# on_truthing_issues_in_supervised_classification__a508ad11.pdf

Journal of Machine Learning Research 25 (2024) 1-91 Submitted 4/19; Revised 12/23; Published 01/24

On Truthing Issues in Supervised Classiﬁcation

Jonathan K. Su su@ll.mit.edu MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02421-6426, USA

Editor: Shivani Agarwal, Francis Bach

Ideal supervised classiﬁcation assumes known correct labels, but various truthing issues can arise in practice: noisy labels; multiple, conﬂicting labels for a sample; missing labels; and diﬀerent labeler combinations for diﬀerent samples. Previous work introduced a noisylabel model, which views the observed noisy labels as random variables conditioned on the unobserved correct labels. It has mainly focused on estimating the conditional distribution of the noisy labels and the class prior, as well as estimating the correct labels or training with noisy labels. In a complementary manner, given the conditional distribution and class prior, we apply estimation theory to classiﬁer testing, training, and comparison of diﬀerent combinations of labelers. First, for binary classiﬁcation, we construct a testing model and derive approximate marginal posteriors for accuracy, precision, recall, probability of false alarm, and F-score, and joint posteriors for ROC and precision-recall analysis. We propose minimum mean-square error (MMSE) testing, which employs empirical Bayes algorithms to estimate the testing-model parameters and then computes optimal point estimates and credible regions for the metrics. We extend the approach to multi-class classiﬁcation to obtain optimal estimates of accuracy and individual confusion-matrix elements. Second, we present a uniﬁed view of training that covers probabilistic (i.e., discriminative or generative) and non-probabilistic models. For the former, we adjust maximum-likelihood or maximum a posteriori training for truthing issues; for the latter, we propose MMSE training, which minimizes the MMSE estimate of the empirical risk. We also describe suboptimal training that is compatible with existing infrastructure. Third, we observe that mutual information lets one express any labeler combination as an equivalent single labeler, implying that multiple mediocre labelers can be as informative as, or more informative than, a single expert labeler. Experiments demonstrate the eﬀectiveness of the methods and conﬁrm the implication. Keywords: supervised classiﬁcation, truth errors, noisy labels, Bayesian estimation, empirical Bayes, mutual information, crowdsourcing

. DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, ﬁndings, conclusions or recommendations expressed in this material are those of the author and do not necessarily reﬂect the views of the Under Secretary of Defense for Research and Engineering. c 2023 Massachusetts Institute of Technology. Subject to FAR52.227-11 Patent Rights - Ownership by the contractor (May 2014) Delivered to the U.S. Government with Unlimited Rights, as deﬁned in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are deﬁned by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as speciﬁcally authorized by the U.S. Government may violate any copyrights that exist in this work.

c 2024 Jonathan K. Su.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/19-301.html.

1. Introduction

Supervised classiﬁcation uses labeled data to train and test a predictive model or classiﬁer, which will be used to predict the labels of unlabeled data. The predictive model is a mapping g : X Y, parameterized by θ, from a feature space X to a set Y = {0, 1, . . . , C 1} of C mutually exclusive and exhaustive classes or labels. The input to the model is a feature vector x X, and the output of the model is the predicted label ˆy = g(x; θ) Y, which may or may not agree with the correct label y Y. A labeled sample (x, y) consists of a feature vector x and its corresponding correct label y, while an unlabeled sample is just a feature vector x. A set of N labeled samples is denoted as {x, y}, where x = (x1, . . . , x N) = (xi)N i=1, y = (yi)N i=1, and for each i, xi X, yi Y, where yi is the correct label associated with xi. The result of applying a learned model to each feature vector in x is the list of corresponding predicted labels ˆy = (ˆyi)N i=1 = (g(xi; θ))N i=1.

Training learns the model from a training set of labeled samples. Testing uses a separate testing set of labeled samples, applies the learned model to its feature vectors, and calculates metrics that quantify the agreement between each predicted label and the corresponding correct label. Our notation does not distinguish between the training and testing sets, since the intended set will be clear from context.

1.1 Truthing Issues

It is usually assumed that the correct labels are known during training and testing; we refer to this situation as the ideal case. However, this critical assumption is often violated in practice (see Everingham et al. (2006, p. 172), Fr enay and Verleysen (2014), or Northcutt et al. (2021b), for example), which can degrade the trained model and produce misleading testing metrics. Deviation from the ideal case can invalidate many hours of hard work and computer processing, as well as make users and clients skeptical of the utility of a classiﬁer or its performance.

Consequently, substantial resources are often devoted to truthing, the process of labeling data as correctly as possible. We use the term labeler to mean an entity that assigns labels to samples; some authors use terms like teacher or annotator. A labeler could be a human, a sensor, a laboratory test, or even another classiﬁer. Despite a labeler s best eﬀorts, errors can and do still occur. Humans make mistakes, become fatigued, and have varying amounts of expertise, attentiveness, and motivation; sensors are subject to noise, occlusion, and other degradations; laboratory test results are not always deﬁnitive; and classiﬁers are rarely perfect predictors. In addition, labeling via crowdsourcing means that more than one labeler could assign a label to the same feature vector xi, and if the labels conﬂict, then at least one of them must be incorrect.

In short, a number of truthing issues can arise: truth errors, multiple labelers who provide conﬂicting labels for the same sample, missing labels, and diﬀerent combinations of labelers for diﬀerent samples. We introduce some notation here and include an example of it in Table 1. We assume there are T labelers, indexed from 1 to T, and we let T = {1, 2, . . . , T}. Let zi,t { } Y, t T , denote the noisy label assigned to xi by the tth labeler, where zi,t = if no label was assigned. We require that at least one labeler assigns a label to each sample; that is, zi,t Y for some t T . Hence, a labeler can only assign one label

On Truthing Issues in Supervised Classification

Labeler Index t 1 2 3 4 5

Sample Feature Correct Noisy Index Vector Label Labels zi i xi yi zi,1 zi,2 zi,3 zi,4 zi,5 1 x1 0 0 1 0 0 2 x2 1 1 1 3 x3 0 0 0 0 0 0 4 x4 1 1 1 0 0 1 5 x5 0 0 0 0 ... ... ... ... N x N 0 0 0 0

Table 1: Example of notation for binary classiﬁcation and ﬁve labelers. This example corresponds to the training example in Section 5.4.

to a sample. Of course, zi,t might be incorrect and diﬀer from yi. Denote the set of noisy labels for xi by zi = (zi,t)t T ; the noisy labels need not agree. Finally, let z = (zi)N i=1. A natural approach, introduced by Dawid and Skene (1979), is to treat the correct labels and noisy labels as random variables (RVs). We adopt this viewpoint and use capitalization (e.g., Y ) for an RV and lowercase (e.g., y) for its non-random counterpart. Hence, Yi is the correct-label RV for the ith sample, so Y = (Yi)N i=1 is the list of correct-label RVs, and y is the list of correct labels themselves. Similarly, the feature-vector RVs are Xi and X = (Xi)N i=1, and their realizations are xi and x. Also, Zi,t, Z, Zi = (Zi,t)t T , and Z = (Zi)N i=1 indicate noisy-label RVs, while zi,t, z, zi, and z indicate their respective realizations. We assume that the RVs associated with diﬀerent samples are independent and identically distributed and that the correct-label RVs have class prior π = π(y) C 1 y=0 , where π(y) = p(y) = Pr(Y = y). We further assume that the noisy-label RVs are conditionally dependent given the correct-label RV, and we denote the noisy-label conditional distribution by p(z|y; ψ), which is parameterized by ψ, and where a semi-colon separates RVs from nonrandom parameters. We refer to p(z|y; ψ)π(y) as a noisy-label model. For the ith sample, the parameters are ψi. We assume that Zi and Zj follow the same conditional distribution p(z|y; ψ) if ψi = ψj = ψ. At this stage, we let the parameters remain completely general. Also, let ψ = (ψi)N i=1 be the list of parameters for all samples. Much of the related work, described in Section 1.2, has focused on constructing a model for p(z|y; ψ), estimating the parameters ψ and class prior π, and estimating the correctlabel RVs Y . This paper and the related work implicitly assume that p(z|y; ψ) = p(z; ψ) so that the noisy-label RVs provide some information about the correct-label RV. If z is non-informative, then none of these methods will be eﬀective. Denote the estimated correct label for the ith sample as ˇyi, and let ˇy = (ˇyi)N i=1. In this paper, we seek answers to the following questions:

1. How can one test a classiﬁer in the presence of truthing issues1?

2. How can one train a classiﬁer in the presence of truthing issues?

3. How can one compare diﬀerent combinations of labelers with diﬀerent abilities?

1.2 Related Work

We organize the related work into noisy-label modeling and learning, training with truthing issues, testing with truthing issues, and comparing combinations of labelers.

1.2.1 Noisy-Label Modeling and Learning

Several authors have proposed noisy-label models p(z|y; ψ)π(y) and developed methods for learning them,2 i.e., estimating ψ and π. Often, they also estimated the correct-label RVs Y . Table 2 presents a summary. All but one set of authors allowed for multiple labelers. The table shows that most of them assumed that the noisy-label RVs are conditionally independent given the correct-label RV. Dawid and Skene (1979) made an initial, inﬂuential foray into this area. They did not address supervised classiﬁcation, but they considered the problem of compiling observations of patients from multiple clinicians who might disagree or make mistakes. In this context, they introduced the conditional-RV formulation that is now commonplace. When the correct labels are available, they showed that maximum-likelihood (ML) estimation of ψ and π is straightforward. With the beneﬁt of hindsight, it is clear that this situation amounts to learning a classiﬁer from an auxiliary data set {z , y } where the noisy labels z serve as the feature vector. For example, one might have a small auxiliary data set containing noisy labels from clinicians and correct labels from a separate, gold-standard laboratory test, and a large amount of data with only noisy labels. More important, when the correct labels are unavailable, Dawid and Skene demonstrated the utility of the expectation-maximization (EM) algorithm of Dempster et al. (1977) for estimating ψ, π, and Y . Many other authors have built upon this work. Donmez et al. (2010) considered classiﬁcation and regression, and they also proposed using the EM algorithm for learning the noisy-label model. They studied conditions under which ML estimates of the noisy-label model parameters are consistent. For classiﬁcation, they showed that consistency holds under certain conditions, such as when the labelers are weak learners or better and the class prior π is not equiprobable. Whitehill et al. (2009) proposed a model for p(z|y; ψ) that considers sample diﬃculty and labeler expertise. A Bayesian extension of the model allowed priors for these parameters. They used both the EM algorithm and a Bayesian version of it to estimate ψ and Y . Welinder and Perona (2010) and Welinder et al. (2010) expanded the model to include labeler bias and to support multi-class labels and continuously-valued annotations.

1. We tackle testing before training because our testing approach applies regardless of how a classiﬁer was trained. Also, if truthing issues are present, then there is little point in training a classiﬁer if one cannot evaluate its performance. 2. One could propose learning a model of the form p(z|x, y)p(x, y) or p(z|y, x)p(y|x) from an auxiliary set {x , y , z }. However, learning it would be harder than learning a predictive model for p(x, y) or p(y|x), so if such a set were available, then one could just learn the predictive model from {x , y }.

On Truthing Issues in Supervised Classification

Reference Description Number of Classes

Labeler Dependence

Dawid and Skene (1979)

Seminal work ML estimation of noisy-label model, if correct labels available EM estimation of noisy-label model and correct labels, if correct labels unavailable

Multiple Independent

Donmez et al. (2010) Also regression EM estimation of noisy-label models Consistency conditions for ML estimates

Multiple Independent

Whitehill et al. (2009) Sample-diﬃculty and labeler-expertise model EM or Bayesian estimation of noisy-label model

Binary Independent

Welinder and Perona (2010); Welinder et al. (2010)

Extension of Whitehill et al. (2009) to include labeler bias, multiple classes, and continuously-valued annotations

Binary or Multiple

Independent

Branson et al. (2017) Extension of Welinder et al. (2010); Welinder and Perona (2010) to part-keypoint and bounding-box annotations Sequential acquisition of noisy labels

Binary Independent

Van Horn et al. (2018) Extension of Branson et al. (2017) to multiple classes and sequentially dependent labelers

Multiple Sequentially dependent Karger et al. (2014) Allocation of labelers to unlabeled samples Iterative message passing

Binary Independent

Zhou et al. (2015) Minimax conditionl entropy principle Multiple Independent Platanios et al. (2016) Hierarchical models Gibbs sampling

Binary Independent or dependent Northcutt et al. (2021a)

Assume previously-trained classiﬁer Estimate joint dist. p(z, y) on new data set Correct or remove mislabeled samples

Multiple Not applicable

Table 2: Related work on noisy-label modeling and learning. References that jointly learn a noisy-label model and train a predictive model appear in Table 3.

Branson et al. (2017) extended the models by Welinder and Perona and by Welinder et al. to include other forms of annotation. They also introduced an algorithm that sequentially acquires more noisy labels until the risk of the estimated correct label falls below a threshold. Van Horn et al. (2018) extended the work by Branson et al. to multi-class classiﬁcation and sequentially dependent labelers. Karger et al. (2014) considered the problem of allocating labelers to unlabeled samples and proposed a message-passing algorithm for estimating the correct labels. Zhou et al. (2015) included sample diﬃculty in their model of p(z|y; ψ) and estimated ψ, π, and Y by optimizing a minimax criterion on the conditional entropy of Z given Y . Platanios et al. (2016) proposed a variety of generative models for p(z|y; ψ), and they applied Gibbs sampling to infer π, ψ, and Y . None of these authors considered training or testing, and they used the known correct labels to compute estimation errors in their experiments. For our purposes, these works oﬀer noisy-label models that could be learned even without correct labels and used with our training and testing methods. The model introduced by Whitehill et al. (2009) is

italicized because, in Section 5.1.2, we present a similar model for p(z|y; ψ) that includes sample diﬃculty and labeler fallibility; however, our purpose is not to estimate ψ but to demonstrate training and testing when ψ is already available.

Northcutt et al. (2021a) took a diﬀerent viewpoint and assumed the availability of an existing classiﬁer, previously trained on an auxiliary data set with enough correctlylabeled samples to overcome the presence of some noisily-labeled ones. Given a feature vector, the existing classiﬁer predicts class probabilities for all classes, unlike a human labeler who provides a noisy label indicating a single class. For a new data set with noisy labels, Northcutt et al. leveraged this property to estimate the joint distribution p(z, y) as well as p(z|y) and π. They used these estimates to identify samples in the new data set that were likely mislabeled and to correct or remove them. A new classiﬁer can then be trained or tested using the corrected labels. This approach oﬀers a diﬀerent way to obtain a noisy-label model, and our methods can complement it by addressing samples whose noisy labels cannot be resolved.

Some other authors perform joint learning of a noisy-label model and training of a predictive model. These references are discussed next.

1.2.2 Training (and Joint Learning) with Truthing Issues

Works on training with truthing issues are listed in Table 3 and reviewed here.

Cid-Sueiro (2012) and Cid-Sueiro et al. (2014) took a Bayesian viewpoint and examined weak losses for training with partial labels, which are modeled slightly diﬀerently than the noisy labels considered here. The authors related the weak loss to an equivalent loss for correct labels and studied theoretical aspects of constructing a weak loss from a given equivalent loss. Their approach does not conform to the uniﬁed view of training that we oﬀer in Section 3.2; it could be interpreted as the reverse of the minimum meansquare error training method proposed in Section 3.4.1. The two approaches are discussed in Section 3.6.1.

Natarajan et al. (2013, 2018) took a classical (i.e., frequentist) view and trained binary classiﬁers on noisy labels from a single labeler by forming a proxy loss function for Z that is an unbiased estimator (in the classical sense) of the ideal loss function for y. Their approach does not ﬁt into our uniﬁed view; further discussion appears in Section 3.6.2. van Rooyen and Williamson (2018) presented an abstract framework for learning with noisy labels, parts of which generalize the work by Cid-Sueiro (2012), Cid-Sueiro et al. (2014), and Natarajan et al. (2013, 2018).

Ratner et al. (2016, 2017) proposed a technique for programmatically generating multiple noisy labels for many unlabeled samples and subsequently training discriminative classiﬁers with the noisy labels. To model p(z|y; ψ), they introduced generative models with independent or dependent labelers, and they estimated ψ while assuming an equiprobable class prior. Training minimized the expected empirical risk with respect to p(z|y; ψ).

The remaining authors considered joint learning and training with noisy labels. Raykar et al. (2010) used the EM algorithm to jointly estimate ψ, π, and θ and train a logistic regression binary classiﬁer. They also presented a Bayesian form of the algorithm with priors on the labelers error probabilities and provided approximate posteriors of the error prob-

On Truthing Issues in Supervised Classification

abilities, and they extended the approach to multi-class classiﬁcation, ordinal regression, and regression.

Khetan et al. (2018) presented a version of the EM algorithm that alternates between training a binary classiﬁer with z and the current estimates of ψ and π and then updating the estimates of ψ and π using the current predictions. They weight the training loss function using the correct-label posterior for each possible correct-label value, eﬀectively marginalizing out the correct-label RV.

To train a convolutional neural network (CNN) on noisy labels from a single labeler and estimate p(z|y), Sukhbaatar et al. (2015) and Jindal et al. (2016) appended an additional layer to the softmax output of a base network. During training, the base network learns to predict the unknown correct labels while the additional layer estimates p(z|y) and predicts the noisy label from the output of the base network. Following training, the additional layer can be excised and the base network used for prediction. This training method lies outside our uniﬁed view of training and is discussed in Section 3.6.3.

Tanno et al. (2019) jointly trained a CNN and estimated the confusion matrices for multiple independent labelers. They used the confusion matrices to convert the CNN outputs into predictions of the noisy labels, which is similar to the work by Sukhbaatar et al. (2015). Notably, Tanno et al. introduced a trace-regularization term, which minimizes the traces of the confusion matrices and, under certain conditions, ensures proper estimation. The authors compared their method to several other methods, including those of Raykar et al. (2010), Khetan et al. (2018), and Sukhbaatar et al. (2015). This approach does not ﬁt our uniﬁed view of training, and further discussion appears in Section 3.6.4

All of these authors used the correct labels during testing or other experimental evaluations; they did not consider testing with truthing issues. As Table 3 shows, many of them do not consider multiple, possibly dependent labelers, which our work allows. Some of them jointly learn a noisy-label model and a predictive model, which we do not consider. A number of these works ﬁt within our uniﬁed view of training (Section 3). The table italicizes parts of Ratner et al. (2016, 2017) and Khetan et al. (2018) that are consistent with our use of minimum mean-square error estimation of the empirical risk (Section 3.4.1). In the work of Raykar et al. (2010), logistic regression is italicized because we also use it as an illustrative example (Section 5.4).

1.2.3 Testing with Truthing Issues

As noted above, all of the related work on training used a reserved set of correct labels for testing. Related work on testing with truthing issues appears in Table 4.

Smyth et al. (1994) considered testing with noisy labels assigned by a scientist or classiﬁer to synthetic aperture imagery of the planet Venus to indicate the presence or absence of types of volcanoes. Clearly, absolute ground truth is unavailable in this case. To test an individual scientist s predicted labels, the authors used the EM algorithm proposed by Dawid and Skene (1979) with the noisy labels from all other scientists to estimate the correct labels, which were then treated as if correct. A classiﬁer was also trained using consensus labels from two scientists (Burl et al., 1994). It was tested by comparing its predictions to the estimated correct labels from EM using all scientists noisy labels.

Reference Description Number of Classes

Labeler Dependence

Joint with noisy-label model learning?

Fits into our uniﬁed view?

Cid-Sueiro (2012); Cid-Sueiro et al. (2014)

Theoretical Bayesian view Weak loss designed for partial labels Equivalent loss for correct labels

Binary or multiple

Not applicable (Partial labels)

Natarajan et al. (2013, 2018)

Classical or frequentist view Proxy loss function that is unbiased in classical sense

Binary Not applicable (Single labeler)

van Rooyen and Williamson (2018)

Abstract framework Generalization of Cid-Sueiro (2012); Cid-Sueiro et al. (2014); Natarajan et al. (2013, 2018)

Binary Not applicable (Single labeler)

Ratner et al. (2016, 2017)

Logistic regression Also long short-term memory network Programmatic noisy labeling Minimization of expected empirical risk

Binary Independent or dependent

Raykar et al. (2010) Binary logistic regression as main example Also regression, ordinal regression EM or Bayesian estimation

Binary or multiple

Independent Yes Yes

Khetan et al. (2018) Empirical risk minimization Marginalization of loss function using correct-label posterior

Binary or multiple

Independent Yes Yes

Sukhbaatar et al. (2015); Jindal et al. (2016)

CNN Base network learns p(y|x) Extra layer learns p(z|y)

Multiple Not applicable (Single labeler)

Tanno et al. (2019) CNN Regularization of trace of labelers confusion matrices

Multiple Independent Yes No

Table 3: Related work on training with truthing issues.

On Truthing Issues in Supervised Classification

Reference Description Number of Classes

Labeler Dependence

Smyth et al. (1994) Comparison of one labeler s noisy labels with estimated correct labels obtained by applying EM to other labelers noisy labels Multi-class labels (volcano type or non-volcano) reduced to binary labels (volcano or non-volcano)

Multiple, reduced to binary

Independent

Lam and Stork (2003) Eﬀect of noisy labels on probability of error Variance of estimated probability of error as a function of labeler error probability, number of samples

Multiple, reduced to binary

Not applicable (Single labeler) Carlotto (2009) Study of the eﬀect of noisy labels on accuracy Relationship between accuracies calculated on correct labels and on noisy labels Rough estimate of ideal accuracy

Multiple Not applicable (Single labeler) Holodnak et al. (2018) Empirical study of methods for estimating accuracy from noisy labels

Multiple Independent or dependent

Table 4: Related work on testing with truthing issues.

For binary classiﬁcation, Lam and Stork (2003) related the ideal probability of error Pr(ˆy = y) = 1 accuracy to a labeler s error probability ε = Pr(z = y) and the classiﬁer s apparent probability of error Pr(ˆy = z). They provided an estimate of Pr(ˆy = y) given an assumed value of ε, and they examined the variance of this estimate as a function of ε and the number of samples N. Carlotto (2009) analyzed how measured accuracy is aﬀected by truth errors for a single labeler. Carlotto obtained an expression that relates ideal accuracy to accuracy calculated against noisy labels and suggested a rough estimate of accuracy when the labeler s error probability is known. Holodnak et al. (2018) conducted an empirical study with real and simulated data to compare a variety of techniques for estimating the accuracy of a classiﬁer from noisy labels. They introduced two models that incorporate dependencies between the labelers or noisy-label RVs, and they demonstrated that estimation techniques that assume conditional independence provide less reliable estimates as labeler dependence increases. In a survey paper on classiﬁcation with label noise (Fr enay and Verleysen, 2014, p. 862), the authors remark, a problem that is seldom mentioned in the literature is that model validation can be diﬃcult in the presence of label noise. Indeed, the number of references on testing is considerably smaller than that on learning or training, and such work has mainly examined accuracy. Nevertheless, the above works reﬂect the main approaches: Calculate estimated correct labels ˇy and use them as the reference; estimate the ideal accuracy in some way; and comparative studies. Our work includes many common testing metrics, including accuracy, precision, recall or probability of detection, and probability of false alarm. We focus on estimating the testing metrics rather than the correct labels, and we develop algorithms for computing Bayesian optimal estimates of scalar and joint metrics. Our experiments use conditionally independent labelers for implementational convenience, but our approaches accommodate noisy-label models with conditionally dependent labelers.

1.2.4 Comparing Combinations of Labelers

Regarding the comparison of diﬀerent combinations of labelers, we have found one rather distantly-related publication. For binary classiﬁcation and a single labeler, Lugosi (1992) viewed the correct-label RV and noisy-label RV as the respective input and output of a communications channel. Lugosi examined purely theoretical aspects of the eﬀects of noisy labels on accuracy if a classiﬁer uses the maximum a posteriori or nearest-neighbor decision rule. In Section 4, we make the channel analogy for one or more labelers, but we proceed in a diﬀerent direction. We suggest that mutual information can be used to compare combinations of labelers, which implies that the information conveyed by multiple mediocre labelers can equal or exceed that provided by a single expert labeler.

1.3 Supervised Classiﬁcation and Estimation Theory

This work applies estimation theory to supervised classiﬁcation with truthing issues, and Figure 1 presents conceptual diagrams for these two ﬁelds. The diagrams look similar but diﬀer in a fundamental way: In supervised classiﬁcation, the actual process is unknown, while in estimation theory, the actual variable is unknown. Supervised classiﬁcation is concerned with ﬁnding a good predictive model that generalizes to future, out-of-sample realizations from an unknown process, given a set of labeled samples from the process and a family of models. In this work, estimation theory is concerned with making a good guess at the current, in-sample value of an unobserved variable, given noisy measurements and a model for the measurement process. These ﬁelds also use similar terms and objectives. Supervised classiﬁcation learns model parameters θ, estimation theory ﬁnds an estimate ˆy or estimator ˆY , and both ﬁelds seek an answer that is best according to some criterion. We present a few examples. In these examples, the criterion is essentially the same; the diﬀerences lie in the components that are assumed to be known, the solution space, and the optimization methods. First, both ﬁelds employ ML estimation: classiﬁcation selects non-random parameters θ to maximize p(y|x; θ) or p(y, x; θ), and estimation chooses ˆy to maximize p Z|Y (z|ˆy). Second, when classiﬁcation treats the model parameters as RVs Θ with prior p(θ), it uses the maximum a posteriori (MAP) criterion and chooses θ to maximize p(θ|y, x). When estimation adopts the MAP criterion, it selects ˆy = arg maxy p(y|z). Third, classiﬁcation may minimize the average square loss N 1 PN i=1(yi g(xi; θ))2, while estimation may minimize the mean-square error (MSE) E[(Y h(Z))2]. We make repeated use of minimum mean-square error (MMSE) estimation, which seeks the estimator h(Z) that minimizes the MSE. A convenient standard result (see Appendix C) is that the MMSE estimator is the conditional mean of Y given Z:

h MMSE(Z) = arg min h E[(Y h(Z))2] = E[Y |Z]. (1)

Finally, classiﬁcation may minimize the average zero-one loss N 1 PN i=1 1(yi = g(xi; θ)), where 1(w) is the indicator function: 1(w) = 0 if w is false, and 1(w) = 1 if w is true. Likewise, estimation may adopt the minimum probability-of-error (MPE) criterion and minimize E[1(Y = h(Z))], the probability of error. Another standard result from estimation

On Truthing Issues in Supervised Classification

(a) Supervised classiﬁcation

(b) Estimation theory

Figure 1: Comparison of supervised classiﬁcation and estimation theory. In supervised classiﬁcation, the goal is to use a set of correctly-labeled samples from an unknown actual process to learn a predictive model that generalizes to future, out-of-sample realizations from the process. In estimation theory, the goal is to estimate the in-sample value of an unobserved variable from noisy measurements produced by a known measurement process. For each problem, the key unknown element appears in red, the key known element in green, and the desired element in blue.

theory (see Appendix D) is that the MPE estimator corresponds to the MAP estimator:

h MPE(Z) = arg min h E[1(Y = h(Z))] = arg max y p(y|Z) = h MAP(Z). (2)

Our application of estimation theory begins with the assumption that a good noisy-label model p(z|y; ψ)π(y) is available. If an auxiliary data set {z , y } is available, then such a model could be obtained via supervised learning. Much of the related work addresses the problem of learning a model when correct labels are unavailable but some knowledge about the labelers error behavior and/or class prior is available. Consequently, this paper is complementary to and broadly compatible with much of the related work.

1.4 Novel Contributions and Organization

We use boldface to highlight speciﬁc novel contributions. We discuss some conceptual contributions and then use the three questions raised in Section 1.1 to cover the organization of the rest of the paper and mention other contributions. For reference, Tables 5, 6, and 7 list abbreviations, symbols, and important distributions used throughout this paper. Supervised classiﬁcation with truthing issues involves three diﬀerent components, which each require a model. The truthing or labeling process requires a noisy-label model, training learns the predictive model, and testing calls for a testing model. Much of the related work has investigated ways to learn a noisy-label model and/or use a noisy-label model to train a predictive model. The greatest conceptual contribution of this paper is its application of Bayesian estimation theory to training and testing. To our knowledge, this paper is the ﬁrst one

to take this viewpoint and pursue its possibilities so extensively. We concentrate on training and testing, and we start with the assumption that a noisy-label model p(z|y, ψ)π(y) is available, so this work complements much of the related work. The Bayesian approach naturally allows for multiple labelers, diﬀerent combinations of labelers for each sample, and missing and/or conﬂicting noisy labels. It also supports noisy-label models with conditionally dependent labelers.3

In this work, we say that an estimator is optimal only if it meets three requirements: it employs an appropriate estimand, it fully exploits all available information, and it minimizes a well-deﬁned penalty criterion (or maximizes a well-deﬁned utility criterion). An estimator that fails any of these requirements is considered suboptimal. Our technical contributions consist of testing and training approaches that are optimal in this strict sense. Our optimal testing approach focuses on estimating the metrics, introduces a testing model that enables thorough exploitation of the available information, and employs the MMSE criterion. In contrast, some suboptimal methods omit estimation theory entirely, others estimate the correct labels instead of the metrics, and still others omit a testing model and fail to take full advantage of the available information. Our optimal training methods select an appropriate likelihood function, posterior, or risk that is faithful to the original objective from ideal training; they use the noisy-label model to exploit the available information completely; and they apply the ML, MAP, or MMSE criterion. Some suboptimal methods ignore estimation theory, while others estimate the correct labels rather than the risk. Our proposed methods never estimate the correct labels because, in our view, they are not the right estimand. This viewpoint marks another novel conceptual contribution: Our conscious choice to refrain from estimating the correct labels. We deliberately avoid making hard decisions about the unobserved correct labels because doing so would produce errors that would propagate into training and testing. In this way, our work diﬀers from existing, suboptimal approaches that estimate the correct labels and then proceed as if the estimated labels were correct.

1. How can one test a classiﬁer in the presence of truthing issues?

This question is deeply investigated in Section 2. For binary classiﬁcation, we propose a testing model for the noisy and predicted labels (Section 2.1, Figure 2). We then derive the estimation-theoretic testing methods as follows:

(a) We express various metrics in terms of two common RVs that are independent and approximately normally distributed (Section 2.2).

(b) We derive approximate marginal posteriors for accuracy, precision, recall or probability of detection, probability of false alarm, and F-score and the approximate joint posteriors for probability of detection and probability of false alarm as well as for precision and recall (Section 2.3).

(c) We propose MMSE testing and develop empirical Bayes algorithms for estimating the testing-model parameters via iterative MMSE estimation

3. For ease of implementation, our simulations and experiments employ conditionally independent labelers, but the derivations and algorithms do not.

On Truthing Issues in Supervised Classification

(Algorithms 1 and 2, Figure 3), and we discuss their relation to the EM algorithm and convergence (Section 2.4),

(d) We explain how to calculate Bayesian optimal estimates (MMSE, MAP, and credible regions) of the metrics from the estimated testing-model parameters and posteriors (Section 2.5).

We also describe some alternative testing approaches (Section 2.6), such as MPE or MAP estimation of the correct labels (Section 2.6.1, Algorithm 3) and fully Bayesian estimation (Section 2.6.2).

For multi-class classiﬁcation (Section 2.7), we extend the testing model (Section 2.7.1), provide another empirical Bayes algorithm for MMSE testing (Algorithm 4, Figure 4), and derive approximate posteriors for accuracy and individual elements of the confusion matrix (Section 2.7.2).

2. How can one train a classiﬁer in the presence of truthing issues?

We consider this question in Section 3. We restate the assumption of independent samples (Section 3.1), and we present a uniﬁed view of training that encompasses and organizes some of the related work (Section 3.2).

For probabilistic (e.g., discriminative or generative) models, we derive the likelihood function or posterior of the predictive model parameters for truthing issues such that the original, ideal training principle (ML or MAP) is preserved (Section 3.3).

For non-probabilistic models (Section 3.4), which are trained by minimizing the empirical risk, we propose MMSE training, which minimizes the MMSE estimate of the empirical risk, and we demonstrate that this approach leads to the same training objective proposed by some related work (Section 3.4.1). We review properties associated with MMSE estimation (Section 3.4.2), mention its convenient form for gradient calculation (Section 3.4.3), and consider some special cases (Section 3.4.4). We provide a basic condition for consistency of the MMSE estimator (Section 3.4.5).

Next, we mention some aspects of MMSE training that make it more appealing than ML and MAP training (Section 3.5). We also discuss some alternative training approaches that do not ﬁt into the uniﬁed view (Section 3.6). Finally, we describe ways to do training with infrastructure that was not designed for truthing issues (Section 3.7).

3. How can one compare diﬀerent combinations of labelers with diﬀerent abilities?

This question is brieﬂy studied in Section 4. We make a simple analogy between the noisy-label model and a communications channel, which suggests mutual information as a basis for comparing combinations of labelers. We focus on the binary symmetric broadcast channel (Section 4.1), and we suggest expressing the mutual information for a set of labelers in terms of that for a single equivalent labeler (Section 4.2). We observe that, in theory, multiple mediocre labelers can be as informative as or more informative than a single expert labeler (Section 4.3).

Experimental results appear in Section 5. We relied on simulation to generate many of the correct, noisy, and predicted labels (Section 5.1). We simulated correct and predicted

Abbreviation Expansion BSBC binary symmetric broadcast channel BSC binary symmetric channel CLT central limit theorem CNN convolutional neural network EM expectation-maximization ERM empirical risk minimization MAC moment-approximation condition MAD maximum absolute diﬀerence MAP maximum a posteriori ML maximum likelihood MMSE minimum mean-square error MPE minimum probability-of-error MSE mean-square error OP operating point P-R precision-recall ROC receiver operating characteristic RV random variable

Table 5: List of abbreviations.

labels in a straightforward manner (Section 5.1.1), and we employed a particular noisy-label model (Section 5.1.2) that is similar to the one by Whitehill et al. (2009). For testing, we review results for several experiments on binary classiﬁcation (Section 5.2), and we report on one experiment on multi-class classiﬁcation (Section 5.3). For training, we provide an example involving binary logistic regression (Section 5.4). For the comparison of labelers, we show experiments that use the proposed training and testing methods to verify the possibility of equivalent mutual information (Section 5.5). Section 6 provides a summary and conclusions (Section 6.1), a workﬂow for supervised classiﬁcation with truthing issues (Section 6.2), and suggested future directions (Section 6.3). Several appendices are also included. Appendix A derives testing metrics in terms of the common RVs from Section 2.2. Appendix B summarizes results on ratios of jointly normal RVs; they are useful for calculating posteriors of scalar metrics. Appendices C and D review MMSE, MPE, and MAP estimation. Appendix E provides derivations for the training approaches in Sections 3.3 and 3.4. Appendix F gives details for the logistic regression training example in Section 5.4.

2. Testing with Truthing Issues: Grading with Dirty Answer Keys

We cover testing before training because the results and methods presented here apply regardless of how a classiﬁer was trained. They can be employed even if training did not consider truthing issues. Additionally, if one needs to train a model with truthing issues, then one very likely needs to test the trained model with truthing issues, too. One might be reluctant to embark on training if one suspects that truthing issues will invalidate the testing metrics. This section provides reassurance that reliable testing is possible. Conventional testing calculates metrics over an ideal testing set {ˆy, y} to measure the (dis)agreement between the predicted labels ˆy and the correct labels y. In machine learning,

On Truthing Issues in Supervised Classification

Symbol Description Symbol Description B Bernoulli distribution g pre-threshold pred. model Beta beta distribution h arbitrary estimator C number of classes h MMSE MMSE estimator Cemp, [C] confusion matrix i sample index Cat categorical distribution j iteration or generic index Femp β , [Fβ] F-score p D, [PD] probabilty of detection I mutual information p FA, [PFA] probability of false alarm Jideal ideal training objective p D, p FA, [ PD, PFA] OP parameters Jpri primary term s pre-threshold pred. stat. Jreg regularization term t labeler index Kemp, [K] cond. conf. matrix x, x, x, [X, X, X] feature vectors K cond. conf. matrix param. y, y, [Y , Y ] correct labels L loss function ˆy, ˆy, [ ˆY , ˆY ] predicted labels ˆLMMSE MMSE est. of loss function ˇy, ˇy estimated correct labels M no. draws in sampling algs. z, z, z, [Z, Z, Z] noisy labels N number of samples α clipping value ˆN1 number of times ˆyi = 1 β F-score parameter Remp, [R] empirical risk δ, δ sample diﬃculty ˆRMMSE MMSE est. of emp.-risk RV ε error prob. (57) or (60) T number of labelers η prob. provide noisy label [U, V ] common RVs for metrics θ, θ, θ , [Θ] pred.-model parameters N normal distribution λ regularization weight T set of labelers π, π class prior U uniform distribution τ decision threshold X feature-vector space φ, φ labeler fallibility Y set of classes ψ, ψ, ψ noisy-label params. f arbitrary function 0 zero vector g predictive model 1( ) indicator function

Table 6: List of main symbols. Brackets indicate RVs.

Distribution Description

p(z|y; ψ) noisy-label conditional distribution

p(z|y; ψ)π(y) noisy-label model

p(ˆyi|yi; p D, p FA) predicted-label conditional distribution (3), (4), (5), (6)

p(ˆyi|yi; p D, p FA)p(zi|yi; ψi) testing model (7)

p(yi|ˆyi, zi; ψi, p D, p FA) testing class posterior (10)

p(yi|zi; ψi) training class posterior (41)

Table 7: List of key distributions.

ˆy is obtained by applying the learned model g to each feature vector xi x. However, testing does not actually involve the feature vectors x, so the results of this section apply to problems outside of machine learning, such as reconciling observations by multiple clinicians (for examples, see Dawid and Skene, 1979; Raykar et al., 2010). With truthing issues, testing must work with the testing set {ˆy, z, ψ, π} instead of {ˆy, y}. The metrics are functions of the correct-label RVs, which are unobserved, so the metrics are RVs whose values remain uncertain. We therefore treat testing as an estimation problem: we want to know the in-sample values of the metrics on the testing set. We seek the posteriors of the metric RVs, conditioned on ˆy and z. Optimal estimates or credible regions for the metric RVs can then be calculated from the posteriors. Our approach exploits all available information, including the predicted labels ˆy. The problem is analogous to grading a multiple-choice quiz using answer keys from one or more teaching assistants who have poor handwriting. The answer keys provide information about the correct answers, but the student s answers do, too. We can obtain the best estimate of the grade by consulting the student s answers along with the answer keys, rather than relying on the answer keys alone. Most of this section covers binary classiﬁcation; Section 2.7 discusses extensions to multiclass classiﬁcation. For binary classiﬁcation, we consider several common scalar metrics: accuracy, precision, recall or probability of detection,4 probability of false alarm,5 and Fscore. Each of these metrics takes on values in [0, 1]. For probability of false alarm, smaller values indicate better performance; for the other metrics, larger values correspond to better performance. Table 8 gives the empirical forms for these metrics, denoted as acc, prec, p D or rec, p FA, and Femp β , respectively. We also consider two common joint metrics: the receiver operating characteristic (ROC) and precision-recall (P-R) operating points, namely (p D, p FA) and (prec, rec). The empirical metrics are used in the ideal case when {ˆy, y} is available. The corresponding RV forms are ACC, PREC, PD or REC, PFA, Fβ, as well as (PD, PFA) and (PREC, REC). We overload the lowercase symbols to mean either an empirical metric or a realization of a metric RV.

2.1 Testing Assumptions

In ideal testing, the metrics do not involve the feature vectors, so we eliminate X and x from consideration. If we had a good testing model for p(ˆy, z|y), then we could use it to estimate the metric RVs. One might propose learning it from an auxiliary set {ˆy , y , z }, but if such a set were available, then one could just do ideal testing with {ˆy , y }.6

Instead, we must devise a model for p(ˆy, z|y) and estimate its parameters from {ˆy, z, ψ, π}, with no prospect of learning them from auxiliary data. We tackle this problem by applying techniques from estimation theory rather than machine learning. We state our assumptions here, and Figure 2 shows the graphical model depicting them. We assume that ˆy, z, ψ, and π are available. We let ˆN1 denote the number of times that ˆyi = 1; it is immediately available from ˆy. We further assume that the noisy-label RVs Zi

4. Recall, probability of detection, sensitivity, and true positive rate are equivalent terms. 5. Probability of false alarm is equivalent to (1 speciﬁcity) and false positive rate. 6. We have chosen to omit any dependence on the feature vectors. If one were to propose learning a testing model such as p(ˆy, z|x, y)p(x, y) from {x , ˆy , y , z }, then a similar contradiction would arise.

On Truthing Issues in Supervised Classification

Metric Empirical Form Random-Variable Form Symbol Expression Symbol Expression

Accuracy acc no. of times ˆyi = yi

N ACC U V ˆN1 N + 1

Precision prec no. of times ˆyi =1 and yi =1

no. of times ˆyi =1 PREC N

Prob. of Detection, Recall

p D, rec no. of times ˆyi = 1 and yi = 1

no. of times yi = 1 PD, REC U U + V

Prob. of False Alarm p FA

no. of times ˆyi = 1 and yi = 0

no. of times yi = 0 PFA

ˆN1/N U 1 (U + V )

F-Score (β > 0) Femp β (1 + β2) prec rec β2prec + rec Fβ

β2(U + V ) + ˆN1/N

Table 8: Binary-classiﬁcation metrics: Empirical forms and RV forms in terms of the common RVs U and V from (8) and (9).

Figure 2: Graphical model for testing approach. Small rectangles indicate non-random variables; circles indicate RVs. Shading indicates a variable that is fully observed. Large rectangles indicate N independent instances of the enclosed variables indexed by i.

have conditional distribution p(zi|yi; ψi) and do not depend on the predicted label. As is common practice, we also assume independent samples, so p(z|y; ψ) = QN i=1 p(zi|yi; ψi). Next, we observe that accuracy, precision, and F-score can each be written solely in terms of the class prior, probability of detection, and probability of false alarm. For example, acc = π(0)(1 p FA) + π(1)p D, and prec = π(1)p D/(π(0)p FA + π(1)p D). Consequently, we introduce the ROC operating-point (OP) parameters ( p D, p FA), which suﬃce to determine the other metrics because the class prior is assumed known. The OP parameters represent the anticipated performance on the testing set before ˆY and Z are observed. We then treat each predicted label ˆyi as a realization of a predicted-label RV ˆYi conditioned on the correct-label RV Yi and ( p D, p FA). The predicted-label conditional distribution

p ˆYi|Yi(0|0; p D, p FA) = 1 p FA, (3)

p ˆYi|Yi(1|0; p D, p FA) = p FA, (4)

p ˆYi|Yi(0|1; p D, p FA) = 1 p D, (5)

p ˆYi|Yi(1|1; p D, p FA) = p D. (6)

The assumption of independent samples means p(ˆy|y; p D, p FA) = QN i=1 p(ˆyi|yi; p D, p FA). Our ﬁnal assumption is that, given Yi, the RVs ˆYi and Zi are conditionally independent, where ( p D, p FA) and ψi only inﬂuence ˆYi and Zi, respectively. This assumption may be a strong one, but it reﬂects the typical case in which ˆYi is generated without access to Zi. For example, ˆYi could be the output of a predictive model whose only input is Xi, or ˆYi could be a diagnosis made by a clinician who has not seen the diagnoses from other clinicians. Therefore, our testing model is

p(ˆyi, zi|yi; ψi, p D, p FA) = p(ˆyi|yi; p D, p FA)p(zi|yi; ψi). (7)

We excluded Xi, so Yi is the only RV that can link Zi and ˆYi, and the model includes such a connection. The model is parsimonious, having just two parameters. As explained at the top of this section, they will not be determined via supervised learning. Section 2.4 presents iterative methods for estimating them from {ˆy, z, ψ, π}.

2.2 Metric RVs in Terms of Common RVs

Here we demonstrate that each metric RV can be written in terms of two RVs that are independent and approximately normal.

2.2.1 Metric RVs in Terms of Common RVs

We begin by considering a single metric, namely probability of detection. Given truthing issues, it becomes an RV conditioned on ˆy, z, ψ, and ( p D, p FA), which we write as

1 N PN i=1 1(ˆyi = 1 and Yi = 1)

1 N PN i=1 1(Yi = 1) .

In this expression, the predicted label is shown in lowercase because it is observed, and the correct label is capitalized because it is an unobserved RV. We deﬁne the RVs U and V , conditioned on {ˆy, z, ψ, p D, p FA}, as follows:

i=1 1(ˆyi = 1 and Yi = 1)

i:ˆyi=1 1(Yi = 1), (8)

On Truthing Issues in Supervised Classification

i=1 1(ˆyi = 0 and Yi = 1)

i:ˆyi=0 1(Yi = 1). (9)

The numerator of PD is exactly U. The denominator is

i=1 1(Yi = 1) = 1

i:ˆyi=1 1(Yi = 1) + 1

i:ˆyi=0 1(Yi = 1)

PD = U U + V .

By similar manipulations (see Appendix A), we can write each of the other metric RVs in terms of the common RVs U and V . Table 8 summarizes the relationships.

2.2.2 Independence and Approximate Normality

We immediately conclude that U and V are independent since they involve summations over disjoint subsets of ˆy. We now consider their distributions.

In (8) and (9), each 1(Yi = 1) is a Bernoulli RV with success probability equal to p Yi| ˆYi,Zi(1|ˆyi, zi; ψi, p D, p FA). Using Bayes rule, we obtain this probability from the testing class posterior:

p(yi|ˆyi, zi; ψi, p D, p FA) = π(yi)p(ˆyi, zi|yi; ψi, p D, p FA) P y i Y π(y i)p(ˆyi, zi|y i; ψi, p D, p FA)

= π(yi)p(ˆyi|yi; p D, p FA)p(zi|yi; ψi) P y i Y π(y i)p(ˆyi|y i; p D, p FA)p(zi|y i; ψi), (10)

where the appropriate value for p(ˆyi|yi; p D, p FA) can be determined from (3) (6).

We denote a Bernoulli RV with success probability p [0, 1] by B(p), so each 1(Yi = 1) in (8) is distributed B(pi) with

pi = p Yi| ˆYi,Zi(1|1, zi; ψi, p D, p FA), i = 1, . . . , N. (11)

Let N(µ, σ2) denote a normal RV with mean µ and variance σ2. By the central limit theorem (CLT), U is approximately distributed N(µU, σ2 U|ˆy, z; ψ, p D, p FA) with

µU(ˆy, z; ψ, p D, p FA) = 1

i:ˆyi=1 p Yi| ˆYi,Zi(1|1, zi; ψi, p D, p FA), (12)

σ2 U(ˆy, z; ψ, p D, p FA) = 1 N2 X

i:ˆyi=1 pi(1 pi)

i:ˆyi=1 p Yi| ˆYi,Zi(1|1, zi; ψi, p D, p FA)

1 p Yi| ˆYi,Zi(1|1, zi; ψi, p D, p FA) , (13)

where we have indicated the dependence on {ˆy, z, ψ, p D, p FA}. Likewise, each 1(Yi = 1) in (9) is distributed B(qi) with

qi = p Yi| ˆYi,Zi(1|0, zi; ψi, p D, p FA), i = 1, . . . , N, (14)

so V is approximately distributed N(µV , σ2 V |ˆy, z; ψ, p D, p FA) with

µV (ˆy, z; ψ, p D, p FA) = 1

i:ˆyi=0 p Yi| ˆYi,Zi(1|0, zi; ψi, p D, p FA), (15)

σ2 V (ˆy, z; ψ, p D, p FA) = 1 N2 X

i:ˆyi=0 qi(1 qi)

i:ˆyi=0 p Yi| ˆYi,Zi(1|0, zi; ψi, p D, p FA)

1 p Yi| ˆYi,Zi(1|0, zi; ψi, p D, p FA) . (16)

These expressions readily accommodate some correctly-labeled samples. For the ith

sample, if yi is known or can be exactly recovered from zi, then pi = 1(yi = 1) if ˆyi = 1, and qi = 1(yi = 1) if ˆyi = 0. Correctly-labeled samples add a constant to the summations for µU and µV and contribute zero to the summations for σ2 U and σ2 V .

2.3 Posteriors of Metric RVs

We can now obtain the posteriors of the metric RVs.

2.3.1 Scalar Metric RVs

We immediately ﬁnd that ACC is approximately normal with mean µU µV ˆN1/N + 1 and variance σ2 U + σ2 V . Similarly, PREC is approximately normal with mean NµU/ ˆN1 and variance N2σ2 U/ ˆN2 1 .

On Truthing Issues in Supervised Classification

From the Random-Variable Form section of Table 8, we observe that each remaining scalar metric RV is a ratio of jointly, approximately normal RVs. Closed-form expressions for these posteriors are unavailable, but the work of Marsaglia (1965, 2006) explains how the posteriors can be computed. Appendix B reviews the procedure, and Table 16 in the appendix covers the necessary parameters for each of these metrics.

Moreover, Marsaglia (2006, 4) and Appendix B provide closed-form approximations for the mean and variance of a ratio of jointly normal RVs. The approximations are valid if a simple moment-approximation condition (MAC) is satisﬁed.7

In summary, for any scalar metric RV Q {ACC, PREC, PD, PFA, Fβ}, we can compute the moments µU, σ2 U, µV , σ2 V from {ˆy, z, ψ, p D, p FA} and use them to get approximations for p(q|ˆy, z; ψ, p D, p FA), E[Q|ˆy, z; ψ, p D, p FA], and var(Q|ˆy, z; ψ, p D, p FA).

2.3.2 Joint Metric RVs

For ROC analysis, we observe that PD and PFA are functions of two independent, approximately normal RVs U and V . We apply the standard transformation for two functions of two RVs (see Papoulis, 1991, 6-3) to obtain the joint posterior of (PD, PFA):

p(p D, p FA|ˆy, z; ψ, p D, p FA)

( ˆN1/N p FA)( ˆN1/N p D)

(p D p FA)3

ˆ N1/N p FA

p D p FA p D µU 2

ˆ N1/N p FA

p D p FA (1 p D) µV 2

0 p D 1, 0 p FA 1, p D = p FA, (17)

where the moments of U and V are computed from (11) (16).

Clearly, the posterior is non-Gaussian. This expression is well-deﬁned except when p D = p FA, which is the chance line in ROC space (see Saito and Rehmsmeier, 2015). Although we have not succeeded in ﬁnding a closed-form expression for limp D p FA p(p D, p FA|ˆy, z; ψ, p D, p FA), we do not consider the lack of this limit to be problematic. First, Section 2.3.3 explains how we can use sampling to approximate the posterior of (PD, PFA) without using (17). Second, a binary classiﬁer is only useful if it operates far from the chance line, so one will likely only be interested in classiﬁers with negligible probability density near the chance line. Finally, in Section 2.4.1, we present an estimation algorithm that only uses the marginal posteriors of PD and PFA and does not require (17).

7. In general, the distribution of a ratio of jointly normal RVs cannot be approximated well by a normal distribution. However, Marsaglia (2006) and the appendix also provide a test for when a normal approximation is reasonable.

For P-R analysis, PREC and REC are also functions of U and V . We apply the transformation of functions of RVs to obtain the joint posterior of (PREC, REC):

p(prec, rec|ˆy, z; ψ, p D, p FA)

rec2 1 2πσUσV

N prec µU 2

N prec 1 rec

0 < rec 1, 0 prec 1, (18)

with the moments of U and V computed from (11) (16). This distribution is also non Gaussian. The chance line in P-R space is prec = π(1) (see Saito and Rehmsmeier, 2015), and it poses no diﬃculties. L Hˆopital s rule can be applied to show that limrec 0 p(prec, rec| ˆy, z; ψ, p D, p FA) = 0, so this expression is well-deﬁned over the entire unit square.

2.3.3 Posteriors via Sampling

We can also use sampling to approximate the posteriors. We generate M length-N realizations {y(m)}M m=1 of the correct-label RVs Y . For each realization y(m), each y(m) i is drawn B p Y | ˆYi,Zi(1|ˆyi, zi; ψi, p D, p FA) using (10). For each y(m), we compute the desired empirical metric (scalar or joint), which yields M realizations of the metric RV. The approximate posterior can then be computed as a one-dimensional or two-dimensional histogram.

2.4 Empirical Bayes Estimation of Operating-Point Parameters (MMSE Testing)

The validity of the posteriors depends on how well we can estimate the moments of U and V , which in turn depend upon the OP parameters ( p D, p FA). We present two iterative algorithms for ﬁnding the MMSE estimate of the OP parameters, so we refer to this approach as MMSE testing. The algorithms belong to the class of empirical Bayes estimators. Standard Bayesian estimation handles unknown nuisance variables by specifying a prior for them and marginalizing them out to obtain the posterior estimate of the estimand. In contrast, empirical Bayes estimation does not assume a prior;8 it instead consults the available data to estimate nuisance variables (Casella, 1992), often using MAP or ML estimation. It alternates between estimating the nuisance variables and the estimand. This iterative process successively improves each estimate and is similar in spirit to the EM algorithm of Dempster et al. (1977). In the proposed algorithms, the OP parameters are the nuisance variables, the estimand consists of the probability-of-detection and probability-of-false alarm RVs, and MMSE estimation is employed. Section 2.4.4 explores the relationship between these algorithms and the EM algorithm.

2.4.1 Motivation

Suppose that, on the jth iteration, we have the previous OP parameters p(j 1) D , p(j 1) FA , along with {ˆy, z, ψ, π}. The preceding results enable us to get the moments of U and

8. As a result, empirical Bayes methods are sometimes said not to be fully Bayesian.

On Truthing Issues in Supervised Classification

V conditioned on {ˆy, z, ψ, p(j 1) D , p(j 1) FA } and then to get the approximate posterior of

(PD, PFA). We can improve on the OP parameters by updating p(j) D , p(j) FA with an optimal

estimate of (PD, PFA) given p(j 1) D , p(j 1) FA and {ˆy, z, ψ, π}. We can repeat this procedure until the OP parameters converge. We adopt the MMSE criterion from Bayesian estimation theory. Let h( ˆY , Z) = (h D( ˆY , Z), h FA( ˆY , Z)) be an estimator of (PD, PFA) given ˆY , Z. The MSE of such an estimator is E h D( ˆY , Z) PD 2 + E h FA( ˆY , Z) PFA 2 , and the MMSE estimator is deﬁned as h MMSE = arg minh E h D( ˆY , Z) PD 2 + E h FA( ˆY , Z) PFA 2 . The standard result is that the MMSE estimator is the conditional mean (see (1) or Appendix C), so

h MMSE( ˆY , Z) = Ep(p D,p FA|ˆy,z;ψ, p(j 1) D , p(j 1) FA ) PD, PFA ˆY , Z; ψ, p(j 1) D , p(j 1) FA ,

where the relevant distribution is shown as a subscript (cf. Cover and Thomas (1991, Equation 2.2)). Therefore, given ˆY = ˆy and Z = z, we set

p(j) D = E PD ˆy, z; ψ, p(j 1) D , p(j 1) FA ,

p(j) FA = E PFA ˆy, z; ψ, p(j 1) D , p(j 1) FA .

Each conditional mean involves a marginal posterior. For example,

E PD ˆy, z; ψ, p(j 1) D , p(j 1) FA = Z 1

0 p D p p D ˆy, z; ψ, p(j 1) D , p(j 1) FA dp D. (19)

Thus, as mentioned previously, the lack of a limit along the chance line in (17) does not prevent us from calculating the conditional mean of (PD, PFA). We can compute the conditional mean of PD or PFA in three ways. First, if the MAC is satisﬁed, we can use the closed-form expression in Marsaglia (2006) or Appendix B. Second, we can perform one-dimensional numerical integration. Third, we can use sampling, as described in Section 2.3.3.

2.4.2 Choice of Initial Operating-Point Parameters

This section discusses the initial OP parameters p(0) D , p(0) FA . We take a Bayesian viewpoint

with initial OP RVs P(0) D , P(0) FA and a non-informative prior namely, the uniform distri-

bution over the unit square. Then the initial OP parameters p(0) D , p(0) FA are given by the conditional mean, which is just (1/2, 1/2). Consequently, p(ˆyi|yi; p D = 1/2, p FA = 1/2) equals 1/2 for any combination of ˆyi and yi, so the testing class posterior (10) reduces to

p(yi|ˆyi, zi; ψi, p D = 1/2, p FA = 1/2) = π(yi)p(zi|yi; ψi) P y i Y π(y i)p(zi|y i; ψi). (20)

The right-hand side matches the ordinary class posterior, which does not condition on ˆyi, p D, and p FA:

p(yi|zi; ψi) = π(yi)p(zi|yi; ψi) P y i Y π(y i)p(zi|y i; ψi). (21)

Thus, choosing p(0) D , p(0) FA = (1/2, 1/2) has the same eﬀect as if we had not included the predicted label at all.

Figure 3: Graphical model of iterative estimation for testing. The common RVs U and V depend on separate partitions of Y and ˆy.

2.4.3 Estimation Algorithms

The recursive relationship between successive OP parameters leads to two iterative empirical Bayes estimators for MMSE testing, namely Algorithms 1 and 2. Figure 3 depicts the graphical model and the procedure. On each iteration, each algorithm updates its OP parameters by computing the MMSE estimate the conditional mean of (PD, PFA) given ˆy, z, ψ, and the previous OP parameters. On the next iteration, the MMSE estimate provides the new OP parameters. Algorithm 1 takes advantage of the fact that PD and PFA are both ratios of jointly approximately normal RVs, so their conditional means can be computed without sampling. It begins the jth iteration by using ˆyi, zi, ψi, p(j 1) D , and p(j 1) FA to compute pi and qi for i = 1, . . . , N, and then it computes the moments of U and V . To get the improved OP parameters p(j) D , p(j) FA , it uses the closed-form approximation for the conditional mean of PD or PFA if the MAC holds, and it falls back to numerical integration if not. To prevent numerical degeneracy, the new OP parameters are clipped to [α, 1 α]. The procedure repeats until the maximum absolute diﬀerence between successive estimates falls below some tolerance or a maximum number of iterations jmax is reached. It then returns the ﬁnal OP parameters ( p D, p FA) = p(j) D , p(j) FA . Algorithm 2 uses sample realizations of the correct-label RVs Y to approximate the conditional means of PD and PFA. On the jth iteration, it generates M length-N realizations y(m), m = 1, . . . , M, by drawing each y(m) i B p Y | ˆYi,Zi(1|ˆyi, zi; ψi, p(j 1) D , p(j 1) FA )

using (10). Then it computes the empirical ROC operating point p(j,m) D , p(j,m) FA for each realization. The conditional mean of (PD, PFA) is obtained by averaging the M operating points, which yields the new OP parameters p(j) D , p(j) FA .

On Truthing Issues in Supervised Classification

Algorithm 1 MMSE testing with empirical Bayes estimation of ( p D, p FA) via ratios of jointly normal RVs.

1: function Empirical Bayes Via Ratios(ˆy, z, ψ, π)

2: Initialize p(0) D 0.5, p(0) FA 0.5, j 0, MAD

3: while j < jmax and MAD tol do

5: for i 1 : N do

6: Compute pi and qi from ˆyi, zi, ψi, p(j 1) D , and p(j 1) FA Eqs. (10), (11), (14)

7: Use {pi}N i=1 and ˆy, z, ψ, p(j 1) D , p(j 1) FA to compute µU, σ2 U Eqs. (12), (13)

8: Use {qi}N i=1 and ˆy, z, ψ, p(j 1) D , p(j 1) FA to compute µV , σ2 V Eqs. (15), (16)

9: if mean approximation for PD from µU, σ2 U, µV , σ2 V is valid then

Marsaglia (2006), App. B

10: p(j) D E PD ˆy, z; ψ, p(j 1) D , p(j 1) FA from mean approximation

12: Get p p D ˆy, z; ψ, p(j 1) D , p(j 1) FA from µU, σ2 U, µV , σ2 V Marsaglia (1965, 2006), App. B

13: p(j) D E PD ˆy, z; ψ, p(j 1) D , p(j 1) FA by 1-D numerical integration

14: if mean approximation for PFA from µU, σ2 U, µV , σ2 V is valid then

15: p(j) FA E PFA ˆy, z; ψ, p(j 1) D , p(j 1) FA from mean approximation

17: Get p p FA ˆy, z; ψ, p(j 1) D , p(j 1) FA from µU, σ2 U, µV , σ2 V

18: p(j) FA E PFA ˆy, z; ψ, p(j 1) D , p(j 1) FA by 1-D numerical integration

19: Clip p(j) D and p(j) FA to [α, 1 α]

20: MAD max p(j) D p(j 1) D , p(j) FA p(j 1) FA

21: ( p D, p FA) p(j) D , p(j) FA Final OP parameters

22: return ( p D, p FA)

Both algorithms used tol = 10 3, jmax = 30, α = 10 3, and Algorithm 2 used M = 5000. They also allow for some correctly-labeled samples. If yi is known or can be recovered exactly from zi, then pi, qi, or p Y | ˆYi,Zi(1|ˆyi, zi; ψi, p(j 1) D , p(j 1) FA ) equals 1(yi = 1). Algorithm 1 will add the proper constants to the moments of U and V , and Algorithm 2 will always draw the proper realization of the correct-label RV.

2.4.4 Relation to EM Algorithm and Remarks on Convergence

Like many empirical Bayes methods, our iterative algorithms are similar to the EM algorithm but diﬀer in that the hidden or latent variables namely, the correct labels are RVs instead of non-random quantities. The above relationships let us obtain the approximate conditional means of PD and PFA directly, so our E-step does not employ an auxiliary function like the EM algorithm does. Also, the other latent variables namely the OP parameters ( p D, p FA) are determined using MMSE rather than ML estimation, so our Mstep performs minimization of the MSE rather than maximization of an auxiliary function.

Algorithm 2 MMSE testing with empirical Bayes estimation of ( p D, p FA) via sampling.

1: function Empirical Bayes Via Sampling(ˆy, z, ψ, π, M)

2: Initialize p(0) D 0.5, p(0) FA 0.5, j 0, MAD

3: while j < jmax and MAD tol do

5: for m 1 : M do Generate M length-N realizations of Y

6: for i 1 : N do Generate mth realization y(m) of Y

7: Draw y(m) i B p Y | ˆYi,Zi(1|ˆyi, zi; ψi, p(j 1) D , p(j 1) FA ) Eq. (10)

p(j,m) D , p(j,m) FA empirical metrics for {ˆy, y(m)}

p(j) D , p(j) FA mean of p(j,m) D , p(j,m) FA M m=1 Empirical conditional mean

10: Clip p(j) D and p(j) FA to [α, 1 α]

11: MAD max p(j) D p(j 1) D , p(j) FA p(j 1) FA

12: ( p D, p FA) p(j) D , p(j) FA Final OP parameters

13: return ( p D, p FA)

We speculate that, regardless of the initial OP parameters, the algorithms should converge to the global optimum ( p D, p FA), the minimum of the MSE E PD p D 2 + PFA p FA 2 ˆy, z; ψ, p D, p FA . The argument is based on the following reasoning: The estimand (PD, PFA) is bounded; generally, the conditional mean of (PD, PFA) is unique for any OP parameters p(j 1) D , p(j 1) FA ; and the MSE decreases on each iteration, unless p(j 1) D , p(j 1) FA is already at the global optimum. Therefore, the algorithms should con-

verge to the global optimum regardless of p(0) D , p(0) FA . This property contrasts with the EM algorithm, which can only be said to converge to a local optimum, and this local optimum may vary signiﬁcantly depending on initialization (Koller and Friedman, 2009, 19.2). We can also view the algorithms as seeking the unique ﬁxed point p D, p FA such that E[PD, PFA|ˆy, z; ψ, p D, p FA] = p D, p FA . A technicality prevents us from making a stronger statement regarding convergence. Both PD and PFA are approximated as ratios of jointly normal RVs, for which the mean and variance do not exist (see Marsaglia (1965, 2006) and Appendix B). As a result, uniqueness of the MMSE estimator of (PD, PFA) cannot be guaranteed, so a strict proof of convergence or existence of a unique ﬁxed point may not be possible. This obstacle might mainly be a theoretical concern, and, except for a few pathological situations, the algorithms might always converge in practice. Section 5.2.3 presents some empirical evidence of global convergence.

2.5 Optimal Estimation of Metric RVs

Algorithms 1 and 2 each produce ﬁnal OP parameters ( p D, p FA). It still remains to calculate optimal estimates of the metric RVs given {ˆy, z, ψ, π, p D, p FA}. We do so by computing the moments of the common RVs U and V (Section 2.2.2), and then by using the estimated posteriors of the metric RVs (Section 2.3). Given the approximate posteriors, optimal estimation is a simple matter; we brieﬂy discuss it for completeness.

On Truthing Issues in Supervised Classification

We consider two point estimates and one range estimate. The ﬁrst point estimate is the conditional mean, which is optimal in the MMSE sense. The accuracy and precision RVs are approximately normal, so it is just the mean. For REC or PD, PFA, or Fβ, we can use the mean approximation if the MAC holds or resort to numerical integration or sampling. For P-R or ROC analysis, we can just use the conditional means of the individual elements. The second point estimate is the MAP estimate, the most-probable value of the metric RV. It can be obtained by computing the posterior at a ﬁne resolution and ﬁnding the peak. The optimal range estimate is the p%-credible region, which speciﬁes a region such that, with probability p/100, the metric RV lies inside the region. The credible region is not necessarily unique, but one way to obtain a reasonable credible region is to apply binary search to ﬁnd a threshold c such that the numerical integral of the posterior over the points where the posterior exceeds c is equal to p/100 within some tolerance. Finally, we reiterate that Algorithms 1 and 2 and the subsequent optimal-estimation calculations never attempt to estimate the correct-label RVs Y . Making a hard decision about Y at any point could introduce errors into downstream processing, as remarked in Section 1.4. Our approach avoids doing so while using all available information to produce its estimates.9

2.6 Alternative Testing Approaches

For comparison purposes, we mention some alternative approaches to testing. We ﬁrst present four suboptimal methods, and then we describe a fully Bayesian approach, which is optimal but handles the OP parameters diﬀerently than the empirical Bayes approach.

2.6.1 Suboptimal Testing Approaches

The ﬁrst suboptimal approach is to estimate the correct labels and then treat the estimates as if they were correct to improve on the OP parameters ( p D, p FA). We present an iterative method in Algorithm 3 and denote its estimate of the correct-label RVs Y as ˇy. On the jth iteration, the algorithm uses the previous OP parameters ( p D, p FA) = p(j 1) D , p(j 1) FA and estimates Y according to the MPE criterion. Let hi( ˆYi, Zi) be an esti-

mator of Yi given ˆYi, Zi, and parameters ψi, p(j 1) D , p(j 1) FA , which are suppressed because they are not RVs. The probability of error of hi is perror(hi( ˆYi, Zi), Yi) = E[1(hi( ˆYi, Zi) = Yi)] = P ˆyi,zi P y 1(hi(ˆyi, zi) = yi)p(ˆyi, zi, yi). This approach seeks (h MPE 1 , . . . , h MPE N ) = arg min(h1,...,h N) N 1 PN i=1 perror(hi( ˆYi, Zi), Yi). The solution consists of ﬁnding the MPE estimator for each individual Yi, and the standard result is that the MAP estimator minimizes the probability of error (see (2) or Appendix D). Thus, the algorithm computes ˇy(j) using:

ˇy(j) i = arg max y Y p y|ˆyi, zi; ψi, p(j 1) D , p(j 1) FA , i = 1, . . . , N. (22)

The algorithm then uses these estimates to compute the empirical probabilities of detection and false alarm p(j) D and p(j) FA. This approach is similar to Algorithms 1 and 2 except

9. If one wants to estimate Y , then one can do so using (10) after ( p D, p FA) has been determined. However, subsequently using this estimate to compute testing metrics runs counter to our approach.

Algorithm 3 Suboptimal estimation of ( p D, p FA) by estimating the correct-label RVs Y .

1: function Estimate OPParameters Via Estimated Correct Labels(ˆy, z, ψ, π)

2: Initialize p(0) D 0.5, p(0) FA 0.5, j 0, MAD

3: while j < jmax and MAD tol do

5: for i 1 : N do

6: ˇy(j) i arg maxy Y p Yi| ˆYi,Zi y|ˆyi, zi; ψi, p D = p(j 1) D , p FA = p(j 1) FA Eq. (22)

7: Compute empirical p(j) D and p(j) FA from ˆy and ˇy(j) = ˇy(j) i N i=1 8: Clip p(j) D and p(j) FA to [α, 1 α]

9: MAD max p(j) D p(j 1) D , p(j) FA p(j 1) FA

10: ( p D, p FA) p(j) D , p(j) FA Final OP parameters

11: return ( p D, p FA)

that it makes a hard decision about the correct labels on every iteration. This approach leverages the testing model and minimizes a well-deﬁned penalty criterion, but it is suboptimal according to Section 1.4 because it estimates Y when it should estimate (PD, PFA). We make no claims about its convergence properties. We again used tol = 10 3, jmax = 30, α = 10 3. Like Algorithms 1 and 2, Algorithm 3 only estimates the ﬁnal OP parameters ( p D, p FA). Following this step, this approach applies (22) to get the ﬁnal MPE or MAP estimate ˇy of Y given {ˆy, z, ψ, p D, p FA}. Final estimates of the metrics are obtained by treating ˇy as if it were correct and computing empirical metrics for {ˆy, ˇy}. The second suboptimal approach neglects estimation theory completely; it just merges the metrics calculated for each individual labeler s labels. For each t T , it treats the labels from the tth labeler as if they were correct, and it computes the empirical metric for these samples only. Doing so produces T instances of a metric. The ﬁnal estimate is then obtained using a centrality statistic, such as the mean or median, of the T instances. The third and fourth suboptimal approaches use the noisy-label conditional distribution p(zi|yi; ψi) instead of the testing model (7), so they omit p(ˆyi|yi; p D, p FA) and ( p D, p FA). They are suboptimal because they neglect the relationship between ˆYi and Yi and thus fail to exploit the predicted labels ˆy fully. The third approach uses MMSE estimation but replaces p(yi|ˆyi, zi; ψi, p D, p FA) with p(yi|zi; ψi) in (11) (16). The fourth approach estimates the correct labels and makes the same substitution in (22). (Equivalently, one can set jmax = 0 in Algorithm 1, 2, or 3, so ( p D, p FA) = (1/2, 1/2) and (10) reduces to (21).)

2.6.2 Fully Bayesian Estimation

Another alternative is a fully Bayesian approach, which treats ( p D, p FA) as an unobserved realization of nuisance-parameter RVs ( PD, PFA) with prior p( p D, p FA). A uniform distribution over the unit square serves as a non-informative prior. This approach estimates each metric RV by marginalizing out ( PD, PFA), so the estimate is not conditioned on a particular instance of ( p D, p FA), and no iteration is required. For example, consider MMSE estimation of the accuracy RV. For an estima-

On Truthing Issues in Supervised Classification

tor h( ˆY , Z) of ACC, the MSE is E h( ˆY , Z) ACC 2 , and the MMSE estimator is h MMSE = arg minh E h( ˆY , Z) ACC 2 . The standard result (see (1) or Appendix C) is that the solution is the conditional mean, so the fully Bayesian estimate of ACC is

Ep( p D, p FA)[ACC | ˆy, z; ψ] = Z 1

0 E[ACC | ˆy, z, p D, p FA; ψ] p( p D, p FA) d p D d p FA.

This estimate minimizes the MSE under the assumption that the OP parameters are RVs ( PD, PFA) rather than non-random quantities. In contrast, the empirical Bayes approach does not view ( p D, p FA) as realizations of RVs; it treats them as unknown, non-random parameters of p(ˆyi|yi; p D, p FA), so it does not marginalize them out, and its estimates depend on them. For example, it computes E PD ˆy, z; ψ, p D, p FA in (19), where the moments of U and V are conditional on {ˆy, z, ψ, p D, p FA}, which depends on the speciﬁc choice of ( p D, p FA). Both approaches are optimal according to Section 1.4, but they diﬀer in their treatment of the OP parameters. Marginalization over ( PD, PFA) is a form of averaging, so the fully Bayesian approach computes an average grade. The empirical Bayes approach is more faithful to the grading analogy at the beginning of this section: The quiz has a particular grade, represented by ( p D, p FA), rather than an average grade. Nevertheless, experimentation is required to compare the performance of these approaches; it appears in Section 5.2.1.

2.7 MMSE Testing for Multi-Class Classiﬁcation

This section discusses the extension of MMSE testing to multi-class classiﬁcation (C > 2) and some of the challenges associated with it. An immediate challenge is that it might be diﬃcult to model p(z|y; ψ) or estimate ψ and π, but the related work described in Sections 1.2.1 and 1.2.2 is very encouraging. We begin by introducing the C C conditional confusion matrix. We use Kemp to denote its empirical form, and we indicate an element of it by Kemp n|ℓ to emphasize its conditional nature.10 Then Kemp n|ℓis deﬁned as

Kemp n|ℓ= no. of times ˆyi = n and yi = ℓ

no. of times yi = ℓ , ℓ, n Y. (23)

The RV form of the matrix is K, with

1 N PN i=1 1(ˆyi = n and Yi = ℓ)

1 N PN i=1 1(Yi = ℓ) , ℓ, n Y.

2.7.1 Empirical Bayes via Sampling (Multi-Class Classification)

We make the same assumptions as for binary classiﬁcation in Section 2.1. We are given {z, ˆy, ψ, π}, each Zi has conditional distribution p(zi|yi; ψi), and the samples are independent. A graphical model appears in Figure 4. Each predicted label ˆyi is a realization of an RV ˆYi, which for multi-class classiﬁcation has conditional distribution p(ˆyi|yi; K), where

10. We use non-standard, zero-based, column-major matrix indexing; Kemp n|ℓcorresponds to Kemp(ℓ+1, n+1) in standard matrix notation.

Figure 4: Multi-class classiﬁcation: Graphical model of iterative estimation for testing.

K is a matrix of conditional confusion-matrix parameters that are analogous to the OP parameters, and therefore p(ˆyi|yi; K) = Kˆyi|yi. (24)

Each row of K must form a (C 1)-probability simplex: For each ℓ Y, Kn|ℓ 0, n Y, and P n Y Kn|ℓ= 1. When C = 2, it was suﬃcient to consider a single element in each row of K, and we used PFA = K1|0 and PD = K1|1; this property enabled us to use the mean approximations or integrate the marginal posteriors of PD and PFA in Algorithm 1. This technique is not viable for C > 2 because, for each row of K, we would have to determine the joint posterior of (C 1) RVs, and we would have to do (C 1)-dimensional numerical integration. On the other hand, for moderate values of C and a class prior that is not highly skewed, we can readily extend the empirical Bayes sampling procedure of Algorithm 2 to multi-class classiﬁcation. Pseudocode appears in Algorithm 4. Using the parameter matrix K, we draw M realizations of Y according to the multi-class testing class posterior, which is obtained in the same way as (10):

p(yi|ˆyi, zi; ψi, K) = π(yi)p(ˆyi, zi|yi; ψi, p D, p FA) P y i Y π(y i)p(ˆyi, zi|y i; ψi, K)

= π(yi)p(ˆyi|yi; K)p(zi|yi; ψi) P y i Y π(y i)p(ˆyi|y i; K)p(zi|y i; ψi)

(a) = π(yi) Kˆyi|yi p(zi|yi; ψi) P y i Y π(y i) Kˆyi|y i p(zi|y i; ψi) , (25)

where (a) is from (24). We compute the empirical conditional confusion matrix for each length-N realization of Y , and we average the M matrices to get an estimate of the conditional mean of K given {ˆy, z, ψ, K}. For Algorithm 4, we set tol = 10 3, jmax = 50, α = 10 3, and M = 2500 C.

On Truthing Issues in Supervised Classification

Algorithm 4 MMSE testing for multi-class classiﬁcation with empirical Bayes estimation of K via sampling.

1: function Multi Class Empirical Bayes Via Sampling(ˆy, z, ψ, π, M)

2: Initialize K(0) n|ℓ= 1/C for (n, ℓ) Y Y, and j 0

5: for m 1 : M do

6: for i 1 : N do

7: Draw y(m) i p Y | ˆYi,Zi(y|ˆyi, zi; ψi, K(j 1)) Eq. (25)

8: K(j,m) empirical matrix from {ˆy, y(m)} Eq. (23)

9: K(j) mean of { K(j,m)}M m=1 10: Adjust each row of K(j) so each element lies in [α, 1 α] and row sums to one

11: until | K(j) K(j 1) |max < tol or j jmax 12: K K(j) Final estimate of K

13: return K

This method allows for some correctly-labeled samples. If the correct label for the ith

sample is known to be equal to ℓ, then p Y | ˆYi,Zi(y|ˆyi, zi; ψi, K(j 1)) = 1(y = ℓ), and sampling will always produce ℓwhen drawing a realization for this sample. The initial parameter matrix K(0) is the result of using a non-informative prior for each row of the matrix namely, a Dirichlet distribution of order C with all concentration parameters equal to unity. The algorithm returns K, the approximate conditional mean of K. The elements of K are bounded and MMSE estimation is employed, so the remarks in Section 2.4.4 on convergence may also be applicable.

2.7.2 Posteriors of Metric RVs (Multi-Class Classification)

Once the estimate K is available, we can obtain the posteriors of diﬀerent metric RVs. Accuracy is deﬁned in the same way for binary and multi-class classiﬁcation. Empirical accuracy is acc = (no. of times ˆyi = yi)/N, and its RV form is ACC = N 1 PN i=1 1(Yi = ˆyi). In the multi-class case, each 1(Yi = ˆyi) is a Bernoulli RV with success probability p Yi| ˆYi,Zi(ˆyi|ˆyi, zi; ψi, K). The Bernoulli RVs are independent, so ACC is approximately normal by the CLT. In addition, the ordinary C C confusion matrix has empirical form Cemp with11

Cemp n,ℓ= (no. of times ˆyi = n and yi = ℓ) and RV form C with

i=1 1(ˆyi = n and Yi = ℓ)

i:ˆyi=n 1(Yi = ℓ).

11. Cemp n,ℓcorresponds to Cemp(ℓ+ 1, n + 1) in standard matrix notation.

Hence, matrix element Cn,ℓis approximately normally distributed with E[Cn,ℓ|ˆy, z; ψ, K] = P i:ˆyi=n p Yi| ˆYi,Zi(ˆyi|ˆyi, zi; ψi, K) and var(Cn,ℓ|ˆy, z; ψ, K) = P i:ˆyi=n p Yi| ˆYi,Zi(ˆyi|ˆyi, zi; ψi, K)

1 p Yi| ˆYi,Zi(ˆyi|ˆyi, zi; ψi, K) . This result applies to the posterior of each individual element of C; it does not describe the joint posterior of multiple elements of C. Finally, we remark that one could apply the ratio-of-normals procedure to compute the approximate posterior of Kn|ℓ, an individual element of K.

3. Training with Truthing Issues: Learning from Noisy Labels

This section addresses training with truthing issues. Recall that we use g(x; θ) to denote a classiﬁer or predictive model with model parameters θ. Given x, the classiﬁer calculates a statistic s = g(x; θ), which contains the model s calculation of the chance that x belongs to each class, and then it applies a decision rule to s to select ˆy. In binary classiﬁcation, s is typically a scalar, and ˆy is selected by comparing s to a threshold τ; i.e., ˆy = 1(s > τ). In multi-class classiﬁcation, s is a usually vector (s0, s1, . . . , s C 1), and ˆy corresponds to the index of the largest element of s; i.e., ˆy = arg maxy Y sy . Training is the process of learning θ from the training set. In the ideal case, the training set is {x, y}, and training seeks the parameters θ that will produce the most accurate predictions when the trained model is applied to as-yet-unseen samples. Training is usually posed as an optimization problem:

θ = arg min θ Jideal(θ; x, y) = arg min θ Jpri(θ; x, y) + λJreg(θ) , (26)

where the primary term Jpri(θ; x, y) imposes a cost or penalty for diﬀerences between the correct labels yi y and either g(xi; θ) or g(xi; θ), xi x; Jreg(θ) is a regularization term that reduces overﬁtting to the training set and improves generalization to unseen samples; and the weight λ 0 controls the level of regularization. The primary term depends on the predictive model, and the regularization term depends on the choice of regularization, such as L2 regularization: Jreg(θ) P j θ2 j, or L1 regularization: Jreg(θ) P j |θj|.

3.1 Training Assumptions

In the presence of truthing issues, the correct labels are not available, and the training set becomes {x, z, ψ, π}. With ψ and π given, the noisy-label model p(z|y; ψ)π(y) is known for each sample. We make the usual assumption of independent samples, so p(z|y, x; ψ) = QN i=1 p(zi|yi, xi; ψi). Next, we assume that, given yi and ψi, the noisy-label RVs Zi do not depend on the feature vector xi, which gives p(zi|yi, xi; ψi) = p(zi|yi; ψi). Thus,

p(z|y, x; ψ) =

i=1 p(zi|yi; ψi). (27)

This expression allows for conditionally dependent labelers. We limit our attention to training that can be expressed as in (26). Then the optimization problem for training given {x, z, ψ, π} becomes

θ = arg min θ J(θ; x, z, ψ, π) = arg min θ Jpri(θ; x, z, ψ, π) + λJreg(θ) , (28)

On Truthing Issues in Supervised Classification

where only the primary term has been modiﬁed to become Jpri(θ; x, z, ψ, π). Since the regularization term is unchanged, we focus on the primary term below.

3.2 Uniﬁed View

We present a uniﬁed view of training with truthing issues that describes general approaches for training probabilistic or non-probabilistic predictive models.

Uniﬁed View of Training with Truthing Issues

1. For probabilistic models, keep the optimality principle from ideal training, and modify the primary term to account for {x, z, ψ, π} rather than {x, y}:

(a) For ML training, which would ideally maximize the likelihood function p(y|x; θ) or p(y, x; θ), instead use p(z|x; ψ, θ) or p(z, x; ψ, θ), respectively. (b) For MAP training, which would ideally maximize the posterior distribution p(θ|y, x), instead use p(θ|z, x; ψ).

2. For non-probabilistic models that use a loss function and would ideally minimize the empirical risk Remp(θ; x, y), perform MMSE training: retain the loss function and minimize ˆRMMSE(θ; x, z), the MMSE estimate of the empirical-risk RV given {x, z, ψ, π}.

The uniﬁed view is simple, elegant, and intuitively appealing. For probabilistic models, training remains true to the original optimality principle from ideal training. For nonprobabilistic models, training retains the original loss function from ideal training and optimizes the MMSE estimate of the empirical risk. Each approach is optimal according to Section 1.4. The estimands are appropriate: Training for probabilistic models targets the likelihood function or posterior, and training for non-probabilistic models targets the empirical risk. The use of the likelihood function, posterior, or MMSE estimator means that all available information in {x, z, ψ, π} is fully exploited. The penalty or utility criterion is clearly deﬁned. None of the methods try to estimate the correct labels. The uniﬁed view also organizes some of the related work. The training method proposed by Raykar et al. (2010) corresponds to Item 1a. Ratner et al. (2016, 2017) and Khetan et al. (2018) proposed training approaches that are equivalent to Item 2, although they did not arrive at them by applying MMSE estimation. We assume that p(z|y, ψ)π(y) is known or has already been learned, but Raykar et al. (2010) and Khetan et al. (2018) have demonstrated that one can employ these training approaches while jointly learning the noisy-label model. The next two sections derive the likelihood functions, posteriors, and MMSE estimator that provide the modiﬁed primary terms. The derivations are much simpler than those for testing because there are no predicted labels ˆy to consider. Indeed, the derivations amount to marginalizing over the correct labels. Marginalization introduces some implicit regularization by accounting for the uncertainty of the correct labels.12

12. The author thanks one of the anonymous reviewers for this observation.

3.3 Probabilistic Predictive Models (ML or MAP Training)

Here, we address predictive models based on a probabilistic viewpoint. There are two aspects to consider. One aspect is the form of the predictive model: discriminative or generative. A discriminative model assumes a form for the posterior p(y|x; θ) and directly uses g(x; θ) = p(y|x; θ). A generative model assumes a form for the joint distribution p(y, x; θ) and uses g(x; θ) = p(y, x; θ)/p(x; θ) p(y, x; θ). Many generative models apply the factorization p(y, x; θ) = p(x|y; θ)p(y; θ) and use g(x; θ) = p(x|y; θ)p(y; θ)/p(x; θ) p(x|y; θ)p(y; θ). The choice of g(x; θ) determines Jpri(θ; x, y). Logistic regression and neural networks are common examples of discriminative models, and na ıve Bayes is a classic example of a generative model (see Ng and Jordan, 2001).

The other aspect is the treatment of the predictive-model parameters θ: non-random or random. When the parameters are non-random, ML training is employed. For a discriminative model, this means ﬁnding θ to maximize the likelihood function p(y|x; θ); for a generative model, this means ﬁnding θ to maximize the likelihood function p(y, x; θ). When the parameters are RVs Θ with prior p(θ), MAP training is used. This means ﬁnding θ that maximizes the posterior p(θ|y, x); the posterior is expanded diﬀerently depending upon whether the model is discriminative or generative.

By considering both of these aspects, we can derive the primary term the likelihood function or posterior in the training objective function. The regularization term remains unchanged since it does not involve the correct labels.

3.3.1 Example Case: Discriminative Model, Non-Random Parameters

To illustrate the approach, we consider a discriminative model with non-random parameters θ. In the ideal case, we have access to {x, y} and use ML training to ﬁnd θ to maximize the likelihood function p(y|x; θ), which is

p(y|x; θ) =

i=1 p(yi|xi; θ). (29)

Equivalently, we can minimize the normalized negative log-likelihood function, which gives

Jpri(θ; x, y) = 1

N log p(y|x; θ)

i=1 log p(yi|xi; θ). (30)

On Truthing Issues in Supervised Classification

With truthing issues, we seek θ to maximize p(z|x; ψ, θ), given by

p(z|x; ψ, θ) =

i=1 p(zi|xi; ψi, θ)

yi Y p(yi, zi|xi; ψi, θ)

yi Y p(zi|yi, xi; ψi, θ)p(yi|xi; ψi, θ)

yi Y p(zi|yi; ψi)p(yi|xi; θ), (31)

where (a) marginalizes over the correct labels, and (b) is because the noisy-label RVs do not depend on the feature vector xi or the parameters θ and because the classiﬁer makes its prediction based only on xi and θ. We can equivalently minimize

Jpri(θ; x, z, ψ, π) = 1

N log p(z|x; ψ, θ)

yi Y p(zi|yi; ψi)p(yi|xi; θ). (32)

Raykar et al. (2010) proposed (31) as part of joint estimation of p(z|y) and logistic regression training.

3.3.2 All Cases

The top four rows of Table 9 summarize the results for all possible cases; derivations appear in Appendices E.1, E.2, and E.3. The only diﬀerence between the ideal case and truthing issues is that the latter marginalizes over the possible values of the correct-label RV Yi. The other diﬀerences are the same as in the ideal case. Switching from non-random to random parameters Θ introduces another factor for the parameter prior p(θ) and changes the optimization goal from ML to MAP. Switching from a discriminative model to a generative model changes the function being maximized from one involving p(yi|xi; θ) to one involving p(xi|yi; θ)π(yi) = p(xi, yi; θ).

3.4 Non-Probabilistic Predictive Models (Empirical Risk Minimization)

Some classiﬁers, like support vector machines, do not have a probabilistic formulation, and sometimes the theoretical origins of a probabilistic model are not the main focus. In such cases, ideal training applies the empirical risk minimization (ERM) principle (Vapnik, 1991) and seeks θ to minimize the empirical risk or average loss

Jpri(θ; x, y) = Remp(θ; x, y) = 1

i=1 L( g(xi; θ), yi), (33)

Pred. Par./ Labels and Training Set

Model Crit.* Ideal: Use {x, y} Noisy: Use {x, z, ψ, π}

NR/ Likelihood p(y|x; θ) Likelihood p(z|x; ψ, θ)

ML = QN i=1 p(yi|xi; θ) = QN i=1 P

yi Y p(zi|yi; ψi)p(yi|xi; θ)

RV/ Posterior p(θ|y, x) Posterior p(θ|z, x; ψ)

Discriminative

MAP p(θ) QN i=1 p(yi|xi, θ) p(θ) QN i=1 P

yi Y p(zi|yi; ψi)p(yi|xi, θ)

NR/ Likelihood p(y, x; θ) Likelihood p(z, x; ψ, θ)

ML = QN i=1 p(xi|yi; θ)π(yi) = QN i=1 P

yi Y p(xi|yi; θ)p(zi|yi; ψi)π(yi)

RV/ Posterior p(θ|y, x) Posterior p(θ|z, x; ψ)

MAP p(θ) QN i=1 p(xi|yi, θ)π(yi) p(θ)QN i=1 P

yi Y p(xi|yi, θ)p(zi|yi; ψi)π(yi)

NR/ Empirical risk Remp(θ; x, y)

MMSE estimate of empirical-risk RV

ERM = N 1 PN i=1 L( g(xi; θ), yi)

ˆRMMSE(θ; x, z)

Nonprobabilistic

= N 1 PN i=1 P

yi Y L( g(xi; θ), yi)p(yi|zi; ψi)

* Par./Crit. indicates the treatment of the predictive-model parameters θ (NR: non-random, RV: random variables) and the optimality criterion (ML: maximum likelihood, MAP: maximum a posteriori, ERM: empirical risk minimization). For discriminative or generative models, the primary term equals the negative normalized log-likelihood or log-posterior; for example, the top-left entry corresponds to Jpri(θ; x, y) = N 1 log p(y|x; θ). For non-probabilistic models, the primary term appears in the bottom row.

Table 9: Comparison of training objective functions for predictive models.

On Truthing Issues in Supervised Classification

where L(s, y) is a loss function that penalizes deviations between s = g(x; θ) and the correct label y. Examples for binary classiﬁcation include support vector machines, which may be trained with the hinge loss, and logistic regression, which corresponds to a linear model trained with the logistic loss. As an example of multi-class classiﬁcation, neural network training often uses the output vector from the network s ﬁnal fully-connected layer for s and applies the cross-entropy loss. In this section, we assume that θ is not random, since random parameters would require probabilistic modeling. As before, the regularization term is unaﬀected by truthing issues.

3.4.1 MMSE Estimation of the Empirical-Risk RV (MMSE Training)

With truthing issues, the correct-label RVs are not observed, and the loss functions and empirical risk are functions of them, so their values are uncertain. Let L( g(xi; θ), Yi), i = 1, . . . , N, be the ith loss-function RV, a function of the correct-label RV Yi, and from (33), write the empirical-risk RV as

Jpri(θ; x, Y ) = R(θ; x, Y ) = 1

i=1 L( g(xi; θ), Yi). (34)

We reason that, if a good in-sample estimate of the empirical-risk RV can be obtained for the training set {x, z, ψ, π}, then training that minimizes this estimate should perform well. We therefore focus on estimating the empirical-risk RV. Uncertainty about the empirical-risk RV arises because the correct-label RVs Y are unobserved. The feature vectors x do not contribute to the uncertainty because they are known. Given the noisy-label model form p(z|y; ψ)π(y), Bayes rule can provide p(y|z; ψ) but not p(y|z, x; ψ). For these reasons, when estimating the empirical-risk RV, we treat Y and Z as RVs, but not x. However, in the broader context of learning, the feature vectors may be treated as RVs. We adopt the MMSE criterion when estimating the empirical-risk RV. Denote an estimator of R(θ; x, Y ) by ˆR(θ; x, Z), where we omit ψ and π from ˆR for brevity. We set the primary term equal to the MMSE estimator of the empirical-risk RV:

Jpri(θ; x, Z, ψ, π) = ˆRMMSE(θ; x, Z) = arg min ˆR E ˆR(θ; x, Z) R(θ; x, Y ) 2 .

The standard result is that the MMSE estimator is the conditional mean of R(θ; x, Y ) given Z (see (1) or Appendices C and E.4), so

ˆRMMSE(θ; x, Z) = Ep(y|z;ψ) R(θ; x, Y ) Z; ψ

i=1 Ep(yi|zi;ψi)[L( g(xi; θ), Yi)|Zi; ψi] (36)

yi Y L( g(xi; θ), yi)p(yi|Zi; ψi). (37)

The samples are independent, so R(θ; x, Y ) is approximately normally distributed by the CLT. The MMSE estimator guesses the conditional mean of this normal RV given Z.

Given Z = z, the MMSE estimate of the empirical-risk RV is

ˆRMMSE(θ; x, z) = Ep(y|z;ψ) R(θ; x, Y ) z; ψ

i=1 Ep(yi|zi;ψi)[L( g(xi; θ), Yi)|zi; ψi] (39)

yi Y L( g(xi; θ), yi)p(yi|zi; ψi), (40)

where p(yi|zi; ψi) is the training class posterior13:

p(yi|zi; ψi) = π(yi)p(zi|yi; ψi) P y i Y π(y i)p(zi|y i; ψi). (41)

Having found the MMSE estimator of the empirical-risk RV, we can now speak of MMSE training, which seeks θ that minimizes (40), the MMSE estimate of the empirical-risk RV. Equation (28) becomes

θ = arg min θ ˆRMMSE(θ; x, z) + λJreg(θ)

= arg min θ 1 N

yi Y L( g(xi; θ), yi)p(yi|zi; ψi) + λJreg(θ).

MMSE training handles correctly-labeled samples in a natural way. For the ith sample, if the correct label is known to be ℓ, then p(yi|zi; ψi) = 1(yi = ℓ), and the sample contributes its usual loss-function value to the empirical-risk term. The bottom row of Table 9 summarizes training for non-probabilistic models. Ratner et al. (2016, 2017) employed (39) for discriminative models including binary logistic regression and long short-term memory recurrent neural networks. Khetan et al. (2018, 4) proposed (40) for arbitrary loss functions, used it for joint estimation of p(z|y) and training, and provided performance guarantees for binary classiﬁcation. Neither group of authors demonstrated the link to MMSE estimation, so this section provides another way to motivate and arrive at this result. We close this section by considering the loss function for a single sample. In (36), each term in the summation corresponds to the MMSE estimator of an individual loss-function RV L( g(xi; θ), Yi), namely

ˆLMMSE( g(xi; θ), Zi) = Ep(yi|zi;ψi)[L( g(xi; θ), Yi)|Zi; ψi], i = 1, . . . , N, (42)

yi Y L( g(xi; θ), yi)p(yi|Zi; ψi), i = 1, . . . , N. (43)

If the loss function has properties such as non-negativity or convexity, then (43) preserves such properties because p(yi|zi; ψi) is a probability distribution. Not surprisingly, the MMSE estimator of the empirical-risk RV is just the average of the MMSE estimators of the individual loss-function RVs.

13. The training class posterior diﬀers from the testing class posterior (10), which included the predicted labels and OP parameters. It is the same as (21) in Section 2.4.2.

On Truthing Issues in Supervised Classification

3.4.2 Standard Properties

Standard properties of MMSE estimators apply to ˆRMMSE(θ; x, Z); see Appendix C.1. It is unbiased in the Bayesian sense:14

E ˆRMMSE(θ; x, Z) = E R(θ; x, Y ) . (44)

Likewise, the MMSE estimator of each loss function is also unbiased:

E ˆLMMSE( g(xi; θ), Zi) = E[L( g(xi; θ), Yi)], i = 1, . . . , N. (45)

From (44), the mean of the estimation error is zero:

E ˆRMMSE(θ; x, Z) R(θ; x, Y ) = 0,

and the variance of the estimation error equals the MSE:

var ˆRMMSE(θ; x, Z) R(θ; x, Y ) = E ˆRMMSE(θ; x, Z) R(θ; x, Y ) 2 (46)

i=1 E var L( g(xi; θ), Yi) Zi . (47)

Appendix E.4 derives (47) and shows that the estimation error converges to a normal RV with the above moments. Note that the MSE of ˆRMMSE(θ; x, Z) is the MMSE. Other properties, such as the orthogonality principle, also hold but are not relevant here. The MMSE estimator is a function of the RV Z, so it is also an RV, and (46) and (47) describe the MMSE considering the distribution of Z. In contrast, the MMSE estimate is the MMSE estimator evaluated at Z = z and is not random. The MSE that was realized by (39) or (40) equals the conditional variance of the empirical-risk RV given Z = z:

E ˆRMMSE(θ; x, Z) R(θ; x, Y ) 2 Z = z

= var R(θ; x, Y ) Z = z

i=1 var L( g(xi; θ), Yi) Zi = zi .

3.4.3 Gradient Calculation

Training often employs some form of gradient descent to ﬁnd θ . For example, deep neural network training uses automatic diﬀerentiation to calculate the gradient of the empirical risk. In the ideal case,

Remp(θ; x, y) + λ Jreg

and from (33)

Remp(θ; x, y) = 1

L( g(xi; θ), yi) = 1

g ( g(xi; θ), yi) g

θj (xi; θ). (49)

14. Speciﬁcally, Ep(z;ψ)[ ˆRMMSE(θ; x, Z)] = Ep(z;ψ) Ep(y|z;ψ)[R(θ; x, Y ) Z; ψ] = Ep(y)[R(θ; x, Y )].

Training based on MMSE estimation replaces the partial derivative Remp(θ; x, y) / θj with ˆRMMSE(θ; x, z) / θj. From (40), this term is

ˆRMMSE(θ; x, z) = 1

yi Y p(yi|zi; ψi)

L( g(xi; θ), yi)

yi Y p(yi|zi; ψi) L

g ( g(xi; θ), yi) g

θj (xi; θ). (50)

It is a convex combination of the partial derivatives of the loss function. It is also compatible with automatic diﬀerentiation methods and inexpensive to compute. For a deep neural network, calculating g(xi; θ) and g/ θj represents the vast majority of the computational burden, and these quantities must only be computed once the same burden as in the ideal case. Calculating L/ g requires a negligible amount of computation for a typical loss function, so calculating it for each possible value of yi does not substantially increase the computational burden. An example of the partial derivatives for binary logistic regression appears in Section 5.4 and Appendix F.2.

3.4.4 Special Cases

We brieﬂy consider two special cases at opposite extremes. First, suppose that the correct label can be perfectly recovered from the noisy labels; i.e., p(y i|zi; ψi) = 1(y i = yi). Then MMSE estimation returns the correct values of the empirical risk and loss function. For example, (40) yields ˆRMMSE(θ; x, z) = R(θ; x, y) = Remp(θ; x, y), and (43) gives ˆLMMSE( g(xi; θ), Zi) = L( g(xi; θ), Yi). Likewise, the partial derivative in (50) reduces to (49): θj ˆRMMSE(θ; x, z) = θj Remp(θ; x, y) . Second, suppose that the noisy labels provide no information about the correct-label RVs; i.e., p(z|y; ψ) = p(z; ψ). Then p(y|z; ψ) reduces to π(y), and (37) becomes

ˆRMMSE(θ; x, Z) = 1

yi Y L( g(xi; θ), yi)π(yi) = Eπ(y) R(θ; x, Y ) .

Regardless of the value of Z, the MMSE estimator always returns the mean of the empiricalrisk RV, taken with respect to the class prior π(y). The partial derivative in (50) behaves similarly. The MMSE estimator remains unbiased, so (44) still holds. However, the estimator is a constant, so its variance is zero. By the law of total variance, its MSE is as large as possible and equals the variance of R(θ; x, Y ) (see Appendix C.1, Item 4).

3.4.5 Consistency of the MMSE Estimator

For ideal training, an important reason for minimizing the empirical risk (33) for a class of predictive models is consistency of the ERM principle: The ideal empirical risk converges in probability to the minimum achievable risk as N , even though the true distribution of (Xi, Yi)N i=1 is unknown (Vapnik, 1991). The relationship between MMSE estimation of the empirical-risk RV and consistency of the ERM principle requires further study. In the meantime, we consider the consistency of the MMSE estimator itself.

On Truthing Issues in Supervised Classification

The estimator ˆRMMSE(θ; x, Z) is consistent if it converges in probability to R(θ; x, Y ) as N . In this way, it attains the true value of the empirical-risk RV. Also, an estimator is mean-square consistent if its MSE goes to zero as N , and mean-square consistency implies consistency. Hence, if ˆRMMSE(θ; x, Z) is mean-square consistent, then it is consistent. From (46) and (47), the MSE of ˆRMMSE(θ; x, Z) is

E ˆRMMSE(θ; x, Z) R(θ; x, Y ) 2 = 1 N2

i=1 E var L( g(xi; θ), Yi) Zi .

A suﬃcient condition for when the MMSE estimator is consistent is therefore:

i=1 E var L( g(xi; θ), Yi) Zi = 0. (51)

If there exists a constant b such that

E var L( g(xi; θ), Yi) Zi < b, i = 1, . . . , N, (52)

then lim N N 2 PN i=1 E var L( g(xi; θ), Yi) Zi < lim N b/N = 0, and (51) is satisﬁed. Hence, if the loss function is bounded, then clearly (52) is fulﬁlled. If the original loss function is unbounded, then one may be able to modify it to make it bounded; for example, once the original loss exceeds some threshold, it could be transformed to asymptotically approach an upper limit.

3.5 Advantages of MMSE Training

Within the uniﬁed view of training, MMSE training for non-probabilistic models has a number of appealing beneﬁts compared to ML or MAP training for probabilistic models. MMSE training can continue to use the original loss function and involves a simple modiﬁcation to the empirical risk calculation. The MMSE estimator gains standard properties such as Bayesian unbiasedness. Gradient descent and automatic diﬀerentiation can be used with minor modiﬁcations. The MMSE estimator is a consistent estimator of the empirical-risk RV if the loss function is bounded. In contrast, the ML or MAP training approach can require new derivations, and it may not be able to leverage existing loss functions and their gradients. The resulting expressions could complicate theoretical analysis. These diﬀerences are evident in the example for binary logistic regression, presented in Section 5.4. Equations for the primary terms and gradients are given in Appendix F.

3.6 Alternative Training Approaches

The uniﬁed view does not cover all possible approaches to training, and here we discuss some alternatives.

3.6.1 Constructing Weak Losses for Partial Labels

An alternative Bayesian approach by Cid-Sueiro (2012) and Cid-Sueiro et al. (2014) studied theoretical properties of weak losses for partial labels.15 Labeling used length-C binaryencoded vectors. For class y, the correct label was one-hot encoded as y = ey, where the vector ej equals one at element j and zero elsewhere. The partial-label vector z was similarly encoded, but multiple elements could be set to one, so the noisy labeler could indicate more than one class. The authors considered diﬀerent constraints on the permitted partial-label vectors, so the number of possible partial-label vectors, B, could be an integer between C and 2C. Recall that s = g(x; θ). The authors deﬁned a weak loss Lwk(s, z) to be a loss function designed to operate on partial labels rather than correct ones. They then introduced an equivalent loss Leq(s, y) for correct labels:

Leq(s, y) = Ep( z| y)[Lwk(s, Z)| Y = y], (53)

and they showed that

Ep( z)[Lwk(s, Z)] = X

z p( z)Lwk(s, z)

y p( y)p( z| y)Lwk(s, z)

z p( z| y)Lwk(s, z)

= Ep( y) Ep( z| y)[Lwk(s, Z)| Y ] (54)

= Ep( y)[Leq(s, Y )]. (55)

From this relationship they reasoned that training on partial labels with Lwk(s, z) will behave like training on correct labels with Leq(s, y). Given an original loss function L(s, y) intended for correct labels, the equivalent loss is Leq(s, ey) = L(s, y), and one would like to ﬁnd a corresponding weak loss. To this end, the authors expanded (53) into a matrix equation:

Leq(s, e0) Leq(s, e1) ... Leq(s, e C 1)

p Z| Y ( b0| e0) p Z| Y ( b1| e0) p Z| Y ( b B 1| e0) p Z| Y ( b0| e1) p Z| Y ( b1| e1) p Z| Y ( b B 1| e1) ... ... ... ... p Z| Y ( b0| e C 1) p Z| Y ( b1| e C 1) p Z| Y ( b B 1| e C 1)

Lwk(s, b0) Lwk(s, b1) ... Lwk(s, b B 1)

(56) where bi denotes the ith possible partial-label vector, i = 0, 1, . . . , B 1. The matrix tabulates the conditional distribution p( z| y), so it has dimensions C B. Cid-Sueiro pointed out that, for B > C, an inﬁnite number of solutions for Lwk(s, z) exist. The MMSE training approach proposed in Section 3.4.1 is also Bayesian, but rather than design a new loss function for noisy labels, it uses the MMSE estimator ˆLMMSE(s, Z) of an original loss function L(s, Y ) meant for correct labels. MMSE training assumes that

15. Portions of the work by van Rooyen and Williamson (2018) are closely related to this approach.

On Truthing Issues in Supervised Classification

the noisy labels do not depend on the feature vector because the noisy-label model has the form p(z|y)π(y). From (42) and (43), the MMSE estimator is the conditional mean of the original loss function given Z and is just a convex combination of the original loss function values for each possible y, weighted by p(y|Z). For the approach taken by Cid-Sueiro et al., (53) suggests that the equivalent loss could be interpreted as the MMSE estimate of the weak loss given Y = y. To ﬁnd a weak loss for a given equivalent loss, one must solve the matrix equation (56). From (45), the MMSE estimator is unbiased in the Bayesian sense, a standard result obtained by iterated expectations:

Ep(z)[ˆLMMSE(s, Z)] = Ep(z) Ep(y|z)[L(s, Y )|Z] = Ep(y)[L(s, Y )].

Cid-Sueiro et al. obtain a similar relationship, but in the opposite order: (54) and (55) give

Ep( z)[Lwk(s, Z)] = Ep( y) Ep( z| y)[Lwk(s, Z)| Y ] = Ep( y)[Leq(s, Y )].

This relationship corresponds to Bayesian unbiasedness of the MMSE estimator of the weak loss given Y . MMSE training conditions on the noisy-label RV Z. After Z = z is observed, it uses p(y|z) to estimate the original loss function. The work of Cid-Sueiro et al. conditions on the correct-label RV Y . Before Z is observed, they tabulate p( z| y) for all combinations of z and y, and they solve (56) to calculate the weak loss for every possible realization of Z that might occur. In summary, these two Bayesian approaches bear some similarities but address truthing issues diﬀerently. Neither one estimates the correct labels. One could be interpreted as operating in the reverse direction of the other. Cid-Sueiro et al. take the equivalent loss for correct labels and construct a weak loss for noisy labels. MMSE training takes the noisy labels and estimates the original loss function for correct labels.

3.6.2 Using Proxy Loss Functions

Another alternative, proposed by Natarajan et al. (2013, 2018), took a classical (i.e., frequentist) view when replacing the original loss function L(s, y) of s = g(x; θ) and y with a proxy loss function designed for noisy labels.16 They considered binary classiﬁcation with a single labeler and took the classical viewpoint, so y is unknown but non-random. Assuming that p(z; y) is known, they devised a proxy loss function Lpr(s, Z) for the noisylabel RV Z that is an unbiased estimator of L(s, y) in the classical sense, meaning that Ep(z;y)[Lpr(s, Z)] = L(s, y), y Y (see Papoulis 1991, 9-2; Kay 1993, 2.3). They then trained models on z using Jpri(θ; x, z, p(z; y)) = N 1 PN i=1 Lpr( g(xi; θ), zi).

16. van Rooyen and Williamson (2018) generalized this approach to other situations where a loss function for correct labels can be modiﬁed to account for noisy labels.

The proxy loss function is the solution of a system of C linear equations in C unknowns:17

p Z;y(0; 0) p Z;y(1; 0) p Z;y(C 1; 0) p Z;y(0; 1) p Z;y(1; 1) p Z;y(C 1; 1) ... ... ... ... p Z;y(0; C 1) p Z;y(1; C 1) p Z;y(C 1; C 1)

Lpr(s, 0) Lpr(s, 1) ... Lpr(s, C 1)

L(s, 0) L(s, 1) ... L(s, C 1)

which is very similar to (56) in the approach by Cid-Sueiro (2012) and Cid-Sueiro et al. (2014). The matrix is just the conditional distribution p(z; y); as long as it has full rank, a unique solution for Lpr(s, Z) exists. This approach exploits knowledge of p(z; y) and does not estimate the correct labels. As a classical approach, it does not involve a class prior π. Extending it to multiple labelers and varying combinations of labelers might be diﬃcult. For T labelers, if every labeler provides a label for every sample, then the proxy loss function must satisfy Ep(z;y)[Lpr(s, Z)] = L(s, y), y Y, where Z = (Z1, Z2, . . . , ZT ). This requirement yields a system of C linear equations in CT unknowns, which is underdetermined for T > 1. If the labelers can decline to provide a label but at least one labeler must provide a label for every sample, then the number of unknowns becomes (C + 1)T 1. In contrast, MMSE training in Section 3.4.1 is a Bayesian method that exploits the class prior π as well as p(z|y). It employs the MMSE estimator of L(s, Y ), which from (42) is ˆLMMSE(s, Z) = Ep(y|z)[L(s, Y )|Z]. This estimator is unbiased in the Bayesian sense, meaning Ep(z)[ˆLMMSE(s, Z)] = Ep(y)[L(s, Y )] (cf. (45)). The classical and Bayesian viewpoints are fundamentally diﬀerent (Kay, 1993, 10.3): The former treats y as an unknown, nonrandom quantity, and the latter treats y as an unobserved realization of the RV Y . Classical unbiasedness is thus not a paramount objective in Bayesian statistics (see Breiman 2001; Gelman et al. 2013, 4.5). Our training approach uses the original loss function, so a proxy loss function is unnecessary. Also, from (43), the MMSE estimator is a convex combination of original loss function values; no system of linear equations must be solved. Finally, MMSE training readily accommodates multiple labelers and diﬀerent combinations of labelers for diﬀerent samples.

3.6.3 Predicting the Noisy Labels

In another alternative training approach, Sukhbaatar et al. (2015) and Jindal et al. (2016) trained a composite neural network, which consists of a base network followed by an additional layer, to predict the noisy labels z from a single labeler. The loss function is unchanged, but y is replaced by z. During training, the authors regularized both components of the composite network so that the base network learned p(y|x) and the additional layer learned p(z|y). Following training, the base network can be extracted and used to predict correct labels. Let nnb(x; θ) denote the base network, and ℓa(nnb(x; θ); ψ) denote the additional layer, where ψ is a C C matrix representing the layer s estimate of p(z|y). Then the primary term for this approach is Jpri(θ, ψ; x, z) = 1

N PN i=1 L ℓa(nnb(x; θ); ψ), zi . This approach does not estimate the correct labels, and it jointly estimates both p(y|x) and p(z|y). It does not consider the class prior π. For a single labeler, no changes to the

17. Natarajan et al. only considered binary classiﬁcation; we oﬀer the formulation for arbitrary C.

On Truthing Issues in Supervised Classification

loss function are necessary, but the loss function would have to be modiﬁed to account for multiple labelers or diﬀerent combinations of labelers.

3.6.4 Predicting the Noisy Labels with Trace Regularization

For multiple independent labelers, Tanno et al. (2019) proposed jointly training a CNN and estimating the labelers confusion matrices. They used the confusion matrices to adjust the softmax output of the CNN to obtain predicted scores for the noisy labels z. The training objective combined a cross-entropy loss term and a trace-regularization term. The cross-entropy loss was applied to the adjusted CNN outputs and noisy labels z, which amounts to predicting the noisy labels like Sukhbaatar et al. (2015). Trace regularization was introduced because, as the authors showed, doing so will drive the estimated confusion matrices to their actual values, under certain conditions.18 Consequently, as the labelers confusion matrices are estimated, the CNN learns to predict the correct labels. This method does not estimate the correct labels. The estimated confusion matrices can be used to obtain estimates of p(zt, y) and p(zt|y) for each labeler, as well as π. Tanno et al. also described some computational advantages of their method compared to the EM-based methods of Raykar et al. (2010) and Khetan et al. (2018).

3.7 Suboptimal, Infrastructure-Compatible Training

There exists signiﬁcant existing infrastructure, like software packages or machine-learning frameworks, that expects correct or assumed-to-be-correct labels, and modifying it for one of the preceding methods might be impractical or costly. Below, we describe some training methods that are suboptimal but compatible with such infrastructure. First, label estimation produces an estimate ˇy of the correct-label RVs Y and trains on {x, ˇy} instead of {x, y}. To estimate Y , one can use the related work or apply the MPE criterion as in Section 2.6.1. The MPE estimator is given by h MPE i = arg minhi E[1(hi(Zi) = Yi)]. The standard result is that the solution is the MAP estimator (see (2) or Appendix D), so ˇyi = arg maxy Y p(y|zi; ψi), i = 1, . . . , N. Label estimation is suboptimal (cf. Section 1.4) because training should estimate the primary term e.g., the empirical-risk RV, as in Section 3.4.1 rather than Y . Second, voting trains T classiﬁers, where the tth classiﬁer gt is trained on the samples and noisy labels treated as if correct from the tth labeler only, namely {(xi, zi,t) : zi,t = , i = 1, . . . , N}. Given an unlabeled sample x, the predicted label ˆy is chosen by a majority vote among {gt(x; θt)}T t=1. This technique ignores estimation theory so it is suboptimal, and it multiplies training and deployment complexity by T. Third, sample replication, which was suggested by Raykar et al. (2010), copies each xi several times, assigns labels to its copies based on p(yi|zi; ψi), and trains on the copies and their labels. For example, in binary classiﬁcation, if p Yi|Zi(0|zi; ψi) = 0.2, then xi is replicated 5 times, with one copy assigned label 0 and four copies assigned label 1. Direct implementation could multiply the storage and computation requirements for training by the number of copies made.

18. Sukhbaatar et al. (2015) mentioned this result but applied diﬀerent regularization for implementational reasons.

Figure 5: Binary symmetric broadcast channel.

These methods oﬀer diﬀerent compromises between accounting for truthing issues and modifying existing infrastructure. Label estimation makes a hard decision about Y before training commences, which could introduce mistaken labels into training but requires no other modiﬁcations. Voting considers diﬀerent hard decisions about Y from the individual labelers, multiplies deployment complexity by T, and requires a simple voting mechanism. Sample replication never makes a hard decision about the correct-label RVs Y , oﬀers a quantized approximation of the approaches given in Sections 3.2, 3.3, and 3.4, and it multiplies training resource requirements by the number of copies, but it does not aﬀect deployment complexity.

4. Comparing Combinations of Labelers: An Information-Theoretic View

The truthing issues have a simple information-theoretic interpretation. For a single sample, the correct-label RV Y can be viewed as the input to a channel, and the noisy-label RVs Z = (Z1, . . . , ZT ) can be viewed as the outputs from the channel.19 Then the mutual information I(Z; Y ) = P z,y p(z, y) log2 p(z, y)/p(z)π(y) quantiﬁes the amount of information that Z conveys about Y and vice versa (see Cover and Thomas, 1991). Here, we show that the interpretation allows us to compare diﬀerent combinations of labelers in terms of mutual information, at least in theory. In Section 5.5, we demonstrate that the training and testing techniques of Sections 2 and 3 enable us to exploit that information in practice.

4.1 Binary Symmetric Broadcast Channel

We illustrate this viewpoint for binary classiﬁcation. For simplicity, we assume that the labelers or noisy-label RVs are conditionally independent, although the interpretation applies for conditionally dependent labelers as well. We also assume that all labelers have the same conditional distribution and that each labeler assigns a noisy label to every sample, so it suﬃces to consider a single sample.

19. In a theoretical context, Lugosi (1992) presented such a channel interpretation for a single labeler.

On Truthing Issues in Supervised Classification

We can now introduce a binary symmetric broadcast channel (BSBC) (see Cover and Thomas, 1991, 14.6), shown in Figure 5 and parameterized by the labeling-error probability ε [0, 1]. For t T and y, zt {0, 1},

p(zt|y; ε) =

( 1 ε, if zt = y; ε, if zt = y. (57)

For ε [0, 1/2], this model is equivalent to setting C = 2, δi 0, i, and φt 2ε in (59) and (60). The binomial theorem can be used to obtain I(Z; Y |T, ε), the mutual information for T labelers, each with error probability ε:

I(Z; Y |T, ε) = π(0)

εm(1 ε)T m log2 εm(1 ε)T m

π(0)εm(1 ε)T m + π(1)(1 ε)mεT m

log2 π(0)εm(1 ε)T m + π(1)(1 ε)mεT m .

For a ﬁxed value of π(1), I(Z; Y |T, ε) is maximized if ε {0, 1}, in which case it equals π(0) log2 π(0) π(1) log2 π(1) = H(Y ), the entropy of Y . If ε = 1/2, then the labelers assign labels equiprobably, providing no information about Y , and I(Z; Y |T, ε) = 0. If ε (0, 1/2) (1/2, 1), then lim T I(Z; Y |T, ε) = H(Y ), and all information about Y becomes available from Z. Figure 6 shows I(Z; Y |T, ε) for diﬀerent values of T and ε when π(1) = 0.1. The plot is only shown for 0 ε 1/2. As ε decreases from 1/2 to zero, I(Z; Y |T, ε) increases from zero to its maximum as one would expect. For ε = 0, I(Z; Y |T, ε) increases with T because more noisy-label observations are available for estimating Y .

4.2 Equivalent Single Labeler

We can use I(Z1; Y |T1, ε1) and I(Z2; Y |T2, ε2) to compare the information provided by two diﬀerent groups of labelers, but it can be helpful to think in terms of a single labeler. Let I(Z; Y |ε ) denote the mutual information for a single labeler with error probability ε , which corresponds to the standard (single-input, single-output) binary symmetric channel (BSC) (see Cover and Thomas, 1991, 8.1). Given π, T, and ε, we can use binary search to ﬁnd ε such that I(Z; Y |ε ) = I(Z; Y |T, ε) and express the T labelers as an equivalent single labeler with error probability ε . Thus, we can compare diﬀerent numbers of labelers with diﬀerent error probabilities in terms of their equivalent single-labeler mutual information. Figure 7 shows the relationship between π, ε , and I(Z; Y |ε ) for the standard BSC. The mutual information is symmetric about π(1) = 1/2. For a ﬁxed value of ε , it decreases as π(1) decreases from 1/2, which indicates that handling truthing issues becomes more challenging as the classes become more imbalanced.

Figure 6: Graph of the mutual information I(Z; Y |T, ε) of BSBC for π(1) = 0.1 as a function of T and ε.

Figure 7: Mutual information I(Z;Y |ε ) of standard BSC as a function of class prior π(1) and error probability ε .

4.3 Multiple Mediocre Labelers or Single Expert Labeler

Figure 6 also indicates that, for 0 < ε < ε < 1/2, I(Z; Y |T, ε) can equal or exceed I(Z; Y |ε ) if T is suﬃciently large. Hence, multiple mediocre labelers can be as informative as or more informative than a single expert labeler. This result helps justify crowdsourcing and explain its successes. We can again use binary search to ﬁnd the minimum value of T needed to satisfy I(Z; Y |T, ε) I(Z; Y |ε ).

Figure 8 shows example curves for ε {0.01, 0.02, 0.05, 0.10} and π(1) = 0.4. The curves rise steeply for small values of T before diminishing returns set in; as T , ε approaches an asymptote at 1/2. Hence, a few mediocre labelers may suﬃce to obtain a small equivalent error probability ε . Alternatively, one might desire a very small value of ε that no single expert can achieve, but a few human not superhuman labelers might be able to attain it together.

This possibility is reminiscent of boosting (see Schapire, 1990; Freund and Schapire, 1997), in which multiple weak learners are leveraged to achieve the performance of a strong learner. We have not explored this relationship further but make some brief remarks. In boosting, the weak learners perform slightly better than random guessing, but the correct labels are known, and this information is exploited to improve performance. With truthing issues, the noisy labels might also be quite inaccurate and the correct labels are unobserved, but the conditional distribution p(z|y; ψ) is known, and it provides information about the correct labels for training and testing.

The information-theoretic implication of equivalent information is intriguing, but it does not explain how to exploit the information that Z or Z conveys about Y . Fortunately, the methods developed in Sections 2 and 3 provide a means to do so. They should produce better estimates as T grows because the variance of an optimal estimator decreases as more

On Truthing Issues in Supervised Classification

5 10 15 20 25 30 35 40 45 50 Number of Labelers

Error Probability of Each Labeler

Class Prior Pr(Y=1) = 0.40 Multiple-Labeler Error Probability Needed to Match Single Labeler

Single-labeler error prob. = 0.10, I(Z;Y) = 0.512 bits Single-labeler error prob. = 0.05, I(Z;Y) = 0.690 bits Single-labeler error prob. = 0.02, I(Z;Y) = 0.832 bits Single-labeler error prob. = 0.01, I(Z;Y) = 0.891 bits

Figure 8: Number of labelers needed to achieve the same mutual information as a single labeler for π(1) = 0.4.

(independent) noisy observations become available.20 For T suﬃciently large, the estimates for the mediocre labelers should become comparable to those for the single expert labeler. Section 5.5 presents a couple of experiments that verify the implication.

5. Experiments

We conducted a number of experiments to see how the diﬀerent testing and training methods performed and to check the implication of equivalent mutual information for diﬀerent combinations of labelers.

5.1 Simulation

To exercise the testing methods, we need correct, noisy, and predicted labels. To study the training methods, we only need correct and noisy labels. These can all be generated via simulation. Of course, the methods do not use the correct labels, which would not be available in the intended applications, but simulation lets us compare the results with the ideal case.

5.1.1 Simulating Correct and Predicted Labels

Simulation for a desired class prior π is simple. For binary classiﬁcation, the N correct-label realizations y(0) are drawn independently B(π(1)). For multi-class classiﬁcation, they are

20. This behavior relates to T, the number of noisy labels per sample, rather than N, the number of samples. Increasing T means more noisy observations of Y are available to reduce the uncertainty about the correct label for a sample. Merely increasing N does not provide any more observations for the previous samples.

drawn independently from a categorical distribution with prior π and categories 0, 1, . . . , C 1. Denote such a distribution as Cat π, (0, 1, . . . , C 1) . Experiments on the testing approaches of Section 2 require simulated predicted labels. To simulate a binary classiﬁer with desired operating point pdes D , pdes FA , the predicted-

label realizations ˆy are drawn independently, with ˆyi drawn B(pdes FA ) if y(0) i = 0, and ˆyi drawn B(pdes D ) if y(0) i = 1. Likewise, the performance of a multi-class classiﬁer can be speciﬁed by a desired conditional confusion matrix Kdes, where matrix element Kdes n|ℓcontains p(ˆy = n|y = ℓ). Thus, each predicted-label realization ˆyi is drawn independently Cat (Kdes 0|ˆyi, Kdes 1|ˆyi, . . . , Kdes C 1|ˆyi), (0, 1, . . . , C 1) .

5.1.2 Modeling and Simulating Noisy Labels

Since the samples are independent, we temporarily drop the sample index i while describing the modeling and simulation of the noisy-label RVs. Until this point, p(z|y; ψ) has remained completely general, so it allows for conditionally dependent labelers; that is, dependencies between diﬀerent noisy-label RVs Zt and Zt for the same sample. We have also left ψ unspeciﬁed. For simulation, we need a more concrete model. First, we assume that the noisy-label RVs for a sample are conditionally independent:

p(z|y; ψ) = Y

t:zt = p(zt|y; ψt), (58)

where ψ = (ψt)t T , and ψt contains parameters that determine how the tth labeler labels the sample. This assumption is primarily for implementational convenience; we could have used a model with dependent labelers like one of the models in Holodnak et al. (2018). Second, we let ψt = (δ, φt), where δ [0, 1] is the sample diﬃculty, and φt [0, 1] is the labeler fallibility. The sample diﬃculty represents the ambiguity about the correct label inherent to the sample itself. For example, in image recognition, it could indicate the amount of blur or noise. The labeler fallibility is an attribute of the tth labeler; it could reﬂect an analyst s experience or a crowdsourcing participant s trustworthiness. The model is then

p(z|y; δ, φt) =

1 ε(δ, φt), if z = y Y; ε(δ,φt)

C 1 , if z = y Y, z Y; 0, if z Y;

where the labeling-error probability ε(δ, φt) is a bilinear function:

ε(δ, φt) = (δ δφt + φt)C 1

C , 0 δ 1, 0 φt 1. (60)

The labeler chooses Zt to be the correct label with probability 1 ε(δ, φt) and makes a mistake with probability ε(δ, φt) [0, 1]. If a mistake occurs, then Zt is equiprobably distributed over the C 1 possible incorrect labels. If δ = φt = 0, then Zt y, but if either parameter is non-zero, then ε(δ, φt) increases with both δ and φt. If either δ = 1 or φt = 1, then Zt is equiprobably distributed over Y.

On Truthing Issues in Supervised Classification

Our model resembles the one by Whitehill et al. (2009), but our formulation is slightly diﬀerent, and our purpose is signiﬁcantly diﬀerent. Their model is for binary classiﬁcation, whereas our model applies to arbitrary C. Their model allows for adversarial labelers who tend to choose the opposite of the correct binary label, but when C > 2, the opposite of the correct label is not obvious. More important, Whitehill et al. concentrated on estimating the model parameters, while we focus on exploiting the model when its parameters are known. In the latter case, an adversarial labeler who is known to frequently choose the opposite of the correct label is actually more informative than a sloppy labeler who assigns labels equiprobably at random.21

Finally, we again consider all samples and let δ = (δi)N i=1 and φ = (φt)T t=1. Then ψ = (δ, φ), and ψi = (δi, φ), so

p(z|y; ψ) =

i=1 p(zi|yi; δi, φ)

t:zi,t = p(zi,t|y; δi, φt), (61)

where (a) is from (58). During simulation, we draw δi, i = 1, . . . , N, and φt, t = 1, . . . , T, independently from beta or uniform distributions. We also draw a vector η = (ηt)T t=1, where ηt is the approximate probability that the tth labeler provides a label for a sample. Each ηt is drawn independently U(0, 1), where U(a, b) denotes a uniform distribution over (a, b). For the ith sample, we repeatedly draw ηi B(ηt) T t=1 with independent Bernoulli-distributed elements until at least one element of ηi equals unity. We then simulate zi,t in accord with (59) and (60) for t {t : ηi,t = 1}, and we set zi,t = for t {t : ηi,t = 0}.

5.2 Examples of Testing: Binary Classiﬁcation

A variety of examples that illustrate aspects of the testing approaches for binary classiﬁcation appear here. We emphasize that, except for the ideal case, no correct labels were used during testing.

5.2.1 Main Testing Example

A simulation was conducted with N = 103, T = 5, π(1) = 0.2, pdes D , pdes FA = (0.8, 0.3), δi Beta(1, 5), i, and φt U(0, 0.4), t. The simulator produced δ = (0.098, 0.046, 0.347, . . . , 0.199, 0.221), φ = (0.249, 0.030, 0.387, 0.244, 0.154), and η = (0.727, 0.873, 0.286, 0.657, 0.232). The number of samples labeled by each labeler was 727, 879, 272, 667, 229, respectively; on average, there were about 2.8 noisy labels per sample. Figure 9 shows the progression of the estimates p(j) D and p(j) FA for the iterative methods. The dotted lines show the ideal values of p D and p FA if the correct labels were known; the ideal values diﬀer slightly from pdes D , pdes FA because they are the result of sampling and simulation. For MMSE testing, both empirical Bayes methods (Algorithms 1 and 2), the

21. Ipeirotis et al. (2010, 3) made a similar observation in the context of assessing the cost of a labeler.

0 2 4 6 8 10 Iteration number

Estimated value

Ideal (correct labels known): p D Ratios: p D Sampling: p D Estimate labels: p D Ideal (correct labels known): p FA Ratios: p FA Sampling: p FA Estimate labels: p FA

Figure 9: Main testing example: Progression of p(j) D , p(j) FA in iterative estimation methods.

0.6 0.65 0.7 0.75 0.8 0.85 0.9 Accuracy

Probability Density

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Sampling: Density Sampling: Cond. mean Sampling: MAP estimate Sampling: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers Full Bayes: Density Full Bayes: Cond. mean Full Bayes: MAP estimate

Figure 10: Main testing example: Accuracy estimates by all methods. Markers are spaced vertically for easier readability.

estimates improved on each iteration, and both converged to nearly the same ﬁnal OP parameters after seven iterations. The suboptimal method of estimating the correct labels (Algorithm 3) converged after nine iterations; its ﬁnal OP parameter p FA was slightly more accurate than those of the empirical Bayes methods, but its ﬁnal OP parameter of p D was considerably less accurate than theirs. Table 10 summarizes the estimates of all metrics by the techniques presented in Section 2, and Figure 10 shows the results for accuracy. Results for the suboptimal methods that do not fully exploit the predicted labels appear in Section 5.2.2. For MMSE testing with the empirical Bayes methods ( Ratios and Sampling ), the ﬁgure displays the estimated posterior density, conditional mean, MAP estimate, and 95%-credible region for ACC. The ﬁgure also displays a gray histogram, which was created by taking the ﬁnal OP parameters ( p D, p FA) from Algorithm 2, generating 5000 sample realizations of Y from (10), and computing the empirical accuracy for each realization. The optimal estimates are fairly close to the ideal accuracy, and the credible regions contain the ideal accuracy. The results for the two methods are very similar; this behavior was observed for all examples, so subsequent ﬁgures do not include results for Algorithm 2 beyond the histograms. The ﬁgure also shows the estimated accuracy for the suboptimal method in Algorithm 3 ( Estimate labels ). This estimate is reasonably good, but it does not provide any sense of uncertainty like a credible region does. Also, this algorithm produced inaccurate estimates of REC or PD (shown next in Figures 11 and 12). Likewise, the ﬁgure displays the T instances of accuracy for each individual labeler ( Labelers ). We also computed the mean and median of these instances; we applied the algorithm by Vardi and Zhang (2000) to calculate the multi-dimensional median. The instances are fairly inaccurate, so their mean and median also yield poor estimates. The last set of results in the ﬁgure are for the fully Bayesian approach. Its estimated density is very spread out as a result of marginalization over ( PD, PFA), and its conditional

On Truthing Issues in Supervised Classification

Scalar Metrics Joint Metrics

Estimate ACC PREC PD or REC PFA F1 (PREC, REC) (PD, PFA)

Ideal (correct labels known) Ideal 0.712 0.421 0.810 0.316 0.554 (0.421, 0.810) (0.810, 0.316) MMSE Testing: Empirical Bayes estimation

via ratios of jointly normal RVs (Alg. 1) Conditional mean 0.700a 0.397a 0.795 0.328 0.534 (0.397, 0.795)b (0.795, 0.328)b

MAP estimate 0.795 0.325 0.529 (0.397, 0.795)c (0.796, 0.325)c

Credible (lower) 0.687 0.373 0.764 0.317 0.506 region (upper) 0.713 0.420 0.828 0.335 0.553 MMSE Testing: Empirical Bayes estimation via sampling (Alg. 2) Conditional mean 0.701a 0.397a 0.796 0.327 0.535 (0.397, 0.796)b (0.796, 0.327)b

MAP estimate 0.796 0.325 0.530 (0.397, 0.796)c (0.796, 0.325)c

Credible (lower) 0.688 0.374 0.765 0.316 0.506 region (upper) 0.714 0.421 0.829 0.334 0.554 Estimation of correct labels (Alg. 3) Estimate labels 0.720 0.402 0.868 0.316 0.550 (0.402, 0.868) (0.868, 0.316) Combine metrics from individual labelers Mean 0.636 0.431 0.615 0.354 0.506 (0.431, 0.615) (0.615, 0.354) Median 0.622 0.421 0.605 0.354 0.504 (0.421, 0.605) (0.605, 0.354) Labeler No. of Metrics from index labels individual labelers 1 727 0.622 0.450 0.574 0.354 0.505 (0.450, 0.574) (0.574, 0.354) 2 879 0.653 0.414 0.643 0.343 0.504 (0.414, 0.643) (0.643, 0.343) 3 272 0.618 0.405 0.605 0.377 0.485 (0.405, 0.605) (0.605, 0.377) 4 667 0.619 0.421 0.587 0.366 0.490 (0.421, 0.587) (0.587, 0.366) 5 229 0.668 0.465 0.667 0.331 0.548 (0.465, 0.667) (0.667, 0.331) Fully Bayesian estimation Conditional mean 0.646 0.345 0.664 0.360 0.456 (0.345, 0.664)b (0.664, 0.360)b

MAP estimate 0.662 0.342 0.714 0.349 0.464 (0.347, 0.690)c (0.663, 0.360)c

a For accuracy or precision, the empirical Bayes conditional mean and MAP estimate are identical. b The conditional means for the joint metrics are the same as those for the corresponding scalar metrics. c The MAP estimate of a joint metric can diﬀer from the MAP estimates of its individual components.

Table 10: Testing metrics for main testing example.

mean and MAP estimate are not very accurate. This behavior occurred for other metrics, so we do not show the fully Bayesian method in the subsequent ﬁgures. Next, Figure 11 displays results for the other scalar metrics. MMSE testing using the empirical Bayes method of Algorithm 1 produced estimates within about 0.025 of each metric. For REC or PD, the credible region contains the ideal metric, and for the other metrics, it lies outside the credible region by just 0.001. Table 10 indicates that the credible regions of Algorithm 2 contained each ideal metric. The suboptimal method of estimating the correct labels (Algorithm 3) was quite accurate for several metrics but oﬀby 0.058 for REC or PD. The use of the labelers labels often produced very inaccurate estimates.

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 Precision

Probability Density

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers

Recall or Prob. of Detection

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Recall or Prob. of Detection

Probability Density

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers

Prob. of False Alarm

0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 Prob. of False Alarm

Probability Density

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers

F-score (beta = 1.0)

0.45 0.5 0.55 0.6 0.65 F-score (beta = 1.0)

Probability Density

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers

Figure 11: Main testing example: Estimates of scalar metric RVs.

On Truthing Issues in Supervised Classification

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Prob. of False Alarm

Prob. of Detection

ROC Analysis

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Recall

Precision-Recall Analysis

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers

Figure 12: Main testing example: Estimates of joint metric RVs for ROC and P-R analysis.

Figure 12 presents results for P-R and ROC analysis. MMSE testing with empirical Bayes estimation produced the best joint estimates; its point estimates are closest to the ideal operating points, and its 95%-credible regions contain the ideal operating points. Estimating the correct labels yielded less accurate joint estimates, and it continues to provide no sense of uncertainty. Using the labelers labels gave very poor estimates.

5.2.2 Exploitation of Predicted Labels

Full exploitation of the predicted labels ˆy and OP parameters ( p D, p FA) is an important characteristic of the most successful testing methods. Figure 13 shows results for the simulation in Section 5.2.1 if the predicted labels are not fully exploited, which corresponds to the third and fourth suboptimal testing approaches in Section 2.6.1. They use p(zi|yi; ψi) rather than the testing model (7) with p(ˆyi|yi; p D, p FA)p(zi|yi; ψi), which amounts to setting jmax = 0 in Algorithm 1, 2, or 3. Compared to Figures 10 and 12, the estimated posterior and the estimates are quite inaccurate, and the credible regions do not contain the ideal metrics. These results illustrate the importance of including the predicted labels and OP parameters during estimation.

5.2.3 Convergence Experiments for Iterative Algorithms

Section 2.4.4 speculated that, when estimating ( p D, p FA) during MMSE testing, the empirical Bayes methods (Algorithms 1 and 2) will converge to the global optimum regardless of the initial OP parameters. To check this possibility, we conducted two experiments, which also included the suboptimal correct-label estimation method (Algorithm 3). First, we used the same simulation as in Section 5.2.1 but varied the initial OP parameters p(0) D , p(0) FA over the 10 10 grid {0.05, 0.15, . . . , 0.95} {0.05, 0.15, . . . , 0.95}. Table 11 summarizes the results. The empirical Bayes methods converged to nearly the same ﬁnal OP parameters every time, with maximum absolute errors of about 0.015. In addition, the MAC was always satisﬁed in Algorithm 1. The suboptimal method of Algorithm 3 had larger p D errors but slightly smaller p FA errors than the empirical Bayes methods. For this

0.6 0.65 0.7 0.75 0.8 0.85 0.9 Accuracy

Probability Density

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Recall

Precision-Recall Analysis

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels

Figure 13: Examples of estimates of metric RVs if the predicted labels ˆy and OP parameters ( p D, p FA) are not fully exploited. Compare with Figures 10 and 12. The plots do not show the labelers results because they are unchanged from those ﬁgures.

Same Ideal Operating Point, Iterative Estimation Method Diﬀerent Initial OP Parameters Ratios Sampling Est. Labels

Error p D p D

Mean 0.0149 0.0131 0.0473 Std. dev. 3.20 10 4 4.15 10 4 1.61 10 2

Max abs. 0.0153 0.0142 0.0715

Error p FA p FA

Mean 0.0118 0.0092 0.0015 Std. dev. 6.34 10 5 8.56 10 5 1.49 10 3

Max abs. 0.0119 0.0095 0.0033

Number of Iterations

Mean 6.6 6.6 6.2 Std. dev. 1.08 1.05 2.35 Max 8 8 10

Table 11: Estimation error statistics for the ﬁnal OP parameters ( p D, p FA) for iterative estimation methods. For the same ideal operating point (p D, p FA) = (0.810, 0.316), 100 diﬀerent initial OP parameters were used.

ideal operating point, all algorithms consistently converged to nearly the same ﬁnal OP parameters. Second, we ran another set of simulations that always initialized the iterative algorithms with the default initial OP parameters p(0) D , p(0) FA = (1/2, 1/2), but we varied the desired operating point pdes D , pdes FA over the same 10 10 grid as above, which produced 100 diﬀerent ideal operating points. These simulations used π(1) = 0.5, δi U(0, 1), i, φt U(0, 0.5), and ηt U(0, 1), t. Results appear in Table 12. The empirical Bayes methods converged every time, with average errors near 0.012 and a maximum absolute error below 0.045. Again, the MAC was always satisﬁed in Algorithm 1. The suboptimal method of Algorithm 3 performed less well. Although its average errors are no more than 0.021, its standard deviations are on the order of 0.200 and its maximum absolute error exceeds 0.333,

On Truthing Issues in Supervised Classification

Diﬀerent Ideal Operating Points, Iterative Estimation Method Same Default Initial OP parameters Ratios Sampling Est. Labels

Error p D p D

Mean 0.0113 0.0114 0.0141 Std. dev. 1.04 10 2 9.56 10 3 1.95 10 1

Max abs. 0.0314 0.0310 0.3590

Error p FA p FA

Mean 0.0120 0.0122 0.0210 Std. dev. 1.36 10 2 1.14 10 2 2.03 10 1

Max abs. 0.0442 0.0381 0.3369

Number of Iterations

Mean 10.9 11.2 10.4 Std. dev. 3.15 3.46 3.39 Max 17 18 21

Table 12: Estimation error statistics for the ﬁnal OP parameters ( p D, p FA) for iterative estimation methods. For 100 diﬀerent ideal operating points, the same default initial OP parameters was used.

indicating that it often became trapped near a local optimum. These results demonstrate the substantial beneﬁt of the empirical Bayes methods over estimating the correct labels.

5.2.4 Estimation Performance for Different Operating Points

The convergence experiments in the previous section only examine the estimation error of the ﬁnal OP parameters ( p D, p FA). For the second set of simulations in that section, we also compiled statistics on the estimation errors of the ﬁnal estimates of the scalar and joint metrics over the 10 10 grid {0.05, 0.15, . . . , 0.95} {0.05, 0.15, . . . , 0.95} of pdes D , pdes FA ; i.e., over 100 diﬀerent ideal operating points. Figure 14 summarizes the error statistics for the estimated scalar metrics by the diﬀerent testing approaches. For MMSE testing, the point estimates from the empirical Bayes methods (Algorithms 1 and 2) have average errors of about 0.012, with standard deviations near 0.011. The scale changes between MMSE testing and the other methods. The other methods have average errors between 0.001 and 0.043, but their standard deviations range from about 0.100 to over 0.200, an order of magnitude greater than those for the empirical Bayes methods. Figures 15 and 16 display joint error statistics for P-R and ROC analysis. Results for MMSE testing were similar for both the conditional mean and MAP estimate, so the ﬁgures only show estimation errors for the conditional mean of Algorithm 1 and the MAP estimate of Algorithm 2. The empirical Bayes methods have average errors near 0.015 in each dimension. The square roots of the eigenvalues of the error covariance matrix fall between 0.005 and 0.025. The scale changes between MMSE testing and the other methods. The latter methods have average errors between 0.005 and 0.043, but the eigenvalues square roots range from about 0.100 to over 0.430. These results demonstrate the superior estimation performance of the MMSE testing methods, whose estimates typically lie within 0.010 to 0.040 of the ideal metrics over a wide range of ideal operating points. They signiﬁcantly outperform the other methods,

-0.05-0.025 0 0.025 0.05 Error

Sampling: MAP Estimate

Sampling: Cond. Mean

Ratios: MAP Estimate

Ratios: Cond. Mean

-0.05-0.025 0 0.025 0.05 Error

-0.05-0.025 0 0.025 0.05 Error

Recall or Prob. of Detection

-0.05-0.025 0 0.025 0.05 Error

Prob. of False Alarm

-0.05-0.025 0 0.025 0.05 Error

-0.5 -0.25 0 0.25 0.5 Error

Full Bayes: Cond. mean

Labelers: Median

Labelers: Mean

Estimate labels

-0.5 -0.25 0 0.25 0.5 Error

-0.5 -0.25 0 0.25 0.5 Error

Recall or Prob. of Detection

-0.5 -0.25 0 0.25 0.5 Error

Prob. of False Alarm

-0.5 -0.25 0 0.25 0.5 Error

Figure 14: Scalar metric estimation errors for diﬀerent testing approaches over 100 diﬀerent ideal operating points. The axis limits diﬀer for MMSE testing with empirical Bayes (upper plots: 0.05 to +0.05) and the other approaches (lower plots: 0.5 to +0.5). Miniature scatterplots of the estimation errors appear as gray dots. The average error is marked with a circle and as text above each circle. Multiples of 1 and 2 times the standard deviation of the errors appear as crosses, and text below the ﬁrst cross gives the standard deviation.

On Truthing Issues in Supervised Classification

-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Precision Error

Recall Error

Ratios: Conditional Mean

(-0.0082, -0.0114)

0.0258 0.0096

-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Precision Error

Recall Error

Sampling: MAP Estimate

(-0.0080, -0.0112)

0.0236 0.0093

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Precision Error

Recall Error

Estimate labels

(-0.0131, -0.0142)

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Precision Error

Recall Error

Full Bayes: Cond. mean

(-0.0010, -0.0045)

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Precision Error

Recall Error

Labelers: Mean

(-0.0284, -0.0132)

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Precision Error

Recall Error

Labelers: Median

(-0.0149, -0.0153)

Figure 15: P-R analysis estimation errors for diﬀerent testing approaches over 100 diﬀerent ideal operating points. The axis limits diﬀer for MMSE testing with empirical Bayes (top plots: 0.05 to +0.05) and the other approaches (middle and lower plots: 0.5 to +0.5). Miniature scatterplots of the estimation errors appear as gray dots. The average error is marked with a circle and listed as an ordered pair. Ellipses denote areas that account for 0.6827 and 0.9545 of the density of a bivariate normal distribution ﬁtted to the errors, and text adjacent to the semi-major and semi-minor axes gives the square roots of the eigenvalues of the error covariance matrix.

-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Prob. of False Alarm Error

Prob. of Detection Error

Ratios: Conditional Mean

( 0.0121, -0.0114)

-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Prob. of False Alarm Error

Prob. of Detection Error

Sampling: MAP Estimate

( 0.0124, -0.0115)

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Prob. of False Alarm Error

Prob. of Detection Error

Estimate labels

( 0.0212, -0.0142)

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Prob. of False Alarm Error

Prob. of Detection Error

Full Bayes: Cond. mean

( 0.0087, -0.0045)

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Prob. of False Alarm Error

Prob. of Detection Error

Labelers: Mean

( 0.0426, -0.0132)

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Prob. of False Alarm Error

Prob. of Detection Error

Labelers: Median

( 0.0188, -0.0144)

Figure 16: ROC analysis estimation errors for diﬀerent testing approaches over 100 diﬀerent ideal operating points. The axis limits diﬀer for MMSE testing with empirical Bayes (top plots: 0.05 to +0.05) and the other approaches (middle and lower plots: 0.5 to +0.5). Miniature scatterplots of the estimation errors appear as gray dots. The average error is marked with a circle and listed as an ordered pair. Ellipses denote areas that account for 0.6827 and 0.9545 of the density of a bivariate normal distribution ﬁtted to the errors, and text adjacent to the semi-major and semi-minor axes gives the square roots of the eigenvalues of the error covariance matrix.

On Truthing Issues in Supervised Classification

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Accuracy

Probability Density

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Recall

Precision-Recall Analysis

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers

Figure 17: Testing example for a single labeler (T = 1) when a constant labeling-error probability ε0 = 0.1 is assumed for all samples. The mean and median of the labelers metrics are not shown because T = 1.

which frequently yield errors in excess of 0.100, 0.200, or more unacceptable given that the metrics lie in [0, 1]. The empirical Bayes approach is clearly more appropriate than the fully Bayesian one.

5.2.5 Single Labeler and Constant Labeling-Error Probability

The testing methods can also be useful if there is a single labeler (T = 1), and one merely wants to know how performance would be aﬀected by some worst-case labeling-error probability ε0. One can simply set δi 0 and φt Cε0/(C 1) in (60) and apply the methods. Figure 17 shows an example for T = 1 and ε0 = 0.1; other simulation settings were N = 103, π(1) = 0.6, pdes D , pdes FA = (0.90, 0.10), δi 0, and φt 0.2. MMSE testing produces accurate estimates, and its estimated posteriors and credible regions allow one to understand the possible variability caused by the assumed labeling-error probability. For 0 < ε0 < 1/2, the estimated correct label ˇyi is identically equal to the noisy label zi,1, so the markers for the estimated correct labels and the labelers labels lie in the same location. These suboptimal estimates are much less accurate than those from the empirical Bayes method.

5.2.6 Small Sample Size

The approximations behind MMSE testing are driven by the CLT, so they should hold for small N as long as ˆN1 and N ˆN1 are greater than or equal to thirty. We conducted another simulation with N = 70, which produced ˆN1 = 37 and N ˆN1 = 33; other simulation settings were T = 3, π(1) = 0.5, pdes D , pdes FA = (0.85, 0.20), δi Beta(1, 2), i, and φt U(0.2, 0.5), t. Figure 18 displays results for PFA and (PD, PFA). The posteriors are clearly non Gaussian, and the 95%-credible regions are large because of the small sample size, but they contain the ideal metrics. The suboptimal methods again provide less accurate estimates.

Prob. of False Alarm

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Prob. of False Alarm

Probability Density

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Prob. of False Alarm

Prob. of Detection

ROC Analysis

Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers

Figure 18: Testing example for small sample size (N = 70), with ˆN1 = 37.

5.3 Example of Testing: Multi-Class Classiﬁcation

For the multi-class testing methods of Section 2.7, we present an example with C = 4, N = 2000, π = (0.2, 0.3, 0.1, 0.4),

0.75 0.08 0.10 0.07 0.10 0.65 0.12 0.13 0.04 0.06 0.80 0.10 0.10 0.05 0.05 0.80

T = 5, δi 0, i, and φt U(0, 0.4), t. MMSE testing with the empirical Bayes sampling method (Algorithm 4) used M = 2500 C = 104. We also extended the method that estimates the correct labels (Algorithm 3) to multi-class classiﬁcation. Both algorithms converged after six iterations. For the individual labelers, we computed their confusion matrices, scaled the tth labeler s confusion matrix by N/(no. of labels from tth labeler), and calculated the mean and multi-dimensional median of the scaled matrices. Apart from the ideal case, testing did not involve the correct labels. First, Table 13 compares the estimates of the accuracy RV for the diﬀerent estimation methods. MMSE testing produced the most accurate estimates, and the ideal accuracy value lies squarely inside the 95%-credible region. Estimating the correct labels yielded a reasonably good estimate of accuracy but again without a sense of uncertainty. Using the individual labelers labels gave accuracies of 0.6586, 0.5667, 0.6397, 0.6698, and 0.5504; taking the mean or median of these values produces highly inaccurate estimates of the accuracy. Second, Table 14 shows the ideal confusion matrix and the estimated confusion matrices from the diﬀerent techniques. Ten matrix elements from MMSE testing are closest to the corresponding elements in the ideal confusion matrix, six from the estimation of correct labels are closest, one from the labelers mean is closest, and one from the labelers median is closest. (There were two ties, so these numbers sum to eighteen rather than sixteen.) Finally, Table 15 shows the 95%-credible regions of the confusion matrix elements from MMSE testing. Every credible region contains the corresponding element in the ideal confusion matrix.

On Truthing Issues in Supervised Classification

Estimation Method ACC Ideal (correct labels known) 0.7350 MMSE testing: Estimate Conditional mean 0.7355 conditional confusion MAP estimate 0.7355 matrix via sampling 95%-credible region (0.7282, 0.7427) Estimate correct labels 0.7390 Combine metrics from each labeler

Mean 0.6170 Median 0.6397

Table 13: Estimated accuracy for 4-class classiﬁcation example. Boldface indicates an estimate that was closest to the ideal accuracy or a credible region that contained the ideal accuracy.

Ideal (Correct) Predicted Label Labels Known) 0 1 2 3 0 290 40 40 39 Correct 1 58 391 68 68 Label 2 8 10 168 27 3 93 43 36 621

MMSE Predicted Label Testing 0 1 2 3 0 284.8 41.3 36.6 33.9 Correct 1 63.7 391.7 67.0 67.6 Label 2 7.5 7.3 169.8 29.0 3 93.0 43.6 38.7 624.6

Est. Correct Predicted Label Labels 0 1 2 3 0 288 43 35 37 Correct 1 61 395 70 68 Labels 2 7 4 170 25 3 93 42 37 625

Mean of Predicted Label Labelers 0 1 2 3 0 238.9 60.7 65.3 54.2 Correct 1 60.7 335.5 66.3 86.3 Label 2 36.7 54.2 130.1 64.8 3 76.6 91.3 48.8 529.6

Median of Predicted Label Labelers 0 1 2 3 0 254.1 57.4 46.8 56.2 Correct 1 71.1 343.4 67.2 86.9 Label 2 26.1 35.9 145.9 62.3 3 90.6 65.8 46.7 543.6

Table 14: Ideal and estimated confusion matrices for 4-class classiﬁcation example. Boldface indicates that the element was closest to the corresponding element in the ideal confusion matrix.

95%-Credible Predicted Label Regions 0 1 2 3 0 (277.72, 291.95) (35.80, 46.79) (32.27, 40.88) (28.54, 39.22) Correct 1 (57.75, 69.56) (384.72, 398.77) (61.51, 72.46) (60.74, 74.43) Label 2 ( 4.79, 10.23) (3.97, 10.67) (163.73, 175.83) (24.03, 33.93) 3 (86.74, 99.25) (38.47, 48.82) (34.34, 42.99) (615.90, 633.21)

Table 15: 95%-credible regions for individual elements of the confusion matrix estimated by MMSE testing (Algorithm 4) for 4-class classiﬁcation example.

5.4 Example of Training: Logistic Regression

This section uses logistic regression to illustrate the training approaches in Section 3. We use the Ionosphere binary-classiﬁcation data set from the UCI Machine Learning Repository (see Dua and Graﬀ, 2017), which contains N = 351 samples, each consisting of 34 real-valued features. Class 0 corresponds to a good radar return and class 1 to a bad radar return. Ionosphere contains 126 bad radar returns, so π(1) = 0.359. We employ 75% 25% stratiﬁed hold-out validation since multi-fold cross-validation produced cluttered plots that were too diﬃcult to read. In practice, one could use cross-validation, of course. Training and testing were conducted for T = 1, 5, 9, and 13. The data set provides the correct labels; noisy labels were simulated as in Section 5.1.2; hence, the noisy-label RVs are conditionally independent as in (58), ψi,t = (δi, φt), and (27) reduces to (61). The settings were δi Beta(1, 5), i; φt U(0, 0.5), t; η1 1 to force the ﬁrst labeler to label every sample; and ηt U(0.33, 1), t T \ {1}. The sample-diﬃculty realization was δ = (0.367, 0.524, 0.115, 0.181, 0.021, . . . , 0.154), and the labeler-fallibility realization for T = 13 was φ = (0.079, 0.440, 0.137, 0.207, 0.148, 0.314, 0.290, 0.300, 0.133, 0.142, 0.127, 0.164, 0.072). For each value of T, z consisted of the noisy labels for labeler indexes 1 through T. In fact, the introductory example in Table 1 is an excerpt of z for T = 5. We consider ﬁve classiﬁers. The ideal classiﬁer performs conventional ML training with {x, y} and is included for reference. The ML-optimal classiﬁer performs ML training given {x, z, ψ, π} according to part 1a of the uniﬁed view in Section 3.2, and primary term (32) from Section 3.3.1; details appear in Appendix F.1. The resulting objective function may no longer be convex, but gradient descent can still be used to ﬁnd a local optimum. The MMSEoptimal classiﬁer uses MMSE training according to part 2 of the uniﬁed view, primary term (40) from Section 3.4.1, and gradient components in (50) of Section 3.4.3; details are in Appendix F.2. The suboptimal, infrastructure-compatible label-estimation and voting classiﬁers described in Section 3.7 are also included. We omitted sample replication since it is essentially a quantized form of training with (32). For all classiﬁer training, we used L2 regularization and the standard Broyden-Fletcher-Goldfarb-Shanno method. All testing metrics were calculated using the empirical Bayes method of Algorithm 1, so, except for the ideal classiﬁer, no correct labels were used during training or testing. Although training could be optimal or suboptimal, testing always applied the same technique. For each training method, the regularization weight λ was swept over {0.5, 1.0, . . . , 10.0}, producing twenty trained models. For the ideal classiﬁer, we selected the model with the

On Truthing Issues in Supervised Classification

largest ideal area under the ROC curve on the held-out testing set. For the other classiﬁers, we estimated the ROC curve on the testing set (explained shortly in Section 5.4.2) and selected the model with the largest estimated area under the ROC curve.

5.4.1 Fixed Decision Threshold

A trained logistic regression model computes g(x; θ) [0, 1] as its estimate of p(y|x; θ) and compares it against a threshold τ; the predicted label ˆy is 1 if g(x; θ) > τ and 0 otherwise. This section presents results for the single default threshold τ = 1/2. Figure 19 displays P-R analysis plots for the held-out testing set as T increases. For the classiﬁers trained with noisy labels, contours show the estimated joint posteriors of (PREC, REC), circles show the conditional means, and solid lines indicate the 95%-credible regions. Normally, the correct labels y would not be available, but since we have the luxury of knowing them, inverted triangles mark each classiﬁer s actual performance. The upright, solid black triangle shows the ideal classiﬁer s operating point, which is identical for all values of T. With T = 1, z conveys little information about y, so the posteriors are spread out, and the credible regions are quite large. For each training method, the conditional mean and actual operating point are not very close together, but the actual operating point lies within the credible region. The suboptimal methods outperform the MMSE-optimal and ML-optimal classiﬁers for this case, but the credible regions overlap so much that one could not make this conclusion without access to the correct labels. This plot also demonstrates that our training and testing approaches are applicable even for a single, imperfect labeler. When T = 5, more information about y is available from z, so the credible regions become smaller, and the conditional means become much closer to the actual operating points for all classiﬁers. The ML-optimal classiﬁer outperforms the ones that used suboptimal training, and from the posteriors, one could reasonably expect it to do so. It also happens to outperform the ideal classiﬁer. The MMSE-optimal classiﬁer performs comparably to the label-estimation classiﬁer. For T = 9, the training methods provide similar estimated precisions, with the MLoptimal classiﬁer achieving greater recall. However, the posteriors overlap substantially, so one could not claim this without access to the correct labels. The ML-optimal classiﬁer again slightly outperforms the ideal one. The MMSE-optimal and voting-trained classiﬁer have identical performance. By the time T = 13, the posteriors and credible regions are much tighter, and the conditional means give quite accurate estimates of the actual operating points. The MLoptimal classiﬁer and the classiﬁer trained with label estimation slightly outperform voting training, and their actual operating points coincide with that of the ideal classiﬁer. For this choice of the threshold τ, the MMSE-optimal classiﬁer has the lowest recall. In the next sections, the threshold is varied, and the MMSE-optimal classiﬁer is seen to have competitive performance.

5.4.2 Performance Curves

It is also common practice to sweep the threshold τ over its range of possible values to obtain ROC or P-R curves. Figure 20 displays estimated ROC curves, which are the conditional

Number of Labelers = 1 Testing: Precision-Recall Analysis

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

Number of Labelers = 5 Testing: Precision-Recall Analysis

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

Number of Labelers = 9 Testing: Precision-Recall Analysis

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

Number of Labelers = 13 Testing: Precision-Recall Analysis

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

Ideal (correct labels known during training and testing) MMSE training: Est. posterior MMSE training: Performance against correct labels MMSE training: Cond. mean from est. posterior MMSE training: 95%-credible region ML training: Est. posterior ML training: Performance against correct labels ML training: Cond. mean from est. posterior ML training: 95%-credible region Label-est.: Est. posterior Label-est.: Performance against correct labels Label-est.: Cond. mean from est. posterior Label-est.: 95%-credible region Voting among labelers: Est. posterior Voting among labelers: Performance against correct labels Voting among labelers: Cond. mean from est. posterior Voting among labelers: 95%-credible region

Figure 19: Training example with diﬀerent numbers of labelers T: Estimated testing results from MMSE testing with Algorithm 1 on the held-out testing set. The chance line appears as a black dotted line.

On Truthing Issues in Supervised Classification

means of (PD, PFA) from MMSE testing using Algorithm 1, for the diﬀerent training methods and numbers of labelers. The estimated performance improves as T increases, although it does not change much beyond T = 5. When T = 1, it is diﬃcult to estimate performance, which results in the jagged estimated ROC curves. As T increases, the estimated ROC curves begin to display the stepwise shape of the ideal ROC curve.

For each value of T, all of the training methods perform comparably. This behavior suggests that the regularized suboptimal training methods oﬀer practical alternatives to training methods that are optimal according to the uniﬁed view. This observation differs from the experiments on testing from Sections 5.2.1, 5.2.4, and 5.2.5, where MMSE testing (Algorithms 1 and 2) outperformed the method of estimating the correct labels (Algorithm 3). We posit that this diﬀerence reﬂects the inherently diﬀerent goals of training and testing and the presence or absence of regularization. The goal of training is to learn a predictive model that generalizes well to out-of-sample data beyond the training set {x, z, ψ, π}. Therefore, training includes regularization, which helps a suboptimal training method compensate for its inferior estimation ability. In contrast, the goal of testing is to obtain the in-sample metrics for the testing set {ˆy, z, ψ, π}. Regularization is not called for in this case, and MMSE testing can provide much better estimation performance over suboptimal testing methods.

Figure 21 displays the actual ROC curves calculated against the correct labels. These curves would not be available in practice if truthing issues are present. They show that the training methods can achieve performance similar to the ideal case, and a comparison with Figure 20 shows that the estimated ROC curves are reasonably good even when T = 1, and they are very good for T = 9 or 13.

5.4.3 Performance Curve Posteriors

MMSE testing allows us to estimate the joint posterior of (PD, PFA) or (PREC, REC). We can average the posteriors over all threshold values to obtain the posterior of a ROC or P-R curve an important capability when truthing issues are present. Figure 22 shows the posteriors of the ROC curves for the ML-optimal classiﬁer as T varies. Posterior values over the range [10 3, 104] are shown using heat maps with a base-10 logarithmic color scale. The ﬁgure also displays the estimated and actual ROC curves for this training method. We could also compute credible regions for the curves, but we do not show them to avoid cluttering the ﬁgure. Figure 19 already demonstrated the good containment of the credible regions.

Similarly, Figure 23 shows P-R curve posteriors for the MMSE-optimal classiﬁer. When τ > 1, the classiﬁer predicts ˆyi = 0, i, so recall is zero but precision is undeﬁned, and none of the curves show a point when rec = 0.

The ﬁgures also reveal another beneﬁt of MMSE testing. Although actual performance is similar to the ideal case when T = 1 or 5, the estimated curves display some large deviations from the actual ones, and the posteriors have broad support, which indicates substantial uncertainty about performance. For T = 9 or 13, the estimated curves coincide closely with the actual ones, and the posteriors have more concentrated support, which reﬂects less uncertainty about performance. In the presence of truthing issues, the actual

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 1 Testing: ROC Analysis (Conditional Means)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 5 Testing: ROC Analysis (Conditional Means)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 9 Testing: ROC Analysis (Conditional Means)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 13 Testing: ROC Analysis (Conditional Means)

Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior ML training: Cond. mean from est. posterior Label-est.: Cond. mean from est. posterior Voting among labelers: Cond. mean from est. posterior

Figure 20: Training example with diﬀerent numbers of labelers T: Estimated ROC curves from MMSE testing with Algorithm 1 on the held-out testing set. The chance line appears as a black dotted line.

On Truthing Issues in Supervised Classification

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 1 Testing: ROC Analysis (vs. Correct Labels)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 5 Testing: ROC Analysis (vs. Correct Labels)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 9 Testing: ROC Analysis (vs. Correct Labels)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 13 Testing: ROC Analysis (vs. Correct Labels)

Ideal (correct labels known during training and testing) MMSE training: Performance against correct labels ML training: Performance against correct labels Label-est.: Performance against correct labels Voting among labelers: Performance against correct labels

Figure 21: Training example with diﬀerent numbers of labelers T: Actual ROC curves on the held-out testing set. The chance line appears as a black dotted line.

Number of Labelers = 1 Testing: ROC Analysis (ML Training)

0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 5 Testing: ROC Analysis (ML Training)

0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 9 Testing: ROC Analysis (ML Training)

0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm

Prob. of Detection

Number of Labelers = 13 Testing: ROC Analysis (ML Training)

0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm

Prob. of Detection

Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels

Figure 22: Training example with diﬀerent numbers of labelers T for ML-trained classiﬁers: Estimated ROC curve posteriors from Algorithm 1 on the held-out testing set. The posteriors appear as heat maps with a base-10 logarithmic color scale; the color bar ticks correspond to the exponents of the scale. The chance line appears as a black dotted line.

performance curve will not be available, but the support of the performance curve posterior can help one understand the potential variability in performance that might occur.

5.5 Examples of Equivalent Mutual Information

In Section 4, the use of mutual information as a basis for comparing diﬀerent combinations of labelers implied that multiple mediocre labelers could be as informative as a single expert labeler. To check the implication, we revisited the Ionosphere data set and simulated noisy labels with the model of (57), both for a single good labeler with ε = 0.05 and for nine mediocre labelers, each with ε = 0.25. Every labeler provided a noisy label for every sample. Then I(Z; Y |ε = 0.05) = 0.667 bits, and I(Z; Y |T = 9, ε = 0.25) = 0.758 bits. As in Section 5.4, we trained ML-optimal and MMSE-optimal classiﬁers using regularized logistic regression and estimated the testing metric RVs with Algorithm 1 for MMSE testing.

On Truthing Issues in Supervised Classification

Number of Labelers = 1 Testing: Precision-Recall Analysis (MMSE Training)

0 0.2 0.4 0.6 0.8 1 Recall

Number of Labelers = 5 Testing: Precision-Recall Analysis (MMSE Training)

0 0.2 0.4 0.6 0.8 1 Recall

Number of Labelers = 9 Testing: Precision-Recall Analysis (MMSE Training)

0 0.2 0.4 0.6 0.8 1 Recall

Number of Labelers = 13 Testing: Precision-Recall Analysis (MMSE Training)

0 0.2 0.4 0.6 0.8 1 Recall

Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels

Figure 23: Training example with diﬀerent numbers of labelers T for MMSE-trained classiﬁers: Estimated P-R curve posteriors from Algorithm 1 on the held-out testing set. The posteriors appear as heat maps with a base-10 logarithmic color scale; the color bar ticks correspond to the exponents of the scale. The chance line appears as a black dotted line.

Figure 24 shows ROC analysis plots for the held-out testing set. The results show that the classiﬁer trained with the mediocre labelers noisy labels performs better than the one trained with the single good labeler s noisy labels, which conﬁrms the implication. As an extreme example, we simulated a single expert labeler with ε = 0.01 and 399 poor labelers, each with ε = 0.45, so I(Z; Y |ε = 0.01) = 0.863 bits and I(Z; Y |T = 399, ε = 0.45) = 0.859 bits. The estimated P-R curve posteriors appear in Figure 25; when τ > 1, both classiﬁers predict ˆyi = 0, i, so recall is zero but precision is undeﬁned, and no points for rec = 0 are plotted. The curves indicate comparable performance and again conﬁrm the implication.

6. Summary, Conclusions, and Future Directions

In supervised classiﬁcation, a number of truthing issues may arise: noisy labels; missing labels; multiple, conﬂicting labels for the same sample; and diﬀerent combinations of labelers for diﬀerent samples. This situation involves three components, each of which requires a model: truthing warrants a noisy-label model, training learns a predictive model, and testing calls for a testing model. We did not study the problem of formulating and learning a noisylabel model, which is the subject of much of the related work. Instead, we concentrated on testing and training, and we began by assuming that a good noisy-label model p(z|y, ψ)π(y) was available, which makes our work compatible with and complementary to the related work. Our methods support models with dependent labelers.

6.1 Summary and Conclusions

By applying principles from Bayesian estimation theory, we succeeded in obtaining some promising and insightful answers to the questions posed in the introduction.

1. How can one test a classiﬁer in the presence of truthing issues?

Given noisy labels z, predicted labels ˆy, noisy-label model parameters ψ, and class prior π, we developed testing methods that are optimal: they estimate the metric RVs rather than the correct-label RVs, they fully exploit all available information, and they optimize a well-deﬁned criterion (MMSE). Our approach is completely separate from training and applicable beyond the realm of machine learning. For example, it could be used to reconcile diagnoses made by clinicians or categorizations assigned by scientists.

To arrive at the methods, we proposed a novel testing model (7), and we used it to derive approximate marginal posteriors for several scalar metrics, as well as joint posteriors for ROC and P-R analysis. We then introduced MMSE testing and developed empirical Bayes algorithms (Algorithms 1 and 2) for iteratively ﬁnding the MMSE estimate of the testing-model parameters from {z, ˆy, ψ, π}. After estimating the parameters, we calculated Bayesian optimal estimates (MMSE or MAP point estimates, or credible regions) of the metric RVs. Finally, we extended the approach to multi-class classiﬁcation.

In our experiments, MMSE testing provided excellent estimates of many binaryclassiﬁcation metrics. Their estimation errors were an order of magnitude smaller

On Truthing Issues in Supervised Classification

Single Good Labeler (Error Prob. = 0.05) Testing: ROC Analysis

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Ideal (correct labels known during training and testing) MMSE training: Est. posterior MMSE training: Performance against correct labels MMSE training: Cond. mean from est. posterior MMSE training: 95%-credible region ML training: Est. posterior ML training: Performance against correct labels ML training: Cond. mean from est. posterior ML training: 95%-credible region

Nine Mediocre Labelers (Error Prob. = 0.25) Testing: ROC Analysis

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm

Prob. of Detection

Ideal (correct labels known during training and testing) MMSE training: Est. posterior MMSE training: Performance against correct labels MMSE training: Cond. mean from est. posterior MMSE training: 95%-credible region ML training: Est. posterior ML training: Performance against correct labels ML training: Cond. mean from est. posterior ML training: 95%-credible region

Single Good Labeler (Error Prob. = 0.05) ROC Analysis (ML Training)

0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm

Prob. of Detection

Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels

Nine Mediocre Labelers (Error Prob. = 0.25) ROC Analysis (ML Training)

0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm

Prob. of Detection

Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels

Single Good Labeler (Error Prob. = 0.05) ROC Analysis (MMSE Training)

0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm

Prob. of Detection

Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels

Nine Mediocre Labelers (Error Prob. = 0.25) ROC Analysis (MMSE Training)

0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm

Prob. of Detection

Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels

Figure 24: ROC analysis for a single good labeler (left) and multiple mediocre labelers (right). The upper graphs show performance for the single threshold τ = 1/2. The middle graphs show the estimated ROC curve posteriors for ML-trained classiﬁers. The lower graphs show the estimated ROC curve posteriors for MMSE-trained classiﬁers. The posteriors appear as heat maps with a base10 logarithmic color scale; the color bar ticks correspond to the exponents of the scale. The chance line appears as a black dotted line.

Single Expert Labeler (Error Prob. = 0.01) Precision-Recall Analysis (ML Training)

0 0.2 0.4 0.6 0.8 1 Recall

Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels

399 Poor Labelers (Error Prob. = 0.45) Precision-Recall Analysis (ML Training)

0 0.2 0.4 0.6 0.8 1 Recall

Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels

Single Expert Labeler (Error Prob. = 0.01) Precision-Recall Analysis (MMSE Training)

0 0.2 0.4 0.6 0.8 1 Recall

Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels

399 Poor Labelers (Error Prob. = 0.45) Precision-Recall Analysis (MMSE Training)

0 0.2 0.4 0.6 0.8 1 Recall

Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels

Figure 25: Estimated P-R curve posteriors for a single expert labeler (left) and many poor labelers (right). The upper graphs show results for ML-trained classiﬁers; the lower graphs show results for MMSE-trained classiﬁers. The posteriors appear as heat maps with a base-10 logarithmic color scale; the color bar ticks correspond to the exponents of the scale. The chance line appears as a black dotted line.

On Truthing Issues in Supervised Classification

than those for suboptimal methods like estimating the correct labels or averaging the metrics from individual labelers. For multi-class classiﬁcation, MMSE testing outperformed the suboptimal methods at estimating accuracy and individual elements of the confusion matrix.

2. How can one train a classiﬁer in the presence of truthing issues?

With z, ψ, π, and the feature vectors x given, we presented a uniﬁed view of training that is elegant and intuitive. The uniﬁed view explains how to train a broad range of classiﬁers, and it organizes some of the related work. Each one of our training approaches is optimal: it employs an appropriate likelihood function, posterior, or estimator; it fully exploits the available information; and it optimizes a well-deﬁned penalty or utility criterion. None of the approaches estimates the correct labels.

For probabilistic (i.e., generative or discriminative) models with parameters θ, we showed that the optimization principle from ideal training can be retained. In ML training, the likelihood function p(y|x; θ) or p(y, x; θ) is replaced with p(z|x; ψ, θ) or p(z, x; ψ, θ), respectively. In MAP training, the posterior p(θ|y, x) is replaced by p(θ|z, x; ψ). For non-probabilistic models, we proposed MMSE training, which retains the original loss function and minimizes the in-sample MMSE estimate of the empirical-risk RV. Some related work has proposed the same form of training but did not use estimation theory to motivate it. We discussed properties of the MMSE estimator and provided a condition for when the MMSE estimator is a consistent estimator.

Experiments using binary logistic regression demonstrated the eﬀectiveness of our training approach, as well as the competitiveness of some suboptimal methods, like estimating the correct labels or voting among labelers, which are compatible with existing machine-learning infrastructure. We reasoned that regularization helps the suboptimal methods compensate for their estimation deﬁciencies. The experiments employed our testing methods, thus demonstrating the feasibility of training and testing with noisy labels only. Moreover, they showed that our testing methods can provide approximate posteriors and optimal estimates of ROC and P-R curves.

3. How can one compare diﬀerent combinations of labelers with diﬀerent abilities?

The noisy-labeling process can be viewed as a broadcast channel, so mutual information quantiﬁes the amount of information about the correct label conveyed by a group of labelers, and it facilitates comparison between diﬀerent combinations of labelers. As another basis for comparison, any combination of labelers can be represented as an equivalent single labeler with some corresponding error probability. This observation implies that multiple mediocre labelers can convey information greater than or equal to that from a single expert labeler.

The preceding statement is theoretical; it does not explain how to extract this information in practice. Fortunately, our training and testing methods provide a way to do so. The work in this paper culminated in training and testing experiments that conﬁrmed the implication and showed that our methods can realize its beneﬁts.

6.2 Expanded Workﬂow

Truthing issues are a reality in many applications. To address them, we advocate an expanded workﬂow that combines this work and the complementary related work. Training must learn two models: a noisy-label model and the desired predictive model. The models can be learned separately, using methods like those in Section 1.2.1 to learn the noisy-label model and then using the training techniques from Section 3 to learn the predictive model. Alternatively, the models can be learned jointly, using techniques like those by Raykar et al. (2010), Khetan et al. (2018), or Tanno et al. (2019). After both models have been learned, the predictive model can be tested using Algorithm 1, 2, or 4 from MMSE testing.

6.3 Future Directions

We close with a discussion of areas for future work.

6.3.1 Directly-Related Topics

A number of aspects of MMSE testing merit further study. First, Section 2.2 applied the CLT to the common RVs U and V , and subsequent manipulations produced the approximate posteriors of the metric RVs in Section 2.3; it would be useful to consider the case when one or both of the summations in (8) and (9) contain an insuﬃcient number of terms to justify the CLT. Second, it would be fulﬁlling to obtain an expression for the joint posterior (17) of (PD, PFA) along the chance line p D = p FA. Third, closed-form expressions for the maximum of (17) and (18) would simplify MAP estimation of the ROC and P-R operating points. Fourth, Section 2.4.4 discussed convergence of the empirical Bayes methods, but more detailed study is called for. Finally, Section 2.7 examined multi-class classiﬁcation, and it considered accuracy and individual elements of the confusion matrix. One could investigate other multi-class classiﬁcation metrics, the joint distribution of the confusion matrix, the eﬀect of a highly imbalanced class prior, and tactics when the number of classes is large.

Regarding training of a probabilistic predictive model, Section 3.3 showed that ML or MAP training could be modiﬁed for noisy labels. One could apply the forms given in Table 9 to adapt training from the ideal case to the case of truthing issues.

For MMSE training of non-probabilistic predictive models, Section 3.4.3 explained that its gradient is amenable to automatic diﬀerentiation (cf. (50)), so it would be exciting to see MMSE training applied to deep neural networks. Section 3.4.5 considered consistency of the MMSE estimator of the empirical-risk RV and gave the simplest of suﬃcient conditions. One could derive other conditions or bounds on the estimation error. One could also investigate conditions under which MMSE training preserves consistency of the ERM principle. The work of Khetan et al. (2018), Cid-Sueiro (2012), and Cid-Sueiro et al. (2014) could be useful in this regard.

The information-theoretic view was illustrated with the BSBC in Section 4.1. One could examine an asymmetric channel (i.e., a channel with diﬀerent miss and false-alarm probabilities), a channel with dependent labelers, or a multi-class channel.

On Truthing Issues in Supervised Classification

6.3.2 Learning and Labeler Allocation

This paper has assumed that the noisy-label model is known, but in many cases it must be learned. This requirement creates a variety of allocation problems. As an example of learning allocation, suppose that one has a small set {x , y , z } with both correct and noisy labels and a much larger set {x, z} with only noisy labels. How should one partition the sets for learning the noisy-label model, training the predictive model, and testing the learned predictive model? Two examples of labeler allocation follow. Given a labeling budget and costs for labelers with diﬀerent abilities, what is the most cost-eﬀective way to acquire labels?22 Similarly, given a set of samples with diﬀerent labeling diﬃculties (e.g., images under a variety of lighting conditions), how should the labeling eﬀort be distributed across the set?

6.3.3 Extension to Weak Supervision

The ideas in the related work and this paper could be adapted to other forms of supervised learning that involve noisy annotation, also known as weak supervision. Per Section 1.4, it would require three models: the usual predictive model, a noisy-annotation model for the imperfect annotation process, and a testing model that relates the predictions and noisy annotations. The noisy-annotation model is like the noisy-label model p(z|y)π(y) and provides estimated probabilities. It should generalize to unseen, out-of-sample realizations of (Z, Y ), so it will be learned with machine-learning methods. The testing model is analogous to p(ˆy, z|y) = p(ˆy|y)p(z|y) in (7); it characterizes in-sample performance on the testing set, so it will be learned with estimation-theoretic methods. For training, many aspects of MMSE training are likely applicable to weak supervision. Section 3.4 and Appendix E.4 allow for a generic loss function and noisy-annotation model. If the annotation set Y is continuous rather than ﬁnite, then the summations over Y must be replaced by integrals (cf. (37), (40), (41), (43), (50)), and techniques for computing or approximating the integrals will be needed. For testing of the predictive model, work tailored to the particular metrics will be required; the MMSE testing approach can provide a general strategy. Testing should follow the principle of estimating the in-sample metric RV rather than the correct annotation. First, one should identify a suitable testing model and parameters23 (cf. Section 2.1) and express the metrics as RVs (Section 2.2). Second, one should obtain the posteriors of the parameters and metric RVs (Section 2.3); if the samples are independent, then the CLT may be helpful. Third, one should develop algorithms for estimating the parameters (Section 2.4). Finally, once the parameters have been estimated, the posteriors of the metric RVs can be used to ﬁnd optimal estimates of the metrics (Section 2.5).

Acknowledgments

22. See Sheng et al. (2008), for example. 23. For binary classiﬁcation, the OP parameters ( p D, p FA) corresponded to probability of detection and probability of false alarm, which are standard metrics, but multi-class classiﬁcation introduced the conditional confusion matrix, which is not commonly used.

The author gratefully thanks Rajmonda Caceres, John Holodnak, Jason Matterer, and Anna Yanchenko for their helpful comments and interest in this work, as well as Paul Monticciolo, Vijay Gadepally, Tim Dasey, and Jason Thornton for enabling this work. The author also thanks the Action Editors and anonymous reviewers for their careful attention, thoughtful questions, and helpful suggestions.

This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001.

Appendix A. Metrics in Terms of Common RVs

For probability of false alarm, its RV form is

1 N PN i=1 1(ˆyi = 1 and Yi = 0)

1 N PN i=1 1(Yi = 0) .

The numerator is

i=1 1(ˆyi = 1 and Yi = 0) = 1

i:ˆyi=1 1(Yi = 0)

i:ˆyi=1 (1 1(Yi = 1))

i:ˆyi=1 1 1

i:ˆyi=1 1(Yi = 1)

and the denominator is

i=1 1(Yi = 0) = 1

i=1 (1 1(Yi = 1))

i=1 1(Yi = 1)

= 1 (U + V ).

On Truthing Issues in Supervised Classification

The RV form of accuracy is

i=1 1(Yi = ˆyi)

i:ˆyi=0 1(Yi = 0) + 1

i:ˆyi=1 1(Yi = 1)

i:ˆyi=0 (1 1(Yi = 1)) + U

N (N ˆN1) V + U

= U V ˆN1/N + 1.

The RV form of precision is

1 N PN i=1 1(ˆyi = 1 and Yi = 1)

1 N PN i=1 1(ˆyi = 1)

1 N P i:ˆyi=1 1(Yi = 1)

Fβ is obtained by taking

Fβ = (1 + β2) PREC REC β2PREC + REC ,

substituting the expressions for PREC and REC in Table 8, and performing a little algebra.

Appendix B. Ratios of Jointly Normal Random Variables

This section summarizes the procedure from Marsaglia (2006, 1965) for calculating the distribution or approximating the moments of the ratio of jointly normal RVs Z and W . Let E[Z ] = µZ , var(Z ) = σ2 Z , E[W ] = µW , var(W ) = σ2 W , and ρ = cov(Z , W )/σZ σW . The density of Z /W is obtained as follows:

1. Let h = σZ p

1 ρ2; the sign of h will be determined momentarily. Also let r = σW /h, s = ρσZ /σW , b = µW /σW , and a = (µZ sµW )/h.

2. Choose the sign of h so that a and b have the same sign. Then the RV T = r(Z /W s) can be expressed in the form (a + X )/(b + Y ), where X and Y are independent N(0, 1) RVs. The density of T is

p(t ) = e (a2+b2)/2

1 + q(t )e(q(t ))2/2 Z q(t )

0 e x2/2 dx,

where q(t ) = (b + at )/ q

(1 + t 2), and the integral can be calculated using the error

function erf(y) = (2/ π) R y 0 e τ 2 dτ.

Symbol PD, REC PFA Fβ

Z U ( ˆN1/N) U (1 + β2)U

W U + V 1 (U + V ) β2(U + V ) + ˆN1/N

µZ µU ( ˆN1/N) µU (1 + β2)µU

σ2 Z σ2 U σ2 U (1 + β2)2σ2 U µW µU + µV 1 (µU + µV ) β2(µU + µV ) + ˆN1/N

σ2 W σ2 U + σ2 V σ2 U + σ2 V β4(σ2 U + σ2 V )

cov(Z , W ) σ2 U σ2 U β2(1 + β2)σ2 U ρ σU

Table 16: Parameters for scalar metric RVs with posteriors equal to the ratio Z /W of jointly approximately normal RVs Z and W .

3. It follows that Z /W = T /r + s, so p Z /W (ζ) = |r| p T (r(ζ s)).

The density p(t ) includes the standard Cauchy density 1/ π(1 + t 2) , so technically, the moments of T of order greater than zero do not exist; i.e., R t ip(t ) dt is inﬁnite for i {1, 2, . . .}. However, Marsaglia (2006, 4) points out that, in practice, one might be able to assume that the denominator b + Y approaches zero with negligible probability, enabling one to compute the moments (conditioned on this assumption). For example, if b = 4, then Pr(b + Y 0) = Pr(Y 4) 3.17 10 5. Marsaglia reports that, conditioned on b > 4 (or Y > 4), the mean and variance of T are approximately µT = a/(1.01 b 0.2713) and σ2 T = (a2 + 1)/(b2 + 0.108 b 3.795) µ2 T . Finally, Marsaglia also observes that, if a < 2.256 and b > 4, then T can be reasonably approximated by a normal distribution.

Appendix C. Review of MMSE Estimation

Let A be the unobserved RV, B be the observed RV, and let f(A) = f1(A) f D(A) T

be a D-dimensional vector of scalar functions of A, with E[f2 j (A)] < , j = 1, . . . , D.

Deﬁne an estimator of f(A) from B as h(B) = h1(B) h D(B) T, and deﬁne the MSE of h(B) by mse h(B), f(A) = PD j=1 E hj(B) fj(A) 2 . The goal is to ﬁnd the MMSE estimator h MMSE = arg minh mse h(B), f(A) . For Section 2.4.1, A = (PD, PFA), B = ( ˆY , Z), and f(A) is the identity function f(A) A. The estimators are h1(B) = h D ˆY , Z, ψ, p(j 1) D , p(j 1) FA and h2(B) = h FA ˆY ,

Z, ψ, p(j 1) D , p(j 1) FA , where ψ, p(j 1) D , and p(j 1) FA are non-random parameters. In Section 2.6.2 on fully Bayesian estimation of a metric RV like accuracy, A = ACC, B = ( ˆY , Z), f(A) is the identity function, and ψ is a non-random parameter. For estimation of the empirical-risk RV in Section 3.4.1 and Appendix E.4, A = Y , B = Z, f(A) = Jpri(θ; x, Y ) = J(Y ), and an estimator is h(B) = ˆJ(θ; x, Z, ψ, π) = ˆJ(Z). For estimating

On Truthing Issues in Supervised Classification

the ith loss-function RV in Appendix E.4, A = Yi, B = Zi, f(A) = L( g(xi; θ), Yi) = ℓi(Yi), and an estimator is h(Zi) = ˆℓi(xi, θ, Zi) = ˆℓi(Zi).

C.1 Standard Results

1. The MMSE estimator is h MMSE(B) = E[f(A)|B], or h MMSE d (B) = E[fd(A)|B], d = 1, . . . , D; that is, the MMSE estimator is the conditional mean. The MMSE estimator is an RV because it is a function of B. Given B = b, the MMSE estimate is the nonrandom quantity h MMSE(b) = E[f(A)|B = b], or h MMSE d (b) = E[fd(A)|B = b], d = 1, . . . , D. See (Van Trees, 1968, 2.4.1), (Papoulis, 1991, 7-5, 8-3), (Kay, 1993, 10.3), (Kamen and Su, 1999, Theorems 3.1, 3.4), (Oppenheim and Verghese, 2015, 8.1, 8.2).

2. The MMSE estimator is unbiased: E[h MMSE(B)] = E[f(A)]; see (Papoulis, 1991, 7-4), (Kay, 1993, 11.6), (Kamen and Su, 1999, 3.3).

3. The estimation error h MMSE(B) f(A) has the following properties.

(a) The mean is zero: E h MMSE(B) f(A) = 0 from Result 2. (b) The total variance equals the MMSE, which is the MSE of the MMSE estimator: PD j=1 var h MMSE j (B) fj(A) = mse h MMSE(B), f(A) ; see (Kay, 1993, 11.6). (c) (Orthogonality principle) The error is orthogonal to every function w(B) = w1(B) w D(B) T: E h MMSE(B) f(A) Tw(B) = 0; see (Papoulis, 1991, 8-3), (Kamen and Su, 1999, Theorems 3.2, 3.3), (Oppenheim and Verghese, 2015, 8.2.1).

4. The MMSE estimator can be related to the law of total variance, which states that, for any RV G with ﬁnite variance, var(G) = E[var(G|B)]+var(E[G|B]), and E[var(G|B)] and var(E[G|B]) are the unexplained and explained variances of G, respectively; see (Blitzstein and Hwang, 2019, 9.5).

(a) The MMSE is equal to the total unexplained variance of f(A): mse h MMSE(B), f(A) = PD j=1 E[var(fj(A)|B)]; see (Van Trees, 1968, 2.4), (Kay, 1993, 10.4). (b) The total variance of the MMSE estimator equals the total explained variance of f(A): PD j=1 var h MMSE j (B) = PD j=1 var( E[fj(A)|B] ), which means the MMSE estimator accounts for as much of the variation of f(A) as possible given B. (c) Hence, the total variance of f(A) equals the sum of the MMSE (the unexplained variance) and the variance of the MMSE estimator (the explained variance):

j=1 var(fj(A)) = mse(h MMSE(B), f(A)) | {z } PD j=1 E[ var(fj(A)|B) ]

j=1 var(h MMSE j (B))

| {z } PD j=1 var( E[fj(A)|B] )

5. Results 2 through 4 apply to the MMSE estimator. For the MMSE estimate given B = b, the MSE equals the sum of the conditional variances of the functions being estimated: mse h MMSE(B), f(A) B = b = PD j=1 var(fj(A)|B = b); see (Oppenheim and Verghese, 2015, 8.1).

C.2 Proofs and Derivations of Standard Results

The derivation of Result 1 assumes B is continuous but is easily modiﬁed if B is discrete. Expand the MSE as

mse h(B), f(A) =

E[h2 j(B)] 2E[hj(B)fj(A)] + E[f2 j (A)]

E[h2 j(B)] 2E E[hj(B)fj(A)|B] + E[f2 j (A)]

Z p(b)h2 j(b) db 2 Z p(b)E[hj(B)fj(A)|B = b] db + E[f2 j (A)]

Z p(b)h2 j(b) db 2 Z p(b)hj(b)E[fj(A)|B = b] db + E[f2 j (A)] ,

where (a) is from iterated expectations, and (b) is because hj(B) is non-random and equal to hj(b) when conditioned on B = b. For d = 1, . . . , D, take the partial derivative of the above equation with respect to hd, set it equal to zero, and solve for hd. Doing so gives

hd = 2 Z p(b)hd(b) db 2 Z p(b)E[fd(A)|B = b] db, d = 1, . . . , D,

= 2 Z p(b) hd(b) E[fd(A)|B = b] db, d = 1, . . . , D.

Since p(b) is a probability distribution, this expression is zero when hd(b) = E[fd(A)|B = b] for any valid b, which is just the deﬁnition of the conditional mean. Also, 2mse/ h2 d = 2 R p(b) db = 2 > 0, d = 1, . . . , D, so the solution is the unique minimum. Thus, the MMSE estimator is h MMSE(B) = E[f(A)|B], or h MMSE d (B) = E[fd(A)|B], d = 1, . . . , D; and the MMSE estimate given B = b is h MMSE(b) = E[f(A)|B = b], or h MMSE d (b) = E[fd(A)|B = b], d = 1, . . . , D. This proves Result 1.

For Result 2, write E[h MMSE(B)] = E E[f(A)|B] (a) = E[f(A)], where (a) applies iterated expectations. For Result 3a, unbiasedness means E h MMSE(B) f(A) = 0. For Result 3b, the preceding relation means PD j=1 var h MMSE j (B) fj(A) = PD j=1 E h MMSE j (B) fj(A) 2 = mse h MMSE(B), f(A) . To obtain Result 3c, for any valid j and b, write E (h MMSE j (B) fj(A))wj(B) B = b = h MMSE j (b)wj(b) E[fj(A)|B = b]wj(b) = E[fj(A)|B = b]wj(b)

E[fj(A)|B = b]wj(b) = 0. Then E h MMSE(B) f(A) Tw(B) = PD j=1 E h MMSE j (B) fj(A) wj(B) = PD j=1 E E h MMSE j (B) fj(A) wj(B) B = PD j=1 R E h MMSE j (B) fj(A) wj(B) B = b p(b) db = PD j=1 R 0 p(b) db = 0.

For Result 4a, write mse h MMSE(B), f(A) (a) = PD j=1 E E h MMSE j (B) fj(A) 2 B = PD j=1 E E fj(A) E[fj(A)|B] 2 B (b) = PD j=1 E[ var(fj(A)|B) ], where (a) uses iterated expectations and (b) applies the deﬁnition of conditional variance. Result 4b is because

On Truthing Issues in Supervised Classification

h MMSE j (B) = E[fj(A)|B], so their variances are equal. Applying the law of total variance to each term in PD j=1 var(fj(A)) yields Result 4c.

For Result 5, conditioning the MSE on B = b gives mse h MMSE(B), f(A) B = b = PD j=1 E fj(A) E[fj(A)|B] 2 B = b = PD j=1 E fj(A) E[fj(A)|B = b] 2 B = b (a) = PD j=1 var(fj(A)|B = b), where (a) uses the deﬁnition of conditional variance given B = b.

Appendix D. Review of MPE and MAP Estimation

Additional derivations appear in Van Trees (1968, 2.4.1) and Kay (1998, Ch. 3). Let Y be a ﬁnite RV with domain Y, and let B be the observed, discrete RV. Denote an estimator of Y given B as h(B). The probability of error of h(B) is perror(h(B), Y ) = Pr(h(B) = Y ) = Ep(y,b)[1(h(B) = Y )]. The goal is to ﬁnd the MPE estimator, namely h MPE = arg minh perror(h(B), Y ).

For Section 2.6.1, Y = Yi, and B = ( ˆYi, Zi) with additional conditioning parameters ψi, p(j 1) D , p(j 1) FA . For Section 3.7, Y = Yi, and B = Zi with additional parameter ψi.

Write perror(h(B), Y ) = E[1 1(h(B)=Y )] = 1 E E[1(h(B)=Y ) | B] = 1 P b p(b) E[1(h(B) = Y ) | B = b]. The last form is minimized by maximizing the conditional expectation for each b. Then E[1(h(B)=Y ) | B =b] = P y Y p(y|b)1(h(b)=y) = p Y |B(h(b) | b), and the maximum occurs when h(b) is the MAP estimate of Y given B = b; i.e., when h(b) = h MAP(b) = arg maxy Y p(y|b). Deﬁne the MAP estimator h MAP(B) as the function of B that returns h MAP(b) when B = b; then the MPE estimator is h MPE(B) = h MAP(B).

Appendix E. Training with Truthing Issues

E.1 Discriminative Model, Random Parameters

Ideal case: Find θ to maximize the posterior p(θ|y, x), given by p(θ|y, x) (a) p(y|x, θ)

p(θ|x) (b) = p(θ) QN i=1 p(yi|xi, θ), where (a) uses Bayes rule, and (b) is from (29).

Truthing issues: Find θ to maximize p(θ|z, x; ψ), and write p(θ|z, x; ψ) (a) p(z|x, θ; ψ)

p(θ|x; ψ) (b) = p(θ) QN i=1 P yi Y p(zi|yi; ψi)p(yi|xi, θ), where (a) uses Bayes rule, and (b) applies (31).

E.2 Generative Model, Non-Random Parameters

Ideal case: Find θ to maximize the likelihood function p(y, x; θ), which can be written as p(y, x; θ) = p(x|y; θ)p(y) = QN i=1 p(xi|yi; θ)π(yi).

Truthing issues: Instead, ﬁnd θ to maximize p(z, x; ψ, θ) = QN i=1 p(zi, xi; ψi, θ) (a) = QN i=1 P yi Y p(zi, yi, xi; ψi, θ) = QN i=1 P yi Y p(xi | zi, yi; ψi, θ)p(zi | yi; ψi, θ)p(yi; ψi, θ) (b) = QN i=1 P yi Y p(xi|yi; θ)p(zi|yi; ψi)π(yi), where (a) uses marginalization and (b) applies nondependencies.

E.3 Generative Model, Random Parameters

Ideal case: Find θ to maximize the posterior p(θ|y, x); then p(θ|y, x) (a) p(y, x|θ)p(θ) (b) = p(θ) QN i=1 p(xi|yi, θ)π(yi), where (a) uses Bayes rule and (b) substitutes the expression for p(y, x|θ) from the ideal case in Appendix E.2.

Truthing issues: Find θ to maximize p(θ|z, x; ψ). Write p(θ|z, x; ψ) (a) p(z, x|θ; ψ)p(θ)

(b) = p(θ) QN i=1 P yi Y p(xi|yi, θ)p(zi|yi; ψi)π(yi), where (a) applies Bayes rule and (b) obtains p(z, x|θ; ψ) in the same way that Appendix E.2 obtained p(z, x; ψ, θ).

E.4 Non-Probabilistic Models: MMSE Estimation of Empirical-Risk RV

As shorthand, denote the ith loss-function RV by ℓi(Yi) = L( g(xi; θ), Yi) and the empiricalrisk RV by R(Y ) = R(θ; x, Y ) = N 1 PN i=1 ℓi(Yi), which is (34). Also let ˆR(Z) = ˆR(θ; x, Z) be an estimator of R(Y ) with mse( ˆR(Z), R(Y )) = E[( ˆR(Z) R(Y ))2]. The MMSE estimator is Jpri(θ; x, Z, ψ, π) = ˆRMMSE(Z) = arg min ˆR mse( ˆR(Z), R(Y )).

The standard result (see Appendix C) is that the MMSE estimator is the conditional mean of R(Y ) given Z: ˆRMMSE(Z) = E[R(Y )|Z], which is (35). Then ˆRMMSE(Z) = N 1 PN i=1 E[ℓi(Yi)|Z] = N 1 PN i=1 E[ℓi(Yi)|Zi] = N 1 PN i=1 ˆℓMMSE i (Zi), which is (36). The last expression is because, for an estimator ˆℓi(Zi) = ˆℓi(xi, θ, Zi) of the ith lossfunction RV ℓi(Yi), the MMSE estimator is ˆℓMMSE i (Zi) = arg minˆℓi mse(ˆℓi(Zi), ℓi(Yi)) =

arg minˆℓi E[(ˆℓi(Zi) ℓi(Yi)2] = E[ℓi(Yi)|Zi]. Thus, ˆRMMSE(Z) equals the average of the MMSE estimators of the individual loss-function RVs.

Likewise, the estimation error is ˆRMMSE(Z) R(Y ) = N 1 PN i=1 ˆℓMMSE i (Zi) ℓi(Yi) , the average of the estimation errors of the MMSE loss-function estimators. The estimation error of each of these estimators has E[ˆℓMMSE i (Zi) ℓi(Yi)] = 0 and var(ˆℓMMSE i (Zi) ℓi(Yi)) = mse(ˆℓMMSE i (Zi), ℓi(Yi)) = E[var(ℓi(Yi)|Zi)]. The samples are independent, so the estimation errors of the MMSE loss-function estimators are independent, too. By the CLT, the estimation error of ˆRMMSE(Z) converges in distribution to a normal RV with mean zero and variance var( ˆRMMSE(Z) R(Y )) = N 2 PN i=1 var(ˆℓMMSE i (Zi) ℓi(Yi)) = N 2 PN i=1 mse(ˆℓMMSE i (Zi), ℓi(Yi)) = N 2 PN i=1 E[var(ℓi(Yi)|Zi)], which is (47).

Appendix F. Regularized Logistic Regression Training Equations

Let the feature dimensionality be D, so θ has dimensionality D + 1. Deﬁne x T i = 1 x T i , and index xi and θ from zero rather than one. Logistic regression uses the model Yi|xi, θ B( g(xi; θ)), where g(xi; θ) = 1/(1 + e x T i θ), so the model assumes p(yi|xi; θ) = (1 g(xi; θ))1 yi g(xi; θ)yi. We use the L2-regularization term Jreg(θ) = 1 2N PD j=1 θ2 j; the intercept weight θ0 is not subject to regularization and is omitted from the summation.

On Truthing Issues in Supervised Classification

F.1 ML Training

From (26) and (30), ML training in the ideal case seeks θ to minimize

Jideal(θ; x, y) = 1

N log p(y|x; θ) | {z } Jpri(θ;x,y)

+λJreg(θ) = 1

i=1 log p(yi|xi; θ) + λ

and some algebra yields

Jideal(θ; x, y) = 1

(1 yi) log 1

1 + e+ x T i θ + yi log 1

1 + e x T i θ

j=1 θ2 j. (62)

For the gradient, the partial derivatives with respect to θj are

1 + e x T i θ yi

N θj1(j = 0), j = 0, 1, . . . , D. (63)

From (28) and (32), ML training with truthing issues means minimizing

JML(θ; x, z, ψ, π) = 1

N log p(z|x; ψ, θ) | {z } Jpri(θ;x,z,ψ,π)

yi=0 p(yi|xi; θ)p(zi|yi; ψi) + λ

yi=0 p(yi|xi; θ) Y

t:zi,t = p(zi,t|yi; δi, φt)

1 + e+ x T i θ Y

t:zi,t = p Zi,t|Yi(zi,t|0; δi, φt)

1 + e x T i θ Y

t:zi,t = p Zi,t|Yi(zi,t|1; δi, φt)

where (a) applies ψi = (δi, φ) along with (61), and (b) is because p Y (0|xi; θ) = 1 g(xi; θ) and p Y (1|xi; θ) = g(xi; θ). For the gradient, diﬀerentiation and some algebra yield

Q t:zi,t = p Zi,t|Yi(zi,t|0; δi, φt) Q t:zi,t = p Zi,t|Yi(zi,t|1; δi, φt) Q t:zi,t = p Zi,t|Yi(zi,t|0; δi, φt) + e+ x T i θ Q t:zi,t = p Zi,t|Yi(zi,t|1; δi, φt)

1 + e x T i θ xi(j) + λ

N θj1(j = 0), j = 0, 1, . . . , D.

F.2 MMSE Training

For ideal training, rewrite the ﬁrst term in (62) as the empirical risk (33) by setting s = g(x; θ) = 1/(1 + e x T i θ) and deﬁning L(s, y) = (1 y) log(1 s) y log s. Then

L( g(x; θ), y) = 1 1 + e x Tθ y x(j). (64)

For the gradient, plugging this equation into (48) and (49) again yields (63). For MMSE training with truthing issues, we apply (40) and minimize

JMMSE(θ; x, z, ψ, π) = 1

p Y |Z(0|zi; ψi) log 1

1 + e+ x T i θ

+ p Y |Z(1|zi; ψi) log 1

1 + e x T i θ

For the gradient, plugging (64) into (50) and simplifying yields

p Y |Z(0|zi; ψi) 1

1 + e x T i θ p Y |Z(1|zi; ψi) 1

1 + e+ x T i θ

N θj1(j = 0), j = 0, 1, . . . , D.

Joseph K. Blitzstein and Jessica Hwang. Introduction to Probability. CRC Press, Boca Raton, FL, USA, 2nd edition, 2019. doi: 10.1201/9780429428357.

Steve Branson, Grant Van Horn, and Pietro Perona. Lean crowdsourcing: Combining humans and machines in an online system. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 6109 6118, July 2017. doi: 10.1109/ CVPR.2017.647. URL https://openaccess.thecvf.com/content_cvpr_2017/html/ Branson_Lean_Crowdsourcing_Combining_CVPR_2017_paper.html.

Leo Breiman. Statistical modeling: The two cultures. Statistical Science, 16(3):199 231, August 2001. doi: 10.1214/ss/1009213726.

M. C. Burl, U. M. Fayyad, P. Perona, P. Smyth, and M. P. Burl. Automating the hunt for volcanoes on Venus. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 302 309, 1994. doi: 10.1109/CVPR.1994.323844.

Mark J. Carlotto. Eﬀect of errors in ground truth on classiﬁcation accuracy. Intl. J. Remote Sensing, 30(18):4831 4849, September 2009. doi: 10.1080/01431160802672864.

George Casella. Illustrating empirical Bayes methods. Chemometrics and Intelligent Laboratory Systems, 16(2):107 125, October 1992. doi: 10.1016/0169-7439(92)80050-E.

On Truthing Issues in Supervised Classification

Jes us Cid-Sueiro. Proper losses for learning from partial labels. In Neural Information Processing Systems (Neur IPS), pages 1565 1573. Curran Associates, Inc., December 2012. URL https://dl.acm.org/doi/10.5555/2999134.2999309.

Jes us Cid-Sueiro, Dar ıo Garc ıa-Garc ıa, and Ra ul Santos-Rodr ıguez. Consistency of losses for learning from weak labels. In Machine Learning and Knowledge Discovery in Databases, volume 8724 of Lecture Notes in Computer Science, pages 197 210. Springer, 2014. doi: 10.1007/978-3-662-44848-9 13.

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, NY, USA, 1st edition, 1991. doi for 2nd edition: 10.1002/047174882X.

A. Philip Dawid and Allan M. Skene. Maximum likelihood estimation of observer errorrates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20 28, 1979. doi: 10.2307/2346806.

Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1 38, 1977. URL https://www.jstor.org/stable/ 2984875.

Pinar Donmez, Guy Lebanon, and Krishnakumar Balasubramanian. Unsupervised supervised learning I: Estimating classiﬁcation and regression errors without labels. J. Machine Learning Research (JMLR), 11(44):1323 1351, April 2010. URL https://www. jmlr.org/papers/v11/donmez10a.html.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository. URL https://archive. ics.uci.edu/ml, 2017. URL visited Feb. 1, 2017.

Mark Everingham et al. The 2005 PASCAL visual object classes challenge. In Machine Learning Challenges Workshop (MLCW), volume 3944 of Lecture Notes in Computer Science, pages 117 176. Springer, 2006. doi: 10.1007/11736790 8.

Benoˆıt Fr enay and Michel Verleysen. Classiﬁcation in the presence of label noise: A survey. IEEE Trans. Neural Networks and Learning Systems, 25(5):845 869, 2014. doi: 10.1109/ TNNLS.2013.2292894.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Computer and System Sciences, 55(1):119 139, August 1997. doi: 10.1006/jcss.1997.1504.

Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis. Chapman and Hall-CRC, Boca Raton, FL, USA, 3rd edition, 2013. doi: 10.1201/b16018.

John T. Holodnak, Jason T. Matterer, and William W. Streilein. Estimating classiﬁer accuracy using noisy expert labels. Technical Report 1225, MIT Lincoln Laboratory, Lexington, MA, USA, January 2018.

Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality management on Amazon Mechanical Turk. In ACM Intl. Conf. Knowledge Discovery and Data Mining (SIGKDD), Workshop on Human Computation (HCOMP), pages 64 67, 2010. doi: 10.1145/1837885. 1837906.

Ishan Jindal, Matthew Nokleby, and Xuewen Chen. Learning deep networks from noisy labels with dropout regularization. In IEEE Intl. Conf. Data Mining (ICDM), pages 967 972. IEEE, 2016. doi: 10.1109/ICDM.2016.0121.

Edward W. Kamen and Jonathan K. Su. Introduction to Optimal Estimation. Springer Verlag, London, UK, 1999. doi: 10.1007/978-1-4471-0417-9.

David R. Karger, Sewoong Oh, and Devavrat Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1 24, February 2014. doi: 10.1287/opre.2013.1235.

Steven M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1993.

Steven M. Kay. Fundamentals of Statistical Signal Processing: Detection Theory. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1998.

Ashish Khetan, Zachary C. Lipton, and Anima Anandkumar. Learning from noisy singlylabeled data. In Intl. Conf. Learning Representations (ICLR), December 2018. doi: 10.48550/ar Xiv.1712.04577. URL https://openreview.net/forum?id=H1s UHgb0Z.

Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA, USA, 2009.

Chuck P. Lam and David G. Stork. Evaluating classiﬁers by means of test data with noisy labels. In Intl. Joint Conf. Artiﬁcial Intelligence (IJCAI), pages 513 518. Morgan Kaufmann, 2003. URL http://ijcai.org/Proceedings/03/Papers/076.pdf.

G abor Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79 87, January 1992. doi: 10.1016/0031-3203(92)90008-7.

George Marsaglia. Ratios of normal variables and ratios of sums of uniform variables. J. Amer. Statistical Assoc., 60(309):193 204, March 1965. doi: 10.2307/2283145.

George Marsaglia. Ratios of normal variables. J. Statistical Software, 16(4):1 10, May 2006. doi: 10.18637/jss.v016.i04.

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Neural Information Processing Systems (Neur IPS), pages 1196 1204. Curran Associates Inc., December 2013. URL https://dl.acm.org/doi/10.5555/ 2999611.2999745.

Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Costsensitive learning with noisy labels. J. Machine Learning Research (JMLR), 18(155): 1 33, April 2018. URL https://jmlr.org/papers/v18/15-226.html.

On Truthing Issues in Supervised Classification

Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classiﬁers: A comparison of logistic regression and naive Bayes. In Neural Information Processing Systems (Neur IPS), pages 841 848. MIT Press, January 2001. URL https://dl.acm.org/doi/ 10.5555/2980539.2980648.

Curtis Northcutt, Lu Jiang, and Isaac Chuang. Conﬁdent learning: Estimating uncertainty in dataset labels. J. Artiﬁcial Intelligence Research (JAIR), 70:1373 1411, May 2021a. ISSN 1076-9757. URL https://doi.org/10.1613/jair.1.12125.

Curtis G. Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks. In Neural Information Processing Systems (Neur IPS) Track on Datasets and Benchmarks. Curran Associates, Inc., December 2021b. URL https://doi.org/10.48550/ar Xiv.2103.14749.

Alan V. Oppenheim and George C. Verghese. Signals, Systems and Inference. Pearson Education, London, UK, 2015.

Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes. Mc Graw Hill, New York, NY, USA, 3rd edition, 1991.

Emmanouil A. Platanios, Avinava Dubey, and Tom Mitchell. Estimating accuracy from unlabeled data: A Bayesian approach. In Intl. Conf. Machine Learning (ICML), pages 1416 1425, June 2016. URL https://proceedings.mlr.press/v48/platanios16.pdf.

Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R e. Snorkel: Rapid training data creation with weak supervision. In Intl. Conf. Very Large Data Bases (VLBD), volume 11(3), pages 269 282. VLDB Endowment, November 2017. doi: 10.14778/3157794.3157797.

Alexander J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and Christopher R e. Data programming: Creating large training sets, quickly. In Neural Information Processing Systems (Neur IPS), pages 3567 3575. Curran Associates, Inc., December 2016. URL https://dl.acm.org/doi/10.5555/3157382.3157497.

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerado Hemosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. J. Machine Learning Research (JMLR), 11(43):1297 1322, April 2010. URL https://jmlr.csail.mit.edu/papers/ v11/raykar10a.html.

Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluating binary classiﬁers on imbalanced datasets. Public Library of Science (PLOS) ONE, 10(3):1 21, March 2015. doi: 10.1371/journal.pone.0118432.

Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197 227, June 1990. doi: 10.1007/BF00116037.

Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In ACM Intl. Conf. Knowledge Discovery and Data Mining (SIGKDD), pages 614 -622, 2008. doi: 10.1145/1401890.1401965.

Padhraic Smyth, Usama Fayyad, Michael Burl, Pietro Perona, and Pierre Baldi. Inferring ground truth from subjective labelling of Venus images. In Neural Information Processing Systems (Neur IPS), pages 1085 1092. MIT Press, January 1994. URL https://dl.acm. org/doi/10.5555/2998687.2998822.

Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. In Intl. Conf. Learning Representations (ICLR), April 2015. doi: 10.48550/ar Xiv.1406.2080.

Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C. Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 11236 11245, 2019. doi: 10.1109/CVPR.2019.01150.

Grant Van Horn, Steve Branson, Scott Loarie, Serge Belongie, and Pietro Perona. Lean multiclass crowdsourcing. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 2714 2723, 2018. doi: 10.1109/CVPR.2018.00287. URL https://openaccess.thecvf.com/content_cvpr_2018/papers/Van_Horn_Lean_ Multiclass_Crowdsourcing_CVPR_2018_paper.pdf.

Brendan van Rooyen and Robert C. Williamson. A theory of learning with corrupted labels. J. Machine Learning Research (JMLR), 18(228):1 50, July 2018. URL https: //jmlr.org/papers/v18/16-315.html.

Harry L. Van Trees. Detection, Estimation, and Modulation Theory. Part I: Detection, Estimation, and Linear Modulation Theory. John Wiley & Sons, New York, NY, USA, 1968. doi: 10.1002/0471221082.

Vladimir Vapnik. Principles of risk minimization for learning theory. In Neural Information Processing Systems (Neur IPS), pages 831 838. Morgan-Kaufmann, December 1991. URL https://dl.acm.org/doi/10.5555/2986916.2987018.

Yehuda Vardi and Cun-Hui Zhang. The multivariate L1-median and associated data depth. Proc. Natl. Acad. Sci. USA (PNAS), 97(4):1423 1426, February 2000. doi: 10.1073/pnas. 97.4.1423.

Peter Welinder and Pietro Perona. Online crowdsourcing: Rating annotators and obtaining cost-eﬀective labels. In IEEE Computer Vision and Pattern Recognition (CVPR), Workshop on Advancing Computer Vision with Humans in the Loop (ACVHL), pages 25 32, June 2010. doi: 10.1109/CVPRW.2010.5543189.

Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The multidimensional wisdom of crowds. In Neural Information Processing Systems (Neur IPS), pages 2424 2432. Curran Associates, Inc., 2010. URL https://dl.acm.org/doi/10.5555/2997046. 2997166.

Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Neural Information Processing Systems (Neur IPS), pages 2035 2043. Curran

On Truthing Issues in Supervised Classification

Associates, Inc., December 2009. URL https://dl.acm.org/doi/10.5555/2984093. 2984321.

Dengyong Zhou, Qiang Liu, John C. Platt, Christopher Meek, and Nihar B. Shah. Regularized minimax conditional entropy for crowdsourcing. URL https://arxiv.org/abs/ 1503.07240, March 2015.