# on_truthing_issues_in_supervised_classification__a508ad11.pdf Journal of Machine Learning Research 25 (2024) 1-91 Submitted 4/19; Revised 12/23; Published 01/24 On Truthing Issues in Supervised Classification Jonathan K. Su su@ll.mit.edu MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02421-6426, USA Editor: Shivani Agarwal, Francis Bach Ideal supervised classification assumes known correct labels, but various truthing issues can arise in practice: noisy labels; multiple, conflicting labels for a sample; missing labels; and different labeler combinations for different samples. Previous work introduced a noisylabel model, which views the observed noisy labels as random variables conditioned on the unobserved correct labels. It has mainly focused on estimating the conditional distribution of the noisy labels and the class prior, as well as estimating the correct labels or training with noisy labels. In a complementary manner, given the conditional distribution and class prior, we apply estimation theory to classifier testing, training, and comparison of different combinations of labelers. First, for binary classification, we construct a testing model and derive approximate marginal posteriors for accuracy, precision, recall, probability of false alarm, and F-score, and joint posteriors for ROC and precision-recall analysis. We propose minimum mean-square error (MMSE) testing, which employs empirical Bayes algorithms to estimate the testing-model parameters and then computes optimal point estimates and credible regions for the metrics. We extend the approach to multi-class classification to obtain optimal estimates of accuracy and individual confusion-matrix elements. Second, we present a unified view of training that covers probabilistic (i.e., discriminative or generative) and non-probabilistic models. For the former, we adjust maximum-likelihood or maximum a posteriori training for truthing issues; for the latter, we propose MMSE training, which minimizes the MMSE estimate of the empirical risk. We also describe suboptimal training that is compatible with existing infrastructure. Third, we observe that mutual information lets one express any labeler combination as an equivalent single labeler, implying that multiple mediocre labelers can be as informative as, or more informative than, a single expert labeler. Experiments demonstrate the effectiveness of the methods and confirm the implication. Keywords: supervised classification, truth errors, noisy labels, Bayesian estimation, empirical Bayes, mutual information, crowdsourcing . DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering. c 2023 Massachusetts Institute of Technology. Subject to FAR52.227-11 Patent Rights - Ownership by the contractor (May 2014) Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work. c 2024 Jonathan K. Su. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/19-301.html. 1. Introduction Supervised classification uses labeled data to train and test a predictive model or classifier, which will be used to predict the labels of unlabeled data. The predictive model is a mapping g : X Y, parameterized by θ, from a feature space X to a set Y = {0, 1, . . . , C 1} of C mutually exclusive and exhaustive classes or labels. The input to the model is a feature vector x X, and the output of the model is the predicted label ˆy = g(x; θ) Y, which may or may not agree with the correct label y Y. A labeled sample (x, y) consists of a feature vector x and its corresponding correct label y, while an unlabeled sample is just a feature vector x. A set of N labeled samples is denoted as {x, y}, where x = (x1, . . . , x N) = (xi)N i=1, y = (yi)N i=1, and for each i, xi X, yi Y, where yi is the correct label associated with xi. The result of applying a learned model to each feature vector in x is the list of corresponding predicted labels ˆy = (ˆyi)N i=1 = (g(xi; θ))N i=1. Training learns the model from a training set of labeled samples. Testing uses a separate testing set of labeled samples, applies the learned model to its feature vectors, and calculates metrics that quantify the agreement between each predicted label and the corresponding correct label. Our notation does not distinguish between the training and testing sets, since the intended set will be clear from context. 1.1 Truthing Issues It is usually assumed that the correct labels are known during training and testing; we refer to this situation as the ideal case. However, this critical assumption is often violated in practice (see Everingham et al. (2006, p. 172), Fr enay and Verleysen (2014), or Northcutt et al. (2021b), for example), which can degrade the trained model and produce misleading testing metrics. Deviation from the ideal case can invalidate many hours of hard work and computer processing, as well as make users and clients skeptical of the utility of a classifier or its performance. Consequently, substantial resources are often devoted to truthing, the process of labeling data as correctly as possible. We use the term labeler to mean an entity that assigns labels to samples; some authors use terms like teacher or annotator. A labeler could be a human, a sensor, a laboratory test, or even another classifier. Despite a labeler s best efforts, errors can and do still occur. Humans make mistakes, become fatigued, and have varying amounts of expertise, attentiveness, and motivation; sensors are subject to noise, occlusion, and other degradations; laboratory test results are not always definitive; and classifiers are rarely perfect predictors. In addition, labeling via crowdsourcing means that more than one labeler could assign a label to the same feature vector xi, and if the labels conflict, then at least one of them must be incorrect. In short, a number of truthing issues can arise: truth errors, multiple labelers who provide conflicting labels for the same sample, missing labels, and different combinations of labelers for different samples. We introduce some notation here and include an example of it in Table 1. We assume there are T labelers, indexed from 1 to T, and we let T = {1, 2, . . . , T}. Let zi,t { } Y, t T , denote the noisy label assigned to xi by the tth labeler, where zi,t = if no label was assigned. We require that at least one labeler assigns a label to each sample; that is, zi,t Y for some t T . Hence, a labeler can only assign one label On Truthing Issues in Supervised Classification Labeler Index t 1 2 3 4 5 Sample Feature Correct Noisy Index Vector Label Labels zi i xi yi zi,1 zi,2 zi,3 zi,4 zi,5 1 x1 0 0 1 0 0 2 x2 1 1 1 3 x3 0 0 0 0 0 0 4 x4 1 1 1 0 0 1 5 x5 0 0 0 0 ... ... ... ... N x N 0 0 0 0 Table 1: Example of notation for binary classification and five labelers. This example corresponds to the training example in Section 5.4. to a sample. Of course, zi,t might be incorrect and differ from yi. Denote the set of noisy labels for xi by zi = (zi,t)t T ; the noisy labels need not agree. Finally, let z = (zi)N i=1. A natural approach, introduced by Dawid and Skene (1979), is to treat the correct labels and noisy labels as random variables (RVs). We adopt this viewpoint and use capitalization (e.g., Y ) for an RV and lowercase (e.g., y) for its non-random counterpart. Hence, Yi is the correct-label RV for the ith sample, so Y = (Yi)N i=1 is the list of correct-label RVs, and y is the list of correct labels themselves. Similarly, the feature-vector RVs are Xi and X = (Xi)N i=1, and their realizations are xi and x. Also, Zi,t, Z, Zi = (Zi,t)t T , and Z = (Zi)N i=1 indicate noisy-label RVs, while zi,t, z, zi, and z indicate their respective realizations. We assume that the RVs associated with different samples are independent and identically distributed and that the correct-label RVs have class prior π = π(y) C 1 y=0 , where π(y) = p(y) = Pr(Y = y). We further assume that the noisy-label RVs are conditionally dependent given the correct-label RV, and we denote the noisy-label conditional distribution by p(z|y; ψ), which is parameterized by ψ, and where a semi-colon separates RVs from nonrandom parameters. We refer to p(z|y; ψ)π(y) as a noisy-label model. For the ith sample, the parameters are ψi. We assume that Zi and Zj follow the same conditional distribution p(z|y; ψ) if ψi = ψj = ψ. At this stage, we let the parameters remain completely general. Also, let ψ = (ψi)N i=1 be the list of parameters for all samples. Much of the related work, described in Section 1.2, has focused on constructing a model for p(z|y; ψ), estimating the parameters ψ and class prior π, and estimating the correctlabel RVs Y . This paper and the related work implicitly assume that p(z|y; ψ) = p(z; ψ) so that the noisy-label RVs provide some information about the correct-label RV. If z is non-informative, then none of these methods will be effective. Denote the estimated correct label for the ith sample as ˇyi, and let ˇy = (ˇyi)N i=1. In this paper, we seek answers to the following questions: 1. How can one test a classifier in the presence of truthing issues1? 2. How can one train a classifier in the presence of truthing issues? 3. How can one compare different combinations of labelers with different abilities? 1.2 Related Work We organize the related work into noisy-label modeling and learning, training with truthing issues, testing with truthing issues, and comparing combinations of labelers. 1.2.1 Noisy-Label Modeling and Learning Several authors have proposed noisy-label models p(z|y; ψ)π(y) and developed methods for learning them,2 i.e., estimating ψ and π. Often, they also estimated the correct-label RVs Y . Table 2 presents a summary. All but one set of authors allowed for multiple labelers. The table shows that most of them assumed that the noisy-label RVs are conditionally independent given the correct-label RV. Dawid and Skene (1979) made an initial, influential foray into this area. They did not address supervised classification, but they considered the problem of compiling observations of patients from multiple clinicians who might disagree or make mistakes. In this context, they introduced the conditional-RV formulation that is now commonplace. When the correct labels are available, they showed that maximum-likelihood (ML) estimation of ψ and π is straightforward. With the benefit of hindsight, it is clear that this situation amounts to learning a classifier from an auxiliary data set {z , y } where the noisy labels z serve as the feature vector. For example, one might have a small auxiliary data set containing noisy labels from clinicians and correct labels from a separate, gold-standard laboratory test, and a large amount of data with only noisy labels. More important, when the correct labels are unavailable, Dawid and Skene demonstrated the utility of the expectation-maximization (EM) algorithm of Dempster et al. (1977) for estimating ψ, π, and Y . Many other authors have built upon this work. Donmez et al. (2010) considered classification and regression, and they also proposed using the EM algorithm for learning the noisy-label model. They studied conditions under which ML estimates of the noisy-label model parameters are consistent. For classification, they showed that consistency holds under certain conditions, such as when the labelers are weak learners or better and the class prior π is not equiprobable. Whitehill et al. (2009) proposed a model for p(z|y; ψ) that considers sample difficulty and labeler expertise. A Bayesian extension of the model allowed priors for these parameters. They used both the EM algorithm and a Bayesian version of it to estimate ψ and Y . Welinder and Perona (2010) and Welinder et al. (2010) expanded the model to include labeler bias and to support multi-class labels and continuously-valued annotations. 1. We tackle testing before training because our testing approach applies regardless of how a classifier was trained. Also, if truthing issues are present, then there is little point in training a classifier if one cannot evaluate its performance. 2. One could propose learning a model of the form p(z|x, y)p(x, y) or p(z|y, x)p(y|x) from an auxiliary set {x , y , z }. However, learning it would be harder than learning a predictive model for p(x, y) or p(y|x), so if such a set were available, then one could just learn the predictive model from {x , y }. On Truthing Issues in Supervised Classification Reference Description Number of Classes Labeler Dependence Dawid and Skene (1979) Seminal work ML estimation of noisy-label model, if correct labels available EM estimation of noisy-label model and correct labels, if correct labels unavailable Multiple Independent Donmez et al. (2010) Also regression EM estimation of noisy-label models Consistency conditions for ML estimates Multiple Independent Whitehill et al. (2009) Sample-difficulty and labeler-expertise model EM or Bayesian estimation of noisy-label model Binary Independent Welinder and Perona (2010); Welinder et al. (2010) Extension of Whitehill et al. (2009) to include labeler bias, multiple classes, and continuously-valued annotations Binary or Multiple Independent Branson et al. (2017) Extension of Welinder et al. (2010); Welinder and Perona (2010) to part-keypoint and bounding-box annotations Sequential acquisition of noisy labels Binary Independent Van Horn et al. (2018) Extension of Branson et al. (2017) to multiple classes and sequentially dependent labelers Multiple Sequentially dependent Karger et al. (2014) Allocation of labelers to unlabeled samples Iterative message passing Binary Independent Zhou et al. (2015) Minimax conditionl entropy principle Multiple Independent Platanios et al. (2016) Hierarchical models Gibbs sampling Binary Independent or dependent Northcutt et al. (2021a) Assume previously-trained classifier Estimate joint dist. p(z, y) on new data set Correct or remove mislabeled samples Multiple Not applicable Table 2: Related work on noisy-label modeling and learning. References that jointly learn a noisy-label model and train a predictive model appear in Table 3. Branson et al. (2017) extended the models by Welinder and Perona and by Welinder et al. to include other forms of annotation. They also introduced an algorithm that sequentially acquires more noisy labels until the risk of the estimated correct label falls below a threshold. Van Horn et al. (2018) extended the work by Branson et al. to multi-class classification and sequentially dependent labelers. Karger et al. (2014) considered the problem of allocating labelers to unlabeled samples and proposed a message-passing algorithm for estimating the correct labels. Zhou et al. (2015) included sample difficulty in their model of p(z|y; ψ) and estimated ψ, π, and Y by optimizing a minimax criterion on the conditional entropy of Z given Y . Platanios et al. (2016) proposed a variety of generative models for p(z|y; ψ), and they applied Gibbs sampling to infer π, ψ, and Y . None of these authors considered training or testing, and they used the known correct labels to compute estimation errors in their experiments. For our purposes, these works offer noisy-label models that could be learned even without correct labels and used with our training and testing methods. The model introduced by Whitehill et al. (2009) is italicized because, in Section 5.1.2, we present a similar model for p(z|y; ψ) that includes sample difficulty and labeler fallibility; however, our purpose is not to estimate ψ but to demonstrate training and testing when ψ is already available. Northcutt et al. (2021a) took a different viewpoint and assumed the availability of an existing classifier, previously trained on an auxiliary data set with enough correctlylabeled samples to overcome the presence of some noisily-labeled ones. Given a feature vector, the existing classifier predicts class probabilities for all classes, unlike a human labeler who provides a noisy label indicating a single class. For a new data set with noisy labels, Northcutt et al. leveraged this property to estimate the joint distribution p(z, y) as well as p(z|y) and π. They used these estimates to identify samples in the new data set that were likely mislabeled and to correct or remove them. A new classifier can then be trained or tested using the corrected labels. This approach offers a different way to obtain a noisy-label model, and our methods can complement it by addressing samples whose noisy labels cannot be resolved. Some other authors perform joint learning of a noisy-label model and training of a predictive model. These references are discussed next. 1.2.2 Training (and Joint Learning) with Truthing Issues Works on training with truthing issues are listed in Table 3 and reviewed here. Cid-Sueiro (2012) and Cid-Sueiro et al. (2014) took a Bayesian viewpoint and examined weak losses for training with partial labels, which are modeled slightly differently than the noisy labels considered here. The authors related the weak loss to an equivalent loss for correct labels and studied theoretical aspects of constructing a weak loss from a given equivalent loss. Their approach does not conform to the unified view of training that we offer in Section 3.2; it could be interpreted as the reverse of the minimum meansquare error training method proposed in Section 3.4.1. The two approaches are discussed in Section 3.6.1. Natarajan et al. (2013, 2018) took a classical (i.e., frequentist) view and trained binary classifiers on noisy labels from a single labeler by forming a proxy loss function for Z that is an unbiased estimator (in the classical sense) of the ideal loss function for y. Their approach does not fit into our unified view; further discussion appears in Section 3.6.2. van Rooyen and Williamson (2018) presented an abstract framework for learning with noisy labels, parts of which generalize the work by Cid-Sueiro (2012), Cid-Sueiro et al. (2014), and Natarajan et al. (2013, 2018). Ratner et al. (2016, 2017) proposed a technique for programmatically generating multiple noisy labels for many unlabeled samples and subsequently training discriminative classifiers with the noisy labels. To model p(z|y; ψ), they introduced generative models with independent or dependent labelers, and they estimated ψ while assuming an equiprobable class prior. Training minimized the expected empirical risk with respect to p(z|y; ψ). The remaining authors considered joint learning and training with noisy labels. Raykar et al. (2010) used the EM algorithm to jointly estimate ψ, π, and θ and train a logistic regression binary classifier. They also presented a Bayesian form of the algorithm with priors on the labelers error probabilities and provided approximate posteriors of the error prob- On Truthing Issues in Supervised Classification abilities, and they extended the approach to multi-class classification, ordinal regression, and regression. Khetan et al. (2018) presented a version of the EM algorithm that alternates between training a binary classifier with z and the current estimates of ψ and π and then updating the estimates of ψ and π using the current predictions. They weight the training loss function using the correct-label posterior for each possible correct-label value, effectively marginalizing out the correct-label RV. To train a convolutional neural network (CNN) on noisy labels from a single labeler and estimate p(z|y), Sukhbaatar et al. (2015) and Jindal et al. (2016) appended an additional layer to the softmax output of a base network. During training, the base network learns to predict the unknown correct labels while the additional layer estimates p(z|y) and predicts the noisy label from the output of the base network. Following training, the additional layer can be excised and the base network used for prediction. This training method lies outside our unified view of training and is discussed in Section 3.6.3. Tanno et al. (2019) jointly trained a CNN and estimated the confusion matrices for multiple independent labelers. They used the confusion matrices to convert the CNN outputs into predictions of the noisy labels, which is similar to the work by Sukhbaatar et al. (2015). Notably, Tanno et al. introduced a trace-regularization term, which minimizes the traces of the confusion matrices and, under certain conditions, ensures proper estimation. The authors compared their method to several other methods, including those of Raykar et al. (2010), Khetan et al. (2018), and Sukhbaatar et al. (2015). This approach does not fit our unified view of training, and further discussion appears in Section 3.6.4 All of these authors used the correct labels during testing or other experimental evaluations; they did not consider testing with truthing issues. As Table 3 shows, many of them do not consider multiple, possibly dependent labelers, which our work allows. Some of them jointly learn a noisy-label model and a predictive model, which we do not consider. A number of these works fit within our unified view of training (Section 3). The table italicizes parts of Ratner et al. (2016, 2017) and Khetan et al. (2018) that are consistent with our use of minimum mean-square error estimation of the empirical risk (Section 3.4.1). In the work of Raykar et al. (2010), logistic regression is italicized because we also use it as an illustrative example (Section 5.4). 1.2.3 Testing with Truthing Issues As noted above, all of the related work on training used a reserved set of correct labels for testing. Related work on testing with truthing issues appears in Table 4. Smyth et al. (1994) considered testing with noisy labels assigned by a scientist or classifier to synthetic aperture imagery of the planet Venus to indicate the presence or absence of types of volcanoes. Clearly, absolute ground truth is unavailable in this case. To test an individual scientist s predicted labels, the authors used the EM algorithm proposed by Dawid and Skene (1979) with the noisy labels from all other scientists to estimate the correct labels, which were then treated as if correct. A classifier was also trained using consensus labels from two scientists (Burl et al., 1994). It was tested by comparing its predictions to the estimated correct labels from EM using all scientists noisy labels. Reference Description Number of Classes Labeler Dependence Joint with noisy-label model learning? Fits into our unified view? Cid-Sueiro (2012); Cid-Sueiro et al. (2014) Theoretical Bayesian view Weak loss designed for partial labels Equivalent loss for correct labels Binary or multiple Not applicable (Partial labels) Natarajan et al. (2013, 2018) Classical or frequentist view Proxy loss function that is unbiased in classical sense Binary Not applicable (Single labeler) van Rooyen and Williamson (2018) Abstract framework Generalization of Cid-Sueiro (2012); Cid-Sueiro et al. (2014); Natarajan et al. (2013, 2018) Binary Not applicable (Single labeler) Ratner et al. (2016, 2017) Logistic regression Also long short-term memory network Programmatic noisy labeling Minimization of expected empirical risk Binary Independent or dependent Raykar et al. (2010) Binary logistic regression as main example Also regression, ordinal regression EM or Bayesian estimation Binary or multiple Independent Yes Yes Khetan et al. (2018) Empirical risk minimization Marginalization of loss function using correct-label posterior Binary or multiple Independent Yes Yes Sukhbaatar et al. (2015); Jindal et al. (2016) CNN Base network learns p(y|x) Extra layer learns p(z|y) Multiple Not applicable (Single labeler) Tanno et al. (2019) CNN Regularization of trace of labelers confusion matrices Multiple Independent Yes No Table 3: Related work on training with truthing issues. On Truthing Issues in Supervised Classification Reference Description Number of Classes Labeler Dependence Smyth et al. (1994) Comparison of one labeler s noisy labels with estimated correct labels obtained by applying EM to other labelers noisy labels Multi-class labels (volcano type or non-volcano) reduced to binary labels (volcano or non-volcano) Multiple, reduced to binary Independent Lam and Stork (2003) Effect of noisy labels on probability of error Variance of estimated probability of error as a function of labeler error probability, number of samples Multiple, reduced to binary Not applicable (Single labeler) Carlotto (2009) Study of the effect of noisy labels on accuracy Relationship between accuracies calculated on correct labels and on noisy labels Rough estimate of ideal accuracy Multiple Not applicable (Single labeler) Holodnak et al. (2018) Empirical study of methods for estimating accuracy from noisy labels Multiple Independent or dependent Table 4: Related work on testing with truthing issues. For binary classification, Lam and Stork (2003) related the ideal probability of error Pr(ˆy = y) = 1 accuracy to a labeler s error probability ε = Pr(z = y) and the classifier s apparent probability of error Pr(ˆy = z). They provided an estimate of Pr(ˆy = y) given an assumed value of ε, and they examined the variance of this estimate as a function of ε and the number of samples N. Carlotto (2009) analyzed how measured accuracy is affected by truth errors for a single labeler. Carlotto obtained an expression that relates ideal accuracy to accuracy calculated against noisy labels and suggested a rough estimate of accuracy when the labeler s error probability is known. Holodnak et al. (2018) conducted an empirical study with real and simulated data to compare a variety of techniques for estimating the accuracy of a classifier from noisy labels. They introduced two models that incorporate dependencies between the labelers or noisy-label RVs, and they demonstrated that estimation techniques that assume conditional independence provide less reliable estimates as labeler dependence increases. In a survey paper on classification with label noise (Fr enay and Verleysen, 2014, p. 862), the authors remark, a problem that is seldom mentioned in the literature is that model validation can be difficult in the presence of label noise. Indeed, the number of references on testing is considerably smaller than that on learning or training, and such work has mainly examined accuracy. Nevertheless, the above works reflect the main approaches: Calculate estimated correct labels ˇy and use them as the reference; estimate the ideal accuracy in some way; and comparative studies. Our work includes many common testing metrics, including accuracy, precision, recall or probability of detection, and probability of false alarm. We focus on estimating the testing metrics rather than the correct labels, and we develop algorithms for computing Bayesian optimal estimates of scalar and joint metrics. Our experiments use conditionally independent labelers for implementational convenience, but our approaches accommodate noisy-label models with conditionally dependent labelers. 1.2.4 Comparing Combinations of Labelers Regarding the comparison of different combinations of labelers, we have found one rather distantly-related publication. For binary classification and a single labeler, Lugosi (1992) viewed the correct-label RV and noisy-label RV as the respective input and output of a communications channel. Lugosi examined purely theoretical aspects of the effects of noisy labels on accuracy if a classifier uses the maximum a posteriori or nearest-neighbor decision rule. In Section 4, we make the channel analogy for one or more labelers, but we proceed in a different direction. We suggest that mutual information can be used to compare combinations of labelers, which implies that the information conveyed by multiple mediocre labelers can equal or exceed that provided by a single expert labeler. 1.3 Supervised Classification and Estimation Theory This work applies estimation theory to supervised classification with truthing issues, and Figure 1 presents conceptual diagrams for these two fields. The diagrams look similar but differ in a fundamental way: In supervised classification, the actual process is unknown, while in estimation theory, the actual variable is unknown. Supervised classification is concerned with finding a good predictive model that generalizes to future, out-of-sample realizations from an unknown process, given a set of labeled samples from the process and a family of models. In this work, estimation theory is concerned with making a good guess at the current, in-sample value of an unobserved variable, given noisy measurements and a model for the measurement process. These fields also use similar terms and objectives. Supervised classification learns model parameters θ, estimation theory finds an estimate ˆy or estimator ˆY , and both fields seek an answer that is best according to some criterion. We present a few examples. In these examples, the criterion is essentially the same; the differences lie in the components that are assumed to be known, the solution space, and the optimization methods. First, both fields employ ML estimation: classification selects non-random parameters θ to maximize p(y|x; θ) or p(y, x; θ), and estimation chooses ˆy to maximize p Z|Y (z|ˆy). Second, when classification treats the model parameters as RVs Θ with prior p(θ), it uses the maximum a posteriori (MAP) criterion and chooses θ to maximize p(θ|y, x). When estimation adopts the MAP criterion, it selects ˆy = arg maxy p(y|z). Third, classification may minimize the average square loss N 1 PN i=1(yi g(xi; θ))2, while estimation may minimize the mean-square error (MSE) E[(Y h(Z))2]. We make repeated use of minimum mean-square error (MMSE) estimation, which seeks the estimator h(Z) that minimizes the MSE. A convenient standard result (see Appendix C) is that the MMSE estimator is the conditional mean of Y given Z: h MMSE(Z) = arg min h E[(Y h(Z))2] = E[Y |Z]. (1) Finally, classification may minimize the average zero-one loss N 1 PN i=1 1(yi = g(xi; θ)), where 1(w) is the indicator function: 1(w) = 0 if w is false, and 1(w) = 1 if w is true. Likewise, estimation may adopt the minimum probability-of-error (MPE) criterion and minimize E[1(Y = h(Z))], the probability of error. Another standard result from estimation On Truthing Issues in Supervised Classification (a) Supervised classification (b) Estimation theory Figure 1: Comparison of supervised classification and estimation theory. In supervised classification, the goal is to use a set of correctly-labeled samples from an unknown actual process to learn a predictive model that generalizes to future, out-of-sample realizations from the process. In estimation theory, the goal is to estimate the in-sample value of an unobserved variable from noisy measurements produced by a known measurement process. For each problem, the key unknown element appears in red, the key known element in green, and the desired element in blue. theory (see Appendix D) is that the MPE estimator corresponds to the MAP estimator: h MPE(Z) = arg min h E[1(Y = h(Z))] = arg max y p(y|Z) = h MAP(Z). (2) Our application of estimation theory begins with the assumption that a good noisy-label model p(z|y; ψ)π(y) is available. If an auxiliary data set {z , y } is available, then such a model could be obtained via supervised learning. Much of the related work addresses the problem of learning a model when correct labels are unavailable but some knowledge about the labelers error behavior and/or class prior is available. Consequently, this paper is complementary to and broadly compatible with much of the related work. 1.4 Novel Contributions and Organization We use boldface to highlight specific novel contributions. We discuss some conceptual contributions and then use the three questions raised in Section 1.1 to cover the organization of the rest of the paper and mention other contributions. For reference, Tables 5, 6, and 7 list abbreviations, symbols, and important distributions used throughout this paper. Supervised classification with truthing issues involves three different components, which each require a model. The truthing or labeling process requires a noisy-label model, training learns the predictive model, and testing calls for a testing model. Much of the related work has investigated ways to learn a noisy-label model and/or use a noisy-label model to train a predictive model. The greatest conceptual contribution of this paper is its application of Bayesian estimation theory to training and testing. To our knowledge, this paper is the first one to take this viewpoint and pursue its possibilities so extensively. We concentrate on training and testing, and we start with the assumption that a noisy-label model p(z|y, ψ)π(y) is available, so this work complements much of the related work. The Bayesian approach naturally allows for multiple labelers, different combinations of labelers for each sample, and missing and/or conflicting noisy labels. It also supports noisy-label models with conditionally dependent labelers.3 In this work, we say that an estimator is optimal only if it meets three requirements: it employs an appropriate estimand, it fully exploits all available information, and it minimizes a well-defined penalty criterion (or maximizes a well-defined utility criterion). An estimator that fails any of these requirements is considered suboptimal. Our technical contributions consist of testing and training approaches that are optimal in this strict sense. Our optimal testing approach focuses on estimating the metrics, introduces a testing model that enables thorough exploitation of the available information, and employs the MMSE criterion. In contrast, some suboptimal methods omit estimation theory entirely, others estimate the correct labels instead of the metrics, and still others omit a testing model and fail to take full advantage of the available information. Our optimal training methods select an appropriate likelihood function, posterior, or risk that is faithful to the original objective from ideal training; they use the noisy-label model to exploit the available information completely; and they apply the ML, MAP, or MMSE criterion. Some suboptimal methods ignore estimation theory, while others estimate the correct labels rather than the risk. Our proposed methods never estimate the correct labels because, in our view, they are not the right estimand. This viewpoint marks another novel conceptual contribution: Our conscious choice to refrain from estimating the correct labels. We deliberately avoid making hard decisions about the unobserved correct labels because doing so would produce errors that would propagate into training and testing. In this way, our work differs from existing, suboptimal approaches that estimate the correct labels and then proceed as if the estimated labels were correct. 1. How can one test a classifier in the presence of truthing issues? This question is deeply investigated in Section 2. For binary classification, we propose a testing model for the noisy and predicted labels (Section 2.1, Figure 2). We then derive the estimation-theoretic testing methods as follows: (a) We express various metrics in terms of two common RVs that are independent and approximately normally distributed (Section 2.2). (b) We derive approximate marginal posteriors for accuracy, precision, recall or probability of detection, probability of false alarm, and F-score and the approximate joint posteriors for probability of detection and probability of false alarm as well as for precision and recall (Section 2.3). (c) We propose MMSE testing and develop empirical Bayes algorithms for estimating the testing-model parameters via iterative MMSE estimation 3. For ease of implementation, our simulations and experiments employ conditionally independent labelers, but the derivations and algorithms do not. On Truthing Issues in Supervised Classification (Algorithms 1 and 2, Figure 3), and we discuss their relation to the EM algorithm and convergence (Section 2.4), (d) We explain how to calculate Bayesian optimal estimates (MMSE, MAP, and credible regions) of the metrics from the estimated testing-model parameters and posteriors (Section 2.5). We also describe some alternative testing approaches (Section 2.6), such as MPE or MAP estimation of the correct labels (Section 2.6.1, Algorithm 3) and fully Bayesian estimation (Section 2.6.2). For multi-class classification (Section 2.7), we extend the testing model (Section 2.7.1), provide another empirical Bayes algorithm for MMSE testing (Algorithm 4, Figure 4), and derive approximate posteriors for accuracy and individual elements of the confusion matrix (Section 2.7.2). 2. How can one train a classifier in the presence of truthing issues? We consider this question in Section 3. We restate the assumption of independent samples (Section 3.1), and we present a unified view of training that encompasses and organizes some of the related work (Section 3.2). For probabilistic (e.g., discriminative or generative) models, we derive the likelihood function or posterior of the predictive model parameters for truthing issues such that the original, ideal training principle (ML or MAP) is preserved (Section 3.3). For non-probabilistic models (Section 3.4), which are trained by minimizing the empirical risk, we propose MMSE training, which minimizes the MMSE estimate of the empirical risk, and we demonstrate that this approach leads to the same training objective proposed by some related work (Section 3.4.1). We review properties associated with MMSE estimation (Section 3.4.2), mention its convenient form for gradient calculation (Section 3.4.3), and consider some special cases (Section 3.4.4). We provide a basic condition for consistency of the MMSE estimator (Section 3.4.5). Next, we mention some aspects of MMSE training that make it more appealing than ML and MAP training (Section 3.5). We also discuss some alternative training approaches that do not fit into the unified view (Section 3.6). Finally, we describe ways to do training with infrastructure that was not designed for truthing issues (Section 3.7). 3. How can one compare different combinations of labelers with different abilities? This question is briefly studied in Section 4. We make a simple analogy between the noisy-label model and a communications channel, which suggests mutual information as a basis for comparing combinations of labelers. We focus on the binary symmetric broadcast channel (Section 4.1), and we suggest expressing the mutual information for a set of labelers in terms of that for a single equivalent labeler (Section 4.2). We observe that, in theory, multiple mediocre labelers can be as informative as or more informative than a single expert labeler (Section 4.3). Experimental results appear in Section 5. We relied on simulation to generate many of the correct, noisy, and predicted labels (Section 5.1). We simulated correct and predicted Abbreviation Expansion BSBC binary symmetric broadcast channel BSC binary symmetric channel CLT central limit theorem CNN convolutional neural network EM expectation-maximization ERM empirical risk minimization MAC moment-approximation condition MAD maximum absolute difference MAP maximum a posteriori ML maximum likelihood MMSE minimum mean-square error MPE minimum probability-of-error MSE mean-square error OP operating point P-R precision-recall ROC receiver operating characteristic RV random variable Table 5: List of abbreviations. labels in a straightforward manner (Section 5.1.1), and we employed a particular noisy-label model (Section 5.1.2) that is similar to the one by Whitehill et al. (2009). For testing, we review results for several experiments on binary classification (Section 5.2), and we report on one experiment on multi-class classification (Section 5.3). For training, we provide an example involving binary logistic regression (Section 5.4). For the comparison of labelers, we show experiments that use the proposed training and testing methods to verify the possibility of equivalent mutual information (Section 5.5). Section 6 provides a summary and conclusions (Section 6.1), a workflow for supervised classification with truthing issues (Section 6.2), and suggested future directions (Section 6.3). Several appendices are also included. Appendix A derives testing metrics in terms of the common RVs from Section 2.2. Appendix B summarizes results on ratios of jointly normal RVs; they are useful for calculating posteriors of scalar metrics. Appendices C and D review MMSE, MPE, and MAP estimation. Appendix E provides derivations for the training approaches in Sections 3.3 and 3.4. Appendix F gives details for the logistic regression training example in Section 5.4. 2. Testing with Truthing Issues: Grading with Dirty Answer Keys We cover testing before training because the results and methods presented here apply regardless of how a classifier was trained. They can be employed even if training did not consider truthing issues. Additionally, if one needs to train a model with truthing issues, then one very likely needs to test the trained model with truthing issues, too. One might be reluctant to embark on training if one suspects that truthing issues will invalidate the testing metrics. This section provides reassurance that reliable testing is possible. Conventional testing calculates metrics over an ideal testing set {ˆy, y} to measure the (dis)agreement between the predicted labels ˆy and the correct labels y. In machine learning, On Truthing Issues in Supervised Classification Symbol Description Symbol Description B Bernoulli distribution g pre-threshold pred. model Beta beta distribution h arbitrary estimator C number of classes h MMSE MMSE estimator Cemp, [C] confusion matrix i sample index Cat categorical distribution j iteration or generic index Femp β , [Fβ] F-score p D, [PD] probabilty of detection I mutual information p FA, [PFA] probability of false alarm Jideal ideal training objective p D, p FA, [ PD, PFA] OP parameters Jpri primary term s pre-threshold pred. stat. Jreg regularization term t labeler index Kemp, [K] cond. conf. matrix x, x, x, [X, X, X] feature vectors K cond. conf. matrix param. y, y, [Y , Y ] correct labels L loss function ˆy, ˆy, [ ˆY , ˆY ] predicted labels ˆLMMSE MMSE est. of loss function ˇy, ˇy estimated correct labels M no. draws in sampling algs. z, z, z, [Z, Z, Z] noisy labels N number of samples α clipping value ˆN1 number of times ˆyi = 1 β F-score parameter Remp, [R] empirical risk δ, δ sample difficulty ˆRMMSE MMSE est. of emp.-risk RV ε error prob. (57) or (60) T number of labelers η prob. provide noisy label [U, V ] common RVs for metrics θ, θ, θ , [Θ] pred.-model parameters N normal distribution λ regularization weight T set of labelers π, π class prior U uniform distribution τ decision threshold X feature-vector space φ, φ labeler fallibility Y set of classes ψ, ψ, ψ noisy-label params. f arbitrary function 0 zero vector g predictive model 1( ) indicator function Table 6: List of main symbols. Brackets indicate RVs. Distribution Description p(z|y; ψ) noisy-label conditional distribution p(z|y; ψ)π(y) noisy-label model p(ˆyi|yi; p D, p FA) predicted-label conditional distribution (3), (4), (5), (6) p(ˆyi|yi; p D, p FA)p(zi|yi; ψi) testing model (7) p(yi|ˆyi, zi; ψi, p D, p FA) testing class posterior (10) p(yi|zi; ψi) training class posterior (41) Table 7: List of key distributions. ˆy is obtained by applying the learned model g to each feature vector xi x. However, testing does not actually involve the feature vectors x, so the results of this section apply to problems outside of machine learning, such as reconciling observations by multiple clinicians (for examples, see Dawid and Skene, 1979; Raykar et al., 2010). With truthing issues, testing must work with the testing set {ˆy, z, ψ, π} instead of {ˆy, y}. The metrics are functions of the correct-label RVs, which are unobserved, so the metrics are RVs whose values remain uncertain. We therefore treat testing as an estimation problem: we want to know the in-sample values of the metrics on the testing set. We seek the posteriors of the metric RVs, conditioned on ˆy and z. Optimal estimates or credible regions for the metric RVs can then be calculated from the posteriors. Our approach exploits all available information, including the predicted labels ˆy. The problem is analogous to grading a multiple-choice quiz using answer keys from one or more teaching assistants who have poor handwriting. The answer keys provide information about the correct answers, but the student s answers do, too. We can obtain the best estimate of the grade by consulting the student s answers along with the answer keys, rather than relying on the answer keys alone. Most of this section covers binary classification; Section 2.7 discusses extensions to multiclass classification. For binary classification, we consider several common scalar metrics: accuracy, precision, recall or probability of detection,4 probability of false alarm,5 and Fscore. Each of these metrics takes on values in [0, 1]. For probability of false alarm, smaller values indicate better performance; for the other metrics, larger values correspond to better performance. Table 8 gives the empirical forms for these metrics, denoted as acc, prec, p D or rec, p FA, and Femp β , respectively. We also consider two common joint metrics: the receiver operating characteristic (ROC) and precision-recall (P-R) operating points, namely (p D, p FA) and (prec, rec). The empirical metrics are used in the ideal case when {ˆy, y} is available. The corresponding RV forms are ACC, PREC, PD or REC, PFA, Fβ, as well as (PD, PFA) and (PREC, REC). We overload the lowercase symbols to mean either an empirical metric or a realization of a metric RV. 2.1 Testing Assumptions In ideal testing, the metrics do not involve the feature vectors, so we eliminate X and x from consideration. If we had a good testing model for p(ˆy, z|y), then we could use it to estimate the metric RVs. One might propose learning it from an auxiliary set {ˆy , y , z }, but if such a set were available, then one could just do ideal testing with {ˆy , y }.6 Instead, we must devise a model for p(ˆy, z|y) and estimate its parameters from {ˆy, z, ψ, π}, with no prospect of learning them from auxiliary data. We tackle this problem by applying techniques from estimation theory rather than machine learning. We state our assumptions here, and Figure 2 shows the graphical model depicting them. We assume that ˆy, z, ψ, and π are available. We let ˆN1 denote the number of times that ˆyi = 1; it is immediately available from ˆy. We further assume that the noisy-label RVs Zi 4. Recall, probability of detection, sensitivity, and true positive rate are equivalent terms. 5. Probability of false alarm is equivalent to (1 specificity) and false positive rate. 6. We have chosen to omit any dependence on the feature vectors. If one were to propose learning a testing model such as p(ˆy, z|x, y)p(x, y) from {x , ˆy , y , z }, then a similar contradiction would arise. On Truthing Issues in Supervised Classification Metric Empirical Form Random-Variable Form Symbol Expression Symbol Expression Accuracy acc no. of times ˆyi = yi N ACC U V ˆN1 N + 1 Precision prec no. of times ˆyi =1 and yi =1 no. of times ˆyi =1 PREC N Prob. of Detection, Recall p D, rec no. of times ˆyi = 1 and yi = 1 no. of times yi = 1 PD, REC U U + V Prob. of False Alarm p FA no. of times ˆyi = 1 and yi = 0 no. of times yi = 0 PFA ˆN1/N U 1 (U + V ) F-Score (β > 0) Femp β (1 + β2) prec rec β2prec + rec Fβ β2(U + V ) + ˆN1/N Table 8: Binary-classification metrics: Empirical forms and RV forms in terms of the common RVs U and V from (8) and (9). Figure 2: Graphical model for testing approach. Small rectangles indicate non-random variables; circles indicate RVs. Shading indicates a variable that is fully observed. Large rectangles indicate N independent instances of the enclosed variables indexed by i. have conditional distribution p(zi|yi; ψi) and do not depend on the predicted label. As is common practice, we also assume independent samples, so p(z|y; ψ) = QN i=1 p(zi|yi; ψi). Next, we observe that accuracy, precision, and F-score can each be written solely in terms of the class prior, probability of detection, and probability of false alarm. For example, acc = π(0)(1 p FA) + π(1)p D, and prec = π(1)p D/(π(0)p FA + π(1)p D). Consequently, we introduce the ROC operating-point (OP) parameters ( p D, p FA), which suffice to determine the other metrics because the class prior is assumed known. The OP parameters represent the anticipated performance on the testing set before ˆY and Z are observed. We then treat each predicted label ˆyi as a realization of a predicted-label RV ˆYi conditioned on the correct-label RV Yi and ( p D, p FA). The predicted-label conditional distribution p ˆYi|Yi(0|0; p D, p FA) = 1 p FA, (3) p ˆYi|Yi(1|0; p D, p FA) = p FA, (4) p ˆYi|Yi(0|1; p D, p FA) = 1 p D, (5) p ˆYi|Yi(1|1; p D, p FA) = p D. (6) The assumption of independent samples means p(ˆy|y; p D, p FA) = QN i=1 p(ˆyi|yi; p D, p FA). Our final assumption is that, given Yi, the RVs ˆYi and Zi are conditionally independent, where ( p D, p FA) and ψi only influence ˆYi and Zi, respectively. This assumption may be a strong one, but it reflects the typical case in which ˆYi is generated without access to Zi. For example, ˆYi could be the output of a predictive model whose only input is Xi, or ˆYi could be a diagnosis made by a clinician who has not seen the diagnoses from other clinicians. Therefore, our testing model is p(ˆyi, zi|yi; ψi, p D, p FA) = p(ˆyi|yi; p D, p FA)p(zi|yi; ψi). (7) We excluded Xi, so Yi is the only RV that can link Zi and ˆYi, and the model includes such a connection. The model is parsimonious, having just two parameters. As explained at the top of this section, they will not be determined via supervised learning. Section 2.4 presents iterative methods for estimating them from {ˆy, z, ψ, π}. 2.2 Metric RVs in Terms of Common RVs Here we demonstrate that each metric RV can be written in terms of two RVs that are independent and approximately normal. 2.2.1 Metric RVs in Terms of Common RVs We begin by considering a single metric, namely probability of detection. Given truthing issues, it becomes an RV conditioned on ˆy, z, ψ, and ( p D, p FA), which we write as 1 N PN i=1 1(ˆyi = 1 and Yi = 1) 1 N PN i=1 1(Yi = 1) . In this expression, the predicted label is shown in lowercase because it is observed, and the correct label is capitalized because it is an unobserved RV. We define the RVs U and V , conditioned on {ˆy, z, ψ, p D, p FA}, as follows: i=1 1(ˆyi = 1 and Yi = 1) i:ˆyi=1 1(Yi = 1), (8) On Truthing Issues in Supervised Classification i=1 1(ˆyi = 0 and Yi = 1) i:ˆyi=0 1(Yi = 1). (9) The numerator of PD is exactly U. The denominator is i=1 1(Yi = 1) = 1 i:ˆyi=1 1(Yi = 1) + 1 i:ˆyi=0 1(Yi = 1) PD = U U + V . By similar manipulations (see Appendix A), we can write each of the other metric RVs in terms of the common RVs U and V . Table 8 summarizes the relationships. 2.2.2 Independence and Approximate Normality We immediately conclude that U and V are independent since they involve summations over disjoint subsets of ˆy. We now consider their distributions. In (8) and (9), each 1(Yi = 1) is a Bernoulli RV with success probability equal to p Yi| ˆYi,Zi(1|ˆyi, zi; ψi, p D, p FA). Using Bayes rule, we obtain this probability from the testing class posterior: p(yi|ˆyi, zi; ψi, p D, p FA) = π(yi)p(ˆyi, zi|yi; ψi, p D, p FA) P y i Y π(y i)p(ˆyi, zi|y i; ψi, p D, p FA) = π(yi)p(ˆyi|yi; p D, p FA)p(zi|yi; ψi) P y i Y π(y i)p(ˆyi|y i; p D, p FA)p(zi|y i; ψi), (10) where the appropriate value for p(ˆyi|yi; p D, p FA) can be determined from (3) (6). We denote a Bernoulli RV with success probability p [0, 1] by B(p), so each 1(Yi = 1) in (8) is distributed B(pi) with pi = p Yi| ˆYi,Zi(1|1, zi; ψi, p D, p FA), i = 1, . . . , N. (11) Let N(µ, σ2) denote a normal RV with mean µ and variance σ2. By the central limit theorem (CLT), U is approximately distributed N(µU, σ2 U|ˆy, z; ψ, p D, p FA) with µU(ˆy, z; ψ, p D, p FA) = 1 i:ˆyi=1 p Yi| ˆYi,Zi(1|1, zi; ψi, p D, p FA), (12) σ2 U(ˆy, z; ψ, p D, p FA) = 1 N2 X i:ˆyi=1 pi(1 pi) i:ˆyi=1 p Yi| ˆYi,Zi(1|1, zi; ψi, p D, p FA) 1 p Yi| ˆYi,Zi(1|1, zi; ψi, p D, p FA) , (13) where we have indicated the dependence on {ˆy, z, ψ, p D, p FA}. Likewise, each 1(Yi = 1) in (9) is distributed B(qi) with qi = p Yi| ˆYi,Zi(1|0, zi; ψi, p D, p FA), i = 1, . . . , N, (14) so V is approximately distributed N(µV , σ2 V |ˆy, z; ψ, p D, p FA) with µV (ˆy, z; ψ, p D, p FA) = 1 i:ˆyi=0 p Yi| ˆYi,Zi(1|0, zi; ψi, p D, p FA), (15) σ2 V (ˆy, z; ψ, p D, p FA) = 1 N2 X i:ˆyi=0 qi(1 qi) i:ˆyi=0 p Yi| ˆYi,Zi(1|0, zi; ψi, p D, p FA) 1 p Yi| ˆYi,Zi(1|0, zi; ψi, p D, p FA) . (16) These expressions readily accommodate some correctly-labeled samples. For the ith sample, if yi is known or can be exactly recovered from zi, then pi = 1(yi = 1) if ˆyi = 1, and qi = 1(yi = 1) if ˆyi = 0. Correctly-labeled samples add a constant to the summations for µU and µV and contribute zero to the summations for σ2 U and σ2 V . 2.3 Posteriors of Metric RVs We can now obtain the posteriors of the metric RVs. 2.3.1 Scalar Metric RVs We immediately find that ACC is approximately normal with mean µU µV ˆN1/N + 1 and variance σ2 U + σ2 V . Similarly, PREC is approximately normal with mean NµU/ ˆN1 and variance N2σ2 U/ ˆN2 1 . On Truthing Issues in Supervised Classification From the Random-Variable Form section of Table 8, we observe that each remaining scalar metric RV is a ratio of jointly, approximately normal RVs. Closed-form expressions for these posteriors are unavailable, but the work of Marsaglia (1965, 2006) explains how the posteriors can be computed. Appendix B reviews the procedure, and Table 16 in the appendix covers the necessary parameters for each of these metrics. Moreover, Marsaglia (2006, 4) and Appendix B provide closed-form approximations for the mean and variance of a ratio of jointly normal RVs. The approximations are valid if a simple moment-approximation condition (MAC) is satisfied.7 In summary, for any scalar metric RV Q {ACC, PREC, PD, PFA, Fβ}, we can compute the moments µU, σ2 U, µV , σ2 V from {ˆy, z, ψ, p D, p FA} and use them to get approximations for p(q|ˆy, z; ψ, p D, p FA), E[Q|ˆy, z; ψ, p D, p FA], and var(Q|ˆy, z; ψ, p D, p FA). 2.3.2 Joint Metric RVs For ROC analysis, we observe that PD and PFA are functions of two independent, approximately normal RVs U and V . We apply the standard transformation for two functions of two RVs (see Papoulis, 1991, 6-3) to obtain the joint posterior of (PD, PFA): p(p D, p FA|ˆy, z; ψ, p D, p FA) ( ˆN1/N p FA)( ˆN1/N p D) (p D p FA)3 ˆ N1/N p FA p D p FA p D µU 2 ˆ N1/N p FA p D p FA (1 p D) µV 2 0 p D 1, 0 p FA 1, p D = p FA, (17) where the moments of U and V are computed from (11) (16). Clearly, the posterior is non-Gaussian. This expression is well-defined except when p D = p FA, which is the chance line in ROC space (see Saito and Rehmsmeier, 2015). Although we have not succeeded in finding a closed-form expression for limp D p FA p(p D, p FA|ˆy, z; ψ, p D, p FA), we do not consider the lack of this limit to be problematic. First, Section 2.3.3 explains how we can use sampling to approximate the posterior of (PD, PFA) without using (17). Second, a binary classifier is only useful if it operates far from the chance line, so one will likely only be interested in classifiers with negligible probability density near the chance line. Finally, in Section 2.4.1, we present an estimation algorithm that only uses the marginal posteriors of PD and PFA and does not require (17). 7. In general, the distribution of a ratio of jointly normal RVs cannot be approximated well by a normal distribution. However, Marsaglia (2006) and the appendix also provide a test for when a normal approximation is reasonable. For P-R analysis, PREC and REC are also functions of U and V . We apply the transformation of functions of RVs to obtain the joint posterior of (PREC, REC): p(prec, rec|ˆy, z; ψ, p D, p FA) rec2 1 2πσUσV N prec µU 2 N prec 1 rec 0 < rec 1, 0 prec 1, (18) with the moments of U and V computed from (11) (16). This distribution is also non Gaussian. The chance line in P-R space is prec = π(1) (see Saito and Rehmsmeier, 2015), and it poses no difficulties. L Hˆopital s rule can be applied to show that limrec 0 p(prec, rec| ˆy, z; ψ, p D, p FA) = 0, so this expression is well-defined over the entire unit square. 2.3.3 Posteriors via Sampling We can also use sampling to approximate the posteriors. We generate M length-N realizations {y(m)}M m=1 of the correct-label RVs Y . For each realization y(m), each y(m) i is drawn B p Y | ˆYi,Zi(1|ˆyi, zi; ψi, p D, p FA) using (10). For each y(m), we compute the desired empirical metric (scalar or joint), which yields M realizations of the metric RV. The approximate posterior can then be computed as a one-dimensional or two-dimensional histogram. 2.4 Empirical Bayes Estimation of Operating-Point Parameters (MMSE Testing) The validity of the posteriors depends on how well we can estimate the moments of U and V , which in turn depend upon the OP parameters ( p D, p FA). We present two iterative algorithms for finding the MMSE estimate of the OP parameters, so we refer to this approach as MMSE testing. The algorithms belong to the class of empirical Bayes estimators. Standard Bayesian estimation handles unknown nuisance variables by specifying a prior for them and marginalizing them out to obtain the posterior estimate of the estimand. In contrast, empirical Bayes estimation does not assume a prior;8 it instead consults the available data to estimate nuisance variables (Casella, 1992), often using MAP or ML estimation. It alternates between estimating the nuisance variables and the estimand. This iterative process successively improves each estimate and is similar in spirit to the EM algorithm of Dempster et al. (1977). In the proposed algorithms, the OP parameters are the nuisance variables, the estimand consists of the probability-of-detection and probability-of-false alarm RVs, and MMSE estimation is employed. Section 2.4.4 explores the relationship between these algorithms and the EM algorithm. 2.4.1 Motivation Suppose that, on the jth iteration, we have the previous OP parameters p(j 1) D , p(j 1) FA , along with {ˆy, z, ψ, π}. The preceding results enable us to get the moments of U and 8. As a result, empirical Bayes methods are sometimes said not to be fully Bayesian. On Truthing Issues in Supervised Classification V conditioned on {ˆy, z, ψ, p(j 1) D , p(j 1) FA } and then to get the approximate posterior of (PD, PFA). We can improve on the OP parameters by updating p(j) D , p(j) FA with an optimal estimate of (PD, PFA) given p(j 1) D , p(j 1) FA and {ˆy, z, ψ, π}. We can repeat this procedure until the OP parameters converge. We adopt the MMSE criterion from Bayesian estimation theory. Let h( ˆY , Z) = (h D( ˆY , Z), h FA( ˆY , Z)) be an estimator of (PD, PFA) given ˆY , Z. The MSE of such an estimator is E h D( ˆY , Z) PD 2 + E h FA( ˆY , Z) PFA 2 , and the MMSE estimator is defined as h MMSE = arg minh E h D( ˆY , Z) PD 2 + E h FA( ˆY , Z) PFA 2 . The standard result is that the MMSE estimator is the conditional mean (see (1) or Appendix C), so h MMSE( ˆY , Z) = Ep(p D,p FA|ˆy,z;ψ, p(j 1) D , p(j 1) FA ) PD, PFA ˆY , Z; ψ, p(j 1) D , p(j 1) FA , where the relevant distribution is shown as a subscript (cf. Cover and Thomas (1991, Equation 2.2)). Therefore, given ˆY = ˆy and Z = z, we set p(j) D = E PD ˆy, z; ψ, p(j 1) D , p(j 1) FA , p(j) FA = E PFA ˆy, z; ψ, p(j 1) D , p(j 1) FA . Each conditional mean involves a marginal posterior. For example, E PD ˆy, z; ψ, p(j 1) D , p(j 1) FA = Z 1 0 p D p p D ˆy, z; ψ, p(j 1) D , p(j 1) FA dp D. (19) Thus, as mentioned previously, the lack of a limit along the chance line in (17) does not prevent us from calculating the conditional mean of (PD, PFA). We can compute the conditional mean of PD or PFA in three ways. First, if the MAC is satisfied, we can use the closed-form expression in Marsaglia (2006) or Appendix B. Second, we can perform one-dimensional numerical integration. Third, we can use sampling, as described in Section 2.3.3. 2.4.2 Choice of Initial Operating-Point Parameters This section discusses the initial OP parameters p(0) D , p(0) FA . We take a Bayesian viewpoint with initial OP RVs P(0) D , P(0) FA and a non-informative prior namely, the uniform distri- bution over the unit square. Then the initial OP parameters p(0) D , p(0) FA are given by the conditional mean, which is just (1/2, 1/2). Consequently, p(ˆyi|yi; p D = 1/2, p FA = 1/2) equals 1/2 for any combination of ˆyi and yi, so the testing class posterior (10) reduces to p(yi|ˆyi, zi; ψi, p D = 1/2, p FA = 1/2) = π(yi)p(zi|yi; ψi) P y i Y π(y i)p(zi|y i; ψi). (20) The right-hand side matches the ordinary class posterior, which does not condition on ˆyi, p D, and p FA: p(yi|zi; ψi) = π(yi)p(zi|yi; ψi) P y i Y π(y i)p(zi|y i; ψi). (21) Thus, choosing p(0) D , p(0) FA = (1/2, 1/2) has the same effect as if we had not included the predicted label at all. Figure 3: Graphical model of iterative estimation for testing. The common RVs U and V depend on separate partitions of Y and ˆy. 2.4.3 Estimation Algorithms The recursive relationship between successive OP parameters leads to two iterative empirical Bayes estimators for MMSE testing, namely Algorithms 1 and 2. Figure 3 depicts the graphical model and the procedure. On each iteration, each algorithm updates its OP parameters by computing the MMSE estimate the conditional mean of (PD, PFA) given ˆy, z, ψ, and the previous OP parameters. On the next iteration, the MMSE estimate provides the new OP parameters. Algorithm 1 takes advantage of the fact that PD and PFA are both ratios of jointly approximately normal RVs, so their conditional means can be computed without sampling. It begins the jth iteration by using ˆyi, zi, ψi, p(j 1) D , and p(j 1) FA to compute pi and qi for i = 1, . . . , N, and then it computes the moments of U and V . To get the improved OP parameters p(j) D , p(j) FA , it uses the closed-form approximation for the conditional mean of PD or PFA if the MAC holds, and it falls back to numerical integration if not. To prevent numerical degeneracy, the new OP parameters are clipped to [α, 1 α]. The procedure repeats until the maximum absolute difference between successive estimates falls below some tolerance or a maximum number of iterations jmax is reached. It then returns the final OP parameters ( p D, p FA) = p(j) D , p(j) FA . Algorithm 2 uses sample realizations of the correct-label RVs Y to approximate the conditional means of PD and PFA. On the jth iteration, it generates M length-N realizations y(m), m = 1, . . . , M, by drawing each y(m) i B p Y | ˆYi,Zi(1|ˆyi, zi; ψi, p(j 1) D , p(j 1) FA ) using (10). Then it computes the empirical ROC operating point p(j,m) D , p(j,m) FA for each realization. The conditional mean of (PD, PFA) is obtained by averaging the M operating points, which yields the new OP parameters p(j) D , p(j) FA . On Truthing Issues in Supervised Classification Algorithm 1 MMSE testing with empirical Bayes estimation of ( p D, p FA) via ratios of jointly normal RVs. 1: function Empirical Bayes Via Ratios(ˆy, z, ψ, π) 2: Initialize p(0) D 0.5, p(0) FA 0.5, j 0, MAD 3: while j < jmax and MAD tol do 5: for i 1 : N do 6: Compute pi and qi from ˆyi, zi, ψi, p(j 1) D , and p(j 1) FA Eqs. (10), (11), (14) 7: Use {pi}N i=1 and ˆy, z, ψ, p(j 1) D , p(j 1) FA to compute µU, σ2 U Eqs. (12), (13) 8: Use {qi}N i=1 and ˆy, z, ψ, p(j 1) D , p(j 1) FA to compute µV , σ2 V Eqs. (15), (16) 9: if mean approximation for PD from µU, σ2 U, µV , σ2 V is valid then Marsaglia (2006), App. B 10: p(j) D E PD ˆy, z; ψ, p(j 1) D , p(j 1) FA from mean approximation 12: Get p p D ˆy, z; ψ, p(j 1) D , p(j 1) FA from µU, σ2 U, µV , σ2 V Marsaglia (1965, 2006), App. B 13: p(j) D E PD ˆy, z; ψ, p(j 1) D , p(j 1) FA by 1-D numerical integration 14: if mean approximation for PFA from µU, σ2 U, µV , σ2 V is valid then 15: p(j) FA E PFA ˆy, z; ψ, p(j 1) D , p(j 1) FA from mean approximation 17: Get p p FA ˆy, z; ψ, p(j 1) D , p(j 1) FA from µU, σ2 U, µV , σ2 V 18: p(j) FA E PFA ˆy, z; ψ, p(j 1) D , p(j 1) FA by 1-D numerical integration 19: Clip p(j) D and p(j) FA to [α, 1 α] 20: MAD max p(j) D p(j 1) D , p(j) FA p(j 1) FA 21: ( p D, p FA) p(j) D , p(j) FA Final OP parameters 22: return ( p D, p FA) Both algorithms used tol = 10 3, jmax = 30, α = 10 3, and Algorithm 2 used M = 5000. They also allow for some correctly-labeled samples. If yi is known or can be recovered exactly from zi, then pi, qi, or p Y | ˆYi,Zi(1|ˆyi, zi; ψi, p(j 1) D , p(j 1) FA ) equals 1(yi = 1). Algorithm 1 will add the proper constants to the moments of U and V , and Algorithm 2 will always draw the proper realization of the correct-label RV. 2.4.4 Relation to EM Algorithm and Remarks on Convergence Like many empirical Bayes methods, our iterative algorithms are similar to the EM algorithm but differ in that the hidden or latent variables namely, the correct labels are RVs instead of non-random quantities. The above relationships let us obtain the approximate conditional means of PD and PFA directly, so our E-step does not employ an auxiliary function like the EM algorithm does. Also, the other latent variables namely the OP parameters ( p D, p FA) are determined using MMSE rather than ML estimation, so our Mstep performs minimization of the MSE rather than maximization of an auxiliary function. Algorithm 2 MMSE testing with empirical Bayes estimation of ( p D, p FA) via sampling. 1: function Empirical Bayes Via Sampling(ˆy, z, ψ, π, M) 2: Initialize p(0) D 0.5, p(0) FA 0.5, j 0, MAD 3: while j < jmax and MAD tol do 5: for m 1 : M do Generate M length-N realizations of Y 6: for i 1 : N do Generate mth realization y(m) of Y 7: Draw y(m) i B p Y | ˆYi,Zi(1|ˆyi, zi; ψi, p(j 1) D , p(j 1) FA ) Eq. (10) p(j,m) D , p(j,m) FA empirical metrics for {ˆy, y(m)} p(j) D , p(j) FA mean of p(j,m) D , p(j,m) FA M m=1 Empirical conditional mean 10: Clip p(j) D and p(j) FA to [α, 1 α] 11: MAD max p(j) D p(j 1) D , p(j) FA p(j 1) FA 12: ( p D, p FA) p(j) D , p(j) FA Final OP parameters 13: return ( p D, p FA) We speculate that, regardless of the initial OP parameters, the algorithms should converge to the global optimum ( p D, p FA), the minimum of the MSE E PD p D 2 + PFA p FA 2 ˆy, z; ψ, p D, p FA . The argument is based on the following reasoning: The estimand (PD, PFA) is bounded; generally, the conditional mean of (PD, PFA) is unique for any OP parameters p(j 1) D , p(j 1) FA ; and the MSE decreases on each iteration, unless p(j 1) D , p(j 1) FA is already at the global optimum. Therefore, the algorithms should con- verge to the global optimum regardless of p(0) D , p(0) FA . This property contrasts with the EM algorithm, which can only be said to converge to a local optimum, and this local optimum may vary significantly depending on initialization (Koller and Friedman, 2009, 19.2). We can also view the algorithms as seeking the unique fixed point p D, p FA such that E[PD, PFA|ˆy, z; ψ, p D, p FA] = p D, p FA . A technicality prevents us from making a stronger statement regarding convergence. Both PD and PFA are approximated as ratios of jointly normal RVs, for which the mean and variance do not exist (see Marsaglia (1965, 2006) and Appendix B). As a result, uniqueness of the MMSE estimator of (PD, PFA) cannot be guaranteed, so a strict proof of convergence or existence of a unique fixed point may not be possible. This obstacle might mainly be a theoretical concern, and, except for a few pathological situations, the algorithms might always converge in practice. Section 5.2.3 presents some empirical evidence of global convergence. 2.5 Optimal Estimation of Metric RVs Algorithms 1 and 2 each produce final OP parameters ( p D, p FA). It still remains to calculate optimal estimates of the metric RVs given {ˆy, z, ψ, π, p D, p FA}. We do so by computing the moments of the common RVs U and V (Section 2.2.2), and then by using the estimated posteriors of the metric RVs (Section 2.3). Given the approximate posteriors, optimal estimation is a simple matter; we briefly discuss it for completeness. On Truthing Issues in Supervised Classification We consider two point estimates and one range estimate. The first point estimate is the conditional mean, which is optimal in the MMSE sense. The accuracy and precision RVs are approximately normal, so it is just the mean. For REC or PD, PFA, or Fβ, we can use the mean approximation if the MAC holds or resort to numerical integration or sampling. For P-R or ROC analysis, we can just use the conditional means of the individual elements. The second point estimate is the MAP estimate, the most-probable value of the metric RV. It can be obtained by computing the posterior at a fine resolution and finding the peak. The optimal range estimate is the p%-credible region, which specifies a region such that, with probability p/100, the metric RV lies inside the region. The credible region is not necessarily unique, but one way to obtain a reasonable credible region is to apply binary search to find a threshold c such that the numerical integral of the posterior over the points where the posterior exceeds c is equal to p/100 within some tolerance. Finally, we reiterate that Algorithms 1 and 2 and the subsequent optimal-estimation calculations never attempt to estimate the correct-label RVs Y . Making a hard decision about Y at any point could introduce errors into downstream processing, as remarked in Section 1.4. Our approach avoids doing so while using all available information to produce its estimates.9 2.6 Alternative Testing Approaches For comparison purposes, we mention some alternative approaches to testing. We first present four suboptimal methods, and then we describe a fully Bayesian approach, which is optimal but handles the OP parameters differently than the empirical Bayes approach. 2.6.1 Suboptimal Testing Approaches The first suboptimal approach is to estimate the correct labels and then treat the estimates as if they were correct to improve on the OP parameters ( p D, p FA). We present an iterative method in Algorithm 3 and denote its estimate of the correct-label RVs Y as ˇy. On the jth iteration, the algorithm uses the previous OP parameters ( p D, p FA) = p(j 1) D , p(j 1) FA and estimates Y according to the MPE criterion. Let hi( ˆYi, Zi) be an esti- mator of Yi given ˆYi, Zi, and parameters ψi, p(j 1) D , p(j 1) FA , which are suppressed because they are not RVs. The probability of error of hi is perror(hi( ˆYi, Zi), Yi) = E[1(hi( ˆYi, Zi) = Yi)] = P ˆyi,zi P y 1(hi(ˆyi, zi) = yi)p(ˆyi, zi, yi). This approach seeks (h MPE 1 , . . . , h MPE N ) = arg min(h1,...,h N) N 1 PN i=1 perror(hi( ˆYi, Zi), Yi). The solution consists of finding the MPE estimator for each individual Yi, and the standard result is that the MAP estimator minimizes the probability of error (see (2) or Appendix D). Thus, the algorithm computes ˇy(j) using: ˇy(j) i = arg max y Y p y|ˆyi, zi; ψi, p(j 1) D , p(j 1) FA , i = 1, . . . , N. (22) The algorithm then uses these estimates to compute the empirical probabilities of detection and false alarm p(j) D and p(j) FA. This approach is similar to Algorithms 1 and 2 except 9. If one wants to estimate Y , then one can do so using (10) after ( p D, p FA) has been determined. However, subsequently using this estimate to compute testing metrics runs counter to our approach. Algorithm 3 Suboptimal estimation of ( p D, p FA) by estimating the correct-label RVs Y . 1: function Estimate OPParameters Via Estimated Correct Labels(ˆy, z, ψ, π) 2: Initialize p(0) D 0.5, p(0) FA 0.5, j 0, MAD 3: while j < jmax and MAD tol do 5: for i 1 : N do 6: ˇy(j) i arg maxy Y p Yi| ˆYi,Zi y|ˆyi, zi; ψi, p D = p(j 1) D , p FA = p(j 1) FA Eq. (22) 7: Compute empirical p(j) D and p(j) FA from ˆy and ˇy(j) = ˇy(j) i N i=1 8: Clip p(j) D and p(j) FA to [α, 1 α] 9: MAD max p(j) D p(j 1) D , p(j) FA p(j 1) FA 10: ( p D, p FA) p(j) D , p(j) FA Final OP parameters 11: return ( p D, p FA) that it makes a hard decision about the correct labels on every iteration. This approach leverages the testing model and minimizes a well-defined penalty criterion, but it is suboptimal according to Section 1.4 because it estimates Y when it should estimate (PD, PFA). We make no claims about its convergence properties. We again used tol = 10 3, jmax = 30, α = 10 3. Like Algorithms 1 and 2, Algorithm 3 only estimates the final OP parameters ( p D, p FA). Following this step, this approach applies (22) to get the final MPE or MAP estimate ˇy of Y given {ˆy, z, ψ, p D, p FA}. Final estimates of the metrics are obtained by treating ˇy as if it were correct and computing empirical metrics for {ˆy, ˇy}. The second suboptimal approach neglects estimation theory completely; it just merges the metrics calculated for each individual labeler s labels. For each t T , it treats the labels from the tth labeler as if they were correct, and it computes the empirical metric for these samples only. Doing so produces T instances of a metric. The final estimate is then obtained using a centrality statistic, such as the mean or median, of the T instances. The third and fourth suboptimal approaches use the noisy-label conditional distribution p(zi|yi; ψi) instead of the testing model (7), so they omit p(ˆyi|yi; p D, p FA) and ( p D, p FA). They are suboptimal because they neglect the relationship between ˆYi and Yi and thus fail to exploit the predicted labels ˆy fully. The third approach uses MMSE estimation but replaces p(yi|ˆyi, zi; ψi, p D, p FA) with p(yi|zi; ψi) in (11) (16). The fourth approach estimates the correct labels and makes the same substitution in (22). (Equivalently, one can set jmax = 0 in Algorithm 1, 2, or 3, so ( p D, p FA) = (1/2, 1/2) and (10) reduces to (21).) 2.6.2 Fully Bayesian Estimation Another alternative is a fully Bayesian approach, which treats ( p D, p FA) as an unobserved realization of nuisance-parameter RVs ( PD, PFA) with prior p( p D, p FA). A uniform distribution over the unit square serves as a non-informative prior. This approach estimates each metric RV by marginalizing out ( PD, PFA), so the estimate is not conditioned on a particular instance of ( p D, p FA), and no iteration is required. For example, consider MMSE estimation of the accuracy RV. For an estima- On Truthing Issues in Supervised Classification tor h( ˆY , Z) of ACC, the MSE is E h( ˆY , Z) ACC 2 , and the MMSE estimator is h MMSE = arg minh E h( ˆY , Z) ACC 2 . The standard result (see (1) or Appendix C) is that the solution is the conditional mean, so the fully Bayesian estimate of ACC is Ep( p D, p FA)[ACC | ˆy, z; ψ] = Z 1 0 E[ACC | ˆy, z, p D, p FA; ψ] p( p D, p FA) d p D d p FA. This estimate minimizes the MSE under the assumption that the OP parameters are RVs ( PD, PFA) rather than non-random quantities. In contrast, the empirical Bayes approach does not view ( p D, p FA) as realizations of RVs; it treats them as unknown, non-random parameters of p(ˆyi|yi; p D, p FA), so it does not marginalize them out, and its estimates depend on them. For example, it computes E PD ˆy, z; ψ, p D, p FA in (19), where the moments of U and V are conditional on {ˆy, z, ψ, p D, p FA}, which depends on the specific choice of ( p D, p FA). Both approaches are optimal according to Section 1.4, but they differ in their treatment of the OP parameters. Marginalization over ( PD, PFA) is a form of averaging, so the fully Bayesian approach computes an average grade. The empirical Bayes approach is more faithful to the grading analogy at the beginning of this section: The quiz has a particular grade, represented by ( p D, p FA), rather than an average grade. Nevertheless, experimentation is required to compare the performance of these approaches; it appears in Section 5.2.1. 2.7 MMSE Testing for Multi-Class Classification This section discusses the extension of MMSE testing to multi-class classification (C > 2) and some of the challenges associated with it. An immediate challenge is that it might be difficult to model p(z|y; ψ) or estimate ψ and π, but the related work described in Sections 1.2.1 and 1.2.2 is very encouraging. We begin by introducing the C C conditional confusion matrix. We use Kemp to denote its empirical form, and we indicate an element of it by Kemp n|ℓ to emphasize its conditional nature.10 Then Kemp n|ℓis defined as Kemp n|ℓ= no. of times ˆyi = n and yi = ℓ no. of times yi = ℓ , ℓ, n Y. (23) The RV form of the matrix is K, with 1 N PN i=1 1(ˆyi = n and Yi = ℓ) 1 N PN i=1 1(Yi = ℓ) , ℓ, n Y. 2.7.1 Empirical Bayes via Sampling (Multi-Class Classification) We make the same assumptions as for binary classification in Section 2.1. We are given {z, ˆy, ψ, π}, each Zi has conditional distribution p(zi|yi; ψi), and the samples are independent. A graphical model appears in Figure 4. Each predicted label ˆyi is a realization of an RV ˆYi, which for multi-class classification has conditional distribution p(ˆyi|yi; K), where 10. We use non-standard, zero-based, column-major matrix indexing; Kemp n|ℓcorresponds to Kemp(ℓ+1, n+1) in standard matrix notation. Figure 4: Multi-class classification: Graphical model of iterative estimation for testing. K is a matrix of conditional confusion-matrix parameters that are analogous to the OP parameters, and therefore p(ˆyi|yi; K) = Kˆyi|yi. (24) Each row of K must form a (C 1)-probability simplex: For each ℓ Y, Kn|ℓ 0, n Y, and P n Y Kn|ℓ= 1. When C = 2, it was sufficient to consider a single element in each row of K, and we used PFA = K1|0 and PD = K1|1; this property enabled us to use the mean approximations or integrate the marginal posteriors of PD and PFA in Algorithm 1. This technique is not viable for C > 2 because, for each row of K, we would have to determine the joint posterior of (C 1) RVs, and we would have to do (C 1)-dimensional numerical integration. On the other hand, for moderate values of C and a class prior that is not highly skewed, we can readily extend the empirical Bayes sampling procedure of Algorithm 2 to multi-class classification. Pseudocode appears in Algorithm 4. Using the parameter matrix K, we draw M realizations of Y according to the multi-class testing class posterior, which is obtained in the same way as (10): p(yi|ˆyi, zi; ψi, K) = π(yi)p(ˆyi, zi|yi; ψi, p D, p FA) P y i Y π(y i)p(ˆyi, zi|y i; ψi, K) = π(yi)p(ˆyi|yi; K)p(zi|yi; ψi) P y i Y π(y i)p(ˆyi|y i; K)p(zi|y i; ψi) (a) = π(yi) Kˆyi|yi p(zi|yi; ψi) P y i Y π(y i) Kˆyi|y i p(zi|y i; ψi) , (25) where (a) is from (24). We compute the empirical conditional confusion matrix for each length-N realization of Y , and we average the M matrices to get an estimate of the conditional mean of K given {ˆy, z, ψ, K}. For Algorithm 4, we set tol = 10 3, jmax = 50, α = 10 3, and M = 2500 C. On Truthing Issues in Supervised Classification Algorithm 4 MMSE testing for multi-class classification with empirical Bayes estimation of K via sampling. 1: function Multi Class Empirical Bayes Via Sampling(ˆy, z, ψ, π, M) 2: Initialize K(0) n|ℓ= 1/C for (n, ℓ) Y Y, and j 0 5: for m 1 : M do 6: for i 1 : N do 7: Draw y(m) i p Y | ˆYi,Zi(y|ˆyi, zi; ψi, K(j 1)) Eq. (25) 8: K(j,m) empirical matrix from {ˆy, y(m)} Eq. (23) 9: K(j) mean of { K(j,m)}M m=1 10: Adjust each row of K(j) so each element lies in [α, 1 α] and row sums to one 11: until | K(j) K(j 1) |max < tol or j jmax 12: K K(j) Final estimate of K 13: return K This method allows for some correctly-labeled samples. If the correct label for the ith sample is known to be equal to ℓ, then p Y | ˆYi,Zi(y|ˆyi, zi; ψi, K(j 1)) = 1(y = ℓ), and sampling will always produce ℓwhen drawing a realization for this sample. The initial parameter matrix K(0) is the result of using a non-informative prior for each row of the matrix namely, a Dirichlet distribution of order C with all concentration parameters equal to unity. The algorithm returns K, the approximate conditional mean of K. The elements of K are bounded and MMSE estimation is employed, so the remarks in Section 2.4.4 on convergence may also be applicable. 2.7.2 Posteriors of Metric RVs (Multi-Class Classification) Once the estimate K is available, we can obtain the posteriors of different metric RVs. Accuracy is defined in the same way for binary and multi-class classification. Empirical accuracy is acc = (no. of times ˆyi = yi)/N, and its RV form is ACC = N 1 PN i=1 1(Yi = ˆyi). In the multi-class case, each 1(Yi = ˆyi) is a Bernoulli RV with success probability p Yi| ˆYi,Zi(ˆyi|ˆyi, zi; ψi, K). The Bernoulli RVs are independent, so ACC is approximately normal by the CLT. In addition, the ordinary C C confusion matrix has empirical form Cemp with11 Cemp n,ℓ= (no. of times ˆyi = n and yi = ℓ) and RV form C with i=1 1(ˆyi = n and Yi = ℓ) i:ˆyi=n 1(Yi = ℓ). 11. Cemp n,ℓcorresponds to Cemp(ℓ+ 1, n + 1) in standard matrix notation. Hence, matrix element Cn,ℓis approximately normally distributed with E[Cn,ℓ|ˆy, z; ψ, K] = P i:ˆyi=n p Yi| ˆYi,Zi(ˆyi|ˆyi, zi; ψi, K) and var(Cn,ℓ|ˆy, z; ψ, K) = P i:ˆyi=n p Yi| ˆYi,Zi(ˆyi|ˆyi, zi; ψi, K) 1 p Yi| ˆYi,Zi(ˆyi|ˆyi, zi; ψi, K) . This result applies to the posterior of each individual element of C; it does not describe the joint posterior of multiple elements of C. Finally, we remark that one could apply the ratio-of-normals procedure to compute the approximate posterior of Kn|ℓ, an individual element of K. 3. Training with Truthing Issues: Learning from Noisy Labels This section addresses training with truthing issues. Recall that we use g(x; θ) to denote a classifier or predictive model with model parameters θ. Given x, the classifier calculates a statistic s = g(x; θ), which contains the model s calculation of the chance that x belongs to each class, and then it applies a decision rule to s to select ˆy. In binary classification, s is typically a scalar, and ˆy is selected by comparing s to a threshold τ; i.e., ˆy = 1(s > τ). In multi-class classification, s is a usually vector (s0, s1, . . . , s C 1), and ˆy corresponds to the index of the largest element of s; i.e., ˆy = arg maxy Y sy . Training is the process of learning θ from the training set. In the ideal case, the training set is {x, y}, and training seeks the parameters θ that will produce the most accurate predictions when the trained model is applied to as-yet-unseen samples. Training is usually posed as an optimization problem: θ = arg min θ Jideal(θ; x, y) = arg min θ Jpri(θ; x, y) + λJreg(θ) , (26) where the primary term Jpri(θ; x, y) imposes a cost or penalty for differences between the correct labels yi y and either g(xi; θ) or g(xi; θ), xi x; Jreg(θ) is a regularization term that reduces overfitting to the training set and improves generalization to unseen samples; and the weight λ 0 controls the level of regularization. The primary term depends on the predictive model, and the regularization term depends on the choice of regularization, such as L2 regularization: Jreg(θ) P j θ2 j, or L1 regularization: Jreg(θ) P j |θj|. 3.1 Training Assumptions In the presence of truthing issues, the correct labels are not available, and the training set becomes {x, z, ψ, π}. With ψ and π given, the noisy-label model p(z|y; ψ)π(y) is known for each sample. We make the usual assumption of independent samples, so p(z|y, x; ψ) = QN i=1 p(zi|yi, xi; ψi). Next, we assume that, given yi and ψi, the noisy-label RVs Zi do not depend on the feature vector xi, which gives p(zi|yi, xi; ψi) = p(zi|yi; ψi). Thus, p(z|y, x; ψ) = i=1 p(zi|yi; ψi). (27) This expression allows for conditionally dependent labelers. We limit our attention to training that can be expressed as in (26). Then the optimization problem for training given {x, z, ψ, π} becomes θ = arg min θ J(θ; x, z, ψ, π) = arg min θ Jpri(θ; x, z, ψ, π) + λJreg(θ) , (28) On Truthing Issues in Supervised Classification where only the primary term has been modified to become Jpri(θ; x, z, ψ, π). Since the regularization term is unchanged, we focus on the primary term below. 3.2 Unified View We present a unified view of training with truthing issues that describes general approaches for training probabilistic or non-probabilistic predictive models. Unified View of Training with Truthing Issues 1. For probabilistic models, keep the optimality principle from ideal training, and modify the primary term to account for {x, z, ψ, π} rather than {x, y}: (a) For ML training, which would ideally maximize the likelihood function p(y|x; θ) or p(y, x; θ), instead use p(z|x; ψ, θ) or p(z, x; ψ, θ), respectively. (b) For MAP training, which would ideally maximize the posterior distribution p(θ|y, x), instead use p(θ|z, x; ψ). 2. For non-probabilistic models that use a loss function and would ideally minimize the empirical risk Remp(θ; x, y), perform MMSE training: retain the loss function and minimize ˆRMMSE(θ; x, z), the MMSE estimate of the empirical-risk RV given {x, z, ψ, π}. The unified view is simple, elegant, and intuitively appealing. For probabilistic models, training remains true to the original optimality principle from ideal training. For nonprobabilistic models, training retains the original loss function from ideal training and optimizes the MMSE estimate of the empirical risk. Each approach is optimal according to Section 1.4. The estimands are appropriate: Training for probabilistic models targets the likelihood function or posterior, and training for non-probabilistic models targets the empirical risk. The use of the likelihood function, posterior, or MMSE estimator means that all available information in {x, z, ψ, π} is fully exploited. The penalty or utility criterion is clearly defined. None of the methods try to estimate the correct labels. The unified view also organizes some of the related work. The training method proposed by Raykar et al. (2010) corresponds to Item 1a. Ratner et al. (2016, 2017) and Khetan et al. (2018) proposed training approaches that are equivalent to Item 2, although they did not arrive at them by applying MMSE estimation. We assume that p(z|y, ψ)π(y) is known or has already been learned, but Raykar et al. (2010) and Khetan et al. (2018) have demonstrated that one can employ these training approaches while jointly learning the noisy-label model. The next two sections derive the likelihood functions, posteriors, and MMSE estimator that provide the modified primary terms. The derivations are much simpler than those for testing because there are no predicted labels ˆy to consider. Indeed, the derivations amount to marginalizing over the correct labels. Marginalization introduces some implicit regularization by accounting for the uncertainty of the correct labels.12 12. The author thanks one of the anonymous reviewers for this observation. 3.3 Probabilistic Predictive Models (ML or MAP Training) Here, we address predictive models based on a probabilistic viewpoint. There are two aspects to consider. One aspect is the form of the predictive model: discriminative or generative. A discriminative model assumes a form for the posterior p(y|x; θ) and directly uses g(x; θ) = p(y|x; θ). A generative model assumes a form for the joint distribution p(y, x; θ) and uses g(x; θ) = p(y, x; θ)/p(x; θ) p(y, x; θ). Many generative models apply the factorization p(y, x; θ) = p(x|y; θ)p(y; θ) and use g(x; θ) = p(x|y; θ)p(y; θ)/p(x; θ) p(x|y; θ)p(y; θ). The choice of g(x; θ) determines Jpri(θ; x, y). Logistic regression and neural networks are common examples of discriminative models, and na ıve Bayes is a classic example of a generative model (see Ng and Jordan, 2001). The other aspect is the treatment of the predictive-model parameters θ: non-random or random. When the parameters are non-random, ML training is employed. For a discriminative model, this means finding θ to maximize the likelihood function p(y|x; θ); for a generative model, this means finding θ to maximize the likelihood function p(y, x; θ). When the parameters are RVs Θ with prior p(θ), MAP training is used. This means finding θ that maximizes the posterior p(θ|y, x); the posterior is expanded differently depending upon whether the model is discriminative or generative. By considering both of these aspects, we can derive the primary term the likelihood function or posterior in the training objective function. The regularization term remains unchanged since it does not involve the correct labels. 3.3.1 Example Case: Discriminative Model, Non-Random Parameters To illustrate the approach, we consider a discriminative model with non-random parameters θ. In the ideal case, we have access to {x, y} and use ML training to find θ to maximize the likelihood function p(y|x; θ), which is p(y|x; θ) = i=1 p(yi|xi; θ). (29) Equivalently, we can minimize the normalized negative log-likelihood function, which gives Jpri(θ; x, y) = 1 N log p(y|x; θ) i=1 log p(yi|xi; θ). (30) On Truthing Issues in Supervised Classification With truthing issues, we seek θ to maximize p(z|x; ψ, θ), given by p(z|x; ψ, θ) = i=1 p(zi|xi; ψi, θ) yi Y p(yi, zi|xi; ψi, θ) yi Y p(zi|yi, xi; ψi, θ)p(yi|xi; ψi, θ) yi Y p(zi|yi; ψi)p(yi|xi; θ), (31) where (a) marginalizes over the correct labels, and (b) is because the noisy-label RVs do not depend on the feature vector xi or the parameters θ and because the classifier makes its prediction based only on xi and θ. We can equivalently minimize Jpri(θ; x, z, ψ, π) = 1 N log p(z|x; ψ, θ) yi Y p(zi|yi; ψi)p(yi|xi; θ). (32) Raykar et al. (2010) proposed (31) as part of joint estimation of p(z|y) and logistic regression training. 3.3.2 All Cases The top four rows of Table 9 summarize the results for all possible cases; derivations appear in Appendices E.1, E.2, and E.3. The only difference between the ideal case and truthing issues is that the latter marginalizes over the possible values of the correct-label RV Yi. The other differences are the same as in the ideal case. Switching from non-random to random parameters Θ introduces another factor for the parameter prior p(θ) and changes the optimization goal from ML to MAP. Switching from a discriminative model to a generative model changes the function being maximized from one involving p(yi|xi; θ) to one involving p(xi|yi; θ)π(yi) = p(xi, yi; θ). 3.4 Non-Probabilistic Predictive Models (Empirical Risk Minimization) Some classifiers, like support vector machines, do not have a probabilistic formulation, and sometimes the theoretical origins of a probabilistic model are not the main focus. In such cases, ideal training applies the empirical risk minimization (ERM) principle (Vapnik, 1991) and seeks θ to minimize the empirical risk or average loss Jpri(θ; x, y) = Remp(θ; x, y) = 1 i=1 L( g(xi; θ), yi), (33) Pred. Par./ Labels and Training Set Model Crit.* Ideal: Use {x, y} Noisy: Use {x, z, ψ, π} NR/ Likelihood p(y|x; θ) Likelihood p(z|x; ψ, θ) ML = QN i=1 p(yi|xi; θ) = QN i=1 P yi Y p(zi|yi; ψi)p(yi|xi; θ) RV/ Posterior p(θ|y, x) Posterior p(θ|z, x; ψ) Discriminative MAP p(θ) QN i=1 p(yi|xi, θ) p(θ) QN i=1 P yi Y p(zi|yi; ψi)p(yi|xi, θ) NR/ Likelihood p(y, x; θ) Likelihood p(z, x; ψ, θ) ML = QN i=1 p(xi|yi; θ)π(yi) = QN i=1 P yi Y p(xi|yi; θ)p(zi|yi; ψi)π(yi) RV/ Posterior p(θ|y, x) Posterior p(θ|z, x; ψ) MAP p(θ) QN i=1 p(xi|yi, θ)π(yi) p(θ)QN i=1 P yi Y p(xi|yi, θ)p(zi|yi; ψi)π(yi) NR/ Empirical risk Remp(θ; x, y) MMSE estimate of empirical-risk RV ERM = N 1 PN i=1 L( g(xi; θ), yi) ˆRMMSE(θ; x, z) Nonprobabilistic = N 1 PN i=1 P yi Y L( g(xi; θ), yi)p(yi|zi; ψi) * Par./Crit. indicates the treatment of the predictive-model parameters θ (NR: non-random, RV: random variables) and the optimality criterion (ML: maximum likelihood, MAP: maximum a posteriori, ERM: empirical risk minimization). For discriminative or generative models, the primary term equals the negative normalized log-likelihood or log-posterior; for example, the top-left entry corresponds to Jpri(θ; x, y) = N 1 log p(y|x; θ). For non-probabilistic models, the primary term appears in the bottom row. Table 9: Comparison of training objective functions for predictive models. On Truthing Issues in Supervised Classification where L(s, y) is a loss function that penalizes deviations between s = g(x; θ) and the correct label y. Examples for binary classification include support vector machines, which may be trained with the hinge loss, and logistic regression, which corresponds to a linear model trained with the logistic loss. As an example of multi-class classification, neural network training often uses the output vector from the network s final fully-connected layer for s and applies the cross-entropy loss. In this section, we assume that θ is not random, since random parameters would require probabilistic modeling. As before, the regularization term is unaffected by truthing issues. 3.4.1 MMSE Estimation of the Empirical-Risk RV (MMSE Training) With truthing issues, the correct-label RVs are not observed, and the loss functions and empirical risk are functions of them, so their values are uncertain. Let L( g(xi; θ), Yi), i = 1, . . . , N, be the ith loss-function RV, a function of the correct-label RV Yi, and from (33), write the empirical-risk RV as Jpri(θ; x, Y ) = R(θ; x, Y ) = 1 i=1 L( g(xi; θ), Yi). (34) We reason that, if a good in-sample estimate of the empirical-risk RV can be obtained for the training set {x, z, ψ, π}, then training that minimizes this estimate should perform well. We therefore focus on estimating the empirical-risk RV. Uncertainty about the empirical-risk RV arises because the correct-label RVs Y are unobserved. The feature vectors x do not contribute to the uncertainty because they are known. Given the noisy-label model form p(z|y; ψ)π(y), Bayes rule can provide p(y|z; ψ) but not p(y|z, x; ψ). For these reasons, when estimating the empirical-risk RV, we treat Y and Z as RVs, but not x. However, in the broader context of learning, the feature vectors may be treated as RVs. We adopt the MMSE criterion when estimating the empirical-risk RV. Denote an estimator of R(θ; x, Y ) by ˆR(θ; x, Z), where we omit ψ and π from ˆR for brevity. We set the primary term equal to the MMSE estimator of the empirical-risk RV: Jpri(θ; x, Z, ψ, π) = ˆRMMSE(θ; x, Z) = arg min ˆR E ˆR(θ; x, Z) R(θ; x, Y ) 2 . The standard result is that the MMSE estimator is the conditional mean of R(θ; x, Y ) given Z (see (1) or Appendices C and E.4), so ˆRMMSE(θ; x, Z) = Ep(y|z;ψ) R(θ; x, Y ) Z; ψ i=1 Ep(yi|zi;ψi)[L( g(xi; θ), Yi)|Zi; ψi] (36) yi Y L( g(xi; θ), yi)p(yi|Zi; ψi). (37) The samples are independent, so R(θ; x, Y ) is approximately normally distributed by the CLT. The MMSE estimator guesses the conditional mean of this normal RV given Z. Given Z = z, the MMSE estimate of the empirical-risk RV is ˆRMMSE(θ; x, z) = Ep(y|z;ψ) R(θ; x, Y ) z; ψ i=1 Ep(yi|zi;ψi)[L( g(xi; θ), Yi)|zi; ψi] (39) yi Y L( g(xi; θ), yi)p(yi|zi; ψi), (40) where p(yi|zi; ψi) is the training class posterior13: p(yi|zi; ψi) = π(yi)p(zi|yi; ψi) P y i Y π(y i)p(zi|y i; ψi). (41) Having found the MMSE estimator of the empirical-risk RV, we can now speak of MMSE training, which seeks θ that minimizes (40), the MMSE estimate of the empirical-risk RV. Equation (28) becomes θ = arg min θ ˆRMMSE(θ; x, z) + λJreg(θ) = arg min θ 1 N yi Y L( g(xi; θ), yi)p(yi|zi; ψi) + λJreg(θ). MMSE training handles correctly-labeled samples in a natural way. For the ith sample, if the correct label is known to be ℓ, then p(yi|zi; ψi) = 1(yi = ℓ), and the sample contributes its usual loss-function value to the empirical-risk term. The bottom row of Table 9 summarizes training for non-probabilistic models. Ratner et al. (2016, 2017) employed (39) for discriminative models including binary logistic regression and long short-term memory recurrent neural networks. Khetan et al. (2018, 4) proposed (40) for arbitrary loss functions, used it for joint estimation of p(z|y) and training, and provided performance guarantees for binary classification. Neither group of authors demonstrated the link to MMSE estimation, so this section provides another way to motivate and arrive at this result. We close this section by considering the loss function for a single sample. In (36), each term in the summation corresponds to the MMSE estimator of an individual loss-function RV L( g(xi; θ), Yi), namely ˆLMMSE( g(xi; θ), Zi) = Ep(yi|zi;ψi)[L( g(xi; θ), Yi)|Zi; ψi], i = 1, . . . , N, (42) yi Y L( g(xi; θ), yi)p(yi|Zi; ψi), i = 1, . . . , N. (43) If the loss function has properties such as non-negativity or convexity, then (43) preserves such properties because p(yi|zi; ψi) is a probability distribution. Not surprisingly, the MMSE estimator of the empirical-risk RV is just the average of the MMSE estimators of the individual loss-function RVs. 13. The training class posterior differs from the testing class posterior (10), which included the predicted labels and OP parameters. It is the same as (21) in Section 2.4.2. On Truthing Issues in Supervised Classification 3.4.2 Standard Properties Standard properties of MMSE estimators apply to ˆRMMSE(θ; x, Z); see Appendix C.1. It is unbiased in the Bayesian sense:14 E ˆRMMSE(θ; x, Z) = E R(θ; x, Y ) . (44) Likewise, the MMSE estimator of each loss function is also unbiased: E ˆLMMSE( g(xi; θ), Zi) = E[L( g(xi; θ), Yi)], i = 1, . . . , N. (45) From (44), the mean of the estimation error is zero: E ˆRMMSE(θ; x, Z) R(θ; x, Y ) = 0, and the variance of the estimation error equals the MSE: var ˆRMMSE(θ; x, Z) R(θ; x, Y ) = E ˆRMMSE(θ; x, Z) R(θ; x, Y ) 2 (46) i=1 E var L( g(xi; θ), Yi) Zi . (47) Appendix E.4 derives (47) and shows that the estimation error converges to a normal RV with the above moments. Note that the MSE of ˆRMMSE(θ; x, Z) is the MMSE. Other properties, such as the orthogonality principle, also hold but are not relevant here. The MMSE estimator is a function of the RV Z, so it is also an RV, and (46) and (47) describe the MMSE considering the distribution of Z. In contrast, the MMSE estimate is the MMSE estimator evaluated at Z = z and is not random. The MSE that was realized by (39) or (40) equals the conditional variance of the empirical-risk RV given Z = z: E ˆRMMSE(θ; x, Z) R(θ; x, Y ) 2 Z = z = var R(θ; x, Y ) Z = z i=1 var L( g(xi; θ), Yi) Zi = zi . 3.4.3 Gradient Calculation Training often employs some form of gradient descent to find θ . For example, deep neural network training uses automatic differentiation to calculate the gradient of the empirical risk. In the ideal case, Remp(θ; x, y) + λ Jreg and from (33) Remp(θ; x, y) = 1 L( g(xi; θ), yi) = 1 g ( g(xi; θ), yi) g θj (xi; θ). (49) 14. Specifically, Ep(z;ψ)[ ˆRMMSE(θ; x, Z)] = Ep(z;ψ) Ep(y|z;ψ)[R(θ; x, Y ) Z; ψ] = Ep(y)[R(θ; x, Y )]. Training based on MMSE estimation replaces the partial derivative Remp(θ; x, y) / θj with ˆRMMSE(θ; x, z) / θj. From (40), this term is ˆRMMSE(θ; x, z) = 1 yi Y p(yi|zi; ψi) L( g(xi; θ), yi) yi Y p(yi|zi; ψi) L g ( g(xi; θ), yi) g θj (xi; θ). (50) It is a convex combination of the partial derivatives of the loss function. It is also compatible with automatic differentiation methods and inexpensive to compute. For a deep neural network, calculating g(xi; θ) and g/ θj represents the vast majority of the computational burden, and these quantities must only be computed once the same burden as in the ideal case. Calculating L/ g requires a negligible amount of computation for a typical loss function, so calculating it for each possible value of yi does not substantially increase the computational burden. An example of the partial derivatives for binary logistic regression appears in Section 5.4 and Appendix F.2. 3.4.4 Special Cases We briefly consider two special cases at opposite extremes. First, suppose that the correct label can be perfectly recovered from the noisy labels; i.e., p(y i|zi; ψi) = 1(y i = yi). Then MMSE estimation returns the correct values of the empirical risk and loss function. For example, (40) yields ˆRMMSE(θ; x, z) = R(θ; x, y) = Remp(θ; x, y), and (43) gives ˆLMMSE( g(xi; θ), Zi) = L( g(xi; θ), Yi). Likewise, the partial derivative in (50) reduces to (49): θj ˆRMMSE(θ; x, z) = θj Remp(θ; x, y) . Second, suppose that the noisy labels provide no information about the correct-label RVs; i.e., p(z|y; ψ) = p(z; ψ). Then p(y|z; ψ) reduces to π(y), and (37) becomes ˆRMMSE(θ; x, Z) = 1 yi Y L( g(xi; θ), yi)π(yi) = Eπ(y) R(θ; x, Y ) . Regardless of the value of Z, the MMSE estimator always returns the mean of the empiricalrisk RV, taken with respect to the class prior π(y). The partial derivative in (50) behaves similarly. The MMSE estimator remains unbiased, so (44) still holds. However, the estimator is a constant, so its variance is zero. By the law of total variance, its MSE is as large as possible and equals the variance of R(θ; x, Y ) (see Appendix C.1, Item 4). 3.4.5 Consistency of the MMSE Estimator For ideal training, an important reason for minimizing the empirical risk (33) for a class of predictive models is consistency of the ERM principle: The ideal empirical risk converges in probability to the minimum achievable risk as N , even though the true distribution of (Xi, Yi)N i=1 is unknown (Vapnik, 1991). The relationship between MMSE estimation of the empirical-risk RV and consistency of the ERM principle requires further study. In the meantime, we consider the consistency of the MMSE estimator itself. On Truthing Issues in Supervised Classification The estimator ˆRMMSE(θ; x, Z) is consistent if it converges in probability to R(θ; x, Y ) as N . In this way, it attains the true value of the empirical-risk RV. Also, an estimator is mean-square consistent if its MSE goes to zero as N , and mean-square consistency implies consistency. Hence, if ˆRMMSE(θ; x, Z) is mean-square consistent, then it is consistent. From (46) and (47), the MSE of ˆRMMSE(θ; x, Z) is E ˆRMMSE(θ; x, Z) R(θ; x, Y ) 2 = 1 N2 i=1 E var L( g(xi; θ), Yi) Zi . A sufficient condition for when the MMSE estimator is consistent is therefore: i=1 E var L( g(xi; θ), Yi) Zi = 0. (51) If there exists a constant b such that E var L( g(xi; θ), Yi) Zi < b, i = 1, . . . , N, (52) then lim N N 2 PN i=1 E var L( g(xi; θ), Yi) Zi < lim N b/N = 0, and (51) is satisfied. Hence, if the loss function is bounded, then clearly (52) is fulfilled. If the original loss function is unbounded, then one may be able to modify it to make it bounded; for example, once the original loss exceeds some threshold, it could be transformed to asymptotically approach an upper limit. 3.5 Advantages of MMSE Training Within the unified view of training, MMSE training for non-probabilistic models has a number of appealing benefits compared to ML or MAP training for probabilistic models. MMSE training can continue to use the original loss function and involves a simple modification to the empirical risk calculation. The MMSE estimator gains standard properties such as Bayesian unbiasedness. Gradient descent and automatic differentiation can be used with minor modifications. The MMSE estimator is a consistent estimator of the empirical-risk RV if the loss function is bounded. In contrast, the ML or MAP training approach can require new derivations, and it may not be able to leverage existing loss functions and their gradients. The resulting expressions could complicate theoretical analysis. These differences are evident in the example for binary logistic regression, presented in Section 5.4. Equations for the primary terms and gradients are given in Appendix F. 3.6 Alternative Training Approaches The unified view does not cover all possible approaches to training, and here we discuss some alternatives. 3.6.1 Constructing Weak Losses for Partial Labels An alternative Bayesian approach by Cid-Sueiro (2012) and Cid-Sueiro et al. (2014) studied theoretical properties of weak losses for partial labels.15 Labeling used length-C binaryencoded vectors. For class y, the correct label was one-hot encoded as y = ey, where the vector ej equals one at element j and zero elsewhere. The partial-label vector z was similarly encoded, but multiple elements could be set to one, so the noisy labeler could indicate more than one class. The authors considered different constraints on the permitted partial-label vectors, so the number of possible partial-label vectors, B, could be an integer between C and 2C. Recall that s = g(x; θ). The authors defined a weak loss Lwk(s, z) to be a loss function designed to operate on partial labels rather than correct ones. They then introduced an equivalent loss Leq(s, y) for correct labels: Leq(s, y) = Ep( z| y)[Lwk(s, Z)| Y = y], (53) and they showed that Ep( z)[Lwk(s, Z)] = X z p( z)Lwk(s, z) y p( y)p( z| y)Lwk(s, z) z p( z| y)Lwk(s, z) = Ep( y) Ep( z| y)[Lwk(s, Z)| Y ] (54) = Ep( y)[Leq(s, Y )]. (55) From this relationship they reasoned that training on partial labels with Lwk(s, z) will behave like training on correct labels with Leq(s, y). Given an original loss function L(s, y) intended for correct labels, the equivalent loss is Leq(s, ey) = L(s, y), and one would like to find a corresponding weak loss. To this end, the authors expanded (53) into a matrix equation: Leq(s, e0) Leq(s, e1) ... Leq(s, e C 1) p Z| Y ( b0| e0) p Z| Y ( b1| e0) p Z| Y ( b B 1| e0) p Z| Y ( b0| e1) p Z| Y ( b1| e1) p Z| Y ( b B 1| e1) ... ... ... ... p Z| Y ( b0| e C 1) p Z| Y ( b1| e C 1) p Z| Y ( b B 1| e C 1) Lwk(s, b0) Lwk(s, b1) ... Lwk(s, b B 1) (56) where bi denotes the ith possible partial-label vector, i = 0, 1, . . . , B 1. The matrix tabulates the conditional distribution p( z| y), so it has dimensions C B. Cid-Sueiro pointed out that, for B > C, an infinite number of solutions for Lwk(s, z) exist. The MMSE training approach proposed in Section 3.4.1 is also Bayesian, but rather than design a new loss function for noisy labels, it uses the MMSE estimator ˆLMMSE(s, Z) of an original loss function L(s, Y ) meant for correct labels. MMSE training assumes that 15. Portions of the work by van Rooyen and Williamson (2018) are closely related to this approach. On Truthing Issues in Supervised Classification the noisy labels do not depend on the feature vector because the noisy-label model has the form p(z|y)π(y). From (42) and (43), the MMSE estimator is the conditional mean of the original loss function given Z and is just a convex combination of the original loss function values for each possible y, weighted by p(y|Z). For the approach taken by Cid-Sueiro et al., (53) suggests that the equivalent loss could be interpreted as the MMSE estimate of the weak loss given Y = y. To find a weak loss for a given equivalent loss, one must solve the matrix equation (56). From (45), the MMSE estimator is unbiased in the Bayesian sense, a standard result obtained by iterated expectations: Ep(z)[ˆLMMSE(s, Z)] = Ep(z) Ep(y|z)[L(s, Y )|Z] = Ep(y)[L(s, Y )]. Cid-Sueiro et al. obtain a similar relationship, but in the opposite order: (54) and (55) give Ep( z)[Lwk(s, Z)] = Ep( y) Ep( z| y)[Lwk(s, Z)| Y ] = Ep( y)[Leq(s, Y )]. This relationship corresponds to Bayesian unbiasedness of the MMSE estimator of the weak loss given Y . MMSE training conditions on the noisy-label RV Z. After Z = z is observed, it uses p(y|z) to estimate the original loss function. The work of Cid-Sueiro et al. conditions on the correct-label RV Y . Before Z is observed, they tabulate p( z| y) for all combinations of z and y, and they solve (56) to calculate the weak loss for every possible realization of Z that might occur. In summary, these two Bayesian approaches bear some similarities but address truthing issues differently. Neither one estimates the correct labels. One could be interpreted as operating in the reverse direction of the other. Cid-Sueiro et al. take the equivalent loss for correct labels and construct a weak loss for noisy labels. MMSE training takes the noisy labels and estimates the original loss function for correct labels. 3.6.2 Using Proxy Loss Functions Another alternative, proposed by Natarajan et al. (2013, 2018), took a classical (i.e., frequentist) view when replacing the original loss function L(s, y) of s = g(x; θ) and y with a proxy loss function designed for noisy labels.16 They considered binary classification with a single labeler and took the classical viewpoint, so y is unknown but non-random. Assuming that p(z; y) is known, they devised a proxy loss function Lpr(s, Z) for the noisylabel RV Z that is an unbiased estimator of L(s, y) in the classical sense, meaning that Ep(z;y)[Lpr(s, Z)] = L(s, y), y Y (see Papoulis 1991, 9-2; Kay 1993, 2.3). They then trained models on z using Jpri(θ; x, z, p(z; y)) = N 1 PN i=1 Lpr( g(xi; θ), zi). 16. van Rooyen and Williamson (2018) generalized this approach to other situations where a loss function for correct labels can be modified to account for noisy labels. The proxy loss function is the solution of a system of C linear equations in C unknowns:17 p Z;y(0; 0) p Z;y(1; 0) p Z;y(C 1; 0) p Z;y(0; 1) p Z;y(1; 1) p Z;y(C 1; 1) ... ... ... ... p Z;y(0; C 1) p Z;y(1; C 1) p Z;y(C 1; C 1) Lpr(s, 0) Lpr(s, 1) ... Lpr(s, C 1) L(s, 0) L(s, 1) ... L(s, C 1) which is very similar to (56) in the approach by Cid-Sueiro (2012) and Cid-Sueiro et al. (2014). The matrix is just the conditional distribution p(z; y); as long as it has full rank, a unique solution for Lpr(s, Z) exists. This approach exploits knowledge of p(z; y) and does not estimate the correct labels. As a classical approach, it does not involve a class prior π. Extending it to multiple labelers and varying combinations of labelers might be difficult. For T labelers, if every labeler provides a label for every sample, then the proxy loss function must satisfy Ep(z;y)[Lpr(s, Z)] = L(s, y), y Y, where Z = (Z1, Z2, . . . , ZT ). This requirement yields a system of C linear equations in CT unknowns, which is underdetermined for T > 1. If the labelers can decline to provide a label but at least one labeler must provide a label for every sample, then the number of unknowns becomes (C + 1)T 1. In contrast, MMSE training in Section 3.4.1 is a Bayesian method that exploits the class prior π as well as p(z|y). It employs the MMSE estimator of L(s, Y ), which from (42) is ˆLMMSE(s, Z) = Ep(y|z)[L(s, Y )|Z]. This estimator is unbiased in the Bayesian sense, meaning Ep(z)[ˆLMMSE(s, Z)] = Ep(y)[L(s, Y )] (cf. (45)). The classical and Bayesian viewpoints are fundamentally different (Kay, 1993, 10.3): The former treats y as an unknown, nonrandom quantity, and the latter treats y as an unobserved realization of the RV Y . Classical unbiasedness is thus not a paramount objective in Bayesian statistics (see Breiman 2001; Gelman et al. 2013, 4.5). Our training approach uses the original loss function, so a proxy loss function is unnecessary. Also, from (43), the MMSE estimator is a convex combination of original loss function values; no system of linear equations must be solved. Finally, MMSE training readily accommodates multiple labelers and different combinations of labelers for different samples. 3.6.3 Predicting the Noisy Labels In another alternative training approach, Sukhbaatar et al. (2015) and Jindal et al. (2016) trained a composite neural network, which consists of a base network followed by an additional layer, to predict the noisy labels z from a single labeler. The loss function is unchanged, but y is replaced by z. During training, the authors regularized both components of the composite network so that the base network learned p(y|x) and the additional layer learned p(z|y). Following training, the base network can be extracted and used to predict correct labels. Let nnb(x; θ) denote the base network, and ℓa(nnb(x; θ); ψ) denote the additional layer, where ψ is a C C matrix representing the layer s estimate of p(z|y). Then the primary term for this approach is Jpri(θ, ψ; x, z) = 1 N PN i=1 L ℓa(nnb(x; θ); ψ), zi . This approach does not estimate the correct labels, and it jointly estimates both p(y|x) and p(z|y). It does not consider the class prior π. For a single labeler, no changes to the 17. Natarajan et al. only considered binary classification; we offer the formulation for arbitrary C. On Truthing Issues in Supervised Classification loss function are necessary, but the loss function would have to be modified to account for multiple labelers or different combinations of labelers. 3.6.4 Predicting the Noisy Labels with Trace Regularization For multiple independent labelers, Tanno et al. (2019) proposed jointly training a CNN and estimating the labelers confusion matrices. They used the confusion matrices to adjust the softmax output of the CNN to obtain predicted scores for the noisy labels z. The training objective combined a cross-entropy loss term and a trace-regularization term. The cross-entropy loss was applied to the adjusted CNN outputs and noisy labels z, which amounts to predicting the noisy labels like Sukhbaatar et al. (2015). Trace regularization was introduced because, as the authors showed, doing so will drive the estimated confusion matrices to their actual values, under certain conditions.18 Consequently, as the labelers confusion matrices are estimated, the CNN learns to predict the correct labels. This method does not estimate the correct labels. The estimated confusion matrices can be used to obtain estimates of p(zt, y) and p(zt|y) for each labeler, as well as π. Tanno et al. also described some computational advantages of their method compared to the EM-based methods of Raykar et al. (2010) and Khetan et al. (2018). 3.7 Suboptimal, Infrastructure-Compatible Training There exists significant existing infrastructure, like software packages or machine-learning frameworks, that expects correct or assumed-to-be-correct labels, and modifying it for one of the preceding methods might be impractical or costly. Below, we describe some training methods that are suboptimal but compatible with such infrastructure. First, label estimation produces an estimate ˇy of the correct-label RVs Y and trains on {x, ˇy} instead of {x, y}. To estimate Y , one can use the related work or apply the MPE criterion as in Section 2.6.1. The MPE estimator is given by h MPE i = arg minhi E[1(hi(Zi) = Yi)]. The standard result is that the solution is the MAP estimator (see (2) or Appendix D), so ˇyi = arg maxy Y p(y|zi; ψi), i = 1, . . . , N. Label estimation is suboptimal (cf. Section 1.4) because training should estimate the primary term e.g., the empirical-risk RV, as in Section 3.4.1 rather than Y . Second, voting trains T classifiers, where the tth classifier gt is trained on the samples and noisy labels treated as if correct from the tth labeler only, namely {(xi, zi,t) : zi,t = , i = 1, . . . , N}. Given an unlabeled sample x, the predicted label ˆy is chosen by a majority vote among {gt(x; θt)}T t=1. This technique ignores estimation theory so it is suboptimal, and it multiplies training and deployment complexity by T. Third, sample replication, which was suggested by Raykar et al. (2010), copies each xi several times, assigns labels to its copies based on p(yi|zi; ψi), and trains on the copies and their labels. For example, in binary classification, if p Yi|Zi(0|zi; ψi) = 0.2, then xi is replicated 5 times, with one copy assigned label 0 and four copies assigned label 1. Direct implementation could multiply the storage and computation requirements for training by the number of copies made. 18. Sukhbaatar et al. (2015) mentioned this result but applied different regularization for implementational reasons. Figure 5: Binary symmetric broadcast channel. These methods offer different compromises between accounting for truthing issues and modifying existing infrastructure. Label estimation makes a hard decision about Y before training commences, which could introduce mistaken labels into training but requires no other modifications. Voting considers different hard decisions about Y from the individual labelers, multiplies deployment complexity by T, and requires a simple voting mechanism. Sample replication never makes a hard decision about the correct-label RVs Y , offers a quantized approximation of the approaches given in Sections 3.2, 3.3, and 3.4, and it multiplies training resource requirements by the number of copies, but it does not affect deployment complexity. 4. Comparing Combinations of Labelers: An Information-Theoretic View The truthing issues have a simple information-theoretic interpretation. For a single sample, the correct-label RV Y can be viewed as the input to a channel, and the noisy-label RVs Z = (Z1, . . . , ZT ) can be viewed as the outputs from the channel.19 Then the mutual information I(Z; Y ) = P z,y p(z, y) log2 p(z, y)/p(z)π(y) quantifies the amount of information that Z conveys about Y and vice versa (see Cover and Thomas, 1991). Here, we show that the interpretation allows us to compare different combinations of labelers in terms of mutual information, at least in theory. In Section 5.5, we demonstrate that the training and testing techniques of Sections 2 and 3 enable us to exploit that information in practice. 4.1 Binary Symmetric Broadcast Channel We illustrate this viewpoint for binary classification. For simplicity, we assume that the labelers or noisy-label RVs are conditionally independent, although the interpretation applies for conditionally dependent labelers as well. We also assume that all labelers have the same conditional distribution and that each labeler assigns a noisy label to every sample, so it suffices to consider a single sample. 19. In a theoretical context, Lugosi (1992) presented such a channel interpretation for a single labeler. On Truthing Issues in Supervised Classification We can now introduce a binary symmetric broadcast channel (BSBC) (see Cover and Thomas, 1991, 14.6), shown in Figure 5 and parameterized by the labeling-error probability ε [0, 1]. For t T and y, zt {0, 1}, p(zt|y; ε) = ( 1 ε, if zt = y; ε, if zt = y. (57) For ε [0, 1/2], this model is equivalent to setting C = 2, δi 0, i, and φt 2ε in (59) and (60). The binomial theorem can be used to obtain I(Z; Y |T, ε), the mutual information for T labelers, each with error probability ε: I(Z; Y |T, ε) = π(0) εm(1 ε)T m log2 εm(1 ε)T m π(0)εm(1 ε)T m + π(1)(1 ε)mεT m log2 π(0)εm(1 ε)T m + π(1)(1 ε)mεT m . For a fixed value of π(1), I(Z; Y |T, ε) is maximized if ε {0, 1}, in which case it equals π(0) log2 π(0) π(1) log2 π(1) = H(Y ), the entropy of Y . If ε = 1/2, then the labelers assign labels equiprobably, providing no information about Y , and I(Z; Y |T, ε) = 0. If ε (0, 1/2) (1/2, 1), then lim T I(Z; Y |T, ε) = H(Y ), and all information about Y becomes available from Z. Figure 6 shows I(Z; Y |T, ε) for different values of T and ε when π(1) = 0.1. The plot is only shown for 0 ε 1/2. As ε decreases from 1/2 to zero, I(Z; Y |T, ε) increases from zero to its maximum as one would expect. For ε = 0, I(Z; Y |T, ε) increases with T because more noisy-label observations are available for estimating Y . 4.2 Equivalent Single Labeler We can use I(Z1; Y |T1, ε1) and I(Z2; Y |T2, ε2) to compare the information provided by two different groups of labelers, but it can be helpful to think in terms of a single labeler. Let I(Z; Y |ε ) denote the mutual information for a single labeler with error probability ε , which corresponds to the standard (single-input, single-output) binary symmetric channel (BSC) (see Cover and Thomas, 1991, 8.1). Given π, T, and ε, we can use binary search to find ε such that I(Z; Y |ε ) = I(Z; Y |T, ε) and express the T labelers as an equivalent single labeler with error probability ε . Thus, we can compare different numbers of labelers with different error probabilities in terms of their equivalent single-labeler mutual information. Figure 7 shows the relationship between π, ε , and I(Z; Y |ε ) for the standard BSC. The mutual information is symmetric about π(1) = 1/2. For a fixed value of ε , it decreases as π(1) decreases from 1/2, which indicates that handling truthing issues becomes more challenging as the classes become more imbalanced. Figure 6: Graph of the mutual information I(Z; Y |T, ε) of BSBC for π(1) = 0.1 as a function of T and ε. Figure 7: Mutual information I(Z;Y |ε ) of standard BSC as a function of class prior π(1) and error probability ε . 4.3 Multiple Mediocre Labelers or Single Expert Labeler Figure 6 also indicates that, for 0 < ε < ε < 1/2, I(Z; Y |T, ε) can equal or exceed I(Z; Y |ε ) if T is sufficiently large. Hence, multiple mediocre labelers can be as informative as or more informative than a single expert labeler. This result helps justify crowdsourcing and explain its successes. We can again use binary search to find the minimum value of T needed to satisfy I(Z; Y |T, ε) I(Z; Y |ε ). Figure 8 shows example curves for ε {0.01, 0.02, 0.05, 0.10} and π(1) = 0.4. The curves rise steeply for small values of T before diminishing returns set in; as T , ε approaches an asymptote at 1/2. Hence, a few mediocre labelers may suffice to obtain a small equivalent error probability ε . Alternatively, one might desire a very small value of ε that no single expert can achieve, but a few human not superhuman labelers might be able to attain it together. This possibility is reminiscent of boosting (see Schapire, 1990; Freund and Schapire, 1997), in which multiple weak learners are leveraged to achieve the performance of a strong learner. We have not explored this relationship further but make some brief remarks. In boosting, the weak learners perform slightly better than random guessing, but the correct labels are known, and this information is exploited to improve performance. With truthing issues, the noisy labels might also be quite inaccurate and the correct labels are unobserved, but the conditional distribution p(z|y; ψ) is known, and it provides information about the correct labels for training and testing. The information-theoretic implication of equivalent information is intriguing, but it does not explain how to exploit the information that Z or Z conveys about Y . Fortunately, the methods developed in Sections 2 and 3 provide a means to do so. They should produce better estimates as T grows because the variance of an optimal estimator decreases as more On Truthing Issues in Supervised Classification 5 10 15 20 25 30 35 40 45 50 Number of Labelers Error Probability of Each Labeler Class Prior Pr(Y=1) = 0.40 Multiple-Labeler Error Probability Needed to Match Single Labeler Single-labeler error prob. = 0.10, I(Z;Y) = 0.512 bits Single-labeler error prob. = 0.05, I(Z;Y) = 0.690 bits Single-labeler error prob. = 0.02, I(Z;Y) = 0.832 bits Single-labeler error prob. = 0.01, I(Z;Y) = 0.891 bits Figure 8: Number of labelers needed to achieve the same mutual information as a single labeler for π(1) = 0.4. (independent) noisy observations become available.20 For T sufficiently large, the estimates for the mediocre labelers should become comparable to those for the single expert labeler. Section 5.5 presents a couple of experiments that verify the implication. 5. Experiments We conducted a number of experiments to see how the different testing and training methods performed and to check the implication of equivalent mutual information for different combinations of labelers. 5.1 Simulation To exercise the testing methods, we need correct, noisy, and predicted labels. To study the training methods, we only need correct and noisy labels. These can all be generated via simulation. Of course, the methods do not use the correct labels, which would not be available in the intended applications, but simulation lets us compare the results with the ideal case. 5.1.1 Simulating Correct and Predicted Labels Simulation for a desired class prior π is simple. For binary classification, the N correct-label realizations y(0) are drawn independently B(π(1)). For multi-class classification, they are 20. This behavior relates to T, the number of noisy labels per sample, rather than N, the number of samples. Increasing T means more noisy observations of Y are available to reduce the uncertainty about the correct label for a sample. Merely increasing N does not provide any more observations for the previous samples. drawn independently from a categorical distribution with prior π and categories 0, 1, . . . , C 1. Denote such a distribution as Cat π, (0, 1, . . . , C 1) . Experiments on the testing approaches of Section 2 require simulated predicted labels. To simulate a binary classifier with desired operating point pdes D , pdes FA , the predicted- label realizations ˆy are drawn independently, with ˆyi drawn B(pdes FA ) if y(0) i = 0, and ˆyi drawn B(pdes D ) if y(0) i = 1. Likewise, the performance of a multi-class classifier can be specified by a desired conditional confusion matrix Kdes, where matrix element Kdes n|ℓcontains p(ˆy = n|y = ℓ). Thus, each predicted-label realization ˆyi is drawn independently Cat (Kdes 0|ˆyi, Kdes 1|ˆyi, . . . , Kdes C 1|ˆyi), (0, 1, . . . , C 1) . 5.1.2 Modeling and Simulating Noisy Labels Since the samples are independent, we temporarily drop the sample index i while describing the modeling and simulation of the noisy-label RVs. Until this point, p(z|y; ψ) has remained completely general, so it allows for conditionally dependent labelers; that is, dependencies between different noisy-label RVs Zt and Zt for the same sample. We have also left ψ unspecified. For simulation, we need a more concrete model. First, we assume that the noisy-label RVs for a sample are conditionally independent: p(z|y; ψ) = Y t:zt = p(zt|y; ψt), (58) where ψ = (ψt)t T , and ψt contains parameters that determine how the tth labeler labels the sample. This assumption is primarily for implementational convenience; we could have used a model with dependent labelers like one of the models in Holodnak et al. (2018). Second, we let ψt = (δ, φt), where δ [0, 1] is the sample difficulty, and φt [0, 1] is the labeler fallibility. The sample difficulty represents the ambiguity about the correct label inherent to the sample itself. For example, in image recognition, it could indicate the amount of blur or noise. The labeler fallibility is an attribute of the tth labeler; it could reflect an analyst s experience or a crowdsourcing participant s trustworthiness. The model is then p(z|y; δ, φt) = 1 ε(δ, φt), if z = y Y; ε(δ,φt) C 1 , if z = y Y, z Y; 0, if z Y; where the labeling-error probability ε(δ, φt) is a bilinear function: ε(δ, φt) = (δ δφt + φt)C 1 C , 0 δ 1, 0 φt 1. (60) The labeler chooses Zt to be the correct label with probability 1 ε(δ, φt) and makes a mistake with probability ε(δ, φt) [0, 1]. If a mistake occurs, then Zt is equiprobably distributed over the C 1 possible incorrect labels. If δ = φt = 0, then Zt y, but if either parameter is non-zero, then ε(δ, φt) increases with both δ and φt. If either δ = 1 or φt = 1, then Zt is equiprobably distributed over Y. On Truthing Issues in Supervised Classification Our model resembles the one by Whitehill et al. (2009), but our formulation is slightly different, and our purpose is significantly different. Their model is for binary classification, whereas our model applies to arbitrary C. Their model allows for adversarial labelers who tend to choose the opposite of the correct binary label, but when C > 2, the opposite of the correct label is not obvious. More important, Whitehill et al. concentrated on estimating the model parameters, while we focus on exploiting the model when its parameters are known. In the latter case, an adversarial labeler who is known to frequently choose the opposite of the correct label is actually more informative than a sloppy labeler who assigns labels equiprobably at random.21 Finally, we again consider all samples and let δ = (δi)N i=1 and φ = (φt)T t=1. Then ψ = (δ, φ), and ψi = (δi, φ), so p(z|y; ψ) = i=1 p(zi|yi; δi, φ) t:zi,t = p(zi,t|y; δi, φt), (61) where (a) is from (58). During simulation, we draw δi, i = 1, . . . , N, and φt, t = 1, . . . , T, independently from beta or uniform distributions. We also draw a vector η = (ηt)T t=1, where ηt is the approximate probability that the tth labeler provides a label for a sample. Each ηt is drawn independently U(0, 1), where U(a, b) denotes a uniform distribution over (a, b). For the ith sample, we repeatedly draw ηi B(ηt) T t=1 with independent Bernoulli-distributed elements until at least one element of ηi equals unity. We then simulate zi,t in accord with (59) and (60) for t {t : ηi,t = 1}, and we set zi,t = for t {t : ηi,t = 0}. 5.2 Examples of Testing: Binary Classification A variety of examples that illustrate aspects of the testing approaches for binary classification appear here. We emphasize that, except for the ideal case, no correct labels were used during testing. 5.2.1 Main Testing Example A simulation was conducted with N = 103, T = 5, π(1) = 0.2, pdes D , pdes FA = (0.8, 0.3), δi Beta(1, 5), i, and φt U(0, 0.4), t. The simulator produced δ = (0.098, 0.046, 0.347, . . . , 0.199, 0.221), φ = (0.249, 0.030, 0.387, 0.244, 0.154), and η = (0.727, 0.873, 0.286, 0.657, 0.232). The number of samples labeled by each labeler was 727, 879, 272, 667, 229, respectively; on average, there were about 2.8 noisy labels per sample. Figure 9 shows the progression of the estimates p(j) D and p(j) FA for the iterative methods. The dotted lines show the ideal values of p D and p FA if the correct labels were known; the ideal values differ slightly from pdes D , pdes FA because they are the result of sampling and simulation. For MMSE testing, both empirical Bayes methods (Algorithms 1 and 2), the 21. Ipeirotis et al. (2010, 3) made a similar observation in the context of assessing the cost of a labeler. 0 2 4 6 8 10 Iteration number Estimated value Ideal (correct labels known): p D Ratios: p D Sampling: p D Estimate labels: p D Ideal (correct labels known): p FA Ratios: p FA Sampling: p FA Estimate labels: p FA Figure 9: Main testing example: Progression of p(j) D , p(j) FA in iterative estimation methods. 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Accuracy Probability Density Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Sampling: Density Sampling: Cond. mean Sampling: MAP estimate Sampling: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers Full Bayes: Density Full Bayes: Cond. mean Full Bayes: MAP estimate Figure 10: Main testing example: Accuracy estimates by all methods. Markers are spaced vertically for easier readability. estimates improved on each iteration, and both converged to nearly the same final OP parameters after seven iterations. The suboptimal method of estimating the correct labels (Algorithm 3) converged after nine iterations; its final OP parameter p FA was slightly more accurate than those of the empirical Bayes methods, but its final OP parameter of p D was considerably less accurate than theirs. Table 10 summarizes the estimates of all metrics by the techniques presented in Section 2, and Figure 10 shows the results for accuracy. Results for the suboptimal methods that do not fully exploit the predicted labels appear in Section 5.2.2. For MMSE testing with the empirical Bayes methods ( Ratios and Sampling ), the figure displays the estimated posterior density, conditional mean, MAP estimate, and 95%-credible region for ACC. The figure also displays a gray histogram, which was created by taking the final OP parameters ( p D, p FA) from Algorithm 2, generating 5000 sample realizations of Y from (10), and computing the empirical accuracy for each realization. The optimal estimates are fairly close to the ideal accuracy, and the credible regions contain the ideal accuracy. The results for the two methods are very similar; this behavior was observed for all examples, so subsequent figures do not include results for Algorithm 2 beyond the histograms. The figure also shows the estimated accuracy for the suboptimal method in Algorithm 3 ( Estimate labels ). This estimate is reasonably good, but it does not provide any sense of uncertainty like a credible region does. Also, this algorithm produced inaccurate estimates of REC or PD (shown next in Figures 11 and 12). Likewise, the figure displays the T instances of accuracy for each individual labeler ( Labelers ). We also computed the mean and median of these instances; we applied the algorithm by Vardi and Zhang (2000) to calculate the multi-dimensional median. The instances are fairly inaccurate, so their mean and median also yield poor estimates. The last set of results in the figure are for the fully Bayesian approach. Its estimated density is very spread out as a result of marginalization over ( PD, PFA), and its conditional On Truthing Issues in Supervised Classification Scalar Metrics Joint Metrics Estimate ACC PREC PD or REC PFA F1 (PREC, REC) (PD, PFA) Ideal (correct labels known) Ideal 0.712 0.421 0.810 0.316 0.554 (0.421, 0.810) (0.810, 0.316) MMSE Testing: Empirical Bayes estimation via ratios of jointly normal RVs (Alg. 1) Conditional mean 0.700a 0.397a 0.795 0.328 0.534 (0.397, 0.795)b (0.795, 0.328)b MAP estimate 0.795 0.325 0.529 (0.397, 0.795)c (0.796, 0.325)c Credible (lower) 0.687 0.373 0.764 0.317 0.506 region (upper) 0.713 0.420 0.828 0.335 0.553 MMSE Testing: Empirical Bayes estimation via sampling (Alg. 2) Conditional mean 0.701a 0.397a 0.796 0.327 0.535 (0.397, 0.796)b (0.796, 0.327)b MAP estimate 0.796 0.325 0.530 (0.397, 0.796)c (0.796, 0.325)c Credible (lower) 0.688 0.374 0.765 0.316 0.506 region (upper) 0.714 0.421 0.829 0.334 0.554 Estimation of correct labels (Alg. 3) Estimate labels 0.720 0.402 0.868 0.316 0.550 (0.402, 0.868) (0.868, 0.316) Combine metrics from individual labelers Mean 0.636 0.431 0.615 0.354 0.506 (0.431, 0.615) (0.615, 0.354) Median 0.622 0.421 0.605 0.354 0.504 (0.421, 0.605) (0.605, 0.354) Labeler No. of Metrics from index labels individual labelers 1 727 0.622 0.450 0.574 0.354 0.505 (0.450, 0.574) (0.574, 0.354) 2 879 0.653 0.414 0.643 0.343 0.504 (0.414, 0.643) (0.643, 0.343) 3 272 0.618 0.405 0.605 0.377 0.485 (0.405, 0.605) (0.605, 0.377) 4 667 0.619 0.421 0.587 0.366 0.490 (0.421, 0.587) (0.587, 0.366) 5 229 0.668 0.465 0.667 0.331 0.548 (0.465, 0.667) (0.667, 0.331) Fully Bayesian estimation Conditional mean 0.646 0.345 0.664 0.360 0.456 (0.345, 0.664)b (0.664, 0.360)b MAP estimate 0.662 0.342 0.714 0.349 0.464 (0.347, 0.690)c (0.663, 0.360)c a For accuracy or precision, the empirical Bayes conditional mean and MAP estimate are identical. b The conditional means for the joint metrics are the same as those for the corresponding scalar metrics. c The MAP estimate of a joint metric can differ from the MAP estimates of its individual components. Table 10: Testing metrics for main testing example. mean and MAP estimate are not very accurate. This behavior occurred for other metrics, so we do not show the fully Bayesian method in the subsequent figures. Next, Figure 11 displays results for the other scalar metrics. MMSE testing using the empirical Bayes method of Algorithm 1 produced estimates within about 0.025 of each metric. For REC or PD, the credible region contains the ideal metric, and for the other metrics, it lies outside the credible region by just 0.001. Table 10 indicates that the credible regions of Algorithm 2 contained each ideal metric. The suboptimal method of estimating the correct labels (Algorithm 3) was quite accurate for several metrics but offby 0.058 for REC or PD. The use of the labelers labels often produced very inaccurate estimates. 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 Precision Probability Density Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers Recall or Prob. of Detection 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Recall or Prob. of Detection Probability Density Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers Prob. of False Alarm 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 Prob. of False Alarm Probability Density Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers F-score (beta = 1.0) 0.45 0.5 0.55 0.6 0.65 F-score (beta = 1.0) Probability Density Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers Figure 11: Main testing example: Estimates of scalar metric RVs. On Truthing Issues in Supervised Classification 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Prob. of False Alarm Prob. of Detection ROC Analysis Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Recall Precision-Recall Analysis Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers Figure 12: Main testing example: Estimates of joint metric RVs for ROC and P-R analysis. Figure 12 presents results for P-R and ROC analysis. MMSE testing with empirical Bayes estimation produced the best joint estimates; its point estimates are closest to the ideal operating points, and its 95%-credible regions contain the ideal operating points. Estimating the correct labels yielded less accurate joint estimates, and it continues to provide no sense of uncertainty. Using the labelers labels gave very poor estimates. 5.2.2 Exploitation of Predicted Labels Full exploitation of the predicted labels ˆy and OP parameters ( p D, p FA) is an important characteristic of the most successful testing methods. Figure 13 shows results for the simulation in Section 5.2.1 if the predicted labels are not fully exploited, which corresponds to the third and fourth suboptimal testing approaches in Section 2.6.1. They use p(zi|yi; ψi) rather than the testing model (7) with p(ˆyi|yi; p D, p FA)p(zi|yi; ψi), which amounts to setting jmax = 0 in Algorithm 1, 2, or 3. Compared to Figures 10 and 12, the estimated posterior and the estimates are quite inaccurate, and the credible regions do not contain the ideal metrics. These results illustrate the importance of including the predicted labels and OP parameters during estimation. 5.2.3 Convergence Experiments for Iterative Algorithms Section 2.4.4 speculated that, when estimating ( p D, p FA) during MMSE testing, the empirical Bayes methods (Algorithms 1 and 2) will converge to the global optimum regardless of the initial OP parameters. To check this possibility, we conducted two experiments, which also included the suboptimal correct-label estimation method (Algorithm 3). First, we used the same simulation as in Section 5.2.1 but varied the initial OP parameters p(0) D , p(0) FA over the 10 10 grid {0.05, 0.15, . . . , 0.95} {0.05, 0.15, . . . , 0.95}. Table 11 summarizes the results. The empirical Bayes methods converged to nearly the same final OP parameters every time, with maximum absolute errors of about 0.015. In addition, the MAC was always satisfied in Algorithm 1. The suboptimal method of Algorithm 3 had larger p D errors but slightly smaller p FA errors than the empirical Bayes methods. For this 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Accuracy Probability Density Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Recall Precision-Recall Analysis Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Figure 13: Examples of estimates of metric RVs if the predicted labels ˆy and OP parameters ( p D, p FA) are not fully exploited. Compare with Figures 10 and 12. The plots do not show the labelers results because they are unchanged from those figures. Same Ideal Operating Point, Iterative Estimation Method Different Initial OP Parameters Ratios Sampling Est. Labels Error p D p D Mean 0.0149 0.0131 0.0473 Std. dev. 3.20 10 4 4.15 10 4 1.61 10 2 Max abs. 0.0153 0.0142 0.0715 Error p FA p FA Mean 0.0118 0.0092 0.0015 Std. dev. 6.34 10 5 8.56 10 5 1.49 10 3 Max abs. 0.0119 0.0095 0.0033 Number of Iterations Mean 6.6 6.6 6.2 Std. dev. 1.08 1.05 2.35 Max 8 8 10 Table 11: Estimation error statistics for the final OP parameters ( p D, p FA) for iterative estimation methods. For the same ideal operating point (p D, p FA) = (0.810, 0.316), 100 different initial OP parameters were used. ideal operating point, all algorithms consistently converged to nearly the same final OP parameters. Second, we ran another set of simulations that always initialized the iterative algorithms with the default initial OP parameters p(0) D , p(0) FA = (1/2, 1/2), but we varied the desired operating point pdes D , pdes FA over the same 10 10 grid as above, which produced 100 different ideal operating points. These simulations used π(1) = 0.5, δi U(0, 1), i, φt U(0, 0.5), and ηt U(0, 1), t. Results appear in Table 12. The empirical Bayes methods converged every time, with average errors near 0.012 and a maximum absolute error below 0.045. Again, the MAC was always satisfied in Algorithm 1. The suboptimal method of Algorithm 3 performed less well. Although its average errors are no more than 0.021, its standard deviations are on the order of 0.200 and its maximum absolute error exceeds 0.333, On Truthing Issues in Supervised Classification Different Ideal Operating Points, Iterative Estimation Method Same Default Initial OP parameters Ratios Sampling Est. Labels Error p D p D Mean 0.0113 0.0114 0.0141 Std. dev. 1.04 10 2 9.56 10 3 1.95 10 1 Max abs. 0.0314 0.0310 0.3590 Error p FA p FA Mean 0.0120 0.0122 0.0210 Std. dev. 1.36 10 2 1.14 10 2 2.03 10 1 Max abs. 0.0442 0.0381 0.3369 Number of Iterations Mean 10.9 11.2 10.4 Std. dev. 3.15 3.46 3.39 Max 17 18 21 Table 12: Estimation error statistics for the final OP parameters ( p D, p FA) for iterative estimation methods. For 100 different ideal operating points, the same default initial OP parameters was used. indicating that it often became trapped near a local optimum. These results demonstrate the substantial benefit of the empirical Bayes methods over estimating the correct labels. 5.2.4 Estimation Performance for Different Operating Points The convergence experiments in the previous section only examine the estimation error of the final OP parameters ( p D, p FA). For the second set of simulations in that section, we also compiled statistics on the estimation errors of the final estimates of the scalar and joint metrics over the 10 10 grid {0.05, 0.15, . . . , 0.95} {0.05, 0.15, . . . , 0.95} of pdes D , pdes FA ; i.e., over 100 different ideal operating points. Figure 14 summarizes the error statistics for the estimated scalar metrics by the different testing approaches. For MMSE testing, the point estimates from the empirical Bayes methods (Algorithms 1 and 2) have average errors of about 0.012, with standard deviations near 0.011. The scale changes between MMSE testing and the other methods. The other methods have average errors between 0.001 and 0.043, but their standard deviations range from about 0.100 to over 0.200, an order of magnitude greater than those for the empirical Bayes methods. Figures 15 and 16 display joint error statistics for P-R and ROC analysis. Results for MMSE testing were similar for both the conditional mean and MAP estimate, so the figures only show estimation errors for the conditional mean of Algorithm 1 and the MAP estimate of Algorithm 2. The empirical Bayes methods have average errors near 0.015 in each dimension. The square roots of the eigenvalues of the error covariance matrix fall between 0.005 and 0.025. The scale changes between MMSE testing and the other methods. The latter methods have average errors between 0.005 and 0.043, but the eigenvalues square roots range from about 0.100 to over 0.430. These results demonstrate the superior estimation performance of the MMSE testing methods, whose estimates typically lie within 0.010 to 0.040 of the ideal metrics over a wide range of ideal operating points. They significantly outperform the other methods, -0.05-0.025 0 0.025 0.05 Error Sampling: MAP Estimate Sampling: Cond. Mean Ratios: MAP Estimate Ratios: Cond. Mean -0.05-0.025 0 0.025 0.05 Error -0.05-0.025 0 0.025 0.05 Error Recall or Prob. of Detection -0.05-0.025 0 0.025 0.05 Error Prob. of False Alarm -0.05-0.025 0 0.025 0.05 Error -0.5 -0.25 0 0.25 0.5 Error Full Bayes: Cond. mean Labelers: Median Labelers: Mean Estimate labels -0.5 -0.25 0 0.25 0.5 Error -0.5 -0.25 0 0.25 0.5 Error Recall or Prob. of Detection -0.5 -0.25 0 0.25 0.5 Error Prob. of False Alarm -0.5 -0.25 0 0.25 0.5 Error Figure 14: Scalar metric estimation errors for different testing approaches over 100 different ideal operating points. The axis limits differ for MMSE testing with empirical Bayes (upper plots: 0.05 to +0.05) and the other approaches (lower plots: 0.5 to +0.5). Miniature scatterplots of the estimation errors appear as gray dots. The average error is marked with a circle and as text above each circle. Multiples of 1 and 2 times the standard deviation of the errors appear as crosses, and text below the first cross gives the standard deviation. On Truthing Issues in Supervised Classification -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Precision Error Recall Error Ratios: Conditional Mean (-0.0082, -0.0114) 0.0258 0.0096 -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Precision Error Recall Error Sampling: MAP Estimate (-0.0080, -0.0112) 0.0236 0.0093 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Precision Error Recall Error Estimate labels (-0.0131, -0.0142) -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Precision Error Recall Error Full Bayes: Cond. mean (-0.0010, -0.0045) -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Precision Error Recall Error Labelers: Mean (-0.0284, -0.0132) -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Precision Error Recall Error Labelers: Median (-0.0149, -0.0153) Figure 15: P-R analysis estimation errors for different testing approaches over 100 different ideal operating points. The axis limits differ for MMSE testing with empirical Bayes (top plots: 0.05 to +0.05) and the other approaches (middle and lower plots: 0.5 to +0.5). Miniature scatterplots of the estimation errors appear as gray dots. The average error is marked with a circle and listed as an ordered pair. Ellipses denote areas that account for 0.6827 and 0.9545 of the density of a bivariate normal distribution fitted to the errors, and text adjacent to the semi-major and semi-minor axes gives the square roots of the eigenvalues of the error covariance matrix. -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Prob. of False Alarm Error Prob. of Detection Error Ratios: Conditional Mean ( 0.0121, -0.0114) -0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05 Prob. of False Alarm Error Prob. of Detection Error Sampling: MAP Estimate ( 0.0124, -0.0115) -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Prob. of False Alarm Error Prob. of Detection Error Estimate labels ( 0.0212, -0.0142) -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Prob. of False Alarm Error Prob. of Detection Error Full Bayes: Cond. mean ( 0.0087, -0.0045) -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Prob. of False Alarm Error Prob. of Detection Error Labelers: Mean ( 0.0426, -0.0132) -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Prob. of False Alarm Error Prob. of Detection Error Labelers: Median ( 0.0188, -0.0144) Figure 16: ROC analysis estimation errors for different testing approaches over 100 different ideal operating points. The axis limits differ for MMSE testing with empirical Bayes (top plots: 0.05 to +0.05) and the other approaches (middle and lower plots: 0.5 to +0.5). Miniature scatterplots of the estimation errors appear as gray dots. The average error is marked with a circle and listed as an ordered pair. Ellipses denote areas that account for 0.6827 and 0.9545 of the density of a bivariate normal distribution fitted to the errors, and text adjacent to the semi-major and semi-minor axes gives the square roots of the eigenvalues of the error covariance matrix. On Truthing Issues in Supervised Classification 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Accuracy Probability Density Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Recall Precision-Recall Analysis Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers Figure 17: Testing example for a single labeler (T = 1) when a constant labeling-error probability ε0 = 0.1 is assumed for all samples. The mean and median of the labelers metrics are not shown because T = 1. which frequently yield errors in excess of 0.100, 0.200, or more unacceptable given that the metrics lie in [0, 1]. The empirical Bayes approach is clearly more appropriate than the fully Bayesian one. 5.2.5 Single Labeler and Constant Labeling-Error Probability The testing methods can also be useful if there is a single labeler (T = 1), and one merely wants to know how performance would be affected by some worst-case labeling-error probability ε0. One can simply set δi 0 and φt Cε0/(C 1) in (60) and apply the methods. Figure 17 shows an example for T = 1 and ε0 = 0.1; other simulation settings were N = 103, π(1) = 0.6, pdes D , pdes FA = (0.90, 0.10), δi 0, and φt 0.2. MMSE testing produces accurate estimates, and its estimated posteriors and credible regions allow one to understand the possible variability caused by the assumed labeling-error probability. For 0 < ε0 < 1/2, the estimated correct label ˇyi is identically equal to the noisy label zi,1, so the markers for the estimated correct labels and the labelers labels lie in the same location. These suboptimal estimates are much less accurate than those from the empirical Bayes method. 5.2.6 Small Sample Size The approximations behind MMSE testing are driven by the CLT, so they should hold for small N as long as ˆN1 and N ˆN1 are greater than or equal to thirty. We conducted another simulation with N = 70, which produced ˆN1 = 37 and N ˆN1 = 33; other simulation settings were T = 3, π(1) = 0.5, pdes D , pdes FA = (0.85, 0.20), δi Beta(1, 2), i, and φt U(0.2, 0.5), t. Figure 18 displays results for PFA and (PD, PFA). The posteriors are clearly non Gaussian, and the 95%-credible regions are large because of the small sample size, but they contain the ideal metrics. The suboptimal methods again provide less accurate estimates. Prob. of False Alarm 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Prob. of False Alarm Probability Density Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Prob. of False Alarm Prob. of Detection ROC Analysis Ideal (correct labels known) Ratios: Density Ratios: Cond. mean Ratios: MAP estimate Ratios: 95%-credible region Estimate labels Labelers: Mean Labelers: Median Labelers Figure 18: Testing example for small sample size (N = 70), with ˆN1 = 37. 5.3 Example of Testing: Multi-Class Classification For the multi-class testing methods of Section 2.7, we present an example with C = 4, N = 2000, π = (0.2, 0.3, 0.1, 0.4), 0.75 0.08 0.10 0.07 0.10 0.65 0.12 0.13 0.04 0.06 0.80 0.10 0.10 0.05 0.05 0.80 T = 5, δi 0, i, and φt U(0, 0.4), t. MMSE testing with the empirical Bayes sampling method (Algorithm 4) used M = 2500 C = 104. We also extended the method that estimates the correct labels (Algorithm 3) to multi-class classification. Both algorithms converged after six iterations. For the individual labelers, we computed their confusion matrices, scaled the tth labeler s confusion matrix by N/(no. of labels from tth labeler), and calculated the mean and multi-dimensional median of the scaled matrices. Apart from the ideal case, testing did not involve the correct labels. First, Table 13 compares the estimates of the accuracy RV for the different estimation methods. MMSE testing produced the most accurate estimates, and the ideal accuracy value lies squarely inside the 95%-credible region. Estimating the correct labels yielded a reasonably good estimate of accuracy but again without a sense of uncertainty. Using the individual labelers labels gave accuracies of 0.6586, 0.5667, 0.6397, 0.6698, and 0.5504; taking the mean or median of these values produces highly inaccurate estimates of the accuracy. Second, Table 14 shows the ideal confusion matrix and the estimated confusion matrices from the different techniques. Ten matrix elements from MMSE testing are closest to the corresponding elements in the ideal confusion matrix, six from the estimation of correct labels are closest, one from the labelers mean is closest, and one from the labelers median is closest. (There were two ties, so these numbers sum to eighteen rather than sixteen.) Finally, Table 15 shows the 95%-credible regions of the confusion matrix elements from MMSE testing. Every credible region contains the corresponding element in the ideal confusion matrix. On Truthing Issues in Supervised Classification Estimation Method ACC Ideal (correct labels known) 0.7350 MMSE testing: Estimate Conditional mean 0.7355 conditional confusion MAP estimate 0.7355 matrix via sampling 95%-credible region (0.7282, 0.7427) Estimate correct labels 0.7390 Combine metrics from each labeler Mean 0.6170 Median 0.6397 Table 13: Estimated accuracy for 4-class classification example. Boldface indicates an estimate that was closest to the ideal accuracy or a credible region that contained the ideal accuracy. Ideal (Correct) Predicted Label Labels Known) 0 1 2 3 0 290 40 40 39 Correct 1 58 391 68 68 Label 2 8 10 168 27 3 93 43 36 621 MMSE Predicted Label Testing 0 1 2 3 0 284.8 41.3 36.6 33.9 Correct 1 63.7 391.7 67.0 67.6 Label 2 7.5 7.3 169.8 29.0 3 93.0 43.6 38.7 624.6 Est. Correct Predicted Label Labels 0 1 2 3 0 288 43 35 37 Correct 1 61 395 70 68 Labels 2 7 4 170 25 3 93 42 37 625 Mean of Predicted Label Labelers 0 1 2 3 0 238.9 60.7 65.3 54.2 Correct 1 60.7 335.5 66.3 86.3 Label 2 36.7 54.2 130.1 64.8 3 76.6 91.3 48.8 529.6 Median of Predicted Label Labelers 0 1 2 3 0 254.1 57.4 46.8 56.2 Correct 1 71.1 343.4 67.2 86.9 Label 2 26.1 35.9 145.9 62.3 3 90.6 65.8 46.7 543.6 Table 14: Ideal and estimated confusion matrices for 4-class classification example. Boldface indicates that the element was closest to the corresponding element in the ideal confusion matrix. 95%-Credible Predicted Label Regions 0 1 2 3 0 (277.72, 291.95) (35.80, 46.79) (32.27, 40.88) (28.54, 39.22) Correct 1 (57.75, 69.56) (384.72, 398.77) (61.51, 72.46) (60.74, 74.43) Label 2 ( 4.79, 10.23) (3.97, 10.67) (163.73, 175.83) (24.03, 33.93) 3 (86.74, 99.25) (38.47, 48.82) (34.34, 42.99) (615.90, 633.21) Table 15: 95%-credible regions for individual elements of the confusion matrix estimated by MMSE testing (Algorithm 4) for 4-class classification example. 5.4 Example of Training: Logistic Regression This section uses logistic regression to illustrate the training approaches in Section 3. We use the Ionosphere binary-classification data set from the UCI Machine Learning Repository (see Dua and Graff, 2017), which contains N = 351 samples, each consisting of 34 real-valued features. Class 0 corresponds to a good radar return and class 1 to a bad radar return. Ionosphere contains 126 bad radar returns, so π(1) = 0.359. We employ 75% 25% stratified hold-out validation since multi-fold cross-validation produced cluttered plots that were too difficult to read. In practice, one could use cross-validation, of course. Training and testing were conducted for T = 1, 5, 9, and 13. The data set provides the correct labels; noisy labels were simulated as in Section 5.1.2; hence, the noisy-label RVs are conditionally independent as in (58), ψi,t = (δi, φt), and (27) reduces to (61). The settings were δi Beta(1, 5), i; φt U(0, 0.5), t; η1 1 to force the first labeler to label every sample; and ηt U(0.33, 1), t T \ {1}. The sample-difficulty realization was δ = (0.367, 0.524, 0.115, 0.181, 0.021, . . . , 0.154), and the labeler-fallibility realization for T = 13 was φ = (0.079, 0.440, 0.137, 0.207, 0.148, 0.314, 0.290, 0.300, 0.133, 0.142, 0.127, 0.164, 0.072). For each value of T, z consisted of the noisy labels for labeler indexes 1 through T. In fact, the introductory example in Table 1 is an excerpt of z for T = 5. We consider five classifiers. The ideal classifier performs conventional ML training with {x, y} and is included for reference. The ML-optimal classifier performs ML training given {x, z, ψ, π} according to part 1a of the unified view in Section 3.2, and primary term (32) from Section 3.3.1; details appear in Appendix F.1. The resulting objective function may no longer be convex, but gradient descent can still be used to find a local optimum. The MMSEoptimal classifier uses MMSE training according to part 2 of the unified view, primary term (40) from Section 3.4.1, and gradient components in (50) of Section 3.4.3; details are in Appendix F.2. The suboptimal, infrastructure-compatible label-estimation and voting classifiers described in Section 3.7 are also included. We omitted sample replication since it is essentially a quantized form of training with (32). For all classifier training, we used L2 regularization and the standard Broyden-Fletcher-Goldfarb-Shanno method. All testing metrics were calculated using the empirical Bayes method of Algorithm 1, so, except for the ideal classifier, no correct labels were used during training or testing. Although training could be optimal or suboptimal, testing always applied the same technique. For each training method, the regularization weight λ was swept over {0.5, 1.0, . . . , 10.0}, producing twenty trained models. For the ideal classifier, we selected the model with the On Truthing Issues in Supervised Classification largest ideal area under the ROC curve on the held-out testing set. For the other classifiers, we estimated the ROC curve on the testing set (explained shortly in Section 5.4.2) and selected the model with the largest estimated area under the ROC curve. 5.4.1 Fixed Decision Threshold A trained logistic regression model computes g(x; θ) [0, 1] as its estimate of p(y|x; θ) and compares it against a threshold τ; the predicted label ˆy is 1 if g(x; θ) > τ and 0 otherwise. This section presents results for the single default threshold τ = 1/2. Figure 19 displays P-R analysis plots for the held-out testing set as T increases. For the classifiers trained with noisy labels, contours show the estimated joint posteriors of (PREC, REC), circles show the conditional means, and solid lines indicate the 95%-credible regions. Normally, the correct labels y would not be available, but since we have the luxury of knowing them, inverted triangles mark each classifier s actual performance. The upright, solid black triangle shows the ideal classifier s operating point, which is identical for all values of T. With T = 1, z conveys little information about y, so the posteriors are spread out, and the credible regions are quite large. For each training method, the conditional mean and actual operating point are not very close together, but the actual operating point lies within the credible region. The suboptimal methods outperform the MMSE-optimal and ML-optimal classifiers for this case, but the credible regions overlap so much that one could not make this conclusion without access to the correct labels. This plot also demonstrates that our training and testing approaches are applicable even for a single, imperfect labeler. When T = 5, more information about y is available from z, so the credible regions become smaller, and the conditional means become much closer to the actual operating points for all classifiers. The ML-optimal classifier outperforms the ones that used suboptimal training, and from the posteriors, one could reasonably expect it to do so. It also happens to outperform the ideal classifier. The MMSE-optimal classifier performs comparably to the label-estimation classifier. For T = 9, the training methods provide similar estimated precisions, with the MLoptimal classifier achieving greater recall. However, the posteriors overlap substantially, so one could not claim this without access to the correct labels. The ML-optimal classifier again slightly outperforms the ideal one. The MMSE-optimal and voting-trained classifier have identical performance. By the time T = 13, the posteriors and credible regions are much tighter, and the conditional means give quite accurate estimates of the actual operating points. The MLoptimal classifier and the classifier trained with label estimation slightly outperform voting training, and their actual operating points coincide with that of the ideal classifier. For this choice of the threshold τ, the MMSE-optimal classifier has the lowest recall. In the next sections, the threshold is varied, and the MMSE-optimal classifier is seen to have competitive performance. 5.4.2 Performance Curves It is also common practice to sweep the threshold τ over its range of possible values to obtain ROC or P-R curves. Figure 20 displays estimated ROC curves, which are the conditional Number of Labelers = 1 Testing: Precision-Recall Analysis 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Number of Labelers = 5 Testing: Precision-Recall Analysis 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Number of Labelers = 9 Testing: Precision-Recall Analysis 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Number of Labelers = 13 Testing: Precision-Recall Analysis 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Ideal (correct labels known during training and testing) MMSE training: Est. posterior MMSE training: Performance against correct labels MMSE training: Cond. mean from est. posterior MMSE training: 95%-credible region ML training: Est. posterior ML training: Performance against correct labels ML training: Cond. mean from est. posterior ML training: 95%-credible region Label-est.: Est. posterior Label-est.: Performance against correct labels Label-est.: Cond. mean from est. posterior Label-est.: 95%-credible region Voting among labelers: Est. posterior Voting among labelers: Performance against correct labels Voting among labelers: Cond. mean from est. posterior Voting among labelers: 95%-credible region Figure 19: Training example with different numbers of labelers T: Estimated testing results from MMSE testing with Algorithm 1 on the held-out testing set. The chance line appears as a black dotted line. On Truthing Issues in Supervised Classification means of (PD, PFA) from MMSE testing using Algorithm 1, for the different training methods and numbers of labelers. The estimated performance improves as T increases, although it does not change much beyond T = 5. When T = 1, it is difficult to estimate performance, which results in the jagged estimated ROC curves. As T increases, the estimated ROC curves begin to display the stepwise shape of the ideal ROC curve. For each value of T, all of the training methods perform comparably. This behavior suggests that the regularized suboptimal training methods offer practical alternatives to training methods that are optimal according to the unified view. This observation differs from the experiments on testing from Sections 5.2.1, 5.2.4, and 5.2.5, where MMSE testing (Algorithms 1 and 2) outperformed the method of estimating the correct labels (Algorithm 3). We posit that this difference reflects the inherently different goals of training and testing and the presence or absence of regularization. The goal of training is to learn a predictive model that generalizes well to out-of-sample data beyond the training set {x, z, ψ, π}. Therefore, training includes regularization, which helps a suboptimal training method compensate for its inferior estimation ability. In contrast, the goal of testing is to obtain the in-sample metrics for the testing set {ˆy, z, ψ, π}. Regularization is not called for in this case, and MMSE testing can provide much better estimation performance over suboptimal testing methods. Figure 21 displays the actual ROC curves calculated against the correct labels. These curves would not be available in practice if truthing issues are present. They show that the training methods can achieve performance similar to the ideal case, and a comparison with Figure 20 shows that the estimated ROC curves are reasonably good even when T = 1, and they are very good for T = 9 or 13. 5.4.3 Performance Curve Posteriors MMSE testing allows us to estimate the joint posterior of (PD, PFA) or (PREC, REC). We can average the posteriors over all threshold values to obtain the posterior of a ROC or P-R curve an important capability when truthing issues are present. Figure 22 shows the posteriors of the ROC curves for the ML-optimal classifier as T varies. Posterior values over the range [10 3, 104] are shown using heat maps with a base-10 logarithmic color scale. The figure also displays the estimated and actual ROC curves for this training method. We could also compute credible regions for the curves, but we do not show them to avoid cluttering the figure. Figure 19 already demonstrated the good containment of the credible regions. Similarly, Figure 23 shows P-R curve posteriors for the MMSE-optimal classifier. When τ > 1, the classifier predicts ˆyi = 0, i, so recall is zero but precision is undefined, and none of the curves show a point when rec = 0. The figures also reveal another benefit of MMSE testing. Although actual performance is similar to the ideal case when T = 1 or 5, the estimated curves display some large deviations from the actual ones, and the posteriors have broad support, which indicates substantial uncertainty about performance. For T = 9 or 13, the estimated curves coincide closely with the actual ones, and the posteriors have more concentrated support, which reflects less uncertainty about performance. In the presence of truthing issues, the actual 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 1 Testing: ROC Analysis (Conditional Means) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 5 Testing: ROC Analysis (Conditional Means) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 9 Testing: ROC Analysis (Conditional Means) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 13 Testing: ROC Analysis (Conditional Means) Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior ML training: Cond. mean from est. posterior Label-est.: Cond. mean from est. posterior Voting among labelers: Cond. mean from est. posterior Figure 20: Training example with different numbers of labelers T: Estimated ROC curves from MMSE testing with Algorithm 1 on the held-out testing set. The chance line appears as a black dotted line. On Truthing Issues in Supervised Classification 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 1 Testing: ROC Analysis (vs. Correct Labels) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 5 Testing: ROC Analysis (vs. Correct Labels) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 9 Testing: ROC Analysis (vs. Correct Labels) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 13 Testing: ROC Analysis (vs. Correct Labels) Ideal (correct labels known during training and testing) MMSE training: Performance against correct labels ML training: Performance against correct labels Label-est.: Performance against correct labels Voting among labelers: Performance against correct labels Figure 21: Training example with different numbers of labelers T: Actual ROC curves on the held-out testing set. The chance line appears as a black dotted line. Number of Labelers = 1 Testing: ROC Analysis (ML Training) 0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 5 Testing: ROC Analysis (ML Training) 0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 9 Testing: ROC Analysis (ML Training) 0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm Prob. of Detection Number of Labelers = 13 Testing: ROC Analysis (ML Training) 0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm Prob. of Detection Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels Figure 22: Training example with different numbers of labelers T for ML-trained classifiers: Estimated ROC curve posteriors from Algorithm 1 on the held-out testing set. The posteriors appear as heat maps with a base-10 logarithmic color scale; the color bar ticks correspond to the exponents of the scale. The chance line appears as a black dotted line. performance curve will not be available, but the support of the performance curve posterior can help one understand the potential variability in performance that might occur. 5.5 Examples of Equivalent Mutual Information In Section 4, the use of mutual information as a basis for comparing different combinations of labelers implied that multiple mediocre labelers could be as informative as a single expert labeler. To check the implication, we revisited the Ionosphere data set and simulated noisy labels with the model of (57), both for a single good labeler with ε = 0.05 and for nine mediocre labelers, each with ε = 0.25. Every labeler provided a noisy label for every sample. Then I(Z; Y |ε = 0.05) = 0.667 bits, and I(Z; Y |T = 9, ε = 0.25) = 0.758 bits. As in Section 5.4, we trained ML-optimal and MMSE-optimal classifiers using regularized logistic regression and estimated the testing metric RVs with Algorithm 1 for MMSE testing. On Truthing Issues in Supervised Classification Number of Labelers = 1 Testing: Precision-Recall Analysis (MMSE Training) 0 0.2 0.4 0.6 0.8 1 Recall Number of Labelers = 5 Testing: Precision-Recall Analysis (MMSE Training) 0 0.2 0.4 0.6 0.8 1 Recall Number of Labelers = 9 Testing: Precision-Recall Analysis (MMSE Training) 0 0.2 0.4 0.6 0.8 1 Recall Number of Labelers = 13 Testing: Precision-Recall Analysis (MMSE Training) 0 0.2 0.4 0.6 0.8 1 Recall Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels Figure 23: Training example with different numbers of labelers T for MMSE-trained classifiers: Estimated P-R curve posteriors from Algorithm 1 on the held-out testing set. The posteriors appear as heat maps with a base-10 logarithmic color scale; the color bar ticks correspond to the exponents of the scale. The chance line appears as a black dotted line. Figure 24 shows ROC analysis plots for the held-out testing set. The results show that the classifier trained with the mediocre labelers noisy labels performs better than the one trained with the single good labeler s noisy labels, which confirms the implication. As an extreme example, we simulated a single expert labeler with ε = 0.01 and 399 poor labelers, each with ε = 0.45, so I(Z; Y |ε = 0.01) = 0.863 bits and I(Z; Y |T = 399, ε = 0.45) = 0.859 bits. The estimated P-R curve posteriors appear in Figure 25; when τ > 1, both classifiers predict ˆyi = 0, i, so recall is zero but precision is undefined, and no points for rec = 0 are plotted. The curves indicate comparable performance and again confirm the implication. 6. Summary, Conclusions, and Future Directions In supervised classification, a number of truthing issues may arise: noisy labels; missing labels; multiple, conflicting labels for the same sample; and different combinations of labelers for different samples. This situation involves three components, each of which requires a model: truthing warrants a noisy-label model, training learns a predictive model, and testing calls for a testing model. We did not study the problem of formulating and learning a noisylabel model, which is the subject of much of the related work. Instead, we concentrated on testing and training, and we began by assuming that a good noisy-label model p(z|y, ψ)π(y) was available, which makes our work compatible with and complementary to the related work. Our methods support models with dependent labelers. 6.1 Summary and Conclusions By applying principles from Bayesian estimation theory, we succeeded in obtaining some promising and insightful answers to the questions posed in the introduction. 1. How can one test a classifier in the presence of truthing issues? Given noisy labels z, predicted labels ˆy, noisy-label model parameters ψ, and class prior π, we developed testing methods that are optimal: they estimate the metric RVs rather than the correct-label RVs, they fully exploit all available information, and they optimize a well-defined criterion (MMSE). Our approach is completely separate from training and applicable beyond the realm of machine learning. For example, it could be used to reconcile diagnoses made by clinicians or categorizations assigned by scientists. To arrive at the methods, we proposed a novel testing model (7), and we used it to derive approximate marginal posteriors for several scalar metrics, as well as joint posteriors for ROC and P-R analysis. We then introduced MMSE testing and developed empirical Bayes algorithms (Algorithms 1 and 2) for iteratively finding the MMSE estimate of the testing-model parameters from {z, ˆy, ψ, π}. After estimating the parameters, we calculated Bayesian optimal estimates (MMSE or MAP point estimates, or credible regions) of the metric RVs. Finally, we extended the approach to multi-class classification. In our experiments, MMSE testing provided excellent estimates of many binaryclassification metrics. Their estimation errors were an order of magnitude smaller On Truthing Issues in Supervised Classification Single Good Labeler (Error Prob. = 0.05) Testing: ROC Analysis 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Ideal (correct labels known during training and testing) MMSE training: Est. posterior MMSE training: Performance against correct labels MMSE training: Cond. mean from est. posterior MMSE training: 95%-credible region ML training: Est. posterior ML training: Performance against correct labels ML training: Cond. mean from est. posterior ML training: 95%-credible region Nine Mediocre Labelers (Error Prob. = 0.25) Testing: ROC Analysis 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Prob. of False Alarm Prob. of Detection Ideal (correct labels known during training and testing) MMSE training: Est. posterior MMSE training: Performance against correct labels MMSE training: Cond. mean from est. posterior MMSE training: 95%-credible region ML training: Est. posterior ML training: Performance against correct labels ML training: Cond. mean from est. posterior ML training: 95%-credible region Single Good Labeler (Error Prob. = 0.05) ROC Analysis (ML Training) 0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm Prob. of Detection Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels Nine Mediocre Labelers (Error Prob. = 0.25) ROC Analysis (ML Training) 0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm Prob. of Detection Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels Single Good Labeler (Error Prob. = 0.05) ROC Analysis (MMSE Training) 0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm Prob. of Detection Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels Nine Mediocre Labelers (Error Prob. = 0.25) ROC Analysis (MMSE Training) 0 0.2 0.4 0.6 0.8 1 Prob. of False Alarm Prob. of Detection Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels Figure 24: ROC analysis for a single good labeler (left) and multiple mediocre labelers (right). The upper graphs show performance for the single threshold τ = 1/2. The middle graphs show the estimated ROC curve posteriors for ML-trained classifiers. The lower graphs show the estimated ROC curve posteriors for MMSE-trained classifiers. The posteriors appear as heat maps with a base10 logarithmic color scale; the color bar ticks correspond to the exponents of the scale. The chance line appears as a black dotted line. Single Expert Labeler (Error Prob. = 0.01) Precision-Recall Analysis (ML Training) 0 0.2 0.4 0.6 0.8 1 Recall Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels 399 Poor Labelers (Error Prob. = 0.45) Precision-Recall Analysis (ML Training) 0 0.2 0.4 0.6 0.8 1 Recall Ideal (correct labels known during training and testing) ML training: Cond. mean from est. posterior ML training: Performance against correct labels Single Expert Labeler (Error Prob. = 0.01) Precision-Recall Analysis (MMSE Training) 0 0.2 0.4 0.6 0.8 1 Recall Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels 399 Poor Labelers (Error Prob. = 0.45) Precision-Recall Analysis (MMSE Training) 0 0.2 0.4 0.6 0.8 1 Recall Ideal (correct labels known during training and testing) MMSE training: Cond. mean from est. posterior MMSE training: Performance against correct labels Figure 25: Estimated P-R curve posteriors for a single expert labeler (left) and many poor labelers (right). The upper graphs show results for ML-trained classifiers; the lower graphs show results for MMSE-trained classifiers. The posteriors appear as heat maps with a base-10 logarithmic color scale; the color bar ticks correspond to the exponents of the scale. The chance line appears as a black dotted line. On Truthing Issues in Supervised Classification than those for suboptimal methods like estimating the correct labels or averaging the metrics from individual labelers. For multi-class classification, MMSE testing outperformed the suboptimal methods at estimating accuracy and individual elements of the confusion matrix. 2. How can one train a classifier in the presence of truthing issues? With z, ψ, π, and the feature vectors x given, we presented a unified view of training that is elegant and intuitive. The unified view explains how to train a broad range of classifiers, and it organizes some of the related work. Each one of our training approaches is optimal: it employs an appropriate likelihood function, posterior, or estimator; it fully exploits the available information; and it optimizes a well-defined penalty or utility criterion. None of the approaches estimates the correct labels. For probabilistic (i.e., generative or discriminative) models with parameters θ, we showed that the optimization principle from ideal training can be retained. In ML training, the likelihood function p(y|x; θ) or p(y, x; θ) is replaced with p(z|x; ψ, θ) or p(z, x; ψ, θ), respectively. In MAP training, the posterior p(θ|y, x) is replaced by p(θ|z, x; ψ). For non-probabilistic models, we proposed MMSE training, which retains the original loss function and minimizes the in-sample MMSE estimate of the empirical-risk RV. Some related work has proposed the same form of training but did not use estimation theory to motivate it. We discussed properties of the MMSE estimator and provided a condition for when the MMSE estimator is a consistent estimator. Experiments using binary logistic regression demonstrated the effectiveness of our training approach, as well as the competitiveness of some suboptimal methods, like estimating the correct labels or voting among labelers, which are compatible with existing machine-learning infrastructure. We reasoned that regularization helps the suboptimal methods compensate for their estimation deficiencies. The experiments employed our testing methods, thus demonstrating the feasibility of training and testing with noisy labels only. Moreover, they showed that our testing methods can provide approximate posteriors and optimal estimates of ROC and P-R curves. 3. How can one compare different combinations of labelers with different abilities? The noisy-labeling process can be viewed as a broadcast channel, so mutual information quantifies the amount of information about the correct label conveyed by a group of labelers, and it facilitates comparison between different combinations of labelers. As another basis for comparison, any combination of labelers can be represented as an equivalent single labeler with some corresponding error probability. This observation implies that multiple mediocre labelers can convey information greater than or equal to that from a single expert labeler. The preceding statement is theoretical; it does not explain how to extract this information in practice. Fortunately, our training and testing methods provide a way to do so. The work in this paper culminated in training and testing experiments that confirmed the implication and showed that our methods can realize its benefits. 6.2 Expanded Workflow Truthing issues are a reality in many applications. To address them, we advocate an expanded workflow that combines this work and the complementary related work. Training must learn two models: a noisy-label model and the desired predictive model. The models can be learned separately, using methods like those in Section 1.2.1 to learn the noisy-label model and then using the training techniques from Section 3 to learn the predictive model. Alternatively, the models can be learned jointly, using techniques like those by Raykar et al. (2010), Khetan et al. (2018), or Tanno et al. (2019). After both models have been learned, the predictive model can be tested using Algorithm 1, 2, or 4 from MMSE testing. 6.3 Future Directions We close with a discussion of areas for future work. 6.3.1 Directly-Related Topics A number of aspects of MMSE testing merit further study. First, Section 2.2 applied the CLT to the common RVs U and V , and subsequent manipulations produced the approximate posteriors of the metric RVs in Section 2.3; it would be useful to consider the case when one or both of the summations in (8) and (9) contain an insufficient number of terms to justify the CLT. Second, it would be fulfilling to obtain an expression for the joint posterior (17) of (PD, PFA) along the chance line p D = p FA. Third, closed-form expressions for the maximum of (17) and (18) would simplify MAP estimation of the ROC and P-R operating points. Fourth, Section 2.4.4 discussed convergence of the empirical Bayes methods, but more detailed study is called for. Finally, Section 2.7 examined multi-class classification, and it considered accuracy and individual elements of the confusion matrix. One could investigate other multi-class classification metrics, the joint distribution of the confusion matrix, the effect of a highly imbalanced class prior, and tactics when the number of classes is large. Regarding training of a probabilistic predictive model, Section 3.3 showed that ML or MAP training could be modified for noisy labels. One could apply the forms given in Table 9 to adapt training from the ideal case to the case of truthing issues. For MMSE training of non-probabilistic predictive models, Section 3.4.3 explained that its gradient is amenable to automatic differentiation (cf. (50)), so it would be exciting to see MMSE training applied to deep neural networks. Section 3.4.5 considered consistency of the MMSE estimator of the empirical-risk RV and gave the simplest of sufficient conditions. One could derive other conditions or bounds on the estimation error. One could also investigate conditions under which MMSE training preserves consistency of the ERM principle. The work of Khetan et al. (2018), Cid-Sueiro (2012), and Cid-Sueiro et al. (2014) could be useful in this regard. The information-theoretic view was illustrated with the BSBC in Section 4.1. One could examine an asymmetric channel (i.e., a channel with different miss and false-alarm probabilities), a channel with dependent labelers, or a multi-class channel. On Truthing Issues in Supervised Classification 6.3.2 Learning and Labeler Allocation This paper has assumed that the noisy-label model is known, but in many cases it must be learned. This requirement creates a variety of allocation problems. As an example of learning allocation, suppose that one has a small set {x , y , z } with both correct and noisy labels and a much larger set {x, z} with only noisy labels. How should one partition the sets for learning the noisy-label model, training the predictive model, and testing the learned predictive model? Two examples of labeler allocation follow. Given a labeling budget and costs for labelers with different abilities, what is the most cost-effective way to acquire labels?22 Similarly, given a set of samples with different labeling difficulties (e.g., images under a variety of lighting conditions), how should the labeling effort be distributed across the set? 6.3.3 Extension to Weak Supervision The ideas in the related work and this paper could be adapted to other forms of supervised learning that involve noisy annotation, also known as weak supervision. Per Section 1.4, it would require three models: the usual predictive model, a noisy-annotation model for the imperfect annotation process, and a testing model that relates the predictions and noisy annotations. The noisy-annotation model is like the noisy-label model p(z|y)π(y) and provides estimated probabilities. It should generalize to unseen, out-of-sample realizations of (Z, Y ), so it will be learned with machine-learning methods. The testing model is analogous to p(ˆy, z|y) = p(ˆy|y)p(z|y) in (7); it characterizes in-sample performance on the testing set, so it will be learned with estimation-theoretic methods. For training, many aspects of MMSE training are likely applicable to weak supervision. Section 3.4 and Appendix E.4 allow for a generic loss function and noisy-annotation model. If the annotation set Y is continuous rather than finite, then the summations over Y must be replaced by integrals (cf. (37), (40), (41), (43), (50)), and techniques for computing or approximating the integrals will be needed. For testing of the predictive model, work tailored to the particular metrics will be required; the MMSE testing approach can provide a general strategy. Testing should follow the principle of estimating the in-sample metric RV rather than the correct annotation. First, one should identify a suitable testing model and parameters23 (cf. Section 2.1) and express the metrics as RVs (Section 2.2). Second, one should obtain the posteriors of the parameters and metric RVs (Section 2.3); if the samples are independent, then the CLT may be helpful. Third, one should develop algorithms for estimating the parameters (Section 2.4). Finally, once the parameters have been estimated, the posteriors of the metric RVs can be used to find optimal estimates of the metrics (Section 2.5). Acknowledgments 22. See Sheng et al. (2008), for example. 23. For binary classification, the OP parameters ( p D, p FA) corresponded to probability of detection and probability of false alarm, which are standard metrics, but multi-class classification introduced the conditional confusion matrix, which is not commonly used. The author gratefully thanks Rajmonda Caceres, John Holodnak, Jason Matterer, and Anna Yanchenko for their helpful comments and interest in this work, as well as Paul Monticciolo, Vijay Gadepally, Tim Dasey, and Jason Thornton for enabling this work. The author also thanks the Action Editors and anonymous reviewers for their careful attention, thoughtful questions, and helpful suggestions. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Appendix A. Metrics in Terms of Common RVs For probability of false alarm, its RV form is 1 N PN i=1 1(ˆyi = 1 and Yi = 0) 1 N PN i=1 1(Yi = 0) . The numerator is i=1 1(ˆyi = 1 and Yi = 0) = 1 i:ˆyi=1 1(Yi = 0) i:ˆyi=1 (1 1(Yi = 1)) i:ˆyi=1 1 1 i:ˆyi=1 1(Yi = 1) and the denominator is i=1 1(Yi = 0) = 1 i=1 (1 1(Yi = 1)) i=1 1(Yi = 1) = 1 (U + V ). On Truthing Issues in Supervised Classification The RV form of accuracy is i=1 1(Yi = ˆyi) i:ˆyi=0 1(Yi = 0) + 1 i:ˆyi=1 1(Yi = 1) i:ˆyi=0 (1 1(Yi = 1)) + U N (N ˆN1) V + U = U V ˆN1/N + 1. The RV form of precision is 1 N PN i=1 1(ˆyi = 1 and Yi = 1) 1 N PN i=1 1(ˆyi = 1) 1 N P i:ˆyi=1 1(Yi = 1) Fβ is obtained by taking Fβ = (1 + β2) PREC REC β2PREC + REC , substituting the expressions for PREC and REC in Table 8, and performing a little algebra. Appendix B. Ratios of Jointly Normal Random Variables This section summarizes the procedure from Marsaglia (2006, 1965) for calculating the distribution or approximating the moments of the ratio of jointly normal RVs Z and W . Let E[Z ] = µZ , var(Z ) = σ2 Z , E[W ] = µW , var(W ) = σ2 W , and ρ = cov(Z , W )/σZ σW . The density of Z /W is obtained as follows: 1. Let h = σZ p 1 ρ2; the sign of h will be determined momentarily. Also let r = σW /h, s = ρσZ /σW , b = µW /σW , and a = (µZ sµW )/h. 2. Choose the sign of h so that a and b have the same sign. Then the RV T = r(Z /W s) can be expressed in the form (a + X )/(b + Y ), where X and Y are independent N(0, 1) RVs. The density of T is p(t ) = e (a2+b2)/2 1 + q(t )e(q(t ))2/2 Z q(t ) 0 e x2/2 dx, where q(t ) = (b + at )/ q (1 + t 2), and the integral can be calculated using the error function erf(y) = (2/ π) R y 0 e τ 2 dτ. Symbol PD, REC PFA Fβ Z U ( ˆN1/N) U (1 + β2)U W U + V 1 (U + V ) β2(U + V ) + ˆN1/N µZ µU ( ˆN1/N) µU (1 + β2)µU σ2 Z σ2 U σ2 U (1 + β2)2σ2 U µW µU + µV 1 (µU + µV ) β2(µU + µV ) + ˆN1/N σ2 W σ2 U + σ2 V σ2 U + σ2 V β4(σ2 U + σ2 V ) cov(Z , W ) σ2 U σ2 U β2(1 + β2)σ2 U ρ σU Table 16: Parameters for scalar metric RVs with posteriors equal to the ratio Z /W of jointly approximately normal RVs Z and W . 3. It follows that Z /W = T /r + s, so p Z /W (ζ) = |r| p T (r(ζ s)). The density p(t ) includes the standard Cauchy density 1/ π(1 + t 2) , so technically, the moments of T of order greater than zero do not exist; i.e., R t ip(t ) dt is infinite for i {1, 2, . . .}. However, Marsaglia (2006, 4) points out that, in practice, one might be able to assume that the denominator b + Y approaches zero with negligible probability, enabling one to compute the moments (conditioned on this assumption). For example, if b = 4, then Pr(b + Y 0) = Pr(Y 4) 3.17 10 5. Marsaglia reports that, conditioned on b > 4 (or Y > 4), the mean and variance of T are approximately µT = a/(1.01 b 0.2713) and σ2 T = (a2 + 1)/(b2 + 0.108 b 3.795) µ2 T . Finally, Marsaglia also observes that, if a < 2.256 and b > 4, then T can be reasonably approximated by a normal distribution. Appendix C. Review of MMSE Estimation Let A be the unobserved RV, B be the observed RV, and let f(A) = f1(A) f D(A) T be a D-dimensional vector of scalar functions of A, with E[f2 j (A)] < , j = 1, . . . , D. Define an estimator of f(A) from B as h(B) = h1(B) h D(B) T, and define the MSE of h(B) by mse h(B), f(A) = PD j=1 E hj(B) fj(A) 2 . The goal is to find the MMSE estimator h MMSE = arg minh mse h(B), f(A) . For Section 2.4.1, A = (PD, PFA), B = ( ˆY , Z), and f(A) is the identity function f(A) A. The estimators are h1(B) = h D ˆY , Z, ψ, p(j 1) D , p(j 1) FA and h2(B) = h FA ˆY , Z, ψ, p(j 1) D , p(j 1) FA , where ψ, p(j 1) D , and p(j 1) FA are non-random parameters. In Section 2.6.2 on fully Bayesian estimation of a metric RV like accuracy, A = ACC, B = ( ˆY , Z), f(A) is the identity function, and ψ is a non-random parameter. For estimation of the empirical-risk RV in Section 3.4.1 and Appendix E.4, A = Y , B = Z, f(A) = Jpri(θ; x, Y ) = J(Y ), and an estimator is h(B) = ˆJ(θ; x, Z, ψ, π) = ˆJ(Z). For estimating On Truthing Issues in Supervised Classification the ith loss-function RV in Appendix E.4, A = Yi, B = Zi, f(A) = L( g(xi; θ), Yi) = ℓi(Yi), and an estimator is h(Zi) = ˆℓi(xi, θ, Zi) = ˆℓi(Zi). C.1 Standard Results 1. The MMSE estimator is h MMSE(B) = E[f(A)|B], or h MMSE d (B) = E[fd(A)|B], d = 1, . . . , D; that is, the MMSE estimator is the conditional mean. The MMSE estimator is an RV because it is a function of B. Given B = b, the MMSE estimate is the nonrandom quantity h MMSE(b) = E[f(A)|B = b], or h MMSE d (b) = E[fd(A)|B = b], d = 1, . . . , D. See (Van Trees, 1968, 2.4.1), (Papoulis, 1991, 7-5, 8-3), (Kay, 1993, 10.3), (Kamen and Su, 1999, Theorems 3.1, 3.4), (Oppenheim and Verghese, 2015, 8.1, 8.2). 2. The MMSE estimator is unbiased: E[h MMSE(B)] = E[f(A)]; see (Papoulis, 1991, 7-4), (Kay, 1993, 11.6), (Kamen and Su, 1999, 3.3). 3. The estimation error h MMSE(B) f(A) has the following properties. (a) The mean is zero: E h MMSE(B) f(A) = 0 from Result 2. (b) The total variance equals the MMSE, which is the MSE of the MMSE estimator: PD j=1 var h MMSE j (B) fj(A) = mse h MMSE(B), f(A) ; see (Kay, 1993, 11.6). (c) (Orthogonality principle) The error is orthogonal to every function w(B) = w1(B) w D(B) T: E h MMSE(B) f(A) Tw(B) = 0; see (Papoulis, 1991, 8-3), (Kamen and Su, 1999, Theorems 3.2, 3.3), (Oppenheim and Verghese, 2015, 8.2.1). 4. The MMSE estimator can be related to the law of total variance, which states that, for any RV G with finite variance, var(G) = E[var(G|B)]+var(E[G|B]), and E[var(G|B)] and var(E[G|B]) are the unexplained and explained variances of G, respectively; see (Blitzstein and Hwang, 2019, 9.5). (a) The MMSE is equal to the total unexplained variance of f(A): mse h MMSE(B), f(A) = PD j=1 E[var(fj(A)|B)]; see (Van Trees, 1968, 2.4), (Kay, 1993, 10.4). (b) The total variance of the MMSE estimator equals the total explained variance of f(A): PD j=1 var h MMSE j (B) = PD j=1 var( E[fj(A)|B] ), which means the MMSE estimator accounts for as much of the variation of f(A) as possible given B. (c) Hence, the total variance of f(A) equals the sum of the MMSE (the unexplained variance) and the variance of the MMSE estimator (the explained variance): j=1 var(fj(A)) = mse(h MMSE(B), f(A)) | {z } PD j=1 E[ var(fj(A)|B) ] j=1 var(h MMSE j (B)) | {z } PD j=1 var( E[fj(A)|B] ) 5. Results 2 through 4 apply to the MMSE estimator. For the MMSE estimate given B = b, the MSE equals the sum of the conditional variances of the functions being estimated: mse h MMSE(B), f(A) B = b = PD j=1 var(fj(A)|B = b); see (Oppenheim and Verghese, 2015, 8.1). C.2 Proofs and Derivations of Standard Results The derivation of Result 1 assumes B is continuous but is easily modified if B is discrete. Expand the MSE as mse h(B), f(A) = E[h2 j(B)] 2E[hj(B)fj(A)] + E[f2 j (A)] E[h2 j(B)] 2E E[hj(B)fj(A)|B] + E[f2 j (A)] Z p(b)h2 j(b) db 2 Z p(b)E[hj(B)fj(A)|B = b] db + E[f2 j (A)] Z p(b)h2 j(b) db 2 Z p(b)hj(b)E[fj(A)|B = b] db + E[f2 j (A)] , where (a) is from iterated expectations, and (b) is because hj(B) is non-random and equal to hj(b) when conditioned on B = b. For d = 1, . . . , D, take the partial derivative of the above equation with respect to hd, set it equal to zero, and solve for hd. Doing so gives hd = 2 Z p(b)hd(b) db 2 Z p(b)E[fd(A)|B = b] db, d = 1, . . . , D, = 2 Z p(b) hd(b) E[fd(A)|B = b] db, d = 1, . . . , D. Since p(b) is a probability distribution, this expression is zero when hd(b) = E[fd(A)|B = b] for any valid b, which is just the definition of the conditional mean. Also, 2mse/ h2 d = 2 R p(b) db = 2 > 0, d = 1, . . . , D, so the solution is the unique minimum. Thus, the MMSE estimator is h MMSE(B) = E[f(A)|B], or h MMSE d (B) = E[fd(A)|B], d = 1, . . . , D; and the MMSE estimate given B = b is h MMSE(b) = E[f(A)|B = b], or h MMSE d (b) = E[fd(A)|B = b], d = 1, . . . , D. This proves Result 1. For Result 2, write E[h MMSE(B)] = E E[f(A)|B] (a) = E[f(A)], where (a) applies iterated expectations. For Result 3a, unbiasedness means E h MMSE(B) f(A) = 0. For Result 3b, the preceding relation means PD j=1 var h MMSE j (B) fj(A) = PD j=1 E h MMSE j (B) fj(A) 2 = mse h MMSE(B), f(A) . To obtain Result 3c, for any valid j and b, write E (h MMSE j (B) fj(A))wj(B) B = b = h MMSE j (b)wj(b) E[fj(A)|B = b]wj(b) = E[fj(A)|B = b]wj(b) E[fj(A)|B = b]wj(b) = 0. Then E h MMSE(B) f(A) Tw(B) = PD j=1 E h MMSE j (B) fj(A) wj(B) = PD j=1 E E h MMSE j (B) fj(A) wj(B) B = PD j=1 R E h MMSE j (B) fj(A) wj(B) B = b p(b) db = PD j=1 R 0 p(b) db = 0. For Result 4a, write mse h MMSE(B), f(A) (a) = PD j=1 E E h MMSE j (B) fj(A) 2 B = PD j=1 E E fj(A) E[fj(A)|B] 2 B (b) = PD j=1 E[ var(fj(A)|B) ], where (a) uses iterated expectations and (b) applies the definition of conditional variance. Result 4b is because On Truthing Issues in Supervised Classification h MMSE j (B) = E[fj(A)|B], so their variances are equal. Applying the law of total variance to each term in PD j=1 var(fj(A)) yields Result 4c. For Result 5, conditioning the MSE on B = b gives mse h MMSE(B), f(A) B = b = PD j=1 E fj(A) E[fj(A)|B] 2 B = b = PD j=1 E fj(A) E[fj(A)|B = b] 2 B = b (a) = PD j=1 var(fj(A)|B = b), where (a) uses the definition of conditional variance given B = b. Appendix D. Review of MPE and MAP Estimation Additional derivations appear in Van Trees (1968, 2.4.1) and Kay (1998, Ch. 3). Let Y be a finite RV with domain Y, and let B be the observed, discrete RV. Denote an estimator of Y given B as h(B). The probability of error of h(B) is perror(h(B), Y ) = Pr(h(B) = Y ) = Ep(y,b)[1(h(B) = Y )]. The goal is to find the MPE estimator, namely h MPE = arg minh perror(h(B), Y ). For Section 2.6.1, Y = Yi, and B = ( ˆYi, Zi) with additional conditioning parameters ψi, p(j 1) D , p(j 1) FA . For Section 3.7, Y = Yi, and B = Zi with additional parameter ψi. Write perror(h(B), Y ) = E[1 1(h(B)=Y )] = 1 E E[1(h(B)=Y ) | B] = 1 P b p(b) E[1(h(B) = Y ) | B = b]. The last form is minimized by maximizing the conditional expectation for each b. Then E[1(h(B)=Y ) | B =b] = P y Y p(y|b)1(h(b)=y) = p Y |B(h(b) | b), and the maximum occurs when h(b) is the MAP estimate of Y given B = b; i.e., when h(b) = h MAP(b) = arg maxy Y p(y|b). Define the MAP estimator h MAP(B) as the function of B that returns h MAP(b) when B = b; then the MPE estimator is h MPE(B) = h MAP(B). Appendix E. Training with Truthing Issues E.1 Discriminative Model, Random Parameters Ideal case: Find θ to maximize the posterior p(θ|y, x), given by p(θ|y, x) (a) p(y|x, θ) p(θ|x) (b) = p(θ) QN i=1 p(yi|xi, θ), where (a) uses Bayes rule, and (b) is from (29). Truthing issues: Find θ to maximize p(θ|z, x; ψ), and write p(θ|z, x; ψ) (a) p(z|x, θ; ψ) p(θ|x; ψ) (b) = p(θ) QN i=1 P yi Y p(zi|yi; ψi)p(yi|xi, θ), where (a) uses Bayes rule, and (b) applies (31). E.2 Generative Model, Non-Random Parameters Ideal case: Find θ to maximize the likelihood function p(y, x; θ), which can be written as p(y, x; θ) = p(x|y; θ)p(y) = QN i=1 p(xi|yi; θ)π(yi). Truthing issues: Instead, find θ to maximize p(z, x; ψ, θ) = QN i=1 p(zi, xi; ψi, θ) (a) = QN i=1 P yi Y p(zi, yi, xi; ψi, θ) = QN i=1 P yi Y p(xi | zi, yi; ψi, θ)p(zi | yi; ψi, θ)p(yi; ψi, θ) (b) = QN i=1 P yi Y p(xi|yi; θ)p(zi|yi; ψi)π(yi), where (a) uses marginalization and (b) applies nondependencies. E.3 Generative Model, Random Parameters Ideal case: Find θ to maximize the posterior p(θ|y, x); then p(θ|y, x) (a) p(y, x|θ)p(θ) (b) = p(θ) QN i=1 p(xi|yi, θ)π(yi), where (a) uses Bayes rule and (b) substitutes the expression for p(y, x|θ) from the ideal case in Appendix E.2. Truthing issues: Find θ to maximize p(θ|z, x; ψ). Write p(θ|z, x; ψ) (a) p(z, x|θ; ψ)p(θ) (b) = p(θ) QN i=1 P yi Y p(xi|yi, θ)p(zi|yi; ψi)π(yi), where (a) applies Bayes rule and (b) obtains p(z, x|θ; ψ) in the same way that Appendix E.2 obtained p(z, x; ψ, θ). E.4 Non-Probabilistic Models: MMSE Estimation of Empirical-Risk RV As shorthand, denote the ith loss-function RV by ℓi(Yi) = L( g(xi; θ), Yi) and the empiricalrisk RV by R(Y ) = R(θ; x, Y ) = N 1 PN i=1 ℓi(Yi), which is (34). Also let ˆR(Z) = ˆR(θ; x, Z) be an estimator of R(Y ) with mse( ˆR(Z), R(Y )) = E[( ˆR(Z) R(Y ))2]. The MMSE estimator is Jpri(θ; x, Z, ψ, π) = ˆRMMSE(Z) = arg min ˆR mse( ˆR(Z), R(Y )). The standard result (see Appendix C) is that the MMSE estimator is the conditional mean of R(Y ) given Z: ˆRMMSE(Z) = E[R(Y )|Z], which is (35). Then ˆRMMSE(Z) = N 1 PN i=1 E[ℓi(Yi)|Z] = N 1 PN i=1 E[ℓi(Yi)|Zi] = N 1 PN i=1 ˆℓMMSE i (Zi), which is (36). The last expression is because, for an estimator ˆℓi(Zi) = ˆℓi(xi, θ, Zi) of the ith lossfunction RV ℓi(Yi), the MMSE estimator is ˆℓMMSE i (Zi) = arg minˆℓi mse(ˆℓi(Zi), ℓi(Yi)) = arg minˆℓi E[(ˆℓi(Zi) ℓi(Yi)2] = E[ℓi(Yi)|Zi]. Thus, ˆRMMSE(Z) equals the average of the MMSE estimators of the individual loss-function RVs. Likewise, the estimation error is ˆRMMSE(Z) R(Y ) = N 1 PN i=1 ˆℓMMSE i (Zi) ℓi(Yi) , the average of the estimation errors of the MMSE loss-function estimators. The estimation error of each of these estimators has E[ˆℓMMSE i (Zi) ℓi(Yi)] = 0 and var(ˆℓMMSE i (Zi) ℓi(Yi)) = mse(ˆℓMMSE i (Zi), ℓi(Yi)) = E[var(ℓi(Yi)|Zi)]. The samples are independent, so the estimation errors of the MMSE loss-function estimators are independent, too. By the CLT, the estimation error of ˆRMMSE(Z) converges in distribution to a normal RV with mean zero and variance var( ˆRMMSE(Z) R(Y )) = N 2 PN i=1 var(ˆℓMMSE i (Zi) ℓi(Yi)) = N 2 PN i=1 mse(ˆℓMMSE i (Zi), ℓi(Yi)) = N 2 PN i=1 E[var(ℓi(Yi)|Zi)], which is (47). Appendix F. Regularized Logistic Regression Training Equations Let the feature dimensionality be D, so θ has dimensionality D + 1. Define x T i = 1 x T i , and index xi and θ from zero rather than one. Logistic regression uses the model Yi|xi, θ B( g(xi; θ)), where g(xi; θ) = 1/(1 + e x T i θ), so the model assumes p(yi|xi; θ) = (1 g(xi; θ))1 yi g(xi; θ)yi. We use the L2-regularization term Jreg(θ) = 1 2N PD j=1 θ2 j; the intercept weight θ0 is not subject to regularization and is omitted from the summation. On Truthing Issues in Supervised Classification F.1 ML Training From (26) and (30), ML training in the ideal case seeks θ to minimize Jideal(θ; x, y) = 1 N log p(y|x; θ) | {z } Jpri(θ;x,y) +λJreg(θ) = 1 i=1 log p(yi|xi; θ) + λ and some algebra yields Jideal(θ; x, y) = 1 (1 yi) log 1 1 + e+ x T i θ + yi log 1 1 + e x T i θ j=1 θ2 j. (62) For the gradient, the partial derivatives with respect to θj are 1 + e x T i θ yi N θj1(j = 0), j = 0, 1, . . . , D. (63) From (28) and (32), ML training with truthing issues means minimizing JML(θ; x, z, ψ, π) = 1 N log p(z|x; ψ, θ) | {z } Jpri(θ;x,z,ψ,π) yi=0 p(yi|xi; θ)p(zi|yi; ψi) + λ yi=0 p(yi|xi; θ) Y t:zi,t = p(zi,t|yi; δi, φt) 1 + e+ x T i θ Y t:zi,t = p Zi,t|Yi(zi,t|0; δi, φt) 1 + e x T i θ Y t:zi,t = p Zi,t|Yi(zi,t|1; δi, φt) where (a) applies ψi = (δi, φ) along with (61), and (b) is because p Y (0|xi; θ) = 1 g(xi; θ) and p Y (1|xi; θ) = g(xi; θ). For the gradient, differentiation and some algebra yield Q t:zi,t = p Zi,t|Yi(zi,t|0; δi, φt) Q t:zi,t = p Zi,t|Yi(zi,t|1; δi, φt) Q t:zi,t = p Zi,t|Yi(zi,t|0; δi, φt) + e+ x T i θ Q t:zi,t = p Zi,t|Yi(zi,t|1; δi, φt) 1 + e x T i θ xi(j) + λ N θj1(j = 0), j = 0, 1, . . . , D. F.2 MMSE Training For ideal training, rewrite the first term in (62) as the empirical risk (33) by setting s = g(x; θ) = 1/(1 + e x T i θ) and defining L(s, y) = (1 y) log(1 s) y log s. Then L( g(x; θ), y) = 1 1 + e x Tθ y x(j). (64) For the gradient, plugging this equation into (48) and (49) again yields (63). For MMSE training with truthing issues, we apply (40) and minimize JMMSE(θ; x, z, ψ, π) = 1 p Y |Z(0|zi; ψi) log 1 1 + e+ x T i θ + p Y |Z(1|zi; ψi) log 1 1 + e x T i θ For the gradient, plugging (64) into (50) and simplifying yields p Y |Z(0|zi; ψi) 1 1 + e x T i θ p Y |Z(1|zi; ψi) 1 1 + e+ x T i θ N θj1(j = 0), j = 0, 1, . . . , D. Joseph K. Blitzstein and Jessica Hwang. Introduction to Probability. CRC Press, Boca Raton, FL, USA, 2nd edition, 2019. doi: 10.1201/9780429428357. Steve Branson, Grant Van Horn, and Pietro Perona. Lean crowdsourcing: Combining humans and machines in an online system. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 6109 6118, July 2017. doi: 10.1109/ CVPR.2017.647. URL https://openaccess.thecvf.com/content_cvpr_2017/html/ Branson_Lean_Crowdsourcing_Combining_CVPR_2017_paper.html. Leo Breiman. Statistical modeling: The two cultures. Statistical Science, 16(3):199 231, August 2001. doi: 10.1214/ss/1009213726. M. C. Burl, U. M. Fayyad, P. Perona, P. Smyth, and M. P. Burl. Automating the hunt for volcanoes on Venus. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 302 309, 1994. doi: 10.1109/CVPR.1994.323844. Mark J. Carlotto. Effect of errors in ground truth on classification accuracy. Intl. J. Remote Sensing, 30(18):4831 4849, September 2009. doi: 10.1080/01431160802672864. George Casella. Illustrating empirical Bayes methods. Chemometrics and Intelligent Laboratory Systems, 16(2):107 125, October 1992. doi: 10.1016/0169-7439(92)80050-E. On Truthing Issues in Supervised Classification Jes us Cid-Sueiro. Proper losses for learning from partial labels. In Neural Information Processing Systems (Neur IPS), pages 1565 1573. Curran Associates, Inc., December 2012. URL https://dl.acm.org/doi/10.5555/2999134.2999309. Jes us Cid-Sueiro, Dar ıo Garc ıa-Garc ıa, and Ra ul Santos-Rodr ıguez. Consistency of losses for learning from weak labels. In Machine Learning and Knowledge Discovery in Databases, volume 8724 of Lecture Notes in Computer Science, pages 197 210. Springer, 2014. doi: 10.1007/978-3-662-44848-9 13. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, NY, USA, 1st edition, 1991. doi for 2nd edition: 10.1002/047174882X. A. Philip Dawid and Allan M. Skene. Maximum likelihood estimation of observer errorrates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20 28, 1979. doi: 10.2307/2346806. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1 38, 1977. URL https://www.jstor.org/stable/ 2984875. Pinar Donmez, Guy Lebanon, and Krishnakumar Balasubramanian. Unsupervised supervised learning I: Estimating classification and regression errors without labels. J. Machine Learning Research (JMLR), 11(44):1323 1351, April 2010. URL https://www. jmlr.org/papers/v11/donmez10a.html. Dheeru Dua and Casey Graff. UCI machine learning repository. URL https://archive. ics.uci.edu/ml, 2017. URL visited Feb. 1, 2017. Mark Everingham et al. The 2005 PASCAL visual object classes challenge. In Machine Learning Challenges Workshop (MLCW), volume 3944 of Lecture Notes in Computer Science, pages 117 176. Springer, 2006. doi: 10.1007/11736790 8. Benoˆıt Fr enay and Michel Verleysen. Classification in the presence of label noise: A survey. IEEE Trans. Neural Networks and Learning Systems, 25(5):845 869, 2014. doi: 10.1109/ TNNLS.2013.2292894. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Computer and System Sciences, 55(1):119 139, August 1997. doi: 10.1006/jcss.1997.1504. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis. Chapman and Hall-CRC, Boca Raton, FL, USA, 3rd edition, 2013. doi: 10.1201/b16018. John T. Holodnak, Jason T. Matterer, and William W. Streilein. Estimating classifier accuracy using noisy expert labels. Technical Report 1225, MIT Lincoln Laboratory, Lexington, MA, USA, January 2018. Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality management on Amazon Mechanical Turk. In ACM Intl. Conf. Knowledge Discovery and Data Mining (SIGKDD), Workshop on Human Computation (HCOMP), pages 64 67, 2010. doi: 10.1145/1837885. 1837906. Ishan Jindal, Matthew Nokleby, and Xuewen Chen. Learning deep networks from noisy labels with dropout regularization. In IEEE Intl. Conf. Data Mining (ICDM), pages 967 972. IEEE, 2016. doi: 10.1109/ICDM.2016.0121. Edward W. Kamen and Jonathan K. Su. Introduction to Optimal Estimation. Springer Verlag, London, UK, 1999. doi: 10.1007/978-1-4471-0417-9. David R. Karger, Sewoong Oh, and Devavrat Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1 24, February 2014. doi: 10.1287/opre.2013.1235. Steven M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1993. Steven M. Kay. Fundamentals of Statistical Signal Processing: Detection Theory. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1998. Ashish Khetan, Zachary C. Lipton, and Anima Anandkumar. Learning from noisy singlylabeled data. In Intl. Conf. Learning Representations (ICLR), December 2018. doi: 10.48550/ar Xiv.1712.04577. URL https://openreview.net/forum?id=H1s UHgb0Z. Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA, USA, 2009. Chuck P. Lam and David G. Stork. Evaluating classifiers by means of test data with noisy labels. In Intl. Joint Conf. Artificial Intelligence (IJCAI), pages 513 518. Morgan Kaufmann, 2003. URL http://ijcai.org/Proceedings/03/Papers/076.pdf. G abor Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79 87, January 1992. doi: 10.1016/0031-3203(92)90008-7. George Marsaglia. Ratios of normal variables and ratios of sums of uniform variables. J. Amer. Statistical Assoc., 60(309):193 204, March 1965. doi: 10.2307/2283145. George Marsaglia. Ratios of normal variables. J. Statistical Software, 16(4):1 10, May 2006. doi: 10.18637/jss.v016.i04. Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Neural Information Processing Systems (Neur IPS), pages 1196 1204. Curran Associates Inc., December 2013. URL https://dl.acm.org/doi/10.5555/ 2999611.2999745. Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Costsensitive learning with noisy labels. J. Machine Learning Research (JMLR), 18(155): 1 33, April 2018. URL https://jmlr.org/papers/v18/15-226.html. On Truthing Issues in Supervised Classification Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Neural Information Processing Systems (Neur IPS), pages 841 848. MIT Press, January 2001. URL https://dl.acm.org/doi/ 10.5555/2980539.2980648. Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels. J. Artificial Intelligence Research (JAIR), 70:1373 1411, May 2021a. ISSN 1076-9757. URL https://doi.org/10.1613/jair.1.12125. Curtis G. Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks. In Neural Information Processing Systems (Neur IPS) Track on Datasets and Benchmarks. Curran Associates, Inc., December 2021b. URL https://doi.org/10.48550/ar Xiv.2103.14749. Alan V. Oppenheim and George C. Verghese. Signals, Systems and Inference. Pearson Education, London, UK, 2015. Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes. Mc Graw Hill, New York, NY, USA, 3rd edition, 1991. Emmanouil A. Platanios, Avinava Dubey, and Tom Mitchell. Estimating accuracy from unlabeled data: A Bayesian approach. In Intl. Conf. Machine Learning (ICML), pages 1416 1425, June 2016. URL https://proceedings.mlr.press/v48/platanios16.pdf. Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R e. Snorkel: Rapid training data creation with weak supervision. In Intl. Conf. Very Large Data Bases (VLBD), volume 11(3), pages 269 282. VLDB Endowment, November 2017. doi: 10.14778/3157794.3157797. Alexander J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and Christopher R e. Data programming: Creating large training sets, quickly. In Neural Information Processing Systems (Neur IPS), pages 3567 3575. Curran Associates, Inc., December 2016. URL https://dl.acm.org/doi/10.5555/3157382.3157497. Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerado Hemosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. J. Machine Learning Research (JMLR), 11(43):1297 1322, April 2010. URL https://jmlr.csail.mit.edu/papers/ v11/raykar10a.html. Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. Public Library of Science (PLOS) ONE, 10(3):1 21, March 2015. doi: 10.1371/journal.pone.0118432. Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197 227, June 1990. doi: 10.1007/BF00116037. Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In ACM Intl. Conf. Knowledge Discovery and Data Mining (SIGKDD), pages 614 -622, 2008. doi: 10.1145/1401890.1401965. Padhraic Smyth, Usama Fayyad, Michael Burl, Pietro Perona, and Pierre Baldi. Inferring ground truth from subjective labelling of Venus images. In Neural Information Processing Systems (Neur IPS), pages 1085 1092. MIT Press, January 1994. URL https://dl.acm. org/doi/10.5555/2998687.2998822. Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. In Intl. Conf. Learning Representations (ICLR), April 2015. doi: 10.48550/ar Xiv.1406.2080. Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C. Alexander, and Nathan Silberman. Learning from noisy labels by regularized estimation of annotator confusion. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 11236 11245, 2019. doi: 10.1109/CVPR.2019.01150. Grant Van Horn, Steve Branson, Scott Loarie, Serge Belongie, and Pietro Perona. Lean multiclass crowdsourcing. In IEEE Computer Vision and Pattern Recognition (CVPR), pages 2714 2723, 2018. doi: 10.1109/CVPR.2018.00287. URL https://openaccess.thecvf.com/content_cvpr_2018/papers/Van_Horn_Lean_ Multiclass_Crowdsourcing_CVPR_2018_paper.pdf. Brendan van Rooyen and Robert C. Williamson. A theory of learning with corrupted labels. J. Machine Learning Research (JMLR), 18(228):1 50, July 2018. URL https: //jmlr.org/papers/v18/16-315.html. Harry L. Van Trees. Detection, Estimation, and Modulation Theory. Part I: Detection, Estimation, and Linear Modulation Theory. John Wiley & Sons, New York, NY, USA, 1968. doi: 10.1002/0471221082. Vladimir Vapnik. Principles of risk minimization for learning theory. In Neural Information Processing Systems (Neur IPS), pages 831 838. Morgan-Kaufmann, December 1991. URL https://dl.acm.org/doi/10.5555/2986916.2987018. Yehuda Vardi and Cun-Hui Zhang. The multivariate L1-median and associated data depth. Proc. Natl. Acad. Sci. USA (PNAS), 97(4):1423 1426, February 2000. doi: 10.1073/pnas. 97.4.1423. Peter Welinder and Pietro Perona. Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In IEEE Computer Vision and Pattern Recognition (CVPR), Workshop on Advancing Computer Vision with Humans in the Loop (ACVHL), pages 25 32, June 2010. doi: 10.1109/CVPRW.2010.5543189. Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The multidimensional wisdom of crowds. In Neural Information Processing Systems (Neur IPS), pages 2424 2432. Curran Associates, Inc., 2010. URL https://dl.acm.org/doi/10.5555/2997046. 2997166. Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Neural Information Processing Systems (Neur IPS), pages 2035 2043. Curran On Truthing Issues in Supervised Classification Associates, Inc., December 2009. URL https://dl.acm.org/doi/10.5555/2984093. 2984321. Dengyong Zhou, Qiang Liu, John C. Platt, Christopher Meek, and Nihar B. Shah. Regularized minimax conditional entropy for crowdsourcing. URL https://arxiv.org/abs/ 1503.07240, March 2015.