# training_subset_selection_for_weak_supervision__c9f9a8d7.pdf

Training Subset Selection for Weak Supervision

Hunter Lang MIT CSAIL hjl@mit.edu

Aravindan Vijayaraghavan Northwestern University aravindv@northwestern.edu

David Sontag MIT CSAIL dsontag@mit.edu

Existing weak supervision approaches use all the data covered by weak signals to train a classiﬁer. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic [23] to select (hopefully) highquality subsets of the weakly-labeled training data. Subset selection applies to any label model and classiﬁer and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classiﬁers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.

1 Introduction

Due to the difﬁculty of hand-labeling large amounts of training data, an increasing share of models are trained with weak supervision [30, 28]. Weak supervision uses expert-deﬁned labeling functions to programatically label a large amount of training data with minimal human effort. This pseudo-labeled training data is used to train a classiﬁer (e.g., a deep neural network) as if it were hand-labeled data.

Labeling functions are often simple, coarse rules, so the pseudolabels derived from them are not always correct. There is an intuitive tradeoff between the coverage of the pseudolabels (how much pseudolabeled data do we use for training?) and the precision on the covered set (how accurate are the pseudolabels that we do use?). Using all the pseudolabeled training data ensures the best possible generalization to the population pseudolabeling function ˆY p Xq. On the other hand, if we can select a high-quality subset of the pseudolabeled data, then our training labels ˆY p Xq are closer to the true label Y , but the smaller training set may hurt generalization. However, existing weak supervision approaches such as Snorkel [28], Me Ta L [29], Flying Squid [10], and Adversarial Label Learning [3] use all of the pseudolabeled data to train the classiﬁer, and do not explore this tradeoff.

We present numerical experiments demonstrating that the status quo of using all the pseudolabeled data is nearly always suboptimal. Combining good pretrained representations with the cut statistic [23] for subset selection, we obtain subsets of the weakly-labeled training data where the weak labels are very accurate. By choosing examples with the same pseudolabel as many of their nearest neighbors in the representation, the cut statistic uses the representation s geometry to identify these accurate subsets without using any ground-truth labels. Using the smaller but higher-quality training sets selected by the cut statistic improves the accuracy of weak supervision pipelines by up to 19% accuracy (absolute). Subset selection applies to any label model (Snorkel, Flying Squid, majority vote, etc.) and any classiﬁer, since it is a modular, intermediate step between creation of the pseudolabeled training set and training. We conclude with a theoretical analysis of a special case of weak supervision where the precision/coverage tradeoff can be made precise.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

2 Background

The three components of a weak supervision pipeline are the labeling functions, the label model, and the end model. The labeling functions are maps Λk : X Ñ Y Y t u, where represents abstention. For example, for sentiment analysis, simple token-based labeling functions are effective, such as:

Λ1pxq "1 good P x otherwise Λ2pxq " 1 bad P x otherwise If the word good is in the input text x, labeling function Λ1 outputs 1; likewise when bad P x, Λ2 outputs 1. Of course, an input text could contain both good and bad , so Λ1 and Λ2 may conﬂict. Resolving these conﬂicts is the role of the label model.

Formally, the label model is a map ˆY : p Y Y t uq K Ñ Y Y t u. That is, if we let Λpxq refer to the vector pΛ1pxq, . . . , ΛKpxqq, then ˆY pΛpxqq is a single pseudolabel (or weak label ) derived from the vector of K labeling function outputs. This resolves conﬂicts between the labeling functions. Note that we can also consider ˆY as a deterministic function of X. The simplest label model is majority vote, which outputs the most common label from the set of non-abstaining labeling functions: ˆYMV pxq modeptΛkpxq : Λkpxq uq

If all the labeling functions abstain (i.e., Λkpxq for all k), then ˆYMV pxq . More sophisticated label models such as Snorkel [30] and Flying Squid [10] parameterize ˆY to learn better aggregation rules, e.g. by accounting for the accuracy of different Λk s or accounting for correlations between pairs (Λj, Λk). These parameters are learned using unlabeled data only; the methods for doing so have a rich history dating back at least to Dawid and Skene [8]. Many label models (including Snorkel and its derivatives) output a soft pseudolabel, i.e., a distribution ˆPr Y |Λ1p Xq, . . . , ΛKp Xqs, and set the hard pseudolabel as ˆY p Xq argmaxy ˆPr Y y|Λ1p Xq, . . . , ΛKp Xqs.

Given an unlabeled sample txiun i 1, the label model produces a pseudolabeled training set T tpxi, ˆY pxiqq : ˆY pxiq u. The ﬁnal step in the weak supervision pipeline is to use T like regular training data to train an end model (such as a deep neural network), minimizing the zero-one loss:

ˆf : argmin f PF

i 1 Irfpxiq ˆY pxiqs (1)

or a convex surrogate like cross-entropy. For many applications, we ﬁne-tune a pretrained representation instead of training from scratch. For example, on text data, we can ﬁne-tune a pretrained BERT model. We refer to the pretrained representation used by the end model as the end model representation, where applicable.

Notably, all existing methods use the full pseudolabeled training set T to train the end model. T consists of all points where ˆY . In this work, we experiment with methods for choosing higher-quality subsets T 1 Ă T and use T 1 in (1) instead of T .

Related Work. The idea of selecting a subset of high-quality training data for use in fullysupervised or semi-supervised learning algorithms has a long history. It is also referred to as data pruning [1], and a signiﬁcant amount of work has focused on removing mislabeled examples to improve the training process [e.g., 26, 18, 7, 25]. These works do not consider the case where the pseudolabels come from deterministic labeling functions, and most try to estimate parameters of a speciﬁc noise process that is assumed to generate the pseudolabels. Many of these approaches require iterative learning or changes to the loss function, whereas typical weak supervision pipelines do one learning step and little or no loss correction. Maheshwari et al. [21] study active subset selection for weak supervision, obtaining a small number of human labels to boost performance.

In self-training [e.g., 34], an initial labeled training set is iteratively supplemented with the pseudolabeled examples where a trained model is most conﬁdent (according to the model s probability scores). The model is retrained on the new training set in each step. Yarowsky [40] used this approach starting from a weakly-labeled training set; Yu et al. [41], Karamanolakis et al. [15] also combine self-training with an initial weakly-labeled training set, and both have deep-model-based procedures for selecting conﬁdent data in each round. We view these weakly-supervised self-training methods as orthogonal to our approach, since their main focus is on making better use of the data that is not covered by weak rules, not on selecting good pseudolabeled subsets. Indeed, we show in Appendix

Figure 1: Cut statistic procedure. A representation φ is used to compute the nearest-neighbor graph G. Nodes that have the same pseudolabel as most of their neighbors are chosen for the subset T 1.

B.7 that combining our method with these approaches improves their performance. Other selection schemes, not based on model conﬁdence, have also been investigated for self-training [e.g., 44, 24].

Muhlenbach et al. [23] introduced the cut statistic as a heuristic for identifying mislabeled examples in a training dataset. Li and Zhou [19] applied the cut statistic to self-training, using it to select highquality pseudolabeled training data for each round. Zhang and Zhou [43] applied the cut statistic to co-training and used learning-with-noise results from Angluin and Laird [2] to optimize the amount of selected data in each round. Lang et al. [16] also used co-training and the cut statistic to co-train large language models such as GPT-3 [5] and T0 [31] with smaller models such as BERT [9] and Ro BERTa [20]. These previous works showed that the cut statistic performs well in iterative algorithms such as self-training and co-training; we show that it works well in one-step weak supervision settings, and that it performs especially well when combined with modern pre-trained representations. Our empirical study shows that this combination is very effective at selecting good pseudolabeled training data across a wide variety of label models, end models, and datasets.

As detailed in Section 3, the performance of the cut statistic relies on a good representation of the input examples xi to ﬁnd good subsets. Zhu et al. [45] also used representations to identify subsets of mislabeled labels and found that methods based on representations outperform methods based on model predictions alone. They use a different ranking method and do not evaluate in weaklysupervised settings. Chen et al. [6] also use pretrained representations to improve the performance of weak supervision. They created a new representation-aware label model that uses nearest-neighbors in the representation to label more data and also learns ﬁner-grained label model parameters. In contrast, our approach applies to any label model, can be implemented in a few lines of code, and does not require representations from very large models like GPT-3 or CLIP. Combining the two approaches is an interesting direction for future work.

3 Subset Selection Methods for Weak Supervision

In this work, we study techniques for selecting high-quality subsets of the pseudolabeled training set T . We consider two simple approaches to subset selection in this work: entropy scoring and the cut statistic. In both cases, we construct a subset T 1 by ﬁrst ranking all the examples in T , then selecting the top β fraction according to the ranking. In our applications, 0 ă β ď 1 is a hyperparameter tuned using a validation set. Hence, instead of |T | covered examples for training the end model in (1), we use β|T | examples. Instead of a single, global ranking, subset selection can easily be stratiﬁed to use multiple rankings. For example, if the true label balance Pr Y s is known, we can use separate rankings for each set Ty txi : ˆY pxiq yu and select the top βPr Y ys|T | points from each Ty. This matches the pseudolabel distribution on T 1 to the true marginal Pr Y s. For simplicity, we use a global ranking in this work, and our subset selection does not use Pr Y s or any other information about the true labels. Below we give details for the entropy and cut statistic rankings.

Entropy score. Entropy scoring only applies to label models that output a soft pseudo-label ˆPr Y |Λp Xqs. For this selection method, we rank examples by the Shannon entropy of the soft label, Hp ˆPr Y |Λpxiqsq, and set T 1 to the β|T | examples with the lowest entropy. Intuitively, the label model is the most conﬁdent on the examples with the lowest entropy. If the label model is well-calibrated, the weak labels should be more accurate on these examples.

Cut statistic [23]. Unlike the entropy score, which only relies on the soft label distribution ˆPr Y |Λs, the cut statistic relies on a good representation of the input examples xi. Let φ be a representation for

0.2 0.4 0.6 0.8 1.0 Coverage

Majority Vote

0.2 0.4 0.6 0.8 1.0 Coverage

Snorkel (DP)

0.2 0.4 0.6 0.8 1.0 Coverage

Dawid-Skene

0.2 0.4 0.6 0.8 1.0 Coverage

Flying Squid

0.2 0.4 0.6 0.8 1.0 Coverage

Me Ta L Accuracy vs Coverage, Yelp

entropy cutstat

Figure 2: Accuracy of the pseudolabeled training set versus the selection fraction β for ﬁve different label models. A pretrained BERT model is used as φ for the cut statistic. The accuracy of the weak training labels is better for β ă 1, indicating that sub-selection can select higher-quality training sets.

examples in X. For example, for text data, φ could be the hidden state of the [CLS] token in the last layer of a pretrained large language model.

Recall that T tpxi, ˆY pxiqq : ˆY pxiq u. To compute the cut statistic using φ, we ﬁrst form a graph G p V, Eq with one vertex for each covered xi and edges connecting vertices who are K-nearest neighbors in φ. That is, for each example xi with ˆY pxiq , let NNφpxiq txj : pxi, xjq are K-nearest-neighbors in φu.

Then we set V ti : ˆY pxiq u, E tpi, jq : xi P NNφpxjq or xj P NNφpxiqu. For each node i, let Npiq tj : pi, jq P Eu denote its neighbors in G. We assign a weight wij to each edge so that nodes closer together in φ have a higher edge weight: wij p1 ||φpxiq φpxjq||2q 1. We say an edge pi, jq is cut if ˆY pxiq ˆY pxjq, and capture this with the indicator variable Iij : Ir ˆY pxiq ˆY pxjqs. As suggested in Figure 1, if φ is a good representation, nodes with few incident cut edges should have high-quality pseudolabels these examples have the same label as most of their neighbors. On the other hand, nodes with a large number of cut edges likely correspond to mislabeled examples. The cut statistic heuristically quantiﬁes this idea to produce a ranking.

Suppose (as a null hypothesis) that the labels ˆY were sampled i.i.d. from the marginal distribution Pr ˆY ys. Large deviations from the null should represent the most noise-free vertices. For each vertex i, consider the test statistic: Ji ř j PNpiq wij Iij. The mean of Ji under the null hypothesis is:

µi p1 Pr ˆY pxiqsq ř

j PNpiq wij, and the variance is: σ2 i Pr ˆY pxiqsp1 Pr ˆY pxiqsq ř

j PNpjq w2 ij. Then for each i we can compute the Z-score Zi Ji µi

σi and rank examples by Zi. Lower is better, since nodes with the smallest Zi have the least noisy ˆY assignments in φ. As with entropy scoring, we set T 1 to be the the β|T| points with the smallest values of Zi. We provide code for a simple (ă 30 lines) function to compute the Zi values given the representations tφpxiq : xiu in Appendix C. Calling this function makes it very straightforward to incorporate the cut statistic in existing weak supervision pipelines. Since the cut statistic does not require soft pseudolabels, it can also be used for label models that only produce hard labels, and for label models such as Majority Vote, where the soft label tends to be badly miscalibrated.

3.1 Cut Statistic Selects Better Subsets

To explore the two scoring methods, we visualize how T 1 changes with β for entropy scoring and the cut statistic. We used label models such as majority vote and Snorkel [30] to obtain soft labels ˆPr Y |Λpxiqs, and set ˆY pxiq to be the argmax of the soft label. We test using the Yelp dataset from the WRENCH weak supervision benchmark [42]. The task is sentiment analysis, and the eight labeling functions tΛ1, . . . , Λ8u consist of seven keyword-based rules and one third-party sentiment polarity model. For φ in the cut statistic, we used the [CLS] token representation of a pretrained BERT model. Section 4 contains more details on the datasets and the cut statistic setup.

For each β P t0.1, 0.2, . . . , 1.0u, Figure 2 plots the accuracy of the pseudolabels on the training subset T 1pβq. This shows how training subset quality varies with the selection fraction β. We can compute this accuracy because most of the WRENCH benchmark datasets also come with ground-truth labels Y (even on the training set) for evaluation. Appendix B contains the same plot for several other WRENCH datasets and ﬁgures showing the histograms of the entropy scores and the Zi values.

Figure 2 shows that combining the cut statistic with a BERT representation selects better subsets than the entropy score for all ﬁve label models tested, especially for majority vote, where the entropy

scoring is badly miscalibrated. For a well-calibrated score, the subset accuracy should decrease as β increases. These results suggest that the cut statistic is able to use the geometric information encoded in φ to select a more accurate subset of the weakly-labeled training data. However, it does not indicate whether that better subset actually leads to a more accurate end model. Since we could also use φ for the end model e.g., by ﬁne-tuning the full neural network or training a linear model on top of φ it s possible that the training step (1) will already perform the same corrections as the cut statistic, and the end model trained on the selected subset will perform no differently from the end model trained with β 1.0. In the following section, we focus on the cut statistic and conduct large-scale empirical evaluation on the WRENCH benchmark to measure whether subset selection improves end model performance. Our empirical results suggest that subset selection and the end model training step are complementary: even when we use powerful representations for the end model, subset selection further improves performance, sometimes by a large margin.

4 Experiments

Having established that the cut statistic can effectively select weakly-labeled training subsets that are higher-quality than the original training set, we now turn to a wider empirical study to see whether this approach actually improves the performance of end models in practice.

Datasets and Models. We evaluate our approach on the WRENCH benchmark [42] for weak supervision. We compare the status-quo of full coverage (β 1.0) to β chosen from t0.1, 0.2, . . . , 1.0u. We evaluate our approach with ﬁve different label models: Majority Vote (MV), the original Snorkel/Data Programming (DP), [30], Dawid-Skene (DS) [8], Flying Squid (FS) [10], and Me Ta L [29]. Following Zhang et al. [42], we use pretrained roberta-base and bert-base-cased1 as the end model representation for text data, and hand-speciﬁed representations for tabular data. We performed all model training on NVIDIA A100 GPUs. We primarily evaluate on seven textual datasets from the WRENCH benchmark: IMDb (sentiment analysis), Yelp (sentiment analysis), Youtube (spam classiﬁcation), TREC (question classiﬁcation), Sem Eval (relation extraction), Chem Prot (relation extraction), and AGNews (text classiﬁcation). Full details for the datasets and the weak label sources are available in [42] Table 5 and reproduced here in Appendix B.1. We explore other several other datasets and data modalities in Sections 4.2-4.3.

Cut statistic. For the representation φ, for text datasets we used the [CLS] token representation of a large pretrained model such as BERT or Ro BERTa. For relation extraction tasks, we followed [42] and used the concatenation of the [CLS] token and the average contextual representation of the tokens in each entity span. In Section 4.3, for the tabular Census dataset we use the raw data features for φ. Unless otherwise speciﬁed, we used the same representation for φ and for the initial end model. For example, when training bert-base-cased as the end model, we used bert-base-cased as φ for the cut statistic. We explore several alternatives to this choice in Section 4.2.

Hyperparameter tuning. Our subset selection approach introduces a new hyperparameter, β the fraction of covered data to retain for training the classiﬁer. To keep the hyperparameter tuning burden low, we ﬁrst tune all other hyperparameters identically to Zhang et al. [42] holding β ﬁxed at 1.0. We then use the optimal hyperparameters (learning rate, batch size, weight decay, etc.) from β 1.0 for a grid search over values of β P t0.1, 0.2, . . . , 1.0u, choosing the value with the best (ground-truth) validation performance. Better results could be achieved by tuning all the hyperparameters together, but this approach limits the number of possible combinations, and it matches the setting where an existing, tuned weak supervision pipeline (with β 1.0) is adapted to use subset selection. In all of our experiments, we used K 20 nearest neighbors to compute the cut statistic and performed no tuning on this value. Appendix B contains an ablation showing that performance is not sensitive to this choice.

4.1 WRENCH Benchmark Performance

Table 1 compares the test performance of full coverage (β 1.0) to the performance of the cut statistic with β chosen according to validation performance. Standard deviations across ﬁve random initializations are shown in parentheses.

1We refer to pretrained models by their names on the Hugging Face Datasets Hub. All model weights were downloaded from the hub: https://huggingface.co/datasets

Table 1: End model test accuracy (stddev) for weak supervision with β 1 versus weak supervision with β selected from t0.1, 0.2, 0.3, . . . , 1.0u using a validation set ( + cutstat ), shown for BERT (B) and Ro BERTa (RB) end models. For these results, the cut statistic uses the same representation as the end model for φ. The cut statistic broadly improves the performance of weak supervision for many (label model, dataset, end model) combinations.

Label model imdb yelp youtube trec semeval chemprot agnews

Majority Vote 78.322.62 86.851.42 95.121.27 66.761.46 85.170.89 57.442.01 86.590.47 + cutstat 81.861.36 89.490.78 95.600.72 71.843.00 92.470.49 57.471.00 86.260.43 Data Programming 75.901.44 76.431.29 92.481.30 71.201.78 71.971.57 51.891.60 86.010.63 + cutstat 79.072.52 88.131.46 93.920.93 76.761.92 91.070.90 55.101.49 85.890.45 Dawid-Skene 78.861.34 88.451.42 88.451.42 51.041.71 72.401.53 44.081.37 86.260.56 + cutstat 80.221.69 89.041.10 90.721.27 57.282.91 89.071.62 49.071.48 86.930.22 Flying Squid 77.461.88 84.981.44 91.522.90 31.122.39 31.830.00 46.720.96 86.100.80 + cutstat 80.851.50 88.751.13 91.041.23 33.843.17 31.830.00 48.650.99 85.900.39 Me Ta L 78.972.57 83.051.69 93.361.15 58.881.22 58.171.77 55.611.35 86.060.82 + cutstat 81.491.51 88.411.19 92.640.41 63.802.28 65.230.91 58.330.81 86.160.48

Majority Vote 86.990.55 88.513.25 95.841.18 67.602.38 85.831.22 57.061.12 87.460.53 + cutstat 86.690.75 95.190.23 96.001.10 72.921.31 92.070.80 59.050.56 88.010.47 Data Programming 86.311.53 88.735.07 94.081.48 71.403.30 71.071.66 52.520.69 86.750.24 + cutstat 86.461.82 93.950.93 93.041.30 76.844.09 86.071.82 56.431.37 87.760.17 Dawid-Skene 85.501.68 92.421.41 92.481.44 51.243.50 70.830.75 45.612.60 87.290.40 + cutstat 86.140.60 93.810.69 93.840.70 58.482.75 81.671.33 52.931.67 88.350.22 Flying Squid 85.251.96 92.142.76 93.522.11 35.401.32 31.830.00 47.231.04 86.560.55 + cutstat 87.710.76 94.500.74 95.840.54 38.160.43 31.830.00 50.551.05 87.490.13 Me Ta L 86.161.13 88.413.25 92.401.19 55.441.08 59.531.87 56.740.58 86.740.60 + cutstat 87.460.65 94.030.53 93.841.38 69.722.39 66.700.90 57.400.98 88.400.38

0 5 10 15 20

Figure 3: Test accuracy gain from setting β ă 1 across all WRENCH trials.

The cut statistic improves the mean performance (across runs) compared to β 1.0 in 61/70 cases, sometimes by 10 20 accuracy points (e.g., BERT Sem Eval DP). Since β 1.0 is included in the hyperparameter search over β, the only cases where the cut statistic performs worse than β 1.0 are due to differences in the performance on the validation and test sets. The mean accuracy gain from setting β ă 1.0 across all 70 trials is 3.65 points, indicating that the cut statistic is complementary to the end model training. If no validation data is available to select β, we found that β 0.6 had the best median performance gain over all label model, dataset, and end model combinations: 1.7 accuracy points compared to β 1.0. However, we show in Section 4.4 that very small validation sets are good enough to select β. The end model trains using φ, but using φ to ﬁrst select a good training set further improves performance. Figure 3 displays a box plot of the accuracy gain from using sub-selection. Appendix B.2 contains plots of the end model performance versus the coverage fraction β. In some cases, the cut statistic is competitive with COSINE [41] , which does multiple rounds of self-training on the unlabeled data. Table 7 compares the two methods, and we show in Appendix B.7 how to combine them to improve performance.

For Table 1, we used the same representation for φ and for the initial end model. These results indicate that representations from very large models such as GPT-3 or CLIP are not needed to improve end model performance. However, there is no a priori reason to use the same representation for φ and for the end model initialization. Using a much larger model for φ may improve the cut statistic performance without drastically slowing down training, since we only need to perform inference on the larger model. We examine the role of the representation choice more thoroughly in Section 4.2.

4.2 Choice of Representation for Cut Statistic

How important is the quality of φ (the representation used for the cut statistic) for the performance of subset selection? In this section we experiment with different choices of φ. To isolate the effect of φ

0.2 0.4 0.6 0.8 1.0

Majority Vote

0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0

Dawid-Skene

0.2 0.4 0.6 0.8 1.0

Flying Squid

0.2 0.4 0.6 0.8 1.0

Me Ta L Accuracy vs Coverage, CDR

method cutstat-pubmed cutstat-biobert cutstat

Figure 4: Domain-speciﬁc pretraining versus general-domain pretraining for φ. Stock BERT (cutstat) compared to Bio BERT (cutstat-biobert), and Pub Med BERT (cutstat-pubmed), two models pretrained on biomedical text. The domain-speciﬁc models select more accurate subsets than the generic model.

0.2 0.4 0.6 0.8 1.0

Majority Vote

0.2 0.4 0.6 0.8 1.0

1.0 Snorkel

0.2 0.4 0.6 0.8 1.0

1.0 Dawid-Skene

0.2 0.4 0.6 0.8 1.0

1.0 Flying Squid

0.2 0.4 0.6 0.8 1.0

1.0 Me Ta L Accuracy vs Coverage, IMDb

entropy cutstat cutstat-sst

Figure 5: Comparison of IMDb training subset accuracy for the cut statistic with generic BERT (cutstat) and a BERT ﬁne-tuned for sentiment analysis on the SST-2 dataset (cutstat-sst). The ﬁne-tuned representation gives very high-quality training subsets when used with the cut statistic.

on performance, we use the generic BERT as the end model. Performance can improve further when using more powerful representations for the end model as well, but we use a ﬁxed end model here to explore how the choice of φ affects ﬁnal performance. Our results indicate that (i) for weak supervision tasks in a speciﬁc domain (e.g., biomedical text), models pretrained on that domain perform better than general-domain models as φ and (ii) for a speciﬁc task (e.g., sentiment analysis), models pretrained on that task, but on different data, perform very well as φ. We also show in Appendix B that using a larger generic model for φ can improve performance when the end model is held ﬁxed.

Domain-speciﬁc pretraining can help. The CDR dataset in WRENCH has biomedical abstracts as input data. Instead of using general-domain φ such as BERT and Ro BERTa, does using a domainspeciﬁc version improve performance? Figure 4 shows that domain speciﬁc models do improve over general-domain models when used in the cut statistic. We compare bert-base-cased to Pub Med BERT-base-uncased-abstract-fulltext [11] and biobert-base-cased-v1.2 [17]. The latter two models were pretrained on biomedical text. The domain-speciﬁc models lead to higher quality training datasets for all label models except Dawid-Skene. These gains in training dataset accuracy translate to gains in end-model performance. Trained with ˆY from majority vote and using Bio BERT as φ, a general-domain BERT end model obtains test F1 score 61.14 (0.64), compared to 59.63 (0.84) using BERT for both φ and the end model. Both methods improve over 58.20 (0.55) obtained from training generic BERT with β 1.0 (no sub-selection).

Representations can transfer. If a model is trained for a particular task (e..g, sentiment analysis) on one dataset, can we use it as φ to perform weakly-supervised learning on a different dataset? We compare two choices for φ on IMDb: regular bert-base-cased, and bert-base-cased ﬁne-tuned with fully-supervised learning on the Stanford Sentiment Treebank (SST) dataset [37]. As indicated in Figure 5, the ﬁne-tuned BERT representation selects a far higher-quality subset for training. This translates to better end model performance as well. Using majority vote with the ﬁne-tuned BERT as φ leads to test performance of 87.22 (0.57), compared to 81.86 in Table 1. These results suggest that if we have a representation φ that s already useful for a task, we can effectively combine it with the cut statistic to improve performance on a different dataset.

4.3 Other Data Modalities

Our experiments so far have used text data, where large pretrained models like BERT and Ro BERTa are natural choices for φ. Here we brieﬂy study the cut statistic on tabular and image data. The Census dataset in WRENCH consists of tabular data where the goal is to classify whether a person s income is greater than $50k from a set of 13 features. We also use these hand-crafted features for φ and train a linear model on top of the features for the end model. The Basketball dataset is a set of still images obtained from videos, and the goal is to classify whether basketball is being

Table 2: Test F1 of a weakly-supervised linear model (LR) on the Census dataset, which consists of 13 hand-created features. Even though the representation does not come from a large, pretrained neural network, the cut statistic improves the performance of weak supervision for every label model. Test F1 of a 1-hidden-layer network (MLP) on CLIP representations of the Basketball dataset, where the results are noisy, but the cut statistic improves the mean performance of every label model.

Dataset Majority Vote Data Programming Dawid-Skene Flying Squid Me Ta L

LR Census 50.712.18 21.6717.32 49.900.61 38.233.96 51.411.45 + cutstat 57.980.68 28.0417.38 58.490.23 40.532.26 54.991.54

MLP Basketball 52.296.62 52.14 4.80 22.59 9.83 54.0412.15 32.9912.98 + cutstat 55.823.98 54.5914.80 43.7712.47 56.606.13 47.97 8.89

Table 3: Comparison between using the full validation set to choose β and the model checkpoint versus using a randomly selected validation subset of 100 examples. These results use the majority vote (MV) label model. Standard deviation is reported over ﬁve random seeds used to select the validation set (not to be confused with Table 1, where standard deviation is reported over random seeds controlling the deep model initialization). Most of the drop in performance is due to the noisier checkpoint selection when using the small validation set. I.e., the difference between β best and β 1.0 is similar for the full validation and random validation cases.

Val. size β imdb yelp youtube trec semeval chemprot agnews

full 1.0 78.32 86.85 95.12 66.76 85.17 57.44 86.59 full best 81.86 89.49 95.60 71.84 92.47 57.47 86.26 100 1.0 79.172.80 84.881.97 94.500.32 65.331.84 85.620.64 54.991.63 84.771.11 100 best 79.752.18 87.961.00 94.400.63 74.501.83 92.401.81 54.402.35 84.961.20

full 1.0 86.99 88.51 95.84 67.60 85.83 57.06 87.46 full best 86.69 95.19 96.00 72.92 92.07 59.05 88.01 100 1.0 85.741.11 89.321.49 95.241.11 66.401.56 84.381.18 56.710.55 85.790.75 100 best 85.240.78 93.820.48 96.060.70 75.152.89 91.240.76 56.520.44 87.200.31

played in the image using the output of an off-the-shelf object detector in the Λk s. We used CLIP [27] representations of the video frames and trained a 1-hidden-layer neural network using the hyperparameter tuning space from [42]. Table 2 shows the results for these datasets. The cut statistic improves the end model performance for every label model even with the small, hand-crafted representation, and also improves for the Basketball data.

4.4 Using a Smaller Validation Set

Many datasets used to evaluate weak supervision methods actually come with large labeled validation sets. For example, the average validation set size of the WRENCH datasets from Table 1 is over 2,500 examples. However, assuming access to a large amount of labeled validation data partially defeats the purpose of weak supervision. In this section, we show that the coverage parameter β for the cut statistic can be selected using a much smaller validation set without compromising the performance gain over β 1.0. We compare choosing the best model checkpoint and picking the best coverage fraction β using (i) the full validation set and (ii) a randomly-sampled validation set of 100 examples. Table 3 shows the results for the majority vote label model. The full validation numbers come from Table 1. The difference between selecting data with the validation-optimal β and using β 1.0 is broadly similar between the full validation and small validation cases. This suggests that most of the drop in performance from full validation to small validation is due to the noisier choice of the best model checkpoint, not due to choosing a suboptimal β.

4.5 Discussion

Why not correct pseudolabels with nearest neighbor? Consider an example xi whose weak label ˆY pxiq disagrees with the weak label ˆY pxjq of most neighbors j P Npiq. This example would get thrown out by the cut statistic selection. Instead of throwing such data points out, we could try to re-label them with the majority weak label from the neighbors. However, throwing data out is a more conservative (and hence possibly more robust) approach. For example, if the weak labels are

mostly wrong on hard examples close to the true unknown decision boundary, relabeling makes the training set worse, whereas the cut statistic ignores these points. Appendix B.3 contains an empirical comparison between subset selection and relabeling. For the representations studied in this work, relabeling largely fails to improve training set quality and end model performance.

Why does sub-selection work? As suggested above, subset selection can change the distribution of data points in the training set shown to the end model. For example, it may only select easy examples. However, this is already a problem in today s weak supervision methods: the full weaklylabeled training set T is already biased. For example, many labeling functions are keyword-based, such as those in Section 2 ( good Ñpositive sentiment, bad Ñnegative). In these examples, T itself is a biased subset of the input distribution (only sentences that contain good or bad , versus all sentences). Theoretical understanding for why weak supervision methods perform well on the uncovered set Xztx : Λpxq u is currently lacking, and existing generalization bounds for the end model do not capture this phenomenon. In the following section we present a special (but practically-motivated) case where this bias can be avoided. In this case, we prove a closed form for the coverage-precision tradeoff of selection methods, giving subset selection some theoretical motivation.

5 Theoretical Results: Why Does Subset Selection Work?

We begin by presenting a theoretical setup motivated by the Che XPert [13] and MIMIC-CXR [14] datasets, where the weak labels are derived from radiology notes and the goal is to learn an end model for classifying X-ray images. Suppose for this section that we have two (possibly related) views ψ0p Xq, ψ1p Xq of the data X, i.e., ψ0 : X Ñ Ψ0, ψ1 : X Ñ Ψ1. We use ψ here to distinguish from φ, the representation used to compute nearest neighbors for the cut statistic. For example, if the input space X is multi-modal, and each xi pxp0q i , xp1q i q, then we can set ψ0 and ψ1 to project onto the individual modes (e.g., φ0p Xq the clinical note and φ1p Xq the X-ray). We will assume that the labeling functions Λkpxiq only depend on ψ0pxiq, and that the end model f only depends on ψ1pxiq. In the multi-modal example, this means the labeling functions are deﬁned on one view, and the end model is trained on the other view. To prove a closed form for the precision/coverage tradeoff, we make the following strong assumption relating the two views ψ0 and ψ1:

Assumption 1 (Conditional independence). The random variables ψ0p Xq, ψ1p Xq are conditionally independent given the true (unobserved) label Y . That is, for any sets A Ă Ψ0, B Ă Ψ1, PX,Y rψ0p Xq P A, ψ1p Xq P B|Y s PX,Y rψ0p Xq P A|Y s PX,Y rψ1p Xq P B|Y s.

Note since every Λk only depends on ψ0p Xq, the pseudolabel ˆY only depends on ψ0p Xq. Hence ˆY p Xq ˆY pψ0p Xqq, and likewise for an end model f, fp Xq fpψ1p Xqq. Assumption 1 implies: PX,Y r Ir ˆY p Xq Y s, ψ1p Xq P B|Y s PX,Y r Ir ˆY p Xq Y s|Y s PX,Y rψ1p Xq P B|Y s for every B Ă Ψ1. In this special case, the end model training reduces to learning with classconditional noise (CCN), since the errors Ir ˆY p Xq Y s are conditionally independent of the representation ψ1p Xq being used for the end model. This assumption is most natural for the case of multimodal data and ψ0, ψ1 that project onto each mode, but it may also roughly apply when the representation being used for the end model (such as a BERT representation) is suitably orthogonal to the input X. While very restrictive, this assumption allows us to make the coverage-precision tradeoff precise.

Theorem 1. Suppose Assumption 1 holds, and that Y t0, 1u. Deﬁne the balanced error of a classiﬁer f on labels Y as: errbalpf, Y q 1 2p Prfp Xq 0|Y 1s Prfp Xq 1|Y 0sq. We write fp Xq instead of fpψ1p Xqq for convenience. Let ˆY : Ψ0p Xq Ñ t0, 1u be an arbitrary label model. Deﬁne α Pr Y 0| ˆY 1s and γ Pr Y 1| ˆY 0s and suppose α γ ă 1, Pr ˆY ys ą 0 for y P t0, 1u. These parameters measure the amount of noise in ˆY . Deﬁne f : inff PF errbalpf, Y q. Let ˆf be the classiﬁer obtained by minimizing the empirical balanced accuracy on tpxi, ˆY pxiqqun i 1. Then the following holds w.p. 1 δ over the sampling of the data:

errbalp ˆf, Y q errbalpf , Y q ď r O

VCp Fq log 1

δ n Pr ˆY s miny Pr ˆY y| ˆY s

where O hides log factors in m and VCp Fq.

Proof. For space, we defer the proof and and bibliographic commentary to Appendix A.

This bound formalizes the tradeoff between the precision of the weak labels, measured by α and γ, and the coverage, measured by n Pr Y s, which for large enough samples is very close to the size of the covered training set T tpxi, ˆY pxiqq : ˆY pxiq u. Suppose we have a label model ˆY and an alternative label model ˆY 1 that abstains more often than ˆY (so Pr Y 1 s ă Pr ˆY s) but also has smaller values of α and γ. Then according to the bound, an end model trained with ˆY 1 can have better performance, and the empirical results in Section 4 conﬁrm this trend.

This bound is useful for comparing two ﬁxed label models ˆY , ˆY 1 with different abstention rates and pα, γq values. However, we have been concerned in this paper with selecting a subset T 1 of T based on a single label model ˆY , and training using T 1. We can represent this subset selection with a new set of pseudolabels t Y pxiq : xi P T u that abstains more than ˆY pxiq i.e., points not chosen for T 1

get Y pxiq . However, selection for T 1 depends on sample-level statistics, so the Y values are not i.i.d., which complicates the generalization bound. We show in Appendix A that this can be remedied by a sample-splitting procedure: we use half of T to deﬁne a reﬁned label model Y : Ψ0 Ñ Y Y t u, and then use the other half of T as the initial training set. This allows us to effectively reduce to the case of two ﬁxed label models ˆY , Y and apply Theorem 1. We include the simpler ˆY versus ˆY 1 bound here because it captures the essential tradeoff without the technical difﬁculties.

6 Limitations, Societal Impact, and Conclusion

Surprisingly, using less data can greatly improve the performance of weak supervision pipelines when that data is carefully selected. By exploring the tradeoff between weak label precision and coverage, subset selection allows us to select a higher-quality training set without compromising generalization performance to the population pseudolabeling function. This improves the accuracy of the end model on the true labels. In Section 5, we showed that this tradeoff can be formalized in the special setting of conditional independence. By combining the cut statistic with good data representations, we developed a technique that improves performance for ﬁve different label models, over ten datasets, and three data modalities. Additionally, the hyperparameter tuning burden is low. We introduced one new hyperparameter β (the coverage fraction) and showed that all other hyperparameters can be re-used from the full-coverage β 1.0 case, so existing tuned weak supervision pipelines can be easily adapted to use this technique.

However, this approach is not without limitations. The cut statistic requires a good representation φ of the input data to work well. Such a representation may not be available. However, for image or text data, pretrained representations provide natural choices for φ. Our results on the Census dataset in Section 4.3 indicate that using hand-crafted features as φ can also work well. Finally, as discussed at the end of Section 4.5, subset selection can further bias the input distribution (except in special cases like the one in Section 5). However, this is already an issue with current weak supervision methods. Most methods only train on the covered data T . Labeling functions are typically deterministic functions of the input example, (such as functions based on the presence of certain tokens) and so the support of the full training set T is a strict subset of the support of the true input distribution, and T may additionally have a skewed distribution over its support. This underscores the need for (i) the use of a ground-truth validation set to ensure that the end model is an accurate predictor on the full distribution (ii) in high stakes settings, sub-group analyses such as those performed by [35], to ensure that the pseudolabels have not introduced bias against protected subgroups and (iii) the need for further theoretical understanding on why weakly supervised end models are able to perform well on the uncovered set tx : Λpxq u.

Acknowledgments and Disclosure of Funding

This work was supported by NSF Ait F awards CCF-1637585 and CCF-1723344. Thanks to Hussein Mozannar for helpful conversations on Section 5 and the pointer to Woodworth et al. [39]. Thanks to Dr. Steven Horng for generously donating GPU-time on the BIDMC computing cluster [12] and to NVIDIA for their donation of GPUs also used in this work.

[1] Anelia Angelova, Yaser Abu-Mostafam, and Pietro Perona. Pruning training sets for learning of object categories. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05), volume 1, pages 494 501. IEEE, 2005.

[2] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4): 343 370, 1988.

[3] Chidubem Arachie and Bert Huang. Adversarial label learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 3183 3190, 2019.

[4] Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. Introduction to statistical learning theory. In Summer school on machine learning, pages 169 207. Springer, 2003.

[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

[6] Mayee F Chen, Daniel Yang Fu, Dyah Adila, Michael Zhang, Frederic Sala, Kayvon Fatahalian, and Christopher Re. Shoring up the foundations: Fusing model embeddings and weak supervision. In The 38th Conference on Uncertainty in Artiﬁcial Intelligence, 2022.

[7] Hao Cheng, Zhaowei Zhu, Xingyu Li, Yifei Gong, Xing Sun, and Yang Liu. Learning with instance-dependent label noise: A sample sieve approach. In International Conference on Learning Representations, 2021.

[8] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20 28, 1979.

[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.

[10] Daniel Fu, Mayee Chen, Frederic Sala, Sarah Hooper, Kayvon Fatahalian, and Christopher Ré. Fast and three-rious: Speeding up weak supervision with triplet methods. In International Conference on Machine Learning, pages 3280 3291. PMLR, 2020.

[11] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-speciﬁc language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1 23, 2021.

[12] Steven Horng. Machine Learning Core. 2 2022. doi: 10.6084/m9.ﬁgshare.19104917.v2. URL

https://figshare.com/articles/preprint/Machine_Learning_Core/19104917.

[13] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 33, pages 590 597, 2019.

[14] Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identiﬁed publicly available database of chest radiographs with free-text reports. Scientiﬁc data, 6(1):1 8, 2019.

[15] Giannis Karamanolakis, Subhabrata Mukherjee, Guoqing Zheng, and Ahmed Hassan. Selftraining with weak supervision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 845 863, 2021.

[16] Hunter Lang, Monica N Agrawal, Yoon Kim, and David Sontag. Co-training improves promptbased learning for large language models. In International Conference on Machine Learning, pages 11985 12003. PMLR, 2022.

[17] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234 1240, 2020.

[18] Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations, 2019.

[19] Ming Li and Zhi-Hua Zhou. Setred: Self-training with editing. In Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, pages 611 621. Springer, 2005.

[20] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

[21] Ayush Maheshwari, Oishik Chatterjee, Krishnateja Killamsetty, Ganesh Ramakrishnan, and Rishabh Iyer. Semi-supervised data programming with subset selection. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4640 4651, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.ﬁndings-acl. 408. URL https://aclanthology.org/2021.findings-acl.408.

[22] Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. Learning from corrupted binary labels via class-probability estimation. In International conference on machine learning, pages 125 134. PMLR, 2015.

[23] Fabrice Muhlenbach, Stéphane Lallich, and Djamel A Zighed. Identifying and handling mislabelled instances. Journal of Intelligent Information Systems, 22(1):89 109, 2004.

[24] Subhabrata Mukherjee and Ahmed Awadallah. Uncertainty-aware self-training for few-shot text classiﬁcation. Advances in Neural Information Processing Systems, 33:21199 21212, 2020.

[25] Curtis Northcutt, Lu Jiang, and Isaac Chuang. Conﬁdent learning: Estimating uncertainty in dataset labels. Journal of Artiﬁcial Intelligence Research, 70:1373 1411, 2021.

[26] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944 1952, 2017.

[27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748 8763. PMLR, 2021.

[28] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access, 2017.

[29] Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. Training complex models with multi-task weak supervision. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 4763 4771, 2019.

[30] Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. Advances in neural information processing systems, 29:3567 3575, 2016.

[31] Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chafﬁn, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022.

[32] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145 147, 1972.

[33] Clayton Scott, Gilles Blanchard, and Gregory Handy. Classiﬁcation with asymmetric label noise: Consistency and maximal denoising. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 489 511, Princeton, NJ, USA, 12 14 Jun 2013. PMLR. URL https://proceedings.mlr.press/v30/Scott13.html.

[34] Henry Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11(3):363 371, 1965.

[35] Laleh Seyyed-Kalantari, Guanxiong Liu, Matthew Mc Dermott, Irene Y Chen, and Marzyeh Ghassemi. Chexclusion: Fairness gaps in deep chest x-ray classiﬁers. In BIOCOMPUTING 2021: Proceedings of the Paciﬁc Symposium, pages 232 243. World Scientiﬁc, 2020.

[36] Saharon Shelah. A combinatorial problem; stability and order for models and theories in inﬁnitary languages. Paciﬁc Journal of Mathematics, 41(1):247 261, 1972.

[37] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631 1642, 2013.

[38] VN Vapnik. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264 281, 1971.

[39] Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920 1953. PMLR, 2017.

[40] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics, pages 189 196, Cambridge, Massachusetts, USA, June 1995. Association for Computational Linguistics. doi: 10.3115/ 981658.981684. URL https://aclanthology.org/P95-1026.

[41] Yue Yu, Simiao Zuo, Haoming Jiang, Wendi Ren, Tuo Zhao, and Chao Zhang. Fine-tuning pretrained language model with weak supervision: A contrastive-regularized self-training approach. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1063 1077, 2021.

[42] Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. Wrench: A comprehensive benchmark for weak supervision. In Thirty-ﬁfth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.

[43] Min-Ling Zhang and Zhi-Hua Zhou. Cotrade: Conﬁdent co-training with data editing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(6):1612 1626, 2011.

[44] Yan Zhou, Murat Kantarcioglu, and Bhavani Thuraisingham. Self-training with selection-byrejection. In 2012 IEEE 12th international conference on data mining, pages 795 803. IEEE, 2012.

[45] Zhaowei Zhu, Zihao Dong, Hao Cheng, and Yang Liu. A good representation detects noisy labels. ar Xiv preprint ar Xiv:2110.06283, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] We show both empirical and theoretical beneﬁts of using less weakly-labeled data to improve performance. (b) Did you describe the limitations of your work? [Yes] In Sections 4.5 and 6.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] In Section

6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] In Assumption

1 and in the statement of Theorem 1. (b) Did you include complete proofs of all theoretical results? [Yes] In Appendix A. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] In Appendix C and in a ZIP ﬁle in the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] In Section 4. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Our tables in Section 4 include standard deviations across 5 runs. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Described in Section 4. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We primarily used the WRENCH benchmark [42]. (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL?

[Yes] We include the code for reproducing our empirical results in the supplementary material. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]