# robust_data_programming_with_precisionguided_labeling_functions__8e6d8b53.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Data Programming Using Continuous and Quality-Guided Labeling Functions

Oishik Chatterjee Department of CSE IIT Bombay, India oishik@cse.iitb.ac.in

Ganesh Ramakrishnan Department of CSE IIT Bombay, India ganesh@cse.iitb.ac.in

Sunita Sarawagi Department of CSE IIT Bombay, India sunita@iitb.ac.in

Scarcity of labeled data is a bottleneck for supervised learning models. A paradigm that has evolved for dealing with this problem is data programming. An existing data programming paradigm allows human supervision to be provided as a set of discrete labeling functions (LF) that output possibly noisy labels to input instances and a generative model for consolidating the weak labels. We enhance and generalize this paradigm by supporting functions that output a continuous score (instead of a hard label) that noisily correlates with labels. We show across ﬁve applications that continuous LFs are more natural to program and lead to improved recall. We also show that accuracy of existing generative models is unstable with respect to initialization, training epochs, and learning rates. We give control to the data programmer to guide the training process by providing intuitive quality guides with each LF. We propose an elegant method of incorporating these guides into the generative model. Our overall method, called CAGE, makes the data programming paradigm more reliable than other tricks based on initialization, sign-penalties, or softaccuracy constraints.

1 Introduction Modern machine learning systems require large amounts of labelled data. For many applications, such labelled data is created by getting humans to explicitly label each training example. A problem of perpetual interest in machine learning is reducing the tedium of such human supervision via techniques like active learning, crowd-labeling, distant supervision, and semi-supervised learning. A limitation of all these methods is that supervision is restricted to the level of individual examples. A recently proposed (Ratner et al. 2016) paradigm is that of Data Programming. In this paradigm, humans provide several labeling functions written in any high-level programming language. Each labeling function (LF) takes as input, an example and either attaches a label to it or backs off. We illustrate such LFs on one of the ﬁve tasks that we experimented with, viz., that of labeling mention of a pair of people names in a sentence as deﬁning the spouse relation or not. The users construct heuristic patterns as LFs for identifying spouse relation in a sentence containing an entity pair

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(E1, E2). A LF can assign +1 to indicate that the spouse relation is true for the candidate pair (E1,E2), -1 to mean that no spouse relation, and 0 to mean that the LF in unable to assert anything for this example. Speciﬁcally for the spouse relation extraction task, Table 1 lists six LFs. In isolation, each LF may neither be always correct nor complete. LFs may also produce conﬂicting labels. For the purpose of illustration, consider a text snippet Michelle Obama is the mother of Malia and Sasha and the wife of Barack Obama . For the candidate pair ( Michelle Obama , Barack Obama ), LF1 and LF4 in Table 1 assign a label 1 whereas LF2 assigns the label -1. Ratner et al. (2016) presented a generative model for consensus on the noisy and conﬂicting labels assigned by the discrete LFs to determine probability of the correct labels. Labels thus obtained could be used for training any supervised model/classiﬁer and evaluated on a test set. In this paper, we present two signiﬁcant extensions of the above data programming paradigm. First, the user provided set of LFs might not be complete in their discrete forms. LF1 through LF3 in Table 1 that look for words in various hand-crafted dictionaries, may have incomplete dictionaries. A more comprehensive alternative could be to design continuous valued LFs that return scores derived from soft match between words in the sentence and the dictionary. As an example, for LF1 through LF3, the soft match could be obtained based on cosine similarity of pre-trained word embedding vectors (Mikolov et al. 2013) of a word in the dictionary with a word in the sentence. This could enable an LF to provide a continuous class-speciﬁc score to the model, instead of a hard class label (when triggered). In Table 2, we list a continuous LF corresponding to each LF from Table 1. Such continuous LFs can expand the scope of matching to semantically similar words beyond the pre-speciﬁed words in the dictionary. For example: in the sentence 1) <Allison>, 27, and <Ricky>, 34, wed on Saturday surrounded by friends., the word wed is semantically similar to married and would be detected by our continuous LF but missed by the discrete ones in Table 1. More generally across applications, human experts are very often able to identify real-valued scores that correlate strongly with the label but ﬁnd it difﬁcult to discretize that score into a hard label. More examples of such scores in-

Spouse Dict = { spouse , married , wife , husband , ex-wife , ex-husband } Family Dict = { father , mother , sister , brother , son , daughter , grandfather , grandmother , uncle , aunt , cousin } {+ ,+ -in-law } Other Dict = { boyfriend , girlfriend boss , employee , secretary , co-worker } Seed Set = {( Barack Obama , Michelle Obama ), ( Jon Bon Jovi , Dorothea Hurley ),( Ron Howard , Cheryl Howard ),.....} Id Description LF1 If some word in Spouse Dict is present between E1 and E2 or within 2 words of either, return 1 else return 0 LF2 If some word in Family Dict is present between E1 and E2, return -1 else return 0. LF3 If some word in Other Dict is present between E1 and E2, return -1 else return 0. LF4 If both E1 and E2 occur in Seed Set, return 1 else return 0. LF5 If the number of word tokens lying between E1 and E2 are less than 4, return 1 else return 0.

Table 1: Discrete LFs based on dictionary lookups or thresholded distance for the spouse relationship extraction task

Id Class Description LF1 +1 max [cosine(word-vector(u), word-vector(v))-0.8]+: u Spouse Dict and v {words between E1, E2}. LF2 -1 max [cosine(word-vector(u), word-vector(v))-0.8]+: u Family Dict and v {words between E1, E2}. LF3 -1 max [cosine(word-vector(u), word-vector(v))-0.8]+: u Other Dict and v {words between E1, E2}. LF4 -1 max [0.2 - Norm-Edit-Dist(E1, E2, u, v)]+: (u, v), (v, u) Seed Set. LF5 +1 [1 - (number of word tokens between E1 and E2)/5.0]+

Table 2: Continuous LFs corresponding to some of the discrete LFs in Table 1 for the spouse relationship extraction task

clude TF-IDF match with prototypical documents in a textclassiﬁcation task, distance among entity pairs for a relation extraction task, and conﬁdence scores from hand-crafted classiﬁers. We extend existing generative models (Ratner et al. 2016; Bach et al. 2017) to support continuous LFs. In addition to modeling the consensus distribution over all LFs (continuous and discrete), we model the distribution of scores for each continuous LF. In the supplementary of the extended version1 of our paper, we illustrate through an example, why the model for continuous LFs is not a straightforward extension of the discrete counterpart. Our second extension is designed to remove the instability of existing generative training based on unsupervised likelihood. Across several datasets we observed that the accuracy obtained by the existing models was highly sensitive to initialization, training epochs, and learning rates. In the absence of labeled validation set, we cannot depend on prevalent tricks like early stopping to stabilize training. We give control to the data programmer to guide the training process by providing intuitive accuracy guides with each LF. In the case that the labeler develops each LF after inspecting some examples, the labeler would naturally have a rough estimate of the fraction q of examples that would be correctly labeled by the LF, amongst all examples that trigger the LF. Even otherwise, it might not be too difﬁcult for the labeler to intuitively specify some value of q. We show that such a q serves as a user-controlled quality guide that can effectively guide the training process. For the case of continuous LFs, we use q as a rough estimate of the mean score of the continuous LF whenever it triggers correctly. A quality guide of q = 0.9 for LF1 would imply that when LF1 triggers correctly the average embedding score is 0.9. This is easier than choosing a hard threshold on the embedding score to

1Available at https://www.cse.iitb.ac.in/ ganesh/papers/ aaai-2020-extended.pdf

convert it to a discrete LF. We provide an elegant method of guiding our generative training with these user-provided accuracy estimates, and show how it surpasses simpler methods like sign penalty and data constraints. Empirically, we show our method stabilizes unsupervised likelihood training even with very crude estimates of q. We study stability issues of the existing model with respect to training epochs and demonstrate that the proposed model is naturally more stable. We refer to our overall approach as CAGE, which stands for Continuous And quality Guided lab Eling functions. In summary, CAGE makes the following contributions: 1) It enhances the expression power of data programming by supporting continuous labeling functions (generalizations of discrete LFs) as the unit of supervision from humans. 2) It proposes a carefully parameterized graphical model that outperforms existing models even for discrete LFs, and permits easy incorporation of user priors for continuous LFs. 3) It extends the generative model through quality guides, thereby increasing its stability and making it less sensitive to initialization. Its training is based on a principled method of regularizing the marginals of the joint model with userprovided accuracy guides. We present extensive experiments on ﬁve datasets, comparing various models for performance and stability, and present the signiﬁcantly positive impact of CAGE . We show that our method of incorporating user guides leads to more reliable training than obvious ideas like sign penalty and constraint based training.

2 Our Approach: CAGE Let X denote the space of input instances and Y = {1, . . . , K} denote the space of labels and P(X, Y) denote their joint distribution. Our goal is to learn a model to associate a label y with an example x X. Unlike standard supervised learning, we are not provided true labels of sampled instances during training. Let the sample of m unlabeled instances be x1, . . . , xm. Instead of the true y s we are pro-

vided a set of n labeling functions (LFs) λ1, λ2, . . . λn such that each LF λj can be either discrete or continuous. Each LF λj is attached with a class kj and on an instance xi outputs a discrete label τij = kj when triggered and τij = 0 when not triggered. If λj is continuous, it also outputs a score sij (0, 1). This is a form of weak supervision that implies that when a LF is triggered on an instance xi, it is proposing that the true label y should be kj, and if continuous it is attaching a conﬁdence proportional to sij with its labeling. But to reliably and accurately infer the true label y from such weak supervision without any labeled data, we need to exploit the assumption that τij is positively correlated with y. We allow the programmer of the LF to make this assumption explicit by attaching a guess on the fraction qt j of triggering of the LF where the true y agrees with τij. We show in the experiments that crude guesses sufﬁce. Intuitively, this says that the user expects qt j fraction of examples for which the LF value has been triggered to be correct. Additionally, for continuous LF the programmer can specify the quality index qc j denoting the average score of sj when there is such agreement. Our goal is to learn to infer the correct label by creating consensus among outputs of the LFs. Thus, the model of CAGE imposes a joint distribution between the true label y and the values τij, sij returned by each LF λj on any data sample xi drawn from the hidden distribution P(X, Y).

Pθ,π(y, τi, si) = 1

j=1 ψθ(τij, y) (ψπ(τij, sij, y))cont(λj)

(1) where cont(λj) is 1 when λj is a continuous LF and 0 otherwise. And θ, π denote the parameters used in deﬁning the potentials psiθ, ψπ coupling discrete and continuous variables respectively. In this factorization of the joint distribution we make the natural assumption that each LF independently provides its supervision on the true label. The main challenge now is designing the potentials coupling various random variables so that: (a) The parameters (θ, π) can be trained reliably using unlabeled data alone. This partially implies that the number of parameters should be limited. (b) The model should be expressive enough to ﬁt the joint distribution on the τj and sj variables across a variety of datasets without relying on labeled validation dataset for model selection and hyper-parameter tuning. (c) Finally, the potentials should reﬂect the bias of the programmer on the quality in providing the true y. We will show how without such control, it is easy to construct counter-examples where the standard likelihood-based training may fail miserably. With these goals in mind, and after signiﬁcant exploration we propose the following form of potentials. For the discrete binary τij variables, we chose these simple potentials:

ψθ(τij, y) = exp(θjy) if τij = 0, 1 otherwise. (2)

Thus, for each LF we have K parameters corresponding to each of the class labels. An even simpler alternative would be to share the θjy across different ys as used in (Bach et

al. 2017) but that approach imposes undesirable restrictions on the distributions it can express. We elaborate on that in Section 2.3. For the case of continuous LFs the task of designing the potential ψπ(sij, τij, y) that is trainable with unlabeled data and captures user bias well turned out to be signiﬁcantly harder. Speciﬁcally, we wanted a form that is suited for scores that can be interpreted as conﬁdence probabilities (that lie between 0 and 1), and capture the bias that sij is high when τij and y agree, and low otherwise. For conﬁdence variables, a natural parametric form of density is the beta density. The beta density is popularly expressed in terms of two independent parameters α > 0 and β > 0 as P(s|α, β) sα 1(1 s)β 1. Instead of independently learning these parameters, we chose an alternative representation that allows expression of user prior on the expected s. We write the beta in terms of two alternative parameters: the mean parameter qc j and the scale parameter π1. These are related to α and β as α = qc jπ1 and β = (1 qc j)π1. We deﬁne our continuous potential as:

ψπ(τij, sij, y) =

Beta(sij; αa, βa) if kj = y & τij = 0, Beta(sij; αd, βd) if kj = y & τij = 0, 1 otherwise (3) where αa = qc jπjy and βa = (1 qc j)πjy are parameters of the agreement distribution and αd = (1 qc j)πjy and βd = qc jπjy are parameters of the disagreement distribution, where πjy is constrained to be strictly positive. To impose πjy > 0 while also maintaining differentiability, we reparametrize πjy as exp(ρjy). Thus, we require K parameters for each continuous LF, which is the same as for a discrete LF. The Beta distribution would normally require 2K parameters but we used the user provided qualify guide in that special manner shown above to share the mean between the agreeing and disagreeing Beta. We experimented with a large variety of other potential forms before converging on the above. We will elaborate on alternatives in the experimental section. With these potentials, the normalizer Zθ of our joint distribution (Eqn 1) can be calculated as

τj {kj,0} ψθ(τj, y) 1

sj=0 ψπ(τj, sj, y)

j (1 + exp(θjy)) (4)

The normalizer reveals two further facets of our joint distribution. First our continuous potentials are deﬁned such that when summed over sj-s we get a value of 1, hence the normalizer is independent of the continuous parameters π. That is, the continuous potentials ψπ(τij, sij, y) are locally normalized Bayesian probabilities P(sij|τij, y). Second, the discrete potentials are not locally normalized; the ψθ(τj, y) cannot be interpreted as Pr(τj|y) because by normalizing them globally we were able to learn the interaction among the LFs better. We will show empirically that either the full Bayesian model with potentials P(y), P(τij|y),

and P(sij|τij, y) or the fully undirected model where the ψπ(τij, sij, y) potential is un-normalized are both harder to train.

2.1 Training CAGE

Our training objective can be expressed as:

max θ,π LL(θ, π|D) + R(θ, π|{qt j}) (5)

The ﬁrst part maximizes the likelihood on the observed τi and si values of the training sample D = x1, . . . , xm after marginalizing out the true y. It can be expressed as:

LL(θ, π|D) =

y Y Pθ,π(τi, si, y) (6)

j=1 ψj(τij, y) (ψj(sij, τij, y))cont(λj) m log Zθ

By CAGE G, we will hereafter refer to the model in Eqn 1 that has parameters learnt by maximizing only this (ﬁrst) likelihood part of the objective in Eqn 6 and not the second part R(θ, π|{qt j}). R(θ, π|{qt j}) is a regularizer that guides the parameters with the programmer s expectation of the quality of each LF. We start by motivating the need for the regularizer by showing simple cases that can cause the likelihood-only training to yield poor accuracy.

Example 1: Sensitivity to Initialization Consider a binary classiﬁcation task where the n LFs are perfect oracles that trigger only on instances whose true label matches kj. Assume all λj are discrete. The likelihood of such data can be expressed as:

i=1 log(exp(

j:kj=y θj1) + exp(

j:kj=y θj2))

j (1 + exp(θj1)) +

j (1 + exp(θj2))) (7)

The value of the above likelihood is totally symmetric in θj1 and θj2 but the accuracy is not. We will get 100% accuracy only when the parameter for the agreeing case: θjkj is larger than θjy for y = kj, and 0% accuracy if θjkj is smaller. A trick is to initialize the θ parameters carefully so that the agreeing parameters θjkj do have larger values. However, even such careful initialization can be forgotten in less trivial cases as we show in the next example.

Example 2: Failure in spite of good initialization Consider a set S1 of r LFs that assign a label of 1 and remaining set S2 of n r LFs that assign label 2. Let each true class-2 instance trigger one or more LF from S1 and one or more LF from S2. Let each true class-1 instance trigger only LFs from S1. When we initialize LFs in set S1 such that θj1 θj2 > 0 and LFs in set S2 have θj2 θj1 > 0, we can get good accuracy. However, as training progresses the likelihood will be globally maximized when both sets of

LFs favor the same class on all instances. If we further assume that the true class distribution is skewed, the LL(θ) objective quickly converges to this useless maxima. This scenario is not artiﬁcial. Many of the real datasets (e.g. the LFs of Spouse relation extraction data in Table 1) exhibit such trends. A straight-forward ﬁx of the above problem is to impose a penalty on the sign of θjkj θjy. However, since the θs of LFs interact via the global normalizer Zθ this condition is neither necessary nor sufﬁcient to ensure that in the joint model Pθ(y, τ) the values of y and kj agree more than disagree. For globally conditioned models the parameters cannot be easily interpreted, and we need to constrain at the level of the joint distribution. One method to work with the whole distribution is to constrain the conditional Pθ(y|τi) over the instances where the LF triggers and constrain that the accumulated probability of the agreeing y is at least qt j as follows:

R(θ|{qt j}, D) =

i:τij=kj (qt j Pθ(τi, kj))

(8) We call this the data-driven constrained training method and refer to it as CAGEdata G. However, a limitation of this constraint is that in a mini-batch training environment it is difﬁcult to get enough examples per batch for reliable estimation of the empirical accuracy, particularly for LFs that trigger infrequently. Next we present our method of incorporating the user guidance into the trained model to avoid such instability.

2.2 Data-independent quality guides in CAGE Our ﬁnal approach that worked reliably was to regularize the parameters so that the learned joint distribution of y and τj matches the user-provided quality guides qt j over all y, τj values from the joint distribution Pθ,π. By default, this is the regularizer that we employ in CAGE. The qt j guide is the user s belief on the fraction of cases where y and τj agree when τj = 0 (LF λj triggers). Using the joint distribution we can calculate this agreement probability as Pθ(y = kj|τj = kj). This probability can be computed in closed form by marginalizing over all remaining variables in the model in Equation 1 as follows:

Pθ(y = kj|τj = kj) = Pθ(y = kj, τj = kj)

Pθ(τj = kj)

r =j(1 + Mr(kj))

r =j(1 + Mr(y))

where Mj(y) = exp(θjy). We then seek to minimize the KL distance between the user provided qt j and the model calculated precision Pθ(y = kj|τj = kj) which turns out to be:

R(θ|{qt j}) =

j qt j log Pθ(y = kj|τj = kj)

+(1 qt j) log(1 Pθ(y = kj|τj = kj)) (9)

Speciﬁcally, when the CAGE model is restricted only to discrete LFs while also incorporating the quality guide in

Eqn 9 into the objective in Eqn 6, we refer to the approach as CAGE C. Further, when the quality guide in Eqn 9 is dropped from CAGE C, we refer to the approach as CAGE C G.

2.3 Relationship of CAGE with existing models We would like to point out that the following two simpliﬁcations in CAGE lead to existing well known models (Ratner et al. 2016; 2017), viz., (i) Coupling the θyj parameters, (ii) Ignoring quality guides, and (iii) Not including continuous potentials. The design used in (Bach et al. 2017) is to assign a single parameter θj for each LF and share it across y as:

ψsnorkel j (τij, y) =

exp(θj) if τij = 0, y = kj, exp( θj) if τij = 0, y = kj, 1 otherwise. (10)

After ignoring quality guides and continuous LF, we note that a choice of θj,+1 = θj, 1 makes CAGE exactly same as the model in Snorkel. However, we found the Snorkel s method of parameter sharing incorporates an unnecessary bias that Pθ(τij = 0|y = kj) = 1 Pθ(τij = 0|y = kj). Also, Snorkel s pure likelihood-based training is subject to all the sensitivity to parameter initialization and training epochs that we highlighted in Section 2.1. We show in the experiments how each of the three new extensions in CAGE is crucial to getting reliable training with the data programming paradigm.

3 Empirical Evaluation In this section we (1) evaluate the utility of continuous LFs vis-a-vis discrete LFs, (2) demonstrate the role of the quality guides for the stability of the unsupervised likelihood training of CAGE as well as Snorkel, and (3) perform a detailed ablation study to justify various design elements of our generative model and its guided training procedure.

3.1 Datasets and Experiment Setup We perform these comparisons on ﬁve different datasets. Spouse: (spo ) This is a relation extraction dataset that proposes to label candidate pairs of entities in a sentence as expressing a spouse relation or not. Our train-dev-test splits and set of discrete LFs shown in Table 1 are the same as in (Ratner et al. 2016) where it was ﬁrst used. For each discrete LF that checks for matches in a dictionary D of keywords we create a continuous LF that returns sj as the maximum of cosine similarity of their word embeddings as shown in Table 2. We used pre-trained vectors provided by Glove (Pennington, Socher, and Manning 2014). SMS spam (sms ) is a binary spam/no-spam classiﬁcation dataset with 5574 documents split into 3700 unlabeled-train and 1872 labeled-test instances. Nine LFs are created based on (i) presence of three categories of words which are highly likely to indicate spam (ii) presence of 2 categories of trigger words in certain contexts, (iii) reference to keywords indicative of ﬁrst/second or third person, (iv) text characteristics such as number of capitalized characters, presence of special characters, etc. and ﬁnally a LF that is (v) associated with the

negative class, always triggers and serves as the class prior. The LFs are explained in the supplementary of the extended version of this paper. The continuous LFs are created in the same way as in Spouse based on word-embedding similarity, number of capitalized characters, etc. CDR: (cdr 2018) This is also a relation extraction dataset where the task is to detect whether or not a sentence expresses a chemical cures disease relation. The train-devtest splits and LFs are the same as in (Ratner et al. 2016). We did not develop any continuous LF for CDR. Dedup: This dataset2 comprises of 32 thousand pairs of noisy citation records with ﬁelds like Title, Author, Year etc. The task is to detect if the record pairs are duplicates. We have 18 continuous LFs corresponding to various text similarity functions (such as Jaccard, TF-IDF similarity, 1-Edit Distance, etc.) computed over one or more of these ﬁelds. Each of these LFs is positively correlated with the duplicate label; we create another 18 with the score as 1-similarity for the negative class. The dataset is highly skewed with only 0.5% of the instances as duplicate. All LFs here are continuous. Iris: Iris is a UCI dataset with 3 classes. We split it into 105 unlabeled train and 45 labeled test examples. We create LFs from the 4 features of the data as follows: For each feature f and class y, we calculate f s mean value f y amongst the examples of y from labeled data and create a LF that returns the value 1 norm(f f y) - where norm(f f y) is the normalized distance from this mean. This gives us a total of 4 3 = 12 continuous LFs. Each such LF has a corresponding discrete LF that is triggered if the feature is closest to the mean of its corresponding class. Ionosphere: This is another 2-class UCI dataset, that is split to 245 unlabeled train and 106 labeled test instances. 64 continuous LFs are created in a manner similar to Iris. Training Setup For each dataset and discrete LF we arbitrarily assigned a default discrete quality guide qt j = 0.9 and for continuous LFs qc j = 0.85. We used learning rate of 0.01 and 100 training epochs. Parameters were initialized favorably so for the agreeing parameter initial θjkj=1 and for disagreeing parameter initial θky = 1, y = kj. For Snorkel this is equivalent to θj = 1. Only for CAGE that is trained with guides we initialize all parameters to 1. We show in Section 3.3 that CAGE is insensitive to initialization whereas others are not. Evaluation Metric We report F1 as our accuracy measure on all binary datasets and for the multi-class dataset Iris we measure micro F1 across the classes. From our generative model, as well as Snorkel, the predicted label on a test instance xi is the y for which the joint generative probability is highest, that is: argmaxy P(y, τi, si). Another measure of interest is the accuracy that would be obtained by training a standard discriminative classiﬁer Pw(y|x) with labeled data as the probabilistically labeled P(y|xi) P(y, τi, si) examples xi from the generative model. In the ﬁrst part of the experiment we measure the accuracy of labeled data produced by the generative model. In the extended version of this paper, we present accuracy from a trained discrimina-

2Publicly available at https://www.cse.iitb.ac.in/ sunita/alias/

tive model from such dataset. We implemented our model in Pytorch.3

3.2 Overall Results

Spouse CDR SMS Ion Iris Dedup Majority 0.17 0.53 0.23 0.79 0.84 - Snorkel 0.41 0.66 0.34 0.70 0.87 - CAGE C G 0.48 0.69 0.34 0.81 0.87 - CAGE C 0.50 0.69 0.45 0.82 0.87 - CAGE 0.58 0.69 0.54 0.97 0.87 0.79

Table 3: Overall Results (F1) with predictions from various generative models contrasted with the Majority baseline.

In Table 3, we compare the performance of CAGE in terms of F1, against the following alternatives: (i) Majority: This is a simple baseline, wherein, the label on which a majority of the LFs show agreement is the inferred consensus label. (ii) Snorkel: See Section 2.3. (iii) CAGE C G: Our model without continuous LFs and quality guides (See Section 2.2). (iv) CAGE C: Our model with quality guides but without continuous LFs (See Section 2.2). From this table we make three important observations: (1) Comparing Snorkel with CAGE C G that differs only in decoupling Snorkel s shared θj parameters, we observe that the shared parameters of Snorkel were indeed introducing undesirable bias. (2) Comparing CAGE C G and CAGE C we see the gains due to our quality guides. (3) Finally, comparing CAGE C and CAGE we see the gains because of the greater expressibility of continuous LFs. These LFs required negligible additional human programming effort beyond the discrete LFs. Compared to Snorkel our model provides signiﬁcant overall gains in F1. For datasets like Dedup which consist only of continuous scores, CAGE is the only option. We next present a more detailed ablation study to tease out the importance of different design elements of CAGE.

3.3 Role of the Quality Guides

We motivate the role of the quality guides in Figure 1 where we show test F1 for increasing training epochs on three datasets. In these plots we considered only discrete LFs. We compare Snorkel and our model with (CAGE C) and without (CAGE C G) these guides. Without the quality guides, all datasets exhibit unpredictable swings in test F1. These swings cannot be attributed to over-ﬁtting since in Spouse and SMS F1 improves later in training with our quality guides. Since we do not have labeled validation data to choose the correct number of epochs, the quality guides are invaluable in getting reliable accuracy in unsupervised learning. Next, we show that the stability provided by the quality guides (qt j) is robust to large deviations from the true accuracy of a LF. Our default qt j value was 0.9 for all LFs irrespective of their true accuracy. We repeated our experiments with a precision of 0.8 and got the same accuracy

3Code available at https://github.com/oishik75/CAGE.

across training epochs (see the extended version of this paper). We next ask if knowing the true accuracy of a LF would help even more and how robust our training is to distortion in the user s guess from the true accuracy. We calculated true accuracy of each LF on the devset and distorted this by a Gaussian noise with variance σ. In Figure 2 we present accuracy after 100 epochs on two datasets with increasing distortion (σ). On CDR CAGE s accuracy is very robust to distorted qt j but guides are important as we see from Figure 1(c). On Spouse accuracy is highest with perfect values of qt j (Sigma=0) but it stays close to this accuracy up to a distortion of 0.4. Sensitivity to Initialization: We carefully initialized parameters of all models except CAGE . With random initialization all models without guides (Snorkel and CAGE G) provide very poor accuracy. Exact numbers are in extended version of this paper.

Spouse CDR Sms Ion CAGE C G+ P 0.48 0.69 0.34 0.81 CAGE C G 0.48 0.69 0.34 0.81 CAGE-C,data G 0.48 0.69 0.34 0.81 CAGE C 0.50 0.69 0.45 0.82

Table 4: Comparing different methods of incorporating user s quality guide on discrete LFs.

Method of Enforcing Quality Guides In Table 4, we compare F1 for the following choices: (i) CAGE C G+ P: Our model with the objective in Eqn (6) augmented with the sign penalty max(0, θjy =kj θjkj) instead of the regularizer in Eqn (9). (ii) CAGE C G: Our model without guides (iii) CAGE-C,data G: The data-driven method of incorporating quality guides (See Section 2.1), and (iv) CAGE C: our data independent regularizer of the model s marginals with qt j (Eqn 9). From Table 4 we observe that CAGE C is the only one that provides reliable gains. Thus, it is not just enough to get quality guides from users, we need to also design sound methods of combining them in likelihood training.

3.4 Structure of the Potentials In deﬁning the joint distribution Pθ,π(y, τi, si) (Eq 1) we used undirected globally normalized potentials for the discrete LFs(Eqn 2). We compare with an alternative where our joint is a pure directed Bayesian network with potentials Pr(τj|y) = exp(θjy/(1 + exp(θjy) on each discrete LF and a class prior Pr(y). We observe that the undirected model is better able to capture interaction among the LFs with the global normalization Zθ.

Spouse CDR Sms Ion Iris Directed 0.15 0.49 0.59 0.86 0.89 CAGE 0.58 0.69 0.54 0.97 0.87 We repeated other ablation experiments where the continuous potentials are undirected and take various forms. The results appear in the extended version of this paper and shows that local normalization is crucially important for modeling sj of continuous LFs.

(a) Spouse (b) SMS (c) CDR

Figure 1: F1 with increasing number of training epochs compared across snorkel, CAGE C Gand CAGE C, for three datasets. For each dataset, in the absence of guides, we observe unpredictable variation in test F1 as training progresses.

Figure 2: F1 with increasing distortion in the guess of the LF quality guide, qt j.

Related Work

Several consensus-based prediction combination algorithms (Gao et al. 2009; Kulkarni et al. 2018) exist that combine multiple model predictions to counteract the effects of data quality and model bias. There also exist label embedding approaches from the extreme classiﬁcation literature (Yeh et al. 2017) that exploit inter-label correlation. While these approaches assume that the imperfect labeler s knowledge is ﬁxed, (Fang et al. 2012) present a self-taught active learning paradigm, where a crowd of imperfect labelers learn complementary knowledge from each other. However, they use instance-wise reliability of labelers to query only the most reliable labeler without any notion of consensus. A recent work by (Chang, Amershi, and Kamar 2017) presents a collaborative crowd sourcing approach. However, they are motivated by the problem of eliminating the burden of deﬁning labeling guidelines a priori and their approach harnesses the labeling disagreements to identify ambiguous concepts and create semantically rich structures for post-hoc label decisions. There is work in the crowd-labeling literature that makes use of many imperfect labelers (Kulkarni et al. 2018; Raykar et al. 2010; Yan et al. 2011; Dekel and Shamir 2009) and accounts for both labeler and model uncertainty to propose

probabilistic solutions to (a) adapt conventional supervised learning algorithms to learn from multiple subjective labels; (b) evaluate them in the absence of absolute gold standard; (c) estimate reliability of labelers. (Donmez and Carbonell 2008) propose a proactive learning method that jointly selects the optimal labeler and instance with a decision theoretic approach. Some recent literature has also studied the augmenting neural networks with rules in ﬁrst order logic to either guide the individual layers (Li and Srikumar 2019) or train model weights within constraints of the rule based system using a student and teacher model (Hu et al. 2016) Snorkel (Ratner et al. 2016; Bach et al. 2017; Ratner et al. 2017; Hancock et al. 2018; Varma et al. 2019) relies on domain experts manually developing heuristic and noisy LFs. Similar methods that rely on imperfect sources of labels are (Bunescu and Mooney 2007; Hearst 1992) relying on heuristics, (Mintz et al. 2009) on distant supervision and (Jawanpuria, Nath, and Ramakrishnan 2015) on learning conjunctions discrete of (discrete) rules. The aforementioned literature focuses exclusively on labeling suggestions that are discrete. We present a generalized generative model to aggregate heuristic labels from continuous (and discrete) LFs while also incorporating user accuracy priors.

Conclusion We presented a data programming paradigm that lets the user specify labeling functions which when triggered on instances can also produce continuous scores. The unsupervised task of consolidating weak labels is inherently unstable and sensitive to parameter initialization and training epochs. Instead of depending on un-interpretable hyperparameters which can only be tuned with labeled validation data which we assume is unavailable, we let the user guide the training with interpretable quality guesses. We carefully designed the potentials and the training process to give the user more interpretable control.

Acknowledgements This research was partly sponsored by a Google India AI/ML Research Award and partly sponsored by IBM Research, India (speciﬁcally the IBM AI Horizon Networks - IIT Bombay initiative).

References Bach, S. H.; He, B. D.; Ratner, A.; and R e, C. 2017. Learning the structure of generative models without labeled data. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 273 282. Bunescu, R. C., and Mooney, R. J. 2007. Learning to extract relations from the web using minimal supervision. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic. 2018. Chemical - Disease Relation Extraction Task. https:// github.com/Hazy Research/snorkel/tree/master/tutorials/cdr. [Online; accessed 31-March-2018]. Chang, J. C.; Amershi, S.; and Kamar, E. 2017. Revolt: Collaborative crowdsourcing for labeling machine learning datasets. In Proceedings of the Conference on Human Factors in Computing Systems (CHI), Denver, CO, USA, 2334 2346. Dekel, O., and Shamir, O. 2009. Good learners for evil teachers. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), Montreal, Qubec, Canada, 233 240. Donmez, P., and Carbonell, J. G. 2008. Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), Napa Valley, California, USA, 619 628. Fang, M.; Zhu, X.; Li, B.; Ding, W.; and Wu, X. 2012. Selftaught active learning from crowds. In 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, 858 863. Gao, J.; Liang, F.; Fan, W.; Sun, Y.; and Han, J. 2009. Graphbased consensus maximization among multiple supervised and unsupervised models. In 23rd Annual Conference on Neural Information Processing Systems (NIPS), Vancouver, Canada, 585 593. Hancock, B.; Varma, P.; Wang, S.; Bringmann, M.; Liang, P.; and R e, C. 2018. Training classiﬁers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 1884 1895. Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In 14th International Conference on Computational Linguistics, COLING, Nantes, France, 539 545. Hu, Z.; Ma, X.; Liu, Z.; Hovy, E. H.; and Xing, E. P. 2016. Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany. Jawanpuria, P.; Nath, J. S.; and Ramakrishnan, G. 2015. Generalized hierarchical kernel learning. Journal of Machine Learning Research 16:617 652. Kulkarni, A.; Uppalapati, N. R.; Singh, P.; and Ramakrishnan, G. 2018. An interactive multi-label consensus labeling model for multiple labeler judgments. In Proceedings

of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI), New Orleans, Louisiana, USA, 1479 1486. Li, T., and Srikumar, V. 2019. Augmenting neural networks with ﬁrst-order logic. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), Florence, Italy, 292 302. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efﬁcient estimation of word representations in vector space. In Workshop Track Proceedings of 1st International Conference on Learning Representations, (ICLR), Scottsdale, Arizona, USA. Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL) and the 4th International Joint Conference on Natural Language Processing (IJCNLP), Singapore, 1003 1011. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1532 1543. Ratner, A. J.; Sa, C. D.; Wu, S.; Selsam, D.; and R e, C. 2016. Data programming: Creating large training sets, quickly. In Proceedings of Annual Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 3567 3575. Ratner, A.; Bach, S. H.; Ehrenberg, H. R.; Fries, J. A.; Wu, S.; and R e, C. 2017. Snorkel: Rapid training data creation with weak supervision. PVLDB 11(3):269 282. Raykar, V. C.; Yu, S.; Zhao, L. H.; Valadez, G. H.; Florin, C.; Bogoni, L.; and Moy, L. 2010. Learning from crowds. The Journal of Machine Learning Research 11:1297 1322. Sms spam collection data set. http://www.dt.fee.unicamp.br/ tiago/smsspamcollection/. Spouse Relation Extraction Task. https://github.com/ Hazy Research/snorkel/tree/master/tutorials/intro. Varma, P.; Sala, F.; He, A.; Ratner, A.; and R e, C. 2019. Learning dependency structures for weak supervision models. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, California, USA, 6418 6427. Yan, Y.; Rosales, R.; Fung, G.; and Dy, J. G. 2011. Active learning from crowds. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA, 1161 1168. Yeh, C.; Wu, W.; Ko, W.; and Wang, Y. F. 2017. Learning deep latent space for multi-label classiﬁcation. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, San Francisco, California, USA, 2838 2844.