# training_complex_models_with_multitask_weak_supervision__3feeb948.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Training Complex Models with Multi-Task Weak Supervision

Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, Christopher R e Department of Computer Science Stanford University {ajratner, bradenjh, jdunnmon, fredsala, shreyash, chrismre}@stanford.edu

As machine learning models continue to increase in complexity, collecting large hand-labeled training sets has become one of the biggest roadblocks in practice. Instead, weaker forms of supervision that provide noisier but cheaper labels are often used. However, these weak supervision sources have diverse and unknown accuracies, may output correlated labels, and may label different tasks or apply at different levels of granularity. We propose a framework for integrating and modeling such weak supervision sources by viewing them as labeling different related sub-tasks of a problem, which we refer to as the multi-task weak supervision setting. We show that by solving a matrix completion-style problem, we can recover the accuracies of these multi-task sources given their dependency structure, but without any labeled data, leading to higher-quality supervision for training an end model. Theoretically, we show that the generalization error of models trained with this approach improves with the number of unlabeled data points, and characterize the scaling with respect to the task and dependency structures. On three ﬁne-grained classiﬁcation problems, we show that our approach leads to average gains of 20.2 points in accuracy over a traditional supervised approach, 6.8 points over a majority vote baseline, and 4.1 points over a previously proposed weak supervision method that models tasks separately.

1 Introduction One of the greatest roadblocks to using modern machine learning models is collecting hand-labeled training data at the massive scale they require. In real-world settings where domain expertise is needed and modeling goals change frequently, hand-labeling training sets is prohibitively slow, expensive, and static. For these reasons, practitioners are increasingly turning to weak supervision techniques wherein noisier, often programmatically-generated labels are used instead. Common weak supervision sources include external knowledge bases (Mintz et al. 2009; Zhang et al. 2017a; Craven and Kumlien 1999; Takamatsu, Sato, and Nakagawa 2012), heuristic patterns (Gupta and Manning 2014; Ratner et al. 2018), feature annotations (Mann and Mc Callum 2010; Zaidan and Eisner 2008), and noisy crowd labels (Karger, Oh, and Shah 2011; Dawid and Skene 1979). The use of these sources has led to state-of-the-art results in a range of

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

domains (Zhang et al. 2017a; Xiao et al. 2015). A theme of weak supervision is that using the full diversity of available sources is critical to training high-quality models (Ratner et al. 2018; Zhang et al. 2017a). The key technical difﬁculty of weak supervision is determining how to combine the labels of multiple sources that have different, unknown accuracies, may be correlated, and may label at different levels of granularity. In our experience with users in academia and industry, the complexity of real world weak supervision sources makes this integration phase the key time sink and stumbling block. For example, if we are training a model to classify entities in text, we may have one available source of high-quality but coarse-grained labels (e.g. Person vs. Organization ) and one source that provides lower-quality but ﬁner-grained labels (e.g. Doctor vs. Lawyer ); moreover, these sources might be correlated due to some shared component or data source (Bach et al. 2017; Varma et al. 2017). Handling such diversity requires addressing a core technical challenge: estimating the unknown accuracies of multi-granular and potentially correlated supervision sources without any labeled data. To overcome this challenge, we propose Me Ta L, a framework for modeling and integrating weak supervision sources with different unknown accuracies, correlations, and granularities. In Me Ta L, we view each source as labeling one of several related sub-tasks of a problem we refer to this as the multi-task weak supervision setting. We then show that given the dependency structure of the sources, we can use their observed agreement and disagreement rates to recover their unknown accuracies. Moreover, we exploit the relationship structure between tasks to observe additional cross-task agreements and disagreements, effectively providing extra signal from which to learn. In contrast to previous approaches based on sampling from the posterior of a graphical model directly (Ratner et al. 2016; Bach et al. 2017), we develop a simple and scalable matrix completionstyle algorithm, which we are able to analyze by applying strong matrix concentration bounds (Tropp 2015). We use this algorithm to learn and model the accuracies of diverse weak supervision sources, and then combine their labels to produce training data that can be used to supervise arbitrary models, including increasingly popular multi-task learning models (Caruana 1993; Ruder 2017). Compared to previous methods which only handled the

single-task setting (Ratner et al. 2016; 2018), and generally considered conditionally-independent sources (Anandkumar et al. 2014; Dawid and Skene 1979), we demonstrate that our multi-task aware approach leads to average gains of 4.1 points in accuracy in our experiments, and has at least three additional beneﬁts. First, many dependency structures between weak supervision sources may lead to nonidentiﬁable models of their accuracies, where a unique solution cannot be recovered. We provide a compiler-like check to establish identiﬁability i.e. the existence of a unique set of source accuracies for arbitrary dependency structures, without resorting to the standard assumption of nonadversarial sources (Dawid and Skene 1979), alerting users to this potential stumbling block that we have observed in practice. Next, we provide sample complexity bounds that characterize the beneﬁt of adding additional unlabeled data and the scaling with respect to the user-speciﬁed task and dependency structure. While previous approaches required thousands of sources to give non-vacuous bounds, we capture regimes with small numbers of sources, better reﬂecting the real-world uses of weak supervision we have observed. Finally, we are able to solve our proposed problem directly with SGD, leading to over 100 faster runtimes compared to prior Gibbs-sampling based approaches (Ratner et al. 2016; Platanios et al. 2017), and enabling simple implementation using libraries like Py Torch. We validate our framework on three ﬁne-grained classiﬁcation tasks in named entity recognition, relation extraction, and medical document classiﬁcation, for which we have diverse weak supervision sources at multiple levels of granularity. We show that by modeling them as labeling hierarchically-related sub-tasks and utilizing unlabeled data, we can get an average improvement of 20.2 points in accuracy over a traditional supervised approach, 6.8 points over a basic majority voting weak supervision baseline, and 4.1 points over data programming (Ratner et al. 2016), an existing weak supervision approach in the literature that is not multi-task-aware. We also extend our framework to handle unipolar sources that only label one class, a critical aspect of weak supervision in practice that leads to an average 2.8 point contribution to our gains over majority vote. From a practical standpoint, we argue that our framework represents an efﬁcient way for practitioners to supervise modern machine learning models, including new multi-task variants, for complex tasks by opportunistically using the diverse weak supervision sources available to them. To further validate this, we have released an open-source implementation of our framework.1

2 Related Work

Our work builds on and extends various settings studied in machine learning. Weak Supervision: We draw motivation from recent work which models and integrates weak supervision using generative models (Ratner et al. 2016; 2018; Bach et al. 2017) and other methods (Guan et al. 2017; Khetan, Lipton, and

1github.com/Hazy Research/metal

Anandkumar 2017). These approaches, however, do not handle multi-granularity or multi-task weak supervision, require expensive sampling-based techniques that may lead to nonidentiﬁable solutions, and leave room for sharper theoretical characterization of weak supervision scaling properties. More generally, our work is motivated by a wide range of speciﬁc weak supervision techniques, which include traditional distant supervision approaches (Mintz et al. 2009; Craven and Kumlien 1999; Zhang et al. 2017a; Hoffmann et al. 2011; Takamatsu, Sato, and Nakagawa 2012), co-training methods (Blum and Mitchell 1998), pattern-based supervision (Gupta and Manning 2014; Zhang et al. 2017a), and feature-annotation techniques (Mann and Mc Callum 2010; Zaidan and Eisner 2008; Liang, Jordan, and Klein 2009). Crowdsourcing: Our approach also has connections to the crowdsourcing literature (Karger, Oh, and Shah 2011; Dawid and Skene 1979), and in particular to spectral and method of moments-based approaches (Zhang et al. 2014; Dalvi et al. 2013a; Ghosh, Kale, and Mc Afee 2011; Anandkumar et al. 2014). In contrast, the goal of our work is to support and explore settings not covered by crowdsourcing work, such as sources with correlated outputs, the proposed multi-task supervision setting, and regimes wherein a small number of labelers (weak supervision sources) each label a large number of items (data points). Moreover, we theoretically characterize the generalization performance of an end model trained with the weakly labeled data. Multi-Task Learning: Our proposed approach is motivated by recent progress on multi-task learning models (Caruana 1993; Ruder 2017; Søgaard and Goldberg 2016), in particular their need for multiple large handlabeled training datasets. We note that the focus of our paper is on generating supervision for these models, not on the particular multi-task learning model being trained, which we seek to control for by ﬁxing a simple architecture in our experiments. Our work is also related to recent techniques for estimating classiﬁer accuracies without labeled data in the presence of structural constraints (Platanios et al. 2017). We use matrix structure estimation (Loh and Wainwright 2012) and concentration bounds (Tropp 2015) for our core results.

3 Programming Machine Learning with Weak Supervision

As modern machine learning models become both more complex and more performant on a range of tasks, developers increasingly interact with them by programmatically generating noisier or weak supervision. These approaches of effectively programming machine learning models have recently been formalized by the following pipeline (Ratner et al. 2016; 2018): First, users provide one or more weak supervision sources, which are applied to unlabeled data to generate a set of noisy labels. These labels may overlap and conﬂict; we model and combine them via a label model in order to produce a ﬁnal set of training labels. These labels are then used to train some discriminative model, which we refer to as the end model. This programmatic weak supervision approach can utilize sources ranging from heuristic

Figure 1: A schematic of the Me Ta L pipeline. To generate training data for an end model, such as a multi-task model as in our experiments, the user inputs a task graph Gtask deﬁning the relationships between task labels Y1, ..., Yt; a set of unlabeled data points X; a set of multi-task weak supervision sources si which each output a vector λi of task labels for X; and the dependency structure between these sources, Gsource. We train a label model to learn the accuracies of the sources, outputting a vector of probabilistic training labels Y for training the end model.

Figure 2: An example ﬁne-grained entity classiﬁcation problem, where weak supervision sources label three sub-tasks of different granularities: (i) Person vs. Organization, (ii) Doctor vs. Lawyer (or N/A), (iii) Hospital vs. Office (or N/A). The example weak supervision sources use a pattern heuristic and dictionary lookup respectively.

rules to other models, and in this way can also be viewed as a pragmatic and ﬂexible form of multi-source transfer learning. In our experiences with users from science and industry, we have found it critical to utilize all available sources of weak supervision for complex modeling problems, including ones which label at multiple levels of granularity. However, this diverse, multi-granular weak supervision does not easily ﬁt into existing paradigms. We propose a formulation where each weak supervision source labels some sub-task of a problem, which we refer to as the multi-task weak supervision setting. We consider an example:

Example 1 A developer wants to train a ﬁne-grained Named Entity Recognition (NER) model to classify mentions of entities in the news (Figure 2). She has a multitude of available weak supervision sources which she believes have relevant signal for her problem for example, pattern matchers, dictionaries, and pre-trained generic NER taggers. However, it is unclear how to properly use and com-

bine them: some of them label phrases coarsely as PERSON versus ORGANIZATION, while others classify speciﬁc ﬁnegrained types of people or organizations, with a range of unknown accuracies. In our framework, she can represent them as labeling tasks of different granularities e.g. Y1 = {Person, Org}, Y2 = {Doctor, Lawyer, N/A}, Y3 = {Hospital, Office, N/A}, where the label N/A applies, for example, when the type-of-person task is applied to an organization.

In our proposed multi-task supervision setting, the user speciﬁes a set of structurally-related tasks, and then provides a set of weak supervision sources which are user-deﬁned functions that either label each data point or abstain for each task, and may have some user-speciﬁed dependency structure. These sources can be arbitrary black-box functions, and can thus subsume a range of weak supervision approaches relevant to both text and other data modalities, including use of pattern-based heuristics, distant supervision (Mintz et al. 2009), crowd labels, other weak or biased classiﬁers, declarative rules over unsupervised feature extractors (Varma et al. 2017), and more. Our goal is to estimate the unknown accuracies of these sources, combine their outputs, and use the resulting labels to train an end model.

4 Modeling Multi-Task Weak Supervision The core technical challenge of the multi-task weak supervision setting is recovering the unknown accuracies of weak supervision sources given their dependency structure and a schema of the tasks they label, but without any ground-truth labeled data. We deﬁne a new algorithm for recovering the accuracies in this setting using a matrix completion-style optimization objective. We establish conditions under which the resulting estimator returns a unique solution. We then analyze the sample complexity of our estimator, characterizing its scaling with respect to the amount of unlabeled data, as well as the task schema and dependency structure, and show how the estimation error affects the generalization performance of the end model we aim to train. Finally, we highlight how our approach handles abstentions and unipolar sources, two critical scenarios in the weak supervision setting.

A Multi-Task Weak Supervision Estimator

Problem Setup Let X X be a data point and Y = [Y1, Y2, . . . , Yt]T be a vector of categorical task labels, Yi {1, . . . , ki}, corresponding to t tasks, where (X, Y) is drawn i.i.d. from a distribution D.2

The user provides a speciﬁcation of how these tasks relate to each other; we denote this schema as the task structure Gtask. The task structure expresses logical relationships between tasks, deﬁning a feasible set of label vectors Y, such that Y Y. For example, Figure 2 illustrates a hierarchical task structure over three tasks of different granularities pertaining to a ﬁne-grained entity classiﬁcation problem. Here, the tasks are related by logical subsumption relationships: for example, if Y2 = DOCTOR, this implies that Y1 = PERSON, and that Y3 = N/A, since the task label Y3 concerns types of organizations, which is inapplicable to persons. Thus, in this task structure, Y = [PERSON, DOCTOR, N/A]T is in Y while Y = [PERSON, N/A, HOSPITAL]T is not. While task structures are often simple to deﬁne, as in the previous example, or are explicitly deﬁned by existing resources such as ontologies or graphs we note that if no task structure is provided, our approach becomes equivalent to modeling the t tasks separately, a baseline we consider in the experiments. In our setting, rather than observing the true label Y, we have access to m multi-task weak supervision sources si S which emit label vectors λi that contain labels for some subset of the t tasks. Let 0 denote a null or abstaining label, and let the coverage set τi {1, . . . , t} be the ﬁxed set of tasks for which the ith source emits non-zero labels, such that λi Yτi. For convenience, we let τ0 = {1, . . . , t} so that Yτ0 = Y. For example, a source from our previous example might have a coverage set τi = {1, 3}, emitting coarse-grained labels such as λi = [PERSON, 0, N/A]T . Note that sources often label multiple tasks implicitly due to the constraints of the task structure; for example, a source that labels types of people (Y2) also implicitly labels people vs. organizations (Y1 = PERSON), and types of organizations (as Y3 = N/A). Thus sources tailored to different tasks still have agreements and disagreements; we use this additional cross-task signal in our approach. The user also provides the conditional dependency structure of the sources as a graph Gsource = (V, E), where V = {Y, λ1, λ2, . . . , λm} (Figure 3). Speciﬁcally, if (λi, λj) is not an edge in Gsource, this means that λi is independent of λj conditioned on Y and the other source labels. Note that if Gsource is unknown, it can be estimated using statistical techniques such as (Bach et al. 2017). Importantly, we do not know anything about the strengths of the correlations in Gsource, or the sources accuracies. Our overall goal is to apply the set of weak supervision sources S = {s1, . . . , sm} to an unlabeled dataset XU consisting of n data points, then use the resulting weaklylabeled training set to supervise an end model fw : X 7 Y (Figure 1). This weakly-labeled training set will contain

2The variables we introduce throughout this section are summarized in a glossary in the Appendix, which can be accessed at https://arxiv.org/abs/1810.02840.

λ1 λ2 λ3 λ4

Y, λ3 Y, λ4

Figure 3: An example of a weak supervision source dependency graph Gsource (left) and its junction tree representation (right), where Y is a vector-valued random variable with a feasible set of values, Y Y. Here, the output of sources 1 and 2 are modeled as dependent conditioned on Y. This results in a junction tree with singleton separator sets, Y. Here, the observable cliques are O = {λ1, λ2, λ3, λ4, {λ1, λ2}} C.

overlapping and conﬂicting labels, from sources with unknown accuracies and correlations. To handle this, we will learn a label model Pµ(Y|λ), parameterized by a vector of source correlations and accuracies µ, which for each data point X takes as input the noisy labels λ = {λ1, . . . , λm} and outputs a single probabilistic label vector Y. Succinctly, given a user-provided tuple (XU, S, Gsource, Gtask), our key technical challenge is recovering the parameters µ without access to ground truth labels Y.

Modeling Multi-Task Sources To learn a label model over multi-task sources, we introduce sufﬁcient statistics over the random variables in Gsource. Let C be the set of cliques in Gsource, and deﬁne an indicator random variable for the event of a clique C C taking on a set of values y C:

ψ(C, y C) = 1 { i CVi = (y C)i} ,

where (y C)i Yτi. We deﬁne ψ(C) {0, 1} Q

i C(|Yτi| 1) as the vector of indicator random variables for all combinations of all but one of the labels emitted by each variable in clique C thereby deﬁning a minimal set of statistics and deﬁne ψ(C) accordingly for any set of cliques C C. Then µ = E [ψ(C)] is the vector of sufﬁcient statistics for the label model we want to learn. We work with two simplifying conditions in this section. First, we consider the setting where Gsource is triangulated and has a junction tree representation with singleton separator sets. If this is not the case, edges can always be added to Gsource to make this setting hold; otherwise, we describe how our approach can directly handle non-singleton separator sets in the Appendix. Second, we use a simpliﬁed class-conditional model of the noisy labeling process, where we learn one accuracy parameter for each label value λi that each source si emits. This is equivalent to assuming that a source may have a different accuracy on each different class, but that if it emits a certain label incorrectly, it does so uniformly over the different true labels Y. This is a more expressive model than the commonly considered one, where each source is modeled by a single accuracy parameter, e.g. in (Dawid and Skene 1979; Ratner et al. 2016), and in particular allows us to capture the unipolar setting considered later on.

Our Approach The chief technical difﬁculty in our problem is that we do not observe Y. We overcome this by analyzing the covariance matrix of an observable subset of the cliques in Gsource, leading to a matrix completion-style approach for recovering µ. We leverage two pieces of information: (i) the observability of part of Cov [ψ(C)], and (ii) a result from (Loh and Wainwright 2012) which states that the inverse covariance matrix Cov [ψ(C)] 1 is structured according to Gsource, i.e., if there is no edge between λi and λj in Gsource, then the corresponding entries are 0. We start by considering two disjoint subsets of C: the set of observable cliques, O C i.e., those cliques not containing Y and the separator set cliques of the junction tree, S C. In the setting we consider here, S = {Y} (see Figure 3). We then write the covariance matrix of the indicator variables for O S, Cov [ψ(O S)], in block form, similar to (Chandrasekaran, Parrilo, and Willsky 2010), as:

Cov [ψ(O S)] Σ = ΣO ΣOS ΣT OS ΣS

and similarly deﬁne its inverse:

K = Σ 1 = KO KOS KT OS KS

Here, ΣO = Cov [ψ(O)] Rd O d O is the observable block of Σ, where d O = P

i C(|Yτi| 1). Next, ΣOS = Cov [ψ(O), ψ(S)] is the unobserved block which is a function of µ, the label model parameters that we wish to recover. Finally, ΣS = Cov [ψ(S)] = Cov [ψ(Y)] is a function of the class balance P(Y). We make two observations about ΣS. First, while the full form of ΣS is the covariance of the |Y| 1 indicator variables for each individual value of Y but one, given our simpliﬁed class-conditional label model, we in fact only need a single indicator variable for Y (see Appendix); thus, ΣS is a scalar. Second, ΣS is a function of the class balance P(Y), which we assume is either known, or has been estimated according to the unsupervised approach we detail in the Appendix. Thus, given ΣO and ΣS, our goal is to recover the vector ΣOS from which we can recover µ. Applying the block matrix inversion lemma, we have:

KO = Σ 1 O + cΣ 1 O ΣOSΣT OSΣ 1 O , (3)

where c = ΣS ΣT OSΣ 1 O ΣOS 1 R+. Let z = cΣ 1 O ΣOS; we can then express (3) as:

KO = Σ 1 O + zz T (4)

The right hand side of (4) consists of an empirically observable term, Σ 1 O , and a rank-one term, zz T , which we can solve for to directly recover µ. For the left hand side, we apply an extension of Corollary 1 from (Loh and Wainwright 2012) (see Appendix) to conclude that KO has graphstructured sparsity, i.e., it has zeros determined by the structure of dependencies between the sources in Gsource. This suggests an algorithmic approach of estimating z as a matrix completion problem in order to recover an estimate of µ (Algorithm 1). In more detail: let Ωbe the set of indices

(i, j) where (KO)i,j = 0, determined by Gsource, yielding a system of equations,

0 = (Σ 1 O )i,j + zz T

i,j for (i, j) Ω, (5)

which is now a matrix completion problem. Deﬁne ||A||Ωas the Frobenius norm of A with entries not in Ωset to zero; then we can rewrite (5) as Σ 1 O + zz T Ω= 0. We solve this equation to estimate z, and thereby recover ΣOS, from which we can directly recover the label model parameters µ algebraically.

Checking for Identiﬁability A ﬁrst question is: which dependency structures Gsource lead to unique solutions for µ? This question presents a stumbling block for users, who might attempt to use non-identiﬁable sets of correlated weak supervision sources. We provide a simple, testable condition for identiﬁability. Let Ginv be the inverse graph of Gsource; note that Ωis the edge set of Ginv expanded to include all indicator random variables ψ(C). Then, let MΩbe a matrix with dimensions |Ω| d O such that each row in MΩcorresponds to a pair (i, j) Ωwith 1 s in positions i and j and 0 s elsewhere. Taking the log of the squared entries of (5), we get a system of linear equations MΩl = qΩ, where li = log(z2 i ) and q(i,j) = log(((Σ 1 O )i,j)2). Assuming we can solve this system (which we can always ensure by adding sources; see Appendix), we can uniquely recover the z2 i , meaning our model is identiﬁable up to sign. Given estimates of the z2 i , we can see from (5) that the sign of a single zi determines the sign of all other zj reachable from zi in Ginv. Thus to ensure a unique solution, we only need to pick a sign for each connected component in Ginv. In the case where the sources are assumed to be independent, e.g., (Dalvi et al. 2013b; Zhang et al. 2014; Dawid and Skene 1979), it sufﬁces to make the assumption that the sources are on average non-adversarial; i.e., select the sign of the zi that leads to higher average accuracies of the sources. Even a single source that is conditionally independent from all the other sources will cause Ginv to be fully connected, meaning we can use this symmetry breaking assumption in the majority of cases even with correlated sources. Otherwise, a sufﬁcient condition is the standard one of assuming non-adversarial sources, i.e. that all sources have greater than random accuracy.

Source Accuracy Estimation Algorithm Now that we know when a set of sources with correlation structure Gsource is identiﬁable, yielding a unique z, we can estimate the accuracies µ using Algorithm 1. We also use the function Expand Tied, which is a simple algebraic expansion of tied parameters according to the simpliﬁed class-conditional model used in this section; see Appendix for details. In Figure 4, we plot the performance of our algorithm on synthetic data, showing its scaling with the number of unlabeled data points n, the density of pairwise dependencies in Gsource, and the runtime performance as compared to a prior Gibbs sampling-based approach. Next, we theoretically analyze the scaling of the error ||ˆµ µ ||.

5 6 7 8 Ln(n)

Avg. Ln Est. Error

Est. Error vs. n

m = 10 m = 20 m = 40

0.00 0.05 0.10 0.15 0.20 Pairwise Dep %

Avg. Ln Est. Error

Est. Error vs. % Dep

Without G With G

3.0 3.5 4.0 4.5 5.0 Log10(n)

Log10 Runtime (s)

Runtime vs. m, n

Me Ta L, m=10 Gibbs, m=10 Me Ta L, m=50 Gibbs, m=50

Figure 4: (Left) Estimation error ||ˆµ µ || decreases with increasing n. (Middle) Given Gsource, our model successfully recovers the source accuracies even with many pairwise dependencies among sources, where a naive conditionally-independent model fails. (Right) The runtime of Me Ta L is independent of n after an initial matrix multiply, and can thus be multiple orders of magnitude faster than Gibbs sampling-based approaches (Ratner et al. 2016).

Algorithm 1 Source Accuracy Estimation for Multi-Task Weak Supervision

Input: Observed labeling rates ˆE [ψ(O)] and covariance ˆΣO; class balance ˆE [ψ(Y)] and variance ΣS; correlation sparsity structure Ω

ˆz argminz ˆΣ 1 O + zz T Ω ˆc Σ 1 S (1 + ˆz T ˆΣOˆz), ˆΣOS ˆΣOˆz/

ˆc ˆµ ˆΣOS + ˆE [ψ(Y)] ˆE [ψ(O)] return Expand Tied(ˆµ )

Theoretical Analysis: Scaling with Diverse Multi-Task Supervision Our ultimate goal is to train an end model using the source labels, denoised and combined by the label model ˆµ we have estimated. We connect the generalization error of this end model to the estimation error of Algorithm 1, ultimately showing that the generalization error scales as n 1

2 , where n is the number of unlabeled data points. This key result establishes the same asymptotic scaling as traditionally supervised learning methods, but with respect to unlabeled data points. Let Pˆµ( Y | λ) be the probabilistic label (i.e. distribution) predicted by our label model, given the source labels λ as input, which we compute using the estimated ˆµ. We then train an end multi-task discriminative model fw : X 7 Y parameterized by w, by minimizing the expected loss with respect to the label model over n unlabeled data points. Let l(w, X, Y) = 1 t Pt s=1 lt(w, X, Ys) be a bounded multi-task loss function such that without loss of generality l(w, X, Y) 1; then we minimize the empirical noise aware loss:

ˆw = argminw 1 n

i=1 E Y Pˆ µ( |λ) l(w, Xi, Y) , (6)

and let w be the w that minimizes the true noise-aware loss. This minimization can be performed by standard methods and is not the focus of our paper; let the solution ˆw satisfy E ˆw w 2 γ. We make several assumptions, following (Ratner et al. 2016): (1) that for some label model parameters µ , sampling (λ, Y) Pµ ( ) is the same as sampling from the true distribution, (λ, Y) D; and (2) that the task labels Ys are independent of the features of the end model given λ sampled from Pµ ( ), that is, the output of the optimal label model provides sufﬁcient information to discern the true label. Then we have the following result: Theorem 1 Let w minimize the expected noise aware loss, using weak supervision source parameters ˆµ estimated with Algorithm 1. Let ˆw minimize the empirical noise aware loss with E ˆw w 2 γ, w = minw l(w, X, Y), and let the assumptions above hold. Then the generalization error is bounded by:

E [l( ˆw, X, Y) l(w , X, Y)] γ + 4|Y| ||ˆµ µ || .

Thus, to control the generalization error, we must control ||ˆµ µ ||, which we do in Theorem 2:

Theorem 2 Let ˆµ be an estimate of µ produced by Algorithm 1 run over n unlabeled data points. Let a := ( d O

ΣS )2λmax(KO)) 1 2 and b := Σ 1 O 2

(Σ 1 O )min . Then, we have:

E [||ˆµ µ ||] 16(|Y| 1)d2 O

n abσmax(M + Ω)

d Oaλ 1 min(ΣO) + 1 κ(ΣO) + λ 1 min(ΣO) .

Interpreting the Bound We brieﬂy explain the key terms controlling the bound in Theorem 2; more detail is found in the Appendix. Our primary result is that the estimation error scales as n 1

2 . Next, σmax(M + Ω), the largest singular value of the pseudoinverse M + Ω, has a deep connection to the density of the graph Ginv. The smaller this quantity, the

more information we have about Ginv, and the easier it is to estimate the accuracies. Next, λmin(ΣO), the smallest eigenvalue of the observed covariance matrix, reﬂects the conditioning of ΣO; better conditioning yields easier estimation, and is roughly determined by how far away from random guessing the worst weak supervision source is, as well as how conditionally independent the sources are. λmax(KO), the largest eigenvalue of the upper-left block of the inverse covariance matrix, similarly reﬂects the overall conditioning of Σ. Finally, (Σ 1 O )min, the smallest entry of the inverse observed matrix, reﬂects the smallest non-zero correlation between source accuracies; distinguishing between small correlations and independent sources requires more samples.

Extensions: Abstentions & Unipolar Sources We brieﬂy highlight two extensions handled by our approach which we have found empirically critical: handling abstentions, and modeling unipolar sources. Handling Abstentions. One fundamental aspect of the weak supervision setting is that sources may abstain from labeling a data point entirely that is, they may have incomplete and differing coverage (Ratner et al. 2018; Dalvi et al. 2013b). We can easily deal with this case by extending the coverage ranges Yτi of the sources to include the vector of all zeros, 0, and we do so in the experiments. Handling Unipolar Sources. Finally, we highlight the fact that our approach models class conditional source accuracies, in particular motivated by the case we have frequently observed in practice of unipolar weak supervision sources, i.e., sources that each only label a single class or abstain. In practice, we ﬁnd that users most commonly use such unipolar sources; for example, a common template for a heuristicbased weak supervision source over text is one that looks for a speciﬁc pattern, and if the pattern is present emits a speciﬁc label, else abstains. As compared to prior approaches that did not model class-conditional accuracies, e.g. (Ratner et al. 2016), we show in our experiments that we can use our class-conditional modeling approach to yield an improvement of 2.8 points in accuracy.

5 Experiments We validate our approach on three ﬁne-grained classiﬁcation problems entity classiﬁcation, relation classiﬁcation, and document classiﬁcation where weak supervision sources are available at both coarser and ﬁner-grained levels (e.g. as in Figure 2). We evaluate the predictive accuracy of end models supervised with training data produced by several approaches, ﬁnding that our approach outperforms traditional hand-labeled supervision by 20.2 points, a baseline majority vote weak supervision approach by 6.8 points, and a prior weak supervision denoising approach (Ratner et al. 2016) that is not multi-task-aware by 4.1 points.

Datasets Each dataset consists of a large (3k-63k) amount of unlabeled training data and a small (200-350) amount of labeled data which we refer to as the development set, which we use for (a) a traditional supervision baseline, and (b) for hyperparameter tuning of the end model (see Appendix).

The average number of weak supervision sources per task was 13, with sources expressed as Python functions, averaging 4 lines of code and comprising a mix of pattern matching heuristics, external knowledge base or dictionary lookups, and pre-trained models. In all three cases, we choose the decomposition into sub-tasks so as to align with weak supervision sources that are either available or natural to express. Named Entity Recognition (NER): We represent a ﬁne-grained named entity recognition problem tagging entity mentions in text documents as a hierarchy of three sub-tasks over the Onto Notes dataset (Weischedel et al. 2011): Y1 {Person, Organization}, Y2 {Businessperson, Other Person, N/A}, Y3 {Company, Other Org, N/A}, where again we use N/A to represent not applicable . Relation Extraction (RE): We represent a relation extraction problem classifying entity-entity relation mentions in text documents as a hierarchy of six sub-tasks which either concern labeling the subject, object, or subject-object pair of a possible or candidate relation in the TACRED dataset (Zhang et al. 2017b). For example, we might label a relation as having a Person subject, Location object, and Place-of-Residence relation type. Medical Document Classiﬁcation (Doc): We represent a radiology report triaging (i.e. document classiﬁcation) problem from the Open I dataset (National Institutes of Health 2017) as a hierarchy of three sub-tasks: Y1 {Acute, Non-Acute}, Y2 {Urgent, Emergent, N/A}, Y3 {Normal, Non-Urgent, N/A}.

End Model Protocol Our goal was to test the performance of a basic multi-task end model using training labels produced by various different approaches. We use an architecture consisting of a shared bidirectional LSTM input layer with pre-trained embeddings, shared linear intermediate layers, and a separate ﬁnal linear layer ( task head ) for each task. Hyperparameters were selected with an initial search for each application (see Appendix), then ﬁxed.

Core Validation We compare the accuracy of the end multi-task model trained with labels from our approach versus those from three baseline approaches (Table 1):

Traditional Supervision [Gold (Dev)]: We train the end model using the small hand-labeled development set.

Hierarchical Majority Vote [MV]: We use a hierarchical majority vote of the weak supervision source labels: i.e. for each data point, for each task we take the majority vote and proceed down the task tree accordingly. This procedure can be thought of as a hard decision tree, or a cascade of if-then statements as in a rule-based approach.

Data Programming [DP]: We model each task separately using the data programming approach for denoising weak supervision (Ratner et al. 2018).

In all settings, we used the same end model architecture as described above. Note that while we choose to model these problems as consisting of multiple sub-tasks, we evaluate with respect to the broad primary task of ﬁne-grained

NER RE Doc Average

Gold (Dev) 63.7 2.1 28.4 2.3 62.7 4.5 51.6 MV 76.9 2.6 43.9 2.6 74.2 1.2 65.0 DP (Ratner et al. 2016) 78.4 1.2 49.0 2.7 75.8 0.9 67.7

Me Ta L 82.2 0.8 56.7 2.1 76.6 0.4 71.8

Table 1: Performance Comparison of Different Supervision Approaches. We compare the micro accuracy (avg. over 10 trials) with 95% conﬁdence intervals of an end multi-task model trained using the training labels from the hand-labeled development set (Gold Dev), hierarchical majority vote (MV), data programming (DP), and our approach (Me Ta L).

0 5 25 63 Unlabeled Datapoints n (Thousands)

Micro-Avg. Accuracy

Accuracy vs. n (Log-Scale)

Figure 5: In the Onto Notes dataset, end model accuracy scales with the amount of available unlabeled data.

classiﬁcation (for subtask-speciﬁc scores, see Appendix). We observe in Table 1 that our approach of leveraging multi-granularity weak supervision leads to large gains 20.2 points over traditional supervision with the development set, 6.8 points over hierarchical majority vote, and 4.1 points over data programming.

Ablations We examine individual factors: Unipolar Correction: Modeling unipolar sources (Sec 4), which we ﬁnd to be especially common when ﬁne-grained tasks are involved, leads to an average gain of 2.8 points of accuracy in Me Ta L performance. Joint Task Modeling: Next, we use our algorithm to estimate the accuracies of sources for each task separately, to observe the empirical impact of modeling the multi-task setting jointly as proposed. We see average gains of 1.3 points in accuracy (see Appendix). End Model Generalization: Though not possible in many settings, in our experiments we can directly apply the label model to make predictions. In Table 6, we show that the end model improves performance by an average 3.4 points in accuracy, validating that the models trained do indeed learn to generalize beyond the provided weak supervision. Moreover, the largest generalization gain of 7 points in accuracy came from the dataset with the most available unlabeled data (n=63k), demonstrating scaling consistent with the predictions of our theory (Fig. 5). This ability to leverage additional unlabeled data and more sophisticated end models are key advantages of the weak supervision approach in practice.

# Train LM EM Gain

NER 62,547 75.2 82.2 7.0 RE 9,090 55.3 57.4 2.1 Doc 2,630 75.6 76.6 1.0

Figure 6: Using the label model (LM) predictions directly versus using an end model trained on them (EM).

6 Conclusion

We presented Me Ta L, a framework for training models with weak supervision from diverse, multi-task sources having different granularities, accuracies, and correlations. We tackle the core challenge of recovering the unknown source accuracies via a scalable matrix completion-style algorithm, introduce theoretical bounds characterizing the key scaling with respect to unlabeled data, and demonstrate empirical gains on real-world datasets. In future work, we hope to learn the task relationship structure and cover a broader range of settings where labeled training data is a bottleneck.

Acknowledgements We gratefully acknowledge the support of DARPA under Nos. FA87501720095 (D3M) and FA86501827865 (SDH), NIH under No. N000141712266 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and DGE-114747 (NSF GRF), ONR under No. N000141712266 (Unifying Weak Supervision), the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, Google, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, and American Family Insurance, the Stanford Interdisciplinary Graduate and Bio-X fellowships, the Intelligence Community Postdoctoral Fellowship, and members of the Stanford DAWN project: Intel, Microsoft, Teradata, Facebook, Google, Ant Financial, NEC, SAP, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.

Anandkumar, A.; Ge, R.; Hsu, D.; Kakade, S. M.; and Telgarsky, M. 2014. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research 15(1):2773 2832. Bach, S. H.; He, B.; Ratner, A. J.; and R e, C. 2017. Learning the structure of generative models without labeled data. Blum, A., and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. Caruana, R. 1993. Multitask learning: A knowledge-based source of inductive bias. Chandrasekaran, V.; Parrilo, P. A.; and Willsky, A. S. 2010. Latent variable graphical model selection via convex optimization. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, 1610 1613. IEEE. Craven, M., and Kumlien, J. 1999. Constructing biological knowledge bases by extracting information from text sources. Dalvi, N.; Dasgupta, A.; Kumar, R.; and Rastogi, V. 2013a. Aggregating crowdsourced binary ratings. Dalvi, N.; Dasgupta, A.; Kumar, R.; and Rastogi, V. 2013b. Aggregating crowdsourced binary ratings. Dawid, A. P., and Skene, A. M. 1979. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics 20 28. Ghosh, A.; Kale, S.; and Mc Afee, P. 2011. Who moderates the moderators?: Crowdsourcing abuse detection in user-generated content. Guan, M. Y.; Gulshan, V.; Dai, A. M.; and Hinton, G. E. 2017. Who said what: Modeling individual labelers improves classiﬁcation. ar Xiv preprint ar Xiv:1703.08774. Gupta, S., and Manning, C. D. 2014. Improved pattern learning for bootstrapped entity extraction. Hoffmann, R.; Zhang, C.; Ling, X.; Zettlemoyer, L.; and Weld, D. S. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. Karger, D. R.; Oh, S.; and Shah, D. 2011. Iterative learning for reliable crowdsourcing systems. Khetan, A.; Lipton, Z. C.; and Anandkumar, A. 2017. Learning from noisy singly-labeled data. ar Xiv preprint ar Xiv:1712.04577. Liang, P.; Jordan, M. I.; and Klein, D. 2009. Learning from measurements in exponential families. Loh, P.-L., and Wainwright, M. J. 2012. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. Mann, G. S., and Mc Callum, A. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. JMLR 11(Feb):955 984. Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. National Institutes of Health. 2017. Open-i.

Platanios, E.; Poon, H.; Mitchell, T. M.; and Horvitz, E. J. 2017. Estimating accuracy from unlabeled data: A probabilistic logic approach. Ratner, A. J.; De Sa, C. M.; Wu, S.; Selsam, D.; and R e, C. 2016. Data programming: Creating large training sets, quickly. Ratner, A.; Bach, S.; Ehrenberg, H.; Fries, J.; Wu, S.; and R e, C. 2018. Snorkel: Rapid training data creation with weak supervision. Ruder, S. 2017. An overview of multi-task learning in deep neural networks. Co RR abs/1706.05098. Søgaard, A., and Goldberg, Y. 2016. Deep multi-task learning with low level tasks supervised at lower layers. Takamatsu, S.; Sato, I.; and Nakagawa, H. 2012. Reducing wrong labels in distant supervision for relation extraction. Tropp, J. A. 2015. An introduction to matrix concentration inequalities. Foundations and Trends R in Machine Learning 8(1-2):1 230. Varma, P.; He, B. D.; Bajaj, P.; Khandwala, N.; Banerjee, I.; Rubin, D.; and R e, C. 2017. Inferring generative model structure with static analysis. Weischedel, R.; Hovy, E.; Marcus, M.; Palmer, M.; Belvin, R.; Pradhan, S.; Ramshaw, L.; and Xue, N. 2011. Ontonotes: A large training corpus for enhanced processing. Handbook of Natural Language Processing and Machine Translation. Springer. Xiao, T.; Xia, T.; Yang, Y.; Huang, C.; and Wang, X. 2015. Learning from massive noisy labeled data for image classiﬁcation. Zaidan, O. F., and Eisner, J. 2008. Modeling annotators: A generative approach to learning from annotator rationales. Zhang, Y.; Chen, X.; Zhou, D.; and Jordan, M. I. 2014. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. Zhang, C.; R e, C.; Cafarella, M.; De Sa, C.; Ratner, A.; Shin, J.; Wang, F.; and Wu, S. 2017a. Deep Dive: Declarative knowledge base construction. Commun. ACM 60(5):93 102. Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning, C. D. 2017b. Position-aware attention and supervised data improve slot ﬁlling.