# learning_from_explanations_with_neural_execution_tree__89fca7ba.pdf Published as a conference paper at ICLR 2020 LEARNING FROM EXPLANATIONS WITH NEURAL EXECUTION TREE Ziqi Wang1 , Yujia Qin1 , Wenxuan Zhou2, Jun Yan2, Qinyuan Ye2, Leonardo Neves3, Zhiyuan Liu1, Xiang Ren2 Tsinghua University1, University of Southern California2, Snap Research3 {ziqi-wan16, qinyj16}@mails.tsinghua.edu.cn, {liuzy}@tsinghua.edu.cn, {zhouwenx, yanjun, qinyuany, xiangren}@usc.edu, {lneves}@snap.com While deep neural networks have achieved impressive performance on a range of NLP tasks, these data-hungry models heavily rely on labeled data, which restricts their applications in scenarios where data annotation is expensive. Natural language (NL) explanations have been demonstrated very useful additional supervision, which can provide sufficient domain knowledge for generating more labeled data over new instances, while the annotation time only doubles. However, directly applying them for augmenting model learning encounters two challenges: (1) NL explanations are unstructured and inherently compositional, which asks for a modularized model to represent their semantics, (2) NL explanations often have large numbers of linguistic variants, resulting in low recall and limited generalization ability. In this paper, we propose a novel Neural Execution Tree (NEx T) framework1 to augment training data for text classification using NL explanations. After transforming NL explanations into executable logical forms by semantic parsing, NEx T generalizes different types of actions specified by the logical forms for labeling data instances, which substantially increases the coverage of each NL explanation. Experiments on two NLP tasks (relation extraction and sentiment analysis) demonstrate its superiority over baseline methods. Its extension to multi-hop question answering achieves performance gain with light annotation effort. 1 INTRODUCTION Deep neural networks have achieved state-of-the-art performance on a wide range of natural language processing tasks. However, they usually require massive labeled data, which restricts their applications in scenarios where data annotation is expensive. The traditional way of providing supervision is human-generated labels. See Figure 1 as an example. The sentiment polarity of the sentence Quality ingredients preparation all around, and a very fair price for NYC can be labeled as Positive . However, the label itself does not provide information about how the decision is made. A more informative method is to allow annotators to explain their decisions in natural language so that the annotation can be generalized to other examples. Such an explanation can be Positive, because the word price is directly preceded by fair , which can be generalized to other instances like It has delicious food with a fair price . Natural language (NL) explanations have shown effectiveness in providing additional supervision, especially in low-resource settings (Srivastava et al., 2017; Hancock et al., 2018). Also, they can be easily collected from human annotators without significantly increasing annotation efforts. However, exploiting NL explanations as supervision is challenging due to the complex nature of human languages. First of all, textual data is not well-structured, and thus we have to parse explanations into logical forms for machine to better utilize them. Also, linguistic variants are ubiquitous, which makes it difficult to generalize an NL explanation for matching sentences that are semantically Equal contribution. The order is decided by a coin toss. The work was done when visiting USC. 1 Project: http://inklab.usc.edu/project-NEx T/ Code: https://github.com/INK-USC/NEx T Published as a conference paper at ICLR 2020 equivalent but having different word usage. When we perform exact matching with the previous example explanation, it fails to annotate sentences with reasonable price or good deal . Explanation: because the word price is directly preceded by fair. What is the sentiment polarity w.r.t. price ? Label: Positive Sentence: quality ingredients preparation all around, and a very fair price for NYC. Sentence: it has delicious food with a fair price. matching from corpus Figure 1: Matching new instances from raw corpus using natural language explanations. Attempts have been made to train classifiers with NL explanations. Srivastava et al. (2017) use NL explanations as additional features of data. They map explanations to logical forms with a semantic parser and use them to generate binary features for all instances. Hancock et al. (2018) employ a rulebased semantic parser to get logical forms (i.e. labeling function ) from NL explanations that generate noisy labeled datasets used for training models. While both methods claim huge performance improvements, they neglect the importance of linguistic variants, thus resulting in a very low recall. Also, their methods of evaluating explanations on new instances are oversimplified (e.g. comparison/logic operators), making their methods overly confident. In the above example, sentence Decent sushi at a fair enough price will be rejected because of the directly preceded requirement. To address these issues, we propose Neural Execution Tree (NEx T) framework for deep neural networks to learn from NL explanations, as illustrated in Figure 2. Given a raw corpus and a set of NL explanations, we first parse the NL explanations into machine-actionable logical forms by a combinatory categorial grammar (CCG) based semantic parser. Different from previous work, we soften the annotation process by generalizing the predicates using neural module networks and changing the labeling process from exact matching to fuzzy matching. We introduce four types of matching modules in total, namely String Matching Module, Soft Counting Module, Logical Calculation Module, and Deterministic Function Module. We calculate the matching scores and find for each instance the most similar logical form. Thus, all instances in the raw corpus can be assigned a label and used to train neural models. The major contributions of our work are summarized as follows: (1) We propose a novel NEx T framework to utilize NL explanations. NEx T is able to model the compositionality of NL explanations and improve the generalization ability of NL explanations so that neural models can leverage unlabeled data for augmenting model training. (2) We conduct extensive experiments on two representative tasks (relation extraction and sentiment analysis). Experimental results demonstrate the superiority of NEx T over various baselines. Also, we adapted NEx T for multi-hop question answering task, in which it achieves performance improvement with only 21 explanations and 5 rules. 2 LEARNING TO AUGMENT SEQUENCE MODELS WITH NL EXPLANATIONS This section first talks about basic concepts and notations for our problem definition. Then we give a brief overview of our approach, followed by details of each stage. Problem Definition. We consider the task of training classifiers with natural language explanations for text classification (e.g., relation extraction and sentiment analysis) in a low-resource setting. Specifically, given a raw corpus S = {xi}N i=1 X and a predefined label set Y, our goal is to learn a classifier fc : X Y. We ask human annotators to view a subset S of the corpus S and provide for each instance x S a label y and an explanation e, which explains why x should receive y. Note that |S | |S|, which requires our framework to learn with very limited human supervision. Approach Overview. We develop a multi-stage learning framework to leverage NL explanations in a weakly-supervised setting. An overview of our framework is depicted in Fig. 2. Our NEx T framework consists of three stages, namely explanation parsing stage, dataset partition stage, and joint model learning stage. Explanation Parsing. To leverage the unstructured human explanations E = {ej}|S | j=1, we turn them into machine-actionable logical forms (i.e., labeling functions) (Ratner et al., 2016), which can be denoted as F = {fj : X {0, 1}}|S | j=1, where 1 indicates the logical form matches the Published as a conference paper at ICLR 2020 Training Classifier Data Partition Hard Matching Set Soft Matching Raw Corpus + NL Explanations Semantic Parsing !": There was a long wait for a table outside, but it was a little too hot in the sun anyway so our TERM was very nice. Explanation: Positive, because the words very nice is within 3 words after the TERM. Label whether the TERM is positive or negative !#: The wine is always good, the tapas are always yummy, especially with the warm TERM. Explanation: Positive, because the word warm is directly before the TERM. return 1 if Word(very nice) Is At Most(Right(TERM), Num(3 tokens))) else 0 Word(very nice) Is At Most(Right(TERM), Num(3 tokens))) String Matching Soft Counting Logic Calculation Matching Figure 2: Overview of the NEx T Framework. Natural language explanations are firstly parsed into logical forms. Then we partition the raw corpus S into labeled dataset Sa and unlabeled dataset Su = S {xa i }Na i=1. We use matching modules to provide supervision on Su. Finally, supervision from both Sa and Su is fed into a classifier. input sequence and 0 otherwise. To access the labels, we introduce a function h : F Y that maps each logical form fj to the label yj of its explanation ej. Examples are given in Fig. 2. We use Combinatory Categorial Grammar (CCG) based semantic parsing (Zettlemoyer & Collins, 2012; Artzi et al., 2015), an approach that couples syntax with semantics, to convert each NL explanation ej to a logical form fj. Following Srivastava et al. (2017), we first compile a domain lexicon that maps each word to its syntax and logical predicate. Frequently-used predicates are listed in the Appendix. For each explanation, the parser can generate many possible logical forms based on CCG grammar. To identify the correct one from these logical forms, we use a feature vector φ(f) Rd with each element counting the number of applications of a particular CCG combinator (similar to Zettlemoyer & Collins (2007)). Specifically, given an explanation ei, the semantic parser parameterized by θ Rd outputs a probability distribution over all possible logical forms Zei. The probability of a feasible logical form can be calculated as: Pθ(f|ei) = exp θT φ(f) P f :f Zei exp θT φ(f ). To learn θ, we maximize the probability of yi given ei calculated by marginalizing over all logical forms that match xi (similar to Liang et al. (2013)). Formally, the objective function is defined as: f:f(xi)=1 h(f)=yi Pθ(f|ei) . When the optimal θ is derived using gradient-based method, the parsing result for ei is defined as fi = arg maxf Pθ (f|ei). Dataset Partition. After we parse explanations {ei}|S | i=1 into F = {fi}|S | i=1, where each fi corresponds to ei, we use F to find exact matches in S and pair them with the corresponding labels. We denote the number of instances getting labeled by exact matching as Na. As a result, S is partitioned into a labeled dataset Sa = {(xa i , ya i )}Na i=1, and an unlabeled dataset Su = S {xa i }Na i=1 = {xu j }Nu j=1 where Nu = |S| Na. Joint Model Learning. The exactly matched Sa can be directly used to train a classifier while informative instances in Su are left untouched. We propose several neural module networks, which relax constraints in each logic form fj and substantially improve the rule coverage in Su. Classifiers will benefit from these soft-matched and pseudo-labeled instances. Trainable parameters in neural module networks are jointly optimized with the classifier. Details of each module and joint training method will be introduced in the next section. 3 NEURAL EXECUTION TREE Given a logical form f and a sentence x, NEx T will output a matching score us [0, 1], which indicates how likely the sentence x satisfies the logical form f and thus should be given the corresponding label h(f). Specifically, NEx T comprises of four modules to deal with four categories of Published as a conference paper at ICLR 2020 SUBJECT was murdered on OBJECT 0.3, 0.2, 0.9, 0.2, 0.4 The words who died precede OBJECT by no more than three words and occur between SUBJECT and OBJECT 0, 1, 1, 1, 0 0.3, 0.2, 0.9, 0.2, 0.4 0.6, 1, 1, 1, 1 (Word(who died) Is Between(SUBJECT, OBJECT)) And (Word(who died) Is At Most(Left(OBJECT) , Num(3 Tokens))) String Matching Logical Calculation Soft Counting Logical Calculation String Matching Logical Calculation Deterministic Date_of_Death Logical Form: Explanation: Figure 3: Neural Execution Tree (NEx T) softly executes the logical form on the sentence. predicates, namely String Matching Module, Soft Counting Module, Deterministic Function Module, and Logical Calculation Module. Any complex logical form can be disassembled into clauses containing these four categories of predicates. The four modules are then used to evaluate each clause and then the whole logical form in a softened way. Fig. 3 shows how NEx T builds the execution tree from an NL explanation and how it evaluates an unlabeled sentence. We show in the figure the corresponding module for each predicate in the logical form. 3.1 MODULES IN NEXT String Matching Module. Given a keyword query q derived from an explanation and an input sequence x = [w1, w2, ..., wn], the string matching module fs(x, q) returns a sequence of scores [s1, s2, ..., sn] indicating the similarity between each token wi and the query q. Previous work implements this operation by exact keyword searching, while we augment the module with neural networks to enable capturing semantically similar words. Inspired by Li et al. (2018), for token wi, we first generate Nc contexts by sliding windows of various lengths. For example, if the maximum window size is 2, the contexts ci0, ci1, ci2 of token wi are [wi], [wi 1; wi] and [wi; wi+1] respectively. Then we encode each context cij to a vector zcij by feeding pre-trained word embeddings into a bi-directional LSTM encoder (Hochreiter & Schmidhuber, 1997). All hidden layers of Bi LSTM are then summarized by an attention layer (Bahdanau et al., 2014). For keyword query q, we directly encode it into vector zq by bi-LSTM and attention. Finally, scores of sentence x and query q are calculated by aggregating similarity scores from different sliding windows: Mij(x, q) = cos(zcij D, zq D), fs(x, q) = M(x, q)v, where D is a trainable diagonal matrix, v RNc is the trainable weight of each sliding window. Parameters in the string matching module need to be learned with data in the form of (sentence, keyword, label). To build a training set for learning string matching, we randomly select spans of consecutive words as keyword queries in the training data. Each query is paired with the sentence it comes from. The synthesized dataset is denoted as {xi, qi, ki}Nsyn i=1 , where kij will take the value of 1 if q is extracted from xij and 0 otherwise. The loss function is defined as the binary cross-entropy loss, as follows. Lfind = 1 Nsyn 1 |ki| (ki log fs(xi, qi) + (1 ki) log(1 fs(xi, qi))). While pretraining with Lfind enables matching similar words, this unsupervised distributional method is poor at learning their semantic meanings. For example, the word good will have relatively low similarity to great because there is no such training data. To solve this problem, we borrow the idea of word retrofitting (Faruqui et al., 2014) and adopt a contrastive loss (Neculoiu et al., 2016) to incorporate semantic knowledge in training. We use the keyword queries in labeling functions as supervision. Intuitively, the semantic meaning of two queries should be similar if they appear in the same class of labeling functions and dissimilar otherwise. More specifically, for a query q, we denote queries in the same class of labeling functions as Q+ and queries in different classes of labeling functions as Q . The similarity loss is defined as: Lsim = max q1 Q+{(τ cos(zq D, zq1D))2 +} + max q2 Q {cos(zq D, zq2D)2 +}. The overall objective function for string matching module is: Lstring = Lfind + γ Lsim, (1) Published as a conference paper at ICLR 2020 Algorithm 1: Learning on Unlabeled Data with NEx T Input: Labeled data Sa = {(xa i , ya i )}Na i=1, unlabeled data Su = {xu j }Nu j=1, and logical forms F = {fk}|S | k=1. Output: A classifier fc : X Y. Pretrain String Matching Module in NEx T w.r.t. Lstring using Eq. 1. while not converge do Sample a labeled batch Ba = {(xa i , ya i )}n i=1 from Sa, and an unlabeled batch Bu = {xu j }m j=1 from Su. foreach xu j Bu do Calculate a pseudo label yu j for xu j with confidence uj using NEx T and F. Normalize matching scores {uj}m j=1 to get {ωj}m j=1 based on Eq. 3. Calculate La using Eq. 2, Lu using Eq. 4, L using Eq. 5. Update fc and String Matching Module in NEx T w.r.t. Ltotal. where γ is a hyper-parameter. We pretrain the string matching module for better initialization. Soft Counting Module. The soft counting module aims to relax the counting (distance) constraints defined by NL explanations. For a counting constraint precede object by no more than three words, the soft counting module outputs a matching score indicating to which extent an anchor word (TERM, SUBJECT, and OBJECT) satisfies the constraint. The score is set to 1 if the position of the anchor word strictly satisfies the constraint, and will decrease if the constraint is broken. For simplicity, we allow an additional range in which the score is set to µ (0, 1), which is a hyper-parameter controlling the constraints. Deterministic Function Module. The deterministic function module deals with deterministic predicates like Left and Right , which can only be exactly matched because a string is either right or left of an anchor word. Therefore, the probability it outputs should be either 0 or 1. The Deterministic Function Module deals with all these predicates and outputs a mask sequence, which is fed into the tree structure and combined with other information. Logical Calculation Module. The logical calculation module acts as a score aggregator. It can aggregate scores given by: (1) a string matching module and a soft counting module/deterministic function module (triggered by predicates such as Is and Occur ) (2) two clauses that have been evaluated with a score respectively (triggered by predicates such as And and Or ). In the first case, the logical calculation module will calculate the element-wise products of the score sequence provided by the string matching module and the mask sequence provided by the soft counting module/deterministic function module. It then uses max pooling to calculate the matching score of the current clause. In the second case, the logical calculation module will aggregate the scores of at least one clause based on the logic operation. The rules are defined as follows. p1 p2 = max(p1 + p2 1, 0), p1 p2 = min(p1 + p2, 1), p = 1 p, where p is the score of the input clause. 3.2 AUGMENTING MODEL LEARNING WITH NEXT As described in Algo. 1, in each iteration, we sample two batches Ba and Bu from Sa and Su. We conduct supervised learning on Ba. The labeled loss function is calculated as: (xa i ,ya i ) Ba log p(ya i |xa i ). (2) To leverage Bu, which is also informative, for each instance xu j Bu, we can use our matching modules to compute its matching score with every logical form. The most probable logical form that matched with xu j is denoted as yu j 2, along with the matching score uj. To ensure the scale of the unlabeled loss is comparable to labeled loss, we normalize the matching scores among pseudolabeled instances in Bu as: ωj = exp(θtuj) P|Bu| k=1 exp(θtuk) , (3) 2None label (e.g. No Relation for relation extraction and Neutral for sentiment analysis) usually lacks explanations. If the entropy of downstream model prediction distribution over labels is lower than a threshold, a None label will be given. Published as a conference paper at ICLR 2020 Dataset exps categs avg ops logic/% assertion/% position/% counting/% acc/% TACRED 170 13 8.2 25.8 21.3 21.4 12.4 95.3 Sem Eval 203 9 4.2 32.7 15.9 26.3 5.5 84.2 Laptop 40 8 3.9 0.0 23.8 23.8 17.5 87.2 Restaurant 45 9 9.6 2.8 25.4 26.1 16.2 88.2 Table 1: Statistics for Human-curated Explanations and Evaluation of Semantic Parsing. We report the number of NL explanations (exps), categories of predicates (categs) and operator compositions per explanation (avg ops) respectively. We also report the proportions of different types of predicates, where logic denotes logical operators (And, Or), assertion denotes assertion predicates (Occur, Contains), position denotes position predicates (Right, Between) and counting denotes counting predicates (More Than, At Most). We summarize the accuracy (acc) of semantic parsing based on human evaluation. where k is the index of the instance and hyperparameter θt (temperature) controls the shape of normalized scores distribution. Based on that, the unlabeled loss is calculated as: (xu j Bu) ωj log p(yu j |xu j ). (4) Note that the string matching module is also trainable and plays a vital role in NEx T. We jointly train it with the classifier by optimizing: Ltotal = La + α Lu + β Lstring, (5) where α and β are hyper-parameters. 4 EXPERIMENTS Tasks and Datasets. We conduct experiments on two tasks: relation extraction and aspect-termlevel sentiment analysis. Relation extraction (RE) aims to identify the relation type between two entities in a sentence. For example, given a sentence Steve Jobs founded Apple Inc, we want to extract a triple (Steve Jobs, Apple Inc., Founder). For RE we choose two datasets, TACRED (Zhang et al., 2017) and Sem Eval (Hendrickx et al., 2009) in our experiments. Aspect-term-level sentiment analysis (SA) aims to decide the sentiment polarity with regard to the given aspect term. For example, given a sentence Quality ingredients preparation all around, and a very fair price for NYC, the sentiment polarity of the aspect term price is positive, the explanation can be The word price is directly preceded by fair. For this task we use two customer review datasets, Restaurant and Laptop, which are part of Sem Eval 2014 Task 4. Explanation Collection. We use Amazon Mechanical Turk to collect explanations for a randomly sampled set of instances in each dataset. Turkers are prompted with a list of selected predicates (see Appendix) and several examples of NL explanations. Examples of collected explanations are listed in Appendix. Statistics of curated explanations and intrinsic evaluation results of semantic parsing are summarized in Table 1. To ensure a low-resource setting (i.e., |S | |S|), in each experiment we only use a random subset of collected explanations. Compared Methods. As mentioned in Sec. 2, logical forms partition unlabeled corpus S into labeled set Sa and unlabeled set Su. Labeled set Sa can be directly utilized by supervised learning methods. (1) CBOW-Glo Ve uses bag-of-words (Mikolov et al., 2013) on Glo Ve embeddings (Pennington et al., 2014) to represent an instance, or surface patterns in NL explanations. It then annotates the sentence with the label of its most similar surface pattern (by cosine similarity). (2) PCNN (Zeng et al., 2015) uses piece-wise max-pooling to aggregate CNN-generated features. (3) LSTM+ATT (Bahdanau et al., 2014) adds an attention layer onto LSTM to encode a sequence. (4) PA-LSTM (Zhang et al., 2017) combines LSTM with an entity-position aware attention to conduct relation extraction. (5) ATAE-LSTM (Wang et al., 2016) combines the aspect term information into both embedding layer and attention layer to help the model concentrate on different parts of a sentence. For semi-supervised baselines, unlabeled data Su is also introduced for training. For methods requiring rules as input, we use surface pattern-based rules transferred from explanations. Compared Published as a conference paper at ICLR 2020 TACRED Sem Eval LF (E) 23.33 33.86 CBOW-Glo Ve (R + S) 34.6 0.4 48.8 1.1 PCNN (Sa) 34.8 0.9 41.8 1.2 PA-LSTM (Sa) 41.3 0.8 57.3 1.5 Bi LSTM+ATT (Sa) 41.4 1.0 58.0 1.6 Bi LSTM+ATT (Sl) 30.4 1.4 54.1 1.0 Self Training (Sa + Su) 41.7 1.5 55.2 0.8 Pseudo Labeling (Sa + Su) 41.5 1.2 53.5 1.2 Mean Teacher (Sa + Su) 40.8 0.9 56.0 1.1 Mean Teacher (Sl + Slu) 25.9 2.2 52.2 0.7 Dual RE (Sa + Su) 32.6 0.7 61.7 0.9 Data Programming (E + S) 30.8 2.4 43.9 2.4 NEx T (E + S) 45.6 0.4 63.5 1.0 (a) Relation Extraction Restaurant Laptop LF (E) 7.7 13.1 CBOW-Glo Ve (R + S) 68.5 2.9 61.5 1.3 PCNN (Sa) 72.6 1.2 60.9 1.1 ATAE-LSTM (Sa) 71.1 0.4 56.2 3.6 ATAE-LSTM (Sl) 71.4 0.5 52.0 1.4 Self Training (Sa + Su) 71.2 0.5 57.6 2.1 Pseudo Labeling (Sa + Su) 70.9 0.4 58.0 1.9 Mean Teacher (Sa + Su) 72.0 1.5 62.1 2.3 Mean Teacher (Sl + Slu) 74.1 0.4 61.7 3.7 Data Programming (E + S) 71.2 0.0 61.5 0.1 NEx T (E + S) 75.8 0.8 62.8 1.9 (b) Sentiment Analysis Table 2: Experiment results on Relation Extraction and Sentiment Analysis. Average and standard deviation of F1 scores (%) over multiple runs are reported (5 runs for RE and 10 runs for SA). LF(E) denotes directly applying logical forms onto explanations. Bracket behind each method illustrates corresponding data used in the method. S denotes training data without labels, E denotes explanations, R denotes surface pattern rules transformed from explanations; Sa denotes labeled data annotated with explanations, Su denotes the remaining unlabeled data. Sl denotes labeled data annotated using same time as creating explanations E, Slu denotes remaining unlabeled data corresponding to Sl. semi-supervised methods include: (1) Pseudo-Labeling (Lee, 2013) first trains a classifier on labeled dataset, then generates pseudo labels for unlabeled data using the classifier by selecting the labels with maximum predicted probability. (2) Self-Training (Rosenberg et al., 2005) proposes to expand the labeled data by selecting a batch of unlabeled data that has the highest confidence and generate pseudo-labels for them. The method stops until the unlabeled data is used up. (3) Mean-Teacher (Tarvainen & Valpola, 2017) averages model weights instead of label predictions and assumes similar data points should have similar outputs. (4) Dual RE (Lin et al., 2019) jointly trains a relation prediction module and a retrieval module. Learning from explanations is categorized as a third setting. Explanation-guided pseudo labels are generated for a downstream classifier. (1) Data Programming (Hancock et al., 2018; Ratner et al., 2016) aggregates results of strict labeling functions for each instance and uses these pseudo-labels to train a classifier. (2) NEx T (proposed work) softly applies logical forms to get annotations for unlabeled instances and trains a downstream classifier with these pseudo-labeled instances. The downstream classifier is Bi LSTM+ATT for relation extraction and ATAE-LSTM for sentiment analysis. 4.1 RESULTS OVERVIEW Table 2 (a) lists F1 scores of all relation extraction models. Full results including precision and recall can be found in Appendix A.4. We observe that our proposed NEx T consistently outperforms all baseline models in low-resource setting. Also, we found that, (1) directly applying logical forms to unlabeled data results in poor performance. We notice that this method achieves high precision but low recall. Based on our observation of the collected dataset, this is because people tend to use detailed and specific constraints in an NL explanation to ensure they cover all aspects of the instance. As a result, those instances that satisfy the constraints are correctly labeled in most cases, and thus the precision is high. Meanwhile, generalization ability is compromised, and thus the recall is low. (2) Compared to its downstream classifier baseline (Bi LSTM+ATT with Sa), NEx T achieves 4.2% F1 improvement in absolute value on TACRED, and 5.5% on Sem Eval. This validates that the expansion of rule coverage by NEx T is effective and is providing useful information to classifier training. (3) Performance gap further widens when we take annotation efforts into account. The annotation time for E and Sl are equivalent; but the performance of Bi LSTM+ATT significantly degrades with fewer instances in Sl. (4) Results of semi-supervised methods are unsatisfactory. This may be explained with difference between underlying data distribution of Sa and Su. Published as a conference paper at ICLR 2020 Table 2 (b) lists the performances of all sentiment analysis models. The observations are similar to those of relation extraction, which strengthens our conclusions and validates the capability of NEx T. 4.2 PERFORMANCE ANALYSIS TACRED Sem Eval Restaurant Laptop Full NEx T 45.6 0.4 63.5 1.0 75.8 0.8 62.8 1.9 No counting 44.6 0.9 63.2 0.7 75.6 0.8 62.4 1.9 No matching 41.8 1.1 54.6 1.2 71.2 0.4 57.0 2.7 No Lsim 42.5 1.0 56.2 2.9 70.7 0.8 59.4 0.7 No Lfind 43.2 1.3 60.2 0.9 70.0 3.5 58.1 2.8 Table 3: Ablation studies on modules of NEx T and losses of string matching module. F1 score on the test set is reported. We remove soft counting module (No counting) and string matching module (No matching) by only allowing them to give 0/1 results. 0.2 0.4 0.6 0.8 1.0 unlabeled data / all data Dataset TACRED Restaurant Figure 4: NEx T s performance w.r.t. number of unlabeled instances. Effectiveness of softening logical rules. As shown in Table 3, we conduct ablation studies on TACRED and Restaurant. We remove two modules that support soft logic (by only allowing them to give 0/1 outputs) to see how much does rule softening help in our framework. Both soft counting module and string matching module contribute to the performance of NEx T. It can be easily concluded that string matching module plays a vital role. Removing it leads to significant performance drops, which demonstrates the effectiveness of generalizing when applying logical forms. Besides, we examine the impact brought by Lsim and Lfind. Removing them severely hurts the performance, indicating the importance of semantic learning when performing fuzzy matching. Performance with different amount of unlabeled data. To investigate how our NEx T s performance is affected by the amount of unlabeled data, we randomly sample 10%, 30%, 50% and 70% of the original unlabeled dataset to do the experiments. As illustrated in Fig. 4, our NEx T benefits from larger amount of unlabeled data. We attribute it to high accuracy of logical forms converted from explanations. 50 60 70 80 90 100 110 Time / min Method NEXT LSTM+ATT (Sl) 100 #Annotation Figure 5: Performance of NEx T v.s. traditional supervised method. Blue line denotes NEx T and dashed line denotes annotating numbers, normal line means performance. Red line denotes traditional supervised method, and dashed line means performance, normal line means annotating numbers. Superiority of explanations in data efficiency. In the real world, with limited human-power, there is a question of whether it is more valuable to spend time explaining existing annotations than just annotating more labels. To answer this question, we conduct experiments on Performance v.s. Time on TACRED dataset. We compare the results of a supervised classfier with only labels as input and our NEx T with both labels and explanations annotated using the same annotation time. The results are listed in Figure 5, from which we can see that NEx T achieves higher performance while labeling speed reduces by half. Performance with different number of explanations. From Fig. 6 , one can clearly observe that all approaches benefit from more labeled data. Our NEx T outperforms all other baselines by a large margin, which indicates the effectiveness of leveraging knowledge embedded in NL explanations. We can also see that, the performance of NEx T with 170 explanations on TACRED equals to about 2500 labeled data using traditional supervised method. Results of Restaurant also have the same trend, which strengthens our conclusion. Besides Fig. 6, we conduct more experiments for this ablation study and make 4 results tables, see Appendix A.5 for details. Published as a conference paper at ICLR 2020 2500 2000 1500 1000 1000 1500 2000 2500 (b) Restaurant Figure 6: Performance with different number of explanations. We choose supervised semi-supervised baselines for comparison. We did experiments on TACRED and Restaurant. Gray dashed lines mean the performance with the corresponding labeled data. 4.3 ADDITIONAL EXPERIMENT ON MULTI-HOP REASONING To further test the capability of NEx T in downstream tasks, we apply it to WIKIHOP (Welbl et al., 2018) country task by fusing NEx T-matched facts into baseline model NLPROLOG (Weber et al., 2019). For a brief introduction, WIKIHOP country task requires a model to select the correct candidate ENT-Y for question Which country is ENT-X in? given a list of support sentences. As part of dataset design, the correct answer can only be found by reasoning over multiple support sentences. NLPROLOG is a model proposed for WIKIHOP. It first extracts triples from support sentences and treats the masked sentence as the relation between two entities. For example, Socrate was born in Athens is converted to (Socrate, ENT1 was born in ENT2 , Athens), where ENT1 was born in ENT2 is later embedded by SENT2VEC (Pagliardini et al., 2018) to represent the relation between ENT1 and ENT2. Triples extracted from supporting sentences are fed into a Prolog reasoner which will do backward chaining and reasoning to arrive at the target statement country(ENT-X, ENT-Y). We refer readers to (Weber et al., 2019) for in-depth introduction of NLProlog NLPROLOG. Fig. 7 shows how the framework in Fig. 2 is adjusted to suit NLPROLOG. We manually choose 3 predicates (i.e., located in, capital of, next to) and annotate 21 support sentences with natural language explanation. We get 103 strictly-matched facts (Sa) and 1407 NEx T-matched facts (Su) among the 128k unlabeled QA support sentences. Additionally, we manually write 5 rules about these 3 predicates for the Prolog solver, e.g. located in(X,Z) located in(X,Y) located in(Y,Z). Results are listed in Table 4. From the result we observe that simply adding the 103 strictly-matched facts is not making notable improvement. However, with the help of NEx T, a larger number of structured facts are recognized from support sentences, so that external knowledge from only 21 explanations and 5 rules improve the accuracy by 1 point. This observation validates NEx T s capability in low resource setting and highlight its potential when applied to downstream tasks. Training NLProlog Strictly matched set Fact Extraction Labeled by LF NEx T matched set Labeled by NEx T Figure 7: Adjusting NEx T Framework (Fig. 2) for NLPROLOG. |Sa| |Su| Accuracy NLProlog (published code) 0 0 74.57 + Sa 103 0 74.40 + Su (confidence >0.3) 103 340 74.74 + Su (confidence >0.2) 103 577 75.26 + Su (confidence >0.1) 103 832 75.60 Table 4: Performance of NLPROLOG when extracted facts are used as input. Average accuracy over 3 runs is reported. NLPROLOG empowered by 21 natural language explanations and 5 hand-written rules achieves 1% gain in accuracy. Published as a conference paper at ICLR 2020 5 RELATED WORK Leveraging natural language for training classifiers. Supervision in the form of natural language has been explored by many works. Srivastava et al. (2017) first demonstrate the effectiveness of NL explanations. They proposed a joint concept learning and semantic parsing method for classification problems. However, the method is very limited in that it is not able to use unlabeled data. To address this issue, Hancock et al. (2018) propose to parse the NL explanations into labeling functions and then use data programming to handle the conflict and enhancement between different labeling functions. Camburu et al. (2018) extend Stanford Natural Language Inference dataset with NL explanations and demonstrate its usefulness for various goals for training classifiers. Andreas et al. (2016) explore decomposing NL questions into linguistic substructures for learning collections of neural modules which can be assembled into deep networks. Hu et al. (2019) explore using NL instructions as compositional representation of actions for hierarchical decision making. The substructure of an instruction is summarized as a latent plan, which is then executed by another model. Rajani et al. (2019) train a language model to automatically generate NL explanations that can be used during training and inference for the task of commonsense reasoning. Weakly-supervised learning. Our work is relevant to weakly-supervised learning. Traditional systems use handcrafted rules (Hearst, 1992) or automatically learned rules (Agichtein & Gravano, 2000; Batista et al., 2015) to take a rule-based approach. Hu et al. (2019) incorporate human knowledge into neural networks by using a teacher network to teach the classifier knowledge from rules and train the classifier with labeled data. Li et al. (2018) parse regular expression to get action trees as a classifier that are composed of neural modules, so that essentially training stage is just a process of learning human knowledge. Meanwhile, if we regard those data that are exactly matched by rules as labeled data and the remaining as unlabeled data, we can apply many semi-supervised models such as self learning (Rosenberg et al., 2005), mean-teacher (Tarvainen & Valpola, 2017), and semi-supervised VAE (Xu et al., 2017). However, These models turn out to be ineffective in rule-labeled data or explanation-labeled data due to potentially large difference in label distribution. The data sparsity is also partially solved by distant supervision (Mintz et al., 2009; Surdeanu et al., 2012). They rely on knowledge bases (KBs) to annotate data. However, the methods introduce a lot of noises, which severely hinders the performance. Liu et al. (2017) instead propose to conduct relation extraction using annotations from heterogeneous information source. Again, predicting true labels from noisy sources is challenging. 6 CONCLUSION In this paper, we presented NEx T, a framework that augments sequence classification by exploiting NL explanations as supervision under a low resource setting. We tackled the challenges of modeling the compositionality of NL explanations and dealing with the linguistic variants. Four types of modules were introduced to generalize the different types of actions in logical forms, which substantially increase the coverage of NL explanations. A joint training algorithm was proposed to utilize information from both labeled dataset and unlabeled dataset. We conducted extensive experiments on several datasets and proved the effectiveness of our model. Future work includes extending NEx T to sequence labeling tasks and building a cross-domain semantic parser for NL explanations. ACKNOWLEDGMENTS This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Contract No. 2019-19051600007, NSF SMA 18-29268, and Snap research gift. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. We would like to thank all the collaborators in USC INK research lab for their constructive feedback on the work. Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries, pp. 85 94. ACM, 2000. Published as a conference paper at ICLR 2020 Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39 48, 2016. Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. Broad-coverage ccg semantic parsing with amr. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1699 1710, 2015. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014. David S Batista, Bruno Martins, and M ario J Silva. Semi-supervised bootstrapping of relationship extractors with distributional semantics. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 499 504, 2015. Oana-Maria Camburu, Tim Rockt aschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli: natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pp. 9539 9549, 2018. Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith. Retrofitting word vectors to semantic lexicons. ar Xiv preprint ar Xiv:1411.4166, 2014. Braden Hancock, Martin Bringmann, Paroma Varma, Percy Liang, Stephanie Wang, and Christopher R e. Training classifiers with natural language explanations. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2018, pp. 1884. NIH Public Access, 2018. Marti A Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pp. 539 545. Association for Computational Linguistics, 1992. Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O S eaghdha, Sebastian Pad o, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pp. 94 99. Association for Computational Linguistics, 2009. Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735 1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx. doi.org/10.1162/neco.1997.9.8.1735. Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, and Mike Lewis. Hierarchical decision making by generating and following natural language instructions. ar Xiv preprint ar Xiv:1906.00744, 2019. Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. 2013. Shen Li, Hengru Xu, and Zhengdong Lu. Generalize symbolic knowledge with neural rule engine. ar Xiv preprint ar Xiv:1808.10326, 2018. Percy Liang, Michael I Jordan, and Dan Klein. Learning dependency-based compositional semantics. Computational Linguistics, 39(2):389 446, 2013. Hongtao Lin, Jun Yan, Meng Qu, and Xiang Ren. Learning dual retrieval module for semisupervised relation extraction. In The Web Conference, 2019. Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, and Jiawei Han. Heterogeneous supervision for relation extraction: A representation learning approach. ar Xiv preprint ar Xiv:1707.00166, 2017. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111 3119, 2013. Published as a conference paper at ICLR 2020 Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pp. 1003 1011. Association for Computational Linguistics, 2009. Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. Learning text similarity with siamese recurrent networks. In Rep4NLP@ACL, 2016. Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of NAACL-HLT, pp. 528 540, 2018. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532 1543, 2014. Nazneen Fatema Rajani, Bryan Mc Cann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. ar Xiv preprint ar Xiv:1906.02361, 2019. Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R e. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems, pp. 3567 3575, 2016. Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-training of object detection models. 2005. Shashank Srivastava, Igor Labutov, and Tom Mitchell. Joint concept learning and semantic parsing from natural language explanations. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 1527 1536, 2017. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 455 465. Association for Computational Linguistics, 2012. Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195 1204, 2017. Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606 615, 2016. L Weber, P Minervini, J M unchmeyer, U Leser, and T Rockt aschel. Nlprolog: Reasoning with weak unification for question answering in natural language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, Volume 1: Long Papers, volume 57. ACL (Association for Computational Linguistics), 2019. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287 302, 2018. Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. Variational autoencoder for semi-supervised text classification. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1753 1762, 2015. Luke Zettlemoyer and Michael Collins. Online learning of relaxed ccg grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-Co NLL), pp. 678 687, 2007. Published as a conference paper at ICLR 2020 Luke S Zettlemoyer and Michael Collins. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. ar Xiv preprint ar Xiv:1207.1420, 2012. Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D Manning. Positionaware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 35 45, 2017. Published as a conference paper at ICLR 2020 A.1 PREDICATES Following Srivastava et al. (2017), we first compile a domain lexicon that maps each word to its syntax and logical predicate. Table 5 lists some frequently used predicates in our parser, descriptions about their function and modules they belong to. Predicate Description Module Because, Separator Basic conjunction words None Arg X, Arg Y, Arg Subject, object or aspect term in each task Int, Token, String Primitive data types True, False Boolean operators And, Or, Not, Is, Occur Logical operators that aggregate matching scores Logical Calculation Module Left, Right, Between, Within Return True if one string is left/right/between/within some range of the other string Deterministic Function Number Of Return the number of words in a given range At Most, At Least, Direct, Counting (distance) constraints Soft Counting Module More Than, Less Than, Equals Word, Contains, Link Return a matching score sequence for a sentence and a query String Matching Module Table 5: Frequently used predicates A.2 EXAMPLES FOR COLLECTED EXPLANATIONS. OBJ-ORGANIZATION coach SUBJ-PERSON insisted he would put the club s 2-0 defeat to Palermo firmly behind him and move forward (Label) per:employee of (Explanation) there is only one word "coach" between SUBJ and OBJ Officials in Mumbai said that the two suspects , David Coleman Headley , an American with links of Pakistan , and SUBJ-PERSON , who was born in Pakistan but is a OBJ-NATIONALITY citizen , both visited Mumbai and several other Indian cities in before the attacks , and may have visited some of the sites that were attacked (Label) per:origin (Explanation) the words "is a" appear right before OBJ-NATIONALITY and the word "citizen" is right after OBJ-NATIONALITY Sem Eval 2010 Task 8 The SUBJ-O is caused by the OBJ-O of UV radiation by the oxygen and ozone (Label) Cause-Effect(e2,e1) (Explanation) the phrase "is caused by the" occurs between SUBJ and OBJ and OBJ follows SUBJ SUBJ-O are parts of the OBJ-O OBJ-O disregarded by the compiler (Label) Component-Whole(e1,e2) (Explanation) the phrase "are parts of the" occurs between SUBJ and OBJ and OBJ follows SUBJ Sem Eval 2014 Task 4 - restaurant Published as a conference paper at ICLR 2020 I am relatively new to the area and tried Pick a bgel on 2nd and was disappointed with the service and I thought the food was overated and on the pricey side (Term: food) (Label) negative (Explanation) the words "overated" is within 2 words after term The decor is vibrant and eye-pleasing with several semi-private boths on the right side of the dining hall, which are great for a date (Term: decor) (Label) positive (Explanation) the term is followed by "vibrant" and "eye-pleasing" Sem Eval 2014 Task 4 - laptop It s priced very reasonable and works very well right out of the box. (Term: works) (Label) positive (Explanation) the words "very well" occur directly after the term The DVD drive randomly pops open when it is in my backpack as well, which is annoying (Term: DVD drive) (Label) negative (Explanation) the word "annoying" occurs after term A.3 IMPLEMENTATION DETAILS We use 300-dimensional word embeddings pre-trained by Glo Ve (Pennington et al., 2014). The dropout rate is 0.96 for word embeddings and 0.5 for sentence encoder. The hidden state size of the encoder and attention layer is 300 and 200 respectively. We choose Adagrad as the optimizer and the learning rate for joint model learning is 0.5. For TACRED, we set the learning rate to 0.1 in the pretraining stage. The total epochs for pretraining are 10. The weight for Lsim is set to 0.5. The batch size for pretraining is set to 100. For training the classifier, the batch size for labeled data and unlabeled data is 50 and 100 respectively, the weight α for Lu is set to 0.7, the weight β for Lstring is set to 0.2, the weight γ for Lsim is set to 2.5. For Sem Eval 2010 Task 8, we set the learning rate to 0.1 in the pretraining stage. The total epochs for pretraining are 10. The weight for Lsim is set to 0.5. The batch size for pretraining is set to 10. For training the classifier, the batch size for labeled data and unlabeled data is 50 and 100 respectively, the weight α for Lu is set to 0.5, the weight β for Lstring is set to 0.1, the weight γ for Lsim is set to 2. For two datasets in Sem Eval 2014 Task 4, we set the learning rate to 0.5 in the pretraining stage. The total epochs for pretraining are 20. The weight for Lsim is set to 5. The batch size for pretraining is set to 20. For training the classifier, the batch size for labeled data and unlabeled data is 10 and 50 respectively, the weight α for Lu is set to 0.5, the weight β for Lstring is set to 0.1, the weight γ for Lsim is set to 2. For ATAE-LSTM, we set hidden state of attention layer to be 300 dimension. A.4 FULL RESULTS FOR MAIN EXPERIMENTS The full results for relation extraction and sentiment analysis are listed in Table 6 and Table 7 respectively. Published as a conference paper at ICLR 2020 TACRED Sem Eval Metric Precision Recall F1 Precision Recall F1 LF (E) 83.21 13.56 23.33 83.19 21.26 33.86 CBOW-Glo Ve (R + S) 28.2 0.7 44.9 0.9 34.6 0.4 46.8 1.3 51.2 2.2 48.8 1.1 PCNN (Sa) 43.8 1.6 28.9 1.1 34.8 0.9 51.5 1.9 35.2 1.4 41.8 1.2 PA-LSTM (Sa) 44.4 2.9 38.7 2.2 41.3 0.8 59.9 2.4 54.9 2.2 57.3 1.5 Bi LSTM+ATT (Sa) 43.8 2.0 39.4 2.6 41.4 1.0 60.0 2.1 56.2 1.3 58.0 1.6 Bi LSTM+ATT (Sl) 42.8 2.6 23.8 2.4 30.4 1.4 54.7 1.0 53.6 1.2 54.1 1.0 Data Programming (E + S) 45.9 2.8 23.3 2.6 30.8 2.4 51.3 3.5 38.8 4.2 43.9 2.4 Self Training (Sa + Su) 45.9 2.3 38.4 2.7 41.7 1.5 57.3 2.1 53.3 0.9 55.2 0.8 Pseudo Labeling (Sa + Su) 44.5 1.5 38.9 1.6 41.5 1.2 53.7 2.6 53.4 2.2 53.5 1.2 Mean Teacher (Sa + Su) 39.2 1.7 42.6 1.8 40.8 0.9 60.8 1.9 51.9 1.2 56.0 1.1 Mean Teacher (Sl + Slu) 28.3 5.7 25.4 5.8 25.9 2.2 53.1 3.8 51.6 2.4 52.2 0.7 Dual RE (Sa + Su) 38.8 4.7 28.6 2.9 32.6 0.7 64.5 0.7 59.2 2.0 61.7 0.9 NEx T (E + S) 49.2 0.9 42.4 1.3 45.6 0.4 66.3 1.4 61.0 2.2 63.5 1.0 Table 6: Full results as supplement to Table 2(a) Restaurant Laptop Metric Precision Recall F1 Precision Recall F1 LF (E) 86.5 4.0 7.7 90.0 7.1 13.1 CBOW-Glo Ve (R + S) 62.8 2.8 75.3 3.1 68.5 2.9 53.4 1.1 72.6 1.5 61.5 1.3 PCNN (Sa) 67.1 2.1 79.0 1.8 72.6 1.2 53.1 1.0 71.4 1.1 60.9 1.1 ATAE-LSTM (Sa) 65.1 0.4 78.4 0.6 71.1 0.4 49.0 3.1 66.0 4.4 56.2 3.6 ATAE-LSTM (Sl) 65.3 0.5 78.9 0.5 71.4 0.5 48.9 1.5 55.6 2.4 52.0 1.4 Data Programming (E + S) 65.0 0.0 78.8 0.0 71.2 0.0 53.4 0.1 72.5 0.1 61.5 0.1 Self Training (Sa + Su) 65.3 0.7 78.4 0.9 71.2 0.5 50.1 1.8 67.7 2.4 57.6 2.1 Pseudo Labeling (Sa + Su) 64.9 0.5 78.0 0.6 70.9 0.4 50.4 1.6 68.4 2.3 58.0 1.9 Mean Teacher (Sa + Su) 68.8 2.2 75.7 3.9 72.0 1.5 54.4 1.7 72.3 4.0 62.1 2.3 Mean Teacher (Sl + Slu) 68.3 0.8 81.0 0.4 74.1 0.4 55.0 4.1 70.3 3.3 61.7 3.7 NEx T (E + S) 69.6 0.9 83.3 1.8 75.8 0.8 54.6 1.6 73.9 2.3 62.8 1.9 Table 7: Full results as supplement to Table 2(b) A.5 PERFORMANCE WITH DIFFERENT NUMBER OF EXPLANATIONS As a supplement to Fig. 6, we show the full experimental results with different number of explanations as input in Table 8,9,10,11. Results show that our model achieves best performance compared with baseline methods. A.6 MODEL-AGNOSTIC Our framework is model-agnostic as it can be integrated with any downstream classifier. We conduct experiments on SA Restaurant dataset with 45 and 75 explanations using BERT as downstream classifier, and the results are summarized in Table 12. Results show that our model still outperforms baseline methods when BERT is incorporated. We observe the performance of NEx T is approaching the upper bound 85% (by feeding all data to BERT), with only 75 explanations, which again demonstrates the annotation efficiency of NEx T. A.7 CASE STUDY ON STRING MATCHING MODULE. String matching module plays a vital role in NEx T. The matching quality greatly influences the accuracy of pseudo labeling. In Fig. 8, we can see that keyword chief executive of is perfectly aligned with executive director of in the sentence, which demonstrates the effectiveness of string matching module in capturing semantic similarity. Published as a conference paper at ICLR 2020 TACRED 130 TACRED 100 Metric Precision Recall F1 Precision Recall F1 LF (E) 83.5 12.8 22.2 85.2 11.8 20.7 CBOW-Glo Ve (R + S) 26.0 2.3 39.9 5.0 31.2 0.5 24.4 1.3 41.7 3.7 30.7 0.1 PCNN (Sa) 41.8 2.7 28.8 1.8 34.1 1.1 28.2 3.4 22.2 1.3 24.8 1.9 PA-LSTM (Sa) 44.9 1.7 33.5 2.9 38.3 1.3 39.9 2.1 38.2 1.1 39.0 1.3 Bi LSTM+ATT (Sa) 40.1 2.6 36.2 3.4 37.9 1.1 36.1 0.4 37.6 3.0 36.8 1.4 Bi LSTM+ATT (Sl) 35.0 9.0 25.4 1.6 28.9 2.7 43.3 2.2 23.1 3.3 30.0 3.1 Self Training (Sa + Su) 43.6 3.3 35.1 2.1 38.7 0.0 41.9 5.9 32.0 7.4 35.5 2.5 Pseudo Labeling (Sa + Su) 44.2 1.9 34.2 1.9 38.5 0.6 39.7 2.0 34.9 3.3 37.1 1.5 Mean Teacher (Sa + Su) 38.8 0.9 35.6 1.3 37.1 0.5 37.4 4.0 37.4 0.2 37.3 2.0 Mean Teacher (Sl + Slu) 21.1 3.3 28.7 1.8 24.2 1.8 17.5 4.7 18.4 .59 17.9 5.0 Dual RE (Sa + Su) 34.9 3.6 30.5 2.3 32.3 1.0 40.6 4.3 19.1 1.5 25.9 0.6 Data Programming (E + S) 34.3 16.1 18.7 1.4 23.5 4.9 43.5 2.3 15.0 2.3 22.2 2.4 NEXT (E + S) 45.3 2.4 39.2 0.3 42.0 1.1 43.9 3.7 36.2 1.9 39.6 0.5 Table 8: TACRED results on 130 explanations and 100 explanations Sem Eval 150 Sem Eval 100 Metric Precision Recall F1 Precision Recall F1 LF (E) 85.1 17.2 28.6 90.7 9.0 16.4 CBOW-Glo Ve (R + S) 44.8 1.9 48.6 1.5 46.6 1.1 36.0 1.4 40.2 2.0 37.9 0.1 PCNN (Sa) 49.1 3.9 36.1 2.4 41.5 1.4 43.3 1.4 27.9 1.0 33.9 0.3 PA-LSTM (Sa) 58.0 1.2 52.5 0.4 55.1 0.5 55.2 1.7 37.7 0.8 44.8 0.8 Bi LSTM+ATT (Sa) 59.2 0.4 53.7 1.8 56.3 0.8 54.9 5.0 40.5 0.9 46.5 1.3 Bi LSTM+ATT (Sl) 47.6 2.6 42.0 2.3 44.6 2.5 43.7 2.6 37.6 5.0 40.3 3.7 Self Training (Sa + Su) 53.4 4.3 47.5 2.9 50.1 1.1 53.2 2.3 34.2 2.2 41.6 1.4 Pseudo Labeling (Sa + Su) 55.3 4.5 51.0 2.3 53.0 1.5 47.4 4.6 39.9 3.9 43.1 0.6 Mean Teacher (Sa + Su) 61.8 4.0 49.1 2.6 54.6 0.2 58.5 1.9 41.8 2.6 48.7 1.4 Mean Teacher (Sl + Slu) 40.6 2.0 31.2 4.5 35.2 3.6 32.7 3.0 25.6 3.1 28.6 2.2 Dual RE (Sa + Su) 61.7 3.0 56.1 3.0 58.8 3.0 61.6 1.7 39.7 1.9 48.3 1.5 Data Programming (E + S) 50.9 10.8 27.0 0.8 35.0 3.2 28.0 4.1 17.4 5.5 21.0 3.4 NEXT (E + S) 68.5 1.6 60.0 1.7 63.7 0.8 60.2 1.8 53.5 0.7 56.7 1.1 Table 9: Sem Eval results on 150 explanations and 100 explanations Laptop 55 Laptop 70 Metric Precision Recall F1 Precision Recall F1 LF (E) 90.8 9.2 16.8 89.4 9.2 16.8 CBOW-Glo Ve (R + S) 53.7 0.2 72.9 0.2 61.8 0.2 53.6 0.3 72.4 0.2 61.6 0.2 PCNN (Sa) 53.5 3.3 71.0 3.6 61.0 3.2 55.6 1.9 74.1 1.9 63.5 1.5 ATAE-LSTM (Sa) 53.5 0.4 71.9 2.2 61.3 1.0 53.7 1.2 72.9 1.8 61.9 1.5 ATAE-LSTM (Sl) 48.3 1.0 59.5 5.0 53.2 2.2 54.1 1.4 61.1 3.0 57.4 2.1 Self Training (Sa + Su) 51.3 2.6 68.6 2.7 58.7 2.6 51.2 1.4 68.6 2.2 58.7 1.6 Pseudo Labeling (Sa + Su) 51.8 1.7 70.3 2.3 59.7 1.9 52.4 0.8 70.9 1.5 60.3 1.0 Mean Teacher (Sa + Su) 55.1 0.9 74.1 1.6 63.2 1.1 55.9 3.3 73.0 2.6 63.2 1.7 Mean Teacher (Sl + Slu) 55.5 2.5 69.3 2.8 61.6 2.2 58.0 0.7 73.2 1.5 64.7 1.0 Data Programming (E + S) 53.4 0.0 72.6 0.0 61.5 0.0 53.5 0.1 72.5 0.1 61.6 0.1 NEXT (E + S) 56.3 1.3 75.9 2.5 64.6 1.7 56.9 0.2 77.1 0.6 65.5 0.3 Table 10: Laptop results on 55 explanations and 70 explanations Published as a conference paper at ICLR 2020 Restaurant 60 Restaurant 75 Metric Precision Recall F1 Precision Recall F1 LF (E) 86.0 3.8 7.4 85.4 6.8 12.6 CBOW-Glo Ve (R + S) 63.7 2.3 75.6 1.3 69.1 1.9 64.1 1.3 76.6 0.1 69.8 0.7 PCNN (Sa) 67.0 0.9 81.0 1.0 73.3 0.9 68.4 0.1 82.8 0.3 74.9 0.2 ATAE-LSTM (Sa) 65.2 0.6 78.5 0.2 71.2 0.3 64.7 0.4 78.3 0.4 70.8 0.4 ATAE-LSTM (Sl) 67.0 1.5 79.5 1.2 72.7 1.0 66.6 2.0 78.5 1.4 72.1 0.6 Self Training (Sa + Su) 65.2 0.2 78.7 0.5 71.3 0.2 65.7 1.1 77.2 1.1 71.0 0.1 Pseudo Labeling (Sa + Su) 64.9 0.6 77.8 1.0 70.8 0.3 64.9 0.9 77.8 1.2 70.7 1.0 Mean Teacher (Sa + Su) 68.8 2.3 76.0 2.2 72.2 1.3 73.3 3.5 79.2 3.8 76.0 1.2 Mean Teacher (Sl + Slu) 69.0 0.8 82.0 1.1 74.9 0.7 69.2 0.7 82.6 0.6 75.3 0.6 Data Programming (E + S) 65.0 0.0 78.8 0.1 71.2 0.0 65.0 0.0 78.8 0.0 71.2 0.0 NEXT (E + S) 71.0 1.4 82.8 1.1 76.4 0.4 71.9 1.5 82.8 1.9 76.9 0.7 Table 11: Restaurant results on 60 explanations and 75 explanations OBJ-PERSON , executive director of the SUBJ-ORGANIZATION at Saint Anselm College in Manchester Figure 8: Heatmap for keyword chief executive of and sentence OBJ-PERSON, executive director of the SUBJORGANIZATION at Saint Anselm College in Manchester. Results show that our string matching module can successfully grasp relevant words. ATAE-LSTM (Sa) 79.9 80.6 Self Training (Sa + Su) 80.9 81.1 Pseudo Labeling (Sa + Su) 78.7 81.0 Mean Teacher (Sa + Su) 79.3 79.8 NEx T (E + S) 81.4 82.0 Table 12: BERT experiments on Restaurant dataset using 45 and 75 explanations