# learning_from_explanations_with_neural_execution_tree__89fca7ba.pdf

Published as a conference paper at ICLR 2020

LEARNING FROM EXPLANATIONS WITH NEURAL EXECUTION TREE

Ziqi Wang1 , Yujia Qin1 , Wenxuan Zhou2, Jun Yan2, Qinyuan Ye2, Leonardo Neves3, Zhiyuan Liu1, Xiang Ren2

Tsinghua University1, University of Southern California2, Snap Research3

{ziqi-wan16, qinyj16}@mails.tsinghua.edu.cn, {liuzy}@tsinghua.edu.cn, {zhouwenx, yanjun, qinyuany, xiangren}@usc.edu, {lneves}@snap.com

While deep neural networks have achieved impressive performance on a range of NLP tasks, these data-hungry models heavily rely on labeled data, which restricts their applications in scenarios where data annotation is expensive. Natural language (NL) explanations have been demonstrated very useful additional supervision, which can provide sufﬁcient domain knowledge for generating more labeled data over new instances, while the annotation time only doubles. However, directly applying them for augmenting model learning encounters two challenges: (1) NL explanations are unstructured and inherently compositional, which asks for a modularized model to represent their semantics, (2) NL explanations often have large numbers of linguistic variants, resulting in low recall and limited generalization ability. In this paper, we propose a novel Neural Execution Tree (NEx T) framework1 to augment training data for text classiﬁcation using NL explanations. After transforming NL explanations into executable logical forms by semantic parsing, NEx T generalizes different types of actions speciﬁed by the logical forms for labeling data instances, which substantially increases the coverage of each NL explanation. Experiments on two NLP tasks (relation extraction and sentiment analysis) demonstrate its superiority over baseline methods. Its extension to multi-hop question answering achieves performance gain with light annotation effort.

1 INTRODUCTION

Deep neural networks have achieved state-of-the-art performance on a wide range of natural language processing tasks. However, they usually require massive labeled data, which restricts their applications in scenarios where data annotation is expensive. The traditional way of providing supervision is human-generated labels. See Figure 1 as an example. The sentiment polarity of the sentence Quality ingredients preparation all around, and a very fair price for NYC can be labeled as Positive . However, the label itself does not provide information about how the decision is made. A more informative method is to allow annotators to explain their decisions in natural language so that the annotation can be generalized to other examples. Such an explanation can be Positive, because the word price is directly preceded by fair , which can be generalized to other instances like It has delicious food with a fair price . Natural language (NL) explanations have shown effectiveness in providing additional supervision, especially in low-resource settings (Srivastava et al., 2017; Hancock et al., 2018). Also, they can be easily collected from human annotators without signiﬁcantly increasing annotation efforts.

However, exploiting NL explanations as supervision is challenging due to the complex nature of human languages. First of all, textual data is not well-structured, and thus we have to parse explanations into logical forms for machine to better utilize them. Also, linguistic variants are ubiquitous, which makes it difﬁcult to generalize an NL explanation for matching sentences that are semantically

Equal contribution. The order is decided by a coin toss. The work was done when visiting USC. 1 Project: http://inklab.usc.edu/project-NEx T/ Code: https://github.com/INK-USC/NEx T

Published as a conference paper at ICLR 2020

equivalent but having different word usage. When we perform exact matching with the previous example explanation, it fails to annotate sentences with reasonable price or good deal .

Explanation: because the word price is directly preceded by fair.

What is the sentiment polarity w.r.t. price ?

Label: Positive

Sentence: quality ingredients preparation all around, and a very fair price for NYC.

Sentence: it has delicious food with a fair price.

matching from corpus

Figure 1: Matching new instances from raw corpus using natural language explanations.

Attempts have been made to train classiﬁers with NL explanations. Srivastava et al. (2017) use NL explanations as additional features of data. They map explanations to logical forms with a semantic parser and use them to generate binary features for all instances. Hancock et al. (2018) employ a rulebased semantic parser to get logical forms (i.e. labeling function ) from NL explanations that generate noisy labeled datasets used for training models. While both methods claim huge performance improvements, they neglect the importance of linguistic variants, thus resulting in a very low recall. Also, their methods of evaluating explanations on new instances are oversimpliﬁed (e.g. comparison/logic operators), making their methods overly conﬁdent. In the above example, sentence Decent sushi at a fair enough price will be rejected because of the directly preceded requirement.

To address these issues, we propose Neural Execution Tree (NEx T) framework for deep neural networks to learn from NL explanations, as illustrated in Figure 2. Given a raw corpus and a set of NL explanations, we ﬁrst parse the NL explanations into machine-actionable logical forms by a combinatory categorial grammar (CCG) based semantic parser. Different from previous work, we soften the annotation process by generalizing the predicates using neural module networks and changing the labeling process from exact matching to fuzzy matching. We introduce four types of matching modules in total, namely String Matching Module, Soft Counting Module, Logical Calculation Module, and Deterministic Function Module. We calculate the matching scores and ﬁnd for each instance the most similar logical form. Thus, all instances in the raw corpus can be assigned a label and used to train neural models.

The major contributions of our work are summarized as follows: (1) We propose a novel NEx T framework to utilize NL explanations. NEx T is able to model the compositionality of NL explanations and improve the generalization ability of NL explanations so that neural models can leverage unlabeled data for augmenting model training. (2) We conduct extensive experiments on two representative tasks (relation extraction and sentiment analysis). Experimental results demonstrate the superiority of NEx T over various baselines. Also, we adapted NEx T for multi-hop question answering task, in which it achieves performance improvement with only 21 explanations and 5 rules.

2 LEARNING TO AUGMENT SEQUENCE MODELS WITH NL EXPLANATIONS

This section ﬁrst talks about basic concepts and notations for our problem deﬁnition. Then we give a brief overview of our approach, followed by details of each stage.

Problem Deﬁnition. We consider the task of training classiﬁers with natural language explanations for text classiﬁcation (e.g., relation extraction and sentiment analysis) in a low-resource setting. Speciﬁcally, given a raw corpus S = {xi}N i=1 X and a predeﬁned label set Y, our goal is to learn a classiﬁer fc : X Y. We ask human annotators to view a subset S of the corpus S and provide for each instance x S a label y and an explanation e, which explains why x should receive y. Note that |S | |S|, which requires our framework to learn with very limited human supervision.

Approach Overview. We develop a multi-stage learning framework to leverage NL explanations in a weakly-supervised setting. An overview of our framework is depicted in Fig. 2. Our NEx T framework consists of three stages, namely explanation parsing stage, dataset partition stage, and joint model learning stage.

Explanation Parsing. To leverage the unstructured human explanations E = {ej}|S | j=1, we turn them into machine-actionable logical forms (i.e., labeling functions) (Ratner et al., 2016), which can be denoted as F = {fj : X {0, 1}}|S | j=1, where 1 indicates the logical form matches the

Published as a conference paper at ICLR 2020

Training Classifier

Data Partition

Hard Matching

Set Soft Matching

Raw Corpus + NL Explanations Semantic Parsing

!": There was a long wait for a table outside, but it was a little too hot in the sun anyway so our TERM was very nice.

Explanation: Positive, because the words very nice is within 3 words after the TERM.

Label whether the TERM is positive or negative

!#: The wine is always good, the tapas are always yummy, especially with the warm TERM.

Explanation: Positive, because the word warm is directly before the TERM.

return 1 if Word(very nice) Is At Most(Right(TERM), Num(3 tokens))) else 0

Word(very nice) Is At Most(Right(TERM), Num(3 tokens)))

String Matching

Soft Counting

Logic Calculation Matching

Figure 2: Overview of the NEx T Framework. Natural language explanations are ﬁrstly parsed into logical forms. Then we partition the raw corpus S into labeled dataset Sa and unlabeled dataset Su = S {xa i }Na i=1. We use matching modules to provide supervision on Su. Finally, supervision from both Sa and Su is fed into a classiﬁer.

input sequence and 0 otherwise. To access the labels, we introduce a function h : F Y that maps each logical form fj to the label yj of its explanation ej. Examples are given in Fig. 2. We use Combinatory Categorial Grammar (CCG) based semantic parsing (Zettlemoyer & Collins, 2012; Artzi et al., 2015), an approach that couples syntax with semantics, to convert each NL explanation ej to a logical form fj.

Following Srivastava et al. (2017), we ﬁrst compile a domain lexicon that maps each word to its syntax and logical predicate. Frequently-used predicates are listed in the Appendix. For each explanation, the parser can generate many possible logical forms based on CCG grammar. To identify the correct one from these logical forms, we use a feature vector φ(f) Rd with each element counting the number of applications of a particular CCG combinator (similar to Zettlemoyer & Collins (2007)). Speciﬁcally, given an explanation ei, the semantic parser parameterized by θ Rd outputs a probability distribution over all possible logical forms Zei. The probability of a feasible logical form can be calculated as:

Pθ(f|ei) = exp θT φ(f) P

f :f Zei exp θT φ(f ).

To learn θ, we maximize the probability of yi given ei calculated by marginalizing over all logical forms that match xi (similar to Liang et al. (2013)). Formally, the objective function is deﬁned as:

f:f(xi)=1 h(f)=yi Pθ(f|ei) .

When the optimal θ is derived using gradient-based method, the parsing result for ei is deﬁned as fi = arg maxf Pθ (f|ei).

Dataset Partition. After we parse explanations {ei}|S | i=1 into F = {fi}|S | i=1, where each fi corresponds to ei, we use F to ﬁnd exact matches in S and pair them with the corresponding labels. We denote the number of instances getting labeled by exact matching as Na. As a result, S is partitioned into a labeled dataset Sa = {(xa i , ya i )}Na i=1, and an unlabeled dataset Su = S {xa i }Na i=1 = {xu j }Nu j=1 where Nu = |S| Na.

Joint Model Learning. The exactly matched Sa can be directly used to train a classiﬁer while informative instances in Su are left untouched. We propose several neural module networks, which relax constraints in each logic form fj and substantially improve the rule coverage in Su. Classiﬁers will beneﬁt from these soft-matched and pseudo-labeled instances. Trainable parameters in neural module networks are jointly optimized with the classiﬁer. Details of each module and joint training method will be introduced in the next section.

3 NEURAL EXECUTION TREE

Given a logical form f and a sentence x, NEx T will output a matching score us [0, 1], which indicates how likely the sentence x satisﬁes the logical form f and thus should be given the corresponding label h(f). Speciﬁcally, NEx T comprises of four modules to deal with four categories of

Published as a conference paper at ICLR 2020

SUBJECT was murdered on OBJECT

0.3, 0.2, 0.9, 0.2, 0.4

The words who died precede OBJECT by no more than three words and occur between SUBJECT and OBJECT

0, 1, 1, 1, 0 0.3, 0.2, 0.9, 0.2, 0.4 0.6, 1, 1, 1, 1

(Word(who died) Is Between(SUBJECT, OBJECT)) And (Word(who died) Is At Most(Left(OBJECT)

, Num(3 Tokens))) String Matching

Logical Calculation

Soft Counting

Logical Calculation

String Matching

Logical Calculation

Deterministic

Date_of_Death

Logical Form:

Explanation:

Figure 3: Neural Execution Tree (NEx T) softly executes the logical form on the sentence.

predicates, namely String Matching Module, Soft Counting Module, Deterministic Function Module, and Logical Calculation Module. Any complex logical form can be disassembled into clauses containing these four categories of predicates. The four modules are then used to evaluate each clause and then the whole logical form in a softened way. Fig. 3 shows how NEx T builds the execution tree from an NL explanation and how it evaluates an unlabeled sentence. We show in the ﬁgure the corresponding module for each predicate in the logical form.

3.1 MODULES IN NEXT

String Matching Module. Given a keyword query q derived from an explanation and an input sequence x = [w1, w2, ..., wn], the string matching module fs(x, q) returns a sequence of scores [s1, s2, ..., sn] indicating the similarity between each token wi and the query q. Previous work implements this operation by exact keyword searching, while we augment the module with neural networks to enable capturing semantically similar words. Inspired by Li et al. (2018), for token wi, we ﬁrst generate Nc contexts by sliding windows of various lengths. For example, if the maximum window size is 2, the contexts ci0, ci1, ci2 of token wi are [wi], [wi 1; wi] and [wi; wi+1] respectively. Then we encode each context cij to a vector zcij by feeding pre-trained word embeddings into a bi-directional LSTM encoder (Hochreiter & Schmidhuber, 1997). All hidden layers of Bi LSTM are then summarized by an attention layer (Bahdanau et al., 2014). For keyword query q, we directly encode it into vector zq by bi-LSTM and attention. Finally, scores of sentence x and query q are calculated by aggregating similarity scores from different sliding windows: Mij(x, q) = cos(zcij D, zq D), fs(x, q) = M(x, q)v,

where D is a trainable diagonal matrix, v RNc is the trainable weight of each sliding window.

Parameters in the string matching module need to be learned with data in the form of (sentence, keyword, label). To build a training set for learning string matching, we randomly select spans of consecutive words as keyword queries in the training data. Each query is paired with the sentence it comes from. The synthesized dataset is denoted as {xi, qi, ki}Nsyn i=1 , where kij will take the value of 1 if q is extracted from xij and 0 otherwise. The loss function is deﬁned as the binary cross-entropy loss, as follows.

Lfind = 1 Nsyn

1 |ki| (ki log fs(xi, qi) + (1 ki) log(1 fs(xi, qi))).

While pretraining with Lfind enables matching similar words, this unsupervised distributional method is poor at learning their semantic meanings. For example, the word good will have relatively low similarity to great because there is no such training data. To solve this problem, we borrow the idea of word retroﬁtting (Faruqui et al., 2014) and adopt a contrastive loss (Neculoiu et al., 2016) to incorporate semantic knowledge in training. We use the keyword queries in labeling functions as supervision. Intuitively, the semantic meaning of two queries should be similar if they appear in the same class of labeling functions and dissimilar otherwise. More speciﬁcally, for a query q, we denote queries in the same class of labeling functions as Q+ and queries in different classes of labeling functions as Q . The similarity loss is deﬁned as: Lsim = max q1 Q+{(τ cos(zq D, zq1D))2 +} + max q2 Q {cos(zq D, zq2D)2 +}.

The overall objective function for string matching module is: Lstring = Lfind + γ Lsim, (1)

Published as a conference paper at ICLR 2020

Algorithm 1: Learning on Unlabeled Data with NEx T

Input: Labeled data Sa = {(xa i , ya i )}Na i=1, unlabeled data Su = {xu j }Nu j=1, and logical forms F = {fk}|S | k=1. Output: A classiﬁer fc : X Y. Pretrain String Matching Module in NEx T w.r.t. Lstring using Eq. 1. while not converge do

Sample a labeled batch Ba = {(xa i , ya i )}n i=1 from Sa, and an unlabeled batch Bu = {xu j }m j=1 from Su. foreach xu j Bu do Calculate a pseudo label yu j for xu j with conﬁdence uj using NEx T and F. Normalize matching scores {uj}m j=1 to get {ωj}m j=1 based on Eq. 3. Calculate La using Eq. 2, Lu using Eq. 4, L using Eq. 5. Update fc and String Matching Module in NEx T w.r.t. Ltotal.

where γ is a hyper-parameter. We pretrain the string matching module for better initialization.

Soft Counting Module. The soft counting module aims to relax the counting (distance) constraints deﬁned by NL explanations. For a counting constraint precede object by no more than three words, the soft counting module outputs a matching score indicating to which extent an anchor word (TERM, SUBJECT, and OBJECT) satisﬁes the constraint. The score is set to 1 if the position of the anchor word strictly satisﬁes the constraint, and will decrease if the constraint is broken. For simplicity, we allow an additional range in which the score is set to µ (0, 1), which is a hyper-parameter controlling the constraints.

Deterministic Function Module. The deterministic function module deals with deterministic predicates like Left and Right , which can only be exactly matched because a string is either right or left of an anchor word. Therefore, the probability it outputs should be either 0 or 1. The Deterministic Function Module deals with all these predicates and outputs a mask sequence, which is fed into the tree structure and combined with other information.

Logical Calculation Module. The logical calculation module acts as a score aggregator. It can aggregate scores given by: (1) a string matching module and a soft counting module/deterministic function module (triggered by predicates such as Is and Occur ) (2) two clauses that have been evaluated with a score respectively (triggered by predicates such as And and Or ).

In the ﬁrst case, the logical calculation module will calculate the element-wise products of the score sequence provided by the string matching module and the mask sequence provided by the soft counting module/deterministic function module. It then uses max pooling to calculate the matching score of the current clause. In the second case, the logical calculation module will aggregate the scores of at least one clause based on the logic operation. The rules are deﬁned as follows.

p1 p2 = max(p1 + p2 1, 0), p1 p2 = min(p1 + p2, 1), p = 1 p,

where p is the score of the input clause.

3.2 AUGMENTING MODEL LEARNING WITH NEXT

As described in Algo. 1, in each iteration, we sample two batches Ba and Bu from Sa and Su. We conduct supervised learning on Ba. The labeled loss function is calculated as:

(xa i ,ya i ) Ba log p(ya i |xa i ). (2)

To leverage Bu, which is also informative, for each instance xu j Bu, we can use our matching modules to compute its matching score with every logical form. The most probable logical form that matched with xu j is denoted as yu j 2, along with the matching score uj. To ensure the scale of the unlabeled loss is comparable to labeled loss, we normalize the matching scores among pseudolabeled instances in Bu as:

ωj = exp(θtuj) P|Bu| k=1 exp(θtuk) , (3)

2None label (e.g. No Relation for relation extraction and Neutral for sentiment analysis) usually lacks explanations. If the entropy of downstream model prediction distribution over labels is lower than a threshold, a None label will be given.

Published as a conference paper at ICLR 2020

Dataset exps categs avg ops logic/% assertion/% position/% counting/% acc/%

TACRED 170 13 8.2 25.8 21.3 21.4 12.4 95.3 Sem Eval 203 9 4.2 32.7 15.9 26.3 5.5 84.2 Laptop 40 8 3.9 0.0 23.8 23.8 17.5 87.2 Restaurant 45 9 9.6 2.8 25.4 26.1 16.2 88.2

Table 1: Statistics for Human-curated Explanations and Evaluation of Semantic Parsing. We report the number of NL explanations (exps), categories of predicates (categs) and operator compositions per explanation (avg ops) respectively. We also report the proportions of different types of predicates, where logic denotes logical operators (And, Or), assertion denotes assertion predicates (Occur, Contains), position denotes position predicates (Right, Between) and counting denotes counting predicates (More Than, At Most). We summarize the accuracy (acc) of semantic parsing based on human evaluation.

where k is the index of the instance and hyperparameter θt (temperature) controls the shape of normalized scores distribution. Based on that, the unlabeled loss is calculated as:

(xu j Bu) ωj log p(yu j |xu j ). (4)

Note that the string matching module is also trainable and plays a vital role in NEx T. We jointly train it with the classiﬁer by optimizing:

Ltotal = La + α Lu + β Lstring, (5)

where α and β are hyper-parameters.

4 EXPERIMENTS

Tasks and Datasets. We conduct experiments on two tasks: relation extraction and aspect-termlevel sentiment analysis. Relation extraction (RE) aims to identify the relation type between two entities in a sentence. For example, given a sentence Steve Jobs founded Apple Inc, we want to extract a triple (Steve Jobs, Apple Inc., Founder). For RE we choose two datasets, TACRED (Zhang et al., 2017) and Sem Eval (Hendrickx et al., 2009) in our experiments. Aspect-term-level sentiment analysis (SA) aims to decide the sentiment polarity with regard to the given aspect term. For example, given a sentence Quality ingredients preparation all around, and a very fair price for NYC, the sentiment polarity of the aspect term price is positive, the explanation can be The word price is directly preceded by fair. For this task we use two customer review datasets, Restaurant and Laptop, which are part of Sem Eval 2014 Task 4.

Explanation Collection. We use Amazon Mechanical Turk to collect explanations for a randomly sampled set of instances in each dataset. Turkers are prompted with a list of selected predicates (see Appendix) and several examples of NL explanations. Examples of collected explanations are listed in Appendix. Statistics of curated explanations and intrinsic evaluation results of semantic parsing are summarized in Table 1. To ensure a low-resource setting (i.e., |S | |S|), in each experiment we only use a random subset of collected explanations.

Compared Methods. As mentioned in Sec. 2, logical forms partition unlabeled corpus S into labeled set Sa and unlabeled set Su. Labeled set Sa can be directly utilized by supervised learning methods. (1) CBOW-Glo Ve uses bag-of-words (Mikolov et al., 2013) on Glo Ve embeddings (Pennington et al., 2014) to represent an instance, or surface patterns in NL explanations. It then annotates the sentence with the label of its most similar surface pattern (by cosine similarity). (2) PCNN (Zeng et al., 2015) uses piece-wise max-pooling to aggregate CNN-generated features. (3) LSTM+ATT (Bahdanau et al., 2014) adds an attention layer onto LSTM to encode a sequence. (4) PA-LSTM (Zhang et al., 2017) combines LSTM with an entity-position aware attention to conduct relation extraction. (5) ATAE-LSTM (Wang et al., 2016) combines the aspect term information into both embedding layer and attention layer to help the model concentrate on different parts of a sentence.

For semi-supervised baselines, unlabeled data Su is also introduced for training. For methods requiring rules as input, we use surface pattern-based rules transferred from explanations. Compared

Published as a conference paper at ICLR 2020

TACRED Sem Eval

LF (E) 23.33 33.86 CBOW-Glo Ve (R + S) 34.6 0.4 48.8 1.1

PCNN (Sa) 34.8 0.9 41.8 1.2 PA-LSTM (Sa) 41.3 0.8 57.3 1.5 Bi LSTM+ATT (Sa) 41.4 1.0 58.0 1.6 Bi LSTM+ATT (Sl) 30.4 1.4 54.1 1.0

Self Training (Sa + Su) 41.7 1.5 55.2 0.8 Pseudo Labeling (Sa + Su) 41.5 1.2 53.5 1.2 Mean Teacher (Sa + Su) 40.8 0.9 56.0 1.1 Mean Teacher (Sl + Slu) 25.9 2.2 52.2 0.7 Dual RE (Sa + Su) 32.6 0.7 61.7 0.9

Data Programming (E + S) 30.8 2.4 43.9 2.4 NEx T (E + S) 45.6 0.4 63.5 1.0 (a) Relation Extraction

Restaurant Laptop

LF (E) 7.7 13.1 CBOW-Glo Ve (R + S) 68.5 2.9 61.5 1.3

PCNN (Sa) 72.6 1.2 60.9 1.1 ATAE-LSTM (Sa) 71.1 0.4 56.2 3.6 ATAE-LSTM (Sl) 71.4 0.5 52.0 1.4

Self Training (Sa + Su) 71.2 0.5 57.6 2.1 Pseudo Labeling (Sa + Su) 70.9 0.4 58.0 1.9 Mean Teacher (Sa + Su) 72.0 1.5 62.1 2.3 Mean Teacher (Sl + Slu) 74.1 0.4 61.7 3.7

Data Programming (E + S) 71.2 0.0 61.5 0.1 NEx T (E + S) 75.8 0.8 62.8 1.9 (b) Sentiment Analysis

Table 2: Experiment results on Relation Extraction and Sentiment Analysis. Average and standard deviation of F1 scores (%) over multiple runs are reported (5 runs for RE and 10 runs for SA). LF(E) denotes directly applying logical forms onto explanations. Bracket behind each method illustrates corresponding data used in the method. S denotes training data without labels, E denotes explanations, R denotes surface pattern rules transformed from explanations; Sa denotes labeled data annotated with explanations, Su denotes the remaining unlabeled data. Sl denotes labeled data annotated using same time as creating explanations E, Slu denotes remaining unlabeled data corresponding to Sl.

semi-supervised methods include: (1) Pseudo-Labeling (Lee, 2013) ﬁrst trains a classiﬁer on labeled dataset, then generates pseudo labels for unlabeled data using the classiﬁer by selecting the labels with maximum predicted probability. (2) Self-Training (Rosenberg et al., 2005) proposes to expand the labeled data by selecting a batch of unlabeled data that has the highest conﬁdence and generate pseudo-labels for them. The method stops until the unlabeled data is used up. (3) Mean-Teacher (Tarvainen & Valpola, 2017) averages model weights instead of label predictions and assumes similar data points should have similar outputs. (4) Dual RE (Lin et al., 2019) jointly trains a relation prediction module and a retrieval module.

Learning from explanations is categorized as a third setting. Explanation-guided pseudo labels are generated for a downstream classiﬁer. (1) Data Programming (Hancock et al., 2018; Ratner et al., 2016) aggregates results of strict labeling functions for each instance and uses these pseudo-labels to train a classiﬁer. (2) NEx T (proposed work) softly applies logical forms to get annotations for unlabeled instances and trains a downstream classiﬁer with these pseudo-labeled instances. The downstream classiﬁer is Bi LSTM+ATT for relation extraction and ATAE-LSTM for sentiment analysis.

4.1 RESULTS OVERVIEW

Table 2 (a) lists F1 scores of all relation extraction models. Full results including precision and recall can be found in Appendix A.4. We observe that our proposed NEx T consistently outperforms all baseline models in low-resource setting. Also, we found that, (1) directly applying logical forms to unlabeled data results in poor performance. We notice that this method achieves high precision but low recall. Based on our observation of the collected dataset, this is because people tend to use detailed and speciﬁc constraints in an NL explanation to ensure they cover all aspects of the instance. As a result, those instances that satisfy the constraints are correctly labeled in most cases, and thus the precision is high. Meanwhile, generalization ability is compromised, and thus the recall is low. (2) Compared to its downstream classiﬁer baseline (Bi LSTM+ATT with Sa), NEx T achieves 4.2% F1 improvement in absolute value on TACRED, and 5.5% on Sem Eval. This validates that the expansion of rule coverage by NEx T is effective and is providing useful information to classiﬁer training. (3) Performance gap further widens when we take annotation efforts into account. The annotation time for E and Sl are equivalent; but the performance of Bi LSTM+ATT signiﬁcantly degrades with fewer instances in Sl. (4) Results of semi-supervised methods are unsatisfactory. This may be explained with difference between underlying data distribution of Sa and Su.

Published as a conference paper at ICLR 2020

Table 2 (b) lists the performances of all sentiment analysis models. The observations are similar to those of relation extraction, which strengthens our conclusions and validates the capability of NEx T.

4.2 PERFORMANCE ANALYSIS

TACRED Sem Eval Restaurant Laptop

Full NEx T 45.6 0.4 63.5 1.0 75.8 0.8 62.8 1.9

No counting 44.6 0.9 63.2 0.7 75.6 0.8 62.4 1.9 No matching 41.8 1.1 54.6 1.2 71.2 0.4 57.0 2.7 No Lsim 42.5 1.0 56.2 2.9 70.7 0.8 59.4 0.7 No Lfind 43.2 1.3 60.2 0.9 70.0 3.5 58.1 2.8

Table 3: Ablation studies on modules of NEx T and losses of string matching module. F1 score on the test set is reported. We remove soft counting module (No counting) and string matching module (No matching) by only allowing them to give 0/1 results.

0.2 0.4 0.6 0.8 1.0 unlabeled data / all data

Dataset TACRED Restaurant

Figure 4: NEx T s performance w.r.t. number of unlabeled instances.

Effectiveness of softening logical rules. As shown in Table 3, we conduct ablation studies on TACRED and Restaurant. We remove two modules that support soft logic (by only allowing them to give 0/1 outputs) to see how much does rule softening help in our framework. Both soft counting module and string matching module contribute to the performance of NEx T. It can be easily concluded that string matching module plays a vital role. Removing it leads to signiﬁcant performance drops, which demonstrates the effectiveness of generalizing when applying logical forms. Besides, we examine the impact brought by Lsim and Lfind. Removing them severely hurts the performance, indicating the importance of semantic learning when performing fuzzy matching.

Performance with different amount of unlabeled data. To investigate how our NEx T s performance is affected by the amount of unlabeled data, we randomly sample 10%, 30%, 50% and 70% of the original unlabeled dataset to do the experiments. As illustrated in Fig. 4, our NEx T beneﬁts from larger amount of unlabeled data. We attribute it to high accuracy of logical forms converted from explanations.

50 60 70 80 90 100 110 Time / min

Method NEXT LSTM+ATT (Sl) 100

#Annotation

Figure 5: Performance of NEx T v.s. traditional supervised method. Blue line denotes NEx T and dashed line denotes annotating numbers, normal line means performance. Red line denotes traditional supervised method, and dashed line means performance, normal line means annotating numbers.

Superiority of explanations in data efﬁciency. In the real world, with limited human-power, there is a question of whether it is more valuable to spend time explaining existing annotations than just annotating more labels. To answer this question, we conduct experiments on Performance v.s. Time on TACRED dataset. We compare the results of a supervised classﬁer with only labels as input and our NEx T with both labels and explanations annotated using the same annotation time. The results are listed in Figure 5, from which we can see that NEx T achieves higher performance while labeling speed reduces by half.

Performance with different number of explanations. From Fig. 6 , one can clearly observe that all approaches beneﬁt from more labeled data. Our NEx T outperforms all other baselines by a large margin, which indicates the effectiveness of leveraging knowledge embedded in NL explanations. We can also see that, the performance of NEx T with 170 explanations on TACRED equals to about 2500 labeled data using traditional supervised method. Results of Restaurant also have the same trend, which strengthens our conclusion. Besides Fig. 6, we conduct more experiments for this ablation study and make 4 results tables, see Appendix A.5 for details.

Published as a conference paper at ICLR 2020

2500 2000 1500 1000

1000 1500 2000 2500

(b) Restaurant

Figure 6: Performance with different number of explanations. We choose supervised semi-supervised baselines for comparison. We did experiments on TACRED and Restaurant. Gray dashed lines mean the performance with the corresponding labeled data.

4.3 ADDITIONAL EXPERIMENT ON MULTI-HOP REASONING

To further test the capability of NEx T in downstream tasks, we apply it to WIKIHOP (Welbl et al., 2018) country task by fusing NEx T-matched facts into baseline model NLPROLOG (Weber et al., 2019). For a brief introduction, WIKIHOP country task requires a model to select the correct candidate ENT-Y for question Which country is ENT-X in? given a list of support sentences. As part of dataset design, the correct answer can only be found by reasoning over multiple support sentences.

NLPROLOG is a model proposed for WIKIHOP. It ﬁrst extracts triples from support sentences and treats the masked sentence as the relation between two entities. For example, Socrate was born in Athens is converted to (Socrate, ENT1 was born in ENT2 , Athens), where ENT1 was born in ENT2 is later embedded by SENT2VEC (Pagliardini et al., 2018) to represent the relation between ENT1 and ENT2. Triples extracted from supporting sentences are fed into a Prolog reasoner which will do backward chaining and reasoning to arrive at the target statement country(ENT-X, ENT-Y). We refer readers to (Weber et al., 2019) for in-depth introduction of NLProlog NLPROLOG.

Fig. 7 shows how the framework in Fig. 2 is adjusted to suit NLPROLOG. We manually choose 3 predicates (i.e., located in, capital of, next to) and annotate 21 support sentences with natural language explanation. We get 103 strictly-matched facts (Sa) and 1407 NEx T-matched facts (Su) among the 128k unlabeled QA support sentences. Additionally, we manually write 5 rules about these 3 predicates for the Prolog solver, e.g. located in(X,Z) located in(X,Y) located in(Y,Z).

Results are listed in Table 4. From the result we observe that simply adding the 103 strictly-matched facts is not making notable improvement. However, with the help of NEx T, a larger number of structured facts are recognized from support sentences, so that external knowledge from only 21 explanations and 5 rules improve the accuracy by 1 point. This observation validates NEx T s capability in low resource setting and highlight its potential when applied to downstream tasks.

Training NLProlog

Strictly matched set

Fact Extraction

Labeled by LF

NEx T matched set Labeled by NEx T

Figure 7: Adjusting NEx T Framework (Fig. 2) for NLPROLOG.

|Sa| |Su| Accuracy

NLProlog (published code) 0 0 74.57 + Sa 103 0 74.40 + Su (conﬁdence >0.3) 103 340 74.74 + Su (conﬁdence >0.2) 103 577 75.26 + Su (conﬁdence >0.1) 103 832 75.60

Table 4: Performance of NLPROLOG when extracted facts are used as input. Average accuracy over 3 runs is reported. NLPROLOG empowered by 21 natural language explanations and 5 hand-written rules achieves 1% gain in accuracy.

Published as a conference paper at ICLR 2020

5 RELATED WORK

Leveraging natural language for training classiﬁers. Supervision in the form of natural language has been explored by many works. Srivastava et al. (2017) ﬁrst demonstrate the effectiveness of NL explanations. They proposed a joint concept learning and semantic parsing method for classiﬁcation problems. However, the method is very limited in that it is not able to use unlabeled data. To address this issue, Hancock et al. (2018) propose to parse the NL explanations into labeling functions and then use data programming to handle the conﬂict and enhancement between different labeling functions. Camburu et al. (2018) extend Stanford Natural Language Inference dataset with NL explanations and demonstrate its usefulness for various goals for training classiﬁers. Andreas et al. (2016) explore decomposing NL questions into linguistic substructures for learning collections of neural modules which can be assembled into deep networks. Hu et al. (2019) explore using NL instructions as compositional representation of actions for hierarchical decision making. The substructure of an instruction is summarized as a latent plan, which is then executed by another model. Rajani et al. (2019) train a language model to automatically generate NL explanations that can be used during training and inference for the task of commonsense reasoning.

Weakly-supervised learning. Our work is relevant to weakly-supervised learning. Traditional systems use handcrafted rules (Hearst, 1992) or automatically learned rules (Agichtein & Gravano, 2000; Batista et al., 2015) to take a rule-based approach. Hu et al. (2019) incorporate human knowledge into neural networks by using a teacher network to teach the classiﬁer knowledge from rules and train the classiﬁer with labeled data. Li et al. (2018) parse regular expression to get action trees as a classiﬁer that are composed of neural modules, so that essentially training stage is just a process of learning human knowledge. Meanwhile, if we regard those data that are exactly matched by rules as labeled data and the remaining as unlabeled data, we can apply many semi-supervised models such as self learning (Rosenberg et al., 2005), mean-teacher (Tarvainen & Valpola, 2017), and semi-supervised VAE (Xu et al., 2017). However, These models turn out to be ineffective in rule-labeled data or explanation-labeled data due to potentially large difference in label distribution. The data sparsity is also partially solved by distant supervision (Mintz et al., 2009; Surdeanu et al., 2012). They rely on knowledge bases (KBs) to annotate data. However, the methods introduce a lot of noises, which severely hinders the performance. Liu et al. (2017) instead propose to conduct relation extraction using annotations from heterogeneous information source. Again, predicting true labels from noisy sources is challenging.

6 CONCLUSION

In this paper, we presented NEx T, a framework that augments sequence classiﬁcation by exploiting NL explanations as supervision under a low resource setting. We tackled the challenges of modeling the compositionality of NL explanations and dealing with the linguistic variants. Four types of modules were introduced to generalize the different types of actions in logical forms, which substantially increase the coverage of NL explanations. A joint training algorithm was proposed to utilize information from both labeled dataset and unlabeled dataset. We conducted extensive experiments on several datasets and proved the effectiveness of our model. Future work includes extending NEx T to sequence labeling tasks and building a cross-domain semantic parser for NL explanations.

ACKNOWLEDGMENTS

This research is based upon work supported in part by the Ofﬁce of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Contract No. 2019-19051600007, NSF SMA 18-29268, and Snap research gift. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the ofﬁcial policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. We would like to thank all the collaborators in USC INK research lab for their constructive feedback on the work.

Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the ﬁfth ACM conference on Digital libraries, pp. 85 94. ACM, 2000.

Published as a conference paper at ICLR 2020

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39 48, 2016.

Yoav Artzi, Kenton Lee, and Luke Zettlemoyer. Broad-coverage ccg semantic parsing with amr. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1699 1710, 2015.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014.

David S Batista, Bruno Martins, and M ario J Silva. Semi-supervised bootstrapping of relationship extractors with distributional semantics. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 499 504, 2015.

Oana-Maria Camburu, Tim Rockt aschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli: natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pp. 9539 9549, 2018.

Manaal Faruqui, Jesse Dodge, Sujay K Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith. Retroﬁtting word vectors to semantic lexicons. ar Xiv preprint ar Xiv:1411.4166, 2014.

Braden Hancock, Martin Bringmann, Paroma Varma, Percy Liang, Stephanie Wang, and Christopher R e. Training classiﬁers with natural language explanations. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2018, pp. 1884. NIH Public Access, 2018.

Marti A Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics-Volume 2, pp. 539 545. Association for Computational Linguistics, 1992.

Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O S eaghdha, Sebastian Pad o, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. Semeval-2010 task 8: Multi-way classiﬁcation of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pp. 94 99. Association for Computational Linguistics, 2009.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735 1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx. doi.org/10.1162/neco.1997.9.8.1735.

Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, and Mike Lewis. Hierarchical decision making by generating and following natural language instructions. ar Xiv preprint ar Xiv:1906.00744, 2019.

Dong-Hyun Lee. Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural networks. 2013.

Shen Li, Hengru Xu, and Zhengdong Lu. Generalize symbolic knowledge with neural rule engine. ar Xiv preprint ar Xiv:1808.10326, 2018.

Percy Liang, Michael I Jordan, and Dan Klein. Learning dependency-based compositional semantics. Computational Linguistics, 39(2):389 446, 2013.

Hongtao Lin, Jun Yan, Meng Qu, and Xiang Ren. Learning dual retrieval module for semisupervised relation extraction. In The Web Conference, 2019.

Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, and Jiawei Han. Heterogeneous supervision for relation extraction: A representation learning approach. ar Xiv preprint ar Xiv:1707.00166, 2017.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111 3119, 2013.

Published as a conference paper at ICLR 2020

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pp. 1003 1011. Association for Computational Linguistics, 2009.

Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. Learning text similarity with siamese recurrent networks. In Rep4NLP@ACL, 2016.

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of NAACL-HLT, pp. 528 540, 2018.

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532 1543, 2014.

Nazneen Fatema Rajani, Bryan Mc Cann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. ar Xiv preprint ar Xiv:1906.02361, 2019.

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R e. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems, pp. 3567 3575, 2016.

Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-training of object detection models. 2005.

Shashank Srivastava, Igor Labutov, and Tom Mitchell. Joint concept learning and semantic parsing from natural language explanations. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 1527 1536, 2017.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 455 465. Association for Computational Linguistics, 2012.

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195 1204, 2017.

Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. Attention-based lstm for aspect-level sentiment classiﬁcation. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606 615, 2016.

L Weber, P Minervini, J M unchmeyer, U Leser, and T Rockt aschel. Nlprolog: Reasoning with weak uniﬁcation for question answering in natural language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, Volume 1: Long Papers, volume 57. ACL (Association for Computational Linguistics), 2019.

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287 302, 2018.

Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. Variational autoencoder for semi-supervised text classiﬁcation. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1753 1762, 2015.

Luke Zettlemoyer and Michael Collins. Online learning of relaxed ccg grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-Co NLL), pp. 678 687, 2007.

Published as a conference paper at ICLR 2020

Luke S Zettlemoyer and Michael Collins. Learning to map sentences to logical form: Structured classiﬁcation with probabilistic categorial grammars. ar Xiv preprint ar Xiv:1207.1420, 2012.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D Manning. Positionaware attention and supervised data improve slot ﬁlling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 35 45, 2017.

Published as a conference paper at ICLR 2020

A.1 PREDICATES

Following Srivastava et al. (2017), we ﬁrst compile a domain lexicon that maps each word to its syntax and logical predicate. Table 5 lists some frequently used predicates in our parser, descriptions about their function and modules they belong to.

Predicate Description Module

Because, Separator Basic conjunction words

None Arg X, Arg Y, Arg Subject, object or aspect term in each task Int, Token, String Primitive data types True, False Boolean operators

And, Or, Not, Is, Occur Logical operators that aggregate matching scores Logical Calculation Module Left, Right, Between, Within Return True if one string is left/right/between/within some range of the other string

Deterministic Function Number Of Return the number of words in a given range

At Most, At Least, Direct, Counting (distance) constraints Soft Counting Module More Than, Less Than, Equals

Word, Contains, Link Return a matching score sequence for a sentence and a query String Matching Module

Table 5: Frequently used predicates

A.2 EXAMPLES FOR COLLECTED EXPLANATIONS.

OBJ-ORGANIZATION coach SUBJ-PERSON insisted he would put the club s 2-0 defeat to Palermo ﬁrmly behind him and move forward

(Label) per:employee of

(Explanation) there is only one word "coach" between SUBJ and OBJ

Ofﬁcials in Mumbai said that the two suspects , David Coleman Headley , an American with links of Pakistan , and SUBJ-PERSON , who was born in Pakistan but is a OBJ-NATIONALITY citizen , both visited Mumbai and several other Indian cities in before the attacks , and may have visited some of the sites that were attacked

(Label) per:origin

(Explanation) the words "is a" appear right before OBJ-NATIONALITY and the word "citizen" is right after OBJ-NATIONALITY

Sem Eval 2010 Task 8

The SUBJ-O is caused by the OBJ-O of UV radiation by the oxygen and ozone

(Label) Cause-Effect(e2,e1)

(Explanation) the phrase "is caused by the" occurs between SUBJ and OBJ and OBJ follows SUBJ

SUBJ-O are parts of the OBJ-O OBJ-O disregarded by the compiler

(Label) Component-Whole(e1,e2)

(Explanation) the phrase "are parts of the" occurs between SUBJ and OBJ and OBJ follows SUBJ

Sem Eval 2014 Task 4 - restaurant

Published as a conference paper at ICLR 2020

I am relatively new to the area and tried Pick a bgel on 2nd and was disappointed with the service and I thought the food was overated and on the pricey side (Term: food)

(Label) negative

(Explanation) the words "overated" is within 2 words after term

The decor is vibrant and eye-pleasing with several semi-private boths on the right side of the dining hall, which are great for a date (Term: decor)

(Label) positive

(Explanation) the term is followed by "vibrant" and "eye-pleasing"

Sem Eval 2014 Task 4 - laptop

It s priced very reasonable and works very well right out of the box. (Term: works)

(Label) positive

(Explanation) the words "very well" occur directly after the term

The DVD drive randomly pops open when it is in my backpack as well, which is annoying (Term: DVD drive)

(Label) negative

(Explanation) the word "annoying" occurs after term

A.3 IMPLEMENTATION DETAILS

We use 300-dimensional word embeddings pre-trained by Glo Ve (Pennington et al., 2014). The dropout rate is 0.96 for word embeddings and 0.5 for sentence encoder. The hidden state size of the encoder and attention layer is 300 and 200 respectively. We choose Adagrad as the optimizer and the learning rate for joint model learning is 0.5.

For TACRED, we set the learning rate to 0.1 in the pretraining stage. The total epochs for pretraining are 10. The weight for Lsim is set to 0.5. The batch size for pretraining is set to 100. For training the classiﬁer, the batch size for labeled data and unlabeled data is 50 and 100 respectively, the weight α for Lu is set to 0.7, the weight β for Lstring is set to 0.2, the weight γ for Lsim is set to 2.5.

For Sem Eval 2010 Task 8, we set the learning rate to 0.1 in the pretraining stage. The total epochs for pretraining are 10. The weight for Lsim is set to 0.5. The batch size for pretraining is set to 10. For training the classiﬁer, the batch size for labeled data and unlabeled data is 50 and 100 respectively, the weight α for Lu is set to 0.5, the weight β for Lstring is set to 0.1, the weight γ for Lsim is set to 2.

For two datasets in Sem Eval 2014 Task 4, we set the learning rate to 0.5 in the pretraining stage. The total epochs for pretraining are 20. The weight for Lsim is set to 5. The batch size for pretraining is set to 20. For training the classiﬁer, the batch size for labeled data and unlabeled data is 10 and 50 respectively, the weight α for Lu is set to 0.5, the weight β for Lstring is set to 0.1, the weight γ for Lsim is set to 2. For ATAE-LSTM, we set hidden state of attention layer to be 300 dimension.

A.4 FULL RESULTS FOR MAIN EXPERIMENTS

The full results for relation extraction and sentiment analysis are listed in Table 6 and Table 7 respectively.

Published as a conference paper at ICLR 2020

TACRED Sem Eval

Metric Precision Recall F1 Precision Recall F1

LF (E) 83.21 13.56 23.33 83.19 21.26 33.86 CBOW-Glo Ve (R + S) 28.2 0.7 44.9 0.9 34.6 0.4 46.8 1.3 51.2 2.2 48.8 1.1

PCNN (Sa) 43.8 1.6 28.9 1.1 34.8 0.9 51.5 1.9 35.2 1.4 41.8 1.2 PA-LSTM (Sa) 44.4 2.9 38.7 2.2 41.3 0.8 59.9 2.4 54.9 2.2 57.3 1.5 Bi LSTM+ATT (Sa) 43.8 2.0 39.4 2.6 41.4 1.0 60.0 2.1 56.2 1.3 58.0 1.6 Bi LSTM+ATT (Sl) 42.8 2.6 23.8 2.4 30.4 1.4 54.7 1.0 53.6 1.2 54.1 1.0 Data Programming (E + S) 45.9 2.8 23.3 2.6 30.8 2.4 51.3 3.5 38.8 4.2 43.9 2.4

Self Training (Sa + Su) 45.9 2.3 38.4 2.7 41.7 1.5 57.3 2.1 53.3 0.9 55.2 0.8 Pseudo Labeling (Sa + Su) 44.5 1.5 38.9 1.6 41.5 1.2 53.7 2.6 53.4 2.2 53.5 1.2 Mean Teacher (Sa + Su) 39.2 1.7 42.6 1.8 40.8 0.9 60.8 1.9 51.9 1.2 56.0 1.1 Mean Teacher (Sl + Slu) 28.3 5.7 25.4 5.8 25.9 2.2 53.1 3.8 51.6 2.4 52.2 0.7 Dual RE (Sa + Su) 38.8 4.7 28.6 2.9 32.6 0.7 64.5 0.7 59.2 2.0 61.7 0.9

NEx T (E + S) 49.2 0.9 42.4 1.3 45.6 0.4 66.3 1.4 61.0 2.2 63.5 1.0

Table 6: Full results as supplement to Table 2(a)

Restaurant Laptop

Metric Precision Recall F1 Precision Recall F1

LF (E) 86.5 4.0 7.7 90.0 7.1 13.1 CBOW-Glo Ve (R + S) 62.8 2.8 75.3 3.1 68.5 2.9 53.4 1.1 72.6 1.5 61.5 1.3

PCNN (Sa) 67.1 2.1 79.0 1.8 72.6 1.2 53.1 1.0 71.4 1.1 60.9 1.1 ATAE-LSTM (Sa) 65.1 0.4 78.4 0.6 71.1 0.4 49.0 3.1 66.0 4.4 56.2 3.6 ATAE-LSTM (Sl) 65.3 0.5 78.9 0.5 71.4 0.5 48.9 1.5 55.6 2.4 52.0 1.4 Data Programming (E + S) 65.0 0.0 78.8 0.0 71.2 0.0 53.4 0.1 72.5 0.1 61.5 0.1

Self Training (Sa + Su) 65.3 0.7 78.4 0.9 71.2 0.5 50.1 1.8 67.7 2.4 57.6 2.1 Pseudo Labeling (Sa + Su) 64.9 0.5 78.0 0.6 70.9 0.4 50.4 1.6 68.4 2.3 58.0 1.9 Mean Teacher (Sa + Su) 68.8 2.2 75.7 3.9 72.0 1.5 54.4 1.7 72.3 4.0 62.1 2.3 Mean Teacher (Sl + Slu) 68.3 0.8 81.0 0.4 74.1 0.4 55.0 4.1 70.3 3.3 61.7 3.7

NEx T (E + S) 69.6 0.9 83.3 1.8 75.8 0.8 54.6 1.6 73.9 2.3 62.8 1.9

Table 7: Full results as supplement to Table 2(b)

A.5 PERFORMANCE WITH DIFFERENT NUMBER OF EXPLANATIONS

As a supplement to Fig. 6, we show the full experimental results with different number of explanations as input in Table 8,9,10,11. Results show that our model achieves best performance compared with baseline methods.

A.6 MODEL-AGNOSTIC

Our framework is model-agnostic as it can be integrated with any downstream classiﬁer. We conduct experiments on SA Restaurant dataset with 45 and 75 explanations using BERT as downstream classiﬁer, and the results are summarized in Table 12. Results show that our model still outperforms baseline methods when BERT is incorporated. We observe the performance of NEx T is approaching the upper bound 85% (by feeding all data to BERT), with only 75 explanations, which again demonstrates the annotation efﬁciency of NEx T.

A.7 CASE STUDY ON STRING MATCHING MODULE.

String matching module plays a vital role in NEx T. The matching quality greatly inﬂuences the accuracy of pseudo labeling. In Fig. 8, we can see that keyword chief executive of is perfectly aligned with executive director of in the sentence, which demonstrates the effectiveness of string matching module in capturing semantic similarity.

Published as a conference paper at ICLR 2020

TACRED 130 TACRED 100

Metric Precision Recall F1 Precision Recall F1

LF (E) 83.5 12.8 22.2 85.2 11.8 20.7 CBOW-Glo Ve (R + S) 26.0 2.3 39.9 5.0 31.2 0.5 24.4 1.3 41.7 3.7 30.7 0.1

PCNN (Sa) 41.8 2.7 28.8 1.8 34.1 1.1 28.2 3.4 22.2 1.3 24.8 1.9 PA-LSTM (Sa) 44.9 1.7 33.5 2.9 38.3 1.3 39.9 2.1 38.2 1.1 39.0 1.3 Bi LSTM+ATT (Sa) 40.1 2.6 36.2 3.4 37.9 1.1 36.1 0.4 37.6 3.0 36.8 1.4 Bi LSTM+ATT (Sl) 35.0 9.0 25.4 1.6 28.9 2.7 43.3 2.2 23.1 3.3 30.0 3.1

Self Training (Sa + Su) 43.6 3.3 35.1 2.1 38.7 0.0 41.9 5.9 32.0 7.4 35.5 2.5 Pseudo Labeling (Sa + Su) 44.2 1.9 34.2 1.9 38.5 0.6 39.7 2.0 34.9 3.3 37.1 1.5 Mean Teacher (Sa + Su) 38.8 0.9 35.6 1.3 37.1 0.5 37.4 4.0 37.4 0.2 37.3 2.0 Mean Teacher (Sl + Slu) 21.1 3.3 28.7 1.8 24.2 1.8 17.5 4.7 18.4 .59 17.9 5.0 Dual RE (Sa + Su) 34.9 3.6 30.5 2.3 32.3 1.0 40.6 4.3 19.1 1.5 25.9 0.6

Data Programming (E + S) 34.3 16.1 18.7 1.4 23.5 4.9 43.5 2.3 15.0 2.3 22.2 2.4 NEXT (E + S) 45.3 2.4 39.2 0.3 42.0 1.1 43.9 3.7 36.2 1.9 39.6 0.5

Table 8: TACRED results on 130 explanations and 100 explanations

Sem Eval 150 Sem Eval 100

Metric Precision Recall F1 Precision Recall F1

LF (E) 85.1 17.2 28.6 90.7 9.0 16.4 CBOW-Glo Ve (R + S) 44.8 1.9 48.6 1.5 46.6 1.1 36.0 1.4 40.2 2.0 37.9 0.1

PCNN (Sa) 49.1 3.9 36.1 2.4 41.5 1.4 43.3 1.4 27.9 1.0 33.9 0.3 PA-LSTM (Sa) 58.0 1.2 52.5 0.4 55.1 0.5 55.2 1.7 37.7 0.8 44.8 0.8 Bi LSTM+ATT (Sa) 59.2 0.4 53.7 1.8 56.3 0.8 54.9 5.0 40.5 0.9 46.5 1.3 Bi LSTM+ATT (Sl) 47.6 2.6 42.0 2.3 44.6 2.5 43.7 2.6 37.6 5.0 40.3 3.7

Self Training (Sa + Su) 53.4 4.3 47.5 2.9 50.1 1.1 53.2 2.3 34.2 2.2 41.6 1.4 Pseudo Labeling (Sa + Su) 55.3 4.5 51.0 2.3 53.0 1.5 47.4 4.6 39.9 3.9 43.1 0.6 Mean Teacher (Sa + Su) 61.8 4.0 49.1 2.6 54.6 0.2 58.5 1.9 41.8 2.6 48.7 1.4 Mean Teacher (Sl + Slu) 40.6 2.0 31.2 4.5 35.2 3.6 32.7 3.0 25.6 3.1 28.6 2.2 Dual RE (Sa + Su) 61.7 3.0 56.1 3.0 58.8 3.0 61.6 1.7 39.7 1.9 48.3 1.5

Data Programming (E + S) 50.9 10.8 27.0 0.8 35.0 3.2 28.0 4.1 17.4 5.5 21.0 3.4 NEXT (E + S) 68.5 1.6 60.0 1.7 63.7 0.8 60.2 1.8 53.5 0.7 56.7 1.1

Table 9: Sem Eval results on 150 explanations and 100 explanations

Laptop 55 Laptop 70

Metric Precision Recall F1 Precision Recall F1

LF (E) 90.8 9.2 16.8 89.4 9.2 16.8 CBOW-Glo Ve (R + S) 53.7 0.2 72.9 0.2 61.8 0.2 53.6 0.3 72.4 0.2 61.6 0.2

PCNN (Sa) 53.5 3.3 71.0 3.6 61.0 3.2 55.6 1.9 74.1 1.9 63.5 1.5 ATAE-LSTM (Sa) 53.5 0.4 71.9 2.2 61.3 1.0 53.7 1.2 72.9 1.8 61.9 1.5 ATAE-LSTM (Sl) 48.3 1.0 59.5 5.0 53.2 2.2 54.1 1.4 61.1 3.0 57.4 2.1

Self Training (Sa + Su) 51.3 2.6 68.6 2.7 58.7 2.6 51.2 1.4 68.6 2.2 58.7 1.6 Pseudo Labeling (Sa + Su) 51.8 1.7 70.3 2.3 59.7 1.9 52.4 0.8 70.9 1.5 60.3 1.0 Mean Teacher (Sa + Su) 55.1 0.9 74.1 1.6 63.2 1.1 55.9 3.3 73.0 2.6 63.2 1.7 Mean Teacher (Sl + Slu) 55.5 2.5 69.3 2.8 61.6 2.2 58.0 0.7 73.2 1.5 64.7 1.0

Data Programming (E + S) 53.4 0.0 72.6 0.0 61.5 0.0 53.5 0.1 72.5 0.1 61.6 0.1 NEXT (E + S) 56.3 1.3 75.9 2.5 64.6 1.7 56.9 0.2 77.1 0.6 65.5 0.3

Table 10: Laptop results on 55 explanations and 70 explanations

Published as a conference paper at ICLR 2020

Restaurant 60 Restaurant 75

Metric Precision Recall F1 Precision Recall F1

LF (E) 86.0 3.8 7.4 85.4 6.8 12.6 CBOW-Glo Ve (R + S) 63.7 2.3 75.6 1.3 69.1 1.9 64.1 1.3 76.6 0.1 69.8 0.7

PCNN (Sa) 67.0 0.9 81.0 1.0 73.3 0.9 68.4 0.1 82.8 0.3 74.9 0.2 ATAE-LSTM (Sa) 65.2 0.6 78.5 0.2 71.2 0.3 64.7 0.4 78.3 0.4 70.8 0.4 ATAE-LSTM (Sl) 67.0 1.5 79.5 1.2 72.7 1.0 66.6 2.0 78.5 1.4 72.1 0.6

Self Training (Sa + Su) 65.2 0.2 78.7 0.5 71.3 0.2 65.7 1.1 77.2 1.1 71.0 0.1 Pseudo Labeling (Sa + Su) 64.9 0.6 77.8 1.0 70.8 0.3 64.9 0.9 77.8 1.2 70.7 1.0 Mean Teacher (Sa + Su) 68.8 2.3 76.0 2.2 72.2 1.3 73.3 3.5 79.2 3.8 76.0 1.2 Mean Teacher (Sl + Slu) 69.0 0.8 82.0 1.1 74.9 0.7 69.2 0.7 82.6 0.6 75.3 0.6

Data Programming (E + S) 65.0 0.0 78.8 0.1 71.2 0.0 65.0 0.0 78.8 0.0 71.2 0.0 NEXT (E + S) 71.0 1.4 82.8 1.1 76.4 0.4 71.9 1.5 82.8 1.9 76.9 0.7

Table 11: Restaurant results on 60 explanations and 75 explanations

OBJ-PERSON , executive director of the SUBJ-ORGANIZATION at Saint Anselm College in Manchester

Figure 8: Heatmap for keyword chief executive of and sentence OBJ-PERSON, executive director of the SUBJORGANIZATION at Saint Anselm College in Manchester. Results show that our string matching module can successfully grasp relevant words.

ATAE-LSTM (Sa) 79.9 80.6

Self Training (Sa + Su) 80.9 81.1 Pseudo Labeling (Sa + Su) 78.7 81.0 Mean Teacher (Sa + Su) 79.3 79.8

NEx T (E + S) 81.4 82.0

Table 12: BERT experiments on Restaurant dataset using 45 and 75 explanations