# effective_slot_filling_via_weaklysupervised_dualmodel_learning__afa8321a.pdf Effective Slot Filling via Weakly-Supervised Dual-Model Learning Jue Wang1, Ke Chen1, Lidan Shou1,2 , Sai Wu1, Gang Chen1,2 1College of Computer Science and Technology, Zhejiang University 2State Key Laboratory of CAD&CG, Zhejiang University {zjuwangjue,chenk,should,wusai,cg}@zju.edu.cn Slot filling is a challenging task in Spoken Language Understanding (SLU). Supervised methods usually require large amounts of annotation to maintain desirable performance. A solution to relieve the heavy dependency on labeled data is to employ bootstrapping, which leverages unlabeled data. However, bootstrapping is known to suffer from semantic drift. We argue that semantic drift can be tackled by exploiting the correlation between slot values (phrases) and their respective types. By using some particular weakly-labeled data, namely the plain phrases included in sentences, we propose a weaklysupervised slot filling approach. Our approach trains two models, namely a classifier and a tagger, which can effectively learn from each other on the weakly-labeled data. The experimental results demonstrate that our approach achieves better results than standard baselines on multiple datasets, especially in the low-resource setting. 1 Introduction Slot filling is an essential and challenging task in Spoken Language Understanding (SLU). The task is usually interpreted as a sequence tagging process, during which slot values, in the form of short phrases (such as named entities), and their corresponding slot types are annotated. For example, as shown in Table 1, there are three slot values in the user utterance, namely Atlanta , Toronto , and Friday afternoon , whose corresponding slot types are departure location, arrival location, and departure time respectively. Although some existing supervised approaches (Raymond and Riccardi 2007; Yao et al. 2014; Mesnil et al. 2015) have achieved good results on slot filling, they usually require large amounts of annotated data, which is hardly available in real-world applications. Actually, acquiring labels is costly, and probably the biggest obstacle to the application of these methods. This motivates the need for effective learning techniques that leverage unlabeled data. Bootstrapping is a popular approach using unlabeled data. The main idea is to extend new annotations based on existing annotations. However, this method may quickly introduce semantic drift (Curran, Murphy, and Scholz 2007): one mistaken label may cause even more wrong predictions in the Corresponding author Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Utterance: List flights from Atlanta to Toronto Friday afternoon Slot Tags: O O O B-dep.loc O B-arr.loc B-dep.time I-dep.time Phrases: O O O B O B B I Table 1: An example of annotation for slot filling. subsequent iterations, causing the semantics of slot types to deviate from its original definition. For the running example in Table 1, a model could accidentally recognize both location mentions ( Atlanta and Toronto ) as departure locations. Such results are probably wrong because a person can only be in one departure location at a time. With this error undetected, the slot type departure location may gradually drift from its original meaning. And in an extreme case, the model may consider all location mentions to be of this type, yet the loss could still be very small! To avoid the above problem, we advocate leveraging some special weakly-labeled data. Our approach relies on a key observation that, compared to slot type and value annotations, untyped plain phrases (i.e., text chunks, as shown at the bottom of Table 1) are much easier to acquire, since almost all phrases in a sentence (except for auxiliary words) can be regarded as potential slot values. Given an utterance and phrases included in it, we can always determine the slot type for each phrase. For example, if a user wants to book an airline ticket, the input utterance may contain phrases such as the departure location, the arrival location, the departure date and time, etc. With the plethora of off-the-shelf tools for text chunking, phrases without slot labels can be collected in large numbers. In this paper, we assume that the dataset is partially annotated, that is, only a small number of utterances are properly labeled with both slot values and their respective slot types, while the rest are all weakly-labeled with phrases. To tackle semantic drift, which typically appears as wrong slot type in our task output, we propose a solution comprising two models in their own task formulations: a Classifier that predicts the slot type given a phrase (slot value), and a Tagger that predicts the phrase (slot value) given a slot type. We design a novel training target on which both models can be collaboratively trained using plain phrases without slot labels. The need for preventing semantic drift justifies our dual-model approach, since, only if the two models produce consistent The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) results, will the joint loss be minimized, which reduces the possibility of wrong slot type caused by a single model. We summarize the contributions of our paper as follows: First, we propose a weakly-supervised dual-model learning approach for slot filling. Supplied by very limited labeled utterances, the models can be effectively trained on large sets of weakly-labeled phrases. Second, we explore variants of our method, including one that requires no additional parameters to ensure that our main idea can be easily integrated into existing slot filling models. Third, we perform extensive experiments over several datasets, and the results show that our approach outperforms conventional supervised methods as well as bootstrapping methods, especially when given very few labeled data.1 2 Related Work Slot filling is usually treated as a sequence tagging task. Standard approaches include MEMMs (Mc Callum, Freitag, and Pereira 2000) and CRFs (Raymond and Riccardi 2007). Many researchers (Mesnil et al. 2013, 2015; Yao et al. 2013, 2014) apply RNNs (Huang, Xu, and Yu 2015; Lample et al. 2016) to this task and have promising achievements. Extensions include encoder-decoder models (Liu and Lane 2016; Zhu and Yu 2017), memory networks (Chen et al. 2016), slot-gated model (Goo et al. 2018), label-recurrent model (Gupta, Hewitt, and Kirchhoff 2019), SF-ID interrelated model (Haihong et al. 2019), capsule networks (Zhang et al. 2019), and stack-propagation model (Qin et al. 2019). However, these approaches are sensitive to the size of training data, and cannot achieve acceptable results given very few labeled samples. To address this issue, some work focuses on exploiting external knowledge via transfer learning (Yang, Salakhutdinov, and Cohen 2017), so they can quickly bootstrap the model in a new domain with only a handful labeled data (Fritzler, Logacheva, and Kretov 2019) or even without any data (Bapna et al. 2017; Shah et al. 2019; Lee and Jha 2019). Bootstrapping (Yarowsky 1995) is a well-known technique which leverages unlabeled data. The main idea is to extend new annotations based on the existing annotations. Lee (2013) propose to use pseudo labels generated by the model as if they were true labels. Thenmalar, Balaji, and Geetha (2015) define the pattern with a small set of training data, and then use it as a seed pattern to generate new patterns. Co-training (Blum and Mitchell 1998; Nigam et al. 2000) is similarly motivated but uses a pair of classifiers to iteratively learn and generate additional training labels. In mutual learning (Zhang et al. 2018), multiple models are trained jointly for the same task by minimizing the KL-divergence between their predictions. Another related technique to leverage unlabeled data is Π-model (Sajjadi, Javanmardi, and Tasdizen 2016) where the input is perturbed with noise, and the model is then required to produce similar predictions with and without perturbation. Miyato et al. (2018) proposed an adversarial training strategy to generate noise. 1Our code is available at https://github.com/Lorrin WWW/ weakly-supervised-slot-filling Linear Linear Utterance Slot Type Utterance Phrase (b) Classifier Figure 1: Dual-model overview. Dual learning (He et al. 2016) is first proposed for neural machine translation. It is also applied in SLU (Su, Huang, and Chen 2019), where natural language generation (NLG) is the dual task. Zhu, Cao, and Yu (2020) propose a semi-supervised learning approach to slot filling which improves performance by integrating a dual task of semantic-to-sentence generation. However, the work is based on a very strong assumption that semantic forms (namely intents and lists of slot-value pairs) are available. Given such slot-value pairs as a gazetteer alone, one can easily build a high-quality slot filler even by keyword lookup. In contrast, our method only leverages plain phrases produced by off-the-shelf chunking tools. 3 Approach 3.1 Model We describe the model structure of Tagger and Classifier in this section. Both models share the same text encoder, which maps each word in the input utterance to a fixed-dimensional vector. Figure 1 presents the overview. Classifier predicts the slot type for the given phrase. Meanwhile, Tagger predicts the slot value for the given slot type. Being two independent parts, however, they provide each other with supervision information via a dual-model learning mechanism, as will be described in the next subsection. Encoder We first introduce the text encoder used in Tagger and Classifier, shown in Figure 1. For an utterance containing n words x = [x(i)]n 1 i=0 , we define the word embeddings xw Rn d1, as well as character embeddings xc Rn d2 computed by an LSTM (Lample et al. 2016). We also consider the contextualized word embeddings xl Rn d3, which is produced from pre-trained language models such as BERT (Devlin et al. 2019). We concatenate those embeddings for each word and use a linear layer to reduce the embedding size and a bidirectional LSTM to compute the final word representation x Rn d: x = LSTM(Linear([xc; xw; xl])) (1) where each word is represented as an d dimensional vector. Tagger Figure 1a presents the structure of Tagger. Given an utterance x and a slot type s, it needs to predict tags for every word in BIO scheme2 (Ratinov and Roth 2009), indicating the phrase boundary of slot type s. We first use a bidirectional LSTM to contextualize the word embeddings x defined at Equation 1. And for each slot type, we use a distinct linear layer to compute the score for tag B, I, and O. Formally, we have: logits TAG = Linears(LSTM( x)) (2) so the tag probability distribution is: Pr θTAG(P|x, s) = softmax(logits TAG) (3) where θTAG is the set of parameters related to Tagger; Linears means the model will use different parameters according to the given slot type s; P = [P(i)]n 1 i=0 is a fixedlength sequence of random variables on BIO tags. Classifier Figure 1b illustrates the model structure of Classifier. Given an utterance x and a phrase p = [p(i)]n 1 i=0 in it, Classifier needs to predict the slot type of the phrase. We use the text encoder to encode the sequence of words in the utterance, denoted x. And the input phrase p is given in BIO tags, and we map each tag to a learnable embedding, so the phrase p can be represented with embeddings p. We add the utterance embeddings x and phrase embeddings p word-wise, and then feed them to a bidirectional LSTM to obtain the contextualized hidden vectors h = [h(i)]n 1 i=0 : h = LSTM( x + p) (4) Next, we use attention to aggregate the sequence of vectors into one vector: 0 i