# crossdomain_slot_filling_as_machine_reading_comprehension__8801ea0d.pdf

Cross-Domain Slot Filling as Machine Reading Comprehension

Mengshi Yu , Jian Liu , Yufeng Chen , Jinan Xu and Yujie Zhang Beijing Jiaotong University, Beijing, China {19120432, jianliu, chenyf, jaxu, yjzhang}@bjtu.edu.cn

With task-oriented dialogue systems being widely applied in everyday life, slot ﬁlling, the essential component of task-oriented dialogue systems, is required to be quickly adapted to new domains that contain domain-speciﬁc slots with few or no training data. Previous methods for slot ﬁlling usually adopt sequence labeling framework, which, however, often has limited ability when dealing with the domain-speciﬁc slots. In this paper, we take a new perspective on cross-domain slot ﬁlling by framing it as a machine reading comprehension (MRC) problem. Our approach ﬁrstly transforms slot names into well-designed queries, which contain rich informative prior knowledge and are very helpful for the detection of domain-speciﬁc slots. In addition, we utilize the large-scale MRC dataset for pre-training, which further alleviates the data scarcity problem. Experimental results on SNIPS and ATIS datasets show that our approach consistently outperforms the existing state-of-the-art methods by a large margin 1.

1 Introduction

Building a task-oriented dialogue system that can comprehend users requests and satisfy their needs has been a key component in many intelligent conversation applications [Jaech et al., 2016; Gao et al., 2020; Liang et al., 2020]. As an indispensable part of task-oriented dialogue systems, slot ﬁlling aims to identify task-related slot types in certain domains. For instance, as shown in Figure 1, given the user request book the hat for my classmates in domain Book Restaurant, we need to ﬁll domain-speciﬁc roles like restaurant name and party size description with the hat and my classmates , respectively. Previous methods for slot ﬁlling often focus on supervised learning [Zhang and Wang, 2016; Goo et al., 2018; Wu et al., 2020], where large-scale labeled datasets are required. However, slot ﬁlling faces the rapid changing of domains, and few or no target training data may

Equal Contribution Corresponding Author 1Code and data available at https://github.com/mengshi Y/RCSF

User Request: book the hat for my classmates 1) Slot filling via sequence labeling

2) Slot filling as reading comprehension Qr_n: what is the name of the restaurant? Ar_n: the hat

Qp_s_d: who are the people attending the party? Ap_s_d: my classmates ......

book O for O

the B-r_n my B-p_s_d

hat I-r_n classmates I-p_s_d

Figure 1: An example of slot ﬁlling via sequence labeling framework and reading comprehension framework. In the sequence labeling framework, slot labels are annotated in BIO format: B represents the start of a slot span, I the inside of a span while O denotes that the word does not belong to any slot. In the reading comprehension framework, each slot type corresponds to a well-designed question Qi where i denotes the i-th slot we need to ﬁll, and we use answer Ai of the question Qi to ﬁll the i-th slot. r n and p s d are short for slot names restaurant name and party size description , respectively

be available in a new domain. To alleviate the data scarcity problem in target domains, we need to train a model that can borrow the prior experience from source domains and adapt it to target domains quickly with limited training samples. Conventional approaches [Zhang and Wang, 2016; Goo et al., 2018; Wu et al., 2020] take slot ﬁlling as a sequence labeling task, which assigns a label to each token in a given sequence, as shown in Figure 1. However, the sequence labeling framework is data-hungry and does not have the potential to scale to new domains that consist of domainspeciﬁc slots and usually have few or no training data. To address these issues, [Shah et al., 2019; Liu et al., 2020b; He et al., 2020] add meta-information such as slot descriptions and slot examples to capture the semantic relationship between slot types and input tokens. However, these methods also require slot deﬁnitions to be similar between training data and unseen test data. That is, if such systems face completely new slot types (unseen slots), their performances would degrade signiﬁcantly (As seen in our experiments of unseen slots in subsection 4.2). In this paper, we propose a new approach for cross-domain slot ﬁlling, which frames the task as a machine reading com-

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

prehension (MRC) problem [Hermann et al., 2015]. An example of slot ﬁlling in the MRC framework is illustrated in Figure 1. We transform each slot type we need to ﬁll into a question, and then ﬁll the slot by answering the question. Speciﬁcally, we design three strategies to generate the questions, which will be discussed in detail later. After the questions are generated, we build a BERT-based MRC model [Devlin et al., 2019] to answer each of the questions and synthesize the answers as the ﬁnal results. In order to boost slot ﬁlling in low-resource scenarios, we also leverage the largescale MRC dataset for pre-training. Compared with the traditional sequence labeling framework, MRC framework has the advantage of introducing prior knowledge about slot information into the queries. More importantly, by converting the sequence labeling problem into MRC problem, we can make full use of large-scale MRC datasets to learn semantic information, which is beneﬁcial for slot ﬁlling tasks in the cross-domain setting. To verify the effectiveness of our approach, we conduct extensive experiments on two benchmark datasets. For SNIPS dataset, our approach achieves performance gains over current state-of-the-art model by 18.37%, 21.28% and 15.43%, respectively under zero-shot setting, 20-shot setting and 50shot setting. For ATIS dataset, our approach outperforms the existing state-of-the-art by 27.78% under zero-shot setting, 25.69% under 20-shot setting and 18.79% under 50-shot setting. Moreover, further experiments show that even without the model pre-training, our model still achieves better results consistently than the existing state-of-the-art approaches. In addition, we also investigate the effect of different query generation strategies and ﬁnd that adding high-quality slot examples into the queries can further improve the model performance under zero-shot setting. Our main contributions can be summarized as follows:

We propose a new MRC framework to deal with crossdomain slot ﬁlling. Compared with previous sequence labeling approaches, our framework can introduce more prior knowledge into the well-designed queries, and thus improve its performance in zero-shot setting. Moreover, by converting slot ﬁlling task into MRC task, we are able to utilize the large-scale supervised MRC dataset for pretraining and further improve the performance.

We devise different strategies to transfer a slot into a query, and conduct a series of studies to explore their effects.

We conduct extensive experiments on two commonly used datasets and show that our approach consistently outperforms the existing state-of-the-art model by a large margin.

2 Related Work 2.1 Cross-Domain Slot Filling There are mainly two challenges in cross-domain slot ﬁlling task. One is to adapt the shared slot types from source domains to target domains, and the other is to handle domainspeciﬁc slot types which have few or no supervision signals for training.

To deal with the shared slot types, a common approach is transfer learning (TF). TF aims to adapt the learned source model MS trained on source domain DS to produce a target model MT for target domain DT . TF can be categorized into data-driven transfer and model-driven transfer. Data-driven transfer approaches are based on pre-training and ﬁne-tuning mechanisms. [Goyal et al., 2018] train MS on large-scale DS, and then ﬁne-tune MS by replacing the output layer corresponding with the label space from DT and further train the model on DT . [Siddhant et al., 2019] leverage large-scale unlabeled data to learn contextual embedding, i.e., ELMo [Peters et al., 2018], before ﬁne-tuning on DT . Different from data-driven approaches, model-driven [Kim et al., 2017; Jha et al., 2018] approaches alleviate the slot adaptation problem by enabling model re-usability. Although different domains have different slot types, common slots such as date , time and country can be shared. These approaches usually ﬁrst train MS on these reusable slots, and then the outputs of MS are used to guide the training of MT for new slots. While TF approaches can share knowledge learned on different domains, such models can not handle unseen slots. Therefore, researchers [Bapna et al., 2017; Guerini et al., 2018; Lee and Jha, 2019; Shah et al., 2019; Liu et al., 2020b] start to investigate zero-shot methods, which can be broadly classiﬁed into two categories. One is to train the model on slot descriptions which carry information about the slots [Bapna et al., 2017; Lee and Jha, 2019; Liu et al., 2020b]. Slots with similar meanings would have similar descriptions, so it is possible to recognize the unseen slots by training on similar seen slots. The other zero-shot approach explores the usage of slot examples [Shah et al., 2019; Guerini et al., 2018], showing that using a small number of slot examples along with slot descriptions performs better than using the slot descriptions alone. However, these zeroshot approaches simply use the slot information to match its most corresponding entities and require slot information to be similar between the seen slots and the unseen slots, which limits their performance. Unlike these work, we utilize the slot information in a more natural way, that is, we transform it into natural questions and get slot entities by answering the questions.

2.2 Framing Other NLP Tasks as MRC Machine Reading Comprehension models [Hermann et al., 2015] predict answer spans from a context through a given query. Recently, there has been a trend of transforming NLP tasks to MRC problems. For example, [Mc Cann et al., 2018] use the MRC framework to implement ten different NLP tasks uniformly and all achieve competitive performances. [Li et al., 2020] transform named entity recognition (NER) task into MRC to handle the nested NER problem. [Gao et al., 2020] leverage MRC datasets and use MRC techniques to enhance dialogue state tracking task. [Liu et al., 2020a] propose an unsupervised question generation method and utilize a BERT-based question-answering process to bridge MRC and event extraction problem. Inspired by the great success of MRC, we exploit it to deal with cross-domain slot ﬁlling task. To the best of our knowledge, there is currently no speciﬁc research for cross-domain

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

[SEP] x1 ... xn q2 x2

Query User Request

H[CLS] Hm H[SEP] H1' H[SEP]

Liner Layer & Softmax

qm ... [SEP]

H1 H2 ... H2' ... Hn'

Start-End Matching Answer

Pstart: p1s p2s ... pns

Pend: p1e p2e ... pne User Request: book the hat for my classmates

Slots restaurant_name = ? party_size_description = ? ... ...

Queries Qr_n: what is the name of the restaurant? Qp_s_d: who are the people attending the party? ... ...

Answers Ar_n: the hat Ap_s_d: my classmates ... ...

Figure 2: Illustration of our proposed approach RCSF. Given an user request and a set of slots in the speciﬁc domain, RCSF ﬁrst generates queries for all of the slots. Then one query and the user request are concatenated together at a time as the inputs of our backbone model BERT. Next, RCSF predicts the start and end indexes based on the hidden representation generated by BERT and matches the start indexes and end indexes as the answer spans through the start-end matching module. Finally, RCSF ﬁlls the slots with answers of their related queries.

slot ﬁlling in the MRC framework. Our work mainly focuses on identifying domain-speciﬁc slot entities, which is signiﬁcantly different from previous work mentioned above.

3 Methodology Our approach, denoted by RCSF (Reading Comprehension for Slot Filling) is depicted in Figure 2. Given a user request, we are supposed to ﬁll its corresponding slots with tokens in it. First of all, RCSF generates a query for each slot with different strategies. Then, the query and the user request are concatenated together as the inputs of the RCSF model (we use BERT as the backbone in this paper). RCSF predicts the start and end indexes based on the hidden representation of BERT. To calculate the ﬁnal answer, the start indexes and end indexes are matched through the start-end matching module. Details are shown in the following subsections.

3.1 Task Formulation Given a user request X = {x1, x2, , xn} with n words and a predeﬁned set of slot types SY in domain D, we need to ﬁll each slot type y SY with entities in X. We convert the tagging-style annotated slot ﬁlling dataset to a set of (query, context, answer) triples. For each slot type y SY , it is associated with a natural language question (query) qy = {qy1, qy2, , qym} where m denotes the length of the generated question. So slot tagging can be transformed to predicting the answer spans of the speciﬁc slot zy = [(s1, e1), (s2, e2), , (st, et)] where si and ei denotes the start and the end position of the i-th span, respectively, and t is the number of spans (1 i t).

3.2 Query Generation In our MRC framework, we ﬁrstly transform each slot type into its corresponding query which contains prior knowledge we need. The strategy to generate queries is important for cross-domain slot ﬁlling, especially in zero-shot setting. Since the BERT model we use is pre-trained on the MRC dataset in which queries are natural questions, we are supposed to generate natural questions as well to utilize the se-

User Input: book the hat for my classmates.

Slot=restaurant name Queries: Desc.: what is the restaurant name? Trans.: what is the name of the restaurant? Exp.: what is the restaurant name like the maisonette or the robinson house?

Slot=party size description Queries: Desc.: what is the party size description? Trans.: who are the people attending the party? Exp.: what is the party size description like me or my colleague?

Table 1: An example of the three query generation strategies. Desc., Trans., and Exp. indicate queries based on slot description, backtranslation, and examples respectively. Some queries with empty answers are omitted in the above example for brevity.

mantic information of the pre-trained model. Therefore, we use templates such as what is the ? , where the blank is ﬁlled with slot information, to construct queries. As the examples shown in Table 1, we propose the following three strategies for query generation:

Description: We turn slot names into their corresponding slot descriptions by replacing punctuation marks like and . with blanks and replacing abbreviations with their original words. Queries are constructed by ﬁlling the above template using the slot descriptions directly. Back-translation: As above, we ﬁrstly use the slot descriptions to construct questions. However, the simple conjunction of the template and slot descriptions may introduce extra noises caused by grammar errors. To wipe off the noises, we translate the constructed questions into Chinese and re-translate them back into English. After the round-trip translation, the queries are more natural. Example: The queries are constructed using slot de-

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

scriptions and two slot examples from the training and validation datasets. We use the template what is the slot description like example 1 or example 2? .

3.3 Slot Filling as Answer Prediction Given the query qy and the context X, we need to extract the answer spans zy under the MRC framework. BERT [Devlin et al., 2019] is used as the backbone. As depicted in Figure 2, we concatenate the question qy and the input sentence X as the input sequence I = {[CLS], q1, q2, , qm, [SEP], x1, x2, , xn, [SEP]} to BERT where [CLS] and [SEP] stand for the classiﬁer token and sentence separator token in BERT, respectively. Then BERT receives the input sequence and generates a context representation matrix H Rn d, where d is the dimension of the last layer of BERT.

Start and End Prediction In the traditional MRC framework, one query usually has one answer. However, in our approach, one query may correspond to multiple answers. Therefore, we construct two binary classiﬁers. One is used to predict whether the token is a start index, and the other is employed to predict whether the token is an end index. Given the representation matrix H output by BERT, the model ﬁrst predicts the probability Pstart of each token being a start index as follows:

Lstart = Linear(HWstart), Lstart Rn 2 (1)

Pstart = Softmax(Lstart Vstart), Pstart Rn 2 (2)

where Linear denotes a fully connected layer and Softmax represents the softmax function. Wstart and Vstart are trainable weights. And the end index prediction procedure is exactly the same, except that we have other matrix Wend and Vend to obtain the probability matrix Pend of each token being an end index:

Lend = Linear(HWend), Lend Rn 2 (3)

Pend = Softmax(Lend Vend), Pend Rn 2 (4)

Start-End Matching In the context X, there can be multiple entities of the same category, which means we are supposed to predict multiple start-end pairs. Traditional methods [Sun et al., 2020] get the start-end pairs by matching the start index with its nearest end index, which does not work well here since the predicted slot entities could overlap and we might lose the most possible one when eliminating overlaps. So we adopt the principle of the most possible pair ﬁrst. That is, we ﬁrst sort the start indexes and end indexes by their probability Pstart and Pend. Then we choose the top-N start indexes and the top-N end indexes, where N is a predeﬁned number:

Istart = {i|P i start > t, i = 1, 2, , N} (5)

Iend = {j|P j end > t, j = 1, 2, , N} (6)

where i and j denotes the i-th and j-th rows of a matrix respectively and t is the minimum probability of the top-N indexes.

With the sets of the most possible start indexes Istart and end indexes Iend, we calculate the probability P ij of each start-end pair by adding P i start and P j end where P i start denotes the probability of the i-th token being a start token and P j end denotes the j-th token being an end token. Then, we sort the matched start-end pairs by P ij and choose the most possible pair which does not overlap the chosen ones.

3.4 Train and Test To utilize the large-scale MRC dataset, we adopt a two-stage training procedure. Our MRC model is ﬁrst pre-trained on the MRC dataset SQu AD2.0 [Rajpurkar et al., 2018], and then ﬁne-tuned on queries and answers created from our slot ﬁlling datasets. In the training stage, each context X is paired with two label sequences Ystart and Yend, which denote the groundtruth label of each token xi being the start index and the end index of an entity respectively. The loss functions are deﬁned as follows:

Lossstart = CE(Pstart, Ystart) (7) Lossend = CE(Pend, Yend) (8) Loss = λLossstart + (1 λ)Lossend (9)

where CE represents the cross-entropy loss function and λ [0, 1] is a balanced factor used to control the overall training objectives. In our experiment, we set λ = 0.5 according to the performance of the model on the validation set. At test time, ﬁrst of all, start and end indexes are separately selected based on Eq. 5 and Eq. 6. Then the start-end matching module is applied to align the extracted start indexes and the end indexes, leading to the ﬁnal extracted answers.

4 Experiments

4.1 Experimental Settings Datasets We evaluate our framework on SNIPS [Coucke et al., 2018], a public spoken language understanding dataset which contains 39 slot types across seven domains (intents) and about 2000 samples per domain. To simulate the cross-domain scenarios, we follow [Liu et al., 2020b] to split the dataset, that is, we choose one domain as the target domain and the other six domains as the source domains each time. However, domains in SNIPS are not completely independent with each other. We use another commonly used dataset ATIS [Hemphill et al., 1990] as target domain to test our model. It consists of 5971 utterances related to the airline travel domain with 83 slot types.

Baselines We compare our approach with the following baselines:

Concept Tagger (CT): A method proposed by [Bapna et al., 2017], which utilizes slot descriptions to boost the performance on detecting unseen slots.

Robust Zero-shot Tagger (RZT): Based on CT, [Shah et al., 2019] leverage both slot descriptions and examples to improve the robustness of zero-shot slot ﬁlling.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Training Setting Zero-shot Few-shot on 20 (1%) samples Few-shot on 50 (2.5%) samples

Corpus Domain Model CT RZT Coach RCSF CT RZT Coach RCSF CT RZT Coach RCSF

Add To Playlist 38.82 42.77 50.90 68.70 58.36 63.18 62.76 88.37 68.69 74.89 74.68 91.92 Book Restaurant 27.54 30.68 34.01 63.49 45.65 50.54 65.97 85.56 54.22 54.49 74.82 89.64 Get Weather 46.45 50.28 50.47 65.36 54.22 58.86 67.89 88.83 63.23 58.87 79.64 93.90 Play Music 32.86 33.12 32.01 53.51 46.35 47.20 54.04 80.95 54.32 59.20 66.38 86.59 Rate Book 14.54 16.43 22.06 36.51 64.37 63.33 74.68 93.35 76.45 76.87 84.62 94.06 Search Creative Work 39.79 44.45 46.65 69.22 57.83 63.39 57.19 81.30 66.38 67.81 64.56 86.23 Search Screening Event 13.83 12.25 25.63 33.54 48.59 49.18 67.38 80.46 70.67 74.58 83.85 94.22

Average F1 30.55 32.85 37.39 55.76 53.62 56.53 64.27 85.55 64.85 66.67 75.51 90.94

AT Airline Travel 2.14 2.86 1.64 30.64 26.05 41.37 54.91 80.60 35.87 51.80 66.99 85.78

Table 2: F1-scores (%) on SNIPS (SN) and ATIS (AT) for different target domains under zero-shot and few-shot learning settings. Scores in each row represents the performance of the leftmost domain, and RCSF denotes our approach using queries constructed by slot descriptions. Since the SNIPS dataset consists of multiple domains, we calculate the average F1 of all domains. We also try Coach with BERT encoders, but it did not perform better than its original LSTM encoders.

Coarse-to-ﬁne Approach (Coach): A two stage method proposed by [Liu et al., 2020b], which contains a coarse-grained BIO 3-way classiﬁcation and a ﬁnegrained slot type prediction. Slot descriptions are used in the second stage to help recognize unseen slots, and template regularization is applied to further improve the slot ﬁlling performance of similar or the same slot types.

Implementation Details We conduct our experiment based on Bert For Question Answering2 implemented by Hugging Face as our base model, and load the pre-trained weights provided by deepset3. They pre-train the BERT model on the question answering dataset SQu AD2.0 [Rajpurkar et al., 2018]. Adam optimizer [Kingma and Ba, 2014] is applied to optimize all parameters with a learning rate 1e-5. We set the batch size to 64 and the maximum sequence length to 128. The patience of early stop is set to 5. As for the baseline models, we use the implementation of [Liu et al., 2020b]4 and follow the same settings for a fair comparison. F1-score is used as the evaluation metric. A slot span is considered to be correct only if its range and slot type are both correct. We ﬁne-tune all hyper-parameters on the validation set and use the best checkpoint to test our model.

4.2 Main Results and Discussions Table 2 demonstrates the main results of our MRC model compared to the baselines. In SNIPS dataset, our approach outperforms the state-of-the-art model (Coach) by 18.37% on the average F1 under zero-shot setting, 21.28% under 20shot setting and 15.43% under 50-shot setting, which demonstrates the effectiveness of our method. To simulate the crossdomain situation in real world, we also test our model on ATIS dataset with SNIPS dataset as the training set. In this setting, Airline Travel in ATIS is considered as the target domain while all of the seven domains in SNIPS are taken as the source domains. Our model still outperforms the existing state-of-the-art approach by a large margin, especially in

2https://github.com/huggingface/transformers 3https://huggingface.co/deepset 4https://github.com/zliucr/coach

Target Samples

0 sample 20 samples

US (Slot) US (Sen.) US (Slot) US (Sen.)

CT 3.47 27.10 42.16 50.13 RZT 1.69 28.28 41.88 52.56 Coach 11.66 34.09 53.96 64.16

RCSF 25.44 41.99 84.94 87.37

Table 3: Averaged F1-scores (%) over all target domains on SNIPS (SN) for unseen slots and unseen sentences. Scores in each row represent the performance of the leftmost method, and RCSF denotes our approach. represents the number of training samples in target domain. US (Slot) and US (Sen.) indicate results on unseen slots and unseen sentences, respectively.

zero-shot setting, which shows that our model can fully make use of the semantic information encoded by queries and has the ability to recognize unseen slots while the baseline models fail to predict unseen slots in the irrelevant target domain Airline Travel.

Analysis on Unseen Slots and Unseen Sentences To further study the effectiveness of our approach on zeroshot setting, we also conduct analysis on unseen slots and unseen sentences in target domains of SNIPS 5. Following [Liu et al., 2020b], we separate the test set on each domain into seen sentence and unseen sentence . An utterance is categorized into the unseen sentence part as long as there is an unseen slot in it. Otherwise, it is categorized into the seen sentence part. However, [Liu et al., 2020b] can not show the real zero-shot scenarios directly because a sample with both seen slots and unseen slots would be categorized into unseen sentence part in their experiments. Therefore, we recalculate the F1-scores for each slot separately instead of each sentence. In our experiments, if a slot does not exist in the remaining six source domains, it will be categorized into the unseen slot part. Otherwise we categorize it into the seen slot part.

5Since all slot types in ATIS can be considered as unseen slots in our settings, we only provide the results of SNIPS.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Training Setting 0 sample

Corpus Domain Model RCSF w/o T w/o PT

Add To Playlist 68.70 18.57 53.02 Book Restaurant 63.49 27.24 34.80 Get Weather 65.36 23.44 58.02 Play Music 53.51 27.26 33.06 Rate Book 36.51 3.21 24.12 Search Creative Work 69.22 7.38 32.53 Search Screening Event 33.54 23.93 18.70

Average F1 55.76 18.72 36.32

AT Airline Travel 30.64 24.57 4.06

Table 4: F1-scores (%) on SNIPS (SN) and ATIS (AT) for different target domains under zero-shot setting. w/o T denotes that we directly test on the pre-trained BERT without ﬁne-tuning on it, and w/o PT represents that we train our model from scratch without pre-training.

Table 3 shows the average results on unseen slot and unseen sentence in the target domains. We can see that our approach outperforms the baselines by large margins in both the unseen sentences and unseen slots settings, which proves that our MRC framework has a positive effect on the zeroshot learning scenarios even when there are no sufﬁcient supervised signals. As for the unseen slot part, the baseline models fail to recognize these unseen slots in the target domain. On the contrary, our approach can be adapted to predict the unseen slot types more quickly. Taking the unseen slot playlist owner in domain Add To Playlist for example, Coach model mistakenly assigns the slot label playlist which is a seen slot type appearing in domain Play Music to entities of playlist owner . However, beneﬁted from the pretraining, our model has the ability to distinguish owner of the playlist from playlist .

4.3 Ablation Studies The Effect of the BERT Pre-training As we can see, our experiments are based on BERT, which is pre-trained with large-scale MRC data. To test the impact of the pre-trained BERT, we carry out ablation experiments. The results are shown in Table 4. Firstly, we directly test the pre-trained BERT model on the test dataset without ﬁne-tuning it on our training dataset. It can be seen that we still get an average F1 of 18.72% on SNIPS and 24.57% on ATIS, which shows that the pre-trained BERT does contain rich semantic information and our model fully utilize it to boost performance. Secondly, to separate the effect of the pre-training, we train the model with randomly initialized weights. Without pretraining, the performance of our MRC model drops drastically, but it still slightly outperforms the existing state-of-theart model which adopts sequence labeling framework in lowresource scenarios. This suggests that the MRC framework is more data-efﬁcient than sequence labeling methods.

The Effect of Query Construction Strategies For MRC tasks, the way to construct queries has a signiﬁcant inﬂuence on the ﬁnal results. Intuitively, the more in-

Training Setting 0 sample

Corpus Domain Model Desc. Trans. Exp.

Add To Playlist 68.70 65.99 70.35 Book Restaurant 63.49 62.05 72.68 Get Weather 65.36 67.80 83.17 Play Music 53.51 53.51 53.84 Rate Book 36.51 23.67 50.08 Search Creative Work 69.22 67.39 66.59 Search Screening Event 33.54 53.20 65.81

Average F1 55.76 56.23 66.08

AT Airline Travel 30.75 25.99 32.20

Table 5: Results of different types of queries under zero-shot setting. Desc., Trans., and Exp. indicate queries based on slot description, back-translation, and examples, respectively.

formation the query contains, the better its effect should be. Table 5 shows the performance of different types of queries under zero-shot setting. It can be seen that using slot descriptions and slot examples together is superior to the other two methods on average, since more information can be found in slot examples, which is in line with our intuition. Speciﬁcally, in domain Search Screening Event, Example method achieves its biggest performance improvement and outperforms the description method by 32.27%. As for the backtranslation method, it does not show signiﬁcant improvement in our experiments. It is effective in some domains while may harm the results in other domains. The reason of the performance decrease may be that some key information is erased through the round-trip translation. However, in domain Search Creative Work, description method achieves the best F1 score of 69.22%. The main reason lies in that domain Search Creative Work only contains shared slots which exist in the training data already.

5 Conclusion and Future Work In this paper, we propose a novel MRC framework to address cross-domain slot ﬁlling. Our approach comes with two key advantages: (1) the well-designed queries encoding significant prior knowledge about slot names; (2) being capable of utilizing the semantic information of BERT pre-trained on the MRC dataset SQu AD2.0. Our method obtains new stateof-the-art results on SNIPS and ATIS datasets in the crossdomain setting, which demonstrates its effectiveness. In the future, we would like to explore how to jointly address intent detection and slot ﬁlling tasks using a uniﬁed MRC framework in cross-domain scenarios.

Acknowledgments The research work descried in this paper has been supported by the National Key R&D Program of China (2019YFB1405200) and the National Nature Science Foundation of China (No. 61976015, 61976016, 61876198 and 61370130). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve this paper.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

References [Bapna et al., 2017] Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. Towards zero-shot frame semantic parsing for domain scaling. ar Xiv preprint ar Xiv:1707.02363, 2017. [Coucke et al., 2018] Alice Coucke, Thibaut Saade, et al. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. ar Xiv preprint ar Xiv:1805.10190, 2018. [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171 4186, 2019. [Gao et al., 2020] Shuyang Gao, Sanchit Agarwal, Di Jin, Tagyoung Chung, and Dilek Hakkani-Tur. From machine reading comprehension to dialogue state tracking: Bridging the gap. In ACL, pages 79 89, 2020. [Goo et al., 2018] Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. Slot-gated modeling for joint slot ﬁlling and intent prediction. In NAACL, pages 753 757, 2018. [Goyal et al., 2018] Anuj Kumar Goyal, Angeliki Metallinou, and Spyros Matsoukas. Fast and scalable expansion of natural language understanding functionality for intelligent agents. In NAACL, pages 145 152, 2018. [Guerini et al., 2018] Marco Guerini, Simone Magnolini, Vevake Balaraman, and Bernardo Magnini. Toward zeroshot entity recognition in task-oriented conversational agents. In SIGDIAL, pages 317 326, 2018. [He et al., 2020] Keqing He, Jinchao Zhang, Yuanmeng Yan, Weiran Xu, Cheng Niu, and Jie Zhou. Contrastive zeroshot learning for cross-domain slot ﬁlling with adversarial attack. In COLING, pages 1461 1467, 2020. [Hemphill et al., 1990] Charles T Hemphill, John J Godfrey, and George R Doddington. The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, 1990. [Hermann et al., 2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. NIPS, 28:1693 1701, 2015. [Jaech et al., 2016] Aaron Jaech, Larry Heck, and Mari Ostendorf. Domain adaptation of recurrent neural networks for natural language understanding. ar Xiv preprint ar Xiv:1604.00117, 2016. [Jha et al., 2018] Rahul Jha, Alex Marin, Suvamsh Shivaprasad, and Imed Zitouni. Bag of experts architectures for model reuse in conversational language understanding. In NAACL, pages 153 161, 2018. [Kim et al., 2017] Young-Bum Kim, Karl Stratos, and Dongchan Kim. Domain attention with an ensemble of experts. In ACL, pages 643 653, 2017.

[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [Lee and Jha, 2019] Sungjin Lee and Rahul Jha. Zero-shot adaptive transfer for conversational language understanding. In AAAI, volume 33, pages 6642 6649, 2019. [Li et al., 2020] Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. A uniﬁed MRC framework for named entity recognition. In ACL, pages 5849 5859, 2020. [Liang et al., 2020] Yunlong Liang, Fandong Meng, Ying Zhang, Jinan Xu, Yufeng Chen, and Jie Zhou. Infusing multi-source knowledge with heterogeneous graph neural network for emotional conversation generation. ar Xiv preprint ar Xiv:2012.04882, 2020. [Liu et al., 2020a] Jian Liu, Yubo Chen, Kang Liu, Wei Bi, and Xiaojiang Liu. Event extraction as machine reading comprehension. In EMNLP, pages 1641 1651, 2020. [Liu et al., 2020b] Zihan Liu, Genta Indra Winata, Peng Xu, and Pascale Fung. Coach: A coarse-to-ﬁne approach for cross-domain slot ﬁlling. In ACL, pages 19 25, 2020. [Mc Cann et al., 2018] Bryan Mc Cann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. ar Xiv preprint ar Xiv:1806.08730, 2018. [Peters et al., 2018] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, pages 2227 2237, 2018. [Rajpurkar et al., 2018] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don t know: Unanswerable questions for SQu AD. In ACL, pages 784 789, 2018. [Shah et al., 2019] Darsh Shah, Raghav Gupta, Amir Fayazi, and Dilek Hakkani-Tur. Robust zero-shot cross-domain slot ﬁlling with example values. In ACL, pages 5484 5490, 2019. [Siddhant et al., 2019] Aditya Siddhant, Anuj Goyal, and Angeliki Metallinou. Unsupervised transfer learning for spoken language understanding in intelligent agents. In AAAI, volume 33, pages 4959 4966, 2019. [Sun et al., 2020] Cong Sun, Zhihao Yang, Lei Wang, Yin Zhang, Hongfei Lin, and Jian Wang. Biomedical named entity recognition using bert in the machine reading comprehension framework. ar Xiv preprint ar Xiv:2009.01560, 2020. [Wu et al., 2020] Di Wu, Liang Ding, Fan Lu, and Jian Xie. Slot Reﬁne: A fast non-autoregressive model for joint intent detection and slot ﬁlling. In EMNLP, pages 1932 1937, 2020. [Zhang and Wang, 2016] Xiaodong Zhang and Houfeng Wang. A joint model of intent determination and slot ﬁlling for spoken language understanding. In IJCAI, volume 16, pages 2993 2999, 2016.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)