# madst_multiattentionbased_scalable_dialog_state_tracking__86a996bb.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

MA-DST: Multi-Attention-Based Scalable Dialog State Tracking

Adarsh Kumar,1 Peter Ku,2 Anuj Goyal,2 Angeliki Metallinou,2 Dilek Hakkani-Tur2

1University of Wisconsin-Madison, 2Amazon Alexa AI, Sunnyvale, CA, USA adarsh@cs.wisc.edu, {kupeter, anujgoya, ametalli, hakkanit}@amazon.com

Task oriented dialog agents provide a natural language interface for users to complete their goal. Dialog State Tracking (DST), which is often a core component of these systems, tracks the system s understanding of the user s goal throughout the conversation. To enable accurate multi-domain DST, the model needs to encode dependencies between past utterances and slot semantics and understand the dialog context, including long-range cross-domain references. We introduce a novel architecture for this task to encode the conversation history and slot semantics more robustly by using attention mechanisms at multiple granularities. In particular, we use cross-attention to model relationships between the context and slots at different semantic levels and self-attention to resolve cross-domain coreferences. In addition, our proposed architecture does not rely on knowing the domain ontologies beforehand and can also be used in a zero-shot setting for new domains or unseen slot values. Our model improves the joint goal accuracy by 5% (absolute) in the full-data setting and by up to 2% (absolute) in the zero-shot setting over the present state-of-the-art on the Multi Wo Z 2.1 dataset.

1 Introduction

Task-oriented dialog systems provide users with a natural language interface to achieve a goal. Modern dialog systems support complex goals that may span multiple domains. For example, during the dialog the user may ask for a hotel reservation (hotel domain) and also a taxi ride to the hotel (taxi domain), as illustrated in the example of Figure 1. Dialog state tracking is one of the core components of task-oriented dialog systems. The dialog state can be thought as the system s belief of user s goal given the conversation history. For each user turn, the dialog state commonly includes the set of slot-value pairs, for all the slots which are mentioned by the user. An example is shown in Figure 1. Accurate DST is critical for task-oriented dialog as most dialog systems rely on such a state to predict the optimal next system action, such as a database query or a natural language generation (NLG) response.

Work done during internship at Amazon Alexa AI Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Sample multi-domain dialog, spanning hotel and taxi domains, along with its dialog state

Dialog state tracking requires understanding the semantics of the agent and user dialog so far, a challenging task since a dialog may span multiple domains and may include user or system references to slots happening earlier in the dialog. Data scarcity is an additional challenge, because dialog data collection is a costly and time consuming (Kang et al. 2018; Lasecki, Kamar, and Bohus 2013). As a result, it is critical to be able to train DST systems for new domains with zero or little data. Previous work formulates DST as a classiﬁcation task over all possible slot values for each slot, assuming all values are available in advance (e.g. through a pre-deﬁned ontology) (Mrkˇsi c et al. 2016; Gao et al. 2019; Liu and Lane 2017). However, DST systems should be able to track the values of even free-form slots such as hotel name , which typically contain out-of-vocabulary words. To overcome the limitations of ontology-based approaches candidate-set generation based approaches have been proposed (Rastogi, Hakkani-Tur, and Heck 2017). TRADE (Wu et al. 2019) extends this idea further and propose a decoderbased approach that uses both generation and a pointing mechanism, taking a weighted sum of a distribution over

the vocabulary and a distribution over the words in the conversation history. This enables the model to produce unseen slot values, and it achieves state-of-the art results on the Multi WOZ public benchmark (Budzianowski et al. 2018; Eric et al. 2019). We extend this work by (Wu et al. 2019) and focus on improving the encoding of dialog context and slot semantics for DST to robustly capture important dependencies between slots and the conversation history as well as longrange coreferences in the conversation history. For this purpose, we propose a Multi-Attention DST (MA-DST) network. It contains multiple layers of cross-attention between the slot encodings and the conversation history to capture relationships at different levels of granularity, which are then followed by a self-attention layer to help resolve references to earlier slot mentions in the dialog. We show that the proposed MA-DST leads to an absolute improvement of over 5% in the joint goal accuracy over the current state-of-the art for the Multi WOZ 2.1 dataset in the full-data setting. We also show that MA-DST can be adapted to new domains with no training data in that new domain, achieving upto a 2% absolute joint goal accuracy gains in the zero-shot setting.

2 Related Work

Dialog state tracking (DST) is a core dialog systems problem that is well studied in the literature. Earlier approaches for DST relied on Markov Decision Processes (MDPs) (Levin, Pieraccini, and Eckert 2000) and partially observable MDPs (POMDPs) (Williams and Young 2007; Thomson and Young 2010) for estimating the state updates. See (Williams, Raux, and Henderson 2016) for a review of DST challenges and earlier related work. Recent neural state tracking approaches achieve state-ofthe-art performance on DST (Gao, Galley, and Li 2018). Some of this work formulates the state tracking problem as a classiﬁcation task over all possible slot-values per slot (Mrkˇsi c et al. 2016; Wen et al. 2017; Liu and Lane 2017). This assumes that an ontology containing all slot values per slot is available in advance. In practice, this is a limiting assumption, especially for free-form slots that may contain values not seen during training (Xu and Hu 2018). To address this limitation, (Rastogi, Hakkani-Tur, and Heck 2017) propose a candidate generation approach based on a bi-GRU network, that selects and scores slot values from the conversation history. (Xu and Hu 2018) propose using a pointer network (Vinyals, Fortunato, and Jaitly 2015) for extracting slot values from the history. More recently, hybrid approaches which combine the candidate-set and slotvalue generation approaches have appeared (Goel, Paul, and Hakkani-T ur 2019; Wu et al. 2019). Our work is most similar to TRADE (Wu et al. 2019), and extends it by proposing self (Cheng, Dong, and Lapata 2016) and cross-attention (Bahdanau, Cho, and Bengio 2015) mechanism for capturing slot and history correlations. Attention based archirectures like the Transformer (Vaswani et al. 2017) and architectures that extend it, like BERT (Devlin et al. 2019) and Ro BERTa (Liu et al. 2019), achieve the

current state-of-the-arts for many NLP tasks. We are also inspired by the work in reading comprehension where cross attention is used to compute relations between a long passage and a query question (Zhu, Zeng, and Huang 2018; Chen et al. 2017). For benchmarking, DSTC challenges provide a popular experimentation framework and dialog data collected through human-machine interactions. Initially, they focused on single domain systems like bus routes (Williams et al. 2013). Wizard-of-Oz (WOZ) is also a popular framework used to collect human-human dialogs that reﬂect the target human-machine behavior (Wen et al. 2017; Asri et al. 2017). Recently, the Multi WOZ 2.0 dataset, collected through WOZ for multiple domains, was introduced to address the lack of a large multi-domain DST benchmark (Budzianowski et al. 2018). (Eric et al. 2019) released an updated version, called Multi WOZ 2.1, which contains annotation corrections and new benchmark results using the current state-of-the-art approaches. Here, we use the Multi WOZ 2.1 dataset as our benchmark.

3 Model Architecture 3.1 Problem Statement Let s denote conversation history till turn t as Ct = {U1, A1, U2, A2, ...Ut}, where Ui and Ai represents the user s utterance and agent s response at the ith turn. Let S = {s1, s2, ..., sn} denote the set of all n possible slots across all domains. Let DSTt = {s1 : v1, s2 : v2, ..., sn : vn} denote the dialog state at turn t, which contains all slots si and their corresponding values vi. Slots that are not mentioned in the dialog history take a none value. DST consists of predicting slot values for all slots si at each turn t, given the conversation history Ct.

3.2 Model Architecture Overview Our model encodes both the slot name si and the conversation history so far Ct, and then decodes the slot value vi, outputting words or special symbols for none and dontcare values. Our proposed model consists of an encoder Encslot for the slot name, an encoder Encconv for the conversation history, a decoder Decgen that generates the slot value, and a three-class slot gate classiﬁer SG that predicts special symbols {none, dontcare, gen}, which will be described in detail later on. The model weights are shared between the slots, which makes the model more robust and scalable. This architecture is similar to (Wu et al. 2019). We propose modiﬁcations to the encoders in order to capture more ﬁne grained dependencies between the slot name and the conversation history. Also, note the domain and slot names are concatenated into a single slot description, which we refer to as slot name for simplicity, and encoded via the slot encoder Encslot. Figure 2 illustrates the proposed architecture which we refer to as Multi-Attention DST (MA-DST).

3.3 Encoders Our proposed slot si and conversation history Ct encoders use three stages of attention, speciﬁcally low-level cross-

Figure 2: Model Architecture

attention on the words, higher level cross-attention on the hidden state representations, and self-attention within the dialog history. Below we describe the encoders bottom-up.

Enriched Word Embedding Layer For both Ct and si, we ﬁrst project each word into a low-dimensional space. We use a 300-dimensional Glo VE embedding (Pennington, Socher, and Manning 2014), and a 100-dimensional character embedding, both of which gets ﬁne-tuned. For the conversation history Ct, we also add a 5-dimensional POS tag embedding and a 5-dimensional NER tag embeddings. We also use the turn index for each word as a feature and initialize it as a 5-dimensional embedding. To capture the contextual meaning of words, we additionally use contextual ELMo embeddings (Peters et al. 2018). We compute 1024-dimensional ELMo embeddings for both Ct and si by taking a weighted average of the different ELMo layers outputs. Instead of ﬁne-tuning parameters of all the ELMo layers, we just learn these combination weights while training the model. All the word-level embeddings are concatenated to generate an enriched, contextual word-level representation e.

e = [Glo VE(w), Char Embedding(w), ELMo(w), POS-tag(w), NER-tag(w), position-tag(w)] (1)

Word-Level Cross-Attention Layer To highlight the words in the conversation history Ct relevant to the slot si, we add a word-to-word attention from conversation history to the slot. For computing the attention weights, we used symmetric scaled multiplicative attention (Huang et al. 2017) with a Re LU non-linearity. The weights are calculated

according to equation 2 and used according to equation 3 to obtain the attended vector for each word in the conversation.

αjk = exp(f(We C j )Df(Wes k)) K k=1 exp(f(We C j )Df(Wes k)) (2)

k=1 αjk es k (3)

Here, e C j and es k correspond to the word embedding of the jth word in the conversation and kth word in the slot. The length of the slot is denoted by K. f denotes a non-linear activation, which here is a Re LU. To get the representation rj for each word in the conversation history, we concatenate the attended vector with the initial word embedding: .

r C j = [e C j , ˆa C j ] (4)

For the slot representation for each word k in the slot name, we use the word embedding rs k = es k. Note that symmetric scaled multiplicative attention with Re LU non-linearity is used in all attention computations of our proposed models, as we empirically found that it gives better performance compared to other attention variants.(e.g. multiplicative, scaled multiplicative, additive).

First Layer RNN The computed representations rs k and r C j for each word in the slot name and the conversation history respectively, are then passed through a Gated Recurrent Unit (GRU) (Chung et al. 2014) in order to model the temporal interactions between the words and get a contextual representation. For each of Ct and si, we use bidirectional GRUs and obtain the hidden contextual representation by

averaging the hidden states of each GRU direction per time step:

HC 1 = [h C,1 1 , h C,1 2 , ..., h C,1 J ] = GRU([r C 1 , r C 2 , ...., r C J ]) (5)

Hs 1 = [hs,1 1 , hs,1 2 , ..., hs,1 K ] = GRU([rs 1, rs 2, ...., rs K]) (6)

Here, HC 1 and Hs 1 are the sequences of encoded representations for the conversational history and slot name respectively, output by the ﬁrst bidirectional GRU layer (assuming K is the slot name length and J the conversation history length in number of words).

Higher Level Cross Attention Layer We add a crossattention network on top of the the base RNN layer to attend over higher level representations generated by the previous RNN layer, i.e. HC 1 and Hs 1. We used two-way crossattention network, one from conversation history (HC 1 ) to the slot (Hs 1) and the other in the opposite direction. This is inspired by several works in reading comprehension where cross attention is used to compute relations between a long passage and a query question (Weissenborn, Wiese, and Seiffe 2017; Chen et al. 2017). The Slot to Conversation History attention subnetwork helps in highlighting the words in the conversation which are relevant to the slot for which we want to generate the value. Similar to the word level attention, the attention weights are calculated by equation 7.

αjk = exp(f(V h C,1 j )D f(V hs,1 k )) J j=1 exp(f(V h C,1 j )D f(V hs,1 k )) (7)

j=1 αjk h C,1 j (8)

We fuse the attention vector ˆhs,1 k with it s corresponding hidden state hs,1 k for each word in the slot name as follows:

rs,1 k =[hs,1 k , ˆhs,1 k , ˆhs,1 k + hs,1 k , ˆhs,1 k hs,1 k ] (9)

where, is the element wise dot product operation. Similarly, the Conversation to Slot attention subnetwork computes attention weights to highlight which words in the slot name are most relevant to each word in the conversation history. This enriches the word representation in the conversation history h C,1 j with an attention based representation ˆh C,1 j , resulting in a new representation r C,1 j . All computations are similar as in the Slot to Conversation History attention, but in the reverse direction.

Second Layer RNN The representations rs,1 k and r C,1 j are then passed through a second bidirectional GRU layer, to obtain hs,2 k and h C,2 j . This helps in fusing these vectors together along with the temporal information.

Self Attention Layer We add a self attention network on top of the conversation representation h C,2 j . This layer helps resolve correlation between words across utterances

in the conversation history. We introduce this sub-network to address cases where the user refers to slot values that are present in previous utterances, which is a common phenomenon in dialogs, especially multi-domain ones. Self attention is computed as:

αji = exp(f(Wh C,2 j )Df(Wh C,2 i )) J i=1 exp(f(Wh C,2 j )Df(Wh C,2 i )) (10)

i=1 αji h C,2 i (11)

The ﬁnal representation r C,2 j for each word in the conversation is the merged representation of self-attended vector ˆh C,2 j and the hidden state h C,2 j , merged according to equation 9.

Third Layer RNN and Slot Summarization We use a third layer RNN to get the ﬁnal representation for the conversation history

h C,3 j = GRU(r C,2 j ), j = 1..J (12)

Since the slot name is much shorter in length than the conversation history, it can be encoded with less information. Instead of using an additional RNN, we summarize the slot using a linear transformation to reduce the slot representation into a single vector.

αk = w hs,2 k (13)

k=1 αk hs,2 k (14)

where, w is the parameter which is learnt during training. Finally, HC,3 = [h C,3 1 , h C,3 2 , ..., h C,3 J ] is the per word representation for the conversation history, while hs,3 is the summarized slot name representation, both of which will be used at the decoding step.

3.4 Decoder and Slot Gate classiﬁer The decoder network is a GRU that decodes the value vi for slot si. At each decoding step i that computes each word in the slot value, the network computes two distributions: a distribution over all in-vocabulary words (word generation distribution) and one over all words in the conversation history (word history distribution). This allows the decoder to generate unseen words that appear in the conversation history but are not present in the vocabulary of the training data. This formulation removes the dependency of having a predeﬁned ontology that contains all the possible slot values, which is restrictive for free-form slots. Because of the ability to generate unseen slot values, the network is well-suited for zero-shot use cases. We initialize the decoder by combining the last hidden state of the conversation history representation and the summarized slot representation:

hdec 0 = W[h C,3 J , hs,3] (15)

where W is a learnable parameter. At each decoding timestep i, the decoder generates a probability distribution over the vocabulary:

P vocab i = Softmax(W hdec i ) (16)

The decoder also generates a probability distribution over words in the conversation history Phistory by using a pointer network (See, Liu, and Manning 2017), i.e., computing attention weights for each word in the conversation history. To generate the ﬁnal vocabulary distribution, we take a weighted sum of Phistory and Pvocab:

P final i = pgen i P vocab i + (1 pgen i ) P history i (17)

Where pgen is the probability to generate a word as opposed to copy from the history, and is calculated at each decoder time step. To avoid running the decoder for slots not present in the conversation, we also train a Slot Gate classiﬁer(Wu et al. 2019). This is a 3-way classiﬁer which predicts among the following classes {none, dontcare, gen}. Only when the classiﬁer predicts gen we decode the slot value. When the classiﬁer predicts none we assume that the slot is not present and takes a none value in the state, and when it predicts dontcare , we assume the user does not care about the slot value (this appears commonly in dialog and therefore dontcare is a special value for DST systems). The network is trained in a multi-task manner using standard cross entropy loss. We combine the losses of the slot generator (decoder) and the SG classiﬁer as follows:

Losscombined = Lossgenerator + γ Lossclassifier (18)

where γ is a hyperparameter that is optimized empirically.

4 Dataset We evaluate our approach on Multi WOZ, a multi-domain Wizard-of-Oz dataset. Multi WOZ 2.0 is a recent dataset of labeled human-human written conversations spanning multiple domains and topics (Budzianowski et al. 2018). As of now, it is the largest labeled, goal-oriented, human-human conversational dataset with around 10k dialogs, each with an average of 13.67 turns. The data spans seven domains and 37 slot types. Due to patterns of annotation errors found in Multi WOZ 2.0, (Eric et al. 2019) re-annotated the data and released a Multi WOZ 2.1 version, which corrected a significant number of errors. Table 1 mentions the percentage of slots in each domain whose values changed with the Multi WOZ 2.1 re-annotation. For all our experiments, we use Multi WOZ 2.1 data, which is shown to be cleaner and more challenging because many slots are now correctly annotated with their corrected values or dontcare instead of none. We are using only ﬁve domains out of the available seven - namely (restaurant, hotel, attraction, taxi, train) - since the other two domains (bus, police) are only present in the training set. We use the provided train/dev/test split for our experiments.

5 Evaluation In this section we ﬁrst describe the evaluation metrics and then present the results of our experiments.

Slot Values Updated in Multi WOZ 2.1 Restaurant Taxi Hotel Train Attraction Train 13.64 3.65 26.89 7.04 12.69 Dev 22.04 3.18 20.93 5.88 12.82 Test 19.33 3.95 24.70 10.59 16.12

Table 1: Percentage of slot values that changed in Multi WOZ 2.1 compared to Multi WOZ 2.0.

5.1 Metrics Following are the metrics used to evaluate DST models:

Average Slot Accuracy: The average slot accuracy is deﬁned as the fraction of slots for which the model predicts the correct slot value. For an individual dialog turn Dt, the average slot accuracy is deﬁned as follows:

i=1 1yi=ˆyi (19)

where yi and ˆyi are ground truth and predicted slot value for si respectively, n is the total number of slots, and 1x=y is an indicator variable that is 1 if and only if x = y.

Joint Goal Accuracy: The joint goal accuracy is deﬁned as the fraction of dialog turns for which the values vi for all slots si are predicted correctly. If we have n total slots we want to track, the joint goal accuracy for an individual dialog turn Dt is deﬁned as follows:

1(( n i=1 1yi=ˆyi)=n) (20)

5.2 Experiment Details We train the encoders to jointly optimize the losses of the slot gate classiﬁer and the slot value generator decoder. The parameters of the model are shared for all (domain, slot) pairs, which makes this model scalable to a large number of domain and slots. We train the model using stochastic gradient descent and use the Adam Optimizer. We empirically optimized the learning rate in the range [0.0005 0.001] and used 0.0005 for the ﬁnal model, while we kept betas as (0.9, 0.999) and epsilon 1x10 08. We used a batch size of four dialog turns and for each turn we generate all 30 slot values. We decayed the learning rate after regular intervals (3 epochs) by a factor of θ (0.25), which was empirically optimized. For ELMo, we kept a dropout of 0.5 for the contexual embedding and used l2 regularization for the weights of ELMo. We used a dropout of 0.2 for all the layers everywhere else. For word embeddings, we used 300-dimensional Glo Ve embeddings and 100-dimensional character embeddings. For all the GRU and attention layers the hidden size is kept at 400. The weight γ for the multi-task loss function in equation 18 is kept at 1.

5.3 Results In this section, we present the results for our model. We measure the quality of the model on joint goal accuracy and average slot accuracy, as described earlier. As our baseline for comparison, we consider the TRADE model (Wu et al. 2019), which is the present state of the art for Multi WOZ.

To have a fair comparison, we report the numbers on the corrected Multi WOZ 2.1 dataset for both models. In Table 2, we present the results for DST on singledomain data. We create the train, dev, and test splits of the data for a particular domain by ﬁltering for dialogs which only contain that domain. As shown in table 2, MA-DST outperforms TRADE for all ﬁve domains, improving the joint goal accuracy by up to 7% absolute as well as the average slot accuracy by up to 5% absolute.

Single Domain

MA-DST TRADE Domain Joint Slot Joint Slot Hotel 57.70 93.41 50.25 90.48 Train 76.47 94.87 74.47 94.30 Taxi 76.55 91.25 70.18 86.27 Restaurant 66.33 93.86 66.02 93.73 Attraction 72.49 89.38 68.48 86.89

Table 2: Joint goal and slot accuracy of MA-DST and TRADE on 5 single-domain datasets from Multi WOZ 2.1

Table 3 shows results for the multi-domain setting, where we combine all available domains during training and evaluation. We compare the the accuracy of MA-DST with the TRADE baseline and four additional ablation variants of our model. These four variants capture the contribution of the different sub-networks and layers in MA-DST on top of the base encoder-decoder architecture, which is called Our Base Model in Table 3. Our full proposed MA-DST model achieves the highest performance on joint goal accuracy and average slot accuracy, surpassing the current state-of-theart performance. Each of the additional layers of self and cross-attention contribute to progressively higher accuracy for both metrics.

Multi Domain Model Joint Slot Baseline (TRADE (Wu et al. 2019)) 45.6 96.62 Our Base Model 44.0 96.15 + Slot Gate + Word-Level Cross-Attention 47.60 97.01 + Higher-Level Cross-Attention 49.56 97.15 + Self-Attention + Slot Summarizer 50.55 97.21 + ELMo (MA-DST) 51.04 97.28 + Ensemble 51.88 97.39

Table 3: Joint goal and slot accuracy of different models in the all-domain setting of Multi WOZ 2.1

In Table 4 we present the zero shot results. For these experiments, the test set contains only dialogs from the target domain while the training set contains only dialogs from the other four domains. As shown in Table 4, MA-DST outperforms TRADE s state-of-the-art result by up to 2% on the joint goal accuracy metric.

5.4 Error Analysis In this section we analyze the errors being made by the model on Multi Woz 2.1 dataset. Table 5 shows the Average Slot Accuracy and F1-Score for each domain. In terms

Zero Shot Experiment

MA-DST TRADE Domain Joint Joint Hotel 16.28 14.20 Train 22.76 22.39 Taxi 59.27 59.21 Restaurant 13.56 12.59 Attraction 22.46 20.06

Table 4: Joint goal accuracy of MA-DST and TRADE in the zero shot setting for the ﬁve domains of Multi WOZ 2.1.

Figure 3: Per Slot Accuracy on test set in multi-domain setting for MA-DST.

Domain Level Statistics Domain F1-Score Slot Acc. Hotel 0.90 97.11 Train 0.92 97.16 Taxi 0.71 97.87 Restaurant 0.94 97.41 Attraction 0.87 95.46

Table 5: F1 Score and Average Slot Accuracy Domain Wise

of F1-Score, the model performs worse for Taxi domain. The average slot accuracy for Taxi domain is high because a vast number of taxi domain s slots are none (i.e. not present in the dialog), which model easily identiﬁes. Figure 3 shows the per-slot accuracy in the all-domain setting, in descending order of performance. As seen from Figure 3, the MA-DST model tends to make the most errors for open-ended slots such as restaurant-name, attraction-name, hotel-name, train-leaveat. These slots are difﬁcult to predict for the model because, unlike categorical slots, these slots can take on a large number of possible values and are more likely to encounter unseen values. On the other end of spectrum, we have slots like restaurant-bookday, hotel-bookstay, hotel-bookday, and train-day, for which the model is able to achieve more than 99% in terms of average slot accuracy. As expected, most of the top-perfoming slots are categori-

cal, i.e. they can take only a small number of different values from a pre-deﬁned set. Figure 4 analyzes the relationship between depth of conversation and accuracy of MA-DST. To calculate this, we ﬁrst bucket the dialog turns according to their turn index, and calculate the joint goal accuracy and average slot accuracy for each bucket. As shown in Figure 4, the joint goal accuracy and average slot accuracy for MA-DST is around 88% and 99% for turn 0, and it decreases to 8% and 92% for turn 10. As expected, we can see that the model s performance degrades as the conversation becomes longer. This can be explained by the fact that longer conversations tend to be more complex and can have long-range dependencies. To study the effect of attention layers, we compare the joint goal accuracy of our base model, which does not have the attention layers, and MA-DST for each turn. As can be seen from Figure 4, MA-DST performs better than our base model, which doesn t have the additional attention layers, for both earlier and later turns by an average margin of 4%. To further analyze what type of errors the model is making, we manually analyzed the model s output for 20 randomly selected dialogs. Around 36% of the errors are because of wrong annotations, i.e., the model predicted the slot value correctly but the target label was wrong. For e.g. in turn 5 of PMUL3158, restaurant book time is annotated as none, while user has mentioned 17:45 as the booking time. These kind of annotation errors are unavoidable. The other common error we observed was of model getting confused among slots of same types. For e.g. in turn 3 of dialog PMUL4547, model populates attraction name and hotel name with The Junction , as user didn t specify in the utterance whether The Junction is attraction or a hotel. Because of similar reason, we also see model confusing between taxi destination and taxi departure slot quite a number of times. The other common type of error model makes is by generating slot value which varies from the ground-truth by a word or character. For e.g. for dialog MUL2432, the model generates the value of restaurant book time as 15.15 by directly copying it from the user utterance, however, the label is 15:15 according to the ontology. This kind of error can be solved by fuzzy match between ontology and model s prediction, but it will introduce dependency on the ontology. We also observed that model s accuracy for slot values which were dontcare was only 60%. We also observed that there are lots of annotation errors for slots with dontcare in the training set, thus making it difﬁcult for the model to learn.

6 Conclusion We propose a new architecture for dialog state tracking that uses multiple levels of attention to better encode relationships between the conversation history and slot semantics and resolve long-range cross-domain coreferences. Like TRADE (Wu et al. 2019), it does not rely on knowing a complete list of possible values for a slot beforehand and both generate values from the vocabulary and copy values from the conversation history. It also shares the same model weights for all (domain, slot) pairs so it can easily be adapted to new domains and applied in a zero-shot or

Figure 4: Accuracy of MA-DST and our base model aggregated by dialog turns.

few-shot setting. We achieve new state-of-the-art joint goal accuracy on the updated Multi WOZ 2.1 dataset of 51%. In the zero-shot setting we improve the state-of-the-art by over 2%. In the future, it is worth exploring whether the state can be carried from the previous turn to predict the state for the current turn (rather than starting from scratch for each turn). Finally, it may be useful to capture dependencies or correlations between slots rather than independently generating values for each one of them.

References Asri, L. E.; Schulz, H.; Sharma, S.; Zumer, J.; Harris, J.; Fine, E.; Mehrotra, R.; and Suleman, K. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. ar Xiv preprint ar Xiv:1704.00057. Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Budzianowski, P.; Wen, T.-H.; Tseng, B.-H.; Casanueva, I.; Ultes, S.; Ramadan, O.; and Gaˇsi c, M. 2018. Multi WOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 5016 5026. Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Reading wikipedia to answer open-domain questions. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Cheng, J.; Dong, L.; and Lapata, M. 2016. Long shortterm memory-networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 551 561. Austin, Texas: Association for Computational Linguistics. Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for

language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171 4186. Eric, M.; Goel, R.; Paul, S.; Sethi, A.; Agarwal, S.; Gao, S.; and Hakkani-T ur, D. 2019. Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. Gao, S.; Sethi, A.; Agarwal, S.; Chung, T.; and Hakkani-Tur, D. 2019. Dialog state tracking: A neural reading comprehension approach. Gao, J.; Galley, M.; and Li, L. 2018. Neural approaches to conversational ai. In ACL and SIGIR tutorial. ACL and SIGIR tutorial. Goel, R.; Paul, S.; and Hakkani-T ur, D. 2019. Hyst: A hybrid approach for ﬂexible and accurate dialogue state tracking. ar Xiv preprint ar Xiv:1907.00883. Huang, H.-Y.; Zhu, C.; Shen, Y.; and Chen, W. 2017. Fusionnet: Fusing via fully-aware attention with application to machine comprehension. Kang, Y.; Zhang, Y.; Kummerfeld, J. K.; Tang, L.; and Mars, J. 2018. Data collection for dialogue system: A startup perspective. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), 33 40. Lasecki, W. S.; Kamar, E.; and Bohus, D. 2013. Conversations in the crowd: Collecting data for task-oriented dialog learning. In First AAAI Conference on Human Computation and Crowdsourcing. Levin, E.; Pieraccini, R.; and Eckert, W. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing 8(1):11 23. Liu, B., and Lane, I. 2017. An end-to-end trainable neural network model with belief tracking for task-oriented dialog. Proc of Interspeech. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized BERT pretraining approach.

Mrkˇsi c, N.; O S eaghdha, D.; Wen, T.-H.; Thomson, B.; and Young, S. 2016. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) Volume 1: Long Papers, 1777 1788. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), 1532 1543. Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Rastogi, A.; Hakkani-Tur, D.; and Heck, L. 2017. Scalable multi-domain dialogue state tracking. In Proceedings of IEEE ASRU.

See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Thomson, B., and Young, S. 2010. Bayesian update of dialogue state: A pomdp framework for spoken dialogue systems. Comput. Speech Lang. 24(4):562 588. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, 6000 6010. Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer networks. In Cortes, C.; Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R., eds., Advances in Neural Information Processing Systems 28. Curran Associates, Inc. 2692 2700. Weissenborn, D.; Wiese, G.; and Seiffe, L. 2017. Making neural qa as simple as possible but not simpler. Proceedings of the 21st Conference on Computational Natural Language Learning (Co NLL 2017). Wen, T.-H.; Vandyke, D.; Mrkˇsi c, N.; Gaˇsi c, M.; Rojas Barahona, L. M.; Su, P.-H.; Ultes, S.; and Young, S. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (ACL): Volume 1, Long Papers, 438 449. Williams, J. D., and Young, S. 2007. Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language 21(2):393 422. Williams, J. D.; Raux, A.; Ramachandran, D.; and Black, A. W. 2013. The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, The 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 404 413. Williams, J.; Raux, A.; and Henderson, M. 2016. The dialog state tracking challenge series: A review. Dialogue and Discourse. Wu, C.-S.; Madotto, A.; Hosseini-Asl, E.; Xiong, C.; Socher, R.; and Fung, P. 2019. Transferable multi-domain state generator for task-oriented dialogue systems. In ACL. Xu, P., and Hu, Q. 2018. An end-to-end approach for handling unknown slot values in dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 1448 1457. Melbourne, Australia: Association for Computational Linguistics. Zhu, C.; Zeng, M.; and Huang, X. 2018. Sdnet: Contextualized attention-based deep network for conversational question answering. ar Xiv.