# personalized_dialogue_generation_with_personaadaptive_attention__4bdf89c0.pdf Personalized Dialogue Generation with Persona-Adaptive Attention Qiushi Huang1,2, Yu Zhang2*, Tom Ko3, Xubo Liu1, Bo Wu4, Wenwu Wang1, H Tang1* 1 University of Surrey 2 Southern University of Science and Technology 3 Byte Dance AI Lab 4 MIT-IBM Watson AI Lab {qiushi.huang,xubo.liu,w.wang,h.tang}@surrey.ac.uk,{yu.zhang.ust,tomkocse}@gmail.com, bo.wu@ibm.com Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona. Unlike conventional dialogue generation, the persona-based dialogue needs to consider both dialogue context and persona, posing a challenge for coherent training. Specifically, this requires a delicate weight balance between context and persona. To achieve that, in this paper, we propose an effective framework with Persona-Adaptive Attention (PAA), which adaptively integrates the weights from the persona and context information via our designed attention. In addition, a dynamic masking mechanism is applied to the PAA to not only drop redundant information in context and persona but also serve as a regularization mechanism to avoid overfitting. Experimental results demonstrate the superiority of the proposed PAA framework compared to the strong baselines in both automatic and human evaluation. Moreover, the proposed PAA approach can perform equivalently well in a lowresource regime compared to models trained in a full-data setting, which achieve a similar result with only 20% to 30% of data compared to the larger models trained in the full-data setting. To fully exploit the effectiveness of our design, we designed several variants for handling the weighted information in different ways, showing the necessity and sufficiency of our weighting and masking designs. Introduction Persona is essential for building a trustful and confident conversational system. Recently, there has been an increasing interest in incorporating explicit persona into dialogue generation models (Wolf et al. 2019; Liu et al. 2020; Song et al. 2021) since the release of the publicly available datasets (Zhang et al. 2018; Dinan et al. 2019). Typically, persona information consists of several sentences describing the facts or background of the interlocutor. An example taken from the Conv AI2 dataset (Dinan et al. 2019) is shown in Figure 1. In this example, the system should consider the information in the persona sentences and generate consistent responses based on both persona and dialogue history. One challenge in persona-based dialogue generation is that the related datasets are usually small. As collecting dialogues in persona-based dialogue datasets requires crowd- *Corresponding authors. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Hello what are doing today? I prefer Mojitos or Watermelon. I am good, I just got off work and tired, I have two jobs. Those are really yummy too, but not my favorite . I'm a stunt double as my second job. I only eat Kosher. I was raised in a single parent household. My favorite drink is Cuba Libre. System's Persona Figure 1: An example from the Conv AI2 dataset. workers to chat with each other based on provided persona profiles, building such quality datasets is expensive and time-consuming, which in turn restricts the size of those datasets. For example, the Conv AI2 dataset (Dinan et al. 2019) only contains 131k utterances with less than 5k unique personas, much smaller than open-domain dialogue datasets such as Pushshift.io Reddit (Baumgartner et al. 2020) with roughly 1.2B utterances. Another challenge is to choose the weights between the persona and context. Unlike open-domain dialogue models that generate responses by considering the dialogue context alone, persona-based dialogue generation systems need to additionally take personalized background descriptions into account along with the dialogue context. The weights between context and persona should be dynamically adjusted by the dialogue system under different situations. For example, given a user utterance How are you? , the contextpreferred answer is likely to be I am fine. , which is safe but bland. Meanwhile, a persona-preferred answer would fuse persona information to the response, such as I am spending time with my four sisters . Under such circumstances, the persona-preferred answer would be more informative and meaningful. On the other hand, sometimes, the system needs to focus on context to make the conversation interactive and engaging. For instance, if the user says: I have two greyhounds. Their names are Tom and Jerry. , then the system would focus on the context and answer: That s cute! How old are they? , which encourages the user to chat with the dialogue system. From the above two scenarios, it can be The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) seen that the weights between context and persona should be adjusted accordingly, which is important for a dialogue model to build long-term relationships with users. Most existing works on persona-based dialogue generation tasks have primarily addressed the data scarcity challenge by utilizing external data or sophisticated training processes. For instance, Song et al. use the MNLI dataset (Williams, Nangia, and Bowman 2018) as auxiliary tasks, Cao et al. augment the data through text manipulation, Roller et al. add other dialogue datasets in pretext tasks, and Liu et al. adopt multi-stage training with reinforcement learning. Those works obtained decent performance, but few of them considered the second challenge. To address the aforementioned second challenge, in this paper, we design a Persona-Adaptive Attention (PAA) to dynamically learn the weights of the persona and context information in the proposed framework. To enhance the persona information in the PAA, we prepend the persona in the decoder as a prompt so that the weights can capture more persona-related information. To balance the context and persona information, the PAA takes two cross-attention and the self-attention from the persona-prompted decoder to compute the weights for combining the latent representations from the context and persona. Moreover, inspired by some findings in (Welleck et al. 2019; Cao et al. 2022b,a) that not all context and persona information is useful to generate the response, we design two dynamic masks to the weighted latent representation to not only remove redundant information but also act as a regularizer in the PAA. As a byproduct, extensive experiments on the Conv AI2 dataset show that the proposed framework achieves comparable or even better performance than existing works without the use of external datasets or sophisticated training procedures. One reason is that our framework explicitly considered learning the weights between context and persona in the architecture design that can perform well under a lowdata regime. This observation indicates that the proposed framework could also alleviate the first challenge, making the proposed framework kill two birds with one stone. This demonstrates the effectiveness of the proposed framework. Our contributions can be summarized as follows. We propose the PAA in an encoder-decoder framework. This framework models the persona and context information by two separate transformer encoders, which are then fused in the persona-prompted decoder by the proposed PAA mechanism. Extensive experiments on the Conv AI2 dataset show that the proposed model performs comparably to or even better than strong baseline methods by about 30% improvement in terms of the perplexity metric. We demonstrate that our framework is a data-efficient architecture that can achieve comparable performance with 20% to 30% of the training data compared with a larger model such as GPT2 (Radford et al. 2019) trained on the full dataset. Related Work Persona-based Dialogue Generation There is a growing interest in persona-based dialogue generation tasks, especially the work on the Persona Chat/Conv AI2 dataset. The release of the Persona Chat dataset (Zhang et al. 2018) has provoked vibrant research in integrating explicit persona into dialogue response generation. The Conv AI2 dataset (Dinan et al. 2019) is a further split of the Persona Chat dataset to serve as a dataset for the conversational competition.1 Most of the works on persona-based dialogue generation are conducted on the Conv AI2 dataset, so we will use the Conv AI2 dataset as our primary training and evaluation dataset. Zhang et al. utilized LSTM (Hochreiter and Schmidhuber 1997) to generate a response from persona and context. Later, Transfer Transfo (Wolf et al. 2019) leveraged the pre-trained language model by fine-tuning the dataset on the GPT2 model with the concatenated input. Meanwhile, BERT over BERT (Bo B) (Song et al. 2021) is composed of three BERTs (Devlin et al. 2019), which is trained with both negative log-likelihood and unlikelihood losses. Bo B utilizes the MNLI dataset (Williams, Nangia, and Bowman 2018) as an auxiliary dataset to help the model recognize the positive and negative samples given an anchor sample. P2 bot (Liu et al. 2020) addressed the persona-based dialogue task by introducing a transmitter and receiver model, which is further tuned with reinforcement learning on manually designed rewards. A recent work (Cao et al. 2022b) tackled the problem in a model-agnostic fashion, providing strategies for data augmentation and curriculum learning. Distinctively different from these previous works, we propose an effective approach without the aid of external datasets or complicated training setups. Attention Mechanisms for Conditional Dialogue Generation Several studies introduced explicitly designed crossattention to address dialogue generation. Those works are tailored either on condition sentences (Zheng et al. 2020) or categorical label (Zeng and Nie 2021). Zheng et al. proposed an attention routing structure that facilitates the weight from persona information to generate the response. The attention routing structure adds the cross-attention/selfattention results from persona-response, context-response, and response-response pairs together to obtain a fused crossattention to balance the weights among different sources of input. Those cross-attention/self-attentions are also calculated in our approach. However, instead of calculating the weights from an external predictor, our approach computes these within the framework, followed by applying the masking on the weighted cross-attention results to alleviate the training difficulties. In addition, Zeng and Nie introduced a condition-aware transformer block into their model to determine the amount of condition information as a bias in word generation probability at a position (Zeng and Nie 2021). In the conditionaware block, the keys and values from a condition (e.g., topic 1http://convai.io/2018/ label) and context are concatenated. Then the block calculates the concatenated content in a cross-attention to obtain a bias term, which is then added to the self-attention. Unlike the condition-aware block approach, our model generates two masks with weights to balance the information from persona and context rather than through a bias term. In addition, our framework takes persona and context text as input, while condition-aware transformer (Zeng and Nie 2021) uses the categorical label and context text as input. Methodology Task Formulation Suppose that we have a persona-based conversation session C = {P, U}, where each persona P = {p1, . . . , pe} is composed of e profile sentences that describe the background of an interlocutor and the dialogue context U = {uh,1, um,1, ..., uh,n} includes the utterances spoken by the first interlocutor (e.g., human) h and the second interlocutor (e.g., machine) m interactively. In the persona-based dialogue generation task, P represents the persona for m and the conversational session always starts with h. Therefore, the objective of this task is to generate the response r = um,n given persona P and the dialogue context U. Overall Framework As depicted in Figure 2, our framework consists of two encoders and one decoder with PAA to perform the decoding process. The encoding layer uses a transformer encoder architecture to encode persona P and dialogue context U, respectively, into latent representations. The encoder layers are randomly initialized, while the decoder layers are initialized with the pre-trained GPT2. The persona information is fed to the persona encoder as well as the decoder as a prompt, offering strong guidance for GPT2 to decode the target response. PAA handles the cross-attentions from the persona and context information to balance and regularize the two parts by weighting and masking. Inputs for Persona-Adaptive Attention Before presenting the proposed PAA, in this section, we introduce the decoder s self-attention and encoder-decoder cross-attention as the inputs for the PAA. Firstly, the persona P and context U are processed separately by two encoders. Let IP = {t P 1 , ..., t P l } denote a concatenation of all sentences in P, where t P i is the ith token in the persona P with total l tokens. Meanwhile, IU = {t U 1 , ..., t U k } represents the token sequence for the concatenated context content U. Then, we use the bi-directional transformer encoders for encoding the text span. Generally, we get the encoder results from IP and IU as h P = Encoder P (IP ), h U = Encoder U(IU), (1) where Encoder P and Encoder U denote the bi-directional transformer encoders for persona and context. h P Rl d and h U Rk d are the hidden states before the last pooling layer from the encoders, where d is the output dimension of the encoders. Since our framework adopts the encoder-decoder structure, we process the persona-prompted response in the decoder. Specifically, to model the ty r+1 that is the (r + 1)- th token in the response, we calculate the self-attention on IR = {IP , [BOS], ty 1, ..., ty r}, where [BOS] is a special token indicating the begin of the sentence and ty i is the i-th decoded response token. Formally, the self-attention result from IR can be expressed as h R = Self-Attention(IR) + MR, ˆh R = Add Norm(h R), (2) where h R, ˆh R R(l+r) d, and MR is the decoder s mask to make the self-attention calculation uni-directional. After obtaining the encoders hidden states h P and h U, as well as the decoder s self-attention output h R, we then calculate the cross-attention based on the (h P , h R) and (h U, h R) pairs. The cross-attention is calculated in a similar way to the self-attention, where K and V are provided from the encoder and Q is from the decoder. In detail, we can formulate cross-attention as o P = Softmax(Qr K p o U = Softmax(Qr K u where Qr R(l+r) d denotes a linear transformation of ˆh R,, Kp, Vp Rl d denote linear transformations of h P ,, Ku, Vu Rk d come from linear transformations of h U,, and d is the dimension of the attention head. By calculating the cross-attentions, we obtain the correlation results between the encoders and decoder, which serve as the parts of input for PAA. Persona-Adaptive Attention To fuse the cross-attention results, the proposed PAA will use the weighting and masking mechanisms to utilize the persona information. Specifically, we take the self-attention result h R and cross-attention result o P as input to generate the initial weights wpersona for the persona information. The motivation behind this operation is to enable the model to consider the relationship between persona and the response in both self-attention and cross-attention fashions. Formally, this operation can be presented as mp = FC([h R; o P ]), wpersona = Sigmoid(mp). (4) In Eq. (4), [; ] denotes the concatenation operation, and h R, o P are firstly mapped into mp R(l+r) d using a linear layer FC followed by a Sigmoid( ) to obtain the initial weight for the persona cross-attention. The weight is then applied to the persona-response and context-response crossattention results to form a complementary relationship, leading to the weighted cross-attention o P and o U as o P = wpersonao P , o U = (1 wpersona)o U. (5) Linear Response Feed Forward Input Embedding Dialogue Context Self Attention Feed Forward Input Embedding Self Attention Feed Forward Input Embedding Context Cross Persona Cross Self-Attention Masking Cross Attention Self Attention Feed Forward Cross Attention Persona [BOS] Response Persona (b) The architecture of Persona-Adaptive Attention, HPAA is the module's output (a) The overview of our framework, PAA indicates the Persona-Adaptive Attention Context Encoder Dialog Decoder Persona Encoder Figure 2: (a) The overview of our framework, including two encoders for persona and context, respectively, and a decoder with PAA to generate a response. (b) The PAA architecture balances the information flows from two sources of input by generating dynamic masks. To dynamically remove the redundant information and to regularize the two input sources, we transform wpersona into mpersona and mcontext, which denote the masks for the two input sources, as mpersona = M(wpersona > τ), mcontext = M(1 wpersona > τ). (6) Here, the masks mpersona and mcontext are made by the binary indicator M which will output 1 and 0 in accordance with the given condition. τ is to control the strength of the masking and here it is defined as τ = |IU|/(|IU| + |IP |), where |IU| denotes the length of the context input and |IP | denotes the length of the persona input. The intuition for such setting of τ is to control the masking strength if the context length outweighs the persona length. After obtaining the masks, we apply the mask to calculate the weighted sum: ˆo P = mpersona o P , ˆo U = mcontext o U, HP AA = ˆo P + ˆo U, (7) where denotes the element-wise multiplication and it is to conduct the masking operation. The weighted masked results ˆo P and ˆo U are then added together in HP AA as the output for PAA. The balanced masked result HP AA will then be passed to the feed-forward network in the decoder as depicted in Figure 2(a). Such transformer blocks will be repeated for Nd times to obtain the final output. Learning Objective In the training process, the objective function utilizes the widely used negative log-likelihood loss as LNLL = log(pθ(IR|IP , IU)) i=1 log(pθ(ty i |IP , IU, ty