# cem_commonsenseaware_empathetic_response_generation__cfeb04f6.pdf CEM: Commonsense-Aware Empathetic Response Generation Sahand Sabour, Chujie Zheng, Minlie Huang* The Co AI Group, DCST, Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China sahandfer@gmail.com, chujiezhengchn@gmail.com, aihuang@tsinghua.edu.cn A key trait of daily conversations between individuals is the ability to express empathy towards others, and exploring ways to implement empathy is a crucial step towards human-like dialogue systems. Previous approaches on this topic mainly focus on detecting and utilizing the user s emotion for generating empathetic responses. However, since empathy includes both aspects of affection and cognition, we argue that in addition to identifying the user s emotion, cognitive understanding of the user s situation should also be considered. To this end, we propose a novel approach for empathetic response generation, which leverages commonsense to draw more information about the user s situation and uses this additional information to further enhance the empathy expression in generated responses. We evaluate our approach on EMPATHETICDIALOGUES, which is a widely-used benchmark dataset for empathetic response generation. Empirical results demonstrate that our approach outperforms the baseline models in both automatic and human evaluations and can generate more informative and empathetic responses. Our code is available at https://github.com/Sahandfer/CEM. Introduction Empathy is a desirable trait of human daily conversations that enables individuals to understand, perceive, and respond appropriately to the situation and feelings of others (Keskin 2014). Previous research has demonstrated that empathetic dialogue systems can improve user experience and satisfaction in multiple domains (Fitzpatrick, Darcy, and Vierhile 2017; Liu et al. 2021; Wang et al. 2021). Hence, it is important to discover ways that allow us to equip open-domain dialogue systems with empathy. Recent work (Rashkin et al. 2019; Lin et al. 2019; Majumder et al. 2020; Li et al. 2020a) has proposed various methods of generating empathetic responses that mainly rely on detecting the user s emotion. However, empathy is a broad construct that includes aspects of affection and cognition (Davis 1983). The affective aspect is concerned with the emotional simulation in reaction to the user s experiences (Cuff et al. 2016) while the cognitive aspect aims to understand the user s situation and * Corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Examples from the EMPATHETICDIALOGUES dataset in which commonsense is used to gain additional information about the user s emotion and situation before responding empathetically. the implied feelings (Elliott et al. 2018). Hence, though emotion is one of the important factors of empathy, it is not the only determining factor. This is demonstrated in Figure 1, where both affective and cognitive empathy are used to form empathetic responses. For instance, in the first case, the user shares information about their emotion (I felt pretty good) as well as their experience (hit a new pr on the overhead press). Accordingly, we can assume that the user is Proud of their achievement and must have worked hard to reach this level. Since these assumptions are not explicitly mentioned by the user, we as humans tend to rely on our own knowledge and commonsense reasoning to draw these implications. Therefore, we believe that providing dialogue systems with this external knowledge could play a critical role in understand- The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) ing the user s situation and feelings, which leads to more informative and empathetic responses. Towards this end, we propose the Commonsense-aware Empathetic Chatting Machine (CEM). CEM leverages external commonsense knowledge to obtain more information about the user s situation and feelings (i.e. user s reaction, intention, desire, etc.). Such additional information is used to improve the cognitive understanding and thus, enhance the empathy expression in the generated responses. We evaluate our approach on EMPATHETICDIALOGUES, a widelyused benchmark dataset for empathetic response generation. Both automatic and manual evaluation results demonstrate that compared to previous methods, CEM can generate more informative and empathetic responses. Our contributions are summarized as follows: We propose to leverage commonsense to improve the understanding of interlocutors situations and feelings, which is an important part of cognitive empathy. We introduce CEM, a novel approach that uses various types of commonsense reasoning to enhance empathetic response generation. Automatic and manual evaluation demonstrate that with the addition of commonsense, CEM is able to generate more informative and empathetic responses compared with the previous methods. Preliminaries Empathetic Dialogue Generation Empathy is a fairly new term in the literature and therefore, has no specific or widely accepted definition in the fields of social psychology and psychotherapy (Macarov 1978; Elliott et al. 2011). However, empathy is commonly known as a complex multi-dimensional construct that includes broad aspects of affection and cognition (Davis 1983; Zheng et al. 2021). Affective empathy enables us to experience the emotion of others through various emotional stimuli (Cuff et al. 2016), while cognitive empathy enables us to understand the situations and implicit mental states of others, such as intentions, causes, desires, requirements, etc. (Elliott et al. 2018). In recent years, research on implementing empathy in dialogue systems and generating empathetic responses has gained considerable attention. Initially, Rashkin et al. (2019) demonstrated that detecting the user s emotion is an essential part of generating empathetic responses. Lin et al. (2019) designed a separate decoder for each available emotion and softly combined their outputs. Majumder et al. (2020) proposed that empathetic responses should also mimic the user s emotion to a degree. Li et al. (2020a) leveraged user feedback and proposed a multi-resolution adversarial framework for this task. Recently, (Li et al. 2020b) used commonsense knowledge from Concept Net (Speer, Chin, and Havasi 2017) to gain a better understanding of the implied emotions within the context. However, these works usually focus on detecting the context emotion and do not pay enough attention to the cognitive aspect of empathy. Commonsense and Empathy As mentioned, a major part of cognitive empathy is understanding the situations and feelings of others. When interacting with a dialogue system, the user is not expected to explicitly share all the information about their situation and how they may feel. As humans, we use our commonsense knowledge to make connections between what is explicitly mentioned and what is implied. Hence, we hypothesize that enabling dialogue systems to leverage commonsense and drive implications from what the user has explicitly shared is highly beneficial for a better understanding of the user s situation and feelings, which leads to more effective cognitive empathy and thus, more empathetic responses. In this work, we use ATOMIC (Sap et al. 2019) as our commonsense knowledge base. ATOMIC is a collection of commonsense reasoning inferences about everyday if-then events. For each event, ATOMIC infers six commonsense relations for the person involved in the event: the effect of the event on the person (x Effect), their reaction to the event (x React), their intent before the event (x Intent), what they need in order for the event to happen (x Need), what they would want after the event(x Want), and an inferred attribute of the person s characteristics (x Attr). Since predicting a person s attributes merely based on a given event would include judging the other person, which is not included in the empathetic process (Peloquin 1995), we neglect x Attr in our approach and use the remaining five relations. In order to generate commonsense inferences for given events, we adopt COMET (Bosselut et al. 2019), which is a pre-trained GPT-2 model (Radford et al. 2018) that is finetuned on triplets (e, r, i) from ATOMIC, where e, r, i are the event, the relation type, and the inferred knowledge respectively. More specifically, we use a modified BART-based (Lewis et al. 2020) variation of COMET, which is trained on the ATOMIC-2020 dataset (Hwang et al. 2021). This model is equipped with knowledge that is not readily available to pre-trained language models and is more suitable for inferring knowledge regarding unseen events (Hwang et al. 2021). The latter is necessary for our use-case as many of the events within an empathetic conversation may not occur on a daily basis and therefore, may not exist in the original ATOMIC dataset. Task Formulation We conduct our experiments on the EMPATHETICDIALOGUES (Rashkin et al. 2019), a large-scale multi-turn dataset containing 25k empathetic conversations between crowdsourcing workers. The dataset also provides an emotion label for each conversation from the total 32 available emotions. In this dataset, each conversation is between a speaker and a listener. The task requires a dialogue model to play the role of the listener and generate empathetic responses. Formally, let D = [u1, u2, u3, ..., uk 1] denote a dialogue history of k 1 utterances, where ui = [wi 1, wi 2, wi 3, ..., wi Mi] is the i-th utterance that consists of Mi words. Our goal is to generate the listener s next utterance uk which is coherent to the context, informative, and empathetic to the speaker s situation and feelings. Figure 2: Overview of our model (CEM). Methodology Our proposed model, CEM, is built upon the standard Transformer (Vaswani et al. 2017) and its overview is illustrated in Figure 2. The process of CEM is mainly divided into five stages: context encoding, knowledge acquisition, context refinement, knowledge selection, and response generation. Context Encoding Following previous work (Lin et al. 2019; Majumder et al. 2020), we concatenate the utterances in the dialogue history and prepend a special token [CLS] to obtain the context input C = [CLS] u1 u2 u3 ... uk 1, where is the concatenation operation. Similar to Devlin et al. (2019), we use the final hidden representation of [CLS] as the representation of the whole sequence. We acquire the embedding EC of the sequence C by summing up the word embedding, positional embedding, and dialogue state embedding. As each utterance in C could be from either the listener or the speaker, we use the dialogue state embedding to distinguish between the two parties. The sequence embedding EC is then fed to a context encoder to produce the contextual representation: HCT X = Enc CT X(EC) (1) where HCT X RL d, L is the length of the sequence, and d is the hidden size of the context encoder. Knowledge Acquisition For input sequence C, we respectively append five special relation tokens ([x React], [x Want], [x Need], [x Intent], [x Effect]) to the last utterance in the dialogue history and use COMET to generate five commonsense inferences [csr 1, csr 2, ..., csr 5] per relation r. For each relation, we concatenate the generated commonsense inferences to obtain its commonsense sequence CSr = csr 1 csr 2 ... csr 5. Given that x React demonstrates the knowledge regarding the affective state (i.e. user s emotion) while the other relation represents knowledge regarding the cognitive state (i.e. user s situation), we divide the relations into two groups: affective and cognitive. Accordingly, similar to the previous section, we prepend [CLS] to the cognitive sequences. As the inferences for x React are usually emotion words (e.g. sad, happy, angry) rather than sentences, we would simply use the average of its hidden representations to represent these sequences. Based on the mentioned grouping, the resulting sequences are fed to two separate cognitive and affective encoders: Hx React = Enc Aff(ECSx React) (2) Hr = Enc Cog(ECSr) (3) where Hx React Rlx React d, Hr Rlr d, with lx React, lr being the lengths of the commonsense inference sequences, and r {x Want, x Need, x Intent, x Effect}. Then, we use the average hidden representation for affective relations and the hidden representation of [CLS] for cognitive relations to represent these sequences respectively: hx React = Average(Hx React) (4) hr = Hr[0] (5) where hx React, hr Rd. Context Refinement Similar to Majumder et al. (2020), in order to refine the context by additional information, we first respectively concatenate each of the commonsense relation representations (Equations 4 & 5) to the context representation HCT X at the token level (i.e. Ur RL 2d): Ux React[i] = HCT X[i] hx React (6) Ur[i] = HCT X[i] hr (7) In contrast to concatenating the representations at a sequence level (i.e. adding additional information to the end of the context representation), token-level concatenation enables us to fuse the additional knowledge within each word in the sequence. Accordingly, we use two separate encoders (affectionrefined and cognition-refined), corresponding to the two groups of relations, to encode the fused representations and obtain commonsense-refined context representations for each relation respectively: HAff = Enc CT X Aff(Ux React) (8) HCog,r = Enc CT X Cog(Ur) (9) where HAff, HCog,r RL d. Emotion Classification In order to acquire a more accurate prediction of the user s affective state, given that we are provided with an emotion label e for each conversation, we use the hidden representation of the [CLS] token from the affection-refined context representation (h Aff) to perform emotion classification: h Aff = HAff[0] (10) where h Aff Rd. Hence, we pass h Aff through a linear layer followed by a Softmax operation to produce the emotion category distribution Pemo Rq, where q is the number of available emotion categories: Pemo = Softmax(Weh Aff) (11) where We Rd q is the weight vector for the linear layer. During training, we optimize these weights by minimizing the Cross-Entropy (CE) loss between the emotion category distribution P and the ground truth label e : Lemo = log(Pemo(e )) (12) Knowledge Selection Merely using one of the commonsense representations to produce an empathetic response is not ideal. For instance, if we only rely on the affection-refined context, the generated responses would likely be about how the user s emotions (e.g. You must be proud.), whereas using the cognitionrefined contexts may lead to responses that focus more on the situation (e.g. You must have worked really hard.). Hence, we want to enable our model to generate responses based on the mixture of both affective and cognitive information. To this end, we first concatenate all the five relationrefined contexts at the token level: HCog[i] = L r {x Want,x Need,x Intent,x Effect} HCog,r[i] (13) HRefine[i] = HAff[i] HCog[i] (14) where HRefine RL 5d. To highlight the more important features within the refined context representation, we apply the Sigmoid function on HRefine to measure the importance of each relation-refined context for response generation. Then, we multiply HRefine by the consequent importance scores, as done by Majumder et al. (2020). Finally, the obtained representation is passed through a Multi-Layer Perceptron (MLP) with Re LU activation, which learns how to mix the commonsense knowledge of different relations into a combined contextualized representation: f HCT X = MLP(σ(HRefine) HRefine) (15) where f HCT X RL d and denotes element-wise multiplication. Response Generation Lastly, the target response Y = [y1, . . . , y T ] with length T, which has the same meaning of uk in Section , is generated by the decoder token by token: P(yt|y