# variational_memory_encoderdecoder__45da0134.pdf Variational Memory Encoder-Decoder Hung Le, Truyen Tran, Thin Nguyen and Svetha Venkatesh Applied AI Institute, Deakin University, Geelong, Australia {lethai,truyen.tran,thin.nguyen,svetha.venkatesh}@deakin.edu.au Introducing variability while maintaining coherence is a core task in learning to generate utterances in conversation. Standard neural encoder-decoder models and their extensions using conditional variational autoencoder often result in either trivial or digressive responses. To overcome this, we explore a novel approach that injects variability into neural encoder-decoder via the use of external memory as a mixture model, namely Variational Memory Encoder-Decoder (VMED). By associating each memory read with a mode in the latent mixture distribution at each timestep, our model can capture the variability observed in sequential data such as natural conversations. We empirically compare the proposed model against other recent approaches on various conversational datasets. The results show that VMED consistently achieves significant improvement over others in both metricbased and qualitative evaluations. 1 Introduction Recent advances in generative modeling have led to exploration of generative tasks. While generative models such as GAN [12] and VAE [19, 29] have been applied successfully for image generation, learning generative models for sequential discrete data is a long-standing problem. Early attempts to generate sequences using RNNs [13] and neural encoder-decoder models [17, 35] gave promising results, but the deterministic nature of these models proves to be inadequate in many realistic settings. Tasks such as translation, question-answering and dialog generation would benefit from stochastic models that can produce a variety of outputs for an input. For example, there are several ways to translate a sentence from one language to another, multiple answers to a question and multiple responses for an utterance in conversation. Another line of research that has captured attention recently is memory augmented neural networks (MANNs). Such models have larger memory capacity and thus remember temporally distant information in the input sequence and provide a RAM-like mechanism to support model execution. MANNs have been successfully applied to long sequence prediction tasks [14, 33] demonstrating great improvement when compared to other recurrent models. However, the role of memory in sequence generation has not been well understood. For tasks involving language understanding and production, handling intrinsic uncertainty and latent variations is necessary. The choice of words and grammars may change erratically depending on speaker intentions, moods and previous languages used. The underlying RNN in neural sequential models finds it hard to capture the dynamics and their outputs are often trivial or too generic [23]. One way to overcome these problems is to introduce variability into these models. Unfortunately, sequential data such as speech and natural language is a hard place to inject variability [30] since they require a coherence of grammars and semantics yet allow freedom of word choice. We propose a novel hybrid approach that integrates MANN and VAE, called Variational Memory Encoder-Decoder (VMED), to model the sequential properties and inject variability in sequence generation tasks. We introduce latent random variables to model the variability observed in the data 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. and capture dependencies between the latent variables across timesteps. Our assumption is that there are latent variables governing an output at each timestep. In the conversation context, for instance, the latent space may represent the speaker s hidden intention and mood that dictate word choice and grammars. For a rich latent multimodal space, we use a Mixture of Gaussians (Mo G) because a spoken word s latent intention and mood can come from different modes, e.g., whether the speaker is asking or answering, or she/he is happy or sad. By modeling the latent space as an Mo G where each mode associates with some memory slot, we aim to capture multiple modes of the speaker s intention and mood when producing a word in the response. Since the decoder in our model has multiple read heads, the Mo G can be computed directly from the content of chosen memory slots. Our external memory plays a role as a mixture model distribution generating the latent variables that are used to produce the output and take part in updating the memory for future generative steps. To train our model, we adapt Stochastic Gradient Variational Bayes (SGVB) framework [19]. Instead of minimizing the KL divergence directly, we resort to using its variational approximation [15] to accommodate the Mo G in the latent space. We show that minimizing the approximation results in KL divergence minimization. We further derive an upper bound on our total timestep-wise KL divergence and demonstrate that minimizing the upper bound is equivalent to fitting a continuous function by a scaled Mo G. We validate the proposed model on the task of conversational response generation. This task serves as a nice testbed for the model because an utterance in a conversation is conditioned on previous utterances, the intention and the mood of the speaker. Finally, we evaluate our model on two open-domain and two closed-domain conversational datasets. The results demonstrate our proposed VMED gains significant improvement over state-of-the-art alternatives. 2 Preliminaries 2.1 Memory-augmented Encoder-Decoder Architecture A memory-augmented encoder-decoder (MAED) consists of two neural controllers linked via external memory. This is a natural extension to read-write MANNs to handle sequence-to-sequence problems. In MAED, the memory serves as a compressor that encodes the input sequence to its memory slots, capturing the most essential information. Then, a decoder will attend to these memory slots looking for the cues that help to predict the output sequence. MAED has recently demonstrated promising results in machine translation [5, 37] and healthcare [20, 21, 28]. In this paper, we advance a recent MAED known as DC-MANN described in [20] where the powerful DNC [14] is chosen as the external memory. In DNC, memory accesses and updates are executed via the controller s reading and writing heads at each timestep. Given current input xt and a set of K previous read values from memory rt 1 = r1 t 1, r2 t 1, ..., r K t 1 , the controllers compute read-weight vector wi,r t and write-weight vector ww t for addressing the memory Mt. There are two versions of decoding in DC-MANN: write-protected and writable memory. We prefer to allow writing to the memory during inference because in this work, we focus on generating diverse output sequences, which requires a dynamic memory for both encoding and decoding process. 2.2 Conditional Variational Autoencoder (CVAE) for Conversation Generation A dyadic conversation can be represented via three random variables: the conversation context x (all the chat before the response utterance), the response utterance y and a latent variable z, which is used to capture the latent distribution over the reasonable responses. A variational autoencoder conditioned on x (CVAE) is trained to maximize the conditional log likelihood of y given x, which involves an intractable marginalization over the latent variable z, i.e.,: p (y | x) = Z z p (y, z | x) dz = Z z p (y | x, z) p (z | x) dz (1) Fortunately, CVAE can be efficiently trained with the Stochastic Gradient Variational Bayes (SGVB) framework [19] by maximizing the variational lower bound of the conditional log likelihood. In a typical CVAE work, z is assumed to follow multivariate Gaussian distribution with a diagonal covariance matrix, which is conditioned on x as pφ (z | x) and a recognition network qθ(z | x, y) to approximate the true posterior distribution p(z | x, y). The variational lower bound becomes: Figure 1: Graphical Models of the vanilla CVAE (a) and our proposed VMED (b) L (φ, θ; y, x) = KL (qθ (z | x, y) pφ (z | x)) + Eqθ(z|x,y) [log p (y | x, z)] log p (y | x) (2) With the introduction of the neural approximator qθ(z | x, y) and the reparameterization trick [18], we can apply the standard back-propagation to compute the gradient of the variational lower bound. Fig. 1(a) depicts elements of the graphical model for this approach in the case of using CVAE. Built upon CVAE and partly inspired by VRNN [8], we introduce a novel memory-augmented variational recurrent network dubbed Variational Memory Encoder-Decoder (VMED). With an external memory module, VMED explicitly models the dependencies between latent random variables across subsequent timesteps. However, unlike the VRNN which uses hidden values of RNN to model the latent distribution as a Gaussian, our VMED uses read values r from an external memory M as a Mixture of Gaussians (Mo G) to model the latent space. This choice of Mo G also leads to new formulation for the prior pφ and the posterior qθ mentioned in Eq. (2). The graphical representation of our model is shown in Fig. 1(b). 3.1 Generative Process The VMED includes a CVAE at each time step of the decoder. These CVAEs are conditioned on the context sequence via K read values rt 1 = r1 t 1, r2 t 1, ..., r K t 1 from the external memory. Since the read values are conditioned on the previous state of the decoder hd t 1, our model takes into account the temporal structure of the output. Unlike other designs of CVAE where there is often only one CVAE with a Gaussian prior for the whole decoding process, our model keeps reading the external memory to produce the prior as a Mixture of Gaussians at every timestep. At the t-th step of generating an utterance in the output sequence, the decoder will read from the memory K read values, representing K modes of the Mo G. This multi-modal prior reflects the fact that given a context x, there are different modes of uttering the output word yt, which a single mode cannot fully capture. The Mo G prior distribution is modeled as: gt = pφ (zt | x, rt 1) = i=1 πi,x t x, ri t 1 N zt; µi,x t x, ri t 1 , σi,x t x, ri t 1 2 I (3) We treat the mean µi,x t and standard deviation (s.d.) σi,x t of each Gaussian distribution in the prior as neural functions of the context sequence x and read vectors from the memory. The context is encoded into the memory by an LSTM E encoder. In decoding, the decoder LSTM D attends to the memory and choose K read vectors. We split each read vector into two parts ri,µ and ri,σ , each of which is used to compute the mean and s.d., respectively: µi,x t = ri,µ t 1, σi,x t = softplus ri,σ t 1 . Here we use the softplus function for computing s.d. to ensure the positiveness. The mode weight πi,x t is chosen based on the read attention weights wi,r t 1 over memory slots. Since we use softattention, a read value is computed from all slots yet the main contribution comes from the one Algorithm 1 VMED Generation Require: Given pφ, r1 0, r2 0, ..., r K 0 , hd 0, y 0 1: for t = 1, T do 2: Sampling zt pφ (zt | x, rt 1) in Eq. (3) 3: Compute: od t , hd t = LSTM D [y t 1, zt] , hd t 1 4: Compute the conditional distribution: p (yt | x, z t) = softmax Woutod t 5: Update memory and read [r1 t , r2 t , ..., r K t ] using hd t as in DNC 6: Generate output y t = argmax y V ocab p (yt = y | x, z t) with highest attention score. Thus, we pick the maximum attention score in each read weight and normalize to become the mode weights: πi,x t = max wi,r t 1/ i=K P i=1 max wi,r t 1. Armed with the prior, we follow a recurrent generative process by alternatively using the memory to compute the Mo G and using latent variable z sampled from the Mo G to update the memory and produce the output conditional distribution. The pseudo-algorithm of the generative process is given in Algorithm 1. 3.2 Neural Posterior Approximation At each step of the decoder, the true posterior p (zt | x, y) will be approximated by a neural function of x, y and rt 1, denoted as qθ (zt | x, y, rt 1) . Here, we use a Gaussian distribution to approximate the posterior. The unimodal posterior is chosen because given a response y, it is reasonable to assume only one mode of latent space is responsible for this response. Also, choosing a unimodel will allow the reparameterization trick during training and reduce the complexity of KL divergence computation. The approximated posterior is computed by the following the equation: ft = qθ (zt | x, y t, rt 1) = N zt; µx,y t (x, y t, rt 1) , σx,y t (x, y t, rt 1)2 I (4) with mean µx,y t and s.d. σx,y t . We use an LSTM U utterance encoder to model the ground truth utterance sequence up to timestep t-th y t. The t-th hidden value of the LSTM U is used to represent the given data in the posterior: hu t = LSTM U yt, hu t 1 . The neural posterior combines the read values rt = K P i=1 πi,x t ri t 1 together with the ground truth data to produce the Gaussian posterior: µx,y t = Wµ [rt, hu t ], σx,y t = softplus (Wσ [rt, hu t ]). In these equations, we use learnable matrix weights Wµ and Wσ as a recognition network to compute the mean and s.d. of the posterior, ensuring that the distribution has the same dimension as the prior. We apply the reparamterization trick to calculate the random variable sampled from the posterior as z t = µx,y t +σx,y t ϵ, ϵ N (0, I). Intuitively, the reparameterization trick bridges the gap between the generation model and the inference model during the training. 3.3 Learning In the training phase, the neural posterior is used to produce the latent variable z t. The read values from memory are used directly as the Mo G priors and the priors are trained to approximate the posterior by reducing the KL divergence. During testing, the decoder uses the prior for generating latent variable zt, from which the output is computed. The training and testing diagram is illustrated in Fig. 2. The objective function becomes a timestep-wise variational lower bound by following similar derivation presented in [8]: L (θ, φ; y, x) = Eq t=1 KL (qθ (zt | x, y t, rt 1) pφ (zt | x, rt 1)) + log p (yt | x, z t) Figure 2: Training and testing of VMED where q = qθ (z T | x, y T , r