# a_mixtureofexpert_approach_to_rlbased_dialogue_management__1593fc8b.pdf Published as a conference paper at ICLR 2023 A MIXTURE-OF-EXPERT APPROACH TO RL-BASED DIALOGUE MANAGEMENT Yinlam Chow Aza Tulepbergenov Ofir Nachum Dhawal Gupta Moonkyung Ryu Mohammad Ghavamzadeh Craig Boutilier Google Research {yinlamchow, atulep, ofirnachum, dhawgupta, mkryu, ghavamza, cboutilier}@google.com Despite recent advancements in language models (LMs), their application to dialogue management (DM) problems and ability to carry on rich conversations remain a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with a combinatorially complex action space even for a medium-size vocabulary. As a result, they struggle to produce a successful and engaging dialogue even if they are warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture of expert language model (Mo E-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a particular attribute or personality, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. Our Mo E approach provides greater flexibility to generate sensible utterances with different intents and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in terms of the diversity and sensibility of the generated utterances and the overall DM performance. 1 INTRODUCTION With the tremendous advancements in natural language understanding and generation, increasing attention has been directed to construct intelligent dialogue agents that can carry out engaging conversations with users. Such interactions can be open-ended, contain different topics, and often involve an underlying task, such as negotiation, information exchange, and recommendation. Therefore, to satisfy the user, a good dialogue agent should not only generate natural responses, but also be capable of pursuing the task s objectives and adapting to the user s feedback on-the-fly. A standard solution is to train the dialogue agent using behavioral cloning, where the agent is a language model (LM) that imitates the utterances in the training set (Gaši c et al., 2011; Fatemi et al., 2016). By leveraging deep neural networks, e.g., RNNs (Sutskever et al., 2014) and Transformers (Vaswani et al., 2017), a LM encodes the conversation to a low-dimensional dialogue state and predicts an utterance, but steering such generation for particular purposes remains an open question. Several works studied ways to fine-tune a LM to generate texts with specific contexts (Ziegler et al., 2019; Ficler and Goldberg, 2017). Other results learned a single steerable LM that is capable of generating utterances for multiple specific intents (Gu et al., 2017; Chen et al., 2018; Subramani et al., 2019; Dathathri et al., 2019). While these LMs produce fluent and relevant responses, it is unclear how to control them to systematically pursue goals during multi-turn dialogue conversations. Another popular approach is to view dialogue management (DM) as a control problem and use reinforcement learning (RL) to optimize the agent s policy (which is often a LM itself). Using RL for dialogue systems has a long history. Earlier work relies on specific, hand-crafted semantic states (Levin and Pieraccini, 1997; Singh et al., 2002; Walker, 2000) or partially observable belief states (Williams and Young, 2007; Young et al., 2010), in which the agent chooses the best handcrafted dialogue act at each turn, with the goal of either satisfying the user (Shah et al., 2018), Published as a conference paper at ICLR 2023 completing the task (Shi and Yu, 2018), or responding to the user s query (Serban et al., 2017a). However, the application of these approaches is limited to problems whose action space can be captured by hand-crafted representations, and they cannot handle complex conversations. On the other hand, more recent approaches use deep learning to extract semantic representations from conversation histories, treat these representations as dialogue belief states, and apply RL to learn a word-level generative DM agent (Jaques et al., 2019; Li et al., 2016; 2017; Shin et al., 2020). However, since there are innumerable possibilities of language utterances, and thus, the action space of the RL problem is extremely large, the agent often performs planning poorly and generates incomprehensible utterances (Zhao et al., 2019). Another issue is that RL only optimizes a scalar reward, while the aforementioned methods often need to optimize for both the quality of the generated utterance, e.g., ease of answering (Li et al., 2016), fluency (Li et al., 2017; 2019), and diversity (Yarats and Lewis, 2018), and the goal, e.g., conversation length (Zhou et al., 2020), user s sentiment (Hancock et al., 2019), and task completion (Verma et al., 2022; Jang et al., 2021). Moreover, defining the reward as weighted combination of these metrics is not ideal, since the hand-picked weights do not often reflect the underlying success criteria. To address the above issues related to using RL in dialogue management (DM) systems, we propose an RL-based DM agent using a novel mixture of expert (Mo E) approach. Our Mo E approach is based on a mixture of expert language model (Mo E-LM), which consists of three main components: 1) a LM (a probabilistic encoder and a decoder) capable of learning diverse semantics for conversation histories, and as a result generating diverse utterances, which we refer to as the primitive LM or LM0, 2) a number of specialized LMs (or experts), {LMi}m i=1, that each is constructed using the latent space learned by LM0, but has been trained such that it is capable of generating utterances corresponding to a certain intent or personality, and 3) an RL-based dialogue manager (DM) that at each turn, given the latent state shared by the experts {LMi}m i=0 and the utterance action(s) they suggest, chooses one among them for the agent to execute. Our Mo E-LM can be seen as a special case of hierarchical LMs (e.g., Serban et al. 2017a; Zhao et al. 2019; Saleh et al. 2020), but it is different than them because it learns both the LMs (experts) and the DM. Moreover, the DM in Mo E-LM is a policy conditioned on both the latent state and the actions suggested by the experts, and not just the state as it is common in hierarchical RL. The primitive LM (LM0) plays an important role in this model because it learns diverse semantics for conversation histories and allows the agent to generate a wide variety of utterances. This diversity is also shared with the specialized LMs (experts) and gives them flexibility in generating their (more) specialized utterances. Another important feature of Mo E-LM is its modularity that facilitates adding and removing specialized LMs (experts). Moreover, this hierarchical architecture allows us to solve an RL problem with much smaller state and action spaces, which is quite important in the quality of the learned policy. Finally, since the candidate utterances are generated by experts with different intents, instead of combining all agent-user signals into a single RL reward, our DM agent can focus on optimizing the specific goal of the conversation task. We start the paper with a brief introduction of LMs and the use of Markov decision processes (MDPs) in modeling dialogue management problems in Section 2. We then describe the overall architecture of our Mo E-LM in Section 3, followed by the detailed implementation of each of its three main components (described in the above paragraph) in Sections 4 to 6. Finally, in Section 7, we demonstrate the effectiveness of our Mo E-LM in open-domain dialogues, in terms of both its ability to generate diverse and sensible utterances and its overall DM performance. 2 PRELIMINARIES Language Models (LMs) In this work, we employ seq2seq LMs to generate the next utterances in a dialogue. We assume access to a dataset of the form D = {(X(k), Y (k))}|D| k=1, where each X = X(k) is a L-turn conversation history X = {Xl}L 1 l=0 and Y is its next utterance. We denote by NX, an upper-bound on the length (number of tokens) of each utterance Xl in X.1 The role of a LM is to predict the probability of the next utterance Y , consisting of N tokens, conditioned on the conversation history X, i.e., p . In the transformer architecture (?), the LM first encodes the conversation history X using an encoder Φ to a (L NX)-length sequence of embeddings {(zl,0, . . . , zl,NX 1)}L 1 l=0 , where each zl,n is a vector in the latent space. For notational convenience, we concatenate these embeddings into a single embedding z 2 Z Rd and denote the overall dimension of the latent space as d. In the RNN architecture (Serban et al., 2016), the LM s encoder Φ directly maps the conversation history X to a latent state z 2 Z Rd. In both architectures, the next utterance b Y = {byn}N n=1 is sampled token-by-token from the decoder , 1If the actual utterance Xl has fewer tokens than NX, it will be padded by a specific token and masked. Published as a conference paper at ICLR 2023 /A34rus=Φ Ke3s H16w7Q=X L87rh3XEGd5ZJt/Ke Xs HIjqt+w=z BPv RBQg Hf4Qec Ref Rr+giury2Nq Jlz1P4DQ24Ar VEs S4=G1 BPv RBQAHf4Qec Ref Rr+giury2Nq Jlz1P4DQ24Ai O3s Wo=Gm ROq Gk S+7JA3m0nqw X69V6+7SOWc Wd Vf Kjr Pc PIk Cu9g= ... ry Zvc+QVvqg RPpyw6w=G0 Step 1 Step 2 Step 3 deterministic transition stochastic transition w4cf7c X1jc5Lw HPl KVsh3Ep Im2SA7ZI+0CSMZu SK/y R/v2rv17rz7R+u UN+l ZJv/Ae/g LBqu9Q=µ Figure 1: (Left) Mo E-LM Architecture. (Right) Sample utterance workflow generated by an Mo E-LM trained with Reddit data. Step 1: Φ encodes conversation history. Step 2: Gi, 8i, generate candidate bot utterances. Step 3: µ selects the bot response by Q-score ranking & post-processing. byn | by0, . . . , byn 1; z , where by0 is a fixed initial (start-of-sentence) token (Chien and Kuo, 2019), and the latent state is denoted as z = Φ(X).2 Markov Decision Processes (MDPs) have been used to model dialogue management problems in a variety of settings (Li et al., 2016; Asadi and Williams, 2016; Jaques et al., 2019). In such MDPs, denoted by M = (S, A, P, r, s0, γ), the state space S represents the tokenized conversation history and the initial state s0 2 S is the initial user s query. The action space A is also the tokenized language space with each action a 2 A being the agent s next utterance (which is a fixed-length, NX, sequence of tokens). The transition kernel P models the user s response to the action taken by the agent (bot). Finally, the reward function r measures the user s satisfaction. In these MDPs, we can think of the entire LM as a policy that maps conversation histories to next utterances, and solve them by finding a policy with maximum expected discounted return, i.e., 2 arg max J := E[P1 k=0 γtrt | P, s0, ]. Note that the size of the tokenized state and action spaces grow exponentially with the size of the vocabulary. This makes it intractable to solve the MDP even for a medium-size vocabulary. As a result, it would quite desirable to develop a novel MDP paradigm that is more amendable to RL-based DM systems. 3 MIXTURE OF EXPERTS (MOE) LANGUAGE MODEL We start by explaining how a Mo E language model (Mo E-LM) can enrich the bot s utterances and improve the overall performance of the DM. While our approach is applicable to any DM system, we use open-domain dialogue (Sankar et al., 2019) as a running example to show how Mo E-LM-based agents can improve user satisfaction measured by an improvement on a sentiment or engagement. Intuitively a good DM agent should possess different behaviors (e.g., inquisitive, explorative, relevant, soothing, empathetic, complimentary, provoking) and swiftly decide which intent to use to pivot a conversation, build rapport, pique the user s interests, improve their mood, etc. To achieve this goal, we require the LM to have a language representation (primitive discovery) that captures different semantics, in order to encode different conversations and avoid generating dull and repetitive responses. We also need a machinery (expert construction) to embed different intents into sub-models of this LM, so that they can behave accordingly when prompted, and respond efficiently. Finally, with various candidate utterances available, the DM module of this LM should understand the current level of user satisfaction and determine which response is the most appropriate. Motivated by these observations, we construct our Mo E-LM in three steps as shown in Figure 1. We give the main idea behind each step here and leave their detailed descriptions to Sections 4, 5, and 6. Step 1: Primitive Discovery. We first employ the dataset D, introduced in Section 2, and learn a language model LM0 = (Φ, G0, ) consisting of a stochastic encoder (i.e., an encoder Φ and a latent space distribution G0 that maps the encoded conversation into a latent distribution), and a decoder . The stochastic encoder (Φ, G0) comprises an encoder Φ that maps tokenized conversation histories X to a latent space Z Rd, i.e., z = Φ(X) 2 Z, which is then used to construct a parameterized d- 2Note that we use Y as the next utterance in the dataset and ˆY as the one predicted by the LM. Published as a conference paper at ICLR 2023 dimensional Gaussian distribution G0(z0|z) = N over Rd. The decoder predicts the next utterance b Y0 (token-by-token) conditioned on the point z0 sampled from the latent distribution, i.e., (b Y0|z0)3, z0 G0( |z). We denote by LM0(Y |X) := Ez0 G0( |z),z=Φ(X)[ (Y |z0)], the primitive and learn it using a loss function that in addition to predicting the next utterance accurately, encourages diversity and generalization in the learned latent space Z (see Eq. 1 and Fig. 2). As we will explain in Section 4, our loss function is inspired by those in prior work, and more specifically by the one in OPAL (Ajay et al., 2020), i.e., an unsupervised learning method for discovering primitive skills in trajectories that are used by some downstream RL tasks. Step 2: Expert Construction. Given the latent space Z, encoder (Φ, G0), and decoder learned in Step 1, we now learn m latent distributions {Gi}m i=1, each defined as Gi(z0|z) = N . Intuitively, each Gi corresponds to an attribute, e.g., an intent or a personality (in case of a chatbot) and generates samples in specific parts of the latent space Z. This results in having m LMs, {LMi}m i=1, LMi = (Φ, Gi, ), each of them corresponds to a specialized version of the original LM, LM0, and serves as an expert in our Mo E-LM. Upon receiving a conversation history X, each expert LMi generates a candidate (or more) for the next utterance b Yi in certain parts of the language space that are compatible with its attribute (personality). As we will explain in Section 5, each Gi is learned using a loss function that encourages its corresponding LM, LMi, to generate utterances consistent with its attribute (see Eq. 2). Step 3: Dialogue Manager (DM). The dialogue manager, denoted by µ, takes as input the encoded conversation history z = Φ(X) and the candidate action utterances generated by the experts {b Yi}m i=0, and selects one of them as the action for the bot to execute, i.e., b Y µ( | z, {b Yi}m i=0). We will describe how DM is trained using reinforcement learning (RL) in Section 6. 4 PRIMITIVE DISCOVERY IN MOE-LM Motivated by literature in the reinforcement and imitation learning fields (Ajay et al., 2020), we propose to learn the primitive LM, LM0, in our Mo E-LM by solving the following KL-constrained optimization problem that aims at capturing diverse semantics: min (Φ,G0, ), b Ez0 ( |z,Y ),z=Φ(X) log (Y |z0) , s.t. b Ez=Φ(X) (z0|z, Y ) || G0(z0|z) where b E is the empirical expectation over (X, Y ) in the dataset D, is a distribution over the latent space conditioned on the encoded conversation history z and the target utterance Y , and KL is a positive real-valued threshold. Using (1), we learn LM0 = (Φ, G0, ) by maximizing the log-likelihood, while enforcing consistency between the latent variable z0 predicted by G0( |z) and ( |z, Y ) via the KL constraint. The distribution ( |z, Y ) is a Gaussian N µ (z, Φ (Y )), σ2 (z, Φ (Y ))Id d in which Φ is a pre-trained encoder for the target utterance Y , and mean µ ( , ) and variance σ2 ( , ) are trainable models. One reason for using a separate encoder Φ for the target utterance Y is to avoid overfitting Φ (i.e., to avoid having back-propagation gradient of Φ with Y as input). Connection to VAE-like objectives In practice, we implement the KL constraint in (1) as a penalty weighted by an appropriately chosen coefficient. Thus, one may interpret the objective in (1) as a variation of β-VAE (Burgess et al., 2018). Due to the connection to VAEs, one may draw similarities between our method and existing dialogue approaches such as VHRED (Serban et al., 2017b) and VHCR (Park et al., 2018). However, we emphasize that there are key differences, and these may be explained by first understanding how the objective in (1) encourages diversity, which is key to good primitive learning. Namely, it is important that primitive discovery learns an encoder-decoder Φ, which can be modulated by the choice of z; i.e., changing z0 while fixing X should lead to different distributions over generated utterances. The objective in (1) encourages this diversity by conditioning the latent variable z0 on both the target utterance Y and z = Φ(X), i.e., z0 ( |z, Y ). In contrast, the KL constraint is used to make sure that the stochastic encoder G0( |z) of our primitive LM is not too varied for different Y , and thus has a limiting effect on diversity. For example, in the extreme when KL = 0 (or, β ! 1 when used as a regularizer) there will be no specialization of the latent space for different Y . Although β ! 1 is an extreme case, degenerate behavior can also happen when β = 1, i.e., in the traditional variational loss used by VHRED and VHCR. Specifically, it is well-known that the traditional VAE loss is an upper bound on the negative log-likelihood of the data under a stochastic encoder-decoder parameterization. Thus if the data can be modeled by a single 3In practice, we can use both latent states as the input to the decoder model (b Y0|z0, z). Published as a conference paper at ICLR 2023 LM then a VAE-optimal decoder can simply ignore G0, leading to a degenerated latent space as observed in previous work (Park et al., 2018). This is precisely the reason that, in our approach, we weaken the KL constraint ( KL 0 or, equivalently, β 1). This enables our approach to more reliably guarantee that a unique z0 represents each distinct conversation pair (X, Y ), thus capturing diverse semantic modalities and enabling easier downstream specialization. In the mathematical results below, we formalize the claim above, namely, that the log-likelihood objective in (1) leads to a learned Φ, that can easily recover any arbitrary desired LM by specializing the latent space G. We begin with a definition that characterizes the coverage of an arbitrary LM on the conditional conversation data distribution PD(Y |X). Definition 1. LMD, is a -common LM of data D if ED[TV(LMD, (Y |X)||PD(Y |X)))] . Leveraging Theorem 4.1 in Ajay et al. (2020), we now present the theoretical result characterizing the representational power of our primitive encoder-decoder pair (Φ, ) on data D. Lemma 1. Let (Φ, , ) be the solution to (1) with b Ez0 ( |z,Y ),z=Φ(X)[ log (Y |z0)] = . Then there exists LM := (Φ, G, ) such that ED[TV(LMD, (Y |X)||LM(Y |X))] + 1 2 ( + H), where G(z0|z) = EY D[ (z0|z, Y )], and H = ED[log PD(Y |X)] is a constant depending on D. The result above shows that, as long as LMD, is -common in D, then there exists a specialization of the latent space G that, when paired with Φ, , can approximately recover LMD, . The quality of the approximation is a function of how well the objective in (1) was optimized and . In practice, to construct the primitive by replacing G with G0, i.e., LM0 = (Φ, G0, ), because G0(z0|z) can be viewed as an distillation of (z0|z, Y ). This theoretical result also motivates the next section, where we explain our algorithm s Step 2: Expert Construction . Specifically, we show how to use the trained encoder-decoder pair Φ, to learn a spectrum of different specialized experts parameterized by different latent distributions Gi. 5 EXPERT CONSTRUCTION WITH PLUG-AND-PLAY LANGUAGE MODELS To complete the Mo E framework one needs to systematically create a gamut of different experts LMi, 8i 2 {1, . . . , m}, with each generating candidate utterances of different intents. By viewing each expert as a distribution of particular behaviors in conversation data D, we leverage the results of Section 4 and Lemma 1 and adopt a universal encoder-decoder (Φ, ) among all the experts. Therefore, each expert i is only parameterized by an arbitrary d-dimensional latent distribution (e.g., Gaussian), and it samples certain regions of the latent space Z. Following the terminology from Dathathri et al. (2019), these experts can all be catagorized as plug-and-play language models (PPLMs). Creating experts is handy because it only requires learning new latent distributions, while switching between experts amounts to sampling a different distribution. Denote by i(X, Y ) 2 R a real-valued label that characterizes the intent of expert i 2 {1, . . . , m}, e.g., determined by an off-the-shelf sentiment classifier. We train the latent distribution Gi(z) of expert i by solving the optimization problem b Ez0 Gi( |z),z=Φ(X),Y ( |z0)[ i(X,Y )]. (2) Unlike the weighted maximum likelihood approach considered in Dathathri et al. (2019), which assigns weight i to training samples that correspond to expert i, we propose to learn each expert via reward-maximization and treat i as a reward signal w.r.t. expert i to be maximized. Interestingly, this approach is also linked to reinforcement learning (RL), in which both the state and action spaces are the latent space Z, and the policy is the latent distribution Gi. The main benefit of our approach is that it does not require the target utterance Y from data D and is thus less vulnerable to data-imbalance issues in D on certain intents. Notice from (2) that the reward-maximization problem is myopic, i.e., the above RL problem has a discounting factor of 0. The main motivation is that, unlike dialogue management that is a sequential decision-making problem, here we want each expert to possess particular behaviors, and this can readily be done via greedy maximization. Long-term dialogue optimization will be handled by the dialogue manager rather than these experts. For example in the case of a Gaussian Gi, we use the standard REINFORCE (Sutton et al., 1999) algorithm to learn the model parameters (µi, σ2 i ) of Gi according to {µi, σi} {µi, σi} + Ez0 Gi( |z),Y ( |z0)[ i(X, Y ) r{µi,σi} log PGi(z0|z)], i 2 {1, . . . , m}, where > 0 is the learning rate. To reduce the variance of these estimates, we also adopt the baseline reduction technique (Greensmith et al., 2004) in policy gradient, which replaces i(X, Y ) Published as a conference paper at ICLR 2023 with i(X, Y ) := i(X, Y ) EY ( |Φ(X))[ i(X, Y )]. Following arguments from Lemma 1 and Lemma 4.0.1 in Ajay et al. (2020), in the following we quantify the sub-optimality of expert LMi. Corollary 1. Denote the i-th reward-maximizing objective as Li(LM) := b EY LM( |X)[ i(X,Y )]. Suppose an optimal LM for this objective LMi, 2 arg max LM Li(LM) is -common in D. Moreover, let G? i be in the arg min of (2). Then with expert LMi = (Φ, G? i , ) and ( , H) from Lemma 1, we have |Li(LMi) Li(LMi, )| 2k ik1 ( + 1 2( + H)). While it may be obvious that optimizing Gi w.r.t. (2) encourages expert LMi to capture the behaviors encouraged by i, this corollary has two further implications: (i) Since the sub-optimality of LMi compared to the oracle LMi, is bounded by the quantity defined in Lemma 1, it justifies using the primitive ( , Φ), which optimizes , for expert construction; (ii) Sub-optimality further depends on , quantifying how well LMi, is represented in the original dataset D. 6 REINFORCEMENT LEARNING FOR MOE-LM DIALOGUE MANAGER We now describe the dialogue manager (DM) of our Mo E-LM and propose RL algorithms to train it. As mentioned in Section 3, the DM is a policy µ that takes the encoded conversation history z = Φ(X) and the m + 1 candidate action utterances generated by the experts {b Yi}m i=0,4 and stochastically selects one of them to execute, i.e., b Y µ( | z, {b Yi}m i=0). Note that each expert i 2 {0, . . . , m} is a LM, LMi, that acts as a policy i( |X) and maps each conversation history X to an utterance b Yi. With this architecture we address the large size of state and action spaces in the original MDP that grows exponentially with the size of the vocabulary. As described in Section 2, the state and action spaces of the original MDP are the tokenized conversation history and the tokenized language space, respectively, while here the DM should choose among m + 1 actions (which is a much smaller and finite action space) given the latent space Z (which is a continuous state space) of encoded conversations. It is important to note that our Mo E-LM is different than other hierarchical architectures (Kulkarni et al., 2016) in which the decision at the high-level is to choose a low-level controller only based on the current state of the system. In Mo E-LM, the DM observes both the current state and the actions suggested by the experts and then chooses one among them. We consider two RL settings to solve this specialized MDP. The first one is offline RL, in which the policy must be learned from the collected conversations D without further (online) interactions. Offline RL requires optimizing a policy, while minimizing the deviation from the behavior policy to avoid errors due to data co-variate shift. Among numerous offline RL algorithms (Kumar et al., 2020; Carta et al., 2021; Jaques et al., 2019), one effective algorithm to learn the DM policy µ is IQL (Kostrikov et al., 2021). Given conversation data D, IQL first computes the critic functions (Q (z, a), Vφ (z)) via solving min ,φ E(z,a,r,z+)2D[(r +γVφ(z+) Q (z, a))2]+λ E(z,a)2D[L 2(Q (z, a) Vφ(z))], where z is the encoded conversation, a is the bot utterance, z+ is the next encoded conversation, r is the conversation-level reward, λ > 0 is a tunable weight, and L 2 is the expectile regression operator (Koenker and Hallock, 2001) of estimating the topexpectile statistics of the Q-function random variable (approximated by the value function Vφ), and then extracts the DM policy µ via greedification over the finite set of Mo E candidate utterances: µ(a | z, {b Yi}m i=0) = arg maxa2{b Yi}m i=0 Q (z, a). Intuitively, IQL leverages the generalization capacity of critic functions to estimate the value of the best action without directly querying the values w.r.t. unseen actions. This makes it less conservative than most offline RL methods that either constrain the policy s actions to be in-distribution or solve a behavior-regularized policy optimization problem. The second RL setting for learning the DM policy µ is via model-based RL (MBRL) (Shah et al., 2018; Wei et al., 2018). While this paradigm can be applied to any online/offline RL algorithms we demonstrate it with the simple DQN (Mnih et al., 2013). Here we first learn a user utterance model Puser(X+|X, a) := Ez=Φuser([X,a])[ user(X+|z)] via maximum likelihood, then generate data DMB, whose next-state bs+ encodes the next conversation generated from roll-outs and the corresponding candidate actions, solve the Bellman error minimization: min (s,a,r,bs+)2DMB( r + γQ target(bs+, arg maxa+2{0,...,m} Q (bs+, a+)) Q (s, a))2, and obtain the DM policy µ via the aforementioned greedification step. The benefit of MBRL over offline RL is two-fold. First, one can easily obtain a high-fidelity user utterance model (Peng et al., 2020) by simply fine-tuning a large LM (e.g., GPT-3 (Floridi and Chiriatti, 2020)). Second, with sufficient 4For simplicity, we assume that each expert generates only a single candidate utterance at each step. It would be straightforward to extend this to multiple (and even a varying number of) candidate utterances. Published as a conference paper at ICLR 2023 dialogue roll-outs that captures many different scenarios, MBRL generally can be more data-efficient and less prone to distributional shifts than offline RL. 7 EXPERIMENTS We evaluate our Mo E-approach on two open-domain benchmarks that are common within the RLbased dialogue management literature (Jaques et al., 2019). The first one is the Cornell Movie corpus (Danescu-Niculescu-Mizil and Lee, 2011), which consists of conversations between speakers in different movie lines and has a median conversation length of 3 utterances. The second is the Reddit Casual (Ghandeharioun et al., 2019) conversations, which is a subset of the Reddit corpus that only contains casual conversations on various topics of at least 3 turns and a median of 7 utterances. We conduct several experiments to test the efficacy of different parts in the Mo E-LM, namely (i) the predictive power and diversity of the primitive, (ii) the quality of experts, and (iii) the overall DM performance. For each metric, we report mean standard error over 100 conversations sampled from the evaluation set. We also ran an ablation study on 4 transformer-based Mo E-LMs, namely Mo E-1, Mo E-2, Mo E-3Mo E-4, to understand how performance is affected by different model architectures, language encoders, and latent generators. Mo E-1 and Mo E-2 use a simpler architecture, while Mo E-3 and Mo E-4 use the same encoder architecture as BERT (Devlin et al., 2018). Mo E-1 uses much smaller latent distribution models {Gi} than Mo E-2; Mo E-3 uses the pre-trained BERT encoder, while Mo E-4 trains that from scratch. Details of these models can be found in Appendix B.3. EXP 1: Comparing Primitive Models We compare the quality of latent representations learned by the 4 Mo E-LMs (via optimizing Eq. 1) and 2 baselines (standard Transformer (Vaswani et al., 2017) and VHRED 5 (Serban et al., 2017b)). To assess their quality, for each test conversation we generated 25 utterances and reported the following 3 metrics: (i) Diversity: The 1 Sparsity (Hurley and Rickard, 2009) of the singular values of the embedded utterances, i.e., Diversity({ ˆYi}) := 1 d k SVDk1/k SVDk2/ d 1 2 [0, 1], where SVD := SVD({ΦSE( ˆYi}25 i=1), and ΦSE is a pre-trained sentence encoder (e.g., a USE (Cer et al., 2018)); (ii) Dist-{1, 2, 3} (Li et al., 2015): Ratio of unique {1, 2, 3}-gram in the generated utterances; (iii) Perplexity: The perplexity score of the utterance w.r.t. a GPT-2 LM, which is more correlated to human s judgement on semantic fluency (Pang et al., 2020). These metrics measure both accuracy and semantic diversity. Qualitatively, we also measure fluency and diversity of LMs using human ratings (see Appendix B.8 for details). The results of the above experiments are reported in Table 1 and 6 (Appendix A.1), and sample utterances generated by these LMs can be found in Appendix A.3. Human evaluation on diversity/fluency of different LMs are given in Table 4. In comparisons with the baselines (Transformer and VHRED), generally (i) transformer-based LMs out-perform VHRED due to their attention mechanism that explicitly encodes sequential semantic information, and (ii) the Mo E-LMs achieve way better diversity without sacrificing much on accuracy (i.e., the perplexity scores are still quite low). Qualitatively, the sample utterances generated the Transformer are closer to the targets than that by Mo E-2 and Mo E-4, likely because Transformer tends to memorize the corpus (Kharitonov et al., 2021). Contrarily, Mo E-LMs generate utterances that have similar contexts with targets but paraphrased or similar structures but different contexts, demonstrating their generalizability. Human evaluations also show that Mo E-2 and Mo E-4 generate more diverse utterances while retaining sufficient semantic fluency. Among different Mo E-LMs, Mo E-2 and Mo E-4 have the best performances, particularly Mo E-4 has better diversity while Mo E-2 has lower perplexity. This corroborates with our hypotheses that (i) jointly training the encoder and decoder with Eq. 1 seems necessary to encourage semantic diversity (as opposed to using a pre-trained BERT encoder, which maximizes likelihood), (ii) sufficient representation power is necessary for G0 to match the posterior distribution in order to capture different semantics in D. In Fig. 2a and 2b, we visualize the latent space of 400 conversation data samples for both Transformer and Mo E-4. The latent states of Mo E-4 are much more dispersed across the embedding space, implying that most conversations get encoded uniquely. In contrast, the latent space of Transformer has many clusters, suggesting it is more prone to generating similar utterances even with different input conversation and is thus less generalizable. EXP 2: Quality of Experts We compare the performance of experts learned by the 4 Mo E-LMs (where experts are separately trained by optimizing Eq. 2) and 2 baselines (WD (Holtzman et al., 2018) and PPLM (Dathathri et al., 2019)). To study the sub-optimality gap in Corollary 1, we also include the performance of Transformer-based expert end-to-end LMs that are individually optimized with REINFORCE (Li et al., 2016), using the expert labels { i} as rewards. Inspired by Ghandeharioun et al. (2019) on how bot behaviors are characterized, we use the following label functions to define the 5The VHRED model implementation is identical to that in Jaques et al. (2019) to ensure fair comparisons. Published as a conference paper at ICLR 2023 Method Diversity Dist-1 Dist-2 Dist-3 Perplexity Mo E-1 0.069 0.03 0.27 0.66 0.75 27.43 18.49 Mo E-2 0.14 0.05 0.35 0.77 0.90 38.81 17.34 Mo E-3 0.089 0.04 0.29 0.75 0.90 41.35 26.68 Mo E-4 0.16 0.04 0.38 0.80 0.95 50.17 28.11 Trans. 0.087 0.03 0.26 0.65 0.85 19.23 15.46 VHRED 0.09 0.04 0.35 0.70 0.79 79.77 39.61 Table 1: Accuracy (Perplexity) and Diversity of Language Primitive Experts Trained with Reddit. Method User Tot. Sent. User Sent. Trans. Perplexity Mo E-4 IQL 4.55 0.38 2.88 0.35 45.53 26.71 Mo E-4 Ens-Q 4.14 0.41 2.36 0.31 43.77 28.24 Mo E-4 KLC 3.94 0.25 2.21 0.24 38.35 16.88 VHRL 3.85 0.28 2.19 0.28 55.81 24.21 VHRL-KLC 3.95 0.19 2.16 0.33 64.05 36.98 VHRL-SAC 3.93 0.28 2.19 0.32 62.06 40.43 Table 2: Performance (w.r.t. User Satisfaction in Conversation) of MBRL-based DM Trained with Reddit. (a) Transformer Primitive, TSNE (b) Mo E-4 Primitive, TSNE (c) Mo E-4 Sentiment, PCA (d) Mo E-4 Emotion, PCA Figure 2: Latent Space Visualizations. Figures (a) and (b) Compares Two Primitive Representations. Figures (c) and (d) Illustrates How Experts (of Different Sentiments and Emotions) are Represented by Latent Clusters. intents of experts: (i) pos-sent(Y ), neg-sent(Y ), joy(Y ), optimism(Y ), anger(Y ), sadness(Y ) quantify 6 different sentiment tones and are constructed by a Ro BERTa-based sentiment detector (Liao et al., 2021) that predicts whether an utterance is of positive or negative sentiment, and whether it falls into any of the 4 more-refined emotions: {joy, optimisim, sadness, anger}; (ii) sent-coh(X, Y ) measures empathy, i.e., bot s sentiment coherence with user s; (iii) question(Y ) outputs 1 when a bot question is detected and 0 otherwise; (iv) exp(X, Y ) quantifies exploration, i.e., the tendency to avoid repetitive contexts (see Appendix B.7 for details). Qualitatively, we also measure fluency and expert skills of LMs using human ratings (see Appendix B.8 for details). The results of the above experiments are reported in Table 3 and 8 (Appendix A.1), with sample utterances reported in Appendix A.4 to A.10. Results on human evaluation of different LMs w.r.t. fluency and different expert skills are given in Table 5. Compared with the baseline LMs, generally the experts created under the Mo E-LM framework, especially under Mo E-2 and Mo E-4, better capture all different language intents (where WD and PPLM appear to capture negative sentiments and emotions much more effectively than behaviors), demonstrating the efficacy of our approach which constructs specialized experts on a diverse language space via reward maximization (instead of weighted MLE). Human evaluations also show that Mo E-4 is most effective in generating semantically fluent utterances that possess a wide range of expert characteristics. Similar to the ablation study in EXP 1, all the experts associated with Mo E-2 and Mo E-4 are also among the best ones in capturing language intents. Interestingly, with the Reddit data the experts in Mo E-4 perform the best, while with much less data (Cornell) the best experts are built upon the simpler Mo E-2 architecture. We conjecture this difference is due to over-fitting issues faced by the larger LMs (Mo E-4) when there is insufficient data for expert fine-tuning. In Fig. 2c and 2d we visualize the latent space of the sentiment-based experts in Mo E-4, each with 400 conversation data samples. Notice that the sentiment experts latent distributions are clearly separated (because positive and negative sentiments have opposite behaviors), while the emotion expert s latent distribution have more gradual separations and even some overlaps (because e.g., joy versus optimism are quite similar, while joy versus anger are quite different). This validates our Mo E-LM represents different behaviors in separate regions of the latent space and justifies our structural prior of modeling each expert as a specialized version of the primitive LM, whose latent distribution focuses on particular regions. EXP 3: Mo E-RL Against Dialo GPT Simulated Users We compare the dialogue management performance of Mo E-LM, for which their Mo E-based DMs µ are trained with different methods: (i) IQL (Kostrikov et al., 2021), (ii) Ensemble DQN (Carta et al., 2021), (iii) KL-control (Jaques et al., 2019), with 3 standard RL-based DM baselines using the VHRL architecture (Saleh et al., 2020): (i) REINFORCE (Li et al., 2016), (ii) KL-control, (iii) SAC (Haarnoja et al., 2018). According to the results on expert quality in EXP2, we pick the Mo E-2 and Mo E-4 frameworks for the Cornell and Reddit tasks respectively. For systematic evaluation, we perform the experiment by having these RL agents interact with a Dialo GPT (Zhang et al., 2019) simulated user environment for a maximum of 5 turns. The DM task is to maximize total user satisfaction in the conversation level, which is measured by both (i) user s overall sentiment, and (ii) user s sentiment transition. To construct an immediate reward that serves as a surrogate for user satisfaction, we set r(X, a, X+) = λ1 sent(X+) + λ2( sent(X+) 1 γ 1 γL l=0 γl sent(Xl)), where the linear combi- Published as a conference paper at ICLR 2023 Method Question Exploration Positive Sent. Negative Sent. Sent. Coherence Joy Optimism Anger Sadness Mo E-1 0.65 0.20 0.45 0.17 1.13 0.21 0.35 0.19 0.50 0.38 0.96 0.26 -0.21 0.56 0.54 0.58 0.99 0.83 Mo E-2 0.95 0.27 0.47 0.21 3.29 0.33 1.42 0.38 0.51 0.40 1.99 0.38 1.25 0.43 1.48 0.39 2.01 0.46 Mo E-3 0.41 0.35 0.50 0.24 1.23 0.78 0.99 0.48 0.66 0.35 1.02 0.29 0.49 0.51 0.53 0.49 1.10 0.48 Mo E-4 0.96 0.37 0.51 0.31 3.41 0.55 1.80 0.34 0.52 0.31 2.05 0.55 1.57 0.44 1.42 0.42 1.97 0.36 WD 0.05 0.03 0.15 0.37 -0.50 0.74 1.01 0.48 0.51 0.20 -0.51 0.39 -0.84 0.76 1.00 0.44 1.27 0.67 PPLM 0.20 0.25 0.48 0.28 0.44 0.41 0.69 0.22 0.53 0.31 0.31 0.29 0.40 0.55 0.71 0.46 0.98 0.59 Trans. RL* 0.99 0.23 0.54 0.18 3.53 1.64 1.89 1.20 0.72 0.30 2.88 0.96 1.80 0.59 1.62 0.75 2.35 0.62 Table 3: Quality of Each Expert PPLM Trained on Reddit Casual dataset w.r.t. its Trained Label. Trans. RL Corresponds to Individually Optimized LMs Using Expert Labels { i} as Rewards. Method Avg. Fluency Diversity Mo E-1 0.72 0.02 0.51 0.02 Mo E-2 0.75 0.02 0.54 0.02 Mo E-3 0.72 0.02 0.42 0.03 Mo E-4 0.65 0.03 0.69 0.02 Trans. 0.70 0.02 0.47 0.02 Table 4: Phase 1 Raters Evaluation Method Avg. Fluency SQuestion SPos. Sent SNeg. Sent SJoy SAnger Mo E-3 0.76 0.04 0.27 0.05 0.64 0.04 0.26 0.05 0.41 0.04 0.33 0.05 Mo E-4 0.82 0.02 0.74 0.04 0.46 0.04 0.44 0.03 0.51 0.04 0.43 0.03 Trans. 0.78 0.03 0.12 0.03 0.29 0.04 0.19 0.04 0.38 0.04 0.33 0.05 Table 5: Phase 2 Raters Evaluation (Reddit Casual Models). nation weights (λ1, λ2) = (0.75, 0.25) correlate with Ghandeharioun et al. (2019), and sent(X) is the same Ro Ber Ta-based sentiment labeler as in EXP2, which assigns a score from [ 1, 1] that is proportional to the positive sentiment and inversely proportional to the negative sentiment prediction probabilities. To ensure the baseline RL DM methods can also possess certain bot-level features, e.g., question, positive sentiment, etc., besides the above RL reward for user satisfaction we also optimize a linear combination of bot-based rewards when training the baseline models, see Appendix B of Saleh et al. (2020) for more details. Since the DM problem lasts at most 5 turns, we use this as the effective horizon and set γ = 1 1/5 = 0.8. The results of the above experiments (performed in both offline RL and MBRL settings) are reported in Table 2, Table 7 (Appendix A.1), Table 9 and 10 (Appendix A.2), with sample utterances reported in Appendix A.11. Our experiments show that Mo E-LMs outperform most baselines on DM performance. We attribute this finding to three factors: (i) Mo E-LM restricts the action space into a smaller set of candidate utterances generated by experts (whose qualities are validated in EXP2), the corresponding RL problem then becomes simpler and requires less data (especially in Cornell) to solve; (ii) Unlike the baseline RL methods, which need to optimize both bot-and-user signals, the Mo E DM agents focus on optimizing the user satisfaction goal and are therefore more effective; (iii) Among different RL settings, MBRL, which first learns a user utterance model (the model uses the same encoder from the primitive LM and learns a separate decoder for user-utterance prediction) then does DM, performs much better than offline RL methods that moderately improve upon the primitive LM (behavior policy). IQL-based dialogue managers are among the best across different settings potentially because IQL is more robust to co-variate shifts than standard RL methods, e.g., Ens-Q, SAC, and yet it is less conservative than the behavior-regularized algorithms, e.g., KLC. Interestingly, our Mo E-LMs also have lower (better) perplexity scores than other methods. This may be due to the fact that Mo E-LM uses pre-trained encoder and decoder from the primitive LM, which are optimized for generalization and accuracy, while other RL methods may distort their language representations to create utterances that maximize reward but become less natural. 8 CONCLUDING REMARKS We developed a mixture-of-expert (Mo E) approach for RL-based dialogue management (DM). Our Mo E language model (Mo E-LM) comprises of three main components: (i) a LM that can generate diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) that can produce utterances corresponding to a particular attribute or intent, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. To understand how well our Mo E approach generates diverse and sensible utterances, and solves DM problems, we evaluated it using two open-domain dialogue tasks and compared it with SOTA baselines. Our results showed that Mo E-LM (i) improves diversity of text generation, (ii) can generate utterances with specific intents, and (iii) yields better overall performance. We consider our work as a step forward in creating steerable LMs that possess different intents and in training RL-based DMs that can carry on rich conversations. Future work includes improving the language representation with informationtheoretic approaches, fine-tuning the experts based on the DM objective, extending the RL agent to track users behaviors (via abstract belief states) and plan upon them, preventing RL dialogue agents from generating harmful behaviors (via enforcing safety constraints), and evaluating our Mo E-LM on more realistic problems, such as information retrieval, recommendation, and negotiation. Published as a conference paper at ICLR 2023 A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning. ar Xiv preprint ar Xiv:2010.13611, 2020. K. Asadi and J. Williams. Sample-efficient deep reinforcement learning for dialog control. ar Xiv preprint ar Xiv:1612.06000, 2016. C. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in β-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. Salvatore Carta, Anselmo Ferreira, Alessandro Sebastian Podda, Diego Reforgiato Recupero, and Antonio Sanna. Multi-dqn: An ensemble of deep q-learning agents for stock market forecasting. Expert systems with applications, 164:113820, 2021. D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. Universal sentence encoder. ar Xiv preprint ar Xiv:1803.11175, 2018. Y. Chen, V. Li, K. Cho, and S. Bowman. A stable and effective learning strategy for trainable greedy decoding. ar Xiv preprint ar Xiv:1804.07915, 2018. J. Chien and C. Kuo. Markov recurrent neural network language model. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 807 813. IEEE, 2019. C. Danescu-Niculescu-Mizil and L. Lee. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. ar Xiv preprint ar Xiv:1106.3077, 2011. S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu. Plug and play language models: A simple approach to controlled text generation. ar Xiv preprint ar Xiv:1912.02164, 2019. A. Van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. ar Xiv e-prints, pages ar Xiv 1807, 2018. J. Devlin, M. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. M. Fatemi, L. Asri, H. Schulz, J. He, and K. Suleman. Policy networks with two-stage training for dialogue systems. ar Xiv preprint ar Xiv:1606.03152, 2016. J. Ficler and Y. Goldberg. Controlling linguistic style aspects in neural language generation. ar Xiv preprint ar Xiv:1707.02633, 2017. L. Floridi and M. Chiriatti. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4):681 694, 2020. M. Gaši c, F. Jurˇcíˇcek, B. Thomson, K. Yu, and S. Young. On-line policy optimisation of spoken dialogue systems via live interaction with human subjects. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pages 312 317. IEEE, 2011. A. Ghandeharioun, J. Shen, N. Jaques, C. Ferguson, N. Jones, A. Lapedriza, and R. Picard. Approxi- mating interactive human evaluation with self-play for open-domain dialog systems. Advances in Neural Information Processing Systems, 32, 2019. E. Greensmith, P. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004. J. Gu, K. Cho, and V. Li. Trainable greedy decoding for neural machine translation. ar Xiv preprint ar Xiv:1702.02429, 2017. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, 2018. D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models. ar Xiv preprint ar Xiv:2010.02193, 2020. B. Hancock, A. Bordes, P. Mazare, and J. Weston. Learning from dialogue after deployment: Feed yourself, chatbot! ar Xiv preprint ar Xiv:1901.05415, 2019. Published as a conference paper at ICLR 2023 A. Holtzman, J. Buys, M. Forbes, A. Bosselut, D. Golub, and Y. Choi. Learning to write with cooperative discriminators. ar Xiv preprint ar Xiv:1805.06087, 2018. N. Hurley and S. Rickard. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10):4723 4741, 2009. Youngsoo Jang, Jongmin Lee, and Kee-Eung Kim. Gpt-critic: Offline reinforcement learning for end- to-end task-oriented dialogue systems. In International Conference on Learning Representations, 2021. N. Jaques, A. Ghandeharioun, J. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Pi- card. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. ar Xiv:1907.00456, 2019. E. Kharitonov, M. Baroni, and D. Hupkes. How bpe affects memorization in transformers. ar Xiv preprint ar Xiv:2110.02782, 2021. Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspectives, 15(4): 143 156, 2001. Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. ar Xiv preprint ar Xiv:2110.06169, 2021. T. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learn- ing: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems, 29, 2016. A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179 1191, 2020. E. Levin and R. Pieraccini. A stochastic model of computer-human interaction for learning dialogue strategies. In Eurospeech, volume 97, pages 1883 1886. Citeseer, 1997. J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. A diversity-promoting objective function for neural conversation models. ar Xiv preprint ar Xiv:1510.03055, 2015. J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. ar Xiv preprint ar Xiv:1606.01541, 2016. J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue generation. ar Xiv preprint ar Xiv:1701.06547, 2017. Z. Li, J. Kiseleva, and M. De Rijke. Dialogue generation: From imitation learning to inverse rein- forcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6722 6729, 2019. W. Liao, B. Zeng, X. Yin, and P. Wei. An improved aspect-category sentiment analysis model for text sentiment analysis based on roberta. Applied Intelligence, 51(6):3522 3533, 2021. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yixian Liu, Kewei Tu, et al. Towards holistic and automatic evaluation of open-domain dialogue generation. 2020. Y. Park, J. Cho, and G. Kim. A hierarchical latent structure for variational conversation modeling. ar Xiv preprint ar Xiv:1804.03424, 2018. B. Peng, C. Zhu, C. Li, X. Li, J. Li, M. Zeng, and J. Gao. Few-shot natural language generation for task-oriented dialog. ar Xiv preprint ar Xiv:2002.12328, 2020. A. Razavi, A. van den Oord, and O. Vinyals. Generating diverse high-resolution images with vq-vae. A. Saleh, N. Jaques, A. Ghandeharioun, J. Shen, and R. Picard. Hierarchical reinforcement learning for open-domain dialog. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8741 8748, 2020. Published as a conference paper at ICLR 2023 C. Sankar, S. Subramanian, C. Pal, S. Chandar, and Y. Bengio. Do neural dialog systems use the conversation history effectively? an empirical study. ar Xiv preprint ar Xiv:1906.01603, 2019. I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. I. Serban, C. Sankar, M. Germain, S. Zhang, Z. Lin, S. Subramanian, T. Kim, M. Pieper, S. Chandar, N. Ke, et al. A deep reinforcement learning chatbot. ar Xiv preprint ar Xiv:1709.02349, 2017a. I. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017b. P. Shah, D. Hakkani-Tur, B. Liu, and G. Tür. Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 41 51, 2018. W. Shi and Z. Yu. Sentiment adaptive end-to-end dialog systems. ar Xiv preprint ar Xiv:1804.10731, J. Shin, P. Xu, A. Madotto, and P. Fung. Generating empathetic responses by looking ahead the user s sentiment. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7989 7993. IEEE, 2020. R. Shu, T. Nguyen, Y. Chow, T. Pham, K. Than, M. Ghavamzadeh, S. Ermon, and H. Bui. Predictive coding for locally-linear control. In International Conference on Machine Learning, pages 8862 8871. PMLR, 2020. S. Singh, D. Litman, M. Kearns, and M. Walker. Optimizing dialogue management with reinforcement learning: Experiments with the njfun system. Journal of Artificial Intelligence Research, 16:105 133, 2002. N. Subramani, S. Bowman, and K. Cho. Can unconditional language models recover arbitrary sentences? Advances in Neural Information Processing Systems, 32, 2019. I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014. R. Sutton, D. Mc Allester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. S. Verma, J. Fu, M. Yang, and S. Levine. Chai: A chatbot ai for task-oriented dialogue with offline reinforcement learning. ar Xiv preprint ar Xiv:2204.08426, 2022. M. Walker. An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system for email. Journal of Artificial Intelligence Research, 12:387 416, 2000. W. Wei, Q. Le, A. Dai, and J. Li. Airdialogue: An environment for goal-oriented dialogue research. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3844 3854, 2018. J. Williams and S. Young. Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2):393 422, 2007. M. Yang and O. Nachum. Representation matters: Offline pretraining for sequential decision making. In International Conference on Machine Learning, pages 11784 11794. PMLR, 2021. Denis Yarats and Mike Lewis. Hierarchical text generation and planning for strategic dialogue. In International Conference on Machine Learning, pages 5591 5599. PMLR, 2018. Published as a conference paper at ICLR 2023 S. Young, M. Gaši c, S. Keizer, F. Mairesse, J. Schatzmann, B. Thomson, and K. Yu. The hidden information state model: A practical framework for pomdp-based spoken dialogue management. Computer Speech & Language, 24(2):150 174, 2010. Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan. Di- alogpt: Large-scale generative pre-training for conversational response generation. ar Xiv preprint ar Xiv:1911.00536, 2019. T. Zhao, K. Xie, and M. Eskenazi. Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models. ar Xiv preprint ar Xiv:1902.08858, 2019. L. Zhou, J. Gao, D. Li, and H. Shum. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53 93, 2020. D. Ziegler, N. Stiennon, J. Wu, T. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593, 2019.