# teaching_machines_to_ask_questions__83193766.pdf Teaching Machines to Ask Questions Kaichun Yao1, Libo Zhang2, , Tiejian Luo1, Lili Tao3, Yan Jun Wu2 1 University of the Chinese Academy of Sciences 2 Institute of Software Chinese Academy of Sciences 3 University of the West of England yaokaichun@outlook.com, libo@iscas.ac.cn, tiejian@ucas.ac,cn, Lili.Tao@uwe.ac.uk, yanjun@iscas.ac.cn We propose a novel neural network model that aims to generate diverse and human-like natural language questions. Our model not only directly captures the variability in possible questions by using a latent variable, but also generates certain types of questions1 by introducing an additional observed variable. We deploy our model in the generative adversarial network (GAN) framework and modify the discriminator which not only allows evaluating the question authenticity, but predicts the question type. Our model is trained and evaluated on a question-answering dataset SQu AD, and the experimental results shown the proposed model is able to generate diverse and readable questions with the specific attribute. 1 Introduction Automatic question generation aims to generate natural questions from a given text passage and a target answer. It has many important application values: One potential value for question generation is in education, where the automatic tutoring systems generate natural questions for reading comprehension materials [Heilman and Smith, 2010]. In the dialogue system, question generation techniques can help the dialogue agents launch a conversation or request feedback [Li et al., 2017a]. In the medical field, question generation systems are also used as a clinical tool for evaluating or improving mental health [Colby et al., 1971]. As the reverse task of question answering, question generation also has the potential for generating large-scale question answer pair corpus[Serban et al., 2016]. Traditional methods for question generation mainly use rigid heuristic rules to convert an input passage into corresponding questions [Heilman, 2011; Chali and Hasan, 2015]. However, these approaches strongly rely on human-designed transformation and generation rules so that cannot be easily adopted to other domains. Recently, neural networks based question generation methods aim at training an end-to-end Corresponding author: Libo Zhang (libo@iscas.ac.cn). 1We classify the questions into six types - WHO, WHAT, WHICH, HOW, WHEN and OTHER. system to generate natural language questions from text without the human-designed rules [Zhou et al., 2017]. The work adapts the sequence-to-sequence approach [Cho et al., 2014] for generating questions, in which the encoder encodes the text passage and other auxiliary information (answer or context), and then a decoder is used to sequentially output question words. However, the existing encoder-decoder models only generate a single question for one text passage. Given a text passage and a question, a unique answer can be found in the passage, but multiple questions can be asked by giving a passage and an answer. Learning the variety of a valid question is an important but overlooked problem in many existing methods [Zhou et al., 2017; Yuan et al., 2017]. In order to capture the diversity in the potential questions, our method aims to produce a series of questions from a given passage and a given answer. Human ask different types of questions. They are classified into six types in SQu AD dataset [Rajpurkar et al., 2016], - WHO/WHOM, WHAT, WHICH, HOW, WHEN, OTHER 2. Our method also aims to learn the generation of the certain types of questions. In this paper, we teach machines to ask the right questions - the natural language questions are generated by learning the given text passage and a targeted answer. The key idea of our work is to model the question generations as a one-tomany problem. Given a text passage and an answer, multiple valid questions may exist. We model a probabilistic distribution over the potential questions using a latent variable. This allows us to generate diverse questions by drawing samples from the learned distribution and reconstruct the words sequence via a decoder neural network. Meanwhile, the observed question-type labels represent the salient attributes of questions and are provided to the models as conditionals in order to generate questions with the certain types. We apply adversarial training to natural-language question generation task, in which we simultaneously train a generative model G and a discriminative model D. We use the latent variable and the observation variable in G to learn a distribution over potential questions, generating diverse and certain types of questions. We also made a modification for D, which encourages G to produce sequences that are indistinguishable from the questions generated by the human, and 2Six-types questions account for about 11.1%, 57.7%, 7.1%, 10.5%, 6.3%, 7.3% in training data, respectively. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) contain the specific attributes. The main contributions of this paper are as follows: We present a novel natural-language question generation model deployed in GAN [Goodfellow et al., 2014] framework which consists of a generative model G and a discriminative model D. D encourages G to generate more readable and diverse questions with the certain types. The variational auto-encoders (VAE) [Kingma and Welling, 2013] is adopted to the generative model G by introducing a latent variable to capture question diversity. The question type is regarded as the salient attribute and the observed variable, which is used to learn a disentangled representation from the latent distribution to produce the certain types of questions. We train and evaluate our model on SQu AD dataset, and the experimental results show our model is able to generate the certain types of questions with high readability and diversity. 2 Related Work 2.1 Question Generation Automatic question generation has drawn an increasing attention from natural language generation community in recent years. Majority of earlier work uses a rule-based method that transforms a sentence into related questions [Heilman, 2011] by manually constructed template. However, these methods strongly depend on manually generated rules so that cannot be easily adopted to other domains. Instead of generating questions from texts, a neural network method is trained to convert knowledge base triples to generate factoid questions from structured data [Serban et al., 2016]. More recently, neural networks based question generation models aim at training an end-to-end system to generate natural language questions from text without human-designed transformation and generation rules [Zhou et al., 2017; Yuan et al., 2017; Du et al., 2017]. These work extends the sequence-tosequence models [Cho et al., 2014] by enriching the encoder with auxiliary feature information to help generating better sentence encoding. Then the decoder with attention mechanism [Bahdanau et al., 2014] produces natural language question of sentence. 2.2 Deep Generative Models Deep neural network based generative models are widely used in text generation tasks such as text summarization [See et al., 2017], machine translation [Hu et al., 2017] and dialogue system [Li et al., 2017b]. Deep generative models have drawn a lot attentions. Recurrent neural networks (RNNs) based generative model is proposed in [Graves, 2013] to generate the next word conditioned on previous work sequence. Encoder-decoder architecture [Cho et al., 2014] based generative models use RNNs to encode an input text sentence into a fixed vector, and generate a new output text sequence from the vector using the second RNN model. These models usually produce one corresponding text sequence from a given text sequence. Recently, VAE [Kingma and Welling, 2013] as a popular framework has been applied in text generation tasks such as dialogue system [Zhao et al., 2017] and image caption [Rezende et al., 2014] as deep generative models. VAE encodes the input sequence into a latent hidden space, and then utilizes a decoder network to rebuild the original input by sampling from this space, aiming to capture the variability in potential generated sequences. GAN [Goodfellow et al., 2014], an alternative training approach to generative models, where the training procedure is a minimax two-player game, in which a generative model is trained to generate outputs, while the discriminative model evaluates them for authenticity. To apply GAN to text generation, Seq GAN [Yu et al., 2016] models the text generation as a sequential decision making process, and utilizes policy gradient methods [Sutton et al., 1999] to train the generative model. In this section, we introduce our neural question generation model. As depicted in figure 1, we apply adversarial training to natural-language question generation task, in which we simultaneously train a generative model G and a discriminative model D. Generative model is employed in an encoderdecoder architecture [Cho et al., 2014]. It is based on conditional variational autoencoders [Zhao et al., 2017] that captures the diversity in the encoder, introducing the latent variable z and the observation variable c in G in order to learn a distribution over potential questions given an answer. We also made a simple modification for D by adding an auxiliary classifier to distinguish the types of questions. Similar to the standard adversarial training manner [Goodfellow et al., 2014], we first pre-train the generative model by generating questions given the passages and answers. Then we pre-train the discriminator by providing positive examples from the human-generated questions and the negative examples produced from the pre-trained generator. After the pre-training, the generator and discriminator are trained alternatively. 3.1 Generative Model Model Description Question generation task is to model the true probability of a question Y given an input text passage X and an answer A. We denote our model distribution by P(Y |X, A). We introduce a latent variable z which is used to learn to the latent distribution over the valid questions. We then define the conditional distribution P(Y, z|X, A) = P(Y |z, X, A)P(z|X, A) and our goal is to use deep neural networks (parametrized by θ) to approximate P(z|X, A) and P(Y |z, X, A). We refer to Pθ(z|X, A) as the prior network and Pθ(Y |z, X, A) as the question generation decoder. Then the generative process of Y can be depicted as: (1) Sample the latent variable z from the prior network Pθ(z|X, A). (2) Generate Y through the question generation decoder Pθ(Y |z, X, A). We assume the latent variable z follow multivariate Gaussian distribution with a diagonal covariance matrix. At training time, we follow the variational autoencoder framework [Kingma and Welling, 2013; Sohn et al., 2015] and introduce an approximation network Qφ(z|X, A) to approximate the true posterior distribution Pθ(z|X, A). We thus have the Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Figure 1: The neural question generation model deployed in GAN framework. denotes the concatenation of the input vectors. following evidence lower bound (ELBO) [Sohn et al., 2015]: log P(Y |X, A) KL(Qφ(z|X, A)||Pθ(z|X, A)) +EQφ(z|X,A)[log Pθ(Y |z, X, A)] (1) We point out that existing encoder-decoder models encode an input passage X and answer A as a single fixed representation. Hence, all of the possible questions corresponding to X and A must be stored within the decoder s probability distribution P(Y |X, A), and during decoding it is hard to disentangle these possible questions. However, our question generation model contains a stochastic component z in the decoder P(Y |z, X, A), and so by sampling different z and then performing maximum likelihood decoding on P(Y |z, X, A), we hope to tease apart the questions stored in the probability distribution P(Y |X, A). In our question generation task, the generative model uses the latent variable z to learn the potential questions distribution and generate the diverse question by sampling different z. However, it is hard to produce the specific attributes or types of questions by randomly sampling from z. For instance, every question has its corresponding type and we hope to generate the specific types of questions. Inspired from ACGAN [Odena et al., 2017], we introduce another observed variable question type label c, which is independent of z, to learn a disentangled representation from z. Our model can encode useful information into z and we want to tease apart these information by using an additional observed variable c. We update ELBO as follows: log P(Y |X, A) KL(Qφ(z|X, A)||Pθ(z|X, A)) +EQφ(z|X,A)[log Pθ(Y |(z, c), X, A)] (2) Note that this loss decomposes into two parts: the KL divergence between the approximate posterior and the prior, and the cross-entropy loss between the model distribution and the data distribution. Finally, the model is trained by minimizing the following loss: Jml(θ, φ) =KL(Qφ(z|X, A)||Pθ(z|X, A)) EQφ(z|X,A)[log Pθ(Y |(z, c), X, A)] (3) Model Implementation Given an input text passage X = {x1, .. . , xn}, a question Y = {y1, .. . , ym} and the corresponding answer A = {a1, .. . , al}. Here, n, m and l are the length of the text passage, the length of ground truth question and the length of answer, respectively. Sequence elements xi, yj and ak are given by pre-trained glove embedding vectors [Pennington et al., 2014]. At the stage of encoding, firstly, we augment each document word embedding with a binary feature that indicates whether the passage word belongs to the answer. Then, we use a bidirectional RNNs [Schuster and Paliwal, 1997] with long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997] cell as the context encoder which runs on the augmented passage sequence, generating the corresponding hidden state vectors hd = {hd 1, ..., hd n} for the input tokens. Here, hd i is the concatenation of the RNNs forward hidden state hd i and backward hidden state h d i , i.e., hd i = [ hd i ; h d i ]3. Meanwhile, we obtain the semantic representation of passage h D by concatenating the last hidden states of the forward and backward RNNs, i.e., h D = [ hd n; h d 1]. We assume that the answer A consists of the sequence of words {xs, ..., xe} in the passage, where s and e are the start position and the end position of the answer word in the passage, s.t. 1 s e n. In order to obtain the semantic representation ha of the answer A, we concatenate the hidden states {hd s, ..., hd e} of the context encoder corresponding to the answer word positions in the passage with the answer word embeddings {as, ..., ae}, i.e., [hd i ; ai], s i e. We form ha by calculating the average pooling of the concatenated answer representation of each position. ha = 1 e s + 1 i=s [hd i ; ai], (4) We also run a bidirectional LSTM RNNs as a question encoder which runs over the word embeddings of question, and concatenate the final state of forward RNNs and backward RNNs to obtain the representation of question hq = [ hq m; h q 1]. Assume that the latent variable z follow Gaussian distribution, the approximation network Qφ(z|X, A) N(µ, σ2I) and the prior network Pθ(z|X, A) N(µ , σ 2I). When we obtain the passage representation h D, the answer representation ha and the question representation hq, we can calculate the mean and variance of Q and P: µ log(σ2) 3We use the notation [ ; ] to denote concatenation of two vectors throughout the paper. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) = Wp2(tanh(Wp1 + bp2)) + bp1, (6) where Wq, bq, Wp1, bp1,Wp2, and bp2 are learning parameters. We then obtain samples of z from the approximation network Q (training) or the prior network P (testing) using the reparameterization trick [Kingma and Welling, 2013] and concatenate h D, ha, z and question type label c (one-hot representation). Finally, we initialize the hidden state of the decoder with the nonlinear transformation of these concatenated representation s0 = tanh(W0[h D; ha; z; c] + b0), where W0 and b0 are learning parameters. At the stage of decoding, we employ an attention-based LSTM decoder to decode the source passage and answer information to generate questions. At decoding time step t, the LSTM decoder reads the previous word embedding yt 1 and context vector Ct 1 to compute the new hidden state st. The context vector Ct for current time step t is computed through the attention mechanism [Luong et al., 2015], which matches the current decoder state st with each hidden state hd in the context encoder to get an importance score. The importance scores are then normalized to get the current context vector by weighted sum: st = LSTM(yt 1, Ct 1, st 1) et,i = v T tanh(West 1 + Uehd i ) αt,i = exp(et,i) Pn i=1 exp(et,i) i=1 αt,ihd i where We, Ue and v T are learning parameters. Finally, the probability of each target word yt is predicted based on all the words that are generated previously (i.e., y