# improving_neural_question_generation_using_answer_separation__f4b284a1.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Improving Neural Question Generation Using Answer Separation Yanghoon Kim,1,2 Hwanhee Lee,1 Joongbo Shin,1 Kyomin Jung1,2 1Seoul National University, Seoul, Korea 2Automation and Systems Research Institute, Seoul National University, Seoul, Korea {ad26kr,wanted1007,jbshin,kjung}@snu.ac.kr Neural question generation (NQG) is the task of generating a question from a given passage with deep neural networks. Previous NQG models suffer from a problem that a significant proportion of the generated questions include words in the question target, resulting in the generation of unintended questions. In this paper, we propose answer-separated seq2seq, which better utilizes the information from both the passage and the target answer. By replacing the target answer in the original passage with a special token, our model learns to identify which interrogative word should be used. We also propose a new module termed keyword-net, which helps the model better capture the key information in the target answer and generate an appropriate question. Experimental results demonstrate that our answer separation method significantly reduces the number of improper questions which include answers. Consequently, our model significantly outperforms previous state-of-the-art NQG models. Introduction Neural question generation (NQG) is the task of generating questions from a given passage with deep neural networks. One of its key applications is to generate questions for educational materials (Heilman and Smith 2010). It is also used as a way to improve question answering (QA) systems (Duan et al. 2017; Tang et al. 2017; 2018) or to engage chatbots to start and continue a conversation (Mostafazadeh et al. 2016). Automatic question generation (QG) from a passage is a challenging task due to the unstructured nature of textual data. One of major issues in NQG is how to take the question target, referred to as the target answer, in the passage. Specifying the question target is necessary for generating natural questions because there could be multiple target answers in the passage as in the following example. In Figure 1(a), the passage John Francis O Hara was elected president of Notre Dame in 1934. has various candidates to be asked such as the person John Francis O Hara , the location Notre Dame , and the number 1934. Without taking the target answer as an additional input, existing NQG models such as (Du, Shao, and Cardie 2017) tend to generate Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: An example of overall idea for QG in this paper. Generated questions from existing NQG models tend to include words from the answer, resulting in the generation of improper questions. Replacing the answer into a special token effectively prevents the answer words from appearing in the question, resulting in the generation of desired questions. questions without specific target. This is a fundamental limitation due to the fact that recent NQG systems mostly rely on RNN sequence-to-sequence model (Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2015), and RNNs do not have the ability to model high-level variability (Serban et al. 2017). To overcome this limitation, most recent NQG models incorporated the target answer information by using the answer position feature (Zhou et al. 2017; Song et al. 2018). However, these approaches have a critical issue that a significant proportion of the generated questions include words in the target answer. For example, Figure 1(b) shows the improperly generated question Who was elected John Francis? 1 which exposes some words in the answer. This problem results from the tendency of the sequence-to-sequence model to include all information from the passage (Amplayo, Lim, and Hwang 2018). It becomes severer with the recent trend that NQG models use the copy mechanism (Gulcehre et al. 2016) to encourage that many words in the original passage appear in the question. This study focuses on resolving this problem by separating the target answer from the original passage. For example, the masked passage was elected president of Notre Dame in 1934. in Figure 1(c) still contains enough information to generate the desired question in Figure 1(d), because 1This example is actually generated by our base model which will be introduced in the later part. the term president is mostly about someone s position. Interestingly, even though a target answer is replaced with a special token in a passage, we can infer the interrogative word through contextual information from the remaining part of the passage. Therefore we expect that separating a target answer will prevent the answer inclusion problem. In this paper, we develop a novel architecture named answer-separated seq2seq which treats the passage and the target answer separately for better utilization of the information from both sides. The first step in our NQG model is an answer masking. Literally, we replace the target answer with the mask token , and keep the corresponding target answer apart. The masked passage is encoded by an RNN encoder inside of our model. This approach to separate the target answer from the passage helps our model to identify the question type related to the target answer because the model learns to capture the position and contextual information of the target answer with the help of the token . Furthermore, we propose a new module called keyword-net as a part of answer-separated seq2seq, which extracts key information from the target answer kept apart before. The keyword-net makes our NQG model be consistently aware of the target answer, supplementing the information deficiency caused by answer separation. This module is inspired by how people keep the target answer in mind when they ask questions. Lastly, we adopt a retrieval-style word generator proposed by (Ma et al. 2018) which better captures the word semantics during the generation process. When we evaluate our answer-separated seq2seq on the SQu AD dataset (Rajpurkar et al. 2016)), our model outperforms previous state-or-the-art NQG models by a considerable margin. We empirically demonstrate the impact of the answer separation in following three ways: the rare appearance of the target answer in the generated questions, the better prediction of interrogative words, and the higher attention weights of the token to interrogative words. Furthermore, trained with the only questions generated by our model, a machine comprehension system achieves a comparable results. Related Work Recently, there have been several NQG models which are end-to-end trainable from (passage, question, answer) triplets written in natural language. (Du, Shao, and Cardie 2017) first dealt with end-to-end learning with regard to the question generation problem using a sequence-to-sequence model with an attention mechanism, achieving better performance than rule-based question generation methods in both automatic and human evaluations. However, their model did not take the target answer into account, resulting in generation of the questions which were full of randomness. To generate more plausible questions, (Zhou et al. 2017) utilized answer positions to make the model aware of the target answer and used NER tags and POS tags as additional features. (Song et al. 2018) utilized the multi-perspective context matching algorithm of (Wang, Hamza, and Florian 2017) to employ the interaction between the target answer and the passage for collecting the relevant contextual information. Both works employed a copy mechanism (Gulcehre et al. 2016) to reflect the phenomenon by which many of the words in the original passage are copied to the generated question. However, none of them dealt with the issue of many of generated questions including target answers, and the copy mechanism could intensify this problem. To tackle this problem, this paper focuses on developing an NQG model that utilizes the target answer as a separated knowledge. Additionally, there have been several works which utilize question generation to improve the question answering system. (Duan et al. 2017) crawled an external QA dataset and generated questions from it through their retrieval-based and generation-based question generation methods. With the generated questions as additional data for training the QA system, they demonstrated that their question generation model helps to improve QA systems. More recently,(Tang et al. 2018) presented a joint training algorithm that improves both the question answering system and the question generation model. To the best of our knowledge, none of the previous works has focused on the issue that a significant proportion of generated questions include words in the target answers. Task Definition Given a passage Xp = (xp 1, ..., xp n) and a target answer Xa = (xa 1, ..., xa m) as input, the NQG model aims to generate a question Y = (y1, ..., y T ) asking about the target answer Xa in the passage Xp. The NQG task is defined as finding the best Y that maximizes the conditional likelihood given the Xp and the Xa: Y = argmax Y P(Y |Xp, Xa) (1) t=1 P(yt|Xp, Xa, y 14.37 0.28 18.95 0.24 42.06 0.27 14.05 0.30 ASs2s 16.20 0.32 19.92 0.20 43.96 0.25 16.17 0.35 Table 1: Evaluation of our model and previous NQG models with three metrics: BLEU-4, METEOR and ROUGE-L. Zhou et al. 2017) re-divided them into train/dev/test splits, and extracted passages from the paragraph that contains the target answer, each of which we call data split-1 and data split-2 in the following lines. For the data split-1, since (Du, Shao, and Cardie 2017) does not include the target answers, (Song et al. 2018) extracted them from each passage to make (passage, question, answer) triplets. As a result, data split-1 and data split-2 contains 70,484/10,570/11,877 triplets and 86,635/8,965/8,964 triplets respectively. We tokenize both data splits with Stanford Core NLP (Manning et al. 2014) and then lower-case them. Implementation Details We implement our models in Tensorflow 1.4 and train the model with a single GTX 1080 Ti. The hyperparameters of our proposed model are described as follows. Our model consists of two one-layer encoders each for encoding passages and target answers, and a one-layer decoder to generate questions. The number of hidden units in both encoders and the decoder are 350. For both encoder and decoder, we use 34k most frequent words appeared in training corpus, replacing the rest with the token. We use 300-dimensional pre-trained Glo Ve (Pennington, Socher, and Manning 2014) embeddings trained on 6 billion-token corpus for initialization and freeze it when training. Weight normalization is applied to the attention module and dropout with Pdrop = 0.4 is applied for both RNNs and the attention module. The layer size of keywordnet is set as 4. Training and Inference During training, we optimize the cross-entropy loss function with the gradient descent algorithm using Adam (Kingma and Ba 2014) optimizer, with an initial learning rate of 0.001. The mini-batch size for each update is set as 128 and the model is trained for up to 17 epochs. When testing, we conduct beam search with beam width 10 and length penalty weight 2.1. Decoding stops when the generated token is . The Performances of all our models are reported as mean and standard derivation values (Mean Std). Named Entity Replacement To further improve the model performance, we pre-process the data with a very simple technique. Since most named entities do not appear of- Model Complete Partial seq2seq+AP 0.8% 17.3% (Song et al. 2018) 2.9% 24.0% ASs2s 0.6% 9.5% Table 2: Percentage of complete/partial inclusion of the target answer in generated questions. ten, by replacing those named entities with representative tokens, we can not only reduce unknown words but also capture the grammatical structure. We look for the named entity tags for tokens in the given passage and replace each of them with the corresponding tag. We make sure that the same entity is assigned the same tag. NER tags are extracted with named entity tagger in Stanford Core NLP. For those passages that have different named entities with the same tag, we distinguish them with different subscripts such as Person1, Person2. We store a matching table between named entities and tags, which is used to post-process the generated questions. Evaluation Methods Following (Zhou et al. 2017; Song et al. 2018), we compare the performance of NQG models with 3 evaluation metrics: BLEU-4, Meteor and Rouge L, which are standard evaluation metrics of machine translation and text summarization. We use the evaluation package published by (Chen et al. 2015). BLEU-4 BLEU-4 measures the quality of the candidate by counting the matching 4-grams in the candidate to the 4-grams in the reference text. Meteor Meteor compares the candidate with the reference in terms of exact, stem, synonym, and paraphrase matches between words and phrases. Rouge L Rouge L assesses the candidate based on longest common subsequence shared by both the candidate and the reference text. Model Question type what how when which where who why yes/no seq2seq+AP 77.3% 56.2% 19.4% 3.4% 12.1% 36.7% 23.7% 5.3% ASs2s 82.0% 74.1% 43.3% 6.1% 46.3% 67.8% 28.5% 6.2% Table 3: Recall of interrogative word prediction. Performance Comparison We compare our model with previous state-of-the-art NQG models. Since there exists two different data splits processed by (Du, Shao, and Cardie 2017; Zhou et al. 2017), we conduct experiments on both data splits. To figure out the effect of each module, we also conduct ablation tests against some key modules: ASs2s denotes the complete answer-separated seq2seq model. ASs2s-< a> is the answer-separated seq2seq without replacing the target answer in the original passage. ASs2s-keyword is the answerseparated seq2seq without keyword-net. ASs2s-ASdec is the answer-separated seq2seq without the answer-separated decoder but with a general LSTM decoder. As shown in Table 1, ASs2s outperforms all of the previous NQG models on both data splits by a great margin, showing that separate utilization of target answer information plays an important role in generating the intended questions. With the help of answer-separated decoder, ASs2s- still outperforms the previous NQG models except for ROUGE-L on data split-1. However, there is a considerable decrease in all metrics compared to the complete model. This results from the fact that answer separation prevents generated question from including the answer. Similarly, ASs2s-keyword has a big drop in performance and this verifies that the keyword-net has actual impact on improving the performance. ASs2s-ASdec has greater decrease in all metrics compared to the ASs2s. This is a very natural result because without the answer-separated decoder, the model has to generate questions by only relying on context around the target answer position without knowledge of the target answer. Impact of Answer Separation Answer separation helps the model generate the right question for the given target answer. Since the base model does not utilize the target answer information, we further define seq2seq+AP(Answer Position) as base model with answer position feature (Zhou et al. 2017) for comparison. We show the benefits of answer-separated seq2seq in three aspects. Answer Copying Frequency If a NQG model captures the question target well, the generated question will rarely include the target answer. We verify the assumption by computing the percentage of generated questions including target answers. Since (Du, Shao, and Cardie 2017) ignores the target answer, we choose seq2seq+AP to represent (Du, Shao, and Cardie 2017) with answer position feature. Further, we choose the previous state-of-the-art (Song et al. 2018) for comparison because both (Zhou et al. 2017) and (Song et al. 2018) use the copy mechanism. As shown in Table 2, the percentage that the target answers are either completely or partially included in the generated questions is significantly lower in our model. We also figure out an interesting observation: even though (Song et al. 2018) is the previous state-of-the-art NQG model, it generates more irrelevant questions to the target answer when compared to seq2seq+AP. This observation indicates the negative effect of copy mechanism that the target answer inside the passage is unintentionally copied to the generated question. Interrogative Word Prediction To figure out the effect of answer-separated seq2seq on question type prediction, we compare the recall of each interrogative word prediction between the generated questions of answer-separated seq2seq and seq2seq+AP. We group questions into 8 categories: what , how , when , which , where , who , why and yes/no . As shown in Table 3, answer-separated seq2seq has better recall score over seq2seq+AP in all categories. Especially, the recall of question types how , when , where and who improved in big magnitude. Both model s recall of question type what is very high because what takes up more than half of the whole training set (55.4%). Both model s recall of type which is very low. This may result from the fact that some combinations like which year and which person may be generated as where and who respectively. For question types why and yes/no which only take up 1.5% and 1.2% of the training set respectively, both models did not perform well due to the small amount of data. Attention from We verify the effect of replacing answer with by comparing attention matrices. Given the passage john francis o hara was elected president of notre dame in 1934. and the target answer john francis o hara , following Figure 3(a) and Figure 3(c) show the attention matrices produced by our answer-separated seq2seq and seq2seq+AP respectively. As shown in Figure 3(a), the interrogative word who gets most of the attention weights(higher attention weights) from the token in our answer-separated seq2seq. Further more, Our model can generate a question that is exactly related to the target answer. With additional answer position features as in Figure 3(c), only a part of answer is attended while generating the interrogative word who . In this case, if the answer has some contextual information, then the model may omit it, generating an unintended question. Also, the generated question contains john francis which is a part of the target answer. We infer that the encoder Figure 3: (a) and (b) show attention matrices of our model given a passage with two different target answers. (c) shows an attention matrix of seq2seq+AP given the same passage and the target answer as (a). tends to utilize more information from the word embeddings rather than answer position features, since the word embedding has far more information than answer position features. Question Generation for Machine Comprehension By training a machine comprehension system on the synthetic data generated by our model, we verify that our model has an enough ability to generate natural and fluent questions. By changing the position of the token, we can easily produce various questions with our model. Figure 3(a) and Figure 3(b) shows one example where we use our model to generate two different questions corresponding to different target answers from the same input passage. We experiment with QANet (Yu et al. 2018) on SQu AD dataset to verify whether the generated questions from our model are valid or not. Since most of the answers correspond to named entities, we use words and phrases that are named entities from training part of data split-1 as target answers. Then, we pair those answers with corresponding passages. We also make sure that selected answers are not overlapped with answers in the original SQu AD dataset because our NQG model is trained with the target answer provided with SQu AD dataset. If answers are overlapped, our model may generates exact the same questions as the golden questions. then we pair those answers with corresponding passages. To organize the dataset in the same way as SQu AD dataset, (paragraph, question, answer position) triplets, we trace the passage in data split-1 in the original paragraph and re-compute the answer position as well. We finally make a synthetic data with about 50k questions and train the machine comprehension system only with our synthetic data. As shown in Table 4, the machine comprehension system achieves EM/F1 score of 22.72/31.58 in public SQu AD dev set. This result is far below the result 68.78/78.56 of the case when the model is trained with the original training set. However, considering our synthetic data only consists of target answers with single named entity, we further check EM/F1 score of partial dev set that only has a single named entity as the answer. We find that in the 10k dev set, about 40 percent of the data has an answer with a single named en- Answers Exact Match (EM) F1 score ALL 22.72 31.58 NER 49.09 56.57 Table 4: Performance of the machine comprehension system which is trained only with synthetic data generated by our NQG model. tity and the machine comprehension system achieves EM/F1 score of 49.09/56.57 with those parts of the data. Since the SQu AD dataset is a human-made dataset, this result sufficiently shows that our answer-separated seq2seq can generate valid questions that can be acceptable both by human and machine comprehension systems. In this paper, we investigate the advantages of answer separation in neural question generation. We observe that existing NQG models suffer from a serious problem: a significant proportion of generated questions include words in the question target, resulting in the generation of unintended questions. To overcome this problem, we introduce a novel NQG architecture that treats the passage and the target answer separately to better utilize the information from the both sides. Experimental results show that our model has a strong ability to generate the right question for the target answer in the passage. As a result, it yields a substantial improvement over previous state-of-the-art models. Acknowledgments This work was supported by the National Research Foundation of Korea (NRF) funded by the Korean government (MSIT) (No. 2016M3C4A7952632), Industrial Strategic Technology Development Program (No. 10073144) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea). References Amplayo, R. K.; Lim, S.; and Hwang, S.-w. 2018. Entity commonsense representation for neural abstractive summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, 697 707. Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations. Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Doll ar, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. ar Xiv preprint ar Xiv:1504.00325. Du, X.; Shao, J.; and Cardie, C. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1342 1352. Duan, N.; Tang, D.; Chen, P.; and Zhou, M. 2017. Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 866 874. Gulcehre, C.; Ahn, S.; Nallapati, R.; Zhou, B.; and Bengio, Y. 2016. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 140 149. Heilman, M., and Smith, N. A. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 609 617. Association for Computational Linguistics. Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Ma, S.; Sun, X.; Li, W.; Li, S.; Li, W.; and Ren, X. 2018. Query and output: Generating words by querying distributed word representations for paraphrase generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, 196 206. Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; and Mc Closky, D. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 55 60. Mostafazadeh, N.; Misra, I.; Devlin, J.; Mitchell, M.; He, X.; and Vanderwende, L. 2016. Generating natural questions about an image. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1802 1813. Nallapati, R.; Zhou, B.; dos Santos, C.; Gulcehre, C.; and Xiang, B. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 280 290. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532 1543. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383 2392. Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, 3776 3784. Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, 3295 3301. Song, L.; Wang, Z.; Hamza, W.; Zhang, Y.; and Gildea, D. 2018. Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, 569 574. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104 3112. Tang, D.; Duan, N.; Qin, T.; Yan, Z.; and Zhou, M. 2017. Question answering and question generation as dual tasks. ar Xiv preprint ar Xiv:1706.02027. Tang, D.; Duan, N.; Yan, Z.; Zhang, Z.; Sun, Y.; Liu, S.; Lv, Y.; and Zhou, M. 2018. Learning to collaborate for question answering and asking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, 1564 1574. Wang, Z.; Hamza, W.; and Florian, R. 2017. Bilateral multiperspective matching for natural language sentences. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, 4144 4150. AAAI Press. Yu, A. W.; Dohan, D.; Luong, M.-T.; Zhao, R.; Chen, K.; Norouzi, M.; and Le, Q. V. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of the International Conference on Learning Representations. Zhou, Q.; Yang, N.; Wei, F.; Tan, C.; Bao, H.; and Zhou, M. 2017. Neural question generation from text: A preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, 662 671. Springer.