# integrating_linguistic_knowledge_to_sentence_paraphrase_generation__2779fb54.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Integrating Linguistic Knowledge to Sentence Paraphrase Generation Zibo Lin,1,2 Ziran Li,1,2 Ning Ding,1,2 Hai-Tao Zheng1,2 Ying Shen,3 Wei Wang,1,2 Cong-Zhi Zhao4 1Department of Computer Science and Technology, Tsinghua University 2Tsinghua Shenzhen International Graduate School, Tsinghua University 3School of Electronics and Computer Engineering, Peking University Shenzhen Graduate School 4Giiso Information Technology Co., Ltd {lzb18, lizr18}@mails.tsinghua.edu.cn Paraphrase generation aims to rewrite a text with different words while keeping the same meaning. Previous work performs the task based solely on the given dataset while ignoring the availability of external linguistic knowledge. However, it is intuitive that a model can generate more expressive and diverse paraphrase with the help of such knowledge. To fill this gap, we propose Knowledge-Enhanced Paraphrase Network (KEPN), a transformer-based framework that can leverage external linguistic knowledge to facilitate paraphrase generation. (1) The model integrates synonym information from the external linguistic knowledge into the paraphrase generator, which is used to guide the decision on whether to generate a new word or replace it with a synonym. (2) To locate the synonym pairs more accurately, we adopt an incremental encoding scheme to incorporate position information of each synonym. Besides, a multi-task architecture is designed to help the framework jointly learn the selection of synonym pairs and the generation of expressive paraphrase. Experimental results on both English and Chinese datasets show that our method significantly outperforms the state-ofthe-art approaches in terms of both automatic and human evaluation. Introduction Paraphrase generation is a fundamental task in natural language processing, which aims to restate a text with different words while keeping the meaning approximately the same as the original. Automatic paraphrase generation can be applied to many scenarios to promote the study of natural language processing. For example, questions answering systems are often sensitive to the way questions are asked, rephrasing questions can help people get better answers in many real-world question answering applications (Fader, Zettlemoyer, and Etzioni 2014). Additionally, paraphrases can also help diversify responses of dialogue assistants (Shah et al. 2018), augment training data (Yang et al. 2019b) and extend coverage of semantic parsers (Berant and Liang 2014). indicates equal contribution. Corresponding author: zheng.haitao@sz.tsinghua.edu.cn. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: An example of paraphrase generation guided by linguistic knowledge. Recently, various neural models have been put forward for automatic paraphrase generation, which modeled the task as a Seq2Seq learning problem from the original sentence to the target paraphrase (Prakash et al. 2016; Gupta et al. 2018; Li et al. 2019a). Although these methods generate fluent and grammatically correct restatements, the performances are far from perfect. Because this kind of data-driven methods can only make limited modifications to the original text such as changing word order or part of speech, which are lack of lexical and phrasal diversity. If a model can be guided by external linguistic knowledge like thesauri, it can replace a word (especially a rare word) in the sentence with a corresponding synonym and thus generate a more complete and expressive paraphrase. For instance, as shown in Figure 1, to rewrite the input sentence What causes impoverishment in this region? to the target sentence What are the causes of poverty in this area? , we need synonym pairs provided by thesauri such as (impoverishment, poverty) and (region, area). And it is worth noting that the word impoverishment in the original sentence is a low-frequent word. Without the guidance of external linguistic knowledge, this word is likely to be masked as an unknown tag in restatement. Some prior works have been conducted to introduce external linguistic knowledge in paraphrase generation. Specifically, Cao et al. (2017) propose a Seq2Seq model with a copying decoder and equips the decoder with a paraphrase collocation table to regulate the generated words. Huang et al. (2019) adopts an extra synonym dictionary to guide the paraphrase decoder, in which a soft attention mechanism is used to learn the semantic vectors of both the original words and the synonyms. However, such models face with the following two issues. First, they pay little attention to the location of each paraphrase pair, which makes it difficult for decoder to make accurate use of the introduced knowledge. Second, they tend to copy the low-frequent words in the original sentence to alleviate the issue of out-of-vocabulary (OOV) words. This kind of copy mechanism is not in line with the intention of paraphrase generation, because the task prefers to use different words to rewrite the original text. To address the issues mentioned above, we propose Knowledge-Enhanced Paraphrase Network (KEPN), a transformer-based framework that can leverage synonym information provided by external linguistic knowledge to facilitate paraphrase generation. (1) The model retrieves a set of synonyms of words in source sentence from external thesauri and uses a soft attention mechanism to compute the weighted sum of the synset embeddings. Then the weighted synonym representation is combined with the hidden vector of the decoder to guide the decision on whether to generate a new word or replace it with a synonym. (2) To locate synonym pairs more accurately, we adopt an incremental encoding scheme to incorporate position information of each synonym into the decoder. What s more, we design a multi-task architecture with synonym labeling as an auxiliary task. The synonym labeling task aims to identify the position of each synonym in the input sentence, which can help the model jointly learn the selection of synonym pairs and the generation of expressive paraphrase. We conduct sets of experiments on both English and Chinese benchmark datasets for paraphrase generation. In addition, because most of the existing paraphrase datasets are derived from question matching corpus, in which sentences are all short questions, we construct a new Chinese paraphrase dataset named TCNP (Translation-based Chinese News Paraphrase) for more diversity in test domains. Experimental results on all datasets show that our model significantly outperforms state-of-the-art methods on automatic evaluation metrics with improvements of 1.0-1.9 points in BLEU. We also perform the qualitative human evaluation to show the quality of paraphrase sentence. The result indicates that the generated paraphrases are well-formed, diverse, and relevant to the input sentence. 1 Related Work Paraphrase generation is approached as using different words to rewrite a semantically equivalent sentence (Madnani and Dorr 2010). Feature-based methods (Mc Keown 1983; Bolshakov and Gelbukh 2004; Carl, Schmidt, and Sch utz 2005) are widely used in paraphrase generation, but they heavily rely on the hand-crafted rules and are hard to scale up. Recent efforts involve neural methods have achieved great success, which model the task as a Seq2Seq 1Code of our model is publicly available at https://github.com/ LINMou Mou Zi Bo/KEPN (Sutskever, Vinyals, and Le 2014) learning problem from the original sentence to the target paraphrase. As a pioneer, Prakash et al. (2016) first explore deep learning models for paraphrase generation through a stacked LSTM network. Following by Prakash s work, Gupta et al. (2018) combine LSTM and Variational Auto Encoder (VAE) to generate multiple paraphrases. Further, Li et al. (2019a) propose a multiple encoders and decoders network to generate paraphrases at different granularity levels. However, these methods perform the task based solely on the given dataset, which ignore the availability of external linguistic knowledge. Recently, introducing external structured knowledge into designed models has achieved great success in many studies of natural language processing (Zhou et al. 2018; Yang et al. 2019a; Li et al. 2019b). Inspired by this, various methods are proposed to use extra knowledge to improve paraphrase generation. Specifically, Cao et al. (2017) introduce a Seq2Seq model that fuses two decoders, in which the generated words are restricted in the paraphrase table of current sentence. Huang et al. (2019) propose a method to generate paraphrase in the guide of an extra dictionary and use a soft attention to learn synonym semantic vector. Moreover, Wang et al. (2019) first exploit the multi-head attention mechanism (Vaswani et al. 2017) for paraphrase generation and utilize external resources (Prop Bank labels) for further improvement. However, these studies pay little attention to the location of each paraphrase pair, which make it difficult to make accurate use of the introduced knowledge. By contrast, our model applies external thesauri to facilitate paraphrase generation, with an incremental encoding scheme and a multitask architecture for better locating of synonym pairs. Methodology The overall architecture of the proposed KEPN is shown in Figure 2, which will be introduced from three parts: (1) Sentence Encoder, which captures contextual features of each word in the input sentence. (2) Paraphrase Decoder, which generates paraphrase with the guidance of linguistic knowledge by a soft attention mechanism. (3) Synonym Labeling, which plays as an auxiliary task in our designed multi-task architecture to help the decoder make better use of synonym information. In the following subsections, we first give the definition of the task of paraphrase generation, and then introduce the three parts mentioned above in detail. Paraphrase Definition Given a sentence x = {x1, ..., xn}, sentence paraphrase aims to generate another sentence y = {y1, ..., ym} from x. Here, the lengths of x and y may not be equal. But the sentence x and y are required to have the same semantic meanings. In our work, we assume that there is an access to a corpus of linguistic knowledge D = {(wi, si)}N i=1, which is specifically referring to the synonym table. In table D, wi is treated as a raw word and si is the synonym of wi. Our goal is to learn a paraphrase generator with the use of D to generate a paraphrase y for a sentence x. Figure 2: The overall framework of Knowledge-Enhanced Paraphrase Network (KEPN). The source sequence is fed into a Sentence Encoder, then read by a raw Transformer decoder. Finally, the output of basic decoder is combined with the context representation of synonyms to generate the target sequence. Sentence Encoder The Sentence Encoder first converts the input word sequence into embedding vectors and then encodes the input via multihead attention. Input Representation At the begin of the Sentence Encoder, the input sentence x = {x1, ..., xn} is represented as a sequence of embedding vectors e = {e1, ..., en} by looking up a word embedding matrix. The matrix is initialized with pretrained embedding and optimized as parameters during training. Apart from the word embedding, a position embedding vector vi is introduced to encode position information of the i-th token in the sentence (Vaswani et al. 2017). The position embedding has the same dimension as the word embedding, and is formulated as: vi[j] = sin(i/100002j/d model), (1) vi[2j + 1] = cos(i/100002j/d model), (2) where i is the position where the word is indexed in sentence x, j is the index of dimension and d model is the number of dimensions. The input vector k of our Sentence Encoder is the sum of the word embedding e and the position embedding v: k = e + v. (3) Encoder With sequence embedding as input, the Sentence Encoder circumvents token-by-token encoding with a parallel encoding step that uses token position information. The encoder is composed of a stack of 6 identical blocks which are formulated as: Block(Q, K, V ) = LNorm(FFNN(m)) + m, (4) m = LNorm(Multi Attn(Q, K, V )) + Q, (5) where FFNN means a fully connected feed-forward network, and LNorm stands for layer normalization. Multi Attn is the crucial building part of the encoder, which allows the model to jointly attend to information from different representation subspaces at different positions. It operates on queries Q, keys K, and values V , as follows: Multi Attn(Q, K, V ) = (h1 , ..., hi)W , (6) hi = Attention(QW Q i ; KW K i ; V W V i ), (7) Attention(Q, K, V ) = Softmax(QKT dk )V , (8) where W, W Q i , W K i and W V i are trainable parameters. Following the above calculating procedures, the output of the Sentence Encoder z inferring from the embedding k can be shorted as: hi = Blocki(hi 1, hi 1, hi 1) i 1 h0 = k i = 0 . (9) The output of encoder z is the semantic representation and fed into the decoder to drive word generation step-bystep. Furthermore, we treat the encoder as a shared module in our multi-task architecture. The output z is trained in not only the decoder but also the Synonym Labeling (More details are listed in following subsections). Paraphrase Decoder The input of Paraphrase Decoder is the representation vector from the Sentence Encoder, along with synonym-position pairs provided by external linguistic knowledge. Paraphrase Decoder first acts as a basic decoder of Transformer to generate a draft vector. Then a set of synonyms are retrieved and represented through a soft attention mechanism. Finally the synonym information is used to revise the draft by replacing some words with synonyms, which adds more diversity in lexical and phrasal level. The basic decoder of Transformer is almost identical to the encoding block, with the addition of one more multiattention layer before the feed-forward layer. The final output of the decoder y t is formulated as: hi = Blocki(m, z, z) i 1 h0 = yt 1 i = 0 , (10) where m is calculated by Eq. 5, in which Q, K and V are all replaced by hi 1. Synonym Retrieval In our work, synonym-position pairs are retrieved from a thesaurus and act as the linguistic knowledge to improve the diversity of paraphrase results. Given a sentence x, we first retrieve a set of synonyms P = {si}M i=1 from the synonym set D. To locate each synonym accurately, we also add pi, the index of si which numbers the position of si in the sentence x, to each synonym in P. Finally, we obtain a set of synonym-position pairs P = {(si, pi)}M i=1. Two public thesauri are used in experiment: Tongyici Cilin (Extended) and Word Net. The Extended Tongyici Cilin is a Chinese synonym table that collected by HITSCIR, and contains 9,995 different pairs of synonyms. The Word Net, released by (Miller 1995), is a well-known lexical database of English, in which nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms. Synonym Pairs Representation The input synonymsposition pairs P = {(si, pi)}M i=1 need to be converted into vector representation before feeding into the Paraphrase Decoder. Specifically, the synonym is represented by looking up the word embedding matrix shared with the Sentence Encoder. If the synonym is a phrase, we sum the embedding of each word to get a phrase vector. For the position, we calculate the position vector pi by Eq. 1 and Eq. 2, following the positional encoding layer of Transformer. The position vector makes a link between the synonyms and words in the input sentence, which guides the decoder to pay more attention to the location of each paraphrase pairs. The initial output of the basic decoder is a draft vector, which is not able to generate a expressive paraphrase. Thus, we further use a soft attention mechanism to integrate the synonym-position pairs representation into the decoder. The synonym information ct is calculated as follows: i=1 ai,t si M i=1 a i,t pi, (11) ai,t = exp(g(y t , si)) M i=1 exp(g(y t , si)) , (12) g(y t , si) = V tanh(W [y t si]), (13) where V and W are parameters and denotes concatenation. y t is the output of the basic decoder from Eq. 10. a i,t is calculated in the same way as ai,t but replacing the synonym si by the position pi. Finally, a softmax layer is introduced to compute probability distribution of the t-th time word: yt = softmax(Wy[y t ct]). (14) Wy is a parameter matrix projecting the vector to match the dimension of output vocabulary. For each step of decoder, the generation probability yt is calculated until meeting an end symbol or reaching the maximum length of generated sentence. The loss function of paraphrase generation is chosen to minimize the negative log-likelihood of the output generative words as: t ln(p(yt|y 15 Ns No Ns No Ns No Transformer 0.03 0.15 0.28 0.21 0.34 0.29 KEPNsub pos 0.05 0.14 0.27 0.19 0.87 0.24 KEPN 0.06 0.16 0.31 0.19 1.18 0.21 KEPNadd SL 0.06 0.14 0.39 0.14 1.42 0.15 Ground Truth 0.13 0.0 0.43 0.0 1.40 0.0 Table 3: The average numbers of synonyms (Ns) and OOV words (No) which appear in the output sentences of different ablated version of KEPN on the TCNP dataset. dataset according to the sentence length L and report the results of various versions of our model in these two metrics. From the results in Table 3, we can find that adding either position encoding scheme or synonym labeling task improves the performance of the model in both metrics. This indicates that both components can not only bring more diversity to the generated results, but also alleviate the OOV issue by replacing a rare word with a corresponding synonym. What s more, we can also observe that as the sentence length grows, the advantages of our model become more obvious. One possible reason is that the longer the sentence, the more synonyms can be utilized. Influence of sentence length To explore the ability of learning the long-term dependency, we evaluate all model in TCNP dataset which is split into five parts according the length of sentence. The curve of our network (KEPNadd SL) is always above others in Figure 3 and descends more smoothly when the sentence length is longer than 16. More interestingly, when the length of sentences ranges from 1 to 5, the RNN-based models outperform the Transformer models. This is caused by the difference of network structures between RNN-based models and Transformer models. RNN-based models is adept at capturing the semantics of short sentences with the help of internal memory units. While, Transformer models can capture dependencies of words without regard to their distance in se- Figure 3: BLEU scores of sentences in different lengths on TCNP. The sentences are split into six groups. The length of sentences in each group falls into the same range. 16-20 means that length ranges from 16 to 20. quences which makes Transformer more powerful to model the long sentences than RNN. Human Evaluation Though quantitative results show that our network outperforms other approaches, we also conduct a human evaluation to present the real quality of generated paraphrase. We randomly select 200 groups of source sentences from the test set of both English and Chinese datasets. Five welleducated university students are asked to score each sentence according to the following three criteria: 1) Relevance (the paraphrase sentence is semantically close to the source sentence); 2) Fluency (the paraphrase sentence is fluent as a natural language sentence, and the grammar is correct); 3) Diversity (the paraphrase sentence has more expressions compared with the source sentence). Results in Table 4 show that the generated sentences of our network have the highest relevance to the source sen- Methods Quora LCQMC Rel. Flu. Div. Rel. Flu. Div. VAE-SVG 3.04 3.57 2.87 3.05 3.46 2.72 Transformer 3.76 4.23 3.08 4.28 4.50 2.88 KEPN 4.03 4.36 3.38 4.49 4.66 3.01 Ground Truth 4.26 4.44 3.84 4.73 4.88 3.58 Table 4: Human evaluation results of our network. Each assessor gives three scores (Relevance, Fluency and Diversity, shorted correspondingly as Rel., Flu. and Div.) to each paraphrase, both ranking from 0 to 5, where 0 is the worst and 5 is the best. Figure 4: Some cases generated by different model. The texts in same color are synonym pairs. tence. Besides, the scores of both Transformer and KEPN are high in fluency, indicating that the generated paraphrases are well-formed and grammatically correct. In terms of the Diversity, most of the methods don t perform well while the KEPN has an improvement of 0.2-0.3 points compared with others. This result demonstrates that by introducing synonym information from thesauri, the model can replace words in the original sentence with synonyms and thus generate a more expressive and diverse paraphrase. Some cases are listed in Figure 4. For the first Chinese example from TCNP, we can find that the VAE-SVG generates an unreadable sentence. A possible reason is that the training set contains few sentences about illness and the VAESVG fails in a unfamiliar field without the help of external knowledge. The Transformer meets an unk word, because the word species of disease is a low-frequency word in Figure 5: A visualization of synonym-output attention. Each column is an attention weight distribution over synonym. Darker colors correspond to higher weights. Chinese and thus replaced by an unk tag. By contrast, our network replaces the rare word with corresponding synonym disease thanks to the synonym pairs provided by thesauri, successfully alleviating the issue of OOV words. Besides, unlike the Transformer, more words are replaced by synonyms in the output of our network, making the paraphrase more diverse and expressive. For the second English case from Quora, we find that both the generations of the Transformer and the VAE-SVG do not convey the same meanings as the original sentence. Instead, our network not only match the input, but also generate a new word faster rather than copy the word quickly from the input sentence. Moreover, we also examine the soft attention mechanism of our network for indirect evidence of the contribution of synonym. For instance, the output sentence in Figure 5 is rewritten from The medicine has widespread usage in highincome countries due to less side effects . When replacing the word medicine of the input sentence, there are three synonyms (i.e. drug , pill , remedy ) to be chosen. Three candidates in Figure 5 receive the certain degrees of attention, and the attention weight of drug is the highest and wins the vote. This illustrates that the dynamic interaction between the weight attention and the generated text helps our network to choose correct synonyms, improving the diversity of the output. In this paper, we present a Knowledge-Enhanced Paraphrase Network for paraphrase generation through editing the original sentence with synonym information provided by external linguistic knowledge. To further locate the synonym pairs more accurately, a multi-task architecture with synonym labeling as an auxiliary task is also designed. Experiments on both Chinese and English datasets demonstrate that our network significantly outperforms the existing methods on paraphrase generation task. Acknowledgments This research is supported by National Natural Science Foundation of China (Grant No. 61773229 and 61972219), Shenzhen Giiso Information Technology Co. Ltd., the National Natural Science Foundation of Guangdong Province (Grand No. 2018A030313422), and Overseas Cooperation Research Fund of Graduate School at Shenzhen, Tsinghua University (Grant No. HW2018002). Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, pages 1 15. Berant, J., and Liang, P. 2014. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1415 1425. Bolshakov, I. A., and Gelbukh, A. 2004. Synonymous paraphrasing using wordnet and internet. In International Conference on Application of Natural Language to Information Systems, 312 323. Springer. Cao, Z.; Luo, C.; Li, W.; and Li, S. 2017. Joint copying and restricted generation for paraphrase. In Thirty-First AAAI Conference on Artificial Intelligence. Carl, M.; Schmidt, P.; and Sch utz, J. 2005. Reversible templatebased shake & bake generation. In Proceedings of MT Summit X, Workshop on EBMT, 17 25. Fader, A.; Zettlemoyer, L.; and Etzioni, O. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1608 1618. Sofia, Bulgaria: Association for Computational Linguistics. Fader, A.; Zettlemoyer, L.; and Etzioni, O. 2014. Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 1156 1165. ACM. Gupta, A.; Agarwal, A.; Singh, P.; and Rai, P. 2018. A deep generative framework for paraphrase generation. In Thirty-Second AAAI Conference on Artificial Intelligence. Huang, S.; Wu, Y.; Wei, F.; and Luan, Z. 2019. Dictionary-guided editing networks for paraphrase generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 6546 6553. Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In Proceedings of ICLR. Lavie, A., and Agarwal, A. 2007. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, 228 231. Association for Computational Linguistics. Li, Z.; Jiang, X.; Shang, L.; and Liu, Q. 2019a. Decomposable neural paraphrase generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3403 3414. Florence, Italy: Association for Computational Linguistics. Li, Z.; Ding, N.; Liu, Z.; Zheng, H.; and Shen, Y. 2019b. Chinese relation extraction with multi-grained information and external linguistic knowledge. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4377 4386. Liu, X.; Chen, Q.; Deng, C.; Zeng, H.; Chen, J.; Li, D.; and Tang, B. 2018. Lcqmc: A large-scale chinese question matching corpus. In Proceedings of the 27th International Conference on Computational Linguistics, 1952 1962. Madnani, N., and Dorr, B. J. 2010. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics 36(3):341 387. Mc Keown, K. R. 1983. Paraphrasing questions using given and new information. Computational Linguistics 9(1):1 10. Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39 41. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311 318. Association for Computational Linguistics. Pavlick, E.; Rastogi, P.; Ganitkevitch, J.; Van Durme, B.; and Callison-Burch, C. 2015. Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 425 430. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532 1543. Prakash, A.; Hasan, S. A.; Lee, K.; Datla, V.; Qadir, A.; Liu, J.; and Farri, O. 2016. Neural paraphrase generation with stacked residual lstm networks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2923 2934. See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, 1073 1083. Shah, P.; Hakkani-T ur, D.; T ur, G.; Rastogi, A.; Bapna, A.; Nayak, N.; and Heck, L. 2018. Building a conversational agent overnight with dialogue self-play. ar Xiv preprint ar Xiv:1801.04871. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104 3112. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998 6008. Wang, S.; Gupta, R.; Chang, N.; and Baldridge, J. 2019. A task in a suit and a tie: paraphrase generation with semantic augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 7176 7183. Yang, M.; Chen, L.; Chen, X.; Wu, Q.; Zhou, W.; and Shen, Y. 2019a. Knowledge-enhanced hierarchical attention for community question answering with multi-task and adaptive learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 5349 5355. AAAI Press. Yang, M.; Yin, W.; Qu, Q.; Tu, W.; Shen, Y.; and Chen, X. 2019b. Neural attentive network for cross-domain aspect-level sentiment classification. IEEE Transactions on Affective Computing. Zhou, H.; Young, T.; Huang, M.; Zhao, H.; Xu, J.; and Zhu, X. 2018. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, 4623 4629.