# entailment_relation_aware_paraphrase_generation__51691342.pdf Entailment Relation Aware Paraphrase Generation Abhilasha Sancheti,1,2 Balaji Vasan Srinivasan,2 Rachel Rudinger1 1University of Maryland, College Park 2Adobe Research sancheti@umd.edu, balsrini@adobe.com, rudinger@umd.edu We introduce a new task of entailment relation aware paraphrase generation which aims at generating a paraphrase conforming to a given entailment relation (e.g., equivalent, forward entailing, or reverse entailing) with respect to a given input. We propose a reinforcement learning-based weaklysupervised paraphrasing system, ERAP, that can be trained using existing paraphrase and natural language inference (NLI) corpora without an explicit task-specific corpus. A combination of automated and human evaluations show that ERAP generates paraphrases conforming to the specified entailment relation and are of good quality as compared to the baselines and uncontrolled paraphrasing systems. Using ERAP for augmenting training data for downstream textual entailment task improves performance over an uncontrolled paraphrasing system, and introduces fewer training artifacts, indicating the benefit of explicit control during paraphrasing. 1 Introduction Paraphrase is an alternative surface form in the same language expressing the same semantic content as the original form (Madnani and Dorr 2010). Although the logical definition of paraphrase requires strict semantic equivalence (or bi-directional entailment (Androutsopoulos and Malakasiotis 2010)) between a sequence and its paraphrase, data-driven paraphrasing accepts a broader definition of approximate semantic equivalence (Bhagat and Hovy 2013). Moreover, existing automatically curated paraphrase resources do not align with this logical definition. For instance, pivot-based paraphrasing rules extracted by Ganitkevitch, Van Durme, and Callison-Burch (2013) contain hypernym or hyponym pairs, e.g., due to variation in the discourse structure of translations, and unrelated pairs, e.g., due to misalignments or polysemy in the foreign language. While this flexibility of approximate semantic equivalence allows for greater diversity in expressing a sequence, it comes at the cost of the ability to precisely control the semantic entailment relationship (henceforth entailment relation ) between a sequence and its paraphrase. This trade-off severely limits the applicability of paraphrasing systems or resources to a variety of downstream natural language understanding (NLU) tasks (e.g., machine translation, question Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Entailment-unaware system might output approximately equivalent paraphrases. Label preserving augmentations generated using such system for textual entailment task can result in incorrect labels (red). Explicit entailment relation control in entailment-aware system helps in reducing such incorrectly labeled augmentations (green). answering, information retrieval, and natural language inferencing (Pavlick et al. 2015)) (Figure 1). For instance, semantic divergences in machine translation have been shown to degrade the translation performance (Carpuat, Vyas, and Niu 2017; Pham et al. 2018). Existing works identify directionality (forward, reverse, bi-directional, or no implication) of paraphrase and inference rules (Bhagat, Pantel, and Hovy 2007), and add semantics (natural logic entailment relationships such as equivalence, forward or reverse entailment, etc.) to data-driven paraphrasing resources (Pavlick et al. 2015) leading to improvements in lexical expansion and proof-based RTE systems, respectively. However, entailment relation control in paraphrase generation is, to our knowledge, a relatively unexplored topic, despite its potential benefit to downstream applications (Madnani and Dorr 2010) such as Multi Document Summarization (MDS) (or Information Retrieval (IR)) wherein having such a control could allow the MDS (or IR) system to choose either the more specific (reverse entailing) or general (forward entailing) sentence (or query) depending on the purpose of the summary (or user needs). To address the lack of entailment relation control in para- The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) phrasing systems, we introduce a new task of entailment relation aware paraphrase generation: given a sequence and an entailment relation, generate a paraphrase which conforms to the given entailment relation. We consider three entailment relations (controls) in the spirit of monotonicity calculus (Valencia 1991): (1) Equivalence ( ) refers to semantically equivalent paraphrases (e.g., synonyms) where input sequence entails its paraphrase and vice-versa; (2) Forward Entailment ( ) refers to paraphrases that loose information from the input or generalizes it (e.g., hypernyms) i.e. input sequence entails its paraphrase; (3) Reverse Entailment ( ) refers to paraphrases that add information to the input or makes it specific (e.g., hyponyms) i.e. input sequence is entailed by its paraphrase. The unavailability of paraphrase pairs annotated with such a relation makes it infeasible to directly train a sequence-to-sequence model for this task. Collecting such annotations for existing large paraphrase corpora such as Para Bank (Hu et al. 2019b) or Para NMT (Wieting and Gimpel 2018) is expensive due to scale. We address this challenge in 3 ways: (1) by building a novel entailment relation oracle based on natural language inference task (NLI) (Bowman et al. 2015a; Williams, Nangia, and Bowman 2018) to obtain weak-supervision for entailment relation for existing paraphrase corpora; (2) by recasting an existing NLI dataset, SICK (Marelli et al. 2014), into a small supervised dataset for this task, and (3) by proposing Entailment Relation Aware Paraphraser (ERAP) which is a reinforcement learning based (RL-based) weakly-supervised system that can be trained only using existing paraphrase and NLI corpora, with or without weak-supervision for entailment relation. Intrinsic and extrinsic evaluations show advantage of entailment relation aware (henceforth entailment-aware ) paraphrasing systems over entailment-unaware (standard uncontrolled paraphrase generation) counterparts. Intrinsic evaluation of ERAP (via a combination of automatic and human measures) on recasted SICK ( 3) dataset shows that generated paraphrases conform to the given entailment relation with high accuracy while maintaining good or improved paraphrase quality when compared against entailment-unaware baselines. Extrinsic data-augmentation experiments ( 5) on textual entailment task show that augmenting training sets using entailment-aware paraphrasing system leads to improved performance over entailmentunaware paraphrasing system, and makes it less susceptible to making incorrect predictions on adversarial examples. 2 Entailment Relation Aware Paraphraser Task Definition. Given a sequence of tokens X = [x1, . . . , xn], and an entailment relation R {Equivalence ( ), Forward Entailment ( ), Reverse Entailment ( )}, we generate a paraphrase Y = [y1, . . . , ym] such that the entailment relationship between X and Y is R. ˆY is the generated paraphrase and Y is the reference paraphrase. Neural paraphrasing systems (Prakash et al. 2016; Li et al. 2018) employ a supervised sequence-to-sequence model to generate paraphrases. However, building a supervised model for this task requires paraphrase pairs with entailment rela- Figure 2: ERAP: Generator takes in a sequence X, an entailment relation R, and outputs a paraphrase ˆY . ˆY is scored by various scorers in the evaluator and a combined score (known as reward) is sent back to train the generator. Hypothesis-only adversary is adversarially trained on ˆY and predictions from the entailment relation consistency scorer. tion annotations. To address this, we propose an RL-based paraphrasing system ERAP which can be trained with existing paraphrase and NLI corpora without any additional annotations. ERAP (Figure 2) consists of a paraphrase generator ( 2.1) and an evaluator ( 2.2) comprising of various scorers to assess the quality of generated paraphrases for different aspects. Scores from the evaluator are combined ( 2.3) to provide feedback to the generator in the form of rewards. Employing RL allows us to explicitly optimize the generator over measures accounting for the quality of generated paraphrases, including the non-differentiable ones. 2.1 Paraphrase Generator The generator is a transformer-based (Vaswani et al. 2017) sequence-to-sequence model which takes (X, R) and generates ˆY . We denote the generator as G( ˆY X,R; θg), where θg refers to parameters of the generator. We incorporate the entailment relation as a special token prepended to the input sequence. This way, entailment relation receives special treatment (Kobus, Crego, and Senellart 2017) and the generator learns to generate paraphrases for a given X, and R. 2.2 Paraphrase Evaluator The evaluator comprises of several scorers to asses the quality of the generated paraphrase for three aspects: semantic similarity with the input, expression diversity from the input, and entailment relation consistency. It provides rewards to the paraphrases generated by the generator as feedback which is used to update the parameters of the generator. We describe the various scorers below. Semantic Similarity Scorer provides reward which encourages the generated paraphrase ˆY to have similar meaning as the input sequence X. We use Mover Score (Zhao et al. 2019) to measure the semantic similarity between the generated paraphrase and the input, denoted as rs(X, ˆY ). Mover Score combines contextualized representations with word mover s distance (Kusner et al. 2015) and has shown high correlation with human judgment of text quality. Expression Diversity Scorer rewards the generated paraphrase to ensure that it uses different tokens or surface form to express the input. We measure this aspect by computing n-grams dissimilarity (inverse BLUE (Papineni et al. 2002)), rd(X, ˆY ) = 1 BLEU( ˆY , X) (1) Following Hu et al. (2019b), we use modified BLEU without length penalty to avoid generating short paraphrases which can result in high inverse BLEU scores. Entailment Relation Consistency Scorer is a novel scorer designed to reward the generated paraphrase in such a way that encourages it to adhere to the given entailment relation R. To compute the reward, we build an oracle O(X, Y ) (details in 3) based on natural language inferencing (NLI) and use likelihood of the given entailment relation from the Oracle as the score. rl(X, ˆY , R) = O(l = R X, ˆY ). As will be discussed further in 4.3, we found that entailment relation consistency scorer can result in generator learning simple heuristics (e.g., adding same adjective such as desert , or trailing tokens like and says or with mexico for or short outputs for ) leading to degenerate paraphrases having high consistency score. Inspired by the idea of hypothesis-only baselines (Poliak et al. 2018) for NLI task, we build a novel Ro BERTa-based Hypothesis-only Adversary, A(l ˆY ), to penalize the generated paraphrases resorting to such heuristics. The adversary is a 3-class classifier trained on the paraphrases generated during the training phase with the oracle prediction for (X, ˆY ) pair as the ground-truth. The Adversary loss L(A) is c=1 O(X, ˆY ) log(A(l = c ˆY )), (2) where C = 3 is the number of entailment relations. Training the adversary in this way helps in adapting to the heuristics taken by the generator during the course of training. The generator and the adversary are trained alternatively, similar to a GAN (Goodfellow et al. 2014) setup. The penalty is computed as the likelihood of entailment relation being R using the adversary, pl( ˆY , R) = A(l = R ˆY ). We only penalize those generated paraphrases for which predicted relation is same as the input relation because incorrect prediction denotes no heuristic is taken by the generator. 2.3 Reinforcement Learning Setup The output paraphrases from the generator are sent to the scorers for evaluation. The various scores from the scorers are combined to give feedback (in the form of reward) to the generator to update its parameters and to improve the quality of the generated paraphrases conforming to the given relation. We emphasize that although the scores from our scorers are not differentiable with respect to θg, we can still use them by employing RL (the REINFORCE algorithm) to update the parameters of the generator (Williams 1992). In RL paradigm, state at time t is defined as st = (X, R, ˆY1 t 1) where ˆY1 t 1 refers to the first t 1 tokens that are already generated in the paraphrase. The action at time t is the tth token to be generated. Let V be the vocabulary, and T be the maximum output length. The total expected reward of the current generator is then given by J(G) = T t=1 E ˆY1 t 1 G[ yt V P(yt st)Q(st, yt)], where P(yt st) is the likelihood of token yt given the current state st, and Q(yt, st) is the cumulative discounted reward for a paraphrase extended from ˆY1 t 1. The total reward, Q, is defined as the sum of the token level rewards. Q(st, yt) = τ=t γτ tr(sτ , yτ ), (3) where r(sτ, yτ) is the reward of token yτ at state sτ, and γ (0, 1) is a discounting factor so that the future rewards have decreasing weights, since their estimates are less accurate. If we consider that ˆY1 t 1 has been given then for every yt, the total expected reward becomes yt V P(yt st)Q(st, yt). (4) Sequence Sampling. To obtain r(st, yt) at each time step t, we need scores for each token. However, by design these scorers only evaluate complete sequences instead of single token or partial sequences. We therefore use the technique of rolling out (Yu et al. 2017), where the generator rolls out a given sub-sequence ˆY1 t to generate complete sequence by sampling the remaining part of the sequence ˆYt+1 T . Following Gong et al. (2019), we use a combination of beam search and multinomial sampling to balance reward estimation accuracy at each time step and diversity of the generated sequence. We first generate a reference paraphrase ˆY ref 1 T using beam search and draw n samples of complete sequences ˆY1 T by rolling out the sub-sequence ˆY ref 1 t using multinomial sampling to estimate reward at each time step t. Reward Estimation. We send n samples of complete sequences drawn from the sub-sequence ˆY ref 1 t to the scorers. The combined score f(st, yt) for an action yt at state st is computed by averaging the score of the complete sequences rolled out from ˆY ref 1 t defined as f(st, yt) = 1 i=1 α (rl(X, ˆY i, R) pl( ˆY i, R))+ β rs(X, ˆY i) + δ rd(X, ˆY i), where α, β, δ, and n are hyperparameters empirically set to 0.4, 0.4, 0.2, and 2, respectively. These parameters control the trade-off between different aspects for this multiobjective task. Following Siddique, Oymak, and Hristidis (2020), we threshold1 the scorers scores so that the final reward maintains a good balance across various scores. For example, generating diverse tokens at the expense of losing too much on the semantic similarity is not desirable. Similarly, copying the input sequence as-is to the generation is clearly not a paraphrase (i.e., rs(X, ˆY ) = 1). We define reward r(st, yt) for action yt at state st as: r(st, yt) = f(st, yt) f(st 1, yt 1), t > 1, f(s1, y1), t = 1 (6) 1If 0.3 rs(X, ˆY ) 0.98 then the score is used as is otherwise, 0. Similarly, if rs(X, ˆY ) > 0 after thresholding then rd, rl, and pl are computed as defined, otherwise 0. The discounted cumulative reward Q(st, yt) is then computed from the rewards r(sτ, yτ) at each time step using Eq. 3 and the total expected reward is derived using Eq. 4. The generator loss L(G) is defined as J(G). 2.4 Training Details Pre-training has been shown to be critical for RL to work in unsupervised settings (Siddique, Oymak, and Hristidis 2020; Gong et al. 2019) therefore, we pre-train the generator on existing large paraphrase corpora e.g., Para Bank (Hu et al. 2019b) or Para NMT (Wieting and Gimpel 2018) in two ways: (1) Entailment-aware uses Oracle ( 3) to obtain entailment relation for paraphrase pairs in the train set of paraphrase corpora, filter the semantically-divergent ( 3) pairs, upsample or downsample to have balanced data across relations, and train the generator with weak-supervision for entailment relation and gold-paraphrases, and (2) Entailmentunaware trains the generator on paraphrase pairs as-is without any entailment relation. Pre-training is done in a supervised manner with the cross-entropy loss and offers immediate benefits for generator to learn paraphrasing transformations and have warm-start leading to faster model training. RL-based Fine-tuning. We fine-tune the generator using feedback from the evaluator on recasted SICK dataset (details in 3). For any practical purposes, our RL-finetuning approach only requires input sequences without any annotations for entailment relation or ground-truth paraphrases. However, for a fair comparison against supervised or weakly-supervised baselines ( 4.1), we use the goldentailment relation for recasted SICK during RL fine-tuning. 3 Collecting Labeled Paraphrase Data Entailment-aware paraphrasing requires paraphrase pairs annotated with entailment relation. However, collecting such annotations for large paraphrase corpora such as Para Bank2 (Hu et al. 2019b) is too costly. To obtain entailment relations automatically, we train a NLI classifier and use it to derive the entailment relations as described below. Entailment Relation Oracle. NLI is a standard natural language understanding task of determining whether a hypothesis h is true (entailment3 E), false (contradiction C), or undetermined (neutral N) given a premise p (Mac Cartney 2009). To build an entailment relation oracle, O(X, Y ), we first train a Ro BERTa-based (Liu et al. 2019) 3-class classifier, o(l p, h ), to predict the uni-directional (E, N, C) labels given a p, h pair. This classifier is then run forwards ( X, Y ) and backwards ( Y, X ) on the paraphrase pairs to get the uni-directional predictions which are further used to derive the entailment relations as in Eq. 7. The Oracle is used to generate weak-supervision for entailment relations for existing paraphrase corpora, and assess the generated 2It consists of 50 million high quality English paraphrases obtained training a Czech-English neural machine translation (NMT) system and adding lexical-constraints to NMT decoding procedure. 3Entailment in NLI is a uni-directional relation while Equivalence is a bi-directional entailment relation. Recasted SICK SICK NLI Split Others E N C Train 1344 684 684 420 1274 2524 641 Dev 196 63 63 43 143 281 71 Test 1386 814 814 494 1404 2790 712 Table 1: E, N, C denote entailment, neutral, and contradiction, respectively. Others refers to neutral or invalid relation. paraphrases for relation consistency. We only focus on , , and relations as contradictory, neutral or invalid pairs are considered as semantically-divergent sentence pairs. if o(l X, Y ) = E & o(l Y, X ) = E if o(l X, Y ) = E & o(l Y, X ) = N if o(l X, Y ) = N & o(l Y, X ) = E C if o(l X, Y ) = C & o(l Y, X ) = C N if o(l X, Y ) = N & o(l Y, X ) = N Invalid ,otherwise Recasting SICK Dataset. SICK (Marelli et al. 2014) is a NLI dataset created from sentences describing the same picture or video which are near paraphrases. It consists of sentence pairs (p, h) with human-annotated NLI labels for both directions p, h and h, p . We recast this dataset to obtain paraphrase pairs with entailment relation annotations derived using the gold bi-directional labels in the same way as O. We only consider the sentence pairs which were created by combining meaning-preserving transformations (details in appendix). We augment this data by adding valid samples obtained by reversing sentence pairs ( p h, we add h p and p h, we add h p). Data statistics in Table 1. Oracle Evaluation. We train the NLI classifier o on existing NLI datasets namely, MNLI (Williams, Nangia, and Bowman 2018), SNLI (Bowman et al. 2015a), SICK (Marelli et al. 2014) as well as diagnostic datasets such as, HANS (Mc Coy, Pavlick, and Linzen 2019), others introduced in Glockner, Shwartz, and Goldberg (2018); Min et al. (2020), using cross-entropy loss. Combining diagnostic datasets during training has shown to improve robustness of NLI systems which can resort to simple lexical or syntactic heuristics (Glockner, Shwartz, and Goldberg 2018; Poliak et al. 2018) to perform well on the task. The accuracy of o(l p, h ) on the combined test sets of the datasets used for training is 92.32% and the accuracy of Entailment Relation Oracle O(X, Y ) on the test set of recasted SICK dataset is 81.55%. Before using the Oracle to obtain weak-supervision for entailment relation for training purposes, we validate it by manually annotating 50 random samples from Para Bank. 78% of the annotated relations were same as the Oracle predictions when C, N, and Invalid labels were combined. 4 Intrinsic Evaluation Here we provide details on the entailment-aware and unaware comparison models, and the evaluation measures. 4.1 Comparison Models To contextualize ERAP s performance, we train several related models including supervised and weakly-supervised, Aware BLEU Div i BLEU R-Con 32.54 46.57 17.78 33.08 58.24 19.06 72.34 Table 2: Evaluation of the generator pre-trained on Para Bank using entailment-aware ( ) and unaware ( ) settings. entailment-aware and unaware models to obtain lower and upper bound performance on recasted SICK as follows: (1) the generator is trained on recasted SICK in an entailment-aware (S2S-A) and unaware (S2S-U) supervised setting; (2) the generator is pre-trained on Para Bank dataset in entailment-aware (Pre-train-A) and unaware (Pre-train-U) setting to directly test on the test set of recasted SICK; (3) the pre-trained generators are finetuned on recasted SICK in entailment-aware (Fine-tune-A) and unaware (Fine-tune-U) supervised setting; (4) multiple outputs (k {1, 5, 10, 20}) are sampled using nucleus sampling (Holtzman et al. 2019) from S2S-U (RR-S2S-U) or Fine-tune-U (RR-FT-U) and re-ranked based on the combined score f(st, yt). The highest scoring output is considered as the final output for RR-S2S-U and RR-FT-U. 4.2 Evaluation Measures Automatic evaluation to evaluate the quality of paraphrases is primarily done using i BLEU (Sun and Zhou 2012) which penalizes for copying from the input. Following Liu et al. (2020), we also report BLEU (Papineni et al. 2002) (up to 4 n-grams) and Diversity (Div, measured identical to Eq. 1) scores to understand the trade-off between these measures. We also compute, R-Con, defined as the percentage of test examples for which the entailment relation predicted using oracle is same as the given entailment relation. Human evaluation is conducted on 4 aspects: (1) semantic similarity which measures the closeness in meaning between paraphrase and input on a scale of 5 (Li et al. 2018); (2) diversity in expression which measures if different tokens or surface-forms are used in the paraphrase with respect to the input on a scale of 5 (Siddique, Oymak, and Hristidis 2020); (3) grammaticality which measures if paraphrase is well-formed and comprehensible on a scale of 5 (Li et al. 2018); (4) relation consistency which measures the % of examples for which the annotated entailment relation is same as the input relation. Three annotations per sample are collected for similarity, diversity, and grammaticality using Amazon Mechanical Turk (AMT), and the authors (blinded to the identity of the model and following proper guidelines) manually annotate for relation consistency as it is more technical and AMT annotators were unable to get the qualification questions correct. More details in Appendix. 4.3 Results and Analysis To use paraphrasing models for downstream tasks, we need to ensure that the generated paraphrases conform to the specified entailment relation and are of good quality. Automatic evaluation. We first evaluate the pre-trained generators on a held-out set from Para Bank containing Model R-T BLEU Div i BLEU RCon Pre-train-U 14.92 76.73 7.53 Pre-train-A 17.20 74.25 8.75 65.53 S2S-U 30.93 59.88 17.62 S2S-A 31.44 63.90 18.77 38.42 RR-S2S-U 30.06 64.51 17.26 51.86 RR-FT-U 41.44 53.67 23.96 66.85 ERAP-U 19.37 69.70 9.43 66.89 ERAP-A 28.20 59.35 14.43 68.61 Fine-tune-U 41.62 51.42 23.79 Fine-tune-A 45.21 51.60 26.73 70.24 Copy-input 51.42 0.00 21.14 45.98 Table 3: Automatic evaluation of ERAP against comparison models described in 4.1. R-Con is measured only for models conditioned (R-T) on R at test time. Fine-tune models are upperand Pre-train are lower-bound. denotes only pre-training is done in entailment-unaware setting. Boldface denotes best in each block and denotes best overall. 500 examples for each relation. Table 2 shows that the entailment-aware generator outperforms its unaware counterpart across all the measures. This boost is observed with weak-supervision for entailment relation demonstrating the good quality of weak-supervision. Next, We evaluate ERAP variants against the comparison models ( 4.1) on the recasted SICK test samples belonging to , , relation and report the results in Table 34. Entailment-aware (-A) variants outperform corresponding unaware (-U) variants on i BLEU score, while outperforming the majority-class (i.e., ) copy-input baseline (except for S2S-A). Weakly-supervised pre-training helps in boosting the performance in terms of i BLEU and R-Con score as evident from higher scores for Fine-tune-A(U) model over S2S-A(U). Poor performance of S2S variants is because of the small dataset size and much harder multi-objective task. Re-ranking outputs from Fine-tune-U achieve higher i BLEU and consistency score than Pre-train-A which is explicitly trained with weak-supervision for relation. However, this comes at the computational cost of sampling multiple5 (k=20) outputs. Improved performance of Fine-tuned models over S2S indicates the importance of pre-training. Both the ERAP variants achieve higher i BLEU and consistency than its lower-bounding (Pre-trained) models but the outputs show less diversity in expression and make conservative lexical or syntactic changes. These results look encouraging until we notice Copy-input (last row) which achieves high BLEU and i BLEU, indicating that these metrics fail to punish against copying through the input (an observation consistent with Niu et al. (2020)). Ablation analysis of each scorer. We demonstrate the effectiveness of each scorer in ERAP via an ablation study in Table 4. Using only consistency scorer for rewarding the generated paraphrases, a significant improvement in con- 4We report analogous results for Para NMT in the Appendix available at https://arxiv.org/pdf/2203.10483.pdf. 5We report results for k {1, 5, 10} in the Appendix. Model BLEU Div i BLEU R-Con Gold-reference 48.58 81.55 Pre-train-A 17.20 74.25 8.75 65.53 +Con 24.82 58.55 12.29 96.75 +Con+Sim 39.78 42.05 20.24 94.72 +Con+Sim+Div 21.68 68.41 11.29 93.60 ERAP-A 28.20 40.65 14.43 68.61 Table 4: Ablation of scorers in ERAP. Con, Sim, Div refers to relation consistency, semantic similarity, and expression diversity scorers. Underline denote more copying of input for Diversity (Div) score and presence of heuristics in outputs for R-Con score as compared to gold-references. Figure 3: Qualitative outputs: 1 showing the effectiveness of various scorers, 2 showing heuristic learned in the absence of hypothesis-only adversary, and 3 from various models. sistency score is observed as compared to Pre-train-A and Gold-references. However, this high score may occur at the cost of semantic similarity (e.g., 1 in Figure 3) wherein output conforms to relation at the cost of losing much of the content. Adding similarity scorer, helps in retaining some of the content (higher BLEU and i BLEU) but results in copying (low diversity) from the input. Addition of diversity scorer helps in introducing diversity in expression. However, model is still prone to heuristics (e.g., losing most of the content from input (1 in Figure 3), or adding irrelevant to- Model R-T Sim Div Gram R-Con Pre-train-U 4.60 2.62 4.73 Pre-train-A 4.67 2.60 4.67 48.00 RR-S2S-U 2.72 3.15 3.46 24.00 RR-FT-U 3.05 2.89 4.27 28.00 ERAP-U 3.98 2.85 4.10 40.00 ERAP-A 3.95 2.68 4.42 64.00 Fine-tune-U 3.87 3.10 4.83 Fine-tune-A 3.80 3.04 4.68 48.00 Table 5: Average scores across 3 annotators are reported for Similarity (Sim, α=0.65), Diversity (Div, α=0.55), and Grammaticality (Gram, α=0.72) and % of correct specified relation for R-Con (α=0.70). Moderate to strong inter-rater reliability is observed with Krippendorff s α. kens with mexico or desert (2 in Figure 3)) for ensuring high consistency score. Introducing Adversary reduces the heuristics learned by the generator. Together all the scorers help maintain a good balance for this multi-objective task. Human evaluation. We report the human evaluation for 25 test outputs each from 8 models for 4 measures in Table 5. ERAP-A achieves the highest consistency while maintaining a good balance between similarity, diversity and grammaticality. RR-S2S-U has the highest diversity which comes at the cost of semantic similarity and grammaticality (e.g., 3 in Figure 3). A strikingly different observation is high similarity and low diversity of Pre-trained variants, reinforcing the issues with existing automatic measures. 5 Extrinsic Evaluation The intrinsic evaluations show that ERAP produces quality paraphrases while adhering to the specified entailment relation. Next, we examine the utility of entailment-aware paraphrasing models over unaware models for a downstream application, namely paraphrastic data augmentation for textual entailment task. Given two sentences, a premise p and a hypothesis h, the task of textual entailment is to determine if a human would infer h is true from p. Prior work has shown that paraphrastic augmentation of textual entailment datasets improve performance (Hu et al. 2019a); however, these approaches make the simplifying assumption that entailment relations are preserved under paraphrase, which is not always the case (see Figure 1 and 30% of Para Bank pairs were found to be semantically-divergent using Oracle). We use SICK NLI dataset for this task because we have a paraphrasing system trained on similar data distribution6. We hypothesize that entailment-aware augmentations will result in fewer label violations, and thus overall improved performance on the textual entailment task. Moreover, explicit control over the entailment relation allows for greater variety of augmentations that can be generated (an exhaustive list of label preserving augmentations based on entail- 6Note that we retained the train, test, development sets of SICK NLI dataset in the recasted SICK dataset and therefore the paraphrasing models have only seen train set. Type (Label) Augmentation Pairs (E/NE) p , h p, h p , h (E/NE) pr, h pr, h (E/U) p, hf p , hf pr, hf Unknown (U/U) pf, h pf, hf p, hr p , hr pr, hr pf, r pf, h Table 6: Various augmentations for p, h with label as E/NE (entails/does not entail) grouped as per the type (and corresponding projected labels) according to the entailment composition rules defined in Mac Cartney (2009). p (h ), pr(hr), pf(hf) denote , , and paraphrase, resp. ment relation between a p (or h) and its paraphrase is presented in Table 6) with entailment-aware models. Paraphrastic Data Augmentation We generate paraphrases for all premises p P and hypotheses h H present in the train set of SICK NLI using entailmentaware and unaware models. We obtain augmentation data by combining all the paraphrases (generated using entailmentaware models) with original data and label them as per Table 6. Augmentation paraphrases generated from entailment-unaware models are (na ıvely) assumed to hold the relation. Ro BERTa-based binary classifiers are trained on original dataset along with the paraphrastic augmentations to predict whether p entails h. Susceptibility to Augmentation Artifacts. If paraphrastic augmentations introduce noisy training examples with incorrectly projected labels, this could lead to, what we call augmentation artifacts in downstream models. We posit that paraphrastically augmented textual entailment (henceforth, PATE) models trained on entailment-aware augmentations will be less susceptible to such artifacts than the models trained with entailment-unaware augmentations. To test this, we generate augmentations for the test set of SICK NLI and manually annotate 1253 augmented samples to obtain 218 incorrectly labeled examples. We evaluate PATE models on these examples (referred to as adversarial test examples). Extrinsic Results. We report accuracy of PATE models on original SICK development and test sets as well as on adversarial test examples in Table 7. As per our hypothesis, models trained with augmentations generated using entailmentaware models result in improved accuracy on both original as well as adversarial test samples over those trained with entailment-unaware augmentations. Textual entailment model trained only on SICK NLI data performs the best on adversarial test set as expected and proves that although augmentation helps in boosting the performance of a model, it introduces augmentation artifacts during training. 6 Related Work Paraphrase generation is a common NLP task with widespread applications. Earlier approaches are rulebased (Barzilay, Mc Keown, and Elhadad 1999; Ellsworth and Janin 2007) or data-driven (Madnani and Dorr 2010). Recent, supervised deep learning approaches use Data R-T O-Dev O-Test A-Test SICK NLI - 95.56 93.78 83.02 +FT-U( ) 95.15 93.68 69.72 +FT-A( ) 95.35 94.62 77.98 +FT-A( , ) 95.76 93.95 75.69 +ERAP-A( ) 95.15 94.58 78.44 +ERAP-A( , ) 95.15 93.86 69.72 Table 7: Accuracy results of PATE models for Original (O-) and Adversarial (A-) datasets. FT/ERAP refers to the Finetuned/proposed model used for generating augmentations. Type of augmentation used as per Table 6 in parenthesis. U/A denote entailment-unaware (aware) variant. LSTMs (Prakash et al. 2016), VAEs (Gupta et al. 2018), pointer-generator networks (See, Liu, and Manning 2017), and transformer-based (Li et al. 2019) sequence-to-sequence models. Li et al. (2018) use RL for supervised paraphrasing. Unsupervised paraphrasing is a challenging and emerging NLP task with limited efforts. Bowman et al. (2015b) train VAE to sample less controllable paraphrases. Others use metropolis-hastings (Miao et al. 2019), simulated annealing (Liu et al. 2020) or dynamic-blocking (Niu et al. 2020) to add constraints to the decoder at test time. Siddique, Oymak, and Hristidis (2020) use RL to maximize expected reward based on adequacy, fluency and diversity. Our RLbased approach draws inspiration from this work by introducing oracle and hypothesis-only adversary. Controllable text generation is a closely related field with efforts been made to add lexical (Hu et al. 2019a; Garg et al. 2021) or syntactic control (Iyyer et al. 2018; Chen et al. 2019; Goyal and Durrett 2020) to improve diversity of paraphrases. However, ours is the first work which introduces a semantic control for paraphrase generation. Style transfer is a related field that aims at transforming an input to adhere to a specified target attribute (e.g., sentiment, formality). RL has been used to explicitly reward the output to adhere to a target attribute (Gong et al. 2019; Sancheti et al. 2020; Luo et al. 2019; Liu, Neubig, and Wieting 2020; Goyal et al. 2021). The target attributes are only a function of the output and defined at a lexical level. However, we consider a relation control which is a function of both the input and the output, and is defined at a semantic level. 7 Conclusion We introduce a new task of entailment-relation-aware paraphrase generation and propose a RL-based weaklysupervised model (ERAP) that can be trained without a taskspecific corpus. Additionally, an existing NLI corpora is recasted to curate a small annotated dataset for this task, and provide performance bounds for it. A novel Oracle is proposed to obtain weak-supervision for relation control for existing paraphrase corpora. ERAP is shown to generate paraphrases conforming to the specified relation while maintaining quality of the paraphrase. Intrinsic and Extrinsic experiments demonstrate the utility of entailment-relation control, indicating a fruitful direction for future research. References Androutsopoulos, I.; and Malakasiotis, P. 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38: 135 187. Barzilay, R.; Mc Keown, K.; and Elhadad, M. 1999. Information fusion in the context of multi-document summarization. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics, 550 557. Bhagat, R.; and Hovy, E. 2013. What is a paraphrase? Computational Linguistics, 463 472. Bhagat, R.; Pantel, P.; and Hovy, E. 2007. LEDIR: An unsupervised algorithm for learning directionality of inference rules. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-Co NLL). Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015a. A large annotated corpus for learning natural language inference. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, 632 642. Association for Computational Linguistics (ACL). Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2015b. Generating sentences from a continuous space. ar Xiv preprint ar Xiv:1511.06349. Carpuat, M.; Vyas, Y.; and Niu, X. 2017. Detecting crosslingual semantic divergence for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, 69 79. Chen, M.; Tang, Q.; Wiseman, S.; and Gimpel, K. 2019. Controllable Paraphrase Generation with a Syntactic Exemplar. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5972 5984. Ellsworth, M.; and Janin, A. 2007. Mutaphrase: Paraphrasing with framenet. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, 143 150. Ganitkevitch, J.; Van Durme, B.; and Callison-Burch, C. 2013. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 758 764. Garg, S.; Prabhu, S.; Misra, H.; and Srinivasaraghavan, G. 2021. Unsupervised contextual paraphrase generation using lexical control and reinforcement learning. ar Xiv preprint ar Xiv:2103.12777. Glockner, M.; Shwartz, V.; and Goldberg, Y. 2018. Breaking NLI Systems with Sentences that Require Simple Lexical Inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 650 655. Gong, H.; Bhat, S.; Wu, L.; Xiong, J.; and Hwu, W.-M. 2019. Reinforcement Learning Based Text Style Transfer without Parallel Training Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. Advances in neural information processing systems, 27. Goyal, N.; Srinivasan, B. V.; Anandhavelu, N.; and Sancheti, A. 2021. Multi-Style Transfer with Discriminative Feedback on Disjoint Corpus. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3500 3510. Goyal, T.; and Durrett, G. 2020. Neural Syntactic Preordering for Controlled Paraphrase Generation. In Annual Meeting of the Association for Computational Linguistics. Gupta, A.; Agarwal, A.; Singh, P.; and Rai, P. 2018. A deep generative framework for paraphrase generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32. Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y. 2019. The curious case of neural text degeneration. ar Xiv preprint ar Xiv:1904.09751. Hu, J. E.; Khayrallah, H.; Culkin, R.; Xia, P.; Chen, T.; Post, M.; and Van Durme, B. 2019a. Improved lexically constrained decoding for translation and monolingual rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 839 850. Hu, J. E.; Rudinger, R.; Post, M.; and Van Durme, B. 2019b. Para Bank: Monolingual bitext generation and sentential paraphrasing via lexically-constrained neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 6521 6528. Iyyer, M.; Wieting, J.; Gimpel, K.; and Zettlemoyer, L. 2018. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1875 1885. Kobus, C.; Crego, J. M.; and Senellart, J. 2017. Domain Control for Neural Machine Translation. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, 372 378. Kusner, M.; Sun, Y.; Kolkin, N.; and Weinberger, K. 2015. From word embeddings to document distances. In International conference on machine learning, 957 966. PMLR. Li, Z.; Jiang, X.; Shang, L.; and Li, H. 2018. Paraphrase Generation with Deep Reinforcement Learning. In EMNLP. Li, Z.; Jiang, X.; Shang, L.; and Liu, Q. 2019. Decomposable neural paraphrase generation. ar Xiv preprint ar Xiv:1906.09741. Liu, X.; Mou, L.; Meng, F.; Zhou, H.; Zhou, J.; and Song, S. 2020. Unsupervised Paraphrasing by Simulated Annealing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 302 312. Liu, Y.; Neubig, G.; and Wieting, J. 2020. On Learning Text Style Transfer with Direct Rewards. ar Xiv preprint ar Xiv:2010.12771. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692. Luo, F.; Li, P.; Zhou, J.; Yang, P.; Chang, B.; Sui, Z.; and Sun, X. 2019. A dual reinforcement learning framework for unsupervised text style transfer. ar Xiv preprint ar Xiv:1905.10060. Mac Cartney, B. 2009. Natural language inference. Stanford University. Madnani, N.; and Dorr, B. J. 2010. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3): 341 387. Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; Zamparelli, R.; et al. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In Lrec, 216 223. Reykjavik. Mc Coy, T.; Pavlick, E.; and Linzen, T. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3428 3448. Miao, N.; Zhou, H.; Mou, L.; Yan, R.; and Li, L. 2019. Cgmh: Constrained sentence generation by metropolishastings sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 6834 6842. Min, J.; Mc Coy, R. T.; Das, D.; Pitler, E.; and Linzen, T. 2020. Syntactic Data Augmentation Increases Robustness to Inference Heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2339 2352. Niu, T.; Yavuz, S.; Zhou, Y.; Wang, H.; Keskar, N. S.; and Xiong, C. 2020. Unsupervised paraphrase generation via dynamic blocking. ar Xiv preprint ar Xiv:2010.12885. Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311 318. Pavlick, E.; Bos, J.; Nissim, M.; Beller, C.; Van Durme, B.; and Callison-Burch, C. 2015. Adding semantics to datadriven paraphrasing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1512 1522. Pham, M. Q.; Crego, J. M.; Senellart, J.; and Yvon, F. 2018. Fixing translation divergences in parallel corpora for neural mt. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2967 2973. Poliak, A.; Naradowsky, J.; Haldar, A.; Rudinger, R.; and Van Durme, B. 2018. Hypothesis Only Baselines in Natural Language Inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. Prakash, A.; Hasan, S. A.; Lee, K.; Datla, V.; Qadir, A.; Liu, J.; and Farri, O. 2016. Neural paraphrase generation with stacked residual LSTM networks. ar Xiv preprint ar Xiv:1610.03098. Sancheti, A.; Krishna, K.; Srinivasan, B. V.; and Natarajan, A. 2020. Reinforced rewards framework for text style transfer. Advances in Information Retrieval, 12035: 545. See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. ar Xiv preprint ar Xiv:1704.04368. Siddique, A.; Oymak, S.; and Hristidis, V. 2020. Unsupervised paraphrasing via deep reinforcement learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1800 1809. Sun, H.; and Zhou, M. 2012. Joint learning of a dual SMT system for paraphrase generation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 38 42. Valencia, V. M. S. 1991. Studies on natural logic and categorial grammar. Universiteit van Amsterdam. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998 6008. Wieting, J.; and Gimpel, K. 2018. Para NMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 451 462. Williams, A.; Nangia, N.; and Bowman, S. 2018. A Broad Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1112 1122. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3): 229 256. Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI conference on artificial intelligence, volume 31. Zhao, W.; Peyrard, M.; Liu, F.; Gao, Y.; Meyer, C. M.; and Eger, S. 2019. Mover Score: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 563 578.