# unsupervised_text_generation_by_learning_from_search__9292f3c3.pdf Unsupervised Text Generation by Learning from Search Jingjing Li1, Zichao Li2, Lili Mou3, Xin Jiang2, Michael R. Lyu1, Irwin King1 1The Chinese University of Hong Kong 2Huawei Noah s Ark Lab 3University of Alberta; Alberta Machine Intelligence Institute (Amii) {lijj,lyu,king}@cse.cuhk.edu.hk {li.zichao,jiang.xin}@huawei.com doublepower.mou@gmail.com In this work, we propose TGLS, a novel framework for unsupervised Text Generation by Learning from Search. We start by applying a strong search algorithm (in particular, simulated annealing) towards a heuristically defined objective that (roughly) estimates the quality of sentences. Then, a conditional generative model learns from the search results, and meanwhile smooth out the noise of search. The alternation between search and learning can be repeated for performance bootstrapping. We demonstrate the effectiveness of TGLS on two real-world natural language generation tasks, unsupervised paraphrasing and text formalization. Our model significantly outperforms unsupervised baseline methods in both tasks. Especially, it achieves comparable performance to strong supervised methods for paraphrase generation.1 1 Introduction Text generation refers to a wide range of tasks involving generating natural language, including machine translation [19, 20, 18], sentence simplification [29, 41], and text summarization [5, 1]. Recent success of neural text generation relies heavily on large parallel data for training, which may not be available in real-world natural language processing (NLP) applications. In this work, we consider unsupervised text generation, where no parallel data is available. This setting is more challenging, and has significant potential in both scientific research (e.g., low-resource language processing) and industrial applications (e.g., cold start for a new NLP application). Early work tackles unsupervised text generation by rules or templates [47, 27]. While such approaches do not require parallel corpora, the generated sentences are highly subject to the rules, and hence lack the flexibility of natural language. Other work constructs pseudo-parallel data, which is only feasible for certain tasks like unsupervised machine translation [19]. Recently, researchers have developed search-based techniques for unsupervised text generation [28, 17, 37, 23], where a heuristically defined scoring function evaluates the quality of a sentence, involving language fluency, semantic compliance, and other task-specific aspects. Then, the algorithm performs word-level edits (such as word deletion, insertion, and replacement) to search towards a (possibly local) optimum of the scoring function. With a reasonably designed scoring function, such approaches are shown to be effective in a variety of applications like paraphrase generation [28, 23], sentence summarization [37], and text simplification [17]. However, the above search-based approach has two major drawbacks: 1) The inference efficiency is low. To obtain an output sentence, the search algorithm would perform a few hundred steps of local 1Code is available at https://github.com/jingjingli01/TGLS 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. edits and re-evaluations. This could be considerably slower than an autoregressive decoder, which generates words sequentially. 2) The search could yield noisy results, since the scoring function is defined heuristically and the search is conducted locally in a discrete sentence space. To this end, we propose a new framework for unsupervised Text Generation by Learning from Search (TGLS), which contains a strong search module that explores the sentence space, as well as a learning module that learns from the search results. For the search module, we adopt the simulated annealing (SA) algorithm. At each step, SA proposes a local edit by a neural network, and then either accepts or rejects the proposal based on a heuristically defined scoring function. For learning, we employ two methods to train a conditional generative model, word-level cross-entropy loss and the sequence-level max-margin loss. Within TGLS, the search and learning can be boosted by each other in an iterative fashion. That is, the search results serve as the pseudo-reference for training the conditional generator, which in turn benefits SA search by serving as a more meaningful initial state. As for implementation, TGLS involves two pretrained language models: 1) the uni-directional GPT2 [33], which is suitable for likelihood-based fluency evaluation and conditional generation; and 2) the bi-directional Ro BERTa [24], which is better at semantic evaluation and contextual word-level prediction. The main contributions of our paper include: 1) We propose TGLS, a principled framework for unsupervised text generation; TGLS can be applied to different tasks if the output resembles the input and can be roughly estimated by a heuristically defined scoring function. 2) We successfully incorporate large-scale pretrained language models into our TGLS framework. 3) We conducted experiments on two different tasks: paraphrasing and text formalization. In both experiments, TGLS significantly outperforms unsupervised baseline methods. Moreover, TGLS achieves comparable performance to recent supervised models [7] in the paraphrasing task. 4) For text formalization (an example of text style transfer), we are also the first to design a search-based method, and further extend it into the proposed TGLS framework. Our TGLS framework involves two stages of search and learning. In the first stage, we perform simulated annealing (SA) search [23] and treat the obtained output sentences as pseudo-references. Then, we train an autoregressive GPT2 as the text generator [33] by word-level cross-entropy (CE) supervised learning, which enables our model to learn quickly. In the second stage, the GPT2 performs beam search and the output is taken as the initial state of the SA algorithm again for iterative performance improvement. Later, we perform max-margin (MM) learning to better distinguish between higher-scored sentences and other high-probability but sub-optimal sentences. Figure 1 provides an overview of the two stages of search and learning in TGLS. 2.1 Simulated Annealing Search The search-based text generation [28, 23] relies on a heuristic-based objective function s(y|x) that (roughly) evaluates the quality of an output sequence y given the input x (usually, one or a few sentences). Typically, the objective involves language modeling fluency slm(x), semantic compliance ssemantic(x, y), and other task-specific scorers stask(y, ). These individual scorers are combined by the product of experts [13]: s(y|x) = slm(y) ssemantic(x, y) stask(y, ). (1) We adopt simulated annealing (SA) [16, 23], which performs local stochastic search to maximize the objective. Concretely, SA starts from an initial candidate output sentence y(0), which is set to the input x in our first-stage SA. For the second stage, it will be the output of our GPT2 model. At a search step t, SA iteratively proposes a new candidate y0 by local edits of y(t), namely, word insertion, deletion, and replacement. The proposal y0 is accepted with probability p(accept|y0, y(t), x, T) = min 1, exp( s(y0|x) s(y(t)|x) . Then, y(t+1) = y0 if y0 is accepted, or otherwise, y(t+1) = y(t). In SA, T is a temperature controlling how greedy the search algorithm is. Usually, T is high at the beginning of search so as to be more explorative, and then T is cooled down to achieve a better (local) optimum. Although we follow the generic SA framework of text generation as in [23], the objective function and proposal are largely redesigned, detailed below. (a) SA (b) CE (c) SA (d) MM !(y|x; GPT2) y("#) = y(%) x y(&'()) y(&'()) y("# ")) - y(&'() ")) - ,(y|x) !(y|x; GPT S2) - Stage 1 Stage 2 Figure 1: Overview of TGLS. (a) First-stage search by simulated anealing (SA). (b) First-stage learning by cross-entropy (CE) loss. (c) Second-stage search by SA. (d) Second-stage learning by max-margin (MM) loss. The horizontal axis represents the sentence space. Fluency scorer (slm). The fluency of a sentence can oftentimes be approximated by a language model s predicted probability. Previous search-based work uses recurrent neural networks for fluency evaluation [28, 23]. In our work, we use the large-scale pretrained GPT2 model [33]. For an output y = y1 yn, the language fluency scorer is the joint likelihood of y, given by slm(y) = (Qn i=1 p(yi|y1, , yi 1)) , where is a hyperparameter balancing slm with other scorers in (1). In fact, we use the vocabulary of GPT2 with bype-pair encoding (BPE), and yi here is a token after BPE segmentation. Our GPT2 is fine-tuned with non-parallel in-domain corpora to learn the specificity of a task. Semantic scorer (ssemantic). In this part, we extend the semantic scorers in [23] with a Ro BERTa [24]. Fine-tuning details are presented in Appendix A. Compared with autoregressive GPT2 used for fluency evaluation, Ro BERTa is pretrained by masked language modeling, and is better at feature representation. Let x = (x1, , xm) be a sentence. Ro BERTa computes a contexualized representation of a word in the sentence as Ro BERTa(xi, x). A word-level semantic scorer evaluates how much keyword information (detected by Rake [36]) is preserved, given by the least matched keyword of x: sword(y, x) = min k2keyword(x) max yi2y Ro BERTa(k, x)>Ro BERTa(yi, y). (2) A sentence-level semantic scorer evaluates the cosine similarity of two sentence vectors ssent(y, x) = y>x kykkxk, where the sentence vector is given by the Ro BERTa feature of the padded token [BOS] at the beginning end of a sentence, i.e., x = Ro BERTa([BOS], x) and y is computed analogously. Finally, the semantic scorer is the product of both wordand sentence-level scores as ssemantic(y, x) = sword(y, x)β ssent(y, x)γ, (3) where β and γ are weighting hyperparameters. Task-specific scorers. We apply TGLS to two tasks: paraphrasing and text formalization. For paraphrasing, the goal is to generate a semantically similar but lexically different sentence. Previous work [23] uses the BLEU score to penalize the n-gram overlapping between the output and input: sparaphrase(y, x) = (1 BLEU(y, x))δ, which is also adopted in our work. Here, δ is a weighting hyperparameter for the task-specific scorer. For text formalization, the goal is to transform an informal sentence to the formal style [35], which is an example of text style transfer. We follow the setting of most text style-transfer work [14], where we assume the style labels are available, but no parallel supervision is given. We train a classifier that predicts the probability of the style, also based on the Ro BERTa features. Then, the task-specific scorer becomes sformality(y) = p(formal | Ro BERTa([BOS], y))δ, where δ is the weighting hyparaparameter for this task. Proposal of local edits. At a step t of SA search, a new candidate y0 is proposed from y(t) by local editing. SA randomly picks a position to edit, as well as one of the following operators: Replace, Insert, and Delete. For Replace, the model suggests a candidate word at xi based on the posterior distribution induced by s(y|x). For efficiency concerns, previous work [28, 23] evaluates top-K candidate words, suggested by a forward and backward language model. In our work, we adopt Ro BERTa to evaluate the posterior probability of a word, where the word embedding layer of Ro BERTa at this slot is randomly masked. The Insert edit also suggests a word from the posterior, predicting a word given the newly added [MASK] token and the context. This complies with Ro BERTa s pretraining criteria of masked language modeling and is able to suggest high-quality candidate words. The Delete operator simply removes the word at a chosen position. In text formalization, we also have rule-based local edits (e.g., we are substituting we re ) which are retrieved from PPDB [30]. Previous sequence-to-sequence approaches on this task adopt manually designed rules as a preprocessing step [35] or as additional input concatenated with the raw sentence [46]. Our unsupervised TGLS, on the other hand, can easily make use of the off-the-shelf resources, and can easily filter out the noise by rejecting bad candidates. In short, the SA search component in our TGLS mainly follows [23], but we re-design the scoring functions and the proposals. The main focus of this paper is to couple search and learning, especially the methods of training a machine learning model that learns from the search results, as follows. 2.2 Word-Level Cross-Entropy (CE) Learning As mentioned in Section 1, the local search algorithm is computationally inefficient during inference time, because it requires a few hundred steps of edits and re-evaluations for each sample. Our intuition is to train a conditional generative model, GPT2, based on SA s search results. Specifically, we concatenate an input x and SA s searched sequence y(SA) with a special separating token [SEP] in between, and train GPT2 with losses on the y-part. Therefore, the GPT2 would be able to generate an output sequence directly from p(y|x) in an autoregressive way. Given a source sequence x, the objective is the word-by-word cross-entropy (CE) loss, given by i,v log p(GPT2) where y(SA) i,v is a binary value, indicating whether the ith word is v or not in the SA s output for this data sample, N is the length of y, and p(GPT2) yi = v | y(SA) , which is predicted by the GPT2. The word-level CE learning in TGLS adopts standard teacher-forcing technique with SA s output being the pseudo-reference, i.e., during training, the GPT2 model learns the probability p(GPT2) step i, assuming all previous words are correctly predicted as y(SA)