# copy_is_all_you_need__cbfb493c.pdf Published as a conference paper at ICLR 2023 COPY IS ALL YOU NEED Tian Lan , , Deng Cai , , Yan Wang , Heyan Huang Xian-Ling Mao Tencent AI Lab School of Computer Science and Technology, Beijing Institute of Technology {lantiangmftby,thisisjcykcd,yanwang.branden}@gmail.com {hhy63,maoxl}@bit.edu.cn The dominant text generation models compose the output by sequentially selecting words from a fixed vocabulary. In this paper, we formulate text generation as progressively copying text segments (e.g., words or phrases) from an existing text collection. We compute the contextualized representations of meaningful text segments and index them using efficient vector search toolkits. The task of text generation is then decomposed into a series of copy-and-paste operations: at each time step, we seek suitable text spans from the text collection rather than selecting from a standalone vocabulary. Experiments on the standard language modeling benchmark (Wiki Text-103) show that our approach achieves better generation quality according to both automatic and human evaluations. Besides, its inference efficiency is comparable to token-level autoregressive models thanks to the reduction of decoding steps. We also show that our approach allows for effective domain adaptation by simply switching to domain-specific text collection without extra training. Finally, we observe that our approach attains additional performance gains by simply scaling up to larger text collections, again without further training.1 1 INTRODUCTION Most neural language models (LMs) process text generation tasks by making a series of next-token predictions in an autoregressive manner (Radford et al., 2019; Dai et al., 2019; Khandelwal et al., 2020; Shi et al., 2022). Specifically, LMs generate the next-token distribution over a fixed vocabulary for any given prefix. Then, the next token is selected by a chosen decoding method, such as greedy search and nucleus sampling (Holtzman et al., 2020). This process continues until some stop condition is reached. For example, a special end-of-generation token is emitted, or the generated text reaches the maximum length limit. Unlike traditional neural language models, we reformulate text generation by copying text segments from existing text collections. The text segments can be of variable lengths, including single words and multi-word phrases. For clarity, we will use the term phrase to refer to any contiguous text segments, and a single word can also be seen as a phrase of length 1. We compute a contextualized vector representation for each phrase and pack them into an offline index. At each decoding step, a suitable phrase is retrieved from the offline index and appended to the current prefix. In other words, the next-token predictions in traditional neural language models are replaced by a series of copy-and-paste operations. Our proposed model, named COG (short for COPY-GENERATOR), enjoys the following advantages. First, our method selects phrases in specific contexts rather than standalone tokens in a fixed vocabulary. It potentially allows for more accurate candidate representation and selection. Second, our method allows training-free adaptation to new knowledge sources because the text collection can be updated in a plug-and-play fashion. It could benefit application scenarios such as domain adaptation and data expansion/filtering. Third, our method allows a sequence of multiple tokens (i.e., multi-word Contributed Equally. Corresponding authors. 1Our source codes are publicly available at https://github.com/gmftby GMFTBY/Copyisallyouneed. Published as a conference paper at ICLR 2023 phrase) to be generated in one single step. It could reduce the total number of decoding steps, leading to improved inference efficiency. We conduct extensive experiments to verify the effectiveness of our proposed COG. On the standard language modeling benchmark (Wiki Text-103), our proposed COG substantially outperforms standard baselines on automatic metrics (26.14 vs. 23.43 MAUVE (Pillutla et al., 2021)) and human evaluation (48% vs. 28% human preference). Moreover, when we directly switch the text collection from the Wiki Text-103 corpus to a domain-specific corpus, Law-MT (Koehn & Knowles, 2017), our proposed COG outperforms strong baselines on this domain adaption setting (28.14 vs. 26.85 MAUVE and 52% vs. 36% human preference) without any domain-specific training. Furthermore, when we scale up the text collection of COG to a larger one, the En-Wiki dataset, we obtain additional gain (26.97 vs. 23.43 MAUVE), again without any further training. Our contributions can be summarized as follows: We propose COG, a method that reformulates text generation tasks as a series of copy-andpaste operations from existing text collections. We show that COG can outperform standard neural language model baselines on existing language modeling benchmarks. We demonstrate that COG allows for training-free adaptations to larger text collections and domain-specific text collections. 2 BACKGROUND: NEURAL TEXT GENERATION Neural text generation can be divided into two categories: (1) unconditional text generation; (2) conditional text generation. Unconditional text generation (or language modeling) aims to generate a coherent text continuation given a prefix. In this case, language models perform generation using a density estimation over sequences pθ(x). Conditional text generation aims to generate text with some condition c and instead estimates the probability of pθ(x|c). Its typical applications include machine translation (Sutskever et al., 2014; Bahdanau et al., 2015), summarization (See et al., 2017). Throughout this paper, our discussion will be focused on unconditional text generation, however, our approach can be readily adapted to conditional text generation as well. The canonical approach to language modeling factors the generation in an autoregressive left-to-right manner pθ(x0:n) = Qn i=1 p(xi|x