# nugget_neural_agglomerative_embeddings_of_text__6a2bd05a.pdf NUGGET: Neural Agglomerative Embeddings of Text Guanghui Qin 1 Benjamin Van Durme 1 Embedding text sequences is a widespread requirement in modern language understanding. Existing approaches focus largely on constantsize representations. This is problematic, as the amount of information contained in text can vary. We propose a solution called NUGGET , which encodes language into a representation based on a dynamically selected subset of input tokens. These nuggets are learned through tasks like autoencoding and machine translation, and intuitively segment language into meaningful units. We demonstrate NUGGET outperforms related approaches in tasks involving semantic comparison. Finally, we illustrate these compact units allow for expanding the contextual window of a language model (LM), suggesting new future LMs that can condition on larger amounts of content. 1. Introduction You can t cram the meaning of a whole %&!$# sentence into a single $&!#* vector! Embedding language into dense representations is a central pursuit in modern Natural Language Processing and Machine Learning. Recent work on text encoding has largely focused on fixed-dimensional representations that use either one or a constant number of vectors, e.g., DAN (Iyyer et al., 2015), DPR (Karpukhin et al., 2020), or TSDAE (Wang et al., 2021). At the other extreme, COLBERT (Khattab & Zaharia, 2020) represents and indexes content by storing the final BERT (Devlin et al., 2019) layer encoding of nearly every input token. Unfortunately a fixed dimensional representation risks not scaling to long texts, while a solution like COLBERT comes at significant cost. We propose that a flexible balance can be found, leading to a semantically useful level of granularity (Rudinger et al., 2017). 1Department of Computer Science, University of Johns Hopkins, USA. Correspondence to: Guanghui Qin . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Figure 1. Three approaches to embedding text. Token-level models map each token to a vector, while passage-level models map the whole passage into a single vector. NUGGET generates a dynamic number of vectors, where each nugget encodes a segment of text. Our solution, NUGGET , is an encoding strategy employing hard-attention to map linguistic input into a fractional number of dynamically selected embeddings called nuggets. As the nugget selection process is non-differentiable, we build a residual connection between the selector and decoder to allow gradient propagation, enabling the model to be trained in an end-to-end manner via tasks such as autoencoding or machine translation. This approach allows the number of vectors to grow with input length, trading performance against memory as a configurable compression ratio. NUGGET leads to an intrinsically interesting representation, where the encoder learns to favor clausal text delimiters, such as punctuation and conjunction words. Moreover, without any explicit guidance during training, each resultant nugget encodes a contiguous segment of text preceding these clausal delimiters, such as illustrated in fig. 1. We demonstrate that extrinsically these nuggets outperform prior unsupervised approaches in experiments on documentlevel paraphrase selection and related passage retrieval. Finally, through an experiment on language modeling we show that NUGGET can provide context information to other models in an efficient way. Looking ahead, we believe fractional representation strategies like NUGGET will allow for exciting new developments in large language models (LLMs). As nuggets support highly accurate reconstruction, they hold promise as a compressed unit of language that could enable scaling LLMs to condition on significantly longer textual inputs. NUGGET: Neural Agglomerative Embeddings of Text 2. Background Token-level Embeddings are commonly used in NLP. To map tokens to individual vectors, Pennington et al. (2014) uses the word co-occurrence matrix as features, while Mikolov et al. (2013) maps words to vectors by training a model to reconstruct the context. Instead of static mappings, encoders such as Co Ve (Mc Cann et al., 2017), ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) and BART (Lewis et al., 2020) generate contextualized token embeddings. Unsupervised methods for passage embedding Early related work modeled passages as topic distributions (Landauer et al., 1998; Blei et al., 2003). With neural networks, researchers map the sentence into one or a fixed number of vectors. Some researchers try to derive a sentence representation from the pretrained encoder without fine-tuning (Wang & Kuo, 2020; Li et al., 2020). Researchers also treat it as an unsupervised learning task. Kiros et al. (2015) trains sentence encoding by predicting the surrounding sentences. Bowman et al. (2016); Wang et al. (2021); Mahabadi et al. (2021) explore autoencoding to map sentences into single vectors. With a contrastive objective, Carlsson et al. (2021) learns to have similar representations of the same sentence with two independent encoders, while Sim CSE (Gao et al., 2021) uses different dropout masks on the same encoder. Giorgi et al. (2021) is similar but relies on document structure to identify positive sentence pairs. Recently, Li et al. (2022) propose to model texts by denoising a sequence of Gaussian vectors, leading to better controllability. Supervised methods for passage embedding To construct datasets for general-purpose sentence encoders, it is common to extract sentence pairs from datasets such as natural language inference and question answering (Conneau et al., 2017). SBERT (Reimers & Gurevych, 2019) finetunes the BERT model (Devlin et al., 2019) and uses mean pooling over the token embeddings as the sentence encoding. In the domain of dense information retrieval, people map documents into vectors to measure their similarity. Some models simply reuse the token-level encodings: Khattab & Zaharia (2020) uses all token embeddings as the index of the document, while Karpukhin et al. (2020) only reuses the embedding of the CLS token. Gao & Callan (2021); O guz et al. (2022) show that continual training can produce information-rich CLS representations. The methods mentioned above use a single vector or all tokens as the representation. Tan et al. (2022) increase the number of vectors by introducing pseudo sentences, while Zhang et al. (2022) append View pseudo tokens to the BERT (Devlin et al., 2019) self-attention; both have fixed-sized vectors, regardless of the lengths of the input. Rudinger et al. (2017), who helped inspire this work, decomposed sentences into a variable number of propositional embeddings, relying on a linguistic processing pipeline. 3. Approach We use a modified transformer encoder-decoder architecture. Let w = {wi}n i=1 denote the input sequence, where n is the number of tokens. A transformer encoder is used to map them into contextualized embeddings: X = Encoder(w), where X Rn d and d is the hidden dimension. Instead of feeding the entire X into the transformer decoder, we use a nugget generator , denoted by Nugget, to produce a latent variable Z that are fed as the inputs of the decoder: Z = Nugget(X), p(y | Z) = Decoder(Z) (1) where Z Rk d, k n is the number of nuggets generated by Nugget, and y is the target sequence. Note that k is not a constant number and depends on X. Decoder is a transformer module with causal masking and is conditioned on Z via cross-attention. In the remainder of this section we introduce the form of Nugget and the corresponding training strategies. 3.1. Nugget Generator Instead of producing vectors that do not correspond to actual tokens, such as the CLS or averaged pooling over all token embeddings, we leverage the fact that contextual token embeddings carry the semantics of their surrounding texts, and use them as document representations. We use a feedforward network to measure the amount of context information of every token embedding, then select the most informative vectors as the output: s = FFN(X), (2) X = Top Kk(s, X), (3) Z = Nugget(X) = X WV , (4) where s Rn are a list of scores, Top K is an operator to pick the top k elements in X sorted by s, and X Rk d are the selected embeddings, WV is a trainable parameter, and Z Rk d are the latent variables, called nuggets. Choice of k If we let k be a constant, then Nugget falls back to a fixed-dimensional representation. Instead, we let k grow with the length of the text by setting k = n r , where the compression ratio 0 < r 1 is a hyperparameter. Alternative viewpoint Equivalently, one can also view Nugget as hard attention. Let q Rd denote a trainable query vector, and we use X as both keys and values. We can regard eq. (2) as the attention logits: s = q WQ XWK , NUGGET: Neural Agglomerative Embeddings of Text Figure 2. The architecture of NUGGET . The diode symbol means that the gradient cannot be back-propagated. where WQ, WK Rd d are trainable parameters. In the next step, instead of aggregating the values X, we use hard attention to take the top-k values in XWV with s as keys. 3.2. Ensuring Differentiability Note that the Top K operator in eq. (3) is not differentiable, thus the parameters in eq. (2) do not receive any gradient signals. Therefore, we build a residual connection between the encoder and the decoder to propagate the gradients back to the Nugget. Specifically, we append the attention logits s to the cross attention in the decoder by: h ZWQ xtgt WK +s where aι is the cross-attention logits for the target token xtgt in one attention head at one of the decoder layers, and it will be fed into a Soft Max operator to produce an attention distribution. Note that we have replaced the source tokens with the nuggets Z. In addition to attending to the nugget vectors, the attention score directly takes into account the nugget logits s. As the cross-attention is differentiable, it can be viewed as a residual connection that allows the gradients to be back-propagated to the hard attention parameters. The architecture of NUGGET is shown in fig. 2. Gradient analysis To interpret the gradients on s, we can rewrite it as: where ℓis the loss value, and the summation on the subscript ι is taken over all target tokens, attention heads, and decoder layers. eq. (6) shows that the gradient on the s is proportional to that on all aι. Consequently, the nugget logit si tends to increase if the model tends to pay more attention to the corresponding nugget vector zi. As the bottleneck of the model is to limit the number of nuggets, the model learns to select the token embeddings that contain the maximal amount of contextual information. Different from previous work with residual connections (He et al., 2017), the introduction of eq. (5) to NUGGET is propagating gradients to the logits s, which otherwise cannot be learned. The absolute values of s do not greatly affect the cross-attention of the decoder, and we do not observe much performance difference in experiments when ablating s in eq. (5) during inference. 3.3. Informed Nugget Encoding The assumption behind NUGGET is that certain tokens function as nuggets to aggregate the surrounding semantics. However, the nugget selection is done after the encoding process, thus cannot affect its attention behavior. To inform the encoder of the selected nuggets, we prepone the calculation of s to the l-th layer of the encoder: s = FFN(X(l)), (7) where X(l) are the hidden states of the encoder in the lth layer, and we suppose the encoder has L l layers in total. With s and the compression ratio r, we are able to tell apart the nugget and non-nugget tokens. Akin to the segment embedding in Devlin et al. (2019), we add 2 type embedding vectors, denoted by en and eo, to the hidden states of nugget and non-nugget tokens in the l-th layer, which are then fed into the next layer: X(l+1) = Self Attn(X(l)+E), (8) where E Rn d are the type embedding matrix. We call this the nugget feedback. Note that the encoding X used in eq. (3) are still the embeddings in the last layer. The updated nugget encoding is illustrated in fig. 3. Stabilized training In practice, we found that the training of nugget selection in eq. (2) can be unstable when the features fed into eq. (8) are being updated. We adopted the common practice for fine-tuning pretrained LMs (Howard & Ruder, 2018) to freeze the bottom l layers of the encoder, which stabilized our training curves. 1 1Freezing bottom layers may also help preserve the multilingual ability of a pretrained multilingual language model; this was not tested in our experiments. NUGGET: Neural Agglomerative Embeddings of Text Figure 3. The encoder of NUGGET with feedback. The bottom l layers do not receive gradient signals from back-propagation. 3.4. Learning The model parameters θ are optimized by minimizing the negative log likelihood: w,y D log p(y | w; θ), where the inputs w and outputs y are sampled from the dataset D. The dataset D can be a monolingual corpus, in which case y should be identical to w and the NUGGET is trained as an autoencoder. Following previous work (Wang et al., 2021), we may randomly delete tokens from w as noise. The dataset can also be bitexts, then the target document y is translated from w. In this case, NUGGET is trained as a machine translation model (Mc Cann et al., 2017). 4. Experiment Setup While we could apply the NUGGET concept to a variety of existing models, for experiments here we build on the architecture of BART (Lewis et al., 2020). We start with the checkpoint in Tang et al. (2020), which is a model with 12 layers of encoder and decoder, and is optimized for manyto-many machine translation. It contains 602M parameters, with 256M in the embedding matrix, 152M in the encoder and 203M in the decoder. For the dataset, we use the English-to-Chinese subset of WMT19 corpus (Barrault et al., 2019), the same corpus used by Tang et al. (2020), as our datasets. WMT19 is comprised of individual sentences, and we concatenate the adjacent sentences together to recover the document structure, similar to the practice of Junczys-Dowmunt (2019). We limit each document to a maximum length of 128 sub-words. The model is trained to translate English documents into Chinese Figure 4. The micro-averaged BLEU value of the texts generated from nuggets with the input document as the reference. Note that r = 0.0 indicates that a single vector is used for each document. Results are reported on the dev set of WMT19. documents. For the autoencoding (AE) objective, we use English documents on both the source and target sides. We explored different compression ratios r from 0.05 to 0.25. We freeze the bottom 3 layers (l = 3) in section 3.3 across our main experiments, and we provide a study of the effect of the number of frozen layers in section 7.1. We put more training details in appendix B.1. 5. Intrinsic evaluation In this section, we conduct experiments to investigate the impact of compression ratio r. We also discuss the behaviors of the nuggets and their relationship to the textual forms. 5.1. What is a sufficient compression ratio? The compression ratio r controls the trade-off between space efficiency and the semantic completeness of the nuggets. Prior to applying NUGGET to downstream tasks to find a sufficient compression ratio, we propose to use beam search with a beam size of 5 to decode texts from the generated nuggets and measure their difference from the inputs with the BLEU (Papineni et al., 2002) metric. We evaluate the model on the dev set of the English-to Chinese subset of WMT19, where sentences are concatenated to document with a maximum length of 128 tokens. The experiment results are shown in fig. 4. With both the AE and MT training objectives, the performance starts to be saturated with a compression ratio of r = 0.1. It shows that with 10% of tokens as nuggets, the model has already gained sufficient information about the source documents. In the case of autoencoding, the BLEU value is higher than 0.99 when r 0.1, meaning NUGGET reconstructs the inputs nearly verbatim, achieving almost lossless text encoding. 5.2. What is selected as nuggets? Instead of uniformly selecting tokens, the scorer (eq. (2)) of NUGGET prefers certain tokens. fig. 5 shows the top-6 most frequent tokens selected by NUGGET , and they are mostly delimiter words, like punctuation tokens (commas NUGGET: Neural Agglomerative Embeddings of Text Figure 5. The 6 most frequent tokens selected by NUGGET . We show their ratio in the nuggets with the AE and MT training objectives compared to that in normal texts The statistics are sampled from 128 documents of lengths up to 128. The compression ratio is set as r = 0.1 for both models. Natural language processing is an interdisciplinary subfield of linguistics , computer science , and artificial intelligence concerned with the interactions between computers and human language , in partic- ular how to program computer s to process and analyze large amounts of natural language data . The goal is a computer capable of understanding the contents of documents , including the contextual nuances of the language within them . The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves . Figure 6. Example texts processed by NUGGET . Tokens in darker colors have higher scores, and those with green backgrounds are selected as nuggets. The compression ratio is set as r = 0.1 and AE is set as the training objective. and periods), conjunctions, and prepositions. Previous work on the study of transformer language models shows that a large amount of self-attention focuses on the delimiter tokens, such as punctuations, and they may be used as no-op Clark et al. (2019). However, our study suggests that they may also serve as summary tokens, as predicting the end of a segment requires the model to understand the semantics of the preceding texts. It is worth noting that in our case study, NUGGET prefers EOS while BOS is never selected, contrary to the practice of Wang et al. (2021). Also, NUGGET is not necessarily selecting the most frequent tokens. For example: the type the , which makes up 5.2% of all tokens in the corpus, accounts for only 0.7% of selected nuggets. An example text is shown in fig. 6, where commas, periods, and the conjunction and are selected as nuggets. We note that the preference of NUGGET on text delimiters is not specific to English. In appendix D, we show similar results of fig. 5 in 9 other languages. Figure 7. The red curve shows the distribution of token indices in the input documents of the 3rd, 6th, and 9th nuggets, and the blue curve shows the probability gain of every token given the corresponding nugget. The distribution is averaged over 10k documents. Compression ratio r is set as 0.1. Figure 8. The probability gain conditioned on a single nugget. Graphs are averaged over all nuggets of 10k documents by centering the nugget and showing the relative indices of the tokens. The ratio r is set as 0.1. Refer to appendix C for a complete version. 5.3. What is encoded in each nugget? The model is optimized to encode information into nuggets, but it is unclear how that information is distributed across them. Thus we propose a method to probe the semantics of individual nuggets. We run teacher-forcing decoding on a document with a model trained with the autoencoding objective, but expose only 1 nugget during decoding. Suppose the j-th nugget is exposed, then we calculate the probability gain by gj i = p(yi | y