# learning_to_tokenize_for_generative_retrieval__7e1bf522.pdf Learning to Tokenize for Generative Retrieval Weiwei Sun1, Lingyong Yan2, Zheng Chen1, Shuaiqiang Wang2, Haichao Zhu2 Pengjie Ren1, Zhumin Chen1, Dawei Yin2, Maarten de Rijke3, Zhaochun Ren4 1Shandong University, China 2Baidu Inc., China 3University of Amsterdam, The Netherlands 4Leiden University, The Netherlands {sunnweiwei,lingyongy}@gmail.com yindawei@acm.org m.derijke@uva.nl z.ren@liacs.leidenuniv.nl As a new paradigm in information retrieval, generative retrieval directly generates a ranked list of document identifiers (docids) for a given query using generative language models (LMs). How to assign each document a unique docid (denoted as document tokenization) is a critical problem, because it determines whether the generative retrieval model can precisely retrieve any document by simply decoding its docid. Most existing methods adopt rule-based tokenization, which is ad-hoc and does not generalize well. In contrast, in this paper we propose a novel document tokenization learning method, GENRET, which learns to encode the complete document semantics into docids. GENRET learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. We develop a progressive training scheme to capture the autoregressive nature of docids and diverse clustering techniques to stabilize the training process. Based on the semantic-embedded docids of any set of documents, the generative retrieval model can learn to generate the most relevant docid only according to the docids semantic relevance to the queries. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets. GENRET establishes the new state-of-the-art on the NQ320K dataset. Compared to generative retrieval baselines, GENRET can achieve significant improvements on unseen documents. Moreover, GENRET can also outperform comparable baselines on MS MARCO and BEIR, demonstrating the method s generalizability. 1 Introduction Document retrieval plays an essential role in web search applications and various downstream knowledge-intensive tasks by identifying relevant documents to satisfy users queries. Recently, generative retrieval has emerged as a new paradigm for document retrieval [1, 5, 37, 41, 46, 47] that directly generates a ranked list of document identifiers (docids) for a given query using generative language models (LMs). Unlike dense retrieval [9, 13, 23, 42], generative retrieval presents an end-to-end solution for document retrieval tasks [37]. It also offers a promising approach to better exploit the capabilities of recent large LMs [1, 41]. As shown in Figure 1, document tokenization aims to tokenize each document in the corpus as a sequence of discrete characters, i.e., docids. Document tokenization plays a crucial role in generative retrieval, as it defines how the document is distributed in the semantic space [37]. And it is still an open problem how to define docids. Most previous generative methods tend to employ rule-based document tokenizers, such as generating titles or URLs [5, 46], or clustering results from off-the-shelf document embeddings [37, 41]. Such rule-based methods are usually ad-hoc and do not generalize well. In particular, the tokenization results potentially perform well on retrieving documents that have been seen during training, but generalize poorly to unlabeled documents [17, 20]. Corresponding author. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Transformer Query 32 45 21 Document 𝑑 Document Tokenization Reconstructed 𝑑 Reconstruction Generative Retrieval docid 𝑧 Transformer Codebook Figure 1: An overview of our proposed method. The proposed method utilizes a document tokenization model to convert a given document into a sequence of discrete tokens, referred to as a docid. This tokenization process allows for the reconstruction of the original document through a reconstruction model. Subsequently, an autoregressive generation model is employed to retrieve documents through the generation of their respective docids. To address the above problem, we propose GENRET, a document tokenization learning framework that learns to tokenize a document into semantic docids in a discrete auto-encoding scheme. GENRET consists of a shared sequence-to-sequence-based document tokenization model, a generative retrieval model, and a reconstruction model. In the proposed auto-encoding learning scheme, the tokenization model learns to convert documents to discrete docids, which are subsequently utilized by the reconstruction model to reconstruct the original document. The generative retrieval model is trained to generate docids in an autoregressive manner for a given query. The above three models are optimized in an end-to-end fashion to achieve seamless integration. There are usually two challenges when using auto-encoding to optimize a generative retrieval model: (i) docids with an autoregressive nature, and (ii) docids with diversity. To address the first challenge and also to stabilize the training of GENRET, we devise a progressive training scheme. This training scheme allows for a stable training of the model by fixing optimized prefix docids z