# generative_retrieval_meets_multigraded_relevance__ee6821bd.pdf Generative Retrieval Meets Multi-Graded Relevance Yubao Tang1,2 Ruqing Zhang1,2 Jiafeng Guo1,2 Maarten de Rijke3 Wei Chen1,2 Xueqi Cheng1,2 1CAS Key Lab of Network Data Science and Technology, ICT, CAS 2University of Chinese Academy of Sciences 3University of Amsterdam {tangyubao21b,zhangruqing,guojiafeng,chenwei2022,cxq}@ict.ac.cn m.derijke@uva.nl Generative retrieval represents a novel approach to information retrieval. It uses an encoder-decoder architecture to directly produce relevant document identifiers (docids) for queries. While this method offers benefits, current approaches are limited to scenarios with binary relevance data, overlooking the potential for documents to have multi-graded relevance. Extending generative retrieval to accommodate multi-graded relevance poses challenges, including the need to reconcile likelihood probabilities for docid pairs and the possibility of multiple relevant documents sharing the same identifier. To address these challenges, we introduce a framework called GRaded Generative Retrieval (GR2). GR2 focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training. First, we create identifiers that are both semantically relevant and sufficiently distinct to represent individual documents effectively. This is achieved by jointly optimizing the relevance and distinctness of docids through a combination of docid generation and autoencoder models. Second, we incorporate information about the relationship between relevance grades to guide the training process. We use a constrained contrastive training strategy to bring the representations of queries and the identifiers of their relevant documents closer together, based on their respective relevance grades. Extensive experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR2. 1 Introduction Generative retrieval (GR) [51] is a new paradigm for information retrieval (IR), where all information in a corpus is encoded into the model parameters and a ranked list is directly produced based on a single parametric model. In essence, a sequence-to-sequence (Seq2Seq) encoder-decoder architecture is used to directly predict identifiers (docids) of documents that are relevant to a given query. Recent studies have achieved impressive retrieval performance on many search tasks [15, 53, 67, 92]. Current work on GR mainly focuses on binary relevance scenarios, where a binary division into relevant and irrelevant categories is assumed [20, 52, 83], and a query is usually labeled with a single relevant document [38, 57] or multiple relevant documents that have the same relevance grade [33]. The standard Seq2Seq objective, via maximizing likelihood estimation (MLE) of the output sequence with teacher forcing, has been used extensively in GR due to its simplicity. However, in real-world search scenarios, documents may have different degrees of relevance [18, 71, 72, 82, 91] as binary relevance may not be sufficiently represent fine-grained relevance. In traditional learning-to-rank (LTR), multi-graded relevance judgments [23, 69] are widely considered, with n DCG [35] and ERR [12] being particularly popular. In modeling multi-graded relevance in LTR, a popular approach is Corresponding author. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). the pairwise method [10, 84] which involves weighting different documents based on their relevance grades and predicting the relative order of a document pair. Compared to common LTR algorithms, the learning objective currently being used in GR differs significantly: the standard Seq2Seq objective emphasizes one-to-one associations between queries and docids, aiming to generate a single most relevant docid. Inspired by pairwise methods in LTR, a straightforward approach to extending GR to multiple grades, involves having the GR model generate the likelihood of docids with higher relevance grades being greater than that of lower relevance grades. The docid likelihood is the product of the likelihoods of each token in the generated docid. Docids commonly exhibit distinct lengths, as a fixed length might not adequately encompass diverse document semantics. However, the variation in docid lengths within the corpus may lead to smaller likelihood scores for longer docids. Although some GR work [44, 78, 92, 93, 102] use a pairwise or listwise loss for optimization, they still only consider binary relevance or require complex multistage optimization. Besides, essential topics in multi-graded relevant documents may be similar, emphasizing the need for a one-to-one correspondence between document content and its identifier to ensure distinctness. Consequently, harnessing a GR model s capabilities for multi-graded relevance ranking in a relatively succinct manner remains an non-trivial challenge. To this end, we consider multi-graded generative retrieval and propose a novel GRaded Generative Retrieval (GR2) framework, with three key features: (1) To enhance docid distinctness while ensuring its relevance to document semantics, we introduce a regularized fusion approach with two modules: (i) a docid generation module, that produces pseudo-queries based on the original documents as the docids; and (ii) an autoencoder module, that reconstructs the target docids from their corresponding representations. We train them jointly to ensure that the docid representation is close to its corresponding document representation while far from other docid representations. (2) For the mapping from a query to its relevant docids, we design a multi-graded constrained contrastive (MGCC) loss to capture the relationships between labels with different relevance grades. Considering the incomparability of likelihood probabilities associated with docids of varying lengths, we convert queries and docids into representations within the embedding space. The core idea is to pull the representation of a given query in the embedding space towards those of its relevant docids, while simultaneously pushing it away from representations of irrelevant docids in the mini-batch. To maintain the order between relevance grades in the embedding space, the strength of the pull is determined by the relevance grades of the docids. The distinction between MGCC and pairwise methods in LTR [9, 10, 84] lies in proposing more specific grade penalties and constraints to regulate the relative distances between query representations and docid representations of different grades. (3) We explore two learning scenarios, i.e., supervised learning and pre-training, to learn generative retrieval models using our GR2 framework. Importantly, our method for obtaining docids is applicable to both multi-graded and binary relevance data, and it can reduce to the supervised contrastive approach in binary relevance scenarios. Our main contributions are: (i) We introduce a general GR2 framework for both binary and multi graded relevance scenarios, by designing relevant and distinct docids and using the information about the relationship between labels. (ii) Through experiments on 5 representative document retrieval datasets, GR2 achieves 14% relative significant improvements for P@20 on Gov 500K dataset over the SOTA GR baseline RIPOR [85]. (iii) Even in low-resource scenarios, our method performs well, surpassing BM25 on two datasets. On large-scale datasets, it achieves comparable results to RIPOR. 2 Related Work Learning to rank (LTR). LTR ranks candidate documents using ranking functions for queries, employing pointwise, pairwise, and listwise approaches. In the pointwise approach, a query is modeled with a single document, similar to using MLE in GR, making it challenging to capture global associations. Pairwise LTR treats document pairs as instances, e.g., Lambda Rank [10], LTRGR [44], and RIPOR [92]. The MGCC loss proposed here aligns with a pairwise approach. The listwise method [78, 88] treats entire document lists as instances, involving high optimization costs. Generative retrieval. GR has been proposed as a new paradigm for IR in which documents are returned using model parameters only [51]. A single model can directly generate relevant documents for a query. Inspired by this blueprint, there have been several proposals [7, 14, 22, 40, 43, 79, 85] to learn a Seq2Seq model by simultaneously addressing the two key issues below. Key issue 1: Building associations between documents and docids. The widely-used docid designs are pre-defined and learnable docids [77]. Pre-defined docids are fixed during training, e.g., document titles [22], semantically structured strings [56, 79, 85], n-grams [7, 14, 44], pseudo-queries [76], URLs [68, 102, 104], product quantization code [13, 104]. Learnable docids are tailored to retrieval tasks, e.g., discrete numbers [74], important word sets [86, 96, 97] and residual quantization code [92, 93]. Though effective in some tasks, these docid designs have limitations. Titles and URLs rely on metadata, while n-grams demand storage of all n-grams. Quantization codes lack interpretability. Learning optimal learnable docids is challenging, involving a complex learning process. Considering performance, implementation complexity, and storage requirements, the pre-defined pseudo-query is a promising compromise choice. Unfortunately, semantically similar documents might have similar docids or even repetitions, making it challenging for the GR model to distinguish them in the both binary and multi-graded relevance scenarios. To generate diverse and relevant docids, we propose a docid fusion approach based on pseudo-queries. Key issue 2: Mapping queries to relevant docids. Given a query, a GR model takes as input a query and outputs its relevant docids by maximizing the output sequence likelihood, which is only suitable for binary relevance. If a query has only one relevant document, it is paired with its relevant docid. If a query has multiple relevant documents at the same grade, it is paired with multiple relevant docids. The relative order of relevant docids in the returned list is random. Such a learning objective cannot handle search tasks with multi-graded relevance, which limits its efficacy for general IR problems. While certain GR studies [44, 92, 93, 102] employ a pairwise or listwise loss for optimization, they remain limited to binary relevance or necessitate intricate multi-stage optimization processes. In this work, we use all available relevance labels to enable multi-graded GR. For more related work, please refer to Appendix C. 3 Preliminaries Document retrieval. Assume that Lq = [1, . . . , l, . . . , L] is the grade set representing different degrees of relevance. We assume that there exists a total order between the grades l > l 1 > > 1, l Lq. Let q be a query from the query set Q, and Dq = {d1, d2, . . . , d N} be the set of N relevant documents for q, which are selected from the large document collection D. DQ is the relevant document set for Q. We write Dl q for the documents with grade l for q. The document retrieval task is to find a retrieval model f to produce the ranked list of relevant documents for the given query, i.e., πf(q) := [arg max(1) d f(q, d), arg max(2) d f(q, d), . . . ], where arg max(i) d f(q, d) denotes the iranked document d for q over D given by f via matching the query and documents. The model f is optimized by minimizing its loss function over some labeled datasets, i.e., minf P Q P D L(f; q, Dq). Multi-graded generative retrieval. In an end-to-end architecture, the GR process directly returns a ranked list for a given query, without a physical index component. Assume that the indexing mechanism to represent docids is I : D ID, where ID is the corresponding docid set. For the observed relevant document set Dq, q Q, the indexing mechanism I maps them to the docid set IDq = {IDlq | l = 1, . . . , L}, in which the docid idl IDlq is at grade l. The GR model observes pairs of a query and docid with multi-graded relevance labels under the indexing mechanism I, i.e., {Q, IDQ}, where IDQ denotes {IDlq | l = 1, . . . , L, q Q}. Given Q, the model g : Q IDQ autoregressively generates a ranked list of candidate docids in descending order of output likelihood conditioned on each query. Mathematically, g(q; θ) = Pθ(id | q) = Q t [1,|id|] pθ(wt | q, w