# autoregressive_entity_retrieval__8f02f955.pdf Published as a conference paper at ICLR 2021 AUTOREGRESSIVE ENTITY RETRIEVAL Nicola De Cao1,2 , Gautier Izacard2,3,4, Sebastian Riedel2,5, Fabio Petroni2 1University of Amsterdam, 2Facebook AI Research 3ENS, PSL University, 4Inria, 5University College London nicola.decao@gmail.com, {gizacard, sriedel, fabiopetroni}@fb.com Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach leads to several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions between the two; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion and conditioned on the context. This enables us to mitigate the aforementioned technical issues since: (i) the autoregressive formulation allows us to directly capture relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the exact softmax loss can be efficiently computed without the need to subsample negative data. We show the efficacy of the approach, experimenting with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their unambiguous name. Code and pre-trained models at https://github.com/facebookresearch/GENRE. 1 INTRODUCTION The ability to retrieve the correct entity from large Knowledge Bases (KBs) given a textual input is a fundamental building block for several applications (Ferrucci, 2012; Slawski, 2015; Yang et al., 2018a). Most commercial recommendation systems, for instance, include in their pipelines components to detect and disambiguate entity mentions in open text, in order to isolate relevant concepts from non-meaningful data (Slawski, 2015; Yang et al., 2018a). Another example are chat-bots and question answering systems, that are often equipped with retrieval components to surface specific KB entries (e.g., Wikipedia articles) to find knowledge for sustaining a conversation or answering a question (Ferrucci, 2012; Chen et al., 2017; Lewis et al., 2020b; Roller et al., 2020). Although there has been extensive previous work on entity retrieval (e.g. Hoffart et al., 2011; Piccinno & Ferragina, 2014; Huang et al., 2015; Le & Titov, 2018; Logeswaran et al., 2019; Broscheit, 2019; Wu et al., 2020, to name just a few) there is a common design choice to most current solutions: entities are associated with a unique atomic label and the retrieval problem can be interpreted as multi-class classification across these labels. The match between input and label is calculated through a bi-encoder (Wu et al., 2020; Karpukhin et al., 2020): a dot product between dense vector encodings of the input and the entity s meta information (such as title and description). Work done during internship with Facebook AI Research. Published as a conference paper at ICLR 2021 (a) Type specification. (b) Composing from context. (c) Translation. (d) Entity normalization. (e) Implicit factual knowledge. (f) Exact copy. Figure 1: Examples of entities correctly retrieved from GENRE (we show only the top-3 rank). On the top three entity disambiguation instances and on the bottom three document retrieval instances, two for open-domain question answering and one for fact checking. All of them are cast as sequenceto-sequence problems while inference is done using constrained beam search. Gold entities in bold. Sub-captions indicate the type of interaction between the input context and the entity names required. Critically, this formulation enables sub-linear search using modern maximum-inner-product-search libraries (Johnson et al., 2019) and hence supports retrieving from large entity databases. Unfortunately, the classifier approach to entity retrieval also has several shortcomings. First, unless a costly cross-encoder is used for re-ranking (Wu et al., 2020), the dot-product can miss fine-grained interactions between input and entity meta information (Humeau et al., 2020). Second, storing dense vectors for the whole KB requires a large memory footprint, especially in real-world scenarios (i.e., 24GB to store 1024-dimensional vectors for all of the 6M Wikipedia pages), and the size linearly grows with the addition of new entities. Third, computing an exact softmax over all entities is very expensive, hence current solutions need to subsample negative data (Logeswaran et al., 2019; Karpukhin et al., 2020) at training time. Tuning an appropriately hard set of negative instances can be challenging and time-consuming. Finally, existing systems can suffer from a cold-start problem since they cannot represent entities about which they have not yet gathered sufficient information, in the form, for instance, of a textual description or a set of relations with the existing entities. The treatment of entity identifiers as atomic labels in a classifier ignores the fact that we often have unambiguous, highly structured and compositional entity names. Wikipedia, for instance, associates unique titles to articles,1 that may be the name of the subject or a description of its topic, as well as potential distinctive information to disambiguate 2 (see Figure 1 for some examples). These entity names often interact with mention contexts in a predictable and regular fashion. For example, often entity names are identical with the mention strings that refer to them (e.g., Fig. 1f). When this is not possible, they might be composed of tokens in the context (e.g., Fig. 1b), include a type specification that can inferred (e.g., Fig. 1a), be the translation of the string mention (e.g., Fig. 1c), require normalization such as referring to the correct alias of a mention (e.g., Fig. 1d), or require factual knowledge that might be stored in the parameters of a model (e.g., Fig. 1e). These observations suggest that inputs could be translated into unique entity names, word by word, instead of being classified among a huge set of options. In this paper, we propose GENRE (for Generative ENtity REtrieval), the first entity retriever that exploits a sequence-to-sequence architecture to generate entity names in an autoregressive fashion conditioned on the context. Concretely, GENRE uses a transformer-based architecture, pre-trained with a language modeling objective (i.e., we use BART weights from Lewis et al. (2020a)) and fine-tuned to generate entity names. This architecture has been shown to retain factual knowledge 1We use entity name to refer to the corresponding Wikipedia article title throughout the rest of the paper. 2often in the form of a description in parentheses after the name. Wikipedia naming conventions are described in https://en.wikipedia.org/wiki/Wikipedia:Article_titles. Published as a conference paper at ICLR 2021 to some extent (Petroni et al., 2019) and language translation skills (Radford et al., 2019) among other things, both desirable properties for an entity retriever. Naturally, the generated output might not always be a valid entity name. To solve this problem, GENRE employs a constrained decoding strategy that forces each generated name to be in a predefined candidate set. The autoregressive formulation allows us to directly capture the aforementioned relations between context and entity name, effectively cross encoding both. Also, the memory footprint required is orders of magnitude smaller than current systems, since the parameters of a sequence-to-sequence model scale linearly with the vocabulary size, not entity count. Moreover, the exact softmax can be computed efficiently for each output token (i.e., all non-gold tokens are considered negative), thereby eliminating the need for negative data downsampling. Finally, our model never accesses any explicit meta-information about the entity beyond their title, hence new entities can be added by simply appending their unambiguous name to the candidate set (e.g., Fig. 1b refers to an entity added after training). We empirically evaluate the performance of GENRE on more than 20 datasets, spanning three families of tasks: (i) entity disambiguation, using popular datasets and settings (both in and out-of domain); (ii) end-to-end entity linking, with the GERBIL benchmarking tool (R oder et al., 2018), by using a novel dynamically markup-constrained decoding strategy; (iii) document retrieval, with the recently proposed KILT benchmark (Petroni et al., 2020b) which spans 5 different sub-tasks. Our models achieve state-of-the-art or very competitive results on nearly all datasets, often with substantial improvement (+13.7 precision points on KILT for retrieval on average). Further, we show that compared with recent models, GENRE requires substantially less memory ( 20 times smaller footprint on average). Finally, we demonstrate that our model can be applied in scenarios where the only entity information available is its name. We organize the paper as follows: in Section 2 we describe our problem formulation. Then, in Section 3 we present GENRE and eventually in Section 4 we extensively evaluate our method on the aforementioned settings. We will release code and pre-processed data to reproduce our experiments. 2 ENTITY RETRIEVAL We assume to have a collection of entities E (e.g., Wikipedia articles) where each entity is an entry in a Knowledge Base (KB) such as Wikipedia. We want to approach the following retrieval problem: given a textual input source x (e.g., question), a model has to return the most relevant entities from E with respect to x. We assume that each e E is uniquely assigned to a textual representation (i.e., its name): a sequence of tokens y (e.g., Wikipedia pages are identified by their titles). A particular instance of this problem is Entity Disambiguation (ED) (see Figure 1 for an example) where an input x is annotated with a mention and a system has to select either its corresponding entity from E, or to predict that there is no corresponding entry in the KB. Another instance is pagelevel Document Retrieval (DR) where the input x is intended as a query and E as a collection of documents identified by their unique titles (e.g., Wikipedia articles). We address the retrieval problem with an sequence-to-sequence model that generates textual entity identifiers (i.e., entity names). Concretely, GENRE ranks each e E by computing a score with an autoregressive formulation: score(e|x) = pθ(y|x) = QN i=1 pθ(yi|y15). Published as a conference paper at ICLR 2021 None 20 21 22 23 24 25 26 27 28 29 210 211 212 213 214 215 216 217 Number of incoming links in Wikipedia Accuracy Data distribution Figure 5: Accuracy per number of incoming links in Wikipedia on the validation sets of all KILT datasets except ELI5 (as it is fundamentally different from the others). We also show the data distribution of the number of incoming links. Intuitively, a page/entity with few incoming links has been observed less than highly connected pages/entities. Indeed, for pages/entities never linked (first bin on the left) the average accuracy is 20% lower than the global average (78.6%). However, for pages/entities linked at least once it is above the global average. This indicates that GENRE seems effective on linking rare entities. 1 ID : 87d95287 707e 4bd9 9633 ca0c611a4a3a World Without Superma :8 2 inputs : [ . . ] When Superman leaves Earth f o r New Krypton , he appoints , newly freed from the Phantom Zone , to take his place as guardian of [START ENT] Metropolis [END ENT ] . Mon El assumes the secret i d e n t i t y of Johnathan Kent as a t r i b u t e to Clark \ s adoptive father , posing as Clark \ s cousin . [ . . ] 3 gold output : Metropolis ( comics ) 4 predicted outputs : [ 5 ( Metropolis ( comics ) , 0.09) , 6 ( Themyscira ( DC Comics ) , 1.09) , 7 ( Metropolis ( disambiguation ) , 1.27) , 8 ( Superman ( comic book ) , 1.51) , 9 ( Superman ( Earth Two) , 1.52) 10 ] Figure 6: Example of a GENRE prediction for named entity disambiguation on KILT WNED. The input is plain text where a mention is flagged with two special start and end tokens [START ENT] and [END ENT]. The output is a ranked list of entity (where we report the log-likelihood as well). 1 ID : sfq 18245 2 inputs : Which Florentine painter 1535 1607 used the name Bronzino a f t e r the death of his uncle ? 3 gold output : Bronzino 4 predicted outputs : [ 5 ( Florence , 0.37) , 6 ( Bronzino , 0.62) , 7 ( Niccolo Machiavelli , 0.64) , 8 ( Giorgio de Chirico , 0.71) , 9 ( Vitruvian Man , 0.73) 10 ] (a) Trivia QA (open domain question answering). 1 ID : 4713 2 inputs : Tool has won three Oscars . 3 gold output : Tool ( band ) 4 predicted outputs : [ 5 ( Tool ( band ) , 0.08) , 6 ( Tool ( disambiguation ) , 1.59) , 7 ( Machine Head ( band ) , 1.73) , 8 ( Language Arts ( album ) , 1.97) , 9 ( Machine Gun ( band ) , 2.12) 10 ] (b) FEVER (fact checking). Figure 7: Example of GENRE predictions for the retrieval task on KILT. The input is a query and the output is a ranked list of Wikipedia article titles (we also report the log-likelihood of the solutions). Published as a conference paper at ICLR 2021 1 ID : 1106testa SOCCER 2 inputs : SOCCER RESULT IN SPANISH FIRST DIVISION . MADRID 1996 08 31 Result of game played in the Spanish f i r s t d i v i s i o n on Saturday : Deportivo Coruna 1 Real Madrid 1. 3 gold output : SOCCER RESULT IN [SPANISH ] ( Spain ) FIRST DIVISION . [MADRID] ( Madrid ) 1996 08 31 Result of game played in the [ Spanish ] ( Spain ) f i r s t d i v i s i o n on Saturday : Deportivo Coruna 1 [ Real Madrid ] ( Real Madrid C. F . ) 1. 4 predicted output : SOCCER RESULT IN [SPANISH ] ( Spain ) FIRST DIVISION . [MADRID] ( Madrid ) 1996 08 31 Result of game played in the [ Spanish ] ( Spain ) f i r s t d i v i s i o n on Saturday : [ Deportivo ] ( Deportivo de La Coruna ) Coruna 1 [ Real Madrid ] ( Real Madrid C. F . ) 1. 5 gold spans : [ 6 [19 , 7 , Spain ] , 7 [44 , 6 , Madrid ] , 8 [91 , 7 , Spain ] , 9 [147 , 11 , Real Madrid C . F . ] 10 ] 11 predicted spans : [ 12 [19 , 7 , Spain ] , 13 [44 , 6 , Madrid ] , 14 [91 , 7 , Spain ] , 15 [128 , 9 , Deportivo de La Coruna ] , 16 [147 , 11 , Real Madrid C . F . ] 17 ] 18 19 Micro precision : 0.80 20 Micro r e c a l l : 1.00 21 Micro F1 : 0.88 Figure 8: Example of a GENRE prediction for end-to-end entity linking on AIDA. The input is plain text and the output is a Markup string where the links are Wikipedia titles. Spans are in the format si, li, ti : start of the mention, length of the mention, and title respectively. Figure 9: Example of prefix tree (trie) structure where the allowed entities identifiers are English language , English literature and France . Note that at the root there is the start-of-sequence token SOS and all leaves are end-of-sequence tokens EOS. Since more that one sequence has the same prefix (i.e., English ), this end up being an internal node where branches are the possible continuations.