# knowledge_infused_decoding__e39bb24c.pdf Published as a conference paper at ICLR 2022 KNOWLEDGE INFUSED DECODING Ruibo Liu , Guoqing Zheng , Shashank Gupta , Radhika Gaonkar , Chongyang Gao , Soroush Vosoughi , Milad Shokouhi , Ahmed Hassan Awadallah Dartmouth College, Microsoft, Northwestern University {ruibo.liu.gr, soroush.vosoughi}@dartmouth.edu {zheng, shagup, ragaonka, milads, hassanam}@microsoft.com cygao@u.northwestern.edu Pre-trained language models (LMs) have been shown to memorize a substantial amount of knowledge from the pre-training corpora; however, they are still limited in recalling factually correct knowledge given a certain context. Hence, they tend to suffer from counterfactual or hallucinatory generation when used in knowledge-intensive natural language generation (NLG) tasks. Recent remedies to this problem focus on modifying either the pre-training or task fine-tuning objectives to incorporate knowledge, which normally require additional costly training or architecture modification of LMs for practical applications. We present Knowledge Infused Decoding (KID) a novel decoding algorithm for generative LMs, which dynamically infuses external knowledge into each step of the LM decoding. Specifically, we maintain a local knowledge memory based on the current context, interacting with a dynamically created external knowledge trie, and continuously update the local memory as a knowledge-aware constraint to guide decoding via reinforcement learning. On six diverse knowledgeintensive NLG tasks, task-agnostic LMs (e.g., GPT-2 and BART) armed with KID outperform many task-optimized state-of-the-art models, and show particularly strong performance in few-shot scenarios over seven related knowledge-infusion techniques. Human evaluation confirms KID s ability to generate more relevant and factual language for the input context when compared with multiple baselines. Finally, KID also alleviates exposure bias and provides stable generation quality when generating longer sequences. Code for KID is available at https:// github.com/microsoft/KID. 1 INTRODUCTION Pre-trained language models (LMs) have been shown to capture rich semantic and syntactic features, as demonstrated by the state-of-the-art performance on many generation tasks such as abstractive summarization (Zhang et al., 2020b; Chu & Liu, 2019) and dialogue generation (Roller et al., 2021; Zhang et al., 2020c). Their generations, however, can be quite limited in scenarios requiring knowledge they frequently struggle with issues such as being easily misguided by phrase cooccurrence (e.g., Talk? Birds can talk. ), failing to handle negation (e.g., The theory of relativity was not developed by Einstein. ) (Kassner & Schütze, 2020), and being unable to compare commonsense concepts, such as time (Qin et al., 2021) and digits (Talmor et al., 2020). To enhance the performance of LMs on knowledge-intensive NLG tasks1, prior studies have proposed to re-train LMs with knowledge-aware objectives (Zhou et al., 2021; Xiong et al., 2020; Zhang et al., 2019; Khandelwal et al., 2020) or add special architectures to encode knowledge (Bosselut et al., 2019; Logan et al., 2019; Peters et al., 2019b) from external resources (e.g., knowledge graphs such as CONCEPTNET (Speer et al., 2017) and ATOMIC (Sap et al., 2019)). These methods, though yielding impressive results on many downstream tasks, can be computationally expensive. More importantly, 1We define knowledge-intensive NLG tasks as those whose input context alone does not provide complete knowledge for a legitimate and plausible generation. Published as a conference paper at ICLR 2022 knowledge implicitly parameterized in LM architectures is difficult to revise and expand (Lewis et al., 2020b), and wrong generations are hard to diagnose due to lack of interpretation (Talmor et al., 2020), which heavily limits their real-world applications. More recent retrieval-based models try to tackle these problems by augmenting inputs with retrieved knowledge evidence (Lee et al., 2019; Guu et al., 2020). For example, RAG (Lewis et al., 2020b) leverages non-parametric memory to access extensive knowledge (in the form of unstructured documents), and jointly fine-tunes a parametric LM (i.e., BART (Lewis et al., 2020a)) to enable knowledge-aware generation. A key limitation of such methods is that they retrieve documents only once while grounding them in the input static context, and thus cannot support the dynamic nature of the context as new tokens are generated. The static knowledge becomes a major problem for tasks where longer and abstractive generation is expected, such as open-ended story generation (Mostafazadeh et al., 2016), multi-turn dialogues (Zhao et al., 2020), and conversation summarization (Gliwa et al., 2019). Moreover, in a recent study, Krishna et al. (2021) replaced the knowledge retriever in RAG with a random retriever and found little difference in the resulting performance on a long-form QA task named ELI5 (Fan et al., 2019b), indicating the model may not be actually grounding its text generation to the retrieved documents. To address these limitations, in this work, we present a novel decoding algorithm KID, aiming to better infuse knowledge into generation in a dynamic manner. Instead of solely relying on the static knowledge retrieved at beginning, during each step of LM decoding, KID dynamically searches promising continuation from retrieved knowledge, to guide the current step generation. Specifically, KID maintains a local knowledge memory, interacts it with a knowledge trie dynamically created from retrieved supporting documents, and updates the local memory as a knowledge-aware constraint to guide the generation. The key intuition behind KID is that existing LM pre-training objectives are usually defined at the token level yet do not explicitly model concept-centric knowledge (Xiong et al., 2020) thus motivating us to reshape the probability mass at each step decoding towards the distribution of entities in knowledge. The contribution of this work is three-fold: First, we introduce KID as a model and task agnostic decoding method that integrates knowledge on the fly and can be applied to various knowledgeintensive tasks with different generative LMs. Second, from a docoding perspective, on six knowledgeintensive NLG tasks, GPT2 (Radford et al., 2019) and BART (Lewis et al., 2020a) equipped with KID significantly outperform conventional beam search or sampling decoding by a large margin. Third, from a knowledge infusion perspective, unlike seven strong knowledge-infusion baselines which require either additional retraining or special architecture modifications, KID leverages knowledge more effectively as a light-weight knowledge infusion solution. Additionally, in few-shot scenarios KID significantly improves over them, demonstrating its generalization ability in low-resource and domain shifting regimes. 2 RELATED WORK We briefly review existing work enhancing LMs with external knowledge and representative decoding algorithms for generation. Enhancing Language Model with Knowledge. Large language models implicitly encode knowledge in their parameters but with limits (Petroni et al., 2019; Kassner & Schütze, 2020; Lin et al., 2020a). Several architectures and objective functions have been proposed to explicitly encode external knowledge (Sun et al., 2020; Logan et al., 2019; Roberts et al., 2020; Levine et al., 2020), or to augment LM pre-training data with retrieved knowledge (Lewis et al., 2020b; Guu et al., 2020; Lee et al., 2019). However, Talmor et al. (2020) notes that the reasoning ability of such LMs is strongly tied to the context seen during pre-training and is thus hard to generalize to new domains. Built on LMs, KID incorporates extra knowledge from external resources (Wikipedia) and thus shows strong performance in knowledge-intensive NLG tasks. Better Decoding Algorithm. Two common strategies dominate the decoding algorithms used by most generative models: beam search which maximizes likelihood in a local horizon (due to finite beam size), and sampling decoding (e.g., top-k sampling (Fan et al., 2018; Holtzman et al., 2018)) which relies on randomness. Holtzman et al. (2020) find beam search often produces generic and repetitive generation, while top-k sampling tends to pick unlikely tokens which creates incoherent and Published as a conference paper at ICLR 2022 unrelated sequences. Existing attempts to mitigate these problems include reranking (Adiwardana et al., 2020; Shao et al., 2017), adding control signals (Zhang et al., 2018; Xing et al., 2017), and self-adaptive truncation (Welleck et al., 2020; Peters et al., 2019a). None of these decoding algorithms consider integrating knowledge in the generation process. Reflective decoding (West et al., 2021) and De Lorean (Qin et al., 2020) are two recent decoding algorithms that focus on abductive commonsense reasoning. Reflective decoding in particular has the potential to be extended to other knowledge-intensive tasks. We compare it with KID in our experiments. 3 KNOWLEDGE INFUSED DECODING We detail the implementation of KID in this section. As shown in Figure 1, KID comprises of three steps: retrieving relevant knowledge ( 3.1), constructing external and local knowledge memory ( 3.2), and guiding current step decoding under the constraint of the knowledge trie ( 3.3). 3.1 RETRIEVING EXTENSIVE KNOWLEDGE FROM WIKIPEDIA The first step of KID is to retrieve several context-relevant documents to ground the following generation. We use DPR (Karpukhin et al., 2020) as our general-purpose knowledge retriever, which projects contexts and relevant documents to a 768-dim shared embedding space using a bi-encoder network (i.e., two independent BERTs (Devlin et al., 2019)). Here, the documents refer to the 100word chunks of Wikipedia passages released by RAG (Lewis et al., 2020b), a total of 21M documents as our knowledge source Z. We pre-load the weights from the latest checkpoint of DPR (March 2021), as it improves retrieval performance by using mined negative samples and contrastive learning, which is also suggested by Jernite (2020). During retrieval, we perform a maximum inner-product search (MIPS) with faiss2 accelerated by GPU (Johnson et al., 2019). Formally, we retrieve k most relevant document z[1,...,k] Z for context xcontext as: z[1,...,k] = {zi Z topk {BERT(xcontext) BERT(zi)}} (1) where BERT( ) means the vectors encoded by BERT. The number of retrieved documents k is a task-specific hyper-parameter we discuss its impact on performance in 4.3. 3.2 CONSTRUCTING EXTERNAL AND LOCAL KNOWLEDGE MEMORIES We convert multi-document input retrieved from previous step into compressed knowledge memories, in order to 1) allow relevant knowledge to be easily identified, 2) reduce the memory footprint of the knowledge, whose size grows linearly with the number of retrieved documents. Following design choice of previous successful methods (Bosselut et al., 2021; Huang et al., 2020; Fan et al., 2019a), we adopt co-reference resolution and open information extraction (Open IE) (Stanovsky et al., 2018) to convert plain text into triplet form3. For example, knowledge statement like Iceland is a Nordic island country in the North Atlantic Ocean and it is the most sparsely populated country in Europe. will be converted to subject-relation-object triplets such as subj:Iceland, rel:is, obj:Nordic island country , subj:Iceland, rel:is, obj:most sparsely populated country in Europe , etc. To account for overlapping elements from these triplets, we use a prefix tree (namely Knowledge Trie Gext) to store and organize the extracted triplets, by setting the subject in each triplet as the key in the Gext. Note that unlike common character-level dictionary tries (Chen et al., 2020; Germann et al., 2009), in Gext each triplet is stored in token unit as our goal is to efficiently query and traverse the knowledge triplets stored in it. A tree structure encoding knowledge is appealing to knowledge intensive NLG tasks, since 1) the non-cyclic structure helps reduce repetitions in generations, and 2) querying a prefix tree can be efficiently completed in constant time (O( xcontext )) which does not involve any costly traversal on the graph (Ji et al., 2020; Zhang et al., 2020a), regardless of the scale of grounding knowledge (normally xcontext k Wiki docs ). We also maintain a local memory Gloc (a first-in-first-out list) 2The faiss project can be found here: https://github.com/facebookresearch/faiss. 3As an empirical trick we also remove the stop words in the documents as they seldom carry knowledge. Published as a conference paper at ICLR 2022 Step 1. Knowledge Retrieval Step 2. Memory Construction Step 3. Interaction Guided Decoding Relevant Wiki Docs Multi-hop Imagination Knowledge Trie Local Memory RL Policy Gradient LM Retriever Figure 1: Overview of our KID decoding algorithm. For a given context xcontext, we first retrieve k most relevant Wikipedia documents z[1,...,k] with a knowledge retriever (Step 1), and then convert them into compressed knowledge trie Gext (Step 2). Meanwhile, the local memory Gloc which is a first-in-first-out list will keep track of entities in current context, and in the final step (Step 3), it will continuously query the knowledge trie with max number of hops hmax. Current step LM decoding will be guided by the query results with policy gradient, to generate new token yi. that records all the mentioned entities in current context, to focus on concept-centric knowledge (Zhou et al., 2021; Lin et al., 2020b). More details about how we construct and query these knowledge memories can be found in A.2 of Appendix. 3.3 KNOWLEDGE TRIE CONSTRAINED DECODING VIA POLICY GRADIENT Background. Current generative LMs are trained to maximize the probability of generating groundtruth tokens at each decoding step. Assuming y 1 T = {y 1, y 2, ..., y T } is the ground-truth output sequence for a given context xcontext, the MLE objective minimizes the following loss: i=1 log p(y t y 1, ..., y t 1, xcontext) . (2) In knowledge-intensive NLG tasks, however, it is reported that the MLE training goal does not explicitly model knowledge, and thus the LM often produces counterfactual generation by surfacelevel misguidance (Zhou et al., 2021; Petroni et al., 2019). Furthermore, the teacher-forcing algorithm used by MLE training leads to the exposure bias problem (Ranzato et al., 2016), as the LM has access to ground truth sequence up to the next token during training, but does not have such signal during testing, which causes accumulated errors when generating longer sequence (Paulus et al., 2018). Both problems heavily limit the performance of popular LMs on diverse knowledge-intensive NLG tasks. One remedy is to learn a generation policy that not only maximizes knowledge correctness but also alleviates exposure bias in longer generation, which can be made possible with policy gradient in reinforcement learning (RL). Knowledge Trie Constrained Decoding. To formulate NLG as a RL problem, we define the state as the generated tokens before t (i.e., st = y