# integrative_decoding_improving_factuality_via_implicit_selfconsistency__5fd1f45c.pdf Published as a conference paper at ICLR 2025. INTEGRATIVE DECODING: IMPROVE FACTUALITY VIA IMPLICIT SELF-CONSISTENCY Yi Cheng1 , Xiao Liang2, Yeyun Gong3, Wen Xiao4, Song Wang4, Yuji Zhang5, Wenjun Hou1, Kaishuai Xu1, Wenge Liu1, Wenjie Li1, Jian Jiao3, Qi Chen3, Peng Cheng3, Wayne Xiong3 1The Hong Kong Polytechnic University 2Tsinghua University 3Microsoft Research 4Microsoft Azure AI 5University of Illinois at Urbana-Champaign alyssa.cheng@connect.polyu.hk Self-consistency-based approaches, which involve repeatedly sampling multiple outputs and selecting the most consistent one as the final response, prove to be remarkably effective in improving the factual accuracy of large language models. Nonetheless, existing methods usually have strict constraints on the task format, largely limiting their applicability. In this paper, we present Integrative Decoding (ID), to unlock the potential of self-consistency in open-ended generation tasks. ID operates by constructing a set of inputs, each prepended with a previously sampled response, and then processes them concurrently, with the next token being selected by aggregating of all their corresponding predictions at each decoding step. In essence, this simple approach implicitly incorporates self-consistency in the decoding objective. Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the Truthful QA (+11.2%), Biographies (+15.4%) and Long Fact (+8.5%) benchmarks. The performance gains amplify progressively as the number of sampled responses increases, indicating the potential of ID to scale up with repeated sampling.1 1 INTRODUCTION Despite notable advancements across various domains, Large Language Models (LLMs) remain notorious for their tendency to produce non-factual and erroneous content, a phenomenon commonly known as hallucinations (Lewis et al., 2020; Ji et al., 2023). Prior research has shown that repeated sampling is a very effective methodology for enhancing factual accuracy (Wang et al., 2023; Shi et al., 2022; Chen et al., 2023). It involves sampling multiple responses to the same prompt, followed by a careful selection of the most accurate one or the synthesis of a refined output from the sampled responses. Notably, as the number of sampled responses increases, its performance gains often continue to rise in an almost log-linear manner, as recently highlighted by Brown et al. (2024). This suggests the existence of inference-time scaling laws, implying the potential of repeated sampling to progressively push the model closer to its theoretical performance ceilings. Despite this immense promise, a central challenge in this methodology remains: how to effectively identify the non-factual content within the sample collection and thereby produce a final, accurate output. The degree of self-consistency (SC), which measures the consistency level among LLMs different outputs, has proven to be a useful indicator to address this issue (Wang et al., 2023; Shi et al., 2022; Chen et al., 2023; Thirukovalluru et al., 2024; Malon & Zhu, 2024; M undler et al., 2024; Manakul et al., 2023). It has been observed that statements consistently present across a range of sampled responses are more likely to be truthful, as opposed to those appearing sporadically or inconsistently across outputs. However, most SC-based methods for improving factuality impose strict constraints on the format of task output, largely limiting their applicability. Due to the difficulty in measuring consistency across responses, previous studies usually only consider tasks where they can easily define consistency as the exact matches between the answers parsed from the responses (Wang et al., 2023; Huang et al., 2023a; Shi et al., 2022; Li et al., 2022), such as arithmetic problems and This work was conducted during Yi Cheng s internship at Microsoft Research. 1All codes and data are available at https://github.com/Yi Cheng98/Integrative Decoding. Published as a conference paper at ICLR 2025. Table 1: Comparisons between ID and previous approaches that utilize self-consistency to improving factuality on open-ended-generation tasks. Input length indicates the length relative to that of one sampled response from standard prompting (with k representing the number of sampled responses). Method How to Check Self-consistency Input Length Inference Latency Balance Informativeness Factuality Improvement USC (Chen et al., 2023) Prompting k Medium Medium SR (Madaan et al., 2024) Co T Reasoning k Medium Medium FSC (Wang et al., 2024a) Co T Reasoning k Medium High SE-SL (Wang et al., 2024b) Numerous Prompting 1 High High SE-RG (Wang et al., 2024b) Prompting & Clustering 1 High High Integrative Decoding ICL & Decoding-time Implicit Integration 1 Medium Higher multiple choice question. This naturally leads us to ask: how can we further unlock the potential of self-consistency and repeated sampling in open-ended generation tasks? One straightforward way is to concatenate all sampled responses in a prompt and directly instruct the LLM to select the most self-consistent one from them, as done in Chen et al. (2023). Nonetheless, such practice substantially increases the input length, posing excessive demands on the model s long-text processing capability. Another line of research treats each response as a collection of statements and then assess the consistency level between each pair of statements through clustering (Thirukovalluru et al., 2024) or iterative LLM prompting (M undler et al., 2024; Wang et al., 2024a;b). This requires numerous iterations of inference, particularly for longer outputs, leading to inefficiencies. Due to these issues, prior attempts to apply SC in open-ended tasks cannot generalize effectively to long-form generations and they struggle to scale up with an increasing number of sampled responses. In this paper, we present Integrative Decoding (ID), a novel decoding strategy designed to improve factuality by implicitly incorporating self-consistency within its decoding objective. ID begins by repeated sampling. For each sampled response in the collection, ID constructs a new input by concatenating the response with the original prompt. Essentially, this input instructs the model to respond to the instruction again with reference to a previously sampled response. Then, ID processes these inputs concurrently for decoding, with the next token being selected by integrating all their predictions at each inference step. During this process, each input acts like a representative for the sampled response within it, voting for the tokens that are semantically consistent with the response it represents. ID effectively aggregates their votes and thereby achieves the optimal overall consistency across all sampled responses. Compared with existing approaches that utilize self-consistency to improve factuality on open-ended generation tasks, ID does not rely on additional prompting or chain-of-thought reasoning to explicitly verify consistency; moreover, it can achieve substantial improvement in factuality with relatively low inference latency and a slight burden on the model s long-text processing capabilities (see Table 1 for detailed comparisons). We evaluate ID over six series of LLMs with varying scales. ID consistently enhances the factuality over all these LLMs by a large margin on the Truthful QA (+11.2%), Biographies (+15.4%) and Long Fact (+8.5%) datasets, demonstrating robustness from sentenceto document-level generations. Moreover, the performance gains of ID progressively amplify as the number of sampled responses increases, indicating its potential to scale up with repeated sampling. Preliminaries: Self-consistency as an Indicator for Factuality Previous studies found that the degree of self-consistency between LLM s different sampled responses can serve as a useful indicator for hallucination detection (Manakul et al., 2023; Farquhar et al., 2024). The facts that are consistently supported by LLMs different sampled responses are more likely to be factual, compared to those that only appear sporadically or inconsistently across multiple outputs. Formally, given a prompt x and its response ˆy that consists of a series of statements S = {s1, s2, .., sn}, the factuality score of si can be estimated by measuring its consistency with other sampled responses Published as a conference paper at ICLR 2025. R = {r1, r2, .., rk} in response to the same prompt x as: f(si) = 1 |R| rj R P(consistent|si, rj), (1) where f(si) refers to the estimated factuality score of the statement si and P(consistent|si, rj) is the probability that si is supported by the response rj. These responses can be obtained through sampling algorithms, such as temperature sampling (Ficler & Goldberg, 2017) or nucleus sampling Holtzman et al. (2020). The overall factuality score of the response ˆy can thereby be estimated as: F(ˆy) = 1 |S| |R| rj R P(consistent|si, rj) = 1 |R| rj R f(ˆy, rj), (2) where f(ˆy, rj) = 1 |S| P si S P(consistent|si, rj), representing the overall degree of ˆy being supported by the response rj. Formalization of Decoding Objective The established insights about the role of self-consistency in hallucination detection indicate that the response most consistent with the others tends to be the most factual one. This motivates us to develop a decoding method that, given several sampled responses, can generate a new output, maintaining strong overall consistency with all of them while maintaining its own coherence. Formally, given an input prompt x, a decoding method searches for an output ˆy by solving: ˆy = arg max y Y H(x, y), (3) where Y refers to the set of all possible token sequences and H(x, y) is the objective function. Common decoding algorithms, such as beam search, consider the decoding objective H(x, y) as log pθ(y|x) = P|y| t=1 log pθ(yt|y