# variational_opendomain_question_answering__bdba8f19.pdf Variational Open-Domain Question Answering Valentin Liévin 1 2 Andreas Geert Motzfeldt 1 Ida Riis Jensen 1 Ole Winther 1 2 3 4 Retrieval-augmented models have proven to be effective in natural language processing tasks, yet there remains a lack of research on their optimization using variational inference. We introduce the Variational Open-Domain (VOD) framework for end-to-end training and evaluation of retrieval-augmented models, focusing on opendomain question answering and language modelling. The VOD objective, a self-normalized estimate of the Rényi variational bound, approximates the task marginal likelihood and is evaluated under samples drawn from an auxiliary sampling distribution (cached retriever and/or approximate posterior). It remains tractable, even for retriever distributions defined on large corpora. We demonstrate VOD s versatility by training readerretriever BERT-sized models on multiple-choice medical exam questions. On the Med MCQA dataset, we outperform the domain-tuned Med Pa LM by +5.3% despite using 2.500 fewer parameters. Our retrieval-augmented Bio Link BERT model scored 62.9% on the Med MCQA and 55.0% on the Med QA-USMLE. Last, we show the effectiveness of our learned retriever component in the context of medical semantic search. 1. Introduction Scaling Transformer-based (Vaswani et al., 2017) language models (LMs) with larger datasets and more parameters (Radford et al., 2018; Kaplan et al., 2020; Hoffmann et al., 2022) led to sustained improvements in various downstream *Equal contribution 1Section for Cognitive Systems, Technical University of Denmark, Denmark 2Find Zebra, Denmark 3Center for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Denmark 4Bioinformatics Centre, Department of Biology, University of Copenhagen, Denmark. Correspondence to: Valentin Liévin . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Figure 1. Parameter efficiency. Answering accuracy of baseline methods and of VOD (Bio Link BERT backbone) on Med MCQA. tasks.1 However, large language models (LLMs) may reach a plateau in their performance due to the limitations of the implicit knowledge they possess, being incomplete, flawed or out-of-date. Open-domain question answering (ODQA) consists of augmenting LMs with external knowledge bases indexed with a retrieval mechanism. This approach was popularized in the question-answering setting by Chen et al. (2017) and was later applied to the task of language modelling itself (Guu et al., 2020; Lewis et al., 2020; Borgeaud et al., 2021; Izacard et al., 2022). However, optimizing deep retrievers is challenging, unless there is a set of annotated evidence documents that are sufficiently aligned with the target task, as explored in Karpukhin et al. (2020); Qu et al. (2021); Khattab & Zaharia (2020). An alternative approach is to model the whole collection of documents as a latent variable (Lee et al., 2019), but this still poses challenges for optimization, especially considering that documents are discrete quantities.2 This research fills a gap in the literature by exploring the optimization of retrieval-augmented models using variational inference. We introduce a probabilistic framework that extends Rényi divergence variational inference (Li & Turner, 1Find a benchmark of LLMs in Srivastava et al. (2022), read about LLMs in Brown et al. (2020); Rae et al. (2021); Chowdhery et al. (2022); Thoppilan et al. (2022); Hoffmann et al. (2022); Smith et al. (2022); Zhang et al. (2022); Lieber et al. (2021); Fedus et al. (2021); Laurençon et al. (2022). 2Learn more about discrete latent variable optimization in Hinton et al. (1995); Le et al. (2018); Mnih & Gregor (2014); Mnih & Rezende (2016); van den Oord et al. (2017); Tucker et al. (2017); Grathwohl et al. (2017); Masrani et al. (2019); Liévin et al. (2020). Variational Open-Domain Question Answering 2016), allowing us to estimate the marginal task likelihood and its gradient by sampling from an approximate posterior. The proposed framework is versatile and applies to various settings, including extractive, generative, and multiplechoice models for open-domain question answering, as well as the training of retrieval-enhanced language models. To demonstrate the effectiveness of the framework, we train reader-retriever Bio Link BERT models end-to-end on multiple-choice medical QA tasks and achieve a new stateof-the-art on the Med MCQA of 62.9%, outperforming the current 540B parameter domain-tuned Med-Pa LM by +5.3% (Singhal et al., 2022) using 2.500 fewer parameters (Figure 1). On the challenging Med QA-USMLE, we score 55.0%: a new state-of-the-art in the open-domain setting. We highlight the main contributions of this paper as follows: 1. The VOD framework: tractable, consistent, end-to-end training of retrieval-augmented models. 2. Popularizing Rényi divergence variational inference for natural language tasks. 3. Truncated retriever parameterization: relaxing the top K retriever approximation to using top P K. In addition to our theoretical contributions, we release Med Wiki: a subset of Wikipedia tailored to the Med MCQA and USMLE dataset for low-resource research. 2. VOD: a Probabilistic Framework for Retrieval-augmented Tasks Let a question q be defined in a space Ω(e.g., the space of sequences of tokens) and the set of possible answers be A Ωwith a correct answer denoted a A. We introduce a corpus of N documents D := {d1, . . . , d N} ΩN. In open-domain tasks, we are interested in modelling the marginal task likelihood with a reader-retriever model pθ(a, d|q) := pθ(a|d, q)pθ(d|q) parameterized by θ: pθ(a|q) := X d D pθ(a|d, q) | {z } reader pθ(d|q) | {z } retriever Variational inference (Jordan et al., 1999; Kingma & Welling, 2013; Burda et al., 2015) allows estimating the marginal task likelihood eq. (1) using samples drawn from an approximate posterior rϕ(d|a, q). This consists of evaluating the evidence lower bound (ELBO), a log-likelihood lower bound.3 In open-domain applications, the approximate posterior, with parameter ϕ, can be defined using either a keyword-search engine (BM25; Robertson & Zaragoza (2009)), a checkpoint of pθ(d|q), or a model learned jointly. We introduce the VOD framework in four acts: i) Why Rényi divergence variational inference can aid likelihood- 3LELBO(a, q) := log pθ(a, q) DKL (rϕ(d|a, q) pθ(d|a, q)) based learning, ii) The VOD objective: a tractable selfnormalized importance sampling estimate of the Rényi bound, iii) A truncated retriever parameterization that generalizes existing approaches and iv) A discussion on the application of the VOD framework. 2.1. Rényi Divergence Variational Inference Figure 2. Depicts the core component of the VOD framework: the importance-weighted Rényi Variational Bound (IW-RVB) as a function of the parameter α [0, 1] and the number of samples K 1. As the value of α and K increase, the IW-RVB becomes a more accurate estimate of the likelihood of a given task, demonstrating how we use VOD to optimize retrieval-augmented models through the manipulation of α and K. See how the parameter α affects the training dynamics in Figure 7, Appendix G. Rényi divergence variational inference (Li & Turner, 2016) extends traditional variational inference (Jordan et al., 1999; Kingma & Welling, 2013). Given a parameter α < 1 and the importance weight w1 α θ,ϕ (a, q, d) := pθ(a, d|q)r 1 ϕ (d|a, q) the variational Rényi bound (RVB) defined as Lα(a, q) := 1 1 α log Erϕ(d|a,q) h w1 α θ,ϕ (a, q, d) i (2) RVB is a lower bound of the marginal log-likelihood for α 0 and is extended by continuity in α = 1 as Lα=1(a, q) := limα 1 Lα(a, q) where it equals the ELBO. In practice, the RVB and its gradients can be estimated using K documents sampled from rϕ(d|a, q). The resulting importance sampling estimate yields another bound: the Importance Weighted RVB (IW-RVB; Li & Turner (2016)): ˆLK α (a, q) := 1 1 α log 1 i=1 w1 α θ,ϕ (a, q, di) (3) d1, . . . , d K iid rϕ(d|a, q) which aligns with the importance-weighted bound (IWB; Burda et al. (2015)) in α = 0. To sum up, the main proper- Variational Open-Domain Question Answering ties of the RVB and the IW-RVB are (α 0): Lα=0(a, q) = log pθ(a|q) Lα 1(a, q) =LELBO(a, q) Lα 0(a, q) log pθ(a|q) LK α (a, q) Lα(a, q) . RVB gradient The gradient of the RVB w.r.t. θ is: θLα(a, q) = Erϕ w1 α θ,ϕ (a, q, d) θ log pθ(a, d|q) where the normalized importance weight is defined as w1 α θ,ϕ (a, d) := w1 α θ,ϕ (a, q, d) Erϕ(d |a,q) h w1 α θ,ϕ (a, d , q) i . (5) In this paper, we consider the sampling distribution rϕ to be static and therefore do not estimate the gradient w.r.t. the approximate posterior. Optimizing the parameter ϕ jointly with θ can be done by application of importance sampling coupled with variance reduction techniques (Burda et al., 2015; Mnih & Rezende, 2016; Le et al., 2018; Masrani et al., 2019; Kool et al., 2019b; Liévin et al., 2020). Stabilizing training using the RVB Considering the optimization of the parameter ϕ, a looser bound (e.g., the ELBO) might be preferred to a tighter one (e.g., the IWB).4 In this paper, we explore interpolating between variational bounds using the parameter α of the RVB. We argue that, even for a non-trainable parameter ϕ, optimizing for a looser bound can overcome early optimization challenges. For α = 0, the RVB aligns with the marginal log-likelihood independently of the choice of the approximate posterior. However, when the importance weight wθ,ϕ(q, a, d) suffers from high variance, so does the Monte Carlo estimate of the marginal likelihood and its gradient.5 For α = 1, the RVB matches the ELBO and the gradients restricted to the reader and retriever decomposes as: θREAD.Lα=1(a, q) = Erϕ(d|a,q) [ θ log pθ(a|d, q)] θRETR.Lα=1(a, q) = θDKL (rϕ(d|a, q) pθ(d|q)) . Maximizing the ELBO corresponds to optimizing the reader and the retriever disjointly. On the reader side, this equals maximizing the answer likelihood pθ(a|d, q) in expectation over rϕ(d|a, q) independently of the value of pθ(d|q). On the retriever side, this corresponds to matching the approximate posterior with the learned retriever pθ(d|q). This 4Exploring using hybrid ELBO/IWB objectives has been explored in Rainforth et al. (2018), interpolating the RVB has been explored in Liévin et al. (2020). 5See Kong (1992); Owen (2013); Nowozin (2015) for an introduction and discussion about variance and importance sampling. can be seen as an instance of knowledge distillation of the posterior into the retriever. After an initial learning phase, the RVB can be smoothly interpolated from the ELBO to the marginal task likelihood by controlling the parameter α. 2.2. VOD objective In ODQA applications, the IW-RVB eq. (3) is generally intractable due to the normalization constant in eq. (8a) which requires evaluating all documents. The VOD objective is an approximation of the IW-RVB which can be evaluated using K documents sampled without replacement from rϕ(d|a, q). It is defined as: ˆLK α (a, q) := 1 1 α log i=1 si ˆv1 α θ,ϕ (a, q, di) (6) (d1, si), . . . , (d K, s K) priority rϕ(d|a, q) . where the self-normalized importance weight ˆvθ,ϕ is defined using the un-normalized retrieval density ratio ζ(d) pθ(d|q)r 1 ϕ (d|a, q) as: ˆvθ,ϕ := pθ(a|q, di)ζ(di) j=1 sjζ(dj) The set of documents d1, . . . , d K are sampled without replacement from rϕ(d|a, q) using priority sampling (Duffield et al., 2007). The sampling procedure comes with importance weights s1, . . . , sk defined such that for a function h(d), PK i=1 si h(di) Erϕ(d|a,q) [h(d)]. We present priority sampling in greater length in Appendix A. The VOD objective and its gradient are consistent (i.e., converge to the RVB in the limit K N with probability one) and can be evaluated with complexity O(K), whereas the IW-RVB is of complexity O(N). Furthermore, the VOD objective approximates the IW-RVB, which itself is guaranteed to approximate the marginal task log-likelihood more tightly as K N (Burda et al., 2015). The VOD objective is derived in Appendix B, the VOD gradient is defined in Appendix C. Our implementation of the sampling methods and the VOD objective is available at http://github.com/Vod LM/vod. 2.3. Truncated retriever parameterization The VOD framework is compatible with retrievers defined on the whole corpus (N documents). However, in our approach, we truncate the retriever to consider only the top Variational Open-Domain Question Answering P documents, where K < P N. K refers to the number of sampled documents, while P represents the pool of documents from which the top K documents are selected. This truncation provides two key advantages: i) it enables efficient caching or retention of document scores, as only P documents need to be stored in memory, and ii) the value P serves as an exploration-exploitation threshold: a higher value of P yield greater diversity in document sampling, promoting exploration. While, a smaller value of P ensures that during training, all documents in the set Tϕ are more likely visited, facilitating exploitation of the available information. Assuming the retrieval distributions to be described by score functions fθ : Ω2 R and fϕ : Ω3 R. We define the truncated retrievers as:6 pθ(d|q) :=1[d Tϕ] exp fθ(d, q) P d Tϕ exp fθ(d , q) (8a) rϕ(d|a, q) :=1[d Tϕ] exp fϕ(a, q, d) P d Tϕ exp fϕ(a, q, d ) (8b) where Tϕ is the set of the top P N documents ranked by the score fϕ(a, q, d). The score function fθ and fϕ can be implemented using BM25 and/or contextual vector representations extracted using pretrained language models such as DPR or Col BERT (Karpukhin et al., 2020; Khattab & Zaharia, 2020). For instance using a dualencoder model fθ(d, q) = BERTθ(d)T BERTθ(q) and fϕ(a, q, d) = BERTϕ([q; a])T BERTϕ(d) where BERT is the function that return the output of a BERT model at the CLS token and [ ; ] is the concatenation operator. Retrieving the top P documents is efficient when using elasticsearch7 and/or faiss (Johnson et al., 2021). 2.4. Applying VOD In this paper, we show how to apply the VOD framework to multiple-choice ODQA. Nevertheless, VOD is generalpurpose and designed for latent variable models defined on a discrete and finite space. In NLP, it applies to a wide range of settings such as generative, extractive, multiple-choice ODQA as well as retrieval-augmented language modelling. Find a non-exhaustive list of examples in Appendix E. 3. Related work VOD aids the development of retrieval-augmented models for language modeling (LM) tasks. In this section, we review previous work on retrieval for LM, and compare to VOD (summarized with references in Table 1). 6When P > K, evaluating the retriever density eq. (8a) is generally intractable due to the sum over P documents. 7http://www.elastic.co/ Table 1. Deep retrievers in literature, detailing if training was endto-end, variational, as well the size of support during training. Method Retriever training End-to-end learning Posterior Guided Retriever Support DPR1 Supervised Col BERT2 Supervised Contriever3 Self-supervised Fi D4 Frozen DPR dual-encoder RETRO5 Frozen BERT dual-encoder ORQA6 Self-supervised + MLL* ( ) top-K doc. RAG7 MLL* + frozen DPR doc. encoder ( ) top-K doc. REALM8 Self-supervised + MLL* top-K doc. EMDR-29 Self-supervised + Expect.-Max. top-K doc. Hindsight10 Col BERT init. + ELBO + MLL* top-K doc. VOD Rényi variational bound top-P doc. 1 Karpukhin et al. (2020), 2 Khattab et al. (2021), 3 Izacard et al. (2021), 4 Izacard & Grave (2020) 5 Borgeaud et al. (2021), 6 Lee et al. (2019), 7 Lewis et al. (2020), 8 Guu et al. (2020) 9 Sachan et al. (2021), 10 Paranjape et al. (2021), *MLL: marginal log-likelihood K P N ( K:# of documents in a batch, N: corpus size, P: chosen) Learning to search Retrieval-based training have gained much attention for improving pre-trained LMs. ORQA and Contriever proposed a self-supervised approach using contrastive learning to match a text passage with its context, and is widely adopted in pre-training to enable zero-shot retrieval (Inverse Cloze Task; Lee et al. (2019)). In contrast, DPR and Col BERT use supervised contrastive learning with questions paired to annotated documents. This method has sparked many retrieval-augmented attempts such as Fi D, RETRO, and RAG to enhance auto-regressive LMs conditioned on a frozen retriever. ORQA and REALM, later followed by RAG, EMDR, Hindsight, and VOD proposed optimizing both a retrieval component and a reader or language modelling component end-to-end, by maximizing the marginal log-likelihood (MLL). Posterior guided supervision Many efforts has been devoted to leveraging external knowledge with posterior guided supervision. EMDR learns a retriever end-toend with an Expectation-Maximization objective evaluated under the posterior distribution of pθ(d|a, q) pθ(d|q)pθ(a|d, q), while Hindsight optimizes the variational lower-bound (ELBO) evaluating under a target-aware approximate posterior rϕ(d|a, q). Among previous methods, Hindsight is most akin to VOD as both methods rely on maximizing a variational bound. Nonetheless, VOD introduces the more general Rényi variational bound, which offers to model the sampling distribution explicitly. Ultimately, a more principled approach makes VOD more versatile and capable of handling a wider range of problems. Navigating large knowledge bases The large size of knowledge bases such as Wikipedia makes it computationally intractable to consider all N documents when computing MLL. To address this, all related methods rely on a strict truncation of the retriever to the top-K cached documents. In contrast to these aforementioned approaches, which limits to a fixed set of K documents, we propose a truncated Variational Open-Domain Question Answering Table 2. Summarizes the medical QA datasets and corpora used in our study, including the Med MCQA, USMLE, and Find Zebra (FZ) corpus, with the Med Wiki as the knowledge base for all QA tasks. The questions are numbered for the train/validation/test splits. DATASETS MEDMCQA USMLE FZ QUERIES QUESTIONS 182.8K/4.2K/6.1K 10.2K/1.3K/1.3K 248 CORPORA WIKPEDIA MEDWIKI FZ CORPUS ARTICLES 6.6M 293.6K 30.7K PASSAGES 7.8M 711.9K retriever parameterization that works hand-in-hand with our principled objective to handle over top P > K documents. Ultimately, this allows for more diverse document sampling during training and allows reducing the bias induced by truncating the retriever distribution. In Appendix D, we show that the top-K MLL is a special case of VOD for K = P and α = 0. 4. Experiments In this section, we present the medical domain tasks and datasets, results on end-to-end multiple-choice ODQA and its application to information retrieval. The code and datasets are available on Git Hub.8 4.1. Datasets The datasets utilized for the medical domain are summarized in Table 2. We introduce the Med Wiki, a subset of Wikipedia targeted to medical QA tasks. Med MCQA Pal et al. (2022) is a large-scale multiplechoice question answering dataset collected from Indian medical school entrance exams (AIIMS and NEET-PG). It covers several medical topics (dentistry, pathology, surgery, preventive medicine, etc.) and question types (diagnosis, recalling expert factual knowledge, mathematical problems, etc.) Med QA-USMLE Jin et al. (2021)) is a collection of medical questions from the US medical board exam. The questions aim to assess human doctors medical knowledge and decision-making. Each question includes a medical history, vital signs (e.g., blood pressure, temperature), and possibly a specific analysis (e.g., CT scan). MMLU Hendrycks et al. (2021) is a dataset for assessing the knowledge acquired during pre-training by evaluating models in a zero-shot setting. The test set comprises 57 tasks spanning different domains. We limit our analysis to the subcategories psychology, biology, and health.9 8https://github.com/findzebra/fz-openqa 9The subcategory professional_medicine corresponds to the Med QA-USMLE questions. Med Wiki We release the Med Wiki corpus (under MIT license): a collection of 4.5% of articles taken from the English Wikipedia and targeted to the Med MCQA and USMLE datasets. The Med Wiki corpus was built by querying each answer option from the Med MCQA and USMLE datasets against the Wikipedia API. Read more in Appendix H. Find Zebra corpus & queries Find Zebra is a search tool for assisting in the diagnosis of rare diseases that is built on open-source information retrieval software (BM25) tailored to this problem (Dragusin et al., 2013). The Find Zebra corpus indexes a collection of curated articles from various reputable databases: GARD, Gene Reviews, Genetics Home Reference, OMIM, Orphanet, and Wikipedia. Each article is referenced with a Concept Unique Identifier (CUI) from the Unified Medical Language System (UMLS; Bodenreider (2004)). We use a collection of 248 publicly available search queries (FZ queries). Each query is labelled with a reference diagnostic, allowing to benchmark medical search engines.10 4.2. VOD for multiple-choice QA In the multiple-choice question answering (MCQA) setting, we consider a vector of M answer options A = [a1, . . . , a M], where represents the index of the correct option. Similarly, we define a vector of M queries as Q = [q1, . . . , q M], where qj = [q; aj] represents the concatenation of the question with the answer option of index j. Additionally, we denote a vector of M documents D = [d1, . . . , d M] DM, and the set of M combinations of documents as D(M), which contains N M document vectors. The marginal likelihood is defined as follows: pθ(a |Q) := X D D(M) pθ(D|Q) pθ(a |D, Q) . (9) To model this problem, we introduce i) a reader model gθ : Ω2 R, which evaluates the likelihood of answer option j [1, ..., M] given the query and a tuple of K documents d1, . . . , d K, and ii) we define a truncated retriever model pθ(d|qj) and rϕ(d|qj), which retrieves K document specific to each answer option. As described in eq. (8a), these models are parameterized by scores fθ(d, qj) and fϕ(d, qj) respectively. The reader and retriever models are defined as: pθ(a |D, Q) := exp gθ(d , q ) PM j=1 exp gθ(dj, qj) (10) j=1 pθ(dj|qj), rϕ(D|Q) = j=1 rϕ(dj|qj) . 10https://huggingface.co/datasets/findzebra Variational Open-Domain Question Answering The VOD objective can be applied to approximate the marginal likelihood pθ(a |Q) defined in eq. (9). In practice, the VOD objective in a multiple-choice setting implies the retrieval of KM documents per query, resulting in a conditional answering likelihood that encompasses KM unique combinations. For further details, refer to Appendix E.4. 4.3. Experimental Setup We implement a DPR-like dual-encoder architecture for the retriever with a shared backbone and implement the multiple-choice reader following Devlin et al. (2018). We use the domain-specific Bio Link BERT (Yasunaga et al., 2022) as the backbone for both models and use the Med Wiki corpus for all QA experiments. This results in a total of 2 110M=220M parameters; a small retrieval-augmented language model. All experiments were conducted on a single node of 8 RTX 5000 GPUs using half-precision. Further details can be found in Appendix F. Hybrid approximate posterior We parameterize the score fϕ of the sampling distribution using a composite BM25 score combined to a checkpoint of the retriever score fθ denoted f ckpt ϕ . Specifically, we sample documents using: fϕ(a, q, d) :=f ckpt ϕ (d, [q; a]) (11) +τ 1 (BM25(q, d) + β BM25(a, d)) . where τ = 5 and β is a parameter scaled proportionally to the ratio of question and answer lengths Lq/La to ensure that the BM25 score of the question does not outweigh the answer score. We use β = 1 + 0.5 max {0, log (Lq/La)}.11 At initialization fθ is uninformative, we thus set f ckpt ϕ = 0. The combination of the two scores may provide a more robust sampling distribution by utilizing both the previously learned information and secondly the BM25 relevance of the query to the document. Training, periodic re-indexing and annealing We organize the training into rounds of T steps similarly to Khattab et al. (2021). As the model is exposed to a progressively larger portion of the dataset over multiple rounds, we expect optimization will result in improved generalization capabilities. At the beginning of each round, for each questionanswer pair qj, we retrieve the set of top-P documents Tϕ and cache the set of values {fϕ(aj, q, d) | d Tϕ}, except for the first period where f ckpt ϕ is set to zero. During the first round, we anneal the RVB parameter α from 1 to 0 to stabilize early training by distilling the BM25 cached score fϕ(a, d, q) = 0 + τ 1 (BM25(q, d) + β BM25(a, d)) into the trainable retriever score fθ(d, q), as shown in Figure 3. At each training iteration, we sample a set of K = 8 11We picked the parameters to target a relatively high sampling entropy, no extensive hyperparameter search was performed. Figure 3. During training, VOD incorporates periodic updates of the cached models. In the initial period, the sampling distribution rϕ(d|a, q) can be chosen as a domain-specific baseline (BM25). Additionally, a parameter α > 0 can be utilized to guide the optimization of θ. Note that the approximations ˆLK α=1 LELBO and ˆLK α=0 log pθ can be observed, demonstrated in the experimental curves in Appendix G. document Tϕ for each of the M = 4 question-answer pairs and evaluated the VOD objective and its gradient using the cached values of fϕ(aj, q, d). Evaluation At evaluation time, we estimate the likelihood for each answer option using C = 10 Monte-Carlo samples, each containing MK = 4 8 = 32 documents using the estimates defined in eq. (6) (see Appendix E.4). Leveraging more samples at inference time allows for approximating the answer likelihood more robustly, as it allows for testing a greater number of combinations of documents. 4.4. QA Benchmark Med MCQA We report the validation and test accuracy of the VOD framework applied to Bio Link BERT (base) and the baselines in Table 3. VOD outperforms both the disjoint BERT-based methods and the recent Med-Pa LM (540B parameters) with a new state-of-the-art test accuracy of 62.9%, +0.2% over Codex 5-shot Co T. This is an improvement of +5.3% over Med Pa LM despite using 2.500 fewer parameters. VOD scored +7.6% improvement over the Bio Link BERT reader with static BM25 retriever, and +15.9% over the Pub Med BERT reader coupled with a DPR retriever. Med QA-USMLE The validation and test accuracy are shown in Table 3. We found that using VOD with a Bi- Variational Open-Domain Question Answering Table 3. Open-domain question answering accuracy. Med MCQA USMLE Method Params. Finetuning Valid. Test Valid. Test VOD Bio Link BERT+BM25 110M Med MCQA 51.6 55.3 VOD Bio Link BERT+BM25 110M USMLE 41.0 40.4 VOD 2 Bio Link BERT 220M Med MCQA 58.3 62.9 47.2 46.8 VOD 2 Bio Link BERT 220M USMLE 45.8 44.7 VOD 2 Bio Link BERT 220M Med MCQA USMLE 53.6 55.0 Disjoint Pub Med BERT+DPR1 220M Med MCQA 43.0 47.0 Disjoint Pub Med BERT+BM252 110M USMLE 38.1 Disjoint Bio Link BERT+BM253 110M USMLE 40.0 Disjoint Bio Link BERT-L+BM253 340M USMLE 44.6 Reader only Pub Med GPT4 2.7B Med MCQA+USMLE 50.3 Reader only Galactica5 120B Med MCQA 52.9 44.4 Reader only Codex 5-shot Co T6 175B 59.7 62.7 60.2 Reader only FLAN-Pa LM7 540B 56.5 60.3 Reader only Med-Pa LM7 540B Med MCQA+USMLE 57.6 67.6 Random Uniform 25.0 25.0 25.0 25.0 Human Passing score6 50.0 50.0 60.0 60.0 Human Merit candidate6 90.0 90.0 87.0 87.0 1results from Pal et al. (2022), model from Gu et al. (2021), 2Gu et al. (2021) 3Yasunaga et al. (2022), 4Venigalla et al. (2022), 5Taylor et al. (2022), 6Liévin et al. (2022) 7Singhal et al. (2022), First pretrained on Med MCQA then finetuned on the USMLE o Link BERT backbone outperforms a Bio Link BERT reader coupled with a BM25 retriever, even when using the larger version of Bio Link BERT (44.7% for VOD, 40.0% for disjoint Bio Link BERT, 44.6% for the disjoint large Bio Link BERT). Due to the small size of Med QA-USMLE, pretraining on the Med MCQA proved beneficial. Med MCQA pretraining with USMLE fine-tuning resulted in VOD achieving a 55.0% test accuracy, +10.4% improvement over a large Bio Link BERT model with a BM25 retriever. However, Med-Pa LM scores +12.6% higher accuracy over the best VOD model. MMLU Table 4 compares the zero-shot performance of VOD, GPT-3, and Unified QA in the subcategories of psychology, biology, and health. We reused the Bio Link BERT VOD model trained on Med MCQA only. VOD achieved an average accuracy of 54.8% across all 12 tasks, surpassing both GPT-3 (47.0%) and Unified QA (48.7%). Particularly, VOD excelled in medical_genetics (+36.0%), professional_medicine (+14.4%), and anatomy (+12.5%). Although GPT-3 and Unified QA showed competitive results in certain areas, VOD s higher accuracy highlights its robustness to a wider set of medical tasks. 4.5. Ablation Study In Figure 4, we report the performances of a VOD model for multiple variational bounds and diverse truncated retriever support sizes (the number of cached top-P documents). 12 Variational bounds We tested multiple variational bounds: the ELBO, the importance-weighted bound (IWB) and the RVB as possible methods to optimize the model. 12To reduce overall running costs, we used a dual-encoder reader with score function gθ(d, q) = BERT(q)T BERT(d). Table 4. Zero-shot accuracy on MMLU (%). Task Subcategory Unified QA GPT-3 VOD medical_genetics health 40.0 40.0 76.0 high_school_psychology psychology 70.0 61.0 60.6 college_biology biology 40.0 45.0 59.7 anatomy health 43.0 46.0 58.5 clinical_knowledge health 57.0 50.0 58.5 professional_medicine health 43.0 38.0 57.4 nutrition health 48.0 50.0 56.5 high_school_biology biology 53.0 48.0 55.2 college_medicine health 43.0 47.0 46.8 human_aging health 55.0 50.0 44.4 virology health 43.0 44.0 42.2 professional_psychology psychology 49.0 45.0 42.2 Average - 48.7 47.0 54.8 The ELBO and IWB are special cases of the RVB. For the RVB, we anneal the parameter α, as in the main experiments, and found that this method resulted in the highest answering accuracy while also resulting in low retriever entropy. This suggests that the retriever was also optimized at a faster rate. Exploration vs. Exploitation We experimented with using values of P {8, 32, 100}. Using the highest value of P = 100 resulted in a smaller effective sample size,13 slower learning but ultimately higher accuracy. 4.6. Information retrieval Despite good QA accuracy, the ability of VOD to yield a meaningful retriever component through the proposed reader-retriever end-to-end training remains, at this point of the paper, to be proven. Thus, we benchmarked a VOD retriever trained on Med MCQA against the Find Zebra API14, which connects to a specialized BM25 search engine targeted to medical professionals (Dragusin et al., 2013). The comparison was done using the set of Find Zebra queries and corpus, where searching documents using a BERT-based retriever translates into a nearest neighbour search problem in the embedding space, which we visualize in Appendix G. Re-purposing MCQA retrievers for semantic search The Bio Link BERT VOD model, trained on the Med MCQA dataset, has a retriever component that is trained to rank documents using question-answer pairs [q; a] as inputs (see eq. (10)). Thus, further task adaptation is required to rank documents solely based on queries, and without answer option (e.g., using a model pθ(d|q) instead of pθ(d|[q; aj])). To address this, we use the retriever to teach a query-only student model, which corresponds to knowledge distillation (Hinton et al., 2015). Given pairs of Med MCQA question 13The effective sample size is correlated with the inverse of the variance of wθ,ϕ(a, q, d), it is a popular diagnostic for importance sampling. See Kong (1992); Owen (2013); Nowozin (2015). 14https://www.findzebra.com/api/ Variational Open-Domain Question Answering Figure 4. Answering accuracy and retriever entropy. (a) Variational bounds: effect of the choice of parameter α (ELBO: α = 1, IWB: α = 0, RVB/VOD: interpolating α from 1 to 0), all using P = 100. (b) Exploration / exploitation: effect of the support size P of the truncated retrievers. We sampled MK = 4 8 documents per question, resulting in KM = 4.096 documents combinations (therefore the max. effective sample size is 4096). Higher P values leads to smaller effective sample sizes, slower learning but better end performances. and answers (q, a ), this translates into minimizing: LDISTILL. = DKL( rϕ(d | [q; a ]) | {z } MCQA Teacher (question+answer) pθ(d | q) | {z } Student (question only) Metrics In line with Dragusin et al. (2013), we evaluate retrieval by recording the first article that matches the reference CUI (disease concept) and report 100 the mean reciprocal rank (MRR) and the fraction of queries for which the correct article is returned in the top 20.15 Retrieval performances We evaluated the VOD retriever with and without distillation, a hybrid retriever combining the VOD and BM25 score (defined as f VOD+BM25 θ (d, q) := fθ(d, q) + τ 1 BM25(d, q) where τ = 5), and BM25 alone. We found that a VOD retriever trained on Med MCQA via distillation can be competitive with the Find Zebra 15We re-used two of the metrics introduced in the original study. We considered the MRR to be more adequate than NDCG because not all documents with a relevant CUI can be considered as a relevant match; only the highest-ranking one is essential. Table 5. Retrieval performances on the Find Zebra benchmark for a Bio Link BERT retriever trained using VOD on Med MCQA and one trained using task-specific distillation, with and without coupling with a BM25 score during evaluation. Method Distillation MRR Hit@20 VOD 27.8 56.9 VOD 31.7 58.1 VOD + BM25 38.9 64.1 BM25 26.4 48.4 FINDZEBRA API 30.1 59.3 API and achieves best performances when combined with a simple BM25 baseline, resulting in an MRR of 38.9. Retriever samples In Appendix G, Table 7, we present examples of a distilled VOD retriever s top-1 ranked passages, including two successes and two failures. The top-ranked documents were mostly relevant, but the retriever struggled with long keyword-based queries, as shown in row #4. This is likely due to the discrepancy of tasks between training on Med MCQA and evaluating on FZ queries. 5. Discussion Knowledge vs. Reasoning Tasks The VOD framework was evaluated using the Med MCQA and USMLE datasets only utilizing BERT-based models. The Med MCQA dataset is designed to evaluate the knowledge of entry-level medical students, whereas the USMLE dataset targets trained medical professionals, who are expected to possess not only a comprehensive understanding of medicine but also the ability to reason about complex medical problems. The results obtained demonstrate the effectiveness of the VOD framework in the specific tasks, however, we speculate that a BERT-sized model may not be sufficient for handling reasoning-intensive questions. As reported in previous studies, larger models like Pa LM and Codex, have shown exceptional performance in handling reasoning-heavy questions (Singhal et al., 2022; Liévin et al., 2022). Large-scale datasets The nature of the task is not the sole factor limiting the performance of VOD. We showed that an initial round of training on the larger Med MCQA dataset (182k samples) strongly benefit performances on the USMLE dataset (10k samples) .16 This suggests that VOD might benefit from larger-scale training, including other tasks such as retrieval-augmented language modelling. 16Pretraining on Med MCQA improved downstream USMLE accuracy by +10.3% when compared to training on USMLE only. Variational Open-Domain Question Answering Importance sampling In contrast to other methods, VOD requires defining the sampling distribution explicitly and thus makes the diagnosis of the suitability of the sampling distribution possible. As utilized in Figure 4, we suggest relying on the effective sample size diagnostic to measure the robustness of the likelihood estimates. A small effective sample size, with a value close to one, hints at a mismatch between the sampling distribution rϕ(d|a, q) and the posterior pθ(d|a, q). In that case, the sampling distribution should be adapted and/or optimized end-to-end with the model. Furthermore, the α parameter of the VOD objective can be increased towards one to target looser variational bounds, which often come with a better optimization profile (Rainforth et al., 2018). Approximating the IW-RVB The VOD objective serves as an approximate estimation of the IW-RVB, although its approximation error remains unaddressed. While the VOD objective is consistent w.r.t. the RVB (Appendix B), its reliance on the self-normalization introduces a deviation from the strict guarantee of being a lower bound for the marginal log-likelihood, which is provided by the IW-RVB. Nonetheless, the utilization of self-normalized importance sampling is generally preferred over un-normalized approaches due to its ability to reduce variance. To thoroughly understand the bias of the VOD objective and its gradient, additional theoretical analysis is required. Despite this, the VOD objective has demonstrated sufficient robustness in enabling end-to-end training of retrieval-augmented systems and efficiently bridging the performance gap that remained with larger, non-retrieval-augmented language models, as shown in Figure 1. 6. Conclusion In conclusion, this study has provided a comprehensive examination of methods for enhancing retrieval-augmented models through variational inference. The proposed probabilistic framework, VOD, is a promising solution for achieving tractable, consistent, and end-to-end training of retrievalaugmented models. Through a series of extensive experiments on multiple-choice medical exam questions, utilizing the Med MCQA and Med QA-USMLE datasets, the effectiveness of the proposed framework have been demonstrated. The findings indicate that leveraging the Rényi variational bound yields better end-to-end performances while also optimizing at a faster rate. Additionally, this study has introduced truncated retriever parameterization with variable support size P, which generalizes existing top-K parameterization and allows for likelihood-based optimization based on the full range of documents. Furthermore, the results have shown that VOD outperforms the state-of-theart Codex and domain-tuned Med-Pa LM on Med MCQA in terms of both accuracy and parameter efficiency. In the future, we plan to investigate various variations of VOD to enhance its versatility in modeling other datasets and tasks, as well as exploring the possibility of jointly learning the approximate posterior. Overall, this research provides a promising direction for designing and training likelihood-based models for retrieval-augmented tasks. We hope this research will help popularizing recent advances in variational inference and importance sampling, in the field of natural language processing and beyond. Acknowledgements VL s work was funded in part by Google Deep Mind through a Ph D grant. OW s work was funded in part by the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science (NNF20OC0062606). VL and OW acknowledge support from the Pioneer Centre for AI, DNRF grant number P1. Variational Open-Domain Question Answering Bodenreider, O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research, 32(Database issue):D267 70, January 2004. ISSN 0305-1048, 1362-4962. doi: 10.1093/nar/ gkh061. Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J.-B., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., Paganini, M., Irving, G., Vinyals, O., Osindero, S., Simonyan, K., Rae, J. W., Elsen, E., and Sifre, L. Improving language models by retrieving from trillions of tokens. December 2021. Brown, T., Mann, B., Ryder, N., and others. Language models are few-shot learners. Advances in neural information processing systems, 2020. ISSN 1049-5258. Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. September 2015. Chen, D., Fisch, A., Weston, J., and Bordes, A. Reading wikipedia to answer Open-Domain questions. March 2017. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S., Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, N. Pa LM: Scaling language modeling with pathways. April 2022. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. October 2018. Dragusin, R., Petcu, P., Lioma, C., Larsen, B., Jørgensen, H. L., Cox, I. J., Hansen, L. K., Ingwersen, P., and Winther, O. Find Zebra: A search engine for rare diseases. International journal of medical informatics, 82(6):528 538, June 2013. ISSN 1386-5056. doi: 10.1016/j.ijmedinf.2013.01.005. Duffield, N., Lund, C., and Thorup, M. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM, 54(6):32 es, December 2007. ISSN 0004-5411. doi: 10.1145/1314690.1314696. Falcon. Py Torch lightning. Git Hub. Note: https://github. com/Py Torch Lightning/pytorch-lightning. Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021. Grathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duvenaud, D. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. October 2017. Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., and Poon, H. Domain Specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1):1 23, October 2021. ISSN 2691-1957. doi: 10.1145/3458754. Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. Retrieval augmented language model Pre-Training. In Iii, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 3929 3938. PMLR, 2020. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding, 2021. Hinton, Vinyals, and Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv, 2015. Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. The wake-sleep algorithm for unsupervised neural networks. Science, 268(5214):1158 1161, May 1995. ISSN 00368075. doi: 10.1126/science.7761831. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training Compute-Optimal large language models, 2022. Izacard, G. and Grave, E. Leveraging passage retrieval with generative models for open domain question answering. July 2020. Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., and Grave, E. Unsupervised dense information retrieval with contrastive learning. December 2021. Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., and Variational Open-Domain Question Answering Grave, E. Few-shot learning with retrieval augmented language models. August 2022. Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., and Szolovits, P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. APPS. Applied Sciences, 11(14): 6421, July 2021. ISSN 1454-5101, 2076-3417. doi: 10.3390/app11146421. Johnson, J., Douze, M., and Jegou, H. Billion-scale similarity search with GPUs. IEEE transactions on big data, 7(3):535 547, July 2021. ISSN 2332-7790, 2372-2096. doi: 10.1109/tbdata.2019.2921572. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, November 1999. ISSN 0885-6125, 1573-0565. doi: 10.1023/A: 1007665907178. Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. January 2020. Karpukhin, V., O guz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-T. Dense passage retrieval for Open-Domain question answering. April 2020. Khattab, O. and Zaharia, M. Col BERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39 48. Association for Computing Machinery, New York, NY, USA, July 2020. ISBN 9781450380164. doi: 10.1145/3397271.3401075. Khattab, O., Potts, C., and Zaharia, M. Relevance-guided supervision for Open QA with Col BERT. Transactions of the Association for Computational Linguistics, 9:929 944, September 2021. ISSN 2307-387X. doi: 10.1162/ tacl\_a\_00405. Kingma, D. P. and Welling, M. Auto-Encoding variational bayes. December 2013. Kong. A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep, 1992. Kool, W., Van Hoof, H., and Welling, M. Stochastic beams and where to find them: The Gumbel-Top-k trick for sampling sequences without replacement. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3499 3508. PMLR, 2019a. Kool, W., van Hoof, H., and Welling, M. Buy 4 REINFORCE samples, get a baseline for free! March 2019b. Laurençon, H., Saulnier, L., Wang, T., Akiki, C., del Moral, A. V., Le Scao, T., Von Werra, L., Mou, C., Ponferrada, E. G., Nguyen, H., Frohberg, J., Šaško, M., Lhoest, Q., Mc Millan-Major, A., Dupont, G., Biderman, S., Rogers, A., Ben allal, L., De Toni, F., Pistilli, G., Nguyen, O., Nikpoor, S., Masoud, M., Colombo, P., de la Rosa, J., Villegas, P., Thrush, T., Longpre, S., Nagel, S., Weber, L., Muñoz, M. R., Zhu, J., Van Strien, D., Alyafeai, Z., Almubarak, K., Chien, V. M., Gonzalez-Dios, I., Soroa, A., Lo, K., Dey, M., Suarez, P. O., Gokaslan, A., Bose, S., Adelani, D. I., Phan, L., Yu, I., Pai, S., Lepercq, V., Ilic, S., Mitchell, M., Luccioni, S., and Jernite, Y. The Big Science corpus a 1.6TB composite multilingual dataset. June 2022. Le, T. A., Kosiorek, A. R., Siddharth, N., Teh, Y. W., and Wood, F. Revisiting reweighted Wake-Sleep for models with stochastic control flow. May 2018. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. Bio BERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234 1240, February 2020. ISSN 13674803, 1367-4811. doi: 10.1093/bioinformatics/btz682. Lee, K., Chang, M.-W., and Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6086 6096, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1612. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. BART: Denoising Sequence-to-Sequence pre-training for natural language generation, translation, and comprehension. October 2019. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., and Kiela, D. Retrieval-Augmented generation for Knowledge-Intensive NLP tasks. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 9459 9474. Curran Associates, Inc., 2020. Li, Y. and Turner, R. E. Rényi divergence variational inference. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 1073 1081. Curran Associates, Inc., 2016. Variational Open-Domain Question Answering Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021. Liévin, V., Dittadi, A., Christensen, A., and Winther, O. Optimal variance control of the Score-Function gradient estimator for Importance-Weighted bounds. In Advances in Neural Information Processing Systems, volume 33, pp. 16591 16602, 2020. Liévin, V., Hother, C. E., and Winther, O. Can large language models reason about medical questions? July 2022. Masrani, V., Le, T. A., and Wood, F. The thermodynamic variational objective. June 2019. Mnih, A. and Gregor, K. Neural variational inference and learning in belief networks. January 2014. Mnih, A. and Rezende, D. Variational inference for monte carlo objectives. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2188 2196, New York, New York, USA, 2016. PMLR. Nowozin, S. Effective sample size in importance sampling. http://www. nowozin.net/sebastian/blog/ effective-sample-size-in-importance-sampling. html, September 2015. Accessed: 2022-5-9. Owen, A. B. Monte Carlo theory, methods and examples. 2013. Pal, A., Umapathi, L. K., and Sankarasubbu, M. Med MCQA: A large-scale Multi-Subject Multi-Choice dataset for medical domain question answering. In Flores, G., Chen, G. H., Pollard, T., Ho, J. C., and Naumann, T. (eds.), Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pp. 248 260. PMLR, 2022. Paranjape, A., Khattab, O., Potts, C., Zaharia, M., and Manning, C. D. Hindsight: Posterior-guided training of retrievers for improved open-ended generation. October 2021. Paszke, Gross, Massa, Lerer, and others. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 2019. ISSN 1049-5258. Qu, Y., Ding, Y., Liu, J., Liu, K., Ren, R., Zhao, W. X., Dong, D., Wu, H., and Wang, H. Rocket QA: An optimized training approach to dense passage retrieval for Open-Domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5835 5847, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.466. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining. cs.ubc.ca, 2018. Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., van den Driessche, G., Hendricks, L. A., Rauh, M., Huang, P.-S., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., Mc Aleese, N., Wu, A., Elsen, E., Jayakumar, S., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J.-B., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., de Masson d Autume, C., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., de Las Casas, D., Guy, A., Jones, C., Bradbury, J., Johnson, M., Hechtman, B., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., and Irving, G. Scaling language models: Methods, analysis & insights from training gopher. December 2021. Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter variational bounds are not necessarily better. February 2018. Robertson, S. and Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333 389, 2009. ISSN 1554-0669. doi: 10.1561/1500000019. Sachan, D. S., Reddy, S., Hamilton, W., Dyer, C., and Yogatama, D. End-to-End training of Multi-Document reader and retriever for Open-Domain question answering. Neur IPS, 2021. Singhal, K., Azizi, S., Tu, T., Sara Mahdavi, S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Scharli, N., Chowdhery, A., Mansfield, P., Aguera y Arcas, B., Webster, D., Corrado, G. S., Matias, Y., Chou, K., Gottweis, J., Tomasev, N., Liu, Y., Rajkomar, A., Barral, J., Semturs, C., Karthikesalingam, A., and Natarajan, V. Large language models encode clinical knowledge. December 2022. Variational Open-Domain Question Answering Smith, S., Patwary, M., Norick, B., Le Gresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., Zhang, E., Child, R., Aminabadi, R. Y., Bernauer, J., Song, X., Shoeybi, M., He, Y., Houston, M., Tiwary, S., and Catanzaro, B. Using Deep Speed and megatron to train Megatron-Turing NLG 530b, a Large-Scale generative language model, 2022. Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., Xiang, A., Parrish, A., Nie, A., Hussain, A., Askell, A., Dsouza, A., Slone, A., Rahane, A., Iyer, A. S., Andreassen, A., Madotto, A., Santilli, A., Stuhlmüller, A., Dai, A., La, A., Lampinen, A., Zou, A., Jiang, A., Chen, A., Vuong, A., Gupta, A., Gottardi, A., Norelli, A., Venkatesh, A., Gholamidavoodi, A., Tabassum, A., Menezes, A., Kirubarajan, A., Mullokandov, A., Sabharwal, A., Herrick, A., Efrat, A., Erdem, A., Karaka s, A., Roberts, B. R., Loe, B. S., Zoph, B., Bojanowski, B., Özyurt, B., Hedayatnia, B., Neyshabur, B., Inden, B., Stein, B., Ekmekci, B., Lin, B. Y., Howald, B., Diao, C., Dour, C., Stinson, C., Argueta, C., Ramírez, C. F., Singh, C., Rathkopf, C., Meng, C., Baral, C., Wu, C., Callison Burch, C., Waites, C., Voigt, C., Manning, C. D., Potts, C., Ramirez, C., Rivera, C. E., Siro, C., Raffel, C., Ashcraft, C., Garbacea, C., Sileo, D., Garrette, D., Hendrycks, D., Kilman, D., Roth, D., Freeman, D., Khashabi, D., Levy, D., González, D. M., Perszyk, D., Hernandez, D., Chen, D., Ippolito, D., Gilboa, D., Dohan, D., Drakard, D., Jurgens, D., Datta, D., Ganguli, D., Emelin, D., Kleyko, D., Yuret, D., Chen, D., Tam, D., Hupkes, D., Misra, D., Buzan, D., Mollo, D. C., Yang, D., Lee, D.-H., Shutova, E., Cubuk, E. D., Segal, E., Hagerman, E., Barnes, E., Donoway, E., Pavlick, E., Rodola, E., Lam, E., Chu, E., Tang, E., Erdem, E., Chang, E., Chi, E. A., Dyer, E., Jerzak, E., Kim, E., Manyasi, E. E., Zheltonozhskii, E., Xia, F., Siar, F., Martínez-Plumed, F., Happé, F., Chollet, F., Rong, F., Mishra, G., Winata, G. I., de Melo, G., Kruszewski, G., Parascandolo, G., Mariani, G., Wang, G., Jaimovitch-López, G., Betz, G., Gur-Ari, G., Galijasevic, H., Kim, H., Rashkin, H., Hajishirzi, H., Mehta, H., Bogar, H., Shevlin, H., Schütze, H., Yakura, H., Zhang, H., Wong, H. M., Ng, I., Noble, I., Jumelet, J., Geissinger, J., Kernion, J., Hilton, J., Lee, J., Fisac, J. F., Simon, J. B., Koppel, J., Zheng, J., Zou, J., Koco n, J., Thompson, J., Kaplan, J., Radom, J., Sohl-Dickstein, J., Phang, J., Wei, J., Yosinski, J., Novikova, J., Bosscher, J., Marsh, J., Kim, J., Taal, J., Engel, J., Alabi, J., Xu, J., Song, J., Tang, J., Waweru, J., Burden, J., Miller, J., Balis, J. U., Berant, J., Frohberg, J., Rozen, J., Hernandez-Orallo, J., Boudeman, J., Jones, J., Tenenbaum, J. B., Rule, J. S., Chua, J., Kanclerz, K., Livescu, K., Krauth, K., Gopalakr- ishnan, K., Ignatyeva, K., Markert, K., Dhole, K. D., Gimpel, K., Omondi, K., Mathewson, K., Chiafullo, K., Shkaruta, K., Shridhar, K., Mc Donell, K., Richardson, K., Reynolds, L., Gao, L., Zhang, L., Dugan, L., Qin, L., Contreras-Ochando, L., Morency, L.-P., Moschella, L., Lam, L., Noble, L., Schmidt, L., He, L., Colón, L. O., Metz, L., Senel, L. K., Bosma, M., Sap, M., ter Hoeve, M., Farooqi, M., Faruqui, M., Mazeika, M., Baturan, M., Marelli, M., Maru, M., Quintana, M. J. R., Tolkiehn, M., Giulianelli, M., Lewis, M., Potthast, M., Leavitt, M. L., Hagen, M., Schubert, M., Baitemirova, M. O., Arnaud, M., Mc Elrath, M., Yee, M. A., Cohen, M., Gu, M., Ivanitskiy, M., Starritt, M., Strube, M., Sw edrowski, M., Bevilacqua, M., Yasunaga, M., Kale, M., Cain, M., Xu, M., Suzgun, M., Tiwari, M., Bansal, M., Aminnaseri, M., Geva, M., Gheini, M., T, M. V., Peng, N., Chi, N., Lee, N., Krakover, N. G.-A., Cameron, N., Roberts, N., Doiron, N., Nangia, N., Deckers, N., Muennighoff, N., Keskar, N. S., Iyer, N. S., Constant, N., Fiedel, N., Wen, N., Zhang, O., Agha, O., Elbaghdadi, O., Levy, O., Evans, O., Casares, P. A. M., Doshi, P., Fung, P., Liang, P. P., Vicol, P., Alipoormolabashi, P., Liao, P., Liang, P., Chang, P., Eckersley, P., Htut, P. M., Hwang, P., Miłkowski, P., Patil, P., Pezeshkpour, P., Oli, P., Mei, Q., Lyu, Q., Chen, Q., Banjade, R., Rudolph, R. E., Gabriel, R., Habacker, R., Delgado, R. R., Millière, R., Garg, R., Barnes, R., Saurous, R. A., Arakawa, R., Raymaekers, R., Frank, R., Sikand, R., Novak, R., Sitelew, R., Le Bras, R., Liu, R., Jacobs, R., Zhang, R., Salakhutdinov, R., Chi, R., Lee, R., Stovall, R., Teehan, R., Yang, R., Singh, S., Mohammad, S. M., Anand, S., Dillavou, S., Shleifer, S., Wiseman, S., Gruetter, S., Bowman, S. R., Schoenholz, S. S., Han, S., Kwatra, S., Rous, S. A., Ghazarian, S., Ghosh, S., Casey, S., Bischoff, S., Gehrmann, S., Schuster, S., Sadeghi, S., Hamdan, S., Zhou, S., Srivastava, S., Shi, S., Singh, S., Asaadi, S., Gu, S. S., Pachchigar, S., Toshniwal, S., Upadhyay, S., Shyamolima, Debnath, Shakeri, S., Thormeyer, S., Melzi, S., Reddy, S., Makini, S. P., Lee, S.-H., Torene, S., Hatwar, S., Dehaene, S., Divic, S., Ermon, S., Biderman, S., Lin, S., Prasad, S., Piantadosi, S. T., Shieber, S. M., Misherghi, S., Kiritchenko, S., Mishra, S., Linzen, T., Schuster, T., Li, T., Yu, T., Ali, T., Hashimoto, T., Wu, T.-L., Desbordes, T., Rothschild, T., Phan, T., Wang, T., Nkinyili, T., Schick, T., Kornev, T., Telleen-Lawton, T., Tunduny, T., Gerstenberg, T., Chang, T., Neeraj, T., Khot, T., Shultz, T., Shaham, U., Misra, V., Demberg, V., Nyamai, V., Raunak, V., Ramasesh, V., Prabhu, V. U., Padmakumar, V., Srikumar, V., Fedus, W., Saunders, W., Zhang, W., Vossen, W., Ren, X., Tong, X., Zhao, X., Wu, X., Shen, X., Yaghoobzadeh, Y., Lakretz, Y., Song, Y., Bahri, Y., Choi, Y., Yang, Y., Hao, Y., Chen, Y., Belinkov, Y., Hou, Y., Hou, Y., Bai, Y., Seid, Z., Zhao, Z., Wang, Z., Wang, Z. J., Wang, Z., and Wu, Z. Beyond the imitation game: Quantifying and extrapolating the capabilities of Variational Open-Domain Question Answering language models, 2022. Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V., and Stojnic, R. Galactica: A large language model for science. November 2022. Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H. S., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhao, V., Zhou, Y., Chang, C.-C., Krivokon, I., Rusch, W., Pickett, M., Srinivasan, P., Man, L., Meier-Hellstern, K., Morris, M. R., Doshi, T., Santos, R. D., Duke, T., Soraker, J., Zevenbergen, B., Prabhakaran, V., Diaz, M., Hutchinson, B., Olson, K., Molina, A., Hoffman-John, E., Lee, J., Aroyo, L., Rajakumar, R., Butryna, A., Lamm, M., Kuzmina, V., Fenton, J., Cohen, A., Bernstein, R., Kurzweil, R., Aguera-Arcas, B., Cui, C., Croak, M., Chi, E., and Le, Q. La MDA: Language models for dialog applications. January 2022. Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., and Sohl Dickstein, J. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2627 2636. Curran Associates, Inc., 2017. van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. November 2017. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017. ISSN 1049-5258. Venigalla, A., Frankle, J., and Carbin, M. Pub Med GPT: A domain-specific large language model for biomedical text. https://www.mosaicml.com/ blog/introducing-pubmed-gpt, 2022. Accessed: 2022-12-16. Vieira, T. Estimating means in a finite universe. https://timvieira. github.io/blog/post/2017/07/03/ estimating-means-in-a-finite-universe/, 2017. Accessed: 2022-NA-NA. Yasunaga, M., Leskovec, J., and Liang, P. Link BERT: Pretraining language models with document links. March 2022. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. OPT: Open pre-trained transformer language models, 2022. Variational Open-Domain Question Answering A. Priority sampling Figure 5. Estimation of the weighted average µ = Ep[g] with weights pi := PN i=1 exp fi/P j=1 exp fj where fi N(0, 32) and N = 100. We compare standard Monte-Carlo (sampling with replacement) with priority sampling and with self-normalized priority sampling (sampling without replacement). In the left side of the plot, we use gi = fi. In the right side, we use independent values gi N(0, 32) (sampled independently of fi). We report the 80% CI interval for 10k estimates, each with K = 1 . . . 100. Priority sampling achieves higher variance than standard MC when gi = fi. Self-normalized priority sampling achieves lower variance than standard MC. Given a set of probabilities p1, . . . , p N and a function with values f1, . . . , f N, priority sampling (Duffield et al., 2007) allows estimating the sum PN i=1 pifi using a subset of K < N samples drawn without replacement.17 For a sequence of random weights u1, . . . , un iid Uniform(0, 1], we define the priority keys pi/ui, set τ to be the K + 1-th largest key, and define the set of K samples S = {i [1, N] | pi/ui > τ}. Using importance-weights si := max(pi, τ), priority sampling yields an unbiased estimate of the weighted mean: Ep(u1,...,u N) i=1 pifi . (13) Self-normalized importance sampling Empirically, the estimator eq. (13) might suffer from high variance. We follow (Kool et al., 2019a) and use self-normalize importance weights defined as si := si/P j S sj to reduce variance at the cost of introducing a bias. However, the estimator P i S sifi is biased but consistent: it equals the true expected value for K = N. The VOD objective uses self-normalized priority sampling. Illustration In Figure 5, we visualize the variance of a standard Monte-Carlo (MC) estimator in two cases, a priority sampling estimator and a priority sampling estimator with self-normalized weights. In both cases, the variance of the self-normalized priority estimate is upper-bounded by the variance of the standard MC estimate and converges to zero at a faster rate than the traditional MC estimator. In one of the two cases, the un-normalized priority estimator suffers from large variance whereas the self-normalized priority estimator benefits from lower variance in both cases. Product of priority sampling estimates Let Z = [z1, . . . , z M] be a vector of M independent variables, each defined on sets Z1, . . . , ZM, each of size N. The vector Z is defined on the set Z(M) = Z1 . . . ZM, the Cartesian product of the M sets, which corresponds to N M combinations. Given a probability distribution p(Z) = QM j=1 p(zj), we draw K samples for each component using priority sampling: Sj = {zj,1, . . . , zj,K} (14a) (zj,1, sj[z1]), . . . , (zj,K, sj[z K]) priority p(zj) . (14b) 17We recommend Vieira (2017) for a great introduction to priority sampling. Variational Open-Domain Question Answering Combining the per-component priority samples p(Z|Q) by defining the product priority weight allows estimating an average of a function h(Z) weighted by p(Z). Defining the product of priority weights as s(Z) := QM j=1 sj[zj], we have: Ep(Z) [h(Z)] =Ep(z1)) . . . Ep(z M) [h(Z)] . . . (15a) z1 S1 s1[z1] . . . X z M SM s M[z M]h(Z) (15b) z1 S1 . . . X z M SM s1[z1] . . . s M[z M]h(Z) (15c) Z S(M) s(Z)h(Z) . (15d) B. VOD objective Given a reader model pθ(a|d, q), and retriever model pθ(d|q) and a proposal rϕ(d|a, q), the VOD objective is: ˆLK α (a, q) := 1 1 α log i=1 si ˆv1 α θ,ϕ (a, q, di) (16a) (d1, si), . . . , (d K, s K) priority rϕ(d|a, q) (16b) ˆvθ,ϕ := pθ(a|q, di)ζ(di) j=1 sjζ(dj) The VOD objective is a self-normalized importance sampling estimate of the RVB, and thus converges with probability one (consistency). Denoting Tϕ the support of pθ(d|q), we have: lim K |Tϕ| ˆLK α (a, q) | {z } VOD = Lα(d, q) | {z } RVB Without loss of generality, we consider a joint reader-retriever model pθ(a, d|q) = pθ(a|d, q)pθ(d|q) with retriever and sampling distribution defined on a support of documents Tϕ18 and parameterized as pθ(d|q) :=Z 1 θ exp fθ(d, q), rϕ(d|a, q) :=Z 1 ϕ exp fϕ(a, d, q) (18) d Tϕ exp fθ(d, q), Zϕ := X d Tϕ exp fϕ(a, d, q) . (19) In this section, we first detail the properties of the VOD objective: its complexity and its relation to the importance-weighted Rényi variational bound (IW-RVB). As a second step, we derive the VOD objective and prove that it is consistent: the VOD objective converges to the IW-RVB with probability 1 as K . B.1. Complexity O(K) Evaluating the VOD objective eq. (16a) only requires evaluating pθ(a|d, q) (complexity O(1), generally one BERT/LM call) and evaluating the retrieval score fθ(d, q) for each document d1, . . . , d K (complexity O(1 + K), generally one BERT/LM call per document and one call to encode the query q). Evaluating the VOD objective does not require evaluating the constant Zθ (complexity O(P) , one call for each document in the set Tϕ). This results in a computational complexity of O(2 + K) = O(K).19 18Tϕ can be chosen as the entire corpus of documents. 19The scores fϕ(d1), . . . , fϕ(d K) of the sampling distribution are computed offline and therefore can be ignored. Variational Open-Domain Question Answering B.2. VOD, IW-RVB, ELBO and marginal likelihood Using a set d1, . . . , d K rϕ(d|a, q) sampled with replacement, the importance-weighted Rényi variational bound (IW-RVB) is defined as: ˆLK α (d, q) := 1 1 α log 1 i=1 w1 α θ,ϕ (a, q, di) . (20) The IW-RVB is a lower-bound of the log-likelihood and for α = 0, increasing the number of samples results in a tighter log-likelihood lower bound (Burda et al., 2015): LELBO(a, q) ˆLK α=0(d, q) ˆLK+1 α=0 (d, q) log pθ(a, q) . (21) In α = 0, the RVB is defined by continuity as the ELBO (Li & Turner, 2016). In that case, increasing the number of Monte Carlo samples K does not result in a tighter bound: ˆLK α 1(d, q) = Erϕ(d1,...,d K|a,q) i=1 log wθ,ϕ(a, q, di) = Erϕ(d|a,q) [log wθ,ϕ(q, a, d)] = LELBO(a, q) . (22) The VOD objective is a self-normalized importance sampling estimate of the RVB, whereas the IW-RVB is a standard importance sampling. The VOD objective only differs from the IW-RVB because (i) VOD relies on self-normalized priority sampling eq. (28a), (ii) the normalizing constant ZθZ 1 ϕ in the expression of the importance weight wθ,ϕ(a, q, d) is estimated with a self-normalized priority sampling estimate eq. (28b). B.3. Derivation of the VOD objective In this section, we derive the VOD objective. We begin by expressing the ratio of normalization constants Zθ/Zϕ as a function of ζ (section B.3.1), and then apply this identity to approximate the importance weight wθ,ϕ(q, a, d) (section B.3.2). We conclude the deriving the VOD objective: an approximation of the IW-RVB using (i) priority sampling and (ii) the importance weight estimate (B.3.3). B.3.1. RATIO OF NORMALIZING CONSTANTS Zθ/Zϕ The quantity Zθ/Zϕ can expressed as a function of the ratio of un-normalized retriever densities ζ(d) := exp fθ(d, q)/exp fϕ(a, d, q) using the following identity: ZθZ 1 ϕ = Erϕ(d|a,q) [ζ(d)] . (23) Proof The equality arises from the definition of the right-hand term: Erϕ(d|a,q) [ζ(d)] := X d Tϕ rϕ(d|a, q) exp fθ(d, q) exp fϕ(a, d, q) (24a) exp fϕ(a, d, q) exp fθ(d, q) exp fϕ(a, d, q) = ZθZ 1 ϕ . (24b) B.3.2. ESTIMATION OF THE IMPORTANCE WEIGHT wθ,ϕ The importance weight wθ,ϕ(q, a, d) can be approximated using K retrieval scores fθ(d1), . . . , fθ(d K): wθ,ϕ(q, a, d) ˆvθ,ϕ(q, a, d) := pθ(a|q, d)ζ(d) j=1 sjζ(dj) (d1, si), . . . , (d K, s K) priority rϕ(d|a, q) . Variational Open-Domain Question Answering Proof Using the eq. (23), we can express wθ,ϕ(q, a, d) as a function of the un-normalized retriever density ratio ζ: wθ,ϕ(a, d, q) :=pθ(a|d, q)pθ(d|q) rϕ(d|a, q) (26a) =pθ(a|d, q)ζ(d) ZθZ 1 ϕ 1 (26b) =pθ(a|d, q)ζ(d) Erϕ(d|a,q) [ζ(d)] 1 . (26c) The expected value of ζ(d) can be estimated via Monte Carlo. Using priority sampling with samples d1, . . . , d K rϕ(d|a, q) and normalized priority weights s1, . . . , s K (section A), we obtain: wθ,ϕ(a, d, q) pθ(a|d, q)ζ(d) j=1 sjζ(dj) = vθ,ϕ(a, d, q) . (27) B.3.3. THE VOD OBJECTIVE Given document samples d1, . . . , d K priority rϕ(d|a, a) with self-normalized priority weights s1, . . . , s K. The VOD objective ˆLK α (d, q) is an approximation of the IW-RVB ( ˆLK α (d, q), eq. (20)): ˆLK α (d, q) 1 1 α log i=1 si w1 α θ,ϕ (a, q, di) (priority sampling) (28a) i=1 si v1 α θ,ϕ (a, q, di) = ˆLK α (d, q) . (inserting eq. (25a)) (28b) B.4. VOD consistency In a nutshell, the VOD objective is biased because some normalization terms are estimated via Monte Carlo. Nevertheless, the estimates used as denominator are themselves consistent. This results in a final estimate the VOD objective which is itself consistent. In contrast to the IW-RVB eq. (20), the VOD objective ˆLK α is not guaranteed to be a lower bound of the marginal loglikelihood. Nonetheless, the VOD objective and its gradient are consistent: they converge to their target expressions (RVB) in the limit of K |Tϕ| < . Proof Self-normalized priority sampling is consistent. Given an arbitrary function h such that |h(x)| < and K priority samples (x1, s1), . . . , (x K, s K) priority p(x) where x X, |X| < : i sih(xi) = lim K |X| j sj h(xi) = Ep(x) h(x) Ep(x) [1] = Ep(x) [h(x)] . (29) Assuming |ζ(d)| < , this result implies that vθ,ϕ is a consistent estimate of the importance weight wθ,ϕ: lim K |Tϕ| vθ,ϕ(a, q, d) =pθ(a|d, q)ζ(d) j=1 sjζ(dj) =pθ(a|d, q)ζ(d) Erϕ(d|a,q) [ζ(d)] 1 (30b) =pθ(a|d, q)ζ(d) ZθZ 1 ϕ 1 (30c) =wθ,ϕ(a, q, d) . (30d) Variational Open-Domain Question Answering The VOD objective relies on the importance weight estimates, which are themselves consistent. Therefore for α < 1: lim K |Tϕ| ˆLK α (a, q) = lim K |Tϕ| 1 1 α log i=1 si ˆv1 α θ,ϕ (a, q, di) (31a) = 1 1 α log Erϕ(d|a,q) lim K |Tϕ| ˆv1 α θ,ϕ (a, q, di) (31b) = 1 1 α log Erϕ(d|a,q) h w1 α θ,ϕ (a, q, d) i (31c) =Lα(a, q) = lim K |Tϕ| LK α (a, q) . (31d) C. VOD gradient The VOD gradient w.r.t. the parameter θ corresponds to a self-normalized importance sampling estimate of the RVB gradient. It corresponds to the IW-RVB gradient derived in (Li & Turner, 2016), except that further approximations are required to ensure the expression is tractable. The VOD gradient is expressed as µVOD θ,α,K := si (pθ(a|di, q)ζ(di))1 α PK j=1 sj (pθ(a|dj, q)ζ(di))1 α ( θ log pθ(a|di, q) + h(di, q)) LK α (a, q) (32) (d1, si), . . . , (d K, s K) priority rϕ(d|a, q) h(di, q) := θfθ(di, q) sj ζ(dj) PK k=1 sk ζ(dk) θfθ(dj, q) θ log pθ(d|q) . (33) The VOD gradient is consistent: it converges to the exact gradient θLα(a, q) with probability one: lim K |Tϕ| µVOD θ,α,K = θLα(a, q) . (34) The estimation of the gradient of the VOD objective w.r.t. the parameter ϕ will be left to future work. In all experiments included in this paper, the parameter ϕ is non trainable. Proof Using the results from the previous section, the gradient of the RVB w.r.t the parameter θ can be estimated as: θLα(a, q) :=Erϕ(d|a,q) w1 α θ,ϕ (a, q, d) θ log pθ(a, d|q) (35a) =Erϕ(d|a,q) w1 α θ,ϕ (a, q, d) Erϕ(d |a,q) h w1 α θ,ϕ (a, q, d ) i θ log pθ(a, d|q) =Erϕ(d|a,q) pθ(a|d, q)ζ(d)Z 1 θ Zϕ 1 α Erϕ(d |a,q) h pθ(a|d , q)ζ(d )Z 1 θ Zϕ 1 αi θ log pθ(a, d|q) s(d) (pθ(a|d, q)ζ(d))1 α P d S s(d) (pθ(a|d , q)ζ(d))1 α θ log pθ(a, d|q) . (35d) Another approximation is required to estimate θ log pθ(a, d|q) = θ log pθ(q|d, q) + θ log pθ(d|q) without paying Variational Open-Domain Question Answering the price of evaluating Zθ. We approximate the term θ log pθ(d|q) using: θ log pθ(d|q) = θfθ(d, q) θ log Zθ (36a) = θfθ(d, q) θZθ = θfθ(d, q) X d Tϕ pθ(d |q) θfθ(d , q) (36c) = θfθ(d, q) X d Tϕ rϕ(d |a, q) pθ(d |q) rϕ(d |a, q) θfθ(d , q) (36d) = θfθ(d, q) Erϕ(d |a,q) ζ(d ) Erϕ(d |a,q) [ζ(d )] θfθ(d , q) (36e) si ζ(di) PK j=1 sj ζ(dj) θfθ(di, q) . (36f) This approximation is also consistent because self-normalized priority sampling is consistent (direct application of eq. (29)). D. VOD and REALM Using the truncated retriever pθ(d|q) defined on the support Tϕ of the top-K=P documents ranked by a cached score fϕ: pθ(d|q) := 1[d Tϕ] exp fθ(d, q) PK i=1 exp fθ(di, q) . (37) the VOD objective aligns with REALM in α = 0. This corresponds to the marginal log-likelihood truncated to the top K documents (the first step is a direct application of priority sampling being consistent): ˆLK=P α=0 (a, q) | {z } VOD i=1 rϕ(di|a, q)wθ,ϕ(a, q, di) = log i=1 pθ(di, a|q) = log pθ(a|q) | {z } REALM E. Applications of the VOD framework In this section, we detail how to apply the VOD framework to the tasks of language modelling as well as extractive, generative and multiple-choice ODQA. We also detail a solution to optimizing multi-documents readers (Fi D) jointly. E.1. Generative and extractive ODQA The model pθ(a|d, q) a machine reading comprehension component that can be implemented either using an extractive approach, as done in the original BERT (Devlin et al., 2018), or using a generative approach (Lewis et al., 2019). Applying the VOD framework to generative and extractive ODQA simply requires plugging the likelihood of the corresponding machine reading comprehension model pθ(a|d, q) in the VOD objective and gradient (equations 6 and 32). E.2. Retrieval-augmented language modelling We consider the variable a = [a1, . . . , a T ] to be the sequence of tokens of length T and omit the conditioning variable q. The retriever model pθ(dt|a