# uncertainty_estimation_in_autoregressive_structured_prediction__83624b0c.pdf Published as a conference paper at ICLR 2021 UNCERTAINTY ESTIMATION IN AUTOREGRESSIVE STRUCTURED PREDICTION Andrey Malinin Yandex, Higher School of Economics am969@yandex-team.ru Mark Gales ALTA Institute, University of Cambridge mjfg@eng.cam.ac.uk Uncertainty estimation is important for ensuring safety and robustness of AI systems. While most research in the area has focused on unstructured prediction tasks, limited work has investigated general uncertainty estimation approaches for structured prediction. Thus, this work aims to investigate uncertainty estimation for autoregressive structured prediction tasks within a single unified and interpretable probabilistic ensemble-based framework. We consider uncertainty estimation for sequence data at the token-level and complete sequence-level; interpretations for, and applications of, various measures of uncertainty; and discuss both the theoretical and practical challenges associated with obtaining them. This work also provides baselines for token-level and sequence-level error detection, and sequencelevel out-of-domain input detection on the WMT 14 English-French and WMT 17 English-German translation and Libri Speech speech recognition datasets. 1 INTRODUCTION Neural Networks (NNs) have become the dominant approach in numerous applications (Simonyan & Zisserman, 2015; Mikolov et al., 2013; 2010; Bahdanau et al., 2015; Vaswani et al., 2017; Hinton et al., 2012) and are being widely deployed in production. As a consequence, predictive uncertainty estimation is becoming an increasingly important research area, as it enables improved safety in automated decision making (Amodei et al., 2016). Important advancements have been the definition of baseline tasks and metrics (Hendrycks & Gimpel, 2016) and the development of ensemble approaches, such as Monte-Carlo Dropout (Gal & Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017)1. Ensemble-based uncertainty estimates have been successfully applied to detecting misclassifications, out-of-distribution inputs and adversarial attacks (Carlini & Wagner, 2017; Smith & Gal, 2018; Malinin & Gales, 2019) and to active learning (Kirsch et al., 2019). Crucially, they allow total uncertainty to be decomposed into data uncertainty, the intrinsic uncertainty associated with the task, and knowledge uncertainty, which is the model s uncertainty in the prediction due to a lack of understanding of the data (Malinin, 2019)2. Estimates of knowledge uncertainty are particularly useful for detecting anomalous and unfamiliar inputs (Kirsch et al., 2019; Smith & Gal, 2018; Malinin & Gales, 2019; Malinin, 2019). Despite recent advances, most work on uncertainty estimation has focused on unstructured tasks, such as image classification. Meanwhile, uncertainty estimation within a general, unsupervised, probabilistically interpretable ensemble-based framework for structured prediction tasks, such as language modelling, machine translation (MT) and speech recognition (ASR), has received little attention. Previous work has examined bespoke supervised confidence estimation techniques for each task separately (Evermann & Woodland, 2000; Liao & Gales, 2007; Ragni et al., 2018; Chen et al., 2017; Koehn, 2009; Kumar & Sarawagi, 2019) which construct an "error-detection" model on top of the original ASR/NMT system. While useful, these approaches suffer from a range of limitations. Firstly, they require a token-level supervision, typically obtained via minimum edit-distance alignment to a ground-truth transcription (ASR) or translation (NMT), which can itself by noisy. Secondly, such token-level supervision is generally inappropriate for translation, as it doesn t account for the validity of re-arrangements. Thirdly, we are unable to determine whether the error is due to knowledge or 1An in-depth comparison of ensemble methods was conducted in (Ashukha et al., 2020; Ovadia et al., 2019) 2Data and Knowledge Uncertainty are sometimes also called Aleatoric and Epistemic uncertainty. Published as a conference paper at ICLR 2021 data uncertainty. Finally, this model is itself subject to the pitfalls of the original system - domain shift, noise, etc. Thus, unsupervised uncertainty-estimation methods are more desirable. Recently, however, initial investigations into unsupervised uncertainty estimation for structured prediction have appeared. The nature of data uncertainty for translation tasks was examined in (Ott et al., 2018a). Estimation of sequence and word-level uncertainty estimates via Monte-Carlo Dropout ensembles has been investigated for machine translation (Xiao et al., 2019; Wang et al., 2019; Fomicheva et al., 2020). However, these works focus on machine translation, consider only a small range of uncertainty adhoc measures, provide limited theoretical analysis of their properties and do not make explicit their limitations. Furthermore, they don t identify or tackle challenges in estimating uncertainty arising from exponentially large output space. Finally, to our knowledge, no work has examined uncertainty estimation for autoregressive ASR models. This work examines uncertainty estimation for structured prediction tasks within a general, probabilistically interpretable ensemble-based framework. The five core contributions are as follows. First, we derive information-theoretic measures of both total uncertainty and knowledge uncertainty at both the token level and the sequence level, make explicit the challenges involved and state any assumptions made. Secondly, we introduce a novel uncertainty measure, reverse mutual information, which has a set of desirable attributes for structured uncertainty. Third, we examine a range of Monte-Carlo approximations for sequence-level uncertainty. Fourth, for structured tasks there is a choice of how ensembles of models can be combined; we examine how this choice impacts predictive performance and derived uncertainty measures. Fifth, we explore the practical challenges associated with obtaining uncertainty estimates for structured predictions tasks and provide performance baselines for tokenlevel and sequence-level error detection, and out-of-domain (OOD) input detection on the WMT 14 English-French and WMT 17 English-German translation datasets and the Libri Speech ASR dataset. 2 UNCERTAINTY FOR STRUCTURED PREDICTION In this section we develop an ensemble-based uncertainty estimation framework for structured prediction and introduce a novel uncertainty measure. We take a Bayesian viewpoint on ensembles, as it yields an elegant probabilistic framework within which interpretable uncertainty estimates can be obtained. The core of the Bayesian approach is to treat the model parameters θ as random variables and place a prior p(θ) over them to compute a posterior p(θ|D) via Bayes rule, where D is the training data. Unfortunately, exact Bayesian inference is intractable for neural networks and it is necessary to consider an explicit or implicit approximation q(θ) to the true posterior p(θ|D) to generate an ensemble. A number of different approaches to generating ensembles have been developed, such as Monte-Carlo Dropout (Gal & Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017). An overview is available in (Ashukha et al., 2020; Ovadia et al., 2019). Consider an ensemble of models {P(y|x; θ(m))}M m=1 sampled from an approximate posterior q(θ), where each model captures the mapping between variable-length sequences of inputs {x1, , x T } = x X and targets {y1, , y L} = y Y, where xt {w1, , w V }, yl {ω1, , ωK}. The predictive posterior is obtained by taking the expectation over the ensemble: P(y|x, D) = Eq(θ) P(y|x, θ) 1 M m=1 P(y|x, θ(m)), θ(m) q(θ) p(θ|D) (1) The total uncertainty in the prediction of y is given by the entropy of the predictive posterior. H[P(y|x, D)] | {z } Total Uncertainty = EP(y|x,D)[ ln P(y|x, D)] = X y Y P(y|x, D) ln P(y|x, D) (2) The sources of uncertainty can be decomposed via the mutual information I between θ and y: I y, θ|x, D | {z } Know. Uncertainty =Eq(θ) h EP(y|x,θ) h ln P(y|x, θ) ii = ˆH P(y|x, D) | {z } Total Uncertainty Eq(θ) ˆH[P(y|x, θ)] | {z } Expected Data Uncertainty Mutual information (MI) is a measure of disagreement between models in the ensemble, and therefore a measure of knowledge uncertainty (Malinin, 2019). It can be expressed as the difference between the entropy of the predictive posterior and the expected entropy of each model in the ensemble. Published as a conference paper at ICLR 2021 The former is a measure of total uncertainty and the latter is a measure of data uncertainty (Depeweg et al., 2017). Another measure of ensemble diversity is the expected pairwise KL-divergence (EPKL): K y, θ|x, D = Eq(θ)q( θ) h EP(y|x,θ) h ln P(y|x, θ) ii q(θ) p(θ|D) (4) where q(θ) = q( θ) and θ is a dummy variable. This measure is an upper bound on the mutual information, obtainable via Jensen s inequality. A novel measure of diversity which we introduce in this work is the reverse mutual information (RMI) between each model and the predictive posterior: M y, θ|x, D = Eq(θ) h EP(y|x,D) h ln P(y|x, D) ii , q(θ) p(θ|D) (5) This is the reverse-KL divergence counterpart to the mutual information (3), and has not been previously explored. As will be shown in the next section, RMI is particularly attractive for estimating uncertainty in structured prediction. Interestingly, RMI is the difference between EPKL and MI: M y, θ|x, D = K y, θ|x, D I y, θ|x, D 0 (6) While mutual information, EPKL and RMI yield estimates of knowledge uncertainty, only mutual information cleanly decomposes into total and data uncertainty. EPKL and RMI do not yield clean measures of total and data uncertainty, respectively. For details see appendix A. Unfortunately, we cannot in practice construct a model which directly yields a distribution over an infinite set of variable-length sequences y Y. Neither can we take expectations over the this set. Instead, autoregressive models are used to factorize the joint distribution over y into a product of conditionals over a finite set of classes, such as words or BPE tokens (Sennrich et al., 2015). P(y|x, θ) = l=1 P(yl|y