# uncertainty_calibration_for_ensemblebased_debiasing_methods__9e076cd3.pdf Uncertainty Calibration for Ensemble-Based Debiasing Methods Ruibin Xiong1,2,3 , Yimeng Chen2,4 , Liang Pang2,5, Xueqi Cheng1,2, Zhiming Ma2,4 and Yanyan Lan6 1CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Baidu Inc. 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences 5Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences 6 Institute for AI Industry Research, Tsinghua University {xiongruibin18, chenyimeng14}@mails.ucas.ac.cn, {cxq, pangliang}@ict.ac.cn, mazm@amt.ac.cn,lanyanyan@tsinghua.edu.cn Ensemble-based debiasing methods have been shown effective in mitigating the reliance of classifiers on specific dataset bias, by exploiting the output of a biasonly model to adjust the learning target. In this paper, we focus on the bias-only model in these ensemble-based methods, which plays an important role but has not gained much attention in the existing literature. Theoretically, we prove that the debiasing performance can be damaged by inaccurate uncertainty estimations of the bias-only model. Empirically, we show that existing bias-only models fall short in producing accurate uncertainty estimations. Motivated by these findings, we propose to conduct calibration on the bias-only model, thus achieving a three-stage ensemble-based debiasing framework, including bias modeling, model calibrating, and debiasing. Experimental results on NLI and fact verification tasks show that our proposed three-stage debiasing framework consistently outperforms the traditional two-stage one in out-of-distribution accuracy. 1 Introduction Machine learning models have achieved remarkable performance on natural language understanding [10; 7; 28] and computer vision [16; 17]. However, observations have shown that these models have difficulties in generalizing well in out-of-distribution settings [30; 40; 2; 11], which limits their applications to real-world scenarios. A major cause of this failure is the reliance of the model on specific dataset bias [33]. For instance, Mc Coy et al. [23] have shown that sentence pairs with high word overlaps in MNLI are easy to be classified as the label entailment , even if they have different relations. A growing body of literature recognizes debiasing as an important direction in machine learning and natural language processing [38; 3; 4; 34]. Within these works, ensemble-based debiasing (EBD) methods [15; 22; 8; 39; 5] have caused considerable interest within the community, as shown Equal contribution. Work done while Yimeng Chen was interning at Institute for AI Industry Research, Tsinghua University. Corresponding author. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). promising improvements on the out-of-distribution performance. EBD methods, e.g., Po E [8], DRi Ft [15], and Inverse-Reweight [39], usually adopt a two-stage framework. Firstly, a biased predictor is trained based on the bias features only, namely the bias-only model. Its output is then utilized to adjust the learning target of the main model by using different ensembling strategies. Previous works are mainly limited to designing different ensembling strategies, without considering the bias-only model, which clearly plays an essential role in the whole process. In this paper, we focus on investigating the bias-only model in the EBD methods. We theoretically reveal that the quality of the predictive uncertainty estimation given by the bias-only model is crucial for the debiasing performance of EBD methods. Specifically, we prove that the out-of-distribution accuracy of the debiased model is monotonically decreasing with the calibration error of the bias-only model when such error exceeds a threshold4. Moreover, by theoretically analyzing the decline of in-distribution performance caused by debiasing, we show the existence of the case when uncertainty calibration can also mitigate such a side-effect. Empirically, we show that bias-only models employed by existing methods on both natural language inference and fact verification tasks fail to produce accurate uncertainty estimations. These findings indicate the critical role of the calibration property of current bias-only models for further improvement of EBD methods. Motivated by the theoretical analysis and empirical study, we introduce an additional calibration stage into the previous EBD methods. In this stage, the bias-only model is calibrated with modelagnostic calibration methods to obtain more accurate predictive uncertainty estimation. Specifically, two typical calibration methods are used in this paper, i.e. temperature scaling [12] and Dirichlet calibration [19]. After that, the calibrated bias-only model is used to train the main model with off-the-shelf ensembling strategies. In this way, we extend the traditional two-stage EBD framework to a three-stage one, including bias Modeling, model Calibrating, and Debiasing, named Mo Ca D for short. To demonstrate the effectiveness of our proposed framework, we conduct experiments on four challenging benchmarks for two NLU tasks, i.e. natural language inference and fact verification. Experimental results show that our framework significantly improves the out-of-distribution performance, as compared with the traditional two-stage one. Moreover, our theoretical results are well verified by the empirical observations in real scenarios. Our main contributions can be summarized as the following three folds. We explore, both theoretically and empirically, the effect of the bias-only model in the EBD methods. Consequently, a critical problem is revealed: existing bias-only models are poorly calibrated, which will hurt the debiasing performance. We propose a model-agnostic three-stage EBD framework to tackle the above problem. We conduct extensive experiments on four challenging datasets for two different tasks, and experimental results show the superiority of our proposed framework as against the traditional two-stage one. 2 Related Work Dataset Bias. Various biases have been found in different NLU benchmarks. For example, models with partial input can perform much better than majority-class baselines in NLI and fact verification datasets [13; 27; 30]. Many multi-hop questions can be solved by just using single-hop models in the recent multi-hop QA datasets [24; 6]. Similar phenomena have been observed in many other tasks, such as reading comprehension [18] and visual question-answering [1]. Many models have used such superficial cues to achieve remarkable performance instead of capturing the underlying intrinsic principles in these biased datasets, leading to poor generalization on out-of-distribution datasets, when the relation of bias features and labels are changed [23; 30; 21]. Ensemble-based debiasing (EBD) methods. EBD methods are a kind of model-agnostic debiasing method to reduce the reliance of models on specific dataset bias. In these methods, a bias-only 4This condition is more general when the ground truth labeling based on the signal features has low certainty. Such cases exist in natural language understanding (NLU) tasks, where the ground-truth label for a sample is not unique but inherently forms a distribution, as shown by recent empirical studies [26; 25]. That is why we focus our empirical study on NLU tasks. model is used to assist the debiasing training of the main model. Most EBD methods, e.g., Po E [8], DRi Ft [15], and Inverse-Reweight [39], can be formalized as a two-stage framework. It is commonly assumed that the dataset bias is known a-priori. In the first stage, the bias-only model is trained to capture the dataset bias by leveraging the pre-defined bias features. Then the bias-only model is used to adjust the learning target of the main model with different ensembling strategies. Recently, some works start to improve the EBD methods by exploring the bias-only models. For example, Utama et al. [35], Sanh et al. [29], and Clark et al. [9] focus on relaxing the basic assumption of many EBD methods, i.e., the dataset bias is known a-priori. They exploit different prior knowledge to obtain bias-only models, e.g., models that shallow [35] or with limited capacity [29; 9] are considered to be biased. Unlike these works, we theoretically study the essential effect of the bias-only model on the final debiasing performance and show how to improve it in the algorithm design process. Please note that some works [22; 9] have been proposed to jointly learn the bias-only model and the debiased main model in an end-to-end manner. However, Since it is difficult to quantify the impact of the bias-only model in this scheme, we mainly focus on the typical two-stage methods [8; 15; 39; 35; 29]. 3 Formalization of EBD Methods In this section, we formalize EBD methods with an introduction to some related notations. Consider a general classification task, where the target is to map an input value x X of an input random variable X to a target label y Y of a target random variable Y . We denote features of x that have invariant relations with the label as signal xs, e.g., the sentiment words in sentiment analysis. Conversely, features whose correlation with label Y is spurious and prone to change in the outof-distribution setting are denoted as bias xb, e.g., the length of input sentences in the NLU tasks. The corresponding random variables are respectively denoted as XS and XB. Now suppose that on a training dataset D where (X, Y ) PD(X Y), XB and Y are spuriously correlated. The goal of debiasing is to learn a classifier that models PD(Y |XS) with invariant out-of-distribution performance. The following decomposition forms the theoretical basis for EBD methods: for x X, with its corresponding features XB = xb, XS = xs, PD(Y |X = x) PD(Y |XB = xb)PD(Y |XS = xs) 1 PD(Y ), (1) where PD(Y |XB =xb) is the conditional probability distribution of Y given the value of bias features XB, PD(Y |XS = xs) represents the true principle we would like to learn, and PD(Y |X = x) is the conditional distribution of Y given all features, which is usually approximated by directly applying statistical machine learning methods on the training data. This decomposition can be deduced under the constraint that XS XB|Y , as shown in [8; 15; 9]. We further prove that it also holds with the assumptions in [39] (See the appendix). The theoretical analysis in this paper is conducted based on the same constraint as in [8; 15; 9]. From this decomposition, the true principle PD(Y |XS) can be achieved by adjusting the learning target with PD(Y |XB). This is exactly the basic idea of EBD methods. Most EBD methods belong to a two-stage framework. In the first stage, a bias-only model f B : X R|Y| is trained to approximate PD(Y |XB). Then it is employed to adjust the learning target in a direct or indirect way. Direct methods such as Inverse-Reweight [39] reweight the distribution by the inverse of the probability induced by the bias-only model to approximate the true principle. The objective function of the main model f M : X R|Y| becomes: min f M EX,Y PD[ 1 pb Y (X)Lc(Y, pm(X))], (2) where pb(X) = {pb 1(X), pb 2(X), . . . , pb |Y|(X)}, pm(X) = {pm 1 (X), pm 2 (X), . . . , pm |Y|(X)} denote the uncertainty estimations, i.e. the prediction probabilities given by f B and f M respectively. Lc represents the cross-entropy loss function. On the other hand, indirect methods usually utilize the output of the bias-only model to adjust the loss function of the main model, and the learning target becomes: min f M EX,Y PD[Lc(Y, m(qb(X) qm(X))], (3) where m is the normalization function, and qb(X), qm(X) are vectors in proportion to pb(X) and pm(X) respectively. Specifically, Po E [8; 35] directly uses the probability output, DRi Ft [15] and Sanh et al. [29] utilizes exponential of the logits output. In Learned-Mixin [8], a variant of Po E, qb(X) is changed to (pb(X))g(X), where g(X) is a trainable gate function. For both direct and indirect methods, by the property of the cross-entropy loss [14], the optimal main model f M satisfies pm PD(Y |X)/pb. Therefore, we have pm PD(Y |XS) when pb PD(Y |XB), which guarantees the effectiveness of the existing EBD methods. Please note that Learned-Mixin does not satisfy this property due to the trainable gate function. 4 Analysis of the Bias-only Model Bias-only models are critical to EBD methods, since their outputs are used to help recover the unbiased distribution. However, far too little attention has been paid to them in previous research. In this section, we theoretically quantify the effect of bias-only outputs on the final debiasing performance and empirically show the weakness of existing bias-only models. 4.1 Theoretical Analysis According to the discussion in Section 3, the optimal main model f M induces the following conditional probability: PD,f M (Y =i|X) := PD(Y =i|X)/pb i(X) P j Y PD(Y =j|X)/pb j(X). (4) For arbitrary x X, we define Y (x) := argmaxi Y PD(Y = i|XS = xs), Y (x) := argmaxi Y PD,f M (Y = i|X = x). (5) Here Y (x) stands for the predicted label given by the intrinsic principle, and Y (x) is the label prediction given by the debiased main model. With these notations, the debiasing performance can be defined as EX PD(X)( Y (X) = Y (X)). As the major factor related to the bias-only model is pb i(X), i.e. the uncertainty estimation, in the concerned quantities Y (X), we investigate the effect of the bias-only model on the debiasing performance from this aspect. Without loss of generality, we consider the binary classification problem with Y = {0, 1} and balanced label distribution. To divide and conquer, we conduct the theoretical analysis on a set of samples, where the bias-only model generates the same uncertainty estimation, i.e. Sf B(l) := {x|pb 0(x) = l}, l [0, 1]. Specifically, the quality of the uncertainty estimation of the bias-only model on Sf B(l) can be measured by the calibration error defined as |l PD(Y = 0|Sf B(l))|. The debiasing performance on Sf B(l) is defined as PD({x Sf B(l)| Y (x) = Y (x)}), i.e. the probability of the subset of Sf B(l) on which the main model gives the same prediction as the intrinsic principle. The following theorem formalizes a precise result. Specifically, the debiasing performance is a monotonically decreasing function of the calibration error when it exceeds a deviation threshold δ(l0, ϵ, α). Here α := min XS maxi {0,1} PD(Y = i|XS) denotes the global certainty level of the true principle PD(Y |XS). Theorem 1. For any l [0, 1], assume that l0 s.t. PD(Y = 0|XB) (l0 ϵ, l0 + ϵ) when X takes values in Sf B(l). If the calibration error |l PD(Y = 0|Sf B(l))| δ(l0, ϵ, α) > 0, the debiasing performance PD({x Sf B(l)| Y (x) = Y (x)}) declines as |l PD(Y = 0|Sf B(l))| increases, where δ(l0, ϵ, α) is a constant dependent with l0, ϵ and α. When α < 1 2 + ϵ 2l0(1 l0)+2ϵ2 , 0 δ(l0, ϵ, α) < 2ϵ, where 2ϵ ϵ 2l0(1 l0)+2ϵ2 < 1 2. Otherwise C < δ(l0, ϵ, α) < 2ϵ + C, where 0 < C := l0 ϵ l0+ϵ (l0+ϵ)+(1 l0 ϵ) α 1 α , which increases as α increases. The threshold in this theorem depends on latent constants l0, ϵ, and α. Here l0 and ϵ define the range of PD(Y = 0|XB) on Sf B(l). As these constants are related to the posterior characteristics of f B, we verify the generality of such condition by empirical facts in Section 6. Note that the deviation threshold decreases as the certainty level α decreases. That means the same calibration error is more likely to exceed the threshold under smaller α, resulting in a more considerable decrease in debiasing 0.0 0.2 0.4 0.6 0.8 1.0 0.0 Observed Class Portion Gap (ECE=9.83) Predict Probability 0.0 0.2 0.4 0.6 0.8 1.0 0.0 Observed Class Portion Gap (ECE=7.12) Predict Probability Figure 1: Reliability diagrams of the bias-only models on MNLI and FEVER. The x-axis is the predictive probability of the bias-only model, and the y-axis is the frequency. The wide blue bars show the weighted average of the observed class portion to all classes within each bin, and the narrow red bars show the gap between the observed class portion and the predictive probability of the bias-only model. performance. As a result, the condition in Theorem 1 is more general and significant when the true principle PD(Y |XS) has low certainty, for example, in the NLU tasks as supported by empirical evidence in [26; 25]. We also theoretically analyze the effect of the bias-only model on the in-distribution performance, which is defined as EX PD(X)( Y (X) = ˆY (X)), where ˆY (x) := argmaxi Y PD(Y = i|X = x) denotes the label given by the ideal predictor on D. The result is shown in the following theorem. Theorem 2. For any X, Y (X) = ˆY (X) if and only if pb ˆY (x)(x) > PD(Y = ˆY (x)|X = x). Theorem 2 gives a possible explanation for the decrease of in-distribution performance of EBD debiased models: the in-distribution error occurs when the predictive uncertainty estimation of the bias-only model on ˆY (x) is higher than the conditional probability of ˆY (x). That indicates that the in-distribution error is non-decreasing as the range of the uncertainty estimation of bias-only models increases. As an important case, when the bias-only model is over-confident [12], decreasing its calibration error can improve both the in-distribution and out-of-distribution performance of the debiased model according to the two theorems. To sum up, our theoretical study shows that both debiasing and in-distribution performances of the EBD methods are affected by the uncertainty estimation of the bias-only models. Please note that both Theorem 1 and 2 can be generalized to multi-class scenarios, with a more complex form. For simplicity, we only discuss the binary class case. 4.2 Empirical Analysis According to some recent machine learning studies, the uncertainty estimations of many widely used machine learning classifiers are not reliable [20; 12; 32; 36]. This indicates that the existing bias-only classifiers may fail to produce a good uncertainty estimation, which can hurt the debiasing performance, as demonstrated by our theoretical results. To quantify the effect, we further conduct an empirical study to demonstrate the quality of the existing bias-only models with respect to the uncertainty estimation. Specifically, we experiment on two typical public datasets, MNLI and FEVER. Their experimental settings and detailed analysis can be found in Section 6.1. For MNLI, we consider the syntactic bias [23] and use hand-crafted features to train a bias-only model, the same as in [8]. For FEVER, we consider the claim-only bias [30] and train a claim-only model as the bias-only model, as in [34]. After that, we use the classwise reliability diagram [19] to check its calibration error based on data binning. We adopt the classwise expected calibration error [19] as a measure to quantify the quality of the uncertainty estimation, denoted as ECE for short, with its lower value indicates better-calibrated uncertainty estimation. Now we introduce our experimental results. The classwise reliability diagrams on MNLI and FEVER training sets are plotted in Figure 1(a) and Figure 1(b), respectively. For perfectly calibrated predictions, the curve in a reliability diagram should be as close as possible to the diagonal. Therefore, the deviation from the diagonal represents the calibration error. From the results, we can see that existing bias-only models suffer from inaccurate uncertainty estimation problems on both datasets. 5 The Mo Ca D Framework To overcome the unreliable predictive uncertainty problem, we introduce a calibration operation to the bias-only model, achieving a Modeling, Calibrating and Debiasing framework, named Mo Ca D for short. Our framework consists of three stages. Firstly, we train a bias-only model to model PD(Y |XB). Secondly, we use the model-agnostic calibration methods to improve the calibration error of the bias-only model. The calibrated bias-only model is finally employed to conduct the debiasing process through the existing ensembling strategies. 5.1 Bias Modeling In the first stage, we train a bias-only model to approximate PD(Y |XB), similar to previous works [8; 15]. When the dataset bias is identified, i.e. bias features XB are known a-priori [8; 15], the bias-only model can be obtained by only using the pre-defined XB to predict label y with cross-entropy loss. For example, in NLI, many specific linguistic phenomena in hypothesis sentences such as negation are highly correlated with certain inference classes [27]. In this case, hypothesis sentences are used as inputs to train an NLI model as a bias-only model. When the dataset bias is unknown, a shallow model or a weak model can be built as the bias-only model, as in [35; 29]. 5.2 Model Calibrating We propose to utilize model-agnostic calibration methods to improve the calibration error of the biasonly models. Specifically, two typical calibration methods, temperature scaling [12] and Dirichlet calibrator [19], are used in this paper. The calibrated bias-only model is denoted as f B. Temperature scaling is a simple-but-effective calibration method. It learns a single scalar parameter temperature which is applied to the last softmax layer. Specifically, denote zb(X) as the logit output of the bias-only model on sample (X, Y ), abbreviated for zb, temperature scaling will correct the output as follows: pb = softmax(zb/T), where T is the temperature, which is learned with the cross-entropy loss. Dirichlet calibrator is derived from the Dirichlet distribution likelihood. The transformed probability is computed as pb = softmax(W ln pb + b ), where W and b stand for the linear transformation matrix and intercept term, which are optimized by the cross-entropy loss equipped with ODIR (Off-Diagonal and Intercept Regularisation) to prevent over-fitting [19]. Please note that temperature scaling does not change the predicted label because the maximum of the softmax function remains unchanged. In other words, it only changes the uncertainty estimation and maintains the model s accuracy. Unlike temperature scaling, the Dirichlet calibrator can change the prediction accuracy. Empirically, we observed that the Dirichlet calibrator improves the accuracy of all bias-only models in our experiments (See the Appendix for details). In both methods, the calibration error is expected to be reduced by learning the parameters with the cross-entropy loss. 5.3 Debiasing The final step is to train the main model f D with the calibrated bias-only model f B. Specifically, f B is applied with the existing ensembling strategies to make the main model f D approximate the true principle PD(Y |XS), by adjusting the learning target of the main model, as described in Section 3. The design of the main model is highly dependent on the concerned task, as indicated by previous works. For example, a BERT-based classifier is usually used in NLI [15], and a Bottom Up-Top Down VQA model is usually adopted in VQA [8]. 6 Experiments In this section, we conduct experiments on different real-world datasets to answer two questions: (1) whether our proposed Mo Ca D framework improves the debiasing performance of the EBD methods; (2) whether the experimental results are consistent with the theoretical findings. 6.1 Experimental Settings We describe our experimental settings, including datasets, models and some training details. More details are provided in the Appendix. Datasets and bias-only models. We conduct experiments on both fact verification and natural language inference, which are commonly used tasks in debiasing [8; 34; 35]. We follow these works to choose the datasets and design the bias-only models. Fact verification requires models to validate a claim in the context of evidence. For this task, we use the training dataset provided by the FEVER challenge [31]. The processing and split of the dataset into training/development set are conducted following Schuster et al. [30]5. It has been shown that FEVER has the claim-only bias, where claim sentences often contain words highly indicative of the target label [30]. So the bias-only model is trained to predict labels by only using claim sentences. Finally, Fever-Symmetric datasets [30] (both version 1 and 2) are used as the test sets for evaluation. Natural language inference aims to infer the relationship between premise and hypothesis. Recent studies have shown that various biases exist in the widely used NLI datasets [27; 13; 23]. In this paper, we conduct our experiments on MNLI [37] and consider both known bias and unknown bias. For known bias, firstly, we consider the syntactic bias, e.g. the lexical overlap between premise and hypothesis sentences is strongly correlated with the entailment label [23]. So the bias-only model is a classifier using hand-crafted features indicating how words are shared between the two sentences as the input, the same as that in [8]. Finally, HANS (Heuristic Analysis for NLI Systems) [23] is utilized as the challenging dataset for evaluation. Then we consider the hypothesis-only bias, which means that we can only use the hypothesis to predict the relation between premise and hypothesis. So the bias-only model is defined as a classifier trained to predict labels by only using hypothesis. In the experiment, we still use MNLI as the training set and employ two hard MNLI datasets [13; 21] for evaluation. The hard subsets are derived from the MNLI Mismatched dataset with two different strategies: (1) a neural classifier is trained on hypothesis sentences and the wrongly classified instances are treated as hard instances. (2) patterns in hypothesis sentences that are highly correlated to the specific labels are extracted as surface patterns, and samples which against those surface patterns indications are recognized as hard samples. Therefore, the two challenging dataset are referred to as Hard-CD (Classifier Detected) and Hard-SP (Surface Pattern), corresponding to their creation strategies. For unknown bias, following Utama et al. [35], we build a shallow model as the bias-only model, which has the same architecture as the main model and is trained on a subset of the MNLI training set. Then we use HANS as the challenging dataset for evaluation as Utama et al. [35]. Baselines and configurations. We experiment with 8 implementations of Mo Ca D, i.e. two different calibrators combined with four different ensembling strategies. The two calibrators are temperature scaling and Dirichlet calibrator, and the four ensembling strategies are those in Product-of-Experts (Po E), Learned-Mixin (LMin), DRi Ft, and Inverse-Reweight (Inv-R). We compare the performances of these implementations with their corresponding two-stage EBD methods. We denote different implementations of Mo Ca D by the name of corresponding EBD methods with the calibrator name as the subscript. We use Temp S and Dirichlet to denote the implemented methods with temperature scaling and Dirichlet as the calibrator, respectively. In our experiments, we adopt the BERT-based classifier as the main model and follow the standard setup for sentence pair classification [10]. The cross-entropy trained model (denoted as CE) is also included as a baseline, to show the difference between the debiased and un-debiased model. To tackle the high performance variance on challenging datasets as observed by Clark et al. [8], we run each experiment five times and report the mean scores and the standard deviations. For each task, we utilize the training configurations that have been proven to work well in previous studies and keep the same bias-only model for all methods. For Learned-Mixin, the entropy term weight is set to the value 5https://github.com/Tal Schuster/Fever Symmetric Table 2: Classification accuracy on MNLI. Method Syntactic Bias Hypothesis-only Bias Unknown Bias ID HANS ID Hard CD Hard SP ID HANS CE 84.2 0.2 61.2 3.2 84.2 0.2 76.8 0.4 72.6 2.0 84.2 0.2 61.2 3.2 Po E 82.8 0.4 68.1 3.4 83.2 0.2 79.4 0.4 76.8 2.4 80.7 0.2 69.0 2.4 Po ETemp S 83.9 0.3 69.1 2.8 82.9 0.3 79.6 0.4 77.4 2.4 82.1 0.2 69.9 1.6 Po EDirichlet 84.1 0.3 70.7 1.5 82.7 0.4 79.4 0.2 77.6 2.1 82.3 0.3 70.7 1.0 DRi Ft 81.8 0.4 66.5 4.0 83.5 0.4 79.5 0.6 76.3 1.6 80.2 0.3 69.1 1.3 DRi Ft Temp S 83.0 0.4 69.7 1.8 83.1 0.2 79.6 0.2 77.4 3.3 81.5 0.3 70.0 0.9 DRi Ft Dirichlet 83.6 0.3 69.8 1.9 82.8 0.3 79.6 0.2 79.0 1.6 81.9 0.6 69.4 1.1 Inv R 82.5 0.1 68.4 1.2 83.1 0.2 78.4 0.5 77.1 2.0 78.7 4.8 64.7 2.6 Inv RTemp S 83.6 0.2 69.4 1.6 82.8 0.2 78.6 0.2 77.9 1.7 81.4 0.5 65.8 0.9 Inv RDirichlet 83.7 0.4 69.4 1.3 82.5 0.2 78.9 0.4 80.8 2.0 81.5 0.2 68.2 0.8 LMin 84.1 0.3 65.5 3.7 80.5 0.3 80.0 0.4 78.2 2.0 83.1 0.3 66.5 1.1 LMin Temp S 84.1 0.2 63.2 2.7 80.5 0.6 80.3 0.2 80.8 3.6 83.3 0.2 66.2 1.0 LMin Dirichlet 84.3 0.3 62.7 2.6 80.1 0.5 79.8 0.4 83.2 2.2 82.7 0.2 66.4 1.2 suggested by Utama et al. [34]. For the Dirichlet calibrator, we set λ = 0.06 for all experiments, based on the in-distribution performance on the development sets. 6.2 Experimental Results Now we show our experimental results to answer the aforementioned two questions. Table 1 shows the experimental results on FEVER. We can see that for both calibrators, Mo Ca D outperforms the corresponding EBD methods, including Learned-Mixin, on both Fever-Symmetric v1 and v2 datasets. Comparing different calibrators, Dirichlet consistently performs better than Temp S. Please note that the label distribution of the development set is different from that of the training set on FEVER, which explains why sometimes Dirichlet obtains better in-distribution performance than the cross-entropy loss. Table 1: Classification accuracy on FEVER. Method ID Symm. v1 Symm. v2 CE 87.1 0.6 56.5 0.9 63.9 0.9 Po E 84.0 1.0 62.0 1.3 65.9 0.6 Po ETemp S 82.0 0.9 63.3 0.9 66.4 0.8 Po EDirichlet 87.1 1.0 65.9 1.1 69.1 0.8 DRi Ft 84.2 1.2 62.3 1.5 65.9 0.7 DRi Ft Temp S 81.7 0.9 63.5 1.3 66.5 0.7 DRi Ft Dirichlet 87.4 1.2 65.7 1.4 69.0 1.3 Inv R 84.3 0.8 60.8 1.2 65.2 1.0 Inv RTemp S 83.8 0.6 61.5 0.9 65.4 0.7 Inv RDirichlet 87.0 0.8 63.8 2.2 68.2 1.7 LMin 84.7 1.8 59.8 2.7 65.3 1.1 LMin Temp S 84.9 1.7 60.0 2.5 65.6 1.5 LMin Dirichlet 87.5 1.1 61.5 2.4 67.1 1.3 Table 2 shows the experimental results on MNLI with respect to known bias and unknown bias. The main results are similar to that on FEVER, i.e. calibration brings benefit to the debiasing performance, and Dirichlet obtains better results than Temp S, for all EBD methods except Learnd-Mixin on HANS. It indicates that for both known and unknown dataset bias, Mo Ca D outperforms corresponding EBD methods. Please note that, as a trainable gate function is added in Learned-Mixin, the optimal bias-only model of it is different from others and does not fit our theoretical assumptions. Specially, the performance gap between baselines and our methods is relatively small on Hard-CD. This may due to the fact that the construction of Hard-CD is dependent on a specific biased model. 6.2.1 Empirical Verification of Theorem 1 Now we analyze whether the improvement of debiasing performance agrees with our theoretical study in Theorem 1. That is, calibrated models achieve better uncertainty estimation, leading to better debiasing performance results. 2 3 4 5 6 7 Symm. v1 Symm. v2 Classwise-ECE Figure 2: Debiasing performance of biasonly model vs the quality of predictive uncertainty measured by classwise-ECE (lower is better). 0.7 0.9 1.1 1.3 1.5 0.81 (a) Syntactic Bias 0.7 0.9 1.1 1.3 1.5 0.81 (b) Hypothesis-only Bias Temperature Figure 3: In-distribution performance (accuracy) of the main model vs temperature. To facilitate the study, we demonstrate the classwise-ECE of the calibrated bias-only models on different training datasets, as shown in Table 3. In the table, Un-Cal, Dirichlet, and Temp S denote the bias-only model without calibration, with temperature scaling and Dirichlet calibrator, respectively. From the results, we can see that calibrated bias-only models on different datasets achieve better uncertainty estimation, for both calibrators. Comparing the two calibrators, the Dirichlet calibrator performs better because of its higher expressive power. Further considering the debiasing improvement in Table 1 and 2, we can see that the empirical findings consist with our theory. Furthermore, we conduct a more detailed experiment on MNLI and FEVER, regarding syntactic bias and claim-only bias respectively. Specifically, we adopt the ensembling strategy in Po E, and calibrate bias-only models with the Dirichlet calibrator and save models at different checkpoints. Then we consider the debiasing performances of bias-only models with different uncertainty estimation qualities, measured by classwise-ECE. The results are plotted in Figure 2. We can see that when the classwise-ECE grows, i.e. the calibration error of the bias-only model grows, the accuracy on the test set decreases, i.e. the debiasing performance drops. These results precisely prove Theorem 1. Table 3: Classwise-ECE of the calibrated bias-only models on different training datasets. FEVER HANS MNLI Unknown Un-Cal 7.11 9.83 3.01 7.41 Temp S 6.23 7.70 2.38 3.07 Dirichlet 1.73 4.47 0.87 1.45 6.2.2 Empirical Verification of Theorem 2 Theorem 2 reveals the relation between the confidence of the bias-only model and the indistribution error of the main model. That is, if the confidence, i.e. the uncertainty estimation of the bias-only model on the predicted label is reduced, the in-distribution error of the main model will decrease. Since the label distribution changes on the development set of FEVER, we only consider the results on MNLI. From Table 2, the in-distribution performance increases in the scenario of syntactic and unknown bias and decreases in the scenario of hypothesis-only bias, for most implementations. That is because the syntactic and unknown bias-only model is over-confident, and the hypothesis-only bias-only model is under-confident, as shown in our Appendix. These results are accordant with our theory. We provide a detailed experiment to further explain the relationship revealed in Theorem 2. Specially, we adopt the ensembling strategy in Po E and take temperature scaling as the calibrator, because the temperature parameter controls the confidence of the calibrated model. The bigger the temperature, the less confident the obtained model. We manually set the temperature parameter from 0.7 to 1.5 with step 0.1, and record the in-distribution accuracy on the development set for the calibrated bias-only model with Po E. The results are plotted in Figure 3. It shows that when the bias-only model is less confident, the in-distribution performance of the main model improves, which verifies Theorem 2. 7 Conclusions and Future Work This paper theoretically and empirically reveals an important problem, which is ignored in previous studies, that existing bias-only models in the EBD methods are poor-calibrated, leading to unsatis- factory debiasing performances. To tackle this problem, we propose a three-stage EBD framework (Mo Ca D), including bias modeling, model calibrating, and debiasing. Extensive experiments on natural language inference and fact verification tasks show that Mo Ca D outperforms corresponding EBD methods, regarding known and unknown dataset bias. Furthermore, our detailed empirical analyses verify the correctness of our theorems. We believe that our study will draw people s attention to the bias-only model, which has the potential to become an interesting research direction in the debiasing study. A limitation of this paper is that our empirical studies focus on NLU tasks. Further experimental results on image classification show inconsistent improvements (See the appendix). A possible reason is that image classes (e.g., birds or elephants) are less disputed than language concepts (e.g., entailment or neutral). Thus the invariant mechanism for image classification has a higher certainty, reducing the impact of calibration error on debiasing according to our theoretical analysis. In the future, we plan to extend our investigations to end-to-end EBD methods and more tasks besides NLU. Acknowledgments and Disclosure of Funding This work is supported by the National Key R&D Program of China under Grants No. 2020AAA0105200, the National Natural Science Foundation of China (NSFC) under Grants No. 61773362, and 61906180. [1] Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1955 1960, 2016. [2] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), pages 456 473, 2018. [3] Yonatan Belinkov, Adam Poliak, Stuart M Shieber, Benjamin Van Durme, and Alexander Rush. On adversarial removal of hypothesis-only bias in natural language inference. NAACL HLT 2019, page 256, 2019. [4] Emily M. Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185 5198, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.463. URL https: //www.aclweb.org/anthology/2020.acl-main.463. [5] Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. Rubi: Reducing unimodal biases for visual question answering. In Advances in neural information processing systems, pages 841 852, 2019. [6] Jifan Chen and Greg Durrett. Understanding dataset design choices for multi-hop reasoning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4026 4032, 2019. [7] Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657 1668, 2017. [8] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4060 4073, 2019. [9] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Learning to model and ignore dataset bias with mixed capacity ensembles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 3031 3045, 2020. [10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, 2019. [11] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, 2019. [12] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. ar Xiv preprint ar Xiv:1706.04599, 2017. [13] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A Smith. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107 112, 2018. [14] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009. ISBN 9780387848846. URL https://books.google.com.hk/books?id=e BSgo AEACAAJ. [15] He He, Sheng Zha, and Haohan Wang. Unlearn dataset bias in natural language inference by fitting the residual. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (Deep Lo 2019), pages 132 142, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-6115. URL https://www. aclweb.org/anthology/D19-6115. [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026 1034, 2015. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [18] Divyansh Kaushik and Zachary C Lipton. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5010 5015, 2018. [19] Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in neural information processing systems, pages 12316 12326, 2019. [20] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 6402 6413. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/ 9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf. [21] Tianyu Liu, Zheng Xin, Baobao Chang, and Zhifang Sui. Hyponli: Exploring the artificial patterns of hypothesis-only bias in natural language inference. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 6852 6860, 2020. [22] Rabeeh Karimi Mahabadi, Yonatan Belinkov, and James Henderson. End-to-end bias mitigation by modelling biases in corpora. In Annual Meeting of the Association for Computational Linguistics, 2020. [23] Tom Mc Coy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428 3448, 2019. [24] Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. Compositional questions do not necessitate multi-hop reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4249 4257, 2019. [25] Yixin Nie, Xiang Zhou, and Mohit Bansal. What can we learn from collective human opinions on natural language inference data? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9131 9143, 2020. [26] Ellie Pavlick and Tom Kwiatkowski. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677 694, 2019. [27] Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180 191, 2018. [28] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. [29] Victor Sanh, Thomas Wolf, Yonatan Belinkov, and Alexander M Rush. Learning from others mistakes: Avoiding dataset biases without modeling them. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Hf3q Xoi Nk R. [30] Tal Schuster, Darsh Shah, Yun Jie Serene Yeo, Daniel Roberto Filizzola Ortiz, Enrico Santus, and Regina Barzilay. Towards debiasing fact verification models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3410 3416, 2019. [31] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. ar Xiv preprint ar Xiv:1803.05355, 2018. [32] Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 13888 13899. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/ file/36ad8b5f42db492827016448975cc22d-Paper.pdf. [33] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521 1528, 2011. doi: 10.1109/CVPR.2011.5995347. [34] Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. Mind the trade-off: Debiasing nlu models without degrading the in-distribution performance. ar Xiv preprint ar Xiv:2005.00315, 2020. [35] Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. Towards debiasing nlu models from unknown biases. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7597 7610, 2020. [36] Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Schön. Evaluating model calibration in classification. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 3459 3467. PMLR, 16 18 Apr 2019. URL http://proceedings.mlr.press/v89/vaicenavicius19a.html. [37] Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112 1122, 2018. [38] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791 4800, 2019. [39] Guanhua Zhang, Bing Bai, Jian Liang, Kun Bai, Shiyu Chang, Mo Yu, Conghui Zhu, and Tiejun Zhao. Selection bias explorations and debias methods for natural language sentence matching datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4418 4429, 2019. [40] Yuan Zhang, Jason Baldridge, and Luheng He. Paws: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298 1308, 2019.