# flexible_instancespecific_rationalization_of_nlp_models__c6250d03.pdf Flexible Instance-Specific Rationalization of NLP Models George Chrysostomou, Nikolaos Aletras Department of Computer Science, University of Sheffield gchrysostomou1@sheffield.ac.uk, n.aletras@sheffield.ac.uk Recent research on model interpretability in natural language processing extensively uses feature scoring methods for identifying which parts of the input are the most important for a model to make a prediction (i.e. explanation or rationale). However, previous research has shown that there is no clear best scoring method across various text classification tasks while practitioners typically have to make several other adhoc choices regarding the length and the type of the rationale (e.g. short or long, contiguous or not). Inspired by this, we propose a simple yet effective and flexible method that allows selecting optimally for each data instance: (1) a feature scoring method; (2) the length; and (3) the type of the rationale. Our method is inspired by input erasure approaches to interpretability which assume that the most faithful rationale for a prediction should be the one with the highest difference between the model s output distribution using the full text and the text after removing the rationale as input respectively. Evaluation on four standard text classification datasets shows that our proposed method provides more faithful, comprehensive and highly sufficient explanations compared to using a fixed feature scoring method, rationale length and type. More importantly, we demonstrate that a practitioner is not required to make any ad-hoc choices in order to extract faithful rationales using our approach. 1 Introduction Large pre-trained transformer-based language models such as BERT (Devlin et al. 2019; Bommasani et al. 2021), currently dominate performance across language understanding benchmarks (Wang et al. 2019). These developments have opened up new challenges on how to extract faithful explanations (i.e. rationales1), which accurately represent the true reasons behind a model s prediction when adapted to downstream tasks (Jacovi and Goldberg 2020).2 3 Recent studies use feature scoring (i.e. attribution) methods such as gradient and attention-based scores (Arras et al. 2016; Sundararajan, Taly, and Yan 2017; Jain and Wallace Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1We use these terms interchangeably throughout the paper. 2Code for experiments available at: https://github.com/ GChrysostomou/instance-specific-rationale 3We provide an extended version of our work with an appendix at: https://arxiv.org/abs/2104.08219 2019; Chrysostomou and Aletras 2021b) to identify important (i.e. salient) segments of the input to subsequently extract them as rationales (Jain et al. 2020; Treviso and Martins 2020). However, a single feature scoring method is typically applied across the whole dataset (i.e. globally). This might not be optimal for individual instances resulting into less faithful explanations (Jacovi and Goldberg 2020; Atanasova et al. 2020). Additionally, rationales are usually extracted using a pre-defined fixed length (i.e. the ratio of a rationale compared to the full input sequence) and type (i.e. top k terms or contiguous) globally. We hypothesize that using a fixed length or type for different instances could result into shorter (i.e. not sufficient for explaining a model s prediction) or longer than needed rationales reducing rationale faithfulness, whilst finding the explanation length is an open problem (Zhang et al. 2021). Moreover to extract rationales, practitioners are currently required to make assumptions for the rationale parameters (i.e. feature scoring method, length and type), whilst different choice of parameters might substantially affect the faithfulness of the rationales. In this paper, we propose a simple yet effective method that operates at instance-level and mitigates the a priori selection of a specific (1) feature scoring method; (2) length and (3) type when extracting faithful rationales. Our proposed method is flexible and allows the automatic selection of some of these instance-specific parameters or all. Inspired by erasure methods, it functions by computing differences between a model s output distributions obtained using the full input sequence and the input without the rationale respectively. We base this on the assumption that by removing important tokens from the sequence, we should observe large divergences in the model s predicted distribution (Nguyen 2018; Serrano and Smith 2019; De Young et al. 2020) resulting into more faithful rationales (Atanasova et al. 2020; Chen and Ji 2020). The contributions of our work are thus as follows: To the best of our knowledge, we are the first to propose a method for instance-specific faithful rationale extraction; We empirically demonstrate that rationales extracted with instance-specific flexible feature scoring method, length and type using our proposed method are more comprehensive than rationales with fixed, pre-defined parameters; The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) We show that our method results in consistently highly sufficient rationales, mitigating the variability in faithfulness of different feature scoring methods across datasets when used globally, i.e. the same for all instances (Atanasova et al. 2020). 2 Background and Related Work Rationale Extraction Given a trained model M, an input x = [x1, . . . , x T ] and a predicted distribution over classes Y, rationale extraction methods seek to identify the most important subset R x of the input for explaining the model s prediction. There are two common approaches for extracting rationales. The first consists of two modules jointly trained on an end-task, e.g. sentiment analysis (Lei, Barzilay, and Jaakkola 2016; Bastings, Aziz, and Titov 2019). The first module extracts the rationale (i.e. typically by learning to select which inputs should be masked) and the second module is trained using only the rationale. The second approach consists of using feature scoring (or attribution) methods (i.e. salience metrics) to first identify important parts of the input and then extract the rationales from M (Jain et al. 2020; Treviso and Martins 2020; De Young et al. 2020). A limitation of the first approach is that the models are hard to train compared to the latter and often do not reach high accuracy (Jain et al. 2020). Regarding the latter approach, a limitation is that the same feature scoring method is applied to all instances in a given dataset, irrespective of whether a feature scoring method is not the best for a particular instance (Atanasova et al. 2020; Jacovi and Goldberg 2020) while finding a suitable explanation length is an open problem (Zhang et al. 2021). Computing Input Importance Feature scoring methods Ωcompute input importance scores ω for each token in the sequence x, such that ω = Ω(M, x, Y). High scores indicate that the associated tokens contributed more towards a model s prediction. Subsequently, R is extracted by selecting the K highest scored tokens (or K-gram for contiguous) in a sequence (De Young et al. 2020; Jain et al. 2020). A common approach to computing ω is by calculating the gradients of the prediction with respect to the input (Kindermans et al. 2016; Li et al. 2016; Arras et al. 2016; Sundararajan, Taly, and Yan 2017; Bastings and Filippova 2020). Jain et al. (2020) use attention weights to attribute token importance for rationale extraction, while Treviso and Martins (2020) propose sparse attention. Li et al. (2016) compute input importance scores by measuring the difference in a model s prediction between keeping and omitting each token, with Kim et al. (2020) also suggesting input marginalization as an alternative to token omission. Another way is using sparse linear meta-models that are easier to interpret (Ribeiro, Singh, and Guestrin 2016). Atanasova et al. (2020) however show that sparse linear meta-models are not as faithful as gradient-based approaches for interpreting large language models. Evaluating Rationale Faithfulness Having extracted R, we typically need to evaluate how faithful that explanation is for a model s prediction. Several studies evaluate the faithfulness of explanations by training a separate classifier on an end-task using only the rationales as input (Jain et al. 2020; Treviso and Martins 2020). These classifiers are inherently faithful, as they are trained only on the rationales (Jain et al. 2020). Other studies compare the ability of different feature scoring methods to identify important tokens by using word erasure, i.e. masking (Samek et al. 2017; Serrano and Smith 2019; Atanasova et al. 2020; Chen and Ji 2020; De Young et al. 2020; Zhang et al. 2021; Chrysostomou and Aletras 2021a). The intuition is that by removing the most important tokens, it should result in a larger difference in the output probabilities, compared to removing a less important token which will also lead to drops in classification accuracy (Robnik-ˇSikonja and Kononenko 2008; Nguyen 2018; Atanasova et al. 2020). De Young et al. (2020) use erasure to evaluate the comprehensiveness and sufficiency of rationales. Carton, Rathore, and Tan (2020) suggest normalizing these metrics using the predictions of the model with a baseline input, to allow for a fairer comparison across models and datasets. 3 Instance-Specific Rationale Extraction Our aim is to address the one-size-fits-all ad-hoc approach of previous work on rationale extraction with feature scoring methods that typically extracts rationales using the same feature scoring method, length and type across all instances in a dataset. Inspired by word erasure approaches (Nguyen 2018; Serrano and Smith 2019; De Young et al. 2020) we mask the tokens that constitute a rationale and record the difference δ in a model s output distribution by using the full text and the reduced input. Our main assumption is that a sufficiently faithful rationale is the one that will result into the largest δ (Atanasova et al. 2020; Chen and Ji 2020; De Young et al. 2020). Following this assumption, we can extract rationales by selecting for each instance a specific (1) feature scoring method; (2) length; and (3) type.4 Instance-level Feature Scoring Selection Given a set of M feature scoring methods {Ω1, . . . , ΩM}, we extract a rationale R as follows: 1. For each Ωi in the set we compute input importance scores ωi = Ωi(M, x, Y); 2. We subsequently select the K highest scored tokens (TOPK) or the highest K-gram (CONTIGUOUS) to form a rationale Ri, where K is the rationale length; 3. For each rationale we compute the difference δi, between the reference model output (using full text input) and the model output having masked the rationale, such that: δi = (Y, Ym i ) = (M(x), M(x\Ri)) 4Similar to Jain et al. (2020), we consider two rationale types: (a) TOPK tokens ranked by a feature scoring method, treating each word in the input sequence independently; and (b) CONTIGUOUS span of input tokens of length K with the highest overall score computed by a feature scoring method. where is the function used to compute the difference between the two outputs; 4. We select the rationale R with the highest difference δmax = max({δ1, . . . , δi, . . . , δM}). For computing δ, we experiment with the following divergence metrics ( ): (a) Kullback-Leibler (KL); (b) Jensen Shannon divergence (JSD); (c) Perplexity (PERP.) and (d) Predicted Class Probability (CLASSDIFF). Instance-level Rationale Length Selection For computing at instance-level the rationale length k and extracting the rationale R using a single feature scoring method Ω, we propose the following steps: 1. Given Ω, we first compute input importance scores ω = Ω(M, x, Y); 2. We then iterate over the sequence such that k = range(1, N), where N is the fixed, pre-defined rationale length and k the possible rationale length at the current iteration. We set N as the upper bound rationale length for our approach to make results comparable with fixed length rationales. 3. At each iteration we begin by masking the top k tokens (as indicated by ω) to form a candidate rationale Rk. When using TOPK we mask the k highest scored tokens, whilst with CONTIGUOUS we mask the highest scored k-gram; 4. We compute the difference δk between the reference model output Y and the model output having masked the candidate rationale Ym k = M(x\Rk); 5. We record every δ until k = N and extract the rationale R with the highest difference δmax = max({δ1, . . . , δk, . . . , δN}), where k at δmax is the computed rationale length.5 Instance-level Rationale Type Selection In a similar way to selecting a feature scoring method, our approach can also be used to select between different rationale types (i.e. CONTIGUOUS or TOPK) for each instance in the dataset. Finally, our approach is flexible and can be easily modified to support selecting any of these parameters while keeping the rest fixed (i.e. feature scoring method, rationale length and rationale type) or by selecting any combination of them. An important benefit of our approach is that we extract rationales with different settings for each instance rather than using uniform settings globally (i.e. across the whole dataset), which we empirically demonstrate to be beneficial for faithfulness ( 5). 4 Experimental Setup Tasks For our experiments we use the following datasets (details in Table 1): 5We also experimented with early stopping, whereby the difference between δk and the δmax until k are under a specified threshold, however this resulted in reduced performance. Data |W| C Splits Train/Dev/Test F1 N SST 18 2 6,920 / 872 / 1,821 90.1 0.2 20% AG 36 4 102,000 / 18,000 / 7,600 93.5 0.2 20% Ev.Inf. 363 3 5,789 / 684 / 720 83.0 1.6 10% M.RC 305 2 24,029 / 3,214 / 4,848 73.2 1.7 20% Table 1: Dataset statistics including average words at instance (|W|), number of classes (C), data splits, F1 macro performance and the fixed, pre-defined rationale ratio across all instances (N). SST: Binary sentiment classification without neutral sentences (Socher et al. 2013). AG: News articles categorized in Science, Sports, Business, and World topics (Corso, Gulli, and Romani 2005). Evidence Inference (EV.INF.): Abstract-only biomedical articles describing randomized controlled trials. The task is to infer the relationship between a given intervention and comparator with respect to an outcome (Lehman et al. 2019). Multi RC (M.RC): A reading comprehension task with questions having multiple correct answers that depend on information from multiple sentences (Khashabi et al. 2018). Following De Young et al. (2020) and Jain et al. (2020), we convert this to a binary classification task where each rationale/question/answer triplet forms an instance and each candidate answer is labeled as True/False Similar to Jain et al. (2020), we use BERT (Devlin et al. 2019) for SST and AG); SCIBERT (Beltagy, Lo, and Cohan 2019) for EV.INF. and ROBERTA (Liu et al. 2019) for M.RC. Feature Scoring Methods We use a random baseline and six other feature scoring methods (to compute input importance scores) similar to Jain et al. (2020) and Serrano and Smith (2019). Random (RAND): Random allocation of token importance. Attention (α): Token importance corresponding to normalized attention scores (Jain et al. 2020). Scaled Attention (α α): Scales the attention scores αi with their corresponding gradients αi = ˆy αi (Serrano and Smith 2019) . Input XGrad (x x): Attributes input importance by multiplying the gradient of the input by the input with respect to the predicted class, where xi = ˆy xi (Kindermans et al. 2016; Atanasova et al. 2020) . Integrated Gradients (IG): Ranking words by computing the integral of the gradients taken along a straight path from a baseline input (zero embedding vector) to the original input (Sundararajan, Taly, and Yan 2017). (a) F1 macro (b) Norm Suff (c) Norm Comp Figure 1: F1 macro (lower is better), mean Norm Suff (higher is better) and mean Norm Comp (higher is better), when using any single feature scoring method across all instances in a dataset and our proposed method of selecting a feature scoring method for each instance (OURS) for TOPK rationale types. Deep Lift: Ranking words according to the difference between the activation of each neuron to a reference activation (Shrikumar, Greenside, and Kundaje 2017). LIME: Ranking words by learning an interpretable model locally around the prediction (Ribeiro, Singh, and Guestrin 2016). Evaluating Explanation Faithfulness F1 macro: Similar to Arras et al. (2017) we measure the F1 macro performance of model M when masking the rationale in the original input (x\R). A key difference in our approach is that we use the predicted labels of the model with full input as gold labels, as we are interested in the faithfulness of explanations for the predictions of the model. Larger drops in F1 scores indicate that the extracted rationale is more faithful.6 Normalized Sufficiency (Norm Suff): We measure the degree to which the extracted rationales are sufficient for a model to make a prediction (De Young et al. 2020). Similar to Carton, Rathore, and Tan (2020) we bind sufficiency between 0 and 1 and use the reverse difference so that higher is better. We modify this metric and measure the normalized sufficiency (Carton, Rathore, and Tan 2020) such that: Suff(x, ˆy, R) = 1 max(0, p(ˆy|x) p(ˆy|R)) Norm Suff(x, ˆy, R) = Suff(x, ˆy, R) Suff(x, ˆy, 0) 1 Suff(x, ˆy, 0) (1) where Suff(x, ˆy, 0) is the sufficiency of a baseline input (zeroed out sequence) and ˆy the model predicted class using the full text x as input, such that ˆy = arg max(Y). Normalized Comprehensiveness (Norm Comp): We measure the extent to which a rationale is needed for a prediction (De Young et al. 2020). For an explanation to 6We also conducted experiments using the dataset gold labels with results being comparable. be highly comprehensive, the model s prediction when masking the rationale should have a high difference between the model s prediction with full text. Similarly to Carton, Rathore, and Tan (2020) we bind this metric between 0 and 1 and normalize it. We compute it by: Comp(x, ˆy, R) = max(0, p(ˆy|x) p(ˆy|x\R)) Norm Comp(x, ˆy, R) = Comp(x, ˆy, R) 1 Suff(x, ˆy, 0) (2) We do not conduct human experiments to evaluate explanation faithfulness since that is only relevant to explanation plausibility (i.e. how understandable by humans a rationale is (Jacovi and Goldberg 2020)) and in practice faithfulness and plausibility do not correlate (Atanasova et al. 2020). Finally we do not compare with select-then-predict methods (Lei, Barzilay, and Jaakkola 2016; Jain et al. 2020), as we are interested in faithfully explaining the model M and not forming inherently faithful classifiers. Performance-Time Trade-off Input erasure approaches typically require N forward passes to compute a rationale length (see 3) when removing one token at a time. Similar to Nguyen (2018); Atanasova et al. (2020), we expedite this process when selecting a rationale length by skipping every X% of tokens. For our work, we use a 2% skip rate which led to a seven-fold reduction in the time required to compute rationales for datasets comprising of long sequences, such as MRc and Ev Inf, with comparable performance in faithfulness to the slower process of removing one token at a time. 5 Results Selecting Instance-specific Feature Scoring Figure 1 compares the faithfulness of extracted rationales when using our proposed method for selecting an instancespecific feature scoring method (OURS) and our baselines, that use a single fixed pre-defined feature scoring method globally (i.e. across all instances in a dataset). We measure faithfulness using F1 macro (lower is better), mean Norm Suff and mean Norm Comp (higher is better respectively). For clarity we show results using the TOPK rationale type. 7. Overall, results demonstrate that rationales extracted with our proposed approach are highly sufficient and comprehensive. In fact, our approach results in more sufficient rationales against all single feature scoring methods in AG and is comparable with the best Norm Suff scores in the remainder of the datasets. This suggests that even when rationales with our proposed method are not the most sufficient, they are consistently highly sufficient (i.e. rationales extracted with our approach are significantly more sufficient than fixed, pre-defined feature scoring methods in 18 out of 24 test cases). Compared to our six baselines, the rationales extracted with our approach are significantly more comprehensive across all four datasets (Wilcoxon Rank Sum, p < .05). Additionally, the larger drops in F1 macro performance demonstrate that rationales extracted with our proposed approach are more necessary for a model to make a prediction compared to a globally used, pre-defined feature scoring approach. Our results strengthen the hypothesis that whilst some feature scoring methods are better than others globally, they might not be optimal for all instances in a dataset (Jacovi and Goldberg 2020) and our approach helps mitigate that. Similar to (Atanasova et al. 2020), we observe that the faithfulness performance of single feature scoring methods varies across datasets. For example LIME returns more comprehensive rationales than α α in Multi RC, however is outperformed by the latter in SST. By returning consistently highly comprehensive and sufficient rationales, our propose method helps reducing the variability in faithfulness performance observed when using any single feature scoring method across datasets. Selecting Instance-specific Rationale Length Table 2 shows the Relative Improvement (R.I.) ratio in mean Norm Suff and Norm Comp (>1.0 is better) between rationales extracted using a fixed pre-defined length (see N in Table 1) and rationales extracted using our method with instance-specific length across feature scoring methods and datasets. For brevity we do not include results with F1 macro, where we make similar observations to comprehensiveness. Overall, rationales extracted using our approach are on average shorter than fixed length rationales. Specifically, rationale length drops from 20% to 16% on average in SST, AG; from 20% to 15% in M.Rc and from 10% to 7% in Ev.Inf. Norm Suff scores indicate that our shorter on average rationales are overall less but comparably sufficient with longer, fixed-length rationales. For example with SST rationales with instance-specific length are 0.9-1.0 times less sufficient that rationale with pre-defined length. We find this particularly evident in datasets such as M.Rc and Ev.Inf., 7Also for clarity, all results presented in this work are using JSD for . The other divergence functions performed comparably Norm Suff Norm Comp FEAT SST M.Rc AG Ev.Inf. SST M.Rc AG Ev.Inf. Deep Lift 0.9 0.8 0.8 1.1 0.8 1.1 1.0 1.0 LIME 1.0 0.7 0.9 0.9 0.9 1.1 1.0 1.0 α 0.9 0.9 0.7 0.8 0.8 1.1 0.9 1.2 α α 0.9 0.9 0.8 0.9 1.0 1.1 0.9 1.0 IG 0.9 0.9 0.8 0.9 0.9 1.1 1.0 1.1 x x 1.0 0.8 0.7 0.8 0.9 1.1 0.9 1.2 Deep Lift 0.9 0.9 0.8 1.2 0.9 1.1 1.3 1.5 LIME 0.9 0.7 0.8 0.9 1.0 1.1 1.2 1.3 α 0.9 0.9 0.7 0.9 0.7 1.1 1.0 1.2 α α 0.9 0.8 0.8 0.9 1.0 1.1 1.1 1.1 IG 0.9 0.8 0.8 1.0 1.0 1.2 1.2 1.4 x x 0.9 0.8 0.7 1.0 1.0 1.1 1.0 1.3 Table 2: Relative Improvement (R.I.) ratios for mean Norm Suff and mean Norm Comp between fixed length rationales (see N in Table 1) extracted using our method and rationales with instance-specific length (>1.0 is better). where our rationales are on average 4-5% shorter (approximately 15 tokens shorter on average for α in M.Rc) but still retain comparable sufficiency, while in some cases improving it (e.g. 1.2 R.I. in Ev.Inf. with Deep Lift). We also note that rationales extracted with instancespecific length are more comprehensive in most cases, despite being shorter on average compared to fixed-length rationales. For example in Ev.Inf., CONTIGUOUS rationales with I.G. are 1.4 times more comprehensive when we select their length at instance-level. Results also indicate that using our proposed method benefits more CONTIGUOUS rationales compared to TOPK for comprehensiveness, leading to increased R.I. in the majority of cases. Overall, findings support our initial hypothesis that in certain cases a rationale with longer than needed length might contain unnecessary information and adversely impact its comprehensiveness. Selecting Instance-specific Feature Scoring, Length and Type Table 3 shows mean Norm Suff and Norm Comp scores when using our proposed method to select at instance-level (I-L) a combination of: (1) the feature scoring method (FEAT); (2) the rationale length (LEN); and (3) the rationale type (TYPE). For comparison, we also show scores of the best performing fixed (FIX) feature scoring function, rationale type and length (see Figure 1). We first observe that the highest Norm Suff scores across three datasets (SST, MRc, Ev Inf), are from the best performing fixed scoring method with fixed length and rationale type. Additionally, the best performing combination of our proposed approach for sufficiency is when we only select the feature scoring method keeping the length and type fixed. This combination results in the highest Norm Suff scores in AG (.44 with TOPK type compared to .42, which is the second best with CONTIGUOUS) and competitive Norm Suff scores with the highest scoring combination (e.g. .82 in Ev.Inf. and CONTIGUOUS compared to .85). We assume that using combinations which include instance-specific lengths do not perform as well for sufficiency due to the shorter ra- (a) F1 macro (b) Norm Suff (c) Norm Comp Figure 2: F1 macro (lower is better), mean Norm Suff and mean Norm Comp (higher is better), when extracting rationales with our approach given decreasing numbers of feature scoring methods. Norm Suff Norm Comp TYPE LEN FEAT SST M.Rc AG Ev.Inf. SST M.Rc AG Ev.Inf. FIX FIX .68 .12 .37 .43 .54 .42 .28 .80 I-L FIX .61 .11 .30 .37 .52 .46 .27 .82 FIX I-L .63 .09 .44 .38 .57 .59 .41 .84 I-L I-L .59 .07 .38 .36 .55 .62 .39 .86 FIX FIX .71 .07 .41 .85 .46 .47 .17 .55 I-L FIX .63 .06 .33 .78 .47 .54 .19 .62 FIX I-L .67 .07 .42 .82 .46 .60 .22 .59 I-L I-L .61 .05 .33 .76 .48 .65 .24 .67 I-L I-L I-L .60 .06 .39 .49 .57 .69 .41 .88 Table 3: Mean Norm Suff and Norm Comp scores when we select at instance-level (I-L) a combination of the: (1) rationale length (LEN); (2) feature scoring method (FEAT.); and (3) rationale type (TYPE). {TYPE}-FIX-FIX and {TYPE}- I-L-FIX values are from the highest scoring feature scoring method (see Figure 1). Bold values denote the highest performing combination in column-wise (higher is better). tionale length, which we have previously shown to partially degrade rationale sufficiency. Finally, our results demonstrate that we obtain highly comprehensive rationales when selecting at instance level all parameters (FEAT. + LEN + TYPE) using our approach. In fact, this results in higher Norm Comp scores compared to any other setting combination across all datasets. For example in M.Rc., selecting all parameters results in a Norm Comp score of .69 which is .22 units higher than the rationales extracted with fixed feature scoring method and length and type. This highlights the efficacy of our approach in extracting highly comprehensive rationales, without requiring strong a priori assumptions about rationale parameters. Ablation Study We finally perform an ablation study to examine the behavior and effectiveness of our approach by sequentially removing one feature scoring method at a time to measure changes in F1 macro, Norm Suff and Norm Comp. The intuition is that we should observe drops in faithfulness scores when remov- ing feature attribution methods for our approach to be effective (i.e. we should extract more faithful rationales when having more feature scoring options to choose from). Figure 2 shows the results. We first observe that removing one feature scoring method at a time results in increases in F1 macro (lower is better) and drops in Norm Comp scores (higher is better). This demonstrate that the faithfulness of the rationales extracted with our approach deteriorates as the number of feature scoring methods becomes smaller highlighting the efficacy of our proposed approach. For example, in Ev.Inf. by removing α α results in a drop of .14 in mean Norm Comp (.84 when including α α compared to .70 without it). On the other hand, we also observe that our method can still benefit from feature scoring methods that achieve low Norm Comp scores when used standalone, resulting in improvements in comprehensiveness and drops in F1 macro (e.g. α in SST). This indicates that our approach steadily improves rationale faithfulness for model s predictions given a larger pool of available feature scoring methods. Results show a deterioration in Norm Suff scores as the number of feature scoring methods becomes smaller, showing that our method results in more sufficient rationales when presented with a larger list of available feature scoring methods in the majority of the datasets. We hypothesize that this is not true for Multi RC due to the already low Norm Suff scores of the rationales (e.g. no more than 0.12). By using all six feature scoring methods, our approach produces highly sufficient rationales and is comparable to the set achieved the highest sufficiency. For example in Ev.Inf. using all feature scoring methods results to a Norm Suff score of approximately .38 compared to the highest scoring feature scoring set (all except LIME) and the lowest scoring (x x) which achieved .39 and .15 respectively. We also tested different combinations of feature scoring methods with similar observations. Finally, we experimented with doubling the upper bound of the rationale length (from N to 2 N) for both fixed length rationales and our proposed approach. Our approach still yielded more Example 1 Data.:AG Id: test 4614 [FIXED-LEN + α]: ... game last Friday night will stand , the CFL announced yesterday. While a review ... [I-L-LEN + α (Ours)]: ... game last Friday night will stand , the CFL announced yesterday. While a review ... [Predicted Topic || True Topic]: Decreased significantly || Decreased significantly Example 2 Data.:EV.INF. Id: 3162205 2 [FIXED-LEN + α α]: ... computed tomography ( 3D - CT ) scans . ABSTRACT.RESULTS : The control sides treated with an autograft showed significantly better Lenke scores than the study sides treated with β - CPP at 3 and 6 months postoperatively , but there was no difference between the two sides at 12 months . The fusion .. [I-L-LEN + α α (Ours)]: ... computed tomography ( 3D - CT ) scans . ABSTRACT.RESULTS : The control sides treated with an autograft showed significantly better Lenke scores than the study sides treated with β - CPP at 3 ... [Predicted Relationship || True Relationship]: Increased significantly || No significant difference Example 3 Data.:SST Id: test 694 [FIXED-LEN + α]: ... Frontal is the antidote for Soderbergh fans who think he s gone too commercial ... [I-L-LEN + I-L-FEAT (Ours)]: ... Frontal is the antidote for Soderbergh fans who think he s gone too commercial ... [Predicted Sentiment || True Sentiment]: Negative || Positive Example 4 Data.:SST Id: test 1039 [FIXED-LEN + α]: It s just incredibly dull. [I-L-LEN + I-L-FEAT (Ours)]: It s just incredibly dull. [Predicted Sentiment || True Sentiment]: Negative || Negative Table 4: Examples when using our approach (Ours) to select at instance-level (I-L) a combination of the: (1) rationale length (LEN); (2) feature scoring method (FEAT) against our baseline of fixed-length rationales from a fixed feature scoring method. comprehensive rationales compared to the fixed-length ones that were also highly sufficient. 6 Qualitative Analysis Table 4 shows examples of the qualitative comparison between our approach (Ours) for selecting at instance-level (I-L) a combination of the: (1) rationale length (LEN); (2) feature scoring method (FEAT against our baseline of fixedlength rationales from a fixed feature scoring method. Concise rationales: Example 1 presents an instance from AG. Our approach extracts a rationale that is six tokens shorter than the one with fixed length while also achieving a higher Norm Comp score. However, the fixed length rationale scores higher in Norm Suff. We can assume from this that sufficiency positively correlates with rationale length. Error analysis: Our assumption is that if a model makes a wrong prediction, we should be able to extract the rationale that better demonstrates what led to a wrong prediction. Example 2 shows an instance from EV.INF., where the model has wrongly predicted that Lenke scores at 12 months have increased significantly instead of the correct no significant difference . Surprisingly, both rationales recorded maximum scores (1.0) in Norm Suff and Norm Comp. We observe that the correct answer is included in the fixed length rationale, however the model made a wrong prediction. On the contrary, our rationale highlights something directly related to its prediction. Example 3 presents an instance from SST, where the fixed-length rationale and the instance-specific rationale (ours) attend at different sections of the text. Our rationale scored lower for Norm Suff, however we observe that it aligns more closely with the predicted sentiment. When using a fixed pre-defined length is not sufficient: Example 4 presents a different scenario, where the fixedlength rationale for SST is at 20% whilst the upper bound N for our rationale is at 40%. The intuition is that in certain cases a fixed rationale length might not be sufficient for all instances to explain a prediction. We argue that our approach highlighted something more informative for the task ( incredibly dull compared to incredibly ), due to removing the restriction of a pre-defined fixed length. 7 Conclusions We have proposed a simple yet effective approach for selecting at instance-level (1) feature scoring method; (2) length; and (3) type of the rationale. We empirically demonstrated that rationales extracted with our approach are significantly more comprehensive and highly sufficient, while being shorter compared to rationales extracted with a fixed feature scoring method, length and type. Finally, we consider our work an important step towards instance-level faithful rationalization while finding the most sufficient rationale, an interesting direction for future work. Acknowledgments NA is supported by EPSRC grant EP/V055712/1, part of the European Commission CHIST-ERA programme, call 2019 XAI: Explainable Machine Learning-based Artificial Intelligence. References Arras, L.; Horn, F.; Montavon, G.; M uller, K.-R.; and Samek, W. 2016. Explaining Predictions of Non-Linear Classifiers in NLP. In Proceedings of the 1st Workshop on Representation Learning for NLP, 1 7. Berlin, Germany: Association for Computational Linguistics. Arras, L.; Montavon, G.; M uller, K.-R.; and Samek, W. 2017. Explaining Recurrent Neural Network Predictions in Sentiment Analysis. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 159 168. Copenhagen, Denmark: Association for Computational Linguistics. Atanasova, P.; Simonsen, J. G.; Lioma, C.; and Augenstein, I. 2020. A Diagnostic Study of Explainability Techniques for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3256 3274. Online: Association for Computational Linguistics. Bastings, J.; Aziz, W.; and Titov, I. 2019. Interpretable Neural Predictions with Differentiable Binary Variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2963 2977. Florence, Italy: Association for Computational Linguistics. Bastings, J.; and Filippova, K. 2020. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third Blackbox NLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 149 155. Online: Association for Computational Linguistics. Beltagy, I.; Lo, K.; and Cohan, A. 2019. Sci BERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), 3615 3620. Hong Kong, China: Association for Computational Linguistics. Bommasani, R.; Hudson, D. A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M. S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. 2021. On the Opportunities and Risks of Foundation Models. ar Xiv preprint ar Xiv:2108.07258. Carton, S.; Rathore, A.; and Tan, C. 2020. Evaluating and Characterizing Human Rationales. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9294 9307. Online: Association for Computational Linguistics. Chen, H.; and Ji, Y. 2020. Learning Variational Word Masks to Improve the Interpretability of Neural Text Classifiers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4236 4251. Online: Association for Computational Linguistics. Chrysostomou, G.; and Aletras, N. 2021a. Enjoy the Salience: Towards Better Transformer-based Faithful Explanations with Word Salience. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 8189 8200. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Chrysostomou, G.; and Aletras, N. 2021b. Improving the Faithfulness of Attention-based Explanations with Taskspecific Information for Text Classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 477 488. Online: Association for Computational Linguistics. Corso, G. M. D.; Gulli, A.; and Romani, F. 2005. Ranking a stream of news. In Ellis, A.; and Hagino, T., eds., Proceedings of the 14th international conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, 2005, 97 106. ACM. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171 4186. Minneapolis, Minnesota: Association for Computational Linguistics. De Young, J.; Jain, S.; Rajani, N. F.; Lehman, E.; Xiong, C.; Socher, R.; and Wallace, B. C. 2020. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4443 4458. Online: Association for Computational Linguistics. Jacovi, A.; and Goldberg, Y. 2020. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4198 4205. Online: Association for Computational Linguistics. Jain, S.; and Wallace, B. C. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 3543 3556. Minneapolis, Minnesota: Association for Computational Linguistics. Jain, S.; Wiegreffe, S.; Pinter, Y.; and Wallace, B. C. 2020. Learning to Faithfully Rationalize by Construction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4459 4473. Online: Association for Computational Linguistics. Khashabi, D.; Chaturvedi, S.; Roth, M.; Upadhyay, S.; and Roth, D. 2018. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 252 262. New Orleans, Louisiana: Association for Computational Linguistics. Kim, S.; Yi, J.; Kim, E.; and Yoon, S. 2020. Interpretation of NLP models through input marginalization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3154 3167. Online: Association for Computational Linguistics. Kindermans, P.-J.; Sch utt, K.; M uller, K.-R.; and D ahne, S. 2016. Investigating the influence of noise and distractors on the interpretation of neural networks. ar Xiv preprint ar Xiv:1611.07270. Lehman, E.; De Young, J.; Barzilay, R.; and Wallace, B. C. 2019. Inferring Which Medical Treatments Work from Reports of Clinical Trials. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 3705 3717. Minneapolis, Minnesota: Association for Computational Linguistics. Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing Neural Predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 107 117. Austin, Texas: Association for Computational Linguistics. Li, J.; Chen, X.; Hovy, E.; and Jurafsky, D. 2016. Visualizing and Understanding Neural Models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 681 691. San Diego, California: Association for Computational Linguistics. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Ro BERTa: A Robustly Optimized BERT Pretraining Approach. Co RR, abs/1907.11692. Nguyen, D. 2018. Comparing Automatic and Human Evaluation of Local Explanations for Text Classification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1069 1078. New Orleans, Louisiana: Association for Computational Linguistics. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why Should I Trust You? : Explaining the Predictions of Any Classifier. In Krishnapuram, B.; Shah, M.; Smola, A. J.; Aggarwal, C. C.; Shen, D.; and Rastogi, R., eds., Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 1135 1144. ACM. Robnik-ˇSikonja, M.; and Kononenko, I. 2008. Explaining classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20(5): 589 600. Samek, W.; Binder, A.; Montavon, G.; Lapuschkin, S.; and M uller, K. 2017. Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11): 2660 2673. Serrano, S.; and Smith, N. A. 2019. Is Attention Interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2931 2951. Florence, Italy: Association for Computational Linguistics. Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learning Important Features Through Propagating Activation Differences. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 3145 3153. PMLR. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A.; and Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631 1642. Seattle, Washington, USA: Association for Computational Linguistics. Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic Attribution for Deep Networks. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, 3319 3328. PMLR. Treviso, M.; and Martins, A. F. T. 2020. The Explanation Game: Towards Prediction Explainability through Sparse Communication. In Proceedings of the Third Blackbox NLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 107 118. Online: Association for Computational Linguistics. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net. Zhang, W.; Huang, Z.; Zhu, Y.; Ye, G.; Cui, X.; and Zhang, F. 2021. On Sample Based Explanation Methods for NLP: Faithfulness, Efficiency and Semantic Evaluation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 5399 5411. Online: Association for Computational Linguistics.