# flexible_instancespecific_rationalization_of_nlp_models__c6250d03.pdf

Flexible Instance-Speciﬁc Rationalization of NLP Models

George Chrysostomou, Nikolaos Aletras

Department of Computer Science, University of Shefﬁeld gchrysostomou1@shefﬁeld.ac.uk, n.aletras@shefﬁeld.ac.uk

Recent research on model interpretability in natural language processing extensively uses feature scoring methods for identifying which parts of the input are the most important for a model to make a prediction (i.e. explanation or rationale). However, previous research has shown that there is no clear best scoring method across various text classiﬁcation tasks while practitioners typically have to make several other adhoc choices regarding the length and the type of the rationale (e.g. short or long, contiguous or not). Inspired by this, we propose a simple yet effective and ﬂexible method that allows selecting optimally for each data instance: (1) a feature scoring method; (2) the length; and (3) the type of the rationale. Our method is inspired by input erasure approaches to interpretability which assume that the most faithful rationale for a prediction should be the one with the highest difference between the model s output distribution using the full text and the text after removing the rationale as input respectively. Evaluation on four standard text classiﬁcation datasets shows that our proposed method provides more faithful, comprehensive and highly sufﬁcient explanations compared to using a ﬁxed feature scoring method, rationale length and type. More importantly, we demonstrate that a practitioner is not required to make any ad-hoc choices in order to extract faithful rationales using our approach.

1 Introduction Large pre-trained transformer-based language models such as BERT (Devlin et al. 2019; Bommasani et al. 2021), currently dominate performance across language understanding benchmarks (Wang et al. 2019). These developments have opened up new challenges on how to extract faithful explanations (i.e. rationales1), which accurately represent the true reasons behind a model s prediction when adapted to downstream tasks (Jacovi and Goldberg 2020).2 3 Recent studies use feature scoring (i.e. attribution) methods such as gradient and attention-based scores (Arras et al. 2016; Sundararajan, Taly, and Yan 2017; Jain and Wallace

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1We use these terms interchangeably throughout the paper. 2Code for experiments available at: https://github.com/ GChrysostomou/instance-speciﬁc-rationale 3We provide an extended version of our work with an appendix at: https://arxiv.org/abs/2104.08219

2019; Chrysostomou and Aletras 2021b) to identify important (i.e. salient) segments of the input to subsequently extract them as rationales (Jain et al. 2020; Treviso and Martins 2020). However, a single feature scoring method is typically applied across the whole dataset (i.e. globally). This might not be optimal for individual instances resulting into less faithful explanations (Jacovi and Goldberg 2020; Atanasova et al. 2020). Additionally, rationales are usually extracted using a pre-deﬁned ﬁxed length (i.e. the ratio of a rationale compared to the full input sequence) and type (i.e. top k terms or contiguous) globally. We hypothesize that using a ﬁxed length or type for different instances could result into shorter (i.e. not sufﬁcient for explaining a model s prediction) or longer than needed rationales reducing rationale faithfulness, whilst ﬁnding the explanation length is an open problem (Zhang et al. 2021). Moreover to extract rationales, practitioners are currently required to make assumptions for the rationale parameters (i.e. feature scoring method, length and type), whilst different choice of parameters might substantially affect the faithfulness of the rationales. In this paper, we propose a simple yet effective method that operates at instance-level and mitigates the a priori selection of a speciﬁc (1) feature scoring method; (2) length and (3) type when extracting faithful rationales. Our proposed method is ﬂexible and allows the automatic selection of some of these instance-speciﬁc parameters or all. Inspired by erasure methods, it functions by computing differences between a model s output distributions obtained using the full input sequence and the input without the rationale respectively. We base this on the assumption that by removing important tokens from the sequence, we should observe large divergences in the model s predicted distribution (Nguyen 2018; Serrano and Smith 2019; De Young et al. 2020) resulting into more faithful rationales (Atanasova et al. 2020; Chen and Ji 2020). The contributions of our work are thus as follows:

To the best of our knowledge, we are the ﬁrst to propose a method for instance-speciﬁc faithful rationale extraction;

We empirically demonstrate that rationales extracted with instance-speciﬁc ﬂexible feature scoring method, length and type using our proposed method are more comprehensive than rationales with ﬁxed, pre-deﬁned parameters;

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

We show that our method results in consistently highly sufﬁcient rationales, mitigating the variability in faithfulness of different feature scoring methods across datasets when used globally, i.e. the same for all instances (Atanasova et al. 2020).

2 Background and Related Work

Rationale Extraction

Given a trained model M, an input x = [x1, . . . , x T ] and a predicted distribution over classes Y, rationale extraction methods seek to identify the most important subset R x of the input for explaining the model s prediction. There are two common approaches for extracting rationales. The ﬁrst consists of two modules jointly trained on an end-task, e.g. sentiment analysis (Lei, Barzilay, and Jaakkola 2016; Bastings, Aziz, and Titov 2019). The ﬁrst module extracts the rationale (i.e. typically by learning to select which inputs should be masked) and the second module is trained using only the rationale. The second approach consists of using feature scoring (or attribution) methods (i.e. salience metrics) to ﬁrst identify important parts of the input and then extract the rationales from M (Jain et al. 2020; Treviso and Martins 2020; De Young et al. 2020). A limitation of the ﬁrst approach is that the models are hard to train compared to the latter and often do not reach high accuracy (Jain et al. 2020). Regarding the latter approach, a limitation is that the same feature scoring method is applied to all instances in a given dataset, irrespective of whether a feature scoring method is not the best for a particular instance (Atanasova et al. 2020; Jacovi and Goldberg 2020) while ﬁnding a suitable explanation length is an open problem (Zhang et al. 2021).

Computing Input Importance

Feature scoring methods Ωcompute input importance scores ω for each token in the sequence x, such that ω = Ω(M, x, Y). High scores indicate that the associated tokens contributed more towards a model s prediction. Subsequently, R is extracted by selecting the K highest scored tokens (or K-gram for contiguous) in a sequence (De Young et al. 2020; Jain et al. 2020). A common approach to computing ω is by calculating the gradients of the prediction with respect to the input (Kindermans et al. 2016; Li et al. 2016; Arras et al. 2016; Sundararajan, Taly, and Yan 2017; Bastings and Filippova 2020). Jain et al. (2020) use attention weights to attribute token importance for rationale extraction, while Treviso and Martins (2020) propose sparse attention. Li et al. (2016) compute input importance scores by measuring the difference in a model s prediction between keeping and omitting each token, with Kim et al. (2020) also suggesting input marginalization as an alternative to token omission. Another way is using sparse linear meta-models that are easier to interpret (Ribeiro, Singh, and Guestrin 2016). Atanasova et al. (2020) however show that sparse linear meta-models are not as faithful as gradient-based approaches for interpreting large language models.

Evaluating Rationale Faithfulness Having extracted R, we typically need to evaluate how faithful that explanation is for a model s prediction. Several studies evaluate the faithfulness of explanations by training a separate classiﬁer on an end-task using only the rationales as input (Jain et al. 2020; Treviso and Martins 2020). These classiﬁers are inherently faithful, as they are trained only on the rationales (Jain et al. 2020). Other studies compare the ability of different feature scoring methods to identify important tokens by using word erasure, i.e. masking (Samek et al. 2017; Serrano and Smith 2019; Atanasova et al. 2020; Chen and Ji 2020; De Young et al. 2020; Zhang et al. 2021; Chrysostomou and Aletras 2021a). The intuition is that by removing the most important tokens, it should result in a larger difference in the output probabilities, compared to removing a less important token which will also lead to drops in classiﬁcation accuracy (Robnik-ˇSikonja and Kononenko 2008; Nguyen 2018; Atanasova et al. 2020). De Young et al. (2020) use erasure to evaluate the comprehensiveness and sufﬁciency of rationales. Carton, Rathore, and Tan (2020) suggest normalizing these metrics using the predictions of the model with a baseline input, to allow for a fairer comparison across models and datasets.

3 Instance-Speciﬁc Rationale Extraction Our aim is to address the one-size-ﬁts-all ad-hoc approach of previous work on rationale extraction with feature scoring methods that typically extracts rationales using the same feature scoring method, length and type across all instances in a dataset. Inspired by word erasure approaches (Nguyen 2018; Serrano and Smith 2019; De Young et al. 2020) we mask the tokens that constitute a rationale and record the difference δ in a model s output distribution by using the full text and the reduced input. Our main assumption is that a sufﬁciently faithful rationale is the one that will result into the largest δ (Atanasova et al. 2020; Chen and Ji 2020; De Young et al. 2020). Following this assumption, we can extract rationales by selecting for each instance a speciﬁc (1) feature scoring method; (2) length; and (3) type.4

Instance-level Feature Scoring Selection Given a set of M feature scoring methods {Ω1, . . . , ΩM}, we extract a rationale R as follows: 1. For each Ωi in the set we compute input importance scores ωi = Ωi(M, x, Y); 2. We subsequently select the K highest scored tokens (TOPK) or the highest K-gram (CONTIGUOUS) to form a rationale Ri, where K is the rationale length; 3. For each rationale we compute the difference δi, between the reference model output (using full text input) and the model output having masked the rationale, such that:

δi = (Y, Ym i ) = (M(x), M(x\Ri))

4Similar to Jain et al. (2020), we consider two rationale types: (a) TOPK tokens ranked by a feature scoring method, treating each word in the input sequence independently; and (b) CONTIGUOUS span of input tokens of length K with the highest overall score computed by a feature scoring method.

where is the function used to compute the difference between the two outputs; 4. We select the rationale R with the highest difference δmax = max({δ1, . . . , δi, . . . , δM}).

For computing δ, we experiment with the following divergence metrics ( ): (a) Kullback-Leibler (KL); (b) Jensen Shannon divergence (JSD); (c) Perplexity (PERP.) and (d) Predicted Class Probability (CLASSDIFF).

Instance-level Rationale Length Selection For computing at instance-level the rationale length k and extracting the rationale R using a single feature scoring method Ω, we propose the following steps:

1. Given Ω, we ﬁrst compute input importance scores ω = Ω(M, x, Y); 2. We then iterate over the sequence such that k = range(1, N), where N is the ﬁxed, pre-deﬁned rationale length and k the possible rationale length at the current iteration. We set N as the upper bound rationale length for our approach to make results comparable with ﬁxed length rationales. 3. At each iteration we begin by masking the top k tokens (as indicated by ω) to form a candidate rationale Rk. When using TOPK we mask the k highest scored tokens, whilst with CONTIGUOUS we mask the highest scored k-gram; 4. We compute the difference δk between the reference model output Y and the model output having masked the candidate rationale Ym k = M(x\Rk); 5. We record every δ until k = N and extract the rationale R with the highest difference δmax = max({δ1, . . . , δk, . . . , δN}), where k at δmax is the computed rationale length.5

Instance-level Rationale Type Selection In a similar way to selecting a feature scoring method, our approach can also be used to select between different rationale types (i.e. CONTIGUOUS or TOPK) for each instance in the dataset. Finally, our approach is ﬂexible and can be easily modiﬁed to support selecting any of these parameters while keeping the rest ﬁxed (i.e. feature scoring method, rationale length and rationale type) or by selecting any combination of them. An important beneﬁt of our approach is that we extract rationales with different settings for each instance rather than using uniform settings globally (i.e. across the whole dataset), which we empirically demonstrate to be beneﬁcial for faithfulness ( 5).

4 Experimental Setup Tasks For our experiments we use the following datasets (details in Table 1):

5We also experimented with early stopping, whereby the difference between δk and the δmax until k are under a speciﬁed threshold, however this resulted in reduced performance.

Data |W| C Splits Train/Dev/Test F1 N

SST 18 2 6,920 / 872 / 1,821 90.1 0.2 20% AG 36 4 102,000 / 18,000 / 7,600 93.5 0.2 20% Ev.Inf. 363 3 5,789 / 684 / 720 83.0 1.6 10% M.RC 305 2 24,029 / 3,214 / 4,848 73.2 1.7 20%

Table 1: Dataset statistics including average words at instance (|W|), number of classes (C), data splits, F1 macro performance and the ﬁxed, pre-deﬁned rationale ratio across all instances (N).

SST: Binary sentiment classiﬁcation without neutral sentences (Socher et al. 2013).

AG: News articles categorized in Science, Sports, Business, and World topics (Corso, Gulli, and Romani 2005).

Evidence Inference (EV.INF.): Abstract-only biomedical articles describing randomized controlled trials. The task is to infer the relationship between a given intervention and comparator with respect to an outcome (Lehman et al. 2019).

Multi RC (M.RC): A reading comprehension task with questions having multiple correct answers that depend on information from multiple sentences (Khashabi et al. 2018). Following De Young et al. (2020) and Jain et al. (2020), we convert this to a binary classiﬁcation task where each rationale/question/answer triplet forms an instance and each candidate answer is labeled as True/False

Similar to Jain et al. (2020), we use BERT (Devlin et al. 2019) for SST and AG); SCIBERT (Beltagy, Lo, and Cohan 2019) for EV.INF. and ROBERTA (Liu et al. 2019) for M.RC.

Feature Scoring Methods

We use a random baseline and six other feature scoring methods (to compute input importance scores) similar to Jain et al. (2020) and Serrano and Smith (2019).

Random (RAND): Random allocation of token importance.

Attention (α): Token importance corresponding to normalized attention scores (Jain et al. 2020).

Scaled Attention (α α): Scales the attention scores αi with their corresponding gradients αi = ˆy αi (Serrano and Smith 2019) .

Input XGrad (x x): Attributes input importance by multiplying the gradient of the input by the input with respect to the predicted class, where xi = ˆy xi (Kindermans et al. 2016; Atanasova et al. 2020) .

Integrated Gradients (IG): Ranking words by computing the integral of the gradients taken along a straight path from a baseline input (zero embedding vector) to the original input (Sundararajan, Taly, and Yan 2017).

(a) F1 macro

(b) Norm Suff

(c) Norm Comp

Figure 1: F1 macro (lower is better), mean Norm Suff (higher is better) and mean Norm Comp (higher is better), when using any single feature scoring method across all instances in a dataset and our proposed method of selecting a feature scoring method for each instance (OURS) for TOPK rationale types.

Deep Lift: Ranking words according to the difference between the activation of each neuron to a reference activation (Shrikumar, Greenside, and Kundaje 2017). LIME: Ranking words by learning an interpretable model locally around the prediction (Ribeiro, Singh, and Guestrin 2016).

Evaluating Explanation Faithfulness F1 macro: Similar to Arras et al. (2017) we measure the F1 macro performance of model M when masking the rationale in the original input (x\R). A key difference in our approach is that we use the predicted labels of the model with full input as gold labels, as we are interested in the faithfulness of explanations for the predictions of the model. Larger drops in F1 scores indicate that the extracted rationale is more faithful.6

Normalized Sufﬁciency (Norm Suff): We measure the degree to which the extracted rationales are sufﬁcient for a model to make a prediction (De Young et al. 2020). Similar to Carton, Rathore, and Tan (2020) we bind sufﬁciency between 0 and 1 and use the reverse difference so that higher is better. We modify this metric and measure the normalized sufﬁciency (Carton, Rathore, and Tan 2020) such that:

Suff(x, ˆy, R) = 1 max(0, p(ˆy|x) p(ˆy|R))

Norm Suff(x, ˆy, R) = Suff(x, ˆy, R) Suff(x, ˆy, 0)

1 Suff(x, ˆy, 0) (1) where Suff(x, ˆy, 0) is the sufﬁciency of a baseline input (zeroed out sequence) and ˆy the model predicted class using the full text x as input, such that ˆy = arg max(Y). Normalized Comprehensiveness (Norm Comp): We measure the extent to which a rationale is needed for a prediction (De Young et al. 2020). For an explanation to

6We also conducted experiments using the dataset gold labels with results being comparable.

be highly comprehensive, the model s prediction when masking the rationale should have a high difference between the model s prediction with full text. Similarly to Carton, Rathore, and Tan (2020) we bind this metric between 0 and 1 and normalize it. We compute it by:

Comp(x, ˆy, R) = max(0, p(ˆy|x) p(ˆy|x\R))

Norm Comp(x, ˆy, R) = Comp(x, ˆy, R)

1 Suff(x, ˆy, 0) (2)

We do not conduct human experiments to evaluate explanation faithfulness since that is only relevant to explanation plausibility (i.e. how understandable by humans a rationale is (Jacovi and Goldberg 2020)) and in practice faithfulness and plausibility do not correlate (Atanasova et al. 2020). Finally we do not compare with select-then-predict methods (Lei, Barzilay, and Jaakkola 2016; Jain et al. 2020), as we are interested in faithfully explaining the model M and not forming inherently faithful classiﬁers.

Performance-Time Trade-off Input erasure approaches typically require N forward passes to compute a rationale length (see 3) when removing one token at a time. Similar to Nguyen (2018); Atanasova et al. (2020), we expedite this process when selecting a rationale length by skipping every X% of tokens. For our work, we use a 2% skip rate which led to a seven-fold reduction in the time required to compute rationales for datasets comprising of long sequences, such as MRc and Ev Inf, with comparable performance in faithfulness to the slower process of removing one token at a time.

5 Results Selecting Instance-speciﬁc Feature Scoring Figure 1 compares the faithfulness of extracted rationales when using our proposed method for selecting an instancespeciﬁc feature scoring method (OURS) and our baselines, that use a single ﬁxed pre-deﬁned feature scoring method

globally (i.e. across all instances in a dataset). We measure faithfulness using F1 macro (lower is better), mean Norm Suff and mean Norm Comp (higher is better respectively). For clarity we show results using the TOPK rationale type. 7. Overall, results demonstrate that rationales extracted with our proposed approach are highly sufﬁcient and comprehensive. In fact, our approach results in more sufﬁcient rationales against all single feature scoring methods in AG and is comparable with the best Norm Suff scores in the remainder of the datasets. This suggests that even when rationales with our proposed method are not the most sufﬁcient, they are consistently highly sufﬁcient (i.e. rationales extracted with our approach are signiﬁcantly more sufﬁcient than ﬁxed, pre-deﬁned feature scoring methods in 18 out of 24 test cases). Compared to our six baselines, the rationales extracted with our approach are signiﬁcantly more comprehensive across all four datasets (Wilcoxon Rank Sum, p < .05). Additionally, the larger drops in F1 macro performance demonstrate that rationales extracted with our proposed approach are more necessary for a model to make a prediction compared to a globally used, pre-deﬁned feature scoring approach. Our results strengthen the hypothesis that whilst some feature scoring methods are better than others globally, they might not be optimal for all instances in a dataset (Jacovi and Goldberg 2020) and our approach helps mitigate that. Similar to (Atanasova et al. 2020), we observe that the faithfulness performance of single feature scoring methods varies across datasets. For example LIME returns more comprehensive rationales than α α in Multi RC, however is outperformed by the latter in SST. By returning consistently highly comprehensive and sufﬁcient rationales, our propose method helps reducing the variability in faithfulness performance observed when using any single feature scoring method across datasets.

Selecting Instance-speciﬁc Rationale Length

Table 2 shows the Relative Improvement (R.I.) ratio in mean Norm Suff and Norm Comp (>1.0 is better) between rationales extracted using a ﬁxed pre-deﬁned length (see N in Table 1) and rationales extracted using our method with instance-speciﬁc length across feature scoring methods and datasets. For brevity we do not include results with F1 macro, where we make similar observations to comprehensiveness. Overall, rationales extracted using our approach are on average shorter than ﬁxed length rationales. Specifically, rationale length drops from 20% to 16% on average in SST, AG; from 20% to 15% in M.Rc and from 10% to 7% in Ev.Inf. Norm Suff scores indicate that our shorter on average rationales are overall less but comparably sufﬁcient with longer, ﬁxed-length rationales. For example with SST rationales with instance-speciﬁc length are 0.9-1.0 times less sufﬁcient that rationale with pre-deﬁned length. We ﬁnd this particularly evident in datasets such as M.Rc and Ev.Inf.,

7Also for clarity, all results presented in this work are using JSD for . The other divergence functions performed comparably

Norm Suff Norm Comp FEAT SST M.Rc AG Ev.Inf. SST M.Rc AG Ev.Inf.

Deep Lift 0.9 0.8 0.8 1.1 0.8 1.1 1.0 1.0 LIME 1.0 0.7 0.9 0.9 0.9 1.1 1.0 1.0 α 0.9 0.9 0.7 0.8 0.8 1.1 0.9 1.2 α α 0.9 0.9 0.8 0.9 1.0 1.1 0.9 1.0 IG 0.9 0.9 0.8 0.9 0.9 1.1 1.0 1.1 x x 1.0 0.8 0.7 0.8 0.9 1.1 0.9 1.2

Deep Lift 0.9 0.9 0.8 1.2 0.9 1.1 1.3 1.5 LIME 0.9 0.7 0.8 0.9 1.0 1.1 1.2 1.3 α 0.9 0.9 0.7 0.9 0.7 1.1 1.0 1.2 α α 0.9 0.8 0.8 0.9 1.0 1.1 1.1 1.1 IG 0.9 0.8 0.8 1.0 1.0 1.2 1.2 1.4 x x 0.9 0.8 0.7 1.0 1.0 1.1 1.0 1.3

Table 2: Relative Improvement (R.I.) ratios for mean Norm Suff and mean Norm Comp between ﬁxed length rationales (see N in Table 1) extracted using our method and rationales with instance-speciﬁc length (>1.0 is better).

where our rationales are on average 4-5% shorter (approximately 15 tokens shorter on average for α in M.Rc) but still retain comparable sufﬁciency, while in some cases improving it (e.g. 1.2 R.I. in Ev.Inf. with Deep Lift). We also note that rationales extracted with instancespeciﬁc length are more comprehensive in most cases, despite being shorter on average compared to ﬁxed-length rationales. For example in Ev.Inf., CONTIGUOUS rationales with I.G. are 1.4 times more comprehensive when we select their length at instance-level. Results also indicate that using our proposed method beneﬁts more CONTIGUOUS rationales compared to TOPK for comprehensiveness, leading to increased R.I. in the majority of cases. Overall, ﬁndings support our initial hypothesis that in certain cases a rationale with longer than needed length might contain unnecessary information and adversely impact its comprehensiveness.

Selecting Instance-speciﬁc Feature Scoring, Length and Type Table 3 shows mean Norm Suff and Norm Comp scores when using our proposed method to select at instance-level (I-L) a combination of: (1) the feature scoring method (FEAT); (2) the rationale length (LEN); and (3) the rationale type (TYPE). For comparison, we also show scores of the best performing ﬁxed (FIX) feature scoring function, rationale type and length (see Figure 1). We ﬁrst observe that the highest Norm Suff scores across three datasets (SST, MRc, Ev Inf), are from the best performing ﬁxed scoring method with ﬁxed length and rationale type. Additionally, the best performing combination of our proposed approach for sufﬁciency is when we only select the feature scoring method keeping the length and type ﬁxed. This combination results in the highest Norm Suff scores in AG (.44 with TOPK type compared to .42, which is the second best with CONTIGUOUS) and competitive Norm Suff scores with the highest scoring combination (e.g. .82 in Ev.Inf. and CONTIGUOUS compared to .85). We assume that using combinations which include instance-speciﬁc lengths do not perform as well for sufﬁciency due to the shorter ra-

(a) F1 macro

(b) Norm Suff

(c) Norm Comp

Figure 2: F1 macro (lower is better), mean Norm Suff and mean Norm Comp (higher is better), when extracting rationales with our approach given decreasing numbers of feature scoring methods.

Norm Suff Norm Comp TYPE LEN FEAT SST M.Rc AG Ev.Inf. SST M.Rc AG Ev.Inf.

FIX FIX .68 .12 .37 .43 .54 .42 .28 .80 I-L FIX .61 .11 .30 .37 .52 .46 .27 .82 FIX I-L .63 .09 .44 .38 .57 .59 .41 .84 I-L I-L .59 .07 .38 .36 .55 .62 .39 .86

FIX FIX .71 .07 .41 .85 .46 .47 .17 .55 I-L FIX .63 .06 .33 .78 .47 .54 .19 .62 FIX I-L .67 .07 .42 .82 .46 .60 .22 .59 I-L I-L .61 .05 .33 .76 .48 .65 .24 .67 I-L I-L I-L .60 .06 .39 .49 .57 .69 .41 .88

Table 3: Mean Norm Suff and Norm Comp scores when we select at instance-level (I-L) a combination of the: (1) rationale length (LEN); (2) feature scoring method (FEAT.); and (3) rationale type (TYPE). {TYPE}-FIX-FIX and {TYPE}- I-L-FIX values are from the highest scoring feature scoring method (see Figure 1). Bold values denote the highest performing combination in column-wise (higher is better).

tionale length, which we have previously shown to partially degrade rationale sufﬁciency. Finally, our results demonstrate that we obtain highly comprehensive rationales when selecting at instance level all parameters (FEAT. + LEN + TYPE) using our approach. In fact, this results in higher Norm Comp scores compared to any other setting combination across all datasets. For example in M.Rc., selecting all parameters results in a Norm Comp score of .69 which is .22 units higher than the rationales extracted with ﬁxed feature scoring method and length and type. This highlights the efﬁcacy of our approach in extracting highly comprehensive rationales, without requiring strong a priori assumptions about rationale parameters.

Ablation Study We ﬁnally perform an ablation study to examine the behavior and effectiveness of our approach by sequentially removing one feature scoring method at a time to measure changes in F1 macro, Norm Suff and Norm Comp. The intuition is that we should observe drops in faithfulness scores when remov-

ing feature attribution methods for our approach to be effective (i.e. we should extract more faithful rationales when having more feature scoring options to choose from). Figure 2 shows the results. We ﬁrst observe that removing one feature scoring method at a time results in increases in F1 macro (lower is better) and drops in Norm Comp scores (higher is better). This demonstrate that the faithfulness of the rationales extracted with our approach deteriorates as the number of feature scoring methods becomes smaller highlighting the efﬁcacy of our proposed approach. For example, in Ev.Inf. by removing α α results in a drop of .14 in mean Norm Comp (.84 when including α α compared to .70 without it). On the other hand, we also observe that our method can still beneﬁt from feature scoring methods that achieve low Norm Comp scores when used standalone, resulting in improvements in comprehensiveness and drops in F1 macro (e.g. α in SST). This indicates that our approach steadily improves rationale faithfulness for model s predictions given a larger pool of available feature scoring methods. Results show a deterioration in Norm Suff scores as the number of feature scoring methods becomes smaller, showing that our method results in more sufﬁcient rationales when presented with a larger list of available feature scoring methods in the majority of the datasets. We hypothesize that this is not true for Multi RC due to the already low Norm Suff scores of the rationales (e.g. no more than 0.12). By using all six feature scoring methods, our approach produces highly sufﬁcient rationales and is comparable to the set achieved the highest sufﬁciency. For example in Ev.Inf. using all feature scoring methods results to a Norm Suff score of approximately .38 compared to the highest scoring feature scoring set (all except LIME) and the lowest scoring (x x) which achieved .39 and .15 respectively. We also tested different combinations of feature scoring methods with similar observations. Finally, we experimented with doubling the upper bound of the rationale length (from N to 2 N) for both ﬁxed length rationales and our proposed approach. Our approach still yielded more

Example 1 Data.:AG Id: test 4614 [FIXED-LEN + α]: ... game last Friday night will stand , the CFL announced yesterday. While a review ... [I-L-LEN + α (Ours)]: ... game last Friday night will stand , the CFL announced yesterday. While a review ... [Predicted Topic || True Topic]: Decreased signiﬁcantly || Decreased signiﬁcantly Example 2 Data.:EV.INF. Id: 3162205 2 [FIXED-LEN + α α]: ... computed tomography ( 3D - CT ) scans . ABSTRACT.RESULTS : The control sides treated with an autograft showed signiﬁcantly better Lenke scores than the study sides treated with β - CPP at 3 and 6 months postoperatively , but there was no difference between the two sides at 12 months . The fusion .. [I-L-LEN + α α (Ours)]: ... computed tomography ( 3D - CT ) scans . ABSTRACT.RESULTS : The control sides treated with an autograft showed signiﬁcantly better Lenke scores than the study sides treated with β - CPP at 3 ... [Predicted Relationship || True Relationship]: Increased signiﬁcantly || No signiﬁcant difference Example 3 Data.:SST Id: test 694 [FIXED-LEN + α]: ... Frontal is the antidote for Soderbergh fans who think he s gone too commercial ... [I-L-LEN + I-L-FEAT (Ours)]: ... Frontal is the antidote for Soderbergh fans who think he s gone too commercial ... [Predicted Sentiment || True Sentiment]: Negative || Positive Example 4 Data.:SST Id: test 1039 [FIXED-LEN + α]: It s just incredibly dull. [I-L-LEN + I-L-FEAT (Ours)]: It s just incredibly dull. [Predicted Sentiment || True Sentiment]: Negative || Negative

Table 4: Examples when using our approach (Ours) to select at instance-level (I-L) a combination of the: (1) rationale length (LEN); (2) feature scoring method (FEAT) against our baseline of ﬁxed-length rationales from a ﬁxed feature scoring method.

comprehensive rationales compared to the ﬁxed-length ones that were also highly sufﬁcient.

6 Qualitative Analysis

Table 4 shows examples of the qualitative comparison between our approach (Ours) for selecting at instance-level (I-L) a combination of the: (1) rationale length (LEN); (2) feature scoring method (FEAT against our baseline of ﬁxedlength rationales from a ﬁxed feature scoring method.

Concise rationales: Example 1 presents an instance from AG. Our approach extracts a rationale that is six tokens shorter than the one with ﬁxed length while also achieving a higher Norm Comp score. However, the ﬁxed length rationale scores higher in Norm Suff. We can assume from this that sufﬁciency positively correlates with rationale length.

Error analysis: Our assumption is that if a model makes a wrong prediction, we should be able to extract the rationale that better demonstrates what led to a wrong prediction. Example 2 shows an instance from EV.INF., where the model has wrongly predicted that Lenke scores at 12 months have increased signiﬁcantly instead of the correct no signiﬁcant difference . Surprisingly, both rationales recorded maximum scores (1.0) in Norm Suff and Norm Comp. We observe that the correct answer is included in the ﬁxed length rationale, however the model made a wrong prediction. On the contrary, our rationale highlights something directly related to its prediction. Example 3 presents an instance from SST, where the ﬁxed-length rationale and the instance-speciﬁc rationale

(ours) attend at different sections of the text. Our rationale scored lower for Norm Suff, however we observe that it aligns more closely with the predicted sentiment.

When using a ﬁxed pre-deﬁned length is not sufﬁcient: Example 4 presents a different scenario, where the ﬁxedlength rationale for SST is at 20% whilst the upper bound N for our rationale is at 40%. The intuition is that in certain cases a ﬁxed rationale length might not be sufﬁcient for all instances to explain a prediction. We argue that our approach highlighted something more informative for the task ( incredibly dull compared to incredibly ), due to removing the restriction of a pre-deﬁned ﬁxed length.

7 Conclusions We have proposed a simple yet effective approach for selecting at instance-level (1) feature scoring method; (2) length; and (3) type of the rationale. We empirically demonstrated that rationales extracted with our approach are significantly more comprehensive and highly sufﬁcient, while being shorter compared to rationales extracted with a ﬁxed feature scoring method, length and type. Finally, we consider our work an important step towards instance-level faithful rationalization while ﬁnding the most sufﬁcient rationale, an interesting direction for future work.

Acknowledgments NA is supported by EPSRC grant EP/V055712/1, part of the European Commission CHIST-ERA programme, call 2019 XAI: Explainable Machine Learning-based Artiﬁcial Intelligence.

References Arras, L.; Horn, F.; Montavon, G.; M uller, K.-R.; and Samek, W. 2016. Explaining Predictions of Non-Linear Classiﬁers in NLP. In Proceedings of the 1st Workshop on Representation Learning for NLP, 1 7. Berlin, Germany: Association for Computational Linguistics. Arras, L.; Montavon, G.; M uller, K.-R.; and Samek, W. 2017. Explaining Recurrent Neural Network Predictions in Sentiment Analysis. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 159 168. Copenhagen, Denmark: Association for Computational Linguistics. Atanasova, P.; Simonsen, J. G.; Lioma, C.; and Augenstein, I. 2020. A Diagnostic Study of Explainability Techniques for Text Classiﬁcation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3256 3274. Online: Association for Computational Linguistics. Bastings, J.; Aziz, W.; and Titov, I. 2019. Interpretable Neural Predictions with Differentiable Binary Variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2963 2977. Florence, Italy: Association for Computational Linguistics. Bastings, J.; and Filippova, K. 2020. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third Blackbox NLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 149 155. Online: Association for Computational Linguistics. Beltagy, I.; Lo, K.; and Cohan, A. 2019. Sci BERT: A Pretrained Language Model for Scientiﬁc Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), 3615 3620. Hong Kong, China: Association for Computational Linguistics. Bommasani, R.; Hudson, D. A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M. S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. 2021. On the Opportunities and Risks of Foundation Models. ar Xiv preprint ar Xiv:2108.07258. Carton, S.; Rathore, A.; and Tan, C. 2020. Evaluating and Characterizing Human Rationales. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9294 9307. Online: Association for Computational Linguistics. Chen, H.; and Ji, Y. 2020. Learning Variational Word Masks to Improve the Interpretability of Neural Text Classiﬁers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4236 4251. Online: Association for Computational Linguistics. Chrysostomou, G.; and Aletras, N. 2021a. Enjoy the Salience: Towards Better Transformer-based Faithful Explanations with Word Salience. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 8189 8200. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.

Chrysostomou, G.; and Aletras, N. 2021b. Improving the Faithfulness of Attention-based Explanations with Taskspeciﬁc Information for Text Classiﬁcation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 477 488. Online: Association for Computational Linguistics.

Corso, G. M. D.; Gulli, A.; and Romani, F. 2005. Ranking a stream of news. In Ellis, A.; and Hagino, T., eds., Proceedings of the 14th international conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, 2005, 97 106. ACM.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171 4186. Minneapolis, Minnesota: Association for Computational Linguistics.

De Young, J.; Jain, S.; Rajani, N. F.; Lehman, E.; Xiong, C.; Socher, R.; and Wallace, B. C. 2020. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4443 4458. Online: Association for Computational Linguistics.

Jacovi, A.; and Goldberg, Y. 2020. Towards Faithfully Interpretable NLP Systems: How Should We Deﬁne and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4198 4205. Online: Association for Computational Linguistics.

Jain, S.; and Wallace, B. C. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 3543 3556. Minneapolis, Minnesota: Association for Computational Linguistics.

Jain, S.; Wiegreffe, S.; Pinter, Y.; and Wallace, B. C. 2020. Learning to Faithfully Rationalize by Construction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4459 4473. Online: Association for Computational Linguistics.

Khashabi, D.; Chaturvedi, S.; Roth, M.; Upadhyay, S.; and Roth, D. 2018. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 252 262. New Orleans, Louisiana: Association for Computational Linguistics.

Kim, S.; Yi, J.; Kim, E.; and Yoon, S. 2020. Interpretation of NLP models through input marginalization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3154 3167. Online: Association for Computational Linguistics.

Kindermans, P.-J.; Sch utt, K.; M uller, K.-R.; and D ahne, S. 2016. Investigating the inﬂuence of noise and distractors on the interpretation of neural networks. ar Xiv preprint ar Xiv:1611.07270. Lehman, E.; De Young, J.; Barzilay, R.; and Wallace, B. C. 2019. Inferring Which Medical Treatments Work from Reports of Clinical Trials. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 3705 3717. Minneapolis, Minnesota: Association for Computational Linguistics. Lei, T.; Barzilay, R.; and Jaakkola, T. 2016. Rationalizing Neural Predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 107 117. Austin, Texas: Association for Computational Linguistics. Li, J.; Chen, X.; Hovy, E.; and Jurafsky, D. 2016. Visualizing and Understanding Neural Models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 681 691. San Diego, California: Association for Computational Linguistics. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Ro BERTa: A Robustly Optimized BERT Pretraining Approach. Co RR, abs/1907.11692. Nguyen, D. 2018. Comparing Automatic and Human Evaluation of Local Explanations for Text Classiﬁcation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1069 1078. New Orleans, Louisiana: Association for Computational Linguistics. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why Should I Trust You? : Explaining the Predictions of Any Classiﬁer. In Krishnapuram, B.; Shah, M.; Smola, A. J.; Aggarwal, C. C.; Shen, D.; and Rastogi, R., eds., Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 1135 1144. ACM.

Robnik-ˇSikonja, M.; and Kononenko, I. 2008. Explaining classiﬁcations for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20(5): 589 600. Samek, W.; Binder, A.; Montavon, G.; Lapuschkin, S.; and M uller, K. 2017. Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11): 2660 2673. Serrano, S.; and Smith, N. A. 2019. Is Attention Interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2931 2951. Florence, Italy: Association for Computational Linguistics. Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learning Important Features Through Propagating Activation Differences. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning,

volume 70 of Proceedings of Machine Learning Research, 3145 3153. PMLR. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A.; and Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631 1642. Seattle, Washington, USA: Association for Computational Linguistics. Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic Attribution for Deep Networks. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, 3319 3328. PMLR. Treviso, M.; and Martins, A. F. T. 2020. The Explanation Game: Towards Prediction Explainability through Sparse Communication. In Proceedings of the Third Blackbox NLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 107 118. Online: Association for Computational Linguistics. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net. Zhang, W.; Huang, Z.; Zhu, Y.; Ye, G.; Cui, X.; and Zhang, F. 2021. On Sample Based Explanation Methods for NLP: Faithfulness, Efﬁciency and Semantic Evaluation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 5399 5411. Online: Association for Computational Linguistics.