# evaluation_of_similaritybased_explanations__0fcec0be.pdf

Published as a conference paper at ICLR 2021

EVALUATION OF SIMILARITY-BASED EXPLANATIONS

Kazuaki Hanawa1,2, Sho Yokoi2,1, Satoshi Hara3, Kentaro Inui2,1

RIKEN Center for Advanced Intelligence Project1, Tohoku University2, Osaka University3 kazuaki.hanawa@riken.jp, yokoi@ecei.tohoku.ac.jp, satohara@ar.sanken.osaka-u.ac.jp, inui@ecei.tohoku.ac.jp

Explaining the predictions made by complex machine learning models helps users to understand and accept the predicted outputs with conﬁdence. One promising way is to use similarity-based explanation that provides similar instances as evidence to support model predictions. Several relevance metrics are used for this purpose. In this study, we investigated relevance metrics that can provide reasonable explanations to users. Speciﬁcally, we adopted three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. Our experiments revealed that the cosine similarity of the gradients of the loss performs best, which would be a recommended choice in practice. In addition, we showed that some metrics perform poorly in our tests and analyzed the reasons of their failure. We expect our insights to help practitioners in selecting appropriate relevance metrics and also aid further researches for designing better relevance metrics for explanations.

1 INTRODUCTION

Explaining the predictions made by complex machine learning models helps users understand and accept the predicted outputs with conﬁdence (Ribeiro et al., 2016; Lundberg & Lee, 2017; Guidotti et al., 2018; Adadi & Berrada, 2018; Molnar, 2020). Instance-based explanations are a popular type of explanation that achieve this goal by presenting one or several training instances that support the predictions of a model. Several types of instance-based explanations have been proposed, such as explaining with instances similar to the instance of interest (i.e., the test instance in question) (Charpiat et al., 2019; Barshan et al., 2020); harmful instances that degrade the performance of models (Koh & Liang, 2017; Khanna et al., 2019); counter-examples that contrast how a prediction can be changed (Wachter et al., 2018); and irregular instances (Kim et al., 2016).

Among these, we focus on the ﬁrst one, the type of explanation that gives one or several training instances that are similar to the test instance in question and corresponding model predictions. We refer to this type of instance-based explanation as similarity-based explanation. A similarity-based explanation is of the form I (the model) think this image is cat because similar images I saw in the past were also cat. This type of explanation is analogous to the way humans make decisions by referring to their prior experiences (Klein & Calderwood, 1988; Klein, 1989; Read & Cesa, 1991). Hence, it tends to be easy to understand even to users with little expertise about machine learning. A report stated that with this type of explanation, users tend to have higher conﬁdence in model predictions compared to explanations that presents contributing features (Cunningham et al., 2003).

In the instance-based explanation paradigm, including similarity-based explanation, a relevance metric R(z, z ) R is typically used to quantify the relationship between two instances, z = (x, y) and z = (x , y ).

Deﬁnition 1 (Instance-based Explanation Using Relevance Metric). Let D = {z(i) train = (x(i) train, y(i) train)}N i=1 be a set of training instances and xtest be a test input of interest whose predicted output is given by bytest = f(xtest) with a predictive model f. An instance-based explanation method gives the most relevant training instance z D to the test instance ztest = (xtest, bytest) by z = arg maxztrain D R(ztest, ztrain) using a relevance metric R(ztest, ztrain).

Previously proposed relevance metrics include similarity (Caruana et al., 1999), kernel functions (Kim et al., 2016; Khanna et al., 2019), and inﬂuence function (Koh & Liang, 2017).

Published as a conference paper at ICLR 2021

Table 1: The relevance metrics and their evaluation results. For the model randomization test, the results that passed the test are colored. For the identical class test and identical subclass test, the results with the ﬁve highest average evaluation scores are colored. The details of the relevance metrics, the evaluation criteria, and the evaluation procedures can be found in Sections 1.2, 3, and 4, respectively.

Evaluation Criteria

Relevance Metrics Abbrv. Model Randomization Test Identical Class Test Identical Subclass Test

ℓ2 φ(z) = x ℓx 2 Failed 0.615 0.261 0.644 0.264 φ(z) = hlast ℓlast 2 Passed 0.880 0.106 0.631 0.237 φ(z) = hall ℓall 2 Failed 0.848 0.128 0.691 0.211

Cosine φ(z) = x cosx Failed 0.669 0.248 0.621 0.242 φ(z) = hlast coslast Passed 0.888 0.098 0.636 0.234 φ(z) = hall cosall Failed 0.871 0.110 0.738 0.166

Dot φ(z) = x dotx Failed 0.336 0.187 0.346 0.201 φ(z) = hlast dotlast Failed 0.579 0.344 0.284 0.122 φ(z) = hall dotall Failed 0.630 0.353 0.488 0.267

Inﬂuence Function IF Passed 0.372 0.270 0.309 0.174 Relative IF RIF Passed 0.779 0.309 0.659 0.266 Fisher Kernel FK Passed 0.226 0.103 0.180 0.076 Grad-Dot GD Passed 0.701 0.287 0.403 0.131 Grad-Cos GC Passed 0.996 0.009 0.753 0.196

An immediate critical question is which relevance metric is appropriate for which type of instancebased explanations. There is no doubt that different types of explanations require different metrics. Despite its potential importance, however, little has been explored on this question. Given this background, in this study, we focused on similarity-based explanation and investigated its appropriate relevance metrics through comprehensive experiments.1

Contributions We provide the ﬁrst answer to the question about which relevance metrics have desirable properties for similarity-based explanation. For this purpose, we propose to use three minimal requirement tests to evaluate various relevance metrics in terms of their appropriateness. The ﬁrst test is the model randomization test originally proposed by Adebayo et al. (2018) for evaluating saliency-based methods, and the other two tests, the identical class test and identical subclass test, are newly designed in this study. As summarized in Table 1, our experiments revealed that (i) the cosine similarity of gradients performs best, which is probably a recommended choice for similaritybased explanation in practice, and (ii) some relevance metrics demonstrated poor performances on the identical class and identical subclass tests, indicating that their use should be deprecated for similarity-based explanation. We also analyzed the reasons behind the success and failure of metrics. We expect these insights to help practitioners in selecting appropriate relevance metrics.

1.1 PRELIMINARIES

Notations For vectors a, b Rp, we denote the dot product by a, b := Pp i=1 aibi, the ℓ2 norm by a := p

a, a , and the cosine similarity by cos(a, b) := a,b / a b .

Classiﬁcation Problem We consider a standard classiﬁcation problem as the evaluation benchmark, which is the most actively explored application of instance-based explanations. The model is the conditional probability p(y | x; θ) with parameter θ. Let bθ be a trained parameter bθ = arg minθ Ltrain := 1 N PN i=1 ℓ(z(i) train; θ), where the loss function ℓis the cross entropy ℓ(z; θ) = log p(y | x; θ) for an input-output pair z = (x, y). The model classiﬁes a test input xtest by assigning the class with the highest probability bytest = arg maxy p(y | xtest; bθ).

1.2 RELEVANCE METRICS

We present an overview of the two types of relevance metrics considered in this study, namely similarity metrics and gradient-based metrics. To the best of our knowledge, all major relevance

1Our implementation is available at https://github.com/k-hanawa/criteria_for_ instance_based_explanation

Published as a conference paper at ICLR 2021

metrics proposed thus far can be classiﬁed under these two types. Table 1 presents a list of metrics and their abbreviations.

Similarity Metrics We consider the following popular similarity metrics with a feature map φ(z).

ℓ2 Metric: Rℓ2(z, z ) := φ(z) φ(z ) 2, which is a typical choice for nearest neighbor methods (Hastie et al., 2009; Abu Alfeilat et al., 2019).

Cosine Metric: Rcos(z, z ) := cos(φ(z), φ(z )), which is commonly used in natural language processing tasks (Mikolov et al., 2013; Arora et al., 2017; Conneau et al., 2017).

Dot Metric: Rdot(z, z ) := φ(z), φ(z ) , which is a kernel function used in kernel models such as SVM (Schölkopf et al., 2002; Fan et al., 2005; Bien & Tibshirani, 2011).

As the feature map φ(z), we consider (i) an identity map φ(z) = x; (ii) the last hidden layer φ(z) = hlast, which is the latent representation of input x, one layer before the output in a deep neural network; and, (iii) all hidden layers φ(z) = hall, where hall = [h1, h2, . . . , hlast] is the concatenation of all latent representations in the network. Note that the metrics with the identity map merely measure the similarity of inputs without model information. We adopt these metrics as naive baselines to contrast with other advanced metrics that utilize model information.

Gradient-based Metrics Gradient-based metrics use a gradient gz bθ := θℓ(z; bθ) to measure the relevance. We consider ﬁve metrics: Inﬂuence Function (IF) (Koh & Liang, 2017), Relative IF (RIF) (Barshan et al., 2020), Fisher Kernel (FK) (Khanna et al., 2019), Grad-Dot (GD) (Yeh et al., 2018; Charpiat et al., 2019), and Grad-Cos (GC) (Perronnin et al., 2010; Charpiat et al., 2019). See Appendix A for further detail.

IF: RIF(z, z ) := gz bθ, H 1gz bθ

RIF: RRIF(z, z ) := cos(H 1

2 gz bθ, H 1

FK: RFK(z, z ) := gz bθ, I 1gz bθ ,

GD: RGD(z, z ) := gz bθ, gz bθ

GC: RGC(z, z ) := cos(gz bθ, gz bθ )

where H and I are the Hessian and Fisher information matrices of the loss Ltrain, respectively.

2 RELATED WORK

Model-speciﬁc Explanation Aside of the relevance metrics, there is another approach for similaritybased explanation that uses speciﬁc models that can provide explanations by their design (Kim et al., 2014; Plötz & Roth, 2018; Chen et al., 2019). We set aside these speciﬁc models and focus on generic relevance metrics because of their applicability to a wide range of problems.

Evaluation of Metrics for Improving Classiﬁcation Accuracy In several machine learning problems, the metrics between instances play an essential role. For example, the distance between instances is essential for distance-based methods such as nearest neighbor methods (Hastie et al., 2009). Another example is kernel models where the kernel function represents the relationship between two instances (Schölkopf et al., 2002). Several studies have evaluated the desirable metrics for speciﬁc tasks (Hussain et al., 2011; Hu et al., 2016; Li & Li, 2018; Abu Alfeilat et al., 2019). These studies aimed to ﬁnd metrics that could improve the classiﬁcation accuracy. Different from these evaluations based on accuracy, our goal in this study is to evaluate the validity of relevance metrics for similarity-based explanation; thus, the ﬁndings in these previous studies are not directly applicable to our goal.

Evaluation of Explanations There are a variety of desiderata argued as requirements for explanations, such as faithfulness (Adebayo et al., 2018; Lakkaraju et al., 2019; Jacovi & Goldberg, 2020), plausibility (Lei et al., 2016; Lage et al., 2019; Strout et al., 2019), robustness (Alvarez-Melis & Jaakkola, 2018), and readability (Wang & Rudin, 2015; Yang et al., 2017; Angelino et al., 2017). It is important to evaluate the existing explanation methods considering these requirements. However, there is no standard test established for evaluating these requirements, and designing such tests still remains an open problem (Doshi-Velez & Kim, 2017; Jacovi & Goldberg, 2020). In this study, as the ﬁrst empirical study for evaluating the existing relevance metrics for similarity-based explanation, we take an alternative approach by designing minimal requirement tests for two primary requirements,

Published as a conference paper at ICLR 2021

namely faithfulness and plausibility. With this alternative approach, we can avoid the difﬁculty of directly evaluating these primary requirements.

3 EVALUATION CRITERIA FOR SIMILARITY-BASED EXPLANATION

This study aims to investigate the relevance metrics with desirable properties for similarity-based explanation. In this section, we propose three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. If a relevance metric fails one of the tests, we can conclude that the metric does not meet the minimal requirements; thus, its use would be deprecated. The ﬁrst test (model randomization test) assesses whether each relevance metric satisﬁes the minimal requirements for the faithfulness of explanation, which requires that an explanation to a model prediction must reﬂect the underlying inference process (Adebayo et al., 2018; Lakkaraju et al., 2019; Jacovi & Goldberg, 2020). The latter two tests (identical class and identical subclass tests) are designed to assess relevance metrics in terms of the plausibility of the explanations they produce (Lei et al., 2016; Lage et al., 2019; Strout et al., 2019), which requires explanations to be sufﬁciently convincing to users.

3.1 MODEL RANDOMIZATION TEST

Explanations that are irrelevant to a model should be avoided because such fake explanations can mislead users. Thus, any valid relevance metric should be model-dependent, which constitutes the ﬁrst requirement.

We use the model randomization test of Adebayo et al. (2018) to assess whether a given relevance metric satisﬁes a minimal requirement for faithfulness. If a relevance metric produces almost same explanations for the same inputs on two models with different inference processes, it is likely to ignore the underlying model, i.e., the metric is independent of the model. Thus, we can evaluate whether the metric is model-dependent by comparing explanations from two different models. In the test, a typical choice of the models is a well-trained model that can predict the output well and a randomly initialized model that can make only poor prediction. These two models have different inference processes; hence, their explanations should be different. Deﬁnition 2 (Model Randomization Test). Let R denote the relevance metric of interest. Let f and frand be a well-trained model and randomly initialized model, respectively. For given R, f, and test instance ztest = (xtest, bytest), let πf be a permutation of the indices of the training instances based

on the degree of relevance to the given test instance, i.e., R(ztest, z (πf (1)) train ) R(ztest, z (πf (2)) train )

. . . R(ztest, z (πf (N)) train ). We also deﬁne πfrand accordingly. Then, we require πf and πfrand to ensure a small rank correlation.

If relevance metric R is independent of the model, it produces the same permutation for both f and frand, and their rank correlation becomes one. If the rank correlation is signiﬁcantly smaller than one and close to zero, we can conﬁrm that the relevance metric is model-dependent.

3.2 IDENTICAL CLASS TEST

The second minimal requirement is that the raised similar instance should belong to the same class as the test instance, as shown in Figure 1. The violation of this requirement leads to nonsensical explanations such as I think this image is cat because a similar image I saw in the past was dog. in Figure 1. When users encounter such explanations, they might question the validity of model predictions and ignore the predictions even if the underlying model is valid. This observation leads to the identical class test below. Deﬁnition 3 (Identical Class Test). We require that the most similar (relevant) instance of a test instance ztest = (xtest, bytest) is a training instance of the same class as the given test instance.

arg max z=(x,y) D R ztest, z = ( x, y) = y = bytest. (1)

Although this test may look trivial, some relevance metrics do not satisfy this minimal requirement, as demonstrated in Section 4.2.

Published as a conference paper at ICLR 2021

is cat because

is cat because

Figure 1: Valid ( ) and invalid ( ) examples for the identical class test.

is animal because

is animal because

Figure 2: Valid ( ) and invalid ( ) examples for the identical subclass test.

3.3 IDENTICAL SUBCLASS TEST

The third minimal requirement is that the raised similar instance should belong to the same subclass as that of the test instance when the the classes consist of latent subclasses, as shown in Figure 2. For example, consider a problem of classifying images of CIFAR10 into two classes, i.e., animal and vehicle. The animal class consists of images from subclasses such as cat and frog, while the vehicle class consists of images from subclasses such as airplane and automobile. Under the presence of subclasses, the violation of this requirement leads to nonsensical explanations such as I think this image (cat) is animal because a similar image (frog) I saw in the past was also animal. in Figure 2. This observation leads to the identical subclass test below. Deﬁnition 4 (Identical Subclass Test). Let s(z) be a subclass for class y of an instance z = (x, y). We require that the most similar (relevant) instance of a test instance ztest = (xtest, bytest) is the training instance of the same subclass as the test instance, under the assumption that the prediction of the test instance is correct bytest = ytest.2

arg max z D R ztest, z = z = s(barz) = s(ztest). (2)

In the experiments, we used modiﬁed datasets: we split the dataset into two new classes (A and B) by randomly assigning the existing classes to either classes. The new two classes now contain the original data classes as subclasses that are mutually exclusive and collectively exhaustive, which can be used for the identical subclass test.

3.4 DISCUSSIONS ON VALIDITY OF CRITERIA

Here, we discuss the validity of the new criteria, i.e., the identical class and identical subclass tests.

Why do relevance metrics that cannot pass these tests matter? Dietvorst et al. (2015) revealed a bias in humans, called algorithm aversion, which states that people tend to ignore an algorithm if it makes errors. It should be noted that the explanations that do not satisfy the identical class test or identical subclass test appear to be logically broken, as shown in Figures 1 and 2. Given such logically broken explanations, users will consider that the models are making errors, even if they are making accurate predictions. Eventually, the users will start to ignore the models.

Is the identical subclass test necessary? This is an essential requirement for ensuring that the explanations are plausible to any users. Some users may not consider the explanations that violate the identical subclass test to be logically broken. For example, some users may ﬁnd a frog to be an appropriate explanation for a cat being animal by inferring taxonomy of the classes (e.g., both have eyes). However, we cannot hope all users to infer the same taxonomy. Therefore, if there is

2We require correct predictions in this test because the subclass does not match for incorrect cases.

Published as a conference paper at ICLR 2021

a discrepancy between the explanation and the taxonomy inferred by a user, the user will consider the explanation to be implausible. To make explanations plausible to any user, instances of the same subclass need to be provided.

Is random class assignment in the identical subclass test appropriate? We adopted random assignment to evaluate the performance of each metric independent from the underlying taxonomy. If a speciﬁc taxonomy was considered for the evaluations, a metric that performed well with it will be highly valued. Random assignment eliminates such effects, and we can purely measure the performance of the metrics themselves.

Do classiﬁcation models actually recognize subclasses? Is the identical subclass test suitable to evaluate the explanations of predictions made by practical models? It is true that if a model ignores subclasses in its training and inference processes, any explanation will fail the test. We conducted simple preliminary experiments and conﬁrmed that the practical classiﬁcation models used in this study capture the subclasses. See Appendix E for further detail.

4 EVALUATION RESULTS

Here, we examine the validity of relevance metrics with respect to the three minimal requirements. For this evaluation, we used two image datasets (MNIST (Le Cun et al., 1998), CIFAR10 (Krizhevsky, 2009)), two text datasets (TREC (Li & Roth, 2002), AGNews (Zhang et al., 2015)) and two table datasets (Vehicle (Dua & Graff, 2017), Segment (Dua & Graff, 2017)). As benchmarks, we employed logistic regression and deep neural networks trained on these datasets. Details of the datasets, models, and computing infrastructure used in this study is provided in Appendix B.

Procedure We repeated the following procedure 10 times for each evaluation test.

1. Train a model using a subset of training instances.3 Then, randomly sample 500 test instances from the test set.4

2. For each test instance, compute the relevance score for all instances used for training.

3. (a) For the model randomization test, compute the Spearman rank correlation coefﬁcients between the relevance scores from the trained model and relevance scores from the randomized model. (b) For the identical class and identical subclass tests, compute the success rate, which is the ratio of test instances that passed the test.

In this section, we mainly present the results for CIFAR10 with CNN and AGNews with Bi-LSTM. The other results were similar, and can be found in Appendix F.

Result Summary We summarize the main results before discussing individual results.

ℓlast 2 , coslast, and gradient-based metrics scored low correlation in the model randomization test for all datasets and models, indicating that they are model-dependent.

GC performed the best in most of the identical class and identical subclass tests; thus, GC would be the recommended choice in practice.

Dot metrics as well as IF, FK, and GD performed poorly on the identical class test and identical subclass test.

In Section 5, we analyze why some relevance metrics succeed or fail in the identical class and identical subclass tests.

4.1 RESULT OF MODEL RANDOMIZATION TEST

Figure 3 shows the Spearman rank correlation coefﬁcients for the model randomization test. The similarities with the identity feature map ℓx 2, cosx, and dotx are irrelevant to the model and their correlations are trivially one. In the ﬁgures, the other metrics scored correlations close to zero,

3We randomly sampled 10% of MNIST and CIFAR10; 50% of TREC, Vehicle and Segment; and 5% of AGNews 4For the identical subclass test, we sampled instances with correct predictions only.

Published as a conference paper at ICLR 2021

Average Correlation Std.

ℓ2 Cosine Dot Gradient

(a) CIFAR10 with CNN

Average Correlation Std.

ℓ2 Cosine Dot Gradient

(b) AGNews with Bi-LSTM

Figure 3: Result of the model randomization test. Correlations close to zero are ideal.

ℓlast 2ℓall 2

0 0.2 0.4 0.6 0.8

(a) CIFAR10 with CNN

Average Success Rate Std.

Identical Class Test

Identical Subclass Test

ℓ2 Cosine Dot Gradient

ℓlast 2ℓall 2

0 0.2 0.4 0.6 0.8

(b) AGNews with Bi-LSTM

Average Success Rate Std.

Identical Class Test

Identical Subclass Test

ℓ2 Cosine Dot Gradient

Figure 4: Results of the identical class test and identical subclass test.

indicating they will be model-dependent. However, the correlation of ℓall 2 , cosall, dotlast was observed to be more than 0.7 on the MNIST and Vehicle datasets (see Appendix F). Therefore, we conclude that these relevance metrics failed the model randomization test because they can raise instances irrelevant to the model for some datasets.

4.2 RESULTS OF IDENTICAL CLASS AND IDENTICAL SUBCLASS TESTS

Figure 4 depicts the success rates for the identical class and identical subclass tests. We also summarized the average success rates of our experiments in Table 1. It is noteworthy that GC performed consistently well on the identical class and identical subclass tests for all the datasets and models used in the experiment (see Appendix F). In contrast, some relevance metrics such as the dot metrics as well as IF, FK, and GD performed poorly on both tests. The reasons for their failure are discussed in the next section.

To conclude, the results of our evaluations indicate that only GC performed well on all tests. That is, only GC seems to meet the minimal requirements; thus, it would be a recommended choice for similarity-based explanation.

5 WHY SOME METRICS ARE SUCCESSFUL AND WHY SOME ARE NOT

We observed that the dot metrics and gradient-based metrics such as IF, FK, and GD failed the identical class and identical subclass tests, in comparison to GC that exhibited remarkable performance. Here, we analyze the reasons why the aforementioned metrics failed while GC performed well. In Appendix D, we also discuss a way to repair IF, FK, and GD to improve their performance based on the ﬁndings in this section.

Published as a conference paper at ICLR 2021

Training Instances Selected Instances

0 0.2 0.4 0.6 0.8

0 7,000 Norm of feature map

Norm of feature map

140 Norm of feature map

140 Norm of feature map

Figure 5: Distributions of norms of the feature maps of all training instances (colored) and the instances selected by the identical class test (meshed) on CIFAR10 with CNN.

Test Instances Found IF

airplane frog bird truck

cos(ztest, ztrain) φ(ztrain) .00008 .00010 .00007 3,585

Test Instances Found FK

cat ship bird ship

cos(ztest, ztrain) φ(ztrain) .021 .020 .019 345,292,727

Test Instances Found GD

cat bird horse truck

cos(ztest, ztrain) φ(ztrain) .385 .291 .329 112.8

Test Instances Found GC

truck truck truck truck

cos(ztest, ztrain) φ(ztrain) .754 .754 .752 .0008

Figure 6: Training instances frequently selected in the identical class test with multiple test instances on CIFAR10 with CNN, the cosine between them, and the norm of training instances.

Failure of Dot Metrics and Gradient-based Metrics To understand the failure, we reformulate IF, FK, and GD as dot metrics of the form Rdot(ztest, ztrain) = φ(ztest), φ(ztrain) to ensure that the following discussion is valid for any relevance metric of this form. It is evident that IF, FK, and GD can be expressed in this form by deﬁning the feature maps by φ(z) = H 1/2g(z; bθ), φ(z) = I 1/2g(z; bθ), and φ(z) = g(z; bθ), respectively.

Given a criterion, let z(i) train be a desirable instance for a test instance ztest. The failures of dot metrics indicate the existence of an undesirable instance z(j) train such that φ(ztest), φ(z(i) train) < φ(ztest), φ(z(j) train) . The following sufﬁcient condition for z(j) train is useful to understand the failure.

φ(z(i) train) < φ(z(j) train) cos(φ(ztest), φ(z(j) train)). (3)

The condition implies that any instance with an extremely large norm and a cosine slightly larger than zero can be the candidate of z(j) train. In our experiments, we observed that the condition on the norm is especially crucial. As shown in Figure 5, even though instances with signiﬁcanty large norms were scarce, only such extreme instances were selected as relevant instances by IF, FK, and GD. This indicates that these these metrics tend to consider such extreme instances as relevant. In contrast, GC was not attracted by large norms because it completely cancels the norm through normalization.

Figure 6 shows some training instances frequently selected in the identical class test on CIFAR10 with CNN. When using IF, FK, and GD, these training instances were frequently selected irrespective of their classes because the training instances had large norms. In these metrics, the term cos(φ(ztest), φ(ztrain)) seems to have negligible effects. In contrast, GC successfully selected the instances of the same class and ignored those with large norms.

Success of GC We now analyze why GC performed well, speciﬁcally in the identical class test. To simplify the discussion, we consider linear logistic regression whose conditional distribution p(y | x; θ) is given by the y-th entry of σ(Wx), where σ is the softmax function, θ = W RC d,

Published as a conference paper at ICLR 2021

and C and d denote the number of classes and dimensionality of x, respectively. With some algebra, we obtain RGC(z, z ) = cos(rz, rz ) cos(x, x ) for z = (x, y) and z = (x , y ), where rz = σ(Wx) ey is the residual for the prediction on z and ey is a vector whose y-th entry is one, and zero, otherwise. See Appendix C for the derivation. Here, the term cos(rz, rz ) plays an essential role in GC. By deﬁnition, rz c 0 if c = y and rz c 0, otherwise. Thus, cos(rz, rz ) 0 always holds true when y = y , while cos(rz, rz ) can be negative for y = y . Hence, the chance of RGC(z, z ) being positive can be larger for the instances from the same class compared to those from a different class.

Figure 7 shows that cos(rz, rz ) is essential also for deep neural networks. Here, for each test instance ztest on CIFAR10 with CNN, we randomly sampled two training instances ztrain (one with the same class and the other with a different class), and computed RGC(ztest, ztrain) and cos(rztest, rztrain).

same class different class

1 0 1 RGC(ztest, ztrain)

1 0 1 cos(rztest, rztrain)

Figure 7: Distributions of RGC(ztest, ztrain) and cos(rztest, rztrain) for training instances with the same / different classes on CIFAR10 with CNN.

We also note that cos(rztest, rztrain) alone was not helpful for the identical subclass test, whose success rate was around the chance level. We thus conjecture that while cos(rztest, rztrain) is particularly helpful for the identical class test, the use of the entire gradient is still essential for GC to work effectively.

6 CONCLUSION

We investigated and determined relevance metrics that are effective for similarity-based explanation. For this purpose, we evaluated whether the metrics satisﬁed the minimal requirements for similaritybased explanation. In this study, we conducted three tests, namely, the model randomization test of Adebayo et al. (2018) to evaluate whether the metrics are model-dependent, and two newly designed tests, the identical class and identical subclass tests, to evaluate whether the metrics can provide plausible explanations. Quantitative evaluations based on these tests revealed that the cosine similarity of gradients performs best, which would be a recommended choice in practice. We also observed that some relevance metrics do not meet the requirements; thus, the use of such metrics would not be appropriate for similarity-based explanation. We expect our insights to help practitioners in selecting appropriate relevance metrics, and also to help further researches for designing better relevance metrics for instance-based explanations.

Finally, we present two future direction for this study. First, the proposed criteria only evaluated limited aspects of the faithfulness and plausibility of relevance metrics. Thus, it is important to investigate further criteria for more detailed evaluations. Second, in addition to similarity-based explanation, it is necessary to consider the evaluation of other explanation methods, such as counterexamples. We expect this study to be the ﬁrst step toward the rigorous evaluation of several instancebased explanation methods.

ACKNOWLEDGMENTS

We thank Dr. Ryo Karakida and Dr. Takanori Maehara for their helpful advice. We also thank Overﬁt Summer Seminar5 for an opportunity that inspired this research. Additionally, we are grateful to our laboratory members for their helpful comments. Sho Yokoi was supported by JST, ACT-X Grant Number JPMJAX200S, Japan. Satoshi Hara was supported by JSPS KAKENHI Grant Number 20K19860, and JST, PRESTO Grant Number JPMJPR20C8, Japan.

Haneen Arafat Abu Alfeilat, Ahmad B.A. Hassanat, Omar Lasassmeh, Ahmad S. Tarawneh, Mahmoud Bashir Alhasanat, Hamzeh S. Eyal Salman, and V.B. Surya Prasath. Effects of Distance

5https://sites.google.com/view/mimaizumi/event/mlcamp2018

Published as a conference paper at ICLR 2021

Measure Choice on K-Nearest Neighbor Classiﬁer Performance: A Review. Big Data, 7(4): 221 248, 2019.

Amina Adadi and Mohammed Berrada. Peeking Inside the Black-box: A Survey on Explainable Artiﬁcial Intelligence (XAI). IEEE Access, 6:52138 52160, 2018.

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity Checks for Saliency Maps. In Advances in Neural Information Processing Systems 31, pp. 9505 9515. 2018.

David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. ar Xiv preprint ar Xiv:1806.08049, 2018.

Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning certiﬁably optimal rule lists for categorical data. The Journal of Machine Learning Research, 18 (1):8753 8830, 2017.

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. In Proceedings of the 5th International Conference on Learning Representations, 2017.

Elnaz Barshan, Marc-Etienne Brunet, and Gintare Karolina Dziugaite. Relat IF: Identifying Explanatory Training Samples via Relative Inﬂuence. In Proceedings of the 23rd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1899 1909, 2020.

Jacob Bien and Robert Tibshirani. Prototype Selection for Interpretable Classiﬁcation. Annals of Applied Statistics, 5(4):2403 2424, 2011.

Rich Caruana, Hooshang Kangarloo, John David N. Dionisio, Usha Sinha, and David Johnson. Case-Based Explanation of Non-Case-Based Learning Methods. In Proceedings of the AMIA Symposium, pp. 212 215, 1999.

Guillaume Charpiat, Nicolas Girard, Loris Felardos, and Yuliya Tarabalka. Input Similarity from the Neural Network Perspective. In Advances in Neural Information Processing Systems 32, pp. 5342 5351. 2019.

Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. This Looks Like That: Deep Learning for Interpretable Image Recognition. In Advances in Neural Information Processing Systems 32, pp. 8930 8941. 2019.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670 680, 2017.

Pádraig Cunningham, Dónal Doyle, and John Loughrey. An Evaluation of the Usefulness of Case Based Explanation. In International Conference on Case-Based Reasoning, pp. 122 130. Springer, 2003.

Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1): 114 8830, 2015.

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. ar Xiv preprint ar Xiv:1702.08608, 2017.

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml.

Rong-En Fan, Pai-Hsuen Chen, and Chih-Jen Lin. Working Set Selection Using Second Order Information for Training Support Vector Machines. Journal of Machine Learning Research, 6: 1889 1918, 2005.

Published as a conference paper at ICLR 2021

Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys, 51 (5):1 42, 2018.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.

Li Yu Hu, Min Wei Huang, Shih Wen Ke, and Chih Fong Tsai. The Distance Function Effect on k-Nearest Neighbor Classiﬁcation for Medical Datasets. Springer Plus, 5(1):1304, 2016.

Muhammad Hussain, Summrina Kanwal Wajid, Ali Elzaart, and Mohammed Berbar. A Comparison of SVM Kernel Functions for Breast Cancer Detection. In Proceedings of the 8th International Conference on Computer Graphics, Imaging and Visualization, pp. 145 150, 2011.

Alon Jacovi and Yoav Goldberg. Towards Faithfully Interpretable NLP Systems: How Should We Deﬁne and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198 4205, 2020.

Rajiv Khanna, Been Kim, Joydeep Ghosh, and Sanmi Koyejo. Interpreting Black Box Predictions using Fisher Kernels. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, volume 89, pp. 3382 3390, 2019.

Been Kim, Cynthia Rudin, and Julie A Shah. The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classiﬁcation. In Advances in Neural Information Processing Systems 27, pp. 1952 1960, 2014.

Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples Are Not Enough, Learn to Criticize! Criticism for Interpretability. In Advances in neural information processing systems 29, pp. 2280 2288, 2016.

Gary A Klein. Strategies of Decision Making. Technical report, 1989.

Gary A Klein and Roberta Calderwood. How Do People Use Analogues to Make Decisions? In Proceedings of the DARPA Workshop on Case-Based Reasoning, 1988, pp. 209 223, 1988.

Pang Wei Koh and Percy Liang. Understanding Black-box Predictions via Inﬂuence Functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 1885 1894, 2017.

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, 2009.

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi-Velez. An Evaluation of the Human-Interpretability of Explanation. 2019. URL http: //arxiv.org/abs/1902.00006.

Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 131 138, 2019.

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing Neural Predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 107 117, 2016.

Xin Li and Dan Roth. Learning Question Classiﬁers. In Proceedings of the 19th International Conference on Computational Linguistics, 2002.

Zhou Li and Chunxiang Li. Selection of Kernel Function for Least Squares Support Vector Machines in Downburst Wind Speed Forecasting. In Proceedings of the 11th International Symposium on Computational Intelligence and Design, volume 2, pp. 337 341, 2018.

Scott M Lundberg and Su-In Lee. A Uniﬁed Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, pp. 4765 4774, 2017.

Published as a conference paper at ICLR 2021

Leland Mc Innes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ar Xiv preprint:1802.03426, 2018.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111 3119. 2013.

Christoph Molnar. Interpretable Machine Learning. Lulu. com, 2020.

Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. Large-Scale Image Retrieval With Compressed Fisher Vectors. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3384 3391, 2010.

Tobias Plötz and Stefan Roth. Neural Nearest Neighbors Networks. In Advances in Neural Information Processing Systems 31, pp. 1087 1098. 2018.

Stephen J Read and Ian L Cesa. This Reminds Me of the Time When...: Expectation Failures in Reminding and Explanation. Journal of Experimental Social Psychology, 27(1):1 25, 1991.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why Should I Trust You Explaining the Predictions of Any Classiﬁer. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135 1144, 2016.

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang Chieh Chen. Mobile Net V2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 4510 4520, 2018.

Bernhard Schölkopf, Alexander J Smola, and Francis Bach. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2002.

Julia Strout, Ye Zhang, and Raymond J. Mooney. Do Human Rationales Improve Machine Explanations? In Proceedings of the 2019 ACL Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pp. 56 62, 2019.

Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 31(2): 841 887, 2018.

Fulton Wang and Cynthia Rudin. Falling rule lists. In Artiﬁcial Intelligence and Statistics, pp. 1013 1022, 2015.

Hongyu Yang, Cynthia Rudin, and Margo Seltzer. Scalable bayesian rule lists. In International Conference on Machine Learning, pp. 3921 3930, 2017.

Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, and Pradeep K Ravikumar. Representer Point Selection for Explaining Deep Neural Networks. In Advances in Neural Information Processing Systems 31, pp. 9291 9301, 2018.

Xiang Zhang, Junbo Zhao, and Yann Le Cun. Character-level convolutional networks for text classiﬁcation. In Advances in Neural Information Processing Systems 28, pp. 649 657. 2015.

Published as a conference paper at ICLR 2021

A GRADIENT-BASED METRICS

In gradient-based metrics, we consider a model with parameter θ, its loss ℓ(z; θ), and its gradient θℓ(z; θ) to measure relevance, where z = (x, y) is an input-output pair.

Inﬂuence Function (Koh & Liang, 2017) Koh & Liang (2017) proposed to measure relevance according to how largely the test loss will increase if the training instance is omitted from the training set. Here, the model parameter trained using all of the training set is denoted by bθ, and the parameter trained using all of the training set except the i-th instance z(i) train is denoted by bθ i. The relevance metric proposed by Koh & Liang (2017) is then deﬁned as the difference between the test loss under parameters bθ and bθ i as follows:

RIF(ztest, z(i) train) := ℓ(ztest; bθ i) ℓ(ztest; bθ). (4)

Here, a greater value indicates that the loss on the test instance increases drastically by removing the i-th training instance from the training set. Thus, the i-th training instance is essential relative to predicting the test instance; therefore, it is highly relevant.

In practice, the following approximation is used to avoid computing bθ i explicitly.

RIF(ztest, z(i) train) θℓ(ztest; bθ), H 1 θℓ(z(i) train; bθ)) , (5)

where H is the Hessian matrix of the loss Ltrain.

Relative IF (Barshan et al., 2020) Barshan et al. (2020) proposed to measure relevance according to how largely the test loss will increase if the training instance is omitted from the training set under the constraint that the expected squared change in loss is sufﬁciently small 6, which is the modiﬁed version of the inﬂuence function. Relative IF is computed as the cosine similarity of φ(z) = H 1/2 θℓ(z; bθ):

RRIF(ztest, ztrain) := cos(H 1/2 θℓ(ztest; bθ), H 1/2 θℓ(ztrain; bθ)). (6)

Fisher Kernel (Khanna et al., 2019) Khanna et al. (2019) proposed to measure the relevance of instances using the Fisher kernel as follows:

RFK(ztest, z(i) train) := θℓ(ztest; bθ), I 1 θℓ(z(i) train; bθ) , (7)

where I is the Fisher information matrix of the loss Ltrain.

Grad-Dot, Grad-Cos (Perronnin et al., 2010; Yeh et al., 2018; Charpiat et al., 2019) Charpiat et al. (2019) proposed to measure relevance according to how largely the loss will decrease when a small update is added to the model using the training instance. This can be computed as the dot product of the loss gradients, which we refer to as Grad-Dot.

RGD(ztest, ztrain) := θℓ(ztest; bθ), θℓ(ztrain; bθ) . (8)

Note that a similar metric is studied by Yeh et al. (2018) as the representer point value.

As a modiﬁcation of Grad-Dot, Charpiat et al. (2019) also proposed the following cosine version, which we refer to as Grad-Cos.

RGC(ztest, ztrain) := cos( θℓ(ztest; bθ), θℓ(ztrain; bθ)). (9)

Note that the use of the cosine between the gradients is also proposed by Perronnin et al. (2010).

B EXPERIMENTAL SETUP

B.1 DATASETS AND MODELS

MNIST (Le Cun et al., 1998) The MNIST dataset is used for handwritten digit image classiﬁcation tasks. Here, input x is an image of a handwritten digit, and the output y consists of 10 classes ( 0

6This metric is called ℓ-Relat IF by Barshan et al. (2020)

Published as a conference paper at ICLR 2021

to 9 ). We adopted logistic regression and a CNN as the classiﬁcation models. The CNN has six convolutional layers, and max-pooling layers for each two convolutional layers. The features obtained by these layers are fed into the global average pooling layer followed by a single linear layer. The number of the output channels of all the convolutional layers is set to 16. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 5,500 training instances to train the models.

CIFAR10 (Krizhevsky, 2009) The CIFAR10 dataset is used for object recognition tasks. Here, input x is an image containing a certain object, and output y consists of 10 classes, e.g., bird or airplane. Note that we used the same models as for the MNIST dataset. In addition, we adopted Mobile Net V2 (Sandler et al., 2018) as a model with a higher performance than the previous model. We trained the models using the Adam optimizer with a learning rate of 0.001. In the experiments, we ﬁrst pre-trained the models using all the training instances of CIFAR10, and then trained the models using randomly sampled 5,000 training instances. Without the pre-training, the classiﬁcation performance of the models dropped signiﬁcantly.

Note that we did not examine IF and FK on Mobile Net V2 because the matrix inverse in these metrics required too much time to calculate even with the conjugate gradient approximation proposed by Koh & Liang (2017).

TREC (Li & Roth, 2002) The TREC dataset is used for question classiﬁcation tasks. Here, input x is a question sentence, and output y is a question category consisting of six classes, e.g., LOC and NUM. We used bag-of-words logistic regression and a two-layer Bi-LSTM as the classiﬁcation models. In the Bi-LSTM, the last state is fed into one linear layer. The word embedding dimension is set to 16, and the dimension of the LSTM is set to 16 also. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 2,726 training instances to train the models.

AGNews (Zhang et al., 2015) The AGNews dataset is used for news article classiﬁcation tasks. Here, input x is a sentence, and output y is a category comprising four classes, e.g., business and sports. We used the same models as TREC. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 6,000 training instances to train the models.

Vehicle (Dua & Graff, 2017) The vehicle dataset is used for vehicle type classiﬁcation tasks. Here, the input x consists of 18 features, and the output y is a type of vehicle comprising four classes, e.g., bus and van. We used logistic regression and a three-layer MLP as the classiﬁcation models. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 423 training instances to train the models.

Segment (Dua & Graff, 2017) The segment dataset is used for image classiﬁcation tasks. Here, the input x consists of 19 features, and the output y consists of seven classes, e.g., sky and window. We used the same models as Vehicle. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 924 training instances to train the models.

B.2 COMPUTING INFRASTRUCTURE

In our experiments, training of the models was run on a NVIDIA GTX 1080 GPU with Intel Xeon Silver 4112 CPU and 64GB RAM. Testing and computing relevance metrics were run on Xeon E5-2680 v2 CPU with 256GB RAM.

C DERIVATION OF GC FOR LINEAR LOGISTIC REGRESSION

We consider linear logistic regression whose conditional distribution p(y | x; θ) is given by the y-th entry of σ(Wx), where σ is the softmax function, θ = W RC d, and C and d are the number of classes and the dimensionality of x, respectively. Recall that the cross entropy loss for linear logistic

Published as a conference paper at ICLR 2021

regression is given as

c=1 yc wc, x + log

c =1 exp( wc , x ), (10)

where W = [w1, w2, . . . , w C] . Let ey be a vector whose y-th entry is one and zero otherwise. Then, the gradient of the loss with respect to wc can be expressed as

wcℓ(z; θ) = (σ(Wx) (ey)c)x = (rz)cx, (11)

where rz = σ(Wx) eyis the residual for the prediction on z. Hence, we have

θℓ(z; θ), θℓ(z ; θ) =

c=1 wcℓ(z; θ), wcℓ(z ; θ) (12)

c=1 (rz)c(rz )c x, x (13)

= rz, rz x, x , (14)

which yields

RGC(z, z ) = rz, rz x, x rz x rz x (15)

= cos(rz, rz ) cos(x, x ). (16)

D REPAIRING GRADIENT-BASED METRICS

As described in Section 5, we found that training instances with extremely large norms were selected as relevant by IF, FK, and GD. Thus, to repair these metrics, we need to design metrics that can ignore instances with large norms. A simple yet effective way of repairing the metrics is to use ℓ2 or cosine instead of the dot product. As Figure 4 shows, the ℓ2 and cosine metrics performed better than the dot metrics. Indeed, the ℓ2 metrics do not favor instances with large norms that lead to large ℓ2-distance, and, through normalization, the cosine metrics completely ignore the effect of the norms

We name the repaired metrics of IF, FK, and GD based on the ℓ2 metric as ℓIF 2 , ℓFK 2 , and ℓGD 2 , respectively, and the repaired metrics based on the cosine metric as cos IF and cos FK, and cos GD, respectively7. We observed that these repaired metrics attained higher success rates on several evaluation criteria. The details of the results can be found in Appendix F.

E DO THE MODELS CAPTURE SUBCLASSES?

The identical subclass test requires the model to obtain internal representations that can distinguish subclasses. Here, we conﬁrm that this condition is satisﬁed for all the datasets and models we used in the experiments. We consider that the model captures the subclasses if the latent representation hall has cluster structures. Figure 9 visualizes hall for each dataset and model using UMAP (Mc Innes et al., 2018). The ﬁgures show that the instances from different subclasses are not mixed completely random. MNIST and TREC have relatively clear cluster structures, while CIFAR10 and AGNews have vague clusters without explicit boundaries. These ﬁgures imply that the models capture subclases (although it may not be perfect).

F COMPLETE EVALUATION RESULTS

F.1 FULL RESULTS

We show the complete results of the model randomization test in Table 2, the identical class test in Table 3, and the identical subclass test in Table 4. The results we present here are consistent with our observations in Section 4.

7Note that cos IF is the same as RIF and cos GD is the same as GC.

Published as a conference paper at ICLR 2021

Table 2: Average Spearman rank correlation coefﬁcients std. of each similarity function for model randomization test. The metrics preﬁxed with are the ones we have repaired. The results with the average score in the 95% conﬁdence interval of the null distribution that the correlation is zero, which is [-0.088, 0.088], are colored.

MNIST CIFAR10 TREC

Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.98 0.00 0.92 0.00 0.89 0.01 0.72 0.02 0.35 0.01 0.86 0.01 0.81 0.02

ℓx 2 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 ℓlast 2 .15 .01 - .07 .00 .05 .01 - .19 .01 - ℓall 2 .79 .00 - .02 .01 .13 .01 - .25 .02 -

cosx 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 coslast .24 .00 - .07 .01 .04 .01 - .17 .02 - cosall .78 .00 - .02 .01 .09 .01 - .26 .03 -

dotx 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 dotlast .39 .01 - .05 .01 .04 .01 - .25 .01 - dotall .80 .00 - .00 .01 .12 .01 - .26 .03 -

IF .05 .00 .00 .00 .05 .01 .04 .01 .04 .01 .01 .01 .06 .01 ℓIF 2 .00 .02 .11 .00 .01 .02 .03 .02 .05 .01 .00 .02 .13 .02 cos IF .02 .00 .05 .00 .04 .01 .03 .01 .03 .01 .01 .01 .03 .01

FK .02 .01 .02 .01 .02 .01 .01 .01 .03 .01 .01 .00 .03 .00 ℓFK 2 .10 .04 .05 .00 .16 .05 .12 .02 .03 .01 .14 .03 .15 .01 cos FK .00 .02 .05 .01 .05 .01 .03 .01 .01 .01 .07 .02 .03 .00

GD .08 .02 .01 .01 .03 .01 .02 .01 .04 .01 .04 .01 .02 .02 GC .07 .03 .03 .01 .02 .02 .01 .03 .05 .01 .04 .02 .01 .01 ℓgrad 2 .09 .04 .13 .01 .09 .04 .07 .02 .06 .01 .02 .02 .10 .02

AGNews Vehicle Segment

Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 76 1K 140 Accuracy 0.80 0.02 0.80 0.01 0.77 0.02 0.77 0.01 0.98 0.01 0.97 0.00

ℓx 2 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 ℓlast 2 .07 .01 - .16 .04 - .62 .15 - ℓall 2 .17 .01 - .20 .10 - .78 .08 -

cosx 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 coslast .07 .01 - .09 .18 - .60 .09 - cosall .12 .02 - .01 .13 - .77 .06 -

dotx 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 dotlast .07 .01 - .85 .33 - .61 .23 - dotall .20 .01 - .97 .03 - .72 .16 -

IF .03 .01 .05 .01 .01 .03 .01 .02 .00 .01 .01 .02 ℓIF 2 .04 .02 .13 .00 .18 .24 .01 .28 .03 .13 .10 .26 cos IF .02 .01 .03 .01 .01 .03 .01 .05 .04 .10 .01 .05

FK .04 .01 .03 .00 .01 .06 .02 .07 .01 .02 .00 .02 ℓFK 2 .17 .04 .14 .00 .18 .21 .01 .17 .05 .07 .02 .20 cos FK .00 .03 .03 .00 .08 .13 .04 .12 .01 .03 .01 .04

GD .04 .01 .03 .01 .01 .11 .02 .05 .01 .03 .02 .03 GC .01 .02 .04 .01 .02 .13 .06 .11 .00 .06 .01 .05 ℓgrad 2 .01 .02 .14 .00 .13 .21 .11 .23 .02 .12 .09 .21

Published as a conference paper at ICLR 2021

2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

12.5 2 4 5 6 7

(a) MNIST with CNN. y = A.

2 0 2 4 6 8 10 12 14 4

12 0 1 3 8 9

(b) MNIST with CNN. y = B.

7 6 5 4 3 2 1 0

cat airplane truck dog horse

(c) CIFAR10 with Mobile Net V2. y = A.

2 1 0 1 2 3 4 5 6

5 ship frog automobile deer bird

(d) CIFAR10 with Mobile Net V2. y = B.

4 3 2 1 0 1 2 3

4 cat airplane truck dog horse

(e) CIFAR10 with CNN. y = A.

3 2 1 0 1 2 3 4

ship frog automobile deer bird

(f) CIFAR10 with CNN. y = B.

NUM HUM ENTY

Figure 8: TREC with LSTM. y = A.

15 10 5 0 5 10

20 LOC DESC ABBR

(a) TREC with LSTM. y = B.

10 Business World

(b) AGNews with LSTM. y = A.

1 2 3 4 5 6 7 8 9

22 Sci/Tech Sports

(c) AGNews with LSTM. y = B.

Figure 9: visualization of hall in each dataset and model using UMAP.

Published as a conference paper at ICLR 2021

Table 3: Average success rate std. of each relevancy metric for identical class test. The metrics preﬁxed with are the ones we have repaired. The results with the average success rate over 0.5 are colored.

MNIST CIFAR10 TREC

Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.98 0.00 0.92 0.00 0.89 0.01 0.72 0.02 0.35 0.01 0.86 0.01 0.81 0.02

ℓx 2 .93 .01 .88 .01 .26 .02 .26 .02 .24 .02 .70 .00 .75 .00 ℓlast 2 .99 .01 - 1.00 .00 .75 .02 - .89 .00 - ℓall 2 .98 .00 - .93 .02 .61 .02 - .88 .00 -

cosx .94 .01 .88 .01 .30 .03 .29 .02 .26 .02 .73 .00 .76 .00 coslast .99 .01 - 1.00 .00 .78 .02 - .89 .00 - cosall .98 .00 - .97 .01 .71 .02 - .90 .00 -

dotx .69 .02 .68 .02 .09 .02 .10 .01 .11 .02 .33 .00 .34 .00 dotlast .67 .02 - 1.00 .00 .20 .02 - .93 .00 - dotall .96 .01 - .96 .01 .31 .01 - .93 .00 -

IF .09 .01 .26 .02 - .10 .01 .09 .02 .29 .00 .86 .00 ℓIF 2 .72 .01 .62 .02 - .14 .01 .14 .01 .98 .00 .95 .00 cos IF .82 .01 .69 .02 - .12 .01 .13 .02 .99 .00 .96 .00

FK .10 .01 .21 .02 - .20 .02 .20 .02 .28 .00 .24 .00 ℓFK 2 .77 .02 .93 .01 - .82 .01 .98 .00 .99 .00 .96 .00 cos FK .92 .01 .97 .01 - .93 .01 .99 .00 1.00 .00 .96 .00

GD .30 .01 .87 .01 .26 .03 .71 .02 1.00 .00 .49 .00 1.00 .00 GC .99 .00 1.00 .00 .99 .00 .99 .00 1.00 .00 1.00 .00 1.00 .00 ℓgrad 2 .94 .01 .99 .00 .97 .01 .99 .00 1.00 .00 1.00 .00 1.00 .00

AGNews Vehicle Segment

Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 76 1K 140 Accuracy 0.80 0.02 0.80 0.01 0.77 0.02 0.77 0.01 0.98 0.01 0.97 0.00

ℓx 2 .39 .02 .40 .02 .65 .02 .62 .02 .93 .01 .92 .01 ℓlast 2 .84 .02 - .72 .03 - .97 .01 - ℓall 2 .84 .01 - .72 .03 - .96 .01 -

cosx .47 .01 .51 .02 .66 .02 .63 .02 .91 .01 .90 .01 coslast .85 .01 - .74 .04 - .97 .01 - cosall .84 .01 - .73 .04 - .96 .01 -

dotx .28 .02 .47 .02 .25 .00 .27 .01 .37 .01 .37 .01 dotlast .89 .01 - .26 .02 - .17 .06 - dotall .90 .01 - .27 .06 - .13 .01 -

IF .24 .01 .67 .02 .39 .16 .78 .08 .15 .03 .52 .07 ℓIF 2 .99 .00 .92 .01 .88 .06 .95 .01 .79 .13 .80 .05 cos IF 1.00 .00 .97 .01 .96 .02 .99 .01 .84 .11 .92 .08

FK .32 .01 .29 .03 .31 .18 .26 .17 .15 .04 .17 .10 ℓFK 2 .94 .01 .68 .02 .93 .04 .94 .03 .86 .06 .95 .02 cos FK .95 .01 .84 .02 .99 .01 .99 .01 .97 .02 .99 .01

GD .76 .01 1.00 .00 .90 .10 .98 .02 .30 .14 .55 .11 GC 1.00 .00 1.00 .00 1.00 .00 1.00 .00 .97 .02 1.00 .00 ℓgrad 2 1.00 .00 1.00 .00 .99 .01 1.00 .00 .90 .05 .99 .01

Published as a conference paper at ICLR 2021

Table 4: Average success rate std. of each relevancy metric for identical subclass test. The metrics preﬁxed with are the ones we have repaired. The results with the average success rate over 0.5 are colored.

MNIST CIFAR10 TREC

Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.99 0.00 0.88 0.01 0.92 0.01 0.84 0.03 0.71 0.03 0.86 0.01 0.81 0.02

ℓx 2 .93 .01 .96 .02 .26 .02 .29 .04 .31 .03 .78 .03 .78 .02 ℓlast 2 .89 .02 - .29 .04 .35 .04 - .76 .02 - ℓall 2 .97 .01 - .49 .04 .38 .03 - .77 .03 -

cosx .95 .01 .96 .02 .29 .03 .31 .04 .31 .03 .82 .02 .81 .02 coslast .89 .02 - .32 .03 .33 .03 - .75 .02 - cosall .98 .00 - .71 .04 .50 .03 - .77 .02 -

dotx .70 .03 .75 .03 .09 .02 .11 .03 .09 .02 .33 .03 .34 .03 dotlast .24 .04 - .22 .02 .20 .01 - .40 .03 - dotall .94 .01 - .68 .03 .25 .03 - .59 .03 -

IF .12 .01 .39 .05 - .06 .02 .08 .02 .31 .02 .49 .03 ℓIF 2 .62 .04 .76 .03 - .17 .02 .12 .02 .68 .02 .79 .02 cos IF .70 .02 .87 .02 - .15 .02 .09 .02 .72 .01 .75 .03

FK .19 .03 .14 .02 - .11 .01 .11 .02 .30 .02 .16 .02 ℓFK 2 .81 .02 .76 .03 - .31 .03 .24 .02 .73 .03 .78 .02 cos FK .91 .02 .85 .02 - .37 .03 .23 .02 .81 .02 .79 .01

GD .42 .05 .48 .03 .20 .02 .24 .03 .21 .04 .45 .02 .60 .02 GC .97 .01 .98 .01 .54 .03 .43 .04 .39 .03 .81 .01 .87 .02 ℓgrad 2 .91 .02 .95 .01 .28 .03 .38 .03 .34 .03 .78 .02 .88 .02

AGNews Vehicle Segment

Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 38 1K 40 Accuracy 0.80 0.02 0.80 0.01 0.73 0.02 0.73 0.01 0.94 0.01 0.90 0.01

ℓx 2 .40 .02 .41 .01 .67 .03 .65 .02 .95 .01 .95 .01 ℓlast 2 .53 .02 - .64 .05 - .95 .02 - ℓall 2 .58 .01 - .66 .04 - .96 .01 -

cosx .49 .02 .53 .02 .68 .04 .66 .03 .92 .01 .93 .01 coslast .54 .01 - .68 .06 - .93 .01 - cosall .59 .02 - .67 .03 - .94 .01 -

dotx .28 .02 .48 .02 .26 .03 .26 .03 .38 .01 .41 .02 dotlast .52 .02 - .27 .03 - .15 .03 - dotall .54 .02 - .28 .04 - .13 .02 -

IF .25 .02 .48 .01 .34 .12 .54 .09 .16 .02 .49 .08 ℓIF 2 .56 .02 .77 .02 .76 .14 .86 .06 .65 .10 .43 .12 cos IF .56 .02 .80 .02 .86 .09 .91 .08 .62 .11 .86 .05

FK .28 .01 .25 .02 .16 .08 .20 .05 .16 .07 .10 .05 ℓFK 2 .56 .02 .63 .02 .73 .13 .62 .09 .73 .13 .93 .03 cos FK .61 .02 .73 .02 .80 .06 .67 .10 .81 .10 .96 .02

GD .50 .02 .54 .02 .47 .09 .43 .03 .34 .08 .37 .08 GC .65 .02 .72 .02 .82 .06 .83 .07 .81 .10 .96 .01 ℓgrad 2 .61 .02 .73 .03 .72 .10 .75 .09 .75 .10 .90 .03

Published as a conference paper at ICLR 2021

F.2 ADDITIONAL RESULTS

The identical class test require the most relevant instance to be of the same class as the test instance. In practice, users can be more conﬁdent about a model s output if several instances are provided as evidence. In other words, we expect that the most relevant and a ﬁrst few relevant instances will be of the same class. This observation leads to the additional criterion, which is a generalization of the identical class test.

Deﬁnition 5 (Top-k Identical Class Test). For ztest = (xtest, bytest), let zj = ( xj, yj) be a training instance with the j-th largest relevance score. Then, we require yj = bytest for any j {1, 2, . . . , k}.

This observation also applies to identical subclass test, which leads to the following criterion

Deﬁnition 6 (Top-k Identical Subclass Test). For ztest = (xtest, bytest), let zj = ( xj, yj) be a training instance with the j-th largest relevance score. Then, we require s( zj) = s(bztest), j {1, 2, . . . , k}.

We show the results of the top-10 identical class test in Table 3, and the top-10 identical subclass test in Table 4.

G EXAMPLES OF EACH EXPLANATION METHOD

We show some examples of the relevant instances using several relevance metrics on CIFAR10 with CNN in Figure 10 and Figure 11 and on AGNews with LSTM in Table 7 and Table 8. We show examples of both correct (in Figure 10 and Table 7) and incorrect (in Figure 11 and Table 8) predictions. As mentioned in Section 5, the relevance metrics based on the dot product of the gradient, such as IF, FK, and GD, tend to select instances with large norms, and therefore we can see that non-typical instances have been selected.

Published as a conference paper at ICLR 2021

Table 5: Average success rate std. of each relevancy metric for top-10 identical class test. The metrics preﬁxed with are the ones we have repaired. The results with the average success rate over 0.5 are colored.

MNIST CIFAR10 TREC

Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.98 0.00 0.92 0.00 0.89 0.01 0.72 0.02 0.35 0.01 0.86 0.01 0.81 0.02

ℓx 2 .63 .02 .63 .02 .00 .00 .00 .00 .00 .00 .23 .00 .23 .00 ℓlast 2 .95 .01 - .98 .01 .30 .01 - .68 .00 - ℓall 2 .89 .01 - .64 .05 .14 .01 - .66 .00 -

cosx .67 .02 .65 .02 .00 .00 .00 .00 .00 .00 .24 .00 .24 .00 coslast .97 .01 - .98 .01 .33 .02 - .69 .00 - cosall .92 .00 - .84 .03 .23 .02 - .68 .00 -

dotx .19 .01 .20 .02 .00 .00 .00 .00 .00 .00 .05 .00 .05 .00 dotlast .42 .03 - .98 .01 .04 .01 - .75 .00 - dotall .88 .01 - .79 .03 .05 .01 - .84 .00 -

IF .00 .00 .00 .00 - .00 .00 .00 .00 .00 .00 .24 .00 ℓIF 2 .25 .01 .10 .01 - .00 .00 .00 .00 .83 .00 .47 .00 cos IF .59 .02 .17 .01 - .00 .00 .00 .00 .91 .00 .65 .00

FK .00 .00 .02 .01 - .00 .00 .06 .01 .01 .00 .00 .00 ℓFK 2 .23 .03 .65 .02 - .25 .02 .87 .01 .90 .00 .71 .00 cos FK .59 .01 .82 .02 - .54 .02 .93 .01 .95 .00 .77 .00

GD .00 .00 .41 .02 .00 .00 .15 .01 1.00 .00 .11 .00 .99 .00 GC .95 .01 .99 .01 .92 .02 .92 .01 1.00 .00 .96 .00 1.00 .00 ℓgrad 2 .57 .02 .95 .01 .78 .03 .80 .01 .99 .00 .94 .00 1.00 .00

AGNews Vehicle Segment

Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 76 1K 140 Accuracy 0.80 0.02 0.80 0.01 0.77 0.02 0.77 0.01 0.98 0.01 0.97 0.00

ℓx 2 .00 .00 .00 .00 .09 .02 .09 .02 .60 .01 .60 .01 ℓlast 2 .48 .03 - .19 .07 - .77 .03 - ℓall 2 .46 .01 - .16 .06 - .74 .03 -

cosx .01 .00 .02 .01 .10 .02 .10 .01 .44 .02 .44 .02 coslast .51 .03 - .22 .07 - .78 .03 - cosall .48 .02 - .17 .06 - .72 .04 -

dotx .01 .00 .01 .00 .15 .12 .16 .13 .23 .02 .23 .02 dotlast .64 .03 - .13 .11 - .05 .06 - dotall .66 .03 - .15 .11 - .00 .01 -

IF .00 .00 .02 .01 .01 .01 .10 .03 .00 .00 .10 .03 ℓIF 2 .94 .01 .20 .02 .25 .13 .47 .05 .32 .15 .48 .05 cos IF .97 .01 .48 .01 .42 .12 .61 .03 .63 .16 .83 .12

FK .00 .00 .00 .00 .05 .11 .08 .11 .00 .01 .03 .06 ℓFK 2 .61 .02 .06 .01 .55 .19 .64 .12 .32 .17 .60 .14 cos FK .71 .03 .15 .01 .85 .06 .85 .08 .78 .08 .92 .03

GD .55 .02 .98 .01 .56 .19 .70 .05 .09 .08 .37 .05 GC 1.00 .00 1.00 .00 .95 .04 1.00 .00 .84 .08 .97 .02 ℓgrad 2 .99 .01 .98 .00 .81 .09 .95 .03 .43 .20 .84 .06

Published as a conference paper at ICLR 2021

Table 6: Average success rate std. of each relevancy metric for top-10 identical subclass test. The metrics preﬁxed with are the ones we have repaired. The results with the average success rate over 0.5 are colored.

MNIST CIFAR10 TREC

Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.99 0.00 0.88 0.01 0.92 0.01 0.84 0.03 0.71 0.03 0.86 0.01 0.81 0.02

ℓx 2 .64 .02 .71 .03 .00 .00 .00 .00 .00 .00 .27 .05 .25 .02 ℓlast 2 .54 .04 - .00 .00 .00 .00 - .30 .02 - ℓall 2 .85 .02 - .08 .02 .01 .00 - .34 .02 -

cosx .67 .02 .74 .03 .00 .00 .01 .01 .00 .00 .28 .04 .27 .02 coslast .57 .05 - .00 .00 .00 .00 - .30 .02 - cosall .89 .02 - .16 .02 .02 .01 - .34 .02 -

dotx .21 .02 .23 .03 .00 .00 .00 .00 .00 .00 .05 .01 .05 .01 dotlast .08 .02 - .00 .00 .00 .00 - .14 .01 - dotall .79 .03 - .13 .02 .01 .01 - .17 .02 -

IF .01 .01 .00 .00 - .00 .00 .00 .00 .00 .00 .01 .01 ℓIF 2 .14 .03 .16 .02 - .00 .00 .00 .00 .11 .02 .24 .02 cos IF .37 .02 .35 .04 - .00 .00 .00 .00 .22 .03 .25 .02

FK .00 .00 .00 .00 - .00 .00 .00 .00 .00 .00 .00 .00 ℓFK 2 .22 .02 .30 .03 - .00 .00 .00 .00 .28 .04 .26 .02 cos FK .58 .02 .46 .04 - .00 .00 .00 .00 .41 .03 .25 .02

GD .00 .00 .01 .01 .01 .01 .00 .00 .00 .00 .10 .02 .01 .00 GC .86 .03 .87 .02 .06 .02 .01 .01 .01 .01 .37 .03 .37 .02 ℓgrad 2 .50 .03 .69 .04 .02 .01 .00 .00 .00 .00 .24 .03 .34 .02

AGNews Vehicle Segment

Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 38 1K 40 Accuracy 0.80 0.02 0.80 0.01 0.73 0.02 0.73 0.01 0.94 0.01 0.90 0.01

ℓx 2 .00 .00 .00 .00 .10 .00 .09 .00 .62 .02 .64 .02 ℓlast 2 .01 .00 - .07 .00 - .66 .07 - ℓall 2 .01 .01 - .08 .00 - .70 .05 -

cosx .01 .00 .02 .01 .10 .00 .09 .00 .46 .02 .48 .02 coslast .01 .00 - .06 .00 - .60 .07 - cosall .02 .01 - .10 .00 - .62 .08 -

dotx .01 .00 .02 .01 .00 .00 .00 .00 .24 .02 .25 .02 dotlast .01 .00 - .00 .00 - .01 .02 - dotall .02 .00 - .00 .00 - .02 .05 -

IF .00 .00 .00 .00 .02 .00 .10 .00 .00 .00 .17 .04 ℓIF 2 .01 .00 .10 .01 .02 .00 .20 .00 .15 .11 .17 .08 cos IF .01 .00 .20 .02 .02 .00 .38 .00 .35 .14 .66 .07

FK .00 .00 .00 .00 .02 .00 .00 .00 .00 .00 .00 .00 ℓFK 2 .01 .00 .02 .01 .18 .00 .02 .00 .15 .14 .61 .09 cos FK .01 .00 .09 .01 .10 .00 .01 .00 .53 .09 .75 .07

GD .01 .00 .06 .01 .39 .00 .39 .00 .09 .05 .13 .08 GC .04 .00 .10 .01 .16 .00 .30 .00 .52 .09 .64 .04 ℓgrad 2 .02 .01 .08 .01 .37 .00 .30 .00 .18 .11 .51 .08

Published as a conference paper at ICLR 2021

Predict: frog

Predict: airplane

Gold: airplane

Figure 10: Relevant instances selected for random test inputs with correct prediction using several relevance metrics on CIFAR10 with CNN.

Published as a conference paper at ICLR 2021

Predict: deer

Predict: automobile

Figure 11: Relevant instances selected for random test inputs with incorrect prediction using several relevance metrics on CIFAR10 with CNN.

Published as a conference paper at ICLR 2021

Table 7: Relevant instances selected for random test inputs with correct predictions using several relevance metrics on AGNews with LSTM. Out-of-vocabulary words are followed by [unk].

Sentence Class

Test Input kerry widens lead in california , poll ﬁnds ( reuters ) Gold: World Predict: World

ℓx 2 in brief Sci/Tech ℓlast 2 strong hurricane approaches[unk] bahamas[unk] , ﬂorida ( reuters ) Sci/Tech ℓall 2 strong hurricane approaches[unk] bahamas[unk] , ﬂorida ( reuters ) Sci/Tech cosx reuters poll : bush holds two - point lead over kerry ( reuters ) World coslast strong hurricane approaches[unk] bahamas[unk] , ﬂorida ( reuters ) Sci/Tech cosall strong hurricane approaches[unk] bahamas[unk] , ﬂorida ( reuters ) Sci/Tech dotx reuters poll : bush holds two - point lead over kerry ( reuters ) World dotlast eurozone ﬁnance ministers debate action on oil as prices surge ( afp ) World

dotall business cash for bush campaign , lawyers[unk] for kerry ( reuters ) World IF greek judoka[unk] dies in hospital after balcony[unk] suicide leap[unk] Sports

ℓIF 2 world front World cos IF arafat family bickers[unk] over medical[unk] records of palestinian leader World

FK linux # 39;s latest moneymaker[unk] Business ℓFK 2 china launches zy-2[unk] resource[unk] satellite Sci/Tech cos FK china launches zy-2[unk] resource[unk] satellite Sci/Tech GD judge adjourns[unk] ba[unk] # 39;asyir[unk] # 39;s trial until nov. 4 World

GC reuters poll : bush holds two - point lead over kerry ( reuters ) World ℓgrad 2 reuters poll : bush holds two - point lead over kerry ( reuters ) World

Sentence Class

Test Input some people not eligible[unk] to get in on google ipo Gold: Sci/Tech Predict: Sci/Tech

ℓx 2 insiders[unk] get rich[unk] through google ipo Sci/Tech ℓlast 2 european judge probes microsoft antitrust case Sci/Tech ℓall 2 insiders[unk] get rich[unk] through google ipo Sci/Tech cosx insiders[unk] get rich[unk] through google ipo Sci/Tech coslast breakthrough in hydrogen[unk] fuel research Sci/Tech cosall insiders[unk] get rich[unk] through google ipo Sci/Tech dotx italians[unk] , canadians[unk] gather[unk] to honour[unk] living legend[unk] : vc[unk] winner smoky[unk] smith[unk] ( canadian press )

dotlast earnings alert : novell sees weakness[unk] in it spending Sci/Tech dotall siemens backs new wireless technology Sci/Tech IF matching[unk] wits[unk] on politics Sports ℓIF 2 insiders[unk] get rich[unk] through google ipo Sci/Tech cos IF congress probes fda in vioxx case Business FK bin laden tape urges oil attack Business ℓFK 2 insiders[unk] get rich[unk] through google ipo Sci/Tech cos FK insiders[unk] get rich[unk] through google ipo Sci/Tech GD issue 65 news hound[unk] : this week in gaming Sci/Tech GC google responds[unk] to google news china controversy[unk] Sci/Tech ℓgrad 2 insiders[unk] get rich[unk] through google ipo Sci/Tech

Published as a conference paper at ICLR 2021

Table 8: Relevant instances selected for random test inputs with incorrect predictions using several relevance metrics on AGNews with LSTM. Out-of-vocabulary words are followed by [unk].

Sentence Class

Test Input ibm to hire even[unk] more new workers Gold:Sci/Tech Predict:Busi ness

ℓx 2 athletes[unk] to watch[unk] Sports ℓlast 2 tech stocks tumble[unk] after chip makers warn Business ℓall 2 microsoft foe[unk] wins in settlement Sci/Tech cosx volkswagen[unk] workers stage new stoppages[unk] Business coslast tech stocks tumble[unk] after chip makers warn Business cosall microsoft revenue tops forecast Business dotx italians[unk] , canadians[unk] gather[unk] to honour[unk] living legend[unk] : vc[unk] winner smoky[unk] smith[unk] ( canadian press )

dotlast google up in market debut after bumpy[unk] ipo ( reuters ) Business dotall google up in market debut after bumpy[unk] ipo ( reuters ) Business IF greek judoka[unk] dies in hospital after balcony[unk] suicide leap[unk] Sports

ℓIF 2 ibm # 39;s third - quarter earnings and revenue up Business cos IF arafat family bickers[unk] over medical[unk] records of palestinian leader World

FK great white sharks[unk] given new protection World ℓFK 2 ibm # 39;s third - quarter earnings and revenue up Business cos FK ibm to buy danish[unk] ﬁrms Business GD some question speed of intel chief bill ( ap ) World GC ibm shrugs[unk] off industry blues[unk] in q3 Business ℓgrad 2 ibm # 39;s third - quarter earnings and revenue up Business

Sentence Class

Test Input tougher[unk] rules wo n t soften[unk] law s game Gold: Sports Predict: Sci/Tech

ℓx 2 proﬁting[unk] from moore[unk] s law Business ℓlast 2 devil[unk] rays[unk] stuck[unk] in ﬂorida hours[unk] before game Sports ℓall 2 devil[unk] rays[unk] stuck[unk] in ﬂorida hours[unk] before game Sports cosx proﬁting[unk] from moore[unk] s law Business coslast devil[unk] rays[unk] stuck[unk] in ﬂorida hours[unk] before game Sports cosall devil[unk] rays[unk] stuck[unk] in ﬂorida hours[unk] before game Sports dotx italians[unk] , canadians[unk] gather[unk] to honour[unk] living legend[unk] : vc[unk] winner smoky[unk] smith[unk] ( canadian press )

dotlast world s top game players battle for cash ( ap ) Sci/Tech dotall sportsnetwork[unk] game preview Sports IF top grades[unk] rising again for gcses[unk] World ℓIF 2 calif. oks toughest[unk] auto emissions[unk] rules World cos IF un envoy headed to darfur World FK yankee[unk] batters[unk] hit wall Sports ℓFK 2 a ﬂat panel does n t always[unk] compute[unk] Sci/Tech cos FK a ﬂat panel does n t always[unk] compute[unk] Sci/Tech GD issue 65 news hound[unk] : this week in gaming Sci/Tech GC atari[unk] announces ﬁrst 64-bit[unk] game Sci/Tech ℓgrad 2 atari[unk] announces ﬁrst 64-bit[unk] game Sci/Tech