# evaluation_of_similaritybased_explanations__0fcec0be.pdf Published as a conference paper at ICLR 2021 EVALUATION OF SIMILARITY-BASED EXPLANATIONS Kazuaki Hanawa1,2, Sho Yokoi2,1, Satoshi Hara3, Kentaro Inui2,1 RIKEN Center for Advanced Intelligence Project1, Tohoku University2, Osaka University3 kazuaki.hanawa@riken.jp, yokoi@ecei.tohoku.ac.jp, satohara@ar.sanken.osaka-u.ac.jp, inui@ecei.tohoku.ac.jp Explaining the predictions made by complex machine learning models helps users to understand and accept the predicted outputs with confidence. One promising way is to use similarity-based explanation that provides similar instances as evidence to support model predictions. Several relevance metrics are used for this purpose. In this study, we investigated relevance metrics that can provide reasonable explanations to users. Specifically, we adopted three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. Our experiments revealed that the cosine similarity of the gradients of the loss performs best, which would be a recommended choice in practice. In addition, we showed that some metrics perform poorly in our tests and analyzed the reasons of their failure. We expect our insights to help practitioners in selecting appropriate relevance metrics and also aid further researches for designing better relevance metrics for explanations. 1 INTRODUCTION Explaining the predictions made by complex machine learning models helps users understand and accept the predicted outputs with confidence (Ribeiro et al., 2016; Lundberg & Lee, 2017; Guidotti et al., 2018; Adadi & Berrada, 2018; Molnar, 2020). Instance-based explanations are a popular type of explanation that achieve this goal by presenting one or several training instances that support the predictions of a model. Several types of instance-based explanations have been proposed, such as explaining with instances similar to the instance of interest (i.e., the test instance in question) (Charpiat et al., 2019; Barshan et al., 2020); harmful instances that degrade the performance of models (Koh & Liang, 2017; Khanna et al., 2019); counter-examples that contrast how a prediction can be changed (Wachter et al., 2018); and irregular instances (Kim et al., 2016). Among these, we focus on the first one, the type of explanation that gives one or several training instances that are similar to the test instance in question and corresponding model predictions. We refer to this type of instance-based explanation as similarity-based explanation. A similarity-based explanation is of the form I (the model) think this image is cat because similar images I saw in the past were also cat. This type of explanation is analogous to the way humans make decisions by referring to their prior experiences (Klein & Calderwood, 1988; Klein, 1989; Read & Cesa, 1991). Hence, it tends to be easy to understand even to users with little expertise about machine learning. A report stated that with this type of explanation, users tend to have higher confidence in model predictions compared to explanations that presents contributing features (Cunningham et al., 2003). In the instance-based explanation paradigm, including similarity-based explanation, a relevance metric R(z, z ) R is typically used to quantify the relationship between two instances, z = (x, y) and z = (x , y ). Definition 1 (Instance-based Explanation Using Relevance Metric). Let D = {z(i) train = (x(i) train, y(i) train)}N i=1 be a set of training instances and xtest be a test input of interest whose predicted output is given by bytest = f(xtest) with a predictive model f. An instance-based explanation method gives the most relevant training instance z D to the test instance ztest = (xtest, bytest) by z = arg maxztrain D R(ztest, ztrain) using a relevance metric R(ztest, ztrain). Previously proposed relevance metrics include similarity (Caruana et al., 1999), kernel functions (Kim et al., 2016; Khanna et al., 2019), and influence function (Koh & Liang, 2017). Published as a conference paper at ICLR 2021 Table 1: The relevance metrics and their evaluation results. For the model randomization test, the results that passed the test are colored. For the identical class test and identical subclass test, the results with the five highest average evaluation scores are colored. The details of the relevance metrics, the evaluation criteria, and the evaluation procedures can be found in Sections 1.2, 3, and 4, respectively. Evaluation Criteria Relevance Metrics Abbrv. Model Randomization Test Identical Class Test Identical Subclass Test ℓ2 φ(z) = x ℓx 2 Failed 0.615 0.261 0.644 0.264 φ(z) = hlast ℓlast 2 Passed 0.880 0.106 0.631 0.237 φ(z) = hall ℓall 2 Failed 0.848 0.128 0.691 0.211 Cosine φ(z) = x cosx Failed 0.669 0.248 0.621 0.242 φ(z) = hlast coslast Passed 0.888 0.098 0.636 0.234 φ(z) = hall cosall Failed 0.871 0.110 0.738 0.166 Dot φ(z) = x dotx Failed 0.336 0.187 0.346 0.201 φ(z) = hlast dotlast Failed 0.579 0.344 0.284 0.122 φ(z) = hall dotall Failed 0.630 0.353 0.488 0.267 Influence Function IF Passed 0.372 0.270 0.309 0.174 Relative IF RIF Passed 0.779 0.309 0.659 0.266 Fisher Kernel FK Passed 0.226 0.103 0.180 0.076 Grad-Dot GD Passed 0.701 0.287 0.403 0.131 Grad-Cos GC Passed 0.996 0.009 0.753 0.196 An immediate critical question is which relevance metric is appropriate for which type of instancebased explanations. There is no doubt that different types of explanations require different metrics. Despite its potential importance, however, little has been explored on this question. Given this background, in this study, we focused on similarity-based explanation and investigated its appropriate relevance metrics through comprehensive experiments.1 Contributions We provide the first answer to the question about which relevance metrics have desirable properties for similarity-based explanation. For this purpose, we propose to use three minimal requirement tests to evaluate various relevance metrics in terms of their appropriateness. The first test is the model randomization test originally proposed by Adebayo et al. (2018) for evaluating saliency-based methods, and the other two tests, the identical class test and identical subclass test, are newly designed in this study. As summarized in Table 1, our experiments revealed that (i) the cosine similarity of gradients performs best, which is probably a recommended choice for similaritybased explanation in practice, and (ii) some relevance metrics demonstrated poor performances on the identical class and identical subclass tests, indicating that their use should be deprecated for similarity-based explanation. We also analyzed the reasons behind the success and failure of metrics. We expect these insights to help practitioners in selecting appropriate relevance metrics. 1.1 PRELIMINARIES Notations For vectors a, b Rp, we denote the dot product by a, b := Pp i=1 aibi, the ℓ2 norm by a := p a, a , and the cosine similarity by cos(a, b) := a,b / a b . Classification Problem We consider a standard classification problem as the evaluation benchmark, which is the most actively explored application of instance-based explanations. The model is the conditional probability p(y | x; θ) with parameter θ. Let bθ be a trained parameter bθ = arg minθ Ltrain := 1 N PN i=1 ℓ(z(i) train; θ), where the loss function ℓis the cross entropy ℓ(z; θ) = log p(y | x; θ) for an input-output pair z = (x, y). The model classifies a test input xtest by assigning the class with the highest probability bytest = arg maxy p(y | xtest; bθ). 1.2 RELEVANCE METRICS We present an overview of the two types of relevance metrics considered in this study, namely similarity metrics and gradient-based metrics. To the best of our knowledge, all major relevance 1Our implementation is available at https://github.com/k-hanawa/criteria_for_ instance_based_explanation Published as a conference paper at ICLR 2021 metrics proposed thus far can be classified under these two types. Table 1 presents a list of metrics and their abbreviations. Similarity Metrics We consider the following popular similarity metrics with a feature map φ(z). ℓ2 Metric: Rℓ2(z, z ) := φ(z) φ(z ) 2, which is a typical choice for nearest neighbor methods (Hastie et al., 2009; Abu Alfeilat et al., 2019). Cosine Metric: Rcos(z, z ) := cos(φ(z), φ(z )), which is commonly used in natural language processing tasks (Mikolov et al., 2013; Arora et al., 2017; Conneau et al., 2017). Dot Metric: Rdot(z, z ) := φ(z), φ(z ) , which is a kernel function used in kernel models such as SVM (Schölkopf et al., 2002; Fan et al., 2005; Bien & Tibshirani, 2011). As the feature map φ(z), we consider (i) an identity map φ(z) = x; (ii) the last hidden layer φ(z) = hlast, which is the latent representation of input x, one layer before the output in a deep neural network; and, (iii) all hidden layers φ(z) = hall, where hall = [h1, h2, . . . , hlast] is the concatenation of all latent representations in the network. Note that the metrics with the identity map merely measure the similarity of inputs without model information. We adopt these metrics as naive baselines to contrast with other advanced metrics that utilize model information. Gradient-based Metrics Gradient-based metrics use a gradient gz bθ := θℓ(z; bθ) to measure the relevance. We consider five metrics: Influence Function (IF) (Koh & Liang, 2017), Relative IF (RIF) (Barshan et al., 2020), Fisher Kernel (FK) (Khanna et al., 2019), Grad-Dot (GD) (Yeh et al., 2018; Charpiat et al., 2019), and Grad-Cos (GC) (Perronnin et al., 2010; Charpiat et al., 2019). See Appendix A for further detail. IF: RIF(z, z ) := gz bθ, H 1gz bθ RIF: RRIF(z, z ) := cos(H 1 2 gz bθ, H 1 FK: RFK(z, z ) := gz bθ, I 1gz bθ , GD: RGD(z, z ) := gz bθ, gz bθ GC: RGC(z, z ) := cos(gz bθ, gz bθ ) where H and I are the Hessian and Fisher information matrices of the loss Ltrain, respectively. 2 RELATED WORK Model-specific Explanation Aside of the relevance metrics, there is another approach for similaritybased explanation that uses specific models that can provide explanations by their design (Kim et al., 2014; Plötz & Roth, 2018; Chen et al., 2019). We set aside these specific models and focus on generic relevance metrics because of their applicability to a wide range of problems. Evaluation of Metrics for Improving Classification Accuracy In several machine learning problems, the metrics between instances play an essential role. For example, the distance between instances is essential for distance-based methods such as nearest neighbor methods (Hastie et al., 2009). Another example is kernel models where the kernel function represents the relationship between two instances (Schölkopf et al., 2002). Several studies have evaluated the desirable metrics for specific tasks (Hussain et al., 2011; Hu et al., 2016; Li & Li, 2018; Abu Alfeilat et al., 2019). These studies aimed to find metrics that could improve the classification accuracy. Different from these evaluations based on accuracy, our goal in this study is to evaluate the validity of relevance metrics for similarity-based explanation; thus, the findings in these previous studies are not directly applicable to our goal. Evaluation of Explanations There are a variety of desiderata argued as requirements for explanations, such as faithfulness (Adebayo et al., 2018; Lakkaraju et al., 2019; Jacovi & Goldberg, 2020), plausibility (Lei et al., 2016; Lage et al., 2019; Strout et al., 2019), robustness (Alvarez-Melis & Jaakkola, 2018), and readability (Wang & Rudin, 2015; Yang et al., 2017; Angelino et al., 2017). It is important to evaluate the existing explanation methods considering these requirements. However, there is no standard test established for evaluating these requirements, and designing such tests still remains an open problem (Doshi-Velez & Kim, 2017; Jacovi & Goldberg, 2020). In this study, as the first empirical study for evaluating the existing relevance metrics for similarity-based explanation, we take an alternative approach by designing minimal requirement tests for two primary requirements, Published as a conference paper at ICLR 2021 namely faithfulness and plausibility. With this alternative approach, we can avoid the difficulty of directly evaluating these primary requirements. 3 EVALUATION CRITERIA FOR SIMILARITY-BASED EXPLANATION This study aims to investigate the relevance metrics with desirable properties for similarity-based explanation. In this section, we propose three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. If a relevance metric fails one of the tests, we can conclude that the metric does not meet the minimal requirements; thus, its use would be deprecated. The first test (model randomization test) assesses whether each relevance metric satisfies the minimal requirements for the faithfulness of explanation, which requires that an explanation to a model prediction must reflect the underlying inference process (Adebayo et al., 2018; Lakkaraju et al., 2019; Jacovi & Goldberg, 2020). The latter two tests (identical class and identical subclass tests) are designed to assess relevance metrics in terms of the plausibility of the explanations they produce (Lei et al., 2016; Lage et al., 2019; Strout et al., 2019), which requires explanations to be sufficiently convincing to users. 3.1 MODEL RANDOMIZATION TEST Explanations that are irrelevant to a model should be avoided because such fake explanations can mislead users. Thus, any valid relevance metric should be model-dependent, which constitutes the first requirement. We use the model randomization test of Adebayo et al. (2018) to assess whether a given relevance metric satisfies a minimal requirement for faithfulness. If a relevance metric produces almost same explanations for the same inputs on two models with different inference processes, it is likely to ignore the underlying model, i.e., the metric is independent of the model. Thus, we can evaluate whether the metric is model-dependent by comparing explanations from two different models. In the test, a typical choice of the models is a well-trained model that can predict the output well and a randomly initialized model that can make only poor prediction. These two models have different inference processes; hence, their explanations should be different. Definition 2 (Model Randomization Test). Let R denote the relevance metric of interest. Let f and frand be a well-trained model and randomly initialized model, respectively. For given R, f, and test instance ztest = (xtest, bytest), let πf be a permutation of the indices of the training instances based on the degree of relevance to the given test instance, i.e., R(ztest, z (πf (1)) train ) R(ztest, z (πf (2)) train ) . . . R(ztest, z (πf (N)) train ). We also define πfrand accordingly. Then, we require πf and πfrand to ensure a small rank correlation. If relevance metric R is independent of the model, it produces the same permutation for both f and frand, and their rank correlation becomes one. If the rank correlation is significantly smaller than one and close to zero, we can confirm that the relevance metric is model-dependent. 3.2 IDENTICAL CLASS TEST The second minimal requirement is that the raised similar instance should belong to the same class as the test instance, as shown in Figure 1. The violation of this requirement leads to nonsensical explanations such as I think this image is cat because a similar image I saw in the past was dog. in Figure 1. When users encounter such explanations, they might question the validity of model predictions and ignore the predictions even if the underlying model is valid. This observation leads to the identical class test below. Definition 3 (Identical Class Test). We require that the most similar (relevant) instance of a test instance ztest = (xtest, bytest) is a training instance of the same class as the given test instance. arg max z=(x,y) D R ztest, z = ( x, y) = y = bytest. (1) Although this test may look trivial, some relevance metrics do not satisfy this minimal requirement, as demonstrated in Section 4.2. Published as a conference paper at ICLR 2021 is cat because is cat because Figure 1: Valid ( ) and invalid ( ) examples for the identical class test. is animal because is animal because Figure 2: Valid ( ) and invalid ( ) examples for the identical subclass test. 3.3 IDENTICAL SUBCLASS TEST The third minimal requirement is that the raised similar instance should belong to the same subclass as that of the test instance when the the classes consist of latent subclasses, as shown in Figure 2. For example, consider a problem of classifying images of CIFAR10 into two classes, i.e., animal and vehicle. The animal class consists of images from subclasses such as cat and frog, while the vehicle class consists of images from subclasses such as airplane and automobile. Under the presence of subclasses, the violation of this requirement leads to nonsensical explanations such as I think this image (cat) is animal because a similar image (frog) I saw in the past was also animal. in Figure 2. This observation leads to the identical subclass test below. Definition 4 (Identical Subclass Test). Let s(z) be a subclass for class y of an instance z = (x, y). We require that the most similar (relevant) instance of a test instance ztest = (xtest, bytest) is the training instance of the same subclass as the test instance, under the assumption that the prediction of the test instance is correct bytest = ytest.2 arg max z D R ztest, z = z = s(barz) = s(ztest). (2) In the experiments, we used modified datasets: we split the dataset into two new classes (A and B) by randomly assigning the existing classes to either classes. The new two classes now contain the original data classes as subclasses that are mutually exclusive and collectively exhaustive, which can be used for the identical subclass test. 3.4 DISCUSSIONS ON VALIDITY OF CRITERIA Here, we discuss the validity of the new criteria, i.e., the identical class and identical subclass tests. Why do relevance metrics that cannot pass these tests matter? Dietvorst et al. (2015) revealed a bias in humans, called algorithm aversion, which states that people tend to ignore an algorithm if it makes errors. It should be noted that the explanations that do not satisfy the identical class test or identical subclass test appear to be logically broken, as shown in Figures 1 and 2. Given such logically broken explanations, users will consider that the models are making errors, even if they are making accurate predictions. Eventually, the users will start to ignore the models. Is the identical subclass test necessary? This is an essential requirement for ensuring that the explanations are plausible to any users. Some users may not consider the explanations that violate the identical subclass test to be logically broken. For example, some users may find a frog to be an appropriate explanation for a cat being animal by inferring taxonomy of the classes (e.g., both have eyes). However, we cannot hope all users to infer the same taxonomy. Therefore, if there is 2We require correct predictions in this test because the subclass does not match for incorrect cases. Published as a conference paper at ICLR 2021 a discrepancy between the explanation and the taxonomy inferred by a user, the user will consider the explanation to be implausible. To make explanations plausible to any user, instances of the same subclass need to be provided. Is random class assignment in the identical subclass test appropriate? We adopted random assignment to evaluate the performance of each metric independent from the underlying taxonomy. If a specific taxonomy was considered for the evaluations, a metric that performed well with it will be highly valued. Random assignment eliminates such effects, and we can purely measure the performance of the metrics themselves. Do classification models actually recognize subclasses? Is the identical subclass test suitable to evaluate the explanations of predictions made by practical models? It is true that if a model ignores subclasses in its training and inference processes, any explanation will fail the test. We conducted simple preliminary experiments and confirmed that the practical classification models used in this study capture the subclasses. See Appendix E for further detail. 4 EVALUATION RESULTS Here, we examine the validity of relevance metrics with respect to the three minimal requirements. For this evaluation, we used two image datasets (MNIST (Le Cun et al., 1998), CIFAR10 (Krizhevsky, 2009)), two text datasets (TREC (Li & Roth, 2002), AGNews (Zhang et al., 2015)) and two table datasets (Vehicle (Dua & Graff, 2017), Segment (Dua & Graff, 2017)). As benchmarks, we employed logistic regression and deep neural networks trained on these datasets. Details of the datasets, models, and computing infrastructure used in this study is provided in Appendix B. Procedure We repeated the following procedure 10 times for each evaluation test. 1. Train a model using a subset of training instances.3 Then, randomly sample 500 test instances from the test set.4 2. For each test instance, compute the relevance score for all instances used for training. 3. (a) For the model randomization test, compute the Spearman rank correlation coefficients between the relevance scores from the trained model and relevance scores from the randomized model. (b) For the identical class and identical subclass tests, compute the success rate, which is the ratio of test instances that passed the test. In this section, we mainly present the results for CIFAR10 with CNN and AGNews with Bi-LSTM. The other results were similar, and can be found in Appendix F. Result Summary We summarize the main results before discussing individual results. ℓlast 2 , coslast, and gradient-based metrics scored low correlation in the model randomization test for all datasets and models, indicating that they are model-dependent. GC performed the best in most of the identical class and identical subclass tests; thus, GC would be the recommended choice in practice. Dot metrics as well as IF, FK, and GD performed poorly on the identical class test and identical subclass test. In Section 5, we analyze why some relevance metrics succeed or fail in the identical class and identical subclass tests. 4.1 RESULT OF MODEL RANDOMIZATION TEST Figure 3 shows the Spearman rank correlation coefficients for the model randomization test. The similarities with the identity feature map ℓx 2, cosx, and dotx are irrelevant to the model and their correlations are trivially one. In the figures, the other metrics scored correlations close to zero, 3We randomly sampled 10% of MNIST and CIFAR10; 50% of TREC, Vehicle and Segment; and 5% of AGNews 4For the identical subclass test, we sampled instances with correct predictions only. Published as a conference paper at ICLR 2021 Average Correlation Std. ℓ2 Cosine Dot Gradient (a) CIFAR10 with CNN Average Correlation Std. ℓ2 Cosine Dot Gradient (b) AGNews with Bi-LSTM Figure 3: Result of the model randomization test. Correlations close to zero are ideal. ℓlast 2ℓall 2 0 0.2 0.4 0.6 0.8 (a) CIFAR10 with CNN Average Success Rate Std. Identical Class Test Identical Subclass Test ℓ2 Cosine Dot Gradient ℓlast 2ℓall 2 0 0.2 0.4 0.6 0.8 (b) AGNews with Bi-LSTM Average Success Rate Std. Identical Class Test Identical Subclass Test ℓ2 Cosine Dot Gradient Figure 4: Results of the identical class test and identical subclass test. indicating they will be model-dependent. However, the correlation of ℓall 2 , cosall, dotlast was observed to be more than 0.7 on the MNIST and Vehicle datasets (see Appendix F). Therefore, we conclude that these relevance metrics failed the model randomization test because they can raise instances irrelevant to the model for some datasets. 4.2 RESULTS OF IDENTICAL CLASS AND IDENTICAL SUBCLASS TESTS Figure 4 depicts the success rates for the identical class and identical subclass tests. We also summarized the average success rates of our experiments in Table 1. It is noteworthy that GC performed consistently well on the identical class and identical subclass tests for all the datasets and models used in the experiment (see Appendix F). In contrast, some relevance metrics such as the dot metrics as well as IF, FK, and GD performed poorly on both tests. The reasons for their failure are discussed in the next section. To conclude, the results of our evaluations indicate that only GC performed well on all tests. That is, only GC seems to meet the minimal requirements; thus, it would be a recommended choice for similarity-based explanation. 5 WHY SOME METRICS ARE SUCCESSFUL AND WHY SOME ARE NOT We observed that the dot metrics and gradient-based metrics such as IF, FK, and GD failed the identical class and identical subclass tests, in comparison to GC that exhibited remarkable performance. Here, we analyze the reasons why the aforementioned metrics failed while GC performed well. In Appendix D, we also discuss a way to repair IF, FK, and GD to improve their performance based on the findings in this section. Published as a conference paper at ICLR 2021 Training Instances Selected Instances 0 0.2 0.4 0.6 0.8 0 7,000 Norm of feature map Norm of feature map 140 Norm of feature map 140 Norm of feature map Figure 5: Distributions of norms of the feature maps of all training instances (colored) and the instances selected by the identical class test (meshed) on CIFAR10 with CNN. Test Instances Found IF airplane frog bird truck cos(ztest, ztrain) φ(ztrain) .00008 .00010 .00007 3,585 Test Instances Found FK cat ship bird ship cos(ztest, ztrain) φ(ztrain) .021 .020 .019 345,292,727 Test Instances Found GD cat bird horse truck cos(ztest, ztrain) φ(ztrain) .385 .291 .329 112.8 Test Instances Found GC truck truck truck truck cos(ztest, ztrain) φ(ztrain) .754 .754 .752 .0008 Figure 6: Training instances frequently selected in the identical class test with multiple test instances on CIFAR10 with CNN, the cosine between them, and the norm of training instances. Failure of Dot Metrics and Gradient-based Metrics To understand the failure, we reformulate IF, FK, and GD as dot metrics of the form Rdot(ztest, ztrain) = φ(ztest), φ(ztrain) to ensure that the following discussion is valid for any relevance metric of this form. It is evident that IF, FK, and GD can be expressed in this form by defining the feature maps by φ(z) = H 1/2g(z; bθ), φ(z) = I 1/2g(z; bθ), and φ(z) = g(z; bθ), respectively. Given a criterion, let z(i) train be a desirable instance for a test instance ztest. The failures of dot metrics indicate the existence of an undesirable instance z(j) train such that φ(ztest), φ(z(i) train) < φ(ztest), φ(z(j) train) . The following sufficient condition for z(j) train is useful to understand the failure. φ(z(i) train) < φ(z(j) train) cos(φ(ztest), φ(z(j) train)). (3) The condition implies that any instance with an extremely large norm and a cosine slightly larger than zero can be the candidate of z(j) train. In our experiments, we observed that the condition on the norm is especially crucial. As shown in Figure 5, even though instances with significanty large norms were scarce, only such extreme instances were selected as relevant instances by IF, FK, and GD. This indicates that these these metrics tend to consider such extreme instances as relevant. In contrast, GC was not attracted by large norms because it completely cancels the norm through normalization. Figure 6 shows some training instances frequently selected in the identical class test on CIFAR10 with CNN. When using IF, FK, and GD, these training instances were frequently selected irrespective of their classes because the training instances had large norms. In these metrics, the term cos(φ(ztest), φ(ztrain)) seems to have negligible effects. In contrast, GC successfully selected the instances of the same class and ignored those with large norms. Success of GC We now analyze why GC performed well, specifically in the identical class test. To simplify the discussion, we consider linear logistic regression whose conditional distribution p(y | x; θ) is given by the y-th entry of σ(Wx), where σ is the softmax function, θ = W RC d, Published as a conference paper at ICLR 2021 and C and d denote the number of classes and dimensionality of x, respectively. With some algebra, we obtain RGC(z, z ) = cos(rz, rz ) cos(x, x ) for z = (x, y) and z = (x , y ), where rz = σ(Wx) ey is the residual for the prediction on z and ey is a vector whose y-th entry is one, and zero, otherwise. See Appendix C for the derivation. Here, the term cos(rz, rz ) plays an essential role in GC. By definition, rz c 0 if c = y and rz c 0, otherwise. Thus, cos(rz, rz ) 0 always holds true when y = y , while cos(rz, rz ) can be negative for y = y . Hence, the chance of RGC(z, z ) being positive can be larger for the instances from the same class compared to those from a different class. Figure 7 shows that cos(rz, rz ) is essential also for deep neural networks. Here, for each test instance ztest on CIFAR10 with CNN, we randomly sampled two training instances ztrain (one with the same class and the other with a different class), and computed RGC(ztest, ztrain) and cos(rztest, rztrain). same class different class 1 0 1 RGC(ztest, ztrain) 1 0 1 cos(rztest, rztrain) Figure 7: Distributions of RGC(ztest, ztrain) and cos(rztest, rztrain) for training instances with the same / different classes on CIFAR10 with CNN. We also note that cos(rztest, rztrain) alone was not helpful for the identical subclass test, whose success rate was around the chance level. We thus conjecture that while cos(rztest, rztrain) is particularly helpful for the identical class test, the use of the entire gradient is still essential for GC to work effectively. 6 CONCLUSION We investigated and determined relevance metrics that are effective for similarity-based explanation. For this purpose, we evaluated whether the metrics satisfied the minimal requirements for similaritybased explanation. In this study, we conducted three tests, namely, the model randomization test of Adebayo et al. (2018) to evaluate whether the metrics are model-dependent, and two newly designed tests, the identical class and identical subclass tests, to evaluate whether the metrics can provide plausible explanations. Quantitative evaluations based on these tests revealed that the cosine similarity of gradients performs best, which would be a recommended choice in practice. We also observed that some relevance metrics do not meet the requirements; thus, the use of such metrics would not be appropriate for similarity-based explanation. We expect our insights to help practitioners in selecting appropriate relevance metrics, and also to help further researches for designing better relevance metrics for instance-based explanations. Finally, we present two future direction for this study. First, the proposed criteria only evaluated limited aspects of the faithfulness and plausibility of relevance metrics. Thus, it is important to investigate further criteria for more detailed evaluations. Second, in addition to similarity-based explanation, it is necessary to consider the evaluation of other explanation methods, such as counterexamples. We expect this study to be the first step toward the rigorous evaluation of several instancebased explanation methods. ACKNOWLEDGMENTS We thank Dr. Ryo Karakida and Dr. Takanori Maehara for their helpful advice. We also thank Overfit Summer Seminar5 for an opportunity that inspired this research. Additionally, we are grateful to our laboratory members for their helpful comments. Sho Yokoi was supported by JST, ACT-X Grant Number JPMJAX200S, Japan. Satoshi Hara was supported by JSPS KAKENHI Grant Number 20K19860, and JST, PRESTO Grant Number JPMJPR20C8, Japan. Haneen Arafat Abu Alfeilat, Ahmad B.A. Hassanat, Omar Lasassmeh, Ahmad S. Tarawneh, Mahmoud Bashir Alhasanat, Hamzeh S. Eyal Salman, and V.B. Surya Prasath. Effects of Distance 5https://sites.google.com/view/mimaizumi/event/mlcamp2018 Published as a conference paper at ICLR 2021 Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data, 7(4): 221 248, 2019. Amina Adadi and Mohammed Berrada. Peeking Inside the Black-box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6:52138 52160, 2018. Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity Checks for Saliency Maps. In Advances in Neural Information Processing Systems 31, pp. 9505 9515. 2018. David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. ar Xiv preprint ar Xiv:1806.08049, 2018. Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning certifiably optimal rule lists for categorical data. The Journal of Machine Learning Research, 18 (1):8753 8830, 2017. Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. In Proceedings of the 5th International Conference on Learning Representations, 2017. Elnaz Barshan, Marc-Etienne Brunet, and Gintare Karolina Dziugaite. Relat IF: Identifying Explanatory Training Samples via Relative Influence. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pp. 1899 1909, 2020. Jacob Bien and Robert Tibshirani. Prototype Selection for Interpretable Classification. Annals of Applied Statistics, 5(4):2403 2424, 2011. Rich Caruana, Hooshang Kangarloo, John David N. Dionisio, Usha Sinha, and David Johnson. Case-Based Explanation of Non-Case-Based Learning Methods. In Proceedings of the AMIA Symposium, pp. 212 215, 1999. Guillaume Charpiat, Nicolas Girard, Loris Felardos, and Yuliya Tarabalka. Input Similarity from the Neural Network Perspective. In Advances in Neural Information Processing Systems 32, pp. 5342 5351. 2019. Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. This Looks Like That: Deep Learning for Interpretable Image Recognition. In Advances in Neural Information Processing Systems 32, pp. 8930 8941. 2019. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670 680, 2017. Pádraig Cunningham, Dónal Doyle, and John Loughrey. An Evaluation of the Usefulness of Case Based Explanation. In International Conference on Case-Based Reasoning, pp. 122 130. Springer, 2003. Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1): 114 8830, 2015. Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. ar Xiv preprint ar Xiv:1702.08608, 2017. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Rong-En Fan, Pai-Hsuen Chen, and Chih-Jen Lin. Working Set Selection Using Second Order Information for Training Support Vector Machines. Journal of Machine Learning Research, 6: 1889 1918, 2005. Published as a conference paper at ICLR 2021 Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys, 51 (5):1 42, 2018. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. Li Yu Hu, Min Wei Huang, Shih Wen Ke, and Chih Fong Tsai. The Distance Function Effect on k-Nearest Neighbor Classification for Medical Datasets. Springer Plus, 5(1):1304, 2016. Muhammad Hussain, Summrina Kanwal Wajid, Ali Elzaart, and Mohammed Berbar. A Comparison of SVM Kernel Functions for Breast Cancer Detection. In Proceedings of the 8th International Conference on Computer Graphics, Imaging and Visualization, pp. 145 150, 2011. Alon Jacovi and Yoav Goldberg. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198 4205, 2020. Rajiv Khanna, Been Kim, Joydeep Ghosh, and Sanmi Koyejo. Interpreting Black Box Predictions using Fisher Kernels. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89, pp. 3382 3390, 2019. Been Kim, Cynthia Rudin, and Julie A Shah. The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification. In Advances in Neural Information Processing Systems 27, pp. 1952 1960, 2014. Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples Are Not Enough, Learn to Criticize! Criticism for Interpretability. In Advances in neural information processing systems 29, pp. 2280 2288, 2016. Gary A Klein. Strategies of Decision Making. Technical report, 1989. Gary A Klein and Roberta Calderwood. How Do People Use Analogues to Make Decisions? In Proceedings of the DARPA Workshop on Case-Based Reasoning, 1988, pp. 209 223, 1988. Pang Wei Koh and Percy Liang. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 1885 1894, 2017. Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, 2009. Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi-Velez. An Evaluation of the Human-Interpretability of Explanation. 2019. URL http: //arxiv.org/abs/1902.00006. Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 131 138, 2019. Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing Neural Predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 107 117, 2016. Xin Li and Dan Roth. Learning Question Classifiers. In Proceedings of the 19th International Conference on Computational Linguistics, 2002. Zhou Li and Chunxiang Li. Selection of Kernel Function for Least Squares Support Vector Machines in Downburst Wind Speed Forecasting. In Proceedings of the 11th International Symposium on Computational Intelligence and Design, volume 2, pp. 337 341, 2018. Scott M Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, pp. 4765 4774, 2017. Published as a conference paper at ICLR 2021 Leland Mc Innes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ar Xiv preprint:1802.03426, 2018. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111 3119. 2013. Christoph Molnar. Interpretable Machine Learning. Lulu. com, 2020. Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. Large-Scale Image Retrieval With Compressed Fisher Vectors. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3384 3391, 2010. Tobias Plötz and Stefan Roth. Neural Nearest Neighbors Networks. In Advances in Neural Information Processing Systems 31, pp. 1087 1098. 2018. Stephen J Read and Ian L Cesa. This Reminds Me of the Time When...: Expectation Failures in Reminding and Explanation. Journal of Experimental Social Psychology, 27(1):1 25, 1991. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why Should I Trust You Explaining the Predictions of Any Classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135 1144, 2016. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang Chieh Chen. Mobile Net V2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 4510 4520, 2018. Bernhard Schölkopf, Alexander J Smola, and Francis Bach. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2002. Julia Strout, Ye Zhang, and Raymond J. Mooney. Do Human Rationales Improve Machine Explanations? In Proceedings of the 2019 ACL Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pp. 56 62, 2019. Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 31(2): 841 887, 2018. Fulton Wang and Cynthia Rudin. Falling rule lists. In Artificial Intelligence and Statistics, pp. 1013 1022, 2015. Hongyu Yang, Cynthia Rudin, and Margo Seltzer. Scalable bayesian rule lists. In International Conference on Machine Learning, pp. 3921 3930, 2017. Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, and Pradeep K Ravikumar. Representer Point Selection for Explaining Deep Neural Networks. In Advances in Neural Information Processing Systems 31, pp. 9291 9301, 2018. Xiang Zhang, Junbo Zhao, and Yann Le Cun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28, pp. 649 657. 2015. Published as a conference paper at ICLR 2021 A GRADIENT-BASED METRICS In gradient-based metrics, we consider a model with parameter θ, its loss ℓ(z; θ), and its gradient θℓ(z; θ) to measure relevance, where z = (x, y) is an input-output pair. Influence Function (Koh & Liang, 2017) Koh & Liang (2017) proposed to measure relevance according to how largely the test loss will increase if the training instance is omitted from the training set. Here, the model parameter trained using all of the training set is denoted by bθ, and the parameter trained using all of the training set except the i-th instance z(i) train is denoted by bθ i. The relevance metric proposed by Koh & Liang (2017) is then defined as the difference between the test loss under parameters bθ and bθ i as follows: RIF(ztest, z(i) train) := ℓ(ztest; bθ i) ℓ(ztest; bθ). (4) Here, a greater value indicates that the loss on the test instance increases drastically by removing the i-th training instance from the training set. Thus, the i-th training instance is essential relative to predicting the test instance; therefore, it is highly relevant. In practice, the following approximation is used to avoid computing bθ i explicitly. RIF(ztest, z(i) train) θℓ(ztest; bθ), H 1 θℓ(z(i) train; bθ)) , (5) where H is the Hessian matrix of the loss Ltrain. Relative IF (Barshan et al., 2020) Barshan et al. (2020) proposed to measure relevance according to how largely the test loss will increase if the training instance is omitted from the training set under the constraint that the expected squared change in loss is sufficiently small 6, which is the modified version of the influence function. Relative IF is computed as the cosine similarity of φ(z) = H 1/2 θℓ(z; bθ): RRIF(ztest, ztrain) := cos(H 1/2 θℓ(ztest; bθ), H 1/2 θℓ(ztrain; bθ)). (6) Fisher Kernel (Khanna et al., 2019) Khanna et al. (2019) proposed to measure the relevance of instances using the Fisher kernel as follows: RFK(ztest, z(i) train) := θℓ(ztest; bθ), I 1 θℓ(z(i) train; bθ) , (7) where I is the Fisher information matrix of the loss Ltrain. Grad-Dot, Grad-Cos (Perronnin et al., 2010; Yeh et al., 2018; Charpiat et al., 2019) Charpiat et al. (2019) proposed to measure relevance according to how largely the loss will decrease when a small update is added to the model using the training instance. This can be computed as the dot product of the loss gradients, which we refer to as Grad-Dot. RGD(ztest, ztrain) := θℓ(ztest; bθ), θℓ(ztrain; bθ) . (8) Note that a similar metric is studied by Yeh et al. (2018) as the representer point value. As a modification of Grad-Dot, Charpiat et al. (2019) also proposed the following cosine version, which we refer to as Grad-Cos. RGC(ztest, ztrain) := cos( θℓ(ztest; bθ), θℓ(ztrain; bθ)). (9) Note that the use of the cosine between the gradients is also proposed by Perronnin et al. (2010). B EXPERIMENTAL SETUP B.1 DATASETS AND MODELS MNIST (Le Cun et al., 1998) The MNIST dataset is used for handwritten digit image classification tasks. Here, input x is an image of a handwritten digit, and the output y consists of 10 classes ( 0 6This metric is called ℓ-Relat IF by Barshan et al. (2020) Published as a conference paper at ICLR 2021 to 9 ). We adopted logistic regression and a CNN as the classification models. The CNN has six convolutional layers, and max-pooling layers for each two convolutional layers. The features obtained by these layers are fed into the global average pooling layer followed by a single linear layer. The number of the output channels of all the convolutional layers is set to 16. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 5,500 training instances to train the models. CIFAR10 (Krizhevsky, 2009) The CIFAR10 dataset is used for object recognition tasks. Here, input x is an image containing a certain object, and output y consists of 10 classes, e.g., bird or airplane. Note that we used the same models as for the MNIST dataset. In addition, we adopted Mobile Net V2 (Sandler et al., 2018) as a model with a higher performance than the previous model. We trained the models using the Adam optimizer with a learning rate of 0.001. In the experiments, we first pre-trained the models using all the training instances of CIFAR10, and then trained the models using randomly sampled 5,000 training instances. Without the pre-training, the classification performance of the models dropped significantly. Note that we did not examine IF and FK on Mobile Net V2 because the matrix inverse in these metrics required too much time to calculate even with the conjugate gradient approximation proposed by Koh & Liang (2017). TREC (Li & Roth, 2002) The TREC dataset is used for question classification tasks. Here, input x is a question sentence, and output y is a question category consisting of six classes, e.g., LOC and NUM. We used bag-of-words logistic regression and a two-layer Bi-LSTM as the classification models. In the Bi-LSTM, the last state is fed into one linear layer. The word embedding dimension is set to 16, and the dimension of the LSTM is set to 16 also. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 2,726 training instances to train the models. AGNews (Zhang et al., 2015) The AGNews dataset is used for news article classification tasks. Here, input x is a sentence, and output y is a category comprising four classes, e.g., business and sports. We used the same models as TREC. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 6,000 training instances to train the models. Vehicle (Dua & Graff, 2017) The vehicle dataset is used for vehicle type classification tasks. Here, the input x consists of 18 features, and the output y is a type of vehicle comprising four classes, e.g., bus and van. We used logistic regression and a three-layer MLP as the classification models. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 423 training instances to train the models. Segment (Dua & Graff, 2017) The segment dataset is used for image classification tasks. Here, the input x consists of 19 features, and the output y consists of seven classes, e.g., sky and window. We used the same models as Vehicle. We trained the models using the Adam optimizer with a learning rate of 0.001. We used randomly sampled 924 training instances to train the models. B.2 COMPUTING INFRASTRUCTURE In our experiments, training of the models was run on a NVIDIA GTX 1080 GPU with Intel Xeon Silver 4112 CPU and 64GB RAM. Testing and computing relevance metrics were run on Xeon E5-2680 v2 CPU with 256GB RAM. C DERIVATION OF GC FOR LINEAR LOGISTIC REGRESSION We consider linear logistic regression whose conditional distribution p(y | x; θ) is given by the y-th entry of σ(Wx), where σ is the softmax function, θ = W RC d, and C and d are the number of classes and the dimensionality of x, respectively. Recall that the cross entropy loss for linear logistic Published as a conference paper at ICLR 2021 regression is given as c=1 yc wc, x + log c =1 exp( wc , x ), (10) where W = [w1, w2, . . . , w C] . Let ey be a vector whose y-th entry is one and zero otherwise. Then, the gradient of the loss with respect to wc can be expressed as wcℓ(z; θ) = (σ(Wx) (ey)c)x = (rz)cx, (11) where rz = σ(Wx) eyis the residual for the prediction on z. Hence, we have θℓ(z; θ), θℓ(z ; θ) = c=1 wcℓ(z; θ), wcℓ(z ; θ) (12) c=1 (rz)c(rz )c x, x (13) = rz, rz x, x , (14) which yields RGC(z, z ) = rz, rz x, x rz x rz x (15) = cos(rz, rz ) cos(x, x ). (16) D REPAIRING GRADIENT-BASED METRICS As described in Section 5, we found that training instances with extremely large norms were selected as relevant by IF, FK, and GD. Thus, to repair these metrics, we need to design metrics that can ignore instances with large norms. A simple yet effective way of repairing the metrics is to use ℓ2 or cosine instead of the dot product. As Figure 4 shows, the ℓ2 and cosine metrics performed better than the dot metrics. Indeed, the ℓ2 metrics do not favor instances with large norms that lead to large ℓ2-distance, and, through normalization, the cosine metrics completely ignore the effect of the norms We name the repaired metrics of IF, FK, and GD based on the ℓ2 metric as ℓIF 2 , ℓFK 2 , and ℓGD 2 , respectively, and the repaired metrics based on the cosine metric as cos IF and cos FK, and cos GD, respectively7. We observed that these repaired metrics attained higher success rates on several evaluation criteria. The details of the results can be found in Appendix F. E DO THE MODELS CAPTURE SUBCLASSES? The identical subclass test requires the model to obtain internal representations that can distinguish subclasses. Here, we confirm that this condition is satisfied for all the datasets and models we used in the experiments. We consider that the model captures the subclasses if the latent representation hall has cluster structures. Figure 9 visualizes hall for each dataset and model using UMAP (Mc Innes et al., 2018). The figures show that the instances from different subclasses are not mixed completely random. MNIST and TREC have relatively clear cluster structures, while CIFAR10 and AGNews have vague clusters without explicit boundaries. These figures imply that the models capture subclases (although it may not be perfect). F COMPLETE EVALUATION RESULTS F.1 FULL RESULTS We show the complete results of the model randomization test in Table 2, the identical class test in Table 3, and the identical subclass test in Table 4. The results we present here are consistent with our observations in Section 4. 7Note that cos IF is the same as RIF and cos GD is the same as GC. Published as a conference paper at ICLR 2021 Table 2: Average Spearman rank correlation coefficients std. of each similarity function for model randomization test. The metrics prefixed with are the ones we have repaired. The results with the average score in the 95% confidence interval of the null distribution that the correlation is zero, which is [-0.088, 0.088], are colored. MNIST CIFAR10 TREC Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.98 0.00 0.92 0.00 0.89 0.01 0.72 0.02 0.35 0.01 0.86 0.01 0.81 0.02 ℓx 2 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 ℓlast 2 .15 .01 - .07 .00 .05 .01 - .19 .01 - ℓall 2 .79 .00 - .02 .01 .13 .01 - .25 .02 - cosx 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 coslast .24 .00 - .07 .01 .04 .01 - .17 .02 - cosall .78 .00 - .02 .01 .09 .01 - .26 .03 - dotx 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 dotlast .39 .01 - .05 .01 .04 .01 - .25 .01 - dotall .80 .00 - .00 .01 .12 .01 - .26 .03 - IF .05 .00 .00 .00 .05 .01 .04 .01 .04 .01 .01 .01 .06 .01 ℓIF 2 .00 .02 .11 .00 .01 .02 .03 .02 .05 .01 .00 .02 .13 .02 cos IF .02 .00 .05 .00 .04 .01 .03 .01 .03 .01 .01 .01 .03 .01 FK .02 .01 .02 .01 .02 .01 .01 .01 .03 .01 .01 .00 .03 .00 ℓFK 2 .10 .04 .05 .00 .16 .05 .12 .02 .03 .01 .14 .03 .15 .01 cos FK .00 .02 .05 .01 .05 .01 .03 .01 .01 .01 .07 .02 .03 .00 GD .08 .02 .01 .01 .03 .01 .02 .01 .04 .01 .04 .01 .02 .02 GC .07 .03 .03 .01 .02 .02 .01 .03 .05 .01 .04 .02 .01 .01 ℓgrad 2 .09 .04 .13 .01 .09 .04 .07 .02 .06 .01 .02 .02 .10 .02 AGNews Vehicle Segment Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 76 1K 140 Accuracy 0.80 0.02 0.80 0.01 0.77 0.02 0.77 0.01 0.98 0.01 0.97 0.00 ℓx 2 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 ℓlast 2 .07 .01 - .16 .04 - .62 .15 - ℓall 2 .17 .01 - .20 .10 - .78 .08 - cosx 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 coslast .07 .01 - .09 .18 - .60 .09 - cosall .12 .02 - .01 .13 - .77 .06 - dotx 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 1.00 .00 dotlast .07 .01 - .85 .33 - .61 .23 - dotall .20 .01 - .97 .03 - .72 .16 - IF .03 .01 .05 .01 .01 .03 .01 .02 .00 .01 .01 .02 ℓIF 2 .04 .02 .13 .00 .18 .24 .01 .28 .03 .13 .10 .26 cos IF .02 .01 .03 .01 .01 .03 .01 .05 .04 .10 .01 .05 FK .04 .01 .03 .00 .01 .06 .02 .07 .01 .02 .00 .02 ℓFK 2 .17 .04 .14 .00 .18 .21 .01 .17 .05 .07 .02 .20 cos FK .00 .03 .03 .00 .08 .13 .04 .12 .01 .03 .01 .04 GD .04 .01 .03 .01 .01 .11 .02 .05 .01 .03 .02 .03 GC .01 .02 .04 .01 .02 .13 .06 .11 .00 .06 .01 .05 ℓgrad 2 .01 .02 .14 .00 .13 .21 .11 .23 .02 .12 .09 .21 Published as a conference paper at ICLR 2021 2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 12.5 2 4 5 6 7 (a) MNIST with CNN. y = A. 2 0 2 4 6 8 10 12 14 4 12 0 1 3 8 9 (b) MNIST with CNN. y = B. 7 6 5 4 3 2 1 0 cat airplane truck dog horse (c) CIFAR10 with Mobile Net V2. y = A. 2 1 0 1 2 3 4 5 6 5 ship frog automobile deer bird (d) CIFAR10 with Mobile Net V2. y = B. 4 3 2 1 0 1 2 3 4 cat airplane truck dog horse (e) CIFAR10 with CNN. y = A. 3 2 1 0 1 2 3 4 ship frog automobile deer bird (f) CIFAR10 with CNN. y = B. NUM HUM ENTY Figure 8: TREC with LSTM. y = A. 15 10 5 0 5 10 20 LOC DESC ABBR (a) TREC with LSTM. y = B. 10 Business World (b) AGNews with LSTM. y = A. 1 2 3 4 5 6 7 8 9 22 Sci/Tech Sports (c) AGNews with LSTM. y = B. Figure 9: visualization of hall in each dataset and model using UMAP. Published as a conference paper at ICLR 2021 Table 3: Average success rate std. of each relevancy metric for identical class test. The metrics prefixed with are the ones we have repaired. The results with the average success rate over 0.5 are colored. MNIST CIFAR10 TREC Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.98 0.00 0.92 0.00 0.89 0.01 0.72 0.02 0.35 0.01 0.86 0.01 0.81 0.02 ℓx 2 .93 .01 .88 .01 .26 .02 .26 .02 .24 .02 .70 .00 .75 .00 ℓlast 2 .99 .01 - 1.00 .00 .75 .02 - .89 .00 - ℓall 2 .98 .00 - .93 .02 .61 .02 - .88 .00 - cosx .94 .01 .88 .01 .30 .03 .29 .02 .26 .02 .73 .00 .76 .00 coslast .99 .01 - 1.00 .00 .78 .02 - .89 .00 - cosall .98 .00 - .97 .01 .71 .02 - .90 .00 - dotx .69 .02 .68 .02 .09 .02 .10 .01 .11 .02 .33 .00 .34 .00 dotlast .67 .02 - 1.00 .00 .20 .02 - .93 .00 - dotall .96 .01 - .96 .01 .31 .01 - .93 .00 - IF .09 .01 .26 .02 - .10 .01 .09 .02 .29 .00 .86 .00 ℓIF 2 .72 .01 .62 .02 - .14 .01 .14 .01 .98 .00 .95 .00 cos IF .82 .01 .69 .02 - .12 .01 .13 .02 .99 .00 .96 .00 FK .10 .01 .21 .02 - .20 .02 .20 .02 .28 .00 .24 .00 ℓFK 2 .77 .02 .93 .01 - .82 .01 .98 .00 .99 .00 .96 .00 cos FK .92 .01 .97 .01 - .93 .01 .99 .00 1.00 .00 .96 .00 GD .30 .01 .87 .01 .26 .03 .71 .02 1.00 .00 .49 .00 1.00 .00 GC .99 .00 1.00 .00 .99 .00 .99 .00 1.00 .00 1.00 .00 1.00 .00 ℓgrad 2 .94 .01 .99 .00 .97 .01 .99 .00 1.00 .00 1.00 .00 1.00 .00 AGNews Vehicle Segment Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 76 1K 140 Accuracy 0.80 0.02 0.80 0.01 0.77 0.02 0.77 0.01 0.98 0.01 0.97 0.00 ℓx 2 .39 .02 .40 .02 .65 .02 .62 .02 .93 .01 .92 .01 ℓlast 2 .84 .02 - .72 .03 - .97 .01 - ℓall 2 .84 .01 - .72 .03 - .96 .01 - cosx .47 .01 .51 .02 .66 .02 .63 .02 .91 .01 .90 .01 coslast .85 .01 - .74 .04 - .97 .01 - cosall .84 .01 - .73 .04 - .96 .01 - dotx .28 .02 .47 .02 .25 .00 .27 .01 .37 .01 .37 .01 dotlast .89 .01 - .26 .02 - .17 .06 - dotall .90 .01 - .27 .06 - .13 .01 - IF .24 .01 .67 .02 .39 .16 .78 .08 .15 .03 .52 .07 ℓIF 2 .99 .00 .92 .01 .88 .06 .95 .01 .79 .13 .80 .05 cos IF 1.00 .00 .97 .01 .96 .02 .99 .01 .84 .11 .92 .08 FK .32 .01 .29 .03 .31 .18 .26 .17 .15 .04 .17 .10 ℓFK 2 .94 .01 .68 .02 .93 .04 .94 .03 .86 .06 .95 .02 cos FK .95 .01 .84 .02 .99 .01 .99 .01 .97 .02 .99 .01 GD .76 .01 1.00 .00 .90 .10 .98 .02 .30 .14 .55 .11 GC 1.00 .00 1.00 .00 1.00 .00 1.00 .00 .97 .02 1.00 .00 ℓgrad 2 1.00 .00 1.00 .00 .99 .01 1.00 .00 .90 .05 .99 .01 Published as a conference paper at ICLR 2021 Table 4: Average success rate std. of each relevancy metric for identical subclass test. The metrics prefixed with are the ones we have repaired. The results with the average success rate over 0.5 are colored. MNIST CIFAR10 TREC Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.99 0.00 0.88 0.01 0.92 0.01 0.84 0.03 0.71 0.03 0.86 0.01 0.81 0.02 ℓx 2 .93 .01 .96 .02 .26 .02 .29 .04 .31 .03 .78 .03 .78 .02 ℓlast 2 .89 .02 - .29 .04 .35 .04 - .76 .02 - ℓall 2 .97 .01 - .49 .04 .38 .03 - .77 .03 - cosx .95 .01 .96 .02 .29 .03 .31 .04 .31 .03 .82 .02 .81 .02 coslast .89 .02 - .32 .03 .33 .03 - .75 .02 - cosall .98 .00 - .71 .04 .50 .03 - .77 .02 - dotx .70 .03 .75 .03 .09 .02 .11 .03 .09 .02 .33 .03 .34 .03 dotlast .24 .04 - .22 .02 .20 .01 - .40 .03 - dotall .94 .01 - .68 .03 .25 .03 - .59 .03 - IF .12 .01 .39 .05 - .06 .02 .08 .02 .31 .02 .49 .03 ℓIF 2 .62 .04 .76 .03 - .17 .02 .12 .02 .68 .02 .79 .02 cos IF .70 .02 .87 .02 - .15 .02 .09 .02 .72 .01 .75 .03 FK .19 .03 .14 .02 - .11 .01 .11 .02 .30 .02 .16 .02 ℓFK 2 .81 .02 .76 .03 - .31 .03 .24 .02 .73 .03 .78 .02 cos FK .91 .02 .85 .02 - .37 .03 .23 .02 .81 .02 .79 .01 GD .42 .05 .48 .03 .20 .02 .24 .03 .21 .04 .45 .02 .60 .02 GC .97 .01 .98 .01 .54 .03 .43 .04 .39 .03 .81 .01 .87 .02 ℓgrad 2 .91 .02 .95 .01 .28 .03 .38 .03 .34 .03 .78 .02 .88 .02 AGNews Vehicle Segment Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 38 1K 40 Accuracy 0.80 0.02 0.80 0.01 0.73 0.02 0.73 0.01 0.94 0.01 0.90 0.01 ℓx 2 .40 .02 .41 .01 .67 .03 .65 .02 .95 .01 .95 .01 ℓlast 2 .53 .02 - .64 .05 - .95 .02 - ℓall 2 .58 .01 - .66 .04 - .96 .01 - cosx .49 .02 .53 .02 .68 .04 .66 .03 .92 .01 .93 .01 coslast .54 .01 - .68 .06 - .93 .01 - cosall .59 .02 - .67 .03 - .94 .01 - dotx .28 .02 .48 .02 .26 .03 .26 .03 .38 .01 .41 .02 dotlast .52 .02 - .27 .03 - .15 .03 - dotall .54 .02 - .28 .04 - .13 .02 - IF .25 .02 .48 .01 .34 .12 .54 .09 .16 .02 .49 .08 ℓIF 2 .56 .02 .77 .02 .76 .14 .86 .06 .65 .10 .43 .12 cos IF .56 .02 .80 .02 .86 .09 .91 .08 .62 .11 .86 .05 FK .28 .01 .25 .02 .16 .08 .20 .05 .16 .07 .10 .05 ℓFK 2 .56 .02 .63 .02 .73 .13 .62 .09 .73 .13 .93 .03 cos FK .61 .02 .73 .02 .80 .06 .67 .10 .81 .10 .96 .02 GD .50 .02 .54 .02 .47 .09 .43 .03 .34 .08 .37 .08 GC .65 .02 .72 .02 .82 .06 .83 .07 .81 .10 .96 .01 ℓgrad 2 .61 .02 .73 .03 .72 .10 .75 .09 .75 .10 .90 .03 Published as a conference paper at ICLR 2021 F.2 ADDITIONAL RESULTS The identical class test require the most relevant instance to be of the same class as the test instance. In practice, users can be more confident about a model s output if several instances are provided as evidence. In other words, we expect that the most relevant and a first few relevant instances will be of the same class. This observation leads to the additional criterion, which is a generalization of the identical class test. Definition 5 (Top-k Identical Class Test). For ztest = (xtest, bytest), let zj = ( xj, yj) be a training instance with the j-th largest relevance score. Then, we require yj = bytest for any j {1, 2, . . . , k}. This observation also applies to identical subclass test, which leads to the following criterion Definition 6 (Top-k Identical Subclass Test). For ztest = (xtest, bytest), let zj = ( xj, yj) be a training instance with the j-th largest relevance score. Then, we require s( zj) = s(bztest), j {1, 2, . . . , k}. We show the results of the top-10 identical class test in Table 3, and the top-10 identical subclass test in Table 4. G EXAMPLES OF EACH EXPLANATION METHOD We show some examples of the relevant instances using several relevance metrics on CIFAR10 with CNN in Figure 10 and Figure 11 and on AGNews with LSTM in Table 7 and Table 8. We show examples of both correct (in Figure 10 and Table 7) and incorrect (in Figure 11 and Table 8) predictions. As mentioned in Section 5, the relevance metrics based on the dot product of the gradient, such as IF, FK, and GD, tend to select instances with large norms, and therefore we can see that non-typical instances have been selected. Published as a conference paper at ICLR 2021 Table 5: Average success rate std. of each relevancy metric for top-10 identical class test. The metrics prefixed with are the ones we have repaired. The results with the average success rate over 0.5 are colored. MNIST CIFAR10 TREC Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.98 0.00 0.92 0.00 0.89 0.01 0.72 0.02 0.35 0.01 0.86 0.01 0.81 0.02 ℓx 2 .63 .02 .63 .02 .00 .00 .00 .00 .00 .00 .23 .00 .23 .00 ℓlast 2 .95 .01 - .98 .01 .30 .01 - .68 .00 - ℓall 2 .89 .01 - .64 .05 .14 .01 - .66 .00 - cosx .67 .02 .65 .02 .00 .00 .00 .00 .00 .00 .24 .00 .24 .00 coslast .97 .01 - .98 .01 .33 .02 - .69 .00 - cosall .92 .00 - .84 .03 .23 .02 - .68 .00 - dotx .19 .01 .20 .02 .00 .00 .00 .00 .00 .00 .05 .00 .05 .00 dotlast .42 .03 - .98 .01 .04 .01 - .75 .00 - dotall .88 .01 - .79 .03 .05 .01 - .84 .00 - IF .00 .00 .00 .00 - .00 .00 .00 .00 .00 .00 .24 .00 ℓIF 2 .25 .01 .10 .01 - .00 .00 .00 .00 .83 .00 .47 .00 cos IF .59 .02 .17 .01 - .00 .00 .00 .00 .91 .00 .65 .00 FK .00 .00 .02 .01 - .00 .00 .06 .01 .01 .00 .00 .00 ℓFK 2 .23 .03 .65 .02 - .25 .02 .87 .01 .90 .00 .71 .00 cos FK .59 .01 .82 .02 - .54 .02 .93 .01 .95 .00 .77 .00 GD .00 .00 .41 .02 .00 .00 .15 .01 1.00 .00 .11 .00 .99 .00 GC .95 .01 .99 .01 .92 .02 .92 .01 1.00 .00 .96 .00 1.00 .00 ℓgrad 2 .57 .02 .95 .01 .78 .03 .80 .01 .99 .00 .94 .00 1.00 .00 AGNews Vehicle Segment Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 76 1K 140 Accuracy 0.80 0.02 0.80 0.01 0.77 0.02 0.77 0.01 0.98 0.01 0.97 0.00 ℓx 2 .00 .00 .00 .00 .09 .02 .09 .02 .60 .01 .60 .01 ℓlast 2 .48 .03 - .19 .07 - .77 .03 - ℓall 2 .46 .01 - .16 .06 - .74 .03 - cosx .01 .00 .02 .01 .10 .02 .10 .01 .44 .02 .44 .02 coslast .51 .03 - .22 .07 - .78 .03 - cosall .48 .02 - .17 .06 - .72 .04 - dotx .01 .00 .01 .00 .15 .12 .16 .13 .23 .02 .23 .02 dotlast .64 .03 - .13 .11 - .05 .06 - dotall .66 .03 - .15 .11 - .00 .01 - IF .00 .00 .02 .01 .01 .01 .10 .03 .00 .00 .10 .03 ℓIF 2 .94 .01 .20 .02 .25 .13 .47 .05 .32 .15 .48 .05 cos IF .97 .01 .48 .01 .42 .12 .61 .03 .63 .16 .83 .12 FK .00 .00 .00 .00 .05 .11 .08 .11 .00 .01 .03 .06 ℓFK 2 .61 .02 .06 .01 .55 .19 .64 .12 .32 .17 .60 .14 cos FK .71 .03 .15 .01 .85 .06 .85 .08 .78 .08 .92 .03 GD .55 .02 .98 .01 .56 .19 .70 .05 .09 .08 .37 .05 GC 1.00 .00 1.00 .00 .95 .04 1.00 .00 .84 .08 .97 .02 ℓgrad 2 .99 .01 .98 .00 .81 .09 .95 .03 .43 .20 .84 .06 Published as a conference paper at ICLR 2021 Table 6: Average success rate std. of each relevancy metric for top-10 identical subclass test. The metrics prefixed with are the ones we have repaired. The results with the average success rate over 0.5 are colored. MNIST CIFAR10 TREC Model CNN logreg Mobilenet V2 CNN logreg Bi-LSTM logreg Parameter size 12K 8K 2.2M 12K 31K 20K 7K Accuracy 0.99 0.00 0.88 0.01 0.92 0.01 0.84 0.03 0.71 0.03 0.86 0.01 0.81 0.02 ℓx 2 .64 .02 .71 .03 .00 .00 .00 .00 .00 .00 .27 .05 .25 .02 ℓlast 2 .54 .04 - .00 .00 .00 .00 - .30 .02 - ℓall 2 .85 .02 - .08 .02 .01 .00 - .34 .02 - cosx .67 .02 .74 .03 .00 .00 .01 .01 .00 .00 .28 .04 .27 .02 coslast .57 .05 - .00 .00 .00 .00 - .30 .02 - cosall .89 .02 - .16 .02 .02 .01 - .34 .02 - dotx .21 .02 .23 .03 .00 .00 .00 .00 .00 .00 .05 .01 .05 .01 dotlast .08 .02 - .00 .00 .00 .00 - .14 .01 - dotall .79 .03 - .13 .02 .01 .01 - .17 .02 - IF .01 .01 .00 .00 - .00 .00 .00 .00 .00 .00 .01 .01 ℓIF 2 .14 .03 .16 .02 - .00 .00 .00 .00 .11 .02 .24 .02 cos IF .37 .02 .35 .04 - .00 .00 .00 .00 .22 .03 .25 .02 FK .00 .00 .00 .00 - .00 .00 .00 .00 .00 .00 .00 .00 ℓFK 2 .22 .02 .30 .03 - .00 .00 .00 .00 .28 .04 .26 .02 cos FK .58 .02 .46 .04 - .00 .00 .00 .00 .41 .03 .25 .02 GD .00 .00 .01 .01 .01 .01 .00 .00 .00 .00 .10 .02 .01 .00 GC .86 .03 .87 .02 .06 .02 .01 .01 .01 .01 .37 .03 .37 .02 ℓgrad 2 .50 .03 .69 .04 .02 .01 .00 .00 .00 .00 .24 .03 .34 .02 AGNews Vehicle Segment Model Bi-LSTM logreg MLP logreg MLP logreg Parameter size 27K 9K 1K 38 1K 40 Accuracy 0.80 0.02 0.80 0.01 0.73 0.02 0.73 0.01 0.94 0.01 0.90 0.01 ℓx 2 .00 .00 .00 .00 .10 .00 .09 .00 .62 .02 .64 .02 ℓlast 2 .01 .00 - .07 .00 - .66 .07 - ℓall 2 .01 .01 - .08 .00 - .70 .05 - cosx .01 .00 .02 .01 .10 .00 .09 .00 .46 .02 .48 .02 coslast .01 .00 - .06 .00 - .60 .07 - cosall .02 .01 - .10 .00 - .62 .08 - dotx .01 .00 .02 .01 .00 .00 .00 .00 .24 .02 .25 .02 dotlast .01 .00 - .00 .00 - .01 .02 - dotall .02 .00 - .00 .00 - .02 .05 - IF .00 .00 .00 .00 .02 .00 .10 .00 .00 .00 .17 .04 ℓIF 2 .01 .00 .10 .01 .02 .00 .20 .00 .15 .11 .17 .08 cos IF .01 .00 .20 .02 .02 .00 .38 .00 .35 .14 .66 .07 FK .00 .00 .00 .00 .02 .00 .00 .00 .00 .00 .00 .00 ℓFK 2 .01 .00 .02 .01 .18 .00 .02 .00 .15 .14 .61 .09 cos FK .01 .00 .09 .01 .10 .00 .01 .00 .53 .09 .75 .07 GD .01 .00 .06 .01 .39 .00 .39 .00 .09 .05 .13 .08 GC .04 .00 .10 .01 .16 .00 .30 .00 .52 .09 .64 .04 ℓgrad 2 .02 .01 .08 .01 .37 .00 .30 .00 .18 .11 .51 .08 Published as a conference paper at ICLR 2021 Predict: frog Predict: airplane Gold: airplane Figure 10: Relevant instances selected for random test inputs with correct prediction using several relevance metrics on CIFAR10 with CNN. Published as a conference paper at ICLR 2021 Predict: deer Predict: automobile Figure 11: Relevant instances selected for random test inputs with incorrect prediction using several relevance metrics on CIFAR10 with CNN. Published as a conference paper at ICLR 2021 Table 7: Relevant instances selected for random test inputs with correct predictions using several relevance metrics on AGNews with LSTM. Out-of-vocabulary words are followed by [unk]. Sentence Class Test Input kerry widens lead in california , poll finds ( reuters ) Gold: World Predict: World ℓx 2 in brief Sci/Tech ℓlast 2 strong hurricane approaches[unk] bahamas[unk] , florida ( reuters ) Sci/Tech ℓall 2 strong hurricane approaches[unk] bahamas[unk] , florida ( reuters ) Sci/Tech cosx reuters poll : bush holds two - point lead over kerry ( reuters ) World coslast strong hurricane approaches[unk] bahamas[unk] , florida ( reuters ) Sci/Tech cosall strong hurricane approaches[unk] bahamas[unk] , florida ( reuters ) Sci/Tech dotx reuters poll : bush holds two - point lead over kerry ( reuters ) World dotlast eurozone finance ministers debate action on oil as prices surge ( afp ) World dotall business cash for bush campaign , lawyers[unk] for kerry ( reuters ) World IF greek judoka[unk] dies in hospital after balcony[unk] suicide leap[unk] Sports ℓIF 2 world front World cos IF arafat family bickers[unk] over medical[unk] records of palestinian leader World FK linux # 39;s latest moneymaker[unk] Business ℓFK 2 china launches zy-2[unk] resource[unk] satellite Sci/Tech cos FK china launches zy-2[unk] resource[unk] satellite Sci/Tech GD judge adjourns[unk] ba[unk] # 39;asyir[unk] # 39;s trial until nov. 4 World GC reuters poll : bush holds two - point lead over kerry ( reuters ) World ℓgrad 2 reuters poll : bush holds two - point lead over kerry ( reuters ) World Sentence Class Test Input some people not eligible[unk] to get in on google ipo Gold: Sci/Tech Predict: Sci/Tech ℓx 2 insiders[unk] get rich[unk] through google ipo Sci/Tech ℓlast 2 european judge probes microsoft antitrust case Sci/Tech ℓall 2 insiders[unk] get rich[unk] through google ipo Sci/Tech cosx insiders[unk] get rich[unk] through google ipo Sci/Tech coslast breakthrough in hydrogen[unk] fuel research Sci/Tech cosall insiders[unk] get rich[unk] through google ipo Sci/Tech dotx italians[unk] , canadians[unk] gather[unk] to honour[unk] living legend[unk] : vc[unk] winner smoky[unk] smith[unk] ( canadian press ) dotlast earnings alert : novell sees weakness[unk] in it spending Sci/Tech dotall siemens backs new wireless technology Sci/Tech IF matching[unk] wits[unk] on politics Sports ℓIF 2 insiders[unk] get rich[unk] through google ipo Sci/Tech cos IF congress probes fda in vioxx case Business FK bin laden tape urges oil attack Business ℓFK 2 insiders[unk] get rich[unk] through google ipo Sci/Tech cos FK insiders[unk] get rich[unk] through google ipo Sci/Tech GD issue 65 news hound[unk] : this week in gaming Sci/Tech GC google responds[unk] to google news china controversy[unk] Sci/Tech ℓgrad 2 insiders[unk] get rich[unk] through google ipo Sci/Tech Published as a conference paper at ICLR 2021 Table 8: Relevant instances selected for random test inputs with incorrect predictions using several relevance metrics on AGNews with LSTM. Out-of-vocabulary words are followed by [unk]. Sentence Class Test Input ibm to hire even[unk] more new workers Gold:Sci/Tech Predict:Busi ness ℓx 2 athletes[unk] to watch[unk] Sports ℓlast 2 tech stocks tumble[unk] after chip makers warn Business ℓall 2 microsoft foe[unk] wins in settlement Sci/Tech cosx volkswagen[unk] workers stage new stoppages[unk] Business coslast tech stocks tumble[unk] after chip makers warn Business cosall microsoft revenue tops forecast Business dotx italians[unk] , canadians[unk] gather[unk] to honour[unk] living legend[unk] : vc[unk] winner smoky[unk] smith[unk] ( canadian press ) dotlast google up in market debut after bumpy[unk] ipo ( reuters ) Business dotall google up in market debut after bumpy[unk] ipo ( reuters ) Business IF greek judoka[unk] dies in hospital after balcony[unk] suicide leap[unk] Sports ℓIF 2 ibm # 39;s third - quarter earnings and revenue up Business cos IF arafat family bickers[unk] over medical[unk] records of palestinian leader World FK great white sharks[unk] given new protection World ℓFK 2 ibm # 39;s third - quarter earnings and revenue up Business cos FK ibm to buy danish[unk] firms Business GD some question speed of intel chief bill ( ap ) World GC ibm shrugs[unk] off industry blues[unk] in q3 Business ℓgrad 2 ibm # 39;s third - quarter earnings and revenue up Business Sentence Class Test Input tougher[unk] rules wo n t soften[unk] law s game Gold: Sports Predict: Sci/Tech ℓx 2 profiting[unk] from moore[unk] s law Business ℓlast 2 devil[unk] rays[unk] stuck[unk] in florida hours[unk] before game Sports ℓall 2 devil[unk] rays[unk] stuck[unk] in florida hours[unk] before game Sports cosx profiting[unk] from moore[unk] s law Business coslast devil[unk] rays[unk] stuck[unk] in florida hours[unk] before game Sports cosall devil[unk] rays[unk] stuck[unk] in florida hours[unk] before game Sports dotx italians[unk] , canadians[unk] gather[unk] to honour[unk] living legend[unk] : vc[unk] winner smoky[unk] smith[unk] ( canadian press ) dotlast world s top game players battle for cash ( ap ) Sci/Tech dotall sportsnetwork[unk] game preview Sports IF top grades[unk] rising again for gcses[unk] World ℓIF 2 calif. oks toughest[unk] auto emissions[unk] rules World cos IF un envoy headed to darfur World FK yankee[unk] batters[unk] hit wall Sports ℓFK 2 a flat panel does n t always[unk] compute[unk] Sci/Tech cos FK a flat panel does n t always[unk] compute[unk] Sci/Tech GD issue 65 news hound[unk] : this week in gaming Sci/Tech GC atari[unk] announces first 64-bit[unk] game Sci/Tech ℓgrad 2 atari[unk] announces first 64-bit[unk] game Sci/Tech