# accelerating_the_global_aggregation_of_local_explanations__24bfcab2.pdf

Accelerating the Global Aggregation of Local Explanations

Alon Mor, Yonatan Belinkov, Benny Kimelfeld

Technion Israel Institute of Technology, Haifa, Israel almr16@campus.technion.ac.il, belinkov@technion.ac.il, bennyk@cs.technion.ac.il

Local explanation methods highlight the input tokens that have a considerable impact on the outcome of classifying the document at hand. For example, the Anchor algorithm applies a statistical analysis of the sensitivity of the classiﬁer to changes in the tokens. Aggregating local explanations over a dataset provides a global explanation of the model. Such aggregation aims to detect words with the most impact, giving valuable insights about the model, like what it has learned in training and which adversarial examples expose its weaknesses. However, standard aggregation methods bear a high computational cost: a na ıve implementation applies a costly algorithm to each token of each document, and hence, it is infeasible for a simple user running in the scope of a short analysis session. We devise techniques for accelerating the global aggregation of the Anchor algorithm. Speciﬁcally, our goal is to compute a set of top-k words with the highest global impact according to different aggregation functions. Some of our techniques are lossless and some are lossy. We show that for a very mild loss of quality, we are able to accelerate the computation by up to 30 , reducing the computation from hours to minutes. We also devise and study a probabilistic model that accounts for noise in the Anchor algorithm and diminishes the bias toward words that are frequent yet low in impact.

Introduction A particular paradigm for local explanations consists of algorithms that compute a numeric estimate of each token s contribution to the decision of the model, also known as input attribution (Danilevsky et al. 2020a). Many of the common techniques are computationally intensive. For instance, the score that Anchor (Ribeiro, Singh, and Guestrin 2018) assigns to a token is the probability that the model keeps its decision intact when the document undergoes random perturbations, as long as the token remains unchanged. LIME (Ribeiro, Singh, and Guestrin 2016) derives a linear bag-of-words predictor of the model s outcome in a small area around the document and scores each token by its learned coefﬁcient. The SHAP score (Lundberg and Lee 2017) views the tokens as players in the cooperative game of forming the model s decision, and applies the well-known Shapley value to attribute a portion of the proﬁt to each player. As a running illustration, Figure 1

Copyright 2024, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

My daughter got this for christmas and loves it! ...

Great service and shipping. Thank you. Because of you I had less stress for christmas

Document collection: spam detection over Amazon toy reviews

Top-10 terms: great love game fun christmas friends grandson family favorite enjoy

Quality as a function of time (sec) for sum ( ), avg ( ), our ( )

Figure 1: Left: local explanations in a collection of documents for a spam detection task. Top-right: top-10 terms resulting from the global aggregation of (local) anchors. Bottom-right: global aggregation quality; our approach leads to higher quality at a fraction of the computation time.

shows documents from a spam-detection task of Amazon reviews on toys and games, with words marked as anchors, meaning that their score by Anchor exceeds a threshold. In turn, an approach to global explanation is the aggregation of the attribution scores computed by methods as the aforementioned, intended to quantify the impact of each term on the model (Arras et al. 2016). More precisely, we begin with a set of instances to classify (e.g., the training set or the test set of a learning setup), compute a score for each token, and then estimate the impact of every word by aggregating the scores that it obtained in its occurrence in documents. Figure 1 shows top-10 impactful words based on the anchors; to illustrate an insight, Christmas is one of the most impactful terms that encourage the model (Bert, Devlin et al. 2018) to classify a document as spam. Like any form of a global explanation, we would like it to be used effectively by users of varying skills to gain valuable insights on the model: what it learned in training, which biases it has towards speciﬁc phrases, which adversarial examples can shed light on possible weaknesses, and so on. For instantiation, we would like to present the top-k impactful terms in online analysis frameworks such as the FIND system (Lertvittayakumjorn, Specia, and Toni 2020) that provides attribution-based insights.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Yet, the global aggregation of a local attribution, like the above ones, entails critical challenges since the latter is designed to explain a single decision by the model. For one, how exactly should we aggregate the scores? The local scores are inherently noisy, and so, if we simply sum up the scores of all occurrences, we end up giving an advantage to frequent words that are irrelevant to the model (e.g., a function word like the is likely to receive enough noisy scores that sum up to a high global score). If we normalize by the number of occurrences, then we promote rare terms that matter only in insigniﬁcant cases (e.g., a word that only appeared once in the dataset would be scored highly). Either way, the aggregation falls short of the expectation to explain the model. Another major challenge, regardless of which aggregation is used, is the computational cost. The statistical approaches are arguably designed to be practical as local explanations, where we wish to provide an online answer for a single instance; yet, global aggregation should, in principle, apply the costly computation to every token of every document. For instance, a direct application of the release of Anchor on a collection of 10k documents took us almost two days, which were shortened to more than an hour after code optimizations. Again, such performance casts the usage of global aggregation as ineffective for online analysis pipelines. We focus on the global aggregation of the Anchor explanations and tackle the above challenges by making the following contributions. First, we design a probabilistic model that targets the importance of words as explanations and, at the same time, accounts for noise and irrelevant occurrences. Second, we develop several optimizations of global aggregation over Anchor towards an effective implementation. Some of the optimizations are speciﬁc to Anchor, while others apply to every global aggregation of contribution scores. Speciﬁcally, we design an anytime algorithm that maintains a set of top-k candidates, and approaches the top score after a fraction of the na ıve time, with an improvement of up to 30x. Third, we design and conduct an experimental study that shows the effectiveness of our solutions. For that, we adapt the evaluation method of Guidotti et al. (2018) to the task of top-k impactful terms. This is illustrated in Figure 1: after a few seconds, our anytime algorithm produced a set of words that outperforms mere sum and average (that terminate in ten minutes).

Related work. The local attribution scores we discussed fall in the category of perturbation-based explanations (Anchor) or simpliﬁcation-based explanations (LIME, SHAP). An earlier perturbation-based explanation is occlusion (Zeiler and Fergus 2014). Another common category is gradientbased explanations, such as Saliency (Simonyan, Vedaldi, and Zisserman 2014), Input XGradient (Shrikumar et al. 2016), and others (Pruthi et al. 2020; Sundararajan, Taly, and Yan 2017); these assume white-box access to the model, while perturbation-based methods can operate with any model in a black-box manner. See Atanasova et al. (2020) for a comparative analysis. Other scores are derived from an attention mechanism incorporated in the model (Lu et al. 2019; Xie et al. 2017). Layer-wise Relevance Propagation (LRP) extracts from the network a linear model where, similarly to LIME, the coefﬁcients are used as scores (Arras et al. 2016).

Aggregation of local attributions was studied by Ebert, Jakobovits, and Filippova (2022), who applied aggregation to the gradient-based local attribution of Bastings et al. (2022). Aggregation of LRP was done by Lertvittayakumjorn, Specia, and Toni (2020) and Gholizadeh and Zhou (2021). These publications focused on representing the whole space of local attributions (e.g., via clustering, word clouds, and representative embeddings), and did not focus on the challenge of execution cost. Later in the paper, we refer to work on the aggregation of LIME scores (Ribeiro, Singh, and Guestrin 2016; van der Linden, Haned, and Kanoulas 2019) that also did not focus on the computational aspects. Token scoring is one of a plethora of explanation forms proposed for machine-learning models. Other popular approaches derive from the original model different kinds of insights such as nearby interpretable (or surrogate) models, deterministic rules, and examples that highlight nuances of the model. These can be found in relevant surveys such as Guidotti et al. (2019), Danilevsky et al. (2020b) and Atanasova et al. (2020), to name a few.

Formal Setup A document d is a ﬁnite sequence w1, . . . , wm of words. For i = 1, . . . , m, we call the pair (wi, i) a token of d. By a slight abuse of notation, we may identify d with its set of tokens; hence, (w, i) d means that the i th word of d is w. We denote by D the set of all documents. By a predictor we refer to any function f : D C that maps documents to some domain C. In particular, a binary classiﬁer is a predictor where C = {0, 1}. In our evaluation, we will make the (conventional) assumption that f(d) is the most likely class according to an associated probability function ˆf : D C [0, 1] that deﬁnes a distribution over C for each document d.

Anchors. The anchor concept has been deﬁned for general modalities and predicates (properties of the input instance) that can take the role of an explanation of a prediction (Ribeiro, Singh, and Guestrin 2018). We deﬁne it in the textual domain where the predicates are token memberships. We assume that every document d is associated with a perturbator, which is a distribution d over D. We later discuss the actual perturbation used in the public anchor implementation. We also assume a numerical threshold τ [0, 1]. Let f be a predictor, and let (w, i) d be a token of d. We say that (w, i) is an anchor (of d w.r.t. f) if Prx d[f(x) = f(d) | (w, i) x] τ . (1) In words, (w, i) is an anchor if the document retains the same prediction under perturbation, with high probability, as long as the word w is kept at the ith position in the perturbation. This probability is referred to as the precision of (w, i). We denote by Anc(d) the set of anchors of a document d. The algorithm uses a Masked Language Model (MLM) as the perturbator of d. Samples from D are created by masking tokens of d, and later unmasking using Distil Bert (Sanh et al. 2019). The output of the MLM on a masked (w, i) is a distribution of words that can ﬁt the mask. Out of the ζ (500 here) words with the highest probability, a word is sampled according to its probability. If there are multiple masks, then the process repeats iteratively until all replacements are applied.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Global aggregation. Let S be a ﬁnite collection of documents, and let f : D C be a predictor that we wish to explore in the context of S. (For instance, S can be the training set that is used for the construction of f.) We denote by W(S) the set of words that occur in S. By global aggregation we refer to any numerical function G that maps every word w W(S) and c C to a number G(w, c). Intuitively, high G(w, c) means that w has high impact on f taking the value of c on a given document. We denote by S[c] the collection of documents in S classiﬁed as c. We also denote by A+(w, c) and A (w, c) the number of occurrences of a word w W(S) where w is considered an anchor for c and a non-anchor for c, respectively:

A+(w, c) def= X

d S[c] |{i | (w, i) Anc(d)}| (2)

A (w, c) def= X

d S[c] |{i | (w, i) d \ Anc(d)}| (3)

Ribeiro, Singh, and Guestrin (2016) proposed the function:

Gsq(w, c) def= p

A+(w, c) (4)

Several aggregations have been proposed by van der Linden, Haned, and Kanoulas (2019). The one that performed best on binary classiﬁcation is the Global Average Importance that, in our context, is normalizing A+(w, c) by the number occurrences of w:

Gav(w, c) def= A+(w, c) A+(w, c) + A (w, c) (5)

Note that Gav(w, c) does not distinguish between rare words that occur as anchors and words that are frequently anchors. In fact, the maximal score is obtained by a word that occurs once, and in that occurrence, it is an anchor. On the other hand, Gsq(w, c) is sensitive to the noise of the anchor algorithm since it rewards frequent words (e.g., stop words) that are occasionally identiﬁed as anchors. Another aggregation that van der Linden, Haned, and Kanoulas (2019) proposed is Gh(w, c), which weighs a word by its score for different classes. The idea is that the multiplicity of classes entails a penalty since it indicates low relevance to the speciﬁc class. Let eh(w, c) be the normalized Gsq(w, c), that is, eh(w, c) def= Gsq(w, c)/ P c C Gsq(w, c ). Note that eh(w, c) can be viewed as a distribution of w s importance across the classes. The Shannon entropy of this distribution is Hw def= P

c C eh(w, c) log eh(w, c) . Low Hw implies that w impacts a speciﬁc class. Let Hmin and Hmax be the minimum and maximum Hw across all words w . Since rare words might occur in a speciﬁc class (thus having low entropy), Gh is deﬁned using Hw and Gsq as follows:

Gh(w, c) def= 1 Hw Hmin

Gsq(w, c) (6)

Hence, a high entropy will have a low factor over Gsq, and rare words will have low Gsq, both resulting in low Gh. While Gh(w, c) aims to address the problems of Gav(w, c) and Gsq, we have found that anchors commonly appear in at

most one class, thus the entropy is very low for most words to begin with. In the next section, we will propose a new aggregation that aims to overcome these weaknesses.

Problem deﬁnition: top-k terms. We consider the following computational task. We have a set S of documents, a predictor f, an aggregation function G for anchors, and a number k. The goal is to ﬁnd, for a given class c of f, a set Tc of k words in W(S) with maximal G(w, c). We will refer to these words as the top terms. While we wish to ﬁnd Tc in interactive time, a naive aggregation can be prohibitively slow. We might be willing to settle for an approximation, that is, a set T c of k words has similar quality compared to Tc. Next, we discuss how this quality can be measured. For evaluating global aggregations, van der Linden, Haned, and Kanoulas (2019) proposed AOPCglobal that adapts the Area Over the Perturbation Curve of Samek et al. (2015). The idea is to measure how removal of high-score terms impacts the model s predictions. In our terms, AOPCglobal is calculated on an aggregation G by progressively removing from each document the top-k terms w in that document by decreasing G(w, c). A curve shows a better performance than another if it is higher and its initial slope is steeper (indicating better identiﬁcation of the top inﬂuencing terms). Our goal is to measure the quality of the set Tc of topk terms. Hence, in our experiments, we adapt AOPCglobal so that we remove from each document d only the terms in Tc; in particular, documents that do not intersect with Tc remain unchanged (in contrast to AOPCglobal that removes k terms from each document). Formally, let f : D C be a predictor, c C a class, and Tc the set of top-k terms found by the aggregation G and a class c. Let w1, . . . , wk be the words in Tc ordered by decreasing G( , c). For a document d, let di be the document d with every occurrence of a word from w1, . . . , wk removed. We deﬁne

AOPCk(G, c) def= 1 k + 1 avg d S[c]

i=1 ˆf(d, c) ˆf(di, c) (7)

where ˆf is the probability that the classiﬁcation model of f assigns to c for the document d. In words, AOPCk(G, c) is the average drop in the probability of the class c, over all documents and preﬁxes of Tc.

Probability-Based Global Aggregation Consider a document collection S, a predictor f : D C, and a class c C. Let d = (w1, . . . , wℓ) be a document in S[c]. We consider a simple generative model that produces a random document Xd = (w 1, . . . , w ℓ) of the same length as d, assuming that each word is generated randomly, yet anchor words and non-anchor words are taken from different distributions. More precisely, each w i is selected randomly and independently using the following process:

If (wi, i) Anc(d), then do as follows. With probability α, select a word w W(S) with the anchor probability q(w, c); with probability 1 α (that the anchor is wrong), select a word w from W(S) with the probability p(w, c). Otherwise, select w W(S) with probability p(w, c).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Here, α is a parameter. The probabilities p(w, c) and q(w, c) are unknown and chosen to maximize the probability of S:

(p , q ) def= argmax p,q

d S[c] Pr[d = Xd] (8)

Intuitively, words w with high q (w, c) are likely to be used as anchors but not necessarily as non-anchors. Hence, we use q as our global aggregation:

Gpr(w, c) def= q (w, c) (9)

We estimate q and p via Maximum Likelihood Estimation using Lagrange multipliers (Sargent 2000) for local maxima. See the long version of this paper (Mor, Belinkov, and Kimelfeld 2023). The resulting estimations q and p are as follows. Recall that α is a parameter, and recall A+(w, c) and A (w , c) from Equations (2) and (3), respectively.

q(w, c) = 1

α A+(w, c) Σw A+(w , c) ( 1

α 1) A (w, c) Σw A (w , c)

p(w, c) = A (w, c) Σw A (w , c) (10)

The resulting q is not necessarily equal to q since it is not necessarily positive. To get a probability (which may be different from q ), we apply Laplace smoothing (Chen and Goodman 1999) with the absolute minimum value of q as the smoothing parameter. Let qmin def= minw W (S[c]) q(w, c) and L(g, β) be the Laplace smoothing over the function g with a smoothing parameter β. We then deﬁne q (w, c) = L( q, |qmin|) and p (w, c) = p(w, c). Laplace smoothing is monotone, thus it preserves word ordering: q (w1, c) q (w2, c) whenever q(w1, c) q(w2, c).

Runtime Optimizations In this section, we propose algorithms and optimization techniques for accelerating the computation of the top-k terms.

Incremental Evaluation (Anytime) Our goal is to retrieve the top-k words in the dataset. Thus, we adopt an incremental evaluation that maintains the best top-k found in each step of the algorithm. This allows us to provide the user with informative early results in a pay-as-you-go (anytime) manner. Yet, some of the scores are not known during the computation, since they may require the processing of the entire dataset. Hence, we use a heuristic pseudo-score G(w, c) instead of each exact aggregation G(w, c). We traverse the documents in a predeﬁned order (that we discuss next). Denote the word with the lowest pseudo-score in the top-k group as wmin. A word w enters the top-k group if G(w, c) > G(wmin, c). Since the results of the anchor algorithm are unknown for all documents during its running, the pseudo-score when reaching document di is calculated with respect to only the ﬁrst i documents. Speciﬁcally, denote by A+ i (w, c) and A i (w, c) the values A+(w, c) and A (w, c), respectively, restricted to the ﬁrst i documents in S[c]. For Gsq, Gav and Gh, the pseudo-score is calculated simply by replacing A with Ai in the deﬁnitions. In the case of Gpr, we

use the estimates q and p of Equation (10) and, again, apply the replacement of A with Ai. As for the order, we traversed the documents by descending conﬁdence of the classiﬁcation, assuming that these encourage anchors early on. (We also tried other orders and got similar results.)

Candidate ﬁltering. By utilizing the maintained set of top candidates, we can avoid the expensive computation for an unlikely candidate, by applying an optimistic, fast-to-compute upper bound on its score: a word is ﬁltered out if its upper bound is lower than the minimal pseudo-score in the current set of candidates. The upper bound for a word w considers the occurrences (w, i) of the word in the remaining documents d, and it is the pseudo-score under the assumption that (w, i) Anc(d) for every such d and i. While this bound might appear optimistic, this optimization accounts for 30%- 40% reduction of the computation s running time.

Accelerating Anchor Recall that a major challenge in computing the top terms is the expensive computation of anchors. Yet, we are interested in an aggregation over many applications of the Anchor algorithm, and so, we might settle for lower accuracy and conﬁdence. Hence, we change the hyperparameters of the Anchor algorithm, as follows.

Masking. Recall that Anchor generates documents by masking words in the instance and ﬁlling them using a Distil Bert MLM model. Each mask is ﬁlled out of 500 MLM suggestions. We observed that decreasing this number (to just 50 in our experiments) reduced the execution cost without harming the quality, and sometimes even improving it.

Anchor precision and conﬁdence. The precision threshold τ of Equation (1) determines the number of documents that we sample using the replacement alternatives produced by the masking. During runtime, we reduce τ for a word to reﬂect its past occurrences as an anchor. Consequently, we generate fewer documents to decide whether the word is an anchor. Instead of τ, we use

τ(w, c) = τ ω Gpr(w, c)/Nw (11)

where Nw is the number of occurrences of w in the dataset, and ω is a hyperparameter. We use ω = 0.4 in our experiments so that τ(w, c) does cannot get below 0.55. In addition, the anchor implementation by Ribeiro, Singh, and Guestrin (2018) estimates, for each anchor candidate w, a conﬁdence score that the inequality of Equation (1) is correct. To be an anchor, this conﬁdence needs to exceed 1 δ. Higher values of δ lead to a reduction in the number of needed samples. Hence, since we aggregate results, we can settle for higher values of δ.

Additional Optimizations In addition to the optimizations described in this section, we applied several accelerations that are fairly straightforward and standard. First, we ﬁltered out stop-words from the set of top-term candidates, as we perceive them as non-informative terms regarding the explanation of the model. We also ﬁltered out rare words, where we deﬁned a word to be rare if it occurs

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 2: AOPCk(G, c) of different aggregation functions G (y-axis) for varying k (x-axis).

Figure 3: AOPCk(G, c) for different versions G of Gpr and k = 20, as a function of the computation time. Optimized refers to the combination of all runtime optimizations described in the paper.

fewer times than some threshold (5 in our experiments). Our experiments conﬁrm that this ﬁltering provides considerable acceleration for negligible loss of quality. Finally, we also experiment with the application of the entire algorithm on a sample of the documents rather than the entire training set.

Experiments and Results

In our experimental study, we aim to understand the effectiveness of the aggregations and optimizations that we proposed. More precisely, we investigate, empirically, how well and how fast each alternative ﬁnds top-k terms. The code and data are available at https://github.com/alonm16/anchor.

We ﬁnd the top-k terms for several classiﬁcation tasks. We use k = 20. In each task, the documents are organized into three collections: training, validation, and test. The predictor f is a classiﬁer trained on the training set, chosen as the model checkpoint with the best validation set accuracy. We experimented with three models: logistic regression, Bert (Devlin et al. 2018), and De BERTa (He et al. 2020). We report the results only for De BERTa; the other two behave similarly and are discussed in the long version (Mor, Belinkov, and Kimelfeld 2023). The set S of documents that we aggregate over is the test set, as in the experimental study of van der Linden, Haned, and Kanoulas (2019). All experiments were conducted on a machine with 96 of Intel Xeon Gold 6336Y 2.40GHz CPUs with 24 cores, 512GB RAM, 8 of 50GB Nvidia A40 GPUs running Ubuntu

20.04 LTS. The algorithms were programmed in Python 3.10 with the libraries CUDA 11.6, Py Torch 2.0, and Numpy 1.23.

Classiﬁcation tasks. We restricted each dataset to documents of at most 200 characters, since the inﬂuence of each token on a longer text diminishes, and we got fewer anchors and less meaningful results.1 We used the following tasks from Kaggle. URL references to the tasks can be found in the long version (Mor, Belinkov, and Kimelfeld 2023).

Coronavirus tweets (sentiment). Tweets classiﬁed into ﬁve sentiments: extremely negative, negative, neutral, positive, and extremely positive. We combined extremely negative and negative, and extremely positive and positive since there were too few anchors that distinguish between the extreme and normal classes. The dataset consists of 16,000, 4200, and 12,500 documents (training, validation, test).

The Social Dilemma tweets (sentiment). Tweets classiﬁed into three sentiments: positive, negative, and neutral. The dataset consists of 3200, 1000, and 3000 documents.

Amazon reviews: Toys & Games (spam). Amazon product reviews in the Toy and Games category. The reviews are classiﬁed into spam and non-spam. The dataset consists of 15,000, 3800, and 11,000 documents.

Amazon reviews: Home & Kitchen (spam). Similar to the previous one, but now in the category Home and Kitchen. The dataset consists of 12,000, 3000, and 9000 documents.

1The length bound also inﬂuences the execution cost; our experiments show that the runtime grows linearly with this bound.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 4: AOPCk(G, c) for different optimizations of Gpr and k = 20, as a function of the computation time. Optimized refers to the combination of all optimizations.

Figure 5: Ratio of shared terms for different versions of Gpr.

Compared Aggregations. We compare the aggregation functions proposed in the paper, as well as their optimized versions. The list of functions includes Gsq (Equation (4)), Gav (Equation (5)), Gh (Equation (6)), and Gpr (Equation (9)) with α = 0.5. As for Gav, we also experiment with a quick ﬁx of its weakness of promoting rare words: the function G+5 ave is the same as Gav, except that every word is ﬁltered out if it has fewer than ﬁve occurrences in the training set. As baselines, we use two aggregations G(w, c). The ﬁrst ignores the Anchor algorithm and simply measures the probability of class c among the documents that include w.

Gbase(w, c) def= |{d S[c] | w d}|

|{d S | w d}| (12)

(Formally, w d means (w, i) d for some i.) The second is G 1 pr def= 1/Gpr, the inverse of Gpr, used as a sanity check.

Results Figure 2 shows the results for the aggregation functions. Each box corresponds to one task and one class c, and it shows AOPCk(G, c) for the top terms at different times. Each line includes a shaded error band based on repetitions with 5 seeds. Observe that Gpr consistently begins with the steepest curve, and its overall height exceeds the baselines. Also note that Gav is inferior to the rest. In contrast, Gsq and Gh perform similarly to Gpr. While so, we later inspect the three in several case studies, and argue that the results of Gpr are more useful, even though it is not captured by AOPCk. Figures 3 and 4 show the quality of Tc when applying the aggregation in an anytime manner. We measure the quality

of the set of k candidates at different times during the computation, until completion. Hence, each chart shows the change of AOPCk over time as well as the total running time.

Figure 3 shows that increasing δ improves both the running time and quality of Gpr. An exception is dilemma + where we get reduced runtime but a slight decrease in quality. This suggests that the aggregation compensates for the reduction in conﬁdence (number of samples) for the Anchor algorithm. It also conﬁrms the conjecture that the default value of δ, namely 0.1, is too strict within aggregation. Combining the remaining optimizations shortens the overall running time; while it reduced the quality in the corona + experiment, it generally did not impair and sometimes improved the quality (e.g., corona - and home-spam -/+ ). This experiment shows that, overall, the optimizations are highly beneﬁcial.

Figure 4 shows that each optimization alone accelerates the computation. The conﬁdence optimization (Equation (11)) incurs the most signiﬁcant reduction in quality. In contrast, the masking optimization consistently improves the overall quality. Applying all optimizations together yields the shortest running time, but generally at the cost of lowering the AOPCk scores (with the exception of corona - where the impact on the score is positive).

In Figure 5, we measure the percentage of shared terms between Tc of Gpr to that of its optimizations. The intersection is generally around 80%. Note that a drop in the intersection does not necessarily imply a drop in quality, since different terms can have similar quality (as shown in Figures 3 and 4).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

toy-spam - toy-spam +

small, broke, disappointed, waste, would, smaller, cheap, thought, money, poor

great, love, loves, fun, favorite, awesome, wonderful, loved, classic, perfect

Table 1: Top-10 terms: negative vs. positive (no stop words).

Gpr: not, but, disappointment, broke, small, it, would, waste, only, price

Gav: narrower, tipping, expanding, mouths, stains, tone, ﬁles, obscure, faint, health

Gsq: not, but, it, the, disappointment, to, this, broke, was, small

home-spam +

Gpr: great, love, best, excellent, perfect, loves, wonderful, happy, pleased, good

Gav: castle, boys, visited, wars, kumar, vibrant, lavender, implement, farewell, silky

Gsq: great, love, best, good, and, my, I, perfect, loves, ever

Table 2: Comparing different aggregations of the toy-spam and home-spam datasets.

Case Studies We now discuss several case studies from our experiments. Table 1 shows the top terms for the toy-spam dataset under Gpr. We can see that terms for spam reviews are mostly positive (promoting products). On the other hand, non-spam reviews use more negative adjectives and commonly include customer complaints. (As an exception, thought occurs in non-spams in 80% of the cases, and is typically used negatively to describe disappointment from a product.) Table 2 shows the results for the toy-spam (negative) and home-spam (positive) datasets. We can see that Gsq has many common insigniﬁcant words, such as my, it, and, the, and so on. The function Gav selects rare words such as mouths, ﬁles, visited, and wars. The function Gpr is balanced and selects different terms such as wonderful and happy (as positives) or broke and waste (as negatives). Table 3 shows the top terms of the corona dataset for Gpr and Gsq, each with its position for the other aggregation. We can see, for example, that indices of common words like corona and 19 are dropped for Gpr, while less common words that appeared more as anchors stay at the top. As insights on the dataset, the reader can see that hand and help are signiﬁcant to the positive class, where the former is typically used in the context of hand sanitizing.

Gpr: hand, like, help, good, safe, please, thank, great (0-7), free (9), support (11), thanks (12), well (13), best (15), positive (18), better (21), care (20), love (22), safety (25), relief (26)

Gsq: hand, like, help, good, please, safe, thank, great (0-7), 19 (48), free (8), co (751), support (9), thanks (10), well (11), corona (752), best (12), store (33), grocery (42), positive (13)

Table 3: Top terms of the corona dataset under Gpr and Gsq. Each term is attached the position by the other function.

Dataset Addition Drop (%)

toy-spam - I was a bit (disappointed / unsatisﬁed) with this game s performance. 32/2

toy-spam - It is just a (small / usual) item. 60/8

toy-spam + This store contained (classic / board) games. 52/0

toy-spam + That game s theme is (love / animals). 44/6

corona + I bought (hand / -) sanitizers. 44/2

corona + People should (support / back) others more. 62/1

corona - People shouldn t fret over this (crisis / situation). 62/2

corona (-) The pandemic affected (crude / -) oil prices. 57/7

Table 4: Sentences with top-10 anchors by Gpr (in bold). Each sentence is appended to all documents with the opposite label. The average accuracy is shown before and after the change.

Counterfactual examples. In Table 4, we ran the following experiment, inspired by the work of Wallace et al. (2019) on the impact of concatenated text on the model s performance. For various tasks and classes, we manually generated short sentences with their top-10 terms of Gpr (Optimized). Each sentence is appended to all documents with an opposite label. We then measured the drop in the overall accuracy of the classiﬁer due to the change; that is, we compared the accuracy before to after the change. We then repeated the measurement when replacing the term with another word of a similar nature. Importantly, the word replacement is such that the meaning of the sentence should barely impact that classiﬁcation, so one could expect similar drops. Nevertheless, we can see that applying this change reduces the accuracy by a considerable amount ( 50%). This suggests that our method indeed ﬁnds signiﬁcant terms for the model.

Conclusions We studied the problem of identifying the top-k terms under an aggregation of their identiﬁcation as anchors or nonanchors in the dataset. We proposed the probabilistic aggregation Gpr as a way of accounting for both the frequency of words and their treatment likelihood of being anchors. Global aggregation over the Anchor explanations incurs a prohibitive computational cost. We proposed techniques for considerably accelerating the identiﬁcation of the top-k terms and showed experimentally that we obtain an anytime solution that is much more useful for online analysis, reducing the time from hours to minutes and seconds. We focused on single-word terms, and it is left for future work to study the case of multiple words, where the challenge is bigger due to the number of candidates for the top-k terms. Finally, while the runtime optimizations were applied to Anchor, the general framework can be adapted to any method of local attribution scores, and such adaptations are also left for future work.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments

This research was supported by the German-Israeli Foundation for Scientiﬁc Research and Development (grant I1502-407.6/2019), the German Research Foundation (project 412400621), the Israel Science Foundation (grant 448/20), an Azrieli Foundation Early Career Faculty Fellowship, and an AI Alignment grant from Open Philanthropy.

References Arras, L.; Horn, F.; Montavon, G.; M uller, K.; and Samek, W. 2016. Explaining Predictions of Non-Linear Classiﬁers in NLP. In Rep4NLP@ACL, 1 7. Association for Computational Linguistics. Atanasova, P.; Simonsen, J. G.; Lioma, C.; and Augenstein, I. 2020. A Diagnostic Study of Explainability Techniques for Text Classiﬁcation. In EMNLP (1), 3256 3274. Association for Computational Linguistics. Bastings, J.; Ebert, S.; Zablotskaia, P.; Sandholm, A.; and Filippova, K. 2022. Will You Find These Shortcuts? A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classiﬁcation. In EMNLP, 976 991. Association for Computational Linguistics. Chen, S. F.; and Goodman, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4): 359 394. Danilevsky, M.; Qian, K.; Aharonov, R.; Katsis, Y.; Kawas, B.; and Sen, P. 2020a. A Survey of the State of Explainable AI for Natural Language Processing. In Proceedings of the 1st Conference of the Asia-Paciﬁc Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 447 459. Suzhou, China: Association for Computational Linguistics. Danilevsky, M.; Qian, K.; Aharonov, R.; Katsis, Y.; Kawas, B.; and Sen, P. 2020b. A Survey of the State of Explainable AI for Natural Language Processing. In AACL/IJCNLP, 447 459. Association for Computational Linguistics. Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Co RR, abs/1810.04805. Ebert, S.; Jakobovits, A. S.; and Filippova, K. 2022. Understanding Text Classiﬁcation Data and Models Using Aggregated Input Salience. Co RR, abs/2211.05485. Gholizadeh, S.; and Zhou, N. 2021. Model Explainability in Deep Learning Based Natural Language Processing. Co RR, abs/2106.07410. Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; and Pedreschi, D. 2019. A Survey of Methods for Explaining Black Box Models. ACM Comput. Surv., 51(5): 93:1 93:42. Guidotti, R.; Monreale, A.; Turini, F.; Pedreschi, D.; and Giannotti, F. 2018. A Survey Of Methods For Explaining Black Box Models. Co RR, abs/1802.01933. He, P.; Liu, X.; Gao, J.; and Chen, W. 2020. De BERTa: Decoding-enhanced BERT with Disentangled Attention. Co RR, abs/2006.03654.

Lertvittayakumjorn, P.; Specia, L.; and Toni, F. 2020. FIND: Human-in-the-Loop Debugging Deep Text Classiﬁers. In EMNLP (1), 332 348. Association for Computational Linguistics. Lu, J.; Zhang, C.; Xie, Z.; Ling, G.; Zhou, T. C.; and Xu, Z. 2019. Constructing Interpretive Spatio-Temporal Features for Multi-Turn Responses Selection. In ACL (1), 44 50. Association for Computational Linguistics. Lundberg, S. M.; and Lee, S. 2017. A uniﬁed approach to interpreting model predictions. Co RR, abs/1705.07874. Mor, A.; Belinkov, Y.; and Kimelfeld, B. 2023. Accelerating the Global Aggregation of Local Explanations. ar Xiv:2312.07991. Pruthi, G.; Liu, F.; Kale, S.; and Sundararajan, M. 2020. Estimating Training Data Inﬂuence by Tracing Gradient Descent. In Neur IPS. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why Should I Trust You? : Explaining the Predictions of Any Classiﬁer. Co RR, abs/1602.04938. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 32. Samek, W.; Binder, A.; Montavon, G.; Bach, S.; and M uller, K. 2015. Evaluating the visualization of what a Deep Neural Network has learned. Co RR, abs/1509.06321. Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Distil BERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Co RR, abs/1910.01108. Sargent, R. 2000. Optimal control. Journal of Computational and Applied Mathematics, 124(1): 361 371. Numerical Analysis 2000. Vol. IV: Optimization and Nonlinear Equations. Shrikumar, A.; Greenside, P.; Shcherbina, A.; and Kundaje, A. 2016. Not Just a Black Box: Learning Important Features Through Propagating Activation Differences. Ar Xiv, abs/1605.01713. Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep Inside Convolutional Networks: Visualising Image Classiﬁcation Models and Saliency Maps. Co RR, abs/1312.6034. Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic Attribution for Deep Networks. In ICML, volume 70 of Proceedings of Machine Learning Research, 3319 3328. PMLR. van der Linden, I.; Haned, H.; and Kanoulas, E. 2019. Global Aggregations of Local Explanations for Black Box models. Co RR, abs/1907.03039. Wallace, E.; Feng, S.; Kandpal, N.; Gardner, M.; and Singh, S. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., EMNLP-IJCNLP, 2153 2162. Hong Kong, China: Association for Computational Linguistics. Xie, Q.; Ma, X.; Dai, Z.; and Hovy, E. H. 2017. An Interpretable Knowledge Transfer Model for Knowledge Base Completion. In ACL (1), 950 962. Association for Computational Linguistics. Zeiler, M. D.; and Fergus, R. 2014. Visualizing and Understanding Convolutional Networks. In ECCV (1), volume 8689 of Lecture Notes in Computer Science, 818 833. Springer.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)