# evaluating_and_aggregating_featurebased_model_explanations__5c000dde.pdf

Evaluating and Aggregating Feature-based Model Explanations

Umang Bhatt1,2 , Adrian Weller1,3 and Jos e M. F. Moura2

1University of Cambridge 2Carnegie Mellon University 3The Alan Turing Institute {usb20, aw665}@cam.ac.uk, moura@ece.cmu.edu

A feature-based model explanation denotes how much each input feature contributes to a model s output for a given data point. As the number of proposed explanation functions grows, we lack quantitative evaluation criteria to help practitioners know when to use which explanation function. This paper proposes quantitative evaluation criteria for feature-based explanations: low sensitivity, high faithfulness, and low complexity. We devise a framework for aggregating explanation functions. We develop a procedure for learning an aggregate explanation function with lower complexity and then derive a new aggregate Shapley value explanation function that minimizes sensitivity.

1 Introduction

There has been great interest in understanding blackbox machine learning models via post-hoc explanations. Much of this work has focused on feature-level importance scores for how much a given input feature contributes to a model s output. These techniques are popular amongst machine learning scientists who want to sanity check a model before deploying it in the real world [Bhatt et al., 2020]. Many feature-based explanation functions are gradient-based techniques that analyze the gradient ﬂow through a model to determine salient input features [Shrikumar et al., 2017; Sundararajan et al., 2017]. Other explanation functions perturb input values to a reference output and measure the change in the model s output [ˇStrumbelj and Kononenko, 2014; Lundberg and Lee, 2017]. With many candidate explanation functions, machine learning practitioners ﬁnd it difﬁcult to pick which explanation function best captures how a model reaches a speciﬁc output for a given input. Though there has been work in qualitatively evaluating feature-based explanation functions on human subjects [Lage et al., 2019], there has been little exploration into formalizing quantitative techniques for evaluating model explanations. Recent work has created auxiliary tasks to test if attribution is assigned to relevant inputs [Yang and Kim, 2019] and has developed tools to verify if the

Contact Author

features important to an explanation function are relevant to the model itself [Camburu et al., 2019]. Borrowing from the humanities, we motivate three criteria for assessing a feature-based explanation: sensitivity, faithfulness, and complexity. Philosophy of science research has advocated for explanations that vary proportionally with changes in the system being explained [Lipton, 2003]; as such, explanation functions should be insensitive to perturbations in the model inputs, especially if the model output does not change. Capturing relevancy faithfully is helpful in an explanation [Ruben, 2015]. Since humans cannot process a lot of information at once, some have argued for minimal model explanations that contain only relevant and representative features [Batterman and Rice, 2014]; therefore, an explanations should not be complex (i.e., use few features). In this paper, we ﬁrst deﬁne these three distinct criteria: low sensitivity, high faithfulness, and low complexity. With many explanation function choices, we then propose methods for learning an aggregate explanation function that combines explanation functions. If we want to ﬁnd the simplest explanation from a set of explanations, then we can aggregate explanations to minimize the complexity of the resulting explanation. If we want to learn a smoother explanation function that varies slowly as inputs are perturbed, we can leverage an aggregation scheme that learns a less sensitive explanation function. To the best of our knowledge, we are the ﬁrst to rigorously explore aggregation of various explanations, while placing explanation evaluation on an objective footing. To that end, we highlight the contributions of this paper:

We describe three desirable criteria for feature-based explanation functions: low sensitivity, high faithfulness, and low complexity.

We develop an aggregation framework for combining explanation functions.

We create two techniques that reduce explanation complexity by aggregating explanation functions.

We derive an approximation for Shapley-value explanations by aggregating explanations from a point s nearest neighbors, minimizing explanation sensitivity and resembling how humans reason in medical settings.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

2 Preliminaries

Restricting to supervised classiﬁcation settings, let f be a black box predictor that maps an input x Rd to an output f(x) Y. An explanation function g from a family of explanation functions, G, takes in a predictor f and a point of interest x and returns importance scores g(f, x) = φx Rd for all features, where g(f, x)i = φx,i (simpliﬁed to φi in context) is the importance of (or attribution for) feature xi of x. By gj, we refer to a particular explanation function, usually from a set of explanation functions Gm = {g1, g2, . . . , gm}. We denote D : Rd Rd 7 R 0 to be a distance metric over explanations, while ρ : Rd Rd 7 R 0 denotes a distance metric over the inputs. An evaluation criterion µ takes in a predictor f, explanation function g, and input x, and outputs a scalar: µ(f, g; x). D = {(xi, yi)}n i=1 refers to a dataset of input-output pairs, and Dx denotes all xi in D.

3 Evaluating Explanations

With the number of techniques to develop feature level explanations growing in the explainability literature, picking which explanation function g to use can be difﬁcult. In order to study the aggregation of explanation functions, we deﬁne three desiderata of an explanation function g.

3.1 Desideratum: Low Sensitivity We want to ensure that, if inputs are near each other and their model outputs are similar, then their explanations should be close to each other. Assuming f is differentiable, we desire an explanation function g to have low sensitivity in the region around a point of interest x, implying local smoothness of g. While [Melis and Jaakkola, 2018] codiﬁed the property, [Ghorbani et al., 2019] empirically tested explanation function sensitivity. We follow the convention of the former and deﬁne max sensitivity and average sensitivity in the neighborhood of a point of interest x. Let Nr = {z Dx | ρ(x, z) r, f(x) = f(z)} be a neighborhood of datapoints within a radius r of x.

Deﬁnition 1 (Max Sensitivity). Given a predictor f, an explanation function g, distance metrics D and ρ, a radius r, and a point x, we deﬁne the max sensitivity of g at x as:

µM(f, g, r; x) = max z Nr D(g(f, x), g(f, z))

Deﬁnition 2 (Average Sensitivity). Given a predictor f, an explanation function g, distance metrics D and ρ, a radius r, a distribution Px( ) over the inputs centered at point x, we deﬁne the average sensitivity of g at x as:

µA(f, g, r; x) = Z

D(g(f, x), g(f, z))Px(z)dz

3.2 Desideratum: High Faithfulness Faithfulness has been deﬁned in [Yeh et al., 2019]. The feature importance scores from g should correspond to the important features of x for f; as such, when we set particular features xs to a baseline value xs, the change in predictor s output should be proportional to the sum of attribution scores

of features in xs. We measure this as the correlation between the sum of the attributions of xs and the difference in output when setting those features to a reference baseline. For a subset of indices S {1, 2, . . . d}, xs = {xi, i S} denotes a sub-vector of input features that partitions the input, x = xs xc. x[xs= xs] denotes an input where xs is set to a reference baseline while xc remains unchanged: x[xs= xs] = xs xc. When |S| = d, x[xs= xs] = x. Remark (Reference Baselines). Recent work has discussed how to pick a proper reference baseline x. [Sundararajan et al., 2017] suggests using a baseline where f( x) 0, while others have proposed taking the baseline to be the mean of the training data. [Chang et al., 2019] notes that the baseline can be learned using generative modeling. Deﬁnition 3 (Faithfulness). Given a predictor f, an explanation function g, a point x, and a subset size |S|, we deﬁne the faithfulness of g to f at x as:

µF(f, g; x) = corr S ( [d] |S|)

i S g(f, x)i, f(x) f x[xs= xs] !

For our experiments, we ﬁx |S| then randomly sample subsets xs of the ﬁxed size from x to estimate correlation. Since we do not see all [d] |S| subsets in our calculation of faithfulness, we may not get an accurate estimate of the criterion. Though hard to codify and even harder to aggregate, faithfulness is desirable, as it demonstrates that an explanation captures which features the predictor uses to generate an output for a given input. Learning global feature importances that highlight, in expectation, which features a predictor relies on is a challenging problem left to future work.

3.3 Desideratum: Low Complexity A complex explanation is one that uses all d features in its explanation of which features of x are important to f. Though this explanation may be faithful to the model (as deﬁned above), it may be too difﬁcult for the user to understand (especially if d is large). We deﬁne a fractional contribution distribution, where | | denotes absolute value:

Pg(i) = |g(f, x)i| P

j [d] |g(f, x)j|; Pg = {Pg(1), . . . , Pg(d)}

Note that Pg is a valid probability distribution. Let Pg(i) denote the fractional contribution of feature xi to the total magnitude of the attribution. If every feature had equal attribution, the explanation would be complex (even if it is faithful). The simplest explanation would be concentrated on one feature. We deﬁne complexity as the entropy of Pg. Deﬁnition 4 (Complexity). Given a predictor f, explanation function g, and a point x, the complexity of g at x is:

µC(f, g; x) = Ei ln(Pg) =

i=1 Pg(i) ln(Pg(i))

4 Aggregating Explanations Given a trained predictor f, a set of explanation functions Gm = {g1, . . . , gm}, a criterion to optimize µ, and a set of

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

inputs Dx, we want to ﬁnd an aggregate explanation function gagg that satisﬁes µ at least as well as any gi Gm. Let h( ) represent some function that combines m explanations into a consensus gagg = h(Gm). We now explore different candidates for h( ).

4.1 Convex Combination Suppose we have two different explanation functions g1 and g2 and have chosen a criterion µ to evaluate a g. Consider an aggregate explanation, gagg = h(g1, g2). A potential h( ) is a convex combination where gagg = h(g1, g2) = wg1 +(1 w)g2 = w Gm. Proposition 1. If D is the ℓ2 distance and µ = µA (average sensitivity), the following holds:

µA(gagg) wµA(g1) + (1 w)µA(g2)

Proof. Assuming Px(z) is uniform, we can apply the triangle inequality and the convexity of D to arrive at the above.

A convex combination of explanation functions thus yields an aggregate explanation function that is at most as sensitive as any of the explanation functions taken alone. In order to learn w given g1 and g2, we set up an objective as follows.

w = arg min w E x Dx

µA(gagg(f, x)) (1)

Assuming a uniform distribution around all x Dx, we can rewrite this as:

w = arg min w

D(gagg(x), gagg(z))Px(z)dzdx

By Cauchy-Schwartz, we get the following:

w arg min w

D (a, b) dzdx

where a = wg1(f, x) + (1 w)g2(f, x) and b = wg1(f, z) + (1 w)g2(f, z). This implies that w will be minimal when one element of w is 0 and the other is 1. Therefore, a convex combination of two explanation functions, found by solving Equation (1), will be at most as sensitive as the least sensitive explanation function.

4.2 Centroid Aggregation Another sensible candidate for h( ) to combine m explanation functions is based on centroids with respect to some distance function D : G G 7 R, so that:

gagg arg min g G E gi Gm

D(g, gi)p = arg min g G

i=1 D(g, gi)p

where p is a positive constant. The simplest examples of distances are the ℓ2 and ℓ1 distances with real-valued attributions where G Rd. Proposition 2. When D is the ℓ2 distance and p = 2, the aggregate explanation is the feature-wise sample mean.

gagg(f, x) = gavg(f, x) = 1

i=1 gi(f, x) (2)

Proposition 3. When D is the ℓ1 distance and p = 1, the aggregate explanation is the feature-wise sample median.

gagg(f, x) = med{Gm}

Propositions 2 and 3 follow from standard results in statistics that the mean minimizes the sum of squared differences and the median minimizes the sum of absolute deviations [Berger, 2013]. We could obtain rank-valued attributions by taking any quantitative vector-valued attributions and ranking features according to their values. If D is the Kendall-tau distance with rank-valued attributions where G Sd (the set of permutations over d features), then the resulting aggregation mechanism via computing the centroid is called the Kemeny Young rule. For rank-valued attributions, any aggregation mechanism falls under the rank aggregation problem in social choice theory for which many practical voting rules exist [Bhatt et al., 2019a]. We analyze the error of a candidate gagg. Suppose the optimal explanation for x using f is g (f, x) and suppose gagg is the mean explanation for x in Equation (2). Let ǫi,x = ||g (f, x) gi(f, x)|| be the error between the optimal explanation and the ith explanation function. Proposition 4. The error between the aggregate explanation gagg(f, x) and the optimal explanation g*(f, x) satisﬁes:

Pn i=1 Pm j=1 ǫj,xi

mn Proof. For a ﬁxed x, we have:

ǫagg,x = ||g (f, x) gagg(f, x)||

= ||mg (f, x)

i=1 gi(f, x)||

i=1 ||g (f, x) gi(f, x)|| = Pm i=1 ǫi,x

Averaging across Dx, we obtain the result.

Hence, by aggregating, we do better than when using one explanation function alone. Many gradient-based explanation functions ﬁt to noise [Hooker et al., 2019]. One way to reduce noise would be to aggregate by ensembling or averaging. As proven in Proposition 4, the typical error of the aggregate is less than the expected error of each function alone.

5 Lowering Complexity Via Aggregation In this section, we describe iterative algorithms for aggregating explanation functions to obtain gagg(f, x) with lower complexity whilst combining m candidate explanation functions Gm = {g1, . . . , gm}. We desire a gagg(f, x) that contains information from all candidate explanations gi(f, x) yet has entropy less than or equal to that of each explanation gi(f, x). As discussed, a reasonable candidate for an aggregate explanation function is the sample mean given by Equation (2). We may want gagg(f, x) to approach the sample mean, gavg(f, x); however, the sample mean may have greater complexity than that of each gi(f, x).

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

For example, let g1(f, x) = [ 1, 0]T and g2(f, x) = [0, 1]T . The sample mean is gavg(f, x) = [ 0.5, 0.5]T . Both g1 and g2 have the minimum possible complexity of 0, while gavg has the maximum possible complexity, log(2). Our aggregation technique must ensure that gagg(f, x) approaches gavg(f, x) while guaranteeing gagg(f, x) has complexity less than or equal to that of each gi(f, x). We now present two approaches for learning a lower complexity explanation, visually represented in Figure 1.

5.1 Gradient-Descent Style Method Our ﬁrst approach is similar to gradient descent. Starting from each gi(f, x), we iteratively move towards gavg(f, x) in each of the d directions (i.e., changing the kth feature by a small amount) if the complexity decreases with that move. We stop moving when the complexity no longer decreases or gavg(f, x) is reached. Simultaneously, we start from gavg(f, x) and iteratively move towards each gi(f, x) in each of the d directions if the complexity decreases. We stop moving when the complexity no longer decreases or any of the gi(f, x) are reached. The ﬁnal gagg(f, x) is the location that has the smallest complexity from these 2d different walks. Since we only move if the complexity decreases and start from each gi(f, x), the entropy of gagg(f, x) is guaranteed to be less than or equal to the entropy of all gi(f, x).

5.2 Region Shrinking Method In our second approach, we consider the closed region, R, which is the convex hull of all the explanation functions, gi(f, x). Notice region R initially contains gavg. We consider an iterative approach to ﬁnd the global minimum in the region R. As before, we consider the convex combination formed by two explanation functions, gi and gj. Using convex optimization, we ﬁnd the value on the line segment between gi and gj that has the minimum complexity; essentially, we iteratively shrink the region. For the region shrinking method, the convex combination formed by gi and gj is:

w(gi) + (1 w)(gj), w [0, 1]

For every pair of functions in Gm, we ﬁnd the functions that produces the minimum complexity in the convex combination of the functions, producing a new set of candidates G m. gagg is the element in set G m with minimal complexity after K iterations. In each iteration, a function is chosen if it has the minimum complexity of all the functions in a convex combination. Thus, the minimum complexity of the set G m decreases or remains constant with each iteration.

6 Lowering Sensitivity Via Aggregation To construct an aggregate explanation function g that minimizes sensitivity, we would need to ensure that a test point s explanation is a function of the explanations of its nearest neighbors under ρ. This is a natural analog for how humans reason: we use past similar events (training data) and facts about the present (individual features) to make decisions [Bhatt et al., 2019b]. We now contribute a new explanation function g AVA that combines the Shapley value explanations of a test point s nearest neighbors to explain the test point.

Figure 1: Visual examples of the two complexity lowering aggregation algorithms: gradient-descent style (a) and region shrinking (b) methods using explanation functions g1, g2, g3

6.1 Shapley Value Review Borrowing from game theory, Shapley values denote the marginal contributions of a player to the payoff of a coalitional game. Let T be the number of players and let v : 2T R be the characteristic function, where v(S) denotes the worth (contribution) of the players in S T. The Shapley value of player i s contribution (averaging player i s marginal contributions to all possible subsets S) is:

1 (v(S {i}) v(S))

Let Φ RT be a Shapley value contribution vector for all players in the game, where φi(v) is the ith element of Φ.

6.2 Shapley Values as Explanations In the feature importance literature, we formulate a similar problem to where the game s payoff is the predictor s output y = f(x), the players are the d features of x, and the φi values represent the contribution of xi to the game f(x). Let the characteristic function be the importance score of a subset of features xs, where EY [ |x] is an expectation over Pf( |x):

log 1 Pf(Y |xs)

This characteristic function denotes the negative of the expected number of bits required to encode the predictor s output based on the features in a subset S [Chen et al., 2019]. Shapley value contributions can be approximated via Monte Carlo sampling [ˇStrumbelj and Kononenko, 2014] or via weighted least squares [Lundberg and Lee, 2017].

6.3 Aggregate Valuation of Antecedents We now explore how to explain a test point in terms of the Shapley value explanations of its neighbors. Termed Aggregate Valuation of Antecedents (AVA), we derive an explanation function that explains a data point in terms of the explanations of its neighbors. We do the following: suppose we want to ﬁnd an explanation function g AVA(f, xtest) for a point of interest xtest. First we ﬁnd the k nearest neighbors of xtest under ρ denoted by Nk(xtest, D).

Nk(xtest, D) = arg min N D,|N |=k

z N ρ(xtest, z)

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

We deﬁne g AVA(f, xtest) = Φxtest as the explanation function where:

g AVA(f, xtest)i = φi(v AVA) = X

z Nk(xtest)

g SHAP(f, z)i

ρ(xtest, z)

z Nk(xtest)

φi(vz) ρ(xtest, z)

In essence, we weight each neighbor s Shapley value contribution by the inverse distance from the neighbor to the test point. AVA is closely related to bootstrap aggregation from classical statistics, as we take an average of model outputs to improve explanation function stability.

Theorem 5. g AVA(f, xtest) is a Shapley value explanation.

Proof. We want to show that g AVA(f, xtest) = Φxtest is indeed a vector of Shapley values. Let g SHAP(f, z) = Φz be the vector of Shapley value contributions for a point z Nk. By [Lundberg and Lee, 2017], we know g SHAP(f, z)i = φi(vz) is a unique Shapley value for the characteristic function vz. By linearity of Shapley values [Shapley, 1953], we know that:

φi(vz1 + vz2) = φi(vz1) + φi(vz2) (3)

This means that the Φz1 + Φz2 will yield a unique Shapley value contribution vector for the characteristic function vz1 + vz2. By linearity (or additivity), we know for any scalar α:

αφi(vz) = φi(αvz) (4)

This means αΦz will yield a unique Shapley value contribution vector for the characteristic function αvz. Now deﬁne:

z Nk(xtest)

Φz ρ(xtest, z)

We can conclude that Φxtest is a vector of Shapley values

While [Sundararajan et al., 2017] takes a path integral from a ﬁxed reference baseline x and [Lundberg and Lee, 2017] only considers attribution along the straight line path between x and xtest, AVA takes a weighted average of attributions along paths from training points in Nk to xtest. AVA can similarly be thought of as a convex combination of explanation functions where the explanation functions are the explanations of the neighbors of xtest and the weights are ρ(xtest, z) 1. Though the weights are guaranteed to be non-negative, we normalize the weights to sum to 1 and edit the AVA formulation to be: g AVA(f, xtest) = ρtotΦxtest where ρtot = P z Nk(xtest) ρ(xtest, z) 1. Notice this formulation is a speciﬁc convex combination as described before; therefore, AVA will result in a lower sensitivity than g SHAP(f, x) alone.

6.4 Medical Connection Similar to how a model uses input features to reach an output, medical professionals learn how to proactively search for risk predictors in a patient. Medical professionals not only use patient attributes (e.g., vital signs, personal information) to make a diagnosis but also leverage experiences

with past patients; for example, if a doctor treated a rare disease over a decade ago, then that experience can be crucial when attributes alone are uninformative about how to diagnose [Goold and Lipkin Jr, 1999]. This is the analogous to close training points affecting a predictor s output. AVA combines the attributions of past training points (past patients) to explain an unseen test point (current patient). When using the MIMIC dataset [Johnson et al., 2016], AVA models the aforementioned intuition.

7 Experiments

We now report some empirical results. We evaluate models trained on the following datasets: Adult, Iris [Dua and Graff, 2017], MIMIC [Johnson et al., 2016], and MNIST [Le Cun et al., 1998]. We use the following explanation functions: SHAP [Lundberg and Lee, 2017], Shapley Sampling (SS) [ˇStrumbelj and Kononenko, 2014], Gradient Saliency (Grad) [Baehrens et al., 2010], Grad*Input (G*I) [Shrikumar et al., 2017], Integrated Gradients (IG) [Sundararajan et al., 2017], and Deep Lift (DL) [Shrikumar et al., 2017]. For all tabular datasets, we train a multilayer perceptron (MLP) with leaky-Re LU activation using the ADAM optimizer. For Iris [Dua and Graff, 2017], we train our model to 96% test accuracy. For Adult [Dua and Graff, 2017], our model has 82% test accuracy. As motivated in Section 6.4, we use MIMIC (Medical Information Mart for Intensive Care III) [Johnson et al., 2016]. We extract seventeen real-valued features deemed critical, per [Purushotham et al., 2018], for sepsis prediction. Our model gets 91% test accuracy on the task. For MNIST [Le Cun et al., 1998], our model is a convolutional neural network and has 90% test accuracy. For experiments with a baseline x, zero baseline implies that we set features to 0 and average baseline uses the average feature value in D. Before doing aggregation, we unit norm all explanations. For the complexity criterion, we take the positive ℓ1 norm. We set D = ℓ2 and ρ = ℓ .

7.1 Faithfulness µF In Table 2, we report results for faithfulness for various explanation functions. When evaluating, we take the average of multiple runs where, in each run, we see at least 50 datapoints; for each datapoint, we randomly select |S| features and replace them with baseline values. We then calculate the Pearson s correlation coefﬁcient between the predicted logits of each modiﬁed test point and the average explanation attribution for only the subset of features. We notice that, as subset size increases, faithfulness increases until the subset is large enough to contain all informative features. We ﬁnd that Shapley values, approximated with weighted least squares, are the most faithful explanation function for smaller datasets.

7.2 Max and Avg Sensitivity µM and µA In Table 3, we report the max and average sensitivities for various explanation functions. To evaluate the sensitivity criterion, we sample a set of test points from D and an additional larger set of training points. We then ﬁnd the training points that fall within a radius r neighborhood of each test point and

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

INPUT BEST (DEEPLIFT) CONVEX GRADIENT-DESCENT REGION-SHRINKING

µC = 3.688 µC = 3.685 µC = 3.575 µC = 3.208

Table 1: Qualitative example of aggregation to lower complexity (µC): We show that it is possible to lower complexity slightly with both of our approaches; note that achieving lowest complexity on an image would imply that all attribution is placed on a single pixel.

METHOD ADULT IRIS MIMIC MIMIC SUBSET 2 2 10 20

SHAP (62, 60) (67, 68) (31, 36) (37, 47) SS (46, 27) (32, 36) (59, 58) (38, 45) GRAD (30, 53) (14, 16) (37, 41) (28, 63) G*I (38, 39) (27, 30) (54, 48) (59, 43) IG (47, 33) (60, 57) (66, 51) (68, 51) DL (58, 43) (46, 48) (84, 54) (43, 45)

Table 2: Faithfulness µF averaged over a test set: (Zero Baseline, Training Average Baseline). Exact quantities can be obtained by dividing table entries by 102

METHOD ADULT IRIS MIMIC RADIUS 2 0.2 4

SHAP (60, 54) (310, 287) (6, 5) SS (191, 168) (477 , 345) (83, 81) GRAD (60, 50) (68, 66) (28, 28) G*I (86, 71) (298, 279) (77, 50) IG (19, 17) (495, 462) (19, 15) DL (74, 74) (850, 820) (135, 111)

Table 3: Sensitivity: (Max µM, Avg µA). Exact quantities can be obtained by dividing table entries by 103

ﬁnd the distance between each nearby training point explanation and the test point explanation to get a mean and max. We average over ten random runs of this procedure. Sensitivity is highly dependent on the dimensionality d and on the radius r. We ﬁnd that as sensitivity decreases as r increases. Empirically, for MIMIC, Shapley values approximated by weighted least squares (SHAP) are the least sensitive.

7.3 MNIST Complexity µC In Table 1, we provide a qualitative example for the gradient descent-style and region-shrinking methods for lowering complexity of explanations from a model trained on MNIST. We show an example with images since it illustrates the notion of lower complexity well; however, other data types (tabular) might be better suited for complexity optimization.

7.4 AVA Our empirical ﬁndings support use of an AVA explanation if low sensitivity is desired. [Ghorbani et al., 2019] note that perturbation-based explanations (like g SHAP) are less sensitive than their gradient-based counterparts. In Table 4, we show that AVA explanations not only have lower sensitivities

METHOD ADULT IRIS MIMIC

µA(f, g SHAP) 0.16 0.11 0.22 0.25 0.47 0.12 µA(f, g AVA) 0.07 0.07 0.13 0.18 0.31 0.13

µM(f, g SHAP) 0.68 0.13 1.20 0.36 0.83 0.17 µM(f, g AVA) 0.52 0.11 1.18 0.28 0.72 0.22

µC(f, g SHAP) 1.94 0.26 1.36 0.36 2.33 0.23 µC(f, g AVA) 1.93 0.24 1.24 0.32 2.61 0.29

Table 4: AVA lowers the sensitivity of Shapley value explanations across all datasets. When d is small (fewer features), AVA explanations are slightly less complex.

in all experiments but also have less complex explanations (depending on the radius r and number of features d). After ﬁnding the average distance between pairs of points, we use r = 1 for Adult, r = 0.3 for Iris, and r = 10 for MIMIC.

8 Conclusion Borrowing from earlier work in social science and the philosophy of science, we codify low sensitivity, high faithfulness, and low complexity as three desirable properties of explanation functions. We deﬁne these three properties for featurebased explanation functions, develop an aggregation scheme for learning combinations of various explanation functions, and devise schemes to learn explanations with lower complexity (iterative approaches) and lower sensitivity (AVA). We hope that this work will provide practitioners with a principled way to evaluate feature-based explanations and to learn an explanation which aggregates and optimizes for criteria desired by end users. Though we consider one criterion at a time, future work could further axiomatize our criteria, explore the interaction between different evaluation criteria, and devise a multi-objective optimization approach to ﬁnding a desirable explanation; for example, can we develop a procedure for learning a less sensitive and less complex explanation function simultaneously?

Acknowledgements We thank reviewers for their feedback. We thank Pradeep Ravikumar, John Shi, Brian Davis, Kathleen Ruan, Javier Antoran, and James Allingham for their comments. UB acknowledges support from Deep Mind and the Leverhulme Trust via the CFI. AW acknowledges support from the David Mac Kay Newton Research Fellowship at Darwin College, The Alan Turing Institute under EPSRC grant EP/N510129/1 & TU/B/000074, and the Leverhulme Trust via the CFI.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

[Baehrens et al., 2010] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Muller. How to explain individual classiﬁcation decisions. JMLR, 11(Jun):1803 1831, 2010. [Batterman and Rice, 2014] Robert W Batterman and Collin C Rice. Minimal model explanations. Philosophy of Science, 81(3):349 376, 2014. [Berger, 2013] James O Berger. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 2013. [Bhatt et al., 2019a] Umang Bhatt, Pradeep Ravikumar, et al. Building human-machine trust via interpretability. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 9919 9920, 2019. [Bhatt et al., 2019b] Umang Bhatt, Pradeep Ravikumar, and Jos e M. F. Moura. Towards aggregating weighted feature attributions. ar Xiv:1901.10040, 2019. [Bhatt et al., 2020] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, Jos e M. F. Moura, and Peter Eckersley. Explainable machine learning in deployment. ACM Conference on Fairness, Accountability, and Transparency (FAT*), 2020. [Camburu et al., 2019] Oana-Maria Camburu, Eleonora Giunchiglia, Jakob Foerster, Thomas Lukasiewicz, and Phil Blunsom. Can I trust the explainer? Verifying post-hoc explanatory methods. ar Xiv:1910.02065, 2019. [Chang et al., 2019] Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explaining image classiﬁers by counterfactual generation. In International Conference on Learning Representations, 2019. [Chen et al., 2019] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. L-shapley and C-shapley: Efﬁcient model interpretation for structured data. International Conference on Learning Representations, 2019. [Dua and Graff, 2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. [Ghorbani et al., 2019] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 3681 3688, 2019. [Goold and Lipkin Jr, 1999] Susan Dorr Goold and Mack Lipkin Jr. The doctor patient relationship: challenges, opportunities, and strategies. Journal of general internal medicine, 14(Suppl 1):S26, 1999. [Hooker et al., 2019] Sara Hooker, Dumitru Erhan, Pieter Jan Kindermans, and Been Kim. A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, pages 9734 9745, 2019. [Johnson et al., 2016] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad

Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientiﬁc Data, 2016. [Lage et al., 2019] Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel J Gershman, and Finale Doshi-Velez. Human evaluation of models built for interpretability. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 59 67, 2019. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Lipton, 2003] Peter Lipton. Inference to the best explanation. Routledge, 2003. [Lundberg and Lee, 2017] Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (Neur IPS 2017), pages 4765 4774, 2017. [Melis and Jaakkola, 2018] David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with selfexplaining neural networks. In Advances in Neural Information Processing Systems (Neur IPS 2018), 2018. [Purushotham et al., 2018] Sanjay Purushotham, Chuizheng Meng, Zhengping Che, and Yan Liu. Benchmarking deep learning models on large healthcare datasets. Journal of Biomedical Informatics, 83:112 134, 2018. [Ruben, 2015] David-Hillel Ruben. Explaining explanation. Routledge, 2015. [Shapley, 1953] Lloyd S Shapley. A value for n-person games. In Contributions to the Theory of Games II, pages 307 317, 1953. [Shrikumar et al., 2017] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (ICML 2017), pages 3145 3153. Journal of Machine Learning Research, 2017. [ˇStrumbelj and Kononenko, 2014] Erik ˇStrumbelj and Igor Kononenko. Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3):647 665, 2014. [Sundararajan et al., 2017] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (ICML 2017), pages 3319 3328. Journal of Machine Learning Research, 2017. [Yang and Kim, 2019] Mengjiao Yang and Been Kim. BIM: Towards quantitative evaluation of interpretability methods with ground truth. ar Xiv:1907.09701, 2019. [Yeh et al., 2019] Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I Inouye, and Pradeep K Ravikumar. On the (in) ﬁdelity and sensitivity of explanations. In Advances in Neural Information Processing Systems, pages 10965 10976, 2019.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)