# shapley_explainability_on_the_data_manifold__9dcc99d3.pdf

Published as a conference paper at ICLR 2021

SHAPLEY EXPLAINABILITY ON THE DATA MANIFOLD

Christopher Frye, Damien de Mijolla, Tom Begley, Laurence Cowton, Megan Stanley, Ilya Feige

Faculty, 54 Welbeck Street, London, UK

Explainability in AI is crucial for model development, compliance with regulation, and providing operational nuance to predictions. The Shapley framework for explainability attributes a model s predictions to its input features in a mathematically principled and model-agnostic way. However, general implementations of Shapley explainability make an untenable assumption: that the model s features are uncorrelated. In this work, we demonstrate unambiguous drawbacks of this assumption and develop two solutions to Shapley explainability that respect the data manifold. One solution, based on generative modelling, provides ﬂexible access to data imputations; the other directly learns the Shapley value-function, providing performance and stability at the cost of ﬂexibility. While off-manifold Shapley values can (i) give rise to incorrect explanations, (ii) hide implicit model dependence on sensitive attributes, and (iii) lead to unintelligible explanations in higher-dimensional data, on-manifold explainability overcomes these problems.

1 INTRODUCTION

Explainability in AI is central to the practical impact of AI on society, thus making it critical to get right. While many dichotomies exist within the ﬁeld between local and global explanations (Ribeiro et al., 2016), between post hoc and intrinsic interpretability (Rudin, 2019), and between model-agnostic and model-speciﬁc methods (Shrikumar et al., 2017) in this work we focus on local, post-hoc, model-agnostic explainability as it provides insight into individual model predictions, does not limit model expressiveness, and is comparable across model types.

In this context, explainability can be treated as a problem of attribution. Shapley values (Shapley, 1953) provide the unique attribution method satisfying a set of intuitive axioms, e.g. they capture all interactions between features and sum to the model prediction. The Shapley approach to explainability has matured over the last two decades (Lipovetsky & Conklin, 2001; Kononenko et al., 2010; ˇStrumbelj & Kononenko, 2014; Datta et al., 2016; Lundberg & Lee, 2017).

Implementations of Shapley explainability suffer from a problem common across model-agnostic methods: they involve marginalisation over features, achieved by splicing data points together and evaluating the model on highly unrealistic inputs (e.g. Fig. 1). Such splicing would only be justiﬁed if all features were independent; otherwise, spliced data lies off the data manifold.

Outside the Shapley paradigm, emerging explainability methods have begun to address this problem. See e.g. Anders et al. (2020) for a general treatment of the off-manifold problem in gradient-based explainability. See also Chang et al. (2019) and Agarwal et al. (2019) for image-speciﬁc explanations that respect the data distribution.

Within Shapley explainability, initial work towards remedying the off-manifold problem has emerged; e.g. Aas et al. (2019) and Sundararajan & Najmi (2019) explore empirical and kernelbased estimation techniques, but these methods do not scale to complex data. A satisfactorily general and performant solution to computing Shapley values on the data manifold has yet to appear and is a focus of this work. Our main contributions are twofold:

Sec. 3 compares onand off-manifold explainability, focusing on novel and unambiguous shortcomings of off-manifold Shapley values. In particular, we show that off-manifold explanations are often incorrect, and that they can hide implicit model dependence on sensitive features.

Published as a conference paper at ICLR 2021

Splice 5 Splice 4 Splice 3 Splice 2 Splice 1 Coalition Digit

Figure 1: An MNIST digit, a coalition of pixels in a Shapley calculation, and 5 off-manifold splices.

Sec. 4 develops two methods to compute on-manifold Shapley values on general data sets: (i) a ﬂexible generative-modelling technique to learn the data s conditional distributions, and (ii) a simple supervised-learning technique that targets the Shapley value-function directly. We demonstrate the effectiveness of these methods on higher-dimensional data with experiments.

2 BACKGROUND ON SHAPLEY EXPLAINABILITY

The Shapley value (Shapley, 1953) is a method from cooperative game theory that distributes credit for the total value v(N) earned by a team N = {1, 2, . . . , n} among its players:

|S|! (n |S| 1)!

n! v(S {i}) v(S) (1)

where the value function v(S) indicates the value that a coalition of players S would earn without their other teammates. The Shapley value φv(i) represents player i s marginal value-added upon joining the team, averaged over all orderings in which the team can be constructed.

In supervised learning, let fy(x) be a model s predicted probability that data point x belongs to class y.1 To apply Shapley attribution to model explainability, one interprets the features {x1, . . . , xn} as players in a game and the output fy(x) as their earned value. To compute Shapley values, one must deﬁne a value function representing the model s output on a coalition x S {x1, . . . , xn}.

As the model is undeﬁned on partial input x S, the standard implementation (Lundberg & Lee, 2017) samples out-of-coalition features, x S where S = N \ S, unconditionally from the data distribution:

v(off) fy(x)(S) = Ep(x ) fy(x S x S) (2)

We refer to this value function, and the corresponding Shapley values, as lying off the data manifold since splices x S x S generically lie far from the data distribution. Alternatively, conditioning outof-coalition features x S on in-coalition features x S would result in an on-manifold value function:

v(on) fy(x)(S) = Ep(x |x S) fy(x ) (3)

The conditional distribution p(x |x S) is not empirically accessible in practical scenarios with highdimensional data or many-valued (e.g. continuous) features. A performant method to compute onmanifold Shapley values on general data is until-now lacking and a focus of this work.

Shapley values φfy(x)(i) provide local explainability for the model s prediction on data point x. To understand the model s global behaviour, one aggregates the φfy(x)(i) s into global Shapley values:

Φf(i) = Ep(x,y) φfy(x)(i) (4)

where p(x, y) is the labelled-data distribution. Global Shapley values can be seen as a special case of the global explanation framework introduced by Covert et al. (2020). As a consequence of the axioms (Shapley, 1953) satisﬁed by the φfy(x)(i) s, global Shapley values satisfy a sum rule: X

i N Φf(i) = Ep(x,y) fy(x) Ep(x )Ep(y) fy(x ) (5)

One interprets the global Shapley value Φf(i) as the portion of model accuracy attributable to the ith feature. Indeed, the ﬁrst term in Eq. (5) is the accuracy one achieves by sampling labels from f s predicted probability distribution over classes. The offset term, which relates to class balance, is not attributable to any individual feature.

1The results of this paper can be applied to regression problems by reinterpreting fy(x) as the model s predicted value rather than its predicted probability.

Published as a conference paper at ICLR 2021

3 EVIDENCE IN FAVOUR OF ON-MANIFOLD EXPLAINABILITY

The key differences between onand off-manifold Shapley values is a subject of ongoing discussion; see Sundararajan & Najmi (2019) or Chen et al. (2020) for recent overviews. Here we focus on theoretical arguments and experimental evidence yet to appear in the literature, in favour of the onmanifold approach. We begin with mathematically precise differences between onand off-manifold methods in Sec. 3.1 and present unambiguous drawbacks of off-manifold Shapley values in Sec. 3.2.

3.1 ONVERSUS OFF-MANIFOLD DIFFERENCES MADE PRECISE

Suppose the model s input features x1, . . . , xn are the result of a data-generating process seeded by unobserved latent variables z1, . . . , zd. Then there exist functional relationships

xi = gi(z1, . . . , zd; ϵi) for i = 1, . . . , n (6)

where ϵi represents noise in xi. In the limit of small ϵi s, there are d directions in which a data point (x1, . . . , xn) can be perturbed while remaining consistent with the data distribution: these correspond to perturbations in z1, . . . , zd in Eq. (6). The data thus lives on a d-dimensional manifold in ambient n-dimensional features space, and therefore satisﬁes n d constraints on the xi s:

Ψk(x1, . . . , xn) = 0 for k = 1, . . . , n d (7)

On-manifold Shapley values evaluate the model on inputs that satisfy these constraints, while the off-manifold approach uses spliced data that generically break them. For a more detailed and mathematically precise discussion of the data manifold in this context, see Anders et al. (2020).

ALGEBRAIC MODEL DEPENDENCE CAN BE MISLEADING

Any model fy(x) can be written in many algebraic forms that all evaluate identically on the data manifold. To show this, one can add any of the n d constraints from Eq. (7) to any of the model s n input slots. This changes the model s algebraic form but does not affect the model s output on the data, since each constraint equals zero on-manifold. The model fy(x) thus belongs to an n(n d) dimensional equivalence class of functions that behave indistinguishably on the data.

On-manifold Shapley values provide the same explanation for any two models that evaluate identically on the data distribution, because on-manifold explanations do not involve evaluation anywhere else. Off-manifold Shapley values provide different explanations for two models in the same equivalence class, as spliced data in the off-manifold value function break the constraints of Eq. (7).

HIDDEN MODEL DEPENDENCE ON SENSITIVE ATTRIBUTES

This is not an academic concern: it follows that off-manifold explanations are vulnerable to adversarial model perturbations that hide dependence on select input features (Dombrowski et al., 2019; Slack et al., 2020). Dimanov et al. (2020) demonstrated that the off-manifold Shapley value for a sensitive feature like gender could be reduced near zero via this vulnerability.

To see how this can happen, suppose that input feature x1 represents gender and formally solve one of the constraints in Eq. (7) for x1. The result, say x1 = Ψ(x2, . . . , xn), can then be used to transform any model fy(x) into another

fy(x2, . . . , xn) = fy Ψ(x2, . . . , xn), x2, . . . , xn (8)

that has no algebraic dependence on gender x1 but behaves identically to fy(x) on the data manifold. The off-manifold Shapley value for gender in f would vanish, since the off-manifold value function of Eq. (2) depends on x1 only through f (i.e. not at all). This result is problematic, since the two models behave equivalently on the data and thus possess the same gender bias.

In contrast, the on-manifold Shapley values for f and f would be identical, as x1 dependence enters the on-manifold value function of Eq. (3) through the conditional expectation value. In a sense, on-manifold Shapley values represent the model s dependence on the information content of each feature, rather than the model s algebraic dependence.

Published as a conference paper at ICLR 2021

Census income data Drug consumption data

Figure 2: (a) Vulnerability of off-manifold explanations to hidden model dependence. (b) Explanations of a ﬁxed model compared to a model that is retrained on each Shapley coalition of features.

We can demonstrate this on UCI Census Income data (Dua & Graff, 2017). We trained a neural network to predict whether an individual s income exceeds $50k based on demographic features in the data. Coral bars in Fig. 2(a) display global Shapley values for this Original model . (Onmanifold values were computed with the unsupervised method developed in Sec. 4.1.)

We then trained an alternative model by ﬁne-tuning the neural network above on a loss function that penalises model dependence on sex; see App. B for full details of this experiment. This resulted in a Suppressed model that makes identical predictions as the original model on 98.5% of the data. Teal bars in Fig. 2(a) display global Shapley values for this model. Note that the off-manifold Shapley value for sex is zero despite the similar behaviour exhibited by the original and suppressed models on the data. In contrast, on-manifold Shapley values explain both models similarly.

ON-MANIFOLD SHAPLEY VALUES IN THE OPTIMAL-MODEL LIMIT

Here we present a result that strengthens the connection between on-manifold Shapley values and the data distribution: in the limit of an optimal model of the data, on-manifold Shapley values converge to an explanation of how the information in the data associates with the labelled outcomes.

To show why this holds, suppose the predicted probability fy(x) converges to the true underlying distribution p(y|x). In this optimal-model limit (which is approached in the limit of abundant data and high model capacity) the on-manifold value function of Eq. (3) becomes

v(on) fy(x)(S) Z dx S p(x S|x S) p(y | x S x S) = p(y | x S) (9)

which shows that value is attributed to xi based on xi s predictivity of the label y.

We can demonstrate this on UCI Drug Consumption data (Dua & Graff, 2017). Using the 10 binary features listed in Fig. 2(b) Mushrooms, Ecstasy, etc. we trained a random forest f to predict whether individuals had consumed an 11th drug: LSD. As the data contains just 10 binary features, we were able to empirically sample the conditional distributions in the on-manifold value function, Eq. (3). See Fig. 2(b) for the resulting offand on-manifold global Shapley values.

Next we ﬁt a separate random forest g S to each coalition S of features, 210 models in total, in the spirit e.g. of ˇStrumbelj et al. (2009). We used the accuracy A(g S) of each model, in the sense of Eq. (5), as the value function for an additional Shapley computation:

|S|! (n |S| 1)!

n! [A(g S i) A(g S)] (10)

where Φg(i) is directly the average gain in accuracy that results from adding feature i to the set of inputs. These values are labelled Model retraining in Fig. 2(b). Note their agreement with the on-manifold explanation of the ﬁxed random forest f. On-manifold Shapley values thus indicate which features in the data are most predictive of the label.

This consistency check allows us to show in passing that Tree SHAP (Lundberg et al., 2018; 2020) does not provide a method for on-manifold explainability. Observe in Fig. 2(b) that Tree SHAP

Published as a conference paper at ICLR 2021

(c) (b) (a)

(d) (f) (e)

Figure 3: An individual outlier (a), its offand on-manifold explanations (b & c), the error rate in explanations (d), and the distribution of model outputs on Shapley coalitions (e & f).

roughly tracks the off-manifold explanation, albeit larger on the most predictive feature and somewhat smaller on the others. This occurs because trees tend to split on high-predictivity features ﬁrst, and Tree SHAP privileges early-splitting features in an otherwise off-manifold calculation.

3.2 UNAMBIGUOUS SHORTCOMINGS OF OFF-MANIFOLD EXPLAINABILITY

Whereas above we clariﬁed precise differences between onand off-manifold Shapley values, in this section we focus on unambiguous drawbacks of the off-manifold approach.

UNCONTROLLED MODEL BEHAVIOUR OFF-MANIFOLD

Sec. 3.1 might lead one to believe that off-manifold Shapley values provide insight into the algebraic dependence of a model. However, the off-manifold approach of evaluating the model on spliced indistribution data does not constitute a controlled study of such dependence. Off-manifold Shapley values serve as a perilously uncontrolled technique, especially in complex nonlinear models such as neural networks. Indeed, it is widely known that deep-learning models are not robust to distributional shift (Nguyen et al., 2015; Goodfellow et al., 2015). Still, off-manifold Shapley values evaluate the model outside its domain of validity, where it is untrained and potentially wildly misbehaved. This garbage-in-garbage-out problem is the clearest reason to avoid the off-manifold approach.

Since this point has been documented in the literature (Hooker & Mentch, 2019), here we simply provide an example: Fig. 1 shows a binary MNIST digit (Le Cun & Cortes, 2010), a coalition of pixels, and 5 random splices that would be used to compute an off-manifold explanation.

OUTLIER DETECTION EXPLAINED INCORRECTLY OFF-MANIFOLD

To demonstrate that off-manifold Shapley values frequently lead to incorrect explanations, here we offer an example on synthetic data where the ground-truth explanation is known. We generated 104 synthetic data points, each consisting of 20 real-valued features, for the purpose of outlier detection. We split the dataset between 99% inliers and 1% outliers, with the classes generated according to:

pin(x1, . . . , x20) = 1

i=1 N[z, σ2](xi)

pout(x1, . . . , x20) = 1

i=1 N[ z, σ2](xi)

i=6 N[z, σ2](xi)

That is, there is a single binary latent variable z. For inliers, each feature is an independent noisy reading of the latent z. For outliers, the ﬁrst 5 features are centred instead around its opposite z. An example outlier (with σ = 0.05) is shown in Fig. 3(a). We generated one such data set for each σ {0.01, 0.03, . . . , 0.15} in order to study the effect of noise on explanation errors.

Published as a conference paper at ICLR 2021

True 50% False

x1 = 1 x1 = 1 False

True 90% False

0.2 0.1 0.9 0.8

p( y=1 | x0 x1 )

Abalone data

Global Shapley value

Synthetic data Synthetic data generation

Figure 4: Negative global Shapley values arise off-manifold, in both (a) synthetic and (b) real data.

We ﬁt an isolation forest (Liu et al., 2008) to perform outlier detection on each synthetic dataset, achieving 100% accuracy in every case. We computed the offand on-manifold value functions of Eqs. (2) and (3) for each isolation forest by sampling the probability distributions directly, as these can be inferred from Eqs. (11) and (12). Figs. 3(b) and 3(c) show the resulting local Shapley values for the example outlier from Fig. 3(a).

The ground-truth explanation of why Fig. 3(a) represents an outlier is that its ﬁrst 5 features break correlations that exist across 99% of the data. The on-manifold explanation of Fig. 3(c) correctly attributes the 5 largest Shapley values to features x1, . . . , x5. The off-manifold explanation of Fig. 3(b) is unambiguously incorrect: feature x7 receives a larger value than x2, x4, and x5. We consider an explanation to be erroneous if x1, . . . , x5 do not receive the 5 largest Shapley values.

To show the frequency of incorrect explanations, Fig. 3(d) displays the offand on-manifold error rates as a function of noise σ in the synthetic data set. Incorrect explanations are commonplace off-manifold: one-quarter are in error in the presence of minimal noise, and two-thirds are incorrect at σ = 0.15. The on-manifold error rate is dramatically lower across this range.

Figs. 3(e) and 3(f) show the root cause of off-manifold errors. These histograms display the distribution of model outputs when evaluated on Shapley coalitions in the offand on-manifold calculations for σ = 0.05. In particular, Fig. 3(e) shows the model evaluated on inlier coalitions which do not include x1, . . . , x5. Note that model outputs for on-manifold coalitions agree with the model evaluated on the actual data, while off-manifold coalitions follow a very different distribution. In particular, since a positive model output indicates a predicted outlier, Fig. 3(e) shows that the offmanifold calculation itself fabricates outliers through its splicing procedure.

Similarly, Fig. 3(f) shows the model evaluated on outlier coalitions which do include x1, . . . , x5. Note that model outputs are similar for on-manifold coalitions and actual outliers, whereas offmanifold coalitions again differ dramatically. This is a manifestation of uncontrolled model behaviour off the data manifold, and it ultimately leads to erroneous off-manifold explanations.

BREAKDOWN IN GLOBAL SHAPLEY VALUES OFF-MANIFOLD

To demonstrate that global Shapley values can be misleading off-manifold as well, we generated an additional synthetic data set according to the process in Fig. 4(a). The data has two binary features and a binary label. We ﬁt a decision tree to this data, resulting in a precise match to Fig. 4(a). Note that the features x0 and x1 are positively correlated, both with each other and with label y. However, with x0 ﬁxed, the likelihood of y = 1 decreases slightly from x1 = 0 to x1 = 1. One might think of x0 as disease severity, x1 as treatment intensity, and y as mortality rate.

Fig. 4(b) displays global Shapley values for this model. The global Shapley values are positive onmanifold, consistent with their interpretation as the portion of model accuracy attributable to each feature. Off-manifold, however, a negative value results from placing too much weight on splices, e.g. with (x0, x1, y) = (0, 1, 1), that occur less frequently in the actual data. The negative value would erroneously indicate that x1 is detrimental to the model s overall performance.

We can demonstrate this on real data using UCI Abalone data (Dua & Graff, 2017). We trained a neural network to classify abalone as younger than or older than the median age based on physical

Published as a conference paper at ICLR 2021

characteristics. Fig. 4(c) displays global Shapley values for this model. (On-manifold values were computed using techniques developed in Sec. 4.1; see App. B for details.)

Observe the drastic difference between the onand off-manifold explanations in Fig. 4(c). This is due to the tight correlations between features in the data (4 weights and 3 lengths) making the data manifold low-dimensional and important. Notice further the large negative off-manifold global Shapley value, negating its interpretation as the portion of model accuracy attributable to that feature.

4 SCALABLE APPROACHES TO ON-MANIFOLD SHAPLEY VALUES

In Sec. 3 we computed on-manifold Shapley values for simple data by estimating p(x |x S) from the empirical data distribution or, for synthetic data, by knowing this distribution analytically. Here we introduce two performant methods to compute on-manifold Shapley values on general data. Sec. 4.1 develops the theory underlying our methods, and Sec. 4.2 presents additional experimental results.

4.1 THEORETICAL DEVELOPMENT OF ON-MANIFOLD METHODS

Here we develop two methods to learn the on-manifold value function: (i) unsupervised learning the conditional distribution p(x |x S), and (ii) a supervised technique to learn the value function directly.

UNSUPERVISED APPROACH

One can use unsupervised learning to learn the conditional distributions p(x |x S) that appear in the on-manifold value function. Here we take an approach similar to Ivanov et al. (2019) to learn these distributions with variational inference. See Douglas et al. (2017) and Belghazi et al. (2019) for alternative techniques to learning conditional distributions that could be used here instead.

Our speciﬁc approach includes two model components. The ﬁrst is a variational autoencoder (Kingma & Welling, 2014; Rezende et al., 2014), with encoder qφ(z|x) and decoder pθ(x|z). The second is a masked encoder, rψ(z|x S), for which the goal is to map the coalition x S to a distribution in latent space that agrees with the encoder qφ(z|x) as well as possible. A model of p(x |x S) is then provided by the composition:

ˆp(x |x S) = Z dz pθ(x |z) rψ(z|x S) (13)

and a good ﬁt to the data should maximise ˆp(x |x S). A lower bound to its log-likelihood is given by

L0 = Eqφ(z|x ) log pθ(x |z) DKL qφ(z|x ) || rψ(z|x S) (14)

While L0 could be used on its own as the objective function to learn ˆp(x |x S), this would leave the variational distribution qφ(z|x) unconstrained, at odds with our goal of learning a smooth-manifold structure in latent space. This concern can be mitigated by Lreg = DKL qφ(z|x) || p(z) which regularises qφ(z|x) by penalising differences from a smooth (e.g. unit normal) prior distribution p(z). We thus include Lreg as a regularisation term in our unsupervised objective: L = L0 + β Lreg.

METRIC FOR THE LEARNT VALUE FUNCTION

The unsupervised method presented above leads to a learnt estimate of the conditional distribution, and thus to an estimate of the on-manifold value function: ˆvfy(x)(S) = Eˆp(x |x S)[fy(x )]. With the goal of judging the performance of this estimate, consider the following formal quantity:

mse(x S, y) = Ep(x |x S) fy(x ) ˆvfy(x)(S) 2 (15)

This quantity is minimal with respect to ˆvfy(x)(S) when ˆvfy(x)(S) = Ep(x |x S)[fy(x )], in agreement with the deﬁnition, Eq. (3), of the on-manifold value function. We can then quantitatively judge the performance of the unsupervised model ˆp(x |x S) by computing

MSE = Ep(x) ES Shapley Ey Unif fy(x) ˆvfy(x)(S) 2 (16)

Published as a conference paper at ICLR 2021

Drug consumption data

Global Shapley value

Figure 5: Validation of unsupervised and supervised techniques for computing on-manifold Shapley values. Comparison against empirical ground truth, which appeared as On manifold in Fig. 2(b).

Table 1: Performance and stability, in terms of MSE, of supervised and unsupervised approaches. Performance is compared with off-manifold splicing and, where accessible, the empirical optimum.

DATA SET OFF MANIFOLD UNSUPERVISED SUPERVISED EMPIRICAL

DRUG 0.0634 0.0536 0.0007 0.0441 0.0002 0.0436 ABALONE 0.0647 0.0293 0.0009 0.0200 0.0001 CENSUS 0.0344 0.0300 0.0006 0.0250 0.0001 MNIST 0.0448 0.0257 0.0005 0.0121 0.0001

Note that this is precisely Eq. (15) averaged over coalitions S drawn from the Shapley sum,2 features x S p(x S) drawn from the data, and labels y drawn uniformly over classes. Moreover, the meansquare-error in Eq. (16) is easy to estimate using the empirical distribution p(x) and the learnt model ˆp(x |x S), thus providing an unambiguous metric to judge the outcome of the unsupervised approach.

SUPERVISED APPROACH

The MSE metric of Eq. (16) supports a supervised approach to learning the on-manifold value function directly: one can deﬁne a surrogate model gy(x S) that operates on coalitions of features x S (e.g. by masking out-of-coalition features) and that is trained to minimise the loss:

L = Ep(x) ES Shapley Ey Unif fy(x) gy(x S) 2 (17)

As discussed above Eq. (16), this loss is minimised as the surrogate model gy(x S) approaches the on-manifold value function Ep(x |x S)[fy(x )] of the model-to-be-explained.

4.2 ADDITIONAL EXPERIMENTS

Sec. 3 above presented two experiments using the scalable on-manifold methods developed here. In particular, Fig. 2(a) applied the unsupervised method to Census Income data, showing that onmanifold Shapley values detect hidden model dependence on sensitive features, and Fig. 4(c) applied both methods to Abalone data, showing that global Shapley values remain positive and interpretable on-manifold. In this section, we perform additional experiments to study the performance and stability of Sec. 4.1 s methods, as well as their effectiveness on higher-dimensional data.

PERFORMANCE AND STABILITY Our implementations of the unsupervised and supervised approaches to on-manifold Shapley values are summarised in Apps. A and B. Both approaches lead to broadly similar results. Fig. 5 compares the two techniques on the Drug Consumption data, where explanations are compared against the ground-truth empirical computation from Fig. 2(b).

The unsupervised approach is ﬂexible but untargeted: p(x |x S) is data-speciﬁc but model-agnostic, accommodating explanations for many models trained on the same data. The supervised approach

2In more detail, here we sample coalitions from Eq. (1), where the probability assigned to each coalition is the combinatorial factor |S|!(n |S| 1)!/n!.

Published as a conference paper at ICLR 2021

normalised v(S [ i) v(S)

Digit O - On-manifold

coalition size

Figure 6: (a) Randomly drawn MNIST digits explained on / off manifold. Red / blue pixels indicate positive / negative Shapley values, and the colour scale in each column is ﬁxed. (b) Shapley summand as a function of coalition size averaged over coalitions, pixels, and the MNIST test set.

must be retrained on each model, but it entails direct minimisation of the MSE. The supervised method is thus expected to achieve higher accuracy. We conﬁrmed this on all data sets studied in this paper; see Table 1 for a numerical comparison of the MSEs.

In Table 1, central values indicate the test-set MSE achieved by each method. The table compares the unsupervised and supervised methods against off-manifold splicing, showing signiﬁcant improvement over this baseline. Note that an MSE of zero is not achievable, because fy(x) in Eq. (16) or (17) is not fully determined by partial input x S. For the Drug Consumption data where we can compute p(x |x S) empirically, the optimal MSE happens to be 0.0436.

Uncertainties in Table 1 represent the standard deviation in test-set MSE upon repeating each method with ﬁxed hyperparameters 10 times. (Uncertainties are absent for the off-manifold and empirical columns, as these do not involve training a separate model.) The table thus indicates that the supervised method offers increased stability as compared to the unsupervised approach.

The supervised method is more efﬁcient as well: while the unsupervised technique estimates the value function by sampling from ˆp(x |x S), the supervised approach learns the value function directly. The supervised method thus requires far fewer model evaluations to match the standard-error of the unsupervised method: roughly 10 times fewer in our experiments.

EXAMPLE ON MNIST

To demonstrate on-manifold explainability on higher-dimensional data, we trained a fully connected network on binary MNIST (Le Cun & Cortes, 2010) and explained random digits in Fig. 6(a).

Despite having the same sum over pixels as controlled by the local version of Eq. (5) and explaining the same model prediction, each on-manifold explanation is more concentrated, with more interpretable structure, than its off-manifold counterpart. The handwritten strokes are clearly visible on-manifold, with key off-stroke regions highlighted as well. Off-manifold explanations generally display lower intensities spread less informatively across the digit-region.

These off-manifold explanations are a result of splices as in Fig. 1. With such unrealistic input, the model s output is uncontrolled and less informative. In fact, it is only on very large coalitions of pixels, subject to minimal splicing, that the model can make intelligent predictions off-manifold. This is conﬁrmed in Fig. 6(b), which shows the average Shapley summand as a function of coalition size on MNIST. Note that primarily large coalitions underpin off-manifold explanations, whereas far fewer pixels are required on-manifold, consistent with the low-dimensional manifold underlying the data.

5 CONCLUSION

In this work, we took a careful study of the off-manifold problem in AI explainability. We presented important distinctions between onand off-manifold explainability and provided experimental evidence for several novel shortcomings of the off-manifold approach. We then introduced two techniques to compute on-manifold Shapley values on general data: one technique learns to impute features on the data manifold, while the other learns the Shapley value-function directly. In-so-doing, we provided compelling evidence against the use of off-manifold explainability, and demonstrated that on-manifold Shapley values offer a viable approach to AI explainability in real-world contexts.

Published as a conference paper at ICLR 2021

J. Aas, M. Jullum, and A. Løland. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values, 2019. [ar Xiv:1903.10464].

C. Agarwal, D. Schonfeld, and A. Nguyen. Removing input features via a generative model to explain their attributions to classiﬁer s decisions, 2019. [ar Xiv:1910.04256].

C. J. Anders, P. Pasliev, A.-K. Dombrowski, K.-R. M uller, and P. Kessel. Fairwashing explanations with off-manifold detergent. In International Conference on Machine Learning, 2020.

M. Belghazi, M. Oquab, and D. Lopez-Paz. Learning about an exponential amount of conditional distributions. In Advances in Neural Information Processing Systems, 2019.

C. Chang, E. Creager, A. Goldenberg, and D. Duvenaud. Explaining image classiﬁers by counterfactual generation. In International Conference on Learning Representations, 2019.

H. Chen, J. D. Janizek, S. Lundberg, and S.-I. Lee. True to the model or true to the data?, 2020. [ar Xiv:2006.16234].

I. Covert, S. Lundberg, and S.-I. Lee. Understanding global feature contributions through additive importance measures, 2020. [ar Xiv:2004.00668].

A. Datta, S. Sen, and Y. Zick. Algorithmic transparency via quantitative input inﬂuence: Theory and experiments with learning systems. In IEEE Symposium on Security and Privacy, 2016.

B. Dimanov, U. Bhatt, M. Jamnik, and A. Weller. You shouldn t trust me: Learning models which conceal unfairness from multiple explanation methods. In Safe AI @ AAAI, 2020.

A.-K. Dombrowski, M. Alber, C. Anders, M. Ackermann, K.-R. M uller, and P. Kessel. Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, 2019.

L. Douglas, I. Zarov, K. Gourgoulias, C. Lucas, C. Hart, A. Baker, M. Sahani, Y. Perov, and S. Johri. A universal marginalizer for amortized inference in generative models. In NIPS Workshop on Approximate Bayesian Inference, 2017.

D. Dua and C. Graff. UCI machine learning repository, 2017. [archive.ics.uci.edu/ml].

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

G. Hooker and L. Mentch. Please stop permuting features: An explanation and alternatives, 2019. [ar Xiv:1905.03151].

O. Ivanov, M. Figurnov, and D. Vetrov. Variational autoencoder with arbitrary conditioning. In International Conference on Learning Representations, 2019.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.

I. Kononenko et al. An efﬁcient explanation of individual classiﬁcations using game theory. Journal of Machine Learning Research, 2010.

Y. Le Cun and C. Cortes. MNIST database, 2010. [yann.lecun.com/exdb/mnist].

S. Lipovetsky and M. Conklin. Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry, 2001.

F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In IEEE International Conference on Data Mining, 2008.

Published as a conference paper at ICLR 2021

S. M. Lundberg and S.-I. Lee. A uniﬁed approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 2017.

S. M. Lundberg, G. G. Erion, and S.-I. Lee. Consistent individualized feature attribution for tree ensembles, 2018. [ar Xiv:1802.03888].

S. M. Lundberg, G. Erion, H. Chen, A. De Grave, J. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S. Lee. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2020.

A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images. In Conference on Computer Vision and Pattern Recognition, 2015.

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.

M. T. Ribeiro, S. Singh, and C. Guestrin. Why should I trust you: Explaining the predictions of any classiﬁer. In International Conference on Knowledge Discovery and Data Mining, 2016.

C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 2019.

L. S. Shapley. A value for n-person games. In Contribution to the theory of games, 1953.

A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In International Conference on Machine Learning, 2017.

D. Slack, S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju. Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In AIES @ AAAI, 2020.

E. ˇStrumbelj and I. Kononenko. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems, 2014.

E. ˇStrumbelj, I. Kononenko, and M. Robnik-Sikonja. Explaining instance classiﬁcations with interactions of subsets of feature values. Data Knowl. Eng., 2009.

M. Sundararajan and A. Najmi. The many Shapley values for model explanation. In International Conference on Machine Learning, 2019.

Published as a conference paper at ICLR 2021

A IMPLEMENTATION DETAILS

For the unsupervised approach, we modelled the encoder qφ(z|x) as a diagonal normal distribution with mean and variance determined by a neural network:

qφ(z|x) = N µφ(x), σφ(x) (18)

We modelled the decoder pθ(x|z) as a product distribution:

pθ(x|z) = Y

i pθ(xi|z) (19)

where the distribution type (e.g. normal, categorical) of each xi is chosen per-data-set and each distribution s parameters are determined by a shared neural network. We modelled the masked encoder rψ(z|x S) as a Gaussian mixture:

rψ(z|x S) = X

j w(j) φ (x) N µ(j) φ (x), σ(j) φ (x) (20)

To allow rψ(z|x S) to accept variable-size coalitions x S as input, we simply masked out-of-coalition features with a special value ( 1) that never appears in the data.

The unsupervised method has several hyperparameters: β which multiplies the regularisation term, the number of components in Eq. (20), as well the architecture and optimisation of the networks involved. For each experiment in this paper, we tuned hyperparameters to minimise the MSE of Eq. (16) on a held-out validation set; see App. B for numerical details.

For the supervised approach, we modelled gy(x S) using a neural network, again masking out-ofcoalition features (with 1) to accommodate variable-size coalitions x S. This method s hyperparameters, relating to architecture and optimisation, were similarly tuned to minimise the validationset MSE; see App. B for details.

B DETAILS OF EXPERIMENTS

Here we provide numerical details for the experiments presented in the paper.

B.1 DRUG CONSUMPTION EXPERIMENT

On the Drug Consumption data from the UCI repository (Dua & Graff, 2017), we used 10 binary features from the data set Mushrooms, Ecstasy, etc., as displayed in Fig. 5 to predict whether individuals had ever consumed an 11th drug: LSD. The explanations of Fig. 2(b) and Fig. 5 describe a random forest ﬁt with default sklearn parameters and max features = None, which achieves 82.2% test-set accuracy amidst a 57 : 43 class balance.

In Fig. 2(b), global off-manifold Shapley values were computed using 106 Monte Carlo samples of Eq. (4). For each labelled data point (x, y) sampled from the test set, a single permutation was drawn to estimate Eq. (1) and a single data point x was drawn to estimate the off-manifold value function Eq. (2). In all the ﬁgures of this paper, bar height represents the mean that resulted from Monte Carlo sampling, and error bars display the standard error of the mean.

Global on-manifold Shapley values in Fig. 2(b) were computed similarly, but in this case using the on-manifold value function of Eq. (3). For each sampled coalition x S, a random data point x was drawn from the test set, with the crucial requirement that x S = x S. In the text, we refer to this as empirically estimating the conditional distribution p(x |x S). Such empirical estimation is only possible because this data set has a small number of all-binary features.

Tree SHAP values in Fig. 2(b) were computed with the SHAP package (Lundberg & Lee, 2017) with model output = margin and feature perturbation = tree path dependent.

The values labelled Model retraining in Fig. 2(b) were computed by ﬁtting a separate random forest g S for each coalition S of features in the data set: 210 models in all. We used these models to compute the sum of Eq. (10), where A(g S) represents a variant of model g S s accuracy: it is the

Published as a conference paper at ICLR 2021

Table 2: Optimal hyperparameters found for computing on-manifold Shapley values.

DATA SET METHOD HIDDEN DIM. LEARN. RATE LATENT DIM. MODES β

DRUG SUPERVISED 512 10 3

UNSUPERVISED 128 10 3 4 1 0.5

ABALONE SUPERVISED 512 10 3

UNSUPERVISED 256 10 3 2 1 0.05

CENSUS SUPERVISED 512 10 3

UNSUPERVISED 128 10 3 8 1 1

MNIST SUPERVISED 512 10 4

UNSUPERVISED 512 10 4 16 1 1

accuracy achieved if one predicts labels by drawing stochastically from g S s predicted probability distribution (as opposed to deterministically drawing the maximum-probability class).

The global on-manifold Shapley values in Fig. 2(b) appear in Fig. 5 as well, labelled Empirical . Fig. 5 also displays on-manifold Shapley values computed using the supervised and unsupervised methods introduced in this paper. As above, these are Monte Carlo estimates of Eq. (4). The supervised method involved training a fully connected network on the MSE loss of Eq. (17). All neural networks in this paper used 2 ﬂat hidden layers, Adam (Kingma & Ba, 2015) for optimisation, and a batch size of 256. We scanned over a grid with

hidden layer size = {128, 256, 512} (21)

learning rate = {10 3, 10 4}

choosing the point with minimal MSE on a held-out validation set after 10k epochs of training; see Table 2. Each supervised value in Fig. 5 corresponds to 104 Monte Carlo samples.

The unsupervised method involved training a variational autoencoder as described in Sec. 4.1 and App. A. The encoder, decoder, and masked encoder were each modelled using fully connected networks, trained using early stopping with patience 100. We scanned over a grid of hidden layer sizes and learning rates as in Eq. (21) as well as

latent dimension = {2, 4, 8, 16} (22) latent modes = {1, 2} regularisation β = {0.05, 0.1, 0.5, 1}

choosing the point with minimal validation-set MSE; see Table 2. Unsupervised values in Fig. 5 correspond to 106 Monte Carlo samples.

B.2 CENSUS INCOME EXPERIMENT

To produce the explanations of Fig. 2(a) we used the Census Income data set from the UCI repository (Dua & Graff, 2017). The data contains 49k individuals from the 1994 US Census, as well as 13 features which we used to predict whether annual income exceeded $50k. We trained a fully connected network (hidden layer size 50, default sklearn parameters, and early stopping), achieving a test-set accuracy of 85% amidst a 76 : 24 class balance.

The Shapley values for this model are labelled Original model in Fig. 2(a). These were computed exactly as described in App. B.1, except that the supervised method used 5k epochs, and the unsupervised method used patience 50. Optimised hyperparameters are given in Table 2. The onmanifold values in Fig. 2(a) were computed using the unsupervised method. While the supervised method does not appear in the ﬁgure, it was performed to complete Table 1.

We also ﬁne-tuned the Original model to suppress the importance of sex. Motivated by Dimanov et al. (2020) we added a term to the loss that penalises the ﬁnite difference in the model output with

Published as a conference paper at ICLR 2021

respect to sex (as this is a discrete feature). The modiﬁed loss function thus becomes

i=1 L f(xi), yi + α f xi | do(sex = 1) f xi | do(sex = 0) (23)

where L is the cross-entropy loss, f xi | do(sex = j) denotes f evaluated on the data point xi with the value for sex replaced with j, and α is a hyperparameter controlling the trade-off between optimising the accuracy and minimising the effect of sex. We ﬁne-tuned the model for an additional 200 epochs with α = 3. The resulting model agrees with the baseline on over 98.5% of the data, and has the same test-set accuracy. Shapley values for this model are labelled Suppressed model in Fig. 2(a).

B.3 ABALONE EXPERIMENT

The Abalone data set from the UCI repository (Dua & Graff, 2017) contains 8 features corresponding to physical measurements (see Fig. 4c) which we used to classify abalone as younger than or older than the median age. We trained a neural network to perform this task with hidden layer size 100, default sklearn parameters, and early stopping obtaining a test-set accuracy of 78%.

Shapley values in Fig. 4(c) were computed exactly as described in App. B.1, except that the supervised method involved training for 5k epochs. Optimised hyperparameters are given in Table 2.

B.4 MNIST EXPERIMENT

For binary MNIST (Le Cun & Cortes, 2010), we trained a fully connected network (hidden layer size 512, default sklearn parameters, and early stopping) achieving 98% test-set accuracy.

The digits in Fig. 6(a) were randomly drawn from the test set. Shapley values in Fig. 6(a) were computed exactly as described in App. B.1, except that the supervised method involved training for 2k epochs, and the on-manifold explanations are based on 16k Monte Carlo samples per pixel. Optimised hyperparameters are given in Table 2. The on-manifold explanations in Fig. 6(a) were computed using the supervised method. While the unsupervised method does not appear in the ﬁgure, it was performed to complete Table 1.

The average uncertainty, which is not shown in Fig. 6(a), is roughly 0.002 stated as a fraction of the maximum Shapley value in each image.