# on_the_tractability_of_shap_explanations__8e2cddd2.pdf

On the Tractability of SHAP Explanations

Guy Van den Broeck,1 Anton Lykov,1 Maximilian Schleich,2 Dan Suciu2

1 University of California, Los Angeles 2 University of Washington, Seattle

SHAP explanations are a popular feature-attribution mechanism for explainable AI. They use game-theoretic notions to measure the inﬂuence of individual features on the prediction of a machine learning model. Despite a lot of recent interest from both academia and industry, it is not known whether SHAP explanations of common machine learning models can be computed efﬁciently. In this paper, we establish the complexity of computing the SHAP explanation in three important settings. First, we consider fully-factorized data distributions, and show that the complexity of computing the SHAP explanation is the same as the complexity of computing the expected value of the model. This fully-factorized setting is often used to simplify the SHAP computation, yet our results show that the computation can be intractable for commonly used models such as logistic regression. Going beyond fullyfactorized distributions, we show that computing SHAP explanations is already intractable for a very simple setting: computing SHAP explanations of trivial classiﬁers over naive Bayes distributions. Finally, we show that even computing SHAP over the empirical distribution is #P-hard.

1 Introduction Machine learning is increasingly applied in high stakes decision making. As a consequence, there is growing demand for the ability to explain the prediction of machine learning models. One popular explanation technique is to compute feature-attribution scores, in particular using the Shapley values from cooperative game theory (Roth 1988) as a principled aggregation measure to determine the inﬂuence of individual features on the prediction of the collective model. Shapley value based explanations have several desirable properties (Datta, Sen, and Zick 2016), which is why they have attracted a lot of interest in academia as well as industry in recent years (see e.g., Gade et al. (2019)). ˇStrumbelj and Kononenko (2014) show that Shapley values can be used to explain arbitrary machine learning models. Datta, Sen, and Zick (2016) use Shapley-value-based explanations as part of a broader framework for algorithmic transparency. Lundberg and Lee (2017) use Shapley values in a framework that uniﬁes various explanation techniques, and they coined the term SHAP explanation. They show that

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

the SHAP explanation is effective in explaining predictions in the medical domain; see Lundberg et al. (2020). More recently there has been a lot of work on the tradeoffs of variants of the original SHAP explanations, e.g., Sundararajan and Najmi (2020), Kumar et al. (2020), Janzing, Minorics, and Bloebaum (2020), Merrick and Taly (2020), and Aas, Jullum, and Løland (2019). Despite all of this interest, there is considerable confusion about the tractability of computing SHAP explanations. The SHAP explanations determine the inﬂuence of a given feature by systematically computing the expected value of the model given a subsets of the features. As a consequence, the complexity of computing SHAP explanations depends on the predictive model as well as assumptions on the underlying data distribution. Lundberg et al. (2020) describe a polynomial-time algorithm for computing the SHAP explanation over decision trees, but online discussions have pointed out that this algorithm is not correct as stated. We present a concrete example of this shortcoming in the supplementary material, in Appendix A. In contrast, for fully-factorized distributions, Bertossi et al. (2020) prove that there are models for which computing the SHAP explanation is #P-hard. A contemporaneous paper by Arenas et al. (2020) shows that computing the SHAP explanation for tractable logical circuits over uniform and fully factorized binary data distributions is tractable. In general, the complexity of the SHAP explanation is open. In this paper we consider the original formulation of the SHAP explanation by Lundberg and Lee (2017) and analyze its computational complexity under the following data distributions and model classes:

1. First, we consider fully-factorized distributions, which are the simplest possible data distribution. Fully-factorized distributions capture the assumption that the model s features are independent, which is a commonly used assumption to simplify the computation of the SHAP explanations, see for example Lundberg and Lee (2017). For fully-factorized distributions and any prediction model, we show that the complexity of computing the SHAP explanation is the same as the complexity of computing the expected value of the model. It follows that there are classes of models for which the computation is tractable (e.g., linear regression, decision

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

trees, tractable circuits) while for other models, including commonly used ones such as logistic regression and neural nets with sigmoid activation functions, it is #P-hard. 2. Going beyond fully-factorized distributions, we show that computing SHAP explanation becomes intractable already for the simplest probabilistic model that does not assume feature independence: naive Bayes. As a consequence, the complexity of computing SHAP explanations on such data distributions is also intractable for many classes of models, including linear and logistic regression. 3. Finally we consider the empirical distribution, and prove that computing SHAP explanations is #P-hard for this class of distributions. This result implies that the algorithm by Lundberg et al. (2020) cannot be ﬁxed to compute the exact SHAP explanations over decision trees in polynomial time.

2 Background and Problem Statement Suppose our data is described by n indexed features X = {X1, . . . , Xn}. Each feature variable X takes a value from a ﬁnite domain dom(X). A data instance x = (x1, . . . , xn) consists of values x dom(X) for every feature X. This instance space is denoted x X = dom(X1) dom(Xn). We are also given a learned function F : X R that computes a prediction F(x) on each instance x. Throughout this paper we assume that the prediction F(x) can be computed in polynomial time in n. For a particular instance of prediction F(x), the goal of local explanations is to clarify why the function F gave its prediction on instance x, usually by attributing credit to the features. We will focus on local explanation that are inspired by game-theoretic Shapley values (Datta, Sen, and Zick 2016; Lundberg and Lee 2017). Speciﬁcally, we will work with the SHAP explanations as deﬁned by Lundberg and Lee (2017).

2.1 SHAP Explanations To produce SHAP explanations, one needs an additional ingredient: a probability distribution Pr(X) over the features, which we call the data distribution. We will use this distribution to reason about partial instances. Concretely, for a set of indices S [n] = {1, . . . , n}, we let x S denote the restriction of complete instance x to those features XS with indices in S. Abusing notation, we will also use x S to denote the probabilistic event XS = x S. Under this data distribution, it now becomes possible to ask for the expected value of the predictive function F. Clearly, for a complete data instance x we have that E[F|x] = F(x), as there is no uncertainty about the features. However, for a partial instance x S, which does not assign values to the features outside of XS, we appeal to the data distribution Pr to compute the expectation of function F as EPr[F|x S] = P x X F(x) Pr(x|x S). The SHAP explanation framework draws from Shapley values in cooperative game theory. Given a particular instance x, it considers features X to be players in a coalition game: the game of making a prediction for x. SHAP explanations are deﬁned in terms of a set function v F,x,Pr : 2X

R. Its purpose is to evaluate the value of each coalition of players/features XS X in making the prediction F(x) under data distribution Pr. Concretely, following Lundberg and Lee (2017), this value function is the conditional expectation of function F:

v F,x,Pr(XS) def = EPr[F|x S]. (1)

We will elide F, x, and Pr when they are clear from context. Our goal, however, is to assign credit to individual features. In the context of a coalition XS, the contribution of an individual feature X / XS is given by

c(X, XS) def = v(XS {X}) v(XS). (2)

where each term is implicitly w.r.t. the same F, x, and Pr. Finally, the SHAP explanation computes a score for each feature X X averaged over all possible contexts, and thus measures the inﬂuence feature X has on the outcome. Let π be a permutation on the set of features X, i.e., π ﬁxes a total order on all features. Let π<X be the set of features that come before X in the order π. The SHAP explanations are then deﬁned as computing the following scores. Deﬁnition 1 (SHAP Score). Fix an entity x, a predictive function F, and a data distribution Pr. The SHAP explanation of a feature X is the contribution of X given the features π<X, averaged over all permutations π:

SHAP(X) def = 1

π c(X, π<X). (3)

We mention two simple properties of the SHAP explanations here; for more discussion see Datta, Sen, and Zick (2016) and Lundberg et al. (2020). First, for the linear combination of functions G(.) = P k λk Fk(.), we have that

SHAPG (X) = X

k λk SHAPFk(X). (4)

Second, the sum of the SHAP explanation of all features is related to the expected value of function F: X

i SHAPF (Xi) = F(x) E[F]. (5)

2.2 Computational Problems This paper studies the complexity of computing SHAP(X); a task we formally deﬁne next. We write F for a class of functions. We also write PRn for a class of data distributions over n features, and let PR = S n PRn. We assume that all parameters are rationals. Because SHAP explanations are for an arbitrary ﬁxed instance x, we will simplify the notation throughout this paper by assuming it to be the instance e = (1, 1, . . . , 1), and that each domain contains the value 1, which is without loss of generality. Deﬁnition 2 (SHAP Computational Problems). For each function class F and distribution class PR, consider the following computational problems. The functional SHAP problem F-SHAP(F, PR): given a data distribution Pr PR and a function F F, compute SHAP(X1), . . . , SHAP(Xn).

The decision SHAP problem D-SHAP(F, PR): given a data distribution Pr PR, a function F F, a feature X X, and a threshold t R, decide if SHAP(X) > t.

To establish the complexities of these problems, we use standard notions of reductions. A polynomial time reduction from a problem A to a problem B, denoted by A P B, and also called a Cook reduction, is a polynomial-time algorithm for the problem A with access to an oracle for the problem B. We write A P B when both A P B and B P A. In the remainder of this paper will study the computational complexity of these problems for natural hypothesis classes F that are popular in machine learning, as well as common classes of data distributions PR, including those most often used to compute SHAP explanations.

3 SHAP over Fully-Factorized Distributions We start our study of the complexity of SHAP by considering the simplest probability distribution: a fully-factorized distribution, where all features are independent. There are both practical and computational reasons why it makes sense to assume a fully-factorized data distribution when computing SHAP explanations. First, functions F are often the product of a supervised learning algorithm that does not have access to a generative model of the data it is purely discriminative. Hence, it is convenient to make the practical assumption that the data distribution is fully factorized, and therefore easy to estimate. Second, fullyfactorized distributions are highly tractable; for example they make it easy to compute expectations of linear regression functions (Khosravi et al. 2019b) and other hard inference tasks (Vergari et al. 2020). Lundberg and Lee (2017) indeed observe that computing the SHAP-explanation on an arbitrary data distribution is challenging and consider using fully-factorized distributions (Sec. 4, Eq. 11). Other prior work on computing explanations also use fully-factorized distributions of features, e.g., Datta, Sen, and Zick (2016); ˇStrumbelj and Kononenko (2014). As we will show, the SHAP explanation can be computed efﬁciently for several popular classiﬁers when the distribution is fully factorized. Yet, such simple data distributions are not a guarantee for tractability: computing SHAP scores will be intractable for some other common classiﬁers.

3.1 Equivalence to Computing Expectations Before studying various function classes, we prove a key result that connects the complexity of SHAP explanations to the complexity of computing expectations. Let INDn be the class of fully-factorized probability distributions over n discrete and independent random variables X1, . . . , Xn. That is, for every instance (x1, . . . , xn) X, we have that Pr(X1 = x1, . . . , Xn = xn) = Q i Pr(Xi =

xi). Let IND def = S n 0 INDn. We show that for every function class F, the complexity of F-SHAP(F, IND) is the same as that of the fully-factorized expectation problem.

Deﬁnition 3 (Fully-Factorized Expectation Problem). Let F be a class of real-valued functions with discrete inputs. The fully-factorized expectation problem for F, denoted E(F), is

the following: given a function F F and a probability distribution Pr IND, compute EPr(F). We know from Equation 5 that for any function F over n features, E({F}) P F-SHAP({F}, INDn), because E[F] = F(x) P i=1,n SHAPF (Xi). In this section we prove that the converse holds too: Theorem 1. For any function F : X R, we have that F-SHAP({F}, INDn) P E({F}). In other words, for any function F, the complexity of computing the SHAP scores is the same as the complexity of computing the expected value E[F] under a fully-factorized data distribution. One direction of the proof is immediate: E({F}) P F-SHAP({F}, INDn) because, if we are given an oracle to compute SHAPF (Xi) for every feature Xi, then we can obtain E[F] from Equation 5 (recall that we assumed that F(x) is computable in polynomial time). The hard part of the proof is the opposite direction: we will show in Sec. 3.2 how to compute SHAPF (Xi) given an oracle for computing E[F]. Theorem 1 immediately extends to classes of functions F, and to any number of variables, and therefore implies that F-SHAP(F, IND) P E(F). Sections 3.3 and 3.4 will discuss the consequences of this result, by delineating which function classes support tractable SHAP explanations, and which do not. The next section is devoted to proving our main technical result.

3.2 Proof of Theorem 1 We start with the special case when all features X are binary: dom(X) = {0, 1}. We denote by INDBn the class of fullyfactorized distributions over binary domains. Theorem 2. For any function F : {0, 1}n R, we have that F-SHAP({F}, INDBn) P E({F}).

Proof. We prove only F-SHAP(F, INDBn) P E({F}); the opposite direction follows immediately from Equation 5. We will assume w.l.o.g. that F has n + 1 binary features X = {X0} X and show how to compute SHAPF (X0) using repeated calls to an oracle for computing E[F], i.e., the expectation of the same function F, but over fully-factorized distributions with different probabilities. The probability distribution Pr is given to us by n + 1 rational numbers, pi def = Pr(Xi =1), i = 0, n; obviously, Pr(Xi =0) = 1 pi. Recall that the instance whose outcome we want to explain is e = (1, . . . , 1). Recall that for any set S [n] we write e S for the event V i S(Xi = 1). Then, we have that

SHAP(X0) = X

(n + 1)! Dk, where (6)

S [n]:|S|=k

E F|e S {0} E[F|e S] .

Let F0 def = F[X0 := 0] and F1 def = F[X0 := 1] (both are functions in n binary features, X = {X1, . . . , Xn}). Then:

E F e S {0} = E[F1|e S]

E[F|e S] = E[F0|e S] (1 p0) + E[F1|e S] p0

and therefore Dk is given by:

Dk = (1 p0) X

S [n]:|S|=k (E[F1|e S] E[F0|e S])

For any function G, Equation 1 deﬁnes value v G,e,Pr(XS) as E[G|e S]. Abusing notation, we write v G,k for the sum of these quantities over all sets S of cardinality k:

v G,k def = X

S [n],|S|=k E[G|e S]. (7)

We will prove the following claim.

Claim 1. Let G be a function over n binary variables. Then the n+1 quantities v G,0 until v G,n can be computed in polynomial time, using n + 1 calls to an oracle for E({G}).

Note that an oracle for E({F}) is also an oracle for both E({F0}) and E({F1}), by simply setting Pr(X0 = 1) = 0 or Pr(X0 = 1) = 1 respectively. Therefore, Claim 1 proves Theorem 2, by applying it once to F0 and once to F1 in order to derive all the quantities v F0,k and v F1,k, thereby computing Dk, and ﬁnally computing SHAPF (X0) using Equation 6. It remains to prove Claim 1. Fix a function G over n binary variables and let vk = v G,k. Let pj = Pr(Xj = 1), for j = 1, n, deﬁne the distribution over which we need to compute v0, . . . , vn. We will prove the following additional claim.

Claim 2. Given any real number z > 0, consider the distribution Prz(Xj) = p j def = pj+z

1+z , for j = 1, n. Let Ez[G] denote E[G] under distribution Prz. We then have that X

k=0,n zk vk =(1 + z)n Ez[G]. (8)

Assuming Claim 2 holds, we prove Claim 1. Choose any n + 1 distinct values for z, use the oracle to compute the quantities Ez0[G], . . . , Ezn[G], and form a system of n + 1 linear equations (8) with unknowns v0, . . . , vn. Next, observe that its matrix is a non-singular Vandermonde matrix, hence the system has a unique solution which can be computed in polynomial time. It remains to prove Claim 2. Because of independence, the probability of instance x {0, 1}n is Pr(x) = Q i:xi=1 pi Q i:xi=0(1 pi), where xi looks up the value of feature Xi in instance x. Similarly, Prz(x) = Q i:xi=1 p i Q i:xi=0(1 p i). Using direct calculations we derive:

= (1 + z)n Prz(x) (9)

Separately we also derive the following identity, using the fact that Pr(e S) = Q i S pi by independence:

E[G|e S] = 1 Q i S pi

x:x S=e S G(x) Pr(x) (10)

We are now in a position to prove Claim 2: X

k=0,n zk vk = X

S [n]:|S|=k E[G|e S]

S [n] z|S| E[G|e S]

z|S| Q i S pi

x:x S=e S G(x) Pr(x)

The last line follows from Equation 10. Next, we simply exchange the summations P S and P x, after which we apply the identity P S A Q i S ui = Q i A(1 + ui).

(continuing)

x {0,1}n G(x) Pr(x) X

z|S| Q i S pi

x {0,1}n G(x) Pr(x) Y

= (1 + z)n X

x {0,1}n G(x) Prz(x) = (1 + z)n Ez[G].

The ﬁnal line uses Equation 9. This completes the proof of Claim 2 as well as Theorem 2.

Next, we generalize this result from binary features to arbitrary discrete features. Fix a function with n inputs, F : X( def = Q i dom(Xi)) R, where each domain is an arbitrary ﬁnite set, dom(Xi) = {1, 2, . . . , mi}; we assume w.l.o.g. that mi > 1. A fully factorized probability space Pr INDn is deﬁned by numbers pij [0, 1], i = 1, n, j = 1, mi, such that, for all i, P j pij = 1. Given F and Pr over the domain Q i dom(Xi), we deﬁne their projections, Fπ, Prπ over the binary domain {0, 1}n as follows. For any instance x {0, 1}n, let T(x) denote the event asserting that Xj = 1 iff xj = 1. Formally,

j:xj=1 (Xj =1)

j:xj=0 (Xj =1).

Then, the projections are deﬁned as follows: x {0, 1}n,

Prπ(x) def = Pr(T(x)) Fπ(x) def = E[F | T(x)] (11)

Notice that Fπ depends both on F and on the probability distribution Pr. Intuitively, the projections only distinguishes between Xj = 1 and Xj = 1, for example:

Fπ(1, 0, 0) =E[F|(X1 = 1, X2 = 1, X3 = 1)] Prπ(1, 0, 0) =Pr(X1 = 1, X2 = 1, X3 = 1)

We prove the following result in Appendix B:

Proposition 3. Let F : X R be a function with n input features, and Pr INDn a fully factorized distribution over X. Then (1) for any feature Xj, SHAPF,Pr(Xj) = SHAPFπ,Prπ(Xj), and (2) E({Fπ}) P E({F}).

Item (1) states that the SHAP-score of F computed over the probability space Pr is the same as that of its projection Fπ (which depends on Pr) over the projected probability space Prπ. Item (2) says that, for any probability space over {0, 1}n (not necessarily Prπ), we can compute E[Fπ] in polynomial time given access to an oracle for computing E[F]. We can now complete the proof of Theorem 1, by showing that F-SHAP({F}, INDn) P E({F}). Given a function F and probability space Pr INDn, in order to compute SHAPF,Pr(Xj), by item (1) of Proposition 3 it sufﬁces to show how to compute SHAPFπ,Prπ(Xj). By Theorem 2, we can compute the latter given access to an oracle for computing E[Fπ]. Finally, by item (2) of the proposition, we can compute E[Fπ] given an oracle for computing E[F].

3.3 Tractable Function Classes

Given the polynomial-time equivalence between computing SHAP explanations and computing expectations under fullyfactorized distributions, a natural next question is: which real-world hypothesis classes in machine learning support efﬁcient computation of SHAP scores?

Corollary 4. For the following function classes F, computing SHAP scores F-SHAP(F, IND) is in polynomial time in the size of the representations of function F F and fullyfactorized distribution Pr IND.

1. Linear regression models 2. Decision and regression trees 3. Random forests or additive tree ensembles 4. Factorization machines, regression circuits 5. Boolean functions in d-DNNF, binary decision diagrams 6. Bounded-treewidth Boolean functions in CNF

These are all consequences of Theorem 1, and the fact that computing fully-factorized expectations E(F) for these function classes F is in polynomial time. Concretely, we have the following observations about fully-factorized expectations:

1. Expectations of linear regression functions are efﬁciently computed by mean imputation (Khosravi et al. 2019b). The tractability of SHAP on linear regression models is well known. In fact, ˇStrumbelj and Kononenko (2014) provide a closed-form formula for this case.

2. Paths from root to leaf in a decision or regression tree are mutually exclusive. Their expected value is therefore the sum of expected values of each path, which are tractable to compute within IND; see Khosravi et al. (2020).

3. Additive mixtures of trees, as obtained through bagging or boosting, are tractable, by the linearity of expectation.

4. Factorization machines extend linear regression models with feature interaction terms and factorize the parameters of the higher-order terms (Rendle 2010). Their expectations remain easy to compute. Regression circuits are a graph-based generalization of linear regression. Khosravi et al. (2019a) provide an algorithm to efﬁciently take their expectation w.r.t. a probabilistic circuit distribution, which is trivial to construct for the fully-factorized case.

The remaining tractable cases are Boolean functions. Computing fully-factorized expectations of Boolean functions is widely known as the weighted model counting task (WMC) (Sang, Beame, and Kautz 2005; Chavira and Darwiche 2008). WMC has been extensively studied both in the theory and the AI communities, and the precise complexity of E(F) is known for many families of Boolean functions F. These results immediately carry over to the F-SHAP(F, IND) problem through Theorem 1:

5. Expectations can be computed in time linear in the size of various circuit representations, called d-DNNF, which includes binary decision diagrams (OBDD, FBDD) and SDDs (Bryant 1986; Darwiche and Marquis 2002).1

6. Bounded-treewidth CNFs are efﬁciently compiled into OBDD circuits (Ferrara, Pan, and Vardi 2005), and thus enjoy tractable expectations.

To conclude this section, the reader may wonder about the algorithmic complexity of solving F-SHAP(F, IND) with an oracle for E(F) under the reduction in Section 3.2. Brieﬂy, we require a linear number of calls to the oracle, as well as time in O(n3) for solving a system of linear equations. Hence, for those classes, such as d-DNNF circuits, where expectations are linear in the size of the (circuit) representation of F, computing F-SHAP(F, IND) is also linear in the representation size and polynomial in n.

3.4 Intractable Function Classes The polynomial-time equivalence of Theorem 1 also implies that computing SHAP scores must be intractable whenever computing fully-factorized expectations is intractable. This section reviews some of those function classes F, including some for which the computational hardness of E(F) is well known. We begin, however, with a more surprising result. Logistic regression is one of the simplest and most widely used machine learning models, yet it is conspicuously absent from Corollary 4. We prove that computing the expectation of a logistic regression model is #P-hard, even under a uniform data distribution, which is of independent interest. A logistic regression model is a parameterized function F(x) def = σ(w x), where w = (w0, w1, . . . , wn) is a vector

of weights, σ(z) = 1/(1+e z) is the logistic function, x def =

(1, x1, x2, . . . , xn), and w x def = P i=0,n wixi is the dot product. Note that we deﬁne the logistic regression function to output probabilities, not data labels. Let LOGITn denote the class of logistic regression functions with n variables, and LOGIT = S n LOGITn. We prove the following:

Theorem 5. Computing the expectation of a logistic regression model w.r.t. a uniform data distribution is #P-hard.

The full proof in Appendix C is by reduction from counting solutions to the number partitioning problem. Because the uniform distribution is contained in IND, and following Theorem 1, we immediately obtain:

1In contemporaneous work, Arenas et al. (2020) also show that the SHAP explanation is tractable for d-DNNFs, but for the more restricted class of uniform data distributions.

Corollary 6. The computational problems E(LOGIT) and F-SHAP(LOGIT, IND) are both #P-hard. We are now ready to list general function classes for which computing the SHAP explanation is #P-hard. Corollary 7. For the following function classes F, computing SHAP scores F-SHAP(F, IND) is #P-hard in the size of the representations of function F F and fully-factorized distribution Pr IND. 1. Logistic regression models (Corollary 6) 2. Neural networks with sigmoid activation functions 3. Naive Bayes classiﬁers, logistic circuits 4. Boolean functions in CNF or DNF Our intractability results stem from these observations: 2. Each neuron is a logistic regression model, and therefore this class subsumes LOGIT. 3. The conditional distribution used by a naive Bayes classiﬁer is known to be equivalent to a logistic regression model (Ng and Jordan 2002). Logistic circuits are a graph-based classiﬁcation model that subsumes logistic regression (Liang and Van den Broeck 2019). 4. For general CNFs and DNFs, weighted model counting, and therefore E(F) is #P-hard. This is true even for very restricted classes, such as monotone 2CNF and 2DNF functions, and Horn clause logic (Wei and Selman 2005).

4 Beyond Fully-Factorized Distributions Features in real-world data distributions are not independent. In order to capture more realistic assumptions about the data when computing SHAP scores, one needs a more intricate probabilistic model. In this section we prove that the complexity of computing the SHAP-explanation quickly becomes intractable, even over the simplest probabilistic models, namely naive Bayes models. To make computing the SHAP-explanation as easy as possible, we will assume that the function F simply outputs the value of one feature. We show that even in this case, even for function classes that are tractable under fully-factorized distributions, computing SHAP explanations becomes computationally hard. Let NBNn denote the family of naive Bayes networks over n + 1 variables X = {X0, X1, . . . , Xn}, with binary domains, where X0 is a parent of all features:

Pr(X) = Pr(X0) Y

i=1,n Pr(Xi|X0).

As usual, the class NBN def = S n 0 NBNn. We write X0 for the function F that returns the value of feature X0; that is, F(x) = x0. We prove the following. Theorem 8. The decision problem D-SHAP({X0}, NBN) is NP-hard. The proof in Appendix D is by reduction from the number partitioning problem, similar to the proof of Corollary 6. We note that the subset sum problem was also used to prove related hardness results, e.g., for proving hardness of the Shapely value in network games (Elkind et al. 2008).

This result is in sharp contrast with the complexity of the SHAP score over fully-factorized distributions in Section 3. There, the complexity was dictated by the choice of function class F. Here, the function is as simple as possible, yet computing SHAP is hard. This ruins any hope of achieving tractability by restricting the function, and this motivates us to restrict the probability distribution in the next section. This result is also surprising because it is efﬁcient to compute marginal probabilities (such as the expectation of X0) and conditional probabilities in naive Bayes distributions. Theorem 8 immediately extends to a large class of probability distributions and functions. We say that F depends only on Xi if there exist two constants c0 = c1 such that F(x) = c0 when xi = 0 and F(x) = c1 when xi = 1. In other words, F ignores all variables other than Xi, and does depend on Xi. We then have the following.

Corollary 9. The problem D-SHAP(F, PR) is NP-hard, when PR is any of the following classes of distributions:

1. Naive Bayes, bounded-treewidth Bayesian networks 2. Bayesian networks Markov networks, Factor graphs 3. Decomposable probabilistic circuits

and when F is any class that contains some function F that depends only on X0, including the class of linear regression models and all the classes listed in Corollaries 4 and 7.

This corollary follows from two simple observations. First, each of the classes of probability distributions listed in the corollary can represent a naive Bayes network over binary variables X. For example, a Markov network will consists of n factors f1(X0, X1), f2(X0, X2), . . . , fn(X0, Xn); similar simple arguments prove that all the other classes can represent naive Bayes, including tractable probabilistic circuits such as sum-product networks (Vergari et al. 2020). Second, for each function that depends only on X0, there exist two distinct constants c0 = c1 R such that F(x) = c0 when x0 = 0 and F(x) = c1 when x0 = 1. For example, if we consider the class of logistic regression functions F(x) = σ(P i wixi), then we choose the weights w0 = 1, w1 = . . . = wn = 0 and we obtain F(x) = 1/2 when x0 = 0 and F(x) = 1/(1 + e 1) when x0 = 1. Then, over the binary domain {0, 1} the function is equivalent to F(x) = (c1 c0)x0 + c0, and, therefore, by the linearity of the SHAP explanation (Equation 4) we have SHAPF (X0) = (c1 c0) SHAPX0(X0) (because the SHAP explanation of a constant function c0 is 0) for which, by Theorem 8, the decision problem is NP-hard. We end this section by proving that Theorem 8 continues to hold even if the prediction function F is the value of some leaf node of a (bounded treewidth) Bayesian Network. In other words, the hardness of the SHAP explanation is not tied to the function returning the root of the network, and applies to more general functions.

Corollary 10. The SHAP decision problem for Bayesian networks with latent variables is NP-hard, even if the function F returns a single leaf variable of the network.

The full proof is given in Appendix E.

5 SHAP on Empirical Distributions In supervised learning one does not require a generative model of the data, instead, the model is trained on some concrete data set: the training data. When some probabilistic model is needed, then the training data itself is conveniently used as a probability model, called the empirical distribution. This distribution captures dependencies between features, while its set of possible worlds is limited to those in the data set. For example, the intent of the Kernel SHAP algorithm by Lundberg and Lee (2017) is to compute the SHAP explanation on the empirical distribution. In another example, Aas, Jullum, and Løland (2019) extend Kernel SHAP to work with dependent features, by estimating the conditional probabilities from the empirical distribution. Compared to the data distributions considered in the previous sections, the empirical distribution has one key advantage: it has many fewer possible worlds with positive probability this suggests increased tractability. Unfortunately, in this section, we prove that computing the SHAP explanation over the empirical distribution is #P-hard in general. To simplify the presentation, this section assumes that all features are binary: dom(Xj) = {0, 1}. The probability distribution is given by a 0/1-matrix d = (xij)i [m],j [n], where each row (xi1, . . . , xin) is an outcome with probability 1/m. One can think of d as a dataset with n features and m data instances, where each row (xi1, . . . , xin) is one data instance. Repeated rows are possible: if a row occurs k times, then its probability is k/m. We denote by X the class of empirical distributions. The predictive function can be any function F : {0, 1}n R. As our data distribution is no longer strictly positive, we adopt the standard convention that E[F|XS = 1] = 0 when Pr(XS = 1) = 0. Recall from Section 2.2 that, by convention, we compute the SHAP-explanation w.r.t. instance e = (1, 1, . . . , 1), which is without loss of generality. Somewhat surprisingly, the complexity of computing the SHAP-explanation of a function F over the empirical distribution given by a matrix d is related to the problem of computing the expectation of a certain CNF formula associated to d. Deﬁnition 4. The positive, partitioned 2CNF formula, PP2CNF, associated to a matrix d {0, 1}m n is:

(i,j):xij=0 (Ui Vj).

Thus, a PP2CNF formula is over m + n variables U1, . . . , Um, V1, . . . , Vn, and has only positive clauses. The matrix d dictates which clauses are present. A quasysymmetric probability distribution is a fully factorized distribution over the m + n variables for which there exists two numbers p, q [0, 1] such that for every i = 1, m, Pr(Ui = 1) = p or Pr(Ui = 1) = 1, and for every j = 1, n, Pr(Vj = 1) = q or Pr(Vj = 1). In other words, all variables U1, . . . , Um have the same probability p, or have probability 1, and similarly for the variables V1, . . . , Vn. We denote by EQS(PP2CNF) the expectation computation problem for PP2CNF over quasi-symmetric probability distributions. EQS(PP2CNF) is #P-hard, because computing E[Φd] under the uniform distribution (i.e. Pr(U1 = 1) = =

Pr(Vn = 1) = 1/2) is #P-hard (Provan and Ball 1983). We prove:

Theorem 11. Let X be the class of empirical distributions, and F be any class of function such that, for each i, it includes some function that depends only on Xi. Then, we have that F-SHAP(F, X) P EQS(PP2CNF). As a consequence, the problem F-SHAP(F, X) is #P-hard in the size of the empirical distribution.

The theorem is surprising, because the set of possible outcomes of an empirical distribution is small. This is unlike all the distributions discussed earlier, for example those mentioned in Corollary 9, which have 2n possible outcomes, where n is the number of features. In particular, given an empirical distribution d, one can compute the expectation E[F] in polynomial time for any function F, by doing just one iteration over the data. Yet, computing the SHAP explanation of F is #P-hard. Theorem 11 implies hardness of SHAP explanations on the empirical distribution for a large class of functions.

Corollary 12. The problem F-SHAP(F, X) is #P-hard, when X is the class of empirical distributions, and F is any class such that for each feature Xi, the class contains some function that depends only on Xi. This includes all the function classes listed in Corollaries 4 and 7.

For instance, any class of Boolean function that contains the n single-variable functions F def = Xi, for i = 1, n, fall under this corollary. Section 4 showed an example of how the class of logistic regression functions fall under this corollary as well. The proof of Theorem 11 follows from the following technical lemma, which is of independent interest:

Lemma 13. We have that:

1. For every matrix d, F-SHAP(F, d) P EQS({Φd}). 2. EQS(PP2CNF) P F-SHAP(F, X).

The proof of the Lemma is given in Appendix F and G. The ﬁrst item says that we can compute the SHAPexplanation in polynomial time using an oracle for computing E[Φd] over quasi-symmetric distributions. The oracle is called only on the PP2CNF Φd associated to the data d, but may perform repeated calls, with different probabilities of the Boolean variables. This is somewhat surprising because the SHAP explanation is over an empirical distribution, while E[Φd] is taken over a fully-factorized distribution; there is no connection between these two distributions. This item immediately implies F-SHAP(F, X) P EQS(PP2CNF), where X is the class of empirical distributions d, since the formula Φd is in the class PP2CNF. The second item says that a weak form of converse also holds. It states that we can compute in polynomial time the expectation E[Φ] over a quasi-symmetric probability distributions by using an oracle for computing SHAP explanations, over several matrices d, but not necessarily restricted to the matrix associated to Φ. Together, the two items of the lemma prove Theorem 11. We end this section with a comment on the Tree SHAP algorithm in Lundberg et al. (2020), which is computed over

a distribution deﬁned by a tree-based model. Our result implies that the problem that Tree SHAP tries to solve is #Phard. This follows immediately by observing that every empirical distribution d can be represented by a binary tree of size polynomial in the size of d. The tree examines the attributes in the order X1, X2, . . . , Xn, and each decision node for Xi has two branches: Xi = 0 and Xi = 1. A branch that does not exists in the matrix d will end in a leaf with label 0. A complete branch that corresponds to a row xi1, xi2, . . . , xin in d ends in a leaf with label 1/m (or k/m if that row occurs k times in d). The size of this tree is no larger than twice the size of the matrix (because of the extra dead end branches). This concludes our study of SHAP explanations on the empirical distribution.

6 Perspectives and Conclusions

We establish the complexity of computing the SHAP explanation in three important settings. First, we consider fullyfactorized data distributions and show that for any prediction model, the complexity of computing the SHAP explanation is the same as the complexity of computing the expected value of the model. It follows that there are commonly used models, such as logistic regression, for which computing SHAP explanations is intractable. Going beyond fullyfactorized distributions, we show that computing SHAP explanations is also intractable for simple functions and simple distributions naive Bayes and empirical distributions. The recent literature on SHAP explanations predominantly studies tradeoffs of variants of the original SHAP formulation, and relies on approximation algorithms to compute the explanations. These approximation algorithms, however, tend to make simplifying assumptions which can lead to counter-intuitive explanations, see e.g., Slack et al. (2020). We believe that more focus should be given to the computational complexity of SHAP explanations. In particular, which classes of machine learning models can be explained efﬁciently using the SHAP scores? Our results show that, under the assumption of fully-factorized data distributions, there are classes of models for which the SHAP explanations can be computed in polynomial time. In future work, we plan to explore if there are classes of models for which the complexity of the SHAP explanations is tractable under more complex data distributions, such as the ones deﬁned by tractable probabilistic circuits (Vergari et al. 2020) or tractable symmetric probability spaces (Van den Broeck, Meert, and Darwiche 2014; Beame et al. 2015).

Acknowledgements

This work is partially supported by NSF grants IIS1907997, IIS-1954222 IIS-1943641, IIS-1956441, CCF1837129, DARPA grant N66001-17-2-4032, a Sloan Fellowship, and gifts by Intel and Facebook research. Schleich is supported by a Relational AI fellowship. The authors would like to thank Yoo Jung Choi for valuable discussions on the proof of Theorem 5.

References Aas, K.; Jullum, M.; and Løland, A. 2019. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. ar Xiv preprint ar Xiv:1903.10464. Arenas, M.; Barcel o, P.; Bertossi, L.; and Monet, M. 2020. The Tractability of SHAP-Score-Based Explanations over Deterministic and Decomposable Boolean Circuits. ar Xiv preprint ar Xiv:2007.14045. Beame, P.; Van den Broeck, G.; Gribkoff, E.; and Suciu, D. 2015. Symmetric Weighted First-Order Model Counting. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, 313 328. Bertossi, L.; Li, J.; Schleich, M.; Suciu, D.; and Vagena, Z. 2020. Causality-Based Explanation of Classiﬁcation Outcomes. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning, DEEM 20. New York, NY, USA: Association for Computing Machinery. Bryant, R. E. 1986. Graph-based algorithms for boolean function manipulation. Computers, IEEE Transactions on 100(8): 677 691. Chavira, M.; and Darwiche, A. 2008. On probabilistic inference by weighted model counting. Artiﬁcial Intelligence 172(6-7): 772 799. Darwiche, A.; and Marquis, P. 2002. A knowledge compilation map. Journal of Artiﬁcial Intelligence Research 17: 229 264. Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic Transparency via Quantitative Input Inﬂuence: Theory and Experiments with Learning Systems. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 2226, 2016, 598 617. Elkind, E.; Goldberg, L. A.; Goldberg, P. W.; and Wooldridge, M. J. 2008. A tractable and expressive class of marginal contribution nets and its applications. In 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008), Estoril, Portugal, May 12-16, 2008, Volume 2, 1007 1014. Ferrara, A.; Pan, G.; and Vardi, M. Y. 2005. Treewidth in veriﬁcation: Local vs. global. In International Conference on Logic for Programming Artiﬁcial Intelligence and Reasoning, 489 503. Springer. Gade, K.; Geyik, S. C.; Kenthapadi, K.; Mithal, V.; and Taly, A. 2019. Explainable AI in Industry. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 19, 3203 3204. New York, NY, USA: Association for Computing Machinery. Janzing, D.; Minorics, L.; and Bloebaum, P. 2020. Feature relevance quantiﬁcation in explainable AI: A causal problem. volume 108 of Proceedings of Machine Learning Research, 2907 2916. PMLR. Khosravi, P.; Choi, Y.; Liang, Y.; Vergari, A.; and Van den Broeck, G. 2019a. On Tractable Computation of Expected

Predictions. In Advances in Neural Information Processing Systems 32 (Neur IPS).

Khosravi, P.; Liang, Y.; Choi, Y.; and den Broeck, G. V. 2019b. What to Expect of Classiﬁers? Reasoning about Logistic Regression with Missing Features. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, 2716 2724.

Khosravi, P.; Vergari, A.; Choi, Y.; Liang, Y.; and Van den Broeck, G. 2020. Handling Missing Data in Decision Trees: A Probabilistic Approach. In The Art of Learning with Missing Values Workshop at ICML (Artemiss).

Kumar, I. E.; Venkatasubramanian, S.; Scheidegger, C.; and Friedler, S. 2020. Problems with Shapley-value-based explanations as feature importance measures. In Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020.

Liang, Y.; and Van den Broeck, G. 2019. Learning Logistic Circuits. In Proceedings of the 33rd Conference on Artiﬁcial Intelligence (AAAI).

Lundberg, S. M.; Erion, G.; Chen, H.; De Grave, A.; Prutkin, J. M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; and Lee, S. 2020. From Local Explanations to Global Understanding with Explainable AI for Trees. Nature Machine Intelligence 2: 56 67.

Lundberg, S. M.; Erion, G. G.; and Lee, S.-I. 2018. Consistent individualized feature attribution for tree ensembles. ar Xiv preprint ar Xiv:1802.03888 .

Lundberg, S. M.; and Lee, S. 2017. A Uniﬁed Approach to Interpreting Model Predictions. In Advances in neural information processing systems (NIPS), 4765 4774.

Merrick, L.; and Taly, A. 2020. The Explanation Game: Explaining Machine Learning Models Using Shapley Values. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, 17 38. Springer.

Ng, A. Y.; and Jordan, M. I. 2002. On discriminative vs. generative classiﬁers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, 841 848.

Provan, J. S.; and Ball, M. O. 1983. The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected. SIAM J. Comput. 12(4): 777 788.

Rendle, S. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining, 995 1000. IEEE.

Roth, A. e. 1988. The Shapley Value: Essays in Honor of Lloyd S. Shapley. Cambridge Univ. Press.

Sang, T.; Beame, P.; and Kautz, H. A. 2005. Performing Bayesian inference by weighted model counting. In AAAI, volume 5, 475 481.

Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; and Lakkaraju, H. 2020. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In AAAI/ACM Conference on AI, Ethics, and Society (AIES).

ˇStrumbelj, E.; and Kononenko, I. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems 41(3): 647 665. Sundararajan, M.; and Najmi, A. 2020. The many Shapley values for model explanation. In Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Van den Broeck, G.; Meert, W.; and Darwiche, A. 2014. Skolemization for Weighted First-Order Model Counting. In Principles of Knowledge Representation and Reasoning: Proceedings of the Fourteenth International Conference, KR 2014, Vienna, Austria, July 20-24, 2014. Vergari, A.; Choi, Y.; Peharz, R.; and Van den Broeck, G. 2020. Probabilistic Circuits: Representations, Inference, Learning and Applications. AAAI Tutorial. Wei, W.; and Selman, B. 2005. A new approach to model counting. In International Conference on Theory and Applications of Satisﬁability Testing, 324 339. Springer.