# axiomatic_aggregations_of_abductive_explanations__5d92b20f.pdf

Axiomatic Aggregations of Abductive Explanations

Gagan Biradar1,Yacine Izza2, Elita Lobo1, Vignesh Viswanathan1, Yair Zick1

1University of Massachusetts, Amherst, USA 2CREATE, National University of Singapore, Singapore {gbiradar,elobo,vviswanathan,yzick}@umass.edu, izza@comp.nus.edu.sg

The recent criticisms of the robustness of post hoc model approximation explanation methods (like LIME and SHAP) have led to the rise of model-precise abductive explanations. For each data point, abductive explanations provide a minimal subset of features that are sufﬁcient to generate the outcome. While theoretically sound and rigorous, abductive explanations suffer from a major issue there can be several valid abductive explanations for the same data point. In such cases, providing a single abductive explanation can be insufﬁcient; on the other hand, providing all valid abductive explanations can be incomprehensible due to their size. In this work, we solve this issue by aggregating the many possible abductive explanations into feature importance scores. We propose three aggregation methods: two based on power indices from cooperative game theory and a third based on a well-known measure of causal strength. We characterize these three methods axiomatically, showing that each of them uniquely satisﬁes a set of desirable properties. We also evaluate them on multiple datasets and show that these explanations are robust to the attacks that fool SHAP and LIME.

Introduction The increasing use of complex machine learning (predictive) models in high-stake domains like ﬁnance (Ozbayoglu, Gudelek, and Sezer 2020) and healthcare (Pandey et al. 2022; Qayyum et al. 2021) necessitates the design of methods to accurately explain the decisions of these models. Many such methods have been proposed by the AI community. Most of these methods (like SHAP (Lundberg and Lee 2017) and LIME (Ribeiro, Singh, and Guestrin 2016)) explain model decisions by sampling points and evaluating model behavior around a point of interest. While useful in many settings, this class of model approximationbased methods has faced criticisms for being unable to fully capture model behavior (Rudin 2019; Huang and Marques Silva 2023) and being easily manipulable (Slack et al. 2020). The main issue with these methods stems from the fact that model approximation-based explanation measures use the model s output on a small fraction of the possible input points. This has led to the rise of model-precise abductive explanations (AXp s) (Shih, Choi, and Darwiche 2018; Ignatiev, Narodytska, and Marques-Silva 2019) which use the

Copyright 2024, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

underlying model s structure to compute rigorous explanations. AXp s are simple: they provide a minimal set of features that are sufﬁcient to generate the outcome. In other words, a set of features S forms an AXp for a particular point of interest x if no matter how we modify the values of the features outside S, the outcome will not change. Despite being simple, concise, and theoretically sound, AXp s suffer from a major ﬂaw there may be several possible AXp s for a given data point. Consider the following example:

Example 1. Suppose that we train a simple rule-based model f for algorithmic loan approval, using the features Age , Purpose , Credit Score , and Bank Balance . The rule-based model has the following closed-form expression:

f( x) =(Age > 20 Purpose = Education) (Credit > 700) (Bank > 50000)

In simple words, if the applicant has an age greater than 20 and is applying for education purposes, the loan is accepted; otherwise, if the applicant has a credit score greater than 700 or a bank account balance greater than 50000, the loan is accepted. Consider a user with the following details x = (Age = 30, Purpose = Education, Credit = 750, Bank = 60000). There are three explanations for this point: (Age = 30, Purpose = Education), (Credit = 750), and (Bank = 60000).

In this example, if we provide the AXp (Age = 30, Purpose = Education) to the user, they can infer that their age and purpose played a big role in their decision. However, note that it would be incorrect to infer anything else. The user cannot even tell if the features which are absent from the explanation played any role in their acceptance. In fact, the user still does not know whether the feature Age (present in the explanation) was more important than the feature Credit Score (absent in the explanation). Arguably, Credit Score is more relevant than Age since it is present in a singleton AXp. However, no user presented with only one AXp can make this conclusion. We propose to aggregate AXp s into importance scores for each feature. Feature importance scores are an extremely well-studied class of explanations (Barocas, Selbst, and Raghavan 2020). As seen with the widespread use of measures like SHAP and LIME, the simple structure of feature

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

importance scores make it easy to understand and visualize. We propose to use these feature importance scores to give users a comprehensive understanding of model behavior that is impossible to obtain from a single AXp.

Our Contributions Conceptual. We present three aggregation measures the Responsibility Index, the Deegan-Packel Index, and the Holler-Packel Index (Section ). The Responsibility index is based on the degree of responsibility a well-known causal strength quantiﬁcation metric (Chockler, Halpern, and Kupferman 2008). The Deegan-Packel and Holler Packel indices are based on power indices from the cooperative game theory literature (Deegan and Packel 1978; Holler 1982; Holler and Packel 1983).

Theoretical. For each of these measures, we present an axiomatic characterization, in line with theoretical results in the model explainability community (Patel, Strobel, and Zick 2021; Lundberg and Lee 2017; Datta, Sen, and Zick 2016; Sundararajan and Najmi 2020). Since we deal with aggregating abductive explanations as opposed to conventional model outputs, our proof styles and axioms are novel.

Empirical. We empirically evaluate our measures, comparing them with well-known feature importance measures: SHAP (Lundberg and Lee 2017) and LIME (Ribeiro, Singh, and Guestrin 2016). Our experimental results demonstrate the robustness of our methods, showing speciﬁcally that they are capable of identifying biases in a model that SHAP and LIME cannot identify.

Related Work Abductive explanations were ﬁrst formally deﬁned in Ignatiev, Narodytska, and Marques-Silva (2019) as a generalization of prime implicant explanations deﬁned in Shih, Choi, and Darwiche (2018). For most commonly used machine learning models models, computing abductive explanations is an intractable problem (Marques-Silva and Ignatiev 2022); hence, computing abductive explanations for these models often requires using NP oracles (e.g. SAT/SMT, MILP, etc). These oracles have been used in different ways to compute abductive explanations for different classes of models. For example, MILP-encodings have been used for neural networks (Ignatiev, Narodytska, and Marques-Silva 2019) and SMT-encodings have been used for tree ensembles (Ignatiev et al. 2022). For less complex models such as monotonic classiﬁers and naive bayes classiﬁers, polynomial time algorithms to compute abductive explanations are known (Marques-Silva et al. 2020, 2021). The main focus of these papers has been the runtime of the proposed algorithms. There are fewer papers analysing the quality of the output abductive explanations. Notably, the work of Audemard et al. (2022) is also motivated by the fact that there can be several abductive explanations for a single data point; however, their solution is radically different from ours. They propose to use the explainer s preferences over the set of explanations to ﬁnd a preferred abductive explanation to provide to the user.

Huang and Marques-Silva (2023) observe that SHAP (Lundberg and Lee 2017) often fails to identify features that are irrelevant to the prediction of a data point, i.e. assigns a positive score to features that never appear in any abductive explanations. They propose aggregating abductive explanations as an alternative to SHAP but do not propose any concrete measures to do so. Our work answers this call with three axiomatically justiﬁed aggregation measures. Parallel to our work1, Yu, Ignatiev, and Stuckey (2023) also build on the observations of Huang and Marques-Silva (2023) and develop a MARCO-like method (Lifﬁton et al. 2016) for computing feature importance explanations by aggregating abductive explanations. Their work proposes two aggregation measures, formal feature attribution (ffa) and weighted ffa, that correspond exactly to the Holler-Packel and Deegan-Packel indices respectively. The results shown in Yu, Ignatiev, and Stuckey (2023), demonstrate empirically the efﬁciency of the proposed solution. We remark, however, that their work does not offer an axiomatic characterization of these measures, and focuses solely on empirical performance. There has also been recent work generalizing abductive explanations to probabilistic abductive explanations (W aldchen et al. 2021; Arenas et al. 2022; Izza et al. 2023). Probabilistic abductive explanations allow users to tradeoff precision for size, resulting in smaller explanations with lower precision i.e. smaller explanations which are not as robust as abductive explanations. Our work also contributes novel feature importance measures. Feature importance measures have been well studied in the literature with measures like SHAP (Lundberg and Lee 2017) and LIME (Ribeiro, Singh, and Guestrin 2016) gaining signiﬁcant popularity. There are several other measures in the literature, many offering variants of the Shapley value (Sundararajan and Najmi 2020; Frye, Rowat, and Feige 2020; Sundararajan, Taly, and Yan 2017). Other works use the Banzhaf index (Patel, Strobel, and Zick 2021) and necessity and sufﬁciency scores (Galhotra, Pradhan, and Salimi 2021; Watson et al. 2021).

Preliminaries We denote vectors by x and y. We denote the i-th and j-th indices of the vector x using xi and xj. Given a set S, we denote the restricted vector containing only the indices i S using x S. We also use [k] to denote the set {1, 2, . . . , k}. We have a set of features N = {1, 2, . . . , n}, where each i N has a domain Xi. We use X = i N Xi to denote the domain of the input space. We are given a model of interest f F that maps input vectors x X to a binary output variable y {0, 1}. In the local post hoc explanation problem, we would like to explain the output of the model of interest f on a point of interest x. We work with two forms of model explanations in this paper. The ﬁrst is that of feature importance weights (or feature importance scores): feature importance weights provide a score to each feature proportional to their importance in the

1The work of Yu, Ignatiev, and Stuckey (2023) was developed independently and in parallel to our work.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

generation of the outcome f( x). Commonly used feature importance measures are LIME (Ribeiro, Singh, and Guestrin 2016) and SHAP (Lundberg and Lee 2017). Second, an abductive explanation (AXp) for a point of interest x is a minimal subset of features which are sufﬁcient to generate the outcome f( x). More formally, an AXp (as deﬁned by Ignatiev et al. (2022)) corresponds to a subset minimal set of features S such that

y X, y S = x S = f( y) = f( x) . (1)

By subset minimality, if S satisﬁes (1), then no proper subset of S satisﬁes (1). We use M( x, f) to denote the set of AXp s for a point of interest x under a model of interest f. We also use Mi( x, f) to denote the subset of M( x, f) containing all the AXp s with the feature i. Our goal is to create aggregation measures that maps M( x, f) to an importance score for each feature i N.

A Cooperative Game Theory Perspective

In this paper, we propose to aggregate abductive explanations into feature importance scores2. A common approach used to compute feature importance scores is via modeling the problem as a cooperative game (Patel, Strobel, and Zick 2021; Datta, Sen, and Zick 2016; Lundberg and Lee 2017). This formulation allows us to both, tap into the existing literature on power indices (like the Shapley value) to create feature importance measures, as well as use theoretical techniques from the literature to provide axiomatic characterizations for new measures. In this paper, we do both. A simple cooperative game (Chalkiadakis, Elkind, and Wooldridge 2011) (N, v) is deﬁned over a set of players N and a monotone3 binary value function v : 2N 7 {0, 1}. The set of players, in our setting (and several others (Patel, Strobel, and Zick 2021; Datta, Sen, and Zick 2016; Lundberg and Lee 2017)), are the features of the model of interest N. The value function v loosely represents the value of each (sub)set of players; in model explanations, the value function represents the joint importance of a set of features in generating the outcome. A set S N is referred to as a minimal winning set if v(S) = 1 and for all proper subsets T S, v(T) = 0. Minimal winning sets are a natural analog of AXp s in the realm of cooperative game theory. There are speciﬁc power indices like the Deegan-Packel index (Deegan and Packel 1978) and the Holler-Packel index (Holler and Packel 1983; Holler 1982) which take as input the set of all minimum winning sets and output a score corresponding to each player (in our case, feature) in the cooperative game. These measures are natural candidates to convert AXp s into feature importance scores.

2Feature importance scores computed using AXp s are also referred to as formal feature attribution (FFA) in Yu, Ignatiev, and Stuckey (2023). 3Recall that a set function v is monotonic if for all S T N, v(S) v(T).

A Framework for Abductive Explanation Aggregation

Formally, we deﬁne an abductive explanation aggregator (or simply an aggregator) as a function that maps a point x and a model f to a vector in Rn using only the AXp s of the point x under the model f; the output vector can be interpreted as importance scores for each feature. For any arbitrary aggregator β : X F Rn, we use βi( x, f) as the importance weight given to the i-th feature for a speciﬁc datapoint-model pair ( x, f). In order to design meaningful aggregators, we take an axiomatic approach: we start with a set of desirable properties and then ﬁnd the unique aggregator which satisﬁes these properties. This is a common approach in explainable machine learning (Datta, Sen, and Zick 2016; Lundberg and Lee 2017; Sundararajan, Taly, and Yan 2017; Patel, Strobel, and Zick 2021) and more recently, boolean functions (Harder et al. 2023). The popular Shapley value (Young 1985) is the unique measure that satisﬁes four desirable properties Monotonicity, Symmetry, Null Feature, and Efﬁciency. However, the exact deﬁnitions of these four properties in the characterization of the Shapley value do not extend to our setting. Moreover, the Shapley value does not aggregate AXp s (or more generally, minimal winning sets). Therefore, for our axiomatic characterization, we formally deﬁne variants of these properties, keeping the spirit of these deﬁnitions intact. We present these deﬁnitions below. α-Monotonicity: Let α be some function that quantiﬁes the relevance of a set of AXp s a feature i is present in. A feature importance score is monotonic with respect to α if for each feature i and dataset model pair ( x, f), the importance score given to i is monotonic with respect to α(Mi( x, f)). In simple words, the higher the rank of the set of AXp s containing a feature (according to α), the higher their importance scores. The ranking function α can capture several intuitive desirable properties. For example, if we want features present in a larger number of AXp s to receive a higher score, we can simply set α(S) = |S|. Otherwise, if we want features present in smaller explanations to receive a higher score, we set α(S) = min S S |S|. Formally, let α : 22N 7 Y be a function that ranks sets of AXp s, i.e., maps every set of AXp s to a partially ordered set Y. An aggregator β is said to satisfy α-monotonicity if for any two datapoint-model pairs ( x, f) and ( y, g) and a feature i, α(Mi( x, f)) α(Mi( y, g)) implies βi( x, f) βi( y, g). Additionally, if the feature i has the same set of AXp s under ( x, f) and ( y, g) i.e., Mi( x, f) = Mi( y, g) then βi( x, f) = βi( y, g). Symmetry: This property requires that the index of a feature should not affect its score. That is, the score of feature i should not change if we change its position. Given a permutation π : N N, we deﬁne π x as the reordering of the feature values in x according to π. In addition, given a permutation π : N N, we deﬁne πf as the function that results from permuting the input point using π before computing the output. More formally, πf( x) = f(π x). We are now ready to formally deﬁne the symmetry property:

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Measure α-Monotonicity C-Efﬁciency

Holler-Packel Index ηi( x, f) = |Mi( x, f)| α(S) = S and α(S) α(T ) iff S T C( x, f) = P i N |Mi( x, f)|

Deegan-Packel Index φi( x, f) = P S Mi( x,f) 1 |S| α(S) = S and α(S) α(T ) iff S T C( x, f) = |M( x, f)|

Responsibility Index ρi( x, f) = max S Mi( x,f) 1 |S| α(S) = min S S |S| NA

Table 1: A summary of the α and C values from the Monotonicity and Efﬁciency properties respectively of each measure deﬁned in this paper. All three measures satisfy Symmetry and Null Feature. The Responsibility index satisﬁes an alternative efﬁciency property which is incomparable to C-efﬁciency.

An aggregator β satisﬁes symmetry if for any datapointmodel pair ( x, f) and a permutation π, πβ( x, f) = β(π x, π 1f). Null Feature: if a feature is not present in any abductive explanation, it is given a score of 0. This property explicitly sets a baseline value for importance scores. More formally, an aggregator η satisﬁes Null Feature if for any datapointmodel pair ( x, f) and any feature i, Mi( x, f) = implies that ηi( x, f) = 0. C-Efﬁciency: This property requires the scores output by aggregators to sum up to a ﬁxed value; in other words, for any datapoint-model pair ( x, f), P i N βi( x, f) must be a ﬁxed value. Not only does efﬁciency bound the importance scores, but it also ensures that features are not always given a trivial score of 0. The ﬁxed value may depend on the aggregator β, the model f, and the datapoint x. To capture this, we deﬁne a function C that maps each datapoint-model pair ( x, f) to a real value. An aggregator β is C-efﬁcient if for any datapoint-model pair ( x, f), P i N βi( x, f) = C( x, f). We deliberately deﬁne the above properties ﬂexibly. There are different reasonable choices of α-monotonicity and C-efﬁciency each leading to a different aggregation measure (Table 1). In what follows, we formally present these choices and mathematically ﬁnd the measures they characterize. It is worth noting, as shown by Huang and Marques-Silva (2023), that the popular SHAP framework fails to satisfy the Null Feature property while all the measures we propose in this paper are guaranteed to satisfy the Null Feature property.

The Holler-Packel Index We start with the Holler-Packel index, named after the power index in cooperative game theory (Holler 1982; Holler and Packel 1983). The Holler-Packel index measures the importance of each feature as the number of AXp s that contain it. More formally, the Holler-Packel index of a feature i (denoted by ηi( x, f)) is given by

ηi( x, f) = |Mi( x, f)| (2)

The Holler-Packel index satisﬁes a property we call Minimal Monotonicity. This property corresponds to αMonotonicity when α(S) = S and α(S) α(T) if and only if S T . Minimal Monotonicity (loosely speaking)

ensures that features present in a larger number of AXp s get a higher importance score. The Holler-Packel index also satisﬁes C-Efﬁciency where C( x, f) is deﬁned as P i N |Mi(x, f)|. We refer to this property as (P i N |Mi(x, f)|)-Efﬁciency for clarity. Our ﬁrst result shows that the Holler-Packel index is the only index that satisﬁes Minimal Monotonicity, Symmetry, Null Feature, and (P i N |Mi(x, f)|)-Efﬁciency.

Theorem 1. The only aggregator that satisﬁes Minimal Monotonicity, Symmetry, Null Feature, and (P i N |Mi(x, f)|)-Efﬁciency is the Holler-Packel index given by (2).

The Holler-Packel index was used as a heuristic AXp aggregator in prior work under the term hit rate (Marques Silva et al. 2020). Theorem 1 theoretically justiﬁes the hit rate.

The Deegan-Packel Index Next, we present the Deegan-Packel index. This method is also named after the similar game-theoretic power index (Deegan and Packel 1978). The Deegan-Packel index, like the Holler-Packel index, counts the number of AXp s a feature is included in but unlike the Holler-Packel index, each AXp is given a weight inversely proportional to its size. This ensures that smaller AXp s are prioritized over larger AXp s. Formally, the Deegan-Packel index is deﬁned as follows:

φi( x, f) = X

Note that this aggregator also satisﬁes Minimal Monotonicity, Symmetry, and Null Feature. However, the Deegan Packel index satisﬁes a different notion of C-Efﬁciency. The efﬁciency notion satisﬁed by the Deegan-Packel index corresponds to C-Efﬁciency where C( x, f) is deﬁned as |M( x, f)|. We refer to this efﬁciency notion as |M( x, f)|-Efﬁciency for clarity. Our second result shows that the Deegan-Packel index uniquely satisﬁes Minimal Monotonicity, Symmetry, Null Feature, and |M( x, f)|-Efﬁciency.

Theorem 2. The only aggregator that satisﬁes Minimal Monotonicity, Symmetry, Null Feature, and |M( x, f)|- Efﬁciency is the Deegan-Packel index given by (3).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

The Responsibility Index We now present our third and ﬁnal aggregator, the Responsibility index, named after the degree of responsibility (Chockler, Halpern, and Kupferman 2008; Chockler and Halpern 2004) used commonly to measure causal strength. The Responsibility index (denoted by ρ) of a feature is the inverse of the size of the smallest AXp containing that feature. More formally,

ρi( x, f) =

( max S Mi( x,f) 1 |S| Mi( x, f) = 0 Mi( x, f) = (4)

To characterize this aggregator, we require different versions of Monotonicity and Efﬁciency. Our new monotonicity property requires aggregators to provide a higher score to features present in smaller AXp s. We refer to this property as Minimum Size Monotonicity: this corresponds to αMonotonicity where given a set of AXp s S, we let α(S) = min S S |S|. The new efﬁciency property does not ﬁt into the CEfﬁciency framework used so far and is easier to deﬁne as two new properties Unit Efﬁciency and Contraction. Unit Efﬁciency requires that the score given to any feature present in a singleton AXp be 1. This property is used to upper bound the score given to a feature. Unit Efﬁciency: For any datapoint-model pair ( x, f), Mi( x, f) = {{i}} implies ρi( x, f) = 1. To deﬁne the contraction property, we deﬁne the contraction operation on the set of features N: we replace a subset of features T N by a single feature [T] corresponding to the set. The contracted data point x[T ] is the same point as x, but we treat all the features in T as a single feature [T]. The contraction property requires that a contracted feature [T] does not receive a score greater than the sum of the scores given to the individual features in T. Contraction: For any subset T that does not contain a null feature (i.e., a feature not included in any AXp), we have ρ[T ]( x[T ], f) P i T ρi( x, f). Moreover, equality holds if T {S : S arg min S Mi( x,f) |S |} for all i T. In other words, equality holds iff T is the smallest AXp for all the features in T. The contraction property bounds the gain one gets by combining features and ensures that the total attribution that a set of features receives when combined does not exceed the sum of the individual attributions of each element in the set. We are now ready to present our characterization of the Responsibility index. Theorem 3. The Responsibility index is the only aggregator which satisﬁes Minimum Size Monotonicity, Unit Efﬁciency, Contraction, Symmetry, and Null Feature.

Impossibilities The framework discussed above can be used to axiomatically characterize several indices. Our axiomatic approach also offers insights as to what can be accomplished by aggregating AXp s. We prove that some choices of α and C may create a set of properties that are impossible to satisfy

x OOD Classiﬁer

Unbiased Classiﬁer

Biased Classiﬁer

If x is OOD

If x is not OOD

Figure 1: A pictorial description of the attack model. OOD is short for out-of-distribution.

simultaneously. For example, the Shapley value s efﬁciency property stipulates that all Shapley values must sum to 1. Somewhat surprisingly, this is not possible when taking an AXp approach. Proposition 4. There exists no aggregator satisfying Minimal Monotonicity, Symmetry, Null Feature, and 1-Efﬁciency. All the indices described in this section inherit the precision and robustness of AXp s while simultaneously satisfying a set of desirable properties. In what follows, we demonstrate the value of this robustness empirically.

Empirical Evaluation To showcase the robustness of the explanations generated by our methods, we study their empirical behavior against adversarial attacks proposed by Slack et al. (2020). Speciﬁcally, we investigate if our framework successfully uncovers underlying biases in adversarial classiﬁers that popular explanation methods like LIME and SHAP often fail to identify (Slack et al. 2020). We describe the details of the datasets used in our experiments below.

Compas (Angwin et al. 2016): This dataset contains information about the demographics, criminal records, and Compas risk scores of 6172 individual defendants from Broward County, Florida. Individuals are labeled with either a high or low risk score, with race as the sensitive feature. German Credit (Dua and Graff 2017): This dataset contains ﬁnancial and demographic information on 1000 loan applicants. Each candidate is labeled as either a good or bad loan candidate. The sensitive feature is gender.

Attack Model. We evaluate the robustness of our explanation methods using the adversarial attacks proposed by Slack et al. (2020) for LIME and SHAP. The underlying attack model is a two-level adversarial classiﬁer in both adversarial attacks. The ﬁrst level of the adversarial classiﬁer is an out-of-distribution (OOD) classiﬁer that predicts if a point is OOD or not. The second level of the adversarial classiﬁer consists of a biased and unbiased prediction model, both

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Features Lime (%) Responsibility (%) Holler-Packel (%) Deegan-Packel (%) 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd

Race 0.0 0.0 0.0 0.921 0.079 0.0 0.845 0.148 0.007 0.845 0.148 0.007 UC1 0.492 0.508 0.0 0.601 0.399 0.0 0.157 0.843 0.0 0.157 0.843 0.0 UC2 0.508 0.492 0.0 0.601 0.399 0.0 0.157 0.843 0.0 0.157 0.843 0.0

Table 2: This table shows the results of the LIME attack experiment on the Compas dataset. Each row represents the frequency of occurrence of either a sensitive feature (Race) or an uncorrelated feature (UC1,UC2) in the top 3 positions when ranked based on their LIME scores, Responsibility indices, Holler-Packel indices, and Deegan-Packel indices. LIME explanations do not uncover the underlying biases of the attack model, whereas the Responsibility index, Deegan-Packel index, and Holler Packel index successfully uncover the underlying biases of the attack model in the explanations they generate.

Features SHAP (%) Responsibility (%) Holler-Packel (%) Deegan-Packel (%) 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd

Race 0.416 0.238 0.141 0.946 0.044 0.01 0.867 0.036 0.052 0.867 0.039 0.057 UC1 0.252 0.249 0.172 0.608 0.316 0.067 0.146 0.47 0.215 0.146 0.552 0.138 UC2 0.215 0.249 0.304 0.618 0.297 0.08 0.148 0.466 0.213 0.148 0.554 0.133

Table 3: This table shows the results of the SHAP attack experiment on the Compas dataset. Each row represents the frequency of occurrence of either a sensitive feature (Race) or an uncorrelated feature (UC1,UC2) in the top 3 positions when ranked based on their SHAP scores, Responsibility indices, Holler-Packel indices, and Deegan-Packel indices.

of which predict the target label for a given data point. The biased prediction model makes predictions solely based on the sensitive feature in the dataset. In contrast, the unbiased prediction model makes predictions using features that are uncorrelated with the sensitive feature. With the Compas dataset, the biased model uses the Race feature for predicting the target label. In contrast, the unbiased model uses either one or two uncorrelated synthetic features (UC1, UC2) for predicting the label. When two uncorrelated features are used, the label is their values XOR. Similarly, with the German Credit dataset, the unbiased and biased models use the Loan Rate As Percent Of Income feature and Gender feature for predicting the target label, respectively. For a given data point, the adversarial classiﬁer ﬁrst uses the OOD classiﬁer to predict whether the given data point is OOD data. If the given data point is OOD data, the adversarial classiﬁer uses the unbiased prediction model to predict the target label; else, the adversarial classiﬁer uses the biased classiﬁer to predict the target label (see Figure 1). Most points in the dataset are classiﬁed as in-distribution and therefore, the prediction of the attack model for these points will be made solely using the sensitive feature of the dataset. Since the type of explanations generated by popular methods like LIME and SHAP tend to be heavily inﬂuenced by the predictions of the classiﬁer model on OOD data points, this attack is designed to hide the underlying biases of the model by ensuring the bias is only applied to in-distribution data points. For each method (SHAP and LIME), Slack et al. (2020) use a different attack model with the same high-level approach described above. Implementation details4 of each attack are outlined in the extended version of the paper (Bi-

4Code available at https://github.com/elitalobo/aggrxp

radar et al. 2023).

Experimental Setup. We split a given dataset into train and test datasets in all our experiments. We use the training dataset to train OOD classiﬁers for the LIME and SHAP attacks and the test dataset to evaluate our methods robustness. To generate explanations using our proposed AXp aggregators, we must ﬁrst compute the set of all AXp s for the adversarial classiﬁer model. We do this using the MARCO algorithm (Lifﬁton et al. 2016). After generating the complete set of AXp s for the adversarial classiﬁer, we compute the feature importance scores using each of our methods the Holler-Packel index, Deegan-Packel index, and the Responsibility index. We use these feature importance scores as explanations for each point in the test dataset. We compare our methods with LIME and SHAP, computed using their respective publicly available libraries (Lundberg and Lee 2017; Ribeiro, Singh, and Guestrin 2016).

Evaluating Robustness to Adversarial LIME and SHAP attacks. For each data point in the test dataset, we rank features based on the feature importance scores given by each explanation method. Note that we allow multiple features to hold the same rank if they have the same importance scores. For each explanation method, we compute the fraction of data points in which the sensitive and uncorrelated features appear in the top three positions. Since most of the points in the test dataset are in-distribution and classiﬁed as such by the OOD classiﬁer, any good explanation method should identify that the adversarial classiﬁer makes its prediction largely based on the sensitive feature for most of the points in the test dataset. In other words, the sensitive feature should receive a high importance score. Table 2 shows the percentage of data points for which the

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Features Lime (%) Responsibility (%) Holler-Packel (%) Deegan-Packel (%) 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd

Gender 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 LR 1.0 0.0 0.0 0.46 0.54 0.0 0.0 0.69 0.31 0.0 0.72 0.28

Table 4: This table shows the results of the LIME attack experiment on the German Credit dataset. Each row represents the frequency of occurrence of either a sensitive feature (Gender) or an uncorrelated feature (Loan Rate As Percent Of Income) in the top 3 positions when ranked based on their LIME scores, Responsibility indices, Holler-Packel indices, and Deegan-Packel indices.

Features SHAP (%) Responsibility (%) Holler-Packel (%) Deegan-Packel (%) 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd

Gender 0.0 0.41 0.01 0.93 0.04 0.03 0.87 0.07 0.02 0.87 0.07 0.02 LR 0.96 0.0 0.04 0.55 0.44 0.01 0.17 0.81 0.02 0.17 0.82 0.0

Table 5: This table shows the results of the SHAP attack experiment on the German Credit dataset. Each row represents the frequency of occurrence of either a sensitive feature (Gender) or an uncorrelated feature (Loan Rate As Percent Of Income) in the top 3 positions when ranked based on their SHAP scores, Responsibility indices, Holler-Packel indices, and Deegan-Packel indices.

sensitive attribute (i.e., Race) and the uncorrelated features (UC1 and UC2) appear in the top three positions when features are ranked using LIME and our methods in the LIME attack experiment on the Compas dataset. Similarly, Table 3 shows the percentage of data points for which the sensitive attribute (i.e., Race) and the uncorrelated features (UC1 and UC2) appear in the top three positions when features are ranked using SHAP and our methods in the SHAP attack experiment for the Compas dataset. Since the biased classiﬁer is used to predict the label for almost all the test points, we expect the explanations to assign a high feature importance score to the sensitive feature. However, we observe that in the LIME attack experiment, LIME does not always assign high scores to the sensitive feature Race due to which Race does not at all appear in the top three positions when two uncorrelated features are used. The uncorrelated features are incorrectly ranked higher than the sensitive feature. On the other hand, the Responsibility index, the Holler-Packel index, and the Deegan Packel index assign the highest feature importance scores to Race: Race appears in the top position for the majority of the instances (> 84%). It is important to note that the instances in which our explanation methods do not assign a high importance score to the Race feature are the instances where the OOD classiﬁer classiﬁes test dataset instances as ODD instances. We observe a similar pattern to LIME in the SHAP attack experiment. In this experiment, AXp aggregators rank Race as the most important feature in at least 86% of test data, whereas SHAP ranks Race as the most important feature only for 41.6% of the returned explanations. We see similar results with the German Credit dataset reported in Table 4 and Table 5. In both LIME and SHAP attacks, we observe that the Loan Rate As Percent Of Income feature appears in the top position for most of the delivered explanations. However, the sensitive feature Gender

does not appear in the top position in any instance. In contrast, the Responsibility Index, the Holler-Packel Index, and the Deegan-Packel Index correctly assign the highest feature importance score to the sensitive feature Gender for most of the data points; Gender appears in the top position in > 87% of the instances in both the LIME and SHAP attack experiments. Clearly, we can conclude that our AXp aggregators generate more robust and reliable explanations than LIME and SHAP.

Conclusion and Future Work

In this work, we aggregate abductive explanations into feature importance scores. We present three methods that aggregate abductive explanations, showing that each of them uniquely satisﬁes a set of desirable properties. We also empirically evaluate each of our methods, showing that they are robust to attacks that SHAP and LIME are vulnerable to. Our focus in this paper has been the axiomatic characterization and comparison of different measures. We believe an empirical comparison of the three methods we propose is also worth exploring in future work. This study is likely to yield insights into the differences in applicability of each of our three methods, further leading to a deeper understanding into how abductive explanations should be aggregated.

Acknowledgments

This research supported in part by the National Research Foundation, Prime Minister s Ofﬁce, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) program.

Angwin, J.; Larson, J.; Mattu, S.; and Kirchner, L. 2016. Machine Bias: There s software used across the country

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

to predict future criminals. And it s biased against blacks. Pro Publica. Arenas, M.; Barcel o, P.; Orth, M. A. R.; and Subercaseaux, B. 2022. On Computing Probabilistic Explanations for Decision Trees. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (Neur IPS). Audemard, G.; Bellart, S.; Bounia, L.; Koriche, F.; Lagniez, J.-M.; and Marquis, P. 2022. On Preferred Abductive Explanations for Decision Trees and Random Forests. In Proceedings of the 31st International Joint Conference on Artiﬁcial Intelligence (IJCAI), 643 650. Barocas, S.; Selbst, A. D.; and Raghavan, M. 2020. The Hidden Assumptions behind Counterfactual Explanations and Principal Reasons. In Proceedings of the 3rd ACM Conference on Fairness, Accountability, and Transparency (FAcc T), 80 89. Biradar, G.; Izza, Y.; Lobo, E.; Viswanathan, V.; and Zick, Y. 2023. Axiomatic Aggregations of Abductive Explanations. Co RR, abs/2310.03131. Chalkiadakis, G.; Elkind, E.; and Wooldridge, M., eds. 2011. Computational Aspects of Cooperative Game Theory. Morgan & Claypool Publishers, 1st edition. Chockler, H.; and Halpern, J. Y. 2004. Responsibility and Blame: A Structural-Model Approach. Journal of Artiﬁcial Intelligence Research, 22: 93 115. Chockler, H.; Halpern, J. Y.; and Kupferman, O. 2008. What Causes a System to Satisfy a Speciﬁcation? ACM Trans. Comput. Logic, 9(3). Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic Transparency via Quantitative Input Inﬂuence: Theory and Experiments with Learning Systems. IEEE Symposium on Security and Privacy, 598 617. Deegan, J.; and Packel, E. 1978. A new index of power for simple n-person games. International Journal of Game Theory, 7: 113 123. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. Frye, C.; Rowat, C.; and Feige, I. 2020. Asymmetric Shapley Values: Incorporating Causal Knowledge into Modelagnostic Explainability. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (Neur IPS). Galhotra, S.; Pradhan, R.; and Salimi, B. 2021. Explaining Black-Box Algorithms Using Probabilistic Contrastive Counterfactuals. ar Xiv:2103.11972. Harder, H.; Jantsch, S.; Baier, C.; and Dubslaff, C. 2023. A Unifying Formal Approach to Importance Values in Boolean Functions. In Proceedings of the 32nd International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2728 2737. Holler, M. J. 1982. Forming Coalitions and Measuring Voting Power. Political Studies, 30: 262 271. Holler, M. J.; and Packel, E. W. 1983. Power, Luck and the Right Index. Journal of Economics, 43: 21 29. Huang, X.; and Marques-Silva, J. 2023. The Inadequacy of Shapley Values for Explainability. ar Xiv:2302.08160.

Ignatiev, A.; Izza, Y.; Stuckey, P. J.; and Marques-Silva, J. 2022. Using Max SAT for Efﬁcient Explanations of Tree Ensembles. In Proceedings of the 36th AAAI Conference on Artiﬁcial Intelligence (AAAI), 3776 3785. Ignatiev, A.; Narodytska, N.; and Marques-Silva, J. 2019. Abduction-Based Explanations for Machine Learning Models. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence (AAAI). Izza, Y.; Huang, X.; Ignatiev, A.; Narodytska, N.; Cooper, M. C.; and Marques-Silva, J. 2023. On computing probabilistic abductive explanations. Int. J. Approx. Reason., 159. Lifﬁton, M. H.; Previti, A.; Malik, A.; and Marques-Silva, J. 2016. Fast, ﬂexible MUS enumeration. Constraints An Int. J., 21(2): 223 250. Lundberg, S. M.; and Lee, S.-I. 2017. A Uniﬁed Approach to Interpreting Model Predictions. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (Neur IPS), 4768 4777. Marques-Silva, J.; Gerspacher, T.; Cooper, M. C.; Ignatiev, A.; and Narodytska, N. 2020. Explaining Naive Bayes and Other Linear Classiﬁers with Polynomial Time and Delay. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (Neur IPS). Marques-Silva, J.; Gerspacher, T.; Cooper, M. C.; Ignatiev, A.; and Narodytska, N. 2021. Explanations for Monotonic Classiﬁers. In Proceedings of the 38th International Conference on Machine Learning (ICML), 7469 7479. Marques-Silva, J.; and Ignatiev, A. 2022. Delivering Trustworthy AI through Formal XAI. In AAAI, 12342 12350. Ozbayoglu, A. M.; Gudelek, M. U.; and Sezer, O. B. 2020. Deep learning for ﬁnancial applications : A survey. Applied Soft Computing, 93: 106384. Pandey, B.; Kumar Pandey, D.; Pratap Mishra, B.; and Rhmann, W. 2022. A comprehensive survey of deep learning in the ﬁeld of medical imaging and medical natural language processing: Challenges and research directions. Journal of King Saud University - Computer and Information Sciences, 34: 5083 5099. Patel, N.; Strobel, M.; and Zick, Y. 2021. High Dimensional Model Explanations: An Axiomatic Approach. In Proceedings of the 4th ACM Conference on Fairness, Accountability, and Transparency (FAcc T), 401 411. ACM. Qayyum, A.; Qadir, J.; Bilal, M.; and Al-Fuqaha, A. 2021. Secure and Robust Machine Learning for Healthcare: A Survey. IEEE Reviews in Biomedical Engineering, 14: 156 180. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why Should I Trust You? : Explaining the Predictions of Any Classiﬁer. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining (KDD), 1135 1144. Rudin, C. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5): 206 215. Shih, A.; Choi, A.; and Darwiche, A. 2018. A Symbolic Approach to Explaining Bayesian Network Classiﬁers. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence (IJCAI), 5103 5111.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; and Lakkaraju, H. 2020. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In Proceedings of the 3rd AAAI/ACM Conference on Artiﬁcal Intelligence, Ethics, and Society (AIES). Sundararajan, M.; and Najmi, A. 2020. The Many Shapley Values for Model Explanation. In Proceedings of the 37th International Conference on Machine Learning (ICML), 9269 9278. Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), 3319 3328. W aldchen, S.; Mac Donald, J.; Hauch, S.; and Kutyniok, G. 2021. The Computational Complexity of Understanding Binary Classiﬁer Decisions. J. Artif. Intell. Res., 70: 351 387. Watson, D.; Gultchin, L.; Taly, A.; and Floridi, L. 2021. Local Explanations via Necessity and Sufﬁciency: Unifying Theory and Practice. ar Xiv:2103.14651. Young, H. 1985. Monotonic Solutions of Cooperative Games. International Journal of Game Theory, 14: 65 72. Yu, J.; Ignatiev, A.; and Stuckey, P. J. 2023. On Formal Feature Attribution and Its Approximation. Co RR, abs/2307.03380.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)