# fooling_shap_with_stealthily_biased_sampling__0a9dfaa4.pdf

Published as a conference paper at ICLR 2023

FOOL SHAP WITH STEALTHILY BIASED SAMPLING.

Gabriel Laberge1, Ulrich Aïvodji2, Satoshi Hara3, Mario Marchand4, Foutse Khomh1

1Polytechnique Montréal, Québec 2École de technologie supérieure, Québec 3Osaka University, Japan 4Universitié de Laval à Québec {gabriel.laberge,foutse.khomh}@polymtl.ca ulrich.aivodji@etsmtl.ca satohara@ar.sanken.osaka-u.ac.jp mario.marchand@ift.ulaval.ca

SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a speciﬁc input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution. In the context of fairness audit, we show that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups while remaining undetected. More precisely, experiments performed on real-world datasets showed that our attack could yield up to a 90% relative decrease in amplitude of the sensitive feature attribution. These results highlight the manipulability of SHAP explanations and encourage auditors to treat them with skepticism.

1 INTRODUCTION

As Machine Learning (ML) gets more and more ubiquitous in high-stake decision contexts (e.g. , healthcare, ﬁnance, and justice), concerns about its potential to lead to discriminatory models are becoming prominent. The use of auditing toolkits (Adebayo et al., 2016; Saleiro et al., 2018; Bellamy et al., 2018) is getting popular to circumvent the use of unfair models. However, although auditing toolkits can help model designers in promoting fairness, they can also be manipulated to mislead both the end-users and external auditors. For instance, a recent study of Fukuchi et al. (2020) has shown that malicious model designers can produce a benchmark dataset as fake evidence of the fairness of the model even though the model itself is unfair.

Another approach to assess the fairness of ML systems is to explain their outcome in a post hoc manner (Guidotti et al., 2018). For instance, SHAP (Lundberg & Lee, 2017) has risen in popularity as a means to extract model-agnostic local feature attributions. Feature attributions are meant to convey how much the model relies on certain features to make a decision at some speciﬁc input. The use of feature attributions for fairness auditing is desirable for cases where the interest is on the direct impact of the sensitive attributes on the output of the model. One such situation is in the context of causal fairness (Chikahara et al., 2021). In some practical cases, the outputs cannot be independent from the sensitive attribute unless we sacriﬁce much of prediction accuracy. For example, any decisions based on physical strength are statistically correlated to gender due to biological nature. The problem in such a situation is not the statistical bias (such as demographic parity), but whether the decision is based on physical strength or gender, i.e. the attributions of each feature.

The focus of this study is on manipulating the feature attributions so that the dependence on sensitive features is hidden and the audits are misled as if the model is fair even if it is not the case. Recently, several studies reported that such a manipulation is possible, e.g. by modifying the black-box model to be explained (Slack et al., 2020; Begley et al., 2020; Dimanov et al., 2020), by manipulating the computation algorithms of feature attributions (Aïvodji et al., 2019), and by poisoning the data distribution (Baniecki et al., 2021; Baniecki & Biecek, 2022). With these ﬁndings in mind, the current

Published as a conference paper at ICLR 2023

possible advice to the auditors is not to rely solely on the reported feature attributions for fairness auditing. A question then arises about what evidence we can expect in addition to the feature attributions, and whether they can be valid evidence of fairness.

In this study, we show that we can craft fake evidence of fairness for SHAP explanations, which provides the ﬁrst negative answer to the last question. In particular, we show that we can produce not only manipulated feature attributions but also a benchmark dataset as the fake evidence of fairness. The benchmark dataset ensures the external auditors reproduce the reported feature attributions using the existing SHAP library. In our study, we leverage the idea of stealthily biased sampling introduced by Fukuchi et al. (2020) to cherry-pick which data points to be included in the benchmark. Moreover, the use of stealthily biased sampling allows us to keep the manipulation undetected by making the distribution of the benchmark sufﬁciently close to the true data distribution. Figure 1 illustrates the impact of our attack in an explanation scenario with the Adult Income dataset.

0.10 0.08 0.06 0.04 0.02 0.00 Shap value

marital_status

capital_gain

hours_per_week

relationship

capital_loss

occupation Original Manipulated

Figure 1: Example of our attack on the Adult Income dataset. After the attack, the feature gender moved from the most negative attribution to the 6th, hence hiding some of the model bias.

Our contributions can be summarized as follows:

Theoretically, we formalize a notion of foreground distribution that can be used to extend Local Shapley Values (LSV) to Global Shapley Values (GSV), which can be used to decompose fairness metrics among the features (Section 2.2). Moreover, we formalize the task of manipulating the GSV as a Minimum Cost Flow (MCF) problem (Section 4). Experimentally (Section 5), we illustrate the impact of the proposed manipulation attack on a synthetic dataset and four popular datasets, namely Adult Income, COMPAS, Marketing, and Communities. We observed that the proposed attack can reduce the importance of a sensitive feature while keeping the data manipulation undetected by the audit.

Our results indicate that SHAP explanations are not robust and can be manipulated when it comes to explaining the difference in outcomes between groups. Even worse, our results conﬁrm we can craft a benchmark dataset so that the manipulated feature attributions are reproducible by external audits. Henceforth, we alert auditors to treat post-hoc explanation methods with skepticism even if it is accompanied by some additional evidence.

2 SHAPLEY VALUES

2.1 LOCAL SHAPLEY VALUES

Shapley values are omnipresent in post-hoc explainability because of their fundamental mathematical properties (Shapley, 1953) and their implementation in the popular SHAP Python library (Lundberg & Lee, 2017). SHAP provides local explanations in the form of feature attributions i.e. given an input of interest x, SHAP returns a score φi P R for each feature i 1, 2, . . . , d. These scores are meant to convey how much the model f relies on feature i to make its decision fpxq. Shapley values have a long background in coalitional game theory, where multiple players collaborate toward a common outcome. In the context of explaining model decisions, the players are the input features and the common outcome is the model output fpxq. In coalitional games, players (features) are either present or absent. Since one cannot physically remove an input feature once the model has already been ﬁtted, SHAP removes features by replacing them with a baseline value z. This leads to the Local Shapley Value (LSV) φipf, x, zq which respect the so-called efﬁciency axiom (Lundberg & Lee, 2017)

i 1 φipf, x, zq fpxq fpzq. (1)

Simply put, the difference between the model prediction at x and the baseline z is shared among the different features. Additional details on the computation of LSV are presented in Appendix B.1.

Published as a conference paper at ICLR 2023

2.2 GLOBAL SHAPLEY VALUES

LSV are local because they explain the prediction at a speciﬁc x and rely on a single baseline input z. Since model auditing requires a more global analysis of model behavior, we must understand the predictions at multiple inputs x F sampled from a distribution F called the foreground. Moreover, because the choice of baseline is somewhat ambiguous, the baselines are sampled z B from a distribution B colloquially referred to as the background. Taking inspiration from Begley et al. (2020), we can compute Global Shapley Values (GSV) by averaging LSV over both foreground and background distributions. Deﬁnition 2.1.

Φipf, F, Bq : E x F z B rφipf, x, zq , i 1, 2, . . . , d. (2)

Proposition 2.1. The GSV have the following property

i 1 Φipf, F, Bq E x Frfpxqs E x Brfpxqs. (3)

2.3 MONTE-CARLO ESTIMATES

In practice, computing expectations w.r.t the whole background and foreground distributions may be prohibitive and hence Monte-Carlo estimates are used. For instance, when a dataset is used to represent a background distribution, explainers in the SHAP library such as the Exact Explainer and Tree Explainer will subsample this dataset 1 by selecting 100 instances uniformly at random when the size of the dataset exceeds 100. More formally, let

Cp S, ωq : ÿ

xpjq PS ωjδpxpjqq (4)

represent a categorical distribution over a ﬁnite set of input examples S, where δp q is the Dirac probability measure, wj ě 0 @j, and ř

j ωj 1. Estimating expectations with Monte-Carlo amounts to sampling M instances S0 FM S1 BM, (5) and computing the plug-in estimate

pΦpf, S0, S1q : Φpf, Cp S0, 1{Mq, Cp S1, 1{Mqq

zpjq PS1 φpf, xpiq, zpjqq. (6)

When a set of samples is a singleton (e.g. S1 tzpjqu), we shall use the convention pΦpf, S0, tzpjquq pΦpf, S0, zpjqq to improve readability. In Appendix B.2, pΦpf, S0, S1q is shown to be a consistent and asymptotically normal estimate of Φpf, F, Bq meaning that one can compute approximate conﬁdence intervals around pΦ to capture Φ with high probability. In practice, the estimates pΦ are employed as the model explanation which we see as a vulnerability. As discussed in Section 4, the Monte-Carlo estimation is the key ingredient that allows us to manipulate the GSV in favor of a dishonest entity.

3 AUDIT SCENARIO

This section introduces an audit scenario to which the proposed attack of SHAP can apply. This scenario involves two parties: a company and an audit. The company has a dataset D tpxpiq, ypiqqu N i 1 with xpiq P Rd and ypiq P t0, 1u that contains N input-target tuples and also has a model f : X Ñ r0, 1s that is meant to be deployed in society. The binary feature with index s (i.e. xs P t0, 1u) represents a sensitive feature with respect to which the model should not explicitly

1https://github.com/slundberg/shap/blob/0662f4e9e6be38e658120079904899cccda59ff8/shap/maskers/_tabular.py# L54-L55

Published as a conference paper at ICLR 2023

E[f(x)|xs=0] E[f(x)|xs=1]

(a) The data initially provided by the company to the audit is fp D0q and fp D1q i.e. the model predictions for all instances in the private dataset for different values of xs. This dataset can later be used by the audit to assess whether or not the subsets S1 0, S1 1 provided by the company where cherry-picked.

(b) Two models f1 and f2 (decision boundaries in dashed lines) with perfect accuracy exhibit a disparity in outcomes w.r.t groups with xs ă 0 and xs ą 0. Here, Φspf1, F, Bq 1 while Φspf2, F, Bq 0. Hence, f2 is indirectly unfair toward xs because of correlations in the data.

Figure 2: Illustrations of the audit scenario.

discriminate. Both the data D and the model f are highly private so the company is very careful when providing information about them to the audit. Hence, f is a black box from the point of view of the audit. At ﬁrst, the audit asks the company for the necessary data to compute fairness metrics e.g. the Demographic Parity (Dwork et al., 2012), the Predictive Equality (Corbett-Davies et al., 2017), or the Equal Opportunity (Hardt et al., 2016). Note that our attack would apply as long as the fairness metric is a difference in model expectations over subgroups. For simplicity, the audit decides to compute the Demographic Parity

Erfpxq|xs 0s Erfpxq|xs 1s, (7)

and therefore demands access to the model outputs for all inputs with different values of the sensitive feature : fp D0q and fp D1q, where D0 txpiq : xpiq s 0u and D1 txpiq : xpiq s 1u are subsets of the input data of sizes N0 and N1 respectively. Doing so does not force the company to share values of features other than xs nor does it requires direct access to the inner workings of the proprietary model. Hence, this demand respects privacy requirements and the company will accept to share the model outputs across all instances, see Figure 2a. At this point, the audit conﬁrms that the model is indeed biased in favor of xs 1 and puts in question the ability of the company to deploy such a model. Now, the company argues that, although the model exhibits a disparity in outcomes, it does not mean that the model explicitly uses the feature xs to make its decision. If such is the case, then the disparity could be explained by other features statistically associated with xs. Some of these other features may be acceptable grounds for decisions. To verify such a claim, the audit decides to employ post-hoc techniques to explain the disparity. Since the model is a black-box, the audits shall compute the GSV. The foreground F and background B are chosen to be the data distributions conditioned on xs 0 and xs 1 respectively

F : Cp D0, 1{N0q B : Cp D1, 1{N1q. (8)

According to Equation 3, the resulting GSV will sum up to the demographic parity (cf. Equation 7). If the sensitive feature has a large negative GSV Φs, then this would mean that the model is explicitly relying on xs to make its decisions and the company would be forbidden from deploying the model. If the GSV has a small amplitude, however, the company could still argue in favor of deploying the model in spite of having disparate outcomes. Indeed, the difference in outcomes by the model could be attributed to more acceptable features. See Figure 2b for a toy example illustrating this reasoning.

To compute the GSV, the audit demands the two datasets of inputs D0 and D1, as well as the ability to query the black box f at arbitrary points. Because of privacy concerns on sharing values of x across the whole dataset, and because GSV must be estimated with Monte-Carlo, both parties agree that the company shall only provide subsets S0 Ă D0 and S1 Ă D1 of size M to the audit so they can compute a Monte-Carlo estimate pΦpf, S0, S1q. The company ﬁrst estimate GSV on their own by choosing S0, S1 uniformly at random from F and B (cf. Equation 5) and observe that pΦs indeed has a

Published as a conference paper at ICLR 2023

large negative value. They realize they must carefully select which data points will be sent, otherwise, the audit may observe the bias toward xs 1 and the model will not be deployed. Moreover, the company understands that the audit currently has access to the data fp D0q and fp D1q representing the model predictions on the whole dataset (see Figure 2a). Therefore, if the company does not share subsets S0, S1 that were chosen uniformly at random from D0, D1, it is possible for the audit to detect this fraud by doing a statistical test comparing fp S0q to fp D0q and fp S1q to fp D1q. The company needs a method to select misleading subsets S1 0, S1 1 whose GSV is manipulated in their favor while remaining undetected by the audit. Such a method is the subject of the next section.

4 FOOL SHAP WITH STEALTHILY BIASED SAMPLING

4.1 MANIPULATION

To fool the audit, the company can decide to indeed sub-sample S1 0 uniformly at random S1 0 FM. Then, given this choice of foreground data, they can repeatedly sub-sample S1 1 BM, and choose the set S1 1 leading to the smallest |pΦspf, S1 0, S1 1q|. We shall call this method brute-force . Its issue is that, by sub-sampling S1 1 from B, it will take an enormous number of repetitions to reduce the attribution since the GSV pΦspf, S1 0, S1 1q is concentrated on the population GSV Φspf, F, Bq.

A more clever method is to re-weight the background distribution before sampling from it i.e. deﬁne B1 ω : Cp D1, ωq with ω 1{N1 and then sub-sample S1 1 B1M ω . To make the model look fairer, the company needs the pΦs computed with these cherry-picked points to have a small magnitude.

Proposition 4.1. Let S1 0 be ﬁxed, and let pÑ represent convergence in probability as the size M of the set S1 1 B1M ω increases, we have pΦspf, S1 0, S1 1q pÑ ÿ

zpjq PD1 ωj pΦspf, S1 0, zpjqq. (9)

We note that the coefﬁcients pΦspf, S1 0, zpjqq in Equation 9 are tractable and can be computed and stored by the company. We discuss in more detail how to compute them in Appendix B.3. An additional requirement is that the non-uniform distribution B1 ω remains similar to the original B. Otherwise, the fraud could be detected by the audit. In this work, the notion of similarity between distributions will be captured by the Wasserstein distance in output space. Deﬁnition 4.1 (Wassertein Distance). Any probability measure π over D1 ˆ D1 is called a coupling measure between B and B1 ω, denoted π P p B, B1 ωq, if 1{N1 ř

j πij and ωj ř

i πij. The Wassertein distance between B and B1 ω mapped to the output-space is deﬁned as

Wp B, B1 ωq min πP p B,B1ωq

i,j |fpzpiqq fpzpjqq|πij, (10)

a.k.a the cost of the optimal transport plan that distributes the mass from one distribution to the other.

We propose Algorithm 1 to compute the weights ω by minimizing the magnitude of the GSV while maintaining a small Wasserstein distance. The trade-off between attribution manipulation and proximity to the data is tuned via a hyper-parameter λ ą 0. We show in the Appendix A.2 that the optimization problem at line 5 of Algorithm 1 can be reformulated as a Minimum Cost Flow (MCF) and hence can be solved in polynomial time (more precisely r Op N 2.5 1 q as in Fukuchi et al. (2020)).

4.2 DETECTION

We now discuss ways the audit can detect manipulation of the sampling procedure. Recall that the audit has previously been given access to fp D0q, fp D1q representing the model outputs across all instances in the private dataset. The audit will then be given sub-samples S1 0, S1 1 of D0, D1 on which they can compute the output of the model and compare with fp D0q, fp D1q. To assess whether or not the sub-samples provided by the company were sampled uniformly at random, the audit has to conduct statistical tests. The null hypothesis of these tests will be that S1 0, S1 1 were sampled uniformly at random from D0, D1. The detection Algorithm 2 with signiﬁcance α uses both the Kolmogorov Smirnov and Wald tests with Bonferonni corrections (i.e. the α{4 terms in the Algorithm). The Kolmogorov-Smirnov and Wald tests are discussed in more detail in Appendix C.

Published as a conference paper at ICLR 2023

Algorithm 1 Compute non-uniform weights

1: procedure COMPUTE_WEIGHTS(D1, tpΦspf, S1 0, zpjqquj, λ)

2: β : signr ř

zpjq PD1 pΦspf, S1 0, zpjqq s 3: B : Cp D1, 1{N1q Ź Unmanipulated background 4: B1 ω : Cp D1, ωq Ź Manipulated background as a function of ω 5: ω arg minω β ř

zpjq PD1 ωj pΦspf, S1 0, zpjqq λWp B, B1 ωq Ź Optimization Problem 6: return ω;

Algorithm 2 Detection with signiﬁcance α

1: procedure DETECT_FRAUD(fp D0q, fp D1q, fp S1 0q, fp S1 1q, α, M) 2: for i 0, 1 do 3: fp Siq Cpfp Diq, 1{Niq M Ź Subsample without cheating. 4: p-value-KS KSp fp Siq, fp S1 iq q Ź KS test comparing fp Siq and fp S1 iq 5: p-value-Wald Waldp fp S1 iq, fp Diq q Ź Wald test 6: if p-value-KS ă α{4 or p-value-Wald ă α{4 then Ź Reject the null hypothesis 7: return 1 8: return 0;

4.3 WHOLE PROCEDURE

The procedure returning the subsets S1 0, S1 1 is presented in Algorithm 3. It conducts a log-space search between λmin and λmax for the λ hyper-parameter (line 6) in order to explore the possible attacks. For each value of λ, the attacker runs Algorithm 1 to obtain B1 ω (line 7), then repeatedly samples S1 1 B1M ω (line 10) and attempts to detect the fraud (line 11). The attacker will choose B1 ω that minimizes the magnitude of pΦs while having a detection rate below some threshold τ (line 12). An example of search over λ on a real-world dataset is presented in Figure 3.

Algorithm 3 Fool SHAP

1: procedure FOOL_SHAP(f, D0, D1, M, λmin, λmax, τ, α) 2: S1 0 Cp D0, 1{N0q M Ź S1 0 is sampled without cheating 3: Compute pΦspf, S1 0, zpjqq @zpjq P D1 Ź cf. Section B.3 4: B Cp D1, 1{N1q 5: Φ s 1{N1 ř

zpjq PD1 pΦspf, S1 0, zpjqq Ź Initialize the solution 6: for λ λmax, . . . , λmin do 7: ω COMPUTE_WEIGHTS(D1, pΦspf, S1 0, zpjqq (

j, λ) 8: Detection 0 9: for rep 1, . . . , 100 do Ź Detect the manipulation 10: S1 1 B1M ω 11: Detection += DETECT_FRAUD(fp D0q, fp D1q, fp S1 0q, fp S1 1q, α, M) 12: if | ř zpjq PD1 ωj pΦspf, S1 0, zpjqq| ă |Φ s| and Detection ă 100τ then 13: B B1 ω 14: Φ s ř

zpjq PD1 ωj pΦspf, S1 0, zpjqq Ź Update the solution

15: S1 1 B M Ź Cherry-pick by sampling from the non-uniform background 16: return S1 0, S1 1

One limitation of Fool SHAP is that it manipulates a single sensitive feature. In Appendix E.4, we present a possible extension of Algorithm 1 to handle multiple sensitive features and present preliminary results of its effectiveness. A second limitation is that it only applies to interventional Shapley values which break feature correlations. This choice was made because most methods in the SHAP library2 are interventional . Future work should port Fool SHAP to observational Shapley values that use conditional expectations to remove features (Frye et al., 2020).

2except the Tree Explainer when no background data is provided

Published as a conference paper at ICLR 2023

100 101 102 λ

Detection Rate (%)

100 101 102 λ

Sensitive feature Other features

Figure 3: Example of log-space search over values of λ using an XGBoost classiﬁer ﬁtted on Adults. (a) The detection rate as a function of the parameter λ of the attack. The attacker uses a detection rate threshold τ 10%. (b) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B1 ω. The goal here is to reduce the amplitude of the sensitive feature (red curve).

4.4 CONTRIBUTIONS

The ﬁrst technique to fool SHAP with perturbations of the background distribution was a genetic algorithm Baniecki & Biecek (2022). Although promising, the cross-over and mutation operations it employs to perturb data do not take into account feature correlations and can therefore generate unrealistic data. Moreover, the objective to minimize does not enforce similarity between the original and manipulated backgrounds. We show in Appendix E.3 that these limitations lead to systematic fraud detections. Hence, our contributions are two-fold. First, by perturbing the background via nonuniform weights over pre-existing instances (i.e. B1 ω : Cp D1, ωq ) rather than a genetic algorithm, we avoid the issue of non-realistic data. Second, by considering the Wasserstein distance, we can control the similarity between the original and fake backgrounds.

Since the Stealthity Biased Sampling technique introduced in Fukuchi et al. (2020) also leverages a non-uniform distribution over data points and the Wasserstein distance, it makes sense to adapt it to fool SHAP. Still, the approach of Fukuchi et al. is different from ours. Indeed, in their work, they minimize the Wasserstein distance while enforcing a hard constraint on the number of instances that land on the different bins for the target and sensitive feature, That way, they can set the Demographic Parity to any given value while staying close to the original data. In our setting of manipulating the model explanation, we leave the Demographic Parity intact and instead manipulate its feature attribution. In terms of the optimization objective, we now minimize a Shapley value with a soft constraint on the Wasserstein distance.

5 EXPERIMENTS

5.1 TOY EXPERIMENT

The task is predicting which individual will be hired for a job that requires carrying heavy objects. The causal graph for this toy data is presented in Figure 4 (left). We observe that sex (S) inﬂuences height (H), and that both these features inﬂuence the Muscular Mass (M). In the end, the hiring decisions (Y ) are only based on the two attributes relevant to the job: H and M. Also, two noise features N1, N2 were added. More details and justiﬁcations for this causal graph are discussed in Appendix D.1. Since strength and height (two important qualiﬁcations for applicants) are correlated with sex, any model f that ﬁts the data will exhibit some disparity in hiring rates between sexes. Although, if the model decisions do not rely strongly on feature S, the company can argue in favor of deployment. GSV are used by the audit to measure the amount by which the model relies on the sex feature, see Figure 4 (Middle). By employing Fool SHAP with M 100, the company can reduce the GSV of feature S considerably compared to the brute-force and genetic algorithms. More importantly, the audit is not able to detect that the provided samples S1 0, S1 1 were cherry-picked, see Figure 4 (Right). More results are presented in Appendix E.1.

Published as a conference paper at ICLR 2023

0.4 0.3 0.2 0.1 0.0 GSV

Muscle Mass

N2 Original Brute Genetic Fool SHAP

0.0 0.2 0.4 0.6 0.8 1.0 Output

Figure 4: Toy example. Left: Causal graph. Middle: GSV for the different attacks with M 100. Right: Comparison of the CDF of the Fool SHAP subsets fp S1 0q, fp S1 1q and the CDF over the whole data fp D0q, fp D1q. Here the audit cannot detect the fraud using their detection algorithm.

5.2 DATASETS

We consider four standard datasets from the FAcc T literature, namely COMPAS, Adult-Income, Marketing, and Communities.

COMPAS regroups 6,150 records from criminal offenders in Florida collected from 2013-2014. This binary classiﬁcation task consists in predicting who will re-offend within two years. The sensitive feature s is race with values xs 0 for African-American and xs 1 for Caucasian.

Adult Income contains demographic attributes of 48,842 individuals from the 1994 U.S. census. It is a binary classiﬁcation problem with the goal of predicting whether or not a particular person makes more than 50K USD per year. The sensitive feature s in this dataset is gender, which took values xs 0 for female, and xs 1 for male.

Marketing involves information on 41,175 customers of a Portuguese bank and the binary classiﬁcation task is to predict who will subscribe to a term deposit. The sensitive attribute is age and took values xs 0 for age 30-60, and xs 1 for age not30-60

Communities & Crime contains per-capita violent crimes for 1994 different communities in the US. The binary classiﬁcation task is to predict which communities have crimes below the median rate. The sensitive attribute is Percent White and took values xs 0 for Percent White<90%, and xs 1 for Percent White>=90%.

Three models were considered for the two datasets: Multi-Layered Perceptrons (MLP), Random Forests (RF), and e Xtreme Gradient Boosted trees (XGB). One model of each type was ﬁtted on each dataset for 5 different train/test splits seeds, resulting in 60 models total. Values of the test set accuracy and demographic parity for each model type and dataset are presented in Appendix D.2.

5.3 DETECTOR CALIBRATION

Table 1: False Positive Rates (%) of the detector i.e. the frequency at which S0, S1 are considered cherry-picked when they are not. No rate should be above 5%.

COMPAS 4.0 4.6 4.0 Adult 4.3 4.3 4.2 Marketing 4.9 5.0 Communities 3.8 4.2

Detector calibration refers to the assessment that, assuming the null hypothesis to be true, the probability of rejecting it (i.e. false positive) should be bounded by the signiﬁcance level α. Remember that the null hypothesis of the audit detector is that the sets S1 0, S1 1 provided by the company are sampled uniformly from D0, D1. Hence, to test the detector, the audit can sample their own subsets fp S0q, fp S1q uniformly from at random from fp D0q, fp D1q, run the detection algorithm, and count the number of detection over 1000 repeats. Table 1 shows the false positive rates over the ﬁve train-test splits using a signiﬁcance level α 5%. We observe that the false positive rates are indeed bounded by α for all model types and datasets implying that the detector employed by the audit is calibrated.

Published as a conference paper at ICLR 2023

5.4 ATTACK RESULTS AND DISCUSSION

COMPAS Adult Marketing C&C Dataset

Relative Amplitude Decrease (%)

Brute Genetic Ours

Figure 5: Relative decrease in amplitude of the sensitive feature attribution induced by the various attacks on SHAP.

The ﬁrst step of the attack (line 3 of Algorithm 3) requires that the company run SHAP on their own and compute the necessary coefﬁcients to run Algorithm 1. For the COMPAS and Adults datasets, the Exact Explainer of SHAP was used. Since Marketing and Communities contain more than 15 features, and since the Exact Explainer scales exponentially with the number of features, we were restricted to using the Tree Explainer (Lundberg et al., 2020) on these datasets. The Tree Explainer avoids the exponential cost of Shapley values but is only applicable to tree-based models such as RFs and XGBs. Therefore, we could not conduct the attack on MLPs ﬁtted on Marketing and Communities.

The following step is to solve the MCF for various values of λ (line 7 of Algorithm 3). As stated previously, solving the MCF can be done in polynomial time in terms of N1, which was tractable for a small dataset like COMPAS and Communities, but not for larger datasets like Adult and Marketing. To solve this issue, as was done in Fukuchi et al. (2020), we compute the manipulated weights multiple times using 5 bootstrap sub-samples of D1 of size 2000 to obtain a set of weights ωr1s, ωr2s, . . . , ωr5s which we average to obtain the ﬁnal weights ω.

Results of 46 attacks with M 200 are shown in Figure 5. Speciﬁc examples of the conducted attacks are presented in Appendix E.2. As a point of reference, we also show results for the brute-force and genetic algorithms. To make comparisons to our attack more meaningful, the brute-force method was only allowed to run for the same amount of time it took to search for the non-uniform weights ω (about 30-180 seconds). Also, the genetic algorithm ran for 400 iterations and was stopped early if there were 10 consecutive detections. We note that, across all datasets, Fool SHAP leads to greater reductions of the sensitive feature attribution compared to brute-force search and the genetic perturbations of the background.

Now focusing on Fool SHAP, for the datasets COMPAS and Marketing, we observe median reductions in amplitudes of about 90%. This means that our attack can considerably reduce the apparent importance of the sensitive attribute. For the Adult and Communities datasets, the median reduction in amplitude is about 50% meaning that we typically reduce by half the importance of the sensitive feature. Still, looking at the maximum reduction in amplitude for Adult-Income and Communities, we note that one attack managed to reduce the amplitude by 90%. Therefore, luck can play a part in the degree of success of Fool SHAP, which is to be expected from data-driven attacks.

Finally, the audit was consistently unable to detect the fraud using statistical tests. This observation raises concerns about the risk that SHAP explanations can be attacked to return not only manipulated attributions but also non-detectable fake evidence of fairness.

6 CONCLUSION

To conclude, we proposed a novel attack on Shapley values that does not require modifying the model but rather manipulates the sampling procedure that estimates expectations w.r.t the background distribution. We show on a toy example and four fairness datasets that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups using SHAP. Crucially, the sampling manipulation is hard to detect by an audit that is given limited access to the data and model. These results raise concerns about the viability of using Shapley values to assess model fairness. We leave as future work the use of Shapley values to decompose other fairness metrics such as predictive equality and equal opportunity. Moreover, we wish to move to use cases beyond fairness, as we believe that the vulnerability of Shapley values that was demonstrated can apply to many other properties such as safety and security.

Published as a conference paper at ICLR 2023

7 ETHICS STATEMENT

The main objective of this work is to raise awareness about the risk of manipulation of SHAP explanations and their undetectability. As such, it aims at exposing the potential negative societal impacts of relying on such explanations. It remains however possible that malicious model producers could use this attack to mislead end users or cheat during an audit. However, we believe this paper makes a signiﬁcant step toward increasing the vigilance of the community and fostering the development of trustworthy explanations methods. Furthermore, by showing how fairness can be manipulated in explanation contexts, this work contributes to the research on the certiﬁcation of the fairness of automated decision-making systems.

8 REPRODUCIBILITY STATEMENT

The source code of all our experiments is available online3. Moreover, experimental details are provided in appendix D.2 for the interested reader.

Julius A Adebayo et al. Fairml: Toolbox for diagnosing bias in predictive modeling. Master s thesis, Massachusetts Institute of Technology, 2016. 1

Ulrich Aïvodji, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara, and Alain Tapp. Fairwashing: the risk of rationalization. In International Conference on Machine Learning, pp. 161 170. PMLR, 2019. 1

Hubert Baniecki and Przemyslaw Biecek. Manipulating shap via adversarial data perturbations (student abstract). 2022. 1, 7

Hubert Baniecki, Wojciech Kretowicz, and Przemyslaw Biecek. Fooling partial dependence via data poisoning. ar Xiv preprint ar Xiv:2105.12837, 2021. 1, 25

Tom Begley, Tobias Schwedes, Christopher Frye, and Ilya Feige. Explainability for fair machine learning. ar Xiv preprint ar Xiv:2010.07389, 2020. 1, 3

Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, et al. Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ar Xiv preprint ar Xiv:1810.01943, 2018. 1

Yoichi Chikahara, Shinsaku Sakaue, Akinori Fujino, and Hisashi Kashima. Learning individually fair classiﬁer with path-speciﬁc causal-effect constraint. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 145 153. PMLR, 2021. 1

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 797 806, 2017. 4

Botty Dimanov, Umang Bhatt, Mateja Jamnik, and Adrian Weller. You shouldn t trust me: Learning models which conceal unfairness from multiple explanation methods. In Safe AI@ AAAI, 2020. 1

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226, 2012. 4

Christopher Frye, Damien de Mijolla, Tom Begley, Laurence Cowton, Megan Stanley, and Ilya Feige. Shapley explainability on the data manifold. ar Xiv preprint ar Xiv:2006.01272, 2020. 6

Kazuto Fukuchi, Satoshi Hara, and Takanori Maehara. Faking fairness via stealthily biased sampling. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 412 419, 2020. 1, 2, 5, 7, 9

3https://github.com/gablabc/Fool_SHAP

Published as a conference paper at ICLR 2023

Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):1 42, 2018. 1

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016. 4

Ian Janssen, Steven B Heymsﬁeld, Zi Mian Wang, and Robert Ross. Skeletal muscle mass and distribution in 468 men and women aged 18 88 yr. Journal of applied physiology, 2000. 20

A J Lee. U-statistics: Theory and Practice. Routledge, 2019. 18

Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. 1, 2

Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex De Grave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature machine intelligence, 2(1):56 67, 2020. 9, 17

Frank J Massey Jr. The kolmogorov-smirnov test for goodness of ﬁt. Journal of the American statistical Association, 46(253):68 78, 1951. 19

Pedro Saleiro, Benedict Kuester, Loren Hinkson, Jesse London, Abby Stevens, Ari Anisfeld, Kit T Rodolfa, and Rayid Ghani. Aequitas: A bias and fairness audit toolkit. ar Xiv preprint ar Xiv:1811.05577, 2018. 1

Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, pp. 307 317, 1953. 2

Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180 186, 2020. 1

Larry Wasserman. All of Statistics: A concise course in statistical inference. Springer, 2004. 13

Published as a conference paper at ICLR 2023

A.1 PROOFS FOR GLOBAL SHAPLEY VALUES (GSV)

Proposition A.1 (Proposition 2.1). The GSV have the following property

i 1 Φipf, F, Bq E x Frfpxqs E x Brfpxqs. (11)

Proof. As a reminder, we have deﬁned the vector

Φpf, F, Bq E x F z B rφpf, x, zq , (12)

whose components sum up to

i 1 Φipf, F, Bq

i 1 E x F z B r φipf, x, zq s (13)

i 1 φipf, x, zq (14)

E x F z B r fpxq fpzq s (15)

E x Frfpxqs E z Brfpzqs (16)

E x Frfpxqs E x Brfpxqs, (17)

where at the last step we have simply renamed a dummy variable.

Proposition A.2 (Proposition 4.1). Let S1 0 be ﬁxed, and let pÑ represent convergence in probability as the size M of the set S1 1 B1M increases, then we have

pΦspf, S1 0, S1 1q pÑ

j 1 ωj pΦspf, S1 0, zpjqq. (18)

pΦpf, S1 0, S1 1q 1 M 2 ÿ

zpjq PS1 1 φpf, xpiq, zpjqq

xpiq PS1 0 φpf, xpiq, zpjqq

pΦpf, S1 0, zpjqq.

Since S1 0 is assumed to be ﬁxed, then the only random variable in pΦspf, S1 0, zpjqq is zpjq which represents an instance sampled from the B1. Therefore, we can deﬁne ψpzq : pΦspf, S1 0, zq and we get

pΦspf, S1 0, S1 1q 1

pΦspf, S1 0, zpjqq

zpjq PS1 1 ψpzpjqq with S1 1 B1M. (20)

Published as a conference paper at ICLR 2023

By the weak law of large number, the following holds as M goes to inﬁnity (Wasserman, 2004, Theorem 5.6) 1 M

zpjq PS1 1 ψpzpjqq pÑ E z B1rψpzqs. (21)

Now, as a reminder, the manipulated background distribution is B1 : Cp D1, ωq with ω 1{N1. Therefore

pΦspf, S1 0, S1 1q pÑ E z B1rψpzqs

E z Cp D1,ωqrψpzqs

j 1 ωjψpzpjqq

j 1 ωj pΦspf, S1 0, zpjqq

concluding the proof.

Published as a conference paper at ICLR 2023

A.2 PROOFS FOR OPTIMIZATION PROBLEM

A.2.1 TECHNICAL LEMMAS

We provide some technical lemmas that will be essential when proving Theorem A.1. These are presented for completeness and are not intended as contributions by the authors. Let us ﬁrst write the formal deﬁnition of the minimum of a function.

Deﬁnition A.1 (Minimum). Given some function f : D Ñ R, the minimum of f over D (denoted f ) is deﬁned as follows:

f min x PD fpxq ðñ Dx P D s.t. f fpx q ď fpxq @x P D.

Basically, the notion of minimum coincides with the inﬁmum inf fp Dq (highest lower bound) when this lower bound is attained for some x P D. By the Extreme Values Theorem, the minimum always exists when D is compact and f is continuous. For the rest of this appendix, we shall only study optimization problems where points on the domain set D tpx, yq : x P X, y P Yx Ă Yu can be selected by the following procedure

1. Choose some x P X

2. Given the selected x, choose some y P Yx Ă Y, where the set Yx is non-empty and depends on the value of x.

When optimizing functions over these domains, one can optimize in two steps as highlighted in the following lemma.

Lemma A.1. Given a compact domain D of the form described above and a continuous objective function f : D Ñ R, the minimum f is attained for some px , y q and the following holds

min px,yq PD fpx, yq min x PX min y PYx fpx, yq.

Proof. Let rfpxq : infy PYx fpx, yq, which is a well deﬁned function on X. We can then take its inﬁmum f infx PX rfpxq. But is f an inﬁmum of fp Dq? By the deﬁnition of inﬁmum

f ď rfpxq @ x P X inf y PYx fpx, yq

ď fpx, yq @ y P Yx,

so that f is a lower bound of fp Dq. In fact, it is the highest lower bound possible so

inf px,yq PD fpx, yq inf x PX inf y PYx fpx, yq. (23)

By the Extreme Value Theorem, since D is compact and f is continuous, there exists px , y q P D s.t. f infpx,yq PD fpx, yq maxpx,yq PD fpx, yq fpx , y q. Since the inﬁmum is attained on the left-hand-side of Equation 23, then it must also be attained on the right-hand-side and therefore we can replace all inf with min in Equation 23, leading to the desired result.

Lemma A.2. Given a compact domain D of the form described above and two continuous functions h : X Ñ R and g : Y Ñ R, then

min px,yq PD

ˆ hpxq gpyq min x PX

ˆ hpxq min y PYx gpyq

Proof. Applying Lemma A.1 with the function fpx, yq : hpxq gpyq proves the Lemma.

Published as a conference paper at ICLR 2023

apeq βpΦspf, S1 0, zpjqq cpeq 8 fpeq rωj

apeq λ |fpzpiqq fpzpjqq| cpeq 8 fpeq rπi,j

apeq 0 cpeq 1 fpeq 1

Figure 6: Graph G on which we solve the MCF. Note that the total amount of ﬂow is d N1 and there are N1 left and right nodes ℓj, ri.

A.2.2 MINIMUM COST FLOWS

Let G p V, Eq be a graph with vertices v P V with directed edges e P E Ă V ˆ V, c : E Ñ R be a capacity and a : E Ñ R be a cost. Moreover, let s, t P E be two special vertices called the source and the sink respectivelly, and d P R be a total ﬂow. The Minimum-Cost Flow (MCF) problem of G consists of ﬁnding the ﬂow function f : E Ñ R that minimizes the total cost

e PE apeqfpeq

s.t. 0 ď fpeq ď cpeq @e P E

e Pu fpeq ÿ

0 u P V z ts, tu d u s d u t

where u : tpu, vq P Eu and u : tpv, uq P Eu are the outgoing and incoming edges from u. The terminology of ﬂow arises from the constraint that, for vertices that are not the source nor the sink, the outgoing ﬂow must equal the incoming one, which is reminiscent of conservation laws in ﬂuidic. We shall refer to fppu, vqq as the ﬂow from u to v.

Now that we have introduced minimum cost ﬂows, let us specify the graph that will be employed to manipulate GSV, see Figure 6. We label the ﬂow going from the sink s to one of the left vertices as rωi ωi ˆ N1, and the ﬂow going from ℓj to ri as rπi,j πi,j ˆ N1. The required ﬂow is ﬁxed at d N1.

Published as a conference paper at ICLR 2023

Theorem A.1. Solving the MCF of Figure 6 leads to a solution of the linear program in Algorithm 1.

Proof. We begin by showing that the ﬂow conservation constraints in the MCF imply that π is a coupling measure (i.e. π P p B, B1 ωq), and ω is constrained to the probability simplex p N1q. Applying the conservation law on the left-side of the graph leads to the conclusion that the ﬂows entering vertices ℓj must sum up to N1

j 1 rωj N1.

This implies that ω is must be part of the probability simplex. By conservation, the amount of ﬂow that leaves a speciﬁc vertex ℓj must also be rωj, hence ÿ

i rπij rωj.

For any edge outgoing from ri to the sink t, the ﬂow must be exactly 1. This is because we have N1 edges with capacity cpeq 1 going into the sink and the sink must receive an incoming ﬂow of N1. As a consequence of the conservation law on a speciﬁc vertex ri, the amount of ﬂow that goes into each ri is also 1 ÿ

Putting everything together, from the conservation laws on G, we have that ω P p N1q, and π P p B, B1 ωq. Now, to make the parallel between the MCF and Algorithm 1, we must use Lemma A.2. Note that ω is restricted to the probability simplex, while π is restricted to be a coupling measure. Importantly, the set of all possible coupling measures p B, B1 ωq is different for each ω because B1 ω depends on ω. Hence, the domain has the same structure as the ones tackled in Lemma A.2 (where x P X becomes ω P p N1qq and y P Yx becomes π P p B, B1 ωq). Also, the set of possible ω and π is a bounded simplex in RN1p N1 1q so it is compact, and the objective function of the MCF is linear, thus continuous. Hence, we can apply the Lemma A.2 to the MCF.

e PE fpeqapeq min rω,rπ

j 1 βrωj pΦspf, S1 0, zpjqq λ ÿ

i,j rπij|fpzpiqq fpzpjqq|

min rω,rπ N1 N1

j 1 rωj pΦspf, S1 0, zpjqq λ ÿ

i,j rπij|fpzpiqq fpzpjqq|

N1 min rω,rπ

rωj N1 pΦspf, S1 0, zpjqq λ ÿ

rπij N1 |fpzpiqq fpzpjqq|

N1 min ωP p N1q,πP p B,B1 ωq

j 1 ωj pΦspf, S1 0, zpjqq λ ÿ

i,j πi,j|fpzpiqq fpzpjqq|

N1 min ωP p N1q,πP p B,B1ωq

ˆ hpωq gpπq

N1 min ωP p N1q

ˆ hpωq min πP p B,B1q gpπq (cf. Lemma A.2)

N1 min ωP p N1q

j 1 ωj pΦspf, S1 0, zpjqq λ min πP p B,B1 ωq

i,j πi,j|fpzpiqq fpzpjqq|

N1 min ωP p N1q

j 1 ωj pΦspf, S1 0, zpjqq λ Wp B, B1 ωq

which (up to a multiplicative constant N1) is a solution of the linear program of Algorithm 1.

Published as a conference paper at ICLR 2023

B SHAPLEY VALUES

B.1 LOCAL SHAPLEY VALUES (LSV)

We introduce Local Shapley Values (LSV) more formally. First, as explained earlier, Shapley values are based on coalitional game theory where the different features work together toward a common outcome fpxq. In a game, the features can either be present or absent, which is simulated by replacing some features with a baseline value z. Deﬁnition B.1 (The Replace Function). Let x be an input of interest x, S Ď t1, 2, . . . , du be a subset of input features that are considered active, and z be a baseline input, then the replace-function r S : Rd ˆ Rd Ñ Rd is deﬁned as

r Spz, xqi "xi if i P S zi otherwise. (25)

We note that this function is meant to activate the features in S.

Now, if we let π be a random permutation of d features, and πi denote all features that appear before i in π, the LSV are computed via

φipf, x, zq : E π Ω fp rπi Ytiupz, xq q fp rπipz, xq q , i 1, 2, . . . , d, (26)

where Ωis the uniform distribution over d! permutations. Observe that the computation of LSV is scales poorly with the number of features d hence model-agnostic computations are only possible with datasets with few features such as COMPAS and Adult-Income. For datasets with larger amounts of features the Tree Explainer algorithm (Lundberg et al., 2020) can be used to compute the LSV (cf. Equation 26) in polynomial time given that one is explaining a tree-based model.

B.2 CONVERGENCE

As a reminder, we are interested in estimating the GSV Φ Φpf, F, Bq which requires estimating expectations w.r.t the foreground and background distributions. Said estimations can be conducted with Monte-Carlo where we sample M instances

S0 FM S1 BM, (27)

and compute the plug-in estimates

pΦpf, S0, S1q : Φpf, Cp S0, 1{Mq, Cp S1, 1{Mqq

zpjq PS1 φpf, xpiq, zpjqq. (28)

We now show that, pΦpf, S0, S1q is a consistent and asymptotically normal estimate of Φpf, F, Bq

Proposition B.1. Let f : X Ñ r0, 1s be a black box, F and B be distributions on X, and pΦ pΦpf, S0, S1q be the plug-in estimate of Φ Φpf, F, Bq, the following holds for any δ P s0, 1r and k 1, 2 . . . , d

lim MÑ8 P ˆ |pΦk Φk| ě F 1 Np0,1qp1 δ{2q

σ2 10 σ2 01

where F 1 Np0,1q is the inverse Cumulative Distribution Function (CDF) of the standard normal distribution, σ2 10 Vx Fr Ez Brφipf, x, zqqs s and σ2 01 Vz Br Ex Frφipf, x, zqqs s.

Proof. The proof consists simply in noting that LSV φkpf, xpiq, zpjqq are a function of two independent samples xpiq F and zpjq B. The model f is assumed ﬁxed and hence for any feature k we can deﬁne hpxpiq, zpjqq : φkpf, xpiq, zpjqq. Now, the estimates of GSV can be rewritten

pΦkpf, S0, S1q 1 |S0| |S1|

zpjq PS1 hpxpiq, zpjqq, (29)

Published as a conference paper at ICLR 2023

which we recognize as a well-known class of statistics called two-samples U-statistics. Such statistics are unbiased and asymptotically normal estimates of

Φkpf, F, Bq E x F z B rhpx, zqs. (30)

The asymptotic normality of two-samples U-statistics is characterized by the following Theorem (Lee, 2019, Section 3.7.1).

Theorem B.1. Let pΦk pΦkpf, S0, S1q be a two-samples U-statistic with |S0| N, |S1| M, moreover let hpx, zq have ﬁnite ﬁrst and second moments, then the following holds for any δ P s0, 1r

lim N MÑ8 s.t. N{p N MqÑp Pp0,1q P ˆ |pΦk Φk| ě F 1 Np0,1qp1 δ{2q

σ2 10 p σ2 01 1 p

where σ2 10 Vx Fr Ez Brhpx, zqs s and σ2 01 Vz Br Ex Frhpx, zqs s.

Proposition B.1 follows from this Theorem by choosing N M, p 0.5 and noticing that having a model with bounded outputs (f : X Ñ r0, 1s) implies that |hpx, zq| ď 1 @x, z P X which means that hpx, zq has bounded ﬁrst and second moments.

B.3 COMPUTE THE LSV

Running Algorithm 1 requires computing the coefﬁcients pΦspf, S1 0, zpjqq for j 1, 2, . . . , N1. To compute them, ﬁrst note that they can be written in terms of LSV for all instances in S1 0

pΦspf, S1 0, zpjqq 1

xpiq PS1 0 φspf, xpiq, zpjqq. (31)

The LSV φspf, xpiq, zpjqq are computed deeply in the SHAP code and are not directly accessible using the current API. Hence, we had to access them using Monkey-Patching i.e. we modiﬁed the Exact Explainer class so that it stores the LSV as one of its attributes. The attribute can then be accessed as seen in Figure 7. The code is provided as a fork the SHAP repository. For the Tree Explainer, because its source code is in C++ and wrapped in Python, we found it simpler to simply rewrite our own version of the algorithm in C++ so that it directly returns the LSV, instead of Monkey-Patching the Tree Explainer.

Figure 7: How we extract the LSV from the Exact Explainer via Monkey-Patching.

Published as a conference paper at ICLR 2023

C STATISTICAL TESTS

C.1 KS TEST

A ﬁrst test that can be conducted is a two-samples Kolmogorov-Smirnov (KS) test (Massey Jr, 1951). If we let p FSpxq 1

z PS 1pz ď xq (32)

be the empirical CDF of observations in the set S. Given two sets S and S1, the KS statistic is

KSp S, S1q sup x PR | p FSpxq p FS1pxq|. (33)

Under the null-hypothesis H0 : S D|S|, S1 D|S1| for some univariate distribution D, this statistic is expected to not be too large with high probability. Hence, when the company provides the subsets S1 0, S1 1, the audit can sample their own two subsets fp S0q, fp S1q uniformly at random from fp D0q, fp D1q and compute the statistics KSpfp S0q, fp S1 0qq and KSpfp S1q, fp S1 1qq to detect a fraud.

C.2 WALD TEST

An alternative is the Wald test, which is based on the central limit theorem. If S1 BM, then the empirical average of the model output over S1 is asymptotically normally distributed as M increases

Waldpfp S1q, fp Bqq :

z Pfp S1q z µ

M ù Np0, 1q, (34)

where µ : Ez fp Bqrzs and σ2 : Vz fp Bqrzs are the expected value and variance of the model output across the whole background. The same reasoning holds for S0 and the foreground F. Applying the Wald test with signiﬁcance α would detect fraud when

| Waldpfp S1 1q, fp Bqq | ą F 1 Np0,1qp1 α{2q, (35)

where F 1 Np0,1q is the inverse of the CDF of a standard normal variable.

Published as a conference paper at ICLR 2023

D METHODOLOGICAL DETAILS

D.1 TOY EXAMPLE

The toy dataset was constructed to closely match the results of the following empirical study comparing skeletal mass distributions between men and women (Janssen et al., 2000). Firstly, the sex feature was sampled from a Bernoulli S Bernoullip0.5q. (36) According to Table 1 of Janssen et al. (2000), the average height of women participants was 163 cm while it was 177cm for men. Both height distributions had the same standard deviation of 7cm. Hence we sampled height via

H|S man Np177, 49q H|S woman Np163, 49q (37)

It was noted in Janssen et al. (2000) that there was approximately a linear relationship between height and skeletal muscle mass for both sexes. Therefore, we computed the muscle mass M as

M|t H h, S manu 0.186h 5ϵ M|t H h, S womanu 0.128h 4ϵ with ϵ Np0, 1q (38)

The values of coefﬁcients 0.186, 0.128 and noise levels 5 and 4 were chosen so the distributions of M|S would approximately match that of Table 1 in Janssen et al. (2000). Finally the target was chosen following

Y |t H h, M mu Bernoullip Pp H, Mq q

with Pp H, Mq 1 expt100ˆ1p H ă 160q 0.3p M 28qu 1. (39)

Simply put, the chances of being hired in the past (Y ) were impossible for individuals with a smaller height than 160cm. Moreover, individuals with a higher mass skeletal mass were given more chances to be admitted. Yet, individuals with less muscle mass could still be given the job if they displayed sufﬁcient determination. In the end, we generated 6000 samples leading to the following disparity in Y . Pp Y 1|S manq 0.733 Pp Y 1|S womanq 0.110. (40)

Published as a conference paper at ICLR 2023

Table 2: Models Test Accuracy % (mean stddev).

COMPAS 68.2 0.9 67.7 0.8 68.6 0.8 Adult 85.6 0.3 86.3 0.2 87.1 0.1 Marketing 91.1 0.1 91.4 0.3 Communities 83 2 82 2

Table 3: Models Demographic Parity (mean stddev).

COMPAS 0.12 0.01 0.11 0.01 0.11 0.02 Adult 0.20 0.01 0.19 0.01 0.192 0.004 Marketing 0.104 0.005 0.11 0.01 Communities 0.50 0.01 0.54 0.02

D.2 REAL DATA

The datasets were ﬁrst divided into train/test subsets with ratio 4

5. The models were trained on the training set and evaluated on the test set. All categorical features for COMPAS, Adult, and Marketing were one-hot-encoded which resulted in a total of 11, 40, and 61 columns for each dataset respectively. A simple 50 steps random search was conducted to ﬁne-tune the hyper-parameters with cross-validation on the training set. The resulting test set performance and demographic parities for all models and datasets, aggregated over 5 random data splits, are reported in Tables 2 and 3 respectively.

Published as a conference paper at ICLR 2023

0 20 40 60 80 100 120 Iterations

(a) Iterations of genetic algorithm.

0.0 0.2 0.4 0.6 0.8 1.0 Output

(b) CDFs for genetic algorithm.

0.0 0.2 0.4 0.6 0.8 1.0 Output

(c) CDFs for brute-force.

0.0 0.2 0.4 0.6 0.8 1.0 Output

(d) CDFs for Fool SHAP.

E ADDITIONAL RESULTS

E.1 TOY EXAMPLE

Figure 8 presents additional results for the toy example. More speciﬁcally, Figure 8 (a) illustrates the evolution of the detection and amplitude of the sensitive feature during the genetic algorithm. We note that beyond 90 iterations, the detector is systematically able to assert that the dataset is manipulated. The smallest value of amplitude that can be reached via the genetic algorithm without being detected is around 0.05. Figures 8 (b) (c) and (d) show the CDFs of fp S1 1q where S1 1 is chosen via the genetic algorithm, brute-force, and Fool SHAP respectively. We observe that Fool SHAP is the method where the resulting CDF for fp S1 1q is closest to the CDF for fp D1q. This is why the audit is not able to detect fraud using statistical tests. The fact that Fool SHAP generates fake CDFs that are close to the data CDFs is a consequence of minimizing the Wasserstein distance. These results highlight the superiority of Fool SHAP compared to the brute-force approach and the genetic algorithm.

Published as a conference paper at ICLR 2023

E.2 EXAMPLES OF ATTACKS

In this section, we present 8 speciﬁc examples of the attacks that were conducted on COMPAS, Adult, Marketing, and Communities.

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Shap value

priors_count

juv_misd_count

c_charge_degree

juv_fel_count

juv_other_count Original Manipulated

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Output

f(S 1) f(D0)

Figure 9: Attack of RF ﬁtted on COMPAS. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is race. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q.

0.08 0.06 0.04 0.02 0.00 Shap value

priors_count

juv_other_count

juv_fel_count

juv_misd_count

c_charge_degree Original Manipulated

0.2 0.3 0.4 0.5 0.6 0.7 0.8 Output

f(S 1) f(D0)

Figure 10: Attack of XGB ﬁtted on COMPAS. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is race. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q.

0.10 0.08 0.06 0.04 0.02 0.00 Shap value

marital_status

capital_gain

hours_per_week

relationship

capital_loss

occupation Original Manipulated

0.0 0.2 0.4 0.6 0.8 1.0 Output

f(S 1) f(D0)

Figure 11: Attack of XGB ﬁtted on Adults. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is gender. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q.

Published as a conference paper at ICLR 2023

0.10 0.08 0.06 0.04 0.02 0.00 0.02 Shap value

marital_status

relationship

hours_per_week

capital_gain

capital_loss

occupation Original Manipulated

0.0 0.2 0.4 0.6 0.8 1.0 Output

f(S 1) f(D0)

Figure 12: Attack of RF ﬁtted on Adults. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is gender. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q.

0.05 0.04 0.03 0.02 0.01 0.00 0.01 Shap value

nr_employed emp_var_rate

age euribor3m

contact cons_conf_idx

job poutcome cons_price_idx

pdays default day_of_week

loan marital campaign

previous duration

Original Manipulated

0.0 0.2 0.4 0.6 0.8 Output

f(S 1) f(D0)

Figure 13: Attack of RF ﬁtted on Marketing. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is age. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q.

0.8 0.6 0.4 0.2 0.0 Shap value

emp_var_rate

nr_employed

age cons_conf_idx

pdays contact poutcome euribor3m

job default day_of_week cons_price_idx

marital housing

loan education campaign

previous Original Manipulated

8 6 4 2 0 2 Output

f(S 1) f(D0)

Figure 14: Attack of XGB ﬁtted on Marketing. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is age. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q.

Published as a conference paper at ICLR 2023

1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Shap value

Pct Kids2Par Pct Pers Dense Hous

Num Illeg Pct White>90

pct WInv Inc

Pct Illeg Pct Pop Under Pov

Male Pct Divorce

Num Under Pov

Pct Fam2Par Pct Less9th Grade

Total Pct Div Pct Hous Less3BR

Female Pct Div Med Own Cost Pct Inc No Mtg

race Pct Hisp Pct Larg House Fam

black Per Cap

Pct Teen2Par Pct Young Kids2Par Pct Vacant Boarded

Pct Unemployed Original Manipulated

10 5 0 5 10 Output

f(S 1) f(D0)

Figure 15: Attack of XGB ﬁtted on Communities. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is Pct White>90. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q.

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Shap value

Pct Kids2Par

Pct Illeg Pct White>90

Pct Fam2Par Pct Pers Dense Hous

pct WInv Inc Female Pct Div Pct Young Kids2Par

Total Pct Div Pct Pop Under Pov

Num Under Pov

race Pct Hisp Pct Teen2Par Male Pct Divorce Pct Larg House Fam

Pct Less9th Grade

pct WPub Asst

Num Immig Pct Pers Own Occup

Pct Hous No Phone

Hous Vacant Original Manipulated

0.0 0.2 0.4 0.6 0.8 1.0 Output

f(S 1) f(D0)

Figure 16: Attack of RF ﬁtted on Communities. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is Pct White>90. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q.

E.3 GENETIC ALGORITHM

This section motivates the use of stealthily biased sampling to perturb Shapley Values in place of the method of Baniecki et al. (2021), which fools SHAP by perturbing the background dataset S1 1 via a genetic algorithm. In said genetic algorithm, a population of P fake background datasets t S1pkq 1 u P k 1 evolves iteratively following three biological mechanisms

Cross-Over: Two parents produce two children by switching some of their feature values. Mutation: Some individuals are perturbed with small Gaussian noise.

Selection: The individuals S1pkq 1 with the smallest amplitudes |Φspf, S1 0, S1pkq 1 q| are selected for the next generation.

Although the use of a genetic algorithm makes the method of Baniecki et al. (2021) very versatile, its main drawback is that there is no constraint on the similarity between the perturbed background and the original one. Moreover, the mutation and cross-over operations ignore the correlations between features and hence the perturbed dataset can contain unrealistic instances. To highlight these issues, Figure 17 presents the ﬁrst two principal components of D1 and S1 1 for the XGB models used in Section 5.4. On COMPAS and Marketing especially, we see that the fake samples S1 1 lie in regions outside of the data manifold. For Adult-Income and Marketing, the fake data overlaps more with the original one, but this could be an artifact of only visualizing 2 dimensions.

For a more rigorous analysis of similarity between S1 1 and D1, we must study the detection rate of the audit detector. To this end, Figures 18 and 19, present the amplitude reduction and the detection rate after a given number of iterations of the genetic algorithm. These curves show the average and

Published as a conference paper at ICLR 2023

2.5 0.0 2.5 5.0 7.5 Component 1

Component 2

2.5 0.0 2.5 5.0 7.5 Component 1

Component 2

(b) Adult-Income

5 0 5 Component 1

Component 2

(c) Marketing

10 0 10 Component 1

Component 2

(d) Communities

Figure 17: First two principal components of D1 (Blue) and S1 1 (Red) returned by the genetic algorithm on XGB models.

standard deviation across the 5 train/test splits employed in our main experiments. Moreover, window 20 convolutions were used to smooth the curves and make them more readable. On the Marketing and Communities datasets, we see that for both XGB and Random Forests models, the detector is quickly able to assert that the data was manipulated. We suspect the genetic algorithm cannot fool the detector on these two datasets because they contain a large number of features (Marketing has 20, Communities has 98). Such a large number of features could make it harder to perturbate samples while staying close to the original data manifold. Since the model behavior is unpredictable outside of the data manifold, it is impossible for the genetic algorithm to guarantee that the CDF of fp S1 1q will be close to the CDF of fp D1q. For adult-income, the detection rate appears to be lower but still, the largest reductions in amplitude of the sensitive feature were about 15%, even after 2.5 hours of run-time.

Contrary to the genetic algorithm, our method Fool SHAP addresses both constraints of making the fake data realistic and keeping it close to the original dataset. Indeed, our objective is tuned to make sure that the Wasserstein distance between the original and perturbed data is small. Moreover, since we do not generate new samples but rather apply non-uniform weights to pre-existing ones, we do not run into the risk of generating unrealistic data.

Published as a conference paper at ICLR 2023

0 100 200 300 400 Iterations

Relative Amplitude Decrease

Detection Rate

0 100 200 300 400 Iterations

Relative Amplitude Decrease

Detection Rate

(b) Adult-Income

0 100 200 300 400 Iterations

Relative Amplitude Decrease

Detection Rate

(c) Marketing

0 25 50 75 100 Iterations

Relative Amplitude Decrease

Detection Rate

(d) Communities

Figure 18: Iterations of the genetic algorithm applied to 5 XGB models per dataset.

0 50 100 150 200 Iterations

Relative Amplitude Decrease

Detection Rate

0 20 40 60 80 100 Iterations

Relative Amplitude Decrease

Detection Rate

(b) Adult-Income

0.0 2.5 5.0 7.5 10.0 Iterations

Relative Amplitude Decrease

Detection Rate

(c) Marketing

0 5 10 15 20 Iterations

Relative Amplitude Decrease

Detection Rate

(d) Communities

Figure 19: Iterations of the genetic algorithm applied to 5 RF models per dataset.

Published as a conference paper at ICLR 2023

E.4 MULTIPLE SENSITIVE ATTRIBUTES

We present preliminary results for settings where one wishes to manipulate the Shapley values of multiple sensitive features s each part of a set s P S. For example, in our experiments, we considered gender as a sensitive attribute for the Adult-Income dataset and we showed that one can diminish the attribution of this feature. Nonetheless, there are two other features in Adult Income that share information with gender: relationship and marital-status. Indeed, relationship can take the value widowed and marital-status can take the value wife, which are both proxies of gender=female. For this reason, these two other features may be considered sensitive and decision-making that relies strongly on them may not be acceptable. Hence, we must derive a method that reduces the total attributions of the features in S tgender, relationship, marital-statusu.

We ﬁrst let βs : signr ř

zpjq PD1 pΦspf, S1 0, zpjqq s for any s P S. In our experiments, all these signs will typically be negative. The proposed approach is to minimize the ℓ1 norm

}ppΦspf, S1 0, S1 1qqs PS}1 : ÿ

s PS | pΦspf, S1 0, S1 1q |, (41)

which we interpret as the total amount of disparity we can attribute to the sensitive attributes. Remember that pΦspf, S1 0, S1 1q converges in probability to ř

zpjq PD1 ωj pΦspf, S1 0, zpjqq (cf. Proposition 4.1). Therefore minimizing the ℓ1 norm will require minimizing ÿ

zpjq PD1 ωj pΦspf, S1 0, zpjqq ÿ

zpjq PD1 ωj ÿ

s PS βs pΦspf, S1 0, zpjqq, (42)

which is again a linear function of the weights. We present Algorithm 4 as an overload of Algorithm 1 that now supports taking multiple sensitive attributes as inputs.

Algorithm 4 Compute non-uniform weights for multiple sensitive attributes s P S

1: procedure COMPUTE_WEIGHTS(D1, pΦspf, S1 0, zpjqq (

2: βs : signr ř

zpjq PD1 pΦspf, S1 0, zpjqq s @s P S; 3: B : Cp D1, 1{N1q Ź Unmanipulated background 4: B1 ω : Cp D1, ωq Ź Manipulated background as a function of ω 5: ω arg minω ř

zpjq PD1 ωj ř

s PS βs pΦspf, S1 0, zpjqq λWp B, B1 ωq 6: return ω;

The only difference in the resulting MCF is that we must use the cost apeq ř

s PS βs pΦspf, S1 0, zpjqq for edges ps, ℓjq in the graph G of Figure 6. This new algorithm is guaranteed to diminish the ℓ1 norm of the attributions of all sensitive features. However, that this does not imply that all sensitive attributes will diminish in amplitude. Indeed, minimizing the sum of multiple quantities does not guarantee that each quantity will diminish. For example, 4 7 is smaller than 6 6 although 4 is smaller than 6 and 7 is higher than 6. Still, we see reducing the ℓ1 norm as a natural way to hide the total amount of disparity that is attributable to the sensitive features. Another important methodological change is the way we select the optimal hyper-parameter λ in Algorithm 3. Now at line 12, we use the ℓ1 norm ř

zpjq PD1 ωj pΦspf, S1 0, zpjqq| as a selection criterion.

Figures 20 and 21 present preliminary results of attacks on three RFs/XGBs ﬁtted on Adults with different train/test splits. We note that in all cases, before the attack, the three sensitive features had large negative attributions. By applying our method, we can considerably reduce the amplitude of the two sensitive attributes. The attribution of the remaining sensitive feature remains approximately constant or slightly becomes more negative. We leave it as future work to run large-scale experiments with multiple sensitive features for various datasets.

Published as a conference paper at ICLR 2023

102 103 104 λ

Detection Rate (%)

102 103 104 λ

Sensitive features Other features

102 103 104 λ

Detection Rate (%)

102 103 104 λ

Sensitive features Other features

102 103 104 λ

Detection Rate (%)

102 103 104 λ

Sensitive features Other features

Figure 20: Example of log-space search over values of λ using RFs classiﬁer ﬁtted on Adults and three sensitive attributes. Each row is a different train/test split seed. (Left) The detection rate as a function of the parameter λ of the attack. (Right) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B1 ω. The goal here is to reduce the amplitude all sensitive features (red curves) in order to hide their contribution to the disparity in model outcomes.

Published as a conference paper at ICLR 2023

102 103 104 λ

Detection Rate (%)

102 103 104 λ

Sensitive features Other features

102 103 104 λ

Detection Rate (%)

102 103 104 λ

Sensitive features Other features

102 103 104 λ

Detection Rate (%)

102 103 104 λ

Sensitive features Other features

Figure 21: Example of log-space search over values of λ using XGBs classiﬁer ﬁtted on Adults and three sensitive attributes. Each row is a different train/test split seed. (Left) The detection rate as a function of the parameter λ of the attack. (Right) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B1 ω. The goal here is to reduce the amplitude all sensitive features (red curves) in order to hide their contribution to the disparity in model outcomes.