# fooling_shap_with_stealthily_biased_sampling__0a9dfaa4.pdf Published as a conference paper at ICLR 2023 FOOL SHAP WITH STEALTHILY BIASED SAMPLING. Gabriel Laberge1, Ulrich Aïvodji2, Satoshi Hara3, Mario Marchand4, Foutse Khomh1 1Polytechnique Montréal, Québec 2École de technologie supérieure, Québec 3Osaka University, Japan 4Universitié de Laval à Québec {gabriel.laberge,foutse.khomh}@polymtl.ca ulrich.aivodji@etsmtl.ca satohara@ar.sanken.osaka-u.ac.jp mario.marchand@ift.ulaval.ca SHAP explanations aim at identifying which features contribute the most to the difference in model prediction at a specific input versus a background distribution. Recent studies have shown that they can be manipulated by malicious adversaries to produce arbitrary desired explanations. However, existing attacks focus solely on altering the black-box model itself. In this paper, we propose a complementary family of attacks that leave the model intact and manipulate SHAP explanations using stealthily biased sampling of the data points used to approximate expectations w.r.t the background distribution. In the context of fairness audit, we show that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups while remaining undetected. More precisely, experiments performed on real-world datasets showed that our attack could yield up to a 90% relative decrease in amplitude of the sensitive feature attribution. These results highlight the manipulability of SHAP explanations and encourage auditors to treat them with skepticism. 1 INTRODUCTION As Machine Learning (ML) gets more and more ubiquitous in high-stake decision contexts (e.g. , healthcare, finance, and justice), concerns about its potential to lead to discriminatory models are becoming prominent. The use of auditing toolkits (Adebayo et al., 2016; Saleiro et al., 2018; Bellamy et al., 2018) is getting popular to circumvent the use of unfair models. However, although auditing toolkits can help model designers in promoting fairness, they can also be manipulated to mislead both the end-users and external auditors. For instance, a recent study of Fukuchi et al. (2020) has shown that malicious model designers can produce a benchmark dataset as fake evidence of the fairness of the model even though the model itself is unfair. Another approach to assess the fairness of ML systems is to explain their outcome in a post hoc manner (Guidotti et al., 2018). For instance, SHAP (Lundberg & Lee, 2017) has risen in popularity as a means to extract model-agnostic local feature attributions. Feature attributions are meant to convey how much the model relies on certain features to make a decision at some specific input. The use of feature attributions for fairness auditing is desirable for cases where the interest is on the direct impact of the sensitive attributes on the output of the model. One such situation is in the context of causal fairness (Chikahara et al., 2021). In some practical cases, the outputs cannot be independent from the sensitive attribute unless we sacrifice much of prediction accuracy. For example, any decisions based on physical strength are statistically correlated to gender due to biological nature. The problem in such a situation is not the statistical bias (such as demographic parity), but whether the decision is based on physical strength or gender, i.e. the attributions of each feature. The focus of this study is on manipulating the feature attributions so that the dependence on sensitive features is hidden and the audits are misled as if the model is fair even if it is not the case. Recently, several studies reported that such a manipulation is possible, e.g. by modifying the black-box model to be explained (Slack et al., 2020; Begley et al., 2020; Dimanov et al., 2020), by manipulating the computation algorithms of feature attributions (Aïvodji et al., 2019), and by poisoning the data distribution (Baniecki et al., 2021; Baniecki & Biecek, 2022). With these findings in mind, the current Published as a conference paper at ICLR 2023 possible advice to the auditors is not to rely solely on the reported feature attributions for fairness auditing. A question then arises about what evidence we can expect in addition to the feature attributions, and whether they can be valid evidence of fairness. In this study, we show that we can craft fake evidence of fairness for SHAP explanations, which provides the first negative answer to the last question. In particular, we show that we can produce not only manipulated feature attributions but also a benchmark dataset as the fake evidence of fairness. The benchmark dataset ensures the external auditors reproduce the reported feature attributions using the existing SHAP library. In our study, we leverage the idea of stealthily biased sampling introduced by Fukuchi et al. (2020) to cherry-pick which data points to be included in the benchmark. Moreover, the use of stealthily biased sampling allows us to keep the manipulation undetected by making the distribution of the benchmark sufficiently close to the true data distribution. Figure 1 illustrates the impact of our attack in an explanation scenario with the Adult Income dataset. 0.10 0.08 0.06 0.04 0.02 0.00 Shap value marital_status capital_gain hours_per_week relationship capital_loss occupation Original Manipulated Figure 1: Example of our attack on the Adult Income dataset. After the attack, the feature gender moved from the most negative attribution to the 6th, hence hiding some of the model bias. Our contributions can be summarized as follows: Theoretically, we formalize a notion of foreground distribution that can be used to extend Local Shapley Values (LSV) to Global Shapley Values (GSV), which can be used to decompose fairness metrics among the features (Section 2.2). Moreover, we formalize the task of manipulating the GSV as a Minimum Cost Flow (MCF) problem (Section 4). Experimentally (Section 5), we illustrate the impact of the proposed manipulation attack on a synthetic dataset and four popular datasets, namely Adult Income, COMPAS, Marketing, and Communities. We observed that the proposed attack can reduce the importance of a sensitive feature while keeping the data manipulation undetected by the audit. Our results indicate that SHAP explanations are not robust and can be manipulated when it comes to explaining the difference in outcomes between groups. Even worse, our results confirm we can craft a benchmark dataset so that the manipulated feature attributions are reproducible by external audits. Henceforth, we alert auditors to treat post-hoc explanation methods with skepticism even if it is accompanied by some additional evidence. 2 SHAPLEY VALUES 2.1 LOCAL SHAPLEY VALUES Shapley values are omnipresent in post-hoc explainability because of their fundamental mathematical properties (Shapley, 1953) and their implementation in the popular SHAP Python library (Lundberg & Lee, 2017). SHAP provides local explanations in the form of feature attributions i.e. given an input of interest x, SHAP returns a score φi P R for each feature i 1, 2, . . . , d. These scores are meant to convey how much the model f relies on feature i to make its decision fpxq. Shapley values have a long background in coalitional game theory, where multiple players collaborate toward a common outcome. In the context of explaining model decisions, the players are the input features and the common outcome is the model output fpxq. In coalitional games, players (features) are either present or absent. Since one cannot physically remove an input feature once the model has already been fitted, SHAP removes features by replacing them with a baseline value z. This leads to the Local Shapley Value (LSV) φipf, x, zq which respect the so-called efficiency axiom (Lundberg & Lee, 2017) i 1 φipf, x, zq fpxq fpzq. (1) Simply put, the difference between the model prediction at x and the baseline z is shared among the different features. Additional details on the computation of LSV are presented in Appendix B.1. Published as a conference paper at ICLR 2023 2.2 GLOBAL SHAPLEY VALUES LSV are local because they explain the prediction at a specific x and rely on a single baseline input z. Since model auditing requires a more global analysis of model behavior, we must understand the predictions at multiple inputs x F sampled from a distribution F called the foreground. Moreover, because the choice of baseline is somewhat ambiguous, the baselines are sampled z B from a distribution B colloquially referred to as the background. Taking inspiration from Begley et al. (2020), we can compute Global Shapley Values (GSV) by averaging LSV over both foreground and background distributions. Definition 2.1. Φipf, F, Bq : E x F z B rφipf, x, zq , i 1, 2, . . . , d. (2) Proposition 2.1. The GSV have the following property i 1 Φipf, F, Bq E x Frfpxqs E x Brfpxqs. (3) 2.3 MONTE-CARLO ESTIMATES In practice, computing expectations w.r.t the whole background and foreground distributions may be prohibitive and hence Monte-Carlo estimates are used. For instance, when a dataset is used to represent a background distribution, explainers in the SHAP library such as the Exact Explainer and Tree Explainer will subsample this dataset 1 by selecting 100 instances uniformly at random when the size of the dataset exceeds 100. More formally, let Cp S, ωq : ÿ xpjq PS ωjδpxpjqq (4) represent a categorical distribution over a finite set of input examples S, where δp q is the Dirac probability measure, wj ě 0 @j, and ř j ωj 1. Estimating expectations with Monte-Carlo amounts to sampling M instances S0 FM S1 BM, (5) and computing the plug-in estimate pΦpf, S0, S1q : Φpf, Cp S0, 1{Mq, Cp S1, 1{Mqq zpjq PS1 φpf, xpiq, zpjqq. (6) When a set of samples is a singleton (e.g. S1 tzpjqu), we shall use the convention pΦpf, S0, tzpjquq pΦpf, S0, zpjqq to improve readability. In Appendix B.2, pΦpf, S0, S1q is shown to be a consistent and asymptotically normal estimate of Φpf, F, Bq meaning that one can compute approximate confidence intervals around pΦ to capture Φ with high probability. In practice, the estimates pΦ are employed as the model explanation which we see as a vulnerability. As discussed in Section 4, the Monte-Carlo estimation is the key ingredient that allows us to manipulate the GSV in favor of a dishonest entity. 3 AUDIT SCENARIO This section introduces an audit scenario to which the proposed attack of SHAP can apply. This scenario involves two parties: a company and an audit. The company has a dataset D tpxpiq, ypiqqu N i 1 with xpiq P Rd and ypiq P t0, 1u that contains N input-target tuples and also has a model f : X Ñ r0, 1s that is meant to be deployed in society. The binary feature with index s (i.e. xs P t0, 1u) represents a sensitive feature with respect to which the model should not explicitly 1https://github.com/slundberg/shap/blob/0662f4e9e6be38e658120079904899cccda59ff8/shap/maskers/_tabular.py# L54-L55 Published as a conference paper at ICLR 2023 E[f(x)|xs=0] E[f(x)|xs=1] (a) The data initially provided by the company to the audit is fp D0q and fp D1q i.e. the model predictions for all instances in the private dataset for different values of xs. This dataset can later be used by the audit to assess whether or not the subsets S1 0, S1 1 provided by the company where cherry-picked. (b) Two models f1 and f2 (decision boundaries in dashed lines) with perfect accuracy exhibit a disparity in outcomes w.r.t groups with xs ă 0 and xs ą 0. Here, Φspf1, F, Bq 1 while Φspf2, F, Bq 0. Hence, f2 is indirectly unfair toward xs because of correlations in the data. Figure 2: Illustrations of the audit scenario. discriminate. Both the data D and the model f are highly private so the company is very careful when providing information about them to the audit. Hence, f is a black box from the point of view of the audit. At first, the audit asks the company for the necessary data to compute fairness metrics e.g. the Demographic Parity (Dwork et al., 2012), the Predictive Equality (Corbett-Davies et al., 2017), or the Equal Opportunity (Hardt et al., 2016). Note that our attack would apply as long as the fairness metric is a difference in model expectations over subgroups. For simplicity, the audit decides to compute the Demographic Parity Erfpxq|xs 0s Erfpxq|xs 1s, (7) and therefore demands access to the model outputs for all inputs with different values of the sensitive feature : fp D0q and fp D1q, where D0 txpiq : xpiq s 0u and D1 txpiq : xpiq s 1u are subsets of the input data of sizes N0 and N1 respectively. Doing so does not force the company to share values of features other than xs nor does it requires direct access to the inner workings of the proprietary model. Hence, this demand respects privacy requirements and the company will accept to share the model outputs across all instances, see Figure 2a. At this point, the audit confirms that the model is indeed biased in favor of xs 1 and puts in question the ability of the company to deploy such a model. Now, the company argues that, although the model exhibits a disparity in outcomes, it does not mean that the model explicitly uses the feature xs to make its decision. If such is the case, then the disparity could be explained by other features statistically associated with xs. Some of these other features may be acceptable grounds for decisions. To verify such a claim, the audit decides to employ post-hoc techniques to explain the disparity. Since the model is a black-box, the audits shall compute the GSV. The foreground F and background B are chosen to be the data distributions conditioned on xs 0 and xs 1 respectively F : Cp D0, 1{N0q B : Cp D1, 1{N1q. (8) According to Equation 3, the resulting GSV will sum up to the demographic parity (cf. Equation 7). If the sensitive feature has a large negative GSV Φs, then this would mean that the model is explicitly relying on xs to make its decisions and the company would be forbidden from deploying the model. If the GSV has a small amplitude, however, the company could still argue in favor of deploying the model in spite of having disparate outcomes. Indeed, the difference in outcomes by the model could be attributed to more acceptable features. See Figure 2b for a toy example illustrating this reasoning. To compute the GSV, the audit demands the two datasets of inputs D0 and D1, as well as the ability to query the black box f at arbitrary points. Because of privacy concerns on sharing values of x across the whole dataset, and because GSV must be estimated with Monte-Carlo, both parties agree that the company shall only provide subsets S0 Ă D0 and S1 Ă D1 of size M to the audit so they can compute a Monte-Carlo estimate pΦpf, S0, S1q. The company first estimate GSV on their own by choosing S0, S1 uniformly at random from F and B (cf. Equation 5) and observe that pΦs indeed has a Published as a conference paper at ICLR 2023 large negative value. They realize they must carefully select which data points will be sent, otherwise, the audit may observe the bias toward xs 1 and the model will not be deployed. Moreover, the company understands that the audit currently has access to the data fp D0q and fp D1q representing the model predictions on the whole dataset (see Figure 2a). Therefore, if the company does not share subsets S0, S1 that were chosen uniformly at random from D0, D1, it is possible for the audit to detect this fraud by doing a statistical test comparing fp S0q to fp D0q and fp S1q to fp D1q. The company needs a method to select misleading subsets S1 0, S1 1 whose GSV is manipulated in their favor while remaining undetected by the audit. Such a method is the subject of the next section. 4 FOOL SHAP WITH STEALTHILY BIASED SAMPLING 4.1 MANIPULATION To fool the audit, the company can decide to indeed sub-sample S1 0 uniformly at random S1 0 FM. Then, given this choice of foreground data, they can repeatedly sub-sample S1 1 BM, and choose the set S1 1 leading to the smallest |pΦspf, S1 0, S1 1q|. We shall call this method brute-force . Its issue is that, by sub-sampling S1 1 from B, it will take an enormous number of repetitions to reduce the attribution since the GSV pΦspf, S1 0, S1 1q is concentrated on the population GSV Φspf, F, Bq. A more clever method is to re-weight the background distribution before sampling from it i.e. define B1 ω : Cp D1, ωq with ω 1{N1 and then sub-sample S1 1 B1M ω . To make the model look fairer, the company needs the pΦs computed with these cherry-picked points to have a small magnitude. Proposition 4.1. Let S1 0 be fixed, and let pÑ represent convergence in probability as the size M of the set S1 1 B1M ω increases, we have pΦspf, S1 0, S1 1q pÑ ÿ zpjq PD1 ωj pΦspf, S1 0, zpjqq. (9) We note that the coefficients pΦspf, S1 0, zpjqq in Equation 9 are tractable and can be computed and stored by the company. We discuss in more detail how to compute them in Appendix B.3. An additional requirement is that the non-uniform distribution B1 ω remains similar to the original B. Otherwise, the fraud could be detected by the audit. In this work, the notion of similarity between distributions will be captured by the Wasserstein distance in output space. Definition 4.1 (Wassertein Distance). Any probability measure π over D1 ˆ D1 is called a coupling measure between B and B1 ω, denoted π P p B, B1 ωq, if 1{N1 ř j πij and ωj ř i πij. The Wassertein distance between B and B1 ω mapped to the output-space is defined as Wp B, B1 ωq min πP p B,B1ωq i,j |fpzpiqq fpzpjqq|πij, (10) a.k.a the cost of the optimal transport plan that distributes the mass from one distribution to the other. We propose Algorithm 1 to compute the weights ω by minimizing the magnitude of the GSV while maintaining a small Wasserstein distance. The trade-off between attribution manipulation and proximity to the data is tuned via a hyper-parameter λ ą 0. We show in the Appendix A.2 that the optimization problem at line 5 of Algorithm 1 can be reformulated as a Minimum Cost Flow (MCF) and hence can be solved in polynomial time (more precisely r Op N 2.5 1 q as in Fukuchi et al. (2020)). 4.2 DETECTION We now discuss ways the audit can detect manipulation of the sampling procedure. Recall that the audit has previously been given access to fp D0q, fp D1q representing the model outputs across all instances in the private dataset. The audit will then be given sub-samples S1 0, S1 1 of D0, D1 on which they can compute the output of the model and compare with fp D0q, fp D1q. To assess whether or not the sub-samples provided by the company were sampled uniformly at random, the audit has to conduct statistical tests. The null hypothesis of these tests will be that S1 0, S1 1 were sampled uniformly at random from D0, D1. The detection Algorithm 2 with significance α uses both the Kolmogorov Smirnov and Wald tests with Bonferonni corrections (i.e. the α{4 terms in the Algorithm). The Kolmogorov-Smirnov and Wald tests are discussed in more detail in Appendix C. Published as a conference paper at ICLR 2023 Algorithm 1 Compute non-uniform weights 1: procedure COMPUTE_WEIGHTS(D1, tpΦspf, S1 0, zpjqquj, λ) 2: β : signr ř zpjq PD1 pΦspf, S1 0, zpjqq s 3: B : Cp D1, 1{N1q Ź Unmanipulated background 4: B1 ω : Cp D1, ωq Ź Manipulated background as a function of ω 5: ω arg minω β ř zpjq PD1 ωj pΦspf, S1 0, zpjqq λWp B, B1 ωq Ź Optimization Problem 6: return ω; Algorithm 2 Detection with significance α 1: procedure DETECT_FRAUD(fp D0q, fp D1q, fp S1 0q, fp S1 1q, α, M) 2: for i 0, 1 do 3: fp Siq Cpfp Diq, 1{Niq M Ź Subsample without cheating. 4: p-value-KS KSp fp Siq, fp S1 iq q Ź KS test comparing fp Siq and fp S1 iq 5: p-value-Wald Waldp fp S1 iq, fp Diq q Ź Wald test 6: if p-value-KS ă α{4 or p-value-Wald ă α{4 then Ź Reject the null hypothesis 7: return 1 8: return 0; 4.3 WHOLE PROCEDURE The procedure returning the subsets S1 0, S1 1 is presented in Algorithm 3. It conducts a log-space search between λmin and λmax for the λ hyper-parameter (line 6) in order to explore the possible attacks. For each value of λ, the attacker runs Algorithm 1 to obtain B1 ω (line 7), then repeatedly samples S1 1 B1M ω (line 10) and attempts to detect the fraud (line 11). The attacker will choose B1 ω that minimizes the magnitude of pΦs while having a detection rate below some threshold τ (line 12). An example of search over λ on a real-world dataset is presented in Figure 3. Algorithm 3 Fool SHAP 1: procedure FOOL_SHAP(f, D0, D1, M, λmin, λmax, τ, α) 2: S1 0 Cp D0, 1{N0q M Ź S1 0 is sampled without cheating 3: Compute pΦspf, S1 0, zpjqq @zpjq P D1 Ź cf. Section B.3 4: B Cp D1, 1{N1q 5: Φ s 1{N1 ř zpjq PD1 pΦspf, S1 0, zpjqq Ź Initialize the solution 6: for λ λmax, . . . , λmin do 7: ω COMPUTE_WEIGHTS(D1, pΦspf, S1 0, zpjqq ( j, λ) 8: Detection 0 9: for rep 1, . . . , 100 do Ź Detect the manipulation 10: S1 1 B1M ω 11: Detection += DETECT_FRAUD(fp D0q, fp D1q, fp S1 0q, fp S1 1q, α, M) 12: if | ř zpjq PD1 ωj pΦspf, S1 0, zpjqq| ă |Φ s| and Detection ă 100τ then 13: B B1 ω 14: Φ s ř zpjq PD1 ωj pΦspf, S1 0, zpjqq Ź Update the solution 15: S1 1 B M Ź Cherry-pick by sampling from the non-uniform background 16: return S1 0, S1 1 One limitation of Fool SHAP is that it manipulates a single sensitive feature. In Appendix E.4, we present a possible extension of Algorithm 1 to handle multiple sensitive features and present preliminary results of its effectiveness. A second limitation is that it only applies to interventional Shapley values which break feature correlations. This choice was made because most methods in the SHAP library2 are interventional . Future work should port Fool SHAP to observational Shapley values that use conditional expectations to remove features (Frye et al., 2020). 2except the Tree Explainer when no background data is provided Published as a conference paper at ICLR 2023 100 101 102 λ Detection Rate (%) 100 101 102 λ Sensitive feature Other features Figure 3: Example of log-space search over values of λ using an XGBoost classifier fitted on Adults. (a) The detection rate as a function of the parameter λ of the attack. The attacker uses a detection rate threshold τ 10%. (b) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B1 ω. The goal here is to reduce the amplitude of the sensitive feature (red curve). 4.4 CONTRIBUTIONS The first technique to fool SHAP with perturbations of the background distribution was a genetic algorithm Baniecki & Biecek (2022). Although promising, the cross-over and mutation operations it employs to perturb data do not take into account feature correlations and can therefore generate unrealistic data. Moreover, the objective to minimize does not enforce similarity between the original and manipulated backgrounds. We show in Appendix E.3 that these limitations lead to systematic fraud detections. Hence, our contributions are two-fold. First, by perturbing the background via nonuniform weights over pre-existing instances (i.e. B1 ω : Cp D1, ωq ) rather than a genetic algorithm, we avoid the issue of non-realistic data. Second, by considering the Wasserstein distance, we can control the similarity between the original and fake backgrounds. Since the Stealthity Biased Sampling technique introduced in Fukuchi et al. (2020) also leverages a non-uniform distribution over data points and the Wasserstein distance, it makes sense to adapt it to fool SHAP. Still, the approach of Fukuchi et al. is different from ours. Indeed, in their work, they minimize the Wasserstein distance while enforcing a hard constraint on the number of instances that land on the different bins for the target and sensitive feature, That way, they can set the Demographic Parity to any given value while staying close to the original data. In our setting of manipulating the model explanation, we leave the Demographic Parity intact and instead manipulate its feature attribution. In terms of the optimization objective, we now minimize a Shapley value with a soft constraint on the Wasserstein distance. 5 EXPERIMENTS 5.1 TOY EXPERIMENT The task is predicting which individual will be hired for a job that requires carrying heavy objects. The causal graph for this toy data is presented in Figure 4 (left). We observe that sex (S) influences height (H), and that both these features influence the Muscular Mass (M). In the end, the hiring decisions (Y ) are only based on the two attributes relevant to the job: H and M. Also, two noise features N1, N2 were added. More details and justifications for this causal graph are discussed in Appendix D.1. Since strength and height (two important qualifications for applicants) are correlated with sex, any model f that fits the data will exhibit some disparity in hiring rates between sexes. Although, if the model decisions do not rely strongly on feature S, the company can argue in favor of deployment. GSV are used by the audit to measure the amount by which the model relies on the sex feature, see Figure 4 (Middle). By employing Fool SHAP with M 100, the company can reduce the GSV of feature S considerably compared to the brute-force and genetic algorithms. More importantly, the audit is not able to detect that the provided samples S1 0, S1 1 were cherry-picked, see Figure 4 (Right). More results are presented in Appendix E.1. Published as a conference paper at ICLR 2023 0.4 0.3 0.2 0.1 0.0 GSV Muscle Mass N2 Original Brute Genetic Fool SHAP 0.0 0.2 0.4 0.6 0.8 1.0 Output Figure 4: Toy example. Left: Causal graph. Middle: GSV for the different attacks with M 100. Right: Comparison of the CDF of the Fool SHAP subsets fp S1 0q, fp S1 1q and the CDF over the whole data fp D0q, fp D1q. Here the audit cannot detect the fraud using their detection algorithm. 5.2 DATASETS We consider four standard datasets from the FAcc T literature, namely COMPAS, Adult-Income, Marketing, and Communities. COMPAS regroups 6,150 records from criminal offenders in Florida collected from 2013-2014. This binary classification task consists in predicting who will re-offend within two years. The sensitive feature s is race with values xs 0 for African-American and xs 1 for Caucasian. Adult Income contains demographic attributes of 48,842 individuals from the 1994 U.S. census. It is a binary classification problem with the goal of predicting whether or not a particular person makes more than 50K USD per year. The sensitive feature s in this dataset is gender, which took values xs 0 for female, and xs 1 for male. Marketing involves information on 41,175 customers of a Portuguese bank and the binary classification task is to predict who will subscribe to a term deposit. The sensitive attribute is age and took values xs 0 for age 30-60, and xs 1 for age not30-60 Communities & Crime contains per-capita violent crimes for 1994 different communities in the US. The binary classification task is to predict which communities have crimes below the median rate. The sensitive attribute is Percent White and took values xs 0 for Percent White<90%, and xs 1 for Percent White>=90%. Three models were considered for the two datasets: Multi-Layered Perceptrons (MLP), Random Forests (RF), and e Xtreme Gradient Boosted trees (XGB). One model of each type was fitted on each dataset for 5 different train/test splits seeds, resulting in 60 models total. Values of the test set accuracy and demographic parity for each model type and dataset are presented in Appendix D.2. 5.3 DETECTOR CALIBRATION Table 1: False Positive Rates (%) of the detector i.e. the frequency at which S0, S1 are considered cherry-picked when they are not. No rate should be above 5%. COMPAS 4.0 4.6 4.0 Adult 4.3 4.3 4.2 Marketing 4.9 5.0 Communities 3.8 4.2 Detector calibration refers to the assessment that, assuming the null hypothesis to be true, the probability of rejecting it (i.e. false positive) should be bounded by the significance level α. Remember that the null hypothesis of the audit detector is that the sets S1 0, S1 1 provided by the company are sampled uniformly from D0, D1. Hence, to test the detector, the audit can sample their own subsets fp S0q, fp S1q uniformly from at random from fp D0q, fp D1q, run the detection algorithm, and count the number of detection over 1000 repeats. Table 1 shows the false positive rates over the five train-test splits using a significance level α 5%. We observe that the false positive rates are indeed bounded by α for all model types and datasets implying that the detector employed by the audit is calibrated. Published as a conference paper at ICLR 2023 5.4 ATTACK RESULTS AND DISCUSSION COMPAS Adult Marketing C&C Dataset Relative Amplitude Decrease (%) Brute Genetic Ours Figure 5: Relative decrease in amplitude of the sensitive feature attribution induced by the various attacks on SHAP. The first step of the attack (line 3 of Algorithm 3) requires that the company run SHAP on their own and compute the necessary coefficients to run Algorithm 1. For the COMPAS and Adults datasets, the Exact Explainer of SHAP was used. Since Marketing and Communities contain more than 15 features, and since the Exact Explainer scales exponentially with the number of features, we were restricted to using the Tree Explainer (Lundberg et al., 2020) on these datasets. The Tree Explainer avoids the exponential cost of Shapley values but is only applicable to tree-based models such as RFs and XGBs. Therefore, we could not conduct the attack on MLPs fitted on Marketing and Communities. The following step is to solve the MCF for various values of λ (line 7 of Algorithm 3). As stated previously, solving the MCF can be done in polynomial time in terms of N1, which was tractable for a small dataset like COMPAS and Communities, but not for larger datasets like Adult and Marketing. To solve this issue, as was done in Fukuchi et al. (2020), we compute the manipulated weights multiple times using 5 bootstrap sub-samples of D1 of size 2000 to obtain a set of weights ωr1s, ωr2s, . . . , ωr5s which we average to obtain the final weights ω. Results of 46 attacks with M 200 are shown in Figure 5. Specific examples of the conducted attacks are presented in Appendix E.2. As a point of reference, we also show results for the brute-force and genetic algorithms. To make comparisons to our attack more meaningful, the brute-force method was only allowed to run for the same amount of time it took to search for the non-uniform weights ω (about 30-180 seconds). Also, the genetic algorithm ran for 400 iterations and was stopped early if there were 10 consecutive detections. We note that, across all datasets, Fool SHAP leads to greater reductions of the sensitive feature attribution compared to brute-force search and the genetic perturbations of the background. Now focusing on Fool SHAP, for the datasets COMPAS and Marketing, we observe median reductions in amplitudes of about 90%. This means that our attack can considerably reduce the apparent importance of the sensitive attribute. For the Adult and Communities datasets, the median reduction in amplitude is about 50% meaning that we typically reduce by half the importance of the sensitive feature. Still, looking at the maximum reduction in amplitude for Adult-Income and Communities, we note that one attack managed to reduce the amplitude by 90%. Therefore, luck can play a part in the degree of success of Fool SHAP, which is to be expected from data-driven attacks. Finally, the audit was consistently unable to detect the fraud using statistical tests. This observation raises concerns about the risk that SHAP explanations can be attacked to return not only manipulated attributions but also non-detectable fake evidence of fairness. 6 CONCLUSION To conclude, we proposed a novel attack on Shapley values that does not require modifying the model but rather manipulates the sampling procedure that estimates expectations w.r.t the background distribution. We show on a toy example and four fairness datasets that our attack can reduce the importance of a sensitive feature when explaining the difference in outcomes between groups using SHAP. Crucially, the sampling manipulation is hard to detect by an audit that is given limited access to the data and model. These results raise concerns about the viability of using Shapley values to assess model fairness. We leave as future work the use of Shapley values to decompose other fairness metrics such as predictive equality and equal opportunity. Moreover, we wish to move to use cases beyond fairness, as we believe that the vulnerability of Shapley values that was demonstrated can apply to many other properties such as safety and security. Published as a conference paper at ICLR 2023 7 ETHICS STATEMENT The main objective of this work is to raise awareness about the risk of manipulation of SHAP explanations and their undetectability. As such, it aims at exposing the potential negative societal impacts of relying on such explanations. It remains however possible that malicious model producers could use this attack to mislead end users or cheat during an audit. However, we believe this paper makes a significant step toward increasing the vigilance of the community and fostering the development of trustworthy explanations methods. Furthermore, by showing how fairness can be manipulated in explanation contexts, this work contributes to the research on the certification of the fairness of automated decision-making systems. 8 REPRODUCIBILITY STATEMENT The source code of all our experiments is available online3. Moreover, experimental details are provided in appendix D.2 for the interested reader. Julius A Adebayo et al. Fairml: Toolbox for diagnosing bias in predictive modeling. Master s thesis, Massachusetts Institute of Technology, 2016. 1 Ulrich Aïvodji, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara, and Alain Tapp. Fairwashing: the risk of rationalization. In International Conference on Machine Learning, pp. 161 170. PMLR, 2019. 1 Hubert Baniecki and Przemyslaw Biecek. Manipulating shap via adversarial data perturbations (student abstract). 2022. 1, 7 Hubert Baniecki, Wojciech Kretowicz, and Przemyslaw Biecek. Fooling partial dependence via data poisoning. ar Xiv preprint ar Xiv:2105.12837, 2021. 1, 25 Tom Begley, Tobias Schwedes, Christopher Frye, and Ilya Feige. Explainability for fair machine learning. ar Xiv preprint ar Xiv:2010.07389, 2020. 1, 3 Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, et al. Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ar Xiv preprint ar Xiv:1810.01943, 2018. 1 Yoichi Chikahara, Shinsaku Sakaue, Akinori Fujino, and Hisashi Kashima. Learning individually fair classifier with path-specific causal-effect constraint. In International Conference on Artificial Intelligence and Statistics, pp. 145 153. PMLR, 2021. 1 Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 797 806, 2017. 4 Botty Dimanov, Umang Bhatt, Mateja Jamnik, and Adrian Weller. You shouldn t trust me: Learning models which conceal unfairness from multiple explanation methods. In Safe AI@ AAAI, 2020. 1 Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226, 2012. 4 Christopher Frye, Damien de Mijolla, Tom Begley, Laurence Cowton, Megan Stanley, and Ilya Feige. Shapley explainability on the data manifold. ar Xiv preprint ar Xiv:2006.01272, 2020. 6 Kazuto Fukuchi, Satoshi Hara, and Takanori Maehara. Faking fairness via stealthily biased sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 412 419, 2020. 1, 2, 5, 7, 9 3https://github.com/gablabc/Fool_SHAP Published as a conference paper at ICLR 2023 Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):1 42, 2018. 1 Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016. 4 Ian Janssen, Steven B Heymsfield, Zi Mian Wang, and Robert Ross. Skeletal muscle mass and distribution in 468 men and women aged 18 88 yr. Journal of applied physiology, 2000. 20 A J Lee. U-statistics: Theory and Practice. Routledge, 2019. 18 Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. 1, 2 Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex De Grave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature machine intelligence, 2(1):56 67, 2020. 9, 17 Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68 78, 1951. 19 Pedro Saleiro, Benedict Kuester, Loren Hinkson, Jesse London, Abby Stevens, Ari Anisfeld, Kit T Rodolfa, and Rayid Ghani. Aequitas: A bias and fairness audit toolkit. ar Xiv preprint ar Xiv:1811.05577, 2018. 1 Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, pp. 307 317, 1953. 2 Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180 186, 2020. 1 Larry Wasserman. All of Statistics: A concise course in statistical inference. Springer, 2004. 13 Published as a conference paper at ICLR 2023 A.1 PROOFS FOR GLOBAL SHAPLEY VALUES (GSV) Proposition A.1 (Proposition 2.1). The GSV have the following property i 1 Φipf, F, Bq E x Frfpxqs E x Brfpxqs. (11) Proof. As a reminder, we have defined the vector Φpf, F, Bq E x F z B rφpf, x, zq , (12) whose components sum up to i 1 Φipf, F, Bq i 1 E x F z B r φipf, x, zq s (13) i 1 φipf, x, zq (14) E x F z B r fpxq fpzq s (15) E x Frfpxqs E z Brfpzqs (16) E x Frfpxqs E x Brfpxqs, (17) where at the last step we have simply renamed a dummy variable. Proposition A.2 (Proposition 4.1). Let S1 0 be fixed, and let pÑ represent convergence in probability as the size M of the set S1 1 B1M increases, then we have pΦspf, S1 0, S1 1q pÑ j 1 ωj pΦspf, S1 0, zpjqq. (18) pΦpf, S1 0, S1 1q 1 M 2 ÿ zpjq PS1 1 φpf, xpiq, zpjqq xpiq PS1 0 φpf, xpiq, zpjqq pΦpf, S1 0, zpjqq. Since S1 0 is assumed to be fixed, then the only random variable in pΦspf, S1 0, zpjqq is zpjq which represents an instance sampled from the B1. Therefore, we can define ψpzq : pΦspf, S1 0, zq and we get pΦspf, S1 0, S1 1q 1 pΦspf, S1 0, zpjqq zpjq PS1 1 ψpzpjqq with S1 1 B1M. (20) Published as a conference paper at ICLR 2023 By the weak law of large number, the following holds as M goes to infinity (Wasserman, 2004, Theorem 5.6) 1 M zpjq PS1 1 ψpzpjqq pÑ E z B1rψpzqs. (21) Now, as a reminder, the manipulated background distribution is B1 : Cp D1, ωq with ω 1{N1. Therefore pΦspf, S1 0, S1 1q pÑ E z B1rψpzqs E z Cp D1,ωqrψpzqs j 1 ωjψpzpjqq j 1 ωj pΦspf, S1 0, zpjqq concluding the proof. Published as a conference paper at ICLR 2023 A.2 PROOFS FOR OPTIMIZATION PROBLEM A.2.1 TECHNICAL LEMMAS We provide some technical lemmas that will be essential when proving Theorem A.1. These are presented for completeness and are not intended as contributions by the authors. Let us first write the formal definition of the minimum of a function. Definition A.1 (Minimum). Given some function f : D Ñ R, the minimum of f over D (denoted f ) is defined as follows: f min x PD fpxq ðñ Dx P D s.t. f fpx q ď fpxq @x P D. Basically, the notion of minimum coincides with the infimum inf fp Dq (highest lower bound) when this lower bound is attained for some x P D. By the Extreme Values Theorem, the minimum always exists when D is compact and f is continuous. For the rest of this appendix, we shall only study optimization problems where points on the domain set D tpx, yq : x P X, y P Yx Ă Yu can be selected by the following procedure 1. Choose some x P X 2. Given the selected x, choose some y P Yx Ă Y, where the set Yx is non-empty and depends on the value of x. When optimizing functions over these domains, one can optimize in two steps as highlighted in the following lemma. Lemma A.1. Given a compact domain D of the form described above and a continuous objective function f : D Ñ R, the minimum f is attained for some px , y q and the following holds min px,yq PD fpx, yq min x PX min y PYx fpx, yq. Proof. Let rfpxq : infy PYx fpx, yq, which is a well defined function on X. We can then take its infimum f infx PX rfpxq. But is f an infimum of fp Dq? By the definition of infimum f ď rfpxq @ x P X inf y PYx fpx, yq ď fpx, yq @ y P Yx, so that f is a lower bound of fp Dq. In fact, it is the highest lower bound possible so inf px,yq PD fpx, yq inf x PX inf y PYx fpx, yq. (23) By the Extreme Value Theorem, since D is compact and f is continuous, there exists px , y q P D s.t. f infpx,yq PD fpx, yq maxpx,yq PD fpx, yq fpx , y q. Since the infimum is attained on the left-hand-side of Equation 23, then it must also be attained on the right-hand-side and therefore we can replace all inf with min in Equation 23, leading to the desired result. Lemma A.2. Given a compact domain D of the form described above and two continuous functions h : X Ñ R and g : Y Ñ R, then min px,yq PD ˆ hpxq gpyq min x PX ˆ hpxq min y PYx gpyq Proof. Applying Lemma A.1 with the function fpx, yq : hpxq gpyq proves the Lemma. Published as a conference paper at ICLR 2023 apeq βpΦspf, S1 0, zpjqq cpeq 8 fpeq rωj apeq λ |fpzpiqq fpzpjqq| cpeq 8 fpeq rπi,j apeq 0 cpeq 1 fpeq 1 Figure 6: Graph G on which we solve the MCF. Note that the total amount of flow is d N1 and there are N1 left and right nodes ℓj, ri. A.2.2 MINIMUM COST FLOWS Let G p V, Eq be a graph with vertices v P V with directed edges e P E Ă V ˆ V, c : E Ñ R be a capacity and a : E Ñ R be a cost. Moreover, let s, t P E be two special vertices called the source and the sink respectivelly, and d P R be a total flow. The Minimum-Cost Flow (MCF) problem of G consists of finding the flow function f : E Ñ R that minimizes the total cost e PE apeqfpeq s.t. 0 ď fpeq ď cpeq @e P E e Pu fpeq ÿ 0 u P V z ts, tu d u s d u t where u : tpu, vq P Eu and u : tpv, uq P Eu are the outgoing and incoming edges from u. The terminology of flow arises from the constraint that, for vertices that are not the source nor the sink, the outgoing flow must equal the incoming one, which is reminiscent of conservation laws in fluidic. We shall refer to fppu, vqq as the flow from u to v. Now that we have introduced minimum cost flows, let us specify the graph that will be employed to manipulate GSV, see Figure 6. We label the flow going from the sink s to one of the left vertices as rωi ωi ˆ N1, and the flow going from ℓj to ri as rπi,j πi,j ˆ N1. The required flow is fixed at d N1. Published as a conference paper at ICLR 2023 Theorem A.1. Solving the MCF of Figure 6 leads to a solution of the linear program in Algorithm 1. Proof. We begin by showing that the flow conservation constraints in the MCF imply that π is a coupling measure (i.e. π P p B, B1 ωq), and ω is constrained to the probability simplex p N1q. Applying the conservation law on the left-side of the graph leads to the conclusion that the flows entering vertices ℓj must sum up to N1 j 1 rωj N1. This implies that ω is must be part of the probability simplex. By conservation, the amount of flow that leaves a specific vertex ℓj must also be rωj, hence ÿ i rπij rωj. For any edge outgoing from ri to the sink t, the flow must be exactly 1. This is because we have N1 edges with capacity cpeq 1 going into the sink and the sink must receive an incoming flow of N1. As a consequence of the conservation law on a specific vertex ri, the amount of flow that goes into each ri is also 1 ÿ Putting everything together, from the conservation laws on G, we have that ω P p N1q, and π P p B, B1 ωq. Now, to make the parallel between the MCF and Algorithm 1, we must use Lemma A.2. Note that ω is restricted to the probability simplex, while π is restricted to be a coupling measure. Importantly, the set of all possible coupling measures p B, B1 ωq is different for each ω because B1 ω depends on ω. Hence, the domain has the same structure as the ones tackled in Lemma A.2 (where x P X becomes ω P p N1qq and y P Yx becomes π P p B, B1 ωq). Also, the set of possible ω and π is a bounded simplex in RN1p N1 1q so it is compact, and the objective function of the MCF is linear, thus continuous. Hence, we can apply the Lemma A.2 to the MCF. e PE fpeqapeq min rω,rπ j 1 βrωj pΦspf, S1 0, zpjqq λ ÿ i,j rπij|fpzpiqq fpzpjqq| min rω,rπ N1 N1 j 1 rωj pΦspf, S1 0, zpjqq λ ÿ i,j rπij|fpzpiqq fpzpjqq| N1 min rω,rπ rωj N1 pΦspf, S1 0, zpjqq λ ÿ rπij N1 |fpzpiqq fpzpjqq| N1 min ωP p N1q,πP p B,B1 ωq j 1 ωj pΦspf, S1 0, zpjqq λ ÿ i,j πi,j|fpzpiqq fpzpjqq| N1 min ωP p N1q,πP p B,B1ωq ˆ hpωq gpπq N1 min ωP p N1q ˆ hpωq min πP p B,B1q gpπq (cf. Lemma A.2) N1 min ωP p N1q j 1 ωj pΦspf, S1 0, zpjqq λ min πP p B,B1 ωq i,j πi,j|fpzpiqq fpzpjqq| N1 min ωP p N1q j 1 ωj pΦspf, S1 0, zpjqq λ Wp B, B1 ωq which (up to a multiplicative constant N1) is a solution of the linear program of Algorithm 1. Published as a conference paper at ICLR 2023 B SHAPLEY VALUES B.1 LOCAL SHAPLEY VALUES (LSV) We introduce Local Shapley Values (LSV) more formally. First, as explained earlier, Shapley values are based on coalitional game theory where the different features work together toward a common outcome fpxq. In a game, the features can either be present or absent, which is simulated by replacing some features with a baseline value z. Definition B.1 (The Replace Function). Let x be an input of interest x, S Ď t1, 2, . . . , du be a subset of input features that are considered active, and z be a baseline input, then the replace-function r S : Rd ˆ Rd Ñ Rd is defined as r Spz, xqi "xi if i P S zi otherwise. (25) We note that this function is meant to activate the features in S. Now, if we let π be a random permutation of d features, and πi denote all features that appear before i in π, the LSV are computed via φipf, x, zq : E π Ω fp rπi Ytiupz, xq q fp rπipz, xq q , i 1, 2, . . . , d, (26) where Ωis the uniform distribution over d! permutations. Observe that the computation of LSV is scales poorly with the number of features d hence model-agnostic computations are only possible with datasets with few features such as COMPAS and Adult-Income. For datasets with larger amounts of features the Tree Explainer algorithm (Lundberg et al., 2020) can be used to compute the LSV (cf. Equation 26) in polynomial time given that one is explaining a tree-based model. B.2 CONVERGENCE As a reminder, we are interested in estimating the GSV Φ Φpf, F, Bq which requires estimating expectations w.r.t the foreground and background distributions. Said estimations can be conducted with Monte-Carlo where we sample M instances S0 FM S1 BM, (27) and compute the plug-in estimates pΦpf, S0, S1q : Φpf, Cp S0, 1{Mq, Cp S1, 1{Mqq zpjq PS1 φpf, xpiq, zpjqq. (28) We now show that, pΦpf, S0, S1q is a consistent and asymptotically normal estimate of Φpf, F, Bq Proposition B.1. Let f : X Ñ r0, 1s be a black box, F and B be distributions on X, and pΦ pΦpf, S0, S1q be the plug-in estimate of Φ Φpf, F, Bq, the following holds for any δ P s0, 1r and k 1, 2 . . . , d lim MÑ8 P ˆ |pΦk Φk| ě F 1 Np0,1qp1 δ{2q σ2 10 σ2 01 where F 1 Np0,1q is the inverse Cumulative Distribution Function (CDF) of the standard normal distribution, σ2 10 Vx Fr Ez Brφipf, x, zqqs s and σ2 01 Vz Br Ex Frφipf, x, zqqs s. Proof. The proof consists simply in noting that LSV φkpf, xpiq, zpjqq are a function of two independent samples xpiq F and zpjq B. The model f is assumed fixed and hence for any feature k we can define hpxpiq, zpjqq : φkpf, xpiq, zpjqq. Now, the estimates of GSV can be rewritten pΦkpf, S0, S1q 1 |S0| |S1| zpjq PS1 hpxpiq, zpjqq, (29) Published as a conference paper at ICLR 2023 which we recognize as a well-known class of statistics called two-samples U-statistics. Such statistics are unbiased and asymptotically normal estimates of Φkpf, F, Bq E x F z B rhpx, zqs. (30) The asymptotic normality of two-samples U-statistics is characterized by the following Theorem (Lee, 2019, Section 3.7.1). Theorem B.1. Let pΦk pΦkpf, S0, S1q be a two-samples U-statistic with |S0| N, |S1| M, moreover let hpx, zq have finite first and second moments, then the following holds for any δ P s0, 1r lim N MÑ8 s.t. N{p N MqÑp Pp0,1q P ˆ |pΦk Φk| ě F 1 Np0,1qp1 δ{2q σ2 10 p σ2 01 1 p where σ2 10 Vx Fr Ez Brhpx, zqs s and σ2 01 Vz Br Ex Frhpx, zqs s. Proposition B.1 follows from this Theorem by choosing N M, p 0.5 and noticing that having a model with bounded outputs (f : X Ñ r0, 1s) implies that |hpx, zq| ď 1 @x, z P X which means that hpx, zq has bounded first and second moments. B.3 COMPUTE THE LSV Running Algorithm 1 requires computing the coefficients pΦspf, S1 0, zpjqq for j 1, 2, . . . , N1. To compute them, first note that they can be written in terms of LSV for all instances in S1 0 pΦspf, S1 0, zpjqq 1 xpiq PS1 0 φspf, xpiq, zpjqq. (31) The LSV φspf, xpiq, zpjqq are computed deeply in the SHAP code and are not directly accessible using the current API. Hence, we had to access them using Monkey-Patching i.e. we modified the Exact Explainer class so that it stores the LSV as one of its attributes. The attribute can then be accessed as seen in Figure 7. The code is provided as a fork the SHAP repository. For the Tree Explainer, because its source code is in C++ and wrapped in Python, we found it simpler to simply rewrite our own version of the algorithm in C++ so that it directly returns the LSV, instead of Monkey-Patching the Tree Explainer. Figure 7: How we extract the LSV from the Exact Explainer via Monkey-Patching. Published as a conference paper at ICLR 2023 C STATISTICAL TESTS C.1 KS TEST A first test that can be conducted is a two-samples Kolmogorov-Smirnov (KS) test (Massey Jr, 1951). If we let p FSpxq 1 z PS 1pz ď xq (32) be the empirical CDF of observations in the set S. Given two sets S and S1, the KS statistic is KSp S, S1q sup x PR | p FSpxq p FS1pxq|. (33) Under the null-hypothesis H0 : S D|S|, S1 D|S1| for some univariate distribution D, this statistic is expected to not be too large with high probability. Hence, when the company provides the subsets S1 0, S1 1, the audit can sample their own two subsets fp S0q, fp S1q uniformly at random from fp D0q, fp D1q and compute the statistics KSpfp S0q, fp S1 0qq and KSpfp S1q, fp S1 1qq to detect a fraud. C.2 WALD TEST An alternative is the Wald test, which is based on the central limit theorem. If S1 BM, then the empirical average of the model output over S1 is asymptotically normally distributed as M increases Waldpfp S1q, fp Bqq : z Pfp S1q z µ M ù Np0, 1q, (34) where µ : Ez fp Bqrzs and σ2 : Vz fp Bqrzs are the expected value and variance of the model output across the whole background. The same reasoning holds for S0 and the foreground F. Applying the Wald test with significance α would detect fraud when | Waldpfp S1 1q, fp Bqq | ą F 1 Np0,1qp1 α{2q, (35) where F 1 Np0,1q is the inverse of the CDF of a standard normal variable. Published as a conference paper at ICLR 2023 D METHODOLOGICAL DETAILS D.1 TOY EXAMPLE The toy dataset was constructed to closely match the results of the following empirical study comparing skeletal mass distributions between men and women (Janssen et al., 2000). Firstly, the sex feature was sampled from a Bernoulli S Bernoullip0.5q. (36) According to Table 1 of Janssen et al. (2000), the average height of women participants was 163 cm while it was 177cm for men. Both height distributions had the same standard deviation of 7cm. Hence we sampled height via H|S man Np177, 49q H|S woman Np163, 49q (37) It was noted in Janssen et al. (2000) that there was approximately a linear relationship between height and skeletal muscle mass for both sexes. Therefore, we computed the muscle mass M as M|t H h, S manu 0.186h 5ϵ M|t H h, S womanu 0.128h 4ϵ with ϵ Np0, 1q (38) The values of coefficients 0.186, 0.128 and noise levels 5 and 4 were chosen so the distributions of M|S would approximately match that of Table 1 in Janssen et al. (2000). Finally the target was chosen following Y |t H h, M mu Bernoullip Pp H, Mq q with Pp H, Mq 1 expt100ˆ1p H ă 160q 0.3p M 28qu 1. (39) Simply put, the chances of being hired in the past (Y ) were impossible for individuals with a smaller height than 160cm. Moreover, individuals with a higher mass skeletal mass were given more chances to be admitted. Yet, individuals with less muscle mass could still be given the job if they displayed sufficient determination. In the end, we generated 6000 samples leading to the following disparity in Y . Pp Y 1|S manq 0.733 Pp Y 1|S womanq 0.110. (40) Published as a conference paper at ICLR 2023 Table 2: Models Test Accuracy % (mean stddev). COMPAS 68.2 0.9 67.7 0.8 68.6 0.8 Adult 85.6 0.3 86.3 0.2 87.1 0.1 Marketing 91.1 0.1 91.4 0.3 Communities 83 2 82 2 Table 3: Models Demographic Parity (mean stddev). COMPAS 0.12 0.01 0.11 0.01 0.11 0.02 Adult 0.20 0.01 0.19 0.01 0.192 0.004 Marketing 0.104 0.005 0.11 0.01 Communities 0.50 0.01 0.54 0.02 D.2 REAL DATA The datasets were first divided into train/test subsets with ratio 4 5. The models were trained on the training set and evaluated on the test set. All categorical features for COMPAS, Adult, and Marketing were one-hot-encoded which resulted in a total of 11, 40, and 61 columns for each dataset respectively. A simple 50 steps random search was conducted to fine-tune the hyper-parameters with cross-validation on the training set. The resulting test set performance and demographic parities for all models and datasets, aggregated over 5 random data splits, are reported in Tables 2 and 3 respectively. Published as a conference paper at ICLR 2023 0 20 40 60 80 100 120 Iterations (a) Iterations of genetic algorithm. 0.0 0.2 0.4 0.6 0.8 1.0 Output (b) CDFs for genetic algorithm. 0.0 0.2 0.4 0.6 0.8 1.0 Output (c) CDFs for brute-force. 0.0 0.2 0.4 0.6 0.8 1.0 Output (d) CDFs for Fool SHAP. E ADDITIONAL RESULTS E.1 TOY EXAMPLE Figure 8 presents additional results for the toy example. More specifically, Figure 8 (a) illustrates the evolution of the detection and amplitude of the sensitive feature during the genetic algorithm. We note that beyond 90 iterations, the detector is systematically able to assert that the dataset is manipulated. The smallest value of amplitude that can be reached via the genetic algorithm without being detected is around 0.05. Figures 8 (b) (c) and (d) show the CDFs of fp S1 1q where S1 1 is chosen via the genetic algorithm, brute-force, and Fool SHAP respectively. We observe that Fool SHAP is the method where the resulting CDF for fp S1 1q is closest to the CDF for fp D1q. This is why the audit is not able to detect fraud using statistical tests. The fact that Fool SHAP generates fake CDFs that are close to the data CDFs is a consequence of minimizing the Wasserstein distance. These results highlight the superiority of Fool SHAP compared to the brute-force approach and the genetic algorithm. Published as a conference paper at ICLR 2023 E.2 EXAMPLES OF ATTACKS In this section, we present 8 specific examples of the attacks that were conducted on COMPAS, Adult, Marketing, and Communities. 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Shap value priors_count juv_misd_count c_charge_degree juv_fel_count juv_other_count Original Manipulated 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Output f(S 1) f(D0) Figure 9: Attack of RF fitted on COMPAS. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is race. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q. 0.08 0.06 0.04 0.02 0.00 Shap value priors_count juv_other_count juv_fel_count juv_misd_count c_charge_degree Original Manipulated 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Output f(S 1) f(D0) Figure 10: Attack of XGB fitted on COMPAS. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is race. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q. 0.10 0.08 0.06 0.04 0.02 0.00 Shap value marital_status capital_gain hours_per_week relationship capital_loss occupation Original Manipulated 0.0 0.2 0.4 0.6 0.8 1.0 Output f(S 1) f(D0) Figure 11: Attack of XGB fitted on Adults. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is gender. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q. Published as a conference paper at ICLR 2023 0.10 0.08 0.06 0.04 0.02 0.00 0.02 Shap value marital_status relationship hours_per_week capital_gain capital_loss occupation Original Manipulated 0.0 0.2 0.4 0.6 0.8 1.0 Output f(S 1) f(D0) Figure 12: Attack of RF fitted on Adults. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is gender. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q. 0.05 0.04 0.03 0.02 0.01 0.00 0.01 Shap value nr_employed emp_var_rate age euribor3m contact cons_conf_idx job poutcome cons_price_idx pdays default day_of_week loan marital campaign previous duration Original Manipulated 0.0 0.2 0.4 0.6 0.8 Output f(S 1) f(D0) Figure 13: Attack of RF fitted on Marketing. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is age. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q. 0.8 0.6 0.4 0.2 0.0 Shap value emp_var_rate nr_employed age cons_conf_idx pdays contact poutcome euribor3m job default day_of_week cons_price_idx marital housing loan education campaign previous Original Manipulated 8 6 4 2 0 2 Output f(S 1) f(D0) Figure 14: Attack of XGB fitted on Marketing. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is age. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q. Published as a conference paper at ICLR 2023 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Shap value Pct Kids2Par Pct Pers Dense Hous Num Illeg Pct White>90 pct WInv Inc Pct Illeg Pct Pop Under Pov Male Pct Divorce Num Under Pov Pct Fam2Par Pct Less9th Grade Total Pct Div Pct Hous Less3BR Female Pct Div Med Own Cost Pct Inc No Mtg race Pct Hisp Pct Larg House Fam black Per Cap Pct Teen2Par Pct Young Kids2Par Pct Vacant Boarded Pct Unemployed Original Manipulated 10 5 0 5 10 Output f(S 1) f(D0) Figure 15: Attack of XGB fitted on Communities. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is Pct White>90. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q. 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Shap value Pct Kids2Par Pct Illeg Pct White>90 Pct Fam2Par Pct Pers Dense Hous pct WInv Inc Female Pct Div Pct Young Kids2Par Total Pct Div Pct Pop Under Pov Num Under Pov race Pct Hisp Pct Teen2Par Male Pct Divorce Pct Larg House Fam Pct Less9th Grade pct WPub Asst Num Immig Pct Pers Own Occup Pct Hous No Phone Hous Vacant Original Manipulated 0.0 0.2 0.4 0.6 0.8 1.0 Output f(S 1) f(D0) Figure 16: Attack of RF fitted on Communities. Left: GSV before and after the attack with M 200. As a reminder, the sensitive attribute is Pct White>90. Right: Comparison of the CDF of the misleading subsets fp S1 0q, fp S1 1q and the CDF over the whole data. fp D0q, fp D1q. E.3 GENETIC ALGORITHM This section motivates the use of stealthily biased sampling to perturb Shapley Values in place of the method of Baniecki et al. (2021), which fools SHAP by perturbing the background dataset S1 1 via a genetic algorithm. In said genetic algorithm, a population of P fake background datasets t S1pkq 1 u P k 1 evolves iteratively following three biological mechanisms Cross-Over: Two parents produce two children by switching some of their feature values. Mutation: Some individuals are perturbed with small Gaussian noise. Selection: The individuals S1pkq 1 with the smallest amplitudes |Φspf, S1 0, S1pkq 1 q| are selected for the next generation. Although the use of a genetic algorithm makes the method of Baniecki et al. (2021) very versatile, its main drawback is that there is no constraint on the similarity between the perturbed background and the original one. Moreover, the mutation and cross-over operations ignore the correlations between features and hence the perturbed dataset can contain unrealistic instances. To highlight these issues, Figure 17 presents the first two principal components of D1 and S1 1 for the XGB models used in Section 5.4. On COMPAS and Marketing especially, we see that the fake samples S1 1 lie in regions outside of the data manifold. For Adult-Income and Marketing, the fake data overlaps more with the original one, but this could be an artifact of only visualizing 2 dimensions. For a more rigorous analysis of similarity between S1 1 and D1, we must study the detection rate of the audit detector. To this end, Figures 18 and 19, present the amplitude reduction and the detection rate after a given number of iterations of the genetic algorithm. These curves show the average and Published as a conference paper at ICLR 2023 2.5 0.0 2.5 5.0 7.5 Component 1 Component 2 2.5 0.0 2.5 5.0 7.5 Component 1 Component 2 (b) Adult-Income 5 0 5 Component 1 Component 2 (c) Marketing 10 0 10 Component 1 Component 2 (d) Communities Figure 17: First two principal components of D1 (Blue) and S1 1 (Red) returned by the genetic algorithm on XGB models. standard deviation across the 5 train/test splits employed in our main experiments. Moreover, window 20 convolutions were used to smooth the curves and make them more readable. On the Marketing and Communities datasets, we see that for both XGB and Random Forests models, the detector is quickly able to assert that the data was manipulated. We suspect the genetic algorithm cannot fool the detector on these two datasets because they contain a large number of features (Marketing has 20, Communities has 98). Such a large number of features could make it harder to perturbate samples while staying close to the original data manifold. Since the model behavior is unpredictable outside of the data manifold, it is impossible for the genetic algorithm to guarantee that the CDF of fp S1 1q will be close to the CDF of fp D1q. For adult-income, the detection rate appears to be lower but still, the largest reductions in amplitude of the sensitive feature were about 15%, even after 2.5 hours of run-time. Contrary to the genetic algorithm, our method Fool SHAP addresses both constraints of making the fake data realistic and keeping it close to the original dataset. Indeed, our objective is tuned to make sure that the Wasserstein distance between the original and perturbed data is small. Moreover, since we do not generate new samples but rather apply non-uniform weights to pre-existing ones, we do not run into the risk of generating unrealistic data. Published as a conference paper at ICLR 2023 0 100 200 300 400 Iterations Relative Amplitude Decrease Detection Rate 0 100 200 300 400 Iterations Relative Amplitude Decrease Detection Rate (b) Adult-Income 0 100 200 300 400 Iterations Relative Amplitude Decrease Detection Rate (c) Marketing 0 25 50 75 100 Iterations Relative Amplitude Decrease Detection Rate (d) Communities Figure 18: Iterations of the genetic algorithm applied to 5 XGB models per dataset. 0 50 100 150 200 Iterations Relative Amplitude Decrease Detection Rate 0 20 40 60 80 100 Iterations Relative Amplitude Decrease Detection Rate (b) Adult-Income 0.0 2.5 5.0 7.5 10.0 Iterations Relative Amplitude Decrease Detection Rate (c) Marketing 0 5 10 15 20 Iterations Relative Amplitude Decrease Detection Rate (d) Communities Figure 19: Iterations of the genetic algorithm applied to 5 RF models per dataset. Published as a conference paper at ICLR 2023 E.4 MULTIPLE SENSITIVE ATTRIBUTES We present preliminary results for settings where one wishes to manipulate the Shapley values of multiple sensitive features s each part of a set s P S. For example, in our experiments, we considered gender as a sensitive attribute for the Adult-Income dataset and we showed that one can diminish the attribution of this feature. Nonetheless, there are two other features in Adult Income that share information with gender: relationship and marital-status. Indeed, relationship can take the value widowed and marital-status can take the value wife, which are both proxies of gender=female. For this reason, these two other features may be considered sensitive and decision-making that relies strongly on them may not be acceptable. Hence, we must derive a method that reduces the total attributions of the features in S tgender, relationship, marital-statusu. We first let βs : signr ř zpjq PD1 pΦspf, S1 0, zpjqq s for any s P S. In our experiments, all these signs will typically be negative. The proposed approach is to minimize the ℓ1 norm }ppΦspf, S1 0, S1 1qqs PS}1 : ÿ s PS | pΦspf, S1 0, S1 1q |, (41) which we interpret as the total amount of disparity we can attribute to the sensitive attributes. Remember that pΦspf, S1 0, S1 1q converges in probability to ř zpjq PD1 ωj pΦspf, S1 0, zpjqq (cf. Proposition 4.1). Therefore minimizing the ℓ1 norm will require minimizing ÿ zpjq PD1 ωj pΦspf, S1 0, zpjqq ÿ zpjq PD1 ωj ÿ s PS βs pΦspf, S1 0, zpjqq, (42) which is again a linear function of the weights. We present Algorithm 4 as an overload of Algorithm 1 that now supports taking multiple sensitive attributes as inputs. Algorithm 4 Compute non-uniform weights for multiple sensitive attributes s P S 1: procedure COMPUTE_WEIGHTS(D1, pΦspf, S1 0, zpjqq ( 2: βs : signr ř zpjq PD1 pΦspf, S1 0, zpjqq s @s P S; 3: B : Cp D1, 1{N1q Ź Unmanipulated background 4: B1 ω : Cp D1, ωq Ź Manipulated background as a function of ω 5: ω arg minω ř zpjq PD1 ωj ř s PS βs pΦspf, S1 0, zpjqq λWp B, B1 ωq 6: return ω; The only difference in the resulting MCF is that we must use the cost apeq ř s PS βs pΦspf, S1 0, zpjqq for edges ps, ℓjq in the graph G of Figure 6. This new algorithm is guaranteed to diminish the ℓ1 norm of the attributions of all sensitive features. However, that this does not imply that all sensitive attributes will diminish in amplitude. Indeed, minimizing the sum of multiple quantities does not guarantee that each quantity will diminish. For example, 4 7 is smaller than 6 6 although 4 is smaller than 6 and 7 is higher than 6. Still, we see reducing the ℓ1 norm as a natural way to hide the total amount of disparity that is attributable to the sensitive features. Another important methodological change is the way we select the optimal hyper-parameter λ in Algorithm 3. Now at line 12, we use the ℓ1 norm ř zpjq PD1 ωj pΦspf, S1 0, zpjqq| as a selection criterion. Figures 20 and 21 present preliminary results of attacks on three RFs/XGBs fitted on Adults with different train/test splits. We note that in all cases, before the attack, the three sensitive features had large negative attributions. By applying our method, we can considerably reduce the amplitude of the two sensitive attributes. The attribution of the remaining sensitive feature remains approximately constant or slightly becomes more negative. We leave it as future work to run large-scale experiments with multiple sensitive features for various datasets. Published as a conference paper at ICLR 2023 102 103 104 λ Detection Rate (%) 102 103 104 λ Sensitive features Other features 102 103 104 λ Detection Rate (%) 102 103 104 λ Sensitive features Other features 102 103 104 λ Detection Rate (%) 102 103 104 λ Sensitive features Other features Figure 20: Example of log-space search over values of λ using RFs classifier fitted on Adults and three sensitive attributes. Each row is a different train/test split seed. (Left) The detection rate as a function of the parameter λ of the attack. (Right) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B1 ω. The goal here is to reduce the amplitude all sensitive features (red curves) in order to hide their contribution to the disparity in model outcomes. Published as a conference paper at ICLR 2023 102 103 104 λ Detection Rate (%) 102 103 104 λ Sensitive features Other features 102 103 104 λ Detection Rate (%) 102 103 104 λ Sensitive features Other features 102 103 104 λ Detection Rate (%) 102 103 104 λ Sensitive features Other features Figure 21: Example of log-space search over values of λ using XGBs classifier fitted on Adults and three sensitive attributes. Each row is a different train/test split seed. (Left) The detection rate as a function of the parameter λ of the attack. (Right) For each value of λ, the vertical slice of the 11 curves is the GSV obtained with the resulting B1 ω. The goal here is to reduce the amplitude all sensitive features (red curves) in order to hide their contribution to the disparity in model outcomes.