# foiling_explanations_in_deep_neural_networks__2ad7b879.pdf

Published in Transactions on Machine Learning Research (08/2023)

Foiling Explanations in Deep Neural Networks

Snir Vitrack Tamam snirvt@gmail.com Department of Computer Science Ben-Gurion University, Israel

Raz Lapid razla@post.bgu.ac.il Department of Computer Science Ben-Gurion University, Israel & Deep Keep, Israel

Moshe Sipper sipper@bgu.ac.il Department of Computer Science Ben-Gurion University, Israel

Reviewed on Open Review: https: // openreview. net/ forum? id= wv LQMHty Lk

Deep neural networks (DNNs) have greatly impacted numerous fields over the past decade. Yet despite exhibiting superb performance over many problems, their blackbox nature still poses a significant challenge with respect to explainability. Indeed, explainable artificial intelligence (XAI) is crucial in several fields, wherein the answer alone sans a reasoning of how said answer was derived is of little value. This paper uncovers a troubling property of explanation methods for image-based DNNs: by making small visual changes to the input image hardly influencing the network s output we demonstrate how explanations may be arbitrarily manipulated through the use of evolution strategies. Our novel algorithm, Atta XAI, a model-anddata XAI-agnostic, adversarial attack on XAI algorithms, only requires access to the output logits of a classifier and to the explanation map; these weak assumptions render our approach highly useful where real-world models and data are concerned. We compare our method s performance on two benchmark datasets CIFAR100 and Image Net using four different pretrained deep-learning models: VGG16-CIFAR100, VGG16-Image Net, Mobile Net-CIFAR100, and Inception-v3-Image Net. We find that the XAI methods can be manipulated without the use of gradients or other model internals. Atta XAI successfully manipulates an image such that several XAI methods output a specific explanation map. To our knowledge, this is the first such method in a black-box setting, and we believe it has significant value where explainability is desired, required, or legally mandatory. The code is available at https://github.com/razla/Foiling-Explanations-in-Deep-Neural-Networks.

Keywords: deep learning, computer vision, adversarial attack, evolutionary algorithm, explainable artificial intelligence

1 Introduction

Recent research has revealed that deep learning-based, image-classification systems are vulnerable to adversarial instances, which are designed to deceive algorithms by introducing perturbations to benign images (Carlini & Wagner, 2017; Madry et al., 2017; Xu et al., 2018; Goodfellow et al., 2014; Croce & Hein, 2020). A variety of strategies have been developed to generate adversarial instances,

Equal contribution

Published in Transactions on Machine Learning Research (08/2023)

and they fall under two broad categories, differing in the underlying threat model: white-box attacks (Moosavi-Dezfooli et al., 2016; Kurakin et al., 2018) and black-box attacks (Chen et al., 2017; Lapid et al., 2022).

In a white box attack, the attacker has access to the model s parameters, including weights, gradients, etc . In a black-box attack, the attacker has limited information or no information at all; the attacker generates adversarial instances using either a different model, a model s raw output (also called logits), or no model at all, the goal being for the result to transfer to the target model (Tramèr et al., 2017; Inkawhich et al., 2019).

In order to render a model more interpretable, various explainable algorithms have been conceived. Van Lent et al. (2004) coined the term Explainable Artificial Intelligence (XAI), which refers to AI systems that can explain their behavior either during execution or after the fact . In-depth research into XAI methods has been sparked by the success of Machine Learning (ML) systems, particularly Deep Learning (DL), in a variety of domains, and the difficulty in intuitively understanding the outputs of complex models, namely, how did a DL model arrive at a specific decision for a given input.

Explanation techniques have drawn increased interest in recent years due to their potential to reveal hidden properties of deep neural networks (Došilović et al., 2018). For safety-critical applications, interpretability is essential, and sometimes even legally required.

The importance assigned to each input feature for the overall classification result may be observed through explanation maps, which can be used to offer explanations. Such maps can be used to create defenses and detectors for adversarial attacks (Walia et al., 2022; Fidel et al., 2020; Kao et al., 2022). Figures 1 and 2 show examples of explanation maps, generated by five different methods discussed in Section 2.

In this paper, we show that these explanation maps can be transformed into any target map, using only the maps and the network s output probability vector. This is accomplished by adding a perturbation to the input that is usually unnoticeable to the human eye. This perturbation has minimal effect on the neural network s output, therefore, in addition to the classification outcome, the probability vector of all classes remains virtually identical.

Our contribution. As we mentioned earlier, recent studies have shown that adversarial attacks may be used to undermine DNN predictions (Goodfellow et al., 2014; Papernot et al., 2016b; Carlini & Wagner, 2017; Lapid et al., 2022). There are several more papers regarding XAI attacks, which have also shown success on manipulating XAI, but to our knowledge, they all rely on access to the neural network s gradient, which is usually not available in real-world scenarios (Ghorbani et al., 2019; Dombrowski et al., 2019).

Herein, we propose a black-box algorithm, Atta XAI, which enables manipulation of an image through a barely noticeable perturbation, without the use of any model internals, such that the explanation fits any given target explanation. Further, the robustness of the XAI techniques are tested as well. We study Atta XAI s efficiency on 2 benchmark datasets, 4 different models, and 5 different XAI methods.

The next section presents related work. Section 3 presents Atta XAI, followed by experiments and results in Section 4. In Section 5 we analyze our results, In Section 6 we make a discussion followed by concluding remarks in Section 7.

2 Related Work

The ability of explanation maps to detect even the smallest visual changes was shown by Ghorbani et al. (2019), where they perturbed a given image, which caused the explanatory map to change, without any specific target.

Published in Transactions on Machine Learning Research (08/2023)

Original LRP Deep Lift Gradient Grad x Input G-Backprop

Figure 1: Explanation maps for 5 images using 5 different explanation methods. Dataset: Image Net. Model: VGG16.

Kuppa & Le-Khac (2020) designed a black-box attack to examine the security aspects of the gradientbased XAI approach, including consistency, accuracy, and confidence, using tabular datasets.

Zhang et al. (2020) demonstrated a class of white-box attacks that provide adversarial inputs, which deceive both the interpretation models and the deep-learning models. They studied their method using four different explanation algorithms.

Xu et al. (2018) demonstrated that a subtle adversarial perturbation intended to mislead classifiers might cause a significant change in a class-specific network interpretability map.

The goal of Dombrowski et al. (2019) was to precisely replicate a given target map using gradient descent with respect to the input image. Although this work showed an intriguing phenomenon, it is a less-realistic scenario, since the attacker has full access to the targeted model.

We aimed to veer towards a more-realistic scenario and show that we can achieve similar results using no information about the model besides the probability output vector and the explanation map. To our knowledge, our work is the first to introduce a black-box attack on XAI gradient-based methods in the domain of image classification.

We will employ the following explanation techniques in this paper:

1. Gradient: Utilizing the saliency map, g(x) = f

x(x), one may measure how small perturbations in each pixel alter the prediction of the model, f(x) (Simonyan et al., 2013).

Published in Transactions on Machine Learning Research (08/2023)

Original LRP Deep Lift Gradient Grad x Input G-Backprop

Figure 2: Explanation maps for 5 images using 5 different explanation methods. Dataset: CIFAR100. Model: VGG16.

2. Gradient Input: The explanation map is calculated by multiplying the input by the partial derivatives of the output with regard to the input, g(x) = f

x(x) x (Shrikumar et al., 2016).

3. Guided Backpropagation: A variant of the Gradient explanation, where the gradient s negative components are zeroed while backpropagating through the non-linearities of the model (Springenberg et al., 2014).

4. Layer-wise Relevance Propagation (LRP): With this technique, pixel importance propagates backwards across the network from top to bottom (Bach et al., 2015; Montavon et al., 2019). The general propagation rule is the following:

αjρ(wjk) ϵ + P

0,j αjρ(wjk)Rk, (1)

where j and k are two neurons of any two consecutive layers, Rk, Rj are the relevance maps of layers k and j, respectively, ρ is a function that transforms the weights, and ϵ is a small positive increment.

In order to propagate relevance scores to the input layer (image pixels), the method applies an alternate propagation rule that properly handles pixel values received as input:

Published in Transactions on Machine Learning Research (08/2023)

αiwij liw+ ij hiw ij P

i αiwij liw+ ij hiw ij Rj, (2)

where li and hi are the lower and upper bounds of pixel values.

5. Deep Learning Important Fea Tures (Deep LIFT) compares a neuron s activation to its reference activation , which then calculates contribution scores based on the difference. Deep LIFT has the potential to separately take into account positive and negative contributions, which might help to identify dependencies that other methods might have overlooked (Shrikumar et al., 2017).

This section presents our algorithm, Atta XAI, discussing first evolution strategies, and then delving into the algorithmic details.

3.1 Evolution Strategies

Our algorithm is based on Evolution Strategies (ES), a family of heuristic search techniques that draw their inspiration from natural evolution (Beyer & Schwefel, 2002; Hansen et al., 2015). Each iteration (aka generation) involves perturbing (through mutation) a population of vectors (genotypes) and assessing their objective function value (fitness value). The population of the next generation is created by combining the vectors with the highest fitness values, a process that is repeated until a stopping condition is met.

Atta XAI belongs to the class of Natural Evolution Strategies (NES) (Wierstra et al., 2014; Glasmachers et al., 2010), which includes several algorithms in the ES class that differ in the way they represent the population, and in their mutation and recombination operators. With NES, the population is sampled from a distribution πψt, which evolves through multiple iterations (generations); we denote the population samples by Zt. Through stochastic gradient descent NES attempts to maximize the population s average fitness, EZ πψ[f(Z)], given a fitness function, f( ).

A version of NES we found particularly useful for our case was used to solve common reinforcement learning (RL) problems (Salimans et al., 2017).

We used Evolution strategies (ES) in our work for the following reasons:

Global optima search. ES are effective at finding global optima in complex, highdimensional search spaces, like those we encounter when trying to generate adversarial examples. ES can often bypass local optima that would trap traditional gradient-based methods. This ability is crucial for our work, where we assumed a black-box threat model.

Parallelizable computation. ES are highly parallelizable, allowing us to to simultaneously compute multiple fitness values. This is a significant advantage given our high computational requirements; using ES allows us to significantly reduce run times.

No Need for backpropagation. ES do not require gradient information and hence eschew backpropagation, which is advantageous in scenarios where the models do not have easily computable gradients or are black-box models.

Search gradients. The core idea of NES is to use search gradients to update the parameters of the search distribution (Wierstra et al., 2014). The search gradient can be defined as the gradient of the expected fitness: Denoting by π a distribution with parameters ψ, π(z|ψ) is the probability density

Published in Transactions on Machine Learning Research (08/2023)

function of a given sample z. With f(z) denoting the fitness of a sample z, the expected fitness under the search distribution can be written as:

J(ψ) = Eψ[f(z)] = Z f(z)π(z|ψ)dz. (3)

The gradient with respect to the distribution parameters can be expressed as:

Z f(z)π(z|θ)dz

= Z f(z) θπ(z|θ)dz

= Z f(z) θπ(z|θ)π(z|θ)

= Z [f(z) θ log π(z|θ)]π(z|θ)dz

= Eθ[f(z) θ log π(z|θ)]

From these results we can approximate the gradient with Monte Carlo (Metropolis & Ulam, 1949) samples z1, z2, ..., zλ:

k=1 f(zk) θ log π(zk|θ) (4)

In our experiment we sampled from a Gaussian distribution for calculating the search gradients.

Latin Hypercube Sampling (LHS). LHS is a form of stratified sampling scheme, which improves the coverage of the sampling space. It is done by dividing a given cumulative distribution function into M non-overlapping intervals of equal y-axis length, and randomly choosing one value from each interval to obtain M samples. It ensures that each interval contains the same number of samples, thus producing good uniformity and symmetry (Wang et al., 2022). We used both LHS and standard sampling in our experimental setup.

3.2 Algorithm

Atta XAI is an evolutionary algorithm (EA), which explores a space of images for adversarial instances that fool a given explanation method. This space of images is determined by a given input image, a model, and a loss function. The algorithm generates a perturbation for the given input image such that it fools the explanation method.

More formally, we consider a neural network, f : Rh,w,c RK, which classifies a given image, x Rh,w,c, where h, w, c are the image s height, width, and channel count, respectively, to one of K predetermined categories, with the predicted class given by k = arg maxi f(x)i. The explanation map, which is represented by the function, g: Rh,w,c Rh,w, links each image to an explanation map, of the same height and width, where each coordinate specifies the influence of each pixel on the network s output.

Atta XAI explores the space of images through evolution, ultimately producing an adversarial image; it does so by continually updating a Gaussian probability distribution, used to sample the space of images. By continually improving this distribution the search improves.

We begin by sampling perturbations from an isotropic normal distribution N(0, σ2I). Then we add them to the original image, x, and feed them to the model. By doing so, we can approximate

Published in Transactions on Machine Learning Research (08/2023)

the gradient of the expected fitness function. With an approximation of the gradient at hand we can advance in that direction by updating the search distribution parameters. A schematic of our algorithm is shown in Figure 3, with a full pseudocode provided in Algorithm 1.

Figure 3: Schematic of proposed algorithm. Individual perturbations are sampled from the population s distribution N(0, σ2I), and feed into the model (Feature 1 and Feature 2 are image features, e.g., two pixel values; in reality the dimensionality is much higher). Then, the fitness function, i.e. the loss, is calculated using the output probability vectors and the explanation maps to approximate the gradient and update the distribution parameters.

3.3 Fitness Function

Given an image, x Rh,w,c, a specific explanation method, g: Rh,w,c Rh,w, a target image, xtarget, and a target explanation map, g(xtarget), we seek an adversarial perturbation, δ Rh,w,c, such that the following properties of the adversarial instance, xadv = x + δ, hold:

1. The network s prediction remains almost constant, i.e., f(x) f(xadv).

2. The explanation vector of xadv is close to the target explanation map, g(xtarget), i.e., g(xadv) g(xtarget).

3. The adversarial instance, xadv, is close to the original image, x, i.e., x xadv.

We achieve such perturbations by optimizing the following fitness function of the evolutionary algorithm:

L = α g(xadv) g(xtarget) + β f(xadv) f(x) (5)

The first term ensures that the altered explanation map, g(xadv), is close to the target explanation map, g(xtarget); the second term pushes the network to produce the same output probability vector. The hyperparameters, α, β R+, determine the respective weightings of the fitness components.

In order to use our approach, we only need the output probability vector, f(xadv), and the target explanation map, g(xtarget). Unlike white-box methods, we do not presuppose anything about the targeted model, its architecture, dataset, or training process. This makes our approach more realistic.

Minimizing the fitness value is the ultimate objective. Essentially, the value is better if the proper class s logit remains the same and the explanation map looks similar to the targeted explanation map:

argmin xadv L = α g(xadv) g(xtarget) + β f(xadv) f(x) (6)

Published in Transactions on Machine Learning Research (08/2023)

Algorithm 1 Atta XAI

x original image y original image s label xexpl target explanation map xpred logits values of x model model to be used G maximum number of generations λ population size σ initial standard deviation value α explanation loss weight β prediction loss weight ηb x mean learning rate

ησ standard deviation learning rate

bx adversarial image

# Main loop 1: bx x 2: for g = 1, 2, ..., G do 3: for k = 1, 2, ..., λ do 4: draw sample zk N(bx, σ2I) # zk is used to perturb an image 5: evaluate fitness f(zk) = fitness(zk) 6: calculate log-derivative b x log N(zk|bx, σ2)

7: calculate log-derivative σ log N(zk|bx, σ2) 8: b x J = 1

i=1 f(zk) b x log N(zk|bx, σ2)

i=1 f(zk) σ log N(zk|bx, σ2) 10: bx = bx + ηV b x J 11: σ = σ + ησ σJ

12: function fitness(z) 13: zexpl = XAI(z, y) 14: zpred = model(z) 15: explloss = xexpl zexpl 2 2 16: predloss = xpred zpred 2 2 17: return α explloss + β predloss

4 Experiments and Results

Assessing the algorithm over a particular configuration of model, dataset, and explanation technique, involves running it over 100 pairs of randomly selected images. We used 2 datasets: CIFAR100 and Image Net (Deng et al., 2009). For CIFAR100 we used the VGG16 (Simonyan & Zisserman, 2014) and Mobile Net (Howard et al., 2017) models, and for Image Net we used VGG16 and Inception (Szegedy et al., 2015); the models are pretrained. For Image Net, VGG16 has an accuracy of 73.3% and Inception-v3 has an accuracy of 78.8%. For CIFAR100, VGG16 has an accuracy of 72.9% and Mobile Net has an accuracy of 69.0% (these are top-1 accuracy values; for Image Net, top5 accuracy values are: VGG16 91.5%, Inception-v3 94.4%, and for CIFAR100, VGG16 91.2%, Mobile Net 91.0%). We chose these models because they are commonly used in the Computer Vision community for many downstream tasks (Haque et al., 2019; Bhatia et al., 2019; Ning et al., 2017; Younis et al., 2020; Venkateswarlu et al., 2020).

The experimental setup is summarized in Algorithm 2: Choose 100 random image pairs from the given dataset. For each image pair compute a target explanation map, g(xtarget), for one of the two images. With a budget of 50, 000 queries to the model, Algorithm 1 perturbs the second image, aiming to replicate the desired g(xtarget). We assume the model outputs both the output probability vector and the explanation map per each query to the model which is a realistic scenario nowadays, with XAI algorithms being part of real-world applications (Payrovnaziri et al., 2020; Giuste et al., 2022; Tjoa & Guan, 2020).

Published in Transactions on Machine Learning Research (08/2023)

Algorithm 2 Experimental setup (per dataset and model)

dataset dataset to be used model model to be used G maximum number of generations λ population size σ initial standard deviation value α explanation-loss weight β prediction-loss weight

Performance scores

1: for i 1 to 100 do

2: Randomly choose a pair of images x and xtarget from dataset

3: Generate xadv by running Algorithm 1, with x and xtarget (and all other input parameters)

4: Save performance statistics

In order to balance between the two contradicting terms in Equation 5, we chose hyperparameters that empirically proved to work, in terms of forcing the optimization process to find a solution that satisfies the two objectives, following the work done in (Dombrowski et al., 2019): α = 1e11, β = 1e6 for Image Net, and α = 1e7, β = 1e6 for CIFAR100. After every generation the learning rate was decreased through multiplication by a factor of 0.999. We tested drawing the population samples, both independent and identically distributed (iid) and through Latin hypercube sampling (LHS). The generation of the explanations was achieved by using the repository Captum (Kokhlikyan et al., 2020), a unified and generic model interpretability library for Py Torch.

Figures 4 through 7 shows samples of our results. Specifically, Figure 4 shows Atta XAI-generated attacks for images from Image Net using the VGG16 model, against each of the 5 explanation methods: LRP, Deep Lift, Gradient, Gradient x Input, Guided-Backpropagation; Figure 5 shows Atta XAIgenerated attacks for images from Image Net using the Inception model; Figure 6 shows Atta XAIgenerated attacks for images from CIFAR100 using the VGG16 model; and Figure 7 shows Atta XAIgenerated attacks for images from CIFAR100 using the Mobile Net model.

Note that our primary objective has been achieved: having generated an adversarial image (xadv), virtually identical to the original (x), the explanation (g) of the adversarial image (xadv) is now, incorrectly, that of the target image (xtarget) essentially, the two rightmost columns of Figures 4-7 are identical; furthermore, the class prediction remains the same, i.e., arg maxi f(x)i = arg maxi f(xadv)i.

5 Conclusions

We examined the multitude of runs in-depth, producing several graphs, which are provided in full in the extensive Appendix. We present the mean of the losses for various query budgets in Table 1.

Below, we summarize several observations we made:

Our algorithm was successful in that f(xadv) f(x) for all xadv generated, and when applying arg max the original label remained unchanged.

For most hyperparameter values examined, our approach converges for Image Net using VGG, reaching 1e 10 MSE loss between g(xadv) and g(xtarget), for every XAI except Guided Backpropagation which was found to be more robust then other techniques in this configuration, 10 more robust.

Published in Transactions on Machine Learning Research (08/2023)

x xtarget xadv g(x) g(xtarget) g(xadv)

Grad x Input

Figure 4: Attacks generated by Atta XAI. Dataset: Image Net. Model: VGG16. Shown for the 5 explanation methods, described in the text: LRP, Deep Lift, Gradient, Gradient x Input, Guided Backpropagation (denoted G-Backprop in the figure). Note that our primary objective has been achieved: having generated an adversarial image (xadv), virtually identical to the original (x), the explanation (g) of the adversarial image (xadv) is now, incorrectly, that of the target image (xtarget); essentially, the two rightmost columns are identical.

For all the experiments we witnessed that the Gradient XAI method showed the smallest mean squared error (MSE) between g(xadv) and g(xtarget), i.e., it was the least robust. The larger the MSE between g(xadv) and g(xtarget) the better the explanation algorithm can handle our perturbed image.

For VGG16 (Figures 8 and 14), Gradient XAI showed the smallest median MSE between g(xadv) and g(xtarget), while Guided Backpropogation showed the most. This means that using Gradient XAI s output as an explanation incurs the greatest risk, while using Guided Backpropogation s output as an explanation incurs the smallest risk.

For Inception (Figure 11), Gradient XAI and Guided Backpropogation exhibited the smallest median MSE, while LRP, Gradient x Input, and Deep Lift displayed similar results.

For the Mobile Net (Figure 17), Gradient XAI exhibited the smallest MSE, while Guided Backpropogation showed the largest MSE rendering it more robust than other techniques in this configuration.

Published in Transactions on Machine Learning Research (08/2023)

x xtarget xadv g(x) g(xtarget) g(xadv)

Grad x Input

Figure 5: Attacks generated by Atta XAI. Dataset: Image Net. Model: Inception.

Table 1: Different evaluations of our algorithm as function of number of queries to model. Best (lowest) per experiment boldfaced.

Model Loss 10k 20k 30k 40k 50k

VGG16-CIFAR100 Input 1.0e-2 1.8e-2 2.3e-2 2.8e-2 3.4e-2 Explanation 1.1e-6 9.9e-7 9.1e-7 8.6e-7 8.2e-7 Output 3.6e-4 2.0e-4 1.1e-4 7.3e-5 4.1e-5

Mobile Net-CIFAR100 Input 1.0e-2 1.7e-2 2.3e-2 2.8e-2 3.4e-2 Explanation 1.4e-6 1.2e-6 1.1e-6 1.1e-6 1.0e-6 Output 7.4e-5 3.1e-5 1.9e-5 1.4e-5 1.3e-5

VGG16-Image Net Input 1.0e-2 1.7e-2 2.2e-2 2.7e-2 3.3e-2 Explanation 1.2e-9 1.0e-9 9.0e-10 8.3e-10 7.9e-10 Output 1.0e-5 7.5e-6 6.2e-6 5.4e-6 5.6e-6

Inception-v3-Image Net Input 1.0e-2 1.7e-2 2.2e-2 2.7e-2 3.3e-7 Explanation 7.5e-10 6.5e-10 6.1e-10 5.8e-10 5.6e-10 Output 1.9e-5 1.3e-5 1.1e-5 9.5e-6 1.0e-5

Mobile Net is more robust than VGG16 in that it attains higher MSE scores irrespective of the XAI method used. We surmise that this is due to the larger number of parameters in VGG16.

Published in Transactions on Machine Learning Research (08/2023)

x xtarget xadv g(x) g(xtarget) g(xadv)

Grad x Input

Figure 6: Attacks generated by Atta XAI. Dataset: CIFAR100. Model: VGG16.

The query budget was 50,000 for all experiments. In many runs the distance between the target explanation and the adversarial explanation reached a plateau after roughly 25,000 queries.

6 Discussion

Broader impact. Our research aims to evidentiate vulnerabilities of explainable artificial intelligence (XAI) in adversarial attacks which is of paramount importance for a wide array of domains.

For example, in healthcare, XAI is increasingly used in decision-making processes, such as disease diagnosis and treatment recommendations (Zhang et al., 2022; Muneer & Rasool, 2022). If these systems are susceptible to adversarial attacks, the consequences could be potentially life-threatening.

In the financial sector XAI systems are used for tasks such as fraud detection and credit-risk assessment (Cirqueira et al., 2021; Psychoula et al., 2021). Vulnerabilities could lead to substantial financial losses.

In cybersecurity, XAI is used to detect anomalies and potential threats, and weaknesses could undermine the entire security infrastructure of an organization (Srivastava et al., 2022; Capuano et al., 2022).

Thus, we believe our research has broad implications, as it aims to fost the development of robust, secure XAI systems, which retain reliability under adversarial conditions.

Published in Transactions on Machine Learning Research (08/2023)

x xtarget xadv g(x) g(xtarget) g(xadv)

Grad x Input

Figure 7: Attacks generated by Atta XAI. Dataset: CIFAR100. Model: Mobile Net.

Potential real-world value. Our study offers several real-world benefits. First, by identifying the potential weaknesses in XAI systems, we provide crucial insights for organizations to take necessary precautions and develop countermeasures against adversarial attacks. This can help prevent extensive damage, both in terms of financial losses and trust erosion, which may result from a successful attack.

Additionally, our research might prompt organizations to invest more in AI security, thereby promoting the development of more resilient XAI systems. Policymakers can also utilize our research findings to create or amend legislation surrounding AI and cybersecurity. As AI technologies become increasingly integrated into our daily lives, establishing proper regulatory frameworks is crucial to ensure their safe, ethical, and responsible use. Our research provides empirical evidence that can guide these legislative efforts, contributing to a more-secure digital society.

Query cost and adversarial effectiveness. The relationship between query cost and adversarial effectiveness is critical in assessing attacks against deep neural networks, particularly for manipulating XAI maps. This paper s proposed black-box attack leverages a query budget of 50, 000, striking a balance between computational demand and attack success. While increasing the query budget may offer more precision, this would be resource-intensive. Conversely, a lower query budget may be less effective but more computationally efficient. The use of 50, 000 queries in our attack demonstrates a vulnerability in deep learning models that can be exploited without exorbitant computational costs, revealing potential security issues and improvement areas. Future research could focus on techniques to lower query cost while maintaining adversarial effectiveness, as well as methods to bolster model resilience against such attacks. This study, therefore, offers valuable insights into the trade-off between query cost and adversarial effectiveness.

Published in Transactions on Machine Learning Research (08/2023)

Future directions. Our research also paves the way for future studies on adversarial attacks. For one, while our study focuses on certain types of XAI techniques, future research could explore the susceptibility of other XAI techniques to adversarial attacks, broadening our understanding of this critical security issue.

Further, we highlight the need for additional research on the development of more-advanced defensive mechanisms against adversarial attacks. As adversaries continually evolve their tactics, it is imperative that our defenses evolve as well.

Moreover, our research underscores the importance of considering the socio-ethical implications of adversarial attacks. As AI becomes more pervasive, attacks against these systems have the potential to cause widespread disruption and harm. Thus, future research should also consider the societal and ethical consequences of these attacks, informing policies and practices aimed at mitigating these potential harms.

In summary, the broader impact and potential real-world values of our study are substantial, extending across various domains, ranging from healthcare to finance, to cybersecurity. By offering a foundation for future research and by suggesting practical strategies to mitigate adversarial attacks, our study contributes to the ongoing effort to secure AI systems and ensure their responsible use.

7 Concluding Remarks

Recently, practitioners have started to use explanation approaches more frequently. We demonstrated how focused, undetectable modifications to the input data can result in arbitrary and significant adjustments to the explanation map. We showed that explanation maps of several known explanation algorithms may be modified at will. Importantly, this is feasible with a black-box approach, while maintaining the output of the model. We tested Atta XAI against the Image Net and CIFAR100 datasets using 4 different network models.

It is obvious that neural networks operate quite differently from humans, capturing fundamentally distinct properties. In addition, further work is required in the XAI domain to make XAI algorithms that are more reliable.

This work has shown that explanations are easily foiled without any recourse to internal information raising questions regarding XAI-based defenses and detectors.

This work has also investigated the robustness of various XAI methods, revealing that Gradient XAI is the least robust XAI method and Guided Backpropagation is the most robust one.

Future suggestions. In our study we examined how to attack a model s (XAI) explanation for a given input, prediction, and XAI method. Some questions still remain:

A way to predict whether a XAI attack will be successful, and how many queries will be needed.

A better metric for a successful XAI attack, since in our results we observed that a smaller L2 distance does not necessarily translate to a more convincing attack.

Find a way to eliminate the need for model feedback, i.e., go fully black-box. Applying XAI attacks via transferability (Papernot et al., 2016a; Xie et al., 2019; Wang et al., 2021) might be a way to move forward.

When developing new XAI methods find ways to render them more robust to adversarial attacks.

Published in Transactions on Machine Learning Research (08/2023)

Acknowledgement

We thank the reviewers for a fruitful discussion. This research was partially supported by the Israeli Innovation Authority through the Trust.AI consortium.

Published in Transactions on Machine Learning Research (08/2023)

Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. Plo S One, 10(7):e0130140, 2015.

Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies a comprehensive introduction. Natural Computing, 1(1):3 52, 2002.

Yajurv Bhatia, Aman Bajpayee, Deepanshu Raghuvanshi, and Himanshu Mittal. Image captioning using Google s Inception-Resnet-v2 and recurrent neural network. In 2019 Twelfth International Conference on Contemporary Computing (IC3), pp. 1 6. IEEE, 2019.

Nicola Capuano, Giuseppe Fenza, Vincenzo Loia, and Claudio Stanzione. Explainable artificial intelligence in cybersecurity: A survey. IEEE Access, 10:93575 93600, 2022.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39 57. IEEE, 2017.

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15 26, 2017.

Douglas Cirqueira, Markus Helfert, and Marija Bezbradica. Towards design principles for user-centric explainable ai in fraud detection. In Artificial Intelligence in HCI: Second International Conference, AI-HCI 2021, Held as Part of the 23rd HCI International Conference, HCII 2021, Virtual Event, July 24 29, 2021, Proceedings, pp. 21 40. Springer, 2021.

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning, pp. 2206 2216. PMLR, 2020.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255. IEEE, 2009.

Ann-Kathrin Dombrowski, Maximillian Alber, Christopher Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. Explanations can be manipulated and geometry is to blame. Advances in Neural Information Processing Systems, 32, 2019.

Filip Karlo Došilović, Mario Brčić, and Nikica Hlupić. Explainable artificial intelligence: A survey. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 0210 0215. IEEE, 2018.

Gil Fidel, Ron Bitton, and Asaf Shabtai. When explainability meets adversarial learning: Detecting adversarial examples using shap signatures. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2020.

Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3681 3688, 2019.

Felipe Giuste, Wenqi Shi, Yuanda Zhu, Tarun Naren, Monica Isgut, Ying Sha, Li Tong, Mitali Gupte, and May D Wang. Explainable artificial intelligence methods in combating pandemics: A systematic review. IEEE Reviews in Biomedical Engineering, 2022.

Tobias Glasmachers, Tom Schaul, Sun Yi, Daan Wierstra, and Jürgen Schmidhuber. Exponential natural evolution strategies. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, pp. 393 400, 2010.

Published in Transactions on Machine Learning Research (08/2023)

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Nikolaus Hansen, Dirk V Arnold, and Anne Auger. Evolution strategies. In Springer Handbook of Computational Intelligence, pp. 871 898. Springer, 2015.

Md Foysal Haque, Hye-Youn Lim, and Dae-Seong Kang. Object detection based on vgg with resnet network. In 2019 International Conference on Electronics, Information, and Communication (ICEIC), pp. 1 3. IEEE, 2019.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017.

Nathan Inkawhich, Wei Wen, Hai Helen Li, and Yiran Chen. Feature space perturbations yield more transferable adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7066 7074, 2019.

Ching-Yu Kao, Junhao Chen, Karla Markert, and Konstantin Böttinger. Rectifying adversarial inputs using xai techniques. In 2022 30th European Signal Processing Conference (EUSIPCO), pp. 573 577. IEEE, 2022.

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz-Richardson. Captum: A unified and generic model interpretability library for pytorch, 2020.

Aditya Kuppa and Nhien-An Le-Khac. Black box attacks on explainable artificial intelligence (xai) methods in cyber security. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2020.

Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security, pp. 99 112. Chapman and Hall/CRC, 2018.

Raz Lapid, Zvika Haramaty, and Moshe Sipper. An evolutionary, gradient-free, query-efficient, blackbox algorithm for generating adversarial instances in deep convolutional neural networks. Algorithms, 15(11), 2022.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Nicholas Metropolis and Stanislaw Ulam. The Monte Carlo method. Journal of the American Statistical Association, 44(247):335 341, 1949.

Grégoire Montavon, Alexander Binder, Sebastian Lapuschkin, Wojciech Samek, and Klaus-Robert Müller. Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning, pp. 193 209, 2019.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574 2582, 2016.

Salman Muneer and Malik Atta Rasool. Aa systematic review: Explainable artificial intelligence (xai) based disease prediction. International Journal of Advanced Sciences and Computing, 1(1): 1 6, 2022.

Chengcheng Ning, Huajun Zhou, Yan Song, and Jinhui Tang. Inception single shot multibox detector for object detection. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 549 554. IEEE, 2017.

Published in Transactions on Machine Learning Research (08/2023)

Nicolas Papernot, Patrick Mc Daniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. ar Xiv preprint ar Xiv:1605.07277, 2016a.

Nicolas Papernot, Patrick Mc Daniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (Euro S&P), pp. 372 387. IEEE, 2016b.

Seyedeh Neelufar Payrovnaziri, Zhaoyi Chen, Pablo Rengifo-Moreno, Tim Miller, Jiang Bian, Jonathan H Chen, Xiuwen Liu, and Zhe He. Explainable artificial intelligence models using realworld electronic health record data: a systematic scoping review. Journal of the American Medical Informatics Association, 27(7):1173 1185, 2020.

Ismini Psychoula, Andreas Gutmann, Pradip Mainali, Sharon H Lee, Paul Dunphy, and Fabien Petitcolas. Explainable machine learning for fraud detection. Computer, 54(10):49 59, 2021.

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017.

Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences. ar Xiv preprint ar Xiv:1605.01713, 2016.

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International Conference on Machine Learning, pp. 3145 3153. PMLR, 2017.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. ar Xiv preprint ar Xiv:1312.6034, 2013.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. ar Xiv preprint ar Xiv:1412.6806, 2014.

Gautam Srivastava, Rutvij H Jhaveri, Sweta Bhattacharya, Sharnil Pandya, Praveen Kumar Reddy Maddikunta, Gokul Yenduri, Jon G Hall, Mamoun Alazab, Thippa Reddy Gadekallu, et al. Xai for cybersecurity: state of the art, challenges, open issues and future directions. ar Xiv preprint ar Xiv:2206.03585, 2022.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 9, 2015.

Erico Tjoa and Cuntai Guan. A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE Transactions on Neural Networks and Learning Systems, 32(11):4793 4813, 2020.

Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. The space of transferable adversarial examples. ar Xiv preprint ar Xiv:1704.03453, 2017.

Michael Van Lent, William Fisher, and Michael Mancuso. An explainable artificial intelligence system for small-unit tactical behavior. In Proceedings of the National Conference on Artificial Intelligence, pp. 900 907. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2004.

Isunuri B Venkateswarlu, Jagadeesh Kakarla, and Shree Prakash. Face mask detection using mobilenet and global pooling block. In 2020 IEEE 4th Conference on Information & Communication Technology (CICT), pp. 1 5. IEEE, 2020.

Published in Transactions on Machine Learning Research (08/2023)

Savita Walia, Krishan Kumar, Saurabh Agarwal, and Hyunsung Kim. Using xai for deep learningbased image manipulation detection with shapley additive explanation. Symmetry, 14(8):1611, 2022.

Dan Wang, Jiayu Lin, and Yuan-Gen Wang. Query-efficient adversarial attack based on latin hypercube sampling. ar Xiv preprint ar Xiv:2207.02391, 2022.

Xiaosen Wang, Xuanran He, Jingdong Wang, and Kun He. Admix: Enhancing the transferability of adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16158 16167, 2021.

Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(1):949 980, 2014.

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2730 2739, 2019.

Kaidi Xu, Sijia Liu, Pu Zhao, Pin-Yu Chen, Huan Zhang, Quanfu Fan, Deniz Erdogmus, Yanzhi Wang, and Xue Lin. Structured adversarial attack: Towards general implementation and better interpretability. ar Xiv preprint ar Xiv:1808.01664, 2018.

Ayesha Younis, Li Shixin, Shelembi Jn, and Zhang Hai. Real-time object detection using pre-trained deep learning models Mobile Net-SSD. In Proceedings of 2020 the 6th International Conference on Computing and Data Engineering, pp. 44 48, 2020.

Xinyang Zhang, Ningfei Wang, Hua Shen, Shouling Ji, Xiapu Luo, and Ting Wang. Interpretable deep learning under fire. In 29th {USENIX} Security Symposium ({USENIX} Security 20), 2020.

Yiming Zhang, Ying Weng, and Jonathan Lund. Applications of explainable artificial intelligence in diagnosis and surgery. Diagnostics, 12(2):237, 2022.

Published in Transactions on Machine Learning Research (08/2023)

The following figures provide our full qualitative results. Hyperparameters: n_pop population size of evolutionary algorithm; lr learning rate of gradient approximation step for updating the attack; LS use of Latin sampling or regular sampling.

Figure 8: Similarity in terms of MSE between the target explanation map, g(xtarget), and the best final adversarial explanation map, g(xadv), for 8 different hyperparameter configurations (dataset: Image Net, model: VGG16).

Figure 9: Similarity in terms of MSE between the input image, x, and the best final adversarial, xadv, for 8 different hyperparameter configurations (dataset: Image Net, model: VGG16).

Figure 10: MSE loss value as function of evolutionary generation for 8 different hyperparameter configurations (dataset: Image Net, model: VGG16).

Figure 11: Similarity in terms of MSE between the target explanation map, g(xtarget), and the best final adversarial explanation map, g(xadv), for 8 different hyperparameter configurations (dataset: Image Net, model: Inception).

Figure 12: MSE loss values for the input image versus the chosen adversarial image for 8 different hyperparameter configurations (dataset: Image Net, model: Inception).

Figure 13: MSE loss value as function of evolutionary generation for 8 different hyperparameter configurations (dataset: Image Net, model: Inception).

Figure 14: Similarity in terms of MSE between the target explanation map, g(xtarget), and the best final adversarial explanation map, g(xadv), for 8 different hyperparameter configurations (dataset: CIFAR100, model: VGG16).

Figure 15: MSE loss value for input image versus chosen adversarial image for 8 different hyperparameter configurations (dataset: CIFAR100, model: VGG16).

Figure 16: MSE loss value as function of evolutionary generation for 8 different hyperparameter configurations (dataset: CIFAR100, model: VGG16).

Figure 17: Similarity in terms of MSE between the target explanation map, g(xtarget), and the best final adversarial explanation map, g(xadv), for 8 different hyperparameter configurations (dataset: CIFAR100, model: Mobile Net).

Figure 18: MSE loss value for input image versus chosen adversarial image for 8 different hyperparameter configurations (dataset: CIFAR100, model: Mobile Net).

Figure 19: MSE loss value as function of evolutionary generation for 8 different hyperparameter configurations (dataset: CIFAR100, model: Mobile Net).

Published in Transactions on Machine Learning Research (08/2023)

Figure 8: Similarity in terms of MSE between target explanation map, g(xtarget), and best final adversarial explanation map, g(xadv), for 8 different hyperparameter configurations. Dataset: Image Net. Model: VGG16. The Gradient XAI method is the most susceptible to attacks while Guided backpropogation is the hardest to attack; Deep Lift, LRP, and Gradient x Input are similar.

Published in Transactions on Machine Learning Research (08/2023)

Figure 9: Similarity in terms of MSE between input image, x, and best final adversarial, xadv, for 8 different hyperparameter configurations. Dataset: Image Net. Model: VGG16. A higher learning rate and a smaller population size (i.e., more gradient steps) contribute to the perturbation of the image. The sampling method has no effect.

Published in Transactions on Machine Learning Research (08/2023)

Figure 10: MSE loss value as function of evolutionary generation for 8 different hyperparameter configurations. Dataset: Image Net. Model: VGG16.

Published in Transactions on Machine Learning Research (08/2023)

Figure 11: Similarity in terms of MSE between target explanation map, g(xtarget), and best final adversarial explanation map, g(xadv), for 8 different hyperparameter configurations. Dataset: Image Net. Model: Inception. The Gradient XAI and Guided backpropogation methods are the most susceptible to attacks while Deep Lift, LRP, and Gradient x Input are less susceptible.

Published in Transactions on Machine Learning Research (08/2023)

Figure 12: MSE loss value for input image versus chosen adversarial image for 8 different hyperparameter configurations. Dataset: Image Net. Model: Inception. A higher learning rate and a smaller population size (i.e., more gradient steps) contribute to the perturbation of the image. The sampling method has no effect.

Published in Transactions on Machine Learning Research (08/2023)

Figure 13: MSE loss value as function of evolutionary generation for 8 different hyperparameter configurations. Dataset: Image Net. Model: Inception.

Published in Transactions on Machine Learning Research (08/2023)

Figure 14: Similarity in terms of MSE between target explanation map, g(xtarget), and best final adversarial explanation map, g(xadv), for 8 different hyperparameter configurations. Dataset: CIFAR100. Model: VGG16. The Gradient XAI method is the most susceptible to attacks while Guided backpropogation and Deep Lift are the hardest to attack; LRP and Gradient x Input are similar.

Published in Transactions on Machine Learning Research (08/2023)

Figure 15: MSE loss value for input image versus chosen adversarial image for 8 different hyperparameter configurations. Dataset: CIFAR100. Model: VGG16. A higher learning rate and a smaller population size (i.e., more gradient steps) contribute to the perturbation of the image. The sampling method has no effect.

Published in Transactions on Machine Learning Research (08/2023)

Figure 16: MSE loss value as function of evolutionary generation for 8 different hyperparameter configurations. Dataset: CIFAR100. Model: VGG16.

Published in Transactions on Machine Learning Research (08/2023)

Figure 17: Similarity in terms of MSE between target explanation map, g(xtarget), and best final adversarial explanation map, g(xadv), for 8 different hyperparameter configurations. Dataset: CIFAR100. Model: Mobile Net. The Gradient XAI method is the most susceptible to attacks while Guided backpropogation is the hardest to attack; Deep Lift, LRP, and Gradient x Input are similar.

Published in Transactions on Machine Learning Research (08/2023)

Figure 18: MSE loss value for input image versus chosen adversarial image for 8 different hyperparameter configurations. Dataset: CIFAR100. Model: Mobile Net. A higher learning rate and a smaller population size (i.e., more gradient steps) contribute to the perturbation of the image. The sampling method has no effect.

Published in Transactions on Machine Learning Research (08/2023)

Figure 19: MSE loss value as function of evolutionary generation for 8 different hyperparameter configurations. Dataset: CIFAR100. Model: Mobile Net.