# fairwashing_explanations_with_offmanifold_detergent__255bc890.pdf

Fairwashing Explanations with Off-Manifold Detergent

Christopher J. Anders 1 Plamen Pasliev 1 Ann-Kathrin Dombrowski 1 Klaus-Robert M uller 1 2 3 Pan Kessel 1

Explanation methods promise to make black-box classiﬁers more transparent. As a result, it is hoped that they can act as proof for a sensible, fair and trustworthy decision-making process of the algorithm and thereby increase its acceptance by the end-users. In this paper, we show both theoretically and experimentally that these hopes are presently unfounded. Speciﬁcally, we show that, for any classiﬁer g, one can always construct another classiﬁer g which has the same behavior on the data (same train, validation, and test error) but has arbitrarily manipulated explanation maps. We derive this statement theoretically using differential geometry and demonstrate it experimentally for various explanation methods, architectures, and datasets. Motivated by our theoretical insights, we then propose a modiﬁcation of existing explanation methods which makes them signiﬁcantly more robust.

1. Introduction

Explanation methods4 are increasingly adopted by machine learning practitioners and incorporated into standard deep learning libraries (Kokhlikyan et al., 2019; Alber et al., 2019; Ancona et al., 2018). The interest in explainability is partly driven by the hope that explanations can act as proof for a sensible, fair, and trustworthy decision-making process(A ıvodji et al., 2019; Lapuschkin et al., 2019). As an example, a bank could provide explanations for its rejection of a loan application. By doing so, the bank can demonstrate that the decision was not based on illegal or ethically

1Machine Learning Group, Technische Universit at Berlin, Germany 2Max-Planck-Institut f ur Informatik, Saarbr ucken, Germany 3Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea. Correspondence to: Pan Kessel <pan.kessel@tu-berlin.de>, Klaus-Robert M uller <klausrobert.mueller@tu-berlin.de>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). 4See (Samek et al., 2019) and references therein for a detailed overview.

questionable features. It can furthermore provide feedback to the customer. In some situations, an explanation of an algorithmic decision may even be required by law.

However, this hope is based on the assumption that explanations faithfully reﬂect the underlying mechanisms of the algorithmic decision. In this work, we demonstrate unequivocally that this assumption should not be made carelessly because explanations can be easily manipulated.

In more detail, we show theoretically that for any classiﬁer g, one can always ﬁnd another classiﬁer g which agrees with the original g on the entire data manifold but has (almost) completely controlled explanations. This surprising result is established using techniques of differential geometry. We then demonstrate experimentally that one can easily construct such manipulated classiﬁers g.

In the example above, a bank could use a manipulated classiﬁer g that uses mainly unethical features, such as the gender of the applicant, but has explanations which suggest that the decision was only based on ﬁnancial features.

Brieﬂy put, the manipulability of explanations arises from the fact that the data manifold is typically low-dimensional compared to its high-dimensional embedding space. The training process only determines the classiﬁer in directions along the manifold. However, many explanation methods are mainly sensitive to directions orthogonal to the data manifold. Since these directions are undetermined by training, they can be changed at will.

This theoretical insight allows us to propose a modiﬁcation to explanation methods which make them signiﬁcantly more robust with respect to such manipulations. Namely, the explanation is projected along tangential directions of the data manifold. We show, both theoretically and experimentally, that these tangent-space-projected (tsp) explanations are indeed signiﬁcantly more robust. We thereby establish a novel and exciting connection between the ﬁelds of explainability and manifold learning.

In summary, our main contributions are as follows:

Using differential geometry, we establish theoretically that popular explanation methods can be easily manipulated.

Fairwashing Explanations with Off-Manifold Detergent

We validate our theoretical predictions in detailed experiments for various explanation methods, classiﬁer architectures, and datasets, as well as for different tasks.

We propose a modiﬁcation to existing explanation methods which make them more robust with respect to these manipulations.

In doing so, we relate explainability to manifold learning.

1.1. Related Works

This work was crucially inspired by (Heo et al., 2019). In this reference, adversarial model manipulation for explanations is proposed. Speciﬁcally, the authors empirically show that one can train models such that they have structurally different explanations while suffering only a very mild drop in classiﬁcation accuracy compared to their unmanipulated counterparts. For example, the adversarial model manipulation can change the positions of the most relevant pixels in each image or increase the overall sum of relevances in a certain subregion of the images. Contrary to their work, we analyze this problem theoretically. Our analysis leads us to demonstrate a stronger form of manipulability. Namely, the model can be manipulated such that it structurally reproduces arbitrary target explanations while keeping all class probabilities the same for all data points. Our theoretical insights not only illuminate the underlying reasons for the manipulability but also allow us to develop modiﬁcations of existing explanation methods which make them more robust. Another approach (Kindermans et al., 2019) adds a constant shift to the input image, which is then eliminated by changing the bias of the ﬁrst layer. For some methods, this leads to a change in the explanation map. Contrary to our approach, this requires a shift in the data. In (Adebayo et al., 2018), explanation maps are changed by randomization of (some of) the network weights. This is different to our method as it dramatically changes the output of the network and is proposed as a consistency check of explanations. In (Dombrowski et al., 2019) and (Ghorbani et al., 2019), it is shown that explanations can be manipulated by an inﬁnitesimal change in input while the output of the network is approximately unchanged. Contrary to this approach, we manipulate the model and keep the input unchanged.

1.2. Explanation Methods

We consider a classiﬁer g : RD RK which classiﬁes an input x RD in K categories with the predicted class given by k = arg maxi g(x)i. The explanation method is denoted by hg : RD RD and associates an input x with an explanation map hg(x) whose components encode the relevance score of each input for the classiﬁer s prediction.

We note that, by convention, explanation maps are usually calculated with respect to the classiﬁer before applying the ﬁnal softmax non-linearity (Kokhlikyan et al., 2019; Alber et al., 2019; Ancona et al., 2018). Throughout the paper, we will therefore denote this function as g.

We use the following explanation methods:

Gradient: The map hg(x) = g

x(x) is used and quantiﬁes how inﬁnitesimal perturbations in each pixel change the prediction g(x) (Simonyan et al., 2014; Baehrens et al., 2010).

x Grad: This method uses the map hg(x) = x g

x(x) (Shrikumar et al., 2017). For linear models, the exact contribution of each pixel to the prediction is obtained.

Integrated Gradients: This method deﬁnes

hg(x) = (x x) Z 1

g( x + t(x x))

where x is a suitable baseline. We refer to the original reference (Sundararajan et al., 2017) for more details.

Layer-wise Relevance Propagation (LRP): This method (Bach et al., 2015; Montavon et al., 2017) propagates relevance backwards through the network. In our experiments, we use the following setup: for the output layer, relevance is given by

RL i = δi,k =

( 1, for i = k 0, for i = k ,

which is then propagated backwards through all layers but the ﬁrst using the z+-rule

xl i(W l)+ ji P

i xl i(W l)+ ji + ǫ Rl+1 j , (1)

where (W l)+ denotes the positive weights of the l-th layer, xl is the activation vector of the l-th layer, and ǫ > 0 is a small constant ensuring numerical stability. For the ﬁrst layer, we use the z B-rule to account for the bounded input domain

x0 j W 0 ji lj(W 0)+ ji hj(W 0) ji P

i(x0 j W 0 ji lj(W 0)+ ji hj(W 0) ji) R1 j ,

where li and hi are the lower and upper bounds of the input domain respectively. For theoretical analysis, we consider the ǫ-rule in all layers for simplicity. This rule is obtained by substituting (W l)+ W l in (1). We refer to the resulting method as ǫ-LRP.

This choice of methods is necessarily not exhaustive. However, it covers two classes of attribution methods, i.e. propagation and gradient-based explanations. Furthermore, the

Fairwashing Explanations with Off-Manifold Detergent

chosen methods are widely used in practice (Kokhlikyan et al., 2019; Alber et al., 2019; Ancona et al., 2018).

2. Manipulation of Explanations

In this section, we will theoretically deduce that explanation methods can be arbitrarily manipulated by adversarially training a model.

2.1. Mathematical Background

In the following, we will brieﬂy summarize the basic tools of differential geometry before applying them in the context of explainability in the next section. For additional technical details, we refer to Appendix A.1.

A D-dimensional manifold M is a topological space which locally resembles RD. More precisely, for each p M, there exists a subset U M containing p and a diffeomorphism φ : U U RD. The pair (U, φ) is called coordinate chart and the component functions xi

of φ(p) = (x1(p), . . . , x D(p)) are called coordinates.

A d-dimensional submanifold S is a subset of M which is itself a d-dimensional manifold. M is called the embedding manifold of S. A properly embedded submanifold S M is a submanifold embedded in M which is also closed as a set.

Let p M be a point on a manifold M and γ : R M with γ(0) = p a curve through the point p. The set of tangent vectors dγ = d dtγ(t)|t=0 of all curves through p forms a vector space of dimension D. This vector space is known as tangent space Tp M. Let (U, φ) be a coordinate chart on M with coordinates x. We can then deﬁne φ λk(t) = (x1(p), . . . , xk(p) + t, . . . , x D(p)) with k {1, . . . , D}. This implicitly deﬁnes curves λk : R M through p. We denote the corresponding tangent vectors as k := d

dtλk(t)|t=0 and it can be shown that they form a basis of the tangent space Tp M.

A vector ﬁeld V on M associates with every point x M an element of the corresponding tangent space, i.e. V (x) Tx M.5 A conservative vector ﬁeld V is a vector ﬁeld that is the gradient of a function f : M R, i.e. V (x) = f(x). For submanifolds S, there are two different notions of vector ﬁelds. A vector ﬁeld V on the submanifold S associates to every point on S a vector in its corresponding tangent space Tx S, i.e. V (x) Tx S. A vector ﬁeld V along the submanifold S associates to every point on S a vector in the corresponding tangent space of the embedding manifold M, i.e. V (x) Tx M. These concepts can be related as follows: the tangent space Tx M can be decomposed into the tangent space Tx S of S and its orthogonal complement

5More rigorously, vector ﬁelds are deﬁned in terms of the tangent bundle. We refrain from introducing bundles for accessibility.

Tx S , i.e. Tx M = Tx S Tx S . A vector ﬁeld along S which only takes values in the ﬁrst summand Tx S is also a vector ﬁeld on S.

With these deﬁnitions, we can now state a crucial theorem for our theoretical analysis. In Appendix A.1, we show that:

Theorem 1 Let S M be d-dimensional submanifold properly embedded in the D-dimensional manifold M. Let V = PD i=d+1 vi i be a conservative vector ﬁeld along S which assigns a vector in Tp S for each p S. For any smooth function f : S R, there exists a smooth extension F : M R such that

where F|S denotes the restriction of F on the submanifold S. Furthermore, the derivative of the extension F is given by

F(x) = ( 1f(x), . . . df(x), vd+1(x), . . . , v D(x))

for all x S.

Technical details not withstanding, this theorem states that a function f deﬁned on a submanifold S can be extended to the entire embedding manifold M. The extension s derivatives orthogonal to the submanifold S can be freely chosen.

This theorem is a generalization of the well-known submanifold extension lemma (see, for example, Lemma 5.34 in (Lee, 2012)) in that it not only shows that an extension exists but also that one has control over the gradient of the extension F. While we could not ﬁnd such a statement in the literature, we suspect that it is entirely obvious to differential geometers but typically not needed for their purposes.

2.2. Explanation Manipulation: Theory

From Theorem 1, it follows under a mild assumption that one can always construct a model g such that it closely reproduces arbitrary target explanations but has the same training, validation, and test loss as the original model g.

Assumption: the data lies on a d-dimensional submanifold S M properly embedded in the manifold M = RD. The data manifold S is of much lower dimensionality than its embedding space M, i.e.

We stress that this assumption is also known as the manifold conjecture and is expected to hold across a wide range of machine learning tasks. We refer to (Goodfellow et al., 2016) for a detailed discussion.

Under this assumption, the following theorem can be derived for the Gradient, x Grad, and ǫ-LRP methods (only the proof for the Gradient method is given; see Appendix 2 for other methods):

Fairwashing Explanations with Off-Manifold Detergent

Theorem 2 Let hg : RD RD be the explanation of classiﬁer g : RD R with bounded derivatives | ig(x)| C R+ for i = 1, . . . , D.

For a given target explanation ht : RD RD, there exists another classiﬁer g : RD R which completely agrees with the classiﬁer g on the data manifold S, i.e.

g|S = g|S . (3)

In particular, both classiﬁers have the same train, validation, and test loss.

However, its explanation h g closely resembles the target ht, i.e.

MSE(h g(x), ht(x)) ǫ x S , (4)

where MSE(h, h ) = 1

D PD i=1(hi h i)2 denotes the meansquared error and ǫ = d

Proof: By Theorem 1, we can ﬁnd a function G which agrees with g on the data manifold S but has the derivative

G(x) = ( 1g(x), . . . dg(x), ht d+1(x), . . . , ht D(x))

for all x S. By deﬁnition, this is its gradient explanation h G = G.

As explained in Appendix A.2.1, we can assume without loss of generality that | ig(x)| 0.5 for i {1, . . . , D}. We can furthermore rescale the target map such that |ht i| 0.5 for i {1, . . . , D}. This rescaling is merely conventional as it does not change the relative importance hi of any input component xi with respect to the others. It then follows that

MSE(h G(x), ht(x)) = 1

i=1 ( i G(x) ht i(x))2 .

This sum can be decomposed as

ig(x) ht i(x) 2 | {z } 1

i G(x) ht i(x) 2 | {z } =0

and from this, it follows that

MSE(h G(x), ht(x)) d

The proof then concludes by identifying g = G.

Intuition: Somewhat roughly, this theorem can be understood as follows: two models, which behave identically on the data, need to only agree on the low-dimensional submanifold S. The gradients orthogonal to the submanifold S are completely undetermined by this requirement. By the manifold assumption, there are however much more

orthogonal than parallel directions and therefore the explanation is largely controlled by these. We can use this fact to closely reproduce an arbitrary target while keeping the function s values on the data unchanged.

We stress however that there are a number of non-trivial differential geometric arguments needed in order to make these statements rigorous and quantitative. For example, it is entirely non-trivial that an extension to the embedding manifold exists for arbitrary choice of target explanation. This is shown by Theorem 1 whose proof is based on a differential geometric technique called partition of the unity subordinate to an open cover. See Appendix A.1 for details.

2.3. Explanation Manipulation: Methods

Flat Submanifolds and Logistic Regression: The previous theorem assumes that the data lies on an arbitrarily curved submanifold and therefore has to rely on relatively involved mathematical concepts of differential geometry. We will now illustrate the basic ideas in a much simpler context: we will assume that the data lies on a d-dimensional ﬂat hyperplane S RD.6 The points on the hyperplane S obey the relation

x S : ( ˆw(i))T x = bi , i {1, . . . , D d} , (5)

where { ˆw(i) RD | i = 1, . . . , D d} are a set of normal vectors to the hyperplane S and bi R are the afﬁne translations. We furthermore assume that we use logistic regression as the classiﬁcation algorithm, i.e.

g(x) = σ(w T x + c) , (6)

where w RD, c R are the weights and the bias respectively and σ(x) = 1 1+exp( x) is the sigmoid function. This classiﬁer has the gradient explanation7

hgrad(x) = w , (7)

We can now deﬁne a modiﬁed classiﬁer by

i λi( ˆw(i)T x bi) + c

for arbitrary λi R. By (5), it follows that both classiﬁers agree on the data manifold S, i.e.

x S : g(x) = g(x) , (9)

and therefore have the same train, validation, and test error. However, the gradient explanations are now given by

hgrad(x) = w + X

i λi ˆw(i) . (10)

6In mathematics, these submanifolds are usually referred to as d-ﬂats and only the case d = D 1 is called hyperplane. We refrain from this terminology. 7We recall that in calculating the explanation map, we take the derivative before applying the ﬁnal activation function.

Fairwashing Explanations with Off-Manifold Detergent

Since the λi can be chosen freely, we can modify the explanations arbitrarily in directions orthogonal to the data submanifold S (parameterized by the normal vectors ˆw(i)). Similar statements can be shown for other explanation methods and we refer to the Appendix A.3 for more details.

As we will discuss in Section 2.4, one can use these tricks even for data which does not (initially) lie on a hyperplane.

General Case: For the case of arbitrary neural networks and curved data manifolds, we cannot analytically construct the manipulated model g. We therefore approximately obtain the model g corresponding to the original model g by minimizing the loss

xi T ||g(xi) g(xi)||2 + γ X

xi T ||h g(xi) ht||2 ,

by stochastic gradient descent with respect to the parameters of g. The training set is denoted by T and ht RD is a speciﬁed target explanation. Note that we could also use different targets for various subsets of the data but we will not make this explicit to avoid cluttered notation. The ﬁrst term in the loss L ensures that the models g and g have approximately the same output while the second term encourages the explanations of g to closely reproduce the target ht. The relative weighting of these two terms is determined by the hyperparameter γ R+.

As we will demonstrate experimentally, the resulting g will closely reproduce the target explanation ht and have (approximately) the same output as g. Crucially, both statements will be seen to hold also for the test set.

2.4. Explanation Manipulation: Practice

In this section, we will demonstrate manipulation of explanations experimentally. We will ﬁrst discuss applying logistic regression to credit assessment and then proceed to the case of deep neural networks in the context of image classiﬁcation. The code for all our experiments is publicly available at https://github.com/fairwashing/fairwashing.

Credit Assessment: In the following, we will suppose that a bank uses a logistic regression algorithm to classify whether a prospective client should receive a loan or not. The classiﬁcation uses the features x = (xgender, xincome) where

( 1, for male 1, for female (12)

and xincome is the income of the applicant. Normalization is chosen such that the features are of the same order of magnitude. Details can be found in the Appendix B.

Original expl.

Manipulated expl.

Figure 1. x Grad explanations for original classiﬁer g and manipulated g highlight completely different features. Colored bars show the median of the explanations over multiple examples.

We then deﬁne a logistic regression classiﬁer g by choosing the weights w = (0.9, 0.1), i.e. female applicants are severely discriminated against. The discriminating nature of the algorithm may be detected by inspecting, for example, the gradient explanation maps hgrad g = w.

Conversely, if the explanations did not show any sign of discrimination for another classiﬁer g, the user may interpret this as a sign of its trustworthiness and fairness.

However, the bank can easily fairwash the explanations, i.e. hide the fact that the classiﬁer is sexist. This can be done by adding new features which are linearly dependent on the previously used features. As a simple example, one could add the applicant s paid taxes xtaxes as a feature. By deﬁnition, it holds that

xtaxes = 0.4 xincome , (13)

where we assume that there is a ﬁxed tax rate of 0.4 on all income. The features used by the classiﬁer are now x = (xgender, xincome, xtaxes). By (13), all data samples x obey

ˆw T x = 0 with ˆw = (0, 0.4, 1) . (14)

Therefore, the original classiﬁer g(x) = σ(w T x) with w = (0.9, 0.1, 0) leads to the same output as the classiﬁer g(x) = σ(w T x + 1000 ˆw T x). However, as shown in Figure 1, the classiﬁer g has explanations which suggest that the two ﬁnancial features (and not the applicant s gender) are important for the classiﬁcation result.

This example is merely an (oversimpliﬁed) illustration of a general concept: for each additional feature which linearly depends on the previously used features, a condition of the form (14) for some normal vector ˆw is obtained. We can then construct a classiﬁer with arbitrary explanation along each of these normal vectors.

Fairwashing Explanations with Off-Manifold Detergent

Image Classiﬁcation: We will now experimentally demonstrate the practical applicability of our methods in the context of image classiﬁcation with deep neural networks.

Datasets: We consider the MNIST, Fashion MNIST, and CIFAR10 datasets. We use the standard training and test sets for our analysis. The data is normalized such that it has mean zero and standard deviation one. We sum the explanations over the absolute values of its channels to get the relevance per pixel. The resulting relevances are then normalized to have a sum of one.

Models: For CIFAR10, we use the VGG16 (Simonyan & Zisserman, 2015) architecture. For Fashion MNIST and MNIST, we use a four layer convolutional neural network. We train the model g by minimizing the standard cross entropy loss for classiﬁcation. The manipulated model g is then trained by minimizing the loss (11) for a given target explanation ht. This target was chosen to have the shape of the number 42. For more details about the architectures and training, we refer to the Appendix D.

Quantitative Measures: We assess the similarity between explanation maps using three quantitative measures: the structural similarity index (SSIM), the Pearson correlation coefﬁcient (PCC) and the mean squared error (MSE). SSIM and PCC are relative similarity measures with values in [0, 1], where larger values indicate high similarity. The MSE is an absolute error measure for which values close to zero indicate high similarity. We also use the MSE metric as well as the Kullback-Leibler divergence for assessing similarity of the class scores of the manipulated model g and the original network g.

Results: For all considered models, datasets, and explanation methods, we ﬁnd that the manipulated model g has explanations which closely resemble the target map ht, e.g. the SSIM between the target and manipulated explanations is of the order 0.8. At the same time, the manipulated network g has approximately the same output as the original model g, i.e. the mean-squared error of the outputs after the ﬁnal softmax non-linearity is of the order 10 3. The classiﬁcation accuracy is changed by about 0.2 percent.

Figure 2 illustrates this for examples from the Fashion MNIST and CIFAR10 test sets. We stress that we use a single model for Gradient, x Grad, and Integrated Gradient methods which demonstrates that the manipulation generalizes over all considered gradient-based methods.

The left-hand-side of Figure 3 shows quantitatively that manipulated model g closely reproduces the target map ht

over the entire test set of Fashion MNIST. We refer to the Appendix D for additional similarity measures, examples, and quantitative analysis for all datasets.

g g g g g g g g

g g g g g g g g

Figure 2. Example explanations from the original model g (left) and the manipulated model g (right). Images from the test sets of Fashion MNIST (top) and CIFAR10 (bottom).

3. Robust Explanations

Having demonstrated both theoretically and experimentally that explanations are highly vulnerable to model manipulation, we will now use our theoretical insights to propose explanation methods which are signiﬁcantly more robust under such manipulations.

3.1. TSP Explanations: Theory

In this section, we will deﬁne a robuster gradient explanation method. Appendix C discusses analogous deﬁnitions for other methods.

We can formally deﬁne an explanation ﬁeld Hg which associates to every point x on the data manifold S the corresponding gradient explanation hg(x) of the classiﬁer g. We note that Hg is generically a vector ﬁeld along the manifold since hg(x) RD = Tx M, i.e. it is an element of the tangent space Tx M of the embedding manifold M and not an element of the tangent space Tx S of data manifold S.

As explained in Section 2.1, we can decompose the tangent space Tp M of the embedding manifold M as follows Tx M = Tx S Tx S . Let P : Tx M Tx S be the projection on the ﬁrst summand of this decomposition. We stress that the form of the projector P depends on the point x S but we do not make this explicit in order to simplify notation. We can then deﬁne:

Fairwashing Explanations with Off-Manifold Detergent

Deﬁnition 1 The tangent-space-projected (tsp) explanation ﬁeld ˆHg is a vector ﬁeld on the data manifold S. It associates to each x S, the tangent-space-projected (tsp) explanation ˆhg(x) given by

ˆhg(x) = (P hg) (x) Tx S . (15)

Intuitively, the tsp-explanation ˆhg(x) is the explanation of the model g projected on the tangential directions of the data manifold.

We recall from our discussion of Theorem 2 that we can always ﬁnd classiﬁers g which coincide with the original classiﬁer g on the data manifold S but may differ in the gradient components orthogonal to the data manifold, i.e. for some x S it holds that

(1 P) g(x) = (1 P) g(x) .

On the other hand, the components tangential to the manifold S agree

P g(x) = P g(x) , x S .

In other words, the tsp-gradient explanations of the original model g and any such model g are identical:

ˆhg(x) = ˆh g(x) x S . (16)

It can therefore be expected that tsp-explanations ˆhg are signiﬁcantly more robust compared to their unprojected counterparts hg.

For other explanation methods, the corresponding tspexplanations may be obtained using a slightly modiﬁed projector P. We refer to Appendix C for more details.

3.2. TSP Explanations: Methods

Flat Submanifolds and Logistic Regression: Recall from Section 2.3 that for a logistic regression model g(x) = σ(w T x + c) with gradient explanation hgrad g = w, we can deﬁne a manipulated model

i λi( ˆw(i)T x bi) + c

with gradient explanation hgrad g = w + P

i λi ˆw(i) for arbitrary λi R. Since the vectors ˆwi are normal to the data hypersurface S, it holds that P ˆwi = 0. As a result, the gradient tsp-explanations of the original model g and its manipulated counterpart g are identical, i.e.

ˆhgrad g = ˆhgrad g = Pw . (17)

We discuss the case of other explanation methods in the Appendix C.1.

SSIM(h(x), ht)

Explanation h

SSIM(h(x), ht)

TSP-Explanation h

Figure 3. Left: SSIM of the target map ht and explanations of

original model g and manipulated g respectively. Clearly, the manipulated model g has explanations which closely resemble the target map ht over the entire Fashion MNIST test set. Right: Same as on the left but for tsp-explanations. The model g was trained to manipulate the tsp-explanation. Evidently, tsp-explanations are considerably more robust than their unprojected counterparts on the left. Colored bars show the median. Errors denote the 25th and 75th percentile. Other similarity measures show similar behaviour and can be found in Appendix D.

Original TSP-expl.

Manipulated TSP-expl.

Figure 4. x Grad tsp-explanations for original classiﬁer g and

manipulated g highlight the same features. Colored bars show the median of the explanations over multiple examples.

General Case: In many practical applications, we do not know the explicit form of the projection matrix P. In these situations, we propose to construct P by one of the following two methods:

Hyperplane method: for a given datapoint x S, we ﬁnd its k-nearest neighbours x1, . . . , xk in the training set. We then estimate the data tangent space Tx S by constructing the d-dimensional hyperplane with minimal Euclidean distance to the points x, x1, . . . , xk. Let this hyperplane be spanned by an orthonormal basis q1, . . . qd RD. The projection

Fairwashing Explanations with Off-Manifold Detergent

matrix P on this hyperplane is then given by

i=1 qi q T i .

Autoencoder method: the hyperplane method requires that the data manifold is sufﬁciently densely sampled, i.e. the nearest neighbors are small deformations of the data point itself. In order to estimate tangent space for datasets without this property, we use techniques from the well-established ﬁeld of manifold learning. Following (Shao et al., 2018), we train an autoencoder on the dataset and then perform an SVD decomposition of the Jacobian of decoder D,

z = U Σ V . (18)

The projector is constructed from the left-singular values u1, . . . , ud RD corresponding to the d largest singular values. The projector is obtained by

i=1 ui u T i . (19)

The underlying motivation for this procedure is reviewed in Appendix C.2.

After one of these methods is used to estimate the projector P for a given x S, the corresponding tsp-explanation can be easily computed by ˆh(x) = P h(x).

3.3. TSP Explanations: Practice

In this section, we will apply tsp-explanations to the examples of Section 2.4 and show that they are signiﬁcantly more robust under model manipulations.

Credit Assessment: From the arguments of the previous section, it follows that the explanations of the manipulated and original model agree. We indeed conﬁrm this experimentally, see Figure 4. We refer to the Appendix B for more details.

Image Classiﬁcation: For MNIST and Fashion MNIST, we use the hyperplane method to estimate the tangent space. For CIFAR10, we ﬁnd that the manifold is not densely sampled enough and we therefore use the autoencoder method. This is computationally expensive and takes about 48h using four Tesla P100 GPUs. We refer to Appendix D for more details.

Figure 5 shows the tsp-explanations for the examples of Figure 2. The explanation maps of the original and manipulated model show a high degree of visual similarity. This suggests the manipulation occurred mainly in directions orthogonal to the data manifold (as the tsp-explanations are

g g g g g g g g

g g g g g g g g

Figure 5. Tsp-explanations for the models and images of Figure 2. The tsp-explanations of the original model g and manipulated g are similar suggesting that the manipulations were mainly due to components orthogonal to the data manifold.

obtained from the original explanations by projecting out the corresponding components). This is also conﬁrmed quantitatively, see Appendix D. Furthermore, tsp-explanations tend to be considerably less noisy than their unprojected counterparts (see Figure 5 vs 2). This is expected from our theoretical analysis: consider gradient explanations for concreteness. Their components orthogonal to the data manifold are undetermined by training and are therefore essentially chosen at random. This ﬁtting noise is projected out in the tsp-explanation which results in a less noisy explanation.

If the adversaries knew that tsp-explanations are used, they could also try to train a model g which manipulates the tsp-explanations directly. However, tsp-explanations are considerable more robust to such manipulations, as shown on the right-hand-side of Figure 3.

We refer to Appendix D for more detailed discussion.

4. Conclusion

A central message of this work is that widely-used explanation methods should not be used as proof for a fair and sensible algorithmic decision-making process. This is because they can be easily manipulated as we have demonstrated both theoretically and experimentally. We propose modiﬁcations to existing explanation methods which make

Fairwashing Explanations with Off-Manifold Detergent

them more robust with respect to such manipulations. This is achieved by projecting explanations on the tangent space of the data manifold. This is exciting because it connects explainability to the ﬁeld of manifold learning. For applying these methods, it is however necessary to estimate the tangent space of the data manifold. For high-dimensional datasets, such as Image Net, this is an expensive and challenging task. Future work will try to overcome this hurdle. Another promising direction for further research is to apply the methods developed in this work to other application domains such as natural language processing.

Acknowledgements

We thank the reviewers for their valuable feedback. P.K. is greatly indebted to his mother-in-law as she took care of his sick son and wife during the ﬁnal week before submission. We acknowledge Shinichi Nakajima for stimulating discussion. K-R.M. was supported in part by the German Ministry for Education and Research (BMBF) under Grants 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A and 01IS18037A. This work is also supported by the Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017-0001779), as well as by the Research Training Group Differential Equationand Data-driven Models in Life Sciences and Fluid Dynamics (DAEDALUS) (GRK 2433) and Grant Math+, EXC 2046/1, Project ID 390685689 both funded by the German Research Foundation (DFG).

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I. J., Hardt, M., and Kim, B. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada., pp. 9525 9536, 2018.

A ıvodji, U., Arai, H., Fortineau, O., Gambs, S., Hara, S., and Tapp, A. Fairwashing: the risk of rationalization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 161 170. PMLR, 2019. URL http://proceedings.mlr.press/ v97/aivodji19a.html.

Alber, M., Lapuschkin, S., Seegerer, P., H agele, M., Sch utt, K. T., Montavon, G., Samek, W., M uller, K.-R., D ahne, S., and Kindermans, P. i NNvestigate neural networks! Journal of Machine Learning Research 20, 2019.

Ancona, M., Ceolini, E., Oztireli, C., and Gross, M. To-

wards better understanding of gradient-based attribution methods for Deep Neural Networks. In 6th International Conference on Learning Representations (ICLR 2018), 2018.

Bach, S., Binder, A., Montavon, G., Klauschen, F., M uller, K.-R., and Samek, W. On Pixel-Wise Explanations for Non-Linear Classiﬁer Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10(7):1 46, 07 2015. doi: 10.1371/journal.pone.0130140. URL https:// doi.org/10.1371/journal.pone.0130140.

Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., and M uller, K.-R. How to explain individual classiﬁcation decisions. Journal of Machine Learning Research, 11(Jun):1803 1831, 2010.

Dombrowski, A.-K., Alber, M., Anders, C., Ackermann, M., M uller, K.-R., and Kessel, P. Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, pp. 13567 13578, 2019.

Ghorbani, A., Abid, A., and Zou, J. Y. Interpretation of neural networks is fragile. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., pp. 3681 3688, 2019.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org.

Heo, J., Joo, S., and Moon, T. Fooling neural network interpretations via adversarial model manipulation. In Advances in Neural Information Processing Systems, pp. 2921 2932, 2019.

Kindermans, P., Hooker, S., Adebayo, J., Alber, M., Sch utt, K. T., D ahne, S., Erhan, D., and Kim, B. The (un)reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 267 280. Springer, 2019.

Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Reynolds, J., Melnikov, A., Lunova, N., and Reblitz Richardson, O. Pytorch captum. https://github. com/pytorch/captum, 2019.

Lapuschkin, S., W aldchen, S., Binder, A., Montavon, G., Samek, W., and M uller, K.-R. Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10:1096, 2019.

Fairwashing Explanations with Off-Manifold Detergent

Lee, J. M. Introduction to Smooth Manifolds. Springer, 2012.

Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and M uller, K.-R. Explaining nonlinear classiﬁcation decisions with deep taylor decomposition. Pattern Recognition, 65:211 222, 2017.

Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K., and M uller, K.-R. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, 2019. ISBN 978-3-030-28953-9. doi: 10.1007/978-3-030-28954-6.

Shao, H., Kumar, A., and Thomas Fletcher, P. The riemannian geometry of deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 315 323, 2018.

Shrikumar, A., Greenside, P., and Kundaje, A. Learning Important Features Through Propagating Activation Differences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3145 3153, 2017. URL http://proceedings.mlr.press/ v70/shrikumar17a.html.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/ abs/1409.1556.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classiﬁcation Models and Saliency Maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014. URL http://arxiv.org/abs/ 1312.6034.

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3319 3328, 2017. URL http://proceedings. mlr.press/v70/sundararajan17a.html.