# fairwashing_explanations_with_offmanifold_detergent__255bc890.pdf Fairwashing Explanations with Off-Manifold Detergent Christopher J. Anders 1 Plamen Pasliev 1 Ann-Kathrin Dombrowski 1 Klaus-Robert M uller 1 2 3 Pan Kessel 1 Explanation methods promise to make black-box classifiers more transparent. As a result, it is hoped that they can act as proof for a sensible, fair and trustworthy decision-making process of the algorithm and thereby increase its acceptance by the end-users. In this paper, we show both theoretically and experimentally that these hopes are presently unfounded. Specifically, we show that, for any classifier g, one can always construct another classifier g which has the same behavior on the data (same train, validation, and test error) but has arbitrarily manipulated explanation maps. We derive this statement theoretically using differential geometry and demonstrate it experimentally for various explanation methods, architectures, and datasets. Motivated by our theoretical insights, we then propose a modification of existing explanation methods which makes them significantly more robust. 1. Introduction Explanation methods4 are increasingly adopted by machine learning practitioners and incorporated into standard deep learning libraries (Kokhlikyan et al., 2019; Alber et al., 2019; Ancona et al., 2018). The interest in explainability is partly driven by the hope that explanations can act as proof for a sensible, fair, and trustworthy decision-making process(A ıvodji et al., 2019; Lapuschkin et al., 2019). As an example, a bank could provide explanations for its rejection of a loan application. By doing so, the bank can demonstrate that the decision was not based on illegal or ethically 1Machine Learning Group, Technische Universit at Berlin, Germany 2Max-Planck-Institut f ur Informatik, Saarbr ucken, Germany 3Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea. Correspondence to: Pan Kessel , Klaus-Robert M uller . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). 4See (Samek et al., 2019) and references therein for a detailed overview. questionable features. It can furthermore provide feedback to the customer. In some situations, an explanation of an algorithmic decision may even be required by law. However, this hope is based on the assumption that explanations faithfully reflect the underlying mechanisms of the algorithmic decision. In this work, we demonstrate unequivocally that this assumption should not be made carelessly because explanations can be easily manipulated. In more detail, we show theoretically that for any classifier g, one can always find another classifier g which agrees with the original g on the entire data manifold but has (almost) completely controlled explanations. This surprising result is established using techniques of differential geometry. We then demonstrate experimentally that one can easily construct such manipulated classifiers g. In the example above, a bank could use a manipulated classifier g that uses mainly unethical features, such as the gender of the applicant, but has explanations which suggest that the decision was only based on financial features. Briefly put, the manipulability of explanations arises from the fact that the data manifold is typically low-dimensional compared to its high-dimensional embedding space. The training process only determines the classifier in directions along the manifold. However, many explanation methods are mainly sensitive to directions orthogonal to the data manifold. Since these directions are undetermined by training, they can be changed at will. This theoretical insight allows us to propose a modification to explanation methods which make them significantly more robust with respect to such manipulations. Namely, the explanation is projected along tangential directions of the data manifold. We show, both theoretically and experimentally, that these tangent-space-projected (tsp) explanations are indeed significantly more robust. We thereby establish a novel and exciting connection between the fields of explainability and manifold learning. In summary, our main contributions are as follows: Using differential geometry, we establish theoretically that popular explanation methods can be easily manipulated. Fairwashing Explanations with Off-Manifold Detergent We validate our theoretical predictions in detailed experiments for various explanation methods, classifier architectures, and datasets, as well as for different tasks. We propose a modification to existing explanation methods which make them more robust with respect to these manipulations. In doing so, we relate explainability to manifold learning. 1.1. Related Works This work was crucially inspired by (Heo et al., 2019). In this reference, adversarial model manipulation for explanations is proposed. Specifically, the authors empirically show that one can train models such that they have structurally different explanations while suffering only a very mild drop in classification accuracy compared to their unmanipulated counterparts. For example, the adversarial model manipulation can change the positions of the most relevant pixels in each image or increase the overall sum of relevances in a certain subregion of the images. Contrary to their work, we analyze this problem theoretically. Our analysis leads us to demonstrate a stronger form of manipulability. Namely, the model can be manipulated such that it structurally reproduces arbitrary target explanations while keeping all class probabilities the same for all data points. Our theoretical insights not only illuminate the underlying reasons for the manipulability but also allow us to develop modifications of existing explanation methods which make them more robust. Another approach (Kindermans et al., 2019) adds a constant shift to the input image, which is then eliminated by changing the bias of the first layer. For some methods, this leads to a change in the explanation map. Contrary to our approach, this requires a shift in the data. In (Adebayo et al., 2018), explanation maps are changed by randomization of (some of) the network weights. This is different to our method as it dramatically changes the output of the network and is proposed as a consistency check of explanations. In (Dombrowski et al., 2019) and (Ghorbani et al., 2019), it is shown that explanations can be manipulated by an infinitesimal change in input while the output of the network is approximately unchanged. Contrary to this approach, we manipulate the model and keep the input unchanged. 1.2. Explanation Methods We consider a classifier g : RD RK which classifies an input x RD in K categories with the predicted class given by k = arg maxi g(x)i. The explanation method is denoted by hg : RD RD and associates an input x with an explanation map hg(x) whose components encode the relevance score of each input for the classifier s prediction. We note that, by convention, explanation maps are usually calculated with respect to the classifier before applying the final softmax non-linearity (Kokhlikyan et al., 2019; Alber et al., 2019; Ancona et al., 2018). Throughout the paper, we will therefore denote this function as g. We use the following explanation methods: Gradient: The map hg(x) = g x(x) is used and quantifies how infinitesimal perturbations in each pixel change the prediction g(x) (Simonyan et al., 2014; Baehrens et al., 2010). x Grad: This method uses the map hg(x) = x g x(x) (Shrikumar et al., 2017). For linear models, the exact contribution of each pixel to the prediction is obtained. Integrated Gradients: This method defines hg(x) = (x x) Z 1 g( x + t(x x)) where x is a suitable baseline. We refer to the original reference (Sundararajan et al., 2017) for more details. Layer-wise Relevance Propagation (LRP): This method (Bach et al., 2015; Montavon et al., 2017) propagates relevance backwards through the network. In our experiments, we use the following setup: for the output layer, relevance is given by RL i = δi,k = ( 1, for i = k 0, for i = k , which is then propagated backwards through all layers but the first using the z+-rule xl i(W l)+ ji P i xl i(W l)+ ji + ǫ Rl+1 j , (1) where (W l)+ denotes the positive weights of the l-th layer, xl is the activation vector of the l-th layer, and ǫ > 0 is a small constant ensuring numerical stability. For the first layer, we use the z B-rule to account for the bounded input domain x0 j W 0 ji lj(W 0)+ ji hj(W 0) ji P i(x0 j W 0 ji lj(W 0)+ ji hj(W 0) ji) R1 j , where li and hi are the lower and upper bounds of the input domain respectively. For theoretical analysis, we consider the ǫ-rule in all layers for simplicity. This rule is obtained by substituting (W l)+ W l in (1). We refer to the resulting method as ǫ-LRP. This choice of methods is necessarily not exhaustive. However, it covers two classes of attribution methods, i.e. propagation and gradient-based explanations. Furthermore, the Fairwashing Explanations with Off-Manifold Detergent chosen methods are widely used in practice (Kokhlikyan et al., 2019; Alber et al., 2019; Ancona et al., 2018). 2. Manipulation of Explanations In this section, we will theoretically deduce that explanation methods can be arbitrarily manipulated by adversarially training a model. 2.1. Mathematical Background In the following, we will briefly summarize the basic tools of differential geometry before applying them in the context of explainability in the next section. For additional technical details, we refer to Appendix A.1. A D-dimensional manifold M is a topological space which locally resembles RD. More precisely, for each p M, there exists a subset U M containing p and a diffeomorphism φ : U U RD. The pair (U, φ) is called coordinate chart and the component functions xi of φ(p) = (x1(p), . . . , x D(p)) are called coordinates. A d-dimensional submanifold S is a subset of M which is itself a d-dimensional manifold. M is called the embedding manifold of S. A properly embedded submanifold S M is a submanifold embedded in M which is also closed as a set. Let p M be a point on a manifold M and γ : R M with γ(0) = p a curve through the point p. The set of tangent vectors dγ = d dtγ(t)|t=0 of all curves through p forms a vector space of dimension D. This vector space is known as tangent space Tp M. Let (U, φ) be a coordinate chart on M with coordinates x. We can then define φ λk(t) = (x1(p), . . . , xk(p) + t, . . . , x D(p)) with k {1, . . . , D}. This implicitly defines curves λk : R M through p. We denote the corresponding tangent vectors as k := d dtλk(t)|t=0 and it can be shown that they form a basis of the tangent space Tp M. A vector field V on M associates with every point x M an element of the corresponding tangent space, i.e. V (x) Tx M.5 A conservative vector field V is a vector field that is the gradient of a function f : M R, i.e. V (x) = f(x). For submanifolds S, there are two different notions of vector fields. A vector field V on the submanifold S associates to every point on S a vector in its corresponding tangent space Tx S, i.e. V (x) Tx S. A vector field V along the submanifold S associates to every point on S a vector in the corresponding tangent space of the embedding manifold M, i.e. V (x) Tx M. These concepts can be related as follows: the tangent space Tx M can be decomposed into the tangent space Tx S of S and its orthogonal complement 5More rigorously, vector fields are defined in terms of the tangent bundle. We refrain from introducing bundles for accessibility. Tx S , i.e. Tx M = Tx S Tx S . A vector field along S which only takes values in the first summand Tx S is also a vector field on S. With these definitions, we can now state a crucial theorem for our theoretical analysis. In Appendix A.1, we show that: Theorem 1 Let S M be d-dimensional submanifold properly embedded in the D-dimensional manifold M. Let V = PD i=d+1 vi i be a conservative vector field along S which assigns a vector in Tp S for each p S. For any smooth function f : S R, there exists a smooth extension F : M R such that where F|S denotes the restriction of F on the submanifold S. Furthermore, the derivative of the extension F is given by F(x) = ( 1f(x), . . . df(x), vd+1(x), . . . , v D(x)) for all x S. Technical details not withstanding, this theorem states that a function f defined on a submanifold S can be extended to the entire embedding manifold M. The extension s derivatives orthogonal to the submanifold S can be freely chosen. This theorem is a generalization of the well-known submanifold extension lemma (see, for example, Lemma 5.34 in (Lee, 2012)) in that it not only shows that an extension exists but also that one has control over the gradient of the extension F. While we could not find such a statement in the literature, we suspect that it is entirely obvious to differential geometers but typically not needed for their purposes. 2.2. Explanation Manipulation: Theory From Theorem 1, it follows under a mild assumption that one can always construct a model g such that it closely reproduces arbitrary target explanations but has the same training, validation, and test loss as the original model g. Assumption: the data lies on a d-dimensional submanifold S M properly embedded in the manifold M = RD. The data manifold S is of much lower dimensionality than its embedding space M, i.e. We stress that this assumption is also known as the manifold conjecture and is expected to hold across a wide range of machine learning tasks. We refer to (Goodfellow et al., 2016) for a detailed discussion. Under this assumption, the following theorem can be derived for the Gradient, x Grad, and ǫ-LRP methods (only the proof for the Gradient method is given; see Appendix 2 for other methods): Fairwashing Explanations with Off-Manifold Detergent Theorem 2 Let hg : RD RD be the explanation of classifier g : RD R with bounded derivatives | ig(x)| C R+ for i = 1, . . . , D. For a given target explanation ht : RD RD, there exists another classifier g : RD R which completely agrees with the classifier g on the data manifold S, i.e. g|S = g|S . (3) In particular, both classifiers have the same train, validation, and test loss. However, its explanation h g closely resembles the target ht, i.e. MSE(h g(x), ht(x)) ǫ x S , (4) where MSE(h, h ) = 1 D PD i=1(hi h i)2 denotes the meansquared error and ǫ = d Proof: By Theorem 1, we can find a function G which agrees with g on the data manifold S but has the derivative G(x) = ( 1g(x), . . . dg(x), ht d+1(x), . . . , ht D(x)) for all x S. By definition, this is its gradient explanation h G = G. As explained in Appendix A.2.1, we can assume without loss of generality that | ig(x)| 0.5 for i {1, . . . , D}. We can furthermore rescale the target map such that |ht i| 0.5 for i {1, . . . , D}. This rescaling is merely conventional as it does not change the relative importance hi of any input component xi with respect to the others. It then follows that MSE(h G(x), ht(x)) = 1 i=1 ( i G(x) ht i(x))2 . This sum can be decomposed as ig(x) ht i(x) 2 | {z } 1 i G(x) ht i(x) 2 | {z } =0 and from this, it follows that MSE(h G(x), ht(x)) d The proof then concludes by identifying g = G. Intuition: Somewhat roughly, this theorem can be understood as follows: two models, which behave identically on the data, need to only agree on the low-dimensional submanifold S. The gradients orthogonal to the submanifold S are completely undetermined by this requirement. By the manifold assumption, there are however much more orthogonal than parallel directions and therefore the explanation is largely controlled by these. We can use this fact to closely reproduce an arbitrary target while keeping the function s values on the data unchanged. We stress however that there are a number of non-trivial differential geometric arguments needed in order to make these statements rigorous and quantitative. For example, it is entirely non-trivial that an extension to the embedding manifold exists for arbitrary choice of target explanation. This is shown by Theorem 1 whose proof is based on a differential geometric technique called partition of the unity subordinate to an open cover. See Appendix A.1 for details. 2.3. Explanation Manipulation: Methods Flat Submanifolds and Logistic Regression: The previous theorem assumes that the data lies on an arbitrarily curved submanifold and therefore has to rely on relatively involved mathematical concepts of differential geometry. We will now illustrate the basic ideas in a much simpler context: we will assume that the data lies on a d-dimensional flat hyperplane S RD.6 The points on the hyperplane S obey the relation x S : ( ˆw(i))T x = bi , i {1, . . . , D d} , (5) where { ˆw(i) RD | i = 1, . . . , D d} are a set of normal vectors to the hyperplane S and bi R are the affine translations. We furthermore assume that we use logistic regression as the classification algorithm, i.e. g(x) = σ(w T x + c) , (6) where w RD, c R are the weights and the bias respectively and σ(x) = 1 1+exp( x) is the sigmoid function. This classifier has the gradient explanation7 hgrad(x) = w , (7) We can now define a modified classifier by i λi( ˆw(i)T x bi) + c for arbitrary λi R. By (5), it follows that both classifiers agree on the data manifold S, i.e. x S : g(x) = g(x) , (9) and therefore have the same train, validation, and test error. However, the gradient explanations are now given by hgrad(x) = w + X i λi ˆw(i) . (10) 6In mathematics, these submanifolds are usually referred to as d-flats and only the case d = D 1 is called hyperplane. We refrain from this terminology. 7We recall that in calculating the explanation map, we take the derivative before applying the final activation function. Fairwashing Explanations with Off-Manifold Detergent Since the λi can be chosen freely, we can modify the explanations arbitrarily in directions orthogonal to the data submanifold S (parameterized by the normal vectors ˆw(i)). Similar statements can be shown for other explanation methods and we refer to the Appendix A.3 for more details. As we will discuss in Section 2.4, one can use these tricks even for data which does not (initially) lie on a hyperplane. General Case: For the case of arbitrary neural networks and curved data manifolds, we cannot analytically construct the manipulated model g. We therefore approximately obtain the model g corresponding to the original model g by minimizing the loss xi T ||g(xi) g(xi)||2 + γ X xi T ||h g(xi) ht||2 , by stochastic gradient descent with respect to the parameters of g. The training set is denoted by T and ht RD is a specified target explanation. Note that we could also use different targets for various subsets of the data but we will not make this explicit to avoid cluttered notation. The first term in the loss L ensures that the models g and g have approximately the same output while the second term encourages the explanations of g to closely reproduce the target ht. The relative weighting of these two terms is determined by the hyperparameter γ R+. As we will demonstrate experimentally, the resulting g will closely reproduce the target explanation ht and have (approximately) the same output as g. Crucially, both statements will be seen to hold also for the test set. 2.4. Explanation Manipulation: Practice In this section, we will demonstrate manipulation of explanations experimentally. We will first discuss applying logistic regression to credit assessment and then proceed to the case of deep neural networks in the context of image classification. The code for all our experiments is publicly available at https://github.com/fairwashing/fairwashing. Credit Assessment: In the following, we will suppose that a bank uses a logistic regression algorithm to classify whether a prospective client should receive a loan or not. The classification uses the features x = (xgender, xincome) where ( 1, for male 1, for female (12) and xincome is the income of the applicant. Normalization is chosen such that the features are of the same order of magnitude. Details can be found in the Appendix B. Original expl. Manipulated expl. Figure 1. x Grad explanations for original classifier g and manipulated g highlight completely different features. Colored bars show the median of the explanations over multiple examples. We then define a logistic regression classifier g by choosing the weights w = (0.9, 0.1), i.e. female applicants are severely discriminated against. The discriminating nature of the algorithm may be detected by inspecting, for example, the gradient explanation maps hgrad g = w. Conversely, if the explanations did not show any sign of discrimination for another classifier g, the user may interpret this as a sign of its trustworthiness and fairness. However, the bank can easily fairwash the explanations, i.e. hide the fact that the classifier is sexist. This can be done by adding new features which are linearly dependent on the previously used features. As a simple example, one could add the applicant s paid taxes xtaxes as a feature. By definition, it holds that xtaxes = 0.4 xincome , (13) where we assume that there is a fixed tax rate of 0.4 on all income. The features used by the classifier are now x = (xgender, xincome, xtaxes). By (13), all data samples x obey ˆw T x = 0 with ˆw = (0, 0.4, 1) . (14) Therefore, the original classifier g(x) = σ(w T x) with w = (0.9, 0.1, 0) leads to the same output as the classifier g(x) = σ(w T x + 1000 ˆw T x). However, as shown in Figure 1, the classifier g has explanations which suggest that the two financial features (and not the applicant s gender) are important for the classification result. This example is merely an (oversimplified) illustration of a general concept: for each additional feature which linearly depends on the previously used features, a condition of the form (14) for some normal vector ˆw is obtained. We can then construct a classifier with arbitrary explanation along each of these normal vectors. Fairwashing Explanations with Off-Manifold Detergent Image Classification: We will now experimentally demonstrate the practical applicability of our methods in the context of image classification with deep neural networks. Datasets: We consider the MNIST, Fashion MNIST, and CIFAR10 datasets. We use the standard training and test sets for our analysis. The data is normalized such that it has mean zero and standard deviation one. We sum the explanations over the absolute values of its channels to get the relevance per pixel. The resulting relevances are then normalized to have a sum of one. Models: For CIFAR10, we use the VGG16 (Simonyan & Zisserman, 2015) architecture. For Fashion MNIST and MNIST, we use a four layer convolutional neural network. We train the model g by minimizing the standard cross entropy loss for classification. The manipulated model g is then trained by minimizing the loss (11) for a given target explanation ht. This target was chosen to have the shape of the number 42. For more details about the architectures and training, we refer to the Appendix D. Quantitative Measures: We assess the similarity between explanation maps using three quantitative measures: the structural similarity index (SSIM), the Pearson correlation coefficient (PCC) and the mean squared error (MSE). SSIM and PCC are relative similarity measures with values in [0, 1], where larger values indicate high similarity. The MSE is an absolute error measure for which values close to zero indicate high similarity. We also use the MSE metric as well as the Kullback-Leibler divergence for assessing similarity of the class scores of the manipulated model g and the original network g. Results: For all considered models, datasets, and explanation methods, we find that the manipulated model g has explanations which closely resemble the target map ht, e.g. the SSIM between the target and manipulated explanations is of the order 0.8. At the same time, the manipulated network g has approximately the same output as the original model g, i.e. the mean-squared error of the outputs after the final softmax non-linearity is of the order 10 3. The classification accuracy is changed by about 0.2 percent. Figure 2 illustrates this for examples from the Fashion MNIST and CIFAR10 test sets. We stress that we use a single model for Gradient, x Grad, and Integrated Gradient methods which demonstrates that the manipulation generalizes over all considered gradient-based methods. The left-hand-side of Figure 3 shows quantitatively that manipulated model g closely reproduces the target map ht over the entire test set of Fashion MNIST. We refer to the Appendix D for additional similarity measures, examples, and quantitative analysis for all datasets. g g g g g g g g g g g g g g g g Figure 2. Example explanations from the original model g (left) and the manipulated model g (right). Images from the test sets of Fashion MNIST (top) and CIFAR10 (bottom). 3. Robust Explanations Having demonstrated both theoretically and experimentally that explanations are highly vulnerable to model manipulation, we will now use our theoretical insights to propose explanation methods which are significantly more robust under such manipulations. 3.1. TSP Explanations: Theory In this section, we will define a robuster gradient explanation method. Appendix C discusses analogous definitions for other methods. We can formally define an explanation field Hg which associates to every point x on the data manifold S the corresponding gradient explanation hg(x) of the classifier g. We note that Hg is generically a vector field along the manifold since hg(x) RD = Tx M, i.e. it is an element of the tangent space Tx M of the embedding manifold M and not an element of the tangent space Tx S of data manifold S. As explained in Section 2.1, we can decompose the tangent space Tp M of the embedding manifold M as follows Tx M = Tx S Tx S . Let P : Tx M Tx S be the projection on the first summand of this decomposition. We stress that the form of the projector P depends on the point x S but we do not make this explicit in order to simplify notation. We can then define: Fairwashing Explanations with Off-Manifold Detergent Definition 1 The tangent-space-projected (tsp) explanation field ˆHg is a vector field on the data manifold S. It associates to each x S, the tangent-space-projected (tsp) explanation ˆhg(x) given by ˆhg(x) = (P hg) (x) Tx S . (15) Intuitively, the tsp-explanation ˆhg(x) is the explanation of the model g projected on the tangential directions of the data manifold. We recall from our discussion of Theorem 2 that we can always find classifiers g which coincide with the original classifier g on the data manifold S but may differ in the gradient components orthogonal to the data manifold, i.e. for some x S it holds that (1 P) g(x) = (1 P) g(x) . On the other hand, the components tangential to the manifold S agree P g(x) = P g(x) , x S . In other words, the tsp-gradient explanations of the original model g and any such model g are identical: ˆhg(x) = ˆh g(x) x S . (16) It can therefore be expected that tsp-explanations ˆhg are significantly more robust compared to their unprojected counterparts hg. For other explanation methods, the corresponding tspexplanations may be obtained using a slightly modified projector P. We refer to Appendix C for more details. 3.2. TSP Explanations: Methods Flat Submanifolds and Logistic Regression: Recall from Section 2.3 that for a logistic regression model g(x) = σ(w T x + c) with gradient explanation hgrad g = w, we can define a manipulated model i λi( ˆw(i)T x bi) + c with gradient explanation hgrad g = w + P i λi ˆw(i) for arbitrary λi R. Since the vectors ˆwi are normal to the data hypersurface S, it holds that P ˆwi = 0. As a result, the gradient tsp-explanations of the original model g and its manipulated counterpart g are identical, i.e. ˆhgrad g = ˆhgrad g = Pw . (17) We discuss the case of other explanation methods in the Appendix C.1. SSIM(h(x), ht) Explanation h SSIM(h(x), ht) TSP-Explanation h Figure 3. Left: SSIM of the target map ht and explanations of original model g and manipulated g respectively. Clearly, the manipulated model g has explanations which closely resemble the target map ht over the entire Fashion MNIST test set. Right: Same as on the left but for tsp-explanations. The model g was trained to manipulate the tsp-explanation. Evidently, tsp-explanations are considerably more robust than their unprojected counterparts on the left. Colored bars show the median. Errors denote the 25th and 75th percentile. Other similarity measures show similar behaviour and can be found in Appendix D. Original TSP-expl. Manipulated TSP-expl. Figure 4. x Grad tsp-explanations for original classifier g and manipulated g highlight the same features. Colored bars show the median of the explanations over multiple examples. General Case: In many practical applications, we do not know the explicit form of the projection matrix P. In these situations, we propose to construct P by one of the following two methods: Hyperplane method: for a given datapoint x S, we find its k-nearest neighbours x1, . . . , xk in the training set. We then estimate the data tangent space Tx S by constructing the d-dimensional hyperplane with minimal Euclidean distance to the points x, x1, . . . , xk. Let this hyperplane be spanned by an orthonormal basis q1, . . . qd RD. The projection Fairwashing Explanations with Off-Manifold Detergent matrix P on this hyperplane is then given by i=1 qi q T i . Autoencoder method: the hyperplane method requires that the data manifold is sufficiently densely sampled, i.e. the nearest neighbors are small deformations of the data point itself. In order to estimate tangent space for datasets without this property, we use techniques from the well-established field of manifold learning. Following (Shao et al., 2018), we train an autoencoder on the dataset and then perform an SVD decomposition of the Jacobian of decoder D, z = U Σ V . (18) The projector is constructed from the left-singular values u1, . . . , ud RD corresponding to the d largest singular values. The projector is obtained by i=1 ui u T i . (19) The underlying motivation for this procedure is reviewed in Appendix C.2. After one of these methods is used to estimate the projector P for a given x S, the corresponding tsp-explanation can be easily computed by ˆh(x) = P h(x). 3.3. TSP Explanations: Practice In this section, we will apply tsp-explanations to the examples of Section 2.4 and show that they are significantly more robust under model manipulations. Credit Assessment: From the arguments of the previous section, it follows that the explanations of the manipulated and original model agree. We indeed confirm this experimentally, see Figure 4. We refer to the Appendix B for more details. Image Classification: For MNIST and Fashion MNIST, we use the hyperplane method to estimate the tangent space. For CIFAR10, we find that the manifold is not densely sampled enough and we therefore use the autoencoder method. This is computationally expensive and takes about 48h using four Tesla P100 GPUs. We refer to Appendix D for more details. Figure 5 shows the tsp-explanations for the examples of Figure 2. The explanation maps of the original and manipulated model show a high degree of visual similarity. This suggests the manipulation occurred mainly in directions orthogonal to the data manifold (as the tsp-explanations are g g g g g g g g g g g g g g g g Figure 5. Tsp-explanations for the models and images of Figure 2. The tsp-explanations of the original model g and manipulated g are similar suggesting that the manipulations were mainly due to components orthogonal to the data manifold. obtained from the original explanations by projecting out the corresponding components). This is also confirmed quantitatively, see Appendix D. Furthermore, tsp-explanations tend to be considerably less noisy than their unprojected counterparts (see Figure 5 vs 2). This is expected from our theoretical analysis: consider gradient explanations for concreteness. Their components orthogonal to the data manifold are undetermined by training and are therefore essentially chosen at random. This fitting noise is projected out in the tsp-explanation which results in a less noisy explanation. If the adversaries knew that tsp-explanations are used, they could also try to train a model g which manipulates the tsp-explanations directly. However, tsp-explanations are considerable more robust to such manipulations, as shown on the right-hand-side of Figure 3. We refer to Appendix D for more detailed discussion. 4. Conclusion A central message of this work is that widely-used explanation methods should not be used as proof for a fair and sensible algorithmic decision-making process. This is because they can be easily manipulated as we have demonstrated both theoretically and experimentally. We propose modifications to existing explanation methods which make Fairwashing Explanations with Off-Manifold Detergent them more robust with respect to such manipulations. This is achieved by projecting explanations on the tangent space of the data manifold. This is exciting because it connects explainability to the field of manifold learning. For applying these methods, it is however necessary to estimate the tangent space of the data manifold. For high-dimensional datasets, such as Image Net, this is an expensive and challenging task. Future work will try to overcome this hurdle. Another promising direction for further research is to apply the methods developed in this work to other application domains such as natural language processing. Acknowledgements We thank the reviewers for their valuable feedback. P.K. is greatly indebted to his mother-in-law as she took care of his sick son and wife during the final week before submission. We acknowledge Shinichi Nakajima for stimulating discussion. K-R.M. was supported in part by the German Ministry for Education and Research (BMBF) under Grants 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A and 01IS18037A. This work is also supported by the Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017-0001779), as well as by the Research Training Group Differential Equationand Data-driven Models in Life Sciences and Fluid Dynamics (DAEDALUS) (GRK 2433) and Grant Math+, EXC 2046/1, Project ID 390685689 both funded by the German Research Foundation (DFG). Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I. J., Hardt, M., and Kim, B. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada., pp. 9525 9536, 2018. A ıvodji, U., Arai, H., Fortineau, O., Gambs, S., Hara, S., and Tapp, A. Fairwashing: the risk of rationalization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 161 170. PMLR, 2019. URL http://proceedings.mlr.press/ v97/aivodji19a.html. Alber, M., Lapuschkin, S., Seegerer, P., H agele, M., Sch utt, K. T., Montavon, G., Samek, W., M uller, K.-R., D ahne, S., and Kindermans, P. i NNvestigate neural networks! Journal of Machine Learning Research 20, 2019. Ancona, M., Ceolini, E., Oztireli, C., and Gross, M. To- wards better understanding of gradient-based attribution methods for Deep Neural Networks. In 6th International Conference on Learning Representations (ICLR 2018), 2018. Bach, S., Binder, A., Montavon, G., Klauschen, F., M uller, K.-R., and Samek, W. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10(7):1 46, 07 2015. doi: 10.1371/journal.pone.0130140. URL https:// doi.org/10.1371/journal.pone.0130140. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., and M uller, K.-R. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803 1831, 2010. Dombrowski, A.-K., Alber, M., Anders, C., Ackermann, M., M uller, K.-R., and Kessel, P. Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, pp. 13567 13578, 2019. Ghorbani, A., Abid, A., and Zou, J. Y. Interpretation of neural networks is fragile. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019., pp. 3681 3688, 2019. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org. Heo, J., Joo, S., and Moon, T. Fooling neural network interpretations via adversarial model manipulation. In Advances in Neural Information Processing Systems, pp. 2921 2932, 2019. Kindermans, P., Hooker, S., Adebayo, J., Alber, M., Sch utt, K. T., D ahne, S., Erhan, D., and Kim, B. The (un)reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 267 280. Springer, 2019. Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Reynolds, J., Melnikov, A., Lunova, N., and Reblitz Richardson, O. Pytorch captum. https://github. com/pytorch/captum, 2019. Lapuschkin, S., W aldchen, S., Binder, A., Montavon, G., Samek, W., and M uller, K.-R. Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10:1096, 2019. Fairwashing Explanations with Off-Manifold Detergent Lee, J. M. Introduction to Smooth Manifolds. Springer, 2012. Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and M uller, K.-R. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211 222, 2017. Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K., and M uller, K.-R. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, 2019. ISBN 978-3-030-28953-9. doi: 10.1007/978-3-030-28954-6. Shao, H., Kumar, A., and Thomas Fletcher, P. The riemannian geometry of deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 315 323, 2018. Shrikumar, A., Greenside, P., and Kundaje, A. Learning Important Features Through Propagating Activation Differences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3145 3153, 2017. URL http://proceedings.mlr.press/ v70/shrikumar17a.html. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/ abs/1409.1556. Simonyan, K., Vedaldi, A., and Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014. URL http://arxiv.org/abs/ 1312.6034. Sundararajan, M., Taly, A., and Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3319 3328, 2017. URL http://proceedings. mlr.press/v70/sundararajan17a.html.