# explanation_by_progressive_exaggeration__de7b39fb.pdf

Published as a conference paper at ICLR 2020

EXPLANATION BY PROGRESSIVE EXAGGERATION

Sumedha Singla Department of Computer Science University of Pittsburgh

Brian Pollack, Junxiang Chen Department of Biomedical Informatics University of Pittsburgh

Kayhan Batmanghelich Department of Biomedical Informatics Department of Computer Science Intelligent Systems Program University of Pittsburgh

As machine learning methods see greater adoption and implementation in high stakes applications such as medical image diagnosis, the need for model interpretability and explanation has become more critical. Classical approaches that assess feature importance (e.g., saliency maps) do not explain how and why a particular region of an image is relevant to the prediction. We propose a method that explains the outcome of a classiﬁcation black-box by gradually exaggerating the semantic effect of a given class. Given a query input to a classiﬁer, our method produces a progressive set of plausible variations of that query, which gradually changes the posterior probability from its original class to its negation. These counter-factually generated samples preserve features unrelated to the classiﬁcation decision, such that a user can employ our method as a tuning knob to traverse a data manifold while crossing the decision boundary. Our method is model agnostic and only requires the output value and gradient of the predictor with respect to its input.

1 INTRODUCTION

With the explosive adoption of deep learning for real-world applications, explanation and model interpretability have received substantial attention from the research community (Kim, 2015; Doshi Velez & Kim, 2017; Molnar, 2019; Guidotti et al., 2019). Explaining an outcome of a model in high stake applications, such as medical diagnosis from radiology images, is of paramount importance to detect hidden biases in data (Cramer et al., 2018), evaluate the fairness of the model (Doshi-Velez & Kim, 2017), and build trust in the system (Glass et al., 2008). For example, consider evaluating a computer-aided diagnosis of Alzheimer s disease from medical images. The physician should be able to assess whether or not the model pays attention to age-related or disease-related variations in an image in order to trust the system. Given a query, our model provides an explanation that gradually exaggerates the semantic effect of one class, which is equivalent to traversing the decision boundary from side to another.

Although not always clear, there are subtle differences between interpretability and explanation (Turner, 2016). While the former mainly focuses on building or approximating models that are locally or globally interpretable (Ribeiro et al., 2016), the latter aims at explaining a predictor aposteriori. The explanation approach does not compromise the prediction performance. However, a rigorous deﬁnition for what is a good explanation is elusive. Some researchers focused on providing feature importance (e.g., in the form of a heatmap (Selvaraju et al., 2017)) that inﬂuence the outcome of the predictor. In some applications (e.g., diagnosis with medical images) the causal changes are spread out across a large number of features (i.e., large portions of the image are impacted by a disease). Therefore, a heatmap may not be informative or useful, as almost all image features are highlighted. Furthermore, those methods do not explain why a predictor returns an outcome. Others have introduced local occlusion or perturbations to the input (Zhou et al., 2014; Fong & Vedaldi, 2017) by assessing which manipulations have the largest impact on the predictors. There is also

Published as a conference paper at ICLR 2020

recent interest in generating counterfactual inputs that would change the black box classiﬁcation decision with respect to the query inputs (Goyal et al., 2019; Liu et al., 2019). Local perturbations of a query are not guaranteed to generate realistic or plausible inputs, which diminishes the usefulness of the explanation, especially for end users (e.g., physicians). We argue that the explanation should depend not only on the predictor function but also on the data. Therefore, it is reasonable to train a model that learns from data as well as the black-box classiﬁer (e.g., (Chang et al., 2019; Dabkowski & Gal, 2017; Fong & Vedaldi, 2017)).

Our proposed method falls into the local explanation paradigm. Our approach is model agnostic and only requires access to the predictor values and its gradient with respect to the input. Given a query input to a black-box, we aim at explaining the outcome by providing plausible and progressive variations to the query that can result in a change to the output. The plausibility property ensures that perturbation is natural-looking. A user can employ our method as a tuning knob to progressively transform inputs, traverse the decision boundary from one side to the other, and gain understanding about how the predictor makes a decision. We introduce three principles for an explanation function that can be used beyond our application of interest. We evaluate our method on a set of benchmarks as well as real medical imaging data. Our experiments show that the counterfactually generated samples are realistic-looking and in the real medical application, satisfy the external evaluation. We also show that the method can be used to detect bias in training of the predictor.

Consider a black box classiﬁer that maps an input space X (e.g., images) to an output space Y (e.g., labels). In this paper, we consider binary classiﬁcation problems where Y = { 1, +1}. To model the black-box, we use f(x) = P(y|x) to denote the posterior probability of the classiﬁcation. We assume that f is a differentiable function and we have access to its value as well as its gradient with respect to the input xf(x).

kx ˆxk1 kx xk1

Figure 1: (a) The schematic of the method: f is the black-box function producing the posterior probability. δ is the required change in black-box s output f(x). If(x, δ) is an explainer function for f, which shifts the value of f(x) by δ. The E( ) is an encoder that maps the data manifold Mx to the embedding manifold Mz. xδ is an abbreviation for If(x, δ). (b) The architecture of our model: E is the encoder, Gδ f denotes the conditional generator G( , cf(x, δ)), f is the black-box and D is the discriminator. The circles denote loss functions.

We view the (visual) explanation of the black-box as a generative process that produces an input for the black-box that slightly perturbs current prediction (f(x) + δ) while remaining plausible and realistic. By repeating this process towards each end of the binary classiﬁcation spectrum, we can traverse the prediction space from one end to the other and exaggerate the underlying effect. We conceptualize the traversal from one side of the decision boundary to the other as walking across a data manifold, Mx. We assume the walk has a ﬁxed step size and each step of the walk makes δ change to the posterior probability of the the classiﬁer, f. Since the output of f is bounded between

Published as a conference paper at ICLR 2020

[0, 1], we can take at-most 1

δ steps. Each positive (negative) step increases (decreases) the posterior probability of the previous step. We assume that there is a low-dimensional embedding space (Mz) that encodes the walk. An encoder, E : Mx Mz, maps an input, x, from the data manifold, Mx, to the embedding space. A generator, G : Mz My, takes both the embedding coordinate and the number of steps and maps it back to the data manifold (see Figure1).

We use If( , ) to denote the explainer function. Formally, If(x, δ) : (X, R) X is a function that takes two arguments: a query image x and the desired perturbation δ. This function generates a perturbed image which is then passed through function f. The difference between the outputs of f given the original image and the perturbed image should be the desired change i.e., f(xδ) f(x) = δ. We use xδ to denote If(x, δ). This formulation enables us to use δ as a knob to exaggerate the visual explanations of the query sample while it is crossing the decision boundary given by function f. Our proposed interpretability function If should satisfy the following properties:

1. Data Consistency: perturbed samples generated by If should lie on the data manifold, Mx, to be consistent with real data. In other words, the generated samples should look realistic when compared to other samples. 2. Compatibility with f: changing the second argument in If(x, ) should produce the desired outcome from classiﬁer f, i.e., f(If(x, δ)) f(x) + δ. 3. Self Consistency: Applying reverse perturbation should bring x back to its original form i.e., If(If(x, δ), δ) = x. Also, applying setting δ to zero should return the query, i.e., If(x, 0) = x.

Each criterion is enforced via a loss function which are discussed in the following sections.

2.1 DATA CONSISTENCY

We adopt the Generative Adversarial Networks (GANs) framework for our model (Goodfellow et al., 2014). The GANs implicitly model the underlying data distribution by setting up a min-max game between generative (G) and discriminative (D) networks: LGAN(D, G) = Ex,c P (x) log D(x) + Ez Pz log 1 D(G(z)) , where z and Pz are the noise distribution and the corresponding canonical distribution. There has been signiﬁcant progress toward improving GANs stability as well as sample quality (Brock et al., 2019; Karras et al., 2019). The advantage of GANs is that they produce realistic-looking samples without an explicit likelihood assumption about the underlying probability distribution. This property is appealing for our application.

Furthermore, we need to provide the desired amount of perturbation to the black-box, f. Hence, we use a Conditional GAN (c GAN) that allows the incorporation of a context as a condition to the GAN (Mirza & Osindero, 2014; Miyato & Koyama, 2018). To deﬁne the condition, we ﬁx the step size, δ, and descritize the walk which effectively cuts the posterior probability range of the predictor (i.e., [0, 1]) into 1

δ equally-sized bins. Hence, one can view the perturbation from f(x) to f(x)+δ as changing the bin index from the current value cf(x, 0) to cf(x, δ) where cf(x, δ) returns the bin index of f(x) + δ. We use cf(x, δ) as a condition to the c GAN.

The c GAN optimizes the following loss function: Lc GAN(D, G) = Ex,c P (x,c) log D(x, c) + Ez Pz,c Pc log 1 D(G(z, c), c) , (1) where c denotes a condition. Instead of generating random samples from Pz, we use the output of an encoder, E(x), as input to the generator. Finally, the explainer function is deﬁned as: If(x, δ) = G(E(x), cf(x, δ)). (2)

Our architecture is based on Projection GAN (Miyato & Koyama, 2018), a modiﬁcation of c GAN. An advantage of the Projection GAN is that it scales well with the number of classes allowing δ 0. The Projection GAN imposes the following structure on the discriminator loss function:

Lc GAN(D, ˆG)(x, c) = log pdata(c|x)

q(c|x) + log pdata(x)

q(x) := r(c|x) + ψ(φ( ˆG(z))), (3)

where Lc GAN(D, ˆG) indicates the loss function in Eq. 1 when ˆG is ﬁxed, φ( ) and ψ( ) are networks producing vector (feature) and scalar outputs respectively. The r(c|x) is a conditional ratio function which will be discussed in Section 2.2.

Published as a conference paper at ICLR 2020

2.2 COMPATIBILITY WITH THE BLACK BOX

In our model, the condition c is an ordered variable i.e., cf(x, δ1) < cf(x, δ2) when δ1 < δ2. Therefore, we adapt the ﬁrst term in Eq. 3 to account for ordinal multi-class regression by transforming cf(x, δ) into 1

δ 1 binary classiﬁcation terms (Frank & Hall, 2001):

r(c = k|x) := X

i<k v T i φ(x), (4)

where φ( ) is the feature network in Eq. 3 and vi s are parameters. We also need to ensure that plugging xδ into f( ) yields f(x) + δ (i.e., compatible with f). This condition is enforce by a Kullback Leibler (KL) divergence loss term. Adding the KL loss and the conditional ratio function we arrive at the following loss:

Lf(D, G) := r(c|x) + DKL (f(x) + δ f(If(x, δ))) .

While the ﬁrst term is a function of both G and D, the second term inﬂuences only the generator G.

2.3 SELF CONSISTENCY

We use a reconstruction loss term to enforce encoder-decoder consistency and satisfy the identity constraint of x = If(x, 0),

Lrec(G) = ||x G (E(x), cf(x, 0)) ||1, (5)

We also require that the perturbation is reversible (i.e., If(If(x, δ), δ) = x). We use a cycleconsistency (Zhu et al., 2017) loss to reconstruct the input from its corresponding perturbed image,

Lcyc(G) = ||x G(E(xδ), cf(x, 0))||1. (6)

Note that the conditions for the generators in Eq. 5 and 6 are the same. However, in the former, we are reconstructing the input x from its latent space, but in the latter, we perturb xδ from the bin index cf(x, δ) back to original bin index cf(x, 0).

2.4 OBJECTIVE FUNCTIONS

We adapted the hinge version of the adversarial loss for Lc GAN(G, D).

Lc GAN(D) = Ex pdata min(0, 1 + D(x, cf(x, 0)))

Ex pdata,cf (x,δ) [0, 1

δ ] min(0, 1 D(G(E(x), cf(x, δ)), cf(x, δ))) (7)

Lc GAN(G, E) = Ex pdata,cf (x,δ) [0, 1

δ ] D(G(E(x), cf(x, δ)), cf(x, δ)) (8)

The overall objective function is

min E,G max D λc GANLc GAN(D, G) + λf Lf(D, G) + λrec Lrec(G) + λrec Lcyc(G) (9)

where λc GAN, λf, λrec are the hyper-parameters that balance the importance of the loss terms.

3 RELATED WORK

Our work broadly relates to literature in interpretation methods that are designed to provide a visual explanation of the decisions made by a black-box function f, for a given query sample x.

Perturbation-based methods: These methods provide interpretation by showing what minimal changes are required in x to induce a desirable output of f. Some methods employed image manipulation via the removal of image patches (Zhou et al., 2014) or the occlusion of image regions (Zhou et al., 2014) to change the classiﬁcation score. Recently, the use of inﬂuence function, as proposed by (Koh & Liang, 2017) are applied as a form of data perturbation to modify a classiﬁer s response. The authors in (Fong & Vedaldi, 2017) proposed the use of optimal perturbation, deﬁned as removing the smallest possible image region in x that results in the maximum drop in classiﬁcation score. In another approach, (Chang et al., 2019) proposed a generative process to ﬁnd and ﬁll the image

Published as a conference paper at ICLR 2020

regions that correspond to the largest change in the decision output of a classiﬁer. To switch the decision of a classiﬁer, (Goyal et al., 2019) suggested generating counterfactuals by replacing the regions of x with patches from images with a different class label. All of the aforementioned works perform pixelor patch-level manipulation to x, which may not result in natural-looking images. In contrast, our model enforces that the perturbed data be consistent with the unperturbed data to ensure that the perturbation is plausible. Furthermore, our method can be applied to general data and is not restricted to the imaging domain.

Saliency map-based methods: Saliency maps explains the decision of f on x by highlighting the relevant regions of x. Some earlier work in this direction(Simonyan et al., 2013; Springenberg et al., 2015; Bach et al., 2015) focuses on computing the gradient of the target class with respect to x and considers the image regions with large gradients as most informative. Building on this work, the class activation map (CAM) (Zhou et al., 2016) and its generalized version Grad-CAM Selvaraju et al. (2017) and other variants such as LPR (Bach et al., 2015) use a linear or non-linear combination of the activation layers to derive relevance score for every pixel in an image. These gradient-based methods are not model-agnostic and require access to intermediate layers. Recently, Adebayo et al. (2018) have shown that some saliency methods are independent both of the model and of the data generating process. We used their propose evaluation to validate our interpretation model. The saliency maps are also prone to adversarial attacks as shown by Ghorbani et al. (2019) and Kindermans et al. (2017). Furthermore, if the causal effect of a class is distributed across an image, which is the case in radiology images, the saliency approaches highlight large sections of the image, which greatly reduce the usefulness of the interpretation.

Generative explanation-based methods: These are interpretation models that uses a generative process to produce visual explanations. The contrastive explanations method (CEM) (Dhurandhar et al., 2018) generates explanations that show minimum regions in x which must be present/absent for a particular classiﬁcation decision. In another work, (Liu et al., 2019; Joshi et al., 2019; Samangouei et al., 2018) generates explanations that highlight what features should be changed in x so that the classiﬁer conﬁdence in the prediction is strengthen (prototype) or weakened (counterfactual). Our approach is aligned with these latter lines of work, although our method and model architecture is different. Our method allows for the gradual change of the class effect, and our consistency criteria result in high-quality feasible perturbation in x. We rigorously evaluate our method on real medical imaging applications, in addition to the curated computer vision datasets.

4 EXPERIMENTS

We set up four experiments to evaluate our method. First, we assess if our method satisﬁes the three criteria of the explainer function introduced in Section 2. We report both qualitative and quantitative results. Second, we apply our method on a medical image diagnosis task. We use external domain knowledge about the disease to perform a quantitative evaluation of the explanation. Third, we train two classiﬁers on biased and unbiased data and examine the performance of our method in identifying the bias. While our method does not produce a saliency map, in our last experiment, we use the two counterfactual samples on the boundary [0, 1] to generate a saliency map and compare it with the other methods. In Appendix A, we show further experiments to evaluate our model in human experiments, to demonstrate its compatibility with a multi-label classiﬁer and, an ablation study, to show the relative importance of each of the three criteria of the explainer function.

Our experiments are conducted on the Celeb A (Liu et al., 2015) and Che Xpert (Irvin et al., 2019) datasets. Celeb A contains 200K celebrity face images, each with forty attribute labels. We considered binary classiﬁer trained on the smiling and young attributes. Che Xpert is a medical dataset containing 224K chest x-ray images from 65K patients and has labels for fourteen radio-graphic observations. We considered Cardiomegaly as the target class for generating explanations. All images are re-sized to 128 128 before processing.

4.1 EVALUATING THE CRITERIA OF THE EXPLAINER

Figure 2 reports the qualitative results on three datasets. Given a query image x at inference time, our model generates a series of images xδ as visual explanations, which gradually increase the posterior probability f(xδ) (top label). We show results for three prediction tasks: smiling or not-smiling,

Published as a conference paper at ICLR 2020

young or old, and Cardiomegaly or healthy. The values on the top of each ﬁgure report the f(xδ) s. For Cardiomegaly, we show the outlines of the heart as well as its normalized size (values inside the parenthesis), which is indicative of the disease.

Figure 2: Visual explanations generated for three prediction tasks: smiling/not-smiling face (ﬁrst two rows), young/old face (middle two rows) and Cardiomegaly/healthy chest x-ray (bottom two rows). The ﬁrst column shows the query image, followed by the corresponding generated explanations. The values above each image are the output of the classiﬁer f. For Cardiomegaly, we show the segmentation of the heart (yellow edge) and report normalized heart size (values in parenthesis), which is indicative of the disease.

Data Consistency: The generated explanations are synthesized variations of the query image. To quantitatively compare their visual quality, we consider Fr echet Inception Distance (FID) (Heusel et al., 2017). We compared our results against the counterfactual explanations produced by x GEM (Joshi et al., 2018). The details of the x GEM model are given in appendix A.2. We divided the real and fake (i.e., generated explanations) images into two groups (on either boundary of f(x) [0, 1]) and reported the FID for each group and the overall score. Our method signiﬁcantly outperforms x GEM, producing crisper and more realistic-looking images. x GEM is based on variational autoencoder (VAE) which are known to produce blurry images (see Figure 7).

Compatibility with the black-box f: To quantify whether the generation process is aligned with the desire perturbation δ, we plotted the expected outcome f(x) + δ against the actual response of the classiﬁer for the generated explanations, f(xδ). Figure 3 shows how our model performs when generating a series of explanations starting from a wide range of initial query images. The performance is almost perfect for Young/Old, but less so for more challenging classiﬁcation problems such as Smiling or Cardiomegaly. The plot also validates that we are producing perturb images covering the entire classiﬁcation range, [0, 1]. Appendix A.3 shows additional result from Celeb A dataset.

Published as a conference paper at ICLR 2020

Celeb A:Smiling Celeb A:Young Xray:Cardiomegaly Target Class x GEM Ours x GEM Ours x GEM Ours Present (f(xδ) [0.9, 1]) 111.0 46.9 115.2 67.6 368.6 82.9 Absent (f(xδ) [0, 0.1]) 112.9 56.3 170.3 74.4 394.6 84.8 Overall (f(xδ) [0, 1]) 106.3 35.8 117.9 53.4 326.3 58.1

Table 1: The Fr echet Inception Distance (FID) score, measuring the quality of the generated explanations for the three prediction tasks. Lower FID corresponds to better image quality. Top (bottom) row corresponds the top (bottom) 10% of the decision interval.

Figure 3: Plot of the expected outcome from the classiﬁer, f(x)+δ, against the actual response of the classiﬁer on generated explanations, f(xδ). The monotonically increasing trend shows a positive correlation between f(x) + δ and f(xδ), and thus the generated explanations are consistent with the expected condition.

Identity preservation: The generated explanations should differ only in semantic features associated with the target class, while retaining the identity of the query image. We extracted the latent embedding for real images (E(x)) and their corresponding explanations (E(xδ)), for different values of δ. We calculated latent space closeness as the percentage of the times, xδ is closest to the query image x as compared to other generated explanations

δ, x X, ||E(x) E(xδ)||2 < min m If (X {x},δ) ||E(xδ) E(m)||2, (10)

where, m If(X {x}, δ)) is the set of explanations generated for all the real images excluding the query image x. Another, popular approach to quantify identity of two face images, is to perform face veriﬁcation. We used state-of-the-art face recognition model trained on VGGFace2 dataset (Cao et al., 2018) as feature extractor for both real images and their corresponding fake explanations. For face veriﬁcation, we calculated the closeness between real and fake image as cosine distance between their feature vectors. The faces were considered as veriﬁed i.e., fake explanation have same identity as real image, if the distance is below 0.5. Table 2 summarizes the results.

Our method achieved high performance on localized attribute smiling , which alters a relatively small region of the face image as compare to attribute age which affects the entire face. Medical images like chest x-ray have very ﬁne grain details which are difﬁcult to preserve in the generative process of GAN. Our explainer function preserves the high level features like shape and size of the lung, but it struggles to retain the low level features like anatomy of the breast and shape of the collar bones. Also, it should be noted that both the datasets have multiple images for same person, but we ignore this information in our analysis and treat each image as a different identity. We compared our performance against x GEM (Joshi et al., 2018). VAE explicitly minimizes for latent space closeness. The generated explanation by x GEM were blurry version of the query image. Hence, although they were close to query image in latent space, but they didn t preserve the identity of the individual as shown in face veriﬁcation task and is evident in Figure 7 in appendix A.2. In comparison, our model achieved good performance on both the tasks.

4.2 COUNTERFACTUAL EVALUATION ON MEDICAL DATA

Cardiomegaly refers to an abnormal enlargement of the heart (Brakohiapa et al., 2017). To understand the explanations derived for Cardiomegaly target class, we overlaid the heart segmentation

Published as a conference paper at ICLR 2020

Celeb A:Smiling Celeb A:Young Xray:Cardiomegaly x GEM Ours x GEM Ours x GEM Ours Latent Space Closeness 88.2 88.0 89.5 81.6 2.2 27.9 Face Veriﬁcation Accuracy 0.0 85.3 0.0 72.2 - -

Table 2: Identity preserving performance on three prediction tasks.

over the x-ray image and visualize the gradual change in heart size. The heart segmentation is shown as outlines in Figure 2, with their corresponding heart size (top values in parentheses). The heart segmentation is derive by training a UNet (Ronneberger et al., 2015) model on the segmentation in chest radiograph (SCR) dataset (van Ginneken et al., 2006). We registered x with its associated xδ and applied the resulting transformation to the heart masks of x to derive the heart masks for xδ.

For population-level analysis, we plotted the average heart size of xδ vs the condition used for generation (f(x) + δ) in Figure 4 (a). The plot shows a positive correlation between the heart size and the response of the classiﬁer f(x), which agrees with the deﬁnition of Cardiomegaly. To better understand the results, we divided the population into two groups, the ﬁrst group (xh; f(xh) < 0.1) consists of real images of healthy x-rays, and the second group (xc; f(xc) > 0.9) contains real images of abnormal x-rays positive for Cardiomegaly. For xh we generated counterfactual as xc δ such that f(xc δ) > 0.9. Similarly, counterfactuals for xc are derived as xh δ such that f(xh δ ) < 0.1. In Figure 4 (b), we show the distribution of heart size in the four groups. We reported the dependent t-test statistics for paired samples xh and xc δ , xc and xh δ . A signiﬁcant p-value 0.001 rejected the null hypothesis (i.e., that the two groups have similar distributions). We also reported the independent two-sample t-test statistics for healthy xh and xh δ , p-value > 0.01 and abnormal (xc and xc δ, p-value < 0.01) populations. Given higher p-values, we cannot reject the null hypothesis of identical average distributions with high conﬁdence. Our model derived explanations successfully captured the change in heart size while generating counterfactual explanations.

Figure 4: Cardiomegaly disease is associated with large heart size. In (a) we show the positive correlation between the heart size and the response of the classiﬁer f(x). (b) Comparison of the distribution of the heart size in the four groups. (c) Plot to show the drop in accuracy of the classiﬁer as we perturb the most relevant pixels (relevance calculated from saliency map) in the image.

4.3 SALIENCY MAP

Saliency maps show the importance of each pixel of an image in the context of classiﬁcation. Our method is not designed to produce saliency maps as a continuous score for every feature of the input. We extract an approximate saliency map by quantifying the regions that changed the most when comparing explanations at the opposing ends of the classiﬁcation spectrum. For each query image, we generated two visual explanations corresponding to the two extremes of the decision boundary f(xδ) = 0 and f(xδ) = 1 . The absolute difference between these explanations is our saliency map. Figure 5 shows the saliency map obtain from our method and its comparison with popular gradient based methods. We restricted the saliency maps obtained from different methods to have positive values and normalize them to range [0,1]. Subjective, the saliency maps produced by our method are very localized and are comparable to the other methods.

Published as a conference paper at ICLR 2020

We adapted the metric introduced in (Samek et al., 2016) to compare the different saliency maps. In an iterative procedure, we progressively replace a percentage of the most relevant pixels in an image (as given by the saliency map) with random values sampled from a uniform distribution. We observe the corresponding change in the classiﬁcation performance as shown in Figure 4 (c). All the methods experienced a drop in the accuracy of the classiﬁer with increase in the fraction of perturb pixels. The saliency maps produced by our model is signiﬁcantly better than random maps and are comparable to the other saliency map methods. It should be noted that, there are many ways to quantify important regions in a image, using the series of explanations generated by our method. We didn t optimize to ﬁnd the best saliency map and showed results for one such method.

Figure 5: Our comparison with popular gradient-based saliency map producing methods on the prediction task of identifying smiling faces in Celeb A dataset.

4.4 BIAS DETECTION

Our model can discover confounding bias in the data used for training the black-box classiﬁer. Confounding bias provides an alternative explanation for an association between the data and the target label. For example, a classiﬁer trained to predict the presence of a disease may make decisions based on hidden attributes like gender, race, or age. In a simulated experiment, we trained two classiﬁers to identify smiling vs not-smiling images in the Celeb A dataset. The ﬁrst classiﬁer f Biased is trained on a biased dataset, confounded with gender such that all smiling images are of male faces. We train a second classiﬁer f No-biased on an unbiased dataset, with data uniformly distributed with respect to gender. Note that we evaluate both the classiﬁers on the same validation set. Additionally, we assume access to a proxy Oracle classiﬁer f Gender that perfectly classiﬁes the confounding attribute i.e., gender. As shown in Cohen et al. (2018), if the training data for the GAN is biased, then the inference would reﬂect that bias. In Figure 6, we compare the explanations generated for the two classiﬁers. The visual explanations for the biased classiﬁer change gender as it increases the amount of smile. We adapted the confounding metric proposed in Joshi et al. (2018) to summarize our results in Table 3. Given the data D = {(xi, yi, ai), xi X, yi, ai Y}, we quantify that a classiﬁer is confounded by an attribute a if the generated explanation ˆxδ has a different attribute a, as compared to query image x, when processed through the Oracle classiﬁer f Gender. The metric is formally deﬁned as ED[1(g (xδ) = a)]/|D|. For a biased classiﬁer, the Oracle function predicted the female class for the majority of the images, while the unbiased classiﬁer is consistent with the true distribution of the validation set for gender. Thus, we the fraction of generated explanations that changed the confounding attribute gender was found to be high for the biased classiﬁer.

Published as a conference paper at ICLR 2020

Target Label Black-box classiﬁer Smiling Not-Smiling f Biased Male: 0.52 Male: 0.18 Female: 0.48 Female: 0.82 Overall: 0.12 Overall: 0.35 f No-biased Male: 0.48 Male: 0.47 Female: 0.52 Female: 0.53 Overall: 0.07 Overall: 0.08

Table 3: Confounding metric for biased detection. For target label Smiling and Not-Smiling , the explanations are generated using condition f(x) + δ > 0.9 and f(x) + δ < 0.1 respectively. The Male and Female values quantiﬁes the fraction of the generated explanations classiﬁer as male or female, respectively by oracle classiﬁer f Gender. The overall value quantiﬁes the fraction of the generated explanations who have different gender as compared to the query image. A small overall value shows least bias.

Figure 6: The visual explanations for two classiﬁers, both trained to classify Smiling attribute on Celeb A dataset. For each example, the top row shows results from Biased classiﬁer whose data distribution is confounded with Gender . The bottom row shows explanations from No-Biased classiﬁer with uniform data distribution w.r.t gender. The top label indicates output of the classiﬁer and the bottom label is the output of an oracle classiﬁer for the con-founding attribute gender. The visual explanations for the Biased classiﬁer changes the gender as it adds smile on the face.

5 CONCLUSION

In this paper, we proposed a novel interpretation method that explains the decision of a black-box classiﬁer by producing natural-looking, gradual perturbations of the query image, resulting in an equivalent change in the output of the classiﬁer. We evaluated our model on two very different datasets, including a medical imaging dataset. Our model produces high-quality explanations while preserving the identity of the query image. Our analysis shows that our explanations are consistent with the deﬁnition of the target disease without explicitly using that information. Our method can also be used to generate a saliency map in a model agnostic setting. In addition to the interpretability advantages, our proposed method can also identify plausible confounding biases in a classiﬁer.

Published as a conference paper at ICLR 2020

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9505 9515, 2018.

Sebastian Bach, Alexander Binder, Grgoire Montavon, Frederick Klauschen, Klaus-Robert Muller, and Wojciech Samek. On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance propagation. Plo S one, 10(7):e0130140, 2015.

Edmund Kwadwo Kwakye Brakohiapa, Benard Ohene Botwe, Benjamin Dabo Sarkodie, Eric Kwesi Ofori, and Jerry Coleman. Radiographic determination of cardiomegaly using cardiothoracic ratio and transverse cardiac diameter: can one size ﬁt all? Part one. The Pan African medical journal, 27:201, 2017. ISSN 1937-8688. doi: 10.11604/pamj.2017.27.201. 12017. URL http://www.ncbi.nlm.nih.gov/pubmed/28904726http://www. pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC5579422.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm.

Q Cao, L Shen, W Xie, O M Parkhi, and A Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition, 2018.

Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explaining Image Classiﬁers by Counterfactual Generation, 2019. URL https://openreview.net/forum?id= B1MXz20c YQ.

Joseph Paul Cohen, Margaux Luck, and Sina Honari. Distribution matching losses can hallucinate features in medical image translation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), volume 11070 LNCS, pp. 529 536. Springer Verlag, 2018. ISBN 9783030009274. doi: 10.1007/978-3-030-00928-1{\ }60.

Henriette Cramer, Jean Garcia-Gathright, Aaron Springer, and Sravana Reddy. Assessing and addressing algorithmic bias in practice. interactions, 25(6):58 63, 2018.

Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classiﬁers. In Advances in Neural Information Processing Systems, pp. 6967 6976, 2017.

Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, pp. 592 603, 2018.

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. ar Xiv preprint ar Xiv:1702.08608, 2017.

Ruth C. Fong and Andrea Vedaldi. Interpretable Explanations of Black Boxes by Meaningful Perturbation. In Proceedings of the IEEE International Conference on Computer Vision, 2017. ISBN 9781538610329. doi: 10.1109/ICCV.2017.371.

Eibe Frank and Mark Hall. A simple approach to ordinal classiﬁcation. In European Conference on Machine Learning, pp. 145 156, 2001.

Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 3681 3688, 2019.

Alyssa Glass, Deborah L Mc Guinness, and Michael Wolverton. Toward establishing trust in adaptive agents. In Proceedings of the 13th international conference on Intelligent user interfaces, pp. 227 236. ACM, 2008.

Published as a conference paper at ICLR 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Z Ghahramani, M Welling, C Cortes, N D Lawrence, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2672 2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/ paper/5423-generative-adversarial-nets.pdf.

Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual Visual Explanations. In International Conference on Machine Learning, pp. 2376 2384, 2019. URL http://arxiv.org/abs/1904.07451.

Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM computing surveys (CSUR), 51(5):93, 2019.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017.

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. Che Xpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison, 1 2019. URL http://arxiv.org/abs/ 1901.07031.

Shalmali Joshi, Oluwasanmi Koyejo, Been Kim, and Joydeep Ghosh. x GEMs: Generating Examplars to Explain Black-Box Models. ar Xiv preprint ar Xiv:1806.08867, 2018.

Shalmali Joshi, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems. Co RR, abs/1907.09615, 2019. URL http://arxiv.org/abs/1907.09615.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019.

Been Kim. Interactive and interpretable machine learning models for human machine collaboration. Ph D thesis, Massachusetts Institute of Technology, 2015.

Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Sch utt, Sven D ahne, Dumitru Erhan, and Been Kim. The (un) reliability of saliency methods. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 267 280, 2017.

Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuence functions. In 34th International Conference on Machine Learning, ICML 2017, 2017. ISBN 9781510855144.

Shusen Liu, Bhavya Kailkhura, Donald Loveland, and Yong Han. Generative Counterfactual Introspection for Explainable Deep Learning. ar Xiv, 2019.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015.

Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. Co RR, abs/1411.1784, 2014. URL http://arxiv.org/abs/1411.1784.

Takeru Miyato and Masanori Koyama. c GANs with Projection Discriminator. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=By S1Vpg RZ.

Christoph Molnar. Interpretable Machine Learning. Github, 2019. https://christophm. github.io/interpretable-ml-book/.

Published as a conference paper at ICLR 2020

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135 1144. ACM, 2016.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. URL http://arxiv.org/abs/1505.04597.

Pouya Samangouei, Ardavan Saeedi, Liam Nakagawa, and Nathan Silberman. Explaingan: Model explanation via decision boundary crossing transformations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 666 681, 2018.

Wojciech Samek, Alexander Binder, Grgoire Montavon, Sebastian Lapuschkin, and Klaus-Robert M uller. Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems, 28(11):2660 2673, 2016.

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618 626, 2017.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. Computing Research Repository, 2013.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A Riedmiller. Striving for Simplicity: The All Convolutional Net. In ICLR (workshop track), 2015.

Ryan Turner. A model explanation system. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1 6. IEEE, 2016.

B van Ginneken, M B Stegmann, and M Loog. Segmentation of anatomical structures in chest radiographs using supervised methods: a comparative study on a public database. Medical Image Analysis, 10(1):19 40, 2006.

Anthony J Viera, Joanne M Garrett, and others. Understanding interobserver agreement: the kappa statistic. Fam med, 37(5):360 363, 2005.

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. Computing Research Repository, 2014.

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921 2929, 2016.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223 2232, 2017. URL http://arxiv.org/abs/1703. 10593.

A.1 IMPLEMENTATION DETAILS

The architecture for the generator and discriminator is adapted from Miyato & Koyama (2018). The image encoding learned by encoder E(x) is fed into the generator. The condition cf(x, δ) is passed to each resnet block in the generator, using conditional batch normalization. The generator has ﬁve resnet blocks, where each block consists of BN-Re LU-Conv3-BN-Re LU-Conv3. BN is batch normalization, Re LU is activation function, and Conv3 is the convolution ﬁlter. The encoder function uses the same structure but downsamples the image. The discriminator function has ﬁve resnet blocks, each of which has the form Re LU-Conv3-Re LU-Conv3.

Published as a conference paper at ICLR 2020

A.2 XGEM IMPLEMENTATION

We refers to Joshi et al. (2019) for the implementation of x GEM. First, VAE is trained to generate face images. The VAE used is available at:https://github.com/Lynn Ho/VAE-Tensorﬂow. All settings and architectures were set to default values. The original code generates an image of dimension 64x64. We extended the given network to produce an image with dimensions 128x128. The pretrained VAE is then extended to incorporate the cross-entropy loss for ﬂipping the label of the query image. The model evaluates the cross-entropy loss by passing the generated image through the classiﬁer. Figure 7 shows the qualitative difference between the explanations generated by our proposed method and x GEM.

A.3 EXTENDED RESULTS FOR EVALUATING THE CRITERIA OF THE EXPLAINER

Here, we provide results for four more prediction tasks on celeb A dataset: no-beard or beard, heavy makeup or light makeup, black hair or not back hair, and bangs or no-bangs. Figure 8 shows the qualitative results, an extended version of results in Figure 2. We evaluated the results from these prediction tasks for compatibility with black-box f (see Figure 9), data consistency and self consistency (see Table 4).

Prediction Task Data Consistency (FID) Self Consistency Present Absent Overall LSC FVA Smiling vs Not-smiling 46.9 56.3 35.8 88.0 85.3 Young vs Old 67.5 74.4 53.4 81.6 72.2 No beard vs Beard 79.2 72.3 45.4 89.6 83.3 Heavy makeup vs Light makeup 64.9 98.2 39.2 89.2 75.3 Black hair vs Not black hair 55.8 72.8 34.8 79.4 81.6 Bangs vs No bangs 54.1 57.8 40.6 76.5 87.3

Table 4: Our model results for six prediction tasks on Celeb A dataset. FID (Fr echet Inception Distance) score measures the quality of the generated explanations. Lower FID is better. LSC (Latent Space Closeness) quantiﬁes the fraction of the population where generated explanation is nearest to the query image than any other generated explanation in embedding space. FVA (Face veriﬁcation accuracy) measures percentage of the times the query image and generated explanation have same face identity as per model trained on VGGFace2. Higher LSC and FVA is better.

A.4 HUMAN EVALUATION

We used Amazon Mechanical Turk (AMT) to conduct human experiments to demonstrate that the progressive exaggeration produced by our model is visually perceivable to humans. We presented AMT workers with three tasks. In the ﬁrst task, we evaluated if humans can detect the relative order between two explanations produced for a given image. We ask the AMT workers, Given two images of the same person, in which image is the person younger (or smiling more)? (see Figure 10). We experimented with 200 query images and generated two pairs of explanations for each query image (i.e., 400 hits). The ﬁrst pair (easy) imposed the two images are samples from opposite ends of the explanation spectrum (counterfactuals), while the second pair (hard) makes no such assumption.

In the second task, we evaluated if humans can identify the target class for which our model has provided the explanations. We ask the AMT workers, What is changing in the images? (age, smile, hair-style or beard) . We experimented with 100 query images from each of the four attributes (i.e., 400 hits). In the third task, we demonstrate that our model can help the user to identify problems like possible bias in the black-box training. Here, we used the same setting as in the second task but also showed explanations generated for a biased classiﬁer. We ask the AMT workers, What is changing in the images? (smile or smile and gender) (see Figure 10). We generated explanations for 200 query images each, from a biased-classiﬁer (f Biased) explainer from Section 4.4 and an unbiased classiﬁer (f No-biased) explainer (i.e., 400 hits). In all the three tasks, we collected eight votes for each task, evaluated against the ground truth, and used the majority vote for calculating accuracy.

Published as a conference paper at ICLR 2020

We summarize our results in Table 5. In the ﬁrst task, the annotators achieved high accuracy for the easy pair when there was a signiﬁcant difference among the two explanation images, as compared to the hard pair when the two explanations can have very subtle differences. Overall, the annotators were successful in identifying the relative order between the two explanation images.

In the second task, the annotators were generally successful in correctly identifying the target class. The target class bangs proved to be the most difﬁcult to identify, which was expected. The generated images for bangs were qualitatively, the most subtle. For the third task, the correct answer was always the target class i.e., smile . In the case of biased classiﬁer explainer, the annotators selected Smile and Gender 12.5% of the times. The gradual progression made by the explainer for a biased classiﬁer was very subtle and was changing large regions of the face as compared to the unbiased explainer. The difference is much more visible when we compare the explanation generated for the same query image for a biased and no-biased classiﬁer, as in Figure 6. But in a realistic scenario, the no-biased classiﬁer would not be available to compare against. Nevertheless, the annotators detected bias at roughly the same level of accuracy as our classiﬁer (Table 3). Future work could improve upon bias detection.

Annotation Task Overall Sub categories Accuracy κ-statistic Category Accuracy κ-statistic Task-1 (Age) 83.5% 0.41 (Moderate) Hard 73% 0.31 (Fair) Easy 94% 0.51 (Moderate) Task-1 (Smile) 77.5% 0.28 (Fair) Hard 66% 0.23 (Fair) Easy 89.5% 0.32 (Fair) Task-2 (Identify Target Class) 77% 0.35 (Fair) Age 72% -

Smile 99% - Bangs 50% - Beard 87% - Task-3 (Bias Detection) 93.75% 0.14 (Slight) f Biased 87.5% 0.09 (Slight)

f No-biased 100% 0.02 (Slight)

Table 5: Summarizing the results of human evaluation. The κ -statistics measure inter-rater agreement for qualitative classiﬁcation of items into some mutually exclusive categories. One possible interpretation of κ as given in Viera et al. (2005) is < 0.0: Poor, 0.01 0.2: Slight, 0.21 0.40: Fair, 0.41 0.60: Moderate, 0.61 0.80: Substantial and 0.81 1.00: Almost perfect agreement.

A.5 EVALUATING CLASS DISCRIMINATION

In multi-label settings, multiple labels can be true for a given image. In this test, we evaluated the sensitivity of our generated explanations to the class being explained. We consider a classiﬁer trained to identify multiple attributes: young, smiling, black-hair, no-beard and bangs in face images from Celeb A dataset. We used our model to generate explanations while considering one of the attributes as the target. Ideally, an explanation model trained to explain a target attribute should produce explanations consistent with the query image on all the attributes beside the target. Figure 11 plots the fraction of the generated explanations, that have ﬂipped in source attribute as compared to the query image. Each column represents one source attribute. Each row is one run of our method to explain a given target attribute.

A.6 ABLATION STUDY

Our proposed model has three types of loss functions: adversarial loss from c GAN, KL loss, and reconstruction loss. The three losses enforce the three properties of our proposed explainer function: data consistency, compatibility with f, and self-consistency, respectively. In the ablation study, we quantify the importance of each of these components by training different models, which differ in one hyper-parameter while rest are equivalent (λc GAN = 1, λf = 1 and λrec = 100). For data consistency, we evaluate Fr echet Inception Distance (FID). FID score measures the visual quality of the generated explanations by comparing them with the real images. We show results for two

Published as a conference paper at ICLR 2020

groups. In the ﬁrst group, we consider real and fake images where the classiﬁer has high conﬁdence in presence of the target label i.e., f(xδ), f(x) [0.9, 1.0]. In second group, the target label is absent i.e., f(xδ), f(x) [0.0, 0.1). We also report an overall score by considering all the real and generated explanations together. For compatability with f we plotted the desired output of the classiﬁer i.e., f(x) + δ against the actual output of the classiﬁer f(xδ) for the generated explanations. For self consistency, we calculated the Latent Space Closeness (LSC) measure and Face veriﬁcation accuracy (FVA). LSC quantiﬁes the fraction of the population in which the generated explanation is nearest to the query image than any other generated explanation in embedding space. FVA measures the percentage of the instances in which the query image and generated explanation have the same face identity as per the model trained on VGGFace2. For the ablation study, we consider the prediction task of young vs old on the Celeb A dataset. Figure 12 shows the results for compatibility with f. Table 6 summarizes the results for data consistency and self-consistency.

Conﬁguration Data Consistency (FID) Self Consistency λc GAN λf λrec Present Absent Overall LSC FVA 0 1 100 69.7 105.7 67.2 96.1 99.8 1 1 100 67.5 74.4 53.4 81.6 72.2 10 1 100 89.4 105.2 63.0 68.0 82.7 100 1 100 71.6 80.6 44.26 75.3 18.0 1 0 100 66.2 66.2 44.9 77.2 99.4 1 1 100 67.5 74.4 53.4 81.6 72.2 1 10 100 95.5 90.4 62.4 71.83 96.8 1 100 100 77.4 73.1 71.2 55.4 42.23 1 1 0 116.2 118.9 72.2 16.6 0.0 1 1 1 63.0 78.6 61.6 32.2 5.5 1 1 10 87.6 83.6 65.7 71.5 88.8 1 1 100 67.5 74.4 53.4 81.6 72.2

Table 6: Our model with ablation on prediction task of young vs old on Celeb A dataset. FID (Fr echet Inception Distance) score measures the quality of the generated explanations. Lower FID is better. LSC (Latent Space Closeness) quantiﬁes the fraction of the population where generated explanation is nearest to the query image than any other generated explanation in embedding space. FVA (Face veriﬁcation accuracy) measures percentage of the times the query image and generated explanation have same face identity as per model trained on VGGFace2. Higher LSC and FVA is better.

Published as a conference paper at ICLR 2020

Figure 7: Visual explanations generated for three prediction tasks on Celeb A dataset. The ﬁrst column shows the query image, followed by the corresponding generated explanations.

Published as a conference paper at ICLR 2020

Figure 8: Visual explanations generated for six prediction tasks on Celeb A dataset. The ﬁrst column shows the query image, followed by the corresponding generated explanations. The values above each image are the output of the classiﬁer f.

Published as a conference paper at ICLR 2020

Figure 9: Plot of the expected outcome from the classiﬁer, f(x) + δ, against the actual response of the classiﬁer on generated explanations, f(xδ). The monotonically increasing trend shows a positive correlation between f(x)+δ and f(xδ), and thus the generated explanations are consistent with the expected condition.

Figure 10: The interface for the human evaluation done using Amazon Mechanical Turk (AMT). Task-1 evaluated if humans can detect the relative order between two explanations. Task-2 evaluated if humans can identify the target class for which our model has provided the explanations. Task-3 demonstrated that our model can help the user to identify problems like possible bias in the blackbox training.

Figure 11: Each cell is the fraction of the generated explanations, that have ﬂipped in source attribute as compared to the query image. The x-axis is source attribute and y-axis is the target attribute for which explanation is generated. Note: This is not a confusion matrix.

Published as a conference paper at ICLR 2020

Figure 12: Ablation study to show the effect of KL loss term. Plot of the expected outcome from the classiﬁer, f(x) + δ, against the actual response of the classiﬁer on generated explanations, f(xδ).