# explaining_image_classifiers_by_counterfactual_generation__05c9d841.pdf

Published as a conference paper at ICLR 2019

EXPLAINING IMAGE CLASSIFIERS BY COUNTERFACTUAL GENERATION

Chun-Hao Chang, Elliot Creager, Anna Goldenberg, & David Duvenaud University of Toronto, Vector Institute {kingsley,creager,duvenaud}@cs.toronto.edu anna.goldenberg@utoronto.ca

When an image classiﬁer makes a prediction, which parts of the image are relevant and why? We can rephrase this question to ask: which parts of the image, if they were not seen by the classiﬁer, would most change its decision? Producing an answer requires marginalizing over images that could have been seen but weren t. We can sample plausible image in-ﬁlls by conditioning a generative model on the rest of the image. We then optimize to ﬁnd the image regions that most change the classiﬁer s decision after in-ﬁll. Our approach contrasts with ad-hoc in-ﬁlling approaches, such as blurring or injecting noise, which generate inputs far from the data distribution, and ignore informative relationships between different parts of the image. Our method produces more compact and relevant saliency maps, with fewer artifacts compared to previous methods.

1 INTRODUCTION

The decisions of powerful image classiﬁers are difﬁcult to interpret. Saliency maps are a tool for interpreting differentiable classiﬁers that, given a particular input example and output class, computes the sensitivity of the classiﬁcation with respect to each input dimension. Fong & Vedaldi (2017) and Dabkowski & Gal (2017) cast saliency computation an optimization problem informally described by the following question: which inputs, when replaced by an uninformative reference value, maximally change the classiﬁer output? Because these methods use heuristic reference values, e.g. blurred input (Fong & Vedaldi, 2017) or random colors (Dabkowski & Gal, 2017), they ignore the context of the surrounding pixels, often producing unnatural in-ﬁlled images (Figure 2). If we think of a saliency map as interrogating the neural network classiﬁer, these approaches have to deal with a somewhat unusual question of how the classiﬁer responds to images outside of its training distribution.

To encourage explanations that are consistent with the data distribution, we modify the question at hand: which region, when replaced by plausible alternative values, would maximally change classiﬁer output? In this paper we provide a new model-agnostic framework for computing and visualizing feature importance of any differentiable classiﬁer, based on variational Bernoulli dropout (Gal & Ghahramani, 2016). We marginalize out the masked region, conditioning the generative model on the non-masked parts of the image to sample counterfactual inputs that either change or preserve classiﬁer behavior. By leveraging a powerful in-ﬁlling conditional generative model we produce saliency maps on Image Net that identify relevant and concentrated pixels better than existing methods.

2 RELATED WORK

Gradient-based approaches (Simonyan et al., 2013; Springenberg et al., 2014; Zhang et al., 2016; Selvaraju et al., 2016) derive a saliency map for a given input example and class target by computing the gradient of the classiﬁer output with respect to each component (e.g., pixel) of the input. The reliance on the local gradient information induces a bias due to gradient saturation or discontinuity in the DNN activations (Shrikumar et al., 2017). Adebayo et al. (2018) observed that some gradientbased saliency computation reﬂect an inductive bias due to the convolutional architecture, which is independent of the network parameter values.

Published as a conference paper at ICLR 2019

(a) Classiﬁer

(b) Heuristic in-ﬁlling

(c) Generative inﬁlling

Figure 1: Graphical models. (1a) p M(c|x) is classiﬁer whose behavior we wish to analyze. (1b) To explain its response to a particular input x we partition the input x into masked (unobserved) region xr and their complement x = xr x\r. Then we replace the xr with uninformative reference value ˆxr to test which region xr is important for classiﬁer s output p M(c|xr, x\r). Heuristic in-ﬁlling (Fong & Vedaldi, 2017) computes ˆxr ad-hoc such as image blur. This biases the explanation when samples [ˆxr, x\r] deviate from the data distribution p(xr, x\r). (1c) We instead sample xr efﬁciently from a conditional generative model xr p G(xr|x\r) that respects the data distribution.

Reference-based approaches analyze the sensitivity of classiﬁer outputs to the substitution of certain inputs/pixels with an uninformative reference value. Shrikumar et al. (2017) linearly approximates this change in classiﬁer output using an algorithm resembling backpropagation. This method is efﬁcient and addresses gradient discontinuity, but ignores nonlinear interactions between inputs. Chen et al. (2018) optimizes a variational bound on the mutual information between a subset of inputs and the target, using a variational family that sets input features outside the chosen subset to zero. In both cases, the choice of background value as reference limits applicability to simple image domains with static background like MNIST.

Zintgraf et al. (2017) computes the saliency of a pixel (or image patch) by treating it as unobserved and marginalizing it out, then measuring the change in classiﬁcation outcome. This approach is similar in spirit to ours. The key difference is that where Zintgraf et al. (2017) iteratively execute this computation for each region, we leverage a variational Bernoulli distribution to efﬁciently search for optimal solution while encouraging sparsity. This reduces computational complexity and allows us to model the interaction between disjoint regions of input space.

Fong & Vedaldi (2017) computes saliency by optimizing the change in classiﬁer outputs with respect to a perturbed input, expressed as the pixel-wise convex combination of the original input with a reference image. They offer three heuristics for choosing the reference: mean input pixel value (typically gray), Gaussian noise, and blurred input. Dabkowski & Gal (2017) amortize the cost of estimating these perturbations by training an auxiliary neural network.

3 PROPOSED METHOD

Dabkowski & Gal (2017) propose two objectives for computing the saliency map:

Smallest Deletion Region (SDR) considers a saliency map as an answer to the question: What is the smallest input region that could be removed and swapped with alternative reference values in order to minimize the classiﬁcation score? Smallest Supporting Region (SSR) instead poses the question: What is the smallest input region that could substituted into a ﬁxed reference input in order to maximize the classiﬁcation score?

Solving these optimization problems (which we formalize below) involves a search over input masks, and necessitates reference values to be substituted inside (SDR) or outside (SSR) the masked region. These values were previously chosen heuristically, e.g., mean pixel value per channel. We instead consider inputs inside (SDR) or outside (SSR) the masked region as unobserved variables to be marginalized efﬁciently by sampling from a strong conditional generative model1. We describe our approach for an image application where the input comprises pixels, but our method is more broadly applicable to any domain where the classiﬁer is differentiable.

1 Whereas (Zintgraf et al., 2017) iteratively marginalize single patches conditioned on their surroundings, we model a more expressive conditional distribution considering the joint interaction of non-contiguous pixels.

Published as a conference paper at ICLR 2019

Heuristics Generative Methods

Input Mean Blur Random Local VAE CA

Figure 2: Computed saliency for a variety of in-ﬁlling techniques. The classiﬁer predicts the correct label, drake . Each saliency map (top row) results from maximizing in-class conﬁdence of mixing a minimal region (red) of the original image with some reference image in the complementary (blue) region. The resulting mixture (bottom row) is fed to the classiﬁer. We compare 6 methods for computing the reference, 3 heuristics and 3 generative models. We argue that strong generative models e.g., Contextual Attention GAN (CA) (Yu et al., 2018) ameliorate in-ﬁll artifacts, making explanations more plausible under the data distribution.

Figure 3: Visualization of reference value inﬁlling methods under centered mask. The Res Net output probability of the correct class is shown (as a percentage) for each imputed image.

Consider an input image x comprising U pixels, a class c, and a classiﬁer with output distribution p M(c|x). Denote by r a subset of the input pixels that implies a partition of the input x = xr x\r. We refer to r as a region, although it may be disjoint. We are interested in the classiﬁer output when xr are unobserved, which can be expressed by marginalization as

p M(c|x\r) = Exr p(xr|x\r) p M(c|x\r, xr) . (1)

We then approximate p(xr|x\r) by some generative model with distribution p G(xr|x\r) (speciﬁc implementations are discussed in section 4.1). Then given a binary mask2 z {0, 1}U and the original image x, we deﬁne an inﬁlling function3 φ as a convex mixture of the input and reference with binary weights,

φ(x, z) = z x + (1 z) ˆx where ˆx p G(ˆx|xz=0). (2)

3.1 OBJECTIVE FUNCTIONS

The classiﬁcation score function s M(c) represents a score of classiﬁer conﬁdence on class c; in our experiments we use log-odds:

s M(c|x) = log p M(c|x) log(1 p M(c|x)). (3)

SDR seeks a mask z yielding low classiﬁcation score when a small number of reference pixels are mixed into the mask regions. Without loss of generality4, we can specify a parameterized distribution

2 zu = 0 means the u-th pixel of x is dropped out. The remaining image is xz=0. 3 The inﬁlling function is stochastic due to randomness in ˆx. 4 We can search for a single mask z using a point mass distribution qz (z) = δ(z = z )

Published as a conference paper at ICLR 2019

Input Realtime FIDO-CA (ours) FIDO-CA \ Head FIDO-CA \ Body p(c|x) = 1.000 p(c|ˆx) = 0.448 p(c|ˆx) = 0.999 p(c|ˆx) = 0.945 p(c|ˆx) = 0.347

Figure 4: Classiﬁer conﬁdence of inﬁlled images. Given an input, FIDO-CA ﬁnds a minimal pixel region that preserves the classiﬁer score following in-ﬁll by CA-GAN (Yu et al., 2018). Dabkowski & Gal (2017) (Realtime) assigns saliency coarsely around the central object, and the heuristic inﬁll reduces the classiﬁer score. We mask further regions (head and body) of the FIDO-CA saliency map by hand, and observe a drop in the inﬁlled classiﬁer score. The label for this image is goose .

over masks qθ(z) and optimize its parameters. The SDR problem is a minimization w.r.t θ of

LSDR(θ) = Eqθ(z) [s M(c|φ(x, z)) + λ z 1] . (4)

On the other hand, SSR aims to ﬁnd a masked region that maximizes classiﬁcation score while penalizing the size of the mask. For sign consistency with the previous problem, we express this as a minimization w.r.t θ of

LSSR(θ) = Eqθ(z) [ s M(c|φ(x, z)) + λ 1 z 1] . (5)

Naively searching over all possible z is exponentially costly in the number of pixels U. Therefore we specify qθ(z) as a factorized Bernoulli:

u=1 qθu(zu) =

u=1 Bern(zu|θu). (6)

This corresponds to applying Bernoulli dropout (Srivastava et al., 2014) to the input pixels and optimizing the per-pixel dropout rate. θ is our saliency map since it has the same dimensionality as the input and provides a probability of each pixel being marginalized (SDR) or retained (SSR) prior to classiﬁcation. We call our method FIDO because it uses a strong generative model (see section 4.1) to Fill-In the Drop Out region.

To optimize the θ through the discrete random mask z, we follow Gal et al. (2017) in computing biased gradients via the Concrete distribution (Maddison et al., 2017; Jang et al., 2017); we use temperature 0.1. We initialize all our dropout rates θ to 0.5 since we ﬁnd it increases the convergence speed and avoids trivial solutions. We optimize using Adam (Kingma & Ba, 2014) with learning rate 0.05 and linearly decay the learning rate for 300 batches in all our experiments. Our Py Torch implementation takes about one minute on a single GPU to ﬁnish one image.

3.2 COMPARISON TO FONG & VEDALDI (2017)

Fong & Vedaldi (2017) compute saliency by directly optimizing the continuous mask z [0, 1]U under the SDR objective, with ˆx chosen heuristically; we call this approach Black Box Meaningful Perturbations (BBMP). We instead optimize the parameters of a Bernoulli dropout distribution qθ(z), which enables us to sample reference values ˆx from a learned generative model. Our method uses mini-batches of samples z qθ(z) to efﬁciently explore the huge space of binary masks and obtain uncertainty estimates, whereas BBMP is limited to a local search around the current point estimate of the mask z. See Figure 5 for a pseudo code comparison. In Appendix A.1 we investigate how the choice of algorithm affects the resulting saliency maps.

To avoid unnatural artifacts in φ(x, z), (Fong & Vedaldi, 2017) and (Dabkowski & Gal, 2017) additionally included two forms of regularization: upsampling and total variation penalization.

Published as a conference paper at ICLR 2019

Algorithm 1: BBMP (Fong & Vedaldi, 2017) Input image x, classiﬁer score s M, sparsity hyperparameter λ, objective L {LSSR, LSDR} Initialize single mask z [0, 1]d while loss L is not converged do Clip z to [0, 1] Compute reference image ˆx heuristically Compute in-ﬁll φ(x, z) = x z + ˆx (1 z) With φ, compute L by Equation 4 or 5 Update z with z L end while Return z as per-feature saliency map

Algorithm 2: FIDO (Ours) Input image x, classiﬁer score s M, sparsity hyperparameter λ, objective L {LSSR, LSDR}, generative model p G Initialize dropout rate θ [0, 1]d while loss L is not converged do

Sample minibatch z {0, 1}d Bern(θ) Sample reference image ˆx p G(ˆx|z, x) Compute in-ﬁll φ(x, z) = x z + ˆx (1 z) With φ, compute L by Equation 4 or 5 Update θ with θL (Maddison et al., 2017) end while Return θ as per-feature saliency map

Figure 5: Pseudo code comparison. Differences between the approaches are shown in blue.

Upsampling is used to optimize a coarser θ (e.g. 56 56 pixels), which is upsampled to the full dimensionality (e.g. 224 224) using bilinear interpolation. Total variation penalty smoothes θ by a ℓ2 regularization penalty between spatially adjacent θu. To avoid losing too much signal from regularization, we use upsampling size 56 and total variation as 0.01 unless otherwise mentioned. We examine the individual effects of these regularization terms in Appendices A.2 and A.4, respectively.

4 EXPERIMENTS

We ﬁrst evaluate the various inﬁlling strategies and objective functions for FIDO. We then compare explanations under several classiﬁer architectures. In section 4.5 we show that FIDO saliency maps outperform BBMP (Fong & Vedaldi, 2017) in a successive pixel removal task where pixels are in-ﬁlled by a generative model (instead of set to the heuristic value). FIDO also outperforms the method from (Dabkowski & Gal, 2017) on the so-called Saliency Metric on Image Net. Appendices A.1 A.6 provide further analysis, including consistency and the effects of additional regularization.

4.1 INFILLING METHODS

We describe several methods for producing the reference value ˆx. The heuristics do not depend on z and are from the literature. The proposed generative approaches, which produce ˆx by conditioning on the non-masked inputs xz=0, are novel to saliency computation.

Heuristics: Mean sets each pixel of ˆx according to its per-channel mean across the training data. Blur generates ˆx by blurring x with Gaussian kernel (σ = 10) (Fong & Vedaldi, 2017). Random samples ˆx from independent per-pixel per-channel uniform color with Gaussians (σ = 0.2).

Generative Models: Local computes ˆx as the average value of the surrounding non-dropped-out pixels xz=0 (we use a 15 15 window). VAE is an image completion Variational Autoencoder (Iizuka et al., 2017). Using the predictive mean of the decoder network worked better than sampling. CA is the Contextual Attention GAN (Yu et al., 2018); we use the authors pre-trained model.

Figure 3 compares these methods with a centered mask. The heuristic in-ﬁlls appear far from the distribution of natural images. This is ameliorated by using a strong generative model like CA, which in-ﬁlls texture consistent with the surroundings. See Appendix A.5 for a quantitative comparison.

4.2 COMPARING THE SDR AND SSR OBJECTIVE FUNCTIONS

Here we examine the choice of objective function between LSDR and LSSR; see Figure 6. We observed more artifacts in the LSDR saliency maps, especially when a weak in-ﬁlling method (Mean) is used. We suspect this unsatisfactory behavior is due to the relative ease of optimizing LSDR. There are many degrees of freedom in input space that can increase the probability of any of the 999 classes besides c; this property is exploited when creating adversarial examples (Szegedy et al., 2013). Since

Published as a conference paper at ICLR 2019

Figure 6: Choice of objective between LSDR and LSSR. The classiﬁer (Res Net) gives correct predictions for all the images. We show the LSDR and LSSR saliency maps under 2 inﬁlling methods: Mean and CA. Here the red means important and blue means non-important. We ﬁnd that LSDR is more susceptible to artifacts in the resulting saliency maps than LSSR.

it is more difﬁcult to inﬁll unobserved pixels that increase the probability of a particular class c, we believe LSSR encourages FIDO to ﬁnd explanations more consistent with the classiﬁer s training distribution. It is also possible that background texture is easier for a conditional generative model to ﬁt. To mitigate the effect of artifacts, we use LSSR for the remaining experiments.

4.3 COMPARING INFILLING METHODS

Figure 7: Comparison of saliency map under different inﬁlling methods by FIDO SSR using Res Net. Heuristics baselines (Mean, Blur and Random) tend to produce more artifacts, while generative approaches (Local, VAE, CA) produce more focused explanations on the targets.

Here we demonstrate the merits of using strong generative model that produces substantially fewer artifacts and a more concentrated saliency map. In Figure 7 we generate saliency maps of different inﬁlling techniques by interpreting Res Net using LSSR with sparsity penalty λ = 10 3. We observed a susceptibility of the heuristic in-ﬁlling methods (Mean, Blur, Random) to artifacts in the resulting saliency maps, which may fool edge ﬁlters in the low level of the network. The use of generative in-ﬁlling (Local, VAE, CA) tends to mitigate this effect; we believe they encourage in-ﬁlled images to lie closer to the natural image manifold. To quantify the artifacts in the saliency maps by a proxy: the proportion of the MAP conﬁguration (θ > 0.5) that lies outside of the ground truth bounding box. FIDO-CA produces the fewest artifacts by this metric (Figure 8).

Published as a conference paper at ICLR 2019

Figure 8: Proportion of saliency map outside bounding box. Different in-ﬁling methods evaluating Res Net trained on Image Net, 1, 971 images. The lower the better.

4.4 INTERPRETING VARIOUS CLASSIFIER ARCHITECTURES

We use FIDO-CA to compute saliency of the same image under three classiﬁer architectures: Alex Net, VGG and Res Net; see Figure 9. Each architecture correctly classiﬁed all the examples. We observed a qualitative difference in the how the classiﬁers prioritize different input regions (according to the saliency maps). For example in the last image, we can see Alex Net focuses more on the body region of the bird, while Vgg and Res Net focus more on the head features.

Figure 9: Comparison of saliency maps for several classiﬁer architectures. We compare 3 networks: Alex Net, Vgg and Res Net using FIDO-CA with λ = 10 3)

4.5 QUANTITATIVE EVALUATION

We follow Fong & Vedaldi (2017) and Shrikumar et al. (2017) in measuring the classiﬁer s sensitivity to successively altering pixels in order of their saliency scores. Intuitively, the best saliency map should compactly identify relevant pixels, so that the predictions are changed with a minimum number of altered pixels. Whereas previous works ﬂipped salient pixel values or set them to zero, we note that this moves the classiﬁer inputs out of distribution. We instead dropout pixels in saliency order and inﬁll their values with our strongest generative model, CA-GAN. To make the log-odds score suppression comparable between images, we normalize per-image by the ﬁnal log-odds suppression score (all pixels inﬁlled). In Figure 10 we evaluate on Res Net and carry out our scoring procedure on 1, 533 randomly-selected correctly-predicted Image Net validation images, and report the number of pixels required to reduce the normalized log-odds score by a given percent. We evaluate FIDO under various in-ﬁlling strategies as well as BBMP with Blur and Random in-ﬁlling strategies. We put both algorithms on equal footing by using λ = 1e 3 for FIDO and BBMP (see Section A.1 for further comparisons). We ﬁnd that strong generative inﬁlling (VAE and CA) yields more parsimonious saliency maps, which is consistent with our qualitative comparisons. FIDO-CA can achieve a given normalized log-odds score suppression using fewer pixels than competing methods. While FIDOCA may be better adapted to evaluation using CA-GAN, we note that other generative in-ﬁlling approaches (FIDO-Local and FIDO-VAE) still out-perform heuristic in-ﬁlling when evaluated with CA-CAN.

We compare our algorithm to several strong baselines on two established metrics. We ﬁrst evaluate whether the FIDO saliency map can solve weakly supervised localization (WSL) (Dabkowski & Gal, 2017). After thresholding the saliency map θ above 0.5, we compute the smallest bounding box

Published as a conference paper at ICLR 2019

10% 20% 30% 40% Percent of Normalized Log-odds Score Suppression

Minimal Removed Region Size (pixel)

BBMP-Blur BBMP-Random FIDO-Mean FIDO-Blur FIDO-Random Generative

FIDO-Local FIDO-VAE FIDO-CAGAN

Figure 10: Number of salient pixels required to change normalized classiﬁcation score. Pixels are sorted by saliency score and successively replaced with CA-GAN in-ﬁlled values. We select λ = 1e 3 for BBMP and FIDO. The lower the better.

Max Center Grad Deconv Grad CAM Realtime Mean Blur Random Local VAE CA BBox

WSL 59.7% 46.9% 52.1% 49.4% 41.9% 47.8% 53.1% 51.6% 50.8% 46.7% 50.2% 57.0% 0.0%

SM 1.029 0.414 0.577 0.627 0.433 0.076 0.836 0.766 0.728 0.543 0.433 -0.021 0.392

Table 1: Weakly Supervised Localization (WSL) error and Saliency Metric (SM). All the methods evaluating under Res Net on 50k validation sets of Image Net. For both metrics the lower the better.

containing all salient pixels. This prediction is correct if it has intersection-over-union (Io U) ratio over 0.5 with any of the ground truth bounding boxes. Using FIDO with various inﬁlling methods, we report the average error rate across all 50, 000 validation images in Table 1. We evaluate the authors pre-trained model of Dabkowski & Gal (2017)5, denoted as Realtime in the results. We also include ﬁve baselines: Max (entire input as the bounding box), Center (centered bounding box occupying half the image), Grad (Simonyan et al., 2013), Deconvnet (Springenberg et al., 2014), and Grad CAM (Selvaraju et al., 2016). We follow the procedure of mean thresholding in Fong & Vedaldi (2017): we normalize the heatmap between 0 and 1 and binarize by threshold θ = αµi where µi is the average heatmap for image i. Then we take the smallest bounding box that encompasses all the binarized heatmap. We search α between 0 to 5 with 0.2 step size on a holdout set to get minimun WSL error. The best α are 1.2, 2 and 1 respectively.

FIDO-CA frugally assigns saliency to contextually important pixels while preserving classiﬁer conﬁdence (Figure 4), so we do not necessarily expect our saliency maps to correlate with the typically large human-labeled bounding boxes. The reliance on human-labeled bounding boxes makes WSL suboptimal for evaluating saliency maps, so we evaluate the so-called Saliency Metric proposed by Dabkowski & Gal (2017), which eschews the human labeled bounding boxes. The smallest bounding box A is computed as before. The image is then cropped using this bounding box and upscaling to its original size. The Saliency Metric is log max(Area(A), 0.05) log p(c|Crop And Upscale(x, A)), the log ratio between the bounding box area and the in-class classiﬁer probability after upscaling. This metric represents the information concentration about the label within the bounded region. From the superior performance of FIDO-CA we conclude that a strong generative model regularizes explanations towards the natural image manifold and ﬁnds concentrated region of features relevant to the classiﬁer s prediction.

5 We use the authors Py Torch pre-trained model https://github.com/Piotr Dabkowski/ pytorch-saliency.

Published as a conference paper at ICLR 2019

Figure 11: Examples from the ablation study. We show how each of our two innovations, FIDO and generative inﬁlling, improve from previous methods that adopts BBMP with hueristics inﬁlling (e.g. Blur and Random). Speciﬁcally, we compare with a new variant BBMP-CA that uses strong generative in-ﬁlling CA-GAN via thresholding the continous masks: we test a variety of decreasing thresholds. We ﬁnd both FIDO (searching over Bernoulli masks) and generative in-ﬁlling (CAGAN) are needed to produce compact saliency maps (the right-most column) that retain class information. See Appendix B for more qualitative examples and in section A.7 for quantitative results.

4.6 ABLATION STUDY

Can existing algorithms be improved by adding an in-ﬁlling generative model without modeling a discrete distribution over per-feature masks? And does ﬁlling in the dropped-out region sufﬁce without an expressive generative model? We carried out a ablation study that suggests no on both counts. We compare FIDO-CA to a BBMP variant that uses CA-GAN inﬁlling (called BBMP-CA); we also evaluate FIDO with heuristic inﬁlling (FIDO-Blur, FIDO-Random). Because the continuous mask of BBMP does not naturally partition the features into observed/unobserved, BBMP-CA ﬁrst thresholds the masked region r = I(z > τ) before generating the reference φ(xr, x\r) with a sample from CA-GAN. We sweep the value of τ as 1, 0.7, 0.5, 0.3, 0.1 and 0. We ﬁnd BBMP-CA is brittle with respect to its threshold value, producing either too spread-out or stringent saliency maps (Figure 11). We observed that FIDO-Blur and FIDO-Random produce more concentrated saliency map than their BBMP counterpart with less artifacts, while FIDO-CA produces the most concentrated region on the target with fewest artifacts. Each of these baselines were evaluated on the two quantitative metrics (Appendix A.7); BBMP-CA considerably underperformed relative to FIDO-CA.

5 SCOPE AND LIMITATIONS

Because the classiﬁer behavior is ill-deﬁned for out-of-distribution inputs, any explanation that relies on out-of-distribution feature values is unsatisfactory. By modeling the input distribution via an expressive generative model, we can encourage explanations that rely on counterfactual inputs close to the natural manifold. However, our performance is then upper-bounded by the ability of the generative model to capture the conditional input density. Fortunately, this bound will improve alongside future improvements in generative modeling.

6 CONCLUSION

We proposed FIDO, a new framework for explaining differentiable classiﬁers that uses adaptive Bernoulli dropout with strong generative in-ﬁlling to combine the best properties of recently proposed methods (Fong & Vedaldi, 2017; Dabkowski & Gal, 2017; Zintgraf et al., 2017). We compute saliency by marginalizing over plausible alternative inputs, revealing concentrated pixel areas that preserve label information. By quantitative comparisons we ﬁnd the FIDO saliency map provides more parsimonious explanations than existing methods. FIDO provides novel but relevant explanations for the classiﬁer in question by highlighting contextual information relevant to the prediction and consistent with the training distribution. We released the code in Py Torch at https://github. com/zzzace2000/FIDO-saliency.

Published as a conference paper at ICLR 2019

ACKNOWLEDGMENTS

We thank Yu et al. (2018) for releasing their code and pretrained model CA-GAN that makes this work possible. We also thank Jesse Bettencourt and David Madras for their helpful suggestions.

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9525 9536, 2018.

Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. Learning to explain: An informationtheoretic perspective on model interpretation. In International Conference on Machine Learning, pp. 882 891, 2018.

Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classiﬁers. In Advances in Neural Information Processing Systems, pp. 6967 6976, 2017.

R. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050 1059, 2016.

Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pp. 3581 3590, 2017.

Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and Locally Consistent Image Completion. ACM Transactions on Graphics (Proc. of SIGGRAPH), 2017.

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017.

Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Grad-cam: Why did you say that? ar Xiv preprint ar Xiv:1611.07450, 2016.

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3145 3153. JMLR. org, 2017.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. ar Xiv preprint ar Xiv:1312.6034, 2013.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. ar Xiv preprint ar Xiv:1412.6806, 2014.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. Journal of machine learning research, 15(1):1929 1958, 2014.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505 5514, 2018.

Published as a conference paper at ICLR 2019

Jianming Zhang, Zhe Lin, Shen Xiaohui Brandt, Jonathan, and Stan Sclaroff. Top-down neural attention by excitation backprop. In European Conference on Computer Vision(ECCV), 2016.

Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network decisions: Prediction difference analysis. In International Conference on Learning Representations, 2017.

Published as a conference paper at ICLR 2019

A FURTHER ANALYSIS

A.1 COMPARISONS OF BBMP AND FIDO

Here we compare FIDO with two previously proposed methods, BBMP with Blur in-ﬁlling strategy (Fong & Vedaldi, 2017) and BBMP with Random in-ﬁlling strategy (Dabkowski & Gal, 2017). One potential concern in qualitatively comparing these methods is that each method might have a different sensitivity to the sparsity parameter λ. Subjectively, we observe that BBMP requires roughly 5 times higher sparsity penalty λ to get visually comparable saliency maps. In our comparisons we sweep λ over a reasonable range for each method and show the resulting sequence of increasingly sparse saliency maps (Figure 12). We use λ = 5e 4, 1e 3, 2e 3, 5e 3.

We observe that all methods are prone to artifacts in the low λ regime, so the appropriate selection of this value is clearly important. Interestingly, BBMP Blur and Random respectively ﬁnd artifacts with different quality: small patches and pixels for Blur and structure off-object lines for Random. FIDO with CA is arguably the best saliency map, producing fewer artifacts and concentrating saliency on small regions for the images.

Figure 12: BBMP vs FIDO saliency map by increasing the sparsity penalty λ value from left to right. We compare with BBMP under Blur and Random, and FIDO under CA in-ﬁlling strategies. Note that here BBMP and FIDO methods use different λ. We show the heuristics produce more artifacts (BBMP Blur) or produces weird lines (BBMP Random) compared to our method (FIDO CAGAN).

A.2 UPSAMPLING EFFECT

Here we examine the effect of learning a reduced dimensionality θ that upsampled to the full image size during optimization. We consider a variety of upsampling rates, and in a slight abuse of terminology we refer to the upsampling size as the square root of the dimensionality of θ before upsampling, so smaller size implies more upsampling. In Figure 13, we demonstrate two examples with different upsampling size under Mean and CA inﬁlling methods with SSR objectives. The weaker inﬁlling strategy Mean apparently requires stronger regularization to avoid artifacts compared to CA. Note that although CA produces much less artifacts compared to Mean, it still produces

Published as a conference paper at ICLR 2019

some small artifacts outside of the objects which is unfavored. We then choose 56 for the rest of our experiments to balance between details and the removal of the artifacts.

Figure 13: Comparisons of upsampling effect in Mean and CA inﬁlling methods with no total variation penalty. We show the upsampling regularization removes the artifacts especially in the weaker inﬁlling method Mean.

A.3 STABILITY

To show the stability of our method, we test our method with different random seeds and observe if they are similar. In Figure 14, our method produces similar saliency map for 4 different random seeds.

Figure 14: Testing the stability of our method with 4 different random seeds. They produce similar saliency maps. (Using CA inﬁlling method with Res Net and λ = 10 3)

A.4 TOTAL VARIATION EFFECT

Here we test the effect of total variation prior regularization in Figure 15. We ﬁnd the total variation can reduce the adversarial artifacts further, while risking losing signals when the total variation penalty is too strong.

A.5 ANALYSIS OF GENERATIVE MODEL INFILLING

Here we quantitatively compare the in-ﬁlling strategies. The generative approaches (VAE and CA) perform visually sharper images than four other baselines. Since we expect this random removal should not remove the target information, we use the classiﬁcation probability of the Res Net as our metric to measure how good the inﬁlling method recover the target prediction. We quantitatively evaluate the probability for 1, 000 validation images in Figure 16. We ﬁnd that VAE and CA consistently outperform other methods, having higher target probability. We also note that all the heuristic baselines (Mean, Blur, Random) perform much worse since the heuristic nature of these approaches, the images they generate are not likely under the distribution of natural images leading to the poor performance by the classiﬁer.

Published as a conference paper at ICLR 2019

Figure 15: Total Variation Regularization Effect. We show the saliency maps of 4 increasing total variation (TV) regularization. We show that strong regularization risks removing signal.

Input Mean Blurry Random Local VAE CA-GAN

Probability

Figure 16: Box plot of the classiﬁer probability under different inﬁlling with respect to random masked pixels using Res Net under 1, 000 images. We show that generative models (VAE and CA) performs much better in terms of classiﬁer probability.

A.6 BATCH SIZE EFFECTS

Figure 17 shows the effect of batch size on the saliency map. We found unsatisfactory results for batch size less than 4, which we attribute this to the high variance in the resulting gradient estimates.

Figure 17: Batch size effects of ﬁnal saliency output. We observed unsatisfactory results for batch size less than 4.

A.7 ABLATION STUDY

We show the performance of BBMP-CA with various thresholds τ on both WSL and SM on subset of 1, 000 images in Table 2. We also show more qaulitative examples in Figure 22. We ﬁnd BBMP-CA is relatively brittle across different thresholds of τ. Though with τ = 0.3, the BBMP-CA perform

Published as a conference paper at ICLR 2019

slightly better than BBMP and FIDO with heuristics inﬁlling, it still performs substantially inferior to FIDO-CA. We also perform the ﬂipping experiment in Figure 18 and show our FIDO-CA substantially outperforms BBMP-CA with varying different thresholds.

BBMP BBMP-CA FIDO

Blur Random 0 0.1 0.2 0.3 0.4 0.5 0.7 1 Blur Random CA BBox

WSL 61.3% 55.0% 68.6% 90.2% 89.1% 79.1% 57.5% 63.1% 63.5% 63.5% 48.7% 53.3% 52.1% 0.0%

SM 1.097 0.912 0.722 1.426 1.272 0.535 0.595 1.103 1.123 1.117 0.763 0.784 -0.092 0.432

Table 2: Weakly Supervised Localization (WSL) error and Saliency Metric (SM) with comparisons on BBMP-CA with varying thresholds τ. FIDO (various in-ﬁlling methods) evaluating Res Net on 1, 000 validation sets of Image Net. For both metrics the lower the better.

10% 20% 30% 40% Percent of Normalized Log-odds Score Suppression

Minimal Removed Region Size (pixel)

BBMP-Blur BBMP-Random BBMP-CA

BBMP-CA 0. BBMP-CA 0.1 BBMP-CA 0.3 BBMP-CA 0.5 BBMP-CA 1. FIDO

FIDO-Blur FIDO-Random FIDO-CAGAN

Figure 18: Number of salient pixels required to change normalized classiﬁcation score with comparison to BBMP-CA across variety of thresholds. Pixels are sorted by saliency score and successively replaced with CA-GAN in-ﬁlled values. The lower the better.

B MORE EXAMPLES

Figure 19 shows several more inﬁlled counterfactual images, along with the counterfactuals produced by the method from Dabkowski & Gal (2017). More examples comparing the various FIDO inﬁlling approaches can be found in Figure 20 and 21.

Published as a conference paper at ICLR 2019

Input Realtime FIDO-CA Input Realtime FIDO-CA

0.999 0.057 1.000 0.233 0.075 0.736

0.999 0.997 1.000 1.000 0.000 1.000

Figure 19: More examples of classiﬁer conﬁdence on inﬁlled images. Realtime denotes the method of Dabkowski & Gal (2017); FIDO-CA is our method with CA-GAN inﬁlling (Yu et al., 2018). Classiﬁer conﬁdence p(c|ˆx) is reported below the input and each inﬁlled image. We hypothesize by that FIDO-CA is able to isolate compact pixel areas of contextual information. For example, in the upper right image pixels in the net region around the ﬁsh are highlighted; this context information is missing from the Realtime saliency map but are apparently relevant to the classiﬁer s prediction. These 4 examples are bulbul, tench, junco, and ptarmigan respectively.

Published as a conference paper at ICLR 2019

Figure 20: Additional Saliency Maps for FIDO under total variation 0.01 with a variety of in-ﬁlling methods. We include the method from Dabkowski & Gal (2017) in the right-most column.

Published as a conference paper at ICLR 2019

Figure 21: Additional Saliency Maps for FIDO under total variation 0.01 with a variety of in-ﬁlling methods. We include the method from Dabkowski & Gal (2017) in the right-most column.

Published as a conference paper at ICLR 2019

Figure 22: Additional examples of ablation study.