# dance_enhancing_saliency_maps_using_decoys__da111f5a.pdf DANCE: Enhancing saliency maps using decoys Yang Young Lu * 1 Wenbo Guo * 2 Xinyu Xing 2 William Stafford Noble 3 Saliency methods can make deep neural network predictions more interpretable by identifying a set of critical features in an input sample, such as pixels that contribute most strongly to a prediction made by an image classifier. Unfortunately, recent evidence suggests that many saliency methods poorly perform, especially in situations where gradients are saturated, inputs contain adversarial perturbations, or predictions rely upon interfeature dependence. To address these issues, we propose a framework, DANCE, which improves the robustness of saliency methods by following a two-step procedure. First, we introduce a perturbation mechanism that subtly varies the input sample without changing its intermediate representations. Using this approach, we can gather a corpus of perturbed ( decoy ) data samples while ensuring that the perturbed and original input samples follow similar distributions. Second, we compute saliency maps for the decoy samples and propose a new method to aggregate saliency maps. With this design, we offset influence of gradient saturation. From a theoretical perspective, we show that the aggregated saliency map not only captures inter-feature dependence but, more importantly, is robust against previously described adversarial perturbation methods. Our empirical results suggest that, both qualitatively and quantitatively, DANCE outperforms existing methods in a variety of application domains. 1 *Equal contribution 1Department of Genome Sciences, University of Washington, Seattle, WA, USA 2College of Information Sciences and Technology, The Pennsylvania State University, State College, PA, USA 3Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA. Correspondence to: Xinyu Xing , William Stafford Noble . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1The Apache licensed source code of DANCE will be available at https://bitbucket.org/noblelab/dance. 1. Introduction Deep neural networks (DNNs) deliver remarkable performance in an increasingly wide range of application domains, but they often do so in an inscrutable fashion, delivering predictions without accompanying explanations. In a practical setting such as automated analysis of pathology images, if a patient sample is classified as malignant, then the physician will want to know which parts of the image contribute to this diagnosis. Thus, in general, a DNN that delivers interpretations alongside its predictions will enhance the credibility and utility of its predictions for end users (Lipton, 2016). In this paper, we focus on a popular branch of explanation methods, often referred to as saliency methods, which aim to find input features (e.g., image pixels or words) that strongly influence the network predictions (Simonyan et al., 2013; Selvaraju et al., 2016; Binder et al., 2016; Shrikumar et al., 2017; Smilkov et al., 2017; Sundararajan et al., 2017; Ancona et al., 2018). Saliency methods typically rely on back-propagation from the network s output back to its input to assign a saliency score to individual features so that higher scores indicate higher importance to the output prediction. Despite attracting increasing attention, saliency methods suffer from several fundamental limitations: Gradient saturation (Sundararajan et al., 2017; Shrikumar et al., 2017; Smilkov et al., 2017) may lead to the problem that the gradients of important features have small magnitudes, breaking down the implicit assumption that important features, in general, correspond to large gradients. This issue can be triggered when the DNN outputs are flattened in the vicinity of important features. Importance isolation (Singla et al., 2019) refers to the problem that gradient-based saliency methods evaluate the feature importance in an isolated fashion, implicitly assuming that the other features are fixed. Perturbation sensitivity (Ghorbani et al., 2017; Kindermans et al., 2017; Levine et al., 2019) refers to the observation that even imperceivable, random perturbations or a simple shift transformation of the input data may lead to a large change in the resulting saliency scores. In this paper, we propose a novel saliency method, Decoy- DANCE: Enhancing saliency maps using decoys enh ANCEd saliency (DANCE), to tackle these limitations. At a high level, DANCE generates the saliency score of an input by aggregating the saliency scores of multiple perturbed copies of this input. Specifically, given an input sample of interest, DANCE first generates a population of perturbed samples, referred to as decoys, that perfectly mimic the neural network s intermediate representation of the original input. These decoys are used to model the variation of an input sample originating from either sensor noise or adversarial attacks. The decoy construction procedure draws inspiration from the knockoffs, proposed recently by Barber & Candès (2015) in the setting of error-controlled feature selection, where the core idea is to generate knockoff features that perfectly mimic the empirical dependence structure among the original features. In brief, the current paper makes three primary contributions. First, we propose a framework to perturb input samples to produce corresponding decoys that preserve the input distribution, in the sense that the intermediate representations of the original input data and the decoys are indistinguishable. We formulate decoy generation as an optimization problem, applicable to diverse deep neural network architectures. Second, we develop a decoy-enhanced saliency score by aggregating the saliency maps of generated decoys. By design, this score naturally offsets the impact of gradient saturation. From a theoretical perspective, we show how the proposed score can simultaneously reflect the joint effects of other dependent features and achieve robustness to adversarial perturbations. Third, we demonstrate empirically that DANCE outperforms existing saliency methods, both qualitatively and quantitatively, on three real-world applications. We also quantify DANCE s advantage over existing saliency methods in terms of robustness against various adversarial attacks. 2. Related work A variety of saliency methods have been proposed in the literature. Some, such as edge detectors and Guided Backpropagation (Springenberg et al., 2014) are independent of the predictive model (Nie et al., 2018; Adebayo et al., 2018).2 Others are designed only for specific architectures (i.e., Grad-CAM (Selvaraju et al., 2016) for CNNs, De Conv Net for CNNs with Re LU activations (Zeiler & Fergus, 2014)). In this paper, instead of exhaustively evaluating all saliency methods, we apply our method to three saliency methods that do depend on the predictor (i.e., passing the sanity checks in Adebayo et al. (2018) and Sixt et al. (2020)) and are applicable to diverse DNN architectures. The vanilla gradient method (Simonyan et al., 2013) 2Sixt et al. (2020) shows that LRP (Binder et al., 2016) is independent of the parameters of certain layers. calculates the gradient of the class score with respect to the input x, defined as Egrad(x; F c) = x F c(x) Smooth Grad (Smilkov et al., 2017) seeks to reduce noise in the saliency map by averaging over explanations of the noisy copies of an input, defined as Esg(x; F c) = 1 i=1 Egrad(x + gi; F c) where gi N(0, σ2) indicates the noise vectors. The integrated gradient method (Sundararajan et al., 2017) starts from a baseline input x0 and sums over the gradient with respect to scaled versions of the input ranging from the baseline to the observed input, defined as Eig(x; F c) = (x x0) Z 1 0 x F c(x0 +α(x x0))dα Note that input gradient and Deep LIFT (Shrikumar et al., 2017) are strongly related to the integrated gradient method, as shown by Ancona et al. (2018). We do not empirically compare to several other categories of methods. Counterfactual-based methods work under the same setup as saliency methods, providing explanations for the predictions of a pre-trained DNN model (Sturmfels et al., 2020). These methods identify the important subregions within an input image by perturbing the subregions (by adding noise, rescaling (Sundararajan et al., 2017), blurring (Fong & Vedaldi, 2017), or inpainting (Chang et al., 2019)) and measuring the resulting changes in the predictions (Ribeiro et al., 2016; Lundberg & Lee, 2017; Chen et al., 2018; Fong & Vedaldi, 2017; Dabkowski & Gal, 2017; Chang et al., 2019; Yousefzadeh & O Leary, 2019; Goyal et al., 2019). Although these methods do identify meaningful subregions in practice, they exhibit several limitations. First, counterfactual-based methods implicitly assume that regions containing the object most contribute to the prediction (Fan et al., 2017). However, Moosavi-Dezfooli et al. (2017) showed that counterfactual-based methods are also vulnerable to adversarial attacks, which force these methods to output unrelated background rather than the meaningful objects as important subregions. Second, the counterfactual images may be potentially far away from the training distribution, causing ill-defined classifier behavior (Burns et al., 2019; Hendrycks & Dietterich, 2019). In addition to these limitations, counterfactual-based methods and our decoy-based method are fundamentally different in three ways. First, the former seeks the minimum DANCE: Enhancing saliency maps using decoys Decoy Generator ... ... Saliency Methods Original image Pretrained network Decoy images Label-dependent gradients Saliency maps of decoys Decoy-enhanced saliency maps Patch masks (A) Original image Decoy image Identical intermediate representation Figure 1. Overview of DANCE. (A) The DANCE workflow. (B) The swapping operation between original and decoy images. set of features to exclude in order to minimize the prediction score or to include in order to maximize the prediction score (Fong & Vedaldi, 2017), whereas our approach aims to characterize the influence of each feature on the prediction score. Second, counterfactual-based methods explicitly consider the decision boundary by comparing each image to the closest image on the other side of the boundary. In contrast, the proposed method only considers the decision boundary implicitly by calculating the gradient s variants. Third, unlike counterfactual images, which could potentially be out-of-distribution, decoys are plausibly constructed in the sense that their intermediate representations are indistinguishable from the original input data by design. Because of these limitations and differences, we do not compare our method with counterfactual-based methods. In addition to saliency methods and counterfactual-based methods, several other types of interpretation methods have been proposed that either aim for a different goal or have a different setup. For example, recent research (e.g., Ribeiro et al. (2016); Lundberg & Lee (2017); Chen et al. (2018; 2019b)) designed techniques to explain a black-box model, where the model s internal weights are inaccessible. Koh & Liang (2017) and some follow-up work (Yeh et al., 2018; Koh et al., 2019) tried to find the training points that are most influential for a given test sample. Some other efforts have been made to train a more interpretable DNN classifier (Fan et al., 2017; Zołna et al., 2019; Alvarez-Melis & Jaakkola, 2018; Toneva & Wehbe, 2019), synthesize samples that represent the model predictions (Ghorbani et al., 2019; Chen et al., 2019a), or identifying noise-tolerant features (Ikeno & Hara, 2018; Schulz et al., 2020). However, due to the task and setup differences, we do not consider these methods in this paper. 3.1. Problem setup Consider a multi-label classification task in which a pretrained neural network model implements a function F: Rd 7 RC that maps from the given input x Rd to C predicted classes. The score for each class c {1, , C} is F c(x), and the predicted class is the one with maximum score, i.e., arg maxc {1, ,C} F c(x). A saliency method aims to assign to each feature a saliency score, encoded in a saliency map E(x; F c) : Rd 7 Rd, in which the features with higher scores represent higher importance relative to the final prediction. Given a pre-trained neural network model F with L layers, an input x, and a saliency method E such that E(x; F) is a saliency map of the same dimensions as x, DANCE operates in two steps: generating decoys and aggregating the saliency maps of the decoys (Figure 1A). 3.2. Decoy definition Say that Fℓ: Rd 7 Rdℓis the function instantiated by the given network, which maps from an input x Rd to its intermediate representation Fℓ(x) Rdℓat layer ℓ {1, 2, , L}. A vector x Rd is said to be a decoy of x Rd at a specified layer ℓif the following swappable condition is satisfied: Fℓ(x) = Fℓ(xswap( x,K)), for swappable features K {1, , d} . (1) Here, the swap( x, K) operation swaps features between x and x based on the elements in K. In this work, K represents a small meaningful feature set, which represents a small region/segment in an image or a group of words (embeddings) in a sentence. Take an image recognition task for example. Assume K = {10} and x is a zero matrix, then xswap( x,K) indicates a new image that is identical to x except that the tenth pixel is set to zero. An illustrative explanation of a swap operator is shown in Figure 1(B). Using the swappable condition, we aim to ensure that the original image x and its decoy x are indistinguishable in terms of the intermediate representation at layer ℓ. Note in particular that the construction of decoys relies solely on the first ℓlayers of the neural network F1, F2, , Fℓand is independent of the succeeding layers Fℓ+1, , FL. As DANCE: Enhancing saliency maps using decoys such, x is conditionally independent of the classification task F(x) given the input x; i.e., x 3.3. Decoy generation To identify decoys satisfying the swappable condition, we solve the following optimization problem: maximize x [xmin,xmax]d (( x x) s)+ 1 , s.t. Fℓ( x) Fℓ(x) ϵ, ( x x) (1 M) = 0 (2) Here, ( )+ = max( , 0), and the operators 1 and correspond to the L1 and L norms, respectively. M {0, 1}d is a specified binary mask, where Mi = 0 indicates that the ith features of x and x are kept the same (realized by the constraint ( x x) (1 M) = 0). In other words, we take x and x to be indistinguishable except for the swappable features indicated by the mask (i.e., xswap( x,(1 M)) = x). The value of each feature in the decoy x is restricted to lie in a legitimate value range i.e., [xmin, xmax] (e.g., the pixel values should lie in [0, 255]). We further impose the constraint Fℓ( x) Fℓ(x) ϵ, which ensures that the generated decoy satisfies the swappable condition described in Equation (1). As illustrated in Figure 1, a population of n patch masks are constructed subject to the principle that each swappable patch is covered at least once. Because each swappable patch is small (e.g., a small region/segment in an image), assigning each patch mask to a single patch would be computationally expensive. Accordingly, we aggregate multiple patches into a combined patch mask for computational efficiency (see Supplementary Section S1 for details). Empirical results suggest that DANCE is robust to the number of patches that are aggregated into each mask (Figure 2C). As is shown later in Section 3.4, DANCE aims to capture the range of the saliency maps among all decoys. To achieve this, we first need to estimate the range of values among the decoys by estimating the range of perturbation values that can be added to the input without violating the swappable condition. In other words, we maximize the deviation between x and x from both the positive and negative directions, i.e., s = +1 and s = 1. As shown in Equation (2), for each specified mask M, we compute two decoys one for the positive deviation (i.e., s = +1) and the other for the negative one (i.e., s = 1). More details about how to optimize Equation 2 can be found in Supplementary Section S1. 3.4. Decoy-enhanced saliency scores We denote the generated decoys as x1, x2, , x2n , i.e., n decoys in the positive direction and n in the negative direction. For these decoys, we then apply a given saliency method E to yield the corresponding decoy saliency maps E( x1; F), E( x2; F), , E( x2n; F) . With these decoy saliency maps in hand, for each feature xi in x, we characterize its saliency score variation by using a population of saliency scores Ei = E( x1; F c)i, E( x2; F c)i, , E( x2n; F c)i . Here we define the decoy-enhanced saliency score Zi for each feature xi as Zi = max( Ei) min( Ei) . (3) Here, Zi is determined by the empirical range of the decoy saliency scores. Ideally, important features will have large values and unimportant ones will have small values. 3.5. Theoretical insights In this section, we analyze the saliency score method in a theoretical fashion. For expedience of exposition, we carry out the theoretical analysis using the vanilla gradient as the base saliency method. In particular, we take a convolutional neural network with the Re LU activation function as an example to discuss why the proposed interpretation method can account for inter-feature dependence while also improving explanatory robustness. It should be noted that, while we conduct our theoretical analysis in the setting of convolutional neural networks (CNNs) with a specific activation function, the conclusions drawn from the theoretical analysis can be extended to other feed-forward neural architectures and other activation functions (See Supplementary Section S4 for more details). Consider a CNN with L hidden blocks, with each layer ℓ containing a convolutional layer with a filter of size sℓ sℓand a max pooling layer with pooling size sℓ sℓ. (We set the pooling size the same as the kernel size in each block for simplicity.) The input to this CNN is x Rd, unrolled from a d matrix. Similarly, we also unroll each convolutional filter into gℓ Rsℓ, where gℓis indexed as (gℓ)j for j Jℓ. Here, Jℓcorresponds to the index shift in matrix form from the top-left to bottom-right element. For example, a 3 3 convolutional filter (i.e., sℓ= 9) is indexed by Jℓ= { d+1, 1, 0, 1, d+1}. The output of the network is the probability vector p RC generated by the softmax function, where C is the total number of classes. Such a network can be represented as mℓ= pool(relu(gℓ mℓ 1)) for ℓ= 1, 2, 3, ..., L , o = WT L+1m L + b L+1, p = softmax(o) , where relu( ) and pool( ) indicate the Re LU and pooling operators, mℓ Rdℓis the output of the block ℓ(m0 = x), and (gℓ mℓ 1) Rdℓ 1 represents a convolutional operation on that block. We assume for simplicity that the convolution retains the input shape. DANCE: Enhancing saliency maps using decoys Consider an input x and its decoy x, generated by swapping features in K. For each feature i K, we have the following theorem for the decoy-enhanced saliency score Zi: Theorem 1. In the aforementioned setting, Zi is bounded by Zi 1 k K ( x+ k x k )(Hx)k,i Here, C1 > 0 is a bounded constant and Hx is the Hessian of F c(x) on x where (Hx)i,k = 2F c xi xk . x+ and x refer to the decoy that maximizes and minimizes E( x; F c), respectively. Theorem 1 implies that the proposed saliency score is determined by the second-order Hessian ((Hx)i,k) in the same swappable feature set. The score explicitly models the feature dependencies in the swappable feature set via this second-order Hessian, potentially capturing meaningful patterns such as edges, texture, etc. In addition to enabling representation of inter-feature dependence, Theorem 1 sheds light on the robustness of the proposed saliency score against adversarial attack. To illustrate the robustness improvement of our method, we introduce the following proposition. Proposition 1. Given an input x and the corresponding adversarial sample ˆx, if both |xi xi| C2δi and ˆxi ˆxi C2δi can be obtain where C2 > 0 is a bounded constant and δi = |E(ˆx, F)i E(x, F)i|, then the following relation can be guaranteed. |(Zˆx)i (Zx)i| |E(ˆx, F)i E(x, F)i| . (5) Given an adversarial sample ˆx (i.e., the perturbed x), we say a saliency method is not robust against ˆx if the deviation of the corresponding explanation δi = |E(ˆx, F)i E(x, F)i| (for all i {1, 2, , d}) is large. According to the proposition above, we can easily discover that the deviation of our decoy-enhanced saliency score is always no larger than that of other saliency methods when a certain condition is satisfied. This indicates that, when the condition holds, our saliency method can guarantee a stronger resistance to the adversarial perturbation. To ensure the conditions |xi xi| C2δi and ˆxi ˆxi C2δi, we can further introduce the corresponding condition as a constraint to Equation (2). In the following section, without further clarification, the saliency scores used in our evaluation are all derived with this constraint imposed. The proof and indepth analysis of Theorem 1 and Proposition 1 can be found in the Supplementary Section S2 and S3. 4. Experiments To evaluate the effectiveness of DANCE, we perform extensive experiments on deep learning models that target three tasks: image classification, sentiment analysis, and network intrusion detection. Our results suggest that that DANCE, in conjunction with state-of-the-art saliency methods, makes already good saliency maps even more intuitively coherent. DANCE also quantitatively achieves better alignment to truly important features and demonstrates stronger robustness to adversarial manipulation. The description of the datasets and experimental setup can be found in the Supplementary Section S5. 4.1. Baseline methods We applied DANCE in conjunction with three state-of-theart saliency methods: vanilla gradient, integrated gradient, and Smooth Grad. (See Supplementary Section S9 for results from more saliency methods such as Exp Grad (Sturmfels et al., 2020), Var Grad (Hooker et al., 2019), and Grad-CAM (Selvaraju et al., 2016).) As claimed in Section 2, a prerequisite of saliency methods is the dependency on the predictor. To confirm that the conjunction with DANCE does not violate this prerequisite, we carried out a sanity check on the Image Net dataset. The results (Supplementary Section S6) show that our method does indeed depend on the predictor. One significant challenge when comparing different saliency methods is that each method produces a raw saliency map with its own distribution. Therefore, to facilitate a fair comparison among all methods, we used a consistent postprocessing scheme to normalize all methods. Specifically, we selected the top-K normalization, i.e., constructing a binary saliency map by retaining only the top-K features ranked by each method. We then set the saliency value of the selected features equal to 1 and the remaining features equal to 0. Here we chose K as the top 20% of all features, and we show that our results are robust to variation in the choice of K in Supplementary Section S11. It is worth mentioning that in this paper we do not consider another common normalization scheme, 0-1 normalization (i.e., linearly rescaling the saliency values to the range [0, 1]), because 0-1 normalization leads to a biased estimation of the evaluation metric (See Section 4.2). 4.2. Evaluation metric Intuitively, we prefer a saliency method that highlights features that align closely with the predictions (e.g., highlights the object of interest in an image or the words indicating the sentiment of the sentence). To measure how well a saliency map achieves qualitative coherence, we use the fidelity metric (Dabkowski & Gal, 2017), defined as SF(E( ; F c), x) = log F c(E(x; F c) x) (6) where c indicates the predicted class of input x, and E(x; F c) is the top-K-retained binary saliency map described above. E(x; F c) x performs entry-wise multi- DANCE: Enhancing saliency maps using decoys Gradient w/o decoy Gradient w/ decoys SGrad w/o decoy SGrad w/ decoys Gradient difference SGrad difference Int Grad w/o decoy Int Grad w/ decoys Int Grad difference SF: 7.64 SF: 5.04 SF: 8.96 SF: 3.38 SF: 5.95 SF: 4.19 SF: 7.86 SF: 2.13 SF: 7.82 SF: 2.88 SF: 5.53 SF: 4.27 Seashore : : : : : SF: 12.02 SF: 7.47 SF: 10.31 SF: 7.72 SF: 5.34 SF: 4.98 SF: 11.95 SF: 2.44 SF: 12.97 SF: 0.34 SF: 0.81 SF: 0.34 SF: 14.17 SF: 7.82 SF: 12.21 SF: 6.77 SF: 7.58 SF: 6.74 SF: 3.05 SF: 0.08 SF: 0.018 SF: 0.012 SF: 0.076 SF: 0.064 Foreground objects Background objects saliency w/ decoys - saliency w/o decoy +1.0 -1.0 Gradient Int Grad SGrad Without decoy Decoys w/ range aggregation Constant w/ range aggregation Decoys w/ mean aggregation Noise w/ range aggregation Decoy variations: (A) (B) patch size = 3 patch size = 5 patch size = 7 patch size = 9 Number of used decoys Figure 2. Performance evaluation on Image Net. (A) Visualization of saliency maps on foreground and background objects. (B) Fidelity comparison of original saliency method (i.e., Without decoys ), our method (i.e., Decoys w/ range aggregation ), and its alternatives: replacing the decoy generation (Equation (2)) with constant perturbation (i.e., Constant w/ range aggregation ) or noise perturbation (i.e., Noise w/ range aggregation ); replacing the decoy aggregation (Equation (3)) with mean aggregation (i.e., Decoys w/ mean aggregation ). See Supplementary Section S14 for more statistics about the performance differences between our method and the baselines. (C) Performance with regard to variant patch size and different numbers of decoys. plication between E(x; F c) and x, encoding the overlap between the object of interest and the concentration of the saliency map. The rationale behind this metric is that, by viewing the saliency score of a feature as its contribution to the predicted class, a good saliency method will highlight more important features and thus give rise to higher predicted class scores and lower metric values. Note that, to guarantee a fair comparison among different saliency methods, it is important to retain the same number of important features for evaluation. Without such a scheme, pathologic cases such as E(x; F c) = 1 (i.e., all saliency values equal to 1) would lead to highest fidelity score unexpectedly, which may be particularly problematic for alternative scheme such as 0-1 normalization. 4.3. Performance in various applications 4.3.1. PERFORMANCE ON THE IMAGENET DATASET To evaluate the effectiveness of DANCE, we first applied DANCE to randomly sampled images from the Image Net dataset (Russakovsky et al., 2015), with a pretrained VGG16 model (Simonyan & Zisserman, 2014) (See Supplementary S7 for the applicability to diverse CNN architectures such as Alex Net (Krizhevsky et al., 2012) and Res Net (He et al., 2016)). The 3 3 image patches are treated as swappable features in generating decoys. A side-by-side comparison (Figure 2(A)) suggests that decoys consistently help to reduce noise and produce more visually coherent saliency maps. For example, the original integrated gradient method highlights the region of the dog s head in a scattered format, which is also revealed by the difference plot. In contrast, the decoy-enhanced integrated gradient method not only highlights the missing body but also identifies the dog s head with more details such as ears, cheek, and nose (See Supplementary Section S13 for more visualization examples). The visual coherence is also quantitatively supported by the saliency fidelity score. To further evaluate the necessity of the two steps in our method (i.e., decoy generation and aggregation), we carried out a control experiment by replacing each step with alternatives. Specifically, as alternatives to the decoy generation, we used an image in which all pixel values are either replaced with a single mean pixel value or contaminated with Gaussian white noise. For the decoy aggregation, we calculated the mean saliency score as the alternative. As shown in Figure 2(B), our method, which incorporate both steps, yields the best performance. This validates the effectiveness of our two-step approach. Thirdly, to evaluate the computational efficiency of DANCE, we carried out a fidelity comparison with respect to the num- DANCE: Enhancing saliency maps using decoys Negative sentiment Positive sentiment SF: 0.093 SF: 0.093 SF: 0.458 SF: 0.031 most new movies have a bright sheen most new movies have a bright sheen most new movies have a bright sheen most new movies have a bright sheen most new movies have a bright sheen most new movies have a bright sheen Gradient w/o decoy Gradient w/ decoys Int Grad w/o decoy Int Grad w/ decoys SGrad w/o decoy SGrad w/ decoys Gradient Int Grad SGrad Without decoy Decoys w/ range aggregation Constant w/ range aggregation Decoys w/ mean aggregation Noise w/ range aggregation Decoy variations: SF: 1.003 SF: 1.003 SF: 0.084 SF: 0.111 SF: 0.105 No movement no yuks not much of anything No movement no yuks not much of anything No movement no yuks not much of anything No movement no yuks not much of anything No movement no yuks not much of anything No movement no yuks not much of anything Gradient w/o decoy Gradient w/ decoys Int Grad w/o decoy Int Grad w/ decoys SGrad w/o decoy SGrad w/ decoys Figure 3. Results obtained from the SST dataset. (A) Visualization of saliency maps in each word, where the normalized saliency values are shown for better distinction. (B) Fidelity comparison of the original saliency method, our method, and its alternatives. Here, the alternative methods represent the practice of replacing the decoy generation (Equation (2)) with constant perturbation or noise perturbation as well as the practice of replacing the decoy aggregation (Equation (3)) with mean aggregation. See Supplementary Section S14 for more statistics about the performance differences between our method and the baselines. ber of decoys to optimize. As discussed in Section 3.3, multiple swappable patches are aggregated into one combined patch mask for computational efficiency. Consequently, the mask multiplicity (i.e., the number of swappable patches per mask) is inversely proportional to the number of decoys to optimize. Figure 2(C) shows that our method achieves stable fidelity scores across a wide range of decoy numbers. Furthermore, as shown in Supplementary Section S10, the computational cost to optimize a single decoy in DANCE is negligible compared to even the fastest vanilla gradientbased saliency method. This analysis result confirms that our system could give reasonably good saliency maps without introducing too much computational cost. Finally, In Supplementary Section S11, we run a sensitivity test on other hyper-parameters (i.e., the swappable feature size P, the targeted network layer ℓ, and the initial Lagrange multiplier λ). The results show that our method is insensitive to substantial variation of these hyperparameters. This is an important property because users do not need to extensively tune the hyper-parameters when using our method. 4.3.2. PERFORMANCE ON THE STANFORD SENTIMENT TREEBANK (SST) DATASET To further evaluate the effectiveness of DANCE, we applied the method to randomly sampled sentences from the Stanford Sentiment Treebank (SST) (Russakovsky et al., 2015). We trained a two-layer CNN (Kim, 2014) which takes the pretrained word embeddings as input (Pennington et al., 2014) (See Supplementary Section S6 for more details about the experimental setup). As suggested by Guan et al. (2019), the average saliency value of all dimensions of a word embedding is regarded as the word-level saliency value. The embeddings of the words are treated as swappable features when generating decoys. As shown in Figure 3(A), a side- by-side comparison suggests that our method consistently helps to produce semantically more meaningful saliency maps. For example, in a sentence with negative sentiment, keywords associated with negation, such as no" and not," are more strongly highlighted by decoy-enhanced saliency methods. The semantic coherence is also quantitatively supported by the saliency fidelity (Figure 3(B)). We also tested the alternatives mentioned above: constant (replacing the decoy generation with the mean embedding of the whole dictionary) and noise perturbation with range aggregation, and decoys with mean aggregation. Figure 3(B) shows that our method outperforms these alternatives. To demonstrate the effectiveness of DANCE on models other than CNNs, we carried out experiments on a multilayer perceptron trained with a network intrusion dataset. The results (Supplementary Section S8) are consistent with those on CNNs, thereby confirming our method s applicability to non-CNN architectures. 4.4. Robustness to adversarial attacks An important design philosophy of DANCE is to model the variation of an input sample originating from either sensor noise or unknown perturbations by using decoys. We therefore hypothesized that DANCE may be particularly robust to adversarial manipulations of images. To test this hypothesis, we evaluated the robustness of our method to adversarial manipulations of images subject to three popular attacks (Ghorbani et al., 2017): (1) the top-k attack, which seeks to decrease the scores of the top k most important features, (2) the target attack, which aims to increase the importance of a pre-specified region in the input image, and (3) the mass-center attack, which aims to spatially change the mass center of the original saliency map. Here, we specify the bottom-right 4 4 region of the original image DANCE: Enhancing saliency maps using decoys SS: 34.35 SS: 27.78 SS: 53.63 SS: 45.22 SS: 2.49 SS: 2.25 SS: 93.31 SS: 61.20 SS: 23.03 SS: 19.23 SS: 33.78 SS: 29.19 SS: 24.10 SS: 16.21 SS: 30.50 SS: 24.81 SS: 27.15 SS: 22.70 Gradient w/o decoy Gradient w/ decoys SGrad w/o decoy SGrad w/ decoys Gradient difference SGrad difference Int Grad w/o decoy Int Grad w/ decoys Int Grad difference Top-k Attack Mass Center Attack Target Attack SS: 77.48 SS: 56.36 SS: 19.50 SS: 15.97 SS: 10.29 SS: 8.77 SS: 197.97 SS: 138.14 SS: 20.46 SS: 16.50 SS: 14.07 SS: 13.22 SS: 140.91 SS: 138.14 SS: 29.10 SS: 25.06 SS: 13.35 SS: 11.89 Top-k Attack Mass Center Attack Target Attack saliency w/ decoys - saliency w/o decoy +1.0 -1.0 Top-k Attack Mass Center Attack Target Attack Sensitivity Sensitivity Sensitivity Figure 4. Robustness to adversarial attacks on images. (A) Visualization of saliency maps under adversarial attacks. (B) (D) The decoy-enhanced saliency score is compared to the original saliency score under adversarial attacks, evaluated by sensitivity. See Supplementary Section S14 for more statistics about the performance differences between our method and the baselines. for the target attack and select k = 5000 in the top-k attack (See Supplementary Section S6 for detailed setups). We use the sensitivity metric (Alvarez-Melis & Jaakkola, 2018) to quantify the robustness of a saliency method E to attack, defined as: SS(E( , F c), x, ˆx) = (E(x, F c) E(ˆx, F c)) 2 where ˆx is the perturbed image of x. A small SS value means that similar inputs do not lead to substantially different saliency maps. As shown in Figure 4(A), a side-by-side comparison suggests that decoys consistently yield low sensitivity scores and help to produce more visually coherent saliency maps, mitigating the impact of various adversarial attacks (See the Supplementary material for more examples). The visual coherence and robustness to adversarial attacks are also quantitatively supported by Figure 4(B) (D). 5. Discussion and conclusion In this work, we propose DANCE, a method for computing, from a given saliency method, decoy-enhanced saliency scores that yield more accurate and robust saliency maps. We formulate the decoy generation as an optimization problem, applicable to diverse DNN architectures. We demonstrate the superior performance of our method relative to three standard saliency methods, both qualitatively and quan- titatively, even in the presence of various adversarial perturbations to the image. From a theoretical perspective, by deriving a closed-form solution, we show that the proposed score can provably compensate for the limitations of existing saliency methods by reflecting the joint effects from other dependent features and maintaining robustness to adversarial perturbations. We also demonstrate the computational efficiency of DANCE, and we show that the cost to optimize a single decoy is small, indicating that our technique can improve upon existing saliency methods without introducing too much computational overhead. This work points to several promising directions for future research. First, DANCE is designed for non-linear models such as feedforward DNNs which are most in need of interpretation. Future work will explore the extension of our method to other models (e.g., linear model and recurrent neural networks) and to inputs with categorical or discrete features. Second, recent work (Etmann et al., 2019; Chen et al., 2019c; Chalasani et al., 2020) shows that adversarial training can improve a DNN s interpretability. It is worth exploring whether DANCE could further enhance the quality of saliency maps derived from these adversarially retrained classifiers. Finally, a promising direction could be reframing interpretability as hypothesis testing and using decoys to deliver a set of salient features, subject to false discovery rate control at some pre-specified level (Burns et al., 2019; Lu et al., 2018). DANCE: Enhancing saliency maps using decoys Acknowledgments We would like to thank the anonymous reviewers and Meta reviewer for their helpful comments. This project was supported in part by NSF grant CNS-1718459, by NSF grant CNS-1954466, and by ONR grant N00014-20-1-2008. Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. In Proc. of Neur IPS, 2018. Alvarez-Melis, D. and Jaakkola, T. S. Towards robust interpretability with self-explaining neural networks. In Proc. of Neur IPS, 2018. Ancona, M., Ceolini, E., Öztireli, C., and Gross, M. Towards better understanding of gradient-based attribution methods for deep neural networks. In Proc. of ICLR, 2018. Barber, R. F. and Candès, E. J. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 2015. Binder, A., Montavon, G., Lapuschkin, S., Müller, K.-R., and Samek, W. Layer-wise relevance propagation for neural networks with local renormalization layers. In Proc. of ICANN, 2016. Burns, C., Thomason, J., and Tansey, W. Interpreting black box models via hypothesis testing. ar Xiv:1904.00045, 2019. Chalasani, P., Chen, J., Chowdhury, A. R., Wu, X., and Jha, S. Concise explanations of neural networks using adversarial training. In Proc. of ICML, 2020. Chang, C.-H., Creager, E., Goldenberg, A., and Duvenaud, D. Explaining image classifiers by counterfactual generation. In Proc. of ICLR, 2019. Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., and Su, J. K. This looks like that: deep learning for interpretable image recognition. In Proc. of Neur IPS, 2019a. Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. Learning to explain: An information-theoretic perspective on model interpretation. In Proc. of ICML, 2018. Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. L-shapley and c-shapley: Efficient model interpretation for structured data. In Proc. of ICLR, 2019b. Chen, J., Wu, X., Rastogi, V., Liang, Y., and Jha, S. Robust attribution regularization. In Proc. of Neur IPS, 2019c. Dabkowski, P. and Gal, Y. Real time image saliency for black box classifiers. In Proc. of Neur IPS, 2017. Etmann, C., Lunz, S., Maass, P., and Schönlieb, C.- B. On the connection between adversarial robustness and saliency map interpretability. ar Xiv preprint ar Xiv:1905.04172, 2019. Fan, L., Zhao, S., and Ermon, S. Adversarial localization network. In Proc. of Neur IPS LLD Workshop, 2017. Fong, R. C. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. In Proc. of ICCV, 2017. Ghorbani, A., Abid, A., and Zou, J. Interpretation of neural networks is fragile. ar Xiv:1710.10547, 2017. Ghorbani, A., Wexler, J., Zou, J. Y., and Kim, B. Towards automatic concept-based explanations. In Proc. of Neur IPS, 2019. Goyal, Y., Wu, Z., Ernst, J., Batra, D., Parikh, D., and Lee, S. Counterfactual visual explanations. Proc. of ICML, 2019. Guan, C., Wang, X., Zhang, Q., Chen, R., He, D., and Xie, X. Towards a deep and unified understanding of deep neural models in nlp. In Proc. of ICML, 2019. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proc. of CVPR, 2016. Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. In Proc. of ICLR, 2019. Hooker, S., Erhan, D., Kindermans, P.-J., and Kim, B. A benchmark for interpretability methods in deep neural networks. In Proc. of Neur IPS, 2019. Ikeno, K. and Hara, S. Maximizing invariant data perturbation with stochastic optimization. ar Xiv preprint ar Xiv:1807.05077, 2018. Kim, Y. Convolutional neural networks for sentence classification. Proc. of EMNLP, 2014. Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Schütt, K. T., Dähne, S., Erhan, D., and Kim, B. The (Un) reliability of saliency methods. ar Xiv:1711.00867, 2017. Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. Proc. of ICML, 2017. Koh, P. W. W., Ang, K.-S., Teo, H., and Liang, P. S. On the accuracy of influence functions for measuring group effects. In Proc. of Neur IPS, 2019. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proc. of Neur IPS, 2012. DANCE: Enhancing saliency maps using decoys Levine, A., Singla, S., and Feizi, S. Certifiably robust interpretation in deep learning. ar Xiv preprint ar Xiv:1905.12105, 2019. Lipton, Z. C. The mythos of model interpretability. ar Xiv:1606.03490, 2016. Lu, Y., Fan, Y., Lv, J., and Noble, W. S. Deep PINK: reproducible feature selection in deep neural networks. In Proc. of Neur IPS, 2018. Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. In Proc. of Neur IPS, 2017. Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., and Frossard, P. Universal adversarial perturbations. In Proc. of CVPR, 2017. Nie, W., Zhang, Y., and Patel, A. A theoretical explanation for perplexing behaviors of backpropagation-based visualizations. In Proc. of ICML, 2018. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proc. of EMNLP, 2014. Ribeiro, M. T., Singh, S., and Guestrin, C. Why should i trust you?: Explaining the predictions of any classifier. In Proc. of KDD, 2016. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015. Schulz, K., Sixt, L., Tombari, F., and Landgraf, T. Restricting the flow: Information bottlenecks for attribution. 2020. Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. ar Xiv:1611.07450, 2016. Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features through propagating activation differences. In Proc. of ICML, 2017. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv:1409.1556, 2014. Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. ar Xiv:1312.6034, 2013. Singla, S., Wallace, E., Feng, S., and Feizi, S. Understanding impacts of high-order loss approximations and features in deep learning interpretation. ar Xiv:1902.00407, 2019. Sixt, L., Granz, M., and Landgraf, T. When explanations lie: Why many modified bp attributions fail. In Proc. of ICML, 2020. Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. Smoothgrad: removing noise by adding noise. ar Xiv:1706.03825, 2017. Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. ar Xiv preprint ar Xiv:1412.6806, 2014. Sturmfels, P., Lundberg, S., and Lee, S.-I. Visualizing the impact of feature attribution baselines. Distill, 2020. Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In Proc. of ICML, 2017. Toneva, M. and Wehbe, L. Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). In Proc. of Neur IPS, 2019. Yeh, C.-K., Kim, J., Yen, I. E.-H., and Ravikumar, P. K. Representer point selection for explaining deep neural networks. In Proc. of Neur IPS, 2018. Yousefzadeh, R. and O Leary, D. P. Interpreting neural networks using flip points. ar Xiv preprint ar Xiv:1903.08789, 2019. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Proc. of ECCV, 2014. Zołna, K., Geras, K. J., and Cho, K. Classifier-agnostic saliency map extraction. In Proceedings of AAAI, 2019.