# robust_attribution_regularization__54e41c09.pdf

Robust Attribution Regularization

Jiefeng Chen 1 Xi Wu 2 Vaibhav Rastogi 2 Yingyu Liang 1 Somesh Jha 1,3

1 University of Wisconsin-Madison 2 Google 3 Xai Pient

An emerging problem in trustworthy machine learning is to train models that produce robust interpretations for their predictions. We take a step towards solving this problem through the lens of axiomatic attribution of neural networks. Our theory is grounded in the recent work, Integrated Gradients (IG) [STY17], in axiomatically attributing a neural network s output change to its input change. We propose training objectives in classic robust optimization models to achieve robust IG attributions. Our objectives give principled generalizations of previous objectives designed for robust predictions, and they naturally degenerate to classic soft-margin training for one-layer neural networks. We also generalize previous theory and prove that the objectives for different robust optimization models are closely related. Experiments demonstrate the effectiveness of our method, and also point to intriguing problems which hint at the need for better optimization techniques or better neural network architectures for robust attribution training.

1 Introduction

Trustworthy machine learning has received considerable attention in recent years. An emerging problem to tackle in this domain is to train models that produce reliable interpretations for their predictions. For example, a pathology prediction model may predict certain images as containing malignant tumor. Then one would hope that under visually indistinguishable perturbations of an image, similar sections of the image, instead of entirely different ones, can account for the prediction. However, as Ghorbani, Abid, and Zou [GAZ17] convincingly demonstrated, for existing models, one can generate minimal perturbations that substantially change model interpretations, while keeping their predictions intact. Unfortunately, while the robust prediction problem of machine learning models is well known and has been extensively studied in recent years (for example, [MMS+17a, SND18, WK18], and also the tutorial by Madry and Kolter [KM18]), there has only been limited progress on the problem of robust interpretations.

In this paper we take a step towards solving this problem by viewing it through the lens of axiomatic attribution of neural networks, and propose Robust Attribution Regularization. Our theory is grounded in the recent work, Integrated Gradients (IG) [STY17], in axiomatically attributing a neural network s output change to its input change. Speciﬁcally, given a model f, two input vectors x,x0, and an input coordinate i, IGf

i (x,x0) deﬁnes a path integration (parameterized by a curve from x to x0) that assigns a number to the i-th input as its contribution to the change of the model s output from f(x) to f(x0). IG enjoys several natural theoretical properties (such as the Axiom of Completeness3) that other related methods violate.

Equal contribution. Work done while at UW-Madison.

Due to lack of space and for completeness, we put some deﬁnitions (such as coupling) to Section B.1. Code for this paper is publicly available at the following repository: https://github.com/jfc43/ robust-attribution-regularization

3Axiom of Completeness says that summing up attributions of all components should give f(x0) f(x).

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

NATURAL IG-NORM IG-SUM-NORM

Top-1000 Intersection: 0.1% Kendall s Correlation: 0.2607

Top-1000 Intersection: 58.8% Kendall s Correlation: 0.6736

Top-1000 Intersection: 60.1% Kendall s Correlation: 0.6951

Figure 1: Attribution robustness comparing different models. Top-1000 Intersection and Kendall s Correlation are rank correlations between original and perturbed saliency maps. NATURAL is the naturally trained model, IG-NORM and IG-SUM-NORM are models trained using our robust attribution method. We use attribution attacks described in [GAZ17] to perturb the attributions while keeping predictions intact. For all images, the models give correct prediction Windﬂower. However, the saliency maps (also called feature importance maps), computed via IG, show that attributions of the naturally trained model are very fragile, either visually or quantitatively as measured by correlation analyses, while models trained using our method are much more robust in their attributions.

We brieﬂy overview our approach. Given a loss function and a data generating distribution P, our Robust Attribution Regularization objective contains two parts: (1) Achieving a small loss over the distribution P, and (2) The IG attributions of the loss over P are close to the IG attributions over Q, if distributions P and Q are close to each other. We can naturally encode these two goals in two classic robust optimization models: (1) In the uncertainty set model [BTEGN09] where we treat sample points as nominal points, and assume that true sample points are from certain vicinity around them, which gives:

E (x,y) P[ (x, y; )]

where (x, y; ) = (x, y; ) + λ max x02N(x,") s(IG y

h (x,x0; r))

h ( ) is the attribution w.r.t. neurons in an intermediate layer h, and s( ) is a size function (e.g., k k2) measuring the size of IG, and (2) In the distributional robustness model [SND18, MEK15], where closeness between P and Q is measured using metrics such as Wasserstein distance, which gives:

E P[ (P; )] + λ sup Q;M2Q(P,Q)

E Z,Z0[d IG(Z, Z0)] s.t. E Z,Z0[c(Z, Z0)]

In this formulation, Q(P, Q) is the set of couplings of P and Q, and M = (Z, Z0) is one coupling. c( , ) is a metric, such as k k2, to measure the cost of an adversary perturbing z to z0. is an upper bound on the expected perturbation cost, thus constraining P and Q to be close with each together. d IG is a metric to measure the change of attributions from Z to Z0, where we want a large d IG-change under a small c-change. The supremum is taken over Q and Q(P, Q).

We provide theoretical characterizations of our objectives. First, we show that they give principled generalizations of previous objectives designed for robust predictions. Speciﬁcally, under weak instantiations of size function s( ), and how we estimate IG computationally, we can leverage axioms satisﬁed by IG to recover the robust prediction objective of [MMS+17a], the input gradient regularization objective of [RD18], and also the distributional robust prediction objective of [SND18]. These results provide theoretical evidence that robust prediction training can provide some control over robust interpretations. Second, for one-layer neural networks, we prove that instantiating s( ) as 1-norm coincides with the instantiation of s( ) as sum, and further coincides with classic soft-margin

training, which implies that for generalized linear classiﬁers, soft-margin training will robustify both predictions and interpretations. Finally, we generalize previous theory on distributional robust prediction [SND18] to our objectives, and show that they are closely related.

Through detailed experiments we study the effect of our method in robustifying attributions. On MNIST, Fashion-MNIST, GTSRB and Flower datasets, we report encouraging improvement in attribution robustness. Compared with naturally trained models, we show signiﬁcantly improved attribution robustness, as well as prediction robustness. Compared with Madry et al. s model [MMS+17a] trained for robust predictions, we demonstrate comparable prediction robustness (sometimes even better), while consistently improving attribution robustness. We observe that even when our training stops, the attribution regularization term remains much more signiﬁcant compared to the natural loss term. We discuss this problem and point out that current optimization techniques may not have effectively optimized our objectives. These results hint at the need for better optimization techniques or new neural network architectures that are more amenable to robust attribution training.

The rest of the paper is organized as follows: Section 2 brieﬂy reviews necessary background. Section 3 presents our framework for robustifying attributions, and proves theoretical characterizations. Section 4 presents instantiations of our method and their optimization, and we report experimental results in Section 5. Finally, Section 6 concludes with a discussion on future directions.

2 Preliminaries

Axiomatic attribution and Integrated Gradients Let f : Rd 7! R be a real-valued function, and x and x0 be two input vectors. Given that function values changes from f(x) to f(x0), a basic question is: How to attribute the function value change to the input variables? A recent work by Sundararajan, Taly and Yan [STY17] provides an axiomatic answer to this question. Formally, let r : [0, 1] 7! Rd be a curve such that r(0) = x, and r(1) = x0, Integrated Gradients (IG) for input variable i is deﬁned as the following integral:

i (x,x0; r) =

i(t)dt, (1)

which formalizes the contribution of the i-th variable as the integration of the i-th partial as we move along curve r. Let IGf(x,x0; r) be the vector where the i-th component is IGf

i , then IGf satisﬁes some natural axioms. For example, the Axiom of Completeness says that summing all coordinates gives the change of function value: sum(IGf(x,x0; r)) = Pd

i (x,x0; r) = f(x0) f(x). We refer readers to the paper [STY17] for other axioms IG satisﬁes.

Integrated Gradients for an intermediate layer. We can generalize the theory of IG to an intermediate layer of neurons. The key insight is to leverage the fact that Integrated Gradients is a curve integration. Therefore, given some hidden layer h = [h1, . . . , hl], computed by a function h(x) induced by previous layers, one can then naturally view the previous layers as inducing a curve h r which moves from h(x) to h(x0), as we move from x to x0 along curve r. Viewed this way, we can thus naturally compute IG for h in a way that leverages all layers of the network4, Lemma 1. Under curve r : [0, 1] 7! Rd such that r(0) = x and r(1) = x0 for moving x to x0, and the function induced by layers before h, the attribution for hi for a differentiable f is

@f(h(r(t)))

The corresponding summation approximation is:

hi(x,x0) = 1

@f(h(r(k/m)))

@hi(r(k/m))

3 Robust Attribution Regularization

In this section we propose objectives for achieving robust attribution, and study their connections with existing robust training objectives. At a high level, given a loss function and a data generating

4 Proofs are deferred to B.2.

distribution P, our objectives contain two parts: (1) Achieving a small loss over the data generating distribution P, and (2) The IG attributions of the loss over P are close to the IG attributions over distribution Q, if P and Q are close to each other. We can naturally encode these two goals in existing robust optimization models. Below we do so for two popular models: the uncertainty set model and the distributional robustness model.

3.1 Uncertainty Set Model

In the uncertainty set model, for any sample (x, y) P for a data generating distribution P, we think of it as a nominal point and assume that the real sample comes from a neighborhood around x. In this case, given any intermediate layer h, we propose the following objective function:

E (x,y) P[ (x, y; )]

where (x, y; ) = (x, y; ) + λ max x02N(x,") s(IG y

h (x,x0; r))

where λ 0 is a regularization parameter, y is the loss function with label y ﬁxed: y(x; ) = (x, y; ), r : [0, 1] 7! Rd is a curve parameterization from x to x0, and IG y is the integrated gradients of y, and therefore gives attribution of changes of y as we go from x to x0. s( ) is a size function that measures the size of the attribution.5

We now study some particular instantiations of the objective (4). Speciﬁcally, we recover existing robust training objectives under weak instantiations (such as choosing s( ) as summation function, which is not metric, or use crude approximation of IG), and also derive new instantiations that are natural extensions to existing ones. Proposition 1 (Madry et al. s robust prediction objective). If we set λ = 1 , and let s( ) be the sum function (sum all components of a vector), then for any curve r and any intermediate layer h, (4) is exactly the objective proposed by Madry et al. [MMS+17a] where (x, y; ) = maxx02N(x,") (x0, y; ).

We note that: (1) sum is a weak size function which does not give a metric. (2) As a result, while this robust prediction objective falls within our framework, and regularizes robust attributions, it allows a small regularization term where attributions actually change signiﬁcantly but they cancel each other in summation. Therefore, the control over robust attributions can be weak. Proposition 2 (Input gradient regularization). For any λ0 > 0 and q 1, if we set λ = λ0/"q, s( ) = k kq

1, and use only the ﬁrst term of summation approximation (3) to approximate IG, then (4) becomes exactly the input gradient regularization of Drucker and Le Cun [DL92], where we have (x, y; ) = (x, y; ) + λkrx (x, y; )kq

In the above we have considered instantiations of a weak size function (summation function), which recovers Madry et al. s objective, and of a weak approximation of IG (picking the ﬁrst term), which recovers input gradient regularization. In the next example, we pick a nontrivial size function, the 1-norm k k1, use the precise IG, but then we use a trivial intermediate layer, the output loss y. Proposition 3 (Regularizing by attribution of the loss output). Let us set λ = 1, s( ) = k k1, and h = y (the output layer of loss function!), then we have (x, y; ) = y(x) + maxx02N(x,"){| y(x0) y(x)|}.

We note that this loss function is a surrogate loss function for Madry et al. s loss function because y(x) + maxx02N(x,"){| y(x0) y(x)|} y(x) + maxx02N(x,"){( y(x0) y(x))} = maxx02N(x,") y(x0). Therefore, even at such a trivial instantiation, robust attribution regularization provides interesting guarantees.

3.2 Distributional Robustness Model

A different but popular model for robust optimization is the distributional robustness model. In this case we consider a family of distributions P, each of which is supposed to be a slight variation of a base distribution P. The goal of robust optimization is then that certain objective functions obtain stable values over this entire family. Here we apply the same underlying idea to the distributional

5 We stress that this regularization term depends on model parameters through loss function y.

robustness model: One should get a small loss value over the base distribution P, and for any distribution Q 2 P, the IG-based attributions change only a little if we move from P to Q. This is formalized as:

E P[ (P; )] + λ sup

{Wd IG(P, Q)} ,

where the Wd IG(P, Q) is the Wasserstein distance between P and Q under a distance metric d IG.6 We use IG to highlight that this metric is related to integrated gradients.

We propose again d IG(z,z0) = s(IG

h(z,z0)). We are particularly interested in the case where P is a Wasserstein ball around the base distribution P, using perturbation cost metric c( ). This gives regularization term λ EWc(P,Q) sup{Wd IG(P, Q)}. An unsatisfying aspect of this objective, as one can observe now, is that Wd IG and Wc can take two different couplings, while intuitively we want to use only one coupling to transport P to Q. For example, this objective allows us to pick a coupling M1 under which we achieve Wd IG (recall that Wasserstein distance is an inﬁmum over couplings), and a different coupling M2 under which we achieve Wc, but under M1 = (Z, Z0), Ez,z0 M1[c(z, z0)] > , violating the constraint. This motivates the following modiﬁcation:

E P[ (P; )] + λ sup Q;M2Q(P,Q)

E Z,Z0[d IG(Z, Z0)] s.t. E Z,Z0[c(Z, Z0)]

In this formulation, Q(P, Q) is the set of couplings of P and Q, and M = (Z, Z0) is one coupling. c( , ) is a metric, such as k k2, to measure the cost of an adversary perturbing z to z0. is an upper bound on the expected perturbation cost, thus constraining P and Q to be close with each together. d IG is a metric to measure the change of attributions from Z to Z0, where we want a large d IG-change under a small c-change. The supremum is taken over Q and Q(P, Q). Proposition 4 (Wasserstein prediction robustness). Let s( ) be the summation function and λ = 1, then for any curve γ and any layer h, (5) reduces to sup Q:Wc(P,Q) {EQ[ (Q; )]}, which is the objective proposed by Sinha, Namhoong, and Duchi [SND18] for robust predictions.

Lagrange relaxation. For any γ 0, the Lagrange relaxation of (5) is

E P[ (P; )] + λ sup Q;M2Q(P,Q)

d IG(Z, Z0) γc(Z, Z0)

where the supremum is taken over Q (unconstrained) and all couplings of P and Q, and we want to ﬁnd a coupling under which IG attributions change a lot, while the perturbation cost from P to Q with respect to c is small. Recall that g : Rd Rd ! R is a normal integrand if for each , the mapping z ! {z0|g(z, z0) } is closed-valued and measurable [RW09].

Our next two theorems generalize the duality theory in [SND18] to a much larger, but natural, class of objectives. Theorem 1. Suppose c(z, z) = 0 and d IG(z, z) = 0 for any z, and suppose γc(z, z0) d IG(z, z0) is a normal integrand. Then, sup Q;M2Q(P,Q){EM=(Z,Z0)[dγ

IG(Z, Z0)]} = Ez P [supz0{dγ

IG(z, z0)}]. Consequently, we have (6) to be equal to the following:

(z; ) + λ sup

z0 {d IG(z, z0) γc(z, z0)}

The assumption d IG(z, z) = 0 is true for what we propose, and c(z, z) = 0 is true for any typical cost such as p distances. The normal integrand assumption is also very weak, e.g., it is satisﬁed when d IG is continuous and c is closed convex.

Note that (7) and (4) are very similar, and so we use (4) for the rest the paper. Finally, given Theorem 1, we are also able to connect (5) and (7) with the following duality result: Theorem 2. Suppose c(z, z) = 0 and d IG(z, z) = 0 for any z, and suppose γc(z, z0) d IG(z, z0) is a normal integrand. For any > 0, there exists γ 0 such that the optimal solutions of (7) are optimal for (5).

6 For supervised learning problem where P is of the form Z = (X, Y ), we use the same treatment as in [SND18] so that cost function is deﬁned as c(z, z0) = cx(x, x0) + 1 1{y 6= y0}. All our theory carries over to such c which has range R+ [{1}.

3.3 One Layer Neural Networks

We now consider the special case of one-layer neural networks, where the loss function takes the form of (x, y;w) = g( yhw,xi), w is the model parameters, x is a feature vector, y is a label, and g is nonnegative. We take s( ) to be k k1, which corresponds to a strong instantiation that does not allow attributions to cancel each other. Interestingly, we prove that for natural choices of g, this is however exactly Madry et al. s objective [MMS+17a], which corresponds to s( ) = sum( ). That is, the strong (s( ) = k k1) and weak instantiations (s( ) = sum( )) coincide for one-layer neural networks. This thus says that for generalized linear classiﬁers, robust interpretation coincides with robust predictions, and further with classic soft-margin training. Theorem 3. Suppose that g is differentiable, non-decreasing, and convex. Then for λ = 1, s( ) = k k1, and 1 neighborhood, (4) reduces to Madry et al. s objective:

i xi k1 " g( yihw,x0

ii) (Madry et al. s objective)

g( yihw,xii + "kw k1) (soft-margin).

Natural losses, such as Negative Log-Likelihood and softplus hinge loss, satisfy the conditions of this theorem.

4 Instantiations and Optimizations

In this section we discuss instantiations of (4) and how to optimize them. We start by presenting two objectives instantiated from our method: (1) IG-NORM, and (2) IG-SUM-NORM. Then we discuss how to use gradient descent to optimize these objectives.

IG-NORM. As our ﬁrst instantiation, we pick s( ) = k k1, h to be the input layer, and r to be the straightline connecting x and x0. This gives:

(x, y; ) + λ max x02N(x,") k IG y(x,x0)k1

IG-SUM-NORM. In the second instantiation we combine the sum size function and norm size function, and deﬁne s( ) = sum( ) + βk k1. Where β 0 is a regularization parameter. Now with the same h and r as above, and put λ = 1, then our method simpliﬁes to:

max x02N(x,")

(x0, y; ) + βk IG y(x,x0)k1

which can be viewed as appending an extra robust IG term to (x0).

Gradient descent optimization. We propose the following gradient descent framework to optimize the objectives. The framework is parameterized by an adversary A which is supposed to solve the inner max by ﬁnding a point x? which changes attribution signiﬁcantly. Speciﬁcally, given a point (x, y) at time step t during SGD training, we have the following two steps (this can be easily generalized to mini-batches):

Attack step. We run A on (x, y) to ﬁnd x? that produces a large inner max term (that is k IG y(x,x?)k1 for IG-NORM, and (x?) + βk IG y(x,x?)k1 for IG-SUM-NORM.

Gradient step. Fixing x?, we can then compute the gradient of the corresponding objective with respect to , and then update the model.

Important objective parameters. In both attack and gradient steps, we need to differentiate IG (in attack step, is ﬁxed and we differentiate w.r.t. x, while in gradient step, this is reversed), and this induces a set of parameters of the objectives to tune for optimization, which is summarized in Table 1. Differentiating summation approximation of IG amounts to compute second partial derivatives. We rely on the auto-differentiation capability of Tensor Flow [ABC+16] to compute second derivatives.

Adversary A Adversary to ﬁnd x?. Note that our goal is simply to maximize the inner term in a neighborhood, thus in this paper we choose Projected Gradient Descent for this purpose. m in the attack step To differentiate IG in the attack step, we use summation approximation of IG, and this is the number of segments for apprioximation. m in the gradient step Same as above, but in the gradient step. We have this m separately due to efﬁciency consideration. λ Regularization parameter for IG-NORM. β Regularization parameter for IG-SUM-NORM. Table 1: Optimization parameters.

5 Experiments

We now perform experiments using our method. We ask the following questions: (1) Comparing models trained by our method and naturally trained models at test time, do we maintain the accuracy on unperturbed test inputs? (2) At test time, if we use attribution attacks mentioned in [GAZ17] to perturb attributions while keeping predictions intact, how does the attribution robustness of our models compare with that of the naturally trained models? (3) Finally, how do we compare attribution robustness of our models with weak instantiations for robust predictions?

To answer these questions, We perform experiments on four classic datasets: MNIST [LCB98], Fashion-MNIST [XRV17], GTSRB [SSSI12], and Flower [NZ06]. In summary, our ﬁndings are the following: (1) Our method results in very small drop in test accuracy compared with naturally trained models. (2) On the other hand, our method gives signﬁcantly better attribution robustness, as measured by correlation analyses. (3) Finally, our models yield comparable prediction robustness (sometimes even better), while consistently improving attribution robustness. In the rest of the section we give more details.

Evaluation setup. In this work we use IG to compute attributions (i.e. feature importance map), which, as demonstrated by [GAZ17], is more robust compared to other related methods (note that, IG also enjoys other theoretical properties). To attack attribution while retaining model predictions, we use Iterative Feature Importance Attacks (IFIA) proposed by [GAZ17]. Due to lack of space, we defer details of parameters and other settings to the appendix. We use two metrics to measure attribution robustness (i.e. how similar the attributions are between original and perturbed images):

Kendall s tau rank order correlation. Attribution methods rank all of the features in order of their importance, we thus use the rank correlation [Ken38] to compare similarity between interpretations.

Top-k intersection. We compute the size of intersection of the k most important features before and after perturbation.

Compared with [GAZ17], we use Kendall s tau correlation, instead of Spearman s rank correlation. The reason is that we found that on the GTSRB and Flower datasets, Spearman s correlation is not consistent with visual inspection, and often produces too high correlations. In comparison, Kendall s tau correlation consistently produces lower correlations and aligns better with visual inspection. Finally, when computing attribution robustness, we only consider the test samples that are correctly classiﬁed by the model.

Comparing with natural models. Figures (a), (b), (c), and (d) in Figure 2 show that, compared with naturally trained models, robust attribution training gives signiﬁcant improvements in attribution robustness (measured by either median or conﬁdence intervals). The exact numbers are recorded in Table 2: Compared with naturally trained models (rows where Approach is NATURAL), robust attribution training has signiﬁcantly better adversarial accuracy and attribution robustness, while having a small drop in natural accuracy (denoted by Nat Acc.).

Ineffective optimization. We observe that even when our training stops, the attribution regularization term remains much more signiﬁcant compared to the natural loss term. For example for IG-NORM, when training stops on MNIST, (x) typically stays at 1, but k IG(x,x0)k1 stays at 10 20. This indicates that optimization has not been very effective in minimizing the regularization term. There are two possible reasons to this: (1) Because we use summation approximation of IG, it forces us to compute second derivatives, which may not be numerically stable for deep net-

(b) Fashion-MNIST

Figure 2: Experiment results on MNIST, Fashion-MNIST, GTSRB and Flower.

works. (2) The network architecture may be inherently unsuitable for robust attributions, rendering the optimization hard to converge.

Comparing with robust prediction models. Finally we compare with Madry et al. s models, which are trained for robust prediction. We use Adv Acc. to denote adversarial accuracy (prediction accuracy on perturbed inputs). Again, Top K Inter. denotes the average top K intersection (K = 100 for MNIST, Fashion-MNIST and GTSRB datasets, K = 1000 for Flower), and Rank Corr. denotes the average Kendall s rank order correlation. Table 2 gives the details of the results. As we can see, our models give comparable adversarial accuracy, and are sometimes even better (on the Flower dataset). On the other hand, we are consistently better in terms of attribution robustness.

Dataset Approach Nat Acc. Adv Acc. Top K Inter. Rank Corr.

NATURAL 99.17% 0.00% 46.61% 0.1758 Madry et al. 98.40% 92.47% 62.56% 0.2422 IG-NORM 98.74% 81.43% 71.36% 0.2841 IG-SUM-NORM 98.34% 88.17% 72.45% 0.3111

Fashion-MNIST

NATURAL 90.86% 0.01% 39.01% 0.4610 Madry et al. 85.73% 73.01% 46.12% 0.6251 IG-NORM 85.13% 65.95% 59.22% 0.6171 IG-SUM-NORM 85.44% 70.26% 72.08% 0.6747

NATURAL 98.57% 21.05% 54.16% 0.6790 Madry et al. 97.59% 83.24% 68.85% 0.7520 IG-NORM 97.02% 75.24% 74.81% 0.7555 IG-SUM-NORM 95.68% 77.12% 74.04% 0.7684

NATURAL 86.76% 0.00% 8.12% 0.4978 Madry et al. 83.82% 41.91% 55.87% 0.7784 IG-NORM 85.29% 24.26% 64.68% 0.7591 IG-SUM-NORM 82.35% 47.06% 66.33% 0.7974

Table 2: Experiment results including prediction accuracy, prediction robustness and attribution robustness.

6 Discussion and Conclusion

This paper builds a theory to robustify model interpretations through the lens of axiomatic attributions of neural networks. We show that our theory gives principled generalizations of previous formulations for robust predictions, and we characterize our objectives for one-layer neural networks. We believe that our work opens many intriguing avenues for future research, and we discuss a few topics below. Why we want robust attributions? Model attributions are facts about model behaviors. While robust attribution does not necessarily mean that the attribution is correct, a model with brittle attribution can never be trusted. To this end, it seems interesting to examine attribution methods other than Integrated Gradients. Robust attribution leads to more human-aligned attribution. Note that our proposed training scheme requires both prediction correctness and robust attributions, and therefore it encourages to learn invariant features from data that are also highly predictive. In our experiments, we found an intriguing phenomenon that our regularized models produce attributions that are much more aligned with human perceptions (for example, see Figure 1). Our results are aligned with the recent work [TSE+19, EIS+19]. Robust attribution may help tackle spurious correlations. In view of our discussion so far, we think it is plausible that robust attribution regularization can help remove spurious correlations because intuitively spurious correlations should not be able to be reliably attributed to. Future research on this potential connection seems warranted. Difﬁculty of optimization. While our experimental results are encouraging, we observe that when training stops, the attribution regularization term remains signiﬁcant (typically around tens to hundreds), which indicates ineffective optimization for the objectives. To this end, a main problem is network depth, where as depth increases, we get very unstable trajectories of gradient descent, which seems to be related to the use of second order information during robust attribution optimization (due to summation approximation, we have ﬁrst order terms in the training objectives). Therefore, it is natural to further study better optimization tchniques or better architectures for robust attribution training.

7 Acknowledgments

This work is partially supported by Air Force Grant FA9550-18-1-0166, the National Science Foundation (NSF) Grants CCF-FMit F-1836978, Sa TC-Frontiers-1804648 and CCF-1652140 and ARO grant number W911NF-17-1-0405.

[ABC+16] Mart ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265 283, 2016.

[BTEGN09] A. Ben-Tal, L. El Ghaoui, and A.S. Nemirovski. Robust Optimization. Princeton

Series in Applied Mathematics. Princeton University Press, October 2009.

[DL92] Harris Drucker and Yann Le Cun. Improving generalization performance using double backpropagation. IEEE Trans. Neural Networks, 3(6):991 997, 1992.

[EIS+19] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. Adversarial robustness as a prior for learned representations. ar Xiv 1906.00945, 2019.

[GAZ17] Amirata Ghorbani, Abubakar Abid, and James Y. Zou. Interpretation of neural networks is fragile. Co RR, abs/1710.10547, 2017.

[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[Ken38] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81 93, 1938.

[KM18] Zico Kolter and Aleksander Madry. Adversarial robustness theory and practice, 2018. https://adversarial-ml-tutorial.org/.

[LCB98] Yann Le Cun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of handwritten digits, 1998. http://yann.lecun.com/exdb/mnist/.

[Lue97] David G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1997.

[MEK15] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. ar Xiv preprint ar Xiv:1505.05116, 2015.

[MMS+17a] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and

Adrian Vladu. Towards deep learning models resistant to adversarial attacks. Co RR, abs/1706.06083, 2017.

[MMS+17b] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and

Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

[NZ06] M-E Nilsback and Andrew Zisserman. A visual vocabulary for ﬂower classiﬁcation. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 06), volume 2, pages 1447 1454. IEEE, 2006.

[RD18] Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, (AAAI18), pages 1660 1669, 2018.

[RW09] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.

[SND18] Aman Sinha, Hongseok Namkoong, and John C. Duchi. Certifying some distributional robustness with principled adversarial training. In 6th International Conference on Learning Representations, ICLR 2018, 2018.

[SSSI12] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for trafﬁc sign recognition. Neural networks, 32:323 332, 2012.

[STY17] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 3319 3328, 2017.

[SVZ13] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. ar Xiv preprint ar Xiv:1312.6034, 2013. [TSE+19] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations, 2019. [WK18] Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pages 5283 5292, 2018. [XRV17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.