# interpreting_robust_optimization_via_adversarial_influence_functions__501190dc.pdf Interpreting Robust Optimization via Adversarial Influence Functions Zhun Deng 1 Cynthia Dwork 1 Jialiang Wang 1 Linjun Zhang 2 Robust optimization has been widely used in nowadays data science, especially in adversarial training. However, little research has been done to quantify how robust optimization changes the optimizers and the prediction losses comparing to standard training. In this paper, inspired by the influence function in robust statistics, we introduce the Adversarial Influence Function (AIF) as a tool to investigate the solution produced by robust optimization. The proposed AIF enjoys a closed-form and can be calculated efficiently. To illustrate the usage of AIF, we apply it to study model sensitivity a quantity defined to capture the change of prediction losses on the natural data after implementing robust optimization. We use AIF to analyze how model complexity and randomized smoothing affect the model sensitivity with respect to specific models. We further derive AIF for kernel regressions, with a particular application to neural tangent kernels, and experimentally demonstrate the effectiveness of the proposed AIF. Lastly, the theories of AIF will be extended to distributional robust optimization. 1. Introduction Robust optimization is a classic field of optimization theory that seeks to achieve a certain measure of robustness against uncertainty in the parameters or inputs involved (Ben-Tal et al., 2009; Beyer & Sendhoff, 2007). Recently, it has been used to address a concern in deep neural networks the deep neural networks are vulnerable to adversarial perturbations (Goodfellow et al., 2014; Szegedy et al., 2013). In supervised learning, given input x, output y and a certain loss function l, adversarial training through robust optimiza- 1John A. Paulson School of Engineering and Applied Sciences, Harvard University 2Department of Statistics, Rutgers University. Correspondence to: Zhun Deng . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). tion for a model M is formulated as min θM Θ Ex,y max δ R(x) l(θM, x + δ, y, M), (1) where R(x) is some constrained set, which is usually taken as a small neighborhood of x in robust optimization. For example, in image recognition (He et al., 2016), an adversarial attack should be small so that it is visually imperceptible. Although adversarial training through robust optimization has achieved great success in defending against adversarial attacks (Madry et al., 2017), the influence of such adversarial training on predictions is under-explored, even for a simple model M. In particular, let us define the regular optimizer and the robust optimizer respectively: θM min := arg min θM Θ Ex,yl(θM, x, y, M), θM ε,min := arg min θM Θ Ex,y max δ R(x,ε) l(θM, x + δ, y, M). (2) It is unclear how Ex,yl(θM ε,min, x, y, M) the prediction loss on the original data with robust optimizer performs compared to the optimal prediction loss Ex,yl(θM min, x, y, M). The difficulty for studying this questions is the underlying NP-hardness of solving robust optimization. Even for the simple models, say quadratic models, the robust optimization problem is NP-hard if the constraint set is polyhedral (Minoux, 2010). To address this problem, drawing inspiration from the idea of influence function in robust statistics (Croux & Haesbroeck, 1999; Hampel; 1974; Huber & Ronchetti, 2009), which characterizes how the prediction loss changes when a small fraction of data points being contaminated, we propose the Adversarial Influence Function (AIF) to investigate the influence of robust optimization on prediction loss. Taking advantage of small perturbations, AIF has a closed-form expression and can be calculated efficiently. Moreover, AIF enables us to analyze the prediction error without implementing the robust optimization, which typically takes long time due to the computational burden of searching adversaries. The rest of the paper is organized as follows. Section 2 lays out the setup and notation. Section 3 defines model sensitivity, which is used to understand how robust optimization affects the predictions. To efficiently approximate the Interpreting Robust Optimization via Adversarial Influence Functions model sensitivity, Section 4 introduces the AIF. Further, in Section 5, we show several case studies, by applying the proposed AIF to theoretically analyze the relationship between model sensitivity and model complexity and randomized smoothing. In Section 6, we extend the AIF theory to kernel regressions and distributional robust optimization. 1.1. Related work Adversarial training and robust optimization Since (Goodfellow et al., 2014) proposed adversarial training, many innovative methods have been invented to improve the performance of adversarial training, such as (Agarwal et al., 2018; Liu & Hsieh, 2019; Shafahi et al., 2019; Yin et al., 2018). Earlier work only added adversarial examples in a few rounds during training, and many of them have been evaded by new attacks (Athalye et al., 2018). In (Madry et al., 2017), the authors proposed to use projected gradient ascent and obtain the state-of-art result. They further pointed out that the adversarial training can be formulated through the lens of robust optimization. Nevertheless, robust optimization has a very deep root in engineering (Taguchi & Phadke, 1989) , but many robust optimization problems are NPhard(Minoux, 2010), and solving such problems heavily relies on high-speed computers and their exponentially increasing FLOPS-rates (Park et al., 2006). Our adversarial influence function may bridge the gap between theoretical analysis and engineering implementation of robust optimization to a certain degree, and improve our understanding of robust optimization. Robust Staistics Robust statistics has been recently applied to machine learning and achieves impressive successes. (Koh & Liang, 2017) used the influence function to understand the prediction of a black-box model. (Debruyne et al., 2008; Liu et al., 2014) and (Christmann & Steinwart, 2004) used the influence function for model selections and crossvalidations in kernel methods. Recently, (Bayaktar & Lai, 2018) extended the influence function to the adversarial setting, and investigated the adversarial robustness of multivariate M-Estimators. We remark here that their adversarial influence function is different from ours, where they focused on the influence on parameter inference, while ours focus on the influence of robust optimization on the prediction. 2. Setup and Notation n this paper, we consider the task of mapping mdimensional input x X Rm to a scalar output y Y, with joint distribution (x, y) Px,y and marginal distributions x Px, y Py . We have training dataset (Xt, Y t) = {(xt 1, yt 1), , (xt nt, yt nt)} and evaluation dataset (Xe, Y e) = {(xe 1, ye 1), , (xe ne, ye ne)}. For a given model architecture M, the loss function is denoted as l(θM, x, y, M) with parameter θM Θ Rd (we will omit M in l sometimes if not causing confusions). For robust optimization, we focus on studying the constraint set R(x, ε) = {ω X : ω x p ε Ex Px x p} with small ε, where p is the lp norm. Such type of constraint set is also called lp-attack in adversarial learning, which implies the adversaries are allowed to observe the whole dataset and are able to contaminate each data point xi a little bit. This is commonly used in adversarial training for image classifications in machine learning and the constant factor Ex Px x p is for scale consideration.1 Further, we denote the empirical version of the minimizers for regular optimization and robust optimizers in Eq. (2): ˆθM min := arg min θM Θ i=1 l(θM, xt i, yt i, M), ˆθM ε,min := arg min θM Θ i=1 max δi ˆ R(xt i,ε) l(θM, xt i + δi, yt i, M), where ˆR(xt i, ε) = {u X : u xt i p εˆExt x p}, with ˆExt being the expectation with respect to the empirical probability distribution of xt. We use sgn(x) to denote the sign function: sgn(x) = 1 if x > 0, sgn(x) = 0 if x = 0, and 1 otherwise. We also use [n] to denote the set {1, 2 , n}. Further, we use the notion op and Op, where for a sequence of random variables Xn, Xn = op(an) means Xn/an 0 in probability, and Xn = Op(bn) means that for any ε > 0, there is a constant K, such that P(|Xn| K bn) 1 ε. 3. Model Sensitivity In order to quantify how robust optimization affects predictions, we first define the model sensitivity with respect to the robust optimization. Definition 3.1 (ε-sensitivity/adversarial cost). For a given model M, the ε-sensitivity/adversarial cost is defined as Sε(M) := Ex,yl(θM ε,min, x, y, M) Ex,yl(θM min, x, y, M). The ε-sensitivity/adversarial cost quantifies how robust optimization increases the expected loss, and this loss also indicates the additional cost of being adversarially robust. Besides this straightforward interpretation, one can also interpret Sε(M) as a trade-off between the prediction loss and robustness for model architecture M the optimizer θM ε,min is more adversarially robust but inflates the prediction loss comparing to θM min. For fixed ε, an architecture M with small ε-sensitivity implies that such an architecture 1For standard MNIST, the average l2 norm of x is 9.21 with dimension m = 28 28. The attack size does not have to be small, but ε, as the ratio of the magnitude of adversarial attacks and average magnitude of images, is small. Interpreting Robust Optimization via Adversarial Influence Functions can achieve adversarial robustness by robust optimization without sacrificing the performance on the original data too much. We also say an architecture M with smaller ε-sensitivity is more stable. Since θM min is the minimizer of Ex,yl(θM, x, y, M) over θM, if we further have θM min Θ , where Θ denotes the interior of Θ and l is twice differentiable, by Taylor expansion, we would have 2( θM ε,min)T Ex,y 2l(θM min, x, y, M) θM ε,min + o( θM ε,min 2 2), where θM ε,min = θM ε,min θM min, and the remainder is negligible if ε is small enough. Given the training set (Xt, Y t) and the evaluation set (Xe, Y e), we define the empirical ε-sensitivity: 2( ˆθM ε,min)T EˆPxe,ye 2l(ˆθM min, x, y, M) ˆθM ε,min, (3) by omitting the remainder o( ˆθM ε,min 2 2), where ˆθM ε,min = ˆθM ε,min ˆθM min. Notice that Eq. (3) involves ˆθM ε,min, the solution of robust optimization, which, even for simple models with loss functions (such as linear regression with quadratic loss), does not have a closed-form expression and is computationally heavy to obtain. In the following sections, we will address this problem by introducing AIF, which provides an efficient way to approximate and analyze Sε(M). For simplicity of illustration, we remove the superscripts t, e and use generic notation (X, Y ) = {(x1, y1), , (xn, yn)} for general dataset in the following sections when there is no ambiguity. 4. Adversarial Influence Function Unless explicitly stated, we mainly consider the case where the empirical risk Pn i=1 l(θM, xt i, yt i; M) is twice differentiable and strongly convex in this paper. A relaxation of such conditions will be discussed in Section 4.1. In order to approximate ˆθM ε,min ˆθM min, for small ε, we use ˆθM ε,min ˆθM min εα d(ˆθM ε,min ˆθM min) dεα |ε=0+ = εα dˆθM ε,min dεα ε=0+ for approximation, where α > 0 is the smallest positive real number such that the limit limε 0+ (ˆθM ε,min ˆθM min)/εα is nonzero. Throughout this section, all the cases we consider later have α = 1, while more general cases will be discussed in Section 6.2. Formally, we define the adversarial influence function as follows. Definition 4.1 (Adversarial Influence Function). For a given model M, the adversarial influence function (AIF) is defined as I(M) := dθM ε,min dε The AIF measures the changing trend of the optimizer under robust optimization in the limiting sense. With the help of AIF, we then approximate Sε(M) by 2ε2I(M)T Ex,y 2l(θM min, x, y, M)I(M) ε=0 when ε is small. Next we provide a specific characterization of the empirical adversarial influence functions. We denote ˆI(M) = dˆθM ε,min/dε|ε=0 as the empirical version of AIF. Besides, we denote the perturbation vector as = (δT 1 , , δT n )T . Further, for given (X, Y ) and M, we define g(θM, ) = 1/n Pn i=1 l(θM, xi+δi, yi; M) when we only consider the optimization over (θM, ). Theorem 4.1. Suppose X, Y and Θ are compact spaces, the loss function l(θ, x, y) is three times continuously differentiable on (θ, x) Θ X for any given y Y, and the empirical Hessian matrix ˆHˆθM min = 1/n Pn i=1 2 θl(ˆθM min, xi, yi) is positive definite. Further, we assume the empirical risk Pn i=1 l(θM, xt i, yt i; M) is twice differentiable and strongly convex and g( , ) is differentiable for every , θg(θM, ) is continuous on Θ X, ˆθM min lies in the interior of Θ, and xl(ˆθM min, xi, yi, M) = 0 for all i [n], then we have ˆI(M) = ˆH 1 ˆθM minΦ, (5) where Φ = 1/n Pn i=1 x,θl(ˆθM min, xi, yi)Ex ˆPx x pφi and φi = (ψ1, ψ2, , ψm)T , with ψk = bq 1 k (Pm k=1 bq k) 1 p sgn x ,k l(ˆθM min, xi, yi, M) . Here, we have bk = x ,k l(ˆθM min, xi, yi, M) , x ,k is the k-th coordinate of vector x, for instance, xj = (xj,1, xj,2, , xj,m)T ; p 0 and q 0 are conjugate such that 1/p + 1/q = 1. Remark 1. The compactness condition is easy to satisfy. Since for any distributions D and integer n, we can take a sufficiently large constant R > 0, which is allowed to depend on n, such that all n samples are contained in the ball B(0, R) with high probability. Besides, if the input x is of high dimension, the computational bottleneck is mainly on inverting the empirical Hessian. We can use techniques such as conjugate gradients and stochastic estimation suggested in (Koh & Liang, 2017) to reduce the computational cost. The above theorem provides a closed-form expression for the first order AIF, and therefore a closed-form approximation of the model sensitivity Sε(M). One nice property of such an approximation is that it does not depend on optimization algorithms, but only depends on the model M and the distribution of (x, y). This attribute makes model sensitivity an inherent property of model M and data distribution, Interpreting Robust Optimization via Adversarial Influence Functions making it a potential new rule for model selection. Model sensitivity can help us pick those models whose prediction result will not be greatly affected after robust optimization. We show the effectiveness of approximation by AIF in Figure 1. We plot two error curves for ˆI(n, ε) := (ˆθM ε,min ˆθM min)/ε ˆI(M) 2 and ˆS(n, ε) := ˆSε(M)/ε2 (ˆI(M))T EˆPxe,ye 2l(ˆθM min, x, y, M)ˆI(M) 2, where the sample size is n. Theoretically, we expect ˆI(n, ε) and ˆS(n, ε) go to 0 as ε goes to 0. In all the experiments in the paper, we use projected gradient descent (PGD) for robust optimization to obtain ˆθM ε,min. In Figure 1, we can see that as ε become smaller, ˆI(n, ε) and ˆS(n, ε) gradually go to 0. We remark here that we do not let ε be exactly 0 in our experiments, since PGD cannot obtain the exact optimal solutions for ˆθM min and ˆθM ε,min. The existing system error will become dominating if ε is too small and return abnormally large value after divided by ε. This also motivates us to introduce the AIF to have an accurate approximation. The model we use is a linear regression model with 500 inputs drawn from a two-dimensional standard Gaussian, i.e. x N(0, I). We fit y with y = 2x1 3.4x2 + η and η 0.1 N(0, I). 0.04 0.06 0.08 0.1 0.12 0 Figure 1. Effectiveness of AIF and model sensitivity for linear regression model. From the monotonicity relationship between ε and ˆI(n, ε), ˆS(n, ε), we verify the effectiveness of AIF and model sensitivity. Here, the sample size n = 500. Remark 2. It is straightforward to derive asymptotic normality for AIF by central limit theorem(Durrett, 2019), which can be used to construct confidence intervals for I(M). Specifically, if we denote ζi := H 1 ˆθM min θ,xl(ˆθM min, xi, yi)Ex ˆPX x pφi, ˆµn := 1/n Pn i=1 ζi, and ˆΣn := 1/n Pn i=1(ζi ˆµn)(ζi ˆµn)T , then by classic statistical theory, we obtain nˆΣ 1/2 n (ˆI(M) ˆµn) D N(0, Id), as n goes to infinity, where N(0, I) denotes standard multivariate normal distribution and D denotes convergence in distribution. 4.1. Non-convex, non-convergence cases In the previous discussions, we talked about the case where the empirical loss is strongly convex. Now we briefly discuss about non-convex and non-convergence cases. Well-separated condition. In the proof of Theorem 4.1, actually we only need ˆθM min to be the global minimum and at the point ˆθM min, the empirical Hessian matrix is positive definite and the landscape are allowed to have many local minimums. The uniqueness assumption can also be formulated in a more elementary way: if we assume the smoothness of loss function l over X Θ, compactness of Θ and we only have one global minimum for E(x,y) Px,yl(θM, x, y, M) which lies in the interior of Θ, with positive definite Hessian matrix, and it is well-separated, which means that ω > 0, there exists κ > 0, such that θM , if θM θM min > ω, we have |Ex,yl(θM, x, y, M) Ex,yl(θM min, x, y, M)| > κ. By classic statistical theory, ˆθM min will be a global minimum if sample size is large enough. The well-separated condition relaxes the convexity condition in Theorem 4.1. However, the validity of Theorem 4.1 still requires the condition that ˆθM min is the global minimum of the empirical risk, which in practice, is hard to find. Another alternative relaxation is to use a surrogate loss. Surrogate losses. In practice, we may obtain θM min by running SGD with early stopping or on non-convex objectives, and get a solution ˆθM min which is different from θM min. As in (Koh & Liang, 2017), we can form a convex quadratic approximation of the loss around θM min, i.e., l(θM, x, y) = l( θM min, x, y) + θl( θM min, x, y)(θM θM min) 2(θM θM min)T 2 θl( θM min, x, y) + λI (θM θM min), where λ is a damping term to remove the negative eigenvalues of the Hessian. One can show the results of Theorem 4.1 hold with this surrogate loss. 5. Case studies of Adversarial Influence Functions To illustrate the usage of adversarial influence functions, we use it to explore the relationship between model complexity, randomized smoothing and model sensitivity. 5.1. Model Complexity and Model Sensitivity Throughout this paper, we use the term model complexity as a general term referring to 1) the number of features included in the predictive model, and 2) the model capacity, such as whether the model being linear, non-linear, and so on. Interpreting Robust Optimization via Adversarial Influence Functions As observed in the prior literature (Fawzi et al., 2018; Kurakin et al., 2017; Madry et al., 2017), model complexity is closely related to adversarial robustness, that is, when the model capacity increases, the ε-sensitivity/adversarial cost will increase first and then decrease. However, such a phenomenon is only emporical and lack of theoretical justification. In this subsection, we will theoretically explore how the model complexity model affect the model sensitivity/adversarial cost by studying specific models with different model capacity and different number of features included in the predictive model. 5.1.1. MODEL CAPACITY AND MODEL SENSITIVITY We start with the relationship between model capacity and model sensitivity via two simple and commonly used models, with the dimension of inputs being fixed. Linear regression models (L) and quadratic models (Q) We consider the class of linear models L = {fβ(x) = βT x : x, β Rm} and the class of quadratic models Q = {fβ,A(x) = βT x + x T Ax, x, β Rm, A Rm m}. Apparently, the class of quadratic models has a larger model capacity and is more flexible than that of linear models. In the following theorem, we will show that larger model capacity does not necessarily lead to smaller sensitivity. Theorem 5.1. We fit the data (xi, yi) by L and Q. For the simplicity of presentation, assume the sample sizes of both the training and testing sample are n. Suppose the underlying true generating process is y = x T β 1 + (β T 2 x)2 + ξ, where x N(0, σ2 x Im) Rm, ξ N(0, σ2 ξ) and independent with x. For l2 or l attack, I. when( β 2 2 2σ2 x q 2 πσξ)2 > 1+2mσ2 x max{σ2x,1} 2 πσ2 ξ, we have ˆSε(L) > ˆSε(Q) + Op(ε2 r II. when ( β 2 2 2σ2 x + q 2 πσξ)2 < 1 min{1, 3 4 σ2x}(1+mσ2 x 2σ2 x log m) 3 2πσ2 ε, then ˆSε(L) < ˆSε(Q) + Op(ε2 r From Theorem 5.1, unlike adversarial robustness, we can see that the model sensitivity does not have monotonic relationship with the model capacity. Such a monotonic relationship only holds when the model has high complexity (when β 2 is large). Therefore, when n is sufficiently large, the result implies that a larger model capacity does not necessarily lead to a model with smaller sensitivity. 5.1.2. NUMBER OF FEATURES AND MODEL SENSITIVITY Another important aspect of model complexity is the number of features included in the predictive model. There have been many model selection techniques, such as LASSO, AIC and BIC, developed over years. Given the newly introduced concept of model sensitivity, it is interesting to take model sensitivity into consideration during model selection. For example, if for a specific model, including more features results in a smaller model sensitivity, then for the sake of adversarial robustness, we should include more features even if it leads to feature redundancy. For instance, the following results study when xi follows some structures such as Cov(xi) = σ2 x Im for some constant σx, the relationship between model sensitivity and number of features included in linear models. Theorem 5.2. Suppose that the data (xi, yi) s are i.i.d. samples drawn from a joint distribution Px,y. Denote the sample sizes of the training and testing sample by ntrain and ntest respectively. Let m be the dimension of input x, and βL min = arg min β EPx,y(y βT x)2. Define ηL i = yi βL minxi, and assume E[xi sgn(ηL i )] = 0 and Cov(xi) = σ2 x Im, then for ℓ2 attack ˆSε(L) =ε2(Ex ˆ Px x 2)2 (E|ηL i |)2 σ 2 x 1 ntrain + m ntest ). Given this theorem, we now consider a specific case where we apply this result to random effect model. Corollary 5.1. Consider the random effect model y = β x + ξ, where x RM, β1, ..., βM i.i.d. N(0, 1), and ξ N(0, σ2 ξ). Further, we assume x is a random design with distribution x1, ..., xn i.i.d. N(0, σ2 x IM). Then when we only include m features in the linear predictive model, the resulting model sensitivity is ˆSε(L) = 4ε2 2 ) ((M m)σ2 x + σ2 ξ) 1 ntrain + m ntest ), where Γ( ) is the Gamma function such that Γ(x) = R 0 tx 1e t dt. Since Γ2( m+1 2m, there is a universal constant C, such that ˆSε(L) Cm((M m)σ2 x+σ2 ξ) = Cσ2 xm2+C(M+ σ2 ξ)m. This also implies that a larger model capacity does not necessarily lead to a model with smaller sensitivity. Specifically, when m is small, including more features in Interpreting Robust Optimization via Adversarial Influence Functions the linear model results in larger model sensitivity. In contrast, when m is large, i.e. in the high-complexity regime, including more features leads to smaller model sensitivity. Next, we consider a broader class of functions general regression models. General regression models (GL) In general regression models, suppose we use a d-dimensional basis v GL(x) = (v GL 1 (x), ..., v GL d (x))T Rd to approximate y (d can be a function of m), and get the coefficients by solving ˆθGL min = arg min θ i=1 (yi θT v GL(xi))2, where the loss function is l(θ, xi, yi, GL) = 1 2(yi θT v GL(xi))2. By Theorem 4.1, it is straightforward to obtain ˆI(GL) = ˆH 1 ˆθGL minΦ = Cov(v GL(x)) 1Φ + OP ( where Cov(v GL(x)) is the covariance matrix of v GL(x) and h|(ˆθGL min)T v GL(xi) yi| n (ˆθGL min)T v GL(xi) x ( v GL(xi) ˆθGL min + v GL(xi) n (ˆθGL min)T v GL(xi) sgn((ˆθGL min)T v GL(xi) yi) i . ˆSε(GL) = ε2 Φ Cov(v(x)) 1Φ + OP (ε2 r Notice that the linear regression model is a special case if we take v(x) = x. However, the expression of model sensitivity for the general regression models is very complex and hard to analyze directly most of the time. Instead of directly studying Eq. (6), we further simplify the expression by providing an upper bound to shed some light. Theorem 5.3. Suppose that the data (xi, yi) s are i.i.d. samples drawn from a joint distribution Px,y. Let m be the dimension of input x, and θGL min = arg min θ EPx,y(y θT v GL(xi))2. Let ηGL i = yi (θGL min)T v GL(xi) and assume E[xi sgn(ηGL i )] = 0, then ˆSε(GL) ε2(Ex ˆ Px x 2)2 1 λmin(E[v(xi)v(xi) ]) xv GL(xi))T xv GL(xi) 2] E[|ηGL i |]2 The following example illustrates how our upper bound is used to demonstrate the trend of change between model sensitivity and number of features included. Example 5.1. Suppose v(x) = (x T , ( x 2)T )T . If x consists of random features, such that each coordinate of x is i.i.d drawn from uniform distribution on ( 1, 1). y = x T β 1+β T 2 x 2 x 2 +ξ, where ξ N(0, σ2 ξ) and independent with x. As a result, the eigenvalue satisfies λmin E[v GL(xi)v GL(xi) ] 1 xv GL(xi))T xv GL(xi) 2] = 1, regardless of the number of features m. Besides, E|ηGL i | decreases as m increases, and thus the upper bound will decrease as m increases. In the experiments in Figure 2(a), we show the trend for ˆSε(GL) by taking sample size n = 5000, σξ = 0.1. We take the average result for 1000 repetitions. 5.2. Randomized Smoothing and Model Sensitivity As the last case study of AIF, we investigate the effect of randomized smoothing (Cohen et al., 2019), a technique inspired by differential privacy, in adversarial robustness. Randomized smoothing has achieved impressive empirical success as a defense mechanism of adversarial attacks for l2 attack. The core techniques is adding isotropic noise ϑ N(0, σ2 r I) to the inputs so that for any output range O, i=1 l(θM, xi + ϑi, yi, M) O is close to i=1 l(θM, xi + δi + ϑi, yi, M) O for constrained δi 2. The following theorem provides an insight into how randomized smoothing affects model sensitivity regarding linear regression models. Theorem 5.4. Use the same notation as that in Theorem 5.2. Suppose that the data (xi, yi) s are i.i.d. samples drawn from a joint distribution Px,y, and E[xi sgn(ηL i )] = 0, Cov(xi) = σ2 x Im, and V ar(ηL i ) = ση2. When we fit y with x = x + ϑ, where ϑ is distributed as N(0, σ2 r Im), then ˆSε(Lnoise) ˆSε(L) = σ2 x/σ2 ξ σ2x + σ2r 2σ2 rσ2 x σ2x + σ2r βL min 2 2 + σ2 ξ Interpreting Robust Optimization via Adversarial Influence Functions 0 2 4 6 8 10 0 (a) Illustration of the relationship between the feature number and model sensitivity for the model in Example 5.1. 0.04 0.06 0.08 0.1 0.12 1.35 (b) Effectiveness of AIF for kernel regression with NTK on MNIST. Figure 2. a) Experimentally, the general trend for ˆSε(GL) with respect to m is decreasing (though not strict for every m) as the upper bound suggests. b) The monotonic trend of ε is still clearly observed, though thevalues are larger than the previous example in Figure 1 due to the high dimensionality of MNIST. Here, Lnoise denotes the linear model with randomized smoothing by adding input noise. This theorem informs us that when σr is large, we have ˆSε(Lnoise) ˆSε(L) asymptotically, and ˆSε(Lnoise) becomes smaller with larger σr. In other words, the randomized smoothing helps reduce the sensitivity in this case. 6. Further Extensions In this section, we extend the theories of IFA to kernel regressions and distributional robust optimization. First, we derive the AIF for kernel regressions in Section 6.1. In particular, we are interested in how well AIF characterizes the change of optimizers with neural tangent kernels (NTK), whose equivalence to infinitely wide neural networks has been well-established in recent literatures (Du et al., 2018; Jacot et al., 2018). In Section 6.2, we further extend our theory to compute the AIF for distributional robust optimization. 6.1. AIF of the kernel regressions We consider the kernel regression model in the following form ˆLn(θ, X, Y ) := 1 j=1 K(xi, xj)θj 2 + λ θ 2 2. (7) where θ = (θ1, , θn)T , and λ > 0. Now let us denote g(θ, ) = ˆLn(θ, X + , Y ), and we will calculate the empirical adversarial influence function ˆI(K) for kernel K. Notice that in kernel regression, the loss function yi Pn j=1 K(xi, xj)θj 2 includes all the data points in one sin- gle term, which is different from the summation-form of loss function in Theorem 4.1. Fortunately, the technique of proving Theorem 4.1 can still be used here with slight modification. We obtain the following corollary for the adversarial influence function ˆI(K) in kernel regression. Corollary 6.1. Suppose X, Y and Θ are compact spaces, the kernel ˆLn is three times continuously differentiable on Θ X. g( , ) is differentiable for every and θg(θ, ) s continuous on Θ X, the minimizer ˆθmin lies in the interior of Θ, with non-zero xi ˆLn(ˆθmin, X, Y ) for all i [n], then we have ˆI(K) = n X i=1 K(xi)K(xi)T + nλI 1 K(xi)T ˆθmin + K(xi)ˆθT min yi Kxi,xkβk . In the above formula, K(xi) = K(xi, x1), K(xi, x2), , K(xi, xn) T , Kxi,xk = K(xi, x1) xk , , K(xi, xn) And the z-th coordinate of βk is βk,z = cq 1 z (Pm k=1 cq z) 1 p sgn xk ˆLn(ˆθmin, z) Ex ˆPx x p, with cz = | xk ˆLn(ˆθmin, z)|, where xk ˆLn(ˆθmin, z) is short for the z-th coordinate of xk ˆLn(ˆθmin, X, Y ): xk ˆLn(θ, X, Y ) = 2 K(xi)T ˆθmin yi KT xi,xk ˆθmin. Interpreting Robust Optimization via Adversarial Influence Functions Neural tangent kernels The intimate connection between kernel regression and overparametrized two-layer neural networks has been studied in the literature, see (Du et al., 2018; Jacot et al., 2018). In this section, we are going to apply Corollary 6.1 to the two-layer neural networks in the over-parametrized setting. Specifically, we consider a two-layer Re LU activated neural network with b neurons in the hidden layer: f W,a(x) = 1 r=1 arσ(w T r x), where x Rm denotes the input, w1, ..., wb Rm are weight vectors in the first layer, a1, ..., ab R are weights in the second layer. Further we denote W = (w1, ..., wb) Rm b and a = (a1, ..., am)T Rm. Suppose we have n samples S = {(xi, yi)}n i=1 and assume xi 2 = 1 for simplicity. We train the neural network by randomly initialized gradient descent on the quadratic loss over data S. In particular, we initialize the parameters randomly: wr N(0, κ2I), ar U( 1, 1), for all r [m], then Jacot et al. [2018] showed that, such a resulting network converges to the solution produced by the kernel regression with the so called Neural Tangent Kernel (NTK) matrix: NTK = x i xj(π arccos(x i xj)) 2π In Figure 2(b), we experimentally demonstrate the effectiveness of the approximation of AIF in kernel regressions with neural tangent kernel on MNIST. The estimation is based on the average of randomly drawn 300 examples from MNIST for 10 times. 6.2. Distributional adversarial influence function Another popular way to formulate adversarial attack is through distributional robust optimization (DRO), where instead of perturbing x with certain distance, one perturbs (x, y) in a distributional sense. For a model M, the corresponding distributional robust optimization with respect to u-Wasserstein distance Wu for u [1, ) regarding lp-norm is formulated as: min θM OPT(ε; θM), where OPT(ε; θM) is defined as OPT(ε; θM) := max Px,y:Wu( Px,y,Px,y) ε E Px,yl(θM, x, y; M). Here, Wu(D, D) = inf{ R x y u pdγ(x, y) : γ Π(D, D)}1/u for two distributions D, D, and Π(D, D) are couplings of D, D. However, it is not clear whether θM,DRO ε,min := arg min θM OPT(εEˆPx x p; θM), is well-defined since the optimizer may not be unique. Moreover, the corresponding sample version of the optimizer θM,DRO ε,min is not easy to obtain via regular optimization methods if we just replace the distribution Px,y by its empirical distribution since it is hard to get the corresponding worst form of Px,y. As a result, we focus on defining empirical distributional adversarial influence function for a special approximation algorithm and state its limit. Interested readers are refered to the following result in (Staib & Jegelka, 2017) and (Gao & Kleywegt, 2016) to properly find an approximation for Px,y. Lemma 6.1 (A variation of Corollary 2(iv) in (Gao & Kleywegt, 2016)). Suppose for all y, l(θM, x, y; M) is L-Lipschitz as a function of x. Define EMP(ε) := max δ1, ,δn 1 n i=1 l(θM, xi + δi, yi, M), such that (Pn i=1 δi u p/n)1/u ε. Then, we have EMP(ε) OPT(ε; θM) LD/n where D bounds the maximum deviation of a single point. Lemma 6.1 provides a direction to define an algorithm dependent empirical DAIF ˆIDRO(M). For a given model M, the corresponding empirical distributional adversarial influence function is defined as ˆIDRO(M) := dˆθM,DRO ε,min such that ˆθM,DRO ε,min arg minθM Θ EMP εEˆPx x p . We use arg min here since there may not be a unique minimizer, but the limit ˆIDRO(M) is still unique and welldefined. Similarly, we can provide a closed form of distributional adversarial influence function. Theorem 6.1. Under the settings of Theorem 4.1, ˆIDRO(M) = ˆH 1 ˆθM minϱn 1 u where ϱ = x,θl(ˆθM min, x J, y J)EˆPx x pφJ and φi is defined as in Theorem 4.1, J is the index: J = arg maxi x L(ˆθM min, xi, yi) q. We remark here from Eq. 8, we can see that if u > 1, more training data will result in a smaller norm of ˆIDRO(M) since there is a factor n(1 u)/u. 7. Conclusions and Future Work To achieve adversarial robustness, robust optimization has been widely used in the training of deep neural networks, Interpreting Robust Optimization via Adversarial Influence Functions while their theoretical aspects are under-explored. In this work we first propose the AIF to quantify the influence of robust optimization theoretically. The proposed AIF is then used to efficiently approximate the model sensitivity, which is usually NP-hard to compute in practice. We then apply the AIF to study the relationship between model sensitivity and model complexity. Moreover, the AIF is applied to randomized smoothing and found that adding noise to the input during training would help reduce the model sensitivity. Further, the theories are extended to the kernel regression models and distributional robust optimization. Based on the newly introduced tool AIF, we suggest two main directions for future research. First, we can study how to use AIF to select model with the greatest adversarial robustness. Due to the computational effectiveness of AIF, it is a natural idea to use AIF for model selection. Such an idea can be used for tuning parameter selection in statistical models such as high-dimensional regression and factor analysis, and further extended to the neural network depth and width selection. Second, AIF can be extended to study more phenomena in adversarial training. For instance, the relationship between low-dimensional representations and adversarial robustness. Recently, Lezama et al. (2018); Sanyal et al. (2018) empirically observed that using learned low-dimensional representations as the input in neural networks is substantially more adversarially robust, but a theoretical exploration of this phenomenon is still lacking. Acknowledgements This work is in part supported by NSF award 1763665 and NSF DMS-2015378. Agarwal, N., Gonen, A., and Hazan, E. Learning in nonconvex games with an optimization oracle. ar Xiv preprint ar Xiv:1810.07362, 2018. Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018. Bayaktar, E. and Lai, L. On the adversarial robustness of robust estimators. ar Xiv preprint ar Xiv:1806.03801, 2018. Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton University Press, 2009. Beyer, H.-G. and Sendhoff, B. Robust optimization a com- prehensive survey. Computer methods in applied mechanics and engineering, 196(33-34):3190 3218, 2007. Christmann, A. and Steinwart, I. On robustness properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5(Aug):1007 1034, 2004. Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certified adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918, 2019. Croux, C. and Haesbroeck, G. Influence function and efficiency of the minimum covariance determinant scatter matrix estimator. Journal of Multivariate Analysis, 71(2): 161 190, 1999. Debruyne, M., Hubert, M., and Suykens, J. A. Model selection in kernel based regression using the influence function. Journal of Machine Learning Research, 9(Oct): 2377 2400, 2008. Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018. Durrett, R. Probability: theory and examples, volume 49. Cambridge university press, 2019. Fawzi, A., Fawzi, O., and Frossard, P. Analysis of classifiers robustness to adversarial perturbations. Machine Learning, 107(3):481 508, 2018. Gao, R. and Kleywegt, A. J. Distributionally robust stochastic optimization with wasserstein distance. ar Xiv preprint ar Xiv:1604.02199, 2016. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Hampel, F. R. Contributions to the theory of robust estimation. Ph.D. Thesis. Hampel, F. R. The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383 393, 1974. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Huber, P. and Ronchetti, E. Robust statistics, john wiley & sons, inc, 2009. Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018. Interpreting Robust Optimization via Adversarial Influence Functions Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1885 1894. JMLR. org, 2017. Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial machine learning at scale. 2017. Lezama, J., Qiu, Q., Mus e, P., and Sapiro, G. Ole: Orthogonal low-rank embedding-a plug and play geometric loss for deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8109 8118, 2018. Liu, X. and Hsieh, C.-J. Rob-gan: Generator, discriminator, and adversarial attacker. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11234 11243, 2019. Liu, Y., Jiang, S., and Liao, S. Efficient approximation of cross-validation for kernel methods using bouligand influence function. In International Conference on Machine Learning, pp. 324 332, 2014. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Minoux, M. Robust network optimization under polyhedral demand uncertainty is np-hard. Discrete Applied Mathematics, 158(5):597 603, 2010. Park, G.-J., Lee, T.-H., Lee, K. H., and Hwang, K.-H. Robust design: an overview. AIAA journal, 44(1):181 191, 2006. Sanyal, A., Kanade, V., and Torr, P. H. Learning low-rank representations. ar Xiv preprint ar Xiv:1804.07090, 2018. Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! In Advances in Neural Information Processing Systems, pp. 3353 3364, 2019. Staib, M. and Jegelka, S. Distributionally robust deep learning as a generalization of adversarial training. In NIPS workshop on Machine Learning and Computer Security, 2017. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Taguchi, G. and Phadke, M. S. Quality engineering through design optimization. In Quality Control, Robust Design, and the Taguchi Method, pp. 77 96. Springer, 1989. Yin, D., Ramchandran, K., and Bartlett, P. Rademacher complexity for adversarially robust generalization. ar Xiv preprint ar Xiv:1810.11914, 2018.