# signsgd_via_zerothorder_oracle__aa7a408f.pdf Published as a conference paper at ICLR 2019 SIGNSGD VIA ZEROTH-ORDER ORACLE Sijia Liu Pin-Yu Chen Xiangyi Chen Mingyi Hong MIT-IBM Watson AI Lab, IBM Research University of Minnesota, Twin Cities In this paper, we design and analyze a new zeroth-order (ZO) stochastic optimization algorithm, ZO-sign SGD, which enjoys dual advantages of gradient-free operations and sign SGD. The latter requires only the sign information of gradient estimates but is able to achieve a comparable or even better convergence speed than SGD-type algorithms. Our study shows that ZO-sign SGD requires d times more iterations than sign SGD, leading to a convergence rate of O( T) under some mild conditions, where d is the number of optimization variables, and T is the number of iterations. In addition, we analyze the effects of different types of gradient estimators on the convergence of ZO-sign SGD, and propose several variants of ZO-sign SGD with O( T) convergence rate. On the application side we explore the connection between ZO-sign SGD and black-box adversarial attacks in robust deep learning. Our empirical evaluations on image classification datasets MNIST and CIFAR-10 demonstrate the superior performance of ZO-sign SGD on the generation of adversarial examples from black-box neural networks. 1 INTRODUCTION Zeroth-order (gradient-free) optimization has attracted an increasing amount of attention for solving machine learning (ML) problems in scenarios where explicit expressions for the gradients are difficult or infeasible to obtain. One recent application of great interest is to generate prediction-evasive adversarial examples, e.g., crafted images with imperceptible perturbations to deceive a well-trained image classifier into misclassification. However, the black-box optimization nature limits the practical design of adversarial examples, where internal configurations and operating mechanism of public ML systems (e.g., Google Cloud Vision API) are not revealed to practitioners and the only mode of interaction with the system is via submitting inputs and receiving the corresponding predicted outputs (Papernot et al., 2017; Liu et al., 2017; Chen et al., 2017; Tu et al., 2018; Ilyas et al., 2018b; Cheng et al., 2018; Bhagoji et al., 2018). It was observed in both white-box and black-box settings1 that simply leveraging the sign information of gradient estimates of an attacking loss can achieve superior empirical performance in generating adversarial examples (Goodfellow et al., 2015; Madry et al., 2018; Ilyas et al., 2018a). Spurred by that, this paper proposes a zeroth-order (ZO) sign-based descent algorithm (we call it ZO-sign SGD ) for solving black-box optimization problems, e.g. design of black-box adversarial examples. The convergence behavior and algorithmic stability of the proposed ZO-sign SGD algorithm are carefully studied in both theory and practice. In the first-order setting, a sign-based stochastic gradient descent method, known as sign SGD, was analyzed by (Bernstein et al., 2018; Balles & Hennig, 2017). It was shown in (Bernstein et al., 2018) that sign SGD not only reduces the per iteration cost of communicating gradients, but also could yield a faster empirical convergence speed than SGD (Kinga & Adam, 2015). That is because although the sign operation compresses the gradient using a single bit, it could mitigate the negative effect of extremely components of gradient noise. Theoretically, sign SGD achieves O(1/ T) convergence rate under the condition of a sufficiently large mini-batch size, where T denotes the total number of iterations. The work in (Balles & Hennig, 2017) established a connection between sign SGD and Adam with restrictive convex analysis. Prior to (Bernstein et al., 2018; Balles & Hennig, 2017), although sign SGD was not formally defined, the fast gradient sign method (Goodfellow et al., 2015) to generate white-box adversarial examples actually obeys the algorithmic protocol of sign SGD. The effectiveness of sign SGD has been witnessed by robust adversarial training of deep neural networks (DNNs) (Madry et al., 2018). Given the advantages of sign SGD, one may wonder if it can 1 white-box (vs black-box ) implies whether the knowledge on the target model is known a priori. Published as a conference paper at ICLR 2019 be generalized for ZO optimization and what the corresponding convergence rate is. In this paper, we answer these questions affirmatively. Contributions We summarize our key contributions as follows. We propose a new ZO algorithm, ZO-sign SGD , and rigorously prove its convergence rate of O( T) under mild conditions. Our established convergence analysis applies to both mini-batch sampling schemes with and without replacement. In particular, the ZO sign-based gradient descent algorithm can be treated as a special case in our proposed ZO-sign SGD algorithm. We carefully study the effects of different types of gradient estimators on the convergence of ZO-sign SGD, and propose three variants of ZO-sign SGD for both centralized and distributed ZO optimization. We conduct extensive synthetic experiments to thoroughly benchmark the performance of ZO-sign SGD and to investigate its parameter sensitivity. We also demonstrate the superior performance of ZO-sign SGD for generating adversarial examples from black-box DNNs. Related work Other types of ZO algorithms have been developed for convex and nonconvex optimization, where the full gradient is approximated via a random or deterministic gradient estimate (Jamieson et al., 2012; Nesterov & Spokoiny, 2015; Ghadimi & Lan, 2013; Duchi et al., 2015; Gao et al., 2014; Shamir, 2017; Hajinezhad et al., 2017; Ghadimi et al., 2016; Lian et al., 2016; Liu et al., 2018b;c; Wang et al., 2018). Examples include ZO-SGD (Ghadimi & Lan, 2013), ZO stochastic coordinate descent (ZO-SCD) (Lian et al., 2016), and ZO stochastic variance reduced gradient descent (ZO-SVRG) (Liu et al., 2018c;a; Gu et al., 2016). Both ZO-SGD and ZO-SCD can achieve O( T) convergence rate. And ZO-SVRG can further improve the iteration complexity to O(d/T) but suffers from an increase of function query complexity due to the additional variance reduced step, known as gradient blending (Liu et al., 2018c), compared to ZO-SGD. The existing work showed that ZO algorithms align with the iteration complexity of their first-order counterparts up to a slowdown effect in terms of a small-degree polynomial of the problem size d. 2 SIGNSGD & ITS CONNECTION TO ADVERSARIAL MACHINE LEARNING In this section, we provide a background on sign SGD, together with the problem setup of our interest. In particular, we show that the commonly-used methods for generating adversarial attacks fall into the framework of sign SGD. Preliminaries on sign SGD Consider a nonconvex finite-sum problem of the form minimize x f(x) := (1/n) Pn i=1 fi(x), (1) where x Rd are optimization variables, and {fi} are n individual nonconvex cost functions. The finite-sum form (1) encompasses many ML problems, ranging from generalized linear models to neural networks. If the gradients of {fi} are available, then problem (1) can be solved by many first-order methods such as SGD, SCD, and sign SGD. The method of our interest is sign SGD, which differs from SGD and SCD, takes the sign of gradient (or its estimate) as the descent direction. It was recently shown in (Bernstein et al., 2018) that sign SGD is quite robust to gradient noise and yields fast empirical convergence. Algorithm 1 provides a generic sign-based gradient descent framework that encapsulates different variants of sign SGD. In Algorithm 1, Grad Estimate( ) signifies a general gradient estimation procedure, which adopts either a stochastic gradient estimate in the first-order setting (Bernstein et al., 2018) or a function difference based random gradient estimate in the ZO setting (Nesterov & Spokoiny, 2015; Duchi et al., 2015). We call the ZO variant of sign SGD ZO-sign SGD , which will be elaborated on in Sec. 3. Adversarial attacks meet sign SGD It is now widely known that ML models (e.g., deep neural networks) are vulnerable to adversarial attacks, which craft inputs (e.g., images) with imperceptible perturbations to cause incorrect classification (Szegedy et al., 2013; Goodfellow et al., 2015; Kurakin et al., 2017; Lin et al., 2019). The resulting inputs crafted by adversaries are known as adversarial examples. Investigating adversarial examples not only helps to understand the limitation of learning models, but also provides opportunities to improve the models robustness (Papernot et al., 2016; Published as a conference paper at ICLR 2019 Algorithm 1 Generic sign-based gradient descent 1: Input: learning rate {δk}, initial value x0, and number of iterations T 2: for k = 0, 1, . . . , T 1 do 3: ˆgk Grad Estimate(xk) # applies to both first and zeroth order gradient estimates 4: sign-gradient update xk+1 = xk δksign(ˆgk), where sign(x) takes element-wise signs of x (2) Athalye et al., 2018; Madry et al., 2018). In what follows, we show that the generation of adversarial examples in (Goodfellow et al., 2015; Kurakin et al., 2017) can be interpreted through sign SGD. Let x0 denote the natural (legitimate) input of an ML model associated with the true label t0, and x = x0 + δ be the adversarial example to be designed, where δ are adversarial perturbations. If f(x, t0) is the training loss of a learning model, then the goal of (white-box) adversarial attack is to find minimal perturbation δ that is sufficient to mislead the learning model, namely, to maximize the loss f(x0 + δ, t0). Taking the first-order approximation of f(x , t0) around x0, we obtain f(x , t0) f(x0, t0) + xf(x0, t0), δ . By constraining the strength of perturbation in the ℓ ball of small radius ϵ (i.e., δ ϵ), the linear approximation of f(x , t0) is then maximized at δ = ϵ sign( xf(x0, t0)) (Shaham et al., 2018). Therefore, generation of adversarial examples proposed in (Goodfellow et al., 2015) obeys the sign-gradient update rule in (2), x = x0 ϵ sign( xf(x0, t0)). Such a connection between adversarial example generation and sign SGD also holds in other attacks, e.g., the iterative target attack method (Kurakin et al., 2017). Similarly, a so-called black-box attack (Ilyas et al., 2018a; Bhagoji et al., 2018) is associated with our proposed ZO-sign SGD algorithm. 3 ZO-SIGNSGD FOR BLACK-BOX OPTIMIZATION One limitation of sign SGD (Bernstein et al., 2018) is the need of first-order information, i.e., stochastic gradients. However, there exists a large practical demand for solving ML problems where explicit expressions of the gradients are difficult or infeasible to obtain, e.g., the generation of adversarial examples from black-box neural networks as discussed in Sec. 1 and 2. Gradient estimation via ZO oracle In the ZO setting where the first-order information is unavailable, the gradient estimator at Step 3 of Algorithm 1 has only access to function values of {fi(x)} given a query point x. Based on that, we construct a ZO gradient estimate through a forward difference of two function values (Nesterov & Spokoiny, 2015; Gao et al., 2014; Duchi et al., 2015). In Algorithm 1, Grad Estimate(x) is then specified as Grad Estimate(x) = 1 j=1 ˆ fi(x; ui,j), ˆ fi(x; ui,j) := d[fi(x + µui,j) fi(x)] µ ui,j, (3) where x = xk in Algorithm 1, Ik is a mini-batch of size |Ik| = b, {ui,j}q j=1 are i.i.d. random directions drawn from a uniform distribution over a unit sphere, and ˆ fi(x; ui,j) gives a two-point based random gradient estimate with direction ui,j and smoothing parameter µ > 0. We remark that the random direction vectors in (3) can also be drawn from the standard Gaussian distribution (Nesterov & Spokoiny, 2015). However, the uniform distribution could be more useful in practice since it is defined in a bounded space rather than the whole real space required for Gaussian. We highlight that unlike the first-order stochastic gradient estimate, the ZO gradient estimate (3) is a biased approximation to the true gradient of f. Instead, it becomes unbiased to the gradient of the randomized smoothing function fµ (Duchi et al., 2012; Gao et al., 2014), fµ(x) = Ev[f(x + µv)] = 1 i=1 Ev[fi(x + µv)] = 1 i=1 fi,µ(x), (4) where fi,µ gives the randomized smoothing version of fi, and the random variable v follows a uniform distribution over the unit Euclidean ball. Clearly, there exists a gap between a ZO gradient estimate and the true gradient of f, but as will be evident later, such a gap can be measured through the smoothing function fµ. Published as a conference paper at ICLR 2019 Motivations of ZO-sign SGD. Compared to SGD-type methods, the fast empirical convergence of sign SGD and ZO-sign SGD has been shown in the application of generating white-box and black-box adversarial examples (Goodfellow et al., 2015; Madry et al., 2018; Ilyas et al., 2018a). As mentioned in (Bernstein et al., 2018), the sign operation could mitigate the negative effect of (coordinate-wise) gradient noise of large variance. Recall that the ZO gradient estimate is a biased approximation to the true gradient, and thus, could suffer from having larger noise variance than (first-order) stochastic gradients. In this context, one could benefit from ZO-sign SGD due to its robustness to gradient noise. In Appendix 1, we provide two concrete examples (Fig. A1 and Fig. A2) to confirm the aforementioned analysis. In Fig. A1, we show the robustness of ZO-sign SGD against sparse noise perturbation through a toy quadratic optimization problem, originally introduced in (Bernstein et al., 2018) to motivate the fast convergence of sign SGD against SGD. In Fig. A2, we show that gradient estimation via ZO oracle indeed encounters gradient noise of large variance. Thus, taking the sign of a gradient estimate might scale down the extremely noisy components. ZO-sign SGD & technical challenges beyond sign SGD Algorithm 1 becomes ZO-sign SGD as the ZO gradient estimate (3) is applied. We note that the extension from first order to ZO is nontrivial, as the proposed ZO-sign SGD algorithm yields three key differences to sign SGD. First, ZO-sign SGD has milder assumption on the choice of mini-batch sampling. Recall that sign SGD in (Bernstein et al., 2018) achieves O(1/ T) convergence rate given the condition that the mini-batch size is sufficiently large, b = O(T). However, this condition only becomes true when the mini-batch sample is randomly selected from [n] with replacement, which is unusual when n T. Here [n] represents the integer set {1, 2, . . . , n}. And sign SGD fails to cover sign GD when b = n, since sampling with replacement leads to Ik = [n] even if b = n. In the proposed ZO-sign SGD algorithm, we will relax the assumption on mini-batch sampling. Second, in ZO-sign SGD both the ZO gradient estimator and the sign operator give rise to approximation errors to the true gradient. Although the statistical properties of ZO gradient estimates can be acquired with the aid of the randomized smoothing function (4), the use of mini-batch sampling without replacement introduces extra difficulty to bound the variance of ZO gradient estimates since mini-batch samples are no longer independent. Moreover, the sign-based descent algorithm evaluates the convergence error in the ℓ1-norm geometry, leading to a mismatch with the ℓ2-norm based gradient variance. Besides translating the the gradient norm from ℓ1 to ℓ2, the probabilistic convergence method (Ghadimi & Lan, 2013) is used to bound the eventual convergence error of ZO-sign SGD. Finally, beyond the standard ZO gradient estimator (3), we will cover multiple variants of ZOsign SGD for centralized or distributed optimization. 4 CONVERGENCE ANALYSIS OF ZO-SIGNSGD In this section, we begin by stating assumptions used in our analysis. We then derive the convergence rate of ZO-sign SGD for nonconvex optimization. Assumptions of problem (1) are listed as follows. A1: Functions {fi} have L-Lipschitz continuous gradients, where L (0, ). A2: At time k, the gradient of fi is upper bounded by fi(xk) 2 σ for i [n]. Both A1 and A2 are the standard assumptions used in nonconvex optimization literature (Bernstein et al., 2018; Reddi et al., 2018; Chen et al., 2018). A1 implies the L-smoothness of fi, namely, for any x and y we obtain fi(x) fi(y) fi(y), x y + (L/2) x y 2 i . A2 implies the bounded variance of fi in (Bernstein et al., 2018, Assumption 3), namely, 1 n Pn i=1 fi(x) f(x) 2 2 4σ2, where we have used the fact that f(x) 2 σ under A2. Throughout the paper, we assume that problem (1) is solvable, namely, f(x ) > where x is an optimal solution. We recall that Algorithm 1 becomes ZO-sign SGD when the gradient estimation step (3) is applied. For nonconvex problems, the convergence of an algorithm is typically measured by stationarity, e.g., using f(x) 2 2 in SGD (Ghadimi & Lan, 2013) and f(x) 1 in sign SGD (Bernstein et al., 2018). For the latter, the ℓ1 geometry is met when quantifying the stochasticity through the (non-linear) sign operation. Different from sign SGD, ZO-sign SGD only obtains a biased estimate to the true gradient. In Proposition 1, we bypass such a bias by leveraging the randomized smoothing technique used for ZO optimization (Gao et al., 2014; Nesterov & Spokoiny, 2015; Duchi et al., 2015). Published as a conference paper at ICLR 2019 Proposition 1 Under A1, the outputs {xk}T 1 k=0 of ZO-sign SGD, i.e., Algorithm 1 with (3), satisfies k=0 (δk E[ fµ(xk) 1]) E[fµ(x0) fµ(x T )] + E[ ˆgk fµ(xk) 2 2] + d L (5) where the expectation is taken with respect to all the randomness of ZO-sign SGD, fµ is the randomized smoothing function of f in (4), and ˆgk = Grad Estimate(xk) in (3). Proof: See Appendix 2. In Proposition 1, the rationale behind introducing the smoothing function fµ is that fµ(xk) is the mean of ZO gradient estimate ˆgk. And thus, the convergence of ZO-sign SGD is now linked with the variance of ˆgk, i.e., E[ ˆgk fµ(xk) 2 2]. This crucial relationship presented in Proposition 1 holds for a general class of sign SGD-type algorithms that use different ZO gradient estimators. Spurred by (5), we next investigate the second-order moment of ˆgk in Proposition 2. Proposition 2 Under A1 and A2, the variance of ZO gradient estimate ˆgk is upper bounded by E ˆgk fµ(xk) 2 2 4αb(q + 1) bq σ2 + (2αb + βb) bq C(d, µ), (6) where C(d, µ) := 2dσ2 + µ2L2d2/2. In (6), αb and βb are Boolean variables depending on the choice of mini-batch sampling, αb = 1, βb = 0 for mini-batch with replacement αb = I(b < n), βb = I(b > 1) for mini-batch without replacement, (7) where I(x > a) is the indicator function of x with respect to the constraint x > a, and I(x > a) = 1 if x > a and 0 otherwise. Proof: See Appendix 3. Compared to the variance bound (σ2/b) of the stochastic gradient estimate of f in sign SGD (Bernstein et al., 2018), Proposition 2 provides a general result for the ZO gradient estimate ˆgk. It is clear that the bound in (6) contains two parts: h1 := 4αb(q+1) bq σ2 and h2 := (2αb+βb) bq C(d, µ), where the former h1 = O(σ2/b) characterizes the reduced variance (using b mini-batch samples) for the stochastic gradient estimate of the smoothing function fµ, and the latter h2 = O(C(d, µ)/(bq)) reveals the dimension-dependent variance induced by ZO gradient estimate using b mini-batch samples and q random directions. If a stochastic gradient estimate of f is used in sign SGD, then h2 is eliminated and the variance bound in (6) is reduced to (σ2/b). Furthermore, Proposition 2 covers mini-batch sampling with and without replacement, while sign SGD only considers the former case. For the latter case, Proposition 2 implies that if b = n (i.e., Ik = [n] for ZO-sign GD), then the variance E ˆgk fµ(xk) 2 2 is reduced to O(C(d, µ)/(nq)), corresponding to αb = 0 and βb = 1 in (7). In the other extreme case of b = 1, both the studied mini-batch schemes become identical, corresponding to αb = 1 and βb = 0. Proposition 2 also implies that the use of large b and q reduces the variance of the gradient estimate, and will further improve the convergence rate. With the aid of Proposition 1 and 2, we can then show the convergence rate of ZO-sign SGD in terms of stationarity of the original function f. The remaining difficulty is how to bound the gap between f and its smoothed version fµ. It has been shown in (Gao et al., 2014; Nesterov & Spokoiny, 2015) that there exists a tight relationship between fµ and f given the fact that the former is a convolution of the latter and the density function of a random perturbation v in (4). We demonstrate the convergence rate of ZO-sign SGD in Theorem 1. Theorem 1 Under A1 and A2, if we randomly pick x R from {xk}T 1 k=0 with probability P(R = k) = δk PT 1 k=0 δk , then the convergence rate of ZO-sign SGD is given by E [ f(x R) 2] 2(f(x0) f + µ2L) PT 1 k=0 δk + d L PT 1 k=0 δ2 k PT 1 k=0 δk + µLd 4αb(q + 1)σ2 + C(d, µ)(2αb + βb) bq , (8) where f denotes the minimum value. Published as a conference paper at ICLR 2019 Proof: See Appendix 4. In Theorem 1, we translate the gradient norm from ℓ1 to ℓ2, and adopt a probabilistic output x R (Ghadimi & Lan, 2013; Lei et al., 2017) to avoid exhaustive search over {xk} for mink f(xk) 2. Note that the convergence rate of ZO-sign SGD relies on the learning rate δk, the problem size d, the smoothing parameter µ, the mini-batch size b, and the number of random perturbations q for ZO gradient estimation. We next obtain explicit dependence on these parameters by specifying Theorem 1. If δk = δ = O( 1 d T ) and µ = O( 1 d T ), then the convergence in (8) simplifies to E [ f(x R) 2] O αbq + (αb + βb)d bq where αb and βb were defined in (7), and 1 (αb + βb) 2. We provide several key insights on the convergence rate of ZO-sign SGD through (9). First, the convergence rate of ZO-sign SGD is measured through f(x R) 2 rather than its squared counterpart f(x R) 2 2, where the latter was used in measuring the convergence of ZO-SGD. We recall from (Ghadimi & Lan, 2013, Theorem 3.2 & Corollary 3.3) that ZO-SGD yields the convergence error E f(x R) 2 2 O( T ). Since f(x R) 2 2 f(x R) 2 as f(x R) 2 2 1, the convergence of ZO-sign SGD meets a stricter criterion than that of ZO-SGD. The possible downside of ZO-sign SGD is that it suffers an additional error of order O( b + d bq) in the worst case. The aforementioned results imply that ZO-sign SGD could only converge to a neighborhood of a stationary point but with a fast convergence speed. Here the size of the neighborhood is controlled by the mini-batch size b and the number of random direction vectors q. Also, our convergence analysis applies to mini-batch sampling both with and without replacement. When b [1, n), ZO-sign SGD achieves O( b + d bq) convergence rate regardless of the choice of mini-batch sampling. When b = n, it is known from (9) that the use of mini-batch without replacement recovers ZO-sign GD, yielding the convergence rate O( T + d nq). By contrast, the use of mini-batch with replacement leads to the worse convergence rate O( d n). Clearly, as b = n and n < T, ZO-sign SGD using mini-batch with replacement fails to achieve the rate O( T ) regardless of the choice of q. By contrast, ZO-sign SGD using mini-batch without replacement recovers O( T ) as q = O( d T n ). When b > n, ZO-sign SGD is restricted to using mini-batch sampling with replacement. Similar to sign SGD (Bernstein et al., 2018), we can obtain O( T ) convergence rate as b = O(T) and q = O( d T n ), where the dependence on q is induced by the use of ZO gradient estimation. 5 VARIANTS OF ZO-SIGNSGD Here we study three variants of ZO-sign SGD, where the gradient will be estimated using a) the central difference of function values, b) the sign of ZO gradient estimates with majority vote, or c) the sign of ZO gradient estimates with majority vote for distributed optimization. That is, a) Grad Estimate(x) = 1 d[fi(x + µui,j) fi(x µui,j)]ui,j b) Grad Estimate(x) = 1 j=1 sign ˆ fi(x; ui,j) , (11) c) Grad Estimate(x) = 1 j=1 ˆ fi(x; ui,j) where {ui,j} and ˆ fi(x; ui,j) have been defined in (3). The gradient estimator (12) is proposed for distributed optimization over a star network that consists of M agents and 1 central processor. Each agent m [M] has only access to nm data (in terms of nm individual costs {fi}) with PM m=1 nm = n, and the mini-batch Im,k per agent satisfies bm = |Im,k| [1, nm]. According to (12), the Published as a conference paper at ICLR 2019 central processor receives 1-bit compressed gradients sign 1 bmq P i Im,k Pq j=1 ˆ fi(x; ui,j) from M agents and then performs the sign-based descent (2) and sends its 1-bit update back to every agent. The ZO gradient estimator (10) was used in (Shamir, 2017) for bandit convex optimization and in (Ilyas et al., 2018a) for designing black-box adversarial attacks. Compared to the form of forward difference (3), the central difference (10) requires b(q 1) times more function queries in gradient estimation. At the cost of more function queries, one may wonder if the convergence rate of ZO-sign SGD can be further improved. Corollary 1 Suppose that the conditions in Theorem 1 hold, ZO-sign SGD with gradient estimator (10) yields the same convergence rate of ZO-sign SGD that uses the estimator (3). Proof: Recall that Proposition 1 is independent of specific forms of gradient estimators, and thus holds for (10). Although Proposition 2 relies on the second-order moments of each gradient estimator, we prove that under A1 and A2, both (3) and (10) maintain the same statistical properties. As a result, Proposition 2 and Theorem 1 also hold for (10); see more details in Appendix 5. We next study the gradient estimator (11), whose sign is equivalent to the majority vote (i.e., the element-wise median) of signs of individual gradient estimates { ˆ fi(x; ui,j)}. It was shown in (Bernstein et al., 2018) that sign SGD with majority vote has a better convergence rate under additional assumptions of unimodal symmetric noise distribution of coordinate-wise gradient estimates. In Corollary 2, we show that such a speed-up in convergence can also be achieved by ZO-sign SGD with majority vote, which we refer to as ZO-M-sign SGD . Corollary 2 Suppose that the conditions in Theorem 1 hold, and the distribution of gradient noise is unimodal and symmetric. Then, ZO-M-sign SGD with δk = O( 1 d T ) and µ = O( 1 d T ) yields E [ f(x R) 2] = O Proof: See Appendix 6. We recall from Theorem 1 that under the same parameter setting of Corollary 2, ZO-sign SGD yields O( b + d bq) convergence rate in the worst case. It is clear from (13) that the error correction term of order b is eliminated in ZO-M-sign SGD. Such an improvement in convergence is achieved under the condition of unimodal symmetric gradient noise. We remark that different from the stochastic gradient noise studied in (Bernstein et al., 2018), the ZO gradient estimation noise could violate this assumption. For example, in a scalar case, if the gradient estimate g follows the distribution where g = 1 with probability 0.9, g = 10 with probability 0.1, then E[g] < 0 and sign(E[g]) < 0. However, E[sign(g)] > 0. This implies that without the assumption of symmetry, the sign of gradient estimates with majority vote (E[sign(g)]) can be in the opposite direction of the sign of averaged gradients (sign(E[g])). Our results in the next section show that ZO-M-sign SGD may not outperform ZO-sign SGD. Lastly, we focus on the gradient estimator (12), whose sign can be interpreted as the major vote of M distributed agents about the sign of the true gradient (Bernstein et al., 2018). The resulting variant of ZO-sign SGD is called ZO-D-sign SGD , and its convergence rate is illustrated in Corollary 3. Compared to ZO-M-sign SGD for centralized optimization, ZO-D-sign SGD suffers an extra error correction term O( d n) in the distributed setting. It is also worth mentioning that if M = n and q = 1, then the gradient estimator (12) reduces to (11) with Ik = [n]. In this case, Corollary 2 and 3 reach a consensus on O( T + d n) convergence error. Corollary 3 Suppose that the conditions in Corollary 2 hold. ZO-M-sign SGD with bm = n M , δk = O( 1 d T ) and µ = O( 1 d T ) yields E [ f(x R) 2] = O d/ n + d/ nq . (14) Proof: See Appendix 7. Published as a conference paper at ICLR 2019 (a) (b) (c) (d) (e) (f) Figure 1: Performance comparison of ZO-sign SGD, ZO-M-sign SGD, ZO-SGD, ZO-SCD, sign SGD and SGD under a synthetic dataset. The solid line represents the loss/accuracy averaged over 10 independent trials with random initialization, and the shaded region indicates the standard deviation of results over random trials. (a)-(b): Training loss and test accuracy versus iterations. (c)-(d): Effects of mini-batch size q and number of random direction vectors q on the convergence of studied algorithms. Here (c) presents the training loss versus iterations, and (d) is the heat map of the final loss for different values of b and q. (e)-(f): Effects of problem size d. Here (e) shows the final training loss versus d, and (f) presents the convergence trajectory when d {200, 400}. 6 EXPERIMENTS In this section, we empirically show the effectiveness of ZO-sign SGD, and validate its convergence behavior on both synthetic and real-world datasets such as MNIST and CIFAR-10. For the synthetic experiment, we study the problem of binary classification in the least squared formulation. For the real-world application, we design adversarial examples from black-box neural networks as mentioned in Sec. 2. Throughout this section, we compare ZO-sign SGD and its variants with SGD, sign SGD (Bernstein et al., 2018), ZO-SGD (Ghadimi & Lan, 2013), and ZO-SCD (Lian et al., 2016). Binary classification We consider the least squared problem with a nonconvex loss function (Xu et al., 2017; Liu et al., 2018b) minx Rd 1 n Pn i=1(yi 1/(1 + e a T i x))2, which satisfies Assumption A2 by letting σ = maxi{2 ai 2}. Here instead of using the conventional cost function of logistic regression (a convex function), the considered least squared formulation is introduced to align with our nonconvex theoretical analysis. For generating the synthetic dataset, we randomly draw samples {ai} from N(0, I), and obtain the label yi = 1 if 1/(1 + e a T i x) > 0.5 and 0 otherwise. The number of training samples {ai, yi} is set by n = 2000 against 200 testing samples. We find the best constant learning rate for algorithms via a greedy search over η [0.001, 0.1] (see Appendix 8.1 for more details), and we choose the smoothing parameter µ = 10/ Td. Unless specified otherwise, let b = q = 10, T = 5000 and d = 100. In Fig. 1, we report the training loss, the test accuracy, as well as the effects of algorithmic parameters on the convergence of the studied algorithms. We observe from Fig. 1-(a) and (b) that ZO-sign SGD outperforms other ZO algorithms, and sign SGD yields the best convergence performance once the first-order information is available. In Fig. 1-(c) and (d), we observe that the convergence performance of ZO algorithms is improved as b and q increase. In particular, ZO-sign SGD and ZO-M-sign SGD at b = q = 30 approach to the best result provided by sign SGD. In Fig. 1-(e) and (f), the convergence of all algorithms degrades as the problem size d increases. However, ZO-sign SGD and ZO-M-sign SGD converge faster than ZO-SGD and ZO-SCD. In Fig. 2, we demonstrate the convergence trajectory of different variants of ZO-sign SGD for b {40, 400}. To make a fair comparison between ZOsign SGD and ZO-D-sign SGD, let each of M = 40 agents use a mini-batch of size b/M. As we can see, ZO-sign SGD outperforms ZO-M-sign SGD and ZO-D-sign SGD. And the convergence is improved as the mini-batch size increases. However, we observe that in all examples, ZO-sign SGD Published as a conference paper at ICLR 2019 Figure 2: Training loss (left) and testing accuracy (right) of ZO-sign SGD, ZO-M-sign SGD, ZO-D-sign SGD and ZO-SGD versus iterations. (a) MNIST ID 2 (b) MNIST ID 34 (c) CIFAR-10 ID 6 (d) CIFAR-10 ID 10 Figure 3: Black-box attacking loss versus iterations. The solid marker indicates the iteration number that finds the first successful adversarial example, and its loss corresponds to the squared ℓ2 distortion. and its variants converge to moderate accuracy much faster than ZO-SGD, only within a few tens of iterations. Generating black-box adversarial examples Here we study adversarial robustness by generating adversarial examples from a black-box image classifier trained by a deep neural network (DNN) model; see details on problem formulation in Appendix 8.2. We recall from Sec. 2 that the task of black-box adversarial attack falls within the category of ZO optimization as one can only access to the input-output relation of the DNN while crafting adversarial examples. The DNN models trained on MNIST and CIFAR-10 (Carlini & Wagner, 2017) are performed as the zeroth-order oracle2. We select one image from each class of MNIST and CIFAR-10 and separately implement black-box attacks using the same attacking loss function (see Appendix 8.2) but with different ZO optimization algorithms (ZO-SGD, ZO-sign SGD and ZO-M-sign SGD). We also set the same parameters for each method, i.e., µ = 0.01, q = 9, and δ = 0.05 for MNIST and δ = 0.0005 for CIFAR-10, to accommodate to the dimension factor d. Moreover, we benchmark their performance with the natural evolution strategy (NES) based two-point gradient estimator in (Ilyas et al., 2018a) for solving the same attacking loss function, where the sign of gradient estimate is also used in the 2https://github.com/carlini/nn_robust_attacks Published as a conference paper at ICLR 2019 Table 1: Iteration comparison of attacking black-box DNN on MNIST (image ID 2). Iteration 0 40 80 120 160 200 240 280 312 356 Classified as 1 1 1 1 1 1 1 1 4 4 Iteration 0 40 80 120 145 202 240 280 320 359 ZO-sign SGD Classified as 1 1 1 1 4 4 4 4 4 4 Iteration 0 40 80 120 142 200 240 279 320 360 ZO-M-sign SGD Classified as 1 1 1 1 4 4 4 4 4 4 Iteration 0 40 80 120 160 200 258 287 321 359 Classified as 1 1 1 1 1 1 4 4 4 4 descent step. We call the resulting black-box attack generation method ZO-NES . Similar to (10), NES computes the ZO gradient estimate using the central difference of two function values. Thus, one iteration of ZO-NES requires 2q function queries and thus we set q = 5 to align with the number of function queries used in other ZO methods. All methods use the the same natural image as the initial point for finding adversarial examples. Fig. 3 shows the plots of black-box attacking loss versus iterations (more results are shown in Appendix 8.3). We find that ZO-sign SGD usually takes significantly less iterations than other methods to find the first successful adversarial example with a similar attacking loss. For MNIST, the average iteration over all attacked images in Table A1 to find the first successful adversarial example is 184 for ZO-SGD, 103 for ZO-sign SGD, 151 for ZO-M-sign SGD, and 227 for ZO-NES. Their corresponding average ℓ2 distortion is 2.345 for ZO-SGD, 2.381 for ZO-sign SGD, 2.418 for ZO-Msign SGD, and 2.488 for ZO-NES. For CIFAR-10, the average iteration over all attacked images in Table A2 to find the first successful adversarial example is 302 for ZO-SGD, 250 for ZO-sign SGD, 389 for ZO-M-sign SGD, and 363 for ZO-NES. Their corresponding average ℓ2 distortion is 0.177 for ZO-SGD, 0.208 for ZO-sign SGD, 0.219 for ZO-M-sign SGD, and 0.235 for ZO-NES. As a visual illustration, we compare the adversarial examples of a hand-written digit 1 of each attacking method at different iterations in Table 1, corresponding to Fig. 3-(a). As we can see, ZO-sign SGD and ZO-M-sign SGD can reduce roughly 54% of iterations (around 600 less model queries) than ZO-SGD to find the first successful adversarial example. Given the first successful adversarial example, we observe that ZO-sign SGD yields slightly higher ℓ2 distortion than ZO-SGD. This is not surprising since Theorem 1 suggests that ZO-sign SGD might not converge to a solution of very high accuracy but it can converge to moderate accuracy sufficient for black-box attacks at a very fast speed. Note that the first successful adversarial examples generated by different ZO methods are all visually similar to the original ones but lead to different top-1 predictions; see more results in Appendix 8.3. In addition, we observe that ZO-NES is not as effective as ZO-sign SGD in either query efficiency (given by the number of iterations to achieve the first successful attack) or attack distortion. Thus, compared to ZO-NES, ZO-sign SGD offers a provable and an efficient black-box adversarial attacking method. 7 CONCLUSION Motivated by the impressive convergence behavior of (first-order) sign SGD and the empirical success in crafting adversarial examples from black-box ML models, in this paper we rigorously prove the O( T) convergence rate of ZO-sign SGD and its variants under mild conditions. Compared to sign SGD, ZO-sign SGD suffers a slowdown (proportional to the problem size d) in convergence rate, however, it enjoys the gradient-free advantages. Compared to other ZO algorithms, we corroborate the superior performance of ZO-sign SGD on both synthetic and real-word datasets, particularly for its application to black-box adversarial attacks. In the future, we would like to generalize our analysis to nonsmooth and nonconvex constrained optimization problems. Published as a conference paper at ICLR 2019 ACKNOWLEDGMENTS This work was supported by the MIT-IBM Watson AI Lab. Mingyi Hong and Xiangyi Chen are supported partly by an NSF grant CMMI-1727757,and by an AFOSR grant 15RT0767. A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018. L. Balles and P. Hennig. Dissecting adam: The sign, magnitude and variance of stochastic gradients. ar Xiv preprint ar Xiv:1705.07774, 2017. J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signsgd: compressed optimisation for non-convex problems. ar Xiv preprint ar Xiv:1802.04434, 2018. A. N. Bhagoji, W. He, B. Li, and D. Song. Practical black-box attacks on deep neural networks using efficient query mechanisms. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 154 169, 2018. N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39 57, 2017. P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15 26. ACM, 2017. X. Chen, S. Liu, R. Sun, and M. Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. ar Xiv preprint ar Xiv:1808.02941, 2018. M. Cheng, T. Le, P.-Y. Chen, J. Yi, H. Zhang, and C.-J. Hsieh. Query-efficient hard-label black-box attack: An optimization-based approach. ar Xiv preprint ar Xiv:1807.04457, 2018. J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012. J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788 2806, 2015. X. Gao, B. Jiang, and S. Zhang. On the information-adaptive variants of the ADMM: an iteration complexity perspective. Optimization Online, 12, 2014. S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267 305, 2016. I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. 2015 ICLR, ar Xiv preprint ar Xiv:1412.6572, 2015. B. Gu, Z. Huo, and H. Huang. Zeroth-order asynchronous doubly stochastic algorithm with variance reduction. ar Xiv preprint ar Xiv:1612.01425, 2016. D. Hajinezhad, M. Hong, and A. Garcia. Zenith: A zeroth-order distributed algorithm for multi-agent nonconvex optimization. 2017. A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, 2018a. A. Ilyas, L. Engstrom, and A. Madry. Prior convictions: Black-box adversarial attacks with bandits and priors. ar Xiv preprint ar Xiv:1807.07978, 2018b. K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2012. D. Kinga and J. B. Adam. A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. Published as a conference paper at ICLR 2019 Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. 2017 ICLR, ar Xiv preprint ar Xiv:1611.01236, 2017. URL http://arxiv.org/abs/1611.01236. L. Lei, C. Ju, J. Chen, and M. I. Jordan. Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pp. 2345 2355, 2017. X. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. In Advances in Neural Information Processing Systems, pp. 3054 3062, 2016. J. Lin, C. Gan, and S. Han. Defensive quantization: When efficiency meets robustness. In International Conference on Learning Representations (ICLR), 2019. L. Liu, M. Cheng, C.-J. Hsieh, and D. Tao. Stochastic zeroth-order optimization via variance reduction method. ar Xiv preprint ar Xiv:1805.11811, 2018a. S. Liu, J. Chen, P.-Y. Chen, and A. O. Hero. Zeroth-order online ADMM: Convergence analysis and applications. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pp. 288 297, April 2018b. S. Liu, B. Kailkhura, P.-Y. Chen, P. Ting, S. Chang, and L. Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. ar Xiv preprint ar Xiv:1805.10367, 2018c. Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. ICLR, ar Xiv preprint ar Xiv:1611.02770, 2017. A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ICLR, 2018. Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 2(17):527 566, 2015. N. Papernot, P. Mc Daniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (Euro S&P), 2016 IEEE European Symposium on, pp. 372 387. IEEE, 2016. N. Papernot, P. Mc Daniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506 519. ACM, 2017. S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. U. Shaham, Y. Yamada, and S. Negahban. Understanding adversarial training: Increasing local stability of supervised models through robust optimization. Neurocomputing, 2018. O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52):1 11, 2017. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. C.-C. Tu, P. Ting, P.-Y. Chen, S. Liu, H. Zhang, J. Yi, C.-J. Hsieh, and S.-M. Cheng. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. ar Xiv preprint ar Xiv:1805.11770, 2018. Y. Wang, S. Du, S. Balakrishnan, and A. Singh. Stochastic zeroth-order optimization in high dimensions. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pp. 1356 1365. PMLR, April 2018. P. Xu, F. Roosta-Khorasan, and M. W. Mahoney. Second-order optimization for non-convex machine learning: An empirical study. ar Xiv preprint ar Xiv:1708.07827, 2017. Published as a conference paper at ICLR 2019 1 MOTIVATING EXAMPLES OF FAST CONVERGENCE OF ZO-SIGNSGD 1.1 EXAMPLE OF SPARSE NOISE PERTURBATION We consider to minimize the function f(x) = 1 2 x 2 2. Similar to (Bernstein et al., 2018, Figure A.1), we assume that the ZO gradient estimate of f(x) and its first-order gradient f(x) = x suffer from a sparse noise vector v, where v1 N(0, 1002), and vi = 0 for i 2. As a result, the used descent direction at iteration t is given by ˆ f(xt) + v or f(xt) + v. Fig. A1 presents the convergence performance of 5 algorithms: SGD, sign SGD, ZO-SGD, ZO-sign SGD and its variant using the central difference based gradient estimator (10). Here we tune a constant learning rate finding 0.001 best for SGD and ZO-SGD and 0.01 best for sign SGD and its ZO variants. As we can see, sign-based first-order and ZO algorithms converge much faster than the stochastic gradient-based descent algorithms. This is not surprising since the presence of extremely noisy component v1 leads to an inaccurate gradient value, and thus degrades the convergence of SGD and ZO-SGD. By contrast, the sign information is more robust to outliers and thus leads to better convergence performance of sign SGD and its variants. We also note that the convergence trajectory of ZO-sign SGD using the gradient estimator (10) coincides with that using the gradient estimator (3) given by the forward difference of two function values. Figure A1: Comparison of different gradient-based and gradient sign-based first-order and ZO algorithms in the example of sparse noise perturbation. The solid line represents the loss averaged over 10 independent trials with random initialization, and the shaded region indicates the standard deviation of results over random trials. Left: Loss value against iterations for SGD, sign SGD, ZO-SGD, ZO-sign SGD and ZO-sign SGD using the central difference based gradient estimator (10). Right: Local regions to highlight the effect of the gradient estimators (3) and (10) on the convergence of ZO-sign SGD. 1.2 STATISTICS OF GRADIENT ESTIMATES The intuition behind why ZO-sign SGD could outperform ZO-SGD is that the sign operation can mitigate the negative effect of (coordinate-wise) gradient noise of large variance. To confirm this point, we examine the coordinate-wise variance of gradient noises during an entire training run of the binary classifier provided in the first experiment of Sec. 6. At each iteration, we perform an additional 100 random trials to obtain the statistics of gradient estimates. In Fig. A2-(a), we present the ℓ1 norm of the mean of gradient estimates (over 100 trials) versus the number of iterations. As we can see, both sign SGD and ZO-sign SGD outperform SGD and ZO-SGD, evidenced by a fast decrease of the ℓ1 norm of gradient estimate. In Fig. A2-(b), we present the coordinate-wise gradient noise variance (over 100 trails at each coordinate) against the number of iterations. It is not surprising that compared to first-order methods, ZO methods suffer gradient noise of larger variance. In this scenario, we could benefit from ZO-sign SGD since taking the sign of gradient estimates might scale down extremely noisy components. Indeed, we observe a significant decrease of the noise variance while performing ZO-sign SGD compared to ZO-SGD. Published as a conference paper at ICLR 2019 Figure A2: Statistics of gradient estimates during an entire training run of the binary classifier provided in the first experiment of Sec. 6. a) The ℓ1 norm of the mean of gradient estimates versus iteration. b) Coordinate-wise gradient noise variance versus iteration. The solid line represents the variance averaged over all coordinates, and the shaded region indicates the corresponding standard deviation with respect to all coordinates at each iteration. 2 PROOF OF PROPOSITION 1 Based on the definition of the smoothing function fµ, for any x and y we have fµ(x) fµ(y) 2 = Ev[ xf(x + µv) yf(y + µv)] 2 E[ xf(x + µv) yf(y + µv) 2] L x y 2, (15) where the first inequality holds due to Jensen s inequality, and the second inequality holds due to A1. It is known from (15) that fµ has L-Lipschitz continuous gradient. By the L-smoothness of fµ, we obtain that fµ(xk+1) fµ(xk) + fµ(xk), xk+1 xk + L 2 xk+1 xk 2 2 (2) =fµ(xk) δk fµ(xk), sign(ˆgk) + L 2 δ2 k sign(ˆgk) 2 2 =fµ(xk) δk fµ(xk) 1 + δ2 kd L i=1 |( fµ(xk))i|I [sign(ˆgk,i) = sign(( fµ(xk))i)] , (16) where ( fµ(x))i denotes the ith element of fµ(x). Taking expectation for both sides of (16), we obtain that E[fµ(xk+1) fµ(xk)] δk fµ(xk) 1 + δ2 kd L i=1 |( fµ(xk))i|Prob [sign(ˆgk,i) = sign(( fµ(xk))i)] . (17) Similar to (Bernstein et al., 2018, Theorem 1), we relax Prob [sign(ˆgk,i) = sign(( fµ(xk))i)] by Markov s inequality, Prob [sign(ˆgk,i) = sign(( fµ(xk))i)] Prob[|ˆgk,i ( fµ(xk))i| |( fµ(xk))i|] E[|ˆgk,i ( fµ(xk))i|] |( fµ(xk))i| . (18) Published as a conference paper at ICLR 2019 Substituting (18) into (17), we obtain E[fµ(xk+1) fµ(xk)] δk fµ(xk) 1 + δ2 kd L i=1 E[|ˆgk,i ( fµ(xk))i|] = δk fµ(xk) 1 + δ2 kd L 2 + 2δk E[ ˆgk fµ(xk) 1] δk fµ(xk) 1 + δ2 kd L d E[ ˆgk fµ(xk) 2] δk fµ(xk) 1 + δ2 kd L E[ ˆgk fµ(xk) 2 2], (19) where the second inequality holds due to x 1 d x 2, and the last inequality holds by applying Jensen s inequality to the concave function , i.e., E[ ˆgk fµ(xk) 2] p E[ ˆgk fµ(xk) 2 2]. Taking sum of both sides of (19), we then obtain (5). 3 PROOF OF PROPOSITION 2 We recall from (3) that ˆ fi(xk), ˆ fi(xk) = 1 j=1 ˆ fi(xk; ui,j). (20) Let zi := ˆ fi(xk) fµ(xk) and zi,j = ˆ fi(xk; ui,j) fµ(xk). Thus, ˆgk fµ(xk) = 1 i Ik zi = 1 j=1 zi,j (21) where there are two sources of randomness: a) minibatch sampling i Ik, and b) the random direction sampling u = ui,j. Note that these two sources of randomness are independent, and the random direction samples {ui,j} are i.i.d.. Next, we discuss two types of mini-batch sampling: a) mini-batch samples without replacement, and b) minibatch samples with replacement. Suppose that Ik is a uniform random subset of [n] (no replacement), motivated by (Lei et al., 2017, Lemma A.1) we introduce a new variable Wi = I(i Ik), where I is an indicator function, and I(i Ik) = 1 if i Ik, and 0 otherwise. As a result, we have E[W 2 i ] = E[Wi] = b n, E[Wi Wj] = b(b 1) n(n 1), i = j. (22) From (21), the variance of ˆgk is given by Ei Ik[W 2 i ]Eu[ zi 2 2] + X i =j Ei,j Ik[Wi Wj] Eu[zi], Eu[zj] i=1 Eu[ zi 2 2] + b(b 1) i=1 Eu[zi] 2 2 i=1 Eu[ zi 2 2] b 1 b(n 1)n i=1 Eu[zi] 2 2 i=1 Eu[ zi 2 2] b 1 b(n 1)n i=1 Eu[ zi 2 2] + b 1 b(n 1)n Eu[ zi 2 2] Eu[zi] 2 2 i=1 Eu[ zi 2 2] + b 1 b(n 1)n i=1 Eu[ ˆ fi(xk) fi,µ(xk) 2 2]. (23) In (23), the equality (a) holds since Eu[zi] = fi,µ(xk) fµ(xk) and fµ(x) = 1 n Pn i=1 fi,µ(x) from (4), where we have used the fact that Eu[ ˆ fi(xk)] = ˆ fi,µ(xk) (Liu et al., 2018c, Lemma. 1), and recall that fi,µ denotes the smoothing function of fi. The above implies that Pn i=1 Eu[zi] = P i fi,µ(xk) n fµ(xk) = 0. And the equality (b) holds due to Eu[ zi 2 2] Eu[zi] 2 2 = Eu zi Eu[zi] 2 2. Published as a conference paper at ICLR 2019 On the other hand, suppose that the mini-batch Ik contains i.i.d. samples (namely, with replacement), the vectors {zi} are then i.i.d. under both mini-batch sampling and random direction sampling. Therefore, we obtain that i Ik zi 2 2 i =j,i,j Ik zi, zj b Eu[Ei Ik[ zi 2 2]] = 1 i=1 Eu[ zi 2 2], (24) where the second equality holds since Ei Ik,u[zi] = 1 n Pn i=1 Eu[zi] = 1 n Pn i=1 fi,µ(xk) fµ(xk) = 0. Combining both (23) and (24), we obtain that i=1 Eu[ zi 2 2] + βb i=1 Eu[ ˆ fi(xk) fi,µ(xk) 2 2]. (25) In (25), αb = 1 and βb = 0 if the mini-batch contains i.i.d. samples from [n] with replacement, and αb = I(b < n) and βb = I(b > 1) if samples are randomly selected without replacement. In (25), we next bound 1 i Eu[ zi 2 2], i=1 Eu[ zi 2 2] (21) = 1 2 = 1 nq2 X j Eu[ zi,j 2 2] + X j =k Eu[zi,j], Eu[zi,k] q Eu[ zi,1 2 2] + (q2 q) Eu[zi,1] 2 2 i Eu[ zi,1 2 2] + q 1 i Eu[zi,1] 2 2, (26) where we have used the facts that Eu[zi,j] = Eu[zi,1] and Eu[ zi,j 2 2] = Eu[ zi,1 2 2] for any j since random direction vectors {ui,j}q j=1 are i.i.d. samples. In (26), we further bound P i Eu[ zi,1 2 2], i Eu[ zi,1 2 2] = 1 i Eu ˆ fi(xk; ui,1) fµ(xk) 2 2 i Eu[ ˆ fi fi,µ + fi,µ fµ 2 2] i Eu[ ˆ fi fi,µ 2 2] + 2 i fi,µ fµ 2 2, (27) where for ease of notation, let ˆ fi := ˆ fi(xk; ui,1), ˆ fi,µ := ˆ fi,µ(xk) and fµ := fµ(xk). According to (Liu et al., 2018c, Lemma 1), the first term at RHS of (27) yields Eu[ ˆ fi fi,µ 2 2] 2d fi 2 2 + µ2L2d2 2 2dσ2 + µ2L2d2 2 := C(d, µ), (28) where the last inequality holds due to A2. Based on the definition of fµ, the second term at RHS of (27) yields i fi,µ fµ 2 2 = 1 i Ev[ fi(xk + µv) f(xk + µv)] 2 2 i Ev[ fi(xk + µv) f(xk + µv) 2 2] 4σ2, (29) where we have used the Jensen s inequality and 1 n Pn i=1 fi(x) f(x) 2 2 4σ2 under A2. Substituting (28) and (29) into (27), we have i Eu[ zi,1 2 2] = 1 i Eu[ ˆ fi fµ 2 2] 4dσ2 + µ2L2d2 + 8σ2 = 2C(d, µ) + 8σ2, (30) where C(d, µ) was defined in (28). Published as a conference paper at ICLR 2019 We are now ready to bound (26). Based on 1 i Eu[zi,1] 2 2 = 1 i fi,µ fµ 2 2 4σ2 from (29), and substituting (30) into (26), we obtain that i Eu[ zi 2 2] 2C(d, µ) + 8σ2 q + 4(q 1)σ2 In (25), we also need to bound Eu[ ˆ fi(xk) fi,µ(xk) 2 2] ˆ fi(xk) fi,µ(xk) 2 ˆ fi(x; ui,j) fi,µ(x) ˆ fi(x; ui,1) fi,µ(x) 2 2dσ2 + µ2L2d2 where the equality (a) holds since Eu[ ˆ fi(x; ui,j)] = fi,µ(x) for any j, given by (Liu et al., 2018c, Lemma 1). Substituting (31) and (32) into (25) E ˆgk fµ(xk) 2 2 = E b 2C(d, µ) + 4σ2 + 4σ2q q + βb C(d, µ) =4αb(q + 1) bq σ2 + C(d, µ) bq (2αb + βb). (33) 4 PROOF OF THEOREM 1 Substituting (6) into (5), we obtain k=0 δk fµ(xk) 1 E[fµ(x0) fµ(x T )] + 4αb(q + 1)σ2 + C(d, µ)(2αb + βb) k=0 δ2 k. (34) It is known from (Liu et al., 2018c, Lemma 1) that f(x) 2 2 2 fµ(x) 2 2 + µ2L2d2 2 , |fµ(x) f(x)| µ2L From (35) we have fµ(x0) f(x0) µ2L 2 and f f µ µ2L 2 , where f µ = minx fµ(x) and f = minx f(x). This yields fµ(x0) f(x0) + f f µ µ2L, and thus fµ(x0) fµ(x T ) fµ(x0) f µ (f(x0) f ) + µ2L. (36) Substituting (36) into (34), we obtain k=0 δk fµ(xk) 1 E[f(x0) f )] + µ2L 4αb(q + 1)σ2 + C(d, µ)(2αb + βb) k=0 δ2 k. (37) Due to fµ(xk) 2 fµ(xk) 1 and dividing PT 1 k=0 δk for both sides of (37), we obtain that δk PT 1 k=0 δk fµ(xk) 2 f(x0) f + µ2L PT 1 k=0 δk + 2 4αb(q + 1)σ2 + C(d, µ)(2αb + βb) PT 1 k=0 δ2 k PT 1 k=0 δk . (38) Published as a conference paper at ICLR 2019 By introducing the random variable R with probability P(R = k) = δk PT 1 k=0 δk , we then obtain that E [ fµ(x R) 2] = E [ER[ fµ(x R) 2]] = E k=0 P(R = k) fµ(xk) 2 Based on (35), we further have E [ f(x R) 2] E 2 fµ(x R) 2 2 + µ2L2d2 2E[ fµ(x R) 2] + µLd where we have used the fact that a2 + b2 (a + b) for a, b 0. Substituting (38) and (39) into (40), we finally obtain (8). 5 PROOF OF COROLLARY 1 Upon defining fi(x; u) = d[fi(x+µu) fi(x µu)]u 2µ (against ˆ fi(x; u) in (3)), our major task is to derive the firstand second-order moments of fi(x; u). Given x, we first study the mean of fi(x; u), Eu h fi(x; u) i = Eu 2µ(fi(x + µu) fi(x µu))u 2µfi(x + µu)u + Eu 2µfi(x µu)( u) (a) = Eu µfi(x + µu)u µ (fi(x + µu) fi(x)) u = Eu h ˆ fi(x; u) i (c) = fi,µ(x), (41) where (a) holds since the distribution of u is symmetric around the origin, (b) holds since E[u] = 0, and (c) holds since fi,µ(x) = Eu h d µfi(x + µu)u i obtained from (Gao et al., 2014, Lemma 4.1). It is clear from (41) that Eu h fi(x; u) i = Eu h ˆ fi(x; u) i . We next study the second-order moment of fi(x; u), Eu h fi(x; u) 2i = d2 4µ2 Eu (fi(x + µu) fi(x µu))2 u 2 2µ2 Eu (fi(x + µu) fi(x))2 + (fi(x) fi(x µu))2 " d µ(fi(x + µu) fi(x))u = Eu h ˆ fi(x; u) 2i , (42) where we have used the fact that u 2 = 1. Based on (41) and (42), we can conclude that both (3) and (10) maintain the same statistical properties. Following the proofs of Proposition 2 and Theorem 1, it is not difficult to reach the convergence rate in (8). 6 PROOF OF COROLLARY 2 Let ˆgi,j k := ˆ fi(xk; ui,j). If we replace ˆgk with ˆgi,j k in (18), we then have |( fµ(xk))l| Prob h sign(ˆgi,j k,l) = sign(( fµ(xk))l) i E[|ˆgi,j k,l ( fµ(xk))l|], (43) where ˆgi,j k,l is the lth coordinate of ˆgi,j k . By letting b = 1 and q = 1 in Proposition 2, we know that E ˆgi,j k fµ(xk) 2 2 is upper bounded. Moreover, by Jensen s inequality we have E[|ˆgi,j k,l ( fµ(xk))l|] q E[(ˆgi,j k,l ( fµ(xk))l)2] := ξl, (44) Published as a conference paper at ICLR 2019 where ξl is finite since E ˆgi,j k fµ(xk) 2 2 is upper bounded. Substituting (44) into (43), we have |( fµ(xk))l| Prob h sign(ˆgi,j k,l) = sign(( fµ(xk))l) i ξl. (45) With the new gradient estimate gk = P i Ik Pq j=1 sign(ˆgi,j k ) in (11), we require to bound Prob [sign( gk,l) = sign(( fµ(xk))l)] , (46) where gk,l is the lth coordinate of gk. We recall that ˆgi,j k,l is an unbiased stochastic approximation to gradient component ( fµ(xk))l with variance ξ2 l . Under the assumption that the noise distribution is unimodal and symmetric, we know from (Bernstein et al., 2018, Lemma D1) that Prob h sign(ˆgi,j k,l) = sign(( fµ(xk))l i := Q ( 2 9 1 S2 S > 2 3 otherwise < 1 where S := |( fµ(xk))l|/ξl. Let Z count the number of estimates {ˆgi,j k,l} yielding the correct sign of (( fµ(xk))l. Thus, the probability of error in (46) is equal to Prob [sign( gk,l) = sign(( fµ(xk))l)] = Prob Z bq Following the proof of (Bernstein et al., 2018, Theorem 2b) under (47), it is not difficult to obtain that 1 bq S . (49) |( fµ(xk))l| Prob [sign( gk,l) = sign(( fµ(xk))l)] ξl bq . (50) Replace ˆgk with gk in (17), we obtain that E[fµ(xk+1) fµ(xk)] δk fµ(xk) 1 + δ2 kd L l=1 |( fµ(xk))l|Prob [sign( gk,l) = sign(( fµ(xk))l)] (50) δk fµ(xk) 1 + δ2 kd L 2 + 2δk 1 bq ξ 1 δk fµ(xk) 1 + δ2 kd L (44) = δk fµ(xk) 1 + δ2 kd L E[ ˆgi,j k fµ(xk) 2 2]. (51) Compared (51) with (19), the standard deviation q E[ ˆgi,j k fµ(xk) 2 2] is reduced by the factor 1/ bq. According to Proposition 2, let b = q = 1, we obtain E h ˆgi,j k fµ(xk) 2 2 i 8αbσ2 + (2αb + βb)C(d, µ). (52) Based on (51)-(52) and following the proof of Theorem 1, we have E [ f(x R) 2] 2f(x0) f + µ2L PT 1 k=0 δk 8αbσ2 + C(d, µ)(2αb + βb) + d L PT 1 k=0 δ2 k PT 1 k=0 δk + µLd where C(d, µ) := 2dσ2 + µ2L2d2/2. If µ = O( 1 d T ) and δk = O( 1 d T ), then the convergence rate simplifies to O( T + d bq ). Published as a conference paper at ICLR 2019 7 PROOF OF COROLLARY 3 Let ˆgm k := 1 bmq P i Im,k Pq j=1 ˆ fi(xk; ui,j) and gk = 1 M PM m=1 sign(ˆgm k ). Following (43)-(50), we can similarly obtain that |( fµ(xk))l| Prob [sign( gk,l) = sign(( fµ(xk))l)] ξl where ξl := q E[(ˆgm k,l ( fµ(xk))l)2]. By mimicking the derivation in (51), we have E[fµ(xk+1) fµ(xk)] δk fµ(xk) 1 + δ2 kd L E[ ˆgm k fµ(xk) 2 2]. (55) According to Proposition 2, let b = n/M , we obtain E ˆgm k fµ(xk) 2 2 4(q + 1) bq C(d, µ). (56) Based on (55)-(56) and following the proof of Theorem 1, we have E [ f(x R) 2] 2f(x0) f + µ2L PT 1 k=0 δk 4(q + 1)σ2 + 3C(d, µ) + d L PT 1 k=0 δ2 k PT 1 k=0 δk + µLd where C(d, µ) := 2dσ2 + µ2L2d2/2. Since µ = O( 1 d T ), δk = O( 1 d T ) and b M n, then the convergence rate simplifies to O( d n + d nq ). 8 SUPPLEMENTAL EXPERIMENTS 8.1 SYNTHETIC EXPERIMENTS In Fig. A3, we demonstrate the effect of the learning rate δ on the training loss of SGD, sign SGD, ZO-SGD, ZO-SCD, ZO-sign SGD and ZO-M-sign SGD. We observe that compared to the gradient-based algorithms (SGD, ZO-SGD and ZO-SCD), the gradient sign-based algorithms support a more flexible choice of learning rates (corresponding to less variance), since the sign operation has an normalization effect to reduce oscillations in convergence. We find δ = 0.1 best for SGD, δ = 0.009 best for sign SGD, δ = 0.1 best for ZOSGD and ZOSCD, δ = 0.0178 best for ZOsign SGD, and δ = 0.0501 best for ZO-M-sign SGD. Figure A3: The training loss at the last iteration versus the constant learning rate δ [10 3, 0.1]. Here the solid line represents the loss averaged over 10 independent trials with random initialization, and the shaded region indicates the standard deviation of results over random trials. Published as a conference paper at ICLR 2019 8.2 BLACK-BOX ATTACK FORMULATION The target DNN classifier F = [F1, F2, . . . , FK] takes an image as an input and produces the classification predictions (here the probability distribution) of K image classes, where Fk denotes the prediction of class k. Given F, an adversarial example x of a legitimate example x0 means it is visually similar to x0 but will yield a different top-1 prediction than x0. Let (x0, t0) denote a legitimate image x0 and its groundtruth label t0 {1, 2, . . . , K}. Without loss of generality, we assume the pixel value range of x0 lies within [ 0.5, 0.5]d. By using the tanh transformation on the adversarial image x such that x = tanh(w)/2, where w Rd, we adopt the untargeted black-box attacking loss designed in (Chen et al., 2017), which is defined as minimize w Rd c max{log Ft0(tanh(w)/2) max j =t0 log Fj(tanh(w)/2), 0} + tanh(w)/2 x0 2 2, (58) where c is a regularization coefficient and x = tanh(w)/2 ensures x lies within the valid image space [ 0.5, 0.5]d. The first term in the attacking objective is a hinge loss function that penalizes the top-1 prediction of x being t0. The log F( ) operation preserves the class prediction ranking and better handles numerical stability. The second term encourages the visual similarity between x and x0 through penalizing their squared ℓ2 difference (i.e., the squared distortion). In our experiment, we set c = 1 for MNIST and c = 0.1 for CIFAR-10. We also note that different from the use of signed gradients in existing black-box attacks (e.g., Ilyas et al. (2018a); Bhagoji et al. (2018)) due to the explicit ℓ perturbation constraint, the use and benefit of ZO-sign SGD for solving (58) in our experiment are non-trivial since the attacking objective does not impose any ℓ constraint. 8.3 ADDITIONAL BLACK-BOX ATTACKING RESULTS ON MNIST AND CIFAR-10 Table A1: First successful adversarial examples attacking black-box DNN on MNIST. Image ID 3 2 38 44 4 15 21 34 177 7 Classified as 0 1 2 3 4 5 6 7 8 9 Classified as 6 4 1 5 9 3 5 1 0 4 ZO-sign SGD Classified as 6 4 1 5 9 3 5 1 0 4 ZO-M-sign SGD Classified as 6 4 1 5 9 3 5 1 0 4 Classified as 6 4 1 5 9 3 5 1 0 4 Published as a conference paper at ICLR 2019 Table A2: First successful adversarial examples attacking black-box DNN on CIFAR-10. Image ID 10 6 25 68 36 33 5 17 1 28 Classified as airplane automobile bird cat deer dog frog horse ship truck Classified as bird truck truck dog horse cat cat dog truck automobile ZO-sign SGD Classified as bird truck truck dog horse cat cat dog truck automobile ZO-M-sign SGD Classified as bird truck truck dog horse cat cat dog automobile automobile Classified as bird truck truck dog horse cat cat dog automobile automobile Table A3: Iteration comparison of attacking black-box DNN on MNIST (image ID 177). Iteration 0 30 60 90 120 162 181 210 241 271 Classified as 8 8 8 8 8 0 0 0 0 0 Iteration 0 30 60 90 114 151 180 210 241 271 ZO-sign SGD Classified as 8 8 8 8 0 0 0 0 0 0 Iteration 0 30 60 90 120 150 180 198 240 271 ZO-M-sign SGD Classified as 8 8 8 8 8 8 8 0 0 0 Iteration 0 40 80 120 160 200 240 280 320 349 Classified as 8 8 8 8 8 8 8 8 8 0 Published as a conference paper at ICLR 2019 (a) MNIST image ID 3 (b) MNIST image ID 4 (c) MNIST image ID 7 (d) MNIST image ID 15 (e) MNIST image ID 21 (f) MNIST image ID 38 (g) MNIST image ID 44 (h) MNIST image ID 177 Figure A4: Additional plots of black-box attacking loss versus iteration on MNIST. The solid marker indicates the iteration number that finds the first successful adversarial example, and its loss corresponds to the squared ℓ2 distortion. Published as a conference paper at ICLR 2019 (a) CIFAR-10 image ID 1 (b) CIFAR-10 image ID 5 (c) CIFAR-10 image ID 17 (d) CIFAR-10 image ID 25 (e) CIFAR-10 image ID 28 (f) CIFAR-10 image ID 33 (g) CIFAR-10 image ID 36 (h) CIFAR-10 image ID 68 Figure A5: Additional plots of black-box attacking loss versus iteration on CIFAR-10. The solid marker indicates the iteration number that finds the first successful adversarial example, and its loss corresponds to the squared ℓ2 distortion.