# signopt_a_queryefficient_hardlabel_adversarial_attack__bc714b47.pdf

Published as a conference paper at ICLR 2020

SIGN-OPT: A QUERY-EFFICIENT HARD-LABEL ADVERSARIAL ATTACK

Minhao Cheng1*, Simranjit Singh1 , Patrick Chen1, Pin-Yu Chen2, Sijia Liu2, Cho-Jui Hsieh1

1Department of Computer Science, UCLA, 2IBM Research {mhcheng, simranjit, patrickchen, chohsieh}@cs.ucla.edu {pin-yu.chen, sijia.liu}@ibm.com

We study the most practical problem setup for evaluating adversarial robustness of a machine learning system with limited access: the hard-label black-box attack setting for generating adversarial examples, where limited model queries are allowed and only the decision is provided to a queried data input. Several algorithms have been proposed for this problem but they typically require huge amount (>20,000) of queries for attacking one example. Among them, one of the state-of-the-art approaches (Cheng et al., 2019) showed that hard-label attack can be modeled as an optimization problem where the objective function can be evaluated by binary search with additional model queries, thereby a zeroth order optimization algorithm can be applied. In this paper, we adopt the same optimization formulation but propose to directly estimate the sign of gradient at any direction instead of the gradient itself, which enjoys the beneﬁt of single query. Using this single query oracle for retrieving sign of directional derivative, we develop a novel query-efﬁcient Sign-OPT approach for hard-label black-box attack. We provide a convergence analysis of the new algorithm and conduct experiments on several models on MNIST, CIFAR-10 and Image Net. We ﬁnd that Sign-OPT attack consistently requires 5 to 10 fewer queries when compared to the current state-of-the-art approaches, and usually converges to an adversarial example with smaller perturbation.

1 INTRODUCTION

It has been shown that neural networks are vulnerable to adversarial examples (Szegedy et al., 2016; Goodfellow et al., 2015; Carlini & Wagner, 2017; Athalye et al., 2018). Given a victim neural network model and a correctly classiﬁed example, an adversarial attack aims to compute a small perturbation such that with this perturbation added, the original example will be misclassiﬁed. Many adversarial attacks have been proposed in the literature. Most of them consider the white-box setting, where the attacker has full knowledge about the victim model, and thus gradient based optimization can be used for attack. Popular Examples include C&W (Carlini & Wagner, 2017) and PGD (Madry et al., 2017) attacks. On the other hand, some more recent attacks have considered the probability black-box setting where the attacker does not know the victim model s structure and weights, but can iteratively query the model and get the corresponding probability output. In this setting, although gradient (of output probability to the input layer) is not computable, it can still be estimated using ﬁnite differences, and algorithms many attacks are based on this (Chen et al., 2017; Ilyas et al., 2018a; Tu et al., 2019; Jun et al., 2018).

In this paper, we consider the most challenging and practical attack setting hard-label black-box setting where the model is hidden to the attacker and the attacker can only make queries and get the corresponding hard-label decisions (e.g., predicted labels) of the model. A commonly used algorithm proposed in this setting, also called Boundary attack (Brendel et al., 2017), is based on random walks on the decision surface, but it does not have any convergence guarantee. More recently, Cheng et al. (2019) showed that ﬁnding the minimum adversarial perturbation in the hard-label setting can be reformulated as another optimization problem (we call this Cheng s formulation in this paper). This

Equal Contribution.

Published as a conference paper at ICLR 2020

new formulation enjoys the beneﬁt of having a smooth boundary in most tasks and the function value is computable using hard-label queries. Therefore, the authors of (Cheng et al., 2019) are able to use standard zeroth order optimization to solve the new formulation. Although their algorithm converges quickly, it still requires large number of queries (e.g., 20,000) for attacking a single image since every function evaluation of Cheng s formulation has to be computed using binary search requiring tens of queries.

In this paper, we follow the same optimization formulation of (Cheng et al., 2019) which has the advantage of smoothness, but instead of using ﬁnite differences to estimate the magnitude of directional derivative, we propose to evaluate its sign using only a single query. With this single-query sign oracle, we design novel algorithms for solving the Cheng s formulation, and we theoretically prove and empirically demonstrate the signiﬁcant reduction in the number of queries required for hard-label black box attack.

Our contribution are summarized below:

Novelty in terms of adversarial attack. We elucidate an efﬁcient approach to compute the sign of directional derivative of Cheng s formulation using a single query, and based on this technique we develop a novel optimization algorithm called Sign-OPT for hard-label black-box attack. Novelty in terms of optimization. Our method can be viewed as a new zeroth order optimization algorithm that features fast convergence of sign SGD. Instead of directly taking the sign of gradient estimation, our algorithm utilizes the scale of random direction. This make existing analysis inappropriate to our case, and we provide a new recipe to prove the convergence of this new optimizer. We conduct comprehensive experiments on several datasets and models. We show that the proposed algorithm consistently reduces the query count by 5 10 times across different models and datasets, suggesting a practical and query-efﬁcient robustness evaluation tool. Furthermore, on most datasets our algorithm can ﬁnd an adversarial example with smaller distortion compared with previous approaches.

2 RELATED WORK

White-box attack Since it was ﬁrstly found that neural networks are easy to be fooled by adversarial examples (Goodfellow et al., 2015), a lot of work has been proposed in the white-box attack setting, where the classiﬁer f is completely exposed to the attacker. For neural networks, under this assumption, back-propagation can be conducted on the target model because both network structure and weights are known by the attacker. Algorithms including (Goodfellow et al., 2015; Kurakin et al., 2016; Carlini & Wagner, 2017; Chen et al., 2018; Madry et al., 2017) are then proposed based on gradient computation. Recently, the BPDA attack introduced by Athalye et al. (2018) bypasses some models with obfuscated gradients and is shown to successfully circumvent many defenses. In addition to typical attacks based on small ℓp norm perturbation, non-ℓp norm perturbations such as scaling or shifting have also been considered (Zhang et al., 2019).

Black-box attack Recently, black-box setting is drawing rapidly increasing attention. In black-box setting, the attacker can query the model but has no (direct) access to any internal information inside the model. Although there are some works based on transfer attack (Papernot et al., 2017), we consider the query-based attack in the paper. Depending on the model s feedback for a given query, an attack can be classiﬁed as a soft-label or hard-label attack. In the soft-label setting, the model outputs a probability score for each decision. Chen et al. (2017) uses a ﬁnite difference in a coordinate-wise manner to approximately estimate the output probability changes and does a coordinate descent to conduct the attack. Ilyas et al. (2018a) uses Neural evolution strategy (NES) to approximately estimate the gradient directly. Later, some variants (Ilyas et al., 2018b; Tu et al., 2019) were proposed to utilize the side information to further speed up the attack procedure. Alzantot et al. (2019) uses a evolutionary algorithm as a black-box optimizer for the soft-label setting. Recently, Al-Dujaili & O Reilly (2019) proposes Sign Hunter algorithm based on sign SGD (Bernstein et al., 2018) to achieve faster convergence in the soft-label setting. The recent work (Al-Dujaili & O Reilly, 2019) proposes Sign Hunter algorithm to achieve a more query-efﬁcent sign estimate when crafting black-box adversarial examples through soft-label information.

Published as a conference paper at ICLR 2020

In the hard-label case, only the ﬁnal decision, i.e. the top-1 predicted class, is observed. As a result, the attacker can only make queries to acquire the corresponding hard-label decision instead of the probability outputs. Brendel et al. (2017) ﬁrst studied this problem and proposed an algorithm based on random walks near the decision boundary. By selecting a random direction and projecting it onto a boundary sphere in each iteration, it aims to generate a high-quality adversarial example. Query-Limited attack (Ilyas et al., 2018a) tries to estimate the output probability scores with model query and turn the hard-label into a soft-label problem. Cheng et al. (2019) instead re-formalizes the hard-label attack into an optimization problem that ﬁnds a direction which could produce the shortest distance to decision boundary.

The recent ar Xiv paper (Chen et al., 2019) applied the zeroth-order sign oracle to improve Boundary attack, and also demonstrated signiﬁcant improvement. The major differences to our algorithm are that we propose a new zeroth-order gradient descent algorithm, provide its algorithmic convergence guarantees, and aim to improve the query complexity of the attack formulation proposed in (Cheng et al., 2019). For completeness, we also compare with this method in Section A.1. Moreover, (Chen et al., 2019) uses one-point gradient estimate, which is unbiased but may encounter larger variance compared with the gradient estimate in our paper. Thus, we can observe in Section A.1 that although they are slightly faster in the initial stage, Sign-OPT will catch up and eventually lead to a slightly better solution.

3 PROPOSED METHOD

We follow the same formulation in (Cheng et al., 2019) and consider the hard-label attack as the problem of ﬁnding the direction with shortest distance to the decision boundary. Speciﬁcally, for a given example x0, true label y0 and the hard-label black-box function f : Rd {1, . . . , K}, the objective function g : Rd R (for the untargeted attack) can be written as:

min θ g(θ) where g(θ) = arg min λ>0

It has been shown that this objective function is usually smooth and the objective function g can be evaluated by a binary search procedure locally. At each binary search step, we query the function f(x0 + λ θ

θ ) and determine whether the distance to decision boundary in the direction θ is greater or smaller than λ based on the hard-label prediction1.

As the objective function is computable, the directional derivative of g can be estimated by ﬁnite differences:

ˆ g(θ; u) := g(θ + ϵu) g(θ)

where u is a random Gaussian vector and ϵ > 0 is a very small smoothing parameter. This is a standard zeroth order oracle for estimating directional derivative and based on this we can apply many different zeroth order optimization algorithms to minimize g.

Original Image X0

Figure 1: Illustration

For example, Cheng et al. (2019) used the Random Derivative Free algorithm Nesterov & Spokoiny (2017) to solve problem (1). However, each computation of (2) requires many hard-label queries due to binary search, so Cheng et al. (2019) still requires a huge number of queries despite having fast convergence.

In this work, we introduce an algorithm that hugely improves the query complexity over Cheng et al. (2019). Our algorithm is based on the following key ideas: (i) one does not need very accurate values of directional derivative in order to make the algorithm converge, and (ii) there exists an imperfect but informative estimation of directional derivative of g that can be computed by a single query.

1Note that binary search only works in a small local region; in more general case g(θ) has to be computed by a ﬁne-grained search plus binary search, as discussed in Cheng et al. (2019).

Published as a conference paper at ICLR 2020

Algorithm 1: Sign-OPT attack Input: Hard-label model f, original image x0, initial θ0 ; for t = 1, 2, . . . , T do

Randomly sample u1, . . . , u Q from a Gaussian or Uniform distribution; Compute ˆg 1

Q PQ q=1 sign(g(θt + ϵuq) g(θt)) uq ; Update θt+1 θt ηˆg ; Evaluate g(θt) using the same search algorithm in Cheng et al. (2019) ; end

3.1 A SINGLE QUERY ORACLE

As mentioned before, the previous approach requires computing g(θ + ϵu) g(θ) which consumes a lot of queries. However, based on the deﬁnition of g( ), we can compute the sign of this value sign(g(θ + ϵu) g(θ)) using a single query. Considering the untargeted attack case, the sign can be computed by

sign(g(θ + ϵu) g(θ)) =

( +1, f(x0 + g(θ) (θ+ϵu)

θ+ϵu ) = y0, 1, Otherwise. (3)

This is illustrated in Figure 1. Essentially, for a new direction θ + ϵu, we test whether a point at the original distance g(θ) from x0 in this direction lies inside or outside the decision boundary, i.e. if the produced perturbation will result in a wrong prediction by classiﬁer. If the produced perturbation is outside the boundary i.e. f(x0 + g(θ) (θ+ϵu)

θ+ϵu ) = y0, the new direction has a smaller distance to decision boundary, and thus giving a smaller value of g. It indicates that u is a descent direction to minimize g.

3.2 SIGN-OPT ATTACK

By sampling random Gaussian vector Q times, we can estimate the imperfect gradient by

ˆ g(θ) ˆg := XQ

q=1 sign(g(θ + ϵuq) g(θ))uq, (4)

which only requires Q queries. We then use this imperfect gradient estimate to update our search direction θ as θ θ ηˆg with a step size η and use the same search procedure to compute g(θ) up to a certain accuracy. The detailed procedure is shown in Algorithm 1.

We note that Liu et al. (2019) designed a Zeroth Order Sign SGD algorithm for soft-label black box attack (not hard-label setting). They use ˆ g(θ) ˆg := PQ q=1 sign(g(θ + ϵuq) g(θ)uq) and shows that it could achieve a comparable or even better convergence rate than zeroth order stochastic gradient descent by using only sign information of gradient estimation. Although it is possible to combine ZO-Sign SGD with our proposed single query oracle for solving hard-label attack, their estimator will take sign of the whole vector and thus ignore the direction of uq, which leads to slower convergence in practice (please refer to Section 4.4 and Figure 5(b) for more details).

To the best of our knowledge, no previous analysis can be used to prove convergence of Algorithm 1. In the following, we show that Algorithm 1 can in fact converge and furthermore, with similar convergence rate compared with (Liu et al., 2019) despite using a different gradient estimator. Assumption 1. Function g(θ) is L-smooth with a ﬁnite value of L. Assumption 2. At any iteration step t, the gradient of the function g is upper bounded by g(θt) 2 σ. Theorem 3.1. Suppose that the conditions in the assumptions hold, and the distribution of gradient noise is unimodal and symmetric. Then, Sign-OPT attack with learning rate ηt = O( 1 Q

d T ) will give following bound on E[ g(θ) 2]:

E[ g(θ) 2] = O(

Published as a conference paper at ICLR 2020

The proof can be found in subsection A.2. The main difference with the original analysis provided by Liu et al. (2019) is that they only only deal with sign of each element, while our analysis also takes the magnitudes of each element of uq into account.

3.3 OTHER GRADIENT ESTIMATIONS

Note that the value sign(g(θ + ϵu) g(θ)) computed by our single query oracle is actually the sign of the directional derivative:

sign( g(θ), u ) = sign( lim ϵ g(θ + ϵu) g(θ)

ϵ ) = sign(g(θ + ϵu) g(θ)) for a small ϵ.

Therefore, we can use this information to estimate the original gradient. The Sign-OPT approach in the previous section uses P

q sign( g(θ), uq )uq as an estimation of gradient. Let yq := sign( g(θ), uq ), a more accurate gradient estimation can be cast as the following constraint optimization problem:

Find a vector z such that sign( z, uq ) = yq q = 1, . . . , Q.

Therefore, this is equivalent to a hard constraint SVM problem where each uq is a training sample and yq is the corresponding label. The gradient can then be recovered by solving the following quadratic programming problem:

min z z T z s.t. z T uq yq, q = 1, . . . , Q. (5)

By solving this problem, we can get a good estimation of the gradient. As explained earlier, each yq can be determined with a single query. Therefore, we propose a variant of Sign-OPT, which is called SVM-OPT attack. The detailed procedure is shown in Algorithm 2. We will present an empirical comparison of our two algorithms in subsection 4.1.

Algorithm 2: SVM-OPT attack Input: Hard-label model f, original image x0, initial θ0 ; for t = 1, 2, . . . , T do

Sample u1, . . . , u Q from Gaussian or orthogonal basis ; Solve z deﬁned by (5) ; Update θt+1 θt ηz ; Evaluate g(θt) using search algorithm in (Cheng et al., 2019) ; end

4 EXPERIMENTAL RESULTS

We evaluate the SIGN-OPT algorithm for attacking black-box models in a hard-label setting on three different standard datasets - MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky et al.) and Image Net-1000 (Deng et al., 2009) and compare it with existing methods. For fair and easy comparison, we use the CNN networks provided by (Carlini & Wagner, 2017), which have also been used by other previous hard-label attacks as well. Speciﬁcally, for both MNIST and CIFAR-10, the model consists of nine layers in total - four convolutional layers, two max-pooling layers and two fully-connected layers. Further details about implementation, training and parameters are available on (Carlini & Wagner, 2017). As reported in (Carlini & Wagner, 2017) and (Cheng et al., 2019), we were able to achieve an accuracy of 99.5% on MNIST and 82.5% on CIFAR-10. We use the pretrained Resnet-50 (He et al., 2016) network provided by torchvision (Marcel & Rodriguez, 2010) for Image Net-1000, which achieves a Top-1 accuracy of 76.15%.

In our experiments, we found that Sign-OPT and SVM-OPT perform quite similarly in terms of query efﬁciency. Hence we compare only Sign-OPT attack with previous approaches and provide a comparison between Sign-OPT and SVM-OPT in subsection 4.1. We compare the following attacks:

Sign-OPT attack (black box): The approach presented in this paper. Opt-based attack (black box): The method proposed in Cheng et al. (2019) where they use Randomized Gradient-Free method to optimize the same objective function. We use the implementation provided at https://github.com/Le Minh Thong/blackbox-attack.

Published as a conference paper at ICLR 2020

Figure 2: Example of Sign-OPT targeted attack. L2 distortions and queries used are shown above and below the images. First two rows: Example comparison of Sign-OPT attack and OPT attack. Third and fourth rows: Examples of Sign-OPT attack on CIFAR-10 and Image Net

Boundary attack (black box): The method proposed in Brendel et al. (2017). This is compared only in L2 setting as it is designed for the same. We use the implementation provided in Foolbox (https://github.com/bethgelab/foolbox).

Guessing Smart Attack (black box): The method proposed in (Brunner et al., 2018). This attack enhances boundary attack by biasing sampling towards three priors. Note that one of the priors assumes access to a similar model as the target model and for a fair comparison we do not incorporate this bias in our experiments. We use the implementation provided at https://github.com/ttbrunner/biased_boundary_attack.

C&W attack (white box): One of the most popular methods in the white-box setting proposed in Carlini & Wagner (2017). We use C&W L2 norm attack as a baseline for the white-box attack performance.

For each attack, we randomly sample 100 examples from validation set and generate adversarial perturbations for them. For untargeted attack, we only consider examples that are correctly predicted by model and for targeted attack, we consider examples that are already not predicted as target label by the model. To compare different methods, we mainly use median distortion as the metric. Median distortion for x queries is the median adversarial perturbation of all examples achieved by a method using less than x queries. Since all the hard-label attack algorithms will start from an adversarial exmample and keep reduce the distortion, if we stop at any time they will always give an adversarial example and medium distortion will be the most suitable metric to compare their performance. Besides, we also show success rate (SR) for x queries for a given threshold (ϵ), which is the percentage of number of examples that have achieved an adversarial perturbation below ϵ with less than x queries. We evaluate success rate on different thresholds which depend on the dataset being used. For comparison of different algorithms in each setting, we chose the same set of examples across all attacks.

Implementation details: To optimize algorithm 1, we estimate the step size η using the same line search procedure implemented in Cheng et al. (2019). At the cost of a relatively small number of queries, this provides signiﬁcant speedup in the optimization. Similar to Cheng et al. (2019), g(θ) in last step of algorithm 1 is approximated via binary search. The initial θ0 in algorithm 1 is calculated

Published as a conference paper at ICLR 2020

0k 5k 10k 15k 20k Queries

L2 Distortion

Sign-OPT SVM-OPT

0k 5k 10k 15k 20k Queries

0k 10k 20k 30k 40k Queries

1.2 CIFAR-10

Q=10 Q=20 Q=50 Q=100 Q=200 Q=400 Q=800 Q=1000

Figure 3: Median L2 distortion vs Queries. First two: Comparison of Sign-OPT and SVM-OPT attack for MNIST and CIFAR-10. Third: Performance of Sign-OPT for different values of Q.

by evaluating g(θ) on 100 random directions and taking the best one. We provide our implementation publicly2.

4.1 COMPARISON BETWEEN SIGN-OPT AND SVM-OPT

In our experiments, we found that the performance in terms of queries of both these attacks is remarkably similar in all settings (both L2/L & Targeted/Untargeted) and datasets. We present a comparison for MNIST and CIFAR-10 (L2 norm-based) for both targeted and untargeted attacks in Figure 3. We see that the median distortion achieved for a given number of queries is quite on part for both Sign-OPT and SVM-OPT.

Number of queries per gradient estimate: In Figure 3, we show the comparison of Sign-OPT attack with different values of Q. Our experiments suggest that Q does not have an impact on the convergence point reached by the algorithm. Although, small values of Q provide a noisy gradient estimate and hence delayed convergence to an adversarial perturbation. Large values of Q, on the other hand, require large amount of time per gradient estimate. After ﬁne tuning on a small set of examples, we found that Q = 200 provides a good balance between the two. Hence, we set the value of Q = 200 for all our experiments in this section.

4.2 UNTARGETED ATTACK

In this attack, the objective is to generate an adversary from an original image for which the prediction by model is different from that of original image. Figure 4 provides an elaborate comparison of different attacks for L2 case for the three datasets. Sign-OPT attack consistently outperforms the current approaches in terms of queries. Not only is Sign-OPT more efﬁcient in terms of queries, in most cases it converges to a lower distortion than what is possible by other hard-label attacks. Furthermore, we observe Sign-OPT converges to a solution comparable with C&W white-box attack (better on CIFAR-10, worse on MNIST, comparable on Image Net). This is signiﬁcant for a hard-label attack algorithm since we are given very limited information.

We highlight some of the comparisons of Boundary attack, OPT-based attack and Sign-OPT attack (L2 norm-based) in Table 1. Particularly for Image Net dataset on Res Net-50 model, Sign-OPT attack reaches a median distortion below 3.0 in less than 30k queries while other attacks need more than 200k queries for the same.

4.3 TARGETED ATTACK

In targeted attack, the goal is to generate an adversarial perturbation for an image so that the prediction of resulting image is the same as a speciﬁed target. For each example, we randomly specify the target label, keeping it consistent across different attacks. We calculate the initial θ0 in algorithm 1 using 100 samples in target label class from training dataset and this θ0 is the same across different attacks. Figure 2 shows some examples of adversarial examples generated by Sign-OPT attack and the Opt-based attack. The ﬁrst two rows show comparison of Sign-OPT and Opt attack respectively on an example from MNIST dataset. The ﬁgures show adversarial examples generated at almost

2https://github.com/cmhcbb/attackbox

Published as a conference paper at ICLR 2020

0k 10k 20k 30k 40k Queries

L2 Distortion

Sign-OPT OPT Boundary Guessing Smart CW

0k 10k 20k 30k 40k Queries

1.5 CIFAR-10

0k 20k 40k 60k 80k Queries

50 Image Net

Figure 4: Untargeted attack: Median distortion vs Queries for different datasets.

0k 10k 20k 30k 40k 50k Queries

L2 Distortion

Sign-OPT OPT Boundary Guessing Smart CW

0k 10k 20k 30k 40k Queries

2.0 CIFAR-10

0k 5k 10k 15k 20k Queries

Sign-OPT OPT ZO-sign SGD with SQO ZO-sign SGD w/o SQO CW

Figure 5: (a) Targeted Attack: Median distortion vs Queries of different attacks on MNIST and CIFAR-10. (b) Comparing Sign-OPT and ZO-Sign SGD with and without single query oracle (SQO).

same number of queries for both attacks. Sign-OPT method generates an L2 adversarial perturbation of 0.94 in 6k queries for this particular example while Opt-based attack requires 35k for the same. Figure 5 displays a comparison among different attacks in targeted setting. In our experiments, average distortion achieved by white box attack C&W for MNIST dataset is 1.51, for which Sign-OPT requires 12k queries while others need > 120k queries. We present a comparison of success rate of different attacks for CIFAR-10 dataset in Figure 6 for both targeted and untargeted cases.

4.4 THE POWER OF SINGLE QUERY ORACLE

In this subsection, we conduct several experiments to prove the effectiveness of our proposed single query oracle in hard-label adversarial attack setting. ZO-Sign SGD algorithm (Liu et al., 2019) is proposed for soft-label black box attack and we extend it into hard-label setting. A straightforward way is simply applying ZO-Sign SGD to solve the hard-label objective proposed in Cheng et al. (2019), estimate the gradient using binary search as (Cheng et al., 2019) and take its sign. In Figure 5(b), we clearly observe that simply combining ZO-Sign SGD and Cheng et al. (2019) is not efﬁcient. With the proposed single query sign oracle, we can also reduce the query count of this method, as demonstrated in Figure 5(b). This veriﬁes the effectiveness of single query oracle, which can universally improve many different optimization methods in the hard-label attack setting. To be noted, there is still improvement on Sign-OPT over ZO-Sign SGD with single query oracle because instead

0k 20k 40k 60k 0.0

L2 Success Rate

Sign-OPT OPT Boundary Guessing Smart

0k 20k 40k 60k 0.0

L2 Success Rate

0k 20k 40k 60k 0.0

L2 Success Rate

0k 20k 40k 60k 0.0

L2 Success Rate

Figure 6: Success Rate vs Queries for CIFAR-10 (L2 norm-based attack). First two and last two depict untargeted and targeted attacks respectively. Success rate threshold is at the top of each plot.

Published as a conference paper at ICLR 2020

Table 1: L2 Untargeted attack - Comparison of average L2 distortion achieved using a given number of queries for different attacks. SR stands for success rate.

MNIST CIFAR10 Image Net (Res Net-50) #Queries Avg L2 SR(ϵ = 1.5) #Queries Avg L2 SR(ϵ = 0.5) #Queries Avg L2 SR(ϵ = 3.0)

Boundary attack 4,000 4.24 1.0% 4,000 3.12 2.3% 4,000 209.63 0% 8,000 4.24 1.0% 8,000 2.84 7.6% 30,000 17.40 16.6% 14,000 2.13 16.3% 12,000 0.78 29.2% 160,000 4.62 41.6%

OPT attack 4,000 3.65 3.0% 4,000 0.77 37.0% 4,000 83.85 2.0% 8,000 2.41 18.0% 8,000 0.43 53.0% 30,000 16.77 14.0% 14,000 1.76 36.0% 12,000 0.33 61.0% 160,000 4.27 34.0%

Guessing Smart 4,000 1.74 41.0% 4,000 0.29 75.0% 4,000 16.69 12.0% 8,000 1.69 42.0% 8,000 0.25 80.0% 30,000 13.27 12.0% 14,000 1.68 43.0% 12,000 0.24 80.0% 160,000 12.88 12.0%

Sign-OPT attack 4,000 1.54 46.0% 4,000 0.26 73.0% 4,000 23.19 8.0% 8,000 1.18 84.0% 8,000 0.16 90.0% 30,000 2.99 50.0% 14,000 1.09 94.0% 12,000 0.13 95.0% 160,000 1.21 90.0% C&W (white-box) - 0.88 99.0% - 0.25 85.0% - 1.51 80.0%

of directly taking the sign of gradient estimation, our algorithm utilizes the scale of random direction u as well. In other words, sign SGD s gradient norm is always 1 while our gradient norm takes into account the magnitude of u. Therefore, our sign OPT optimization algorithm is fundamentally different (Liu et al., 2019) or any other proposed sign SGD varieties. Our method can be viewed as a new zeroth order optimization algorithm that features fast convergence in sign SGD.

5 CONCLUSION

We developed a new and ultra query-efﬁcient algorithm for adversarial attack in the hard-label black-box setting. Using the same smooth reformulation in Cheng et al. (2019), we design a novel zeroth order oracle that can compute the sign of directional derivative of the attack objective using single query. Equipped with this single-query oracle, we design a new optimization algorithm that can dramatically reduce number of queries compared with Cheng et al. (2019). We prove the convergence of the proposed algorithm and show our new algorithm is overwhelmingly better than current hard-label black-box attacks.

ACKNOWLEDGEMENT

This work is based upon work supported by the Department of Energy National Energy Technology Laboratory under Award Number DE-OE0000911 and by NSF under IIS1719097.

Published as a conference paper at ICLR 2020

Abdullah Al-Dujaili and Una-May O Reilly. There are no bit parts for sign bits in black-box attacks. ar Xiv preprint ar Xiv:1902.06894, 2019.

Moustafa Alzantot, Yash Sharma, Supriyo Chakraborty, Huan Zhang, Cho-Jui Hsieh, and Mani B Srivastava. Genattack: Practical black-box attacks with gradient-free optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1111 1119, 2019.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. sign SGD: Compressed optimisation for non-convex problems. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 560 569, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/bernstein18a. html.

Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. ar Xiv preprint ar Xiv:1712.04248, 2017.

Thomas Brunner, Frederik Diehl, Michael Truong Le, and Alois Knoll. Guessing smart: Biased sampling for efﬁcient black-box adversarial attacks. ar Xiv preprint ar Xiv:1812.09803, 2018.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 39 57. IEEE, 2017.

Jianbo Chen, Michael I. Jordan, and Martin J. Wainwright. Hopskipjumpattack: A query-efﬁcient decision-based attack. ar Xiv preprint ar Xiv:1904.02144, 2019.

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pp. 15 26. ACM, 2017.

Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. Ead: elastic-net attacks to deep neural networks via adversarial examples. In Thirty-second AAAI conference on artiﬁcial intelligence, 2018.

Minhao Cheng, Thong Le, Pin-Yu Chen, Huan Zhang, Jin Feng Yi, and Cho-Jui Hsieh. Queryefﬁcient hard-label black-box attack: An optimization-based approach. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id= r Jlk6i Rq KX.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248 255. IEEE, 2009.

John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012.

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pp. 2142 2151, 2018a.

Published as a conference paper at ICLR 2020

Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Prior convictions: Black-box adversarial attacks with bandits and priors. ar Xiv preprint ar Xiv:1807.07978, 2018b.

Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pp. 3640 3649, 2018.

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/ kriz/cifar.html.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016.

Yann Le Cun, L eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 3727 3737. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7630-zeroth-order-stochastic-variance-reduction-for-nonconvex-optimization. pdf.

Sijia Liu, Pin-Yu Chen, Xiangyi Chen, and Mingyi Hong. sign SGD via zeroth-order oracle. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=BJe-Ds C5Fm.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

S ebastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, MM 10, pp. 1485 1488, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-933-6. doi: 10.1145/1873951.1874254. URL http://doi.acm.org/10.1145/1873951.1874254.

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527 566, 2017.

Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506 519. ACM, 2017.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016.

Chun-Chen Tu, Paishun Ting, Pin-Yu Chen, Sijia Liu, Huan Zhang, Jinfeng Yi, Cho-Jui Hsieh, and Shin-Ming Cheng. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. AAAI, 2019.

Huan Zhang, Hongge Chen, Zhao Song, Duane Boning, Inderjit S Dhillon, and Cho-Jui Hsieh. The limitations of adversarial training and the blind-spot attack. ar Xiv preprint ar Xiv:1901.04684, 2019.

Published as a conference paper at ICLR 2020

A.1 COMPARISON WITH HOPSKIPJUMPATTACK

There is a recent paper (Chen et al., 2019) that applied the zeroth-order sign oracle to improve Boundary attack, and also demonstrated signiﬁcant improvement. The major differences to our algorithm are that we propose a new zeroth-order gradient descent algorithm, provide its algorithmic convergence guarantees, and aim to improve the query complexity of the attack formulation proposed in (Cheng et al., 2019). To be noted, Hop Skip Jump Attack only provides the bias and variance analysis (Theorem 2 and 3) without convergence rate analysis.

Also, Hop Skip Jump Attack uses one-point gradient estimate compared to the 2-point gradient estimate used by Sign OPT. Therefore, although the estimation is unbiased, it has large variance, which achieves successful attack faster but generates a worse adversarial example with larger distortion than ours. For completeness, we also compare with this method (and mention the results) as follows.

Figure 7 shows a comparison of Sign-OPT and Hop Skip Jump Attack for CIFAR-10 and MNIST datasets for the case of L2 norm based attack. We ﬁnd in our experiments that performance of both attacks is comparable in terms of queries consumed. In some cases, Sign-OPT converges to a better solution.

0k 20k 40k Queries

L2 Distortion

Sign-OPT Hop Skip Jump Attack CW

0k 20k 40k Queries

2.5 MNIST (T)

0k 20k 40k Queries

1.5 CIFAR-10 (U)

0k 20k 40k Queries

1.5 CIFAR-10 (T)

Figure 7: Comparison with Hop Skip Jump Attack for CIFAR and MNIST: Median distortion vs Queries. (U) represents untargeted attack and (T) represents targeted attack.

Deﬁne following notations: ˆ g(θt; uq) := sign(g(θt + ϵuq) g(θt))uq

g(θt; uq) := 1

ϵ (g(θt + ϵuq) g(θt))uq

g(θt; uq) := sign(1

ϵ (g(θt + ϵuq) g(θt))uq)

Thus we could write the corresponding estimate of gradients as follow:

q=1 sign(g(θt + ϵuq) g(θt))uq = 1

q=1 ˆ g(θt; uq)

1 ϵ (g(θt + ϵuq) g(θt))uq = 1

q=1 g(θt; uq)

ϵ (g(θt + ϵuq) g(θt))uq) = 1

q=1 g(θt; uq)

Clearly, we have g(θt; uq) = sign( g(θt; uq)) and we could relate g(θt; uq) and ˆ g(θt; uq) by writing ˆ g(θt; uq) = Gq g(θt; uq) where where Gq Rd is absolute value of vector uq (i.e. Gq = (|uq,1|, |uq,2|, , |uq,d|)T ).

Published as a conference paper at ICLR 2020

Note that Zeroth-order gradient estimate g(θt; uq) is a biased approximation to the true gradient of g. Instead, it becomes unbiased to the gradient of the randomized smoothing function gϵ(θ) = Eu[g(θ + ϵu)] Duchi et al. (2012).

Our analysis is based on the following two assumptions:

Assumption 1 function g is L-smooth with a ﬁnite value of L.

Assumption 2 At any iteration step t, the gradient of the function g is upper bounded by g(θt) 2 σ.

To prove the convergence of proposed method, we need the information on variance of the update g(θt; uq). Here, we introduce a lemma from previous works.

Lemma 1 The variance of Zeroth-Order gradient estimate g(θt; uq) is upper bounded by

E g(θt; uq) gϵ(θt) 2 2] 4(Q + 1)

where C(d, ϵ) := 2dσ2 + ϵ2L2d2/2

Proof of Lemma 1 This lemma could be proved by using proposition 2 in Liu et al. (2019) with b = 1 and q = Q. When b = 1 there is no difference between with/without replacement, and we opt for with replacement case to obtain above bound.

By talking Q = 1, we know that E g(θt; uq) gϵ(θt) 2 2] is upper bounded. And by Jensen s inequality, we also know that the

E |( g(θt; uq) gϵ(θt))l ] q

E (( g(θt; uq) gϵ(θt) )2 l ] := δl, (6)

where δl denotes the upper bound of lth coordinate of E | g(θt; uq) gϵ(θt)| , and δl is ﬁnite

since E g(θt; uq) gϵ(θt) 2 2] is upper bounded.

Next, we want to show the Prob[sign(( gt)l) = sign(( gϵ(θt))l)] by following lemma.

Lemma 2 |( gϵ(θt))l|Prob[sign(( gt)l) = sign(( gϵ(θt))l)] δl Q

Proof of Lemma 2 Similar to Bernstein et al. (2018), we ﬁrst relax Prob[sign(( g(θt; uq))l) = sign( gϵ(θt))l] by Markov inequality:

Prob[sign(( g(θt; uq))l) = sign(( gϵ(θt))l)] Prob[| g(θt; uq)l)| | gϵ(θt)l|]

E |( g(θt; uq) gϵ(θt))l ] | gϵ(θt)l|

δl | gϵ(θt)l|,

where the last inequality comes from eq (6). Recall that ( g(θt; uq))l) is an unbiased estimation to ( gϵ(θt))l. Under the assumption that the noise distribution is unimodal and symmetric, from Bernstein et al. (2018) Lemma D1, we will have

Prob[sign(( g(θt; uq))l) = sign( gϵ(θt))l] := M

9 1 S2 , S 2

3, otherwise < 1

where S := | gϵ(θt)l|/δl.

Published as a conference paper at ICLR 2020

Note that this probability bound applies uniformly to all q Q regardless of the magnitude |(uq)l|. That is,

q=1 |(uq)l|sign(( g(θt; uq))l) = sign( gϵ(θt))l] =

Prob[sign((

q=1 sign( g(θt; uq))l) = sign( gϵ(θt))l]. (7)

This is true as when all |(uq)l| = 1, Prob[sign((PQ q=1 sign( g(θt; uq))l) = sign( gϵ(θt))l] is equivalent to majority voting of each estimate q yielding correct sign. This is the same as sum of Q bernoulli trials (i.e. binomial distribution) with error rate M. And since error probability M is independent of sampling of |(uq)l|, calculating Prob[sign(PQ q=1 |(uq)l|sign(( g(θt; uq))l) = sign( gϵ(θt))l] could be thought as taking Q bernoulli experiments and then independently draw a weight from unit length for each of Q experiment. Since the weight is uniform, we will have expectation of weights on correct counts and incorrect counts are the same and equal to 1/2. Therefore, the probability of Prob[sign(PQ q=1 |(uq)l|sign(( g(θt; uq))l) = sign( gϵ(θt))l] is still the same as original non-weighted binomial distribution. Notice that by our notation, we will have sign( g(θt; uq)l) = g(θt; uq)l thus 1

Q PQ q=1 sign( g(θt; uq))l = ( gt)l. Let Z counts the num-

ber of estimates g(θt; uq)l yielding correct sign of gϵ(θt)l. Probability in eq (7) could be written as:

Prob[sign(sign(( gt)l) = sign( gϵ(θt))l] = P[Z Q

Following the derivation of theorem 2b in Bernstein et al. (2018), we could get

|( gϵ(θt))l|Prob[sign(( gt)l) = sign(( gϵ(θt))l)] δl Q (8)

We also need few more lemmas on properties of function g.

Lemma 3 gϵ(θ1) gϵ(θT ) gϵ(θ1) g + ϵ2L

Proof of Lemma 3 The proof can be found in Liu et al. (2018) Lemma C.

Lemma 4 E[ g(θ) 2]

2E[ gϵ(θ) 2] + ϵLd

2 , where g = minθ g(θ).

Proof of Lemma 4 The proof can be found in Liu et al. (2019).

Theorem 1 Suppose that the conditions in the assumptions hold, and the distribution of gradient noise is unimodal and symmetric. Then, Sign-OPT attack with learning rate ηt = O( 1 Q

d T ) will give following bound on E[ g(θ) 2]

E[ g(θ) 2] = O(

Proof of Theorem 1 From L-smoothness assumption we could have

Published as a conference paper at ICLR 2020

gϵ(θt+1) gϵ(θt) + gϵ(θt), θt+1 θt + L

2 θt+1 θt 2 2

= gϵ(θt) ηk gϵ(θt), ˆgt + L

2 η2 t ˆgt 2 2

= gϵ(θt) ηt Gt gϵ(θt) 1 + d L

2 η2 t Gt 2

l=1 |( gϵ(θt))l|Prob[sign(( gt)l) = sign(( gϵ(θt))l)],

where Gt is deﬁned as ( Gt)l = PQ q=1 (Gq)l g(θt; uq)l = PQ q=1 |(uq)l| g(θt; uq)l. Continue the inequality,

gϵ(θt) ηt Gt gϵ(θt) 1 + d L

2 η2 t Gt 2

l=1 |( gϵ(θt))l|Prob[sign(( gt)l) = sign(( gϵ(θt))l)]

gϵ(θt) ηt Gt gϵ(θt) 1 + d L

2 η2 t Gt 2 + 2ηt Gt

δl Q by eq (8)

gϵ(θt) ηt Gt gϵ(θt) 1 + d L

2 η2 t Gt 2 + 2ηt Gt δl 1 Q

gϵ(θt) ηt Gt gϵ(θt) 1 + d L

2 η2 t Gt 2 + 2ηt Gt

= gϵ(θt) ηt Gt gϵ(θt) 1 + d L

2 η2 t Gt 2 + 2ηt Gt

E (( g(θt; uq) gϵ(θt) )2 l ] Q by eq (6).

Thus we will have,

gϵ(θt+1) gϵ(θt) ηt Gt gϵ(θt) 1 + d L

2 η2 t Gt 2 + 2ηt Gt

E (( g(θt; uq) gϵ(θt) )2 l ] Q

ηt Gt gϵ(θt) 1 gϵ(θt) gϵ(θt+1) + d L

2 η2 t Gt 2 + 2ηt Gt

E (( g(θt; uq) gϵ(θt) )2 l ] Q

ˆηt gϵ(θt) 1 gϵ(θt) gϵ(θt+1) + d L

2 ˆηt 2 + 2 ˆηt

E (( g(θt; uq) gϵ(θt) )2 l ] Q ,

where we deﬁne ˆηt := ηt Gt. Sum up all inequalities for all ts and take expectation on both side, we will have T X

t=1 ˆηt E[ gϵ(θt) 1] E[gϵ(θ1) gϵ(θT )] + d L

t=1 ˆηt 2 +

E (( g(θt; uq) gϵ(θt) )2 l ]

E[gϵ(θ1) gϵ(θT )] + d L

t=1 ˆηt 2 +

QC(d, ϵ) by Lemma 1.

Substitute Lemma 3 into above inequality, we get

t=1 ˆηt E[ gϵ(θt) 1] gϵ(θ1) g + ϵ2L + d L

t=1 ˆηt 2 +

Published as a conference paper at ICLR 2020

Since 2 1 and we could divide PT t=1 ˆηt on both side to get

ˆηt PT t=1 ˆηt E[ gϵ(θt) 2] gϵ(θ1) g + ϵ2L PT t=1 ˆηt + d L

PT t=1 ˆηt 2 PT t=1 ˆηt +

4(Q + 1)σ2 + 2C(d, ϵ).

Deﬁne a new random variable R with probability P(R = t) = ηt PT t=1 ηt , we will have

E[ gϵ(θR) 2] = E[ER[ gϵ(θR) 2]] = E h T X

t=1 P(R = t) gϵ(θt) 2 i .

Substitute all the quantities into Lemma 4, we will get

2(gϵ(θ1) g + ϵ2L) PT t=1 ˆηt + d L

PT t=1 ˆηt 2 PT t=1 ˆηt + ϵLd

4(Q + 1)σ2 + 2C(d, ϵ).

By choosing ϵ = O( 1

d T ) and ηt = O( 1 Q

d T ), then the convergence rate as shown in above is