# adversarial_vulnerability_of_randomized_ensembles__1ff0d01e.pdf

Adversarial Vulnerability of Randomized Ensembles

Hassan Dbouk 1 Naresh R. Shanbhag 1

Despite the tremendous success of deep neural networks across various tasks, their vulnerability to imperceptible adversarial perturbations has hindered their deployment in the real world. Recently, works on randomized ensembles have empirically demonstrated significant improvements in adversarial robustness over standard adversarially trained (AT) models with minimal computational overhead, making them a promising solution for safety-critical resource-constrained applications. However, this impressive performance raises the question: Are these robustness gains provided by randomized ensembles real? In this work we address this question both theoretically and empirically. We first establish theoretically that commonly employed robustness evaluation methods such as adaptive PGD provide a false sense of security in this setting. Subsequently, we propose a theoretically-sound and efficient adversarial attack algorithm (ARC) capable of compromising random ensembles even in cases where adaptive PGD fails to do so. We conduct comprehensive experiments across a variety of network architectures, training schemes, datasets, and norms to support our claims, and empirically establish that randomized ensembles are in fact more vulnerable to ℓp-bounded adversarial perturbations than even standard AT models. Our code can be found at https://github.com/ hsndbk4/ARC.

1. Introduction

Deep neural networks (DNNs) are ubiquitous today, cementing themselves as the de facto learning model for various machine learning tasks (Sarwar et al., 2001; He et al., 2016; Girshick, 2015; Farhadi & Redmon, 2018; Devlin et al.,

1Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, USA. Correspondence to: Hassan Dbouk <hdbouk2@illinois.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Figure 1. The adversarial vulnerability of randomized ensembles across various network architectures (from left to right: Res Net20, Mobile Net V1, VGG-16, Res Net-18, and Wide Res Net-28-4) against ℓ norm-bounded perturbations on CIFAR-10. Compared to a standard adversarially trained (AT) single model ( ), a randomized ensemble of two models obtained via BAT ( ) (Pinot et al., 2020) offers significant improvement in robustness at iso-FLOPs, when evaluated using an adaptive PGD adversary. However, the robust accuracy of the BAT ensemble suffers a massive drop and becomes worse than even the single model AT baseline when evaluated using the proposed ARC ( ) adversary, rendering the ensemble obsolete.

2018; Zhang et al., 2017). Yet, their vulnerability to imperceptible adversarial perturbations (Szegedy et al., 2013; Goodfellow et al., 2014; Ilyas et al., 2019) has caused major concerns regarding their security, and hindered their deployment in safety-critical applications.

Various attack algorithms for crafting such perturbations have been proposed in the literature (Kurakin et al., 2016; Moosavi-Dezfooli et al., 2016; Carlini & Wagner, 2017); the majority of which rely on moving the input along the gradient to maximize a certain loss function. Powerful attacks, such as projected gradient descent (PGD) (Madry et al., 2018), perform this process iteratively until a perturbation is found, and can successfully fool undefended DNNs causing them to misclassify every test sample in the test set.

In order to robustify DNNs against adversarial perturbations, several defense strategies have been proposed (Cisse et al., 2017; Pang et al., 2019; Yang et al., 2019; Zhang

Adversarial Vulnerability of Randomized Ensembles

et al., 2019; Madry et al., 2018). While some of them were subsequently broken with more powerful attacks (adaptive attacks) (Tram er et al., 2020; Athalye et al., 2018), adversarial training (AT) (Goodfellow et al., 2014; Zhang et al., 2019; Madry et al., 2018) remains the strongest empirical defense that thus far has stood the test of time.

Recently, the work of (Pinot et al., 2020) showed, from a game-theoretic point of view, that the robustness of any deterministic classifier can be outperformed by a randomized one. Specifically, (Pinot et al., 2020) proposed constructing random classifiers by randomly sampling from an ensemble of classifiers trained via boosted adversarial training (BAT), and empirically demonstrated impressive ( 15%) improvements in robust accuracy over single model standard AT classifiers (see Fig. 1). Randomized ensembles also present a promising way for achieving high robustness with limited compute resources since the number of FLOPs per inference is fixed1 a problem that has garnered much interest recently (Dbouk & Shanbhag, 2021; Sehwag et al., 2020; Guo et al., 2020). However, this recent success of randomized ensembles raises the following question:

Are the robustness gains provided by randomized ensembles real?

In this work, we tackle this question both theoretically and empirically. Specifically, we make the following contributions:

We prove that standard attack algorithms such as PGD suffer from a fundamental flaw when employed to attack randomized ensembles of linear classifiers there are no guarantees that it will find an ℓp-bounded adversarial perturbation even when one exists (inconsistency).

We propose a provably (for linear classifiers) consistent and efficient adversarial perturbation algorithm the Attacking Randomized ensembles of Classifiers (ARC) algorithm that is tailored to evaluate the robustness of randomized ensembles against ℓp-bounded perturbations.

We employ the ARC algorithm to demonstrate empirically that randomized ensembles of DNNs are in fact more vulnerable to ℓp-bounded perturbations than standard AT DNNs (see Fig. 1).

We conduct comprehensive experiments across a variety of network architectures, training schemes, datasets, and norms to support our observations.

1Assuming all the models in the ensemble share the same architecture.

Our work suggests the need for improved adversarial training algorithms for randomized ensembles.

2. Background and Related Work

Adaptive Attacks: Benchmarking new defenses against adaptive attacks, i.e., attacks carefully designed to target a given defense, has become standard practice today, thanks to the works of (Athalye et al., 2018; Tram er et al., 2020). The authors in (Athalye et al., 2018) identify the phenomenon of obfuscated gradients, a type of gradient masking, that leads to a false sense of security in defenses against adversarial examples. By identifying and addressing different types of obfuscated gradients such as stochastic gradients due to randomization, they were able to circumvent and fool most defenses proposed at the time. More recently, (Tram er et al., 2020) identified and circumvented thirteen defenses from top venues (Neur IPS, ICML, and ICLR), despite most employing adaptive attacks, by customizing these attacks to the specific defense. Therefore, it is imperative when evaluating a new defense to not only re-purpose existing adaptive attacks that circumvented a prior defense, but also adapt these attacks to target the new defense. Our work falls in the category of adaptive attacks for randomized ensembles. Specifically, we identify an unforeseen vulnerability of randomized ensembles by demonstrating the shortcomings of commonly used PGD-based attacks and then proposing an adaptive attack to successfully circumvent them.

Ensembles and Robustness: Traditionally, the use of ensembling techniques such as bagging (Breiman, 1996) and boosting (Dietterich, 2000; Freund & Schapire, 1997) has been a popular technique for improving the performance of machine learning models. Building on that success, and in order to address the apparent vulnerability of deep nets, a recent line of work (Kariyappa & Qureshi, 2019; Pang et al., 2019; Sen et al., 2019; Yang et al., 2020; 2021) has proposed the use of DNN ensembles. The intuition is that promoting some form of diversity between the different members of the ensemble would alleviate the adversarial transferability that DNNs exhibit, and as a byproduct improve overall robustness. Specifically, GAL (Kariyappa & Qureshi, 2019) proposes maximizing the cosine distance between the members gradients, whereas EMPIR (Sen et al., 2019) leverages extreme model quantization to enforce ensemble diversity. ADP (Pang et al., 2019) proposes a training regularizer that promotes different members to have high diversity in the non-maximal predictions. These earlier works have been subsequently broken by adaptive and more powerful attacks (Tram er et al., 2020). More recently, DVERGE (Yang et al., 2020) presents a robust ensemble training approach that diversifies the non-robust features of the different members via an adversarial training objective function. TRS (Yang et al., 2021), on the other hand, provides a theoretical

Adversarial Vulnerability of Randomized Ensembles

framework for understanding adversarial transferability and establishes bounds that show model smoothness and input gradient diversity are both required for low transferability. The authors also propose an ensemble training algorithm to empirically reduce transferability. While none of these works deal with randomized ensembles explicitly, their proposed techniques can be seamlessly adapted to that setting, where our work shows that existing powerful attacks such as PGD can provide a false sense of robustness.

Randomization and Robustness: Randomization has been shown, both theoretically and empirically, to have a very strong connection with adversarial robustness. Randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) proves a tight robustness guarantee in ℓ2 norm for smoothing classifiers with Gaussian noise. Prior to (Lecuyer et al., 2019), no certified defense has been shown feasible on Image Net. In the same spirit, SNAP (Patil et al., 2021) enhances the robustness of single attack ℓ AT frameworks against the union of perturbation types via shaped noise augmentation. Recently, the work of (Pinot et al., 2020) showed, from a game-theoretic point of view, that the robustness of any deterministic classifier can be outperformed by a randomized one when evaluated against deterministic attack strategies. A follow-up work from (Meunier et al., 2021) removes any assumptions on the attack strategy and further motivates the use of randomization to improve robustness. Building on their framework, the authors of (Pinot et al., 2020) propose training random ensembles of DNNs using their BAT algorithm and empirically demonstrate impressive robustness gains over individual adversarially trained DNNs on CIFAR-10, when evaluated against a strong PGD adversary. However, our work demonstrates that these robustness gains give a false sense of security and propose a new attack algorithm better suited for randomized ensembles.

3. Preliminaries

3.1. Problem Setup

Let F = {f1, f2, ..., f M} be a collection of M N arbitrary C-ary differentiable classifiers fi : RD RC. A classifier fi assigns the label m [C] to data-point x RD

if and only if fi(x)

j j [C] \ {m}, where [fi(x)]j is the jth component of fi(x).

A randomized ensemble classifier (REC) g is defined over the tuple (F, α) as: Pr{g(x) = fd(x)} = αd where d [M] is sampled independent of x. This independence assumption is crucial as the choice of the sampled classifier cannot be tampered by any adversary.

For any labeled data (x, y) RD [C], define L as the

expected classification accuracy of g:

L(x, y, α) = E

arg max j [C]

arg max j [C]

where 1 {.} is the indicator function. Note that we have L(x, y, α) 1 with equality if and only if all the classifiers in F correctly classify x. Similarly L(x, y, α) 0 with equality if and only if all all the classifiers misclassify x. We always assume that αi > 0, otherwise we can remove fi from the ensemble as it is not used. Definition 3.1. A perturbation δ RD is called adversarial to (F, α) for the labeled data-point (x, y) if:

L(x + δ, y, α) < L(x, y, α) (2)

That is, the mean classification accuracy of the randomized ensemble strictly decreases when x is perturbed by δ. To that end, an adversary s objective is to search for a normbounded adversarial perturbation δ such that:

δ = arg min δ: δ p ϵ L(x + δ, y, α) (3)

Let A be an adversarial perturbation generation algorithm that takes as input an REC (F, α), a labeled data-point (x, y), and a norm bound ϵ and generates a perturbation δA : δA p ϵ while attempting to solve the optimization problem in (3). Definition 3.2. Given REC (F, α), a labeled data-point (x, y), and a norm bound ϵ: An adversarial perturbation generation algorithm A is said to be consistent if it finds a norm-bounded adversarial δA whenever L(x, y, α) = 1 and a norm-bounded adversarial δ exists.

Definition 3.2 implies that a consistent algorithm will always find an adversarial perturbation, if one exists, when all the classifiers in the ensemble correctly classify x. It does not imply that it will find the optimal adversarial perturbation δ . Conversely, an inconsistent adversarial algorithm will fail to find a norm-bounded adversarial perturbation under these conditions and provide an overly optimistic estimate of adversarial robustness. Note that the condition L(x, y, α) = 1 is not too restrictive, since DNNs typically exhibit very high accuracies on clean samples.

3.2. Projected Gradient Descent

Optimization-based attacks, such as PGD (Madry et al., 2018; Maini et al., 2020), search for adversarial perturbations around the data sample (x, y) for some classifier f and loss function l, by solving the following maximization:

δ = arg max δ: δ p ϵ l f(x + δ), y (4)

Adversarial Vulnerability of Randomized Ensembles

in an iterative fashion k [K]:

δ(k) = Πp ϵ δ(k 1) + ηµp xl(x + δ(k 1), y) (5)

where δ(0) is typically a random initial guess inside the ℓp ball of radius ϵ, η is the step size, µp computes the unit-norm ℓp steepest direction, and Πp ϵ computes the projection onto the ℓp ball of radius ϵ. A typical choice of loss function is the cross-entropy loss (Madry et al., 2018). To simplify notation, we will use l(x + δ, y) instead of l f(x + δ), y . For instance when p = , the update equation in (5) reduces to: δ(k) = clip δ(k 1) + ηsgn (g) , ϵ, ϵ (6)

where g is the gradient of the loss as in (5).

Adaptive PGD for REC: As pointed out by (Athalye et al., 2018; Tram er et al., 2020), randomized defenses need to take caution when evaluating robustness. In particular, one should use the expectation over transformation (EOT) method to avoid gradient masking. The discrete nature of our setup allows for exact computation of expectations, and eliminates the need for Monte Carlo sampling (Pinot et al., 2020). Therefore, in order to solve the optimization in (3) and evaluate the robustness of randomized ensembles, (Pinot et al., 2020) adapted the PGD update rule (5) by replacing the loss function with its expected value E [l(x + δ, y)], which can be computed efficiently. Interestingly, we find that it is in fact this adaptation of PGD that leads to an overly optimistic estimate of the robustness of randomized ensembles a point we address in the next section.

4. Limitations of Adaptive PGD

We analyze the nature of adversarial perturbations obtained via adaptive PGD (APGD) for the special case of an REC an ensemble of binary linear classifiers (BLCs). In doing so, we unveil the reason underlying APGD s weakness in attacking RECs.

Assume an ensemble (F, α) of M BLCs:

fi(x) = w T i x + bi i [M] (7)

where wi RD and bi R are the weight and bias, respectively, of the ith BLC. The BLC fi assigns label 1 to x if and only if fi(x) > 0, and 1 otherwise 2. The associated REC is given by Pr{g(x) = fd(x)} = αd where d [M]. We direct the reader to Appendix A for detailed derivations and proofs of the results in this section.

4.1. Attacking the Auxiliary Classifier

The APGD update rule employs the expected loss E [l(x + δ, y)] in (5), where l is the binary cross-entropy

2For notional convenience, we assume y { 1, 1} for binary classification.

Figure 2. Illustration of an auxiliary classifier f for a randomized ensemble of two BLCs f1 and f2 with equiprobable sampling. The APGD output δ when attacking f1 and f2 is equivalent to the output of PGD when attacking f.

loss. Now consider the gradient g of the expected loss. By linearity of the expectation, we have:

i=1 αiλiwi (8)

where λi s belong to the open interval (0, 1). Examining the expression in (8), we observe that the direction of g is determined purely by a linear combination of the classifiers weight vectors. With that mind, if we define an auxiliary classifier f such that:

for some arbitrary b R, then it is easy to see that the APGD attack is actually equivalent to the standard PGD attack evaluated against the auxiliary classifier f. More generally, we have the following result:

Theorem 4.1 (Auxiliary Classifiers). For any REC (F, α) consisting of BLCs and any data-point (x, y), there exists a sequence of auxiliary BLCs { f (k)}K k=1 such that in the kth

iteration, the output of APGD against the REC is equal to the output of PGD against f (k).

4.2. Inconsistency of Adaptive PGD

Theorem 4.1 highlights a fundamental flaw underlying APGD when applied to RECs: APGD iterates target an auxiliary classifier that may or may not exist in the REC being attacked. For instance, Fig. 2 illustrates the case of an M = 2 REC. Assuming equiprobable sampling, clearly choosing either of δ1 or δ2 will fool the ensemble half the time on average. However, due to the norm radius choice, the APGD generated perturbation cannot fool either, and thus will give an incorrect and optimistic estimate of the REC s robustness.

Adversarial Vulnerability of Randomized Ensembles

Another implication of Theorem 4.1 is that APGD might prematurely converge to a fixed-point, that is δ(k) = δ(k 1), which can happen when the auxiliary classifier becomes ill-defined and reduces to f (k)(x) = b. Furthermore, these fundamental flaws in APGD can be exploited by ensemble training techniques to artificially boost the robustness of the randomized ensemble. In fact, prime examples of this phenomena include adapting standard ensembling techniques (Kariyappa & Qureshi, 2019; Yang et al., 2021) that promote input gradient dissimilarity to the randomized setting, something we explore in Section 6.

One final implication of Theorem 4.1 is the following result regarding APGD:

Theorem 4.2 (Inconsistency). The APGD algorithm for randomized ensembles of classifiers is inconsistent.

Theorem 4.2 provides further theoretical justification of the shortcomings of APGD when attacking randomized ensembles. The inconsistency of APGD leads to overly optimistic robustness evaluation in case of DNNs (see Fig. 1).

5. The ARC Algorithm

To properly assess the robustness of randomized ensembles, and avoid the pitfalls of APGD as seen in Section 4, we present a novel attack algorithm for solving (3), namely the Attacking Randomized ensembles of Classifiers (ARC) algorithm.

5.1. Binary Linear Classifiers

Analogous to the APGD analysis in Section 4, we first present a simpler version of ARC (Algorithm 1) tailored for randomized ensembles of BLCs.

In this setting, the ARC algorithm iterates over all members of the ensemble once in a greedy fashion. At iteration i [M], a candidate perturbation ˆδ is obtained by updating the perturbation δ from the previous perturbation as follows:

ˆδ = γ (δ + βg) (10)

where γ > 0 such that ˆδ p = ϵ, g is the unit ℓp norm optimal perturbation (Melachrinoudis, 1997) for fooling classifier fi, and β is adaptively chosen based on lines (10)- (13) to guarantee that ˆδ can fool fi, irrespective of δ, as long as the shortest ℓp distance between fi and x is less than the norm radius ϵ. Subsequently, the ARC algorithm updates its iterate δ with ˆδ as long as the average classification accuracy satisfies:

L(x + ˆδ, y, α) L(x + δ, y, α) (11)

that is ˆδ does not increase the robust accuracy. Note that ρ > 0 in line (13) is an arbitrarily small positive constant to

Algorithm 1 The ARC Algorithm for BLCs

1: Input: REC (F, α), labeled data-point (x, y), norm p, and radius ϵ. 2: Output: Adversarial perturbation δ such that δ p ϵ. 3: Initialize δ 0, v L(x, y, α) , q p p 1 4: Define I such that αi αj i, j I and i j. 5: for i I do 6: / optimal unit ℓp norm adversarial direction for fi

7: g y |wi|q 1 sgn(wi)

wi q 1 q 8: / shortest ℓp distance between x and fi 9: ζ |fi(x)|

wi q 10: if ζ ϵ i = 1 then 11: β ϵ 12: else 13: β ϵ ϵ ζ ywi Tδ wi q + ζ + ρ

14: end if 15: ˆδ ϵ δ+βg δ+βg p candidate ˆδ such that ˆδ p = ϵ

16: ˆv L(x + ˆδ, y, α) 17: / if robustness does not increase, update δ 18: if ˆv v then 19: δ ˆδ, v ˆv 20: end if 21: end for

avoid boundary conditions. In our implementation, we set ρ = 0.05ϵ.

In contrast to APGD, the ARC algorithm is provably consistent as stated below:

Theorem 5.1 (Consistency). The ARC algorithm for randomized ensembles of binary linear classifiers is consistent.

The implication of Theorem 5.1 (see Appendix A for proof) is that the perturbation generated via the ARC algorithm is guaranteed to be adversarial to (F, α), given that x is correctly classified by all members provided such a perturbation exists. Thus, we have constructed a theoretically-sound algorithm for attacking randomized ensembles of binary linear classifiers.

5.2. Differentiable Multiclass Classifiers

We now extend the ARC algorithm presented in Algorithm 1 to the more general case of C-ary differentiable classifiers, such as DNNs. When dealing with a non-linear differentiable classifier f, we can linearize its behavior around an input x in order to provide estimates of both the shortest ℓp distance to the decision boundary ζ, and the corresponding unit ℓp norm direction g. Using a first-order Taylor series expansion at x we approximate f:

f(u) f(u) = f(x) + f(x)T(u x) (12)

Adversarial Vulnerability of Randomized Ensembles

where f(x) = [ [f(x)]1 |...| [f(x)]C] RD C is the Jacobian matrix. Let m [C] be the assigned label to x by f, we can construct the approximate decision region Rm(x) around x as follows:

u RD : h f(u) i

m > h f(u) i

n u RD : w T j (u x) + hj > 0 o (13)

where hj = [f(x)]m [f(x)]j and wj = [f(x)]m [f(x)]j j [C] \ {m}. The decision boundary of Rm(x) is captured via the C 1 hyper-planes Hj defined by:

Hj = n u RD : w T j (u x) + hj = 0 o (14)

Therefore, in order to obtain ζ and g, we find the closest hyper-plane Hn:

n = arg min j [C]\{m}

wj q = arg min j [C]\{m} ζj (15)

and then compute:

ζ = ζn & g = | wn|q 1 sgn( wn)

wn q 1 q (16)

where q 1 is the dual norm of p, is the Hadamard product, and m = |w|r denotes the element-wise operation mi = |wi|r i [D]. Using these equations, we generalize Algorithm 1 for BLCs to obtain ARC in Algorithm 2. The ARC algorithm is iterative. At each iteration k [K], ARC performs a local linearization approximation. Due to this approximation, we limit the local perturbation radius to η, a hyper-paramater often referred to as the step size akin to PGD.

5.3. Practical Considerations

The ARC algorithm is extremely efficient to implement in practice. Automatic differentiation packages such as Py Torch (Paszke et al., 2017) can carry out the linearization procedures in a seamless fashion. However, datasets with large number of classes C, e.g., CIFAR-100 and Image Net, can pose a practical challenge for Algorithm 2. This is due to the O(C) search performed in (15), which recurs for each iteration k and classifier fi in the ensemble. However, we have observed that (15) can be efficiently and accurately approximated by limiting the search only to a fixed subset

Algorithm 2 The ARC Algorithm

1: Input: REC (F, α), labeled data-point (x, y), number of steps K, step size η, norm p, and radius ϵ. 2: Output: Adversarial perturbation δ such that δ p ϵ. 3: Initialize δ 0, v L(x, y, α) , q p p 1 4: Define I such that αi αj i, j I and i j. 5: for k [K] do 6: δl 0, vl v initialize for local search 7: for i I do 8: x x + δ + δl 9: / the label assigned to x by fi 10: m arg maxj [C] [fi( x)]j 11: wj [fi( x)]m [fi( x)]j j [C] \ {m}

12: hj [fi( x)]m [fi( x)]j j [C] \ {m} 13: / get closest hyper-plane to x when fi is linearized

14: n arg minj [C]\{m} | hj| wj q 15: g | wn|q 1 sgn( wn)

16: ζ | hn| wn q 17: if ζ η i = 1 then 18: β η 19: else 20: β η η ζ w T nδl wn q + ζ + ρ

21: end if 22: ˆδl η δl+βg δl+βg p 23: / project onto the ℓp ball of radius ϵ and center x

24: ˆδ Πp ϵ δ + ˆδl , ˆvl L(x + ˆδ, y, α)

25: if ˆvl vl then 26: δl ˆδl, vl ˆvl update the local variables 27: end if 28: end for 29: / update the global variables after localized search 30: ˆδ Πp ϵ (δ + δl), ˆv L(x + ˆδ, y, α) 31: if ˆv v then 32: δ ˆδ, v ˆv 33: end if 34: end for

of hyper-planes of size G [C 1], e.g., G = 4 reduces the evaluation time by more than 14 (for CIFAR-100) without compromising on accuracy. We shall use this version of ARC for evaluating such datasets, and refer the reader to Appendix D for further details.

6. Experiments

We conduct comprehensive experiments to demonstrate the effectiveness of ARC in generating norm-bounded adversarial perturbations for RECs, as compared to APGD. Specif-

Adversarial Vulnerability of Randomized Ensembles

Table 1. Comparison between ARC and adaptive PGD when attacking randomized ensembles trained via BAT (Pinot et al., 2020) across various network architectures and norms on the CIFAR-10 dataset. We use the standard radii ϵ2 = 128/255 and ϵ = 8/255 for ℓ2 and ℓ -bounded perturbations, respectively.

NETWORK NORM ROBUST ACCURACY [%] AT (M = 1) REC (M = 2) PGD APGD ARC DIFF

RESNET-20 ℓ2 62.43 69.21 55.44 13.77 ℓ 45.66 61.10 40.71 20.39

MOBILENETV1 ℓ2 66.39 67.92 59.43 8.49 ℓ 49.23 59.27 44.59 14.68

VGG-16 ℓ2 66.08 66.96 59.20 7.76 ℓ 49.02 57.82 42.93 14.89

RESNET-18 ℓ2 69.16 70.16 65.88 4.28 ℓ 51.73 61.61 47.43 14.18

WIDERESNET-28-4 ℓ2 69.91 71.48 62.95 8.53 ℓ 51.88 63.86 48.65 15.21

ically, we experiment with a variety of network architectures (VGG-16 (Simonyan & Zisserman, 2014), Res Net-20, Res Net-18 (He et al., 2016), Wide Res Net-28-4 (Zagoruyko & Komodakis, 2016), and Mobile Net V1 (Howard et al., 2017)), datasets (SVHN (Netzer et al., 2011), CIFAR10 (Krizhevsky et al., 2009), CIFAR-100, and Image Net (Krizhevsky et al., 2012)), and norms (ℓ2 and ℓ ). We use the tried and tested publicly available implementation of PGD from (Rice et al., 2020). For all our experiments, the same ℓp norm is used during both training and evaluation. Further details on the training/evaluation setup can be found in Appendix B.

For BAT (Pinot et al., 2020) comparisons, we first train a standard AT model using PGD (Rice et al., 2020), which serves both as an independent robust baseline as well as the first model (f1) in the REC to be constructed. Following the BAT procedure, we then train a second model (f2) using the adversarial samples of the first model only. The same network architecture will be used for both models. When evaluating the robust accuracy of an REC, we compute the true expectation via (1) in accordance with (Pinot et al., 2020).

6.2. Results

We first showcase the effectiveness of ARC on CIFAR-10 using five network architectures and both ℓ2 and ℓ norms. Table 1 summarizes the robust accuracies for various models and adversaries. Following the approach of (Pinot et al., 2020), we find the optimal sampling probability α = (α, 1 α) that maximizes the robust accuracy via a grid search using APGD. We report the corresponding robust accuracies using both APGD and ARC for the same optimal sampling probability, to ensure a fair comparison. We consistently

Table 2. Comparison between ARC and adaptive PGD when attacking randomized ensembles trained via BAT (Pinot et al., 2020) across various datasets and norms. We use Res Net-18 for Image Net and and Res Net-20 for SVHN, CIFAR-10, and CIFAR-100 datasets.

DATASET NORM RADIUS (ϵ) ROBUST ACCURACY [%] AT (M = 1) REC (M = 2) PGD APGD ARC DIFF

SVHN ℓ2 128/255 68.35 74.66 60.15 14.51 ℓ 8/255 53.55 65.99 52.01 13.98

CIFAR-10 ℓ2 128/255 62.43 69.21 55.44 13.77 ℓ 8/255 45.66 61.10 40.71 20.39

CIFAR-100 ℓ2 128/255 34.60 41.91 28.92 12.99 ℓ 8/255 22.29 33.37 17.45 15.92

IMAGENET ℓ2 128/255 47.61 49.62 42.09 7.53 ℓ 4/255 24.33 35.92 19.54 16.38

find that α 0.9, that is the REC heavily favors the robust model (f1).

Table 1 provides overwhelming evidence that: 1) BAT trained RECs perform significantly better than their respective AT counterparts, when APGD is used for robustness evaluation, 2) employing ARC instead of APGD results in a massive drop in robust accuracy of the same REC, and 3) the true robust accuracy of the REC (obtained via ARC) is worse than that of the AT baseline M = 1. For example, a randomized ensemble of Res Net-20s achieves 61.1% robust accuracy when using APGD, a 15% increase over the AT baseline. However, when using ARC, we find that the robustness of the ensemble is in fact 40.7% which is: 1) much lower ( 20%) than what APGD claims and 2) also lower ( 5%) than the AT baseline.

The conclusions of Table 1 are further corroborated by Table 2, where we now compare ARC and APGD across four datasets. These empirical results illustrate both the validity and usefulness of our theoretical results, as they convincingly show that APGD is indeed ill-suited for evaluating the robustness of RECs and can provide a false sense of robustness.

6.3. Robustness Stress Tests

In this section, we conduct robustness stress tests to further ensure that our claims are in fact well-founded and not due to a technicality.

Parameter Sweeps: We sweep the sampling probability (Fig. 3a), number of iterations (Fig. 3b), and norm radius (Fig. 3c) for an REC of two Res Net-20s obtained via ℓ BAT on the CIFAR-10 dataset. For similar experiments across more norms and datasets, please refer to Appendix C.2. In all cases, we find that the ARC adversary is consistently a much stronger attack for RECs than APGD. Moreover, we observe that α = 1 always results in the high-

Adversarial Vulnerability of Randomized Ensembles

0.5 0.6 0.7 0.8 0.9 1.0

Robust Accuracy [%]

0 10 20 30 40 50 K

Robust Accuracy [%]

0.02 0.04 0.06

Robust Accuracy [%]

Figure 3. Comparison between ARC and APGD using an REC of Res Net-20s obtained via ℓ BAT (M = 2) on CIFAR-10. Robust accuracy vs.: (a) sampling probability α, (b) number of iterations K, and (c) norm radius ϵ.

Table 3. Robust accuracy of different versions of APGD compared to ARC for Res Net-20 RECs on CIFAR-10. We use the standard radii ϵ2 = 128/255 and ϵ = 8/255 for ℓ2 and ℓ -bounded perturbations, respectively.

ATTACK METHOD ROBUST ACCURACY [%] ℓ2 ℓ

APGD 69.21 61.10 APGD + R 69.28 62.80 APGD + RR 70.14 62.95

ARC 55.44 40.71

est ensemble robustness when properly evaluated with ARC. This renders BAT RECs obsolete since the best policy is to choose the deterministic AT trained model f1.

Stronger PGD: We also experiment with stronger versions of PGD using random initialization (R) (Madry et al., 2018) and 10 random restarts (RR) (Maini et al., 2020). Table 3 reports the corresponding robust accuracies. Interestingly enough, we find that APGD+RR is actually slightly less effective than vanilla APGD and APGD+R. While this observation might seem counter-intuitive at first, it can be reconciled using our analysis from Section 4. By design, we have that the expected loss associated with APGD+RR perturbations is guaranteed to be larger than that of APGD+R3. However, we have established (see Fig. 2) that the perturbation that maximizes the expected loss can perform worse than other sub-optimal perturbations, which provides further evidence that APGD is indeed fundamentally flawed, and that ARC is better suited for evaluating the robustness of RECs.

6.4. Other Ensemble Training Methods

As alluded to in Section 2, one can construct randomized ensembles from any pre-trained set of classifiers, regardless

3Assuming they share the same random seed.

0.01% 87.33% 87.33%

87.31% 0.00% 87.15%

87.33% 87.27% 0.00%

0.010.020.030.040.050.06

Robust Accuracy [%]

Figure 4. Robustness performance of an REC constructed from ℓ DVERGE trained models (M = 3): (a) cross-robustness matrix, and (b) robust accuracy vs. norm radius ϵ using both ARC and APGD.

of the training procedure. To that end, we leverage the publicly available DVERGE (Yang et al., 2020) trained Res Net-20 models on CIFAR-10 to construct and evaluate such RECs.

Figure 4a plots the cross-robustness matrix of three DVERGE models. As expected, each model is quite robust ( 87%) to other models adversarial perturbations (obtained via PGD). However, the models are completely compromised when subjected to their own adversarial perturbations.

Due to the apparent symmetry between the models, we construct a randomized ensemble with equiprobable sampling αi = 1/3. We evaluate the robustness of the REC against both APGD and ARC adversaries, and sweep the norm radius ϵ in the process. As seen in Fig. 4b, the ARC adversary is much more effective at fooling the REC, compared to APGD. For instance, for the typical radius ϵ = 8/255, the APGD robust accuracy is 43.36%, on-par with the single model AT baseline (see Table 1). In contrast, the ARC robust accuracy is 14.88%, yielding a massive difference in robustness ( 28%). In the interest of space, we defer

Adversarial Vulnerability of Randomized Ensembles

the ADP (Pang et al., 2019) and TRS (Yang et al., 2021) experiments to Appendix C.4. We also explore constructing RECs using independent adversarially trained models in Appendix C.5.

7. Conclusion

We have demonstrated both theoretically and empirically that the proposed ARC algorithm is better suited for evaluating the robustness of randomized ensembles of classifiers. In contrast, current adaptive attacks provide an overly optimistic estimate of robustness. Specifically, our work successfully compromises existing empirical defenses such as BAT (Pinot et al., 2020). While we also experiment with other methods for constructing RECs, we point out the paucity of work on randomized ensemble defenses.

Despite the vulnerability of existing methods, the hardware efficiency of randomized ensembles is an attractive feature. We believe that constructing robust randomized ensembles is a promising research direction, and our work advocates the need for improved defense methods including certifiable defenses, i.e., defenses that are provably robust against a specific adversary.

Acknowledgements

This work was supported by the Center for Brain-Inspired Computing (C-BRIC) and the Artificial Intelligence Hardware (AIHW) program funded by the Semiconductor Research Corporation (SRC) and the Defense Advanced Research Projects Agency (DARPA).

Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, pp. 274 283. PMLR, 2018.

Breiman, L. Bagging predictors. Machine learning, 24(2): 123 140, 1996.

Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39 57. IEEE, 2017.

Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pp. 854 863. PMLR, 2017.

Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310 1320. PMLR, 2019.

Dbouk, H. and Shanbhag, N. Generalized depthwiseseparable convolutions for adversarially robust and efficient neural networks. Advances in Neural Information Processing Systems, 34, 2021.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1 15. Springer, 2000.

Farhadi, A. and Redmon, J. YOLOv3: An incremental improvement. In Computer Vision and Pattern Recognition, pp. 1804 02767, 2018.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Guo, M., Yang, Y., Xu, R., Liu, Z., and Lin, D. When NAS meets robustness: In search of robust architectures against adversarial attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 631 640, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017.

Ilyas, A., Santurkar, S., Engstrom, L., Tran, B., and Madry, A. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.

Kariyappa, S. and Qureshi, M. K. Improving adversarial robustness of ensembles with diversity training. ar Xiv preprint ar Xiv:1901.09981, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Adversarial Vulnerability of Randomized Ensembles

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. ar Xiv preprint ar Xiv:1607.02533, 2016.

Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656 672. IEEE, 2019.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=r Jz IBf ZAb.

Maini, P., Wong, E., and Kolter, Z. Adversarial robustness against the union of multiple perturbation models. In International Conference on Machine Learning, pp. 6640 6650. PMLR, 2020.

Melachrinoudis, E. An analytical solution to the minimum lp-norm of a hyperplane. Journal of Mathematical Analysis and Applications, 211(1):172 189, 1997.

Meunier, L., Scetbon, M., Pinot, R. B., Atif, J., and Chevaleyre, Y. Mixed nash equilibria in the adversarial examples game. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7677 7687. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/ v139/meunier21a.html.

Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574 2582, 2016.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. Neur IPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

Pang, T., Xu, K., Du, C., Chen, N., and Zhu, J. Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pp. 4970 4979. PMLR, 2019.

Papernot, N., Mc Daniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pp. 582 597. IEEE, 2016.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in Py Torch. In NIPS Autodiff Workshop, 2017.

Patil, A. D., Tuttle, M., Schwing, A. G., and Shanbhag, N. R. Robustifying ℓ adversarial training to the union of perturbation models. ar Xiv preprint ar Xiv:2105.14710, 2021.

Pinot, R., Ettedgui, R., Rizk, G., Chevaleyre, Y., and Atif, J. Randomization matters how to defend against strong adversarial attacks. In International Conference on Machine Learning, pp. 7717 7727. PMLR, 2020.

Rauber, J., Brendel, W., and Bethge, M. Foolbox: A python toolbox to benchmark the robustness of machine learning models. In Reliable Machine Learning in the Wild Workshop, 34th International Conference on Machine Learning, 2017. URL http://arxiv.org/abs/1707. 04131.

Rice, L., Wong, E., and Kolter, Z. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pp. 8093 8104. PMLR, 2020.

Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. Itembased collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pp. 285 295, 2001.

Sehwag, V., Wang, S., Mittal, P., and Jana, S. HYDRA: Pruning adversarially robust neural networks. Advances in Neural Information Processing Systems (Neur IPS), 7, 2020.

Sen, S., Ravindran, B., and Raghunathan, A. EMPIR: Ensembles of mixed precision deep networks for increased robustness against adversarial attacks. In International Conference on Learning Representations, 2019.

Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! Advances in Neural Information Processing Systems (Neur IPS), 32, 2019.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Adversarial Vulnerability of Randomized Ensembles

Tram er, F., Carlini, N., Brendel, W., and Madry, A. On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems, 33, 2020.

Yang, H., Zhang, J., Dong, H., Inkawhich, N., Gardner, A., Touchet, A., Wilkes, W., Berry, H., and Li, H. DVERGE: Diversifying vulnerabilities for enhanced robust generation of ensembles. Advances in Neural Information Processing Systems, 33, 2020.

Yang, Y., Zhang, G., Katabi, D., and Xu, Z. ME-net: Towards effective adversarial robustness with matrix estimation. ar Xiv preprint ar Xiv:1905.11971, 2019.

Yang, Z., Li, L., Xu, X., Zuo, S., Chen, Q., Zhou, P., Rubinstein, B., Zhang, C., and Li, B. TRS: Transferability reduced ensemble via promoting gradient diversity and model smoothness. Advances in Neural Information Processing Systems, 34, 2021.

Zagoruyko, S. and Komodakis, N. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472 7482. PMLR, 2019.

Zhang, Y., Suda, N., Lai, L., and Chandra, V. Hello edge: Keyword spotting on microcontrollers. ar Xiv preprint ar Xiv:1711.07128, 2017.

Adversarial Vulnerability of Randomized Ensembles

In this section, we provide proofs for Theorems 4.1, 4.2 & 5.1 as well as any omitted derivations. We first state the following result due to (Melachrinoudis, 1997) on the ℓp norm projection of any point onto a hyper-plane:

Lemma A.1 (Melachrinoudis). For any hyper-plane H = {u RD : w Tu + b = 0} RD, point x RD, and p 1, then the ℓp norm projection z of x onto H can be obtained analytically as follows:

z = x + ζg = x ζsgn(f(x))|w|q 1 sgn(w)

w q 1 q (17)

ζ = min z H z x p = z x p =

w q = |f(x)|

is the shortest ℓp distance between x and H, q 1 satisfying 1

p = 1 defines the dual norm of p, is the Hadamard product, g is the unit ℓp norm vector from x towards the decision boundary, and m = |w|r denotes the element-wise operation mi = |wi|r i [D].

The only dependence g has on x is the sign of f(x), which essentially determines which side of the hyper-plane x lies in. The nice thing about Lemma A.1 is that it provides us with the tools for computing the optimal adversarial perturbations for linear classifiers, for all valid norms p 1.

A.1. Fooling a Binary Linear Classifier

We also state and prove a simple and useful result for fooling binary linear classifiers. Let (x, y) RD { 1, 1} be a labeled data point that is correctly classified by f. That is we have y(w Tx + b) > 0. Let δ RD such that x = x + δ is misclassified by f. That is, y(w T x + b) < 0. What can we say about δ?

Lemma A.2. For any (x, y) RD { 1, 1} correctly classified by f and valid norm p 1, δ is adversarial to f at (x, y) if and only if the following conditions hold

yw Tδ > 0 &

w q > ζ (19)

w q = y f(x)

is the shortest ℓp distance between x and the decision boundary of f.

Proof. We can write: y(w T x + b) = y(w Tx + b) + yw Tδ (21)

Since y(w Tx+b) > 0 holds by assumption, y(w T x+b) is negative if and only if yw Tδ > y(w Tx+b) > 0. This implies we need δ to have a positive projection on yw. Dividing both sides by w q, establishes the second condition as well.

A.2. Derivation of (8)

Let (F, α) be an REC of M BLCs: fi(x) = w T i x + bi i [M] (22)

where wi RD and bi R are the weight and bias, respectively, of the ith BLC. The BLC fi assigns label 1 to x if and only if fi(x) > 0, and 1 otherwise.

Define the binary cross-entropy loss l for classifier f evaluated at (x, y):

l(x, y) = l (f(x), y) = log ef(x)

log 1 1 + ef(x)

= log (z) y log (1 z) (1 y) (23)

Adversarial Vulnerability of Randomized Ensembles

where z = ef(x)

1+ef(x) is the empirical probability of assigning label 1 to x by f and y = 1+y

2 {0, 1} is the binarized version of the signed label y { 1, 1}.

Computing the gradient of l:

Let us first compute:

x log (z) = x log ef(x) x log 1 + ef(x)

= xf(x) 1 1 + ef(x) xef(x)

1 + ef(x) w = (1 z)w

Similarly, we compute:

x log (1 z) = 0 x log 1 + ef(x)

= 1 1 + ef(x) xef(x)

1 + ef(x) w = zw

The gradient of l(x, y) w.r.t. x can thus be computed:

xl(x, y) = x log (z) y x log (1 z) (1 y)

= (1 z) yw + z(1 y)w

= ( (1 z) y + z(1 y))w

= ( y + z y + z z y)w

= ( y + z)w

Note that z (0, 1) and y {0, 1}, therefore y + z has the same sign as y. If y = 1, then y = 1, and 1 + z ( 1, 0) is negative. If y = 1, then y = 0, and z (0, 1) is positive. Therefore, we can write:

xl(x, y) = yλw (27)

λ = | (1 z) y + z(1 y)| = (1 z) y + z(1 y)

2 1 + y + (1 y)ef(x)

1 + ef(x) (0, 1) (28)

Gradient of the expected loss:

Using (27), we can compute the gradient g of the expected loss E [l(x + δ, y)] employed by APGD:

g = x E [l(x + δ, y)] = x

i=1 [αili(x + δ, y)]

i=1 [αi xli(x + δ, y)] =

i=1 [αi( yλiwi)]

where i [M]:

2 1 + y + (1 y)efi(x+δ)

1 + efi(x+δ) (0, 1) (30)

Adversarial Vulnerability of Randomized Ensembles

A.3. Proof of Theorem 4.1

We provide a more detailed proof for Theorem 4.1, restated below: Theorem (Restated). For any REC (F, α) consisting of BLCs and any data-point (x, y), there exists a sequence of auxiliary BLCs { f (k)}K k=1 such that in the kth iteration, the output of APGD against the REC is equal to the output of PGD against f (k).

Proof. Let (F, α) be an arbitrary REC of BLCs: fi(x) = w T i x + bi and (x, y) RD { 1, 1} be an arbitrary labeled data point. Recall the APGD update rule is:

δ(k) = Πp ϵ δ(k 1) + ηµp x E h l(x + δ(k 1), y) i

δ(k 1) + ηµp

h αili(x + δ(k 1), y) i!!

δ(k 1) + ηµp

h αi xli(x + δ(k 1), y) i!!

δ(k 1) + ηµp

h αig(k 1) i i!!

where li(x, y) = l(fi(x), y) is the binary cross-entropy loss evaluated at fi(x). We can compute the gradients g(k 1) i analytically i [M], k [K]: g(k 1) i = yλ(k 1) i wi (32)

λ(k 1) i = 1

2 1 + y + (1 y)efi(x+δ(k 1))

1 + efi(x+δ(k 1)) (0, 1) (33)

We construct the kth auxiliary classifier as follows:

f (k) = w T kx + bk =

i=1 αiλ(k 1) i wi

x + bk (34)

for some arbitrary bk R. We now proceed to prove the result via induction on k, assuming both the APGD and PGD algorithms share the same random initialization δ(0). Let { δ(k)} be the sequence of PGD outputs when attacking the auxiliary classifier f (k). Assume that δ(k 1) = δ(k 1) for some k, i.e., the perturbation generated by APGD applied to the REC is equal to the perturbation generated by PGD applied to the auxiliary classifier. Thus, the kth PGD output will be:

δ(k) = Πp ϵ δ(k 1) + ηµp xl f (k)(x + δ(k 1)), y

= Πp ϵ δ(k 1) + ηµp y λ(k) wk

= Πp ϵ δ(k 1) + ηµp ( y wk)

δ(k 1) + ηµp

i=1 αiλ(k 1) i wi

δ(k 1) + ηµp

h αig(k 1) i i!!

which is equal to the APGD output in (31). Note that in (35) we use the following property:

µp(γu) = sgn(γ)µp(u) u RD, γ R (36)

Adversarial Vulnerability of Randomized Ensembles

coupled with the fact that λ(k) (0, 1).

Therefore, we have established the inductive step k. Since both algorithms start from the same random initialization, then we also establish the base case: δ(0) = δ(0).

A.4. Proof of Theorem 4.2

We provide a more detailed proof for Theorem 4.2, restated below: Theorem (Restated). The APGD algorithm for randomized ensembles of classifiers is inconsistent.

Proof. To prove inconsistency, we construct an example where the algorithm fails to generate an adversarial perturbation given one exists. Consider the following setup:

1. two binary linear classifiers with w1 = w2 = w = 0, and b1 = b2 = b > 0

2. equiprobable sampling α = (0.5, 0.5)

3. (x, y) = (0, 1) with norm-bound ϵ = 2b w p

It is easy to show that both BLCs correctly classify x, since f1(x) = f2(x) = b > 0, thus we have the condition L(x, y, α) = 1. We also have a trivial choice of δ = ϵw/ w p, which satisfies both necessary and sufficient conditions for fooling f2 shown below:

w T 2δ = w Tδ = ϵ w Tw

w p = ϵ w 2 2 w p > 0 & w T 2δ / w2 q = w Tδ

w q = ϵ w 2 2 w q w p = 2b w q > ζ2 (37)

However, we observe that δ does not fool f1, since y(w1Tδ) = y(2b) < 0. Therefore, we have constructed a norm-bounded perturbation δ such that L(x + δ, y, α) = 0.5 < 1.

Next, we show that the APGD algorithm evaluated against this REC cannot generate a norm-bounded perturbation that is adversarial. Recall that, starting from some random δ(0) such that δ(0) p ϵ, the APGD algorithm performs the following update rule:

δ(k) = Πp ϵ δ(k 1) + ηµp x E h l(x + δ(k 1), y) i

δ(k 1) + ηµp

h αili(x + δ(k 1), y) i!!

δ(k 1) + ηµp

h αi xli(x + δ(k 1), y) i!!

δ(k 1) + ηµp

h αig(k 1) i i!!

where li(x, y) = l(fi(x), y) is the binary cross-entropy loss evaluated for fi(x). We can compute the gradients g(k 1) i analytically i {1, 2}, k [K]:

g(k 1) i = yλ(k 1) i wi (39)

λ(k 1) i = 1

2 1 + y + (1 y)efi(x+δ(k 1))

1 + efi(x+δ(k 1)) (0, 1) (40)

Adversarial Vulnerability of Randomized Ensembles

Assume that δ(0) satisfies w Tδ(0) = 0, therefore we have f1(x + δ(0)) = f2(x + δ(0)) = b. The expected gradient below is employed by APGD to obtain the next perturbation δ(1):

h αig(0) i i = y

h αiλ(0) i wi i = y

λ(0) 1 w λ(0) 2 w = y

2w λ(0) 1 λ(0) 2

1 2 1 + y + (1 y)ef1(x+δ(0))

1 + ef1(x+δ(0)) 1

2 1 + y + (1 y)ef2(x+δ(0))

1 + ef2(x+δ(0))

2 1 + y + (1 y)eb

2 1 + y + (1 y)eb

This implies that the APGD algorithm is stuck in a fixed-point k: δ(k) = δ(0), which means the final perturbation obtained via PGD δAPGD = δ(0) is a non-adversarial perturbation since w Tδ(0) = 0. Hence, we have L(x + δAPGD, y, α) = L(x, y, α) = 1.

A.5. Proof of Theorem 5.1

We now can prove the main result, restated below: Theorem (Restated). The ARC algorithm for randomized ensembles of binary linear classifiers is consistent.

Proof. To prove consistency, we need to show that for all random BLCs (F, α), valid norms p 1, norm-bounds ϵ, and labeled data-points (x, y) such that L(x, y, α) = 1, the ARC generated perturbation δARC will result in L(x+δARC, y, α) < 1, given that δ that satisfies δ p ϵ and L(x + δ, y, α) < 1.

First, let us establish some properties of the ARC algorithm. Define δ(i) to be the perturbation vector δ at the end of the ith iteration of the ARC algorithm. Similarly, define v(i) to be the cost function L(x + δ(i), y, α). The algorithm is initialized with δ(0) = 0 and v(0) = L(x, y, α) = 1. Subsequently, the algorithm terminates with δARC = δ(M) with a final average classification v(M) = L(x + δARC, y, α). We note that the sequence {v(i)} is non-increasing by design, with v(i) L(x, y, α). This implies that the algorithm will never increase the mean classification, and if m [M] such that v(m) < v(m 1), then the algorithm terminates with v(M) v(m) < 1. Therefore, to show consistency, it is enough to find one such m [M].

Assume that L(x, y, α) = 1 and δ such that δ p ϵ and L(x + δ, y, α) < 1, then from Lemma A.2 we know that m [M] such that:

yw T mδ > 0 &

wm q > ζm (42)

ζm = |fm(x)|

wm q = y fm(x)

Combining the second inequality from (42) with H older s inequality, we get:

wm q δ p w T mδ > ζm wm q (44)

which implies that ζm < ϵ since δ p ϵ. We now have two cases to consider:

Case 1: If m = 1, the candidate perturbation will be ˆδ = ϵg, where

g = y |wm|q 1 sgn(wm)

wm q 1 q (45)

We will now show that ˆδ satisfies both necessary and sufficient conditions to fool fm. First, we compute the dot-product:

ywm Tˆδ = ywm T (ϵg) = yϵwm Tg = ϵ wm q > 0 (46)

Adversarial Vulnerability of Randomized Ensembles

which is strictly positive, and the last equality uses the following:

ywm Tg = ( y)2

i=1 |wm,i|q 1sgn(wm,i)wm,i

= 1 wm q 1 q

i=1 |wm,i|q 1|wm,i|

= 1 wm q 1 q

i=1 |wm,i|q

= 1 wm q 1 q wm q q = wm q

Second, we compute the following quantity: wm Tˆδ

wm q = ywm Tˆδ

wm q = ϵ wm q

wm q = ϵ > ζm (48)

which is strictly greater than ζm. Therefore, the candidate ˆδ satisfies both necessary and sufficient conditions for fooling fm.

Case 2: If m > 1, then assume that up until the (m 1)th iteration the ARC algorithm has not found any adversarial perturbation, that is v(m 1) = 1, otherwise the algorithm will terminate with an adversarial perturbation. Let u = δ(m 1), and by assumption we have L(x+u, y, α) = 1. During the mth iteration, we have the candidate perturbation ˆδ = γ(u+βg) where

g = y |wm|q 1 sgn(wm)

wm q 1 q (49)

and γ = ϵ u + βg p > 0 (50)

to ensure ˆδ p = ϵ.

Since ζm < ϵ, the choice of β for the mth iteration will be:

where ρ > 0 is a fixed arbitrarily small positive number. Note that u p can either be 0 or ϵ by design. If u p = 0, then the same proof from case 1 follows, otherwise assume u p = ϵ.

We will now show that ˆδ satisfies both necessary and sufficient conditions to fool fm. First, we compute the dot-product:

ywm Tˆδ = yγwm T (u + βg) = yγwm Tu yβγwm Tg = yγwm Tu + βγ wm q (52)

Therefore, in order to satisfy the first condition in (42), we need ywm Tˆδ to be strictly positive, which is satisfied by any β satisfying:

However, using the fact that ϵ > ζm > 0, our choice of β satisfies this constraint since:

+ ρ > ϵ ϵ ζm

Second, we compute the following quantity required for the second condition: wm Tˆδ

wm q = ywm Tˆδ

wm q = yγwm Tu + βγ wm q

Adversarial Vulnerability of Randomized Ensembles

In order to satisfy the second condition, we need:

yγwm Tu + βγ wm q

wm q > ζm (56)

which is equivalent to:

Recall that: γ = ϵ u + βg p ϵ u p + |β| g p = ϵ ϵ + β (58)

where the first inequality stems from the triangle inequality of the ℓp norm, and the second equality uses the fact that u p = ϵ, g p = 1, and β > 0 by design. Using the inequality in (58), we can modify the inequality condition in (57) to:

wm q + ζm(ϵ + β)

which means it is enough to require that:

wm q + ζm (60)

for the second condition to hold. It is easy to see our choice of β satisfies this strict inequality (due to ρ > 0). In fact, we can easily show that ywm Tu wm q + ζm is guaranteed to be positive in this scenario due to our assumption that x + u is correctly classified by fm, which removes the looseness of the absolute value in our choice of β. Thus we have shown that, given the assumptions, we can find a m [M] such that x + δ(m) is misclassified by fm, which implies v(m) < v(m 1) = 1, which concludes the proof that the ARC algorithm is consistent.

Adversarial Vulnerability of Randomized Ensembles

B. Experimental Setup Details

B.1. Training Hyper-parameters

In this section, we provide the training details alongside the choice of hyper-parameters for all our experiments. For SVHN, CIFAR-10, and CIFAR-100 datasets, we establish strong adversarially trained baselines (via PGD) following the approach of (Rice et al., 2020) which utilizes early stopping to avoid robust over-fitting. As for Image Net, we use the popular free training (FT) (Shafahi et al., 2019) method for faster and more efficient adversarial training. We employ standard data augmentation techniques (random cropping and random horizontal flips) for all models. Following standard practice, we also apply input normalization as part of the model, so that the adversary operates on physical images x [0, 1]D. For all our experiments, the same ℓp norm is used during both training and evaluation.

For BAT training, as mentioned in Section 6, we re-use the AT models as the first model (f1) in the REC, and then train a second model (f2) using the adversarial samples of the first model only. The same network architecture, training algorithm, and hyper-parameters will be used for both models. A single workstation with two NVIDIA Tesla P100 GPUs is used for running all the training experiments. Below are the hyper-parameter details for each dataset:

SVHN: Models are trained for 200 epochs, using a PGD adversary with K = 7 iterations with: ϵ = 8/255 and η = 2/255 for ℓ AT, and ϵ = 128/255 and η = 32/255 for ℓ2 AT. We use stochastic gradient descent (SGD) with momentum (0.9), 128 mini-batch size, and a step-wise learning rate decay set initially at 0.1 and divided by 10 at epochs 100 and 150. We employ weight decay of 2 10 4.

CIFAR-10: Models are trained for 200 epochs, using a PGD adversary with K = 7 iterations with: ϵ = 8/255 and η = 2/255 for ℓ AT, and ϵ = 128/255 and η = 32/255 for ℓ2 AT. We use SGD with momentum (0.9), 128 mini-batch size, and a step-wise learning rate decay set initially at 0.1 and divided by 10 at epochs 100 and 150. We employ weight decay of: 2 10 4 for Res Net-20 and Mobile Net V1, and 5 10 4 for VGG-16, Res Net-18, and Wide Res Net-28-4.

CIFAR-100: Models are trained for 200 epochs, using a PGD adversary with K = 7 iterations with: ϵ = 8/255 and η = 2/255 for ℓ AT, and ϵ = 128/255 and η = 32/255 for ℓ2 AT. We use SGD with momentum (0.9), 128 mini-batch size, and a step-wise learning rate decay set initially at 0.1 and divided by 10 at epochs 100 and 150. We employ weight decay of 2 10 4.

Image Net: Models are trained for a total of 90 epochs with mini-batch replay m = 4, using: ϵ = 4/255 for ℓ FT, and ϵ = 76/255 for ℓ2 FT. We use stochastic gradient descent (SGD) with momentum (0.9), 256 mini-batch size and a step-wise learning rate decay set initially at 0.1 and decayed by 10 every 30 epochs. We employ weight decay of 1 10 4.

B.2. Evaluation Details

We evaluate the robustness of standard AT models using PGD (Madry et al., 2018). As for randomized ensembles, we establish baselines for comparison using the adaptive version of PGD (APGD) (Pinot et al., 2020) which uses the expected loss function (see Section 3). We use our proposed ARC algorithm as well, to demonstrate the vulnerability of RECs. In Appendix C.3, we investigate different methods for constructing adaptive and non-adaptive attacks using both PGD and C&W (Carlini & Wagner, 2017). The PGD/APGD attack details for each dataset are listed below. The same configurations are used for ARC as well, except for ℓ evaluations, where we find that setting η = ϵ yields best results.

SVHN, CIFAR-10, CIFAR-100: We use K = 20 iterations, with: ϵ = 8/255 with η = 2/255 for ℓ PGD and ϵ = 128/255 with η = 32/255 ℓ2 PGD.

Image Net: We use K = 50 iterations, with: ϵ = 4/255 with η = 1/255 for ℓ PGD and ϵ = 128/255 with η = 32/255 for ℓ2 PGD.

Adversarial Vulnerability of Randomized Ensembles

C. Additional Experiments and Comparisons

C.1. Individual Model Performance

Table 4. Clean and robust accuracies of the individual members of BAT trained RECs from Table 2. Model robustness is measured via both PGD and ARC.

DATASET NORM MODEL CLEAN ACCURACY [%] ROBUST ACCURACY [%] PGD ARC

SVHN ℓ2 f1 93.07 68.35 65.50 f2 91.23 0.00 0.00

ℓ f1 90.53 53.55 49.01 f2 88.31 0.00 0.00

CIFAR-10 ℓ2 f1 83.36 62.43 61.13 f2 91.26 0.13 0.00

ℓ f1 75.96 45.66 42.36 f2 77.64 0.00 0.00

CIFAR-100 ℓ2 f1 53.43 34.60 31.88 f2 63.74 0.00 0.00

ℓ f1 44.37 22.29 18.36 f2 34.1 0.00 0.00

IMAGENET ℓ2 f1 62.91 47.60 45.96 f2 64.60 0.00 0.00

ℓ f1 52.01 24.33 21.03 f2 55.70 0.00 0.00

In this section, we provide more details on the performance of the individual models from the BAT RECs constructed in Table 2 across different datasets. Table 4 reports the clean and robust accuracies of all the individual models. The robust accuracy is measured via both standard PGD and our ARC adversary. We consistently find that the performance of ARC and PGD in the standard single model setting to be similar, with ARC being slightly better. One thing to note is that, due to the nature of BAT, the second model f2 is always not robust to its own adversarial perturbation. However, as shown in Fig. 5, the second model is very successful at defending against the adversarial samples of the first model, which is expected.

SVHN CIFAR-10 CIFAR-100 Image Net

Figure 5. The cross-robustness matrices of all BAT REC from Table 2. As expected, BAT training induces an asymmetric ensemble were the first model is robust and the second model is completely compromised. The false sense of robustness from BAT RECs stems from the high robustness of the second model to the adversarial samples of the first model.

Adversarial Vulnerability of Randomized Ensembles

C.2. More Parameter Sweeps

In this section, we expand on the results in Section 6.3. Specifically, Fig. 6 provides a more complete version of Fig. 3 with added CIFAR-100 results, across two norms ℓ and ℓ2. Across all datasets, norms, and parameter values, Fig. 6 consistently shows that the randomized ensembles obtained via BAT are more vulnerable than originally claimed. The ARC algorithm is much more effective at generating adversarial examples for RECs, in contrast to APGD.

0.6 0.7 0.8 0.9 1.0

Robust Accuracy [%]

0 10 20 30 40 K

2 - CIFAR-10

0 1 2 3 4 0

0.6 0.7 0.8 0.9 1.0

Robust Accuracy [%]

0 10 20 30 40 50 K

0.02 0.04 0.06

0.6 0.7 0.8 0.9 1.0

Robust Accuracy [%]

0 10 20 30 40 50 K

50 2 - CIFAR-100

0 1 2 3 4 0

0.6 0.7 0.8 0.9 1.0

Robust Accuracy [%]

0 10 20 30 40 50 K

40 - CIFAR-100

0.02 0.04 0.06

Figure 6. More comparisons between ARC and APGD using RECs of Res Net-20s obtained via BAT (M = 2).

C.3. More on Adaptive Attacks

In this section, we explore different ways of deploying PGD-based adversaries for attacking RECs and compare them with ARC. We also employ the C&W (Carlini & Wagner, 2017) attack for ℓ2 norm-bounded perturbations, and adapt it to our setting.

C.3.1. ADAPTIVE PGD

In Section 3, we introduced the standard PGD attack, which performs the following update rule k [K]:

Adversarial Vulnerability of Randomized Ensembles

δ(k) = Πp ϵ δ(k 1) + ηµp xl(f(x + δ(k 1)), y) (61)

where l is the loss function, f is the differentiable deterministic classifier, and (x, y) is the labeled data-point. It is clear that some changes have to made in order to use PGD for a randomized ensemble (F, α). We propose some intuitive changes below.

PGD: Perhaps the simplest modification one can make is to randomize the attack as well! If the defender picks a model fi at random from the ensemble with probability αi, for some i [M], then the attacker can also independently sample a model fj at random with probability αj and use it to construct the adversarial perturbation for (F, α) via PGD. Again, the discrete nature of our setup allows us to compute exactly the expected robust accuracy, when both the attacker and defender are randomizing their strategies:

j=1 αj L(x + δj, α) (62)

where δj is the adversarial perturbation obtained via PGD when attacking model fj, and L is the conditional expected robust accuracy from (1).

PGD-1: Since the second model in BAT RECs is always not robust, one logical attack to consider is attacking the robust model only f1, and ignore the second non-robust model. That is, we use standard PGD to attack f1, and use the generated δ to attack the REC.

APGD: The attacker can also adapt PGD by using the expected value of the loss function, and modifying (61) as follows:

δ(k) = Πp ϵ δ(k 1) + ηµp x E h l(f(x + δ(k 1)), y) i

δ(k 1) + ηµp

h αi xl(fi(x + δ(k 1)), y) i!! (63)

Recall this is the adaptive PGD proposed in (Pinot et al., 2020) and what we adopted in this paper (see Sections 3 and 4) for robustness evaluation of randomized ensembles.

APGD-L: Another method for computed the EOT for randomized ensembles is to compute the expected classifier outputs (logits), instead of the expected loss, which will yield the following change to (61):

δ(k) = Πp ϵ δ(k 1) + ηµp xl(E h f(x + δ(k 1)) i , y)

δ(k 1) + ηµp

i=1 αifi(x + δ(k 1)), y

The authors of (Pinot et al., 2020) were also aware of this modification. However, they argued that averaging the logits may cause discrepancies, since logits from two different models can have completely different ranges for values. The compelling argument for using the loss function, is that it provides an elegant and natural way of normalizing the outputs. Note that, our analysis in Section 4 is also applicable for this version of adaptive PGD, and therefore we expect it to inherit the shortcomings of APGD and provide a false sense of robustness for RECs.

C.3.2. C&W ATTACK

The C&W attack (Carlini & Wagner, 2017) is another popular and powerful attack originally designed for compromising defensive distillation (Papernot et al., 2016) in neural networks. For a given classifier f and data-point (x, y), the C&W attack finds an adversarial perturbation by solving the following constraint optimization problem:

δ = arg min δ+x [0,1]D δ 2 + ch(x + δ) (65)

where h(x+δ) is some cost function such that h(x+δ) < 0 if and only if arg max[f(x+δ)]i = y, and c > 0 is a constant that is often optimized via binary search. Note that the C&W attack (Carlini & Wagner, 2017) tries to find the minimum

Adversarial Vulnerability of Randomized Ensembles

ℓ2 norm perturbation that fools f, such that the perturbed image satisfies the box constraints. The resulting perturbation therefore is not guaranteed to be norm-bounded by some ϵ. However, a common workaround is to simply project the final perturbation onto the ℓ2 ball of radius ϵ.

Solving (65) is very challenging from an optimization point of view. To facilitate the optimization, the authors propose a change of variable:

2(tanh(θ) + 1) x (66)

where tanh is an element-wise hyperbolic tangent, which guarantees that δ + x satisfies the box constraints θ RD. Therefore, the C&W attack solves for θ via:

θ = arg min θ RD 1

2(tanh(θ) + 1) x 2 2 + ch(1

2(tanh(θ) + 1)) (67)

where h is:

h(u) = max [f(u)]y max i [C]\{y}[f(u)]i, κ (68)

and κ controls the confidence (usually set to 0). Finally, the optimization in (67) can be solved via gradient based algorithms, such as SGD or ADAM (Kingma & Ba, 2014), while c > 0 is chosen using binary search.

Analogous to adapting PGD, we will experiment with three version of the C&W attack for RECs:

C&W: The attacker randomly samples a classifier and attacks it via the classical C&W adversary.

C&W-1: The attacker only considers the first robust model (for BAT RECs) via the classical C&W adversary, akin to PGD-1.

AC&W: The attacker uses the expected value of h when solving (67), akin to APGD.

AC&W-L: The attacker uses the expected output of the REC (exepected logits) when solving (67), akin to APGD-L.

To ensure proper implementation, we adopt the open-source Foolbox (Rauber et al., 2017) implementation of the C&W attack. We adopt the common attack hyper-parameters: 9 steps for binary search with an initial constant c = 0.001, optimizer learning rate 0.01, confidence κ = 0, and run the attack for a total of K = 50 iterations.

C.3.3. ARC VS. ADAPTIVE ATTACKS

We now apply all proposed variants of PGD and C&W for attacking the RECs of Res Net-20s on CIFAR-10 (from Section 6.3). Note that the C&W variants will only be evaluated against the ℓ2 trained REC. Figure. 7 plots the robust accuracy of all methods while sweeping the defender s sampling probability α = (α, 1 α). We make the following observations:

The ARC algorithm outperforms all variants of PGD and C&W and across all values of α, and by massive margins ( 20%) for α [0.5, 0.9].

ARC demonstrates that there is no benefit in using RECs, since the highest robustness achieved corresponds to α = 1. Note that α = 1 reduces the REC to the deterministic adversarially trained classifier f1.

Adapting PGD via the expected loss, instead of computing the expected logits yields little to no difference, and the same can be said for C&W as well.

Interestingly, attacking the first model only is a much more effective attack than any exsitng adaptive attack in the high sampling probabilty regime (α [0.8, 1]). Despite this fact, ARC remains a much stronger attack for all values of α.

The robustness evaluation of APGD and PGD follow completely different trends, with both providing false sense of robustness. The same can be said for AC&W and CW as well.

C.4. More on Ensemble Training Methods

In this section, we expand on the results from Section 6.4, where we constructed RECs from pre-trained DVERGE models (Yang et al., 2020). In addition to DVERGE, we experiment with ADP (Pang et al., 2019) and TRS (Yang et al., 2021) using

Adversarial Vulnerability of Randomized Ensembles

0.5 0.6 0.7 0.8 0.9 1.0 25

Robust Accuracy [%]

0.5 0.6 0.7 0.8 0.9 1.0 30

80 2 - CIFAR-10

PGD PGD-1 APGD APGD-L CW CW-1 ACW ACW-L ARC

Figure 7. Robust accuracy vs. sampling probability of: ℓ (left) and ℓ2 (right) BAT REC of Res Net-20s on CIFAR-10. Robust accuracy is evaluated using various adaptive attacks detailed in Appendix C.3 and the proposed ARC attack algorithm.

an ensemble of three Res Net-20s on CIFAR-10. We used the pre-trained ADP models that were publicly available thanks to (Yang et al., 2020), and trained TRS models using their publicly released code on Git Hub (Yang et al., 2021).

We first plot the cross-robustness matrices for each ensemble in Fig. 8. None of the classifiers in the ensembles are actually robust, which is expected. These training methods are designed to improve the robustness to transfer attacks, which is why we notice high cross-robustness in all three matrices.

We construct RECs for each method via equiprobable sampling, and evaluate the robustness against both APGD and ARC in Fig. 9 for different norm radius ϵ values. A common observation to all three methods is that ARC consistently outperforms APGD, and with large margins for TRS and DVERGE. These experiments demonstrate that RECs are vulnerable, regardless of the training procedure thus motivating the need for improved adversarial training algorithms for randomized ensembles.

0.01% 87.33% 87.33%

87.31% 0.00% 87.15%

87.33% 87.27% 0.00%

0.60% 52.70% 52.40%

43.49% 0.53% 47.48%

47.22% 49.36% 0.99%

0.00% 84.49% 78.55%

83.83% 0.00% 81.36%

85.90% 83.84% 0.00%

Figure 8. Cross-robustness matrices for: (a) DVERGE, (b) ADP, and (c) TRS trained ensembles of three Res Net-20s on CIFAR-10.

C.5. Randomized Ensembles from Independent Adversarially Trained Classifiers

In this section, we investigate the performance of RECs constructed from independent adversarially trained (IAT) deep nets. Specifically, we train M Res Net-20s using ℓ2 AT on CIFAR-10. We use different random initializations for each network by

Adversarial Vulnerability of Randomized Ensembles

0.01 0.02 0.03 0.04 0.05 0.06

Robust Accuracy [%]

0.01 0.02 0.03 0.04 0.05 0.06 0

Robust Accuracy [%]

0.01 0.02 0.03 0.04 0.05 0.06

Robust Accuracy [%]

Figure 9. Robust accuracy vs. norm radius ϵ using both ARC and APGD for: (a) DVERGE, (b) ADP, and (c) TRS trained ensembles of three Res Net-20s on CIFAR-10.

changing the random seed. Doing so will result in a symmetric ensemble as seen via the cross-robustness matrix in Fig. 10a. In this setting, all the models are robust with 62% robust accuracy, as opposed to the asymmetric BAT setting where the first model is robust and the second model is completely compromised. Using equiprobable sampling, we construct RECs from the M models and compare the performance of both ARC and APGD with varying ensemble size M in Fig. 10b, e.g., M = 2 means we only use the first 2 models and ignore the rest. Both ARC and APGD follow the same trend since the robustness of the ensemble increases with M. The ARC algorithm is consistently stronger than APGD, albeit the gap is relatively small compared to BAT. This is expected since all models in the ensemble are robust. We do note however that the improvement in robustness is limited and requires large ensemble size for it to be significant, e.g., < 3% improvement with M = 8. Finally, we strongly believe that future ensemble training methods need to employ IAT RECs as a baseline.

1 2 3 4 5 6 7 8

62.44% 71.63% 71.49% 71.64% 71.74% 71.64% 71.59% 71.83%

71.61% 62.41% 71.60% 71.67% 71.73% 71.74% 71.66% 71.98%

72.46% 72.48% 62.89% 72.38% 72.55% 72.46% 72.54% 72.54%

71.69% 71.89% 71.58% 62.40% 71.89% 71.94% 72.16% 71.77%

72.07% 72.30% 72.17% 72.34% 62.27% 72.29% 72.13% 72.35%

72.02% 72.10% 71.85% 71.87% 72.09% 62.18% 72.10% 72.22%

71.17% 71.45% 71.42% 71.15% 71.24% 71.48% 61.43% 71.47%

72.02% 72.01% 71.93% 71.96% 71.92% 71.84% 71.92% 62.03%

1 2 3 4 5 6 7 8 Ensemble Size M

Robust Accuracy [%]

Figure 10. Robustness performance of an REC constructed from ℓ2 independent adversarially trained models (M = 8): (a) crossrobustness matrix, and (b) robust accuracy vs. size of the ensemble M using both ARC and APGD.

C.6. Visualization of Adversarial Samples

In this section, we provide some visual examples of ℓ norm-bounded adversarial perturbations obtained via APGD and ARC. Specifically, Fig. 11 shows three examples from our Image Net experiments, where each clean image (unperturbed) is correctly classified by a randomized ensemble of Res Net-18s (from Table 2). The APGD adversarial samples are also correctly classified by both networks in the ensemble, and thus are unable to fool the ensemble. In contrast, the ARC adversarial samples completely compromise the ensemble, i.e., they fool both the networks. All adversarial perturbations are ℓ norm-bounded with ϵ = 4/255.

Adversarial Vulnerability of Randomized Ensembles

0 50 100 150 200

Clean Sample

0 50 100 150 200

APGD Sample

0 50 100 150 200

APGD Perturbation

0 50 100 150 200

0 50 100 150 200

ARC Perturbation

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

Figure 11. Visual examples of ℓ norm-bounded adversarial perturbations obtained via APGD and ARC evaluated against a randomized ensemble of Res Net-18s on Image Net. All perturbations are ℓ norm-bounded with ϵ = 4/255. The APGD and clean samples are correctly classified by all members of the ensemble, whereas the ARC samples are misclassified by all members of the same ensemble.

Adversarial Vulnerability of Randomized Ensembles

D. Scalability of the ARC Algorithm

In this section, we expand on Section 5.3 and provide more details on approximating the search procedure in (15) in the ARC Algorithm. Recall that the goal is to find the ℓp shortest distance ζ between x and the hyper-planes that capture the decision boundary of Rm(x), as well as the corresponding unit ℓp norm direction g. The C 1 hyper-planes are defined by:

Hj = n u RD : w T j (u x) + hj = 0 o j [C] \ {m} (69)

where m [C] is the label assigned to x by f, hj = [f(x)]m [f(x)]j and wj = [f(x)]m [f(x)]j j [C] \ {m}. Therefore, in order to obtain ζ and g, we find the closest hyper-plane:

n = arg min j [C]\{m}

wj q = arg min j [C]\{m} ζj (70)

and then compute:

ζ = ζn & g = | wn|q 1 sgn( wn)

wn q 1 q (71)

The search in (70) can be computationally demanding for datasets with large number of classes, e.g., C = 100 for CIFAR100 or C = 1000 for Image Net, as the gradient computation routine required for finding wj is very compute intensive, and has to be executed C 1 times, during every iteration k and classifier fi. To alleviate this issue, we perform an approximate search procedure. Intuitively, we expect the closest hyper-plane n would correspond to the output logit [f(x)]n that is closest to [f(x)]m. Therefore, instead of searching over all C 1 hyper-planes, we only need to search over 1 G C 1 hyper-planes that correspond to the G largest logits [f(x)]j j = m. Specifically, we propose approximating the search as follows:

1. compute the quantities hj j [C] \ {m} as before

2. construct the sorted index set J = {j1, ..., j G} [C] \ {m} such that:

0 < hj1 hj2 ... hj G ht t [C] \ ({m} J ) (72)

3. search for the closest hyper-plane over the restricted set J :

n = arg min j J

wj q = arg min j J ζj (73)

Table 5. The robust accuracy and run time of the proposed approximate version of ARC evaluated against two RECs on CIFAR-100. The evaluation was run on a single NVIDIA 1080 Ti GPU.

SEARCH SIZE (G) RUN TIME [MIN] ROBUST ACCURACY [%] ℓ2 ℓ

1 2.06 29.79 18.59

2 2.56 29.08 17.71

3 3.05 28.97 17.55

4 3.56 28.92 17.45

10 6.44 28.88 17.32

50 26.46 28.88 17.32

99 49.91 28.88 17.32

Adversarial Vulnerability of Randomized Ensembles

The parameter G controls the accuracy-efficiency trade-off of the approximation, and setting G = C 1 yields the exact search in (70).

To demonstrate the efficacy of this approximation, we use this version of ARC for evaluating the robustness of two RECs of Res Net-20 s obtained via ℓ2 BAT and ℓ BAT on CIFAR-100. We also measure the total evaluation time, i.e., the total time our script requires to evaluate the robustness of the REC using the entire 10000 samples of the CIFAR-100 testset. We use a workstation with a single NVIDIA 1080 Ti GPU and iterate over the testset with a mini-batch size of 256 for all evaluations. Table 5 reports the corresponding robustness and required evaluation time for different values of G. Table 5 demonstrates that using G = 4 provides the same robustness evaluation as the full ARC algorithm (G = 99), while requiring less than 4 minutes to run, as opposed to 50 minutes required by G = 99. This massive reduction in evaluation time, while maintaining the fidelity of robustness evaluation, allows us to scale the ARC algorithm to more complex datasets. Therefore, in all of our CIFAR-100 and Image Net experiments, we will use this version of ARC with G = 4.