# why_adversarial_training_can_hurt_robust_accuracy__4f83f869.pdf

Published as a conference paper at ICLR 2023

WHY ADVERSARIAL TRAINING CAN HURT ROBUST ACCURACY

Jacob Clarysse1, Julia H orrmann2, Fanny Yang1 1. Department of Computer Science, ETH Z urich 2. Department of Mathematics, ETH Z urich {jacob.clarysse;fan.yang}@inf.ethz.ch; {julia.hoerrmann}@stat.math.ethz.ch

Machine learning classiﬁers with high test accuracy often perform poorly under adversarial perturbations. It is commonly believed that adversarial training alleviates this issue. In this paper, we demonstrate that, surprisingly, the opposite can be true for a natural class of perceptible perturbations even though adversarial training helps when enough data is available, it may in fact hurt robust generalization in the small sample size regime. We ﬁrst prove this phenomenon for a high-dimensional linear classiﬁcation setting with noiseless observations. Using intuitive insights from the proof, we could ﬁnd perturbations on standard image datasets for which this behavior persists. Speciﬁcally, it occurs for perceptible perturbations that effectively reduce class information such as object occlusions or corruptions.

1 INTRODUCTION

Figure 1: On the Waterbirds dataset attacked by the adversarial illumination attack, adversarial training (yellow) yields higher robust error than standard training (blue) when the sample size is small, even though it helps for large sample sizes and in a setting where the standard error of standard training is small. (see App. D for details).

Today s best-performing classiﬁers are vulnerable to adversarial attacks Goodfellow et al. (2015); Szegedy et al. (2014) and exhibit high robust error: for many inputs, their predictions change under adversarial perturbations, even though the true class stays the same. Such contentpreserving (Gilmer et al., 2018), consistent (Raghunathan et al., 2020) attacks can be either perceptible or imperceptible. For image datasets, most work to date studies imperceptible attacks that are based on perturbations with limited strength or attack budget. These include bounded ℓp-norm perturbations (Goodfellow et al., 2015; Madry et al., 2018; Moosavi-Dezfooli et al., 2016), small transformations using image processing techniques (Ghiasi et al., 2019; Zhao et al., 2020; Laidlaw et al., 2021; Luo et al., 2018) or nearby samples on the data manifold (Lin et al., 2020; Zhou et al., 2020). Even though they do not visibly change the image by deﬁnition, imperceptible attacks can often successfully fool a learned classiﬁer.

On the other hand, perturbations that naturally occur and are physically realizable are commonly perceptible. Some perceptible perturbations speciﬁcally target the object to be recognized: these include occlusions (e.g. stickers placed on trafﬁc signs (Eykholt et al., 2018) or masks of different sizes that cover important features of human faces (Wu et al., 2020)) or corruptions that are caused by the image capturing process (animals that move faster than the shutter speed or objects that are not well-lit, see Figure 2). Others transform the whole image and are not conﬁned to the object itself, such as rotations, translations or corruptions Engstrom et al. (2019); Kang et al. (2019). In this paper, we refer to such perceptible attacks as directed attacks. In contrast to other attacks, they effectively reduce useful class information in the input for any model, without necessarily changing the true label we say that they are directed and consistent, more formally deﬁned in Section 2. For example, a stop sign with a small sticker could partially cover the text without losing its semantic meaning. Similarly, a ﬂying bird captured with a long exposure time can induce motion blur in the ﬁnal image without becoming unrecognizable to the observer.

Published as a conference paper at ICLR 2023

(b) Original

(c) Lighting

(e) Classiﬁcation of perturbations

Figure 2: Examples of directed attacks on CIFAR10 and the Waterbirds dataset. In Figure 2a, we corrupt the image with a black mask of size 2 2 and in Figure 2c and 2d we change the lighting conditions (darkening) and apply motion blur on the bird in the image respectively. All perturbations reduce the information about the class in the images: they are the result of directed attacks. (e) Directed attacks are a subset of perceptible attacks.

In the literature so far, it is widely acknowledged that adversarial training with the same perturbation type and budget as during test time often achieves signiﬁcantly lower robust error than standard training (Madry et al., 2018; Zhang et al., 2019; Bai et al., 2021).

In contrast, we show that adversarial training not only increases standard error (Zhang et al., 2019; Tsipras et al., 2019; Stutz et al., 2019; Raghunathan et al., 2020), but surprisingly, in the low sample regime,

adversarial training may even increase the robust error compared to standard training!

Figure 1 illustrates the main message of our paper on the Waterbirds dataset: Although adversarial training with directed attacks outperforms standard training when enough training samples are available, it is inferior when the sample size is small (but still large enough to obtain a small standard test error).

Our contributions are as follows:

We prove that, almost surely, adversarially training a linear classiﬁer on separable data yields a monotonically increasing robust error as the perturbation budget grows. We further establish high-probability non-asymptotic lower bounds on the robust error gap between adversarial and standard training. Our proof provides intuition for why this lower bound on the gap is particularly large for directed attacks in the low sample regime. We observe empirically for different directed attacks on real-world image datasets that this behavior persists: adversarial training for directed attacks hurts robust accuracy when the sample size is small.

2 ROBUST CLASSIFICATION

We ﬁrst introduce our robust classiﬁcation setting more formally by deﬁning the notions of adversarial robustness, directed attacks and adversarial training used throughout the paper.

Adversarially robust classiﬁers For inputs x Rd, we consider multi-class classiﬁers associated with parameterized functions fθ : Rd RK if K > 2 and fθ : Rd R if K = 2, where K is the number of labels. For example, fθ(x) could be a linear model (as in Section 3) or a neural network (as in Section 4). The output label predictions are obtained by h(fθ(x)) = sign(fθ(x)) for K = 2 and h(fθ(x)) = arg maxk {1,..,K} fθ(x)k for K > 2.

In order to convince practitioners to use machine learning models in the wild, it is key to demonstrate that they exhibit robustness. One kind of robustness is that they do not change prediction when the input is subject to consistent perturbations, which are small class-preserving perturbations. Mathematically speaking, for the underlying joint data distribution P, the model should have a small ϵte-robust error, deﬁned as

Err(θ; ϵte) := E(x,y) P max x T (x;ϵte) ℓ(fθ(x ), y), (1)

where ℓis 0 if the class determined by h(fθ(x)) is equal to y and 1 otherwise. Further, T(x; ϵte) indicates a perturbation set around x of a certain transformation type with size ϵtest. Note that

Published as a conference paper at ICLR 2023

the (standard) error E(x,y) Pℓ(fθ(x), y) of a classiﬁer corresponds to Err(θ; 0) the robust error evaluated at ϵte = 0.

Directed attacks The inner maximization in Equation 1 is often called the adversarial attack of the input x for the model fθ and the corresponding solution is referred to as the adversarial example. In this paper, we consider directed attacks that effectively reduce the information about the true classes, with image-based examples depicted in Figure 2. For linear classiﬁcation, we analyze directed attacks in the form of additive perturbations that are constrained to the direction of the optimal decision boundary (see details in Section 3.1). In particular, note that the set of directed perturbations is restricted to directions attacking the Bayes optimal classiﬁer.

Adversarial training A common approach to obtain classiﬁers with a good robust accuracy is to minimize the training objective Lϵtr with a surrogate robust classiﬁcation loss L

Lϵtr(θ) := 1

i=1 max x i T (xi;ϵtr) L(fθ(x i)yi), (2)

also called adversarial training. In practice, we often use the cross entropy loss L(z) = log(1 + e z) and minimize the robust objective by using ﬁrst order optimization methods such as (stochastic) gradient descent. SGD is also the algorithm that we focus on in both the theoretical and experimental sections. When the desired type of robustness is known in advance, it is standard practice to use the same perturbation set for training as for testing, i.e. T(x; ϵtr) = T(x; ϵte). For example, Madry et al. (2018) show that the robust error sharply increases for ϵtr < ϵte. In this paper, we demonstrate that for directed attacks in the small sample size regime, in fact, the opposite is true.

3 THEORETICAL RESULTS

In this section, we prove for linear functions fθ(x) = θ x that in the case of directed attacks, robust generalization deteriorates with increasing ϵtr. The proof, albeit in a simple setting, provides explanations for why adversarial training fails in the high-dimensional regime for such attacks.

3.1 SETTING

We now introduce the precise linear setting used in our theoretical results.

Data model We assume that the ground truth and hypothesis class are given by linear functions fθ(x) = θ x and the sample size n is lower than the ambient dimension d minus one. The generative distribution Pr is similar to (Tsipras et al., 2019; Nagarajan & Kolter, 2019): The label y {+1, 1} is drawn with equal probability and the covariate vector is sampled as x = [y r

2, x] with the random vector x Rd 1 drawn from a standard normal distribution, i.e. x N(0, σ2Id 1). We would like to learn a classiﬁer that has low robust error by using a dataset D = (xi, yi)n i=1 with n i.i.d. samples from Pr. Intuitively, the separation distance r reﬂects the signal strength of the data distribution.

Notice that the distribution Pr is noiseless: for a given input x, the label y = sign(x[1]) is deterministic. Further, the Bayes optimal linear classiﬁer (also referred to as the ground truth) is parameterized by the ﬁrst standard coordinate vector, θ = e1.1 By deﬁnition, the ground truth is robust against all perturbations that do not change the sign in the ﬁrst coordinate of the sample, i.e. consistent perturbations, and hence so is the optimal robust classiﬁer.

Directed attacks In this paper, we focus on consistent directed attacks that by deﬁnition efﬁciently concentrate their attack budget to reduce the class information. For our linear setting this information lies in the ﬁrst entry. Hence, we can model such attacks by additive perturbations in the ﬁrst dimension T(x; ϵ) = {x = x + δ | δ = βe1 and ϵ β ϵ}. (3) Note that this attack is always in the direction of the signal dimension, i.e. the Bayes optimal classiﬁer or equivalently the ground truth. Furthermore, when ϵ < r

2, it is a consistent directed attack. Observe how this is different from ℓp-attacks an ℓp attack, depending on the model, may add a perturbation that only has a very small component in the signal direction.

1Note that the result more generally holds for non-sparse models that are not axis aligned by way of a simple rotation z = Ux. In that case the distribution is characterized by θ = u1, where u1 is the ﬁrst column vector of U, and a rotated Gaussian in the d 1 dimensions orthogonal to θ .

Published as a conference paper at ICLR 2023

(a) Robust error increase with ϵtr

(b) Standard-adversarial training

(c) Effect of overparameterization

Figure 3: Experimental veriﬁcation of Theorem 3.1. (a) We set d = 1000, r = 12 and n = 50. The robust error gap between standard and adversarial training as a function of the adversarial budget ϵtr = 5 independent experiments (blue) and the lower bound given in Theorem 3.1 (gray). In (b) and (c), we set d = 10000 and vary the number of samples n. (b) The robust error of standard and adversarial training with ϵtr = 4.5. (c) The error gap and the lower bound of Theorem 3.1. For more experimental details see Appendix C.

Robust max-ℓ2-margin classiﬁer We study a classiﬁer that is the solution of running gradient descent on the adversarial logistic loss. A long line of work (Soudry et al., 2018; Ji & Telgarsky, 2019; Chizat & Bach, 2020; Nacson et al., 2019; Liu et al., 2020) studies the implicit bias of (S)GD on the (standard) logistic loss and separable data. In particular, they show directional convergence to the max-margin solution. For the adversarial logistic loss and linear models in particular, (S)GD converges to the robust max-ℓ2-margin solution (Li et al., 2020),

bθϵtr := arg max θ 2 1 min i [n],x i T (xi;ϵtr) yiθ x i. (4)

Even though our result is proven for the max-ℓ2-margin classiﬁer, it can easily be extended to other interpolators.

3.2 MAIN RESULTS

We are now ready to characterize the ϵte-robust error as a function of ϵtr, the separation r, the dimension d and sample size n of the data. In the theorem statement we use the following quantities

ϕmin = σ r/2 ϵte

ϕmax = σ r/2 ϵte

that arise from concentration bounds for the singular values of the random data matrix. Further, let ϵ := r

2 and denote by Φ the cumulative distribution function of a standard normal.

Theorem 3.1. Assume d 1 > n. For test samples from Pr, perturbation set type T as in Equation 3 and any 0 ϵte < r 2, the following holds for the ϵte-robust error of the classiﬁer (Equation 1) resulting from ϵtr-adversarial training:

1. The ϵte-robust error of the ϵtr-robust max-margin estimator reads

Err(bθϵtr; ϵte) = Φ

for a random quantity ϕ > 0 depending on σ, r, ϵte and is hence strictly increasing in the adversarial training budget ϵtr.

2. With probability at least 1 δ, we further have ϕmin ϕ ϕmax and the following lower bound on the robust error increase by adversarially training with size ϵtr

Err(bθϵtr; ϵte) Err(bθ0; ϵte) Φ r/2

Φ r/2 min{ϵtr, ϵ}

The proof can be found in Appendix A and primarily relies on estimation of singular values of high-dimensional matrices. Note that the theorem holds for any 0 ϵte < r

2 and hence also directly

Published as a conference paper at ICLR 2023

(a) Robust error vs ϵtr

(b) Robust error decomposition

(c) Intuition in 2D

Figure 4: (a) We set d = 1000 and r = 12. The robust error as a function of the adversarial training budget ϵtr for different d/n. (b) The robust error decomposition into susceptibility and standard error as a function of the adversarial budget ϵtr. Full experimental details can be found in Section C. (c) 2D illustration providing intuition for the linear setting. The effect of adversarial training with directed attacks is captured in the yellow dotted lines: adversarially perturbed training points move closer to the true boundary which in turn tilts the decision boundary more heavily in the wrong direction.

applies to the standard error by setting ϵte = 0. In Figure 3, we empirically conﬁrm the statements of Theorem 3.1 by performing multiple experiments on synthetic datasets as described in Subsection 3.1 with different choices of d/n and ϵtr. In the ﬁrst statement, we prove that for small sample-size (n < d 1) noiseless data, almost surely, the robust error increases monotonically with adversarial training budget ϵtr > 0. In Figure 3a, we plot the robust error gap between standard and adversarial logistic regression as a function of the adversarial training budget ϵtr for 5 runs.

The second statement establishes a simpliﬁed lower bound on the robust error increase for adversarial training (for a ﬁxed ϵtr = ϵte) compared to standard training. In Figures 3a and 3c, we show how the lower bound closely predicts the robust error gap in our synthetic experiments. Furthermore, by the dependence of ϕmin on the overparameterization ratio d/n, the lower bound on the robust error gap is ampliﬁed for large d/n. Indeed, Figure 3c shows how the error gap increases with d/n both theoretically and experimentally. However, when d/n increases above a certain threshold, the gap decreases again, as standard training fails to learn the signal and yields a high error (see Figure 3b).

3.3 PROOF INTUITION

The reason that adversarial training hurts robust generalization is based on an extreme robust vs. standard error trade-off. We now provide intuition for the effect of directed attacks and the low sample regime on the ϵtr-robust max-ℓ2-margin solution by decomposing the robust error Err(θ; ϵte). Notice that ϵte-robust error Err(θ; ϵte) can be written as the probability of the union of two events: the event that the classiﬁer based on θ is wrong and the event that the classiﬁer is susceptible to attacks:

Err(θ; ϵte) = Ex,y P

I{yfθ(x) < 0} max x T (x;ϵte) I{fθ(x)fθ(x ) < 0} Err(θ; 0) + Susc(θ; ϵte)

(7) where Susc(θ; ϵte) is the expectation of the maximization term in Equation 7. Susc(θ; ϵte) represents the ϵte-attack-susceptibility of a classiﬁer induced by θ and Err(θ; 0) its standard error. In our linear setting, we can lower bound Equation 7 by Err(θ; 0) + 1

2Susc(θ; ϵte). Hence, Equation 7 suggests that the robust error can only be small if both the standard error and susceptibility are small. In Figure 4b, we plot the decomposition of the robust error in standard error and susceptibility for adversarial logistic regression with increasing ϵtr. We observe that increasing ϵtr increases the standard error too drastically compared to the decrease in susceptibility, leading to a drop in robust accuracy. For completeness, in Appendix B, we provide upper and lower bounds for the susceptibility score. We now explain why, in the small-sample size regime, adversarial training with directed attacks 3 may increase standard error to the extent that it dominates the decrease in susceptibility.

A key observation is that the robust max-ℓ2-margin solution of a dataset D = {(xi, yi)}n i=1 maximizes the minimum margin that reads mini [n] yiθ (xi yiϵtr|θ[1]|e1), where θ[i] refers to the i-th entry of vector θ. Therefore, it simply corresponds to the max-ℓ2-margin solution of the dataset shifted towards the decision boundary Dϵtr = {(xi yiϵtr|bθϵtr [1]|e1, yi)}n i=1. Using this fact, we obtain a closed-form expression of the (normalized) max-margin solution 4 as a function of ϵtr that reads

bθϵtr = 1 (r 2ϵtr)2 + 4 γ2

h r 2ϵtr, 2 γ θ i , (8)

Published as a conference paper at ICLR 2023

where θ 2 = 1 and γ > 0 is a random quantity associated with the max-ℓ2-margin solution of the d 1 dimensional Gaussian inputs orthogonal to the signal direction (see Lemma A.1 in Section A).

In high dimensions, with high probability any two Gaussian random vectors are far apart in our distributional setting, this corresponds to the vectors being far apart in the non-signal directions. In Figure 4c, we illustrate the phenomenon using a 2D cartoon, where the few samples in the dataset are all far apart in the non-signal direction. We see how shifting the dataset closer to the true decision boundary, may result in a max-margin solution (yellow) that aligns much worse with the ground truth (gray), compared to the estimator learned from the original points (blue). Even though the new (robust max-margin) classiﬁer (yellow) is less susceptible to attacks in the signal dimension, it also uses the signal dimension less. Mathematically, this is reﬂected in the expression of the max-margin solution in Equation 8: We see that the ﬁrst (signal) dimension is used less as ϵtr increases.

3.4 GENERALITY OF THE RESULTS

In this section we discuss how Theorem 3.1 might generalize to other perturbation sets and models.

Signal direction is known The type of additive perturbations used in Theorem 3.1, deﬁned in Equation 3, is explicitly constrained to the direction of the true signal. This choice is reminiscent of corruptions where every possible perturbation in the set is directly targeted at the object to be recognized, such as motion blur of moving objects. Such corruptions are also studied in the context of domain generalization and adaptation (Schneider et al., 2020).

Directed attacks in general, however, may also consist of perturbation sets that are only strongly biased towards the true signal direction. They may ﬁnd the true signal direction only when the inner maximization is exact. The following corollary extends Theorem 3.1 to small ℓ1-perturbations

T(x; ϵ) = {x = x + δ | δ 1 ϵ}, (9)

for 0 < ϵ < r

2 that reﬂect such attacks. We state the corollary here and give the proof in Appendix A.

Corollary 3.2. Theorem 3.1 also holds for 4 with perturbation sets deﬁned in 9.

The proof uses the fact that the inner maximization effectively results in a sparse perturbation equivalent to the attack resulting from the perturbation set deﬁned in Equation 3.

Other models Motivated by the implicit bias results of (stochastic) gradient descent on the logistic loss, Theorem 3.1 is proven for the max-ℓ2-margin solution. We would like to conjecture that for the data distribution in Section 3, adversarial training can hurt robust generalization also for other models with zero training error (interpolators in short). For example, Adaboost is a widely used algorithm that converges to the max-ℓ1-margin classiﬁer (Telgarsky, 2013). One might argue that for a sparse ground truth, the max-ℓ1-margin classiﬁer should (at least in the noiseless case) have the right inductive bias to alleviate large bias in high dimensions. Hence, in many cases the (sparse) max-ℓ1-margin solution might align with the ground truth for a given dataset. However, we conjecture that even in this case, the robust max-ℓ1-margin solution would be misled to choose a wrong sparse solution. This can be seen with the help of the cartoon illustration in Figure 4c.

4 REAL-WORLD EXPERIMENTS

In this section, we demonstrate that the proof intuition of the linear case may generalize to more complex models. Speciﬁcally, the insights from Section 3 helped us to identify realistic directed attacks on standard image datasets for which adversarial training hurts robust accuracy in the low sample regime. In what follows, we present experimental results for corruption attacks on the Waterbirds dataset. Due to space constraints, results on the mask attacks on CIFAR-10 can be found in Appendix E. The corresponding experimental details and more results on other additional image datasets (such as the hand gestures dataset) can be found in Appendices D, E and F.

4.1 DATASETS AND MODELS

We consider three datasets: the Waterbirds dataset, CIFAR-10 and a hand gesture datasets. Due to space constraints, we describe CIFAR-10 and the hand gesture dataset in Appendix E and F. Apart from CIFAR-10 and the hand gesture dataset, we build a new version of the Waterbirds dataset, consisting of images of waterand landbirds of size 256 256 and labels that distinguish the two

Published as a conference paper at ICLR 2023

(a) Robust error with increasing ϵtr

(b) Robust error decomposition

(c) Robust error vs. #samples

Figure 5: Experiments on the Waterbirds dataset considering the adversarial illumination attack with ϵte = 0.3. We plot the mean and standard deviation of the mean of several independent experiments. (a) The robust error increases with larger ϵtr in the low sample size regime. (b) We set n = 20 and plot the robust error decomposition as in Equation 7 with increasing ϵtr. While the susceptibility decreases slightly, the increase in standard error is much more severe, resulting in an increase in robust error. (c) The robust error of standard training and adversarial training as a function of the number of samples, where the smallest sample size still yields small (< 10%) standard test error for standard training. While adversarial training hurts for small sample sizes, it helps for larger sample sizes. For more experimental details see App. D.

types of birds. Using code provided by Sagawa et al. (2020), we construct the dataset as follows: First, we sample equally many waterand landbirds from the CUB-200 dataset (Welinder et al., 2010). Then, we segment the birds and paste them onto a background image that is randomly sampled (without replacement) from the Places-256 dataset (Zhou et al., 2017). Also, following the choice of Sagawa et al. (2020), we use as models a Res Net50 and a Res Net18 that were both pretrained on Image Net and achieve near perfect standard accuracy. In Appendix D, we complement the results of this section by reporting the results of similar experiments with different architectures.

4.2 IMPLEMENTATION OF THE DIRECTED ATTACKS

In this section, we consider two attacks on the Waterbirds dataset: motion blur and adversarial illumination as depicted in Figure 2. In Appendix E, we also discuss the mask attack, which should mimic occlusions of objects in images that are physically realizable (Eykholt et al., 2018; Wu et al., 2020). On the other hand, motion blur may arise naturally when photographing fast moving objects with a slow shutter speed. Lastly, adversarial illumination may result from adversarial lighting conditions. Next, we describe the motion blur and adversarial illumination attacks in more detail.

Motion blur For the Waterbirds dataset we can implement motion blur attacks on the object (the bird) speciﬁcally, a natural corruption that could occur if birds move at speeds that are faster than the shutter speed. The aim is to be robust against all motion blur severity levels up to Mmax = 15. To simulate motion blur, we apply a motion blur ﬁlter with a kernel of size M on the segmented bird before we paste it onto the background image. We can change the severity level of the motion blur by increasing the kernel size of the ﬁlter. See Appendix D for concrete expressions of the motion blur kernel. Intuitively the worst attack should be the most severe blur, rendering a search over a range of severity superﬂuous. However, similar to rotations, this is not necessarily true in practice since the training loss on neural networks is generally nonconvex. Therefore, for an exact evaluation of the robust error at test time, we perform a full grid search over all kernel sizes in [1, 2, . . . , Mmax]. We refer to Figure 2d and Section D for an illustration of our motion blur attack. During training time, we perform an approximate search over kernels with sizes 2i for i = 1, . . . , Mmax/2.

Adversarial illumination As a second attack on the Waterbirds dataset, we consider adversarial illumination. The adversary can darken or brighten the bird without corrupting the background of the image. The attack aims to model images where the object at interest is hidden in shadows or placed against bright light. To compute the adversarial illumination attack, we modify the brightness of the segmented bird by adding a constant a [ ϵte, ϵte] to all pixel values, before pasting the bird onto the background image. With an analogous argument as for the adversarial search for motion blur, the exact evaluation requires an actual search over the interval [ ϵte, ϵte]. We ﬁnd the most adversarial lighting level, i.e. the value of a, by equidistantly partitioning the interval [ ϵte, ϵte] in K steps and performing a full list-search over all steps. See Figure 2c and Appendix D for an illustration of the adversarial illumination attack. We choose K = 65, 33 during test and training time respectively.

Adversarial training For all datasets and attacks, we run SGD until convergence on the robust cross-entropy loss 2. In each iteration, we search for an adversarial example as described above and

Published as a conference paper at ICLR 2023

(a) Robust error with increasing ϵtr

(b) Robust error decomposition

(c) Robust error vs. #samples

Figure 6: Experiments on the (subsampled) Waterbirds dataset using the motion blur attack. (a) Even though adversarial training hurts robust generalization for low sample size (n = 20), it helps for n = 50. (b) For n = 20, the decomposition of the robust error in standard error and susceptibility as a function of adversarial budget ϵtr. The increase in standard error is more severe than the drop in susceptibility, leading to a slight increase in robust error. (c) The robust error of standard and adversarial training on settings where the test error after standard training is small as a function of the number of samples. While adversarial training hurts for small sample sizes, it helps for larger sample sizes. For each experiment we plot the mean and standard deviation of the mean of independent experiments. For more experimental details see App. D.

update the weights using a gradient with respect to the resulting perturbed example (Goodfellow et al., 2015; Madry et al., 2018). For every experiment, we choose the learning rate and weight decay parameters that minimize the robust error on a hold-out dataset.

4.3 ADVERSARIAL TRAINING CAN HURT ROBUST GENERALIZATION

We now present our experimental results on the Waterbirds dataset for both motion blur and adversarial illumination attacks. First of all, Figure 5a and 6a show that the phenomenon characterized in the linear setting by Theorem 3.1 also occurs for directed attacks on the Waterbirds dataset: as we increase the adversarial training budget ϵtr starting from zero (standard training), the robust error monotonically increases.

Furthermore, to gain intuition as described in Section 3.3, we also plot the robust error decomposition (Equation 7) consisting of the standard error and susceptibility in Figure 5b and 6b. Recall that we measure susceptibility as the fraction of data points in the test set for which the classiﬁer predicts a different class under an adversarial attack. As in our linear example, we observe an increase in robust error despite a slight drop in susceptibility, because of the more severe increase in standard error. Moreover, Figures 1 and 6c show that analogous to our linear example, this phenomenon is speciﬁc to the low sample regime: for large sample size adversarial training outperforms standard training as expected. Note again that even the smallest sample size is large enough to yield a standard test error < 10% for standard training. Similar experiments for CIFAR-10 can be found in Appendix E. Finally, we empirically conﬁrm in Appendix D.8 that our phenomenon is speciﬁc to directed attacks: for undirected attacks such as bounded ℓ and ℓ2 ball perturbations, adversarial training helps robust generalization also in the low sample size regime.

4.4 DISCUSSION

We now discuss how different algorithmic choices, motivated by related work, might affect how adversarial training hurts robust generalization.

Catastrophic overﬁtting Often the worst-case perturbation during adversarial training is found using an approximate algorithm such as SGD. It is common belief that using the strongest attack (in the motion blur case, full grid search) during training also results in better robust generalization. In particular, the literature on catastrophic overﬁtting shows that weaker attacks during training lead to bad performance on stronger attacks during testing (Wong et al., 2020; Andriushchenko & Flammarion, 2020; Li et al., 2021). Our results suggests the opposite in the low sample size regime for directed attacks: the weaker the attack during training, the better adversarial training performs.

Robust overﬁtting Recent work observes empirically (Rice et al., 2020) and theoretically (Sanyal et al., 2020; Donhauser et al., 2021), that perfectly minimizing the adversarial loss during training might in fact be suboptimal for robust generalization; that is, classical regularization techniques might lead to higher robust accuracy. This phenomenon is often referred to as robust overﬁtting. May the phenomenon be mitigated using standard regularization techniques? In Appendix D we shed light

Published as a conference paper at ICLR 2023

on this question and show that adversarial training hurts robust generalization even when standard regularization methods such as early stopping are used.

5 RELATED WORK

Robust and non-robust useful features In the words of Ilyas et al. (2019) and Springer et al. (2021) we can describe the intuition behind our phenomenon as follows: for directed attacks, all robust features become less useful, but adversarial training uses robust features more. In the small sample-size regime, n < d 1 in particular, robust learning assigns too much weight on the robust (possibly non-useful) features that then dominate the non-robust (but useful)features. Even though they deﬁne these concepts, they don t make our statement, but show that adversarial training reduces the reliance on non-robust but possibly useful features.

Small sample size and robustness A direct consequence of Theorem 3.1 is that in order to achieve the same robust error as standard training, adversarial training requires more samples. This statement might remind the reader of sample complexity results for robust generalization in Schmidt et al. (2018); Yin et al. (2019); Khim & Loh (2018). While those results compare sample complexity bounds for standard vs. robust error, our theorem statement compares two algorithms, standard vs. adversarial training, with respect to the robust error.

Trade-off between standard and robust error Many papers observed that even though adversarial training decreases robust error compared to standard training, it may lead to an increase in standard test error Madry et al. (2018); Zhang et al. (2019). For example, Tsipras et al. (2019); Zhang et al. (2019); Javanmard et al. (2020); Dobriban et al. (2020); Chen et al. (2020) study settings where the Bayes optimal robust classiﬁer is not equal to the Bayes optimal (standard) classiﬁer (i.e. the perturbations are inconsistent or the dataset is non-separable). Raghunathan et al. (2020) study consistent perturbations, as in our paper, and prove that for small sample size, ﬁtting adversarial examples can increase standard error even in the absence of noise. Empirically, Dong et al. (2021); Mendonc a et al. (2022) show that for ℓp-attacks low-quality data might be the main cause of the trade-off. While aforementioned works focus on the decrease in standard error, we prove that for directed attacks, in the small sample regime adversarial training may in fact increase robust error.

Mitigation of the trade-off A long line of work has proposed procedures to mitigate the trade-off between robust and standard accuracy. For example Alayrac et al. (2019); Carmon et al. (2019); Zhai et al. (2019); Raghunathan et al. (2020) study robust self training, which leverages a large set of unlabelled data, while Lee et al. (2020); Lamb et al. (2019); Xu et al. (2020) use data augmentation by interpolation. Ding et al. (2020); Balaji et al. (2019); Cheng et al. (2020) on the other hand propose to use adaptive perturbation budgets ϵtr that vary across inputs. The intuition behind our theoretical analysis suggests that the standard mitigation procedures for imperceptible perturbations may not work for perceptible directed attacks, because all relevant features are non-robust. We leave a thorough empirical study as interesting future work.

6 SUMMARY AND FUTURE WORK

This paper aims to caution the practitioner against blindly following current widespread practices to increase the robust performance of machine learning models. Speciﬁcally, adversarial training is currently recognized to be one of the most effective defense mechanisms for ℓp-perturbations, signiﬁcantly outperforming robust performance of standard training. However, we prove that in the low sample size regime this common wisdom is not applicable for consistent directed attacks, which efﬁciently focus their attack budget to target the ground truth class information. In terms of follow-up work on directed attacks in the low sample regime, there are some concrete questions that would be interesting to explore. For example, as discussed in Section 5, it would be useful to test whether some methods to mitigate the standard accuracy vs. robustness trade-off would also relieve the perils of adversarial training for directed attacks. Further, we hypothesize that when few samples are available, one should avoid training with attacks that may heavily reduce class information, independently of the attacks at test time. If this hypothesis were conﬁrmed, it would break with yet another general rule that the best defense perturbation type should always match the attack during evaluation.

Published as a conference paper at ICLR 2023

ACKNOWLEDGEMENT

Supported by the Hasler Foundation grant number 21050.

Jean-Baptiste Alayrac, Jonathan Uesato, Po-Sen Huang, Alhussein Fawzi, Robert Stanforth, and Pushmeet Kohli. Are labels required for improving adversarial robustness? Advances in Neural Information Processing Systems, pp. 12214 12223, 2019.

Maksym Andriushchenko and Nicolas Flammarion. Understanding and improving fast adversarial training. Advances in Neural Information Processing Systems, 2020.

Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang. Recent advances in adversarial training for adversarial robustness. In Zhi-Hua Zhou (ed.), The 30th International Joint Conference on Artiﬁcial Intelligence, pp. 4312 4321. International Joint Conferences on Artiﬁcial Intelligence Organization, 2021.

Yogesh Balaji, Tom Goldstein, and Judy Hoffman. Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets. ar Xiv preprint ar Xiv:1910.08051, 2019.

G. Bradski. The Open CV Library. Dr. Dobb s Journal of Software Tools, 2000.

Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C Duchi. Unlabeled data improves adversarial robustness. In The 33rd International Conference on Neural Information Processing Systems, pp. 11192 11203, 2019.

Lin Chen, Yifei Min, Mingrui Zhang, and Amin Karbasi. More data can expand the generalization gap between adversarially robust and standard models. In The 36th International Conference on Machine Learning, pp. 1670 1680, 2020.

Minhao Cheng, Qi Lei, Pin-Yu Chen, Inderjit Dhillon, and Cho-Jui Hsieh. Cat: Customized adversarial training for improved robustness. ar Xiv preprint ar Xiv:2002.06789, 2020.

L ena ıc Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In The 7th International Conference on Learning Theory, pp. 1305 1338, 2020.

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In The 37th International Conference on Machine Learning, pp. 2206 2216, 2020.

Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang. Mma training: Direct input space margin maximization through adversarial training. In The 8th International Conference on Learning Representations, 2020.

Edgar Dobriban, Hamed Hassani, David Hong, and Alexander Robey. Provable tradeoffs in adversarially robust classiﬁcation. ar Xiv preprint ar Xiv:2006.05161, 2020.

Chengyu Dong, Liyuan Liu, and Jingbo Shang. Data quality matters for adversarial training: An empirical study, 2021.

Konstantin Donhauser, Alexandru Tifrea, Michael Aerni, Reinhard Heckel, and Fanny Yang. Interpolation can hurt robust generalization even when there is no noise. The 36th conference on Advances in Neural Information Processing Systems, 2021.

Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. Exploring the landscape of spatial robustness. In The 36th International Conference on Machine Learning, pp. 1802 1811, 2019.

Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classiﬁcation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1625 1634, 2018.

Published as a conference paper at ICLR 2023

Amin Ghiasi, Ali Shafahi, and Tom Goldstein. Breaking certiﬁed defenses: semantic adversarial examples with spoofed robustness certiﬁcates. In The 6th International Conference on Learning Representations, 2019.

Justin Gilmer, Ryan P Adams, Ian Goodfellow, David Andersen, and George E Dahl. Motivating the rules of the game for adversarial example research. ar Xiv preprint ar Xiv:1807.06732, 2018.

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In The 3th International Conference on Learning Representations, pp. 1 10, 2015.

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In The 33rd conference on Advances in Neural Information Processing Systems, pp. 125 136, 2019.

Adel Javanmard, Mahdi Soltanolkotabi, and Hamed Hassani. Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pp. 2034 2078, 2020.

Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In The 32nd Conference on Learning Theory, pp. 1772 1798, 2019.

Daniel Kang, Yi Sun, Tom Brown, Dan Hendrycks, and Jacob Steinhardt. Transfer of adversarial robustness between perturbation types. ar Xiv e-prints, pp. ar Xiv 1905, 2019.

Justin Khim and Po-Ling Loh. Adversarial risk bounds via function transformation. ar Xiv preprint ar Xiv:1810.09519, 2018.

Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. Perceptual adversarial robustness: Defense against unseen threat models. In The 9th International Conference on Learning Representation, 2021.

Alex Lamb, Vikas Verma, Juho Kannala, and Yoshua Bengio. Interpolated adversarial training: Achieving robust neural networks without sacriﬁcing too much accuracy. In The 12th ACM Workshop on Artiﬁcial Intelligence and Security, pp. 95 103, 2019.

Saehyung Lee, Hyungyu Lee, and Sungroh Yoon. Adversarial vertex mixup: Toward better adversarially robust generalization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

Bai Li, Shiqi Wang, Suman Jana, and Lawrence Carin. Towards understanding fast adversarial training. ar Xiv preprint ar Xiv:2006.03089, 2021.

Yan Li, Ethan X.Fang, Huan Xu, and Tuo Zhao. Implicit bias of gradient descent based adversarial training on separable data. In The 8th International Conference on Learning Representations, 2020.

Wei-An Lin, Chun Pong Lau, Alexander Levine, Rama Chellappa, and Soheil Feizi. Dual manifold adversarial robustness: Defense against lp and non-lp adversarial attacks. In The 34th conference on Advances in Neural Information Processing Systems, pp. 3487 3498, 2020.

Chen Liu, Mathieu Salzmann, Tao Lin, Ryota Tomioka, and Sabine S usstrunk. On the loss landscape of adversarial training: Identifying challenges and how to overcome them. In The 35th conference on Advances in Neural Information Processing Systems, pp. 21476 21487, 2020.

Bo Luo, Yannan Liu, Lingxiao Wei, and Qiang Xu. Towards imperceptible and robust adversarial example attacks against neural networks. In The 32nd AAAI Conference on Artiﬁcial Intelligence and Thirtieth Innovative Applications of Artiﬁcial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In The 6th International Conference on Learning Representations, 2018.

Tom as Mantec on, Carlos R. del Blanco, Fernando Jaureguizar, and Narciso Garc ıa. A real-time gesture recognition system using near-infrared imagery. PLOS ONE, pp. 1 17, 2019.

Published as a conference paper at ICLR 2023

Marcele OK Mendonc a, Javier Maroto, Pascal Frossard, and Paulo SR Diniz. Adversarial training with informed data selection. In The 30th European Signal Processing Conference (EUSIPCO), pp. 608 612, 2022.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In The IEEE conference on computer vision and pattern recognition (CVPR), pp. 2574 2582, 2016.

Abdullah Mujahid, Mazhar Javed Awan, Awais Yasin, Mazin Abed Mohammed, Robertas Damaˇseviˇcius, Rytis Maskeli unas, and Karrar Hameed Abdulkareem. Real-time hand gesture recognition based on deep learning yolov3 model. Applied Sciences, 2021.

Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a ﬁxed learning rate. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 3051 3059, 2019.

Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In The 33d conference on Advances in Neural Information Processing Systems, pp. 11611 11622, 2019.

Munir Oudah, Ali Al-Naji, and Javaan Chahl. Hand gesture recognition based on computer vision: A review of techniques. Journal of Imaging, 2020.

Huy Phan. huyvnphan/pytorch cifar10, 1 2021.

Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, and Percy Liang. Understanding and mitigating the tradeoff between robustness and accuracy. In The 37th International Conference on Machine Learning, pp. 7909 7919, 2020.

Leslie Rice, Eric Wong, and Zico Kolter. Overﬁtting in adversarially robust deep learning. In The 37th International Conference on Machine Learning, pp. 8093 8104, 2020.

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. In The 7th International Conference on Learning Representations, 2020.

Amartya Sanyal, Puneet K Dokania, Varun Kanade, and Philip Torr. How benign is benign overﬁtting? In The 8th International Conference on Learning Representations, 2020.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In The 32nd conference Advances in Neural Information Processing Systems, 2018.

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In The 34th conference on Advances in Neural Information Processing Systems, pp. 11539 11551, 2020.

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, pp. 1 57, 2018.

Jacob M Springer, Melanie Mitchell, and Garrett T Kenyon. Adversarial perturbations are not so weird: Entanglement of robust and non-robust features in neural network classiﬁers. ar Xiv preprint ar Xiv:2102.05110, 2021.

David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generalization. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6967 6987, 2019.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In The 2nd International Conference on Learning Representations, 2014.

Matus Telgarsky. Margins, shrinkage, and boosting. In The 30th International Conference on Machine Learning, pp. 307 315, 2013.

Published as a conference paper at ICLR 2023

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In The 7th International Conference on Learning Representations, 2019.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical report, California Institute of Technology, 2010.

Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better than free: Revisiting adversarial training. In The 8th International Conference on Learning Representations, 2020.

Tong Wu, Liang Tong, and Yevgeniy Vorobeychik. Defending against physically realizable attacks on image classiﬁcation. In The 8th International Conference on Learning Representations, 2020.

Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. Adversarial domain adaptation with domain mixup. In The AAAI Conference on Artiﬁcial Intelligence, pp. 6502 6509, 2020.

Shuai Yang, Prashan Premaratne, and Peter Vial. Hand gesture recognition: An overview. In The 5th IEEE International Conference on Broadband Network Multimedia Technology, pp. 63 69, 2013.

Dong Yin, Ramchandran Kannan, and Peter Bartlett. Rademacher complexity for adversarially robust generalization. In The 36th International conference on machine learning, pp. 7085 7094, 2019.

Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, and Liwei Wang. Adversarially robust generalization just requires more unlabeled data. ar Xiv preprint ar Xiv:1906.00555, 2019.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In The 36th International Conference on Machine Learning, pp. 7472 7482, 2019.

Zhengyu Zhao, Zhuoran Liu, and Martha Larson. Towards large yet imperceptible adversarial image perturbations with perceptual color distance. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1039 1048, 2020.

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

Jianli Zhou, Chao Liang, and Jun Chen. Manifold projection for adversarial defense on face recognition. In The 16th European Conference on Computer Vision, pp. 288 305, 2020.

Published as a conference paper at ICLR 2023

A THEORETICAL STATEMENTS FOR THE LINEAR MODEL

Before we present the proof of the theorem, we introduce two lemmas are of separate interest that are used throughout the proof of Theorem 1. Recall that the deﬁnition of the (standard normalized) maximum-ℓ2-margin solution (max-margin solution in short) of a dataset D = {(xi, yi)}n i=1 corresponds to bθ := arg max θ 2 1 min i [n] yiθ xi, (10)

by simply setting ϵtr = 0 in Equation 4. The ℓ2-margin of bθ then reads mini [n] yibθ xi. Furthermore for a dataset D = {(xi, yi)}n i=1 we refer to the induced dataset e D as the dataset with covariate vectors stripped of the ﬁrst element, i.e. e D = {( xi, yi)}n i=1 := {((xi)[2:d], yi)}n i=1, (11) where (xi)[2:d] refers to the last d 1 elements of the vector xi. Furthermore, remember that for any vector z, z[j] refers to the j-th element of z and ej denotes the j-th canonical basis vector. Further, recall the distribution Pr as deﬁned in Section 3.1: the label y {+1, 1} is drawn with equal probability and the covariate vector is sampled as x = [y r

2, x] where x Rd 1 is a random vector drawn from a standard normal distribution, i.e. x N(0, σ2Id 1). We generally allow r, used to sample the training data, to differ from rtest, which is used during test time.

The following lemma derives a closed-form expression for the normalized max-margin solution for any dataset with ﬁxed separation r in the signal component, and that is linearly separable in the last d 1 coordinates with margin γ. Lemma A.1. Let D = {(xi, yi)}n i=1 be a dataset that consists of points (x, y) Rd { 1} and x[1] = y r

2, i.e. the covariates xi are deterministic in their ﬁrst coordinate given yi with separation distance r. Furthermore, let the induced dataset e D also be linearly separable by the normalized max-ℓ2-margin solution θ with an ℓ2-margin γ. Then, the normalized max-margin solution of the original dataset D is given by bθ = 1 p

h r, 2 γ θ i . (12)

Further, the standard accuracy of bθ for data drawn from Prtest reads

Prtest(Y bθ X > 0) = Φ r rtest

The proof can be found in Section A.3. The next lemma provides high probability upper and lower bounds for the margin γ of e D when xi are drawn from the normal distribution. Lemma A.2. Let e D = {( xi, yi)}n i=1 be a random dataset where yi { 1} are equally distributed and xi N(0, σId 1) for all i, and γ is the maximum ℓ2 margin that can be written as γ = max θ 2 1 min i [n] yi θ xi.

Then, for any t 0, with probability greater than 1 2e t2

2 , we have γmin(t) γ γmax(t) where

γmax(t) = σ

n + 1 + t n

, γmin(t) = σ

A.1 PROOF OF THEOREM 3.1

Given a dataset D = {(xi, yi)} drawn from Pr, it is easy to see that the (normalized) ϵtr-robust max-margin solution 4 of D with respect to signal-attacking perturbations T(ϵtr; xi) as deﬁned in Equation 3, can be written as bθϵtr = arg max θ 2 1 min i [n],x i T (xi;ϵtr) yiθ x i

= arg max θ 2 1 min i [n],|β| ϵtr yiθ (xi + βe1)

= arg max θ 2 1 min i [n] yiθ (xi yiϵtr sign(θ[1])e1).

Published as a conference paper at ICLR 2023

Note that by deﬁnition, it is equivalent to the (standard normalized) max-margin solution bθ of the shifted dataset Dϵtr = {(xi yiϵtr sign(θ[1])e1, yi)}n i=1. Since Dϵtr satisﬁes the assumptions of Lemma A.1, it then follows directly that the normalized ϵtr-robust max-margin solution reads

bθϵtr = 1 p

(r 2ϵtr)2 + 4 γ2

h r 2ϵtr, 2 γ θ i , (14)

by replacing r by r 2ϵtr in Equation 12. Similar to above, θ Rd 1 is the (standard normalized) max-margin solution of {( xi, yi)}n i=1 and γ the corresponding margin.

Proof of 1. We can now compute the ϵte-robust accuracy of the ϵtr-robust max-margin estimator bθϵtr

for a given dataset D as a function of γ. Note that in the expression of bθϵtr, all values are ﬁxed for a ﬁxed dataset, while 0 ϵtr r 2 γmax can be chosen. First note that for a test distribution Pr, the ϵte-robust accuracy, deﬁned as one minus the robust error (Equation 1), for a classiﬁer associated with a vector θ, can be written as

Acc(θ; ϵte) = EX,Y Pr

I{ min x T (X;ϵte) Y θ x > 0} (15)

= EX,Y Pr I{Y θ X ϵteθ[1] > 0} = EX,Y Pr I{Y θ (X Y ϵte sign(θ[1])e1) > 0}

Now, recall that by Equation 14 and the assumption in the theorem, we have r 2ϵtr > 0, so that sign(bθϵtr) = 1. Further, using the deﬁnition of the T(ϵtr; x) in Equation 3 and by deﬁnition of the distribution Pr, we have X[1] = Y r

2. Plugging into Equation 15 then yields

Acc(bθϵtr; ϵte) = EX,Y Pr h I{Y bθϵtr (X Y ϵtee1) > 0} i

= EX,Y Pr h I{Y bθϵtr (X 1 + Y r

2 ϵte e1) > 0} i

= Pr 2ϵte(Y bθϵtr X > 0)

where X 1 is a shorthand for the random vector X 1 = (0; X[2], . . . , X[d]). The assumptions in Lemma A.1 (Dϵtr is linearly separable) are satisﬁed whenever the n < d 1 samples are distinct, i.e. with probability one. Hence applying Lemma A.1 with rtest = r 2ϵte and r = r 2ϵtr yields

Acc(bθϵtr; ϵte) = Φ r(r 2ϵte)

4σ γ ϵtr r 2ϵte

Theorem statement a) then follows by noting that Φ is a monotically decreasing function in ϵtr. The expression for the robust error then follows by noting that 1 Φ( z) = Φ(z) for any z R and deﬁning

ϕ = σ γ r/2 ϵte . (17)

Proof of 2. First deﬁne ϕmin, ϕmax using γmin, γmax as in Equation 17. Then we have by Equation 16

Err(bθϵtr; ϵte) Err(bθ0; ϵte) = Acc(bθ0; ϵte) Acc(bθϵtr; ϵte)

By plugging in t = q

n in Lemma A.2, we obtain that with probability at least 1 δ we have

and equivalently ϕmin ϕ ϕmax.

Published as a conference paper at ICLR 2023

Now note the general fact that for all ϕ

2x the density function f( ϕ; x) = 1

ϕ2 is monotonically increasing in ϕ.

By assumption of the theorem, ϕ

2(r/2 ϵtr)(r/2 ϵte) so that f( ϕ; x) f(ϕmin; x) for all x [r/2 ϵtr, r/2] and therefore Z r/2

ϕ2 dx Z r/2

2πϕmin E x2

ϕ2 dx = Φ r/2

and the statement is proved.

A.2 PROOF OF COROLLARY 3.2

We now show that Theorem 3.1 also holds for ℓ1-ball perturbations with at most radius ϵ. Following similar steps as in Equation 14, the ϵtr-robust max-margin solution for ℓ1-perturbations can be written as bθϵtr := arg max θ 2 1 min i [n] yiθ (xi yiϵtr sign(θ[j (θ)])ej (θ)) (18)

where j (θ) := arg maxj |θj| is the index of the maximum absolute value of θ. We now prove by contradiction that the robust max-margin solution for this perturbation set 9 is equivalent to the solution 14 for the perturbation set 3. We start by assuming that bθϵtr does not solve Equation 14, which is equivalent to assuming 1 j (bθϵtr) by deﬁnition. We now show how this assumption leads to a contradiction.

Deﬁne the shorthand j := j (bθϵtr) 1. Since bθϵtr is the solution of 18, by deﬁnition, we have that bθϵtr is also the max-margin solution of the shifted dataset Dϵtr := (xi yiϵtr sign(θ[j +1])ej +1, yi). Further, note that by the assumption that 1 j (bθϵtr), this dataset Dϵtr consists of input vectors xi = (yi r

2, xi yiϵtr sign(θ[j +1])ej +1). Hence via Lemma A.1, bθϵtr can be written as

bθϵtr = 1 p

r2 4( γϵtr)2 [r, 2 γϵtr θϵtr], (19)

where θϵtr is the normalized max-margin solution of e D := ( xi yiϵtr sign( θ[j ])ej , yi).

We now characterize θϵtr. Note that by assumption, j = j ( θϵtr) = arg maxj | θϵtr [j]|. Hence, the

normalized max-margin solution θϵtr is the solution of θϵtr := arg max θ 2 1 min i [n] yi θ xi ϵtr| θ[j ]| (20)

Observe that the minimum margin of this estimator γϵtr = mini [n] yi( θϵtr) xi ϵtr| θϵtr [j ]| decreases

with ϵtr as the problem becomes harder γϵtr γ, where the latter is equivalent to the margin of θϵtr for ϵtr = 0. Since r > 2 γmax by assumption in the Theorem, by Lemma A.2 with probability at least

1 2E α2(d 1)

n , we then have that r > 2 γ 2 γϵtr. Given the closed form of bθϵtr in Equation 19, it directly follows that bθϵtr [1] = r > 2 γϵtr θϵtr 2 = bθϵtr [2:d] 2 and hence 1 j (bθϵtr). This contradicts the

original assumption 1 j (bθϵtr) and hence we established that bθϵtr for the ℓ1-perturbation set 9 has the same closed form 14 as for the perturbation set 3.

The ﬁnal statement is proved by using the analogous steps as in the proof of 1. and 2. to obtain the closed form of the robust accuracy of bθϵtr.

A.3 PROOF OF LEMMA A.1

We start by proving that bθ is of the form

bθ = h a1, a2 θ i , (21)

for a1, a2 > 0. Denote by H(θ) the plane through the origin with normal θ. We deﬁne d ((x, y), H(θ)) as the signed euclidean distance from the point (x, y) D Pr to the plane H(θ). The signed

Published as a conference paper at ICLR 2023

euclidean distance is the deﬁned as the euclidean distance from x to the plane if the point (x, y) is correctly predicted by θ, and the negative euclidean distance from x to the plane otherwise. We rewrite the deﬁnition of the max l2-margin classiﬁer. It is the classiﬁer induced by the normalized vector bθ, such that

max θ Rd min (x,y) D d ((x, y) , H(θ)) = min (x,y) D d (x, y) , H(bθ) .

We use that D is deterministic in its ﬁrst coordinate and get

max θ min (x,y) D d ((x, y) , H(θ)) = max θ min (x,y) D y(θ[1]x[1] + θ x)

= max θ θ1 r 2 + min (x,y) D y θ x.

Because r > 0, the maximum over all θ has bθ[1] 0. Take any a > 0 such that θ 2 = a. By

deﬁnition the max l2-margin classiﬁer, θ, maximizes min(x,y) D d ((x, y) , H(θ)). Therefore, bθ is of the form of Equation 21.

Note that all classiﬁers induced by vectors of the form of Equation 21 classify D correctly. Next, we aim to ﬁnd expressions for a1 and a2 such that Equation 21 is the normalized max l2-margin classiﬁer. The distance from any x D to H(bθ) is

d x, H(bθ) = a1x[1] + a2 θ x .

Using that x[1] = y r

2 and that the second term equals a2d x, H( θ) , we get

d x, H(bθ) = a1 r 2 + a2d x, H( θ) = a1 r 2 + q

1 a2 1d x, H( θ) . (22)

Let ( x, y) e D be the point closest in Euclidean distance to θ. This point is also the closest point in

Euclidean distance to H(bθ), because by Equation 22 d x, H(bθ) is strictly decreasing for decreasing

d x, H( θ) . We maximize the minimum margin d x, H(bθ) with respect to a1. Deﬁne the vectors

a = [a1, a2] and v = h r 2, d x, H( θ) i . We ﬁnd using the dual norm that

a = v v 2 .

Plugging the expression of a into Equation 21 yields that bθ is given by

h r, 2 γ θ i .

For the second part of the lemma we ﬁrst decompose

Prtest(Y bθ X > 0) = 1

2Prtest h bθ X > 0 | Y = 1 i + 1

2Prtest h bθ X < 0 | Y = 1 i

We can further write

Prtest h bθ X > 0 | Y = 1 i = Prtest

i=2 bθ[i]X[i] > bθ[1] X[1] | Y = 1

i=1 θ[i]X[i] > r rtest

= 1 Φ r rtest

= Φ r rtest

where Φ is the cumulative distribution function. The second equality follows by multiplying by the normalization constant on both sides and the third equality is due to the fact that Pd 1 i=1 θ[i]X[i] is

Published as a conference paper at ICLR 2023

a zero-mean Gaussian with variance σ2 θ 2 2 = σ2 since θ is normalized. Correspondingly we can write

Prtest h bθ X < 0 | Y = 1 i = Prtest

i=1 θ[i]X[i] < r rtest

= Φ r rtest

so that we can combine 23 and 23 and 24 to obtain Prtest(Y bθ X > 0) = Φ r rtest

4σ γ . This concludes the proof of the lemma.

A.4 PROOF OF LEMMA A.2

The proof plan is as follows. We start from the deﬁnition of the max ℓ2-margin of a dataset. Then, we rewrite the max ℓ2-margin as an expression that includes a random matrix with independent standard normal entries. This allows us to prove the upper and lower bounds for the max-ℓ2-margin in Sections A.4.1 and A.4.2 respectively, using non-asymptotic estimates on the singular values of Gaussian random matrices.

Given the dataset e D = {( xi, yi)}n i=1, we deﬁne the random matrix

x 1 x 2 ... x n

where xi N(0, σId 1). Let V be the class of all perfect predictors of e D. For a matrix A and vector b we also denote by |Ab| the vector whose entries correspond to the absolute values of the entries of Ab. Then, by deﬁnition

γ = max v V, v 2=1 min j [n] |Xv|[j] = max v V, v 2=1 min j [n] σ|Qv|[j], (26)

where Q = 1

σX is the scaled data matrix.

In the sequel we will use the operator norm of a matrix A Rn d 1.

A 2 = sup v Rd 1| v 2=1 Av 2

and denote the maximum singular value of a matrix A as smax(A) and the minimum singular value as smin(A).

A.4.1 UPPER BOUND

Given the maximality of the operator norm and since the minimum entry of the vector |Qv| must be smaller than Q 2 n , we can upper bound γ by

γ σ 1 n Q 2.

Taking the expectation on both sides with respect to the draw of e D and noting Q 2 smax (Q), it follows from Corollary 5.35 of Vershynin (2010) that for all t 0:

d 1 + n + t smax (Q) i 1 2e t2

Therefore, with a probability greater than 1 2e t2

γ σ 1 + t +

Published as a conference paper at ICLR 2023

A.4.2 LOWER BOUND

By the deﬁnition in Equation 26, if we ﬁnd a vector v V with v 2 = 1 such that for an a > 0, it holds that minj n σ|Xv|[j] > a, then γ > a.

Recall the deﬁnition of the max-ℓ2-margin as in Equation 25. As n < d 1, the random matrix Q is a wide matrix, i.e. there are more columns than rows and therefore the minimal singular value is 0. Furthermore, Q has rank n almost surely and hence for all c > 0, there exists a v Rd 1 such that

σQv = 1{n}c > 0, (27)

where 1{n} denotes the all ones vector of dimension n. The smallest non-zero singular value of Q, smin, nonzero(Q), equals the smallest non-zero singular value of its transpose Q . Therefore, there also exists a v V with v 2 = 1 such that

γ min j [n] σ|Qv|[j] σsmin,nonzeros Q 1 n, (28)

where we used the fact that any vector v in the span of non-zero eigenvectors satisﬁes Qv 2 smin, nonzeros(Q) and the existence of a solution v for any right-hand side as in Equation 27. Taking the expectation on both sides, Corollary 5.35 of Vershynin (2010) yields that with a probability greater than 1 2e t2

2 , t 0 we have

d 1 t n 1 . (29)

B BOUNDS ON THE SUSCEPTIBILITY SCORE

In Theorem 3.1, we give non-asymptotic bounds on the robust and standard error of a linear classiﬁer trained with adversarial logistic regression. Moreover, we use the robust error decomposition in susceptibility and standard error to gain intuition about how adversarial training may hurt robust generalization. In this section, we complete the result of Theorem 3.1 by also deriving non-asymptotic bounds on the susceptibility score of the max ℓ2-margin classiﬁer.

Using the results in Appendix A, we can prove following Corollary B.1, which gives non asymptotic bounds on the susceptibility score. Corollary B.1. Assume d 1 > n. For the ϵte-susceptibility on test samples from Pr with 2ϵte < r and perturbation sets in Equation equation 3 and equation 9 the following holds:

For ϵtr < r

2 γmax, with probability at least 1 2E α2(d 1)

2 for any 0 < α < 1, over the draw of a dataset D with n samples from Pr, the ϵte-susceptibility is upper and lower bounded by

Susc(bθϵtr; ϵte) Φ (r 2ϵtr)(ϵte r

Φ (r 2ϵtr)( ϵte r

Susc(bθϵtr; ϵte) Φ (r 2ϵtr)(ϵte r

Φ (r 2ϵtr)( ϵte r

We give the proof in Subsection B.1. Observe that the bounds on the susceptibility score in Corollary B.1 consist of two terms each, where the second term decreases with ϵtr, but the ﬁrst term increases. We recognise following two regimes: the max ℓ2-margin classiﬁer is close to the ground truth e1 or not. Clearly, the ground truth classiﬁer has zero susceptibility and hence classiﬁers close to the ground truth also have low susceptibility. On the other hand, if the max l2-margin classiﬁer is not close to the ground truth, then putting less weight on the ﬁrst coordinate increases invariance to the perturbations along the ﬁrst direction. Recall that by Lemma A.1, increasing ϵtr, decreases the weight on the ﬁrst coordinate of the max ℓ2-margin classiﬁer. Furthermore, in the low sample size regime, we are likely not close to the ground truth. Therefore, the regime where the susceptibility decreases with increasing ϵtr dominates in the low sample size regime.

To conﬁrm the result of Corollary B.1, we plot the mean and standard deviation of the susceptibility score of 5 independent experiments. The results are depicted in Figure 7. We see that for low standard

Published as a conference paper at ICLR 2023

error, when the classiﬁer is reasonably close to the optimal classiﬁer, the susceptibility increases slightly with increasing adversarial budget. However, increasing the adversarial training budget, ϵtr, further, causes the susceptibility score to drop greatly. Hence, we can recognize both regimes and validate that, indeed, the second regime dominates in the low sample size setting.

B.1 PROOF OF COROLLARY B.1

We proof the statement by bounding the robustness of a linear classiﬁer. Recall that the robustness of a classiﬁer is the probability that a classiﬁer does not change its prediction under an adversarial attack. The susceptibility score is then given by

Susc(bθϵtr; ϵte) = 1 Rob(bθϵtr; ϵte). (31)

The proof idea is as follows: since the perturbations are along the ﬁrst basis direction, e1, we compute the distance from the robust l2-max margin bθϵtr to a point (X, Y ) P. Then, we note that the robustness of bθϵtr is given by the probability that the distance along e1, from X to the decision plane induced by bθϵtr is greater then ϵte. Lastly, we use the non-asymptotic bounds of Lemma A.2.

Recall, by Lemma A.1, the max l2-margin classiﬁer is of the form of

bθϵtr = 1 p

(r 2ϵtr)2 + 4 γ2

h r 2ϵtr, 2 γ θ i . (32)

Let (X, Y ) P. The distance along e1 from X to the decision plane induced by bθϵtr, H(bθϵtr), is given by

de1(X, H(bθϵtr)) =

i=2 bθϵtr [i]X[i]

Substituting the expression of bθϵtr in Equation 32 yields

de1(X, H(bθϵtr)) =

X[1] + 2 γ 1 (r ϵtr)

i=2 θ[i]X[i]

Let N be a standard normal distributed random variable. By deﬁnition θ 2 2 = 1 and using that a sum of Gaussian random variables is again a Gaussian random variable, we can write

de1(X, H(bθϵtr)) = X[1] + 2 γ σ (r ϵtr)N .

The robustness of bθϵtr is given by the probability that de1(X, H(bθϵtr)) > ϵte. Hence, using that X1 = r

2 with probability 1

Rob(bθϵtr; ϵte) = P r

2 + 2 γ σ (r 2ϵtr)N > ϵte

2 + 2 γ σ (r ϵtr)N < ϵte

(a) Susceptibility score decreases with ϵtr

(b) Robust error decomposition

Figure 7: We set r = 6, d = 1000, n = 50 and ϵte = 2.5. (a) The average susceptibility score and the standard deviation over 5 independent experiments. Note how the bounds closely predict the susceptibility score. (b) For comparison, we also plot the robust error decomposition in susceptibility and standard error. Even though the susceptibility decreases, the robust error increases with increasing adversarial budget ϵtr.

Published as a conference paper at ICLR 2023

We can rewrite Equation 33 in the form

Rob(bθϵtr; ϵte) = P N > (r 2ϵtr)(ϵte r

+ P N < (r 2ϵtr)( ϵte r

Recall, that N is a standard normal distributed random variable and denote by Φ the cumulative standard normal density. By deﬁnition of the cumulative denisity function, we ﬁnd that

Rob(bθϵtr; ϵte) = 1 Φ (r 2ϵtr)(ϵte r

+ Φ (r 2ϵtr)( ϵte r

Substituting the bounds on γ of Lemma A.2 gives us the non-asymptotic bounds on the robustness score and by Equation 31 also on the susceptibility score.

C EXPERIMENTAL DETAILS ON THE LINEAR MODEL

In this section, we provide detailed experimental details to Figures 3 and 4.

We implement adversarial logistic regression using stochastic gradient descent with a learning rate of 0.01. Note that logistic regression converges logarithmically to the robust max l2-margin solution. As a consequence of the slow convergence, we train for up to 107 epochs. Both during training and test time we solve maxx i T (xi;ϵtr) L(fθ(x i)yi) exactly. Hence, we exactly measure the robust error. Unless speciﬁed otherwise, we set σ = 1, r = 12 and ϵte = 4.

Experimental details on Figure 3 (a) We draw 5 datasets with n = 50 samples and input dimension d = 1000 from the distribution P. We then run adversarial logistic regression on all 5 datasets with adversarial training budgets, ϵtr = 1 to 5. To compute the resulting robust error gap of all the obtained classiﬁers, we use a test set of size 106. Lastly, we compute the lower bound given in part 2. of Theorem 3.1. (b) We draw 5 datasets with different sizes n between 50 and 104. We take an input dimension of d = 104 and plot the mean and standard deviation of the robust error after adversarial and standard logistic regression over the 5 samples.(c) We again draw 5 datasets for each d/n constellation and compute the robust error gap for each dataset.

Experimental details on Figure 4 For both (a) and (b) we set d = 1000, ϵte = 4, and vary the adversarial training budget (ϵtr) from 1 to 5. For every constellation of n and ϵtr, we draw 10 datasets and show the average and standard deviation of the resulting robust errors. In (b), we set n = 50.

D EXPERIMENTAL DETAILS ON THE WATERBIRDS DATASET

In this section, we discuss the experimental details and construction of the Waterbirds dataset in more detail. We also provide ablation studies of attack parameters such as the size of the motion blur kernel, plots of the robust error decomposition with increasing n, and some experiments using early stopping.

D.1 THE WATERBIRDS DATASET

To build the Waterbirds dataset, we use the CUB-200 dataset Welinder et al. (2010), which contains images and labels of 200 bird species, and 4 background classes (forest, jungle/bamboo, water ocean, water lake natural) of the Places dataset Zhou et al. (2017).The aim is to recognize whether or not the bird, in a given image, is a waterbird (e.g. an albatros) or a landbird (e.g. a woodpecker). To create the dataset, we randomly sample equally many wateras landbirds from the CUB-200 dataset. Thereafter, we sample for each bird image a random background image. Then, we use the segmentation provided in the CUB-200 dataset to segment the birds from their original images and paste them onto the randomly sampled backgrounds. The resulting images have a size of 256 256. Moreover, we also resize the segmentations such that we have the correct segmentation proﬁles of the birds in the new dataset as well. For the concrete implementation, we use the code provided by Sagawa et al. (2020).

Published as a conference paper at ICLR 2023

(a) Original

Figure 8: An ablation study of the motion blur kernel size, which corresponds to the severity level of the blur. For increasing M, the severity of the motion blur increases. In particular, note that for M = 15 and even M = 20, the bird remains recognizable: we do not semantically change the class, i.e. the perturbations are consistent.

D.2 EXPERIMETAL TRAINING DETAILS

Following the example of Sagawa et al. (2020), we use a Res Net50 or Res Net18 pretrained on the Image Net dataset for all experiments in the main text, a weight-decay of 10 4, and train for 300 epochs using the Adam optimizer. Extensive ﬁne-tuning of the learning rate resulted in an optimal learning rate of 0.006 for all experiments using the adversarial illumination attack and a pretrained Res Net50. For the experiments considering the adversarial illumnination attack using a pretrained VGG19 or Densenet121 network, we found optimal learning rates of 0.001 and 0.002 respectively. Lastly, we found that for all experiments using the motion blur attack a learning rate of 0.0011 was optimal. Adversarial training is implemented as suggested in Madry et al. (2018): at each iteration we ﬁnd the worst case perturbation with an exact or approximate method. In all our experiments, the resulting classiﬁer interpolates the training set. We plot the mean over all runs and the standard deviation of the mean.

D.3 SPECIFICS TO THE MOTION BLUR ATTACK

Fast moving objects or animals are hard to photograph due to motion blur. Hence, when trying to classify or detect moving objects from images, it is imperative that the classiﬁer is robust against reasonable levels of motion blur. We implement the attack as follows. First, we segment the bird from the original image, then use a blur ﬁlter and lastly, we paste the blurred bird back onto the background. We are able to apply more severe blur, by enlarging the kernel of the ﬁlter. See Figure 8 for an ablation study of the kernel size.

The motion blur ﬁlter is implemented as follows. We use a kernel of size M M and build the ﬁlter as follows: we ﬁll the row (M 1)/2 of the kernel with the value 1/M. Thereafter, we use the 2D convolution implementation of Open CV (ﬁlter2D) Bradski (2000) to convolve the kernel with the image. Note that applying a rotation before the convolution to the kernel, changes the direction of the resulting motion blur. Lastly, we ﬁnd the most detrimental level of motion blur using a list-search over all levels up to Mmax.

(a) ϵ = 0.3

(b) ϵ = 0.2

(c) ϵ = 0.1

(d) Original

(e) ϵ = 0.1

(f) ϵ = 0.2

(g) ϵ = 0.3

Figure 9: An ablation study of the different lighting changes of the adversarial illumination attack. Even though the directed attack perturbs the signal component in the image, the bird remains recognizable in all cases.

Published as a conference paper at ICLR 2023

(a) Robust error

(b) Standard error

(c) Susceptibility

Figure 10: The robust error decomposition of the experiments depicted in Figure 10a. The plots depict the mean and standard deviation of the mean over several independent experiments. We see that, in comparison to standard training, the reduction in susceptibility for adversarial training is minimal in the low sample size regime. Moreover, the increase in standard error of adversarial training is quite severe, leading to an overall increase in robust error in the low sample size regime.

D.4 SPECIFICS TO THE ADVERSARIAL ILLUMINATION ATTACK

An adversary can hide objects using poor lightning conditions, which can for example arise from shadows or bright spots. To model poor lighting conditions on the object only (or targeted to the object), we use the adversarial illumination attack. The attack is constructed as follows: First, we segment the bird from their background. Then we apply an additive constant ϵ to the bird, where the absolute size of the constant satisﬁes |ϵ| < ϵte = 0.3. Thereafter, we clip the values of the bird images to [0, 1], and lastly, we paste the bird back onto the background. See Figure 9 for an ablation of the parameter ϵ of the attack. It is non-trivial how to (approximately) ﬁnd the worst perturbation. We ﬁnd an approximate solution by searching over all perturbations with increments of size ϵte/Kmax. Denote by seg, the segmentation proﬁle of the image x. We consider all perturbed images in the form of

xpert = (1 seg)x + seg(x + ϵ K

Kmax 1255 255), K [ Kmax, Kmax].

During training time we set Kmax = 16 and therefore search over 33 possible images. During test time we search over 65 images (Kmax = 32).

D.5 EARLY STOPPING

In all our experiments on the Waterbirds dataset, a parameter search lead to an optimal weight-decay and learning rate of 10 4 and 0.006 respectively. Another common regularization technique is early stopping, where one stops training on the epoch where the classiﬁer achieves minimal robust error on a hold-out dataset. To understand if early stopping can mitigate the effect of adversarial training aggregating robust generalization in comparison to standard training, we perform the following experiment. On the Waterbirds dataset of size n = 20 and considering the adversarial illumination attack, we compare standard training with early stopping and adversarial training (ϵtr = ϵte = 0.3) with early stopping. Considering several independent experiments, early stopped adversarial training has an average robust error of 33.5 a early stopped standard training 29.1. Hence, early stopping does decrease the robust error gap, but does not close it.

D.6 ERROR DECOMPOSITION WITH INCREASING n

In Figure 10a and 11a, we see that adversarial training hurts robust generalization in the small sample size regime. For completeness, we plot the robust error composition for adversarial and standard training in Figure 10. We see that in the low sample size regime, the drop in susceptibility that adversarial training achieves in comparison to standard training, is much lower than the increase in standard error. Conversely, in the high sample regime, the drop of susceptibility from adversarial training over standard training is much bigger than the increase in standard error.

D.7 DIFFERENT ARCHITECTURES

For completeness, we also performed similar experiments on the waterbirds dataset using the adversarial illumination attack with different network architectures as with the pretrained Res Net50

Published as a conference paper at ICLR 2023

(a) Robust error

(b) Standard error

(c) Susceptibility

Figure 11: The robust error decomposition of the experiments depicted in Figure 6. The plots depict the mean and standard deviation of the mean over several independent experiments. We see that, in comparison to standard training, the reduction in susceptibility for adversarial training is minimal in the low sample size regime. Moreover, the increase in standard error of adversarial training is quite severe, leading to an overall increase in robust error in the low sample size regime.

network. In particular, we considered the following pretrained network architectures: VGG19 and Densenet121. See Figure 12 for the results. We observe that accros models, adversarial training hurts in the low sample size regime, but helps when enough data is available.

(b) Densenet121

Figure 12: The robust error of adversarial training and standard training with increasing sample size using the adversarial illumination attack with ϵte = 0.3. We depict the mean and the standard deviation of the mean for multiple runs. Observe that across models, adversarial training hurts in the low sample size regime, but helps when enough samples are available.

D.8 UNDIRECTED ATTACKS ON THE WATERBIRDS DATASET

In this section, we analyse adversarial training for ℓ2-and ℓ -ball perturbations in the small sample size regime. We observe that while adversarial training hurts standard generalization, it helps robust generalization.

Adversarial training with ℓ2-balls We train and test with small ℓ2-balls, ϵte = 0.2, such that the networks trained with standard training achieve a non-zero robust accuracy and the networks trained with adversarial training achieve non-trivial standard accuracy. We see in Figure 13, that adversarial training with ℓ2-balls hurts standard generalization while increasing robust generalization. Moreover, in Figure 14, we see that also in the very small sample size regime, adversarial training with increasing ϵtr increases the standard error, but reduces the susceptibility.

Adversarial training with ℓ -balls We also consider ℓ -ball perturbation. We see in Figure 15 that even the smallest perturbation budget ϵte = 2 255, standard training has robust error of 100 percent. On the other hand, adversarial training achieves low, but non-zero robust error.

Experimental details We use an Image Net pretrained Res Net34 and train for 300 epochs. Moreover, for reliable robust error and susceptibility evaluation of the attacks we use Auto Attack Croce & Hein (2020). All networks were trained such that the network interpolates the training dataset and has low robust error with non-trivial standard error. For the networks trained using standard training we use a learning rate of 0.006 and for the networks trained with adversarial training we used a learning

Published as a conference paper at ICLR 2023

(a) Robust error

(b) Standard error

(c) Susceptibility

Figure 13: The robust error decomposition of adversarial training with ℓ2-balls of size ϵtr = 0.2 and test adversaries with ℓ2-balls of size ϵte = 0.2. The plots depict the mean and standard deviation of the mean over several independent experiments. We see that even though adversarial training hurts standard generalization, it increases robust generalization as it decreases the susceptibility of the classiﬁers.

Figure 14: The robust error decomposition of adversarial training in function of ϵtr in the small sample size regime n = 20. We see that even though adversarial training hurts standard generalization, it increases robust generalization as it decreases the susceptibility of the classiﬁers with increasing ϵtr. We take n = 20 and consider test adversaries with ℓ2-balls of size ϵte = 0.2. The plots depict the mean and standard deviation of the mean over several independent experiments.

(a) Robust error

(b) Standard error

(c) Susceptibility

Figure 15: The robust error decomposition of adversarial training with ℓ -balls of size ϵtr = 2 255 and test adversaries with ℓ -balls of size ϵte = 2 255. The plots depict the mean and standard deviation of the mean over several independent experiments. We see that even though adversarial training hurts standard generalization, it increases robust generalization as it decreases the susceptibility of the classiﬁers.

rate of 5 10 4. We also trained with a weight decay of 10 4, a batch size of 8 and a momentum of 0.9 for all networks. We train at least 3 networks for all settings and report the mean and standard deviation of the mean of the standard error, robust error and susceptibility over the three runs.

E EXPERIMENTAL DETAILS ON CIFAR-10

In this section, we give the experimental details on the CIFAR-10-based experiments shown in Figures 1 and 17.

Subsampling CIFAR-10 In all our experiments we subsample CIFAR-10 to simulate the low sample size regime. We ensure that for all subsampled versions the number of samples of each class are equal. Hence, if we subsample to 500 training images, then each class has exactly 50 images, which are drawn uniformly from the 5k training images of the respective class.

Published as a conference paper at ICLR 2023

Mask perturbation on CIFAR-10 On CIFAR-10, we consider the square black mask attack where the adversary can mask a square in the image of size ϵte ϵte by setting the pixel values zero. To ensure that the mask cannot cover all the information about the true class in the image, we restrict the size of the masks to be at most 2 2, while allowing for all possible locations of the mask in the targeted image. For exact robust error evaluation, we perform a full grid search over all possible locations during test time. We show an example of a black-mask attack on each of the classes in CIFAR-10 in Figure 16.

During training, a full grid search is computationally intractable so that we use an approximate attack similar to Wu et al. (2020) during training time: by identifying the K = 16 most promising mask locations with a heuristic as follows. First, we identify promising mask locations by analyzing the gradient, x L(fθ(x), y), of the cross-entropy loss with respect to the input. Masks that cover part of the image where the gradient is large, are more likely to increase the loss. Hence, we compute the K mask locations (i, j), where x L(fθ(x), y)[i:i+2,j:j+2] 1 is the largest and take using a full list-search the mask that incurs the highest loss. Our intuition from the theory predicts that higher K, and hence a more exact defense , only increases the robust error of adversarial training, since the mask could then more efﬁciently cover important information about the class.

Figure 16: We show an example of a mask perturbation for all 10 classes of CIFAR-10. Even though the attack occludes part of the images, a human can still easily classify all images correctly.

Figure 17: The robust error decomposition in standard error and susceptibility for varying attack strengths K. We see that the larger K, the lower the susceptibility, but the higher the standard error.

Experimental training details For all our experiments on CIFAR-10, we adjusted the code provided by Phan (2021). As typically done for CIFAR-10, we augment the data with random cropping and horizontal ﬂipping. For the experiments with results depicted in Figures 1 and 17, we use a Res Net18 network and train for 100 epochs. We tune the parameters learning rate and weight decay for low robust error. For standard standard training, we use a learning rate of 0.01 with equal weight decay. For adversarial training, we use a learning rate of 0.015 and a weight decay of 10 4. We run each experiment three times for every dataset with different initialization seeds, and plot the average and standard deviation over the runs.

For the experiments in Figure 1 and 18 we use an attack strength of K = 4. Recall that we perform a full grid search at test time and hence have a good approximation of the robust accuracy and susceptibility score.

Increasing training attack strength We investigate the inﬂuence of the attack strength K on the robust error for adversarial training. We take ϵtr = 2 and n = 500 and vary K. The results are depicted in Figure 17. We see that for increasing K, the susceptibility decreases, but the standard error increases more severely, resulting in an increasing robust error.

Robust error decomposition In Figure 1, we see that the robust error increases for adversarial training compared to standard training in the low sample size regime, but the opposite holds when enough samples are available. For completeness, we provide a full decomposition of the robust error in standard error and susceptibility for standard and adversarial training. We plot the decomposition in Figure 18.

F STATIC HAND GESTURE RECOGNITION

The goal of static hand gesture or posture recognition is to recognize hand gestures such as a pointing index ﬁnger or the okay-sign based on static data such as images Oudah et al. (2020); Yang et al.

Published as a conference paper at ICLR 2023

(2013). The current use of hand gesture recognition is primarily in the interaction between computers and humans Oudah et al. (2020). More speciﬁcally, typical practical applications can be found in the environment of games, assisted living, and virtual reality Mujahid et al. (2021). In the following, we conduct experiments on a hand gesture recognition dataset constructed by Mantec on et al. (2019), which consists of near-infrared stereo images obtained using the Leap Motion device. First, we crop or segment the images after which we use logistic regression for classiﬁcation. We see that adversarial logistic regression deteriorates robust generalization with increasing ϵtr.

Static hand-gesture dataset We use the dataset made available by Mantec on et al. (2019). This dataset consists of near-infrared stereo images taken with the Leap Motion device and provides detailed skeleton data. We base our analysis on the images only. The size of the images is 640 240 pixels. The dataset consists of 16 classes of hand poses taken by 25 different people. We note that the variety between the different people is relatively wide; there are men and women with different posture and hand sizes. However, the different samples taken by the same person are alike.

We consider binary classiﬁcation between the index-pose and L-pose, and take as a training set 30 images of the users 16 to 25. This results in a training dataset of 300 samples. We show two examples of the training dataset in Figure 19, each corresponding to a different class. Observe that the near-infrared images darken the background and successfully highlight the hand-pose. As a test dataset, we take 10 images of each of the two classes from the users 1 to 10 resulting in a test dataset of size 200.

Cropping the dataset To speed up training and ease the classiﬁcation problem, we crop the images from a size of 640 240 to a size of 200 200. We crop the images using a basic image segmentation technique to stay as close as possible to real-world applications. The aim is to crop the images such that the hand gesture is centered within the cropped image.

For every user in the training set, we crop an image of the L-pose and the index pose by hand. We call these images the training masks {masksi}20 i=1. We note that the more a particular window of an image resembles a mask, the more likely that the window captures the hand gesture correctly. Moreover, the near-infrared images are such that the hands of a person are brighter than the surroundings of the person itself. Based on these two observations, we deﬁne the best segment or window, deﬁned by the upper left coordinates (i, j), for an image x as the solution to the following optimization problem:

arg min i [440], j [40]

l=1 masksl x{i:i+200,j:j+200} 2 2 1

2 x{i+w,j+h} 1. (34)

Equation 34 is solved using a full grid search. We use the result to crop both training and test images. Upon manual inspection of the cropped images, close to all images were perfectly cropped. We replace the handful poorly cropped training images with hand-cropped counterparts.

Square-mask perturbations Since we use logistic regression, we perform a full grid search to ﬁnd the best adversarial perturbation at training and test time. For completeness, the upper left coordinates

(a) Robust error

(b) Standard error

(c) Susceptibility

Figure 18: The robust error decomposition in standard error and susceptibility of the subsampled datasets of CIFAR-10 after adversarial and standard training. For small sample size, adversarial training has higher robust error then standard training.

Published as a conference paper at ICLR 2023

(b) Index pose

Figure 19: Examples of the original images of the considered hand-gestures. We recognize the L -sign in Figure 19a and the index sign in Figure 19b. Observe that the near-infrared images highlight the hand pose well and blends out much of the non-useful or noisy background.

(a) Cropped L pose

(b) Cropped index pose

(c) Black-mask perturbation

Figure 20: Examples of the cropped hand-gesture images. We see that the hands are centered and the images have a size of 200 200. In Figure 20c we show an example of the square black-mask perturbation.

of the optimal black-mask perturbation of size ϵtr ϵtr can be found as the solution to

arg max i [200 ϵtr], j [200 ϵtr]

l,m [ϵtr] θ[i:i+l,j:j+m]. (35)

The algorithm is rather slow as we iterate over all possible windows. We show a black-mask perturbation on an L-pose image in Figure 20c.

Results We run adversarial logistic regression with square-mask perturbations on the cropped dataset and vary the adversarial training budget and plot the result in Figure 21. We observe attack that adversarial logistic regression deteriorates robust generalization.

Because we use adversarial logistic regression, we are able to visualize the classiﬁer. Given the classiﬁer induced by θ, we can visualize how it classiﬁes the images by plotting θ mini [d] θ[i]

maxi [d] θ[i]

[0, 1]d. Recall that the class-prediction of our predictor for a data point (x, y) is given by sign(θ x) { 1}. The lighter parts of the resulting image correspond to the class with label 1 and the darker patches with the class corresponding to label 1.

Figure 21: The standard error and robust error for varying adversarial training budget ϵtr. We see that the larger ϵtr the higher the robust error.

We plot the classiﬁers obtained by standard logistic regression and adversarial logistic regression with training adversarial budgets ϵtr of 10 and 25 in Figure 22. The darker parts in the classiﬁer correspond to patches that are typically bright for the L-pose. Complementary, the lighter patches in the classiﬁer correspond to patches that are typically bright for the index pose. We see that in the case of adversarial logistic regression, the background noise is much higher than for standard logistic regression. In other words, adversarial logistic regression puts more weight on non-signal parts in the images to classify the

Published as a conference paper at ICLR 2023

training dataset and hence exhibits worse performance on the test dataset.

(a) ϵtr = 0

(b) ϵtr = 10

(c) ϵtr = 25

Figure 22: We visualize the logistic regression solutions. In Figure 22a we plot the vector that induces the classiﬁer obtained after standard training. In Figure 22b and Figure 22c we plot the vector obtained after training with square-mask perturbations of size 10 and 25, respectively. We note the non-signal enhanced background correlations at the parts highlighted with the red circles in the image projection of the adversarially trained classiﬁers.