# overfitting_in_adversarially_robust_deep_learning__a3cb5663.pdf

Overﬁtting in adversarially robust deep learning

Leslie Rice * 1 Eric Wong * 2 J. Zico Kolter 1

It is common practice in deep learning to use overparameterized networks and train for as long as possible; there are numerous studies that show, both theoretically and empirically, that such prac tices surprisingly do not unduly harm the gener alization performance of the classiﬁer. In this paper, we empirically study this phenomenon in the setting of adversarially trained deep net works, which are trained to minimize the loss under worst-case adversarial perturbations. We ﬁnd that overﬁtting to the training set does in fact harm robust performance to a very large de gree in adversarially robust training across mul tiple datasets (SVHN, CIFAR-10, CIFAR-100, and Image Net) and perturbation models ( and 2). Based upon this observed effect, we show that the performance gains of virtually all recent algorithmic improvements upon adversarial train ing can be matched by simply using early stop ping. We also show that effects such as the dou ble descent curve do still occur in adversarially trained models, yet fail to explain the observed overﬁtting. Finally, we study several classical and modern deep learning remedies for overﬁtting, in cluding regularization and data augmentation, and ﬁnd that no approach in isolation improves sig niﬁcantly upon the gains achieved by early stop ping. All code for reproducing the experiments as well as pretrained model weights and training logs can be found at https://github.com/ locuslab/robust_overfitting.

*Equal contribution 1Computer Science Department, Carnegie Mellon University, Pittsburgh PA, USA 2Machine Learning De partment, Carnegie Mellon University, Pittsburgh PA, USA. Cor respondence to: Leslie Rice <larice@cs.cmu.edu>, Eric Wong <ericwong@cs.cmu.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

0 50 100 150 200 Epochs

Test robust Train robust

Test standard Train standard

Figure 1. The learning curves for a robustly trained model repli

cating the experiment done by Madry et al. (2017) on CIFAR-10. The curves demonstrate robust overﬁtting ; shortly after the ﬁrst learning rate decay the model momentarily attains 43.2% robust error, and is actually more robust than the model at the end of train ing, which only attains 51.4% robust test error against a 10-step PGD adversary for radius of ϵ = 8/255. The learning rate is decayed at 100 and 150 epochs.

1. Introduction

One of the surprising characteristics of deep learning is the relative lack of overﬁtting seen in practice (Zhang et al., 2016). Deep learning models can often be trained to zero training error, effectively memorizing the training set, seem ingly without causing any detrimental effects on the gener alization performance. This phenomenon has been widely studied both from the theoretical (Neyshabur et al., 2017) and empirical perspectives (Belkin et al., 2019), and re mains such a hallmark of deep learning practice that it is often taken for granted.

In this paper, we consider the empirical question of overﬁt ting in a similar, but slightly different domain: the setting of adversarial training for robust networks. Adversarial train ing is a method for hardening classiﬁers against adversarial attacks, i.e. small perturbations to the input which can drasti cally change a classiﬁer s predictions, that involves training the network on adversarially perturbed inputs instead of on clean data (Goodfellow et al., 2014). It is generally re

Overﬁtting in adversarially robust deep learning

garded as one of the strongest empirical defenses against these attacks (Madry et al., 2017).

A key ﬁnding of our paper is that, unlike in traditional deep learning, overﬁtting is a dominant phenomenon in adversarially robust training of deep networks. That is, adversarially robust training has the property that, after a certain point, further training will continue to substantially decrease the robust training loss of the classiﬁer, while increasing the robust test loss. This is shown, for instance, in Figure 1 for adversarial training on CIFAR-10, where the robust test error dips immediately after the ﬁrst learning rate decay, and only increases beyond this point. We show that this phenomenon, which we refer to as robust overﬁtting , can be observed on multiple datasets beyond CIFAR-10, such as SVHN, CIFAR-100, and Image Net.

Motivated by this initial ﬁnding, we make several contribu tions in this paper to further study and diagnose this problem. First, we emphasize that virtually all the recent gains in ad versarial performance from newer algorithms beyond simple projected gradient descent (PGD) based adversarial training (Mosbach et al., 2018; Xie et al., 2019; Yang et al., 2019; Zhang et al., 2019c) can be attained by a much simpler ap proach: using early stopping. Speciﬁcally, by just using an earlier checkpoint, the robust performance of adversarially trained deep networks can be drastically improved, to the point where the original PGD-based adversarial training method can actually achieve the same robust performance as state-of-the-art methods. For example, vanilla PGDbased adversarial training (Madry et al., 2017) can achieve 43.2% robust test error against a PGD adversary with ra dius 8/255 on CIFAR-10 when training is stopped early, on par with the 43.4% robust test error reported by TRADES (Zhang et al., 2019c) against the same adversary. This phe nomenon is not unique to perturbations and is also seen in 2 adversarial training. For instance, early stopping a CIFAR-10 model trained against an 2 adversary with ra dius 128/255 can decrease the robust test error from 31.1% to 28.4%.

Second, we study various empirical properties of overﬁt ting for adversarially robust training and how they relate to standard training. Since the effects of such overﬁtting appear closely tied to the learning rate schedule, we begin by investigating how changes to the learning rate schedule affect the prevalence of robust overﬁtting and its impacts on model performance. We next explore how known connec tions between the hypothesis class size and generalization in deep networks translate to the robust setting, and show that the double descent generalization curves seen in standard training (Belkin et al., 2019) also hold for robust training (Nakkiran et al., 2019). However, although this is used as a justiﬁcation for the lack of overﬁtting in the standard set ting, surprisingly, changing the hypothesis class size does

not actually mitigate the robust overﬁtting that is observed during training.

Our ﬁnal contribution is to investigate several techniques for preventing robust overﬁtting. We ﬁrst explore the effects of classic statistical approaches for combating overﬁtting beyond early stopping, namely explicit 1 and 2 regular ization. We then study more modern approaches using data augmentation, including cutout (De Vries & Taylor, 2017), mixup (Zhang et al., 2017), and semisupervised learning methods, which are known to empirically reduce overﬁtting in deep networks. Ultimately, while these methods can miti gate robust overﬁtting to varying degrees, when trained to convergence, we ﬁnd that no other approach to combating robust overﬁtting performs better than simple early stop ping. In fact, even combining regularization methods with early stopping tends to not signiﬁcantly improve on early stopping alone. We ﬁnd that the one exception is data aug mentation with semi-supervised learning, where although the test performance can vary wildly even when training has converged, at select epochs it is possible to ﬁnd a model with improved robust performance over simple early stop ping. Code for reproducing all the experiments in this pa per along with pretrained model weights and training logs can be found at https://github.com/locuslab/ robust_overfitting.1

2. Background and related work

One of the ﬁrst approaches to using adversarial training was with a single step gradient-based method for generating adversarial examples known as the fast gradient sign method (FGSM) (Goodfellow et al., 2014). The adversary was later extended to take multiple smaller steps, in a technique known as the basic iterative method (Kurakin et al., 2016), and eventually reincorporated into adversarial training with random restarts, commonly referred to as projected gradient descent (PGD) adversarial training (Madry et al., 2017). Further improvements to both the PGD adversary and the training procedure include incorporating momentum into the adversary (Dong et al., 2018), leveraging matrix estimation (Yang et al., 2019), logit pairing (Mosbach et al., 2018), and feature denoising (Xie et al., 2019). Most notably, Zhang et al. (2019c) proposed the method TRADES for adversarial training that balances the trade-off between standard and robust errors, and achieves state-of-the-art performance on several benchmarks.

Because PGD training is signiﬁcantly more time consuming than standard training, several works have focused on im proving the efﬁciency of adversarial training by reducing the

1Since there are over 75 models trained in this paper, we se lected a subset of pretrained models to release (e.g. those which are for Wide Res Nets since those take the most time to train, and can achieve the best performance in the paper)

Overﬁtting in adversarially robust deep learning

computational complexity of calculating gradients and re ducing the number of attack iterations (Shafahi et al., 2019; Zhang et al., 2019a; Wong et al., 2020). Separate works have also expanded the general PGD adversarial training algorithm to different threat models including image trans formations (Engstrom et al., 2017; Xiao et al., 2018a), dif ferent distance metrics (Wong et al., 2019), and multiple threat models (Maini et al., 2019; Tram er & Boneh, 2019).

Other adversarial defenses that have been proposed were not always successful, such as distillation (Papernot et al., 2016; Carlini & Wagner, 2017b) and detection of adver sarial examples (Metzen et al., 2017; Feinman et al., 2017; Carlini & Wagner, 2017a; Tao et al., 2018; Carlini, 2019), which eventually were defeated by stronger attacks. Adver sarial examples were also believed to be ineffective in the real world across different viewpoints (Lu et al., 2017) until proven otherwise (Athalye et al., 2017), and a large number of adversarial defenses were shown to be relying on obfus cated gradients and ultimately rendered ineffective (Athalye et al., 2018), including thermometer encoding (Buckman et al., 2018) and various preprocessing techniques (Guo et al., 2017; Song et al., 2017).

Because many defenses were broken by stronger adver saries, a separate but related line of work has looked at generating certiﬁcates which can guarantee or prove robust ness of the network output to norm-bounded adversarial perturbations. While not always scalable to large convolu tional networks, methods for generating these robustness certiﬁcates range from using Satisﬁability Modulo Theories (SMT) solvers (Ehlers, 2017; Huang et al., 2017; Katz et al., 2017) and mixed-integer linear programs (Tjeng et al., 2019) for exact certiﬁcates, to semi-deﬁnite programming (SDP) solvers for relaxed but still accurate certiﬁcates (Raghu nathan et al., 2018a;b; Fazlyab et al., 2019). Other methods focus on generating more tractable but relaxed certiﬁcates, which provide looser guarantees but can be optimized during training. These methods leverage techniques such as duality and linear programming (Wong & Kolter, 2017; Dvijotham et al.; Wong et al., 2018; Salman et al., 2019b; Zhang et al., 2019b), randomized smoothing (Cohen et al., 2019; Lecuyer et al., 2019; Salman et al., 2019a), distributional robustness (Sinha et al., 2017), abstract interpretations (Gehr et al., 2018; Mirman et al., 2018; Singh et al., 2018), and interval bound propagation (Gowal et al., 2018). Another approach is to use theoretically justiﬁed training heuristics (Croce et al., 2018; Xiao et al., 2018b) which result in models which are veriﬁable by an independent certiﬁcation method.

Highly relevant to this work are those that study the general problem of overﬁtting in machine learning. Both regular ization (Friedman et al., 2001) and early stopping (Strand, 1974) have been well-studied in classical statistical settings

to reduce overﬁtting and improve generalization, and con

nections between the two have been established in vari ous settings such as in kernel boosting algorithms (Wei et al., 2017), least squares regression (Ali et al., 2018), and strongly convex problems (Suggala et al., 2018). Although 2 regularization (also known as weight decay) is commonly used for training deep networks (Krogh & Hertz, 1992), early stopping is less commonly used despite being studied as an implicit regularizer for controlling model complex ity for neural networks at least 30 years ago (Morgan & Bourlard, 1990).2 Indeed, it is now known that the standard bias-variance trade-off from classical statistical learning the ory fails to explain why deep networks can generalize so well (Zhang et al., 2016). Consequently, it is now standard practice in many modern deep learning tasks to train for as long as possible and use large overparameterized models, since test set performance typically continues to improve past the point of dataset interpolation in what is known as double descent generalization (Belkin et al., 2019; Nakki

ran et al., 2019). The generalization gap for robust deep networks has also been studied from a learning theoretic perspective in the context of data complexity (Schmidt et al., 2018) and Rademacher complexity (Yin et al., 2018).

Also relevant to this work are methods speciﬁc to deep learning that empirically reduce overﬁtting and improve performance of deep networks. For example, Dropout is a commonly used stochastic regularization technique that ran domly drops units and their connections from the network during training (Srivastava et al., 2014) with the intent of pre venting complex co-adaptations on the training data. Data augmentation is another technique frequently used when training deep networks that has been empirically shown to reduce overﬁtting. Cutout (De Vries & Taylor, 2017) is a form of data augmentation that randomly masks out a sec tion of the input during training, which can be considered as augmenting the dataset with occlusions. Another tech nique known as mixup (Zhang et al., 2017) trains on convex combinations of pairs of data points and their corresponding labels to encourage linear behavior in between data points. Semi-supervised learning methods augment the dataset with unlabeled data, and have been shown to improve generaliza tion when used in the adversarially robust setting (Carmon et al., 2019; Zhai et al., 2019; Alayrac et al., 2019).

3. Adversarial training and robust overﬁtting

In order to learn networks that are robust to adversarial examples, a commonly used method is adversarial training,

2It is common practice in deep learning to save the best check point which can be seen as early stopping. However, in the standard setting, the test loss tends to gradually improve over training, and so the best checkpoint tends to just select the best performance at the end of training, rather than stopping before training loss has converged.

Overﬁtting in adversarially robust deep learning

Table 1. Robust performance showing the occurrence of robust overﬁtting across datasets and perturbation threat models. The best robust test error is the lowest test error observed during training. The ﬁnal robust test error is averaged over the last ﬁve epochs. The difference between ﬁnal and best robust test error indicates the degradation in robust performance during training.

DATASET NORM RADIUS

ROBUST TEST ERROR (%) FINAL BEST DIFF

8/255 128/255

45.6 0.40 26.4 0.27

8/255 128/255

51.4 0.41 31.1 0.46

CIFAR-100 2

8/255 128/255

78.6 0.39 62.5 0.09

4/255 76/255

85.5 8.87 94.8 1.16

which solves the following robust optimization problem

X min max (fθ(xi + δ), yi), (1) θ δ Δ i

where fθ is a network with parameters θ, (xi, yi) is a train ing example, is the loss function, and Δ is the perturbation set. Typically the perturbation set Δ is chosen to be an p-norm ball (e.g. 2 and perturbations, which we con sider in this paper), such that Δ = {δ : ||δ||p ϵ} for ϵ > 0. Adversarial training approximately solves the inner optimization problem, also known as the robust loss, using some adversarial attack method, typically with projected gradient descent (PGD), and then updates the model param eters θ using gradient descent (Madry et al., 2017). For example, an PGD adversary would start at some random initial perturbation δ(0) and iteratively adjust the perturba tion with the following gradient steps while projecting back onto the ball with radius ϵ:

δ = δ(t) + α signrx (f(x), y))

(2) δ(t+1) = max(min( δ, ϵ), ϵ)

We denote error rates when attacked by a PGD adversary as the robust error , and error rates on the clean, unperturbed data as standard error .

3.1. Robust overﬁtting: a general phenomenon for adversarially robust deep learning

In the standard, non-robust deep learning setting, it is com mon practice to train for as long as possible to minimize the training loss, as modern convergence curves for deep learning generally observe that the testing loss continues to decrease with the training loss. On the contrary, for the set ting of adversarially robust training we make the following discovery:

Unlike the standard setting of deep networks, overﬁtting for adversarially robust training can result in worse test set performance.

This phenomenon, which we refer to as robust overﬁtting , results in convergence curves as shown earlier in Figure 1. Although training appears normal in the earlier stages, after the learning rate decays, the robust test error brieﬂy decreases but begins to increase as training progresses. This behavior indicates that the optimal performance is not ob tained at the end of training, unlike in standard training for deep networks.

We ﬁnd that robust overﬁtting occurs across a variety of datasets, algorithmic approaches, and perturbation threat models, indicating that it is a general property of the adver sarial training formulation and not speciﬁc to a particular problem, as can be seen in Table 1 for and 2 pertur bations on SVHN, CIFAR-10, CIFAR-100, and Image Net. A more detailed and expanded version of this table sum marizing the full extent of robust overﬁtting as well as the corresponding learning curves for each setting can be found in Appendix A. We consistently ﬁnd that there is a signif icant gap between the best robust test performance during training and the ﬁnal robust test performance at the end of training, observing an increase of 8.2% robust error for CIFAR-10 and 22.8% robust error for Image Net against an adversary, to highlight a few. Robust overﬁtting is also not speciﬁc to PGD-based adversarial training, and affects faster adversarial training methods such as FGSM adversarial training3 (Wong et al., 2020) as well as top per forming algorithms for adversarially robust training such as TRADES (Zhang et al., 2019c).

Learning rate schedules and robust overﬁtting Since the change in performance appears to be closely linked with the ﬁrst drop in the scheduled learning rate decay, we ex plore how different learning rate schedules affect robust overﬁtting on CIFAR-10, as shown in Figure 2, with com plete descriptions of the various learning rate schedules in Appendix B.1. In summary, we ﬁnd that smoother learning rate schedules (which take smaller decay steps or interpolate the change in learning rate over epochs) simply result in smoother curves that still exhibit robust overﬁtting. Further more, with each smoother learning rate schedule, the best robust test performance during training is strictly worse than the best robust test performance during training with the discrete piecewise decay schedule. In fact, the parameters of the discrete piecewise decay schedule can even be tuned to slightly exacerbate the sudden improvement in performance

3Wong et al. (2020) also observe a different form of overﬁtting speciﬁcally for FGSM adversarial training which they refer to as catastrophic overﬁtting . This is separate behavior from the robust overﬁtting described in this paper, and the speciﬁcs of this distinction are discussed further in Appendix A.4.

Overﬁtting in adversarially robust deep learning

0 50 100 150 200 Epochs

Robust test error

Piecewise decay Multiple decay Linear decay Cyclic Cosine

Figure 2. Robust test error over training epochs for various learn

ing rate schedules on CIFAR-10. None of the alternative smoother learning rate schedules can achieve a peak performance compet itive with the standard piecewise decay learning rate, indicating that the peak performance is obtained by having a single discrete jump. Note that the multiple decay schedule is actually run for 500 epochs, but compressed into this plot for a clear comparison.

after the ﬁrst learning rate decay step, which we discuss further in Appendix B.2

3.2. Mitigating robust overﬁtting with early stopping

Proper early stopping, an old form of implicit regularization, calculates a metric on a hold-out validation set to determine when to stop training in order to prevent overﬁtting. Since the test performance does not monotonically improve during adversarially robust training due to robust overﬁtting, it is advantageous for robust networks to use early stopping to achieve the best possible robust performance.

We ﬁnd that, for example, the TRADES approach relies heavily on using the best robust performance on the test set from an earlier checkpoint in order to achieve their top reported result of 43.4% robust error against an PGD ad versary with radius 8/255 on CIFAR-10, a number which is typically viewed as a substantial algorithmic improvement in adversarial robustness over standard PGD-based adversarial training. In our own reproduction of the TRADES experi ment, we conﬁrm that allowing the TRADES algorithm to train until convergence results in signiﬁcant degradation of robust performance as seen in Figure 3. Speciﬁcally, the robust test error of the model at the checkpoint with the best performance on the test set is 44.1% whereas the robust test error of the model at the end of training has increased to 50.6%.4

Surprisingly, when we early stop vanilla PGD-based ad versarial training, selecting the model checkpoint with the

4We used the public implementation of TRADES available at https://github.com/yaodongyu/TRADES and sim ply ran it to completion using the same learning rate decay schedule used by Madry et al. (2017).

0 50 100 150 200 Epochs

Standard train Standard test

Figure 3. Learning curves showing standard and robust error rates for a Wide Res Net model trained with TRADES on CIFAR-10. Early stopping after the initial learning rate decay is crucial in order to achieve the 43.4% robust test error reported by Zhang et al. (2019c), which eventually degrades to 50.6% robust test error when the training has converged.

best performance on the test set, we ﬁnd that PGD-based adversarial training performs just as well as more recent al gorithmic approaches such as TRADES. Speciﬁcally, when using the same architecture (a Wide Res Net with depth 28 and width factor 10) and the same 20-step PGD adversary for evaluation used by Zhang et al. (2019c) for TRADES, the model checkpoint with the best performance on the test set from vanilla PGD-based adversarial training achieves 42.3% robust test error, which is actually slightly better than the best reported result for TRADES from Zhang et al. (2019c).5

Similarly, we ﬁnd early stopping to be a factor in the robust test performance for publicly released pre-trained Image Net models (Engstrom et al., 2019). Continuing to train these models degrades the robust test performance from 62.7% to 85.5% robust test error for robustness at ϵ = 4/255 and 63.0% to 94.8% robust test error for 2 robustness at ϵ = 128/255. This shows that these models are also susceptible to robust overﬁtting and beneﬁt greatly from early stopping.6

The corresponding learning curves are shown in Appendix A.3.

Validation-based early stopping Early stopping based on the test set performance, however, leaks test set infor

5We found that our implementation of the PGD adversary to be slightly more effective, increasing the robust test error of the TRADES model and the PGD trained model to 45.0% and 43.2% respectively.

6We use the publicly available framework from https:// github.com/madrylab/robustness and continue train ing checkpoints obtained from the authors using the same learning parameters.

Overﬁtting in adversarially robust deep learning

0 50 100 150 200 Epochs

Robust loss

Test Train Val

Figure 4. Learning curves for a CIFAR-10 pre-activation Res Net18 model trained with a hold-out validation set of 1,000 examples. We ﬁnd that the hold-out validation set is enough to reﬂect the test set performance, and stopping based on the validation set is able to prevent overﬁtting and recover 46.9% robust test error, in comparison to 46.7% achieved by the best-performing model checkpoint.

mation and goes against the traditional machine learning paradigm. Instead, we ﬁnd that it is still possible to re cover the best test performance achieved during training with a true hold-out validation set. By holding out 1,000 examples from the CIFAR-10 training set for validation pur poses, we use validation-based early stopping to achieve 46.9% robust error on the test set without looking at the test set, in comparison to the 46.7% robust error achieved by the best-performing model checkpoint for a pre-activation Res Net18. The resulting validation curve during training closely matches the testing curve as seen in Figure 4, and suggests that although robust overﬁtting degrades the ro bust test set performance, selecting the best checkpoint in adversarially robust training for deep networks still does not appear to signiﬁcantly overﬁt to the test set (which has been previously observed in the standard, non-robust setting (Recht et al., 2018)).

3.3. Reconciling double descent curves

Modern generalization curves for deep learning typically show improved test set performance for increased model complexity beyond data point interpolation in what is known as double descent (Belkin et al., 2019). This suggests that overﬁtting by increasing model complexity using overpa rameterized neural networks is beneﬁcial and improves test set performance. However, this appears to be at odds with the main ﬁndings of this paper; since training for longer can also be viewed as increasing model complexity, the fact that training for longer results in worst test set performance seems to contradict double descent.

We ﬁnd that, while increasing either training time or archi tecture size can be viewed as increasing model complexity, these two approaches actually have separate effects; train-

Train robust error

Best checkpoint Final model

5 10 15 20 Width factor

Test robust error

Figure 5. Generalization curves depicting double descent for adver sarially robust generalization, where hypothesis class complexity is controlled by varying the width factor for a wide residual network. Each ﬁnal model point represents the average performance over the last 5 epochs with the corresponding width factor from training until convergence. The best checkpoint refers to the lowest robust test error achieved by a model checkpoint during training, and illustrates the signiﬁcant gap in performance between the best and ﬁnal models resulting from robust overﬁtting.

ing for longer degrades the robust test set performance re gardless of architecture size, while increasing the model architecture size still improves the robust test set perfor mance despite robust overﬁtting. This was brieﬂy noted by Nakkiran et al. (2019) for the 2 robust setting, and so in this section we show that this generally holds also in the robust setting. We explore these properties by training multiple adversarially robust Wide Res Nets (Zagoruyko & Komodakis, 2016) with varying widths to control model complexity. In Figure 5, we see that no matter how large the model architecture is, robust overﬁtting still results in a signiﬁcant gap between the best and ﬁnal robust test perfor mance. However, we also see that adversarially robust train ing still produces the double descent generalization curve, as the robust test performance increases and then decreases again with architecture size, suggesting that the double de scent and robust overﬁtting are separate phenomenon. Even the lowest robust test error achieved during training contin ues to descend with increased model complexity, suggesting that larger architecture sizes are still beneﬁcial for adversar ially robust training despite robust overﬁtting. More details and learning curves for a wide range of architecture sizes can be found in Appendix C.

Overﬁtting in adversarially robust deep learning

Table 2. Robust performance of PGD-based adversarial training with different regularization methods on CIFAR-10 using a Pre Ac t Res Net18 for with radius 8/255. The best robust test error is the lowest test error achieved during training whereas the ﬁnal robust test error is averaged over the last ﬁve epochs. Each of the regularization methods listed is trained using the optimally chosen hyperparameter. Pure early stopping is done with a validation set.

ROBUST TEST ERROR (%) REG METHOD FINAL BEST DIFF

EARLY STOPPING W/ VAL 46.9 46.7 0.2 1 REGULARIZATION 53.0 0.39 48.6 4.4 2 REGULARIZATION 55.2 0.4 46.4 55.2 CUTOUT 48.8 0.79 46.7 2.1 MIXUP 49.1 1.32 46.3 2.8 SEMI-SUPERVISED 47.1 4.32 40.2 6.9

4. Alternative methods to prevent robust overﬁtting

In this section, we explore whether common methods for combating overﬁtting in standard training are successful at mitigating robust overﬁtting in adversarial training. We run a series of ablation studies on CIFAR-10 using classical and modern regularization techniques, yet ultimately ﬁnd that no technique performs as well in isolation as early stopping, as shown in Table 2 (a more detailed table including standard error can be found in Appendix D.2). Unless otherwise stated, we begin each experiment with the standard setup for PGD-based adversarial training with a 10-step adversary with step size 2/255 using a pre-activation Res Net18 (He et al., 2016) (details for the training procedure and the PGD adversary can be found in Appendix D.1). All experiments in this section were run with one Ge Force RTX 2080ti unless a Wide Res Net was trained, in which case two GPUs were used.

4.1. Explicit regularization

A classical method for preventing overﬁtting is to add an explicit regularization term to the loss, penalizing the com plexity of the model parameters. Speciﬁcally, the term is typically of the form λΩ(θ), where θ contains the model parameters, Ω(θ) is some regularization penalty, and λ is a hyperparameter to control the regularization effect. A typi cal choice for Ω is p regularization for p {1, 2}, where 2 regularization is canonically known as weight decay and commonly used in standard training of deep networks, and 1 regularization is known to induce sparsity properties.

We explore the effects of using 1 and 2 regularization when training robust networks on robust overﬁtting, and sweep across a range of hyperparameter values as seen in

10 -4 10 -2 10 0 10 2

Train robust error

Best checkpoint Final model

10 -4 10 -2 10 0 10 2

2 regularization

Test robust error

Figure 6. Robust performance on the train and test set for varying degrees of 2 regularization. 2 regularization is unable to match the same performance of early stopping without also using early stopping, even with an optimally chosen hyperparameter of λ = 5 10 3 which achieves 55.2% robust test error.

Figure 6 for 2.7 Although explicit regularization does im prove the performance to some degree, on its own, it is still not as effective as early stopping, with the best explicit regularizer achieving 55.2% robust test error with 2 regular ization and parameter λ = 5 10 2 . Additionally, neither of these regularization techniques can completely remove the detrimental effects of robust overﬁtting without drastically over-regularizing the model, which is shown and discussed further in Appendix D.3, along with the corresponding plots for 1 regularization.

4.2. Data augmentation for deep learning

Data augmentation has been empirically shown to reduce overﬁtting in modern deep learning tasks that involve very high-dimensional data by enhancing the quantity and diver sity of the training data. Such techniques range from simple augmentations like random cropping and horizontal ﬂipping to more recent approaches leveraging unlabeled data for semi-supervised learning, and some work has argued that robust deep learning requires more data than standard deep learning (Schmidt et al., 2018).

7Proper parameter regularization only applies the penalty to the weights w of the afﬁne transformations at each layer, excluding the bias terms and batch normalization parameters.

Overﬁtting in adversarially robust deep learning

Train robust error

Best checkpoint Final model

5 10 15 20 Cutout length

Test robust error

Figure 7. Robust performance on the train and test set with cutout across varying patch lengths. Even with the optimal patch length of 14, cutout does not surpass the performance of early stopping, achieving at best 48.8% robust test error at the end of training.

Cutout and mixup Recent data augmentations tech niques for deep networks, such as cutout (De Vries & Taylor, 2017) and mixup (Zhang et al., 2017), are known to reduce overﬁtting and improve generalization in the standard train ing setting. We scan a range of hyperparameters for these approaches when applicable, and ﬁnd a similar story to that of explicit p regularization; either the regularization effect of cutout and mixup is too low to prevent robust overﬁtting, or too high and the model is over-regularized, as seen in Figures 7 for cutout. When trained to convergence, neither cutout nor mixup is as effective as early stopping, achieving at best 48.8% robust test error for cutout with a patch length of 14 and 49.1% robust test error for mixup with α = 1.4. 8

The corresponding plots for mixup and the learning curves for both methods are in Appendix D.4, where we see signif icant robust overﬁtting cutout but less so for mixup, which appears to be more regularized.

Semi-supervised learning We additionally consider a semi-supervised data augmentation technique (Carmon et al., 2019; Zhai et al., 2019; Alayrac et al., 2019) which uses a standard classiﬁer to label unlabeled data for use in robust training. Although there is a large gap between best

8We used the public implementations of cutout and mixup available at https://github.com/ davidcpage/cifar10-fast and https://github. com/facebookresearch/mixup-cifar10

0 50 100 150 200 Epochs

Test robust Train robust

Test standard Train standard

Figure 8. Learning curves for robust training with semi-supervised data augmentation, where we do not see a severe case of robust overﬁtting. When robust training error has converged, there is a signiﬁcant amount of variance in the robust test error, so the average ﬁnal model performance is on par with pure early stopping. Combining early stopping with semi-supervised data augmentation to avoid this variance is the only method we ﬁnd that signiﬁcantly improves on pure early stopping, reaching 40.2% robust test error.

and ﬁnal robust performance shown in Table 2, we ﬁnd that this is primarily driven by high variance in the robust test er ror during training rather than from robust overﬁtting, even when the model has converged as seen in Figure 8. Due to this variance, the ﬁnal model s average robust performance of 47.1% robust test error is similar to the performance obtained by early stopping. By combining early stopping with semi-supervised data augmentation, this variance can be avoided. In fact, we ﬁnd that the combination of early stopping and semi-supervised data augmentation is the only method that results in signiﬁcant improvement over early stopping alone, resulting in 40.2% robust test error. Experi mental details and further discussion for this approach can be found in Appendix E. 9

5. Conclusion

Unlike in standard training, overﬁtting in robust adversarial training decays test set performance during training in a wide variety of settings. While overﬁtting with larger ar chitecture sizes results in better test set generalization, it does not reduce the effect of robust overﬁtting. Our exten sive suite of experiments testing the effect of implicit and explicit regularization methods on preventing overﬁtting found that most of these techniques tend to over-regularize the model or do not prevent robust overﬁtting, and all of

9We used the data from https://github.com/ yaircarmon/semisup-adv containing 500K pseudolabeled Tiny Images

Overﬁtting in adversarially robust deep learning

them in isolation do not improve upon early stopping.

Especially due to the prevalence of robust overﬁtting in ad versarial training, we particularly urge the community to use validation sets when performing model selection in this regime, and to analyze the learning curves of their mod els. This work exposes a key difference in generalization properties between standard and robust training, which is not fully explained by either classic statistics or modern deep learning, and re-establishes the competitiveness of the simplest adversarial training baseline.

Acknowledgements

Leslie Rice was funded by support from the Bosch Center for AI, under contract 0087016732PCR. Eric Wong was funded by support from the Bosch Center for AI, under contract 0087016732PCR, and a fellowship from the Siebel Scholars Foundation.

Alayrac, J.-B., Uesato, J., Huang, P.-S., Fawzi, A., Stanforth, R., and Kohli, P. Are labels required for improving ad versarial robustness? In Advances in Neural Information Processing Systems, pp. 12192 12202, 2019.

Ali, A., Kolter, J. Z., and Tibshirani, R. J. A continuoustime view of early stopping for least squares regression. ar Xiv preprint ar Xiv:1810.10082, 2018.

Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. Syn

thesizing robust adversarial examples. ar Xiv preprint ar Xiv:1707.07397, 2017.

Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumvent ing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

Buckman, J., Roy, A., Raffel, C., and Goodfellow, I. Ther

mometer encoding: One hot way to resist adversarial examples. 2018.

Carlini, N. Is ami (attacks meet interpretability) robust to adversarial examples? ar Xiv preprint ar Xiv:1902.02322, 2019.

Carlini, N. and Wagner, D. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pp. 3 14. ACM, 2017a.

Carlini, N. and Wagner, D. Towards evaluating the robust

ness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39 57. IEEE, 2017b.

Carmon, Y., Raghunathan, A., Schmidt, L., Liang, P., and Duchi, J. C. Unlabeled data improves adversarial robust ness. ar Xiv preprint ar Xiv:1905.13736, 2019.

Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certiﬁed adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918, 2019.

Croce, F., Andriushchenko, M., and Hein, M. Provable robustness of relu networks via maximization of linear regions. ar Xiv preprint ar Xiv:1810.07481, 2018.

De Vries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017.

Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., and Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9185 9193, 2018.

Dvijotham, K., Stanforth, R., Gowal, S., Mann, T. A., and Kohli, P. A dual approach to scalable veriﬁcation of deep networks.

Ehlers, R. Formal veriﬁcation of piece-wise linear feedforward neural networks. In International Symposium on Automated Technology for Veriﬁcation and Analysis, pp. 269 286. Springer, 2017.

Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry,

A. A rotation and a translation sufﬁce: Fooling cnns with simple transformations. ar Xiv preprint ar Xiv:1712.02779, 2017.

Engstrom, L., Ilyas, A., Santurkar, S., and Tsipras, D. Robustness (python library), 2019. URL https:// github.com/Madry Lab/robustness.

Fazlyab, M., Morari, M., and Pappas, G. J. Safety ver

iﬁcation and robustness analysis of neural networks via quadratic constraints and semideﬁnite programming. ar Xiv preprint ar Xiv:1903.01287, 2019.

Feinman, R., Curtin, R. R., Shintre, S., and Gardner, A. B. Detecting adversarial samples from artifacts. ar Xiv preprint ar Xiv:1703.00410, 2017.

Friedman, J., Hastie, T., and Tibshirani, R. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.

Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., and Vechev, M. Ai2: Safety and robustness

Overﬁtting in adversarially robust deep learning

certiﬁcation of neural networks with abstract interpreta tion. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 3 18. IEEE, 2018.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain

ing and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Mann, T., and Kohli, P. On the effectiveness of interval bound propagation for training veriﬁably robust models. ar Xiv preprint ar Xiv:1810.12715, 2018.

Guo, C., Rana, M., Cisse, M., and Van Der Maaten, L. Countering adversarial images using input transforma tions. ar Xiv preprint ar Xiv:1711.00117, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630 645. Springer, 2016.

Huang, X., Kwiatkowska, M., Wang, S., and Wu, M. Safety veriﬁcation of deep neural networks. In International Conference on Computer Aided Veriﬁcation, pp. 3 29. Springer, 2017.

Katz, G., Barrett, C., Dill, D. L., Julian, K., and Kochender

fer, M. J. Reluplex: An efﬁcient smt solver for verifying deep neural networks. In International Conference on Computer Aided Veriﬁcation, pp. 97 117. Springer, 2017.

Krogh, A. and Hertz, J. A. A simple weight decay can im

prove generalization. In Advances in neural information processing systems, pp. 950 957, 1992.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversar

ial examples in the physical world. ar Xiv preprint ar Xiv:1607.02533, 2016.

Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certiﬁed robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656 672. IEEE, 2019.

Lu, J., Sibai, H., Fabry, E., and Forsyth, D. No need to worry about adversarial examples in object detection in autonomous vehicles. ar Xiv preprint ar Xiv:1707.03501, 2017.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Maini, P., Wong, E., and Kolter, J. Z. Adversarial robustness against the union of multiple perturbation models. ar Xiv preprint ar Xiv:1909.04068, 2019.

Metzen, J. H., Genewein, T., Fischer, V., and Bischoff, B. On detecting adversarial perturbations. ar Xiv preprint ar Xiv:1702.04267, 2017.

Mirman, M., Gehr, T., and Vechev, M. Differentiable ab

stract interpretation for provably robust neural networks. In International Conference on Machine Learning, pp. 3575 3583, 2018.

Morgan, N. and Bourlard, H. Generalization and parameter estimation in feedforward nets: Some experiments. In Advances in neural information processing systems, pp. 630 637, 1990.

Mosbach, M., Andriushchenko, M., Trost, T., Hein, M., and Klakow, D. Logit pairing methods can fool gradientbased attacks. ar Xiv preprint ar Xiv:1810.12042, 2018.

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. Deep double descent: Where bigger models and more data hurt. ar Xiv preprint ar Xiv:1912.02292, 2019.

Neyshabur, B., Bhojanapalli, S., Mc Allester, D., and Sre

bro, N. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5947 5956, 2017.

Papernot, N., Mc Daniel, P., Wu, X., Jha, S., and Swami,

A. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582 597. IEEE, 2016.

Raghunathan, A., Steinhardt, J., and Liang, P. Certiﬁed defenses against adversarial examples. ar Xiv preprint ar Xiv:1801.09344, 2018a.

Raghunathan, A., Steinhardt, J., and Liang, P. S. Semidef

inite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pp. 10877 10887, 2018b.

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do cifar-10 classiﬁers generalize to cifar-10? ar Xiv preprint ar Xiv:1806.00451, 2018.

Salman, H., Yang, G., Li, J., Zhang, P., Zhang, H., Razen

shteyn, I., and Bubeck, S. Provably robust deep learn ing via adversarially trained smoothed classiﬁers. ar Xiv preprint ar Xiv:1906.04584, 2019a.

Salman, H., Yang, G., Zhang, H., Hsieh, C.-J., and Zhang, P. A convex relaxation barrier to tight robustness veriﬁcation of neural networks. In Advances in Neural Information Processing Systems, pp. 9832 9842, 2019b.

Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires

Overﬁtting in adversarially robust deep learning

more data. In Advances in Neural Information Processing Systems, pp. 5014 5026, 2018.

Shafahi, A., Najibi, M., Ghiasi, A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! ar Xiv preprint ar Xiv:1904.12843, 2019.

Singh, G., Gehr, T., Mirman, M., P uschel, M., and Vechev,

M. Fast and effective robustness certiﬁcation. In Ad vances in Neural Information Processing Systems, pp. 10802 10813, 2018.

Sinha, A., Namkoong, H., and Duchi, J. Certifying some dis

tributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

Smith, L. N. Cyclical learning rates for training neural net

works. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464 472. IEEE, 2017.

Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kushman, N. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. ar Xiv preprint ar Xiv:1710.10766, 2017.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929 1958, 2014.

Strand, O. N. Theory and methods related to the singularfunction expansion and landweber s iteration for integral equations of the ﬁrst kind. SIAM Journal on Numerical Analysis, 11(4):798 825, 1974.

Suggala, A., Prasad, A., and Ravikumar, P. K. Connect

ing optimization and regularization paths. In Advances in Neural Information Processing Systems, pp. 10608 10619, 2018.

Tao, G., Ma, S., Liu, Y., and Zhang, X. Attacks meet interpretability: Attribute-steered detection of adversarial samples. In Advances in Neural Information Processing Systems, pp. 7717 7728, 2018.

Tjeng, V., Xiao, K. Y., and Tedrake, R. Evaluating robust

ness of neural networks with mixed integer programming. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Hy GIdi Rqtm.

Tramer, F. and Boneh, D. Adversarial training and ro bustness for multiple perturbations. ar Xiv preprint ar Xiv:1904.13000, 2019.

Wei, Y., Yang, F., and Wainwright, M. J. Early stopping for kernel boosting algorithms: A general analysis with lo calized complexities. In Advances in Neural Information Processing Systems, pp. 6065 6075, 2017.

Wong, E. and Kolter, J. Z. Provable defenses against adver

sarial examples via the convex outer adversarial polytope. ar Xiv preprint ar Xiv:1711.00851, 2017.

Wong, E., Schmidt, F., Metzen, J. H., and Kolter, J. Z. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems, pp. 8400 8409, 2018.

Wong, E., Schmidt, F. R., and Kolter, J. Z. Wasserstein adversarial examples via projected sinkhorn iterations. ar Xiv preprint ar Xiv:1902.07906, 2019.

Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. In International Confer ence on Learning Representations, 2020. URL https: //openreview.net/forum?id=BJx040EFv H.

Xiao, C., Zhu, J.-Y., Li, B., He, W., Liu, M., and Song,

D. Spatially transformed adversarial examples. ar Xiv preprint ar Xiv:1801.02612, 2018a.

Xiao, K. Y., Tjeng, V., Shaﬁullah, N. M., and Madry, A. Training for faster adversarial robustness veriﬁcation via inducing relu stability. ar Xiv preprint ar Xiv:1809.03008, 2018b.

Xie, C., Wu, Y., Maaten, L. v. d., Yuille, A. L., and He, K. Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501 509, 2019.

Yang, Y., Zhang, G., Katabi, D., and Xu, Z. Me-net: To

wards effective adversarial robustness with matrix estima tion. ar Xiv preprint ar Xiv:1905.11971, 2019.

Yin, D., Ramchandran, K., and Bartlett, P. Rademacher complexity for adversarially robust generalization. ar Xiv preprint ar Xiv:1810.11914, 2018.

Zagoruyko, S. and Komodakis, N. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Zhai, R., Cai, T., He, D., Dan, C., He, K., Hopcroft, J., and Wang, L. Adversarially robust generalization just requires more unlabeled data. ar Xiv preprint ar Xiv:1906.00555, 2019.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking general ization. ar Xiv preprint ar Xiv:1611.03530, 2016.

Zhang, D., Zhang, T., Lu, Y., Zhu, Z., and Dong, B. You only propagate once: Painless adversarial training using maximal principle. ar Xiv preprint ar Xiv:1905.00877, 2019a.

Overﬁtting in adversarially robust deep learning

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz,

D. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

Zhang, H., Chen, H., Xiao, C., Li, B., Boning, D., and Hsieh, C.-J. Towards stable and efﬁcient training of veriﬁably robust neural networks. ar Xiv preprint ar Xiv:1906.06316, 2019b.

Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jor

dan, M. I. Theoretically principled trade-off between ro bustness and accuracy. ar Xiv preprint ar Xiv:1901.08573, 2019c.