# understanding_catastrophic_overfitting_in_singlestep_adversarial_training__d7cf7802.pdf

Understanding Catastrophic Overﬁtting in Single-step Adversarial Training

Hoki Kim , Woojin Lee , Jaewook Lee

Seoul National University, Seoul, Korea ghrl9613@snu.ac.kr, wj926@snu.ac.kr, jaewook@snu.ac.kr

Although fast adversarial training has demonstrated both robustness and efﬁciency, the problem of catastrophic overﬁtting has been observed. This is a phenomenon in which, during single-step adversarial training, robust accuracy against projected gradient descent (PGD) suddenly decreases to 0% after a few epochs, whereas robust accuracy against fast gradient sign method (FGSM) increases to 100%. In this paper, we demonstrate that catastrophic overﬁtting is very closely related to the characteristic of single-step adversarial training which uses only adversarial examples with the maximum perturbation, and not all adversarial examples in the adversarial direction, which leads to decision boundary distortion and a highly curved loss surface. Based on this observation, we propose a simple method that not only prevents catastrophic overﬁtting, but also overrides the belief that it is difﬁcult to prevent multi-step adversarial attacks with single-step adversarial training.

1 Introduction Adversarial examples are perturbed inputs that are designed to deceive machine learning classiﬁers by adding adversarial noises to the original data. Although such perturbations are sufﬁciently subtle and undetectable by humans, they result in an incorrect classiﬁcation. Since deep-learning models were found to be vulnerable to adversarial examples (Szegedy et al. 2013), a line of work was proposed to mitigate the problem and improve robustness of the models. Among the numerous defensive methods, projected gradient descent (PGD) adversarial training (Madry et al. 2017) is one of the most successful approaches for achieving robustness against adversarial attacks. Although PGD adversarial training serves as a strong defensive algorithm, because it relies on a multi-step adversarial attack, a high computational cost is required for multiple forward and back propagation during batch training. To overcome this issue, other studies (Shafahi et al. 2019; Wong, Rice, and Kolter 2020) on reducing the computational burden of adversarial training using single-step adversarial attacks (Goodfellow, Shlens, and Szegedy 2014)

Equal contribution. Corresponding author. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Visualization of distorted decision boundary. The origin indicates the original image x, the label of which is dog . In addition, v1 is the direction of a single-step adversarial perturbation and v2 is a random direction. The adversarial image x+v1 is classiﬁed as the correct label, although there is distorted interval where x + k v1 is misclassiﬁed even when k is less than 1. Due to this decision boundary distortion, single-step adversarial training becomes vulnerable to multi-step adversarial attacks.

have been proposed. Among them, inspired by Shafahi et al. (2019), Wong, Rice, and Kolter (2020) suggested fast adversarial training, which is a modiﬁed version of fast gradient sign method (FGSM) adversarial training designed to be as effective as PGD adversarial training. Fast adversarial training has demonstrated both robustness and efﬁciency; however, it suffers from the problem of catastrophic overﬁtting, which is a phenomenon that robustness against PGD suddenly decreases to 0%, whereas robustness against FGSM rapidly increases. Wong, Rice, and Kolter (2020) ﬁrst discovered this issue and suggested the use of early stopping to prevent it. Later, it was found that catastrophic overﬁtting also occurs in different singlestep adversarial training methods such as free adversarial training (Andriushchenko and Flammarion 2020).

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

In this regard, few attempts have been made to discover the underlying reason for catastrophic overﬁtting and methods proposed to prevent this failure (Andriushchenko and Flammarion 2020; Vivek and Babu 2020; Li et al. 2020). However, these approaches were computationally inefﬁcient or did not provide a fundamental reason for the problem. In this study, we ﬁrst analyze the differences before and after catastrophic overﬁtting. We then identify the relationship between distortion of the decision boundary and catastrophic overﬁtting. Unlike the previous notion in which a larger perturbation implies a stronger attack, we discover that sometimes a smaller perturbation is sufﬁcient to fool the model, whereas the model is robust against larger perturbations during single-step adversarial training. We call this phenomenon decision boundary distortion. Figure 1 shows an example of decision boundary distortion by visualizing the loss surface. The model is robust to perturbations when the magnitude of the attack is equal to the maximum perturbation ϵ, but not to other smaller perturbations. When decision boundary distortion occurs, the model becomes more robust against a single-step adversarial attack but reveals fatal weaknesses to multi-step adversarial attacks and leads to catastrophic overﬁtting. Through extensive experiments, we empirically discovered the relationship between single-step adversarial training and decision boundary distortion, and found that the problem of single-step adversarial training is a ﬁxed magnitude of the perturbation, not the direction of the attack. Based on this observation, we present a simple algorithm that determines the appropriate magnitude of the perturbation for each image and prevents catastrophic overﬁtting.

Contributions.

We discovered a decision boundary distortion phenomenon that occurs during single-step adversarial training and the underlying connection between decision boundary distortion and catastrophic overﬁtting.

We suggest a simple method that prevents decision boundary distortion by searching the appropriate step size for each image. This method not only prevents catastrophic overﬁtting, but also achieves near 100% accuracy for the training examples against PGD.

We evaluate robustness of the proposed method against various adversarial attacks (FGSM, PGD, and Auto Attack (Croce and Hein 2020)) and demonstrate the proposed method can provide sufﬁcient robustness without catastrophic overﬁtting.

2 Background and Related Work

2.1 Adversarial Robustness

There are two major movements for building a robust model: provable defenses and adversarial training. A considerable number of studies related to provable defenses of deep-learning models have been published. Provable defenses attempt to provide provable guarantees for robust performance, such as linear relaxations (Wong and

Kolter 2018; Zhang et al. 2019a), interval bound propagation (Gowal et al. 2018; Lee, Lee, and Park 2020), and randomized smoothing (Cohen, Rosenfeld, and Kolter 2019; Salman et al. 2019). However, provable defenses are computationally inefﬁcient and show unsatisﬁed performance compared to adversarial training. Adversarial training is an approach that augments adversarial examples generated by adversarial attacks (Goodfellow, Shlens, and Szegedy 2014; Madry et al. 2017; Tram er et al. 2017). Because this approach is simple and achieves high empirical robustness for various attacks, it has been widely used and developed along with other deep learning methods such as mix-up (Zhang et al. 2017; Lamb et al. 2019; Pang, Xu, and Zhu 2019) and unsupervised training (Alayrac et al. 2019; Najaﬁet al. 2019; Carmon et al. 2019). In this study, we focus on adversarial training. Given an example (x, y) D, let ℓ(x, y; θ) = ℓ(fθ(x), y) denote the loss function of a deep learning model f with parameters θ. Then, adversarial training with a maximum perturbation ϵ can be formalized as follows:

min θ E(x,y) D[ max δ B(x,ϵ) ℓ(x + δ, y; θ)] (1)

A perturbation δ is in B(x, ϵ) that denotes the ϵ-ball around an example x with a speciﬁc distance measure. The most used distance measures are L0, L2, and L . In this study, we use L for such a measure. However, the above optimization is considered as NPhard because it contains a non-convex min-max problem. Thus, instead of the inner maximization problem, adversarial attacks are used to ﬁnd the perturbation δ. Fast gradient sign method (FGSM) (Goodfellow, Shlens, and Szegedy 2014) is the simplest adversarial attack, which uses a sign of a gradient to ﬁnd an adversarial image x . Because FGSM requires only one gradient, it is considered the least expensive adversarial attack (Goodfellow, Shlens, and Szegedy 2014; Madry et al. 2017).

x = x + ϵ sgn( xℓ(x, y; θ)) (2)

Projected gradient descent (PGD) (Madry et al. 2017) uses multiple gradients to generate more powerful adversarial examples. With a step size α, PGD can be formalized as follows:

xt+1 = ΠB(x,ϵ)(xt + α sgn( xℓ(x, y; θ))) (3)

where ΠB(x,ϵ) refers the projection to the ϵ-ball B(x, ϵ). Here, xt is an adversarial example after t-steps. A large number of steps allows us to explore more areas in B(x, ϵ). Note that PGDn corresponds to PGD with n steps (or iterations). For instance, PGD7 indicates that the number of PGD steps is 7.

2.2 Single-step Adversarial Attack versus Multi-step Adversarial Attack Single-step adversarial training was previously believed to be a non-robust method because it produces nearly 0% accuracy against PGD (Madry et al. 2017). Moreover, the model trained using FGSM has been conﬁrmed to have typical

characteristics, such as gradient masking, which indicates that a single-step gradient is insufﬁcient to ﬁnd a decent adversarial examples (Tram er et al. 2017). For the above reasons, a number of studies have been conducted on multi-step adversarial attacks. Contrary to this perception, however, free adversarial training (Shafahi et al. 2019) has achieved a remarkable performance with a single-step gradient using redundant batches and accumulative perturbations. Following Shafahi et al. (2019), Wong, Rice, and Kolter (2020) proposed fast adversarial training using FGSM with a uniform random initialization. Fast adversarial training shows an almost equivalent performance to those of PGD (Madry et al. 2017) and free adversarial training (Shafahi et al. 2019).

η = Uniform( ϵ, ϵ) δ = η + α sgn( ηℓ(x + η, y; θ))

x = x + δ (4)

2.3 Catastrophic Overﬁtting Although fast adversarial training performs well in a short time, a previously undiscovered phenomenon has been identiﬁed. That is, after a few epochs with single-step adversarial training, robustness of the model against PGD decreases sharply. This phenomenon is called catastrophic overﬁtting. Fast adversarial training (Wong, Rice, and Kolter 2020) uses early stopping to temporally avoid catastrophic overﬁtting by tracking robustness accuracy against PGD on the training batches. To apply early stopping, robustness against PGD must be continuously conﬁrmed. Furthermore, standard accuracy does not yield the maximum potential (Andriushchenko and Flammarion 2020). To resolve these shortcomings and gain a deeper understanding of catastrophic overﬁtting, a line of work has been proposed. Vivek and Babu (2020) identiﬁed that catastrophic overﬁtting arises with early overﬁtting to FGSM. To prevent this type of overﬁtting, the authors introduced dropout scheduling and demonstrated stable adversarial training for up to 100 epochs. In addition, Li et al. (2020) trained a model with FGSM at ﬁrst and then changed it into PGD when there was a large decrease in the PGD accuracy. Andriushchenko and Flammarion (2020) found that an abnormal behavior of a single ﬁlter leads to a nonlinear model with single-layer convolutional networks. Based on this observation, they proposed a regularization method, Grad Align, which maximizes cos( xℓ(x, y; θ), xℓ(x + η, y; θ)) and prevents catastrophic overﬁtting by inducing a gradient alignment. However, even with an increased understanding of catastrophic overﬁtting and methods for its prevention, a key question remains unanswered: What characteristic of single-step adversarial attacks is the cause of catastrophic overﬁtting? In this paper, we discuss the cause of catastrophic overﬁtting in the context of single-step adversarial training. We then propose a new simple method to facilitate stable singlestep adversarial training, wherein longer training can produce a higher standard accuracy with sufﬁcient adversarial robustness.

(a) Robust accuracy and distortion

(b) Mean of absolute value of PGD7 perturbations and L2 norm of the gradients of the images

Figure 2: (CIFAR10) Analysis of catastrophic overﬁtting. Plot (a) shows robust accuracy of fast adversarial training against FGSM (red) and PGD7 (blue). Distortion (green) denotes the ratio of images in distorted interval in Equation (5). Plot (b) shows the mean of absolute value of PGD7 perturbation E[||δP GD7||1] (purple) and the L2 norm of the gradients of the images E[|| x||2] (orange). Dashed black lines correspond to the 240th batch, which is the start point of catastrophic overﬁtting in both plots.

3 Revisiting Catastrophic Overﬁtting First, to analyze catastrophic overﬁtting, we start by recording robust accuracy of fast adversarial training on CIFAR-10 (Krizhevsky, Hinton et al. 2009). The maximum perturbation ϵ is ﬁxed to 8/255. We use FGSM and PGD7 to verify robust accuracy with the same settings ϵ = 8/255 and a step size α = 2/255. Figure 2 shows statistics on the training batch when catastrophic overﬁtting occurs (71st out of 200 epochs). In plot (a), after 240 batches, robustness against PGD7 begins to decrease rapidly; conversely, robustness against FGSM increases. Plot (b) shows the mean of the absolute value of PGD7 perturbation E[||δP GD7||1] and squared L2 norm of the gradient of the images E[|| x||2] of each batch. After catastrophic overﬁtting, there is a trend of decreasing mean perturbation. This is consistent with the phenomenon in which the perturbations of the catastrophic overﬁtted model are located away from the maximum perturbation, unlike the model that is stopped early (Wong, Rice, and Kolter 2020). Concurrently, a signiﬁcant increase in the squared L2 norm of the gradient is also observed. The highest point indicates a large difference, approximately 35 times greater than that before catastrophic overﬁtting. These two observations, a low magnitude of perturbations and a high gradient norm, make us wonder what would the

Figure 3: Process of normal decision boundary turns into distorted decision boundary. (Left) The loss surface before catastrophic overﬁtting with a FGSM adversarial direction v1 and a random direction v2. The red point denotes an adversarial example x+v1 generated from the original image x, the label of which is dog. (Middle) The changed loss surface after learning adversarial example x + v1. Here, v1 is the same vector as that on the left. Distorted interval begins to occur for the ﬁrst time. (Right) As training continues, distorted decision boundary grows uncontrollably such that robustness against multi-step adversarial attacks decreases.

loss surface looks like. Figure 3 illustrates the progress of adversarial training in which catastrophic overﬁtting occurs. The loss surface of the perturbed example is shown, where the green spot denotes the original images and the red spot denotes the adversarial example used for adversarial training in the batch. The v1 axis indicates the direction of FGSM, whereas the v2 axis is a random direction. The true label of the original sample is dog. Hence, the purple area indicates where the perturbed sample is correctly classiﬁed, whereas the blue area indicates a misclassiﬁed area. On the left side of Figure 3, we can easily observe that the model is robust against FGSM. However, after training the batch, an interval vulnerable to a smaller perturbation than the maximum perturbation ϵ appears, whereas the model is still robust against FGSM. This distorted interval implies that the adversarial example with a larger perturbation is weaker than that with a smaller perturbation, which is contrary to the conventional belief that a larger magnitude of perturbation induces a stronger attack. As a result, the model with distorted interval is vulnerable to multi-step adversarial attacks that can search the vulnerable region further inside B(x, ϵ). As the training continues, the area of distorted interval increases as shown in the ﬁgure on the right. It is now easier to see that the model is now perfectly overﬁtted for FGSM, yet loses its robustness to the smaller perturbations. We call this phenomenon decision boundary distortion. The evidence of decision boundary distortion is also shown in Figure 2 (b). When robustness against PGD7 sharply decreases to 0%, the mean of the absolute value of PGD7 perturbation E[||δP GD7||1] decreases. It indicates that, when catastrophic overﬁtting arises, a smaller perturbation is enough to fool the model than the maximum perturbation ϵ, which implies that distorted interval exists. In addition, during the process of having distorted decision boundary, as shown in the ﬁgure on the right, the loss surface inevitably becomes highly curved, which matches the observation of increasing the L2 norm of the

gradients of the images E[|| x||2]. This is also consistent with previous research (Andriushchenko and Flammarion 2020). Andriushchenko and Flammarion (2020) argued that xℓ(x, y; θ) and xℓ(x + η, y; θ) tend to be perpendicular in catastrophic overﬁtted models where η is drawn from a uniform distribution U( ϵ, ϵ). Considering that a highly curved loss surface implies ( xℓ(x, y; θ))T ( xℓ(x+ η, y; θ)) 0 in high dimensions, the reason why Grad Align (Andriushchenko and Flammarion 2020) can avoid catastrophic overﬁtting might be because the gradient alignment leads the model to learn a linear loss surface which reduces the chance of having distorted decision boundary. We next numerically measured the degree of decision boundary distortion. To do so, we ﬁrst deﬁne a new measure distortion d. Given a deep learning model f and a loss function ℓ, distortion d can be formalized as follows:

SD = {x| k (0, 1) s.t. f(x + k ϵ sgn( xℓ)) = y} SN = {x|f(x) = y, f(x + ϵ sgn( xℓ)) = y}

d = |SD SN|

where (x, y) is an example drawn from dataset D. However, because the loss function of the model is not known explicitly, we use a number of samples to estimate distortion d. In all experiments, we tested 100 samples in the adversarial direction δ = ϵ sgn( xℓ) for each example. Indeed, we can see that distortion increases in Figure 2 (a) when catastrophic overﬁtting arises. To verify that decision boundary distortion is generally related to catastrophic overﬁtting, we demonstrate how distortion and robustness against PGD7 change during training. We conducted an experiment on ﬁve different models: fast adversarial training (Fast Adv.) (Wong, Rice, and Kolter 2020), PGD2 (PGD2 Adv.) (Madry et al. 2017), TRADES (Zhang et al. 2019b), Grad Align (Andriushchenko

Figure 4: (CIFAR10) Robust accuracy and distortion on the training batch for each epoch. Two multi-step adversarial attacks show zero distortion during the entire training time and reach nearly 100% PGD7 accuracy. By contrast, fast adversarial training shows high distortion and eventually collapses after the 71st epoch. The proposed method successfully avoids such problems and achieves a high PGD7 accuracy similiar to multi-step adversarial training (Best viewed in color).

and Flammarion 2020), and the proposed method (Ours). All models were tested on ϵ = 8/255. The step size α is set to α = 1.25ϵ, α = 1/2ϵ, and α = 1/4ϵ for fast adversarial training, PGD2 adversarial training, and TRADES, respectively. We also conducted same experiment on PGD adversarial training with different number of steps; however, because these show similar results to PGD2 adversarial training, we only included PGD2. TRADES is trained with seven steps. As the key observation in Figure 4, the point where decision boundary distortion begins in fast adversarial training (22nd epoch) is identical to the point where robustness against PGD7 sharply decreases; that is, catastrophic overﬁtting occurs. Then, when decision boundary distortion disappears (45th to 72nd epoch), the model immediately recovers robust accuracy. After the 72nd epoch, the model once again suffers a catastrophic overﬁtting and never regains its robustness with high distortion. Hence, we conclude that there is a close connection between decision boundary distortion and the vulnerability of the model against multi-step adversarial attacks.

4 Stable Single-Step Adversarial Training

Based on the results in Section 3, we assume that distorted decision boundary might be the reason for catastrophic overﬁtting. Here, we stress that the major cause of distorted decision boundary is that single-step adversarial training uses a point with a ﬁxed distance ϵ from the original image x as an adversarial image x instead of an optimal solution of the inner maximum in Equation (1). Under this linearity assumption, the most powerful adversarial perturbation δ would be the same as ϵ sgn( xℓ) where ϵ is the maximum perturbation, and the following formula should be satisﬁed.

ℓ(x + δ) ℓ(x) = ( xℓ)T δ

= ( xℓ)T ϵ sgn( xℓ) = ϵ|| xℓ||1

However, as conﬁrmed in the previous scetion, decision boundary distortion with a highly curved loss surface has been observed during the training phase, which indicates that ϵ is no longer the strongest adversarial step size in the direction of δ. Thus, the linear approximation of the inner maximization is not satisﬁed when distorted decision boundary arises. To resolve this issue, we suggest a simple ﬁx to prevent catastrophic overﬁtting by forcing the model to verify the inner interval of the adversarial direction. In this case, the appropriate magnitude of the perturbation should be taken into consideration instead of using ϵ:

δ = ϵ sgn( xℓ) arg max k [0,1] ℓ(x + k δ, y; θ) (7)

Here, we introduce k, which denotes the scaling parameter for the original adversarial direction sgn( xℓ). In contrast to previous single-step adversarial training which uses a ﬁxed size of k = 1, an appropriate scaling parameter k helps the model to train stronger adversarial examples as follows:

δ = ϵ sgn( xℓ) k = mink [0,1]{k|y = f(x + k δ; θ)}

min θ E(x,y) D[ℓ(x + k δ, y; θ)] (8)

In this way, regardless of the linearity assumption, we can train the model with stronger adversarial examples that induce an incorrect classiﬁcation in the adversarial direction. Simultaneously, we can also detect distorted decision boundary by inspecting the inside of distorted interval, as shown in Figure 3. However, because we do not know the explicit loss function of the model, forward propagation is the only approach for checking the adversarial images in the single-step adversarial attack direction. Hence, we propose the following simple method. First, we calculate the single-step adversarial direction δ. Next, we choose multiple checkpoints

Algorithm 1: Stable single-step adversarial training

Parameter: B mini-batches, a perturbation size ϵ, a step size α, and c check points for a network fθ for i = 1, ..., B do

η = Uniform( ϵ, ϵ) ˆyi,0 = fθ(xi + η) δ = η + α ηℓ(ˆyi,0, yi) for j = 1, ..., c do

ˆyi,j = fθ(xi + j δ/c)) end x i = xi + min({k|ˆyi,k = yi} {1}) δ/c θ = θ θℓ(fθ(x i), yi) end

cδ, ..., x + c 1

c δ, x + δ). Here, c denotes the number of checkpoints except for the clean image x, which is tested in advance during the single-step adversarial attack process. We then feed all checkpoints to the model and verify that the predicted label ˆyj matches the correct label y for all checkpoints x + j

cδ where j {1, ..., c}. Among the incorrect images and the clean image x, the smallest j is selected; if all checkpoint are correctly classiﬁed, the adversarial image x = x + δ is used. Algorithm 1 shows a summary of the proposed method. Suppose the model has L layers with n neurons. Then, the time complexity of forward propagation is O(Ln2). Considering that backward propagation has the same time complexity, the generation of one adversarial example requires O(2Ln2) in total. Thus, with c checkpoints, the proposed method consumes O((c + 4)Ln2) because it requires one adversarial direction O(2Ln2), forward propagation for c checkpoints O(c Ln2), and one optimization step O(2Ln2). Compared to PGD2 adversarial training, which demands O(6Ln2), the proposed method requires more time when c > 2. However, the proposed method does not require additional memory for computing the gradients of the checkpoints because we do not need to track a history of variables for backward propagation; hence, larger validation batch sizes can be considered. Indeed, the empirical results describe in Section 5 indicate that the proposed method consumes less time than PGD2 adversarial training under c 4. Figure 4 shows that the proposed method successfully avoids catastrophic overﬁtting despite using a single-step adversarial attack. Furthermore, the proposed model not only achieves nearly 100% robustness against PGD7, which fast adversarial training cannot accomplish, but also possesses zero distortion until the end of the training. This is the opposite of the common understanding that single-step adversarial training methods cannot perfectly defend the model against multi-step adversarial attacks. The proposed model learns the image with the smallest perturbation among the incorrect adversarial images. In other words, during the initial states, the model outputs incorrect predictions for almost every image such that min({k|ˆyi,k = yi} {1}) = 0 in Algorithm 1. As addi-

Figure 5: (CIFAR10) Comparison of PGD7 accuracy on the training batch between fast adversarial training with ϵscheduling and the proposed method. The dashed line indicates the average maximum perturbation E [||δ|| ] calculated from the proposed method for each epoch and is used as the maximum perturbation of fast adversarial training.

tional batches are trained, the average maximum perturbation E[||δ|| ] increases, as in Figure 5, where δ = x x and x is selected by the proposed method. Thus, the proposed method may appear to simply be a variation of ϵ-scheduling. In order to point out the difference, fast adversarial training with ϵ-scheduling is also considered. For each epoch, we use the average maximum perturbation E[||δ|| ] calculated from the proposed method as the maximum perturbation ϵ. The result is summarized in Figure 5. Notably, ϵ-scheduling cannot help fast adversarial training avoid catastrophic overﬁtting. The main difference between ϵ-scheduling and the proposed method is that, whereas ϵscheduling uniformly applies the same magnitude of the perturbation for every image, the proposed method gradually increases the magnitude of the perturbation appropriately by considering the loss surface of each image. Therefore, in contrast to ϵ-scheduling, the proposed method successfully prevents catastrophic overﬁtting, despite the same size of the average perturbation used during the training process.

5 Adversarial Robustness

In this section, we conduct a set of experiments on CIFAR10 (Krizhevsky, Hinton et al. 2009) and Tiny Image Net (Le and Yang 2015) , using Pre Act Res Net-18 (He et al. 2016). Input normalization and data augmentation including 4-pixel padding, random crop and horizontal ﬂip are applied. We use SGD with a learning rate of 0.01, momentum of 0.9 and weight decay of 5e-4. To check whether catastrophic overﬁtting occurs, we set the total epoch to 200. The learning rate decays with a factor of 0.2 at 60, 120, and 160 epochs. All experiments were conducted on a single NVIDIA TITAN V over three different random seeds. Our implementation in Py Torch (Paszke et al. 2019) with Torchattacks (Kim 2020) is available at https://github.com/Harry24k/catastrophic-overﬁtting. During the training session, the maximum perturbation ϵ was set to 8/255. For PGD adversarial training, we use a step size of α = max(2/255, ϵ/n), where n is the number of steps. TRADES uses α = 2/255 and seven steps for generating adversarial images. Following Wong, Rice, and

Method Standard FGSM PGD50 Black-box AA Time (h) Multi-step PGD2 Adv. 86.6 0.8 49.7 2.6 36.0 2.3 85.6 0.8 34.8 2.1 4.5 PGD4 Adv. 86.0 0.8 49.6 3.0 36.7 2.9 85.3 0.8 35.4 2.6 6.7 PGD7 Adv. 84.4 0.2 51.5 0.1 40.5 0.1 83.8 0.2 39.4 0.2 11.1 TRADES 85.3 0.4 50.7 1.6 39.3 1.9 84.4 0.4 38.6 2.0 15.1 Single-step Fast Adv. 84.5 4.3 95.1 6.8 0.1 0.1 80.8 8.7 0.0 0.0 3.2 Grad Align 83.9 0.2 44.3 0.0 31.7 0.2 83.3 0.3 30.9 0.2 13.6 Ours (c = 2) 86.8 0.3 48.3 0.5 32.5 0.2 85.9 0.1 30.9 0.2 3.5 Ours (c = 3) 87.7 0.8 50.5 2.4 33.9 2.3 86.7 0.9 32.3 2.2 3.9 Ours (c = 4) 87.8 0.9 50.5 2.3 33.7 2.4 87.0 0.8 32.2 2.4 4.4

Table 1: Standard and robust accuracy (%) and training time (hour) on CIFAR10.

Figure 6: (Tiny Image Net) PGD7 accuracy on the training batch.

Method Standard FGSM PGD50 Time (h) PGD2 Adv. 46.3 1.2 14.7 2.7 10.3 2.7 27.7 Fast Adv. 26.2 0.7 49.0 5.7 0.0 0.0 19.6 Ours (c = 3) 49.6 1.5 12.5 0.1 7.8 0.1 25.7

Table 2: Standard accuracy and robustness (%) and training time (h) on Tiny Image Net.

Kolter (2020), we use α = 1.25ϵ for fast adversarial training and the proposed method. The regularization parameter β for the gradient alignment of Grad Align is set to 0.2 as suggested by Andriushchenko and Flammarion (2020). First, we check whether our method shows the same results as those of Tiny Image Net described in the previous section. Figure 6 shows that the proposed method also successfully prevents catastrophic overﬁtting in a large dataset. PGD7 accuracy decreases rapidly only for fast adversarial training after the 49th epoch, but not for others including the proposed method. The full results with the change in distortion are shown in Appendix B. We then evaluate robustness on the test set. FGSM and PGD50 with 10 random restarts are used for evaluating robustness of the models. Furthermore, to estimate accurate robustness and detect gradient obfuscation (Athalye, Carlini, and Wagner 2018), we also consider PGD50 adversarial images generated from Wide-Res Net 40-10 (Zagoruyko and Komodakis 2016) trained on clean images (Black-box), and Auto Attack (AA) which is one of the latest strong adversarial attacks proposed by Croce and Hein (2020). Tables 1 and 2 summarize the results. From Table 1, we can see that multi-step adversarial training methods yield

more robust models, but generally requires a longer computational time. In particular, TRADES requires over 15 hours, which is 5-times slower than the proposed method. Among the single-step adversarial training methods, fast adversarial training is computationally efﬁcient, however, because catastrophic overﬁtting has occurred, it shows 0% accuracy against PGD50 and AA. Interestingly, we observe that fast adversarial training achieves a higher accuracy for FGSM adversarial images than clean images in both datasets, which does not appear in other methods. The accuracy when applying FGSM on only correctly classiﬁed images is 84.4% on CIFAR10, whereas all other numbers remain almost unchanged when we use attacks on correctly classiﬁed clean images. We note that this is another characteristic of the catastrophic overﬁtted model which we describe in more detail in Appendix B. The proposed method, by contrast, shows the best standard accuracy and robustness against PGD50, Black-box, and AA with a shorter time. Grad Align also provides sufﬁcient robustness; however, it takes 3-times longer than the proposed method. As shown in Table 2, similar results are observed on Tiny Image Net. We include the results of the main competitors, PGD2 adversarial training, fast adversarial training, and the proposed method with c = 3 which shows the best performance on CIFAR10. Here again, the proposed method shows high standard accuracy and adversarial robustness close to that of PGD2 adversarial training. We provide additional experiments with different settings, such as cyclic learning rate schedule in Appendix C.

6 Conclusion In this study, we empirically showed that catastrophic overﬁtting is closely related to decision boundary distortion by analyzing their loss surface and robustness during training. Decision boundary distortion provides a reliable understanding of the phenomenon in which a catastrophic overﬁtted model becomes vulnerable to multi-step adversarial attacks, while achieving a high robustness on the single-step adversarial attacks. Based on these observations, we suggested a new simple method that determines the appropriate magnitude of the perturbation for each image. Further, we evaluated robustness of the proposed method against various adversarial attacks and showed sufﬁcient robustness using single-step adversarial training without the occurrence of any catastrophic overﬁtting.

Acknowledgments This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (NRF-2019R1A2C2002358).

References Alayrac, J.-B.; Uesato, J.; Huang, P.-S.; Fawzi, A.; Stanforth, R.; and Kohli, P. 2019. Are Labels Required for Improving Adversarial Robustness? In Advances in Neural Information Processing Systems, 12214 12223.

Andriushchenko, M.; and Flammarion, N. 2020. Understanding and improving fast adversarial training. Advances in Neural Information Processing Systems 33.

Athalye, A.; Carlini, N.; and Wagner, D. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420 .

Carmon, Y.; Raghunathan, A.; Schmidt, L.; Duchi, J. C.; and Liang, P. S. 2019. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems, 11192 11203.

Cohen, J. M.; Rosenfeld, E.; and Kolter, J. Z. 2019. Certiﬁed adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918 .

Croce, F.; and Hein, M. 2020. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. ar Xiv preprint ar Xiv:2003.01690 .

Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572 .

Gowal, S.; Dvijotham, K.; Stanforth, R.; Bunel, R.; Qin, C.; Uesato, J.; Arandjelovic, R.; Mann, T.; and Kohli, P. 2018. On the effectiveness of interval bound propagation for training veriﬁably robust models. ar Xiv preprint ar Xiv:1810.12715 .

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778.

Kim, H. 2020. Torchattacks: A Pytorch Repository for Adversarial Attacks. ar Xiv preprint ar Xiv:2010.01950 .

Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.(2009). Citeseer .

Lamb, A.; Verma, V.; Kannala, J.; and Bengio, Y. 2019. Interpolated adversarial training: Achieving robust neural networks without sacriﬁcing too much accuracy. In Proceedings of the 12th ACM Workshop on Artiﬁcial Intelligence and Security, 95 103.

Le, Y.; and Yang, X. 2015. Tiny imagenet visual recognition challenge. CS 231N 7: 7.

Lee, S.; Lee, J.; and Park, S. 2020. Lipschitz-Certiﬁable Training with a Tight Outer Bound. Advances in Neural Information Processing Systems 33.

Li, B.; Wang, S.; Jana, S.; and Carin, L. 2020. Towards Understanding Fast Adversarial Training. ar Xiv preprint ar Xiv:2006.03089 .

Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2017. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083 .

Najaﬁ, A.; Maeda, S.-i.; Koyama, M.; and Miyato, T. 2019. Robustness to adversarial perturbations in learning from incomplete data. In Advances in Neural Information Processing Systems, 5541 5551.

Pang, T.; Xu, K.; and Zhu, J. 2019. Mixup inference: Better exploiting mixup to defend adversarial attacks. ar Xiv preprint ar Xiv:1909.11515 .

Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, 8026 8037.

Salman, H.; Li, J.; Razenshteyn, I.; Zhang, P.; Zhang, H.; Bubeck, S.; and Yang, G. 2019. Provably robust deep learning via adversarially trained smoothed classiﬁers. In Advances in Neural Information Processing Systems, 11292 11303.

Shafahi, A.; Najibi, M.; Ghiasi, M. A.; Xu, Z.; Dickerson, J.; Studer, C.; Davis, L. S.; Taylor, G.; and Goldstein, T. 2019. Adversarial training for free! In Advances in Neural Information Processing Systems, 3358 3369.

Smith, L. N.; and Topin, N. 2019. Super-convergence: Very fast training of neural networks using large learning rates. In Artiﬁcial Intelligence and Machine Learning for Multi Domain Operations Applications, volume 11006, 1100612. International Society for Optics and Photonics.

Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199 .

Tram er, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; and Mc Daniel, P. 2017. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204 .

Vivek, B.; and Babu, R. V. 2020. Single-step adversarial training with dropout scheduling. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 947 956. IEEE.

Wong, E.; and Kolter, Z. 2018. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, 5286 5295.

Wong, E.; Rice, L.; and Kolter, J. Z. 2020. Fast is better than free: Revisiting adversarial training. ar Xiv preprint ar Xiv:2001.03994 .

Zagoruyko, S.; and Komodakis, N. 2016. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146 .

Zhang, H.; Chen, H.; Xiao, C.; Gowal, S.; Stanforth, R.; Li, B.; Boning, D.; and Hsieh, C.-J. 2019a. Towards stable and efﬁcient training of veriﬁably robust neural networks. ar Xiv preprint ar Xiv:1906.06316 . Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412 . Zhang, H.; Yu, Y.; Jiao, J.; Xing, E. P.; Ghaoui, L. E.; and Jordan, M. I. 2019b. Theoretically principled tradeoff between robustness and accuracy. ar Xiv preprint ar Xiv:1901.08573 .