# wasserstein_distributional_robustness_of_neural_networks__bf2c0f14.pdf

Wasserstein distributional robustness of neural networks

Xingjian Bai Department of Computer Science University of Oxford, UK xingjian.bai@sjc.ox.ac.uk

Guangyi He Mathematical Institute University of Oxford, UK guangyihe2002@outlook.com

Yifan Jiang Mathematical Institute University of Oxford, UK yifan.jiang@maths.ox.ac.uk

Jan Obłój Mathematical Institute University of Oxford, UK jan.obloj@maths.ox.ac.uk

Deep neural networks are known to be vulnerable to adversarial attacks (AA). For an image recognition task, this means that a small perturbation of the original can result in the image being misclassified. Design of such attacks as well as methods of adversarial training against them are subject of intense research. We re-cast the problem using techniques of Wasserstein distributionally robust optimization (DRO) and obtain novel contributions leveraging recent insights from DRO sensitivity analysis. We consider a set of distributional threat models. Unlike the traditional pointwise attacks, which assume a uniform bound on perturbation of each input data point, distributional threat models allow attackers to perturb inputs in a nonuniform way. We link these more general attacks with questions of out-of-sample performance and Knightian uncertainty. To evaluate the distributional robustness of neural networks, we propose a first-order AA algorithm and its multistep version. Our attack algorithms include Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) as special cases. Furthermore, we provide a new asymptotic estimate of the adversarial accuracy against distributional threat models. The bound is fast to compute and first-order accurate, offering new insights even for the pointwise AA. It also naturally yields out-of-sample performance guarantees. We conduct numerical experiments on CIFAR-10, CIFAR-100, Image Net datasets using DNNs on Robust Bench to illustrate our theoretical results. Our code is available at https://github.com/Jan Obloj/W-DRO-Adversarial-Methods.

1 Introduction

Model uncertainty is a ubiquitous phenomenon across different fields of science. In decision theory and economics, it is often referred to as the Knightian uncertainty (Knight, 1921), or the unknown unknowns, to distinguish it from the risk which stems from the randomness embedded by design in the scientific process, see Hansen and Marinacci (2016) for an overview. Transcribing to the context of data science, risk refers to the randomness embedded in a training by design, e.g., through random initialization, drop-outs etc., and uncertainty encompasses the extent to which the dataset is an adequate description of reality. Robustness, the ability to perform well under uncertainty, thus relates to several themes in ML including adversarial attacks, out-of-sample performance and

Corresponding author. www.maths.ox.ac.uk/people/jan.obloj

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

out-of-distribution performance. In this work, we mainly focus on the former but offer a unified perspective on robustness in all of its facets.

Vulnerability of DNNs to crafted adversarial attacks (AA), diagnosed in Biggio et al. (2013), Goodfellow et al. (2015), relates to the ability of an attacker to manipulate network s outputs by changing the input images only slightly often in ways imperceptible to a human eye. As such, AA are of key importance for security-sensitive applications and an active field of research. Most works so far have focused on attacks under pointwise lp-bounded image distortions but a growing stream of research, pioneered by Staib and Jegelka (2017) and Sinha et al. (2018), frames the problem using Wasserstein distributionally robust optimization (DRO). We offer novel contributions to this literature.

Our key contributions can be summarized as follows. 1) We propose a unified approach to adversarial attacks and training based on sensitivity analysis for Wasserstein DRO. We believe this approach, leveraging results from Bartl et al. (2021), is better suited for gradient-based optimization methods than duality approach adopted in most of the works to date. We further link the adversarial accuracy to the adversarial loss, and investigate the out-of-sample performance. 2) We derive a general adversarial attack method. As a special case, this recovers the classical FGSM attack lending it a further theoretical underpinning. However, our method also allows one to carry out attacks under a distributional threat model which, we believe, has not been done before. We also propose a rectified DLR loss suitable for the distributional attacks. 3) We develop asymptotically certified bounds on adversarial accuracy, applicable to a general threat, including the classical pointwise perturbations. The bounds are first-order accurate and much faster to compute than, e.g., the Auto Attack (Croce and Hein, 2020) benchmark. The performance of our methods is documented using CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), Image Net (Deng et al., 2009) datasets and neural networks from Robust Bench (Croce et al., 2021).

2 Related Work

Adversarial Attack (AA). Original research focused on the pointwise lp-bounded image distortion. Numerous attack methods under this threat model have been proposed in the literature, including Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015), Projected Gradient Descent (PGD) (Madry et al., 2018), CW attack (Carlini and Wagner, 2017), etc. In these white-box attacks, the attacker has full knowledge of the neural network. There are also black-box attacks, such as Zeroth Order Optimization (ZOO) (Chen et al., 2017), Boundary Attack (Brendel et al., 2018), and Query-limited Attack (Ilyas et al., 2018). Auto Attack (Croce and Hein, 2020), an ensemble of white-box and blackbox attacks, provides a useful benchmark for pointwise lp-robustness of neural networks. Notably Hua et al. (2022) considered AA with lp distance replaced by a proxy for human eye evaluation.

Adversarial Defense. Early works on data augmentation (Goodfellow et al., 2015, Madry et al., 2018, Tramèr et al., 2018) make use of strong adversarial attacks to augment the training data with adversarial examples; more recent works (Gowal et al., 2021, Xing et al., 2022, Wang et al., 2023) focus on adding randomness to training data through generative models such as GANs and diffusion models. Zhang et al. (2019) consider the trade-off between robustness and accuracy of a neural network via TRADES, a regularized loss. Analogous research includes MART (Wang et al., 2020) and SCORE (Pang et al., 2022). Other loss regularization methods such as adversarial distributional training (Dong et al., 2020) and adversarial weight perturbation (Wu et al., 2020) have been shown to smooth the loss landscape and improve the robustness. In addition, various training techniques can be overlaid to improve robustness, including droup-out layers, early stopping and parameter fine-tuning Sehwag et al. (2020). The closest to our setting are Sinha et al. (2018), García Trillos and García Trillos (2022) which employ Wasserstein penalization and constraint respectively. However, so far, even the papers which used distributional threat models to motivate DRO-based training methods actually used classical pointwise attacks to evaluate robustness of their trained DNNs. This highlights the novelty of our distributional threat attack.

Robust Performance Bounds. Each AA method gives a particular upper bound on the adversarial accuracy of the network. In contrast, research on certified robustness aims at classifying images which are robust to all possible attacks allowed in the threat model and thus providing an attack-agnostic lower bound on the classification accuracy. To verify robustness of images, deterministic methods using off-the-shelf solvers (Tjeng et al., 2019), relaxed linear programming (Wong and Kolter, 2018,

Weng et al., 2018a) or semi-definite programming (Raghunathan et al., 2018, Dathathri et al., 2020) have been applied. Hein and Andriushchenko (2017), Weng et al. (2018b) derive Lipschitz-based metrics to characterize the maximum distortion an image can uphold; Cohen et al. (2019) constructs a certifiable classifier by adding smooth noise to the original classifier; see Li et al. (2023) for a review.

Distributionally Robust Optimization (DRO). Mathematically, it is formulated as a min-max problem inf θPΘ sup QPP EQrfθp Zqs, (1)

where we minimize the worst-case loss over all possible distributions Q P P. In financial economics, such criteria appear in the context of multi-prior preferences, see (Gilboa and Schmeidler, 1989, Föllmer and Weber, 2015). We refer to (Rahimian and Mehrotra, 2019) for a survey of the DRO.

We focus on the Wasserstein ambiguity set P Bδp Pq, which is a ball centered at the reference distribution P with radius δ under the Wasserstein distance. We refer to Gao and Kleywegt (2022) for a discussion of many advantages of this distance. In particular, measures close to each other can have different supports which is key in capturing data perturbations, see Sinha et al. (2018). Staib and Jegelka (2017) interpreted pointwise adversarial training as a special case of Wasserstein DRO (W-DRO). Volpi et al. (2018) utilized W-DRO to improve network performance on unseen data distributions. More recently, Bui et al. (2022) unified various classical adversarial training methods, such as PGD-AT, TRADES, and MART, under the W-DRO framework.

W-DRO, while compelling theoretically, is often numerically intractable. In the literature, two lines of research have been proposed to tackle this problem. The duality approach rewrites (1), changing the sup to a univariate inf featuring a transform of fθ. We refer to Mohajerin Esfahani and Kuhn (2018) for the data-driven case, Blanchet and Murthy (2019), Bartl et al. (2020), Gao and Kleywegt (2022) for general probability measures and Huang et al. (2022) for a further application with coresets. The second approach, which we adopt here, considers the first order approximation to the original DRO problem. This can be seen as computing the sensitivity of the value function with respect to the model uncertainty as derived in Bartl et al. (2021), see also Lam (2016), García Trillos and García Trillos (2022) for analogous results in different setups.

3 Preliminaries

Image Classification Task. An image is interpreted as a tuple px, yq where the feature vector x P X encodes the graphic information and y P Y t1, . . . , mu denotes the class, or tag, of the image. W.l.o.g., we take X r0, 1sn. A distribution of labelled images corresponds to a probability measure P on X ˆ Y. We are given the training set Dtr and the test set Dtt, subsets of X ˆ Y, i.i.d. sampled from P. We denote p P (resp. q P) the empirical measure of points in the training set (resp. test set), i.e., p P 1 |Dtr| ř

px,yq PDtr δpx,yq. A neural network is a map fθ : X Ñ Rm

fθpxq f l f 1pxq, where f ipxq σpwix biq,

σ is a nonlinear activation function, and θ twi, bi : 1 ď i ď lu is the collection of parameters. We denote S the set of images equipped with their labels generated by fθ, i.e.,

S ! px, yq P X ˆ Y : arg max 1ďiďm fθpxqi tyu ) .

The aim of image classification is to find a network fθ with high (clean) prediction accuracy A : Pp Sq EP r1Ss. To this end, fθ is trained solving2 the stochastic optimization problem

inf θPΘ EP r Lpfθpxq, yqs, (2)

where Θ denotes the set of admissible parameters, and L is a (piecewise) smooth loss function, e.g., cross entropy loss3CE : Rm ˆ Y Ñ R given by

CEpz, yq plog softmaxpzqqy. (3)

2In practice, P is not accessible and we use p P or q P instead, e.g., in (2) we replace P with p P and then compute the clean accuracy as q Pp Sq. In our experiments we make it clear which dataset is used. 3By convention, cross entropy is a function of two probability measures. In this case, we implicitly normalize the logit z by applying softmax, and we associate a class y with the Dirac measure δy.

Wasserstein Distances. Throughout, pp, qq is a pair of conjugate indices, 1{p 1{q 1, with 1 ď p ď 8. We consider a norm } } on X and denote } } its dual, } x} suptxx, xy : }x} ď 1u. Our main interest is in } } } }r the lr-norm for which } } } }s, where pr, sq are conjugate indices, 1 ď r ď 8. We consider adversarial attacks which perturb the image feature x but not its label y. Accordingly, we define a pseudo distance4 d on X ˆ Y as

dppx1, y1q, px2, y2qq }x1 x2} 81ty1 y2u. (4)

We denote Πp P, Qq the set of couplings between px, yq and px1, y1q whose first margin is P and second margin is Q, and T#P : P T 1 denotes the pushforward measure of P under a map T.

The p-Wasserstein distance, 1 ď p ă 8, between probability measures P and Q on X ˆ Y is

Wpp P, Qq : inf t Eπrdppx1, y1q, px2, y2qqps : π P Πp P, Qqu1{p . (5)

The 8-Wasserstein distance W8 is given by

W8p P, Qq : inftπ ess sup dppx1, y1q, px2, y2qq : π P Πp P, Qqu. (6)

We denote the p-Wasserstein ball centered at P with radius δ by Bδp Pq. We mainly consider the cases where p, r P t2, 8u. Intuitively, we can view p as the index of image-wise flexibility and r as the index of pixel-wise flexibility. Unless p 1 is explicitly allowed, p ą 1 in what follows.

4 Wasserstein Distributional Robustness: adversarial attacks and training

W-DRO Formulation. The Wasserstein DRO (W-DRO) formulation of a DNN training task is given by: inf θPΘ sup QPBδp P q EQr Lpfθpxq, yqs, (7)

where Bδp Pq is the p-Wasserstein ball centered at P and δ denotes the budget of the adversarial attack. In practice, P is not accessible and is replaced with p P. When p 8, the above adversarial loss coincides with the pointwise adversarial loss of Madry et al. (2018) given by

inf θPΘ EP rsupt Lpfθpx1q, yq : }x1 x} ď δus.

Recently, Bui et al. (2022) considered a more general criterion they called unified distributional robustness. It can be re-cast equivalently as an extended W-DRO formulation using couplings:

inf θPΘ sup πPΠδp P, q Eπr Jθpx, y, x1, y1qs, (8)

where Πδp P, q is the set of couplings between px, yq and px1, y1q whose first margin is P and the second margin is within a Wasserstein δ-ball centered at P. This formulation was motivated by the observation that for p 8, taking Jθpx, y, x1, y1q Lpfθpxq, yq βLpfθpxq, fθpx1qq, it retrieves the TRADES loss of (Zhang et al., 2019) given by

inf θPΘ EP Lpfθpxq, yq β sup x1:}x x1}ďδ Lpfθpxq, fθpx1qq ı .

W-DRO Sensitivity. In practice, training using (7), let alone (8), is computationally infeasible. To back propagate θ it is essential to understand the inner maximization problem denoted by

V pδq sup QPBδp P q EQr Jθpx, yqs,

where we write Jθpx, yq Lpfθpxq, yq. One can view the adversarial loss V pδq as a certain regularization of the vanilla loss. Though we are not able to compute the exact value of V pδq for neural networks with sufficient expressivity, DRO sensitivity analysis results allow us to derive a numerical approximation to V pδq and further apply gradient-based optimization methods. This is the main novelty of our approach previous works considering a W-DRO formulation mostly relied on duality results in the spirit of Blanchet and Murthy (2019) to rewrite (7).

4Our results can be adapted to regression tasks where the class label y is continuous and sensitive to the perturbation. In such a setting a different d would be appropriate.

Assumption 4.1. We assume the map px, yq ÞÑ Jθpx, yq is L-Lipschitz under d, i.e.,

|Jθpx1, y1q Jθpx2, y2q| ď Ldppx1, y1q, px2, y2qq.

The above assumption is weaker than the L-smoothness assumption encountered in the literature which requires Lipschitz continuity of gradients of Jθ, see for example Sinha et al. (2018)[Assumption B] and Volpi et al. (2018)[Assumptions 1 & 2]. We also remark that our assumption holds for any continuously differentiable Jθ as the feature space X r0, 1sn is compact.

The following result follows readily from (Bartl et al., 2021, Theorem 2.2) and its proof.

Theorem 4.1. Under Assumption 4.1, the following first order approximations hold:

(i) V pδq V p0q δΥ opδq, where

Υ EP } x Jθpx, yq}q 1{q .

(ii) V pδq EQδr Jθpx, yqs opδq, where

Qδ px, yq ÞÑ x δhp x Jθpx, yqq}Υ 1 x Jθpx, yq}q 1 , y ı

and h is uniquely determined by xhpxq, xy }x} .

The above holds for any probability measure, in particular with P replaced consistently by an empirical measure p P or q P. In Figure 1, we illustrate the performance of our first order approximation of the adversarial loss on CIFAR-10 (Krizhevsky, 2009) under different threat models.

0.00 0.01 0.02 0.03 0.04

(W , l ) Threat Model

0.0 0.1 0.2 0.3 0.4 0.5 0.6

(W2, l2) Threat Model

Figure 1: Performance of the first order approximation for the W-DRO value derived in Theorem 4.1. Left: Wide Res Net-28-10 (Gowal et al., 2020) under CE loss (3) and p W8, l8q threat model with δ 1{255, . . . , 10{255. Right: Wide Res Net-28-10 (Wang et al., 2023) under Re DLR loss (10) and p W2, l2q threat models with δ 1{16, . . . , 10{16.

WD-Adversarial Accuracy. We consider an attacker with perfect knowledge of the network fθ and the data distribution P, aiming to minimize the prediction accuracy of fθ under an admissible attack. Complementing the W-DRO training formulation, Staib and Jegelka (2017), Sinha et al. (2018) proposed Wasserstein distributional threat models under which an attack is admissible if the resulting attacked distribution Q stays in the p-Wasserstein ball Bδp Pq, where δ is the attack budget, i.e., the tolerance for distributional image distortion. We define the adversarial accuracy as:

Aδ : inf QPBδp P q Qp Sq inf QPBδp P q EQr1Ss. (9)

Note that Aδ is decreasing in δ with A0 A, the clean accuracy. For p 8, the Wasserstein distance essentially degenerates to the uniform distance between images and hence the proposed threat model coincides with the popular pointwise threat model. For 1 ď p ă 8, the distributional threat model is strictly stronger than the pointwise one, as observed in Staib and Jegelka (2017, Prop. 3.1). Intuitively, it is because the attacker has a greater flexibility and can perturb images close to the decision boundary only slightly while spending more of the attack budget on images farther away from the boundary. The threat is also closely related to out-of-distribution generalization, see Shen et al. (2021) for a survey.

WD-Adversarial Attack. We propose Wasserstein distributionally adversarial attack methods. As mentioned above, to date, even the papers which used distributional threat models to motivate DRO-based training methods then used classical pointwise attacks to evaluate robustness of their trained DNNs. Our contribution is novel and enabled by the explicit first-order expression for the distributional attack in Theorem 4.1(ii), which is not accessible using duality methods.

We recall the Difference of Logits Ratio (DLR) loss of Croce and Hein (2020). If we write z pz1, . . . , zmq fθpxq for the output of a neural network, and zp1q ě ě zpmq are the order statistics of z, then DLR loss is given by

zp1q zp3q , if zy zp1q,

zp1q zp3q , else.

The combination of CE loss and DLR loss has been widely shown as an effective empirical attack for pointwise threat models. However, under distributional threat models, intuitively, an effective attack should perturb more aggressively images classified far from the decision boundary and leave the misclassified images unchanged. Consequently, neither CE loss nor DLR loss are appropriate this intuition is confirmed in our numerical experiments, see Table 1 for details. To rectify this, we propose Re DLR (Rectified DLR) loss:

Re DLRpz, yq p DLRq pz, yq

zp1q zp3q , if zy zp1q,

0, else. (10)

Its key property is to leave unaffected those images that are already misclassified. Our experiments show it performs superior to CE or DLR.

An attack is performed using the test data set. For a given loss function, our proposed attack is:

xt 1 projδ xt αhp x Jθpxt, yqq}qΥ 1 x Jθpxt, yq}q 1 , (11)

where α is the step size and projδ is a projection which ensures the empirical measure q P t 1 : 1 |Dtt| ř

px,yq PDtt δpxt 1,yq stays inside the Wasserstein ball Bδp q Pq. In the case p r 8, one can verify hpxq sgnpxq and write (11) as

xt 1 projδ xt α sgnp x Jθpxt, yqq .

This gives exactly Fast Gradient Sign Method (single step) and Projected Gradient Descent (multistep) proposed in Goodfellow et al. (2015), Madry et al. (2018) and we adopt the same labels for our more general algorithms.5 A pseudocode for the above attack is summarized in Appendix C.

Finally, note that Theorem 4.1 offers computationally tractable approximations to the W-DRO adversarial training objectives (7) and (8). In Appendix D we propose two possible training methods but do not evaluate their performance and otherwise leave this topic to future research.

5 Performance Bounds

Understanding how a DNN classifier will perform outside the training data set is of key importance. We leverage the DRO sensitivity results now to obtain a lower bound on Aδ. We then use results on convergence of empirical measures in Fournier and Guillin (2015) to translate our lower bound into guarantees on out-of-sample performance.

Bounds on Adversarial Accuracy. We propose the following metric of robustness:

A P r0, 1s.

Previous works mostly focus on the maximum distortion a neural network can withhold to retain certain adversarial performance, see Hein and Andriushchenko (2017), Weng et al. (2018b) for

5To stress the Wasserstein attack and the particular loss function we may write, e.g., W-PGD-Re DLR.

local robustness and Bastani et al. (2016) for global robustness. However, there is no immediate connection between such a maximum distortion and the adversarial accuracy, especially in face of a distributionally adversarial attack. In contrast, since A A0 is known, computing Rδ is equivalent to computing Aδ. We choose to focus on the relative loss of accuracy as it provides a convenient normalization: 0 ď Rδ ď 1. Rδ 1 corresponds to a very robust architecture which performs as well under attacks as it does on clean test data, while Rδ 0 corresponds to an architecture which loses all of its predictive power under an adversarial attack. Together the couple p A, Aδq thus summarizes the performance of a given classifier. However, computing Aδ is difficult and time-consuming. Below, we develop a simple and efficient method to calculate theoretical guaranteed bounds on R and thus also on Aδ. Assumption 5.1. We assume that for any Q P Bδp Pq

(i) 0 ă Qp Sq ă 1.

(ii) Wpp Qp |Sq, Pp |Sqq Wpp Qp |Scq, Pp |Scqq opδq, where Sc p X ˆ Yqz S and the conditional distribution is given by Qp E|Sq Qp E X Sq{Qp Sq.

The first condition stipulates non-degeneracy: the classifier does not perform perfectly but retains some accuracy under attacks. The second condition says the classes are well-separated: for δ small enough an admissible attack can rarely succeed.

We write the adversarial loss condition on the correctly classified images and misclassified images as

Cpδq sup QPBδp P q EQr Jθpx, yq|Ss and Wpδq sup QPBδp P q EQr Jθpx, yq|Scs.

We note that an upper bound on Rδ is given by any adversarial attack. In particular,

Rδ ď Ru δ : Qδp Sq{A. (12)

Theorem 5.1. Under Assumptions 4.1 and 5.1, we have an asymptotic lower bound as δ Ñ 0

Rδ ě Wp0q V pδq

Wp0q V p0q opδq r Rl δ opδq R l δ opδq, (13)

where the first order approximations are given by

r Rl δ Wp0q EQδr Jθpx, yqs

Wp0q V p0q and R l δ Wp0q V p0q δΥ

Wp0q V p0q . (14)

The equality between the lower bound and the two first-order approximations r Rl δ and R l δ follows from Theorem 4.1. Consequently, Rl δ : mint r Rl δ, R l δu allows us to estimate the model robustness without performing any sophisticated adversarial attack. Our experiments, detailed below, show the bound is reliable for small δ and is orders of magnitude faster to compute than Rδ even in the classical case of pointwise attacks. The proof is reported in Appendix A. Its key ingredient is the following tower-like property. Proposition 5.2. Under Assumptions 4.1 and 5.1, we have

V pδq sup QPBδp P q EQr Cpδq1S Wpδq1Scs opδq.

Bounds on Out-of-Sample Performance. Our results on distributionally adversarial robustness translate into bounds for performance of the trained DNN on unseen data. We rely on the results of Fournier and Guillin (2015) and refer to Lee and Raginsky (2018) for analogous applications to finite sample guarantees and to Gao (2022) for further results and discussion.

We fix 1 ă p ă n{2 and let N |Dtr|, M |Dtt|. If sampling of data from P is described on a probability space pΩ, F, Pq then p P is a random measure on this space and, by ergodic theorem, P-a.s., it converges weakly to P as N Ñ 8. In fact, Wpp p P, Pq converges to zero P-a.s. Crucially the rates of convergence were obtained in Dereich et al. (2013), Fournier and Guillin (2015) and yield

Er Wpp p P, Pqs ď KN 1

n and Pp Wpp p P, Pq ě εq ď K expp KNεnq, (15)

where K is a constant depending on p and n which can be computed explicitly, see for example Guo and Obłój (2019, Appendix). This, with triangle inequality and Theorem 5.1, gives

Corollary 5.3. Under Assumptions 4.1 and 5.1 on measure p P, with probability at least 1 2K expp Kεn mint M, Nuq it holds that

q A q Pp Sq ě p A p Rl 2ε opεq.

Next results provide a finer statistical guarantee on the out-of-sample performance for robust (W-DRO) training. Its proof is reported in Appendix B. Theorem 5.4. Under Assumption 4.1, with probability at least 1 K expp KNεnq we have

V pδq ď p V pδq ε sup QPB δp p P q

EQ} x Jθpx, yq}q s 1{q opεq ď p V pδq Lε

where B δp p Pq arg max QPBδp p P q EQr Jθpx, yqs and constant K only depends on p and n.

Our lower bound estimate in Theorem 5.1 can be restated as

p Aδ : p A p Aδ ď p V pδq p V p0q x Wp0q p Cp0q opδq.

We now use Theorem 5.4 to bound Apδq, the shortfall of the adversarial accuracy under P, using quantities evaluated under p P. Corollary 5.5. Under Assumptions 4.1 and 5.1, with probability at least 1 K expp KNδnq it holds that

Aδp Pq ď p V pδq p V p0q x Wp0q p Cp0q 2Lδ x Wp0q p Cp0q opδq.

We remark that the above results are easily extended to the out-of-sample performance on the test set, via the triangle inequality Wpp p P, q Pq ď Wpp p P, Pq Wpp P, q Pq. By using complexity measures such as entropy integral (Lee and Raginsky, 2018), Rademacher complexity (Gao, 2022, Gao et al., 2022) a further analysis can be undertaken for

inf θPΘ sup QPBδp P q EQr Jθpx, yqs and inf θPΘ sup QPBδp p P q EQr Jθpx, yqs. (16)

In particular, a dimension-free estimate of out-of-sample performance is obtained in (Gao, 2022) under a Lipschitz framework with light-tail reference measures.

6 Numerical Experiments

Experimental Setting. We conduct experiments on a high performance computing server equipped with 49 GPU nodes. The algorithms are implemented in Python. Experiments are conducted on CIFAR-10, CIFAR-100, and Image Net datasets. Numerical results are consistent across different datasets, and we present the results on CIFAR-10 in body paragraph only for the sake of brevity. Results on CIFAR-100 and Image Net are reported in Appendix F.

CIFAR-10 (Krizhevsky, 2009) comprises 60,000 color images across 10 mutually exclusive classes, with 6,000 images per class. Each image contains 32 ˆ 32 pixels in 3 color channels. We normalize the input feature as a vector x P r0, 1s3ˆ32ˆ32. The dataset is further divided into training and test sets, containing 50,000 and 10,000 images respectively. We evaluate the robustness of neural networks on the test set only.

We consider four threat models p Wp, lrq with p, r P t2, 8u with different range of attack budget δ depending on the relative strength of the attack. E.g., roughly speaking, if an l8-attack modifies one third of the pixels of an image with strength 4/255, then it corresponds to an l2-attack with strength 1/2. When clear from the context, we drop the δ subscript.

We take top neural networks from Robust Bench (Croce et al., 2021), a lively maintained repository that records benchmark robust neural networks on CIFAR-10 against pointwise attacks. For pointwise threat models p W8, lrq, Robust Bench reports Aδ obtained using Auto Attack (Croce and Hein, 2020) for l8, δ 8{255 and l2, δ 1{2, see Appendix H. However, due to high computational

cost of Auto Attack, we apply PGD-50 based on CE and DLR losses as a substitute to obtain the reference adversarial accuracy for attacks with relatively small budgets δ 2{255, 4{255 for l8 and δ 1{8, 1{4 for l2. For distributional threat models p W2, lrq, there is no existing benchmark attacking method. Therefore, W-PGD attack (11) based on Re DLR loss is implemented to obtain the reference adversarial accuracy Aδ. All PGD attacks are run with 50 iteration steps and take between 1 and 12 hours to run on a single GPU environment. Bounds Rl, Ru compute ca. 50 times faster.

Table 1: Comparison of adversarial accuracy of neural networks on Robust Bench under different empirical attacks. Set attack budget δ 8{255 for l8 threat models and δ 1{2 for l2 threat models.

Methods Auto Attack W-PGD-CE W-PGD-DLR W-PGD-Re DLR

l8 57.66% 61.32% 79.00% 45.46% l2 75.78% 74.62% 78.69% 61.69%

Distributionally Adversarial Attack. We report in Table 1 the average accuracy of top neural networks on Robust Bench against pointwise and distributional attacks under different loss functions. The predicted drop in accuracy between a pointwise, i.e., 8-W-DRO attack and a distributional 2-W-DRO attack is only realized using the Re DLR loss.

In Figure 2, we compare the adversarial accuracy of robust networks on Robust Bench against pointwise threat models and distributional threat models. We notice a significant drop of the adversarial accuracy even for those neural networks robust against pointwise threat models.

30% 40% 50% 60% 70% 80% 90% 100% Adversarial Accuracy (W , l )

Adversarial Accuracy (W2, l )

δ = 8/255 y = x

50% 60% 70% 80% 90% 100% Adversarial Accuracy (W , l2)

Adversarial Accuracy (W2, l2)

δ = 1/2 y = x

Figure 2: Shortfall of WD-adversarial accuracy with different metrics l8 (left) and l2 (right).

Bounds on Adversarial Accuracy. We report in Table 2 the computation time of our proposed bounds Rl δ mint r Rl δ, R l δu in (14) and Ru δ in (12) with the computation time of Rδ obtained from Auto Attack. Computing our proposed bounds Rl, Ru is orders of magnitude faster than performing an attack to estimate R. This also holds for distributional threat attacks.

Table 2: Computation times of p W8, l8q, δ 8{255 attack for one mini-batch of size 100, in seconds. We compute R by Auto Attack and average the computation time over models on Robust Bench grouped by their architecture.

Pre Act Res Net-18 Res Net-18 Res Net-50 WRN-28-10 WRN-34-10 WRN-70-16

R 197 175 271 401 456 2369 Rl&Ru 0.52 0.49 0.17 0.55 0.53 1.46

To illustrate the applications of Theorem 5.1, we plot the bounds Rl and Ru against R for neural networks on Robust Bench. The results are plotted in Figure 3 and showcase the applicability of our bounds across different architectures.6 Note that smaller δ values are suitable for the stronger W2-distributional attack. For pointwise threat models (top row) we compute the bounds using CE loss. For distributional threat models (bottom row), reference adversarial accuracy is obtained from a

6We use all 60 available networks on Robust Bench (model zoo) for l8 and all 20 available networks for l2.

W-PGD-Re DLR attack and, accordingly, we use Re DLR loss to compute Ru and Rl. In this case, the width of the gap between our upper and lower bounds varies significantly for different DNNs. To improve the bounds, instead of Rl, we could estimate V pδq and use the lower bound in (13). This offers a trade-off between computational time and accuracy which is explored further in Appendix E.

0.88 0.90 0.92 0.94 0.96 0.98

(W , l ) Threat Model with δ = 2/255

0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950

(W , l ) Threat Model with δ = 4/255

0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00

(W , l2) Threat Model with δ = 1/8

0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00

(W , l2) Threat Model with δ = 1/4

0.75 0.80 0.85 0.90 0.95 1.00

(W2, l ) Threat Model with δ = 1/510

0.6 0.7 0.8 0.9 1.0

(W2, l ) Threat Model with δ = 1/255

0.90 0.92 0.94 0.96 0.98 1.00

(W2, l2) Threat Model with δ = 1/32

0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000

(W2, l2) Threat Model with δ = 1/16

Figure 3: Ru & Rl versus R. Top row: W8-attack with bounds computed based on CE loss across neural networks on Robust Bench. Bottom row: W2-attack for the same sets of neural networks with bounds computed using Re DLR loss.

7 Limitations and Future Work

Limitations. Our theoretical results are asymptotic and their validity is confined to the linear approximation regime. We believe that the empirical results we presented from all the leaderboard models on Robust Bench across different attack types provide an overwhelming evidence that our results are valid and relevant for the range of attack budget δ considered in AA settings. However, as δ increases we are likely to go outside of the linear approximation regime, see Figure 1. In Appendix E we plot the results for pointwise attack with δ 8{255 where some of neural networks have a lower bound Rl greater than the reference R. We do not have theoretical results to provide guarantees on the range of applicability but, in Appendix E, discuss a possible rule of thumb solution.

Future Work. We believe our research opens up many avenues for future work. These include: developing stronger attacks under distributional threat models, testing the performance of the two training algorithms derived here and investigating further sensitivity-based ones, as well as analyzing the relation between the values and optimizers in (16), verifying empirical performance of our out-ofsample results, including Corollary 5.5, and extending these to out-of-distribution performance.

Broader Impact. Our work contributes to the understanding of robustness of DNN classifiers. We believe it can help users in designing and testing DNN architectures. It also offers a wider viewpoint on the question of robustness and naturally links the questions of adversarial attacks, out-of-sample performance, out-of-distribution performance and Knightian uncertainty. We provide computationally efficient tools to evaluate robustness of DNNs. However, our results are asymptotic and hence valid for small attacks and we acknowledge the risk that some users may try to apply the methods outside of their applicable regimes. Finally, in principle, our work could also enhance understanding of malicious agents aiming to identify and attack vulnerable DNN-based classifiers.

Acknowledgements

The authors are grateful to Johannes Wiesel for his most helpful comments and suggestions in the earlier stages of this project. JO gratefully acknowledges the support from St John s College, Oxford. YJ s research is supported by the EPSRC Centre for Doctoral Training in Mathematics of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1). XB and GH s work, part of their research internship with JO, was supported by the Mathematical Institute and, respectively, St John s College and St Anne s College, Oxford.

D. Bartl, S. Drapeau, and L. Tangpi. Computational aspects of robust optimized certainty equivalents and option pricing. Mathematical Finance, 30(1):287 309, 2020. D. Bartl, S. Drapeau, J. Obłój, and J. Wiesel. Sensitivity analysis of Wasserstein distributionally robust optimization problems. Proc. R. Soc. A., 477(2256):20210176, Dec. 2021.

O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. Nori, and A. Criminisi. Measuring neural net robustness with constraint. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndi c, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages 387 402, Berlin, Heidelberg, 2013. Springer.

J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565 600, May 2019. W. Brendel, J. Rauber, and M. Bethge. Decision based adversarial attacks: reliable attacks against black-box machine learning models. In International Conference on Learning Representations, 2018.

A. T. Bui, T. Le, Q. H. Tran, H. Zhao, and D. Phung. A unified Wasserstein distributional robustness framework for adversarial training. In International Conference on Learning Representations, Jan. 2022. N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39 57. IEEE, May 2017. P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15 26. ACM, Nov. 2017.

J. Cohen, E. Rosenfeld, and Z. Kolter. Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning, pages 1310 1320. PMLR, May 2019. F. Croce and M. Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter free attacks. In Proceedings of the 37th International Conference on Machine Learning, pages 2206 2216. PMLR, Nov. 2020.

F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion, M. Chiang, P. Mittal, and M. Hein. Robust Bench: A standardized adversarial robustness benchmark, Oct. 2021. S. Dathathri, K. Dvijotham, A. Kurakin, A. Raghunathan, J. Uesato, R. R. Bunel, S. Shankar, J. Steinhardt, I. Goodfellow, P. S. Liang, and P. Kohli. Enabling certification of verification agnostic networks via memory efficient semidefinite programming. In Advances in Neural Information Processing Systems, volume 33, pages 5318 5331. Curran Associates, Inc., 2020. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

S. Dereich, M. Scheutzow, and R. Schottstedt. Constructive quantization: approximation by empirical measures. In Annales de l IHP Probabilités et statistiques, volume 49, pages 1183 1203, 2013. Y. Dong, Z. Deng, T. Pang, J. Zhu, and H. Su. Adversarial distributional training for robust deep learning. In Advances in Neural Information Processing Systems, volume 33, pages 8270 8283. Curran Associates, Inc., 2020.

H. Föllmer and S. Weber. The axiomatic approach to risk measures for capital determination. Annual Review of Financial Economics, 7(1):301 337, 2015. N. Fournier and A. Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707 738, 2015. R. Gao. Finite-sample guarantees for wasserstein distributionally robust optimization: breaking the curse of dimensionality. Operations Research, 2022. R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. Mathematics of OR, Aug. 2022.

R. Gao, X. Chen, and A. J. Kleywegt. Wasserstein distributionally robust optimization and variation regularization. Operations Research, 2022.

C. A. García Trillos and N. García Trillos. On the regularized risk of distributionally robust learning over deep neural networks. Res Math Sci, 9(3):54, Aug. 2022.

I. Gilboa and D. Schmeidler. Maxmin expected utility with non-unique prior. Journal of Mathematical Economics, 18(2):141 153, 1989. I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. S. Gowal, C. Qin, J. Uesato, T. Mann, and P. Kohli. Uncovering the limits of adversarial training against norm-bounded adversarial examples. ar Xiv preprint ar Xiv:2010.03593, 2020.

S. Gowal, S.-A. Rebuffi, O. Wiles, F. Stimberg, D. A. Calian, and T. A. Mann. Improving robustness using generated data. In Advances in Neural Information Processing Systems, volume 34, pages 4218 4233. Curran Associates, Inc., 2021.

G. Guo and J. Obłój. Computational methods for martingale optimal transport problems. The Annals of Applied Probability, 29(6):3311 3347, 2019. L. P. Hansen and M. Marinacci. Ambiguity aversion and model misspecification: an economic perspective. Statistical Science, 31(4):511 515, 2016. M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. X. Hua, H. Xu, J. Blanchet, and V. A. Nguyen. Human imperceptible attacks and applications to improve fairness. In 2022 Winter Simulation Conference (WSC), pages 2641 2652, 2022.

R. Huang, J. Huang, W. Liu, and H. Ding. Coresets for Wasserstein distributionally robust optimization problems. In Advances in Neural Information Processing Systems, Oct. 2022.

A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box adversarial attacks with limited queries and information. In Proceedings of the 35th International Conference on Machine Learning, pages 2137 2146. PMLR, July 2018. F. H. Knight. Risk, Uncertainty and Profit. Boston, New York, Houghton Mifflin Company, 1921. A. Krizhevsky. Learning multiple layers of features from tiny images. 2009. H. Lam. Robust sensitivity analysis for stochastic systems. Mathematics of OR, 41(4):1248 1275, Nov. 2016. M. Larsson, J. Park, and J. Wiesel. On concentration of the empirical measure for general transport costs, 2023.

J. Lee and M. Raginsky. Minimax statistical learning with Wasserstein distances. Advances in Neural Information Processing Systems, 31, 2018.

L. Li, T. Xie, and B. Li. So K: Certified robustness for deep neural networks. In 44th IEEE Symposium on Security and Privacy. IEEE, 2023.

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. P. Mohajerin Esfahani and D. Kuhn. Data driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Program., 171(1): 115 166, Sept. 2018. J. L. M. Olea, C. Rush, A. Velez, and J. Wiesel. The out-of-sample prediction error of the squareroot-lasso and related estimators, 2023. T. Pang, M. Lin, X. Yang, J. Zhu, and S. Yan. Robustness and accuracy could be reconcilable by (proper) definition. In Proceedings of the 39th International Conference on Machine Learning, pages 17258 17277. PMLR, June 2022.

A. Raghunathan, J. Steinhardt, and P. S. Liang. Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. H. Rahimian and S. Mehrotra. Distributionally robust optimization: A review, Aug. 2019.

V. Sehwag, S. Wang, P. Mittal, and S. Jana. HYDRA: pruning adversarially robust neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 19655 19666. Curran Associates, Inc., 2020.

Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui. Towards out-of-distribution generalization: a survey, Aug. 2021. A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. M. Staib and S. Jegelka. Distributionally robust deep learning as a generalization of adversarial training. In NIPS workshop on Machine Learning and Computer Security, volume 3, page 4, 2017. V. Tjeng, K. Y. Xiao, and R. Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, 2019.

F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. Mc Daniel. Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representations, 2018. R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. Advances in Neural Information Processing Systems, 31, 2018. Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, 2020. Z. Wang, T. Pang, C. Du, M. Lin, W. Liu, and S. Yan. Better diffusion models further improve adversarial training. In Proceedings of the 40th International Conference on Machine Learning, 2023.

L. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, L. Daniel, D. Boning, and I. Dhillon. Towards fast computation of certified robustness for Re LU networks. In Proceedings of the 35th International Conference on Machine Learning, pages 5276 5285. PMLR, July 2018a. T. W. Weng, H. Zhang, P.-Y. Chen, J. Yi, D. Su, Y. Gao, C.-J. Hsieh, and L. Daniel. Evaluating the robustness of neural networks: an extreme value theory approach. In International Conference on Learning Representations. International Conference on Learning Representations, ICLR, 2018b. E. Wong and Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of the 35th International Conference on Machine Learning, pages 5286 5295. PMLR, July 2018. D. Wu, S.-T. Xia, and Y. Wang. Adversarial weight perturbation helps robust generalization. In Advances in Neural Information Processing Systems, volume 33, pages 2958 2969. Curran Associates, Inc., 2020. Y. Xing, Q. Song, and G. Cheng. Why do artificially generated data help adversarial robustness. In Advances in Neural Information Processing Systems, Oct. 2022. H. Zhang, Y. Yu, J. Jiao, E. Xing, L. E. Ghaoui, and M. Jordan. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the 36th International Conference on Machine Learning, pages 7472 7482. PMLR, May 2019.

A Bounds on Adversarial Accuracy

Recall that Proposition 5.2 states the following tower-like property

V pδq sup QPBδp P q EQr Cpδq1S Wpδq1Scs opδq,

where Cpδq sup QPBδp P q EQr Jθpx, yq|Ss and Wpδq sup QPBδp P q EQr Jθpx, yq|Scs.

Proof of Proposition 5.2. One direction follows directly from the usual tower property of conditional expectation:

V pδq sup QPBδp P q EQr Jpx, yqs sup QPBδp P q EQr EQr Jpx, yq|σp Sqss

ď sup QPBδp P q EQr Cpδq1S Wpδq1Scs.

For the other direction, note that

Qp E|Sq Qp E X Sq{Qp Sq and Qp E|Scq Qp E X Scq{Qp Scq,

are well-defined for any Borel E by Assumption 5.1. Take an arbitrary ε ą 0 and Qc, Qw P Bδp Pq such that EQcr Jpx, yq|Ss ě Cpδq ε and EQwr Jpx, yq|Scs ě Wpδq ε. We further take Q P Bδp Pq such that

EQ r Cpδq1S Wpδq1Scs ě sup QPBδp P q EQr Cpδq1S Wpδq1Scs ε,

and write distribution r Q given by

r Qp Eq Qcp E|Sq Q p Sq Qwp E|Scq Q p Scq.

These give us

sup QPBδp P q EQr Cpδq1S Wpδq1Scs ďEQ r Cpδq1S Wpδq1Scs ε

ďEQ EQcr Jθpx, yq|Ss1S EQwr Jθpx, yq|Scs1Sc 3ε

E r Qr Jθpx, yqs 3ε.

Recall Assumption 5.1 (ii) gives for any Q P Bδp Pq

Wpp Qp |Sq, Pp |Sqq Wpp Qp |Scq, Pp |Scqq opδq.

Now we take πc P Πp Q p |Sq, Qcp |Sqq such that Eπcrdp X, Y qps Wpp Q p |Sq, Qcp |Sqqp, and similarly πw P Πp Q p |Scq, Qwp |Scqq. Then by definition of r Q, we have π Q p Sqπc Q p Scqπw P Πp Q , r Qq. Moreover, we derive

Wpp Q , r Qqp ď Eπrdp X, Y qps Q p Sq Eπcrdp X, Y qps Q p Scq Eπwrdp X, Y qps Q p Sq Wpp Q p |Sq, Qcp |Sqqp Q p Scq Wpp Q p |Scq, Qwp |Scqqp

which implies Wpp P, r Qq δ opδq by triangle inequality. Hence, by L-Lipschitzness of Jθ and (Bartl et al., 2021, Appendix Corollary 7.5) we obtain

sup QPBδp P q EQr Cpδq1S Wpδq1Scs ď V pδ opδqq 3ε ď V pδq opδq 3ε.

Finally, by taking ε Ñ 0, we deduce

V pδq sup QPBδp P q EQr Cpδq1S Wpδq1Scs opδq

which concludes the proof of Proposition 5.2.

Now, we present the proof of the lower bound estimate in Theorem 5.1,

A ě Wp0q V p0q

Wp0q V p0q.

Proof of Theorem 5.1. By adding and subtracting Wpδq1S, Proposition 5.2 now gives V pδq sup QPBδp P q EQr Cpδq1S Wpδq1Scs opδq

sup QPBδp P q EQr Wpδq p Wδ Cpδqq1Ss opδq

Wpδq p Wpδq Cpδqq Aδ opδq. Naturally, we can rewrite loss V p0q using the usual tower property V p0q ACp0q p1 Aq Wp0q Wp0q p Wp0q Cp0qq A. Together these yield V pδq V p0q p1 Aδqp Wpδq Wp0qq Aδp Cpδq Cp0qq p Wp0q Cp0qqp A Aδq opδq ě p Wp0q Cp0qqp A Aδq opδq. Plugging in Cp0q r V p0q p1 Aq Wp0qs{A completes the proof.

B Bounds on Out-of-Sample Performance

The concentration inequality in Fournier and Guillin (2015) is pivotal to derive the out-of-sample performance bounds. It characterizes how likely an empirical measure can deviate from its generating distribution. We note that recently (Larsson et al., 2023, Olea et al., 2023) obtained concentration inequalities considering other transport costs, including sliced Wasserstein distances which are of particular interest for regression tasks. Recall we denote p P as the empirical measure of training set Dtr with size N and q P as the empirical measure of test set Dtt with size M. We use the same p, q notations to denote quantities computed from p P and q P, respectively. We restate the concentration inequality as Pp Wpp p P, Pq ě εq ď K expp KNεnq, where K is a constant only depending on n and p. For convenience, K might change from line to line in this section. Together with Theorem 5.1, we give an out-of-sample clean accuracy guarantee in Corollary 5.3.

Proof of Corollary 5.3. By triangle inequality and concentration inequality, we have

Pp Wpp q P, p Pq ě 2εq ď Pp Wpp q P, Pq ě εq Pp Wpp P, p Pq ě εq ď 2K expp Kεn mint M, Nuq. With Theorem 5.1, it implies that with probability at least 1 2K expp Kεn mint M, Nuq, we have q A ě p A2ε ě p A p Rl 2ε opεq.

The following results provide a guarantee on the out-of-sample adversarial performance.

Proof of Theorem 5.4. Estimates in Fournier and Guillin (2015) imply that with probability at least 1 K expp KNεnq, we have Bδp p Pq Ď Bδ εp Pq. Hence, we derive V pδq ď p V pδ εq. On the other hand, since Jθ is L-Lipschitz, from (Bartl et al., 2021, Appendix Corollary 7.5) we obtain

p V pδ εq p V pδq ε sup QPB δp p P q

EQ} x Jθpx, yq}q 1{q opεq,

where B δp p Pq arg max QPBδp p P q EQr Jθpx, yqs. Combining above results, we conclude that with probability at least 1 K expp KNεnq

V pδq ď p V pδq ε sup QPB δp p P q

EQ} x Jθpx, yq}q 1{q opεq ď p V pδq Lε.

Proof Corollary 5.5. By Theorem 5.1, we have

Aδp Pq ď V pδq V p0q

Wp0q Cp0q opδq.

We now bound the numerator and denominator separately. By taking ε δ in Theorem 5.4, we have with probability at least 1 K expp KNδnq, V pδq ď p V pδq Lδ. Similarly, we can also show that V p0q ě p V p0q Lδ. Hence, we have with probability at least 1 K expp KNδnq,

V pδq V p0q ď p V pδq p V p0q 2Lδ.

For the denominator, notice that Pp p P P Bδp Pqq ě 1 K expp KN exppδnqq. By Assumption 5.1 (ii), we have with probability at least 1 K expp KNδnq,

Wpp p Pp |Sq, Pp |Sqq Wpp p Pp |Scq, Pp |Scqq opδq.

This implies

|Cp0q Wp0q p Cp0q x Wp0q| opδq.

Combining above results, we conclude that with probability at least 1 K expp KNδnq,

Aδp Pq ď p V pδq p V p0q x Wp0q p Cp0q 2Lδ x Wp0q p Cp0q opδq.

C W-PGD Algorithm

We give the details of W-PGD algorithm implemented in this paper. Recall in Section 4, we propose

xt 1 projδ xt αhp x Jθpxt, yqq}Υ 1 x Jθpxt, yq}q 1 s , (17)

where we take } } } }r and hence h is given by xhpxq, xy }x}s. In particular, hpxq sgnpxq for s 1 and hpxq x{}x}2 for s 2. The projection step projδ is based on any off-the-shelf optimal transport solver S which pushes the adversarial images back into the Wasserstein ball along the geodesics. The solver S gives the optimal matching T between the original test data Dtt and the perturbed test data D1 tt. Formally, projδ maps

x1 ÞÑ x δd 1p Tpxq xq,

where d ! 1 |Dtt| ř

px,yq PDtt }Tpxq x}p r )1{p the Wasserstein distance between q P and q P 1. See Algorithm 1 for pseudocodes. In numerical experiments, due to the high computational cost of the OT solver, we always couple each image with its own perturbation.

Algorithm 1: W-PGD Attack Input: Model parameter θ, attack strength δ, ratio r, iteration step I, OT solver S ; Data: Test set Dtt tpx, yq|px, yq Pu with size M; def projδp Dtt, D1 ttq: T S p Dtt, D1 ttq; // Generate transport map from OT solver

d ! 1 M ř px,yq PDtt }Tpxq x}p r )1{p ; // Calculate the Wasserstein distance

for px, yq in Dtt do

x1 Ð x δd 1p Tpxq xq; // Project back to the Wasserstein ball x1 Ð clamppx1, 0, 1q; return D1 tt. def attackp Dttq:

α Ð rδ{I; // Calculate stepsize Dadv tt Ð Dtt; for 1 ď i ď I do

px,yq PDadv tt } x Jθpx, yq}q s 1{q ; // Calculate Υ

for px, yq P Dadv tt do px, yq Ð x αhp x Jθpx, yqq}Υ 1 x Jθpx, yq}q 1 s , y ;

Dadv tt projδp Dtt, Dadv tt q; return Dadv tt .

D Wasserstein Distributionally Adversarial Training

Theorem 4.1 offers natural computationally tractable approximations to the W-DRO training objective

inf θPΘ sup QPBδp P q EQr Jθpx, yqs

and its extension inf θPΘ sup πPΠδp P, q Eπr Jθpx, y, x1, y1qs.

First, consider a regularized optimization problem:

inf θPΘ EP r Jθpx, yq δΥs.

The extra regularization term δΥ allows us to approximate the W-DRO objective above. A similar approach has been studied in García Trillos and García Trillos (2022) in the context of Neural ODEs, and Sinha et al. (2018) considered Wasserstein distance penalization and used duality.

Training is done by replacing P with p P. Note that pΥ is a statistics over the whole data set. In order to implement stochastic gradient descent method, an asynchronous update of the parameters is needed. We consider } } } }s and, by a direct calculation, obtain

θ pΥ pΥ1 q E p P rx x θJθpx, yq, sgnp x Jθpx, yqqy} x Jθpx, yq}q 1 s s.

We calculate the term pΥ1 q from the whole training dataset using parameter θ from previous epoch; we estimate E p P rx x θJθpx, yq, sgnp x Jθpx, yqqy} x Jθpx, yq}q 1 s s

on a mini-batch and update current parameter θ after each batch. At the end of an epoch, we update θ to θ. See Algorithm 2 for pseudocodes.

Another classical approach consists in clean training the network but on the adversarial perturbed data. To this we employ the Wasserstein FGSM attack described in Section 4 and shift training data px, yq by

px, yq ÞÑ projδ x δhp x Jθpx, yqq}pΥ 1 x Jθpx, yq}q 1 s , y .

Similarly to the discussion above, an asynchronous update of parameters is applied, see Algorithm 3 for details. Empirical evaluation of the performance of Algorithms 2 and 3 is left for future research.

Algorithm 2: Loss Regularization Input: Initial parameter θ0, hyperparameter δ, learning rate η; Data: Training set Dtr tpx, yq|px, yq Pu with size N; θ Ð θ0, θ Ð θ0; repeat

px,yq PDtr } x Jθ px, yq}q s 1{q ; // Calculate Υ from θ

Generate a mini-batch B with size |B|; // Calculate gradient θJθpx, yq θJθpx, yq 1 |B| ř

px,yq PB θJθpx, yq; // Calculate gradient θΥ θΥ Υ1 q 1

px,yq PBx x θJθpx, yq, hp x Jθpx, yqqy} x Jθpx, yq}q 1 s ; // Update θ by stochastic gradient descent θ Ð θ ηp θJθpx, yq δ θΥq; until the end of epoch; θ Ð θ; until the end condition.

Algorithm 3: Adversarial Data Perturbation Input: Initial parameter θ0, hyperparameter δ, learning rate η; Data: Training set Dtr tpx, yq|px, yq Pu with size N; θ Ð θ0, θ Ð θ0; repeat

px,yq PDtr } x Jθ px, yq}q s 1{q ; // Calculate Υ from θ

Generate a mini-batch B with size |B|; // Do W-FGSM attack on B px, yq Ð projδpx δhp x Jθpx, yqq}Υ 1 x Jθpx, yq}q 1 s q, y ; // Update θ by stochastic gradient descent θ Ð θ η 1

px,yq PB θJθpx, yq; until the end of epoch; θ Ð θ; until the end condition.

E Robust Performance Bounds

As pointed out in Section 7, with attack budget δ 8{255 some neural networks have lower bounds Rl surpassing the reference value R obtained from Auto Attack. It is because for most of neural networks δ 8{255 is outside the linear approximation region of the adversarial loss V pδq, and we underestimate V pδq by using first order approximations in Theorem 4.1. In Figure 4, we plot our proposed bounds Ru and Rl against R for p W8, l8q threat model with δ 8{255.

In general, to compute more accurate lower bounds on the adversarial accuracy, as explained in Section 6, we can consider the first lower bound in (13). We thus introduce Rlpnq given by

Rlpnq Wp0q V pδ, nq

Wp0q V p0q ,

where V pδ, nq is the approximated adversarial loss computed by a W-PDG-(n) attack. In Figures 5 and 6, we include plots for different bounds of R under W2 threat models which illustrate the changing performance of the lower bound in Theorem 5.1 as V pδq is computed to an increasing accuracy. We achieve this performing a W-PDG-(n) attack, where n 5, 50. An n 50 attack takes

0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.4

(W , l ) Threat Model with δ = 8/255

Figure 4: Ru&Rl versus R. Bounds are computed based on CE loss across neural networks on Robust Bench under a p W8, l8q threat model with budget δ 8{255. At this δ linear approximation may be inefficient as seen from the blue dots crossing the diagonal.

10 times more computational time than the n 5 attack and the latter is 5 times more computational costly than the one-step bound Rl. The plots thus illustrate a trade-off between computational time and accuracy of the proposed lower bound. For clarity, we also point out that Figure 5 has a different scaling from the other plots.

In practice, we would compare our one-step-attack bound Rl with the bound obtained by iterating the attack several steps: we propose W-PGD(5) in our paper and report Rlp5q. If the difference between the two is small it indicates linear approximation is working very well. If the difference is significant, we would use Rlp5q and maybe compare it against Rlp10q. If we observe that V pδq is convex - as for the CE loss under p W8, l8q attack - the lower bound should decrease. In this case, the one-step bound may in fact not be a lower bound for too large as shown in Figure 4. If we observe V pδq is concave - as for the Re DLR loss and our W2 attacks - the one-step lower bound might be too low and will increase with additional PGD steps. This is visible in Figures 5 and 6.

0.90 0.92 0.94 0.96 0.98 1.00

1.00 δ = 1/510

Rl(50) y = x

0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 0.70

1.00 δ = 1/255

Rl(50) y = x

Figure 5: Comparison of lower bounds computed from different W-PDG-(n) attacks, where n 5, 50. Bounds are computed based on CE loss across neural networks on Robust Bench under p W2, l8q threat models with budget δ 1{510, 1{255.

F Experiments on Other Datasets

We illustrate our theoretical results on CIFAR-100 (Krizhevsky, 2009) and Image Net (Deng et al., 2009) datasets. We analyze all the networks included on Robust Bench leaderboards. Only l8 norm on images is considered as Robust Bench does not provide any models for l2 for these datasets. Similar to CIFAR-10, CIFAR-100 is a low resolution dataset but with 100 classes. Each class contains 500 training images and 100 testing images, and each image has 32 ˆ 32 pixels in 3 color channels. For

0.90 0.92 0.94 0.96 0.98 1.00 0.90

Rl(50) y = x

0.80 0.85 0.90 0.95 1.00 0.80

Rl(50) y = x

Figure 6: Comparison of lower bounds computed from different W-PDG-(n) attacks, where n 5, 50. Bounds are computed based on Re DLR loss across neural networks on Robust Bench under p W2, l2q threat models with budget δ 1{32, 1{16.

Image Net dataset, we use a subset of 5000 images selected by Robust Bench. The size of image may vary among different preprocessing. A typical size is 224ˆ224 pixels in 3 color channels. Numerical results mirror those for CIFAR-10 in the body paragraph.

Before the discussion on the numerical results, we remark that the computation time of our proposed methods do not suffer from the curse of dimensionality. Despite the fact that complexity of Wasserstein distance in higher dim can cause troubles, this does not happen for our methods due to our formulae for the first order sensitivity. This means that the approximation to the worst case Wasserstein adversarial attack is explicit and only requires us to compute the gradient x Jθpx, yq. In consequence, the size of the image only affects the first layer of the DNN and the number of classes affects the last layer of the DNN. The computational time is thus mostly determined by the (hidden) architecture of the DNN model we consider.

In Figure 7, we compare the adversarial accuracy under pointwise threat models and distributional threat models. For W8 threat model, we apply a combination PGD-50 attack based on CE, DLR losses to obtain the reference adversarial accuracy; for W2 threat model, we apply a 50-step W-PGDRe DLR to obtain the reference WD-adversarial accuracy. Similar to CIFAR-10, we observe a drop of adversarial accuracy under distributional threat models on CIFAR-100 and Image Net.

In Figure 8, we plot the bounds Rl and Ru against R on CIFAR-100 and Image Net. We compute bounds under W8-attack based on CE loss across neural networks on Robust Bench; W2-attack for the same sets of neural networks with bounds computed using Re DLR loss, where Rlp5q denotes the lower bound (eq. (13)) computed from a 5-step W-PGD attack. For pointwise threat models, our upper and lower bounds sandwich the true (Auto Attack) accuracy ratio very well at a fraction of the computational cost. For distributional threat models, our bounds work but the tightness of the bound is model-dependent. We display also the improved bound obtained with 5 steps of W-PGD attack which leads to a marked improvement in the lower bound s performance. This is akin to Figure 5 in Appendix E.

G Comparison to Existing Attack Methods

In this section, we compare our proposed W-PGD attack with existing attack methods. We reiterate that our method focuses on distributional threats under a given budget δ. While the W-DRO formulation has been explored in numerous works, as discussed, we believe that none of them proposed an AA method specifically tailored for the distributional threat under a given attack budget.

We illustrate the main difference between our sensitivity approach and duality approach employed in Sinha et al. (2018), Volpi et al. (2018), Hua et al. (2022), etc. When using duality of W-DRO, the optimization involves a minimization over the Lagrangian multiplier λ. However, in the aforementioned works, a fixed λ is used. While their choice of λ may be optimal for some budget δ, there is no universal approach to achieve an optimal attack for a given budget δ. In contrast, our attack is an explicit first-order approximation designed to be optimal for a given budget δ. It only involves x Jθpx, yq and its norms which are fast to compute using standard methods.

10% 20% 30% 40% 50% 60% 70% 80% Adversarial Accuracy (W , l )

Adversarial Accuracy (W2, l )

δ = 8/255 y = x

0% 10% 20% 30% 40% 50% 60% 70% Adversarial Accuracy (W , l )

Adversarial Accuracy (W2, l )

δ = 8/255 y = x

Figure 7: Shortfall of WD-adversarial accuracy on CIFAR-100 (left) and Image Net (right).

0.70 0.75 0.80 0.85 0.90 0.95

(W , l ) Threat Model with δ = 2/255

0.4 0.5 0.6 0.7 0.8

(W , l ) Threat Model with δ = 4/255

0.4 0.5 0.6 0.7 0.8 0.9 1.0

(W2, l ) Threat Model with δ = 1/510

0.0 0.2 0.4 0.6 0.8

1.0 (W2, l ) Threat Model with δ = 1/255

0.65 0.70 0.75 0.80 0.85 0.90

(W , l ) Threat Model with δ = 2/255

0.3 0.4 0.5 0.6 0.7 0.8

(W , l ) Threat Model with δ = 4/255

0.4 0.5 0.6 0.7 0.8 0.9 1.0

(W2, l ) Threat Model with δ = 1/510

0.0 0.2 0.4 0.6 0.8

(W2, l ) Threat Model with δ = 1/255

Figure 8: Ru & Rl versus R on CIFAR-100 (top) and Image Net (bottom).

Despite no existing baseline for addressing distributional threat, it is possible to reverse engineer a comparison setup ex post. We employ CW attack which gives the minimum distortion needed to successfully attack an image. Due to the computational cost of CW attack, the experiment is conducted on the first 1000 correctly classified images from the Image Net subset used by Robust Bench. By calculating the square root of the mean square of the minimum distortion over this chosen dataset, we obtain a reference p W2, l8q budget 0.000711 to perform our distributional attacks. Employing our W-PGD(50)-Re DLR attack results in 934 images being successfully attacked out of the 1000. The CW attack, which by construction was successful on all 1000 images, took ca 4.5h to run (on a Tesla V100-SXM2-32GB-LS GPU) and the time needed to be scaled linearly with the number of images as the attack has to be performed image by image. In contrast, our attack took ca 50min to run (on a NVIDIA A100-SXM4-40GB GPU). In addition, we use mini-batches, which allows for computational speed improvements based on the available GPU memory.

H Additional Numerical Results

In Tables 3 and 4, we report the clean accuracy and the adversarial accuracy under different threat models for all but 4 neural networks available on Robust Bench (model zoo). All networks are named after their labels on Robust Bench (model zoo). We remove Standard l8-network and Standard l2-network because they are not robust to any attacks. In addition, we also remove Kang2021Stable which contains Neural ODE blocks and Ding2020MMA which has a huge gap between adversarial accuracies obtained from PGD and Auto Attack.

In Tables 5, 6, 7 and 8, we include the complete list of R and its bounds used in Figure 3. We write Ru Ru R and Rl Rl R.

Table 3: Complete list of adversarial accuracies under p W8, l8q and p W2, l8q threat models.

δ=1/8 δ=1/4 δ=1/8

Network Clean W8 W2 W8 W2 W8 W2

Augustin2020Adversarial 91.08 87.88 82.03 84.48 73.17 72.91 56.66 Augustin2020Adversarial_34_10 92.23 89.20 83.70 85.80 74.06 76.25 56.33 Augustin2020Adversarial_34_10_extra 93.96 91.63 84.86 88.32 75.04 78.79 59.73 Ding2020MMA 88.02 83.57 77.69 78.44 69.67 66.09 53.47 Engstrom2019Robustness 90.83 86.81 81.99 82.37 73.77 69.24 57.77 Gowal2020Uncovering 90.89 87.62 83.19 84.26 75.83 74.50 60.94 Gowal2020Uncovering_extra 94.73 92.64 88.08 89.52 79.74 80.53 60.18 Rade2021Helper_R18_ddpm 90.57 88.03 83.57 84.58 77.26 76.15 62.74 Rebuffi2021Fixing_28_10_cutmix_ddpm 91.79 89.26 86.09 86.43 81.06 78.80 67.18 Rebuffi2021Fixing_70_16_cutmix_ddpm 92.41 90.36 87.27 87.69 81.67 80.42 67.93 Rebuffi2021Fixing_70_16_cutmix_extra 95.74 93.78 89.73 91.18 81.59 82.32 63.43 Rebuffi2021Fixing_R18_cutmix_ddpm 90.33 87.48 84.24 84.48 78.38 75.86 63.85 Rice2020Overfitting 88.68 85.07 80.62 80.35 73.28 67.68 56.83 Rony2019Decoupling 89.04 84.74 78.47 79.25 69.39 66.44 51.15 Sehwag2021Proxy 90.93 88.42 85.50 85.66 80.50 77.24 67.96 Sehwag2021Proxy_R18 89.76 87.08 83.34 83.87 77.62 74.41 65.02 Wang2023Better_WRN-28-10 95.16 93.44 88.99 90.79 82.58 83.68 68.65 Wang2023Better_WRN-70-16 95.54 93.76 89.94 91.79 83.47 84.97 70.09 Wu2020Adversarial 88.51 85.32 81.33 82.05 75.31 73.66 62.21

Table 4: Complete list of adversarial accuracies under p W8, l8q and p W2, l8q threat models.

δ=2/255 δ=4/255 δ=8/255

Network Clean W8 W2 W8 W2 W8 W2

Addepalli2021Towards_RN18 80.23 73.85 68.54 66.70 58.76 51.06 42.91 Addepalli2021Towards_WRN34 85.32 79.87 73.41 73.18 64.60 58.04 48.12 Addepalli2022Efficient_RN18 85.71 79.16 71.75 71.37 59.59 52.48 37.84 Addepalli2022Efficient_WRN_34_10 88.70 82.95 74.75 75.52 63.35 57.81 42.03 Andriushchenko2020Understanding 79.85 72.30 66.16 64.45 55.83 43.93 37.86 Carmon2019Unlabeled 89.69 84.59 77.02 77.87 66.59 59.53 46.32 Chen2020Adversarial 86.04 79.68 67.45 71.61 51.23 51.56 29.66 Chen2020Efficient 85.34 78.48 70.91 70.64 59.24 51.12 39.32 Chen2021LTD_WRN34_10 85.21 79.78 72.74 72.97 61.86 56.94 41.48 Chen2021LTD_WRN34_20 86.03 80.71 75.31 74.19 64.95 57.71 44.87 Cui2020Learnable_34_10 88.22 81.95 71.25 74.20 57.88 52.86 40.14 Cui2020Learnable_34_20 88.70 82.20 72.16 74.46 57.74 53.57 37.73 Dai2021Parameterizing 87.02 82.39 77.35 76.96 69.28 61.55 52.43 Debenedetti2022Light_XCi T-L12 91.73 86.59 78.09 79.42 66.77 57.58 44.90 Debenedetti2022Light_XCi T-M12 91.30 86.14 77.83 78.89 66.86 57.27 43.90 Debenedetti2022Light_XCi T-S12 90.06 84.91 77.11 77.50 65.35 56.14 44.28 Engstrom2019Robustness 87.03 80.34 74.30 72.22 63.43 49.25 41.85 Gowal2020Uncovering_28_10_extra 89.48 84.78 78.08 78.77 67.93 62.80 48.51 Gowal2020Uncovering_34_20 85.64 79.84 73.22 73.47 62.78 56.86 42.87 Gowal2020Uncovering_70_16 85.29 79.81 72.89 73.47 62.25 57.20 43.29 Gowal2020Uncovering_70_16_extra 91.10 86.86 79.81 81.06 70.01 65.88 50.55 Gowal2021Improving_28_10_ddpm_100m 87.50 83.37 78.46 78.30 70.50 63.44 54.95 Gowal2021Improving_70_16_ddpm_100m 88.74 84.76 80.48 80.08 73.27 66.11 59.34 Gowal2021Improving_R18_ddpm_100m 87.35 82.08 76.83 76.22 68.16 58.63 51.17 Hendrycks2019Using 87.11 81.36 74.48 74.21 63.45 54.92 44.19 Huang2020Self 83.48 77.59 69.33 70.55 58.46 53.34 38.96 Huang2021Exploring 90.56 85.77 77.78 79.50 67.52 61.56 47.30 Huang2021Exploring_ema 91.23 86.84 79.28 80.79 69.02 62.54 48.46 Huang2022Revisiting_WRN-A4 91.59 87.35 79.76 81.69 69.62 65.79 50.22 Jia2022LAS-AT_34_10 84.98 79.26 73.00 72.44 62.88 56.26 43.92 Jia2022LAS-AT_70_16 85.66 80.29 73.90 73.52 63.76 57.61 44.19 Pang2020Boosting 85.14 79.34 71.93 72.60 61.17 53.74 39.40 Pang2022Robustness_WRN28_10 88.61 83.73 78.46 77.16 69.51 61.04 51.67 Pang2022Robustness_WRN70_16 89.01 84.77 79.57 78.58 70.61 63.35 52.93 Rade2021Helper_ddpm 88.16 83.26 77.54 76.83 67.49 60.97 47.37 Rade2021Helper_extra 91.47 86.72 80.47 80.29 69.48 62.83 47.57 Rade2021Helper_R18_ddpm 86.86 81.12 75.67 74.39 64.92 57.09 43.99 Rade2021Helper_R18_extra 89.02 83.21 77.16 76.36 66.10 57.67 43.38 Rebuffi2021Fixing_106_16_cutmix_ddpm 88.50 84.32 78.88 78.83 70.59 64.64 52.31 Rebuffi2021Fixing_28_10_cutmix_ddpm 87.33 82.36 76.33 76.08 66.35 60.75 46.73 Rebuffi2021Fixing_70_16_cutmix_ddpm 88.54 84.31 78.64 78.79 68.95 64.25 49.75 Rebuffi2021Fixing_70_16_cutmix_extra 92.23 88.02 81.21 82.76 70.39 66.58 47.97 Rebuffi2021Fixing_R18_ddpm 83.53 77.99 71.46 71.68 61.83 56.66 44.26 Rice2020Overfitting 85.34 79.60 74.06 72.82 64.81 53.42 43.66 Sehwag2020Hydra 88.98 83.49 76.02 76.18 65.38 57.14 44.37 Sehwag2021Proxy 86.68 81.72 76.91 76.39 68.84 60.27 53.29 Sehwag2021Proxy_R18 84.59 79.25 73.87 72.74 64.73 55.54 46.58 Sehwag2021Proxy_Res Nest152 87.21 82.55 78.08 77.30 70.78 62.79 56.67 Sitawarin2020Improving 86.84 80.21 74.16 72.47 63.54 50.72 43.25 Sridhar2021Robust 89.46 84.34 77.42 78.03 66.42 59.66 46.45 Sridhar2021Robust_34_15 86.53 81.45 73.95 75.63 64.18 60.41 45.77 Wang2020Improving 87.51 82.15 74.59 75.47 63.17 56.29 42.57 Wang2023Better_WRN-28-10 92.44 88.40 81.51 83.10 71.42 67.31 52.03 Wang2023Better_WRN-70-16 93.26 89.72 82.88 84.90 73.95 70.69 55.17 Wong2020Fast 83.34 75.77 69.60 67.23 58.38 43.21 36.96 Wu2020Adversarial 85.36 79.69 73.33 72.90 62.64 56.17 42.28 Wu2020Adversarial_extra 88.25 82.98 76.83 76.61 66.45 60.04 46.70 Zhang2019Theoretically 84.92 78.96 71.68 71.15 60.30 53.08 40.05 Zhang2019You 87.20 79.42 72.40 70.29 60.61 44.83 36.65 Zhang2020Attacks 84.52 78.48 71.10 71.32 59.45 53.51 40.76 Zhang2020Geometry 89.36 84.01 81.57 77.10 73.45 59.64 53.50

Table 5: Complete list of R and its bounds under p W8, l2q threat model based on CE loss.

δ=1/8 δ=1/4

Network R Ru Rl R Ru Rl

Augustin2020Adversarial 0.9649 0.0026 -0.0153 0.9275 0.0083 -0.0324 Augustin2020Adversarial_34_10 0.9669 0.0035 -0.0097 0.9303 0.0115 -0.0201 Augustin2020Adversarial_34_10_extra 0.9752 0.0031 -0.0150 0.9400 0.0084 -0.0239 Ding2020MMA 0.9496 0.0016 -0.0176 0.8912 0.0081 -0.0434 Engstrom2019Robustness 0.9557 0.0019 -0.0087 0.9069 0.0057 -0.0256 Gowal2020Uncovering 0.9640 0.0022 -0.0063 0.9271 0.0045 -0.0191 Gowal2020Uncovering_extra 0.9779 0.0013 -0.0121 0.9451 0.0050 -0.0209 Rade2021Helper_R18_ddpm 0.9720 0.0017 -0.0124 0.9339 0.0061 -0.0209 Rebuffi2021Fixing_28_10_cutmix_ddpm 0.9724 0.0034 -0.0107 0.9416 0.0060 -0.0232 Rebuffi2021Fixing_70_16_cutmix_ddpm 0.9778 0.0021 -0.0137 0.9489 0.0052 -0.0259 Rebuffi2021Fixing_70_16_cutmix_extra 0.9795 0.0018 -0.0091 0.9524 0.0040 -0.0184 Rebuffi2021Fixing_R18_cutmix_ddpm 0.9686 0.0032 -0.0150 0.9352 0.0061 -0.0339 Rice2020Overfitting 0.9593 0.0030 -0.0178 0.9061 0.0090 -0.0342 Rony2019Decoupling 0.9517 0.0015 -0.0144 0.8899 0.0077 -0.0316 Sehwag2021Proxy 0.9724 0.0027 -0.0099 0.9420 0.0065 -0.0235 Sehwag2021Proxy_R18 0.9703 0.0028 -0.0133 0.9344 0.0057 -0.0271 Wang2023Better_WRN-28-10 0.9819 0.0013 -0.0098 0.9541 0.0047 -0.0147 Wang2023Better_WRN-70-16 0.9814 0.0015 -0.0069 0.9609 0.0029 -0.0163 Wu2020Adversarial 0.9640 0.0024 -0.0088 0.9270 0.0051 -0.0221

Table 6: Complete list of R and its bounds under p W2, l2q threat model based on Re DLR loss.

δ=1/32 δ=1/16

Network R Ru Rl R Ru Rl

Augustin2020Adversarial 0.9844 0.0058 -0.0358 0.9723 0.0131 -0.0505 Augustin2020Adversarial_34_10 0.9868 0.0026 -0.0518 0.9738 0.0094 -0.0743 Augustin2020Adversarial_34_10_extra 0.9879 0.0049 -0.0452 0.9756 0.0105 -0.0652 Ding2020MMA 0.9788 0.0087 -0.0694 0.9644 0.0148 -0.1079 Engstrom2019Robustness 0.9834 0.0036 -0.0798 0.9719 0.0085 -0.1250 Gowal2020Uncovering 0.9837 0.0048 -0.0581 0.9719 0.0110 -0.0918 Gowal2020Uncovering_extra 0.9915 0.0026 -0.0574 0.9855 0.0039 -0.0898 Rade2021Helper_R18_ddpm 0.9879 0.0055 -0.0530 0.9802 0.0075 -0.0871 Rebuffi2021Fixing_28_10_cutmix_ddpm 0.9902 0.0034 -0.0572 0.9826 0.0069 -0.0912 Rebuffi2021Fixing_70_16_cutmix_ddpm 0.9920 0.0023 -0.0610 0.9852 0.0055 -0.0966 Rebuffi2021Fixing_70_16_cutmix_extra 0.9922 0.0020 -0.0539 0.9851 0.0050 -0.0852 Rebuffi2021Fixing_R18_cutmix_ddpm 0.9894 0.0034 -0.0711 0.9791 0.0081 -0.1104 Rice2020Overfitting 0.9859 0.0046 -0.0825 0.9738 0.0104 -0.1263 Rony2019Decoupling 0.9829 0.0045 -0.0855 0.9672 0.0116 -0.1267 Sehwag2021Proxy 0.9918 0.0018 -0.0767 0.9824 0.0047 -0.1165 Sehwag2021Proxy_R18 0.9890 0.0039 -0.0558 0.9789 0.0097 -0.0868 Wang2023Better_WRN-28-10 0.9902 0.0023 -0.0492 0.9832 0.0054 -0.0775 Wang2023Better_WRN-70-16 0.9919 0.0020 -0.0494 0.9850 0.0043 -0.0780 Wu2020Adversarial 0.9867 0.0037 -0.0704 0.9759 0.0092 -0.1114

Table 7: Complete list of R and its bounds under p W8, l8q threat model based on CE loss.

δ=2/255 δ=4/255

Network R Ru Rl R Ru Rl

Addepalli2021Towards_RN18 0.9205 0.0172 -0.0234 0.8314 0.0403 -0.0390 Addepalli2021Towards_WRN34 0.9361 0.0144 -0.0266 0.8577 0.0374 -0.0399 Addepalli2022Efficient_RN18 0.9236 0.0126 -0.0194 0.8327 0.0329 -0.0312 Addepalli2022Efficient_WRN_34_10 0.9352 0.0097 -0.0167 0.8514 0.0275 -0.0218 Andriushchenko2020Understanding 0.9054 0.0125 -0.0200 0.8071 0.0296 -0.0496 Carmon2019Unlabeled 0.9431 0.0056 -0.0122 0.8682 0.0236 -0.0144 Chen2020Adversarial 0.9261 0.0074 -0.0319 0.8323 0.0206 -0.0495 Chen2020Efficient 0.9199 0.0111 -0.0107 0.8273 0.0293 -0.0177 Chen2021LTD_WRN34_10 0.9363 0.0099 -0.0224 0.8564 0.0275 -0.0347 Chen2021LTD_WRN34_20 0.9382 0.0105 -0.0280 0.8624 0.0322 -0.0476 Cui2020Learnable_34_10 0.9290 0.0070 -0.0178 0.8410 0.0196 -0.0324 Cui2020Learnable_34_20 0.9267 0.0079 -0.0150 0.8395 0.0224 -0.0296 Dai2021Parameterizing 0.9468 0.0069 -0.0173 0.8844 0.0139 -0.0362 Debenedetti2022Light_XCi T-L12 0.9440 0.0041 -0.0154 0.8658 0.0162 -0.0233 Debenedetti2022Light_XCi T-M12 0.9435 0.0049 -0.0164 0.8641 0.0194 -0.0233 Debenedetti2022Light_XCi T-S12 0.9428 0.0048 -0.0239 0.8605 0.0198 -0.0357 Engstrom2019Robustness 0.9231 0.0101 -0.0203 0.8298 0.0263 -0.0406 Gowal2020Uncovering_28_10_extra 0.9476 0.0072 -0.0149 0.8803 0.0221 -0.0214 Gowal2020Uncovering_34_20 0.9323 0.0081 -0.0156 0.8579 0.0206 -0.0331 Gowal2020Uncovering_70_16 0.9357 0.0055 -0.0144 0.8614 0.0155 -0.0279 Gowal2020Uncovering_70_16_extra 0.9535 0.0050 -0.0129 0.8898 0.0196 -0.0137 Gowal2021Improving_28_10_ddpm_100m 0.9528 0.0073 -0.0192 0.8949 0.0177 -0.0347 Gowal2021Improving_70_16_ddpm_100m 0.9551 0.0061 -0.0147 0.9024 0.0169 -0.0265 Gowal2021Improving_R18_ddpm_100m 0.9397 0.0065 -0.0166 0.8726 0.0117 -0.0387 Hendrycks2019Using 0.9340 0.0057 -0.0250 0.8519 0.0215 -0.0447 Huang2020Self 0.9294 0.0075 -0.0139 0.8451 0.0270 -0.0201 Huang2021Exploring 0.9471 0.0051 -0.0091 0.8779 0.0227 -0.0091 Huang2021Exploring_ema 0.9519 0.0054 -0.0103 0.8856 0.0226 -0.0103 Huang2022Revisiting_WRN-A4 0.9537 0.0052 -0.0101 0.8919 0.0167 -0.0127 Jia2022LAS-AT_34_10 0.9327 0.0102 -0.0173 0.8524 0.0262 -0.0277 Jia2022LAS-AT_70_16 0.9372 0.0070 -0.0178 0.8585 0.0229 -0.0262 Pang2020Boosting 0.9319 0.0181 -0.0349 0.8526 0.0430 -0.0611 Pang2022Robustness_WRN28_10 0.9449 0.0095 -0.0199 0.8708 0.0229 -0.0325 Pang2022Robustness_WRN70_16 0.9524 0.0056 -0.0220 0.8828 0.0225 -0.0331 Rade2021Helper_ddpm 0.9444 0.0068 -0.0195 0.8715 0.0188 -0.0317 Rade2021Helper_extra 0.9481 0.0057 -0.0158 0.8778 0.0186 -0.0247 Rade2021Helper_R18_ddpm 0.9339 0.0098 -0.0203 0.8564 0.0220 -0.0407 Rade2021Helper_R18_extra 0.9347 0.0095 -0.0183 0.8578 0.0220 -0.0370 Rebuffi2021Fixing_106_16_cutmix_ddpm 0.9528 0.0063 -0.0196 0.8907 0.0226 -0.0295 Rebuffi2021Fixing_28_10_cutmix_ddpm 0.9432 0.0088 -0.0193 0.8713 0.0262 -0.0286 Rebuffi2021Fixing_70_16_cutmix_ddpm 0.9522 0.0071 -0.0204 0.8899 0.0208 -0.0311 Rebuffi2021Fixing_70_16_cutmix_extra 0.9544 0.0051 -0.0169 0.8973 0.0191 -0.0282 Rebuffi2021Fixing_R18_ddpm 0.9337 0.0078 -0.0140 0.8584 0.0242 -0.0270 Rice2020Overfitting 0.9327 0.0091 -0.0273 0.8532 0.0199 -0.0548 Sehwag2020Hydra 0.9383 0.0055 -0.0123 0.8561 0.0262 -0.0126 Sehwag2021Proxy 0.9428 0.0073 -0.0150 0.8813 0.0166 -0.0348 Sehwag2021Proxy_R18 0.9369 0.0091 -0.0209 0.8598 0.0240 -0.0364 Sehwag2021Proxy_Res Nest152 0.9466 0.0062 -0.0138 0.8864 0.0151 -0.0308 Sitawarin2020Improving 0.9237 0.0084 -0.0190 0.8345 0.0243 -0.0428 Sridhar2021Robust 0.9428 0.0063 -0.0112 0.8722 0.0221 -0.0171 Sridhar2021Robust_34_15 0.9413 0.0067 -0.0065 0.8740 0.0208 -0.0089 Wang2020Improving 0.9387 0.0119 -0.0216 0.8624 0.0362 -0.0303 Wang2023Better_WRN-28-10 0.9563 0.0067 -0.0149 0.8990 0.0166 -0.0248 Wang2023Better_WRN-70-16 0.9620 0.0049 -0.0149 0.9104 0.0165 -0.0239 Wong2020Fast 0.9093 0.0128 -0.0282 0.8068 0.0301 -0.0608 Wu2020Adversarial 0.9336 0.0068 -0.0161 0.8540 0.0225 -0.0267 Wu2020Adversarial_extra 0.9403 0.0073 -0.0139 0.8681 0.0228 -0.0230 Zhang2019Theoretically 0.9298 0.0061 -0.0190 0.8381 0.0238 -0.0289 Zhang2019You 0.9107 0.0073 -0.0196 0.8061 0.0203 -0.0506 Zhang2020Attacks 0.9285 0.0082 -0.0167 0.8438 0.0257 -0.0267 Zhang2020Geometry 0.9402 0.0187 -0.0307 0.8629 0.0404 -0.0514

Table 8: Complete list of R and its bounds under p W2, l8q threat model based on Re DLR loss.

δ=1/510 δ=1/255

Network R Ru Rl R Ru Rl

Addepalli2021Towards_RN18 0.9762 0.0080 -0.1301 0.9533 0.0202 -0.1902 Addepalli2021Towards_WRN34 0.9774 0.0102 -0.0849 0.9590 0.0197 -0.1257 Addepalli2022Efficient_RN18 0.9740 0.0088 -0.1012 0.9534 0.0195 -0.1480 Addepalli2022Efficient_WRN_34_10 0.9749 0.0124 -0.0645 0.9563 0.0231 -0.0942 Andriushchenko2020Understanding 0.9679 0.0140 -0.0969 0.9438 0.0294 -0.1357 Carmon2019Unlabeled 0.9775 0.0079 -0.0803 0.9584 0.0178 -0.1162 Chen2020Adversarial 0.9639 0.0182 -0.0351 0.9371 0.0351 -0.0485 Chen2020Efficient 0.9713 0.0147 -0.0742 0.9495 0.0281 -0.1073 Chen2021LTD_WRN34_10 0.9758 0.0140 -0.0622 0.9578 0.0236 -0.0960 Chen2021LTD_WRN34_20 0.9780 0.0078 -0.1077 0.9635 0.0124 -0.1615 Cui2020Learnable_34_10 0.9694 0.0121 -0.0663 0.9462 0.0228 -0.0838 Cui2020Learnable_34_20 0.9710 0.0132 -0.0722 0.9492 0.0231 -0.0982 Dai2021Parameterizing 0.9814 0.0082 -0.0825 0.9654 0.0164 -0.1206 Debenedetti2022Light_XCi T-L12 0.9760 0.0076 -0.0730 0.9579 0.0181 -0.1063 Debenedetti2022Light_XCi T-M12 0.9781 0.0127 -0.0477 0.9574 0.0280 -0.0682 Debenedetti2022Light_XCi T-S12 0.9795 0.0078 -0.0911 0.9620 0.0161 -0.1370 Engstrom2019Robustness 0.9746 0.0086 -0.1117 0.9550 0.0174 -0.1600 Gowal2020Uncovering_28_10_extra 0.9791 0.0075 -0.0813 0.9623 0.0153 -0.1213 Gowal2020Uncovering_34_20 0.9748 0.0110 -0.0786 0.9546 0.0220 -0.1166 Gowal2020Uncovering_70_16 0.9722 0.0162 -0.0583 0.9552 0.0232 -0.0906 Gowal2020Uncovering_70_16_extra 0.9801 0.0070 -0.0786 0.9639 0.0136 -0.1145 Gowal2021Improving_28_10_ddpm_100m 0.9814 0.0070 -0.0852 0.9690 0.0119 -0.1281 Gowal2021Improving_70_16_ddpm_100m 0.9838 0.0072 -0.0778 0.9726 0.0114 -0.1150 Gowal2021Improving_R18_ddpm_100m 0.9797 0.0069 -0.1049 0.9626 0.0160 -0.1520 Hendrycks2019Using 0.9778 0.0090 -0.0963 0.9582 0.0183 -0.1398 Huang2020Self 0.9714 0.0117 -0.0764 0.9478 0.0252 -0.1082 Huang2021Exploring 0.9770 0.0109 -0.0556 0.9593 0.0208 -0.0803 Huang2021Exploring_ema 0.9792 0.0091 -0.0625 0.9623 0.0189 -0.0923 Huang2022Revisiting_WRN-A4 0.9810 0.0088 -0.0521 0.9646 0.0181 -0.0766 Jia2022LAS-AT_34_10 0.9762 0.0139 -0.0743 0.9575 0.0232 -0.1118 Jia2022LAS-AT_70_16 0.9751 0.0138 -0.0707 0.9583 0.0219 -0.1091 Pang2020Boosting 0.9745 0.0093 -0.0244 0.9528 0.0195 -0.0330 Pang2022Robustness_WRN28_10 0.9823 0.0055 -0.1311 0.9707 0.0095 -0.1988 Pang2022Robustness_WRN70_16 0.9836 0.0078 -0.0882 0.9711 0.0139 -0.1378 Rade2021Helper_ddpm 0.9808 0.0068 -0.0936 0.9672 0.0124 -0.1430 Rade2021Helper_extra 0.9809 0.0066 -0.0846 0.9674 0.0130 -0.1298 Rade2021Helper_R18_ddpm 0.9778 0.0056 -0.1266 0.9627 0.0098 -0.1904 Rade2021Helper_R18_extra 0.9784 0.0127 -0.0720 0.9587 0.0244 -0.1073 Rebuffi2021Fixing_106_16_cutmix_ddpm 0.9829 0.0046 -0.0787 0.9681 0.0130 -0.1163 Rebuffi2021Fixing_28_10_cutmix_ddpm 0.9816 0.0050 -0.1001 0.9651 0.0115 -0.1436 Rebuffi2021Fixing_70_16_cutmix_ddpm 0.9824 0.0072 -0.0694 0.9680 0.0141 -0.1035 Rebuffi2021Fixing_70_16_cutmix_extra 0.9809 0.0056 -0.0718 0.9661 0.0126 -0.1058 Rebuffi2021Fixing_R18_ddpm 0.9758 0.0104 -0.0855 0.9564 0.0194 -0.1276 Rice2020Overfitting 0.9773 0.0082 -0.1061 0.9593 0.0159 -0.1547 Sehwag2020Hydra 0.9758 0.0096 -0.0805 0.9577 0.0192 -0.1201 Sehwag2021Proxy 0.9818 0.0073 -0.0831 0.9657 0.0162 -0.1202 Sehwag2021Proxy_R18 0.9795 0.0071 -0.0964 0.9609 0.0154 -0.1370 Sehwag2021Proxy_Res Nest152 0.9805 0.0144 -0.0361 0.9667 0.0243 -0.0573 Sitawarin2020Improving 0.9747 0.0088 -0.1054 0.9511 0.0204 -0.1505 Sridhar2021Robust 0.9766 0.0087 -0.0832 0.9594 0.0160 -0.1240 Sridhar2021Robust_34_15 0.9749 0.0105 -0.0602 0.9537 0.0221 -0.0858 Wang2020Improving 0.9738 0.0195 -0.0134 0.9552 0.0341 -0.0185 Wang2023Better_WRN-28-10 0.9840 0.0054 -0.0767 0.9692 0.0129 -0.1120 Wang2023Better_WRN-70-16 0.9835 0.0049 -0.0747 0.9727 0.0074 -0.1121 Wong2020Fast 0.9674 0.0163 -0.0901 0.9448 0.0268 -0.1281 Wu2020Adversarial 0.9776 0.0089 -0.0837 0.9577 0.0221 -0.1215 Wu2020Adversarial_extra 0.9778 0.0077 -0.0963 0.9588 0.0158 -0.1419 Zhang2019Theoretically 0.9774 0.0132 -0.0670 0.9565 0.0277 -0.1003 Zhang2019You 0.9720 0.0135 -0.1073 0.9502 0.0247 -0.1580 Zhang2020Attacks 0.9724 0.0208 -0.0225 0.9504 0.0386 -0.0312 Zhang2020Geometry 0.9875 0.0035 -0.1267 0.9777 0.0068 -0.1838