# a_pacbayes_analysis_of_adversarial_robustness__30299c60.pdf

A PAC-Bayes Analysis of Adversarial Robustness

Paul Viallard1 , Guillaume Vidot23 , Amaury Habrard1, Emilie Morvant1

1 Univ Lyon, UJM-Saint-Etienne, CNRS, Institut d Optique Graduate School,

Laboratoire Hubert Curien UMR 5516, F-42023, SAINT-ETIENNE, France

2 Airbus Opération S.A.S 3 University of Toulouse, Institut de Recherche en Informatique de Toulouse, France

We propose the ﬁrst general PAC-Bayesian generalization bounds for adversarial robustness, that estimate, at test time, how much a model will be invariant to imperceptible perturbations in the input. Instead of deriving a worst-case analysis of the risk of a hypothesis over all the possible perturbations, we leverage the PACBayesian framework to bound the averaged risk on the perturbations for majority votes (over the whole class of hypotheses). Our theoretically founded analysis has the advantage to provide general bounds (i) that are valid for any kind of attacks (i.e., the adversarial attacks), (ii) that are tight thanks to the PAC-Bayesian framework, (iii) that can be directly minimized during the learning phase to obtain a robust model on different attacks at test time.

1 Introduction

While machine learning algorithms are able to solve a huge variety of tasks, Szegedy et al. [2014] pointed out a crucial weakness: the possibility to generate samples similar to the originals (i.e., with no or insigniﬁcant change recognizable by the human eyes) but with a different outcome from the algorithm. This phenomenon, known as adversarial examples , contributes to the impossibility to ensure the safety of machine learning algorithms for safety-critical applications such as aeronautic functions (e.g., vision-based navigation), autonomous driving, or medical diagnosis (see, e.g., Huang et al. [2020]). Adversarial robustness is thus a critical issue in machine learning that studies the ability of a model to be robust or invariant to perturbations of its input. A perturbed input that fools the model is usually called an adversarial example. In other words, an adversarial example can be deﬁned as an example that has been modiﬁed by an imperceptible noise (or that does not exceed a threshold) but which leads to a misclassiﬁcation. One line of research is referred to as adversarial robustness veriﬁcation [e.g., Gehr et al., 2018, Huang et al., 2017, Singh et al., 2019, Tsuzuku et al., 2018], where the objective is to formally check whether the neighborhood of each sample does not contain any adversarial examples. This kind of method comes with some limitations such as scalability or overapproximation [Gehr et al., 2018, Katz et al., 2017, Singh et al., 2019]. In this paper we stand in another setting called adversarial attack/defense [e.g., Papernot et al., 2016, Goodfellow et al., 2015, Madry et al., 2018, Carlini and Wagner, 2017, Zantedeschi et al., 2017, Kurakin et al., 2017]. An adversarial attack consists in ﬁnding perturbed examples that defeat machine learning algorithms while the adversarial defense techniques enhance their adversarial robustness to make the attacks useless. While a lot of methods exist, adversarial robustness suffers from a lack of general theoretical understandings (see Section 2.2).

To tackle this issue, we propose in this paper to formulate the adversarial robustness through the lens of a well-founded statistical machine learning theory called PAC-Bayes and introduced by Shawe Taylor and Williamson [1997], Mc Allester [1998]. This theory has the advantage to provide tight

Paul Viallard and Guillaume Vidot contributed equally to this work

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

generalization bounds in average over the set of hypotheses considered (leading to bounds for a weighted majority vote over this set), in contrast to other theories such as VC-dimension or Rademacher-based approaches that give worst-case analysis, i.e., for all the hypotheses. We start by deﬁning our setting called adversarially robust PAC-Bayes. The idea consists in considering an averaged adversarial robustness risk which corresponds to the probability that the model misclassiﬁes a perturbed example (this can be seen as an averaged risk over the perturbations). This measure can be too optimistic and not enough informative since for each example we sample only one perturbation. Thus we also deﬁne an averaged-max adversarial risk as the probability that there exists at least one perturbation (taken in a set of sampled perturbations) that leads to a misclassiﬁcation. These deﬁnitions, based on averaged quantities, have the advantage (i) of still being suitable for the PACBayesian framework and majority vote classiﬁers and (ii) to be related to the classical adversarial robustness risk. Then, for each of our adversarial risks, we derive a PAC-Bayesian generalization bound that can are valid to any kind of attack. From an algorithmic point of view, these bounds can be directly minimized in order to learn a majority vote robust in average to attacks. We empirically illustrate that our framework is able to provide generalization guarantees with non-vacuous bounds for the adversarial risk while ensuring efﬁcient protection to adversarial attacks.

Organization of the paper. Section 2 recalls basics on usual adversarial robustness. We state our new adversarial robustness PAC-Bayesian setting along with our theoretical results in Section 3, and we empirically show its soundness in Section 4. All the proofs of the results are deferred in Appendix.

2 Basics on adversarial robustness

2.1 General setting

We tackle binary classiﬁcation tasks with the input space X=Rd and the output/label space Y ={ 1, +1}. We assume that D is a ﬁxed but unknown distribution on X Y . An example is denoted by (x, y)2X Y . Let S ={(xi, yi)}m

i=1 be the learning sample consisted of m examples i.i.d. from D; We denote the distribution of such m-sample by Dm. Let H be a set of real-valued functions from X to [ 1, +1] called voters or hypotheses. Usually, given a learning sample S Dm, a learner aims at ﬁnding the best hypothesis h from H that commits as few errors as possible on unseen data from D. One wants to ﬁnd h2H that minimizes the true risk RD(h) on D deﬁned as

RD(h) = E (x,y) D (h, (x, y)) , (1)

where :H X Y !R+ is the loss function. In practice since D is unknown we cannot compute RD(h), we usually deal with the empirical risk RS(h) estimated on S and deﬁned as

(h, (xi, yi)).

From a classic ideal machine learning standpoint, we are able to learn a well-performing classiﬁer with strong guarantees on unseen data, and even to measure how much the model will be able to generalize on D (e.g., with generalization bounds).

However, in real-life applications at classiﬁcation time, an imperceptible perturbation of the input (e.g., due to a malicious attack or a noise) can have a bad inﬂuence on the classiﬁcation performance on unseen data [Szegedy et al., 2014]: the usual guarantees do not stand anymore. Such imperceptible perturbation can be modeled by a (relatively small) noise in the input. Let b>0 and k k be an arbitrary norm (the most used norms are the 1, 2 and 1-norms), the set of possible noises B is deﬁned by

2 Rd ## k k b

The learner aims to ﬁnd an adversarial robust classiﬁer that is robust in average to all noises in B over (x, y) D. More formally, one wants to minimize the adversarial robust true risk RROB

D (h) deﬁned as RROB

D (h) = E (x,y) D max 2B (h, (x+ , y)) . (2)

Similarly as in the classic setting, since D is unknown, RROB

D (h) cannot be directly computed, and then one usually deals with the empirical adversarial risk

max 2B (h, (xi+ , yi)) .

That being said, a learned classiﬁer h should be robust to adversarial attacks that aim at ﬁnding an adversarial example x+ (x,y) to fool h for given example (x, y), where (x,y) is deﬁned as

(x,y) 2 argmax 2B (h, (x+ , y)). (3)

In consequence, adversarial defense mechanisms often rely on the adversarial attacks by replacing the original examples with the adversarial ones during the learning phase; This procedure is called adversarial training. Even if there are other defenses, adversarial training appears to be one of the most efﬁcient defense mechanisms [Ren et al., 2020].

2.2 Related works

Adversarial Attacks/Defenses. Numerous methods2 exist to solve or approximate the optimization of Equation (3). Among them, the Fast Gradient Sign Method (FGSM Goodfellow et al. [2015]) is an attack consisting in generating a noise in the direction of the gradient of the loss function with respect to the input x. Kurakin et al. [2017] introduced IFGSM, an iterative version of FGSM: at each iteration, one repeats FGSM and adds to x a noise, that is the sign of the gradient of the loss with respect to x. Following the same principle as IFGSM, Madry et al. [2018] proposed a method based on Projected Gradient Descent (PGD) that includes a random initialization of x before the optimization. Another technique known as the Carlini and Wagner Attack [Carlini and Wagner, 2017] aims at ﬁnding adversarial examples x+ (x,y) that are as close as possible to the original x, i.e., they want an attack being the most imperceptible as possible. However, producing such imperceptible perturbation leads to a high-running time in practice. Contrary to the most popular techniques that look for a model with a low adversarial robust risk (Equation (2)), our work stands in another line of research where the idea is to relax this worst-case risk measure by considering an averaged adversarial robust risk over the noises instead of a max-based formulation [see, e.g., Zantedeschi et al., 2017, Hendrycks and Dietterich, 2019]. Our averaged formulation is introduced in the Section 3.

Generalization Bounds. Recently, few generalization bounds for adversarial robustness have been introduced [e.g. Khim and Loh, 2018, Yin et al., 2019, Montasser et al., 2019, 2020, Cohen et al., 2019, Salman et al., 2019]. Khim and Loh, and Yin et al. s results are Rademacher complexity-based bounds. The former makes use of a surrogate of the adversarial risk; The latter provides bounds in the speciﬁc case of neural networks and linear classiﬁers, and involves an unavoidable polynomial dependence on the dimension of the input. Montasser et al. study robust PAC-learning for PAClearnable classes with ﬁnite VC-dimension for unweighted majority votes that have been robustiﬁed with a boosting algorithm. However, their algorithm requires to consider all possible adversarial perturbations for each example which is intractable in practice, and their bound suffers also from a large constant as indicated at the end of the Montasser et al. [Theorem 3.1 2019] s proof. Cohen et al. provide bounds that estimate what is the minimum noise to get an adversarial example (in the case of perturbations expressed as Gaussian noise) while our results give the probability to be fooled by an adversarial example. Salman et al. leverage Cohen et al. s method and adversarial training in order to get tighter bounds. Moreover, Farnia et al. present margin-based bounds on the adversarial robust risk for speciﬁc neural networks and attacks (such as FGSM or PGD). While they made use of a classical PAC-Bayes bound, their result is not a PAC-Bayesian analysis and stands in the family of uniform-convergence bounds [see Nagarajan and Kolter, 2019, Ap. J for details]. In this paper, we provide PAC-Bayes bounds for general models expressed as majority votes, their bounds are thus not directly comparable to ours.

3 Adversarially robust PAC-Bayes

Although few theoretical results exist, the majority of works come either without theoretical guarantee or with very speciﬁc theoretical justiﬁcations. In the following, we aim at giving a different point of view on adversarial robustness based on the so-called PAC-Bayesian framework. By leveraging this framework, we derive a general generalization bound for adversarial robustness based on an averaged notion of risk that allows us to learn robust models at test time. We introduce below our new setting referred to as adversarially robust PAC-Bayes.

2The reader can refer to Ren et al. [2020] for a survey on adversarial attacks and defenses.

3.1 Adversarially robust majority vote

The PAC-Bayesian framework provides practical and theoretical tools to analyze majority vote classiﬁers. Assuming the voters set H and a learning sample S as deﬁned in Section 2, our goal is not anymore to learn one classiﬁer from H but to learn a well-performing weighted combination of the voters involved in H, the weights being modeled by a distribution Q on H. This distribution is called the posterior distribution and is learned from S given a prior distribution P on H. The learned weighted combination is called a Q-weighted majority vote and is deﬁned by

8x 2 X, HQ(x) = sign

In the rest of the paper, we consider the 0-1 loss function classically used for majority votes in PAC-Bayes and deﬁned as (h, (x, y))=I (h(x) 6= y) with I(a)=1 if a is true, and 0 otherwise. In this context, the adversarial perturbation related to Equation (3) becomes

(x,y) 2 argmax 2B I(HQ(x+ ) 6= y). (5) Optimizing this problem is intractable due to the non-convexity of HQ induced by the sign function. Note that the adversarial attacks of the literature (like PGD or IFGSM) aim at ﬁnding the optimal perturbation (x,y), but, in practice one considers an approximation of this perturbation.

Hence, instead of searching for the noise that maximizes the chance of fooling the algorithm, we propose to model the perturbation according to an example-dependent distribution. First let us deﬁne !(x,y) a distribution, on the set of possible noises B, that is dependent on an example (x, y) 2 X Y . Then we denote as D the distribution on (X Y ) B deﬁned as D((x, y), ) = D(x, y) !(x,y)( ) which further permits to generate perturbed examples. To estimate our risks (deﬁned below) for a given example (xi, yi) D, we consider a set of n perturbations sampled from !(xi,yi) denoted by Ei={ i

j=1. Then we consider as a learning set the m n-sample S = {((xi, yi), Ei)}m

i=1 2 (X Y Bn)m. In other words, each ((xi, yi), Ei) 2 S is sampled from a distribution that we denote by Dn such that

Dn((xi, yi), Ei) = D(xi, yi)

!(xi,yi)( i

Then, inspired by the works of Zantedeschi et al. [2017], Hendrycks and Dietterich [2019], we deﬁne our robustness averaged adversarial risk as follows. Deﬁnition 1 (Averaged Adversarial Risk). For any distribution D on (X Y ) B, for any distribution Q on H, the averaged adversarial risk of HQ is deﬁned as

RD(HQ) = Pr ((x,y), ) D (HQ(x + ) 6= y)

= E ((x,y), ) D I(HQ(x + ) 6= y).

The empirical averaged adversarial risk is computed on a m n-sample S = {((xi, yi), Ei)}m

RS(HQ) = 1 mn

As we will show in Proposition 3, the risk RD(HQ) can considered optimistic regarding (x,y) of Equation (5). Indeed, instead of taking the maximizing the loss, a unique is drawn from a distribution. Hence, it can lead to a non-informative risk regarding the occurrence of adversarial examples. To overcome this, we propose an extension that we refer as averaged-max adversarial risk. Deﬁnition 2 (Averaged-Max Adversarial Risk). For any distribution D on (X Y ) B, for any distribution Q on H, the averaged-max adversarial risk of HQ is deﬁned as

ADn(HQ) = Pr ((x,y),E) Dn

9 2 E, HQ(x + ) 6= y

The empirical averaged-max adversarial risk computed on a m n-sample S={((xi, yi), Ei)}m

max 2Ei I(HQ(xi + ) 6= yi).

For an example (x, y) D, instead of checking if one perturbed example x+ is adversarial, we sample n perturbed examples x+ 1, . . . , x+ n and we check if at least one example is adversarial.

3.2 Relations between the adversarial risks

Proposition 3 below shows the intrinsic relationships between the classical adversarial risk RROB

D (HQ) and our two relaxations RD(HQ) and ADn(HQ). In particular, Proposition 3 shows that the larger n, the number of perturbed examples, the higher is the chance to get an adversarial example and then to be close to the adversarial risk RROB

D (HQ). Proposition 3. For any distribution D on (X Y ) B, for any distribution Q on H, for any (n, n0) 2 N2, with n n0 1, we have

RD(HQ) ADn0 (HQ) ADn(HQ) RROB

D (HQ). (6)

The left-hand side of Equation (6) conﬁrms that the averaged adversarial risk RD(HQ) is optimistic regarding the classical RROB

D (HQ). Proposition 4 estimates how close RD(HQ) can be to RROB

D (HQ). Proposition 4. For any distribution D on (X Y ) B, for any distribution Q on H, we have

D (HQ) TV( k ) RD(HQ),

where and are distributions on X Y , and (x0, y0), respectively (x0, y0), corresponds to the probability of drawing a perturbed example (x+ ) with ((x, y), ) D, respectively an adversarial example (x+ (x,y), y) with (x, y) D. We have

(x0, y0) = Pr ((x,y), ) D [x+ =x0, y=y0] , and (x0, y0) = Pr (x,y) D [x+ (x, y)=x0, y=y0] , (7)

and TV( k ) = E (x0,y0)

(x0,y0) (x0,y0) 1

#### , is the Total Variation (TV) distance between and .

Note that (x,y) depends on Q, and hence depends on Q. From Equation (7), RROB

D (HQ) and RD(HQ) can be rewritten (see Lemmas 8 and 9 in Appendix B) respectively with and as

RD(HQ) = Pr (x0,y0) [HQ(x0) 6= y0] , and RROB

D (HQ) = Pr (x0,y0) [HQ(x0) 6= y0] .

Finally, Propositions 3 and 4 relate the adversarial risk RD(HQ) to the standard adversarial risk RROB

D (HQ). Indeed, by merging the two propositions we obtain

D (HQ) TV( k ) RD(HQ) ADn(HQ) RROB

D (HQ). (8)

Hence, the smaller the TV distance TV( k ), the closer the averaged adversarial risk RD(HQ) is from RROB

D (HQ) and the more probable an example ((x, y), ) sampled from D would be adversarial, i.e., when our averaged adversarial example looks like a speciﬁc adversarial example. Moreover, Equation (8) justiﬁes that the PAC-Bayesian point of view makes sense for adversarial learning with theoretical guarantees: the PAC-Bayesian guarantees we derive in the next section for our adversarial risks also give some guarantees on the standard risk RROB

3.3 PAC-Bayesian bounds on the adversarially robust majority vote

First of all, since RD(HQ) and ADn(HQ) risks are not differentiable due to the indicator function, we propose to use a common surrogate in PAC-Bayes (known as the Gibbs risk): instead of considering the risk of the Q-weighted majority vote, we consider the expectation over Q of the individual risks of the voters involved in H. In our case, we deﬁne the surrogates with the linear loss as

RD(HQ) = E ((x,y), ) D

and ADn(HQ) = E ((x,y),E) Dn

The next theorem relates these surrogates to our risks, implying that a generalization bound for RD(HQ), resp. for ADn(HQ), leads to a generalization bound for RD(HQ), resp. ADn(HQ). Theorem 5. For any distributions D on (X Y ) B and Q on H, for any n>1, we have

RD(HQ) 2RD(HQ), and ADn(HQ) 2ADn(HQ).

Theorem 6 below presents our PAC-Bayesian generalization bounds for RD(HQ). Before that, it is important to mention that the empirical counterpart of RD(HQ) is computed on S which is composed of non identically independently distributed samples, meaning that a classical proof technique is not applicable. The trick here is to make use of a result of Ralaivola et al. [2010] that provides a chromatic PAC-Bayes bound, i.e., a bound which supports non-independent data.

Theorem 6. For any distribution D on (X Y ) B, for any set of voters H, for any prior P on H, for any n, with probability at least 1 δ over S, for all posteriors Q on H, we have

kl(RS(HQ)k RD(HQ)) 1

KL(Qk P) + ln m + 1

and RD(HQ) RS(HQ) +

KL(Qk P) + ln m + 1

where RS(HQ) = 1 mn

h Q h(xi+ i

kl(akb)=a ln a

b +(1 a) ln 1 a

1 b , and KL(Qk P)= E

h P ln P(h)

Q(h) the KL-divergence between P and Q.

Surprisingly, this theorem states bounds that do not depend on the number of perturbed examples n but only on the number of original examples m. The reason is that the n perturbed examples are inter-dependent (see the proof in Appendix). Note that Equation (9) is expressed as a Seeger [2002] s bound and is tighter but less interpretable than Equation (10) expressed as a Mc Allester [1998] s bound; These bounds involve the usual trade-off between the empirical risk RS(HQ) and KL(Qk P).

We now state a generalization bound for ADn(HQ). Since this value involves a minimum term, we cannot use the same trick as for Theorem 6. To bypass this issue, we use the TV distance between two artiﬁcial distributions on Ei. Given ((xi, yi), Ei) 2 S, let i be an arbitrary distribution on Ei, and given h 2 H, let h

i be a Dirac distribution on Ei such that h

i ( )=1 if = argmax 2Ei

1 yih(xi+ )

(i.e., if is maximizing the linear loss), and 0 otherwise.

Theorem 7. For any distribution D on (X Y ) B, for any set of voters H, for any prior P on H, for any n, with probability at least 1 δ over S, for all posteriors Q on H, for all i 2 {1, . . . , m}, for all distributions i on Ei independent from a voter h 2 H, we have

1 2 (1 yih(xi+ )) +

KL(Qk P) + ln 2pm

E h Q TV( h

KL(Qk P) + ln 2pm

where AS(HQ) = 1

, and TV( k ) = E

To minimize the true average-max risk ADn(HQ) from Equation (11), we have to minimize a trade-off between KL(Qk P) (i.e., how much the posterior weights are close to the prior ones) and the empirical risk 1

i=1 max 2Ei

1 2 (1 yih(xi+ )). However, to compute the empirical risk, the loss for each voter and each perturbation has to be calculated and can be time-consuming. With Equation (12), we propose an alternative, which can be efﬁciently optimized using 1

i=1 Eh Q TV( h

i k i) and the empirical average-max risk AS(HQ). Intuitively, Equation (12) can be seen as a trade-off between the empirical risk, which reﬂects the robustness of the majority vote, and two penalization terms: the KL term and the TV term. The KL-divergence KL(Qk P) controls how much the posterior Q can differ from the prior ones P. While the TV term Eh TV( h

i k i) controls the diversity of the voters, i.e., the ability of the voters to be fooled on the same adversarial example. From an algorithmic view, an interesting behavior is that the bound of Equation (12) stands for all distributions i on Ei. This suggests that given (xi, yi), we want to ﬁnd i minimizing Eh Q TV( h

i k i). Ideally,

this term tends to 0 when i is close3 to h

i and all voters have their loss maximized by the same perturbation 2 Ei.

To learn a well-performing majority vote, one solution is to minimize the right-hand side of the bounds, meaning that we would like to ﬁnd a good trade-off between a low empirical risk RS(HQ) or AS(HQ) and a low divergence between the prior weights and the learned posterior ones KL(Qk P).

4 Experimental evaluation on differentiable decision trees

In this section, we illustrate the soundness of our framework in the context of differentiable decision trees learning. First of all, we describe our learning procedure designed from our theoretical results.

4.1 From the bounds to an algorithm

We consider a ﬁnite voters set H consisting of differentiable decision trees [Kontschieder et al., 2016] where each h2H is parametrized by a weight vector wh. Inspired by Masegosa et al. [2020], we learn the decision trees of H and a data-dependent prior distribution P from a ﬁrst learning set S0 (independent from S); This is a common approach in PAC-Bayes [Parrado-Hernández et al., 2012, Lever et al., 2013, Dziugaite and Roy, 2018, Dziugaite et al., 2021]. Then, the posterior distribution is learned from the second learning set S by minimizing the bounds. This means we need to minimize the risk and the KL-divergence term. Our two-step learning procedure is summarized in Algorithm 1.

Step 1. Starting from an initial prior P0 and an initial set of voters H0, where each voter h is parametrized by a weight vector wh

0, the objective of this step is to construct the hypothesis set H and the prior distribution P to give as input to Step 2 for minimizing the bound. To do so, at each epoch t of the Step 1, we learn from S0 an intermediate prior Pt on an intermediate hypothesis set Ht consisting of voters h parametrized by the weights wh

t ; Note that the optimization in Line 9 is done with respect to wt={wh

t }h2Ht. At each iteration of the optimizer, from Lines 4 to 7, for each (x, y) of the current batch S0, we attack the majority vote HPt to obtain a perturbed example x+ . Then, in Lines 8 and 9, we perform a forward pass in the majority vote with the perturbed examples and update the weights wt and the prior Pt according to the linear loss. To sum up, from Lines 11 to 20 at the end of Step 1, the prior P and the hypothesis set H constructed for Step 2 are the ones associated to the best epoch t 2 {1, . . . , T 0} that permits to minimize RSt(HPt), where St={attack(x, y) | (x, y) 2 S} is the perturbed set obtained by attacking the majority vote HPt.

Step 2. Starting from the prior P on H and the learning set S, we perform the same process as in Step 1 except that the considered objective function corresponds to the desired bound to optimize (Line 30, denoted B( )). For the sake of readability, we deferred in Appendix G the deﬁnition of B( ) for Equations (9) and (12). Note that the intermediate priors do not depend on S, since they are learned from S0: the bounds are then valid.

4.2 Experiments4

In this section, we empirically illustrate that our PAC-Bayesian framework for adversarial robustness is able to provide generalization guarantees with non-vacuous bounds for the adversarial risk.

Setting. We stand in a white-box setting meaning that the attacker knows the voters set H, the prior distribution P, and the posterior one Q. We empirically study 2 attacks with the 2-norm and 1-norm: the Projected Gradient Descent (PGD, Madry et al. [2018]) and the iterative version of

FGSM (IFGSM, Kurakin et al. [2017]). We ﬁx the number of iterations at k=20 and the step size at b

k for PGD and IFGSM (where b=1 for 2-norm and b=0.1 for 1-norm). One speciﬁcity of our setting is that we deal with the perturbation distribution !(x,y). We propose PGDU and IFGSMU, two variants of PGD and IFGSM. To attack an example with PGDU or IFGSMU we proceed with the following steps: (1) We attack the prior majority vote HP with the attack PGD or IFGSM: we will obtain a ﬁrst perturbation 0 ; (2) We sample n uniform noises 1, . . . , n between 10 2 and +10 2 ; (3) We set

3Note that, since h

i is a Dirac distribution, we have Eh TV( h

h = argmax 2Ei

1 yih(xi+ )

. 4The source code is available at https://github.com/paulviallard/Neur IPS21-PB-Robustness.

Algorithm 1 Average Adversarial Training with Guarantee

Require: S, S0: disjoint learning sets T, T 0: number of epochs P0: initial prior H0 (with w0): initial

hypothesis set attack(): the attack function B( ): the objective function associated to a bound Step 1 prior and voters set construction

1: for t from 1 to T 0 do 2: Pt Pt 1 and Ht Ht 1 (wt wt 1) 3: for all batches S0 (from S0) do 4: for all (x, y) 2 S0 do 5: (x+ , y) attack(x, y) 6: S0 (S0 \ {(x, y)}) [ {(x+ , y)} 7: end for 8: Update Pt with r Pt RS0(HPt) 9: Update wt with rwt RS0(HPt) 10: end for 11: St ; 12: for all (x, y) 2 S do 13: (x+ , y) attack(x, y) 14: St St [ {(x+ , y)} 15: end for 16: t argmint02{1,...,t} RSt0 (HPt0 ) 17: P Pt 18: H Ht 19: end for 20: return (P, H)

Step 2 bound minimization

21: (P, H) Output of Step 1 22: Q0 P 23: for t from 1 to T do 24: for all batches S (from S) do 25: Qt Qt 1 26: for all (x, y) 2 S do 27: (x+ , y) attack(x, y) 28: S (S \ {(x, y)}) [ {(x+ , y)} 29: end for 30: Update Qt with r Qt BS(HQt) 31: end for 32: St ; 33: for all (x, y) 2 S do 34: (x+ , y) attack(x, y) 35: St St [ {(x+ , y)} 36: end for 37: t argmint02{1,...,t} BSt0 (HQt0 ) 38: Q Qt 39: end for 40: return (Q, H)

the i-th perturbation as i = 0 + i. Note that, for PGDU and IFGSMU, after one attack we end up with n=100 perturbed examples. We set n=1 when these attacks are used as a defense mechanism in Algorithm 1. Indeed since the adversarial training is iterative, we do not need to sample numerous perturbations for each example: we sample a new perturbation each time the example is forwarded through the decision trees. We also consider a naive defense referred to as UNIF that only adds a noise uniformly such that the p-norm of the added noise is lower than b. We study the following scenarios of defense/attack. These scenarios correspond to all the pairs (Defense, Attack) belonging to the set { , UNIF, PGD, IFGSM} { , PGD, IFGSM} for the baseline,

and { , UNIF, PGDU, IFGSMU} { , PGDU, IFGSMU}, where means that we do not defend, i.e., the attack returns the original example (note that PGDU and IFGSMU when Attack without U refers to PGD and IFGSM for computing the classical adversarial risk RROB()).

Datasets and algorithm description. We perform our experiment on six binary classiﬁcation tasks from MNIST [Le Cun et al., 1998] (1vs7, 4vs9, 5vs6) and Fashion MNIST [Xiao et al., 2017] (Coat vs Shirt, Sandal vs Ankle Boot, Top vs Pullover). We decompose the learning set into two disjoint subsets S0 of around 7, 000 examples (to learn the prior and the voters) and S of exactly 5, 000 examples (to learn the posterior). We keep as test set T the original test set that contains around 2, 000 examples. Moreover, we need a perturbed test set, denoted by T, to compute our averaged(-max) adversarial risks. Depending on the scenario, T is constructed from T by attacking the prior model HP with PGDU or IFGSMU with n=100 (more details are given in Appendix). We run our Algorithm 1 for Equation (9) (Theorem 6), respectively Equation (12) (Theorem 7), and we compute our risk RT(HQ), respectively AT(HQ), the bound value and the usual adversarial risk associated to the model learned RROB

T (HQ). Note that, during the evaluation of the bounds, we have to compute our relaxed adversarial risks RS(HQ) and AS(HQ) on S. For Step 1, the initial prior P0 is ﬁxed to the uniform distribution, the initial set of voters H0 is constructed with weights initialized with Xavier Initializer [Glorot and Bengio, 2010] and bias initialized at 0 (more details are given in Appendix). During Step 2, to optimize the bound, we ﬁx the conﬁdence parameter δ=0.05, and we consider as the set of voters H two settings: H as it is output by Step 1, and the set HSIGN = {h0( ) = sign(h( )) | h 2 H} for which the theoretical results are still valid (we will see that in this latter situation we are able to better minimize the TV term of Theorem 7). For the two steps, we use Adam optimizer [Kingma and Ba, 2015] for T=T 0=20 epochs with a learning rate at 10 2 and a batch size at 64.

Table 1: Test risks and bounds for MNIST 1vs7 with n=100 perturbations for all pairs (Defense,Attack) with the two voters set H and HSIGN. The results in bold correspond to the best values between results for H and HSIGN. To quantify the gap between our risks and the classical deﬁnition we put in italic the risk of our models against the classical attacks: we replace PGDU and IFGSMU by PGD or IFGSM (i.e., we did not sample from the uniform distribution). Since Eq. (12) upper-bounds Eq. (11) thanks to the TV term, we compute the two bound values of Theorem 7.

2-norm Algo.1 with Eq. (9) Algo.1 with Eq. (12) b = 1 Attack without U Attack without U RROB

T (HQ) RT(HQ) Th. 6 RROB

T (HQ) AT(HQ) Th. 7 - Eq. (12) Th. 7 - Eq. (11) Defense Attack HSIGN H HSIGN H HSIGN H HSIGN H HSIGN H HSIGN H HSIGN H

.005 .005 .005 .005 .017 .019 .005 .005 .005 .005 .099 0.100 .099 .100 PGDU .245 .255 .263 .276 .577 .448 .315 .313 .325 .326 .801 1.667 .684 .515 IFGSMU .084 .086 .066 .080 .170 .185 .117 .113 .106 .110 .356 1.431 .286 .251 UNIF .005 .005 .005 .005 .018 .019 .005 .005 .005 .005 .099 0.100 .099 .100 UNIF PGDU .151 .146 .151 .158 .355 .292 .183 .178 .190 .189 .531 1.620 .454 .355 UNIF IFGSMU .063 .061 .031 .035 .088 .114 .071 .070 .056 .054 .248 1.405 .200 .186 PGDU .006 .007 .006 .007 .023 .024 .006 .007 .006 .007 .102 0.103 .102 .103 PGDU PGDU .028 .030 .021 .025 .065 .064 .028 .029 .025 .028 .143 1.389 .137 .136 PGDU IFGSMU .021 .022 .013 .016 .043 .045 .022 .022 .018 .019 .125 1.362 .121 .119 IFGSMU .006 .007 .006 .007 .019 .021 .006 .007 .006 .007 .100 0.102 .100 .102 IFGSMU PGDU .040 .041 .033 .035 .086 .094 .040 .039 .040 .038 .184 1.368 .166 .163 IFGSMU IFGSMU .021 .022 .013 .014 .039 .049 .021 .022 .018 .021 .131 1.329 .122 .123

(unif, pgd U)

(pgd U, pgd U)

(pgd U, ifgsm U)

(unif, ifgsm U)

(ifgsm U, pgd U)

(pgd U, unif)

(ifgsm U, unif)

(unif, unif)

(ifgsm U, ifgsm U)

( , ifgsm U)

(unif, pgd U)

(pgd U, pgd U)

(pgd U, ifgsm U)

(unif, ifgsm U)

(ifgsm U, pgd U)

(pgd U, unif)

(ifgsm U, unif)

(unif, unif)

(ifgsm U, ifgsm U)

( , ifgsm U)

Figure 1: Visualization of the impact of the TV term in Equation (12). The left, respectively the right, bar plot show the bounds for the set of voters HSIGN, respectively H. We plot the bounds for all the scenarios of Table 1 that use the TV distance, i.e., all except the pairs ( , ). In orange we represent the value of the TV term while in blue we represent all the remaining terms of the bound.

Analysis of the results. For the sake of readability, we exhibit the detailed results for one task (MNIST:1vs7) and all the pairs (Defense,Attack) with 2-norm in Table 1, and we report in Figure 1 the inﬂuence of the TV term in the bound of Theorem 7 (Equation (12)). The detailed results on the other tasks are reported in Appendix; We provide in Figure 2 an overview of the results we obtained on all the tasks for the pairs (Defense,Attack) where Defense=Attack and with HSIGN.

First of all, from Table 1 the bounds of Theorem 6 are tighter than the ones of Theorem 7: this is an expected result since we showed that the averaged-max adversarial risk ADn(HQ) is more pessimistic than its averaged counterpart RD(HQ). Note that the bound values of Equation (11) are tighter than the ones of Equation (12). This is expected since Equation (11) is a lower bound on Equation (12).

Second, the bounds with HSIGNare all informative (lower than 1) and give insightful guarantees for our models. For Theorem 7 (Equation (12)) with H, while the risks are comparable to the risks obtained with HSIGN, the bound values are greater than 1, meaning that we have no more guarantee on the model learned. As we can observe in Figure 1, this is due to the TV term involved in the bound. Considering HSIGNwhen optimizing A( ) helps to control the TV term. Even if the bounds are non-vacuous for Theorem 6 with H, the best models with the best guarantees are obtained with HSIGN. This is conﬁrmed by the columns RROB

T (HQ) that are always worse than RT(HQ) and mostly worse than AT(HQ) with HSIGN. The performance obtained with HSIGNcan be explained by the fact that the sign saturates the output of the voters which makes the majority vote more robust to noises. Thus, we focus the rest of the analysis on results obtained with HSIGN.

Third, we observe that the naive defense UNIF is able to improve the risks RT(HQ) and AT(HQ), but the improvement with the defenses based on PGDU and IFGSMU is much more signiﬁcant speciﬁcally against a PGDU attack (up to 13 times better). We observe the same phenomenon for both bounds

Fashion:COvs SH Fashion:SAvs BO Fashion:TOvs PU MNIST:1vs7 MNIST:4vs9 MNIST:5vs6

RT(HQ) Bound of Theorem 6 AT(HQ) Bound of Equation (11)

Figure 2: Visualization of the risk and bound values when Defense=Attack when the set of voters is HSIGN. Results obtained with the PGDU, respectively IFGSMU, defense are represented by a star F, respectively a circle (reminder: RROB

T (HQ) is computed with a PGD, respectively IFGSM, attack). The dashed line corresponds to bisecting line y=x. For RT(HQ) and AT(HQ), the closer the datasets are to the bisecting line, the more accurate our relaxed risk is compared to the classical adversarial risk RROB

T (HQ). For the bounds, the closer the datasets are to the bisecting line, the tighter the bound.

(Theorems 6 and 7). This is an interesting fact because this behavior conﬁrms that we are able to learn models that are robust against the attacks tested with theoretical guarantees.

Lastly, from Figure 2 and Table 1, it is important to notice that the gap between the classical risk and our relaxed risks is small, meaning that our relaxation are not too optimistic. Despite the pessimism of the classical risk RROB

T (HQ), it remains consistent with our bounds, i.e., it is lower than the bounds. In other words, in addition to giving upper bounds for our risks RT(HQ) and AT(HQ), our bounds give non-vacuous guarantees on the classical risks RROB

5 Conclusion

To the best of our knowledge, our work is the ﬁrst one that studies from a general standpoint adversarial robustness through the lens of the PAC-Bayesian framework. We have started by formalizing a new adversarial robustness setting (for binary classiﬁcation) specialized for models that can be expressed as a weighted majority vote; we referred to this setting as Adversarially Robust PAC-Bayes. This formulation allowed us to derive PAC-Bayesian generalization bounds on the adversarial risk of general majority votes. We illustrated the usefulness of this setting on the training of (differentiable) decision trees. Our contribution is mainly theoretical and it does not appear to directly lead to potentially negative social impact.

This work gives rise to many interesting questions and lines of future research. Some perspectives will focus on extending our results to other classiﬁcation settings such as multiclass or multilabel. Another line of research could focus on taking advantage of other tools of the PAC-Bayesian literature. Among them, we can make use of other bounds on the risk of the majority vote that take into consideration the diversity between the individual voters; For example, the C-bound [Lacasse et al., 2006], or more recently the tandem loss [Masegosa et al., 2020]. Another very recent PAC-Bayesian bound for majority votes that needs investigation in the case of adversarial robustness is the one proposed by Zantedeschi et al. [2021] that has the advantage to be directly optimizable with the 0-1 loss. Last but not least, in real-life applications, one often wants to combine different input sources (from different sensors, cameras, etc). Being able to combine these sources in an effective way is then a key issue. We believe that our new adversarial robustness setting can offer theoretical guarantees and well-founded algorithms when the model we learn is expressed as a majority vote, whether for ensemble methods with weak voters [e.g. Roy et al., 2011, Lorenzen et al., 2019], or for fusion of classiﬁers [e.g. Morvant et al., 2014], or for multimodal/multiview learning [e.g. Sun et al., 2017, Goyal et al., 2019].

Acknowledgments and Disclosure of Funding

This work was partially funded supported by the French Project APRIORI ANR-18-CE23-0015. G. Vidot is supported by the ANRT with the convention CIFRE N 2019/0507. We also thank all anonymous reviewers for their constructive comments, and the time they took to review our work.

Nicholas Carlini and David Wagner. Towards Evaluating the Robustness of Neural Networks. In

IEEE Symposium on Security and Privacy, 2017.

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certiﬁed Adversarial Robustness via Randomized

Smoothing. In ICML, 2019.

Gintare Karolina Dziugaite and Daniel Roy. Data-dependent PAC-Bayes priors via differential

privacy. In Neur IPS, 2018.

Gintare Karolina Dziugaite, Kyle Hsu, Waseem Gharbieh, Gabriel Arpino, and Daniel Roy. On the

role of data in PAC-Bayes. In AISTATS, 2021.

Farzan Farnia, Jesse Zhang, and David Tse. Generalizable Adversarial Training via Spectral Normal-

ization. In ICLR, 2019.

Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin

Vechev. AI2: Safety and Robustness Certiﬁcation of Neural Networks with Abstract Interpretation. In IEEE Symposium on Security and Privacy, 2018.

Pascal Germain, Alexandre Lacasse, François Laviolette, Mario Marchand, and Jean-Francis Roy.

Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm. JMLR, 2015.

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural

networks. In AISTATS, 2010.

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial

Examples. In ICLR, 2015.

Anil Goyal, Emilie Morvant, Pascal Germain, and Massih-Reza Amini. Multiview Boosting by

Controlling the Diversity and the Accuracy of View-speciﬁc Voters. Neurocomputing, 2019.

Dan Hendrycks and Thomas Dietterich. Benchmarking Neural Network Robustness to Common

Corruptions and Perturbations. In ICLR, 2019.

Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety Veriﬁcation of Deep Neural

Networks. In CAV, 2017.

Xiaowei Huang, Daniel Kroening, Wenjie Ruan, James Sharp, Youcheng Sun, Emese Thamo, Min

Wu, and Xinping Yi. A survey of safety and trustworthiness of deep neural networks: Veriﬁcation, testing, adversarial attack and defence, and interpretability. Computer Science Review, 2020.

Guy Katz, Clark Barrett, David Dill, Kyle Julian, and Mykel Kochenderfer. Reluplex: An Efﬁcient

SMT Solver for Verifying Deep Neural Networks. In CAV, 2017.

Justin Khim and Po-Ling Loh. Adversarial Risk Bounds for Binary Classiﬁcation via Function

Transformation. Co RR, 2018.

Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.

Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulò. Deep Neural

Decision Forests. In IJCAI, 2016.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial Machine Learning at Scale. In

ICLR, 2017.

Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier.

PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classiﬁer. In NIPS, 2006.

Yann Le Cun, Corinna Cortes, and Christopher Burges. THE MNIST DATASET of handwritten digits,

1998. URL http://yann.lecun.com/exdb/mnist/.

Guy Lever, François Laviolette, and John Shawe-Taylor. Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science, 2013.

Stephan Sloth Lorenzen, Christian Igel, and Yevgeny Seldin. On PAC-Bayesian bounds for random

forests. Machine Learning, 2019.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.

Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR, 2018.

Andrés Masegosa, Stephan Sloth Lorenzen, Christian Igel, and Yevgeny Seldin. Second Order

PAC-Bayesian Bounds for the Weighted Majority Vote. In Neur IPS, 2020.

David Mc Allester. Some PAC-Bayesian Theorems. In COLT, 1998.

Omar Montasser, Steve Hanneke, and Nathan Srebro. VC Classes are Adversarially Robustly

Learnable, but Only Improperly. In COLT, 2019.

Omar Montasser, Steve Hanneke, and Nathan Srebro. Reducing Adversarially Robust Learning to

Non-Robust PAC Learning. In Neur IPS, 2020.

Emilie Morvant, Amaury Habrard, and Stéphane Ayache. Majority Vote of Diverse Classiﬁers for

Late Fusion. In S+SSPR, 2014.

Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generaliza-

tion in deep learning. In Neur IPS, 2019.

Yuki Ohnishi and Jean Honorio. Novel Change of Measure Inequalities with Applications to PAC-

Bayesian Bounds and Monte Carlo Estimation. In AISTATS, 2021.

Nicolas Papernot, Patrick Mc Daniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram

Swami. The Limitations of Deep Learning in Adversarial Settings. In IEEE Euro S&P, 2016.

Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, and Shiliang Sun. PAC-bayes

bounds with data dependent priors. JMLR, 2012.

Liva Ralaivola, Marie Szafranski, and Guillaume Stempfel. Chromatic PAC-Bayes Bounds for

Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes. JMLR, 2010.

David Reeb, Andreas Doerr, Sebastian Gerwinn, and Barbara Rakitsch. Learning Gaussian Processes

by Minimizing PAC-Bayesian Generalization Bounds. In Neur IPS, 2018.

Kui Ren, Tianhang Zheng, and Xue Liu. Adversarial Attacks and Defenses in Deep Learning.

Engineering, 2020.

Jean-Francis Roy, François Laviolette, and Mario Marchand. From PAC-Bayes Bounds to Quadratic

Programs for Majority Votes. In ICML, 2011.

Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sébastien Bubeck, and

Greg Yang. Provably Robust Deep Learning via Adversarially Trained Smoothed Classiﬁers. In Neur IPS, 2019.

Edward Scheinerman and Daniel Ullman. Fractional Graph Theory: A Rational Approach to the

Theory of Graphs. Courier Corporation, 2011.

Matthias Seeger. PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classiﬁcation.

JMLR, 2002.

John Shawe-Taylor and Robert Williamson. A PAC Analysis of a Bayesian Estimator. In COLT,

Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. Boosting Robustness Certiﬁca-

tion of Neural Networks. In ICLR, 2019.

Shiliang Sun, John Shawe-Taylor, and Liang Mao. PAC-Bayes analysis of multi-view learning. Inf.

Fusion, 2017.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,

and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.

Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-Margin Training: Scalable Certiﬁca-

tion of Perturbation Invariance for Deep Neural Networks. In Neur IPS, 2018.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Bench-

marking Machine Learning Algorithms. Co RR, 2017.

Dong Yin, Kannan Ramchandran, and Peter Bartlett. Rademacher Complexity for Adversarially

Robust Generalization. In ICML, 2019.

Valentina Zantedeschi, Maria-Irina Nicolae, and Ambrish Rawat. Efﬁcient Defenses Against Adver-

sarial Attacks. In ACM Workshop on Artiﬁcial Intelligence and Security, AISec@CCS, 2017.

Valentina Zantedeschi, Paul Viallard, Emilie Morvant, Rémi Emonet, Amaury Habrard, Pascal

Germain, and Benjamin Guedj. Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound. In Neur IPS, 2021.