# certifying_confidence_via_randomized_smoothing__3a529ab0.pdf

Certifying Conﬁdence via Randomized Smoothing

Aounon Kumar University of Maryland aounon@umd.edu

Alexander Levine University of Maryland alevine0@cs.umd.edu

Soheil Feizi University of Maryland sfeizi@cs.umd.edu

Tom Goldstein University of Maryland tomg@cs.umd.edu

Randomized smoothing has been shown to provide good certiﬁed-robustness guarantees for high-dimensional classiﬁcation problems. It uses the probabilities of predicting the top two most-likely classes around an input point under a smoothing distribution to generate a certiﬁed radius for a classiﬁer s prediction. However, most smoothing methods do not give us any information about the conﬁdence with which the underlying classiﬁer (e.g., deep neural network) makes a prediction. In this work, we propose a method to generate certiﬁed radii for the prediction conﬁdence of the smoothed classiﬁer. We consider two notions for quantifying conﬁdence: average prediction score of a class and the margin by which the average prediction score of one class exceeds that of another. We modify the Neyman-Pearson lemma (a key theorem in randomized smoothing) to design a procedure for computing the certiﬁed radius where the conﬁdence is guaranteed to stay above a certain threshold. Our experimental results on CIFAR-10 and Image Net datasets show that using information about the distribution of the conﬁdence scores allows us to achieve a signiﬁcantly better certiﬁed radius than ignoring it. Thus, we demonstrate that extra information about the base classiﬁer at the input point can help improve certiﬁed guarantees for the smoothed classiﬁer. Code for the experiments is available at https://github.com/aounon/cdf-smoothing.

1 Introduction

Deep neural networks have been shown to be vulnerable to adversarial attacks, in which a nearly imperceptible perturbation is added to an input image to completely alter the network s prediction [37, 31, 12, 21]. Several empirical defenses have been proposed over the years to produce classiﬁers that are robust to such perturbations [20, 4, 16, 8, 30, 14, 11]. However, without robustness guarantees, it is often the case that these defenses are broken by stronger attacks [5, 1, 40, 39]. Certiﬁed defenses, such as those based on convex-relaxation [41, 33, 35, 6, 36] and interval-bound propagation [13, 17, 9, 32], address this issue by producing robustness guarantees within a neighborhood of an input point. However, due to the complexity of present-day neural networks, these methods have seen limited use in high-dimensional datasets such as Image Net.

Randomized smoothing has recently emerged as the state-of-the-art technique for certifying adversarial robustness with the scalability to handle datasets as large as Image Net [22, 28, 7, 34]. This defense uses a base classiﬁer, e.g. a deep neural network, to make predictions. Given an input image, a smoothing method queries the top class label at a large number of points in a Gaussian distribution surrounding the image, and returns the label with the majority vote. If the input image is perturbed slightly, the new voting population overlaps greatly with the smoothing distribution around the original image, and so the vote outcome can change only a small amount.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Conventional smoothing throws away a lot of information about class labels, and has limited capabilities that make its outputs difﬁcult to use for decision making. Conventional classiﬁcation networks with a softmax layer output a conﬁdence score that can be interpreted as the degree of certainty the network has about the class label [15]. This is a crucial piece of information in real world decision-making applications such as self-driving cars [3] and disease-diagnosis networks [18], where safety is paramount.

In contrast, standard Gaussian smoothing methods take binary votes at each randomly sampled point i.e., each point votes either for or against the most likely class, without conveying any information about how conﬁdent the network is in the class label. This may lead to scenarios where a point has a large certiﬁed radius but the underlying classiﬁer has a low conﬁdence score. For example, imagine a 2-way classiﬁer for which a large portion, say 95%, of the sampled points predict the same class. In this case, the certiﬁed radius will be very large (indicating that this image is not an ℓ2-bounded adversarial example). However, it could be that each point predicts the top class with very low conﬁdence. In this case, one should have very low conﬁdence in the class label, despite the strength of the adversarial certiﬁcate. A Gaussian smoothing classiﬁer counts a 51% conﬁdence vote exactly the same way as a 99% conﬁdence vote, and this important information is erased.

In this work, we restore conﬁdence information in certiﬁed classiﬁers by proposing a method that produces class labels with a certiﬁed conﬁdence score. Instead of taking a vote at each Gaussian sample around the input point, we average the conﬁdence scores from the underlying base classiﬁer for each class. The prediction of our smoothed classiﬁer is given by the argmax of the expected scores of all the classes. Using the probability distribution of the conﬁdence scores under the Gaussian, we produce a lower bound on how much the expected conﬁdence score of the predicted class can be manipulated by a bounded perturbation to the input image. To do this, we adapt the Neyman-Pearson lemma, the fundamental theorem that characterizes the worst-case behaviour of the classiﬁer under regular (binary) voting, to leverage the distributional information about the conﬁdence scores. The lower bound we obtain is monotonically decreasing with the ℓ2-norm of the perturbation and can be expressed as a linear combination of the Gaussian CDF at different points. This allows us to design an efﬁcient binary search based algorithm to compute the radius within which the expected score is guaranteed to be above a given threshold. Our method endows smoothed classiﬁers with the new and important capability of producing conﬁdence scores.

We study two notions of measuring conﬁdence: the average prediction score of a class, and the margin by which the average prediction score of one class exceeds that of another. The average prediction score is the expected value of the activations in the ﬁnal softmax-layer of a neural network under the smoothing distribution. A class is guaranteed to be the predicted class if its average prediction score is greater than one half (since softmax values add up to one) or it maintains a positive margin over all the other classes. For both these measures, along with the bounds described in the previous paragraph, we also derive naive lower bounds on the expected score at a perturbed input point that do not use the distribution of the scores. We perform experiments on CIFAR-10 and Image Net datasets which show that using information about the distribution of the scores allows us to achieve better certiﬁed guarantees than the naive method.

Related work: Randomized smoothing as a technique to design certiﬁably robust machine learning models has been studied amply in recent years. It has been used to produce certiﬁed robustness against additive threat models, such as, ℓ1 [22, 38], ℓ2 [7, 29, 27] and ℓ0-norm [23, 26] bounded adversaries, as well as non-additive threat models, such as, Wasserstein Adversarial attacks [24]. A derandomized version has been shown to provide robustness guarantees for patch attacks [25]. Smoothed classiﬁers that use the average conﬁdence scores have been studied in [34] to achieve better certiﬁed robustness through adversarial training. A recent work uses the median score to generated certiﬁed robustness for regression models [43]. Differential privacy based defense method studied in [22] is capable of providing a guarantee on the test accuracy of a robust model under adversarial attack. Various limitations of randomized smoothing, like its inapplicability to high-dimensional problems for ℓ -robustness, have been studied in [19, 42, 2].

2 Background and Notation

Gaussian smoothing, introduced by Cohen et al. in 2019, relies on a base classiﬁer, which is a mapping f : Rd Y where Rd is the input space and Y is a set of k classes. It deﬁnes a smoothed

classiﬁer f as f(x) = argmax c Y P(f(x + δ) = c)

where δ N(0, σ2I) is sampled from an isometric Gaussian distribution with variance σ2. It returns the class that is most likely to be sampled by the Gaussian distribution centered at point x. Let p1 and p2 be the probabilities of sampling the top two most likely classes. Then, f is guaranteed to be constant within an ℓ2-ball of radius

2 Φ 1(p1) Φ 1(p2)

where Φ 1 is the inverse CDF of the standard Gaussian distribution [7]. For a practical certiﬁcation algorithm, a lower bound p1 on p1 and an upper bound p2 = 1 p1 on p2, with probability 1 α for a given α (0, 1), are obtained and the certiﬁed radius is given by R = σΦ 1(p1). This analysis is tight for ℓ2 perturbations; the bound is achieved by a worst-case classiﬁer in which all the points in the top-class are restricted to a half-space separated by a hyperplane orthogonal to the direction of the perturbation.

In our discussion, we diverge from the standard notation described above, and assume that the base classiﬁer f maps points in Rd to a k-tuple of conﬁdence scores. Thus, f : Rd (a, b)k for some a, b R and a < b1. We deﬁne the smoothed version of the classiﬁer as f(x) = E δ N(0,σ2I)[f(x + δ)],

which is the expectation of the class scores under the Gaussian distribution centered at x. The ﬁnal prediction is made by taking an argmax of the expected scores. This deﬁnition has been studied by Salman et al. in [34] to develop an attack against smoothed classiﬁers which when used in an adversarial training setting helps boost the performance of conventional smoothing. The goal of this work is to identify a radius around an image x within which the expected conﬁdence score of the predicted class i, i.e. fi(x) = E[fi(x + δ)], remains above a given threshold c (a, b)2.

We measure conﬁdence using two different notions. The ﬁrst measure is the average prediction score of a class as output by the ﬁnal softmax layer. We denote the prediction score function with h : Rd (0, 1)k and deﬁne the average for class i as hi(x) = E[hi(x + δ)]. The second one is the margin mi(x) = hi(x) maxj =i hj(x) by which class i beats every other class in the softmax prediction score. In section 4, we show that the expected margin mi(x) = E[mi(x + δ)] for the predicted class is a lower-bound on the gap in average prediction scores of the top two class labels. Thus, mi(x) > 0 implies that i is the predicted class.

3 Certifying Conﬁdence Scores

Standard Gaussian smoothing for establishing certiﬁed class labels essentially works by averaging binary (0/1) votes from every image in a Gaussian cloud around the input image, x. It then establishes the worst-case class boundary given the recorded vote, and produces a certiﬁcate. The same machinery can be applied to produce a naive certiﬁcate for conﬁdence score; rather than averaging binary votes, we simply average scores. We then produce the worst-case class distribution, in which each class lives in a separate half-space, and generate a certiﬁcate for this worst case.

However, the naive certiﬁcate described above throws away a lot of information. When continuousvalues scores are recorded, we obtain not only the average score, but also the distribution of scores around the input point. By using this distributional information, we can potentially create a much stronger certiﬁcate.

To see why, consider the extreme case of a ﬂat classiﬁer function for which every sample in the Gaussian cloud around x returns the same top-class prediction score of 0.55. In this case, the average score is 0.55 as well. For a function where the distribution of score votes is concentrated at 0.55 (or any other value great than 1/2), the average score will always remain at 0.55 for any perturbation to x, thus yielding an inﬁnite certiﬁed radius. However, when using the naive approach that throws away the distribution, the worst-case class boundary with average vote 0.55 is one with conﬁdence

1(a, b) denotes the open interval between a and b. 2fi(x) denotes the ith component of f(x)

score 1.0 everywhere in a half-space occupying 0.55 probability, and 0.0 in a half-space with 0.45 probability. This worst-case, which uses only the average vote, produces a very small certiﬁed radius, in contrast to the inﬁnite radius we could obtain from observing the distribution of votes.

Below, we ﬁrst provide a simple bound that produces a certiﬁcate by averaging scores around the input image, and directly applying the framework from [7]. Then, we describe a more reﬁned method that uses distributional information to obtain stronger bounds.

3.1 A baseline method using Gaussian means

In this section, we describe a method that uses only the average conﬁdence over the Gaussian distribution surrounding x, and not the distribution of values, to bound how much the expected score can change when x is perturbed with an ℓ2 radius of R units. This is a straightforward extension of Cohen et al. s [7] work to our framework. It shows that regardless the behaviour of the base classiﬁer f, its smoothed version f changes slowly which is similar to the observation of bounded Lipschitz-ness made by Salman et al. in [34] (Lemma 2). The worst-case classiﬁer in this case assumes value a in one half space and b in other, with a linear boundary between the two as illustrated in ﬁgure 1a. The following theorem formally states the bounds, the proof of which is deferred to the appendix3. Theorem 1. Let ei(x) and ei(x) be a lower-bound and an upper-bound respectively on the expected

score fi(x) for class i and, let pi(x) = ei(x) a

b a and pi(x) = ei(x) a

b a . Then, for a perturbation x of the input x, such that, x x 2 R, fi(x ) bΦσ(Φ 1 σ (pi(x)) R) + a(1 Φσ(Φ 1 σ (pi(x)) R)) (1) and fi(x ) bΦσ(Φ 1 σ (pi(x)) + R) + a(1 Φσ(Φ 1 σ (pi(x)) + R)) where Φσ is the CDF of the univariate Gaussian distribution with σ2 variance, i.e., N(0, σ2).

3.2 Proposed certiﬁcate

The bounds in section 3.1 are a simple application of the Neyman-Pearson lemma to our framework. But this method discards a lot of information about how the class scores are distributed in the Gaussian around the input point. Rather than consolidating the conﬁdence scores from the samples into an expectation, we propose a method that uses the cumulative distribution function of the conﬁdence scores to obtain improved bounds on the expected class scores.

Given an input x, we draw m samples from the Gaussian distribution around x. We use the prediction of the base classiﬁer f on these points to generate bounds on the distribution function of the scores for the predicted class. These bounds, in turn, allow us to bound the amount by which the expected score of the class will decrease under an ℓ2 perturbation. Finally, we apply binary search to compute the radius for which this lower bound on the expected score remains above c.

Consider the sampling of scores around an image x using a Gaussian distribution. Let the probability with which the score of class i is above s be pi,s(x) = P δ N(0,σ2I)(fi(x + δ) s).

For point x and class i, consider the random variable Z = fi(x + δ) where δ N(0, σ2I). Let F(s) = P(Z s) be the cumulative distribution function of Z and Fm(s) = 1 m Pm j=1 1{Zj s} be its empirical estimate. For a given α (0, 1), the Dvoretzky Kiefer Wolfowitz inequality [10] says that, with probability 1 α, the true CDF is bounded by the empirical CDF as follows: Fm(s) ϵ F(s) Fm(s) + ϵ, s,

where ϵ = q

2m . Thus, pi,s(x) is also bounded within ϵ of its empirical estimate Pm j=1 1{fi(x+ δj) s}.

The following theorem bounds the expected class score under an ℓ2 perturbation using bounds on the cumulative distribution of the scores.

3A separate proof, using Lemma 2 from Salman et al. in [34], for this theorem for σ = 1 is also included in the appendix.

(a) Naive classiﬁer

(b) CDF-based classiﬁer

Figure 1: Worst case classiﬁer behaviour using (a) naive approach and (b) CDF-based method. As the center of the distribution moves from x to x , the probability mass of the higher values of the score function (indicated in red) decreases and that of the lower values (indicated in blue) increases, bringing down the value of the expected score.

Theorem 2. Let, for class i, a < s1 s2 sn < b be n real numbers and let pi,sj(x) and pi,sj(x) be upper and lower bounds on pi,sj(x) respectively derived using the Dvoretzky Kiefer Wolfowitz inequality, with probability 1 α, for a given α (0, 1). Then, for a perturbation x of the input x, such that, x x 2 R,

fi(x ) a + (s1 a)Φσ(Φ 1 σ (pi,s1(x)) R) +

j=2 (sj sj 1)Φσ(Φ 1 σ (pi,sj(x)) R) (2)

fi(x ) s1 + (b sn)Φσ(Φ 1 σ (pi,sn(x)) + R) +

j=1 (sj+1 sj)Φσ(Φ 1 σ (pi,sj(x)) + R)

where Φσ is the CDF of the univariate Gaussian distribution with σ2 variance, i.e., N(0, σ2).

The above bounds are tight for ℓ2 perturbations. The worst-case classiﬁer for the lower bound is one in which the class score decreases from b to a in steps, taking values sn, sn 1, . . . , s1 at each level. Figure 1b illustrates this case for three intermediate levels. A similar worst-case scenario can be constructed for the upper bound as well where the class score increases from a to b along the direction of the perturbation. Even though our theoretical results allow us to derive both upper and lower bounds for the expected scores, we restrict ourselves to the lower bound in our experimental results. We provide a proof sketch for this theorem in section 3.3. Our experimental results show that the CDF-based approach beats the naive bounds in practice by a signiﬁcant margin, showing that having more information about the classiﬁer at the input point can help achieve better guarantees.

Computing the certiﬁed radius Both the bounds in theorem 2 monotonic in R. So, in order to ﬁnd a certiﬁed radius, up to a precision τ, such that the lower (upper) bound is above (below) a certain threshold we can apply binary search which will require at most O(log(1/τ)) evaluations of the bound.

3.3 Proof of Theorem 2

We present a brief proof for theorem 2. We use a slightly modiﬁed version of the Neyman-Pearson lemma (stated in [7]) which we prove in the appendix.

Lemma 3 (Neyman & Pearson, 1933). Let X and Y be random variables in Rd with densities µX and µY . Let h : Rd (a, b) be a function. Then:

1. If S = n z Rd | µY (z)

µX(z) t o for some t > 0 and P(h(X) s) P(X S), then

P(h(Y ) s) P(Y S).

2. If S = n z Rd | µY (z)

µX(z) t o for some t > 0 and P(h(X) s) P(X S), then

P(h(Y ) s) P(Y S).

Set X to be the smoothing distribution at an input point x and Y to be that at x+ϵ for some perturbation vector ϵ. For a class i, deﬁne sets Si,j = {z Rd | µY (z)/µX(z) ti,j} for some ti,j > 0, such that, P(X Si,j) = pi,sj(x). Similarly, deﬁne sets Si,j = {z Rd | µY (z)/µX(z) t i,j} for

some t i,j > 0, such that, P(X Si,j) = pi,sj(x). Since, P(fi(X) sj) P(X Si,j), using lemma 3 we can say that P(fi(Y ) si) P(Y Si,j). Therefore,

E[fi(Y )] sn P(fi(Y ) sn) + sn 1(P(fi(Y ) sn 1) P(fi(Y ) sn)) + + s1(P(fi(Y ) s1) P(fi(Y ) s2)) + a(1 P(fi(Y ) s1))

= a + (s1 a)P(fi(Y ) s1) +

j=2 (sj sj 1)P(fi(Y ) sj)

a + (s1 a)P(Y Si,1) +

j=2 (sj sj 1)P(Y Si,j).

Similarly, P(fi(X) sj) P(X Si,j) implies P(fi(Y ) sj) P(Y Si,j) as per lemma 3. Therefore,

E[fi(Y )] b P(fi(Y ) sn) + sn(P(fi(Y ) sn 1) P(fi(Y ) sn)) + + s1(1 P(fi(Y ) s1))

= (b sn)P(fi(Y ) sn) +

j=1 (sj+1 sj)P(fi(Y ) sj) + s1

s1 + (b sn)P(Y Si,n) +

j=1 (sj+1 sj)P(Y Si,j).

Since, we are smoothing using an isometric Gaussian distribution with σ2 variance, µX = N(x, σ2I) and µY = N(x + ϵ, σ2I). Then, for some t and β

µY (z) µY (z) t ϵT z β

µY (z) µY (z) t ϵT z β.

Thus, each of the sets Si,j and Si,j is a half space deﬁned by a hyper-plane orthogonal to the direction of the perturbation. This simpliﬁes our analysis to one dimension, namely, the one along the perturbation. For each of the sets Si,j and Si,j, we can ﬁnd a point on the real number line Φ 1 σ (pi,sj(x)) and Φ 1 σ (pi,sj(x)) respectively such that the probability of a Gaussian sample to fall in that set is equal to the Gaussian CDF at that point. Therefore,

fi(x + ϵ) a + (s1 a)Φσ(Φ 1 σ (pi,s1(x)) R) +

j=2 (sj sj 1)Φσ(Φ 1 σ (pi,sj(x)) R)

fi(x + ϵ) s1 + (b sn)Φσ(Φ 1 σ (pi,sn(x)) + R) +

j=1 (sj+1 sj)Φσ(Φ 1 σ (pi,sj(x)) + R)

which completes the proof of theorem 2. We would like to note here that although we use the Gaussian distribution for smoothing, the modiﬁed Neyman-Pearson lemma does not make any assumptions on the shape of the distributions which allows for this proof to be adapted for other smoothing distributions as well.

4 Conﬁdence measures

We study two notions of conﬁdence: average prediction score of a class and the margin of average prediction score between two classes. Usually, neural networks make their predictions by outputting a prediction score for each class and then taking the argmax of the scores. Let h : Rd (0, 1)k be a classiﬁer mapping input points to prediction scores between 0 and 1 for each class. We assume that the scores are generated by a softmax-like layer, i.e., 0 < hi(x) < 1, i {1, . . . , k} and P

i hi(x) = 1. For δ N(0, σ2I), we deﬁne average prediction score for a class i as hi(x) = E[hi(x + δ)]. The ﬁnal prediction for the smoothed classiﬁer is made by taking an argmax over the average prediction scores of all the classes, i.e., argmaxi hi(x). Thus, if for a class j, hj(x) 0.5, then j = argmaxi hi(x).

Now, we deﬁne margin m at point x for a class i as mi(x) = hi(x) max j =i hj(x).

Thus, if i is the class with the highest prediction score, mi(x) is the lead it has over the second highest class (ﬁgure 2). And, for any other class mi(x) is the negative of the difference of the scores of that class with the highest class. We deﬁne average margin at point x under smoothing distribution P as mi(x) = E[mi(x + δ)].

Figure 2: Margin

For a pair of classes i and j, we have, hi(x) hj(x) = E[hi(x + δ)] E[hj(x + δ)] = E[hi(x + δ) hj(x + δ)] E[hi(x + δ) max j =i hj(x + δ)]

= E[mi(x + δ)] = mi(x) hi(x) hj(x) + mi(x). Thus, if mi(x) > 0, then class i must have the highest average prediction score making it the predicted class under this notion of conﬁdence measure.

5 Experiments

We conduct several experiments to motivate the use of certiﬁed conﬁdence, and to validate the effectiveness of our proposed CDF-based certiﬁcate.

5.1 Does certiﬁed radius correlate with conﬁdence score?

A classiﬁer can fail because of an adversarial attack, or because of epistemic uncertainty a class label may be uncertain or wrong because of lack of useful features, or because the model was not trained on sufﬁcient representative data. The use of certiﬁed conﬁdence is motivated by the observation that the original Gaussian averaging, which certiﬁes the security of class labels, does not convey whether the user should be conﬁdent in the label because it neglects epistemic uncertainty. We demonstrate this with a simple experiment. In ﬁgure 3, we show plots of softmax prediction score vs. certiﬁed radius obtained using smoothed Res Net-110 and Res Net-50 classiﬁers trained by Cohen et al. in [7] for CIFAR-10 and Image Net respectively. The noise level σ used for this experiment was 0.25. For both models, the certiﬁed radii correlate very little with the prediction scores for the input images. The CIFAR-10 plot has points with high scores but small radii. While, for Image Net, we see a lot of points with low scores but high radii. This motivates the need for certifying conﬁdence; high radius does not imply high conﬁdence of the underlying classiﬁer. This lack of correlation is visualized in ﬁgure 4.

Radius< 0.2, Conﬁdence Score> 0.7 Radius> 0.5, Conﬁdence Score< 0.4

Radius< 0.4, Conﬁdence Score> 0.6 Radius> 0.9, Conﬁdence Score< 0.1

Figure 4: Certiﬁed radius does not correlate well with human visual conﬁdence or network conﬁdence score. Low radius images on the left have high conﬁdence scores, while the high radius images on the right all have low conﬁdence scores. There is not a pronounced visual difference between lowand high-radius images.

Figure 3: Prediction Score vs. Certiﬁed Radius.

In the plots, CIFAR-10 images tend to have a higher prediction score than Image Net images which is potentially due to the fact that the Image Net dataset has a lot more classes than the CIFAR-10 dataset, driving the softmax scores down. There is a hard limit ( 0.95 for Image Net) on the largest radius that can be generated by Cohen et al. s certifying algorithm which causes a lot of the Image Net points to accumulate at this radius value. This limit comes from the fact that even if all the samples around an input image vote for the same class, the lower-bound on the top-class probability is strictly less than one, which keeps the certiﬁed radius within a ﬁnite value.

5.2 Evaluating the strength of bounds

We use the Res Net-110 and Res Net-50 models trained by Cohen et al. in [7] on CIFAR-10 and Image Net datasets respectively to generate conﬁdence certiﬁcates. These models have been pre-trained with varying Gaussian noise level σ in the training data. We use the same σ for certifying conﬁdences as well. We use the same number of samples m = 100, 000 and value of α = 0.001 as in [7]. We set s1, s2. . . . , sn in theorem 2 such that the number of conﬁdence score values falling in each of the intervals (a, s1), (s1, s2), . . . , (sn, b) is the same. We sort the scores from the m samples in increasing order and set si to be the element at position 1 + (i 1)m/n in the order. We chose this method of splitting the range (a, b), instead of at regular steps, to keep the intervals well-balanced. We present results for both notions of conﬁdence measure: average prediction score and margin. Figure 5 plots certiﬁed accuracy, using the naive bound and the CDF-based method, for different threshold values for the top-class average prediction score and the margin at various radii for σ = 0.25. The same experiments for σ = 0.50 have been included in the appendix.

Each line is for a given threshold for the conﬁdence score. The solid lines represent certiﬁcates derived using the CDF bound and the dashed lines are for ones using the naive bound. For the baseline certiﬁcate (1), we use Hoeffding s inequality to get a lower-bound on the expected top-class conﬁdence score ei(x), that holds with probability 1 α, for a given α (0, 1).

j=1 fi(x + δj) (b a)

This bound is a reasonable choice because pi(x) differs from the empirical estimate by the same amount p

ln(1/α)/2m as pi,s(x) in the proposed CDF-based certiﬁcate. In the appendix, we also show that the baseline certiﬁcate, even with the best-possible lower-bound for ei(x), cannot beat our method for most cases.

We see a signiﬁcant improvement in certiﬁed accuracy (e.g. at radius = 0.25) when certiﬁcation is done using the CDF method instead of the naive bound. The conﬁdence measure based on the margin

between average prediction scores yields slightly better certiﬁed accuracy when thresholded at zero than the other measure.

(a) Average Prediction Score (CIFAR-10)

(b) Margin (CIFAR-10)

(c) Average Prediction Score (Image Net)

(d) Margin (Image Net)

Figure 5: Certiﬁed accuracy vs. radius (CIFAR-10 & Image Net) at different cutoffs for average conﬁdence score with σ = 0.25. Solid and dashed lines represent certiﬁcates computed with and without CDF bound respectively.

6 Conclusion

While standard certiﬁcates can guarantee that a decision is secure, they contain little information about how conﬁdent the user should be in the assigned label. We present a method that certiﬁes the conﬁdence scores, rather than the labels, of images. By leveraging information about the distribution of conﬁdence scores around an input image, we produce certiﬁcates that beat a naive bound based on a direct application of the Neyman-Pearson lemma. The results in this work show that certiﬁcates can be strengthened by incorporating more information into the worst-case bound than just the average vote. We hope this line of research leads to methods for strengthening smoothing certiﬁcates based on other information sources, such as properties of the base classiﬁer or the spatial distribution of votes.

7 Broader Impact

We design procedures that equip randomized smoothing with certiﬁed prediction conﬁdence, an important property for any real-world decision-making system to have. In applications where robustness is key, like credit scoring and disease diagnosis systems, it is important to know how certain the prediction model is about the output, so that a human expert can take over if the model s conﬁdence is low.

However, this method does not produce any guarantees on the calibration of the underlying model itself. It could happen that the conﬁdence measure used to determine the degree of certainty of the model does not actually reﬂect the probability of the prediction being correct. In other words, our methods depends on the underlying classiﬁer to have high accuracy to perform reliably. With certiﬁcates, there is always a risk of conveying a false sense of conﬁdence, but hopefully by producing interpretable risk scores along with certiﬁcates our work will help mitigate this problem.

8 Acknowledgments

We would like to thank anonymous reviewers for their valuable comments and suggestions. This work was supported by DARPA GARD, DARPA QED4RML, and the National Science Foundation Directorate of Mathematical Sciences. This project was supported in part by NSF CAREER AWARD 1942230, grants from NIST award 60NANB20D134, HR00112090132 and Simons Fellowship on Foundations of Deep Learning.

[1] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 274 283, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[2] Avrim Blum, Travis Dick, Naren Manoj, and Hongyang Zhang. Random smoothing might be unable to certify ℓ robustness for high-dimensional images, 2020.

[3] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars, 2016.

[4] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian J. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[5] Nicholas Carlini and David A. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, AISec@CCS 2017, Dallas, TX, USA, November 3, 2017, pages 3 14, 2017.

[6] Ping-yeh Chiang, Renkun Ni, Ahmed Abdelkader, Chen Zhu, Christoph Studer, and Tom Goldstein. Certiﬁed defenses for adversarial patches. In 8th International Conference on Learning Representations, 2020.

[7] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1310 1320, Long Beach, California, USA, 09 15 Jun 2019. PMLR.

[8] Guneet S. Dhillon, Kamyar Azizzadenesheli, Zachary C. Lipton, Jeremy Bernstein, Jean Kossaiﬁ, Aran Khanna, and Animashree Anandkumar. Stochastic activation pruning for robust adversarial defense. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[9] Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Relja Arandjelovic, Brendan O Donoghue, Jonathan Uesato, and Pushmeet Kohli. Training veriﬁed learners with learned veriﬁers, 2018.

[10] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist., 27(3): 642 669, 09 1956. doi: 10.1214/aoms/1177728174. URL https://doi.org/10.1214/ aoms/1177728174.

[11] Zhitao Gong, Wenlu Wang, and Wei-Shinn Ku. Adversarial and clean data are not twins. Co RR, abs/1704.04960, 2017.

[12] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

[13] Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training veriﬁably robust models, 2018.

[14] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick D. Mc Daniel. On the (statistical) detection of adversarial examples. Co RR, abs/1702.06280, 2017.

[15] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321 1330. JMLR. org, 2017.

[16] Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[17] Po-Sen Huang, Robert Stanforth, Johannes Welbl, Chris Dyer, Dani Yogatama, Sven Gowal, Krishnamurthy Dvijotham, and Pushmeet Kohli. Achieving veriﬁed robustness to symbol substitutions via interval bound propagation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4081 4091, 2019. doi: 10.18653/v1/D19-1419. URL https://doi.org/10. 18653/v1/D19-1419.

[18] Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263 274, 10 2011. ISSN 1067-5027. doi: 10.1136/amiajnl-2011-000291. URL https://doi.org/10.1136/amiajnl-2011-000291.

[19] Aounon Kumar, Alexander Levine, Tom Goldstein, and Soheil Feizi. Curse of dimensionality on randomized smoothing for certiﬁable robustness. Co RR, abs/2002.03239, 2020. URL https://arxiv.org/abs/2002.03239.

[20] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/forum? id=BJm4T4Kgx.

[21] Cassidy Laidlaw and Soheil Feizi. Functional adversarial attacks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 10408 10418, 2019. URL http://papers.nips.cc/paper/ 9228-functional-adversarial-attacks.

[22] Mathias Lécuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certiﬁed robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 656 672, 2019.

[23] Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi S. Jaakkola. Tight certiﬁcates of adversarial robustness for randomly smoothed classiﬁers. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 4911 4922, 2019.

[24] Alexander Levine and Soheil Feizi. Wasserstein smoothing: Certiﬁed robustness against wasserstein adversarial attacks, 2019.

[25] Alexander Levine and Soheil Feizi. (de)randomized smoothing for certiﬁable defense against patch attacks. Co RR, abs/2002.10733, 2020. URL https://arxiv.org/abs/2002.10733.

[26] Alexander Levine and Soheil Feizi. Robustness certiﬁcates for sparse adversarial attacks by randomized ablation. In The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 4585 4593. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/5888.

[27] Alexander Levine, Aounon Kumar, Thomas Goldstein, and Soheil Feizi. Tight second-order certiﬁcates for randomized smoothing, 2020.

[28] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certiﬁed adversarial robustness with additive noise. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 9459 9469, 2019.

[29] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Second-order adversarial attack and certiﬁable robustness, 2019. URL https://openreview.net/forum?id=Syxa Ys Aq Y7.

[30] Xin Li and Fuxin Li. Adversarial examples detection in deep networks with convolutional ﬁlter statistics. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5775 5783, 2017.

[31] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[32] Matthew Mirman, Timon Gehr, and Martin Vechev. Differentiable abstract interpretation for provably robust neural networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3578 3586. PMLR, 10 15 Jul 2018. URL http://proceedings. mlr.press/v80/mirman18b.html.

[33] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Semideﬁnite relaxations for certifying robustness to adversarial examples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 10900 10910, Red Hook, NY, USA, 2018. Curran Associates Inc.

[34] Hadi Salman, Jerry Li, Ilya P. Razenshteyn, Pengchuan Zhang, Huan Zhang, Sébastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classiﬁers. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 11289 11300, 2019.

[35] Sahil Singla and Soheil Feizi. Robustness certiﬁcates against adversarial examples for relu networks. Co RR, abs/1902.01235, 2019.

[36] Sahil Singla and Soheil Feizi. Second-order provable defenses against adversarial attacks, 2020.

[37] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

[38] Jiaye Teng, Guang-He Lee, and Yang Yuan. ℓ1 adversarial robustness certiﬁcates: a randomized smoothing approach, 2020. URL https://openreview.net/forum?id=H1l QIgr FDS.

[39] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses, 2020.

[40] Jonathan Uesato, Brendan O Donoghue, Pushmeet Kohli, and Aäron van den Oord. Adversarial risk and the dangers of evaluating against weak attacks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 5032 5041, 2018.

[41] Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 5283 5292, 2018.

[42] Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes, 2020.

[43] Ping yeh Chiang, Michael J. Curry, Ahmed Abdelkader, Aounon Kumar, John Dickerson, and Tom Goldstein. Detection as regression: Certiﬁed object detection by median smoothing, 2020.