# understanding_noiseaugmented_training_for_randomized_smoothing__14a8bfc9.pdf

Published in Transactions on Machine Learning Research (04/2023)

Understanding Noise-Augmented Training for Randomized Smoothing

Ambar Pal ambar@jhu.edu Department of Computer Science & Mathematical Institute of Data Science Johns Hopkins University Baltimore, MD 21218, USA

Jeremias Sulam jsulam1@jhu.edu Department of Biomedical Engineering & Mathematical Institute of Data Science Johns Hopkins University Baltimore, MD 21218, USA

Reviewed on Open Review: https: // openreview. net/ forum? id= fvyh6m DWFr

Randomized smoothing is a technique for providing provable robustness guarantees against adversarial attacks while making minimal assumptions about a classiﬁer. This method relies on taking a majority vote of any base classiﬁer over multiple noise-perturbed inputs to obtain a smoothed classiﬁer, and it remains the tool of choice to certify deep and complex neural network models. Nonetheless, non-trivial performance of such smoothed classiﬁer crucially depends on the base model being trained on noise-augmented data, i.e., on a smoothed input distribution. While widely adopted in practice, it is still unclear how this noisy training of the base classiﬁer precisely aﬀects the risk of the robust smoothed classiﬁer, leading to heuristics and tricks that are poorly understood. In this work we analyze these trade-oﬀs theoretically in a binary classiﬁcation setting, proving that these common observations are not universal. We show that, without making stronger distributional assumptions, no beneﬁt can be expected from predictors trained with noise-augmentation, and we further characterize distributions where such beneﬁt is obtained. Our analysis has direct implications to the practical deployment of randomized smoothing, and we illustrate some of these via experiments on CIFAR-10 and MNIST, as well as on synthetic datasets.

1 Introduction

Machine learning classiﬁers are known to be vulnerable to adversarial attacks, wherein a small, humanimperceptible additive perturbation to the input is able to produce a change in the predicted class (Szegedy et al., 2013). Because of clear security implications (Kurakin et al., 2016), this phenomenon has sparked increasing amounts of work dedicated to devising defense strategies (Metzen et al., 2017; Gu and Rigazio, 2014; Madry et al., 2017) and correspondingly more sophisticated attacks (Carlini and Wagner, 2017; Athalye et al., 2018; Tramer et al., 2020), with each group trying to triumph over the other in an arms-race of sorts. As a result, an increasing number of works have begun providing theoretical guarantees and understanding of adversarial attacks and defenses, studying learning theoretic questions (Shafahi et al., 2018; Cullina et al., 2018; Bubeck et al., 2018; Tsipras et al., 2018), understanding the sample complexity of robust learning (Schmidt et al., 2018; Yin et al., 2018; Tu et al., 2019; Awasthi et al., 2019), characterizing the optimality of attacks and defenses (Pal and Vidal, 2020), analyzing the implications of robust representations (Awasthi et al., 2020; Sulam et al., 2020; Muthukumar and Sulam, 2022), and more.

Published in Transactions on Machine Learning Research (04/2023)

One of the central objects of study in this setting are robustness certiﬁcates, which provide provable guarantees on the prediction of models as long as the input is not perturbed beyond a speciﬁc contamination level (Cohen et al., 2019; Raghunathan et al., 2018; Yang et al., 2020). In this vein, randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) employs a base classiﬁer, typically vulnerable to adversarial attacks, and derives from it a smoothed version by taking a majority vote of the base classiﬁer s outputs on several (stochastic) noise-perturbed versions of the input. While making minimal assumptions about the base model, the resulting smoothed predictor is provably robust to input perturbations of bounded ℓp norm (Yang et al., 2020). Because this is applicable to any given predictor, randomized smoothing has arguably become the most practical certiﬁed defense technique against adversarial examples for deep learning models, which are often too complex be analyzed and certify otherwise.

Nonetheless, the accuracy of the resulting smoothed model diﬀers from that of its base classiﬁer. It has been empirically observed that smoothing a base classiﬁer out of the box, i.e., without any modiﬁcations to the network weights or structure, does not provide good performance, for example in terms of certiﬁed accuracy (Cohen et al., 2019; Gao et al., 2019). Tricks like noise-augmented training for the base classiﬁer need to be employed in order to obtain meaningful results and defense certiﬁcates. This phenomenon, while pervasive in practice, does not have a good theoretical understanding. There are several open questions surrounding the relationship between the base classiﬁer and its smoothed version: Why is noise-augmented training useful when training the base classiﬁer? How does the performance of the smoothed classiﬁer depend on the distribution of the training noise? Is noise perturbed training always needed for all data-distributions, and if not, could we determine this before modifying the training procedure? In this paper, we take a step towards answering some of these questions for a general binary classiﬁcation task in Rd.

In summary, our contributions are as follows:

1. For a general bounded data-distribution, we derive an upper bound on the benign risk obtained from smoothing a classiﬁer that has been trained with noise augmentation. Surprisingly, this bound suggests that noise-augmented training can in fact be harmful to the benign risk of the ﬁnal smoothed classiﬁer, which is contrary to the observed phenomenon in practical applications.

2. We then characterize a family of data distributions for which the above bound is tight, and for which training on noise-augmented data is only detrimental. These distributions are characterized by a notion of large interference distance, which we formalize.

3. We show that this notion of interference distance captures some of the trade-oﬀs in randomized smoothing, and we prove that there exist a family of distributions where beneﬁts can be obtained from noise-augmented training as observed in practice.

4. Our empirical experiments suggest, ﬁrstly, that real data distributions lie in the low-interference distance regime, and hence noise-augmented training is beneﬁcial to randomized smoothing. Secondly, contrary to practice, our proofs suggest that the parameters of the noise-distribution for noiseaugmented training and that of randomized smoothing need not be the same, and that allowing for diﬀerent values of these parameters lead to improved results. Our experiments on MNIST and CIFAR-10 conﬁrm this theoretical intuition.

The rest of the paper is organized as follows. In Section 2 we describe preliminaries and set up our notation and framework. In Section 3 we introduce our main results informally, obtaining conditions that characterize data distributions where noise-augmented training provides an advantage, and those where it does not. Section 4 presents our main results in greater technical detail. Lastly, we conduct our synthetic and real data experiments in Section 5, and conclude in Section 7 highlighting our answers to the questions posed above.

2 Preliminaries and Setup

We consider a binary classiﬁcation 1 task on data X X Rd with labels Y Y = {0, 1}. The random variable X follows a data distribution over the space X with PDF p X. In turn, the label or response variable

1We comment on extensions to multiple classes later in Section 4.

Published in Transactions on Machine Learning Research (04/2023)

Y follows a distribution p Y over its corresponding space. We assume that X is bounded (we take diameter 1 for simplicity). A supervised learning task is deﬁned by the joint distribution p X,Y , and the goal is to obtain a predictor h: X [0, 1] so that h is a good approximation for the conditional expectation E[Y |X = x]. These distributions are unknown, however, so the learning problem consists of ﬁnding such a predictor from a set of samples, typically identically and independently distributed from p X,Y .

In this work we will not dwell much on the learning problem, and instead we will assume we are given access to a function h: X [0, 1] which is a good approximation 2 for the conditional p Y |X. The base classiﬁer is given by the composition of h with a discretizing decision mapping, ψ: [0, 1] Y. In our setting, ψ(z) = 1[z 0.5], where 1[ ] denotes the indicator function. Note that if h is the real conditional distribution of the label for a given input, this coincides3 with the Bayes classiﬁer for this problem.

Adversarial examples are small additive perturbations designed for an input sample such that the predicted label at the perturbed input is incorrect. Typically these perturbations are constrained to be in some ℓq norm ball, i.e., δ q ϵ, so as to be imperceptible to humans. It has been extensively shown that models that achieve excellent accuracy in normal settings can misclassify samples perturbed even with very small values of ϵ (Goodfellow et al., 2014). The goal of certiﬁed robustness methods is to guarantee that the output of a certain model at a given input4, ψ(h)(x), will not change when contaminated by perturbations in a ℓq-ball with radius of at most ϵ . That is, certifying that ψ(h)(x) = ψ(h)(x + δ) for every δ : δ q ϵ .

Randomized smoothing (Cohen et al., 2019) achieves this by smoothing the classiﬁer ψ(h) with an isotropic distribution. More precisely, the smoothed classiﬁer Smooth(ψ(h)) is constructed from ψ(h) by

Smoothp V (ψ(h))(x) def = arg max c Y P[ψ(h)(x + V ) = c], (1)

where V p V . The choice of the distribution p V centrally depends on the norm constraint of the adversarial perturbation. Randomized smoothing was ﬁrst introduced as a certiﬁcation method against ℓ2-bounded perturbations, for which p V = N(0, β2Id). However, this has been extended to other norms in a series of recent works (Lecuyer et al., 2019; Li et al., 2019; Yang et al., 2020), yielding correspondingly diﬀerent distributions p V . In particular, for the binary case with ℓ2-bounded perturbations, randomized smoothing measures the probability of the predicted class (say 1) after smoothing as s = P(h(x + V ) = 1). Then, it provides a certiﬁed radius ϵ = βΦ 1(s) that depends on this probability s (0.5, 1] as well as the variance of the smoothing distribution β2, where Φ 1 is the inverse of the standard Gaussian CDF. Before continuing, it will be useful for our discussions to employ the following equivalent form of randomized smoothing.

Proposition 2.1. When p V is symmetric, and for the binary classiﬁcation task deﬁned above, the smoothed classiﬁer is given by Smoothp V (ψ(h))(x) = ψ((ψ(h) p V ))(x), where denotes the convolution operation.

Proof. Consider the random variable V p V and simplify Smoothp V (ψ(h))(x) as,

Smoothp V (ψ(h))(x) = arg max c {0,1} P[ψ(h)(x + V ) = c]

= ψ(P[ψ(h)(x + V ) = 1]) = ψ Z 1[ψ(h)(x + v) = 1]p V (v)dv

= ψ Z ψ(h)(x + v)p V (v)dv = ψ(ψ(h) p V )(x).

In this manuscript, we will work with single-parameter smoothing distributions pβ. Accordingly, we will simplify our notation to denote the smoothed classiﬁer as Smoothβ(ψ(h))(x).

We consider the setting where a certain minimum level of robustness is required at deployment, say ϵ . In other words, we want our certiﬁed radius to be at least ϵ . From the previous discussion, this implies that

2Our results are developed for h = E[Y |X], but we show that most of them can be extended to the approximate case in Appendix G. 3Up to the deterministic nature of ψ(z) whenever z = 0.5. 4For simplicity of notation in our analysis, we use ψ(h)(x) to denote the composition ψ h at x.

Published in Transactions on Machine Learning Research (04/2023)

βΦ 1(s) ϵ , implying that we need a minimum strength for randomized smoothing, say β def = (1/Φ 1(s))ϵ . Then, given that one will be required to employ Smoothβ (ψ(h))(x), how should h be obtained?

To make the above question precise, we recall the deﬁnition of the risk of a classiﬁer f under the datadistribution p X,Y as R(f) = P(f(X) = Y ). When f = Smoothβ (ψ(h)), this becomes

R(Smoothβ (ψ(h))) = P(Smoothβ (ψ(h))(X) = Y ).

As mentioned in the introduction, obtaining h via natural training typically leads to a certiﬁable predictor with higher error than that of the base classiﬁer, i.e., R(Smoothβ (ψ(h))) > R(ψ(h)). This is expected, as the classiﬁer ψ(h) was precisely trained to minimize the risk R(ψ(h)), whereas its smoothed counterpart Smoothβ(ψ(h)), was not. Such a phenomenon can also be regarded as an out-of-distribution (OOD) problem, albeit in a very speciﬁc setting where the distribution shift is produced by a certiﬁcation method. Indeed, it has been widely demonstrated in practice that naïvely smoothing any base classiﬁer leads to a signiﬁcant degradation in benign accuracy, i.e., accuracy on non-adversarially-corrupted samples, and hence to a lower certiﬁed-accuracy. Besides the well established empirical evidence reported in the literature (Gao et al., 2020; Cohen et al., 2019), we will also showcase this phenomenon in our experiments.

To alleviate the discrepancy, practitioners have resorted to retraining the base classiﬁer ψ(h) with a noiseaugmented data distribution instead, denoted here by ps X. This is achieved by adding noise V p V to the data variable X p X to obtain the noise-augmented data Xs = X + V ps X. Following general intuition, in practice one employs the same smoothing distribution for noise-augmentation as that employed for the certiﬁcation stage. If we let ps X be the PDF of Xs, it is well known that ps X is the convolution of p X and p V , i.e. ps X = p X p V .

Throughout this work, we will make minimal assumptions about the underlying data distribution, the parameterization of the predictor, and the (ﬁnite) dimension of the data. However, we will assume that the hypothesis class is rich enough, and the learning algorithm good enough, such that the true conditional expectations are learnt successfully. As we show in Appendix B, this implies that the Bayes classiﬁers ψ(h) and ψ(h p V ) are learnt successfully under data distributions PX and P s X, respectively. Therefore, a classiﬁer learnt on the noise-augmented data results in Smoothβ(ψ(h p V )), which will be the central object of our study.

While the assumption of learning the Bayes classiﬁers exactly might seem stringent, much of our results can be adapted to relax this assumption while maintaining our general proof technique, assuming a controlled diﬀerence between the Bayes and the obtained classiﬁer. In Section 4 and Appendix G, we discuss this further and extend our results to handle errors in learning. On the one hand, our analyses of the Bayes classiﬁers are informative because they reﬂect the best possible predictors that can be learned from data. On the other hand, and importantly, we will illustrate how these assumptions are reasonable and suﬃcient to depict what is observed in relevant scenarios. Indeed, our theoretical results and simulations on synthetic experiments will resemble the observations in natural image data, presented in Section 5.

As earlier, we will employ noise-augmentation distributions parameterized by a single parameter. Accordingly, we will denote our noise-augmentation distribution as pα to highlight the only parameter, α. For instance, when augmenting data with a Gaussian we have that p V = pα = N(0, α2I); and in the case of a uniform distribution supported over a ℓ2 ball of radius α, one has p V = pα = Unif(B2(0, α)). In this way, the ﬁnal smoothed classiﬁer is given by Smoothβ(ψ(h pα)). A key goal of our work is to understand the interplay between these two parameters: β (controlling the certiﬁed radius) and α (controlling the noise-augmentation of the data distribution).

3 Main Results

Our central aim in this section is to theoretically characterize the empirically observed diﬀerence between the benign risks of the original classiﬁer ψ(h), and that of the smoothed classiﬁer trained on noise-augmented data, Smoothβ(ψ(h pα)). In other words, we want to obtain an estimate of the excess risk

α,β(h) = R(Smoothβ(ψ(h pα))) R(ψ(h)), (2)

Published in Transactions on Machine Learning Research (04/2023)

Small ζh Large ζh

ψ(h) ψ(h pα) Smoothβ(ψ(h pα))

|ψ(h) Smoothβ(ψ(h pα))|

Figure 1: Theorem 3.1 and Theorem 3.3 deal with the cases of large (top row) and small (bottom row) interference distance ζh, respectively, where ζh is the average of the pairwise distances between the orange regions, denoted by the black arrows. The text after Theorem 3.3 provides a detailed description of the diﬀerent regions in the ﬁgure.

where the risk measures the probability of error over p X,Y (see Section 1). We will present our result conceptually and slightly informally here, and we will provide a more detailed version of these later in Section 4.

In our ﬁrst result, we demonstrate that the behaviour observed in practice (namely, that performing noiseaugmented training is beneﬁcial5) is not universal. We do this by showing that there exists a class of distributions H1 with a certain minimal interference distance property where training the classiﬁer on a noise-augmented distribution before performing the smoothing for certiﬁcation is in fact detrimental. The interference distance ζh for a classiﬁer ψ(h) can be informally thought of as the average ℓ2 distance between any two disjoint input regions classiﬁed as 1 by ψ(h) (normalized by the size of the domain). Figure 1 provides a simple illustration, where ζh is the average length of all the black arrows. We defer the formal deﬁnition of this property to Section 4, where we will use tighter characterizations by using the minimum and maximum of such lengths.

Theorem 3.1 (No need for noise-augmentation). There exists a class of distributions H1 with large interference distance, such that, for all h H1, training on a smoothed distribution has no beneﬁt; that is 0,β(h) < α,β(h) for all α, β > 0.

This result is of importance because it sheds light on the kind of behaviour that can be expected while making minimal assumptions of the (conditional) distribution h. Indeed, we now present a result that upper bounds α,β(h) as a function of the smoothing parameters, α, β.

Theorem 3.2 (Simpliﬁed Upper-Bound). For any h with bounded support, α,β(h) Gα,β, where Gα,β is a monotonically increasing function of both α and β.

We pause to make a few remarks about Theorem 3.2. First, we see that Gα,β increases with the certiﬁcation parameter β, suggesting that the excess risk α,β may increase as we increase β. This behavior is expected and reﬂects what is seen in practice. Recall that the inverse of the smoothness parameter of the classiﬁer Smoothβ(ψ(h pα)) is proportional to the randomized smoothing strength, β. Hence as β increases, the classiﬁer changes more slowly, and more error is incurred in the spiky regions of the domain where the true class ﬂuctuates rapidly. This demonstrates the well known accuracy-robustness trade-oﬀ: larger β leads to better robustness certiﬁcates but worse benign performance (Gao et al., 2020; Cohen et al., 2019).

5Note that we are not studying the robust risk of the smoothed classiﬁer. Nevertheless, understanding properties of the benign risk is an essential step towards understanding the robust risk, as the former is a lower bound on the latter.

Published in Transactions on Machine Learning Research (04/2023)

Secondly, we see that Gα,β increases monotonically with α. This suggests that, for any ﬁxed value of β, the excess risk α,β is minimized at α = 0; i.e, it is better not to perform noise-augmentation during training. This is contrary to the behavior of randomized smoothing in practice: to obtain a smoothed classiﬁer Smoothβ(ψ(h pα)) with a good benign risk a practitioner typically sets α = β > 0. One might be tempted to conclude that this seeming contradiction stems from the fact that the bound in Theorem 3.2 is too loose to be informative. Yet, we ﬁnd that this is not the case. In fact, Theorem 3.2 reﬂects the behaviour of α,β(h) for a general h with bounded support, and this upper bound is tight for functions h H1, i.e., the family of distributions from Theorem 3.1.

If retraining on noise-augmented data is not universally needed, then for what family of distributions is it beneﬁcial? It turns out that our notion of interference distance ζh controls this trade-oﬀ, as we now show.

Theorem 3.3 (Noise-Augmentation helps for distributions with small interference distance). There exists a class of distributions H2 with small interference distance, such that, for all h H2, training on noiseaugmented data is favorable; i.e., there exists α, β > 0 such that α,β(h) < 0,β(h).

Before delving into our detailed results in Section 4, we summarize a few key implications of Theorem 3.1 and Theorem 3.3 for randomized smoothing, as illustrated in the example in Figure 1. There, a classiﬁer ψ(h) predicts the class 1 in the orange regions of the input space X, and class 0 otherwise. In the large separation regime that is captured by the family of distributions in H1, the orange regions are far apart and have a very small eﬀect on each other (this is seen pictorially as each positive regions shrinks uniformly upon smoothing as if there were no other positive regions). In these cases, training on noisy data leads to each of the orange regions to shrink in the ﬁnal smoothed classiﬁer Smoothβ(ψ(h pα)) for any α > 0, and the smoothed classiﬁer with noisy training has larger benign risk than training with no noise (Figure 1 top row). On the other hand, as the separation parameter decreases the orange regions get closer to each other eventually having a signiﬁcant eﬀect on their neighbors after smoothing (Figure 1 bottom row). In this regime, it becomes possible to set α > 0 to achieve better benign risk after smoothing compared to smoothing a classiﬁer not trained with noisy data (i.e. with α = 0). Such distributions are captured in H2. For these distributions, the result in Theorem 3.2 is in fact not tight, and thus beneﬁt can be derived from setting α > 0.

With these informal results in mind, we now present them with greater rigour in the next section, before moving to the numerical experiments later in Section 5.

4 Detailed Results

In this section we will expand and make our results from Section 3 more precise. We will ﬁrstly prove in Section 4.1 the phenomenon observed in the large separation regime illustrated in the top panel of Figure 1, and demonstrated in our experiments (Figure 3). Next, in Section 4.2, we will construct data distributions that have a speciﬁc structure, following the phenomenon observed in the small separation regime, illustrated in the bottom panel of Figure 1.

Given a conditional distribution h, we deﬁne a partition of the input space X as X = I I+, where I+

contains all the points where the classiﬁer outputs 1, i.e., I+ = {x X : ψ(h)(x) = 1}. Further, we think of I+ as being composed of disjoint positive partitions I1, I2, . . . , IM such that each partition is simply connected 6. I is the rest of the space, i.e., X \ I+, and is partitioned analogously. In other words,

I+ = I1 I2 . . . IM and I = X \ I+,

where denotes the disjoint union. Note that how to exactly obtain a valid partition above is unspeciﬁed we construct the partitions explicitly for Theorems 4.1 and 4.3, and Theorem 4.2 holds for any valid partitioning. In Figure 1, these regions are colored orange. When a classiﬁer is trained on data perturbed by noise pα, and then smoothed by noise pβ, each of these positive partitions change in some way. Denote these latter partitions I1(α, β), . . . , IM(α, β). In the following two subsections, we will show how the diﬀerence between

6A subset of the space X is deﬁned in the standard topological sense to be simply connected, if it is a connected region containing no holes, i.e., every curve can be continuously contracted to a point.

Published in Transactions on Machine Learning Research (04/2023)

Ik and Ik(α, β) changes under diﬀerent separation regimes, and how this in turn determines the behavior of α,β.

4.1 Large Separation Regime

We will ﬁrst formalize and prove a version of Theorem 3.1 by constructing suitable distributions H1 with a large interference distance. We will then generalize the proof strategy to arbitrary distributions to obtain Theorem 3.2.

Unless speciﬁed otherwise, we assume that the noise distributions (i.e., those for smoothing) pθ(x) we are working with are nice, in the following speciﬁc sense. Nice probability distributions are decreasing in their argument x, and are spherically symmetric. Formally, a probability distribution p over Rd parameterized by a scalar α R is deﬁned to be nice, if for all x1, x2 Rd, p satisﬁes the properties

(Decreasing in argument norm) pα(x1) pα(x2), if x1 2 x2 2, and,

(Spherically symmetric) pα(x1) = pα(x2), if x1 2 = x2 2.

Nice distributions comprise Uniform and Gaussian as special cases. Throughout the paper, we consider the family of pα and pβ to be the same (e.g., both Uniform or both Gaussian). This is not required for our analyses, and simple extensions of our results could generalize further to these having diﬀerent forms.

The lower interference distance for a given h is now deﬁned as the minimum distance between any two positive partitions, i.e., ζh = mini =j dist(Ii, Ij). The lower interference distance over a family H is deﬁned as ζH = minh ζh. We can now state the detailed version of Theorem 3.1. The full proof, along with further properties of these nice distributions, can be found in Appendix D.

Theorem 4.1. For nice noise distributions pα = Unif(Bℓ2(0, α)) and pβ = Unif(Bℓ2(0, β)), there exists H1 with interference distance ζH1 > max(α, β), such that for all h H1 we have 0,β(h) < α,β(h) for all 0 < α < ζH1 and all 0 < β < ζH1.

While we tightly bound α,β from above and below in Theorem 4.1, we are able to do so by assuming that the positive partitions I+ are perfectly spherical. In the following result, we show that we can in fact generalize our proof technique to obtain an upper-bound for α,β in Theorem 4.2 that only assumes that h is supported on the (bounded) data domain X, with partitions of arbitrary shape. In doing so, we obtain an upper-bound that might be loose in general. However, we note that there exist simple distributions (H1) where the upper-bound in Theorem 4.2 is tight, implying that it is the best one could hope for without assuming anything else about h. The full proof can be found in Appendix E.

Theorem 4.2. For nice noise distributions pα, pβ, for any h supported on a bounded domain X,

k p X(Bℓ2(ˆxk, (ωk h,τ rk α,β)+)), (3)

where rk α = r

Ψ 1 α 0.5 0.5+τ if Ik is a positive partition, i.e., Ik I+ and rk α = r

Ψ 1 α 0.5 0.5 τ otherwise,

i.e., Ik I . Further, rk α,β = rk α + q

Ψ 1 β (0.5), and Ψα, Ψβ are the CDFs of the distribution of z 2 2 when z pα and z pβ, respectively. p X(S) denotes the measure of the set S under p X. The upper bound (3) holds for any choice of {(ωk h,τ, ˆxk)} such that ˆxk is the center of a ball of radius ωk h,τ completely contained in Ik,τ deﬁned as Ik,τ = {x Ik : |h(x) 0.5| τ}. We can choose the sequence {(ωk h,τ, ˆxk)} such that the upper bound (3) is minimized.

This results upper bounds the excess risk by one minus the sum of the measures of balls under the data distribution. Each partition Ik contributes to the upper bound in (3) via the diﬀerence (ωk rk). The ﬁrst term ωk h,τ denotes the inradius of a subset of Ik, deﬁned as the portion of Ik classiﬁed with a conﬁdence margin τ, i.e., Ik,τ = {x Ik : |h(x) 0.5| τ}. The second term rk captures the shrinkage produced by smoothing. Speciﬁcally, rk α denotes the shrinkage caused in Ik while moving from the original classiﬁer to the

Published in Transactions on Machine Learning Research (04/2023)

c1 c2 c3 c4

c1 c2 c3 c4

ψ(ψ( h) pβ0)

c1 c2 c3 c4

c1 c2 c3 c4

ψ(ψ( h pα) pβ0)

Noise-Augmented

Randomized Smoothing

Randomized Smoothing

Figure 2: An example h H2 where noise-augmented training is beneﬁcial for randomized smoothing. The orange and blue shaded bars denote the regions where the smoothed classiﬁers make errors, i.e., ψ( h) = Smoothβ0(ψ( h)) and ψ( h) = Smoothβ0(ψ( h pα0)) respectively. Assuming a suitable data-distribution p X, the orange region can have a higher p X-mass than the blue region, i.e., α0,β0( h) < 0,β0( h). See the text below for details of the four sub-ﬁgures.

noise-trained classiﬁer, i.e., ψ(h) ψ(h pα). Similarly, rk α,β denotes the shrinkage caused while moving from the noise-trained classiﬁer to the randomized smoothed classiﬁer, i.e., ψ(h pα) ψ(ψ(h pα) pβ). For any ﬁxed β, as we increase the noise-augmentation strength α, the shrinkage rk α,β also increases, leading to an increase in the upper bound in Eq. (3).

4.2 Small Separation Regime

We will show that, unlike what is reﬂected by Theorem 4.1, there exists a family of distributions H2 characterized by a small separation distance where α,β0 is indeed minimized at some α > 0, which is the behavior observed in practice in common image datasets.

We will begin with an illustrative one-dimensional example of a data distribution supported on X R where we will show that the interference distance ζ explicitly controls whether any beneﬁt can be obtained by noise-augmentation. We will observe the basic structure required in h for this to occur, and then generalize this structure to construct examples supported on X Rd forming the family H2 required for Theorem 4.3.

1 dimensional example We will now walk through the four panels of Figure 2. First, we let h be deﬁned as

( 0 x ( 0.5, c1) (c2, c3) (c4, 0.5) 1 otherwise, (4)

where the speciﬁc values of c1, . . . c4 are not important for the example but can be found in Appendix F. Additionally, let p X be uniform in [c1, c4] this assumption is not needed, but simpliﬁes the following discussion. The resulting ψ( h) is plotted in the top-left panel of Figure 2.

For smoothing, we use the nice distribution pβ0 = Unif([ β0/2, β0/2]) where β0 is just large enough, so that (ψ( h) pβ0)(x) = 0 for all x [c1, c4]. In other words, randomized smoothing ensures that the smoothed classiﬁer Smoothβ0(ψ( h)) outputs a consistent label across the input, hence ensuring robust classiﬁcation (and in particular, that label is 0). However, observe that the benign risk of the smoothed classiﬁer R(Smoothβ0(ψ(h))) is very high, as it makes an error whenever x [c1, c2] [c3, c4]. This situation is shown in the bottom left panel of Figure 2.

Published in Transactions on Machine Learning Research (04/2023)

We now consider the nice noise distribution for noise-augmented training, deﬁned as pα0 = Unif([ α0/2, α0/2]). For a suitable value of α, we can ensure that ψ( h pα0)(x) = 1 for all x [c1, c4]. This is shown in the top right panel of Figure 2. But now, since β0 was just enough to obtain Smoothβ0(ψ( h)) = 0, the additional positive mass in [c2, c3] causes the classiﬁcation to ﬂip, and now Smoothβ0(ψ( h) pα0)(x) = 1 for all x [c1, c4]. The benign risk of the smoothed classiﬁer now is much lower than earlier, as it only makes an error on the small crevice [c2, c3]. This situation is shown in the bottom right panel of Figure 2. Thus, one can choose h, p, α0, β0 such that R(Smoothβ0(ψ( h) pα0)) < R(Smoothβ0(ψ( h))), (5)

demonstrating that training with noise augmentation is indeed beneﬁcial when h has a low interference distance. The speciﬁc values of α0, β0 can be found in Appendix F.

Fixing the noise-augmentation distribution pα0 and the smoothing distribution pβ0, we now modify h to h by increasing the interference distance |c2 c3| while maintaining the same structure, i.e. h(x) = 0 when x ( 0.5, c1) ( c2, c3) ( c4, 0.5), and h(x) = 1 otherwise. When | c2 c3| > α0/2, noise-augmentation with pα0 is no longer eﬀective, i.e., ψ(ψ( h) pα0) = ψ(ψ( h) pα0). Intuitively, this is caused by insuﬃcient positive mass near any class-0 point for the prediction to ﬂip after noise-augmentation and this can be veriﬁed using the values of α0, β0 provided in Appendix F). Applying randomized smoothing on ψ(ψ( h) pα0) now simply leads to Smoothβ0(ψ( h) pα0)(x) = 0 everywhere, implying that the risk after smoothing is high, i.e.,

R(Smoothβ0(ψ( h) pα0)) R(Smoothβ0(ψ( h))). (6)

Thus, we have shown via this example that the interference distance directly controls whether we get any beneﬁt out of training with noise-augmentation (5) or not (6).

The distance |c2 c3| above corresponds to a small upper interference distance ζ, which we deﬁne now as the maximum distance between any two positive partitions, i.e., ζ = maxi =j dist(Ii, Ij). With this deﬁnition, we can now extend the ideas in the simple example above to general constructions in d dimensions.

Theorem 4.3. There exist nice distributions pα, pβ, a family H2 with low interference distance ζH2, and data-distributions p X, such that for all h H2 we have 0,β0(h) > α,β0(h) for some α > 0, β0 > 0.

We pause again to understand some implications of Theorem 4.3. From the one-dimensional example and the proof of Theorem 4.3, we see that training with noisy data, i.e., α > 0, is better for distributions with low interference distance. We conjecture that natural distributions follow this structure as well, containing regions corresponding to one class that are packed closely in the domain. Interestingly, we ﬁnd evidence in the favor of this conjecture in our experiments in Section 5. Moreover, this result also shows that there is no reason for the noise level α to be the same as the smoothing level β for ﬁnal smoothed classiﬁer to obtain its lowest risk. This suggests that the common practice of using α = β for randomized smoothing (e.g., see (Cohen et al., 2019)) might not be optimal. Indeed, we will shortly demonstrate in our experimental section that better choices for smoothing can be found by relaxing the constraint that the training noise level α should be equal to β. Additionally, our ﬁndings might provide a theoretical foundation for recent works (Alfarra et al., 2020; Anderson and Sojoudi, 2022; Súkeník et al., 2021) on obtaining an adaptive β for smoothing at each input point.

Exactly Learning Bayes Classiﬁers We pause here to comment on the assumptions we made in analyzing randomized smoothing on a noise-smoothed distribution, Smoothβ(ψ(h pα)). As we mentioned above and as we prove in Appendix B our analysis assumed that one has access to the true conditional distribution of E[Y |Xs]. However, in more realistic settings, the learned predictor might deviate from this optimal value. We now argue that our analysis is still relevant in these cases, too.

On the one hand, in Theorem G.1, we provide an extension of our result in Theorem 4.1 by considering a predictor that is an inexact approximation to the conditional expectation. In other words, we assume that training with noise-augmentation results in a function g(x) that is not too far from the true smoothed conditional distribution, |g(x) h pα(x)| η for all x X. As we show in Appendix G, our same proof technique allows us to show that indeed, there exist distributions where no beneﬁt can be expected from

Published in Transactions on Machine Learning Research (04/2023)

randomized smoothing, even in this more general case where the assumption of learning the true conditional distribution is relaxed.

On the other hand, it is also not hard to see that our results that provide an upper bound to the excess risk based on Bayes classiﬁer, α,β(h), are also useful in characterizing upper bounds to more realistic predictors that might deviate from the true conditional distribution. Indeed, given access to g that is only an approximation to h pα, one can show (see our derivation in Appendix G) that the excess risk of smoothing g can be upper bounded by

α,β(g, h) α,β(h) + p X ((h pα)(X) = g(X)) . (7)

As a result, our upper bound in Theorem 4.2 also provides an upper bound to the excess risk of more general predictors g and the tightness of this bound is naturally controlled by the mass under p X of the errors between g and h pα.

Extension to Multiple Classes Although our theory was developed for binary classiﬁcation, a standard extension to K-way classiﬁcation is possible, using the technique followed while certifying multi-class Randomized Smoothing based classiﬁers Cohen et al. (2019), Yang et al. (2020). In this context, the class Y = 0 now represents one of the K-classes, and Y = 1 represents all the remaining classes. All our theorems would then extend, having an additional parameter for the class viewed as 0.

Having established our theoretical results, we now turn to an empirical veriﬁcation of some of them on synthetic datasets, as well as MNIST and CIFAR-10.

5 Experiments

Synthetic Experiments. We begin by conducting synthetic experiments7 with distributions resembling those used in the proofs of our existence results Theorems 3.1 and 3.3. We take X = [0, 100] [0, 100] as the data domain, and sample the positive partitions of h as spheres at random locations in the domain (see Appendix A for more details about the construction) to obtain the data-distribution hζ, where the superscript ζ denotes the interference distance. Figure 3 shows a few examples of hζ for diﬀerent values of ζ.

ζ = 0 ζ = 10 ζ = 20 ζ = 30

Figure 3: Examples of ψ(hζ) for diﬀerent values of ζ. The blue regions denote ψ(hζ)(x) = 0 and the orange regions denote ψ(hζ)(x) = 1.

In Figure 4, we report quantitative results using our synthetic distributions as described above. Over several choices of interference distance ζ, we ﬁrst sample several synthetic data distributions hζ. We then plot the quantity α,β 0,β, which is the diﬀerence between the benign risks of smoothing a noise-augmented classiﬁer, i.e., R(Smoothβ(ψ(h pα)), and smoothing a classiﬁer not trained with noise-augmentation, i.e., R(Smoothβ(ψ(h))). We increase the smoothing strength β moving from left to right in Figure 4, and plot the diﬀerence in the risks as a function of α for each β. Recall that noise-augmentation is useful only when we can

7We provide code for all of our experiments at https://github.com/ambarpal/randomized-smoothing.

Published in Transactions on Machine Learning Research (04/2023)

ﬁnd α0, β0 such that α0,β0 0,β0 < 0. Accordingly, we plot a dashed line whenever noise-augmentation is useful, i.e., α0, β0 α0,β0 < 0,β0 and solid otherwise, i.e., α, β α,β 0,β.

At high ζ, we are operating in the large interference distance regime and the results from Section 4.1 apply. Those results dictate that α,β0 should be minimized at α = 0 for any ﬁxed β0 > 0. Indeed, this phenomenon can be seen in Figure 4 for ζ = 3, where we see that the line remains solid for all β.

On the other hand, at lower ζ, we go into the small interference distance regime and the results from Section 4.2 apply. Those results dictate that α,β0 should be minimized at some α > 0 given a large enough, but ﬁxed β0 > 0. Indeed this phenomenon is seen in Figure 4, where the lines tend to become dashed as β increases. Thus, our synthetic experiments conﬁrm the predictions from our theory.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 α

Δα, β Δ0, β

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 α

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 α

Figure 4: Synthetic Experiments showing the variation in α,β. The line is dashed if there exists α > 0 such that α,β < 0,β, and solid otherwise.

MNIST, CIFAR-10 Experiments. We now proceed to conduct experiments with real data distributions. For a range of ﬁnely spaced α in [0, 1], we train a standard CNN hα with isotropic Gaussian noise-augmentation with variance α2, i.e., pα = N(0, α2I). For each of these trained models, we use the isotropic-Gaussian with variance β2, with β [0, 1], as the smoothing distribution, i.e., pβ = N(0, β2I), to obtain the smoothed classiﬁer Smoothβ(ψ(h pα)). We then plot the standard empirical estimate of the risk, ˆ α,β = P

(x,y) Stest 1[Smoothβ(hα)(x) = y] over the test set Stest, for diﬀerent α, β in Figure 5.

The ﬁrst observation that we make from the MNIST and CIFAR-10 plots in Figure 5 is that a non-zero data-augmentation is always beneﬁcial, i.e., α > 0. While not an implication arising from our theory, this behavior is reminiscent of our results in Theorem 4.3, suggesting that real data distributions lie in the small interference distance regime. Additionally, we note the similarity of the synthetic experiment plots at low ζ in Figure 4 and the real-data curves in Figure 5 (upto scaling of the α-axis, and the fact that we subtract 0,β for normalization in Figure 4), providing empirical validation for our theoretical assumptions.

The second observation we make from Figure 5 is that, as suggested by Theorem 4.3, the smoothing parameter need not be the same as the data augmentation parameter for the best performance of the smoothed classiﬁer. This phenomenon has also been observed empirically in prior work Alfarra et al. (2020), Salman et al. (2019), Zhai et al. (2020), and our results provide a theoretical foundation for such observations.

6 Related Work

In this work, we study randomized smoothing, which is currently the method of choice for obtaining robustness guarantees for deep, complex neural network classiﬁers. Since the early proposals of stochastic smoothing (Liu et al., 2018), many works have derived robustness certiﬁcates against adversarial attacks for smoothed classiﬁers (Lecuyer et al., 2019; Li et al., 2018; Cohen et al., 2019). These certiﬁcates obtain a radius r around an input x such that the network prediction is robust in an ℓp ball of radius r around x. Studied

Published in Transactions on Machine Learning Research (04/2023)

(b) CIFAR-10

Figure 5: Plot of ˆ α,β for MNIST (5a) and CIFAR-10 (5b). For each ﬁxed value of β = β0, the legend α

(and the corresponding shaded disk on the plot) shows the minimizer of the α,β0 curve.

balls include p = 0 (Levine and Feizi, 2020a; Lee et al., 2019), p = 1 (Teng et al., 2019), p = 2 (Cohen et al., 2019; Salman et al., 2020; 2019) and p = (Zhang, 2002). Certiﬁcates have also been extended to non-ℓp smoothing distributions (Levine and Feizi, 2020b). For diﬀerent notions of optimality, the optimal certiﬁcates have also been characterized (Yang et al., 2020; Kumar et al., 2020).

However, there has been lesser attention given to the impact that randomized smoothing has on the benign accuracy of the smoothed classiﬁer, and a theoretical understanding for the need or lack thereof for noise-augmented training of the classiﬁer in order to produce an accurate smoothed classiﬁer. Recent work in Gao et al. (2020), which is closest to ours in spirit, takes a ﬁrst step towards this goal by analyzing the class of hypotheses that are realizable when training on smoothed data. Their result shows that this re-training strictly reduces the hypothesis class whenever the smoothing strength is above a threshold. In this way, Gao et al. (2020) provide some theoretical basis for the reduction of accuracy of the so smoothed-trained classiﬁer. Elsewhere, Mohapatra et al. (2021) theoretically demonstrate that randomized smoothing leads to shrinkage of class boundaries for simple data distributions. Importantly, their deﬁnition of shrinkage is that the bounding sphere (or cone, for semi-bounded regions) of the decision boundary becomes smaller. This is diﬀerent from our analysis, as a shrunk decision region Rσ might not be a subset of the original region R under this deﬁnition. In comparison, our universal result in Theorem 4.2 holds in much more generality for arbitrary (bounded) data-distributions, and our existence results Theorems 4.1 and 4.3 construct speciﬁc synthetic datasets. While Mohapatra et al. (2021) only analyse the distance f Smoothβ(f) 2 given a trained classiﬁer f, we directly bound the risk R(Smoothβ(f)), and shrinkage does show up in some parts of our analysis. Our proof techniques are able to handle both noise-augmentation of strength α and randomized smoothing of strength β simultaneously, allowing us to discover cases where noise-augmentation helps randomized smoothing (in other words, showing cases where the combined eﬀect might not be a shrinkage of the decision boundaries). Some recent works have tried to move away from having the same smoothing and noisy-training distribution, by studying input-dependent smoothing (Alfarra et al., 2020; Anderson and Sojoudi, 2022; Súkeník et al., 2021). Lastly, and in a very diﬀerent context of graph convolutional networks (GCNs), recent results in Keriven (2022) show that a positive level of smoothing (deﬁned as the number of GCN layers) can be beneﬁcial for a supervised learning task, before becoming detrimental once smoothing is too large. Studying further connections between our analysis and smoothing for GCNNs constitutes an interesting direction of research.

Published in Transactions on Machine Learning Research (04/2023)

7 Conclusion, Limitations and Future Work

In this work, we provided a theoretical understanding of how noise-augmented training aﬀects classiﬁers defended using randomized smoothing. We identiﬁed a parameter of the data distribution, which we dubbed interference distance, that, for certain families of data-distributions, determines whether noise-augmented training could be beneﬁcial for randomized smoothing. Using this parameter, we showed there exist data distributions where noise-augmented training reduces the accuracy of the ﬁnal smoothed classiﬁer, contrary to common observation. Our upper bound to the benign risk in Theorem 4.2 for very general conditional distributions and which is tight for distributions presented in Theorem 4.1 demonstrates that no beneﬁt of randomized smoothing is possible without further structural distributional assumptions. In turn, we showed the existence of distributions with small interference distance in Theorem 4.3, where improvements by noise-augmented training is possible. We complemented our theoretical results with empirical validation, suggesting that real-world distributions lie in regimes where the parameter ζ is small, and so noise-augmented training can indeed help randomized smoothing if the smoothing strength is chosen properly. Contrary to intuition, the proof of our theoretical result in Theorem 4.3, and our experiments, suggest that this smoothing strength need not be the same as the noise-augmentation strength for best performance of the ﬁnal predictor.

We now revisit the open questions that we posed in Section 1. Firstly, why is noise-augmentation helpful while training the base classiﬁer? We saw in the bottom row of Figure 1 that noise augmented training helps the smoothed classiﬁer have good benign accuracy when we are in the low-interference distance regime, as it alleviates the degradation of the risk incurred by randomized smoothing at deployment time. Secondly, how does the performance of the smoothed classiﬁer depend on the training noise distribution? We saw in Theorems 4.1 and 4.3 that in the high interference regime, training with low to zero noise leads to good performance of the smoothed classiﬁer. On the other hand, in the low-interference regime, the training noise can be tuned to extract good performance of the smoothed classiﬁer. These conclusions are derived for the cases when the same family of distribution is used for augmentation during training and for smoothing at deployment time. Future work could consider extending these to diﬀerent classes of distribution. Thirdly, is noise-augmented training needed for all data-distributions? The answer is no, noise-augmentation is not always eﬀective in reducing the risk of the classiﬁer after randomized smoothing, and the interference-distance parameter can distinguish between families of data-distributions where noise-augmentation helps randomized smoothing. It remains unclear how to eﬃciently determine this parameter for real data-distributions, which would allow us to determine whether noise-augmented training is needed at all, or what the optimal parameters of the augmentation distribution should be. This remains matter of future research.

We note that our results are independent of the learning aspect of the problem, i.e., we assume that the learning algorithm is good enough to obtain the predictor with the lowest possible risk. These assumptions, while simplistic, nonetheless allowed us to characterize observations that have practical relevance. Though we provide some results on how to relax such assumptions in Appendix G, we see this as a starting point for future work that should explore these trade-oﬀs in more generality. Additionally, our experiments and theory suggest that the best data-augmentation strength is not the same as the smoothing strength in general, and we hope that future work can theoretically quantify and provide precise estimates for this quantity. Finally, in this paper, we studied the benign risk of classiﬁers after smoothing. Future work could extend these results to the study of the robust risk (which is lower bounded by the benign risk), and extend our techniques to quantify how the former depends on the noise-augmentation distribution.

Acknowledgements

The authors thank René Vidal for insightful comments, as well as the anonymous reviewers for their valuable suggestions, which helped improve this manuscript. AP acknowledges the support of DARPA (via GARD HR00112020010) and NSF (via Grant 1934979). JS acknowledges support from DARPA GARD HR00112020004.

Published in Transactions on Machine Learning Research (04/2023)

Motasem Alfarra, Adel Bibi, Philip HS Torr, and Bernard Ghanem. Data dependent randomized smoothing. ar Xiv preprint ar Xiv:2012.04351, 2020.

Brendon G Anderson and Somayeh Sojoudi. Certiﬁed robustness via locally biased randomized smoothing. Proceedings of Machine Learning Research vol, 144:1 14, 2022.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018.

Pranjal Awasthi, Vaggos Chatziafratis, Xue Chen, and Aravindan Vijayaraghavan. Adversarially robust low dimensional representations. ar Xiv preprint ar Xiv:1911.13268, 2019.

Pranjal Awasthi, Himanshu Jain, Ankit Singh Rawat, and Aravindan Vijayaraghavan. Adversarial robustness via robust low rank representations. ar Xiv preprint ar Xiv:2007.06555, 2020.

Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. ar Xiv preprint ar Xiv:1805.10204, 2018.

Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pages 3 14, 2017.

Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918, 2019.

Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Pac-learning in the presence of adversaries. In Advances in Neural Information Processing Systems, pages 230 241, 2018.

Ruiqi Gao, Tianle Cai, Haochuan Li, Cho-Jui Hsieh, Liwei Wang, and Jason D Lee. Convergence of adversarial training in overparametrized neural networks. In Advances in Neural Information Processing Systems, pages 13009 13020, 2019.

Yue Gao, Harrison Rosenberg, Kassem Fawaz, Somesh Jha, and Justin Hsu. Analyzing accuracy loss in randomized smoothing defenses. ar Xiv preprint ar Xiv:2003.01595, 2020.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. ar Xiv preprint ar Xiv:1412.5068, 2014.

Nicolas Keriven. Not too little, not too much: a theoretical analysis of graph (over) smoothing. ar Xiv preprint ar Xiv:2205.12156, 2022.

Aounon Kumar, Alexander Levine, Tom Goldstein, and Soheil Feizi. Curse of dimensionality on randomized smoothing for certiﬁable robustness. In International Conference on Machine Learning, pages 5458 5467. PMLR, 2020.

Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adversarial examples in the physical world, 2016.

Richard J Larsen and Morris L Marx. An introduction to mathematical statistics. Prentice Hall, 2005.

Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certiﬁed robustness to adversarial examples with diﬀerential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pages 656 672. IEEE, 2019.

Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi Jaakkola. Tight certiﬁcates of adversarial robustness for randomly smoothed classiﬁers. Advances in Neural Information Processing Systems, 32, 2019.

Published in Transactions on Machine Learning Research (04/2023)

Alexander Levine and Soheil Feizi. Robustness certiﬁcates for sparse adversarial attacks by randomized ablation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 4585 4593, 2020a.

Alexander Levine and Soheil Feizi. Wasserstein smoothing: Certiﬁed robustness against wasserstein adversarial attacks. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3938 3947. PMLR, 2020b.

Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certiﬁed adversarial robustness with additive noise. ar Xiv preprint ar Xiv:1809.03113, 2018.

Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certiﬁed adversarial robustness with additive noise. Advances in neural information processing systems, 32, 2019.

Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pages 369 385, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoﬀ. On detecting adversarial perturbations. ar Xiv preprint ar Xiv:1702.04267, 2017.

Jeet Mohapatra, Ching-Yun Ko, Lily Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. Hidden cost of randomized smoothing. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4033 4041. PMLR, 2021.

Ramchandran Muthukumar and Jeremias Sulam. Adversarial robustness of sparse local lipschitz predictors. ar Xiv preprint ar Xiv:2202.13216, 2022.

Ambar Pal and René Vidal. A game theoretic analysis of additive adversarial attacks and defenses. ar Xiv preprint ar Xiv:2009.06530, 2020.

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed defenses against adversarial examples. ar Xiv preprint ar Xiv:1801.09344, 2018.

Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classiﬁers. Advances in Neural Information Processing Systems, 32, 2019.

Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, and J Zico Kolter. Black-box smoothing: A provable defense for pretrained classiﬁers. ar Xiv preprint ar Xiv:2003.01908, 2020.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5014 5026, 2018.

Ali Shafahi, W Ronny Huang, Christoph Studer, Soheil Feizi, and Tom Goldstein. Are adversarial examples inevitable? ar Xiv preprint ar Xiv:1809.02104, 2018.

Peter Súkeník, Aleksei Kuvshinov, and Stephan Günnemann. Intriguing properties of input-dependent randomized smoothing. ar Xiv preprint ar Xiv:2110.05365, 2021.

Jeremias Sulam, Ramchandran Muthukumar, and Raman Arora. Adversarial robustness of supervised sparse coding. Advances in neural information processing systems, 33:2110 2121, 2020.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Published in Transactions on Machine Learning Research (04/2023)

Jiaye Teng, Guang-He Lee, and Yang Yuan. ℓ1 adversarial robustness certiﬁcates: a randomized smoothing approach. 2019.

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. ar Xiv preprint ar Xiv:2002.08347, 2020.

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018.

Zhuozhuo Tu, Jingwei Zhang, and Dacheng Tao. Theoretical analysis of adversarial learning: A minimax approach. In Advances in Neural Information Processing Systems, pages 12259 12269, 2019.

Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pages 10693 10705. PMLR, 2020.

Dong Yin, Kannan Ramchandran, and Peter Bartlett. Rademacher complexity for adversarially robust generalization. ar Xiv preprint ar Xiv:1810.11914, 2018.

Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403 2412, 2018.

Runtian Zhai, Chen Dan, Di He, Huan Zhang, Boqing Gong, Pradeep Ravikumar, Cho-Jui Hsieh, and Liwei Wang. Macer: Attack-free and scalable robust training via maximizing certiﬁed radius. ar Xiv preprint ar Xiv:2001.02378, 2020.

Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527 550, 2002.

Published in Transactions on Machine Learning Research (04/2023)

A Experimental Details

Synthetic Experiments To construct our synthetic distributions, we take X = [0, 100] [0, 100] as the data domain. We start with an empty set I = {} and radius r = 10. We repeat the following steps 500 times: (1) Sample a center c in X uniformly at random. (2) Test whether c c 2 > ζ + 2r for all spheres Bℓ2(c , r) I. If true, add Bℓ2(c , r) to I, otherwise reject this sample. At the end of this process, we obtain a set of spheres I satisfying the property c c 2 ζ + 2r for all c = c , where ζ is the interference distance.

MNIST Experiments We take 100 samples from each of pα, pβ per image x in the MNIST test set to compute a single point α,β. The ﬁgures are quite stable against random initialization due to the MNIST test set size of 10000 samples. We use the following architecture for the neural network classiﬁer hα:

Input(28, 28) Conv3,1(1, 32) Conv3,1(32, 64) Linear(9216, 128) Linear(128, 2)

In the above, the input image has dimension 28 28, and convolution layers with a ﬁlter size of 3 and a stride of 1 are applied. The notation Conv3,1(a, b) denotes that the input depth is a and the number of ﬁlters is b for the convolution. Finally, the notation Linear(a, b) denotes a aﬃne transformation with input dimension a and output dimension b.

CIFAR-10 Experiments We use the DLA Architecture (Yu et al., 2018) for the neural network classiﬁer. For obtaining hα, we use standard noise augmentation on the CIFAR-10 dataset, i.e., we augment every training sample with noise sampled from pα. For obtaining the smoothed classiﬁer Smoothβ(hα), we use 200 noise samples from pβ for every image. We report results aggregated over 4000 images from the CIFAR-10 test set.

B Derivation of Bayes Classiﬁers

Recall that X and Y denote the data and label random variables, respectively, such that (X, Y ) p X,Y . In this binary setting, the Bayes classiﬁer, i.e. the classiﬁer that minimizes the probability of misclassiﬁcation, can be obtained by thresholding the conditional expectation E[Y |X = x] at 0.5. We assume that the learning algorithm results in a predicted score h(x) = E[Y |X = x]. The prediction of this model is given by ψ(h).

Now, we instead train on noisy data, Xs ps X, where Xs = X + V such that V p V . It is well known that ps X = p X p V , where denotes the convolution operator (Larsen and Marx, 2005). Further, we can show now that the conditional expectation is E[Y |Xs = x] = h p V :

E[Y |Xs = x] = p(Y = 1|Xs = x) = p(Y = 1|X + V = x)

v p(Y = 1|X + V = x, V = v)p V (v)dv

v p(Y = 1|X = x v, V = v)p V (v)dv

v p(Y = 1|X = x v)p V (v)dv (8)

v h(x v)p V (v)dv = (h p V )(x),

where (8) uses the fact that the noise variable V is independent of the data variable X. As earlier, the Bayes classiﬁer is now given by a thresholding of the conditional expectation, i.e., ψ(h p V ).

Published in Transactions on Machine Learning Research (04/2023)

C Nice Distributions and their Properties

Nice Distributions. A probability distribution p over Rd parameterized by a scalar α R is deﬁned to be nice, if p satisﬁes the properties (9) and (10):

(Decreasing in argument norm) pα(x1) pα(x2), if x1 2 x2 2 (9)

(Spherically symmetric) pα(x1) = pα(x2), if x1 2 = x2 2 (10)

For ease of understanding of the rest of the section, one can think of p to be the uniform distribution supported on a ball of radius θ, i.e., pθ = Unif(Bℓ2(0, θ)), which is a nice distribution.

Shifted CDF. We deﬁne the multivariate cumulative distribution function (CDF) w.r.t. x X Rd as

Φα(r, x) = Z

t 2 r pα(x t)dt (11)

For nice distributions p, we can show the following properties for the CDF:

(Decreasing in argument Norm) Φα(r, x1) Φα(r, x2), if x1 2 x2 2 (12)

(Spherically symmetric) Φα(r, x1) = Φα(r, x2), if x1 2 = x2 2 (13)

(Increasing in radius) Φα(r1, x) Φα(r2, x), if r1 r2 (14)

To see (13), we choose any x1, x2 with x1 = x2 . Now, note that the integral in (11) can be rewritten as Φα(r, x) = R

t Bℓ2(x,r) pα(t)dt, which in other words computes the mass of the pα contained in an ℓ2 ball of radius r around x. Since x1 = x2 , we can ﬁnd a rotation matrix Q that maps x1 to x2 as Qx1 = x2. Observe that the same rotation matrix Q also maps the entire ball B(x1, r) to the ball B(x2, r). Finally, (13) follows from by observing that pα(x) = pα(Qx) for any rotation matrix Q as x 2 = Qx 2.

Now that we know that Φα(r, x) is spherically symmetric in the second argument, we can focus our attention to the restriction of Φ along the standard basis vector e1 for showing (12). For any r > 0, consider the function φ: R R deﬁned as φ(c) = Φ(r, ce1) for a scalar c. Points x1, x2 with x1 x2 correspond to c1 = x1 and c2 = x2 in the sense that Φ(r, x1) = φ(c1) and Φ(r, x2) = φ(c2), and c1 c2 0. Deﬁne D1 = Bℓ2(c1e1, r) and D2 = Bℓ2(c2e1, r). We have,

t D1 pα(t)dt

t (D1 D2) pα(t)dt + Z

t (D1\D2) pα(t)dt

t (D1 D2) pα(t)dt + Z

t (D2\D1) pα(t)dt (15)

To obtain the inequality in (15), observe that D1, D2 are spheres of the same radius with their centers c1e1, c2e2 such that 0 c2 c1. This ensures that any point in D2 \ D1 has a lower ℓ2 norm than any point in D1 \ D2 , which in turn means that the every term in the integral R

t (D2\D1) pα(t)dt is lower bounded by every term in the integral R

t (D1\D2) pα(t)dt, using Property (9) of pα.

Due to properties (12) and (13), we see that Φα(r, x) is non-increasing as x 2 increases. As a result, we can deﬁne the function Aα,r : R R as

Aα,r(c) = max {x: Φα(r,x) c} x 2, (16)

which computes the maximum ℓ2 norm that x 2 can take before the CDF falls below c. This function will be useful to us later when we reason about how much does smoothing a classiﬁer contract or expand its positive regions.

Published in Transactions on Machine Learning Research (04/2023)

We ﬁrstly note that Aα,r is decreasing in c. This can be seen as increasing c makes the constraint tighter in (16), and hence the optimal objective value can only decrease when we increase c.

Secondly, we note that Aα,r(Φα(r, x0)) = x0 2, for any x0 X, α 0 and r 0. This can be seen by observing that for x = x0, we have c0 = Φα(r, x0), and thus x0 is a feasible point for the optimization problem max{x: Φα(r,x) c0} x 2. Hence Aα,r(c0) x0 2. Then, by property (12), we know that Φα(r, x ) Φα(r, x0) whenever x 2 x0 2, implying that any such x will not be feasible. This shows that Aα,r(c0) x0 2, completing the argument.

D Proof of Theorem 4.1

Theorem 4.1. For nice noise distributions pα = Unif(Bℓ2(0, α)) and pβ = Unif(Bℓ2(0, β)), there exists H1 with interference distance ζH1 > max(α, β), such that for all h H1 we have 0,β(h) < α,β(h) for all 0 < α < ζH1 and all 0 < β < ζH1.

Proof. Recall that h(x) = p(Y = 1|X = x). Let I1, I2, . . . , Ik be spheres such that Ij = Bℓ2(cj, rj). Deﬁne h(x) = 0.5 + τ whenever x I1 I2 . . . Ik, and 0 otherwise.

Large Interference Distance Recall that ζh = mini =j dist(Ii, Ij), and assume that ζh > max(α, β).

Recall the CDF Φα(x, r) = R

t 2 r pα(x t)dt, which is a function decreasing monotonically in x 2. Also recall the function Aα,r with the property Aα,r(Φα(x, r)) = x 2 for all x. For any x Ij, we can obtain h pα as

h pα(x) = Z

Ij h(t)pα(x t)dt + Z

X\Ij h(t)pα(x t)dt

Ij (0.5 + τ)pα(x t)dt + Z

X\Ij h(t) 0 dt

= (0.5 + τ)Φα(x cj, rj).

α Shrinkage The regions where the classiﬁer ψ(h pα) predicts 1 are given by Ij,α = {x Ij : h pα 0.5}. This region Ij,α is an α-shrinkage of Ij and can be obtained as

(0.5 + τ)Φα(x cj, rj) 0.5 = x cj 2 Aα,rj

0.5 0.5 + τ

def = rj,α, (17)

where we have used the deﬁnition of the function Aα,r.

Now, the regions where the classiﬁer Smoothβ(h pα) predicts 1 are similarly given by Ij,α,β = {x Ij,α : ψ(h pα) pβ 0.5}. We can follow the same steps as earlier on the α-shrunk balls Ij,α = Bℓ2(cj, rj,α). This time, for x Ij,α we have

Smoothβ(ψ(h pα))(x) = Z

Ij,α ψ(h pα)(t)pβ(x t)dt + Z

X\Ij,α ψ(h pα)(t)pβ(x t)dt

Ij,α 1 pβ(x t)dt = Φβ(x cj, rj,α).

β Shrinkage Again, we can ﬁnd the β-shrinkage in Ij,α as

Φβ(x cj, rj,α) 0.5 = x cj 2 Aα,rj,α(0.5) def = rj,α,β.

Published in Transactions on Machine Learning Research (04/2023)

We have thus obtained Ij,α,β as the α, β-shrunk balls, i.e., Ij,α,β = Bℓ2(cj, rj,α,β). We can now compute the risk of the smoothed classiﬁer as

R(Smoothβ(ψ(h pα)))

X |Smoothβ(ψ(h pα))(x) Y |p(x, y)dxdy

X |Smoothβ(ψ(h pα))(x) 1|p(1|x)p X(x)dx + Z

X |Smoothβ(ψ(h pα)(x)|p(0|x)p X(x)dx.

Let I = I1 I2 . . . Ik be the positive regions, and Iα,β = I1,α,β I2,α,β . . . Ik,α,β be the shrunk positive regions. For the ﬁrst term above, we note that

|Smoothβ(ψ(h pα))(x) 1|p(1|x) =

0, x Iα,β 0.5 + τ, x I \ Iα,β 0, x X \ I .

Similarly, for the second term, we see that

Smoothβ(ψ(h pα))(x)p(0|x) =

0.5 τ, x Iα,β 0, x I \ Iα,β 0, x X \ I .

Substituting into the integrals, we obtain the risk of the smoothed classiﬁer

R(Smoothβ(ψ(h pα))) = (0.5 + τ)p X(I \ Iα,β) + (0.5 τ)p X(Iα,β)

= (0.5 + τ)p X(I) 2τp X(Iα,β).

Recalling that p X( ) is positive everywhere in the domain, and using I0,β Iα,β for 0 < α, β < ζh in the above, we obtain p X(I0,β) > p X(Iα,β) = 2τp X(Iα,β) > 2τp X(I0,β).

In other words, we have

R(Smoothβ(ψ(h pα))) > R(Smoothβ(ψ(h))) = α,β > 0,β.

E Proof of Theorem 4.2

Theorem 4.2. For nice noise distributions pα, pβ, for any h supported on a bounded domain X,

k p X(Bℓ2(ˆxk, (ωk h,τ rk α,β)+)), (3)

where rk α = r

Ψ 1 α 0.5 0.5+τ if Ik is a positive partition, i.e., Ik I+ and rk α = r

Ψ 1 α 0.5 0.5 τ otherwise,

i.e., Ik I . Further, rk α,β = rk α + q

Ψ 1 β (0.5), and Ψα, Ψβ are the CDFs of the distribution of z 2 2 when z pα and z pβ, respectively. p X(S) denotes the measure of the set S under p X. The upper bound (3) holds for any choice of {(ωk h,τ, ˆxk)} such that ˆxk is the center of a ball of radius ωk h,τ completely contained in Ik,τ deﬁned as Ik,τ = {x Ik : |h(x) 0.5| τ}. We can choose the sequence {(ωk h,τ, ˆxk)} such that the upper bound (3) is minimized.

Deﬁnition E.1 (Inradius). The inradius of a set S is deﬁned as the largest radius r such that an ℓ2 ball of radius r is completely contained in S. In other words, there exists x S such that Bℓ2(x, r ) S.

Published in Transactions on Machine Learning Research (04/2023)

In order to bound α,β(h), we will follow a strategy which generalizes the arguments in Theorem 4.1 to any arbitrary conditional distribution h(x) = p(Y = 1|X = x). Our ﬁrst step would be to upper-bound the positive partitions of ψ(h pα), i.e., the regions where the classiﬁer trained with noise-augmentation predicts 1. This will be then be used along with properties of randomized smoothing to upper-bound the positive partitions of Smoothβ(ψ(h pα)), i.e., the regions where the ﬁnal smoothed classiﬁer predicts 1. These regions would ﬁnally be used to upper bound the risk of the smoothed classiﬁer as a function of α and β.

Proof. Let I+ denote the set of maximal, non-empty simply connected subsets I X of the input space such that h(x) 0.5 for each x I. Pick any I I+ and let Iτ I be such that h(x) 0.5 + τ for each x Iτ and 0 τ < 0.5.

Bounding region where ψ(h pα)(x) = 1. Let J1 Iτ be the maximal subset of Iτ such that the noise-trained classiﬁer ψ(h pα) predicts 1 in J1, i.e. J1 = {x Iτ : C1(x) is satisﬁed }, where C1(x) is the condition (h pα)(x) 0.5. We will ﬁrstly lower bound the area in J1 by using a stronger condition C2(x) R

Iτ x(τ + 0.5)pα(δ)dδ 0.5. To show that C2 is a stronger condition than C1, we will show that C2(x) = 1 implies C1(x) = 1, as follows:

(h pα)(x) = Z

X h(x δ)pα(δ)dδ

Iτ h(δ)pα(x δ)dδ

Iτ (τ + 0.5)pα(x δ)dδ = Z

Iτ x (τ + 0.5)pα(δ)dδ

Deﬁning J2 = {x Iτ : C2(x) = 1}, we see that J2 J1. We will further lower bound J2 by considering the largest d-dimensional ball centered at 0 that can be inscribed in Iτ x for each x. Deﬁne the condition C3(x, r) as the following:

C3(x, r) (C2(x) = 1 and B(x, r) Iτ)

Iτ x (τ + 0.5)pα(δ)dδ 0.5 | {z } I

and B(x, r) Iτ | {z } II

The set J3(r) = {x Iτ : C3(x, r) = 1} is a subset of J2 for all r > 0. In words, J3(r) denotes the set of points in J2 which are at least r distance away from the boundary of Iτ. Finally, we combine the two conditions I and II in C3(x, r) to obtain our ﬁnal condition C4(x, r).

B(0,r) (τ + 0.5)pα(δ)dδ 0.5 and B(x, r) Iτ (19)

Note that B(x, r) Iτ is the same as B(0, r) Iτ x. Hence, the integral in (19) is over a smaller set than in (18), from where it follows that C4 is a stronger condition than C3 and the set J4(r) = {x Iτ : C4(x, r) = 1} is a subset of J3(r) for all r > 0.

We now simplify C4(r) by using the distribution of z 2 2 where z pα, as follows.

B(0,r) pα(δ)dδ = Pr z pα[ z 2 2 r2]

Published in Transactions on Machine Learning Research (04/2023)

In the above, Ψα is the CDF of the distribution of z 2 2 when z pα. Setting (τ + 0.5)Ψα(r2) 0.5 gives us the following condition

B(x, r) Iτ such that r

0.5 0.5 + τ

0.5 0.5 + τ

Our ﬁnal subset J5(α) = {x Iτ : C5(x) = 1} denotes the set of points in I which are at least rα = r

Ψ 1 α 0.5 0.5+τ away from the boundary of Iτ. In other words, J5(α) Iτ is a rα-contraction of Iτ.

Bound region where Smoothβ(ψ(h pα)) = 1. We will now follow the same general strategy as in the previous part of the proof. Let K1 J5(α) be the maximal subset of J5(α) such that the smoothed classiﬁer Smoothβ(ψ(h pα) predicts 1 in K1, i.e., K1 = {x J5(α): D1(x) is satisﬁed }, where D1(x) is the condition (Smoothβ(ψ(h pα))(x) 0.5. We will ﬁrstly lower bound the area in K1 by using a stronger condition D2(x) R

J5(α) x pβ(δ)dδ 0.5. To show that D2 is a stronger condition than D1, we will show that D2(x) = 1 implies D1(x) = 1, as follows:

(Smoothβ(ψ(h pα))(x) = Z

X (ψ(h pα))(x δ)pβ(δ)dδ

J5(α) (ψ(h pα))(δ)pβ(x δ)dδ

J5(α) pβ(x δ)dδ = Z

J5(α) x pβ(δ)dδ

Deﬁning K2 = {x J5(α): D2(x) = 1}, we see that K2 K1. We will further lower bound K2 by considering the largest d-dimensional ball centered at 0 that can be inscribed in J5(α) x for each x. Deﬁne the condition D3(x, r) as the following:

D3(x, r) (D2(x) = 1 and B(x, r) J5(α))

J5(α) x pβ(δ)dδ 0.5 and B(x, r) J5(α)

The set K3(r) = {x J5(α) : D3(x, r) = 1} is a subset of K2 for all r > 0. In words, K3(r) denotes the set of points in K2 which are atleast r distance away from the boundary of J5(α). Finally, we combine the two conditions in D3(x, r) to obtain our ﬁnal condition D4(x, r).

B(0,r) pβ(δ)dδ 0.5 and B(x, r) J5(α) (21)

Note that B(x, r) J5(α) is the same as B(0, r) J5(α) x. Hence, the integral in (21) is over a smaller set than in (20), from where it follows that D4 is a stronger condition than D3 and the set K4(r) = {x J5(α): D4(x, r) = 1} is a subset of K3(r) for all r > 0.

We now simplify K4(r) by using the distribution of z 2 2 where z pβ, as follows.

B(0,r) pβ(δ)dδ = Pr[ z 2 2 r2] where z pβ

Published in Transactions on Machine Learning Research (04/2023)

In the above, Ψβ is the CDF of the distribution of z 2 2 when z pβ. Setting Ψβ(r2) 0.5 gives us the following condition

D5(x) B(x, r) J5(α) such that r q

Ψ 1 β (0.5)

Ψ 1 β (0.5) J5(α)

Our ﬁnal subset K5(α, β) = {x Iτ : D5(x) = 1} denotes the set of points in I which are at least

rα,β = rα + q

Ψ 1 β (0.5) away from the boundary of Iτ. In other words, K5(α, β) Iτ is a rα,β-contraction of Iτ.

We recall that h is such that the subset Iτ has inradius ωh,τ. By the deﬁnition of inradius, this means that there exists a ball B(ˆx, ωh,τ) for some ˆx Iτ such that B(ˆx, ωh,τ) Iτ. Hence, the contraction of Iτ by rα,β, i.e., K5(α, β), is a superset of the contraction of B(ˆx, ωh,τ) by rα,β,

B(ˆx, ωh,τ rα,β) K5(α, β). ( )

Risk Upper Bound Recall that X p X, Y p Y are the data and the response variables respectively. Let S be the random variable obtained from the smoothed, noise-augmented classiﬁer as S = Smoothβ(ψ(h pα))(X). Let T be the random variable obtained from the original classiﬁer T = ψ(h)(X). We know that the risk of the original classiﬁer is given by

R(ψ(h)) = Pr[ψ(h)(X) = Y ] = p(T = 1, Y = 0) + p(T = 0, Y = 1).

Deﬁne I+ = {x: ψ(h)(x) = 1} and I = {x: ψ(h)(x) = 0} as the regions of the input space classiﬁed as 1 and 0 respectively by the base classiﬁer. With this deﬁnition, we can rewrite the base risk as

R(ψ(h)) = p(X I+, Y = 0) + p(X I , Y = 1).

Let K+ I+ be any subset classiﬁed as 1 by the smoothed classiﬁer, i.e.,

K+ {x: Smoothβ(ψ(h pα))(x) = 1} I+.

Similarly, K {x: Smoothβ(ψ(h pα))(x) = 0} I .

We will think of the sets K as shrunk versions of I, and where the smoothed classiﬁer predicts the correct label. Speciﬁcally, we will let K+ be the union of all the contractions K5(α, β) of the positive partitions of ψ(h), where K5(α, β) was obtained earlier. Similarly, we will let K be the union of the contractions of all the negative partitions of ψ(h). This ensures that

K def = X \ (K+ K ) = X \ [

k K5,k(α, β),

where we deﬁne K as the rest of the space.

The risk of the smoothed classiﬁer is

R(Smoothβ(ψ(h pα))) = p(S = 1, Y = 0) + p(S = 0, Y = 1),

where the ﬁrst term decomposes into

p(S = 1, Y = 0) = p(S = 1, Y = 0, X K+) + p(S = 1, Y = 0, X K)

= p(Y = 0, X K+) + p(S = 1, Y = 0, X K),

Published in Transactions on Machine Learning Research (04/2023)

noting that p(S = 1, Y = 0, X K ) = 0, as X K implies S = 0. Similarly, the second term decomposes as

p(S = 0, Y = 1) = p(Y = 1, X K ) + p(S = 0, Y = 1, X K).

Combining, we have

R(Smoothβ(ψ(h pα))) = p(Y = 0, X K+) + p(Y = 1, X K )+

p(S = 1, Y = 0, X K) + p(S = 0, Y = 1, X K)

= p(Y = 0, X K+) + p(Y = 1, X K ) + p(S = Y, X K).

The above expression says that the errors made by the smoothed classiﬁer can be decomposed into the errors made in the regions K, and the error everywhere else K. Now, since K+ I+, K I , we have

R(Smoothβ(ψ(h pα))) p(Y = 0, X I+) + p(Y = 1, X I ) + p(S = Y, X K)

R(ψ(h)) + p(S = Y, X K)

= R(Smoothβ(ψ(h pα))) R(ψ(h)) + p(X K) = α,β(h) p(X K) (*)

As K5(α, β) K5(0, 0) for any α, β 0, we have p(X Kα,β) p(X K0,0), showing that the upper-bound (*) decreases as α increases at any ﬁxed β.

Finally, notice that the upper-bound (*) becomes tight at α = β = 0, K+ = I+, K = I , K = {}, as then R(Smooth0(ψ(h p0)) = R(ψ(h)) and p(X K) = 0.

Finally, we can use ( ) in (*) to obtain

k p X(Bℓ2(ˆxk, ωk h,τ rk α,β)),

for any choice of the points {ˆxk} such that ˆxk is such that Bℓ2(ˆxk, ωk h,τ) Ik,τ.

Finally,the quantities Ψ 1 α 0.5 0.5 τ and Ψ 1 β (0.5) must be non-negative for the above balls to be well deﬁned.

This in turn implies that ωk h,τ rk α,β 0 for all k, i.e., the inradius of each partition is at least as large as the amount of shrinkage that partition goes through. These requirements are not additional constraints on h, but rather a property of the proof technique: whenever the inradius is too small, the partition vanishes from the upper bound. To capture this subtlety, we can write the upper bound as

k p X(Bℓ2(ˆxk, (ωk h,τ rk α,β)+)),

where (c)+ denotes the positive part of c.

F Proof of Theorem 4.3

Theorem 4.3. There exist nice distributions pα, pβ, a family H2 with low interference distance ζH2, and data-distributions p X, such that for all h H2 we have 0,β0(h) > α,β0(h) for some α > 0, β0 > 0.

Proof. We will ﬁrstly derive general conditions on h H2 and α0, β0 > 0 such that α0,β0(h) < 0,β0(h). We will then show that the H2 as restricted by these conditions is not empty, by providing a particular h that satisﬁes these conditions. We will assume that the domain is bounded.

Lower-Bound on β. We ﬁrst of all require β to be large enough such that ψ(ψ(h) pβ0)(x) = 0 for all x X. This is to ensure that without noise-augmentation, the smoothed classiﬁer predicts 0 everywhere. In other words, x X we require Z

t X ψ(h)(x t)pβ0(t)dt 0.5. (22)

Condition (22) implies β0 β for some β, as will become clear later in (26).

Published in Transactions on Machine Learning Research (04/2023)

Acceptable Range of α. Secondly, we require ψ(h pα0)(x) = 1 over a set X, where we think of the size of X as being close to the size of X in the sense p X(x X \ X) ϵ for a small ϵ. This subset X denotes the part of the input where the noise-trained classiﬁer predicts 1. In other words, for all x X, we require Z

t X h(x t)pα0(t)dt 0.5. (23)

Condition (23) implies lower and upper bounds α α0 α, as will become clear later in (27), (28).

Upper-Bound on β. Thirdly, we require ψ(ψ(h pα0) pβ0)(x) = 1 for all x ˆ X, where we again think of the size of ˆ X as being close to the size of X, i.e., p X( X \ ˆ X) ϵ. This ensures that the smoothed classiﬁer predicts 1 in almost all of the region where the noise-trained classiﬁer predicts 1. In other words, for all x ˆ X, we require Z

t X ψ(h pα0)(x t)pβ0(t)dt 0.5. (24)

Condition (24) implies an upper bound β β0, as will become clear in (29).

Small Interference-Distance. Finally, we require that the data-distribution places more mass on the label 1, i.e.

p(Y = 1) > 1

2 + ϵ. (25)

Condition (25) ensures that h has a small upper-interference distance. Informally, this is expected: (25) says that the positive partitions of ψ(h) need to have a certain mass under p, and the domain is bounded, so the partitions cannot be far apart. Formally, this will become clear in (30) when we demonstrate a particular example h where the above constraints are satisﬁed. Before that, we will prove that the above conditions are suﬃcient to guarantee α0,β0(h) < 0,β0(h).

Let Sα,β be the random variable Smoothβ(ψ(h pα))(X), X p X, and simplify 0,β0 as

0,β0(h) = p(Y = S0,β0) Using (22)

> p(Y = 0) + 2ϵ Using (25)

p(Y = 0, X ˆ X) + ϵ + ϵ

p(Y = 1, X ˆ X) + p X( X \ ˆ X) + p X(X \ X)

= p(Y = Sα0,β0, X ˆ X) + p(X X \ ˆ X) Using (24)

p(Y = Sα0,β0, X ˆ X) + p(Y = Sα0,β0, X X \ ˆ X)

= α0,β0(h).

We will consider the one-dimensional example h described in Section 4.2, and show that it satisﬁes the above constraints. The domain can be taken to be X = [ 0.25, 0.25]. h is deﬁned by specifying c1, c2, c3, c4. We can take c1 = 0.25, c2 = 0.25 + ω, c4 = 0.25, c3 = 0.25 ω, for some ω 0.25. Here ω is the inradius of the positive partitions of h, similar to what we saw in Theorem 4.2. Additionally, the upper interference distance ζ is the distance between the positive partitions, i.e., ζ = (0.25 ω) ( 0.25 + ω) = 0.5 2ω. The smoothing distribution p is uniform in a ball, as pθ = Unif([ θ/2, θ/2]), and the marginal distribution p X is uniform over [ 0.25, 0.25].

The lower bound on β (22) then requires for all x X, we have R β/2 β/2 ψ(h)(x t) (1/β)dt 0.5. A stronger

condition is (1/β) R 0.25 0.25 ψ(h)(t)dt 0.5 which implies a lower bound on β as

(1/β)(2ω) 0.5 = β 4ω. (26)

Published in Transactions on Machine Learning Research (04/2023)

The acceptable range of α is deﬁned by (23). We take X = X (hence ϵ = 0). For x [ 0.25 + ω, 0.25 ω], (23) gives 2 (0.25 ω)(1/α) 0.5 = α 4(0.25 ω). (27)

For x [ 0.25, 0.25 + ω] [0.25 ω, 0.25], a stronger condition to (23) gives

ω(1/α) 0.5 = α 2ω. (28)

The upper bound on β is given by (24). As ϵ = 0, we have ˆ X = X, which gives

(0.5) (1/β) 0.5 = β 1.. (29)

Finally, Eq. (25) gives a condition on ω:

2 = ω > 0.125. (30)

The proof is now complete with the observation that ω = 0.23, α = 0.1, β = 0.93 satisﬁes these constraints, showing that h H2 as 0,0.93( h) > 0.1,0.93( h).

Note that the condition (30) on ω implied an upper-bound on the interference distance ζ, as ζ = 0.5 2ω. This shows that when h satisﬁes the condition (25), h has a low interference distance.

G Accommodating Errors in Learning Bayes Classiﬁers.

While obtaining a randomized-smoothed classiﬁer in a real learning scenario, we might deviate from the Bayes classiﬁer in any stage of the process. Speciﬁcally, we have 2 stages: (1) Learn a classiﬁer on the noisy data (Xs, Y ) and (2) Smooth the resultant classiﬁer using randomized smoothing.

In Theorem 4.1, we assumed that we are able to exactly learn the Bayes Classiﬁer for (Xs, Y ) in stage (1), i.e. ψ(h pα). This assumption can be relaxed while maintaining the same general proof technique, albeit by paying with further slack in the obtained bounds due to the inexactness of the learned classiﬁer. We now sketch a modiﬁcation of the argument in Appendix D that allows a bounded deviation from the Bayes classiﬁer.

We start with the ﬁrst column of Figure 1, i.e. ψ(h), and train the classiﬁer on data perturbed by noise pα. In the ideal case sketched above, this leads precisely to the second column, i.e., ψ(h pα). In the non-ideal scenario, we assume that we obtain a classiﬁer g in stage (1) after training on (Xs, Y ), such that for all x, we have |g(x) h pα(x)| η. In other words, the maximum error in the trained classiﬁer compared to the actual conditional is no larger than η. For clarity, we still maintain that stage (2) is done exactly, i.e. that given ψ(g) we are able to obtain Smoothβ(ψ(g)) exactly as our ﬁnal randomized-smoothed classiﬁer.

As a result, we are interested in analysing the modiﬁed excess risk

α,β(g, h) = R(Smoothβ(ψ(g))) R(ψ(h)), (31)

for all (g, h) such that g h pα η. Eq. (31) captures the increase in the benign risk due to randomized smoothing an imperfectly learned classiﬁer on the noise-augmented data. Additionally, we deﬁne

η α,β(h) = max {g : g h pα η} α,β(g, h), (32)

to be the maximum excess risk when using an imperfectly learned g. Similarly, we deﬁne

η α,β(h) = min {g : g h pα η} α,β(g, h), (33)

to be the minimum excess risk when using an imperfectly learned g. We now have the following modiﬁed version of Theorem 4.1.

Published in Transactions on Machine Learning Research (04/2023)

Theorem G.1. For nice noise distributions pα = Unif(Bℓ2(0, α)) and pβ = Unif(Bℓ2(0, β)), there exists H1 with interference distance ζH1, such that for all h H1 we have η 0,β(h) < η α,β(h), and η 0,β(h) < η α,β(h) for all smoothing parameters α, β such that ζH1 > 2 max(α, β), and η < 0.5.

Proof. Similar to Theorem 4.1, let I1, I2, . . . , Ik be spheres such that Ij = Bℓ2(cj, r) for some positive radius r > 0. Deﬁne h(x) = 0.5 + τ whenever x I1 I2 . . . Ik, and 0 otherwise, for some 0 τ < 0.5.

Large Interference Distance We assume that ζh > 2 max(α, β), where note the additional factor 2 in the required lower bound as compared to Theorem 4.1. This stricter condition will become useful when we deal with errors in learning the bayes classiﬁer.

α Shrinkage Recall that we earlier analyzed the region Ij,α = {x Ij : h pα(x) 0.5}. In this new non-ideal setting, it is now useful to consider the regions I+η j,α deﬁned as

I+η j,α = {x Ij,α : h pα(x) 0.5 + η}. (34)

The inaccurate classiﬁer g would have ψ(g)(x) = 1 for all x i I+η i,α, as g(x) h pα(x) η 0.5.

Similarly, the sets {I η j,α} can be deﬁned as the set of maximal simply connected regions forming a partition of {x: h pα(x) 0.5 η}. There is a natural correspondence between I η j,α, I+η j,α and Ij,α, which will become clear once we obtain the explicit forms of these sets. Note that ψ(g)(x) = 0 for all x i I η i,α, as g(x) h pα(x) + η 0.5.

We can characterize I+η j,α explicitly as the set of all x satisfying

(0.5 + τ)Φα(x cj, r) 0.5 + η = x cj 2 Aα,r

def = r+η α , (35)

obtaining the shrunk ball I+η j,α = B(cj, r+η α ). Similarly, we see that I η j,α is the ball I η j,α = B(cj, r η α ) with radius

r η α def = Aα,r

Handling Inaccuracy in g We now perform randomized smoothing on this classiﬁer g, and obtain ψ(ψ(g) pβ).

For x Ij we have that

Smoothβ(ψ(g))(x) = Z

I+η j,α ψ(g)(t)pβ(x t)dt + Z

I η j,α\I+η j,α ψ(g)(t)pβ(x t)dt + Z

X\I η j,α ψ(g)(t)pβ(x t)dt.

We will now split the domain of the last integral in (36) as

X \ I η j,α = i =j I η j,α X \ i I η j,α ,

where we note that the integral over the second set in the union, i.e., R

X\ i I η i,α ψ(g)(t)pβ(x t)dt is zero, by

the observation above that ψ(g)(t) = 0 for all t j I η j,α. This leaves us with the integral over the ﬁrst set, which we will now show to be 0 as well.

For the ﬁrst set, we want to see whether there is any point t I η i,α, i = j such that ψ(g)(t) = 1 and pβ(x t) > 0. This can happen only when g(t) 0.5 and x t 2 β. Now consider the triangle formed by the points {x, t, ci}, and apply the reverse triangle inequality to get

t ci 2 ci x 2 x t 2 2 max(α, β) + r β α + r,

Published in Transactions on Machine Learning Research (04/2023)

where the interference distance condition was used for ci x > 2 max(α, β). But now we know that g(t) h pα(t) η < 0.5, and that h pα(t) = 0 for any t at a distance more than α + r distance away from all the centers ci. This implies that g(t) < 0.5, which further implies that ψ(g)(t) = 0. Hence, the integral over the ﬁrst set i =j I η j,α is 0.

The ﬁrst term in (36) is simply

I+η j,α ψ(g)(t)pβ(x t)dt = Z

I+η j,α 1 pβ(x t)dt = Φβ(x cj, r+η j,α).

For the second term in (36), we can only produce upper and lower bounds, as

I η j,α\I+η j,α ψ(g)(t)pβ(x t)dt Z

I η j,α\I+η j,α 1 pβ(x t)dt

Combining, we have

Φβ(x cj, r+η α ) Smoothβ(ψ(g))(x) Φβ(x cj, r η α ) (37)

where the lower-bound occurs when g(x) takes the lowest possible value for all x I+η j,α \ I η j,α. Similarly, the upper-bound occurs when g(x) takes the highest possible value in the same region.

β Shrinkage Compared Theorem 4.1, we now only have an inequality (37) for the smoothed classiﬁer at any point x. As a result, we cannot determine the positive regions Ij,α,β = {x Ij : Smoothβ(ψ(g))(x) 0.5} exactly. Instead, the following set inclusions follow from (37):

Bℓ2(cj, Aα,r+η α (0.5)) Ij,α,β Bℓ2(cj, Aα,r η α (0.5)).

Bounding Risk Similar to Theorem 4.1, we can now compute the risk of the smoothed classiﬁer as

R(Smoothβ(ψ(h pα))) (38)

X |Smoothβ(ψ(h pα))(x) 1|p(1|x)p X(x)dx + Z

X Smoothβ(ψ(h pα)(x)p(0|x)p X(x)dx. (39)

Let I = I1 I2 . . . Ik are the positive regions for the base classiﬁer ψ(h), and let I+η α,β = j Bℓ2(cj, Aα,r+η α ) and I η α,β = j Bℓ2(cj, Aα,r η α (0.5)) be the upper and lower bounds to the positive regions of the smoothed classiﬁer Smoothβ(ψ(g)), i.e., Iα,β = j Ij,α,β.

Let rα,β,η = Aα,r η α (0.5) denote the largest possible radius of the positive partitions after smoothing. We now have two cases, rα,β,η r (Case A), or r < rα,β,η (Case B). When η = 0, the positive partitions always shrink (as we saw in Theorem 3.1), and we are in Case A. As we increase η beyond a certain η η0, we go into Case B, where the positive partitions have potentially dilated after smoothing. We handle both cases separately.

Case A Deﬁne S I η α,β \ I+η α,β to be the subset of I η α,β \ I+η α,β where Smoothβ(ψ(g))(x) = 1. For the ﬁrst integral in (39), we obtain

|Smoothβ(ψ(g))(x) 1|p(1|x) =

0, x I+η α,β 0, x S 0.5 + τ, x (I η α,β \ I+η α,β) \ S 0.5 + τ, x I \ I η α,β 0, x X \ I

Published in Transactions on Machine Learning Research (04/2023)

For the second integral in (39), we obtain

Smoothβ(ψ(g))(x)p(0|x) =

0.5 τ, x I+η α,β 0.5 τ, x S 0, x (I η α,β \ I+η α,β) \ S 0, x I \ I η α,β 0, x X \ I

Substituting into the integrals, we obtain

R(Smoothβ(ψ(g))) = (0.5 τ)p X(I+η α,β S) + (0.5 + τ)p X ((I η α,β \ I+η α,β) \ S) (I \ I η α,β) ,

which is minimized when S is as large as possible, and vice-versa. This gives us the risk bounds

(0.5 τ)p X(I η α,β) + (0.5 + τ)p X(I \ I η α,β) R (0.5 τ)p X(I+η α,β) + (0.5 + τ)p X(I \ I+η α,β)

(0.5 + τ)p X(I) 2τp X(I η α,β) R(ψ(h)) | {z } η α,β(h)

α,β(g, h) (0.5 + τ)p X(I) 2τp X(I+η α,β) R(ψ(h)) | {z }

As α increases for a ﬁxed β, η, τ, both the upper and lower bounds in (40) increase, following the same argument as Theorem 4.1.

Case B Deﬁne T1 I η α,β \I to be the subset of I η α,β \I where Smoothβ(ψ(g))(x) = 1. Deﬁne T2 I \I+η α,β to be the subset of I \ I+η α,β where Smoothβ(ψ(g))(x) = 1. For the ﬁrst integral in (39), we obtain

|Smoothβ(ψ(g))(x) 1|p(1|x) =

0, x I+η α,β 0, x T2 0.5 + τ, x (I \ I+η α,β) \ T2 0, x X \ I

For the second integral in (39), we obtain

Smoothβ(ψ(g))(x)p(0|x) =

0.5 τ, x I+η α,β 0.5 τ, x T2 0, x (I \ I+η α,β) \ T2 1, x T1 0, x (I η α,β \ I) \ T1 0, x X \ I η α,β

Again substituting into the integrals, we obtain

R(Smoothβ(ψ(g))) = (0.5 + τ)p X((I \ I+η α,β) \ T2) + (0.5 τ)p X(T2) + p X(T1) + (0.5 τ)p X(I+η α,β),

which is minimized when T1 is empty and T2 is as large as possible, and maximized when T1 is as large as possible and T2 is empty. This gives the risk bounds

(0.5 τ)p X(I) R(Smoothβ(ψ(g))) (0.5 + τ)p X(I) 2τp X(I+η α,β) + p X(I η α,β \ I)

(0.5 τ)p X(I) R(ψ(h)) | {z } η α,β(h)

α,β(g, h) (0.5 + τ)p X(I) 2τp X(I+η α,β) + p X(I η α,β \ I) R(ψ(h)) | {z }

Note that the lower bound for the risk in (41) is equal to the Bayes Error - this is expected as in the best scenario, the classiﬁer after smoothing can be idential to the original classiﬁer in Case B. For the upper bound, we observe that as we increase α at a ﬁxed β, η, τ, the set I+η α,β shrinks, which causes the upper bound to increase.

We have hence shown that in both Case A and Case B, the risk of the smoothed classiﬁer increases as α is increased from 0, modulo approximations due to errors in learning the bayes classiﬁer.

Published in Transactions on Machine Learning Research (04/2023)

Upper Bound for General g We show here that via a simple application of Theorem 4.2, we can upper bound α,β(g, h) for general g, h:

α,β(g, h) α,β(h) + p X ((h pα)(X) = g(X)) (42)

To show (42), we let Sg denote the random variable Smoothβ(ψ(g))(X), and Sh denote the random variable Smoothβ(ψ(h) pα)(X). We use the following simple sequence of upper-bounds:

α,β(g, h) = R(Smoothβ(ψ(g))) R(ψ(h))

= p(Sg = Y ) R(ψ(h))

= p(Sg = Y, Sh = Sg) R(ψ(h)) + p(Sg = Y, Sh = Sg)

p(Sh = Y ) R(ψ(h)) + p(Sh = Sg)

= α,β(h) + p(Sh = Sg)

α,β(h) + p((h pα)(X) = g(X))