# understanding_noiseaugmented_training_for_randomized_smoothing__14a8bfc9.pdf Published in Transactions on Machine Learning Research (04/2023) Understanding Noise-Augmented Training for Randomized Smoothing Ambar Pal ambar@jhu.edu Department of Computer Science & Mathematical Institute of Data Science Johns Hopkins University Baltimore, MD 21218, USA Jeremias Sulam jsulam1@jhu.edu Department of Biomedical Engineering & Mathematical Institute of Data Science Johns Hopkins University Baltimore, MD 21218, USA Reviewed on Open Review: https: // openreview. net/ forum? id= fvyh6m DWFr Randomized smoothing is a technique for providing provable robustness guarantees against adversarial attacks while making minimal assumptions about a classifier. This method relies on taking a majority vote of any base classifier over multiple noise-perturbed inputs to obtain a smoothed classifier, and it remains the tool of choice to certify deep and complex neural network models. Nonetheless, non-trivial performance of such smoothed classifier crucially depends on the base model being trained on noise-augmented data, i.e., on a smoothed input distribution. While widely adopted in practice, it is still unclear how this noisy training of the base classifier precisely affects the risk of the robust smoothed classifier, leading to heuristics and tricks that are poorly understood. In this work we analyze these trade-offs theoretically in a binary classification setting, proving that these common observations are not universal. We show that, without making stronger distributional assumptions, no benefit can be expected from predictors trained with noise-augmentation, and we further characterize distributions where such benefit is obtained. Our analysis has direct implications to the practical deployment of randomized smoothing, and we illustrate some of these via experiments on CIFAR-10 and MNIST, as well as on synthetic datasets. 1 Introduction Machine learning classifiers are known to be vulnerable to adversarial attacks, wherein a small, humanimperceptible additive perturbation to the input is able to produce a change in the predicted class (Szegedy et al., 2013). Because of clear security implications (Kurakin et al., 2016), this phenomenon has sparked increasing amounts of work dedicated to devising defense strategies (Metzen et al., 2017; Gu and Rigazio, 2014; Madry et al., 2017) and correspondingly more sophisticated attacks (Carlini and Wagner, 2017; Athalye et al., 2018; Tramer et al., 2020), with each group trying to triumph over the other in an arms-race of sorts. As a result, an increasing number of works have begun providing theoretical guarantees and understanding of adversarial attacks and defenses, studying learning theoretic questions (Shafahi et al., 2018; Cullina et al., 2018; Bubeck et al., 2018; Tsipras et al., 2018), understanding the sample complexity of robust learning (Schmidt et al., 2018; Yin et al., 2018; Tu et al., 2019; Awasthi et al., 2019), characterizing the optimality of attacks and defenses (Pal and Vidal, 2020), analyzing the implications of robust representations (Awasthi et al., 2020; Sulam et al., 2020; Muthukumar and Sulam, 2022), and more. Published in Transactions on Machine Learning Research (04/2023) One of the central objects of study in this setting are robustness certificates, which provide provable guarantees on the prediction of models as long as the input is not perturbed beyond a specific contamination level (Cohen et al., 2019; Raghunathan et al., 2018; Yang et al., 2020). In this vein, randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) employs a base classifier, typically vulnerable to adversarial attacks, and derives from it a smoothed version by taking a majority vote of the base classifier s outputs on several (stochastic) noise-perturbed versions of the input. While making minimal assumptions about the base model, the resulting smoothed predictor is provably robust to input perturbations of bounded ℓp norm (Yang et al., 2020). Because this is applicable to any given predictor, randomized smoothing has arguably become the most practical certified defense technique against adversarial examples for deep learning models, which are often too complex be analyzed and certify otherwise. Nonetheless, the accuracy of the resulting smoothed model differs from that of its base classifier. It has been empirically observed that smoothing a base classifier out of the box, i.e., without any modifications to the network weights or structure, does not provide good performance, for example in terms of certified accuracy (Cohen et al., 2019; Gao et al., 2019). Tricks like noise-augmented training for the base classifier need to be employed in order to obtain meaningful results and defense certificates. This phenomenon, while pervasive in practice, does not have a good theoretical understanding. There are several open questions surrounding the relationship between the base classifier and its smoothed version: Why is noise-augmented training useful when training the base classifier? How does the performance of the smoothed classifier depend on the distribution of the training noise? Is noise perturbed training always needed for all data-distributions, and if not, could we determine this before modifying the training procedure? In this paper, we take a step towards answering some of these questions for a general binary classification task in Rd. In summary, our contributions are as follows: 1. For a general bounded data-distribution, we derive an upper bound on the benign risk obtained from smoothing a classifier that has been trained with noise augmentation. Surprisingly, this bound suggests that noise-augmented training can in fact be harmful to the benign risk of the final smoothed classifier, which is contrary to the observed phenomenon in practical applications. 2. We then characterize a family of data distributions for which the above bound is tight, and for which training on noise-augmented data is only detrimental. These distributions are characterized by a notion of large interference distance, which we formalize. 3. We show that this notion of interference distance captures some of the trade-offs in randomized smoothing, and we prove that there exist a family of distributions where benefits can be obtained from noise-augmented training as observed in practice. 4. Our empirical experiments suggest, firstly, that real data distributions lie in the low-interference distance regime, and hence noise-augmented training is beneficial to randomized smoothing. Secondly, contrary to practice, our proofs suggest that the parameters of the noise-distribution for noiseaugmented training and that of randomized smoothing need not be the same, and that allowing for different values of these parameters lead to improved results. Our experiments on MNIST and CIFAR-10 confirm this theoretical intuition. The rest of the paper is organized as follows. In Section 2 we describe preliminaries and set up our notation and framework. In Section 3 we introduce our main results informally, obtaining conditions that characterize data distributions where noise-augmented training provides an advantage, and those where it does not. Section 4 presents our main results in greater technical detail. Lastly, we conduct our synthetic and real data experiments in Section 5, and conclude in Section 7 highlighting our answers to the questions posed above. 2 Preliminaries and Setup We consider a binary classification 1 task on data X X Rd with labels Y Y = {0, 1}. The random variable X follows a data distribution over the space X with PDF p X. In turn, the label or response variable 1We comment on extensions to multiple classes later in Section 4. Published in Transactions on Machine Learning Research (04/2023) Y follows a distribution p Y over its corresponding space. We assume that X is bounded (we take diameter 1 for simplicity). A supervised learning task is defined by the joint distribution p X,Y , and the goal is to obtain a predictor h: X [0, 1] so that h is a good approximation for the conditional expectation E[Y |X = x]. These distributions are unknown, however, so the learning problem consists of finding such a predictor from a set of samples, typically identically and independently distributed from p X,Y . In this work we will not dwell much on the learning problem, and instead we will assume we are given access to a function h: X [0, 1] which is a good approximation 2 for the conditional p Y |X. The base classifier is given by the composition of h with a discretizing decision mapping, ψ: [0, 1] Y. In our setting, ψ(z) = 1[z 0.5], where 1[ ] denotes the indicator function. Note that if h is the real conditional distribution of the label for a given input, this coincides3 with the Bayes classifier for this problem. Adversarial examples are small additive perturbations designed for an input sample such that the predicted label at the perturbed input is incorrect. Typically these perturbations are constrained to be in some ℓq norm ball, i.e., δ q ϵ, so as to be imperceptible to humans. It has been extensively shown that models that achieve excellent accuracy in normal settings can misclassify samples perturbed even with very small values of ϵ (Goodfellow et al., 2014). The goal of certified robustness methods is to guarantee that the output of a certain model at a given input4, ψ(h)(x), will not change when contaminated by perturbations in a ℓq-ball with radius of at most ϵ . That is, certifying that ψ(h)(x) = ψ(h)(x + δ) for every δ : δ q ϵ . Randomized smoothing (Cohen et al., 2019) achieves this by smoothing the classifier ψ(h) with an isotropic distribution. More precisely, the smoothed classifier Smooth(ψ(h)) is constructed from ψ(h) by Smoothp V (ψ(h))(x) def = arg max c Y P[ψ(h)(x + V ) = c], (1) where V p V . The choice of the distribution p V centrally depends on the norm constraint of the adversarial perturbation. Randomized smoothing was first introduced as a certification method against ℓ2-bounded perturbations, for which p V = N(0, β2Id). However, this has been extended to other norms in a series of recent works (Lecuyer et al., 2019; Li et al., 2019; Yang et al., 2020), yielding correspondingly different distributions p V . In particular, for the binary case with ℓ2-bounded perturbations, randomized smoothing measures the probability of the predicted class (say 1) after smoothing as s = P(h(x + V ) = 1). Then, it provides a certified radius ϵ = βΦ 1(s) that depends on this probability s (0.5, 1] as well as the variance of the smoothing distribution β2, where Φ 1 is the inverse of the standard Gaussian CDF. Before continuing, it will be useful for our discussions to employ the following equivalent form of randomized smoothing. Proposition 2.1. When p V is symmetric, and for the binary classification task defined above, the smoothed classifier is given by Smoothp V (ψ(h))(x) = ψ((ψ(h) p V ))(x), where denotes the convolution operation. Proof. Consider the random variable V p V and simplify Smoothp V (ψ(h))(x) as, Smoothp V (ψ(h))(x) = arg max c {0,1} P[ψ(h)(x + V ) = c] = ψ(P[ψ(h)(x + V ) = 1]) = ψ Z 1[ψ(h)(x + v) = 1]p V (v)dv = ψ Z ψ(h)(x + v)p V (v)dv = ψ(ψ(h) p V )(x). In this manuscript, we will work with single-parameter smoothing distributions pβ. Accordingly, we will simplify our notation to denote the smoothed classifier as Smoothβ(ψ(h))(x). We consider the setting where a certain minimum level of robustness is required at deployment, say ϵ . In other words, we want our certified radius to be at least ϵ . From the previous discussion, this implies that 2Our results are developed for h = E[Y |X], but we show that most of them can be extended to the approximate case in Appendix G. 3Up to the deterministic nature of ψ(z) whenever z = 0.5. 4For simplicity of notation in our analysis, we use ψ(h)(x) to denote the composition ψ h at x. Published in Transactions on Machine Learning Research (04/2023) βΦ 1(s) ϵ , implying that we need a minimum strength for randomized smoothing, say β def = (1/Φ 1(s))ϵ . Then, given that one will be required to employ Smoothβ (ψ(h))(x), how should h be obtained? To make the above question precise, we recall the definition of the risk of a classifier f under the datadistribution p X,Y as R(f) = P(f(X) = Y ). When f = Smoothβ (ψ(h)), this becomes R(Smoothβ (ψ(h))) = P(Smoothβ (ψ(h))(X) = Y ). As mentioned in the introduction, obtaining h via natural training typically leads to a certifiable predictor with higher error than that of the base classifier, i.e., R(Smoothβ (ψ(h))) > R(ψ(h)). This is expected, as the classifier ψ(h) was precisely trained to minimize the risk R(ψ(h)), whereas its smoothed counterpart Smoothβ(ψ(h)), was not. Such a phenomenon can also be regarded as an out-of-distribution (OOD) problem, albeit in a very specific setting where the distribution shift is produced by a certification method. Indeed, it has been widely demonstrated in practice that naïvely smoothing any base classifier leads to a significant degradation in benign accuracy, i.e., accuracy on non-adversarially-corrupted samples, and hence to a lower certified-accuracy. Besides the well established empirical evidence reported in the literature (Gao et al., 2020; Cohen et al., 2019), we will also showcase this phenomenon in our experiments. To alleviate the discrepancy, practitioners have resorted to retraining the base classifier ψ(h) with a noiseaugmented data distribution instead, denoted here by ps X. This is achieved by adding noise V p V to the data variable X p X to obtain the noise-augmented data Xs = X + V ps X. Following general intuition, in practice one employs the same smoothing distribution for noise-augmentation as that employed for the certification stage. If we let ps X be the PDF of Xs, it is well known that ps X is the convolution of p X and p V , i.e. ps X = p X p V . Throughout this work, we will make minimal assumptions about the underlying data distribution, the parameterization of the predictor, and the (finite) dimension of the data. However, we will assume that the hypothesis class is rich enough, and the learning algorithm good enough, such that the true conditional expectations are learnt successfully. As we show in Appendix B, this implies that the Bayes classifiers ψ(h) and ψ(h p V ) are learnt successfully under data distributions PX and P s X, respectively. Therefore, a classifier learnt on the noise-augmented data results in Smoothβ(ψ(h p V )), which will be the central object of our study. While the assumption of learning the Bayes classifiers exactly might seem stringent, much of our results can be adapted to relax this assumption while maintaining our general proof technique, assuming a controlled difference between the Bayes and the obtained classifier. In Section 4 and Appendix G, we discuss this further and extend our results to handle errors in learning. On the one hand, our analyses of the Bayes classifiers are informative because they reflect the best possible predictors that can be learned from data. On the other hand, and importantly, we will illustrate how these assumptions are reasonable and sufficient to depict what is observed in relevant scenarios. Indeed, our theoretical results and simulations on synthetic experiments will resemble the observations in natural image data, presented in Section 5. As earlier, we will employ noise-augmentation distributions parameterized by a single parameter. Accordingly, we will denote our noise-augmentation distribution as pα to highlight the only parameter, α. For instance, when augmenting data with a Gaussian we have that p V = pα = N(0, α2I); and in the case of a uniform distribution supported over a ℓ2 ball of radius α, one has p V = pα = Unif(B2(0, α)). In this way, the final smoothed classifier is given by Smoothβ(ψ(h pα)). A key goal of our work is to understand the interplay between these two parameters: β (controlling the certified radius) and α (controlling the noise-augmentation of the data distribution). 3 Main Results Our central aim in this section is to theoretically characterize the empirically observed difference between the benign risks of the original classifier ψ(h), and that of the smoothed classifier trained on noise-augmented data, Smoothβ(ψ(h pα)). In other words, we want to obtain an estimate of the excess risk α,β(h) = R(Smoothβ(ψ(h pα))) R(ψ(h)), (2) Published in Transactions on Machine Learning Research (04/2023) Small ζh Large ζh ψ(h) ψ(h pα) Smoothβ(ψ(h pα)) |ψ(h) Smoothβ(ψ(h pα))| Figure 1: Theorem 3.1 and Theorem 3.3 deal with the cases of large (top row) and small (bottom row) interference distance ζh, respectively, where ζh is the average of the pairwise distances between the orange regions, denoted by the black arrows. The text after Theorem 3.3 provides a detailed description of the different regions in the figure. where the risk measures the probability of error over p X,Y (see Section 1). We will present our result conceptually and slightly informally here, and we will provide a more detailed version of these later in Section 4. In our first result, we demonstrate that the behaviour observed in practice (namely, that performing noiseaugmented training is beneficial5) is not universal. We do this by showing that there exists a class of distributions H1 with a certain minimal interference distance property where training the classifier on a noise-augmented distribution before performing the smoothing for certification is in fact detrimental. The interference distance ζh for a classifier ψ(h) can be informally thought of as the average ℓ2 distance between any two disjoint input regions classified as 1 by ψ(h) (normalized by the size of the domain). Figure 1 provides a simple illustration, where ζh is the average length of all the black arrows. We defer the formal definition of this property to Section 4, where we will use tighter characterizations by using the minimum and maximum of such lengths. Theorem 3.1 (No need for noise-augmentation). There exists a class of distributions H1 with large interference distance, such that, for all h H1, training on a smoothed distribution has no benefit; that is 0,β(h) < α,β(h) for all α, β > 0. This result is of importance because it sheds light on the kind of behaviour that can be expected while making minimal assumptions of the (conditional) distribution h. Indeed, we now present a result that upper bounds α,β(h) as a function of the smoothing parameters, α, β. Theorem 3.2 (Simplified Upper-Bound). For any h with bounded support, α,β(h) Gα,β, where Gα,β is a monotonically increasing function of both α and β. We pause to make a few remarks about Theorem 3.2. First, we see that Gα,β increases with the certification parameter β, suggesting that the excess risk α,β may increase as we increase β. This behavior is expected and reflects what is seen in practice. Recall that the inverse of the smoothness parameter of the classifier Smoothβ(ψ(h pα)) is proportional to the randomized smoothing strength, β. Hence as β increases, the classifier changes more slowly, and more error is incurred in the spiky regions of the domain where the true class fluctuates rapidly. This demonstrates the well known accuracy-robustness trade-off: larger β leads to better robustness certificates but worse benign performance (Gao et al., 2020; Cohen et al., 2019). 5Note that we are not studying the robust risk of the smoothed classifier. Nevertheless, understanding properties of the benign risk is an essential step towards understanding the robust risk, as the former is a lower bound on the latter. Published in Transactions on Machine Learning Research (04/2023) Secondly, we see that Gα,β increases monotonically with α. This suggests that, for any fixed value of β, the excess risk α,β is minimized at α = 0; i.e, it is better not to perform noise-augmentation during training. This is contrary to the behavior of randomized smoothing in practice: to obtain a smoothed classifier Smoothβ(ψ(h pα)) with a good benign risk a practitioner typically sets α = β > 0. One might be tempted to conclude that this seeming contradiction stems from the fact that the bound in Theorem 3.2 is too loose to be informative. Yet, we find that this is not the case. In fact, Theorem 3.2 reflects the behaviour of α,β(h) for a general h with bounded support, and this upper bound is tight for functions h H1, i.e., the family of distributions from Theorem 3.1. If retraining on noise-augmented data is not universally needed, then for what family of distributions is it beneficial? It turns out that our notion of interference distance ζh controls this trade-off, as we now show. Theorem 3.3 (Noise-Augmentation helps for distributions with small interference distance). There exists a class of distributions H2 with small interference distance, such that, for all h H2, training on noiseaugmented data is favorable; i.e., there exists α, β > 0 such that α,β(h) < 0,β(h). Before delving into our detailed results in Section 4, we summarize a few key implications of Theorem 3.1 and Theorem 3.3 for randomized smoothing, as illustrated in the example in Figure 1. There, a classifier ψ(h) predicts the class 1 in the orange regions of the input space X, and class 0 otherwise. In the large separation regime that is captured by the family of distributions in H1, the orange regions are far apart and have a very small effect on each other (this is seen pictorially as each positive regions shrinks uniformly upon smoothing as if there were no other positive regions). In these cases, training on noisy data leads to each of the orange regions to shrink in the final smoothed classifier Smoothβ(ψ(h pα)) for any α > 0, and the smoothed classifier with noisy training has larger benign risk than training with no noise (Figure 1 top row). On the other hand, as the separation parameter decreases the orange regions get closer to each other eventually having a significant effect on their neighbors after smoothing (Figure 1 bottom row). In this regime, it becomes possible to set α > 0 to achieve better benign risk after smoothing compared to smoothing a classifier not trained with noisy data (i.e. with α = 0). Such distributions are captured in H2. For these distributions, the result in Theorem 3.2 is in fact not tight, and thus benefit can be derived from setting α > 0. With these informal results in mind, we now present them with greater rigour in the next section, before moving to the numerical experiments later in Section 5. 4 Detailed Results In this section we will expand and make our results from Section 3 more precise. We will firstly prove in Section 4.1 the phenomenon observed in the large separation regime illustrated in the top panel of Figure 1, and demonstrated in our experiments (Figure 3). Next, in Section 4.2, we will construct data distributions that have a specific structure, following the phenomenon observed in the small separation regime, illustrated in the bottom panel of Figure 1. Given a conditional distribution h, we define a partition of the input space X as X = I I+, where I+ contains all the points where the classifier outputs 1, i.e., I+ = {x X : ψ(h)(x) = 1}. Further, we think of I+ as being composed of disjoint positive partitions I1, I2, . . . , IM such that each partition is simply connected 6. I is the rest of the space, i.e., X \ I+, and is partitioned analogously. In other words, I+ = I1 I2 . . . IM and I = X \ I+, where denotes the disjoint union. Note that how to exactly obtain a valid partition above is unspecified we construct the partitions explicitly for Theorems 4.1 and 4.3, and Theorem 4.2 holds for any valid partitioning. In Figure 1, these regions are colored orange. When a classifier is trained on data perturbed by noise pα, and then smoothed by noise pβ, each of these positive partitions change in some way. Denote these latter partitions I1(α, β), . . . , IM(α, β). In the following two subsections, we will show how the difference between 6A subset of the space X is defined in the standard topological sense to be simply connected, if it is a connected region containing no holes, i.e., every curve can be continuously contracted to a point. Published in Transactions on Machine Learning Research (04/2023) Ik and Ik(α, β) changes under different separation regimes, and how this in turn determines the behavior of α,β. 4.1 Large Separation Regime We will first formalize and prove a version of Theorem 3.1 by constructing suitable distributions H1 with a large interference distance. We will then generalize the proof strategy to arbitrary distributions to obtain Theorem 3.2. Unless specified otherwise, we assume that the noise distributions (i.e., those for smoothing) pθ(x) we are working with are nice, in the following specific sense. Nice probability distributions are decreasing in their argument x, and are spherically symmetric. Formally, a probability distribution p over Rd parameterized by a scalar α R is defined to be nice, if for all x1, x2 Rd, p satisfies the properties (Decreasing in argument norm) pα(x1) pα(x2), if x1 2 x2 2, and, (Spherically symmetric) pα(x1) = pα(x2), if x1 2 = x2 2. Nice distributions comprise Uniform and Gaussian as special cases. Throughout the paper, we consider the family of pα and pβ to be the same (e.g., both Uniform or both Gaussian). This is not required for our analyses, and simple extensions of our results could generalize further to these having different forms. The lower interference distance for a given h is now defined as the minimum distance between any two positive partitions, i.e., ζh = mini =j dist(Ii, Ij). The lower interference distance over a family H is defined as ζH = minh ζh. We can now state the detailed version of Theorem 3.1. The full proof, along with further properties of these nice distributions, can be found in Appendix D. Theorem 4.1. For nice noise distributions pα = Unif(Bℓ2(0, α)) and pβ = Unif(Bℓ2(0, β)), there exists H1 with interference distance ζH1 > max(α, β), such that for all h H1 we have 0,β(h) < α,β(h) for all 0 < α < ζH1 and all 0 < β < ζH1. While we tightly bound α,β from above and below in Theorem 4.1, we are able to do so by assuming that the positive partitions I+ are perfectly spherical. In the following result, we show that we can in fact generalize our proof technique to obtain an upper-bound for α,β in Theorem 4.2 that only assumes that h is supported on the (bounded) data domain X, with partitions of arbitrary shape. In doing so, we obtain an upper-bound that might be loose in general. However, we note that there exist simple distributions (H1) where the upper-bound in Theorem 4.2 is tight, implying that it is the best one could hope for without assuming anything else about h. The full proof can be found in Appendix E. Theorem 4.2. For nice noise distributions pα, pβ, for any h supported on a bounded domain X, k p X(Bℓ2(ˆxk, (ωk h,τ rk α,β)+)), (3) where rk α = r Ψ 1 α 0.5 0.5+τ if Ik is a positive partition, i.e., Ik I+ and rk α = r Ψ 1 α 0.5 0.5 τ otherwise, i.e., Ik I . Further, rk α,β = rk α + q Ψ 1 β (0.5), and Ψα, Ψβ are the CDFs of the distribution of z 2 2 when z pα and z pβ, respectively. p X(S) denotes the measure of the set S under p X. The upper bound (3) holds for any choice of {(ωk h,τ, ˆxk)} such that ˆxk is the center of a ball of radius ωk h,τ completely contained in Ik,τ defined as Ik,τ = {x Ik : |h(x) 0.5| τ}. We can choose the sequence {(ωk h,τ, ˆxk)} such that the upper bound (3) is minimized. This results upper bounds the excess risk by one minus the sum of the measures of balls under the data distribution. Each partition Ik contributes to the upper bound in (3) via the difference (ωk rk). The first term ωk h,τ denotes the inradius of a subset of Ik, defined as the portion of Ik classified with a confidence margin τ, i.e., Ik,τ = {x Ik : |h(x) 0.5| τ}. The second term rk captures the shrinkage produced by smoothing. Specifically, rk α denotes the shrinkage caused in Ik while moving from the original classifier to the Published in Transactions on Machine Learning Research (04/2023) c1 c2 c3 c4 c1 c2 c3 c4 ψ(ψ( h) pβ0) c1 c2 c3 c4 c1 c2 c3 c4 ψ(ψ( h pα) pβ0) Noise-Augmented Randomized Smoothing Randomized Smoothing Figure 2: An example h H2 where noise-augmented training is beneficial for randomized smoothing. The orange and blue shaded bars denote the regions where the smoothed classifiers make errors, i.e., ψ( h) = Smoothβ0(ψ( h)) and ψ( h) = Smoothβ0(ψ( h pα0)) respectively. Assuming a suitable data-distribution p X, the orange region can have a higher p X-mass than the blue region, i.e., α0,β0( h) < 0,β0( h). See the text below for details of the four sub-figures. noise-trained classifier, i.e., ψ(h) ψ(h pα). Similarly, rk α,β denotes the shrinkage caused while moving from the noise-trained classifier to the randomized smoothed classifier, i.e., ψ(h pα) ψ(ψ(h pα) pβ). For any fixed β, as we increase the noise-augmentation strength α, the shrinkage rk α,β also increases, leading to an increase in the upper bound in Eq. (3). 4.2 Small Separation Regime We will show that, unlike what is reflected by Theorem 4.1, there exists a family of distributions H2 characterized by a small separation distance where α,β0 is indeed minimized at some α > 0, which is the behavior observed in practice in common image datasets. We will begin with an illustrative one-dimensional example of a data distribution supported on X R where we will show that the interference distance ζ explicitly controls whether any benefit can be obtained by noise-augmentation. We will observe the basic structure required in h for this to occur, and then generalize this structure to construct examples supported on X Rd forming the family H2 required for Theorem 4.3. 1 dimensional example We will now walk through the four panels of Figure 2. First, we let h be defined as ( 0 x ( 0.5, c1) (c2, c3) (c4, 0.5) 1 otherwise, (4) where the specific values of c1, . . . c4 are not important for the example but can be found in Appendix F. Additionally, let p X be uniform in [c1, c4] this assumption is not needed, but simplifies the following discussion. The resulting ψ( h) is plotted in the top-left panel of Figure 2. For smoothing, we use the nice distribution pβ0 = Unif([ β0/2, β0/2]) where β0 is just large enough, so that (ψ( h) pβ0)(x) = 0 for all x [c1, c4]. In other words, randomized smoothing ensures that the smoothed classifier Smoothβ0(ψ( h)) outputs a consistent label across the input, hence ensuring robust classification (and in particular, that label is 0). However, observe that the benign risk of the smoothed classifier R(Smoothβ0(ψ(h))) is very high, as it makes an error whenever x [c1, c2] [c3, c4]. This situation is shown in the bottom left panel of Figure 2. Published in Transactions on Machine Learning Research (04/2023) We now consider the nice noise distribution for noise-augmented training, defined as pα0 = Unif([ α0/2, α0/2]). For a suitable value of α, we can ensure that ψ( h pα0)(x) = 1 for all x [c1, c4]. This is shown in the top right panel of Figure 2. But now, since β0 was just enough to obtain Smoothβ0(ψ( h)) = 0, the additional positive mass in [c2, c3] causes the classification to flip, and now Smoothβ0(ψ( h) pα0)(x) = 1 for all x [c1, c4]. The benign risk of the smoothed classifier now is much lower than earlier, as it only makes an error on the small crevice [c2, c3]. This situation is shown in the bottom right panel of Figure 2. Thus, one can choose h, p, α0, β0 such that R(Smoothβ0(ψ( h) pα0)) < R(Smoothβ0(ψ( h))), (5) demonstrating that training with noise augmentation is indeed beneficial when h has a low interference distance. The specific values of α0, β0 can be found in Appendix F. Fixing the noise-augmentation distribution pα0 and the smoothing distribution pβ0, we now modify h to h by increasing the interference distance |c2 c3| while maintaining the same structure, i.e. h(x) = 0 when x ( 0.5, c1) ( c2, c3) ( c4, 0.5), and h(x) = 1 otherwise. When | c2 c3| > α0/2, noise-augmentation with pα0 is no longer effective, i.e., ψ(ψ( h) pα0) = ψ(ψ( h) pα0). Intuitively, this is caused by insufficient positive mass near any class-0 point for the prediction to flip after noise-augmentation and this can be verified using the values of α0, β0 provided in Appendix F). Applying randomized smoothing on ψ(ψ( h) pα0) now simply leads to Smoothβ0(ψ( h) pα0)(x) = 0 everywhere, implying that the risk after smoothing is high, i.e., R(Smoothβ0(ψ( h) pα0)) R(Smoothβ0(ψ( h))). (6) Thus, we have shown via this example that the interference distance directly controls whether we get any benefit out of training with noise-augmentation (5) or not (6). The distance |c2 c3| above corresponds to a small upper interference distance ζ, which we define now as the maximum distance between any two positive partitions, i.e., ζ = maxi =j dist(Ii, Ij). With this definition, we can now extend the ideas in the simple example above to general constructions in d dimensions. Theorem 4.3. There exist nice distributions pα, pβ, a family H2 with low interference distance ζH2, and data-distributions p X, such that for all h H2 we have 0,β0(h) > α,β0(h) for some α > 0, β0 > 0. We pause again to understand some implications of Theorem 4.3. From the one-dimensional example and the proof of Theorem 4.3, we see that training with noisy data, i.e., α > 0, is better for distributions with low interference distance. We conjecture that natural distributions follow this structure as well, containing regions corresponding to one class that are packed closely in the domain. Interestingly, we find evidence in the favor of this conjecture in our experiments in Section 5. Moreover, this result also shows that there is no reason for the noise level α to be the same as the smoothing level β for final smoothed classifier to obtain its lowest risk. This suggests that the common practice of using α = β for randomized smoothing (e.g., see (Cohen et al., 2019)) might not be optimal. Indeed, we will shortly demonstrate in our experimental section that better choices for smoothing can be found by relaxing the constraint that the training noise level α should be equal to β. Additionally, our findings might provide a theoretical foundation for recent works (Alfarra et al., 2020; Anderson and Sojoudi, 2022; Súkeník et al., 2021) on obtaining an adaptive β for smoothing at each input point. Exactly Learning Bayes Classifiers We pause here to comment on the assumptions we made in analyzing randomized smoothing on a noise-smoothed distribution, Smoothβ(ψ(h pα)). As we mentioned above and as we prove in Appendix B our analysis assumed that one has access to the true conditional distribution of E[Y |Xs]. However, in more realistic settings, the learned predictor might deviate from this optimal value. We now argue that our analysis is still relevant in these cases, too. On the one hand, in Theorem G.1, we provide an extension of our result in Theorem 4.1 by considering a predictor that is an inexact approximation to the conditional expectation. In other words, we assume that training with noise-augmentation results in a function g(x) that is not too far from the true smoothed conditional distribution, |g(x) h pα(x)| η for all x X. As we show in Appendix G, our same proof technique allows us to show that indeed, there exist distributions where no benefit can be expected from Published in Transactions on Machine Learning Research (04/2023) randomized smoothing, even in this more general case where the assumption of learning the true conditional distribution is relaxed. On the other hand, it is also not hard to see that our results that provide an upper bound to the excess risk based on Bayes classifier, α,β(h), are also useful in characterizing upper bounds to more realistic predictors that might deviate from the true conditional distribution. Indeed, given access to g that is only an approximation to h pα, one can show (see our derivation in Appendix G) that the excess risk of smoothing g can be upper bounded by α,β(g, h) α,β(h) + p X ((h pα)(X) = g(X)) . (7) As a result, our upper bound in Theorem 4.2 also provides an upper bound to the excess risk of more general predictors g and the tightness of this bound is naturally controlled by the mass under p X of the errors between g and h pα. Extension to Multiple Classes Although our theory was developed for binary classification, a standard extension to K-way classification is possible, using the technique followed while certifying multi-class Randomized Smoothing based classifiers Cohen et al. (2019), Yang et al. (2020). In this context, the class Y = 0 now represents one of the K-classes, and Y = 1 represents all the remaining classes. All our theorems would then extend, having an additional parameter for the class viewed as 0. Having established our theoretical results, we now turn to an empirical verification of some of them on synthetic datasets, as well as MNIST and CIFAR-10. 5 Experiments Synthetic Experiments. We begin by conducting synthetic experiments7 with distributions resembling those used in the proofs of our existence results Theorems 3.1 and 3.3. We take X = [0, 100] [0, 100] as the data domain, and sample the positive partitions of h as spheres at random locations in the domain (see Appendix A for more details about the construction) to obtain the data-distribution hζ, where the superscript ζ denotes the interference distance. Figure 3 shows a few examples of hζ for different values of ζ. ζ = 0 ζ = 10 ζ = 20 ζ = 30 Figure 3: Examples of ψ(hζ) for different values of ζ. The blue regions denote ψ(hζ)(x) = 0 and the orange regions denote ψ(hζ)(x) = 1. In Figure 4, we report quantitative results using our synthetic distributions as described above. Over several choices of interference distance ζ, we first sample several synthetic data distributions hζ. We then plot the quantity α,β 0,β, which is the difference between the benign risks of smoothing a noise-augmented classifier, i.e., R(Smoothβ(ψ(h pα)), and smoothing a classifier not trained with noise-augmentation, i.e., R(Smoothβ(ψ(h))). We increase the smoothing strength β moving from left to right in Figure 4, and plot the difference in the risks as a function of α for each β. Recall that noise-augmentation is useful only when we can 7We provide code for all of our experiments at https://github.com/ambarpal/randomized-smoothing. Published in Transactions on Machine Learning Research (04/2023) find α0, β0 such that α0,β0 0,β0 < 0. Accordingly, we plot a dashed line whenever noise-augmentation is useful, i.e., α0, β0 α0,β0 < 0,β0 and solid otherwise, i.e., α, β α,β 0,β. At high ζ, we are operating in the large interference distance regime and the results from Section 4.1 apply. Those results dictate that α,β0 should be minimized at α = 0 for any fixed β0 > 0. Indeed, this phenomenon can be seen in Figure 4 for ζ = 3, where we see that the line remains solid for all β. On the other hand, at lower ζ, we go into the small interference distance regime and the results from Section 4.2 apply. Those results dictate that α,β0 should be minimized at some α > 0 given a large enough, but fixed β0 > 0. Indeed this phenomenon is seen in Figure 4, where the lines tend to become dashed as β increases. Thus, our synthetic experiments confirm the predictions from our theory. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 α Δα, β Δ0, β 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 α 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 α Figure 4: Synthetic Experiments showing the variation in α,β. The line is dashed if there exists α > 0 such that α,β < 0,β, and solid otherwise. MNIST, CIFAR-10 Experiments. We now proceed to conduct experiments with real data distributions. For a range of finely spaced α in [0, 1], we train a standard CNN hα with isotropic Gaussian noise-augmentation with variance α2, i.e., pα = N(0, α2I). For each of these trained models, we use the isotropic-Gaussian with variance β2, with β [0, 1], as the smoothing distribution, i.e., pβ = N(0, β2I), to obtain the smoothed classifier Smoothβ(ψ(h pα)). We then plot the standard empirical estimate of the risk, ˆ α,β = P (x,y) Stest 1[Smoothβ(hα)(x) = y] over the test set Stest, for different α, β in Figure 5. The first observation that we make from the MNIST and CIFAR-10 plots in Figure 5 is that a non-zero data-augmentation is always beneficial, i.e., α > 0. While not an implication arising from our theory, this behavior is reminiscent of our results in Theorem 4.3, suggesting that real data distributions lie in the small interference distance regime. Additionally, we note the similarity of the synthetic experiment plots at low ζ in Figure 4 and the real-data curves in Figure 5 (upto scaling of the α-axis, and the fact that we subtract 0,β for normalization in Figure 4), providing empirical validation for our theoretical assumptions. The second observation we make from Figure 5 is that, as suggested by Theorem 4.3, the smoothing parameter need not be the same as the data augmentation parameter for the best performance of the smoothed classifier. This phenomenon has also been observed empirically in prior work Alfarra et al. (2020), Salman et al. (2019), Zhai et al. (2020), and our results provide a theoretical foundation for such observations. 6 Related Work In this work, we study randomized smoothing, which is currently the method of choice for obtaining robustness guarantees for deep, complex neural network classifiers. Since the early proposals of stochastic smoothing (Liu et al., 2018), many works have derived robustness certificates against adversarial attacks for smoothed classifiers (Lecuyer et al., 2019; Li et al., 2018; Cohen et al., 2019). These certificates obtain a radius r around an input x such that the network prediction is robust in an ℓp ball of radius r around x. Studied Published in Transactions on Machine Learning Research (04/2023) (b) CIFAR-10 Figure 5: Plot of ˆ α,β for MNIST (5a) and CIFAR-10 (5b). For each fixed value of β = β0, the legend α (and the corresponding shaded disk on the plot) shows the minimizer of the α,β0 curve. balls include p = 0 (Levine and Feizi, 2020a; Lee et al., 2019), p = 1 (Teng et al., 2019), p = 2 (Cohen et al., 2019; Salman et al., 2020; 2019) and p = (Zhang, 2002). Certificates have also been extended to non-ℓp smoothing distributions (Levine and Feizi, 2020b). For different notions of optimality, the optimal certificates have also been characterized (Yang et al., 2020; Kumar et al., 2020). However, there has been lesser attention given to the impact that randomized smoothing has on the benign accuracy of the smoothed classifier, and a theoretical understanding for the need or lack thereof for noise-augmented training of the classifier in order to produce an accurate smoothed classifier. Recent work in Gao et al. (2020), which is closest to ours in spirit, takes a first step towards this goal by analyzing the class of hypotheses that are realizable when training on smoothed data. Their result shows that this re-training strictly reduces the hypothesis class whenever the smoothing strength is above a threshold. In this way, Gao et al. (2020) provide some theoretical basis for the reduction of accuracy of the so smoothed-trained classifier. Elsewhere, Mohapatra et al. (2021) theoretically demonstrate that randomized smoothing leads to shrinkage of class boundaries for simple data distributions. Importantly, their definition of shrinkage is that the bounding sphere (or cone, for semi-bounded regions) of the decision boundary becomes smaller. This is different from our analysis, as a shrunk decision region Rσ might not be a subset of the original region R under this definition. In comparison, our universal result in Theorem 4.2 holds in much more generality for arbitrary (bounded) data-distributions, and our existence results Theorems 4.1 and 4.3 construct specific synthetic datasets. While Mohapatra et al. (2021) only analyse the distance f Smoothβ(f) 2 given a trained classifier f, we directly bound the risk R(Smoothβ(f)), and shrinkage does show up in some parts of our analysis. Our proof techniques are able to handle both noise-augmentation of strength α and randomized smoothing of strength β simultaneously, allowing us to discover cases where noise-augmentation helps randomized smoothing (in other words, showing cases where the combined effect might not be a shrinkage of the decision boundaries). Some recent works have tried to move away from having the same smoothing and noisy-training distribution, by studying input-dependent smoothing (Alfarra et al., 2020; Anderson and Sojoudi, 2022; Súkeník et al., 2021). Lastly, and in a very different context of graph convolutional networks (GCNs), recent results in Keriven (2022) show that a positive level of smoothing (defined as the number of GCN layers) can be beneficial for a supervised learning task, before becoming detrimental once smoothing is too large. Studying further connections between our analysis and smoothing for GCNNs constitutes an interesting direction of research. Published in Transactions on Machine Learning Research (04/2023) 7 Conclusion, Limitations and Future Work In this work, we provided a theoretical understanding of how noise-augmented training affects classifiers defended using randomized smoothing. We identified a parameter of the data distribution, which we dubbed interference distance, that, for certain families of data-distributions, determines whether noise-augmented training could be beneficial for randomized smoothing. Using this parameter, we showed there exist data distributions where noise-augmented training reduces the accuracy of the final smoothed classifier, contrary to common observation. Our upper bound to the benign risk in Theorem 4.2 for very general conditional distributions and which is tight for distributions presented in Theorem 4.1 demonstrates that no benefit of randomized smoothing is possible without further structural distributional assumptions. In turn, we showed the existence of distributions with small interference distance in Theorem 4.3, where improvements by noise-augmented training is possible. We complemented our theoretical results with empirical validation, suggesting that real-world distributions lie in regimes where the parameter ζ is small, and so noise-augmented training can indeed help randomized smoothing if the smoothing strength is chosen properly. Contrary to intuition, the proof of our theoretical result in Theorem 4.3, and our experiments, suggest that this smoothing strength need not be the same as the noise-augmentation strength for best performance of the final predictor. We now revisit the open questions that we posed in Section 1. Firstly, why is noise-augmentation helpful while training the base classifier? We saw in the bottom row of Figure 1 that noise augmented training helps the smoothed classifier have good benign accuracy when we are in the low-interference distance regime, as it alleviates the degradation of the risk incurred by randomized smoothing at deployment time. Secondly, how does the performance of the smoothed classifier depend on the training noise distribution? We saw in Theorems 4.1 and 4.3 that in the high interference regime, training with low to zero noise leads to good performance of the smoothed classifier. On the other hand, in the low-interference regime, the training noise can be tuned to extract good performance of the smoothed classifier. These conclusions are derived for the cases when the same family of distribution is used for augmentation during training and for smoothing at deployment time. Future work could consider extending these to different classes of distribution. Thirdly, is noise-augmented training needed for all data-distributions? The answer is no, noise-augmentation is not always effective in reducing the risk of the classifier after randomized smoothing, and the interference-distance parameter can distinguish between families of data-distributions where noise-augmentation helps randomized smoothing. It remains unclear how to efficiently determine this parameter for real data-distributions, which would allow us to determine whether noise-augmented training is needed at all, or what the optimal parameters of the augmentation distribution should be. This remains matter of future research. We note that our results are independent of the learning aspect of the problem, i.e., we assume that the learning algorithm is good enough to obtain the predictor with the lowest possible risk. These assumptions, while simplistic, nonetheless allowed us to characterize observations that have practical relevance. Though we provide some results on how to relax such assumptions in Appendix G, we see this as a starting point for future work that should explore these trade-offs in more generality. Additionally, our experiments and theory suggest that the best data-augmentation strength is not the same as the smoothing strength in general, and we hope that future work can theoretically quantify and provide precise estimates for this quantity. Finally, in this paper, we studied the benign risk of classifiers after smoothing. Future work could extend these results to the study of the robust risk (which is lower bounded by the benign risk), and extend our techniques to quantify how the former depends on the noise-augmentation distribution. Acknowledgements The authors thank René Vidal for insightful comments, as well as the anonymous reviewers for their valuable suggestions, which helped improve this manuscript. AP acknowledges the support of DARPA (via GARD HR00112020010) and NSF (via Grant 1934979). JS acknowledges support from DARPA GARD HR00112020004. Published in Transactions on Machine Learning Research (04/2023) Motasem Alfarra, Adel Bibi, Philip HS Torr, and Bernard Ghanem. Data dependent randomized smoothing. ar Xiv preprint ar Xiv:2012.04351, 2020. Brendon G Anderson and Somayeh Sojoudi. Certified robustness via locally biased randomized smoothing. Proceedings of Machine Learning Research vol, 144:1 14, 2022. Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018. Pranjal Awasthi, Vaggos Chatziafratis, Xue Chen, and Aravindan Vijayaraghavan. Adversarially robust low dimensional representations. ar Xiv preprint ar Xiv:1911.13268, 2019. Pranjal Awasthi, Himanshu Jain, Ankit Singh Rawat, and Aravindan Vijayaraghavan. Adversarial robustness via robust low rank representations. ar Xiv preprint ar Xiv:2007.06555, 2020. Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. ar Xiv preprint ar Xiv:1805.10204, 2018. Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3 14, 2017. Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918, 2019. Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Pac-learning in the presence of adversaries. In Advances in Neural Information Processing Systems, pages 230 241, 2018. Ruiqi Gao, Tianle Cai, Haochuan Li, Cho-Jui Hsieh, Liwei Wang, and Jason D Lee. Convergence of adversarial training in overparametrized neural networks. In Advances in Neural Information Processing Systems, pages 13009 13020, 2019. Yue Gao, Harrison Rosenberg, Kassem Fawaz, Somesh Jha, and Justin Hsu. Analyzing accuracy loss in randomized smoothing defenses. ar Xiv preprint ar Xiv:2003.01595, 2020. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. ar Xiv preprint ar Xiv:1412.5068, 2014. Nicolas Keriven. Not too little, not too much: a theoretical analysis of graph (over) smoothing. ar Xiv preprint ar Xiv:2205.12156, 2022. Aounon Kumar, Alexander Levine, Tom Goldstein, and Soheil Feizi. Curse of dimensionality on randomized smoothing for certifiable robustness. In International Conference on Machine Learning, pages 5458 5467. PMLR, 2020. Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adversarial examples in the physical world, 2016. Richard J Larsen and Morris L Marx. An introduction to mathematical statistics. Prentice Hall, 2005. Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pages 656 672. IEEE, 2019. Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi Jaakkola. Tight certificates of adversarial robustness for randomly smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019. Published in Transactions on Machine Learning Research (04/2023) Alexander Levine and Soheil Feizi. Robustness certificates for sparse adversarial attacks by randomized ablation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4585 4593, 2020a. Alexander Levine and Soheil Feizi. Wasserstein smoothing: Certified robustness against wasserstein adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pages 3938 3947. PMLR, 2020b. Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certified adversarial robustness with additive noise. ar Xiv preprint ar Xiv:1809.03113, 2018. Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certified adversarial robustness with additive noise. Advances in neural information processing systems, 32, 2019. Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pages 369 385, 2018. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. ar Xiv preprint ar Xiv:1702.04267, 2017. Jeet Mohapatra, Ching-Yun Ko, Lily Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. Hidden cost of randomized smoothing. In International Conference on Artificial Intelligence and Statistics, pages 4033 4041. PMLR, 2021. Ramchandran Muthukumar and Jeremias Sulam. Adversarial robustness of sparse local lipschitz predictors. ar Xiv preprint ar Xiv:2202.13216, 2022. Ambar Pal and René Vidal. A game theoretic analysis of additive adversarial attacks and defenses. ar Xiv preprint ar Xiv:2009.06530, 2020. Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. ar Xiv preprint ar Xiv:1801.09344, 2018. Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019. Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, and J Zico Kolter. Black-box smoothing: A provable defense for pretrained classifiers. ar Xiv preprint ar Xiv:2003.01908, 2020. Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5014 5026, 2018. Ali Shafahi, W Ronny Huang, Christoph Studer, Soheil Feizi, and Tom Goldstein. Are adversarial examples inevitable? ar Xiv preprint ar Xiv:1809.02104, 2018. Peter Súkeník, Aleksei Kuvshinov, and Stephan Günnemann. Intriguing properties of input-dependent randomized smoothing. ar Xiv preprint ar Xiv:2110.05365, 2021. Jeremias Sulam, Ramchandran Muthukumar, and Raman Arora. Adversarial robustness of supervised sparse coding. Advances in neural information processing systems, 33:2110 2121, 2020. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Published in Transactions on Machine Learning Research (04/2023) Jiaye Teng, Guang-He Lee, and Yang Yuan. ℓ1 adversarial robustness certificates: a randomized smoothing approach. 2019. Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. ar Xiv preprint ar Xiv:2002.08347, 2020. Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018. Zhuozhuo Tu, Jingwei Zhang, and Dacheng Tao. Theoretical analysis of adversarial learning: A minimax approach. In Advances in Neural Information Processing Systems, pages 12259 12269, 2019. Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pages 10693 10705. PMLR, 2020. Dong Yin, Kannan Ramchandran, and Peter Bartlett. Rademacher complexity for adversarially robust generalization. ar Xiv preprint ar Xiv:1810.11914, 2018. Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403 2412, 2018. Runtian Zhai, Chen Dan, Di He, Huan Zhang, Boqing Gong, Pradeep Ravikumar, Cho-Jui Hsieh, and Liwei Wang. Macer: Attack-free and scalable robust training via maximizing certified radius. ar Xiv preprint ar Xiv:2001.02378, 2020. Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527 550, 2002. Published in Transactions on Machine Learning Research (04/2023) A Experimental Details Synthetic Experiments To construct our synthetic distributions, we take X = [0, 100] [0, 100] as the data domain. We start with an empty set I = {} and radius r = 10. We repeat the following steps 500 times: (1) Sample a center c in X uniformly at random. (2) Test whether c c 2 > ζ + 2r for all spheres Bℓ2(c , r) I. If true, add Bℓ2(c , r) to I, otherwise reject this sample. At the end of this process, we obtain a set of spheres I satisfying the property c c 2 ζ + 2r for all c = c , where ζ is the interference distance. MNIST Experiments We take 100 samples from each of pα, pβ per image x in the MNIST test set to compute a single point α,β. The figures are quite stable against random initialization due to the MNIST test set size of 10000 samples. We use the following architecture for the neural network classifier hα: Input(28, 28) Conv3,1(1, 32) Conv3,1(32, 64) Linear(9216, 128) Linear(128, 2) In the above, the input image has dimension 28 28, and convolution layers with a filter size of 3 and a stride of 1 are applied. The notation Conv3,1(a, b) denotes that the input depth is a and the number of filters is b for the convolution. Finally, the notation Linear(a, b) denotes a affine transformation with input dimension a and output dimension b. CIFAR-10 Experiments We use the DLA Architecture (Yu et al., 2018) for the neural network classifier. For obtaining hα, we use standard noise augmentation on the CIFAR-10 dataset, i.e., we augment every training sample with noise sampled from pα. For obtaining the smoothed classifier Smoothβ(hα), we use 200 noise samples from pβ for every image. We report results aggregated over 4000 images from the CIFAR-10 test set. B Derivation of Bayes Classifiers Recall that X and Y denote the data and label random variables, respectively, such that (X, Y ) p X,Y . In this binary setting, the Bayes classifier, i.e. the classifier that minimizes the probability of misclassification, can be obtained by thresholding the conditional expectation E[Y |X = x] at 0.5. We assume that the learning algorithm results in a predicted score h(x) = E[Y |X = x]. The prediction of this model is given by ψ(h). Now, we instead train on noisy data, Xs ps X, where Xs = X + V such that V p V . It is well known that ps X = p X p V , where denotes the convolution operator (Larsen and Marx, 2005). Further, we can show now that the conditional expectation is E[Y |Xs = x] = h p V : E[Y |Xs = x] = p(Y = 1|Xs = x) = p(Y = 1|X + V = x) v p(Y = 1|X + V = x, V = v)p V (v)dv v p(Y = 1|X = x v, V = v)p V (v)dv v p(Y = 1|X = x v)p V (v)dv (8) v h(x v)p V (v)dv = (h p V )(x), where (8) uses the fact that the noise variable V is independent of the data variable X. As earlier, the Bayes classifier is now given by a thresholding of the conditional expectation, i.e., ψ(h p V ). Published in Transactions on Machine Learning Research (04/2023) C Nice Distributions and their Properties Nice Distributions. A probability distribution p over Rd parameterized by a scalar α R is defined to be nice, if p satisfies the properties (9) and (10): (Decreasing in argument norm) pα(x1) pα(x2), if x1 2 x2 2 (9) (Spherically symmetric) pα(x1) = pα(x2), if x1 2 = x2 2 (10) For ease of understanding of the rest of the section, one can think of p to be the uniform distribution supported on a ball of radius θ, i.e., pθ = Unif(Bℓ2(0, θ)), which is a nice distribution. Shifted CDF. We define the multivariate cumulative distribution function (CDF) w.r.t. x X Rd as Φα(r, x) = Z t 2 r pα(x t)dt (11) For nice distributions p, we can show the following properties for the CDF: (Decreasing in argument Norm) Φα(r, x1) Φα(r, x2), if x1 2 x2 2 (12) (Spherically symmetric) Φα(r, x1) = Φα(r, x2), if x1 2 = x2 2 (13) (Increasing in radius) Φα(r1, x) Φα(r2, x), if r1 r2 (14) To see (13), we choose any x1, x2 with x1 = x2 . Now, note that the integral in (11) can be rewritten as Φα(r, x) = R t Bℓ2(x,r) pα(t)dt, which in other words computes the mass of the pα contained in an ℓ2 ball of radius r around x. Since x1 = x2 , we can find a rotation matrix Q that maps x1 to x2 as Qx1 = x2. Observe that the same rotation matrix Q also maps the entire ball B(x1, r) to the ball B(x2, r). Finally, (13) follows from by observing that pα(x) = pα(Qx) for any rotation matrix Q as x 2 = Qx 2. Now that we know that Φα(r, x) is spherically symmetric in the second argument, we can focus our attention to the restriction of Φ along the standard basis vector e1 for showing (12). For any r > 0, consider the function φ: R R defined as φ(c) = Φ(r, ce1) for a scalar c. Points x1, x2 with x1 x2 correspond to c1 = x1 and c2 = x2 in the sense that Φ(r, x1) = φ(c1) and Φ(r, x2) = φ(c2), and c1 c2 0. Define D1 = Bℓ2(c1e1, r) and D2 = Bℓ2(c2e1, r). We have, t D1 pα(t)dt t (D1 D2) pα(t)dt + Z t (D1\D2) pα(t)dt t (D1 D2) pα(t)dt + Z t (D2\D1) pα(t)dt (15) To obtain the inequality in (15), observe that D1, D2 are spheres of the same radius with their centers c1e1, c2e2 such that 0 c2 c1. This ensures that any point in D2 \ D1 has a lower ℓ2 norm than any point in D1 \ D2 , which in turn means that the every term in the integral R t (D2\D1) pα(t)dt is lower bounded by every term in the integral R t (D1\D2) pα(t)dt, using Property (9) of pα. Due to properties (12) and (13), we see that Φα(r, x) is non-increasing as x 2 increases. As a result, we can define the function Aα,r : R R as Aα,r(c) = max {x: Φα(r,x) c} x 2, (16) which computes the maximum ℓ2 norm that x 2 can take before the CDF falls below c. This function will be useful to us later when we reason about how much does smoothing a classifier contract or expand its positive regions. Published in Transactions on Machine Learning Research (04/2023) We firstly note that Aα,r is decreasing in c. This can be seen as increasing c makes the constraint tighter in (16), and hence the optimal objective value can only decrease when we increase c. Secondly, we note that Aα,r(Φα(r, x0)) = x0 2, for any x0 X, α 0 and r 0. This can be seen by observing that for x = x0, we have c0 = Φα(r, x0), and thus x0 is a feasible point for the optimization problem max{x: Φα(r,x) c0} x 2. Hence Aα,r(c0) x0 2. Then, by property (12), we know that Φα(r, x ) Φα(r, x0) whenever x 2 x0 2, implying that any such x will not be feasible. This shows that Aα,r(c0) x0 2, completing the argument. D Proof of Theorem 4.1 Theorem 4.1. For nice noise distributions pα = Unif(Bℓ2(0, α)) and pβ = Unif(Bℓ2(0, β)), there exists H1 with interference distance ζH1 > max(α, β), such that for all h H1 we have 0,β(h) < α,β(h) for all 0 < α < ζH1 and all 0 < β < ζH1. Proof. Recall that h(x) = p(Y = 1|X = x). Let I1, I2, . . . , Ik be spheres such that Ij = Bℓ2(cj, rj). Define h(x) = 0.5 + τ whenever x I1 I2 . . . Ik, and 0 otherwise. Large Interference Distance Recall that ζh = mini =j dist(Ii, Ij), and assume that ζh > max(α, β). Recall the CDF Φα(x, r) = R t 2 r pα(x t)dt, which is a function decreasing monotonically in x 2. Also recall the function Aα,r with the property Aα,r(Φα(x, r)) = x 2 for all x. For any x Ij, we can obtain h pα as h pα(x) = Z Ij h(t)pα(x t)dt + Z X\Ij h(t)pα(x t)dt Ij (0.5 + τ)pα(x t)dt + Z X\Ij h(t) 0 dt = (0.5 + τ)Φα(x cj, rj). α Shrinkage The regions where the classifier ψ(h pα) predicts 1 are given by Ij,α = {x Ij : h pα 0.5}. This region Ij,α is an α-shrinkage of Ij and can be obtained as (0.5 + τ)Φα(x cj, rj) 0.5 = x cj 2 Aα,rj 0.5 0.5 + τ def = rj,α, (17) where we have used the definition of the function Aα,r. Now, the regions where the classifier Smoothβ(h pα) predicts 1 are similarly given by Ij,α,β = {x Ij,α : ψ(h pα) pβ 0.5}. We can follow the same steps as earlier on the α-shrunk balls Ij,α = Bℓ2(cj, rj,α). This time, for x Ij,α we have Smoothβ(ψ(h pα))(x) = Z Ij,α ψ(h pα)(t)pβ(x t)dt + Z X\Ij,α ψ(h pα)(t)pβ(x t)dt Ij,α 1 pβ(x t)dt = Φβ(x cj, rj,α). β Shrinkage Again, we can find the β-shrinkage in Ij,α as Φβ(x cj, rj,α) 0.5 = x cj 2 Aα,rj,α(0.5) def = rj,α,β. Published in Transactions on Machine Learning Research (04/2023) We have thus obtained Ij,α,β as the α, β-shrunk balls, i.e., Ij,α,β = Bℓ2(cj, rj,α,β). We can now compute the risk of the smoothed classifier as R(Smoothβ(ψ(h pα))) X |Smoothβ(ψ(h pα))(x) Y |p(x, y)dxdy X |Smoothβ(ψ(h pα))(x) 1|p(1|x)p X(x)dx + Z X |Smoothβ(ψ(h pα)(x)|p(0|x)p X(x)dx. Let I = I1 I2 . . . Ik be the positive regions, and Iα,β = I1,α,β I2,α,β . . . Ik,α,β be the shrunk positive regions. For the first term above, we note that |Smoothβ(ψ(h pα))(x) 1|p(1|x) = 0, x Iα,β 0.5 + τ, x I \ Iα,β 0, x X \ I . Similarly, for the second term, we see that Smoothβ(ψ(h pα))(x)p(0|x) = 0.5 τ, x Iα,β 0, x I \ Iα,β 0, x X \ I . Substituting into the integrals, we obtain the risk of the smoothed classifier R(Smoothβ(ψ(h pα))) = (0.5 + τ)p X(I \ Iα,β) + (0.5 τ)p X(Iα,β) = (0.5 + τ)p X(I) 2τp X(Iα,β). Recalling that p X( ) is positive everywhere in the domain, and using I0,β Iα,β for 0 < α, β < ζh in the above, we obtain p X(I0,β) > p X(Iα,β) = 2τp X(Iα,β) > 2τp X(I0,β). In other words, we have R(Smoothβ(ψ(h pα))) > R(Smoothβ(ψ(h))) = α,β > 0,β. E Proof of Theorem 4.2 Theorem 4.2. For nice noise distributions pα, pβ, for any h supported on a bounded domain X, k p X(Bℓ2(ˆxk, (ωk h,τ rk α,β)+)), (3) where rk α = r Ψ 1 α 0.5 0.5+τ if Ik is a positive partition, i.e., Ik I+ and rk α = r Ψ 1 α 0.5 0.5 τ otherwise, i.e., Ik I . Further, rk α,β = rk α + q Ψ 1 β (0.5), and Ψα, Ψβ are the CDFs of the distribution of z 2 2 when z pα and z pβ, respectively. p X(S) denotes the measure of the set S under p X. The upper bound (3) holds for any choice of {(ωk h,τ, ˆxk)} such that ˆxk is the center of a ball of radius ωk h,τ completely contained in Ik,τ defined as Ik,τ = {x Ik : |h(x) 0.5| τ}. We can choose the sequence {(ωk h,τ, ˆxk)} such that the upper bound (3) is minimized. Definition E.1 (Inradius). The inradius of a set S is defined as the largest radius r such that an ℓ2 ball of radius r is completely contained in S. In other words, there exists x S such that Bℓ2(x, r ) S. Published in Transactions on Machine Learning Research (04/2023) In order to bound α,β(h), we will follow a strategy which generalizes the arguments in Theorem 4.1 to any arbitrary conditional distribution h(x) = p(Y = 1|X = x). Our first step would be to upper-bound the positive partitions of ψ(h pα), i.e., the regions where the classifier trained with noise-augmentation predicts 1. This will be then be used along with properties of randomized smoothing to upper-bound the positive partitions of Smoothβ(ψ(h pα)), i.e., the regions where the final smoothed classifier predicts 1. These regions would finally be used to upper bound the risk of the smoothed classifier as a function of α and β. Proof. Let I+ denote the set of maximal, non-empty simply connected subsets I X of the input space such that h(x) 0.5 for each x I. Pick any I I+ and let Iτ I be such that h(x) 0.5 + τ for each x Iτ and 0 τ < 0.5. Bounding region where ψ(h pα)(x) = 1. Let J1 Iτ be the maximal subset of Iτ such that the noise-trained classifier ψ(h pα) predicts 1 in J1, i.e. J1 = {x Iτ : C1(x) is satisfied }, where C1(x) is the condition (h pα)(x) 0.5. We will firstly lower bound the area in J1 by using a stronger condition C2(x) R Iτ x(τ + 0.5)pα(δ)dδ 0.5. To show that C2 is a stronger condition than C1, we will show that C2(x) = 1 implies C1(x) = 1, as follows: (h pα)(x) = Z X h(x δ)pα(δ)dδ Iτ h(δ)pα(x δ)dδ Iτ (τ + 0.5)pα(x δ)dδ = Z Iτ x (τ + 0.5)pα(δ)dδ Defining J2 = {x Iτ : C2(x) = 1}, we see that J2 J1. We will further lower bound J2 by considering the largest d-dimensional ball centered at 0 that can be inscribed in Iτ x for each x. Define the condition C3(x, r) as the following: C3(x, r) (C2(x) = 1 and B(x, r) Iτ) Iτ x (τ + 0.5)pα(δ)dδ 0.5 | {z } I and B(x, r) Iτ | {z } II The set J3(r) = {x Iτ : C3(x, r) = 1} is a subset of J2 for all r > 0. In words, J3(r) denotes the set of points in J2 which are at least r distance away from the boundary of Iτ. Finally, we combine the two conditions I and II in C3(x, r) to obtain our final condition C4(x, r). B(0,r) (τ + 0.5)pα(δ)dδ 0.5 and B(x, r) Iτ (19) Note that B(x, r) Iτ is the same as B(0, r) Iτ x. Hence, the integral in (19) is over a smaller set than in (18), from where it follows that C4 is a stronger condition than C3 and the set J4(r) = {x Iτ : C4(x, r) = 1} is a subset of J3(r) for all r > 0. We now simplify C4(r) by using the distribution of z 2 2 where z pα, as follows. B(0,r) pα(δ)dδ = Pr z pα[ z 2 2 r2] Published in Transactions on Machine Learning Research (04/2023) In the above, Ψα is the CDF of the distribution of z 2 2 when z pα. Setting (τ + 0.5)Ψα(r2) 0.5 gives us the following condition B(x, r) Iτ such that r 0.5 0.5 + τ 0.5 0.5 + τ Our final subset J5(α) = {x Iτ : C5(x) = 1} denotes the set of points in I which are at least rα = r Ψ 1 α 0.5 0.5+τ away from the boundary of Iτ. In other words, J5(α) Iτ is a rα-contraction of Iτ. Bound region where Smoothβ(ψ(h pα)) = 1. We will now follow the same general strategy as in the previous part of the proof. Let K1 J5(α) be the maximal subset of J5(α) such that the smoothed classifier Smoothβ(ψ(h pα) predicts 1 in K1, i.e., K1 = {x J5(α): D1(x) is satisfied }, where D1(x) is the condition (Smoothβ(ψ(h pα))(x) 0.5. We will firstly lower bound the area in K1 by using a stronger condition D2(x) R J5(α) x pβ(δ)dδ 0.5. To show that D2 is a stronger condition than D1, we will show that D2(x) = 1 implies D1(x) = 1, as follows: (Smoothβ(ψ(h pα))(x) = Z X (ψ(h pα))(x δ)pβ(δ)dδ J5(α) (ψ(h pα))(δ)pβ(x δ)dδ J5(α) pβ(x δ)dδ = Z J5(α) x pβ(δ)dδ Defining K2 = {x J5(α): D2(x) = 1}, we see that K2 K1. We will further lower bound K2 by considering the largest d-dimensional ball centered at 0 that can be inscribed in J5(α) x for each x. Define the condition D3(x, r) as the following: D3(x, r) (D2(x) = 1 and B(x, r) J5(α)) J5(α) x pβ(δ)dδ 0.5 and B(x, r) J5(α) The set K3(r) = {x J5(α) : D3(x, r) = 1} is a subset of K2 for all r > 0. In words, K3(r) denotes the set of points in K2 which are atleast r distance away from the boundary of J5(α). Finally, we combine the two conditions in D3(x, r) to obtain our final condition D4(x, r). B(0,r) pβ(δ)dδ 0.5 and B(x, r) J5(α) (21) Note that B(x, r) J5(α) is the same as B(0, r) J5(α) x. Hence, the integral in (21) is over a smaller set than in (20), from where it follows that D4 is a stronger condition than D3 and the set K4(r) = {x J5(α): D4(x, r) = 1} is a subset of K3(r) for all r > 0. We now simplify K4(r) by using the distribution of z 2 2 where z pβ, as follows. B(0,r) pβ(δ)dδ = Pr[ z 2 2 r2] where z pβ Published in Transactions on Machine Learning Research (04/2023) In the above, Ψβ is the CDF of the distribution of z 2 2 when z pβ. Setting Ψβ(r2) 0.5 gives us the following condition D5(x) B(x, r) J5(α) such that r q Ψ 1 β (0.5) Ψ 1 β (0.5) J5(α) Our final subset K5(α, β) = {x Iτ : D5(x) = 1} denotes the set of points in I which are at least rα,β = rα + q Ψ 1 β (0.5) away from the boundary of Iτ. In other words, K5(α, β) Iτ is a rα,β-contraction of Iτ. We recall that h is such that the subset Iτ has inradius ωh,τ. By the definition of inradius, this means that there exists a ball B(ˆx, ωh,τ) for some ˆx Iτ such that B(ˆx, ωh,τ) Iτ. Hence, the contraction of Iτ by rα,β, i.e., K5(α, β), is a superset of the contraction of B(ˆx, ωh,τ) by rα,β, B(ˆx, ωh,τ rα,β) K5(α, β). ( ) Risk Upper Bound Recall that X p X, Y p Y are the data and the response variables respectively. Let S be the random variable obtained from the smoothed, noise-augmented classifier as S = Smoothβ(ψ(h pα))(X). Let T be the random variable obtained from the original classifier T = ψ(h)(X). We know that the risk of the original classifier is given by R(ψ(h)) = Pr[ψ(h)(X) = Y ] = p(T = 1, Y = 0) + p(T = 0, Y = 1). Define I+ = {x: ψ(h)(x) = 1} and I = {x: ψ(h)(x) = 0} as the regions of the input space classified as 1 and 0 respectively by the base classifier. With this definition, we can rewrite the base risk as R(ψ(h)) = p(X I+, Y = 0) + p(X I , Y = 1). Let K+ I+ be any subset classified as 1 by the smoothed classifier, i.e., K+ {x: Smoothβ(ψ(h pα))(x) = 1} I+. Similarly, K {x: Smoothβ(ψ(h pα))(x) = 0} I . We will think of the sets K as shrunk versions of I, and where the smoothed classifier predicts the correct label. Specifically, we will let K+ be the union of all the contractions K5(α, β) of the positive partitions of ψ(h), where K5(α, β) was obtained earlier. Similarly, we will let K be the union of the contractions of all the negative partitions of ψ(h). This ensures that K def = X \ (K+ K ) = X \ [ k K5,k(α, β), where we define K as the rest of the space. The risk of the smoothed classifier is R(Smoothβ(ψ(h pα))) = p(S = 1, Y = 0) + p(S = 0, Y = 1), where the first term decomposes into p(S = 1, Y = 0) = p(S = 1, Y = 0, X K+) + p(S = 1, Y = 0, X K) = p(Y = 0, X K+) + p(S = 1, Y = 0, X K), Published in Transactions on Machine Learning Research (04/2023) noting that p(S = 1, Y = 0, X K ) = 0, as X K implies S = 0. Similarly, the second term decomposes as p(S = 0, Y = 1) = p(Y = 1, X K ) + p(S = 0, Y = 1, X K). Combining, we have R(Smoothβ(ψ(h pα))) = p(Y = 0, X K+) + p(Y = 1, X K )+ p(S = 1, Y = 0, X K) + p(S = 0, Y = 1, X K) = p(Y = 0, X K+) + p(Y = 1, X K ) + p(S = Y, X K). The above expression says that the errors made by the smoothed classifier can be decomposed into the errors made in the regions K, and the error everywhere else K. Now, since K+ I+, K I , we have R(Smoothβ(ψ(h pα))) p(Y = 0, X I+) + p(Y = 1, X I ) + p(S = Y, X K) R(ψ(h)) + p(S = Y, X K) = R(Smoothβ(ψ(h pα))) R(ψ(h)) + p(X K) = α,β(h) p(X K) (*) As K5(α, β) K5(0, 0) for any α, β 0, we have p(X Kα,β) p(X K0,0), showing that the upper-bound (*) decreases as α increases at any fixed β. Finally, notice that the upper-bound (*) becomes tight at α = β = 0, K+ = I+, K = I , K = {}, as then R(Smooth0(ψ(h p0)) = R(ψ(h)) and p(X K) = 0. Finally, we can use ( ) in (*) to obtain k p X(Bℓ2(ˆxk, ωk h,τ rk α,β)), for any choice of the points {ˆxk} such that ˆxk is such that Bℓ2(ˆxk, ωk h,τ) Ik,τ. Finally,the quantities Ψ 1 α 0.5 0.5 τ and Ψ 1 β (0.5) must be non-negative for the above balls to be well defined. This in turn implies that ωk h,τ rk α,β 0 for all k, i.e., the inradius of each partition is at least as large as the amount of shrinkage that partition goes through. These requirements are not additional constraints on h, but rather a property of the proof technique: whenever the inradius is too small, the partition vanishes from the upper bound. To capture this subtlety, we can write the upper bound as k p X(Bℓ2(ˆxk, (ωk h,τ rk α,β)+)), where (c)+ denotes the positive part of c. F Proof of Theorem 4.3 Theorem 4.3. There exist nice distributions pα, pβ, a family H2 with low interference distance ζH2, and data-distributions p X, such that for all h H2 we have 0,β0(h) > α,β0(h) for some α > 0, β0 > 0. Proof. We will firstly derive general conditions on h H2 and α0, β0 > 0 such that α0,β0(h) < 0,β0(h). We will then show that the H2 as restricted by these conditions is not empty, by providing a particular h that satisfies these conditions. We will assume that the domain is bounded. Lower-Bound on β. We first of all require β to be large enough such that ψ(ψ(h) pβ0)(x) = 0 for all x X. This is to ensure that without noise-augmentation, the smoothed classifier predicts 0 everywhere. In other words, x X we require Z t X ψ(h)(x t)pβ0(t)dt 0.5. (22) Condition (22) implies β0 β for some β, as will become clear later in (26). Published in Transactions on Machine Learning Research (04/2023) Acceptable Range of α. Secondly, we require ψ(h pα0)(x) = 1 over a set X, where we think of the size of X as being close to the size of X in the sense p X(x X \ X) ϵ for a small ϵ. This subset X denotes the part of the input where the noise-trained classifier predicts 1. In other words, for all x X, we require Z t X h(x t)pα0(t)dt 0.5. (23) Condition (23) implies lower and upper bounds α α0 α, as will become clear later in (27), (28). Upper-Bound on β. Thirdly, we require ψ(ψ(h pα0) pβ0)(x) = 1 for all x ˆ X, where we again think of the size of ˆ X as being close to the size of X, i.e., p X( X \ ˆ X) ϵ. This ensures that the smoothed classifier predicts 1 in almost all of the region where the noise-trained classifier predicts 1. In other words, for all x ˆ X, we require Z t X ψ(h pα0)(x t)pβ0(t)dt 0.5. (24) Condition (24) implies an upper bound β β0, as will become clear in (29). Small Interference-Distance. Finally, we require that the data-distribution places more mass on the label 1, i.e. p(Y = 1) > 1 2 + ϵ. (25) Condition (25) ensures that h has a small upper-interference distance. Informally, this is expected: (25) says that the positive partitions of ψ(h) need to have a certain mass under p, and the domain is bounded, so the partitions cannot be far apart. Formally, this will become clear in (30) when we demonstrate a particular example h where the above constraints are satisfied. Before that, we will prove that the above conditions are sufficient to guarantee α0,β0(h) < 0,β0(h). Let Sα,β be the random variable Smoothβ(ψ(h pα))(X), X p X, and simplify 0,β0 as 0,β0(h) = p(Y = S0,β0) Using (22) > p(Y = 0) + 2ϵ Using (25) p(Y = 0, X ˆ X) + ϵ + ϵ p(Y = 1, X ˆ X) + p X( X \ ˆ X) + p X(X \ X) = p(Y = Sα0,β0, X ˆ X) + p(X X \ ˆ X) Using (24) p(Y = Sα0,β0, X ˆ X) + p(Y = Sα0,β0, X X \ ˆ X) = α0,β0(h). We will consider the one-dimensional example h described in Section 4.2, and show that it satisfies the above constraints. The domain can be taken to be X = [ 0.25, 0.25]. h is defined by specifying c1, c2, c3, c4. We can take c1 = 0.25, c2 = 0.25 + ω, c4 = 0.25, c3 = 0.25 ω, for some ω 0.25. Here ω is the inradius of the positive partitions of h, similar to what we saw in Theorem 4.2. Additionally, the upper interference distance ζ is the distance between the positive partitions, i.e., ζ = (0.25 ω) ( 0.25 + ω) = 0.5 2ω. The smoothing distribution p is uniform in a ball, as pθ = Unif([ θ/2, θ/2]), and the marginal distribution p X is uniform over [ 0.25, 0.25]. The lower bound on β (22) then requires for all x X, we have R β/2 β/2 ψ(h)(x t) (1/β)dt 0.5. A stronger condition is (1/β) R 0.25 0.25 ψ(h)(t)dt 0.5 which implies a lower bound on β as (1/β)(2ω) 0.5 = β 4ω. (26) Published in Transactions on Machine Learning Research (04/2023) The acceptable range of α is defined by (23). We take X = X (hence ϵ = 0). For x [ 0.25 + ω, 0.25 ω], (23) gives 2 (0.25 ω)(1/α) 0.5 = α 4(0.25 ω). (27) For x [ 0.25, 0.25 + ω] [0.25 ω, 0.25], a stronger condition to (23) gives ω(1/α) 0.5 = α 2ω. (28) The upper bound on β is given by (24). As ϵ = 0, we have ˆ X = X, which gives (0.5) (1/β) 0.5 = β 1.. (29) Finally, Eq. (25) gives a condition on ω: 2 = ω > 0.125. (30) The proof is now complete with the observation that ω = 0.23, α = 0.1, β = 0.93 satisfies these constraints, showing that h H2 as 0,0.93( h) > 0.1,0.93( h). Note that the condition (30) on ω implied an upper-bound on the interference distance ζ, as ζ = 0.5 2ω. This shows that when h satisfies the condition (25), h has a low interference distance. G Accommodating Errors in Learning Bayes Classifiers. While obtaining a randomized-smoothed classifier in a real learning scenario, we might deviate from the Bayes classifier in any stage of the process. Specifically, we have 2 stages: (1) Learn a classifier on the noisy data (Xs, Y ) and (2) Smooth the resultant classifier using randomized smoothing. In Theorem 4.1, we assumed that we are able to exactly learn the Bayes Classifier for (Xs, Y ) in stage (1), i.e. ψ(h pα). This assumption can be relaxed while maintaining the same general proof technique, albeit by paying with further slack in the obtained bounds due to the inexactness of the learned classifier. We now sketch a modification of the argument in Appendix D that allows a bounded deviation from the Bayes classifier. We start with the first column of Figure 1, i.e. ψ(h), and train the classifier on data perturbed by noise pα. In the ideal case sketched above, this leads precisely to the second column, i.e., ψ(h pα). In the non-ideal scenario, we assume that we obtain a classifier g in stage (1) after training on (Xs, Y ), such that for all x, we have |g(x) h pα(x)| η. In other words, the maximum error in the trained classifier compared to the actual conditional is no larger than η. For clarity, we still maintain that stage (2) is done exactly, i.e. that given ψ(g) we are able to obtain Smoothβ(ψ(g)) exactly as our final randomized-smoothed classifier. As a result, we are interested in analysing the modified excess risk α,β(g, h) = R(Smoothβ(ψ(g))) R(ψ(h)), (31) for all (g, h) such that g h pα η. Eq. (31) captures the increase in the benign risk due to randomized smoothing an imperfectly learned classifier on the noise-augmented data. Additionally, we define η α,β(h) = max {g : g h pα η} α,β(g, h), (32) to be the maximum excess risk when using an imperfectly learned g. Similarly, we define η α,β(h) = min {g : g h pα η} α,β(g, h), (33) to be the minimum excess risk when using an imperfectly learned g. We now have the following modified version of Theorem 4.1. Published in Transactions on Machine Learning Research (04/2023) Theorem G.1. For nice noise distributions pα = Unif(Bℓ2(0, α)) and pβ = Unif(Bℓ2(0, β)), there exists H1 with interference distance ζH1, such that for all h H1 we have η 0,β(h) < η α,β(h), and η 0,β(h) < η α,β(h) for all smoothing parameters α, β such that ζH1 > 2 max(α, β), and η < 0.5. Proof. Similar to Theorem 4.1, let I1, I2, . . . , Ik be spheres such that Ij = Bℓ2(cj, r) for some positive radius r > 0. Define h(x) = 0.5 + τ whenever x I1 I2 . . . Ik, and 0 otherwise, for some 0 τ < 0.5. Large Interference Distance We assume that ζh > 2 max(α, β), where note the additional factor 2 in the required lower bound as compared to Theorem 4.1. This stricter condition will become useful when we deal with errors in learning the bayes classifier. α Shrinkage Recall that we earlier analyzed the region Ij,α = {x Ij : h pα(x) 0.5}. In this new non-ideal setting, it is now useful to consider the regions I+η j,α defined as I+η j,α = {x Ij,α : h pα(x) 0.5 + η}. (34) The inaccurate classifier g would have ψ(g)(x) = 1 for all x i I+η i,α, as g(x) h pα(x) η 0.5. Similarly, the sets {I η j,α} can be defined as the set of maximal simply connected regions forming a partition of {x: h pα(x) 0.5 η}. There is a natural correspondence between I η j,α, I+η j,α and Ij,α, which will become clear once we obtain the explicit forms of these sets. Note that ψ(g)(x) = 0 for all x i I η i,α, as g(x) h pα(x) + η 0.5. We can characterize I+η j,α explicitly as the set of all x satisfying (0.5 + τ)Φα(x cj, r) 0.5 + η = x cj 2 Aα,r def = r+η α , (35) obtaining the shrunk ball I+η j,α = B(cj, r+η α ). Similarly, we see that I η j,α is the ball I η j,α = B(cj, r η α ) with radius r η α def = Aα,r Handling Inaccuracy in g We now perform randomized smoothing on this classifier g, and obtain ψ(ψ(g) pβ). For x Ij we have that Smoothβ(ψ(g))(x) = Z I+η j,α ψ(g)(t)pβ(x t)dt + Z I η j,α\I+η j,α ψ(g)(t)pβ(x t)dt + Z X\I η j,α ψ(g)(t)pβ(x t)dt. We will now split the domain of the last integral in (36) as X \ I η j,α = i =j I η j,α X \ i I η j,α , where we note that the integral over the second set in the union, i.e., R X\ i I η i,α ψ(g)(t)pβ(x t)dt is zero, by the observation above that ψ(g)(t) = 0 for all t j I η j,α. This leaves us with the integral over the first set, which we will now show to be 0 as well. For the first set, we want to see whether there is any point t I η i,α, i = j such that ψ(g)(t) = 1 and pβ(x t) > 0. This can happen only when g(t) 0.5 and x t 2 β. Now consider the triangle formed by the points {x, t, ci}, and apply the reverse triangle inequality to get t ci 2 ci x 2 x t 2 2 max(α, β) + r β α + r, Published in Transactions on Machine Learning Research (04/2023) where the interference distance condition was used for ci x > 2 max(α, β). But now we know that g(t) h pα(t) η < 0.5, and that h pα(t) = 0 for any t at a distance more than α + r distance away from all the centers ci. This implies that g(t) < 0.5, which further implies that ψ(g)(t) = 0. Hence, the integral over the first set i =j I η j,α is 0. The first term in (36) is simply I+η j,α ψ(g)(t)pβ(x t)dt = Z I+η j,α 1 pβ(x t)dt = Φβ(x cj, r+η j,α). For the second term in (36), we can only produce upper and lower bounds, as I η j,α\I+η j,α ψ(g)(t)pβ(x t)dt Z I η j,α\I+η j,α 1 pβ(x t)dt Combining, we have Φβ(x cj, r+η α ) Smoothβ(ψ(g))(x) Φβ(x cj, r η α ) (37) where the lower-bound occurs when g(x) takes the lowest possible value for all x I+η j,α \ I η j,α. Similarly, the upper-bound occurs when g(x) takes the highest possible value in the same region. β Shrinkage Compared Theorem 4.1, we now only have an inequality (37) for the smoothed classifier at any point x. As a result, we cannot determine the positive regions Ij,α,β = {x Ij : Smoothβ(ψ(g))(x) 0.5} exactly. Instead, the following set inclusions follow from (37): Bℓ2(cj, Aα,r+η α (0.5)) Ij,α,β Bℓ2(cj, Aα,r η α (0.5)). Bounding Risk Similar to Theorem 4.1, we can now compute the risk of the smoothed classifier as R(Smoothβ(ψ(h pα))) (38) X |Smoothβ(ψ(h pα))(x) 1|p(1|x)p X(x)dx + Z X Smoothβ(ψ(h pα)(x)p(0|x)p X(x)dx. (39) Let I = I1 I2 . . . Ik are the positive regions for the base classifier ψ(h), and let I+η α,β = j Bℓ2(cj, Aα,r+η α ) and I η α,β = j Bℓ2(cj, Aα,r η α (0.5)) be the upper and lower bounds to the positive regions of the smoothed classifier Smoothβ(ψ(g)), i.e., Iα,β = j Ij,α,β. Let rα,β,η = Aα,r η α (0.5) denote the largest possible radius of the positive partitions after smoothing. We now have two cases, rα,β,η r (Case A), or r < rα,β,η (Case B). When η = 0, the positive partitions always shrink (as we saw in Theorem 3.1), and we are in Case A. As we increase η beyond a certain η η0, we go into Case B, where the positive partitions have potentially dilated after smoothing. We handle both cases separately. Case A Define S I η α,β \ I+η α,β to be the subset of I η α,β \ I+η α,β where Smoothβ(ψ(g))(x) = 1. For the first integral in (39), we obtain |Smoothβ(ψ(g))(x) 1|p(1|x) = 0, x I+η α,β 0, x S 0.5 + τ, x (I η α,β \ I+η α,β) \ S 0.5 + τ, x I \ I η α,β 0, x X \ I Published in Transactions on Machine Learning Research (04/2023) For the second integral in (39), we obtain Smoothβ(ψ(g))(x)p(0|x) = 0.5 τ, x I+η α,β 0.5 τ, x S 0, x (I η α,β \ I+η α,β) \ S 0, x I \ I η α,β 0, x X \ I Substituting into the integrals, we obtain R(Smoothβ(ψ(g))) = (0.5 τ)p X(I+η α,β S) + (0.5 + τ)p X ((I η α,β \ I+η α,β) \ S) (I \ I η α,β) , which is minimized when S is as large as possible, and vice-versa. This gives us the risk bounds (0.5 τ)p X(I η α,β) + (0.5 + τ)p X(I \ I η α,β) R (0.5 τ)p X(I+η α,β) + (0.5 + τ)p X(I \ I+η α,β) (0.5 + τ)p X(I) 2τp X(I η α,β) R(ψ(h)) | {z } η α,β(h) α,β(g, h) (0.5 + τ)p X(I) 2τp X(I+η α,β) R(ψ(h)) | {z } As α increases for a fixed β, η, τ, both the upper and lower bounds in (40) increase, following the same argument as Theorem 4.1. Case B Define T1 I η α,β \I to be the subset of I η α,β \I where Smoothβ(ψ(g))(x) = 1. Define T2 I \I+η α,β to be the subset of I \ I+η α,β where Smoothβ(ψ(g))(x) = 1. For the first integral in (39), we obtain |Smoothβ(ψ(g))(x) 1|p(1|x) = 0, x I+η α,β 0, x T2 0.5 + τ, x (I \ I+η α,β) \ T2 0, x X \ I For the second integral in (39), we obtain Smoothβ(ψ(g))(x)p(0|x) = 0.5 τ, x I+η α,β 0.5 τ, x T2 0, x (I \ I+η α,β) \ T2 1, x T1 0, x (I η α,β \ I) \ T1 0, x X \ I η α,β Again substituting into the integrals, we obtain R(Smoothβ(ψ(g))) = (0.5 + τ)p X((I \ I+η α,β) \ T2) + (0.5 τ)p X(T2) + p X(T1) + (0.5 τ)p X(I+η α,β), which is minimized when T1 is empty and T2 is as large as possible, and maximized when T1 is as large as possible and T2 is empty. This gives the risk bounds (0.5 τ)p X(I) R(Smoothβ(ψ(g))) (0.5 + τ)p X(I) 2τp X(I+η α,β) + p X(I η α,β \ I) (0.5 τ)p X(I) R(ψ(h)) | {z } η α,β(h) α,β(g, h) (0.5 + τ)p X(I) 2τp X(I+η α,β) + p X(I η α,β \ I) R(ψ(h)) | {z } Note that the lower bound for the risk in (41) is equal to the Bayes Error - this is expected as in the best scenario, the classifier after smoothing can be idential to the original classifier in Case B. For the upper bound, we observe that as we increase α at a fixed β, η, τ, the set I+η α,β shrinks, which causes the upper bound to increase. We have hence shown that in both Case A and Case B, the risk of the smoothed classifier increases as α is increased from 0, modulo approximations due to errors in learning the bayes classifier. Published in Transactions on Machine Learning Research (04/2023) Upper Bound for General g We show here that via a simple application of Theorem 4.2, we can upper bound α,β(g, h) for general g, h: α,β(g, h) α,β(h) + p X ((h pα)(X) = g(X)) (42) To show (42), we let Sg denote the random variable Smoothβ(ψ(g))(X), and Sh denote the random variable Smoothβ(ψ(h) pα)(X). We use the following simple sequence of upper-bounds: α,β(g, h) = R(Smoothβ(ψ(g))) R(ψ(h)) = p(Sg = Y ) R(ψ(h)) = p(Sg = Y, Sh = Sg) R(ψ(h)) + p(Sg = Y, Sh = Sg) p(Sh = Y ) R(ψ(h)) + p(Sh = Sg) = α,β(h) + p(Sh = Sg) α,β(h) + p((h pα)(X) = g(X))