# intriguing_properties_of_inputdependent_randomized_smoothing__b464509e.pdf

Intriguing Properties of Input-Dependent Randomized Smoothing

Peter S uken ık 1 Aleksei Kuvshinov 2 Stephan G unnemann 2 3

Randomized smoothing is currently considered the state-of-the-art method to obtain certifiably robust classifiers. Despite its remarkable performance, the method is associated with various serious problems such as certified accuracy waterfalls , certification vs. accuracy trade-off, or even fairness issues. Input-dependent smoothing approaches have been proposed with intention of overcoming these flaws. However, we demonstrate that these methods lack formal guarantees and so the resulting certificates are not justified. We show that in general, the input-dependent smoothing suffers from the curse of dimensionality, forcing the variance function to have low semi-elasticity. On the other hand, we provide a theoretical and practical framework that enables the usage of input-dependent smoothing even in the presence of the curse of dimensionality, under strict restrictions. We present one concrete design of the smoothing variance function and test it on CIFAR10 and MNIST. Our design mitigates some of the problems of classical smoothing and is formally underlined, yet further improvement of the design is still necessary.

1. Introduction

Deep neural networks are one of the dominating recently used machine learning methods. They achieve state-of-theart performance in a variety of applications like computer vision, natural language processing, and many others. The key property that makes neural networks so powerful is their expressivity (G uhring et al., 2020). However, as a price, they possess a weakness - a vulnerability against adversarial attacks (Szegedy et al., 2013; Biggio et al.,

1Institute of Science and Technology Austria, Klosterneuburg, Austria 2Technical University of Munich, School of Computation, Information and Technology, Munich, Germany 3Munich Data Science Institute, Munich, Germany. Correspondence to: Peter S uken ık <peter.sukenik@ista.ac.at>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

2013). The adversarial attack on a sample x0 is a point x1 such that the distance d(x0, x1) is small, yet the predictions of model f on x0 and x1 differ. Such examples are often easy to construct, for example by optimizing for a change in prediction f(x) (Biggio et al., 2013). Even worse, these attacks are present even if the model s prediction on x is unequivocal.

This property is highly undesirable because in several sensitive applications, misclassifying a sample just because it does not follow the natural distribution might lead to serious and harmful consequences. A well-known example is a sticker placed on a traffic sign, which could confuse the self-driving car and cause an accident (Eykholt et al., 2018). As a result, the robustness of classifiers against adversarial examples has begun to be a strongly discussed topic. Though many methods claim to provide robust classifiers, just some of them are certifiably robust, i.e. the robustness is mathematically guaranteed. The certifiability turns out to be essential since more sophisticated attacks can break empirical defenses (Carlini & Wagner, 2017).

Currently, the dominant method to achieve the certifiable robustness is randomized smoothing (RS). This clever idea to get rid of adversarial examples using randomization of input was introduced by Lecuyer et al. (2019) and Li et al. (2019) and fully formalized and improved by Cohen et al. (2019). Let f be a classifier assigning inputs x RN

to one of the classes C C. Given a random deviation ϵ N(0, σ2I), the smoothed classifier g, made of f, is defined as: g(x) = arg max C P(f(x + ϵ) = C), for C C. In other words, the smoothed classifier classifies a class that has the highest probability under the sampling of f(x + ϵ). Consequently, an adversarial attack x on f is less dangerous for g, because g does not look directly at x , but rather at its whole neighborhood, in a weighted manner. This way we can get rid of local artifacts that f possesses thus the name smoothing . It turns out, that g enjoys strong robustness properties against attacks bounded by a specifically computed l2-norm threshold, especially if f is trained under a Gaussian noise augmentation (Cohen et al., 2019).

Unfortunately, since the introduction of the RS, several serious problems were reported to be connected to the technique. Cohen et al. (2019) mention two of them. First is the usage of lower confidence bounds to estimate the leading class s

Intriguing Properties of Input-Dependent Randomized Smoothing

probability. With a high probability, this leads to smaller reported certified radiuses in comparison with the true ones. Moreover, it yields a theoretical threshold, which upperbounds the maximal possible certified radius and causes the certified accuracy waterfalls , which significantly decreases the certified accuracy. This problem is particularly pronounced for small levels of the used smoothing variance σ2, which motivates to use larger variance. Second, RS possesses a robustness vs. accuracy trade-off problem. The bigger σ we use as the smoothing variance, the smaller clean accuracy will the smoothed classifier have. This motivates to use rather smaller levels of σ. Third, as pointed out by Mohapatra et al. (2020a), RS smooths the decision boundary of f in such a way that bounded or convex regions begin to shrink as σ increases, while the unbounded and anti-convex regions expand. This, as the authors empirically demonstrate, creates a imbalance in class-wise accuracies (accuracies computed per each class separately) of g and causes serious fairness issues. Therefore the smaller values of σ are again more preferable. See Appendix A for a detailed discussion.

Clearly, the usage of a global, constant σ is suboptimal. For the samples close to the decision boundary, we want to use small σ, so that f and g have similar decision boundaries and the expressivity of f is not lost (where not necessary). On the other hand, far from the decision boundary of f, where the probability of the dominant class is close to 1, we need bigger σ to avoid the severe under-certification (see Appendix A). All together, using a non-constant σ(x) rather than constant σ, a suitable smoothing variance could be used to achieve optimal robustness.

To support this reasoning, we present a toy example. We train a network on a 2D dataset of a circular shape with the classes being two complementary sectors, one of which is of a very small angle. In Figure 1 we show the difference between constant and input-dependent σ. Using the non-constant σ(x) defined in Equation 1, we obtain an improvement both in terms of the certified radiuses as well as clean accuracy. For more details see Appendix A. Even though there are some works introducing this concept (see Appendix C), most of them lack mathematical reasoning about the correctness of their method or fairness of their comparison, which, as we show, turns out to be critical.

The main contributions of this work are fourfold. First, we generalize the methodology of Cohen et al. (2019) for the case of the input-dependent RS (IDRS), obtaining useful and important insights about how to use the Neyman-Pearson lemma in this general case. Second and most importantly, we show that the IDRS suffers from the curse of dimensionality in the sense that the semielasticity coefficient r of σ(x) (a positive number such that | log(σ(x0)) log(σ(x1))| r x0 x1 x0, x1 RN)

in a high-dimensional setting is restricted to be very small. This means, that even if we wanted to vary σ(x) significantly with varying x, we cannot. The maximal reasonable speed of change of σ(x) turns out to be almost too small to handle, especially in high dimensions. Third, in contrast, we also study the conditions on σ(x) under which it is applicable in high-dimensional regime and prepare a theoretical framework necessary to build an efficient certification algorithm. We are the first to do so for σ(x) functions, which are not locally constant (as in (Wang et al., 2021)). Finally, we provide a concrete design of the σ(x) function, test it extensively and compare it to the classical RS on the CIFAR10 and MNIST datasets. We discuss to what extent the method treats the issues mentioned above.

2. IDRS and the Curse of Dimensionality

Let C be the set of classes, f : RN C a classifier (referred to as the base classifier), σ : RN R a nonnegative function and P(C) a set of distributions over C. Then we call Gf : RN P(C) the smoothed class probability predictor, if Gf(x)C = P(f(x + ϵ) = C), where ϵ N(0, σ(x)2I) and gf : RN C is called smoothed classifier if gf(x) = arg max C Gf(x)C, for C C. We will omit the subscript f in gf often, since it is usually clear from the context to which base classifier the g corresponds. Furthermore, let A := g(x) refer to the most likely class under the random variable f(N(x, σ2I)), and B denote the second most likely class. Define p A = Gf(x)A and p B = Gf(x)B as the respective probabilities. It is important to note that in practice, it is impossible to estimate p A and p B precisely. Instead, p A is estimated as a lower confidence bound (LCB) of the relative occurence of class A in the predictions of f given certain number of Monte-Carlo samples n and a confidence level α. The estimate is denoted as p A. Similarly to Cohen et al. (2019) we use the exact Clopper-Pearson interval for estimation of the LCB. The same applies for p B. We work with l2-norms denoted as x . When we speak of certified radius at sample x0, we always mean the biggest R 0 for which the underlying theory provides the guarantee g(x0) = g(x1) x1 : x0 x1 R.

First of all, we summarize the main steps in the derivation of certified radius around x0 using any method that relies on the Neyman-Pearson lemma (e.g. by Cohen et al. (2019)).

1. For a potential adversary x1 specify the worst-case classifier f , such that P(f (N(x0, σ2I)) = A) = p A, while P(f (N(x1, σ2I)) = B) is maximized.

2. Express Gf (x1)B = P(f (N(x1, σ2I)) = B) as a function depending on x1.

3. Determine the conditions on x1 (possibly related to x0 x1 ) for which this probability is 0.5. From these conditions, derive the certified radius.

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 1. Motivating toy experiment. We use constant σ = 0.6 and input-dependent σ(x) equal in average to the constant σ. Left: Dataset and the variance function depicted as circles with the radius equal to σ(x) and centers at the data points. Middle: Zoomed in part of the dataset and decision boundaries of the smoothed classifiers with constant σ (red) and input-dependent σ(x) (green). Note that we recover a part of the misclassified data points by using a more appropriate smoothing strength close to the decision boundary. Right: Certified accuracy plot. The waterfall effect vanishes since the points far from the decision boundary are certified with a correspondingly large σ(x).

Cohen et al. (2019) proceeds in this way to obtain a tight certified radius R = σ

2 (Φ 1(p A) Φ 1(p B)). Unfortunately, their result is not directly applicable to the inputdependent case. Constant σ simplifies the derivation of f that turns out to be a linear classifier. This is not the case for non-constant σ(x) anymore. Therefore, we generalize the methodology of Cohen et al. (2019). We put p B = 1 p A for simplicity (yet it is not necessary to assume this, see Appendix B.5). Let x0 be the point to certify, x1 the potential adversary point, δ = x1 x0 the shift and σ0 = σ(x0), σ1 = σ(x1) the standard deviations used in x0 and x1, respectively. Furthermore, let qi be a density and Pi a probability measure corresponding to N(xi, σ2 i I), i {0, 1}.

Lemma 2.1. Out of all possible classifiers f such that Gf(x0)B p B = 1 p A, the one, for which Gf(x0 + δ)B is maximized predicts class B in a region determined by the likelihood ratio:

B = x RN : q1(x)

where r is fixed, such that P0(B) = p B. Note that we use B to denote both the class and the region of that class.

We use this lemma to compute the decision boundary of the worst-case classifier f .

Theorem 2.2. If σ0 > σ1, then set B is an Ndimensional ball with the center at S> and radius R> := R>(σ0, σ1, δ, N, r), defined in Appendix F:

S> = x0 + σ2 0 σ2 0 σ2 1 δ.

If σ0 < σ1, then set B is the complement of an N-dimensional ball with the center at S< and radius

Figure 2. Decision regions of the worst-case classifier f . Left: σ0 > σ1 Right: σ0 < σ1.

R< := R<(σ0, σ1, δ, N, r), expressed in Appendix F:

S< = x0 σ2 0 σ2 1 σ2 0 δ.

As we depict in Figure 2, both resulting balls are centered on the line connecting x0, x1. Moreover, the centers of the balls are always further from x0, than x1 is from x0 (even in the case σ0 < σ1). In both cases, it depends on p A (since r is fixed such that P0(B) = p B) and the ball can, but might not cover x0 and/or x1. Note that if σ0 = σ1, which can happen even in input-dependent regime, the worst-case classifier is the half-space described by Cohen et al. (2019).

To compute the probability of a ball under an isotropic Gaussian probability measure is more challenging than the probability of a half-space. In fact, there is no closed-form expression for it. However, this probability is connected to the non-central chi-squared distribution (NCCHSQ). More precisely, the probability of an N-dimensional ball centered at z with radius r under N(0, I) can be expressed as a cumulative distribution fucntion (cdf) of NCCHSQ with N degrees of freedom, non-centrality parameter z 2 and argument r2. With this knowledge, we can express P0(B) and P1(B) in terms of the cdf of NCCHSQ as follows.

Intriguing Properties of Input-Dependent Randomized Smoothing

Theorem 2.3.

P0(B) = χ2 N

σ2 0 (σ2 0 σ2 1)2 δ 2, R2 <,> σ2 0

P1(B) = χ2 N

σ2 1 (σ2 0 σ2 1)2 δ 2, R2 <,> σ2 1

where the sign < or > is chosen according to the inequality between σ0 and σ1, and χ2 N is the cdf of NCCHSQ with N degrees of freedom.

Note that both Theorem 2.2 and Theorem 2.3 work well also for δ = 0 (see the proofs in Appendix F). In this case, we encounter a ball centered at x0 = x1 and all the cdf functions become cdf functions of central chi-squared.

We expressed the probabilities of the decision region of the worst-case class B using the cdf of NCCHCSQ. Now, how do we do the certification? We start with the certification just for two points, x0 for which the certified radius is in question and its potential adversary x1. We ask, under which circumstances can x1 be certified from the point of view of x0. To obtain P1(B) as a function of x1 as in step 2 of the certified radius derivation scheme from page 2, we first need to fix P1(B) = 1 p A = p B. Having x0, x1, p A and σ0 > σ1, we obtain such R, that P0(B) = χ2 N δ 2σ2 0/(σ2 0 σ2 1)2, R2 = 1 p A = p B simply by putting it into the quantile function. This way we get R2 = χ2 N,qf δ 2σ2 0/(σ2 0 σ2 1)2, 1 p A . In the next step we substitute it directly into P1(B) = χ2 N δ 2σ2 1/(σ2 0 σ2 1)2, R2σ2 0/σ2 1 . This way, we obtain P1(B) and can judge, whether P1(B) < 1/2 or not. Similar computation can be done if σ0 < σ1. Denote a := δ . We express P1(B) more simply as a function of a for σ0 > σ1 as

ξ>(a) := P1(B) =

σ2 1 (σ2 0 σ2 1)2 a2, σ2 0 σ2 1 χ2 N,qf

σ2 0 (σ2 0 σ2 1)2 a2, 1 p A

and for σ0 < σ1 as

ξ<(a) := P1(B) =

σ2 1 (σ2 1 σ2 0)2 a2, σ2 0 σ2 1 χ2 N,qf

σ2 0 (σ2 1 σ2 0)2 a2, p A

With this in mind, if we have x0, x1, p A, σ0, σ1, then we can certify x1 w.r.t x0 simply by choosing the correct sign (<, >), computing ξ<( x0 x1 ) or ξ>( x0 x1 ) and comparing it with 0.5. The sample plots of these ξ functions can be found in Appendix B.

Now, we are ready to discuss the curse of dimensionality. The problem that arises is that having a high dimension

N and σ0, σ1 differing a lot from each other, ξ functions are already big at 0, even for considerably small p B. For fixed ratio σ0/σ1 and probability p B, ξ(0) increases with growing dimension N and soon becomes bigger than 0.5. This, together with monotonicity of the ξ function yields that no x1 can be certified w.r.t. x0, if σ0, σ1 are used. The more dissimilar the σ0 and σ1 are, the smaller the dimension N needs to be for this situation to occur. If we want to certify x1 in a reasonable distance from x0, we need to use similar σ0, σ1. This restricts the variability of the σ(x) function. We formalize the curse of dimensionality in the following theorems. We discuss more why the curse of dimensionality is present in Appendix B.2.

Theorem 2.4 (curse of dimensionality). Let x0, x1, p A, σ0, σ1 and N be as usual. The following two implications hold:

1. If σ0 > σ1 and

log σ2 1 σ2 0

+ 1 σ2 1 σ2 0 < 2 log(1 p A)

then x1 is not certified w.r.t. x0.

2. If σ0 < σ1 and

log σ2 1 σ2 0

+ 1 σ2 1 σ2 0

N < 2 log(1 p A)

then x1 is not certified w.r.t. x0.

Corollary 2.5 (one-sided simpler bound). Let x0, x1, p A, σ0, σ1 and N be as usual and assume now σ0 > σ1. Then, if

then x1 is not certified w.r.t x0.

Note that both Theorem 2.4 and Corrolary 2.5 can be adjusted to the case where we have a separate estimate p B and do not put p B = 1 p A (see Appendix B.5). We emphasize, that the bounds obtained in Theorem 2.4 are very tight. In other words, if the ratio σ1/σ0 is just slightly bigger than the minimal possible threshold from Theorem 2.4, ξ>(0) becomes smaller than 0.5 and similarly for ξ<(0). This is because the only two estimates used in the proof of Theorem 2.4 are the estimates on the median, which are very tight and constant with respect to N, and the multiplicative Chernoff bound, which is generally considered to be tight too and improves for larger N. The tightness is depicted in Figure 3, where we plot the minimal possible threshold σ1/σ0 given by Theorem 2.4 and minimal threshold for which ξ>(0) < 0.5 as a function of N.

To get a better feeling about the concrete numbers, we provide the theoretical threshold values from Theorem 2.4 in

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 3. Plots depicting tightness of results of Theorem 2.4. In both figures, the biggest possible threshold of σ1/σ0 for which the condition in Theorem 2.4 is satisfied (theoretical threshold) and the numerically computed threshold for which ξ>(0) exceeds the threshold 0.5 (practical threshold) are depicted. Left: Plot for p A = 0.9, Right: Plot for p A = 0.999.

Table 1. Theoretical lower-thresholds for σ1/σ0 for different data dimensions and class A probabilities. The Image Net spatial size is assumed to be 3x256x256, while CIFAR10 spatial size is 32x32x3 and MNIST spatial size is 28x28x1.

p A 0.9 0.99 0.999 0.99993 MNIST 0.946 0.924 0.908 0.892 CIFAR10 0.973 0.961 0.953 0.945 Image Net 0.997 0.995 0.994 0.993

Table 1. If σ1/σ0 is smaller than the threshold, we are not able to certify any pair of x0, x1 using σ0, σ1.

Results from Table 1 are very restrictive. Assume we have a CIFAR10 sample with p A = 0.999. For such a probability, constant σ = 0.5 is more than sufficient to guarantee the certified radius of more than 1. However, in the non-constant regime, to certify R 1, we first need to guarantee that no sample within the distance of 1 from x0 uses σ1 < 0.953σ0. To even strengthen this statement, note that one needs to guarantee σ1 to be even much closer to σ0 in practice. Why? The results of Theorem 2.4 lower-bound the ξ functions at 0. However, since ξ functions are strictly increasing (as shown in Appendix F), one usually needs σ0 and σ1 to be much closer to each other to guarantee ξ(a) being smaller than 0.5 at a 0. This not only forces the σ(x) function to have really small semi-elasticity but also makes it problematic to define a stochastic σ(x). For more, see Appendix B.2.

To fully understand how the curse of dimensionality affects the usage of IDRS, we mention two more significant effects. First, with increasing dimension the average distance between samples tends to grow as

N. This enables bigger distance to change σ(x). On the other hand, the average level of σ(x) (like 0.12, 0.25, . . . ) needs to be adjusted also as

N with increasing dimension. The bigger average level of σ(x) we use, the more is the semi-elasticity of σ(x) restricted by Theorem 2.4 and Theorem 3.2. All together, these two effects combine in a final trend that for

σ0 and σ1 being variances used in two random test samples, |σ0/σ1 1| is restricted to go to 0 as 1/

N. For detailed explanation, see Appendix B.4.

3. How to Use IDRS properly

As we discuss above, usage of the IDRS is challenging. How can we obtain valid, mathematically justified certified radiuses? Fix some design σ(x). If σ(x) is not trivial, to get a certified radius at x0, we need to go over all the possible adversaries x1 in the neighborhood of x0 and compute σ1 and ξ<,>(a). Then, the certified radius is the infimum over x0 x1 for all uncertified x1 points. Of course, this is a priori infeasible. Fortunately, the ξ functions possess a property that helps to simplify this procedure. For convenience, we extend the notation of ξ such that ξ(a, σ1) additionally denotes the dependence on the σ1 value. Theorem 3.1. Let x0, x1, p A, σ0 be as usual and denote x0 x1 by R. Then, the following two statements hold:

1. Let σ1 σ0. Then, for all σ2 : σ1 σ2 σ0, if ξ>(R, σ2) > 0.5, then ξ>(R, σ1) > 0.5.

2. Let σ1 σ0. Then, for all σ2 : σ1 σ2 σ0, if ξ<(R, σ2) > 0.5, then ξ<(R, σ1) > 0.5.

Theorem 3.1 serves as a monotonicity property. The main gain is, that for each distance R from x0, it is sufficient to consider just two adversaries the one with the biggest σ1 (if bigger than σ0) and the one with the smallest σ1 (if smaller than σ0). If we cannot certify some point x1 at the distance R from x0, then we will for sure not be able to certify at least one of the two adversaries with the most extreme σ1 values.

This, however, does not suffice for most of the reasonable σ(x) designs, since it might be still too hard to determine the two most extreme σ1 values at some distance from x0. Therefore, we assume that σ(x) is semielastic with coefficient r. Then we have a guarantee that σ(x0) exp( ra) σ(x1) σ(x0) exp(ra) holds. Thus, we bound the worst-case extreme σ1-s for every distance a. Using this, we guarantee the following certified radius. Theorem 3.2. Let σ(x) be an r-semi-elastic function and x0, p A, N, σ0 as usual. Then, the certified radius at x0 guaranteed by our method is

CR(x0) = sup R 0 : ξ>(R, σ0 exp( r R)) < 0.5 ξ<(R, σ0 exp(r R)) < 0.5

If the underlying set is empty, which is equivalent to p A 0.5 (that might occur in practice, as we use a lower bound p A instead of p A), we cannot certify any radius.

Note that Theorem 3.2 can be adjusted to the case where we have a separate estimate p B and do not put p B = 1 p A

Intriguing Properties of Input-Dependent Randomized Smoothing

(see Appendix B.5). Since the bigger the semi-elasticity constant of σ(x) is, the worse certifications we obtain, it is important to estimate the constant tightly. Even with a good estimate of r, we still get smaller certified radiuses in comparison with using the σ(x) exactly, but that is a prize that is inevitable for the feasibility of the method.

The algorithm is then very easy - we just pick sufficiently dense space of possible radiuses and determine the smallest, for which either ξ>(R, σ0 exp( r R)) or ξ<(R, σ0 exp(r R)) is larger than 0.5. The only non-trivial part is how to evaluate the ξ functions. For small values of R, the exp( r R) is very close to 1 and from the definition of ξ functions it is obvious that this results in extremely big inputs to the cdf and quantile function of NCCHSQ. To avoid numerical problems, we employ a simple hack where we assume thresholds for σ1 such that for R small enough, these thresholds are used instead of σ0 exp( r R)). Unfortunately, the numerical stability still disables the usage of this method on really high-dimensional datasets like Image Net. For more details on implementation, see Appendix D.

4. The Design of σ(x) and Experiments

The only missing ingredient to finally being able to use IDRS is the σ(x) function. As we have seen, this function has to be r-semi-elastic for rather small r and ideally deterministic. Yet it should at least roughly fulfill the requirements imposed by the motivation it should possess big values for points far from the decision boundary of f and rather small for points close to it. Adhering to these restrictions, we use the following function:

σ(x) = σb exp

xi Nk(x) x xi m

for σb being a base standard deviation, r the required semielasticity, {xi}d i=1 the training set, Nk(x) the k nearest neighbors of x and m the normalization constant. Intuitively, if a sample is far from all other samples, it will be far from the decision boundary, unless the network overfits to this sample. On the other hand, the dense clusters of samples are more likely to be positioned near the decision boundary, since such clusters have a high leverage on the network s weights, forcing the decision boundary to adapt well to the geometry of the cluster (for more insight on the relation between the geometry of the data and distance from decision boundary see, for example, Baldock et al. (2021)). To use such a function, however, we first prove that it is indeed r-semi-elastic. Theorem 4.1. The standard deviation function σ(x) defined in equation 1 is r-semi-elastic.

We test our IDRS and σ(x) function extensively. For both CIFAR10 (Krizhevsky, 2009) and MNIST (Le Cun et al.,

1999) datasets, we analyze series of different experimental setups, including experiments with an input-dependent train-time Gaussian data augmentation. We present a direct comparison of our method with the constant σ method using evaluation strategy from (Cohen et al., 2019) (all other experiments, including ablation studies, and the discussion on the hyperparameter selection are presented in Appendix E). Here, we compare (Cohen et al., 2019) s evaluations for σ = 0.12, 0.25, 0.50 with our evaluations, setting σb = σ, r = 0.01, 0.02, k = 20, m = 5, 1.5 (for CIFAR10 and MNIST, respectively), applied on models trained with Gaussian data augmentation, using constant standard deviation roughly equal to the average test-time σ(x) or test-time σ. For CIFAR10, these levels of traintime standard deviation are σtr = 0.126, 0.263, 0.53 and for MNIST σtr = 0.124, 0.258, 0.517. In this way, the levels of σ(x) we use in the direct comparison are spread from the values roughly equal to (Cohen et al., 2019) s constant σ to higher values. The results are depicted in Figure 4.

From Figure 4 we see that we outperform the constant σ for small levels of σ, such as 0.12 or 0.25. On higher levels of σ, we are, in general, worse (see explanation in Appendix B.3). The most visible improvement is in mitigation of the truncation of certified accuracy (certified accuracy waterfall). To comment on the other two issues, we provide Tables 2 and 3 with the clean accuracies and class-wise accuracy standard deviations, respectively. These results are the averages of 8 independent runs and in Table 2, the displayed error values are equal to empirical standard deviations.

From Tables 2 and 3, we draw two conclusions. First, it is not easy to judge about the robustness vs. accuracy trade-off, because the differences in clean accuracies are not statistically significant in any of the experiments (not even for CIFAR10 and σ = 0.25, where the difference is at least pronounced). However, the general trend in Table 2 indicates that the clean accuracies tend to slightly decrease with the increasing rate. The decrease is roughly equivalent to the drop in accuracy for the case when we just use constant σ during evaluation with its value set to the average σ(x). In that case, the certified radiuses would be also roughly equal in average, but ours would still encounter a less severe certified accuracy drop. Second, the standard deviations of the class-wise accuracies, which serve as a good measure of the impact of the shrinking phenomenon and subsequent fairness, don t significantly change after applying the nonconstant RS.

5. Related Work

Since the vulnerability of deep neural networks against adversarial attacks has been noticed by Szegedy et al. (2013); Biggio et al. (2013), a lot of effort has been put into making neural nets more robust. There are two types of solu-

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 4. Comparison of certified accuracy plots for Cohen et al. (2019) s method and our work.

Table 2. Clean accuracies for both input-dependent and constant σ evaluation strategies on CIFAR10 (C) and MNIST (M).

dataset σ = 0.12 σ = 0.25 σ = 0.50 r = 0.01, σtr increased C 0.852 0.002 0.780 0.013 0.673 0.008 r = 0.00 C 0.851 0.006 0.792 0.004 0.674 0.018 r = 0.01, σtr increased M 0.9912 0.0003 0.9910 0.0006 0.9881 0.0003 r = 0.00 M 0.9914 0.0004 0.9907 0.0004 0.9886 0.0005

Table 3. Class-wise accuracy (i.e. how many points of a certain class are correctly classified divided by the size of this class) standard deviations for both input-dependent and constant σ evaluation strategies on CIFAR10 (C) and MNIST (M).

dataset σ = 0.12 σ = 0.25 σ = 0.50 r = 0.01, σtr increased C 0.076 0.099 0.120 r = 0.00 C 0.076 0.097 0.122 r = 0.01, σtr increased M 0.00775 0.00777 0.00930 r = 0.00 M 0.00751 0.00778 0.00934

tions empirical and certified defenses. While empirical defenses suggest heuristics to make models robust, certified approaches additionally provide a way to compute a mathematically valid robust radius.

One of the most effective empirical defenses, adversarial training (Goodfellow et al., 2014; Kurakin et al., 2016; Madry et al., 2017), is based on a intuitive idea to use adversarial examples for training. Unfortunately, together with adversarial training, other promising empirical defenses were subsequently broken by more sophisticated adversarial methods (for instance by Carlini & Wagner (2017); Athalye

& Carlini (2018); Athalye et al. (2018), among others).

Among many certified defenses (Tsuzuku et al., 2018; Anil et al., 2019; Hein & Andriushchenko, 2017; Wong & Kolter, 2018; Raghunathan et al., 2018; Mirman et al., 2018; Weng et al., 2018), one of the most successful yet is RS. While Lecuyer et al. (2019) introduced the method within the context of differential privacy, Li et al. (2019) proceeded via R enyi divergences. Possibly the most prominent work on RS is that of Cohen et al. (2019), where authors fully established RS and proved tight certification guarantees.

Later, a lot of authors further worked with RS. The work

Intriguing Properties of Input-Dependent Randomized Smoothing

of Yang et al. (2020) generalizes the certification provided by Cohen et al. (2019) to certifications with respect to the general lp norms and provide the optimal smoothing distributions for each of the norms. Other works point out different problems or weaknesses of RS such as the curse of dimensionality (Kumar et al., 2020; Hayes, 2020; Wu et al., 2021), robustness vs. accuracy trade-off (Gao et al., 2020) or a shrinking phenomenon (Mohapatra et al., 2020a), which yields serious fairness issues (Mohapatra et al., 2020a).

The work of Mohapatra et al. (2020b) improves RS further by introducing the first-order information about g. In this work, authors not only estimate g(x), but also g(x), making more restrictions on the possible base models f that might have created g. Zhai et al. (2020) and Salman et al. (2019) improve the training procedure of f to yield better robustness guarantees of g. Salman et al. (2019) directly use adversarial training of the base classifier f. Finally, Zhai et al. (2020) introduce the so-called soft smoothing, which enables to compute gradients directly for g and construct a training method, which optimizes directly for the robustness of g via the gradient descent.

To address several issues connected to randomized smoothing, there have already been four works that introduce the usage of IDRS. Wang et al. (2021) divide RN into several regions Ri, i {1, . . . , K} and optimize for σi, i {1, . . . , K} locally, such that σi is a most suitable choice for the region Ri. Yet this work partially solves some problems of randomized smoothing, it also possesses some practical and philosophical issues (see Appendix C). Alfarra et al. (2020); Eiras et al. (2021); Chen et al. (2021) suggest to optimize for locally optimal σi, for each sample xi from the test set. A similar strategy is proposed by these works in the training phase, with the intention of obtaining the base model f that is most suitable for the construction of the smoothed classifier g. They demonstrate, that by using this input-dependent approach, one can overcome some of the main problems of randomized smoothing. However, as we demonstrate in Appendix C, their methodology is not valid and therefore their results are not trustworthy. In the latest version of their work, Alfarra et al. (2020) themselves point out this issue and try to fix it by a very similar strategy to the one of Wang et al. (2021). Thus, their work still carries all the problems connected to this method.

6. Conclusions

We show in this work that input-dependent randomized smoothing suffers from the curse of dimensionality. In the high-dimensional regime, the usage of input-dependent σ(x) is put under strict constraints. The σ(x) function is forced to have very small semi-elasticity. This is in conflict with some recent works, which have used the input-dependent randomized smoothing without mathematical justification

and therefore claim invalid results, or results of questionable comparability and practical usability. It seems that inputdependent randomized smoothing has limited potential of improvement over the classical, constant-σ RS. Moreover, due to numerical instability, the computation of certified radiuses on high-dimensional datasets like Image Net remains to be an open challenge.

On the other hand, we prepare a ready-to-use mathematically underlined framework for the usage of the inputdependent RS and show that it works well for small to medium-sized problems. We also show, via extensive experiments, that our concrete design of the σ(x) function reasonably mitigates the truncation issue connected to constant-σ RS and is capable of mitigating the robustness vs. accuracy one on simpler datasets. The most intriguing and promising direction for the future work lies in the development of new σ(x) functions, which could treat the mentioned issues even more efficiently.

This improvement is necessary to make IDRS able to significantly beat constant smoothing, as it happens in our toy example in Appendix A. The difference between CIFAR10 and MNIST and the toy dataset is that σ(x) from equation 1 is well-suited for the toy dataset s geometry, but not to the same extent for the geometry of images. One possible reason is that this σ(x) does not correspond to the distances from the decision boundary across the entire dataset, because the euclidean distances between images do not align with the distances in images content . We believe, however, that improvements in σ(x) design are possible and we leave the investigation in this matter as an open and attractive research question. One way to go is to define a metric on the input space that would better reflect the geometry of images and convolutional neural networks.

Alfarra, M., Bibi, A., Torr, P. H., and Ghanem, B. Data dependent randomized smoothing. ar Xiv preprint ar Xiv:2012.04351, 2020.

Anil, C., Lucas, J., and Grosse, R. Sorting out Lipschitz function approximation. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 291 301. PMLR, 09 15 Jun 2019. URL http://proceedings.mlr. press/v97/anil19a.html.

Athalye, A. and Carlini, N. On the robustness of the cvpr 2018 white-box adversarial example defenses. ar Xiv preprint ar Xiv:1804.03286, 2018.

Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing de-

Intriguing Properties of Input-Dependent Randomized Smoothing

fenses to adversarial examples. In International conference on machine learning, pp. 274 283. PMLR, 2018.

Baldock, R., Maennel, H., and Neyshabur, B. Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34:10876 10889, 2021.

Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇSrndi c, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Blockeel, H., Kersting, K., Nijssen, S., and ˇZelezn y, F. (eds.), Machine Learning and Knowledge Discovery in Databases, pp. 387 402, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-40994-3.

Carlini, N. and Wagner, D. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pp. 3 14, 2017.

Chen, C., Kong, K., Yu, P., Luque, J., Goldstein, T., and Huang, F. Insta-rs: Instance-wise randomized smoothing for improved robustness and accuracy. ar Xiv preprint ar Xiv:2103.04436, 2021.

Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310 1320. PMLR, 2019.

Eiras, F., Alfarra, M., Kumar, M. P., Torr, P. H., Dokania, P. K., Ghanem, B., and Bibi, A. Ancer: Anisotropic certification via sample-wise volume maximization. ar Xiv preprint ar Xiv:2107.04570, 2021.

Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., and Song, D. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1625 1634, 2018.

Gao, Y., Rosenberg, H., Fawaz, K., Jha, S., and Hsu, J. Analyzing accuracy loss in randomized smoothing defenses. ar Xiv preprint ar Xiv:2003.01595, 2020.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

G uhring, I., Raslan, M., and Kutyniok, G. Expressivity of deep neural networks. ar Xiv preprint ar Xiv:2007.04759, 2020.

Hayes, J. Extensions and limitations of randomized smoothing for robustness guarantees. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 786 787, 2020.

Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/ e077e1a544eec4f0307cf5c3c721d944-Paper. pdf.

Krizhevsky, A. Learning multiple layers of features from tiny images, 2009. URL http://www.cs.toronto.edu/ kriz/ learning-features-2009-TR.pdf.

Kumar, A., Levine, A., Goldstein, T., and Feizi, S. Curse of dimensionality on randomized smoothing for certifiable robustness. In International Conference on Machine Learning, pp. 5458 5467. PMLR, 2020.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016.

Le Cun, Y., Cortes, C., and Burges, C. J. The mnist database of handwritten digits, 1999. URL http: //yann.lecun.com/exdb/mnist/.

Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656 672. IEEE, 2019.

Li, B., Chen, C., Wang, W., and Carin, L. Certified adversarial robustness with additive noise. Advances in Neural Information Processing Systems, 32:9464 9474, 2019.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Mirman, M., Gehr, T., and Vechev, M. Differentiable abstract interpretation for provably robust neural networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3578 3586. PMLR, 10 15 Jul 2018. URL http://proceedings.mlr.press/ v80/mirman18b.html.

Mohapatra, J., Ko, C.-Y., Liu, S., Chen, P.-Y., Daniel, L., et al. Rethinking randomized smoothing for adversarial robustness. ar Xiv preprint ar Xiv:2003.01249, 2020a.

Intriguing Properties of Input-Dependent Randomized Smoothing

Mohapatra, J., Ko, C.-Y., Weng, T.-W., Chen, P.-Y., Liu, S., and Daniel, L. Higher-order certification for randomized smoothing. Advances in Neural Information Processing Systems, 33, 2020b.

Raghunathan, A., Steinhardt, J., and Liang, P. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018.

Robert, C. On some accurate bounds for the quantiles of a non-central chi squared distribution. Statistics & probability letters, 10(2):101 106, 1990.

Salman, H., Li, J., Razenshteyn, I. P., Zhang, P., Zhang, H., Bubeck, S., and Yang, G. Provably robust deep learning via adversarially trained smoothed classifiers. In Neur IPS, 2019.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Tsuzuku, Y., Sato, I., and Sugiyama, M. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings. neurips.cc/paper/2018/file/ 485843481a7edacbfce101ecb1e4d2a8-Paper. pdf.

Van Erven, T. and Harremos, P. R enyi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

Wang, L., Zhai, R., He, D., Wang, L., and Jian, L. Pretrainto-finetune adversarial training via sample-wise randomized smoothing. 2021. URL https://openreview. net/forum?id=Te1a Z2my PIu.

Weng, L., Zhang, H., Chen, H., Song, Z., Hsieh, C.- J., Daniel, L., Boning, D., and Dhillon, I. Towards fast computation of certified robustness for Re LU networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5276 5285. PMLR, 10 15 Jul 2018. URL http://proceedings.mlr.press/ v80/weng18a.html.

Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5286 5295. PMLR, 2018.

Wu, Y., Bojchevski, A., Kuvshinov, A., and G unnemann, S. Completing the picture: Randomized smoothing suffers from the curse of dimensionality for a large family of distributions. In International Conference on Artificial Intelligence and Statistics, pp. 3763 3771. PMLR, 2021.

Yang, G., Duan, T., Hu, J. E., Salman, H., Razenshteyn, I., and Li, J. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pp. 10693 10705. PMLR, 2020.

Zhai, R., Dan, C., He, D., Zhang, H., Gong, B., Ravikumar, P., Hsieh, C.-J., and Wang, L. Macer: Attack-free and scalable robust training via maximizing certified radius. ar Xiv preprint ar Xiv:2001.02378, 2020.

Intriguing Properties of Input-Dependent Randomized Smoothing

A. The Issues of Constant σ Smoothing

A.1. Toy Example

To better demonstrate our ideas, we prepared a two-dimensional simple toy dataset. This dataset can be seen in Figure 5. The dataset is generated in polar coordinates, having uniform angle and the distance distributed as a square root of suitable chi-square distribution. The classes are positioned in a circle sectors, one in a sector with a very sharp angle. The number of training samples is 500 for each class, number of test samples is 100 for each class (except demonstrative figures, where we increased it to 300). The model that was trained on this dataset was a simple fully connected three-layer neural network with Re LU activations and a maximal width of 20.

A.2. Undercertification Caused by the Use of Lower Confidence Bounds

As we mention in Section 1, one can not usually obtain exact values of p A and p B. However, it is obvious, that for vast majority of evaluated samples, p A < p A and p B > p B. Given the nature of our certified radius, it follows that R < R, where R denotes the certified radius coming from the certification procedure with p A and p B, while R here stands for the certified radius corresponding to true values p A, p B.

It is, therefore, clear, that we face a certain level of under-certification. But how serious under-certification it is? Assume the case with a linear base classifier. Imagine, that we move the point x further and further away from the decision boundary. Therefore, p A 1. At some point, the probability will be so large, that with high probability, all n samplings in our evaluation of p A will be classified as A, obtaining ˆp A = 1 - the empirical probability. The lower confidence bound p A is therefore bounded by having ˆp A = 1. Thus, from some point, the certification will yield the same p A regardless of the true value of p A. So in practice, we have an upper bound on the certified radius R in the case of the linear boundary. In Figure 7 (left), we see the truncation effect. Using σ = 1, from a distance of roughly 4, we can no longer achieve a better certified radius, despite its theoretical value equals the distance. Similarly, if we fix a distance of x from decision boundary and vary σ, for very small values of σ, the value of Φ 1(p A) will no longer increase, but the values of σ will pull R towards zero. This behaviour is depicted in Figure 7 (right).

We can also look at it differently - what is the ratio between Φ 1(p A) and Φ 1(p A) for different values of p A? Since

R = σΦ 1(p A) and R = σΦ 1(p A), the ratio represents the undercertification rate . In Figure 6 we plot Φ 1(p A) Φ 1(p A) as a function of p A for two different ranges of values. The situation is worst for very small and very big values of p A. In the case of very big values, this can be explained due to extreme nature of Φ 1. For small values of p A, it can be explained as a consequence of a fact, that even small difference between p A and p A will yield big ratio between Φ 1(p A) and Φ 1(p A) due to the fact, that these values are close to 0.

If we look at the left plot on Figure 8 we see, that the certified accuracy plots also possess the truncations. Above some radius, no sample is certified anymore. The problem is obviously more serious for small values of σ. On the right plot of Figure 8, we see, that samples far from the decision boundary are obviously under-certified. We can also see, that certified

Figure 5. The toy dataset.

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 6. The ratio between certified radius if using lower confidence bounds and if using exact values for the case of linear boundary.

Figure 7. Left: Certified radius as a function of distance in linear boundary case. The truncation is due to the use of lower confidence bounds. The parameters are n = 100000, α = 0.001, σ = 1. Right: Certified radius for a point x at fixed distance 1 from linear boundary as a function of used σ. The undercertification follows from usage of lower confidence bounds.

radiuses remain constant, even though in reality they would increase with increasing distance from the decision boundary.

All the observations so far motivate us to use rather large values of σ in order to avoid the truncation problem. However, as we will see in the next sections, using a large σ carries a different, yet equally serious burden.

A.3. Robustness vs. Accuracy Trade-off

As we demonstrate in the previous subsection, it is be useful to use large values of σ to prevent the under-certification. But does it come without a prize? If we have a closer look at Figure 8 (right), we might notice, that the accuracy on the threshold 0, i.e. clean accuracy , decreases as σ increases. This effect has been noticed in the literature (Cohen et al., 2019; Gao et al., 2020; Mohapatra et al., 2020a) and is called robustness vs. accuracy tradeoff.

There are several reasons, why this problem occurs. Generally, changing σ changes the decision boundary of g and we might assume, that due to the high complexity of the boundary of f, the decision boundary of g becomes smoother. If σ is too large, however, the decision boundary will be so smooth, that it might lose some amount of the base classifier s expressivity. Another reason for the accuracy drop is also the increase in the number of samples, for which the evaluation is abstained. This is because using big values of σ makes more classes within the reach of our distribution , making the p A and p A small. If p A < 0.5 and we do not estimate p B but set p B = 1 p A, then we are not able to classify the sample as class A, yet we cannot classify it as a different class either, which forces us to abstain. To demonstrate these results, we computed not only

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 8. Results of certification on toy dataset. Left: Certified accuracy for different levels of σ. Right: Certified radiuses and decision boundary of g visualized directly on test set.

Accuracy Abstention rate Misclassification rate σ = 0.12 0.814 0.038 0.148 σ = 0.25 0.748 0.086 0.166 σ = 0.50 0.652 0.166 0.182 σ = 1.00 0.472 0.29 0.238

Table 4. Accuracies, rates of abstentions and misclassification rates of (Cohen et al., 2019) evaluation for different levels of σ.

the clean accuracies of (Cohen et al., 2019) evaluations but also the abstention rates. Results are depicted in the Table 4.

From the table, it is obvious, that the abstention rate is possibly even bigger cause of accuracy drop than the clean misclassification . This problem can be partially solved if one estimated p B together with p A too. In this way, using big σ yields generally small estimated class probabilities, but since p A p B, the problematic p B p A occur just very rarely. Another option is to increase the number of Monte-Carlo samplings for the classification decision, what is almost for free.

Yet another reason for the decrease in the accuracy is the so-called shrinking phenomenon, which we will discuss in the next subsection.

In contrast with the truncation effect, the robustness vs. accuracy trade-off motivates the usage of smaller values of σ in order to prevent the accuracy loss, which is definitely a very serious issue.

A.4. Shrinking Phenomenon

How exactly does the decision boundary of g change, as we change the σ? For instance, if f is a linear classifier, then the boundary does not change at all. To answer this question, we employ the following experiment: For our toy base classifier f on our toy dataset, we increase σ and plot the heatmap of f, g, together with its decision boundary. This experiment is depicted on Figure 9. As we see from the plots, increasing σ causes several effects. First of all, the heatmap becomes more and more blurred, what proves, that stronger smoothing implies stronger smoothness.

Second, crucial, effect is that the bigger the σ, the smaller the decision boundary of a submissive class is. The shrinkage becomes pronounced from σ = 1. Already for σ = 4, there is hardly any decision boundary anymore. Generally, as σ , g will predict the class with the biggest volume in the input space (in the case of bounded input space, like in image domain, this is very well defined). For extreme values of sigma, the p A will practically just be the ratio between the volume of A and the actual volume of the input space (for bounded input spaces).

Following from these results, but also from basic intuition, it seems, that an undesired effect becomes present as σ increases - the bounded/convex regions become to shrink, like in Figure 9, while the unbounded/big/anti-convex regions expand. This

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 9. Heatmaps and decision boudnary of base classifier (top left) and the smoothed classifier for increasing levels of σ. As σ increases, the classifier is more smooth and the decision boundary recedes.

is called shrinking phenomenon. (Mohapatra et al., 2020a) investigate this effect rather closely. They define the shrinkage and vanishing of regions formally and prove rigorously, that if σ , bounded regions, or semi-bounded regions (see (Mohapatra et al., 2020a)) will eventually vanish. We formulate the main result in this direction.

Theorem A.1. Let us have K the number of classes and the dimension N. Assume, that we have some bounded decision region D for a specific class roughly centered around 0. Further assume, that this is the only region where the class is classified. Let R be a smallest radius such that D BR(0). Then, this decision region will vanish at most for σ R

Proof. The idea of the proof is not very hard. First, the authors prove, that the smoothed region will be a subset of the smoothed BR(0). Then, they upper-bound the vanishing threshold of such a ball in two steps. First, they show, that if 0 is not classified as the class, then no other point will be (this is quite an intuitive geometrical statement. The BR(0) has the biggest probability under N(x, σ2I) if x 0). Second, they upper-bound the threshold for σ, under which BR(0) will have probability below 1

K (since they use slightly different setting as (Cohen et al., 2019)) for the N(x, σ2I). Using some insights about incomplete gamma function, which is known to be also the cdf of central chi-square distribution, and some other integration tricks, they obtain the resulting bound.

Besides Theorem A.1, authors also claim many other statements abound shrinking, including shrinking of semi-bounded regions. Moreover, they also conduct experiments on CIFAR10 and Image Net to support their theoretical findings. They also point out serious fairness issue that comes out as a consequence of the shrinkage phenomenon. For increasing levels of σ, they measure the class-wise clean accuracy of the smoothed classifier. If f is trained with Gaussian data augmentation (what is known to be a good practice in randomized smoothing), using σ = 0.12, the worst class cat has test accuracy of 67%, while the best class automobile attains the accuracy of 92%. The figures, however, change drastically, if we use σ = 1 instead. In this case, the worst predicted class cat has accuracy of poor 22%, while ship has reasonable accuracy 68%. As authors claim, this is a consequence of the fact that samples of cat are situated more in bounded, convex regions, that suffer from shrinking, while samples of ship are mostly placed in expanded regions of anti-convex shape that will expand as the σ grows. In addition, the authors also show, that the Gaussian data augmentation or adversarial training will reduce the shrinking phenomenon just partially and for moderate and high values of σ, this effect will be present anyway.

We must emphasize, that this is a serious fairness issue, that has to be treated before randomized smoothing can be fully used in practice. For instance, if we trained a neural network to classify humans into several categories, fairness of classification is inevitable and the neural network cannot be used until this issue is solved.

Intriguing Properties of Input-Dependent Randomized Smoothing

Dimension σ (σb) r Accuracy 2 0.5 - 0.943 2 0.4 0.2 0.96 2 0.5 0.2 0.943 6 0.5 - 0.946 6 0.4 0.1 0.963 18 1.0 - 0.86 18 0.8 0.05 0.886 60 1.0 - 0.83 60 0.8 0.03 0.85 60 1.0 0.03 0.83 180 2.0 - 0.713 180 1.9 0.01 0.726 400 2.0 - 0.623 400 1.95 0.005 0.623

Table 5. Clean accuracies of different evaluations of our toy experiment.

Similarly as the robustness vs. accuracy trade-off, this issue also motivates to use rather smaller values of σ. We see, that it is not possible to address all three problems consistently because they disagree on whether to use smaller, or bigger values of σ.

A.5. Experiments on High-dimensional toy Dataset

In this subsection, we present the results of our motivational experiment on a synthetic dataset. Before reading this section, please read our main text, because we will use the necessary notation of the paper.

The dataset we evaluated our method on is a generalization of the dataset visualized on Figure 1. The data points from one class lie in a cone of small angle and the points are generated such that the density is higher near the vertex of the cone (which is put in origin). Points from other class are generated from a spherically symmetrical distribution (where points sampled into the cone are excluded) with density again highest in the center (note, that the density peak is more pronounced than in the case of normal distribution, where the density around the center resembles uniform distribution). This dataset is chosen so that the σ(x) function designed in Equation 1 well corresponds to the geometry of the decision boundary. Moreover it is chosen so that the conic decision region will shrink rather fast with increasing σ. The motivation of this example is to show that if the σ(x) function is well-designed, our IDRS can outperform the constant RS considerably.

The setup of our experiment is as follows: We evaluate dimensions N = 2, 6, 18, 60, 180, 400. The σ used for constant smoothing is σ = 0.5, 0.5, 1.0, 1.0, 2.0, 2.0 respectively. The σb used is 0.4, 0.5 for N = 2, 0.4 for N = 6, 0.8 for N = 18, 0.8, 1.0 for N = 60, 1.9 for N = 180 and 1.95 for N = 400. The rates are r = 0.2, 0.1, 0.05, 0.03, 0.01, 0.005 respectively. The training was executed without data augmentation (because samples from different classes are very close to each other). Moreover, we have set maximal σ(x) threshold for numerical purposes, because some samples were outliers and were way too far from other samples (and if the σ(x) is way too big, the method encounteres numerical problems). In this case we set σ(x) 5σb, but we are aware that also much bigger thresholds would have been possible. We present our comparisons in Figure 10 and Table 5.

From both the Figure 10 an Table 5 it is clear that the IDRS can outperform the constant σ RS considerably, if we use really suitable σ(x) function. We manage to improve significantly the certified radiuses without losing a single correct classification. On the other hand, in cases where σb < σ, we outperform constant σ both in clean accuracy and in certified radiuses. This example is synthetic and designed in our favour. The main message is not how perfect our design of σ(x) is, but the fact, that if σ(x) is designed well, the IDRS can bring real advantages, even in moderate dimensions.

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 10. Certified accuracy plots of our multidimensional toy experiments.

Intriguing Properties of Input-Dependent Randomized Smoothing

B. More on Theory

B.1. Generalization of Results by Li et al. (2019)

In our main text, we mostly focus on the generalization of the methods from (Cohen et al., 2019). This is because these methods yield tight radiuses and because the application of Neyman-Pearson lemma is beautiful. However, the methodology from (Li et al., 2019) can also be generalized for the input-dependent RS. To be able to do it, we need some auxiliary statements about the R enyi divergence.

Lemma B.1. The R enyi divergence between two one-dimensional normal distributions is as follows:

Dα(N(µ1, σ2 1)||N(µ0, σ2 0)) = α(µ1 µ2)2

2σ2α + 1 1 α log σα σ1 α 1 σα 0

provided, that σ2 α := (1 α)σ2 1 + ασ2 0 0.

Proof. See (Van Erven & Harremos, 2014).

Note, that this proposition induces some assumptions on how σ0, σ1, α should be related. If σ0 > σ1, then the required inequality holds for any 1 = α > 0. If σ0 < σ1, then α is restricted and we need to keep that in mind.

Lemma B.2. Assume, we have some one-dimensional distributions P1, P1, . . . , PN and Q1, Q2, . . . , QN defined on common space for pairs with the same index. Then, assuming product space with product σ-algebra, we have the following identity:

Dα(P1 P2 PN||Q1 Q2 QN) =

i=1 Dα(Pi||Qi).

Proof. See (Van Erven & Harremos, 2014).

Using these two propositions, we are now able to derive a formula for R enyi divergence between two multivariate isotropic normal distributions:

Dα(N(x1, σ2 1I)||N(x0, σ2 0I)) = α x0 x1 2

2σ2 1 + 2α(σ2 0 σ1 1) + N log σα

1 α N α 1 α log σ0

Proof. Imporant property that is needed here is, that isotropic gaussian distributions factorize to one-dimensinal independent marignals. In other words:

N(x1, σ2 1I) = N(x11, σ2 1) N(x12, σ2 1) N(x1N, σ2 1),

and analogically for x0. Therefore, using Lemma B.2 we see:

Dα(N(x1, σ2 1I)||N(x0, σ2 0I)) =

i=1 Dα(N(x1i, σ2 1)||N(x0i, σ2 0)).

Now, it suffices to plug in the formula from Proposition B.1 to obtain the required result:

Dα(N(x1i, σ2 1I)||N(x0i, σ2 0I)) = α(x1i x2i)2

2σ2α + 1 1 α log σα σ1 α 1 σα 0

= α(x1i x2i)2

2σ2 1 + 2α(σ2 0 σ1 1) + log σα

1 α α 1 α log σ0

Now it suffices to sum up over i and the result follows.

Intriguing Properties of Input-Dependent Randomized Smoothing

To obtain the certified radius, we also need a result from (Li et al., 2019), which gives a guarantee that two measures on the set of classes will share the modus if the R enyi divergence between them is small enough.

Lemma B.4. Let P = (p1, p2, . . . , p K) and Q = (q1, q2, . . . , q K) two discrete measures on C. Let p A, p B correspond to two biggest probabilities in distribution P. Let M1(a, b) = a+b

2 and M1 α(a, b) = ( a1 α+b1 α

2 ) 1 1 α If

Dα(Q||P) log(1 2M1(p A, p B) + 2M1 α(p A, p B)),

then the distributions P and Q agree on the class with maximal assigned probability.

Proof. This lemma can be proved by directly computing the minimal required Dα to be able to disagree on the maximal class probabilities via a constrained optimization problem (with variables pi, qi, i {1, . . . , K}), solving KKT conditions. For details, consult (Li et al., 2019).

Having explicit formula for the R enyi divergence, we can mimic the methodology of (Li et al., 2019) to obtain the certified radius:

Theorem B.5. Given x0, p A, p B, σ0, N, the certified radius squared for all x1 such that fixed σ1 is used is:

R2 = sup α Sσ0,σ1

2σ2 1 + 2α(σ2 0 σ1 1) α

N α 1 α log σ0

log(1 2M1(p A, p B) + 2M1 α(p A, p B))

where Sσ0,σ1 = R+, if σ0 > σ1 and Sσ0,σ1 = 0, σ2 1 σ2 1 σ2 0

i if σ0 < σ1.

Proof. Let us fix x1 and assume, that α Sσ0,σ1. Then, due to post-processing inequality for Renyi divergence, it follows that

Dα(f(x1 + N(0, σ2 1I))||f(x0 + N(0, σ2 0I))) Dα(x1 + N(0, σ2 1I)||x0 + N(0, σ2 0I))

= α x0 x1 2

2σ2 1 + 2α(σ2 0 σ1 1) + N log σα

1 α N α 1 α log σ0

Due to Lemma B.4, it suffices that the following inequality holds for some α Sσ0,σ1:

2σ2 1 + 2α(σ2 0 σ1 1) + N log σα

1 α N α 1 α log σ0

log(1 2M1(p A, p B) + 2M1 α(p A, p B)).

This can be rewritten w.r.t. x0 x1 2:

x0 x1 2 2σ2 1 + 2α(σ2 0 σ1 1) α

N α 1 α log σ0

log(1 2M1(p A, p B) + 2M1 α(p A, p B))

The resulting certified radius squared is now simply obtained by taking maximum over x0 x1 2 s.t. α Sσ0,σ1 such that the preceding inequality holds.

Intriguing Properties of Input-Dependent Randomized Smoothing

Note, that this theorem is formulated assuming, that except in x0, we use σ1 everywhere. It would require some further work to generalize this for general σ(x) functions, but to demonstrate the next point, it is not even necessary. Looking at the expression, we can observe that

N α 1 α log σ0

depends highly on N and even for a ratio of σ0

σ1 close to 1, we already obtain very strong negative values for high dimensions. The expression log(1 2M1(p A, p B) + 2M1 α(p A, p B)) is far less sensitive w.r.t p A and for large dimensions of N it is easily beaten by the first expression. Therefore, the higher the dimension N is, the bigger p A or the closer to 1 the σ0

σ1 has to be in order to obtain even valid certified radius (not to speak about big). This points out that also the method of (Li et al., 2019) suffers from the curse of dimensionality, as we know it must have done. This method is not useful for big N, because the conditions on p A, σ0, σ1 are so extreme, that barely any inputs would yield a positive certified radius. This fact is depicted in the Figure 11.

Figure 11. The certified radius as a function of dimension. Paremeters are p A = 0.99, σ0 = 1, σ1 = 0.8

The key reason why this happens if done via R enyi divergences is that while the divergence Dα(N(x1, σ2 1I)||N(x0, σ2 0I)) grows independently of dimension as x0 x1 grows, it drastically increases for big N even if x1 = x0! This reflects the effect, that if σ0 = σ1, then the more dimensions we have, the more dissimilar are N(x1, σ2 1I) and N(x0, σ2 0I)). We can think of it as a consequence of standard fact from statistics that the more data we have, the more confident statistics against the null hypothesis σ0 = σ1 will we get if the null hypothesis is false. Since isotropic normal distributions can be actually treated as a sample of one-dimensional normal distributions, this is in accordance with our multivariate distributions setting.

B.2. The Explanation of the Curse of Dimensionality

In the Section 2 we show that input-dependent RS suffers from the curse of dimenisonality. Now we will elaborate a bit more on this phenomenon and try to explain why it occurs. First, it is obvious from the Subsection B.1, that also the generalized method of (Li et al., 2019) suffers from the curse of dimensionality, because the R enyi divergence between two isotropic Gaussians with different variances grows considerably with respect to dimension. This suggests that the input-dependent RS might suffer from the curse of dimensionality in general. To motivate this idea even further, we present this easy observation:

Theorem B.6. Denote RC to be a certified radius given for p A and σ0 at x0 assuming the constant σ0 and following the certification of (Cohen et al., 2019) 1. Assume, that we do the certification for each x1 by assuming the worst case-classifier

1The C in the subscript of certified radius might come both from constant and Cohen et. al.

Intriguing Properties of Input-Dependent Randomized Smoothing

as in Theorem 2.2. Then, for any x0, any function σ(x) and any p A, the following inequality holds:

Proof. Fix x1 and σ1. From Theorem 2.2 we know that the worst-case classifier f defines a ball B such that P0(B) = 1 p A. From this it obviously follows, that the linear classifier fl and the linear space Bl that assume constant σ0 also for x1 and is the worst-case for σ0 such that P0(Bl) = 1 p A is not worst-case for the case of using σ1 instead. Therefore, P1(Bl) P1(B).

Moreover, let PC 1 be a probability measure corresponding to N(x1, σ0I), i.e. the probability measure assuming constant σ0. It is easy to see that PC 1 (Bl) > 0.5 P1(Bl) > 0.5 because the probability of a linear half-space under isotropic normal distribution is bigger than half if and only if the mean is contained in the half-space.

Assume, for contradiction that R > RC. From that, it exists a particular x1 such that PC 1 (Bl) > 0.5 > P1(B), because otherwise there would be no such point, which would cause R > RC. However, PC 1 (Bl) > 0.5 = P1(Bl) > 0.5, thus P1(Bl) > P1(B) and that is contradiction.

This theorem shows, that we can never achieve a better certified radius at x0 using σ0 and having probability p A than that, which we would get by (Cohen et al., 2019) s certification. Of course, this does not mean, that using non-constant σ is useless, since σ0 can vary. The question is, how much do we lose using non-constant σ. To get a better intuition, we plot the functions ξ< and ξ> under different setups in Figure 12, together with P1(Bl) from the proof of Theorem B.6. From the top row we can deduce that dimension N has a very significant impact on the probabilities and therefore also on the certified radius. We particularly point out the fact, that even ξ>(0), ξ<(0) can have significant margin w.r.t. to the probability coming out of linear classifier.2 Already for N = 90, we are not able to certify p A = 0.99 for rather conservative value of σ0

σ1 . From middle row we see, that decreasing σ0

σ1 can mitigate this effect strongly. For instance, for σ0 = 1, σ1 = 0.95 the difference between P1(B) and P1(Bl) is almost negotiated. Bottom row compares ξ>(a), ξ<(a) and the respective linear classifier probabilities. We can see, that the case σ0 < σ1 might cause stronger restrictions on our certification (yet we deduce it just form the picture).

What is the reason for ξ>(a), ξ<(a) being so big even at 0? The problem is following: Assume σ0 > σ1. If x0 = x1, the worst-case classifier coming from Lemma 2.2 will be a ball B centered right at x0, such that P0(B) = 1 p A. If we look at P1(B), we see, that we have the same ball centered directly at the mean, but the variance of the distribution is smaller. Using spherical symmetry of the isotropic gaussian distribution, this is equivalent to evaluating the probability of a bigger ball. If we fix σ0

σ1 and look at the ratio of probabilities P1(B)

P0(B) with increasing N, the curse of dimensionality comes into the game. For N = 2, the ratio is not too big. However, if N = 3072, like in CIFAR10, this ratio is far bigger. This can be intuitively seen from a property of chi-square distribution (which is present in the case x0 = x1), that while expectation is

N, the standard deviation is just

V ar(χ2 N) E(χ2 N) 0 as N .

B.3. Why Does the Input-dependent Smoothing Work Better for Small σ Values?

As can be observed in Section 4 and Appendix E, the bigger the σb = σ we use, the harder it is to keep up to standards of constant smoothing. An interesting question is, why is the usage of small σb = σ helpful for the input-dependent smoothing?

Assume fixed σ, say σb = σ = 0.12. The theoretical bound on the certified radius given 100000 Monte-Carlo samplings and 0.001 confidence level using constant smoothing is about 0.48. Having σ(x) 0.12, we cannot expect much bigger certified radius. Therefore, if we follow Theorem 3.2, the values of exp( r R) and exp(r R) in the critical distance 0.5 will be much closer to 1, than the values of exp( r R) and exp(r R) if we used σb = σ = 0.50 instead, where the critical values of R could be much bigger than 0.5. Therefore, the gain in P1(B) imposed by the curse of dimensionality, compared to P1(B) assuming constant σ will not be that severe yet. This means, that the loss in certified radius caused by the curse of dimensionality will be much less pronounced on the active range of certified radiuses (those for which the constant smoothing still works), compared to using big σb = σ. To support this idea, we demonstrate it on Figure 13, where we depict the certified radius as a function of distance from decision boundary, assuming f to be a linear classifier, using σb = σ = 0.12 and σb = σ = 0.50 for comparison.

2Notice the similarity with R enyi divergence, which also has positive value even for x0 = x1 if σ0 = σ1 and then grows rather reasonably with distance.

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 12. Plots of ξ>(a), ξ<(a) for different setups. Coding for parameters is: [σ0, σ1, N, p B] Top: ξ>(a) left, ξ<(a) right, varying values of N. Center: On the left, ξ>(a) for varying σ1, on the right ξ>(a) for varying p B. Bottom: ξ>(a) and ξ<(a) compared.

B.4. How Does the Curse of Dimensionality Affect the Total Possible Variability of σ(x)?

Fix certain type of task, say RGB image classification with images of similar object, but consider many possible resolutions (dimensions N). Given two random images from the test set (x0, x1), what is the biggest reasonable value of |σ(x0)/σ(x1) 1|? Theoretically, the expression is bounded by | exp( r x0 x1 ) 1|, given that r is the semi-elasticity constant of σ(x). However, the average distance between two samples from a test set of constant size, but increasing dimension scales as

N. Therefore, with constant r, this upper-bound increases.

The increasing distance between samples is, therefore, a countereffect to the curse of dimensionality. In simple words, we have more distance to change σ(x0) to σ(x1) . Even if the r decreased just as 1/

N, the increasing distances would cancel the effect of the curse of dimensionaity and as a result, the maximal reasonable value of |σ(x0)/σ(x1) 1| would remain roughly constant w.r.t. N. However, we need to take into account another effect. As the dimension increases, also the average distance of samples from the decision boundary increases. This is because the distances in general grow with dimension and if we assume that the number of intersections of a line segment between x0 and x1 with the decision

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 13. Comparison of certified radius as a function of distance for constant and input-dependent smoothing. Left: σb = σ = 0.12, right: σb = σ = 0.50.

boundary of the network remains roughly constant then the average distance from the decision boundary grows as

N too. In order to compensate for this, we need to adjust the basic level of σ(x) (which we later call σb and can be understood as the general offset of our σ(x)) as

N too. This is because the maximal attainable certified radius given fixed confidence level α and the number of Monte-Carlo samples is a constant multiple of σ(x).

However, with increased σ, we need to decrease the semi-elasticity rate r in order to obtain full certifications (see also Appendix B.3 for intuition behind this).

As a sketch of proof, we provide a simple computation, which tells us the approximate asymptotic behavior of |σ(x0)/σ(x1) 1|. By Theorem 2.5 it holds:

if we want to be able to predict a certified radius of c

N (though this is just a necessary condition. For sufficiency, the LHS must be much closer to 1). After simple manipulation, we obtain:

So the rate scales as 1/N. Now we have:

|σ(x0)/σ(x1) 1| | exp( r x0 x1 ) 1| exp(r x0 x1 ) 1

B.5. Does the Curse of Dimensionality Apply in Multi-class Regime?

In the main text, we presented a setup, where p B is set to be 1 p A. This is equivalent to pretending that we have just 2 classes. By not estimating the proper value of p B we lose some amount of power and the resulting certified radius is smaller than it could have been, did we have the p B as well. This is most pronounced for datasets with many classes. The natural question, therefore, is, whether we could avoid the curse of dimensionality by properly estimating the p B together with p A. The answer is no. The problem is that the the theory in Section 2 already implicitly works with the estimate of p B in a form of 1 p A. The theory would work also with any other estimate of p B. Assuming constant p B, instead of constant p A, as we did in Section 2, will, therefore, yield the same conclusions. Moreover, there is neither theoretical, nor practical reason, why should p B decrease with increasing dimension.

Intriguing Properties of Input-Dependent Randomized Smoothing

This insight even applies to the question of the usage of input-dependent RS in practice. The assumption p B = 1 p A is no more important in Section 3 than in Section 2. Therefore, we can apply our method also for the p B obtained directly by Monte-Carlo sampling for the class B (or by any other estimation method).

C. Competitive Work

As we mention in Section 1, the idea to use input-dependent RS is not new. It has popped out in years 2020 and 2021 in at least four works from three completely distinct groups of authors, even though none of these works has been successfully published yet. We find it necessary to comment on all of these works because of two orthogonal reasons. First, it is a good practice to compare our work with the competitive work to see what are pros and cons of these similar approaches and to what extend the approaches differ. Second, we are convinced, that three of these four works claim results, which are not mathematically valid. We find this to be a particularly critical problem in a domain such as certifiable robustness, which is by definition based on rigorous, mathematical certifications.

C.1. The Work of Wang et al. (2021)

In this work, authors have two main contributions first, they propose a two-phase training, where in the second phase, for each sample xi, roughly the optimal σi is being found and then this sample xi is being augmented with this σi as an augmentation standard deviation. Authors call this method pretrain to finetune. Second, they provide a specific version of input-dependent RS. Essentially, they try to overcome the mathematical problems connected to the usage of non-constant σ(x) by splitting the input space in so called robust regions Ri, where the constant σi is guaranteed to be used. All the certified balls are guaranteed to lie within just one of these robust regions, making sure that within one certified region, constant level of σ is used. Authors test this method on CIFAR10 and MNIST and show, that the method can outperform existing state-of-the-art approaches, mainly on the more complex CIFAR10 dataset.

However, we make several points, which make the results of this work, as well as the proposed method less impressive:

The computational complexity of both their train-time and test-time algorithms seems to be quite high.

The final smoothed classifier depends on the order of the incoming samples. As a consequence, it is not clear, whether the method works well for any permutation of the would-be tested samples. This creates another adversarial attack possibility - to attack the final smoothed classifier by manipulating the test set so that the order of samples is inappropriate for the good functionality of the final smoothed classifier.

Even more, the fact, that the smoothed classifier depends on the order of the would-be tested samples makes it necessary, that the same smoothed classifier is used all the time for some test session in a real-world applications. For instance, a camera recognizing faces to approve an entry to a high-security building would need to keep the same model for its whole functional life, because restarting the model would enable attackers to create attacks on the predictions from the previous session. This might lead to significant restrictions on the practical usability of this method.

C.2. The Works of Alfarra et al. (2020) and Eiras et al. (2021)

In these works, similarly as in the work of (Chen et al., 2021), authors suggest to optimize in each test point x for such a σ(x), that maximizes the certified radius given by (Zhai et al., 2020), which is an extension of (Cohen et al., 2019) s certified radius for soft smoothing. The optimization for σ(x) differs but is similar in some respect (as will be discussed).

Besides, all three works further propose input-dependent training procedure, for which σ(x) - the standard deviation of gaussian data augmentation is also optimized. Altogether, both authors claim strong improvements over all the previous impactful works like (Cohen et al., 2019; Zhai et al., 2020; Salman et al., 2019). The only significant difference between the works of (Alfarra et al., 2020) and (Eiras et al., 2021) (which have strong author intersections) is that in (Eiras et al., 2021), authors build upon (Alfarra et al., 2020) s work and move from the isotropic smoothing to the smoothing with some specific anisotropic distributions.

As mentioned, authors first deviate from the setup of (Cohen et al., 2019) and turn to the setup introduced by (Zhai et al., 2020), i.e. they use soft smoothed classifier G defined as

GF (x)C = Eδ N(0,σ2I)F(x + δ)C.

Intriguing Properties of Input-Dependent Randomized Smoothing

The key property of soft smoothed classifiers is that the (Cohen et al., 2019) s result on certified radius holds for them too.

Theorem C.1 (certified radius for soft smoothed classifiers). Let G be the soft smoothed probability predictor. Let x be s.t.

G(x)A EA EB G(x)B.

Then, the smoothed classifier g is robust at x with radius

2 (Φ 1(EA) Φ 1(EB)) = σ Φ 1(EA) + Φ 1(1 EB))

where Φ 1 denotes the quantile function of standard normal distribution.

Proof. Is provided in (Zhai et al., 2020).

Note, that it is, similarly as in the hard randomized smoothing version of this theorem, essential to provide lower and upper confidence bounds for G(x)A and G(x)B, otherwise we cannot use this theorem with the required probability that the certified radius is valid. Denote G(x, σ) to be the soft smoothed classifier using σ in x. Authors propose to use the following theoretical σ(x) function:

σ(x) = arg max σ>0 σ

2 (Φ 1(G(x, σ(x))A) Φ 1(G(x, σ(x))B)). (2)

It is of course not possible to optimize for this particular function since it is not known. It is also not feasible to run the Monte-Carlo sampling for each σ, because that is too costly and moreover due to stochasticity, it would lead to discontinuous function. Treatment of this problem is probably the most pronounced difference between the works of (Alfarra et al., 2020) and (Chen et al., 2021).

(Alfarra et al., 2020) use the following easy observation: N(0, σ2I) σN(0, I). Assume we have δi, i {1, . . . , M} be

i.i.d. sample from N(0, I). Obviously, G(x, σ(x))A 1 M

i=1 F(x + σδi)A, since this is just the empirical mean of the

theoretical expectation. Then, Expression 2 can be approximated as:

σ(x) = arg max σ>0 σ

i=1 F(x + σδi)A

i=1 F(x + σδi)B

Here, M is the number of Monte-Carlo samplings used to approximate this function. Note, that this function is a random realization of stochastic process in σ which is driven by the stochasticity in the sample δi, i {1, . . . , M}. To find the maximum of this function, authors furhter propose to use simple gradient ascent, which is possible due to the simple differentiable form of Expression 3. This differentiability is one of the main motivations to switch from hard to soft randomized smoothing. Now, we are able to state the exact optimization algorithm of (Alfarra et al., 2020):

Note, that being done in this way, this algorithm can be viewed as a stochastic gradient ascent. After obtaining σ σ(x), authors further run the Monte-Carlo sampling to estimate the certified radius exactly as in (Cohen et al., 2019), but with σ(x) instead of some global σ. Using this algorithm, authors achieve significant improvement over the (Cohen et al., 2019) s results, particularly getting rid of the first problem mentioned in Appendix A, the truncation issue. For the results, we refer to (Alfarra et al., 2020). We will now give several comments on this algorithm and this method.

To begin with, in this optimization, authors do not adjust the estimated expectations and therefore don t use lower confidence bounds, but rather raw estimates. This is not incorrect, since these estimates are not used directly for the estimation of certified radius, but it is inconsistent with the resulting estimation. In other words, authors optimize for a slightly different function than they then use. The difference is, however, not very big apart from extreme values of EA, where the difference might be really significant.

To overcome slightly this inconsistence, authors further (without comment) use clamping of the ˆEA(σk) and ˆEB(σk) on the interval [0.02, 0.98]. I.e. if ˆEA(σk) > 0.98, it will be set to 0.98 and this is also taken into account in the computation of gradients. This way, authors get rid of the inconvenient issue, that if G(x)A 1, then ˆEA(σk) 1 for σk 0, what might

Intriguing Properties of Input-Dependent Randomized Smoothing

Algorithm 1 Data dependent certification (Alfarra et al., 2020)

function OPTIMIZESIGMA(F, x, β, σ0, M, K):

for k = 0, . . . , K do

sample δ1, . . . , δM N(0, I)

ϕ(σk) = 1 M

i=1 F(x + σδi)

ˆEA(σk) = max C ϕ(σk)C ˆEB(σk) = max C =A ϕ(σk)C R(σk) = σk

2 (Φ 1( ˆEA(σk)) Φ 1( ˆEB(σk))) σk+1 σk + β σk R(σk) σ = σK return σ

cause very big value of Φ 1( ˆEA(σk)), yielding strong inconsistency with what would be obtained, if lower confidence bound was used instead.

However, the clamping causes even stronger inconsistence in the end. Note, that if G(x)A 1, then the true value of EA(σk) would be really close to 1, yielding high values of Φ 1(EA(σk)). This value would be far better approximated by the lower confidence bound than with the clamping, since the lower confidence bound of 1 for M = 100000 and α = 0.001 is more than 0.9999, while the clamped value is just 0.98. This makes small values of σ highly disadvantageous, since σ 2 0 as σ 0, yet Φ 1( ˆEA(σk)) is being stuck on Φ 1(0.98). In other words, this way authors artificially force the resulting σ(x) to be big enough, s.t. E(σ(x))A 0.98. This assumption is not commented in the article and might result in intransparent behaviour.

Second of all, authors use M = 1 for their experiments. This can be interpreted as using batch size 1 in classical SGD. We suppose that this small batch size is suboptimal since it yields an insanely high variance of the gradient.

Third of all, during the search for σ(x), it is not taken into account, whether the prediction is correct or not. This is, of course, a scientifically correct approach, since we cannot look at the label of the test sample before the very final evaluation. However, it is also problematic, since the function in Expression 2 might attain its optimum in such a σ(x), which leads to misclassification. This could have been avoided if constant σ was used instead.

To further illustrate this issue, assume F(x) = 1(B1(0)), i.e. F(x) predicts class 1 if and only if x 1, otherwise predicts class 0. Assume we are certifying x 0 and assume that σ0 in Algorithm 1 is initialized such that class 0 is already dominating. Then, we will have positive gradient σk R(σk) in all steps, because F(σδi) is obviously non-increasing, so the number of points classified as class 1 for fixed sample δi, i 1, . . . , M is decreasing, yielding ˆEA(σk) non-decreasing in σk, while σk

2 strictly increasing in σk. This way, the σk will diverge to for k . However, point x 0 is classified as class 1, yielding misclassification which is, moreover, assigned very high certified radius.

This issue is actually even more general - the function in Expression 2 does in most cases (assuming infinite region RN) not possess global maximum, because usually

2 (Φ 1(G(x, σ(x))A) Φ 1(G(x, σ(x))B)) = .

This can be seen, for instance, easily for the F(x) = 1(B1(0)), but it is the case for any hard classifier, for which one region becomes to have dominating area as the radius around some x0 goes to infinity. This is because, if some region becomes to be dominating (for instance if all other regions are bounded), then σ

2 grows, while Φ 1(G(x, σ(x))A) Φ 1(G(x, σ(x))B) either grows too, or stagnates, making the whole function strictly increasing with sufficiently high slope.

This issue also throws the hyperparameter K under closer inspection. What is the effect of this hyperparameter on the performance of the algorithm? From the previous paragraph, it seems, that this parameter serves not only as the scaled number of epochs , but also as some stability parameter, which, however, does not have theoretical, but rather practical justification.

Another issue is, that the function in Expression 2 might be non-convex and might possess many different local minima, from which not all (or rather just a few) are actually reasonable. Therefore, the Algorithm 1 is very sensitive to initialization

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 14. The theoretical certified radius as in Expression 2. The function is monotonically increasing on interval [0, 100] and will further be increasing too.

However, probably the biggest issue of all is connected to the impossibility result showed in Section 2, which shows, that the Algorithm 1 actually yields invalid certified radiuses. Why it is so?

First of all, we must justify, that our impossibility result is applicable also for the soft randomized smoothing. This is because classifiers of type F(x)C = 1(x RC) for RC being decision region for class C are among applicable classifiers s.t. G(x, σ)A = EA. With such classifiers, however, there is no difference between soft and hard smoothing and moreover EA p A from our setup. This way we can construct the worst-case classifiers F exactly as in our setup and therefore the same worst-case classifiers and subsequent adversarial examples are applicable here as well. In other words, for fixed value of soft smoothed G(x, σ)A = EA we can denote p A = EA and find the worst-case hard classifier F defined as indicator of the worst-case ball, which will yield EB P1(B) from Theorem 2.3 in some queried point x1.

As we have seen in previous paragraphs, the resulting σ(x) yielded in Algorithm 1 is very instable and stochastic - it depends heavily on F, σ0, K, β, M and of course δi, i {1, . . . , M} for each iteration of the for cycle. Now, for instance for CIFAR10 and p A = 0.99, we have the minimal possible ratio σ0

σ1 equal to more than 0.96. It is hard to believe, that such instable, highly stochastic and non-regularized (except for K, β) method will yield σ(x) sufficiently slowly varying such that within the certified radius around x0, there will be no x1 for which σ1 deviates more than by this strict threshold from σ0. This is even more pronounced on Image Net, where the minimal possible ratio σ1

σ0 is above 0.99 for any p A or EA.

Even without the help of curse of dimensionality, we can construct a counterexample for which the algorithm will not yield valid certified radius. Assume again F(x) = 1(B1(0)) and assume modest dimension N = 2. Assume we try to certify point x0 [50, 0]. Then, the theoretical σ-dependent function from Equation 2 is depicted on Figure 14.

We can see, that the resulting σ(x0) will be as big as our regularizers K and β in Algorithm 1 will allow. Therefore, if we run the algorithm for K high-enough, surely the resulting certified radius will be far bigger than 50. However, if we certify the point x0 [0, 0] and we start with σ0 = 0.25, for instance, then the σ-dependent certified radius in Expression 2 will be decreasing in this σ0, yielding σ(x0) < 0.25, which will result in classification of class 1. This point [0, 0] lies within the certified range of [50, 0], yet it is not classified the same, because, obviously, [50, 0] is classified as class 0. This is therefore a counterexample to the validity of (Alfarra et al., 2020) s certification method and their results.

Note that even though our counterexample is a bit extreme and one could argue that in practice such a situation would not occur, we must emphasize, that this counterexample is constructed even without the help of the curse of dimensionality. In practice, it fully suffices, that for x0 and some certified radius R in x0, there exists x1 within the range of this certified

Intriguing Properties of Input-Dependent Randomized Smoothing

radius, s.t. σ1 is quite dissimilar to σ0. If such situation occurs, then R surely is not a valid certified radius. Remark C.2. In the latest version of their work, the authors themselves point out the issue outlined here. To overcome this issue, the authors further propose the following fix. Let test set be {xi}d i=1. The samples are being processed consecutively from 1 to d. After obtaining a radius Ri for a sample xi, they reserve the ball around xi with radius Ri so that every point falling in this ball from now on will be classified the same class as xi. They also adjust a newly-computed certified radius Rj of a sample xj so that the ball around xj doesn t intersect with any previously reserved ball xi, i < j. If the sample falls into the already reserved region, the prediction on this sample is being forced to be the prediction of the sample in the region, and the certified radius is set such that the ball around the new sample is a subset of a ball reserved by the old sample.

Let us note that this fix by far does not solve the issues. Technically, it partially gets rid of the mathematical invalidity of certified radiuses, since now we at least have a guarantee that within the session for which we store all the previous test samples, there will be no direct violations of the certificates. However, this fix brings many other subtle and serious problems.

Similarly as in Wang et al. (2021) it does depend on the order of incoming samples, creating possibilities for new adversarial attacks. To compare fairly, one would need to compare this method taking the worst possible certified accuracy result from all orderings of the test set.

This method requires storing all the previously checked test samples, which might be a really big burden in some inference-heavy applications such as autonomous driving, but also many others.

The classical randomized smoothing or also our framework provide guarantees which are session-agnostic. In other words, given a fixed model f and a fixed variance function σ(x) (possibly constant), the certified radius for a sample x0 holds anywhere in the world at any time, on any device. Yet the model of Alfarra et al. (2020) requires synchronization of all the testing devices and if this synchronization is not provided then the guarantees hold just locally. Imagine our cell phones use some fixed classifier. If the phones were not synchronized, then within the certified radius around x0 provided by one phone, there might be an adversarial example in some other phone.

Probably the biggest issue is that the method uses a formula for the computation of certified radiuses which completely loses its meaning when the guarantees are actually based just on the fix. The Cohen et al. (2019) s formula for certified radius is completely tailored for constant smoothing and used in this way it is just a random formula. A very simple and naive method would outperform everything existent. It just suffices to classify with f or g created with very small, constant σ and then give constant certified radius to all the samples, computed such that given the size of the test set (which is known for instance in scientific comparisons) and the average distances between closest samples, we would choose radius which is just small enough to statistically not create intersections of certified regions. For instance for CIFAR10 and 10000 test samples, certified radiuses of 4 or 5 would be completely safe and wouldn t create possibly any intersections. However, it is clear that such a method doesn t make sense.

Why this method is so appealing at the first sight? The problem is that in practical comparisons on CIFAR or MNIST, there are no clashes of certified regions. That would not be the case if the test sets would be bigger. The average distance from the closest neighbor goes as 1

d where d is the number of samples.

One could argue that there might be something as blessing of dimensionality . The distances between samples grow with dimension and so the chance of clashes is smaller and smaller and thus extremely big amounts of test data would be needed to create some problems. However, this is not correct reasoning since with growing dimension, also the required σ grows to provide appropriately big certified radiuses. From this point of view, the amount of test samples necessary to create troubles is roughly constant with respect to dimension and depends just on the topology of the data distribution.

C.3. The Work of Chen et al. (2021)

The methodology of (Chen et al., 2021) is rather similar to that of (Alfarra et al., 2020). The biggest difference consists in the optimization of Expression 2.

Instead of stochastic gradient descent, they use more sophisticated version of grid search - so called multiple-start fast gradient sign search. Simply speaking, this method first generates a set of pairs (σ0, s)i, i {1, . . . , K} and then for each of the i runs, it runs a j-while cycle, where in each step j, it increases σ2 j = σ2 0 + js to σ2 j+1 = σ2 0 + (j + 1)s and

Intriguing Properties of Input-Dependent Randomized Smoothing

Algorithm 2 Instance-wise multiple-start FGSS (Chen et al., 2021)

function OPTIMIZESIGMA(F, x, σ0, M, K, T):

generate (lk, sk), k {1, . . . , K} for k = 1, . . . , K do

sample δ1, . . . , δM N(0, lkσ2 0I)

ϕ( lkσ0) = 1 M

i=1 F(x + δi)

ˆEA( lkσ0) = max C ϕ( lkσ0)C R( lkσ0) = lkσ0Φ 1( ˆEA( lkσ0)) mk = R( lkσ0) while lk [1, T] do

sample δ1, . . . , δM N(0, (lk + sk)σ2 0I)

ϕ( lk + skσ0) = 1 M

i=1 F(x + δi)

ˆEA( lk + skσ0) = max C ϕ( lk + skσ0)C R( lk + skσ0) = lk + skσ0Φ 1( ˆEA( lk + skσ0)) if R( lk + skσ0) R( lkσ0) then

lk lk + sk mk = R( lk + skσ0) else

break σ(x) = max k {1,...,K}mk

return σ(x)

checks, whether the σ-dependent empirical certified radius in Expression 3 increases or not. If yes, they continue until j is above some threshold T, if not, they break and report σi as the σj from the inner step where while cycle was broken. After obtaining σi for i {1, . . . , K}, they choose σ(x) to be the one, that maximizes Expression 3. More concretely, their multiple-start fast gradient sign search algorithm looks as follows:

It is not entirely clear from the text of (Chen et al., 2021), how exactly are lk, sk sampled, but it is written there, that the interval for l is [1, 16] and for s it is ( 1, 1). Moreover, the authors don t provide the code and from the text, it seems, that they don t use lower confidence bounds during the evaluation of certified radiuses, what we consider to be a serious mistake (if really the case). However, we add some comments to this method regardless of the lower confidence bounds.

Generally, this method possesses most of the disadvantages mentioned in Section C.2. They use the same function for optimization, the Expression 2 and its empirical version 3. This means, that the method suffers from having several local optima, having no global optimum in general (and in most cases with limit infinity). Similarly like before, here is also no control over the correctness of the prediction, i.e. many or all local optima might lead to misclassification.

On the other hand, in this paper authors use M = 500 (the effective batch size), which is definitely more reasonable than M = 1 as in (Alfarra et al., 2020). Furthermore, they use multiple initializations, making the optimization more robust and improving the chances to obtain global, or at least very good local minimum.

However, the main problem, the curse of dimensionality yielding invalid results is even more pronounced here. Unlike the continuous approach in (Alfarra et al., 2020), here authors for each x0 sample just some discrete grid (more complex, since there are more initializations) of possible values of σ(x). For instance, if s = 1, then the smallest possible ratio between two consecutive l s in the Algorithm 2 is

15/4 0.97, making it impossible to certify some x1 w.r.t. x0 if for both s = 1 and l0 = l1 on Image Net and also for a lot of samples on CIFAR10. Of course, the fact that s is randomly sampled from ( 1, 1) makes this counter-argumentation more difficult, but it is, again, highly unlikely that this highly stochastic method without control over σ(x) would yield function with sufficiently small semi-elasticity. Therefore also the impressive results of (Chen et al., 2021) are, unfortunately, scientifically invalid.

Intriguing Properties of Input-Dependent Randomized Smoothing

Algorithm 3 Pseudocode for certification and prediction of my method based on (Cohen et al., 2019)

# evaluate g at x0 function PREDICT(f, σ0, x0, n, α):

counts Sample Under Noise(f, x0, n, σ0) ˆc A, ˆc B two top indices in counts n A, n B counts(c A), counts(ˆc B) if Binom PValue(n A, n A + n B, 0.5) α then return ˆc A else return ABSTAIN

# certify the robustness of g around x0 function CERTIFY(f, σ0, x0, n0, n, α):

counts0 Sample Under Noise(f, x0, n0, σ0) ˆc A top index in counts0 counts Sample Under Noise(f, x0, n, σ0) p A Lower Conf Bound(counts[ˆc A], n, 1 α) if p A > 1/2 then return prediction ˆc A and radius Compute Certified Radius(σ0, r, N, p A, num steps) else return ABSTAIN

function COMPUTECERTIFIEDRADIUS(σ0, r, N, p A, num steps)

radiuses linspace(num space) for R in radiuses do

σ11 σ0 exp( r R) σ12 σ0 exp(r R) xi bigger ξ>(R, σ11) xi lower ξ<(R, σ12) if max{xi bigger, xi lower} > 0.5 then BREAK return R

D. Implementation Details

Even though our algorithm is rather easy, there are some perks that should be discussed before one can safely use it in practice. First, we show the actual Algorithm 3.

Note that the function Compute Certified Radius is a bit more complicated than depicted in Algorithm 3. We don t use a simple for-loop, but rather quite an efficient search method.

Theoretically speaking, this algorithm works perfectly. However, in practice, it is a bit problematic. The issue is, that since we use σ11 and σ12, which are extremely close to σ0 for small tested radiuses R, the NCCHSQ functions will get extremely high inputs, making the results numerically instable. To prevent this, we use a simple trick. Since the more extreme σ1 will we assume in evaluation at particular distance R, the worse for us, we can prevent numerical issues simply by putting σt < σ0 and σT > σ0 to be maximal and minimal used σ s in our evaluation, i.e. the true σ used will be min{σt, σ0 exp( r R)} and max{σT , σ0 exp(r R)}. This way, we avoid numerical issues, because we can put σt, σT to be s.t. 1 σ2 0 σ2 t is not too big and in the same time maintainting the correct certification thanks to the Lemma 3.1. The problems of this workarounds are first that it decreases the certification power, since it assumes σ1 s that are even worse than the theoretically guaranteed worst-case possibilities and second, more importantly, that it requires some engineering to design the σt, σT designs. It is submoptimal to put one constant value for these thresholds, because the numerical problems occur at different ratio thresholds of σ1/σ0 for different class probabilities p A and the dimension N. This requires to design a specific σt(p A) and σT (p A) functions for each dimension N which we want to apply. For instance, we use σt(p A) = 0.9993 + 0.001 log10(p B), and σT (p A) = 1/σt(p A) for CIFAR10, while for MNIST we use σt(p A) = 0.9988 + 0.001 log10(p B), and σT (p A) = 1/σt(p A). To design such functions, one needs to plot plot real probability of a ball with fixed variances as fcn of dist function, which computes the ξ functions, for particular N and several values of p A and look, whether it computes correctly. As an example, we provide such a plots for well and ill working setups on Figure 15.

Another performance trick is to not evaluate ξ for each Ri, where Ri is i-th grid point of evaluation, but rather evaluate

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 15. Well and ill working function plot real probability of a ball with fixed variances as fcn of dist, which computes the ξ functions. The coding is [σ0, σ1, N, p A].

sequentially Ri2, i.e. just every i2-th point, until we reach value > 0.5 and then to search just the interval [(i 1)2, i2], where i is the first iteration for which ξ>(Ri2, σ1) 0.5.

E. More to Experiments and Ablations

Before we present our further results, we must emphasize that our certification procedure is barely any slower than that of (Cohen et al., 2019). More specifically, given 100000 iterations of monte-carlo sampling, certification of one sample using (Cohen et al., 2019) s algorithm on CIFAR10 takes 15 seconds on our machine, while certification of a sample using our Algorithm 3 takes 15 20 seconds depending on the σb, r setup. If at least one of σb and r is not small, then our method runs practically instantly. If both parameters are small, then one evaluation can take up to 5 seconds depending on the exact value of parameters and on the p A. Note, that this part of the certification is dimension-independent and therefore can run in the same time also on much higher-dimensional problems.

Besides the actual certification, we have to compute σ(x) for each of the test examples. This part of the algorithm is being executed before the actual certification and usually takes around 1 minute on our machine and on CIFAR10.

All in all, even in the really worst-case scenario, our method runs at most 1/3-times longer than the old method on CIFAR10. On MNIST, the ratio between our run and the original run is higher, since MNIST is smaller-dimensional problem. However, since our part of evaluation is practically independent of the setup (except the values of σb and r, which, however, can yield just some upper-bounded amount of slow-down), our algorithm does not bring any added asymptotic time complexity.

E.1. How to Choose the Hyperparameters?

Our design of σ(x) function defined in Equation 1 uses several hyperparameters. These are: r for the rate, m for the scaling, σb for the base sigma and k for the k-nearest neighbors. How do we choose these hyperparameters?

The m parameter depends on our goals. We can set it so that σ(x) achieves lowest values at σb by setting it so that it is roughly equal to the minimal distance from k nearest neighbors across, for instance, training samples. Other possibility is to set it so that it is roughly equal to the average distance from k nearest neighbors, to ensure that the average σ(x) will roughly correspond to the σb.

The k parameter needs to be set with two objectives in mind. Firstly, it would be unwise to set it too small, because then the distance from k nearest neighbors would be too noisy. On the other hand, we don t want it too high, because then it will not be changing fast enough with changing the position of x. The suitable value can be obtained by looking at histograms of average distances from k nearest neighbors and choosing the k for which the histogram is enough scattered, but it is not too small.

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 16. Comparison of certified accuracy plots for (Cohen et al., 2019) and our work. For each plot, the same base model f is used for evaluation.

σ = 0.12 σ = 0.25 σ = 0.50 r = 0.00 0.831 0.766 0.658 r = 0.005 0.830 0.766 0.654 r = 0.01 0.828 0.762 0.649 r = 0.015 0.826 - -

Table 6. Clean accuracies for Cohen s models and our non-constant σ(x) models.

σ = 0.12 σ = 0.25 σ = 0.50 r = 0.00 0.084 0.108 0.131 r = 0.005 0.086 0.112 0.135 r = 0.01 0.088 0.119 0.142

Table 7. Standard deviations of class-wise accuracies for different levels of σ and r.

The r parameter needs to be chosen so that we can have some significant advantage over constant smoothing, but it cannot be too big, because otherwise the curse of dimensionality would apply. The value can be decided either by trial and error, or by plotting the certified radius given linear classifier, or from Theorem 2.4, setting the rate low-enough so that within the expected certified radius range, the ratio σ1

σ0 can t move anywhere near the theoretical thresholds implied by Theorem 2.4.

The σb is the base σ and should be used according to the level of smoothing variance we want to use. More discussion on this can be found in (Cohen et al., 2019).

E.2. Comparison with Cohen et al. (2019) s Methodology on CIFAR10 Datasets

Here, we compare (Cohen et al., 2019) s evaluations for σ = 0.12, 0.25, 0.50 with our evaluations directly on models trained by (Cohen et al., 2019), setting σb = σ, r = 0.005, 0.01 and 0.015 for σb = σ = 0.12, k = 20 and m = 5. In this way, the levels of σ(x) used in direct comparison will rise from the values roughly equal to (Cohen et al., 2019) s constant σ to higher values. The results are depicted in Figure 16.

Note, that this evaluation is being done on the models trained directly by (Cohen et al., 2019) and therefore the variance of Gaussian data augmentation is not entirely consistent with the optimal variance that should be used for non-constant σ, which should be either the same, σ(x) or constant, but in average equal to σ(x). The results are similar as in the Section 4. Note, that for σb = σ = 0.50, the curse of dimensionality becomes most pronounced, as explained in Appendix B. Further, we provide the Tables 6, 7, where the clean accuracies and class-wise standard deviations are displayed.

The results are, again, similar as in the section 4.

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 17. Comparison of certified accuracy plots for (Cohen et al., 2019) and our work, MNIST. For each plot, the same base model f is used for evaluation. The term trr stands for train-time rate, will be discussed later and can be ignored now.

σ = 0.12 σ = 0.25 σ = 0.50 r = 0.00 0.9913 0.9910 0.9888 r = 0.005 0.9914 0.9912 0.9885 r = 0.01 0.9914 0.9910 0.9887 r = 0.02 0.9914 0.9912 0.9876 r = 0.05 0.9914 0.9906 0.9836

Table 8. Clean accuracies for Cohen s models and our non-constant σ(x) models on MNIST.

σ = 0.12 σ = 0.25 σ = 0.50 r = 0.00 0.677 0.729 0.909 r = 0.005 0.659 0.735 0.905 r = 0.01 0.659 0.722 0.9318 r = 0.02 0.659 0.713 0.960 r = 0.05 0.715 0.796 1.159

Table 9. Standard deviations of class-wise accuracies for different levels of σ and r. The printed values are multiples of 100 of the real standard deviations.

E.3. Comparison with Cohen et al. (2019) s Methodology on MNIST Datasets

Here, we present similar comparison as in Subsection E.2, but on MNIST and with models trained by us. Again, the setup is similar as in the Section 4. We compare σ = σb = 0.12, 0.25, 0.50 with test-time rates r = 0.005, 0.01, 0.02, 0.05 and train-time level of σ again equal to σ = σb. It is important to note, that we use different normalization constant m in the MNIST case. In CIFAR10, we set m = 5, in MNIST, the suitable m is 1.5. This way we assure, that the smallest σ(x) values in the test set will roughly equal the σb = σ. The certified accuracy plots are depicted on Figure 17. We also add the clean accuracy table and class-wise clean accuracies standard deviation table (8, 9).

All the results are, again, very similar to those presented in Section 4, even though the gain in certified accuracies is marginally worse, since our evaluations run on models trained with in average smaller train-time data-augmentation standard deviation σb = σ.

E.4. Investigation of the Effect of Training with Input-dependent Gaussian Augmentation

It has been shown by many works, that apart from a good test-time certification method, also the appropriate training plays a very important role in the final robustness of our smoothed classifier g. Already (Cohen et al., 2019) realize this and propose to train with gaussian data augmentation with constant σ. They experiment with different levels of σ during training and conclude that training with the same level of σ that will be later used in the test time is usually the most suitable option.

The question of best-possible training to boost the certified robustness didn t stay without the interest of different researchers.

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 18. The certified accuracies of our procedure on CIFAR10 for σb = 0.12, 0.25, 0.50, rate r = 0.01 and training rate trr = 0.0, 0.01, 0.04, 0.1.

Both (Zhai et al., 2020) and (Salman et al., 2019) try to improve the way of training and propose two different, yet interesting and effective training methods. While (Zhai et al., 2020) manage to incorporate the adversarial robustness into the training loss function, therefore training directly for the robustness, (Salman et al., 2019) propose to use adversarial training to achieve more robust classifiers.

Both (Alfarra et al., 2020) and (Chen et al., 2021) already propose to use training with input-dependent σ as the variance of gaussian data augmentation. Both of them proceed similarly as during test time - to obtain training σ(x), they optimize for such, that would maximize the certified accuracy of training samples.

In this section, we propose and test our own training method. We propose to use again gaussian data augmentation with input-dependent σ(x), but we suggest to use the simple σ(x) defined in Equation 1. In other words, we suggest using the same σ(x) during training as during testing (up to parametrization, which might differ).

Note, that, unlike the certification, the training procedure does not require any mathematical analysis nor certification. It is totally up to us how we train the base classifier f and the way of training does not influence the validity of subsequent certification guarantees during test time. However, it is good to have a reasonable training procedure, because otherwise, we would achieve a satisfactory model neither in terms of clean accuracy nor in terms of adversarial robustness.

In the subsequent analysis, we evaluate and compare our certification procedures on models trained with different training parametrizations. For this particular section, we run the comparison only on the CIFAR10 dataset. For each test-time σb, r, we evaluate our method with these parameters on base models f trained with the same σb, but different level of training rate trr. The training rate trr plays exactly the same role as the evaluation rate r but is used exclusively during training. Note, that this makes our σ(x) different during training and testing since it is parametrized with different rates.

On the Figure 18 we plot evaluations on CIFAR10 of our method for rate 0.01, all levels of σb = 0.12, 0.25, 0.50 and each of these test-time setups is evaluated on 4 different levels of train-time rate trr = 0.0, 0.01, 0.04, 0.1.

From the results, we judge, that our training procedure works satisfactorily well. It can generally outperform the constant σ training, yet the standard accuracy vs. robustness trade-off is present in some cases. If we train with small train-time rate, the improvement of the certified accuracies is not pronounced (the case for σb = σ = 0.50 is slightly misleading, since such a configuration is just a result of the variance of clean accuracy w.r.t different traning runs) enough, but we also don t lose almost any clean accuracy. Increasing the rate to trr = 0.04 results in much more pronounced improvements in high certified accuracies, yet also comes at a prize of clean accuracy drop, especially for large σ levels. Even bigger training rate, such as trr = 0.1 seems to be too big and does not bring almost any improvement over the rate trr = 0.04, yet loses a large amount of clean accuracy.

These results suggest, that the input-dependent training with a carefully chosen training rate for σ(x) can lead to significant improvements in certifiable robustness. However, it is important to note, that the optimal trr seems to be dependent on the σb, therefore for each value of σb, some effort has to be invested to find the optimal hyperparameters.

Besides, we were also interested, whether using an input-dependent σ(x) during training influences the class-wise accuracy balance. In Table 10 we report the standard deviations of class-wise accuracies.

Intriguing Properties of Input-Dependent Randomized Smoothing

σ = 0.12 σ = 0.25 σ = 0.50 trr = 0.00 0.084 0.107 0.153 trr = 0.01 0.078 0.099 0.126 trr = 0.04 0.068 0.081 0.117 trr = 0.1 0.088 0.099 0.230

Table 10. Standard deviations of class-wise accuracies for different levels of σ and trr, under constant rate r = 0.01.

We can observe, that unlike the pure input-dependent evaluation, the input-dependent training is partially capable of mitigating the effects of the shrinking. For instance, the trr = 0.04 for σb = 0.12 provides obvious improvement in establishing class-wise balance. Similarly successful are trainings with trr = 0.01 for σb = 0.12 and both trr = 0.01, 0.04 for σb = 0.25. Also for σ = 0.50 the mitigation is present for small-enough training rates. However, we must emphasize, that if we use too big training rate, the disbalance between class accuracies will be re-established and in some cases even magnified. Therefore, we must be careful to choose the appropriate training rate for the σb, r.

E.5. Why Do We Not Compare with the Current State-of-the-art?

Since we claim one type of improvement over Cohen et al. (2019) s model (experiment-wise) and don t develop new state-of-the-art training method, we don t find it necessary to compare with methods of Salman et al. (2019) and Zhai et al. (2020). It is obvious that we would outperform these methods in the question of certified accuracy waterfalls, since these methods focus on the training phase. Since we do not outperform RS by Cohen et al. (2019) neither in terms of the clean accuracies nor in terms of class-wise accuracies, it is not our belief that we would outperform the two modern methods in these metrics. Moreover, we find the comparison with Cohen et al. (2019) s work most appropriate, since we extend the theory built by them.

E.6. Ablations

Even though our results so far might look impressive, we can t claim that it is fully due to our particular method until we exclude the possibility, that some different effects play an essential role in the improvement over (Cohen et al., 2019) s work.

To investigate, whether our particular method dominates the contribution to the performance boost, we conduct several ablation studies - first, we study the variance of our evaluations and trainings, second, we study the effect of input-dependent test-time randomized smoothing, and third, we study the effect of input-dependent train-time data augmentation.

E.6.1. VARIANCE OF THE EVALUATION

To find out, whether there is a significant variance in the evaluation of certified radiuses, we conduct a simple experiment - we train a single model on CIFAR10 and evaluate our method on this model for the very same setup of parameters multiple times. This way, the only present stochasticity is in the Monte-Carlo sampling, which influences the evaluation of certified radiuses. We pick the parameters as follows: σb = 0.50, r = 0.01, trr = 0.0, since the σb = 0.50 turns out to have biggest variance in the training. The results are depicted in Figure 19.

From the results, it is obvious that the variance in the evaluation phase is absolutely negotiable. Therefore, there is no need to run the same evaluation setup more times.

E.6.2. VARIANCE OF THE TRAINING

To estimate the variance of the training, we train several models for one specific training setup and evaluate them with the same evaluation setup (knowing, that there is no variance in the evaluation phase, this is equivalent to measuring directly the training variance). We pick our classical non-constant σ(x) for the evaluation, but we train with constant variance data augmentation. The concrete parameters we work with are: σb {0.12, 0.25, 0.50}, r = 0.01, trr = 0.0 and we run 9 trainings for each of these parameter configurations. Then we run full certification to not only see the variance in clean accuracy, but also the variance in the certified radiuses. The results are depicted in Figure 20.

From the figures we see, that the variance of the training is strongly σb-dependent. Most volatile clean accuracy is present for the case σb = 0.50. However, fortunately, the biggest variability is present for the clean accuracy and the curves seem to

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 19. The variance of evaluation. Parameters are σb = 0.50, r = 0.01, trr = 0.0, the evaluated model is the same for all runs. There are 7 runs on CIFAR10.

Figure 20. The certified accuracies of our procedure on CIFAR10 for σb = 0.12, 0.25, 0.50, rate r = 0.01 and training rate trr = 0.0 evaluated on 9 different trained models for each of the setups.

σ = 0.12 σ = 0.25 σ = 0.50 accuracy 0.61% 0.40% 1.86% abstention rate 0.17% 0.34% 0.59% misclassification rate 0.60% 0.24% 1.48%

Table 11. Standard deviations of clean accuracies, abstention rates and misclassification rates for 9 runs of each parameter configuration on CIFAR10.

be less scattered in the areas of high certified radiuses. The concrete standard deviations of clean accuracies are in Table 11. The standard deviations of clean accuracies for MNIST dataset and the same parameters are in Table 12.

Since the differences in accuracies of different methods are very subtle, it is hard to obtain statistically trustworthy results. For instance, given, that the standard deviation 0.4% is the true standard deviation of the σb = 0.25 runs, we would need 16 runs to decrease it to a standard deviation of 0.1%, which might be considered to be precise-enough. To do the same in the case of σb = 0.50 on CIFAR10, we would roughly need 400 runs to decrease the standard deviation below 0.1%. Therefore, the results we provide in the subsequent subsections, being the average of just 8 runs, have to be taken just modulo variance in the results, which might still be considerable.

Intriguing Properties of Input-Dependent Randomized Smoothing

σ = 0.12 σ = 0.25 σ = 0.50 accuracy 0.036% 0.042% 0.044% abstention rate 0.037% 0.027% 0.058% misclassification rate 0.043% 0.029% 0.021%

Table 12. Standard deviations of clean accuracies, abstention rates and misclassification rates for 8 runs of each parameter configuration on MNIST.

E.6.3. EFFECT OF INPUT-DEPENDENT EVALUATION

In this ablation study, we compare the certification method for particular σb, r = 0.01, trr = 0.0 with the constant-σ certification method with Cσb, r = 0.0, trr = 0.0, where C is an appropriate constant. The motivation behind such an experiment is, that our σ(x) is generally bigger than σb, but originally, we compare this method to constant σ = σb evaluation. Therefore, in average, samples in our method enjoy bigger values of σ(x). Natural question is, whether we cannot obtain the same performance boost using just the constant σ method with Cσb > σb set to such value, which roughly corresponds to the average of σ(xi) for xi, i {1, . . . , T} being the test set. The problem of using bigger Cσb is, that we encounter performance drop and more severe case of shrinking, but we need to check, to what extent is the performance drop present in the input-dependent σ(x) method. Comparing the performance drops of larger constant Cσb and input-dependent σ(x), which is in average larger (but in average the same as the Cσb), we will be able to answer, to what degree is the usage of input-dependent σ(x) really justified. If we remind ourselves, that

σ(x) = σb exp

xi Nk(x) x xi

then we see, that the constant C we are searching for is the average (or rather median) value of

xi Nk(x) x xi

Fortunately, empirically, the mean and median of the above expression are roughly equal for both CIFAR10 and MNIST, so we are not forced to choose between them. For r = 0.01, m = 5, we choose the rounded value of C = exp(0.05) on CIFAR10. For r = 0.01, m = 1.5 as in MNIST, the constant is set to C = 1.035. In the end, the values of Cσ used in this experiment are Cσ = 0.126, 0.263, 0.53 for CIFAR10 and Cσ = 0.124, 0.258, 0.517 for MNIST. To obtain a fair comparison, though, we evaluate the input-dependent σ(x) evaluation strategy on models trained with constant Cσb standard deviation of gaussian augmentation. This is because this level of σ is equal to the mean value of the σ(x) and we believe, that such a training data augmentation standard deviation is more consistent with our σ(x) function. We provide the plots of single-run evaluations of certified accuracies for CIFAR10 in Figure 21 and for MNIST in Figure 22. The models on which we evaluate differ because for the increased constant σ evaluations we needed to also use an increased level of the training data augmentation variance.

From the figures, it is obvious, that our method is not able to outperform the constant σ method using the same mean σ in terms of certified accuracy, not even for our strongest σb = 0.12. This fact might not be in general bad news, if we demonstrated, that our method suffers from less pronounced accuracy drop or less pronounced disbalance in class-wise accuracies. To find out, we measure average accuracies of the evaluation strategies from 8 runs for each, as well as average class-wise accuracy standard deviations from 8 runs. The results are provided in Tables 13 and 14 for CIFAR10 and Tables 15 and 16 for MNIST.

As for CIFAR10, except for σ = σb = 0.25, the differences in accuracies between different evaluation strategies are so small, that we cannot consider them to be statistically significant. Even though the difference for σ = σb = 0.25 is high, it is still not possible to draw some definite conclusions, especially for the difference between the input-dependent σ(x) and the increased constant Cσb evaluations. In general, it is not easy to judge, whether our method possesses some advantage (or disadvantage) over the increased Cσb method in terms of clean accuracy. Similar conclusions can be drawn in the context of the shrinking phenomenon. Here, the differences are also very small, but unlike in the comparison with (Cohen et al., 2019)

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 21. The certified accuracies of our procedure on CIFAR10 for σb = 0.12, 0.25, 0.50, rate r = 0.01 and constant, yet increased Cσb training variance, compared to certified accuracies of the constant σ method for σ = σb = 0.12, 0.25, 0.50 and also σ = Cσb = 0.126, 0.265, 0.53. Evaluated on a single training.

Figure 22. The certified accuracies of our procedure on MNIST for σb = 0.12, 0.25, 0.50, rate r = 0.01 and constant, yet increased Cσb training variance, compared to certified accuracies of the constant σ method for σ = σb = 0.12, 0.25, 0.50 and also σ = Cσb = 0.124, 0.258, 0.517. Evaluated on a single training.

σ = 0.12 σ = 0.25 σ = 0.50 r = 0.01, trs increased 0.852 0.780 0.673 r = 0.00 classical 0.851 0.792 0.674 r = 0.00 increased 0.853 0.780 0.673

Table 13. Clean accuracies for both input-dependent and constant σ evaluation strategies on CIFAR10.

σ = 0.12 σ = 0.25 σ = 0.50 r = 0.01, trs increased 0.076 0.099 0.120 r = 0.00 classical 0.076 0.097 0.122 r = 0.00 increased 0.076 0.101 0.123

Table 14. Class-wise accuracy standard deviations for both input-dependent and constant σ evaluation strategies on CIFAR10.

σ = 0.12 σ = 0.25 σ = 0.50 r = 0.01, trs increased 0.9913 0.9905 0.9885 r = 0.00 classical 0.9914 0.9907 0.9886 r = 0.00 increased 0.9914 0.9904 0.9885

Table 15. Clean accuracies for both input-dependent and constant σ evaluation strategies on MNIST.

Intriguing Properties of Input-Dependent Randomized Smoothing

σ = 0.12 σ = 0.25 σ = 0.50 r = 0.01, trs increased 0.00757 0.00798 0.00929 r = 0.00 classical 0.00751 0.00778 0.00934 r = 0.00 increased 0.00750 0.00798 0.00925

Table 16. Class-wise accuracy standard deviations for both input-dependent and constant σ evaluation strategies on MNIST. Printed are multiples of 100 of the real values.

Figure 23. The certified radiuses on CIFAR10 of the non-constant σ(x) method with rate r = 0.01, but different training strategies. Used training strategies are input-dependent training with the same σ(x) function and constant-σ training with either σb or Cσb variance level. Evaluations are being done from single run.

models, where we evaluate our input-dependent σ(x) method on classifiers trained with inconsistent data-augmentation variance, here we observe the general trend, that our method is able to outperform the increased constant Cσb evaluation. This is good news and it confirms our suspicion, that the bad results from Subsection E.2 could come from the train-test σ inconsistency.

The results on MNIST suggest similar conclusions for the accuracy vs. robustness tradeoff. Similarly, the σ = σb = 0.12, 0.50 are not telling much, and for σ = σb = 0.25, the differences are still rather small (yet the standard deviation of the results should be 0.0001, so it is rather on the edge). The conclusions for the shrinking phenomenon are a bit more pesimistic than in the case of CIFAR10. Here we don t see any improvement over the constant σ, not even the one with increased σ level.

E.6.4. EFFECT OF INPUT-DEPENDENT TRAINING

In this last ablation study, we compare our input-dependent data augmentation for particular σb, r and particular training rate trr with the constant Cσb data augmentation, where the training rate trr is set to 0. The strategy for choosing the constant C is exactly the same as in the first experiment. Particularly, we evaluate our method with r = 0.01 and σb = 0.12, 0.25, 0.50, trained with the same level of σb and training rate trr = 0.01 with the evaluations using r = 0.01 and σb = 0.12, 0.25, 0.50 during test time, while during train time using training rate trr = 0.0, but using the constant σ = 0.126, 0.263, 0.53 for CIFAR10 and σ = 0.124, 0.258, 0.517 for MNIST. This way, we compensate for the increased levels of σ(x) with respect to σb. We present our comparisons in the Figure 23 for CIFAR10 and 24 for MNIST, providing the evaluations with r = 0.01, σb = 0.12, 0.25, 0.50 and the same σb and trr = 0.0 during train time as a reference.

The certified accuracy results for the CIFAR10 and the MNIST differ a bit. For CIFAR10 training with rate r = 0.01 is does not overperform the constant Cσb training. For σb = 0.12, the constant Cσb training clearly outperforms the input-dependent training. For σb = 0.25, these two training strategies seem to have almost identical performances. For σb = 0.50, the input-dependent σ(x) strategy outperforms the constant σ ones, but now we know, that it is purely due to the variance in the training. On the other hand on MNIST, we either have very similar performance or even slightly outperform the constant σ training.

Looking at the accuracy and standard deviation Tables 17, 18, 19 and 20, we can deduce the following. In terms of clean accuracy, the input-dependent training strategy performs worst in most of the cases, even though the differences in performance might not be statistically significant. We see, that we would need far more evaluations to see some clear pattern.

Intriguing Properties of Input-Dependent Randomized Smoothing

Figure 24. The certified radiuses on MNIST of the non-constant σ(x) method with rate r = 0.01, but different training strategies. Used training strategies are input-dependent training with the same σ(x) function and constant-σ training with either σb or Cσb variance level. Evaluations are being done from single run.

σ = 0.12 σ = 0.25 σ = 0.50 trr = 0.01 0.843 0.780 0.671 trr = 0.00 classical 0.849 0.790 0.670 trr = 0.00 increased 0.852 0.780 0.673

Table 17. Clean accuracies for both input-dependent and constant σ training strategies on CIFAR10.

σ = 0.12 σ = 0.25 σ = 0.50 trr = 0.01 0.080 0.101 0.121 trr = 0.00 classical 0.080 0.105 0.135 trr = 0.00 increased 0.076 0.099 0.120

Table 18. Class-wise accuracy standard deviations for both input-dependent and constant σ training strategies on CIFAR10.

σ = 0.12 σ = 0.25 σ = 0.50 trr = 0.01 0.9912 0.9910 0.9883 trr = 0.00 classical 0.9914 0.9906 0.9884 trr = 0.00 increased 0.9913 0.9905 0.9885

Table 19. Clean accuracies for both input-dependent and constant σ training strategies on MNIST.

σ = 0.12 σ = 0.25 σ = 0.50 trr = 0.01 0.00757 0.00800 0.00947 trr = 0.00 classical 0.00743 0.00789 0.00929 trr = 0.00 increased 0.00757 0.00798 0.00929

Table 20. Class-wise accuracy standard deviations for both input-dependent and constant σ training strategies on MNIST.

However, these results are definitely not good news for the use of input-dependent σ(x) during training.

In terms of the class-wise accuracy standard deviation, we again see countering results for CIFAR10 and MNIST datasets. For CIFAR10 the input-dependent σ(x) clearly outperforms the smaller constant σb training method, particularly for σb = 0.50. However, the constant Cσb method seem to outperform even the input-dependent σ(x). For MNIST, the smaller constant σb outperforms both other methods, while they are rather similar.

Together with findings from previous sections, these results suggest, that usage of this particular design of input-dependent σ(x) might not be worthy until a more precise evaluation is conducted. However, the combination of input-dependent

Intriguing Properties of Input-Dependent Randomized Smoothing

test-time evaluation with constant, yet increased train-time augmentation is possibly the strongest combination that can be achieved using input-dependent sigma at all (especially for CIFAR10).

Lemma F.1 (Neyman-Pearson). Let X, Y be random vectors in RN with densities x, y. Let h : RN {0, 1} be a random or deterministic function. Then, the following two implications hold:

1. If S = n z RN : y(z)

x(z) t o for some t > 0 and P(h(X) = 1) P(X S), then P(h(Y ) = 1) P(Y S).

2. If S = n z RN : y(z)

x(z) t o for some t > 0 and P(h(X) = 1) P(X S), then P(h(Y ) = 1) P(Y S).

Proof. See (Cohen et al., 2019).

Lemma 2.1: Out of all possible classifiers f such that Gf(x0)B p B = 1 p A, the one, for which Gf(x0 + δ)B is maximized is the one, which predicts class B in a region determined by the likelihood ratio:

B = x RN : q1(x)

where r is fixed, such that P0(B) = p B. Note, that we use B to denote both the class and the region of that class.

Proof. Let f be arbitrary classifier. To invoke the Neyman-Pearson Lemma F.1, define h f (with the only difference, that h goes to {0, 1} instead of {A, B}). Moreover, let S B and X N(x, σ2 0I), Y N(x + δ, σ2 1I). Let also f classify S as B. Then obviously, P(X S) = P0(B) = p B. Since Gf(x)B p B, we have P(h(X) = 1) p B. Using directly the second part of Neyman-Pearson Lemma F.1, this will yield P(Y S) P(h(Y ) = 1). Rewritten in the words of our setup, Gf (x + δ) Gf(x + δ).

Theorem 2.2: If σ0 > σ1, then B is a N-dimensional ball with the center at S> and radius R>:

S> = x0 + σ2 0 σ2 0 σ2 1 δ, R> =

σ2 0σ2 1 (σ2 0 σ2 1)2 δ 2 + 2N σ2 0σ2 1 σ2 0 σ2 1 log σ0

+ 2σ2 0σ2 1 σ2 0 σ2 1 log(r).

If σ0 < σ1, then B is the complement of a N-dimensional ball with the center at S< and radius R<:

S< = x0 σ2 0 σ2 1 σ2 0 δ, R< =

2σ4 0 σ2 0σ2 1 (σ2 1 σ2 0)2 δ 2 + 2N σ2 0σ2 1 σ2 1 σ2 0 log σ1

2σ2 0σ2 1 σ2 1 σ2 0 log(r).

Proof. From spherical symmetry of isotropic multivariate normal distribution, it follows, that without loss of generality we can take δ (a, 0, . . . , 0). With little abuse of notation, let a refer to (a, 0, . . . , 0) as well as (a, 0, . . . , 0) . With this, the B is a set of all x, for which:

q1(x) q0(x) 1

1 (2π)N/2σN 0 exp

r (2π)N/2σN 1 exp

i=1 x2 i N log σ0

(σ2 0 σ2 1)

i=2 x2 i + (σ2 0 σ2 1)x2 1 2σ2 0x1a + σ2 0a2 2Nσ2 0σ2 1 log σ0

+ 2σ2 0σ2 1 log(r)

Intriguing Properties of Input-Dependent Randomized Smoothing

Now assume σ0 > σ1 and continue:

(σ2 0 σ2 1)

i=2 x2 i + (σ2 0 σ2 1)x2 1 2σ2 0x1a + σ2 0a2 2Nσ2 0σ2 1 log σ0

+ 2σ2 0σ2 1 log(r)

i=2 x2 i + x2 1 2σ2 0 σ2 0 σ2 1 ax1 + a2σ2 0 σ2 0 σ2 1 2N σ2 0σ2 1 σ2 0 σ2 1 log σ0

+ 2σ2 0σ2 1 σ2 0 σ2 1 log(r)

x1 σ2 0 σ2 0 σ2 1 a 2 +

i=2 x2 i σ2 0σ2 1 (σ2 0 σ2 1)2 a2 + 2N σ2 0σ2 1 σ2 0 σ2 1 log σ0

+ 2σ2 0σ2 1 σ2 0 σ2 1 log(r)

Such inequality defines exactly the ball from the statement of the theorem. On the other hand, if σ0 < σ1:

(σ2 1 σ2 0)

i=2 x2 i + (σ2 1 σ2 0)x2 1 + 2σ2 0x1a σ2 0a2 2Nσ2 0σ2 1 log σ1

2σ2 0σ2 1 log(r)

i=2 x2 i + x2 1 + 2σ2 0 σ2 1 σ2 0 ax1 a2σ2 0 σ2 1 σ2 0 2N σ2 0σ2 1 σ2 1 σ2 0 log σ1

2σ2 0σ2 1 σ2 1 σ2 0 log(r)

x1 + σ2 0 σ2 1 σ2 0 a 2 +

i=2 x2 i 2σ4 0 σ2 0σ2 1 (σ2 1 σ2 0)2 a2 + 2N σ2 0σ2 1 σ2 1 σ2 0 log σ1

2σ2 0σ2 1 σ2 1 σ2 0 log(r)

This is exactly the complement of a ball from the second part of the statement of the theorem.

Theorem 2.3:

P0(B) = χ2 N

σ2 0 (σ2 0 σ2 1)2 δ 2, R2 <,> σ2 0

, P1(B) = χ2 N

σ2 1 (σ2 0 σ2 1)2 δ 2, R2 <,> σ2 1

where the sign < or > is choosed according to the inequality between σ0 and σ1.

Proof. Assume first σ0 > σ1. Let us shift the coordinates, such that x + σ2 0 σ2 0 σ2 1 δ 0. Now, the x will have coordinates

σ2 0 σ2 0 σ2 1 δ. Assume X N( σ2 0 σ2 0 σ2 1 δ, σ2 0I). To obtain

P0(B) = P(X B) = P X 2 < σ2 0σ2 1 (σ2 0 σ2 1)2 δ 2 + 2N σ2 0σ2 1 σ2 0 σ2 1 log σ0

+ 2σ2 0σ2 1 σ2 0 σ2 1 log(r) ,

we could almost use NCCHSQ, but we don t have correct scaling of variance of X. However, for any regular square matrix Q, it follows that P(X B) = P(QX QB), where QB is interpreted as set projection. Therefore, if we choose Q 1 σ0 I, we will get

P0(B) = P X/σ0 2 < σ2 1 (σ2 0 σ2 1)2 δ 2 + 2N σ2 1 σ2 0 σ2 1 log σ0

+ 2σ2 1 σ2 0 σ2 1 log(r) .

Now, since X/σ0 N( σ0 σ2 0 σ2 1 δ, I), we can use the definition of NCCHSQ to obtain the final:

P0(B) = χ2 N

σ2 0 (σ2 0 σ2 1)2 δ 2, σ2 1 (σ2 0 σ2 1)2 δ 2 + 2N σ2 1 σ2 0 σ2 1 log σ0

+ 2σ2 1 σ2 0 σ2 1 log(r) .

To obtain P1(B), we will do similar calculation, yet we need to compute the offset:

x + σ2 0 σ2 0 σ2 1 δ x δ = σ2 1 σ2 0 σ2 2 a.

Intriguing Properties of Input-Dependent Randomized Smoothing

Thus, after shifting coordinates in the same way, our alternative X will be distributed like X N( σ2 1 σ2 0 σ2 2 a, σ1I). Now, the same idea as before will yield the required formula.

In the case of σ0 < σ1, we do practically the same thing, yet now, we have to keep in mind, that B will not be a ball, but its complement, therefore we will obtain 1 in the formulas.

Lemma F.2. Functions ξ>(a), ξ<(a) are continuous on the whole R+. Particularly, they are continuous at 0.

Proof. Assume for simplicity σ0 > σ1 and fix x0, whose position is irrelevant and fix x1 such that x0 x1 = am, where am is the point, where we prove the continuity. Note, that χ2 N(λ, x) can be interpreted (as we have seen) as a probability of an offset ball with radius x and offset

λ. Assume we have a sequence {ai} i=1 , ai am. Define xi to be a point lying on the line defined by x0, x1 s.t. x0 xi = ai. define Bai to be the worst-case ball corresponding to ai, xi. Now, without loss of generality, we can assume, that all B s are open. We have already seen from Lemma 2.2, that centers of Bi converge to the center of Bm. Define Xi = 1(Bi), Xm = 1(Bm).

First we need to prove, that ri, radiuses of the balls converge. Assume for contradiction, that ri do not converge. Without loss of gererality, let rs = lim sup i ri > rm. If rs = , it is trivial to see, that P0(Bi) = p B for some i for which ri is

too big. If rs < , consider {aik} k=0 to be the subsequence for which rs is monotonically attained. Define Bs the ball with center x1 and radius rs and Xs = 1(Bs). Then, Xik a.s. Xs for k and from dominated convergence theorem, P0(Bik) P0(Bs). However, P0(Bs) > P0(Bm) = p B, what is contradiction, since obviously P0(Bik) = p B for some k.

Since σ1 is fixed, the P1i, probability measures corresponding to N(xi, σ1I) are actually the same probability measure up to a shift. Therefore, P1i(Bi) can be treated as P2( Bi), where P2 is simply measure corresponding to N(0, σ1I) and Bi is simply Bi shifted accordingly s.t. P1i(Bi) = P2( Bi) (and assume, without loss of generality, that for each i, they are shifted such that their centers lie on a fixed line). Now, since we know, that position of both the centers of the balls and the x1 is continuous w.r.t a, as can be seen from Lemma 2.2 (and the radiuses are still ri and converge), we see, that even 1( Bi) 1( Bm) almost surely. Now, we can simply use dominated convergence theorem using P2 to obtain P2( Bi) P2( Bm) and thus P1i(Bi) P1m(Bm), what we wanted to prove.

Note that the proof for the case σ0 < σ1 is fully analogous, yet instead of class B probabilities, we work with class A probabilities to still work with balls and not with less convenient complements.

Lemma F.3. If λ1 > λ2, then χ2 N(λ2 1, x2) χ2 N(λ2 2, x2).

Proof. Let us fix N(0, I) and respective measure P and respective density f. From symmetry, the NCCHCSQ defined as distribution of X 2 for an offset normal distribution can be as well defined as X s 2 under centralized normal distribution. Define B1 a ball with center at (λ1, 0, . . . , 0) and radius x and B2 a ball with center at (λ2, 0, . . . , 0) and radius x. Denote C(B) as the center of a ball B. From definition of NCCHSQ it now follows, that P(Bi) = χ2 N(λ2 i , x2), i {1, 2}. Therefore, it suffices to show P(B1) P(B2).

Define D1 = B1\B2 and D2 = B2\B1. Then we know:

P(B1) P(B2) Z

Let S = C(B1)+C(B2)

2 . Define a central symmetry M with center S. Let z1 D1. Then z1 can be decomposed as z1 = C(B1) + d, d x. Then, z2 := M(z1) = C(B2) d from symmetry. This way, we see, that D1 = M(D2) and

Intriguing Properties of Input-Dependent Randomized Smoothing

D2 = M(D1) under a bijection M which does not distort the geometry and distances of the euclidean space. Therefore, it suffices to show: z D2 : f(z) f(M(z)), M(z) D1.

From the monotonicity of f(y) w.r.t y it actually suffices to show z M(z) z D2. Fix some z D2. By the fact that M is central symmetry and z M(z), it is obvious, that z = S + p, M(z) = S p, where p is some vector. Now, using law of cosine, we can write: z 2 = S 2 + p 2 2 S p cos(α),

where α is angle between S and p. On the other hand:

M(z) 2 = S 2 + p 2 2 S p cos(π α).

It is obvious from these equations, that

z M(z) α π/2 p T S 0.

Here, the crucial observation is, that D1 and D2 are separated by a hyperplane perpendicular to S (vector), such that S (point) is in this hyperplane. From this it follows:

y D2 = y T S S 2, y D1 = y T S S 2.

Now, since z = S + p and z D2, this implies S 2 z T S = S 2 + p T S and thus p T S 0.

Lemma F.4. Functions ξ>(a), ξ<(a) are non-decreasing in a.

Proof. First assume σ0 > σ1 and analyse ξ>(a). From Lemma F.2, we can without loss of generality assume, that a > 0, since ξ>(0) is simply the limit for a 0 and cannot change the monotonicity status.

Now, fix a > 0 and define x0, x1 s.t. x0 x1 = a. Denote Ba to be the worst-case ball corresponding to x0, x1 and P1 as usual. Choose ϵ < σ2 1 σ2 0 σ2 1 a, which is the distance between x1 and C(Ba), as can be seen from Lemma 2.2.

Now, assume x2 lies on line defined by x0, x1 with x0 x2 = a + ϵ. Let Ba+ϵ be the corresponding worst-case ball and P2 as usual. First observe, that P2(Ba) P1(Ba), since x1 C(Ba) > x2 C(Ba) . Here we use χ2 N(λ1, x) χ2 N(λ2, x) if λ1 > λ2, what is proved in Lemma F.3. Second, note, that since Ba+ϵ is the worst-case ball for x2, it follows P2(Ba+ϵ) P2(Ba). Thus, P1(Ba) P2(Ba+ϵ), but that is exactly ξ>(a) ξ>(a + ϵ).

To prove ξ>(a) ξ>(a + ϵ) also for ϵ > σ2 1 σ2 0 σ2 1 a, it suffices to consider finite sequence of points ai starting at a and ending

at a + ϵ that are close enough to each other such that the respective ϵi that codes the shift ai ai+1 satisfy ϵi < σ2 1 σ2 0 σ2 1 ai.

Now, assume σ0 < σ1 and analysie ξ<(a). The proof is similar, but we have to be a bit careful about some details. Again, fix a > 0, define x0, x1 accordingly and all other objects as before, except now, let us denote Aa = BC a and Aa+ϵ = BC a to be the class A balls which are complements to the anti-balls B. Again, P2(Aa) P1(Aa), since x1 C(Aa) < x2 C(Aa) . We again used the monotonicity from Lemma F.3. Moreover, P2(Aa+ϵ) P2(Aa), since Ba+ϵ is the worst-case set for x2. Therefore, P1(Aa) P2(Aa+ϵ), but after reverting to B s, it follows ξ>(a) ξ>(a + ϵ).

Here, we don t even need to care about ϵ , since the centers of A s are on the opposite half-lines from x0 than x1 and x2.

To prove the main theorem, we need a simple bound on a median of central chi-squared distribution, shown in (Robert, 1990) in a more general way.

Lemma F.5. For all c 0, N 1 + c χ2 N,qf(c, 0.5) χ2 N,qf(0.5) + c.

Proof. See (Robert, 1990).

Intriguing Properties of Input-Dependent Randomized Smoothing

Theorem 2.4 (the curse of dimensionality): Let x0, x1, p A, σ0, σ1, N be as usual. Then, the following two implications hold:

1. If σ0 > σ1 and

log σ2 1 σ2 0

+ 1 σ2 1 σ2 0 < 2 log(1 p A)

then x1 is not certified w.r.t. x0.

2. If σ0 < σ1 and

log σ2 1 σ2 0

+ 1 σ2 1 σ2 0

N < 2 log(1 p A)

then x1 is not certified w.r.t. x0.

Proof. We will first prove first statement, thus let us assume σ0 > σ1. Then P1(B) = ξ>( x0 x1 ). From monotonicity of ξ showed in Lemma F.4, we know ξ>( x0 x1 ) ξ>(0). We will show ξ>(0) > 0.5. We have, using definition of ξ> plugging in a = 0:

ξ>(0) = χ2 N

σ2 0 σ2 1 χ2 N,qf(1 p A) .

Note, that here, we work with central chi-square cdf and quantile function. In order to show ξ>(0) > 0.5, it suffices to show

σ2 0 σ2 1 χ2 N,qf(1 p A) N,

because it is well-known, that median of central chi-square distribution is smaller than mean, which is N, i.e. from strict monotonicity of cdf, we will get χ2 N(N) > 0.5. To show the above inequality, we will use Chernoff bound on chi-squared, which states the following: If 0 < z < 1, then χ2 N(z N) (z exp(1 z))N/2. Putting z σ2 1 σ2 0 , using chernoff bound we get:

σ2 1 σ2 0 N σ2 1 σ2 0 exp 1 σ2 1 σ2 0

2 !< 1 p A.

The last inequality is required to hold. If it holds, then necessarily χ2 N,qf(1 p A) > σ2 1 σ2 0 N and thus σ2 0 σ2 1 χ2 N,qf(1 p A) > N. Manipulating the required inequality, we will get exactly

log σ2 1 σ2 0

+ 1 σ2 1 σ2 0 < 2 log(1 p A)

what is the assumption of 1.

Similarly we will also prove the statement 2. Assume σ0 > σ1. Like in part 1, we will just prove ξ<(0) > 0.5 by using chernoff bound. This time, however, we have:

ξ<(0) = 1 χ2 N

σ2 0 σ2 1 χ2 N,qf(p A) ,

i.e. we need to prove

σ2 0 σ2 1 χ2 N,qf(p A) < 1

The second part of Chernoff bound states: If 1 < z, then χ2 N(z N) 1 (z exp(1 z))N/2. Let us choose z σ2 1 σ2 0 N 1

N . Then Chernoff bound yields:

N N 1 σ2 1 σ2 0

N exp 1 σ2 1 σ2 0

Intriguing Properties of Input-Dependent Randomized Smoothing

If this holds, then σ2 1 σ2 0

N N > χ2 N,qf(p A) σ2 0 σ2 1 χ2 N,qf(p A) < N 1.

Now, using Lemma F.5 for the easy case of central chi-squared, we see: χ2 N(N 1) < 0.5 and thus

σ2 0 σ2 1 χ2 N,qf(p A) < 1

what we wanted to prove.

Corollary 2.5 (one-sided simpler bound): Let x0, x1, p A, σ0, σ1, N be as usual and assume now σ0 > σ1. Then, if

then x1 is not certified w.r.t x0.

Proof. We will simply prove

N = log σ2 1 σ2 0

+ 1 σ2 1 σ2 0 < 2 log(1 p A)

Assume expression log(1 y) + y; 1 > y > 0. From Taylor series, it is apparent, that log(1 y) + y < y2

2 . Therefore, if

2 < 2 log(1 p A)

N , then also log(1 y) + y < 2 log(1 p A)

N . Solving for y in the first inequality, we get sufficient condition

N . Plugging 1 σ2 1 σ2 0 into y we get:

1 σ2 1 σ2 0 > 2

which is very easily manipulated to the inequality from theorem statement.

Theorem 3.1: Let x0, x1, p A, σ0 be as usual and let x0 x1 = R. Then, the following two statements hold:

1. Let σ1 σ0. Then, for all σ2 : σ1 σ2 σ0, if ξ>(R, σ2) > 0.5, then ξ>(R, σ1) > 0.5.

2. Let σ1 σ0. Then, for all σ2 : σ1 σ2 σ0, if ξ<(R, σ2) > 0.5, then ξ>(R, σ1) > 0.5.

Proof. We will first prove the first statement. Denote, as usual in the proofs Bi the worst-case ball for σi, Pi the probability associated to N(x1, σ2 i I). Since ξ>(R, σ2) > 0.5 and since it is essentially P2(B2), we see, that the probability of a ball under normal distribution is bigger than half. This is obviously possible just if x1 B2. From the fact, that B1 is the worst-case ball for σ1 we see ξ>(R, σ1) = P1(B1) P1(B2). It suffices to show P1(B2) P2(B2).

This follows, since σ1 σ2 and x1 B2. We know, that we can rescale the space such that

P1(B2) = P2

σ1 (B2 x1) + x1

using the fact that σ just scales the normal distribution. The set σ2

σ1 (B2 x1) + x1 is just an image of B2 via homothety with center x1 and rate σ2

σ1 . So it suffices to prove

σ1 (B2 x1) + x1

Intriguing Properties of Input-Dependent Randomized Smoothing

However, obviously σ2

σ1 (B2 x1) + x1 B2 from convexity of a ball. If, namely, x1 + z B2, then from convexity also x1 + σ1

σ2 z B2 and this maps back to x1 + z, thus x1 + z is in an image. Applying monotonicity of P, we obtain the result.

Now we will prove the second statement and as usual, let A1, A2, P1, P2 be as usual (A is now the ball connected to class A). As always, P1(A1) P1(A2), so it suffices to show P2(A2) < 0.5 = P1(A2) < 0.5. Now, we need to distinguish two cases. If x1 A2, the proof is completely analogical to the first part, but now reasoning on A s rather than B s. In this case, we will even get stronger P1(A2) P2(A2) just like in the first part. If x1 A2, then it is easy to see, that indeed P2(A2) < 0.5, yet it is also obvious to see that P1(A2) < 0.5. This finishes the proof of the lemma.

Theorem 3.2: Let σ(x) be r-semi-elastic function and x0, p A, N, σ0 as usual. Then, the certified radius at x0 guaranteed by our method is

CR(x0) = max {0, sup {R 0; ξ>(R, σ0 exp( r R)) < 0.5 and ξ<(R, σ0 exp(r R)) < 0.5}} .

Proof. This follows easily from Theorem 3.1

Lemma F.6. Let us have f(x) = M P

i=1 fi(x)1(x Ri), where {Ri}M i=1 is finite set of regions that divide the RN and

{fi}M i=1, fi : Ri R is finite set of 1-Lipschitz continuous (1-LC) functions. Moreover assume, that f(x) is continuous. Then, f(x) is 1-LC.

Proof. We do not assume any nice behaviour from our decision regions, what can make the situation quite ugly. For instance, regions might not be measurable. However, it will not be a problem for us.

Fix x1, x2. Consider line segment S = x1 + α(x2 x1), α [0, 1]. Let us instead of points in S work with numbers in [0, 1] via the α encoding. Consider the following coloring C of [0, 1]: Each point a [0, 1] will be assigned one of M colors according to which region the x1 + a(x2 x1) belongs. Define d1 0 and d2 = sup{z [0, 1], C(z) = C(d1)}. Thus, d2 is the supremum of all numbers colored the same color as 0. Then,

|f(x1 + d2(x2 x1)) f(x1 + d1(x2 x1))| (d2 d1) x2 x1 .

Why? Let {zj} j=1 be a non-decreasing sequence s.t. C(zj) = C(0) and zj d2. Since f is continuous, obviously f(x1 + d2(x2 x1)) = lim j f(x1 + zj(x2 x1)). Now, since norm and absolute value are both continuous functions and

since from 1-LC of f C(0) on RC(0) we have

j N : |f(x1 + zj(x2 x1)) f(x1 + d1(x2 x1))|

(zj d1) x2 x1 1,

we also necessarily have |f(x1 + d2(x2 x1)) f(x1 + d1(x2 x1))|

(d2 d1) x2 x1 1.

If d2 = 1, we finish the construction. If not, distinguish two cases. First assume C(d2) = C(d1). In this case, take some color C s.t. {zj} j=1 : zj+1 zj j N and zj d2 and zj > d2 j N,

and fix one such {zj} j=1. Obviously, C = C(0), since d2 is upper-bound on points of color C(0). Then, define d3 = sup{z [0, 1], C(z) = C} and also define {zj} j=1 : zj+1 zj j N and zj d3. From continuity of f, we again have f(x1+d2(x2 x1)) = lim j f(x1+zj(x2 x1)) and similarly f(x1+d3(x2 x1)) = lim j f(x1+zj(x2 x1)).

Again from continuity of absolute value and norm and 1-LC of all the partial functions we have:

j N : |f(x1 + zj(x2 x1)) f(x1 + zj(x2 x1))|

(zj zj) x2 x1 1

and |f(x1 + d3(x2 x1)) f(x1 + d2(x2 x1))|

(d3 d2) x2 x1 1.

Intriguing Properties of Input-Dependent Randomized Smoothing

Now assume C(d2) = C(d1). Then, we can take as C directly C(d2) and do the same as in the last paragraph (note, that this case could have implicitly come up in the previous construction too, but we would need to not take C = C(0) and we find this case distinction to be more elegant).

If d3 = 1, we finish the construction. If not, we continue in exactly the same manner as before. Since the number of colors M is finite, we will run out of colors in finite number of steps and thus, eventually there will be l M s.t. dl = 1. The final 1-LC is now trivially obtained as follows:

|f(x1 + 1(x2 x1)) f(x1 + 0(x2 x1))| =

i=1 f(x1 + di+1(x2 x1)) f(x1 + di(x2 x1))

f(x1 + di+1(x2 x1)) f(x1 + di(x2 x1))

(di+1 di) x2 x1

i=1 (di+1 di) = x2 x1

Theorem 4.1: The σ(x) defined in Equation 1 is r-semi-elastic.

Proof. Our aim is to prove, that

log(σ(x)) = log(σb) + r

xi Nk(x) x xi

is r lipschitz continuous. Obviously, this does not depend neither on log(σb), nor on rm, so we will focus just on r k P

xi Nk(x) x xi . Obviously, this function is r lipschitz continuous if and only if 1

xi Nk(x) x xi is 1-LC.

Let us fix y RN. We will first prove x y is 1-LC. Let us fix x1, x2. From triangle inequality we have x1 y x2 y x1 x2 ,

what is exactly what we wanted to prove.

Now fix y1, y2, . . . , yk and x1, x2. Then

i=1 x1 yi 1

i=1 x2 yi = 1

i=1 x1 yi x2 yi

x1 yi x2 yi 1

i=1 x1 x2 = 1

Finally note, that using the k nearest neighbors out of finite training dataset will divide RN in a finite number of regions, where each region is defined by the set of k nearest neighbors for x in that region. Note, that the average distance from k nearest neighbors is obviously continuous. Then, using Lemma F.6, the claim follows.