# double_sampling_randomized_smoothing__93d94154.pdf

Double Sampling Randomized Smoothing

Linyi Li 1 Jiawei Zhang 1 Tao Xie 2 Bo Li 1

Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certiﬁcation for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classiﬁer. However, as previous work shows, the certiﬁed robust radius in randomized smoothing suffers from scaling to large datasets ( curse of dimensionality ). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certiﬁcation of the previous smoothed classiﬁer. Theoretically, under mild assumptions, we prove that DSRS can certify Θ(

d) robust radius under ℓ2 norm where d is the input dimension, which implies that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efﬁcient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and Image Net verify our theory and show that DSRS certiﬁes larger robust radii than existing baselines consistently under different settings. Code is available at https: //github.com/llylly/DSRS.

1. Introduction

Neural networks (NNs) have achieved great advances on a wide range of tasks, but have been shown vulnerable against adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015; Eykholt et al., 2018; Wang et al., 2021;

1University of Illinois Urbana-Champaign, Illinois, USA 2Peking University, Beijing, China. Correspondence to: Linyi Li <linyi2@illinois.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Get 𝑃! from sampling c

Given model and

smoothing distribution 𝒫

Construct worst-case classifier with 𝑃! Derive certified radius 𝑟"#

Smooth the model

with another distribution 𝒬

Get 𝑄! from

sampling 𝑥 𝑥

c Construct worst-case classifier with both 𝑃! and 𝑄! Derive tighter certified radius 𝑟$%&%> 𝑟"#

Get 𝑃! from

Standard Certification

Our DSRS Certification c

Figure 1. Upper: Standard certiﬁcation for randomized smoothing leverages information from only one distribution (PA) to compute robustness certiﬁcation. Lower: DSRS leverages information from two distributions (PA and QA) to compute certiﬁcation for the smoothed classiﬁer, yielding signiﬁcantly larger certiﬁed radius.

Qiu et al., 2020; Li et al., 2020a; Zhang et al., 2021). A plethora of empirical defenses are proposed to improve the robustness; however, most of these are broken by strong adversaries again (Carlini & Wagner, 2017; Athalye et al., 2018; Tram er et al., 2020). Recently, there are great efforts in developing certiﬁed defenses for NNs under certain adversarial constraints (Wong & Kolter, 2018; Raghunathan et al., 2018; Li et al., 2020b; Xu et al., 2020; Li et al., 2019b).

Randomized smoothing (Cohen et al., 2019; Li et al., 2019a) has emerged as a popular technique to provide certiﬁed robustness for large-scale datasets. Concretely, it samples noise from a certain smoothing distribution to construct a smoothed classiﬁer, and thus certiﬁes the robust radius for the smoothed classiﬁer. Compared to other techniques (Wong & Kolter, 2018; Mirman et al., 2018; Gowal et al., 2019; Zhang et al., 2020b), randomized smoothing is efﬁcient and agnostic to the model, and is applicable to a wide range of ML models, including large Res Net (He et al., 2016) on the Image Net dataset.

To improve the certiﬁed robust radius, existing studies (Cohen et al., 2019; Lee et al., 2019; Li et al., 2019a; Yang et al., 2020) have explored different smoothing distributions. However, the improvement is limited. For example, ℓ2 certiﬁed robust radius does not increase on large datasets despite that the input dimension d increases (Cohen et al., 2019), resulting in a low ℓ certiﬁed radius on large

Double Sampling Randomized Smoothing

datasets which is theoretically shown as an intrinsic barrier of randomized smoothing ( curse of dimensionality or ℓ barrier ) (Yang et al., 2020; Blum et al., 2020; Kumar et al., 2020b; Hayes, 2020; Wu et al., 2021).

Given these challenges towards tight robustness certiﬁcation, a natural question arises: Q1: Is it possible to circumvent the barrier of randomized smoothing by certifying with additional information ? Q2: What type of information is needed to provide tight robustness certiﬁcation? To answer these questions, we propose Double Sampling Randomized Smoothing (DSRS) framework to leverage the sampled noises from an additional smoothing distribution as additional information to tighten the robust certiﬁcation. In theory, we show that (1) ideally, if the decision region of the base classiﬁer is known, DSRS can provide tight robustness certiﬁcation; (2) more practically, if the inputs, which can be correctly classiﬁed by the base classiﬁer, satisfy concentration property within an input-centered ball with constant mass under standard Gaussian measure, the standard Neyman-Pearson-based certiﬁcation (Li et al., 2019a; Cohen et al., 2019; Salman et al., 2019; Yang et al., 2020) can only certify a dimension-independent ℓ2 radius, whereas DSRS with generalized Gaussian smoothing can certify radius Ω(

d) under ℓ2 norm which would increase with the dimension d, leading to tighter certiﬁcation. Under more general conditions, we provide numerical simulations to verify our theories. Our results provide a positive answer to Q1 and sufﬁcient conditions for Q2, i.e., DSRS may be able to circumvent the barrier of randomized smoothing.

Motivated by the theory, we leverage a type of generalized Gaussian (Zhang et al., 2020a) as the smoothing distribution and truncated generalized Gaussian as an additional distribution. For this type of concretization, we propose an efﬁcient and sound computation method to compute the certiﬁably robust radius for practical classiﬁers considering sampling error. Our method formulates the certiﬁcation problem given additional information as a constrained optimization problem and leverages speciﬁc properties of the dual problem to decompose the effects of different dual variables to solve it. DSRS is fully scalable since the computational time is nearly independent of the size of the dataset, model, or sampling. Our extensive experimental evaluation on MNIST, CIFAR-10, and Image Net shows that (1) under large sampling size (2 105 8 105), the certiﬁed radius of DSRS consistently increases as our theory suggested; (2) under practical sampling size (105), DSRS can certify consistently higher robust radii than existing baselines, including standard Neyman-Pearson-based certiﬁcation.

As further discussed in Appendix L, we believe that DSRS as a framework opens a wide range of future directions for selecting or optimizing different forms of additional information to tighten the certiﬁcation of randomized smoothing.

We summarize the main technical contributions as follows:

We propose a general robustness certiﬁcation framework DSRS, which leverages additional information by sampling from another smoothing distribution. We prove that under practical concentration assumptions, DSRS certiﬁes Ω(

d) radius under ℓ2 norm with d the input dimension, which suggests a possible way to circumvent the intrinsic barrier of randomized smoothing. We concretize DSRS by generalized Gaussian smoothing mechanisms and propose a method to efﬁciently compute the certiﬁed radius for given classiﬁers. We conduct extensive experiments, showing that DSRS provides consistently tighter robustness certiﬁcation than existing baselines, including standard Neyman-Pearsonbased certiﬁcation across different models on MNIST, CIFAR-10, and Image Net.

Related Work. For the certiﬁcation method of randomized smoothing, most existing methods leverage only the true-class prediction probability to certify. In this case, the tightest possible robustness certiﬁcation is based on the Neyman-Pearson lemma (Neyman & Pearson, 1933) as ﬁrst proposed by Cohen et al. (2019) for certifying ℓ2 radius under Gaussian smoothing. Several methods extend this certiﬁcation to accommodate different smoothing distributions and different ℓp norms (Dvijotham et al., 2020; Yang et al., 2020; Zhang et al., 2020a; Levine & Feizi, 2021). In randomized smoothing, the ℓ2 certiﬁed robust radius r is similar across datasets of different scales, which translates to vanishing ℓ certiﬁed radius r/

d when input dimension increases. This limitation of existing certiﬁcation methods of randomized smoothing is formally proved (Yang et al., 2020; Blum et al., 2020; Kumar et al., 2020b; Hayes, 2020; Wu et al., 2021) and named ℓ barrier or curse of dimensionality .

Recent work tries to incorporate additional information besides true-class prediction probability to tighten the certiﬁcation and bypass the barrier. For ℓ2 and ℓ certiﬁcation, to the best of our knowledge, gradient magnitude is the only exploited additional information (Mohapatra et al., 2020; Levine et al., 2020). However, in practice, the improvement is relatively marginal and requires a large number of samples (see Appendix J.5). Some other methods provide tighter certiﬁcation given speciﬁc model structures (Kumar et al., 2020a; Chiang et al., 2020; Lee et al., 2019; Awasthi et al., 2020). DSRS instead focuses on leveraging modelstructure-agnostic additional information. More discussion on related work is in Appendix K.

2. Preliminaries and Background

Deﬁne [C] := {1, . . . C}. Let C be the C-dimensional probability simplex. We consider a multiclass classiﬁcation

Double Sampling Randomized Smoothing

model F : Rd [C] as the base classiﬁer, where d is the input dimension, and the model outputs hard-label class prediction within [C]. The original smoothing distribution P and additional smoothing distribution Q are both supported on Rd. We let p( ) and q( ) be their density functions respectively. We assume that both p and q are positive and differentiable almost everywhere, i.e., the set of singular points has zero measure under either P or Q. These assumptions hold for common smoothing distributions used in the literature such as Gaussian distribution (L ecuyer et al., 2019; Li et al., 2019a; Cohen et al., 2019; Yang et al., 2020).

Randomized smoothing constructs a smoothed classiﬁer from a given base classiﬁer by adding input noise following original smoothing distribution P. For input x Rd, we deﬁne prediction probability under P by function f P : Rd C:

f P(x)c := Pr ϵ P[F(x + ϵ) = c] where c [C]. (1)

The smoothed classiﬁer e F P : R [C] (or e F when P is clear from the context) predicts the class with the highest conﬁdence after smoothing with P:

e F P(x) := arg max c [C] f P(x)c. (2)

We focus on robustness certiﬁcation against ℓp-bounded perturbations for smoothed classiﬁer e F, where the standard certiﬁcation method is called Neyman-Pearson-based certiﬁcation (Cohen et al., 2019) (details in Appendix A, certiﬁed radius from it denoted by r N P). Concretely, certiﬁcation methods compute robust radius r deﬁned as below.

Deﬁnition 1 (Certiﬁed Robust Radius). Under ℓp norm (p R+ {+ }), for given smoothed classiﬁer e F P and input x0 Rd with true label y0 [C], a radius r 0 is called certiﬁed (robust) radius for e F P if e F P always predicts y0 for any input within the r-radius ball centered at x0:

δ Rd, δ p < r, e F P(x0 + δ) = y0. (3)

3. DSRS Overview

We propose Double Sampling Randomized Smoothing (DSRS), which leverages the prediction probability from an additional smoothing distribution Q (formally QA := f Q(x0)y0 = Prϵ Q[F(x + ϵ) = y0]), along with the prediction probability from the original smoothing distribution P (formally PA := f P(x0)y0 as in Eqn. (1), also used in Neyman-Pearson-based certiﬁcation), to provide robustness certiﬁcation for P-smoothed classiﬁer e F P. Note that both PA and QA can be obtained from Monte-Carlo sampling (see Sections 5.1 and 5.2). Formally, we let r DSRS denote the tightest possible certiﬁed radius with prediction probability from Q, then r DSRS can be deﬁned as below.

Deﬁnition 2 (r DSRS). Given PA and QA,

Table 1. Deﬁnitions of smoothing distributions in this paper. In the table, k N, σ = p

Name Notation Density Function

Standard Gaussian N(σ) exp ϵ 2 2 2σ2

Generalized Gaussian N g(k, σ) ϵ 2k 2 exp ϵ 2 2 2σ 2

Truncated Standard Gaussian Ntrunc(T, σ) exp ϵ 2 2 2σ2 I[ ϵ 2 T]

Truncated Generalized Gaussian N g trunc(k, T, σ) ϵ 2k 2 exp ϵ 2 2 2σ 2 I[ ϵ 2 T]

r DSRS := max r s.t.

F : R [C], f P(x0)y0 = PA, f Q(x0)y0 = QA

x, x x0 p < r, e F P(x) = y0.

Intuitively, r DSRS is the maximum possible radius, such that any smoothed classiﬁer constructed from base classiﬁer satisfying PA and QA constraints cannot predict other labels when the perturbation magnitude is within the radius.

In Section 4, we will analyze the theoretical properties of DSRS, including comparing r DSRS and r N P under the concentration assumption. Computing r DSRS is nontrivial, so in Section 5, we will introduce a practical computational method that exactly solves r DSRS when P and Q are standard and generalized (truncated) Gaussian. In Appendix G, we will show method variants to deal with other forms of P and Q distributions. In Appendix L, we will further generalize the DSRS framework.

Smoothing Distributions. Now we formally deﬁne the smoothing distributions used in DSRS. We mainly consider standard Gaussian N (Cohen et al., 2019; Yang et al., 2020) and generalized Gaussian N g (Zhang et al., 2020a). Let N(σ) to represent standard Gaussian distribution with covariance matrix σ2Id which has density function exp( ϵ 2 2/(2σ2)).1 For k N, we let N g(k, σ) to represent generalized Gaussian whose density function ϵ 2k 2 exp( ϵ 2 2/(2σ 2)) where σ = p

d/(d 2k)σ. Here we use σ instead of σ to ensure that the expected noise p

E ϵ 2 2 of N g(k, σ) is the same as N(σ). The generalized Gaussian as the smoothing distribution overcomes the thin shell problem of standard Gaussian and improves certiﬁed robustness (Zhang et al., 2020a); and we will reveal more of its theoretical advantages in Section 4.

As the additional smoothing distribution Q, we will mainly consider truncated distributions within a small ℓ2 radius ball. Specially, truncated standard Gaussian is denoted by Ntrunc(T, σ) with density function exp( ϵ 2 2/(2σ2)) I[ ϵ 2 T]; and truncated generalized Gaussian is denoted by N g trunc(k, T, σ) with density function ϵ 2k 2 exp( ϵ 2 2/(2σ 2)) I[ ϵ 2 T].

In Table 1, we summarize these distribution deﬁnitions.

1In this paper, N(σ) is a shorthand of N(0, σ2Id).

Double Sampling Randomized Smoothing

4. Theoretical Analysis of DSRS

In this section, we theoretically analyze DSRS to answer the following core question: Does QA, the prediction probability under additional smoothing distribution, provide sufﬁcient information for tightening the robustness certiﬁcation? We ﬁrst show that if the support of Q is the decision region of true class, DSRS can certify the smoothed classiﬁer s maximum possible robust radius. Then, under concentration assumption, we show the ℓ2 certiﬁed radius of DSRS can be Ω(

d) which is asymptotically optimal for bounded inputs. Finally, under more general conditions, we conduct both numerical simulations and real-data experiments to verify that the certiﬁed radius of DSRS increases with data dimension d. These analyses provide a positive answer to the above core question.

DSRS can certify the tightest possible robust radius. Given an original smoothing distribution P and a base classiﬁer F0. At input point x0 Rd with true label y0, we deﬁne the tightest possible certiﬁed robust radius rtight to be the largest ℓp ball that contains no adversarial example for smoothed classiﬁer e F P 0 : rtight := max r s.t. δ Rd, δ p < r, e F P 0 (x0 + δ) = y0. Then, for binary classiﬁcation, if we choose an additional smoothing distribution Q whose support is the decision region or its complement, then DSRS can certify robust radius rtight. Theorem 1. Suppose the original smoothing distribution P has non-zero density everywhere, i.e., p( ) > 0. For binary classiﬁcation with base classiﬁer F0, at point x0 Rd, let Q be an additional distribution that satisﬁes: (1) its support is the decision region of an arbitrary class c [C] shifted by x0: supp(Q) = {x x0 : F0(x) = c}; (2) for any x supp(Q), 0 < q(x)/p(x) < + . Then, plugging PA = f P 0 (x0)c and QA = f Q 0 (x0)c (see Eqn. (1)) into Deﬁnition 2, we have r DSRS = rtight under any ℓp (p 1).

Proof sketch. We defer the proof to Appendix F.1. At a high level, with this type of Q, we have QA = 1 or QA = 0. Then, from the mass of the Q s support on P and PA, we can conclude that the Q s support is exactly the decision region of label c or its complement. Thus, the DSRS constraints (in Eqn. (4)) are satisﬁed iff F differs from F0 in a zero-measure set, and thus we exactly compute the smoothed classiﬁer e F P 0 s maximum certiﬁed robust radius in DSRS. An extension to multiclass setting is in Appendix F.2.

Remark. For any base classiﬁer F0, Q that satisﬁes conditions in Theorem 1 exists, which implies that with DSRS, certifying a strictly tight robust radius is possible. In contrast, Neyman Pearson-based is proved to certify tight robust radius for linear base classiﬁers (Cohen et al., 2019, Section 3.1), but for arbitrary base classiﬁers, its tightness is not guaranteed. This result suggests that, to certify a tight radius, just one additional smoothing

distribution Q is sufﬁcient rather than multiple ones. On the other hand, it is challenging to ﬁnd Q whose support (or its complement) exactly matches the decision region of an NN classiﬁer. In the following, we analyze the tightness of DSRS under weaker assumptions.

DSRS can certify Ω(

d) ℓ2 radius under concentration assumption. We begin by deﬁning the concentration property.

Deﬁnition 3 ((σ, Pcon)-Concentration). Given a base classiﬁer F0, at input x0 R with true label y0, we call F0 satisﬁes (σ, Pcon)-concentration property, if for within Pconpercentile of small ℓ2 magnitude Gaussian N(σ) noise, the adversarial example occupies zero measure. Formally, (σ, Pcon)-concentration means

Pr ϵ N(σ)[F0(x0 + ϵ) = y0 | ϵ0 2 T] = 1 (5a)

where T satisﬁes Pr ϵ N(σ)[ ϵ 2 T] = Pcon. (5b)

Intuitively, (σ, Pcon)-concentration implies that the base classiﬁer has few adversarial examples for small magnitude noises during standard Gaussian smoothing. In Figure 4 (in Appendix B), we empirically veriﬁed that a well-trained base classiﬁer on Image Net may satisfy this property for a signiﬁcant portion of inputs. Furthermore, Salman et al. (2019) show that promoting this concentration property by adversarially training the smoothed classiﬁer improves the certiﬁed robustness. With this concentration property, DSRS certiﬁes the radius Ω(

d) under ℓ2 norm, as the following theorem shows.

Theorem 2. Let d be the input dimension and F0 be the base classiﬁer. For an input point x0 Rd with true class y0, suppose F0 satisﬁes (σ, Pcon)-Concentration property. Then, for any sufﬁciently large d, for the classiﬁer e F P 0 smoothed by generalized Gaussian P = N g(k, σ) with d/2 15 k < d/2, DSRS with additional smoothing distribution Q = N g trunc(k, T, σ) can certiﬁed ℓ2 radius

r DSRS 0.02σ

where T = σ q

2ΓCDF 1 d/2(Pcon) and ΓCDFd/2 is the

CDF of gamma distribution Γ(d/2, 1).

Proof sketch. We defer the proof to Appendix F.3. At high level, based on the standard Gaussian distribution s property (Proposition F.1), we ﬁnd QA = 1 under concentration property (Lemma F.2). With QA = 1, we derive a lower bound of r DSRS in Lemma F.3. We then use: (1) the concentration of beta distribution Beta( d 1

2 ) (see Lemma F.4) for large d; (2) the relative concentration of gamma Γ(d/2, 1) distribution around mean for large d (see Proposition F.5 and resulting Fact F.7); and (3) the misalignment of gamma distribution Γ(d/2 k, 1) s mean and median for small (d/2 k) (see Proposition F.6) to lower

Double Sampling Randomized Smoothing

bound the quantity in Lemma F.3 and show it is large or equal to 0.5. Then, using the conclusion in Section 5 we conclude that r DSRS 0.02σ

Remark. (1) For standard Neyman-Pearson based certiﬁcaton, r N P = σΦ 1(f P(x0)y0). Along with the increase of input dimension d, to achieve growing ℓ2 certiﬁed radius, one needs the prediction probability of true class under P, f P(x0)y0, to grow simultaneously which is challenging. Indeed, across different datasets, f P(x0)y0 is almost a constant which leads to a constant ℓ2 certiﬁed radius and shrinking ℓ radius for large d. We further empirically illustrate this property in Appendix C. (2) In contrast, as long as the model satisﬁes concentration property which may be almost true on large datasets as reﬂected by Figure 4, with our speciﬁc choices of P and Q, DSRS can achieve Ω(σ

d) ℓ2 radius on large datasets. This rate translates to a constant Ω(σ) ℓ radius on large datasets and thus breaks the curse of dimensionality of randomized smoothing. We remark that this

d rate is optimal when dataset input is bounded such as images (otherwise, the ω(1) ℓ radius leads the radius to exceed the constant ℓ diameter for large d). Therefore, under the assumption of concentration property, DSRS provides asymptotically optimal certiﬁcation for randomized smoothing. (3) Smoothing with generalized Gaussian distribution and choosing a parameter k that is close to d/2 play an essential role in proving the Ω(σ

d) certiﬁed radius. Otherwise, in Appendix F.4 we have Theorem 6 which shows any certiﬁcation methods cannot certify an ℓ2 radius c

d for any c > 0. This adds another theoretical evidence for the superiority of generalized Gaussian, which is cross-validated by Zhang et al. (2020a).

DSRS certiﬁes tighter radius under general scenarios. When the concentration property does not absolutely hold, a rigorous theoretical analysis becomes challenging, since the impact of a noninﬁnite dual variable needs to be taken into account. This dual variable is inside a Lambert W function where typical approximation bounds are too loose to provide non-trivial convergence rates. Thus, we leverage the numerical computational method introduced in Section 5 to provide numerical simulations and real-data experiments. We generalize the concentration assumption by changing the holding probability in Eqn. (5a) from 1 to α1/N, which corresponds to (1 α)-conﬁdent lower bound of QA given N times of Monte-Carlo sampling, where we set α = 0.1% following the convention (Cohen et al., 2019). In this scenario, we compare DSRS certiﬁcation with Neyman-Pearson certiﬁcation numerically in Figure 2 (numerical simulations in Figure 2(a) and Image Net experiments in Figure 2(b)).

In Figure 2(a), we assume (σ, Pcon)-concentration with σ = 1, Pcon = 0.5 and different sampling number Ns. We further assume PA = f P(x0)y0 = 0.6 as the true-class prediction probability under P. In Figure 2(b), we take the model weights trained by Salman et al. (2019) on Image Net and apply generalized Gaussian smoothing with d/2 k = 4 and σ = 0.50. We uniformly pick 100 samples from the test set and compute (1 α)-conﬁdent certiﬁed radius for each sample. We report certiﬁed accuracy under different

10 2 10 3 10 4 10 5 10 6

Input Dimension d

Certified Radius

Certified Radius with PA = 0.6 and (1, 0.5)-Concentration

Neyman-Pearson Standard Gaussian DSRS w/ N = 100, α = 0.1% DSRS w/ N = 1000, α = 0.1% DSRS w/ N = 10000, α = 0.1% DSRS w/ N = 100000, α = 0.1% DSRS w/ N = 1000000, α = 0.1% DSRS w/ N = 10000000, α = 0.1% DSRS w/ Deterministic Info. (Ideal) MNIST Input Dim. CIFAR-10 Input Dim. Image Net Input Dim.

(a) When holding probability in Eqn. (5a) is obtained from sampling N times and conﬁdence level 1 α = 99.9%, relation between certiﬁed radius (y-axis) and input data dimension d (x-axis). Different curves correspond to different Ns.

0.0 0.5 1.0 1.5 2.0 r

Certified Accuracy

Image Net, smoothadv model from (Salmen et al., 2019), σ=0.50

Neyman-Pearson (N = 50000)

Neyman-Pearson (N = 100000)

Neyman-Pearson (N = 200000)

DSRS (N = 50000 + 50000)

DSRS (N = 50000 + 100000)

DSRS (N = 50000 + 200000)

DSRS (N = 50000 + 400000)

DSRS (N = 50000 + 800000)

(b) Relation between certiﬁed radius (x-axis) and certiﬁed accuracy (y-axis) on Image Net models. Different curves correspond to Neyman-Pearson and DSRS certiﬁcation with different Ns. Sampling error considered, conﬁdence level = 99.9%.

Figure 2. Tendency of DSRS certiﬁed robust radius considering sampling error. In both (a) and (b), DSRS certiﬁed radius grows along with the increase of sampling number N but Neyman Pearson radius is almost ﬁxed.

Constrained Optimatization Strong Duality

Determine 𝑃! and 𝑄! from confidence intervals

Dual Problem

Solving ( 5.3)

DSRS Certification as a Constrained Optimization ( 5.1)

Binary Search

Numerical Integration

Figure 3. Overview of DSRS computational method.

ℓ2 radius r, which is the fraction of certiﬁably correctly classiﬁed samples by the smoothed classiﬁer.

Remark. When the sampling error and conﬁdence interval come into play, they quickly suppress the Ω(

d) growth rate of DSRS certiﬁed radius (blue curve) as shown in Figure 2(a). Nonetheless, DSRS still certiﬁes a larger radius than the standard Neyman-Pearson method and increasing the sampling number further enlarges the gap.

We consider another relaxed version of concentration property in Appendix D, where DSRS still provides signiﬁcantly tighter robustness certiﬁcation than Neyman-Pearson.

5. DSRS Computational Method

The theoretical analysis in Section 4 implies that additional smoothing distribution Q helps to tighten the robustness certiﬁcation over standard Neyman-Pearson-based certiﬁcation signiﬁcantly. In this section, we propose an efﬁcient compu-

Double Sampling Randomized Smoothing

tational method to compute this tight certiﬁed robust radius r DSRS (see Deﬁnition 2) when P is generalized Gaussian and Q is truncated P as suggested by Theorems 2 and 6.

Compared with the classical certiﬁcation for randomized smoothing or its variants (cf. (Kumar et al., 2020a)), incorporating additional information raises a big challenge: the Neyman-Pearson lemma (1933) can no longer be served as the foundation of the certiﬁcation algorithm due to its incapability to handle the additional information.

Thus, we propose a novel DSRS computational method by formalizing robustness certiﬁcation as a constrained optimization problem and proving its strong duality ( 5.1). Then, we propose an efﬁcient algorithm to solve this speciﬁc dual optimization problem considering sampling error. The detailed algorithm can be found in Alg. 2 in Appendix E.1: 1) we ﬁrst perform a binary search on the certiﬁed radius r to determine the maximum radius that we can certify; 2) for current r, we determine the smoothed prediction conﬁdence PA and QA from the conﬁdence intervals of predicting the true class ( 5.2); 3) then, for current r we solve the dual problem by quick binary search for dual variables λ1 and λ2 (see Eqn. (10)) along with numerical integration ( 5.3). To guarantee the soundness of numerical-integration-based certiﬁcation, we take the maximum possible error into account during the binary search. We will discuss further extensions in 5.4.

5.1. DSRS as Constrained Optimization We ﬁrst formulate the robustness certiﬁcation as a constrained optimization problem and then show several foundational properties of the problem.

Following the notation of Deﬁnition 2, from the given base classiﬁer F0, we can use Monte-Carlo sampling to obtain

PA = f P 0 (x0)y0, QA = f Q 0 (x0)y0. (7)

In 5.2 we will discuss how to handle conﬁdence intervals of PA and QA. For now, we assume PA and QA are ﬁxed.

Given perturbation vector δ Rd, to test whether smoothed classiﬁer e F P 0 still predicts true label y0, we only need to check whether the prediction probability f P 0 (x0 + δ)y0 > 0.5. This can be formulated as a constrained optimization problem (C):

minimize f Eϵ P[f(ϵ + δ)] (8a)

s.t. Eϵ P[f(ϵ)] = PA, Eϵ Q[f(ϵ)] = QA, (8b)

0 f(ϵ) 1 ϵ Rd. (8c)

Remark. (C) seeks for the minimum possible f P(x0+δ)y0 given Eqn. (7) s constraint. Concretely, we let f represent whether the base classiﬁer predicts label y0: f( ) = I[F( + x0) = y0], and accordingly impose f [0, 1] in Eqn. (8c).

Then, Eqns. (8a) and (8b) unfold f P(x0 + δ)y0, f P(x0)y0, and f Q(x0)y0 respectively and impose Eqn. (7) s constraint.

We let Cδ(PA, QA) denote the optimal value of Eqn. (8) when feasible. Thus, under norm p, to certify the robustness within radius r, we only need to check whether

δ, δ p < r Cδ(PA, QA) > 0.5. (9)

This formulation yields the tightest robustness certiﬁcation given information from P and Q under the binary setting. Under the multiclass setting, there are efforts towards tighter certiﬁcation by using > maximum over other classes instead of > 0.5 in Eqn. (9) (Dvijotham et al., 2020). For saving the sampling cost and also to follow the convention (Cohen et al., 2019; Yang et al., 2020; Jeong & Shin, 2020; Zhai et al., 2020), we mainly consider > 0.5 for multiclass setting, and extension to the other form is straightforward.

Since our choices of P and Q (standard/generalized (truncated) Gaussian) are isotropic and centered around origin, when certifying radius r, for ℓ2 certiﬁcation we only need to test Cδ(PA, QA) > 0.5 with δ = (r, 0, . . . , 0)T; and for ℓ we only need to divide ℓ2 radius by

d. This trick can also be extended for ℓ1 case (Zhang et al., 2020a).

Directly solving (C) is challenging. Thus, we construct the Lagrangian dual problem (D):

maximize λ1, λ2 R Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)] (10a)

s.t. Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] = PA,

Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] = QA. (10b)

In Eqn. (10), p( ) and q( ) are the density functions of distributions P and Q respectively. We let Dδ(PA, QA) denote the optimal objective value to Eqn. (10a) when it is feasible.

Theorem 3. For given δ Rd, PA, and QA, if C and D are both feasible, then Cδ(PA, QA) = Dδ(PA, QA).

The theorem states the strong duality between (C) and (D). We defer the proof to Appendix G.1 which is based on minmax inequality and feasibility condition of (D). Intuitively, we can view (C), a functional optimization over f, as a linear programming (LP) problem over inﬁnite number of variables {f(x) : x Rd} so that the strong duality holds which guarantees the tightness of DSRS in the primal space.

5.2. Dealing with Conﬁdence Intervals

It is practically intractable to know the exact PA and QA in Eqn. (7) by only querying the model s prediction for ﬁnite times. The common practice is using Monte-Carlo sampling, which gives conﬁdence intervals of PA and QA with a predeﬁned conﬁdence level 1 α.

Double Sampling Randomized Smoothing

Suppose we have conﬁdence intervals [PA, PA] and [QA, QA]. To derive a sound certiﬁcation, we need to certify that for any PA [PA, PA] and any QA [QA, QA], Cδ(PA, QA) > 0.5. Given the inﬁnite number of possible PA and QA, the brute-force method is intractable. Here, without computing Cδ, we show how to solve

(PA, QA) = arg min P A [PA, PA],Q A [QA, QA] Cδ(P A, Q A). (11)

If solved PA and QA satify Cδ(PA, QA) > 0.5, then for any PA and QA within the conﬁdence intervals, we can certify the robustness against perturbation δ. We observe the following two properties of Cδ.

Proposition 1. Cδ( , ) is convex in the feasible region.

Proposition 2. With respect to x [0, 1], functions x 7 miny Cδ(x, y) and x 7 arg miny Cδ(x, y) are monotonically non-decreasing. Similarly, with respect to y [0, 1], functions y 7 minx Cδ(x, y) and y 7 arg minx Cδ(x, y) are monotonically non-decreasing.

These two propositions characterize the landscape of Cδ( , ) convex and monotonically non-decreasing along both x and y axes. Thus, desired (PA, QA) (location of minima within the bounded box) lies on the box boundary, and we only need to compute the location of boundaryline-sliced minima and compare it with box constraints to solve Eqn. (11). Formally, we propose an efﬁcient algorithm (Alg. 1, omitted to Appendix E.1) to solve (PA, QA).

Theorem 4. If Eqn. (11) is feasible, the PA and QA returned by Alg. 1 solve Eqn. (11).

The above results are proved in Appendix G.2. On a high level, we prove Proposition 1 by deﬁnition; we prove Proposition 2 via a reduction to classical Neyman-Pearson-based certiﬁcation and analysis of this reduced problem; and we prove Theorem 4 based on Propositions 1 and 2 along with exhaustive and nontrivial analyses of all possible cases.

5.3. Solving the Dual Problem

After the smoothed prediction conﬁdences PA and QA are determined from the conﬁdence intervals, now we solve the dual problem Dδ(PA, QA) as deﬁned in Eqn. (10). We solve the problem based on the following theorem: Theorem 5 (Numerical Integration for DSRS with Generalized Gaussian Smoothing). In Dδ(PA, QA), let r = δ 2, when P = N g(k, σ) and Q = N g trunc(k, T, σ), let σ := p

d/(d 2k) and let ν := ΓCDFd/2 k(T 2/(2σ 2)), R(λ1, λ2) := Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)]

= Et Γ(d/2 k,1)u1(t), λ1 0 Et Γ(d/2 k,1)u1(t) + u2(t), λ1 > 0 where

u1(t) =Beta CDF d 1

min{T 2, 2σ 2k W ( t

k e t k (λ1 + νλ2) 1 k )}

u2(t) = max

Beta CDF d 1

2σ 2k W ( t

k e t k λ 1 k 1 ) (σ

Beta CDF d 1

P (λ1, λ2) := Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

= Et Γ(d/2 k,1)

u3(t, λ1), t T 2/(2σ 2) u3(t, λ1 + νλ2), t < T 2/(2σ 2). where

u3(t, λ) = Beta CDF d 1

2t 2kσ 2W ( t

k e t k λ 1

Q(λ1, λ2) := Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

= νEt Γ(d/2 k,1)u3(t, λ1 + νλ2) I[t T 2/(2σ 2)].

In above equations, Γ(d/2 k, 1) is gamma distribution and ΓCDFd/2 k is its CDF, Beta CDF d 1

2 is the CDF of distri-

bution Beta( d 1

2 ), and W is the principal branch of Lambert W function.

When P is standard Gaussian and Q is truncated standard Gaussian, we derive similar expressions as detailed in Appendix H.1. We prove Theorem 5 in Appendix G.3, which extends the level-set sliced integration and results from (Yang et al., 2020). With the theorem, we can rewrite the dual problem Dδ(PA, QA) as

max λ1, λ2 R R(λ1, λ2) s.t. P(λ1, λ2) = PA, Q(λ1, λ2) = QA,

(12) Given concrete λ1 and λ2, from the theorem, these function values P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2) can be easily computed with one-dimensional numerical integration using Sci Py package.

Now, solving Dδ(PA, QA) reduces to ﬁnding dual variables λ1 and λ2 such that P(λ1, λ2) = PA and Q(λ1, λ2) = QA. Generally, we ﬁnd that there is only one unique feasible pair (λ1, λ2) for Eqn. (12), so ﬁnding out such a pair is sufﬁcient. We prove the uniqueness and discuss how we deal with edge cases where multiple feasible pairs exist in Appendix G.4.

Normally, such solving process is expensive. However, we ﬁnd a particularly efﬁcient method to solve λ1 and λ2 and the algorithm description is in Alg. 3 (in Appendix E.1). At a high level, from Theorem 5, we observe that Q(λ1, λ2) is determined only by the sum (λ1 + νλ2) and non-decreasing w.r.t. this sum. Therefore, we apply binary search to ﬁnd out (λ1 + νλ2) that satisﬁes Q(λ1, λ2) = QA. Then, we observe that

P(λ1, λ2) Q(λ1, λ2)

:=h(λ1) z }| {

E t Γ(d/2 k,1)u3(t, λ1) I

Thus, we need to ﬁnd λ1 such that h(λ1) = PA QA/ν. We observe that h(λ1) is non-decreasing w.r.t. λ1, and we use binary search to solve λ1. Combining with the value of

Double Sampling Randomized Smoothing

(λ1 +νλ2), we also obtain λ2. We lastly leverage numerical integration to compute R(λ1, λ2) following Theorem 5 to solve the dual problem Dδ(PA, QA).

Practical Certiﬁcation Soundness. As a practical certiﬁcation method, we need to guarantee the certiﬁcation soundness in the presence of numerical error. In DSRS, there are two sources of numerical error: numerical integration error when computing P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2), and the ﬁnite precision of binary search on λ1 and λ2. For numerical integration, we notice that typical numerical integration packages such as scipy support setting an absolute error threshold and raising warnings when such threshold cannot be reached. We set the absolute threshold = 1.5 10 8, and abstain when the threshold cannot be reached (which never happens in our experimental evaluation). Then, when computing P, Q, and R, suppose the numerical value is v, we use the lower bound (v ) and upper bound (v+ ) in the corresponding context to guarantee the soundness. For the ﬁnite precision in binary search, we use the left endpoint or the right endpoint of the ﬁnal binary search interval to guarantee soundness. For example, we use the left endpoint of λ1 in R computation, and use the left endpoint of (λ1+νλ2) minus right endpoint of λ1 to get the lower bound of λ2 to use in R computation. As a result, we always get an under-estimation of R so the certiﬁcation is sound. Further discussion is in Appendix E.2.

To this point, we have introduced the DSRS computational method. Complexity and efﬁciency analysis is omitted to Appendix E.3. Implementation details are in Appendix I.1.

5.4. Extensions

We mainly discussed DSRS computational method for generalized Gaussian P and truncated generalized Gaussian Q under ℓ2 norm. Can we extend it to other settings? Indeed, DSRS is a general framework. In appendices, we show following extensions: (1) DSRS for generalized Gaussian with different variances as P and Q (in Appendix H.2); (2) DSRS for other ℓp norms (in Appendix H.3); and (3) DSRS that leverages other forms of additional information covering gradient magnitude information (Mohapatra et al., 2020; Levine et al., 2020) (in Appendix L).

6. Experimental Evaluation

In this section, we systematically evaluate DSRS and demonstrate that it achieves tighter certiﬁcation than the classical Neyman-Pearson-based certiﬁcation against ℓ2 perturbations on MNIST, CIFAR-10, and Image Net. We focus on ℓ2 certiﬁcation because additive randomized smoothing is not optimal for other norms (e.g., ℓ1 (Levine & Feizi, 2021)) or the certiﬁcation can be directly translated from ℓ2 certiﬁcation (e.g., ℓ (Yang et al., 2020) and semantic transformations (Li et al., 2021)).

6.1. Experimental Setup Smoothing Distributions. Following Theorem 2, we use generalized Gaussian N g(k, σ) as smoothing distribution P where d/2 15 k < d/2. Speciﬁcally, we set k to be d/2 12 on MNIST, d/2 6 on CIFAR-10, and d/2 4 on Image Net. We use three different σ s: 0.25, 0.50, and 1.00.

In terms of the additional smoothing distribution Q, on MNIST and CIFAR-10, we empirically ﬁnd that using generalized Gaussian with the same k but different variance yields tighter robustness certiﬁcation, and therefore we choose σg to be 0.2, 0.4, and 0.8 corresponding to P s σ being 0.25, 0.50, and 1.0, respectively. On Image Net, the concentration property (see Deﬁnition 3) is more pronounced (detail study in Appendix J.1) and thus we use truncated generalized Gaussian N g trunc(k, T, σ) as Q. We apply a simple but effective algorithm as explained in Appendix I to determine hyperparameter T in N g trunc(k, T, σ). Models and Training. We consider three commonly-used or state-of-the-art training methods: Gaussian augmentation (Cohen et al., 2019), Consistency (Jeong & Shin, 2020), and Smooth Mix (Jeong et al., 2021). We follow the default model architecture on each dataset respectively. We train the models with augmentation noise sampled from the corresponding generalized Gaussian smoothing distribution P. More training details can be found in Appendix I. Baselines. We consider the Neyman-Pearson-based certiﬁcation method as the baseline. This certiﬁcation is widely used and is the tightest given only prediction probability under P (Cohen et al., 2019; Yang et al., 2020; Jeong & Shin, 2020; Li et al., 2021). We remark that although there are certiﬁcation methods that leverage more information, to the best of our knowledge, they are not visibly better than the Neyman-Pearson-based method on ℓ2 certiﬁcation under practical sampling number (105). More comparisons in Appendix J.5 show DSRS is also better than these baselines.

For both baseline and DSRS, following the convention, the certiﬁcation conﬁdence is 1 α = 99.9%, and we use 105 samples for estimating PA and QA. Neyman-Pearson certiﬁcation does not use the information from additional distribution, and all 105 samples are used to estimate the interval of PA. In DSRS, we use 5 104 samples to estimate the interval of PA with conﬁdence 1 α

2 = 99.95% and the rest 5 104 samples for QA with the same conﬁdence. By union bound, the whole certiﬁcation conﬁdence is 99.9%. Metrics. We uniformly draw 1000 samples from the test set and report certiﬁed robust accuracy under each ℓ2 radius r, which is the fraction of samples that are both correctly classiﬁed and have certiﬁed robust radii larger than or equal to r. Under each radius r, we report the highest certiﬁed robust accuracy among the three variances σ {0.25, 0.50, 1.00} following (Cohen et al., 2019; Salman et al., 2019). We also report evaluation results with ACR metric (Zhai et al., 2020) in Appendix J.3.3.

Double Sampling Randomized Smoothing

Table 2. Certiﬁed robust accuracy under different radii r with different certiﬁcation approaches.

Dataset Training Certiﬁcation Certiﬁed Accuracy under Radius r Method Approach 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Gaussian Aug. Neyman-Pearson 97.8% 96.9% 94.6% 88.4% 78.7% 57.6% 41.0% 25.5% 13.6% 6.2% 2.1% 0.9% (Cohen et al., 2019) DSRS 97.8% 97.0% 95.0% 89.8% 83.4% 61.6% 48.4% 34.1% 21.0% 10.6% 4.4% 1.2% Consistency Neyman-Pearson 98.4% 97.5% 96.0% 92.3% 83.8% 67.5% 49.1% 35.6% 21.7% 10.4% 4.1% 1.9% (Jeong & Shin, 2020) DSRS 98.4% 97.5% 96.0% 93.5% 87.1% 71.8% 55.8% 41.9% 31.4% 17.8% 8.6% 2.8% Smooth Mix Neyman-Pearson 98.6% 97.6% 96.5% 91.9% 85.1% 73.0% 51.4% 40.2% 31.5% 22.2% 12.2% 4.9% (Jeong et al., 2021) DSRS 98.6% 97.7% 96.8% 93.4% 87.5% 76.6% 54.4% 46.2% 37.6% 29.2% 18.5% 7.2%

Gaussian Aug. Neyman-Pearson 56.1% 41.3% 27.7% 18.9% 14.9% 10.2% 7.5% 4.1% 2.0% 0.7% 0.1% 0.1% (Cohen et al., 2019) DSRS 57.4% 42.7% 30.6% 20.6% 16.1% 12.5% 8.4% 6.4% 3.5% 1.8% 0.7% 0.1% Consistency Neyman-Pearson 61.8% 50.9% 38.0% 32.3% 23.8% 19.0% 16.4% 13.8% 11.2% 9.0% 7.1% 5.1% (Jeong & Shin, 2020) DSRS 62.5% 52.5% 38.7% 35.2% 28.1% 20.9% 17.6% 15.3% 13.1% 10.9% 8.9% 6.5% Smooth Mix Neyman-Pearson 63.9% 53.3% 40.2% 34.2% 26.7% 20.4% 17.0% 13.9% 10.3% 7.8% 4.9% 2.3% (Jeong et al., 2021) DSRS 64.7% 55.5% 42.1% 35.9% 29.4% 22.1% 18.7% 16.1% 13.2% 10.2% 7.1% 3.9%

Guassian Aug. Neyman-Pearson 57.1% 47.0% 39.3% 33.2% 24.8% 21.4% 17.6% 13.7% 10.2% 7.8% 5.7% 3.6% (Cohen et al., 2019) DSRS 58.4% 48.4% 41.4% 35.3% 28.8% 23.3% 21.3% 18.7% 14.2% 11.0% 9.0% 5.7% Consistency Neyman-Pearson 59.8% 49.8% 43.3% 36.8% 31.4% 25.6% 22.1% 19.1% 16.1% 14.0% 10.6% 8.5% (Jeong & Shin, 2020) DSRS 60.4% 52.4% 44.7% 39.3% 34.8% 28.1% 25.4% 22.6% 19.6% 17.4% 14.1% 10.4% Smooth Mix Neyman-Pearson 46.7% 38.2% 28.8% 24.6% 18.1% 14.2% 11.8% 10.1% 8.9% 7.2% 6.0% 4.6% (Jeong et al., 2021) DSRS 47.4% 40.0% 30.3% 26.8% 21.6% 15.7% 14.0% 12.1% 9.9% 8.4% 7.2% 5.3%

6.2. Evaluation Results We show results in Table 2. The corresponding curves and separated tables for each variance σ are in Appendix J.3.

For almost all models and radii r, DSRS yields signiﬁcantly higher certiﬁed accuracy. For example, for Gaussian augmented models, when r = 2.0, on MNIST the robust accuracy increases from 25.5% to 34.1% (+8.6%), on CIFAR-10 from 4.1% to 6.4% (+2.3%), and on Image Net from 13.7% to 18.7% (+5.0%). On average, on MNIST the improvements are around 6% - 9%; on CIFAR-10, the improvements are around 1.5% - 3%; and on Image Net the improvements are around 2% - 5%. Thus, DSRS can be used in conjunction with different training approaches and provides consistently tighter robustness certiﬁcation.

The improvements in the robust radius are not as substantial as those in Figure 2(b) (which is around 2 ). We investigate the reason in Appendix J.1. In summary, the model in Figure 2(b) is trained with standard Gaussian smoothing augmentation and smoothed with generalized Gaussian. The models in this section are trained with generalized Gaussian augmentation. Such training gives higher certiﬁed robustness, but in the meantime, gives more advantage to Neyman Pearson-based certiﬁcation. This ﬁnding implies that there may be a large space for exploring training approaches that favor DSRS certiﬁcation since all existing training methods are designed for Neyman-Pearson-based certiﬁcation. Nevertheless, even with these unsuitable training methods, DSRS still achieves signiﬁcantly tighter robustness certiﬁcation than the baseline.

On the other hand, all the above results are restricted to generalized Gaussian smoothing. We still observe that standard Gaussian smoothing combined with strong training methods (Salman et al., 2019; Jeong & Shin, 2020) and Neyman-Pearson certiﬁcation (the SOTA setting) yields similar or slightly higher certiﬁed robust accuracy than generalized Gaussian smoothing even with DSRS certiﬁca-

tion. Though DSRS with its suitable generalized Gaussian smoothing does not achieve SOTA certiﬁed robustness yet, given the theoretical advantages, we believe that with future tailored training approaches, DSRS with generalized Gaussian smoothing can bring strong certiﬁed robustness. More discussion is in Appendix L.3.

Ablation Studies. We present several ablation studies in the appendix. and verify: (1) Effectiveness of our simple heuristic for selecting hyperparameter for Q: We propose a simple heuristic to select the hyperparameter T in smoothing distribution Q = N g trunc(k, T, σ). In Appendix J.2, we propose a gradient-based optimization method to select such Q. We ﬁnd that our simple heuristic has similar performance compared to the more complex optimization method but is more efﬁcient. (2) Comparison of different types of Q: by choosing different types of Q distributions (truncated Gaussian or Gaussian with different variance), DSRS has different performance as mentioned in Section 6.1. In Appendix J.4, we investigate the reason. In summary, when concentration property (see Deﬁnition 3) is better satisﬁed, using truncated Gaussian as Q is better; otherwise, using Gaussian with different variance is better.

7. Conclusion

We propose a general DSRS framework, which exploits information based on an additional smoothing distribution to tighten the robustness certiﬁcation. We theoretically analyze and compare classical Neyman-Pearson and DSRS certiﬁcation, showing that DSRS has the potential to break the curse of dimensionality of randomized smoothing.

Acknowledgements We think the anonymous reviewers for their constructive feedback. This work is partially supported by NSF grant No.1910100, NSF CNS 20-46726 CAR, C3 AI, and the Sloan Fellowship.

Double Sampling Randomized Smoothing

Albarghouthi, A. Introduction to neural network veriﬁcation. ar Xiv preprint ar Xiv:2109.10317, 2021.

Alfarra, M., Bibi, A., Torr, P. H., and Ghanem, B. Data dependent randomized smoothing. ar Xiv preprint ar Xiv:2012.04351, 2020.

Athalye, A., Carlini, N., and Wagner, D. A. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 274 283. PMLR, 2018.

Awasthi, P., Jain, H., Rawat, A. S., and Vijayaraghavan, A. Adversarial robustness via robust low rank representations. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 11391 11403. Curran Associates, Inc., 2020.

Blum, A., Dick, T., Manoj, N., and Zhang, H. Random smoothing might be unable to certify ℓ robustness for high-dimensional images. Journal of Machine Learning Research, 21:211:1 211:21, 2020.

Carlini, N. and Wagner, D. A. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 39 57. IEEE Computer Society, 2017.

Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J. C., and Liang, P. Unlabeled data improves adversarial robustness. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 11190 11201. Curran Associates, Inc., 2019.

Chernoff, H. and Scheffe, H. A generalization of the neyman-pearson fundamental lemma. The Annals of Mathematical Statistics, 23(4):213 225, 1952.

Chiang, P., Curry, M. J., Abdelkader, A., Kumar, A., Dickerson, J., and Goldstein, T. Detection as regression: Certiﬁed object detection with median smoothing. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and

Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 1275 1286. Curran Associates, Inc., 2020.

Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certiﬁed adversarial robustness via randomized smoothing. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 1310 1320. PMLR, 2019.

Dvijotham, K. D., Hayes, J., Balle, B., Kolter, J. Z., Qin, C., Gy orgy, A., Xiao, K., Gowal, S., and Kohli, P. A framework for robustness certiﬁcation of smoothed classiﬁers using f-divergences. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Eiras, F., Alfarra, M., Kumar, M. P., Torr, P. H. S., Dokania, P. K., Ghanem, B., and Bibi, A. ANCER: anisotropic certiﬁcation via sample-wise volume maximization. ar Xiv preprint ar Xiv:2107.04570, 2021.

Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., and Song, D. Robust physical-world attacks on deep learning visual classiﬁcation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 1625 1634. Computer Vision Foundation / IEEE Computer Society, 2018.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Arandjelovic, R., Mann, T. A., and Kohli, P. Scalable veriﬁed training for provably robust image classiﬁcation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 4841 4850. IEEE, 2019.

Hayes, J. Extensions and limitations of randomized smoothing for robustness guarantees. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020, pp. 3413 3421. Computer Vision Foundation / IEEE, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,

Double Sampling Randomized Smoothing

Las Vegas, NV, USA, June 27-30, 2016, pp. 770 778. IEEE Computer Society, 2016.

Jeong, J. and Shin, J. Consistency regularization for certiﬁed robustness of smoothed classiﬁers. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 10558 10570. Curran Associates, Inc., 2020.

Jeong, J., Park, S., Kim, M., Lee, H., Kim, D., and Shin, J. Smoothmix: Training conﬁdence-calibrated smoothed classiﬁers for certiﬁed robustness. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 30153 30168. Curran Associates, Inc., 2021.

Kumar, A., Levine, A., Feizi, S., and Goldstein, T. Certifying conﬁdence via randomized smoothing. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 5165 5177. Curran Associates, Inc., 2020a.

Kumar, A., Levine, A., Goldstein, T., and Feizi, S. Curse of dimensionality on randomized smoothing for certiﬁable robustness. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 5458 5467. PMLR, 2020b.

L ecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certiﬁed robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pp. 656 672. IEEE, 2019.

Lee, G., Yuan, Y., Chang, S., and Jaakkola, T. S. Tight certiﬁcates of adversarial robustness for randomly smoothed classiﬁers. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 4911 4922. Curran Associates, Inc., 2019.

Levine, A. and Feizi, S. Improved, deterministic smoothing for ℓ1 certiﬁed robustness. In Meila, M. and Zhang,

T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 6254 6264. PMLR, 2021.

Levine, A., Kumar, A., Goldstein, T. A., and Feizi, S. Tight second-order certiﬁcates for randomized smoothing. ar Xiv preprint ar Xiv:2010.10549, 2020.

Li, B., Chen, C., Wang, W., and Carin, L. Certiﬁed adversarial robustness with additive noise. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 9459 9469. Curran Associates, Inc., 2019a.

Li, H., Xu, X., Zhang, X., Yang, S., and Li, B. QEBA: query-efﬁcient boundary-based blackbox attack. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, pp. 1218 1227. Computer Vision Foundation / IEEE, 2020a.

Li, L., Zhong, Z., Li, B., and Xie, T. Robustra: Training provable robust neural networks over reference adversarial space. In Kraus, S. (ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp. 4711 4717. ijcai.org, 2019b.

Li, L., Xie, T., and Li, B. Sok: Certiﬁed robustness for deep neural networks. ar Xiv preprint ar Xiv:2009.04131, 2020b.

Li, L., Weber, M., Xu, X., Rimanic, L., Kailkhura, B., Xie, T., Zhang, C., and Li, B. TSS: transformation-speciﬁc smoothing for robustness certiﬁcation. In Kim, Y., Kim, J., Vigna, G., and Shi, E. (eds.), 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS 2021, Virtual Event, Republic of Korea, November 15 - 19, 2021, pp. 535 557. ACM, 2021.

Liu, C., Arnon, T., Lazarus, C., Strong, C. A., Barrett, C. W., and Kochenderfer, M. J. Algorithms for verifying deep neural networks. Foundations and Trends in Optimization, 4(3-4):244 404, 2021.

Lozier, D. W. NIST digital library of mathematical functions. Annals of Mathematics and Artiﬁcial Intelligence, 38(13):105 119, 2003.

Mirman, M., Gehr, T., and Vechev, M. T. Differentiable abstract interpretation for provably robust neural networks. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML

Double Sampling Randomized Smoothing

2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 3575 3583. PMLR, 2018.

Mohapatra, J., Ko, C., Weng, T., Chen, P., Liu, S., and Daniel, L. Higher-order certiﬁcation for randomized smoothing. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 4501 4511. Curran Associates, Inc., 2020.

Neyman, J. and Pearson, E. S. On the problem of the most efﬁcient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231:289 337, 1933.

Qiu, H., Xiao, C., Yang, L., Yan, X., Lee, H., and Li, B. Semanticadv: Generating adversarial examples via attributeconditioned image editing. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), 16th European Conference on Computer Vision, ECCV 2020, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV, volume 12359 of Lecture Notes in Computer Science, pp. 19 37. Springer, 2020.

Raghunathan, A., Steinhardt, J., and Liang, P. Certiﬁed defenses against adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018.

Salman, H., Li, J., Razenshteyn, I. P., Zhang, P., Zhang, H., Bubeck, S., and Yang, G. Provably robust deep learning via adversarially trained smoothed classiﬁers. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 11289 11300. Curran Associates, Inc., 2019.

Schuchardt, J., Wollschl ager, T., Bojchevski, A., and G unnemann, S. Localized randomized smoothing for collective robustness certiﬁcation, 2022. URL https: //openreview.net/forum?id=m F122Bu Ann W.

S uken ık, P., Kuvshinov, A., and G unnemann, S. Intriguing properties of input-dependent randomized smoothing. ar Xiv preprint ar Xiv:2110.05365, 2021.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. In Bengio, Y. and Le Cun, Y. (eds.),

2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

Tram er, F., Carlini, N., Brendel, W., and Madry, A. On adaptive attacks to adversarial example defenses. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 1633 1645. Curran Associates, Inc., 2020.

Wang, B., Xu, C., Wang, S., Gan, Z., Cheng, Y., Gao, J., Awadallah, A. H., and Li, B. Adversarial GLUE: A multitask benchmark for robustness evaluation of language models. In Vanschoren, J. and Yeung, S. (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Neur IPS Datasets and Benchmarks 2021, December 2021, virtual, volume 1, 2021.

Wong, E. and Kolter, J. Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 5283 5292. PMLR, 2018.

Wu, Y., Bojchevski, A., Kuvshinov, A., and G unnemann, S. Completing the picture: Randomized smoothing suffers from the curse of dimensionality for a large family of distributions. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 3763 3771. PMLR, 2021.

Xu, K., Shi, Z., Zhang, H., Wang, Y., Chang, K., Huang, M., Kailkhura, B., Lin, X., and Hsieh, C. Automatic perturbation analysis for scalable certiﬁed robustness and beyond. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 1129 1141. Curran Associates, Inc., 2020.

Yang, G., Duan, T., Hu, J. E., Salman, H., Razenshteyn, I. P., and Li, J. Randomized smoothing of all shapes and sizes. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 10693 10705. PMLR, 2020.

Double Sampling Randomized Smoothing

Yang, Z., Li, L., Xu, X., Kailkhura, B., Xie, T., and Li, B. On the certiﬁed robustness for ensemble models and beyond. In 10th International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022.

Zhai, R., Dan, C., He, D., Zhang, H., Gong, B., Ravikumar, P., Hsieh, C., and Wang, L. MACER: attack-free and scalable robust training via maximizing certiﬁed radius. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Zhang, D., Ye, M., Gong, C., Zhu, Z., and Liu, Q. Black-box certiﬁcation with randomized smoothing: A functional optimization based framework. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 2316 2326. Curran Associates, Inc., 2020a.

Zhang, H., Chen, H., Xiao, C., Gowal, S., Stanforth, R., Li, B., Boning, D. S., and Hsieh, C. Towards stable and efﬁcient training of veriﬁably robust neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020b.

Zhang, J., Li, L., Li, H., Zhang, X., Yang, S., and Li, B. Progressive-scale boundary blackbox attack via projective gradient estimation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 12479 12490. PMLR, 2021.

Double Sampling Randomized Smoothing

A. Neyman-Pearson Certiﬁcation

The Neyman-Pearson-based robustness certiﬁcation is the tightest certiﬁcation given only prediction probability under P (Cohen et al., 2019). This certiﬁcation and its equivalent variants are widely used for randomized smoothing. We use r N P to represent the certiﬁed radius from the Neyman Pearson-based method.

If the smoothing distribution P is standard Gaussian, the following proposition gives the closed-form certiﬁed robust radius derived from the Neyman-Pearson lemma (Neyman & Pearson, 1933).

Proposition 3 ((Cohen et al., 2019)). Under ℓ2 norm, given input x0 Rd with true label y0. Let P = N(σ) be the smoothing distribution, then Neyman-Pearson-based certiﬁcation yields certiﬁed radius r N P = σΦ 1(f P(x0)y0), where Φ 1 is the inverse CDF of unit-variance Gaussian.

For other smoothing distributions, the concretization of the Neyman-Pearson certiﬁcation method can be found in (Yang et al., 2020). Remark. In practice, the routine is to use Monte-Carlo sampling to obtain a high-conﬁdence interval of f P(x0)y0, which implies a high-conﬁdence certiﬁcation (r N P) of robust radius. A tighter radius can be obtained when the runner-up prediction probability is known: r N P = σ 2 Φ 1(f P(x0)y0) maxy [C]:y =y0 Φ 1(f P(x0)y) . However, due to efﬁciency concern (for C-way classiﬁcation the sampling number needs to be more than C times if using r N P for certiﬁcation instead of r N P), the standard routine is to only use top-class probability and r N P (Cohen et al., 2019, Section 3.2.2). DSRS follows this routine.

B. Illustration of Concentration Assumption on Image Net

0 50 100 150 200 250 300 350 400 ℓ2 Length of Perturbation

Prob. of Predicting True Class

Density of ℓ2 Noise Magnitude

Figure 4. Blue curves: Probability of true-prediction w.r.t. ℓ2 length of perturbations for a base classiﬁer from (Salman et al., 2019) on Image Net. Each line corresponds to one of 100 uniformly drawn samples from test set (detailed setup in Appendix J.1). Green curve: Normalized density of ℓ2 noise magnitude for Image Net standard Gaussian N(σ) with σ = 0.5, which highly concentrates on σ

d. Thus, for constant Pcon, (σ, Pcon)-concentration can be satisﬁed for a signiﬁcant portion of input samples.

C. Illustration of Unchanged PA with Increasing d

In the ﬁrst remark of Theorem 2, we mention that PA = f P(x0)y0 does not grow simultaneously along with the increase of input dimension d. In Figure 5, to illustrate this phenomenon, we plot PA histograms for 1, 000 test samples from Tiny Image Net and Image Net. Note that Tiny Image Net images are downscaled Image Net images, so the data distribution only differs in the input dimension d. As we can observe, though d varies, the PA distribution is highly similar, so PA is roughly unchanged along with the increase of input dimension d.

Tiny Image Net(d = 12288) trained w/ Gaussian Aug.

Image Net(d = 150528) trained w/ Gaussian Aug.

Figure 5. PA histograms for models trained on Tiny Image Net (left) and Image Net (right) with same σ = 0.50.

D. DSRS under Relaxed Concentration Assumption

In main text (Section 4), we generalize the concentration assumption by replacing the holding probability 1 in Eqn. (5a) by probability considering sampling conﬁdence. In this appendix, we replace the holding probability in Eqn. (5a) by exp( dα) for α {0.1, 0.2, 0.3, 0.4, 0, 5}.

With this relaxation, we conduct numerical simulations using the same settings as in the main text, and the corresponding results are shown in Figure 6. Note that some solid curves terminate when d is large, which is due to the limitation of ﬂoating-point precision in numerical simulations, and we use dashed lines of the same color to plot the projected radius when d is large.

Remark. When the concentration property holds with probability exp( dα) (0 < α 0.5) other than 1, from Figure 6, we observe that r N Pdα/1.18 predicts the certiﬁed radius of DSRS well where r N P is Neyman-Pearson certiﬁed radius. There-

Double Sampling Randomized Smoothing

10 2 10 3 10 4 10 5 10 6

Input Dimension d

Certified Radius

Certified Radius with PA = 0.6 and (1, 0.5)-Approximate Concentration Property

Neyman-Pearson Standard Gaussian DSRS w/ Concentration Info. (Ideal) DSRS w/ (exp( 0.1d))-Concentration

DSRS w/ (exp( 0.2d))-Concentration

DSRS w/ (exp( 0.3d))-Concentration

DSRS w/ (exp( 0.4d))-Concentration

DSRS w/ (exp( 0.5d))-Concentration

MNIST Input Dim. CIFAR-10 Input Dim. Image Net Input Dim.

Figure 6. Tendency of DSRS certiﬁed robust radius with different input dimensions d under relaxed concentration assumption: when holding probability in Eqn. (5a) is exp( dα) with α from 0.1 to 0.5;. Blue line: DSRS when holding probability in Eqn. (5a) is 1. Dotted line: Neyman-Pearson certiﬁcation. Other solid lines: DSRS when holding probability in Eqn. (5a) is exp( dα). Other dashed lines: DSRS projected radius by rproj DSRS = r N Pdα/1.18. α {0.1, 0.2, 0.3, 0.4, 0.5}. Both xand y-axes are logarithmic.

fore, although the

d growth rate of ℓ2 certiﬁed radius does not hold, the radius still increases along with the dimension d. Interestingly, along with the increase of dimension d, the vanishing probability exp( dα) still implies the increasing volume of adversarial examples, and smoothed classiﬁer is still certiﬁably robust with increasing radius reﬂected by DSRS despite the increasing adversarial volume.

E. Omitted Details of DSRS Computational Method

E.1. Algorithm Description

Algorithm 1 Determining PA and QA from conﬁdence intervals (see 5.2).

Data: Distributions P and Q; δ; [PA, PA] and [QA, QA] Result: PA and QA satisfying Eqn. (11)

1 Compute q arg miny Cδ(PA, y)

2 if q > QA then

3 return (PA, min{q, QA})

5 Compute p arg minx Cδ(x, QA)

6 return (max{min{p, PA}, PA}, QA)

Alg. 1 is a subroutine (Line 7) of Alg. 2. Note that Lines 1 and 5 of Alg. 1 solve the constrained optimization with only one constraint (either one of Eqn. (8b)), reducing to the wellstudied and solvable Neyman-Pearson-based certiﬁcation.

Note that we do not need to evaluate any value of Cδ in Alg. 1. Although q and p in the algorithm are arg minx or y over Cδ, the free choices of x or y leave Cδ s constrained optimization with only one constraint and then q and p can be solved by Neyman-Pearson instead of evaluating Cδ directly.

Algorithm 2 DSRS computational method. Data: clean input x0, base classiﬁer F0; distributions P and Q; norm type p; conﬁdence level α; numerical integration error bound Result: Certiﬁed radius r

1 Query prediction y0 e F P 0 (x0)

2 Sample and estimate the intervals of smoothed conﬁdence [PA, PA] under P and [QA, QA] under Q with conﬁdence (1 α) following (Cohen et al., 2019)

3 Initialize: rl 0, ru rmax 4 while ru rl > eps do /* Binary search on radius r */

5 rm (rl + ru)/2

6 δ (rm, 0, . . . , 0)T ; /* for ℓ2 certiﬁcation with ℓ2 symmetric P and Q; for ℓ or ℓ1, can be adjusted following (Zhang et al., 2020a) */

7 Determine PA [PA, PA] and QA [QA, QA] ; /* See Section 5.2 and Alg. 1 */

8 (λ1, λ2) DUALBINARYSEARCH(PA, QA) ; /* See Section 5.3 and Alg. 3 */

9 v R(λ1, λ2) ; /* Using Theorem 5 */

10 if v > 0.5 then

14 return rl

Alg. 2 is the pseudocode of the whole DSRS computational method as introduced in Section 5.

Double Sampling Randomized Smoothing

Algorithm 3 DUALBINARYSEARCH for λ1 and λ2. Data: Query access to P( , ) and Q( , ); PA; QA; ν; precision parameter ϵ; numerical integration error bound Result: λ1 and λ2 satisfying constraints P(λ1, λ2) = PA, Q(λ1, λ2) = QA (see Eqn. (12))

1 a L 0, a U M ; /* search for a = λ1 + νλ2, M is a large positive number */

2 while a U a L > ϵ do

3 am (a L + a U)/2

4 if Q(am, 0) < QA then

/* Following while-loop enlarges a L and a U until [a L, a U] covers a such that Q(a , 0) = QA under numerical integration error */

9 while (Q(a L, 0) + > QA) or (Q(a U, 0) < QA) do

10 t a U a L

11 a L a L t/2

12 a U a U + t/2

14 λL 1 0, λU 1 M ; /* search for λ1, M is a large positive number */

15 while λU 1 λL 1 > ϵ do

16 λm 1 (λL 1 + λU 1 )/2

17 if h(λm 1 ) < PA QA/ν then

18 λL 1 λm 1 19 else

20 λU 1 λm 1

/* Following while-loop enlarges λL 1 and λU 1 until [λL 1 , λU 1 ] covers λ 1 such that h(λ 1) = PA QA/ν under numerical integration error */

22 while (h(λL 1 ) + > PA QA/ν) or (h(λU 1 ) < PA QA/ν) do

23 t λU 1 λL 1

24 λL 1 λL 1 t/2

25 λU 1 λU 1 + t/2

27 return (λL 1 , (a L λU 1 )/ν) ; /* for soundness, choose the left endpoint of λ1 and λ2 range */

Alg. 3 is the dual variable search algorithm described in Section 5.3. From Line 1 to 8, we conduct binary search for quantity λ1 + νλ2; from Line 14 to 21, we conduct binary search for quantity λ1. Notice that our binary search interval is initialized to be the non-negative interval. This is because Q(am, 0) = 0 and h(λm 1 ) = 0 if am and λm 1 are non-positive observed from Theorem 5.

E.2. Guaranteeing Numerical Soundness

In DSRS computational method, to compute a practically sound robustness guarantee, we take the binary search error

and numerical integration error into consideration.

Speciﬁcally, in Alg. 3, we return a pair (λ 1, λ 2) whose R(λ 1, λ 2) lower bounds R(λ1, λ2) where (λ1, λ2) is the precise feasible pair. We achieve so by returning (λL 1 , (a L λU 1 )/ν). Speciﬁcally, as we can see a L is an underestimation of actual (λ1 + νλ2) in the presence of binary search error and numerical integration error. We bound the numerical integration error by setting absolute error bound = 1.5 10 8 in scipy.integrad.quad function. Then, λL 1 and λU 1 are underestimation and overestimation of the actual λ1 in the presence of errors respectively. As a result, (a L λU 1 )/ν is an underestimation of λ2. Therefore, both λ1 and λ2 are underestimated and by the monotonicity of R( , ), the actual R(λ1, λ2) would be underestimated which guarantees the soundness.

Then, since the evaluation of R involves numerical integration, we compare the lower bound of R: R(λ1, λ2) in Alg. 2 (line 9) with 0.5 to determine whether current robustness radius can be certiﬁed or not.

E.3. Complexity and Efﬁciency Analysis

In this appendix, we brieﬂy analyze the computational complexity of DSRS computational method introduced in Section 5. Suppose the binary search precision is ϵ, and each numerical integration costs C time. First, the search of certiﬁed robust radius costs O(log(

d/ϵ)). For each searched radius, we ﬁrst determine PA and QA by running Neyman Pearson-based certiﬁcation, which has cost O(log(1/ϵ)C). Then, solving dual variables takes two binary search rounds, which has cost O(log(1/ϵ)C). The ﬁnal one-time integration of R(λ1, λ2) has cost O(C). Thus, overall time complexity is O(log(

d/ϵ) log(1/ϵ)C), which is the same as classical Neyman-Pearson certiﬁcation and grows slowly (in logarithmic factor) w.r.t. input dimension d.

In practice, the certiﬁcation time is on average 5 s to 10 s per sample across different datasets. For example, with σ = 0.50 as the smoothing variance parameter, the certiﬁcation time, as an overhead over Neyman-Pearson-based certiﬁcation, is 10.53 s, 4.53 s, and 3.21 s on MNIST, CIFAR-10, and Image Net respectively. This overhead is almost negligible compared with the sampling time for estimating PA and QA which is around 200 s on Image Net and is the shared cost of all randomized smoothing certiﬁcation methods. In summary, compared with standard Neyman-Pearson-based certiﬁcation, the running time of DSRS is roughly the same.

F. Extensions and Proofs in Section 4

In this appendix, we provide formal proofs and theoretical extensions for the results in Section 4.

Double Sampling Randomized Smoothing

F.1. Proof of Theorem 1

Proof of Theorem 1. We let Dc denote the decision region of F0 for class c, i.e., Dc := {x : F0(x) = c}. Since Q is supported on the decision region shifted by x0, f Q 0 (x0)c = 1. Thus, from f Q 0 (x0)c, we know (supp(Q)+x0)\S Dc, where S is some set with zero measure under Q+x0. Since 0 < q(x)/p(x) < + , S also has zero measure under P + x0. On the other hand, by 0 < q(x)/p(x) < + , we can determine the probability mass of supp(Q) on P, i.e., Prϵ P[ϵ supp(Q)]. Then, we observe that

f P 0 (x0)c = Pr ϵ P[F0(x0 + ϵ) = c]

= Pr ϵ P[ϵ Rd \ supp(Q)]

Pr ϵ P[F0(x0 + ϵ) = c | ϵ Rd \ supp(Q)]

+ Pr ϵ P[ϵ supp(Q)] Pr ϵ P[F0(x0 + ϵ) = c | ϵ supp(Q)]

= Pr ϵ P[ϵ Rd \ supp(Q)]

Pr ϵ P[F0(x0 + ϵ) = c | ϵ Rd \ supp(Q)]

+ Pr ϵ P[ϵ supp(Q)].

By the deﬁnition of supp(Q), we observe that Prϵ P[F0(x0 + ϵ) = c | ϵ Rd \ supp(Q)] = 0. As a result, we will ﬁnd that f P 0 (x0)c = Prϵ P[ϵ supp(Q)]. Then the DSRS certiﬁcation method can know Rd \ supp(Q) + x0 Dc has zero measure under P + x0, i.e., In summary, the certiﬁcation method can determine that supp(Q) + x0 differs from Dc on some set with zero measure under P + x0. Because P has positive density everywhere, also has zero measure under P + x0 + δ for arbitrary δ Rd. Thus, for arbitrary δ Rd, the certiﬁcation method can compute out

f P 0 (x0 + δ)c = Pr ϵ P[F0(x0 + δ + ϵ) = c] (13)

= Pr ϵ P+δ[x0 + ϵ Dc] = Pr ϵ P+δ[supp(Q)].

Under the binary classiﬁcation setting, it implies that for any δ Rd, the f P 0 (x0 +δ) can be uniquely determined by DSRS certiﬁcation method. Since the smoothed classiﬁer s decision at any x0 + δ is uniquely determiend by f P 0 (x0 + δ) (see Eqn. (2)), the certiﬁcation method can exactly know e F P 0 (x + δ) for any δ and thus determine tightest possible certiﬁed robust radius rtight.

F.2. Extending Theorem 1 to Multiclass Setting

For the multiclass setting, we deﬁne a variant of DSRS as follows.

Deﬁnition 4 (rmulti DSRS). Given PA [0, 1] and Qmulti A RC 1,

rmulti DSRS := max r s.t.

F : R [C], f P(x0)y0 = PA,

f Qc(x0)c = (Qmulti A )c, c [C 1]

x, x x0 p < r, e F P(x) = y0.

In the above deﬁnition, rmulti DSRS is the tightest possible certiﬁed radius with prediction probability Qmulti A , where each component of Qmulti A , namely (Qmulti A )c, corresponds to the prediction probability for label c under additional smoothing distribution Qc. Note that there are (C 1) additional smoothing distributions {Qc}C 1 c=1 in this generalization for multiclass setting.

With this DSRS generation, the following corollary extends the tightness analysis in Theorem 1 from binary to the multiclass setting.

Corollary 1. Suppose the original smoothing distribution P has positive density everywhere, i.e., p( ) > 0. For multiclass classiﬁcation with base classiﬁer F0, at point x0 Rd, for each class c [C 1], let Qc be a distribution that satisﬁes: (1) its support is the decision region of c shifted by x0: supp(Qc) = {x x0 : F0(x) = c}; (2) for any x supp(Q), 0 < qc(x)/p(x) < + . Then, plugging PA = f P 0 (x0)c and Qmulti A where (Qmulti A )c = f Qc 0 (x0)c for c [C 1] into Deﬁnition 4, we have rmulti DSRS = rtight under any ℓp (p 1).

Proof of Corollary 1. Similar as the proof of Theorem 1, the certiﬁcation method can observe that for any c [C 1], f Qc 0 (x0)c = 1. Thus, the method knows supp(Qc) + x0 Dc for arbitrary c [C 1]. Here, the means that the difference between the two sets has zero measure under P + x0. Thus, for arbitrary δ Rd the certiﬁcation method can precisely compute out f P(x0)c for any c [C 1]. Since f P(x0) C, we also know f P(x0)C and the smoothed classiﬁer s prediction on x0 +δ can be uniquely determined. Then, following the same argument in Theorem 1 s proof, we can determine the tightest possible certiﬁed robust radius rmulti DSRS.

Remark. To achieve the tightest possible certiﬁed radius rtight, for binary classiﬁcation, we only need one extra scalar as the additional information (QA), while for multiclass classiﬁcation, we need (C 1) extra scalars as the additional information (Qmulti A RC 1). Following the convention as discussed in Appendix A, we are interested in tight certiﬁcation under the binary classiﬁcation for sampling efﬁciency concerns. Therefore, we focus on using only one extra scalar (QA) to additional information in DSRS. In both Theorem 1 and Corollary 1, we only need ﬁnite quantities to achieve tight certiﬁcation for any smoothed classiﬁer. In contrast, other existing work requires inﬁnite quantities to achieve such optimal tightness (Mohapatra et al., 2020, Section 3.1).

Double Sampling Randomized Smoothing

F.3. Proof of Theorem 2

The proof of Theorem 2 is a bit complicated, which relies on several propositions and lemmas along with theoretical results in Section 5. At high level, based on the standard Gaussian distribution s property (Proposition F.1), we ﬁnd QA = 1 under concentration property (Lemma F.2). With QA = 1, we derive a lower bound of r DSRS in Lemma F.3. We then use: (1) the concentration of beta distribution Beta( d 1

2 ) (see Lemma F.4) for large d; (2) the relative concentration of gamma Γ(d/2, 1) distribution around mean for large d (see Proposition F.5 and resulting Fact F.7); and (3) the misalignment of gamma distribution Γ(d/2 k, 1) s mean and median for small (d/2 k) (see Proposition F.6) to lower bound the quantity in Lemma F.3 and show it is large or equal to 0.5. Then, using the conclusion in Section 5 we conclude that r DSRS 0.02σ

Proposition F.1. If random vector ϵ Rd follows standard Gaussian distribution N(σ), then

Pr[ ϵ 2 T] = ΓCDFd/2

where ΓCDFd/2 is the CDF of gamma distribution Γ(d/2, 1).

Proof of Proposition F.1. According to (Lozier, 2003, Eqn. 5.19.4), he volume of a d-dimensional ball, i.e., d-ball, with radius r is Vd(r) = πd/2

2 +1)rd. Thus,

Vol({ϵ : ϵ 2 = r}) = Vd(r) = dπd/2

2 + 1)rd 1. (16)

1 (2πσ2)d/2 exp r2

Vol({ϵ : ϵ 2 = r})dr

1 (2πσ2)d/2 exp r2

2 + 1)rd 1dr

1 (2σ2)d/2Γ( d

0 exp( r)rd/2 1dr = ΓCDFd/2

With Proposition F.1, now we can show that QA = 1 under the condition of Theorem 2 as stated in the following lemma.

Lemma F.2. Suppose F0 satisﬁes (σ, Pcon)-concentration property at input point x0 Rd, with additional smoothing distribution Q = N g trunc(k, T, σ) where T 2 =

2σ2ΓCDF 1 d/2(Pcon) and d/2 15 k < d/2, we have

QA = Pr ϵ Q[F0(x0 + ϵ) = y0] = 1. (17)

Proof of Lemma F.2. According to Deﬁnition 3, for T that satisﬁes Pr ϵ N(σ)[ ϵ 2 T ] = Pcon (18)

Pr ϵ N(σ)[F0(x0 + ϵ) = y0 | ϵ 2 T ] = 1. (19)

With Eqn. (18), from Proposition F.1, we have

T 2 = 2σ2ΓCDF 1 d/2(Pcon). (20)

Thus, Eqn. (19) implies

Pr ϵ N (σ)[F0(x0 + ϵ) = y0 | ϵ 2 q

2σ2ΓCDF 1 d/2(Pcon)] = 1.

(21) Notice that N(σ) has ﬁnite and positive density anywhere within {ϵ : ϵ 2 T }. Thus, F0(x0 + ϵ) = y0 for any ϵ with ϵ 2 T unless a zero-measure set.

Now, we consider Q. Q = N g trunc(k, T, σ) where T = T , and Ntrunc has ﬁnite and positive density anywhere within {ϵ : ϵ 2 T } \ {0}. Thus,

QA = Pr ϵ N g trunc(k,T,σ)[F0(x0 + ϵ) = y0]

= Pr ϵ N g(k,σ)[F0(x0 + ϵ) = y0 | ϵ 2 T]

( ) = Pr ϵ N g(k,σ)[F0(x0 + ϵ) = y0 | ϵ 2 T, ϵ = 0]

In the above equations, ( ) is because

Pr ϵ N g(k,σ)[ϵ = 0 | ϵ 2 T]

lim r 0 Pr ϵ N g(k,σ)[ ϵ 2 r | ϵ 2 T]

= lim r 0 C Z r

0 x 2k exp x2

0 dxd 2k 1 exp x2

where C is a constant, and the last equality is due to d/2 > k d 2k 1 0.

With QA = 1, we can have a lower bound of r DSRS as stated in the following lemma.

Double Sampling Randomized Smoothing

Lemma F.3. Under the same condition as in Lemma F.2, we let Beta CDF d 1

2 be the CDF of distributon

s.t.Et Γ( d

2 k)Beta CDF d 1

(22) and let r DSRS be the tightest possible certiﬁed radius in DSRS under ℓ2 when smoothing distribution P = N g(k, σ), then r DSRS r0. (23)

Proof of Lemma F.3. The proof shares the same core methodology as DSRS computational method introduced in Section 5. Basically, according to Eqn. (9), for any radius r, let δ = (r, 0, . . . , 0)T, if Cδ(PA, QA) > 0.5, then r DSRS r, where QA = 1 according to Lemma F.2, and by the (σ, Pcon)-concentration property

PA Pr ϵ N g(k,σ)[ ϵ 2 T]. (24)

Therefore, to prove the lemma, we only need to show that when

2 k)Beta CDF d 1

(25) for any δ = (u, 0, . . . , 0)T, Cδ(PA, QA) > 0.5.

By deﬁnition (Eqn. (8)),

= min f Eϵ P[f(ϵ + δ)] s.t.

Eϵ P[f(ϵ)] = PA, Eϵ Q[f(ϵ)] = QA,

0 f(ϵ) 1 ϵ Rd

min f Eϵ P[f(ϵ + δ)] s.t.

Eϵ Q[f(ϵ)] = 1,

0 f(ϵ) 1 ϵ Rd

=Eϵ P[f(ϵ + δ)] where f(ϵ) = 1, ϵ 2 T 0. ϵ 2 > T

We now compute V :

V =Eϵ P[ ϵ + δ 2 T]

Rd p(x)I[ x + δ 2 T]dx

p(x)=y I[ x+δ 2 T

p(x)=y I[ x+δ 2 T dx r p(r 1 p (y))

0 ydy 2πd/2

2) r 1 p (y)d 1

1 r p(r 1 p (y))

Pr[ x + δ 2 T | p(x) = y]

0 rp(t)dt2πd/2

2) td 1 Pr[ x + δ 2 T | x 2 = t]

1 (2σ 2)d/2 kπd/2 Γ(d/2) Γ(d/2 k)t 2k exp t2

2) td 1 Pr[ x + δ 2 T | x 2 = t]dt

0 td/2 k 1 exp( t)

Pr[ x + δ 2 T | x 2 = σ

2 k) Pr[ x + δ 2 T | x 2 = σ

In above equations, (1) follows from level-set sliced integration extended from (Yang et al., 2020) and p(x) is the density of distribution P = N g(k, σ) at point x; in (2) we deﬁne rp( x 2) := p(x) noting that P is ℓ2 symmetric and all x with same ℓ2 length having the same p(x), and we have p(x) 2 = r p(r 1 p (y)) since y = rp( x 2) and rp is monotonically decreasing; (3) uses

Vol({x : p(x) = y}) = Vol({x : x 2 = r 1 p (y)})

2) r 1 p (y)d 1;

(26) (4) changes the integration variable from y to t = r 1 p (y); and (5) injects the concrete expression of rp.

Now, we inspect Pr[ x+δ 2 T | x 2 = σ

2t]: When x 2 = σ

2t, Pd i=1 x2 i = 2tσ 2. Meanwhile, x+δ 2 T means (x1 + u)2 + Pd i=2 x2 i T 2. Thus, when x 2 = σ

x + δ 2 T x1 σ

2t T 2 u2 2tσ 2

According to (Yang et al., 2020, Lemma I.23), for x uniformly sampled from sphere with radius σ

2t, the compo-

nent coordinate 1 + x1 σ

2t 2 Beta( d 1

Pr[ x + δ 2 T | x 2 = σ

=Beta CDF d 1

1 + T 2 u2 2tσ 2

=Beta CDF d 1

Double Sampling Randomized Smoothing

Finally, we get

V = Et Γ( d

2 k,1)Beta CDF d 1

(29) In other words, when

2 k,1)Beta CDF d 1

(30) we have V 0.5, and thus Cδ(PA, QA) V 0.5, r DSRS u, which concludes the proof.

We require the following property of Beta CDF.

Lemma F.4. There exists d0 N+, for any d d0,

Beta CDF d 1

2 (0.6) 0.999. (31)

Proof of Lemma F.4. We let v Beta( d 1

E[v] = 1/2, (32)

Var[v] = ( d 1

Now, applying Chebyshev s inequality, we have

Pr[|v 0.5| 0.1] 1 0.04d. (34)

Beta CDF d 1

2 (0.6) (35)

= Pr[v < 0.6] (36)

=1 Pr[v 0.6] (37)

1 Pr[|v 0.5| 0.1] 1 1 0.04d. (38)

Thus, when d 25000, Beta CDF d 1

2 (0.6) 0.999.

We also require the following properties of the gamma distribution.

Proposition F.5. For any Pcon (0, 1), there exists d0 N+, for any d d0,

ΓCDF 1 d/2(Pcon) 0.99 d

Proof of Proposition F.5. We let v Γ(d/2), then

E[v] = d/2, (40)

Var[v] = d/2. (41)

We now apply Chebyshev s inequality and get

Pr[v < 0.99 d/2]

Pr[|v d/2| > 0.01 d/2]

Thus, for any Pcon (0, 1), when d 20000

d Pcon, (42)

i.e., ΓCDF 1 d/2(Pcon) 0.99 d

Proposition F.6. When d/2 15 k < d/2, k, d N+,

Pr t Γ(d/2 k)

2 k 0.5 0.999. (43)

Proof of Proposition F.6. We prove the proposition by enumeration. Notice that d/2 k {0.5, 1.0, , 14.5, 15.0}, we enumerate ΓCDFd/2 k(0.98(d/2 k)) for each (d/2 k) and get the following table.

d 2 k ΓCDFd/2 k(0.98(d/2 k)) d 2 k ΓCDFd/2 k(0.98(d/2 k))

0.5 0.6778 8.0 0.5245 1.0 0.6247 8.5 0.5224 1.5 0.5990 9.0 0.5204 2.0 0.5831 9.5 0.5186 2.5 0.5718 10.0 0.5168 3.0 0.5632 10.5 0.5152 3.5 0.5564 11.0 0.5136 4.0 0.5507 11.5 0.5121 4.5 0.5459 12.0 0.5107 5.0 0.5418 12.5 0.5093 5.5 0.5381 13.0 0.5080 6.0 0.5349 13.5 0.5068 6.5 0.5319 14.0 0.5056 7.0 0.5292 14.5 0.5044 7.5 0.5268 15.0 0.5033 On the other hand, 0.5 0.999 0.5001, which concludes the proof.

Now we are ready to prove the main theorem.

Proof of Theorem 2. According to Lemma F.3, we only need to show that for u = 0.02σ

d, Eqn. (22) holds. For sufﬁciently large d, indeed,

Et Γ(d/2 k)Beta CDF(d 1)/2

0.999Et Γ(d/2 k)I

( ) 0.999Et Γ(d/2 k)I t 0.98 d

Proposition F.6

0.999 0.5 0.999 = 0.5.

Thus, from Lemma F.3 we have r DSRS u = 0.02

The inequality ( ) follows from Fact F.7.

Double Sampling Randomized Smoothing

Fact F.7. Under the condition of Theorem 2, for sufﬁciently large d,

2t 0.6. (44)

Proof of Fact F.7.

2t u)2 2.4uσ

2t T 2 (x u)2 2.4ux

x2 + 0.4ux + u2 T 2 0.

From Proposition F.5, we have

=0.0004dσ2 2σ2ΓCDF 1 d/2(Pcon)

0.0004dσ2 0.99dσ2 < 0.

x2 + 0.4ux + u2 T 2 0

0.16u2 4(u2 T 2)

2 = 0.2u + p

T 2 0.96u2)2

Again, from Proposition F.5,

=2σ2ΓCDF 1 d/2(Pcon) 0.96 0.0004dσ2

(0.99 0.96 0.0004)dσ2,

and therefore

T 2 0.96u2)2

dσ2 0.004 +

0.99 0.96 0.0004 2 0.9816dσ2

0.98dσ2. (47) Then,

T 2 0.96u2)2

F.4. Theorem 6

Theorem 6. Let d be the input dimension and F0 be the base classiﬁer. For an input point x0 Rd with true class y0, suppose F0 satisﬁes (σ, Pcon)-Concentration property and Prϵ N(σ)[F0(x0 + ϵ) = y0] = Pcon where Pcon < 1. The smoothed classiﬁer e F P 0 is constructed from F0 and smoothed by generalized Gaussian P = N g(k0, σ) where k0 is a constant independent of input dimension d. Then, for any constant c > 0, there exists d0, such that when input dimension d d0, any method cannot certify ℓ2 radius c

d, where T = σ q

2ΓCDF 1 d/2(Pcon) and ΓCDFd/2 is

the CDF of gamma distribution Γ(d/2, 1).

We defer the proof to Appendix F.5. This theorem suggests that, if we use generalized Gaussian whose k is a constant with respect to input dimension d or use standard Gaussian (whose k = 0 is a constant) for smoothing, we cannot achieve Ω(

d) certiﬁed radius rate from DSRS and any other certiﬁcation method.

F.5. Proof of Theorem 6

The proof of Theorem 6 is based on three lemmas listed below. Lemma F.8. Given k0 N, for any ϵ > 0, there exists d0, such that when d > d0,

Lemma F.9. Given Pcon 0, for any ϵ > 0, there exists d0, such that when d > d0,

2ΓCDF 1 d/2(Pcon) σ p

(1 + ϵ)d. (49)

Lemma F.10. For any ϵ > 0, there exists d0, such that when d > d0,

Beta CDF d 1

2 (0.5 ϵ) 0.01. (50)

Proofs of these lemmas are based on Chebyshev s inequality.

Proof of Lemma F.8. For t Γ(d/2 k0, 1), we have

E[t] = d/2 k0, Var[t] = d/2 k0. (51)

By Chebyshev s inequality,

Pr t (1 ϵ) d

Pr |t E[t]| ϵ d

2 k0). (52)

Picking d0 = 2 0.99

0.48ϵ2 + k0 concludes the proof.

Double Sampling Randomized Smoothing

Proof of Lemma F.9. We deﬁne random variable v Γ(d/2, 1), so E[v] = d/2, Var[v] = d/2. By Chebyshev s inequality,

Pr[v (1 + ϵ)d/2]

1 Pr[v (1 + ϵ)d/2]

1 Pr[|v E[v]| ϵd/2]

Let d0 = 2 ϵ2(1 Pcon). Thus, when d > d0, Pr[v

(1 + ϵ)d/2] 1 2 d0ϵ2 = Pcon, which implies that

ΓCDFd/2((1 + ϵ)d/2) Pcon and ΓCDF 1 d/2(Pcon) (1 + ϵ)d/2 and concludes the proof.

Proof of Lemma F.10. We deﬁne random variable v Beta( d 1

2 ), and we have E[v] = 1/2, Var[v] = 1 4d. By Chebyshev s inequality,

Pr[v 0.5 ϵ] Pr[|v E[v]| ϵ] 1 4dϵ2 . (54)

Let d0 = 25 dϵ2 , when d > d0, Pr[v 0.5 ϵ] 0.01 and hence Beta CDF d 1

2 (0.5 ϵ) 0.01.

Now we are ready to prove the main theorem.

Proof of Theorem 6. According to the above three lemmas, we pick d0, such that when d > d0, the followings hold simultaneously.

2ΓCDF 1 d/2(Pcon) σ

Beta CDF d 1

We deﬁne vector δ = (c

d, 0, 0, . . . , 0)T. Since F0 satisﬁes (σ, Pcon)-concentration property and Prϵ N(σ)[F0(x0 + ϵ) = y0] = Pcon, up to a set of zero measure, the region {ϵ : F0(x0+ϵ) = y0} and region {ϵ : ϵ 2 T} coincide.

We now show that Eϵ P =N g(k0,σ)[F0(x0+δ+ϵ) = y0] < 0.5 when c σ p

Eϵ P [F0(x0 + δ + ϵ) = y0]

= Pr ϵ P [ δ + ϵ 2 T]

2 k0,1)Beta CDF d 1

(from Eqn. (29))

0.99Et Γ( d

+ 0.01 (by Eqn. (57) and Beta CDF ( ) 1)

=0.99 Pr t Γ( d

+ 0.01. (58)

T 2 2tσ2 d d 2k0 dc2 + c2d

Eqn. (56) = 1 + c2

dσ2 2tσ2 d d 2k0 dc2 (60)

2t d 2k0 0.

We now inject t = 0 and t = 1 c2 8σ2 d

2 k0 to the LHS of Eqn. (60).

When t = 0,

LHS of Eqn. (60) = 1 + c2

When t = 1 c2 8σ2 d

LHS of Eqn. (60)

8 dσ2 + dc2

8 dc2 + dc2

4 dc2 + dc2

Notice that the LHS of Eqn. (60) is a parabola with negative second-order coefﬁcient. Thus,

Eqn. (60) = t 0, 1 c2

Double Sampling Randomized Smoothing

0.99. (by Eqn. (55)) (62) Plugging this inequality to Eqn. (58), we get

Eϵ P [F0(x0 +δ +ϵ) = y0] 0.99 0.48

0.99 +0.01 = 0.49. (63) As a result, when c σ p

8/7, the smoothed classiﬁer e F P 0 is not robust given the perturbation δ = (c

d, 0, 0, , 0)T, since there may exist another y = y0 with Eϵ P [F0(x0 + δ + ϵ) = y ] 0.51 so e F P 0 (x0 + δ) = y = y0.

When c > σ p

8/7, within the ℓ2 radius ball c

d, there exists perturbation vector δ = (c

d, 0, 0, , 0)T fooling smoothed classiﬁer e F P 0 where c = σ p

8/7. Hence, for any c > 0, there exists a perturbation within ℓ2 ball with radius c

d, such that smoothed classiﬁer e F P 0 can be fooled, and then any robustness certiﬁcation method cannot certify ℓ2 radius c

d since the smoothed classiﬁer itself is not robust.

G. Proofs of DSRS Computational Method

G.1. Proof of Strong Duality (Theorem 3)

Proof of Theorem 3. We write down the Lagrangian dual function of Eqn. (8a):

Λ(f, λ1, λ2) :=Eϵ P[f(δ + ϵ)] λ1 (Eϵ P[f(ϵ)] PA)

λ2 (Eϵ Q[f(ϵ)] QA) . (64) Then, from C s expression (Eqn. (8)), we have

= min f Eϵ P[f(ϵ + δ)] s.t. 0 f(ϵ) 1 ϵ Rd,

Eϵ P[f(ϵ)] = PA, Eϵ Q[f(ϵ)] = QA = min f:Rd [0,1] max λ1,λ2 R Λ(f, λ1, λ2)

(i) max λ1,λ2 R min f:Rd [0,1] Λ(f, λ1, λ2)

(ii) = max λ1,λ2 R Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)]

λ1 Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

λ2 Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

+ λ1PA + λ2QA. (65)

In the above equation, (i) is from the min-max inequality. For completeness, we provide the proof as such: Deﬁne g : R2 R such that g(λ1, λ2) := minf:Rd [0,1] Λ(f, λ1, λ2). As a result, for any λ1, λ2 R and any f : Rd [0, 1], g(λ1, λ2) Λ(f, λ1, λ2). So for any f : Rd [0, 1], maxλ1,λ2 R g(λ1, λ2) maxλ1,λ2 R Λ(f, λ1, λ2), which implies

max λ1,λ2 R g(λ1, λ2) min f:R [0,1] max λ1,λ2 R Λ(f, λ1, λ2), (66)

where LHS is the RHS of (i) and RHS is the LHS of (i).

In above equation, (ii) comes from a closed-form solution of f for Λ(f, λ1, λ2) given (λ1, λ2) R2. Notice that we can rewrite Λ(f, λ1, λ2) as an integral over Rd:

Λ(f, λ1, λ2)

=Eϵ P[f(δ + ϵ)] λ1Eϵ P[f(ϵ)] λ2Eϵ Q[f(ϵ)]

+ λ1PA + λ2QA

Rd f(x) (p(x δ) λ1p(x) λ2q(x)) dx

+ λ1PA + λ2QA. (67)

We would like to minimize over f : Rd [0, 1] in Eqn. (67) and simple greedy solution reveals that we should choose

( 1, p(x δ) λ1p(x) λ2q(x) < 0

0. p(x δ) λ1p(x) λ2q(x) 0 (68)

We inject this f into Eqn. (67) and get

min f:Rd [0,1] Λ(f, λ1, λ2)

= Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)]

+ λ1 PA Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

QA Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] .

(69) Hence (ii) holds.

On the other hand, we know that Dδ(PA, QA) (deﬁned by Eqn. (10)) is feasible by theorem statement. Denote (λ 1, λ 2) R2 to a feasible solution to Dδ(PA, QA) and d

to the objective value, then from the constraints of (D) we know

Pr ϵ P[p(ϵ δ) < λ 1p(ϵ) + λ 2q(ϵ)] = PA,

Pr ϵ Q[p(ϵ δ) < λ 1p(ϵ) + λ 2q(ϵ)] = QA,

Pr ϵ P[p(ϵ) < λ 1p(ϵ + δ) + λ 2q(ϵ + δ)] = d .

Plugging in these equalities into Eqn. (65), we have

Cδ(PA, QA) d λ1PA λ2QA+λ1PA+λ2QA = d . (71)

Double Sampling Randomized Smoothing

At the same time, we deﬁne function f : Rd [0, 1] such that

f (x) = I[p(x δ) λ 1p(x) λ 2q(x) < 0]. (72)

From Eqn. (70), f satisﬁes the constraints of (C) (Eqns. (8b) and (8c)). Since (C) minimizes over all possible functions f : Rd [0, 1], we have

Cδ(PA, QA) Eϵ P[f (ϵ + δ)] = d . (73)

Combining Eqns. (71) and (73), we get Cδ(PA, QA) = d

and hence the strong duality holds.

G.2. Proofs of Propositions 1 and 2 and Theorem 4

Proof of Proposition 1. Suppose f1 is the optimal solution to Cδ(P 1 A, Q1 A) and f2 is the optimal solution to Cδ(P 2 A, Q2 A). Due to the linearity of expectation, (f1 + f2)/2 satisﬁes all the constraints of Eqn. (8) for PA = (P 1 A+ P 2 A)/2 and QA = (Q1 A + Q2 A)/2, i.e., (f1 + f2)/2 is feasible for PA = (P 1 A + P 2 A)/2 and QA = (Q1 A + Q2 A)/2 with objective value Cδ(P 1 A, Q1 A) + Cδ(P 2 A, Q2 A) /2. Thus, we have

P 1 A + P 2 A 2 , Q1 A + Q2 A 2

1 2 Cδ(P 1 A, Q1 A) + Cδ(P 2 A, Q2 A) (74)

since C is a minimization problem. By deﬁnition, Cδ(PA, QA) is convex.

Remark. Since Cδ(PA, QA) is deﬁned on a compact R2

subspace, the convexity implies continuity. The continuity property is used in the following proof of Theorem 4.

Proof of Proposition 2. Here, we only prove the monotonicity for functions x 7 miny Cδ(x, y) and x 7 arg miny Cδ(x, y). The same statement for y 7 minx Cδ(x, y) and y 7 minx Cδ(x, y) is then straightforward due to the symmetry.

For simpliﬁcation, we deﬁne C δ : x 7 miny Cδ(x, y) and let Cδ : x 7 arg miny Cδ(x, y). We notice that both functions can be exactly mapped to the constrained optimization problem (C ) which removes the second constraint in Eqn. (8b) in (C):

minimize f Eϵ P[f(δ + ϵ)] (75a)

s.t. Eϵ P[f(ϵ)] = x, (75b)

0 f(ϵ) 1 ϵ Rd. (75c)

C δ(x) is the optimal objective to (C ) and Cδ(x) is Eϵ Q[f (ϵ)] where f is the optimal solution.

Either based on Neyman-Pearson lemma [1933] or strong duality, (C ) is equivalent to (D ) deﬁned as such:

Pr ϵ P[p(ϵ) < λp(ϵ + δ)] (76a)

s.t. Pr ϵ P[p(ϵ δ) < λp(ϵ)] = x. (76b)

For a given x, we only need to ﬁnd λ satisfying Eqn. (76b). Then,

C δ(x) = Pr ϵ P[p(ϵ) < λp(ϵ + δ)], (77)

Cδ(x) = Pr ϵ Q[p(ϵ δ) < λp(ϵ)]. (78)

Now the monotonicity (what we would like to prove) is apparent. For x1 < x2, from Eqn. (76b), we have λ1 < λ2, since the probability density function p is non-negative. Thus, we inject λ1 and λ2 into Eqn. (77) and Eqn. (78), and yield

C δ(x1) C δ(x2), Cδ(x1) Cδ(x2), (79)

which concludes the proof.

Proof of Theorem 4. We discuss the cases according to the branching statement in the algorithm (Alg. 1).

if q QA, by deﬁnition we have Cδ(PA, q) Cδ(PA, y) for arbitrary y. According to Proposition 2, we also have Cδ(PA, q) Cδ(x, y) for arbitrary x PA and arbitrary y. Given that (PA, q) [PA, PA] [QA, QA], (PA, q) solves Eqn. (11);

if q > QA, by convexity, Cδ(PA, QA) Cδ(PA, y) for y [QA, QA].

We further show that Cδ(PA, QA) Cδ(x, QA) for x [PA, PA]: assume that this is not true, by Proposition 1, the function y 7 arg minx Cδ(x, y) has function value larger than PA at y = QA. Since Cδ(0, 0) = 0 is the global minimum of Cδ, the function value at y = 0 is x = 0. By Proposition 2, there exists y0 [0, QA] such that PA = arg minx Cδ(x, y0). Then, we get

(i.) Cδ(arg min x Cδ(x, QA), QA)

(ii.) Cδ(PA, QA),

where (i.) follows from Proposition 2 for y 7 arg minx Cδ(x, y); (ii.) is implied in the meaning of arg minx Cδ(x, QA). Since y0 [0, QA], Eqn. (80)

Double Sampling Randomized Smoothing

implies that q should be in [0, QA] as well, which violates the branching condition. Thus, Cδ(PA, QA) Cδ(x, QA) for x [PA, PA].

Using Proposition 2 for function x 7 arg miny Cδ(x, y) in interval [PA, PA] together with Proposition 1, we get Cδ(x, QA) Cδ(x, y) for x [PA, PA] and y [QA, QA]. Thus, (PA, QA) solves Eqn. (11).

if p PA, by deﬁnition we have Cδ(max{p, PA}, QA) Cδ(x, QA) for x [PA, PA]. According to Proposition 2 and condition q QA, we further have Cδ(max{p, PA}, QA) Cδ(max{p, PA}, y) Cδ(x, y) for arbitrary x [PA, PA] and y [QA, QA]. Given that (max{p, PA}, QA) [PA, PA] [QA, QA], (p, QA) solves Eqn. (11);

if p > PA, according to Proposition 1, Cδ(PA, QA) Cδ(x, QA) for x [PA, PA].

We further show that Cδ(PA, QA) Cδ(PA, y) for y [QA, QA]: assume that this is not true, by Proposition 1, the function x 7 arg miny Cδ(x, y) has function value larger than QA at x = PA. Since Cδ(0, 0) is the global minimum, by Proposition 2 on x 7 arg miny Cδ(x, y), there exists x0 [0, PA] such that QA = arg miny Cδ(x0, y). Then, we get

Cδ(PA, arg min y Cδ(PA, y))

following the similar deduction as in Eqn. (80). Since x0 [0, PA], Eqn. (81) implies that p should be in [0, PA] as well which violates the branching condition. Thus, Cδ(PA, QA) Cδ(PA, y) for y [QA, QA].

Using Proposition 2 for function y 7 arg minx Cδ(x, y) in interval [QA, QA] together with Proposition 1, we get Cδ(PA, y) Cδ(x, y) for y [QA, QA] and x [PA, PA]. Thus, (PA, QA) solves Eqn. (11).

G.3. Proof of Theorem 5

Proof of Theorem 5. We ﬁrst deﬁne rp( ϵ 2) = p(ϵ) and rq( ϵ 2) = q(ϵ), then easily seen the concrete expressions

of rp and rq are:

rp(t) = 1 (2σ 2)d/2 kπd/2 Γ(d/2) Γ(d/2 k), (82)

rq(t) = ν (2σ 2)d/2 kπd/2 Γ(d/2) Γ(d/2 k), (83)

ν := Γ(d/2 k) γ(d/2 k, T 2

2σ 2 ) > 1 (84)

and γ is the lower incomplete Gamma function.

Now we use level-set integration similar as the proof in Lemma F.3 to get the expressions of P, Q, and R respectively. Since P and Q are ℓ2-symmetric, without loss of generality, we let δ = (r, 0, . . . , 0)T.

Suppose PT = rp(T).

= Pr ϵ P=N g(k,σ)[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

Rd I[p(x δ) < λ1p(x) + λ2q(x)]p(x)dx

p(x)=y p(x δ)<λ1p(x)

dx p(x) 2 +

p(x)=y p(x δ)<(λ1+λ2ν)p(x)

0 ydy 2πd/2

Γ(d/2)r 1 p (y)d 1

1 r p(r 1 p (y))

Pr[p(x δ) λ1p(x) | p(x) = y]+ Z

PT ydy 2πd/2

Γ(d/2)r 1 p (y)d 1

1 r p(r 1 p (y))

Pr[p(x δ) < (λ1 + λ2ν)p(x) | p(x) = y]

y=rp(t) = Z

T rp(t)dt 2πd/2

Pr[p(x δ) < λ1p(x) | x 2 = t]+ Z T

0 rp(t)dt 2πd/2

Pr[p(x δ) < (λ1 + λ2ν)p(x) | x 2 = t]

= 1 Γ(d/2 k)

T 2/(2σ 2) td/2 k 1 exp( t)dt

Pr[p(x δ) < λ1p(x) | x 2 = σ

Z T 2/(2σ 2)

0 td/2 k 1 exp( t)dt

Pr[p(x δ) < (λ1 + λ2ν)p(x) | x 2 = σ

Double Sampling Randomized Smoothing

=Et Γ(d/2 k,1)

u3(t, λ1), t T 2/(2σ 2) u3(t, λ1 + λ2ν). t < T 2/(2σ 2)

Here, u3(t, λ) = Pr[p(x δ) < λp(x) | x 2 = σ

= Pr ϵ Q=N g trunc(k,T,σ)[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

x 2 T I[p(x δ) < λ1p(x) + λ2q(x)]q(x)dx

q(x)=y p(x δ)<(λ1+λ2ν)p(x)

y=rq(t) = Z T

0 rq(t)dt 2πd/2

Pr[p(x δ) < (λ1 + λ2ν)p(x) | x 2 = t]

=νEt Γ(d/2 k,1)u3(t, λ1 + λ2ν) I t T 2

Now, for R:

= Pr ϵ P=N g(k,σ)[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)]

Rd I[p(x) < λ1p(x + δ) + λ2q(x + δ)]p(x)dx

=Et Γ(d/2 k,1)u4(t, λ1, λ2).

Here, u4(t, λ1, λ2) = Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

2t) | x 2 = σ

Plugging Lemma G.1 into P(λ1, λ2) and Q(λ1, λ2), and then plugging Lemma G.2 into R(λ1, λ2), we yield the desired expressions in theorem statement.

Lemma G.1. Under the condition of Theorem 5, let δ = (r, 0, . . . , 0)T,

u3(t, λ) := Pr[p(x δ) < λp(x) | x 2 = σ

= Beta CDF d 1

2t 2kσ 2W( t

(85) where W is the principal branch of Lambert W function.

Proof of Lemma G.1. p(x δ) < λp(x) means that rp( x δ 2) < λrp( x 2) and therefore x δ 2 >

r 1 p (λrp( x 2)). Given that x 2 = σ

2t, we have

i=2 x2 i = 2tσ 2,

i=2 x2 i r 1 p (λrp(σ

This is equivalent to

x1 2tσ 2 + r2 r 1 p (λrp(σ

From the expression of rp (Eqn. (82)), we have

r 1 p (λrp(σ

2t))2 = 2σ 2k W t

k e t k λ 1

Thus, when x 2 = σ

2t and x uniformly sampled from this sphere,

p(x δ) < λp(x)

x1 2tσ 2 + r2 2σ 2k W( t

2t 2 (r + σ

2t)2 2σ 2k W t ke t k λ 1

(89) According to (Yang et al., 2020, Lemma I.23), for x uniformly sampled from sphere with radius σ

2t, the compo-

nent coordinate 1 + x1 σ

2t 2 Beta( d 1

2 ). Combining Eqn. (89) with this result concludes the proof.

Lemma G.2. Under the condition of Theorem 5, let δ = (r, 0, . . . , 0)T,

u4(t, λ1, λ2) := Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

= u1(t), λ1 0 u1(t) + u2(t), λ1 > 0 (90) where

u1(t) =Beta CDF d 1

min{T 2, 2σ 2k W ( t

k e t k (λ1 + νλ2) 1 k )}

u2(t) = max

0, Beta CDF d 1

2σ 2k W ( t

k e t k λ 1 k 1 ) (σ

Beta CDF d 1

Proof of Lemma G.2. Under the condition that x 2 = σ

2t, we separate two cases: q(x+δ) > 0 and q(x+δ) = 0, which corresponds to x + δ 2 T and x + δ 2 > T.

Double Sampling Randomized Smoothing

(1) q(x + δ) > 0:

Notice that

q(x + δ) > 0

x + δ 2 2 T 2

x1 T 2 2tσ 2 r2

From Eqn. (83), q(x + δ) = νp(x + δ). Thus,

λ1p(x + δ) + λ2q(x + δ) > rp(σ

(λ1 + νλ2)p(x + δ) rp(σ

x + δ 2 2 r 1 p

2t) λ1 + νλ2

x1 r 1 p rp(σ

Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

2t) q(x + δ) > 0 | x 2 = σ

x1 min r 1 p

2 , T 2 2tσ 2 r2

(93) By (Yang et al., 2020, Lemma I.23) and Eqn. (88), we thus have

Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

q(x + δ) > 0 | x 2 = σ

=Beta CDF d 1

min{T 2, 2σ 2k W( t

ke t k (λ1 + νλ2) 1 k )}

(2) q(x + δ) = 0:

q(x + δ) = 0 x1 > T 2 2tσ 2 r2

When λ1 0, the condition λ1p(x + δ) + λ2q(x + δ) = λ1p(x + δ) > rp(σ

2t) can never be satisﬁed. When λ1 > 0, we have

λ1p(x + δ) + λ2q(x + δ) > rp(σ

x + δ 2 2 r 1 p

x1 r 1 p rp(σ

Therefore, when λ1 0,

Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

q(x + δ) = 0 | x 2 = σ

2t] = 0. (97)

When λ1 > 0, the condition that lambda1p(x + δ) + λ2q(x + δ) > rp(σ

2t) is equivalent to

T 2 2tσ 2 r2

2r , r 1 p rp(σ

(98) Again, by (Yang et al., 2020, Lemma I.23) and Eqn. (88), we have

Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

q(x + δ) = 0 | x 2 = σ

0, Beta CDF d 1

Beta CDF d 1

=u2(t). (99)

(3) Combining the two cases:

Now we are ready to combine the two cases.

Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

= Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

q(x + δ) > 0 | x 2 = σ

Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ

q(x + δ) = 0 | x 2 = σ

= u1(t), λ1 0 u1(t) + u2(t). λ1 > 0

G.4. Discussion on Uniqueness of Feasible Pair

As we sketched in Section 5.3, in general cases, the pair that satisﬁes constraints in Eqn. (12) is unique. We formally state this ﬁnding and prove it in Theorem 7.

Theorem 7 (Uniqueness of Feasible Solution in Eqn. (12)). Under the same setting of Theorem 5, if QA (0, 1) and PA (QA/ν, 1 (1 QA)/ν), then there is the pair (λ1, λ2) that satisﬁes both P(λ1, λ2) = PA and Q(λ1, λ2) = QA is unique.

We prove the theorem in the end of this section, which is based on the strict monotonicity of two auxiliary functions: g(λ1 + νλ2) := Q(λ1, λ2) and h(λ1) (deﬁned in

Double Sampling Randomized Smoothing

Section 5.3). For other types of smoothing distributions P and Q, in Theorem 11 we characterize and prove a sufﬁcient condition that guarantees the uniqueness of feasible pair.

We observe that the feasible region of (PA, QA) is

R = {(x, y) : y/ν x 1 (1 y)/ν, 0 y 1}. (101) Therefore, the theorem states that when (PA, QA) is an internal point of R, the feasible solution is unique and we can use our proposed method to ﬁnd out such a feasible solution and thus solve the dual problem (D). Now, the edge cases are that (PA, QA) lies on the boundary of R. We discuss all these cases and show that these boundary cases correspond to degenerate problems that are easy to solve respectively:

QA = 0: When QA = 0, and PA (0, 1) (otherwise, trivially R(λ1, λ2) = PA {0, 1} solves (D)), we have λ1 + νλ2 0+ and thus R(λ1, λ2) = Et Γ(d/2 k,1)u2(t). Since u2(t) only involves λ1, we only require λ1 to be unique to deploy the method. Since PA = P(λ1, λ2) = Et Γ(d/2 k,1)u3(t, λ1) I[t T 2/(2σ 2)] and PA (0, 1), by similar arguments as in Theorem 7, we know λ1 is unique. Hence, all feasible pairs give the same R(λ1, λ2), i.e., have the same objective value and the proposed method which computes a feasible one is sufﬁcient for solving (D).

QA = 1: When QA = 1 and PA (0, 1), we observe that u1(t) Beta CDF d 1

where equality is feasible with the selected (λ1 + νλ2) + and hence the maximum of Et Γ(d/2 k,1)u1(t) among all feasible (λ1, λ2) is a constant. On the other hand, since u2(t) only involves λ1 which is unique as discussed in QA = 0 case, all feasible pairs give the same value of Et Γ(d/2 k,1)u2(t). As a result, the maximum of R(λ1, λ2) can be computed by adding the unique value of Et Γ(d/2 k,1)u2(t) and the constant corresponding to the maximum of Et Γ(d/2 k,1)u1(t) among all feasible (λ1, λ2), which solves the dual problem (D).

PA = QA/ν: We assume PA, QA (0, 1) (otherwise covered by former cases). In this case, λ1 satisﬁes that h(λ1) = PA QA/ν = 0, so λ1 0+. As a result, u2(t) = 0 for all t > 0 and R(λ1, λ2) = Et Γ(d/2 k,1)u1(t). We observe that u1(t) is only related to (λ1 + νλ2) where (λ1+νλ2) satisfying Q(λ1, λ2) = QA is unique since QA (0, 1). Thus, any feasible (λ1, λ2) would have the same (λ1 + νλ2) and hence leads to the same R(λ1, λ2). So the proposed method which ﬁnds one feasible (λ1, λ2) sufﬁces for solving (D).

PA = 1 1 QA

ν : We again assume PA, QA (0, 1) (otherwise covered by former cases). In this case, λ1 satisﬁes that h(λ1) = 1 1/ν. Since Et Γ(d/2 k,1)I[t < T 2/(2σ 2)] = 1/ν, we know u3(t, λ1) = 1 for t T 2/(2σ 2) except a zero-measure set, and thus λ1 + . As a result, Et Γ(d/2 k,1)u2(t) =

1 Et Γ(d/2 k,1)Beta CDF d 1

constant. Similar as PA = QA/ν case, feasible (λ1+ νλ2) is unique. Therefore, feasible (λ1, λ2) leads to a unique Et Γ(d/2 k,1)u1(t). We compute out these two quantities Et Γ(d/2 k,1)u2(t) and Et Γ(d/2 k,1)u1(t) so as to obtain the unique R(λ1, λ2) which solves (D).

We remark that in practice, we never observe any instances that correspond to these edge cases though we implemented these techniques for solving them.

Proof of Theorem 7. The high-level proof sketch is implied in the derivation of our feasible (λ1, λ2) ﬁnding method introduced in Section 5.3. We ﬁrst show h(λ1) is monotonically strictly increasing in a neighborhood of λ1 where h(λ1) = PA QA/ν, so the λ1 that satisﬁes PA QA/ν is unique. We then deﬁne g(γ) = νEt Γ(d/2 k,1)u3(t, γ) I[t T 2/(2σ 2)], and show its strict monotonicity around the neighborhood of λ that satisﬁes g(γ ) = QA, which indicates that (λ1 + νλ2) that satisﬁes QA is unique. Combining this two arguments, we know feasible (λ1, λ2) is unique. Note that the key in this proof is the local strict monotonicity, and the global (nonstrict) monotonicity for both h( ) and g( ) is apparent from their expressions. We now present the proofs for the local strict monotonicity of these two functions h and g.

(1) h(λ1) is monotonically strictly increasing in a neighborhood of λ1 where h(λ1) = PA QA/ν.

From the theorem condition, we know that h(λ1) (0, 1 1/ν). Thus, there exists t0 T 2/(2σ 2), such that u3(t0, λ1) (0, 1) (otherwise h(λ1) = 0 or 1 1/ν). From the closed-form equation of u3, for any neighboring λ 1 = λ1, we will have W(t0/ket0/kλ 1/k 1 ) = W(t0/ket0/kλ 1/k 1 ) by the monotonicity of Lambert W function. On the other hand, since u3(t0, λ1) (0, 1) and Beta CDF d 1

2 ( ) is also strict monotonic in the neighborhood, we have u3(t0, λ 1) = u3(t0, λ1). Same happens to t0 s neighborhood, i.e., δ > 0, s.t., t (t0 δ, t0 + δ), u3(t, λ 1) = u3(t, λ1) and sgn(u3(t, λ 1) u3(t, λ1)) is consistent for any t (t0 δ, t0 + δ). As a result, h(λ1) = h(λ 1). By deﬁnition and the fact that h( ) is monotonically non-decreasing, the argument is proved.

(2) g(γ) is monotonically strictly increasing in a neighborhood of γ where g(γ ) = QA.

Double Sampling Randomized Smoothing

Since QA (0, 1) by the theorem condition, we know that there exists t0 (0, T 2/(2σ 2)), such that u3(t0, γ ) (0, 1) (otherwise g(γ ) = 0 or 1 which contradicts the theorem condition). Following the same reasoning as in (1) s proof, for any γ = γ that lies in a sufﬁciently small neighborhood of γ , we have u3(t0, γ ) = u3(t0, γ ), and δ > 0, s.t., t (t0 δ, t0 + δ), u3(t, γ ) = u3(t, γ ) and sgn(u3(t, γ ) u3(t, γ )) is consistent for any t (t0 δ, t0 + δ). As a result, g(γ ) = g(γ ). By deﬁnition and the fact that g( ) is monotonically non-decreasing, the argument is proved.

H. Extensions of DSRS Computational Methods

In this appendix, we exemplify a few extensions of DSRS computational method.

H.1. Certiﬁcation with Standard and Truncated Standard Gaussian

In main text and Theorem 5, we focus on DSRS certiﬁcation with generalized Gaussian as P and truncated generalized Gaussian as Q, which has theoretical advantages (Theorem 2). On the other hand, DSRS can also be applied to other distributions. Concretely, to certify robustness with standard Gaussian as P and truncated standard Gaussian as Q, we can directly plug the following theorem s numerical integration expressions into the described DSRS algorithm (Alg. 2).

Theorem 8. In Dδ(PA, QA), let r = δ 2, when P = N(σ) and Q = Ntrunc(T, σ), let ν := ΓCDFd/2(T 2/(2σ2)) 1,

R(λ1, λ2) := Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(vϵ + δ)]

= Et Γ(d/2,1)u1(t), λ1 0 Et Γ(d/2,1)u1(t) + u2(t), λ1 > 0 where

u1(t) =Beta CDF d 1

min{T 2, 2tσ2 + 2σ2 ln(λ1 + νλ2)}

u2(t) = max

0, Beta CDF d 1

2tσ2 + 2σ2 ln λ1 (σ

Beta CDF d 1

P (λ1, λ2) := Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

= Et Γ(d/2,1)

u3(t, λ1), t T 2/(2σ2) u3(t, λ1 + νλ2), t < T 2/(2σ2). where

u3(t, λ) = Beta CDF d 1

1 2 + r2 + 2σ2 ln λ

Q(λ1, λ2) := Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)]

= νEt Γ(d/2,1)u3(t, λ1 + νλ2) I[t T 2/(2σ2)].

In above equations, Γ(d/2, 1) is gamma distribution and ΓCDFd/2 is its CDF, and Beta CDF d 1

2 is the CDF of dis-

tribution Beta( d 1

Proof of Theorem 8. The proof largely follows the same procedure as the proof of Theorem 5. The only difference is that, since P = N(σ), let rp( ϵ 2) = p(ϵ), different from Eqn. (88), now

r 1 p (λrp(σ

= 2σ2 ln(λrp(σ

2t) (2πσ2)d/2)

= 2σ2 ln λ (2πσ2)d/2 exp 2tσ2

= 2σ2(ln λ t) = 2tσ2 2σ2 ln λ. (103) By plugging this equation into the proof of Theorem 5, we prove Theorem 8.

In practice, DSRS with standard Gaussian and truncated standard Gaussian as smoothing distributions gives marginal improvements over Neyman-Pearson-based certiﬁcation. This is because, for standard Gaussian distribution, the noise magnitude is particularly concentrated on a thin shell as reﬂected by the green curve in Figure 4. As a result, the truncated standard Gaussian as Q either has a tiny density overlap with P or provides highly similar information (i.e., QA PA). In either case, Q provides little additional information. Therefore, in practice, we do not use standard Gaussian and truncated standard Gaussian as P and Q which is also justiﬁed by Theorem 2, though DSRS can provide certiﬁcation for this setting.

H.2. Certiﬁcation with Generalized Gaussian with Different Variances

We now consider the robustness certiﬁcation with smoothing distribution P = N g(k, σ) and additional smoothing distribution Q = N g(k, β) where σ and β are different (i.e., different variance).

H.2.1. COMPUTATIONAL METHOD DESCRIPTION

Hereinafter, for this P and Q we deﬁne the radial density function g(r) := p(x) and h(r) := q(x) for any x 2 = r, where p and q are the density functions of P and Q respectively. (λ1g + λ2h) 1(x) := max y s.t. λ1g(y) + λ2h(y) = x and similarly (λ1g + λ2h) 1(x) := min y s.t. λ1g(y) + λ2h(y) = x.

In this case, we can still have the numerical expression for P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2) as shown in Theorem 9.

Theorem 9. When the smoothing distributions P = N g(k, σ) and additional smoothing distribution Q =

Double Sampling Randomized Smoothing

Table 3. The numerical integration expression for P, Q, and R (see deﬁnition in Theorem 5). See Appendix H.2 for notation description.

P(λ1, λ2) = E x Γ( d

2 k)Beta CDF d 1

2x)2 g 1(λ1g(σ

2x) + λ2h(σ

Q(λ1, λ2) = E x Γ( d

2 k)Beta CDF d 1

2x)2 g 1(λ1g(β

2x) + λ2h(β

R(λ1, λ2) = E x Γ( d

2 k)Beta CDF d 1

1 2 + (λ1g + λ2h) 1(g(σ

2x))2 r2 2xσ 2

Beta CDF d 1

1 2 + (λ1g + λ2h) 1(g(σ

2x))2 r2 2xσ 2

Table 4. Simpliﬁed terms in numerical integration for P and Q, where W is the real-valued branch of the Lambert W function.

Functions P(λ1, λ2) Q(λ1, λ2)

Term to Simplify g 1 λ1g(σ

2x) + λ2h(σ

2x) 2 g 1 λ1g(β

2x) + λ2h(β

Simpliﬁed k = 0 2σ 2 ln λ1 exp( x) + λ2 σ β d exp σ 2

β 2 x 2σ 2 ln λ1 exp β 2

σ 2 x + λ2 σ β d exp( x)

Terms k > 0 2kσ 2W

λ1 exp( x) + λ2 σ β d 2k exp σ 2

σ 2 + λ2 σ β d 2k exp( x) 1/k!

N g(k, β), let P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2) be as deﬁned in Theorem 5, then P, Q, and R can be computed by expressions in Table 3.

In Table 3, the numerical integration requires the computation of several inverse functions. In this subsection, we simplify the numerical integration expressions for P and Q by deriving the closed forms of these inverse functions, as shown in Table 4. In the actual implementation of numerical integration, for P and Q, we use these simpliﬁed expressions to compute; for R, beneﬁted from the unimodality (Lemma H.2), we deploy a simple binary search to compute. Theorem 10. When the smoothing distributions P = N g(k, σ) and Q = N g(k, β), the terms g 1(λ1g(σ

2x))2 and g 1(λ1g(β

2x) + λ2h(β

2x))2 in P(λ1, λ2) and Q(λ1, λ2) s computational expressions (see Table 3) are equivalent to those shown in Table 4.

With the method to compute P, Q, and R for given λ1 and λ2, now the challenge is to solve λ1 and λ2 such that P(λ1, λ2) = PA and Q(λ1, λ2) = QA.

Luckily, as Theorem 11 shows, for given PA and QA, such (λ1, λ2) pair is unique. Indeed, such uniqueness holds not only for this P and Q but also for a wide range of smoothing distributions. Theorem 11 (Uniqueness). Suppose distributions P and Q s are ℓp-spherically symmetric, i.e., there exists radial density functions g and h such that p(x) = g( x p) and q(x) = h( x p), then if g and h are continuous and g

h is continuous and strictly monotonic almost everywhere, for any given (PA, QA) R2 +, there is at most one (λ1, λ2) pair satisfying constraint of Eqn. (12). The proof is shown in the next subsection, which is based on Cauchy s mean value theorem of the probability integral.

With Theorem 11 and Proposition 2, we can use joint binary

search as shown in Alg. 4 to ﬁnd λ1 and λ2 which can be viewed as the intersection of two curves. At a high level, Each time, we leverage the monotonicity to get a point (λmid 1 , λmid 2 ) on the P s curve, then compute corresponding Q, and update the binary search intervals based on whether Q(λmid 1 , λmid 2 ) > QA. We shrink the intervals for both λ1 and λ2 (Lines 5 and 7) in Alg. 4 to accelerate the search. The algorithm is plugged into Line 8 of Alg. 2.

Algorithm 4 DUALBINARYSEARCH for λ1 and λ2. Data: Query access to P( , ) and Q( , ), PA and QA Result: λ1 and λ2 satisfying constraints P(λ1, λ2) = PA, Q(λ1, λ2) = QA

1 λL 1 M, λU 1 +M, λL 2 M, λU 2 +M ; /* M is a large positive number */

2 while λU 1 λL 1 > eps do

3 λmid 1 (λL 1 + λU 1 )/2 Binary search for λmid 2 [λL 2 , λU 2 ] such that P(λmid 1 , λmid 2 ) = PA ; /* (λmid 1 , λmid 2 ) lies on red curve */

4 if Q(λmid 1 , λmid 2 ) < QA then

5 λU 1 λmid 1 , λL 2 λmid 2 ; /* (λmid 1 , λmid 2 ) is right to intersection */

7 λL 1 λmid 1 , λU 2 λmid 2 ; /* (λmid 1 , λmid 2 ) is left to intersection */

9 return (λL 1 , λL 2 ) ; /* for soundness: R(λL 1 , λL 2 ) lower bounds (D) */

H.2.2. PROOFS

Proof of Theorem 9. The ℓ2-radial density functions of p(x) and q(x) have these expressions: g(r) r k exp( r2/(2σ 2)) and h(r) r k exp( r2/(2β 2)). When r increases, r k, exp( r2/(2σ 2)), and exp( r2/(2β 2)) decrease so that g and h are both strictly decreasing. Therefore, they have inverse functions, which are denoted by g 1 and h 1. Now we are ready to

Double Sampling Randomized Smoothing

derive the expressions.

(I.) We start with P:

Rd I [p(x δ) < λ1p(x) + λ2q(x)] p(x)dx

p(x)=y p(x δ)<λ1p(x)+λ2h(x)

p(x)=y p(x δ)<λ1p(x)+λ2h(x) dx g (g 1(y))

0 ydy πd/2dg 1(y)d 1

dx g (g 1(y))

Pr p(x δ) < λ1p(x) + λ2q(x) p(x) = y

0 g(t)dt πd/2dtd 1

Pr p(x δ) < λ1g(t) + λ2h(t) x 2 = t

(2σ 2) d 2 kπ d 2 Γ(d/2) Γ(d/2 k)

t 2k exp t2

Pr p(x δ) < λ1g(t) + λ2h(t) x 2 = t

(2σ 2) d 2 kΓ( d

2 k) td 2k 1 exp t2

Pr p(x δ) < λ1g(t) + λ2h(t) x 2 = t

(2σ 2) d 2 kΓ( d

0 t d 2 k 1 exp t 2σ 2

Pr h p(x δ) < λ1g(

(7.) = 1 Γ( d

0 t d 2 2k 1 exp( t)

Pr h p(x δ) < λ1g(σ

2t) + λ2h(σ

2t) x 2 = σ

(8.) = Et Γ( d

2 k) Pr h p(x δ) < λ1g(σ

2t) + λ2h(σ

2t) x 2 = σ

As a reminder, d is the input dimension. The Γ( ) here refer to either the Gamma distribution (in t Γ( d

2 k)) or Gamma function (in some denominators). In the above integration: (1.) uses level-set sliced integration as ﬁrst proposed in (Yang et al., 2020); (2.) leverages the fact that p(x) is ℓ2-symmetric and g ( ) < 0; (3.) injects the surface area of ℓ2-sphere with radius g 1(y); (4.) alters the integral variable: t = g 1(y), which yields dt = dy/g (t) = dy/g (g 1(y)) and y = g(t); (5.) injects the expression

of g(t); (6.) alters the integral variable from t to t2; (7.) does re-scaling; and (8.) observes that the integral can be expressed by expectation over Gamma distribution.

Due to the isotropy, let r = δ 2, we can deem δ = (r, 0, . . . , 0)T by rotating the axis. Then we simplify the probability term by observing that ( x 2 = σ

p(x δ) < λ1g(σ

2t) + λ2h(σ

i=2 x2 i = 2tσ 2

i=2 x2 i g 1 λ1g(σ

2t) + λ2h(σ

= x1 2tσ 2 g 1 λ1g(σ

2t) + λ2h(σ

Lemma H.1 (Lemma I.23; (Yang et al., 2020)). If (x1, , xd) is sampled uniformly from the unit sphere Sd 1 Rd, then

2 is distributed as Beta d 1

Combining Lemma H.1 and Appendix H.2.2, we get

Pr h p(x δ) < λ1g(σ

2t) + λ2h(σ

2t) x 2 = σ

=Beta CDF d 1

2t) + λ2h(σ

(105) Injecting Eqn. (105) into (8.) yields the expression shown in Table 3.

(II.) The integration for Q is similar:

Rd I[p(x δ) < λ1p(x) + λ2q(x)]q(x)dx

q(x)=y p(x δ)<λ1p(x)+λ2q(x)

q(x)=y p(x δ)<λ1p(x)+λ2q(x) dx h (h 1(y))

0 h(t)dt πd/2dtd 1

Pr p(x δ) < λ1p(x) + λ2q(x) x 2 = t

(2β 2) d 2 kΓ( d

0 t d 2 k 1 exp t 2β 2

Double Sampling Randomized Smoothing

Pr h p(x δ) < λ1g(

2 k) Pr h p(x δ) < λ1g(β

2t) + λ2h(β

2t) x 2 = β

Pr h p(x δ) < λ1g(β

2t) + λ2h(β

2t) x 2 = β

=Beta CDF d 1

2t) + λ2h(β

(III.) Finally, we derive the integration for R:

Rd I [p(x δ) < λ1p(x) + λ2q(x)] p(x δ)dx

Rd I [p(x) < λ1p(x + δ) + λ2q(x + δ)] p(x)dx

p(x)=y p(x)<λ1p(x+δ)+λ2q(x+δ)

p(x)=y p(x)<λ1p(x+δ)+λ2q(x+δ) dx g (g 1(y))

0 g(t)dt πd/2dtd 1

Pr λ1p(x + δ) + λ2q(x + δ) > p(x) x 2 = t

(2σ 2) d 2 kΓ( d

0 t d 2 k 1 exp t 2σ 2

Pr h λ1p(x + δ) + λ2q(x + δ) > g(

(9.) = Et Γ( d k

2 ) Pr h λ1p(x + δ) + λ2q(x + δ) > g(σ

2t) x 2 = σ

To simplify the probability term, this time we have ( x 2 = σ

λ1p(x + δ) + λ2q(x + δ) > g(σ

i=2 x2 i = 2tσ 2

(λ1g + λ2h)

v u u t(x1 + r)2 +

= (λ1g + λ2h) p

2tσ 2 + r2 + 2x1r > g(σ

where r = δ 2 and we deem δ = (r, 0, . . . , 0)T by rotating the axis.

Lemma H.2. The function (λ1g + λ2h) is unimodal in its domain (0, + ).

Proof of Lemma H.2. We expand (λ1g+λ2h) then consider its derivative.

λ1g(r) + λ2h(r)

=C1λ1r 2k exp r2

+ C2λ2r 2k exp r2

(110) where C1 and C2 is the constant normalization coefﬁcient of g and h respectively.

(λ1g + λ2h) (r) (111)

= C1λ12kr 2k 1 exp r2

+ C1λ1 r 2k+1

+ C2λ22kr 2k 1 exp r2

+ C2λ2 r 2k+1

Now we show that

(λ1g + λ2h) (r) = 0 (112)

has at most one solution.

When either λ1 = 0 or λ2 = 0, since both g and h are monotonic, λ1g + λ2 is monotonic and there is no solution. Thus, we assume λ1, λ2 = 0. We observe that

(λ1g + λ2h) (r) = 0 (113)

C2λ2 2k + r2

σ 2 2k + r2

β 2 = exp r2

C2λ2 2k + x σ 2 2k + x β 2 = exp x

2σ 2 x 2β 2

C1λ1 2k + x β 2 2k + x σ 2 = exp x

2β 2 x 2σ 2

Double Sampling Randomized Smoothing

We focus on Eqn. (117). Without loss of generality, we assume σ > β , then both function x 7 2k+x/β 2

2k+x/σ 2 and function x 7 exp x/(2β 2) x/(2σ 2) are monotonically increasing. If λ1λ2 > 0, the LHS and RHS of Eqn. (117) are continuous and monotonic in opposite directions. Thus, there is at most one solution to Eqn. (117). If λ1λ2 < 0, the LHS is monotonic increasing but the derivative is decreasing because

C1λ1 2k + x β 2 2k + x σ 2 = C2λ2

C1λ1 1 + x 2kβ 2 1 + x 2kσ 2 (118)

where the numerator 1 + x 2kβ 2 is linearly increasing and the denominator 1 + x 2kσ 2 is also linearly increasing. On the other hand, the RHS is monotonic increasing and the derivative is also increasing because

1 2β 2 1 2σ 2 > 0. (119)

As a result, the difference function between RHS and LHS is monotone and there is at most one solution. Thus, we have shown Eqn. (117) has at most one solution.

Given that (λ1g + λ2h) is also continuous, we thus know the function (λ1g + λ2h) is unimodal.

Moreover, since g and h approach 0 when r , (λ1g + λ2h) approaches 0 when r .

(λ1g + λ2h) 1(y) := max y

s.t. λ1g(y ) + λ2h(y ) = x, (120)

(λ1g + λ2h) 1(y) := min y

s.t. λ1g(y ) + λ2h(y ) = x. (121)

Then, Lemma H.2 and (λ1g + λ2h) 0 when r imply that, when y > 0,

y0 (λ1g + λ2h) 1(y), (λ1g + λ2h) 1(y)

λ1g(y0) + λ2h(y0) > y. (122) We simplify Eqn. (109) by observing that g(σ

2tσ 2 + r2 + 2x1r

(λ1g + λ2h) 1(g(σ

(λ1g + λ2h) 1(g(σ

(λ1g + λ2h) 1(g(σ

2t))2 2tσ 2 r2

(λ1g + λ2h) 1(g(σ

2t))2 2tσ 2 r2

Combining Lemma H.1 with Eqn. (124) we get

Pr h λ1p(x + δ) + λ2q(x + δ) > g(σ

2t) x 2 = σ

=Beta CDF d 1

1 2 + (λ1g + λ2h) 1(g(σ

2t))2 2tσ 2 r2

Beta CDF d 1

1 2 + (λ1g + λ2h) 1(g(σ

2t))2 2tσ 2 r2

(125) Combining the above equation with (9.) yields the expression shown in Table 3.

Proof of Theorem 10. To prove the theorem, the main work we need to do is deriving the closed-form expression for the inverse function g 1, where

(2σ 2) d 2 kπ d 2 Γ( d

2 k)r 2k exp r2

(I.) When k = 0,

(2σ 2π) d 2 exp g 1(y)2

g 1(y)2 = 2σ 2 ln (2σ 2π) d 2 y . (128)

(II.) When k > 0, we notice that the Lambert W function W is the inverse function of w 7 wew, i.e., W(x) exp(W(x)) = x. We let the normalizing coefﬁcient of g(r) be

(2σ 2) d 2 kπ d 2 Γ( d

2 k). (129)

y = Cg 1(y) 2k exp g 1(y)2

y = g 1(y)2k exp g 1(y)2

k = g 1(y)2 exp g 1(y)2

k = W 1 g 1(y)2

g 1(y)2 = 2σ 2k W

Plugging in Eqns. (128) and (134) to g 1(λ1g(σ

2x) + λ2h(σ

2x))2 and g 1(λ1g(β

2x) + λ2h(β

2x))2 for k = 0 and k > 0 case, then results in Table 4 follow from algebra.

Double Sampling Randomized Smoothing

Proof of Theorem 11. We prove the theorem by contradiction. Suppose that the (λ1, λ2) that satisfy the constraint of Eqn. (12) are not unique, and we let (λa 1, λa 2) and (λb 1, λb 2) be such two pairs. Without loss of generality, we assume λa 1 = λb 1.

If λa 2 = λb 2, we have P(λa 1, λa 2) = P(λb 1, λb 2), i.e., the region

{x δ : p(x δ) min{λa 1, λb 1}p(x) + λa 2q(x),

max{λa 1, λb 1}p(x) + λa 2q(x)

(135) has zero mass under distribution P. Given that P and Q have positive and continuous density functions almost everywhere, the volume of the region is non-zero thus the mass under distribution P is also non-zero. Therefore, we also have λa 2 = λb 2. Because of the partial monotonicity of P and Q functions (shown in Section 5.3), without loss of generality, we assume

λa 1 < λb 1, λa 2 > λb 2. (136)

Lemma H.3. There exists a point r0 0, either (1) or (2) is satisﬁed.

(1) When r > r0,

Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r]

Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r];

when r < r0,

Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r]

Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r].

(2) When r > r0,

Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r]

Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r];

when r < r0,

Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r]

Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r].

Note that in Pr[ ϵ p = r], the vector ε Rd is uniformly sampled from the ℓp-sphere of radius r.

Proof of Lemma H.3. For a given r, since P and Q are both ℓp-spherically symmetric, and the density functions are both positive almost everywhere, as long as

λa 1g(r) + λa 2h(r) λb 1g(r) + λb 2h(r), (137)

[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r]

[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r]. (138)

It still holds if we change the s to s in both Eqns. (137) and (138). Meanwhile,

λa 1g(r)+λa 2h(r) λb 1g(r)+λb 2h(r) g(r)

h(r) > λb 2 λa 2 λa 1 λb 1 .

(139) Since g(r)/h(r) is strictly monotonic, there exists at most one point r0 0 that divides

g(r) h(r) > λb 2 λa 2 λa 1 λb 1 and g(r) h(r) < λb 2 λa 2 λa 1 λb 1 . (140)

If that r0 exists, from Eqns. (137) to (139) the lemma statement follows.

Now we only need to show that r0 exists. Assume that the point r0 does not exist, it implies that for all r, we have either

λa 1g(r) + λa 2h(r) < λb 1g(r) + λb 2h(r) (141)

λa 1g(r) + λa 2h(r) > λb 1g(r) + λb 2h(r) (142)

while P(λa 1, λa 2) = P(λb 1, λb 2) > 0 and Q(λa 1, λa 2) = Q(λb 1, λb 2) > 0. It implies that for almost every r,

Pr[p(ϵ δ) (a, b) ϵ p = r] = 0 (143)

a = min{λa 1g(r) + λa 2h(r), λb 1g(r) + λb 2h(r)},

b = max{λa 1g(r) + λa 2h(r), λb 1g(r) + λb 2h(r)}. (144)

This violates the continuous assumption on both P and Q. Therefore, point r0 exists.

With Lemma H.3, we deﬁne auxiliary function D : R+ R such that

D(r) = Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) | ϵ p = r]

Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) | ϵ p = r]. (145) We let S(r) be the surface area of ℓp sphere of radius r. Then the P and Q can be written in integral form:

P(λ1, λ2) = Z

0 Pr[p(ε δ) < λ1p(ε) + λ2p(ε) |

ε p = r] g(r)S(r)dr, (146)

Q(λ1, λ2) = Z

0 Pr[p(ε δ) < λ1p(ε) + λ2p(ε) |

Double Sampling Randomized Smoothing

ε p = r] h(r)S(r)dr. (147)

Since P(λa 1, λa 2) = P(λb 1, λb 2) and Q(λa 1, λa 2) = Q(λb 1, λb 2) by our assumption, simple arrangement yields Z r0

0 D(r)g(r)S(r)dr = Z

r0 ( D(r))g(r)S(r)dr = 0,

0 D(r)h(r)S(r)dr = Z

r0 ( D(r))h(r)S(r)dr = 0.

As Lemma H.3 shows, D(r) where r [0, r0] always has the same sign as D(r) where r [r0, + ], and the last inequality ( = 0) is again due to the continuity of p and q and the fact that P(λa 1, λa 2) > 0 and Q(λa 1, λa 2) > 0. Now we can divide Eqn. (148) by Eqn. (149) and apply the Cauchy s mean value theorem, which yields

D(ξ1)g(ξ1)S(ξ1) D(ξ1)h(ξ1)S(ξ1) = D(ξ2)g(ξ2)S(ξ2)

D(ξ2)h(ξ2)S(ξ2), (150)

where ξ1 (0, r0) and ξ2 (r0, + ). Apparently, it requires g(ξ1) h(ξ1) = g(ξ2)

h(ξ2). (151)

However, g/h is strictly monotonic. By contradiction, there is no distinct pair (λa 1, λa 2) and (λb 1, λb 2) satisfying the constraint of Eqn. (12) simultaneously.

Remark. Suppose P and Q are ℓp-radial extended Gaussian/Laplace distributions, i.e., their density functions p(x) and q(x) are

p(x) x k p exp ( x p/σ)α ,

q(x) x k p exp ( x p/β)α (152)

with α > 0 (for Gaussian α = 2 and for Laplace α = 1). Note that this is a broader family than the family considered in DSRS shown in the main text. we have g h exp(r/β r/σ)α (153)

which is strictly monotonic. Thus, Theorem 11 is applicable for this large family of smoothing distributions that are commonly used in the literature (L ecuyer et al., 2019; Cohen et al., 2019; Yang et al., 2020; Zhang et al., 2020a; Zhai et al., 2020; Jeong & Shin, 2020).

H.3. Discussion on Certiﬁcation of Other ℓp Norms

DSRS can also be extended to provide a certiﬁed robust radius under ℓp norm other than ℓ2.

Different from the case of ℓ2 certiﬁcation, for other ℓp norm the challenge is to compute P(λ1, λ2), Q(λ1, λ2),

and R(λ1, λ2) as deﬁned in Theorem 5. For ℓ2 certiﬁcation, as shown by Theorems 5 and 9, there exist closed-form expressions for these quantities that can be efﬁciently implemented with numerical integrations. However, for other ℓp norms, ﬁnding such closed-form expressions for P, Q, and R is challenging.

Luckily, we notice that P, Q, and R are all probabilitybased deﬁnitions, as long as we can effectively sample from P and Q and effectively compute the density functions p( ) and q( ), we can estimate these function quantities from the empirical means of Monte-Carlo sampling.

Compared to numerical integration based ℓ2 certiﬁcation, Monte-Carlo sampling has sampling uncertainty and efﬁciency problems. Here we discuss these two problems in detail and how we can alleviate them.

Sampling Uncertainty and Mitigations. The empirical means for P and Q are stochastic which breaks the nice properties of P and Q (shown in Section 5.3) with respect to (λ1, λ2) as the different queries to P and Q ﬂuctuate around the actual value. Therefore, the joint binary search (Alg. 4) may fail to return the correct (λ1, λ2) pair. Thus, we propose a stabilization trick: the same set of samples is used when querying P and Q during the joint binary search. With this same set of samples, it can be easily veriﬁed that the nice properties (Propositions 1 and 2) still hold even if P and Q are empirical means. To guarantee the certiﬁcation soundness in the context of probabilistic Monte Carlo sampling, we introduce a different set of samples to test whether the solved (λ1, λ2) indeed upper bounds the intersection point (if the test fails we fall back to the classical Neyman-Pearson-based certiﬁcation though it seldom happens). Since the test is also probabilistic, we need to accumulate this additional failing probability and use the lower bound of the conﬁdence interval for soundness, which results in 1 2α = 99.8% certiﬁcation instead of 1 α = 99.9% as in classical randomized smoothing certiﬁcation (Cohen et al., 2019; Yang et al., 2020). Note that the existence of additional conﬁdence intervals caused by Monte-Carlo sampling makes DSRS based on Monte-Carlo sampling slightly looser than DSRS based on numerical integration.

Efﬁciency Concern and Mitigations. Traditionally we need to sample several (denoted as M, in our implementation we set M = 5 104) high-dimensional vectors for each P or Q computation, which induces the efﬁciency concern. With the usage of the aforementioned stabilization trick, now we only sample one set of M samples during the whole joint binary search instead of during each P and Q computation. Combining with the testing phase sampling, the whole algorithm needs to sample 2M vectors rather than O(M log2 M) without the stabilization trick, so

Double Sampling Randomized Smoothing

it greatly solves the efﬁciency concern. Moreover, we notice that for the samples {ϵi}M i=1 we only need to care about its densities {p(εi)}M i=1, {q(εi)}M i=1 and {p(εi δ)}M i=1. Thus, we store only these three quantities instead of the whole ddimensional vectors {ϵi}M i=1, reducing the space complexity from O(M d) to O(M).

These techniques signiﬁcantly improve efﬁciency in practice. Although DSRS based on Monte-Carlo sampling is still slightly looser and slower than DSRS based on numerical integration, DSRS based on Monte-Carlo sampling makes certifying robustness under other ℓp norms feasible. In this work, we focus on the ℓ2 norm, because additive randomized smoothing is not optimal for other norms (e.g., ℓ1 (Levine & Feizi, 2021)) or the state-of-the-art certiﬁcation can be directly translated from ℓ2 certiﬁcation (e.g., ℓ (Yang et al., 2020) and semantic transformations (Li et al., 2021)). Moreover, to the best of our knowledge, standard ℓ2 certiﬁcation is the most challenging setting where additive randomized smoothing achieves state-of-the-art and no other work can achieve visibly tighter robustness certiﬁcation than standard Neyman-Pearson certiﬁcation (Yang et al., 2020; Levine et al., 2020; Mohapatra et al., 2020).

I. Implementation and Optimizations

In this appendix, we discuss the implementation tricks and optimizations, along with our simple heuristic for selecting the hyperparameter T in the additional smoothing distribution Q = N g trunc(k, T, σ).

I.1. Implementation Details

We implement DSRS in Python with about two thousand lines of code. The tool uses Py Torch2 to query a given base classiﬁer with Monte-Carlo sampling in order to derive the conﬁdence intervals [PA, PA] and [QA, QA]. Then, the tool builds the whole DSRS pipeline on Sci Py3 and Num Py4. Speciﬁcally, the numerical integration is implemented with scipy.integrate.quad() method. We exploit the full independence across the certiﬁcation for different input instances and build the tool to be fully parallelizable on CPUs. By default, we utilize 10 processes, and the number of processes can be dynamically adjusted.

The tool is built in a ﬂexible way that adding new smoothing distributions is not only theoretically straightforward but also easy in practice. In the future, we plan to extend the tool to 1) provide GPU support; 2) reuse existing certiﬁcation results from previous instances with similar conﬁdence intervals to achieve higher efﬁciency. We will also support more smoothing distributions.

2http://pytorch.org/ 3https://scipy.org/scipylib/index.html 4https://numpy.org/

In the implementation, we widely use the logarithmic scale, since many quantities in the computation have varied scales. For example, since the input dimension d is typically over 500 on a real-world dataset, the density functions p and q decay very quickly along with the increase of noise magnitude. Another example is the input variable for ln( ) and W( ) in Theorem 5. These variables are exponential with respect to the input dimension d so they could be very large or small. To mitigate this, we perform all computations with varied scales in logarithmic scale to improve the precision and ﬂoating-point soundness. For example, we implement a method lnlogadd to compute log(λ1 exp(x1) + λ2 exp(x2)) and apply method wlog in (Yang et al., 2020) to compute W(exp(x)). We remark that in the binary search for dual variables (see Section 5.3), we also use the logarithmic scale for λ1 and λ2.

The code, model weights, and all experiment data are publicly available at https://github.com/llylly/ DSRS.

I.2. T Heuristics

As brieﬂy outlined in Section 6.1, we apply a simple yet effective heuristic to determine the hyperparameter T in additional smoothing distribution Q = N g trunc(k, T, σ).

Concretely, we ﬁrst sample the prediction probability from the original smoothing distribution P and get the conﬁdence lower bound PA of PA = f P(x0)y0. Then, we use the following empirical expression to determine T from PA:

2d d 2k ΓCDF 1 d/2 k(p), (154)

p = max{ 0.08 ln(1 PA) + 0.2, 0.5}. (155)

It can be viewed as we ﬁrst parameterize T by p and then ﬁnd a simple heuristic to determine p by PA.

The T s parameterization with p is inspired by the probability mass under original smoothing distribution P = N g(k, σ) if true-decision region is concentrated in a ℓ2-ball centered at x0. Concretely, from P s deﬁnition,

Pr ϵ P[ ϵ 2 T] = p. (156)

Then, we use a randomly sampled CIFAR-10 training set containing 1, 000 points with models trained using Consistency (Jeong & Shin, 2020) under σ = 0.50 to sweep all p {0.1, 0.2, , 0.9}. We plot the minimum p and maximum p that gives the highest certiﬁed robust radius as a segment and ﬁnd Eqn. (155) ﬁts the general tendency well as shown in Appendix I.2. Thus, we use this simple heuristic to determine T. Empirically, this simple heuristic generalizes well and is competitive with more complex methods as shown in Appendix J.

Double Sampling Randomized Smoothing

1 2 3 4 5 6 7 8 ln(1 PA)

Best Hyperparameter p

Figure 7. Our p heuristic (Eqn. (155)) shown as orange curve ﬁts the generalization tendency of optimal p ranges (shown as blue segments) well.

Another heuristic that we have deployed is the fall-back strategy. When the empirical probability f PA = 1, i.e., if for all the sampled ϵ P, F(x0 + ϵ) = y0, then we fall back to still using P instead of using another distribution Q for the second round of sampling. This strategy is inspired by a ﬁnding that, with the ﬁxed sampling budget, if PA is already very high, it is more efﬁcient to use more samples to further increase the conﬁdence interval of PA rather than querying imprecise information under another distribution Q. Notice that such strategy does not break the high-conﬁdence soundness of our certiﬁcation, because Pr[f PA = n/N | PA t] Pr[f PA = n/N g P half A = 1 | PA t] for any t < 1 and Bernoulli distributed sampling (which is our case), and we use Bernoulli conﬁdence interval which corresponds to Pr[f PA = n/N | PA t]. Due to the tight experiment time, we only deployed this strategy on Image Net evaluation but not on MNIST and CIFAR-10 evaluations.

I.3. Training Strategy for Generalized Gaussian Smoothing

We train the base classiﬁers on each dataset using Gaussian augmentation training (Cohen et al., 2019), Consistency training (Jeong & Shin, 2020), and Smooth Mix training (Jeong et al., 2021), which are typical training methods for randomized smoothing. We do not consider other training approaches such as (Zhai et al., 2020; Salman et al., 2019; Li et al., 2019a; Carmon et al., 2019) because: (1) Some training approaches either require additional data (Carmon et al., 2019; Salman et al., 2019), or relatively time-consuming (Salman et al., 2019), or are reported to be not as effective as later approaches (Li et al., 2019a); (2) Selected training approaches are widely used or achieve state-of-the-art with high training efﬁciency.

On MNIST, for all training methods, we use a convolutional neural network with 4 convolutional layers and 3 fully connected layers following Cohen et al. (2019) as the

base classiﬁers architecture. On CIFAR-10, for all training methods, we use Res Net-110 (He et al., 2016) as the base classiﬁers architecture. These architecture settings follow the prior work on smoothed classiﬁer training (Cohen et al., 2019; Salman et al., 2019; Zhai et al., 2020).

On both MNIST and CIFAR-10, for all training methods, we train for 150 epochs. The learning rate is 0.01 and is decayed by 0.1 at the 50th and 100th epoch. For Consistency training, the hyperparameter λ = 5 on MNIST and λ = 20 on CIFAR-10. We use two noise vectors per training batch per instance. These are the best hyperparameters reported in (Jeong & Shin, 2020). The batch size is 256 on both MNIST and CIFAR-10 following (Jeong & Shin, 2020). For Smooth Mix training, we directly use the best hyperparameters from their open-source repository: https://github.com/jh-jeong/smoothmix/ blob/main/EXPERIMENTS.MD.

On Image Net, we use Res Net-50 (He et al., 2016) as the base classiﬁers architecture and ﬁnetune from the opensource model trained by Cohen et al. (Cohen et al., 2019) with Gaussian smoothing. We train for 10 epochs due to the expensive training cost on Image Net and we remark that better results can be achieved with a larger training time budget. The learning rate is 0.001 and is decayed at the end of every epoch by 0.1. For Consistency training, the hyperparameter λ = 5 and we use two noise vectors per training batch per instance following the best hyperparameters reported in (Jeong & Shin, 2020). For Smooth Mix training, since the open-source repository does not contain the best hyperparameters, we select the hyperparameters as suggested in the original paper (Jeong et al., 2021).

During training, the training samples are augmented by adding noises following training smoothing distribution. In typical training approaches (Cohen et al., 2019; Carmon et al., 2019; Salman et al., 2019; Zhai et al., 2020; Jeong & Shin, 2020), the training smoothing distribution is set to be the same as the original smoothing distribution P for constructing the smoothed classiﬁer. However, for our generalized Gaussian N g(k, σ) as P with large k, we ﬁnd this strategy gives a poor empirical performance.

To better train the base classiﬁer for our original smoothing distribution P with large k, we introduce a warm-up stage on the training smoothing distribution. Suppose our original smoothing distribution P for constructing the smoothed classiﬁer is N g(k0, σ) We let e0 = 100 be the number of warm-up epochs on MNIST and CIFAR-10, or e0 = 10000 be the number of warm-up steps on Image Net. In the ﬁrst e0 epochs (on MNIST and CIFAR-10) or steps (on Image Net), we use the training smoothing distribution with smaller k. Formally, in the eth epoch/step where e e0, the training

Double Sampling Randomized Smoothing

smoothing distribution P = N g(k, σ) where

k = k0 k1 e/e0 0 . (157)

For later epochs/steps, we use the original smoothing distribution P itself as the training smoothing distribution. This strategy gradually increases the k of training smoothing distribution throughout the training, so that the base classiﬁer can be a better ﬁt for the ﬁnal desired distribution P. Using this strategy, the smoothed classiﬁer constructed from our trained base classiﬁer has similar certiﬁed robustness compared to standard Gaussian augmentation under classical Neyman-Pearson-based certiﬁcation.

J. Additional Experimental Results

In this appendix, we present additional experiment results and studies.

J.1. Empirical Study Setup of Concentration Property

In Figure 4 in Appendix B, we present our investigation of the decision regions of base classiﬁers. The investigation follows the following protocol: (1) We choose the base classiﬁer from (Salman et al., 2019) on Image Net trained for σ = 0.5 Gaussian smoothing as the subject. The reason is that this base classiﬁer is one of the state-of-the-art certiﬁably robust classiﬁers on the large-scale Image Net dataset and our code uses the same preprocessing parameters so it is easy to adopt their model. (2) We pick every 500-th image from the test set of the Image Net dataset to form a subset of 100 samples. (3) We ﬁlter out the samples where the base classiﬁer cannot classify correctly even without adding any noise, which results in 89 remaining samples. (4) For each of these 89 samples, for each perturbation magnitude r, we uniformly sample 1000 perturbation vectors from the hypersphere with ℓ2-radius r and compute the empirical probability of true-prediction, where the step size of r is 10.

Figure 4 implies that for a vast majority of samples, the true-prediction samples are highly concentrated on an ℓ2 ball around the clean input since there exists apparent ℓ2 magnitude thresholds where the true-prediction probability is almost 1 within the thresholds and almost 0 beyond the thresholds. This implies that the concentration property may be achievable for real-world base classiﬁers in randomized smoothing.

In Figure 9, we follow the same protocol but plot the landscape of base classiﬁers trained using generalized Gaussian distribution (instead of standard Gaussian distribution as in Figure 4). By comparing Figure 9 and Figure 4, we ﬁnd that although base classiﬁers in Figure 4 can achieve better certiﬁed robustness using Neyman-Pearson certiﬁcation and generalized Gaussian smoothing (compare Figure 2(b) and Neyman-Pearson rows in Table 10), they also sacriﬁce

the concentration property, which can explain why DSRS improvements are much smaller on models in Section 6 compared with models in Figure 2(b). Thus, as discussed in Section 6, there may be a large space for exploring training approaches that favor DSRS certiﬁcation by preserving the concentration property.

J.2. Effectiveness of T-Heuristics and Attempts on Better Optimization Tricks

Better Optimization Tricks. For obtaining a tighter certiﬁed radius, we should make the support of N g trunc(k, T, σ) more aligned with the decision region while keeping the QA large enough. So except using a simple heuristic, we can also turn the search for an appropriate value of T into an optimization problem. In order to make the optimization more stable, here we will construct the optimization based on Pcon which is more scale-invariant, and then transform it to get the ﬁnal appropriate T = σ q

2ΓCDF 1 d/2(Pcon).

The ﬁnal optimization objective is now built as Pcon = arg min QA + λ

2 (Pcon PA)2, where λ is a hyperparameter that controls the relative weight of the two loss terms. The QA here is estimated by sampling from the distribution N g trunc(k, T, σ) in which the T is determined by Pcon; however, the actual process of such sampling is implemented by rejecting the sampled noise whose norm is bigger than T. Therefore, there will be no gradient obtained for Pcon through the backward of the loss, namely, the optimized objective. So instead, we attempt to approximate the gradient comes from the ﬁrst loss term QA with GQA = Eϵ Q[ ||ϵ||2Cross Entropy Loss(f(x0 + ϵ), y0)]. Then, we will only have the gradient information, and there is no explicit form of QA anymore. The PA is estimated with the PA which we have already obtained, so the ﬁnal estimation of the gradient for Pcon is GQA + λ(Pcon PA).

Experiment Setting. The Pcon is optimized for different input test images respectively, and the initialized Pcon is set to 0.7. For each input, we will update Pcon 20 steps on datasets MNIST and CIFAR10 while updating only 10 steps on dataset Image Net to reduce time complexity. For each step, we will sample 2, 000 times for estimating the term GQA, and the learning rate is set to 2, 000. Besides, to avoid the Pcon being optimized too large or too small, we will clip the ﬁnal optimized Pcon within 0.1 and 0.9. Since the optimization is a bit time-consuming, we only test it on CIFAR10 with σ = 0.5 and test it on MNIST and Image Net with σ = 1.0. Different λ is also tried for different datasets and different training methods for getting better performance.

Performance. The ﬁnal results are shown in Table 5, and the certiﬁcation approach based on the optimization tricks

Double Sampling Randomized Smoothing

mentioned above is denoted as Opt in the table. As we can see, our simple T-Heuristics could still be competitive with the method based on complicated optimization but with a cheaper cost.

J.3. Full Curves and Separated Tables

J.3.1. SEPARATE TABLES BY SMOOTHING VARIANCE

Due to the page limit, in the main text (Section 6, Table 2), we aggregate the certiﬁed robust accuracy across models with different smoothing variance σ {0.25, 0.50, 1.00}. To show the full landscape, we present the certiﬁed robust accuracy for each model trained with each variance. The evaluation protocol is the same as the one in the main text, and the tables for MNIST, CIFAR-10, and Image Net models are Table 6, Table 7, and Table 8 respectively. We observe that DSRS outperforms standard Neyman-Pearson certiﬁcation for a wide range of radii.

J.3.2. CURVES

Following the convention (Cohen et al., 2019; Salman et al., 2019), we plot the certiﬁed robust accuracy - radius curve in Figure 8.

The curves correspond to the certiﬁed robust accuracy data in Table 2 (in Section 6), i.e., the certiﬁed robust accuracy under each radius r is the maximum certiﬁed robust accuracy among models trained with variance σ {0.25, 0.50, 1.00}. We observe that among all medium to large radii (including those not shown in Table 2), DSRS provides higher certiﬁed robust accuracy than Neyman-Pearson certiﬁcation. The margin of DSRS is relatively small on CIFAR-10 but is apparent on MNIST and Image Net. Especially, the margins on Image Net reﬂect that DSRS is particularly effective on large datasets.

J.3.3. ACR RESULTS

In the literature, another common metric of certiﬁed robustness is ACR (average certiﬁed radius) (Zhai et al., 2020; Jeong & Shin, 2020; Jeong et al., 2021). In Table 9, we report the ACR comparison between Neyman-Pearson-based certiﬁcation and our DSRS certiﬁcation. Across the three smoothing variance choices σ {0.25, 0.50, 1.00}, we ﬁnd σ = 1.00 yields the highest ACR, so we only report the ACR for models smoothed with σ = 1.00. As we can see, in all cases, DSRS signiﬁcantly improves over Neyman Pearson-based certiﬁcation in terms of ACR.

J.4. Using Distribution with Different Variance as Q

We take the models trained with Gaussian augmentation (Cohen et al., 2019) and variance σ = 1.00 as examples. We use DSRS-trunc to represent DSRS using truncated gener-

alized Gaussian as Q, and DSRS-var to represent DSRS using generalized Gaussian with different variance as Q, and compare their robustness certiﬁcation (i.e., certiﬁed robust accuracy) in Table 10. From the table, we ﬁnd that on MNIST and CIFAR-10, DSRS-var is better than DSRStrunc, whereas on Image Net, DSRS-trunc is slightly better than DSRS-var. Both DSRS-trunc and DSRS-var are significantly better than Neyman-Pearson-based certiﬁcation.

To investigate the reason, we follow the protocols for studying the concentration property in Appendix J.1 to plot the landscape of models on MNIST, CIFAR-10, and Image Net, as shown in Figure 9. From the ﬁgure, we ﬁnd that the curves on Image Net are generally steeper, which corresponds to that the concentration property is better satisﬁed on Image Net. Therefore, we conjecture that when the concentration property (see Deﬁnition 3) is better satisﬁed, DSRS with truncated Gaussian as Q is better than Gaussian with different variance as Q.

J.5. Comparison with Higher-Order Randomized Smoothing

It is difﬁcult to have a direct comparison with higher-order randomized smoothing (Mohapatra et al., 2020; Levine et al., 2020), which is the only work to the best of our knowledge that uses additional information beyond PA to tighten the robustness certiﬁcation in randomized smoothing. This difﬁculty comes from the following reasons: (1) Higher-order randomized smoothing only supports standard Gaussian smoothing, while DSRS is particularly useful with generalized Gaussian smoothing. (2) All experiment evaluations in higher-order randomized smoothing are conducted with large sampling numbers (N = 2 105 on CIFAR-10 and N = 1.25 106 on Image Net) which makes the evaluation costly, while DSRS is designed for practical sampling numbers (N = 105). (3) The code is not open-source yet (Mohapatra et al., 2020). Nonetheless, we can directly compare with the curves provided by Mohapatra et al. (2020).

We capture the certiﬁed robust accuracy vs. ℓ2 radius r curves from (Mohapatra et al., 2020) in Figure 10. As we can see, compared with Neyman-Pearson-based certiﬁcation, the improvements from higher-order randomized smoothing are small especially on the large Image Net dataset despite the excessive sampling numbers (1.25 106). In contrast, as shown in Figure 8, within only 105 sampling number, DSRS is visibly tighter than Neyman-Pearsonbased certiﬁcation. In fact, to the best of our knowledge, DSRS is the ﬁrst model-agnostic approach that is visibly tighter than Neyman-Pearson-based certiﬁcation under ℓ2 radius.

Double Sampling Randomized Smoothing

Table 5. ℓ2 certiﬁed robust accuracy w.r.t. different radii r for different certiﬁcation approaches.

Dataset Training Certiﬁcation Certiﬁed Accuracy under Radius r Method Approach 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

Gaussian Aug. Neyman-Pearson 95.5 % 93.5 % 90.0 % 86.1 % 80.4 % 72.8 % 61.4 % 50.2 % 36.6 % 25.2 % 14.5 % 8.5 %

(σ = 1.00) DSRS(T-heuristic) 95.5 % 93.5 % 90.2 % 86.9 % 81.4 % 74.4 % 64.6 % 55.2 % 42.8 % 30.9 % 20.3 % 11.3 % Opt (λ = 7e 05) 95.5% 93.5% 90.0% 86.9% 81.7% 74.9% 65.6% 55.8% 43.8% 30.5% 19.1% 10.2%

Consistency Neyman-Pearson 94.5 % 92.6 % 89.3 % 85.9 % 80.7 % 74.4 % 65.9 % 56.9 % 44.1 % 34.4 % 23.3 % 12.8 %

(σ = 1.00) DSRS(T-heuristic) 94.5 % 92.8 % 89.3 % 86.3 % 81.4 % 76.1 % 68.3 % 59.5 % 50.7 % 39.8 % 30.7 % 20.0 % Opt (λ = 6e 06) 94.5% 92.8% 89.3% 86.2% 81.2% 75.8% 68.2% 59.6% 50.5% 39.6% 30.7% 19.8% Certiﬁed Accuracy under Radius r 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

Gaussian Aug. Neyman-Pearson 60.4 % 55.2 % 51.3 % 45.9 % 40.8 % 35.6 % 30.1 % 24.3 % 20.0 % 16.7 % 13.0 % 10.1 %

(σ = 0.50) DSRS(T-heuristic) 60.6 % 55.5 % 51.5 % 46.8 % 42.1 % 37.3 % 32.5 % 27.4 % 22.8 % 19.3 % 16.0 % 12.7 % Opt (λ = 3e 06) 60.6% 55.5% 51.3% 46.7% 42.0% 37.2% 32.5% 27.4% 23.0% 19.3% 16.0% 12.5%

Consistency Neyman-Pearson 53.1 % 50.5 % 48.6 % 45.5 % 43.6 % 41.5 % 38.7 % 36.7 % 35.1 % 32.0 % 29.1 % 25.7 %

(σ = 0.50) DSRS(T-heuristic) 53.1 % 50.7 % 48.7 % 45.7 % 44.0 % 41.8 % 39.6 % 37.8 % 36.0 % 34.4 % 31.3 % 28.6 % Opt (λ = 4e 06) 53.1% 50.7% 48.7% 45.7% 44.0% 41.8% 39.5% 37.8% 36.0% 34.4% 31.4% 28.4%

Gaussian Aug. Neyman-Pearson 57.5 % 55.1 % 52.2 % 49.7 % 47.0 % 43.9 % 40.8 % 38.1 % 35.0 % 33.2 % 29.6 % 25.3 %

(σ = 1.00) DSRS(T-heuristic) 57.7 % 55.6 % 52.7 % 51.0 % 48.4 % 45.5 % 43.1 % 40.2 % 37.9 % 35.3 % 32.8 % 30.5 % Opt (λ = 1e 05) 57.7% 55.5% 52.4% 50.5% 48.2% 45.0% 42.9% 40.0% 38.0% 35.0% 32.7% 29.9%

Consistency Neyman-Pearson 55.9 % 54.4 % 53.0 % 51.2 % 48.2 % 46.2 % 44.2 % 41.7 % 39.1 % 36.4 % 34.4 % 32.1 %

(σ = 1.00) DSRS(T-heuristic) 56.0 % 54.6 % 53.1 % 51.8 % 49.9 % 47.4 % 45.7 % 44.2 % 41.7 % 39.3 % 37.8 % 35.8 % Opt (λ = 1e 05) 56.0% 54.6% 53.1% 51.8% 49.7% 47.4% 45.3% 44.0% 41.6% 39.3% 37.8% 35.9%

Table 6. MNIST: Certiﬁed robust accuracy for models smoothed with different variance σ certiﬁed with different certiﬁcation approaches.

Variance Training Certiﬁcation Certiﬁed Accuracy under Radius r Method Approach 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Gaussian Aug. Neyman-Pearson 97.9% 96.4% 92.1% (Cohen et al., 2019) DSRS 97.9% 96.6% 92.7% Consistency Neyman-Pearson 98.4% 97.5% 94.4% (Jeong & Shin, 2020) DSRS 98.4% 97.5% 95.4% Smooth Mix Neyman-Pearson 98.6% 97.6% 96.5% (Jeong et al., 2021) DSRS 98.6% 97.7% 96.8%

Gaussian Aug. Neyman-Pearson 97.8% 96.9% 94.6% 88.4% 78.7% 52.6% (Cohen et al., 2019) DSRS 97.8% 97.0% 95.0% 89.8% 83.4% 59.1% Consistency Neyman-Pearson 98.4% 97.3% 96.0% 92.3% 83.8% 67.5% (Jeong & Shin, 2020) DSRS 98.4% 97.3% 96.0% 93.5% 87.1% 71.8% Smooth Mix Neyman-Pearson 98.2% 97.1% 95.4% 91.9% 85.1% 73.0% (Jeong et al., 2021) DSRS 98.1% 97.1% 95.9% 93.4% 87.5% 76.6%

Gaussian Aug. Neyman-Pearson 95.2% 91.9% 87.7% 80.6% 71.2% 57.6% 41.0% 25.5% 13.6% 6.2% 2.1% 0.9% (Cohen et al., 2019) DSRS 95.1% 91.8% 88.2% 81.5% 73.6% 61.6% 48.4% 34.1% 21.0% 10.6% 4.4% 1.2% Consistency Neyman-Pearson 93.9% 90.9% 86.4% 80.8% 73.0% 61.1% 49.1% 35.6% 21.7% 10.4% 4.1% 1.9% (Jeong & Shin, 2020) DSRS 93.9% 91.1% 86.9% 81.7% 75.2% 65.6% 55.8% 41.9% 31.4% 17.8% 8.6% 2.8% Smooth Mix Neyman-Pearson 92.0% 88.9% 84.4% 78.6% 69.8% 60.7% 49.9% 40.2% 31.5% 22.2% 12.2% 4.9% (Jeong et al., 2021) DSRS 92.2% 89.0% 84.8% 79.7% 72.0% 63.9% 54.4% 46.2% 37.6% 29.2% 18.5% 7.2%

K. Extended Related Work Discussion

In this appendix, we discuss related work from two branches: training approaches for randomized smoothing, and datadependent randomized smoothing. Both branches aim to improve the certiﬁed robustness of randomized smoothing. For tighter certiﬁcation approaches leveraging additional information, a detailed discussion can be found in Appendix L.1. For more related work we refer the interested readers to recent surveys and books (Liu et al., 2021; Li et al., 2020b; Albarghouthi, 2021).

To improve the certiﬁed robustness of randomized smoothing, efforts have been made on both the training (Li et al., 2019a; Zhai et al., 2020; Salman et al., 2019; Jeong & Shin, 2020) and the certiﬁcation sides (L ecuyer et al., 2019; Cohen et al., 2019; Li et al., 2019a; Yang et al., 2020; Zhang

et al., 2020a; Yang et al., 2022). On the training side, data augmentation (Cohen et al., 2019), regularization (Li et al., 2019a; Zhai et al., 2020; Jeong & Shin, 2020), and adversarial training (Salman et al., 2019) help to train stable base models under noise corruptions so that higher certiﬁed robustness for a smoothed classiﬁer can be achieved. In this work, we focus on certiﬁcation, and these training approaches can be used in conjunction with ours to provide higher certiﬁed robustness.

Another potential way to improve the certiﬁed robustness of randomized smoothing is to dynamically change the smoothing distribution P based on the input toward maximizing the certiﬁed radius (Alfarra et al., 2020; Eiras et al., 2021; Schuchardt et al., 2022). In this scenario, the certiﬁcation needs to take into account that the attacker may adaptively mislead the pipeline to choose a bad smoothing distri-

Double Sampling Randomized Smoothing

Table 7. CIFAR-10: Certiﬁed robust accuracy for models smoothed with different variance σ certiﬁed with different certiﬁcation approaches.

Variance Training Certiﬁcation Certiﬁed Accuracy under Radius r Method Approach 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Gaussian Aug. Neyman-Pearson 56.1% 35.7% 13.4% (Cohen et al., 2019) DSRS 57.4% 39.4% 17.3% Consistency Neyman-Pearson 61.8% 50.9% 34.7% (Jeong & Shin, 2020) DSRS 62.5% 52.5% 37.8% Smooth Mix Neyman-Pearson 63.9% 53.3% 38.4% (Jeong et al., 2021) DSRS 64.7% 55.5% 41.1%

Gaussian Aug. Neyman-Pearson 53.7% 41.3% 27.7% 17.1% 9.1% 2.8% (Cohen et al., 2019) DSRS 54.1% 42.7% 30.6% 20.3% 12.6% 4.0% Consistency Neyman-Pearson 49.2% 43.9% 38.0% 32.3% 23.8% 18.1% (Jeong & Shin, 2020) DSRS 49.6% 44.1% 38.7% 35.2% 28.1% 19.7% Smooth Mix Neyman-Pearson 53.2% 47.6% 40.2% 34.2% 26.7% 19.6% (Jeong et al., 2021) DSRS 53.3% 48.5% 42.1% 35.9% 29.4% 21.7%

Gaussian Aug. Neyman-Pearson 40.2% 32.6% 24.7% 18.9% 14.9% 10.2% 7.5% 4.1% 2.0% 0.7% 0.1% 0.1% (Cohen et al., 2019) DSRS 40.3% 33.1% 25.9% 20.6% 16.1% 12.5% 8.4% 6.4% 3.5% 1.8% 0.7% 0.1% Consistency Neyman-Pearson 37.2% 32.6% 29.6% 25.9% 22.5% 19.0% 16.4% 13.8% 11.2% 9.0% 7.1% 5.1% (Jeong & Shin, 2020) DSRS 37.1% 32.5% 29.8% 27.1% 23.5% 20.9% 17.6% 15.3% 13.1% 10.9% 8.9% 6.5% Smooth Mix Neyman-Pearson 43.2% 39.5% 33.9% 29.1% 24.0% 20.4% 17.0% 13.9% 10.3% 7.8% 4.9% 2.3% (Jeong et al., 2021) DSRS 43.2% 39.7% 34.9% 30.0% 25.8% 22.1% 18.7% 16.1% 13.2% 10.2% 7.1% 3.9%

Table 8. Image Net: Certiﬁed robust accuracy for models smoothed with different variance σ certiﬁed with different certiﬁcation approaches.

Variance Training Certiﬁcation Certiﬁed Accuracy under Radius r Method Approach 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Gaussian Aug. Neyman-Pearson 57.1% 41.6% 17.4% (Cohen et al., 2019) DSRS 58.4% 47.9% 24.4% Consistency Neyman-Pearson 59.8% 49.8% 36.9% (Jeong & Shin, 2020) DSRS 60.4% 52.4% 40.4% Smooth Mix Neyman-Pearson 46.7% 38.2% 28.2% (Jeong et al., 2021) DSRS 47.4% 40.0% 29.8%

Gaussian Aug. Neyman-Pearson 53.6% 48.3% 43.3% 36.8% 31.4% 24.5% (Cohen et al., 2019) DSRS 53.7% 49.9% 44.7% 39.3% 34.8% 27.4% Consistency Neyman-Pearson 53.6% 48.3% 43.3% 36.8% 31.4% 24.5% (Jeong & Shin, 2020) DSRS 53.7% 49.9% 44.7% 39.3% 34.8% 27.4% Smooth Mix Neyman-Pearson 38.7% 33.5% 28.8% 24.6% 18.1% 13.5% (Jeong et al., 2021) DSRS 39.1% 34.9% 30.3% 26.8% 21.6% 15.6%

Gaussian Aug. Neyman-Pearson 42.5% 37.2% 33.0% 29.2% 24.8% 21.4% 17.6% 13.7% 10.2% 7.8% 5.7% 3.6% (Cohen et al., 2019) DSRS 42.5% 38.1% 34.4% 30.2% 27.0% 23.3% 21.3% 18.7% 14.2% 11.0% 9.0% 5.7% Consistency Neyman-Pearson 40.0% 38.3% 34.2% 31.8% 28.7% 25.6% 22.1% 19.1% 16.1% 14.0% 10.6% 8.5% (Jeong & Shin, 2020) DSRS 40.2% 38.5% 35.4% 32.6% 30.7% 28.1% 25.4% 22.6% 19.6% 17.4% 14.1% 10.4% Smooth Mix Neyman-Pearson 29.8% 25.6% 21.8% 19.2% 17.0% 14.2% 11.8% 10.1% 8.9% 7.2% 6.0% 4.6% (Jeong et al., 2021) DSRS 29.7% 26.2% 23.0% 20.6% 18.0% 15.7% 14.0% 12.1% 9.9% 8.4% 7.2% 5.3%

Double Sampling Randomized Smoothing

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

MNIST, Gaussian Aug.

Neyman-Pearson Certification DSRS Certification

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

MNIST, Consistency

Neyman-Pearson Certification DSRS Certification

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

MNIST, Smooth Mix

Neyman-Pearson Certification DSRS Certification

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

CIFAR-10, Gaussian Aug.

Neyman-Pearson Certification DSRS Certification

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

CIFAR-10, Consistency

Neyman-Pearson Certification DSRS Certification

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

CIFAR-10, Smooth Mix

Neyman-Pearson Certification DSRS Certification

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

Image Net, Gaussian Aug.

Neyman-Pearson Certification DSRS Certification

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

Image Net, Consistency

Neyman-Pearson Certification DSRS Certification

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r

Certified Accuracy

Image Net, Smooth Mix

Neyman-Pearson Certification DSRS Certification

Figure 8. Certiﬁed robust accuracy - radius curve corresponding to Table 2.

Table 9. Average certiﬁed radius (ACR) statistics. The smoothing variance hyperparameter σ = 1.00. The evaluation protocol is the same as that in the main text.

Training Method Certiﬁcation MNIST CIFAR-10 Image Net

Gaussian Augmentation

Neyman-Pearson 1.550 0.447 0.677 DSRS 1.629 0.469 0.750 Relative Improvement +5.10% +4.92% +10.78%

Consistency Neyman-Pearson 1.645 0.636 0.796 DSRS 1.730 0.659 0.862 Relative Improvement +5.17% +3.62% +8.29%

Smooth Mix Neyman-Pearson 1.716 0.676 0.490 DSRS 1.806 0.712 0.525 Relative Improvement +5.24% +5.33% +7.14%

bution. Therefore, additional costs such as memorizing training data need to be paid to defend such adaptive robust-

ness vulnerabilities. A recent work (S uken ık et al., 2021) shows that input-dependent randomized smoothing may not bring substantial improvements in certiﬁed robustness. In DSRS, we select the additional smoothing distribution Q dynamically based on the input, which may appear like input-dependent randomized smoothing. However, we select such distribution Q only for certiﬁcation purposes, and the original distribution P which is used to construct the smoothed classiﬁer remains static. Thus, we do not need to consider the existence of adaptive attackers. Indeed, for any smoothing distribution Q, with DSRS, we generate valid robustness certiﬁcation for the static smoothed classiﬁer e F P.

Double Sampling Randomized Smoothing

0 50 100 150 200 ℓ2 Length of Perturbation

Prob. of Predicting True Class

(a) MNIST model.

0 50 100 150 200 ℓ2 Length of Perturbation

Prob. of Predicting True Class

(b) CIFAR-10 model.

0 200 400 600 800 1000 1200 ℓ2 Length of Perturbation

Prob. of Predicting True Class

(c) Image Net model.

Figure 9. Probability of true-prediction w.r.t. ℓ2 length of perturbations for base classiﬁers from Gaussian augmentation training studied in Appendix J.4. Figures are plotted following the same protocol as in Appendix J.1.

Table 10. Comparison of DSRS certiﬁed robust accuracy with different types of additional smoothing distribution Q and Neyman-Pearsonbased certiﬁcation. Detail experiment settings are in Appendix J.4.

Dataset Certiﬁcation Certiﬁed Accuracy under Radius r 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Neyman-Pearson 95.2% 91.9% 87.7% 80.6% 71.2% 57.6% 41.0% 25.5% 13.6% 6.2% 2.1% 0.9% DSRS-trunc 95.1% 91.8% 87.9% 81.3% 72.9% 60.2% 46.1% 30.9% 17.4% 9.4% 3.6% 1.2% DSRS-var 95.1% 91.8% 88.2% 81.5% 73.6% 61.6% 48.4% 34.1% 21.0% 10.6% 4.4% 1.2%

Neyman-Pearson 40.2% 32.6% 24.7% 18.9% 14.9% 10.2% 7.5% 4.1% 2.0% 0.7% 0.1% 0.1% DSRS-trunc 40.3% 32.9% 25.5% 20.1% 15.7% 11.5% 8.0% 5.5% 2.7% 1.5% 0.6% 0.1% DSRS-var 40.3% 33.1% 25.9% 20.6% 16.1% 12.5% 8.4% 6.4% 3.5% 1.8% 0.7% 0.1%

Neyman-Pearson 42.5% 37.2% 33.0% 29.2% 24.8% 21.4% 17.6% 13.7% 10.2% 7.8% 5.7% 3.6% DSRS-trunc 42.5% 38.1% 34.4% 30.2% 27.0% 23.3% 21.3% 18.7% 14.2% 11.0% 9.0% 5.7% DSRS-var 42.9% 38.5% 35.0% 31.0% 26.5% 23.2% 21.0% 18.3% 14.6% 10.5% 8.2% 5.3%

L. Discussions on Generalizing DSRS Framework

In this appendix, we ﬁrst introduce prior work that leverages additional information for certiﬁcation in randomized smoothing, then generalize our DSRS as a more general framework to theoretically compare with the related work and highlight future directions.

L.1. Existing Work on Leveraging Additional Information for Certiﬁcation

We discuss all known work that leverages more information to achieve tighter robustness certiﬁcation for randomized smoothing prior to this paper to the best of our knowledge.

Additional Black-Box Information. Our DSRS leverages additional information to tighten the certiﬁcation for randomized smoothing. We leverage the information from an additional smoothing distribution. This information can be obtained from the base classiﬁer that we only have query access on the predicted label. We call the information from this limited query access black-box information. The certiﬁcation approaches that only require black-box information can be applied to any classiﬁcation model regardless of the model structure. Thus, they are usually more general and more scalable. The classical Neyman-Pearson certiﬁcation only requires black-box information.

Besides our DSRS, the only other form of additional blackbox information that is leveraged is the higher-order information (Levine et al., 2020; Mohapatra et al., 2020). Formally, our additional black-box information has the form

Pr ϵ Q[F0(x + ϵ) = y]. (158)

In contrast, the higher-order information, especially secondorder information used by Levine et al. (2020); Mohapatra et al. (2020) has the form

f P 0 (x0)y0 p = Pr ϵ P[F0(x0 + ϵ) = y0] p (159)

which is also shown to be estimable given the black-box query access. However, the higher-order information has several limitations: (1) It is hard to leverage the information beyond the second order. Therefore, only second-order information is used in existing certiﬁcation approaches yet. However, to achieve optimal tightness, one needs to leverage inﬁnite orders of information, which brings an inﬁnite number of constraints and is thus intractable. In contrast, we show that extra information from only one additional distribution sufﬁces to derive a strongly tight certiﬁcation. (2) In practice, the second-order information shows marginal improvements in the widely used ℓ2 and ℓ certiﬁcation settings on real-world datasets (Levine et al., 2020; Mohapatra et al., 2020) and even such improvements require a large number of samples (usually in million order instead of ours 105).

Double Sampling Randomized Smoothing

(a) (Mohapatra et al., 2020, Figure 2(b)): Higher-order randomized smoothing on CIFAR-10.

(b) (Mohapatra et al., 2020, Figure 4): Higher-order randomized smoothing on Image Net for models trained with different smoothing variances.

Figure 10. Higher-order randomized smoothing certiﬁcation (solid curves) compared with standard Neyman-Pearson-based certiﬁcation (dotted curves).

Dvijotham et al. (2020) also propose to use additional information to tighten the robustness certiﬁcation for randomized smoothing ( full-information setting). They formalize the tightest possible certiﬁcation and compare it with Neyman Pearson-based certiﬁation ( information-limited setting), but in practice, they do not try to leverage information from distributions other than P.

Constraining Model Structure. If we discard the blackbox information constraint, we can obtain tighter robustness certiﬁcation than classical Neyman-Pearson. For example, knowing the model structure can beneﬁt the certiﬁcation. Lee et al. (2019) show that when the base classiﬁer is a decision tree, we can use dynamic programming to derive a strongly tight certiﬁcation against ℓ0-bounded perturbations. Awasthi et al. (2020) show that, if the base classiﬁer ﬁrst performs a known low-rank projection, then works on the low-rank projected space, for the corresponding smoothed classiﬁer, we can have a tighter certiﬁcation on both ℓ2 and ℓ settings. However, it is challenging to ﬁnd a projection such that the model preserves satisfactory performance while the projection rank is low. Indeed, the approach is evaluated on DCT basis to show the improvement on ℓ certiﬁcation, and there exists a gap between the actual achieved certiﬁed robustness and the state of the art. We do not compare with these approaches since they impose additional assumptions on the base classiﬁers so their applicable scenarios are limited, and currently, the state-of-the-art base classiﬁer does not satisfy their imposed constraints under ℓ2 and ℓ certiﬁcation settings. Recently, for ℓ1 certiﬁcation, a deterministic and improved smoothing approach (a type of non-additive smoothing mechanism) is proposed to handle the case where input images are constrained in space {0, 1/255, , 244/255, 1}d (Levine & Feizi, 2021). This could be viewed as constraining the attack space from another aspect and implies that certiﬁed robustness can be improved by better smoothing mechanisms, which is orthogonal to our work which focuses on certiﬁcation for existing smoothing mechanisms.

Conﬁdence Smoothing. A group of certiﬁcation approaches assumes that the base classiﬁer outputs normalized conﬁdence on the given input, and the smoothed classiﬁer predicts the class with the highest expectation of normalized conﬁdence under noised input. This assumption can be viewed as a special type of Constraining Model Structure . Under this assumption, we can query and approximate the quantile of the conﬁdence under noised input: F0(x0 + ϵ) where ϵ P. With this information, we can leverage the Neyman-Pearson lemma in a more delicate way to provide a tighter (higher) lower bound of the expected conﬁdence under perturbation, i.e., Eϵ PF0(x + δ + ϵ).

These certiﬁcation approaches provide tighter certiﬁcation than the classical Neyman-Pearson for the smoothed classiﬁer that predicts the class with the highest expected normalized conﬁdence. They are also useful for regression tasks such as object detection in computer vision as shown in (Chiang et al., 2020). However, for the classiﬁcation task, for utilizable base classiﬁers (i.e., benign accuracy > 50% under noise), if we simply set the predicted class to have 100% conﬁdence, we only increase the certiﬁed radius of the classiﬁer and the certiﬁcation for this one-hot base classiﬁer only requires classical Neyman-Pearson. Thus, these certiﬁcation approaches, e.g., (Kumar et al., 2020a), may not achieve higher certiﬁed robustness on the classiﬁcation task than Neyman-Pearson and therefore we do not compare with them.

L.2. General Framework

Focusing on the certiﬁcation with additional black-box information, we generalize the DSRS to allow more general additional information.

Deﬁnition 5 (General Additional Black-Box Information). For given base classiﬁer F0, suppose the true label at x0 is y0, for certifying robustness at x0, the general additional

Double Sampling Randomized Smoothing

black-box information has the form

Rd f1(x)I [F0(x) = y0] dx = c1,

Rd fi(x)I [F0(x) = y0] dx = ci,

Rd f N(x)I [F0(x) = y0] dx = c N,

where fi and ci are pre-determined; fi is integrable in Rd

and ci R for 1 i N. Remark. Obtaining the information in Eqn. (160) requires only the black-box access to whether F0(x) equals to y0. We deﬁne the general additional black-box information in this way because: (1) The information from ﬁnite points is useless since the smoothed classiﬁer has zero probability mass on ﬁnite points, so the useful information is based on integration; (2) To provide a lower bound of e F P 0 (x0 + δ)y0, we only need to care whether F0(x) equals to y0 in region of interest.

Examples. (1) Our DSRS, the additional information Eϵ Q[f(ϵ)] = QA instantiates Deﬁnition 5 by setting N = 1, f1( ) = q( x0) and c1 = QA. (2) In (Mohapatra et al., 2020; Levine et al., 2020), the second-order information f P 0 (x0) instantiates Deﬁnition 5 by setting N = d, fi(x) = p(x x0)i, and ci = f P 0 (x0)

i according to Theorem 1 in (Mohapatra et al., 2020). We remark that due to the sampling difﬁculty, instead of using the whole vector f P 0 (x0) as the information, second-order smoothing (Mohapatra et al., 2020; Levine et al., 2020) uses its p-norm in practice. However, using the full information only gives tighter certiﬁcation so we consider this a more ideal variant.

Then, we can extend the constrained optimization problem (C) in Section 5.1 to (Cext) to accommodate the general information as such

minimize f Eϵ P[f(δ + ϵ)] (161)

s.t. Eϵ P[f(ϵ)] = PA, Z

Rd f1(ϵ)f(ϵ)dϵ = c1,

Rd f N(ϵ)f(ϵ)dϵ = c N,

0 f(ϵ) 1 ϵ Rd.

Similarly, by the strong duality (Theorem 3), to solve the certiﬁcation problem

δ s.t. δ p r, Cext δ (PA, c1, . . . , c N) > 0.5, (162)

we only need to solve the dual problem (Dext):

maximize η,λ1,...,λN R Pr ϵ P

p(ϵ) < ηp(ϵ + δ) +

i=1 λifi(ϵ + δ)

s.t. Pr ϵ P

p(ϵ δ) < ηp(ϵ) +

i=1 λifi(ϵ)

p(ϵ δ) < ηp(ϵ) +

i=1 λifi(ϵ)

f1(ϵ)dϵ = c1,

p(ϵ δ) < ηp(ϵ) +

i=1 λifi(ϵ)

f N(ϵ)dϵ = c N.

We remark that this generalization shares a similar spirit as one type of generalization of Neyman-Pearson Lemma (Chernoff & Scheffe, 1952; Mohapatra et al., 2020). Following the same motivation, Dvijotham et al. (2020) try to generalize the certiﬁcation by adding more constraints in their full-information setting . However, it is unclear whether their constraints in f-divergences form have the same expressive power as ours in practice (i.e., the practicality of theoretically tight Hockey-Stick divergence). A study of these different types of generalization would be our future work.

More importantly, we believe that the pipeline of DSRS can be adapted to solve this generalized dual problem. We hope that this generalization and the corresponding DSRS could enable the exploration and exploitation of more types of additional information for tightening the robustness certiﬁcation of randomized smoothing.

Implications. The generalized DSRS framework allows us to explicitly compare different types of additional information. For example, comparing our additional distribution information and higher-order information, we ﬁnd that (1) for additional distribution information, from Theorem 1 and Corollary 1, the strong tightness can be acheived for N = C 1 where C is the number of classes; (2) for higher-order information, from (Mohapatra et al., 2020, Asymptotic-Optimality Remark), the strong tightness can be achieved when all orders of information are used, i.e., N . This comparison suggests that our additional information from another smoothing distribution should be more efﬁcient.

Another implication is that, from Corollary 1, under multiclass setting, with proper choices of the (C 1) additional smoothing distributions, if we base DSRS on solving dual problem (Dext), the DSRS can achieve strong tightness in multiclass setting.

Double Sampling Randomized Smoothing

L.3. Limitations and Future Directions

Despite the promising theoretical and empirical results of DSRS, DSRS still has some limitations which open an avenue for future work. We list the following future directions: (1) tighter certiﬁcation from a more ideal additional smoothing distribution Q: there may exist better smoothing distribution Q or better methods to optimize hyperparameters in Q than what we have considered in this work in terms of certifying larger certiﬁed radius in practice; (2) better training approach for DSRS: there may be a large space for exploring training approaches that favor DSRS certiﬁcation since all existing training methods are designed for Neyman-Pearsonbased certiﬁcation. We believe that advances in this aspect can boost the robustness certiﬁcation with DSRS to achieve state-of-the-art certiﬁed robustness. (3) better additional information: more generally, besides the prediction probability from an additional smoothing distribution, there may exist other useful additional information for certiﬁcation in randomized smoothing. We hope our generalization of the DSRS framework in this appendix can inspire future work in tighter and more efﬁcient certiﬁcation for randomized smoothing.