# double_sampling_randomized_smoothing__93d94154.pdf Double Sampling Randomized Smoothing Linyi Li 1 Jiawei Zhang 1 Tao Xie 2 Bo Li 1 Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certification for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classifier. However, as previous work shows, the certified robust radius in randomized smoothing suffers from scaling to large datasets ( curse of dimensionality ). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certification of the previous smoothed classifier. Theoretically, under mild assumptions, we prove that DSRS can certify Θ( d) robust radius under ℓ2 norm where d is the input dimension, which implies that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efficient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and Image Net verify our theory and show that DSRS certifies larger robust radii than existing baselines consistently under different settings. Code is available at https: //github.com/llylly/DSRS. 1. Introduction Neural networks (NNs) have achieved great advances on a wide range of tasks, but have been shown vulnerable against adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015; Eykholt et al., 2018; Wang et al., 2021; 1University of Illinois Urbana-Champaign, Illinois, USA 2Peking University, Beijing, China. Correspondence to: Linyi Li . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). Get 𝑃! from sampling c Given model and smoothing distribution 𝒫 Construct worst-case classifier with 𝑃! Derive certified radius 𝑟"# Smooth the model with another distribution 𝒬 Get 𝑄! from sampling 𝑥 𝑥 c Construct worst-case classifier with both 𝑃! and 𝑄! Derive tighter certified radius 𝑟$%&%> 𝑟"# Get 𝑃! from Standard Certification Our DSRS Certification c Figure 1. Upper: Standard certification for randomized smoothing leverages information from only one distribution (PA) to compute robustness certification. Lower: DSRS leverages information from two distributions (PA and QA) to compute certification for the smoothed classifier, yielding significantly larger certified radius. Qiu et al., 2020; Li et al., 2020a; Zhang et al., 2021). A plethora of empirical defenses are proposed to improve the robustness; however, most of these are broken by strong adversaries again (Carlini & Wagner, 2017; Athalye et al., 2018; Tram er et al., 2020). Recently, there are great efforts in developing certified defenses for NNs under certain adversarial constraints (Wong & Kolter, 2018; Raghunathan et al., 2018; Li et al., 2020b; Xu et al., 2020; Li et al., 2019b). Randomized smoothing (Cohen et al., 2019; Li et al., 2019a) has emerged as a popular technique to provide certified robustness for large-scale datasets. Concretely, it samples noise from a certain smoothing distribution to construct a smoothed classifier, and thus certifies the robust radius for the smoothed classifier. Compared to other techniques (Wong & Kolter, 2018; Mirman et al., 2018; Gowal et al., 2019; Zhang et al., 2020b), randomized smoothing is efficient and agnostic to the model, and is applicable to a wide range of ML models, including large Res Net (He et al., 2016) on the Image Net dataset. To improve the certified robust radius, existing studies (Cohen et al., 2019; Lee et al., 2019; Li et al., 2019a; Yang et al., 2020) have explored different smoothing distributions. However, the improvement is limited. For example, ℓ2 certified robust radius does not increase on large datasets despite that the input dimension d increases (Cohen et al., 2019), resulting in a low ℓ certified radius on large Double Sampling Randomized Smoothing datasets which is theoretically shown as an intrinsic barrier of randomized smoothing ( curse of dimensionality or ℓ barrier ) (Yang et al., 2020; Blum et al., 2020; Kumar et al., 2020b; Hayes, 2020; Wu et al., 2021). Given these challenges towards tight robustness certification, a natural question arises: Q1: Is it possible to circumvent the barrier of randomized smoothing by certifying with additional information ? Q2: What type of information is needed to provide tight robustness certification? To answer these questions, we propose Double Sampling Randomized Smoothing (DSRS) framework to leverage the sampled noises from an additional smoothing distribution as additional information to tighten the robust certification. In theory, we show that (1) ideally, if the decision region of the base classifier is known, DSRS can provide tight robustness certification; (2) more practically, if the inputs, which can be correctly classified by the base classifier, satisfy concentration property within an input-centered ball with constant mass under standard Gaussian measure, the standard Neyman-Pearson-based certification (Li et al., 2019a; Cohen et al., 2019; Salman et al., 2019; Yang et al., 2020) can only certify a dimension-independent ℓ2 radius, whereas DSRS with generalized Gaussian smoothing can certify radius Ω( d) under ℓ2 norm which would increase with the dimension d, leading to tighter certification. Under more general conditions, we provide numerical simulations to verify our theories. Our results provide a positive answer to Q1 and sufficient conditions for Q2, i.e., DSRS may be able to circumvent the barrier of randomized smoothing. Motivated by the theory, we leverage a type of generalized Gaussian (Zhang et al., 2020a) as the smoothing distribution and truncated generalized Gaussian as an additional distribution. For this type of concretization, we propose an efficient and sound computation method to compute the certifiably robust radius for practical classifiers considering sampling error. Our method formulates the certification problem given additional information as a constrained optimization problem and leverages specific properties of the dual problem to decompose the effects of different dual variables to solve it. DSRS is fully scalable since the computational time is nearly independent of the size of the dataset, model, or sampling. Our extensive experimental evaluation on MNIST, CIFAR-10, and Image Net shows that (1) under large sampling size (2 105 8 105), the certified radius of DSRS consistently increases as our theory suggested; (2) under practical sampling size (105), DSRS can certify consistently higher robust radii than existing baselines, including standard Neyman-Pearson-based certification. As further discussed in Appendix L, we believe that DSRS as a framework opens a wide range of future directions for selecting or optimizing different forms of additional information to tighten the certification of randomized smoothing. We summarize the main technical contributions as follows: We propose a general robustness certification framework DSRS, which leverages additional information by sampling from another smoothing distribution. We prove that under practical concentration assumptions, DSRS certifies Ω( d) radius under ℓ2 norm with d the input dimension, which suggests a possible way to circumvent the intrinsic barrier of randomized smoothing. We concretize DSRS by generalized Gaussian smoothing mechanisms and propose a method to efficiently compute the certified radius for given classifiers. We conduct extensive experiments, showing that DSRS provides consistently tighter robustness certification than existing baselines, including standard Neyman-Pearsonbased certification across different models on MNIST, CIFAR-10, and Image Net. Related Work. For the certification method of randomized smoothing, most existing methods leverage only the true-class prediction probability to certify. In this case, the tightest possible robustness certification is based on the Neyman-Pearson lemma (Neyman & Pearson, 1933) as first proposed by Cohen et al. (2019) for certifying ℓ2 radius under Gaussian smoothing. Several methods extend this certification to accommodate different smoothing distributions and different ℓp norms (Dvijotham et al., 2020; Yang et al., 2020; Zhang et al., 2020a; Levine & Feizi, 2021). In randomized smoothing, the ℓ2 certified robust radius r is similar across datasets of different scales, which translates to vanishing ℓ certified radius r/ d when input dimension increases. This limitation of existing certification methods of randomized smoothing is formally proved (Yang et al., 2020; Blum et al., 2020; Kumar et al., 2020b; Hayes, 2020; Wu et al., 2021) and named ℓ barrier or curse of dimensionality . Recent work tries to incorporate additional information besides true-class prediction probability to tighten the certification and bypass the barrier. For ℓ2 and ℓ certification, to the best of our knowledge, gradient magnitude is the only exploited additional information (Mohapatra et al., 2020; Levine et al., 2020). However, in practice, the improvement is relatively marginal and requires a large number of samples (see Appendix J.5). Some other methods provide tighter certification given specific model structures (Kumar et al., 2020a; Chiang et al., 2020; Lee et al., 2019; Awasthi et al., 2020). DSRS instead focuses on leveraging modelstructure-agnostic additional information. More discussion on related work is in Appendix K. 2. Preliminaries and Background Define [C] := {1, . . . C}. Let C be the C-dimensional probability simplex. We consider a multiclass classification Double Sampling Randomized Smoothing model F : Rd [C] as the base classifier, where d is the input dimension, and the model outputs hard-label class prediction within [C]. The original smoothing distribution P and additional smoothing distribution Q are both supported on Rd. We let p( ) and q( ) be their density functions respectively. We assume that both p and q are positive and differentiable almost everywhere, i.e., the set of singular points has zero measure under either P or Q. These assumptions hold for common smoothing distributions used in the literature such as Gaussian distribution (L ecuyer et al., 2019; Li et al., 2019a; Cohen et al., 2019; Yang et al., 2020). Randomized smoothing constructs a smoothed classifier from a given base classifier by adding input noise following original smoothing distribution P. For input x Rd, we define prediction probability under P by function f P : Rd C: f P(x)c := Pr ϵ P[F(x + ϵ) = c] where c [C]. (1) The smoothed classifier e F P : R [C] (or e F when P is clear from the context) predicts the class with the highest confidence after smoothing with P: e F P(x) := arg max c [C] f P(x)c. (2) We focus on robustness certification against ℓp-bounded perturbations for smoothed classifier e F, where the standard certification method is called Neyman-Pearson-based certification (Cohen et al., 2019) (details in Appendix A, certified radius from it denoted by r N P). Concretely, certification methods compute robust radius r defined as below. Definition 1 (Certified Robust Radius). Under ℓp norm (p R+ {+ }), for given smoothed classifier e F P and input x0 Rd with true label y0 [C], a radius r 0 is called certified (robust) radius for e F P if e F P always predicts y0 for any input within the r-radius ball centered at x0: δ Rd, δ p < r, e F P(x0 + δ) = y0. (3) 3. DSRS Overview We propose Double Sampling Randomized Smoothing (DSRS), which leverages the prediction probability from an additional smoothing distribution Q (formally QA := f Q(x0)y0 = Prϵ Q[F(x + ϵ) = y0]), along with the prediction probability from the original smoothing distribution P (formally PA := f P(x0)y0 as in Eqn. (1), also used in Neyman-Pearson-based certification), to provide robustness certification for P-smoothed classifier e F P. Note that both PA and QA can be obtained from Monte-Carlo sampling (see Sections 5.1 and 5.2). Formally, we let r DSRS denote the tightest possible certified radius with prediction probability from Q, then r DSRS can be defined as below. Definition 2 (r DSRS). Given PA and QA, Table 1. Definitions of smoothing distributions in this paper. In the table, k N, σ = p Name Notation Density Function Standard Gaussian N(σ) exp ϵ 2 2 2σ2 Generalized Gaussian N g(k, σ) ϵ 2k 2 exp ϵ 2 2 2σ 2 Truncated Standard Gaussian Ntrunc(T, σ) exp ϵ 2 2 2σ2 I[ ϵ 2 T] Truncated Generalized Gaussian N g trunc(k, T, σ) ϵ 2k 2 exp ϵ 2 2 2σ 2 I[ ϵ 2 T] r DSRS := max r s.t. F : R [C], f P(x0)y0 = PA, f Q(x0)y0 = QA x, x x0 p < r, e F P(x) = y0. Intuitively, r DSRS is the maximum possible radius, such that any smoothed classifier constructed from base classifier satisfying PA and QA constraints cannot predict other labels when the perturbation magnitude is within the radius. In Section 4, we will analyze the theoretical properties of DSRS, including comparing r DSRS and r N P under the concentration assumption. Computing r DSRS is nontrivial, so in Section 5, we will introduce a practical computational method that exactly solves r DSRS when P and Q are standard and generalized (truncated) Gaussian. In Appendix G, we will show method variants to deal with other forms of P and Q distributions. In Appendix L, we will further generalize the DSRS framework. Smoothing Distributions. Now we formally define the smoothing distributions used in DSRS. We mainly consider standard Gaussian N (Cohen et al., 2019; Yang et al., 2020) and generalized Gaussian N g (Zhang et al., 2020a). Let N(σ) to represent standard Gaussian distribution with covariance matrix σ2Id which has density function exp( ϵ 2 2/(2σ2)).1 For k N, we let N g(k, σ) to represent generalized Gaussian whose density function ϵ 2k 2 exp( ϵ 2 2/(2σ 2)) where σ = p d/(d 2k)σ. Here we use σ instead of σ to ensure that the expected noise p E ϵ 2 2 of N g(k, σ) is the same as N(σ). The generalized Gaussian as the smoothing distribution overcomes the thin shell problem of standard Gaussian and improves certified robustness (Zhang et al., 2020a); and we will reveal more of its theoretical advantages in Section 4. As the additional smoothing distribution Q, we will mainly consider truncated distributions within a small ℓ2 radius ball. Specially, truncated standard Gaussian is denoted by Ntrunc(T, σ) with density function exp( ϵ 2 2/(2σ2)) I[ ϵ 2 T]; and truncated generalized Gaussian is denoted by N g trunc(k, T, σ) with density function ϵ 2k 2 exp( ϵ 2 2/(2σ 2)) I[ ϵ 2 T]. In Table 1, we summarize these distribution definitions. 1In this paper, N(σ) is a shorthand of N(0, σ2Id). Double Sampling Randomized Smoothing 4. Theoretical Analysis of DSRS In this section, we theoretically analyze DSRS to answer the following core question: Does QA, the prediction probability under additional smoothing distribution, provide sufficient information for tightening the robustness certification? We first show that if the support of Q is the decision region of true class, DSRS can certify the smoothed classifier s maximum possible robust radius. Then, under concentration assumption, we show the ℓ2 certified radius of DSRS can be Ω( d) which is asymptotically optimal for bounded inputs. Finally, under more general conditions, we conduct both numerical simulations and real-data experiments to verify that the certified radius of DSRS increases with data dimension d. These analyses provide a positive answer to the above core question. DSRS can certify the tightest possible robust radius. Given an original smoothing distribution P and a base classifier F0. At input point x0 Rd with true label y0, we define the tightest possible certified robust radius rtight to be the largest ℓp ball that contains no adversarial example for smoothed classifier e F P 0 : rtight := max r s.t. δ Rd, δ p < r, e F P 0 (x0 + δ) = y0. Then, for binary classification, if we choose an additional smoothing distribution Q whose support is the decision region or its complement, then DSRS can certify robust radius rtight. Theorem 1. Suppose the original smoothing distribution P has non-zero density everywhere, i.e., p( ) > 0. For binary classification with base classifier F0, at point x0 Rd, let Q be an additional distribution that satisfies: (1) its support is the decision region of an arbitrary class c [C] shifted by x0: supp(Q) = {x x0 : F0(x) = c}; (2) for any x supp(Q), 0 < q(x)/p(x) < + . Then, plugging PA = f P 0 (x0)c and QA = f Q 0 (x0)c (see Eqn. (1)) into Definition 2, we have r DSRS = rtight under any ℓp (p 1). Proof sketch. We defer the proof to Appendix F.1. At a high level, with this type of Q, we have QA = 1 or QA = 0. Then, from the mass of the Q s support on P and PA, we can conclude that the Q s support is exactly the decision region of label c or its complement. Thus, the DSRS constraints (in Eqn. (4)) are satisfied iff F differs from F0 in a zero-measure set, and thus we exactly compute the smoothed classifier e F P 0 s maximum certified robust radius in DSRS. An extension to multiclass setting is in Appendix F.2. Remark. For any base classifier F0, Q that satisfies conditions in Theorem 1 exists, which implies that with DSRS, certifying a strictly tight robust radius is possible. In contrast, Neyman Pearson-based is proved to certify tight robust radius for linear base classifiers (Cohen et al., 2019, Section 3.1), but for arbitrary base classifiers, its tightness is not guaranteed. This result suggests that, to certify a tight radius, just one additional smoothing distribution Q is sufficient rather than multiple ones. On the other hand, it is challenging to find Q whose support (or its complement) exactly matches the decision region of an NN classifier. In the following, we analyze the tightness of DSRS under weaker assumptions. DSRS can certify Ω( d) ℓ2 radius under concentration assumption. We begin by defining the concentration property. Definition 3 ((σ, Pcon)-Concentration). Given a base classifier F0, at input x0 R with true label y0, we call F0 satisfies (σ, Pcon)-concentration property, if for within Pconpercentile of small ℓ2 magnitude Gaussian N(σ) noise, the adversarial example occupies zero measure. Formally, (σ, Pcon)-concentration means Pr ϵ N(σ)[F0(x0 + ϵ) = y0 | ϵ0 2 T] = 1 (5a) where T satisfies Pr ϵ N(σ)[ ϵ 2 T] = Pcon. (5b) Intuitively, (σ, Pcon)-concentration implies that the base classifier has few adversarial examples for small magnitude noises during standard Gaussian smoothing. In Figure 4 (in Appendix B), we empirically verified that a well-trained base classifier on Image Net may satisfy this property for a significant portion of inputs. Furthermore, Salman et al. (2019) show that promoting this concentration property by adversarially training the smoothed classifier improves the certified robustness. With this concentration property, DSRS certifies the radius Ω( d) under ℓ2 norm, as the following theorem shows. Theorem 2. Let d be the input dimension and F0 be the base classifier. For an input point x0 Rd with true class y0, suppose F0 satisfies (σ, Pcon)-Concentration property. Then, for any sufficiently large d, for the classifier e F P 0 smoothed by generalized Gaussian P = N g(k, σ) with d/2 15 k < d/2, DSRS with additional smoothing distribution Q = N g trunc(k, T, σ) can certified ℓ2 radius r DSRS 0.02σ where T = σ q 2ΓCDF 1 d/2(Pcon) and ΓCDFd/2 is the CDF of gamma distribution Γ(d/2, 1). Proof sketch. We defer the proof to Appendix F.3. At high level, based on the standard Gaussian distribution s property (Proposition F.1), we find QA = 1 under concentration property (Lemma F.2). With QA = 1, we derive a lower bound of r DSRS in Lemma F.3. We then use: (1) the concentration of beta distribution Beta( d 1 2 ) (see Lemma F.4) for large d; (2) the relative concentration of gamma Γ(d/2, 1) distribution around mean for large d (see Proposition F.5 and resulting Fact F.7); and (3) the misalignment of gamma distribution Γ(d/2 k, 1) s mean and median for small (d/2 k) (see Proposition F.6) to lower Double Sampling Randomized Smoothing bound the quantity in Lemma F.3 and show it is large or equal to 0.5. Then, using the conclusion in Section 5 we conclude that r DSRS 0.02σ Remark. (1) For standard Neyman-Pearson based certificaton, r N P = σΦ 1(f P(x0)y0). Along with the increase of input dimension d, to achieve growing ℓ2 certified radius, one needs the prediction probability of true class under P, f P(x0)y0, to grow simultaneously which is challenging. Indeed, across different datasets, f P(x0)y0 is almost a constant which leads to a constant ℓ2 certified radius and shrinking ℓ radius for large d. We further empirically illustrate this property in Appendix C. (2) In contrast, as long as the model satisfies concentration property which may be almost true on large datasets as reflected by Figure 4, with our specific choices of P and Q, DSRS can achieve Ω(σ d) ℓ2 radius on large datasets. This rate translates to a constant Ω(σ) ℓ radius on large datasets and thus breaks the curse of dimensionality of randomized smoothing. We remark that this d rate is optimal when dataset input is bounded such as images (otherwise, the ω(1) ℓ radius leads the radius to exceed the constant ℓ diameter for large d). Therefore, under the assumption of concentration property, DSRS provides asymptotically optimal certification for randomized smoothing. (3) Smoothing with generalized Gaussian distribution and choosing a parameter k that is close to d/2 play an essential role in proving the Ω(σ d) certified radius. Otherwise, in Appendix F.4 we have Theorem 6 which shows any certification methods cannot certify an ℓ2 radius c d for any c > 0. This adds another theoretical evidence for the superiority of generalized Gaussian, which is cross-validated by Zhang et al. (2020a). DSRS certifies tighter radius under general scenarios. When the concentration property does not absolutely hold, a rigorous theoretical analysis becomes challenging, since the impact of a noninfinite dual variable needs to be taken into account. This dual variable is inside a Lambert W function where typical approximation bounds are too loose to provide non-trivial convergence rates. Thus, we leverage the numerical computational method introduced in Section 5 to provide numerical simulations and real-data experiments. We generalize the concentration assumption by changing the holding probability in Eqn. (5a) from 1 to α1/N, which corresponds to (1 α)-confident lower bound of QA given N times of Monte-Carlo sampling, where we set α = 0.1% following the convention (Cohen et al., 2019). In this scenario, we compare DSRS certification with Neyman-Pearson certification numerically in Figure 2 (numerical simulations in Figure 2(a) and Image Net experiments in Figure 2(b)). In Figure 2(a), we assume (σ, Pcon)-concentration with σ = 1, Pcon = 0.5 and different sampling number Ns. We further assume PA = f P(x0)y0 = 0.6 as the true-class prediction probability under P. In Figure 2(b), we take the model weights trained by Salman et al. (2019) on Image Net and apply generalized Gaussian smoothing with d/2 k = 4 and σ = 0.50. We uniformly pick 100 samples from the test set and compute (1 α)-confident certified radius for each sample. We report certified accuracy under different 10 2 10 3 10 4 10 5 10 6 Input Dimension d Certified Radius Certified Radius with PA = 0.6 and (1, 0.5)-Concentration Neyman-Pearson Standard Gaussian DSRS w/ N = 100, α = 0.1% DSRS w/ N = 1000, α = 0.1% DSRS w/ N = 10000, α = 0.1% DSRS w/ N = 100000, α = 0.1% DSRS w/ N = 1000000, α = 0.1% DSRS w/ N = 10000000, α = 0.1% DSRS w/ Deterministic Info. (Ideal) MNIST Input Dim. CIFAR-10 Input Dim. Image Net Input Dim. (a) When holding probability in Eqn. (5a) is obtained from sampling N times and confidence level 1 α = 99.9%, relation between certified radius (y-axis) and input data dimension d (x-axis). Different curves correspond to different Ns. 0.0 0.5 1.0 1.5 2.0 r Certified Accuracy Image Net, smoothadv model from (Salmen et al., 2019), σ=0.50 Neyman-Pearson (N = 50000) Neyman-Pearson (N = 100000) Neyman-Pearson (N = 200000) DSRS (N = 50000 + 50000) DSRS (N = 50000 + 100000) DSRS (N = 50000 + 200000) DSRS (N = 50000 + 400000) DSRS (N = 50000 + 800000) (b) Relation between certified radius (x-axis) and certified accuracy (y-axis) on Image Net models. Different curves correspond to Neyman-Pearson and DSRS certification with different Ns. Sampling error considered, confidence level = 99.9%. Figure 2. Tendency of DSRS certified robust radius considering sampling error. In both (a) and (b), DSRS certified radius grows along with the increase of sampling number N but Neyman Pearson radius is almost fixed. Constrained Optimatization Strong Duality Determine 𝑃! and 𝑄! from confidence intervals Dual Problem Solving ( 5.3) DSRS Certification as a Constrained Optimization ( 5.1) Binary Search Numerical Integration Figure 3. Overview of DSRS computational method. ℓ2 radius r, which is the fraction of certifiably correctly classified samples by the smoothed classifier. Remark. When the sampling error and confidence interval come into play, they quickly suppress the Ω( d) growth rate of DSRS certified radius (blue curve) as shown in Figure 2(a). Nonetheless, DSRS still certifies a larger radius than the standard Neyman-Pearson method and increasing the sampling number further enlarges the gap. We consider another relaxed version of concentration property in Appendix D, where DSRS still provides significantly tighter robustness certification than Neyman-Pearson. 5. DSRS Computational Method The theoretical analysis in Section 4 implies that additional smoothing distribution Q helps to tighten the robustness certification over standard Neyman-Pearson-based certification significantly. In this section, we propose an efficient compu- Double Sampling Randomized Smoothing tational method to compute this tight certified robust radius r DSRS (see Definition 2) when P is generalized Gaussian and Q is truncated P as suggested by Theorems 2 and 6. Compared with the classical certification for randomized smoothing or its variants (cf. (Kumar et al., 2020a)), incorporating additional information raises a big challenge: the Neyman-Pearson lemma (1933) can no longer be served as the foundation of the certification algorithm due to its incapability to handle the additional information. Thus, we propose a novel DSRS computational method by formalizing robustness certification as a constrained optimization problem and proving its strong duality ( 5.1). Then, we propose an efficient algorithm to solve this specific dual optimization problem considering sampling error. The detailed algorithm can be found in Alg. 2 in Appendix E.1: 1) we first perform a binary search on the certified radius r to determine the maximum radius that we can certify; 2) for current r, we determine the smoothed prediction confidence PA and QA from the confidence intervals of predicting the true class ( 5.2); 3) then, for current r we solve the dual problem by quick binary search for dual variables λ1 and λ2 (see Eqn. (10)) along with numerical integration ( 5.3). To guarantee the soundness of numerical-integration-based certification, we take the maximum possible error into account during the binary search. We will discuss further extensions in 5.4. 5.1. DSRS as Constrained Optimization We first formulate the robustness certification as a constrained optimization problem and then show several foundational properties of the problem. Following the notation of Definition 2, from the given base classifier F0, we can use Monte-Carlo sampling to obtain PA = f P 0 (x0)y0, QA = f Q 0 (x0)y0. (7) In 5.2 we will discuss how to handle confidence intervals of PA and QA. For now, we assume PA and QA are fixed. Given perturbation vector δ Rd, to test whether smoothed classifier e F P 0 still predicts true label y0, we only need to check whether the prediction probability f P 0 (x0 + δ)y0 > 0.5. This can be formulated as a constrained optimization problem (C): minimize f Eϵ P[f(ϵ + δ)] (8a) s.t. Eϵ P[f(ϵ)] = PA, Eϵ Q[f(ϵ)] = QA, (8b) 0 f(ϵ) 1 ϵ Rd. (8c) Remark. (C) seeks for the minimum possible f P(x0+δ)y0 given Eqn. (7) s constraint. Concretely, we let f represent whether the base classifier predicts label y0: f( ) = I[F( + x0) = y0], and accordingly impose f [0, 1] in Eqn. (8c). Then, Eqns. (8a) and (8b) unfold f P(x0 + δ)y0, f P(x0)y0, and f Q(x0)y0 respectively and impose Eqn. (7) s constraint. We let Cδ(PA, QA) denote the optimal value of Eqn. (8) when feasible. Thus, under norm p, to certify the robustness within radius r, we only need to check whether δ, δ p < r Cδ(PA, QA) > 0.5. (9) This formulation yields the tightest robustness certification given information from P and Q under the binary setting. Under the multiclass setting, there are efforts towards tighter certification by using > maximum over other classes instead of > 0.5 in Eqn. (9) (Dvijotham et al., 2020). For saving the sampling cost and also to follow the convention (Cohen et al., 2019; Yang et al., 2020; Jeong & Shin, 2020; Zhai et al., 2020), we mainly consider > 0.5 for multiclass setting, and extension to the other form is straightforward. Since our choices of P and Q (standard/generalized (truncated) Gaussian) are isotropic and centered around origin, when certifying radius r, for ℓ2 certification we only need to test Cδ(PA, QA) > 0.5 with δ = (r, 0, . . . , 0)T; and for ℓ we only need to divide ℓ2 radius by d. This trick can also be extended for ℓ1 case (Zhang et al., 2020a). Directly solving (C) is challenging. Thus, we construct the Lagrangian dual problem (D): maximize λ1, λ2 R Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)] (10a) s.t. Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] = PA, Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] = QA. (10b) In Eqn. (10), p( ) and q( ) are the density functions of distributions P and Q respectively. We let Dδ(PA, QA) denote the optimal objective value to Eqn. (10a) when it is feasible. Theorem 3. For given δ Rd, PA, and QA, if C and D are both feasible, then Cδ(PA, QA) = Dδ(PA, QA). The theorem states the strong duality between (C) and (D). We defer the proof to Appendix G.1 which is based on minmax inequality and feasibility condition of (D). Intuitively, we can view (C), a functional optimization over f, as a linear programming (LP) problem over infinite number of variables {f(x) : x Rd} so that the strong duality holds which guarantees the tightness of DSRS in the primal space. 5.2. Dealing with Confidence Intervals It is practically intractable to know the exact PA and QA in Eqn. (7) by only querying the model s prediction for finite times. The common practice is using Monte-Carlo sampling, which gives confidence intervals of PA and QA with a predefined confidence level 1 α. Double Sampling Randomized Smoothing Suppose we have confidence intervals [PA, PA] and [QA, QA]. To derive a sound certification, we need to certify that for any PA [PA, PA] and any QA [QA, QA], Cδ(PA, QA) > 0.5. Given the infinite number of possible PA and QA, the brute-force method is intractable. Here, without computing Cδ, we show how to solve (PA, QA) = arg min P A [PA, PA],Q A [QA, QA] Cδ(P A, Q A). (11) If solved PA and QA satify Cδ(PA, QA) > 0.5, then for any PA and QA within the confidence intervals, we can certify the robustness against perturbation δ. We observe the following two properties of Cδ. Proposition 1. Cδ( , ) is convex in the feasible region. Proposition 2. With respect to x [0, 1], functions x 7 miny Cδ(x, y) and x 7 arg miny Cδ(x, y) are monotonically non-decreasing. Similarly, with respect to y [0, 1], functions y 7 minx Cδ(x, y) and y 7 arg minx Cδ(x, y) are monotonically non-decreasing. These two propositions characterize the landscape of Cδ( , ) convex and monotonically non-decreasing along both x and y axes. Thus, desired (PA, QA) (location of minima within the bounded box) lies on the box boundary, and we only need to compute the location of boundaryline-sliced minima and compare it with box constraints to solve Eqn. (11). Formally, we propose an efficient algorithm (Alg. 1, omitted to Appendix E.1) to solve (PA, QA). Theorem 4. If Eqn. (11) is feasible, the PA and QA returned by Alg. 1 solve Eqn. (11). The above results are proved in Appendix G.2. On a high level, we prove Proposition 1 by definition; we prove Proposition 2 via a reduction to classical Neyman-Pearson-based certification and analysis of this reduced problem; and we prove Theorem 4 based on Propositions 1 and 2 along with exhaustive and nontrivial analyses of all possible cases. 5.3. Solving the Dual Problem After the smoothed prediction confidences PA and QA are determined from the confidence intervals, now we solve the dual problem Dδ(PA, QA) as defined in Eqn. (10). We solve the problem based on the following theorem: Theorem 5 (Numerical Integration for DSRS with Generalized Gaussian Smoothing). In Dδ(PA, QA), let r = δ 2, when P = N g(k, σ) and Q = N g trunc(k, T, σ), let σ := p d/(d 2k) and let ν := ΓCDFd/2 k(T 2/(2σ 2)), R(λ1, λ2) := Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)] = Et Γ(d/2 k,1)u1(t), λ1 0 Et Γ(d/2 k,1)u1(t) + u2(t), λ1 > 0 where u1(t) =Beta CDF d 1 min{T 2, 2σ 2k W ( t k e t k (λ1 + νλ2) 1 k )} u2(t) = max Beta CDF d 1 2σ 2k W ( t k e t k λ 1 k 1 ) (σ Beta CDF d 1 P (λ1, λ2) := Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] = Et Γ(d/2 k,1) u3(t, λ1), t T 2/(2σ 2) u3(t, λ1 + νλ2), t < T 2/(2σ 2). where u3(t, λ) = Beta CDF d 1 2t 2kσ 2W ( t k e t k λ 1 Q(λ1, λ2) := Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] = νEt Γ(d/2 k,1)u3(t, λ1 + νλ2) I[t T 2/(2σ 2)]. In above equations, Γ(d/2 k, 1) is gamma distribution and ΓCDFd/2 k is its CDF, Beta CDF d 1 2 is the CDF of distri- bution Beta( d 1 2 ), and W is the principal branch of Lambert W function. When P is standard Gaussian and Q is truncated standard Gaussian, we derive similar expressions as detailed in Appendix H.1. We prove Theorem 5 in Appendix G.3, which extends the level-set sliced integration and results from (Yang et al., 2020). With the theorem, we can rewrite the dual problem Dδ(PA, QA) as max λ1, λ2 R R(λ1, λ2) s.t. P(λ1, λ2) = PA, Q(λ1, λ2) = QA, (12) Given concrete λ1 and λ2, from the theorem, these function values P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2) can be easily computed with one-dimensional numerical integration using Sci Py package. Now, solving Dδ(PA, QA) reduces to finding dual variables λ1 and λ2 such that P(λ1, λ2) = PA and Q(λ1, λ2) = QA. Generally, we find that there is only one unique feasible pair (λ1, λ2) for Eqn. (12), so finding out such a pair is sufficient. We prove the uniqueness and discuss how we deal with edge cases where multiple feasible pairs exist in Appendix G.4. Normally, such solving process is expensive. However, we find a particularly efficient method to solve λ1 and λ2 and the algorithm description is in Alg. 3 (in Appendix E.1). At a high level, from Theorem 5, we observe that Q(λ1, λ2) is determined only by the sum (λ1 + νλ2) and non-decreasing w.r.t. this sum. Therefore, we apply binary search to find out (λ1 + νλ2) that satisfies Q(λ1, λ2) = QA. Then, we observe that P(λ1, λ2) Q(λ1, λ2) :=h(λ1) z }| { E t Γ(d/2 k,1)u3(t, λ1) I Thus, we need to find λ1 such that h(λ1) = PA QA/ν. We observe that h(λ1) is non-decreasing w.r.t. λ1, and we use binary search to solve λ1. Combining with the value of Double Sampling Randomized Smoothing (λ1 +νλ2), we also obtain λ2. We lastly leverage numerical integration to compute R(λ1, λ2) following Theorem 5 to solve the dual problem Dδ(PA, QA). Practical Certification Soundness. As a practical certification method, we need to guarantee the certification soundness in the presence of numerical error. In DSRS, there are two sources of numerical error: numerical integration error when computing P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2), and the finite precision of binary search on λ1 and λ2. For numerical integration, we notice that typical numerical integration packages such as scipy support setting an absolute error threshold and raising warnings when such threshold cannot be reached. We set the absolute threshold = 1.5 10 8, and abstain when the threshold cannot be reached (which never happens in our experimental evaluation). Then, when computing P, Q, and R, suppose the numerical value is v, we use the lower bound (v ) and upper bound (v+ ) in the corresponding context to guarantee the soundness. For the finite precision in binary search, we use the left endpoint or the right endpoint of the final binary search interval to guarantee soundness. For example, we use the left endpoint of λ1 in R computation, and use the left endpoint of (λ1+νλ2) minus right endpoint of λ1 to get the lower bound of λ2 to use in R computation. As a result, we always get an under-estimation of R so the certification is sound. Further discussion is in Appendix E.2. To this point, we have introduced the DSRS computational method. Complexity and efficiency analysis is omitted to Appendix E.3. Implementation details are in Appendix I.1. 5.4. Extensions We mainly discussed DSRS computational method for generalized Gaussian P and truncated generalized Gaussian Q under ℓ2 norm. Can we extend it to other settings? Indeed, DSRS is a general framework. In appendices, we show following extensions: (1) DSRS for generalized Gaussian with different variances as P and Q (in Appendix H.2); (2) DSRS for other ℓp norms (in Appendix H.3); and (3) DSRS that leverages other forms of additional information covering gradient magnitude information (Mohapatra et al., 2020; Levine et al., 2020) (in Appendix L). 6. Experimental Evaluation In this section, we systematically evaluate DSRS and demonstrate that it achieves tighter certification than the classical Neyman-Pearson-based certification against ℓ2 perturbations on MNIST, CIFAR-10, and Image Net. We focus on ℓ2 certification because additive randomized smoothing is not optimal for other norms (e.g., ℓ1 (Levine & Feizi, 2021)) or the certification can be directly translated from ℓ2 certification (e.g., ℓ (Yang et al., 2020) and semantic transformations (Li et al., 2021)). 6.1. Experimental Setup Smoothing Distributions. Following Theorem 2, we use generalized Gaussian N g(k, σ) as smoothing distribution P where d/2 15 k < d/2. Specifically, we set k to be d/2 12 on MNIST, d/2 6 on CIFAR-10, and d/2 4 on Image Net. We use three different σ s: 0.25, 0.50, and 1.00. In terms of the additional smoothing distribution Q, on MNIST and CIFAR-10, we empirically find that using generalized Gaussian with the same k but different variance yields tighter robustness certification, and therefore we choose σg to be 0.2, 0.4, and 0.8 corresponding to P s σ being 0.25, 0.50, and 1.0, respectively. On Image Net, the concentration property (see Definition 3) is more pronounced (detail study in Appendix J.1) and thus we use truncated generalized Gaussian N g trunc(k, T, σ) as Q. We apply a simple but effective algorithm as explained in Appendix I to determine hyperparameter T in N g trunc(k, T, σ). Models and Training. We consider three commonly-used or state-of-the-art training methods: Gaussian augmentation (Cohen et al., 2019), Consistency (Jeong & Shin, 2020), and Smooth Mix (Jeong et al., 2021). We follow the default model architecture on each dataset respectively. We train the models with augmentation noise sampled from the corresponding generalized Gaussian smoothing distribution P. More training details can be found in Appendix I. Baselines. We consider the Neyman-Pearson-based certification method as the baseline. This certification is widely used and is the tightest given only prediction probability under P (Cohen et al., 2019; Yang et al., 2020; Jeong & Shin, 2020; Li et al., 2021). We remark that although there are certification methods that leverage more information, to the best of our knowledge, they are not visibly better than the Neyman-Pearson-based method on ℓ2 certification under practical sampling number (105). More comparisons in Appendix J.5 show DSRS is also better than these baselines. For both baseline and DSRS, following the convention, the certification confidence is 1 α = 99.9%, and we use 105 samples for estimating PA and QA. Neyman-Pearson certification does not use the information from additional distribution, and all 105 samples are used to estimate the interval of PA. In DSRS, we use 5 104 samples to estimate the interval of PA with confidence 1 α 2 = 99.95% and the rest 5 104 samples for QA with the same confidence. By union bound, the whole certification confidence is 99.9%. Metrics. We uniformly draw 1000 samples from the test set and report certified robust accuracy under each ℓ2 radius r, which is the fraction of samples that are both correctly classified and have certified robust radii larger than or equal to r. Under each radius r, we report the highest certified robust accuracy among the three variances σ {0.25, 0.50, 1.00} following (Cohen et al., 2019; Salman et al., 2019). We also report evaluation results with ACR metric (Zhai et al., 2020) in Appendix J.3.3. Double Sampling Randomized Smoothing Table 2. Certified robust accuracy under different radii r with different certification approaches. Dataset Training Certification Certified Accuracy under Radius r Method Approach 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 Gaussian Aug. Neyman-Pearson 97.8% 96.9% 94.6% 88.4% 78.7% 57.6% 41.0% 25.5% 13.6% 6.2% 2.1% 0.9% (Cohen et al., 2019) DSRS 97.8% 97.0% 95.0% 89.8% 83.4% 61.6% 48.4% 34.1% 21.0% 10.6% 4.4% 1.2% Consistency Neyman-Pearson 98.4% 97.5% 96.0% 92.3% 83.8% 67.5% 49.1% 35.6% 21.7% 10.4% 4.1% 1.9% (Jeong & Shin, 2020) DSRS 98.4% 97.5% 96.0% 93.5% 87.1% 71.8% 55.8% 41.9% 31.4% 17.8% 8.6% 2.8% Smooth Mix Neyman-Pearson 98.6% 97.6% 96.5% 91.9% 85.1% 73.0% 51.4% 40.2% 31.5% 22.2% 12.2% 4.9% (Jeong et al., 2021) DSRS 98.6% 97.7% 96.8% 93.4% 87.5% 76.6% 54.4% 46.2% 37.6% 29.2% 18.5% 7.2% Gaussian Aug. Neyman-Pearson 56.1% 41.3% 27.7% 18.9% 14.9% 10.2% 7.5% 4.1% 2.0% 0.7% 0.1% 0.1% (Cohen et al., 2019) DSRS 57.4% 42.7% 30.6% 20.6% 16.1% 12.5% 8.4% 6.4% 3.5% 1.8% 0.7% 0.1% Consistency Neyman-Pearson 61.8% 50.9% 38.0% 32.3% 23.8% 19.0% 16.4% 13.8% 11.2% 9.0% 7.1% 5.1% (Jeong & Shin, 2020) DSRS 62.5% 52.5% 38.7% 35.2% 28.1% 20.9% 17.6% 15.3% 13.1% 10.9% 8.9% 6.5% Smooth Mix Neyman-Pearson 63.9% 53.3% 40.2% 34.2% 26.7% 20.4% 17.0% 13.9% 10.3% 7.8% 4.9% 2.3% (Jeong et al., 2021) DSRS 64.7% 55.5% 42.1% 35.9% 29.4% 22.1% 18.7% 16.1% 13.2% 10.2% 7.1% 3.9% Guassian Aug. Neyman-Pearson 57.1% 47.0% 39.3% 33.2% 24.8% 21.4% 17.6% 13.7% 10.2% 7.8% 5.7% 3.6% (Cohen et al., 2019) DSRS 58.4% 48.4% 41.4% 35.3% 28.8% 23.3% 21.3% 18.7% 14.2% 11.0% 9.0% 5.7% Consistency Neyman-Pearson 59.8% 49.8% 43.3% 36.8% 31.4% 25.6% 22.1% 19.1% 16.1% 14.0% 10.6% 8.5% (Jeong & Shin, 2020) DSRS 60.4% 52.4% 44.7% 39.3% 34.8% 28.1% 25.4% 22.6% 19.6% 17.4% 14.1% 10.4% Smooth Mix Neyman-Pearson 46.7% 38.2% 28.8% 24.6% 18.1% 14.2% 11.8% 10.1% 8.9% 7.2% 6.0% 4.6% (Jeong et al., 2021) DSRS 47.4% 40.0% 30.3% 26.8% 21.6% 15.7% 14.0% 12.1% 9.9% 8.4% 7.2% 5.3% 6.2. Evaluation Results We show results in Table 2. The corresponding curves and separated tables for each variance σ are in Appendix J.3. For almost all models and radii r, DSRS yields significantly higher certified accuracy. For example, for Gaussian augmented models, when r = 2.0, on MNIST the robust accuracy increases from 25.5% to 34.1% (+8.6%), on CIFAR-10 from 4.1% to 6.4% (+2.3%), and on Image Net from 13.7% to 18.7% (+5.0%). On average, on MNIST the improvements are around 6% - 9%; on CIFAR-10, the improvements are around 1.5% - 3%; and on Image Net the improvements are around 2% - 5%. Thus, DSRS can be used in conjunction with different training approaches and provides consistently tighter robustness certification. The improvements in the robust radius are not as substantial as those in Figure 2(b) (which is around 2 ). We investigate the reason in Appendix J.1. In summary, the model in Figure 2(b) is trained with standard Gaussian smoothing augmentation and smoothed with generalized Gaussian. The models in this section are trained with generalized Gaussian augmentation. Such training gives higher certified robustness, but in the meantime, gives more advantage to Neyman Pearson-based certification. This finding implies that there may be a large space for exploring training approaches that favor DSRS certification since all existing training methods are designed for Neyman-Pearson-based certification. Nevertheless, even with these unsuitable training methods, DSRS still achieves significantly tighter robustness certification than the baseline. On the other hand, all the above results are restricted to generalized Gaussian smoothing. We still observe that standard Gaussian smoothing combined with strong training methods (Salman et al., 2019; Jeong & Shin, 2020) and Neyman-Pearson certification (the SOTA setting) yields similar or slightly higher certified robust accuracy than generalized Gaussian smoothing even with DSRS certifica- tion. Though DSRS with its suitable generalized Gaussian smoothing does not achieve SOTA certified robustness yet, given the theoretical advantages, we believe that with future tailored training approaches, DSRS with generalized Gaussian smoothing can bring strong certified robustness. More discussion is in Appendix L.3. Ablation Studies. We present several ablation studies in the appendix. and verify: (1) Effectiveness of our simple heuristic for selecting hyperparameter for Q: We propose a simple heuristic to select the hyperparameter T in smoothing distribution Q = N g trunc(k, T, σ). In Appendix J.2, we propose a gradient-based optimization method to select such Q. We find that our simple heuristic has similar performance compared to the more complex optimization method but is more efficient. (2) Comparison of different types of Q: by choosing different types of Q distributions (truncated Gaussian or Gaussian with different variance), DSRS has different performance as mentioned in Section 6.1. In Appendix J.4, we investigate the reason. In summary, when concentration property (see Definition 3) is better satisfied, using truncated Gaussian as Q is better; otherwise, using Gaussian with different variance is better. 7. Conclusion We propose a general DSRS framework, which exploits information based on an additional smoothing distribution to tighten the robustness certification. We theoretically analyze and compare classical Neyman-Pearson and DSRS certification, showing that DSRS has the potential to break the curse of dimensionality of randomized smoothing. Acknowledgements We think the anonymous reviewers for their constructive feedback. This work is partially supported by NSF grant No.1910100, NSF CNS 20-46726 CAR, C3 AI, and the Sloan Fellowship. Double Sampling Randomized Smoothing Albarghouthi, A. Introduction to neural network verification. ar Xiv preprint ar Xiv:2109.10317, 2021. Alfarra, M., Bibi, A., Torr, P. H., and Ghanem, B. Data dependent randomized smoothing. ar Xiv preprint ar Xiv:2012.04351, 2020. Athalye, A., Carlini, N., and Wagner, D. A. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 274 283. PMLR, 2018. Awasthi, P., Jain, H., Rawat, A. S., and Vijayaraghavan, A. Adversarial robustness via robust low rank representations. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 11391 11403. Curran Associates, Inc., 2020. Blum, A., Dick, T., Manoj, N., and Zhang, H. Random smoothing might be unable to certify ℓ robustness for high-dimensional images. Journal of Machine Learning Research, 21:211:1 211:21, 2020. Carlini, N. and Wagner, D. A. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 39 57. IEEE Computer Society, 2017. Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J. C., and Liang, P. Unlabeled data improves adversarial robustness. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 11190 11201. Curran Associates, Inc., 2019. Chernoff, H. and Scheffe, H. A generalization of the neyman-pearson fundamental lemma. The Annals of Mathematical Statistics, 23(4):213 225, 1952. Chiang, P., Curry, M. J., Abdelkader, A., Kumar, A., Dickerson, J., and Goldstein, T. Detection as regression: Certified object detection with median smoothing. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 1275 1286. Curran Associates, Inc., 2020. Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certified adversarial robustness via randomized smoothing. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 1310 1320. PMLR, 2019. Dvijotham, K. D., Hayes, J., Balle, B., Kolter, J. Z., Qin, C., Gy orgy, A., Xiao, K., Gowal, S., and Kohli, P. A framework for robustness certification of smoothed classifiers using f-divergences. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. Eiras, F., Alfarra, M., Kumar, M. P., Torr, P. H. S., Dokania, P. K., Ghanem, B., and Bibi, A. ANCER: anisotropic certification via sample-wise volume maximization. ar Xiv preprint ar Xiv:2107.04570, 2021. Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., and Song, D. Robust physical-world attacks on deep learning visual classification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 1625 1634. Computer Vision Foundation / IEEE Computer Society, 2018. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Arandjelovic, R., Mann, T. A., and Kohli, P. Scalable verified training for provably robust image classification. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 4841 4850. IEEE, 2019. Hayes, J. Extensions and limitations of randomized smoothing for robustness guarantees. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020, pp. 3413 3421. Computer Vision Foundation / IEEE, 2020. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Double Sampling Randomized Smoothing Las Vegas, NV, USA, June 27-30, 2016, pp. 770 778. IEEE Computer Society, 2016. Jeong, J. and Shin, J. Consistency regularization for certified robustness of smoothed classifiers. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 10558 10570. Curran Associates, Inc., 2020. Jeong, J., Park, S., Kim, M., Lee, H., Kim, D., and Shin, J. Smoothmix: Training confidence-calibrated smoothed classifiers for certified robustness. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 30153 30168. Curran Associates, Inc., 2021. Kumar, A., Levine, A., Feizi, S., and Goldstein, T. Certifying confidence via randomized smoothing. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 5165 5177. Curran Associates, Inc., 2020a. Kumar, A., Levine, A., Goldstein, T., and Feizi, S. Curse of dimensionality on randomized smoothing for certifiable robustness. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 5458 5467. PMLR, 2020b. L ecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pp. 656 672. IEEE, 2019. Lee, G., Yuan, Y., Chang, S., and Jaakkola, T. S. Tight certificates of adversarial robustness for randomly smoothed classifiers. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 4911 4922. Curran Associates, Inc., 2019. Levine, A. and Feizi, S. Improved, deterministic smoothing for ℓ1 certified robustness. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 6254 6264. PMLR, 2021. Levine, A., Kumar, A., Goldstein, T. A., and Feizi, S. Tight second-order certificates for randomized smoothing. ar Xiv preprint ar Xiv:2010.10549, 2020. Li, B., Chen, C., Wang, W., and Carin, L. Certified adversarial robustness with additive noise. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 9459 9469. Curran Associates, Inc., 2019a. Li, H., Xu, X., Zhang, X., Yang, S., and Li, B. QEBA: query-efficient boundary-based blackbox attack. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, pp. 1218 1227. Computer Vision Foundation / IEEE, 2020a. Li, L., Zhong, Z., Li, B., and Xie, T. Robustra: Training provable robust neural networks over reference adversarial space. In Kraus, S. (ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp. 4711 4717. ijcai.org, 2019b. Li, L., Xie, T., and Li, B. Sok: Certified robustness for deep neural networks. ar Xiv preprint ar Xiv:2009.04131, 2020b. Li, L., Weber, M., Xu, X., Rimanic, L., Kailkhura, B., Xie, T., Zhang, C., and Li, B. TSS: transformation-specific smoothing for robustness certification. In Kim, Y., Kim, J., Vigna, G., and Shi, E. (eds.), 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS 2021, Virtual Event, Republic of Korea, November 15 - 19, 2021, pp. 535 557. ACM, 2021. Liu, C., Arnon, T., Lazarus, C., Strong, C. A., Barrett, C. W., and Kochenderfer, M. J. Algorithms for verifying deep neural networks. Foundations and Trends in Optimization, 4(3-4):244 404, 2021. Lozier, D. W. NIST digital library of mathematical functions. Annals of Mathematics and Artificial Intelligence, 38(13):105 119, 2003. Mirman, M., Gehr, T., and Vechev, M. T. Differentiable abstract interpretation for provably robust neural networks. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML Double Sampling Randomized Smoothing 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 3575 3583. PMLR, 2018. Mohapatra, J., Ko, C., Weng, T., Chen, P., Liu, S., and Daniel, L. Higher-order certification for randomized smoothing. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 4501 4511. Curran Associates, Inc., 2020. Neyman, J. and Pearson, E. S. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231:289 337, 1933. Qiu, H., Xiao, C., Yang, L., Yan, X., Lee, H., and Li, B. Semanticadv: Generating adversarial examples via attributeconditioned image editing. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), 16th European Conference on Computer Vision, ECCV 2020, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV, volume 12359 of Lecture Notes in Computer Science, pp. 19 37. Springer, 2020. Raghunathan, A., Steinhardt, J., and Liang, P. Certified defenses against adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. Salman, H., Li, J., Razenshteyn, I. P., Zhang, P., Zhang, H., Bubeck, S., and Yang, G. Provably robust deep learning via adversarially trained smoothed classifiers. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 11289 11300. Curran Associates, Inc., 2019. Schuchardt, J., Wollschl ager, T., Bojchevski, A., and G unnemann, S. Localized randomized smoothing for collective robustness certification, 2022. URL https: //openreview.net/forum?id=m F122Bu Ann W. S uken ık, P., Kuvshinov, A., and G unnemann, S. Intriguing properties of input-dependent randomized smoothing. ar Xiv preprint ar Xiv:2110.05365, 2021. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. In Bengio, Y. and Le Cun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. Tram er, F., Carlini, N., Brendel, W., and Madry, A. On adaptive attacks to adversarial example defenses. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 1633 1645. Curran Associates, Inc., 2020. Wang, B., Xu, C., Wang, S., Gan, Z., Cheng, Y., Gao, J., Awadallah, A. H., and Li, B. Adversarial GLUE: A multitask benchmark for robustness evaluation of language models. In Vanschoren, J. and Yeung, S. (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Neur IPS Datasets and Benchmarks 2021, December 2021, virtual, volume 1, 2021. Wong, E. and Kolter, J. Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 5283 5292. PMLR, 2018. Wu, Y., Bojchevski, A., Kuvshinov, A., and G unnemann, S. Completing the picture: Randomized smoothing suffers from the curse of dimensionality for a large family of distributions. In Banerjee, A. and Fukumizu, K. (eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pp. 3763 3771. PMLR, 2021. Xu, K., Shi, Z., Zhang, H., Wang, Y., Chang, K., Huang, M., Kailkhura, B., Lin, X., and Hsieh, C. Automatic perturbation analysis for scalable certified robustness and beyond. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 1129 1141. Curran Associates, Inc., 2020. Yang, G., Duan, T., Hu, J. E., Salman, H., Razenshteyn, I. P., and Li, J. Randomized smoothing of all shapes and sizes. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 10693 10705. PMLR, 2020. Double Sampling Randomized Smoothing Yang, Z., Li, L., Xu, X., Kailkhura, B., Xie, T., and Li, B. On the certified robustness for ensemble models and beyond. In 10th International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022. Zhai, R., Dan, C., He, D., Zhang, H., Gong, B., Ravikumar, P., Hsieh, C., and Wang, L. MACER: attack-free and scalable robust training via maximizing certified radius. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. Zhang, D., Ye, M., Gong, C., Zhu, Z., and Liu, Q. Black-box certification with randomized smoothing: A functional optimization based framework. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, pp. 2316 2326. Curran Associates, Inc., 2020a. Zhang, H., Chen, H., Xiao, C., Gowal, S., Stanforth, R., Li, B., Boning, D. S., and Hsieh, C. Towards stable and efficient training of verifiably robust neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020b. Zhang, J., Li, L., Li, H., Zhang, X., Yang, S., and Li, B. Progressive-scale boundary blackbox attack via projective gradient estimation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 12479 12490. PMLR, 2021. Double Sampling Randomized Smoothing A. Neyman-Pearson Certification The Neyman-Pearson-based robustness certification is the tightest certification given only prediction probability under P (Cohen et al., 2019). This certification and its equivalent variants are widely used for randomized smoothing. We use r N P to represent the certified radius from the Neyman Pearson-based method. If the smoothing distribution P is standard Gaussian, the following proposition gives the closed-form certified robust radius derived from the Neyman-Pearson lemma (Neyman & Pearson, 1933). Proposition 3 ((Cohen et al., 2019)). Under ℓ2 norm, given input x0 Rd with true label y0. Let P = N(σ) be the smoothing distribution, then Neyman-Pearson-based certification yields certified radius r N P = σΦ 1(f P(x0)y0), where Φ 1 is the inverse CDF of unit-variance Gaussian. For other smoothing distributions, the concretization of the Neyman-Pearson certification method can be found in (Yang et al., 2020). Remark. In practice, the routine is to use Monte-Carlo sampling to obtain a high-confidence interval of f P(x0)y0, which implies a high-confidence certification (r N P) of robust radius. A tighter radius can be obtained when the runner-up prediction probability is known: r N P = σ 2 Φ 1(f P(x0)y0) maxy [C]:y =y0 Φ 1(f P(x0)y) . However, due to efficiency concern (for C-way classification the sampling number needs to be more than C times if using r N P for certification instead of r N P), the standard routine is to only use top-class probability and r N P (Cohen et al., 2019, Section 3.2.2). DSRS follows this routine. B. Illustration of Concentration Assumption on Image Net 0 50 100 150 200 250 300 350 400 ℓ2 Length of Perturbation Prob. of Predicting True Class Density of ℓ2 Noise Magnitude Figure 4. Blue curves: Probability of true-prediction w.r.t. ℓ2 length of perturbations for a base classifier from (Salman et al., 2019) on Image Net. Each line corresponds to one of 100 uniformly drawn samples from test set (detailed setup in Appendix J.1). Green curve: Normalized density of ℓ2 noise magnitude for Image Net standard Gaussian N(σ) with σ = 0.5, which highly concentrates on σ d. Thus, for constant Pcon, (σ, Pcon)-concentration can be satisfied for a significant portion of input samples. C. Illustration of Unchanged PA with Increasing d In the first remark of Theorem 2, we mention that PA = f P(x0)y0 does not grow simultaneously along with the increase of input dimension d. In Figure 5, to illustrate this phenomenon, we plot PA histograms for 1, 000 test samples from Tiny Image Net and Image Net. Note that Tiny Image Net images are downscaled Image Net images, so the data distribution only differs in the input dimension d. As we can observe, though d varies, the PA distribution is highly similar, so PA is roughly unchanged along with the increase of input dimension d. Tiny Image Net(d = 12288) trained w/ Gaussian Aug. Image Net(d = 150528) trained w/ Gaussian Aug. Figure 5. PA histograms for models trained on Tiny Image Net (left) and Image Net (right) with same σ = 0.50. D. DSRS under Relaxed Concentration Assumption In main text (Section 4), we generalize the concentration assumption by replacing the holding probability 1 in Eqn. (5a) by probability considering sampling confidence. In this appendix, we replace the holding probability in Eqn. (5a) by exp( dα) for α {0.1, 0.2, 0.3, 0.4, 0, 5}. With this relaxation, we conduct numerical simulations using the same settings as in the main text, and the corresponding results are shown in Figure 6. Note that some solid curves terminate when d is large, which is due to the limitation of floating-point precision in numerical simulations, and we use dashed lines of the same color to plot the projected radius when d is large. Remark. When the concentration property holds with probability exp( dα) (0 < α 0.5) other than 1, from Figure 6, we observe that r N Pdα/1.18 predicts the certified radius of DSRS well where r N P is Neyman-Pearson certified radius. There- Double Sampling Randomized Smoothing 10 2 10 3 10 4 10 5 10 6 Input Dimension d Certified Radius Certified Radius with PA = 0.6 and (1, 0.5)-Approximate Concentration Property Neyman-Pearson Standard Gaussian DSRS w/ Concentration Info. (Ideal) DSRS w/ (exp( 0.1d))-Concentration DSRS w/ (exp( 0.2d))-Concentration DSRS w/ (exp( 0.3d))-Concentration DSRS w/ (exp( 0.4d))-Concentration DSRS w/ (exp( 0.5d))-Concentration MNIST Input Dim. CIFAR-10 Input Dim. Image Net Input Dim. Figure 6. Tendency of DSRS certified robust radius with different input dimensions d under relaxed concentration assumption: when holding probability in Eqn. (5a) is exp( dα) with α from 0.1 to 0.5;. Blue line: DSRS when holding probability in Eqn. (5a) is 1. Dotted line: Neyman-Pearson certification. Other solid lines: DSRS when holding probability in Eqn. (5a) is exp( dα). Other dashed lines: DSRS projected radius by rproj DSRS = r N Pdα/1.18. α {0.1, 0.2, 0.3, 0.4, 0.5}. Both xand y-axes are logarithmic. fore, although the d growth rate of ℓ2 certified radius does not hold, the radius still increases along with the dimension d. Interestingly, along with the increase of dimension d, the vanishing probability exp( dα) still implies the increasing volume of adversarial examples, and smoothed classifier is still certifiably robust with increasing radius reflected by DSRS despite the increasing adversarial volume. E. Omitted Details of DSRS Computational Method E.1. Algorithm Description Algorithm 1 Determining PA and QA from confidence intervals (see 5.2). Data: Distributions P and Q; δ; [PA, PA] and [QA, QA] Result: PA and QA satisfying Eqn. (11) 1 Compute q arg miny Cδ(PA, y) 2 if q > QA then 3 return (PA, min{q, QA}) 5 Compute p arg minx Cδ(x, QA) 6 return (max{min{p, PA}, PA}, QA) Alg. 1 is a subroutine (Line 7) of Alg. 2. Note that Lines 1 and 5 of Alg. 1 solve the constrained optimization with only one constraint (either one of Eqn. (8b)), reducing to the wellstudied and solvable Neyman-Pearson-based certification. Note that we do not need to evaluate any value of Cδ in Alg. 1. Although q and p in the algorithm are arg minx or y over Cδ, the free choices of x or y leave Cδ s constrained optimization with only one constraint and then q and p can be solved by Neyman-Pearson instead of evaluating Cδ directly. Algorithm 2 DSRS computational method. Data: clean input x0, base classifier F0; distributions P and Q; norm type p; confidence level α; numerical integration error bound Result: Certified radius r 1 Query prediction y0 e F P 0 (x0) 2 Sample and estimate the intervals of smoothed confidence [PA, PA] under P and [QA, QA] under Q with confidence (1 α) following (Cohen et al., 2019) 3 Initialize: rl 0, ru rmax 4 while ru rl > eps do /* Binary search on radius r */ 5 rm (rl + ru)/2 6 δ (rm, 0, . . . , 0)T ; /* for ℓ2 certification with ℓ2 symmetric P and Q; for ℓ or ℓ1, can be adjusted following (Zhang et al., 2020a) */ 7 Determine PA [PA, PA] and QA [QA, QA] ; /* See Section 5.2 and Alg. 1 */ 8 (λ1, λ2) DUALBINARYSEARCH(PA, QA) ; /* See Section 5.3 and Alg. 3 */ 9 v R(λ1, λ2) ; /* Using Theorem 5 */ 10 if v > 0.5 then 14 return rl Alg. 2 is the pseudocode of the whole DSRS computational method as introduced in Section 5. Double Sampling Randomized Smoothing Algorithm 3 DUALBINARYSEARCH for λ1 and λ2. Data: Query access to P( , ) and Q( , ); PA; QA; ν; precision parameter ϵ; numerical integration error bound Result: λ1 and λ2 satisfying constraints P(λ1, λ2) = PA, Q(λ1, λ2) = QA (see Eqn. (12)) 1 a L 0, a U M ; /* search for a = λ1 + νλ2, M is a large positive number */ 2 while a U a L > ϵ do 3 am (a L + a U)/2 4 if Q(am, 0) < QA then /* Following while-loop enlarges a L and a U until [a L, a U] covers a such that Q(a , 0) = QA under numerical integration error */ 9 while (Q(a L, 0) + > QA) or (Q(a U, 0) < QA) do 10 t a U a L 11 a L a L t/2 12 a U a U + t/2 14 λL 1 0, λU 1 M ; /* search for λ1, M is a large positive number */ 15 while λU 1 λL 1 > ϵ do 16 λm 1 (λL 1 + λU 1 )/2 17 if h(λm 1 ) < PA QA/ν then 18 λL 1 λm 1 19 else 20 λU 1 λm 1 /* Following while-loop enlarges λL 1 and λU 1 until [λL 1 , λU 1 ] covers λ 1 such that h(λ 1) = PA QA/ν under numerical integration error */ 22 while (h(λL 1 ) + > PA QA/ν) or (h(λU 1 ) < PA QA/ν) do 23 t λU 1 λL 1 24 λL 1 λL 1 t/2 25 λU 1 λU 1 + t/2 27 return (λL 1 , (a L λU 1 )/ν) ; /* for soundness, choose the left endpoint of λ1 and λ2 range */ Alg. 3 is the dual variable search algorithm described in Section 5.3. From Line 1 to 8, we conduct binary search for quantity λ1 + νλ2; from Line 14 to 21, we conduct binary search for quantity λ1. Notice that our binary search interval is initialized to be the non-negative interval. This is because Q(am, 0) = 0 and h(λm 1 ) = 0 if am and λm 1 are non-positive observed from Theorem 5. E.2. Guaranteeing Numerical Soundness In DSRS computational method, to compute a practically sound robustness guarantee, we take the binary search error and numerical integration error into consideration. Specifically, in Alg. 3, we return a pair (λ 1, λ 2) whose R(λ 1, λ 2) lower bounds R(λ1, λ2) where (λ1, λ2) is the precise feasible pair. We achieve so by returning (λL 1 , (a L λU 1 )/ν). Specifically, as we can see a L is an underestimation of actual (λ1 + νλ2) in the presence of binary search error and numerical integration error. We bound the numerical integration error by setting absolute error bound = 1.5 10 8 in scipy.integrad.quad function. Then, λL 1 and λU 1 are underestimation and overestimation of the actual λ1 in the presence of errors respectively. As a result, (a L λU 1 )/ν is an underestimation of λ2. Therefore, both λ1 and λ2 are underestimated and by the monotonicity of R( , ), the actual R(λ1, λ2) would be underestimated which guarantees the soundness. Then, since the evaluation of R involves numerical integration, we compare the lower bound of R: R(λ1, λ2) in Alg. 2 (line 9) with 0.5 to determine whether current robustness radius can be certified or not. E.3. Complexity and Efficiency Analysis In this appendix, we briefly analyze the computational complexity of DSRS computational method introduced in Section 5. Suppose the binary search precision is ϵ, and each numerical integration costs C time. First, the search of certified robust radius costs O(log( d/ϵ)). For each searched radius, we first determine PA and QA by running Neyman Pearson-based certification, which has cost O(log(1/ϵ)C). Then, solving dual variables takes two binary search rounds, which has cost O(log(1/ϵ)C). The final one-time integration of R(λ1, λ2) has cost O(C). Thus, overall time complexity is O(log( d/ϵ) log(1/ϵ)C), which is the same as classical Neyman-Pearson certification and grows slowly (in logarithmic factor) w.r.t. input dimension d. In practice, the certification time is on average 5 s to 10 s per sample across different datasets. For example, with σ = 0.50 as the smoothing variance parameter, the certification time, as an overhead over Neyman-Pearson-based certification, is 10.53 s, 4.53 s, and 3.21 s on MNIST, CIFAR-10, and Image Net respectively. This overhead is almost negligible compared with the sampling time for estimating PA and QA which is around 200 s on Image Net and is the shared cost of all randomized smoothing certification methods. In summary, compared with standard Neyman-Pearson-based certification, the running time of DSRS is roughly the same. F. Extensions and Proofs in Section 4 In this appendix, we provide formal proofs and theoretical extensions for the results in Section 4. Double Sampling Randomized Smoothing F.1. Proof of Theorem 1 Proof of Theorem 1. We let Dc denote the decision region of F0 for class c, i.e., Dc := {x : F0(x) = c}. Since Q is supported on the decision region shifted by x0, f Q 0 (x0)c = 1. Thus, from f Q 0 (x0)c, we know (supp(Q)+x0)\S Dc, where S is some set with zero measure under Q+x0. Since 0 < q(x)/p(x) < + , S also has zero measure under P + x0. On the other hand, by 0 < q(x)/p(x) < + , we can determine the probability mass of supp(Q) on P, i.e., Prϵ P[ϵ supp(Q)]. Then, we observe that f P 0 (x0)c = Pr ϵ P[F0(x0 + ϵ) = c] = Pr ϵ P[ϵ Rd \ supp(Q)] Pr ϵ P[F0(x0 + ϵ) = c | ϵ Rd \ supp(Q)] + Pr ϵ P[ϵ supp(Q)] Pr ϵ P[F0(x0 + ϵ) = c | ϵ supp(Q)] = Pr ϵ P[ϵ Rd \ supp(Q)] Pr ϵ P[F0(x0 + ϵ) = c | ϵ Rd \ supp(Q)] + Pr ϵ P[ϵ supp(Q)]. By the definition of supp(Q), we observe that Prϵ P[F0(x0 + ϵ) = c | ϵ Rd \ supp(Q)] = 0. As a result, we will find that f P 0 (x0)c = Prϵ P[ϵ supp(Q)]. Then the DSRS certification method can know Rd \ supp(Q) + x0 Dc has zero measure under P + x0, i.e., In summary, the certification method can determine that supp(Q) + x0 differs from Dc on some set with zero measure under P + x0. Because P has positive density everywhere, also has zero measure under P + x0 + δ for arbitrary δ Rd. Thus, for arbitrary δ Rd, the certification method can compute out f P 0 (x0 + δ)c = Pr ϵ P[F0(x0 + δ + ϵ) = c] (13) = Pr ϵ P+δ[x0 + ϵ Dc] = Pr ϵ P+δ[supp(Q)]. Under the binary classification setting, it implies that for any δ Rd, the f P 0 (x0 +δ) can be uniquely determined by DSRS certification method. Since the smoothed classifier s decision at any x0 + δ is uniquely determiend by f P 0 (x0 + δ) (see Eqn. (2)), the certification method can exactly know e F P 0 (x + δ) for any δ and thus determine tightest possible certified robust radius rtight. F.2. Extending Theorem 1 to Multiclass Setting For the multiclass setting, we define a variant of DSRS as follows. Definition 4 (rmulti DSRS). Given PA [0, 1] and Qmulti A RC 1, rmulti DSRS := max r s.t. F : R [C], f P(x0)y0 = PA, f Qc(x0)c = (Qmulti A )c, c [C 1] x, x x0 p < r, e F P(x) = y0. In the above definition, rmulti DSRS is the tightest possible certified radius with prediction probability Qmulti A , where each component of Qmulti A , namely (Qmulti A )c, corresponds to the prediction probability for label c under additional smoothing distribution Qc. Note that there are (C 1) additional smoothing distributions {Qc}C 1 c=1 in this generalization for multiclass setting. With this DSRS generation, the following corollary extends the tightness analysis in Theorem 1 from binary to the multiclass setting. Corollary 1. Suppose the original smoothing distribution P has positive density everywhere, i.e., p( ) > 0. For multiclass classification with base classifier F0, at point x0 Rd, for each class c [C 1], let Qc be a distribution that satisfies: (1) its support is the decision region of c shifted by x0: supp(Qc) = {x x0 : F0(x) = c}; (2) for any x supp(Q), 0 < qc(x)/p(x) < + . Then, plugging PA = f P 0 (x0)c and Qmulti A where (Qmulti A )c = f Qc 0 (x0)c for c [C 1] into Definition 4, we have rmulti DSRS = rtight under any ℓp (p 1). Proof of Corollary 1. Similar as the proof of Theorem 1, the certification method can observe that for any c [C 1], f Qc 0 (x0)c = 1. Thus, the method knows supp(Qc) + x0 Dc for arbitrary c [C 1]. Here, the means that the difference between the two sets has zero measure under P + x0. Thus, for arbitrary δ Rd the certification method can precisely compute out f P(x0)c for any c [C 1]. Since f P(x0) C, we also know f P(x0)C and the smoothed classifier s prediction on x0 +δ can be uniquely determined. Then, following the same argument in Theorem 1 s proof, we can determine the tightest possible certified robust radius rmulti DSRS. Remark. To achieve the tightest possible certified radius rtight, for binary classification, we only need one extra scalar as the additional information (QA), while for multiclass classification, we need (C 1) extra scalars as the additional information (Qmulti A RC 1). Following the convention as discussed in Appendix A, we are interested in tight certification under the binary classification for sampling efficiency concerns. Therefore, we focus on using only one extra scalar (QA) to additional information in DSRS. In both Theorem 1 and Corollary 1, we only need finite quantities to achieve tight certification for any smoothed classifier. In contrast, other existing work requires infinite quantities to achieve such optimal tightness (Mohapatra et al., 2020, Section 3.1). Double Sampling Randomized Smoothing F.3. Proof of Theorem 2 The proof of Theorem 2 is a bit complicated, which relies on several propositions and lemmas along with theoretical results in Section 5. At high level, based on the standard Gaussian distribution s property (Proposition F.1), we find QA = 1 under concentration property (Lemma F.2). With QA = 1, we derive a lower bound of r DSRS in Lemma F.3. We then use: (1) the concentration of beta distribution Beta( d 1 2 ) (see Lemma F.4) for large d; (2) the relative concentration of gamma Γ(d/2, 1) distribution around mean for large d (see Proposition F.5 and resulting Fact F.7); and (3) the misalignment of gamma distribution Γ(d/2 k, 1) s mean and median for small (d/2 k) (see Proposition F.6) to lower bound the quantity in Lemma F.3 and show it is large or equal to 0.5. Then, using the conclusion in Section 5 we conclude that r DSRS 0.02σ Proposition F.1. If random vector ϵ Rd follows standard Gaussian distribution N(σ), then Pr[ ϵ 2 T] = ΓCDFd/2 where ΓCDFd/2 is the CDF of gamma distribution Γ(d/2, 1). Proof of Proposition F.1. According to (Lozier, 2003, Eqn. 5.19.4), he volume of a d-dimensional ball, i.e., d-ball, with radius r is Vd(r) = πd/2 2 +1)rd. Thus, Vol({ϵ : ϵ 2 = r}) = Vd(r) = dπd/2 2 + 1)rd 1. (16) 1 (2πσ2)d/2 exp r2 Vol({ϵ : ϵ 2 = r})dr 1 (2πσ2)d/2 exp r2 2 + 1)rd 1dr 1 (2σ2)d/2Γ( d 0 exp( r)rd/2 1dr = ΓCDFd/2 With Proposition F.1, now we can show that QA = 1 under the condition of Theorem 2 as stated in the following lemma. Lemma F.2. Suppose F0 satisfies (σ, Pcon)-concentration property at input point x0 Rd, with additional smoothing distribution Q = N g trunc(k, T, σ) where T 2 = 2σ2ΓCDF 1 d/2(Pcon) and d/2 15 k < d/2, we have QA = Pr ϵ Q[F0(x0 + ϵ) = y0] = 1. (17) Proof of Lemma F.2. According to Definition 3, for T that satisfies Pr ϵ N(σ)[ ϵ 2 T ] = Pcon (18) Pr ϵ N(σ)[F0(x0 + ϵ) = y0 | ϵ 2 T ] = 1. (19) With Eqn. (18), from Proposition F.1, we have T 2 = 2σ2ΓCDF 1 d/2(Pcon). (20) Thus, Eqn. (19) implies Pr ϵ N (σ)[F0(x0 + ϵ) = y0 | ϵ 2 q 2σ2ΓCDF 1 d/2(Pcon)] = 1. (21) Notice that N(σ) has finite and positive density anywhere within {ϵ : ϵ 2 T }. Thus, F0(x0 + ϵ) = y0 for any ϵ with ϵ 2 T unless a zero-measure set. Now, we consider Q. Q = N g trunc(k, T, σ) where T = T , and Ntrunc has finite and positive density anywhere within {ϵ : ϵ 2 T } \ {0}. Thus, QA = Pr ϵ N g trunc(k,T,σ)[F0(x0 + ϵ) = y0] = Pr ϵ N g(k,σ)[F0(x0 + ϵ) = y0 | ϵ 2 T] ( ) = Pr ϵ N g(k,σ)[F0(x0 + ϵ) = y0 | ϵ 2 T, ϵ = 0] In the above equations, ( ) is because Pr ϵ N g(k,σ)[ϵ = 0 | ϵ 2 T] lim r 0 Pr ϵ N g(k,σ)[ ϵ 2 r | ϵ 2 T] = lim r 0 C Z r 0 x 2k exp x2 0 dxd 2k 1 exp x2 where C is a constant, and the last equality is due to d/2 > k d 2k 1 0. With QA = 1, we can have a lower bound of r DSRS as stated in the following lemma. Double Sampling Randomized Smoothing Lemma F.3. Under the same condition as in Lemma F.2, we let Beta CDF d 1 2 be the CDF of distributon s.t.Et Γ( d 2 k)Beta CDF d 1 (22) and let r DSRS be the tightest possible certified radius in DSRS under ℓ2 when smoothing distribution P = N g(k, σ), then r DSRS r0. (23) Proof of Lemma F.3. The proof shares the same core methodology as DSRS computational method introduced in Section 5. Basically, according to Eqn. (9), for any radius r, let δ = (r, 0, . . . , 0)T, if Cδ(PA, QA) > 0.5, then r DSRS r, where QA = 1 according to Lemma F.2, and by the (σ, Pcon)-concentration property PA Pr ϵ N g(k,σ)[ ϵ 2 T]. (24) Therefore, to prove the lemma, we only need to show that when 2 k)Beta CDF d 1 (25) for any δ = (u, 0, . . . , 0)T, Cδ(PA, QA) > 0.5. By definition (Eqn. (8)), = min f Eϵ P[f(ϵ + δ)] s.t. Eϵ P[f(ϵ)] = PA, Eϵ Q[f(ϵ)] = QA, 0 f(ϵ) 1 ϵ Rd min f Eϵ P[f(ϵ + δ)] s.t. Eϵ Q[f(ϵ)] = 1, 0 f(ϵ) 1 ϵ Rd =Eϵ P[f(ϵ + δ)] where f(ϵ) = 1, ϵ 2 T 0. ϵ 2 > T We now compute V : V =Eϵ P[ ϵ + δ 2 T] Rd p(x)I[ x + δ 2 T]dx p(x)=y I[ x+δ 2 T p(x)=y I[ x+δ 2 T dx r p(r 1 p (y)) 0 ydy 2πd/2 2) r 1 p (y)d 1 1 r p(r 1 p (y)) Pr[ x + δ 2 T | p(x) = y] 0 rp(t)dt2πd/2 2) td 1 Pr[ x + δ 2 T | x 2 = t] 1 (2σ 2)d/2 kπd/2 Γ(d/2) Γ(d/2 k)t 2k exp t2 2) td 1 Pr[ x + δ 2 T | x 2 = t]dt 0 td/2 k 1 exp( t) Pr[ x + δ 2 T | x 2 = σ 2 k) Pr[ x + δ 2 T | x 2 = σ In above equations, (1) follows from level-set sliced integration extended from (Yang et al., 2020) and p(x) is the density of distribution P = N g(k, σ) at point x; in (2) we define rp( x 2) := p(x) noting that P is ℓ2 symmetric and all x with same ℓ2 length having the same p(x), and we have p(x) 2 = r p(r 1 p (y)) since y = rp( x 2) and rp is monotonically decreasing; (3) uses Vol({x : p(x) = y}) = Vol({x : x 2 = r 1 p (y)}) 2) r 1 p (y)d 1; (26) (4) changes the integration variable from y to t = r 1 p (y); and (5) injects the concrete expression of rp. Now, we inspect Pr[ x+δ 2 T | x 2 = σ 2t]: When x 2 = σ 2t, Pd i=1 x2 i = 2tσ 2. Meanwhile, x+δ 2 T means (x1 + u)2 + Pd i=2 x2 i T 2. Thus, when x 2 = σ x + δ 2 T x1 σ 2t T 2 u2 2tσ 2 According to (Yang et al., 2020, Lemma I.23), for x uniformly sampled from sphere with radius σ 2t, the compo- nent coordinate 1 + x1 σ 2t 2 Beta( d 1 Pr[ x + δ 2 T | x 2 = σ =Beta CDF d 1 1 + T 2 u2 2tσ 2 =Beta CDF d 1 Double Sampling Randomized Smoothing Finally, we get V = Et Γ( d 2 k,1)Beta CDF d 1 (29) In other words, when 2 k,1)Beta CDF d 1 (30) we have V 0.5, and thus Cδ(PA, QA) V 0.5, r DSRS u, which concludes the proof. We require the following property of Beta CDF. Lemma F.4. There exists d0 N+, for any d d0, Beta CDF d 1 2 (0.6) 0.999. (31) Proof of Lemma F.4. We let v Beta( d 1 E[v] = 1/2, (32) Var[v] = ( d 1 Now, applying Chebyshev s inequality, we have Pr[|v 0.5| 0.1] 1 0.04d. (34) Beta CDF d 1 2 (0.6) (35) = Pr[v < 0.6] (36) =1 Pr[v 0.6] (37) 1 Pr[|v 0.5| 0.1] 1 1 0.04d. (38) Thus, when d 25000, Beta CDF d 1 2 (0.6) 0.999. We also require the following properties of the gamma distribution. Proposition F.5. For any Pcon (0, 1), there exists d0 N+, for any d d0, ΓCDF 1 d/2(Pcon) 0.99 d Proof of Proposition F.5. We let v Γ(d/2), then E[v] = d/2, (40) Var[v] = d/2. (41) We now apply Chebyshev s inequality and get Pr[v < 0.99 d/2] Pr[|v d/2| > 0.01 d/2] Thus, for any Pcon (0, 1), when d 20000 d Pcon, (42) i.e., ΓCDF 1 d/2(Pcon) 0.99 d Proposition F.6. When d/2 15 k < d/2, k, d N+, Pr t Γ(d/2 k) 2 k 0.5 0.999. (43) Proof of Proposition F.6. We prove the proposition by enumeration. Notice that d/2 k {0.5, 1.0, , 14.5, 15.0}, we enumerate ΓCDFd/2 k(0.98(d/2 k)) for each (d/2 k) and get the following table. d 2 k ΓCDFd/2 k(0.98(d/2 k)) d 2 k ΓCDFd/2 k(0.98(d/2 k)) 0.5 0.6778 8.0 0.5245 1.0 0.6247 8.5 0.5224 1.5 0.5990 9.0 0.5204 2.0 0.5831 9.5 0.5186 2.5 0.5718 10.0 0.5168 3.0 0.5632 10.5 0.5152 3.5 0.5564 11.0 0.5136 4.0 0.5507 11.5 0.5121 4.5 0.5459 12.0 0.5107 5.0 0.5418 12.5 0.5093 5.5 0.5381 13.0 0.5080 6.0 0.5349 13.5 0.5068 6.5 0.5319 14.0 0.5056 7.0 0.5292 14.5 0.5044 7.5 0.5268 15.0 0.5033 On the other hand, 0.5 0.999 0.5001, which concludes the proof. Now we are ready to prove the main theorem. Proof of Theorem 2. According to Lemma F.3, we only need to show that for u = 0.02σ d, Eqn. (22) holds. For sufficiently large d, indeed, Et Γ(d/2 k)Beta CDF(d 1)/2 0.999Et Γ(d/2 k)I ( ) 0.999Et Γ(d/2 k)I t 0.98 d Proposition F.6 0.999 0.5 0.999 = 0.5. Thus, from Lemma F.3 we have r DSRS u = 0.02 The inequality ( ) follows from Fact F.7. Double Sampling Randomized Smoothing Fact F.7. Under the condition of Theorem 2, for sufficiently large d, 2t 0.6. (44) Proof of Fact F.7. 2t u)2 2.4uσ 2t T 2 (x u)2 2.4ux x2 + 0.4ux + u2 T 2 0. From Proposition F.5, we have =0.0004dσ2 2σ2ΓCDF 1 d/2(Pcon) 0.0004dσ2 0.99dσ2 < 0. x2 + 0.4ux + u2 T 2 0 0.16u2 4(u2 T 2) 2 = 0.2u + p T 2 0.96u2)2 Again, from Proposition F.5, =2σ2ΓCDF 1 d/2(Pcon) 0.96 0.0004dσ2 (0.99 0.96 0.0004)dσ2, and therefore T 2 0.96u2)2 dσ2 0.004 + 0.99 0.96 0.0004 2 0.9816dσ2 0.98dσ2. (47) Then, T 2 0.96u2)2 F.4. Theorem 6 Theorem 6. Let d be the input dimension and F0 be the base classifier. For an input point x0 Rd with true class y0, suppose F0 satisfies (σ, Pcon)-Concentration property and Prϵ N(σ)[F0(x0 + ϵ) = y0] = Pcon where Pcon < 1. The smoothed classifier e F P 0 is constructed from F0 and smoothed by generalized Gaussian P = N g(k0, σ) where k0 is a constant independent of input dimension d. Then, for any constant c > 0, there exists d0, such that when input dimension d d0, any method cannot certify ℓ2 radius c d, where T = σ q 2ΓCDF 1 d/2(Pcon) and ΓCDFd/2 is the CDF of gamma distribution Γ(d/2, 1). We defer the proof to Appendix F.5. This theorem suggests that, if we use generalized Gaussian whose k is a constant with respect to input dimension d or use standard Gaussian (whose k = 0 is a constant) for smoothing, we cannot achieve Ω( d) certified radius rate from DSRS and any other certification method. F.5. Proof of Theorem 6 The proof of Theorem 6 is based on three lemmas listed below. Lemma F.8. Given k0 N, for any ϵ > 0, there exists d0, such that when d > d0, Lemma F.9. Given Pcon 0, for any ϵ > 0, there exists d0, such that when d > d0, 2ΓCDF 1 d/2(Pcon) σ p (1 + ϵ)d. (49) Lemma F.10. For any ϵ > 0, there exists d0, such that when d > d0, Beta CDF d 1 2 (0.5 ϵ) 0.01. (50) Proofs of these lemmas are based on Chebyshev s inequality. Proof of Lemma F.8. For t Γ(d/2 k0, 1), we have E[t] = d/2 k0, Var[t] = d/2 k0. (51) By Chebyshev s inequality, Pr t (1 ϵ) d Pr |t E[t]| ϵ d 2 k0). (52) Picking d0 = 2 0.99 0.48ϵ2 + k0 concludes the proof. Double Sampling Randomized Smoothing Proof of Lemma F.9. We define random variable v Γ(d/2, 1), so E[v] = d/2, Var[v] = d/2. By Chebyshev s inequality, Pr[v (1 + ϵ)d/2] 1 Pr[v (1 + ϵ)d/2] 1 Pr[|v E[v]| ϵd/2] Let d0 = 2 ϵ2(1 Pcon). Thus, when d > d0, Pr[v (1 + ϵ)d/2] 1 2 d0ϵ2 = Pcon, which implies that ΓCDFd/2((1 + ϵ)d/2) Pcon and ΓCDF 1 d/2(Pcon) (1 + ϵ)d/2 and concludes the proof. Proof of Lemma F.10. We define random variable v Beta( d 1 2 ), and we have E[v] = 1/2, Var[v] = 1 4d. By Chebyshev s inequality, Pr[v 0.5 ϵ] Pr[|v E[v]| ϵ] 1 4dϵ2 . (54) Let d0 = 25 dϵ2 , when d > d0, Pr[v 0.5 ϵ] 0.01 and hence Beta CDF d 1 2 (0.5 ϵ) 0.01. Now we are ready to prove the main theorem. Proof of Theorem 6. According to the above three lemmas, we pick d0, such that when d > d0, the followings hold simultaneously. 2ΓCDF 1 d/2(Pcon) σ Beta CDF d 1 We define vector δ = (c d, 0, 0, . . . , 0)T. Since F0 satisfies (σ, Pcon)-concentration property and Prϵ N(σ)[F0(x0 + ϵ) = y0] = Pcon, up to a set of zero measure, the region {ϵ : F0(x0+ϵ) = y0} and region {ϵ : ϵ 2 T} coincide. We now show that Eϵ P =N g(k0,σ)[F0(x0+δ+ϵ) = y0] < 0.5 when c σ p Eϵ P [F0(x0 + δ + ϵ) = y0] = Pr ϵ P [ δ + ϵ 2 T] 2 k0,1)Beta CDF d 1 (from Eqn. (29)) 0.99Et Γ( d + 0.01 (by Eqn. (57) and Beta CDF ( ) 1) =0.99 Pr t Γ( d + 0.01. (58) T 2 2tσ2 d d 2k0 dc2 + c2d Eqn. (56) = 1 + c2 dσ2 2tσ2 d d 2k0 dc2 (60) 2t d 2k0 0. We now inject t = 0 and t = 1 c2 8σ2 d 2 k0 to the LHS of Eqn. (60). When t = 0, LHS of Eqn. (60) = 1 + c2 When t = 1 c2 8σ2 d LHS of Eqn. (60) 8 dσ2 + dc2 8 dc2 + dc2 4 dc2 + dc2 Notice that the LHS of Eqn. (60) is a parabola with negative second-order coefficient. Thus, Eqn. (60) = t 0, 1 c2 Double Sampling Randomized Smoothing 0.99. (by Eqn. (55)) (62) Plugging this inequality to Eqn. (58), we get Eϵ P [F0(x0 +δ +ϵ) = y0] 0.99 0.48 0.99 +0.01 = 0.49. (63) As a result, when c σ p 8/7, the smoothed classifier e F P 0 is not robust given the perturbation δ = (c d, 0, 0, , 0)T, since there may exist another y = y0 with Eϵ P [F0(x0 + δ + ϵ) = y ] 0.51 so e F P 0 (x0 + δ) = y = y0. When c > σ p 8/7, within the ℓ2 radius ball c d, there exists perturbation vector δ = (c d, 0, 0, , 0)T fooling smoothed classifier e F P 0 where c = σ p 8/7. Hence, for any c > 0, there exists a perturbation within ℓ2 ball with radius c d, such that smoothed classifier e F P 0 can be fooled, and then any robustness certification method cannot certify ℓ2 radius c d since the smoothed classifier itself is not robust. G. Proofs of DSRS Computational Method G.1. Proof of Strong Duality (Theorem 3) Proof of Theorem 3. We write down the Lagrangian dual function of Eqn. (8a): Λ(f, λ1, λ2) :=Eϵ P[f(δ + ϵ)] λ1 (Eϵ P[f(ϵ)] PA) λ2 (Eϵ Q[f(ϵ)] QA) . (64) Then, from C s expression (Eqn. (8)), we have = min f Eϵ P[f(ϵ + δ)] s.t. 0 f(ϵ) 1 ϵ Rd, Eϵ P[f(ϵ)] = PA, Eϵ Q[f(ϵ)] = QA = min f:Rd [0,1] max λ1,λ2 R Λ(f, λ1, λ2) (i) max λ1,λ2 R min f:Rd [0,1] Λ(f, λ1, λ2) (ii) = max λ1,λ2 R Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)] λ1 Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] λ2 Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] + λ1PA + λ2QA. (65) In the above equation, (i) is from the min-max inequality. For completeness, we provide the proof as such: Define g : R2 R such that g(λ1, λ2) := minf:Rd [0,1] Λ(f, λ1, λ2). As a result, for any λ1, λ2 R and any f : Rd [0, 1], g(λ1, λ2) Λ(f, λ1, λ2). So for any f : Rd [0, 1], maxλ1,λ2 R g(λ1, λ2) maxλ1,λ2 R Λ(f, λ1, λ2), which implies max λ1,λ2 R g(λ1, λ2) min f:R [0,1] max λ1,λ2 R Λ(f, λ1, λ2), (66) where LHS is the RHS of (i) and RHS is the LHS of (i). In above equation, (ii) comes from a closed-form solution of f for Λ(f, λ1, λ2) given (λ1, λ2) R2. Notice that we can rewrite Λ(f, λ1, λ2) as an integral over Rd: Λ(f, λ1, λ2) =Eϵ P[f(δ + ϵ)] λ1Eϵ P[f(ϵ)] λ2Eϵ Q[f(ϵ)] + λ1PA + λ2QA Rd f(x) (p(x δ) λ1p(x) λ2q(x)) dx + λ1PA + λ2QA. (67) We would like to minimize over f : Rd [0, 1] in Eqn. (67) and simple greedy solution reveals that we should choose ( 1, p(x δ) λ1p(x) λ2q(x) < 0 0. p(x δ) λ1p(x) λ2q(x) 0 (68) We inject this f into Eqn. (67) and get min f:Rd [0,1] Λ(f, λ1, λ2) = Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)] + λ1 PA Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] QA Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] . (69) Hence (ii) holds. On the other hand, we know that Dδ(PA, QA) (defined by Eqn. (10)) is feasible by theorem statement. Denote (λ 1, λ 2) R2 to a feasible solution to Dδ(PA, QA) and d to the objective value, then from the constraints of (D) we know Pr ϵ P[p(ϵ δ) < λ 1p(ϵ) + λ 2q(ϵ)] = PA, Pr ϵ Q[p(ϵ δ) < λ 1p(ϵ) + λ 2q(ϵ)] = QA, Pr ϵ P[p(ϵ) < λ 1p(ϵ + δ) + λ 2q(ϵ + δ)] = d . Plugging in these equalities into Eqn. (65), we have Cδ(PA, QA) d λ1PA λ2QA+λ1PA+λ2QA = d . (71) Double Sampling Randomized Smoothing At the same time, we define function f : Rd [0, 1] such that f (x) = I[p(x δ) λ 1p(x) λ 2q(x) < 0]. (72) From Eqn. (70), f satisfies the constraints of (C) (Eqns. (8b) and (8c)). Since (C) minimizes over all possible functions f : Rd [0, 1], we have Cδ(PA, QA) Eϵ P[f (ϵ + δ)] = d . (73) Combining Eqns. (71) and (73), we get Cδ(PA, QA) = d and hence the strong duality holds. G.2. Proofs of Propositions 1 and 2 and Theorem 4 Proof of Proposition 1. Suppose f1 is the optimal solution to Cδ(P 1 A, Q1 A) and f2 is the optimal solution to Cδ(P 2 A, Q2 A). Due to the linearity of expectation, (f1 + f2)/2 satisfies all the constraints of Eqn. (8) for PA = (P 1 A+ P 2 A)/2 and QA = (Q1 A + Q2 A)/2, i.e., (f1 + f2)/2 is feasible for PA = (P 1 A + P 2 A)/2 and QA = (Q1 A + Q2 A)/2 with objective value Cδ(P 1 A, Q1 A) + Cδ(P 2 A, Q2 A) /2. Thus, we have P 1 A + P 2 A 2 , Q1 A + Q2 A 2 1 2 Cδ(P 1 A, Q1 A) + Cδ(P 2 A, Q2 A) (74) since C is a minimization problem. By definition, Cδ(PA, QA) is convex. Remark. Since Cδ(PA, QA) is defined on a compact R2 subspace, the convexity implies continuity. The continuity property is used in the following proof of Theorem 4. Proof of Proposition 2. Here, we only prove the monotonicity for functions x 7 miny Cδ(x, y) and x 7 arg miny Cδ(x, y). The same statement for y 7 minx Cδ(x, y) and y 7 minx Cδ(x, y) is then straightforward due to the symmetry. For simplification, we define C δ : x 7 miny Cδ(x, y) and let Cδ : x 7 arg miny Cδ(x, y). We notice that both functions can be exactly mapped to the constrained optimization problem (C ) which removes the second constraint in Eqn. (8b) in (C): minimize f Eϵ P[f(δ + ϵ)] (75a) s.t. Eϵ P[f(ϵ)] = x, (75b) 0 f(ϵ) 1 ϵ Rd. (75c) C δ(x) is the optimal objective to (C ) and Cδ(x) is Eϵ Q[f (ϵ)] where f is the optimal solution. Either based on Neyman-Pearson lemma [1933] or strong duality, (C ) is equivalent to (D ) defined as such: Pr ϵ P[p(ϵ) < λp(ϵ + δ)] (76a) s.t. Pr ϵ P[p(ϵ δ) < λp(ϵ)] = x. (76b) For a given x, we only need to find λ satisfying Eqn. (76b). Then, C δ(x) = Pr ϵ P[p(ϵ) < λp(ϵ + δ)], (77) Cδ(x) = Pr ϵ Q[p(ϵ δ) < λp(ϵ)]. (78) Now the monotonicity (what we would like to prove) is apparent. For x1 < x2, from Eqn. (76b), we have λ1 < λ2, since the probability density function p is non-negative. Thus, we inject λ1 and λ2 into Eqn. (77) and Eqn. (78), and yield C δ(x1) C δ(x2), Cδ(x1) Cδ(x2), (79) which concludes the proof. Proof of Theorem 4. We discuss the cases according to the branching statement in the algorithm (Alg. 1). if q QA, by definition we have Cδ(PA, q) Cδ(PA, y) for arbitrary y. According to Proposition 2, we also have Cδ(PA, q) Cδ(x, y) for arbitrary x PA and arbitrary y. Given that (PA, q) [PA, PA] [QA, QA], (PA, q) solves Eqn. (11); if q > QA, by convexity, Cδ(PA, QA) Cδ(PA, y) for y [QA, QA]. We further show that Cδ(PA, QA) Cδ(x, QA) for x [PA, PA]: assume that this is not true, by Proposition 1, the function y 7 arg minx Cδ(x, y) has function value larger than PA at y = QA. Since Cδ(0, 0) = 0 is the global minimum of Cδ, the function value at y = 0 is x = 0. By Proposition 2, there exists y0 [0, QA] such that PA = arg minx Cδ(x, y0). Then, we get (i.) Cδ(arg min x Cδ(x, QA), QA) (ii.) Cδ(PA, QA), where (i.) follows from Proposition 2 for y 7 arg minx Cδ(x, y); (ii.) is implied in the meaning of arg minx Cδ(x, QA). Since y0 [0, QA], Eqn. (80) Double Sampling Randomized Smoothing implies that q should be in [0, QA] as well, which violates the branching condition. Thus, Cδ(PA, QA) Cδ(x, QA) for x [PA, PA]. Using Proposition 2 for function x 7 arg miny Cδ(x, y) in interval [PA, PA] together with Proposition 1, we get Cδ(x, QA) Cδ(x, y) for x [PA, PA] and y [QA, QA]. Thus, (PA, QA) solves Eqn. (11). if p PA, by definition we have Cδ(max{p, PA}, QA) Cδ(x, QA) for x [PA, PA]. According to Proposition 2 and condition q QA, we further have Cδ(max{p, PA}, QA) Cδ(max{p, PA}, y) Cδ(x, y) for arbitrary x [PA, PA] and y [QA, QA]. Given that (max{p, PA}, QA) [PA, PA] [QA, QA], (p, QA) solves Eqn. (11); if p > PA, according to Proposition 1, Cδ(PA, QA) Cδ(x, QA) for x [PA, PA]. We further show that Cδ(PA, QA) Cδ(PA, y) for y [QA, QA]: assume that this is not true, by Proposition 1, the function x 7 arg miny Cδ(x, y) has function value larger than QA at x = PA. Since Cδ(0, 0) is the global minimum, by Proposition 2 on x 7 arg miny Cδ(x, y), there exists x0 [0, PA] such that QA = arg miny Cδ(x0, y). Then, we get Cδ(PA, arg min y Cδ(PA, y)) following the similar deduction as in Eqn. (80). Since x0 [0, PA], Eqn. (81) implies that p should be in [0, PA] as well which violates the branching condition. Thus, Cδ(PA, QA) Cδ(PA, y) for y [QA, QA]. Using Proposition 2 for function y 7 arg minx Cδ(x, y) in interval [QA, QA] together with Proposition 1, we get Cδ(PA, y) Cδ(x, y) for y [QA, QA] and x [PA, PA]. Thus, (PA, QA) solves Eqn. (11). G.3. Proof of Theorem 5 Proof of Theorem 5. We first define rp( ϵ 2) = p(ϵ) and rq( ϵ 2) = q(ϵ), then easily seen the concrete expressions of rp and rq are: rp(t) = 1 (2σ 2)d/2 kπd/2 Γ(d/2) Γ(d/2 k), (82) rq(t) = ν (2σ 2)d/2 kπd/2 Γ(d/2) Γ(d/2 k), (83) ν := Γ(d/2 k) γ(d/2 k, T 2 2σ 2 ) > 1 (84) and γ is the lower incomplete Gamma function. Now we use level-set integration similar as the proof in Lemma F.3 to get the expressions of P, Q, and R respectively. Since P and Q are ℓ2-symmetric, without loss of generality, we let δ = (r, 0, . . . , 0)T. Suppose PT = rp(T). = Pr ϵ P=N g(k,σ)[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] Rd I[p(x δ) < λ1p(x) + λ2q(x)]p(x)dx p(x)=y p(x δ)<λ1p(x) dx p(x) 2 + p(x)=y p(x δ)<(λ1+λ2ν)p(x) 0 ydy 2πd/2 Γ(d/2)r 1 p (y)d 1 1 r p(r 1 p (y)) Pr[p(x δ) λ1p(x) | p(x) = y]+ Z PT ydy 2πd/2 Γ(d/2)r 1 p (y)d 1 1 r p(r 1 p (y)) Pr[p(x δ) < (λ1 + λ2ν)p(x) | p(x) = y] y=rp(t) = Z T rp(t)dt 2πd/2 Pr[p(x δ) < λ1p(x) | x 2 = t]+ Z T 0 rp(t)dt 2πd/2 Pr[p(x δ) < (λ1 + λ2ν)p(x) | x 2 = t] = 1 Γ(d/2 k) T 2/(2σ 2) td/2 k 1 exp( t)dt Pr[p(x δ) < λ1p(x) | x 2 = σ Z T 2/(2σ 2) 0 td/2 k 1 exp( t)dt Pr[p(x δ) < (λ1 + λ2ν)p(x) | x 2 = σ Double Sampling Randomized Smoothing =Et Γ(d/2 k,1) u3(t, λ1), t T 2/(2σ 2) u3(t, λ1 + λ2ν). t < T 2/(2σ 2) Here, u3(t, λ) = Pr[p(x δ) < λp(x) | x 2 = σ = Pr ϵ Q=N g trunc(k,T,σ)[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] x 2 T I[p(x δ) < λ1p(x) + λ2q(x)]q(x)dx q(x)=y p(x δ)<(λ1+λ2ν)p(x) y=rq(t) = Z T 0 rq(t)dt 2πd/2 Pr[p(x δ) < (λ1 + λ2ν)p(x) | x 2 = t] =νEt Γ(d/2 k,1)u3(t, λ1 + λ2ν) I t T 2 Now, for R: = Pr ϵ P=N g(k,σ)[p(ϵ) < λ1p(ϵ + δ) + λ2q(ϵ + δ)] Rd I[p(x) < λ1p(x + δ) + λ2q(x + δ)]p(x)dx =Et Γ(d/2 k,1)u4(t, λ1, λ2). Here, u4(t, λ1, λ2) = Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ 2t) | x 2 = σ Plugging Lemma G.1 into P(λ1, λ2) and Q(λ1, λ2), and then plugging Lemma G.2 into R(λ1, λ2), we yield the desired expressions in theorem statement. Lemma G.1. Under the condition of Theorem 5, let δ = (r, 0, . . . , 0)T, u3(t, λ) := Pr[p(x δ) < λp(x) | x 2 = σ = Beta CDF d 1 2t 2kσ 2W( t (85) where W is the principal branch of Lambert W function. Proof of Lemma G.1. p(x δ) < λp(x) means that rp( x δ 2) < λrp( x 2) and therefore x δ 2 > r 1 p (λrp( x 2)). Given that x 2 = σ 2t, we have i=2 x2 i = 2tσ 2, i=2 x2 i r 1 p (λrp(σ This is equivalent to x1 2tσ 2 + r2 r 1 p (λrp(σ From the expression of rp (Eqn. (82)), we have r 1 p (λrp(σ 2t))2 = 2σ 2k W t k e t k λ 1 Thus, when x 2 = σ 2t and x uniformly sampled from this sphere, p(x δ) < λp(x) x1 2tσ 2 + r2 2σ 2k W( t 2t 2 (r + σ 2t)2 2σ 2k W t ke t k λ 1 (89) According to (Yang et al., 2020, Lemma I.23), for x uniformly sampled from sphere with radius σ 2t, the compo- nent coordinate 1 + x1 σ 2t 2 Beta( d 1 2 ). Combining Eqn. (89) with this result concludes the proof. Lemma G.2. Under the condition of Theorem 5, let δ = (r, 0, . . . , 0)T, u4(t, λ1, λ2) := Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ = u1(t), λ1 0 u1(t) + u2(t), λ1 > 0 (90) where u1(t) =Beta CDF d 1 min{T 2, 2σ 2k W ( t k e t k (λ1 + νλ2) 1 k )} u2(t) = max 0, Beta CDF d 1 2σ 2k W ( t k e t k λ 1 k 1 ) (σ Beta CDF d 1 Proof of Lemma G.2. Under the condition that x 2 = σ 2t, we separate two cases: q(x+δ) > 0 and q(x+δ) = 0, which corresponds to x + δ 2 T and x + δ 2 > T. Double Sampling Randomized Smoothing (1) q(x + δ) > 0: Notice that q(x + δ) > 0 x + δ 2 2 T 2 x1 T 2 2tσ 2 r2 From Eqn. (83), q(x + δ) = νp(x + δ). Thus, λ1p(x + δ) + λ2q(x + δ) > rp(σ (λ1 + νλ2)p(x + δ) rp(σ x + δ 2 2 r 1 p 2t) λ1 + νλ2 x1 r 1 p rp(σ Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ 2t) q(x + δ) > 0 | x 2 = σ x1 min r 1 p 2 , T 2 2tσ 2 r2 (93) By (Yang et al., 2020, Lemma I.23) and Eqn. (88), we thus have Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ q(x + δ) > 0 | x 2 = σ =Beta CDF d 1 min{T 2, 2σ 2k W( t ke t k (λ1 + νλ2) 1 k )} (2) q(x + δ) = 0: q(x + δ) = 0 x1 > T 2 2tσ 2 r2 When λ1 0, the condition λ1p(x + δ) + λ2q(x + δ) = λ1p(x + δ) > rp(σ 2t) can never be satisfied. When λ1 > 0, we have λ1p(x + δ) + λ2q(x + δ) > rp(σ x + δ 2 2 r 1 p x1 r 1 p rp(σ Therefore, when λ1 0, Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ q(x + δ) = 0 | x 2 = σ 2t] = 0. (97) When λ1 > 0, the condition that lambda1p(x + δ) + λ2q(x + δ) > rp(σ 2t) is equivalent to T 2 2tσ 2 r2 2r , r 1 p rp(σ (98) Again, by (Yang et al., 2020, Lemma I.23) and Eqn. (88), we have Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ q(x + δ) = 0 | x 2 = σ 0, Beta CDF d 1 Beta CDF d 1 =u2(t). (99) (3) Combining the two cases: Now we are ready to combine the two cases. Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ = Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ q(x + δ) > 0 | x 2 = σ Pr[λ1p(x + δ) + λ2q(x + δ) > rp(σ q(x + δ) = 0 | x 2 = σ = u1(t), λ1 0 u1(t) + u2(t). λ1 > 0 G.4. Discussion on Uniqueness of Feasible Pair As we sketched in Section 5.3, in general cases, the pair that satisfies constraints in Eqn. (12) is unique. We formally state this finding and prove it in Theorem 7. Theorem 7 (Uniqueness of Feasible Solution in Eqn. (12)). Under the same setting of Theorem 5, if QA (0, 1) and PA (QA/ν, 1 (1 QA)/ν), then there is the pair (λ1, λ2) that satisfies both P(λ1, λ2) = PA and Q(λ1, λ2) = QA is unique. We prove the theorem in the end of this section, which is based on the strict monotonicity of two auxiliary functions: g(λ1 + νλ2) := Q(λ1, λ2) and h(λ1) (defined in Double Sampling Randomized Smoothing Section 5.3). For other types of smoothing distributions P and Q, in Theorem 11 we characterize and prove a sufficient condition that guarantees the uniqueness of feasible pair. We observe that the feasible region of (PA, QA) is R = {(x, y) : y/ν x 1 (1 y)/ν, 0 y 1}. (101) Therefore, the theorem states that when (PA, QA) is an internal point of R, the feasible solution is unique and we can use our proposed method to find out such a feasible solution and thus solve the dual problem (D). Now, the edge cases are that (PA, QA) lies on the boundary of R. We discuss all these cases and show that these boundary cases correspond to degenerate problems that are easy to solve respectively: QA = 0: When QA = 0, and PA (0, 1) (otherwise, trivially R(λ1, λ2) = PA {0, 1} solves (D)), we have λ1 + νλ2 0+ and thus R(λ1, λ2) = Et Γ(d/2 k,1)u2(t). Since u2(t) only involves λ1, we only require λ1 to be unique to deploy the method. Since PA = P(λ1, λ2) = Et Γ(d/2 k,1)u3(t, λ1) I[t T 2/(2σ 2)] and PA (0, 1), by similar arguments as in Theorem 7, we know λ1 is unique. Hence, all feasible pairs give the same R(λ1, λ2), i.e., have the same objective value and the proposed method which computes a feasible one is sufficient for solving (D). QA = 1: When QA = 1 and PA (0, 1), we observe that u1(t) Beta CDF d 1 where equality is feasible with the selected (λ1 + νλ2) + and hence the maximum of Et Γ(d/2 k,1)u1(t) among all feasible (λ1, λ2) is a constant. On the other hand, since u2(t) only involves λ1 which is unique as discussed in QA = 0 case, all feasible pairs give the same value of Et Γ(d/2 k,1)u2(t). As a result, the maximum of R(λ1, λ2) can be computed by adding the unique value of Et Γ(d/2 k,1)u2(t) and the constant corresponding to the maximum of Et Γ(d/2 k,1)u1(t) among all feasible (λ1, λ2), which solves the dual problem (D). PA = QA/ν: We assume PA, QA (0, 1) (otherwise covered by former cases). In this case, λ1 satisfies that h(λ1) = PA QA/ν = 0, so λ1 0+. As a result, u2(t) = 0 for all t > 0 and R(λ1, λ2) = Et Γ(d/2 k,1)u1(t). We observe that u1(t) is only related to (λ1 + νλ2) where (λ1+νλ2) satisfying Q(λ1, λ2) = QA is unique since QA (0, 1). Thus, any feasible (λ1, λ2) would have the same (λ1 + νλ2) and hence leads to the same R(λ1, λ2). So the proposed method which finds one feasible (λ1, λ2) suffices for solving (D). PA = 1 1 QA ν : We again assume PA, QA (0, 1) (otherwise covered by former cases). In this case, λ1 satisfies that h(λ1) = 1 1/ν. Since Et Γ(d/2 k,1)I[t < T 2/(2σ 2)] = 1/ν, we know u3(t, λ1) = 1 for t T 2/(2σ 2) except a zero-measure set, and thus λ1 + . As a result, Et Γ(d/2 k,1)u2(t) = 1 Et Γ(d/2 k,1)Beta CDF d 1 constant. Similar as PA = QA/ν case, feasible (λ1+ νλ2) is unique. Therefore, feasible (λ1, λ2) leads to a unique Et Γ(d/2 k,1)u1(t). We compute out these two quantities Et Γ(d/2 k,1)u2(t) and Et Γ(d/2 k,1)u1(t) so as to obtain the unique R(λ1, λ2) which solves (D). We remark that in practice, we never observe any instances that correspond to these edge cases though we implemented these techniques for solving them. Proof of Theorem 7. The high-level proof sketch is implied in the derivation of our feasible (λ1, λ2) finding method introduced in Section 5.3. We first show h(λ1) is monotonically strictly increasing in a neighborhood of λ1 where h(λ1) = PA QA/ν, so the λ1 that satisfies PA QA/ν is unique. We then define g(γ) = νEt Γ(d/2 k,1)u3(t, γ) I[t T 2/(2σ 2)], and show its strict monotonicity around the neighborhood of λ that satisfies g(γ ) = QA, which indicates that (λ1 + νλ2) that satisfies QA is unique. Combining this two arguments, we know feasible (λ1, λ2) is unique. Note that the key in this proof is the local strict monotonicity, and the global (nonstrict) monotonicity for both h( ) and g( ) is apparent from their expressions. We now present the proofs for the local strict monotonicity of these two functions h and g. (1) h(λ1) is monotonically strictly increasing in a neighborhood of λ1 where h(λ1) = PA QA/ν. From the theorem condition, we know that h(λ1) (0, 1 1/ν). Thus, there exists t0 T 2/(2σ 2), such that u3(t0, λ1) (0, 1) (otherwise h(λ1) = 0 or 1 1/ν). From the closed-form equation of u3, for any neighboring λ 1 = λ1, we will have W(t0/ket0/kλ 1/k 1 ) = W(t0/ket0/kλ 1/k 1 ) by the monotonicity of Lambert W function. On the other hand, since u3(t0, λ1) (0, 1) and Beta CDF d 1 2 ( ) is also strict monotonic in the neighborhood, we have u3(t0, λ 1) = u3(t0, λ1). Same happens to t0 s neighborhood, i.e., δ > 0, s.t., t (t0 δ, t0 + δ), u3(t, λ 1) = u3(t, λ1) and sgn(u3(t, λ 1) u3(t, λ1)) is consistent for any t (t0 δ, t0 + δ). As a result, h(λ1) = h(λ 1). By definition and the fact that h( ) is monotonically non-decreasing, the argument is proved. (2) g(γ) is monotonically strictly increasing in a neighborhood of γ where g(γ ) = QA. Double Sampling Randomized Smoothing Since QA (0, 1) by the theorem condition, we know that there exists t0 (0, T 2/(2σ 2)), such that u3(t0, γ ) (0, 1) (otherwise g(γ ) = 0 or 1 which contradicts the theorem condition). Following the same reasoning as in (1) s proof, for any γ = γ that lies in a sufficiently small neighborhood of γ , we have u3(t0, γ ) = u3(t0, γ ), and δ > 0, s.t., t (t0 δ, t0 + δ), u3(t, γ ) = u3(t, γ ) and sgn(u3(t, γ ) u3(t, γ )) is consistent for any t (t0 δ, t0 + δ). As a result, g(γ ) = g(γ ). By definition and the fact that g( ) is monotonically non-decreasing, the argument is proved. H. Extensions of DSRS Computational Methods In this appendix, we exemplify a few extensions of DSRS computational method. H.1. Certification with Standard and Truncated Standard Gaussian In main text and Theorem 5, we focus on DSRS certification with generalized Gaussian as P and truncated generalized Gaussian as Q, which has theoretical advantages (Theorem 2). On the other hand, DSRS can also be applied to other distributions. Concretely, to certify robustness with standard Gaussian as P and truncated standard Gaussian as Q, we can directly plug the following theorem s numerical integration expressions into the described DSRS algorithm (Alg. 2). Theorem 8. In Dδ(PA, QA), let r = δ 2, when P = N(σ) and Q = Ntrunc(T, σ), let ν := ΓCDFd/2(T 2/(2σ2)) 1, R(λ1, λ2) := Pr ϵ P[p(ϵ) < λ1p(ϵ + δ) + λ2q(vϵ + δ)] = Et Γ(d/2,1)u1(t), λ1 0 Et Γ(d/2,1)u1(t) + u2(t), λ1 > 0 where u1(t) =Beta CDF d 1 min{T 2, 2tσ2 + 2σ2 ln(λ1 + νλ2)} u2(t) = max 0, Beta CDF d 1 2tσ2 + 2σ2 ln λ1 (σ Beta CDF d 1 P (λ1, λ2) := Pr ϵ P[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] = Et Γ(d/2,1) u3(t, λ1), t T 2/(2σ2) u3(t, λ1 + νλ2), t < T 2/(2σ2). where u3(t, λ) = Beta CDF d 1 1 2 + r2 + 2σ2 ln λ Q(λ1, λ2) := Pr ϵ Q[p(ϵ δ) < λ1p(ϵ) + λ2q(ϵ)] = νEt Γ(d/2,1)u3(t, λ1 + νλ2) I[t T 2/(2σ2)]. In above equations, Γ(d/2, 1) is gamma distribution and ΓCDFd/2 is its CDF, and Beta CDF d 1 2 is the CDF of dis- tribution Beta( d 1 Proof of Theorem 8. The proof largely follows the same procedure as the proof of Theorem 5. The only difference is that, since P = N(σ), let rp( ϵ 2) = p(ϵ), different from Eqn. (88), now r 1 p (λrp(σ = 2σ2 ln(λrp(σ 2t) (2πσ2)d/2) = 2σ2 ln λ (2πσ2)d/2 exp 2tσ2 = 2σ2(ln λ t) = 2tσ2 2σ2 ln λ. (103) By plugging this equation into the proof of Theorem 5, we prove Theorem 8. In practice, DSRS with standard Gaussian and truncated standard Gaussian as smoothing distributions gives marginal improvements over Neyman-Pearson-based certification. This is because, for standard Gaussian distribution, the noise magnitude is particularly concentrated on a thin shell as reflected by the green curve in Figure 4. As a result, the truncated standard Gaussian as Q either has a tiny density overlap with P or provides highly similar information (i.e., QA PA). In either case, Q provides little additional information. Therefore, in practice, we do not use standard Gaussian and truncated standard Gaussian as P and Q which is also justified by Theorem 2, though DSRS can provide certification for this setting. H.2. Certification with Generalized Gaussian with Different Variances We now consider the robustness certification with smoothing distribution P = N g(k, σ) and additional smoothing distribution Q = N g(k, β) where σ and β are different (i.e., different variance). H.2.1. COMPUTATIONAL METHOD DESCRIPTION Hereinafter, for this P and Q we define the radial density function g(r) := p(x) and h(r) := q(x) for any x 2 = r, where p and q are the density functions of P and Q respectively. (λ1g + λ2h) 1(x) := max y s.t. λ1g(y) + λ2h(y) = x and similarly (λ1g + λ2h) 1(x) := min y s.t. λ1g(y) + λ2h(y) = x. In this case, we can still have the numerical expression for P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2) as shown in Theorem 9. Theorem 9. When the smoothing distributions P = N g(k, σ) and additional smoothing distribution Q = Double Sampling Randomized Smoothing Table 3. The numerical integration expression for P, Q, and R (see definition in Theorem 5). See Appendix H.2 for notation description. P(λ1, λ2) = E x Γ( d 2 k)Beta CDF d 1 2x)2 g 1(λ1g(σ 2x) + λ2h(σ Q(λ1, λ2) = E x Γ( d 2 k)Beta CDF d 1 2x)2 g 1(λ1g(β 2x) + λ2h(β R(λ1, λ2) = E x Γ( d 2 k)Beta CDF d 1 1 2 + (λ1g + λ2h) 1(g(σ 2x))2 r2 2xσ 2 Beta CDF d 1 1 2 + (λ1g + λ2h) 1(g(σ 2x))2 r2 2xσ 2 Table 4. Simplified terms in numerical integration for P and Q, where W is the real-valued branch of the Lambert W function. Functions P(λ1, λ2) Q(λ1, λ2) Term to Simplify g 1 λ1g(σ 2x) + λ2h(σ 2x) 2 g 1 λ1g(β 2x) + λ2h(β Simplified k = 0 2σ 2 ln λ1 exp( x) + λ2 σ β d exp σ 2 β 2 x 2σ 2 ln λ1 exp β 2 σ 2 x + λ2 σ β d exp( x) Terms k > 0 2kσ 2W λ1 exp( x) + λ2 σ β d 2k exp σ 2 σ 2 + λ2 σ β d 2k exp( x) 1/k! N g(k, β), let P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2) be as defined in Theorem 5, then P, Q, and R can be computed by expressions in Table 3. In Table 3, the numerical integration requires the computation of several inverse functions. In this subsection, we simplify the numerical integration expressions for P and Q by deriving the closed forms of these inverse functions, as shown in Table 4. In the actual implementation of numerical integration, for P and Q, we use these simplified expressions to compute; for R, benefited from the unimodality (Lemma H.2), we deploy a simple binary search to compute. Theorem 10. When the smoothing distributions P = N g(k, σ) and Q = N g(k, β), the terms g 1(λ1g(σ 2x))2 and g 1(λ1g(β 2x) + λ2h(β 2x))2 in P(λ1, λ2) and Q(λ1, λ2) s computational expressions (see Table 3) are equivalent to those shown in Table 4. With the method to compute P, Q, and R for given λ1 and λ2, now the challenge is to solve λ1 and λ2 such that P(λ1, λ2) = PA and Q(λ1, λ2) = QA. Luckily, as Theorem 11 shows, for given PA and QA, such (λ1, λ2) pair is unique. Indeed, such uniqueness holds not only for this P and Q but also for a wide range of smoothing distributions. Theorem 11 (Uniqueness). Suppose distributions P and Q s are ℓp-spherically symmetric, i.e., there exists radial density functions g and h such that p(x) = g( x p) and q(x) = h( x p), then if g and h are continuous and g h is continuous and strictly monotonic almost everywhere, for any given (PA, QA) R2 +, there is at most one (λ1, λ2) pair satisfying constraint of Eqn. (12). The proof is shown in the next subsection, which is based on Cauchy s mean value theorem of the probability integral. With Theorem 11 and Proposition 2, we can use joint binary search as shown in Alg. 4 to find λ1 and λ2 which can be viewed as the intersection of two curves. At a high level, Each time, we leverage the monotonicity to get a point (λmid 1 , λmid 2 ) on the P s curve, then compute corresponding Q, and update the binary search intervals based on whether Q(λmid 1 , λmid 2 ) > QA. We shrink the intervals for both λ1 and λ2 (Lines 5 and 7) in Alg. 4 to accelerate the search. The algorithm is plugged into Line 8 of Alg. 2. Algorithm 4 DUALBINARYSEARCH for λ1 and λ2. Data: Query access to P( , ) and Q( , ), PA and QA Result: λ1 and λ2 satisfying constraints P(λ1, λ2) = PA, Q(λ1, λ2) = QA 1 λL 1 M, λU 1 +M, λL 2 M, λU 2 +M ; /* M is a large positive number */ 2 while λU 1 λL 1 > eps do 3 λmid 1 (λL 1 + λU 1 )/2 Binary search for λmid 2 [λL 2 , λU 2 ] such that P(λmid 1 , λmid 2 ) = PA ; /* (λmid 1 , λmid 2 ) lies on red curve */ 4 if Q(λmid 1 , λmid 2 ) < QA then 5 λU 1 λmid 1 , λL 2 λmid 2 ; /* (λmid 1 , λmid 2 ) is right to intersection */ 7 λL 1 λmid 1 , λU 2 λmid 2 ; /* (λmid 1 , λmid 2 ) is left to intersection */ 9 return (λL 1 , λL 2 ) ; /* for soundness: R(λL 1 , λL 2 ) lower bounds (D) */ H.2.2. PROOFS Proof of Theorem 9. The ℓ2-radial density functions of p(x) and q(x) have these expressions: g(r) r k exp( r2/(2σ 2)) and h(r) r k exp( r2/(2β 2)). When r increases, r k, exp( r2/(2σ 2)), and exp( r2/(2β 2)) decrease so that g and h are both strictly decreasing. Therefore, they have inverse functions, which are denoted by g 1 and h 1. Now we are ready to Double Sampling Randomized Smoothing derive the expressions. (I.) We start with P: Rd I [p(x δ) < λ1p(x) + λ2q(x)] p(x)dx p(x)=y p(x δ)<λ1p(x)+λ2h(x) p(x)=y p(x δ)<λ1p(x)+λ2h(x) dx g (g 1(y)) 0 ydy πd/2dg 1(y)d 1 dx g (g 1(y)) Pr p(x δ) < λ1p(x) + λ2q(x) p(x) = y 0 g(t)dt πd/2dtd 1 Pr p(x δ) < λ1g(t) + λ2h(t) x 2 = t (2σ 2) d 2 kπ d 2 Γ(d/2) Γ(d/2 k) t 2k exp t2 Pr p(x δ) < λ1g(t) + λ2h(t) x 2 = t (2σ 2) d 2 kΓ( d 2 k) td 2k 1 exp t2 Pr p(x δ) < λ1g(t) + λ2h(t) x 2 = t (2σ 2) d 2 kΓ( d 0 t d 2 k 1 exp t 2σ 2 Pr h p(x δ) < λ1g( (7.) = 1 Γ( d 0 t d 2 2k 1 exp( t) Pr h p(x δ) < λ1g(σ 2t) + λ2h(σ 2t) x 2 = σ (8.) = Et Γ( d 2 k) Pr h p(x δ) < λ1g(σ 2t) + λ2h(σ 2t) x 2 = σ As a reminder, d is the input dimension. The Γ( ) here refer to either the Gamma distribution (in t Γ( d 2 k)) or Gamma function (in some denominators). In the above integration: (1.) uses level-set sliced integration as first proposed in (Yang et al., 2020); (2.) leverages the fact that p(x) is ℓ2-symmetric and g ( ) < 0; (3.) injects the surface area of ℓ2-sphere with radius g 1(y); (4.) alters the integral variable: t = g 1(y), which yields dt = dy/g (t) = dy/g (g 1(y)) and y = g(t); (5.) injects the expression of g(t); (6.) alters the integral variable from t to t2; (7.) does re-scaling; and (8.) observes that the integral can be expressed by expectation over Gamma distribution. Due to the isotropy, let r = δ 2, we can deem δ = (r, 0, . . . , 0)T by rotating the axis. Then we simplify the probability term by observing that ( x 2 = σ p(x δ) < λ1g(σ 2t) + λ2h(σ i=2 x2 i = 2tσ 2 i=2 x2 i g 1 λ1g(σ 2t) + λ2h(σ = x1 2tσ 2 g 1 λ1g(σ 2t) + λ2h(σ Lemma H.1 (Lemma I.23; (Yang et al., 2020)). If (x1, , xd) is sampled uniformly from the unit sphere Sd 1 Rd, then 2 is distributed as Beta d 1 Combining Lemma H.1 and Appendix H.2.2, we get Pr h p(x δ) < λ1g(σ 2t) + λ2h(σ 2t) x 2 = σ =Beta CDF d 1 2t) + λ2h(σ (105) Injecting Eqn. (105) into (8.) yields the expression shown in Table 3. (II.) The integration for Q is similar: Rd I[p(x δ) < λ1p(x) + λ2q(x)]q(x)dx q(x)=y p(x δ)<λ1p(x)+λ2q(x) q(x)=y p(x δ)<λ1p(x)+λ2q(x) dx h (h 1(y)) 0 h(t)dt πd/2dtd 1 Pr p(x δ) < λ1p(x) + λ2q(x) x 2 = t (2β 2) d 2 kΓ( d 0 t d 2 k 1 exp t 2β 2 Double Sampling Randomized Smoothing Pr h p(x δ) < λ1g( 2 k) Pr h p(x δ) < λ1g(β 2t) + λ2h(β 2t) x 2 = β Pr h p(x δ) < λ1g(β 2t) + λ2h(β 2t) x 2 = β =Beta CDF d 1 2t) + λ2h(β (III.) Finally, we derive the integration for R: Rd I [p(x δ) < λ1p(x) + λ2q(x)] p(x δ)dx Rd I [p(x) < λ1p(x + δ) + λ2q(x + δ)] p(x)dx p(x)=y p(x)<λ1p(x+δ)+λ2q(x+δ) p(x)=y p(x)<λ1p(x+δ)+λ2q(x+δ) dx g (g 1(y)) 0 g(t)dt πd/2dtd 1 Pr λ1p(x + δ) + λ2q(x + δ) > p(x) x 2 = t (2σ 2) d 2 kΓ( d 0 t d 2 k 1 exp t 2σ 2 Pr h λ1p(x + δ) + λ2q(x + δ) > g( (9.) = Et Γ( d k 2 ) Pr h λ1p(x + δ) + λ2q(x + δ) > g(σ 2t) x 2 = σ To simplify the probability term, this time we have ( x 2 = σ λ1p(x + δ) + λ2q(x + δ) > g(σ i=2 x2 i = 2tσ 2 (λ1g + λ2h) v u u t(x1 + r)2 + = (λ1g + λ2h) p 2tσ 2 + r2 + 2x1r > g(σ where r = δ 2 and we deem δ = (r, 0, . . . , 0)T by rotating the axis. Lemma H.2. The function (λ1g + λ2h) is unimodal in its domain (0, + ). Proof of Lemma H.2. We expand (λ1g+λ2h) then consider its derivative. λ1g(r) + λ2h(r) =C1λ1r 2k exp r2 + C2λ2r 2k exp r2 (110) where C1 and C2 is the constant normalization coefficient of g and h respectively. (λ1g + λ2h) (r) (111) = C1λ12kr 2k 1 exp r2 + C1λ1 r 2k+1 + C2λ22kr 2k 1 exp r2 + C2λ2 r 2k+1 Now we show that (λ1g + λ2h) (r) = 0 (112) has at most one solution. When either λ1 = 0 or λ2 = 0, since both g and h are monotonic, λ1g + λ2 is monotonic and there is no solution. Thus, we assume λ1, λ2 = 0. We observe that (λ1g + λ2h) (r) = 0 (113) C2λ2 2k + r2 σ 2 2k + r2 β 2 = exp r2 C2λ2 2k + x σ 2 2k + x β 2 = exp x 2σ 2 x 2β 2 C1λ1 2k + x β 2 2k + x σ 2 = exp x 2β 2 x 2σ 2 Double Sampling Randomized Smoothing We focus on Eqn. (117). Without loss of generality, we assume σ > β , then both function x 7 2k+x/β 2 2k+x/σ 2 and function x 7 exp x/(2β 2) x/(2σ 2) are monotonically increasing. If λ1λ2 > 0, the LHS and RHS of Eqn. (117) are continuous and monotonic in opposite directions. Thus, there is at most one solution to Eqn. (117). If λ1λ2 < 0, the LHS is monotonic increasing but the derivative is decreasing because C1λ1 2k + x β 2 2k + x σ 2 = C2λ2 C1λ1 1 + x 2kβ 2 1 + x 2kσ 2 (118) where the numerator 1 + x 2kβ 2 is linearly increasing and the denominator 1 + x 2kσ 2 is also linearly increasing. On the other hand, the RHS is monotonic increasing and the derivative is also increasing because 1 2β 2 1 2σ 2 > 0. (119) As a result, the difference function between RHS and LHS is monotone and there is at most one solution. Thus, we have shown Eqn. (117) has at most one solution. Given that (λ1g + λ2h) is also continuous, we thus know the function (λ1g + λ2h) is unimodal. Moreover, since g and h approach 0 when r , (λ1g + λ2h) approaches 0 when r . (λ1g + λ2h) 1(y) := max y s.t. λ1g(y ) + λ2h(y ) = x, (120) (λ1g + λ2h) 1(y) := min y s.t. λ1g(y ) + λ2h(y ) = x. (121) Then, Lemma H.2 and (λ1g + λ2h) 0 when r imply that, when y > 0, y0 (λ1g + λ2h) 1(y), (λ1g + λ2h) 1(y) λ1g(y0) + λ2h(y0) > y. (122) We simplify Eqn. (109) by observing that g(σ 2tσ 2 + r2 + 2x1r (λ1g + λ2h) 1(g(σ (λ1g + λ2h) 1(g(σ (λ1g + λ2h) 1(g(σ 2t))2 2tσ 2 r2 (λ1g + λ2h) 1(g(σ 2t))2 2tσ 2 r2 Combining Lemma H.1 with Eqn. (124) we get Pr h λ1p(x + δ) + λ2q(x + δ) > g(σ 2t) x 2 = σ =Beta CDF d 1 1 2 + (λ1g + λ2h) 1(g(σ 2t))2 2tσ 2 r2 Beta CDF d 1 1 2 + (λ1g + λ2h) 1(g(σ 2t))2 2tσ 2 r2 (125) Combining the above equation with (9.) yields the expression shown in Table 3. Proof of Theorem 10. To prove the theorem, the main work we need to do is deriving the closed-form expression for the inverse function g 1, where (2σ 2) d 2 kπ d 2 Γ( d 2 k)r 2k exp r2 (I.) When k = 0, (2σ 2π) d 2 exp g 1(y)2 g 1(y)2 = 2σ 2 ln (2σ 2π) d 2 y . (128) (II.) When k > 0, we notice that the Lambert W function W is the inverse function of w 7 wew, i.e., W(x) exp(W(x)) = x. We let the normalizing coefficient of g(r) be (2σ 2) d 2 kπ d 2 Γ( d 2 k). (129) y = Cg 1(y) 2k exp g 1(y)2 y = g 1(y)2k exp g 1(y)2 k = g 1(y)2 exp g 1(y)2 k = W 1 g 1(y)2 g 1(y)2 = 2σ 2k W Plugging in Eqns. (128) and (134) to g 1(λ1g(σ 2x) + λ2h(σ 2x))2 and g 1(λ1g(β 2x) + λ2h(β 2x))2 for k = 0 and k > 0 case, then results in Table 4 follow from algebra. Double Sampling Randomized Smoothing Proof of Theorem 11. We prove the theorem by contradiction. Suppose that the (λ1, λ2) that satisfy the constraint of Eqn. (12) are not unique, and we let (λa 1, λa 2) and (λb 1, λb 2) be such two pairs. Without loss of generality, we assume λa 1 = λb 1. If λa 2 = λb 2, we have P(λa 1, λa 2) = P(λb 1, λb 2), i.e., the region {x δ : p(x δ) min{λa 1, λb 1}p(x) + λa 2q(x), max{λa 1, λb 1}p(x) + λa 2q(x) (135) has zero mass under distribution P. Given that P and Q have positive and continuous density functions almost everywhere, the volume of the region is non-zero thus the mass under distribution P is also non-zero. Therefore, we also have λa 2 = λb 2. Because of the partial monotonicity of P and Q functions (shown in Section 5.3), without loss of generality, we assume λa 1 < λb 1, λa 2 > λb 2. (136) Lemma H.3. There exists a point r0 0, either (1) or (2) is satisfied. (1) When r > r0, Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r] Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r]; when r < r0, Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r] Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r]. (2) When r > r0, Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r] Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r]; when r < r0, Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r] Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r]. Note that in Pr[ ϵ p = r], the vector ε Rd is uniformly sampled from the ℓp-sphere of radius r. Proof of Lemma H.3. For a given r, since P and Q are both ℓp-spherically symmetric, and the density functions are both positive almost everywhere, as long as λa 1g(r) + λa 2h(r) λb 1g(r) + λb 2h(r), (137) [p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) ϵ p = r] [p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) ϵ p = r]. (138) It still holds if we change the s to s in both Eqns. (137) and (138). Meanwhile, λa 1g(r)+λa 2h(r) λb 1g(r)+λb 2h(r) g(r) h(r) > λb 2 λa 2 λa 1 λb 1 . (139) Since g(r)/h(r) is strictly monotonic, there exists at most one point r0 0 that divides g(r) h(r) > λb 2 λa 2 λa 1 λb 1 and g(r) h(r) < λb 2 λa 2 λa 1 λb 1 . (140) If that r0 exists, from Eqns. (137) to (139) the lemma statement follows. Now we only need to show that r0 exists. Assume that the point r0 does not exist, it implies that for all r, we have either λa 1g(r) + λa 2h(r) < λb 1g(r) + λb 2h(r) (141) λa 1g(r) + λa 2h(r) > λb 1g(r) + λb 2h(r) (142) while P(λa 1, λa 2) = P(λb 1, λb 2) > 0 and Q(λa 1, λa 2) = Q(λb 1, λb 2) > 0. It implies that for almost every r, Pr[p(ϵ δ) (a, b) ϵ p = r] = 0 (143) a = min{λa 1g(r) + λa 2h(r), λb 1g(r) + λb 2h(r)}, b = max{λa 1g(r) + λa 2h(r), λb 1g(r) + λb 2h(r)}. (144) This violates the continuous assumption on both P and Q. Therefore, point r0 exists. With Lemma H.3, we define auxiliary function D : R+ R such that D(r) = Pr[p(ϵ δ) < λa 1p(ϵ) + λa 2q(ϵ) | ϵ p = r] Pr[p(ϵ δ) < λb 1p(ϵ) + λb 2q(ϵ) | ϵ p = r]. (145) We let S(r) be the surface area of ℓp sphere of radius r. Then the P and Q can be written in integral form: P(λ1, λ2) = Z 0 Pr[p(ε δ) < λ1p(ε) + λ2p(ε) | ε p = r] g(r)S(r)dr, (146) Q(λ1, λ2) = Z 0 Pr[p(ε δ) < λ1p(ε) + λ2p(ε) | Double Sampling Randomized Smoothing ε p = r] h(r)S(r)dr. (147) Since P(λa 1, λa 2) = P(λb 1, λb 2) and Q(λa 1, λa 2) = Q(λb 1, λb 2) by our assumption, simple arrangement yields Z r0 0 D(r)g(r)S(r)dr = Z r0 ( D(r))g(r)S(r)dr = 0, 0 D(r)h(r)S(r)dr = Z r0 ( D(r))h(r)S(r)dr = 0. As Lemma H.3 shows, D(r) where r [0, r0] always has the same sign as D(r) where r [r0, + ], and the last inequality ( = 0) is again due to the continuity of p and q and the fact that P(λa 1, λa 2) > 0 and Q(λa 1, λa 2) > 0. Now we can divide Eqn. (148) by Eqn. (149) and apply the Cauchy s mean value theorem, which yields D(ξ1)g(ξ1)S(ξ1) D(ξ1)h(ξ1)S(ξ1) = D(ξ2)g(ξ2)S(ξ2) D(ξ2)h(ξ2)S(ξ2), (150) where ξ1 (0, r0) and ξ2 (r0, + ). Apparently, it requires g(ξ1) h(ξ1) = g(ξ2) h(ξ2). (151) However, g/h is strictly monotonic. By contradiction, there is no distinct pair (λa 1, λa 2) and (λb 1, λb 2) satisfying the constraint of Eqn. (12) simultaneously. Remark. Suppose P and Q are ℓp-radial extended Gaussian/Laplace distributions, i.e., their density functions p(x) and q(x) are p(x) x k p exp ( x p/σ)α , q(x) x k p exp ( x p/β)α (152) with α > 0 (for Gaussian α = 2 and for Laplace α = 1). Note that this is a broader family than the family considered in DSRS shown in the main text. we have g h exp(r/β r/σ)α (153) which is strictly monotonic. Thus, Theorem 11 is applicable for this large family of smoothing distributions that are commonly used in the literature (L ecuyer et al., 2019; Cohen et al., 2019; Yang et al., 2020; Zhang et al., 2020a; Zhai et al., 2020; Jeong & Shin, 2020). H.3. Discussion on Certification of Other ℓp Norms DSRS can also be extended to provide a certified robust radius under ℓp norm other than ℓ2. Different from the case of ℓ2 certification, for other ℓp norm the challenge is to compute P(λ1, λ2), Q(λ1, λ2), and R(λ1, λ2) as defined in Theorem 5. For ℓ2 certification, as shown by Theorems 5 and 9, there exist closed-form expressions for these quantities that can be efficiently implemented with numerical integrations. However, for other ℓp norms, finding such closed-form expressions for P, Q, and R is challenging. Luckily, we notice that P, Q, and R are all probabilitybased definitions, as long as we can effectively sample from P and Q and effectively compute the density functions p( ) and q( ), we can estimate these function quantities from the empirical means of Monte-Carlo sampling. Compared to numerical integration based ℓ2 certification, Monte-Carlo sampling has sampling uncertainty and efficiency problems. Here we discuss these two problems in detail and how we can alleviate them. Sampling Uncertainty and Mitigations. The empirical means for P and Q are stochastic which breaks the nice properties of P and Q (shown in Section 5.3) with respect to (λ1, λ2) as the different queries to P and Q fluctuate around the actual value. Therefore, the joint binary search (Alg. 4) may fail to return the correct (λ1, λ2) pair. Thus, we propose a stabilization trick: the same set of samples is used when querying P and Q during the joint binary search. With this same set of samples, it can be easily verified that the nice properties (Propositions 1 and 2) still hold even if P and Q are empirical means. To guarantee the certification soundness in the context of probabilistic Monte Carlo sampling, we introduce a different set of samples to test whether the solved (λ1, λ2) indeed upper bounds the intersection point (if the test fails we fall back to the classical Neyman-Pearson-based certification though it seldom happens). Since the test is also probabilistic, we need to accumulate this additional failing probability and use the lower bound of the confidence interval for soundness, which results in 1 2α = 99.8% certification instead of 1 α = 99.9% as in classical randomized smoothing certification (Cohen et al., 2019; Yang et al., 2020). Note that the existence of additional confidence intervals caused by Monte-Carlo sampling makes DSRS based on Monte-Carlo sampling slightly looser than DSRS based on numerical integration. Efficiency Concern and Mitigations. Traditionally we need to sample several (denoted as M, in our implementation we set M = 5 104) high-dimensional vectors for each P or Q computation, which induces the efficiency concern. With the usage of the aforementioned stabilization trick, now we only sample one set of M samples during the whole joint binary search instead of during each P and Q computation. Combining with the testing phase sampling, the whole algorithm needs to sample 2M vectors rather than O(M log2 M) without the stabilization trick, so Double Sampling Randomized Smoothing it greatly solves the efficiency concern. Moreover, we notice that for the samples {ϵi}M i=1 we only need to care about its densities {p(εi)}M i=1, {q(εi)}M i=1 and {p(εi δ)}M i=1. Thus, we store only these three quantities instead of the whole ddimensional vectors {ϵi}M i=1, reducing the space complexity from O(M d) to O(M). These techniques significantly improve efficiency in practice. Although DSRS based on Monte-Carlo sampling is still slightly looser and slower than DSRS based on numerical integration, DSRS based on Monte-Carlo sampling makes certifying robustness under other ℓp norms feasible. In this work, we focus on the ℓ2 norm, because additive randomized smoothing is not optimal for other norms (e.g., ℓ1 (Levine & Feizi, 2021)) or the state-of-the-art certification can be directly translated from ℓ2 certification (e.g., ℓ (Yang et al., 2020) and semantic transformations (Li et al., 2021)). Moreover, to the best of our knowledge, standard ℓ2 certification is the most challenging setting where additive randomized smoothing achieves state-of-the-art and no other work can achieve visibly tighter robustness certification than standard Neyman-Pearson certification (Yang et al., 2020; Levine et al., 2020; Mohapatra et al., 2020). I. Implementation and Optimizations In this appendix, we discuss the implementation tricks and optimizations, along with our simple heuristic for selecting the hyperparameter T in the additional smoothing distribution Q = N g trunc(k, T, σ). I.1. Implementation Details We implement DSRS in Python with about two thousand lines of code. The tool uses Py Torch2 to query a given base classifier with Monte-Carlo sampling in order to derive the confidence intervals [PA, PA] and [QA, QA]. Then, the tool builds the whole DSRS pipeline on Sci Py3 and Num Py4. Specifically, the numerical integration is implemented with scipy.integrate.quad() method. We exploit the full independence across the certification for different input instances and build the tool to be fully parallelizable on CPUs. By default, we utilize 10 processes, and the number of processes can be dynamically adjusted. The tool is built in a flexible way that adding new smoothing distributions is not only theoretically straightforward but also easy in practice. In the future, we plan to extend the tool to 1) provide GPU support; 2) reuse existing certification results from previous instances with similar confidence intervals to achieve higher efficiency. We will also support more smoothing distributions. 2http://pytorch.org/ 3https://scipy.org/scipylib/index.html 4https://numpy.org/ In the implementation, we widely use the logarithmic scale, since many quantities in the computation have varied scales. For example, since the input dimension d is typically over 500 on a real-world dataset, the density functions p and q decay very quickly along with the increase of noise magnitude. Another example is the input variable for ln( ) and W( ) in Theorem 5. These variables are exponential with respect to the input dimension d so they could be very large or small. To mitigate this, we perform all computations with varied scales in logarithmic scale to improve the precision and floating-point soundness. For example, we implement a method lnlogadd to compute log(λ1 exp(x1) + λ2 exp(x2)) and apply method wlog in (Yang et al., 2020) to compute W(exp(x)). We remark that in the binary search for dual variables (see Section 5.3), we also use the logarithmic scale for λ1 and λ2. The code, model weights, and all experiment data are publicly available at https://github.com/llylly/ DSRS. I.2. T Heuristics As briefly outlined in Section 6.1, we apply a simple yet effective heuristic to determine the hyperparameter T in additional smoothing distribution Q = N g trunc(k, T, σ). Concretely, we first sample the prediction probability from the original smoothing distribution P and get the confidence lower bound PA of PA = f P(x0)y0. Then, we use the following empirical expression to determine T from PA: 2d d 2k ΓCDF 1 d/2 k(p), (154) p = max{ 0.08 ln(1 PA) + 0.2, 0.5}. (155) It can be viewed as we first parameterize T by p and then find a simple heuristic to determine p by PA. The T s parameterization with p is inspired by the probability mass under original smoothing distribution P = N g(k, σ) if true-decision region is concentrated in a ℓ2-ball centered at x0. Concretely, from P s definition, Pr ϵ P[ ϵ 2 T] = p. (156) Then, we use a randomly sampled CIFAR-10 training set containing 1, 000 points with models trained using Consistency (Jeong & Shin, 2020) under σ = 0.50 to sweep all p {0.1, 0.2, , 0.9}. We plot the minimum p and maximum p that gives the highest certified robust radius as a segment and find Eqn. (155) fits the general tendency well as shown in Appendix I.2. Thus, we use this simple heuristic to determine T. Empirically, this simple heuristic generalizes well and is competitive with more complex methods as shown in Appendix J. Double Sampling Randomized Smoothing 1 2 3 4 5 6 7 8 ln(1 PA) Best Hyperparameter p Figure 7. Our p heuristic (Eqn. (155)) shown as orange curve fits the generalization tendency of optimal p ranges (shown as blue segments) well. Another heuristic that we have deployed is the fall-back strategy. When the empirical probability f PA = 1, i.e., if for all the sampled ϵ P, F(x0 + ϵ) = y0, then we fall back to still using P instead of using another distribution Q for the second round of sampling. This strategy is inspired by a finding that, with the fixed sampling budget, if PA is already very high, it is more efficient to use more samples to further increase the confidence interval of PA rather than querying imprecise information under another distribution Q. Notice that such strategy does not break the high-confidence soundness of our certification, because Pr[f PA = n/N | PA t] Pr[f PA = n/N g P half A = 1 | PA t] for any t < 1 and Bernoulli distributed sampling (which is our case), and we use Bernoulli confidence interval which corresponds to Pr[f PA = n/N | PA t]. Due to the tight experiment time, we only deployed this strategy on Image Net evaluation but not on MNIST and CIFAR-10 evaluations. I.3. Training Strategy for Generalized Gaussian Smoothing We train the base classifiers on each dataset using Gaussian augmentation training (Cohen et al., 2019), Consistency training (Jeong & Shin, 2020), and Smooth Mix training (Jeong et al., 2021), which are typical training methods for randomized smoothing. We do not consider other training approaches such as (Zhai et al., 2020; Salman et al., 2019; Li et al., 2019a; Carmon et al., 2019) because: (1) Some training approaches either require additional data (Carmon et al., 2019; Salman et al., 2019), or relatively time-consuming (Salman et al., 2019), or are reported to be not as effective as later approaches (Li et al., 2019a); (2) Selected training approaches are widely used or achieve state-of-the-art with high training efficiency. On MNIST, for all training methods, we use a convolutional neural network with 4 convolutional layers and 3 fully connected layers following Cohen et al. (2019) as the base classifiers architecture. On CIFAR-10, for all training methods, we use Res Net-110 (He et al., 2016) as the base classifiers architecture. These architecture settings follow the prior work on smoothed classifier training (Cohen et al., 2019; Salman et al., 2019; Zhai et al., 2020). On both MNIST and CIFAR-10, for all training methods, we train for 150 epochs. The learning rate is 0.01 and is decayed by 0.1 at the 50th and 100th epoch. For Consistency training, the hyperparameter λ = 5 on MNIST and λ = 20 on CIFAR-10. We use two noise vectors per training batch per instance. These are the best hyperparameters reported in (Jeong & Shin, 2020). The batch size is 256 on both MNIST and CIFAR-10 following (Jeong & Shin, 2020). For Smooth Mix training, we directly use the best hyperparameters from their open-source repository: https://github.com/jh-jeong/smoothmix/ blob/main/EXPERIMENTS.MD. On Image Net, we use Res Net-50 (He et al., 2016) as the base classifiers architecture and finetune from the opensource model trained by Cohen et al. (Cohen et al., 2019) with Gaussian smoothing. We train for 10 epochs due to the expensive training cost on Image Net and we remark that better results can be achieved with a larger training time budget. The learning rate is 0.001 and is decayed at the end of every epoch by 0.1. For Consistency training, the hyperparameter λ = 5 and we use two noise vectors per training batch per instance following the best hyperparameters reported in (Jeong & Shin, 2020). For Smooth Mix training, since the open-source repository does not contain the best hyperparameters, we select the hyperparameters as suggested in the original paper (Jeong et al., 2021). During training, the training samples are augmented by adding noises following training smoothing distribution. In typical training approaches (Cohen et al., 2019; Carmon et al., 2019; Salman et al., 2019; Zhai et al., 2020; Jeong & Shin, 2020), the training smoothing distribution is set to be the same as the original smoothing distribution P for constructing the smoothed classifier. However, for our generalized Gaussian N g(k, σ) as P with large k, we find this strategy gives a poor empirical performance. To better train the base classifier for our original smoothing distribution P with large k, we introduce a warm-up stage on the training smoothing distribution. Suppose our original smoothing distribution P for constructing the smoothed classifier is N g(k0, σ) We let e0 = 100 be the number of warm-up epochs on MNIST and CIFAR-10, or e0 = 10000 be the number of warm-up steps on Image Net. In the first e0 epochs (on MNIST and CIFAR-10) or steps (on Image Net), we use the training smoothing distribution with smaller k. Formally, in the eth epoch/step where e e0, the training Double Sampling Randomized Smoothing smoothing distribution P = N g(k, σ) where k = k0 k1 e/e0 0 . (157) For later epochs/steps, we use the original smoothing distribution P itself as the training smoothing distribution. This strategy gradually increases the k of training smoothing distribution throughout the training, so that the base classifier can be a better fit for the final desired distribution P. Using this strategy, the smoothed classifier constructed from our trained base classifier has similar certified robustness compared to standard Gaussian augmentation under classical Neyman-Pearson-based certification. J. Additional Experimental Results In this appendix, we present additional experiment results and studies. J.1. Empirical Study Setup of Concentration Property In Figure 4 in Appendix B, we present our investigation of the decision regions of base classifiers. The investigation follows the following protocol: (1) We choose the base classifier from (Salman et al., 2019) on Image Net trained for σ = 0.5 Gaussian smoothing as the subject. The reason is that this base classifier is one of the state-of-the-art certifiably robust classifiers on the large-scale Image Net dataset and our code uses the same preprocessing parameters so it is easy to adopt their model. (2) We pick every 500-th image from the test set of the Image Net dataset to form a subset of 100 samples. (3) We filter out the samples where the base classifier cannot classify correctly even without adding any noise, which results in 89 remaining samples. (4) For each of these 89 samples, for each perturbation magnitude r, we uniformly sample 1000 perturbation vectors from the hypersphere with ℓ2-radius r and compute the empirical probability of true-prediction, where the step size of r is 10. Figure 4 implies that for a vast majority of samples, the true-prediction samples are highly concentrated on an ℓ2 ball around the clean input since there exists apparent ℓ2 magnitude thresholds where the true-prediction probability is almost 1 within the thresholds and almost 0 beyond the thresholds. This implies that the concentration property may be achievable for real-world base classifiers in randomized smoothing. In Figure 9, we follow the same protocol but plot the landscape of base classifiers trained using generalized Gaussian distribution (instead of standard Gaussian distribution as in Figure 4). By comparing Figure 9 and Figure 4, we find that although base classifiers in Figure 4 can achieve better certified robustness using Neyman-Pearson certification and generalized Gaussian smoothing (compare Figure 2(b) and Neyman-Pearson rows in Table 10), they also sacrifice the concentration property, which can explain why DSRS improvements are much smaller on models in Section 6 compared with models in Figure 2(b). Thus, as discussed in Section 6, there may be a large space for exploring training approaches that favor DSRS certification by preserving the concentration property. J.2. Effectiveness of T-Heuristics and Attempts on Better Optimization Tricks Better Optimization Tricks. For obtaining a tighter certified radius, we should make the support of N g trunc(k, T, σ) more aligned with the decision region while keeping the QA large enough. So except using a simple heuristic, we can also turn the search for an appropriate value of T into an optimization problem. In order to make the optimization more stable, here we will construct the optimization based on Pcon which is more scale-invariant, and then transform it to get the final appropriate T = σ q 2ΓCDF 1 d/2(Pcon). The final optimization objective is now built as Pcon = arg min QA + λ 2 (Pcon PA)2, where λ is a hyperparameter that controls the relative weight of the two loss terms. The QA here is estimated by sampling from the distribution N g trunc(k, T, σ) in which the T is determined by Pcon; however, the actual process of such sampling is implemented by rejecting the sampled noise whose norm is bigger than T. Therefore, there will be no gradient obtained for Pcon through the backward of the loss, namely, the optimized objective. So instead, we attempt to approximate the gradient comes from the first loss term QA with GQA = Eϵ Q[ ||ϵ||2Cross Entropy Loss(f(x0 + ϵ), y0)]. Then, we will only have the gradient information, and there is no explicit form of QA anymore. The PA is estimated with the PA which we have already obtained, so the final estimation of the gradient for Pcon is GQA + λ(Pcon PA). Experiment Setting. The Pcon is optimized for different input test images respectively, and the initialized Pcon is set to 0.7. For each input, we will update Pcon 20 steps on datasets MNIST and CIFAR10 while updating only 10 steps on dataset Image Net to reduce time complexity. For each step, we will sample 2, 000 times for estimating the term GQA, and the learning rate is set to 2, 000. Besides, to avoid the Pcon being optimized too large or too small, we will clip the final optimized Pcon within 0.1 and 0.9. Since the optimization is a bit time-consuming, we only test it on CIFAR10 with σ = 0.5 and test it on MNIST and Image Net with σ = 1.0. Different λ is also tried for different datasets and different training methods for getting better performance. Performance. The final results are shown in Table 5, and the certification approach based on the optimization tricks Double Sampling Randomized Smoothing mentioned above is denoted as Opt in the table. As we can see, our simple T-Heuristics could still be competitive with the method based on complicated optimization but with a cheaper cost. J.3. Full Curves and Separated Tables J.3.1. SEPARATE TABLES BY SMOOTHING VARIANCE Due to the page limit, in the main text (Section 6, Table 2), we aggregate the certified robust accuracy across models with different smoothing variance σ {0.25, 0.50, 1.00}. To show the full landscape, we present the certified robust accuracy for each model trained with each variance. The evaluation protocol is the same as the one in the main text, and the tables for MNIST, CIFAR-10, and Image Net models are Table 6, Table 7, and Table 8 respectively. We observe that DSRS outperforms standard Neyman-Pearson certification for a wide range of radii. J.3.2. CURVES Following the convention (Cohen et al., 2019; Salman et al., 2019), we plot the certified robust accuracy - radius curve in Figure 8. The curves correspond to the certified robust accuracy data in Table 2 (in Section 6), i.e., the certified robust accuracy under each radius r is the maximum certified robust accuracy among models trained with variance σ {0.25, 0.50, 1.00}. We observe that among all medium to large radii (including those not shown in Table 2), DSRS provides higher certified robust accuracy than Neyman-Pearson certification. The margin of DSRS is relatively small on CIFAR-10 but is apparent on MNIST and Image Net. Especially, the margins on Image Net reflect that DSRS is particularly effective on large datasets. J.3.3. ACR RESULTS In the literature, another common metric of certified robustness is ACR (average certified radius) (Zhai et al., 2020; Jeong & Shin, 2020; Jeong et al., 2021). In Table 9, we report the ACR comparison between Neyman-Pearson-based certification and our DSRS certification. Across the three smoothing variance choices σ {0.25, 0.50, 1.00}, we find σ = 1.00 yields the highest ACR, so we only report the ACR for models smoothed with σ = 1.00. As we can see, in all cases, DSRS significantly improves over Neyman Pearson-based certification in terms of ACR. J.4. Using Distribution with Different Variance as Q We take the models trained with Gaussian augmentation (Cohen et al., 2019) and variance σ = 1.00 as examples. We use DSRS-trunc to represent DSRS using truncated gener- alized Gaussian as Q, and DSRS-var to represent DSRS using generalized Gaussian with different variance as Q, and compare their robustness certification (i.e., certified robust accuracy) in Table 10. From the table, we find that on MNIST and CIFAR-10, DSRS-var is better than DSRStrunc, whereas on Image Net, DSRS-trunc is slightly better than DSRS-var. Both DSRS-trunc and DSRS-var are significantly better than Neyman-Pearson-based certification. To investigate the reason, we follow the protocols for studying the concentration property in Appendix J.1 to plot the landscape of models on MNIST, CIFAR-10, and Image Net, as shown in Figure 9. From the figure, we find that the curves on Image Net are generally steeper, which corresponds to that the concentration property is better satisfied on Image Net. Therefore, we conjecture that when the concentration property (see Definition 3) is better satisfied, DSRS with truncated Gaussian as Q is better than Gaussian with different variance as Q. J.5. Comparison with Higher-Order Randomized Smoothing It is difficult to have a direct comparison with higher-order randomized smoothing (Mohapatra et al., 2020; Levine et al., 2020), which is the only work to the best of our knowledge that uses additional information beyond PA to tighten the robustness certification in randomized smoothing. This difficulty comes from the following reasons: (1) Higher-order randomized smoothing only supports standard Gaussian smoothing, while DSRS is particularly useful with generalized Gaussian smoothing. (2) All experiment evaluations in higher-order randomized smoothing are conducted with large sampling numbers (N = 2 105 on CIFAR-10 and N = 1.25 106 on Image Net) which makes the evaluation costly, while DSRS is designed for practical sampling numbers (N = 105). (3) The code is not open-source yet (Mohapatra et al., 2020). Nonetheless, we can directly compare with the curves provided by Mohapatra et al. (2020). We capture the certified robust accuracy vs. ℓ2 radius r curves from (Mohapatra et al., 2020) in Figure 10. As we can see, compared with Neyman-Pearson-based certification, the improvements from higher-order randomized smoothing are small especially on the large Image Net dataset despite the excessive sampling numbers (1.25 106). In contrast, as shown in Figure 8, within only 105 sampling number, DSRS is visibly tighter than Neyman-Pearsonbased certification. In fact, to the best of our knowledge, DSRS is the first model-agnostic approach that is visibly tighter than Neyman-Pearson-based certification under ℓ2 radius. Double Sampling Randomized Smoothing Table 5. ℓ2 certified robust accuracy w.r.t. different radii r for different certification approaches. Dataset Training Certification Certified Accuracy under Radius r Method Approach 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 Gaussian Aug. Neyman-Pearson 95.5 % 93.5 % 90.0 % 86.1 % 80.4 % 72.8 % 61.4 % 50.2 % 36.6 % 25.2 % 14.5 % 8.5 % (σ = 1.00) DSRS(T-heuristic) 95.5 % 93.5 % 90.2 % 86.9 % 81.4 % 74.4 % 64.6 % 55.2 % 42.8 % 30.9 % 20.3 % 11.3 % Opt (λ = 7e 05) 95.5% 93.5% 90.0% 86.9% 81.7% 74.9% 65.6% 55.8% 43.8% 30.5% 19.1% 10.2% Consistency Neyman-Pearson 94.5 % 92.6 % 89.3 % 85.9 % 80.7 % 74.4 % 65.9 % 56.9 % 44.1 % 34.4 % 23.3 % 12.8 % (σ = 1.00) DSRS(T-heuristic) 94.5 % 92.8 % 89.3 % 86.3 % 81.4 % 76.1 % 68.3 % 59.5 % 50.7 % 39.8 % 30.7 % 20.0 % Opt (λ = 6e 06) 94.5% 92.8% 89.3% 86.2% 81.2% 75.8% 68.2% 59.6% 50.5% 39.6% 30.7% 19.8% Certified Accuracy under Radius r 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 Gaussian Aug. Neyman-Pearson 60.4 % 55.2 % 51.3 % 45.9 % 40.8 % 35.6 % 30.1 % 24.3 % 20.0 % 16.7 % 13.0 % 10.1 % (σ = 0.50) DSRS(T-heuristic) 60.6 % 55.5 % 51.5 % 46.8 % 42.1 % 37.3 % 32.5 % 27.4 % 22.8 % 19.3 % 16.0 % 12.7 % Opt (λ = 3e 06) 60.6% 55.5% 51.3% 46.7% 42.0% 37.2% 32.5% 27.4% 23.0% 19.3% 16.0% 12.5% Consistency Neyman-Pearson 53.1 % 50.5 % 48.6 % 45.5 % 43.6 % 41.5 % 38.7 % 36.7 % 35.1 % 32.0 % 29.1 % 25.7 % (σ = 0.50) DSRS(T-heuristic) 53.1 % 50.7 % 48.7 % 45.7 % 44.0 % 41.8 % 39.6 % 37.8 % 36.0 % 34.4 % 31.3 % 28.6 % Opt (λ = 4e 06) 53.1% 50.7% 48.7% 45.7% 44.0% 41.8% 39.5% 37.8% 36.0% 34.4% 31.4% 28.4% Gaussian Aug. Neyman-Pearson 57.5 % 55.1 % 52.2 % 49.7 % 47.0 % 43.9 % 40.8 % 38.1 % 35.0 % 33.2 % 29.6 % 25.3 % (σ = 1.00) DSRS(T-heuristic) 57.7 % 55.6 % 52.7 % 51.0 % 48.4 % 45.5 % 43.1 % 40.2 % 37.9 % 35.3 % 32.8 % 30.5 % Opt (λ = 1e 05) 57.7% 55.5% 52.4% 50.5% 48.2% 45.0% 42.9% 40.0% 38.0% 35.0% 32.7% 29.9% Consistency Neyman-Pearson 55.9 % 54.4 % 53.0 % 51.2 % 48.2 % 46.2 % 44.2 % 41.7 % 39.1 % 36.4 % 34.4 % 32.1 % (σ = 1.00) DSRS(T-heuristic) 56.0 % 54.6 % 53.1 % 51.8 % 49.9 % 47.4 % 45.7 % 44.2 % 41.7 % 39.3 % 37.8 % 35.8 % Opt (λ = 1e 05) 56.0% 54.6% 53.1% 51.8% 49.7% 47.4% 45.3% 44.0% 41.6% 39.3% 37.8% 35.9% Table 6. MNIST: Certified robust accuracy for models smoothed with different variance σ certified with different certification approaches. Variance Training Certification Certified Accuracy under Radius r Method Approach 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 Gaussian Aug. Neyman-Pearson 97.9% 96.4% 92.1% (Cohen et al., 2019) DSRS 97.9% 96.6% 92.7% Consistency Neyman-Pearson 98.4% 97.5% 94.4% (Jeong & Shin, 2020) DSRS 98.4% 97.5% 95.4% Smooth Mix Neyman-Pearson 98.6% 97.6% 96.5% (Jeong et al., 2021) DSRS 98.6% 97.7% 96.8% Gaussian Aug. Neyman-Pearson 97.8% 96.9% 94.6% 88.4% 78.7% 52.6% (Cohen et al., 2019) DSRS 97.8% 97.0% 95.0% 89.8% 83.4% 59.1% Consistency Neyman-Pearson 98.4% 97.3% 96.0% 92.3% 83.8% 67.5% (Jeong & Shin, 2020) DSRS 98.4% 97.3% 96.0% 93.5% 87.1% 71.8% Smooth Mix Neyman-Pearson 98.2% 97.1% 95.4% 91.9% 85.1% 73.0% (Jeong et al., 2021) DSRS 98.1% 97.1% 95.9% 93.4% 87.5% 76.6% Gaussian Aug. Neyman-Pearson 95.2% 91.9% 87.7% 80.6% 71.2% 57.6% 41.0% 25.5% 13.6% 6.2% 2.1% 0.9% (Cohen et al., 2019) DSRS 95.1% 91.8% 88.2% 81.5% 73.6% 61.6% 48.4% 34.1% 21.0% 10.6% 4.4% 1.2% Consistency Neyman-Pearson 93.9% 90.9% 86.4% 80.8% 73.0% 61.1% 49.1% 35.6% 21.7% 10.4% 4.1% 1.9% (Jeong & Shin, 2020) DSRS 93.9% 91.1% 86.9% 81.7% 75.2% 65.6% 55.8% 41.9% 31.4% 17.8% 8.6% 2.8% Smooth Mix Neyman-Pearson 92.0% 88.9% 84.4% 78.6% 69.8% 60.7% 49.9% 40.2% 31.5% 22.2% 12.2% 4.9% (Jeong et al., 2021) DSRS 92.2% 89.0% 84.8% 79.7% 72.0% 63.9% 54.4% 46.2% 37.6% 29.2% 18.5% 7.2% K. Extended Related Work Discussion In this appendix, we discuss related work from two branches: training approaches for randomized smoothing, and datadependent randomized smoothing. Both branches aim to improve the certified robustness of randomized smoothing. For tighter certification approaches leveraging additional information, a detailed discussion can be found in Appendix L.1. For more related work we refer the interested readers to recent surveys and books (Liu et al., 2021; Li et al., 2020b; Albarghouthi, 2021). To improve the certified robustness of randomized smoothing, efforts have been made on both the training (Li et al., 2019a; Zhai et al., 2020; Salman et al., 2019; Jeong & Shin, 2020) and the certification sides (L ecuyer et al., 2019; Cohen et al., 2019; Li et al., 2019a; Yang et al., 2020; Zhang et al., 2020a; Yang et al., 2022). On the training side, data augmentation (Cohen et al., 2019), regularization (Li et al., 2019a; Zhai et al., 2020; Jeong & Shin, 2020), and adversarial training (Salman et al., 2019) help to train stable base models under noise corruptions so that higher certified robustness for a smoothed classifier can be achieved. In this work, we focus on certification, and these training approaches can be used in conjunction with ours to provide higher certified robustness. Another potential way to improve the certified robustness of randomized smoothing is to dynamically change the smoothing distribution P based on the input toward maximizing the certified radius (Alfarra et al., 2020; Eiras et al., 2021; Schuchardt et al., 2022). In this scenario, the certification needs to take into account that the attacker may adaptively mislead the pipeline to choose a bad smoothing distri- Double Sampling Randomized Smoothing Table 7. CIFAR-10: Certified robust accuracy for models smoothed with different variance σ certified with different certification approaches. Variance Training Certification Certified Accuracy under Radius r Method Approach 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 Gaussian Aug. Neyman-Pearson 56.1% 35.7% 13.4% (Cohen et al., 2019) DSRS 57.4% 39.4% 17.3% Consistency Neyman-Pearson 61.8% 50.9% 34.7% (Jeong & Shin, 2020) DSRS 62.5% 52.5% 37.8% Smooth Mix Neyman-Pearson 63.9% 53.3% 38.4% (Jeong et al., 2021) DSRS 64.7% 55.5% 41.1% Gaussian Aug. Neyman-Pearson 53.7% 41.3% 27.7% 17.1% 9.1% 2.8% (Cohen et al., 2019) DSRS 54.1% 42.7% 30.6% 20.3% 12.6% 4.0% Consistency Neyman-Pearson 49.2% 43.9% 38.0% 32.3% 23.8% 18.1% (Jeong & Shin, 2020) DSRS 49.6% 44.1% 38.7% 35.2% 28.1% 19.7% Smooth Mix Neyman-Pearson 53.2% 47.6% 40.2% 34.2% 26.7% 19.6% (Jeong et al., 2021) DSRS 53.3% 48.5% 42.1% 35.9% 29.4% 21.7% Gaussian Aug. Neyman-Pearson 40.2% 32.6% 24.7% 18.9% 14.9% 10.2% 7.5% 4.1% 2.0% 0.7% 0.1% 0.1% (Cohen et al., 2019) DSRS 40.3% 33.1% 25.9% 20.6% 16.1% 12.5% 8.4% 6.4% 3.5% 1.8% 0.7% 0.1% Consistency Neyman-Pearson 37.2% 32.6% 29.6% 25.9% 22.5% 19.0% 16.4% 13.8% 11.2% 9.0% 7.1% 5.1% (Jeong & Shin, 2020) DSRS 37.1% 32.5% 29.8% 27.1% 23.5% 20.9% 17.6% 15.3% 13.1% 10.9% 8.9% 6.5% Smooth Mix Neyman-Pearson 43.2% 39.5% 33.9% 29.1% 24.0% 20.4% 17.0% 13.9% 10.3% 7.8% 4.9% 2.3% (Jeong et al., 2021) DSRS 43.2% 39.7% 34.9% 30.0% 25.8% 22.1% 18.7% 16.1% 13.2% 10.2% 7.1% 3.9% Table 8. Image Net: Certified robust accuracy for models smoothed with different variance σ certified with different certification approaches. Variance Training Certification Certified Accuracy under Radius r Method Approach 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 Gaussian Aug. Neyman-Pearson 57.1% 41.6% 17.4% (Cohen et al., 2019) DSRS 58.4% 47.9% 24.4% Consistency Neyman-Pearson 59.8% 49.8% 36.9% (Jeong & Shin, 2020) DSRS 60.4% 52.4% 40.4% Smooth Mix Neyman-Pearson 46.7% 38.2% 28.2% (Jeong et al., 2021) DSRS 47.4% 40.0% 29.8% Gaussian Aug. Neyman-Pearson 53.6% 48.3% 43.3% 36.8% 31.4% 24.5% (Cohen et al., 2019) DSRS 53.7% 49.9% 44.7% 39.3% 34.8% 27.4% Consistency Neyman-Pearson 53.6% 48.3% 43.3% 36.8% 31.4% 24.5% (Jeong & Shin, 2020) DSRS 53.7% 49.9% 44.7% 39.3% 34.8% 27.4% Smooth Mix Neyman-Pearson 38.7% 33.5% 28.8% 24.6% 18.1% 13.5% (Jeong et al., 2021) DSRS 39.1% 34.9% 30.3% 26.8% 21.6% 15.6% Gaussian Aug. Neyman-Pearson 42.5% 37.2% 33.0% 29.2% 24.8% 21.4% 17.6% 13.7% 10.2% 7.8% 5.7% 3.6% (Cohen et al., 2019) DSRS 42.5% 38.1% 34.4% 30.2% 27.0% 23.3% 21.3% 18.7% 14.2% 11.0% 9.0% 5.7% Consistency Neyman-Pearson 40.0% 38.3% 34.2% 31.8% 28.7% 25.6% 22.1% 19.1% 16.1% 14.0% 10.6% 8.5% (Jeong & Shin, 2020) DSRS 40.2% 38.5% 35.4% 32.6% 30.7% 28.1% 25.4% 22.6% 19.6% 17.4% 14.1% 10.4% Smooth Mix Neyman-Pearson 29.8% 25.6% 21.8% 19.2% 17.0% 14.2% 11.8% 10.1% 8.9% 7.2% 6.0% 4.6% (Jeong et al., 2021) DSRS 29.7% 26.2% 23.0% 20.6% 18.0% 15.7% 14.0% 12.1% 9.9% 8.4% 7.2% 5.3% Double Sampling Randomized Smoothing 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy MNIST, Gaussian Aug. Neyman-Pearson Certification DSRS Certification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy MNIST, Consistency Neyman-Pearson Certification DSRS Certification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy MNIST, Smooth Mix Neyman-Pearson Certification DSRS Certification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy CIFAR-10, Gaussian Aug. Neyman-Pearson Certification DSRS Certification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy CIFAR-10, Consistency Neyman-Pearson Certification DSRS Certification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy CIFAR-10, Smooth Mix Neyman-Pearson Certification DSRS Certification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy Image Net, Gaussian Aug. Neyman-Pearson Certification DSRS Certification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy Image Net, Consistency Neyman-Pearson Certification DSRS Certification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Radius r Certified Accuracy Image Net, Smooth Mix Neyman-Pearson Certification DSRS Certification Figure 8. Certified robust accuracy - radius curve corresponding to Table 2. Table 9. Average certified radius (ACR) statistics. The smoothing variance hyperparameter σ = 1.00. The evaluation protocol is the same as that in the main text. Training Method Certification MNIST CIFAR-10 Image Net Gaussian Augmentation Neyman-Pearson 1.550 0.447 0.677 DSRS 1.629 0.469 0.750 Relative Improvement +5.10% +4.92% +10.78% Consistency Neyman-Pearson 1.645 0.636 0.796 DSRS 1.730 0.659 0.862 Relative Improvement +5.17% +3.62% +8.29% Smooth Mix Neyman-Pearson 1.716 0.676 0.490 DSRS 1.806 0.712 0.525 Relative Improvement +5.24% +5.33% +7.14% bution. Therefore, additional costs such as memorizing training data need to be paid to defend such adaptive robust- ness vulnerabilities. A recent work (S uken ık et al., 2021) shows that input-dependent randomized smoothing may not bring substantial improvements in certified robustness. In DSRS, we select the additional smoothing distribution Q dynamically based on the input, which may appear like input-dependent randomized smoothing. However, we select such distribution Q only for certification purposes, and the original distribution P which is used to construct the smoothed classifier remains static. Thus, we do not need to consider the existence of adaptive attackers. Indeed, for any smoothing distribution Q, with DSRS, we generate valid robustness certification for the static smoothed classifier e F P. Double Sampling Randomized Smoothing 0 50 100 150 200 ℓ2 Length of Perturbation Prob. of Predicting True Class (a) MNIST model. 0 50 100 150 200 ℓ2 Length of Perturbation Prob. of Predicting True Class (b) CIFAR-10 model. 0 200 400 600 800 1000 1200 ℓ2 Length of Perturbation Prob. of Predicting True Class (c) Image Net model. Figure 9. Probability of true-prediction w.r.t. ℓ2 length of perturbations for base classifiers from Gaussian augmentation training studied in Appendix J.4. Figures are plotted following the same protocol as in Appendix J.1. Table 10. Comparison of DSRS certified robust accuracy with different types of additional smoothing distribution Q and Neyman-Pearsonbased certification. Detail experiment settings are in Appendix J.4. Dataset Certification Certified Accuracy under Radius r 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 Neyman-Pearson 95.2% 91.9% 87.7% 80.6% 71.2% 57.6% 41.0% 25.5% 13.6% 6.2% 2.1% 0.9% DSRS-trunc 95.1% 91.8% 87.9% 81.3% 72.9% 60.2% 46.1% 30.9% 17.4% 9.4% 3.6% 1.2% DSRS-var 95.1% 91.8% 88.2% 81.5% 73.6% 61.6% 48.4% 34.1% 21.0% 10.6% 4.4% 1.2% Neyman-Pearson 40.2% 32.6% 24.7% 18.9% 14.9% 10.2% 7.5% 4.1% 2.0% 0.7% 0.1% 0.1% DSRS-trunc 40.3% 32.9% 25.5% 20.1% 15.7% 11.5% 8.0% 5.5% 2.7% 1.5% 0.6% 0.1% DSRS-var 40.3% 33.1% 25.9% 20.6% 16.1% 12.5% 8.4% 6.4% 3.5% 1.8% 0.7% 0.1% Neyman-Pearson 42.5% 37.2% 33.0% 29.2% 24.8% 21.4% 17.6% 13.7% 10.2% 7.8% 5.7% 3.6% DSRS-trunc 42.5% 38.1% 34.4% 30.2% 27.0% 23.3% 21.3% 18.7% 14.2% 11.0% 9.0% 5.7% DSRS-var 42.9% 38.5% 35.0% 31.0% 26.5% 23.2% 21.0% 18.3% 14.6% 10.5% 8.2% 5.3% L. Discussions on Generalizing DSRS Framework In this appendix, we first introduce prior work that leverages additional information for certification in randomized smoothing, then generalize our DSRS as a more general framework to theoretically compare with the related work and highlight future directions. L.1. Existing Work on Leveraging Additional Information for Certification We discuss all known work that leverages more information to achieve tighter robustness certification for randomized smoothing prior to this paper to the best of our knowledge. Additional Black-Box Information. Our DSRS leverages additional information to tighten the certification for randomized smoothing. We leverage the information from an additional smoothing distribution. This information can be obtained from the base classifier that we only have query access on the predicted label. We call the information from this limited query access black-box information. The certification approaches that only require black-box information can be applied to any classification model regardless of the model structure. Thus, they are usually more general and more scalable. The classical Neyman-Pearson certification only requires black-box information. Besides our DSRS, the only other form of additional blackbox information that is leveraged is the higher-order information (Levine et al., 2020; Mohapatra et al., 2020). Formally, our additional black-box information has the form Pr ϵ Q[F0(x + ϵ) = y]. (158) In contrast, the higher-order information, especially secondorder information used by Levine et al. (2020); Mohapatra et al. (2020) has the form f P 0 (x0)y0 p = Pr ϵ P[F0(x0 + ϵ) = y0] p (159) which is also shown to be estimable given the black-box query access. However, the higher-order information has several limitations: (1) It is hard to leverage the information beyond the second order. Therefore, only second-order information is used in existing certification approaches yet. However, to achieve optimal tightness, one needs to leverage infinite orders of information, which brings an infinite number of constraints and is thus intractable. In contrast, we show that extra information from only one additional distribution suffices to derive a strongly tight certification. (2) In practice, the second-order information shows marginal improvements in the widely used ℓ2 and ℓ certification settings on real-world datasets (Levine et al., 2020; Mohapatra et al., 2020) and even such improvements require a large number of samples (usually in million order instead of ours 105). Double Sampling Randomized Smoothing (a) (Mohapatra et al., 2020, Figure 2(b)): Higher-order randomized smoothing on CIFAR-10. (b) (Mohapatra et al., 2020, Figure 4): Higher-order randomized smoothing on Image Net for models trained with different smoothing variances. Figure 10. Higher-order randomized smoothing certification (solid curves) compared with standard Neyman-Pearson-based certification (dotted curves). Dvijotham et al. (2020) also propose to use additional information to tighten the robustness certification for randomized smoothing ( full-information setting). They formalize the tightest possible certification and compare it with Neyman Pearson-based certifiation ( information-limited setting), but in practice, they do not try to leverage information from distributions other than P. Constraining Model Structure. If we discard the blackbox information constraint, we can obtain tighter robustness certification than classical Neyman-Pearson. For example, knowing the model structure can benefit the certification. Lee et al. (2019) show that when the base classifier is a decision tree, we can use dynamic programming to derive a strongly tight certification against ℓ0-bounded perturbations. Awasthi et al. (2020) show that, if the base classifier first performs a known low-rank projection, then works on the low-rank projected space, for the corresponding smoothed classifier, we can have a tighter certification on both ℓ2 and ℓ settings. However, it is challenging to find a projection such that the model preserves satisfactory performance while the projection rank is low. Indeed, the approach is evaluated on DCT basis to show the improvement on ℓ certification, and there exists a gap between the actual achieved certified robustness and the state of the art. We do not compare with these approaches since they impose additional assumptions on the base classifiers so their applicable scenarios are limited, and currently, the state-of-the-art base classifier does not satisfy their imposed constraints under ℓ2 and ℓ certification settings. Recently, for ℓ1 certification, a deterministic and improved smoothing approach (a type of non-additive smoothing mechanism) is proposed to handle the case where input images are constrained in space {0, 1/255, , 244/255, 1}d (Levine & Feizi, 2021). This could be viewed as constraining the attack space from another aspect and implies that certified robustness can be improved by better smoothing mechanisms, which is orthogonal to our work which focuses on certification for existing smoothing mechanisms. Confidence Smoothing. A group of certification approaches assumes that the base classifier outputs normalized confidence on the given input, and the smoothed classifier predicts the class with the highest expectation of normalized confidence under noised input. This assumption can be viewed as a special type of Constraining Model Structure . Under this assumption, we can query and approximate the quantile of the confidence under noised input: F0(x0 + ϵ) where ϵ P. With this information, we can leverage the Neyman-Pearson lemma in a more delicate way to provide a tighter (higher) lower bound of the expected confidence under perturbation, i.e., Eϵ PF0(x + δ + ϵ). These certification approaches provide tighter certification than the classical Neyman-Pearson for the smoothed classifier that predicts the class with the highest expected normalized confidence. They are also useful for regression tasks such as object detection in computer vision as shown in (Chiang et al., 2020). However, for the classification task, for utilizable base classifiers (i.e., benign accuracy > 50% under noise), if we simply set the predicted class to have 100% confidence, we only increase the certified radius of the classifier and the certification for this one-hot base classifier only requires classical Neyman-Pearson. Thus, these certification approaches, e.g., (Kumar et al., 2020a), may not achieve higher certified robustness on the classification task than Neyman-Pearson and therefore we do not compare with them. L.2. General Framework Focusing on the certification with additional black-box information, we generalize the DSRS to allow more general additional information. Definition 5 (General Additional Black-Box Information). For given base classifier F0, suppose the true label at x0 is y0, for certifying robustness at x0, the general additional Double Sampling Randomized Smoothing black-box information has the form Rd f1(x)I [F0(x) = y0] dx = c1, Rd fi(x)I [F0(x) = y0] dx = ci, Rd f N(x)I [F0(x) = y0] dx = c N, where fi and ci are pre-determined; fi is integrable in Rd and ci R for 1 i N. Remark. Obtaining the information in Eqn. (160) requires only the black-box access to whether F0(x) equals to y0. We define the general additional black-box information in this way because: (1) The information from finite points is useless since the smoothed classifier has zero probability mass on finite points, so the useful information is based on integration; (2) To provide a lower bound of e F P 0 (x0 + δ)y0, we only need to care whether F0(x) equals to y0 in region of interest. Examples. (1) Our DSRS, the additional information Eϵ Q[f(ϵ)] = QA instantiates Definition 5 by setting N = 1, f1( ) = q( x0) and c1 = QA. (2) In (Mohapatra et al., 2020; Levine et al., 2020), the second-order information f P 0 (x0) instantiates Definition 5 by setting N = d, fi(x) = p(x x0)i, and ci = f P 0 (x0) i according to Theorem 1 in (Mohapatra et al., 2020). We remark that due to the sampling difficulty, instead of using the whole vector f P 0 (x0) as the information, second-order smoothing (Mohapatra et al., 2020; Levine et al., 2020) uses its p-norm in practice. However, using the full information only gives tighter certification so we consider this a more ideal variant. Then, we can extend the constrained optimization problem (C) in Section 5.1 to (Cext) to accommodate the general information as such minimize f Eϵ P[f(δ + ϵ)] (161) s.t. Eϵ P[f(ϵ)] = PA, Z Rd f1(ϵ)f(ϵ)dϵ = c1, Rd f N(ϵ)f(ϵ)dϵ = c N, 0 f(ϵ) 1 ϵ Rd. Similarly, by the strong duality (Theorem 3), to solve the certification problem δ s.t. δ p r, Cext δ (PA, c1, . . . , c N) > 0.5, (162) we only need to solve the dual problem (Dext): maximize η,λ1,...,λN R Pr ϵ P p(ϵ) < ηp(ϵ + δ) + i=1 λifi(ϵ + δ) s.t. Pr ϵ P p(ϵ δ) < ηp(ϵ) + i=1 λifi(ϵ) p(ϵ δ) < ηp(ϵ) + i=1 λifi(ϵ) f1(ϵ)dϵ = c1, p(ϵ δ) < ηp(ϵ) + i=1 λifi(ϵ) f N(ϵ)dϵ = c N. We remark that this generalization shares a similar spirit as one type of generalization of Neyman-Pearson Lemma (Chernoff & Scheffe, 1952; Mohapatra et al., 2020). Following the same motivation, Dvijotham et al. (2020) try to generalize the certification by adding more constraints in their full-information setting . However, it is unclear whether their constraints in f-divergences form have the same expressive power as ours in practice (i.e., the practicality of theoretically tight Hockey-Stick divergence). A study of these different types of generalization would be our future work. More importantly, we believe that the pipeline of DSRS can be adapted to solve this generalized dual problem. We hope that this generalization and the corresponding DSRS could enable the exploration and exploitation of more types of additional information for tightening the robustness certification of randomized smoothing. Implications. The generalized DSRS framework allows us to explicitly compare different types of additional information. For example, comparing our additional distribution information and higher-order information, we find that (1) for additional distribution information, from Theorem 1 and Corollary 1, the strong tightness can be acheived for N = C 1 where C is the number of classes; (2) for higher-order information, from (Mohapatra et al., 2020, Asymptotic-Optimality Remark), the strong tightness can be achieved when all orders of information are used, i.e., N . This comparison suggests that our additional information from another smoothing distribution should be more efficient. Another implication is that, from Corollary 1, under multiclass setting, with proper choices of the (C 1) additional smoothing distributions, if we base DSRS on solving dual problem (Dext), the DSRS can achieve strong tightness in multiclass setting. Double Sampling Randomized Smoothing L.3. Limitations and Future Directions Despite the promising theoretical and empirical results of DSRS, DSRS still has some limitations which open an avenue for future work. We list the following future directions: (1) tighter certification from a more ideal additional smoothing distribution Q: there may exist better smoothing distribution Q or better methods to optimize hyperparameters in Q than what we have considered in this work in terms of certifying larger certified radius in practice; (2) better training approach for DSRS: there may be a large space for exploring training approaches that favor DSRS certification since all existing training methods are designed for Neyman-Pearsonbased certification. We believe that advances in this aspect can boost the robustness certification with DSRS to achieve state-of-the-art certified robustness. (3) better additional information: more generally, besides the prediction probability from an additional smoothing distribution, there may exist other useful additional information for certification in randomized smoothing. We hope our generalization of the DSRS framework in this appendix can inspire future work in tighter and more efficient certification for randomized smoothing.