# ancer_anisotropic_certification_via_samplewise_volume_maximization__29695272.pdf Published in Transactions on Machine Learning Research (08/2022) ANCER: Anisotropic Certification via Sample-wise Volume Maximization Francisco Eiras eiras@robots.ox.ac.uk University of Oxford, Five AI Ltd., UK Motasem Alfarra motasem.alfarra@kaust.edu.sa King Abdullah University of Science and Technology (KAUST) M. Pawan Kumar pawan@robots.ox.ac.uk University of Oxford Philip H. S. Torr philip.torr@eng.ox.ac.uk University of Oxford Puneet K. Dokania puneet@robots.ox.ac.uk University of Oxford, Five AI Ltd., UK Bernard Ghanem bernard.ghanem@kaust.edu.sa King Abdullah University of Science and Technology (KAUST) Adel Bibi adel.bibi@eng.ox.ac.uk University of Oxford Reviewed on Open Review: https: // openreview. net/ forum? id= 7j0GI6t PYi Randomized smoothing has recently emerged as an effective tool that enables certification of deep neural network classifiers at scale. All prior art on randomized smoothing has focused on isotropic ℓp certification, which has the advantage of yielding certificates that can be easily compared among isotropic methods via ℓp-norm radius. However, isotropic certification limits the region that can be certified around an input to worst-case adversaries, i.e. it cannot reason about other close , potentially large, constant prediction safe regions. To alleviate this issue, (i) we theoretically extend the isotropic randomized smoothing ℓ1 and ℓ2 certificates to their generalized anisotropic counterparts following a simplified analysis. Moreover, (ii) we propose evaluation metrics allowing for the comparison of general certificates a certificate is superior to another if it certifies a superset region with the quantification of each certificate through the volume of the certified region. We introduce An Cer, a framework for obtaining anisotropic certificates for a given test set sample via volume maximization. We achieve it by generalizing memory-based certification of data-dependent classifiers. Our empirical results demonstrate that An Cer achieves state-of-the-art ℓ1 and ℓ2 certified accuracy on CIFAR-10 and Image Net in the data-dependence setting, while certifying larger regions in terms of volume, highlighting the benefits of moving away from isotropic analysis. Our code is available in this repository. 1 Introduction The well-studied fact that Deep Neural Networks (DNNs) are vulnerable to additive imperceptible noise perturbations has led to a growing interest in developing robust classifiers (Goodfellow et al., 2015; Szegedy Equal contribution; order of first two authors decided by 3 coin flips. Published in Transactions on Machine Learning Research (08/2022) Anisotropic Isotropic Anisotropic Isotropic Anisotropic Isotropic Anisotropic Isotropic Figure 1: Illustration of the landscape of f y (blue corresponds to a higher confidence in y, the true label) for a region around an input in a toy, 2-dimensional radially separable dataset. For two dataset examples, in (a) and (b) we show the boundaries of the optimal ℓ1 isotropic and anisotropic certificates, while (c) and (d) are the boundaries of the optimal ℓ2 isotropic and anisotropic certificates. A thorough discussion of this figure is presented in Section 3. et al., 2014). A recent promising approach to achieve state-of-the-art provable robustness (i.e. a theoretical bound on the output around every input) at the scale of Image Net (Deng et al., 2009) is randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019). Given an input x and a network f, randomized smoothing constructs g(x) = Eϵ D[f(x + ϵ)] such that g(x) = g(x + δ) δ R, where the certification region R is characterized by x, f, and the smoothing distribution D. For instance, Cohen et al. (2019) showed that if D = N(0, σ2I), then R is an ℓ2-ball whose radius is determined by x, f and σ. Since then, there has been significant progress towards the design of D leading to the largest R for all inputs x. The interplay between R characterized by ℓ1, ℓ2 and ℓ -balls, and a notion of optimal distribution D has been previously studied (Yang et al., 2020). Despite this progress, current randomized smoothing approaches provide certification regions that are isotropic in nature, limiting their capacity to certifying smaller and worst-case regions. We provide an intuitive example of this behavior in Figure 1. The isotropic nature of R in prior art is due to the common assumption that the smoothing distribution D is identically distributed (Yang et al., 2020; Kumar et al., 2020; Levine & Feizi, 2021). Moreover, comparisons between various randomized smoothing approaches were limited to methods that produce the same ℓp certificate, with no clear metrics for comparing with other certificates. In this paper, we address both concerns and present new state-of-the-art certified accuracy results on both CIFAR-10 and Image Net datasets. Our contributions are threefold. (i) We provide a general and simpler analysis compared to prior art (Cohen et al., 2019; Yang et al., 2020) that paves the way for the certification of anisotropic regions characterized by any norm, holding prior art as special cases. We then specialize our result to regions that, for a positive definite A, are ellipsoids, i.e. Aδ 2 c, c > 0, and generalized cross-polytopes, i.e. Aδ 1 c, generalizing both ℓ2 (Cohen et al., 2019) and ℓ1 (Lecuyer et al., 2019; Yang et al., 2020) certification (Section 4). (ii) We introduce a new evaluation framework to compare methods that certify general (isotropic or anisotropic) regions. We compare two general certificates by defining that a method certifying R1 is superior to another certifying R2, if R1 is a strict superset to R2. Further, we define a standalone quantitative metric as the volume of the certified region, and specialize it for the cases of ellipsoids and generalized cross-polytopes (Section 5). (iii) We propose An Cer, an anisotropic certification method that performs sample-wise (i.e. per sample in the test set) region volume maximization (Section 6), generalizing the data-dependent, memory-based solution from Alfarra et al. (2022). Through experiments on CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009), we show that restricting An Cer s certification region to ℓ1 and ℓ2-balls outperforms state-of-the-art ℓ1 and ℓ2 results from previous works (Yang et al., 2020; Alfarra et al., 2022). Further, we show that the volume of the certified regions are significantly larger than all existing methods, thus setting a new state-of-the-art in certified accuracy. We highlight that while we effectively achieve state-of-the-art performance, it comes at a high cost given the data-dependency requirements. A discussion of the limitations of the solution is presented in Section 6. Published in Transactions on Machine Learning Research (08/2022) Notation. We consider a base classifier f : Rn P(K), where P(K) is a probability simplex over K classes, i.e. f i 0 and 1 f = 1, for i {1, . . . , K}. Further, we use (x, y) to be a sample input x and its corresponding true label y drawn from a test set Dt, and f y to be the output of f at the correct class. We use ℓp to be the typically defined p norm (p 1), and ℓA p or A,p for p = {1, 2} to be a composite norm defined with respect to a positive definite matrix A as A 1/pv p. 2 Related Work Verified Defenses. Since the discovery that DNNs are vulnerable against input perturbations (Goodfellow et al., 2015; Szegedy et al., 2014), a range of methods have been proposed to build classifiers that are verifiably robust (Huang et al., 2017; Gowal et al., 2019; Bunel et al., 2018; Salman et al., 2019b). Despite this progress, these methods do not yet scale to the networks the community is interested in certifying (Tjeng et al., 2019; Weng et al., 2018). Randomized Smoothing. The first works on randomized smoothing used Laplacian (Lecuyer et al., 2019; Li et al., 2019) and Gaussian Cohen et al. (2019) distributions to obtain ℓ1 and ℓ2-ball certificates, respectively. Several subsequent works improved the performance of smooth classifiers by training the base classifier using adversarial augmentation (Salman et al., 2019a), regularization (Zhai et al., 2019), or general adjustments to training routines (Jeong & Shin, 2020). Recent work derived ℓp-norm certificates for other isotropic smoothing distributions (Yang et al., 2020; Levine & Feizi, 2020; Zhang et al., 2019). Concurrently, Dvijotham et al. (2020) developed a framework to handle arbitrary smoothing measures in any ℓp-norm; however, the certification process requires significant hyperparameter tuning. Similarly, Mohapatra et al. (2020) introduces larger certificates that require higher-order information, yet do not provide a closed-form solution. This was followed by a complementary data-dependent smoothing approach, where the parameters of the smoothing distribution were optimized per test set sample to maximize the certified radius at an individual input (Alfarra et al., 2022). All prior works considered smoothing with isotropic distributions and hence certified isotropic ℓp-ball regions. In this paper, we extend randomized smoothing to certify anisotropic regions, by pairing it with a generalization of the data-dependent framework (Alfarra et al., 2022) to maximize the certified region at each input point. 3 Motivating Anisotropic Certificates Certification approaches aim to find the safe region R, where arg maxi f i(x) = arg maxi f i(x + δ) δ R. Recent randomized smoothing techniques perform this certification by explicitly optimizing the isotropic ℓp certified region around each input (Alfarra et al., 2022), obtaining state-of-the-art performance as a result. Despite this ℓp optimality, we note that any ℓp-norm certificate is worst-case from the perspective of that norm, as it avoids adversary regions by limiting its certificate to the ℓp-closest adversary. This means that it can only enjoy a radius that is at most equal to the distance to the closest decision boundary. However, decision boundaries of general classifiers are complex, non-linear, and non-radially distributed with respect to a generic input sample (Karimi et al., 2019). This is evidenced by the fact that, within a reasonably small ℓp-ball around an input, there are often only a small set of adversary directions (Tramèr et al., 2017; 2018) (e.g. see the decision boundaries in Figure 1). As such, while ℓp-norm certificates are useful to reason about worst-case performance and are simple to obtain given previous works (Cohen et al., 2019; Yang et al., 2020; Lee et al., 2019), they are otherwise uninformative in terms of the shape of decision boundaries, i.e. which regions around the input are safe. To visualize these concepts, we illustrate the decision boundaries of a base classifier f trained on a toy 2-dimensional, radially separable (with respect to the origin) binary classification dataset, and consider two different input test samples (see Figure 1). We compare the optimal isotropic and anisotropic certified regions of different shapes at these points. In Figures 1a and 1b, we compare an isotropic cross-polytope (of the form δ 1 r) with an anisotropic generalized cross-polytope (of the form Aδ 1 r), while in Figures 1c and 1d we compare an isotropic ℓ2 ball (of the form δ 2 r) with an anisotropic ellipsoid (of the form Aδ 2 r). Notice that in Figures 1a and 1c, due to the curvature of the classification boundary (shown in white), the optimal certification region is isotropic in nature, which is evidenced by the similarities of the Published in Transactions on Machine Learning Research (08/2022) Figure 2: Visualization of a CIFAR-10 image x and an example x + δ of an imperceptible change that is not inside the optimal isotropic certified region, but is covered by the anisotropic certificate. optimal isotropic and anisotropic certificates. On the other hand, in Figures 1b and 1d, the location of the decision boundary allows for the anisotropic certified regions to be considerably larger than their isotropic counterparts, as they are not as constrained by the closest decision boundary, i.e. the worst-case performance. We note that these differences are further highlighted in higher dimensions, and we study them for a single CIFAR-10 test set sample in Appendix A.1. As shown, anisotropic certification reasons more closely about the shape of the decision boundaries, allowing for further insights into constant prediction (safe) directions. In Figure 2, we present a series of test set images x, as well as practically indistinguishable x + δ images which are not inside the optimal certified isotropic ℓ2-balls for each input sample, yet are within the anisotropic certified regions. This showcases the merits of using anisotropic certification for characterizing larger safe regions. 4 Anisotropic Certification One of the main obstacles in enabling anisotropic certification is the complexity of the analysis required. To alleviate this, we follow a Lipschitz argument first observed by Salman et al. (2019a) and Jordan & Dimakis (2020) and propose a simple and general certification analysis. We start with the following two observations. All proofs are in Appendix B. Proposition 1. Consider a differentiable function g : Rn R. If supx g(x) L where has a dual norm z = maxx z x s.t. x 1, then g is L-Lipschitz under norm , that is |g(x) g(y)| L x y . Given the previous proposition, we formalize certification as follows: Theorem 1. Let g : Rn RK, gi be L-Lipschitz continuous under norm i {1, . . . , K}, and c A = arg maxi gi(x). Then, we have arg maxi gi(x + δ) = c A for all δ satisfying: gc A(x) max c gc =c A(x) . Theorem 1 provides an norm robustness certificate for any L-Lipschitz classifier g under . The certificate is only informative when one can attain a tight non-trivial estimate of L, ideally supx g(x) , which is generally difficult when g is an arbitrary neural network. Framework Recipe. In light of Theorem 1, randomized smoothing can be viewed differently as an instance of Theorem 1 with the favorable property that the constructed smooth classifier g enjoys an analytical form for L = supx g(x) by design. As such, to obtain an informative certificate, one must, for an arbitrary choice of a smoothing distribution, compute the analytic Lipschitz constant L under for g. While there can exist a notion of optimal smoothing distribution for a given choice of certificate, as in part addressed earlier for the isotropic ℓ1, ℓ2 and ℓ certificates (Yang et al., 2020), this is not the focus of this paper. The choice of the smoothing distribution in later sections is inspired by previous work for the purpose of granting anisotropic certificates. This recipe complements randomized smoothing works based on Neyman-Pearson s lemma (Cohen et al., 2019) or the Level-Set and Differential Method (Yang et al., 2020). We will deploy this framework recipe to show two specializations for anisotropic certification, namely ellipsoids (Section 4.1) and generalized cross-polytopes (Section 4.2).1. 1Our analysis also grants a certificate for a mixture of Gaussians smoothing distribution (see Appendix B.1). Published in Transactions on Machine Learning Research (08/2022) 4.1 Certifying Ellipsoids In this section, we consider the certification under ℓΣ 2 norm, or δ Σ,2 = δ Σ 1δ, that has a dual norm δ Σ 1,2. Note that both δ Σ,2 r and δ Σ 1,2 r define an ellipsoid. Despite that the following results hold for any positive definite Σ, we assume for efficiency reasons that Σ is diagonal throughout. First, we consider the anisotropic Gaussian smoothing distribution N(0, Σ) with the smooth classifier defined as gΣ(x) = Eϵ N(0,Σ) [f(x + ϵ)]. Considering the classifier Φ 1(gΣ(x)), where Φ is the standard Gaussian CDF, and following Theorem 1 to grant an ℓΣ 2 certificate for Φ 1(gΣ(x)), we derive the Lipschitz constant L under Σ 1,2, in the following proposition. Proposition 2. Φ 1(gΣ(x)) is 1-Lipschitz (i.e. L = 1) under the Σ 1,2 norm. Since Φ 1 is a strictly increasing function, by combining Proposition 2 with Theorem 1, we have: Corollary 1. Let c A = arg maxi gi Σ(x) , then arg maxi gi Σ(x + δ) = c A for all δ satisfying: Φ 1 (gc A Σ (x)) Φ 1 max c gc =c A Σ (x) . Corollary 1 holds the ℓ2 certification from Zhai et al. (2019) as a special case for when Σ = σ2I.2 4.2 Certifying Generalized Cross-Polytopes Here we consider certification under the ℓΛ 1 norm defining a generalized cross-polytope, i.e. the set {δ : δ Λ,1 = Λ 1δ 1 r}, as opposed to the ℓ1-bounded set that defines a cross-polytope, i.e. {δ : δ 1 r}. As with the ellipsoid case and despite that the following results hold for any positive definite Λ, for the sake of efficiency, we assume Λ to be diagonal throughout. For generalized cross-polytope certification, we consider an anisotropic Uniform smoothing distribution U, which defines the smooth classifier gΛ(x) = Eϵ U[ 1,1]n[f(x + Λϵ)]. Following Theorem 1 and to certify under the ℓΛ 1 norm, we compute the Lipschitz constant of gΛ under the Λx norm, which is the dual norm of Λ,1 (see Appendix B), in the next proposition. Proposition 3. The classifier gΛ is 1/2-Lipschitz (i.e. L = 1/2) under the Λx norm. Similar to Corollary 1, by combining Proposition 3 with Theorem 1, we have that: Corollary 2. Let c A = arg maxi gi Λ(x) , then arg maxi gi Λ(x + δ) = c A for all δ satisfying: δ Λ,1 = Λ 1δ 1 gc A Λ (x) max c gc =c A Λ (x) . Corollary 2 holds the ℓ1 certification from Yang et al. (2020) as a special case for when Λ = λI. 5 Evaluating Anisotropic Certificates With the anisotropic certification framework presented in the previous section, the question arises: Given two general (isotropic or anisotropic) certification regions R1 and R2, how can one effectively compare them? . We propose the following definition to address this issue. Definition 1. For a given input point x, consider the two certification regions R1 and R2 obtained for two classifiers f1 and f2, i.e. A1 = {δ : arg maxc f c 1(x) = arg maxc f c 1(x + δ), δ R1} and A2 = {δ : arg maxc f c 2(x) = arg maxc f c 2(x + δ), δ R2} where arg maxc f c 1(x) = arg maxc f c 2(x). We say A1 is a "superior certificate" to A2 (i.e. A1 A2), if and only if, A1 A2. This definition is a natural extension from the radius-based comparison of ℓp-ball certificates, providing a basis for evaluating anisotropic certification. To compare an anisotropic to an isotropic region of certification, it is not immediately clear how to (i) check that an anisotropic region is a superset to the isotropic region, and (ii) if it were a superset, how to quantify the improvement of the anisotropic region over the isotropic counterpart. In Sections 5.1 and 5.2, we tackle these issues for the particular cases of ellipsoid and generalized cross-polytope certificates. 2A similar result was derived in the appendix of Fischer et al. (2020); Li et al. (2020) with a more involved analysis by extending Neyman-Pearson s lemma. Published in Transactions on Machine Learning Research (08/2022) 5.1 Evaluating Ellipsoid Certificates Comparing ℓ2 Balls to ℓΣ 2 Ellipsoids (Specialization of Definition 1). Recall that if Σ = σ2I, our ellipsoid certification in Corollary 1 recovers as a special case the isotropic ℓ2-ball certification of Cohen et al. (2019); Salman et al. (2019a); Zhai et al. (2019). Consider the certified regions R1 = {δ : δ 2 σr1} and R2 = {δ : δ Σ,2 = δ Σ 1δ r2} for given r1, r2 > 0. Since we take Σ = diag({σ2 i }n i=1), the maximum enclosed ℓ2-ball for the ellipsoid R2 is given by the set R3 = {δ : δ 2 mini σir2}, and thus R2 R3. Therefore, it suffices that R3 R1 (i.e. mini σir2 σr1), to say that R2 is a superior certificate to the isotropic R1 as per Definition 1. Quantifying ℓΣ 2 Certificates. The aforementioned specialization is only concerned with whether our ellipsoid certified region R2 is superior to the isotropic ℓ2-ball without quantifying it. A natural solution is to directly compare the volumes of the certified regions. Since the volume of an ellipsoid given by R2 is V(R2) = rn 2 πn/Γ(n/2+1) Qn i=1 σi (Kendall, 2004), we directly compare the proxy radius R defined for R2 as R = r2 np Qn i σi, since larger R correspond to certified regions with larger volumes. Note that R, which is the nth root of the volume up to a constant factor, can be seen as a generalization to the certified radius in the case when σi = σ i. 5.2 Evaluating Generalized Cross-Polytope Certificates Comparing ℓ1 Balls to ℓΛ 1 Generalized Cross-Polytopes (Specialization of Definition 1). Consider the certificates S1 = {δ : δ 1 λr1}, S2 = {δ : δ Λ,1 = Λ 1δ 1 r2}, and S3 = {δ : δ 1 mini λir2}, where we take Λ = diag({λi}n i=1). Note that since S2 S3, then as per Definition 1, it suffices that S3 S1 (i.e. mini λir2 λr1) to say that the anisotropic generalized cross-polytope S2 is superior to the isotropic ℓ1-ball S1. Quantifying ℓΛ 1 Certificates. Following the approach proposed in the ℓΣ 2 case, we quantitatively compare the generalized cross-polytope certification of Corollary 2 to the ℓ1 certificate through the volumes of the two regions. We first present the volume of the generalized cross-polytope. Proposition 4. V {δ : Λ 1δ 1 r} = (2r)n Following this definition, we define the proxy radius for S2 in this case to be R = r2 np Qn i=1 λi. As with the ℓ2 case, larger R correspond certified regions with larger volumes. As in the ellipsoid case, R can be seen as a generalization to the certified radius when λi = λ i. 6 An Cer: Sample-wise Volume Maximization for Anisotropic Certification Given the results from the previous sections, we are now equipped to certify anisotropic regions, in particular ellipsoids and generalized cross-polytopes. As mentioned in Section 4, these regions are generally defined as R = {δ : δ Θ,p rp} for a given parameter of the smoothing distribution Θ = diag ({θi}n i=1), an ℓp-norm (p {1, 2}), and a gap value of rp R+. At this point, one could simply take an anisotropic distribution with arbitrarily chosen parameters Θ and certify a trained network at any input point x, in the style of what was done in the previous randomized smoothing literature with isotropic distributions. However, the choice of Θ is more complex in the anisotropic case. A fixed choice of anisotropic Θ could severely underperform the isotropic case take, for example, the anisotropic distribution of Figure 1d applied to the input of Figure 1c. Instead of taking a fixed Θ, we generalize the framework introduced by Alfarra et al. (2022), where parameters of the smoothing distribution are optimized per input test point (i.e. in a sample-wise fashion) so as to maximize the resulting certificate. The goal of the optimization in (Alfarra et al., 2022) is, at a point x, to maximize the isotropic ℓ2 region described in Section 4.1 (i.e. {δ : δ 2 σxrp(x, σx))}), where rp is the gap and a function of x and σx R+. In the isotropic ℓp case, this generalizes to maximizing the region {δ : δ p θxrp (x, θx)}, which can be achieved by maximizing radius θxrp (x, θx) through θx R+, obtaining r iso (Alfarra et al., 2022). Published in Transactions on Machine Learning Research (08/2022) Algorithm 1: ANCER Optimization Function An Cer (fθ, x, α, Θ0, n, K, κ): Initialize: Θ0 x Θ0 for k = 0 . . . K 1 do sample ˆϵ1, . . . ˆϵn D ψ(Θk x) = 1 n Pn i=1 fθ(x + Θk xˆϵi) EA(Θk x) = maxc ψc; y A = arg maxc ψc; EB(Θk x) = maxc =y A ψc rp(x, Θk x) = ( EA EB , if p = 1 1 2 Φ 1(EA) Φ 1(EB) , if p = 2 R(Θk x) = rp(x, Θk x) Qd i Θk ii 1/d + κ rp(x, Θk x) mini Θk ii Θk+1 x Θk x + α Θkx R(Θk x) Θk+1 x max Θk+1 x , Θ0 // element-wise maximum - projection step return ΘK x For the general anisotropic case, we propose An Cer, whose objective is to maximize the volume of the certified region through the proxy radius, while satisfying the superset condition with respect to the maximum isotropic ℓ2 radius, r iso. In the case of the ellipsoids and generalized cross-polytopes as presented in Sections 5.1 and 5.2, respectively, An Cer s optimization problem can be written as: arg max Θx rp (x, Θx) n s Y i θx i s.t. min i θx i rp (x, Θx) r iso (1) where rp (x, Θx) is the gap value under the anisotropic smoothing distribution. That is, rp (x, Λx) = gc A Λ (x) max c gc =c A Λ (x), rp (x, Σx) = 1 Φ 1 (gc A Σ (x)) Φ 1 max c gc =c A Σ (x) for ℓ1 and ℓ2, respectively. This is a nonlinear constrained optimization problem that is challenging to solve. As such, we relax it, and solve instead: arg max Θx rp (x, Θx) n s Y i θx i + κ min i θx i rp (x, Θx) s.t. θx i θx given a hyperparameter κ R+. While the constraint θx i θx is not explicitly required to enforce the superset condition over the isotropic case, it proved itself beneficial from an empirical perspective. To sample from the distribution parameterized by Θx (in our case, either a Gaussian or Uniform), we make use of the reparameterization trick, as in Alfarra et al. (2022). The solution of this optimization problem can be found iteratively by performing projected gradient ascent, as detailed in Algorithm 1. A standalone implementation for the An Cer optimization stage is presented in Listing 1 in Appendix C. Memory-based Anisotropic Certification. While each of the classifiers induced by the parameter Θx, i.e. gΘx, is robust by definition as presented in Section 4, the certification of the overall data-dependent classifier is not necessarily sound due to the optimization procedure for each x. This is a known issue in certifying data-dependent classifiers, and is addressed by Alfarra et al. (2022) through the use of a memory-based procedure. In Appendix D, we present an adapted version of this algorithm to An Cer. All subsequent results are obtained following this procedure. Limitations of An Cer. Given An Cer uses a memorization procedure similar to the one presented in Alfarra et al. (2022), it incurs limitations on memory and runtime complexity. Note that in memory-based data-dependent certification there is a single procedure for both certification and inference in contrast with the fixed σ setting from Cohen et al. (2019). The main limitations of the memory-based certification are outlined in Appendix E of Alfarra et al. (2022). The anisotropic case increases on the complexity of the Published in Transactions on Machine Learning Research (08/2022) isotropic framework by the increased runtime of specific functions presented in Appendix D. Certification runtime comparisons are in Section 7.4. The memory-based procedure incurs the same memory cost as the one presented in Alfarra et al. (2022), i.e., it has a memory complexity of O(N) where N is the total number of inferred samples. This is since that the memory based method requires saving the observed instances along with their smoothing parameters. While the linear runtime dependency on memory size might appear daunting for the deployment of such a system, there are a few factors that could mitigate the cost. Firstly, in practice the models deployed get regularly updated in deployment, and the memory should be reset in those situations. Secondly, there are possible solutions which might attain sublinear runtime for the post-certification stage, such as the application of k-d trees to reduce the space of comparisons and speed-up the process. As such, we believe An Cer to be suited to applications in offline scenarios, where improved robustness is desired and inference time is not a critical issue. A further limitation of the memorization procedure has to do with the impact of the order in which inputs are certified on the overall statistics obtained. Within a memory-based framework, certifying x2 with x1 in memory can be different from certifying x1 with x2 in memory if they intersect. In practice, given the low number of intersections observed with the original certified regions, this effect was almost negligible in the results presented in Section 7. For fairness of comparison with non-memory based methods, we report "worst-case" results for An Cer in which we abstain from deciding whenever an intersection of two certified regions occurs. 7 Experiments We now study the empirical performance of An Cer to obtain ℓΣ 2 , ℓΛ 1 , ℓ2 and ℓ1 certificates on networks trained using randomized smoothing methods found in the literature. In this section, we show that An Cer is able to achieve (i) improved performance on those networks in terms of ℓ2 and ℓ1 certification when compared to certification baselines that smooth using a fixed isotropic σ (Fixed σ) (Cohen et al., 2019; Yang et al., 2020; Salman et al., 2019a; Zhai et al., 2019) or a data-dependent and memory-based isotropic one (Isotropic DD) (Alfarra et al., 2022); and (ii) a significant improvement in terms of the ℓΣ 2 and ℓΛ 1 -norm certified region obtained by the same methods compared by computing the proxy radius of the certified regions thus generally satisfying the conditions of a superior certificate proposed in Definition 1. Note that both data-dependent approaches (Isotropic DD and An Cer) use memory-based procedures. As such, the gains described in this section constitute a trade-offgiven the limitations of the method described in Section 6. We follow an evaluation procedure as similar as possible to the ones described in Cohen et al. (2019); Yang et al. (2020); Salman et al. (2019a); Zhai et al. (2019) by using code and pre-trained networks whenever available and by performing experiments on CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009), certifying the entire CIFAR-10 test set and a subset of 500 examples from the Image Net test set. For the implementation of An Cer, we solve Equation equation 1 with Adam for 100 iterations, where the certification gap rp(x, Θx) is estimated at each iteration using 100 noise samples per test point (see Appendix C) and Θx in Equation equation 1 is initialized with the Isotropic DD solution from Alfarra et al. (2022). Further details of the setup can be found in Appendix E. As in previous works, ℓp certified accuracy at radius R is defined as the portion of the test set Dt for which the smooth classifier correctly classifies with an ℓp certification radius of at least R. In a similar fashion, we define the anisotropic ℓΣ 2 /ℓΛ 1 certified accuracy at a proxy radius of R (as defined in Section 5) to be the portion of Dt in which the smooth classifier classifies correctly with an ℓΣ 2 /ℓΛ 1 -norm certificate of an nth root volume of at least R. We also report average certified radius (ACR) defined as Ex,y Dt[Rx1(g(x) = y)] (Alfarra et al., 2022; Zhai et al., 2019) as well as average certified proxy radius (AC R) defined as Ex,y Dt[ Rx1(g(x) = y)], where Rx and Rx denote the radius and proxy radius at x with a true label y for a smooth classifier g. Recall that in the isotropic case, the proxy radius is, by definition, the same as the radius for a given ℓp-norm. For each classifier, we ran experiments on the σ values reported in the original work (with the exception of Yang et al. (2020), see Section 7.2). For the sake of brevity, we report in this section the top-1 certified accuracy plots, ACR and AC R per radius across σ, as in Salman et al. (2019a); Zhai et al. (2019); Alfarra et al. (2022). The performance of each method per σ is presented in Appendix G. Published in Transactions on Machine Learning Research (08/2022) 0 1 2 3 4 5 Certified accuracy CIFAR-10 - Cohen Fixed Isotropic DD ANCER 0 1 2 3 4 5 Certified accuracy CIFAR-10 - Smooth Adv Fixed Isotropic DD ANCER 0 1 2 3 4 5 Certified accuracy CIFAR-10 - MACER Fixed Isotropic DD ANCER Certified accuracy Image Net - Cohen Fixed Isotropic DD ANCER Certified accuracy Image Net - Smooth Adv Fixed Isotropic DD ANCER 0 1 2 3 4 5 2 proxy volume Certified accuracy Fixed Isotropic DD ANCER 0 1 2 3 4 5 2 proxy volume Certified accuracy Fixed Isotropic DD ANCER 0 1 2 3 4 5 2 proxy volume Certified accuracy Fixed Isotropic DD ANCER 2 proxy radius Certified Accuracy Fixed Isotropic DD ANCER 2 proxy radius Certified Accuracy Fixed Isotropic DD ANCER Figure 3: Distribution of top-1 certified accuracy as a function of ℓ2 radius (top) and ℓΣ 2 -norm proxy radius (bottom) obtained by different certification methods on CIFAR-10 and Image Net. 7.1 Ellipsoid certification (ℓ2 and ℓΣ 2 -norm certificates) We perform the comparison of ℓ2-ball vs. ℓΣ 2 -ellipsoid certificates via Gaussian smoothing using networks trained following the procedures defined in Cohen et al. (2019), Salman et al. (2019a), and Zhai et al. (2019). For each of these, we report results on Res Net18 trained using σ {0.12, 0.25, 0.5, 1.0} for CIFAR-10, and Res Net50 using σ {0.25, 0.5, 1.0} for Image Net. For details of the training procedures, see Appendix E.1. Figure 3 plots top-1 certified accuracy as a function of the ℓ2 radius (top) and of the ℓΣ 2 -norm proxy radius (bottom) per trained network and dataset, while Table 1 presents an overview of the certified accuracy at various ℓ2 radii, as well as ℓ2 ACR and ℓΣ 2 -norm AC R. Recall that, following the considerations in Section 5.1, the ℓ2 certificate obtained through An Cer is the maximum enclosed isotropic ℓ2-ball in the ℓΣ 2 ellipsoid. 0.0 0.5 1.0 1.5 2.0 Certified accuracy CIFAR-10 - RS4A Fixed Isotropic DD ANCER 0.00 0.25 0.50 0.75 1.00 Certified accuracy Image Net - RS4A Fixed Isotropic DD ANCER 1 proxy radius Certified accuracy Fixed Isotropic DD ANCER 1 proxy volume Certified accuracy Fixed Isotropic DD ANCER Figure 4: Distribution of top-1 certified accuracy as a function of ℓ1 radius (top) and ℓΛ 1 -norm proxy radius (bottom) obtained by different certification methods on CIFAR-10 and Image Net. First, we note that sample-wise certification (Isotropic DD and An Cer) achieves higher certified accuracy than fixed σ across the board. This mirrors the findings in Alfarra et al. (2022), since certifying with a fixed σ for all samples struggles with the robustness/accuracy trade-offfirst mentioned in Cohen et al. (2019), whereas the data-dependent solutions explicitly optimize σ per sample to avoid it. More importantly, An Cer achieves new state-of-the-art ℓ2 certified accuracy at most radii in Table 1, e.g. at radius 0.5 An Cer brings certified accuracy to 77% (from 66%) and 70% (from 62%) on CIFAR10 and Image Net, respectively, yielding relative percentage improvements in ACR between 13% and 47% when compared to Isotropic DD. While the results are significant, it might not be immediately clear why maximizing the volume of an ellipsoid with An Cer results in a larger maximum enclosed ℓ2-ball certificate in ℓΣ 2 ellipsoid when compared to optimizing the ℓ2-ball with Isotropic DD. We explore this phenomenon in Appendix 7.3. As expected, An Cer substantially improves ℓΣ 2 AC R compared to Isotropic DD in all cases with relative improvements in AC R between 38% and 63% over both datasets. The joint results, certification with ℓ2 and ℓΣ 2 , establish that An Cer certifies the ℓ2-ball region obtained by previous approaches, in addition to a much larger region captured by the ℓΣ 2 certified accuracy and AC R, and therefore is, according to Definition 1, generally superior to the Isotropic DD one. Published in Transactions on Machine Learning Research (08/2022) Table 1: Comparison of top-1 certified accuracy at different ℓ2 radii, ℓ2 average certified radius (ACR) and ℓΣ 2 average certified proxy radius (AC R) obtained by using the isotropic σ used for training the networks (Fixed σ); the isotropic data-dependent (Isotropic DD) optimization scheme from Alfarra et al. (2022); and An Cer s data-dependent anisotropic optimization. CIFAR-10 Certification Accuracy @ ℓ2 radius (%) ℓ2 ACR ℓΣ 2 AC R 0.0 0.25 0.5 1.0 1.5 2.0 2.5 Cohen Cohen et al. (2019) Fixed σ 86 71 51 27 14 6 2 0.722 0.722 Isotropic DD 82 76 62 39 24 14 8 1.117 1.117 An Cer 86 85 77 53 31 17 10 1.449 1.772 Smooth Adv Salman et al. (2019a) Fixed σ 82 72 55 32 19 9 5 0.834 0.834 Isotropic DD 82 75 63 40 25 15 7 1.011 1.011 An Cer 83 81 73 48 30 17 8 1.224 1.573 MACER Zhai et al. (2019) Fixed σ 87 76 59 37 24 14 9 0.970 0.970 Isotropic DD 88 80 66 40 17 9 6 1.007 1.007 An Cer 84 80 67 34 15 11 9 1.136 1.481 Image Net Certification Accuracy @ ℓ2 radius (%) ℓ2 ACR ℓΣ 2 AC R 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Cohen Cohen et al. (2019) Fixed σ 70 56 41 31 19 14 12 1.098 1.098 Isotropic DD 71 59 46 36 24 19 15 1.234 1.234 An Cer 70 70 62 61 42 36 29 1.810 1.981 Smooth Adv Salman et al. (2019a) Fixed σ 65 59 44 38 26 20 18 1.287 1.287 Isotropic DD 66 62 53 41 32 24 20 1.428 1.428 An Cer 66 66 62 58 44 37 32 1.807 1.965 7.2 Generalized Cross-Polytope certification (ℓ1 and ℓΛ 1 -norm certificates) To investigate ℓ1-ball vs. ℓΛ 1 -generalized cross-polytope certification via Uniform smoothing, we compare An Cer to the ℓ1 state-of-the-art results from RS4A (Yang et al., 2020). While the authors of the original work report best certified accuracy based on 15 networks trained at different σ levels between 0.15 and 3.5 on CIFAR-10 (Wide Res Net40) and Image Net (Res Net50) and due to limited computational resources, we perform the analysis on a subset of those networks with σ = {0.25, 0.5, 1.0}. We reproduce the results in Yang et al. (2020) as closely as possible, with details of the training procedure presented in Appendix E.2. Figure 4 shows the top-1 certified accuracy as a function of the ℓ1 radius (top) and of the ℓΛ 1 -norm proxy radius (bottom) for RS4A, and Table 2 shows an overview of the certified accuracy at various ℓ1 radii, as well as ℓ1 ACR and ℓΛ 1 AC R. As with the ellipsoid case, we notice that An Cer outperforms both Fixed σ and Istropic DD for most ℓ1 radii, establishing new state-of-the-art results in CIFAR-10 at radii 0.5 and 1.0, and Image Net at radii 0.5 (compared to previous results reported in Yang et al. (2020)). Once more and as expected, An Cer significantly improves the ℓΛ 1 AC R for all radii, pointing to substantially larger cerficates than the isotropic case. These results also establish that An Cer certifies the ℓ1-ball region obtained by previous work, in addition to the larger region obtained by the ℓΛ 1 certificate, and thus we can consider it superior (with respect to Definition 1) to Isotropic DD. 7.3 Why does An Cer improve upon Isotropic DD s ℓp certificates? As observed in Sections 7.1 and 7.2, An Cer s ℓ2 and ℓ1 certificates outperform the corresponding certificates obtained by Isotropic DD. To explain this, we compare the ℓ2 certified region obtained by An Cer, defined in Section 6 as {δ : δ 2 mini σx i r(x, Σx)}, to the one by Isotropic DD defined as {δ : δ 2 σxr(x, σx)}. We observe that the radius of both of these certificates can be separated into a σ-factor (σx vs. σx min = mini σx i ) and a gap-factor (r(x, σx) vs. r(x, Σx)). We posit the seemingly surprising result can be attributed to the computation of the gap-factor r using an anisotropic, optimized distribution. However, another potential explanation would be that An Cer benefits from a prematurely stopped initialization provided by Isotropic DD, thus achieving a better σx min than the isotropic σx when given further optimization iterations. Published in Transactions on Machine Learning Research (08/2022) Table 2: Comparison of top-1 certified accuracy at different ℓ1 radii, ℓ1 average certified radius (ACR) and ℓΛ 1 average certified proxy radius (AC R) obtained by using the isotropic σ used for training the networks (Fixed σ); the isotropic data-dependent (Isotropic DD) optimization scheme from Alfarra et al. (2022); and An Cer s data-dependent anisotropic optimization. CIFAR-10 Certification Accuracy @ ℓ1 radius (%) ℓ1 ACR ℓΛ 1 AC R 0.0 0.25 0.5 0.75 1.0 1.5 2.0 RS4A Yang et al. (2020) Fixed σ 92 83 75 71 46 0 0 0.775 0.775 Isotropic DD 92 89 82 76 58 6 2 0.946 0.946 An Cer 92 90 84 80 63 6 2 0.980 1.104 RS4A Yang et al. (2020) Fixed σ 78 73 67 63 0 0 0 0.683 0.683 Isotropic DD 79 76 70 65 46 0 0 0.729 0.729 An Cer 78 76 70 66 48 0 0 0.730 1.513 0.2 0.4 0.6 x or x min 100+ Isotropic DD ANCER Figure 5: Histograms of the values of the gap r (left) and the σ-factor (right) obtained by An Cer initialized with Isotropic DD, and Isotropic DD when allowed to run for 100 iterations more than the baseline. Vertical lines plot the median of the data. To investigate this, we take the optimized parameters from the Isotropic DD experiments on Smooth Adv for an initial σ = 0.25 on CIFAR-10, and run the optimization step of Isotropic DD for 100 iterations more than its default number of iterations from Alfarra et al. (2022), so as to match the total number of optimization steps between Isotropic DD and An Cer. The histograms of σx or σx min and the gap-factor r, i.e. the two factors from the ℓ2 certification results, are presented in Figure 5. While σx for Isotropic DD is similar in distribution to An Cer s σx min, the distribution of the two gaps, r(x, σx) and r(x, Σx), are quite different. In particular, the An Cer certification gap is significantly larger when compared to Isotropic DD, and is the main contributor to the improvement in the ℓ2-ball certificate of An Cer. That is to say, An Cer generates Σx that is better aligned with the decision boundaries, and hence increases the confidence of the smooth classifier. 7.4 Certification Runtime Table 3: Average certification time for each sample per architecture used: (a) Res Net18 (ℓ2, ℓΣ 2 on CIFAR-10), (b) Wide Res Net40 (ℓ1, ℓΛ 1 on CIFAR-10), and (c) Res Net50 (Image Net). Fixed σ Isotropic DD An Cer (a) 1.6s 1.8s 2.7s (b) 7.4s 9.5s 11.5s (c) 109.5s 136.0s 147.0s The certification procedures of Isotropic DD and An Cer tradeoffimproved certified accuracy for runtime, since they require a sample-wise optimization to be run prior to the Certify step described in Cohen et al. (2019), and a memory-based step as per Alfarra et al. (2022). The runtime of the optimization and certification procedures is roughly equal for ℓ1, ℓ2, ℓΣ 2 and ℓΛ 1 certification, and mostly depends on network architecture. As such, we report the average certification runtime for a test set sample on an NVIDIA Quadro RTX 6000 GPU for Fixed σ, Isotropic DD and An Cer (including the isotropic initialization step) in Table 3. We observe that the run time overhead for An Cer is not significant as compared to its certification gains. Finally, due to the memory based step in our approach, the inference and certification runtime are the same. 8 Conclusion We lay the theoretical foundations for anisotropic certification through a simple analysis, propose a metric for comparing general robustness certificates, and introduce An Cer, a certification procedure that estimates Published in Transactions on Machine Learning Research (08/2022) the parameters of the anisotropic smoothing distribution to maximize the certificate. Our experiments show that An Cer achieves state-of-the-art ℓ1 and ℓ2 certified accuracy in the data-dependent setting. Acknowledgments This publication is based upon work supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems [EP/S024050/1] and Five AI Limited; by King Abdullah University of Science and Technology (KAUST) under Award No. ORA-CRG10-2021-4648, as well as, the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI); and by the Royal Academy of Engineering under the Research Chair and Senior Research Fellowships scheme, EPSRC/MURI grant EP/N019474/1. Motasem Alfarra, Adel Bibi, Philip H. S. Torr, and Bernard Ghanem. Data dependent randomized smoothing. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, volume 180 of Proceedings of Machine Learning Research, pp. 64 74. PMLR, 01 05 Aug 2022. 2, 3, 6, 7, 8, 9, 10, 11, 15, 19, 20, 21, 24, 25 Rudy Bunel, Ilker Turkaslan, Philip HS Torr, Pushmeet Kohli, and M Pawan Kumar. A unified view of piecewise linear neural network verification. In Advances in Neural Information Processing Systems (Neur IPS), 2018. 3 Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning (ICML), 2019. 2, 3, 4, 6, 7, 8, 9, 10, 11, 19, 25, 26 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 2, 8, 24 Krishnamurthy Dj Dvijotham, Jamie Hayes, Borja Balle, Zico Kolter, Chongli Qin, András György, Kai Xiao, Sven Gowal, and Pushmeet Kohli. A framework for robustness certification of smoothed classifiers using f-divergences. In International Conference on Learning Representations (ICLR), 2020. 3 Marc Fischer, Maximilian Baader, and Martin Vechev. Certified defense to image transformations via randomized smoothing. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 5 Igor Gilitschenski and Uwe D Hanebeck. A robust computational test for overlap of two arbitrary-dimensional ellipsoids in fault-detection of kalman filters. In 2012 15th International Conference on Information Fusion, 2012. 21, 22 Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015. 1, 3 Sven Gowal, Krishnamurthy Dj Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. Scalable verified training for provably robust image classification. In IEEE International Conference on Computer Vision (ICCV), 2019. 3 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. 24 Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. In International Conference on Computer Aided Verification (CAV), 2017. 3 Jongheon Jeong and Jinwoo Shin. Consistency regularization for certified robustness of smoothed classifiers. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 3 Published in Transactions on Machine Learning Research (08/2022) Matt Jordan and Alexandros G Dimakis. Exactly computing the local lipschitz constant of relu networks. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 4 Hamid Karimi, Tyler Derr, and Jiliang Tang. Characterizing the decision boundary of deep neural networks. ar Xiv preprint ar Xiv:1912.11460, 2019. 3 Maurice G Kendall. A Course in the Geometry of n Dimensions. Courier Corporation, 2004. 6, 17 Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. 2, 8, 15, 24 Aounon Kumar, Alexander Levine, Tom Goldstein, and Soheil Feizi. Curse of dimensionality on randomized smoothing for certifiable robustness. In International Conference on Machine Learning (ICML), 2020. 2 Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In IEEE Symposium on Security and Privacy (SP), 2019. 2, 3 Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi S Jaakkola. Tight certificates of adversarial robustness for randomly smoothed classifiers. ar Xiv preprint ar Xiv:1906.04948, 2019. 3 Alexander Levine and Soheil Feizi. Robustness certificates for sparse adversarial attacks by randomized ablation. In Association for the Advancement of Artificial Intelligence (AAAI), 2020. 3 Alexander Levine and Soheil Feizi. Improved, deterministic smoothing for l1 certified robustness. ar Xiv preprint ar Xiv:2103.10834, 2021. 2 Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certified adversarial robustness with additive noise. In Advances in Neural Information Processing Systems (Neur IPS), 2019. 3 Linyi Li, Maurice Weber, Xiaojun Xu, Luka Rimanic, Tao Xie, Ce Zhang, and Bo Li. Provable robust learning based on transformation-specific smoothing. ar Xiv preprint ar Xiv:2002.12398, 2020. 5 Jeet Mohapatra, Ching-Yun Ko, Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. Higher-order certification for randomized smoothing. Advances in Neural Information Processing Systems (Neur IPS), 2020. 3, 29 Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 15 Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (Neur IPS). 2019. 20, 24 Lluís Ros, Assumpta Sabater, and Federico Thomas. An ellipsoidal calculus based on propagation and fusion. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2002. 21, 22 Hadi Salman, Jerry Li, Ilya P Razenshteyn, Pengchuan Zhang, Huan Zhang, Sébastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. In Advances in Neural Information Processing Systems (Neur IPS), 2019a. 3, 4, 6, 8, 9, 10, 16, 25, 26, 28, 30 Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relaxation barrier to tight robust verification of neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2019b. 3 Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014. 1, 3 Published in Transactions on Machine Learning Research (08/2022) Vincent Tjeng, Kai Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations (ICLR), 2019. 3 Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. The space of transferable adversarial examples. ar Xiv preprint ar Xiv:1704.03453, 2017. 3 Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations (ICLR), 2018. 3 Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S Dhillon, and Luca Daniel. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning (ICML), 2018. 3 Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning (ICML), 2020. 2, 3, 4, 5, 8, 10, 11, 25, 29 Runtian Zhai, Chen Dan, Di He, Huan Zhang, Boqing Gong, Pradeep Ravikumar, Cho-Jui Hsieh, and Liwei Wang. Macer: Attack-free and scalable robust training via maximizing certified radius. In International Conference on Learning Representations (ICLR), 2019. 3, 5, 6, 8, 9, 10, 25, 26 Dinghuai Zhang, Mao Ye, Chengyue Gong, Zhanxing Zhu, and Qiang Liu. Filling the soap bubbles: Efficient black-box adversarial certification with non-gaussian smoothing. https://openreview.net/forum?id= Skg8g JBFvr, 2019. 3 Published in Transactions on Machine Learning Research (08/2022) A Qualitative Motivation of Anisotropic Certification A.1 Visualizing CIFAR-10 Optimized Isotropic vs. Anisotropic Certificates Anistropic Isotropic Anistropic Isotropic Figure 6: Illustration of the landscape of f y for points around an input point x, and two projections of an isotropic ℓ2 certified region and an anisotropic ℓΣ 2 2norm region on a CIFAR-10 dataset example to a subset of two eigenvectors of the Hessian of f y (blue regions correspond to a higher confidence in y). To extend the illustration in Figure 1 to a higher dimensional input, we now analyze an example of the isotropic ℓ2 certification of randomized smoothing with N(0, σ2I), where σ is optimized per input Alfarra et al. (2022), against An Cer, certifying an anisotropic region characterized by a diagonal ℓΣ 2 - norm. To do so, we consider a CIFAR-10 Krizhevsky (2009) dataset point x, where the input is of size (32x32x3). We perform the 2D analysis by considering the regions closest to a decision boundary. To do so, and following Moosavi-Dezfooli et al. (2019), we compute the Hessian of f y(x) with respect to x where y is the true label for x with f classifying x correctly, i.e. y = arg maxi f i(x). In addition to the Hessian, we also compute its eigenvector decomposition, yielding the eigenvectors {νi}, i {1, . . . , 3072} ordered in descending order of the absolute value of the respective eigenvalues. In Figure 6a, we show the projection of the landscape of f y in the highest curvature directions, i.e. ν1 and ν2. Note that the isotropic certification, much as in Figure 1c, in these 2 dimensions is nearly optimal when compared to the anisotropic region. However, if we take the same projection with respect to the eigenvectors with the lowest and highest eigenvalues, i.e. ν1 and ν3072, the advantages of the anisotropic certification become clear as shown in Figure 6b. B Anisotropic Certification and Evaluation Proofs Proposition 1 (restatement). Consider a differentiable function g : Rn R. If supx g(x) L where has a dual norm z = maxx z x s.t. x 1, then g is L-Lipschitz under norm , that is |g(x) g(y)| L x y . Proof. Consider some x, y Rn and a parameterization in t as γ(t) = (1 t)x + ty t [0, 1]. Note that γ(0) = x and γ(1) = y. By the Fundamental Theorem of Calculus we have: |g(y) g(x)| = |g(γ(1)) g(γ(0))| = 0 g γdt Z 1 0 g(x) γ(t) dt L y x Theorem 1 (restatement). Let g : Rn RK, gi be L-Lipschitz continuous under norm i {1, . . . , K}, and c A = arg maxi gi(x). Then, we have arg maxi gi(x + δ) = c A for all δ satisfying: gc A(x) max c gc =c A(x) . Proof. Take c B = arg maxc gc =c A(x). By Proposition 1, we get: |gc A(x + δ) gc A(x)| L δ = gc A(x + δ) gc A(x) L δ |gc B(x + δ) gc B(x)| L δ = gc B(x + δ) gc B(x) + L δ Published in Transactions on Machine Learning Research (08/2022) By subtracting the inequalities and re-arranging terms, we have that as long as gc A(x) L δ > gc B(x)+L δ , i.e. the bound in the Theorem, then gc A(x + δ) > gc B(x + δ), completing the proof. Proposition 2 (restatement). Consider gΣ(x) = Eϵ N(0,Σ) [f(x + ϵ)]. Φ 1(gΣ(x)) is 1-Lipschitz (i.e. L = 1) under the Σ 1,2 norm. Proof. To prove Proposition 2, one needs to show that Φ 1(gi Σ(x)) i is 1-Lipschitz under the Σ 1,2 norm. For ease of notation, we drop the superscript gi Σ and use only g. We want to show that Φ 1(gΣ(x)) Σ 1,2 = Σ 1/2 Φ 1(gΣ(x)) 2 1. Following the argument presented in Salman et al. (2019a), it suffices to show that, for any unit norm direction u and p = gΣ(x), we have: u Σ 1 2 gΣ(x) 1 2(Φ 1(p))2 . (2) We start by noticing that: u Σ 1 2 gΣ(x) = 1 Rn f(t)u Σ 1 2 Σ 1(t x) exp 1 2(x t)Σ 1(x t) dnt = Es N(0,I)[f(x + Σ 1 2 s)u s] = Ev N(0,Σ)[f(x + v)u Σ 1 We now need to find the optimal f : Rn [0, 1] that satisfies gΣ(x) = Ev N(0,Σ)[f(x + v)] = p while maximizing the left hand size Ev N(0,Σ)[f(x + v)u Σ 1 2 v]. We argue that the maximizer is the following function: f (x + v) = 1 n u Σ 1 2 v Φ 1(p) o . To prove that f is indeed the optimal maximizer, we first show feasibility. (i): It is clear that f : Rn [0, 1]. (ii) Note that: Ev N(0,Σ)) h 1 n u Σ 1 2 v Φ 1(p) oi = Px N(0,1)(x Φ 1(p)) = 1 Φ( Φ 1(p)) = p. To show the optimality of f , we show that it attains the right upper bound: Ev N(0,Σ)) h u Σ 1 2 v1 n u Σ 1 2 v Φ 1(p) o i = Ex N(0,1) h x1 x Φ 1(p) i Φ 1(p) x exp 1 obtaining the bound from Equation equation 2, and thus completing the proof. Proposition 3 (restatement). Consider gΛ(x) = Eϵ U[ 1,1]n[f(x + Λϵ)]. The classifier gi Λ i is 1/2-Lipschitz (i.e. L = 1/2) under the Λx norm. Proof. We begin by observing that the dual norm of x Λ,1 = Λ 1x 1 is x = Λx , since: max Λ 1x 1 1 x y = max z 1 1 y Λz = Λy . Without loss of generality, we analyze gi/ x1. Let ˆx = [x2, . . . , xn] Rn 1, then: x1 = λ1 (2λ)n x1 1 f i(x1 + λ1ϵ1, ˆx + ˆΛˆϵ)dϵ1dn 1ˆϵ [ 1,1]n 1(f i(x1 + 1, ˆx + ˆΛˆϵ) f i(x1 1, ˆx + ˆΛˆϵ))dn 1ˆϵ Published in Transactions on Machine Learning Research (08/2022) Thus, λ1 gi 1 2n Qn j=2 λj f i(x1 + 1, ˆx + ˆΛˆϵ) f i(x1 1, ˆx + ˆΛˆϵ) dn 1ˆϵ 1 The second and last steps follow by the change of variable t = x1 + λ1ϵ1 and Leibniz rule. Following a symmetric argument, λj gi/ xj 1/2 i resulting in having Λ gi(x) = maxi λi gi/ xi 1/2 i concluding the proof. Proposition 4 (restatement). V {δ : Λ 1δ 1 r} = (2r)n Proof. Take A = rΛ 1 = diag(1/rλ1, . . . , 1/rλn) = diag(a1, . . . , an). We can re-write the region as {x : P i ai|xi| 1}, from which it is clear to see that this region is an origin centered, axis-aligned simplex with the set of vertices V = { 1/aiei}n i=1, where ei is the standard basis vector i. Define the sets of vertices Vt = V \ { 1/anen} and Vb = V \ {1/anen}. Given the symmetry around the origin, each of these sets defines an n-dimensional hyperpyramid with a shared base Bn 1 given by the n 1-dimensional hyperplane defined by all vertices where xn = 0, and an apex at the vertex 1/anen (or 1/anen in the case of Vb). The volume of each of these n 1-dimensional hyperpyramids is given by V(Bn 1)/nan (Kendall (2004)), yielding a total volume of Vn = 2 n 1 an V(Bn 1). The same argument can be applied to compute V(Bn 1) which is a union of two n 1-dimensional hyperpyramids. This forms a recursion that completes the proof. Proof. (Alternative Proof.) We consider the case that Λ 1 is a general positive definite matrix that is not necessarily diagonal. Note that V {δ : Λ 1δ 1 r} = V {δ : (rΛ) 1δ 1 1} = rn|Λ|V ({δ : δ 1 1}) where |rΛ| denotes the determinant. The last equality follows by the volume of a set under a linear map and noting that {δ : (rΛ) 1δ 1 1} = {rΛδ : δ 1 r}. At last, {δ : δ 1 1} can be expressed as the disjoint union of 2n simplexes. Thus, we have V {δ : Λ 1δ 1 r} = (2r)n/n!|Λ| since the volume of a simplex is 1/n! completing the proof. For completeness, we supplement the previous result with bounds on the volume that may be useful for future readers. Proposition 5. For any positive definite Λ 1 Rn n, we have the following: 2r n V Z(Λ) V {δ : Λ 1δ 1 r} (2r)n V (Z(Λ)) where V (Z(Λ)) = p |Λ Λ| which is the volume of the zonotope with a generator matrix Λ. Proof. Let S1 = {δ : Λ 1δ 1 r}, S = {δ : Λ 1δ r} and Sn = {δ : n Λ 1δ r}. Since Λ 1δ Λ 1δ 1 n Λ 1δ , then S S1 Sn . Therefore, we have V(S ) V(S1) V(Sn ). At last note that, Sn = { r nΛδ : δ 1} and that with the change of variables δ = 2u 1n where 1n is a vector of all ones, we have Sn = Z 2r n Λ1n where is a Minkowski sum and noting that r single point in Rn. Therefore, V Z 2r n Λ1n = (2r/n)n V (Z(Λ)). The upper bound follows with a similar argument completing the proof. B.1 Certification under Gaussian Mixture Smoothing Distribution We consider a general, K-component, zero-mean Gaussian mixture smoothing distribution G such that: G({αi, Σi}K i=1) := i=1 αi N(0, Σi), s.t. X i αi = 1, 0 < αi 1 (3) Published in Transactions on Machine Learning Research (08/2022) Given f and as per the recipe described in Section 4, we are interested in the Lipschitz constant of the smooth classifier g G(x) = (f G)(x) = PK i αigΣi = PK i αi(f N(0, Σi)) = P i αigΣi(x) where gΣi is defined as in the Gaussian case. Note the weaker bound when compared to Proposition 2, for each of the Gaussian components presented in the following proposition. Proposition 6. gΣ is p 2/π-Lipschitz under . Σ 1,2 norm. Proof. Following a similar argument to the proof of Proposition 2, we get: u Σ 1 2 gΣ(x) 1 2 (t x)| exp 1 2(x t) Σ 1(x t) dnt = Es N(0,I) |u s| = Ev N(0,1) [|v|] = p With Proposition 6, we obtain a Lipschitz constant for a Gaussian mixture smoothing distribution as: Proposition 7. g G is p π/2-Lipschitz under δ B 1,2 norm, where B 1 = PK i αiΣ 1 i . |g G(x + δ) g G(x)| X i αi|gΣi(x + δ) gΣi(x)| i αi δ Σi,2 rπ Obtained by first applying the triangle inequality, then Proposition 2 followed by Jensen s inequality. Thus yielding the following certificate by combining Proposition 7 and Theorem 1. Corollary 3. Let c A = arg maxi g G(x) , then arg maxi gi G(x + δ) = c A for all δ satisfying: gc A G (x) max c gc =c A G (x) . where B 1 = PK i αiΣ 1 i . C An Cer Optimization In this section we detail the implementation choices required to solving Equation equation 1. For ease of presentation, we restate the An Cer optimization problem (with Θx = diag({θx i }n i=1)): arg max Θx rp (x, Θx) n s Y i θx i s.t. min i θx i rp (x, Θx) r iso, where rp (x, Θx) is the gap value under the anisotropic smoothing distribution, and r iso is the optimal isotropic radius, i.e. θxrp(x, θx) for θx R+. This is a nonlinear constrained optimization problem that is challenging to solve. As such, we relax it, and solve instead: arg max Θx rp (x, Θx) n s Y i θx i + κ min i θx i rp (x, Θx) s.t. θx i θx given a hyperparameter κ R+. While the constraint θx i θx is not explicitly required to enforce the superset condition over the isotropic case, it proved itself beneficial from an empirical perspective. To sample Published in Transactions on Machine Learning Research (08/2022) from the distribution parameterized by Θx (in our case, either a Gaussian or Uniform), we make use of the reparameterization trick, as in Alfarra et al. (2022). The solution of this optimization problem can be found iteratively by performing projected gradient ascent. A standalone implementation for the An Cer optimization stage is presented in Listing 1, whereas the full code integrated in our code base is available as supplementary material. To perform certification, we simply feed the output of this optimization to the certification procedure from Cohen et al. (2019). import torch from torch.autograd import Variable from torch. distributions .normal import Normal class Certificate (): def compute_proxy_gap (self , logits: torch.Tensor): raise Not Implemented Error def sample_noise (self , batch: torch.Tensor , repeated_theta : torch.Tensor): raise Not Implemented Error def compute_gap (self , p ABar: float): raise Not Implemented Error class L2Certificate (Certificate): def __init__(self , batch_size: int , device: str = "cuda :0"): self.m = Normal(torch.zeros(batch_size ).to(device), torch.ones(batch_size ).to(device)) self.device = device self.norm = "l2" def compute_proxy_gap (self , logits: torch.Tensor): return self.m.icdf(logits [:, 0]. clamp_ (0.001 , 0.999)) - \ self.m.icdf(logits [:, 1]. clamp_ (0.001 , 0.999)) def sample_noise (self , batch: torch.Tensor , repeated_theta : torch.Tensor): return torch.randn_like(batch , device=self.device) * repeated_theta def compute_gap (self , p ABar: float): return norm.ppf(p ABar) class L1Certificate (Certificate): def __init__(self , device="cuda :0"): self.device = device self.norm = "l1" def compute_proxy_gap (self , logits: torch.Tensor): return logits [:, 0] - logits [:, 1] def sample_noise (self , batch: torch.Tensor , repeated_theta : torch.Tensor): return 2 * (torch.rand_like(batch , device=self.device) - 0.5) * repeated_theta def compute_gap (self , p ABar: float): return 2 * (p ABar - 0.5) def ancer_optimization ( model: torch.nn.Module , batch: torch.Tensor , certificate: Certificate , learning_rate : float , isotropic_theta : torch.Tensor , iterations : int , samples: int , kappa: float , device: str = "cuda :0"): """ Optimize batch using ANCER , assuming isotropic initialization point. model: trained network batch: inputs to certify around certificate: instance of desired certification object learning_rate : optimization learning rate for ANCER isotropic_theta : initialization isotropic value per input in batch iterations: number of iterations to run the optimization samples: number of samples per input and iteration kappa: relaxation hyperparameter """ batch_size = batch.shape [0] img_size = np.prod(batch.shape [1:]) # define a variable , the optimizer , and the initial sigma values theta = Variable(isotropic_theta , requires_grad =True).to(device) Published in Transactions on Machine Learning Research (08/2022) optimizer = torch.optim.Adam ([ theta], lr= learning_rate ) initial_theta = theta.detach ().clone () # reshape vectors to have samples per input in batch new_shape = [batch_size * samples] new_shape.extend(batch [0]. shape) new_batch = batch.repeat ((1, samples , 1, 1)).view(new_shape) # solve iteratively by projected gradient ascend for _ in range(iterations): theta_repeated = theta.repeat (1, samples , 1, 1).view(new_shape) # Reparameterization trick noise = certificate. sample_noise (new_batch , theta_repeated ) out = model( new_batch + noise ).reshape(batch_size , samples , -1).mean(dim =1) vals , _ = torch.topk(out , 2) gap = certificate . compute_proxy_gap (vals) prod = torch.prod( (theta.reshape(batch_size , -1))**(1/ img_size), dim =1) proxy_radius = prod * gap radius_maximizer = - ( proxy_radius .sum() + kappa * (torch.min(theta.view(batch_size , -1), dim =1).values*gap).sum () ) radius_maximizer .backward () optimizer.step () # project to the initial theta with torch.no_grad (): torch.max(theta , initial_theta , out=theta) return theta Listing 1: Python implementation of the An Cer optimization routine using Py Torch Paszke et al. (2019) D Memory-based Certification for An Cer To guarantee the soundness of the An Cer classifier, we use an adapted version of the data-dependent memory-based solution presented in Alfarra et al. (2022). The modified algorithm involves a post-processing certification step that obtains adjusted certification statistics based on the memory procedure from Alfarra et al. (2022) (see the original paper for more details). We present an adapted version to An Cer of this post-processing memory-based step in Algorithm 2. Algorithm 2: Memory-Based Certification Input: input point x N+1, certified region RN+1, prediction CN+1, and memory M Result: Prediction for x N+1 and certified region at x N+1 that does not intersect with any certified region in M. for (xi, Ci, Ri) M do if CN+1 = Ci then if x N+1 Ri then return Abstain, 0 else if Max Intersect(RN+1, Ri) and Intersect(RN+1, Ri) then R N+1 = Largest Out Subset(Ri, RN+1); RN+1 R N+1; end add (x N+1, CN+1, RN+1) to M; return CN+1, RN+1; Note that the proposed certified region RN+1 emerges from our certification bounds presented in Sections 4.1 and 4.2. There are a few differences between our proposed Algorithm 2 with respect to the original variant Published in Transactions on Machine Learning Research (08/2022) presented in Alfarra et al. (2022). The first is that we remove the computation of the largest certifiable subset of a certified region RN+1 when there exists an i such that x N+1 Ri with a different class prediction, i.e. (Largest In Subset in Alfarra et al. (2022)) due to the complexity of the operation in the anisotropic case. As an example, it is generally difficult to find the largest volume ellipsoid contained in another ellipsoid. Due to this complexity, we choose to simply Abstain instead. Given the high dimensionality of the data, empirically, we never found a certificate in this situation within our experiments. Further, to ease the computational burden of the Intersect function, we introduce and instantiate the function Max Intersect first which checks whether the ℓp-ball over-approximation of the region RN+1 intersects with a ℓp over-approximation of Ri. This follows since when the ℓp balls over-approximation to the anisotropic regions RN+1 and Ri do not intersect, then RN+1 and Ri do not intersect either. Only in cases in which those over-approximation regions intersect, we run the more expensive Intersect procedure. We present practical implementations for Max Intersect, Intersect and Largest Out Subset for the ellipsoids and generalized cross-polytopes considered in this paper. D.1 Implementing Max Intersect(RA, RB) in the Ellipsoid and Generalized Cross-Polytope Cases Given the two regions RA and RB, consider ℓp-ball approximations of those regions, R A = {x Rn : x a p ra} and R B = {x Rn : x b p rb} such that RA R A and RB R B. Lemma 1. If a b p > ra + rb, then RA RB = . Proof. For the sake of contradiction, let a b p > ra+rb and x R A R B. Then, we have that x a ra and x b rb. However: ra + rb < a b p x a p + x b p ra + rb, forming a contradiction. Thus, R A R B = , which in turn implies RA RB = since RA and RB are subsets of R A and R B, respectively. This forms a fast, maximum intersection check for ellipsoids, i.e. p = 2, and generalized cross-polytopes, i.e. p = 1. The Max Intersect function returns False if a b p > ra + rb, and True otherwise. D.2 Implementing Intersect(RA, RB) in the Ellipsoid Case The problem of efficiently checking if two ellipsoids intersect is not trivial. We rely on the work of Ros et al. (2002); Gilitschenski & Hanebeck (2012) with missing proofs from Gilitschenski & Hanebeck (2012) for completeness. Lemma 2. Let RA = {x Rn : (x a) A(x a) 1} and RB = {x Rn : (x b) B(x b) 1} define two ellipsoids centered at a and b, respectively. We have that R = {x : t(x a) A(x a)+(1 t)(x b) B(x b) 1} for any t [0, 1] satisfies RA RB R RA RB. Proof. By considering the convex combination of the left-hand side of the inequalities defining the regions RA and RB, it becomes obvious that x RA RB = x R, concluding the left side of the property. As for the right side, it suffices to show that if x / RA and x R then x RB and, similarly, that if x / RB and x R then x RA. We show the first case since the second follows by symmetry. Without loss of generality, we assume that a = b = 0n. Now, let x be such that x Ax > 1 and tx Ax + (1 t)x Bx 1 since x / RA and x R. Then, since x R, we have that (1 t)x Bx 1 tx Ax 1 since x Ax > 1 which implies that x RB. Note that the previous result holds without loss of generality when for the radius 1 as the radius can be absorbed in A and B. As the following Lemma was shown by Gilitschenski & Hanebeck (2012) without proof, we complement it below for completeness. Lemma 3. The set R is equivalent to the following ellipsoid R = {x : (x m) Et(x m) K(t)} where Et = t A + (1 t)B, m = E 1 t (t Aa + (1 t)Bb), and K(t) = 1 ta Aa (1 t)b Bb + m Etm. Published in Transactions on Machine Learning Research (08/2022) t(x a) A(x a) + (1 t)(x b) B(x b) 1 x (t A + (1 t)B) | {z } Et x 2x (t Aa + (1 t)Bb) | {z } Etm 1 ta Aa (1 t)b Bb (x m) Et(x m) 1 ta Aa (1 t)b Bb + m Etm The last equality follows by adding and subtracting m Etm and concluding the proof. Proposition 8. The set of points satisfying R for t (0, 1) is either an empty set, a single point, or the ellipsoid R. Proof. We first observe that since A and B are positive definite, then Et is positive definite. Then observe that for a choice of t (0, 1) such that K(t) < 0, the set R is an empty set, and since R RA RB, the two sets do not intersect. If K(t) = 0, then the only point satisfying R is the center at m. Following a similar argument, then the two ellipsoids intersect at a point. At last for a choice of t such that K(t) > 0, then R defines an ellipsoid. As per Theorem 8, it suffices to find some t [0, 1] under which K(t) < 0 to guarantee that the ellipsoids do not intersect. To that end, we solve the following convex optimization problem: t = argmint [0,1]K(t) and check the condition if K(t ) < 0. Moreover, as shown by Ros et al. (2002); Gilitschenski & Hanebeck (2012) K(t) is convex in the domain t (0, 1). With several algebraic manipulations, one can show that K(t) has the following equivalent forms: K(t) = 1 ta Aa (1 t)b Bb + m Etm K(t) = 1 t(1 t)(b a) BE 1 t A(b a) K(t) = 1 (b a) 1 1 t B 1 + 1 t A 1 1 (b a) Observe that for ANCER, we have that both A and B to be diagonals with diagonal elements {Aii}n i=1 and {Bii}n i=1, respectively, resulting in the following simple form for K(t): i=1 (bi ai)2 t(1 t)Aii Bii t Aii + (1 t)Bii . The Intersect function in the ellipsoid case returns False if there exists a t (0, 1) such that K(t) < 0, i.e. ellipsoids do not intersect, and True otherwise. D.3 Implementing Intersect(RA, RB) in the Generalized Cross-Polytope Case Let RA and RB be two generalized cross-polytopes RA = {x Rn : A(x a) 1 1} and RB = {x Rn : B(x b) 1 1}, where A and B are positive definite diagonal matrices with elements {Aii}n i=1 and {Bii}n i=1, respectively. We are interested in deciding whether RA and RB intersect. However, given the conservative context in which Intersect is used in Algorithm 2, we only need to make sure that the function only returns False if it is guaranteed that RA RB = . As such, we are able to simplify the complex problem of generalized cross-polytope intersection to the much simpler one of ellipsoid over-approximation intersection. We do this by considering the over-approximation, i.e. superset, ellipsoids R A = {x Rn : A(x a) 2 1} and R B = {x Rn : B(x b) 2 1}, and perform the ellipsoid intersection check presented in Appendix D.2. If R A R B = , then this implies that RA RB = and we can safely return False. Otherwise, we conservatively assume the generalized cross-polytopes intersect, and return True, triggering the reduction procedure detailed in Appendix D.5. Published in Transactions on Machine Learning Research (08/2022) D.4 Implementing Largest Out Subset(RA, RB) in the Ellipsoid Case Given two ellipsoids RA = {x Rn : (x a) A(x a) 1} and RB = {x Rn : (x b) B(x b) 1} that do intersect where A and B are positive definite diagonal matrices, the task is to find the largest possible ellipsoid R B centered at b such that R B RB where RA R B = . Finding a maximum ellipsoid that satisfies those conditions is not trivial, so instead we consider a maximum enclosing ℓ2-ball of RB, R B = {x Rn : x b 2 r}, that does not intersect RA. To obtain this ball, we project the center of RB, b, to the ellipsoid RA. Particularly, we formulate the problem as the projection of a vector y = b a onto an ellipsoid with the same shape as RA centered at 0n. This is equivalent to solving the following optimization problem for a symmetric positive definite matrix A: min x 1 2 x y 2 2 s.t. x Ax 1. Note that the objective function is convex, and the constraint forms a convex set. Forming the Lagrangian to this problem, we obtain: L(x, λ) = 1 2 x y 2 2 + λ x Ax 1 , where λ > 0. Therefore, the global optimal solution must satisfy the KKT conditions below: x = 0 x = (2λA + I) 1 y, λ = 0 y (2λA + I) A (2λA + I) 1 y 1 | {z } f(λ) Thus, to project the vector y on our region the ellipsoid characterized by A, one needs to solve the scalar optimization f(λ) = 0 then substitute back in the formula of x . Further, given A = diag(A11, . . . , Ann), we can simplify the problem to: y2 i Aii (1 + 2λAii)2 1 = 0. Once x is obtained, we can define the maximum radius of the ℓ2-ball centered at b that does not intersect RA as: r = (x + a) b 2 ϵ, for an arbitrarily small ϵ. Finally, we obtain R B as the maximum ball contained within RB that has a radius smaller than r , that is: R B = {x Rn : x b 2 min{r , min i Bii}}. Note that while choosing the radius of R B to be r guarantees that R B RA = , this does not guarantee that R B RB. To guarantee both properties, we take the minimum of both r and mini Bii. This approach finds the solution to the projection of the point to the ellipsoid {x Rn : x Ax 1}; it does not work for the case in which b RA, since the problem would be trivially solved by setting x = y. Thus, our classifier must abstain in that situation. D.5 Implementing Largest Out Subset(RA, RB) in the Generalized Cross-Polytope Case Let RA and RB be two generalized cross-polytopes RA = {x Rn : A(x a) 1 1} and RB = {x Rn : B(x b) 1 1}, where A and B are positive definite diagonal matrices with elements {Aii}n i=1 and {Bii}n i=1, respectively. The task is to find the largest possible generalized cross-polytope R B centered at b such that R B RB where RA R B = . As with the ellipsoid case, solving this problem for a generalized cross-polytope is not trivial, so instead we consider a maximum enclosing cross-polytope (i.e., ℓ1-ball) of R B = {x Rn : x b 1 r} that does not Published in Transactions on Machine Learning Research (08/2022) intersect RA and is a subset of RB. To obtain this ℓ1-ball, we project the center of RB, b, to the generalized cross-polytope RA in a similar fashion to the ellipsoid case in Appendix D.4. We formulate the problem as the projection of the vector y = b a to the 0n centered generalized cross-polytope {x Rn : Ax 1 1}. Lemma 4. Consider the hyperplane H = {x Rn : w x k = 0} and a point y Rn. The ℓ2 projection of y on the hyperplane is the point x = y (w y k)w/ w 2 2. Proof. We define the projection problem in a similar fashion to the ellipsoid case: min x 1 2 x y 2 2 s.t. w x k = 0, and obtain the Lagrangian as L(x, λ) = 1 2 x y 2 2 + λ(w x k), from where we get (using the KKT conditions): x = y λ w and λ = w y k/ w 2 2; thus obtaining: x = y (w y k)w While this formulation does not yield the closest point from a hyperplane when measured with the ℓ1 norm, the fact that x x 1 x x 2 implies the certification set obtained in the ℓ1 norm via this method is a subset of the ℓ2-ball of the minimum projection point. Crucially, this ℓ2 projection has the advantage of having a closed-form solution, while an ℓ1 one would require solving the problem using an iterative linear programming solver. As such, for the sake of computational complexity, we decided to use this projection, despite the sub-optimality of the result from the ℓ1 perspective. Empirically, we have found this does not affect our results. Since the set of vertices of the generalized cross-polytope {x Rn : Ax 1 1} is given by {ei/Aii, ei/Aii}n i=1, and considering the distance between the projections and the original y, the hyperplane that minimizes it is defined by the set of vertices {sign(yi)ei/Aii}n i=1. By writing it as a system of n equations, we obtain the hyperplane defined by w = [ sign(y1)A11, ..., sign(yn)Ann] and k = 1. Finally, after computing x as per Lemma 4, we can define the maximum radius of the ℓ1-ball centered at b that does not intersect RA as: r = (x + a) b 1 ϵ, for an arbitrarily small ϵ. Finally, and similar to the ellipsoids case, we obtain R B as the maximum generalized cross-polytope contained within RB that has a radius smaller than r , that is: R B = {x Rn : x b 1 min{r , min i Bii}} Similar to before, to guarantee that the ℓ1 ball R B is still a subset to RB, we take the minimum between r and mini Bii to be the radius of R B. As with the ellipsoid case, this approach does not work for the case in which b RA, since the assumption of the closest plane to y would not hold. Thus, our classifier must abstain in that situation. E Experimental Setup The experiments reported in the paper used the CIFAR-10 Krizhevsky (2009)3 and Image Net Deng et al. (2009)4 datasets, and trained Res Net18, Wide Res Net40 and Res Net50 networks He et al. (2016). Experiments used the typical data split for these datasets found in the Py Torch implementation Paszke et al. (2019). The procedures to obtain the baseline networks used in the experiments are detailed in Appendix E.1 and E.2 for ellipsoids and generalized cross-polytopes, respectively. Source code to reproduce the An Cer optimization and certification results of this paper is available as supplementary material. Isotropic DD Optimization. We used the available code of Alfarra et al. (2022)5 to obtain the isotropic data dependent smoothing parameters. To train our models from scratch, we used an adapted version of the code provided in the same repository. 3Available here (url), under an MIT license. 4Available here (url), terms of access detailed in the Download page. 5Data Dependent Randomized Smoothing source code available here Published in Transactions on Machine Learning Research (08/2022) Certification. Following Cohen et al. (2019); Salman et al. (2019a); Zhai et al. (2019); Yang et al. (2020); Alfarra et al. (2022), all results were certified with N0 = 100 Monte Carlo samples for selection and N = 100, 000 estimation samples, with failure a probability of α = 0.001. E.1 Ellipsoid certification baseline networks In terms of ellipsoid certification, the baselines we considered were Cohen Cohen et al. (2019)6, Smooth Adv Salman et al. (2019a)7 and MACER Zhai et al. (2019)8. In the CIFAR-10 experiments, we used a Res Net18 architecture, instead of the Res Net110 used in Cohen et al. (2019); Salman et al. (2019a); Zhai et al. (2019) due to constraints at the level of computation power. As such, we had to train each of the networks from scratch following the procedures available in the source code of each of the baselines. We did so under our own framework, and the training scripts are available in the supplementary material. For the Image Net experiments we used the Res Net50 networks provided by each of the baselines in their respective open source repositories. We trained the Res Net18 networks for 120 epochs, with a batch size of 256 and stochastic gradient descent with a learning rate of 10 2, and momentum of 0.9. E.2 Generalized Cross-Polytope certification baseline networks For the certification of generalized cross-polytopes we considered RS4A Yang et al. (2020)9. As described in RS4A Yang et al. (2020), we take λ = σ/ 3 and report results as a function of σ for ease of comparison. As with the baseline, we ran experiments on CIFAR-10 on a Wide Res Net40 architecture, and Image Net on a Res Net50 Yang et al. (2020). However, due to limited computational power, we were not able to run experiments on the wide range of distributional parameters the original work considers, i.e. σ = {0.15, 0.25, 0.5, 0.75, 1.0, 1.125, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5} on CIFAR-10 and σ = {0.25, 0.5, 0.75, 1.0, 1.125, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5} on Image Net. Instead, and matching the requirements from the ellipsoid section, we choose a subset of σ = {0.25, 0.5, 1.0} and performed our analysis at that level. While the trained models are available in the source code of RS4A, we ran into several issues when we attempted to use them, the most problematic of which being the fact that the clean accuracy of such models was very low in both the Wide Res Net40 and Res Net50 ones. To avoid these issues we trained the models from scratch, but using the stability training loss as presented in the source code of RS4A. All of these models achieved clean accuracy of over 70%. Following the procedures described in the original work, we trained the Wide Res Net40 models with the stability loss used in Yang et al. (2020) for 120 epochs, with a batch size of 128 and stochastic gradient descent with a learning rate of 10 2, and momentum of 0.9, along with a step learning rate scheduler with a γ of 0.1. For the Res Net50 networks on Image Net, we trained them from scratch with stability loss for 90 epochs with a learning rate of 0.1 that drops by a factor of 0.1 after each 30 epochs and a batch size of 256. F Superset argument The results we present in Section 7 support the argument that An Cer achieves, in general, a certificate that is a superset of the Fixed σ and Isotropic DD ones. To confirm this at an individual test set sample level, we compare the ℓ2, ℓ1, ℓΣ 2 and ℓΛ 1 certification results across the different methods, and obtain the percentage of the test set in which An Cer performs at least as well as all other methods in each certificates of the samples. Results of this analysis are presented in Tables 4 and 5. 6Cohen source code available here. 7Smooth Adv source code available here. 8MACER source code available here. 9RS4A source code available here. Published in Transactions on Machine Learning Research (08/2022) For most networks and datasets, we observe that An Cer achieves a larger ℓp certificate than the baselines in a significant portion of the dataset, showcasing the fact that it obtains a superset of the isotropic region per sample. This is further confirmed by the comparison with the anisotropic certificates, in which, for all trained networks except MACER in CIFAR-10, An Cer s certificate is superior in over 90% of the test set samples. Table 4: Superset in top-1 ℓ2 and ℓΣ 2 (rounded to nearest percent) % An Cer ℓ2 is the best % An Cer ℓΣ 2 is the best CIFAR-10: Cohen 83 93 CIFAR-10: Smooth Adv 73 90 CIFAR-10: MACER 50 69 Image Net: Cohen 94 96 Image Net: Smooth Adv 90 93 Table 5: Superset in top-1 ℓ1 and ℓΛ 1 (rounded to nearest percent) % An Cer ℓ1 is the best % An Cer ℓΛ 1 is the best CIFAR-10: RS4A 100 100 Image Net: RS4A 97 99 G Experimental Results per σ G.1 Certifying Ellipsoids - ℓ2 and ℓΣ 2 certification results per σ In this section we report certified accuracy at various ℓ2 radii and ℓΣ 2 proxy radii, following the metrics defined in Section 7, for each training method (Cohen Cohen et al. (2019), Smooth Adv Salman et al. (2019a) and MACER Zhai et al. (2019)), dataset (CIFAR-10 and Image Net) and σ (σ {0.12, 0.25, 0.5, 1.0}). Figures 7 and 8 shows certified accuracy at different ℓ2 radii for CIFAR-10 and Image Net, respectively, whereas Figures 9 and 10 plot certified accuracy and different ℓΣ 2 proxy radii for CIFAR-10 and Image Net, respectively. G.2 Certifying Ellipsoids - ℓ1 and ℓΛ 1 certification results per σ In this section we report certified accuracy at various ℓ1 radii and ℓΛ 1 proxy radii, following the metrics defined in Section 7, for RS4a, dataset (CIFAR-10 and Image Net) and σ (σ {0.25, 0.5, 1.0}). Figures 11 and 12 shows certified accuracy at different ℓ1 radii for CIFAR-10 and Image Net, respectively, whereas Figures 13 and 14 plot certified accuracy and different ℓΛ 1 proxy radii for CIFAR-10 and Image Net, respectively. H Visual Comparison of Parameters in Ellipsoid Certificates Anisotropic certification allows for a better characterization of the decision boundaries of the base classifier f. For example, the directions aligned with the major axes of the ellipsoids δ Σ,2 = r, i.e. locations where Σ is large, are, by definition, expected to be less sensitive to perturbations compared to the minor axes directions. To visualize this concept, Figure 15 shows CIFAR-10 images along with their corresponding optimized ℓ2 isotropic parameters obtained by Isotropic DD, and ℓΣ 2 anisotropic parameters obtained by An Cer. First, we note the richness of information provided by the anisotropic parameters when compared to the ℓ2 worst-case, isotropic one. Interestingly, pixel locations where the intensity of Σ is large (higher intensity in Figure 15) are generally the ones corresponding least with the underlying true class and overlapping more with background pixels. A particular insight one can get from An Cer certification is that the decision boundaries are not distributed isotropically around each input. To quantify this in higher dimensions, we plot in Figure 16 a histogram Published in Transactions on Machine Learning Research (08/2022) 0.00 0.25 0.50 0.75 1.00 1.25 Certified accuracy Cohen ( = 0.12) Fixed (ACR=0.263) Isotropic (ACR=0.299) ANCER (ACR=0.447) 0.0 0.5 1.0 1.5 Certified accuracy Cohen ( = 0.25) Fixed (ACR=0.402) Isotropic (ACR=0.44) ANCER (ACR=0.656) 0.0 0.5 1.0 1.5 2.0 2.5 Certified accuracy Cohen ( = 0.5) Fixed (ACR=0.492) Isotropic (ACR=0.563) ANCER (ACR=0.814) 0 1 2 3 4 5 Certified accuracy Cohen ( = 1.0) Fixed (ACR=0.497) Isotropic (ACR=0.824) ANCER (ACR=1.025) 0.00 0.25 0.50 0.75 1.00 Certified accuracy Smooth Adv ( = 0.12) Fixed (ACR=0.289) Isotropic (ACR=0.466) ANCER (ACR=0.508) 0.0 0.5 1.0 1.5 Certified accuracy Smooth Adv ( = 0.25) Fixed (ACR=0.454) Isotropic (ACR=0.561) ANCER (ACR=0.655) 0.0 0.5 1.0 1.5 2.0 2.5 Certified accuracy Smooth Adv ( = 0.5) Fixed (ACR=0.569) Isotropic (ACR=0.635) ANCER (ACR=0.761) 0 1 2 3 4 5 Certified accuracy Smooth Adv ( = 1.0) Fixed (ACR=0.565) Isotropic (ACR=0.694) ANCER (ACR=0.871) Certified accuracy MACER ( = 0.12) Fixed (ACR=0.272) Isotropic (ACR=0.345) ANCER (ACR=0.363) 0.0 0.5 1.0 1.5 Certified accuracy MACER ( = 0.25) Fixed (ACR=0.429) Isotropic (ACR=0.461) ANCER (ACR=0.424) 0.0 0.5 1.0 1.5 2.0 2.5 Certified accuracy MACER ( = 0.5) Fixed (ACR=0.57) Isotropic (ACR=0.641) ANCER (ACR=0.56) 0 1 2 3 4 5 Certified accuracy MACER ( = 1.0) Fixed (ACR=0.668) Isotropic (ACR=0.393) ANCER (ACR=0.561) Figure 7: CIFAR-10 certified accuracy as a function of ℓ2 radius, per model and σ (used as initialization in the isotropic data-dependent case and An Cer). 0.00 0.25 0.50 0.75 1.00 1.25 Certified accuracy Cohen ( = 0.25) Fixed (ACR=0.47) Isotropic (ACR=0.527) ANCER (ACR=0.653) 0.0 0.5 1.0 1.5 2.0 Certified accuracy Cohen ( = 0.5) Fixed (ACR=0.721) Isotropic (ACR=0.795) ANCER (ACR=1.125) Certified accuracy Cohen ( = 1.0) Fixed (ACR=0.864) Isotropic (ACR=0.995) ANCER (ACR=1.439) 0.00 0.25 0.50 0.75 1.00 1.25 Certified accuracy Smooth Adv ( = 0.25) Fixed (ACR=0.518) Isotropic (ACR=0.61) ANCER (ACR=0.65) 0.0 0.5 1.0 1.5 2.0 Certified accuracy Smooth Adv ( = 0.5) Fixed (ACR=0.816) Isotropic (ACR=0.921) ANCER (ACR=1.083) Certified accuracy Smooth Adv ( = 1.0) Fixed (ACR=1.011) Isotropic (ACR=1.13) ANCER (ACR=1.459) Figure 8: Image Net certified accuracy as a function of ℓ2 radius, per model and σ (used as initialization in the isotropic data-dependent case and An Cer). of the ratio between the maximum and minimum elements of our optimized smoothing parameters for the experiments on Smooth Adv (with an initial σ = 1.0) on CIFAR-10. We note that this ratio can be as high as 5 for some of the input points, meaning the decision boundaries in that case could be 5 times closer to a given input for some directions than others. Published in Transactions on Machine Learning Research (08/2022) 0.0 0.5 1.0 1.5 2 proxy volume Certified accuracy Cohen ( = 0.12) Fixed (ACR=0.263) Isotropic (ACR=0.299) ANCER (ACR=0.676) 0.0 0.5 1.0 1.5 2.0 2.5 2 proxy volume Certified accuracy Cohen ( = 0.25) Fixed (ACR=0.402) Isotropic (ACR=0.44) ANCER (ACR=0.873) 2 proxy volume Certified accuracy Cohen ( = 0.5) Fixed (ACR=0.492) Isotropic (ACR=0.563) ANCER (ACR=0.998) 0 1 2 3 4 5 2 proxy volume Certified accuracy Cohen ( = 1.0) Fixed (ACR=0.497) Isotropic (ACR=0.824) ANCER (ACR=1.163) 0.0 0.5 1.0 1.5 2.0 2.5 2 proxy volume Certified accuracy Smooth Adv ( = 0.12) Fixed (ACR=0.289) Isotropic (ACR=0.466) ANCER (ACR=0.821) 0.0 0.5 1.0 1.5 2.0 2.5 2 proxy volume Certified accuracy Smooth Adv ( = 0.25) Fixed (ACR=0.454) Isotropic (ACR=0.561) ANCER (ACR=0.896) 2 proxy volume Certified accuracy Smooth Adv ( = 0.5) Fixed (ACR=0.569) Isotropic (ACR=0.635) ANCER (ACR=0.996) 0 1 2 3 4 5 2 proxy volume Certified accuracy Smooth Adv ( = 1.0) Fixed (ACR=0.565) Isotropic (ACR=0.694) ANCER (ACR=0.992) 0 1 2 3 4 5 2 proxy volume Certified accuracy MACER ( = 0.12) Fixed (ACR=0.272) Isotropic (ACR=0.345) ANCER (ACR=0.62) 0.0 0.5 1.0 1.5 2 proxy volume Certified accuracy MACER ( = 0.25) Fixed (ACR=0.429) Isotropic (ACR=0.461) ANCER (ACR=0.581) 0.0 0.5 1.0 1.5 2.0 2.5 2 proxy volume Certified accuracy MACER ( = 0.5) Fixed (ACR=0.57) Isotropic (ACR=0.641) ANCER (ACR=0.705) 0 1 2 3 4 5 2 proxy volume Certified accuracy MACER ( = 1.0) Fixed (ACR=0.668) Isotropic (ACR=0.393) ANCER (ACR=0.609) Figure 9: CIFAR-10 certified accuracy as a function of ℓΣ 2 proxy radius, per model and σ (used as initialization in the isotropic data-dependent case and An Cer). 0.0 0.5 1.0 1.5 2.0 2.5 2 proxy volume Certified accuracy Cohen ( = 0.25) Fixed (ACR=0.47) Isotropic (ACR=0.527) ANCER (ACR=0.989) 0.0 0.5 1.0 1.5 2.0 2.5 2 proxy volume Certified accuracy Cohen ( = 0.5) Fixed (ACR=0.721) Isotropic (ACR=0.795) ANCER (ACR=1.289) 2 proxy volume Certified accuracy Cohen ( = 1.0) Fixed (ACR=0.864) Isotropic (ACR=0.995) ANCER (ACR=1.531) 2 proxy volume Certified accuracy Smooth Adv ( = 0.25) Fixed (ACR=0.518) Isotropic (ACR=0.61) ANCER (ACR=0.977) 2 proxy volume Certified accuracy Smooth Adv ( = 0.5) Fixed (ACR=0.816) Isotropic (ACR=0.921) ANCER (ACR=1.268) 2 proxy volume Certified accuracy Smooth Adv ( = 1.0) Fixed (ACR=1.011) Isotropic (ACR=1.13) ANCER (ACR=1.547) Figure 10: Image Net certified accuracy as a function of ℓΣ 2 proxy radius, per model and σ (used as initialization in the isotropic data-dependent case and An Cer). I Non data-dependent Anisotropic Certification As mentioned briefly in Section 6, it is our intuition that anisotropic certification requires a data-dependent approach, as different points will have fairly different decision boundaries and the certified regions will extend in different directions (as exemplified in Figure 1). To validate this claim, we perform certification of Smooth Adv Salman et al. (2019a) with an initial σ = 1 on CIFAR-10 using a Σ which is the average of all the optimized Σx. The results of the certified accuracy, ACR and AC R are presented in Table 6, along with the same results for the methods reported in the main paper. Published in Transactions on Machine Learning Research (08/2022) 0.00 0.25 0.50 0.75 1.00 1.25 Certified accuracy RS4A ( = 0.25) Fixed (ACR=0.212) Isotropic (ACR=0.479) ANCER (ACR=0.496) 0.0 0.5 1.0 1.5 Certified accuracy RS4A ( = 0.5) Fixed (ACR=0.397) Isotropic (ACR=0.614) ANCER (ACR=0.629) 0.0 0.5 1.0 1.5 2.0 Certified accuracy RS4A ( = 1.0) Fixed (ACR=0.737) Isotropic (ACR=0.88) ANCER (ACR=0.91) Figure 11: CIFAR-10 certified accuracy as a function of ℓ1 radius per σ (used as initialization in the isotropic data-dependent case and An Cer). 0.0 0.1 0.2 0.3 0.4 Certified accuracy RS4A ( = 0.25) Fixed (ACR=0.176) Isotropic (ACR=0.233) ANCER (ACR=0.233) 0.0 0.2 0.4 0.6 Certified accuracy RS4A ( = 0.5) Fixed (ACR=0.339) Isotropic (ACR=0.378) ANCER (ACR=0.377) 0.00 0.25 0.50 0.75 1.00 Certified accuracy RS4A ( = 1.0) Fixed (ACR=0.642) Isotropic (ACR=0.681) ANCER (ACR=0.68) Figure 12: Image Net certified accuracy as a function of ℓ1 radius per σ (used as initialization in the isotropic data-dependent case and An Cer). As can be observed, moving away from the data-dependent certification in the anisotropic scenario leads to a significant performance drop in terms of robustness. J Theoretical and Empirical Comparison with Mohapatra et al. (2020) In regards to the theoretical results, unfortunately the certified regions of Mohapatra et al. (2020) do not exhibit a closed form solution similarly to ours. Thus, a direct theoretical volume bound comparison is not possible. As for the empirical comparison, An Cer s performance on both ℓ2 and ℓ1 certificates far out-does that of Mohapatra et al. (2020). For example, with ℓ2 certificates at a radius of 0.5, Cohen certified with An Cer achieves 77% certified accuracy (see Table 1) while Mohapatra et al. (2020) achieves under 60% certified accuracy. Note that Mohapatra et al. (2020) has only a marginal improvement over Cohen et al. As for the ℓ1 certificates, Mohapatra et al. (2020) uses the Gaussian distribution of Cohen et al, resulting in worse performance than existing state-of-art in ℓ1 Yang et al. (2020) that uses a uniform distribution. Our approach improves further upon the performance of Yang et al. (2020). For example, as per Table 2, RS4A with ANCER certification achieves 84% certified accuracy at an ℓ1 radius of 0.5, Yang et al. (2020) achieves 75% certified accuracy while Mohapatra et al. (2020) achieves below 60%. However, we believe that the combination of both approaches, ANCER and Mohapatra et al. (2020) can further boost the performance as also hinted on in the abstract of Mohapatra et al. (2020) on the use of data-dependent smoothing. Published in Transactions on Machine Learning Research (08/2022) 0.00 0.25 0.50 0.75 1.00 1.25 1 proxy volume Certified accuracy RS4A ( = 0.25) Fixed (ACR=0.212) Isotropic (ACR=0.479) ANCER (ACR=0.665) 0.0 0.5 1.0 1.5 1 proxy volume Certified accuracy RS4A ( = 0.5) Fixed (ACR=0.397) Isotropic (ACR=0.614) ANCER (ACR=0.781) 0.0 0.5 1.0 1.5 2.0 2.5 1 proxy volume Certified accuracy RS4A ( = 1.0) Fixed (ACR=0.737) Isotropic (ACR=0.88) ANCER (ACR=1.016) Figure 13: CIFAR-10 certified accuracy as a function of ℓΛ 1 proxy radius per σ (used as initialization in the isotropic data-dependent case and An Cer). 1 proxy volume Certified accuracy RS4A ( = 0.25) Fixed (ACR=0.176) Isotropic (ACR=0.233) ANCER (ACR=1.421) 0.0 0.5 1.0 1.5 2.0 1 proxy volume Certified accuracy RS4A ( = 0.5) Fixed (ACR=0.339) Isotropic (ACR=0.378) ANCER (ACR=1.053) 0.0 0.5 1.0 1.5 2.0 2.5 1 proxy volume Certified accuracy RS4A ( = 1.0) Fixed (ACR=0.642) Isotropic (ACR=0.681) ANCER (ACR=1.075) Figure 14: Image Net certified accuracy as a function of ℓΛ 1 proxy radius per σ (used as initialization in the isotropic data-dependent case and An Cer). Figure 15: Visualization of an input CIFAR-10 image x (top), and the optimized parameters σ (middle) and Σ (bottom) higher intensity corresponds to higher σi in that pixel and channel of the smoothing distributions in the isotropic and anisotropic case, respectively. 1 2 3 4 5 x max/ x min Figure 16: Distribution of the maximum over the minimum An Cer σx at each dataset point for Smooth Adv Salman et al. (2019a) on CIFAR-10 (for initial σ = 1.0) Published in Transactions on Machine Learning Research (08/2022) Table 6: Comparison of different certification methods on Smooth Adv with an initial σ = 1.0 on CIFAR-10. CIFAR-10 Smooth Adv Accuracy @ ℓ2 radius (%) ℓ2 ACR ℓΣ 2 AC R 0.0 0.25 0.5 1.0 1.5 2.0 2.5 Fixed σ 45 40 35 25 16 9 5 0.565 0.565 Isotropic DD 41 39 36 29 21 14 7 0.694 0.694 An Cer 44 43 41 35 26 15 8 0.871 0.992 Average Σ 29 25 21 14 9 5 2 0.329 0.379