# ancer_anisotropic_certification_via_samplewise_volume_maximization__29695272.pdf

Published in Transactions on Machine Learning Research (08/2022)

ANCER: Anisotropic Certiﬁcation via Sample-wise Volume Maximization

Francisco Eiras eiras@robots.ox.ac.uk University of Oxford, Five AI Ltd., UK

Motasem Alfarra motasem.alfarra@kaust.edu.sa King Abdullah University of Science and Technology (KAUST)

M. Pawan Kumar pawan@robots.ox.ac.uk University of Oxford

Philip H. S. Torr philip.torr@eng.ox.ac.uk University of Oxford

Puneet K. Dokania puneet@robots.ox.ac.uk University of Oxford, Five AI Ltd., UK

Bernard Ghanem bernard.ghanem@kaust.edu.sa King Abdullah University of Science and Technology (KAUST)

Adel Bibi adel.bibi@eng.ox.ac.uk University of Oxford

Reviewed on Open Review: https: // openreview. net/ forum? id= 7j0GI6t PYi

Randomized smoothing has recently emerged as an eﬀective tool that enables certiﬁcation of deep neural network classiﬁers at scale. All prior art on randomized smoothing has focused on isotropic ℓp certiﬁcation, which has the advantage of yielding certiﬁcates that can be easily compared among isotropic methods via ℓp-norm radius. However, isotropic certiﬁcation limits the region that can be certiﬁed around an input to worst-case adversaries, i.e. it cannot reason about other close , potentially large, constant prediction safe regions. To alleviate this issue, (i) we theoretically extend the isotropic randomized smoothing ℓ1 and ℓ2 certiﬁcates to their generalized anisotropic counterparts following a simpliﬁed analysis. Moreover, (ii) we propose evaluation metrics allowing for the comparison of general certiﬁcates a certiﬁcate is superior to another if it certiﬁes a superset region with the quantiﬁcation of each certiﬁcate through the volume of the certiﬁed region. We introduce An Cer, a framework for obtaining anisotropic certiﬁcates for a given test set sample via volume maximization. We achieve it by generalizing memory-based certiﬁcation of data-dependent classiﬁers. Our empirical results demonstrate that An Cer achieves state-of-the-art ℓ1 and ℓ2 certiﬁed accuracy on CIFAR-10 and Image Net in the data-dependence setting, while certifying larger regions in terms of volume, highlighting the beneﬁts of moving away from isotropic analysis. Our code is available in this repository.

1 Introduction

The well-studied fact that Deep Neural Networks (DNNs) are vulnerable to additive imperceptible noise perturbations has led to a growing interest in developing robust classiﬁers (Goodfellow et al., 2015; Szegedy

Equal contribution; order of ﬁrst two authors decided by 3 coin ﬂips.

Published in Transactions on Machine Learning Research (08/2022)

Anisotropic Isotropic

Anisotropic Isotropic

Anisotropic Isotropic

Anisotropic Isotropic

Figure 1: Illustration of the landscape of f y (blue corresponds to a higher conﬁdence in y, the true label) for a region around an input in a toy, 2-dimensional radially separable dataset. For two dataset examples, in (a) and (b) we show the boundaries of the optimal ℓ1 isotropic and anisotropic certiﬁcates, while (c) and (d) are the boundaries of the optimal ℓ2 isotropic and anisotropic certiﬁcates. A thorough discussion of this ﬁgure is presented in Section 3.

et al., 2014). A recent promising approach to achieve state-of-the-art provable robustness (i.e. a theoretical bound on the output around every input) at the scale of Image Net (Deng et al., 2009) is randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019). Given an input x and a network f, randomized smoothing constructs g(x) = Eϵ D[f(x + ϵ)] such that g(x) = g(x + δ) δ R, where the certiﬁcation region R is characterized by x, f, and the smoothing distribution D. For instance, Cohen et al. (2019) showed that if D = N(0, σ2I), then R is an ℓ2-ball whose radius is determined by x, f and σ. Since then, there has been signiﬁcant progress towards the design of D leading to the largest R for all inputs x. The interplay between R characterized by ℓ1, ℓ2 and ℓ -balls, and a notion of optimal distribution D has been previously studied (Yang et al., 2020).

Despite this progress, current randomized smoothing approaches provide certiﬁcation regions that are isotropic in nature, limiting their capacity to certifying smaller and worst-case regions. We provide an intuitive example of this behavior in Figure 1. The isotropic nature of R in prior art is due to the common assumption that the smoothing distribution D is identically distributed (Yang et al., 2020; Kumar et al., 2020; Levine & Feizi, 2021). Moreover, comparisons between various randomized smoothing approaches were limited to methods that produce the same ℓp certiﬁcate, with no clear metrics for comparing with other certiﬁcates. In this paper, we address both concerns and present new state-of-the-art certiﬁed accuracy results on both CIFAR-10 and Image Net datasets.

Our contributions are threefold. (i) We provide a general and simpler analysis compared to prior art (Cohen et al., 2019; Yang et al., 2020) that paves the way for the certiﬁcation of anisotropic regions characterized by any norm, holding prior art as special cases. We then specialize our result to regions that, for a positive deﬁnite A, are ellipsoids, i.e. Aδ 2 c, c > 0, and generalized cross-polytopes, i.e. Aδ 1 c, generalizing both ℓ2 (Cohen et al., 2019) and ℓ1 (Lecuyer et al., 2019; Yang et al., 2020) certiﬁcation (Section 4). (ii) We introduce a new evaluation framework to compare methods that certify general (isotropic or anisotropic) regions. We compare two general certiﬁcates by deﬁning that a method certifying R1 is superior to another certifying R2, if R1 is a strict superset to R2. Further, we deﬁne a standalone quantitative metric as the volume of the certiﬁed region, and specialize it for the cases of ellipsoids and generalized cross-polytopes (Section 5). (iii) We propose An Cer, an anisotropic certiﬁcation method that performs sample-wise (i.e. per sample in the test set) region volume maximization (Section 6), generalizing the data-dependent, memory-based solution from Alfarra et al. (2022). Through experiments on CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009), we show that restricting An Cer s certiﬁcation region to ℓ1 and ℓ2-balls outperforms state-of-the-art ℓ1 and ℓ2 results from previous works (Yang et al., 2020; Alfarra et al., 2022). Further, we show that the volume of the certiﬁed regions are signiﬁcantly larger than all existing methods, thus setting a new state-of-the-art in certiﬁed accuracy. We highlight that while we eﬀectively achieve state-of-the-art performance, it comes at a high cost given the data-dependency requirements. A discussion of the limitations of the solution is presented in Section 6.

Published in Transactions on Machine Learning Research (08/2022)

Notation. We consider a base classiﬁer f : Rn P(K), where P(K) is a probability simplex over K classes, i.e. f i 0 and 1 f = 1, for i {1, . . . , K}. Further, we use (x, y) to be a sample input x and its corresponding true label y drawn from a test set Dt, and f y to be the output of f at the correct class. We use ℓp to be the typically deﬁned p norm (p 1), and ℓA p or A,p for p = {1, 2} to be a composite norm deﬁned with respect to a positive deﬁnite matrix A as A 1/pv p.

2 Related Work

Veriﬁed Defenses. Since the discovery that DNNs are vulnerable against input perturbations (Goodfellow et al., 2015; Szegedy et al., 2014), a range of methods have been proposed to build classiﬁers that are veriﬁably robust (Huang et al., 2017; Gowal et al., 2019; Bunel et al., 2018; Salman et al., 2019b). Despite this progress, these methods do not yet scale to the networks the community is interested in certifying (Tjeng et al., 2019; Weng et al., 2018).

Randomized Smoothing. The ﬁrst works on randomized smoothing used Laplacian (Lecuyer et al., 2019; Li et al., 2019) and Gaussian Cohen et al. (2019) distributions to obtain ℓ1 and ℓ2-ball certiﬁcates, respectively. Several subsequent works improved the performance of smooth classiﬁers by training the base classiﬁer using adversarial augmentation (Salman et al., 2019a), regularization (Zhai et al., 2019), or general adjustments to training routines (Jeong & Shin, 2020). Recent work derived ℓp-norm certiﬁcates for other isotropic smoothing distributions (Yang et al., 2020; Levine & Feizi, 2020; Zhang et al., 2019). Concurrently, Dvijotham et al. (2020) developed a framework to handle arbitrary smoothing measures in any ℓp-norm; however, the certiﬁcation process requires signiﬁcant hyperparameter tuning. Similarly, Mohapatra et al. (2020) introduces larger certiﬁcates that require higher-order information, yet do not provide a closed-form solution. This was followed by a complementary data-dependent smoothing approach, where the parameters of the smoothing distribution were optimized per test set sample to maximize the certiﬁed radius at an individual input (Alfarra et al., 2022). All prior works considered smoothing with isotropic distributions and hence certiﬁed isotropic ℓp-ball regions. In this paper, we extend randomized smoothing to certify anisotropic regions, by pairing it with a generalization of the data-dependent framework (Alfarra et al., 2022) to maximize the certiﬁed region at each input point.

3 Motivating Anisotropic Certiﬁcates

Certiﬁcation approaches aim to ﬁnd the safe region R, where arg maxi f i(x) = arg maxi f i(x + δ) δ R. Recent randomized smoothing techniques perform this certiﬁcation by explicitly optimizing the isotropic ℓp certiﬁed region around each input (Alfarra et al., 2022), obtaining state-of-the-art performance as a result. Despite this ℓp optimality, we note that any ℓp-norm certiﬁcate is worst-case from the perspective of that norm, as it avoids adversary regions by limiting its certiﬁcate to the ℓp-closest adversary. This means that it can only enjoy a radius that is at most equal to the distance to the closest decision boundary. However, decision boundaries of general classiﬁers are complex, non-linear, and non-radially distributed with respect to a generic input sample (Karimi et al., 2019). This is evidenced by the fact that, within a reasonably small ℓp-ball around an input, there are often only a small set of adversary directions (Tramèr et al., 2017; 2018) (e.g. see the decision boundaries in Figure 1). As such, while ℓp-norm certiﬁcates are useful to reason about worst-case performance and are simple to obtain given previous works (Cohen et al., 2019; Yang et al., 2020; Lee et al., 2019), they are otherwise uninformative in terms of the shape of decision boundaries, i.e. which regions around the input are safe.

To visualize these concepts, we illustrate the decision boundaries of a base classiﬁer f trained on a toy 2-dimensional, radially separable (with respect to the origin) binary classiﬁcation dataset, and consider two diﬀerent input test samples (see Figure 1). We compare the optimal isotropic and anisotropic certiﬁed regions of diﬀerent shapes at these points. In Figures 1a and 1b, we compare an isotropic cross-polytope (of the form δ 1 r) with an anisotropic generalized cross-polytope (of the form Aδ 1 r), while in Figures 1c and 1d we compare an isotropic ℓ2 ball (of the form δ 2 r) with an anisotropic ellipsoid (of the form Aδ 2 r). Notice that in Figures 1a and 1c, due to the curvature of the classiﬁcation boundary (shown in white), the optimal certiﬁcation region is isotropic in nature, which is evidenced by the similarities of the

Published in Transactions on Machine Learning Research (08/2022)

Figure 2: Visualization of a CIFAR-10 image x and an example x + δ of an imperceptible change that is not inside the optimal isotropic certiﬁed region, but is covered by the anisotropic certiﬁcate.

optimal isotropic and anisotropic certiﬁcates. On the other hand, in Figures 1b and 1d, the location of the decision boundary allows for the anisotropic certiﬁed regions to be considerably larger than their isotropic counterparts, as they are not as constrained by the closest decision boundary, i.e. the worst-case performance. We note that these diﬀerences are further highlighted in higher dimensions, and we study them for a single CIFAR-10 test set sample in Appendix A.1.

As shown, anisotropic certiﬁcation reasons more closely about the shape of the decision boundaries, allowing for further insights into constant prediction (safe) directions. In Figure 2, we present a series of test set images x, as well as practically indistinguishable x + δ images which are not inside the optimal certiﬁed isotropic ℓ2-balls for each input sample, yet are within the anisotropic certiﬁed regions. This showcases the merits of using anisotropic certiﬁcation for characterizing larger safe regions.

4 Anisotropic Certiﬁcation

One of the main obstacles in enabling anisotropic certiﬁcation is the complexity of the analysis required. To alleviate this, we follow a Lipschitz argument ﬁrst observed by Salman et al. (2019a) and Jordan & Dimakis (2020) and propose a simple and general certiﬁcation analysis. We start with the following two observations. All proofs are in Appendix B.

Proposition 1. Consider a diﬀerentiable function g : Rn R. If supx g(x) L where has a dual norm z = maxx z x s.t. x 1, then g is L-Lipschitz under norm , that is |g(x) g(y)| L x y .

Given the previous proposition, we formalize certiﬁcation as follows:

Theorem 1. Let g : Rn RK, gi be L-Lipschitz continuous under norm i {1, . . . , K}, and c A = arg maxi gi(x). Then, we have arg maxi gi(x + δ) = c A for all δ satisfying:

gc A(x) max c gc =c A(x) .

Theorem 1 provides an norm robustness certiﬁcate for any L-Lipschitz classiﬁer g under . The certiﬁcate is only informative when one can attain a tight non-trivial estimate of L, ideally supx g(x) , which is generally diﬃcult when g is an arbitrary neural network.

Framework Recipe. In light of Theorem 1, randomized smoothing can be viewed diﬀerently as an instance of Theorem 1 with the favorable property that the constructed smooth classiﬁer g enjoys an analytical form for L = supx g(x) by design. As such, to obtain an informative certiﬁcate, one must, for an arbitrary choice of a smoothing distribution, compute the analytic Lipschitz constant L under for g. While there can exist a notion of optimal smoothing distribution for a given choice of certiﬁcate, as in part addressed earlier for the isotropic ℓ1, ℓ2 and ℓ certiﬁcates (Yang et al., 2020), this is not the focus of this paper. The choice of the smoothing distribution in later sections is inspired by previous work for the purpose of granting anisotropic certiﬁcates. This recipe complements randomized smoothing works based on Neyman-Pearson s lemma (Cohen et al., 2019) or the Level-Set and Diﬀerential Method (Yang et al., 2020).

We will deploy this framework recipe to show two specializations for anisotropic certiﬁcation, namely ellipsoids (Section 4.1) and generalized cross-polytopes (Section 4.2).1.

1Our analysis also grants a certiﬁcate for a mixture of Gaussians smoothing distribution (see Appendix B.1).

Published in Transactions on Machine Learning Research (08/2022)

4.1 Certifying Ellipsoids

In this section, we consider the certiﬁcation under ℓΣ 2 norm, or δ Σ,2 =

δ Σ 1δ, that has a dual norm δ Σ 1,2. Note that both δ Σ,2 r and δ Σ 1,2 r deﬁne an ellipsoid. Despite that the following results hold for any positive deﬁnite Σ, we assume for eﬃciency reasons that Σ is diagonal throughout. First, we consider the anisotropic Gaussian smoothing distribution N(0, Σ) with the smooth classiﬁer deﬁned as gΣ(x) = Eϵ N(0,Σ) [f(x + ϵ)]. Considering the classiﬁer Φ 1(gΣ(x)), where Φ is the standard Gaussian CDF, and following Theorem 1 to grant an ℓΣ 2 certiﬁcate for Φ 1(gΣ(x)), we derive the Lipschitz constant L under Σ 1,2, in the following proposition. Proposition 2. Φ 1(gΣ(x)) is 1-Lipschitz (i.e. L = 1) under the Σ 1,2 norm.

Since Φ 1 is a strictly increasing function, by combining Proposition 2 with Theorem 1, we have: Corollary 1. Let c A = arg maxi gi Σ(x) , then arg maxi gi Σ(x + δ) = c A for all δ satisfying:

Φ 1 (gc A Σ (x)) Φ 1 max c gc =c A Σ (x) .

Corollary 1 holds the ℓ2 certiﬁcation from Zhai et al. (2019) as a special case for when Σ = σ2I.2

4.2 Certifying Generalized Cross-Polytopes

Here we consider certiﬁcation under the ℓΛ 1 norm deﬁning a generalized cross-polytope, i.e. the set {δ : δ Λ,1 = Λ 1δ 1 r}, as opposed to the ℓ1-bounded set that deﬁnes a cross-polytope, i.e. {δ : δ 1 r}. As with the ellipsoid case and despite that the following results hold for any positive deﬁnite Λ, for the sake of eﬃciency, we assume Λ to be diagonal throughout. For generalized cross-polytope certiﬁcation, we consider an anisotropic Uniform smoothing distribution U, which deﬁnes the smooth classiﬁer gΛ(x) = Eϵ U[ 1,1]n[f(x + Λϵ)]. Following Theorem 1 and to certify under the ℓΛ 1 norm, we compute the Lipschitz constant of gΛ under the Λx norm, which is the dual norm of Λ,1 (see Appendix B), in the next proposition. Proposition 3. The classiﬁer gΛ is 1/2-Lipschitz (i.e. L = 1/2) under the Λx norm.

Similar to Corollary 1, by combining Proposition 3 with Theorem 1, we have that: Corollary 2. Let c A = arg maxi gi Λ(x) , then arg maxi gi Λ(x + δ) = c A for all δ satisfying:

δ Λ,1 = Λ 1δ 1 gc A Λ (x) max c gc =c A Λ (x) .

Corollary 2 holds the ℓ1 certiﬁcation from Yang et al. (2020) as a special case for when Λ = λI.

5 Evaluating Anisotropic Certiﬁcates

With the anisotropic certiﬁcation framework presented in the previous section, the question arises: Given two general (isotropic or anisotropic) certiﬁcation regions R1 and R2, how can one eﬀectively compare them? . We propose the following deﬁnition to address this issue.

Deﬁnition 1. For a given input point x, consider the two certiﬁcation regions R1 and R2 obtained for two classiﬁers f1 and f2, i.e. A1 = {δ : arg maxc f c 1(x) = arg maxc f c 1(x + δ), δ R1} and A2 = {δ : arg maxc f c 2(x) = arg maxc f c 2(x + δ), δ R2} where arg maxc f c 1(x) = arg maxc f c 2(x). We say A1 is a "superior certiﬁcate" to A2 (i.e. A1 A2), if and only if, A1 A2.

This deﬁnition is a natural extension from the radius-based comparison of ℓp-ball certiﬁcates, providing a basis for evaluating anisotropic certiﬁcation. To compare an anisotropic to an isotropic region of certiﬁcation, it is not immediately clear how to (i) check that an anisotropic region is a superset to the isotropic region, and (ii) if it were a superset, how to quantify the improvement of the anisotropic region over the isotropic counterpart. In Sections 5.1 and 5.2, we tackle these issues for the particular cases of ellipsoid and generalized cross-polytope certiﬁcates.

2A similar result was derived in the appendix of Fischer et al. (2020); Li et al. (2020) with a more involved analysis by extending Neyman-Pearson s lemma.

Published in Transactions on Machine Learning Research (08/2022)

5.1 Evaluating Ellipsoid Certiﬁcates

Comparing ℓ2 Balls to ℓΣ 2 Ellipsoids (Specialization of Deﬁnition 1). Recall that if Σ = σ2I, our ellipsoid certiﬁcation in Corollary 1 recovers as a special case the isotropic ℓ2-ball certiﬁcation of Cohen et al. (2019); Salman et al. (2019a); Zhai et al. (2019). Consider the certiﬁed regions R1 = {δ : δ 2 σr1} and R2 = {δ : δ Σ,2 =

δ Σ 1δ r2} for given r1, r2 > 0. Since we take Σ = diag({σ2 i }n i=1), the maximum enclosed ℓ2-ball for the ellipsoid R2 is given by the set R3 = {δ : δ 2 mini σir2}, and thus R2 R3. Therefore, it suﬃces that R3 R1 (i.e. mini σir2 σr1), to say that R2 is a superior certiﬁcate to the isotropic R1 as per Deﬁnition 1.

Quantifying ℓΣ 2 Certiﬁcates. The aforementioned specialization is only concerned with whether our ellipsoid certiﬁed region R2 is superior to the isotropic ℓ2-ball without quantifying it. A natural solution is to directly compare the volumes of the certiﬁed regions. Since the volume of an ellipsoid given by R2 is V(R2) = rn 2

πn/Γ(n/2+1) Qn i=1 σi (Kendall, 2004), we directly compare the proxy radius R deﬁned for R2 as R = r2 np Qn i σi, since larger R correspond to certiﬁed regions with larger volumes. Note that R, which is the nth root of the volume up to a constant factor, can be seen as a generalization to the certiﬁed radius in the case when σi = σ i.

5.2 Evaluating Generalized Cross-Polytope Certiﬁcates

Comparing ℓ1 Balls to ℓΛ 1 Generalized Cross-Polytopes (Specialization of Deﬁnition 1). Consider the certiﬁcates S1 = {δ : δ 1 λr1}, S2 = {δ : δ Λ,1 = Λ 1δ 1 r2}, and S3 = {δ : δ 1 mini λir2}, where we take Λ = diag({λi}n i=1). Note that since S2 S3, then as per Deﬁnition 1, it suﬃces that S3 S1 (i.e. mini λir2 λr1) to say that the anisotropic generalized cross-polytope S2 is superior to the isotropic ℓ1-ball S1.

Quantifying ℓΛ 1 Certiﬁcates. Following the approach proposed in the ℓΣ 2 case, we quantitatively compare the generalized cross-polytope certiﬁcation of Corollary 2 to the ℓ1 certiﬁcate through the volumes of the two regions. We ﬁrst present the volume of the generalized cross-polytope.

Proposition 4. V {δ : Λ 1δ 1 r} = (2r)n

Following this deﬁnition, we deﬁne the proxy radius for S2 in this case to be R = r2 np Qn i=1 λi. As with the ℓ2 case, larger R correspond certiﬁed regions with larger volumes. As in the ellipsoid case, R can be seen as a generalization to the certiﬁed radius when λi = λ i.

6 An Cer: Sample-wise Volume Maximization for Anisotropic Certiﬁcation

Given the results from the previous sections, we are now equipped to certify anisotropic regions, in particular ellipsoids and generalized cross-polytopes. As mentioned in Section 4, these regions are generally deﬁned as R = {δ : δ Θ,p rp} for a given parameter of the smoothing distribution Θ = diag ({θi}n i=1), an ℓp-norm (p {1, 2}), and a gap value of rp R+. At this point, one could simply take an anisotropic distribution with arbitrarily chosen parameters Θ and certify a trained network at any input point x, in the style of what was done in the previous randomized smoothing literature with isotropic distributions. However, the choice of Θ is more complex in the anisotropic case. A ﬁxed choice of anisotropic Θ could severely underperform the isotropic case take, for example, the anisotropic distribution of Figure 1d applied to the input of Figure 1c.

Instead of taking a ﬁxed Θ, we generalize the framework introduced by Alfarra et al. (2022), where parameters of the smoothing distribution are optimized per input test point (i.e. in a sample-wise fashion) so as to maximize the resulting certiﬁcate. The goal of the optimization in (Alfarra et al., 2022) is, at a point x, to maximize the isotropic ℓ2 region described in Section 4.1 (i.e. {δ : δ 2 σxrp(x, σx))}), where rp is the gap and a function of x and σx R+. In the isotropic ℓp case, this generalizes to maximizing the region {δ : δ p θxrp (x, θx)}, which can be achieved by maximizing radius θxrp (x, θx) through θx R+, obtaining r iso (Alfarra et al., 2022).

Published in Transactions on Machine Learning Research (08/2022)

Algorithm 1: ANCER Optimization

Function An Cer (fθ, x, α, Θ0, n, K, κ):

Initialize: Θ0 x Θ0 for k = 0 . . . K 1 do

sample ˆϵ1, . . . ˆϵn D ψ(Θk x) = 1

n Pn i=1 fθ(x + Θk xˆϵi) EA(Θk x) = maxc ψc; y A = arg maxc ψc; EB(Θk x) = maxc =y A ψc

rp(x, Θk x) =

( EA EB , if p = 1 1 2 Φ 1(EA) Φ 1(EB) , if p = 2

R(Θk x) = rp(x, Θk x) Qd i Θk ii 1/d + κ rp(x, Θk x) mini Θk ii Θk+1 x Θk x + α Θkx R(Θk x) Θk+1 x max Θk+1 x , Θ0 // element-wise maximum - projection step

return ΘK x

For the general anisotropic case, we propose An Cer, whose objective is to maximize the volume of the certiﬁed region through the proxy radius, while satisfying the superset condition with respect to the maximum isotropic ℓ2 radius, r iso. In the case of the ellipsoids and generalized cross-polytopes as presented in Sections 5.1 and 5.2, respectively, An Cer s optimization problem can be written as:

arg max Θx rp (x, Θx) n s Y

i θx i s.t. min i θx i rp (x, Θx) r iso (1)

where rp (x, Θx) is the gap value under the anisotropic smoothing distribution. That is,

rp (x, Λx) = gc A Λ (x) max c gc =c A Λ (x), rp (x, Σx) = 1

Φ 1 (gc A Σ (x)) Φ 1 max c gc =c A Σ (x)

for ℓ1 and ℓ2, respectively. This is a nonlinear constrained optimization problem that is challenging to solve. As such, we relax it, and solve instead:

arg max Θx rp (x, Θx) n s Y

i θx i + κ min i θx i rp (x, Θx) s.t. θx i θx

given a hyperparameter κ R+. While the constraint θx i θx is not explicitly required to enforce the superset condition over the isotropic case, it proved itself beneﬁcial from an empirical perspective. To sample from the distribution parameterized by Θx (in our case, either a Gaussian or Uniform), we make use of the reparameterization trick, as in Alfarra et al. (2022). The solution of this optimization problem can be found iteratively by performing projected gradient ascent, as detailed in Algorithm 1. A standalone implementation for the An Cer optimization stage is presented in Listing 1 in Appendix C.

Memory-based Anisotropic Certiﬁcation. While each of the classiﬁers induced by the parameter Θx, i.e. gΘx, is robust by deﬁnition as presented in Section 4, the certiﬁcation of the overall data-dependent classiﬁer is not necessarily sound due to the optimization procedure for each x. This is a known issue in certifying data-dependent classiﬁers, and is addressed by Alfarra et al. (2022) through the use of a memory-based procedure. In Appendix D, we present an adapted version of this algorithm to An Cer. All subsequent results are obtained following this procedure.

Limitations of An Cer. Given An Cer uses a memorization procedure similar to the one presented in Alfarra et al. (2022), it incurs limitations on memory and runtime complexity. Note that in memory-based data-dependent certiﬁcation there is a single procedure for both certiﬁcation and inference in contrast with the ﬁxed σ setting from Cohen et al. (2019). The main limitations of the memory-based certiﬁcation are outlined in Appendix E of Alfarra et al. (2022). The anisotropic case increases on the complexity of the

Published in Transactions on Machine Learning Research (08/2022)

isotropic framework by the increased runtime of speciﬁc functions presented in Appendix D. Certiﬁcation runtime comparisons are in Section 7.4.

The memory-based procedure incurs the same memory cost as the one presented in Alfarra et al. (2022), i.e., it has a memory complexity of O(N) where N is the total number of inferred samples. This is since that the memory based method requires saving the observed instances along with their smoothing parameters. While the linear runtime dependency on memory size might appear daunting for the deployment of such a system, there are a few factors that could mitigate the cost. Firstly, in practice the models deployed get regularly updated in deployment, and the memory should be reset in those situations. Secondly, there are possible solutions which might attain sublinear runtime for the post-certiﬁcation stage, such as the application of k-d trees to reduce the space of comparisons and speed-up the process. As such, we believe An Cer to be suited to applications in oﬄine scenarios, where improved robustness is desired and inference time is not a critical issue.

A further limitation of the memorization procedure has to do with the impact of the order in which inputs are certiﬁed on the overall statistics obtained. Within a memory-based framework, certifying x2 with x1 in memory can be diﬀerent from certifying x1 with x2 in memory if they intersect. In practice, given the low number of intersections observed with the original certiﬁed regions, this eﬀect was almost negligible in the results presented in Section 7. For fairness of comparison with non-memory based methods, we report "worst-case" results for An Cer in which we abstain from deciding whenever an intersection of two certiﬁed regions occurs.

7 Experiments

We now study the empirical performance of An Cer to obtain ℓΣ 2 , ℓΛ 1 , ℓ2 and ℓ1 certiﬁcates on networks trained using randomized smoothing methods found in the literature. In this section, we show that An Cer is able to achieve (i) improved performance on those networks in terms of ℓ2 and ℓ1 certiﬁcation when compared to certiﬁcation baselines that smooth using a ﬁxed isotropic σ (Fixed σ) (Cohen et al., 2019; Yang et al., 2020; Salman et al., 2019a; Zhai et al., 2019) or a data-dependent and memory-based isotropic one (Isotropic DD) (Alfarra et al., 2022); and (ii) a signiﬁcant improvement in terms of the ℓΣ 2 and ℓΛ 1 -norm certiﬁed region obtained by the same methods compared by computing the proxy radius of the certiﬁed regions thus generally satisfying the conditions of a superior certiﬁcate proposed in Deﬁnition 1. Note that both data-dependent approaches (Isotropic DD and An Cer) use memory-based procedures. As such, the gains described in this section constitute a trade-oﬀgiven the limitations of the method described in Section 6.

We follow an evaluation procedure as similar as possible to the ones described in Cohen et al. (2019); Yang et al. (2020); Salman et al. (2019a); Zhai et al. (2019) by using code and pre-trained networks whenever available and by performing experiments on CIFAR-10 (Krizhevsky, 2009) and Image Net (Deng et al., 2009), certifying the entire CIFAR-10 test set and a subset of 500 examples from the Image Net test set. For the implementation of An Cer, we solve Equation equation 1 with Adam for 100 iterations, where the certiﬁcation gap rp(x, Θx) is estimated at each iteration using 100 noise samples per test point (see Appendix C) and Θx in Equation equation 1 is initialized with the Isotropic DD solution from Alfarra et al. (2022). Further details of the setup can be found in Appendix E.

As in previous works, ℓp certiﬁed accuracy at radius R is deﬁned as the portion of the test set Dt for which the smooth classiﬁer correctly classiﬁes with an ℓp certiﬁcation radius of at least R. In a similar fashion, we deﬁne the anisotropic ℓΣ 2 /ℓΛ 1 certiﬁed accuracy at a proxy radius of R (as deﬁned in Section 5) to be the portion of Dt in which the smooth classiﬁer classiﬁes correctly with an ℓΣ 2 /ℓΛ 1 -norm certiﬁcate of an nth root volume of at least R. We also report average certiﬁed radius (ACR) deﬁned as Ex,y Dt[Rx1(g(x) = y)] (Alfarra et al., 2022; Zhai et al., 2019) as well as average certiﬁed proxy radius (AC R) deﬁned as Ex,y Dt[ Rx1(g(x) = y)], where Rx and Rx denote the radius and proxy radius at x with a true label y for a smooth classiﬁer g. Recall that in the isotropic case, the proxy radius is, by deﬁnition, the same as the radius for a given ℓp-norm. For each classiﬁer, we ran experiments on the σ values reported in the original work (with the exception of Yang et al. (2020), see Section 7.2). For the sake of brevity, we report in this section the top-1 certiﬁed accuracy plots, ACR and AC R per radius across σ, as in Salman et al. (2019a); Zhai et al. (2019); Alfarra et al. (2022). The performance of each method per σ is presented in Appendix G.

Published in Transactions on Machine Learning Research (08/2022)

0 1 2 3 4 5

Certified accuracy

CIFAR-10 - Cohen

Fixed Isotropic DD ANCER

0 1 2 3 4 5

Certified accuracy

CIFAR-10 - Smooth Adv

Fixed Isotropic DD ANCER

0 1 2 3 4 5

Certified accuracy

CIFAR-10 - MACER

Fixed Isotropic DD ANCER

Certified accuracy

Image Net - Cohen

Fixed Isotropic DD ANCER

Certified accuracy

Image Net - Smooth Adv

Fixed Isotropic DD ANCER

0 1 2 3 4 5

2 proxy volume

Certified accuracy

Fixed Isotropic DD ANCER

0 1 2 3 4 5

2 proxy volume

Certified accuracy

Fixed Isotropic DD ANCER

0 1 2 3 4 5

2 proxy volume

Certified accuracy

Fixed Isotropic DD ANCER

2 proxy radius

Certified Accuracy

Fixed Isotropic DD ANCER

2 proxy radius

Certified Accuracy

Fixed Isotropic DD ANCER

Figure 3: Distribution of top-1 certiﬁed accuracy as a function of ℓ2 radius (top) and ℓΣ 2 -norm proxy radius (bottom) obtained by diﬀerent certiﬁcation methods on CIFAR-10 and Image Net.

7.1 Ellipsoid certiﬁcation (ℓ2 and ℓΣ 2 -norm certiﬁcates)

We perform the comparison of ℓ2-ball vs. ℓΣ 2 -ellipsoid certiﬁcates via Gaussian smoothing using networks trained following the procedures deﬁned in Cohen et al. (2019), Salman et al. (2019a), and Zhai et al. (2019). For each of these, we report results on Res Net18 trained using σ {0.12, 0.25, 0.5, 1.0} for CIFAR-10, and Res Net50 using σ {0.25, 0.5, 1.0} for Image Net. For details of the training procedures, see Appendix E.1. Figure 3 plots top-1 certiﬁed accuracy as a function of the ℓ2 radius (top) and of the ℓΣ 2 -norm proxy radius (bottom) per trained network and dataset, while Table 1 presents an overview of the certiﬁed accuracy at various ℓ2 radii, as well as ℓ2 ACR and ℓΣ 2 -norm AC R. Recall that, following the considerations in Section 5.1, the ℓ2 certiﬁcate obtained through An Cer is the maximum enclosed isotropic ℓ2-ball in the ℓΣ 2 ellipsoid.

0.0 0.5 1.0 1.5 2.0

Certified accuracy

CIFAR-10 - RS4A

Fixed Isotropic DD ANCER

0.00 0.25 0.50 0.75 1.00

Certified accuracy

Image Net - RS4A

Fixed Isotropic DD ANCER

1 proxy radius

Certified accuracy

Fixed Isotropic DD ANCER

1 proxy volume

Certified accuracy

Fixed Isotropic DD ANCER

Figure 4: Distribution of top-1 certiﬁed accuracy as a function of ℓ1 radius (top) and ℓΛ 1 -norm proxy radius (bottom) obtained by diﬀerent certiﬁcation methods on CIFAR-10 and Image Net.

First, we note that sample-wise certiﬁcation (Isotropic DD and An Cer) achieves higher certiﬁed accuracy than ﬁxed σ across the board. This mirrors the ﬁndings in Alfarra et al. (2022), since certifying with a ﬁxed σ for all samples struggles with the robustness/accuracy trade-oﬀﬁrst mentioned in Cohen et al. (2019), whereas the data-dependent solutions explicitly optimize σ per sample to avoid it. More importantly, An Cer achieves new state-of-the-art ℓ2 certiﬁed accuracy at most radii in Table 1, e.g. at radius 0.5 An Cer brings certiﬁed accuracy to 77% (from 66%) and 70% (from 62%) on CIFAR10 and Image Net, respectively, yielding relative percentage improvements in ACR between 13% and 47% when compared to Isotropic DD. While the results are signiﬁcant, it might not be immediately clear why maximizing the volume of an ellipsoid with An Cer results in a larger maximum enclosed ℓ2-ball certiﬁcate in ℓΣ 2 ellipsoid when compared to optimizing the ℓ2-ball with Isotropic DD. We explore this phenomenon in Appendix 7.3.

As expected, An Cer substantially improves ℓΣ 2 AC R compared to Isotropic DD in all cases with relative improvements in AC R between 38% and 63% over both datasets. The joint results, certiﬁcation with ℓ2 and ℓΣ 2 , establish that An Cer certiﬁes the ℓ2-ball region obtained by previous approaches, in addition to a much larger region captured by the ℓΣ 2 certiﬁed accuracy and AC R, and therefore is, according to Deﬁnition 1, generally superior to the Isotropic DD one.

Published in Transactions on Machine Learning Research (08/2022)

Table 1: Comparison of top-1 certiﬁed accuracy at diﬀerent ℓ2 radii, ℓ2 average certiﬁed radius (ACR) and ℓΣ 2 average certiﬁed proxy radius (AC R) obtained by using the isotropic σ used for training the networks (Fixed σ); the isotropic data-dependent (Isotropic DD) optimization scheme from Alfarra et al. (2022); and An Cer s data-dependent anisotropic optimization.

CIFAR-10 Certiﬁcation Accuracy @ ℓ2 radius (%) ℓ2 ACR ℓΣ 2 AC R 0.0 0.25 0.5 1.0 1.5 2.0 2.5

Cohen Cohen et al. (2019)

Fixed σ 86 71 51 27 14 6 2 0.722 0.722 Isotropic DD 82 76 62 39 24 14 8 1.117 1.117 An Cer 86 85 77 53 31 17 10 1.449 1.772

Smooth Adv Salman et al. (2019a)

Fixed σ 82 72 55 32 19 9 5 0.834 0.834 Isotropic DD 82 75 63 40 25 15 7 1.011 1.011 An Cer 83 81 73 48 30 17 8 1.224 1.573

MACER Zhai et al. (2019)

Fixed σ 87 76 59 37 24 14 9 0.970 0.970 Isotropic DD 88 80 66 40 17 9 6 1.007 1.007 An Cer 84 80 67 34 15 11 9 1.136 1.481

Image Net Certiﬁcation Accuracy @ ℓ2 radius (%) ℓ2 ACR ℓΣ 2 AC R 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Cohen Cohen et al. (2019)

Fixed σ 70 56 41 31 19 14 12 1.098 1.098 Isotropic DD 71 59 46 36 24 19 15 1.234 1.234 An Cer 70 70 62 61 42 36 29 1.810 1.981

Smooth Adv Salman et al. (2019a)

Fixed σ 65 59 44 38 26 20 18 1.287 1.287 Isotropic DD 66 62 53 41 32 24 20 1.428 1.428 An Cer 66 66 62 58 44 37 32 1.807 1.965

7.2 Generalized Cross-Polytope certiﬁcation (ℓ1 and ℓΛ 1 -norm certiﬁcates)

To investigate ℓ1-ball vs. ℓΛ 1 -generalized cross-polytope certiﬁcation via Uniform smoothing, we compare An Cer to the ℓ1 state-of-the-art results from RS4A (Yang et al., 2020). While the authors of the original work report best certiﬁed accuracy based on 15 networks trained at diﬀerent σ levels between 0.15 and 3.5 on CIFAR-10 (Wide Res Net40) and Image Net (Res Net50) and due to limited computational resources, we perform the analysis on a subset of those networks with σ = {0.25, 0.5, 1.0}. We reproduce the results in Yang et al. (2020) as closely as possible, with details of the training procedure presented in Appendix E.2. Figure 4 shows the top-1 certiﬁed accuracy as a function of the ℓ1 radius (top) and of the ℓΛ 1 -norm proxy radius (bottom) for RS4A, and Table 2 shows an overview of the certiﬁed accuracy at various ℓ1 radii, as well as ℓ1 ACR and ℓΛ 1 AC R. As with the ellipsoid case, we notice that An Cer outperforms both Fixed σ and Istropic DD for most ℓ1 radii, establishing new state-of-the-art results in CIFAR-10 at radii 0.5 and 1.0, and Image Net at radii 0.5 (compared to previous results reported in Yang et al. (2020)). Once more and as expected, An Cer signiﬁcantly improves the ℓΛ 1 AC R for all radii, pointing to substantially larger cerﬁcates than the isotropic case. These results also establish that An Cer certiﬁes the ℓ1-ball region obtained by previous work, in addition to the larger region obtained by the ℓΛ 1 certiﬁcate, and thus we can consider it superior (with respect to Deﬁnition 1) to Isotropic DD.

7.3 Why does An Cer improve upon Isotropic DD s ℓp certiﬁcates?

As observed in Sections 7.1 and 7.2, An Cer s ℓ2 and ℓ1 certiﬁcates outperform the corresponding certiﬁcates obtained by Isotropic DD. To explain this, we compare the ℓ2 certiﬁed region obtained by An Cer, deﬁned in Section 6 as {δ : δ 2 mini σx i r(x, Σx)}, to the one by Isotropic DD deﬁned as {δ : δ 2 σxr(x, σx)}. We observe that the radius of both of these certiﬁcates can be separated into a σ-factor (σx vs. σx min = mini σx i ) and a gap-factor (r(x, σx) vs. r(x, Σx)). We posit the seemingly surprising result can be attributed to the computation of the gap-factor r using an anisotropic, optimized distribution. However, another potential explanation would be that An Cer beneﬁts from a prematurely stopped initialization provided by Isotropic DD, thus achieving a better σx min than the isotropic σx when given further optimization iterations.

Published in Transactions on Machine Learning Research (08/2022)

Table 2: Comparison of top-1 certiﬁed accuracy at diﬀerent ℓ1 radii, ℓ1 average certiﬁed radius (ACR) and ℓΛ 1 average certiﬁed proxy radius (AC R) obtained by using the isotropic σ used for training the networks (Fixed σ); the isotropic data-dependent (Isotropic DD) optimization scheme from Alfarra et al. (2022); and An Cer s data-dependent anisotropic optimization.

CIFAR-10 Certiﬁcation Accuracy @ ℓ1 radius (%) ℓ1 ACR ℓΛ 1 AC R 0.0 0.25 0.5 0.75 1.0 1.5 2.0

RS4A Yang et al. (2020)

Fixed σ 92 83 75 71 46 0 0 0.775 0.775 Isotropic DD 92 89 82 76 58 6 2 0.946 0.946 An Cer 92 90 84 80 63 6 2 0.980 1.104

RS4A Yang et al. (2020)

Fixed σ 78 73 67 63 0 0 0 0.683 0.683 Isotropic DD 79 76 70 65 46 0 0 0.729 0.729 An Cer 78 76 70 66 48 0 0 0.730 1.513

0.2 0.4 0.6 x or x min

100+ Isotropic DD ANCER

Figure 5: Histograms of the values of the gap r (left) and the σ-factor (right) obtained by An Cer initialized with Isotropic DD, and Isotropic DD when allowed to run for 100 iterations more than the baseline. Vertical lines plot the median of the data.

To investigate this, we take the optimized parameters from the Isotropic DD experiments on Smooth Adv for an initial σ = 0.25 on CIFAR-10, and run the optimization step of Isotropic DD for 100 iterations more than its default number of iterations from Alfarra et al. (2022), so as to match the total number of optimization steps between Isotropic DD and An Cer. The histograms of σx or σx min and the gap-factor r, i.e. the two factors from the ℓ2 certiﬁcation results, are presented in Figure 5. While σx for Isotropic DD is similar in distribution to An Cer s σx min, the distribution of the two gaps, r(x, σx) and r(x, Σx), are quite diﬀerent. In particular, the An Cer certiﬁcation gap is signiﬁcantly larger when compared to Isotropic DD, and is the main contributor to the improvement in the ℓ2-ball certiﬁcate of An Cer. That is to say, An Cer generates Σx that is better aligned with the decision boundaries, and hence increases the conﬁdence of the smooth classiﬁer.

7.4 Certiﬁcation Runtime

Table 3: Average certiﬁcation time for each sample per architecture used: (a) Res Net18 (ℓ2, ℓΣ 2 on CIFAR-10), (b) Wide Res Net40 (ℓ1, ℓΛ 1 on CIFAR-10), and (c) Res Net50 (Image Net).

Fixed σ Isotropic DD An Cer

(a) 1.6s 1.8s 2.7s (b) 7.4s 9.5s 11.5s (c) 109.5s 136.0s 147.0s

The certiﬁcation procedures of Isotropic DD and An Cer tradeoﬀimproved certiﬁed accuracy for runtime, since they require a sample-wise optimization to be run prior to the Certify step described in Cohen et al. (2019), and a memory-based step as per Alfarra et al. (2022). The runtime of the optimization and certiﬁcation procedures is roughly equal for ℓ1, ℓ2, ℓΣ 2 and ℓΛ 1 certiﬁcation, and mostly depends on network architecture. As such, we report the average certiﬁcation runtime for a test set sample on an NVIDIA Quadro RTX 6000 GPU for Fixed σ, Isotropic DD and An Cer (including the isotropic initialization step) in Table 3. We observe that the run time overhead for An Cer is not signiﬁcant as compared to its certiﬁcation gains. Finally, due to the memory based step in our approach, the inference and certiﬁcation runtime are the same.

8 Conclusion

We lay the theoretical foundations for anisotropic certiﬁcation through a simple analysis, propose a metric for comparing general robustness certiﬁcates, and introduce An Cer, a certiﬁcation procedure that estimates

Published in Transactions on Machine Learning Research (08/2022)

the parameters of the anisotropic smoothing distribution to maximize the certiﬁcate. Our experiments show that An Cer achieves state-of-the-art ℓ1 and ℓ2 certiﬁed accuracy in the data-dependent setting.

Acknowledgments

This publication is based upon work supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems [EP/S024050/1] and Five AI Limited; by King Abdullah University of Science and Technology (KAUST) under Award No. ORA-CRG10-2021-4648, as well as, the SDAIA-KAUST Center of Excellence in Data Science and Artiﬁcial Intelligence (SDAIA-KAUST AI); and by the Royal Academy of Engineering under the Research Chair and Senior Research Fellowships scheme, EPSRC/MURI grant EP/N019474/1.

Motasem Alfarra, Adel Bibi, Philip H. S. Torr, and Bernard Ghanem. Data dependent randomized smoothing. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence, volume 180 of Proceedings of Machine Learning Research, pp. 64 74. PMLR, 01 05 Aug 2022. 2, 3, 6, 7, 8, 9, 10, 11, 15, 19, 20, 21, 24, 25

Rudy Bunel, Ilker Turkaslan, Philip HS Torr, Pushmeet Kohli, and M Pawan Kumar. A uniﬁed view of piecewise linear neural network veriﬁcation. In Advances in Neural Information Processing Systems (Neur IPS), 2018. 3

Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. In International Conference on Machine Learning (ICML), 2019. 2, 3, 4, 6, 7, 8, 9, 10, 11, 19, 25, 26

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 2, 8, 24

Krishnamurthy Dj Dvijotham, Jamie Hayes, Borja Balle, Zico Kolter, Chongli Qin, András György, Kai Xiao, Sven Gowal, and Pushmeet Kohli. A framework for robustness certiﬁcation of smoothed classiﬁers using f-divergences. In International Conference on Learning Representations (ICLR), 2020. 3

Marc Fischer, Maximilian Baader, and Martin Vechev. Certiﬁed defense to image transformations via randomized smoothing. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 5

Igor Gilitschenski and Uwe D Hanebeck. A robust computational test for overlap of two arbitrary-dimensional ellipsoids in fault-detection of kalman ﬁlters. In 2012 15th International Conference on Information Fusion, 2012. 21, 22

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015. 1, 3

Sven Gowal, Krishnamurthy Dj Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. Scalable veriﬁed training for provably robust image classiﬁcation. In IEEE International Conference on Computer Vision (ICCV), 2019. 3

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. 24

Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety veriﬁcation of deep neural networks. In International Conference on Computer Aided Veriﬁcation (CAV), 2017. 3

Jongheon Jeong and Jinwoo Shin. Consistency regularization for certiﬁed robustness of smoothed classiﬁers. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 3

Published in Transactions on Machine Learning Research (08/2022)

Matt Jordan and Alexandros G Dimakis. Exactly computing the local lipschitz constant of relu networks. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 4

Hamid Karimi, Tyler Derr, and Jiliang Tang. Characterizing the decision boundary of deep neural networks. ar Xiv preprint ar Xiv:1912.11460, 2019. 3

Maurice G Kendall. A Course in the Geometry of n Dimensions. Courier Corporation, 2004. 6, 17

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. 2, 8, 15, 24

Aounon Kumar, Alexander Levine, Tom Goldstein, and Soheil Feizi. Curse of dimensionality on randomized smoothing for certiﬁable robustness. In International Conference on Machine Learning (ICML), 2020. 2

Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certiﬁed robustness to adversarial examples with diﬀerential privacy. In IEEE Symposium on Security and Privacy (SP), 2019. 2, 3

Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi S Jaakkola. Tight certiﬁcates of adversarial robustness for randomly smoothed classiﬁers. ar Xiv preprint ar Xiv:1906.04948, 2019. 3

Alexander Levine and Soheil Feizi. Robustness certiﬁcates for sparse adversarial attacks by randomized ablation. In Association for the Advancement of Artiﬁcial Intelligence (AAAI), 2020. 3

Alexander Levine and Soheil Feizi. Improved, deterministic smoothing for l1 certiﬁed robustness. ar Xiv preprint ar Xiv:2103.10834, 2021. 2

Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certiﬁed adversarial robustness with additive noise. In Advances in Neural Information Processing Systems (Neur IPS), 2019. 3

Linyi Li, Maurice Weber, Xiaojun Xu, Luka Rimanic, Tao Xie, Ce Zhang, and Bo Li. Provable robust learning based on transformation-speciﬁc smoothing. ar Xiv preprint ar Xiv:2002.12398, 2020. 5

Jeet Mohapatra, Ching-Yun Ko, Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. Higher-order certiﬁcation for randomized smoothing. Advances in Neural Information Processing Systems (Neur IPS), 2020. 3, 29

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 15

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (Neur IPS). 2019. 20, 24

Lluís Ros, Assumpta Sabater, and Federico Thomas. An ellipsoidal calculus based on propagation and fusion. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2002. 21, 22

Hadi Salman, Jerry Li, Ilya P Razenshteyn, Pengchuan Zhang, Huan Zhang, Sébastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classiﬁers. In Advances in Neural Information Processing Systems (Neur IPS), 2019a. 3, 4, 6, 8, 9, 10, 16, 25, 26, 28, 30

Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relaxation barrier to tight robust veriﬁcation of neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2019b. 3

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014. 1, 3

Published in Transactions on Machine Learning Research (08/2022)

Vincent Tjeng, Kai Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations (ICLR), 2019. 3

Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. The space of transferable adversarial examples. ar Xiv preprint ar Xiv:1704.03453, 2017. 3

Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations (ICLR), 2018. 3

Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S Dhillon, and Luca Daniel. Towards fast computation of certiﬁed robustness for relu networks. In International Conference on Machine Learning (ICML), 2018. 3

Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning (ICML), 2020. 2, 3, 4, 5, 8, 10, 11, 25, 29

Runtian Zhai, Chen Dan, Di He, Huan Zhang, Boqing Gong, Pradeep Ravikumar, Cho-Jui Hsieh, and Liwei Wang. Macer: Attack-free and scalable robust training via maximizing certiﬁed radius. In International Conference on Learning Representations (ICLR), 2019. 3, 5, 6, 8, 9, 10, 25, 26

Dinghuai Zhang, Mao Ye, Chengyue Gong, Zhanxing Zhu, and Qiang Liu. Filling the soap bubbles: Eﬃcient black-box adversarial certiﬁcation with non-gaussian smoothing. https://openreview.net/forum?id= Skg8g JBFvr, 2019. 3

Published in Transactions on Machine Learning Research (08/2022)

A Qualitative Motivation of Anisotropic Certiﬁcation

A.1 Visualizing CIFAR-10 Optimized Isotropic vs. Anisotropic Certiﬁcates

Anistropic Isotropic

Anistropic Isotropic

Figure 6: Illustration of the landscape of f y for points around an input point x, and two projections of an isotropic ℓ2 certiﬁed region and an anisotropic ℓΣ 2 2norm region on a CIFAR-10 dataset example to a subset of two eigenvectors of the Hessian of f y (blue regions correspond to a higher conﬁdence in y).

To extend the illustration in Figure 1 to a higher dimensional input, we now analyze an example of the isotropic ℓ2 certiﬁcation of randomized smoothing with N(0, σ2I), where σ is optimized per input Alfarra et al. (2022), against An Cer, certifying an anisotropic region characterized by a diagonal ℓΣ 2 - norm. To do so, we consider a CIFAR-10 Krizhevsky (2009) dataset point x, where the input is of size (32x32x3). We perform the 2D analysis by considering the regions closest to a decision boundary. To do so, and following Moosavi-Dezfooli et al. (2019), we compute the Hessian of f y(x) with respect to x where y is the true label for x with f classifying x correctly, i.e. y = arg maxi f i(x). In addition to the Hessian, we also compute its eigenvector decomposition, yielding the eigenvectors {νi}, i {1, . . . , 3072} ordered in descending order of the absolute value of the respective eigenvalues. In Figure 6a, we show the projection of the landscape of f y in the highest curvature directions, i.e. ν1 and ν2. Note that the isotropic certiﬁcation, much as in Figure 1c, in these 2 dimensions is nearly optimal when compared to the anisotropic region. However, if we take the same projection with respect to the eigenvectors with the lowest and highest eigenvalues, i.e. ν1 and ν3072, the advantages of the anisotropic certiﬁcation become clear as shown in Figure 6b.

B Anisotropic Certiﬁcation and Evaluation Proofs

Proposition 1 (restatement). Consider a diﬀerentiable function g : Rn R. If supx g(x) L where has a dual norm z = maxx z x s.t. x 1, then g is L-Lipschitz under norm , that is |g(x) g(y)| L x y .

Proof. Consider some x, y Rn and a parameterization in t as γ(t) = (1 t)x + ty t [0, 1]. Note that γ(0) = x and γ(1) = y. By the Fundamental Theorem of Calculus we have:

|g(y) g(x)| = |g(γ(1)) g(γ(0))| =

0 g γdt Z 1

0 g(x) γ(t) dt L y x

Theorem 1 (restatement). Let g : Rn RK, gi be L-Lipschitz continuous under norm i {1, . . . , K}, and c A = arg maxi gi(x). Then, we have arg maxi gi(x + δ) = c A for all δ satisfying:

gc A(x) max c gc =c A(x) .

Proof. Take c B = arg maxc gc =c A(x). By Proposition 1, we get:

|gc A(x + δ) gc A(x)| L δ = gc A(x + δ) gc A(x) L δ

|gc B(x + δ) gc B(x)| L δ = gc B(x + δ) gc B(x) + L δ

Published in Transactions on Machine Learning Research (08/2022)

By subtracting the inequalities and re-arranging terms, we have that as long as gc A(x) L δ > gc B(x)+L δ , i.e. the bound in the Theorem, then gc A(x + δ) > gc B(x + δ), completing the proof.

Proposition 2 (restatement). Consider gΣ(x) = Eϵ N(0,Σ) [f(x + ϵ)]. Φ 1(gΣ(x)) is 1-Lipschitz (i.e. L = 1) under the Σ 1,2 norm.

Proof. To prove Proposition 2, one needs to show that Φ 1(gi Σ(x)) i is 1-Lipschitz under the Σ 1,2 norm. For ease of notation, we drop the superscript gi Σ and use only g. We want to show that Φ 1(gΣ(x)) Σ 1,2 = Σ 1/2 Φ 1(gΣ(x)) 2 1. Following the argument presented in Salman et al. (2019a), it suﬃces to show that, for any unit norm direction u and p = gΣ(x), we have:

u Σ 1 2 gΣ(x) 1

2(Φ 1(p))2 . (2)

We start by noticing that:

u Σ 1 2 gΣ(x) = 1

Rn f(t)u Σ 1 2 Σ 1(t x) exp 1

2(x t)Σ 1(x t) dnt

= Es N(0,I)[f(x + Σ 1 2 s)u s] = Ev N(0,Σ)[f(x + v)u Σ 1

We now need to ﬁnd the optimal f : Rn [0, 1] that satisﬁes gΣ(x) = Ev N(0,Σ)[f(x + v)] = p while maximizing the left hand size Ev N(0,Σ)[f(x + v)u Σ 1

2 v]. We argue that the maximizer is the following function:

f (x + v) = 1 n u Σ 1

2 v Φ 1(p) o .

To prove that f is indeed the optimal maximizer, we ﬁrst show feasibility. (i): It is clear that f : Rn [0, 1]. (ii) Note that:

Ev N(0,Σ)) h 1 n u Σ 1

2 v Φ 1(p) oi = Px N(0,1)(x Φ 1(p)) = 1 Φ( Φ 1(p)) = p.

To show the optimality of f , we show that it attains the right upper bound:

Ev N(0,Σ)) h u Σ 1

2 v1 n u Σ 1

2 v Φ 1(p) o i = Ex N(0,1) h x1 x Φ 1(p) i

Φ 1(p) x exp 1

obtaining the bound from Equation equation 2, and thus completing the proof.

Proposition 3 (restatement). Consider gΛ(x) = Eϵ U[ 1,1]n[f(x + Λϵ)]. The classiﬁer gi Λ i is 1/2-Lipschitz (i.e. L = 1/2) under the Λx norm.

Proof. We begin by observing that the dual norm of x Λ,1 = Λ 1x 1 is x = Λx , since:

max Λ 1x 1 1 x y = max z 1 1 y Λz = Λy .

Without loss of generality, we analyze gi/ x1. Let ˆx = [x2, . . . , xn] Rn 1, then:

x1 = λ1 (2λ)n x1

1 f i(x1 + λ1ϵ1, ˆx + ˆΛˆϵ)dϵ1dn 1ˆϵ

[ 1,1]n 1(f i(x1 + 1, ˆx + ˆΛˆϵ) f i(x1 1, ˆx + ˆΛˆϵ))dn 1ˆϵ

Published in Transactions on Machine Learning Research (08/2022)

Thus, λ1 gi

1 2n Qn j=2 λj

f i(x1 + 1, ˆx + ˆΛˆϵ) f i(x1 1, ˆx + ˆΛˆϵ) dn 1ˆϵ 1

The second and last steps follow by the change of variable t = x1 + λ1ϵ1 and Leibniz rule. Following a symmetric argument, λj gi/ xj 1/2 i resulting in having Λ gi(x) = maxi λi gi/ xi 1/2 i concluding the proof.

Proposition 4 (restatement). V {δ : Λ 1δ 1 r} = (2r)n

Proof. Take A = rΛ 1 = diag(1/rλ1, . . . , 1/rλn) = diag(a1, . . . , an).

We can re-write the region as {x : P i ai|xi| 1}, from which it is clear to see that this region is an origin centered, axis-aligned simplex with the set of vertices V = { 1/aiei}n i=1, where ei is the standard basis vector i.

Deﬁne the sets of vertices Vt = V \ { 1/anen} and Vb = V \ {1/anen}. Given the symmetry around the origin, each of these sets deﬁnes an n-dimensional hyperpyramid with a shared base Bn 1 given by the n 1-dimensional hyperplane deﬁned by all vertices where xn = 0, and an apex at the vertex 1/anen (or 1/anen in the case of Vb). The volume of each of these n 1-dimensional hyperpyramids is given by V(Bn 1)/nan (Kendall (2004)), yielding a total volume of Vn = 2

n 1 an V(Bn 1). The same argument can be applied to compute V(Bn 1) which is a union of two n 1-dimensional hyperpyramids. This forms a recursion that completes the proof.

Proof. (Alternative Proof.) We consider the case that Λ 1 is a general positive deﬁnite matrix that is not necessarily diagonal. Note that V {δ : Λ 1δ 1 r} = V {δ : (rΛ) 1δ 1 1} = rn|Λ|V ({δ : δ 1 1}) where |rΛ| denotes the determinant. The last equality follows by the volume of a set under a linear map and noting that {δ : (rΛ) 1δ 1 1} = {rΛδ : δ 1 r}. At last, {δ : δ 1 1} can be expressed as the disjoint union of 2n simplexes. Thus, we have V {δ : Λ 1δ 1 r} = (2r)n/n!|Λ| since the volume of a simplex is 1/n! completing the proof.

For completeness, we supplement the previous result with bounds on the volume that may be useful for future readers.

Proposition 5. For any positive deﬁnite Λ 1 Rn n, we have the following: 2r

n V Z(Λ) V {δ : Λ 1δ 1 r} (2r)n V (Z(Λ))

where V (Z(Λ)) = p

|Λ Λ| which is the volume of the zonotope with a generator matrix Λ.

Proof. Let S1 = {δ : Λ 1δ 1 r}, S = {δ : Λ 1δ r} and Sn = {δ : n Λ 1δ r}. Since Λ 1δ Λ 1δ 1 n Λ 1δ , then S S1 Sn . Therefore, we have V(S ) V(S1) V(Sn ). At last note that, Sn = { r

nΛδ : δ 1} and that with the change of variables δ = 2u 1n where 1n is a vector of all ones, we have Sn = Z 2r

n Λ1n where is a Minkowski sum and noting that r

single point in Rn. Therefore, V Z 2r

n Λ1n = (2r/n)n V (Z(Λ)). The upper bound follows with a similar argument completing the proof.

B.1 Certiﬁcation under Gaussian Mixture Smoothing Distribution

We consider a general, K-component, zero-mean Gaussian mixture smoothing distribution G such that:

G({αi, Σi}K i=1) :=

i=1 αi N(0, Σi), s.t. X

i αi = 1, 0 < αi 1 (3)

Published in Transactions on Machine Learning Research (08/2022)

Given f and as per the recipe described in Section 4, we are interested in the Lipschitz constant of the smooth classiﬁer g G(x) = (f G)(x) = PK i αigΣi = PK i αi(f N(0, Σi)) = P i αigΣi(x) where gΣi is deﬁned as in the Gaussian case.

Note the weaker bound when compared to Proposition 2, for each of the Gaussian components presented in the following proposition.

Proposition 6. gΣ is p

2/π-Lipschitz under . Σ 1,2 norm.

Proof. Following a similar argument to the proof of Proposition 2, we get:

u Σ 1 2 gΣ(x) 1

2 (t x)| exp 1

2(x t) Σ 1(x t) dnt

= Es N(0,I) |u s| = Ev N(0,1) [|v|] = p

With Proposition 6, we obtain a Lipschitz constant for a Gaussian mixture smoothing distribution as:

Proposition 7. g G is p

π/2-Lipschitz under δ B 1,2 norm, where B 1 = PK i αiΣ 1 i .

|g G(x + δ) g G(x)| X

i αi|gΣi(x + δ) gΣi(x)|

i αi δ Σi,2 rπ

Obtained by ﬁrst applying the triangle inequality, then Proposition 2 followed by Jensen s inequality.

Thus yielding the following certiﬁcate by combining Proposition 7 and Theorem 1.

Corollary 3. Let c A = arg maxi g G(x) , then arg maxi gi G(x + δ) = c A for all δ satisfying:

gc A G (x) max c gc =c A G (x) .

where B 1 = PK i αiΣ 1 i .

C An Cer Optimization

In this section we detail the implementation choices required to solving Equation equation 1. For ease of presentation, we restate the An Cer optimization problem (with Θx = diag({θx i }n i=1)):

arg max Θx rp (x, Θx) n s Y

i θx i s.t. min i θx i rp (x, Θx) r iso,

where rp (x, Θx) is the gap value under the anisotropic smoothing distribution, and r iso is the optimal isotropic radius, i.e. θxrp(x, θx) for θx R+. This is a nonlinear constrained optimization problem that is challenging to solve. As such, we relax it, and solve instead:

arg max Θx rp (x, Θx) n s Y

i θx i + κ min i θx i rp (x, Θx) s.t. θx i θx

given a hyperparameter κ R+. While the constraint θx i θx is not explicitly required to enforce the superset condition over the isotropic case, it proved itself beneﬁcial from an empirical perspective. To sample

Published in Transactions on Machine Learning Research (08/2022)

from the distribution parameterized by Θx (in our case, either a Gaussian or Uniform), we make use of the reparameterization trick, as in Alfarra et al. (2022). The solution of this optimization problem can be found iteratively by performing projected gradient ascent.

A standalone implementation for the An Cer optimization stage is presented in Listing 1, whereas the full code integrated in our code base is available as supplementary material. To perform certiﬁcation, we simply feed the output of this optimization to the certiﬁcation procedure from Cohen et al. (2019).

import torch from torch.autograd import Variable from torch. distributions .normal import Normal

class Certificate (): def compute_proxy_gap (self , logits: torch.Tensor): raise Not Implemented Error

def sample_noise (self , batch: torch.Tensor , repeated_theta : torch.Tensor): raise Not Implemented Error

def compute_gap (self , p ABar: float): raise Not Implemented Error

class L2Certificate (Certificate): def __init__(self , batch_size: int , device: str = "cuda :0"): self.m = Normal(torch.zeros(batch_size ).to(device),

torch.ones(batch_size ).to(device)) self.device = device self.norm = "l2"

def compute_proxy_gap (self , logits: torch.Tensor): return self.m.icdf(logits [:, 0]. clamp_ (0.001 , 0.999)) - \ self.m.icdf(logits [:, 1]. clamp_ (0.001 , 0.999))

def sample_noise (self , batch: torch.Tensor , repeated_theta : torch.Tensor): return torch.randn_like(batch , device=self.device) * repeated_theta

def compute_gap (self , p ABar: float): return norm.ppf(p ABar)

class L1Certificate (Certificate): def __init__(self , device="cuda :0"): self.device = device self.norm = "l1"

def compute_proxy_gap (self , logits: torch.Tensor): return logits [:, 0] - logits [:, 1]

def sample_noise (self , batch: torch.Tensor , repeated_theta : torch.Tensor): return 2 * (torch.rand_like(batch , device=self.device) - 0.5) * repeated_theta

def compute_gap (self , p ABar: float): return 2 * (p ABar - 0.5)

def ancer_optimization (

model: torch.nn.Module , batch: torch.Tensor , certificate: Certificate , learning_rate : float , isotropic_theta : torch.Tensor , iterations : int , samples: int , kappa: float , device: str = "cuda :0"): """ Optimize batch using ANCER , assuming isotropic initialization point.

model: trained network batch: inputs to certify around certificate: instance of desired certification object learning_rate : optimization learning rate for ANCER isotropic_theta : initialization isotropic value per input in batch iterations: number of iterations to run the optimization samples: number of samples per input and iteration kappa: relaxation hyperparameter """ batch_size = batch.shape [0] img_size = np.prod(batch.shape [1:])

# define a variable , the optimizer , and the initial sigma values theta = Variable(isotropic_theta , requires_grad =True).to(device)

Published in Transactions on Machine Learning Research (08/2022)

optimizer = torch.optim.Adam ([ theta], lr= learning_rate ) initial_theta = theta.detach ().clone ()

# reshape vectors to have samples per input in batch new_shape = [batch_size * samples] new_shape.extend(batch [0]. shape) new_batch = batch.repeat ((1, samples , 1, 1)).view(new_shape)

# solve iteratively by projected gradient ascend for _ in range(iterations):

theta_repeated = theta.repeat (1, samples , 1, 1).view(new_shape)

# Reparameterization trick noise = certificate. sample_noise (new_batch , theta_repeated ) out = model(

new_batch + noise ).reshape(batch_size , samples , -1).mean(dim =1)

vals , _ = torch.topk(out , 2) gap = certificate . compute_proxy_gap (vals)

prod = torch.prod(

(theta.reshape(batch_size , -1))**(1/ img_size), dim =1) proxy_radius = prod * gap

radius_maximizer = - (

proxy_radius .sum() + kappa * (torch.min(theta.view(batch_size , -1), dim =1).values*gap).sum () ) radius_maximizer .backward () optimizer.step ()

# project to the initial theta with torch.no_grad (): torch.max(theta , initial_theta , out=theta)

return theta

Listing 1: Python implementation of the An Cer optimization routine using Py Torch Paszke et al. (2019)

D Memory-based Certiﬁcation for An Cer

To guarantee the soundness of the An Cer classiﬁer, we use an adapted version of the data-dependent memory-based solution presented in Alfarra et al. (2022). The modiﬁed algorithm involves a post-processing certiﬁcation step that obtains adjusted certiﬁcation statistics based on the memory procedure from Alfarra et al. (2022) (see the original paper for more details). We present an adapted version to An Cer of this post-processing memory-based step in Algorithm 2.

Algorithm 2: Memory-Based Certiﬁcation Input: input point x N+1, certiﬁed region RN+1, prediction CN+1, and memory M Result: Prediction for x N+1 and certiﬁed region at x N+1 that does not intersect with any certiﬁed region in M. for (xi, Ci, Ri) M do

if CN+1 = Ci then

if x N+1 Ri then

return Abstain, 0 else if Max Intersect(RN+1, Ri) and Intersect(RN+1, Ri) then

R N+1 = Largest Out Subset(Ri, RN+1); RN+1 R N+1; end add (x N+1, CN+1, RN+1) to M; return CN+1, RN+1;

Note that the proposed certiﬁed region RN+1 emerges from our certiﬁcation bounds presented in Sections 4.1 and 4.2. There are a few diﬀerences between our proposed Algorithm 2 with respect to the original variant

Published in Transactions on Machine Learning Research (08/2022)

presented in Alfarra et al. (2022). The ﬁrst is that we remove the computation of the largest certiﬁable subset of a certiﬁed region RN+1 when there exists an i such that x N+1 Ri with a diﬀerent class prediction, i.e. (Largest In Subset in Alfarra et al. (2022)) due to the complexity of the operation in the anisotropic case. As an example, it is generally diﬃcult to ﬁnd the largest volume ellipsoid contained in another ellipsoid. Due to this complexity, we choose to simply Abstain instead. Given the high dimensionality of the data, empirically, we never found a certiﬁcate in this situation within our experiments. Further, to ease the computational burden of the Intersect function, we introduce and instantiate the function Max Intersect ﬁrst which checks whether the ℓp-ball over-approximation of the region RN+1 intersects with a ℓp over-approximation of Ri. This follows since when the ℓp balls over-approximation to the anisotropic regions RN+1 and Ri do not intersect, then RN+1 and Ri do not intersect either. Only in cases in which those over-approximation regions intersect, we run the more expensive Intersect procedure. We present practical implementations for Max Intersect, Intersect and Largest Out Subset for the ellipsoids and generalized cross-polytopes considered in this paper.

D.1 Implementing Max Intersect(RA, RB) in the Ellipsoid and Generalized Cross-Polytope Cases

Given the two regions RA and RB, consider ℓp-ball approximations of those regions, R A = {x Rn : x a p ra} and R B = {x Rn : x b p rb} such that RA R A and RB R B.

Lemma 1. If a b p > ra + rb, then RA RB = .

Proof. For the sake of contradiction, let a b p > ra+rb and x R A R B. Then, we have that x a ra and x b rb. However:

ra + rb < a b p x a p + x b p ra + rb,

forming a contradiction. Thus, R A R B = , which in turn implies RA RB = since RA and RB are subsets of R A and R B, respectively.

This forms a fast, maximum intersection check for ellipsoids, i.e. p = 2, and generalized cross-polytopes, i.e. p = 1. The Max Intersect function returns False if a b p > ra + rb, and True otherwise.

D.2 Implementing Intersect(RA, RB) in the Ellipsoid Case

The problem of eﬃciently checking if two ellipsoids intersect is not trivial. We rely on the work of Ros et al. (2002); Gilitschenski & Hanebeck (2012) with missing proofs from Gilitschenski & Hanebeck (2012) for completeness.

Lemma 2. Let RA = {x Rn : (x a) A(x a) 1} and RB = {x Rn : (x b) B(x b) 1} deﬁne two ellipsoids centered at a and b, respectively. We have that R = {x : t(x a) A(x a)+(1 t)(x b) B(x b) 1} for any t [0, 1] satisﬁes RA RB R RA RB.

Proof. By considering the convex combination of the left-hand side of the inequalities deﬁning the regions RA and RB, it becomes obvious that x RA RB = x R, concluding the left side of the property. As for the right side, it suﬃces to show that if x / RA and x R then x RB and, similarly, that if x / RB and x R then x RA. We show the ﬁrst case since the second follows by symmetry. Without loss of generality, we assume that a = b = 0n. Now, let x be such that x Ax > 1 and tx Ax + (1 t)x Bx 1 since x / RA and x R. Then, since x R, we have that (1 t)x Bx 1 tx Ax 1 since x Ax > 1 which implies that x RB.

Note that the previous result holds without loss of generality when for the radius 1 as the radius can be absorbed in A and B. As the following Lemma was shown by Gilitschenski & Hanebeck (2012) without proof, we complement it below for completeness.

Lemma 3. The set R is equivalent to the following ellipsoid R = {x : (x m) Et(x m) K(t)} where Et = t A + (1 t)B, m = E 1 t (t Aa + (1 t)Bb), and K(t) = 1 ta Aa (1 t)b Bb + m Etm.

Published in Transactions on Machine Learning Research (08/2022)

t(x a) A(x a) + (1 t)(x b) B(x b) 1

x (t A + (1 t)B) | {z } Et

x 2x (t Aa + (1 t)Bb) | {z } Etm

1 ta Aa (1 t)b Bb

(x m) Et(x m) 1 ta Aa (1 t)b Bb + m Etm

The last equality follows by adding and subtracting m Etm and concluding the proof.

Proposition 8. The set of points satisfying R for t (0, 1) is either an empty set, a single point, or the ellipsoid R.

Proof. We ﬁrst observe that since A and B are positive deﬁnite, then Et is positive deﬁnite. Then observe that for a choice of t (0, 1) such that K(t) < 0, the set R is an empty set, and since R RA RB, the two sets do not intersect. If K(t) = 0, then the only point satisfying R is the center at m. Following a similar argument, then the two ellipsoids intersect at a point. At last for a choice of t such that K(t) > 0, then R deﬁnes an ellipsoid.

As per Theorem 8, it suﬃces to ﬁnd some t [0, 1] under which K(t) < 0 to guarantee that the ellipsoids do not intersect. To that end, we solve the following convex optimization problem: t = argmint [0,1]K(t) and check the condition if K(t ) < 0. Moreover, as shown by Ros et al. (2002); Gilitschenski & Hanebeck (2012) K(t) is convex in the domain t (0, 1). With several algebraic manipulations, one can show that K(t) has the following equivalent forms:

K(t) = 1 ta Aa (1 t)b Bb + m Etm

K(t) = 1 t(1 t)(b a) BE 1 t A(b a)

K(t) = 1 (b a) 1 1 t B 1 + 1

t A 1 1 (b a)

Observe that for ANCER, we have that both A and B to be diagonals with diagonal elements {Aii}n i=1 and {Bii}n i=1, respectively, resulting in the following simple form for K(t):

i=1 (bi ai)2 t(1 t)Aii Bii

t Aii + (1 t)Bii .

The Intersect function in the ellipsoid case returns False if there exists a t (0, 1) such that K(t) < 0, i.e. ellipsoids do not intersect, and True otherwise.

D.3 Implementing Intersect(RA, RB) in the Generalized Cross-Polytope Case

Let RA and RB be two generalized cross-polytopes RA = {x Rn : A(x a) 1 1} and RB = {x Rn : B(x b) 1 1}, where A and B are positive deﬁnite diagonal matrices with elements {Aii}n i=1 and {Bii}n i=1, respectively. We are interested in deciding whether RA and RB intersect. However, given the conservative context in which Intersect is used in Algorithm 2, we only need to make sure that the function only returns False if it is guaranteed that RA RB = .

As such, we are able to simplify the complex problem of generalized cross-polytope intersection to the much simpler one of ellipsoid over-approximation intersection. We do this by considering the over-approximation, i.e. superset, ellipsoids R A = {x Rn : A(x a) 2 1} and R B = {x Rn : B(x b) 2 1}, and perform the ellipsoid intersection check presented in Appendix D.2. If R A R B = , then this implies that RA RB = and we can safely return False. Otherwise, we conservatively assume the generalized cross-polytopes intersect, and return True, triggering the reduction procedure detailed in Appendix D.5.

Published in Transactions on Machine Learning Research (08/2022)

D.4 Implementing Largest Out Subset(RA, RB) in the Ellipsoid Case

Given two ellipsoids RA = {x Rn : (x a) A(x a) 1} and RB = {x Rn : (x b) B(x b) 1} that do intersect where A and B are positive deﬁnite diagonal matrices, the task is to ﬁnd the largest possible ellipsoid R B centered at b such that R B RB where RA R B = .

Finding a maximum ellipsoid that satisﬁes those conditions is not trivial, so instead we consider a maximum enclosing ℓ2-ball of RB, R B = {x Rn : x b 2 r}, that does not intersect RA. To obtain this ball, we project the center of RB, b, to the ellipsoid RA. Particularly, we formulate the problem as the projection of a vector y = b a onto an ellipsoid with the same shape as RA centered at 0n. This is equivalent to solving the following optimization problem for a symmetric positive deﬁnite matrix A:

min x 1 2 x y 2 2 s.t. x Ax 1.

Note that the objective function is convex, and the constraint forms a convex set. Forming the Lagrangian to this problem, we obtain:

L(x, λ) = 1

2 x y 2 2 + λ x Ax 1 ,

where λ > 0. Therefore, the global optimal solution must satisfy the KKT conditions below:

x = 0 x = (2λA + I) 1 y,

λ = 0 y (2λA + I) A (2λA + I) 1 y 1 | {z } f(λ)

Thus, to project the vector y on our region the ellipsoid characterized by A, one needs to solve the scalar optimization f(λ) = 0 then substitute back in the formula of x . Further, given A = diag(A11, . . . , Ann), we can simplify the problem to:

y2 i Aii (1 + 2λAii)2 1 = 0.

Once x is obtained, we can deﬁne the maximum radius of the ℓ2-ball centered at b that does not intersect RA as: r = (x + a) b 2 ϵ,

for an arbitrarily small ϵ. Finally, we obtain R B as the maximum ball contained within RB that has a radius smaller than r , that is: R B = {x Rn : x b 2 min{r , min i Bii}}.

Note that while choosing the radius of R B to be r guarantees that R B RA = , this does not guarantee that R B RB. To guarantee both properties, we take the minimum of both r and mini Bii. This approach ﬁnds the solution to the projection of the point to the ellipsoid {x Rn : x Ax 1}; it does not work for the case in which b RA, since the problem would be trivially solved by setting x = y. Thus, our classiﬁer must abstain in that situation.

D.5 Implementing Largest Out Subset(RA, RB) in the Generalized Cross-Polytope Case

Let RA and RB be two generalized cross-polytopes RA = {x Rn : A(x a) 1 1} and RB = {x Rn : B(x b) 1 1}, where A and B are positive deﬁnite diagonal matrices with elements {Aii}n i=1 and {Bii}n i=1, respectively. The task is to ﬁnd the largest possible generalized cross-polytope R B centered at b such that R B RB where RA R B = .

As with the ellipsoid case, solving this problem for a generalized cross-polytope is not trivial, so instead we consider a maximum enclosing cross-polytope (i.e., ℓ1-ball) of R B = {x Rn : x b 1 r} that does not

Published in Transactions on Machine Learning Research (08/2022)

intersect RA and is a subset of RB. To obtain this ℓ1-ball, we project the center of RB, b, to the generalized cross-polytope RA in a similar fashion to the ellipsoid case in Appendix D.4. We formulate the problem as the projection of the vector y = b a to the 0n centered generalized cross-polytope {x Rn : Ax 1 1}.

Lemma 4. Consider the hyperplane H = {x Rn : w x k = 0} and a point y Rn. The ℓ2 projection of y on the hyperplane is the point x = y (w y k)w/ w 2 2.

Proof. We deﬁne the projection problem in a similar fashion to the ellipsoid case:

min x 1 2 x y 2 2 s.t. w x k = 0,

and obtain the Lagrangian as L(x, λ) = 1 2 x y 2 2 + λ(w x k), from where we get (using the KKT

conditions): x = y λ w and λ = w y k/ w 2 2; thus obtaining: x = y (w y k)w

While this formulation does not yield the closest point from a hyperplane when measured with the ℓ1 norm, the fact that x x 1 x x 2 implies the certiﬁcation set obtained in the ℓ1 norm via this method is a subset of the ℓ2-ball of the minimum projection point. Crucially, this ℓ2 projection has the advantage of having a closed-form solution, while an ℓ1 one would require solving the problem using an iterative linear programming solver. As such, for the sake of computational complexity, we decided to use this projection, despite the sub-optimality of the result from the ℓ1 perspective. Empirically, we have found this does not aﬀect our results.

Since the set of vertices of the generalized cross-polytope {x Rn : Ax 1 1} is given by {ei/Aii, ei/Aii}n i=1, and considering the distance between the projections and the original y, the hyperplane that minimizes it is deﬁned by the set of vertices {sign(yi)ei/Aii}n i=1. By writing it as a system of n equations, we obtain the hyperplane deﬁned by w = [ sign(y1)A11, ..., sign(yn)Ann] and k = 1. Finally, after computing x as per Lemma 4, we can deﬁne the maximum radius of the ℓ1-ball centered at b that does not intersect RA as:

r = (x + a) b 1 ϵ,

for an arbitrarily small ϵ. Finally, and similar to the ellipsoids case, we obtain R B as the maximum generalized cross-polytope contained within RB that has a radius smaller than r , that is:

R B = {x Rn : x b 1 min{r , min i Bii}}

Similar to before, to guarantee that the ℓ1 ball R B is still a subset to RB, we take the minimum between r

and mini Bii to be the radius of R B. As with the ellipsoid case, this approach does not work for the case in which b RA, since the assumption of the closest plane to y would not hold. Thus, our classiﬁer must abstain in that situation.

E Experimental Setup

The experiments reported in the paper used the CIFAR-10 Krizhevsky (2009)3 and Image Net Deng et al. (2009)4 datasets, and trained Res Net18, Wide Res Net40 and Res Net50 networks He et al. (2016). Experiments used the typical data split for these datasets found in the Py Torch implementation Paszke et al. (2019). The procedures to obtain the baseline networks used in the experiments are detailed in Appendix E.1 and E.2 for ellipsoids and generalized cross-polytopes, respectively. Source code to reproduce the An Cer optimization and certiﬁcation results of this paper is available as supplementary material.

Isotropic DD Optimization. We used the available code of Alfarra et al. (2022)5 to obtain the isotropic data dependent smoothing parameters. To train our models from scratch, we used an adapted version of the code provided in the same repository.

3Available here (url), under an MIT license. 4Available here (url), terms of access detailed in the Download page. 5Data Dependent Randomized Smoothing source code available here

Published in Transactions on Machine Learning Research (08/2022)

Certiﬁcation. Following Cohen et al. (2019); Salman et al. (2019a); Zhai et al. (2019); Yang et al. (2020); Alfarra et al. (2022), all results were certiﬁed with N0 = 100 Monte Carlo samples for selection and N = 100, 000 estimation samples, with failure a probability of α = 0.001.

E.1 Ellipsoid certiﬁcation baseline networks

In terms of ellipsoid certiﬁcation, the baselines we considered were Cohen Cohen et al. (2019)6, Smooth Adv Salman et al. (2019a)7 and MACER Zhai et al. (2019)8.

In the CIFAR-10 experiments, we used a Res Net18 architecture, instead of the Res Net110 used in Cohen et al. (2019); Salman et al. (2019a); Zhai et al. (2019) due to constraints at the level of computation power. As such, we had to train each of the networks from scratch following the procedures available in the source code of each of the baselines. We did so under our own framework, and the training scripts are available in the supplementary material. For the Image Net experiments we used the Res Net50 networks provided by each of the baselines in their respective open source repositories.

We trained the Res Net18 networks for 120 epochs, with a batch size of 256 and stochastic gradient descent with a learning rate of 10 2, and momentum of 0.9.

E.2 Generalized Cross-Polytope certiﬁcation baseline networks

For the certiﬁcation of generalized cross-polytopes we considered RS4A Yang et al. (2020)9. As described in RS4A Yang et al. (2020), we take λ = σ/

3 and report results as a function of σ for ease of comparison.

As with the baseline, we ran experiments on CIFAR-10 on a Wide Res Net40 architecture, and Image Net on a Res Net50 Yang et al. (2020). However, due to limited computational power, we were not able to run experiments on the wide range of distributional parameters the original work considers, i.e. σ = {0.15, 0.25, 0.5, 0.75, 1.0, 1.125, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5} on CIFAR-10 and σ = {0.25, 0.5, 0.75, 1.0, 1.125, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5} on Image Net. Instead, and matching the requirements from the ellipsoid section, we choose a subset of σ = {0.25, 0.5, 1.0} and performed our analysis at that level.

While the trained models are available in the source code of RS4A, we ran into several issues when we attempted to use them, the most problematic of which being the fact that the clean accuracy of such models was very low in both the Wide Res Net40 and Res Net50 ones. To avoid these issues we trained the models from scratch, but using the stability training loss as presented in the source code of RS4A. All of these models achieved clean accuracy of over 70%.

Following the procedures described in the original work, we trained the Wide Res Net40 models with the stability loss used in Yang et al. (2020) for 120 epochs, with a batch size of 128 and stochastic gradient descent with a learning rate of 10 2, and momentum of 0.9, along with a step learning rate scheduler with a γ of 0.1. For the Res Net50 networks on Image Net, we trained them from scratch with stability loss for 90 epochs with a learning rate of 0.1 that drops by a factor of 0.1 after each 30 epochs and a batch size of 256.

F Superset argument

The results we present in Section 7 support the argument that An Cer achieves, in general, a certiﬁcate that is a superset of the Fixed σ and Isotropic DD ones. To conﬁrm this at an individual test set sample level, we compare the ℓ2, ℓ1, ℓΣ 2 and ℓΛ 1 certiﬁcation results across the diﬀerent methods, and obtain the percentage of the test set in which An Cer performs at least as well as all other methods in each certiﬁcates of the samples. Results of this analysis are presented in Tables 4 and 5.

6Cohen source code available here. 7Smooth Adv source code available here. 8MACER source code available here. 9RS4A source code available here.

Published in Transactions on Machine Learning Research (08/2022)

For most networks and datasets, we observe that An Cer achieves a larger ℓp certiﬁcate than the baselines in a signiﬁcant portion of the dataset, showcasing the fact that it obtains a superset of the isotropic region per sample. This is further conﬁrmed by the comparison with the anisotropic certiﬁcates, in which, for all trained networks except MACER in CIFAR-10, An Cer s certiﬁcate is superior in over 90% of the test set samples.

Table 4: Superset in top-1 ℓ2 and ℓΣ 2 (rounded to nearest percent)

% An Cer ℓ2 is the best % An Cer ℓΣ 2 is the best

CIFAR-10: Cohen 83 93 CIFAR-10: Smooth Adv 73 90 CIFAR-10: MACER 50 69

Image Net: Cohen 94 96 Image Net: Smooth Adv 90 93

Table 5: Superset in top-1 ℓ1 and ℓΛ 1 (rounded to nearest percent)

% An Cer ℓ1 is the best % An Cer ℓΛ 1 is the best

CIFAR-10: RS4A 100 100 Image Net: RS4A 97 99

G Experimental Results per σ

G.1 Certifying Ellipsoids - ℓ2 and ℓΣ 2 certiﬁcation results per σ

In this section we report certiﬁed accuracy at various ℓ2 radii and ℓΣ 2 proxy radii, following the metrics deﬁned in Section 7, for each training method (Cohen Cohen et al. (2019), Smooth Adv Salman et al. (2019a) and MACER Zhai et al. (2019)), dataset (CIFAR-10 and Image Net) and σ (σ {0.12, 0.25, 0.5, 1.0}). Figures 7 and 8 shows certiﬁed accuracy at diﬀerent ℓ2 radii for CIFAR-10 and Image Net, respectively, whereas Figures 9 and 10 plot certiﬁed accuracy and diﬀerent ℓΣ 2 proxy radii for CIFAR-10 and Image Net, respectively.

G.2 Certifying Ellipsoids - ℓ1 and ℓΛ 1 certiﬁcation results per σ

In this section we report certiﬁed accuracy at various ℓ1 radii and ℓΛ 1 proxy radii, following the metrics deﬁned in Section 7, for RS4a, dataset (CIFAR-10 and Image Net) and σ (σ {0.25, 0.5, 1.0}). Figures 11 and 12 shows certiﬁed accuracy at diﬀerent ℓ1 radii for CIFAR-10 and Image Net, respectively, whereas Figures 13 and 14 plot certiﬁed accuracy and diﬀerent ℓΛ 1 proxy radii for CIFAR-10 and Image Net, respectively.

H Visual Comparison of Parameters in Ellipsoid Certiﬁcates

Anisotropic certiﬁcation allows for a better characterization of the decision boundaries of the base classiﬁer f. For example, the directions aligned with the major axes of the ellipsoids δ Σ,2 = r, i.e. locations where Σ is large, are, by deﬁnition, expected to be less sensitive to perturbations compared to the minor axes directions. To visualize this concept, Figure 15 shows CIFAR-10 images along with their corresponding optimized ℓ2 isotropic parameters obtained by Isotropic DD, and ℓΣ 2 anisotropic parameters obtained by An Cer. First, we note the richness of information provided by the anisotropic parameters when compared to the ℓ2 worst-case, isotropic one. Interestingly, pixel locations where the intensity of Σ is large (higher intensity in Figure 15) are generally the ones corresponding least with the underlying true class and overlapping more with background pixels.

A particular insight one can get from An Cer certiﬁcation is that the decision boundaries are not distributed isotropically around each input. To quantify this in higher dimensions, we plot in Figure 16 a histogram

Published in Transactions on Machine Learning Research (08/2022)

0.00 0.25 0.50 0.75 1.00 1.25

Certified accuracy

Cohen ( = 0.12)

Fixed (ACR=0.263) Isotropic (ACR=0.299) ANCER (ACR=0.447)

0.0 0.5 1.0 1.5

Certified accuracy

Cohen ( = 0.25)

Fixed (ACR=0.402) Isotropic (ACR=0.44) ANCER (ACR=0.656)

0.0 0.5 1.0 1.5 2.0 2.5

Certified accuracy

Cohen ( = 0.5)

Fixed (ACR=0.492) Isotropic (ACR=0.563) ANCER (ACR=0.814)

0 1 2 3 4 5

Certified accuracy

Cohen ( = 1.0)

Fixed (ACR=0.497) Isotropic (ACR=0.824) ANCER (ACR=1.025)

0.00 0.25 0.50 0.75 1.00

Certified accuracy

Smooth Adv ( = 0.12)

Fixed (ACR=0.289) Isotropic (ACR=0.466) ANCER (ACR=0.508)

0.0 0.5 1.0 1.5

Certified accuracy

Smooth Adv ( = 0.25)

Fixed (ACR=0.454) Isotropic (ACR=0.561) ANCER (ACR=0.655)

0.0 0.5 1.0 1.5 2.0 2.5

Certified accuracy

Smooth Adv ( = 0.5)

Fixed (ACR=0.569) Isotropic (ACR=0.635) ANCER (ACR=0.761)

0 1 2 3 4 5

Certified accuracy

Smooth Adv ( = 1.0)

Fixed (ACR=0.565) Isotropic (ACR=0.694) ANCER (ACR=0.871)

Certified accuracy

MACER ( = 0.12)

Fixed (ACR=0.272) Isotropic (ACR=0.345) ANCER (ACR=0.363)

0.0 0.5 1.0 1.5

Certified accuracy

MACER ( = 0.25)

Fixed (ACR=0.429) Isotropic (ACR=0.461) ANCER (ACR=0.424)

0.0 0.5 1.0 1.5 2.0 2.5

Certified accuracy

MACER ( = 0.5)

Fixed (ACR=0.57) Isotropic (ACR=0.641) ANCER (ACR=0.56)

0 1 2 3 4 5

Certified accuracy

MACER ( = 1.0)

Fixed (ACR=0.668) Isotropic (ACR=0.393) ANCER (ACR=0.561)

Figure 7: CIFAR-10 certiﬁed accuracy as a function of ℓ2 radius, per model and σ (used as initialization in the isotropic data-dependent case and An Cer).

0.00 0.25 0.50 0.75 1.00 1.25

Certified accuracy

Cohen ( = 0.25)

Fixed (ACR=0.47) Isotropic (ACR=0.527) ANCER (ACR=0.653)

0.0 0.5 1.0 1.5 2.0

Certified accuracy

Cohen ( = 0.5)

Fixed (ACR=0.721) Isotropic (ACR=0.795) ANCER (ACR=1.125)

Certified accuracy

Cohen ( = 1.0)

Fixed (ACR=0.864) Isotropic (ACR=0.995) ANCER (ACR=1.439)

0.00 0.25 0.50 0.75 1.00 1.25

Certified accuracy

Smooth Adv ( = 0.25)

Fixed (ACR=0.518) Isotropic (ACR=0.61) ANCER (ACR=0.65)

0.0 0.5 1.0 1.5 2.0

Certified accuracy

Smooth Adv ( = 0.5)

Fixed (ACR=0.816) Isotropic (ACR=0.921) ANCER (ACR=1.083)

Certified accuracy

Smooth Adv ( = 1.0)

Fixed (ACR=1.011) Isotropic (ACR=1.13) ANCER (ACR=1.459)

Figure 8: Image Net certiﬁed accuracy as a function of ℓ2 radius, per model and σ (used as initialization in the isotropic data-dependent case and An Cer).

of the ratio between the maximum and minimum elements of our optimized smoothing parameters for the experiments on Smooth Adv (with an initial σ = 1.0) on CIFAR-10. We note that this ratio can be as high as 5 for some of the input points, meaning the decision boundaries in that case could be 5 times closer to a given input for some directions than others.

Published in Transactions on Machine Learning Research (08/2022)

0.0 0.5 1.0 1.5

2 proxy volume

Certified accuracy

Cohen ( = 0.12)

Fixed (ACR=0.263)

Isotropic (ACR=0.299)

ANCER (ACR=0.676)

0.0 0.5 1.0 1.5 2.0 2.5

2 proxy volume

Certified accuracy

Cohen ( = 0.25)

Fixed (ACR=0.402)

Isotropic (ACR=0.44)

ANCER (ACR=0.873)

2 proxy volume

Certified accuracy

Cohen ( = 0.5)

Fixed (ACR=0.492)

Isotropic (ACR=0.563)

ANCER (ACR=0.998)

0 1 2 3 4 5

2 proxy volume

Certified accuracy

Cohen ( = 1.0)

Fixed (ACR=0.497)

Isotropic (ACR=0.824)

ANCER (ACR=1.163)

0.0 0.5 1.0 1.5 2.0 2.5

2 proxy volume

Certified accuracy

Smooth Adv ( = 0.12)

Fixed (ACR=0.289)

Isotropic (ACR=0.466)

ANCER (ACR=0.821)

0.0 0.5 1.0 1.5 2.0 2.5

2 proxy volume

Certified accuracy

Smooth Adv ( = 0.25)

Fixed (ACR=0.454)

Isotropic (ACR=0.561)

ANCER (ACR=0.896)

2 proxy volume

Certified accuracy

Smooth Adv ( = 0.5)

Fixed (ACR=0.569)

Isotropic (ACR=0.635)

ANCER (ACR=0.996)

0 1 2 3 4 5

2 proxy volume

Certified accuracy

Smooth Adv ( = 1.0)

Fixed (ACR=0.565)

Isotropic (ACR=0.694)

ANCER (ACR=0.992)

0 1 2 3 4 5

2 proxy volume

Certified accuracy

MACER ( = 0.12)

Fixed (ACR=0.272)

Isotropic (ACR=0.345)

ANCER (ACR=0.62)

0.0 0.5 1.0 1.5

2 proxy volume

Certified accuracy

MACER ( = 0.25)

Fixed (ACR=0.429)

Isotropic (ACR=0.461)

ANCER (ACR=0.581)

0.0 0.5 1.0 1.5 2.0 2.5

2 proxy volume

Certified accuracy

MACER ( = 0.5)

Fixed (ACR=0.57)

Isotropic (ACR=0.641)

ANCER (ACR=0.705)

0 1 2 3 4 5

2 proxy volume

Certified accuracy

MACER ( = 1.0)

Fixed (ACR=0.668)

Isotropic (ACR=0.393)

ANCER (ACR=0.609)

Figure 9: CIFAR-10 certiﬁed accuracy as a function of ℓΣ 2 proxy radius, per model and σ (used as initialization in the isotropic data-dependent case and An Cer).

0.0 0.5 1.0 1.5 2.0 2.5

2 proxy volume

Certified accuracy

Cohen ( = 0.25)

Fixed (ACR=0.47)

Isotropic (ACR=0.527)

ANCER (ACR=0.989)

0.0 0.5 1.0 1.5 2.0 2.5

2 proxy volume

Certified accuracy

Cohen ( = 0.5)

Fixed (ACR=0.721)

Isotropic (ACR=0.795)

ANCER (ACR=1.289)

2 proxy volume

Certified accuracy

Cohen ( = 1.0)

Fixed (ACR=0.864)

Isotropic (ACR=0.995)

ANCER (ACR=1.531)

2 proxy volume

Certified accuracy

Smooth Adv ( = 0.25)

Fixed (ACR=0.518)

Isotropic (ACR=0.61)

ANCER (ACR=0.977)

2 proxy volume

Certified accuracy

Smooth Adv ( = 0.5)

Fixed (ACR=0.816)

Isotropic (ACR=0.921)

ANCER (ACR=1.268)

2 proxy volume

Certified accuracy

Smooth Adv ( = 1.0)

Fixed (ACR=1.011)

Isotropic (ACR=1.13)

ANCER (ACR=1.547)

Figure 10: Image Net certiﬁed accuracy as a function of ℓΣ 2 proxy radius, per model and σ (used as initialization in the isotropic data-dependent case and An Cer).

I Non data-dependent Anisotropic Certiﬁcation

As mentioned brieﬂy in Section 6, it is our intuition that anisotropic certiﬁcation requires a data-dependent approach, as diﬀerent points will have fairly diﬀerent decision boundaries and the certiﬁed regions will extend in diﬀerent directions (as exempliﬁed in Figure 1).

To validate this claim, we perform certiﬁcation of Smooth Adv Salman et al. (2019a) with an initial σ = 1 on CIFAR-10 using a Σ which is the average of all the optimized Σx. The results of the certiﬁed accuracy, ACR and AC R are presented in Table 6, along with the same results for the methods reported in the main paper.

Published in Transactions on Machine Learning Research (08/2022)

0.00 0.25 0.50 0.75 1.00 1.25

Certified accuracy

RS4A ( = 0.25)

Fixed (ACR=0.212) Isotropic (ACR=0.479) ANCER (ACR=0.496)

0.0 0.5 1.0 1.5

Certified accuracy

RS4A ( = 0.5)

Fixed (ACR=0.397) Isotropic (ACR=0.614) ANCER (ACR=0.629)

0.0 0.5 1.0 1.5 2.0

Certified accuracy

RS4A ( = 1.0)

Fixed (ACR=0.737) Isotropic (ACR=0.88) ANCER (ACR=0.91)

Figure 11: CIFAR-10 certiﬁed accuracy as a function of ℓ1 radius per σ (used as initialization in the isotropic data-dependent case and An Cer).

0.0 0.1 0.2 0.3 0.4

Certified accuracy

RS4A ( = 0.25)

Fixed (ACR=0.176) Isotropic (ACR=0.233) ANCER (ACR=0.233)

0.0 0.2 0.4 0.6

Certified accuracy

RS4A ( = 0.5)

Fixed (ACR=0.339) Isotropic (ACR=0.378) ANCER (ACR=0.377)

0.00 0.25 0.50 0.75 1.00

Certified accuracy

RS4A ( = 1.0)

Fixed (ACR=0.642) Isotropic (ACR=0.681) ANCER (ACR=0.68)

Figure 12: Image Net certiﬁed accuracy as a function of ℓ1 radius per σ (used as initialization in the isotropic data-dependent case and An Cer).

As can be observed, moving away from the data-dependent certiﬁcation in the anisotropic scenario leads to a signiﬁcant performance drop in terms of robustness.

J Theoretical and Empirical Comparison with Mohapatra et al. (2020)

In regards to the theoretical results, unfortunately the certiﬁed regions of Mohapatra et al. (2020) do not exhibit a closed form solution similarly to ours. Thus, a direct theoretical volume bound comparison is not possible.

As for the empirical comparison, An Cer s performance on both ℓ2 and ℓ1 certiﬁcates far out-does that of Mohapatra et al. (2020). For example, with ℓ2 certiﬁcates at a radius of 0.5, Cohen certiﬁed with An Cer achieves 77% certiﬁed accuracy (see Table 1) while Mohapatra et al. (2020) achieves under 60% certiﬁed accuracy. Note that Mohapatra et al. (2020) has only a marginal improvement over Cohen et al. As for the ℓ1 certiﬁcates, Mohapatra et al. (2020) uses the Gaussian distribution of Cohen et al, resulting in worse performance than existing state-of-art in ℓ1 Yang et al. (2020) that uses a uniform distribution. Our approach improves further upon the performance of Yang et al. (2020). For example, as per Table 2, RS4A with ANCER certiﬁcation achieves 84% certiﬁed accuracy at an ℓ1 radius of 0.5, Yang et al. (2020) achieves 75% certiﬁed accuracy while Mohapatra et al. (2020) achieves below 60%. However, we believe that the combination of both approaches, ANCER and Mohapatra et al. (2020) can further boost the performance as also hinted on in the abstract of Mohapatra et al. (2020) on the use of data-dependent smoothing.

Published in Transactions on Machine Learning Research (08/2022)

0.00 0.25 0.50 0.75 1.00 1.25

1 proxy volume

Certified accuracy

RS4A ( = 0.25)

Fixed (ACR=0.212)

Isotropic (ACR=0.479)

ANCER (ACR=0.665)

0.0 0.5 1.0 1.5

1 proxy volume

Certified accuracy

RS4A ( = 0.5)

Fixed (ACR=0.397)

Isotropic (ACR=0.614)

ANCER (ACR=0.781)

0.0 0.5 1.0 1.5 2.0 2.5

1 proxy volume

Certified accuracy

RS4A ( = 1.0)

Fixed (ACR=0.737)

Isotropic (ACR=0.88)

ANCER (ACR=1.016)

Figure 13: CIFAR-10 certiﬁed accuracy as a function of ℓΛ 1 proxy radius per σ (used as initialization in the isotropic data-dependent case and An Cer).

1 proxy volume

Certified accuracy

RS4A ( = 0.25)

Fixed (ACR=0.176)

Isotropic (ACR=0.233)

ANCER (ACR=1.421)

0.0 0.5 1.0 1.5 2.0

1 proxy volume

Certified accuracy

RS4A ( = 0.5)

Fixed (ACR=0.339)

Isotropic (ACR=0.378)

ANCER (ACR=1.053)

0.0 0.5 1.0 1.5 2.0 2.5

1 proxy volume

Certified accuracy

RS4A ( = 1.0)

Fixed (ACR=0.642)

Isotropic (ACR=0.681)

ANCER (ACR=1.075)

Figure 14: Image Net certiﬁed accuracy as a function of ℓΛ 1 proxy radius per σ (used as initialization in the isotropic data-dependent case and An Cer).

Figure 15: Visualization of an input CIFAR-10 image x (top), and the optimized parameters σ (middle) and Σ (bottom) higher intensity corresponds to higher σi in that pixel and channel of the smoothing distributions in the isotropic and anisotropic case, respectively.

1 2 3 4 5 x max/ x min

Figure 16: Distribution of the maximum over the minimum An Cer σx at each dataset point for Smooth Adv Salman et al. (2019a) on CIFAR-10 (for initial σ = 1.0)

Published in Transactions on Machine Learning Research (08/2022)

Table 6: Comparison of diﬀerent certiﬁcation methods on Smooth Adv with an initial σ = 1.0 on CIFAR-10.

CIFAR-10 Smooth Adv Accuracy @ ℓ2 radius (%) ℓ2 ACR ℓΣ 2 AC R 0.0 0.25 0.5 1.0 1.5 2.0 2.5

Fixed σ 45 40 35 25 16 9 5 0.565 0.565 Isotropic DD 41 39 36 29 21 14 7 0.694 0.694 An Cer 44 43 41 35 26 15 8 0.871 0.992

Average Σ 29 25 21 14 9 5 2 0.329 0.379