# scalable_certified_segmentation_via_randomized_smoothing__d04b0ce6.pdf

Scalable Certiﬁed Segmentation via Randomized Smoothing

Marc Fischer 1 Maximilian Baader 1 Martin Vechev 1

We present a new certiﬁcation method for image and point cloud segmentation based on randomized smoothing. The method leverages a novel scalable algorithm for prediction and certiﬁcation that correctly accounts for multiple testing, necessary for ensuring statistical guarantees. The key to our approach is reliance on established multiple-testing correction mechanisms as well as the ability to abstain from classifying single pixels or points while still robustly segmenting the overall input. Our experimental evaluation on synthetic data and challenging datasets, such as Pascal Context, Cityscapes, and Shape Net, shows that our algorithm can achieve, for the ﬁrst time, competitive accuracy and certiﬁcation guarantees on real-world segmentation tasks. We provide an implementation at https://github.com/ eth-sri/segmentation-smoothing.

1. Introduction

Semantic image segmentation and point cloud part segmentation are important problems in many safety critical domains including medical imaging (Perone et al., 2018) and autonomous driving (Deng et al., 2017). However, deep learning models used for segmentation are vulnerable to adversarial attacks (Xie et al., 2017; Arnab et al., 2018; Xiang et al., 2019), preventing their application to such tasks. This vulnerability is illustrated in Fig. 1, where the task is to segment the (adversarially attacked) image shown in Fig. 1a. We see that the segmentation of the adversarially attacked image is very different from the ground truth depicted in Fig. 1b, potentially causing unfavorable outcomes. While provable robustness to such adversarial perturbations is well studied for classiﬁcation (Katz et al., 2017; Gehr et al., 2018; Wong & Kolter, 2018; Cohen et al., 2019), the investigation of certiﬁed segmentation just begun recently (Lorenz et al., 2021; Tran et al., 2021).

1Department of Computer Science, ETH Zurich, Switzerland. Correspondence to: Marc Fischer <marc.ﬁscher@inf.ethz.ch>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

(a) Attacked image

(b) Ground truth segmentation

(c) Attacked segmentation

(d) Certiﬁed segmentation

Figure 1. In semantic segmentation, a model segments an input (a) by classifying each pixel. While the result should match (b), a non-robust model predicts (c) as the input (a) was perturbed by additive ℓ2 noise (PGD). Our model is certiﬁably robust to this perturbation (d) by abstaining from ambiguous pixels (white). The model abstains where multiple classes meet, causing ambiguity. We provide technical details and further examples in App. C.

Certiﬁably robust segmentation is a challenging task, as the classiﬁcation of each component (e.g., a pixel in an image) needs to be certiﬁed simultaneously. Many datasets and models in this domain are beyond the reach of current deterministic veriﬁcation methods while probabilistic methods need to account for accumulating uncertainty over the large number of individual certiﬁcations.

In this work we propose a novel method to certify the robustness of segmentation models via randomized smoothing (Cohen et al., 2019), a probabilistic certiﬁcation method able to certify ℓ2 robustness around large images. As depicted in Fig. 1d, our method enables the certiﬁcation of challenging segmentation tasks by certifying each component individually, abstaining from unstable ones that cause naive algorithms to fail. This abstention mechanism also provides strong synergies with the multiple testing correction, required for the soundness of our approach, thus enabling high certiﬁcation rates.

While we focus in our evaluation on ℓ2 robustness, our method is general and can also combined with randomized smoothing methods that certify robustness to other ℓp-bounded attacks or parametrized transformations like rotations (Fischer et al., 2020).

Scalable Certiﬁed Segmentation via Randomized Smoothing

Main Contributions Our key contributions are:

We investigate the obstacles of scaling randomized smoothing from classiﬁcation to segmentation, identifying two key challenges: the inﬂuence of single bad components and the multiple testing trade-offs ( 4)

We introduce a scalable algorithm that addresses these issues, allowing, for the ﬁrst time, to certify large scale segmentation models ( 5).

We show that this algorithm can be applied to different generalizations of randomized smoothing, enabling defenses against different attacker models ( 5.2).

We provide an extensive evaluation on semantic image segmentation and point cloud part segmentation, achieving up to 88% and 55% certiﬁed pixel accuracy on Cityscapes and Pascal context respectively, while obtaining m Io U of 0.6 and 0.2 ( 6).

2. Related Work

In the following, we survey the most relevant related work.

Adversarial Attacks Biggio et al. (2013); Szegedy et al. (2014) discovered adversarial examples, which are inputs perturbed in a way that preserves their semantics but fools deep networks. To improve performance on these inputs, training can be extended to including adversarial examples, called adversarial training (Kurakin et al., 2017; Madry et al., 2018). Most important to this work are attacks on semantic segmentation: Xie et al. (2017) introduced a targeted gradient based unconstrained attack, by maximizing the summed individual targeted losses for every pixel. Arnab et al. (2018) applied FGSM based attacks (Goodfellow et al., 2015) to the semantic segmentation setting. Similarly, the vulnerability of point cloud classiﬁers was exposed in Xiang et al. (2019); Liu et al. (2019); Sun et al. (2020).

Certiﬁed robustness and defenses However, adversarially trained neural networks do not come with robustness guarantees. To address this, different certiﬁcation methods have been proposed recently using various methods, relying upon SMT solvers (Katz et al., 2017; Ehlers, 2017), semidefinite programming (Raghunathan et al., 2018a) and linear relaxations (Gehr et al., 2018; Zhang et al., 2018; Wang et al., 2018; Weng et al., 2018; Wong & Kolter, 2018; Singh et al., 2019b;a). Speciﬁcally, linear relaxations have been used beyond the classical ℓp noise setting to certify against geometric transformations (Singh et al., 2019b; Balunovic et al., 2019; Mohapatra et al., 2020b) and vector ﬁeld attacks (Ruoss et al., 2021).

To further improve certiﬁcation rates, methods that train networks to be certiﬁable have been proposed (Raghunathan

et al., 2018b; Mirman et al., 2018; Gowal et al., 2018; Balunovic & Vechev, 2020).

Notable to our setting are Lorenz et al. (2021); Tran et al. (2021), who extend deterministic certiﬁcation to point cloud segmentation and semantic segmentation respectively. However, due to the limitations of deterministic certiﬁcation these models are small in scale. Further, Bielik & Vechev (2020); Sheikholeslami et al. (2021) improved robust classiﬁers with the ability to abstain from classiﬁcation.

Randomized Smoothing Despite all this, deterministic certiﬁcation performance on complicated datasets remained unsatisfactory. Recently, based on Li et al. (2018) and Lécuyer et al. (2019), Cohen et al. (2019) presented randomized smoothing, which was the ﬁrst certiﬁcation method to successfully certify ℓ2 robustness of large neural networks on large images. Salman et al. (2019) improved the results, by combining the smoothing training procedure with adversarial training. Yang et al. (2020) derive conditions for optimal smoothing distributions for ℓ1, ℓ2 and ℓ adversaries, if only label information is available. Mohapatra et al. (2020a) incorporated gradient information to improve certiﬁcation radii. Zhai et al. (2020); Jeong & Shin (2020) improved the training procedure for base models by introducing regularized losses.

Randomized smoothing has been extended in various ways. Bojchevski et al. (2020) proposed a certiﬁcation scheme suitable for discrete data and applied it successfully to certify graph neural networks. (Levine & Feizi, 2020b) certiﬁed robustness against Wasserstein adversarial examples. Fischer et al. (2020) and Li et al. (2021) used randomized smoothing to certify robustness against geometric perturbations. Salman et al. (2020) showed that using a denoiser, of the shelf classiﬁers can be turned into certiﬁable classiﬁers without retraining. Levine & Feizi (2020a) and Lin et al. (2021) presented methods to certify robustness against adversarially placed patches.

Most closely related to this work are Chiang et al. (2020), which introduces median smoothing and applies it to certify object detectors, and Schuchardt et al. (2021), which also extends randomized smoothing to collective robustness certiﬁcates over multiple components. They speciﬁcally exploit the locality of the classiﬁer to the data, making their defense particularly suitable for graphs where an attacker can only modify certain subsets. While their approach can in principle be applied to semantic segmentation, modern approaches commonly rely on global information.

3. Randomized Smoothing for Classiﬁcation

In this section we will brieﬂy review the necessary background and notation on randomized smoothing, before ex-

Scalable Certiﬁed Segmentation via Randomized Smoothing

Algorithm 1 adapted from (Cohen et al., 2019)

# evaluate f at x function PREDICT(f, σ, x, n, α)

cnts SAMPLE(f, x, n, σ) ˆc A, ˆc B top two indices in cnts n A, n B cnts[ˆc A], cnts[ˆc B] if BINPVALUE(n A, n A + n B, =, 0.5) α return ˆc A else return

# certify the robustness of f around x function CERTIFY(f, σ, x, n0, n, α)

cnts0 SAMPLE(f, x, n0, σ) ˆc A top index in cnts0

cnts SAMPLE(f, x, n, σ) p A LOWERCONFBND(cnts[ˆc A], n, 1 α) if p A > 1

2 return prediction ˆc A and radius σ Φ 1(p A) else return

tending it in 4 and 5. Randomized smoothing (Cohen et al., 2019) constructs a robust (smoothed) classiﬁer f from a (base) classiﬁer f. The classiﬁer f is then provably robust to ℓ2-perturbations up to a certain radius.

Concretely, for a classiﬁer f : Rm 7 Y and random variable ϵ N(0, σ21): we deﬁne a smoothed classiﬁer f as

f(x) := arg max c Pϵ N(0,σ21)(f(x + ϵ) = c). (1)

This classiﬁer f is then robust to adversarial perturbations:

Theorem 3.1 (From (Cohen et al., 2019)). Suppose c A Y, p A, p B [0, 1]. If

Pϵ(f(x+ϵ) = c A) p A p B max c =c A Pϵ(f(x+ϵ) = c),

then f(x + δ) = c A for all δ satisfying δ 2 R with R := σ

2 (Φ 1(p A) Φ 1(p B)).

In order to evaluate the model f and calculate its robustness radius R, we need to be able to compute p A for the input x. However, since the computation of the probability in Eq. (1) is not tractable (for most choices of f) f can not be evaluated exactly. To still query it, Cohen et al. (2019) suggest PREDICT and CERTIFY (Algorithm 1) which use Monte-Carlo sampling to approximate f. Both algorithms utilize the procedure SAMPLE, which samples n random realizations of ϵ N(0, σ) and computes f(x + ϵ), which is returned as a vector of counts for each class in Y. These samples are then used to estimate the class c A and radius R with conﬁdence 1 α, where α [0, 1]. PREDICT utilizes a two-sided binomial p-value test to determine f(x). With probability at most α it will abstain, denoted as (a predeﬁned value), else it will produce f(x). CERTIFY uses the Clopper-Pearson conﬁdence interval (Clopper & Pearson,

1934) to soundly estimate p A and then invoke Theorem 3.1 with p B = 1 p A to obtain R. If CERTIFY returns a class other than and a radius R, then with probability 1 α the guarantee in Theorem 3.1 holds for this R. The certiﬁcation radius R increases if (i) the value of p A increases, which increases if the classiﬁer f is robust to noise, (ii) the number of samples n used to estimate it increases, or (iii) the α increases which means that the conﬁdence decreases.

4. Randomized Smoothing for Segmentation

To show how randomized smoothing can be applied in the segmentation setting we ﬁrst discuss the mathematical problem formulation and two direct adaptations of randomized smoothing to the problem. By outlining how these fail in practice, we determine the two key challenges preventing their success. Then in, 5, we address these challenges.

Segmentation Given an input x = {xi}N i=0 of N components xi X (e.g., points or pixels) and a set of possible classes Y, segmentation can be seen as a function f : X N YN. That is, to each xi we assign a fi(x) = yi Y, where fi denotes the i-th component of the output of f invoked on input x. Here we assume X := Rm. Unless speciﬁed, we will use m = 3 as this allows for RGB color pixels as well as 3d point clouds.

Direct Approaches To apply randomized smoothing, as introduced in 3, to segmentation, we can reduce it to one or multiple classiﬁcation problems.

We can recast segmentation f : X N YN as a classiﬁcation problem by considering the cartesian product of the co-domain V := N i=1 Y and a new function f : X N V that performs classiﬁcation. Thus we can apply CERTIFY (Algorithm 1) to the base classiﬁer f . This provides a mathematically sound mapping of segmentation to classiﬁcation. However, a change in the classiﬁcation of a single component xi will change the overall class in V making it hard to ﬁnd a majority class ˆc A with high p A in practice. We refer to this method as JOINTCLASS, short for joint classiﬁcation .

Alternatively, rather than considering all components at the same time, we can also classify each component individually. To this end, we let fi(x) denote the i-th component of f(x) and apply CERTIFY, in Algorithm 1, N times to evaluate fi(x) to obtain classes ˆc A,1, . . . , ˆc A,N and radii R1, . . . , RN. Then the overall radius is given as R = mini Ri and a single abstention will cause an overall abstention. To reduce evaluation cost we can reuse the same input samples for all components of the output vector, that is sample f(x) rather than individual fi(x). We refer to this method as INDIVCLASS. Further, the result of each call to CERTIFY only holds with probability 1 α. Thus, using the union bound, the overall

Scalable Certiﬁed Segmentation via Randomized Smoothing

correctness is limited by 1 P(W

i i-th test incorrect) 1 max(P

i P(i-th test incorrect), 1) = 1 min(Nα, 1) (by the union bound), which for large N quickly becomes problematic. This can be compensated by carrying out calls to CERTIFY with α = α N , which becomes prohibitively expensive as n needs to be increased for the procedure to not abstain.

Key Challenges These direct applications of randomized smoothing to segmentation suffer from multiple problems that can be reduced to two challenges:

Bad Components: Both algorithms can be forced to abstain or report a small radius by a single bad component xi for which the base classiﬁer is unstable.

Multiple Testing Trade-off: Any algorithm that, like INDIVCLASS, reduces the certiﬁcation segmentation to multiple stochastic tests (such as CERTIFY) suffers from the multiple testing problem. As outlined before, if each partial result only holds with probability α then the overall probability decays, using the union bound, linearly in the number of tests. Thus, to remain sound one is forced to choose between scalability or low conﬁdence.

5. Scalable Certiﬁed Segmentation

We now introduce our algorithm for certiﬁed segmentation by addressing these challenges. In particular, we will sidestep the bad component issue and allow for a more favorable trade-off in the multiple-testing setting.

To limit the impact of bad components on the overall result, we introduce a threshold τ [ 1

2, 1) and deﬁne a model f τ : X N ˆYN, with ˆY = Y { }, that abstains if the probability of the top class for component xi is below τ:

f τ i (x) =

( c A,i if Pϵ N(0,σ)(fi(x + ϵ)) > τ else ,

where c A,i = arg maxc Y Pϵ N(0,σ)(fi(x + ϵ) = c).

This means that on components with ﬂuctuating classes, the model f τ does not need to commit to a class. For the model f τ, we obtain a safety guarantee similar to the original theorem by (Cohen et al., 2019):

Theorem 5.1. Let Ix = {i | f τ i (x) = , i 1, . . . , N} denote the set of non-abstain indices for f τ(x). Then,

f τ i (x + δ) = f τ i (x), i Ix

for δ RN m with δ 2 R := σΦ 1(τ).

Proof. We consider f τ 1 , . . . , f τ N independently. With p A,i := Pϵ N(0,σ)(fi(x + ϵ) = c A,i) > τ, we invoke

Algorithm 2 algorithm for certiﬁcation and prediction

# evaluate f τ at x function SEGCERTIFY(f, σ, x, n, n0, τ, α)

cnts0 1, . . . , cnts0 N SAMPLE(f, x, n0, σ) cnts1, . . . , cnts N SAMPLE(f, x, n, σ) for i {1, . . . , N}:

ˆci top index in cnts0 i ni cntsi[ˆci] pvi BINPVALUE(ni, n, , τ) r1, . . . , r N FWERCONTROL(α, pv1, . . . , pv N) for i {1, . . . , N}:

if ri: ˆci R σΦ 1(τ) return ˆc1, . . . , ˆc N, R

Theorem 3.1 with p A = τ for f τ and obtain robustness radius R := σΦ 1(τ) for f τ i (x). This holds for all i where the class probability p A,i > τ, denoted by the set Ix.

Certiﬁcation Similarly to Cohen et al. (2019), we cannot invoke our theoretically constructed model f τ and must approximate it. The simplest way to do so would be to invoke CERTIFY for each component and replace the check p A > 1 2 by p A > τ. However, while this accounts for the bad component issue, it still suffers from the outlined multiple testing problem. To address this issue, we now introduce the SEGCERTIFY procedure in Algorithm 2.

We will now brieﬂy outline the steps in Algorithm 2 before describing them in further detail and arguing about its correctness and properties. Algorithm 2 relies on the same primitives as Algorithm 1, as well as the FWERCONTROL function, which performs multiple-testing correction, which we will formally introduce shortly. As before, SAMPLE denotes the evaluation of samples f(x + ϵ) and cntsi denotes the class frequencies observed for the i-th component. Similar to CERTIFY, we use two sets of samples cnts and cnts0 to avoid a model selection bias (and thus invalidate our statistical test): We use n0 samples, denoted cnts0 i to guess the majority class c A,i for the i-th component, denoted ˆci and then determine its appearance count ni out of n trails from cntsi. With these counts, we then perform a one-sided binomial test, discussed shortly, to obtain its p-value pvi. Given these p-values we can employ FWERCONTROL, which determines which tests we can reject in order to obtain overall conﬁdence 1 α. The rejection of the i-th test is denoted by the boolean variable ri. If we reject the i-th test (ri = 1), we assume its alternate hypothesis p A,i > τ and return the class c A,i, else we return .

To establish the correctness of the approach and improve on the multiple testing trade-off, we will now brieﬂy review statistical (multiple) hypotheses testing.

Scalable Certiﬁed Segmentation via Randomized Smoothing

Hypothesis Testing Here we consider a ﬁxed but arbitrary single component i: If we guessed the majority class c A,i for the i-th component to be ˆci, we assume that fi(x+ϵ) returns ˆci with probability pˆci,i (denoted p in the following). We now want to infer wether p > τ or not, to determine whether f τ will output ˆci or . We phrase this as a statistical test with null hypothesis (H0,i) and alternative (HA,i):

(H0,i) : p τ (HA,i) : p > τ

A statistical test lets us assume a null hypothesis (H0,i) and check how plausible it is given our observational data. We can calculate the p-value pv, which denotes the probability of the observed data or an even more extreme event under the null hypothesis. Thus, in our case rejecting (H0,i) means returning ˆci while accepting it means returning .

Rejecting the null hypothesis when it is actually true (in our case returning a class if we should abstain) is called a type I error. The opposite, not rejecting the null hypothesis when it is false is called a type II error. In our setting type II errors mean additional abstentions on top of those f τ makes by design. Making a sound statement about robustness means controlling the type I error while reducing type II errors means fewer abstentions due to the testing procedure.

Commonly, a test is rejected if its p-value is below some threshold α, as in the penultimate line of PREDICT (Algorithm 1). This bounds the probability of type I error at probability (or level ) α.

Multiple Hypothesis Testing Multiple tests performed at once require additional care. Usually, if we perform N tests, we want to bound the probability of any type I error occurring. This probability is called the family wise error rate (FWER). The goal of FWER control is, given a set of N tests and α, to reject tests such that the FWER is limited at α. In Algorithm 2 this is denoted as FWERCONTROL, which is a procedure that takes the value α and p-values pv1, . . . , pv N and decides on rejections r1, . . . , r N.

The simplest procedure for this is the Bonferroni method (Bonferroni, 1936), which rejects individual tests with pvalue pvi α

N . It is based on the same consideration we applied for the failure probability of INDIVCLASS in 4. While this controls the FWER at level α it also increases the type II error, and thus would reduce the number of correctly classiﬁed components. To maintain the same number of correctly classiﬁed components we would need to increase n or decrease α.

A better alternative to the Bonferroni method is the Holm correction (or Holm-Bonferroni method) (Holm, 1979) which orders the tests by acceding p-value and steps through them at levels α

N , . . . , α

1 (rejecting null if the associated p value is smaller then the level) until the ﬁrst level where no

additional test can be rejected. This method also controls the FWER at level α but allows for a lower rate of type II errors.

Further improving upon these methods, usually requires additional information. This can either be specialization to a certain kind of test (Tukey, 1949), additional knowledge (e.g., no negative dependencies between tests, for procedures such as (Šidák, 1967)) or an estimate of the dependence structure computed via Bootsrap or permutation methods (Westfall & Young, 1993).

While our formulation of SEGCERTIFY admits all of these corrections, Bootstrap or permutaion procedures are generally infeasible for large N. Thus, in the rest of the paper we consider both Holm and Bonferroni correction and compare them in 6.1.

We note that while methods (Westfall, 1985) exist for assessing that N Clopper-Pearson conﬁdence intervals hold jointly at level α, thus allowing better correction for INDIVCLASS, these methods do not scale to realistic segmentation problems as they require re-sampling operations that are expensive on large problems.

5.1. Properties of SEGCERTIFY

SEGCERTIFY is a conservative algorithm, as it will rather abstain from classifying a component than returning a wrong or non-robust result. We formalize this as well as a statement about the correctness of SEGCERTIFY in Proposition 1.

Proposition 1. Let ˆc1, . . . , ˆc N be the output of SEGCERTIFY for input x and ˆIx := {i | ˆci = }. Then with probability at least 1 α over the randomness in SEGCERTIFY ˆIx Ix, where Ix denotes the previously deﬁned non-abstain indices of f τ(x). Therefore ˆci = f τ i (x) = f τ i (x + δ) for i ˆIx and δ RN m with δ 2 R.

Proof. Suppose that i ˆIx \ Ix. This i then represents a type I error. However, the overall probability of any such type I error is controlled at level α by the FWERCONTROL step. Thus with probability at least 1 α, ˆIx Ix. The rest follows from Theorem 5.1.

We note that there are fundamentally two reasons for SEGCERTIFY to abstain from the classiﬁcation of a component: either because the true class probability p A,i is less than or equal to τ in which case f τ i abstains by deﬁnition or because of a type II error in SEGCERTIFY. This type II error can occur either if we guess the wrong ˆci from our n0 samples or because we could not gather sufﬁcient evidence to reject (H0,i) under α-FWER. Proposition 1 only makes a statement about type I errors, but not type II errors, e.g., ˆI could always be the empty set. In general, the control of type II errors (called power ) of a testing procedure can

Scalable Certiﬁed Segmentation via Randomized Smoothing

be hard to quantify without a precise understanding of the underlying procedure (in our case the model f). However, in 6.1 we will investigate this experimentally.

In contrast to Cohen et al. (2019) we do not provide separate procedures for prediction and certiﬁcation, as for a ﬁxed τ both only need to determine p A,i > τ for all i with high probability (w.h.p.). For faster prediction, the algorithm can be invoked with larger α and smaller n.

For N = 1, the algorithm exhibits the same time complexity as those proposed by Cohen et al. (2019). For larger N, time complexity stays the same (plus the time for sorting the pvalues in Holm procedure), however due to type II errors the number of abstentions increases. Counteracting this leads to the previously discussed multiple-testing trade-off.

Summary Having presented SEGCERTIFY we now summarize how it addresses the two key challenges. First, the bad component issue is offset by the introduction of a threshold parameter τ. This allows to exclude single components, which would not be provable, from the overall certiﬁcate. Second, due to this threshold we now aim to show a ﬁxed lower bound p A = τ, where we can use a binomial p-value test rather than the Clopper-Pearson interval for a ﬁxed conﬁdence. The key difference here is that the binomial test produces a p-value that is small unless the observed rate ni

n is close to τ, while the Clopper-Pearson interval aims to show the maximal p A for a ﬁxed conﬁdence. This former setting generally lends itself better to the multiple testing setting. In particular for correction algorithms such as Holm this lets small p-values offset larger ones.

5.2. Generality

While presented in the ℓ2-robustness setting as in Cohen et al. (2019), SEGCERTIFY is not tied to this and is compatible with many extensions and variations of Theorem 3.1 in a plug-and-play manner. For example, Li et al. (2018); Lécuyer et al. (2019); Yang et al. (2020) showed robustness to ℓ1 (and other ℓp) perturbations. Fischer et al. (2020) introduced an extension that allows computing a robustness radius over the parameter of a parametric transformation. In both of these, it sufﬁces for practical purposes to update SAMPLE to produce perturbations from a suitable distribution (e.g., rotation or Laplace noise) as well as update the calculation of the radius R to reﬂect the different underlying theorems. Similarly, in f τ and Theorem 5.1, it sufﬁces to update the class probability according to the sample distribution and radius prescribed by the relevant theorem.

Extensions to k-FWER When we discussed FWER control so far, we bounded the probability of making any type I error by α. However, in the segmentation setting we have many individual components, and depending on our objec-

tive, we may allow a small budget of few errors. In this setting, we can employ k-FWER control (Lehmann & Romano, 2012). It controls the probability of observing k or more type I errors at probability α. Thus with probability 1 α we have observed at most k 1 type I errors (false non-abstentions). Lehmann & Romano (2012) introduces a procedure similar to Holm method that allows for k-FWER that is optimal (without the use of further information on test dependence) and can be simply plugged into SEGCERTIFY. We provide further explanation and evaluation in App. B.2.

6. Experimental Evaluation

We evaluate our approach in three different settings: (i) investigating the two key challenges on toy data in 6.1, (ii) semantic image segmentation in 6.2, and (iii) point clouds part segmentation in 6.3. All timings are for a single Nvidia Ge Force RTX 2080 Ti. For further details, see App. A.

6.1. Toy Data

We now investigate how well SEGCERTIFY and the two naive algorithms handle the identiﬁed challenges.

Here we consider a simple setting, where our input is of length N and each component can be from two possible classes. Further, our base model is an oracle (that does not need to look at the data) and is correct for each of N k component with probability 1 γ for some γ [0, 1] and k 0, . . . , N. On the remaining k components it is correct with probability 1 5γ (clamped to [0, 1]).

We evaluate JOINTCLASS, INDIVCLASS with Bonferroni correction and SEGCERTIFY with Holm, denoted SEGCERTIFYHOLM, as well as Bonferroni correction, denoted SEGCERTIFYBON. Note that SEGCERTIFYBON in contrast to INDIVCLASS employs thresholding. We investigate the performance of all methods under varying levels of noise by observing the rate of certiﬁed components. As the naive algorithms either provide a certiﬁcate for all pixels or none, we have repeated the experiment 600 times, showing a smoothed average in Fig. 2a (the unsmoothed version can be found in App. A). In this setting, we assume N = 100 components, k = 1 and γ varies between 0 and 0.1 along the x-axis. All algorithms use n0 = n = 100 and α = 0.001. SEGCERTIFY uses τ = 0.75. This means that for γ 0.05, all abstentions are due to the testing procedure (type II errors) and not prescribed by τ in the deﬁnition of f τ. For 0.05 < γ 0.1, there is one prescribed abstention as k = 1. Thus the results provide an empirical assessment of the statistical power of the algorithm.

As outlined 4, JOINTCLASS and INDIVCLASS are very susceptible to a single bad component. While JOINTCLASS almost fails immediately for γ > 0, INDIVCLASS works until γ becomes large enough so that the estimated conﬁ-

Scalable Certiﬁed Segmentation via Randomized Smoothing

0.000 0.025 0.050 0.075 0.100

Joint Class

Indiv Class

Seg Certify Holm

Seg Certify Bon

(a) On N = 100 components, with a classiﬁer that has error rate 5γ on one component and γ on all others.

0.000 0.025 0.050 0.075 0.100

(b) Same setting as Fig. 2a, but the algorithms use more samples n0, n allowing for less abstentions.

102 103 104 105 106 N 0.70

Seg Certify Holm

Seg Certify Bon

(c) SEGCERTIFY with different testing corrections for various numbers of components N with error rate γ = 0.05.

Figure 2. We empirically investigate the power (ability to avoid type II errors false abstention) of multiple algorithms on synthetic data. The y-axis shows the rate of certiﬁed (rather than abstained) components. An optimal algorithm would achieve 1.0 or 0.99 in all plots.

dence interval of 1 5γ cannot be guaranteed to be larger than 0.5. Both variants of SEGCERTIFY also deteriorate with increasing noise. The main culprit for this is guessing a wrong class in the n0 samples. We showcase this in Fig. 2b, where we use the same setting as in Fig. 2a but with n0 = n = 1000. Here, both SEGCERTIFY variants have the expected performance (close to 1) even with increasing noise. For INDIVCLASS, these additional samples also delay complete failure, but do not prevent it. The main takeaway from Figs. 2a and 2b is that SEGCERTIFY overcomes the bad component issue and the number of samples n, n0 is a key driver for its statistical power.

Lastly, in Fig. 2c we compare SEGCERTIFYBON and SEGCERTIFYHOLM. Here, γ = 0.05, k = 0, n = n0 = 1000, α = 0.1 and N increases (exponentially) along the x-axis. We see both versions deteriorate in performance as N grows. As before, this is to be expected as out of N components some of the initial guesses based on n0 samples may be wrong or n may be too small to correctly reject the test. However, we note that the certiﬁcation gap between SEGCERTIFYBON and SEGCERTIFYHOLM grows as N increases due to their difference in statistical power.

We conclude that SEGCERTIFY addresses both challenges, solving the bad component issue and achieving better tradeoffs in multiple testing. Particularly in the limit the beneﬁt of Holm correction over Bonferrroni becomes apparent.

6.2. Semantic Image Segmentation

Next, we evaluate SEGCERTIFY on the task of semantic image segmentation considering two datasets, Cityscapes, and Pascal Context. The Cityscapes dataset (Cordts et al., 2016) contains high-resolution (1024 2048 pixel) images of trafﬁc scenes annotated in 30 classes, 19 of which are used for evaluation. An example of this can be seen in Fig. 1 (the hood of the car is in an unused class). The Pascal Context dataset (Mottaghi et al., 2014) contains images with

59 foreground and 1 background classes, which before segmentation are resized to 480 480. There are two common evaluation schemes, either using all 60 classes or just the 59 foreground classes. Here we use the later setting.

As a base model in this settings, we use a Hr Net V2 (Sun et al., 2019; Wang et al., 2019). Like many segmentation models, it can be invoked on different scales. For example, to achieve faster inference we scale an image down half its length and width, invoke the segmentation model, and scale up the result. Or, to achieve a more robust or more accurate segmentation we can invoke segmentation on multiple scales, scale the the output probabilities to the original input size, average them, and then predict the class for each pixel. Here we investigate how different scales allow different trade-offs for provably robust segmentation via SEGCERTIFY. We note that at the end, we always perform certiﬁcation on the 1024 2048-scale result. We trained our base models with Gaussian Noise (σ = 0.25), as in Cohen et al. (2019). Details about the training procedure and further results can be found in App. A.1 and B.3.

Evaluation results for 100 images on both datasets are given in Table 1. We observe that the certiﬁed model has accuracy close to that of the base model, even outperforming it sometimes, even though it is abstaining for a number of pixels. This means that abstentions generally happens in wrongly predicted areas or on ignored classes. Similar to Cohen et al. (2019), we observe best performance on the noise level we trained the model on (σ = 0.25) and degrading performance for higher values. Interestingly, by comparing the results for σ = 0.5 at scale 1.0 and smaller scales, we observe that the resizing for scales < 1.0 acts as a natural denoising step, allowing better performance with more noise. Further, in Fig. 3 we show how the certiﬁed accuracy degrates with different values of τ. We use scale 0.25 and n = 300 and the same parameters as in Table 1 otherwise. Up to τ = 0.92 we observe very gradual drop-off with increasing τ. Then

Scalable Certiﬁed Segmentation via Randomized Smoothing

Table 1. Segmentation results for 100 images. acc. shows the mean per-pixel accuracy, m Io U the mean intersection over union, % abstentions and t runtime in seconds. All SEGCERTIFY (n0 = 10, α = 0.001) results are certiﬁably robust at radius R w.h.p. multiscale uses 0.5, 0.75, 1.0, 1.25, 1.5, 1.75 as well as their ﬂipped variants for Cityscapes and additionally 2.0 for Pascal.

Cityscapes Pascal Context

scale σ R acc. m Io U % t acc. m Io U % t

0.25 non-robust model - - 0.93 0.60 0.00 0.38 0.59 0.24 0.00 0.12

base model - - 0.87 0.42 0.00 0.37 0.33 0.08 0.00 0.13

SEGCERTIFY n = 100, τ = 0.75

0.25 0.17 0.84 0.43 0.07 70.00 0.33 0.08 0.13 14.16 0.33 0.22 0.84 0.44 0.09 70.21 0.34 0.09 0.17 14.20 0.50 0.34 0.82 0.43 0.13 71.45 0.23 0.05 0.27 14.23

SEGCERTIFY n = 500, τ = 0.95

0.25 0.41 0.83 0.42 0.11 229.37 0.29 0.01 0.30 33.64 0.33 0.52 0.83 0.42 0.12 230.69 0.26 0.01 0.39 33.79 0.50 0.82 0.77 0.38 0.20 230.09 0.10 0.00 0.61 33.44

0.5 non-robust model - - 0.96 0.76 0.00 0.39 0.74 0.38 0.00 0.16

base model - - 0.89 0.51 0.00 0.39 0.47 0.13 0.00 0.14

SEGCERTIFY n = 100, τ = 0.75

0.25 0.17 0.88 0.54 0.06 75.59 0.48 0.16 0.09 16.29 0.33 0.22 0.87 0.54 0.08 75.99 0.50 0.17 0.11 16.08 0.50 0.34 0.86 0.54 0.10 75.72 0.36 0.10 0.21 16.14

1.0 non-robust model - - 0.97 0.81 0.00 0.52 0.77 0.42 0.00 0.18

base model - - 0.91 0.57 0.00 0.52 0.53 0.18 0.00 0.18

SEGCERTIFY n = 100, τ = 0.75

0.25 0.17 0.88 0.59 0.11 92.75 0.55 0.22 0.22 18.53 0.33 0.22 0.78 0.43 0.20 92.85 0.46 0.18 0.34 18.57 0.50 0.34 0.34 0.06 0.40 92.48 0.17 0.03 0.41 18.46

multi non-robust model - - 0.97 0.82 0.00 8.98 0.78 0.45 0.00 4.21

base model - - 0.92 0.60 0.00 9.04 0.56 0.19 0.00 4.22 SEGCERTIFY n = 100, τ = 0.75 0.25 0.17 0.88 0.57 0.09 1040.55 0.52 0.21 0.29 355.00

observe a sudden drop-off as n is becomes insufﬁcient to proof any component (even those that are always correct). All values use Holm correction. We contrast them with Bonferroni correction in App. B.3.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 R 0.00

0.5 0.6 0.7 0.8 0.9 0.92

0.8 0.9 0.92

0.8 0.9 0.91

σ = 0.25 σ = 0.33 σ = 0.50

Figure 3. Radius versus certiﬁed mean per-pixel accuracy for semantic segmentation on Cityscapes at scale 0.25. Numbers next to dots show τ. The y-axis is scaled to the fourth power for clarity.

Lastly, we observe that our algorithm comes with a large increase in run-time as is common with randomized smoothing. However, the effect here is ampliﬁed as semantic segmentation commonly considers large models and large images. On average, 30s (for n = 100) are spent on computing the p-values, 0.1s on computing the Holm correction and the test of the reported time on sampling. We did not optimize our implementation for speed, other than choosing the largest possible batch size for each scale, and we believe that with further engineering run time can be reduced.

6.3. Pointcloud Part Segmentation

For point clouds part segmentation we use the Shape Net (Chang et al., 2015) dataset, which contains 3d CAD models of different objects from 16 categories, with multiple parts, annotated from 50 labels. Given the overall label and points sampled from the surface of the object, the goal is to segment them into their correct parts. There exist two variations of the described Shape Net task, one where every point is a 3d coordinate and one where each coordinate also

Scalable Certiﬁed Segmentation via Randomized Smoothing

Table 2. results on 100 point clouds part segmentations. acc. shows the mean per-point accuracy, % abstentions and t runtime in seconds. all SEGCERTIFY (n0 = 100, α = 0.001) results are certiﬁably robust w.h.p.

n τ σ acc % t

fσ=0.25 - - - 0.81 0.00 0.57 1000 0.75 0.250 0.62 0.32 54.47 1000 0.85 0.250 0.52 0.44 54.41 10000 0.95 0.250 0.41 0.56 496.57 10000 0.99 0.250 0.20 0.78 496.88

f n σ=0.25 - - - 0.86 0.00 0.57 1000 0.75 0.250 0.78 0.16 54.46 1000 0.85 0.250 0.71 0.25 54.41 10000 0.95 0.250 0.62 0.35 496.65 10000 0.99 0.250 0.39 0.60 496.68

fσ=0.5 - - - 0.70 0.00 0.55 1000 0.75 0.250 0.57 0.23 54.40 1000 0.75 0.500 0.47 0.44 54.47 1000 0.85 0.500 0.41 0.54 54.39 10000 0.95 0.500 0.30 0.67 496.30 10000 0.99 0.500 0.11 0.87 496.47

f n σ=0.5 - - - 0.78 0.00 0.56 1000 0.75 0.250 0.74 0.10 54.78 1000 0.75 0.500 0.67 0.21 54.58 1000 0.85 0.500 0.61 0.30 54.44 10000 0.95 0.500 0.53 0.41 497.40 10000 0.99 0.500 0.37 0.60 497.87

contains its surface normal vector. Each input consists of 2048 points and before it is fed to the neural network its coordinates are centered at the origin and scaled such that the point furthest away from the origin has distance 1.

Using the Point Net V2 architecture (Qi et al., 2017a;b; Yan et al., 2020), we trained four base models: fσ=0.25, f n σ=0.25, f n σ=0.5 and fσ=0.5 where n denotes the inclusion of normal vectors and σ the noise used in training. We apply all smoothing after the rescaling to work on consistent sizes and we only perturb the location, not the normal vector. Commonly, the underlying network is invoked 3 times and the results averaged as the classiﬁcation algorithm involves randomness. As Theorems 3.1 and 5.1 do not require the underlying f to be deterministic, we also use the average of 3 runs for each invocation of the base model.

We provide the results in Table 2 and Fig. 4. We note that on the same data non-robust model achieves 0.91 and 0.90 with and without normal vectors respectively. While certiﬁcation results here are good compared with the base model, this ratio is worse than in image segmentation and more samples (n) are needed. This is because here in contrast to image segmentation a perturbed point can have different semantics if it is moved to a different part of the object, causing label ﬂuctuation. Further, as Fig. 4, shows we see a gradual decline in accuracy when increasing τ rather than the sudden drop in Fig. 3. As before, all values

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 R 0.00

1.00 cert. acc.

0.5 0.6 0.7 0.8

0.5 0.6 0.7

0.95 0.96 0.97

f n σ=0.25, σ = 0.25, n = 10000

f n σ=0.25, σ = 0.25, n = 1000

f n σ=0.50, σ = 0.50, n = 10000

f n σ=0.50, σ = 0.50, n = 1000

Figure 4. Radius versus certiﬁed accuracy at different radii for Pointcloud part segmentation. Numbers next to dots show τ.

are computed using Holm correction and we provide results for Bonferroni in App. B.4.

Using the same base models, we apply the randomized smoothing variant by Fischer et al. (2020) and achieve an accuracy of 0.69 while showing robustness to 3d rotations. App. B.5 provides further details and the obtained guarantee.

6.4. Discussion & Limitation

By reducing the problem of certiﬁed segmentation to only non-ﬂuctuating components, we signiﬁcantly reduce the difﬁculty and achieve strong results on challenging datasets. However, a drawback of the method is the newly introduced hyperparameter τ. In practice a suitable value can be determined by the desired radius or the empirical performance of the base classiﬁer. High values of τ will permit a higher certiﬁcation radius, but also lead to more abstentions and require more samples n for both certiﬁcation and inference (both done via SEGCERTIFY). Further, as is common with adversarial defenses is a loss of performance compared to a non-robust model. However we expect further improvements in accuracy by the application of specialized training procedures (Salman et al., 2019; Zhai et al., 2020; Jeong & Shin, 2020), which we brieﬂy investigate in App. B.3.

7. Conclusion

In this work we investigated the problem of provably robust segmentation algorithms and identiﬁed two key challenges: bad components and trade-offs introduced by multiple testing that prevent naive solutions from working well. Based on this we introduced SEGCERTIFY, a simple yet powerful algorithm that clearly overcomes the bad component issue and allows for signiﬁcantly better trade-offs in multipletesting. It enables certiﬁed performance on a similar level as an undefended base classiﬁer and permits guarantees to multiple threat models in a plug-and-play manner.

Scalable Certiﬁed Segmentation via Randomized Smoothing

Arnab, A., Miksik, O., and Torr, P. H. S. On the robustness of semantic segmentation models to adversarial attacks. In CVPR, 2018.

Balunovic, M. and Vechev, M. T. Adversarial training and provable defenses: Bridging the gap. In ICLR, 2020.

Balunovic, M., Baader, M., Singh, G., Gehr, T., and Vechev, M. T. Certifying geometric robustness of neural networks. In Neur IPS, 2019.

Bielik, P. and Vechev, M. T. Adversarial robustness for code. In ICML, 2020.

Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In ECML/PKDD, 2013.

Bojchevski, A., Klicpera, J., and Günnemann, S. Efﬁcient robustness certiﬁcates for discrete data: Sparsity-aware randomized smoothing for graphs, images and more. In ICML, 2020.

Bonferroni, C. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8:3 62, 1936.

Chang, A. X., Funkhouser, T. A., Guibas, L. J., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., and Yu, F. Shapenet: An information-rich 3d model repository. Technical report, 2015.

Chiang, P., Curry, M. J., Abdelkader, A., Kumar, A., Dickerson, J., and Goldstein, T. Detection as regression: Certiﬁed object detection with median smoothing. In Neur IPS, 2020.

Clopper, C. J. and Pearson, E. S. The use of conﬁdence or ﬁducial limits illustrated in the case of the binomial. Biometrika, 26(4):404 413, 1934.

Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certiﬁed adversarial robustness via randomized smoothing. In ICML, 2019.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.

Deng, L., Yang, M., Qian, Y., Wang, C., and Wang, B. CNN based semantic segmentation for urban trafﬁc scenes using ﬁsheye camera. In Intelligent Vehicles Symposium. IEEE, 2017.

Ehlers, R. Formal veriﬁcation of piece-wise linear feedforward neural networks. In ATVA, 2017.

Fischer, M., Baader, M., and Vechev, M. T. Certiﬁed defense to image transformations via randomized smoothing. In Neur IPS, 2020.

Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., and Vechev, M. T. AI2: safety and robustness certiﬁcation of neural networks with abstract interpretation. In IEEE Symposium on Security and Privacy, 2018.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In ICLR, 2015.

Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Arandjelovic, R., Mann, T. A., and Kohli, P. On the effectiveness of interval bound propagation for training veriﬁably robust models. Co RR, abs/1810.12715, 2018.

Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pp. 65 70, 1979.

Jeong, J. and Shin, J. Consistency regularization for certiﬁed robustness of smoothed classiﬁers. In Neur IPS, 2020.

Katz, G., Barrett, C. W., Dill, D. L., Julian, K., and Kochenderfer, M. J. Reluplex: An efﬁcient SMT solver for verifying deep neural networks. In CAV, 2017.

Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial machine learning at scale. In ICLR, 2017.

Lécuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certiﬁed robustness to adversarial examples with differential privacy. In IEEE Symposium on Security and Privacy, 2019.

Lehmann, E. L. and Romano, J. P. Generalizations of the familywise error rate. In Selected Works of EL Lehmann, pp. 719 735. Springer, 2012.

Levine, A. and Feizi, S. (de)randomized smoothing for certiﬁable defense against patch attacks. In Neur IPS, 2020a.

Levine, A. and Feizi, S. Wasserstein smoothing: Certiﬁed robustness against wasserstein adversarial attacks. In AISTATS, 2020b.

Li, B., Chen, C., Wang, W., and Carin, L. Second-order adversarial attack and certiﬁable robustness. Co RR, abs/1809.03113, 2018.

Li, L., Weber, M., Xu, X., Rimanic, L., Kailkhura, B., Xie, T., Zhang, C., and Li, B. Tss: Transformation-speciﬁc smoothing for robustness certiﬁcation. In CCS, 2021.

Scalable Certiﬁed Segmentation via Randomized Smoothing

Lin, W.-Y., Sheikholeslami, F., jinghao shi, Rice, L., and Kolter, J. Z. Certiﬁed robustness against physicallyrealizable patch attack via randomized cropping, 2021.

Liu, D., Yu, R., and Su, H. Extending adversarial attacks and defenses to deep 3d point cloud classiﬁers. In ICIP, 2019.

Lorenz, T., Ruoss, A., Balunovic, M., Singh, G., and Vechev, M. T. Robustness certiﬁcation for point cloud models. Co RR, abs/2103.16652, 2021.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.

Mirman, M., Gehr, T., and Vechev, M. T. Differentiable abstract interpretation for provably robust neural networks. In ICML, 2018.

Mohapatra, J., Ko, C., Weng, T., Chen, P., Liu, S., and Daniel, L. Higher-order certiﬁcation for randomized smoothing. In Neur IPS, 2020a.

Mohapatra, J., Weng, T., Chen, P., Liu, S., and Daniel, L. Towards verifying robustness of neural networks against A family of semantic perturbations. In CVPR, 2020b.

Mottaghi, R., Chen, X., Liu, X., Cho, N., Lee, S., Fidler, S., Urtasun, R., and Yuille, A. L. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.

Perone, C. S., Calabrese, E., and Cohen-Adad, J. Spinal cord gray matter segmentation using deep dilated convolutions. Scientiﬁc reports, 8(1):1 13, 2018.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. In CVPR, 2017a.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017b.

Raghunathan, A., Steinhardt, J., and Liang, P. Semideﬁnite relaxations for certifying robustness to adversarial examples. In Neur IPS, 2018a.

Raghunathan, A., Steinhardt, J., and Liang, P. Certiﬁed defenses against adversarial examples. In ICLR, 2018b.

Ruoss, A., Baader, M., Balunovic, M., and Vechev, M. T. Efﬁcient certiﬁcation of spatial robustness, 2021.

Salman, H., Li, J., Razenshteyn, I. P., Zhang, P., Zhang, H., Bubeck, S., and Yang, G. Provably robust deep learning via adversarially trained smoothed classiﬁers. In Neur IPS, 2019.

Salman, H., Sun, M., Yang, G., Kapoor, A., and Kolter, J. Z. Denoised smoothing: A provable defense for pretrained classiﬁers. In Neur IPS, 2020.

Schuchardt, J., Bojchevski, A., Klicpera, J., and Günnemann, S. Collective robustness certiﬁcates. In ICLR, 2021.

Sheikholeslami, F., Lotﬁ, A., and Kolter, J. Z. Provably robust classiﬁcation of adversarial examples with detection. In ICLR, 2021.

Šidák, Z. Rectangular conﬁdence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318):626 633, 1967.

Singh, G., Ganvir, R., Püschel, M., and Vechev, M. T. Beyond the single neuron convex barrier for neural network certiﬁcation. In Neur IPS, 2019a.

Singh, G., Gehr, T., Püschel, M., and Vechev, M. T. An abstract domain for certifying neural networks. Proc. ACM Program. Lang., (POPL), 2019b.

Sun, J., Cao, Y., Chen, Q. A., and Mao, Z. M. Towards robust lidar-based perception in autonomous driving: General black-box adversarial sensor attack and countermeasures. In USENIX Security Symposium, 2020.

Sun, K., Xiao, B., Liu, D., and Wang, J. Deep highresolution representation learning for human pose estimation. In CVPR, 2019.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. In ICLR, 2014.

Tran, D., Pal, N., Musau, P., Manzanas Lopez, D., Hamilton, N., Yang, X., Bak, S., and Johnson, T. Robustness veriﬁcation of semantic segmentation neural networks using relaxed reachability. In CAV, 2021.

Tukey, J. W. Comparing individual means in the analysis of variance. Biometrics, pp. 99 114, 1949.

Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., and Xiao, B. Deep high-resolution representation learning for visual recognition. TPAMI, 2019.

Wang, S., Pei, K., Whitehouse, J., Yang, J., and Jana, S. Efﬁcient formal safety analysis of neural networks. In Neur IPS, 2018.

Weng, T., Zhang, H., Chen, H., Song, Z., Hsieh, C., Daniel, L., Boning, D. S., and Dhillon, I. S. Towards fast computation of certiﬁed robustness for relu networks. In ICML, 2018.

Scalable Certiﬁed Segmentation via Randomized Smoothing

Westfall, P. Simultaneous small-sample multivariate bernoulli conﬁdence intervals. Biometrics, pp. 1001 1013, 1985.

Westfall, P. H. and Young, S. S. Resampling-based multiple testing: Examples and methods for p-value adjustment, volume 279. John Wiley & Sons, 1993.

Wong, E. and Kolter, J. Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML, 2018.

Xiang, C., Qi, C. R., and Li, B. Generating 3d adversarial point clouds. In CVPR, 2019.

Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., and Yuille, A. L. Adversarial examples for semantic segmentation and object detection. In ICCV, 2017.

Yan, X., Zheng, C., Li, Z., Wang, S., and Cui, S. Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In CVPR, 2020.

Yang, G., Duan, T., Hu, J. E., Salman, H., Razenshteyn, I. P., and Li, J. Randomized smoothing of all shapes and sizes. In ICML, 2020.

Zhai, R., Dan, C., He, D., Zhang, H., Gong, B., Ravikumar, P., Hsieh, C., and Wang, L. MACER: attack-free and scalable robust training via maximizing certiﬁed radius. In ICLR, 2020.

Zhang, H., Weng, T., Chen, P., Hsieh, C., and Daniel, L. Efﬁcient neural network robustness certiﬁcation with general activation functions. In Neur IPS, 2018.