# hierarchical_randomized_smoothing__ff47c2a6.pdf

Hierarchical Randomized Smoothing

Yan Scholten1, Jan Schuchardt1, Aleksandar Bojchevski2 & Stephan Günnemann1 {y.scholten, j.schuchardt, s.guennemann}@tum.de, a.bojchevski@uni-koeln.de 1Dept. of Computer Science & Munich Data Science Institute, Technical University of Munich 2University of Cologne, Germany

Real-world data is complex and often consists of objects that can be decomposed into multiple entities (e.g. images into pixels, graphs into interconnected nodes). Randomized smoothing is a powerful framework for making models provably robust against small changes to their inputs by guaranteeing robustness of the majority vote when randomly adding noise before classification. Yet, certifying robustness on such complex data via randomized smoothing is challenging when adversaries do not arbitrarily perturb entire objects (e.g. images) but only a subset of their entities (e.g. pixels). As a solution, we introduce hierarchical randomized smoothing: We partially smooth objects by adding random noise only on a randomly selected subset of their entities. By adding noise in a more targeted manner than existing methods we obtain stronger robustness guarantees while maintaining high accuracy. We initialize hierarchical smoothing using different noising distributions, yielding novel robustness certificates for discrete and continuous domains. We experimentally demonstrate the importance of hierarchical smoothing in image and node classification, where it yields superior robustness-accuracy trade-offs. Overall, hierarchical smoothing is an important contribution towards models that are both certifiably robust to perturbations and accurate.1

1 Introduction

Machine learning models are vulnerable to small input changes. Consequently, their performance is typically better during development than in the real world where noise, incomplete data and adversaries are present. To address this problem, robustness certificates constitute useful tools that provide provable guarantees for the stability of predictions under worst-case perturbations, enabling us to analyze and guarantee robustness of our classifiers in practice.

Randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) is a powerful and highly flexible framework for making arbitrary classifiers certifiably robust by averaging out small discontinuities in the underlying model. This is achieved by guaranteeing robustness of the majority vote when randomly adding noise to a model s input before classification. This typically yields a robustnessaccuracy trade-off more noise increases certifiable robustness but also reduces clean accuracy.

Recent adversarial attacks against image classifiers (Su et al., 2019; Croce and Hein, 2019; Modas et al., 2019) and graph neural networks (Ma et al., 2020) further motivate the need for more flexible robustness certificates against threat models where adversaries are bounded by both: the number of perturbed entities and perturbation strength. For example in social networks, adversaries typically cannot control entire graphs, but the perturbation of the controlled nodes may be (imperceptibly) strong.

When adversaries perturb only a subset of all entities, existing smoothing-based approaches sacrifice robustness over accuracy (or vice versa): They either add noise to entire objects (e.g. images/graphs) or ablate entire entities (e.g. pixels/nodes) even if only a small subset is perturbed and requires noising.

1Project page: https://www.cs.cit.tum.de/daml/hierarchical-smoothing

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Entity with two binary features Selected entity with added noise Perturbed entity

Clean object Perturbed object Majority vote over randomly smoothed and then classified inputs

Figure 1: Hierarchical randomized smoothing: We first select a subset of all entities and then add noise to the selected entities only. We achieve stronger robustness guarantees while still maintaining high accuracy especially when adversaries can only perturb a subset of all entities. For example in social networks, adversaries typically control only a subset of all nodes in the entire graph.

The absence of appropriate tools for decomposable data prevents us from effectively analyzing the robustness of models such as image classifiers or graph neural networks raising the research question: How can we provably guarantee robustness under adversaries that perturb only a subset of all entities? We propose hierarchical randomized smoothing: By only adding noise on a randomly selected subset of all entities we achieve stronger robustness guarantees while maintaining high accuracy (Figure 1).

Our hierarchical smoothing framework is highly flexible and can integrate the whole suite of smoothing distributions for adding random noise to the input, provided that the corresponding certificates exist. Notably, our certificates apply to a wide range of models, domains and tasks and are at least as efficient as the certificates for the underlying smoothing distribution.

We instantiate hierarchical smoothing with well-established smoothing distributions (Cohen et al., 2019; Bojchevski et al., 2020; Levine and Feizi, 2020b; Scholten et al., 2022), yielding novel robustness certificates for discrete and continuous domains. Experimentally, hierarchical smoothing significantly expands the Pareto-front when optimizing for accuracy and robustness, revealing novel insights into the reliability of image classifiers and graph neural networks (GNNs).

In short, our main contributions are:

We propose a novel framework for strong robustness certificates for complex data where objects (e.g. images or graphs) can be decomposed into multiple entities (e.g. pixels or nodes). We instantiate our framework with well-established smoothing distributions, yielding novel robustness certificates for discrete and continuous domains. We demonstrate the importance of hierarchical smoothing in image and node classification, where it significantly expands the Pareto-front with respect to robustness and accuracy.

2 Related work

Despite numerous research efforts, the vulnerability of machine learning models against adversarial examples remains an open research problem (Hendrycks et al., 2021). Various tools emerged to address adversarial robustness: Adversarial attacks and defenses allow to evaluate and improve robustness empirically (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini et al., 2019; Croce and Hein, 2019). While attacks provide an upper bound on the adversarial robustness of a model, robustness certificates provide a provable lower bound. The development of strong certificates is challenging and constitutes a research field with many open questions (Li et al., 2020).

Robustness certification via randomized smoothing. Randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) represents a framework for making arbitrary classifiers certifiably robust while scaling to deeper architectures. There are two main approaches discussed in the literature: Additive noise certificates add random noise to continuous (Cohen et al., 2019) or discrete (Lee et al., 2019) input data. Ablation certificates mask out parts of the input to hide the potentially perturbed information (Levine and Feizi, 2020b). Since the input is masked, ablation certificates can be applied to both continuous and discrete data. We are the first to combine and generalize additive noise and ablation certificates into a novel hierarchical framework: We first select entities to ablate, but instead of ablating we add noise to them. By only partially adding noise to the input our framework is more flexible and allows us to better control the robustness-accuracy trade-off under our threat model.

Further research directions and orthogonal to ours include derandomized (deterministic) smoothing (Levine and Feizi, 2020a, 2021; Horváth et al., 2022b; Scholten et al., 2022), input-dependent smoothing (Súkeník et al., 2022), and gray-box certificates (Mohapatra et al., 2020; Levine et al., 2020; Schuchardt and Günnemann, 2022; Scholten et al., 2022; Schuchardt et al., 2023a).

Robustness-accuracy trade-off. The robustness-accuracy trade-off has been extensively discussed in the literature (Tsipras et al., 2019; Raghunathan et al., 2019; Zhang et al., 2019; Yang et al., 2020b; Xie et al., 2020), including works for graph data (Gosch et al., 2023b). In randomized smoothing the parameters of the smoothing distribution allow to trade robustness against accuracy (Cohen et al., 2019; Mohapatra et al., 2020; Wu et al., 2021) where more noise means higher robustness but lower accuracy. Our framework combines two smoothing distributions while introducing parameters for both, which allows us to better control the robustness-accuracy trade-off under our threat model since we can control both the number of entities to smooth and the magnitude of the noise. Further advances in the trade-off in randomized smoothing include the work of Levine and Feizi (2020a) against patch-attacks, Schuchardt et al. (2021, 2023b) for collective robustness certification, and Scholten et al. (2022) for graph neural networks. Other techniques include consistency regularization (Jeong and Shin, 2020), denoised smoothing (Salman et al., 2020) and ensemble techniques (Müller et al., 2021; Horváth et al., 2022a,c). Notably, our method is orthogonal to such enhancements since we derive certificates for a novel, flexible smoothing distribution to better control such trade-offs.

Robustness certificates for GNNs. Research on GNN certification is still at an early stage (Günnemann, 2022), most works focus on heuristic defenses rather than provable robustness guarantees. But heuristics cannot guarantee robustness and may be broken by stronger attacks in the future (Mujkanovic et al., 2022). Most certificates are limited to specific architectures (Zügner and Günnemann, 2020; Jin et al., 2020; Bojchevski and Günnemann, 2019; Zügner and Günnemann, 2019). Randomized smoothing is a more flexible framework for certifying arbitrary models while only accessing their forward pass. We compare our method to additive noise certificates for sparse data by Bojchevski et al. (2020), and to ablation certificates by Scholten et al. (2022) that ablate all node features to derive certificates against strong adversaries that control entire nodes. While we implement certificates against node-attribute perturbations, our framework can in principle integrate smoothing-based certificates against graph-structure perturbations as well.

3 Preliminaries and background

Our research is focused on data where objects can be decomposed into entities. W.l.o.g. we represent such objects as matrices where rows represent entities. We assume a classification task on a discrete or continuous (N D)-dimensional space X N D (e.g. X = {0, 1} or X = R). We propose certificates for classifiers f : X N D Y that assign matrix X X N D a label y Y = {1, . . . , C}.2 We model evasion settings, i.e. adversaries perturb the input at inference. We seek to verify the robustness f(X) = f( X) for perturbed X from a threat model of admissible perturbations:

( X X N D |

i=1 1[xi = xi] r, ||vec(X X)||p ϵ

where 1 denotes an indicator function, || ||p a ℓp-vector norm and vec( ) the matrix vectorization. For example, ||vec( )||2 is the Frobenius-norm. Intuitively, we model adversaries that control up to r rows in the matrix, and whose perturbations are bounded by ϵ under a fixed ℓp-norm.

3.1 Randomized smoothing framework

Our certificates build on the randomized smoothing framework (Lecuyer et al., 2019; Li et al., 2019; Cohen et al., 2019). Instead of certifying a base classifier f, one constructs a smoothed classifier g that corresponds to the majority vote prediction of f under a randomized smoothing of the input. We define the smoothed classifier g : X N D Y for a smoothing distribution µX as:

g(X) argmax y Y p X,y p X,y Pr W µX [f(W ) = y]

where p X,y is the probability of predicting class y for input X under the smoothing distribution µX. Let y g(X) further denote the majority vote of g, i.e. the most likely prediction for the clean X.

2Multi-output classifiers can be treated as multiple independent single-output classifiers.

To derive robustness certificates for the smoothed classifier g we have to show that y =g(X)=g( X) for clean matrix X and any perturbed matrix X Br p,ϵ(X). This is the case if the probability to classify any perturbed matrix X as y is still larger than for all other classes, that is if p X,y > 0.5. However, randomized smoothing is designed to be flexible and does not assume anything about the underlying base classifier, i.e. all we know about f is the probability p X,y for the clean matrix X. We therefore derive a lower bound p X,y p X,y under a worst-case assumption: We consider the least robust classifier among all classifiers with the same probability p X,y of classifying X as y:

p X,y min h H Pr W µ X [h(W ) = y] s.t. Pr W µX[h(W ) = y] = p X,y (1)

where H {h : X N D Y} is the set of possible classifiers with f H. To ensure that the smoothed classifier is robust to the entire treat model we have to guarantee min X B(X) p X,y > 0.5. One can also derive tighter multi-class certificates, for which we refer to Appendix D for conciseness.

Probabilistic certificates. Since computing p X,y exactly is challenging in practice, one estimates it using Monte-Carlo samples from µX and bounds p X,y with confidence intervals (Cohen et al., 2019). The final certificates are probabilistic and hold with an (arbitrarily high) confidence level of 1 α.

3.2 Neyman-Pearson Lemma: The foundation of randomized smoothing certificates

So far we introduced the idea of robustness certification using randomized smoothing, but to compute certificates one still has to solve Equation 1. To this end, Cohen et al. (2019) show that the minimum of Equation 1 can be derived using the Neyman-Pearson ( NP ) Lemma (Neyman and Pearson, 1933). Intuitively, the least robust model classifies those W as y for which the probability of sampling W around the clean X is high, but low around the perturbed X: Lemma 1 (Neyman-Pearson lower bound). Given X, X X N D, distributions µX, µ X, class label y Y, probability p X,y and the set Sκ W X N D : µ X(W ) κ µX(W ) , we have p X,y = Pr W µ X [W Sκ] with κ R+ s.t. Pr W µX[W Sκ] = p X,y

Proof. See (Neyman and Pearson, 1933) and (Cohen et al., 2019).

3.3 Existing smoothing distributions for discrete and continuous data

Our hierarchical smoothing framework builds upon certificates for smoothing distributions that apply noise independently per dimension (Lecuyer et al., 2019; Cohen et al., 2019; Lee et al., 2019). Despite being proposed for vector data x X D, these certificates have been generalized to matrix data by applying noise independently on all matrix entries (Cohen et al., 2019; Bojchevski et al., 2020; Chu et al., 2022; Schuchardt and Günnemann, 2022). Thus, given matrix data we define the density µX(W ) QN i=1 QD j=1 µXij(Wij) given the density of an existing smoothing distribution µ.3

Gaussian randomized smoothing. For continuous data X = R, Cohen et al. (2019) derive the first tight certificate for the ℓ2-threat model. Given an isotropic Gaussian smoothing distribution µx(w) = N(w|x, σ2I), Cohen et al. (2019) show that the optimal value of Equation 1 is given by

p x,y =Φ Φ 1(px,y) ||δ||2

σ , where Φ denotes the CDF of the normal distribution and δ x x.

Randomized smoothing for discrete data. Lee et al. (2019) derive tight certificates for the ℓ0-threat model. Certificates for discrete domains are in general combinatorial problems and computationally challenging. To overcome this, Lee et al. (2019) use the NP-Lemma for discrete random variables (Tocher, 1950) and show that the minimum in Equation 1 can be computed with a linear program (LP): Lemma 2 (Discrete Neyman-Pearson lower bound). Partition the input space X D into disjoint regions R1, . . . , RI of constant likelihood ratio µx(w)/µ x(w) = ci R { } for all w Ri. Define ri Prw µ x(w Ri) and ri Prw µx(w Ri). Then Equation 1 is equivalent to the linear program p x,y = minh [0,1]I h T r s.t. h T r = px,y, where h represents the worst-case classifier.

Proof in (Tocher, 1950). The LP can be solved efficiently using a greedy algorithm that consumes regions with larger likelihood ratio first (Lee et al., 2019). If the number of regions is small and ri, ri can be computed efficiently, robustness certificates for discrete data become computationally feasible.

3We write µ(Z) when referring to the density of the distribution µ.

X|0 τ {0,1}N X|τ

1. τi Ber(p)

2. Z = W |τ µX

(a) For graphs we extend the feature matrix by an additional column that indicates which nodes in the graph to add noise to.

1. τi Ber(p)

Z Z(C+1) W H

2. Z=W |τ µX

(b) For images we add an additional channel indicating the pixels to smooth.

Figure 2: We derive flexible and efficient robustness certificates for hierarchical randomized smoothing by certifying classifiers on a higher-dimensional space where the indicator τ is added to the data.

4 Hierarchical smoothing distribution

We aim to guarantee robustness against adversaries that perturb a subset of rows of a given matrix X. To achieve this we introduce hierarchical randomized smoothing: We partially smooth matrices by adding random noise on a randomly selected subset of their rows only. By adding noise in a more targeted manner than existing methods we will obtain stronger robustness while maintaining high accuracy. We describe sampling from the hierarchical (mixture) distribution ΨX as follows:

First, we sample an indicator vector τ {0, 1}N from an upper-level smoothing distribution ϕ(τ) that indicates which rows to smooth. Specifically, we draw each element of τ independently from the same Bernoulli distribution τi Ber(p), where p denotes the probability for smoothing rows. Second, we sample a matrix W by using a lower-level smoothing distribution µ (depending on domain and threat model) to add noise on the elements of the selected rows only. We define the overall density of this hierarchical smoothing distribution as ΨX(W, τ) µX(W |τ)ϕ(τ) with:

i=1 µxi(wi|τi) for µxi(wi|τi) µxi(wi) if τi = 1 δ(wi xi) if τi = 0

where N is the number of rows and δ the Dirac delta. In the case of a discrete lower-level smoothing distribution, ΨX is a probability mass function and δ denotes the Kronecker delta δwi,xi.

5 Provable robustness certificates for hierarchical randomized smoothing

We develop certificates by deriving a condition that guarantees g(X) = g( X) for clean X and any perturbed X Br p,ϵ(X). Certifying robustness under hierarchical smoothing is challenging, especially if we want to make use of certificates for the lower-level smoothing distribution µ without deriving the entire certificate from scratch. Moreover, Lemma 1 can be intractable in discrete domains where certification involves combinatorial problems. We are interested in certificates that are (1) flexible enough to make use of existing smoothing-based certificates, and (2) efficient to compute.

Our main idea to achieve both flexible and efficient certification under hierarchical smoothing is to embed the data into a higher-dimensional space W , ZN (D+1) by appending the indicator τ as an additional column: Z W |τ ( | denotes concatenation, see Figure 2). We construct a new base classifier that operates on this higher-dimensional space, e.g. by ignoring τ and applying the original base classifier to W. In our experiments we also train the classifiers directly on the extended data.

Certifying the smoothed classifier on this higher-dimensional space simplifies the certification: By appending τ to the data the supports of both distributions ΨX and Ψ X differ, and they intersect only for those Z for which all perturbed rows are selected by τ (see Figure 3). This allows the upper-level certificate to separate clean from perturbed rows and to delegate robustness guarantees for perturbed rows to an existing certificate for the lower-level µ. We further elaborate on why this concatenation is necessary in Appendix H. In the following we show robustness certification under this hierarchical smoothing mainly involves (1) transforming the observed label probability p X,y, (2) computing an existing certificate for the transformed p X,y, and (3) transforming the resulting lower bound p X,y. We provide pseudo-code for the entire certification in Appendix C.

ΨX(Z) Ψ X(Z) =

ΨX (Z) Ψ X(Z) = µXC (WC) µ XC (WC)

ΨX(Z) Ψ X(Z) = 0

Pr Z ΨX [Z R1] =

Pr Z ΨX [Z R2] = 1

Pr Z Ψ X [Z R3] =

Pr Z Ψ X [Z R2] = 1

Figure 3: Overview of the disjoint regions and probabilities to sample Z of each region. The absence of arrows into specific regions indicates a probability of zero. Only Z in region R2 can be sampled from both distributions ΨX and Ψ X. The likelihood ratios visualize the proof of Theorem 1.

5.1 Point-wise robustness certificates for hierarchical randomized smoothing

Before providing certificates against the entire threat model we first derive point-wise certificates, i.e. we consider an arbitrary but fixed X Br p,ϵ(X) and show robustness g(X)=g( X) by deriving a lower bound p X,y p X,y. The smoothed classifier g is certifiably robust w.r.t. X if p X,y > 0.5. We derive a lower bound by using the discrete NP-Lemma on the upper-level ϕ, which allows us to delegate robustness guarantees to the lower-level µ on the perturbed rows only.

Let C {i : xi = xi} denote the rows in which X and X differ. To apply the discrete NP-Lemma we have to partition the space ZN (D+1) into disjoint regions (Lemma 2). The upper-level distribution allows us define the following four regions (Figure 3): In region R2 all perturbed rows are selected ( i C :τi =1). In regions R1 and R3 at least one perturbed row is not selected when sampling from ΨX and Ψ X, respectively. Region R0 contains Z that cannot be sampled from either distribution.

Example for matrices. Consider the following example for matrix data to see the existence of four different regions. Define X (x1|x2|x3)T and X (x1| x2|x3)T (second row in X is perturbed). Now consider a smoothed row w with w = x2 and w = x2 that we can sample from both lower-level distributions µx2(w)>0 and µ x2(w)>0. Then we can define four examples, one for each region:

Z0 R0 x1 0 w 0 x3 0

ΨX(Z0)=0,Ψ X(Z0)=0

Z1=X|0 R1 x1 0 x2 0 x3 0

ΨX(Z1)>0,Ψ X(Z1)=0

Z2 R2 x1 0 w 1 x3 0

ΨX(Z2)>0,Ψ X(Z2)>0

Z3= X|0 R3 x1 0 x2 0 x3 0

ΨX(Z3)=0,Ψ X(Z3)>0 Note that we can sample the row w by adding noise to the rows x2 or x2 only if the upper-level smoothing distribution ϕ first selects the perturbed row (τ2 = 1). In general, we can only sample the same matrix Z from both distributions ΨX and Ψ X if the upper-level smoothing distribution first selects all perturbed rows i C. We group all those Z in the region R2.

Partitioning into regions. By appending the indicator τ to the data we obtain four disjoint regions. To formally characterize these regions, we define the reachable set as A A1 A2 ZN (D+1)

with A1 {Z | i : (τi = 0) (zi = xi|0)}, and A2 {Z | i : (τi = 0) (zi = xi|0)}, where xi|0 denotes concatenation. Intuitively, A only contains matrices that can be sampled from Ψ. Now we can define the regions and derive the required probabilities to sample Z of each region: Proposition 1. The regions R0, R1, R2 and R3 are disjoint and partition the space ZN (D+1):

R0 ZN (D+1) \ A R2 {Z A | i C : τi = 1}

R1 {Z A | i C : τi = 0, zi = xi|0} R3 {Z A | i C : τi = 0, zi = xi|0}

Proposition 2. Given ΨX and Ψ X, the regions of Proposition 1 and 1 p|C| we have

Pr Z ΨX[Z R1] = Pr Z ΨX[Z R2] = 1 Pr Z ΨX[Z R3] = 0

Pr Z Ψ X [Z R1] = 0 Pr Z Ψ X [Z R2] = 1 Pr Z Ψ X [Z R3] =

Proofs in Appendix C. Intuitively, 1 = p|C| is the probability that all perturbed rows i C are selected (τi = 1), and that at least one row i C is not selected by ϕ. Lastly, the probability to sample Z R0 is just 0. We visualize Proposition 1 and Proposition 2 in Figure 3.

Hierarchical smoothing certificates. Having derived a partitioning of ZN (D+1) as above, we can apply the NP-Lemma for discrete random variables (Lemma 2) on the upper-level distribution ϕ. Notably, we show that the problem of certification under hierarchical smoothing can be reduced to proving robustness for the lower-level distribution µ on the adversarially perturbed rows C only:

Theorem 1 (Neyman-Pearson lower bound for hierarchical smoothing). Given fixed X, X X N D, let µX denote a smoothing distribution that operates independently on matrix elements, and ΨX the induced hierarchical distribution over ZN (D+1). Given label y Y and the probability p X,y to classify X as y under Ψ. Define ˆSκ W X |C| D : µ XC(W ) κ µXC(W ) . We have:

p X,y = Pr W µ XC [W ˆSκ] (1 ) with κ R+ s.t. Pr W µXC [W ˆSκ] = p X,y

where XC and XC denote those rows xi of X and xi of X with i C, that is xi = xi.

Proof sketch (Full proof in Appendix C). In the worst-case all matrices in R1 are correctly classified. Note that this is the worst case since we cannot obtain any matrix of region R1 by sampling from the distribution Ψ X. Therefore the worst-case classifier first uses the budget p X,y on region R1 and we can subtract the probability Pr Z ΨX[Z R1] = from the label probability p X,y.

Since we never sample matrices of region R3 from the distribution ΨX, the remaining correctly classified matrices must be in region R2, which one reaches with probability 1 from both distributions ΨX and Ψ X. In region R2, however, we can simplify the problem to computing the certificate for µ on the perturbed rows to distribute the remaining probability. Since one never samples matrices of region R0 we do not have to consider it. The remaining statement follows from the Neyman-Pearson-Lemma (Neyman and Pearson, 1933; Tocher, 1950).

Note that we also derive the counterpart for discrete lower-level µ in Appendix F.

Implications. Notably, Theorem 1 implies that we can delegate robustness guarantees to the lowerlevel smoothing distribution µ, compute the optimal value of Equation 1 under µ given the transformed probability (p X,y )/(1 ) and multiply the result with (1 ). This way we obtain the optimal value of Equation 1 under the hierarchical smoothing distribution Ψ on the extended matrix space and thus robustness guarantees for hierarchical smoothing. This means our certificates are highly flexible and can integrate the whole suite of existing smoothing distributions with independent noise per dimension (Lecuyer et al., 2019; Cohen et al., 2019; Lee et al., 2019; Yang et al., 2020a). This is in contrast to all existing approaches where one has to come up with novel smoothing distributions and derive certificates from scratch once new threat models are introduced.

Special cases. We discuss two special cases of the probability p for smoothing rows: First, if the probability to select rows for smoothing is p = 1 (intuitively, we always smooth the entire matrix), then we have = 1 p|C| = 0 for any number of perturbed rows |C| and with Theorem 1 we get p X,y(Ψ) = p X,y(µ). That is in this special case we obtain the original certificate for the lower-level smoothing distribution µ. Note that in this case the certificate ignores r. Second, if the probability to select rows is p = 0 (intuitively, we never sample smoothed matrices), we have = 1 p|C| = 1 for any |C| 1, and with Theorem 1 we get p X,y = 0, that is we do not obtain any certificates.

5.2 Regional robustness certificates for hierarchical randomized smoothing

So far we can only guarantee that the prediction is robust for a specific perturbed input X Br p,ϵ(X). To ensure that the smoothed classifier is robust to the entire treat model Br p,ϵ(X) we have to guarantee min X Brp,ϵ(X) p X,y > 0.5. In the following we show that it is sufficient to compute Theorem 1 for = 1 pr with the largest radius r. In fact, the final certificate is independent of which rows are perturbed, as long the two matrices differ in exactly r rows the certificate is the same. Proposition 3. Given clean X X N D. Consider the set Qr p,ϵ(X) of inputs at a fixed distance r (i.e. |C| = r for all X Qr p,ϵ(X)). Then we have min X Br p,ϵ(X) p X,y = min X Qr p,ϵ(X) p X,y.

Proof. The probability 1 = p|C| is monotonously decreasing in the number of perturbed rows |C|. Thus the lower bound p X,y is also monotonously decreasing in |C| (see Theorem 1). It follows for

fixed ϵ that p X,y is minimal at |C| = r, i.e. the largest possible radius under our threat model Br p,ϵ(X).

6 Initializing hierarchical randomized smoothing

In the previous section we derive robustness certificates for hierarchical randomized smoothing for any lower-level smoothing distribution µ by deriving a general lower bound on the probability p X,y in Theorem 1. In this section we instantiate our hierarchical smoothing framework with specific lower-level smoothing distributions for discrete and continuous data.

6.1 Hierarchical randomized smoothing using Gaussian isotropic smoothing

Concerning continuous data X = R, consider the threat model Br p,ϵ(X) for p = 2. We initialize the lower-level smoothing distribution of hierarchical smoothing with the Gaussian distribution (Cohen et al., 2019) (as introduced in Section 3), since it is specifically designed and tight for the wellstudied ℓ2-norm threat model. For our purposes we apply the Gaussian noise independently on the matrix entries. In the following we present the binary-class certificates for hierarchical randomized smoothing induced by isotropic Gaussian smoothing:

Corollary 1. Given continuous X R, consider the threat model Br 2,ϵ(X) for fixed X X N D. Initialize the hierarchical smoothing distribution Ψ using the isotropic Gaussian distribution µX(W ) QN i=1 N(wi|xi, σ2I) that applies noise independently on each matrix element. Let y Y denote the majority class and p X,y the probability to classify X as y under Ψ. Then the smoothed classifier g is certifiably robust g(X) = g( X) for any X Br 2,ϵ(X) if

ϵ < σ Φ 1 p X,y

Φ 1 1 2(1 )

where Φ 1 denotes the inverse CDF of the normal distribution and 1 pr.

Proof in Appendix E. We also derive the corresponding multi-class certificates in Appendix E.

6.2 Hierarchical randomized smoothing using sparse smoothing

To demonstrate our certificates in discrete domains we consider binary data X = {0, 1} and model adversaries that delete rd ones (flip 1 0) and add ra ones (flip 0 1), that is we consider the ball Br ra,rd(X) (see Appendix B for a formal introduction). We initialize the lower-level smoothing distribution of hierarchical smoothing with the sparse smoothing distribution proposed by Bojchevski et al. (2020): p(µ(x)i = xi) = p1 xi + pxi , introducing two different noise probabilities to flip 0 1 with probability p+ and 1 0 with probability p . The main idea is that the different flipping probabilities allow to preserve sparsity of the underlying data, making this approach particularly useful for graph-structured data. Note that we can consider the discrete certificate of Lee et al. (2019) as a special case (Bojchevski et al., 2020). We derive the corresponding certificates in Appendix F.

6.3 Hierarchical randomized smoothing using ablation smoothing

Lastly, randomized ablation (Levine and Feizi, 2020b) is a smoothing distribution where the input is not randomly smoothed but masked, e.g. by replacing the input with a special ablation token t / X D that does not exist in the original space. There are different variations of ablation and we carefully discuss such differences in Appendix G. By choosing a smoothing distribution µ that ablates all selected rows we can prove the following connection between hierarchical and ablation smoothing: Corollary 2. Initialize the hierarchical smoothing distribution Ψ with the ablation smoothing distribution µx(t) = 1 for ablation token t / X D and µx(w) = 0 for w X D otherwise (i.e. we always ablate selected rows). Define 1 pr. Then the smoothed classifier g is certifiably robust g(X) = g( X) for any X Br 2,ϵ(X) if p X,y > 0.5.

Proof in Appendix G. Interestingly, Corollary 1 and Corollary 2 show that hierarchical smoothing is a generalization of additive noise and ablation certificates. The additional column indicates which rows will be ablated, but instead of ablating we only add noise to them. In contrast to ablation smoothing (which just checks if p X,y > 0.5) we further utilize the budget p X,y by plugging it into the lower-level certificate. Notably, we are the first to generalize beyond ablation and additive noise certificates, and our certificates are in fact orthogonal to all existing robustness certificates that are based on randomized smoothing.

0 10 20 30 40 50 60 70 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Sparse smoothing (Bojchevski et al., 2020) Interception smoothing (Scholten et al., 2022)

58 60 62 64 66 68 70 72 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Gaussian smoothing (Cohen et al., 2019) Ablation smoothing (Levine and Feizi, 2020b)

Figure 4: Hierarchical smoothing significantly expands the Pareto-front w.r.t. robustness and accuracy in node and image classification. Left: Discrete hierarchical smoothing for node classification, smoothed GAT on Cora-ML (r = 1, rd = 40, ra = 0). Right: Continuous hierarchical smoothing for image classification, smoothed Res Net50 on CIFAR10 (r = 3, ϵ = 0.35). Non-smoothed GAT achieves 80% 2% clean accuracy on Cora-ML, Res Net50 94% on CIFAR10. Large circles and stars are dominating points for each certificate. Dashed lines connect dominating points across methods.

7 Experimental evaluation

In this section we highlight the importance of hierarchical smoothing in image and node classification, instantiating our framework with two well-established smoothing distributions for continuous (Cohen et al., 2019) and discrete data (Bojchevski et al., 2020). In the following we present experiments supporting our main findings. We refer to Appendix A for more experimental results and to Appendix B for the full experimental setup and instructions to reproduce our results.

Threat models. For images we model adversaries that perturb at most r pixels of an image, where the perturbation over all channels is bounded by ϵ under the ℓ2-norm. For graphs we model adversaries that perturb binary features of at most r nodes by inserting at most ra and deleting at most rd ones.

Datasets and models. For image classification we train Res Net50 (He et al., 2016) on CIFAR10 (Krizhevsky et al., 2009) consisting of images with three channels of size 32x32 categorized into 10 classes. We certify the models on a random but fixed subset of 1,000 test images. For node classification we train graph attention networks (GATs) (Velickovic et al., 2018) with two layers on Cora-ML (Mc Callum et al., 2000; Bojchevski and Günnemann, 2018) consisting of 2,810 nodes with 2,879 binary features, 7,981 edges and 7 classes. We train and certify GNNs in an inductive learning setting (Scholten et al., 2022). We defer results for more models and datasets to Appendix A.

Experimental setup. During training, we sample one smoothed matrix each epoch to train our models on smoothed data. Note that for hierarchical smoothing we also train our models on the higherdimensional matrices. At test time, we use significance level α = 0.01, n0 = 1,000 samples for estimating the majority class and n1 = 10,000 samples for certification. We report the classification accuracy of the smoothed classifier on the test set (clean accuracy), and the certified accuracy, that is the number of test samples that are correctly classified and certifiably robust for a given radius. For node classification we report results averaged over five random graph splits and model initializations.

Baseline certificates. We compare hierarchical smoothing to additive noise and ablation certificates since our method generalizes both into a novel framework. To this end, we implement the two additive noise certificates of Cohen et al. (2019) and Bojchevski et al. (2020) for continuous and discrete data, respectively. Concerning ablation certificates, we implement the certificate of Levine and Feizi (2020b) for image classification, which ablates entire pixels. For node classification we implement the generalization to GNNs of Scholten et al. (2022) that ablates all node features of entire nodes. We conduct exhaustive experiments to explore the space of smoothing parameters for each method (see details in Appendix B). To compare the three different approaches we fix the threat model and investigate the robustness-accuracy trade-off by comparing certified accuracy against clean accuracy.

Hierarchical smoothing significantly expands the Pareto-front. Notably, hierarchical smoothing is clearly expanding the Pareto-front when optimization for both certified accuracy and clean accuracy (see Figure 4). Especially when the number of adversarial nodes or pixels is small but the feature perturbation is large, hierarchical smoothing is significantly dominating both baselines.

Interestingly, both baselines either sacrifice robustness over accuracy (or vice versa): While additive noise certificates obtain higher clean accuracy at worse certified accuracy, ablation certificates obtain strong certified accuracy at lower clean accuracy. In contrast, hierarchical smoothing allows to explore the entire space and significantly expands the Pareto-front of the robustness-accuracy trade-off under our threat model. This demonstrates that hierarchical smoothing is a more flexible framework and represents a useful and novel tool for analyzing the robustness of machine learning models in general.

Entity-selection probability. With hierarchical smoothing we introduce the entity-selection probability p that allows to better control the robustness-accuracy trade-off under our threat model. Specifically, for larger p we add more noise and increase robustness but also decrease accuracy. As usual in randomized smoothing, the optimal p needs to be fine-tuned against a task, dataset and radius r. In our settings we found dominating points e.g. for p = 0.81 (Cora-ML) and p = 0.85 (CIFAR10).

8 Discussion

Our hierarchical smoothing approach overcomes limitations of randomized smoothing certificates: Instead of having to derive Lemma 1 from scratch for each smoothing distribution, we can easily integrate the whole suite of existing and future randomized smoothing certificates. Our framework is highly flexible and allows to certify robustness against a wide range of adversarial perturbations.

Hierarchical smoothing also comes with limitations: First, we inherit the limitations of ablation certificates in which the certifiable radius r is bounded by the smoothing parameters independently of the classifier (Scholten et al., 2022). The underlying reason is that we do not evaluate the smoothed classifier on the entire space A since we cannot reach certain matrices as exploited in Section 5. Second, the classifier defined on the original matrix space is invariant with respect to the new dimension introduced by our smoothing distribution and not incorporating such invariances into the Neyman-Pearson-Lemma yields strictly looser guarantees (Schuchardt and Günnemann, 2022).

Beyond that, our certificates are highly efficient: The only additional cost for computing hierarchical smoothing certificates is to evaluate the algebraic term = 1 pr, which takes constant time (see also Appendix H). Since we train our classifiers on the extended matrices, we also allow classifiers to distinguish whether entities have been smoothed by the lower-level distribution (see Appendix A).

Future work. Future work can build upon our work towards even stronger certificates in theory and practice, for example by deriving certificates that are tight under the proposed threat model and efficient to compute at the same time. Future work can further (1) implement certificates for other ℓp-norms (Levine and Feizi, 2021; Vorácek and Hein, 2023) and domains, (2) improve and assess adversarial robustness, and (3) introduce novel architectures and training techniques.

Broader impact. Our hierarchical smoothing certificates are highly flexible and provide provable guarantees for arbitrary (ℓp-norm) threat models, provided that certificates for the corresponding lowerlevel smoothing distribution exist. Therefore our contributions impact the certifiable robustness of a large variety of models in discrete and continuous domains and therefore the field of reliable machine learning in general: Our robustness certificates provide novel ways of assessing, understanding and improving the adversarial robustness of machine learning models.

9 Conclusion

With hierarchical smoothing we propose the first hierarchical (mixture) smoothing distribution for complex data where adversaries can perturb only a subset of all entities (e.g. pixels in an image, or nodes in a graph). Our main idea is to add noise in a more targeted manner: We first select a subset of all entities and then we add noise to them. By certifying robustness under this hierarchical smoothing distribution we achieve stronger robustness guarantees while still maintaining high accuracy. Overall, our certificates for hierarchical smoothing represent novel, flexible tools for certifying the adversarial robustness of machine learning models towards more reliable machine learning.

Acknowledgments and Disclosure of Funding

This work has been funded by the Munich Center for Machine Learning, by the DAAD program Konrad Zuse Schools of Excellence in Artificial Intelligence (sponsored by the Federal Ministry of Education and Research), and by the German Research Foundation, grant GU 1409/4-1. The authors of this work take full responsibility for its content.

Aleksandar Bojchevski and Stephan Günnemann. Certifiable robustness to graph perturbations. In Neural Information Processing Systems, Neur IPS, 2019.

Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. In International Conference on Learning Representations, 2018.

Aleksandar Bojchevski, Johannes Gasteiger, and Stephan Günnemann. Efficient robustness certificates for discrete data: Sparsity-aware randomized smoothing for graphs, images and more. In Proceedings of the 37th International Conference on Machine Learning, pages 1003 1013, 2020.

Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? In International Conference on Learning Representations, 2022.

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. Co RR, abs/1902.06705, 2019.

Wenda Chu, Linyi Li, and Bo Li. TPC: transformation-specific smoothing for point cloud models. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 4035 4056. PMLR, 2022.

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310 1320. PMLR, 2019.

Francesco Croce and Matthias Hein. Sparse and imperceivable adversarial attacks. In ICCV, pages 4723 4731. IEEE, 2019.

Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

Zhidong Gao, Rui Hu, and Yanmin Gong. Certified robustness of graph classification against topology attack with randomized smoothing. In GLOBECOM, pages 1 6. IEEE, 2020.

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR (Poster), 2015.

Lukas Gosch, Simon Geisler, Daniel Sturm, Bertrand Charpentier, Daniel Zügner, and Stephan Günnemann. Adversarial training for graph neural networks: Pitfalls, solutions, and new directions. In Advances in Neural Information Processing Systems, Neur IPS, 2023a.

Lukas Gosch, Daniel Sturm, Simon Geisler, and Stephan Günnemann. Revisiting robustness in graph machine learning. In International Conference on Learning Representations (ICLR), 2023b.

Stephan Günnemann. Graph neural networks: Adversarial robustness. In Lingfei Wu, Peng Cui, Jian Pei, and Liang Zhao, editors, Graph Neural Networks: Foundations, Frontiers, and Applications, chapter 8, pages 149 176. Springer, 2022.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778. IEEE Computer Society, 2016.

Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ML safety. Co RR, abs/2109.13916, 2021.

Miklós Z. Horváth, Mark Niklas Müller, Marc Fischer, and Martin T. Vechev. Boosting randomized smoothing with variance reduced classifiers. In ICLR. Open Review.net, 2022a.

Miklós Z. Horváth, Mark Niklas Müller, Marc Fischer, and Martin T. Vechev. (De-)Randomized Smoothing for Decision Stump Ensembles. In Advances in Neural Information Processing Systems, Neur IPS, 2022b.

Miklós Z. Horváth, Mark Niklas Müller, Marc Fischer, and Martin T. Vechev. Robust and accurate - compositional architectures for randomized smoothing. Co RR, abs/2204.00487, 2022c.

Zhuoqun Huang, Neil Marchant, Keane Lucas, Lujo Bauer, Olya Ohrimenko, and Benjamin I. P. Rubinstein. RS-Del: Edit distance robustness certificates for sequence classifiers via randomized deletion. In Advances in Neural Information Processing Systems, Neur IPS, 2023.

Jongheon Jeong and Jinwoo Shin. Consistency regularization for certified robustness of smoothed classifiers. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Intrinsic certified robustness of bagging against data poisoning attacks. In AAAI, pages 7961 7969. AAAI Press, 2021.

Hongwei Jin, Zhan Shi, Venkata Jaya Shankar Ashish Peruri, and Xinhua Zhang. Certified robustness of graph convolution networks for graph classification under topological attacks. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.

Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In ICLR (Poster). Open Review.net, 2019.

Marcel Kollovieh, Lukas Gosch, Yan Scholten, Marten Lienen, and Stephan Günnemann. Assessing robustness via score-based adversarial image generation. ar Xiv preprint ar Xiv:2310.04285, 2023.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pages 656 672. IEEE, 2019.

Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi Jaakkola. Tight certificates of adversarial robustness for randomly smoothed classifiers. In Advances in Neural Information Processing Systems, Neur IPS, 2019.

Alexander Levine and Soheil Feizi. (De)Randomized Smoothing for Certifiable Defense against Patch Attacks. In Advances in Neural Information Processing Systems, Neur IPS, 2020a.

Alexander Levine and Soheil Feizi. Robustness certificates for sparse adversarial attacks by randomized ablation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4585 4593, 2020b.

Alexander Levine and Soheil Feizi. Improved, deterministic smoothing for ℓ1 certified robustness. In International Conference on Machine Learning, 2021.

Alexander Levine, Aounon Kumar, Thomas Goldstein, and Soheil Feizi. Tight second-order certificates for randomized smoothing. ar Xiv preprint ar Xiv:2010.10549, 2020.

Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certified adversarial robustness with additive noise. In Advances in Neural Information Processing Systems, Neur IPS, pages 9459 9469, 2019.

Linyi Li, Xiangyu Qi, Tao Xie, and Bo Li. Sok: Certified robustness for deep neural networks. Co RR, abs/2009.04131, 2020.

Hongbin Liu, Jinyuan Jia, and Neil Zhenqiang Gong. Pointguard: Provably robust 3d point cloud classification. In CVPR, pages 6186 6195. Computer Vision Foundation / IEEE, 2021.

Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In ICLR (Poster). Open Review.net, 2017.

Jiaqi Ma, Shuangrui Ding, and Qiaozhu Mei. Towards more practical adversarial attacks on graph neural networks. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Andrew Mc Callum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Inf. Retr., 3(2):127 163, 2000.

Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Sparsefool: A few pixels make a big difference. In CVPR, pages 9087 9096. Computer Vision Foundation / IEEE, 2019.

Jeet Mohapatra, Ching-Yun Ko, Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. Higherorder certification for randomized smoothing. In Advances in Neural Information Processing Systems, Neur IPS, volume 33, pages 4501 4511, 2020.

Felix Mujkanovic, Simon Geisler, Stephan Günnemann, and Aleksandar Bojchevski. Are defenses for graph neural networks robust? In Neural Information Processing Systems, Neur IPS, 2022.

Mark Niklas Müller, Mislav Balunovic, and Martin T. Vechev. Certify or predict: Boosting certified robustness with compositional architectures. In ICLR. Open Review.net, 2021.

Jerzy Neyman and Egon Sharpe Pearson. Ix. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289 337, 1933.

Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C. Duchi, and Percy Liang. Adversarial training can hurt generalization. Co RR, abs/1906.06032, 2019.

Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, and J. Zico Kolter. Denoised smoothing: A provable defense for pretrained classifiers. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Yan Scholten, Jan Schuchardt, Simon Geisler, Aleksandar Bojchevski, and Stephan Günnemann. Randomized message-interception smoothing: Gray-box certificates for graph neural networks. In Advances in Neural Information Processing Systems, Neur IPS, 2022.

Jan Schuchardt and Stephan Günnemann. Invariance-aware randomized smoothing certificates. In Advances in Neural Information Processing Systems, Neur IPS, 2022.

Jan Schuchardt, Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Collective robustness certificates: Exploiting interdependence in graph neural networks. In International Conference on Learning Representations, 2021.

Jan Schuchardt, Yan Scholten, and Stephan Günnemann. Provable adversarial robustness for group equivariant tasks: Graphs, point clouds, molecules, and more. In Advances in Neural Information Processing Systems, Neur IPS, 2023a.

Jan Schuchardt, Tom Wollschläger, Aleksandar Bojchevski, and Stephan Günnemann. Localized randomized smoothing for collective robustness certification. In ICLR. Open Review.net, 2023b.

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. ar Xiv preprint ar Xiv:1811.05868, 2018.

Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput., 23(5):828 841, 2019.

Peter Súkeník, Aleksei Kuvshinov, and Stephan Günnemann. Intriguing properties of input-dependent randomized smoothing. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 20697 20743. PMLR, 2022.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR (Poster), 2014.

Keith D Tocher. Extension of the neyman-pearson theory of tests to discontinuous variates. Biometrika, 37(1/2):130 144, 1950.

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In ICLR (Poster). Open Review.net, 2019.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.

Václav Vorácek and Matthias Hein. Improving l1-certified robustness via randomized smoothing by leveraging box constraints. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 35198 35222. PMLR, 2023.

Binghui Wang, Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Certified robustness of graph neural networks against adversarial structural perturbation. In KDD, pages 1645 1653. ACM, 2021.

Yihan Wu, Aleksandar Bojchevski, Aleksei Kuvshinov, and Stephan Günnemann. Completing the picture: Randomized smoothing suffers from the curse of dimensionality for a large family of distributions. In AISTATS, volume 130 of Proceedings of Machine Learning Research, pages 3763 3771. PMLR, 2021.

Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L. Yuille, and Quoc V. Le. Adversarial examples improve image recognition. In CVPR, pages 816 825. Computer Vision Foundation / IEEE, 2020.

Greg Yang, Tony Duan, J. Edward Hu, Hadi Salman, Ilya P. Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 10693 10705. PMLR, 2020a.

Yao-Yuan Yang, Cyrus Rashtchian, Hongyang Zhang, Ruslan Salakhutdinov, and Kamalika Chaudhuri. A closer look at accuracy vs. robustness. In Advances in Neural Information Processing Systems, Neur IPS, 2020b.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 7472 7482. PMLR, 2019.

Daniel Zügner and Stephan Günnemann. Certifiable robustness and robust training for graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.

Daniel Zügner and Stephan Günnemann. Certifiable robustness of graph convolutional networks under structure perturbations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1656 1665, 2020.

A Additional experiments and results

In the following we provide additional results for our experiments with hierarchical randomized smoothing for the tasks of node and image classification.

0 10 20 30 40 50 60 70 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Sparse smoothing (Bojchevski et al., 2020) Interception smoothing (Scholten et al., 2022)

0 10 20 30 40 50 60 70 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Sparse smoothing (Bojchevski et al., 2020) Interception smoothing (Scholten et al., 2022)

Figure 5: Discrete hierarchical smoothing significantly extends the Pareto-front w.r.t. robustnessaccuracy (smoothed GAT on Cora-ML). Left: r=1, ra = 0, rd = 20. Right: r = 1, ra = 0, rd = 40.

20 30 40 50 60 70 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Sparse smoothing (Bojchevski et al., 2020) Interception smoothing (Scholten et al., 2022)

0 10 20 30 40 50 60 70 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Sparse smoothing (Bojchevski et al., 2020) Interception smoothing (Scholten et al., 2022)

Figure 6: Discrete hierarchical smoothing significantly extends the Pareto-front w.r.t. robustnessaccuracy (smoothed GAT on Cora-ML). Left: r = 1, ra = 5, rd = 0. Right: r = 1, ra = 10, rd = 0.

0 10 20 30 40 50 60 70 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Sparse smoothing (Bojchevski et al., 2020) Interception smoothing (Scholten et al., 2022)

0 10 20 30 40 50 60 70 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Sparse smoothing (Bojchevski et al., 2020) Interception smoothing (Scholten et al., 2022)

Figure 7: Discrete hierarchical smoothing significantly extends the Pareto-front w.r.t. robustnessaccuracy (smoothed GAT on Cora-ML). Left: r=1, ra = 5, rd = 15. Right: r = 2, ra = 0, rd = 40.

58 60 62 64 66 68 70 72 74 76 78 80

Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Gaussian smoothing (Cohen et al., 2019) Ablation smoothing (Levine and Feizi, 2020b)

58 60 62 64 66 68 70 72 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Gaussian smoothing (Cohen et al., 2019) Ablation smoothing (Levine and Feizi, 2020b)

Figure 8: Continuous hierarchical smoothing significantly extends the Pareto-front w.r.t. robustnessaccuracy (smoothed Res Net50 on CIFAR10). Left: r = 2, ϵ = 0.4. Right: r = 3, ϵ = 0.3.

58 60 62 64 66 68 70 72 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Gaussian smoothing (Cohen et al., 2019) Ablation smoothing (Levine and Feizi, 2020b)

58 60 62 64 66 68 70 72 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing Gaussian smoothing (Cohen et al., 2019) Ablation smoothing (Levine and Feizi, 2020b)

Figure 9: Continuous hierarchical smoothing significantly extends the Pareto-front w.r.t. robustnessaccuracy (smoothed Res Net50 on CIFAR10). Left: r = 3, ϵ = 0.35. Right: r = 4, ϵ = 0.4.

0 10 20 30 40 50 60 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing (with skip) Hierarchical smoothing (without skip) Sparse smoothing (with skip) Sparse smoothing (without skip)

0 10 20 30 40 50 60 Certified Accuracy (%)

Clean Accuracy (%)

Hierarchical smoothing (with skip) Hierarchical smoothing (without skip) Sparse smoothing (with skip) Sparse smoothing (without skip)

Figure 10: Skip-connections for hierarchical and sparse smoothing significantly extend the Paretofront (smoothed GAT on Cora-ML). Left: r = 2, ra = 0, rd = 70. Right: r = 3, ra = 0, rd = 70.

0 10 20 30 40 50 60 70 Certified Accuracy (%)

75 76 77 78 79 80 81 82 83

Clean Accuracy (%)

Hierarchical smoothing (with column) Hierarchical smoothing (w/o column)

0 10 20 30 40 50 60 70 Certified Accuracy (%)

75 76 77 78 79 80 81 82 83

Clean Accuracy (%)

Hierarchical smoothing (with column) Hierarchical smoothing (w/o column)

Figure 11: Hierarchical smoothing with and without appended column (smoothed GAT on Cora-ML). Left: r = 1, ra = 0, rd = 40. Right: r = 3, ra = 0, rd = 70.

0 5 10 15 20 25 30 ra (dotted) rd (solid)

0.2 0.4 0.6 0.8

Cert. ACC (%)

GAT GATv2 GCN APPNP

0 5 10 15 20 25 30 ra (dotted) rd (solid)

0.2 0.4 0.6 0.8

Cert. ACC (%)

GAT GATv2 GCN APPNP

0 5 10 15 20 25 30 ra (dotted) rd (solid)

0.2 0.4 0.6 0.8

Cert. ACC (%)

GAT GATv2 GCN APPNP

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Cert. ACC (%)

r = 1 r = 2 r = 3 r = 4

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Cert. ACC (%)

r = 1 r = 2 r = 3 r = 4

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Cert. ACC (%)

r = 1 r = 2 r = 3 r = 4

Figure 12: Upper row: Smoothed GNNs on Cora-ML (p = 0.8, pa = 0.006, pd = 0.88) for r = 1, 2, 3 from left to right. Lower row: Smoothed Res Net50 on CIFAR10 for different radii r (Left: k = 150, σ = 0.25. Middle: k = 150, σ = 0.35. Right: k = 160, σ = 0.4.)

0 5 10 15 20 25 30 Radius rd

0.2 0.4 0.6 0.8

Cert. ACC (%)

n1 = 100,000 n1 = 10,000 n1 = 1,000 n1 = 100

0 5 10 15 20 25 30 Radius rd

0.2 0.4 0.6 0.8

Cert. ACC (%)

α = 0.1 α = 0.01 α = 0.001 α = 0.0001

0 5 10 15 20 25 30 Radius rd

0.2 0.4 0.6 0.8

Cert. ACC (%)

α = 0.1 α = 0.01 α = 0.001 α = 0.0001

Figure 13: Smoothed GAT models on Cora ML (r = 1, ra = 0). Different number of Monte-Carlo samples and the effect on the certifiable robustness. Left: α = 0.01. Middle: n1 = 100. Right: n1 = 10,000. Different clean accuracies at rd = 0 are due to abstained predictions.

For node classification we experimentally observe that smoothed GAT models are more robust on average (Figure 12 upper row). We therefore decide to show more results for smoothed GAT models in the following. Recall that r is the number of adversarial nodes, rd is the number of adversarial deletions (1 0), and ra the number of adversarial insertions (0 1). In Figure 5 we show results for varying rd with fixed r = 1 and ra = 0. In Figure 6 we further vary ra with fixed r = 1 and rd = 0. In Figure 7 we show results for r = 1, rd > 0 and ra > 0 and separately results for r = 2. For image classification, recall that r is the number of pixels controlled by the adversary and ϵ the perturbation strength in ℓ2-distance across channels. In Figure 8 and Figure 9 we provide additional results for different radii r and ϵ.

Clearly, hierarchical randomized smoothing is significantly expanding the Pareto-front w.r.t. the robustness-accuracy trade-off in all of these settings. Since our certificates generalize ablation and additive noise certificates into a new common framework, our method performs either better or on-par with the existing methods under our threat model. Note that in special threat models where e.g. adversaries have access to all nodes (r = N), our certificates correspond to standard randomized smoothing and consequently perform on-par with existing methods.

In Figure 12 (upper row) we compare different GNN architectures for the task of node classification and generally observe that smoothed GAT models are more robust on average. In Figure 12 (lower row) we provide further results for different smoothing parameters when certifying the robustness of smoothed Res Net50 models on CIFAR10. Note the different effects on the certifiable robustness at different radii r when increasing σ and k (increasing k corresponds to decreasing p).

In Figure 10 we study the setting where adversaries do not have access to target nodes in the graph but only to nodes in the firstand second-hop neighborhoods (this setting has been previously studied by Scholten et al. (2022)). Such threat models allow us to introduce skip-connections: We can forward the (unsmoothed) features of target nodes separately through the GNN (without graph structure) and add the final representations on the representations obtained with smoothing. In Figure 10 we experimentally observe that such skip-connections can significantly expand the Pareto-front w.r.t. robustness and accuracy of sparse and hierarchical smoothing.

In Figure 11 we perform experiments with and without the additional additional column for smoothed GAT models on Cora ML. Notably, training the models on the extended matrices and using the additional column at inference can result in better classifiers: As we experimentally observe for graphstructured data, using the additional column significantly extends the Pareto-front w.r.t. robustness and accuracy under specific threat models.

In Figure 13 we study the effect of different numbers of Monte-Carlo samples on the certificate strength. We observe that for our setting using α = 0.01 and n1 = 10,000 is sufficient to compare our method to the baselines. Note that in practice one would have to fix α and n0, n1 prior to computing certificates to ensure their statistical validity.

Error bars. For the node classification task we run every experiment 5 times with different dataset splits and model initializations and report averaged results. The average standard deviation is (3%, 3%, 4%, 4%) in clean accuracy and (3%, 3%, 2%, 3%) in certified accuracy (across all node classification experiments, hierarchical smoothing only, sparse smoothing only, and ablation smoothing only, respectively).

B Full experimental setup (Section 7)

We provide detailed information on the experimental setup as described in Section 7 to ensure reproducibility of our results. Note that we also upload our implementation.4

B.1 Experimental setup for the task of node classification

Datasets and models. We implement smoothed classifiers for four architectures with two messagepassing layers: Graph convolutional networks (GCN) (Kipf and Welling, 2017), graph attention networks (GAT and GATv2) (Velickovic et al., 2018; Brody et al., 2022), and APPNP (Klicpera et al., 2019). We train models on the citation dataset Cora-ML (Bojchevski and Günnemann, 2018; Mc Callum et al., 2000) consisting of 2,810 nodes with 2,879 binary features, 7,981 edges and 7 classes. We follow the standard procedure in graph machine learning (Shchur et al., 2018) and preprocess graphs into undirected graphs containing only their largest connected component.

Model details. We implement all GNN models for two message-passing layers. We use 64 hidden channels for GCN (Kipf and Welling, 2017) and APPNP (Klicpera et al., 2019), and 8 attention heads for 8 hidden channels for GAT and GATv2 (Velickovic et al., 2018; Brody et al., 2022). For APPNP we further set the hyperparameter k_hops to 10 and the teleport probability to 0.15.

Node classification task. For GNNs we follow the setup in (Scholten et al., 2022). We evaluate our certificates for the task of node classification in semi-supervised inductive learning settings: We label 20 randomly drawn nodes per class for training and validation, and 10% of the nodes for testing. We use the labelled training nodes and all remaining unlabeled nodes as training graph, and insert the validation and test nodes into the graph at validation and test time, respectively. We train our models on training nodes, tune them on validation nodes and compute certificates for all test nodes. Note that transductive settings come with shortcomings when evaluating robustness certificates (Scholten et al., 2022), or in adversarial robustness on graph-structured data in general (Gosch et al., 2023a).

Training details. We train models full-batch using Adam (learning rate = 0.001, β1 =0.9, β2 =0.999, ϵ = e 08, weight decay = 5e 04) for a maximum of 1,000 epochs with cross-entropy loss. Note that we implement early stopping after 50 epochs. We also use a dropout of 0.5 on the hidden node representations after the first graph convolution for GCN and GAT, and additionally a dropout of 0.5 on the attention coefficients for GAT and GATv2.

Training-time smoothing parameters. During training we sample one smoothed feature matrix (i.e. a matrix with (partially) added noise) in each epoch and then we normally proceed training with the noised feature matrix. Unless stated differently we also train our models on the higher-dimensional matrices, i.e. we append the additional column to the matrices. Note that we use the same smoothing parameters during training as for certification.

Experimental evaluation. We compare hierarchical smoothing to the additive noise baseline by Bojchevski et al. (2020) and the ablation baseline Scholten et al. (2022) as follows: We run 1,000 experiments for each method by (1) randomly drawing specific smoothing parameters (p and p+, p , depending on the method), (2) training 5 smoothed classifiers with different seeds for the datasplit and model initialization, (3) computing robustness certificates under the corresponding smoothing distribution, and (4) averaging resulting clean and certified accuracies.

Exploring the Pareto-front. In our exhaustive experiments we randomly explore the entire space of possible parameter combinations. The reason for this is to demonstrate Pareto-optimality, i.e. to find all ranges of smoothing parameters for which we can offer both better robustness and accuracy. Note that in practical settings one would have to conduct significantly less experiments to find suitable parameters, and the task of efficiently finding the optimal parameters is a research direction orthogonal to deriving robustness certificates.

Smoothing parameters for node classification. We randomly sample from the following space of possible smoothing parameters (we draw from a uniform distribution over these intervals):

Discrete hierarchical smoothing: p [0.51, 0.993], p+ [0, 0.05], p [0.5, 1], Sparse smoothing (additive noise baseline): p+ [0, 0.05], p [0.5, 1] Interception smoothing (ablation smoothing baseline): p [0.51, 0.993]

4Project page: https://www.cs.cit.tum.de/daml/hierarchical-smoothing

Note that the search space of hierarchical smoothing is the cross product of the search spaces of the two baselines (sparse smoothing (Bojchevski et al., 2020) and interception smoothing (Scholten et al., 2022)). Notably, we still find that hierarchical smoothing is superior despite running 1,000 experiments equally for all three methods (see Section 7). Note that we run 1,000 experiments equally for all methods to establish an equal chance for random outliers in accuracy. Also note that for clarity of the plots we only include points with certified accuracies of larger than 0%.

B.2 Experimental setup for the task of image classification

Dataset and model architecture. For the task of image classification we follow the experimental setup in (Levine and Feizi, 2020b) and train smoothed Res Net50 (He et al., 2016) on CIFAR10 (Krizhevsky et al., 2009). CIFAR10 consists of 50,000 training images and 10,000 test images with 10 classes (6,000 images per class). The colour images are of size 32x32 and have 3 channels. Note that for runtime reasons we only certify a subset of the entire test set: For all experiments we report clean and certified accuracy for the same subset consisting of 1,000 randomly sampled test images.

Training details. We train the Res Net50 model with stochastic gradient descent (learning rate 0.01, momentum 0.9, weight decay 5e-4) for 400 epochs using a batch-size of 128. We also deploy a cosine learning rate scheduler (Loshchilov and Hutter, 2017). At inference we use a batch size of 300. We append the additional indicator channel (see Figure 3) during training and certification. We train the smoothed classifiers with the same noise level as during certification.

Image preprocessing and smoothing. We augment the training set with random horizontal flips and random crops (with a padding of 4). We normalize all images using the channel-wise dataset mean [0.4914, 0.4822, 0.4465] and standard deviation [0.2023, 0.1994, 0.2010]. For the ablation smoothing baseline we follow the implementation of Levine and Feizi (2020b) and first normalize the images before appending three more channels to the images. Specifically, for each channel x we append the channel 1 x (resulting in 6 channels in total). Finally, we set all channels of ablated pixels to 0. We train and certify the extended images. See Levine and Feizi (2020b) for a detailed setup of their method. For hierarchical smoothing and the Gaussian smoothing baseline (Cohen et al., 2019) we first add the Gaussian noise and then normalize the images. Note that for hierarchical smoothing we extend the images with the additional indicator channel (Figure 3), resulting in 4 channels in total.

Smoothing parameters for image classification. Note that the ablation baseline (Levine and Feizi, 2020b) draws k pixels to ablate from a uniform distribution U(k, N) over all possible subsets with k pixels of an image with N total pixels. This is in contrast to our approach, which selects pixels to add noise to with a probability of p. For the comparison we choose the pixel selection probability p = 1 k

N for every k, where N is the number of pixels in an image (i.e. we select k pixels in expectation). This way, increasing k corresponds to decreasing p and vice versa. We run experiments for every 10th k from 0 to 500. Note that for k larger that 512 we cannot obtain certificates with either method since then > 0.5. For Gaussian smoothing we run experiments for every 100th σ in the interval [0, 1], and for hierarchical smoothing for every 20th σ in [0, 1]. Overall, we run 50 experiments for the ablation baseline (Levine and Feizi, 2020b), 100 experiments for the Gaussian noise baseline (Cohen et al., 2019), and 1,000 experiments for hierarchical smoothing.

B.3 Further experimental details

Certification parameters. All presented plots for graphs show results for the tighter multi-class certificates, for images we compute the binary-class certificates. Our certificates are probabilistic (Cohen et al., 2019), and we use Monte-Carlo samples to estimate the smoothed classifiers with Clopper-Pearson confidence intervals for α = 0.01 (thus our certificates hold with high probability of 99%). We also apply Bonferroni correction to ensure that the bounds hold simultaneously with confidence level 1 α. We follow the procedure of Cohen et al. (2019) and, if not stated differently, use n0 = 1,000 Monte-Carlo samples for estimating the majority class at test time and n1 = 10,000 samples for the certification at test time.

Reproducibility. To ensure reproducibility we further use random seeds for all random processes including for model training, for drawing smoothing parameters and for drawing Monte-Carlo samples. We publish the source code including reproducibility instructions and all random seeds.

Computational resources. All experiments were performed on a Xeon E5-2630 v4 CPU with a NVIDIA GTX 1080TI GPU. In the following we report experiment runtimes (note that for node classification we compute certificates for 5 different smoothed models in each experiment): Regarding node classification experiments, the average runtime of hierarchical smoothing for the 1,000 experiments is 1.6 hours. For the additive noise baseline 1.5 hours, and for the ablation baseline 0.92 hours. For image classification we only train a single classifier. The overall training and certification process takes (on average) 9.9 hours for hierarchical smoothing, 8.6 hours for Gaussian smoothing, and 13.3 hours for ablation smoothing.

Third-party assets. For the sparse smoothing baseline we use the implementation of Bojchevski et al. (2020).5 For the ablation smoothing baseline for GNNs we use the implementation of Scholten et al. (2022).6 We implement all GNNs using Py Torch Geometric (Fey and Lenssen, 2019).7 The graph datasets are publicly available and can be downloaded also e.g. using Py Torch Geometric. The CIFAR10 dataset is also publicly available, for example via the torchvision library.8

5https://github.com/abojchevski/sparse_smoothing 6https://github.com/yascho/interception_smoothing 7https://pytorch-geometric.readthedocs.io 8https://pytorch.org/vision/stable/index.html

C Robustness certificates for hierarchical smoothing (Proofs of Section 5)

In the following we show the statements of Section 5 about hierarchical smoothing certificates and correctness of the following certification procedure:

Algorithm 1: Binary-class Robustness Certificates for Hierarchical Randomized Smoothing Input: f, X, p, Lower-level smoothing distribution µ, Radius (r, ϵ), n0, n1, 1 α counts0 Sample Under Noise(f, X, p, µ, n0) ; counts1 Sample Under Noise(f, X, p, µ, n1) ; y A top index in counts0 ; p A Confidence Bound(counts1[y A], n1, 1 α) ; 1 pr ;

p A Lower Level Worst Case Lower Bound(µ, ϵ, p A

1 ) ; if p A (1 ) > 1

2 then Return: y A certified end Return: ABSTAIN

Here, Lower Level Worst Case Lower Bound(µ,ϵ, p A) computes the existing certificate, specifically the smallest p X,y A over the entire threat model, e.g. Φ Φ 1(p A) ϵ

σ for Gaussian smoothing. The remaining certification procedure is analogous to the certification in (Cohen et al., 2019).

C.1 Proofs of Section 5

First, define the space ZN (D+1) X N D {0, 1}N and recall the definition of the reachable set A A1 A2 with A1 {Z ZN (D+1) | i : (τi = 0) (zi = xi|0)}, and A2 {Z ZN (D+1) | i : (τi = 0) (zi = xi|0)}. Intuitively, A contains only those Z that can be sampled from either ΨX or Ψ X (given the lower-level smoothing distribution has infinite support). Note that these two sets are not necessarily disjoint, so we cannot apply the Neyman-Pearson Lemma for discrete random variables (Tocher, 1950) on the upper-level smoothing distribution ϕ yet. The following Proposition therefore provides a partitioning of the space ZN (D+1): Proposition 1. The regions R0, R1, R2 and R3 are disjoint and partition the space ZN (D+1):

R0 ZN (D+1) \ A R2 {Z A | i C : τi = 1}

R1 {Z A | i C : τi = 0, zi = xi|0} R3 {Z A | i C : τi = 0, zi = xi|0}

Proof. We want to show that the four sets above (1) are pairwise disjoint, and (2) partition ZN (D+1), that is R0 R1 R2 R3 = ZN (D+1).

Clearly we have ZN (D+1) = R0 A due to the definition of the region R0. Thus we only have to show R1 R2 R3 = A:

(1) We show R1,R2,R3 are pairwise disjoint:

We have R2 R1 = and R2 R3 = since in region R2 we have τi = 1 for all i C, which does not hold for R2 or R3. Moreover, R1 R3 = because if Z R1 then Z / R3 (and vice versa) due the to definition of A and R1 A1, R3 A3. (2) We show R1 R2 R3 = A:

: Clearly R1 A,R2 A, and R3 A by definition of the regions. : Consider any Z A = A1 A2. Either (1) all perturbed rows are selected and Z R2, or (2) at least one perturbed row is not selected ( i : τi = 0), in which case we either have Z R1 or Z R3, depending on whether Z A1, Z A3, respectively.

We further have to derive the probability to sample Z of each region from both ΨX and Ψ X for applying the Neyman-Pearson Lemma (Tocher, 1950) for discrete random variables on the upper-level smoothing distribution ϕ that selects rows before adding noise.

Proposition 2. Given ΨX and Ψ X, the regions of Proposition 1 and 1 p|C| we have

Pr Z ΨX[Z R1] = Pr Z ΨX[Z R2] = 1 Pr Z ΨX[Z R3] = 0

Pr Z Ψ X [Z R1] = 0 Pr Z Ψ X [Z R2] = 1 Pr Z Ψ X [Z R3] =

Proof. The probability to select all |C| perturbed rows is pr since we sample each τi independently

from a Bernoulli: τi iid Ber(p). Consequently the probability to not select at least one perturbed row is 1 pr = . The remaining follows from the fact that matrices from region R1 cannot be sampled from Ψ X and matrices from R3 cannot be sampled from ΨX.

Note that we need region R0 just for a complete partitioning of ZN (D+1). This region contains essentially all matrices we cannot sample neither from ΨX nor Ψ X. For example, matrices with a row that differs from the same row in X and X without a previous selection of that row cannot be sampled from the two distributions simply because we first select rows and then we only add noise to the rows that we selected. Thus we have Pr Z ΨX[Z R0] = 0 and Pr Z Ψ X[Z R0] = 0.

Before we derive the lower bound on p X,y we first restate the Neyman-Pearson Lemma as presented in Section 3: Lemma 1 (Neyman-Pearson lower bound). Given X, X X N D, distributions µX, µ X, class label y Y, probability p X,y and the set Sκ W X N D : µ X(W ) κ µX(W ) , we have

p X,y = Pr W µ X [W Sκ] with κ R+ s.t. Pr W µX[W Sκ] = p X,y

Proof. See (Neyman and Pearson, 1933) and (Cohen et al., 2019).

There are different views on Lemma 1. In the context of statistical hypothesis testing, Lemma 1 states that the likelihood ratio test is the uniformly most powerful test when testing the simple hypothesis µX(W ) against µ X(W ). For such views we refer to Cohen et al. (2019).

Now we prove our main theorem:

Theorem 1 (Neyman-Pearson lower bound for hierarchical smoothing). Given fixed X, X X N D, let µX denote a smoothing distribution that operates independently on matrix elements, and ΨX the induced hierarchical distribution over ZN (D+1). Given label y Y and the probability p X,y to classify X as y under Ψ. Define ˆSκ W X |C| D : µ XC(W ) κ µXC(W ) . We have:

p X,y = Pr W µ XC [W ˆSκ] (1 ) with κ R+ s.t. Pr W µXC [W ˆSκ] = p X,y

where XC and XC denote those rows xi of X and xi of X with i C, that is xi = xi.

Proof. We are solving the following optimization problem:

p X,y min h H Pr Z Ψ X [h(Z) = y] s.t. Pr Z ΨX[h(Z) = y] = p X,y

We can use the partitioning of Proposition 1 and apply the law of total probability:

p X,y = min h H

i=0 Pr Z Ψ X [h(Z) = y | Z Ri] Pr Z Ψ X [Z Ri]

i=0 Pr Z ΨX[h(Z) = y | Z Ri] Pr Z ΨX[Z Ri] = p X,y

We apply Proposition 2 and cancel out all zero probabilities: p X,y = min h H Pr Z Ψ X [h(Z) = y | Z R2] (1 ) + Pr Z Ψ X [h(Z) = y | Z R3]

s.t. Pr Z ΨX[h(Z) = y | Z R2] (1 ) + Pr Z ΨX[h(Z) = y | Z R1] = p X,y

This is a minimization problem and to minimize we choose Pr Z Ψ X[h(Z) = y | Z R3] = 0 and Pr Z ΨX[h(Z) = y | Z R1] = 1. In other words, in the worst-case all matrices in R1 that we never observe around X are correctly classified and the worst-case classifier first uses the budget p X,y on region R1.

We obtain: p X,y = min h H Pr Z Ψ X [h(Z) = y | Z R2] (1 )

s.t. Pr Z ΨX[h(Z) = y | Z R2] (1 ) = p X,y

Here we can see that for p = 0 (i.e. = 1) we have just p X,y = 0. We can assume p > 0 in the following.

We can further rearrange the terms:

p X,y = (1 ) min h H Pr Z Ψ X [h(Z) = y | Z R2]

s.t. Pr Z ΨX[h(Z) = y | Z R2] = p X,y

Now we can apply Lemma 1:

p X,y = Pr Z Ψ X [Z Sκ | Z R2] (1 )

with κ R+ s.t.

Pr Z ΨX[Z Sκ | Z R2] = p X,y

for Sκ Z ZN (D+1) : Ψ X(Z) κ ΨX(Z) . Note the difference between Sκ and ˆSκ.

We need to realize that due to independence we have:

{Z R2 : Ψ X(Z) κ ΨX(Z)} = {Z R2 : Ψ X(W , τ) κ ΨX(W , τ)} = {Z R2 : µ X(W |τ)ϕ(τ) κ µX(W |τ)ϕ(τ)}

i=1 µ xi(wi|τi) κ

i=1 µxi(wi|τi)

i C µ xi(wi) κ Y

i C µxi(wi)

= Z R2 : µ XC(WC) κ µXC(WC)

where (*) is due to xi = xi for all i / C and τi = 1 for i C.

In other words this means that if Z Sκ R2 then all Z R2 that differ only in rows i / C from Z are also in Sκ R2 and integrate out (the likelihood ratio is only determined by rows i C).

It follows that for fixed κ R+ we have

Pr Z Ψ X [Z Sκ | Z R2] = Pr W µ XC [W ˆSκ]

and Pr Z ΨX[Z Sκ | Z R2] = Pr W µXC [W ˆSκ]

for ˆSκ W X |C| D : µ XC(W ) κ µXC(W ) . All together we obtain:

p X,y = Pr W µ XC [W ˆSκ] (1 ) with κ R+ s.t. Pr W µXC [W ˆSκ] = p X,y

D Multi-class certificates for hierarchical randomized smoothing

In the main section of our paper we only derive so-called binary-class certificates: If p X,y > 0.5, then the smoothed classifier g is certifiably robust. We can derive a tighter certificate by guaranteeing p X,y > maxy =y p X,y. To this end, we derive an upper bound p X,y p X,y in the following:

p X,y max h H Pr W µ X [h(W ) = y] s.t. Pr W µX[h(W ) = y] = p X,y (4)

Cohen et al. (2019) show that the maximum of Equation 4 can be obtained by using following lemma:

Lemma 3 (Neyman-Pearson upper bound). Given X, X X N D, distributions µX, µ X, class label y Y, probability p X,y and the set Sκ W X N D : µ X(W ) κ µX(W ) we have

p X,y = Pr W µ X [W Sκ] with κ R+ s.t. Pr W µX[W Sκ] = p X,y

Proof. See (Neyman and Pearson, 1933; Cohen et al., 2019).

We can now use Lemma 3 to derive an upper bound under hierarchical smoothing:

Theorem 2 (Neyman-Pearson upper bound for hierarchical smoothing). Given fixed X, X X N D, let µX denote a smoothing distribution that operates independently on matrix elements, and ΨX the induced hierarchical distribution over ZN (D+1). Given label y Y and the probability p X,y to classify X as y under Ψ. Define ˆSκ W X r D : µ XC(W ) κ µXC(W ) . We have:

p X,y = Pr W µ XC [W ˆSκ] (1 ) + with κ R+ s.t. Pr W µXC [W ˆSκ] = p X,y

where XC and XC denote those rows xi of X and xi of X with i C, that is xi = xi.

Proof. We are solving the following maximization problem:

p X,y max h H Pr Z Ψ X [h(Z) = y] s.t. Pr Z ΨX[h(Z) = y] = p X,y

We can use the partitioning of Proposition 1 and apply the law of total probability:

p X,y = max h H

i=0 Pr Z Ψ X [h(Z) = y | Z Ri] Pr Z Ψ X [Z Ri]

i=0 Pr Z ΨX[h(Z) = y | Z Ri] Pr Z ΨX[Z Ri] = p X,y

We apply Proposition 2 and cancel out all zero probabilities:

p X,y = max h H Pr Z Ψ X [h(Z) = y | Z R2] (1 ) + Pr Z Ψ X [h(Z) = y | Z R3]

s.t. Pr Z ΨX[h(Z) = y | Z R2] (1 ) + Pr Z ΨX[h(Z) = y | Z R1] = p X,y

This is a maximization problem and to maximize we choose Pr Z Ψ X[h(Z) = y | Z R3] = 1 and Pr Z ΨX[h(Z) = y | Z R1] = 0. In other words, all matrices in R3 that we never observe around X are correctly classified and the worst-case classifier does not use any budget p X,y on this region R3.

p X,y = max h H Pr Z Ψ X [h(Z) = y | Z R2] (1 ) +

s.t. Pr Z ΨX[h(Z) = y | Z R2] (1 ) = p X,y

The remaining statement follows analogous to the proof in Appendix C.

In this section we derived multi-class robustness certificates for randomized smoothing and showed the correctness of the following certification procedure:

Algorithm 2: Multi-class Robustness Certificates for Hierarchical Randomized Smoothing Input: f, X, p, Lower-level smoothing distribution µ, Radius (r, ϵ), n0, n1, 1 α counts0 Sample Under Noise(f, X, p, µ, n0) ; counts1 Sample Under Noise(f, X, p, µ, n1) ; y A, y B top two indices in counts0 ; p A, p B Confidence Bounds(counts1[y A],counts1[y B], n1, 1 α) ; 1 pr ;

p B Lower Level Worst Case Upper Bound(µ, ϵ, p B 1 ) ;

p A Lower Level Worst Case Lower Bound(µ, ϵ, p A

1 ) ; if p A (1 ) > p B (1 ) + then

Return: y A certified end Return: ABSTAIN

Here, Lower Level Worst Case Upper Bound(µ,ϵ, p B) computes the existing certificate, specifically the largest p X,y B over the entire threat model, e.g. Φ Φ 1(p B) + ϵ

σ for Gaussian smoothing. The remaining certification procedure is analogous to the certification in (Cohen et al., 2019).

E Hierarchical randomized smoothing using Gaussian isotropic smoothing (Proofs of Section 6)

Having derived our general certification framework for hierarchical randomized smoothing, we can instantiate it with specific smoothing distributions to certify specific threat models. In the following we show the robustness guarantees for initializing hierarchical randomized smoothing with Gaussian isotropic smoothing, as already discussed in Section 6. We will also derive the corresponding multi-class robustness certificates. Corollary 3. Given continuous X R, consider the threat model Br 2,ϵ(X) for fixed X X N D. Initialize hierarchical smoothing with the isotropic Gaussian smoothing distribution defined as µX(W ) QN i=1 N(wi|xi, σ2I). Let y Y denote a label and p X,y the probability to classify X as y under Ψ. Define 1 pr. Then we have:

min X Br 2,ϵ(X) p X,y = Φ Φ 1 p X,y

Proof. As outlined in Subsection 3.3, Cohen et al. (2019) show that the optimal value of Equation 1 under Gaussian smoothing is Φ Φ 1(px,y) ||δ||2

σ , where Φ denotes the CDF of the normal

distribution and δ = x x. Plugging this optimal value into Theorem 1 yields for any X Br 2,ϵ(X):

p X,y = Φ Φ 1 p X,y

where = 1 p|C| and δ = vec(X X). The remaining statements about the minimum over the entire threat model follows from the fact that p X,y is monotonously decreasing in ||δ||2, and

monotonously decreasing in r (see also Proposition 3 in Section 5). Note that for fixed X we write = 1 p|C| but for the entire threat model we have to use the largest = 1 pr.

Corollary 1. Given continuous X R, consider the threat model Br 2,ϵ(X) for fixed X X N D. Initialize the hierarchical smoothing distribution Ψ using the isotropic Gaussian distribution µX(W ) QN i=1 N(wi|xi, σ2I) that applies noise independently on each matrix element. Let y Y denote the majority class and p X,y the probability to classify X as y under Ψ. Then the smoothed classifier g is certifiably robust g(X) = g( X) for any X Br 2,ϵ(X) if

ϵ < σ Φ 1 p X,y

Φ 1 1 2(1 )

where Φ 1 denotes the inverse CDF of the normal distribution and 1 pr.

Proof. For the binary-class certificate, the smoothed classifier g is certifiably robust for fixed input X if min X Br 2,ϵ(X) p X,y > 0.5. By using Corollary 3 we have

min X Br 2,ϵ(X) p X,y > 1

Φ Φ 1 p X,y

σ > Φ 1 1 2(1 )

ϵ < σ Φ 1 p X,y

Φ 1 1 2(1 )

In the following we additionally derive the corresponding multi-class robustness certificates:

Corollary 4. Given continuous X R, consider the threat model Br 2,ϵ(X) for fixed X X N D. Initialize hierarchical smoothing with the isotropic Gaussian smoothing distribution defined as µX(W ) QN i=1 N(wi|xi, σ2I). Let y Y denote a label and p X,y the probability to classify X as y under Ψ. Define 1 pr. Then we have:

max X Br 2,ϵ(X)

p X,y = Φ Φ 1 p X,y

Proof. Cohen et al. (2019) show that the optimal value of Equation 4 under Gaussian smoothing is Φ Φ 1(px,y) + ϵ

σ , where Φ denotes the CDF of the normal distribution. The statement follows analogously to Corollary 3 by directly plugging this optimal value into Theorem 2.

Corollary 5 (Multi-class certificates). Given continuous X R, consider the threat model Br 2,ϵ(X) for fixed X X N D. Initialize the hierarchical smoothing distribution Ψ using the isotropic Gaussian distribution µX(W ) QN i=1 N(wi|xi, σ2I) that applies noise independently on each matrix element. Given majority class y Y and the probability p X,y to classify X as y under Ψ. Then the smoothed classifier g is certifiably robust g(X) = g( X) for any X Br 2,ϵ(X) if

Φ Φ 1 p X,y

> max y =y Φ Φ 1 p X,y

where Φ 1 denotes the inverse CDF of the normal distribution and 1 pr.

Proof. For the tighter multi-class certificate we have to guarantee p X,y > maxy =y p X,y. The statement follows directly from Corollary 3 and Corollary 4.

F Hierarchical smoothing for discrete lower-level smoothing distributions

Threat model details for discrete hierarchical smoothing experiments. In our discrete experiments in Section 7 we model adversaries that perturb multiple nodes in the graph by deleting rd ones (flip 1 0) and adding ra ones (flip 0 1) in the feature matrix X, that is we consider the ball Br ra,rd(X). Here we define this threat model more formally:

Br ra,rd(X) { X X N D |

i=1 1[xi = xi] r

j 1( Xij = Xij + 1) ra

j 1( Xij = Xij 1) rd}

where 1 is an indicator function, r the number of controlled nodes, ra the number of adversarial insertions (0 1), and rd the number of adversarial deletions (1 0).

We expand the result of Theorem 1 to certificates for discrete lower-level smoothing distributions µ: Theorem 3 (Hierarchical lower bound for discrete lower-level smoothing distributions). Given discrete X and a smoothing distribution µx for discrete data, assume that the corresponding discrete certificate partitions the input space X r D into disjoint regions H1, . . . , HI of constant likelihood ratio: µx(w)/µ x(w) = ci for all w Hi. Then we can compute a lower bound p X,y p X,y by solving the following Linear Program (LP):

p X,y = min h h T ν (1 ) s.t. h T ν = p X,y

where νq Pr W µ XC [vec(W ) Hq] and νq Pr W µXC [vec(W ) Hq] for all q {1, . . . , I}.

Proof. This follows from applying the Neyman-Pearson-Lemma for discrete random variables twice (see Lemma 2). While R1, R2, R3 represent super regions for the upper-level smoothing distribution ϕ, H1, . . . , HI further subdivide R2 into smaller regions of constant likelihood ratio, which is required for the lower-level discrete smoothing distribution µ.

Notably, Theorem 3 implies that we can compute the LP of the original discrete certificate for the adjusted probability of (p X,y )/(1 ), and then multiply the resulting optimal value by the constant 1 . In particular, the feasibility of the Linear Program and of computing v, v can be delegated to the certificate for the underlying µ.

Analogously for the upper bound for discrete lower-level µ we need: Lemma 4 (Discrete Neyman-Pearson upper bound). Partition the input space X D into disjoint regions R1, . . . , RI of constant likelihood ratio µx(w)/µ x(w) = ci R { } for all w Ri. Define ri Prw µ x(w Ri) and ri Prw µx(w Ri). Then Equation 4 is equivalent to the linear program p x,y = maxh [0,1]I h T r s.t. h T r = px,y, where h represents the worst-case classifier.

Then we can show: Theorem 4 (Hierarchical upper bound for discrete lower-level smoothing distributions). Given discrete X and a smoothing distribution µx for discrete data, assume that the corresponding discrete certificate partitions the input space X r D into disjoint regions H1, . . . , HI of constant likelihood ratio: µx(w)/µ x(w) = ci for all w Hi. Then we can compute an upper bound p X,y p X,y by solving the following Linear Program (LP):

p X,y = max h h T ν (1 ) + s.t. h T ν = p X,y

where νq Pr W µ XC [vec(W ) Hq] and νq Pr W µXC [vec(W ) Hq] for all q {1, . . . , I}.

Proof. This follows analogously to Theorem 3, specifically by applying the Neyman-Pearson-Lemma upper bound for discrete random variables (Lemma 4) twice (first using the super regions and then the lower-level regions).

G Hierarchical randomized smoothing using ablation smoothing

There are different types randomized ablation certificates: In image classification, Levine and Feizi (2020b) randomly ablate pixels by masking them. In point cloud classification, Liu et al. (2021) ablate points by deleting them to certify robustness against adversarial point insertion, deletion and modification. In sequence classification, Huang et al. (2023) randomly delete sequence elements to certify robustness against edit distance threat models. In node/graph classification, Bojchevski et al. (2020) experiment with ablation smoothing for individual attributes but show that their additive method is superior. Scholten et al. (2022) directly mask out entire nodes, yielding strong certificates against adversaries that arbitrarily perturb node attributes.

Ablation smoothing differences. The ablation baseline (Levine and Feizi, 2020b) draws k pixels to ablate from a uniform distribution U(k, N) over all possible subsets with k out of N pixels. Levine

and Feizi (2020b) derive L = 1 ( N r k ) ( N k) . This is in contrast to our approach, which selects pixels to

add noise to with a probability of p, for which we have S = 1 pr. In other words, we choose the pixel selection probability of p = 1 k

N (i.e. we select k pixels in expectation). This way, increasing k corresponds to decreasing p and vice versa. Interestingly, this type of ablation smoothing is tighter for small N since S < L for k > 0 (see Scholten et al. (2022) (Appendix B, Proposition 5) for a detailed discussion; later also verified by Huang et al. (2023)).

Ablation and additive noise hybrid. Notably, our method generalizes ablation and additive noise smoothing into a common framework: The additional indicator indicates which entity will be ablated, but instead of ablating entities we only add noise to them. If we ablate the selected rows instead of adding noise, we can restore certificates for ablation smoothing as a special case of our framework: Corollary 2. Initialize the hierarchical smoothing distribution Ψ with the ablation smoothing distribution µx(t) = 1 for ablation token t / X D and µx(w) = 0 for w X D otherwise (i.e. we always ablate selected rows). Define 1 pr. Then the smoothed classifier g is certifiably robust g(X) = g( X) for any X Br 2,ϵ(X) if p X,y > 0.5.

Proof. We can first apply Theorem 3 and obtain:

p X,y = minh h T ν (1 ) s.t. h T ν = p X,y

for v = 1 and v = 1 (since we always ablate selected rows there is only a single lower-level region H). Solving this LP yields

p X,y = p X,y

1 (1 ) = p X,y

Note one can analogously compute the upper bound p X,y by applying the corresponding Theorem 4:

p X,y = p X,y

1 (1 ) + = p X,y +

Altogether in this special case we obtain for the multi-class certificate p X,y > maxy =y p X,y:

p X,y > p X, y +

where y is the most likely class and y the follow-up class. That is in this special case with ablation smoothing we again obtain the multi-class guarantees derived by Levine and Feizi (2020b).

Related work. In the context of randomized ablation (a special case of our framework), Jia et al. (2021) also define three regions and build an ablation certificate using the Neyman-Pearson Lemma. Their regions are, however, different to ours, and their certificates are designed for poisoning threat models (whereas we consider evasion settings). Moreover, Liu et al. (2021) define three regions similar to ours, but their certificate is only for point cloud classifiers and suffers from the same limitations of ablation certificates that we overcome: Our key idea is to further process the region R2 and to utilize the budget p X,y by plugging it into the certificate for the lower-level smoothing distribution, which yields stronger robustness-accuracy trade-offs under our threat model.

Discussion. In general, the difference between ablation certificates and additive noise certificates is that ablation smoothing does not evaluate the base classifier on the entire input space. Specifically, the supported sample spaces around X and X are overlapping but not the identical. In this context, additive noise certificates are technically tighter for threat models where adversarial perturbations are bounded. Since our certificates generalize ablation smoothing and additive noise smoothing we inherit advantages and disadvantages of both methods: While the certificate for the lower-level smoothing distribution may be tight, the upper-level smoothing distribution prevents us from evaluating the base classifier on the entire matrix space. We leave the development of tight and efficient certificates for hierarchical randomized smoothing to future work.

Gray-box certificates. Scholten et al. (2022) derive gray-box certificates by exploiting the underlying message-passing principles of graph neural networks. In detail, they derive a tighter gray-box that accounts for the fact that nodes may be disconnected from a target node when randomly deleting edges in a graph (and disconnected nodes do not affect the prediction for the target). Interestingly, we can directly plug their gray-box into our framework to obtain gray-box robustness certificates for hierarchical randomized smoothing.

H Extended discussion

Motivation for the indicator variable as additional column. We certify robustness of classifiers operating on the higher-dimensional matrix space, which allows our certificates to be efficient and highly flexible. In the following we motivate this by considering the scenario of certifying robustness on the original data space X N D. The main problem is that it is technically challenging to obtain an efficient certificate that depends only on the radius (and not on fixed X and X):

Consider a discrete lower-level smoothing distribution, fixed X, X, smoothed matrix W and row i C such that xi = xi but xi = wi. To certify robustness on the original space we have to consider the likelihood ratio for the marginal distribution ΨX(W ) P

τ ΨX(W , τ):

ΨX(W ) Ψ X(W ) = P

τ ΨX(W , τ) P

τ Ψ X(W , τ).

Let C describe all perturbed and C all clean rows. Similar due to independence we have:

ΨX(W ) Ψ X(W )

(1) = Qn i=1 Ψxi(wi) Qn i=1 Ψ xi(wi) = Q

i C Ψxi(wi) Q

i C Ψxi(wi) Q

i C Ψ xi(wi) Q

i C Ψ xi(wi)

i C Ψxi(wi) Q

i C Ψ xi(wi)

where (1) is due independence and (2) due to xi = xi for clean rows i C. Now we can consider the perturbed row i C that we chose above and consider the likelihood ratio:

Ψxi(wi) Ψ xi(wi) =

τi µxi(wi|τi)ϕ(τi) P

τi µ xi(wi|τi)ϕ(τi)

(3) = µxi(wi)p + (1 p)

where (3) is due to wi = xi but wi = xi.

Due to the mixture, there are two ways sampling wi from Ψxi: First, we may not select row i (and then we do not smooth the row). Second, we may select row i but not add any noise on the selected row. Due to the mixture, we cannot simply cancel out dimensions in which the rows xi and xi agree as in (Bojchevski et al., 2020), leading to the problem that the likelihood ratio and subsequently the point-wise certificate depends on X. Consequently, certifying robustness against an entire ball of perturbations is technically challenging since the likelihood ratio will change for every X.

We can resolve this problem by realizing that for reasonable smoothing parameters, the probability µ xi(wi) = µ xi(xi) is close to zero (probability for flipping exactly and only the dimensions where xi disagrees with xi is close to zero for small radii and high dimensional data). Specifically, the likelihood ratio for such elements is approaching (e.g. for sparse smoothing with fixed radius ra, rd we have µ xi(wi) 0 for D ). Note that if we choose a W and row i such that xi = xi but xi = wi, we can derive the same but the other way around and find the likelihood ratio approaching 0. In other words, we cannot avoid regions of 0, and likelihood ratio for such smoothing distributions where we can flip entire rows from xi to xi. Certifying robustness on the higher-dimensional matrix space instead is efficient, highly flexible and tight for higher dimensional rows as for example in graphs where rows represent node features.

Efficiency of hierarchical randomized smoothing. We discuss (1) the efficiency of the smoothing process, and (2) the efficiency of the certification. First, regarding the efficiency of the smoothing process, note that hierarchical smoothing only adds noise on a subset of all entities. Technically this can be more efficient, although drawing from a bernoulli before adding noise may require more time in practice than directly drawing (e.g. gaussian) noise for all matrix elements simultaneously. Second, regarding certification efficiency, our method only requires the additional computation of the term which takes constant time. Yet, closed-form solutions for the largest certifiable radius may not be available, depending on the lower-level smoothing distribution. For example we may not be able to derive a closed-form radius for the multi-class certificate under hierarchical Gaussian smoothing (see Appendix E). Still, the certification process can be implemented efficiently even without closed-form solutions, e.g. via bisection or just by incrementally increasing ϵ to find the largest certifiable radius.

Non-uniform entity-selection probability p. In general we can also use a non-uniform p: Let pi denote a node-specific (non-uniform) selection probability for node i that the adversary does not control. To get robustness certificates we have to make the worst-case assumption, i.e. nodes with lowest selection probability will be attacked first by the adversary. For the certificate this means we have to find the largest possible (see also Theorem 1). Specifically, we have to choose 1 Q

i I pi where I contains the indices of the r smallest probabilities pi. Then we can proceed as in our paper to compute certificates for the smoothing distribution with non-uniform p. If we additionally know that certain nodes will be attacked before other nodes, we do not have to consider the first nodes with smallest pi but rather the first nodes with smallest pi under an attack order. This can be useful for example if one cluster of nodes is attacked before another, then we can increase the selection probability for nodes in the first cluster and obtain stronger certificates.

Node-specific smoothing distributions. If some nodes are exposed to different levels of attacks than others, we can obtain stronger certificates by making the lower-level smoothing distribution node-specific. This would allow us to add less noise for nodes that are less exposed to attacks, potentially leading to even better robustness-accuracy trade-offs. Note, however, that this increases threat model complexity since we will have to treat all nodes separately.

Input-dependent randomized smoothing. We derive certificates for a smoothing distribution that adds random noise independently on the matrix elements. This is the most studied class of smoothing distributions and allows us to integrate a whole suite of existing distributions. Although there are first approaches adding non-independent noise, the corresponding certificates are still in their infancy (Súkeník et al., 2022). Data-dependent smoothing is a research direction orthogonal to ours, and we consider the integration of such smoothing distributions as future work.

Finding optimal smoothing parameters. The randomized smoothing framework introduces parameters of the smoothing distribution as hyperparameters (see e.g. (Cohen et al., 2019)). In general, increasing the probabilities to add random noise increases the robustness guarantees but also decreases the accuracy. We introduce an additional parameter allowing to better control the robustness-accuracy trade-off. Specifically, larger p adds more noise and increases the robustness but also decreases accuracy. In our exhaustive experiments we randomly explore the entire space of possible parameter combinations. The reason for this is to demonstrate Pareto-optimality, i.e. to find all ranges for which we can offer both better robustness and accuracy. In practice one would have to conduct significantly less experiments to find suitable parameters. Note that the task of efficiently finding the optimal parameters is a research direction orthogonal to deriving robustness certificates.

Limitation of fixed threat models. Recent adversarial attacks (see e.g. (Kollovieh et al., 2023)) go beyond the notion of ℓp-norm balls and assess adversarial robustness using generative methods. Our approach is limited as we can only certify robustness against bounded adversaries that perturb at most r objects with a perturbation strength that is bounded by ϵ under some ℓp-norm. Since current robustness certificates can only provide provable guarantees under fixed threat models, certifying robustness against adversaries that are not bounded by ℓp-norms is out of the scope of this paper.

Graph-structure perturbations. While we implement certificates against node-attribute perturbations, our framework can in principle integrate certificates against graph-structure perturbations (for example by bounding the number of nodes whose outgoing edges can be perturbed). This can be implemented by first selecting rows of the adjacency matrix A and then applying sparse smoothing (Bojchevski et al., 2020) to the selected rows only. Note that the works of Gao et al. (2020) and Wang et al. (2021) also study randomized smoothing against perturbations of the graph structure, but they consider a special case of sparse smoothing Bojchevski et al. (2020) with equal flipping probabilities.