# relational_verification_leaps_forward_with_rabbit__81916367.pdf

Relational Verification Leaps Forward with RABBit

Tarun Suresh1 Debangshu Banerjee1 Gagandeep Singh1,2

1University of Illinois Urbana-Champaign, 2VMware Research {tsuresh3, db21, ggnds}@illinois.edu

We propose RABBit, a Branch-and-Bound-based verifier for verifying relational properties defined over Deep Neural Networks, such as robustness against universal adversarial perturbations (UAP). Existing SOTA complete L -robustness verifiers can not reason about dependencies between multiple executions and, as a result, are imprecise for relational verification. In contrast, existing SOTA relational verifiers only apply a single bounding step and do not utilize any branching strategies to refine the obtained bounds, thus producing imprecise results. We develop the first scalable Branch-and-Bound-based relational verifier, RABBit, which efficiently combines branching over multiple executions with cross-executional bound refinement to utilize relational constraints, gaining substantial precision over SOTA baselines on a wide range of datasets and networks. Code is at this URL.

1 Introduction

Deep neural networks (DNNs) are now widely used in safety-critical fields like autonomous driving and medical diagnosis [Amato et al., 2013], where their decisions can have serious consequences. However, understanding and ensuring their reliability is difficult due to their complex and opaque nature. Despite efforts to find and address vulnerabilities, such as adversarial attacks [Goodfellow et al., 2014, Madry et al., 2018, Moosavi-Dezfooli et al., 2017] and adversarial training techniques [Madry et al., 2018], ensuring safety remains a challenge. As a result, extensive research is focused on formally verifying the safety of DNNs. However, most of the existing L robustness verification techniques can not handle relational properties common in practical situations. While significant efforts have been invested in verifying the absence of input-specific adversarial examples within the local neighborhood of test inputs, recent studies [Li et al., 2019a] emphasize that input-specific attacks are impractical regardless. Conversely, practical attack scenarios [Liu et al., 2023, Li et al., 2019b,a] involve the creation of universal adversarial perturbations (UAPs) [Moosavi-Dezfooli et al., 2017], which are crafted to impact a substantial portion of inputs from the training distribution. Ra Ve N [Banerjee et al., 2024b] and subsequently RACoon [Banerjee and Singh, 2024] showed that since the same adversarial perturbation is applied to multiple inputs, the executions on different perturbed inputs are related, exploiting the relationship between different executions significantly improves the precision of the verifier. Despite Ra Ve N and RACoon s ability to leverage cross-executional dependencies, both of them remain imprecise as they only apply a single bounding step and lack refinement using branching strategies used in SOTA complete non-relational verifiers.

Key challenges: For precise relational verification, we need efficient algorithms that can effectively combine branching strategies over multiple executions with bounding techniques that can leverage cross-executional dependencies. Theoretically, MILP (Mixed Integer Linear Programming) can exactly encode DNN executions with piecewise linear activation functions like Re LU over any input regions specified by linear inequalities. However, the associated MILP optimization problem is computationally expensive. For instance, encoding k executions of a DNN with nr Re LU activations

Equal contribution. Primary Correspondence: tsuresh3@illinois.edu

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

introduces O(nr k) integer variables in the worst case. As the cost of MILP optimization grows exponentially with the number of integer variables, even SOTA off-the-shelf solvers like Gurobi [Gurobi Optimization, LLC, 2018] struggle to verify small DNNs for a relational property over k executions within a reasonable time limit. For scalability, SOTA non-relational verifiers like α, βCROWN [Wang et al., 2021] design custom "Branch and Bound" (Ba B) solvers using more scalable differentiable optimization techniques such as gradient descent. However, these verifiers ignore dependencies between multiple executions, resulting in imprecise relational verification. Conversely, the SOTA relational verifier RACoon uses parametric linear relaxation for each activation to avoid integer variables and employs gradient descent to learn parameters that leverage cross-executional dependencies for verification. This method, however, introduces imprecision due to the replacement of non-linear activations with parametric linear approximations. Therefore, precise relational verification requires scalable algorithms that can: a) scale to the large DNNs used in this paper, b) effectively reduce imprecision from parametric linear relaxations, and c) utilize cross-executional dependencies.

Our contributions: We advance the state-of-the-art in relational DNN verification by:

Efficiently combining branching strategies over multiple DNN executions with a cross-executional bounding method that utilizes dependencies between DNN s outputs from different executions while reducing imprecision resulting from parametric linear relaxations. Developing two "branch and bound" algorithms, each with its own advantages - a) strong bounding: applies cross-execution bounding at each step, branching over all executions. This method provides tighter bounds than RACoon (cross-executional bound refinement without branching) and α, βCROWN (branching without cross-executional bound refinement), b) strong branching: applies cross-execution bounding only at the start to derive fixed linear approximations for each execution. These approximations are then used to branch independently over each execution, exploring more branches per execution. Combining strong bounding and branching results into an efficiently optimizable MILP instance that leverages the benefits of both techniques, outperforming each individually. Performing extensive experiments on popular datasets and various DNNs (standard and robustly trained) to showcase the precision improvement over the current SOTA baselines.

2 Related Works

Non-relational DNN verifiers: Given a logical input specification ϕ and an output specification ψ, DNN verifiers formally prove that for all inputs x satisfying ϕ, the output N(x) of the DNN satisfies ψ. If the verification process fails, the verifier generates a counter-example where the output specification ψ does not hold. DNN verifiers are broadly divided into three main types based on their ability to prove properties: - (i) sound but incomplete verifiers which may not always prove property even if it holds [Gehr et al., 2018, Singh et al., 2018, 2019b,a, Zhang et al., 2018, Xu et al., 2020, 2021], (ii) complete verifiers that can always prove the property if it holds [Wang et al., 2018, Gehr et al., 2018, Bunel et al., 2020a,c, Bak et al., 2020, Ehlers, 2017, Ferrari et al., 2022, Fromherz et al., 2021, Wang et al., 2021, Palma et al., 2021, Anderson et al., 2020, Zhang et al., 2022a] and (iii) verifiers with probabilistic guarantees [Cohen et al., 2019, Li et al., 2022]. Beyond the commonly studied L robustness verification problem, several works adapt DNN verification techniques for specific applications, such as robustness against image rotation [Singh et al., 2019b, Balunovic et al., 2019], incremental verification [Ugare et al., 2023, 2024], interpretability [Banerjee et al., 2024a], and certifiable training [Mueller et al., 2023, Palma et al., 2024, Jiang and Singh, 2024].

Relational DNN verifier: Existing relational verifiers fall into two main categories based on the type of relational properties they can handle: (i) verifiers for properties such as UAP and fairness, which are defined across multiple executions of the same DNN [Zeng et al., 2023, Khedr and Shoukry, 2023, Meyer et al., 2024, Banerjee et al., 2024b, Banerjee and Singh, 2024, Banerjee et al., 2024c], and (ii) verifiers for properties like local DNN equivalence, defined over multiple executions of different DNNs on the same input [Paulsen et al., 2020, 2021]. For relational properties defined across multiple executions of the same DNN, existing verifiers [Khedr and Shoukry, 2023] reduce the verification problem to an L robustness problem by constructing a "product DNN" that includes multiple copies of the same DNN. However, the relational verifier in [Khedr and Shoukry, 2023] treats all k executions of the DNN as independent, which results in a loss of precision. On the other hand, [Zeng et al., 2023] (referred to as the I/O formulation) tracks the relationships between inputs used in multiple executions at the input layer, but it does not maintain the relationships between the

outputs fed into the subsequent hidden layers. As a result, it achieves only limited improvement over baseline verifiers that treat all executions independently. Ra Ve N [Banerjee et al., 2024b] uses Diff Poly a abstract intepretation based framework to track linear relationships between the outputs at all layers resulting from multiple executions of the same DNN. While Ra Ve N is significantly more precise than the I/O formulation, tracking linear constraints at each layer across all DNN executions can be computationally expensive. The SOTA relational verifier RACoon [Banerjee and Singh, 2024] improves the scalability of Ra Ve N while maintaining Ra Ve N s precision by introducing a new gradient-descent based bounding strategy called cross-executional bound refinement, as detailed in Section 3. There exist, probabilistic verifiers, [Xie et al., 2021, Zhang et al., 2022b] based on randomized smoothing [Cohen et al., 2019] for verifying relational properties. However, these works can only give probabilistic guarantees on smoothed models which have high inference costs. Similar to [Banerjee et al., 2024b, Banerjee and Singh, 2024], in this work, we focus on deterministic relational verifiers for DNNs with Re LU activation. However, RABBit can be extended to activations like Sigmoid with branching methods [Shi et al., 2024] and parametric bounds [Wu et al., 2023].

3 Preliminaries

We provide background on "branch and bound" (Ba B) based non-relational DNN verification, as well as DNN safety properties that can be encoded as relational properties.

Non-relational DNN verification: For a single execution, non-relational DNN verification focuses on proving that, for all perturbations x+δ of a given input x specified by ϕ, the network s output y = N(x +δ) meets a specified logical condition ψ. Commonly, safety properties such as L robustness encode the output condition (ψ) as a linear inequality or a conjunction of linear inequalities over the DNN output y Rnl. For instance, an output property could be expressed as ψ(y) = (c T y 0), where c Rnl. Generally, even for DNNs with piecewise-linear activation functions and input constraints defined by linear inequalities, complete verification i.e., always proving the property or finding a counterexample is an NP-complete problem. Given a DNN N : Rn0 Rnl and a property defined by (ϕ, ψ), scalable yet sound (but incomplete) verifiers approximate the network s behavior by computing a linear approximation specified by L Rn0 and b R. For any input x satisfying ϕ, this linear approximation ensures that LT x + b c T N(x). The verifier then aims to show that LT x + b 0 for all x that satisfy ϕ, which implies c T N(x) 0. While LT x + b provides a valid lower bound for c T N(x), it may lack precision. To enhance this precision for piecewise-linear activations, state-of-the-art non-relational verifiers use a branch-and-bound (Ba B) method. In each branching step, the problem is divided into smaller subproblems, while the bounding method computes a valid lower bound for each subproblem.

Branching for piecewise linear activation: The non-relational verifier computes L by replacing non-linear activations with linear relaxations, which introduces imprecision. However, for piecewise linear activations like Re LU, it is possible to consider each linear piece separately as different subproblems, avoiding the need for imprecise linear relaxations. For instance, for y = Re LU(x), branching on x and considering the cases x 0 and x 0 allows decomposing Re LU(x) into two distinct linear pieces. Still in the worst case decomposing all Re LU nodes in a DNN results in exponential blowup making it practically infeasible. Therefore, SOTA non-relation verifiers like α, β-CROWN [Wang et al., 2021] greedily pick a small subset of Re LU nodes for branching while using linear relaxations for the rest. We explain the bounding step used for each subproblem below.

Bounding with parameter refinement: Obtaining sound linear relaxations of activations σ like Re LU, which are not used for branching, involves computing linear lower bounds σl(x) and upper bounds σu(x) that contain all possible outputs of σ w.r.t all inputs x satisfying ϕ. That is, for all possible input values x of σ, σl(x) σ(x) σu(x) holds. SOTA non-relational verifiers, such as α, β-CROWN, improve precision by using parametric linear relaxations instead of static linear bounds and refine the parameters to facilitate verification of the property (ϕ, ψ). For example, for Re LU(x), the parametric lower bound is Re LU(x) α x with α [0, 1]. Since α x remains a valid lower for any α [0, 1], this allows optimizing α while ensuring the bound remains mathematically correct. Each branched Re LU say y = Re LU(x), introduces two subproblems each with one additional constraint x 0 (or, x 0) where Re LU behaves as a linear function i.e. y = 0 (or, y = x) respectively. To obtain the lower bound of LT x + b over inputs satisfying ϕ with the additional branching constraints α, β-CROWN convert the constrained optimization problem into an unconstrained one by looking at the Lagrangian dual. The dual replaces each branching constraint by augmenting the minimization objective LT x + b with additional terms i.e. LT x + b + β+x for x 0

or LT x + b + β x for x 0 where β+ 0 and β 0. Overall, at high level, α, β-CROWN computes parametric linear approximations L(α,β)T x + b(α,β) and refine the parameters α, β to facilitate verification of (ϕ, ψ).

DNN relational properties: Relational properties defined for a DNN N : Rn0 Rnl defined over k executions of N are specified by the tuple (Φ, Ψ). Here, Φ : Rn0 k {true, false} (the input specification) encodes the input region Φt Rn0 k encompassing all potential inputs corresponding to each of the k executions of N. Furthermore, the safety property we expect the outputs of all k executions of N to satisfy is specified by Ψ : Rnl k {true, false} (the output specification). Given N, an input specification Φ and an output specification Ψ, DNN relational verification seeks to formally prove whether x 1, . . . , x k Rn0.Φ(x 1, . . . , x k) = Ψ(N(x 1), . . . N(x k)) or otherwise provide a counterexample. The inputs to the k executions of N are denoted by x 1, . . . , x k and the corresponding outputs are denoted by N(x 1), . . . , N(x k). For the i-th execution, commonly, the input region ϕi t is a L region around a fixed point xi Rn0 defined as ϕi t = {x i Rn0 | x i xi ϵ} and the corresponding output specification ψi(N(x i )) = Vm j=1(ci,j T N(x i ) 0). Consequently,

Φ(x 1, . . . , x k) = Vk i=1(x i ϕi t) V Φδ(x 1, . . . , x k) where Φδ(x 1, . . . , x k) encodes the relationship between the inputs used in different execution and Ψ(N(x 1), . . . , N(x k)) = Vk i=1 ψi(N(x i )). Following this, we describe relational properties encoding important DNN safety configurations.

UAP verification: In a UAP attack, given a DNN N, the adversary aims to find an adversarial perturbation with a bounded L norm that maximizes the rate at which N misclassifies when the same adversarial perturbation is applied to all inputs from the distribution. The UAP verification problem aims to find the worst-case accuracy of N against the UAP adversary. We refer to this worst-case accuracy as UAP accuracy in the rest of the paper. As shown by Theorem 2 in [Zeng et al., 2023], it is possible to statistically estimate the UAP accuracy of N with respect to the input distribution if one can determine the UAP accuracy of N on k randomly selected images. We focus on the k-UAP verification problem for the rest of the paper as improving the precision of k-UAP verification directly improves the UAP accuracy on the input distribution [Banerjee and Singh, 2024]. The k-UAP verification problem is fundementally different from local L robustness verification since the same adversarial perturbation is applied across the set of inputs. Thus, improving precision for the UAP verification problem requires a relational verifier that can exploit dependencies between the perturbed inputs. We provide the Φ and Ψ of the UAP verification problem in Appendix A.1. 4 Cross-executional Ba B

The key distinction between relational and non-relational DNN verification is the dependency between different DNN executions, which necessitates that any precise relational verifier utilizes these crossexecution dependencies. For instance, for k-UAP problem with two images x1, x2 consider the scenario where both x1 and x2 have valid adversarial perturbations δ1 δ1 δ1 and δ2 δ2 δ2 but no common perturbation say δ that works for both x1 and x2. In this case, any non-relational verification that does not account for cross-execution dependencies can never prove the absence of a common perturbation given that both x1, x2 have valid adversarial perturbations. This highlights the importance of utilizing cross-executional dependencies. The SOTA relational verifier RACoon [Banerjee and Singh, 2024] leverages cross-execution dependencies to jointly optimize the α parameters from different executions, significantly improving the precision of relational verification. However, RACoon only uses parametric linear relaxations for non-linear activations and lacks a branching step, resulting in reduced precision, as confirmed by our experimental results in Section 6. To address this, we propose two separate Ba B algorithms, each with its benefits, described in Sections 4.1 and 4.2. Finally, we combine the results to formulate an efficiently optimizable MILP instance in Section 5

4.1 Strong Bounding

Before going into the details, we briefly review the cross-executional bound refinement proposed in RACoon. For k-UAP, given any subset S of the k executions, RACoon can verify the absence of any common perturbation that works for all executions in S with cross-executional bound refinement. For all i S, let (Li(αi), bi(αi)) denote the parametric linear approximations corresponding to the i-th execution. Then the optimal value t = maxαi,λi ϵ P

i S λi Li(αi) 1+P

i S λi ai(αi) 0 proves the absence of a common perturbation δ for S. Here, ϵ is the perturbation bound i.e. δ ϵ, ai(αi) = bi(αi)+Li(αi)T xi and λi [0, 1] with P

i S λi = 1 are the cross-executional parameters that relate linear approximations from different execution enabling joint optimization over αis. Next,

we detail the first Ba B method - strong bounding that combines cross-executional bounding with branching methods to verify the absence of common perturbation for any subset of n = |S| executions.

Branching and bounding: For n k executions, we construct a "product DNN" by duplicating the DNN n times, one for each execution. Formally, a product DNN is a function N n : Rn0 n Rnl n with N n(x1, . . . , xn) = [N(x1), . . . , N(xn)]T . At each branching step, we greedily select a subset of unbranched Re LU activations from the product DNN and branch on them, while using parametric linear relaxations for the rest. We adapt existing greedy branching heuristics, such as Ba BSR [Bunel et al., 2020b], for selecting the candidate Re LU activations. The heuristic computes a score for each unbranched Re LU activation in the product DNN, and we branch on the activations with the highest scores. Next, we detail the bounding method applied to each subproblem resulting from branching. Since the number of subproblems can be large, the bounding method needs to be fast yet capable of leveraging both branching constraints and cross-executional dependencies. However, the crossexecutional bound refinement from RACoon can not handle branching constraints, while the bounding step from α, β-CROWN does not utilize dependencies across executions. Hence, we develop a threestep algorithm for obtaining the optimal value t with fast gradient descent-based methods. First, we replace these branching constraints by introducing dual variables β, resulting in new parametric linear approximations (Li(αi,βi), bi(αi,βi)) for each subproblem for all i S. Then for each subproblem, we introduce additional variables λi for each execution with constraints λi [0, 1] and P

i S λi = 1. These λis relate linear approximations from different executions capturing cross-executional dependencies. This reduces finding t for each subproblem to the following optimization problem t = maxαi,βi,λi ϵ P

i S λi Li(αi,βi) 1 + P

i S λi ai(αi,βi). Here, ai(αi,βi) = bi(αi,βi) + Li(αi,βi)T xi. Finally, we apply projected gradient ascent to refine parameters (αi,βi, λi). The detailed derivation of the bounding step and the proof of correctness is in Appendix B. Precision gains of strong bounding over the baselines are in Section 6.2. Suppose, F(S) denotes the set of subproblems then Theorem 4.1 proves the absence of common perturbation for the subset of executions S.

Theorem 4.1. If min F(S) maxαi,βi,λi ϵ P

i S λi Li(αi,βi) 1 + P

i S λi ai(αi,βi) 0 then executions in S do not have a common perturbation δ Rn0 with δ ϵ.

Proof: The detailed proof is in the Appendix B.

While strong bounding effectively combines cross-executional refinement with branching, it has the following drawbacks that led to the development of the 2nd Ba B method. First, strong bounding branches over all executions simultaneously, which limits the number of branches explored per execution within a fixed timeout compared to branching on individual executions. For instance, if strong bounding solves m subproblems for n executions, then assuming each execution branched uniformly, each execution gets only m 1 n subproblems. In contrast, given the same timeout, branching individually allows exploring m

n subproblems per execution. Second, strong bounding only proves the absence of common perturbation, a relaxation of the k-UAP problem. To mitigate this, RACoon uses parameter refinement to obtain linear approximations and formulate a MILP, providing a more precise bound on k-UAP accuracy. However, for strong bounding, as the number of subproblems increases and each subproblem has a different linear approximation, formulating a MILP with each linear approximation is practically infeasible. Restricting the number of linear approximations can help accommodate MILP formulation by compromising on the strong cross-executional bounding.

4.2 Strong Branching

Unlike strong bounding, strong branching explores more branches by branching on each execution independently. Additionally, for each execution, we aim to keep the number of linear approximations small post-branching, ensuring the MILP instance using these approximations remains easy to optimize. To limit the number of linear approximations for each execution i, we fix a set of linear coefficients {L1, . . . , Lm} called "target coefficients" and for each j [m], Lj Rn0 compute valid lower bound b j of the following optimization problem minδ c T N(xi + δ) LT j (xi + δ) with δ ϵ using Ba B. In this case, for all δ Rn0 with δ ϵ the refined bias b j and Lj remain a valid lower bound of c T N(xi +δ) i.e. LT j (xi +δ) + b j c T N(xi +δ). Moreover, since we only refine the bias, the number of linear approximations remains the same as at the start of Ba B, even after branching. Next, we describe how we utilize cross-execution dependencies while branching on each execution independently.

Selecting targets: To select target coefficients, we greedily pick subsets of executions and run cross-executional refinement from RACoon without branching on each subset of executions. We describe the greedy selection strategy in Section 5. For each set of executions, we add the linear approximations obtained by cross-executional refinement to the corresponding executions target sets. Cross-executional refinement ensures for each execution set the parameters corresponding to the linear approximations are tailored for the relational verification.

Bounding and branching: Given a target coefficient Lt Rn0, since finding the exact solution of minδ c T N(xi + δ) LT t (xi + δ) is computationally expensive, strong branching aims to obtain a tight mathematically correct lower bound on the difference c T N(xi + δ) LT t (xi + δ). For any subproblem, let (L(α,β), b(α,β)) denote the parametric linear approximation. Then for this particular subproblem, for all α,β, L(α,β)T (xi + δ) + b(α,β) c T N(xi + δ) and subsequently:

max α,β min δ ϵ(L(α,β) Lt)T (xi + δ) + b(α,β) min δ ϵ c T N(xi + δ) LT t (xi + δ) (1)

The optimal solution of the max-min problem in Eq. 1 provides a mathematically correct lower bound of minδ c T N(xi + δ) LT t (xi + δ) for each subproblem. However, it is hard to solve a max-min problem with scalable differentiable optimization techniques like gradient descent typically used for large DNNs considered in this paper. Instead, we compute a closed form of the inner minimization problem reducing the optimization instance to a more tractable maximization problem (Theorem 4.2).

Theorem 4.2. For any α,β, if L(α,β) Rn0 and b(α,β) R then min δ ϵ(L(α,β) Lt)T (x + δ) + b(α,β) = ϵ L(α,β) Lt 1 + (L(α,β) Lt)T x + b(α,β).

Proof: The proof is in Appendix C.

We apply a projected gradient ascent to optimize the maximization with the closed form obtained above (Appendix C.1). The proof of the correctness of the bounding method is in Appendix C. Note the proof of correctness does not necessitate the optimizer to find the global optimum. This is important since gradient ascent may not always converge to the global optimum. Since strong branching branch on each execution independently we reuse the branching strategy of α, β-CROWN.

In this section, we detail the algorithm (Algo. 1) that combines the results from strong bounding and strong branching to formulate the MILP. Running strong bounding on all 2k 1 non-empty subsets of k executions is impractical. Therefore, we use a greedy approach to select subsets of executions for strong bounding. Similarly, for strong branching, we greedily select the target linear coefficients. First, we describe both greedy strategies before moving on to the MILP formulation.

Elimination of individually verified executions: RABBit maintains a list of unverified indices and eliminates any executions that can be verified individually and does not consider them for subsequent steps (lines 3, 8, and 13 in Algo. 1). For instance, for k-UAP verification, we do not need to consider those executions that are proved to have no adversarial perturbation δ such that δ ϵ. Pruning individually verified executions improves the runtime without any compromise on the precision of the relational verifier (see Theorem B.1 [Banerjee and Singh, 2024]).

Greedy target coefficient selection: RABBit first runs RACoon which in turn executes an incomplete non-relational verifier α-CROWN [Xu et al., 2021] eliminating the verified executions (line 8 in Algo. 1). Subsequently, for target selection, RABBit greedily picks the first kt (hyperparameter) executions based on si the lower bound on ψi(N(xi + δ)) as computed by α-CROWN, prioritizing executions with higher si (line 9). Intuitively, for unverified executions, si measures the maximum violation of the output specification ψi(N(xi + δ)) and thus leads to the natural choice of picking executions with smaller violations. For each selected execution i, we choose up to m target coefficients by iterating over all subsets i S considered by RACoon, and selecting linear approximations corresponding to the top m subsets.The cross-executional lower bound t from RACoon decides the priority of each subset S. Subsets S with higher t indicate smaller violations and are more likely to be verified for the absence of a common perturbation, making them suitable for target selection.

Selection of subsets of executions for strong bounding: Thereafter, until timeout ζ, we run strong bounding on subsets of executions from individually unverified executions I. For each subset S I,

Algorithm 1 RABBit

1: Input: N, (Φ, Ψ), k, kt, timeout ζ 2: Output: M. 3: I {} Unverified indices 4: L {} Linear approximations 5: C {} Cross-verified executions 6: s {} Lower bounds from α-Crown 7: M 0 Initialize verified UAP accuracy 8: (I, L, C, s) RACoon(N, (Φ, Ψ), k) 9: I1 top-kt indices from I based on s 10: for i I1 do 11: b i Strong Branching(ϕi, ψi, L[i]) 12: if Verified(ϕi, ψi, L[i], b i ) then 13: I I \ {i} 14: end if 15: Update Bias(L[i], b i )

16: M MILP(L, Φ, Ψ, k, I, C) 17: M max (M(Φ, Ψ), Opt(M)) 18: end for 19: I2 top-kt indices from I based on s 20: while time() < ζ do 21: S Greedily select subset of I2 22: t S Strong Bounding(S, Φ, Ψ) 23: if t S 0 then 24: C Append(C, S) 25: M MILP(L, Φ, Ψ, k, I, C) 26: M max (M, Opt(M)) 27: end if 28: end while 29: return M

the cross-executional bound obtained by RACoon on S decides its priority. However, considering all non-empty subsets of I can be expensive. Instead, similar to strong branching, we first pick top-kt executions (I2) from I (Algo 1 line 19). We sort all non-empty subsets S I2 based on their priority and, in each iteration, run strong bounding on the highest-priority subset that has not been scheduled yet (Algo 1 line 22). Given a large timeout, RABBit would eventually select all subsets from I2.

MILP Formulation: The MILP formulation uses both the refined biases from strong branching (line 11) and the subsets S of executions verified for the absence of common perturbation from strong bounding (line 22) to compute final verified UAP accuracy. RABBit MILP formulation involves three steps. First, we deduce linear constraints between the input and output of N for each unverified execution using linear approximations of N with refined bias obtained by strong branching. Secondly, we add constraints for each subset S verified for the absence of common perturbation with strong bounding. Then, similar to the current SOTA baseline [Banerjee and Singh, 2024], we encode the output specification Ψ as a MILP objective, introducing only O(k) integer variables. Finally, we use an off-the-shelf MILP solver [Gurobi Optimization, LLC, 2018] to optimize the MILP. Ψ encoding: First, we show the MILP objective M that encodes Ψ. We introduce binary variables zi {0, 1} for each individually unverified execution in I where for any perturbation δ Rn0 and δ ϵ, zi = 1 implies ψi(N(xi +δ)) = True. Then the finding the worst case UAP accuracy is equivalent to the following M = 1

k (k |I|) + min δ ϵ P

Constraints encoding: We add constraints from strong bounding, strong branching, and from the linear approximation obtained from the call to RACoon (Algo. 1 line 8). Suppose for any subset S I, strong bounding verifies the absence of common perturbation. Then for all δ Rn0 and δ ϵ at least one of the executions from S will always satisfy the corresponding output specification. Hence, for every such S we add the constraint: P

i S zi 1. Now, let for any i I, {(L1 i , b1 i ), . . . , (Lm i , bm i )} denote set of linear approximation with bm i either coming from RACoon or from strong branching. Then we add the following constraints zi z i, z i = (oi 0), oi Lj T i (xi + δ) + bj i where j [m], and oi R, z i are new real and integer variables respectively.

Limitations: Although RABBit outperforms SOTA verifiers in relational verification, like all deterministic verifiers, whether relational or non-relational (including ours), do not scale to deep neural networks (DNNs) trained on very large datasets such as Image Net. RABBit is sound but incomplete, meaning it may not be able to prove certain relational properties even if they are true. Note that all complete non-relational verifiers are also incomplete for relational properties since they do not track any dependencies between executions.

6 Experimental Evaluation

We evaluate the effectiveness of RABBit on multiple relational properties such UAP accuracy ( Table 1) and top-k accuracy (Appendix Table 5), DNNs, and datasets. In our evaluation, we compare

Table 1: RABBit Efficacy Analysis for Worst-Case UAP Accuracy

Dataset Network Training Perturbation CROWN α CROWN α, β CROWN MN-Ba B GCP-CROWN I/O RACoon Strong Strong RABBit Structure Method Bound (ϵ) Bounding Branching

Conv Small Standard 1/255 44.8 45.4 62.2 55.0 61.8 45.4 45.4 63.8 (+1.6) 63.2 (+1.0) 65.4 (+3.2) Conv Small Diff AI 5/255 44.4 49.6 53.8 55.0 53.8 50.4 51.6 59.0 (+4.0) 59.0 (+4.0) 59.8 (+4.8) CIFAR10 Conv Small SABR 2/255 75.2 75.8 79.4 80.0 80.0 76.8 78.2 83.0 (+3.0) 83.8 (+3.8) 84.0 (+4.0) Conv Small CITRUS 2/255 74.8 76.0 79.2 79.6 79.6 77.0 78.8 82.8 (+3.2) 83.2 (+3.6) 83.6 (+4.0) Conv Big Diff AI 2/255 46.6 51.8 61.2 61.6 61.2 53.2 54.8 62.8 (+1.2) 62.6 (+1.0) 63.0 (+1.6)

Conv Small Standard 0.07 53.0 59.4 83.6 77.4 84.2 60.0 60.6 84.2 (+0.0) 84.2 (+0.0) 84.8(+0.6) Conv Small Diff AI 0.13 51.8 57.0 76.6 77.0 77.0 57.2 58.4 79.0 (+2.0) 78.6 (+1.6) 80.0 (+3.0) MNIST Conv Small SABR 0.15 27.0 38.0 50.4 51.2 60.2 42.2 45.8 62.6 (+2.4) 62.2 (+2.0) 63.4 (+3.2) Conv Small CITRUS 0.15 28.8 41.6 73.0 69.2 73.0 41.6 44.6 74.0 (+1.0) 73.4 (+0.4) 74.6 (+1.6)

RABBit against SOTA baselines, including non-relational verifiers CROWN [Zhang et al., 2018], α-CROWN [Xu et al., 2021], α, β-CROWN [Wang et al., 2021], MN-Ba B [Ferrari et al., 2022], GCP-CROWN [Zhang et al., 2022a], as well as relational verifiers I/O Formulation [Zeng et al., 2023] and RACoon. As previously noted, Ra Ve N adds linear constraints for each layer, which restricts its scalability as the number of executions k increases. Therefore, we compare RABBit with Ra Ve N for a smaller execution count of k = 5, as shown in Appendix Table 4. Additionally, we show that: a) given the same time, RABBit always outperforms the SOTA Ba B-based non-relational verifier α, β-CROWN; b) strong bounding computes a tighter bound on t than α, β-CROWN; and c) we provide an ablation study on ϵ and k used by RABBit.

6.1 Experiment Setup

Networks. We use standard convolutional architectures, such as Conv Small and Conv Big, which are used to evaluate both SOTA relational [Banerjee and Singh, 2024] and non-relational verifiers [Wang et al., 2021] (see Table 1). We provide the details of the DNN architectures in the Appendix D.1. We use networks trained using both standard training methods and robust training strategies, such as Diff AI [Mirman et al., 2018], SABR [Mueller et al., 2023], and CITRUS [Xu and Singh, 2024]. Our experiments utilize publicly available pre-trained DNNs sourced from the CROWN repository [Zhang et al., 2020], α, β-CROWN repository [Wang et al., 2021], and ERAN repository [Singh et al., 2019b]. The clean accuracies of these networks are reported in Appendix D.2.

Implementation details and hyperparameters. We implemented our method in Python with Pytorch V1.11 on top of SOTA complete non-relational verifier α, β-CROWN [Wang et al., 2021]. We used Gurobi V11.0 as the off-the-shelf MILP solver. For both strong bounding and strong branching, we use Adam [Kingma and Ba, 2014] for parameter learning and run it for 20 iterations on each subproblem. We set the value of kt = 10 for CIFAR-10 and kt = 24 for MNIST networks respectively. We use a single NVIDIA A100-PCI GPU with 40 GB RAM for bound refinement and an Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz with 64 GB RAM for MILP optimization. For any relational property with k executions, we give an overall timeout of k minutes (averaging 1 minute/execution) to RABBit and all baselines. Each MILP instance gets a timeout of 10 minutes. We issue the MILP optimization call on line 25 of Algo. 1 in a separate thread for runtime optimization, ensuring that the MILP optimization process does not unnecessarily block the subsequent iterations of the while loop (line 20 of Algo. 1).

6.2 Experimental Results

Effectiveness of RABBit: Table 1 compares the results of RABBit to all baselines across different datasets (column 1) and DNN architectures (column 2) trained with various methods (column 3), with ϵ values defining the L bound of δ in column 4. For each DNN and ϵ, we run RABBit and all the baselines on 10 relational properties each defined with k = 50 randomly selected inputs, and report the worst-case UAP accuracy averaged over the 10 properties. Note that for each DNN, we exclude inputs misclassified by the DNN. We compare the performance of RABBit against SOTA relational and complete non-relational verifiers as well as against strong bounding and strong branching. The results in Table 1 demonstrate that strong bounding, strong branching, and RABBit all outperform the existing SOTA verifiers on all DNNs and ϵ. Notably, RABBit gains up to +4.8% and up to +3.2% improvement in the worst-case UAP accuracy (averaged over 10 runs) for CIFAR10 and MNIST DNNs, respectively. RABBit also efficiently scales to the largest verifiable DNN architectures such as Conv Big, conferring up to +1.6% improvement in worst-case UAP accuracy. In some cases, strong bounding outperforms strong branching, while in others, strong branching outperforms strong

bounding, highlighting the importance of both methods. RABBit combines the strengths of both strong branching and strong bounding, producing the best results overall.

(a) Diff AI (CIFAR10)

(b) SABR (CIFAR10)

(c) CITRUS (CIFAR10) Figure 1: Average Worst Case k-UAP accuracy vs Time for Conv Small CIFAR10 DNNs.

Time vs UAP Accuracy Analysis: Fig. 1 shows timewise the worst-case UAP accuracy (averaged over 10 runs) for different Conv Small CIFAR10 networks with k = 50 on ϵ values from Table 1. Note that RABBit invokes RACoon, which in turn calls α-CROWN and eliminates verified executions (Line 7 in Algorithm 1). Hence, for a fair comparison, we also run α-CROWN first for α, β-CROWN and then run α, β-CROWN only on the unverified indices. For all DNNs, RABBit consistently outperforms the SOTA Ba B-based non-relational verifier α, β-CROWN at all timestamps. This confirms that the improved precision shown in Table 1 is not dependent on the specific timeout value.

(a) Diff AI (CIFAR10)

(b) CITRUS (CIFAR10)

Figure 2: Timewise Analysis of Average % Improvement in t with Strong Bounding (CIFAR10)

(a) Diff AI (MNIST)

(b) CITRUS (MNIST)

Figure 3: Timewise Analysis of Average % Improvement in t with Strong Bounding (MNIST)

Evaluating Bound Improvement: In Figs 2 and 3, we present a timewise analysis of the improvement in t with strong bounding over α, β-CROWN and RACoon. For this experiment, we use Diff AI and CITRUS Conv Small networks with epsilon values from Table 1. For each network and ϵ, we select 30 executions at random and compute the percentage improvement in t with strong bounding over RACoon and α, β-CROWN.We also report the average improvement and 95% confidence intervals for all cases in Table 6 in Appendix G. The results demonstrate that the t with strong bounding is significantly tighter compared to the bounds from the SOTA verifiers α, β-CROWN and

RACoon at all timestamps. Furthermore, strong bounding improves t on average by up to 108.7% for CIFAR10 networks and 57.7% for MNIST networks. These results highlight the importance of leveraging dependencies across executions during both branching and bounding to improve precision.

Different ϵ and k values: Fig. 4 shows the results of RACoon, α, β-CROWN, and RABBit for k-UAP verification of CIFAR10 Conv Small DNNs for 5 different ϵ values and k = 50. RABBit outperforms RACoon and α, β-CROWN for all evaluated ϵ values, notably improving the worst case k-UAP accuracy by up to 4.8%. Similarly, we analyze the performance of RACoon, α, β-CROWN, and RABBit for k-UAP verification of CIFAR10 Conv Small DNNs with different k values. As presented in Fig. 5, for all k values, RABBit is more precise than both baselines. Expectedly, the worst-case k-UAP accuracy for relational verifiers is higher with larger k values as it is easier to prove the absence of a common perturbation with larger k.

(a) Diff AI (CIFAR10)

(b) CITRUS (CIFAR10)

Figure 4: Average Worst Case k-UAP accuracy vs ϵ for CIFAR10 DNNs.

(a) Diff AI (CIFAR10)

(b) CITRUS (CIFAR10)

Figure 5: Average Worst Case k-UAP accuracy for different k values for CIFAR10 Conv Small DNNs. 7 Conclusion We present RABBit, a general framework for improving the precision of relational verification of DNNs through Ba B methods specifically designed to utilize dependencies across executions. Our experiments, on various DNN architectures, and training methods demonstrate that RABBit significantly outperforms both SOTA relational and non-relational verifiers for relational properties. Although we focus on the worst-case UAP accuracy and top-k accuracy RABBit can be extended to properties involving different DNNs, such as local equivalence of DNN pairs Paulsen et al. [2020] or properties defined over an ensemble of DNNs.

Acknowledgement

We thank the anonymous reviewers for their insightful comments. This work was supported in part by NSF Grants No. CCF-2238079, CCF-2316233, CNS-2148583.

Filippo Amato, Alberto López, Eladia María Peña-Méndez, Petr Vaˇnhara, Aleš Hampl, and Josef Havel. Artificial neural networks in medical diagnosis. Journal of Applied Biomedicine, 11(2), 2013.

Ross Anderson, Joey Huchette, Will Ma, Christian Tjandraatmadja, and Juan Pablo Vielma. Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming, 2020.

Stanley Bak, Hoang-Dung Tran, Kerianne Hobbs, and Taylor T. Johnson. Improved geometric path enumeration for verifying relu neural networks. In Shuvendu K. Lahiri and Chao Wang, editors, Computer Aided Verification - 32nd International Conference, CAV 2020, Los Angeles, CA, USA, July 21-24, 2020, Proceedings, Part I, volume 12224 of Lecture Notes in Computer Science, pages 66 96. Springer, 2020. doi: 10.1007/978-3-030-53288-8\_4. URL https://doi.org/10. 1007/978-3-030-53288-8_4.

Mislav Balunovic, Maximilian Baader, Gagandeep Singh, Timon Gehr, and Martin Vechev. Certifying geometric robustness of neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/ paper_files/paper/2019/file/f7fa6aca028e7ff4ef62d75ed025fe76-Paper.pdf.

Debangshu Banerjee and Gagandeep Singh. Relational DNN verification with cross executional bound refinement. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=HOG80Yk4Gw.

Debangshu Banerjee, Avaljot Singh, and Gagandeep Singh. Interpreting robustness proofs of deep neural networks. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=Ev10F9TWML.

Debangshu Banerjee, Changming Xu, and Gagandeep Singh. Input-relational verification of deep neural networks. Proc. ACM Program. Lang., 8(PLDI), June 2024b. doi: 10.1145/3656377. URL https://doi.org/10.1145/3656377.

Debangshu Banerjee, Changming Xu, and Gagandeep Singh. Scalable relational verification and training for deep neural networks. 2024c.

Rudy Bunel, Jingyue Lu, Ilker Turkaslan, Pushmeet Kohli, P Torr, and P Mudigonda. Branch and bound for piecewise linear neural network verification. Journal of Machine Learning Research, 21 (2020), 2020a.

Rudy Bunel, Ilker Turkaslan, Philip H. S. Torr, M. Pawan Kumar, Jingyue Lu, and Pushmeet Kohli. Branch and bound for piecewise linear neural network verification. J. Mach. Learn. Res., 21(1), jan 2020b. ISSN 1532-4435.

Rudy R Bunel, Oliver Hinder, Srinadh Bhojanapalli, and Krishnamurthy Dvijotham. An efficient nonconvex reformulation of stagewise convex optimization problems. Advances in Neural Information Processing Systems, 33, 2020c.

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1310 1320. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr. press/v97/cohen19c.html.

Ruediger Ehlers. Formal verification of piece-wise linear feed-forward neural networks. In International Symposium on Automated Technology for Verification and Analysis, 2017.

Claudio Ferrari, Mark Niklas Mueller, Nikola Jovanovi c, and Martin Vechev. Complete verification via multi-neuron relaxation guided branch-and-bound. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=l_am Hf1oa K.

Aymeric Fromherz, Klas Leino, Matt Fredrikson, Bryan Parno, and Corina Pasareanu. Fast geometric projections for local robustness certification. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=z Wy1uxj Dd ZJ.

Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. In 2018 IEEE Symposium on Security and Privacy (SP), 2018.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Gurobi Optimization, LLC. Gurobi optimizer reference manual, 2018.

Enyi Jiang and Gagandeep Singh. Towards universal certified robustness with multi-norm training, 2024. URL https://arxiv.org/abs/2410.03000.

Haitham Khedr and Yasser Shoukry. Certifair: A framework for certified global fairness of neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8237 8245, Jun. 2023.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Juncheng Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J. Zico Kolter, and Florian Metze. Adversarial music: Real world audio adversary against wake-word detection system. In Proc. Neural Information Processing Systems (Neur IPS), pages 11908 11918, 2019a.

Juncheng Li, Frank R. Schmidt, and J. Zico Kolter. Adversarial camera stickers: A physical camerabased attack on deep learning systems. In Proc. International Conference on Machine Learning, ICML, volume 97, pages 3896 3904, 2019b.

Linyi Li, Jiawei Zhang, Tao Xie, and Bo Li. Double sampling randomized smoothing. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 13163 13208. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/li22aa.html.

Zikun Liu, Changming Xu, Emerson Sie, Gagandeep Singh, and Deepak Vasisht. Exploring practical vulnerabilities of machine learning-based wireless systems. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17-19, 2023, pages 1801 1817. USENIX Association, 2023.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r Jz IBf ZAb.

Anna P. Meyer, Yuhao Zhang, Aws Albarghouthi, and Loris D Antoni. Verified training for counterfactual explanation robustness under data shift, 2024. URL https://arxiv.org/abs/2403. 03773.

Matthew Mirman, Timon Gehr, and Martin Vechev. Differentiable abstract interpretation for provably robust neural networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3578 3586. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/ v80/mirman18b.html.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1765 1773, 2017.

Mark Niklas Mueller, Franziska Eckert, Marc Fischer, and Martin Vechev. Certified training: Small boxes are all you need. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=7o Fuxt Jt UMH.

Alessandro De Palma, Harkirat S. Behl, Rudy R. Bunel, Philip H. S. Torr, and M. Pawan Kumar. Scaling the convex barrier with active sets. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.

Alessandro De Palma, Rudy R Bunel, Krishnamurthy Dj Dvijotham, M. Pawan Kumar, Robert Stanforth, and Alessio Lomuscio. Expressive losses for verified robustness via convex combinations. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=mzy Z4wz Kl M.

Brandon Paulsen, Jingbo Wang, and Chao Wang. Reludiff: Differential verification of deep neural networks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE 20, page 714 726, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371216. doi: 10.1145/3377811.3380337. URL https: //doi.org/10.1145/3377811.3380337.

Brandon Paulsen, Jingbo Wang, Jiawei Wang, and Chao Wang. Neurodiff: Scalable differential verification of neural networks using fine-grained approximation. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 20, page 784 796, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450367684. doi: 10.1145/3324884.3416560. URL https://doi.org/10.1145/3324884.3416560.

Zhouxing Shi, Qirui Jin, J Zico Kolter, Suman Jana, Cho-Jui Hsieh, and Huan Zhang. Formal verification for neural networks with general nonlinearities via branch-and-bound, 2024. URL https://openreview.net/forum?id=ivokw VKY4o.

Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin Vechev. Fast and effective robustness certification. Advances in Neural Information Processing Systems, 31, 2018.

Gagandeep Singh, Rupanshu Ganvir, Markus Püschel, and Martin Vechev. Beyond the single neuron convex barrier for neural network certification. In Advances in Neural Information Processing Systems, 2019a.

Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. An abstract domain for certifying neural networks. Proceedings of the ACM on Programming Languages, 3(POPL), 2019b.

Shubham Ugare, Debangshu Banerjee, Sasa Misailovic, and Gagandeep Singh. Incremental verification of neural networks. Proc. ACM Program. Lang., 7(PLDI), June 2023. doi: 10.1145/3591299. URL https://doi.org/10.1145/3591299.

Shubham Ugare, Tarun Suresh, Debangshu Banerjee, Gagandeep Singh, and Sasa Misailovic. Incremental randomized smoothing certification. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Sde APV1irk.

Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Efficient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems, 2018.

Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J Zico Kolter. Beta-CROWN: Efficient bound propagation with per-neuron split constraints for neural network robustness verification. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/ forum?id=ah YIl RBe CFw.

Haoze Wu, Teruhiro Tagomori, Alexander Robey, Fengjun Yang, Nikolai Matni, George Pappas, Hamed Hassani, Corina Pasareanu, and Clark Barrett. Toward certified robustness against realworld distribution shifts. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML), pages 537 553. IEEE, 2023.

Chulin Xie, Minghao Chen, Pin-Yu Chen, and Bo Li. Crfl: Certifiably robust federated learning against backdoor attacks. In International Conference on Machine Learning, pages 11372 11382. PMLR, 2021.

Changming Xu and Gagandeep Singh. Cross-input certified training for universal perturbations, 2024.

Kaidi Xu, Zhouxing Shi, Huan Zhang, Yihan Wang, Kai-Wei Chang, Minlie Huang, Bhavya Kailkhura, Xue Lin, and Cho-Jui Hsieh. Automatic perturbation analysis for scalable certified robustness and beyond. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

Kaidi Xu, Huan Zhang, Shiqi Wang, Yihan Wang, Suman Jana, Xue Lin, and Cho-Jui Hsieh. Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=n VZt XBI6LNn.

Yi Zeng, Zhouxing Shi, Ming Jin, Feiyang Kang, Lingjuan Lyu, Cho-Jui Hsieh, and Ruoxi Jia. Towards robustness certification against universal perturbations. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= 7GEv PKxjtt.

Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions. Advances in neural information processing systems, 31, 2018.

Huan Zhang, Hongge Chen, Chaowei Xiao, Sven Gowal, Robert Stanforth, Bo Li, Duane Boning, and Cho-Jui Hsieh. Towards stable and efficient training of verifiably robust neural networks. In Proc. International Conference on Learning Representations (ICLR), 2020.

Huan Zhang, Shiqi Wang, Kaidi Xu, Linyi Li, Bo Li, Suman Jana, Cho-Jui Hsieh, and J Zico Kolter. General cutting planes for bound-propagation-based neural network verification. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022a. URL https://openreview.net/forum?id=5ha AJAcofjc.

Yuhao Zhang, Aws Albarghouthi, and Loris D Antoni. Bagflip: A certified defense against data poisoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/ forum?id=Zidk M5b92G.

A Formal encoding of relational properties

A.1 k-UAP verification

Given a set of k points X = {x1, ..., xk} where for all i [k], xi Rn0 and ϵ R we can first define individual input constraints used to define L input region for each execution i [k].ϕi in(x i ) = x i xi ϵ. We define Φδ(x 1, . . . , x k) as follows:

Φδ(x 1, . . . , x k) =

(i,j [k]) (i<j) (x i x j = xi xj) (2)

Then, we have the input specification as Φ(x 1, . . . , x k) = Vk i=1 ϕi in(x i ) Φδ(x 1, . . . , x k).

Next, we define Ψ(x 1, . . . , x k) as conjunction of k clauses each defined by ψi(yi) where yi = N(x i ). Now we define ψi(yi) = Vnl j=1(ci,j T yi 0) where ci,j Rnl is defined as follows

a [nl].ci,j,a =

1 if a = j and a is the correct label for yi 1 if a = j and a is not the correct label for yi 0 otherwise (3)

In this case, the tuple of inputs (x 1, . . . , x k) satisfies the input specification Φ(x 1, . . . , x k) iff for all i [k], x i = xi + δ where δ Rn0 and δ ϵ. Hence, the relational property (Φ, Ψ) defined above verifies whether there is an adversarial perturbation δ Rn0 with δ ϵ that can misclassify all k inputs. Next, we show the formulation for the worst-case UAP accuracy of the k-UAP verification problem as described in section 3. Let, for any δ Rn0 and δ ϵ, µ(δ) denotes the number of clauses (ψi) in Ψ that are satisfied. Then µ(δ) is defined as follows

zi(δ) = 1 ψi(N(xi + δ)) is True 0 otherwise (4)

i=1 zi(δ) (5)

Since ψi(N(xi + δ)) is True iff the perturbed input xi + δ is correctly classified by N, for any δ Rn0 and δ ϵ, µ(δ) captures the number of correct classifications over the set of perturbed inputs {x1 + δ, . . . , xk + δ}. The worst-case k-UAP accuracy M0(Φ, Ψ) for (Φ, Ψ) is as follows M0(Φ, Ψ) = min δ Rn0, δ ϵ µ(δ) (6)

B Details of strong bounding

For fixed linear approximations {(L1, b1), . . . (Ln, bn)} corresponding to n executions of N if the optimal value t of the following linear program 0 then the n executions do not have a common peturbation (from Theorem B.3. Banerjee and Singh [2024]). min t s.t. δ ϵ

Li T (xi + δ) + bi t i [n] (7) Now in the first step, we compute the Lagrangian dual of the linear program from Eq. 7. The Lagrangian Dual is as follows where for all i [n], λi 0 are Lagrange multipliers.

max 0 λi min t R, δ ϵ(1

i=1 λi) t +

i=1 λi LT i (xi + δ) + bi

We set the coefficient of the unbounded variable t to 0 to avoid cases where min t R, δ ϵ(1 Pn i=1 λi) t + Pn i=1 λi LT i (xi + δ) + bi = . This leads to the following Lagrangian Dual form

max 0 λi min δ ϵ

i=1 λi LT i (xi + δ) + bi where

Now for every subproblem, replacing the branching constraints by introducing dual variables β results in the parametric linear approximations of N specified by (Li(αi,βi), bi(αi,βi)) for each execution i [n]. Then the Lagrangian Dual with the parametric linear approximations {(L1(α1,β1), b1(α1,β1)), . . . , (Ln(αn,βn), bn(αn,βn))} is as follows

max 0 λi min δ ϵ

i=1 λi Li(αi,βi)T (xi + δ) + bi(αi,βi) where

Theorem 4.1. If min F(S) maxαi,βi,λi ϵ P

i S λi Li(αi,βi) 1 + P

i S λi ai(αi,βi) 0 then executions in S do not have a common perturbation δ Rn0 with δ ϵ.

Proof. First, we show that min δ ϵ Pn i=1 λi Li(αi,βi)T (xi + δ) + bi(αi,βi) = ϵ

Pn i=1 λi Li(αi,βi) 1 + Pn i=1 λi ai(αi,βi).

i=1 λi Li(αi,βi)T (xi + δ) + bi(αi,βi)

i=1 λi Li(αi,βi)T (δ) +

i=1 λi bi(αi,βi) + Li(αi,βi)T xi

i=1 λi ai(αi,βi) + min δ ϵ

i=1 λi Li(αi,βi)T (δ)

i=1 λi ai(αi,βi) ϵ

i=1 λi Li(αi,βi) 1 Using Hölder s Inequality (8)

For fixed αi,βi, the optimal solution of the LP in Eq. 7 and subsequently of the Lagrangian gives us

max 0 λi min δ ϵ

i=1 λi Li(αi,βi)T (xi + δ) + bi(αi,βi)

= min δ ϵ max 1 i n Li(αi,βi)T (xi + δ) + bi(αi,βi) provided

i=1 λi = 1 (9)

For each subproblem, for all αi,βi

min δ ϵ max 1 i n ci T N(xi + δ) min δ ϵ max 1 i n Li(αi,βi)T (xi + δ) + bi(αi,βi)

min δ ϵ max 1 i n ci T N(xi + δ)

max αi,βi min δ ϵ max 1 i n Li(αi,βi)T (xi + δ) + bi(αi,βi)

max αi,βi max 0 λi min δ ϵ

i=1 λi Li(αi,βi)T (xi + δ) + bi(αi,βi) where

i=1 λi = 1 from Eq. 9

max αi,βi,0 λi

i=1 λi ai(αi,βi) ϵ

i=1 λi Li(αi,βi) 1 From Eq. 8 (10)

Finally, if min F(S) maxαi,βi,λi ϵ P i S λi Li(αi,βi) 1 + P i S λi ai(αi,βi) 0 then,

min δ ϵ max 1 i n ci T N(xi + δ) 0 for all subproblems in F(S) using Eq. 10

Since, for all subproblems min δ ϵ max1 i n ci T N(xi + δ) 0, Wn i=1 ψi(N(xi + δ)) holds for all

δ Rn0 and δ ϵ i.e. there does not exist any common perturbation.

C Details of strong branching

Theorem 4.2. For any α,β, if L(α,β) Rn0 and b(α,β) R then min δ ϵ(L(α,β) Lt)T (x + δ) + b(α,β) = ϵ L(α,β) Lt 1 + (L(α,β) Lt)T x + b(α,β).

min δ ϵ(L(α,β) Lt)T (x + δ) + b(α,β)

= min δ ϵ(L(α,β) Lt)Tδ + b(α,β) + (L(α,β) Lt)T x

= b(α,β) + (L(α,β) Lt)T x + min δ ϵ(L(α,β) Lt)Tδ

= b(α,β) + (L(α,β) Lt)T x ϵ (L(α,β) Lt) 1 Using Hölder s Inequality

C.1 Projected gradient ascent

For each αi αi αi,βi, after each step of gradient ascent (for maximization problem), we clip αi αi αi,βi values to the corresponding ranges [lα ilα ilα i ,uα i uα i uα i ] [lβ ilβ ilβ i ,uβ i uβ i uβ i ] respectively. This is similar to the approach used in the SOTA non-relational bound refinement α, β-CROWN Wang et al. [2021]. Since λi [0, 1] and Pk i=1 λi = 1 we replace λi = sigmoid(xi) Pk i=1 sigmoid(xi) where xi R. For any values of (x1, . . . , xk) Rk

the corresponding (λ1, . . . , λk) satisfy λi [0, 1] and Pk i=1 λi = 1. We then apply gradient ascent (for maximization problem) on (x1, . . . , xk) without any constraints.

D DNN Architectures

D.1 DNN Architectures:

Table 2: DNN Architecture Details

Dataset Model Type Train # Layers # Params

Conv Small Conv Standard 4 80k Conv Small Conv Diff AI 4 80k MNIST Conv Small Conv SABR 4 80k Conv Small Conv CITRUS 4 80k Conv Big Conv Diff AI 7 1.8M

Conv Small Conv Standard 4 80k CIFAR10 Conv Small Conv Diff AI 4 80k Conv Small Conv SABR 4 80k Conv Small Conv CITRUS 4 80k Conv Big Conv Diff AI 7 2.5M

D.2 Accuracies for Evaluated DNNs:

Table 3: Standard top-1 accuracy for evaluated DNNs

Dataset Model Train Accuracy (%)

Conv Small Standard 62.9 Conv Small Diff AI 45.9 CIFAR10 Conv Small SABR 63.3 Conv Small CITRUS 63.9 Conv Big Diff AI 53.8

Conv Small Standard 97.7 Conv Small Diff AI 96.8 MNIST Conv Small SABR 97.9 Conv Small CITRUS 98.5

E Comparison of RABBit with Ra Ve N

Table 4: Comparison of RABBit vs Ra Ve N and RACoon

Dataset Network Training Method ϵ Ra Ve N RACoon RABBit MNIST Conv Small Diff AI 0.13 66.0 63.0 82.0 (+19.0) MNIST Conv Small SABR 0.15 51.0 50.0 66.0 (+15.0) MNIST Conv Small CITRUS 0.14 82.0 80.0 89.0 (+7.0) CIFAR10 Conv Small SABR 4.0/255 48.0 48.0 55.0 (+7.0) CIFAR10 Conv Small CITRUS 2.0/255 79.0 78.0 82.0 (+3.0)

F RABBit Efficacy Analysis for top-k accuracy

Table 5: Verified top-2 accuracy for RABBit vs baselines

Dataset Network Training Method ϵ α-CROWN RACoon α, β-CROWN RABBit Avg. Acc. (%) Avg. Time (sec.) Avg. Acc. (%) Avg. Time (sec.) Avg. Acc. (%) Avg. Time (sec.) Avg. Acc. (%) Avg. Time (sec.) CIFAR10 Conv Small Diff AI 5/255 74.0 4.52 75.0 4.87 75.0 20.47 78.0 (+3.0) 24.27 MNIST Conv Small Diff AI 0.13 84.0 1.20 84.0 1.42 89.0 11.03 91.0 (+2.0) 13.43

G Average Improvement in t with Strong Branching

Table 6: Average Improvement in t with Strong Bounding

Dataset Network Training Perturbation RACoon α, β-CROWN Structure Method Bound (ϵ) Avg. Improvement (%) 95% CI Avg. Improvement (%) 95% CI

Conv Small Diff AI 5/255 108.7 [93.9, 126.1] 102.5 [92.7, 115.4] CIFAR Conv Small CITRUS 2/255 77.9 [75.3, 81.6] 86.9 [86.2, 88.1]

Conv Small Diff AI 5/255 57.7 [55.5, 60.2] 54.4 [53.0, 56.0] MNIST Conv Small CITRUS 2/255 40.8 [39.8, 41.9] 37.1 [36.4, 37.8]

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: See Section 1 for main claims and contributions. The main claims made in this section and the abstract reflect the paper s scope and contributions. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: See end of Section 5 for the limitations. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: See experimental setup in Section 6. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes] Justification: We provide the code to replicate the main results of this paper. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please See Section the Neur IPS code and data submission guidelines (https://nips. cc/public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See Section the Neur IPS code and data submission guidelines (https://nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: See experimental setup in Section 6. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: RABBit is a deterministic verifier. The experiment "Evaluating Bound Improvement" (Section 6) is the only randomized experiment in the paper. We report the mean and 95% confidence intervals of the experiment in Appendix G. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: See experimental setup in Section 6.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Submission meets all ethical guidelines after authors reviewed the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: See Section 1 and Section 7 for societal impacts of the work.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: No data or models with high risk for misuse were used.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes] Justification: We have only utilized publically available code, models, and datasets and properly cited all relevant works.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We provide the code to replicate the main results of this paper. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No crowdsourcing nor human research with subject participants. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: No crowdsourcing nor human research with subject participants. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.