# scaling_the_convex_barrier_with_active_sets__f2c5e77d.pdf Published as a conference paper at ICLR 2021 SCALING THE CONVEX BARRIER WITH ACTIVE SETS Alessandro De Palma , Harkirat Singh Behl , Rudy Bunel, Philip H.S. Torr, M. Pawan Kumar University of Oxford {adepalma,harkirat,phst,pawan}@robots.ox.ac.uk bunel.rudy@gmail.com Tight and efficient neural network bounding is of critical importance for the scaling of neural network verification systems. A number of efficient specialised dual solvers for neural network bounds have been presented recently, but they are often too loose to verify more challenging properties. This lack of tightness is linked to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise linear activations exists, it comes at the cost of exponentially many constraints and thus currently lacks an efficient customised solver. We alleviate this deficiency via a novel dual algorithm that realises the full potential of the new relaxation by operating on a small active set of dual variables. Our method recovers the strengths of the new relaxation in the dual space: tightness and a linear separation oracle. At the same time, it shares the benefits of previous dual approaches for weaker relaxations: massive parallelism, GPU implementation, low cost per iteration and valid bounds at any time. As a consequence, we obtain better bounds than off-the-shelf solvers in only a fraction of their running time and recover the speed-accuracy trade-offs of looser dual solvers if the computational budget is small. We demonstrate that this results in significant formal verification speed-ups. 1 INTRODUCTION Verification requires formally proving or disproving that a given property of a neural network holds over all inputs in a specified domain. We consider properties in their canonical form (Bunel et al., 2018), which requires us to either: (i) prove that no input results in a negative output (property is true); or (ii) identify a counter-example (property is false). The search for counter-examples is typically performed by efficient methods such as random sampling of the input domain (Webb et al., 2019), or projected gradient descent (Carlini & Wagner, 2017). In contrast, establishing the veracity of a property requires solving a suitable convex relaxation to obtain a lower bound on the minimum output. If the lower bound is positive, the given property is true. If the bound is negative and no counter-example is found, either: (i) we make no conclusions regarding the property (incomplete verification); or (ii) we further refine the counter-example search and lower bound computation within a branch-and-bound framework until we reach a concrete conclusion (complete verification). The main bottleneck of branch and bound is the computation of the lower bound for each node of the enumeration tree via convex optimization. While earlier works relied on off-the-shelf solvers (Ehlers, 2017; Bunel et al., 2018), it was quickly established that such an approach does not scale-up elegantly with the size of the neural network. This has motivated researchers to design specialized dual solvers (Dvijotham et al., 2019; Bunel et al., 2020a), thereby providing initial evidence that verification can be realised in practice. However, the convex relaxation considered in the dual solvers is itself very weak (Ehlers, 2017), hitting what is now commonly referred to as the convex barrier (Salman et al., 2019). In practice, this implies that either several properties remain undecided in incomplete verification, or take several hours to be verified exactly. Multiple works have tried to overcome the convex barrier for piecewise linear activations (Raghunathan et al., 2018; Singh et al., 2019). Here, we focus on the single-neuron Linear Programming (LP) relaxation by Anderson et al. (2020). Unfortunately, its tightness comes at the price of exponentially many (in the number of variables) constraints. Therefore, existing dual solvers (Dvijotham et al., 2018; Bunel et al., 2020a) are not easily applicable, limiting the scaling of the new relaxation. Equal contribution. Published as a conference paper at ICLR 2021 We address this problem by presenting a specialized dual solver for the relaxation by Anderson et al. (2020), which realises its full potential by meeting the following desiderata: By keeping an active set of dual variables, we obtain a sparse dual solver that recovers the strengths of the original primal problem (Anderson et al., 2020) in the dual domain. In line with previous dual solvers, our approach yields valid bounds at anytime, leverages convolutional network structure and enjoys massive parallelism within a GPU implementation, resulting in better bounds in an order of magnitude less time than off-the-shelf solvers (Gurobi Optimization, 2020). We present a unified dual treatment that includes both a linearly sized LP relaxation (Ehlers, 2017) and the tighter formulation. As a consequence, our solver provides a wide range of speedaccuracy trade-offs: (i) it is competitive with dual approaches on the looser relaxation (Dvijotham et al., 2018; Bunel et al., 2020a); and (ii) it yields much tighter bounds if a larger computational budget is available. Owing to this flexibility, we show that our dual algorithm yields large complete verification gains compared to primal approaches (Anderson et al., 2020) and previous dual algorithms. 2 PRELIMINARIES: NEURAL NETWORK RELAXATIONS We denote vectors by bold lower case letters (for example, x) and matrices by upper case letters (for example, W). We use for the Hadamard product, J K for integer ranges, 1a for the indicator vector on condition a and brackets for intervals ([lk, uk]) and vector or matrix entries (x[i] or W[i, j]). In addition, given W Rm n and x Rm, we will employ W x and W @ x as shorthands for respectively P i coli(W) x and P i coli(W)T x, where coli(W) denotes the i-th column of matrix W. Let C be the network input domain. Similar to Dvijotham et al. (2018); Bunel et al. (2020a), we assume that linear minimisation over C can be performed in closed-form. Our goal is to compute bounds on the scalar output of a piecewise-linear feedforward neural network. The tightest possible lower bound can be obtained by solving the following optimization problem: min x,ˆx ˆxn s.t. x0 C, (1a) ˆxk+1 = Wk+1xk + bk+1 k J0, n 1K , (1b) xk = σ (ˆxk) k J1, n 1K , (1c) where the activation function σ (ˆxk) is piecewise-linear, ˆxk, xk Rnk denote the outputs of the kth linear layer (fully-connected or convolutional) and activation function respectively, Wk and bk denote its weight matrix and bias, nk is the number of activations at layer k. We will focus on the Re LU case (σ (x) = max (x, 0)), as common piecewise-linear functions can be expressed as a composition of Re LUs (Bunel et al., 2020b). Problem (1) is non-convex due to the activation function s non-linearity (1c). As solving it is NPhard (Katz et al., 2017), it is commonly approximated by a convex relaxation (see 4). The quality of the corresponding bounds, which is fundamental in verification, depends on the tightness of the relaxation. Unfortunately, tight relaxations usually correspond to slower bounding procedures. We first review a popular Re LU relaxation in 2.1). We then consider a tighter one in 2.2. 2.1 PLANET RELAXATION The so-called Planet relaxation (Ehlers, 2017) has enjoyed widespread use due to its amenability to efficient customised solvers (Dvijotham et al., 2018; Bunel et al., 2020a) and is the relaxation of choice for many works in the area (Bunel et al., 2020b; Lu & Kumar, 2020). Here, we describe it in its non-projected form Mk, the LP relaxation of the Big-M Mixed Integer Programming (MIP) formulation (Tjeng et al., 2019). Applying Mk to problem (1) results in: min x,ˆx,z ˆxn s.t. x0 C ˆxk+1 = Wk+1xk + bk+1 k J0, n 1K , xk ˆxk, xk ˆuk zk, xk ˆxk ˆlk (1 zk), (xk, ˆxk, zk) [lk, uk] [ˆlk, ˆuk] [0, 1] := Mk k J1, n 1K , (2) where ˆlk, ˆuk and lk, uk are intermediate bounds respectively on pre-activation variables ˆxk and post-activation variables xk. These constants play an important role in the structure of Mk and, Published as a conference paper at ICLR 2021 together with the relaxed binary constraints on z, define box constraints on the variables. We detail how to compute intermediate bounds in appendix E. Projecting out auxiliary variables z results in the Planet relaxation (cf. appendix B.1 for details), which replaces (1c) by its convex hull. Problem (2), which is linearly-sized, can be easily solved via commercial black-box LP solvers (Bunel et al., 2018). This does not scale-up well with the size of the neural network, motivating the need for specialised solvers. Customised dual solvers have been designed by relaxing constraints (1b), (1c) (Dvijotham et al., 2018) or replacing (1c) by the Planet relaxation and employing Lagrangian Decomposition (Bunel et al., 2020a). Both approaches result in bounds very close to optimality for problem (2) in only a fraction of the runtime of off-the-shelf solvers. 2.2 A TIGHTER RELAXATION A much tighter approximation of problem (1) than the Planet relaxation ( 2.1) can be obtained by representing the convex hull of the composition of (1b) and (1c) rather than the convex hull of (1c) alone. A formulation of this type was recently introduced by Anderson et al. (2020). Let us define ˇLk 1, ˇUk 1 Rnk nk 1 as: ˇLk 1[i, j] = lk 1[j]1Wk[i,j] 0 + uk 1[j]1Wk[i,j]<0, and ˇUk 1[i, j] = uk 1[j]1Wk[i,j] 0 + lk 1[j]1Wk[i,j]<0. Additionally, let us introduce 2Wk = {0, 1}nk nk 1, the set of all possible binary masks of weight matrix Wk, and Ek := 2Wk \ {0, 1}, which excludes the all-zero and all-one masks. The new representation results in the following primal problem: min x,ˆx,z ˆxn s.t. x0 C ˆxk+1 = Wk+1xk + bk+1 k J0, n 1K , (xk, ˆxk, zk) Mk (Wk Ik) xk 1 + zk bk Wk Ik ˇLk 1 (1 zk) + Wk (1 Ik) ˇUk 1 zk := Ak k J1, n 1K . Both Mk and Ak yield valid MIP formulations for problem (1) when imposing integrality constraints on z. However, the LP relaxation of Ak will yield tighter bounds. In the worst case, this tightness comes at the cost of exponentially many constraints: one for each Ik Ek. On the other hand, given a set of primal assignments (x, z) that are not necessarily feasible for problem (3), one can efficiently compute the most violated constraint (if any) at that point. The mask associated to such constraint can be computed in linear-time (Anderson et al., 2020) as: Ik[i, j] = 1T ((1 zk[i]) ˇLk 1[i,j]+zk[i] ˇUk 1[i,j] xk 1[i])Wk[i,j] 0. (4) We point out that Ak slightly differs from the original formulation of Anderson et al. (2020), which does not explicitly include pre-activation bounds ˆlk, ˆuk (which we treat via Mk). While this was implicitly addressed in practical applications (Botoeva et al., 2020), not doing so has a strong negative effect on bound tightness, possibly to the point of yielding looser bounds than problem (2). In appendix F, we provide an example in which this is the case and extend the original derivation by Anderson et al. (2020) to recover Ak as in problem (3). Owing to the exponential number of constraints, problem (3) cannot be solved as it is. As outlined by Anderson et al. (2020), the availability of a linear-time separation oracle (4) offers a natural primal cutting plane algorithm, which can then be implemented in off-the-shelf solvers: solve the Big-M LP (2), then iteratively add the most violated constraints from Ak at the optimal solution. When applied to the verification of small neural networks via off-the-shelf MIP solvers, this leads to substantial gains with respect to the looser Big-M relaxation (Anderson et al., 2020). 3 AN EFFICIENT DUAL SOLVER FOR THE TIGHTER RELAXATION Inspired by the success of dual approaches on looser relaxations (Bunel et al., 2020a; Dvijotham et al., 2019), we show that the formal verification gains by Anderson et al. (2020) (see 2.2) scale to larger networks if we solve the tighter relaxation in the dual space. Due to the particular structure of the relaxation, a customised solver for problem (3) needs to meet a number of requirements. Fact 1. In order to replicate the success of previous dual algorithms on looser relaxations, we need a solver for problem (3) with the following properties: (i) sparsity: a memory cost linear in the number of network activations in spite of exponentially many constraints, (ii) tightness: the bounds Published as a conference paper at ICLR 2021 should reflect the quality of those obtained in the primal space, (iii) anytime: low cost per iteration and valid bounds at each step. The anytime requirement motivates dual solutions: any dual assignment yields a valid bound due to weak duality. Unfortunately, as shown in appendix A, neither of the two dual derivations by Bunel et al. (2020a); Dvijotham et al. (2018) readily satisfy all desiderata at once. Therefore, we need a completely different approach. Let us introduce dual variables α, β and functions thereof: fk(α, β) = αk W T k+1αk+1 P Ik βk,Ik + P Ik+1(Wk+1 Ik+1)T βk+1,Ik+1, Ik Ek Wk (1 Ik) ˇUk 1 βk,Ik + βk,0 ˆuk + βk,1 ˆlk + P Ik Ek Wk Ik ˇLk 1 βk,Ik + P Ik Ek βk,Ik bk, (5) where P Ik is a shorthand for P Ik 2Wk . Starting from primal (3), we relax all constraints in Ak except box constraints (see 2.1). We obtain the following dual problem (derivation in appendix C), where functions fk, gk appear in inner products with primal variables xk, zk: max (α,β) 0d(α, β) where: d(α, β) := min x,z L(x, z, α, β), L(x, z, α, β) = " Pn 1 k=1 b T k αk Pn 1 k=0 fk(α, β)T xk Pn 1 k=1 gk(β)T zk + Pn 1 k=1 P Ik Ek(Wk Ik ˇLk 1) @ βk,Ik + βT k,1(ˆlk bk) s.t. x0 C, (xk, zk) [lk, uk] [0, 1] k J1, n 1K. This is again a challenging problem: the exponentially many constraints in the primal (3) are now associated to an exponential number of variables. Nevertheless, we show that the requirements of Fact 1 can be met by operating on a restricted version of dual (6). To this end, we present Active Set, a specialised solver for the relaxation by Anderson et al. (2020) that is sparse, anytime and yields bounds reflecting the tightness of the new relaxation. Starting from the dual of problem (2), our solver iteratively adds variables to a small active set of dual variables βB and solves the resulting reduced version of problem (6). We first describe our solver on a fixed βB and then outline how to iteratively modify the active set ( 3.2). Pseudo-code can be found in appendix D. 3.1 ACTIVE SET SOLVER We want to solve a version of problem (6) for which the sums over the Ik masks of each layer k are restricted to Bk Ek1, with B = k Bk. By keeping B = , we recover a novel dual solver for the Big-M relaxation (2) (explicitly described in appendix B), which is employed as initialisation. Setting βk,Ik = 0, Ik Ek \ Bk in (5), (6) and removing these from the formulation, we obtain: f B,k(α, βB) = αk W T k+1αk+1 P Ik Bk {0,1} βk,Ik, +P Ik+1 Bk+1 {0,1}(Wk+1 Ik+1)T βk+1,Ik+1 g B,k(βB) = P Ik Bk Wk (1 Ik) ˇUk 1 βk,Ik + βk,0 ˆuk + βk,1 ˆlk + P Ik Bk Wk Ik ˇLk 1 βk,Ik + P Ik Bk βk,Ik bk, along with the reduced dual problem: max (α,βB) 0d B(α, βB) where: d B(α, βB) := min x,z LB(x, z, α, βB), LB(x, z, α, βB) = " Pn 1 k=1 b T k αk Pn 1 k=0 f B,k(α, βB)T xk Pn 1 k=1 g B,k(βB)T zk + Pn 1 k=1 P Ik Bk(Wk Ik ˇLk 1) @ βk,Ik + βT k,1(ˆlk bk) s.t. x0 C, (xk, zk) [lk, uk] [0, 1] k J1, n 1K. (8) We can maximize d B(α, βB), which is concave and non-smooth, via projected supergradient ascent or variants thereof, such as Adam (Kingma & Ba, 2015). In order to obtain a valid supergradient, we need to perform the inner minimisation over the primals. Thanks to the structure of problem (8), the optimisation decomposes over the layers. For k J1, n 1K, we can perform the minimisation in closed-form by driving the primals to their upper or lower bounds depending on the sign of their coefficients: x k = 1f B,k(α,βB) 0 ˆuk + 1f B,k(α,βB)<0 ˆlk, z k = 1g B,k(βB) 0 1. (9) 1As dual variables βk,Ik are indexed by Ik, B = k Bk implicitly defines an active set of variables βB. Published as a conference paper at ICLR 2021 The subproblem corresponding to x0 is different, as it involves a linear minimization over x0 C: x 0 argminx0 f B,0(α, βB)T x0 s.t. x0 C. (10) We assumed in 2 that (10) can be performed efficiently. We refer the reader to Bunel et al. (2020a) for descriptions of the minimisation when C is a ℓ or ℓ2 ball, as common for adversarial examples. Given (x , z ) as above, the supergradient of d B(α, βB) is a subset of the one for d(α, β), given by: αkd(α, β) = Wkx k 1 + bk x k, βk,0d(α, β) = x k z k ˆuk, βk,1d(α, β) = x k Wkx k 1 + bk + (1 z k) ˆlk, βk,Ik d(α, β) = x k (Wk Ik) x k 1 + Wk Ik ˇLk 1 (1 z k) z k bk + Wk (1 Ik) ˇUk 1 z k for each k J0, n 1K. At each iteration, after taking a step in the supergradient direction, the dual variables are projected to the non-negative orthant by clipping negative values. 3.2 EXTENDING THE ACTIVE SET We initialise the dual (6) with a tight bound on the Big-M relaxation by solving for d (α, β ) in (8). To satisfy the tightness requirement in Fact 1, we then need to include constraints (via their Lagrangian multipliers) from the exponential family of Ak into Bk. Our goal is to tighten them as much as possible while keeping the active set small to save memory and compute. The active set strategy is defined by a selection criterion for the I k to be added2 to Bk, and the frequency of addition. In practice, we add the variables maximising the entries of supergradient βk,Ik d(α, β) after a fixed number of dual iterations. We now provide motivation for both choices. Selection criterion The selection criterion needs to be computationally efficient. Thus, we proceed greedily and focus only on the immediate effect at the current iteration. Let us map a restricted set of dual variables βB to a set of dual variables β for the full dual (6). We do so by setting variables not in the active set to 0: β B = 0, and β = βB β B. Then, for each layer k, we add the set of variables βk,I k maximising the corresponding entries of the supergradient of the full dual problem (6): βk,I k argmaxβk,Ik { βk,Ik d(α, β)T 1}. Therefore, we use the subderivatives as a proxy for short-term improvement on the full dual objective d(α, β). Under a primal interpretation, our selection criterion involves a call to the separation oracle (4) by Anderson et al. (2020). Proposition 1. βk,I k (as defined above) represents the Lagrangian multipliers associated to the most violated constraints from Ak at (x , z ) argminx,z LB(x, z, α, βB), the primal minimiser of the current restricted Lagrangian. Proof. See appendix D.1. Frequency Finally, we need to decide the frequency at which to add variables to the active set. Fact 2. Assume we obtained a dual solution (α , β B) argmax d B(α, βB) using Active Set on the current B. Then (x , z ) argminx,z LB(x, z, α , β B) is not necessarily an optimal primal solution for the primal of the current restricted dual problem (Sherali & Choi, 1996). The primal of d B(α, βB) (restricted primal) is the problem obtained by setting Ek Bk in problem (3). While the primal cutting plane algorithm by Anderson et al. (2020) calls the separation oracle (4) at the optimal solution of the current restricted primal, Fact 2 shows that our selection criterion leads to a different behaviour even at dual optimality for d B(α, βB). Therefore, as we have no theoretical incentive to reach (approximate) subproblem convergence, we add variables after a fixed tunable number of supergradient iterations. Furthermore, we can add more than one variable at once by running the oracle (4) repeatedly for a number of iterations. We conclude this section by pointing out that, while recovering primal optima is possible in principle (Sherali & Choi, 1996), doing so would require dual convergence on each restricted dual problem (8). As the main advantage of dual approaches (Dvijotham et al., 2018; Bunel et al., 2020a) is their ability to quickly achieve tight bounds (rather than formal optimality), adapting the selection criterion to mirror the primal cutting plane algorithm would defeat the purpose of Active Set. 2adding a single I k mask to Bk extends βB by nk variables: one for each neuron at layer k. Published as a conference paper at ICLR 2021 3.3 IMPLEMENTATION DETAILS, TECHNICAL CHALLENGES Analogously to previous dual algorithms (Dvijotham et al., 2018; Bunel et al., 2020a), our approach can leverage the massive parallelism offered by modern GPU architectures in three different ways. First, we execute in parallel the computations of lower and upper bounds relative to all the neurons of a given layer. Second, in complete verification, we can batch over the different Branch and Bound (Ba B) subproblems. Third, as most of our solver relies on standard linear algebra operations employed during the forward and backward passes of neural networks, we can exploit the highly optimised implementations commonly found in modern deep learning frameworks. An exception are what we call masked forward/backward passes: operations of the form (Wk Ik) xk or (Wk Ik)T xk+1, which are needed whenever dealing with constraints from Ak. In our solver, they appear if Bk = (see equations (8), (11)). Masked passes require a customised lower-level implementation for a proper treatment of convolutional layers, detailed in appendix G. 4 RELATED WORK In addition to those described in 2, many other relaxations have been proposed in the literature. In fact, all bounding methods are equivalent to solving some convex relaxation of a neural network. This holds for conceptually different ideas such as bound propagation (Gowal et al., 2018), specific dual assignments (Wong & Kolter, 2018), dual formulations based on Lagrangian Relaxation (Dvijotham et al., 2018) or Decomposition (Bunel et al., 2020a). The degree of tightness varies greatly: from looser relaxations associated to closed-form methods (Gowal et al., 2018; Weng et al., 2018; Wong & Kolter, 2018) to tighter formulations based on Semi-Definite Programming (SDP) (Raghunathan et al., 2018). The speed of closed-form approaches results from simplifying the triangle-shaped feasible region of the Planet relaxation ( 2.1) (Singh et al., 2018; Wang et al., 2018). On the other hand, tighter relaxations are more expressive than the linearly-sized LP by (Ehlers, 2017). The SDP formulation by Raghunathan et al. (2018) can represent interactions between activations in the same layer. Similarly, Singh et al. (2019) tighten the Planet relaxation by considering the convex hull of the union of polyhedra relative to k Re LUs of a given layer at once. Alternatively, tighter LPs can be obtained by considering the Re LU together with the affine operator before it: standard MIP techniques (Jeroslow, 1987) lead to a formulation that is quadratic in the number of variables (see appendix F.2). The relaxation by Anderson et al. (2020) detailed in 2.2 is a more convenient representation of the same set. By projecting out the auxiliary z variables, (Tjandraatmadja et al., 2020) recently introduced another formulation equivalent to the one by Anderson et al. (2020), with half as many variables and a linear factor more constraints compared to what described in 2.2. Therefore, the relationship between the two formulations mirrors the one between the Planet and Big-M relaxations (see appendix B.1). Our dual derivation and the Active Set algorithm can be adapted to operate on the projected relaxations. Specialised dual solvers significantly improve in bounding efficiency with respect to off-the-shelf solvers for both LP (Bunel et al., 2020a) and SDP formulations (Dvijotham et al., 2019). Therefore, the design of similar solvers for other tight relaxations is an interesting line of future research. We contribute with a specialised dual solver for the relaxation by Anderson et al. (2020) ( 3). In what follows, we demonstrate empirically that by seamlessly transitioning from the Planet relaxation to the tighter formulation, we can obtain large incomplete and complete verification improvements. 5 EXPERIMENTS We empirically demonstrate the effectiveness of our method under two settings. On incomplete verification ( 5.1), we assess the speed and quality of bounds compared to other bounding algorithms. On complete verification ( 5.2), we examine whether our speed-accuracy trade-offs correspond to faster exact verification. Our implementation is based on Pytorch (Paszke et al., 2017) and is available at https://github.com/oval-group/scaling-the-convex-barrier. 5.1 INCOMPLETE VERIFICATION We evaluate incomplete verification performance by upper bounding the robustness margin (the difference between the ground truth logit and the other logits) to adversarial perturbations (Szegedy et al., 2014) on the CIFAR-10 test set (Krizhevsky & Hinton, 2009). If the upper bound is negative, we can certify the network s vulnerability to adversarial perturbations. We replicate the experimen- Published as a conference paper at ICLR 2021 Gurobi Planet BDD+ 400 steps Big-M 850 steps Active Set 600 steps Active Set 1050 steps Active Set 1650 steps Active Set CPU Improvement from 1 cut Figure 1: Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from Ak per neuron; higher is better. Results for the SGD-trained network from Bunel et al. (2020a). The width at a given value represents the proportion of problems for which this is the result. Comparing Active Sets with 1650 steps to Gurobi 1 Cut, tighter bounds are achieved with a smaller runtime. tal setting from Bunel et al. (2020a). The networks correspond to the small network architecture from Wong & Kolter (2018). Here, we present results for a network trained via standard SGD and cross entropy loss, with no modification to the objective for robustness. Perturbations for this network lie in a ℓ norm ball with radius ϵver = 5/255 (which is hence lower than commonly employed radii for robustly trained networks). In appendix I, we provide additional CIFAR-10 results on an adversarially trained network using the method by Madry et al. (2018), and on MNIST (Le Cun et al., 1998), for a network adversarially trained with the algorithm by Wong & Kolter (2018). We compare both against previous dual iterative methods and Gurobi (Gurobi Optimization, 2020), the commercial black-box solver employed by Anderson et al. (2020). For Gurobi-based baselines, Planet means solving the Planet Ehlers (2017) relaxation of the network, while Gurobi cut starts from the Big-M relaxation and adds constraints from Ak in a cutting-plane fashion, as the original primal algorithm by Anderson et al. (2020). We run both on 4 CPU threads. Amongst dual iterative methods, run on an Nvidia Titan Xp GPU, we compare with BDD+, the recent proximal-based solver by Bunel et al. (2020a), operating on a Lagrangian Decomposition dual of the Planet relaxation. As we operate on (a subset of) the data by Bunel et al. (2020a), we omit both their supergradientbased approach and the one by Dvijotham et al. (2018), as they both perform worse than BDD+ (Bunel et al., 2020a). For the same reason, we omit cheaper (and looser) methods, like interval propagation Gowal et al. (2018) and the one by Wong & Kolter (2018). Active Set denotes our solver for problem (3), described in 3.1. By keeping B = , Active Set reduces to Big-M, a solver for the non-projected Planet relaxation (appendix B), which can be seen as Active Set s initialiser. In line with previous bounding algorithms (Bunel et al., 2020a), we employ Adam updates (Kingma & Ba, 2015) for supergradient-type methods due to their faster empirical convergence. Finally, we complement the comparison with Gurobi-based methods by running Active Set on 4 CPU threads (Active Set CPU). Further details, including hyper-parameters, can be found in appendix I. Figure 1 shows the distribution of runtime and the bound improvement with respect to Gurobi cut for the SGD-trained network. For Gurobi cut, we only add the single most violated cut from Ak per neuron, due to the cost of repeatedly solving the LP. We tuned BDD+ and Big-M, the dual methods operating on the weaker relaxation (2), to have the same average runtime. They obtain bounds comparable to Gurobi Planet in one order less time. Initialised from 500 Big-M iterations, at 600 1 2 3 4 BDD+ 400 steps Big-M 850 steps Timing (in s) 10 2 10 1 BDD+ 400 steps Big-M 850 steps Gap to Planet (a) Comparison of runtime (left) and gap to Gurobi Planet bounds (right). For the latter, lower is better. 1 2 3 4 Big-M 850 steps Active Set 80 steps Timing (in s) 0.0 0.2 0.4 Big-M 850 steps Active Set 80 steps Improvement from Planet (b) Comparison of runtime (left) and difference with the Gurobi Planet bounds (right). For the latter, higher is better. Figure 2: Pointwise comparison for a subset of the methods on the data presented in Figure 1. Darker colour shades mean higher point density (on a logarithmic scale). The oblique dotted line corresponds to the equality. Published as a conference paper at ICLR 2021 iterations Active Set already achieves better bounds on average than Gurobi cut in around 1/20th of the time. With a computational budget twice as large (1050 iterations) or four times as large (1650 iterations), the bounds significantly improve over Gurobi cut in still a fraction of the time. As we empirically demonstrate in appendix I, the tightness of the Active Set bounds is strongly linked to our active set strategy ( 3.2). Remarkably, even if our method is specifically designed to take advantage of GPU acceleration, executing it on CPU proves to be strongly competitive with Gurobi cut, producing better bounds in less time for the benchmark of Figure 1. Figure 2 shows pointwise comparisons for a subset of the methods of Figure 1, on the same data. Figure 2a shows the gap to the (Gurobi) Planet bound for BDD+ and our Big-M solver. Surprisingly, our Big-M solver is competitive with BDD+, achieving on average better bounds than BDD+, in the same time. Figure 2b shows the improvement over Planet bounds for Big-M and Active Set. The latter achieves markedly better bounds than Big-M in the same time, demonstrating the benefit of operating (at least partly) on the tighter dual (6). 5.2 COMPLETE VERIFICATION We next evaluate the performance on complete verification, verifying the adversarial robustness of a network to perturbations in ℓ norm on a subset of the dataset by Lu & Kumar (2020), replicating the experimental setting from Bunel et al. (2020a). The dataset associates a different perturbation radius ϵverif to each CIFAR-10 image, so as to create challenging verification properties. Its difficulty makes the dataset an appropriate testing ground for tighter relaxations like the one by Anderson et al. (2020) ( 2.2). Further details, including network architectures, can be found in appendix I. Here, we aim to solve the non-convex problem (1) directly, rather than an approximation like 5.1. In order to do so, we use Ba SBR, a branch and bound algorithm from Bunel et al. (2020b). Branch and Bound works by dividing the problem domain into subproblems (branching) and bounding the local minimum over those domains. Any domain which cannot contain the global lower bound is pruned away, whereas the others are kept and branched over. In Ba BSR, branching is carried out by splitting an unfixed Re LU into its passing and blocking phases. The Re LU which induces maximum change in the domain s lower bound, when made unambiguous, is selected for splitting. A fundamental component of a Ba B method is the bounding algorithm, which is, in general, the computational bottleneck (Lu & Kumar, 2020). Therefore, we compare the effect on final verification time of using the different bounding methods in 5.1 within Ba BSR. In addition, we evaluate MIP Ak, which encodes problem (1) as a Big-M MIP (Tjeng et al., 2019) and solves it in Gurobi by adding cutting planes from Ak, analogously to the original experiments by Anderson et al. (2020). Finally, we also compare against ERAN (Singh et al., 2020), a state-of-the-art complete verification toolbox: results on the dataset by Lu & Kumar (2020) are taken from the recent VNN-COMP competition (VNN-COMP, 2020). We use 100 iterations for Active Set, 100 iterations for BDD+ and 180 iterations for Big-M. For dual iterative algorithms, we solve 300 subproblems at once for the base network and 200 for the deep and wide networks (see 3.3). Additionally, dual variables are initialised from their parent node s bounding computation. As in Bunel et al. (2020a), the time-limit is kept at one hour. Due to the difference in computational cost between algorithms operating on the tighter relaxation by Anderson et al. (2020) and the other bounding algorithms3, we also experiment with a stratified version of the bounding within Ba BSR. We devise a set of heuristics to determine 3For Active Set, this is partly due to the masked forward/backward pass described in appendix G. 100 101 102 103 Computation time [s] % of properties verified BDD+ Ba BSR Big-M Ba BSR Active Set Ba BSR Big-M + Active Set Ba BSR MIP k G. Planet + G. 1 cut Ba BSR 100 101 102 103 Computation time [s] % of properties verified Wide large model BDD+ Ba BSR Big-M Ba BSR Active Set Ba BSR Big-M + Active Set Ba BSR MIP k G. Planet + G. 1 cut Ba BSR 100 101 102 103 Computation time [s] % of properties verified Deep large model BDD+ Ba BSR Big-M Ba BSR Active Set Ba BSR Big-M + Active Set Ba BSR MIP k G. Planet + G. 1 cut Ba BSR Figure 3: Cactus plots on properties from Lu & Kumar (2020), displaying the percentage of solved properties as a function of runtime. Baselines are represented by dotted lines. Published as a conference paper at ICLR 2021 Base Wide Deep Method time(s) sub-problems %Timeout time(s) sub-problems %Timeout time(s) sub-problems %Timeout BDD+ BABSR 883.55 82 699.40 22.00 568.25 43 751.88 13.00 281.47 10 763.48 5.00 BIG-M BABSR 826.60 68 582.00 19.00 533.79 35 877.24 12.00 253.37 9346.78 4.00 A. SET BABSR 422.32 9471.90 7.00 169.73 1873.36 3.00 227.26 2302.16 2.00 BIG-M + A. SET BABSR 402.88 11 408.84 7.00 179.73 3712.62 3.00 197.99 3086.62 2.00 G. PLANET + G. 1 CUT BABSR 1191.48 2044.28 14.00 1272.99 1352.42 10.00 704.59 677.74 3.00 MIP Ak 3227.50 226.24 82.00 2500.70 100.93 64.00 3339.37 434.57 91.00 ERAN 805.89 - 5.00 632.12 - 9.00 545.72 - 0.00 Table 1: We compare average solving time, average number of solved sub-problems and the percentage of timed out properties on data from Lu & Kumar (2020). The best dual iterative method is highlighted in bold. whether a given subproblem is easy (therefore looser bounds are sufficient) or whether we need to operate on the tighter relaxation. Instances of this approach are Big-M + Active Set and Gurobi Planet + Gurobi 1 cut. Further details are provided in appendix H. Figure 3 and Table 1 show that Big-M performs competitively with BDD+. Active Set verifies a larger share of properties than the methods operating on the looser formulation (2), demonstrating the benefit of tighter bounds ( 5.1) in complete verification. On the other hand, the poor performance of MIP + Ak and of Gurobi Planet + Gurobi 1 cut, tied to scaling limitations of off-the-shelf solvers, shows that tighter bounds are effective only if they can be computed efficiently. Nevertheless, the difference in performance between the two Gurobi-based methods confirms that customised Branch and Bound solvers (Ba BSR) are preferable to generic MIP solvers, as observed by Bunel et al. (2020b) on the looser Planet relaxation. Moreover, the stratified bounding system allows us to retain the speed of Big-M on easier properties, without excessively sacrificing Active Set s gains on the harder ones. Finally, while ERAN verifies 2% more properties than Active Set on two networks, Ba BSR (with any dual bounding algorithm) is faster on most of the properties. Ba BSR-based results could be further improved by employing the learned branching strategy presented by Lu & Kumar (2020): in this work, we focused on the bounding component of branch and bound. 6 DISCUSSION The vast majority of neural network bounding algorithms focuses on (solving or loosening) a popular triangle-shaped relaxation, referred to as the convex barrier for verification. Relaxations that are tighter than this convex barrier have been recently introduced, but their complexity hinders applicability. We have presented Active Set, a sparse dual solver for one such relaxation, and empirically demonstrated that it yields significant formal verification speed-ups. Our results show that scalable tightness is key to the efficiency of neural network verification and instrumental in the definition of a more appropriate convex barrier . We believe that new customised solvers for similarly tight relaxations are a crucial avenue for future research in the area, possibly beyond piecewise-linear networks. Finally, as it is inevitable that tighter bounds will come at a larger computational cost, future verification systems will be required to recognise a priori whether tight bounds are needed for a given property. A possible solution to this problem could rely on learning algorithms. ACKNOWLEDGMENTS ADP was supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems, grant EP/L015987/1, and an IBM Ph D fellowship. HSB was supported using a Tencent studentship through the University of Oxford. Ross Anderson, Joey Huchette, Christian Tjandraatmadja, and Juan Pablo Vielma. Strong mixedinteger programming formulations for trained neural networks. Conference on Integer Programming and Combinatorial Optimization, 2019. Ross Anderson, Joey Huchette, Will Ma, Christian Tjandraatmadja, and Juan Pablo Vielma. Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming, 2020. Elena Botoeva, Panagiotis Kouvaros, Jan Kronqvist, Alessio Lomuscio, and Ruth Misener. Efficient verification of Re LU-based neural networks via dependency analysis. AAAI Conference on Artificial Intelligence, 2020. Published as a conference paper at ICLR 2021 Rudy Bunel, Ilker Turkaslan, Philip HS Torr, Pushmeet Kohli, and M Pawan Kumar. A unified view of piecewise linear neural network verification. Neural Information Processing Systems, 2018. Rudy Bunel, Alessandro De Palma, Alban Desmaison, Krishnamurthy Dvijotham, Pushmeet Kohli, Philip HS Torr, and M Pawan Kumar. Lagrangian decomposition for neural network verification. Conference on Uncertainty in Artificial Intelligence, 2020a. Rudy Bunel, Jingyue Lu, Ilker Turkaslan, P Kohli, P Torr, and M Pawan Kumar. Branch and bound for piecewise linear neural network verification. Journal of Machine Learning Research, 21(2020), 2020b. Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy, 2017. Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy Mann, and Pushmeet Kohli. A dual approach to scalable verification of deep networks. Uncertainty in Artificial Intelligence, 2018. Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Chongli Qin, Soham De, and Pushmeet Kohli. Efficient neural network verification with exactness characterization. Uncertainty in Artificial Intelligence, 2019. Ruediger Ehlers. Formal verification of piece-wise linear feed-forward neural networks. Automated Technology for Verification and Analysis, 2017. Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Timothy Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. SECML Neur IPS, 2018. LLC Gurobi Optimization. Gurobi optimizer reference manual, 2020. URL http://www. gurobi.com. R. G. Jeroslow. Representability in mixed integer programming, i: Characterization results. Discrete Applied Mathematics, 1987. Guy Katz, Clark Barrett, David Dill, Kyle Julian, and Mykel Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. International Conference on Computer-Aided Verification, 2017. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto, 2009. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. IEEE, 1998. Claude Lemar echal. Lagrangian relaxation. In Computational combinatorial optimization, pp. 112 156. Springer, 2001. Jingyue Lu and M Pawan Kumar. Neural network branching for neural network verification. In International Conference on Learning Representations, 2020. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations, 2018. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. NIPS Autodiff Workshop, 2017. Aditi Raghunathan, Jacob Steinhardt, and Percy S Liang. Semidefinite relaxations for certifying robustness to adversarial examples. Neural Information Processing Systems, 2018. Published as a conference paper at ICLR 2021 Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relaxation barrier to tight robustness verification of neural networks. Neural Information Processing Systems, 2019. Hanif D. Sherali and Gyunghyun Choi. Recovery of primal solutions when using subgradient optimization methods to solve lagrangian duals of linear programs. Operations Research Letters, 1996. Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus P uschel, and Martin Vechev. Fast and effective robustness certification. Neural Information Processing Systems, 2018. Gagandeep Singh, Rupanshu Ganvir, Markus P uschel, and Martin Vechev. Beyond the single neuron convex barrier for neural network certification. Neural Information Processing Systems, 2019. Gagandeep Singh, Jonathan Maurer, Christoph M uller, Matthew Mirman, Timon Gehr, Adrian Hoffmann, Petar Tsankov, Dana Drachsler Cohen, Markus P uschel, and Martin Vechev. ETH robustness analyzer for neural networks (ERAN), 2020. URL https://github.com/eth-sri/ eran. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. International Conference on Learning Representations, 2014. Christian Tjandraatmadja, Ross Anderson, Joey Huchette, Will Ma, Krunal Patel, and Juan Pablo Vielma. The convex relaxation barrier, revisited: Tightened single-neuron relaxations for neural network verification. ar Xiv preprint ar Xiv:2006.14076, 2020. Vincent Tjeng, Kai Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. International Conference on Learning Representations, 2019. VNN-COMP. International verification of neural networks competition (VNN-COMP). Verification of Neural Networks workshop at the International Conference on Computer-Aided Verification, 2020. URL https://sites.google.com/view/vnn20/vnncomp. Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Efficient formal safety analysis of neural networks. Neural Information Processing Systems, 2018. Stefan Webb, Tom Rainforth, Yee Whye Teh, and M Pawan Kumar. A statistical approach to assessing neural network robustness. International Conference on Learning Representations, 2019. Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S Dhillon, and Luca Daniel. Towards fast computation of certified robustness for relu networks. International Conference on Machine Learning, 2018. Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. International Conference on Machine Learning, 2018. Published as a conference paper at ICLR 2021 A LIMITATIONS OF PREVIOUS DUAL APPROACHES In this section, we show that previous dual derivations (Bunel et al., 2020a; Dvijotham et al., 2018) violate Fact 1. Therefore, they are not efficiently applicable to problem (3), motivating our own derivation in section 3. We start from the approach by Dvijotham et al. (2018), which relies on relaxing equality constraints (1b), (1c) from the original non-convex problem (1). Dvijotham et al. (2018) prove that this relaxation corresponds to solving convex problem (2), which is equivalent to the Planet relaxation (Ehlers, 2017), to which the original proof refers. As we would like to solve tighter problem (3), the derivation is not directly applicable. Relying on intuition from convex analysis applied to duality gaps (Lemar echal, 2001), we conjecture that relaxing the composition (1c) (1b) might tighten the primal problem equivalent to the relaxation, obtaining the following dual: max µ min x Wnxn 1 + bn + k=1 µT k (xk max {Wkxk 1 + bk, 0}) s.t. lk xk uk k J1, n 1K , x0 C. Unfortunately dual (12) requires an LP (the inner minimisation over x, which in this case does not decompose over layers) to be solved exactly to obtain a supergradient and any time a valid bound is needed. This is markedly different from the original dual by Dvijotham et al. (2018), which had an efficient closed-form for the inner problems. The derivation by Bunel et al. (2020a), instead, operates by substituting (1c) with its convex hull and solving its Lagrangian Decomposition dual. The Decomposition dual for the convex hull of (1c) (1b) (i.e., Ak) takes the following form: max ρ min x,z Wnx A,n 1 + bn + k=1 ρT k (x B,k x A,k) s.t. x0 C, (x B,k, Wnx A,n 1 + bn, zk) Adec,k k J1, n 1K , where Adec,k corresponds to Ak with the following substitutions: xk x B,k, and ˆxk Wnx A,n 1 + bn. It can be easily seen that the inner problems (the inner minimisation over x A,k, x B,k, for each layer k > 0) are an exponentially sized LP. Again, this differs from the original dual on the Planet relaxation (Bunel et al., 2020a), which had an efficient closed-form for the inner problems. B DUAL INITIALISATION Algorithm 1 Big-M solver 1: function BIGM COMPUTE BOUNDS({Wk, bk, lk, uk,ˆlk, ˆuk}k=1..n) 2: Initialise duals α0, β0 M using interval propagation bounds (Gowal et al., 2018) 3: for t J1, T 1K do 4: x , z argminx,z LM(x, z, αt, βt M) using (16)-(17) 5: Compute supergradient using (18) 6: αt+1, βt+1 M Adam s update rule (Kingma & Ba, 2015) 7: αt+1, βt+1 M max(αt+1, 0), max(βt+1 M , 0) (dual projection) 8: end for 9: return minx,z LM(x, z, αT , βT M) 10: end function As shown in section 3, our Active Set solver yields a dual solver for the Big-M relaxation (2) if the active set B is kept empty throughout execution. As indeed B = for the first Active Set iterations (see algorithm 2 in section D), the Big-M solver can be thought of as dual initialisation. Furthermore, Published as a conference paper at ICLR 2021 we demonstrate experimentally in 5 that, when used as a stand-alone solver, our Big-M solver is competitive with previous dual algorithms for problem (2). The goal of this section is to explicitly describe the Big-M solver, which is summarised in algorithm 1. We point out that, in the notation of restricted variable sets from section 3.1, βM := β . We now describe the equivalence between the Big-M and Planet relaxations, before presenting the solver in section B.3 and the dual it operates on in section B.2. B.1 EQUIVALENCE TO PLANET As previously shown (Bunel et al., 2018), the Big-M relaxation (Mk, when considering the k-th layer only) in problem (2) is equivalent to the Planet relaxation by Ehlers (2017). Then, due to strong duality, our Big-M solver (section B.2) and the solvers by Bunel et al. (2020a); Dvijotham et al. (2018) will all converge to the bounds from the solution of problem (2). In fact, the Decompositionbased method (Bunel et al., 2020a) directly operates on the Planet relaxation, while Dvijotham et al. (2018) prove that their dual is equivalent to doing so. On the k-th layer, the Planet relaxation takes the following form: if ˆlk 0 and ˆuk 0 : xk 0, xk ˆxk, xk ˆuk (ˆxk ˆlk) 1/(ˆuk ˆlk) . if ˆuk 0 : xk = 0. if ˆlk 0 : xk = ˆxk. It can be seen that Pk = Projx,ˆx (Mk), where Projx,ˆx denotes projection on the x, ˆx hyperplane. In fact, as zk does not appear in the objective of the primal formulation (2), but only in the constraints, this means assigning it the value that allows the largest possible feasible region. This is trivial for passing or blocking Re LUs. For the ambiguous case, instead, Figure 4 (on a single Re LU) shows that zk = ˆxk ˆlk ˆuk ˆlk is the correct assignment. B.2 BIG-M DUAL As evident from problem (3), Ak Mk. If we relax all constraints in Mk (except, again, the box constraints), we are going to obtain a dual with a strict subset of the variables in problem (6). The Big-M dual is a specific instance of the Active Set dual (8) where B = , and it takes the following xk = ˆxk ˆlk (1 zk) xk = ˆuk zk 1 0 zk = ˆxk ˆlk Cσ(ˆxk, zk) Figure 4: Mk plotted on the (zk, xk) plane, under the assumption that ˆlk 0 and ˆuk 0. Published as a conference paper at ICLR 2021 form: max (α,β) 0d M(α, βM) where: d M(α, βM) := min x,z LM(x, z, α, βM), LM(x, z, α, βM) = Pn 1 k=0 αk W T k+1αk+1 (βk,0 + βk,1 W T k+1βk+1,1) T xk + Pn 1 k=1 b T k αk Pn 1 k=1 βk,0 ˆuk + βk,1 ˆlk T zk + Pn 1 k=1(ˆlk bk)T βk,1 s.t. x0 C, (xk, zk) [lk, uk] [0, 1] k J1, n 1K. (15) B.3 BIG-M SOLVER We initialise dual variables from interval propagation bounds (Gowal et al., 2018): this can be easily done by setting all dual variables except αn to 0. Then, we can maximize d M(α, β) via projected supergradient ascent, exactly as described in section 3.1 on a generic active set B. All the computations in the solver follow from keeping B = in 3.1. We explicitly report them here for the reader s convenience. Let us define the following shorthand for the primal coefficients: f M,k(α, βM) = αk W T k+1αk+1 (βk,0 + βk,1 W T k+1βk,1) g M,k(βM) = βk,0 ˆuk + βk,1 ˆlk. The minimisation of the Lagrangian LM(x, z, α, β) over the primals for k J1, n 1K is as follows: x k = 1f M,k(α,βM) 0 ˆuk + 1f M,k(α,βM)<0 ˆlk z k = 1g M,k(βM) 0 1 (16) For k = 0, instead (assuming, as 3.1 that this can be done efficiently): x 0 argmin x0 f M,k(α, βM)T x0 s.t. x0 C. (17) The supergradient over the Big-M dual variables α, βk,0, βk,1 is computed exactly as in 3.1 and is again a subset of the supergradient of the full dual problem (6). We report it for completeness. For each k J0, n 1K: αkd(α, β) = Wkx k 1 + bk x k, βk,0d(α, β) = xk zk ˆuk, βk,1d(α, β) = xk (Wkxk 1 + bk) + (1 zk) ˆlk. (18) C DUAL DERIVATIONS We now derive problem (6), the dual of the full relaxation by Anderson et al. (2020) described in equation (3). The Active Set (equation (8)) and Big-M duals (equation (15)) can be obtained by removing βk,Ik Ik Ek \ Bk and βk,Ik Ik Ek, respectively. We employ the following Lagrangian multipliers: xk ˆxk αk, xk ˆuk zk βk,0, xk ˆxk ˆlk (1 zk) βk,1, (Wk Ik) xk 1 + zk bk Wk Ik ˇLk 1 (1 zk) + Wk (1 Ik) ˇUk 1 zk Published as a conference paper at ICLR 2021 and obtain, as a Lagrangian (using ˆxk = Wkxk 1 + bk): L(x, z, α, β) = Pn 1 k=1 αT k (Wkxk 1 + bk xk) + Pn 1 k=1 βT k,0 (xk zk ˆuk) + Pn 1 k=1 P Ik Ek βT k,Ik Wk Ik ˇLk 1 (1 zk) (Wk Ik) xk 1 bk zk Wk (1 Ik) ˇUk 1 zk + xk + Pn 1 k=1 βT k,1 xk (Wkxk 1 + bk) + (1 zk) ˆlk + Wnxn 1 + bn Let us use P Ik as shorthand for P Ik Ek {0,1}. If we collect the terms with respect to the primal variables and employ dummy variables α0 = 0, β0 = 0, αn = I, βn = 0, we obtain: L(x, z, α, β) = αk W T k+1αk+1 P Ik βk,Ik +P Ik+1(Wk+1 Ik+1)T βk+1,Ik+1 Ik Ek βk,Ik bk + βk,1 ˆlk + βk,0 ˆuk Ik Ek Wk Ik ˇLk 1 βk,Ik + P Ik Ek Wk (1 Ik) ˇUk 1 βk,Ik + Pn 1 k=1 b T k αk + Pn 1 k=1 P Ik Ek(Wk Ik ˇLk 1) @ βk,Ik + βT k,1(ˆlk bk) which corresponds to the form shown in problem (6). D IMPLEMENTATION DETAILS FOR ACTIVE SET METHOD From the high-level perspective, our Active Set solver proceeds by repeatedly solving modified instances of problem (6), where the exponential set Ek is replaced by a fixed (small) set of active variables B. The full solver procedure is summarised in algorithm 2. Algorithm 2 Active Set solver 1: function ACTIVESET COMPUTE BOUNDS({Wk, bk, lk, uk,ˆlk, ˆuk}k=1..n) 2: Initialise duals α0, β0 0, β0 1 using Algorithm (1) 3: Set βk,Ik = 0, Ik Ek 4: B = 5: for nb additions do 6: for t J1, T 1K do 7: x , z argminx,z LB(x, z, αt, βt B) using (9),(10) 8: if t nb vars to add then 9: For each layer k, add output of (4) called at (x , z ) to Bk 10: end if 11: Compute supergradient using (11) 12: αt+1, βt+1 B Adam s update rule (Kingma & Ba, 2015) 13: αt+1, βt+1 B max(αt+1, 0), max(βt+1 B , 0) (dual projection) 14: end for 15: end for 16: return minx,z LB(x, z, αT , βT B) 17: end function We conclude this section by proving the primal interpretation of the selection criterion for adding a new set of variables to B. D.1 ACTIVE SET SELECTION CRITERION We map a restricted set of dual variables βB to a set of dual variables β for the full dual (6) by setting variables not in the active set to 0: β B = 0, and β = βB β B. Proposition. Let βk,I k be the set of dual variables maximising the corresponding entries of the supergradient of the full dual problem (6): βk,I k argmaxβk,Ik { βk,Ik d(α, β)T 1}. βk,I k represents the Lagrangian multipliers associated to the most violated constraints from Ak at (x , z ) argminx,z LB(x, z, α, βB), the primal minimiser of the current restricted Lagrangian. Published as a conference paper at ICLR 2021 Proof. In the following, (x , z ) denotes the points introduced in the statement. Recall the definition of βk,Ik d(α, β) in equation (9), which applies beyond the current active set: βk,Ik d(α, β) = x k (Wk Ik) x k 1 + Wk Ik ˇLk 1 (1 z k) z k bk + Wk (1 Ik) ˇUk 1 z k We want to compute I k argmax Ik{ βk,Ik d(α, β)T 1}, that is: I k argmax Ik Ek x k (Wk Ik) x k 1 + Wk Ik ˇLk 1 (1 z k) z k bk + Wk (1 Ik) ˇUk 1 z k By removing the terms that do not depend on Ik, we obtain: (Wk Ik) x k 1 + Wk Ik ˇLk 1 (1 z k) + Wk Ik ˇUk 1 z k Let us denote the i-th row of Wk and Ik by wi,k and ii,k, respectively, and define Ek[i] = 2wi,k \ {0, 1}. The optimisation decomposes over each such row: we thus focus on the optimisation problem for the supergradient s i-th entry. Collecting the mask, we get: max ii,k Ek[i] (1 z k[i]) ˇLk 1[i, j] + z k[i] ˇUk 1[i, j] x k 1[i] Wk[i, j] As the solution to the problem above is obtained by setting I k[i, j] = 1 if its coefficient is positive and I k[i, j] = 0 otherwise, we can see that the optimal Ik corresponds to calling oracle (4) by Anderson et al. (2020) on (x , z ). Hence, in addition to being the mask associated to βk,I k, the variable set maximising the supergradient, I k corresponds to the most violated constraint from Ak at (x , z ). E INTERMEDIATE BOUNDS A crucial quantity in both Re LU relaxations (Mk and Ak) are intermediate pre-activation bounds ˆlk, ˆuk. In practice, they are computed by solving a relaxation Ck (which might be Mk, Ak, or something looser) of (1) over subsets of the network (Bunel et al., 2020a). Forˆli, this means solving the following problem (separately, for each entry ˆli[j]): min x,ˆx,z ˆxi[j] s.t. x0 C ˆxk+1 = Wk+1xk + bk+1, k J0, i 1K , (xk, ˆxk, zk) Ck k J1, i 1K . As (19) needs to be solved twice for each neuron (lower and upper bounds, changing the sign of the last layer s weights) rather than once as in (3), depending on the computational budget, Ck might be looser than the relaxation employed for the last layer bounds (in our case, Ak). In all our experiments, we compute intermediate bounds as the tightest bounds between the method by Wong & Kolter (2018) and Interval Propagation (Gowal et al., 2018). Once pre-activation bounds are available, post-activation bounds can be simply computed as lk = max(ˆlk, 0), uk = max(ˆuk, 0). F PRE-ACTIVATION BOUNDS IN Ak We now highlight the importance of an explicit treatment of pre-activation bounds in the context of the relaxation by Anderson et al. (2020). In F.1 we will show through an example that, without a separate pre-activation bounds treatment, Ak could be looser than the less computationally expensive Mk relaxation. We then ( F.2) justify our specific pre-activation bounds treatment by extending the original proof by Anderson et al. (2020). Published as a conference paper at ICLR 2021 The original formulation by Anderson et al. (2020) is the following: xk Wkxk 1 + bk (Wk Ik) xk 1 + zk bk Wk Ik ˇLk 1 (1 zk) + Wk (1 Ik) ˇUk 1 zk (xk, ˆxk, zk) [lk, uk] [ˆlk, ˆuk] [0, 1] = A k. (20) The difference with respect to Ak as defined in equation (3) exclusively lies in the treatment of pre-activation bounds. While Ak explicitly employs genericˆlk, ˆuk in the constraint set via Mk, A k implicitly sets ˆlk, ˆuk to the value dictated by interval propagation bounds (Gowal et al., 2018) via the constraints in Ik = 0 and Ik = 1 from the exponential family. In fact, setting Ik = 0 and Ik = 1, we obtain the following two constraints: xk ˆxk M k (1 zk) xk M + k zk where: M k := min xk 1 [lk 1,uk 1] W T k xk 1 + bk = Wk ˇLk 1 + bk M + k := max xk 1 [lk 1,uk 1] W T k xk 1 + bk = Wk ˇUk 1 + bk which correspond to the upper bounding Re LU constraints in Mk if we set ˆlk M k , ˆuk M + k . While ˆlk, ˆuk are (potentially) computed solving an optimisation problem over the entire network (problem 19), the optimisation for M k , M + k involves only the layer before the current. Therefore, the constraints in (21) might be much looser than those in Mk. In practice, the effect of ˆlk[i], ˆuk[i] on the resulting set is so significant that Mk might yield better bounds than A k, even on very small networks. We now provide a simple example. F.1 MOTIVATING EXAMPLE Input - 0 Hidden - 1 Hidden - 2 Output - 3 x0[0] x1[0] x2[0] x0[1] x1[1] x2[1] Figure 5: Example network architecture in which Mk A k, with pre-activation bounds computed with Ck = Mk. For the bold nodes (the two hidden layers) a Re LU activation follows the linear function. The numbers between parentheses indicate multiplicative weights, the others additive biases (if any). Figure 5 illustrates the network architecture. The size of the network is the minimal required to reproduce the phenomenon. Mk and Ak coincide for single-neuron layers (Anderson et al., 2020), and ˆlk = M k , ˆuk = M + k on the first hidden layer (hence, a second layer is needed). Let us write the example network as a (not yet relaxed, as in problem (1)) optimization problem for the lower bound on the output node x3. l3 = arg min x,ˆx [2 1] x2 (22a) s.t. x0 [ 1, 1]2 (22b) Published as a conference paper at ICLR 2021 ˆx1 = 1 1 1 1 x1 = max(0, ˆx1) (22c) ˆx2 = 1 2 2 1 x2 = max(0, ˆx2) (22d) x3 = [2 1] x2 (22e) Let us compute pre-activation bounds with Ck = Mk (see problem (19)). For this network, the final output lower bound is tighter if the employed relaxation is Mk rather than Ak (hence, in this case, Mk A k). Specifically: ˆl3,A k = 1.2857, ˆl3,Mk = 1.2273. In fact: In order to compute l1 and u1, the post-activation bounds of the first-layer, it suffices to solve a box-constrained linear program for ˆl1 and ˆu1, which at this layer coincide with interval propagation bounds, and to clip them to be non-negative. This yields l1 = [0 0]T , u1 = [1 3]T . Computing M + 2 [1] = maxx1 [l1,u1] [ 2 1] x1 = 3 we are assuming that x1[0] = l1[0] and x1[1] = u1[1]. These two assignments are in practice conflicting, as they imply different values for x0. Specifically, x1[1] = u1[1] requires x0 = [u0[0] l0[1]] = [1 1], but this would also imply x1[0] = u1[0], yielding ˆx2[1] = 1 = 3. Therefore, explicitly solving a LP relaxation of the network for the value of ˆu2[1] will tighten the bound. Using Mk, the LP for this intermediate pre-activation bound is: ˆu2[1] = arg min x,ˆx,z [ 2 1] x1 (23a) s.t. x0 [ 1, 1]2, z1 [0, 1]2, x1 R2 0 (23b) ˆx1 = 1 1 1 1 x1 ˆx1 (23d) x1 ˆu1 z1 = 1 3 x1 ˆx1 ˆl1 (1 z1) = ˆx1 3 1 (1 z1) (23f) Yielding ˆu2[1] = 2.25 < 3 = M + 2 [1]. An analogous reasoning holds for M 2 [1] and ˆl2[1]. In Mk, we therefore added the following two constraints: x2[1] ˆx2[1] ˆl2[1](1 z2[1]) x2[1] ˆu2[1]z2[1] (24) that in A k correspond to the weaker: x2[1] ˆx2[1] M 2 [1](1 z2[1]) x2[1] M + 2 [1]z2[1] (25) As the last layer weight corresponding to x2[1] is negative (W3[0, 1] = 1), these constraints are going to influence the computation of ˆl3. In fact, the constraints in (24) are both active when optimizing for ˆl3,Mk, whereas their counterparts for ˆl3,A k in (25) are not. The only active upper constraint at neuron x2[1] for the Anderson relaxation is x2[1] x1[1], corresponding to the constraint from A 2 with I2[1, ] = [0 1]. Evidently, its effect is not sufficient to counter-balance the effect of the tighter constraints (24) for I2[1, ] = [1 1] and I2[1, ] = [0 0], yielding a weaker lower bound for the network output. Published as a conference paper at ICLR 2021 F.2 DERIVATION OF Ak Having motivated an explicit pre-activation bounds treatment for the relaxation by Anderson et al. (2020), we now extend the original proof for A k (equation (20)) to obtain our formulation Ak (as defined in equation (3)). For simplicity, we will operate on a single neuron xk[i]. A self-contained way to derive A k is by applying Fourier-Motzkin elimination on a standard MIP formulation referred to as the multiple choice formulation (Anderson et al., 2019), which is defined as follows: (xk 1, xk[i]) = (x0 k 1, x0 k[i]) + (x1 k 1, x1 k[i]) x0 k[i] = 0 w T i,kx0 k 1 + bk[i](1 zk[i]) x1 k[i] = w T i,kx1 k 1 + bk[i]zk[i] 0 lk 1[i](1 zk[i]) x0 k 1 uk 1[i](1 zk[i]) lk 1[i]zk[i] x1 k 1 uk 1[i]zk[i] zk[i] [0, 1] = S k,i (26) Where wi,k denotes the i-th row of Wk, and x1 k 1 and x0 k 1 are copies of the previous layer variables. Applying (26) to the entire neural network results in a quadratic number of variables (relative to the number of neurons). The formulation can be obtained from well-known techniques from the MIP literature (Jeroslow, 1987) (it is the union of the two polyhedra for a passing and a blocking Re LU, operating in the space of xk 1). Anderson et al. (2019) show that A k = Projxk 1,xk,zk(S k). If pre-activation bounds ˆlk, ˆuk (computed as described in section E) are available, we can naturally add them to (26) as follows: (xk 1, xk[i], zk[i]) S k,i ˆlk[i](1 zk[i]) w T i,kx0 k 1 + bk[i](1 zk[i]) ˆuk[i](1 zk[i]) ˆlk[i] zk[i] w T i,kx1 k 1 + bk[i]zk[i] ˆuk[i]zk[i] = Sk,i (27) We now prove that this formulation yields Ak when projecting out the copies of the activations. Proposition. Sets Sk from equation (27) and Ak from problem (3) are equivalent, in the sense that Ak = Projxk 1,xk,zk(Sk). Proof. In order to prove the equivalence, we will rely on Fourier-Motzkin elimination as in the original Anderson relaxation proof (Anderson et al., 2019). Going along the lines of the original proof, we start from (26) and eliminate x1 k 1, x0 k[i] and x1 k[i] exploiting the equalities. We then re-write all the inequalities as upper or lower bounds on x0 k 1[0] in order to eliminate this variable. As Anderson et al. (2019), we assume wi,k[0] > 0. The proof generalizes by using ˇL and ˇU for wi,k[0] < 0, whereas if the coefficient is 0 the variable is easily eliminated. We get the following system: x0 k 1[0] = 1 wi,k[0] w T i,kxk 1 X j>1 wi,k[j]x0 k 1[j] + bk[i]zk[i] xk[i] x0 k 1[0] 1 wi,k[0] j>1 wi,k[j]x0 k 1[j] + bk[i](1 zk[i]) x0 k 1[0] 1 wi,k[0] w T i,kxk 1 X j>1 wi,k[j]x0 k 1[j] + bk[i]zk[i] lk 1[0](1 zk[i]) x0 k 1[0] uk 1[0](1 zk[i]) (28d) x0 k 1[0] xk 1[0] lk 1[0]zk[i] (28e) x0 k 1[0] xk 1[0] uk 1[0]zk[i] (28f) Published as a conference paper at ICLR 2021 x0 k 1[0] 1 wi,k[0] w T i,kxk 1 X j>1 wi,k[j]x0 k 1[j] + (bk[i] ˆlk[i])zk[i] x0 k 1[0] 1 wi,k[0] w T i,kxk 1 X j>1 wi,k[j]x0 k 1[j] + (bk[i] ˆuk[i])zk[i] x0 k 1[0] 1 wi,k[0] (ˆlk[i] bk[i])(1 zk[i]) X j>1 wi,k[j]x0 k 1[j] x0 k 1[0] 1 wi,k[0] (ˆuk[i] bk[i])(1 zk[i]) X j>1 wi,k[j]x0 k 1[j] where only inequalities (28g) to (28j) are not present in the original proof. We therefore focus on the part of the Fourier-Motzkin elimination that deals with them, and invite the reader to refer to Anderson et al. (2019) for the others. The combination of these new inequalities yields trivial constraints. For instance: (28i) + (28g) = ˆlk[i] w T i,kxk 1 + bk[i] = ˆxk[i] (29) which holds by the definition of pre-activation bounds. Let us recall that xk[i] 0 and xk[i] ˆxk[i], the latter constraint resulting from (28a) + (28b). Then, it can be easily verified that the only combinations of interest (i.e., those that do not result in constraints that are obvious by definition or are implied by other constraints) are those containing the equality (28a). In particular, combining inequalities (28g) to (28j) with inequalities (28d) to (28f) generates constraints that are (after algebraic manipulations) superfluous with respect to those in (30). We are now ready to show the system resulting from the elimination: xk[i] 0 (30a) xk[i] ˆxk[i] (30b) xk[i] wi,k[0]xk 1[0] wi,k[0]lk 1[0](1 zk[i]) + X j>1 wi,k[j]x0 k 1[j] + bk[i]zk[i] (30c) xk[i] wi,k[0]uk 1[0]zk[i] + X j>1 wi,k[j]x0 k 1[j] + bk[i]zk[i] (30d) xk[i] wi,k[0]xk 1[0] wi,k[0]uk 1[0](1 zk[i]) + X j>1 wi,k[j]x0 k 1[j] + bk[i]zk[i] (30e) xk[i] wi,k[0]lk 1[0]zk[i] + X j>1 wi,k[j]x0 k 1[j] + bk[i]zk[i] (30f) lk 1[0] xk[i] uk 1[0] (30g) xk[i] ˆlk[i]zk[i] (30h) xk[i] ˆuk[i]zk[i] (30i) xk[i] ˆxk[i] ˆlk[i](1 zk[i]) (30j) xk[i] ˆxk[i] ˆuk[i](1 zk[i]) (30k) Constraints from (30a) to (30g) are those resulting from the original derivation of A k (see (Anderson et al., 2019)). The others result from the inclusion of pre-activation bounds in (27). Of these, (30h) is implied by (30a) if ˆlk[i] 0 and by the definition of pre-activation bounds (together with (30b)) if ˆlk[i] > 0. Analogously, (30k) is implied by (30b) if ˆuk[i] 0 and by (30a) otherwise. By noting that no auxiliary variable is left in (30i) and in (30j), we can conclude that these will not be affected by the remaining part of the elimination procedure. Therefore, the rest of the proof Published as a conference paper at ICLR 2021 (the elimination of x0 k 1[1], x0 k 1[2], . . . ) proceeds as in (Anderson et al., 2019), leading to Ak,i. Repeating the proof for each neuron i at layer k, we get Ak = Projxk 1,xk,zk(Sk). G MASKED FORWARD AND BACKWARD PASSES Crucial to the practical efficiency of our solvers is to represent the various operations as standard forward/backward passes over a neural network. This way, we can leverage the engineering efforts behind popular deep learning frameworks such as Py Torch (Paszke et al., 2017). While this can be trivially done for the Big-M solver (appendix B), the Active Set method ( 3.1) requires a specialised operator that we call masked forward/backward pass. Here, we provide the details to our implementation. The masked forward and backward passes respectively take the following forms: (Wk Ik) xk, (Wk Ik)T xk+1 and they are needed when dealing with the exponential family of constraints from the relaxation by Anderson et al. (2020). A crucial property of the operator is that the Ik mask may take on a different value for each input/output combination. While this is straightforward to implement for fully connected layers, we need to be more careful when handling convolutional operators, which rely on re-applying the same weights (kernel) to many different parts of the image. A naive solution is to convert convolutions into equivalent linear operators, but this has a high cost in terms of performance, as it involves much redundancy. A convolutional operator can be represented via a matrix-matrix multiplication if the input is unfolded and the filter is appropriately reshaped. The multiplication output can then be reshaped to the correct convolutional output shape. Given a filter w Rc k1 k2, an input x Ri1 i2 i3 and the convolutional output convw(x) = y Rc o2 o3, we need the following definitions: [ ]I,O : I O { }j : Rd1 dn Rd1 dj 1 dj+1 dn unfoldw( ) : Ri1 i2 i3 Rk1k2 o2o3 foldw( ) : Rk1k2 o2o3 Ri1 i2 i3 where the brackets simply reshape the vector from shape I to O, while the braces sum over the j-th dimension. unfold decomposes the input image into the (possibly overlapping) o2o3 blocks the sliding kernel operates on, taking padding and striding into account. fold brings the output of unfold to the original input space. Let us define the following reshaped versions of the filter and the convolutional output: WR = [w]Rc k1 k2,Rc k1k2 y R = [y]Rc o2 o3,Rc o2o3 The standard forward/backward convolution (neglecting the convolutional bias, which can be added at the end of the forward pass) can then be written as: convw(x) = [WR unfoldw(x)]Rc o2o3,Rc o2 o3 back convw(y) = foldw(W T R y R). (32) We need to mask the convolution with a different mask for each input-output pair. This means employing a mask I Rc k1k2 o2o3. Therefore, assuming vectors are broadcast to the correct output shape4 , we can write the masked forward and backward passes as: convw,I(x) = [{(WR I unfoldw(x)}2]Rc o2o3,Rc o2 o3 back convw,I(y) = foldw({WR I y R}1). (33) Owing to the avoided redundancy with respect to the equivalent linear operation (e.g., copying of the kernel matrix, zero-padding in the linear weight matrix), this implementation of the masked for- 4if we want to perform an element-wise product a b between a Rd1 d2 d3 and b Rd1 d3, the operation is implicitly performed as a b , where b Rd1 d2 d3 is an extended version of b obtained by copying along the missing dimension. Published as a conference paper at ICLR 2021 ward/backward pass reduces both the memory footprint and the number of floating point operations (FLOPs) associated to the passes computations by a factor (i1i2i3)/(k1k2). In practice, this ratio might be significant: on the incomplete verification networks 5.1) it ranges from 16 to 64. H STRATIFIED BOUNDING FOR BRANCH AND BOUND As seen in the complete verification results in Figure 3 ( 5.2), the use of a tighter bounding algorithm results in the verification of a larger number of properties. In general, tighter bounds come at a larger computational cost, which might negatively affect performance on easy verification properties, where a small number of domain splits with loose bounds suffices to verify the property (hence, the tightness is not needed). As a general complete verification system needs to be efficient a priori, on both easy and hard properties, we employ a stratified bounding system for use within Branch and Bound-like complete verification algorithms. We design a set of heuristics to determine whether the children (i.e., the subproblems arising after splitting on it) of a given subproblem will require tight bounds or not. The graph describing parentchild relations between sub-problems is referred to as the Ba B search tree. Given a problem p, we individually mark it as difficult (with all its future offspring) if it meets all the conditions in the following set: The lower bound on the minimum over the subproblem, p LB, should be relatively close to the decision threshold (0, in the standard form by Bunel et al. (2018)). That is, p LB c LB. We argue that tighter bounds are worth computing only if they are likely to result in a crossing of the decision threshold, thus cutting away a part of the Ba B search tree. The depth pdepth in the Ba B search tree should not be below a certain threshold cdepth. This is a way to ensure a certain number of splits is performed before adopting tighter bounds. The running average of the lower bound improvement from parent to child pimpr should fall below a threshold cimpr. This reflects the idea that if splitting seems to be effective on a given section of the Ba B tree, it is perhaps wiser to invest computational budget in more splits (with cheaper bounding computations, more splits can be performed in a given time) than tighter bounds. Empirically, this is the criterion with the largest effect on performance. Additionally, we do not employ tighter bounds unless the verification problem itself (in addition to the individual subproblems) is marked as hard . We do so when the size of the un-pruned domains reaches a certain threshold chard. The set of criteria above needs to address the following decision trade-off: one should postpone the use of tighter bounds until sure the difficulty of the verification task requires it. At the same time, if the switch to tighter bounds is excessively delayed, one might end up computing tight bounds on a large number of subproblems (as sections of the Ba B tree were not pruned in time), hurting performance for harder tasks. In practice, we set the criteria as follows for all our experiments: cdepth = 0, c LB = 0.5, cimpr = 10 1, chard = 200. As seen in Figure 3, this resulted in a reasonable trade-off between losing efficiency on the harder properties (wide model) and gaining it on the easier ones (base and deep models). Published as a conference paper at ICLR 2021 I EXPERIMENTAL APPENDIX We conclude the appendix by presenting supplementary experiments and additional details with respect to the presentation in the main paper. I.1 EXPERIMENTAL SETTING, HYPER-PARAMETERS All the experiments and bounding computations (including intermediate bounds) were run on a single Nvidia Titan Xp GPU, except Gurobi-based methods and Active Set CPU . These were run on i7-6850K CPUs, utilising 4 cores for the incomplete verification experiments, and 6 cores for the more demanding complete verification experiments. The experiments were run under Ubuntu 16.04.2 LTS. Complete verification results for ERAN, the method by Singh et al. (2020), are taken from the recent VNN-COMP competition (VNN-COMP, 2020). These were executed by Singh et al. (2020) on a 2.6 GHz Intel Xeon CPU E5-2690 with 512 GB of main memory, utilising 14 cores. Gurobi-based methods make use of LP incrementalism (warm-starting) when possible. In the experiments of 5.1, where each image involves the computation of 9 different output upper bounds, we warm-start each LP from the LP of the previous neuron. For Gurobi 1 cut , which involves two LPs per neuron, we first solve all Big-M LPs, then proceed with the LPs containing a single cut. Hyper-parameter tuning for incomplete verification was done on a small subset of the CIFAR-10 test set, on the SGD-trained network from Figures 1, 2. BDD+ is run with the hyper-parameters found by Bunel et al. (2020a) on the same datasets, for both incomplete and complete verification. For all supergradient-based methods (Big-M, Active Set), we employed the Adam update rule (Kingma & Ba, 2015), which showed stronger empirical convergence. For Big-M, replicating the findings by Bunel et al. (2020a) on their supergradient method, we linearly decrease the step size from 10 2 to 10 4. Active Set is initialized with 500 Big-M iterations, after which the step size is reset and linearly scaled from 10 3 to 10 6. We found the addition of variables to the active set to be effective before convergence: we add variables every 450 iterations, without re-scaling the step size again. Every addition consists of 2 new variables (see pseudo-code in appendix D), which was found to be a good compromise between fast bound improvement and computational cost. On complete verification, we re-employed the same hyper-parameters for both Big-M and Active Set, except the number of iterations. For Big-M, this was tuned to employ roughly the same time per bounding computation as BDD+. As the complete verification problem is formulated as a minimisation (as in problem (1)), in Branch and Bound we need a lower and an upper bound to the minimum of each sub-problem. The lower bound is computed by running the bounding algorithm, while the upper bound on the minimum is obtained by evaluating the network at the last primal output by the bounding algorithm in the lower bound computation (running the bounding algorithm to get an upper bound would result in a much looser bound, as it would imply having an upper bound on a version of problem (1) with maximisation instead of minimisation). I.2 DATASET DETAILS We now provide further details on the employed datasets. For incomplete verification, the architecture was introduced by Wong & Kolter (2018) and re-employed by Bunel et al. (2020a). It corresponds to the Wide complete verification architecture, found in Table 2 . Due to the additional computational cost of bounds obtained via the tighter relaxation (3), we restricted the experiments to the first 3450 CIFAR-10 test set images for the experiments on the SGD-trained network (Figures 1, 2), and to the first 4513 images for the network trained via the method by Madry et al. (2018) (Figures 6, 7). For complete verification, we employed a subset of the adversarial robustness dataset presented by Lu & Kumar (2020) and used by Bunel et al. (2020a), where the set of properties per network has been restricted to 100. The dataset provides, for a subset of the CIFAR-10 test set, a verification radius ϵver defining the small region over which to look for adversarial examples (input points for which the output of the network is not the correct class) and a (randomly sampled) non-correct class to verify against. The verification problem is formulated as the search for an adversarial example, carried out by minimizing the difference between the ground truth logit and the target logit. If the minimum is positive, we have not succeeded in finding a counter-example, and the network is robust. The ϵver radius was tuned to meet a certain problem difficulty via binary search, employing Published as a conference paper at ICLR 2021 Network Name No. of Properties Network Architecture BASE Model 100 Conv2d(3,8,4, stride=2, padding=1) Conv2d(8,16,4, stride=2, padding=1) linear layer of 100 hidden units linear layer of 10 hidden units (Total Re LU activation units: 3172) Conv2d(3,16,4, stride=2, padding=1) Conv2d(16,32,4, stride=2, padding=1) linear layer of 100 hidden units linear layer of 10 hidden units (Total Re LU activation units: 6244) Conv2d(3,8,4, stride=2, padding=1) Conv2d(8,8,3, stride=1, padding=1) Conv2d(8,8,3, stride=1, padding=1) Conv2d(8,8,4, stride=2, padding=1) linear layer of 100 hidden units linear layer of 10 hidden units (Total Re LU activation units: 6756) Table 2: For each complete verification experiment, the network architecture used and the number of verification properties tested, a subset of the dataset by Lu & Kumar (2020). Each layer but the last is followed by a Re LU activation function. a Gurobi-based bounding algorithm (Lu & Kumar, 2020). The networks are robust on all the properties we employed. Three different network architectures of different sizes are used. A Base network with 3172 Re LU activations, and two networks with roughly twice as many activations: a Deep network, and a Wide network. Details can be found in Table 2. I.3 ADVERSARIALLY-TRAINED INCOMPLETE VERIFICATION Figure 6: Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from Ak per neuron; higher is better. Results for the network adversarially trained with the method by Madry et al. (2018), from Bunel et al. (2020a). The width at a given value represents the proportion of problems for which this is the result. Gurobi Planet BDD+ 400 steps Big-M 850 steps Active Set 600 steps Active Set 1050 steps Active Set 1650 steps Active Set CPU Improvement from 1 cut In addition to the SGD-trained network in 5.1, we now present results relative to the same architecture, trained with the adversarial training method by Madry et al. (2018) for robustness to perturbations of ϵtrain = 8/255. Each adversarial sample for the training was obtained using 50 steps of projected gradient descent. For this network, we measure the robustness margin to perturbations with ϵver = 12/255. Figures 6, 7 confirm most of the observations carried out for the SGD-trained network in 5.1, with fewer variability around the bounds returned by Gurobi cut. Big-M is competitive with BDD+, and switching to Active Set after 500 iterations results in much better bounds in the same time. Increasing the computational budget for Active Set still results in better bounds than Gurobi cut in Published as a conference paper at ICLR 2021 1 2 3 4 BDD+ 400 steps Big-M 850 steps Timing (in s) 10 2 10 1 BDD+ 400 steps Big-M 850 steps Gap to Planet (a) Comparison of runtime (left) and gap to Gurobi Planet bounds. For the latter, lower is better. 1 2 3 4 Big-M 850 steps Active Set 80 steps Timing (in s) 0.0 0.2 0.4 Big-M 850 steps Active Set 80 steps Improvement from Planet (b) Comparison of runtime (left) and improvement from the Gurobi Planet bounds. For the latter, higher is better. Figure 7: Pointwise comparison for a subset of the methods on the data presented in Figure 6. Darker colour shades mean higher point density (on a logarithmic scale). The oblique dotted line corresponds to the equality. a fraction of its running time, even though the performance gap is on average smaller than on the SGD-trained network. I.4 SENSITIVITY TO SELECTION CRITERION AND FREQUENCY In section 3.2, we describe how to iteratively modify B, the active set of dual variables on which our Active Set solver operates. In short, Active Set adds the variables corresponding to the output Figure 8: Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from Ak per neuron; higher is better. Results for the SGD-trained network from Bunel et al. (2020a). The width at a given value represents the proportion of problems for which this is the result. Sensitivity of Active Set to selection criterion (see 3.2). Gurobi Planet Active Set 600 steps Ran. Selection Active Set 1050 steps Ran. Selection Active Set 1650 steps Ran. Selection Improvement from 1 cut Figure 9: Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from Ak per neuron; higher is better. Results for the SGD-trained network from Bunel et al. (2020a). The width at a given value represents the proportion of problems for which this is the result. Sensitivity of Active Set to variable addition frequency ω, with the selection criterion presented in 3.2. Active Set 600 steps AS, = 300 550 steps AS, = 600 650 steps Active Set 1050 steps AS, = 300 900 steps AS, = 600 1150 steps Active Set 1650 steps AS, = 300 1500 steps AS, = 600 1850 steps Improvement from 1 cut Published as a conference paper at ICLR 2021 of oracle (4) invoked at the primal minimiser of LB(x, z, α, βB), at a fixed frequency ω. We now investigate the empirical sensitivity of Active Set to both the selection criterion and the frequency of addition. We test against Ran. Selection, a version of Active Set for which the variables to add are selected at random by uniformly sampling from the binary Ik masks. As expected, Figure 8 shows that a good selection criterion is key to the efficiency of Active Set. In fact, random variable selection only marginally improves upon the Planet relaxation bounds, whereas the improvement becomes significant with our selection criterion from 3.2. In addition, we investigate the sensitivity of Active Set (AS) to variable addition frequency ω. In order to do so, we cap the maximum number of cuts at 7 for all runs, and vary ω while keeping the time budget fixed (we test on three different time budgets). Figure 9 compares the results for ω = 450 (Active Set), which were presented in 5.1, with the bounds obtained by setting ω = 300 and ω = 600 (respectively AS ω = 300 and AS ω = 600). Our solver proves to be relatively robust to ω across all the three budgets, with the difference in obtained bounds decreasing with the number of iterations. Moreover, early cut addition tends to yield better bounds in the same time, suggesting that our selection criterion is effective before subproblem convergence. I.5 MNIST INCOMPLETE VERIFICATION We conclude the experimental appendix by presenting incomplete verification results (the experimental setup mirrors the one employed in section 5.1 and appendix I.3) on the MNIST dataset (Le Cun et al., 1998). We report results on the wide MNIST network from Lu & Kumar (2020), whose architecture is identical to the wide network in Table 2 except for the first layer, which has only one input channel to reflect the MNIST specification (the total number of Re LU activation units is 4804). As those employed for the complete verification experiments ( 5.2), and differently from the incomplete Gurobi Planet BDD+ 400 steps Big-M 850 steps Active Set 600 steps Active Set 1050 steps Active Set 1650 steps Active Set 2500 steps Improvement from 1 cut Figure 10: Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from Ak per neuron; higher is better. MNIST results for a network adversarially trained with the method by Wong & Kolter (2018), from Lu & Kumar (2020). The width at a given value represents the proportion of problems for which this is the result. 20 40 60 80 100 120 140 Gurobi 1 cut Active Set 1650 steps Timing (in s) 0.0 0.2 0.4 0.6 0.8 1.0 Gurobi 1 cut Active Set 1650 steps Improvement from Planet 20 40 60 80 100 120 140 Gurobi 1 cut Active Set 2500 steps Timing (in s) 0.0 0.2 0.4 0.6 0.8 1.0 Gurobi 1 cut Active Set 2500 steps Improvement from Planet Figure 11: Pointwise comparison for a subset of the methods on the data presented in Figure 10. Comparison of runtime (left) and improvement from the Gurobi Planet bounds. For the latter, higher is better. Darker colour shades mean higher point density (on a logarithmic scale). The oblique dotted line corresponds to the equality. Published as a conference paper at ICLR 2021 verification experiments in section 5.1 and appendix I.3, the network was adversarially trained with the method by Wong & Kolter (2018). We compute the robustness margin to ϵver = 0.15 on the first 2960 images of the MNIST test set. All hyper-parameters are kept to the values employed for the CIFAR-10 networks, except the Big-M step size, which was linearly decreased from 10 1 to 10 3, and the weight of the proximal terms for BDD+, which was linearly increased from 1 to 50. As seen on the CIFAR-10 networks, Figures 10, 11 show that Active Set yields comparable or better bounds than Gurobi 1 cut in less average runtime. However, more iterations are required to reach the same relative bound improvement over Gurobi 1 cut (2500 as opposed to 600 in Figures 1, 6). Furthermore, the smaller average gap between the bounds of Gurobi Planet and Gurobi 1 cut (especially with respect to Figure 1) suggests that the relaxation by Anderson et al. (2020) is less effective on this MNIST benchmark.