# safe_neurosymbolic_learning_with_differentiable_symbolic_execution__5fb2a722.pdf Published as a conference paper at ICLR 2022 SAFE NEUROSYMBOLIC LEARNING WITH DIFFERENTIABLE SYMBOLIC EXECUTION Chenxi Yang, Swarat Chaudhuri The University of Texas at Austin We study the problem of learning worst-case-safe parameters for programs that use neural networks as well as symbolic, human-written code. Such neurosymbolic programs arise in many safety-critical domains. However, because they can use nondifferentiable operations, it is hard to learn their parameters using existing gradient-based approaches to safe learning. Our approach to this problem, Differentiable Symbolic Execution (DSE), samples control flow paths in a program, symbolically constructs worst-case safety losses along these paths, and backpropagates the gradients of these losses through program operations using a generalization of the REINFORCE estimator. We evaluate the method on a mix of synthetic tasks and real-world benchmarks. Our experiments show that DSE significantly outperforms the state-of-the-art DIFFAI method on these tasks.1 1 INTRODUCTION Safety on worst-case inputs has recently emerged as a key challenge in deep learning research. Formal verification of neural networks (Albarghouthi, 2021) is an established response to this challenge. In particular, a body of recent work (Mirman et al., 2018; Madry et al., 2017; Cohen et al., 2019; Singh et al., 2018) seeks to incorporate formal verification into the training of neural networks. DIFFAI, among the most prominent of such approaches, uses a neural network verifier to construct a differentiable, worst-case safety loss for the learner. This loss is used to regularize a standard data-driven loss, biasing the learner towards parameters that are both performant and safe. A weakness of these methods is that they only consider functional properties (such as adversarial robustness) of isolated neural networks. By contrast, in real-world applications, neural networks are often embedded within human-written symbolic code, and correctness requirements apply to the neurosymbolic programs (Chaudhuri et al., 2021) obtained by composing the two. For example, consider a car directed by a neural controller (Qin & Badgwell, 2000). Safety properties for the car are functions of its trajectories, and these trajectories depend not just on the controller but also the symbolic equations that define the environment. While recent work (Christakis et al., 2021) has studied the verification of such neurosymbolic programs, there is no prior work on integrating verification and learning for such systems. In this paper, we present the first steps towards such an integration. The fundamental technical challenge here is that while a neural network is differentiable in its parameters, the code surrounding it may be non-differentiable or even discontinuous. This makes our problem a poor fit for methods like DIFFAI, which were designed for fully differentiable learning objectives. We solve the problem using a new method, Differentiable Symbolic Execution (DSE), that can estimate gradients of worst-case safety losses of nondifferentiable neurosymbolic programs. Our method approximates the program s worst-case safety loss by an integral over the safety losses along individual control flow paths in the program. We compute the gradients of this integral using a generalization of the classic REINFORCE estimator (Williams, 1992). The gradient updates resulting from this process balance two goals: (i) choosing the parameters of individual paths so that program trajectories that follow them are safer, and (ii) learning the parameters of conditional branches so that the program is discouraged from entering unsafe paths. The procedure requires a 1Our implementation of DSE is available at https://github.com/cxyang1997/DSE. Published as a conference paper at ICLR 2022 1 Example(x): // x [ 5, 5] 2 y := NNθ(x) 3 if y 1.0: 4 z := x + 10.0 5 else : 6 z := x 5.0 7 assert (z <= 1) Loc = {ℓ2, ℓ3, ℓ7}, X = {x, y, z}, l0 = {ℓ2} Init = ( 5 x 5) Safe(l7) = (z 1), Safe(l2) = True, Safe(l3) = True Transθ = {(ℓ2, True, y := NNθ(x) , ℓ3), (ℓ3, (y 1.0), z := x + 10.0 , ℓ7), (ℓ3, (y > 1.0), z := x 5.0 , ℓ7) Figure 1: (Left) An example program. NNθ is a neural network with parameters θ. DIFFAI fails to learn safe parameters for this program. (Right) The program as an STS. Each location ℓi corresponds to line i in the program. For brevity, the updates only depict the variable that changes value. method for sampling paths and computing the safety loss for individual paths. We use a version of symbolic execution (Baldoni et al., 2018) to this end. We evaluate DSE through several problems from the embedded control and navigation domains, as well as several synthetic tasks. Our baselines include an extended version of DIFFAI, the current state of the art, and an ablation that does not use an explicit safety loss. Our experiments show that DSE significantly outperforms the baselines in finding safe and high-performance model parameters. To summarize, our main contributions are: We present the first approach to worst-case-safe parameter learning for neural networks embedded within nondifferentiable, symbolic programs. As part of our learning algorithm, we give a new way to bring together symbolic execution and stochastic gradient estimators. that might have applications outside our immediate task. We present experiments that show that DSE can handle programs with complex structure and also outperforms DIFFAI on our problem. 2 PROBLEM FORMULATION Programs. We define programs with embedded neural networks as symbolic transition systems (STS) (Manna & Pnueli, 2012). Formally, a program Fθ is a tuple (Loc, X, l0, Init, Safe, Transθ). Here, Loc is a finite set of (control) locations, X = {x1, . . . , xm} is a set of real-valued variables, and l0 Loc is the initial location. Init, a constraint over X, is an initial condition for the program. Safe is a map from locations to constraints over X; intuitively, Safe(l) is a safety requirement asserted at location l. Finally, Transθ is a transition relation consisting of transitions (l, G, Uθ, l ) such that: (i) l is the source location and l is the destination location; (ii) the guard G is a constraint over X; and (iii) the update Uθ is a vector U1,θ, . . . , Um,θ , where each Ui,θ is a real-valued expression over X constructed using standard symbolic operators and neural networks with parameters θ. Intuitively, Ui,θ represents the update to the i-th variable. We assume that each Ui,θ is differentiable in θ. We also assume that each guard G is of the form xi 0, where {<, >, , } note that this is not a significant restriction as xi can be made to store the value of a complex expression using an update. Finally, we assume that programs are deterministic. That is, if G and G are guards for two distinct transitions from the same source state, then G G is unsatisfiable. Programs in higher-level languages can be translated to the STS notation in a standard way. For example, Figure 1 (left) shows a simple high-level program. The STS for this program appears in Figure 1 (right). While the program is simple, the state-of-the-art DIFFAI approach to verified learning fails to learn safe parameters for it. Safety Semantics. In classical formal methods, a program is considered safe if all of its executions satisfy a safety constraint. However, in learning settings, it helps to know not only whether a program is unsafe, but also the extent to which it is unsafe. Consequently, we define the safety semantics of programs in terms of a (worst-case) safety loss that quantifies a program s safeness. Formally, let a state of Fθ be a pair s = (l, v), where l is a location and v Rm is an assignment of values to the variables (i.e., v(i) is the value of xi). Such a state is said to be at location l. For boolean constraints B and assignments v to variables in B, let us write B(v) if v satisfies B. A state (l0, v), where Init(v), is an initial state. A state (l, v) is safe if (Safe(l))(v). Let v Rm be an assignment to the variables. For a real-valued expression E over X, let E(v) be the value of E when xi is substituted by v(i). For an update U = U1, . . . , Un , we define U(v) Published as a conference paper at ICLR 2022 as the assignment U1(v), . . . , Un(v) . A length-n trajectory of Fθ is a sequence τ = s0, . . . , sn , with si = (li, vi), such that: (i) s0 is an initial state; and (ii) for each i, there is a transition (li, G, U, li+1) such that G(vi) and vi+1 = U(vi). Let us fix a size bound N for trajectories. A trajectory τ = s0, . . . , sn is maximal if it has length N, or if there is no trajectory τ = s0, . . . , sn, sn+1 with length N. Because our programs are deterministic, there is a unique maximal trajectory from each s0. We denote this trajectory by τ(s0). Let us assume a real-valued loss Unsafe(s) that quantifies the unsafeness of each state s. We require Unsafe(s) = 0 if s is safe and Unsafe(s) > 0 otherwise. We lift this measure to trajectories τ by letting Unsafe(τ) = P s appears in τ Unsafe(s). The safety loss C(θ) for Fθ is now defined as: C(θ) = max s is an initial state Unsafe(τ(s)). (1) Thus, C(θ) = 0 if and only if all program trajectories are safe. Problem Statement. Our learning problem formalizes a setting in which we have training data for neural networks inside a program Fθ. While training the networks with respect to this data, we must ensure that the overall program satisfies its safety requirements. To ensure that the parameters of the different neural networks in Fθ are not incorrectly entangled, we assume in this exposition that only one of these networks, NNθ, has trainable parameters.2 We expect as input a training set of i.i.d. samples from an unknown distribution over the inputs and outputs of NNθ, and a differentiable data loss Q(θ) that quantifies the network s fidelity to this training set. Our learning goal is to solve the following constrained optimization problem: min θ Q(θ) s.t. C(θ) 0. (2) 3 LEARNING ALGORITHM The learning approach in DSE is based on two ideas. First, we directly apply a recently-developed equivalence between constrained and regularized learning (Agarwal et al., 2018; Le et al., 2019) to reduce Equation (2) to a series of unconstrained optimization problems. Second, we use the novel technique of differentiating through a symbolic executor to solve these unconstrained problems. At the highest level, we convexify the class of programs Fθ by allowing stochastic combinations of programs. That is, we now allow programs of the form Fˆθ = PT t=1 αt Fθt, where P i αt = 1. To execute the program Fθ from a given initial state, we sample a specific program Fθt from the distribution (α1, . . . , αt) and then execute it from that state. Equation (2) can now be written into the problem maxλ R+ minˆθ Q(ˆθ) + λC(ˆθ), which in turn can be solved using a classic algorithm (Freund & Schapire, 1999) for computing equilibria in a two-player game. See Appendix A.1 for more details on this algorithm. A key feature of our high-level algorithm is that it repeatedly solves the optimization problem minθ Q(θ) + λC(θ) for fixed values of λ. This problem is challenging because while Q(θ) is differentiable in θ, C(θ) depends on the entirety of Fθ and may not even be continuous. DSE, our main contribution, addresses this challenge by constructing a differentiable approximation C#(θ) of C(θ). We present the details of this method in the next section. 4 DIFFERENTIATING THROUGH A SYMBOLIC EXECUTOR Background on Symbolic Execution. DSE uses a refinement of symbolic execution (Baldoni et al., 2018), a classic technique for systematic formal analysis of programs. Symbolic execution has been recently used to analyze the safety of neural networks (Gopinath et al., 2018). However, we are not aware of any prior attempt to incorporate symbolic execution into neural network training. 2Note that some of our experimental benchmarks have k > 1 trainable neural modules. In these tasks, each neural module comes with its own training data; thus, in effect, we have k instances of the learning problem. Our implementation learns the modules simultaneously by interleaving their gradient updates. However, the gradient of Q for each module is only influenced by its own data. Published as a conference paper at ICLR 2022 A symbolic executor systematically searches the set of symbolic trajectories of programs, which we now define. Consider a program Fθ = (Loc, X = {x1, . . . , xm}, l0, Init, Safe, Transθ) as in Section 2. First, we fix a syntactically restricted class Vθ of boolean constraints over X and θ. Vθ is required to be closed under conjunction and must include all expressions that can appear as a guard in Fθ. In particular, in our implementation, Vθ consists of formulas that specify closed or open intervals, with bounds given by expressions over parameters, in which each variable lies. A symbolic state of Fθ is a pair σθ = (l, Vθ), where l is a location and Vθ Vθ. Intuitively, σθ represents the set of states of Fθ that are at location l and satisfy the property Vθ. We designate a finite set of initial symbolic states (l0, V 1 0 ), . . . , (lp, V p 0 ). It is required that Init W i V i 0 . Intuitively, the initial symbolic states cover all the possible initial states of the program Fθ. We also construct an overapproximate update U # θ : Vθ Vθ for each update Uθ in Fθ. For all V Vθ, we have Post(v) U # θ (V ), where Post(v) is the formula v : V (v ) (U(v ) = v). Intuitively, U # θ (V ) is an overapproximate representation of the set of states reached by applying Uθ to a state that satisfies V . Let us call a transition t = (l, G, U, l ) enabled at a symbolic state (l, V ) if (G V ) is satisfiable. A symbolic trajectory of Fθ is a sequence τ # θ = σ0, . . . , σn with the following properties: σ0 is an initial symbolic state. Let σi = (li, Vi) and σi+1 = (li+1, Vi+1). Then there is a transition t = (li, Gi, Ui, li+1) such that: (i) t is enabled at σi, and (ii) Vi+1 = U #(Gi Vi). In this case, we write σi+1 = t(σi). Intuitively, τ # θ represents an overapproximation of the set of concrete trajectories of Fθ that pass through the control path l0, . . . , ln . The above symbolic trajectory is safe if for all i, Vi Safe(li). Intuitively, in this case, all concrete trajectories that the symbolic trajectory represents follow the program s safety requirements. Example(Cont.) Consider the example program in Figure 1. Let us assume a single initial symbolic state where x [ 5, 5], and suppose NNθ(x) [ 2, 2] when x [ 5, 5]. The program has two symbolic trajectories from the initial symbolic state: τ # 1 = (ℓ2, x [ 5, 5]), (ℓ3, x [ 5, 5] y [ 2, 2]), (ℓ7, x [ 5, 5] y [ 2, 1] z [5, 15]) τ # 2 = (ℓ2, x [ 5, 5]), (ℓ3, x [ 5, 5] y [ 2, 2]), (ℓ7, x [ 5, 5] y (1, 2] z [ 10, 0]) . We note that only τ # 2 is safe. Probabilistic Symbolic Execution. A key difficulty with using symbolic execution to estimate the safety loss C(θ) is that our programs need not be differentiable, or even continuous. To understand the issues here, consider our running example once again. Here, the parameter θ is used to calculate the variable y, which is used in a discontinuous conditional statement that assigns very different values to the variable z in its two branches. Additionally, a slight change in θ can sometimes cause the guard G of the conditional to go from being True on some values of y, to being False on all values of y. In the second case, the updated θ would have one, rather than two, symbolic trajectories. If the symbolic trajectory τ # in which G is True serves as an edge case to judge whether a program is worst-case-safe, learning safe parameters would be difficult, as there would be no gradient guiding the optimizer back to the case in which G is True. DSE overcomes these difficulties using a probabilistic approach to symbolic execution. Specifically, at a symbolic state σi = (li, Vi), the symbolic executor in DSE samples its next action following a probability distribution pθ(t | σi), where t ranges over program transitions or a special action Stop. For a boolean expression Vθ over the program variables x0, x1, . . . , xk and parameters θ, let Vol(Vθ) denote the volume of the assignments to X that satisfy Vθ. We assume a procedure to compute Vol(Vθ) for any Vθ. (Note that such a procedure is easy to implement when every Vθ is an interval, as is the case in our implementation.) We define: If there is a unique program transition t that is enabled at σi, then pθ(t|σi) = 1. Published as a conference paper at ICLR 2022 If there is no transition t that is enabled at σi, then pθ(Stop|σi) = 1. Otherwise, let t1, . . . , tk be the transitions that are enabled at σi, with ti = (li, Gi, Ui, li+1). Then pθ(tj|σi) = Vol(Gj Vi) We also define a volume-weighted probability distribution p(σ0) over the (finite set of) initial symbolic states σi 0. Specifically, for each σi 0 = (l0, V i 0 ), we have p(σi 0) = Vol(V i 0 ) P j Vol(V j 0 ) . (3) DSE uses the distributions pθ(t|σi) and p(σ0) to sample symbolic trajectories. Let τ # θ = σ0, . . . , σn , with σi+1 = ti(σi). The probability of sampling τ # is pθ(τ # θ ) = p(σ0) Y i pθ(ti|σi 1). (4) We note that pθ is differentiable in θ. Approximate Safety Loss. A key decision in DSE is to approximate the overall worst-case loss C(θ) by the expectation of a differentiable safety loss computed per sampled symbolic trajectory. Recall the safety loss Unsafe(s) for individual states. For σ = (l, V ), we define Unsafeθ(σ) as a differentiable approximation of the worst-case loss maxs satisfies V Unsafe(s). We lift this loss to symbolic trajectories τ # θ = σ0, . . . , σn by defining Unsafeθ(τ #) = P i Unsafeθ(σi). We observe that Unsafeθ is differentiable in θ. Our approximation to the safety loss is now given by: C#(θ) = Eτ # pθ(τ #)Unsafeθ(τ #). (5) Intuitively, C#(θ) guides the learner towards either making the unsafe trajectories safer or lowering the probability of the program falling into unsafe trajectories. If C#(θ) equals zero, the program is provably safe. That said, in practice, we must estimate the expectation in Equation (5) using sampling. As this sampling process may miss trajectories, a low empirical estimate of C#(θ) need not guarantee worst-case safety. As an extreme example, suppose a program has one unsafe symbolic trajectory, but the probability of this symbolic trajectory is near-zero. This trajectory is likely to be missed by the sampling process, leading to an artificially low value of C#(θ). Example (Cont.) Consider the trajectories τ # 1 and τ # 2 in our running example. We have p(τ # 1 ) = p(t2|σ2), as the other transitions in τ # 1 have probability 1. We compute p(t2|σ2) = Vol([ 2,1]) Vol([ 2,2]) = 0.75. Thus, p(τ # 1 ) = 0.75 and similarly, p(τ # 2 ) = 0.25. Gradient Estimation. Our ultimate goal is to compute the gradient θC#(θ) of the approximate safety loss. At first sight, the classic REINFORCE estimator (Williams, 1992) or one of its relatives (Schulman et al., 2015) seems suitable for this task, given that they can differentiate an integral over sampled paths . However, a subtlety in our setting is that the losses for individual paths (symbolic trajectories) is a function of θ. This calls for a generalization of traditional REINFORCElike estimators. More precisely, we adapt the derivation of REINFORCE to our setting as follows: θ(C#(θ)) = θEτ # pθ(τ #)Unsafeθ(τ #) τ # pθ(τ #) pθ(τ #)Unsafeθ(τ #)dτ # τ # pθ(τ #) pθ(τ #) θUnsafeθ(τ #) + Unsafeθ(τ #) θpθ(τ #)dτ # = Eτ # pθ(τ #)[ θUnsafeθ(τ #)] + Eτ # pθ(τ #)[Unsafeθ(τ #) θ(log pθ(τ #))] The above derivation can be further refined to incorporate the variance reduction techniques for REINFORCE that are commonly employed in generative modeling and reinforcement learning. The use of such techniques is orthogonal to our main contribution, and we ignore it in this paper. Published as a conference paper at ICLR 2022 5 EVALUATION Our experimental evaluation seeks to answer the following research questions: (RQ1): Is DSE effective in learning safe and performant parameters? (RQ2): How does data size influence DSE and baselines? (RQ3): How well does DSE scale as the program being learned gets more complex? 5.1 EXPERIMENTAL SETUP System Setup. Our framework is built on top of Py Torch (Paszke et al., 2019). We use the Adam Optimizer (Kingma & Ba, 2014) for all the experiments with default parameters and a weight decay of 0.000001. We ran all the experiments using a single-thread implementation on a Linux system with Intel Xeon Gold 5218 2.30GHz CPUs and Ge Force RTX 2080 Ti GPUs. (Please refer to Appendix A.7 for more training details.) Provably Safe Portion Template Model DIFFAI+ DSE Pattern1 NNSmall 0.0 1.0 NNMed 0.0 1.0 NNBig 0.0 1.0 Pattern2 NNSmall 0.0 0.70 NNMed 0.0 0.78 NNBig 0.0 0.73 Pattern3 NNSmall 1.0 1.0 NNMed 1.0 1.0 NNBig 1.0 1.0 Pattern4 NNSmall 0.0 0.76 NNMed 0.0 0.80 NNBig 0.58 0.97 Pattern5 NNSmall 0.0 1.0 NNMed 0.0 1.0 NNBig 0.0 1.0 Figure 2: Results of synthetic microbenchmarks of DSE and DIFFAI+. Baselines. We use two types of baselines: (i) Ablation, which ignores the safety constraint and learns only from data; (ii) DIFFAI+, an extended version of the original DIFFAI method (Mirman et al., 2018). DIFFAI does not directly fit our setting of neurosymbolic programs as it does not handle general conditional statements. DIFFAI+ extends DIFFAI by adding the meet and join operations following the abstract interpretation framework (Cousot & Cousot, 1977). Also, we allow DIFFAI+ to split the input set into several (100 if not specified) initial symbolic states to increase the precision of symbolic execution of each trajectory. (The original DIFFAI does not need to perform such splits because, being focused on adversarial robustness, it only propagates small regions around each input point through the model.) However, the parts of DIFFAI that propagate losses and their gradients through neural networks remain the same in DIFFAI+. We use the box (interval) domain for both DSE and DIFFAI+ in all the experiments. DSE samples 50 symbolic trajectories for each starting component. Benchmarks. Our evaluation uses 5 synthetic microbenchmarks and 4 case studies. The microbenchmarks (Appendix A.5) are small programs consisting of basic neural network modules, arithmetic function assignment, and conditional branches. These programs are designed to highlight the relationship between DIFFAI+ and DSE. The case studies, described below, model real-world control and navigation tasks (see Appendix A.4 for more details): In Thermostat, we want to learn safe parameters for two neural networks that control the temperature of a room. Racetrack is a navigation benchmark (Barto et al., 1995; Christakis et al., 2021). Two vehicles are trained by path planners separately. The racetrack system is expected to learn two safe controllers so that vehicles do not crash into walls or into each other. In Aircraft-Collision (AC), the goal is to learn safe parameters for an airplane that performs maneuvers to avoid a collision with a second plane. Cartpole aims to train a cart imitating the demonstrations and keep the cart within the safe position range. Published as a conference paper at ICLR 2022 (a) Thermostat (Safety) (b) AC (Safety) (c) Racetrack (Safety) (d) Cartpole (Safety) (e) Thermostat (Test Data Loss) (f) AC (Test Data Loss) (g) Racetrack (Test Data Loss) (h) Cartpole (Test Data Loss) Figure 3: The provably safe portion and the test data loss of Ablation, DIFFAI+ and DSE when varying the size of the data to train. For each benchmark, we first write a ground-truth program that does not include neural networks and provably satisfies a safety requirement. We identify certain modules of these programs as candidates for learning. Then we replace the modules with neural networks with trainable parameters. For all benchmarks with the exception of Racetrack, we execute the ground-truth programs on uniformly sampled inputs to collect input-output datasets for each module that is replaced by a neural network. As for Racetrack, this benchmark involves two simulated vehicles with neural controllers; learned parameters for these controllers are safe if the vehicles do not crash into each other or a wall. To model real-world navigation, we generate the training data for the controllers from a ground-truth trajectory constructed using a path planner. This path planner only tries to avoid the walls in the map and does not have a representation of distances from other vehicles. Thus, in this problem, even a very large data set does not contain all the information needed for learning safe parameters. Evaluation of Safety and Data Loss. Once training is over, we evaluate the learned program s test data loss by running it on 10000 initial states that were not seen during training. We evaluate the safety loss using an abstract interpreter (Cousot & Cousot, 1977) that splits the initial condition into a certain number of boxes (204 for Cartpole, 10000 for the other benchmarks), then constructs symbolic trajectories from these boxes. This analysis is sound (unlike the approximate loss employed during DSE training), meaning that the program is provably worst-case safe if the safety loss evaluates to 0. Our safety metric is the provably safe portion, which is the fraction of the program s symbolic trajectories that are worst-case safe. 5.2 RESULTS RQ1: Overall Quality of Learned Parameters. Figure 2 exhibits the provably safe portion of DIFFAI+ and DSE on our microbenchmarks. Here, the first two patterns only use neural modules to evaluate the guards of a conditional branch, and the last three benchmarks use the neural modules in the guards as well as updates (assignments) that directly affect the program s safety. We see that DIFFAI+ cannot learn safe highly non-differentiable modules (pattern1, pattern2) while DSE can. For the cases where optimization across branches is not required to give safe programs, DIFFAI+ can handle them (pattern3). For the patterns with neural assignments, DIFFAI+ finds it hard to learn to jump from one conditional branch to another to increase safety (pattern4, pattern5). We refer the readers to Appendix A.9 for more detailed pattern analysis. We show the training-time and test-time safety and data losses for our more complex benchmarks in Figure 3. From Figure 3a, 3b, 3c, 3d, we exhibit that DSE can learn programs with 0.8 provably safe portion even with 200 data points and the results keep when the number of training data points is increased. Meanwhile, both DIFFAI+ and Ablation fail to provide safe programs for AC, Racetrack and Cartpole. For Thermostat, Ablation and DIFFAI+ can achieve a provably safe portion of 0.8 only when using 5000 and 10000 data points to train. Published as a conference paper at ICLR 2022 (a) Ablation(Final) (b) DIFFAI+ (Final) (c) DSE (Middle) (d) DSE (Final) (e) Ablation(Final) (f) DIFFAI+ (Final) (g) DSE (Middle) (h) DSE (Final) (i) Ablation(Final)[pos] (j) DIFFAI+ (Final)[pos] (k) DSE (Middle)[pos] (l) DSE (Final)[pos] (m) Ablation(Final) [dist] (n) DIFFAI+ (Final) [dist] (o) DSE (Middle) [dist] (p) DSE (Final) [dist] (q) Ablation(Final) (r) DIFFAI+ (Final) (s) DSE (Middle) (t) DSE (Final) Figure 4: Trajectories during training. Each row exhibits the concrete trajectories and symbolic trajectories of one case from different methods. From top to bottom, the cases are Thermostat (Figure. 4a, 4b, 4c, 4d), AC (Figure. 4e, 4f, 4g, 4h), Racetrack with position property (Figure. 4i, 4j, 4k, 4p) and distance property (Figure. 4m, 4n, 4o, 4p), and Cartpole (Figure. 4q, 4r, 4s, 4t). Each figure shows the trajectories (concrete: red, symbolic: green rhombus) of programs learnt by the method from different training stages, which is denoted by Middle and Final . We separate the input state set into 100 components (81 components for Cartpole) evenly to plot the symbolic trajectories clearly. During evaluation, we separate the input set into 10000 components (204 components for Cartpole) evenly to get more accurate symbolic trajectories measurement. Our test data loss is sometimes larger yet comparable with the Ablation and DIFFAI+ from Figure 3e, 3g. Figure 3h exhibits the tension between achieving a good loss and safety, where the test data loss of DSE increases as the provably safe portion becomes larger. For AC specifically, the safety constraint can help the learner overcome some local optimal yet unsafe areas to get a safer result with more accurate behaviors. Overall, we find that DSE outperforms DIFFAI+ when neural modules are used in the conditional statement and the interaction between neural modules is important for the safety property. This is the case in our larger benchmarks. For example, in Thermostat and AC, the neural modules outputs decide the actions that the neural module to take in the next steps. In Racetrack, the distance between two neural modules regulates each step of the vehicles controllers. In Cartpole, the neural modules gives the force on carts as the output. We display representative symbolic trajectories from the programs learned by Ablation, DIFFAI+ and DSE in Figure 4. The larger portion of the symbolic trajectories is provably safe from DSE than the ones from baselines, which is indicated by the less overlapping between the green trajectories and the gray area. Because the symbolic trajectories are overapproximate, we also depict 100 randomly sampled representative concrete trajectories for each approach. We note that more concrete trajectories of Thermostat, AC, the distance property of Racetrack and Cartpole from DSE fall into the safe area compared with DIFFAI+ and Ablation. RQ2: Impact of Data Size. Figure 3 compares the different methods performance on the test metrics as the number of training data points is changed. The Ablation and DIFFAI+ can reach 0.95 provably safe portion with 10000 data points for Thermostat. However, the variance of results is large, as the minimum provably safe portion across training can reach 0.74 and 0.0 for Ablation and DIFFAI+. In the Thermostat benchmark, DIFFAI+ s performance mainly comes from the guidance from the data loss rather than the safety loss. For the other three cases, a larger training data set cannot help the Ablation and DIFFAI+ to give much safer programs. For AC, as is in Figure 4e, the program learnt from Ablation is very close to the safe area but the Ablation is still not accurate enough to give a safe program in the third step. In Racetrack, increasing the data size to train the vehicles controllers can only satisfy the requirement of not crashing into walls . The networks cannot learn to satisfy the property of not crashing into Published as a conference paper at ICLR 2022 other vehicles as this information is not available in the training data. In Cartpole, the baselines mainly follow the guidance of the data loss and can not give safer programs. RQ3: Scalability. As our solution for the safety loss does not rely on the number of data points, the solution is scalable in terms of both data size and neural network size. In Figure 2, we use the three types of neural networks: (i) NNSmall (with about 33000 parameters); (ii) NNMed (with over 0.5 million parameters); (iii) NNBig (with over 2 million parameters), the training time per epoch is 0.09 sec, 0.10sec, 0.11sec for NNSmall, NNMed, NNBig separately. For the synthetic benchmarks, the program can converge within about 200 epochs which takes 18 sec, 20sec, and 22 sec. Our Thermostat, AC and Racetrack benchmarks consist of neural networks with 4933, 8836 and 4547 parameters respectively, and loops with 20, 15 and 20 iterations respectively. There are 2, 1 and 2 neural modules (respectively) in each benchmark. The training times for DIFFAI+ and DSE are comparable on these benchmarks. Specifically, for Thermostat, AC, and Racetrack, the total training time of DSE is less than 1 hour, less than 90 minutes, and around 13 hours. The corresponding numbers are over 90 minutes, more than 2 hours, and more than 17 hours for DIFFAI+. A more detailed scalability analysis is available in Appendix A.11. 6 RELATED WORK Verification of Neural and Neurosymbolic Models. There are many recent papers on the verification of worst-case properties of neural networks (Anderson et al., 2019; Gehr et al., 2018; Katz et al., 2017; Elboher et al., 2020; Wang et al., 2018). There are also several recent papers (Ivanov et al., 2019; Tran et al., 2020; Sun et al., 2019; Christakis et al., 2021) on the verification of compositions of neural networks and symbolic systems (for example, plant models). To our knowledge, the present effort is the first to integrate a method of this sort propagation of worst-case intervals through neurosymbolic program with the gradient-based training of neurosymbolic programs. Verified Deep Learning. There is a growing literature on methods that incorporate worst-case objectives (such as safety and robustness) into neural network training (Zhang et al., 2019; Mirman et al., 2018; Madry et al., 2017; Cohen et al., 2019; Singh et al., 2018). Most work on this topic focuses on the training of single neural networks. There are a few domain-specific efforts that consider the environment of the neural networks being trained. For example, (Shi et al., 2019) uses spectral normalization to constrain the neural network module of a neurosymbolic controller and ensure that it respects certain stability properties. Unlike DSE, these methods treat neural network modules in an isolated way and do not consider the interactions between these modules and surrounding code (Stone et al., 2016; Qin & Badgwell, 2000; Nassi et al., 2020; Katz et al., 2021). Verified Parameter Synthesis for Symbolic Code. There is a large body of work on parameter synthesis for traditional symbolic code (Alur et al., 2013; Chaudhuri et al., 2014) with respect to worst-case correctness constraints. Especially relevant in this literature is Chaudhuri et al. (2014), which introduced a method to smoothly approximate abstract interpreters of programs that is closely related to DSE and DIFFAI. However, these methods do not use contemporary gradient-based learning, and as a result, scaling them to programs with neural modules is impractical. 7 CONCLUSION We presented DSE, the first approach to worst-case-safe parameter learning for programs that embed neural networks inside potentially discontinuous symbolic code. The method is based on a new mechanism for differentiating through a symbolic executor. We demonstrate that DSE outperforms DIFFAI, a state-of-the-art approach to worst-case-safe deep learning, in a range of tasks. Our current implementation of DSE uses an interval representation of symbolic states. Future work should explore more precise representations such as zonotopes. Incorporating modern variance reduction techniques in our gradient estimator is another natural next step. Finally, one challenge in DSE is that learning here can get harder as the symbolic state representation gets more precise. In particular, if we increase the number of initial symbolic states beyond a point, each state would only lead to a unique symbolic trajectory, and there would be no gradient signal to adjust the relative weights of the different symbolic trajectories. Future work should seek to identify good, possibly adaptive, tradeoffs between the precision of symbolic states and the ease of learning. Published as a conference paper at ICLR 2022 Acknowledgments: We thank our anonymous reviewers for their insightful comments and Radoslav Ivanov (Rensselaer Polytechnic Institute) for his help with benchmark design. Work on this paper was supported by the United States Air Force and DARPA under Contract No. FA8750-20-C-0002, by ONR under Award No. N00014-20-1-2115, and by NSF under grant #1901376. Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60 69. PMLR, 2018. Aws Albarghouthi. Introduction to neural network verification. ar Xiv preprint ar Xiv:2109.10317, 2021. Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntaxguided synthesis. IEEE, 2013. Greg Anderson, Shankara Pailoor, Isil Dillig, and Swarat Chaudhuri. Optimization and abstraction: a synergistic approach for analyzing neural network robustness. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 731 744, 2019. Roberto Baldoni, Emilio Coppa, Daniele Cono D elia, Camil Demetrescu, and Irene Finocchi. A survey of symbolic execution techniques. ACM Computing Surveys (CSUR), 51(3):1 39, 2018. Andrew G Barto, Steven J Bradtke, and Satinder P Singh. Learning to act using real-time dynamic programming. Artificial intelligence, 72(1-2):81 138, 1995. Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiable reinforcement learning via policy extraction. ar Xiv preprint ar Xiv:1805.08328, 2018. Swarat Chaudhuri, Martin Clochard, and Armando Solar-Lezama. Bridging boolean and quantitative synthesis using smoothed proof search. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 207 220, 2014. Swarat Chaudhuri, Kevin Ellis, Oleksandr Polozov, Rishabh Singh, Armando Solar-Lezama, Yisong Yue, et al. Neurosymbolic Programming. Now Publishers, 2021. Maria Christakis, Hasan Ferit Eniser, Holger Hermanns, Jörg Hoffmann, Yugesh Kothari, Jianlin Li, Jorge A Navas, and Valentin Wüstholz. Automated safety verification of programs invoking neural networks. In International Conference on Computer Aided Verification, pp. 201 224. Springer, 2021. Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310 1320. PMLR, 2019. Patrick Cousot and Radhia Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 238 252, 1977. Yizhak Yisrael Elboher, Justin Gottschlich, and Guy Katz. An abstraction-based framework for neural network verification. In International Conference on Computer Aided Verification, pp. 43 65. Springer, 2020. Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pp. 325 332, 1996. Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999. Published as a conference paper at ICLR 2022 Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 3 18. IEEE, 2018. Divya Gopinath, Kaiyuan Wang, Mengshi Zhang, Corina S Pasareanu, and Sarfraz Khurshid. Symbolic execution for deep neural networks. ar Xiv preprint ar Xiv:1807.10439, 2018. Radoslav Ivanov, James Weimer, Rajeev Alur, George J Pappas, and Insup Lee. Verisig: verifying safety properties of hybrid systems with neural network controllers. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pp. 169 178, 2019. Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97 117. Springer, 2017. Michael Katz, Kavitha Srinivas, Shirin Sohrabi, Mark Feblowitz, Octavian Udrea, and Oktie Hassanzadeh. Scenario planning in the wild: A neuro-symbolic approach. Fin Plan 2021, pp. 15, 2021. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703 3712. PMLR, 2019. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Zohar Manna and Amir Pnueli. Temporal verification of reactive systems: safety. Springer Science & Business Media, 2012. Matthew Mirman, Timon Gehr, and Martin Vechev. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, pp. 3578 3586. PMLR, 2018. Ben Nassi, Yisroel Mirsky, Dudi Nassi, Raz Ben-Netanel, Oleg Drokin, and Yuval Elovici. Phantom of the adas: Securing advanced driver-assistance systems from split-second phantom attacks. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pp. 293 308, 2020. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. S Joe Qin and Thomas A Badgwell. An overview of nonlinear model predictive control applications. Nonlinear model predictive control, pp. 369 392, 2000. John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. Advances in Neural Information Processing Systems, 28, 2015. Guanya Shi, Xichen Shi, Michael O Connell, Rose Yu, Kamyar Azizzadenesheli, Animashree Anandkumar, Yisong Yue, and Soon-Jo Chung. Neural lander: Stable drone landing control using learned dynamics. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9784 9790. IEEE, 2019. Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin T Vechev. Fast and effective robustness certification. Neur IPS, 1(4):6, 2018. Published as a conference paper at ICLR 2022 Peter Stone, Rodney Brooks, Erik Brynjolfsson, Ryan Calo, Oren Etzioni, Greg Hager, Julia Hirschberg, Shivaram Kalyanakrishnan, Ece Kamar, Sarit Kraus, et al. Artificial intelligence and life in 2030: the one hundred year study on artificial intelligence. 2016. Xiaowu Sun, Haitham Khedr, and Yasser Shoukry. Formal verification of neural network controlled autonomous systems. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pp. 147 156, 2019. Hoang-Dung Tran, Xiaodong Yang, Diego Manzanas Lopez, Patrick Musau, Luan Viet Nguyen, Weiming Xiang, Stanley Bak, and Taylor T Johnson. Nnv: The neural network verification tool for deep neural networks and learning-enabled cyber-physical systems. In International Conference on Computer Aided Verification, pp. 3 17. Springer, 2020. Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Formal security analysis of neural networks using symbolic intervals. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pp. 1599 1614, 2018. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992. Huan Zhang, Hongge Chen, Chaowei Xiao, Sven Gowal, Robert Stanforth, Bo Li, Duane Boning, and Cho-Jui Hsieh. Towards stable and efficient training of verifiably robust neural networks. ar Xiv preprint ar Xiv:1906.06316, 2019. Published as a conference paper at ICLR 2022 Algorithm 1: Learning Safe, Optimal Parameter Mixtures (Agarwal et al., 2018; Le et al., 2019) for t = 1, ..., N do θt Bestθ(λt) ˆθt Uniform(θ1, . . . , θt), ˆλt 1 t P λt Lmax L(ˆθ, Bestλ(ˆθ)) Lmin L(Bestθ(ˆλt), ˆλt) if Lmax Lmin < ν then return ( ˆθt, ˆλt); λt+1 λ-Update(θ1, . . . , θt) A.1 LEARNING FRAMEWORK We use an equivalence between constrained and regularized learning that (Agarwal et al., 2018; Le et al., 2019), among others, have recently developed in other learning settings, we reduce our problem to a series of unconstrained optimization tasks. We convexify the program set {Fθ : θ Rk} by considering stochastic mixtures (Le et al., 2019) and represent the convexified set as a probabilistic function Fˆθ. Following DSE, we rewrite Equation 2 in terms of these mixtures: ˆθ = arg min ˆθ Q(ˆθ) s.t. C#(ˆθ) 0 (6) We convert Equation 6 to a Lagrangian function L(ˆθ, λ) = Q(ˆθ) + λC#(ˆθ) (7) (8) , where λ R+ is a Lagrange multiplier. Following the equilibrium computation technique by Freund & Schapire (1996), Equation 6 can be rewritten as max λ R+ min ˆθ L(ˆθ, λ). (9) Solutions to this problem can be interpreted as equilibria of a game between a λ-player and a ˆθplayer in which the ˆθ-player minimizes L(ˆθ, λ) given the current λ, and the λ-player maximizes L(ˆθ, λ) given the current ˆθ. Our overall algorithm is shown in Algorithm 1. In this pseudocode, ν is a predefined positive real. Uniform(θ1, . . . , θt) is the mixture that selects a θi out of {θ1, . . . , θt} uniformly at random. Bestθ(λ) refers to the ˆθ player s best response for a given value of λ (it can be shown that this best response is a single parameter value, rather than a mixture of parameters). Computing this best response amounts to solving the unconstrained optimization problem minθ L(θ, λ), i.e., Bestθ(λ) = min θ Q(θ) + λC#(θ). (10) Bestλ(ˆθ) is the λ-player s best response to a particular parameter mixture ˆθ. We define this function as Bestλ(ˆθ) = 0 if C#(ˆθ) 0 S otherwise (11) where S is the upper bound on λ. Intuitively, when C#(ˆθ) is non-positive, the current parameter mixture is safe. As L is a linear function of λ, the minimum value of L is achieved in this case when λ is zero. For other cases, the minimum L is reached by setting λ to the maximum value S. Finally, λ-Update(θt) computes, in constant time, a new value of λ based on the most recent best response by ˆθ player. We do not describe this function in detail; please see (Agarwal et al., 2018) for more details. Following prior work, we can show that Algorithm 1 converges to a value (ˆθ, ˆλ) that is within additive distance ν from a saddle point for Equation 9. To solve our original problem (Equation 2), Published as a conference paper at ICLR 2022 we take the returned ˆθ and then return the real-valued parameter θi to which this mixture assigns the highest probability. It is easy to see that this value is an approximate solution to Equation 2. Published as a conference paper at ICLR 2022 A.2 ABSTRACT UPDATE FOR NEURAL NETWORK MODULES We consider the box domain in the implementation. For a program with m variables, each component in the domain represents a m-dimensional box. Each component of the domain is a pair b = bc, be , where bc Rm is the center of the box and be Rm 0 represents the non-negative deviations. The interval concretization of the i-th dimension variable of b is given by [(bc)i (be)i, (bc)i + (be)i]. Now we give the abstract update for the box domain following Mirman et al. (2018). Add. For a concrete function f that replaces the i-th element in the input vector x Rm by the sum of the j-th and k-th element: f(x) = (x1, . . . , xi 1, xj + xk, xi+1, . . . xm)T . The abstraction function of f is given by: f #(b) = M bc, M be , where M Rm m can replace the i-th element of x by the sum of the j-th and k-th element by M bc. Multiplication. For a concrete function f that multiplies the i-th element in the input vector x Rm by a constant w: f(x) = (x1, . . . , xi 1, w xi, xi+1, . . . , xm)T . The abstraction function of f is given by: f #(b) = Mw bc, M|w| be , where Mw bc multiplies the i-th element of bc by w and M|w| be multiplies the i-th element of be with |w|. Matrix Multiplication. For a concrete function f that multiplies the input x Rm by a fixed matrix M Rm m: f(x) = M x. The abstraction function of f is given by: f #(b) = M bc, |M| be , where M is an element-wise absolute value operation. Convolutions follow the same approach, as they are also linear operations. Re LU. For a concrete element-wise Re LU operation over x Rm: Re LU(x) = (max(x1, 0), . . . , max(xm, 0))T , the abstraction function of Re LU is given by: Re LU#(b) = Re LU(bc + be) + Re LU(bc be) 2 , Re LU(bc + be) Re LU(bc be) where bc + be and bc be denotes the element-wise sum and element-wise subtraction between bc and be. Sigmoid. As Sigmoid and Re LU are both monotonic functions, the abstraction functions follow the same approach. For a concrete element-wise Sigmoid operation over x Rm: Sigmoid(x) = ( 1 1 + exp( x1), . . . , 1 1 + exp( xm))T , the abstraction function of Sigmoid is given by: Sigmoid#(b) = Sigmoid(bc + be) + Sigmoid(bc be) 2 , Sigmoid(bc + be) Sigmoid(bc be) where bc + be and bc be denotes the element-wise sum and element-wise subtraction between bc and be. All the above abstract updates can be easily differentiable and parallelized on the GPU. Published as a conference paper at ICLR 2022 A.3 INSTANTIATION OF THE Unsafe FUNCTION In general, the Unsafe(s) function over individual states can be any differentiable distance function between a point and a set which satisfies the property that Unsafe(s) = 0 if s is in the safe set, A, and Unsafe(s) > 0 if s is not in A. We give the following instantiation as the unsafeness score over individual states: Unsafe(s) = minx A DIST(s, x) if s / A 0 if s A where DIST denotes the euclidean distance between two points. Similarly, the Unsafe(σ) function over symbolic states can be any differentiable distance function between two sets which satisfies the property that Unsafe(σ) = 0 if V is in A, and Unsafe(σ) > 0 if V is not in A. We give the following instantiation as a differentiable approximation of the worstcase loss maxs satisfies V Unsafe(s) in our implementation: Unsafe(σ) = mins satisfies V Unsafe(s) + 1 if V A = 1 Vol(V A) Vol(V ) if V A = Published as a conference paper at ICLR 2022 A.4 BENCHMARKS FOR CASE STUDIES The detailed programs describing Aircraft Collision, Thermostat and Racetrack are in Figure 5, 6 and 7. The πθ of Aircraft Collision and πcool θ , πheat θ in Thermostat are a 3 layer feed forward net with 64 nodes in each layer and a Re LU after each layer except the last one. A sigmoid layer serves as the last layer. Both the πagent1 θ and πagent2 θ in Racetrack are a 3 layer feed forward net with 64 nodes in each and a Re LU after each layer. The Cartpole benchmark is a linear approximation following Bastani et al. (2018) with a 3 layer feed forward net with 64 nodes in each layer as the controller. 1 aircraftcollision(x): 2 x1, y1 := x, -15.0 3 x2, y2 := 0.0, 0.0 4 N := 15 5 i = 0 6 while i < N: 7 p0, p1, p2, p3, step := πθ(x1, y1, x2, y2, step, stage) 8 stage := argmax(p0, p1, p2, p3) 9 if stage == CRUISE: 10 x1, y1 := MOVE_CRUISE(x1, y1) 11 else if stage == LEFT: 12 x1, y1 := MOVE_LEFT(x1, y1) 13 else if stage == STRAIGHT: 14 x1, y1 := MOVE_STRAIGHT(x1, y1) 15 else if stage == RIGHT: 16 x1, y1 := MOVE_RIGHT(x1, y1) 17 18 x2, y2 := MOVE_2(x2, y2) 19 i := i + 1 20 assert (!IS_CRASH(x1, y1, x2, y2)) 21 return Figure 5: Aircraft Collision 1 thermostat(x): 2 N = 20 3 is On = 0.0 4 i = 0 5 while i < N: 6 if is On 0.5: 7 is On := πcool θ (x) 8 x := COOLING(x) 9 else: 10 is On, heat := πheat θ (x) 11 x := WARMING(x, heat) 12 i := i + 1 13 assert(!EXTREME_TEMPERATURE(x)) 14 15 return Figure 6: Thermostat Published as a conference paper at ICLR 2022 1 racetrack(x): 2 x1, y1, x2, y2 := x, 0.0, x, 0.0 3 N := 20 4 while i < N: 5 p10, p11, p12 := πagent1 θ (x1, y1) 6 p20, p21, p22 := πagent2 θ (x2, y2) 7 action1 := argmax(p10, p11, p12) 8 action2 := argmax(p20, p21, p22) 9 x1, y1 := MOVE(x1, y1, action1) 10 x2, y2 := MOVE(x2, y2, action2) 11 i := i + 1 12 assert(!CRASH_WALL(x1, y1) && !CRASH_WALL(x2, y2) 13 && !CRASH(x1, y1, x2, y2)) 14 15 return Figure 7: Racetrack Published as a conference paper at ICLR 2022 A.5 SYNTHETIC MICROBENCHMARKS Figure 8 exhibits the 5 synthetic microbenchmarks in our evaluation. Specifically, the neural network modules are used as conditions in pattern 1 and pattern 2. The neural network modules serve as both conditions and the assignment variable in pattern 3, pattern 4 and pattern 5. To highlight the impact from the safety loss, we only train with safety loss for synthetic microbenchmarks. 1 pattern1(x): 2 # x [-5, 5] 3 y := πθ(x) 4 if y 1.0: 5 z := 10 6 else: 7 z: = 1 8 assert(z 1) 9 return z (a) Pattern1 1 pattern2(x): 2 # x [-5, 5] 3 y := πθ(x) 4 if y 1.0: 5 z := x + 10 6 else: 7 z: = x - 5 8 assert(z 0) 9 return z (b) Pattern2 1 pattern3(x): 2 # x [-5, 5] 3 y := πθ(x) 4 if y 1.0: 5 z := 10 - y 6 else: 7 z: = 1 8 assert(z 1) 9 return z (c) Pattern3 1 pattern4(x): 2 # x [-5, 5] 3 y := πθ(x) 4 if y -1.0: 5 z := 1 6 else: 7 z: = 2 + y * y 8 assert(z 1) 9 return z (d) Pattern4 1 pattern5(x): 2 # x [-1, 1] 3 y := πθ(x) 4 if y 1.0: 5 z := y 6 else: 7 z: = -10 8 assert(z 0 && z -5) 9 return z (e) Pattern5 Figure 8: Programs for Patterns Published as a conference paper at ICLR 2022 A.6 DATA GENERATION In this section, we provide more details about our data generation process. We give our ground-truth programs and the description of each benchmark: Thermostat: Figure 6 exhibits the program with initial temperature x [60.0, 64.0] in training, where a 20-length loop is involved. The COOLING and WARMING are two differentiable functions updating the temperature, x. The πcool θ and πheat θ are two linear feed-forward neural networks. We set the safe temperature to the area [55.0, 83.0]. EXTREME_TEMPERATURE measures that whether x is not within the safe temperature scope. The ground-truth program replaces πcool θ and πheat θ with two different functional mechanisms (including branches and assignment), which decides the value of is On and heat. To mimic the real-world signal sensing process of thermostat, we carefully add noise to the output of these functional mechanisms and restricts the noise does not influence the safety of the ground-truth program. AC: Figure 5 illustrates the program in training with initial input x-axis of aircraft1 x [12.0, 16.0], where a 15-length loop is involved. The MOVE_* functions update the position of aircraft in a differentiable way. The πθ is one linear feed-forward neural networks. IS_CRASH measures the distances between two aircraft and the safe distance area is set to be larger than 40.0. In the ground-truth program, the πθ is replaced by one functional mechanism mimicing the real-world aircraft planner. If the stage is CRUISE, the groundtruth planner detects the distance between aircraft and decides the next step s stage by assignment values 1 or 0 to p0, p1, p2, p3. The step is set to 0. When the stage is in LEFT or RIGHT, stage keeps and the step increases if the number of steps in this stage is within a threshold, and the stage changes to STRAIGHT or CRUISE and the step is set to 0 otherwise. Specifically, we convert the argmax to a nested If-Then-Else(ITE) block to select the index of the p with the maximum value. Racetrack: Figure 7 illustrates the program in training with the input x-axis of each agent x [5.0, 6.0] (The two agents start from the same location.) This program has a 20-length loop with a dynamic length of 223 lines (The unfold ITE of argmax consists of 3 lines code.) In training, the πagent1 θ and πagent2 θ are two linear feed-forward neural network. The MOVE is differentiable functions that update agents positions. CRASH_WALL measures whether the agents position collides with the wall. CRASH measures whether the two agents crash into each other. In ground-truth program, πagent1 θ and πagent2 θ are replaced by functional path planners guiding the agent s direction. The path planner only tries to avoid wall-collision in the map. That is to say, in the trajectories data generated, an arbitrary pair of trajectories of agent1 and agent2 is not guaranteed to be nocollision with each other. To model real-world navigation, we also add noise in the next step selection for one position to generate independent trajectories of each agent. In the ground-truth program, the noise is added by uniformly selecting from the safe next step area. Cartpole: In the Cartpole experiment, we use the trajectories data generated from the expert model from Imitation Package. Starting from S0 = [ 0.05, 0.05]4, we train a cart of which the position is within a range x [ 0.1, 0.1]. The state space is 4 and the action space is 2. We use a 3 layer fully connected neural network (each layer followed by a Re LU and the last layer followed by a Sigmoid) to train. We follow the two points in Bastani et al. (2018) below to setup the experiment: We approximate the system using a finite time horizon Tmax = 10; we use a linear approximation f(s, a) As + Ba. When generating the trajectory dataset, we uniformly sample the input from the input space and run the ground-truth program to get the trajectory. Specifically, the trajectories in the dataset is represented by a sequence of neural network input-output pairs and the index of the neural networks. Take Thermostat as an example, we get a sequence of ([x0], [is On0], cool ), . . . , ([xk], [is Onk, heatk], heat ), . . . for one starting point. The k represents the data collected in the k-th step of the program. Published as a conference paper at ICLR 2022 A.7 TRAINING PROCEDURE Our framework is built on top of Py Torch (Paszke et al., 2019). We use the Adam Optimizer (Kingma & Ba, 2014) for all the experiments with a learning rate of 0.001 and a weight decay of 0.000001. We ran all the experiments using a single-thread implementation on a Linux system with Intel Xeon Gold 5218 2.30GHz CPUs and Ge Force RTX 2080 Ti GPUs. We represent the data loss (Q) following imitation learning techniques. The trajectory dataset is converted to several input-output example pair sets. Each input-output pair set represents the inputoutput pairs for each neural network. Q is calculated by the average of the Mean Squared Error of the each neural network over the corresponding input-output pair set. The C# of DSE follows the algorithm given in the Section 4 with one symbolic state covering the input space. In DIFFAI+, we use sound join operation, which is used in classic abstract interpretation Cousot & Cousot (1977), whenever encountering branches. Sound join means that the symbolic state after the branch block is the join of the symbolic states computed from each branch. The concretization of the symbolic state after sound join covers the union of the concretization of the symbolic states from each branch. Intuitively, starting from one input symbolic state, DIFFAI+ gets one symbolic trajectory which covers all the potential trajectories starting from the concrete input satisfying the input symbolic state. For the safety loss of DIFFAI+, we first split the input space into 100 subregions evenly to give more accuracy for DIFFAI+. After that, we can get 100 symbolic trajectories. We compute the average of the safety loss of all the 100 symbolic trajectories as the training safety loss C# of DIFFAI+. We set the convergence criterion as a combination of epoch number and loss. In each training iteration, the final loss is represented as Q(θ)+C#(θ). We set the maximum epoch for Thermostat, AC and Racetrack as 1500, 1200 and 6000. We set the early stop if the final loss has less than a 1% decrease over 200 epochs. Published as a conference paper at ICLR 2022 A.8 TRAINING PERFORMANCE Figure 9 and Figure 10 illustrates the training performance of safety loss and data loss when we vary data points (2%, 10%, 20%, 50%, 100% of the training dataset) used by DIFFAI + and DSE. We can see that the safety loss of DSE can converge while the variance of the safety loss of DIFFAI + is quite large for these three cases and can not converge to a small safety loss. While we found that starting from random initialization can give us safe programs in Thermostat, we achieved much quicker convergence when initializing Thermostat programs trained purely with data and then continue to train with safety loss. We show the results and training performance for Thermostat with data-loss initialized program. The other two benchmarks do not benefit a lot from the data loss initialized program. We keep the training of them with random initialization. A.8.1 SAFETY LOSS (a) Thermostat(200) (b) Thermostat(1k) (c) Thermostat(2k) (d) Thermostat(5k) (e) Thermostat(10k) (f) AC(150) (g) AC(750) (h) AC(1.5k) (i) AC(3.75k) (j) AC(7.5k) (k) Racetrack(200) (l) Racetrack(1k) (m) Racetrack(2k) (n) Racetrack(5k) (o) Racetrack(10k) (p) Cartpole(200) (q) Cartpole(1k) (r) Cartpole(2k) (s) Cartpole(5k) (t) Cartpole(10k) Figure 9: Training performance of data loss on different benchmarks varying data points. The y-axis represents the safety loss (C#(θ)) and the x-axis gives the number of training epochs. Overall, DSE converges within 1500, 1000 , 6000, and 2000 epochs for Thermostat, AC, Racetrack, Cartpole, while DIFFAI+ easily gets stuck or fluctuates on these benchmarks. Published as a conference paper at ICLR 2022 A.8.2 DATA LOSS (a) Thermostat(200) (b) Thermostat(1k) (c) Thermostat(2k) (d) Thermostat(5k) (e) Thermostat(10k) (f) AC(150) (g) AC(750) (h) AC(1.5k) (i) AC(3.75k) (j) AC(7.5k) (k) Racetrack(200) (l) Racetrack(1k) (m) Racetrack(2k) (n) Racetrack(5k) (o) Racetrack(10k) (p) Cartpole(200) (q) Cartpole(1k) (r) Cartpole(2k) (s) Cartpole(5k) (t) Cartpole(10k) Figure 10: Training performance of data loss on different benchmarks varying data points. The y-axis represents the data loss (Q(θ)) and the x-axis gives the number of training epochs. Since Thermostat is initialized with data trained programs to give quicker convergence, the data loss does not change a lot during training. For other benchmarks, data loss of DSE and DIFFAI+ are both converging. Overall speaking, DSE sacrifices some data loss to give safer programs. A.8.3 DETAILS OF TRAINING AND TEST DATA We also attach the detailed training data loss (Q), training safety loss (C#), test data loss and provably safe portion for Thermostat(Figure 11), AC(Figure 12) and Racetrack(Figure 13). Specifically, training safety loss measures the expectation of the quantitative safety of the symbolic trajectories and provably safe portion measures the portion of the symbolic trajectories that is provably safe. Intuitively, in training, if the number of unsafe symbolic trajectories is larger and the unsafe symbolic trajectories are further away from safe areas, C# is larger. In test, if the number of provably safe symbolic trajectories is larger, the provably safe portion is larger. Published as a conference paper at ICLR 2022 Data Size Approach Q C# Test Data Loss Provably Safe Portion 200 DSE 0.13 0.02 0.24 0.99 DIFFAI+ 0.07 25.37 0.21 0.28 1000 DSE 0.09 0.0 0.22 0.99 DIFFAI+ 0.05 1.96 0.14 0.55 2000 DSE 0.08 0.0 0.21 0.99 DIFFAI+ 0.05 2.70 0.15 0.46 5000 DSE 0.07 0.0 0.19 0.99 DIFFAI+ 0.04 4.58 0.18 0.40 10000 DSE 0.04 0.0 0.19 0.99 DIFFAI+ 0.02 1.23 0.19 0.80 Figure 11: Training and Test Results of Thermostat Data Size Approach Q C# Test Data Loss Provably Safe Portion 150 DSE 0.60 0.0 0.85 0.99 DIFFAI+ 0.53 21.46 0.87 0.22 750 DSE 0.64 0.0 0.86 1.0 DIFFAI+ 0.57 20.66 0.87 0.26 1500 DSE 0.58 0.0 0.84 1.0 DIFFAI+ 0.53 28.28 0.87 0.17 3750 DSE 0.62 0.0 0.85 1.0 DIFFAI+ 0.53 26.31 0.87 0.21 7500 DSE 0.61 0.0 0.86 1.0 DIFFAI+ 0.52 27.23 0.88 0.27 Figure 12: Training and Test Results of AC Data Size Approach Q C# Test Data Loss Provably Safe Portion 200 DSE 0.22 1.12 0.28 0.66 DIFFAI+ 0.11 2.04 0.31 0.0 1000 DSE 0.22 0.0 0.25 0.99 DIFFAI+ 0.17 2.03 0.26 0.0 2000 DSE 0.21 0.0 0.25 0.99 DIFFAI+ 0.17 2.03 0.25 0.0 5000 DSE 0.22 0.0 0.27 1.0 DIFFAI+ 0.17 2.03 0.24 0.0 10000 DSE 0.22 0.0 0.24 0.99 DIFFAI+ 0.13 2.26 0.24 0.0 Figure 13: Training and Test Results of Racetrack Published as a conference paper at ICLR 2022 A.9 ADDITIONAL PATTERN ANALYSIS Figure 2 exhibits safety performance of learnt patterns. The details of the patterns are in Figure 8. We highlight the patterns that cover the scope differences between DIFFAI+ and DSE, including parameterized branch conditions(Pattern 1-4) and bad joins(Pattern 5). In addition, we describe the characteristics in each pattern that influence DSE s performance, including deep local minimum in branches and limitations on sampling. In implementation, we randomly initialize the neural network s parameter (DIFFAI+ and DSE start from the initialization). We observe that all the neural networks are initialized to come with an output y 1 when x [ 5, 5], which gives 0.0 provably unsafe portion for Pattern 1, 2 and 5 in initialization. We give a detailed description of each patterns characteristic (with results analysis), where all the patterns safety loss is over the only variable z in assertion: Pattern1: The output, y, of the neural network module is only used as the parameters to calculate the branch condition satisfaction. In each branch, z is only updated with constants. When the gradient back-propagation executes in DIFFAI+, the gradients over θ is constantly zero since the branch splitting computation is not involved in safety loss. While for DSE, we consider the y by using the volume-based probability in safety loss calculation. Therefore, DSE is able to guide the program to fall into the second branch thoroughly by reducing the probability of y falling into the first branch. Pattern2: Similar to Pattern 1, the output of the neural network is used in branch condition. Different from Pattern 1, the input of the neural network, x, participates the assignment for z. DIFFAI+ has the same 0 gradient" issue as in Pattern 1. DSE sometimes reaches an area during training where only a super small portion of y falls into the first branch and the sampling operation does not pick the portion because the probability is super small. In this way, DSE still steadily converges with a C# = 0.0. However, the program is safer but not worst-case safe because C# = 0.0 represents all the trajectories sampled by DSE during training is safe. Pattern3: Pattern 3 uses y in the branch condition as well. For the two branches, z in the first branch is directly assigned by a function over y and the second branch is naturally safe. Therefore, the goal for learning safe programs is to optimize z := 10 y to the safe area. Both DIFFAI+ and DSE success on this pattern only optimizing over one branch is enough to learn a safe program. Pattern4: Pattern 4 is a typical example of local minimum. As the branch condition is y 1.0, the neural network may be initialized to fall into both branches. The first branch is naturally safe and the second branch comes with a local minimum z = 2. DIFFAI+ is not able to learn the safe programs guided by the safe loss and DSE can learn. DSE not achieving 1.0 provably safe portion as there is a part of training gets stuck in the local minimum with the second branch. There is an exception for DIFFAI+: it gets 0.58 for NNBig. The 0.58 can not be viewed as a result benefiting from the safety loss calculation as it specifically benefits from the random initialization. Since this is not a complicated task, Diff AI+ benefits from weights initialization when using NNBig. Specifically, some random initializations of NNBig may pick neural network parameters that make all the y fall into the first branch(safe). This leads to a not bad result of Diff AI+ for Pattern4 with NNBig. If there are other random initializations that can benefit the approach, Diff AI+ and DSE should benefit in the same way. DSE does not specifically benefit from the random initialization for Pattern 4 since Diff AI+ fails on NNSmall and NNMed while DSE does not and DSE gives a safer program on NNBig. Pattern5: The first branch in this pattern is a function over the output of the neural network and the second branch is unsafe. There are bad joins for DIFFAI+. For here, one symbolic representation of y falling into both branches always joins with the unsafe results (z := 10). There is no gradient from z := 10.0 can direct the neural network to the safe area. DSE can work on this pattern as there is a volume-based probability used in learning to guide the y to fall into the first branch. Published as a conference paper at ICLR 2022 A.10 ADDITIONAL DETAILS OF THE CORE ALGORITHM The key point of sampling based on volume is the volume-based probability, which is measured by the volume portion satisfying one condition. Therefore, the volume portion must be differentiable (w.r.t θ). In addition, the volume is not required to be non-zero. We give a detailed description as below: - We represent the Vθ by the box domain (The detailed definition of Vθ is in Appendix A.2). - When the branch condition splits the box domain into two polyhedra (direct volume calculation for polyhedra is -P hard [1]), we add one new dimension (with a new variable) to allow that the splitting is over this new dimension. Formally, starting from one state s = (l, V ), a condition iff(x1, . . . , xk) <= M : . . . , DSE transforms the program first by converting the condition to xk+1 = f(x1, . . . , xk); ifxk+1 <= M : . . . . Then, the volume portion is measured by the intersection portion of the concrete interval representation of xk+1. Specifically, if the additional variable s concrete interval length is 0, it indicates that the variable represents one point. It falls into one branch (Say branch 1) completely. Then the probability to select branch 1 is 1.0, and the probability to select other branches is 0.0. A.11 ADDITIONAL SCALABILITY EVALUATION For most of the additional experiments for scalability, we take Thermostat. We take AC as the base benchmark for different neural network architectures as we add convolutional layers. There is one variable as the input for the NN in Thermostat, and using Conv1d with kernel size=1 is the same as fully connected layers. In the original Thermostat, there is a 20-length loop with a dynamic program length of 20 6 = 120. In each iteration, there are 2 branches. Branches stacking on each other would increase the number of branches exponentially. Thus, in the original Thermostat, there is 220 paths to go in total. We evaluate the scalability over Thermostat by increase the number of branch in each iteration, doubling the loop length, and use a super refined input size. A.11.1 DIFFERENT NUMBER OF BRANCHES Data Size Approach Q C# Test Data Loss Provably Safe Portion 200 Ablation 0.03 - 0.2 0.66 DIFFAI+ 0.07 1.08 0.19 0.66 DSE 0.07 0 0.25 0.99 10000 Ablation 0.01 - 0.2 0.68 DIFFAI+ 0.05 1.53 0.18 0.67 DSE 0.06 0 0.22 0.99 Figure 14: Results from Thermostat with Three Branches in Each Iteration. With Thermostat, we increase the number of branches in the iteration from 2 to 3, by allowing the is On has three branches to go. We convert the branches from {cool, heat} to {cool, low Heat, high Heat}. That said, the entire number of branches in the program increases from 220 to 320. We show that DSE still performs well. And Diff AI+ can not give very good results even if ablation gives a reasonable number. A.11.2 DIFFERENT DYNAMIC LENGTH With the same structure as the original thermostat, we set the loop length of 40, which doubles the dynamic length and the number of branches increases from 220 to 240 accordingly. We show that with 200 data, DSE gives a result better than baselines but definitely less-than-perfect. In the training where DSE did not give safe programs in the cases where the symbolic state (split by too many branches) is too refined. The symbolic state gets stuck in an area where it s not safe but it falls into one trajectory fully. In this way, the volume-based probability is always 1.0 for the later transitions and the gradient over θ is 0.0. The highly unsafe stuck state gives very high state safety loss, which is exhibited by the large training safety loss for DSE. When DSE is not stuck, it can learn safe programs. That s why the average provable safe portion is still larger than baselines. Published as a conference paper at ICLR 2022 Data Size Approach Q C# Test Data Loss Provably Safe Portion 200 Ablation 0.03 - 0.19 0 DIFFAI+ 0.05 2.05 0.21 0 DSE 0.17 396 0.26 0.36 10000 Ablation 0.01 - 0.19 0.67 DIFFAI+ 0.05 25.42 0.21 0 DSE 0.08 0 0.24 0.99 Figure 15: Results from Thermostat with 40-Length Loop. We also show that, with more data, DSE can give a very good safe program for this challenging task while baselines can not. Specifically, DIFFAI+ (splitting 100) easily gets stuck when guided by the highly non-differentiable representation of the safety loss even if ablation gives a reasonable result. A.11.3 SUPER REFINED INPUT SIZE Data Size Approach Q C# Test Data Loss Provably Safe Portion 200 Ablation 0.02 - 0.19 0.33 DIFFAI+ 0.04 4.5 0.18 0.33 DSE 0.07 129.5 0.25 0.33 10000 Ablation 0.01 - 0.19 1 DIFFAI+ 0.01 0 0.19 1 DSE 0.01 0 0.19 1 Figure 16: Results from Thermostat with Super Refined Input Size. With the Thermostat, this task starts from a super refined input size: [60.0, 60.1]. The input in the original Thermostat is [60.0, 64.0]. We can see here DSE does not give good results with 200 data and gives the same result with the Ablation when using the full dataset. This benchmark exhibits the trade-off between refinement and the ease of learning of DSE. When the input size is small enough that it only fully falls into one trajectory with an updated θ during learning. In this way, the volume-base probability is also 1.0 for this trajectory and there is not gradient guiding the state to jump from this trajectory to another. Thus, when this case occurs in the training, DSE gets stuck. A.11.4 DIFFERENT NEURAL NETWORK ARCHITECTURE Data Size Approach Q C# Test Data Loss Provably Safe Portion 200 Ablation 0.02 - 0.19 0.33 DIFFAI+ 0.04 4.5 0.18 0.33 DSE 0.07 129.5 0.25 0.33 10000 Ablation 0.01 - 0.19 1 DIFFAI+ 0.01 0 0.19 1 DSE 0.01 0 0.19 1 Figure 17: Results from AC with Convolutional Layers With AC, we use a NN with Conv1d(1, 1, 2)-Re LU()-Conv1d(1, 1, 2)-Re LU()-Linear(4, 32)- Re LU()-Linear(32, 6)-Sigmoid(), where Conv1d(X, Y, Z) means an input channel of X, an output channel of Y and a kernel size of Z and Linear(X, Y) means an input channel of X and an output channel of Y. The above table exhibits that DSE can work for neural networks with convolutional layers. For AC specifically, all the training data comes from safe trajectories. There is a lack of generality for Ablation to get good test data loss. When adding additional safety constraint(e.g. In DSE), it helps to overcome some local minimum areas and gives a better test data loss even if the training data loss is worse than Ablation. Published as a conference paper at ICLR 2022 A.11.5 CART-POLE EXPERIMENTS In the Cartpole experiment, we use the trajectories data generated from the expert model from Imitation Package. Starting from S0 = [ 0.05, 0.05]4, we follow Bastani et al. (2018) to train a cart of which the pole keeps upright (θ [ 12 2 math.pi/360, 12 2 math.pi/360], the definition of upright is the standard one from Open AI Gym). The state space is 4 and the action space is 2. We use a 3 layer fully connected neural network (each layer followed by a Re LU and the last layer followed by a Sigmoid) to train. We follow the two points in Bastani et al. (2018) below to setup the experiment: We approximate the system using a finite time horizon Tmax = 10; We use a linear approximation f(s, a) As + Ba. We did the verification by splitting the input space into 204 boxes. And we extract the percentage of safe concrete trajectories from 100 uniformly sampled starting points. The results for restricting pole angle are as shown in Figure 18. Approach Q C# Test Data Loss Provably Safe Portion Percentage of Safe Concrete Trajectories Ablation 0.13 - 0.14 0.67 100% DIFFAI+ 0.13 0.66 0.14 0.62 100% DSE 0.16 0 0.18 0.78 100% Figure 18: Results of Cart-Pole(angle) As indicated by the above figure, pure imitation learning (Ablation) can already learn a pole keeping upright. To show DSE s ability to train a safe program without too much help from the data loss, we add another experiment following the same setting above except the constraint, which is shown in the main paper. A.11.6 TRAIN DSE FROM HIGHLY UNSAFE DATA Benchmark Data Size Approach Q C# Test Data Loss Provably Safe Portion Thermostat(45% Unsafety) 200 Ablation 0.04 - 0.19 0.20 DIFFAI+ 0.06 4.40 0.21 0.20 DSE 0.16 0.00 0.27 0.99 10000 Ablation 0.04 - 0.18 0.59 DIFFAI+ 0.05 3.72 0.17 0.46 DSE 0.06 0.00 0.18 0.99 AC(49% Unsafety) 150 Ablation 0.89 - 1.04 0.32 DIFFAI+ 0.90 25.20 1.04 0.24 DSE 0.96 0.00 1.03 1.00 7500 Ablation 0.79 - 1.03 0.00 DIFFAI+ 0.81 15.00 1.03 0.00 DSE 0.88 0.00 1.03 1.00 Figure 19: Results from Highly Unsafe Data In summary, DSE can scale to programs with reasonable length, branches and different neural network architectures. One challenge in DSE is that learning becomes harder when symbolic state representation becomes more refined(including super refined input space, safety constraint, and branch splitting). We leave this open for future works to seek to identify better tradeoffs between precision of symbolic states and ease of learning.