# universal_approximation_with_certified_networks__e238115f.pdf Published as a conference paper at ICLR 2020 UNIVERSAL APPROXIMATION WITH CERTIFIED NETWORKS Maximilian Baader, Matthew Mirman, Martin Vechev Department of Computer Science ETH Zurich, Switzerland {mbaader,matthew.mirman,martin.vechev}@inf.ethz.ch Training neural networks to be certifiably robust is critical to ensure their safety against adversarial attacks. However, it is currently very difficult to train a neural network that is both accurate and certifiably robust. In this work we take a step towards addressing this challenge. We prove that for every continuous function f, there exists a network n such that: (i) n approximates f arbitrarily close, and (ii) simple interval bound propagation of a region B through n yields a result that is arbitrarily close to the optimal output of f on B. Our result can be seen as a Universal Approximation Theorem for interval-certified Re LU networks. To the best of our knowledge, this is the first work to prove the existence of accurate, interval-certified networks. 1 INTRODUCTION Much recent work has shown that neural networks can be fooled into misclassifying adversarial examples (Szegedy et al., 2014), inputs which are imperceptibly different from those that the neural network classifies correctly. Initial work on defending against adversarial examples revolved around training networks to be empirically robust, usually by including adversarial examples found with various attacks into the training dataset (Gu and Rigazio, 2015; Papernot et al., 2016; Zheng et al., 2016; Athalye et al., 2018; Eykholt et al., 2018; Moosavi-Dezfooli et al., 2017; Xiao et al., 2018). However, while empirical robustness can be practically useful, it does not provide safety guarantees. As a result, much recent research has focused on verifying that a network is certifiably robust, typically by employing methods based on mixed integer linear programming (Tjeng et al., 2019), SMT solvers (Katz et al., 2017), semidefinite programming (Raghunathan et al., 2018a), duality (Wong and Kolter, 2018; Dvijotham et al., 2018b), and linear relaxations (Gehr et al., 2018; Weng et al., 2018; Wang et al., 2018b; Zhang et al., 2018; Singh et al., 2018; Salman et al., 2019). Because the certification rates were far from satisfactory, specific training methods were recently developed which produce networks that are certifiably robust: Mirman et al. (2018); Raghunathan et al. (2018b); Wang et al. (2018a); Wong and Kolter (2018); Wong et al. (2018); Gowal et al. (2018) train the network with standard optimization applied to an over-approximation of the network behavior on a given input region (the region is created around the concrete input point). These techniques aim to discover specific weights which facilitate verification. There is a tradeoff between the degree of the over-approximation used and the speed of training and certification. Recently, (Cohen et al., 2019b) proposed a statistical approach to certification, which unlike the non-probabilistic methods discussed above, creates a probabilistic classifier that comes with probabilistic guarantees. So far, some of the best non-probabilistic results achieved on the popular MNIST (Lecun et al., 1998) and CIFAR10 (Krizhevsky, 2009) datasets have been obtained with the simple Interval relaxation (Gowal et al., 2018; Mirman et al., 2019), which scales well at both training and verification time. Despite this progress, there are still substantial gaps between known standard accuracy, experimental robustness, and certified robustness. For example, for CIFAR10, the best reported certified robustness is 32.04% with an accuracy of 49.49% when using a fairly modest l region with radius 8/255 (Gowal et al., 2018). The state-of-the-art non-robust accuracy for this dataset is > 95% with experimental robustness > 50%. Given the size of this gap, a key question then is: can certified training ever succeed or is there a fundamental limit? Published as a conference paper at ICLR 2020 (a) Not certifiable network n1. (b) The function f. (c) Certifiable network n2. Figure 2: The Re LU networks n1 (Figure 2a) and n2 (Figure 2c) encode the same function f (Figure 2b). Interval analysis fails certify that n1 does not exceed [0, 1] on [0, 1] while certification succeeds for n2. Figure 1: Illustration of Theorem 1.1. In this paper we take a step in answering this question by proving a result parallel to the Universal Approximation Theorem (Cybenko, 1989; Hornik et al., 1989). We prove that for any continuous function f defined on a compact domain Γ Rm and for any desired level of accuracy δ, there exists a Re LU neural network n which can certifiably approximate f up to δ using interval bound propagation. As an interval is a fairly imprecise relaxation, our result directly applies to more precise convex relaxations (e.g., Zhang et al. (2018); Singh et al. (2019)). Theorem 1.1 (Universal Interval-Certified Approximation, Figure 1). Let Γ Rm be a compact set and let f : Γ R be a continuous function. For all δ > 0, there exists a Re LU network n such that for all boxes [a, b] in Γ defined by points a, b Γ where ak bk for all k, the propagation of the box [a, b] using interval analysis through the network n, denoted n ([a, b]), approximates the set [l, u] = [min f([a, b]), max f([a, b])] R up to δ, [l + δ, u δ] n ([a, b]) [l δ, u + δ]. (1) We recover the classical universal approximation theorem (|f(x) n(x)| δ for all x Γ) by considering boxes [a, b] describing points (x = a = b). Note that here the lower bound is not [l, u] as the network n is an approximation of f. Because interval analysis propagates boxes, the theorem naturally handles l norm bound perturbations to the input. Other lp norms can be handled by covering the lp ball with boxes. The theorem can be extended easily to functions f : Γ Rk by applying the theorem component wise. Practical meaning of theorem The practical meaning of this theorem is as follows: if we train a neural network n on a given training data set (e.g., CIFAR10) and we are satisfied with the properties of n (e.g., high accuracy), then because n is a continuous function, the theorem tells us that there exists a network n which is as accurate as n and as certifiable with interval analysis as n is with a complete verifier. This means that if we fail to find such an n, then either n did not possess the required capacity or the optimizer was unsuccessful. Focus on the existence of a network We note that we do not provide a method for training a certified Re LU network even though our method is constructive, we aim to answer an existential question and thus we focus on proving that a given network exists. Interesting future work items would be to study the requirements on the size of this network and the inherent hardness of finding it with standard optimization methods. Universal approximation is insufficient We now discuss why classical universal approximation is insufficient for establishing our result. While classical universal approximation theorems state that neural networks can approximate a large class of functions f, unlike our result, they do not state that robustness of the approximation n of f is actually certified with a scalable proof method (e.g., interval bound propagation). If one uses a non scalable complete verifier instead, then the standard Universal approximation theorem is sufficient. Published as a conference paper at ICLR 2020 To demonstrate this point, consider the function f : R R (Figure 2b) mapping all x 0 to 1, all x 1 to 0 and all 0 < x < 1 to 1 x and two Re LU networks n1 (Figure 2a) and n2 (Figure 2c) perfectly approximating f, that is n1(x) = f(x) = n2(x) for all x. For δ = 1 4, the interval certification that n1 maps all x [0, 1] to [0, 1] fails because [ 1 4] n 1([0, 1]) = [0, 3 4]. However, interval certification succeeds for n2, because n 2([0, 1]) = [0, 1]. To the best of our knowledge, this is the first work to prove the existence of accurate, interval-certified networks. 2 RELATED WORK After adversarial examples were discovered by Szegedy et al. (2014), many attacks and defenses were introduced (for a survey, see Akhtar and Mian (2018)). Initial work on verifying neural network robustness used exact methods (Katz et al., 2017; Tjeng et al., 2019) on small networks, while later research introduced methods based on over-approximation (Gehr et al., 2018; Raghunathan et al., 2018a; Singh et al., 2018; Salman et al., 2019) aiming to scale to larger networks. A fundamentally different approach is randomized smoothing (Li et al., 2019; Lécuyer et al., 2019; Cohen et al., 2019b), in which probabilistic classification and certification with high confidence is performed. As neural networks that are experimentally robust need not be certifiably robust, there has been significant recent research on training certifiably robust neural networks (Raghunathan et al., 2018b; Mirman et al., 2018; 2019; Wong and Kolter, 2018; Wong et al., 2018; Wang et al., 2018a; Gowal et al., 2018; Dvijotham et al., 2018a; Xiao et al., 2019; Cohen et al., 2019b). As these methods appear to have reached a performance wall, several works have started investigating the fundamental barriers in the datasets and methods that preclude the learning of a robust network (let alone a certifiably robust one) (Khoury and Hadfield-Menell, 2018; Schmidt et al., 2018; Tsipras et al., 2019). In our work, we focus on the question of whether neural networks are capable of approximating functions whose robustness can be established with the efficient interval relaxation. Feasibility Results with Neural Networks Early versions of the Universal Approximation Theorem were stated by Cybenko (1989) and Hornik et al. (1989). Cybenko (1989) showed that networks using sigmoidal activations could approximate continuous functions in the unit hypercube, while Hornik et al. (1989) showed that even networks with only one hidden layer are capable of approximating Borel measurable functions. More recent work has investigated the capabilities of Re LU networks. Here, Arora et al. (2018), based on Tarela and Martínez (1999), proved that every continuous piecewise linear function in Rm can be represented by a Re LU network. Later, He et al. (2018) reduced the number of neurons needed using ideas from finite elements methods. Relevant to our work, Arora et al. (2018) introduced a Re LU network representations of the min function. Further, we use a construction method that is similar to the construction for nodal basis functions given in He et al. (2018). Universal approximation for Lipschitz constrained networks have been considered by Anil et al. (2019) and later by Cohen et al. (2019a). A bound on the Lipschitz constant of a network immediately yields a certified region depending on the classification margin. Anil et al. (2019) proved that the set of Lipschitz networks with the Group Sort activation is dense in the space of Lipschitz continuous functions with Lipschitz constant 1, while Cohen et al. (2019a) provide an explicit construction to obtain the network. We note that both of these works focus on Lipschitz continuous functions, a more restricted class than continuous functions, which we consider in our work. 3 BACKGROUND In this section we provide the concepts necessary to describe our main result. Adversarial Examples and Robustness Verification Let n : Rm Rk be a neural network, which classifies an input x to a label t if n(x)t > n(x)j for all j = t. For a correctly classified input x, an adversarial example is an input y such that x is imperceptible from y to a human, but is classified to a different label by n. Frequently, two images are assumed to be imperceptible if there lp distance is at most ϵ. The lp ball around an image is said to be the adversarial ball, and a network is said to be ϵ-robust around x if Published as a conference paper at ICLR 2020 (b) Slicing of f, f0, . . . , f4 (c) Networks nk approximating fk. Figure 3: Approximating f (Figure 3a) using a Re LU network n = ξ0 +P k nk. The Re LU networks nk (Figure 3c) approximate the N-slicing of f (Figure 3b), as a sum of local bumps (Figure 6). every point in the adversarial ball around x classifies the same. In this paper, we limit our discussion to l adversarial balls which can be used to cover to all lp balls. The goal of robustness verification is to show that for a neural network n, input point x and label t, every possible input in an l ball of size ϵ around x (written B ϵ (x)) is also classified to t. Verifying neural networks with Interval Analysis The verification technique we investigate in this work is interval analysis. We denote by B the set of boxes B = [a, b] Rm for all m, where ai bi for all i. Furthermore for Γ Rm we define B(Γ) := B Γ describing all the boxes in Γ. The standard interval-transformations for the basic operations we are considering, namely +, , and the Re LU function R (Gehr et al. (2018), Gowal et al. (2018)) are [a, b] + [c, d] = [a + c, b + d] R ([a, b]) = [R(a), R(b)] [a, b] = [ b, a] λ [a, b] = [λa, λb], where [a, b], [c, d] B(R), and λ R 0. Furthermore, we used to distinguish the function f from its interval-transformation f . To illustrate the difference between f and f , consider f(x) := x x evaluated on x = [0, 1]. We have f([0, 1]) = 0, but f ([0, 1]) = [0, 1] #[0, 1] = [0, 1]+#[ 1, 0] = [ 1, 1] illustrating the loss in precision that interval analysis suffers from. Interval analysis provides a sound over-approximation in the sense that for all function f, the values that f can obtain on [a, b], namely f([a, b]) := {f(x) | x [a, b]} are a subset of f ([a, b]). If f is a composition of functions, f = f1 fk, then f 1 f k is a sound interval-transformer for f. Furthermore all combinations f of +, , and R are monotone, that is for [a, b], [c, d] B(Rm) such that [a, b] [c, d] then f #([a, b]) f #([c, d]) (Appendix A). For boxes [x, x] representing points f coincides with f, f ([x, x]) = f(x). This will later be needed. 4 PROVING UNIVERSAL INTERVAL-PROVABLE APPROXIMATION In this section, we provide an explanation of the proof of our main result, Theorem 4.6, and illustrate the main points of the proof. The first step in the construction is to deconstruct the function f into slices {fk : Γ [0, δ 2]}0 k 0 there exists a set of Re LU networks {nk}0 k 0, exists a Re LU network n such that for all B B(Γ) [l + δ, u δ] n (B) [l δ, u + δ], where l := min f(B) and u := max f(B). Proof. Pick N such that the height of each slice is exactly δ 2, if this is impossible choose a slightly smaller δ. Let {nk}0 k q the lower bound we want to prove becomes vacuous and only the upper one needs to be proven. Thus we have [l + δ, u δ] [ξp+2, ξp 1] n (B) [ξp 1,ξq+2] [l δ, u + δ], where l := min f(B) and u := max f(B). Theorem 4.6 (Universal Interval-Provable Approximation). Let Γ Rm be compact and f : Γ Rd be continuous. For all δ Rm 0 exists a Re LU network n such that for all B B(Γ) [l + δ, u δ] n (B) [l δ, u + δ], where l, u Rm such that lk := min f(B)k and uk := max f(B)k for all k. Proof. This is a direct consequence of using Theorem 4.5 and the Tietze extension theorem to produce a neural network for each dimension d of the codomain of f. Note that Theorem 1.1 is a special case of Theorem 4.6 with d = 1 to simplify presentation. 5 CONCLUSION We proved that for all real valued continuous functions f on compact sets, there exists a Re LU network n approximating f arbitrarily well with the interval abstraction. This means that for arbitrary input sets, analysis using the interval relaxation yields an over-approximation arbitrarily close to the smallest interval containing all possible outputs. Our theorem affirmatively answers the open question, whether the Universal Approximation Theorem generalizes to Interval analysis. Our results address the question of whether the interval abstraction is expressive enough to analyse networks approximating interesting functions f. This is of practical importance because interval analysis is the most scalable non-trivial analysis. Published as a conference paper at ICLR 2020 Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. ar Xiv preprint ar Xiv:1801.00553, 2018. Cem Anil, James Lucas, and Roger B. Grosse. Sorting out lipschitz function approximation. In International Conference on Machine Learning, (ICML), 2019. Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, (ICLR), 2018. Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In International Conference on Machine Learning, (ICML), 2018. Jeremy E. J. Cohen, Todd Huster, and Ra Cohen. Universal lipschitz approximation in bounded depth neural networks. ar Xiv preprint ar Xiv:1904.04861, 2019a. Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, (ICML), 2019b. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems (MCSS), 1989. Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Relja Arandjelovic, Brendan O Donoghue, Jonathan Uesato, and Pushmeet Kohli. Training verified learners with learned verifiers. ar Xiv preprint ar Xiv:1805.10265, 2018a. Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy A. Mann, and Pushmeet Kohli. A dual approach to scalable verification of deep networks. In Uncertainty in Artificial Intelligence, (UAI), 2018b. Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classification. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2018. Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin T. Vechev. AI2: safety and robustness certification of neural networks with abstract interpretation. In IEEE Symposium on Security and Privacy, (SP), 2018. Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy A. Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. ar Xiv preprint ar Xiv:1810.12715, 2018. Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. In International Conference on Learning Representations, (ICLR), Workshop, 2015. Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. Re LU Deep Neural Networks and Linear Finite Elements. ar Xiv preprint ar Xiv:1807.03973, 2018. Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 1989. Guy Katz, Clark W. Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In Computer Aided Verification (CAV), 2017. Marc Khoury and Dylan Hadfield-Menell. On the geometry of adversarial examples. ar Xiv preprint ar Xiv:1811.00525, 2018. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 1998. Published as a conference paper at ICLR 2020 Mathias Lécuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In IEEE Symposium on Security and Privacy, (SP), 2019. Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certified adversarial robustness with additive noise. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Matthew Mirman, Timon Gehr, and Martin T. Vechev. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, (ICML), 2018. Matthew Mirman, Gagandeep Singh, and Martin T. Vechev. A provable defense for deep residual networks. ar Xiv preprint ar Xiv:1903.12519, 2019. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2017. Nicolas Papernot, Patrick D. Mc Daniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, Euro S&P, 2016. Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems (Neur IPS), 2018a. Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, (ICLR), 2018b. Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relaxation barrier to tight robustness verification of neural networks. ar Xiv preprint ar Xiv:1902.08722, 2019. Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems (Neur IPS), 2018. Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin T. Vechev. Fast and effective robustness certification. In Advances in Neural Information Processing Systems (Neur IPS), 2018. Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin T. Vechev. An abstract domain for certifying neural networks. PACMPL, (POPL), 2019. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, (ICLR), 2014. J. M. Tarela and M. V. Martínez. Region configurations for realizability of lattice piecewise-linear models. Mathematical and Computer Modelling, 1999. Vincent Tjeng, Kai Y. Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, (ICLR), 2019. Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations, (ICLR), 2019. Shiqi Wang, Yizheng Chen, Ahmed Abdou, and Suman Jana. Mixtrain: Scalable training of formally robust neural networks. ar Xiv preprint ar Xiv:1811.02625, 2018a. Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Efficient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2018b. Published as a conference paper at ICLR 2020 Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane S. Boning, and Inderjit S. Dhillon. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, (ICML), 2018. Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, (ICML), 2018. Eric Wong, Frank R. Schmidt, Jan Hendrik Metzen, and J. Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems (Neur IPS), 2018. Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. In International Joint Conference on Artificial Intelligence, (IJCAI), 2018. Kai Y. Xiao, Vincent Tjeng, Nur Muhammad (Mahi) Shafiullah, and Aleksander Madry. Training for faster adversarial robustness verification via inducing relu stability. In International Conference on Learning Representations, (ICLR), 2019. Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions. In Advances in Neural Information Processing Systems (Neur IPS), 2018. Stephan Zheng, Yang Song, Thomas Leung, and Ian J. Goodfellow. Improving the robustness of deep neural networks via stability training. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2016. Published as a conference paper at ICLR 2020 A PROOFS FOR THE UNIVERSAL INTERVAL-CERTIFIED APPROXIMATION Lemma A.1 (Monotonicity). The operations +, are monotone, that is for all [a1, b1], [a2, b2], [c1, d1], [c2, d2] B(R) such that [a1, b1] [a2, b2] and [c1, d2] [c2, d2] holds [a1, b1] + [c1, d1] [a2, d2] + [c2, d2] [a1, b1] [c1, d1] [a2, d2] [c2, d2] [a1, b1] [c1, d1] [a2, d2] [c2, d2]. Further the operation and R are monotone, that is for all [a, b], [c, d] B(R) and for all λ R 0 such that [a, b] [c, d] holds λ [a, b] λ [c, d] R ([a, b]) R ([c, d]). [a1, b1] + [c1, d1] = [a1 + c1, b1 + d1] [a2 + c2, b2 + d2] = [a2, d2] + [c2, d2] [a1, b1] [c1, d1] = [a1 d1, b1 c1] [a2 d2, b2 c2] = [a2, d2] [c2, d2] λ [a, b] = [λa, λb] [λc, λd] = [λc, λd] R ([a, b]) = [R(a), R(b)] [R(c), R(d)] = R ([c, d]). Definition A.2 (N-slicing). Let Γ Rm be a compact m-dimensional box and let f : Γ R be continuous. The N-slicing of f is a set of functions {fk}0 k N 1 defined by fk : Γ R, x 7 0 if f(x) ξk, f(x) ξk if ξk < f(x) < ξk+1, ξk+1 ξk otherwise, k {0, . . . , N 1}, where ξk := k N (ξmax ξmin), k {0, . . . , N}, ξmin := min f(Γ) and ξmax := max f(Γ). Lemma A.3 (N-slicing). Let {fk}0 k N 1 be the N-slicing of f. Then for all x Γ we have f(x) := ξ0 + PN 1 k=0 fk(x). Proof. Pick x Γ and let l {0, . . . , N 1} such that ξl f(x) ξl+1. Then k=0 fk(x) = ξ0 + k=0 fk(x) + fl(x) + k=l+1 fk(x) = ξ0 + k=0 (ξk+1 ξk) + fl(x) = ξl + fl(x) = f(x). Definition A.4 (clipping). Let a, b R, a < b. We define the clipping function R[ ,b] : R R by R[ ,b](x) := b R(b x). Lemma A.5 (clipping). The function R[ ,b] sends all x b to x, and all x > b to b. Further, R [ ,b]([a , b ]) = [R[ ,b](a ), R[ ,b](b )]. Published as a conference paper at ICLR 2020 Proof. We show the proof for R[a,b], the proof for R[ ,b] is similar. x < b R[ ,b](x) = b R(b x) = b b + x = x x b R[ ,b](x) = b R(b x) = b 0 = b R [ ,b]([a , b ]) = b R (b [a , b ]) = b R (b + [ b , a ]) = b R ([b b , b a ]) = b [R(b b ), R(b a )] = b + [ R(b a ), R(b b )] = [b R(b a ), b R(b b )] = [R[ ,b](a ), R[ ,b](b )]. Definition A.6 (nmin). We define the Re LU network nmin: R2 R by nmin(x, y) := 1 2 (1 1 1 1) R 1 1 1 1 1 1 1 1 Lemma A.7 (nmin). Let x, y R, then nmin(x, y) = min(x, y). Proof. Because nmin is symmetric in its arguments, we assume w.o.l.g. x y. nmin(x, y) = 1 2 (1 1 1 1) R 1 1 1 1 1 1 1 1 2 (1 1 1 1) R x + y x y x y x + y If x + y 0, then nmin(x, y) = 1 2(x + y x + y) = y. If x + y < 0, then nmin(x, y) = 1 2(x + y x + y) = y. Definition A.8 (nmin N). For all N N 1, we define a Re LU network nmin N defined by nmin1(x) := x nmin N(x1, . . . , x N) := nmin(nmin N/2 (x1, . . . , x N/2 ), nmin N/2 +1(x N/2 +1, . . . , x N)). Lemma A.9. Let [a, b], [c, d] B(R). Then nmin ([a, b], [c, d]) = nmin ([c, d], [a, b]) and nmin ([a, b], [c, d]) = 2 , d + b a 2 ] if d a [a + c d 2 , b + d c 2 ] if a d and b < c [a + c b+d 2 ] if a d and b c Published as a conference paper at ICLR 2020 Proof. The symmetry on abstract elements is immediate. In the following, we omit some of to improve readability. nmin ([a, b], [c, d]) = 1 2 (1 1 1 1) R 1 1 1 1 1 1 1 1 [a, b] [c, d] 2 (1 1 1 1) R [a, b] + [c, d] [a, b] [c, d] [a, b] [c, d] [a, b] + [c, d] 2 (1 1 1 1) R [a + c, b + d] [ b d, a c] [a d, b c] [c b, d a] 2 (1 1 1 1) [R(a + c), R(b + d)] [R( b d), R( a c)] [R(a d), R(b c)] [R(c b), R(d a)] 2([R(a + c), R(b + d)] [R( b d), R( a c)] [R(a d), R(b c)] [R(c b), R(d a)]) 2([R(a + c), R(b + d)] + [ R( a c), R( b d)] + [ R(b c), R(a d)] + [ R(d a), R(c b)]) 2([R(a + c) R( a c), R(b + d) R( b d)] + [ R(b c) R(d a), R(a d) R(c b)]) Claim: R(a+c) R( a c) = a+c. If a+c > 0 then a c < 0 thus the claim in this case. Indeed: If a + c 0 then a c 0 thus R(a + c) R( a c) = R( a c) = ( a c) = a + c. Similarly R(b + d) R( b d) = b + d. So the expression simplifies to nmin ([a, b], [c, d]) = 1 2([a + c, b + d] + [ R(b c) R(d a), R(a d) R(c b)]) We proceed by case distinction: Case 1: b c 0: Then a b c d: nmin ([a, b], [c, d]) = 1 2([a + c, b + d] + [a d, b c]) 2([a + c + a d, b + d + b c]) 2 , b + d c Case 2: a d 0: Then c d a b. By symmetry of nmin equivalent to Case 1. Hence nmin ([a, b], [c, d]) = [c + a b 2 , d + b a Case 3: a d < 0 and b c > 0: nmin ([a, b], [c, d]) = 1 2([a + c, b + d] + [c b d + a, 0]) 2([a + c + c b d + a, b + d]) = [a + c b+d Published as a conference paper at ICLR 2020 Thus we have nmin ([a, b], [c, d]) = 2 , b + d c 2 ] if b c [c + a b 2 , d + b a 2 ] if d a [a + c b+d 2 ] if a < d and b > c Definition A.10 (neighboring grid points). Let G be as above. We define the set of neighboring grid points of x Γ by N(x) := {g G | g ||x g|| 1 M } \ {x}. For U Rm, we define N(U) := {N(x) | x U} \ U. Definition A.11 (local bump). Let M N, G := ( Z M )m, ℓ= 2 log2 2m +1 and let c = { il 1 M , iu 1 M } { il m M , iu m M } G. We define a Re LU neural network φc : Rm [0, 1] w.r.t. the grid G by n R[ ,1](Mℓ(xk il k M ) + 1), R[ ,1](Mℓ( iu k M xk) + 1) o Lemma A.12. It holds: 0 if x / conv(N(c)) 1 if x conv(c) min 0, Sm k=1{Mℓ(xk il k M ) + 1} {Mℓ( iu k M xk) + 1} otherwise. Proof. By case distinction: Case x / N(c). Then there exists k, such that either xk < il k 1 M or xk > iu k+1 Mℓ(xk il k M ) + 1 or Mℓ( iu k M xk) + 1 is less or equal to 0. Hence Case x conv(c). Then for all k holds il k M xk iu k M . Thus Mℓ(xk il k M ) + 1 1 and Mℓ( iu k M xk) + 1 1 for all k Hence Case otherwise: For all x exists a k such that Mℓ(xk il k M ) + 1 or Mℓ( iu k M xk) + 1 is smaller or equal to all other arguments of the function min and smaller or equal to 1. If the smallest element is smaller than 0, then φc(x) will evaluate to 0, otherwise it will evaluate to Mℓ(xk il k M ) + 1 or Mℓ( iu k M xk) + 1. Thus we can just drop R and R[ ,1] from the equations and take the minimum also over 0: n R[ ,1](Mℓ(xk il k M ) + 1), R[ ,1](Mℓ( iu k M xk) + 1) o! k=1 {(Mℓ(xk il k M ) + 1)} {(Mℓ( iu k M xk) + 1)} k=0 {Mℓ(xk il k M ) + 1} {Mℓ( iu k M xk) + 1} Published as a conference paper at ICLR 2020 Lemma A.13. Let [u1, 1], . . . , [u N, 1] be abstract elements of the Interval Domain B. Then nmin N([u1, 1], . . . , [u N, 1]) = [u1 + u N + 1 N, 1]. Proof. By induction. Base case: Let N = 1. Then nmin 1([u1, 1]) = [u1, 1]. Let N = 2. Then nmin 2([u1, 1], [u2, 1]) = [u1 + u2 1, 1]. Induction hypothesis: The property holds for N s.t. 0 < N N 1. Induction step: Then it also holds for N: nmin N([u1, 1], . . . , [u N, 1]) = nmin (nmin N/2 ([u1, 1], . . . , [u N/2 , 1]), nmin N N/2 ([u N/2 +1, 1], . . . , [u N, 1])) = nmin ([u1 + + u N/2 + 1 N/2 , 1], [u N/2 +1 + u N + 1 N + N/2 , 1]) Lemma A.9 = [u1 + + u N + 2 N/2 N + N/2 1, 1] = [u1 + + u N + 1 N, 1] Lemma A.14. Let [a, b], [u, 1] B(R 1). Then nmin ([a, b], [u, 1]) [a + u 1 nmin ([a, b], [u, 1]) = [a + u 1 2 , b + 1 u 2 ] if b u [a + u b+1 If b u then b + 1 u 2 . If u b then a + u b+1 2 a + u u+1 2 = a + u 1 nmin ([a, b], [u, 1]) [a + u 1 Lemma A.15. Let N N 2, let [u1, 1], . . . , [u N 1, 1], [u N, d] B(R) s.t. b 1 be abstract elements of the Interval Domain B. Furthermore, let H(x) := 1+x 2 . Then there exists a u R s.t. nmin N([u1, 1], . . . , [u N 1, 1], [u N, d]) [u, H log2 N +1(d)] Proof. By induction: Let N = 2: nmin 2([u1, 1], [u2, d]) Lemma A.14 = [a + u1 1 nmin 3([u1, 1], [u2, 1], [u3, d]) = nmin (nmin ([u1, 1], [u2, 1]), [u3, d]) = nmin ([u1 + u2 1, 1], [u3, d]) [u3 + u1+u2 2 nmin 3([u1, 1], [a, b], [u2, 1]) = nmin 3([u3, d], [u1, 1], [u2, 1]) = nmin (nmin ([u3, d], [u1, 1]), [u2, 1]) = nmin ([u3 + u1 1 2 , H(d)], [u2, 1]) [u3 + u1+u2 2 So nmin 3([u3, d], [u1, 1], [u2, 1]) is always included in [u3 + u1+u2 2 2 , H2(d)]. Induction hypothesis: The statement holds for all 2 N N 1. Published as a conference paper at ICLR 2020 Induction step: Then the property holds also for N: nmin N([u N, d], [u1, 1], . . . , [u N 1, 1]) = nmin (nmin N/2 ([u N, d], [u1, 1], . . . , [u N/2 1, 1]), nmin N N/2 ([u N/2 , 1], . . . , [u N 1, 1])) = nmin ([u , H log2 N/2 +1(d)], [u , 1]) nmin ([u , H log2 N/2 +1(d)], [u , 1]) = nmin ([u , H log2 N log2(2) +1(d)], [u , 1]) = nmin ([u , H log2 N 1 +1(d)], [u , 1]) = nmin ([u , H log2 N (d)], [u , 1]) = [u , H log2 N +1(d)] and similarly for other orderings of the arguments. Lemma A.16. Let H(x) := 1+x 2 . For all N N>0, we have that d 1 2N implies HN(d) 0. Proof. By induction. N = 1: Then H(1 2) = 1+1 2 Induction hypothesis. The statement holds for all N such that 0 < N N. Induction step: N + 1: d 1 2N: HN+1(d) HN+1(1 2N+1) = HN(H(1 2N+1)) = HN( 1+1 2N+1 2 ) = HN(1 2N) 0 Lemma A.17. For all boxes B B(Rm), we have φ c(B) = [1, 1] if B conv(c) [0, 0] if B Γ \ conv(N(c)) Furthermore, φ c(B) [0, 1]. Proof. Let φc be a local bump and let B = [a, b] B(Rm). Let [r1 k, s1 k], [r2 k, s2 k] B(R) such that Mℓ([ak, bk] il k M ) + 1 = [r1 k, s1 k] and Mℓ( iu k M [ak, bk]) + 1 = [r2 k, s2 k]. If [a, b] conv(c): Then 1 r1 k and 1 r2 k for all k {1, . . . , m}. Thus φ c([a, b]) = R (nmin 2m{R [ ,1]([rp k, sp k])}(p,k) {1,2} {1,...,m}) = R (nmin 2m{[1, 1]}(p,k) {1,2} {1,...,m}) If [a, b] Γ \ conv(N(c)): Then there exists a (p , k ) {1, 2} {1, . . . , m} such that sp k 1 2 log2 N +1. Using Lemma A.16 and Lemma A.15, we now that there exists a u R s.t. φ c([a, b]) = R (nmin 2m{R [ ,1]([rp k, sp k])}(p,k) {1,2} {1,...,m}) = R (nmin 2m{[R[ ,1](rp k), R[ ,1](sp k)]}(p,k) {1,2} {1,...,m}) R (nmin 2m{[R[ ,1](rp k), 1]}(p,k) =(p ,k ) {[rp R ([u, 0]) = [0, 0] For any [a, b] B(Γ) we have φ c([a, b]) [0, 1] by construction. Published as a conference paper at ICLR 2020 Lemma A.18. Let Γ Rm be a closed box and let f : Γ R be continuous. For all δ > 0 exists a set of Re LU networks {nk}0 k N 1 of size N N approximating the N-slicing of f, {fk}0 k N 1 (ξk as in Definition A.2) such that for all boxes B B(Γ) n k(B) = [0, 0] if f(B) ξk δ 2 [1, 1] if f(B) ξk+1 + δ and n k(B) [0, 1]. Proof. Let N N such that N 2 ξmax ξmin δ where ξmin := min f(Γ) and ξmax := max f(Γ). For simplicity we assume Γ = [0, 1]m. Using the Heine-Cantor theorem, we get that f is uniformly continuous, thus there exists a δ > 0 such that x, y Γ.||y x|| < δ ||f(y) f(x)|| < δ 2. Further, let M N such that M 1 δ and let G be the grid defined by G := ( Z Let C(B) be the set of corner points of the closest hyperrectangle in G confining B B(Γ). We construct the set k := {C(B) | B B(Γ) : f(B) ξk+1 + δ We claim that {nk}0 k N 1 defined by nk(x) := R[ ,1] satisfies the condition. Case 1: Let B B(Γ) such that f(B) ξk+1 + δ 2. Then for all g N(B) holds fk(g) = δ2. By construction exists a c k such that B conv(c ). Using Lemma 4.3 we get n k(B) = R [ ,1] φ c (B) + X c k\c φ c(B) = R [ ,1] ([1, 1] + [p1, p2]) = [1, 1], where [p1, p2] B(R 0). Indeed, by case distinction: Case 2: Let B B(Γ) such that f(B) ξk δ 2. Then for all g N(B) holds fk(g) = 0. Further, B conv(N(c)) = for all c k because G is fine enough. Using Lemma 4.3 we obtain n k(B) = R [ ,1] = R [ ,1]([0, 0]) = [0, 0]. By construction we have n k(B) [0, 1].