# universal_approximation_with_certified_networks__e238115f.pdf

Published as a conference paper at ICLR 2020

UNIVERSAL APPROXIMATION WITH CERTIFIED NETWORKS

Maximilian Baader, Matthew Mirman, Martin Vechev Department of Computer Science ETH Zurich, Switzerland {mbaader,matthew.mirman,martin.vechev}@inf.ethz.ch

Training neural networks to be certiﬁably robust is critical to ensure their safety against adversarial attacks. However, it is currently very difﬁcult to train a neural network that is both accurate and certiﬁably robust. In this work we take a step towards addressing this challenge. We prove that for every continuous function f, there exists a network n such that: (i) n approximates f arbitrarily close, and (ii) simple interval bound propagation of a region B through n yields a result that is arbitrarily close to the optimal output of f on B. Our result can be seen as a Universal Approximation Theorem for interval-certiﬁed Re LU networks. To the best of our knowledge, this is the ﬁrst work to prove the existence of accurate, interval-certiﬁed networks.

1 INTRODUCTION

Much recent work has shown that neural networks can be fooled into misclassifying adversarial examples (Szegedy et al., 2014), inputs which are imperceptibly different from those that the neural network classiﬁes correctly. Initial work on defending against adversarial examples revolved around training networks to be empirically robust, usually by including adversarial examples found with various attacks into the training dataset (Gu and Rigazio, 2015; Papernot et al., 2016; Zheng et al., 2016; Athalye et al., 2018; Eykholt et al., 2018; Moosavi-Dezfooli et al., 2017; Xiao et al., 2018). However, while empirical robustness can be practically useful, it does not provide safety guarantees. As a result, much recent research has focused on verifying that a network is certiﬁably robust, typically by employing methods based on mixed integer linear programming (Tjeng et al., 2019), SMT solvers (Katz et al., 2017), semideﬁnite programming (Raghunathan et al., 2018a), duality (Wong and Kolter, 2018; Dvijotham et al., 2018b), and linear relaxations (Gehr et al., 2018; Weng et al., 2018; Wang et al., 2018b; Zhang et al., 2018; Singh et al., 2018; Salman et al., 2019).

Because the certiﬁcation rates were far from satisfactory, speciﬁc training methods were recently developed which produce networks that are certiﬁably robust: Mirman et al. (2018); Raghunathan et al. (2018b); Wang et al. (2018a); Wong and Kolter (2018); Wong et al. (2018); Gowal et al. (2018) train the network with standard optimization applied to an over-approximation of the network behavior on a given input region (the region is created around the concrete input point). These techniques aim to discover speciﬁc weights which facilitate veriﬁcation. There is a tradeoff between the degree of the over-approximation used and the speed of training and certiﬁcation. Recently, (Cohen et al., 2019b) proposed a statistical approach to certiﬁcation, which unlike the non-probabilistic methods discussed above, creates a probabilistic classiﬁer that comes with probabilistic guarantees.

So far, some of the best non-probabilistic results achieved on the popular MNIST (Lecun et al., 1998) and CIFAR10 (Krizhevsky, 2009) datasets have been obtained with the simple Interval relaxation (Gowal et al., 2018; Mirman et al., 2019), which scales well at both training and veriﬁcation time. Despite this progress, there are still substantial gaps between known standard accuracy, experimental robustness, and certiﬁed robustness. For example, for CIFAR10, the best reported certiﬁed robustness is 32.04% with an accuracy of 49.49% when using a fairly modest l region with radius 8/255 (Gowal et al., 2018). The state-of-the-art non-robust accuracy for this dataset is > 95% with experimental robustness > 50%. Given the size of this gap, a key question then is: can certiﬁed training ever succeed or is there a fundamental limit?

Published as a conference paper at ICLR 2020

(a) Not certiﬁable network n1.

(b) The function f.

(c) Certiﬁable network n2.

Figure 2: The Re LU networks n1 (Figure 2a) and n2 (Figure 2c) encode the same function f (Figure 2b). Interval analysis fails certify that n1 does not exceed [0, 1] on [0, 1] while certiﬁcation succeeds for n2.

Figure 1: Illustration of Theorem 1.1.

In this paper we take a step in answering this question by proving a result parallel to the Universal Approximation Theorem (Cybenko, 1989; Hornik et al., 1989). We prove that for any continuous function f deﬁned on a compact domain Γ Rm and for any desired level of accuracy δ, there exists a Re LU neural network n which can certiﬁably approximate f up to δ using interval bound propagation. As an interval is a fairly imprecise relaxation, our result directly applies to more precise convex relaxations (e.g., Zhang et al. (2018); Singh et al. (2019)). Theorem 1.1 (Universal Interval-Certiﬁed Approximation, Figure 1). Let Γ Rm be a compact set and let f : Γ R be a continuous function. For all δ > 0, there exists a Re LU network n such that for all boxes [a, b] in Γ deﬁned by points a, b Γ where ak bk for all k, the propagation of the box [a, b] using interval analysis through the network n, denoted n ([a, b]), approximates the set [l, u] = [min f([a, b]), max f([a, b])] R up to δ,

[l + δ, u δ] n ([a, b]) [l δ, u + δ]. (1)

We recover the classical universal approximation theorem (|f(x) n(x)| δ for all x Γ) by considering boxes [a, b] describing points (x = a = b). Note that here the lower bound is not [l, u] as the network n is an approximation of f. Because interval analysis propagates boxes, the theorem naturally handles l norm bound perturbations to the input. Other lp norms can be handled by covering the lp ball with boxes. The theorem can be extended easily to functions f : Γ Rk by applying the theorem component wise.

Practical meaning of theorem The practical meaning of this theorem is as follows: if we train a neural network n on a given training data set (e.g., CIFAR10) and we are satisﬁed with the properties of n (e.g., high accuracy), then because n is a continuous function, the theorem tells us that there exists a network n which is as accurate as n and as certiﬁable with interval analysis as n is with a complete veriﬁer. This means that if we fail to ﬁnd such an n, then either n did not possess the required capacity or the optimizer was unsuccessful.

Focus on the existence of a network We note that we do not provide a method for training a certiﬁed Re LU network even though our method is constructive, we aim to answer an existential question and thus we focus on proving that a given network exists. Interesting future work items would be to study the requirements on the size of this network and the inherent hardness of ﬁnding it with standard optimization methods.

Universal approximation is insufﬁcient We now discuss why classical universal approximation is insufﬁcient for establishing our result. While classical universal approximation theorems state that neural networks can approximate a large class of functions f, unlike our result, they do not state that robustness of the approximation n of f is actually certiﬁed with a scalable proof method (e.g., interval bound propagation). If one uses a non scalable complete veriﬁer instead, then the standard Universal approximation theorem is sufﬁcient.

Published as a conference paper at ICLR 2020

To demonstrate this point, consider the function f : R R (Figure 2b) mapping all x 0 to 1, all x 1 to 0 and all 0 < x < 1 to 1 x and two Re LU networks n1 (Figure 2a) and n2 (Figure 2c) perfectly approximating f, that is n1(x) = f(x) = n2(x) for all x. For δ = 1 4, the interval certiﬁcation that n1 maps all x [0, 1] to [0, 1] fails because [ 1

4] n 1([0, 1]) = [0, 3

4]. However, interval certiﬁcation succeeds for n2, because n 2([0, 1]) = [0, 1]. To the best of our knowledge, this is the ﬁrst work to prove the existence of accurate, interval-certiﬁed networks.

2 RELATED WORK

After adversarial examples were discovered by Szegedy et al. (2014), many attacks and defenses were introduced (for a survey, see Akhtar and Mian (2018)). Initial work on verifying neural network robustness used exact methods (Katz et al., 2017; Tjeng et al., 2019) on small networks, while later research introduced methods based on over-approximation (Gehr et al., 2018; Raghunathan et al., 2018a; Singh et al., 2018; Salman et al., 2019) aiming to scale to larger networks. A fundamentally different approach is randomized smoothing (Li et al., 2019; Lécuyer et al., 2019; Cohen et al., 2019b), in which probabilistic classiﬁcation and certiﬁcation with high conﬁdence is performed.

As neural networks that are experimentally robust need not be certiﬁably robust, there has been signiﬁcant recent research on training certiﬁably robust neural networks (Raghunathan et al., 2018b; Mirman et al., 2018; 2019; Wong and Kolter, 2018; Wong et al., 2018; Wang et al., 2018a; Gowal et al., 2018; Dvijotham et al., 2018a; Xiao et al., 2019; Cohen et al., 2019b). As these methods appear to have reached a performance wall, several works have started investigating the fundamental barriers in the datasets and methods that preclude the learning of a robust network (let alone a certiﬁably robust one) (Khoury and Hadﬁeld-Menell, 2018; Schmidt et al., 2018; Tsipras et al., 2019). In our work, we focus on the question of whether neural networks are capable of approximating functions whose robustness can be established with the efﬁcient interval relaxation.

Feasibility Results with Neural Networks Early versions of the Universal Approximation Theorem were stated by Cybenko (1989) and Hornik et al. (1989). Cybenko (1989) showed that networks using sigmoidal activations could approximate continuous functions in the unit hypercube, while Hornik et al. (1989) showed that even networks with only one hidden layer are capable of approximating Borel measurable functions.

More recent work has investigated the capabilities of Re LU networks. Here, Arora et al. (2018), based on Tarela and Martínez (1999), proved that every continuous piecewise linear function in Rm can be represented by a Re LU network. Later, He et al. (2018) reduced the number of neurons needed using ideas from ﬁnite elements methods. Relevant to our work, Arora et al. (2018) introduced a Re LU network representations of the min function. Further, we use a construction method that is similar to the construction for nodal basis functions given in He et al. (2018).

Universal approximation for Lipschitz constrained networks have been considered by Anil et al. (2019) and later by Cohen et al. (2019a). A bound on the Lipschitz constant of a network immediately yields a certiﬁed region depending on the classiﬁcation margin. Anil et al. (2019) proved that the set of Lipschitz networks with the Group Sort activation is dense in the space of Lipschitz continuous functions with Lipschitz constant 1, while Cohen et al. (2019a) provide an explicit construction to obtain the network. We note that both of these works focus on Lipschitz continuous functions, a more restricted class than continuous functions, which we consider in our work.

3 BACKGROUND

In this section we provide the concepts necessary to describe our main result.

Adversarial Examples and Robustness Veriﬁcation Let n : Rm Rk be a neural network, which classiﬁes an input x to a label t if n(x)t > n(x)j for all j = t. For a correctly classiﬁed input x, an adversarial example is an input y such that x is imperceptible from y to a human, but is classiﬁed to a different label by n.

Frequently, two images are assumed to be imperceptible if there lp distance is at most ϵ. The lp ball around an image is said to be the adversarial ball, and a network is said to be ϵ-robust around x if

Published as a conference paper at ICLR 2020

(b) Slicing of f, f0, . . . , f4

(c) Networks nk approximating fk.

Figure 3: Approximating f (Figure 3a) using a Re LU network n = ξ0 +P k nk. The Re LU networks nk (Figure 3c) approximate the N-slicing of f (Figure 3b), as a sum of local bumps (Figure 6).

every point in the adversarial ball around x classiﬁes the same. In this paper, we limit our discussion to l adversarial balls which can be used to cover to all lp balls.

The goal of robustness veriﬁcation is to show that for a neural network n, input point x and label t, every possible input in an l ball of size ϵ around x (written B ϵ (x)) is also classiﬁed to t.

Verifying neural networks with Interval Analysis The veriﬁcation technique we investigate in this work is interval analysis. We denote by B the set of boxes B = [a, b] Rm for all m, where ai bi for all i. Furthermore for Γ Rm we deﬁne B(Γ) := B Γ describing all the boxes in Γ. The standard interval-transformations for the basic operations we are considering, namely +, , and the Re LU function R (Gehr et al. (2018), Gowal et al. (2018)) are

[a, b] + [c, d] = [a + c, b + d]

R ([a, b]) = [R(a), R(b)]

[a, b] = [ b, a]

λ [a, b] = [λa, λb],

where [a, b], [c, d] B(R), and λ R 0. Furthermore, we used to distinguish the function f from its interval-transformation f . To illustrate the difference between f and f , consider f(x) := x x evaluated on x = [0, 1]. We have f([0, 1]) = 0, but f ([0, 1]) = [0, 1] #[0, 1] = [0, 1]+#[ 1, 0] = [ 1, 1] illustrating the loss in precision that interval analysis suffers from.

Interval analysis provides a sound over-approximation in the sense that for all function f, the values that f can obtain on [a, b], namely f([a, b]) := {f(x) | x [a, b]} are a subset of f ([a, b]). If f is a composition of functions, f = f1 fk, then f 1 f k is a sound interval-transformer for f.

Furthermore all combinations f of +, , and R are monotone, that is for [a, b], [c, d] B(Rm) such that [a, b] [c, d] then f #([a, b]) f #([c, d]) (Appendix A). For boxes [x, x] representing points f coincides with f, f ([x, x]) = f(x). This will later be needed.

4 PROVING UNIVERSAL INTERVAL-PROVABLE APPROXIMATION

In this section, we provide an explanation of the proof of our main result, Theorem 4.6, and illustrate the main points of the proof.

The ﬁrst step in the construction is to deconstruct the function f into slices {fk : Γ [0, δ

2]}0 k<N such that that f(x) = ξ0 + PN 1 k=0 fk(x) for all x, where ξ0 is the minimum of f(Γ). We approximate each slice fk by a Re LU network δ

2 nk. The network n approximating f up to δ will be n(x) := ξ0 + δ

2 P k nk(x). The construction relies on 2 key insights, (i) the output of δ

2 n k can be conﬁned to the interval [0, δ

2], thus the loss of analysis precision is at most the height of the slice, and (ii) we can construct the networks nk using local bump functions, such that only 4 slices can contribute to the loss of analysis precision, two for the lower interval bound, two for the upper one.

Published as a conference paper at ICLR 2020

Figure 4: Neighbors N(x) (blue dots) and N(U) (red squares).

Figure 5: R[ ,b](x)

The slicing {fk}0 k<5 of the function f : [ 2, 2] R (Figure 3a), mapping x to f(x) = x3 + 3x is depicted in Figure 3b. The networks nk are depicted in Figure 3c. In this example, evaluating the interval-transformer of n, namely n on the box B = [ 1, 1] results into n ([ 1, 1]) = [ 2, 6/5] lies is within the δ = 8

5 bound of f([ 1, 1]) = [ 2, 2].

Deﬁnition 4.1 (N-slicing (Figure 3b)). Let Γ Rm be a closed m-dimensional box and let f : Γ R be continuous. The N-slicing of f is a set of functions {fk}0 k<N deﬁned by

fk : Γ R, x 7

0 if f(x) ξk, f(x) ξk if ξk < f(x) < ξk+1, ξk+1 ξk if ξk+1 f(x), k {0, . . . , N 1},

where ξk := ξ0 + k

N (ξN ξ0), k {1, . . . , N 1}, ξ0 := min f(Γ) and ξN := max f(Γ).

To construct a Re LU network satisfying the desired approximation property (Equation (1)) if evaluated on boxes in B(Γ), we need the Re LU network nmin capturing the behavior of min as a building block (similar to He et al. (2018)). It is given by

nmin(x, y) := 1

2 (1 1 1 1) R

1 1 1 1 1 1 1 1

With the Re LU network nmin, we can construct recursively a Re LU network nmin N mapping N arguments to the smallest one (Deﬁnition A.8). Even though the interval-transformation loses precision, we can establish bounds on the precision loss of nmin N sufﬁcient for our use case (Appendix A).

Now, we use the clipping function R[ ,1] := 1 R(1 x) clipping every value exceeding 1 back to 1 (Figure 5) to construct the local bumps φc w.r.t. a grid G. G speciﬁes the set of all possible local bumps we can use to construct the networks nk. Increasing the ﬁnesse of G will increases the approximation precision.

Deﬁnition 4.2 (local bump, Figure 6). Let M N, G := {( i1

M ), . . . , im

M | i Zm} be a grid,

ℓ= 2 log2 2m +1 and let c = { il 1 M , iu 1 M } { il m M , iu m M } G be a set of grid points describing the corner points of a hyperrectangle in G. We deﬁne a Re LU neural network φc : Rm [0, 1] R w.r.t. G by

( R[ ,1](M ℓ (xk il k M ) + 1), R[ ,1](M ℓ ( iu k M xk) + 1)

We will describe later how M and c get picked. A graphical illustration of a local bump for in two dimensions and c = { il 1 M , iu 1 M } { il 2 M , iu 2 M } = {cll, clu, cul, cuu} is shown in Figure 6. The local bump φc(x) evaluates to 1 for all x that lie within the convex hull of c, namely conv(c), after which φc(x) quickly decreases linearly to 0. φc has 1 + 2(2d 1) + 2d Re LUs and 1 + log2(2d + 1) + 1 layers.

Published as a conference paper at ICLR 2020

1/M cll clu

(clu x , clu y , 1)

Figure 6: Local bump φc, where c contains the points cll, clu, cul, cuu. The points in N(conv(c)) are depicted by the red squares.

By construction φc(x) decreases to 0 before reaching the next neighboring grid points N(conv(c)), where N(x) := {g G | ||x g|| 1 M } \ {x} denotes the neighboring grid points of x and similarly for N(U) := {N(x) | x U} \ U (Figure 4). The set N(conv(c)) forms a hyperrectangle in G and is shown in Figure 6 using red squares. Clearly conv(c) conv(N(c)).

Next, we give bounds on the loss of precision for the interval-transformation φ c. We can show that interval analysis can (i) never produce intervals exceeding [0, 1] and (ii) is precise if B does no intersect conv(N(c)) \ conv(c).

Lemma 4.3. For all B B(Rm), it holds that φ c(B) [0, 1] B and

φ c(B) = [1, 1] if B conv(c) [0, 0] if B Γ \ conv(N(c)).

The formal proof is given in Appendix A. The next lemma shows, how a Re LU network nk can approximate the slice fk while simultaneously conﬁning the loss of analysis precision.

Lemma 4.4. Let Γ Rm be a closed box and let f : Γ R be continuous. For all δ > 0 there exists a set of Re LU networks {nk}0 k<N of size N N approximating the N-slicing of f, {fk}0 k<N (ξk as in Deﬁnition 4.1) such that for all boxes B B(Γ)

n k(B) = [0, 0] if f(B) ξk δ

2 [1, 1] if f(B) ξk+1 + δ

and n k(B) [0, 1].

It is important to note that in Equation (2) we mean f and not f . The proof for Lemma 4.4 is given in Appendix A. In the following, we discuss a proof sketch.

Because Γ is compact and f is continuous, f is uniformly continuous by the Heine-Cantor Theorem. So we can pick a M N such that for all x, y Γ satisfying ||y x|| 1 M holds |f(y) f(x)| δ

2. We then choose the grid G = ( Z

Next, we construct for every slice k a set k of hyperrectangles on the grid G: if a box B B(Γ) fulﬁlls f(B) ξk+1 + δ

2, then we add a minimal enclosing hyperrectangle c G such that B conv(c) to k, where conv(c) denotes the convex hull of c. This implies, using uniform continuity of f and that the grid G is ﬁne enough, that f(conv(c)) ξk+1. Since there is only a ﬁnite number of possible hyperrectangles in G, the set k is clearly ﬁnite. The network fulﬁlling Equation (2) is

nk(x) := R[ ,1]

where φc is as in Deﬁnition 4.2. The nk are depicted in Figure 3c.

Published as a conference paper at ICLR 2020

Now, we see that Equation (2) holds by construction: For all boxes B B(Γ) such that f ξk+1 + δ

2 on B exists c k such that B conv(c ) which implies, using Lemma 4.3, that φ c (B) = [1, 1], hence

n k(B) = R [ ,1](φ c (B) + X

c k\c φ c(B)) c = c : φ c(B) [0, 1](Lemma 4.3)

= R [ ,1]([1, 1] + [p1, p2]) [p1, p1] B(R 0)

= R [ ,1]([1 + p1, 1 + p2])

Similarly, if f(B) ξk δ

2 holds, then it holds for all c k that B does not intersect N(conv(c)). Indeed, if a c k would violate this, then by construction, f(conv(c)) ξk+1, contradicting f(B) ξk δ

2. Thus φ c(B) = [0, 0], and hence n (B) = [0, 0]. Theorem 4.5. Let Γ Rm be a closed box and let f : Γ R be continuous. Then for all δ > 0, exists a Re LU network n such that for all B B(Γ)

[l + δ, u δ] n (B) [l δ, u + δ],

where l := min f(B) and u := max f(B).

Proof. Pick N such that the height of each slice is exactly δ

2, if this is impossible choose a slightly smaller δ. Let {nk}0 k<N be a series of networks as in Lemma 4.4. Recall that ξ0 = min f(Γ). We deﬁne the Re LU network

n(x) := ξ0 + δ

k=0 nk(x). (3)

Let B B(Γ). Thus we have for all k

f(B) ξk+2 f(B) ξk+1 + δ

2 Lemma 4.4 n k(B) = [1, 1] (4)

f(B) ξk 1 f(B) ξk δ

2 Lemma 4.4 n k(B) = [0, 0]. (5)

Let p, q {0, . . . , N 1} such that

ξp l = min f(B) ξp+1 (6) ξq u = max f(B) ξq+1, (7)

Figure 7: Illustration of the proof for Theorem 4.5.

as depicted in Figure 7. Thus by Equation (4) for all k {0, . . . , p 2} it holds that n k(B) = [1, 1] and similarly, by Equation (5) for all k {q + 2, . . . , N 1} it holds that n k(B) = [0, 0]. Plugging this into Equation (3) after splitting the sum into three parts leaves us with

n (B) = ξ0 + δ

k=0 n k(B) + δ

k=p 1 n k(B) + δ

k=p+1 n k(B)

= ξ0 + (p 1)[ δ

k=p 1 n k(B) + [0, 0].

Applying the standard rules for interval analysis, leads to

n (B) = [ξp 1, ξp 1] + δ

k=p 1 n k(B),

where we used in the last step, that ξ0 + k δ

2 = ξk. For all terms in the sum except the terms corresponding to the 3 highest and lowest k we get

n k(B) = [0, 1] k {p + 2, . . . , q 2}. (8)

Published as a conference paper at ICLR 2020

Indeed, from Equation (6) we know that there is x B such that f(x) ξp+1 = ξp+2 δ

2, thus by Lemma 4.4 n k([x, x]) = [0, 0] for all p + 2 k q 2. Similarly, from Equation (7) we know, that there is x B such that f(x) ξq = ξq 1 + δ

2, thus by Lemma 4.4 n k([x , x ]) = [1, 1] for all p + 2 k q 2. So n k(B) is at least [0, 1], and by Lemma 4.4 also at most [0, 1]. This leads to

n (B) = [ξp 1, ξp 1] + δ

k=p 1 n k(B) + δ

2((q 2) (p + 2) + 1)[0, 1] + δ

k=q 1 n k(B)

= [ξp 1, ξp 1] + δ

k=p 1 n k(B) + [0, ξq 1 ξp+2] + δ

k=q 1 n k(B).

We know further, that if p + 3 q, than there is an x B such that f(x) ξp+3 = ξp+2 + δ

2, hence similar as before n p+1([x, x]) = [1, 1] and similarly n p([x, x]) = [1, 1] and n ([x, x]) = [1, 1]. So we know, that δ

2 Pp+1 k=p 1 n k(B) includes at least [3 δ

2] and at the most [0, 3 δ

2]. Similarly, there exists an x B such that n q 1([x , x ]) = [0, 0], n q([x , x ]) = [0, 0] and n q+1([x , x ]) = [0, 0]. This leaves us with

k=p 1 n k(B) [0, 3 δ

k=q 1 n k(B) [0, 3 δ

If p + 3 > q the lower bound we want to prove becomes vacuous and only the upper one needs to be proven. Thus we have

[l + δ, u δ] [ξp+2, ξp 1] n (B) [ξp 1,ξq+2] [l δ, u + δ],

where l := min f(B) and u := max f(B).

Theorem 4.6 (Universal Interval-Provable Approximation). Let Γ Rm be compact and f : Γ Rd be continuous. For all δ Rm 0 exists a Re LU network n such that for all B B(Γ)

[l + δ, u δ] n (B) [l δ, u + δ],

where l, u Rm such that lk := min f(B)k and uk := max f(B)k for all k.

Proof. This is a direct consequence of using Theorem 4.5 and the Tietze extension theorem to produce a neural network for each dimension d of the codomain of f.

Note that Theorem 1.1 is a special case of Theorem 4.6 with d = 1 to simplify presentation.

5 CONCLUSION

We proved that for all real valued continuous functions f on compact sets, there exists a Re LU network n approximating f arbitrarily well with the interval abstraction. This means that for arbitrary input sets, analysis using the interval relaxation yields an over-approximation arbitrarily close to the smallest interval containing all possible outputs. Our theorem afﬁrmatively answers the open question, whether the Universal Approximation Theorem generalizes to Interval analysis.

Our results address the question of whether the interval abstraction is expressive enough to analyse networks approximating interesting functions f. This is of practical importance because interval analysis is the most scalable non-trivial analysis.

Published as a conference paper at ICLR 2020

Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. ar Xiv preprint ar Xiv:1801.00553, 2018.

Cem Anil, James Lucas, and Roger B. Grosse. Sorting out lipschitz function approximation. In International Conference on Machine Learning, (ICML), 2019.

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectiﬁed linear units. In International Conference on Learning Representations, (ICLR), 2018.

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In International Conference on Machine Learning, (ICML), 2018.

Jeremy E. J. Cohen, Todd Huster, and Ra Cohen. Universal lipschitz approximation in bounded depth neural networks. ar Xiv preprint ar Xiv:1904.04861, 2019a.

Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. In International Conference on Machine Learning, (ICML), 2019b.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems (MCSS), 1989.

Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Relja Arandjelovic, Brendan O Donoghue, Jonathan Uesato, and Pushmeet Kohli. Training veriﬁed learners with learned veriﬁers. ar Xiv preprint ar Xiv:1805.10265, 2018a.

Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy A. Mann, and Pushmeet Kohli. A dual approach to scalable veriﬁcation of deep networks. In Uncertainty in Artiﬁcial Intelligence, (UAI), 2018b.

Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classiﬁcation. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2018.

Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin T. Vechev. AI2: safety and robustness certiﬁcation of neural networks with abstract interpretation. In IEEE Symposium on Security and Privacy, (SP), 2018.

Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy A. Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training veriﬁably robust models. ar Xiv preprint ar Xiv:1810.12715, 2018.

Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. In International Conference on Learning Representations, (ICLR), Workshop, 2015.

Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. Re LU Deep Neural Networks and Linear Finite Elements. ar Xiv preprint ar Xiv:1807.03973, 2018.

Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 1989.

Guy Katz, Clark W. Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efﬁcient SMT solver for verifying deep neural networks. In Computer Aided Veriﬁcation (CAV), 2017.

Marc Khoury and Dylan Hadﬁeld-Menell. On the geometry of adversarial examples. ar Xiv preprint ar Xiv:1811.00525, 2018.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 1998.

Published as a conference paper at ICLR 2020

Mathias Lécuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certiﬁed robustness to adversarial examples with differential privacy. In IEEE Symposium on Security and Privacy, (SP), 2019.

Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certiﬁed adversarial robustness with additive noise. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

Matthew Mirman, Timon Gehr, and Martin T. Vechev. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, (ICML), 2018.

Matthew Mirman, Gagandeep Singh, and Martin T. Vechev. A provable defense for deep residual networks. ar Xiv preprint ar Xiv:1903.12519, 2019.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2017.

Nicolas Papernot, Patrick D. Mc Daniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, Euro S&P, 2016.

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Semideﬁnite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems (Neur IPS), 2018a.

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed defenses against adversarial examples. In International Conference on Learning Representations, (ICLR), 2018b.

Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relaxation barrier to tight robustness veriﬁcation of neural networks. ar Xiv preprint ar Xiv:1902.08722, 2019.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin T. Vechev. Fast and effective robustness certiﬁcation. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin T. Vechev. An abstract domain for certifying neural networks. PACMPL, (POPL), 2019.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, (ICLR), 2014.

J. M. Tarela and M. V. Martínez. Region conﬁgurations for realizability of lattice piecewise-linear models. Mathematical and Computer Modelling, 1999.

Vincent Tjeng, Kai Y. Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, (ICLR), 2019.

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations, (ICLR), 2019.

Shiqi Wang, Yizheng Chen, Ahmed Abdou, and Suman Jana. Mixtrain: Scalable training of formally robust neural networks. ar Xiv preprint ar Xiv:1811.02625, 2018a.

Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Efﬁcient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2018b.

Published as a conference paper at ICLR 2020

Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane S. Boning, and Inderjit S. Dhillon. Towards fast computation of certiﬁed robustness for relu networks. In International Conference on Machine Learning, (ICML), 2018.

Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, (ICML), 2018.

Eric Wong, Frank R. Schmidt, Jan Hendrik Metzen, and J. Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. In International Joint Conference on Artiﬁcial Intelligence, (IJCAI), 2018.

Kai Y. Xiao, Vincent Tjeng, Nur Muhammad (Mahi) Shaﬁullah, and Aleksander Madry. Training for faster adversarial robustness veriﬁcation via inducing relu stability. In International Conference on Learning Representations, (ICLR), 2019.

Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efﬁcient neural network robustness certiﬁcation with general activation functions. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Stephan Zheng, Yang Song, Thomas Leung, and Ian J. Goodfellow. Improving the robustness of deep neural networks via stability training. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2016.

Published as a conference paper at ICLR 2020

A PROOFS FOR THE UNIVERSAL INTERVAL-CERTIFIED APPROXIMATION

Lemma A.1 (Monotonicity). The operations +, are monotone, that is for all [a1, b1], [a2, b2], [c1, d1], [c2, d2] B(R) such that [a1, b1] [a2, b2] and [c1, d2] [c2, d2] holds

[a1, b1] + [c1, d1] [a2, d2] + [c2, d2]

[a1, b1] [c1, d1] [a2, d2] [c2, d2]

[a1, b1] [c1, d1] [a2, d2] [c2, d2].

Further the operation and R are monotone, that is for all [a, b], [c, d] B(R) and for all λ R 0 such that [a, b] [c, d] holds

λ [a, b] λ [c, d]

R ([a, b]) R ([c, d]).

[a1, b1] + [c1, d1] = [a1 + c1, b1 + d1] [a2 + c2, b2 + d2] = [a2, d2] + [c2, d2]

[a1, b1] [c1, d1] = [a1 d1, b1 c1] [a2 d2, b2 c2] = [a2, d2] [c2, d2]

λ [a, b] = [λa, λb] [λc, λd] = [λc, λd]

R ([a, b]) = [R(a), R(b)] [R(c), R(d)] = R ([c, d]).

Deﬁnition A.2 (N-slicing). Let Γ Rm be a compact m-dimensional box and let f : Γ R be continuous. The N-slicing of f is a set of functions {fk}0 k N 1 deﬁned by

fk : Γ R, x 7

0 if f(x) ξk, f(x) ξk if ξk < f(x) < ξk+1, ξk+1 ξk otherwise, k {0, . . . , N 1},

where ξk := k

N (ξmax ξmin), k {0, . . . , N}, ξmin := min f(Γ) and ξmax := max f(Γ).

Lemma A.3 (N-slicing). Let {fk}0 k N 1 be the N-slicing of f. Then for all x Γ we have f(x) := ξ0 + PN 1 k=0 fk(x).

Proof. Pick x Γ and let l {0, . . . , N 1} such that ξl f(x) ξl+1. Then

k=0 fk(x) = ξ0 +

k=0 fk(x) + fl(x) +

k=l+1 fk(x) = ξ0 +

k=0 (ξk+1 ξk) + fl(x)

= ξl + fl(x) = f(x).

Deﬁnition A.4 (clipping). Let a, b R, a < b. We deﬁne the clipping function R[ ,b] : R R by

R[ ,b](x) := b R(b x).

Lemma A.5 (clipping). The function R[ ,b] sends all x b to x, and all x > b to b. Further, R [ ,b]([a , b ]) = [R[ ,b](a ), R[ ,b](b )].

Published as a conference paper at ICLR 2020

Proof. We show the proof for R[a,b], the proof for R[ ,b] is similar.

x < b R[ ,b](x) = b R(b x) = b b + x = x

x b R[ ,b](x) = b R(b x) = b 0 = b

R [ ,b]([a , b ]) = b R (b [a , b ])

= b R (b + [ b , a ])

= b R ([b b , b a ])

= b [R(b b ), R(b a )]

= b + [ R(b a ), R(b b )]

= [b R(b a ), b R(b b )]

= [R[ ,b](a ), R[ ,b](b )].

Deﬁnition A.6 (nmin). We deﬁne the Re LU network nmin: R2 R by

nmin(x, y) := 1

2 (1 1 1 1) R

1 1 1 1 1 1 1 1

Lemma A.7 (nmin). Let x, y R, then nmin(x, y) = min(x, y).

Proof. Because nmin is symmetric in its arguments, we assume w.o.l.g. x y.

nmin(x, y) = 1

2 (1 1 1 1) R

1 1 1 1 1 1 1 1

2 (1 1 1 1) R

x + y x y x y x + y

If x + y 0, then

nmin(x, y) = 1

2(x + y x + y) = y.

If x + y < 0, then

nmin(x, y) = 1

2(x + y x + y) = y.

Deﬁnition A.8 (nmin N). For all N N 1, we deﬁne a Re LU network nmin N deﬁned by

nmin1(x) := x nmin N(x1, . . . , x N) := nmin(nmin N/2 (x1, . . . , x N/2 ), nmin N/2 +1(x N/2 +1, . . . , x N)).

Lemma A.9. Let [a, b], [c, d] B(R). Then nmin ([a, b], [c, d]) = nmin ([c, d], [a, b]) and

nmin ([a, b], [c, d]) =

2 , d + b a

2 ] if d a [a + c d

2 , b + d c

2 ] if a d and b < c [a + c b+d

2 ] if a d and b c

Published as a conference paper at ICLR 2020

Proof. The symmetry on abstract elements is immediate. In the following, we omit some of to improve readability.

nmin ([a, b], [c, d]) = 1

2 (1 1 1 1) R

1 1 1 1 1 1 1 1

[a, b] [c, d]

2 (1 1 1 1) R

[a, b] + [c, d] [a, b] [c, d] [a, b] [c, d] [a, b] + [c, d]

2 (1 1 1 1) R

[a + c, b + d] [ b d, a c] [a d, b c] [c b, d a]

2 (1 1 1 1)

[R(a + c), R(b + d)] [R( b d), R( a c)] [R(a d), R(b c)] [R(c b), R(d a)]

2([R(a + c), R(b + d)] [R( b d), R( a c)]

[R(a d), R(b c)] [R(c b), R(d a)])

2([R(a + c), R(b + d)] + [ R( a c), R( b d)]

+ [ R(b c), R(a d)] + [ R(d a), R(c b)])

2([R(a + c) R( a c), R(b + d) R( b d)]

+ [ R(b c) R(d a), R(a d) R(c b)])

Claim: R(a+c) R( a c) = a+c. If a+c > 0 then a c < 0 thus the claim in this case. Indeed: If a + c 0 then a c 0 thus R(a + c) R( a c) = R( a c) = ( a c) = a + c. Similarly R(b + d) R( b d) = b + d.

So the expression simpliﬁes to

nmin ([a, b], [c, d]) = 1

2([a + c, b + d] + [ R(b c) R(d a), R(a d) R(c b)])

We proceed by case distinction:

Case 1: b c 0: Then a b c d:

nmin ([a, b], [c, d]) = 1

2([a + c, b + d] + [a d, b c])

2([a + c + a d, b + d + b c])

2 , b + d c

Case 2: a d 0: Then c d a b. By symmetry of nmin equivalent to Case 1. Hence

nmin ([a, b], [c, d]) = [c + a b

2 , d + b a

Case 3: a d < 0 and b c > 0:

nmin ([a, b], [c, d]) = 1

2([a + c, b + d] + [c b d + a, 0])

2([a + c + c b d + a, b + d])

= [a + c b+d

Published as a conference paper at ICLR 2020

Thus we have

nmin ([a, b], [c, d]) =

2 , b + d c

2 ] if b c [c + a b

2 , d + b a

2 ] if d a [a + c b+d

2 ] if a < d and b > c

Deﬁnition A.10 (neighboring grid points). Let G be as above. We deﬁne the set of neighboring grid points of x Γ by N(x) := {g G | g ||x g|| 1 M } \ {x}.

For U Rm, we deﬁne N(U) := {N(x) | x U} \ U.

Deﬁnition A.11 (local bump). Let M N, G := ( Z

M )m, ℓ= 2 log2 2m +1 and let c = { il 1 M , iu 1 M }

{ il m M , iu m M } G. We deﬁne a Re LU neural network φc : Rm [0, 1] w.r.t. the grid G by

n R[ ,1](Mℓ(xk il k M ) + 1), R[ ,1](Mℓ( iu k M xk) + 1) o

Lemma A.12. It holds:

0 if x / conv(N(c)) 1 if x conv(c)

min 0, Sm k=1{Mℓ(xk il k M ) + 1} {Mℓ( iu k M xk) + 1} otherwise.

Proof. By case distinction:

Case x / N(c). Then there exists k, such that either xk < il k 1

M or xk > iu k+1

Mℓ(xk il k M ) + 1 or Mℓ( iu k M xk) + 1 is less or equal to 0. Hence

Case x conv(c). Then for all k holds il k M xk iu k M . Thus Mℓ(xk il k M ) + 1 1 and Mℓ( iu k M xk) + 1 1 for all k Hence

Case otherwise: For all x exists a k such that Mℓ(xk il k M ) + 1 or Mℓ( iu k M xk) + 1 is smaller or equal to all other arguments of the function min and smaller or equal to 1. If the smallest element is smaller than 0, then φc(x) will evaluate to 0, otherwise it will evaluate to Mℓ(xk il k M ) + 1 or Mℓ( iu k M xk) + 1. Thus we can just drop R and R[ ,1] from the equations and take the minimum also over 0:

n R[ ,1](Mℓ(xk il k M ) + 1), R[ ,1](Mℓ( iu k M xk) + 1) o!

k=1 {(Mℓ(xk il k M ) + 1)} {(Mℓ( iu k M xk) + 1)}

k=0 {Mℓ(xk il k M ) + 1} {Mℓ( iu k M xk) + 1}

Published as a conference paper at ICLR 2020

Lemma A.13. Let [u1, 1], . . . , [u N, 1] be abstract elements of the Interval Domain B. Then

nmin N([u1, 1], . . . , [u N, 1]) = [u1 + u N + 1 N, 1].

Proof. By induction. Base case: Let N = 1. Then nmin 1([u1, 1]) = [u1, 1]. Let N = 2. Then nmin 2([u1, 1], [u2, 1]) = [u1 + u2 1, 1].

Induction hypothesis: The property holds for N s.t. 0 < N N 1.

Induction step: Then it also holds for N:

nmin N([u1, 1], . . . , [u N, 1]) = nmin (nmin N/2 ([u1, 1], . . . , [u N/2 , 1]),

nmin N N/2 ([u N/2 +1, 1], . . . , [u N, 1]))

= nmin ([u1 + + u N/2 + 1 N/2 , 1],

[u N/2 +1 + u N + 1 N + N/2 , 1])

Lemma A.9 = [u1 + + u N + 2 N/2 N + N/2 1, 1] = [u1 + + u N + 1 N, 1]

Lemma A.14. Let [a, b], [u, 1] B(R 1). Then

nmin ([a, b], [u, 1]) [a + u 1

nmin ([a, b], [u, 1]) = [a + u 1

2 , b + 1 u

2 ] if b u [a + u b+1

If b u then b + 1 u

2 . If u b then a + u b+1

2 a + u u+1

2 = a + u 1

nmin ([a, b], [u, 1]) [a + u 1

Lemma A.15. Let N N 2, let [u1, 1], . . . , [u N 1, 1], [u N, d] B(R) s.t. b 1 be abstract elements of the Interval Domain B. Furthermore, let H(x) := 1+x

2 . Then there exists a u R s.t.

nmin N([u1, 1], . . . , [u N 1, 1], [u N, d]) [u, H log2 N +1(d)]

Proof. By induction: Let N = 2:

nmin 2([u1, 1], [u2, d]) Lemma A.14 = [a + u1 1

nmin 3([u1, 1], [u2, 1], [u3, d]) = nmin (nmin ([u1, 1], [u2, 1]), [u3, d])

= nmin ([u1 + u2 1, 1], [u3, d])

[u3 + u1+u2 2

nmin 3([u1, 1], [a, b], [u2, 1]) = nmin 3([u3, d], [u1, 1], [u2, 1])

= nmin (nmin ([u3, d], [u1, 1]), [u2, 1])

= nmin ([u3 + u1 1

2 , H(d)], [u2, 1])

[u3 + u1+u2 2

So nmin 3([u3, d], [u1, 1], [u2, 1]) is always included in [u3 + u1+u2 2

2 , H2(d)].

Induction hypothesis: The statement holds for all 2 N N 1.

Published as a conference paper at ICLR 2020

Induction step: Then the property holds also for N:

nmin N([u N, d], [u1, 1], . . . , [u N 1, 1]) = nmin (nmin N/2 ([u N, d], [u1, 1], . . . , [u N/2 1, 1]),

nmin N N/2 ([u N/2 , 1], . . . , [u N 1, 1]))

= nmin ([u , H log2 N/2 +1(d)], [u , 1])

nmin ([u , H log2 N/2 +1(d)], [u , 1])

= nmin ([u , H log2 N log2(2) +1(d)], [u , 1])

= nmin ([u , H log2 N 1 +1(d)], [u , 1])

= nmin ([u , H log2 N (d)], [u , 1])

= [u , H log2 N +1(d)]

and similarly for other orderings of the arguments.

Lemma A.16. Let H(x) := 1+x

2 . For all N N>0, we have that d 1 2N implies HN(d) 0.

Proof. By induction. N = 1: Then H(1 2) = 1+1 2

Induction hypothesis. The statement holds for all N such that 0 < N N.

Induction step: N + 1: d 1 2N:

HN+1(d) HN+1(1 2N+1) = HN(H(1 2N+1)) = HN( 1+1 2N+1

2 ) = HN(1 2N) 0

Lemma A.17. For all boxes B B(Rm), we have

φ c(B) = [1, 1] if B conv(c) [0, 0] if B Γ \ conv(N(c))

Furthermore, φ c(B) [0, 1].

Proof. Let φc be a local bump and let B = [a, b] B(Rm). Let [r1 k, s1 k], [r2 k, s2 k] B(R) such that

Mℓ([ak, bk] il k M ) + 1 = [r1 k, s1 k] and Mℓ( iu k M [ak, bk]) + 1 = [r2 k, s2 k].

If [a, b] conv(c): Then 1 r1 k and 1 r2 k for all k {1, . . . , m}. Thus

φ c([a, b]) = R (nmin 2m{R [ ,1]([rp k, sp k])}(p,k) {1,2} {1,...,m})

= R (nmin 2m{[1, 1]}(p,k) {1,2} {1,...,m})

If [a, b] Γ \ conv(N(c)): Then there exists a (p , k ) {1, 2} {1, . . . , m} such that sp

k 1 2 log2 N +1. Using Lemma A.16 and Lemma A.15, we now that there exists a u R s.t.

φ c([a, b]) = R (nmin 2m{R [ ,1]([rp k, sp k])}(p,k) {1,2} {1,...,m})

= R (nmin 2m{[R[ ,1](rp k), R[ ,1](sp k)]}(p,k) {1,2} {1,...,m})

R (nmin 2m{[R[ ,1](rp k), 1]}(p,k) =(p ,k ) {[rp

R ([u, 0]) = [0, 0]

For any [a, b] B(Γ) we have φ c([a, b]) [0, 1] by construction.

Published as a conference paper at ICLR 2020

Lemma A.18. Let Γ Rm be a closed box and let f : Γ R be continuous. For all δ > 0 exists a set of Re LU networks {nk}0 k N 1 of size N N approximating the N-slicing of f, {fk}0 k N 1 (ξk as in Deﬁnition A.2) such that for all boxes B B(Γ)

n k(B) = [0, 0] if f(B) ξk δ

2 [1, 1] if f(B) ξk+1 + δ

and n k(B) [0, 1].

Proof. Let N N such that N 2 ξmax ξmin

δ where ξmin := min f(Γ) and ξmax := max f(Γ). For simplicity we assume Γ = [0, 1]m. Using the Heine-Cantor theorem, we get that f is uniformly continuous, thus there exists a δ > 0 such that x, y Γ.||y x|| < δ ||f(y) f(x)|| < δ

2. Further, let M N such that M 1

δ and let G be the grid deﬁned by G := ( Z

Let C(B) be the set of corner points of the closest hyperrectangle in G conﬁning B B(Γ). We construct the set k := {C(B) | B B(Γ) : f(B) ξk+1 + δ

We claim that {nk}0 k N 1 deﬁned by

nk(x) := R[ ,1]

satisﬁes the condition.

Case 1: Let B B(Γ) such that f(B) ξk+1 + δ

2. Then for all g N(B) holds fk(g) = δ2. By construction exists a c k such that B conv(c ). Using Lemma 4.3 we get

n k(B) = R [ ,1]

φ c (B) + X

c k\c φ c(B)

= R [ ,1] ([1, 1] + [p1, p2]) = [1, 1],

where [p1, p2] B(R 0). Indeed, by case distinction:

Case 2: Let B B(Γ) such that f(B) ξk δ

2. Then for all g N(B) holds fk(g) = 0. Further, B conv(N(c)) = for all c k because G is ﬁne enough. Using Lemma 4.3 we obtain

n k(B) = R [ ,1]

= R [ ,1]([0, 0]) = [0, 0].

By construction we have n k(B) [0, 1].