# expressivity_of_relunetworks_under_convex_relaxations__e64172d9.pdf Published as a conference paper at ICLR 2024 EXPRESSIVITY OF RELU-NETWORKS UNDER CONVEX RELAXATIONS Maximilian Baader , Mark Niklas Müller , Yuhao Mao & Martin Vechev Department of Computer Science ETH Zurich, Switzerland {mbaader,mark.mueller,yuhao.mao,martin.vechev}@inf.ethz.ch Convex relaxations are a key component of training and certifying provably safe neural networks. However, despite substantial progress, a wide and poorly understood accuracy gap to standard networks remains, raising the question of whether this is due to fundamental limitations of convex relaxations. Initial work focused on the simple and widely used IBP relaxation. It revealed that some univariate, convex, continuous piecewise linear (CPWL) functions cannot be encoded by any Re LU network such that its IBP-analysis is precise. To explore whether this limitation is shared by more advanced convex relaxations, we conduct the first indepth study on the expressive power of Re LU networks across all commonly used convex relaxations. We show that: (i) more advanced relaxations allow a larger class of univariate functions to be expressed as precisely analyzable Re LU networks, (ii) more precise relaxations can allow exponentially larger solution spaces of Re LU networks encoding the same functions, and (iii) even using the most precise single-neuron relaxations, it is impossible to construct precisely analyzable Re LU networks that express multivariate, convex, monotone CPWL functions. 1 INTRODUCTION Table 1: Expressivity of different relaxations. Novel results are red and green . Previous results are in black ( , ?). M: monotone, C: convex, MC: monotone and convex. X Relaxation CPWL M-CPWL C-CPWL MC-CPWL IBP DEEPPOLY-0 ? DEEPPOLY-1 ? ? Multi-Neuron With the increased deployment of neural networks in mission-critical applications, formal robustness guarantees against adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) have become an important and active field of research. Many popular certification methods (Zhang et al., 2018; Singh et al., 2018; 2019a;b) provide such safety guarantees by using convex relaxations to compute over-approximations of a network s reachable set w.r.t. an adversary specification. However, despite significant progress, a wide and poorly understood accuracy gap between robust and conventional networks remains. This raises the fundamental question: Is the expressivity of Re LU-networks under convex relaxations fundamentally limited? Investigating this question, Mirman et al. (2022) prove that, the class of convex, continuouspiecewise-linear (CPWL) functions cannot be encoded as Re LU-networks such that their analysis with the simple IBP-relaxation (Gehr et al., 2018; Gowal et al., 2018), is precise. This Work: Expressivity of Common Relaxations To investigate whether this limitation of IBP is fundamental to all single-neuron convex relaxations, we conduct the first in-depth study on the expressive power of Re LU networks under all commonly used relaxations. To this end, we consider CPWL functions, naturally represented by Re LU networks, and two common restrictions, convexity and monotonicity. We illustrate key findings in Table 1, showing novel results as red and green . Equal contribution. Published as a conference paper at ICLR 2024 Key Results on Univariate Functions In this work, we prove the following key results: The most precise single-neuron relaxation, (Wong & Kolter (2018)), and the popular DEEPPOLY-relaxation (Singh et al., 2019b; Zhang et al., 2018) do not share IBP s limitation and can express univariate, convex, CPWL functions precisely. All considered relaxations, including IBP, can express univariate, monotone, CPWL functions precisely. The -relaxation permits an exponentially larger network solution space for convex CPWL functions compared to the less precise DEEPPOLY-relaxation. Multi-neuron relaxations (Singh et al., 2019a; Müller et al., 2022) can express all univariate, CPWL functions precisely using a single layer. Having thus shown that, for univariate functions, the expressivity of Re LU networks under convex relaxations is not fundamentally limited, we turn our analysis to multivariate functions. Key Results on Multivariate Functions In this setting, we prove the following result: No single-neuron convex relaxation can precisely express even the heavily restricted class of multivariate, convex, monotone, CPWL functions. Interestingly, the exact analysis of such monotone functions on box input regions is trivial, making the failure of convex relaxations even more surprising. In fact, CPWL functions as simple as f(x, y) = max(x, y) = y + Re LU(x y) cannot be encoded by any finite Re LU network such that its -analysis is precise. We thus conclude that, for multivariate functions, the expressivity of Re LU networks under single-neuron convex relaxations is fundamentally limited. Implications of our Results for Certified Training While we believe our results to be of general interest, they have particularly interesting implications for certified training. In this area, all stateof-the-art methods (Müller et al., 2023; Mao et al., 2023; Palma et al., 2023) are based on the simple IBP-relaxation even though it induces strong regularisation which severely reduces accuracy. While Jovanovi c et al. (2022) show that more precise relaxations induce significantly harder optimization problems, it remains an open question whether solving these would actually yield networks with better performance. Our results represent a major step towards answering this question. Specifically in the univariate setting, we show that more precise relaxations increase expressivity (see Table 1) and lead to larger network solution spaces (compare Theorems 11 and 15). Thus, we hypothesize that using them during training yields a larger effective hypothesis space for the same network architecture. Importantly, this implies that networks with higher performance could indeed be obtained if we can overcome the optimization issues described by Jovanovi c et al. (2022). However, in the multivariate setting, perhaps surprisingly, we show that even the most precise singleneuron relaxations severely limit expressivity (see Corollary 21). This highlights the need for further study of more precise analysis methods such as multi-neuron or non-convex relaxations. 2 BACKGROUND ON CONVEX RELAXATIONS Notation We denote vectors with bold lower-case letters a Rn, matrices with bold upper-case letters A Rn d, and sets with upper-case calligraphic letters A R. Inequalities a b between vectors are elementwise. We refer to a hyperrectangle B Rn as a box. Further, we consider general (finite) Re LU networks h with arbitrary skip connections, including CNNs and Res Nets. 2.1 CONVEX RELAXATIONS IN NEURAL NETWORK CERTIFICATION We call a classifier H : X Y locally robust around an input x X if it predicts the same, correct class y Y for all similar inputs Bϵ p(x) := {x X | x x p ϵ}. To prove that a classifier is locally robust, we thus have to show that H(x ) = H(x) = y, x B. For a neural network predicting H(x) := arg maxi h(x)i, this is equivalent to showing that the logit of the target class is always greater than that of all other classes, i.e., 0 < minx B,i =y h(x )y h(x )i. As solving this non-convex optimization problem exactly is generally NP-complete (Katz et al., 2017), state-of-theart neural network verifiers (Brix et al., 2023) relax it to an efficiently solvable convex optimization Published as a conference paper at ICLR 2024 problem. To this end, we replace the non-linear activation functions with convex relaxations in their input-output space, allowing us to compute linear bounds on the output h(x): {Alix + bli}i L h(x) {Aujx + buj}j U, for some input region Bϵ p(x), with index sets L and U. These bounds can in-turn be bounded by ly = minx B maxi L(Alix + bli) R and uy analogously. Hence, we have ly h(x) uy. Figure 1: IBP-relaxation of a Re LU with bounded inputs v [l, u]. IBP Interval bound propagation (Mirman et al., 2018; Gehr et al., 2018; Gowal et al., 2018) only considers elementwise, constant bounds of the form l v u. Affine layers y = W v + b are thus also relaxed as W (l+u) |W |(u l) 2 + b W v + b W (l+u)+|W |(u l) 2 + b, where | | the elementwise absolute value. Re LU functions are relaxed by their concrete lower and upper bounds Re LU(l) Re LU(v) Re LU(u), illustrated in Fig. 1. y u u l(v l) Figure 2: DEEPPOLY-1 (blue) and DEEPPOLY-0 (green) Re LU abstraction with bounded inputs v [l, u]. Deep Poly (DP) Deep Poly, introduced by Singh et al. (2019b), is mathematically identical to CROWN (Zhang et al., 2018) and based on recursively deriving linear bounds of the form Alx + al v Aux + au on the outputs of every layer. While this allows affine layers to be handled exactly, Re LU layers y = Re LU(v) are relaxed neuron-wise, using one of the two relaxations illustrated in Fig. 2 λv Re LU(v) (v l) u u l, where product and division are elementwise. Typically, the lower-bound slope λ {0, 1} is chosen depending on the input bounds l and u. In this work, however, we analyze the relaxations obtained by always choosing the same lower-bound, which we denote with DEEPPOLY-0 (DP-0, green in Fig. 2) and DEEPPOLY-1 (DP-1, blue). y u u l(v l) Figure 3: -relaxation of a Re LU with bounded inputs v [l, u]. Triangle-Relaxation ( ) In contrast to the above convex relaxations, the -relaxation (Wong & Kolter, 2018; Dvijotham et al., 2018; Salman et al., 2019; Qin et al., 2019) maintains multiple linear upperand lower-bounds on every network activation v. We write Al1x + al1, ... Alnx + aln, Au1x + au1, ... Aunx + aun. Unstable Re LU activation y = Re LU(v) with l < 0 < u are relaxed with their convex hull as illustrated in Fig. 3 0 v Re LU(v) (v l) u u l, where again, product and division are elementwise. Note that this can lead to an exponential growth (with the depth of the network) in the number of constraints for any given activation. 2-neuron single-neuron Figure 4: Comparison of a 2-neuron (green) and single-neuron (blue) relaxation projected into y2-v1-v2space for Re LU activations yi = Re LU(vi). Multi-Neuron Relaxation (MN) All methods introduced so far relax activation functions neuron-wise and are thus limited in precision by the (single neuron) convex relaxation barrier (Salman et al., 2019), i.e., the activation function s convex hull in their input-output space. Multi-neuron relaxations, in contrast, compute the convex hull in the joint input-output space of multiple neurons in the same layer (Singh et al., 2019a; Müller et al., 2022), or consider multiple inputs jointly (Tjandraatmadja et al., 2020). We illustrate the increase in tightness in Fig. 4 for a group of just k = 2 neurons. Published as a conference paper at ICLR 2024 2.2 DEFINITIONS We now define the most important concepts for this work. Definition 1 (CPWL). We denote the set of continuous piecewise linear functions f : X Y by CPWL(X, Y). Further, if X is some interval I R, then we enumerate the points where f changes slope and call them xi, where 0 i n, i < j implies xi < xj, and X = [x0, xn]. All CPWL functions f : I R satisfy f(x) = f(xi) + (x xi) f(xi+1) f(xi) xi+1 xi for x [xi, xi+1]. We denote by M-CPWL, C-CPWL, and MC-CPWL the class of monotone (M), convex (C), and monotone & convex (MC) CPWL functions, respectively. We say a network h encodes a function f if and only if they are equal on all inputs x X: Definition 2 (Encoding). A neural network h: X Y encodes a function f : X Y if and only if for all x X we have h(x) = f(x). In the following, D denotes a convex relaxation and can be IBP, DEEPPOLY-0 (DP-0), DEEPPOLY1 (DP-1), , or Multi-Neuron (MN). We now call the over-approximation of a network s graph (set of input-output tuples) obtained with domain D its D-analysis: Definition 3 (Analysis). Let h: X Y be a network, D a convex relaxation, and B X an input box. We denote by h D(B) the polytope in h s input-output space containing the graph {(x, h(x)) | x B} h D(B) X Y of h on B, as obtained with D and refer to it as the D-analysis of h on B. For Y R, we denote the interval bounds of f on B by [f(B), f(B)] := [minx B f(x), maxx B f(x)] and, similarly, the interval bounds implied by h D(B) as [h D(B), h D(B)] := [min(x,y) h D(B) y, max(x,y) h D(B) y]. As any D-analysis of h captures the set of all possible outputs h(x), x B, it is of key interest to us to investigate when the analysis does not lose precision. Specifically, whether the linear output bounds h D(B) do not exceed the interval bounds of f on B anywhere on B: Definition 4 (Precise). Let h be a network encoding f : X Y and D a convex relaxation. We say that the D-analysis is precise for h if it yields precise lower and upper bounds, that is for all boxes B X we have that [h D(B), h D(B)] = [f(B), f(B)]. In this work, we investigate the expressivity of Re LU-networks, that is, which function class they can encode such that their D-analysis is precise. Specifically: Definition 5 (Expressivity). Let D be a convex relaxation, F a set of functions, and N a set of networks. We say that N can D-express F precisely, if and only if, for all f F, there exists a network h N, such that h encodes f and its D-analysis is precise. We can often replace (sub-)networks to encode the same function but yield a (strictly) more precise analysis in terms of the obtained input-output polytope: Definition 6 (Replacement). Let h and h be Re LU networks, B a box, and D some convex relaxation. We say h can replace h with respect to D, if h D(B) h D(B) for all B and write h D h . 3 RELATED WORK Below, we give a brief overview of the most relevant related work. Expressing CPWL Functions He et al. (2020) show that Re LU networks require at least 2 layers to encode CPWL functions in Rd (for d 2) with log2(d + 1) layers always being sufficient. Expressivity with IBP Baader et al. (2020) show that for any continuous function f : Γ Rn R over a compact domain Γ and ϵ > 0, there exists a finite Re LU network h, such that its IBPanalysis for any input box B Γ, denoted by h IBP(B), is precise up to an ϵ-error: [f(B) + ϵ, f(B) ϵ] h IBP(B) [f(B) ϵ, f(B) + ϵ]. Published as a conference paper at ICLR 2024 An equivalent result immediately follows for all strictly more precise domains such as DP-0, , and MN. Wang et al. (2022) propose a more efficient construction, generalize this result to squashable activation functions, and provide first results on the hardness of constructing such networks. However, as these results require network widths going to for approximation errors ϵ 0 and IBP-based methods fail empirically for realistic networks, the study of exact encodings is crucial. Investigating what class of functions allows for an exact IBP-analysis, Mirman et al. (2022) show that for any function with points of non-invertibility, i.e., x = 0 for f(x) = |x|, there does not exist a Re LU network IBP-expressing this function. Certified Training Certified training methods typically optimize an upper bound on the worstcase loss over some adversary specification computed via convex relaxations. Surprisingly, using the imprecise IBP-relaxation (Mirman et al., 2018; Gowal et al., 2018) consistently yields better performance than tighter relaxations (Wong et al., 2018; Zhang et al., 2020; Balunovi c & Vechev, 2020). Jovanovi c et al. (2022) investigate this paradox and identify two key properties of the worstcase loss approximation, continuity and sensitivity, required for effective optimization, with only IBP possessing both. However, the heavy regularization that makes IBP trained networks amenable to certification also severely reduces their standard accuracy (Mao et al., 2024). Neural Network Certification We distinguish complete certification methods, which, given sufficient time, can decide any property, i.e., always compute precise bounds, and incomplete methods, which sacrifice precision for speed. Salman et al. (2019) unify a range of incomplete certification methods including IBP, DEEPPOLY, and , and show that their precision is limited by that of the -relaxation. They observe that for a wide range of networks and even when using the -relaxation, a substantial certification gap between the upperand lower-bounds on robust accuracy remains. Semidefinite programming based methods (Dathathri et al., 2020; Raghunathan et al., 2018) increase tightness at the cost of computational efficiency. Early, complete certification methods directly leveraged off-the-shelf SMT (Katz et al., 2017; Ehlers, 2017) or MILP solvers (Dutta et al., 2018; Tjeng et al., 2019), limiting their applicability to small networks. To improve scalability, Bunel et al. (2020) formulate a branch-and-bound (Ba B) framework, that recursively splits the certification problem into easier subproblems until they can be decided by cheap incomplete methods. This concept has been widely adopted and improved using more efficient solvers (Xu et al., 2021; Wang et al., 2021) and tighter constraints (Palma et al., 2021; Ferrari et al., 2022; Zhang et al., 2022). 4 CONVEX RELAXATIONS FOR UNIVARIATE FUNCTIONS In this section, we differentiate all convex relaxations that are commonly used for neural network certification (IBP, DP-0, DP-1, , and MN) in terms of their expressivity, i.e., with respect to the function classes they can analyze precisely when encoded by a Re LU network. We first show that finite-depth Re LU networks can IBP-express M-CPWL functions precisely (Theorem 9). This construction can be applied directly to the strictly more precise DP-0 and relaxation and with slight modification also to DP-1. We, then, show that while finite Re LU networks can both DP-0and -express M-CPWL and C-CPWL functions, the solution space is exponentially larger when using the more precise -analysis. Finally, we show that single-layer Re LU networks can MN-express arbitrary CPWL functions. We defer all proofs and supplementary lemmata to App. B. Figure 5: IBP-analysis of the step function β Re LU(β β x1 x0 Re LU(x x0)). To show that M-CPWL functions can be IBP-expressed, we begin by constructing a step function, illustrated in (Fig. 5), as a two-layer Re LU network that can be IBPexpressed: Lemma 7 (Step Function). Let β R 0 and f CPWL(I, R) s.t. f(x) = 0 for x < x0, f(x) = β for x > x1 and linear in between. Then, ϕx0,x1,β(x) = β Re LU(β β x1 x0 Re LU(x x0)) encodes f. Published as a conference paper at ICLR 2024 Lemma 8 (Precise Step). The IBP-analysis of ϕx0,x1,β is precise. Intuitively, the key to this construction is to leverage that while the IBP-relaxation of the Re LU function does not capture any relational information, it recovers the exact output interval. By using two sequential Re LUs, we allow the inner one to cut away the output-half-space f(x) < 0 and the outer one to cut away the half-space f(x) > β, thus obtaining a precise analysis. We can now construct arbitrary M-CPWL functions from these step functions, allowing us to show that they too can be IBP-expressed: Theorem 9 (Precise Monotone). Finite Re LU networks can IBP-express the set of monotone CPWL(I, R) functions precisely. 4.2 DEEPPOLY-0 Figure 6: Illustration of two different Re LU network encodings of the same function unde DP-0- (top and middle) and -analysis (bottom). We show constructively that finite Re LU networks can DP-0express C-CPWL functions, by first encoding any such function as a single-layer Re LU network. We note that the below results equivalently apply to concave functions: Lemma 10 (Convex encoding). Let f CPWL(I, R) be convex. Then f is encoded by h(x) = b + cx + i=1 γi Re LU( i(x xi)), (1) for any choice i { 1, 1}, if b and c are set appropriately, where αi = f(xi+1) f(xi) xi+1 xi is the slope between points xi and xi+1, and γi = αi αi 1 > 0 the slope change at xi+1. Intuitively, we encode the C-CPWL function f by starting with a linear function h0 = b + cx, coinciding with one of the linear segments of f. We then pick one of the points xi where f changes slope that are adjacent to this segment and add Re LU( i(x xi)) changing its activation state at this point. Regardless of i, we now scale this Re LU with γi = αi αi 1 to introduce the local change of slope, and update the linear term c c γi if the newly added Re LU affects the segment that the linear function matched originally. We repeat this process until h encodes f. We illustrate this in Fig. 6 (top), where we start our construction with the left-most linear segment. We continue by adding a Re LU, first at the green and then the red point, and show the DP-0 relaxation of the added Re LUs as a shaded area of the same color. We illustrate the resulting overall DP-0-relaxation, obtained as their point-wise sum, striped grey. Observe that this always recovers the original linear term as the lower bound (see Fig. 6 top). This leads to an imprecise output range unless its slope c is 0. If f includes such a constant section with zero-slope, we can directly apply the above construction, always changing i such that the Re LUs open "outward", i.e., in a direction that does not affect the constant segment. If f does not include such a constant section but a unique minimum, as in our example, we place two Re LUs at this point, treating it as a constant section with 0-width and recovering a precise lower bound (see Fig. 6 middle). Thus finite Re LU networks can DP-0-express C-CPWL functions but do not allow i to be chosen freely. Note that the upper bound is still precise regardless of the choice of i. Theorem 11 (DP-0 Convex). For any convex CPWL function f : I R, there exists exactly one network of the form h(x) = b + P i I γi Re LU( i(x xi)) encoding f, with |I| = n 1 if f has slope zero on some segment and otherwise |I| = n, such that its DP-0-analysis is precise, where γi > 0 for all i. Published as a conference paper at ICLR 2024 4.3 DEEPPOLY-1 To show that DP-1 has the same expressivity as DP-0, we encode a Re LU function as h(x) = x + Re LU( x) which under DP-1-analysis yields the same linear bounds as h (x) = Re LU(x) under DP-0-analysis. The reverse also holds. Thus, the expressivity of DP-0 and DP-1 is equivalent. Corollary 12 (DP-1 Re LU). The Re LU network h(x) = x + Re LU( x) encodes the function f(x) = Re LU(x) and, the DP-1-analysis of h(x) is identical to the DP-0-analysis of Re LU. Further, the DP-0-analysis of h(x) is identical to the DP-1-analysis of Re LU. It follows directly that any function that can be DP-0-expressed by a finite Re LU network can be DP-1-expressed by the same Re LU network after substituting every Re LU(x) with x+Re LU( x): Corollary 13 (DP-1 Approximation). Finite Re LU networks can DP-1and DP-0-express the same function class precisely. In particular, they can DP-1-express the set of convex functions f CPWL(I, R) and monotone functions f CPWL(I, R) precisely. 4.4 TRIANGLE To show that finite Re LU networks can -express C-CPWL functions, we reuse the construction from Lemma 10. However, as the -relaxation yields the exact convex hull for Re LU functions, we first show that the convex hull of a sum of convex functions (such as Eq. (1)) is recovered by the pointwise sum of their convex hulls: Lemma 14 (Convex Hull Sum). Given two convex functions f, g: R R and the box [l, u]. Then, the pointwise sum of the convex hulls Hf + Hg is identical to the convex hull of the sum of the two functions Hf+g = Hf + Hg. This follows directly from the definition and implies that the -analysis is precise for arbitrary choices of i, illustrated in the bottom of Fig. 6: Theorem 15 ( Convex). Let f CPWL(I, R) be convex. Then, for any network h encoding f as in Lemma 10, we have that its -analysis is precise. In particular, i can be chosen freely. 4.5 MULTI-NEURON-RELAXATIONS As multi-neuron relaxations yield the exact convex hull of the considered group of neurons (all within the same layer), it is sufficient to show that we can express arbitrary CPWL functions with a single-layer network to see that they MN-express CPWL functions. To this end, we use a similar construction as in Lemma 10, where the lack of convexity removes the positivity constraint on γi. Theorem 16 (Multi-Neuron Precision). For every f CPWL(I, R), there exists a single layer Re LU network h encoding f, such that its MN-analysis (considering all Re LUs jointly) is precise. 5 CONVEX RELAXATIONS FOR MULTIVARIATE FUNCTIONS Figure 7: Illustration of the pre-image of activation pattern changes with ( ) and without ( ) functional change of a Re LU network h encoding the g = max(x, y) function, as well as the ϵneighborhood U ( ), in which the only activation change occurs at x = y. After having shown in the previous section that, for univariate functions, the expressivity of Re LU networks under convex relaxations is not fundamentally limited, we now turn our attention to multivariate functions. There, we prove that no finite Re LU network can -express the maximum function max: R2 R precisely (Theorem 20). This directly implies that no single-neuron relaxation can express the class of multivariate, monotone, and convex CPWL functions precisely. Note, this negative result generalizes to all relaxations less precise than , including IBP and DEEPPOLY. Intuitively, we will argue along the following lines. We first observe that for any finite Re LU-Network h that encodes the maximum function, we can find a point (x, y = x) R2 with neighborhood U, such that on U, all Re LUs in h either switch their activation state for x = y or not at all (see Fig. 7). Then, Published as a conference paper at ICLR 2024 z = max(x, y) Figure 8: Illustration of z = max(x, y) (left) with its lower (middle) and upper (right) bounds as obtained in Theorem 19. we show that for such a neighborhood, we can -replace the finite Re LU network h with a single layer consisting of just 2 neurons and a linear term (Theorems 17 and 18). Finally, we show that no such single-layer network can -express max precisely (Theorem 19), before putting everything together in Theorem 20. All proofs and support lemmata are again deferred to App. A. Let us begin by showing that we can express any finite Re LU network using the functional form of Eq. (2). That is, every i-layer network hi can be written as the sum of an (i 1)-layer network hi 1 L and a linear function of a Re LU applied to another (i 1)-layer network Wi Re LU(hi 1 R ). Further, if, for a given input region U, all Re LUs in the original network switch activation state on the hyperplane w x = 0 or not at all, then, we can ensure that every Re LU in both (i 1)-layer networks change activation state exactly for z := w x = 0. Theorem 17 (Network Form Coverage). Given a neighborhood U and a finite k-layer Re LU network h such that on U and under -analysis all its Re LUs are either stably active (Re LU(v) = v), stably inactive (Re LU(v) = 0), or switch activation state for z := w x = 0 with w Rd, then h can be represented using the functional form hi {R,L} = hi 1 L + Wi Re LU(hi 1 R ), h0 {R,L} = b + W0x, (2) for i = k and such that all Re LUs switch their activation state at {x X | w x = 0}. Here, L and R are labels, used to distinguish the possibly different (i-1)-layer networks hi 1 from each other. We can now leverage the fact that all Re LUs change activation state at the same point to simplify the sum of Re LUs to a linear term plus a single Re LU: P i ai Re LU(wiz) γz + α Re LU(z) for some γ, α R (Lemma 22). This allows us to further simplify a Re LU applied to such a sum of Re LUs: Re LU(γ +α Re LU(z)) γ z +α Re LU(z) (Lemma 23). These two replacements allow us to recursively reduce the depth of networks in the form of Eq. (2) until just a single layer is left: Theorem 18 (Network Simplification). Let hk be a network as in Theorem 17 such that all Re LUs change activation state at z := w x = 0 with w Rd. We have hk = hk 1 L + W Re LU(hk 1 R ) h(x) = b + W x + α Re LU(z), where h0(x) = b0 + W0x and all Re LU change state exactly at {x X | w x = 0}. Note that, hk, hk 1 L and h map to R, while hk 1 R maps to the space of some hidden layer Rn. Next, we show directly via contradiction that single-layer Re LU networks of this form cannot -express the maximum function, illustrating the resulting (imprecise) bounds in Fig. 8: Theorem 19 (Triangle max). Re LU-networks of the form h(x, y) = b+wxx+wyy+α Re LU(x y) can not -express the function max: R2 R. Proof. We consider the input region B = [0, 1]2 and constrain our parameters by considering the following: For x = y = 0, we have f(0, 0) = 0 = b = h(0, 0). For x < y, we have f(x, y) = y = wxx + wyy = h(x, y) and thus wx = 0, wy = 1. Finally for x > y, we have f(x, y) = x = y + α(x y) = h(x, y) and thus α = 1. Hence we have h(x, y) = y + Re LU(x y). Re LU(x y) 1 2(x y + 1) = y x 2(x + y + 1). The maximum of the upper bound is attained at x = y = 1, where we get h (B) = 3 2 which is larger than max(B) = 1 (see Fig. 8). Published as a conference paper at ICLR 2024 To show that no Re LU network can -express max it remains to argue how we can find a U such that all Re LUs switch their activation state at {x X | w x = 0} or not at all: Theorem 20 ( Impossibility max). Finite Re LU networks can not -express the function max. Proof. We will prove this theorem via contradiction in four steps. Assume there exists a finite Re LU network h that -expresses max precisely. First Locality We argue this point in three steps: 1. There exists a point (x, y = x) with an ϵ-neighborhood U such that one of the following holds for any Re LU(v) with input v = hv(x, y) of the network h: the Re LU is always active, i.e., (x, y) U, Re LU(v) = v, the Re LU is never active, i.e., (x, y) U, Re LU(v) = 0, or the Re LU changes activation state for x = y, i.e., v R, s.t., Re LU(v) = Re LU(v (x y)). This follows directly from the fact that finite Re LU networks divide their input space into finitely many linear regions (see the illustration in Fig. 7). 2. Further, there exists an ϵ-neighborhood U of (x, y) such that the above holds under IBPanalysis, as it depends continuously on the input bounds and becomes exact for any network when the input bounds describe a point. 3. Via rescaling and translation, we can assume that the point (x, y) is at 0 and that the neighborhood U covers [ 1, 1]2. Second Network Form On the neighborhood U, any finite Re LU-network h can, w.r.t. , be represented by hk = hk 1 + W Re LU(hk 1), h0(v) = b + W v with biases b Rdk, weight matrices W Rdk dk 1, where all Re LUs change activation state exactly for x = y (Theorem 17). Third Network Replacements We can replace hk w.r.t. with the single layer network h (x, y) = b + W (x, y) + αR(x y) (Theorem 18). Fourth Conclusion There exists no network of this form encoding the max-function such that its -analysis is precise on the interval [0, 1]2 (Theorem 19). This concludes the proof. As max belongs to the class of multivariate, convex, monotone, CPWL functions, it follows directly from Theorem 20 that no finite Re LU network can -express this class precisely: Corollary 21 ( Impossibility). Finite Re LU networks can not -express the set of convex, monotone, CPWL functions mapping from some box I R2 to R. 6 CONCLUSION We conduct the first in-depth study on the expressivity of Re LU networks under all commonly used convex relaxations and find that: (i) more precise relaxations ( , DP-0 or DP-1) allow a larger class of univariate functions (C-CPWL and M-CPWL) to be expressed precisely than the simple IBP-relaxation (M-CPWL), (ii) for the same function class (C-CPWL), a more precise relaxation ( vs DP-0 or DP-1), can allow an exponentially larger solution space of Re LU networks, (iii) MNrelaxations allow single-layer networks to express all univariate CPWL functions, (iv) even the most precise single-neuron relaxation ( ) is too imprecise to express multivariate, convex, monotone CPWL functions precisely with finite Re LU networks, despite their exact analysis being trivial. While more precise domains improve expressivity for univariate functions, all single-neuron convexrelaxations are fundamentally limited in the multivariate setting. Surprisingly, even simple functions that can be encoded with a single neuron h = y+Re LU(x y) = max(x, y), can not be -expressed precisely using any finite Re LU network. This highlights not only the importance of recent, more precise multi-neuronand Ba B-based neural network certification methods but also suggests more precise methods might be needed for training. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS We would like to thank our anonymous reviewers for their constructive comments and insightful questions. This work has been done as part of the EU grant ELSA (European Lighthouse on Secure and Safe AI, grant agreement no. 101070617). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them. The work has received funding from the Swiss State Secretariat for Education, Research and Innovation (SERI). Published as a conference paper at ICLR 2024 Maximilian Baader, Matthew Mirman, and Martin T. Vechev. Universal approximation with certified networks. In Proc. of ICLR, 2020. Mislav Balunovi c and Martin T. Vechev. Adversarial training and provable defenses: Bridging the gap. In Proc. of ICLR, 2020. Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Proc of ECML PKDD, volume 8190, 2013. doi: 10.1007/978-3-642-40994-3\_25. Christopher Brix, Mark Niklas Müller, Stanley Bak, Taylor T. Johnson, and Changliu Liu. First three years of the international verification of neural networks competition (VNN-COMP). Co RR, abs/2301.05815, 2023. doi: 10.48550/ARXIV.2301.05815. URL https://doi.org/10.48550/ ar Xiv.2301.05815. Rudy Bunel, Jingyue Lu, Ilker Turkaslan, Philip H. S. Torr, Pushmeet Kohli, and M. Pawan Kumar. Branch and bound for piecewise linear neural network verification. J. Mach. Learn. Res., 21, 2020. Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin, Aditi Raghunathan, Jonathan Uesato, Rudy Bunel, Shreya Shankar, Jacob Steinhardt, Ian J. Goodfellow, Percy Liang, and Pushmeet Kohli. Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming. In Proc. of Neur IPS, 2020. Souradeep Dutta, Susmit Jha, Sriram Sankaranarayanan, and Ashish Tiwari. Output range analysis for deep feedforward neural networks. In NASA Formal Methods - 10th International Symposium, NFM 2018, Newport News, VA, USA, April 17-19, 2018, Proceedings, volume 10811, 2018. doi: 10.1007/978-3-319-77935-5\_9. Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy A. Mann, and Pushmeet Kohli. A dual approach to scalable verification of deep networks. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, 2018. Rüdiger Ehlers. Formal verification of piece-wise linear feed-forward neural networks. In Automated Technology for Verification and Analysis - 15th International Symposium, ATVA 2017, Pune, India, October 3-6, 2017, Proceedings, volume 10482, 2017. doi: 10.1007/ 978-3-319-68167-2\_19. Claudio Ferrari, Mark Niklas Müller, Nikola Jovanovi c, and Martin T. Vechev. Complete verification via multi-neuron relaxation guided branch-and-bound. In Proc. of ICLR, 2022. Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin T. Vechev. AI2: safety and robustness certification of neural networks with abstract interpretation. In 2018 IEEE Symposium on Security and Privacy, SP 2018, Proceedings, 21-23 May 2018, San Francisco, California, USA, 2018. doi: 10.1109/SP.2018.00058. Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy A. Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. Ar Xiv preprint, abs/1810.12715, 2018. Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. Relu deep neural networks and linear finite elements. Journal of Computational Mathematics, 38(3), 2020. ISSN 1991-7139. doi: https: //doi.org/10.4208/jcm.1901-m2018-0160. Nikola Jovanovi c, Mislav Balunovi c, Maximilian Baader, and Martin T. Vechev. On the paradox of certified training. Trans. Mach. Learn. Res., 2022, 2022. Guy Katz, Clark W. Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. Ar Xiv preprint, abs/1702.01135, 2017. Published as a conference paper at ICLR 2024 Yuhao Mao, Mark Niklas Müller, Marc Fischer, and Martin T. Vechev. TAPS: connecting certified and adversarial training. Co RR, abs/2305.04574, 2023. doi: 10.48550/ar Xiv.2305.04574. Yuhao Mao, Mark Niklas Müller, Marc Fischer, and Martin T. Vechev. Understanding certified training with interval bound propagation. In Proc. of. ICLR, 2024. Matthew Mirman, Timon Gehr, and Martin T. Vechev. Differentiable abstract interpretation for provably robust neural networks. In Proc. of ICML, volume 80, 2018. Matthew Mirman, Maximilian Baader, and Martin T. Vechev. The fundamental limits of neural networks for interval certified robustness. Trans. Mach. Learn. Res., 2022, 2022. Mark Niklas Müller, Gleb Makarchuk, Gagandeep Singh, Markus Püschel, and Martin T. Vechev. PRIMA: general and precise neural network certification via scalable convex hull approximations. Proc. ACM Program. Lang., 6(POPL), 2022. doi: 10.1145/3498704. Mark Niklas Müller, Franziska Eckert, Marc Fischer, and Martin T. Vechev. Certified training: Small boxes are all you need. In Proc. of ICLR, 2023. Alessandro De Palma, Harkirat S. Behl, Rudy Bunel, Philip H. S. Torr, and M. Pawan Kumar. Scaling the convex barrier with active sets. In Proc. of ICLR, 2021. Alessandro De Palma, Rudy Bunel, Krishnamurthy Dvijotham, M. Pawan Kumar, Robert Stanforth, and Alessio Lomuscio. Expressive losses for verified robustness via convex combinations. Co RR, abs/2305.13991, 2023. doi: 10.48550/ar Xiv.2305.13991. Chongli Qin, Krishnamurthy (Dj) Dvijotham, Brendan O Donoghue, Rudy Bunel, Robert Stanforth, Sven Gowal, Jonathan Uesato, Grzegorz Swirszcz, and Pushmeet Kohli. Verification of non-linear specifications for neural networks. In Proc. of ICLR, 2019. Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Semidefinite relaxations for certifying robustness to adversarial examples. In Proc. of Neur IPS, 2018. Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relaxation barrier to tight robustness verification of neural networks. In Proc. of Neur IPS, 2019. Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin T. Vechev. Fast and effective robustness certification. In Proc. of Neur IPS, 2018. Gagandeep Singh, Rupanshu Ganvir, Markus Püschel, and Martin T. Vechev. Beyond the single neuron convex barrier for neural network certification. In Proc. of Neur IPS, 2019a. Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin T. Vechev. An abstract domain for certifying neural networks. Proc. ACM Program. Lang., 3(POPL), 2019b. doi: 10.1145/3290354. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Proc. of ICLR, 2014. Christian Tjandraatmadja, Ross Anderson, Joey Huchette, Will Ma, Krunal Patel, and Juan Pablo Vielma. The convex relaxation barrier, revisited: Tightened single-neuron relaxations for neural network verification. In Proc. of Neur IPS, 2020. Vincent Tjeng, Kai Y. Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In Proc. of ICLR, 2019. Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J. Zico Kolter. Beta-crown: Efficient bound propagation with per-neuron split constraints for neural network robustness verification. In Proc. of Neur IPS, 2021. Zi Wang, Aws Albarghouthi, Gautam Prakriya, and Somesh Jha. Interval universal approximation for neural networks. Proc. ACM Program. Lang., 6(POPL), 2022. doi: 10.1145/3498675. Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proc. of ICML, volume 80, 2018. Published as a conference paper at ICLR 2024 Eric Wong, Frank R. Schmidt, Jan Hendrik Metzen, and J. Zico Kolter. Scaling provable adversarial defenses. In Proc. of Neur IPS, 2018. Kaidi Xu, Huan Zhang, Shiqi Wang, Yihan Wang, Suman Jana, Xue Lin, and Cho-Jui Hsieh. Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers. In Proc. of ICLR, 2021. Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions. In Proc. of Neur IPS, 2018. Huan Zhang, Hongge Chen, Chaowei Xiao, Sven Gowal, Robert Stanforth, Bo Li, Duane S. Boning, and Cho-Jui Hsieh. Towards stable and efficient training of verifiably robust neural networks. In Proc. of ICLR, 2020. Huan Zhang, Shiqi Wang, Kaidi Xu, Linyi Li, Bo Li, Suman Jana, Cho-Jui Hsieh, and J. Zico Kolter. General cutting planes for bound-propagation-based neural network verification. In Neur IPS, 2022. Published as a conference paper at ICLR 2024 A DEFERRED PROOFS ON MULTIVARIATE FUNCTIONS Theorem 17 (Network Form Coverage). Given a neighborhood U and a finite k-layer Re LU network h such that on U and under -analysis all its Re LUs are either stably active (Re LU(v) = v), stably inactive (Re LU(v) = 0), or switch activation state for z := w x = 0 with w Rd, then h can be represented using the functional form hi {R,L} = hi 1 L + Wi Re LU(hi 1 R ), h0 {R,L} = b + W0x, (2) for i = k and such that all Re LUs switch their activation state at {x X | w x = 0}. Here, L and R are labels, used to distinguish the possibly different (i-1)-layer networks hi 1 from each other. Proof. Given Wi Re LU(hi 1 R ), we partition the columns of the weight matrix into Wi = (W + i |W i |W i ), depending on whether the associated Re LU is stably active, stably inactive, or unstable, respectively. We thus obtain (W + i |W i |W i ) Re LU(hk 1 R ) = W + i hk 1 R + W i Re LU(hk 1 R ). We update hi 1 L,new = hi 1 L + W + i hi 1 R , by showing that W + i hi 1 R is still an (i 1)-layer network as follows. We recursively update weight matrices Wk,new = W + i Wk to obtain W + i hk = W + i hk 1 + Wk,new Re LU(hk 1 R ) until we have reached k = 1, where we have W + i h1 = W + i b L + W + i W0,Lv + W1,new Re LU(b R + W0,Rx). Lemma 22 (Simplification of Re LU Sums w.r.t. ). Let A Rn 1 and w Rn. Then, we have h(z) = A Re LU(wz) h (z) = γz + α Re LU(z), where γ = P i,wi<0 Aiwi and α = P i,wi>0 Aiwi γ. Proof. Both h and h are CPWL functions with slope change only at z = 0. Thus they are fully defined by their value at the points zi { 1, 0, 1}. Hence, we can show that h and h encode the same function by showing their equivalence on these points: h(0) = 0 = h (0), h( 1) = P i,wi<0 Aiwi = γ = h ( 1), and h(1) = P i,wi>0 Aiwi = α + γ. As γz and α Re LU(z) are convex/concave, and their -analysis yields their convex hulls, the pointwise sum of their convex hulls, i.e. the -analysis of h , recovers the convex hull of h by Lemma 14 and is thus at least as precise as any convex-relaxation of h. Lemma 23 (Simplification of Composed Re LUs w.r.t. ). We have h(z) = Re LU(γz + α Re LU(z)) h (z) = γ z + α Re LU(z), where γ = Re LU( γ) and α = Re LU(α + γ) γ . Proof. We observe that h(z) is convex and piecewise-linear for any z [l, u] R with l < 0 < u and a slope change only at z = 0. Its convex hull is thus spanned by h(l) = Re LU(γl) = h (l), h(0) = h (0) = 0, and h(u) = Re LU((γ +α)u) = h (u). We further observe that the -relaxation of Re LU(z) and z is their convex hull. Finally, the convex hull of the positive sum of convex functions is equal to the pointwise sum of their individual convex hulls (Lemma 14). Thus, the triangle-relaxation of h (z) recovers the convex hull of h(z) and thus the tightest possible convex relaxation. For convex functions f, g: R R, we define the convex hull Hf([l, u]) = {(x, y) | x [l, u], f(x) y f(l) + f(u) f(l) u l (x l)} over [l, u] R. Further, we define the convex hull sum of f and g on [l, u] to be Hf + Hg := {(x, y + y ) | (x, y ) Hf, (x, y ) Hg}. Lemma 14 (Convex Hull Sum). Given two convex functions f, g: R R and the box [l, u]. Then, the pointwise sum of the convex hulls Hf + Hg is identical to the convex hull of the sum of the two functions Hf+g = Hf + Hg. Proof. We first show that every point in Hf+g can be obtained from Hf + Hg. Let (x, y) Hf+g([l, u]). Then we have (f + g)(x) y (f + g)(l) + (f+g)(u) (f+g)(l) f(x) + g(x) y f(l) + f(u) f(l) u l (x l) + g(l) + f(u) f(l) Published as a conference paper at ICLR 2024 Then we can find a partition of y = y + y . We know for sure that there exists t [0, 1] s.t. y = (1 t)(f + g)(x) + t((f + g)(l) + (f+g)(u) (f+g)(l) u l (x l)). Hence if we pick for example y = (1 t)f(x) + t(f(l) + f(u) f(l) u l (x l)) Hf y = (1 t)g(x) + t(g(l) + g(u) g(l) u l (x l)) Hg, we get immediately that (x, y) Hf + Hg. The other direction is immediate. Using Lemma 23, we can show that these networks mapping R2 to R can be simplified further: Theorem 18 (Network Simplification). Let hk be a network as in Theorem 17 such that all Re LUs change activation state at z := w x = 0 with w Rd. We have hk = hk 1 L + W Re LU(hk 1 R ) h(x) = b + W x + α Re LU(z), where h0(x) = b0 + W0x and all Re LU change state exactly at {x X | w x = 0}. Note that, hk, hk 1 L and h map to R, while hk 1 R maps to some Rn. Proof. We show a more general result on hk with possibly many output dimensions by induction: Induction Hypothesis: hi bi + Wix + αi Re LU(z). Base Case: h0(x) = b0 + W0x satisfies the form h0(x) = b0 + W0x + α0 Re LU(z) for α0 = 0, thus we can replace h0(x) by itself. Induction Step: Using the induction hypothesis, we have Wi Re LU(hi 1 R ) = Wi Re LU(bi 1 + Wi 1x + αi 1 Re LU(z)), which by Theorem 17 only changes its activation state at z = 0. Since Re LU(0) = 0, we must have bi 1 + Wi 1x = wz for some w (recall that z is the projection of x on a hyperplane in the input space). Further, applying Lemma 23, we obtain Wi Re LU(hi 1 R ) =Wi Re LU(wz + αi 1 Re LU(z)) γ iz + α i Re LU(z) = b i + W i x + α i Re LU(z). Using the induction hypothesis, we can thus rewrite: hi = hi 1 + Wi Re LU(hi 1 R ) b + W x + αi Re LU(z). Theorem 19 (Triangle max). Re LU-networks of the form h(x, y) = b+wxx+wyy+α Re LU(x y) can not -express the function max: R2 R. Proof. We first constrain our parameters by considering the following: For x = y = 0, we have f(0, 0) = 0, leading to b = 0 = h(0, 0). For x < y, we have f(x, y) = y = wxx + wyy = h(x, y) and thus wx = 0, wy = 1. For x > y, we have f(x, y) = x = y + α(x y) = h(x, y) and thus α = 1. Hence we have h(x, y) = y + Re LU(x y). Re LU(x y) 1 Adding y results in the following: 2(x + y + 1). The maximum of the upper bound is attained at x = y = 1, where we get 3 2 which is larger than max(x, y) = 1 for x, y [0, 1]. Published as a conference paper at ICLR 2024 Theorem 24 (Triangle max ℓp). Re LU-networks of the form h(x, y) = b + wxx + wyy + α Re LU(x y) can not -express the function max: R2 R for any ℓp-norm bounded perturbation with p 1, i.e., input regions Bϵ p(x) := {x X | x x p ϵ}. Proof. We first consider the case of p > 1 and f = max(x, y): We again constrain our parameters as in the proof of Theorem 19 to obtain h(x, y) = y + Re LU(x y). We consider the input region . We can now use Hölder s inequality to compute bounds on the Re LU q | {z } cq:= q ϵ x y x0 y0 + q | {z } cq:= q = 1. And thus obtain the following bounds on the Re LU output: Re LU(x y) 1 Adding y results in the following: y x 2(x + y + cq We can again use Hölder s inequality to bound the upper bound: 1 2(x + y + cq 2(x0 + y0 + cq q | {z } =cq 2(1 + 2 1 q ) > 1, Where the last inequality holds due to p > 1 = q < . This upper bound is strictly greater than max B0.5 p max(x, y) = 1. We now consider the case of p = 1 and the rotated max function f = max( x y 2 ). We choose the input region B ϵ= 1 and observe that we recover the setting as discussed in the proof of Theorem 19 for p = rotated by 45 around the origin and thus obtain the same imprecise bounds. Theorem 20 ( Impossibility max). Finite Re LU networks can not -express the function max. Proof. We will prove this theorem in four steps. First Locality 1. There exists a point (x, y = x) with an ϵ-neighborhood U such that one of the following holds for any Re LU(v) with input v = hv(x, y) of the network h: the Re LU is always active, i.e., (x, y) U, Re LU(v) = v, the Re LU is never active, i.e., (x, y) U, Re LU(v) = 0, or the Re LU changes activation state for x = y, i.e., v R, s.t., Re LU(v) = Re LU(v (x y)). 2. Further, there exists a neighborhood U of (x, y) such that the above holds under -analysis, as it depends continuously on the input bounds and becomes exact for any network when the input bounds describe a point. 3. Via rescaling and translation, we can assume that the point (x, y) is at 0 and that the neighborhood U covers [ 1, 1]2. Second Network Form On the neighborhood U, any finite Re LU-network h can, w.r.t. , be replaced by hk = hk 1 + W Re LU(hk 1) with biases b Rdk, weight matrices W Rdk dk 1, and h0(v) = b + W v, where all Re LUs change activation state exactly for x = y (Theorem 17). Published as a conference paper at ICLR 2024 Third Network Simplifications We can replace hk w.r.t. triangle with b+Wk(x, y) +αk R(x y) (Theorem 18). Fourth Conclusion Every finite Re LU network can be replaced w.r.t. with a single layer network of the form h1(x, y) = b+W (x, y) +αR(x y). However, there exists no such network encoding the max-function such that its -analysis is precise on the interval [0, 1]2 (Theorem 19). Corollary 21 ( Impossibility). Finite Re LU networks can not -express the set of convex, monotone, CPWL functions mapping from some box I R2 to R. B DEFERRED PROOFS ON UNIVARIATE FUNCTIONS Lemma 7 (Step Function). Let β R 0 and f CPWL(I, R) s.t. f(x) = 0 for x < x0, f(x) = β for x > x1 and linear in between. Then, ϕx0,x1,β(x) = β Re LU(β β x1 x0 Re LU(x x0)) encodes f. Proof. We prove the theorem by considering the three cases separately: 1. For x x0 we have ϕx0,x1,β(x) = Re LU( β x1 x0 Re LU(x x0) + β) + β = Re LU( β x1 x0 0 + β) + β = Re LU(β) + β = β + β = 0. 2. For x0 x x1 we have ϕx0,x1,β(x) = Re LU( β x1 x0 Re LU(x x0) + β) + β = Re LU( β x1 x0 (x x0) + β) + β = β x1 x0 (x x0) β + β = β x1 x0 (x x0). 3. For x x1 we have ϕx0,x1,β(x) = Re LU( β x1 x0 Re LU(x x0) + β) + β = Re LU( β x1 x0 (x x0) + β) + β = 0 + β = β. Lemma 8 (Precise Step). The IBP-analysis of ϕx0,x1,β is precise. Published as a conference paper at ICLR 2024 Proof. Consider the box [l, u] R. ϕx0,x1,β([l, u]) = Re LU( β x1 x0 Re LU([l, u] x0) + β) + β = Re LU( β x1 x0 Re LU([l x0, u x0]) + β) + β = Re LU( β x1 x0 [Re LU(l x0), Re LU(u x0)] + β) + β = Re LU([ β x1 x0 Re LU(u x0), β x1 x0 Re LU(l x0)] + β) + β = Re LU([ β x1 x0 Re LU(u x0) + β, β x1 x0 Re LU(l x0) + β]) + β = Re LU([ β x1 x0 Re LU(u x0) + β, β x1 x0 Re LU(l x0) + β]) + β = [Re LU( β x1 x0 Re LU(u x0) + β), Re LU( β x1 x0 Re LU(l x0) + β)] + β = [Re LU( β x1 x0 Re LU(l x0) + β), Re LU( β x1 x0 Re LU(u x0) + β)] + β = [Re LU( β x1 x0 Re LU(l x0) + β) + β, Re LU( β x1 x0 Re LU(u x0) + β) + β] = [ϕx0,x1,β(l), ϕx0,x1,β(u)]. Theorem 9 (Precise Monotone). Finite Re LU networks can IBP-express the set of monotone CPWL(I, R) functions precisely. Proof. W.l.o.g. assume f is monotonously increasing. Otherwise, consider f. Let xi for i {0, . . . , n} be the set of boundary points of the linear regions of f with x0 < < xn. We claim that h(x) = f(x0) + i=0 ϕxi,xi+1,f(xi+1) f(xi)(x) is equal to f on I and that the IBP-analysis of h is precise. We note that f(xi+1) f(xi) > 0. We first show f = h on I. For cach x I pick i {1, . . . , n} such that xj 1 x < xj. Then h(x) = f(x0) + i=0 ϕxi,xi+1,f(xi+1) f(xi)(x) i=0 ϕxi,xi+1,f(xi+1) f(xi)(x) i=0 ϕxi,xi+1,f(xi+1) f(xi)(x) + ϕxj,xj+1,f(xj+1) f(xj)(x) i=0 [f(xi+1) f(xi)] + f(xj+1) f(xj) = f(xj) + f(xj+1) f(xj) xj+1 xj (x xj) where we used the piecewise linearity of f in the last step. Published as a conference paper at ICLR 2024 Now we show that the analysis of IBP of h is precise. Consider the box [l, u] I. We have h([l, u]) = f(x0) + i=1 ϕxi,xi+1,f(xi+1) f(xi)([l, u]) i=1 [ϕxi,xi+1,f(xi+1) f(xi)(l), ϕxi,xi+1,f(xi+1) f(xi)(u)] i=1 ϕxi,xi+1,f(xi+1) f(xi)(l), f(x0) + i=1 ϕxi,xi+1,f(xi+1) f(xi)(u)] = [h(l), h(u)] = [f(l), f(u)]. B.2 DEEPPOLY-0 Lemma 10 (Convex encoding). Let f CPWL(I, R) be convex. Then f is encoded by h(x) = b + cx + i=1 γi Re LU( i(x xi)), (1) for any choice i { 1, 1}, if b and c are set appropriately, where αi = f(xi+1) f(xi) xi+1 xi is the slope between points xi and xi+1, and γi = αi αi 1 > 0 the slope change at xi+1. Proof. First, we show that γj = αj αj 1. Evaluating h(x) for xj x xj+1 yields h(x) = b + cx + i=1 γi Re LU( i(x xi)) i=1 γi i (x xi)[ i = +, xi < x] + i=1 γi i (x xi)[ i = , xi > x] i=1 γi(x xi)[ i = +, xi < x] i=1 γi(x xi)[ i = , xi > x] i=1 γi(x xi)[ i = +] i=j+1 γi(x xi)[ i = ]. The derivative of h evaluated at x for xj x xj+1 is αj: h x(x) = c + i=1 γi[ i = +] i=j+1 γi[ i = ] = αj. By choosing ϵ small enough we can ensure that xj +ϵ [xj, xj+1] and xj ϵ [xj 1, xj] and thus αj αj 1 = h x(xj + ϵ) h i=1 γi[ i = +] i=j+1 γi[ i = ] i=1 γi[ i = +] + i=j γi[ i = ] = γj[ j = +] + γj[ j = ] = γj Published as a conference paper at ICLR 2024 Next, we show that one can pick i arbitrarily as long as b and c are set appropriately. Pick any choice of i { 1, 1} and set b and c to b := f(x0) x0 f(x1) f(x0) i=1 γixi[ i = ] = f(x0) x0α0 i=1 γixi[ i = ] i=1 γi[ i = ]. We have h(x) = f(x). Indeed: For any x [x0, xn] pick j s.t. x [xj, xj+1]. Then h(x) = b + cx + i=1 γi(x xi)[ i = +] i=j+1 γi(x xi)[ i = ] i=1 γixi[ i = +] + i=j+1 γixi[ i = ] | {z } offset i=1 γi[ i = +] i=j+1 γi[ i = ] | {z } linear The offset evaluates to i=1 γixi[ i = +] + i=j+1 γixi[ i = ] = f(x0) x0α0 i=1 γixi[ i = ] i=1 γixi[ i = +] + i=j+1 γixi[ i = ] = f(x0) x0α0 i=1 γixi[ i = ] i=1 γixi[ i = +] = f(x0) x0α0 = f(x0) x0α0 γ1x1 γ2x2 γjxj = f(x0) x0α0 (α1 α0)x1 (α2 α1)x2 (αj αj 1)xj = f(x0) x0α0 + α0x1 α1x1 + α1x2 α2x2 + αj 1xj αjxj = f(x0) + (x1 x0)α0 + (x2 x1)α1 + (xj xj 1)αj 1 αjxj = f(x0) + (f(x1) f(x0)) + (f(x2) f(x1)) + (f(xj) f(xj 1)) αjxj = f(xj) αjxj, where we used that γi = αi αi 1 and αi = f(xi+1) f(xi) xi+1 xi . The linear part evaluates to i=1 γi[ i = +] i=j+1 γi[ i = ] i=1 γi[ i = ] + i=1 γi[ i = +] i=j+1 γi[ i = ] i=1 γi[ i = ] + i=1 γi[ i = +] i=1 αi αi 1 Published as a conference paper at ICLR 2024 Combining the results, we get h(x) = f(xj) αjxj + xαj = f(xj) + αj(x xj) = f(x), by the piecewise linearity of f. Lemma 25 (DP-0 Monotone Re LU). The DP-0-analysis of the 1-layer Re LU network h(x) = Pn i=1 γi Re LU(x xi) yields h(x), h(xj 1) + αj 1(x xj 1) ( h(x), if all Re LUs are stable, h(u) h(l) u l (x l) + h(l) otherwise. where xi R s.t. 1 i n and i < p xi < xp, and γi are either all > 0 or all < 0. j is the smallest i such that xi l and k is the largest i such that xi < u. Thus, DP-0 analysis for h(x) is precise. Proof. W.o.l.g. assume h is monotonously increasing. Otherwise, consider h. The cases u < x1 and xn < l are immediate. Choose j as the smallest i such that xi l and k as the largest i such that xi < u. DP-0 yields for Re LU(x xi) on l x u x xi if i j, 0 if j < i < k, 0 if k i, Re LU(x xi) x xi if i < j, u xi u l (x l) if j i k, 0 if k < i. Thus we have i=1 γi(x xi) h(x) i=1 γi(x xi) + i=j γi u xi The term Pj 1 i=1 γi(x xi) can be simplified as follows i=1 γi(x xi) = i=1 (αi αi 1) i=1 (αi αi 1)xi = x(αj 1 α0) = x(αj 1 α0) i=1 αixi+1 + α0x1 = x(αj 1 α0) αj 1xj 1 i=1 αixi+1 + α0x1 = x(αj 1 α0) αj 1xj 1 + i=1 αi(xi+1 xi) + α0x1 = αj 1(x xj 1) + α0(x1 x) + i=1 (h(xi+1) h(xi)) = αj 1(x xj 1) + α0(x1 x) + h(xj 1) h(x1) = h(xj 1) + αj 1(x xj 1), hence we have proven the lower bound. Published as a conference paper at ICLR 2024 Now we consider the upper bound. We evaluate the upper bound at l and u. If the two linear upper bounds coincide there, they coincide everywhere: i=1 γi(l xi) + i=j γi u xi i=1 γi(l xi) = h(l) = h(u) h(l) u l (l l) + h(l), i=1 γi(u xi) + i=j γi u xi i=1 γi(u xi) + i=j γi(u xi) i=1 γi(u xi) = h(u) = h(u) h(l) u l (u l) + h(l), hence we have proven the upper bound. Theorem 11 (DP-0 Convex). For any convex CPWL function f : I R, there exists exactly one network of the form h(x) = b + P i I γi Re LU( i(x xi)) encoding f, with |I| = n 1 if f has slope zero on some segment and otherwise |I| = n, such that its DP-0-analysis is precise, where γi > 0 for all i. Proof. The proof works as follows: If the function is monotonously increasing, we show that using a local argument j = + and if the function is monotonously decreasing, we argue j = . If f has somewhere zero slope, it will be on a piecewise linear region at the boundary of [x0, xn], in which case we need 1 neuron less. If the function is non-monotone and has slope zero somewhere, then there are two minima x and x that are also switching points. Hence f is for all xj > x +x 2 increasing and for all xj < x +x 2 decreasing, so we can reuse the argument from before and need n 1 neurons for that. Then we argue separately for the Re LU at x and x . If the function is non-monotone and has nowhere slope zero, then there exists a unique minimum xj. We then show that there is exactly one splitting of γj to Re LU(x xj) and Re LU( x + xj). Finally, we prove that network is precise. We know that there are finitely many switching points xi, 1 i n. Case 1: f is monotone. W.o.l.g. assume f is monotonously increasing; the proof is similar for monotonously decreasing f. Assume that j = for some j. Then there exists ϵ > 0 such that for all x [xj ϵ, xj + ϵ], all the Re LUs except Re LU( x + xj) are stable, i.e., either always 0 or always active. Further, for such inputs, DP-0 yields 0 γi Re LU(x xj) γj 2 ( x + xj + ϵ). As f has no minimum, at least one other Re LU need to active at xj: If there would be no other active Re LU, then f would have a minimum. The active neuron(s) contributes a linear term of the form βx, β = 0, hence we get for xj ϵ x xj + ϵ and some b R, DP-0 bounds are b + βx h(x) b + βx + 1 2( x + xj + ϵ). Published as a conference paper at ICLR 2024 We know that for x = xj, we get h(xj) and thus h(xj) = b+βxj. The slope of h at xj ϵ is β γj. Hence, we have h(xj ϵ) = h(xj) (β γj)ϵ > b + β(xj ϵ) = h(xj) βϵ since γjϵ > 0. This contradicts the assumption that we are precise since the lower bound from DP-0 analysis h(xj) β is unequal to the actual lower bound h(xj ϵ). Hence, we have to have j = + for all j. Case 2: f is not monotone and has slope 0 somewhere. Then, there are two minima which are also switching points, x and x . The argument in Case 1 is a local one, hence we can use the same argument for all i such that xi / {x , x }, so we only need to consider the cases for xj and xj+1 such that xj = x and xj+1 = x . Since f is convex, the only possible case is that f is monotonously decreasing for x < xj and monotonously increasing for x > xj+1, thus we have k = for k < j and k = + for k > j + 1. Now we claim that j = and j+1 = +: We have either ( j, j+1) = ( , +) or ( j, j+1) = (+, ). If not, we would not get a unique minimum. The second case also leads directly to a contradiction: Not only would the pre-factors of the two Re LU need to coincide, i.e., γj = γj+1 (otherwise one would not have the minimum between them), the analysis of h on xj ϵ x xj + ϵ yields (as only the neurons at xj and xj+1 are active there) 0 R(x xj) 1 2(x xj + ϵ), x + xj+1 = R( x + xj+1) b + γj+1( x + xj+1) h(x) b + γj+1( x + xj+1) + γj+1 2 (x xj + ϵ) Here the lower bound is b γj+1ϵ < b, thus imprecise. Case 3: f is not monotone and has slope 0 nowhere. Then, there is one minimum x = xj. For all xi = xj we can argue as before. So we just need to argue about xj. Assume we only have one Re LU involving xj. The only unstable Re LU leads to DP-0 lower bound 0, while all others together lead to a linear term ax + b for some a = 0, thus the overall lower bound from DP-0 is h(xj) |a|ϵ < h(xj). Therefore, such network cannot be precise under DP-0 analysis. As one Re LU is not enough, we can try two at xj, namely γ j Re LU(x xj) and γ j Re LU( x + xj). As around γj no other Re LU is active, it immediately follows that we have γ j = αj and γ j = αj 1. Now we finally prove that the network constructed above is precise. Consider the input l x u. In the case where f is monotone on [l, u], we have the result immediately by using Lemma 25 as all Re LU with an opposite orientation are inactive. In the case where f is on [l, u] not monotone, the above construction yields h(x) = b + X i, i= γi Re LU( (x xi)) + X i, i=+ γi Re LU(x xi). We can apply Lemma 25 to P i, i= γi Re LU( (x xi)) and P i, i=+ γi Re LU(x xi) individually to get i, i= γi Re LU( (x xi)) h(l) b u l ( x + u), i, i=+ γi Re LU(x xi) h(u) b Hence the combined bounds are b h(x) h(l) + h(u) h(l) Evaluating the upper bounds at x = l and x = u yields h(l) and h(u) respectively, hence the bounds are precise. B.3 DEEPPOLY-1 Corollary 12 (DP-1 Re LU). The Re LU network h(x) = x + Re LU( x) encodes the function f(x) = Re LU(x) and, the DP-1-analysis of h(x) is identical to the DP-0-analysis of Re LU. Further, the DP-0-analysis of h(x) is identical to the DP-1-analysis of Re LU. Proof. We first prove that x + Re LU( x) = Re LU(x). x + Re LU( x) = Re LU(x) Re LU( x) + Re LU( x) = Re LU(x). Next we show that the DP-0-analysis of R(x) coincides with the DP-1-analysis of x+Re LU( x). Published as a conference paper at ICLR 2024 Case l 0 u, x [l, u]: We have for DP-0 and Re LU(x): 0 Re LU(x) u u l(x l). For DP-1 and x + Re LU( x) we have: x Re LU( x) l u l(x u), Hence 0 x + Re LU( x) l u l(x u) + x = u u l(x l). Case 0 l u, x [l, u]: We have for DP-0 and Re LU(x): Re LU(x) = x. For DP-1 and x + Re LU( x) we have: Re LU( x) = 0, Hence x + Re LU( x) = x. Case l u 0, x [l, u]: We have for DP-0 and Re LU(x): Re LU(x) = 0. For DP-1 and x + Re LU( x) we have: Re LU( x) = x, Hence x + x = 0. To show the opposite, we only need to prove for the case l 0 u since DP-0 and DP-1 are identical when there is no unstable Re LU. For DP-1 and x + Re LU( x) we have: 0 Re LU( x) l u l(x u), Hence x x + Re LU( x) l u l(x u) + x = u u l(x l). Corollary 13 (DP-1 Approximation). Finite Re LU networks can DP-1and DP-0-express the same function class precisely. In particular, they can DP-1-express the set of convex functions f CPWL(I, R) and monotone functions f CPWL(I, R) precisely. Proof. The follows immediately with Corollary 12 and the technique presented in Theorem 11. B.4 TRIANGLE Theorem 15 ( Convex). Let f CPWL(I, R) be convex. Then, for any network h encoding f as in Lemma 10, we have that its -analysis is precise. In particular, i can be chosen freely. Proof. The analysis of a Re LU over some input range l x u results in the convex hull of Re LU on that range. With that, we can apply the Lemma 14 over the network h from Lemma 10 modeling f: h(x) = b + cx + i=1 γi Re LU( i(x xi)) This is regardless of the choice of i a sum of convex functions (γi > 0). With Lemma 14 we get that the analysis of h results in the convex hull of h, and thus is precise. Published as a conference paper at ICLR 2024 B.5 MULTI-NEURON-RELAXATIONS Theorem 26 (Multi-Neuron Precision). For every f CPWL(I, R), there exists a single layer Re LU network h encoding f, such that its multi-neuron analysis (considering all Re LUs jointly) is precise. Proof. As the multi-neuron relaxation yields the exact convex hull of all considered neurons in their input-output space, it remains to show that every f CPWL(I, R) can be represented using a single-layer Re LU network. Recall that every f CPWL(I, R) can be defined by the points {(xi, yi)}i s.t. xi > xi 1 (Definition 1). We set h1(x) = (x x0) y1 y0 x1 x0 + y0 and now update it as follows: hi+1(x) = hi(x) + yi+1 yi xi+1 xi yi yi 1 Re LU(x xi) We observe that Re LU(x xi) = 0 for all xj such that j i. As h(x) is CPWL, it is now sufficient to show that h(xi) = yi, i. h(xj) = (x x0) y1 y0 x1 x0 + y0 + X xi+1 xi yi yi 1 Re LU(x xi) = (xj x0) y1 y0 x1 x0 + y0 + xj xi+1 xi yi yi 1 xi xi 1 yi yi 1 = (xj x0) y1 y0 x1 x0 + y0 + xj xj xj 1 y1 y0 xj yj yj 1 xj xj 1 yi+1 yi xi+1 xi (xi+1 xi) + x1 y1 y0 x1 x0 + yi yi 1 xi xi 1 (xi xi)