# minimum_width_for_universal_approximation__68ee4d1d.pdf Published as a conference paper at ICLR 2021 MINIMUM WIDTH FOR UNIVERSAL APPROXIMATION Sejun Park Chulhee Yun Jaeho Lee Jinwoo Shin KAIST AI MIT EECS KAIST EE The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension dx and the output dimension dy. In this work, we provide the first definitive result in this direction for networks using the RELU activation functions: The minimum width required for the universal approximation of the Lp functions is exactly max{dx + 1, dy}. We also prove that the same conclusion does not hold for the uniform approximation with RELU, but does hold with an additional threshold activation function. Our proof technique can be also used to derive a tighter upper bound on the minimum width required for the universal approximation using networks with general activation functions. 1 INTRODUCTION The study of the expressive power of neural networks investigates what class of functions neural networks can/cannot represent or approximate. Classical results in this field are mostly focused on shallow neural networks. An example of such results is the universal approximation theorem (Cybenko, 1989; Hornik et al., 1989; Pinkus, 1999), which shows that a neural network with fixed depth and arbitrary width can approximate any continuous function on a compact set, up to arbitrary accuracy, if the activation function is continuous and nonpolynomial. Another line of research studies the memory capacity of neural networks (Baum, 1988; Huang and Babri, 1998; Huang, 2003), trying to characterize the maximum number of data points that a given neural network can memorize. After the advent of deep learning, researchers started to investigate the benefit of depth in the expressive power of neural networks, in an attempt to understand the success of deep neural networks. This has led to interesting results showing the existence of functions that require the network to be extremely wide for shallow networks to approximate, while being easily approximated by deep and narrow networks (Telgarsky, 2016; Eldan and Shamir, 2016; Lin et al., 2017; Poggio et al., 2017). A similar trade-off between depth and width in expressive power is also observed in the study of the memory capacity of neural networks (Yun et al., 2019; Vershynin, 2020). In search of a deeper understanding of the depth in neural networks, a dual scenario of the classical universal approximation theorem has also been studied (Lu et al., 2017; Hanin and Sellke, 2017; Johnson, 2019; Kidger and Lyons, 2020). Instead of bounded depth and arbitrary width studied in classical results, the dual problem studies whether universal approximation is possible with a network of bounded width and arbitrary depth. A very interesting characteristic of this setting is that there exists a critical threshold on the width that allows a neural network to be a universal approximator. For example, one of the first results (Lu et al., 2017) in the literature shows that universal approximation of L1 functions from Rdx to R is possible for a width-(dx + 4) RELU network, but impossible for a width-dx RELU network. This implies that the minimum width required for universal approximation lies between dx + 1 and dx + 4. Subsequent results have shown upper/lower bounds on the minimum width, but none of the results has succeeded in a tight characterization of the minimum width. 1.1 WHAT IS KNOWN SO FAR? Before summarizing existing results, we first define function classes studied in the literature. For a domain X Rdx and a codomain Y Rdy, we define C(X, Y) to be the class of continuous emails: sejun.park@kaist.ac.kr, chulheey@mit.edu, jaeho-lee@kaist.ac.kr, jinwoos@kaist.ac.kr Published as a conference paper at ICLR 2021 Table 1: A summary of known upper/lower bounds on minimum width for universal approximation. In the table, K Rdx denotes a compact domain, and p [1, ). Conti. is short for continuous. Reference Function class Activation ρ Upper / lower bounds Lu et al. (2017) L1(Rdx, R) RELU dx + 1 wmin dx + 4 L1(K, R) RELU wmin dx Hanin and Sellke (2017) C(K, Rdy) RELU dx + 1 wmin dx + dy Johnson (2019) C(K, R) uniformly conti. wmin dx + 1 Kidger and Lyons (2020) C(K, Rdy) conti. nonpoly wmin dx + dy + 1 C(K, Rdy) nonaffine poly wmin dx + dy + 2 Lp(Rdx, Rdy) RELU wmin dx + dy + 1 Ours (Theorem 1) Lp(Rdx, Rdy) RELU wmin = max{dx + 1, dy} Ours (Theorem 2) C([0, 1], R2) RELU wmin = 3 > max{dx + 1, dy} Ours (Theorem 3) C(K, Rdy) RELU+STEP wmin = max{dx + 1, dy} Ours (Theorem 4) Lp(K, Rdy) conti. nonpoly wmin max{dx + 2, dy + 1} requires that ρ is uniformly approximated by a sequence of one-to-one functions. requires that ρ is continuously differentiable at some z with ρ (z) = 0. functions from X to Y, endowed with the uniform norm: f := supx X f(x) . For p [1, ), we also define Lp(X, Y) to be the class of Lp functions from X to Y, endowed with the Lp-norm: f p := ( R X f(x) p p dx)1/p. The summary of known upper and lower bounds in the literature, as well as our own results, is presented in Table 1. We use wmin to denote the minimum width for universal approximation. First progress. As aforementioned, Lu et al. (2017) show that universal approximation of L1(Rdx, R) is possible for a width-(dx + 4) RELU network, but impossible for a width-dx RELU network. These results translate into bounds on the minimum width: dx +1 wmin dx +4. Hanin and Sellke (2017) consider approximation of C(K, Rdy), where K Rdx is compact. They prove that RELU networks of width dx + dy are dense in C(K, Rdy), while width-dx RELU networks are not. Although this result fully characterizes wmin in case of dy = 1, it fails to do so for dy > 1. General activations. Later, extensions to activation functions other than RELU have appeared in the literature. Johnson (2019) shows that if the activation function ρ is uniformly continuous and can be uniformly approximated by a sequence of one-to-one functions, a width-dx network cannot universally approximate C(K, R). Kidger and Lyons (2020) show that if ρ is continuous, nonpolynomial, and continuously differentiable at some z with ρ (z) = 0, then networks of width dx + dy + 1 with activation ρ are dense in C(K, Rdy). Furthermore, Kidger and Lyons (2020) prove that RELU networks of width dx + dy + 1 are dense in Lp(Rdx, Rdy). Limitations of prior arts. Note that none of the existing works succeeds in closing the gap between the upper bound (at least dx + dy) and the lower bound (at most dx + 1). This gap is significant especially for applications with high-dimensional codomains (i.e., large dy), arising for many practical applications of neural networks, e.g., image generation (Kingma and Welling, 2013; Goodfellow et al., 2014), language modeling (Devlin et al., 2019; Liu et al., 2019), and molecule generation (Gómez-Bombarelli et al., 2018; Jin et al., 2018). In the prior arts, the main bottleneck for proving an upper bound below dx + dy is that they maintain all dx neurons to store the input and all dy neurons to construct the function output; this means every layer already requires at least dx + dy neurons. In addition, the proof techniques for the lower bounds only consider the input dimension dx regardless of the output dimension dy. 1.2 SUMMARY OF RESULTS We mainly focus on characterizing the minimum width of RELU networks for universal approximation. Nevertheless, our results are not restricted to RELU networks; they can be generalized to networks with general activation functions. Our contributions can be summarized as follows. Published as a conference paper at ICLR 2021 Theorem 1 states that the minimum width for RELU networks to be dense in Lp(Rdx, Rdy) is exactly max{dx + 1, dy}. This is the first result fully characterizing the minimum width of RELU networks for universal approximation. In particular, the upper bound on the minimum width is significantly smaller than the best known result dx + dy + 1 (Kidger and Lyons, 2020). Given the full characterization of wmin of RELU networks for approximating Lp(Rdx, Rdy), a natural question arises: Is wmin also the same for C(K, Rdy)? We prove that it is not the case; Theorem 2 shows that the minimum width for RELU networks to be dense in C([0, 1], R2) is 3. Namely, RELU networks of width max{dx + 1, dy} are not dense in C(K, Rdy) in general. In light of Theorem 2, is it impossible to approximate C(K, Rdy) in general while maintaining width max{dx + 1, dy}? Theorem 3 shows that an additional activation comes to rescue. We show that if networks use both RELU and threshold activation functions (which we refer to as STEP)1, they can universally approximate C(K, Rdy) with the minimum width max{dx +1, dy}. Our proof techniques for tight upper bounds are not restricted to RELU networks. In Theorem 4, we extend our results to general activation functions covered in Kidger and Lyons (2020). 1.3 ORGANIZATION We first define necessary notation in Section 2. In Section 3, we formally state our main results and discuss their implications. In Section 4, we present our coding scheme for proving upper bounds on the minimum width in Theorems 1, 3 and 4. In Section 5, we prove the lower bound in Theorem 2 by explicitly constructing a counterexample. Finally, we conclude the paper in Section 6. We note that all formal proofs of Theorems 1 4 are presented in Appendix. 2 PROBLEM SETUP AND NOTATION Throughout this paper, we consider fully-connected neural networks that can be described as an alternating composition of affine transformations and activation functions. Formally, we consider the following setup: Given a set of activation functions Σ, an L-layer neural network f of input dimension dx, output dimension dy, and hidden layer dimensions d1, . . . , d L 12 is represented as f := t L σL 1 t2 σ1 t1, (1) where tℓ: Rdℓ 1 Rdℓis an affine transformation and σℓis a vector of activation functions: σℓ(x1, . . . , xdℓ) = ρ1(x1), . . . , ρdℓ(xdℓ) , where ρi Σ. While we mostly consider the cases where Σ is a singleton (e.g., Σ = {RELU}), we also consider the case where Σ contains both RELU and STEP activation functions as in Theorem 3. We denote a neural network with Σ = {ρ} by a ρ network and a neural network with Σ = {ρ1, ρ2} by a ρ1+ρ2 network. We define the width w of f as the maximum over d1, . . . , d L 1. We use ρ networks (or ρ1 + ρ2 networks) of width w for denoting the collection of all ρ networks (or ρ1 + ρ2 networks) of width w having a finite number of layers. For describing the universal approximation of neural networks, we say ρ networks (or ρ1+ρ2 networks) of width w are dense in C(X, Y) if for any f C(X, Y) and ε > 0, there exists a ρ network (or a ρ1+ρ2 network) f of width w such that f f ε. Likewise, we say ρ networks (or ρ1+ρ2 networks) are dense in Lp(X, Y) if for any f Lp(X, Y) and ε > 0, there exists a ρ network (or a ρ1+ρ2 network) f such that f f p ε. 3 MINIMUM WIDTH FOR UNIVERSAL APPROXIMATION Lp approximation with RELU. We present our main theorems in this section. First, for universal approximation of Lp(Rdx, Rdy) using RELU networks, we give the following theorem. Theorem 1. For any p [1, ), RELU networks of width w are dense in Lp(Rdx, Rdy) if and only if w max{dx + 1, dy}. 1The threshold activation function (i.e., STEP) denotes x 7 1[x 0]. 2For simplicity of notation, we let d0 := dx and d L := dy. Published as a conference paper at ICLR 2021 This theorem shows that the minimum width wmin for universal approximation is exactly equal to max{dx + 1, dy}. In order to provide a tight characterization of wmin, we show three new upper and lower bounds: wmin max{dx + 1, dy} through a construction utilizing a coding approach, wmin dy through a volumetric argument, and wmin dx + 1 through an extension of the same lower bound for L1(Rdx, Rdy) (Lu et al., 2017). Combining these bounds gives the tight minimum width wmin = max{dx + 1, dy}. Notably, using our new proof technique, we overcome the limitation of existing upper bounds that require width at least dx + dy. Our construction first encodes the dx dimensional input vectors into one-dimensional codewords, and maps the codewords to target codewords using memorization, and decodes the target codewords to dy dimensional output vectors. Since we construct the map from input to target using scalar codewords, we bypass the need to use dx + dy hidden nodes. More details are found in Section 4. Proofs of the lower bounds are deferred to Appendix B. Uniform approximation with RELU. In Theorem 1, we have seen the tight characterization wmin = max{dx + 1, dy} for Lp(Rdx, Rdy) functions. Does the same hold for C(K, Rdy), for a compact K Rdx? Surprisingly, we show that the same conclusion does not hold in general. Indeed, we show the following result, proving that width max{dx + 1, dy} is provably insufficient for dx = 1, dy = 2. Theorem 2. RELU networks of width w are dense in C([0, 1], R2) if and only if w 3. Theorem 2 translates to wmin = 3, and the upper bound wmin 3 = dx + dy is given by Hanin and Sellke (2017). The key is to prove a lower bound wmin 3, i.e., width 2 is not sufficient. Recall from Section 1.1 that all the known lower bounds are limited to showing that width dx is insufficient for universal approximation. A closer look at their proof techniques reveals that they heavily rely on the fact that the hidden layers have the same dimensions as the input space. As long as the width w > dx, their arguments break because such a network maps the input space into a higher-dimensional space. Although only for dx = 1 and dy = 2, we overcome this limitation of the prior arts and show that width w = 2 > dx is insufficient for universal approximation, by providing a counterexample. We use a novel topological argument which comes from a careful observation on the image created by RELU operations. In particular, we utilize the property of RELU that it projects all negative inputs to zero, without modifying any positive inputs. We believe that our proof will be of interest to readers and inspire follow-up works. Please see Section 5 for more details. Theorem 1 and Theorem 2 together imply that for RELU networks, approximating C(K, Rdy) requires more width than approximating Lp(Rdx, Rdy). Interestingly, this is in stark contrast with existing results, where the minimum depth of RELU networks for approximating C(K, Rdy) is two (Leshno et al., 1993) but it is greater than two for approximating Lp(Rdx, Rdy) (Wang and Qu, 2019). Uniform approximation with RELU+STEP. While width max{dx + 1, dy} is insufficient for RELU networks to be dense in C(K, Rdy), an additional STEP activation function helps achieve the minimum width max{dx + 1, dy}, as stated in the theorem below. Theorem 3. RELU+STEP networks of width w are dense in C(K, Rdy) if and only if w max{dx + 1, dy}. Theorem 2 and Theorem 3 indicate that the minimum width for universal approximation is indeed dependent on the choice of activation functions. This is also in contrast to the classical results where RELU networks of depth 2 are universal approximators (Leshno et al., 1993), i.e., the minimum depths for universal approximation are identical for both RELU networks and RELU+STEP networks. Theorem 3 comes from a similar proof technique as Theorem 1. Due to its discontinuous nature, the STEP activation can be used in our encoder to quantize the input without introducing uniform norm errors. Lower bounds on wmin can be proved in a similar way as Theorem 1. General activations. Our proof technique for upper bounds in Theorems 1 and 3 can be easily extended to networks using general activations. Indeed, we prove the following theorem, which shows that adding a width of 1 is enough to cover the networks with general activations. Theorem 4. Let ρ : R R be any continuous nonpolynomial function which is continuously differentiable at some z with ρ (z) = 0. Then, ρ networks of width w are dense in Lp(K, Rdy) for all p [1, ) if w max{dx + 2, dy + 1}. Published as a conference paper at ICLR 2021 Figure 1: Illustration of the coding scheme Please notice that unlike other theorems, Theorem 4 only proves an upper bound wmin max{dx + 2, dy + 1}. We note that Theorem 4 significantly improves over the previous upper bound of width dx + dy + 1 by (Kidger and Lyons, 2020, Remark 4.10). 4 TIGHT UPPER BOUND ON MINIMUM WIDTH In this section, we present the main idea for constructing networks achieving the minimum width for universal approximation, and then sketch the proofs of upper bounds in Theorems 1, 3, and 4. 4.1 CODING SCHEME FOR UNIVERSAL APPROXIMATION We now illustrate the main idea underlying the construction of neural networks that achieve the minimum width. To this end, we consider an approximation of a target continuous function f C([0, 1]dx, [0, 1]dy); however, our main idea can be easily generalized to other domain, codomain, and Lp functions. Our construction can be viewed as a coding scheme in essence, consisting of three parts: encoder, memorizer, and decoder. First, the encoder encodes an input vector x to a one-dimensional codeword. Then, the memorizer maps the codeword to a one-dimensional target codeword that is encoded with respect to the corresponding target f (x). Finally, the decoder maps the target codeword to a target vector which is sufficiently close to f (x). Note that one can view the encoder, memorizer, and decoder as functions mapping from dx-dimension to 1-dimension, then to 1-dimension, and finally to dy-dimension. The spirit of the coding scheme is that the three functions can be constructed using the idea of the prior results such as (Hanin and Sellke, 2017). Recall that Hanin and Sellke (2017) approximate any continuous function mapping n-dimensional inputs to m-dimensional outputs using RELU networks of width n + m. Under this intuition, we construct the encoder, the memorizer, and the decoder by RELU+STEP networks (or RELU networks) of width dx + 1, 2, dy, respectively; these constructions result in the tight upper bound max{dx + 1, dy}. Here, the decoder requires width dy instead of dy + 1, as we only construct the first dy 1 coordinates of the output, and recover the last output coordinate from a linear combination of the target codeword and the first dy 1 coordinates. Note that we construct the exact encoder, decoder, and memorizer as prior results only approximate them. Next, we describe the operation of each part. We explain their neural network constructions in subsequent subsections. Encoder. Before introducing the encoder, we first define a quantization function qn : [0, 1] Cn for n N and Cn := {0, 2 n, 2 2 n, . . . , 1 2 n} as qn(x) := max{c Cn : c x}. In other words, given any x [0, 1), qn(x) preserves the first n bits in the binary representation of x and discards the rest; x = 1 is mapped to 1 2 n. Note that the error from the quantization is always less than or equal to 2 n. The encoder encodes each input x [0, 1]dx to some scalar value via the function encode K : Rdx Cdx K for some K N defined as encode K(x) := Xdx i=1 q K(xi) 2 (i 1)K. In other words, encode K(x) quantizes each coordinate of x by a K-bit binary representation and concatenates the quantized coordinates into a single scalar value having a (dx K)-bit binary Published as a conference paper at ICLR 2021 representation. Note that if one decodes a codeword encode K(x) back to a vector ˆx as3 {ˆx} := encode 1 K encode K(x) Cdx K , then x ˆx 2 K. Namely, the information loss incurred by the encoding can be made arbitrarily small by choosing large K. Memorizer. The memorizer maps each codeword encode K(x) Cdx K to its target codeword via the function memorize K,M : Cdx K Cdy M for some M N, defined as memorize K,M encode K(x) := encode M f q K(x) where q K is applied coordinate-wise for a vector. We note that memorize K,M is well-defined as each encode K(x) Cdx K corresponds to a unique q K(x) Cdx K . Here, one can observe that the target of the memorizer contains the information of the target value since encode M f q K(x) contains information of f at a quantized version of x, and the information loss due to quantization can be made arbitrarily small by choosing large enough K and M. Decoder. The decoder decodes each codeword generated by the memorizer using the function decode M : Cdy M Cdy M defined as decode M(c) := ˆx where {ˆx} := encode 1 M (c) Cdy M . Combining encode, memorize, and decode completes our coding scheme for approximating f . One can observe that our coding scheme is equivalent to q M f q K which can approximate the target function f within any ε > 0 error, i.e., supx [0,1]dx f (x) decode M memorize K,M encode K(x) ε by choosing large enough K, M N so that ωf (2 K) + 2 M ε.4 In the remainder of this section, we discuss how each part of the coding scheme can be implemented with a neural network using RELU+STEP activations (Section 4.2), RELU activation (Section 4.3), and other general activations (Section 4.4). 4.2 TIGHT UPPER BOUND ON MINIMUM WIDTH OF RELU+STEP NETWORKS (THEOREM 3) In this section, we discuss how we explicitly construct our coding scheme to approximate functions in C(K, Rdy) using a width-(max{dx + 1, dy}) RELU+STEP network. This results in the tight upper bound in Theorem 3. First, the encoder consists of quantization functions q K and a linear transformation. However, as q K is discontinuous and cannot be uniformly approximated by any continuous function, we utilize the discontinuous STEP activation to exactly construct the encoder via a RELU+STEP network of width dx + 1. On the other hand, the memorizer and the decoder maps a finite number of scalar values (i.e., Cdx K and Cdy M, respectively) to their target values/vectors. Such maps can be easily implemented by piecewise linear functions; hence, they can be exactly constructed by RELU networks of width 2 and dy, respectively, as discussed in Section 4.1. Note that STEP is used only for constructing the encoder. In summary, all parts of our coding scheme can be exactly constructed by RELU+STEP networks of width dx + 1, 2, and dy. Thus, the overall RELU+STEP network has width max{dx + 1, dy}. Furthermore, it can approximate the target continuous function f within arbitrary uniform error by choosing sufficiently large K and M. 4.3 TIGHT UPPER BOUND ON MINIMUM WIDTH OF RELU NETWORKS (THEOREM 1) The construction of width-(max{dx + 1, dy}) RELU network for approximating Lp(Rdx, Rdy) (i.e., the tight upper bound in Theorem 1) is almost identical to the RELU+STEP network construction 3Here, encode 1 K denotes the preimage of encode K and Cdx K is the Cartesian product of dx copies of CK. 4ωf denotes the modulus of continuity of f : f (x) f (x ) ωf ( x x ) x, x [0, 1]dx. Published as a conference paper at ICLR 2021 in Section 4.2. Since any Lp function can be approximated by a continuous function with compact support, we aim to approximate continuous f : [0, 1]dx [0, 1]dy here as in our coding scheme. Since the memorizer and the decoder can be exactly constructed by RELU networks, we only discuss the encoder here. As we discussed in the last section, the encoder cannot be uniformly approximated by continuous functions (i.e., RELU networks). Nevertheless, it can be implemented by continuous functions except for a subset of the domain around the discontinuities, and this subset can be made arbitrarily small in terms of the Lebesgue measure. That is, we construct the encoder using a RELU network of width dx + 1 for [0, 1]dx except for a small subset, which enables us to approximate the encoder in the Lp-norm. Combining with the memorizer and the decoder, we obtain a RELU network of width max{dx + 1, dy} that approximates the target function f in the Lp-norm. 4.4 TIGHTENING UPPER BOUND ON MINIMUM WIDTH OF GENERAL NETWORKS (THEOREM 4) Our network construction can be generalized to general activation functions using existing results on approximation of C(K, Rdy) functions. For example, Kidger and Lyons (2020) show that if the activation ρ is continuous, nonpolynomial, and continuously differentiable at some z with ρ (z) = 0, then ρ networks of width dx + dy + 1 are dense in C(K, Rdy). Applying this result to our encoder, memorizer, and decoder constructions of RELU networks, it follows that if ρ satisfies the conditions above, then ρ networks of width max{dx +2, dy +1} are dense in Lp(K, Rdy), i.e., Theorem 4 holds. We note that any universal approximation result for C(K, Rdy) by networks using other activation functions, other than Kidger and Lyons (2020), can also be combined with our construction. 4.5 NUMBER OF LAYERS FOR MINIMUM WIDTH UNIVERSAL APPROXIMATORS Lastly, we discuss the required number of layers for our network constructions achieving the tight minimum width in Theorem 1 and 3. First, our RELU + STEP network construction of width max{dx + 1, dy} for approximating f C([0, 1]dx, Rdy) requires O(2dx K + 2M) layers. Here, O(2K) layers are for the encoder, O(2dx K) layers are for the memorizer, and O(2M) layers for the decoder.5 Since K, M must satisfy ωf (2 K)+2 M ε for approximating f in ε error (see Section 4.1), we choose K = log2(ω 1 f (ε/2)) , M = log2(ε/2) so that ωf (2 K), 2 M ε/2. Namely, the overall required number of layers (i.e., parameters) for approximating f in ε error is O (ω 1 f (ε/2)) dx + 1/ε . This number is closely related to the number of parameters for approximating arbitrary f C([0, 1]dx, Rdy) in ε error: For RELU networks of a constantly bounded number of layers, Θ (ω 1 f (Ω(ε))) dx parameters are necessary and sufficient (Yarotsky, 2018). On the other hand, for RELU networks whose number of layers can increase with the number of parameters, O (ω 1 f (Ω(ε))) dx/2 parameters are sufficient (Yarotsky, 2018). However, while the number of layers grows with the number of parameters in our construction, it requires O (ω 1 f (Ω(ε))) dx + 1/ε parameters. Namely, the number of parameters in constructions achieving the tight minimum width can be improved. Likewise, for proving the tight upper bound in Theorem 1, for approximating f Lp(Rdx, Rdy) in ε error, our RELU network construction also requires O (ω 1 f (Ω(ε))) dx + 1/ε layers where f is some continuous function satisfying f f p = ε/2. 5 TIGHT LOWER BOUND ON MINIMUM WIDTH The purpose of this section is to prove the tight lower bound in Theorem 2, i.e., there exist f C([0, 1], R2) and ε > 0 satisfying the following property: For any width-2 RELU network f, we have f f > ε. Our construction of f is based on topological properties of RELU networks, which we study in Section 5.1. Then, we introduce a counterexample f and prove that f cannot be approximated by width-2 RELU networks in Section 5.2. 5dx, dy are considered as constants. Published as a conference paper at ICLR 2021 5.1 TOPOLOGICAL PROPERTIES OF RELU NETWORKS We first interpret a width-2 RELU network f : R R2 as below, following (1): f := t L σ σ t2 σ t1 where L N denotes the number of layers, t1 : R R2 and tℓ: R2 R2 for ℓ> 1 are affine transformations, and σ is the coordinate-wise RELU. Without loss of generality, we assume that tℓis invertible for all ℓ> 1, as invertible affine transformations are dense in the space of affine transformations on bounded support, endowed with the uniform norm. To illustrate the topological properties of f better, we reformulate f as follows: f = (φ 1 L 1 σ φL 1) (φ 1 2 σ φ2) (φ 1 1 σ φ1) t (2) where φℓand t are defined as t := t L t1 and φℓ:= (t L tℓ+1) 1, i.e., tℓ= φℓ φ 1 ℓ 1 for ℓ 2 and t1 = φ1 t . Under the reformulation (2), f first maps inputs through an affine transformation t , then it sequentially applies φ 1 ℓ σ φℓ. Here, φ 1 ℓ σ φℓcan be viewed as changing the coordinate system using φℓ, applying RELU in the modified coordinate system, and then returning back to the original coordinate system via φ 1 ℓ. Under this reformulation, we present the following lemmas, whose proofs are presented in Appendices B.4 B.5. Lemma 5. Let φ : R2 R2 be an invertible affine transformation. Then, there exist a1, a2 R2 and b1, b2 R determined by φ such that the following statements hold for S := {x : a1, x + b1 0, a2, x + b2 0} and x := φ 1 σ φ(x). If x S, then x = x. If x R2 \ S, then x = x and x S.6 Lemma 6. Let φ : R2 R2 be an invertible affine transformation. Suppose that x is in a bounded path-component7 of R2 \ T for some T R2. Then, the following statements hold for x := φ 1 σ φ(x) and T := φ 1 σ φ(T ). If x = x and x / T , then x is in a bounded path-component of R2 \ T . If x = x, then x T . Lemma 5 follows from the fact that output of RELU is identity to nonnegative coordinates, and is zero to negative coordinates. In particular, a1, b1 and a2, b2 in Lemma 5 correspond to the axes of the modified coordinate system before applying σ. Under the same property of RELU, Lemma 6 states that if a point x is surrounded by a set T , after applying φ 1 σ φ, either the point stays at the same position and surrounded by the image of T or intersects with the image of T . Based on these observations, we are now ready to introduce our counterexample. 5.2 COUNTEREXAMPLE Our counterexample f : [0, 1] R2 is illustrated in Figure 2(a) where f ([0, p1]) is drawn in red from (4, 3) to (0, 0), f ((p1, p2)) is drawn in black from (0, 0) to ( 1, 0), and f ([p2, 1]) is drawn in blue from ( 1, 0) to (1, 0), for some 0 < p1 < p2 < 1, e.g., p1 = 1 3. In this section, we suppose for contradiction that there exists a RELU network f of width 2 such that f f 1 100. To this end, consider the mapping by the first ℓlayers of f: gℓ:= (φ 1 ℓ σ φℓ) (φ 1 1 σ φ1) t . Our proof is based on the fact: If gℓ(x) = gℓ(x ), then f(x) = f(x ). Thus, the following must hold: if f f 1 100, then gℓ([0, p1]) gℓ([p2, 1]) = for all ℓ 1. (3) Let B := ( 2, 2) ( 1, 1) (the gray box in Figure 2(b)) and ℓ N be the largest ℓsuch that φ 1 ℓ σ φℓ(B) = B. This means that after the ℓ -th layer, everything inside the box B never gets 6 S denotes the boundary set of S. 7S T is a path-component of T if S is a maximal set satisfying the following condition: For any x1, x2 S, there exists a continuous function f : [0, 1] S such that f(0) = x1 and f(1) = x2. Published as a conference paper at ICLR 2021 Figure 2: (a) Illustration of the image of f : [0, 1] R2 (b, c) Examples of gℓ ([0, 1]). affected by RELU operations. In other words, f([0, 1]) B must have been constructed in the first ℓ layers. Under this observation, the boundary of S in Lemma 5 determined by φℓ must intersect with B: If B S, then φ 1 ℓ σ φℓ(B) = B which contradicts the definition of ℓ and if B S = , then f([0, 1]) B cannot be constructed in the first ℓ layers. Therefore, there must exist a line (e.g., a line containing one boundary of S, see the arrow in Figure 2(b)) intersecting with B, such that the image gℓ ([0, 1]) lies in one side of the line. Since the image of the entire network f([0, p1]) is on both sides of the line, we have gℓ ([0, p1]) = f([0, p1]), which implies that the remaining layers ℓ + 1, . . . , L 1 must have moved the image gℓ ([0, p1]) \ B to f([0, p1]) \ B; this also implies gℓ ([0, p1]) \ B = . A similar argument gives gℓ ([p2, 1]) \ B = . Since f([0, 1]) B must have been constructed in the first ℓ layers, as illustrated in Figures 2(b) and 2(c), the boundary B intersects with gℓ ([p2, 1]) (the blue line) near points ( 1, 1) and (1, 1). Hence, T := gℓ ([p2, 1]) B forms a closed loop. Also, B intersects with gℓ ([0, p1]) near the point (0, 1), so there must exist a point in gℓ ([0, p1]) \ B that is surrounded by T . Given these observations, we have the following lemma. The proof of Lemma 7 is presented in Appendix B.6. Lemma 7. The image gℓ ([0, p1]) \ B is contained in a bounded path-component of R2 \ T unless gℓ ([0, p1]) gℓ ([p2, 1]) = . Figures 2(b) and 2(c) illustrates the two possible cases of Lemma 7. If gℓ ([0, p1]) gℓ ([p2, 1]) = (Figure 2(c)), this contradicts (3). Then, gℓ ([0, p1]) \ B must be contained in a bounded pathcomponent of R2 \ T . Recall that gℓ ([0, p1]) \ B has to move to f([0, p1]) \ B by layers ℓ + 1, . . . , L 1. However, by Lemma 6, if any point in gℓ ([0, p1]) \ B moves, then it must intersect with the image of T . If it intersects with the image of gℓ ([p2, 1]), then (3) is violated, hence a contradiction. If it intersects with B at the ℓ -th layer for some ℓ > ℓ , then B S for S in Lemma 5 determined by ℓ . Hence, it violates the definition of ℓ as φ 1 ℓ σ φℓ (B) = B by Lemma 5. Namely, the approximation by f is impossible in any cases. This completes the proof of Theorem 2. 6 CONCLUSION The universal approximation property of width-bounded networks is one of the fundamental problems in the expressive power theory of deep learning. Prior arts attempt to characterize the minimum width sufficient for universal approximation; however, they only provide upper and lower bounds with large gaps. In this work, we provide the first exact characterization of the minimum width of RELU networks and RELU+STEP networks. In addition, we observe interesting dependence of the minimum width on the target function classes and activation functions, in contrast to the minimum depth of classical results. We believe that our results and analyses would contribute to a better understanding of the performance of modern deep and narrow network architectures. ACKNOWLEDGEMENTS This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST) and No.2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence). Chulhee Yun acknowledges financial supports from Korea Foundation for Advanced Studies, NSF CAREER grant 1846088, and ONR grant N00014-20-1-2394. Published as a conference paper at ICLR 2021 Eric B. Baum. On the capabilities of multilayer perceptrons. Journal of Complexity, 4(3):193 215, 1988. ISSN 0885-064X. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303 314, 1989. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, 2016. Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 4(2):268 276, 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014. Richard Hacker. Certification of algorithm 112: position of point relative to polygon. Communications of the ACM, 5(12):606, 1962. Boris Hanin and Mark Sellke. Approximating continuous functions by Re LU nets of minimal width. ar Xiv preprint ar Xiv:1710.11278, 2017. Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274 281, 2003. Guang-Bin Huang and Haroon A Babri. Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Transactions on Neural Networks, 9(1):224 229, 1998. Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, 2018. Jesse Johnson. Deep, skinny neural networks are not universal approximators. In International Conference on Learning Representations, 2019. Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In Conference on Learning Theory, 2020. Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2013. Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861 867, 1993. Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223 1247, 2017. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro BERTa: A robustly optimized BERT pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. Published as a conference paper at ICLR 2021 Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems, 2017. Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8: 143 195, 1999. Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal of Automation and Computing, 14(5):503 519, 2017. Moshe Shimrat. Algorithm 112: position of point relative to polygon. Communications of the ACM, 5(8):434, 1962. Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, 2016. Carsten Thomassen. The Jordan-Schönflies theorem and the classification of surfaces. The American Mathematical Monthly, 99(2):116 130, 1992. Helge Tverberg. A proof of the Jordan curve theorem. Bulletin of the London Mathematical Society, 12(1):34 38, 1980. Roman Vershynin. Memory capacity of neural networks with threshold and Re LU activations. ar Xiv preprint 2001.06938, 2020. Ming-Xi Wang and Yang Qu. Approximation capabilities of neural networks on unbounded domains. ar Xiv preprint ar Xiv:1910.09293, 2019. Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. In Conference on Learning Theory, 2018. Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small Re LU networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems, 2019. Published as a conference paper at ICLR 2021 We first provide proofs of upper bounds in Theorems 1, 3, 4 in Appendix A. In Appendix B, we provide proofs of lower bounds in Theorem 1, 3 and proofs of Lemmas 5, 6, 7 used for proving the lower bound in Theorem 2. Throughout Appendix, we denote the coordinate-wise RELU by σ and we denote the i-th coordinate of an output of a function f(x) by (f(x))i. A PROOFS OF UPPER BOUNDS A.1 PROOF OF TIGHT UPPER BOUND IN THEOREM 3 In this section, we prove the tight upper bound on the minimum width in Theorem 3, i.e., width- (max{dx + 1, dy}) RELU+STEP networks are dense in C([0, 1]dx, Rdy). In particular, we prove that for any f C([0, 1]dx, [0, 1]dy), for any ε > 0, there exists a RELU+STEP network f of width max{dx + 1, dy} such that supx [0,1]dx f (x) f(x) ε. Here, we note that the domain and the codomain can be easily generalized to arbitrary compact support and arbitrary codomain, respectively. Our construction is based on the three-part coding scheme introduced in Section 4.1. First, consider constructing a RELU+STEP network for the encoder. From the definition of q K, one can observe that the mapping is discontinuous and piecewise constant. Hence, the exact construction (or even the uniform approximation) of the encoder requires the use of discontinuous activation functions such as STEP (recall its definition x 7 1[x 0]). We introduce the following lemma for the exact construction of q K. The proof of Lemma 8 is presented in Appendix A.4. Lemma 8. For any K N, there exists a RELU+STEP network f : R R of width 2 and O(2K) layers such that f(x) = q K(x) for all x [0, 1]. For constructing the encoder via a RELU+STEP network of width dx + 1, we sequentially apply q K to each input coordinate, by utilizing the extra width 1 and Lemma 8. Here, when we apply q K to some coordinate (requiring width 2), we preserve other coordinates by applying identity mapping (requiring width dx 1) for constructing the encoder within width dx + 1. Once we apply q K for all input coordinates, we apply the linear transformation Pdx i=1 q K(xi) 2 (i 1)K to obtain the output of the encoder. On the other hand, the memorizer only maps a finite number of scalar inputs to the corresponding scalar targets, which can be easily implemented by piecewise linear continuous functions. We show that the memorizer can be exactly constructed by a RELU network of width 2 using the following lemma. The proof of Lemma 9 is presented in Appendix A.5. Lemma 9. For any function f : R R, any finite set X R, and any compact interval I R containing X, there exists a RELU network f : R R of width 2 and O(|X|) layers such that f(x) = f (x) for all x X and f(I) min f (X), max f (X) . Likewise, the decoder maps a finite number of scalar inputs in Cdy M to corresponding target vectors in Cdy M . Here, each coordinate of a target vector corresponds to some consequent bits of the binary representation of the input. Under the similar idea used for our implementation of the memorizer, we show that the decoder can be exactly constructed by a RELU network of width dy using the following lemma. The proof of Lemma 10 is presented in Appendix A.6. Lemma 10. For any dy, M N, for any δ > 0, there exists a RELU network f : R R2 of width dy and O(2M) layers such that for all c Cdy M f(c) = decode M(c). Furthermore, it holds that f(R) [0, 1]dy. Finally, as the encoder, the memorizer, and the decoder can be constructed by RELU+STEP networks of width dx + 1, width 2, and width dy, respectively, the width of the overall RELU+STEP network f is max{dx + 1, dy}. In addition, as mentioned in Section 4.1, choosing K, M N large enough so that ωf (2 K) + 2 M ε ensures f f ε. This completes the proof of the tight upper bound in Theorem 3. Published as a conference paper at ICLR 2021 A.2 PROOF OF TIGHT UPPER BOUND IN THEOREM 1 In this section, we derive the upper bound in Theorem 1. In particular, we prove that for any p [1, ), for any f Lp(Rdx, Rdy), for any ε > 0, there exists a RELU network f of width max{dx + 1, dy} such that f f p ε. To this end, we first note that since f Lp(Rdx, Rdy), there exists a continuous function f on a compact support such that Namely, if we construct a RELU network f such that f f p ε 2, then it completes the proof. Throughout this proof, we assume that the support of f is a subset of [0, 1]dx and its codomain to be [0, 1]dy which can be easily generalized to arbitrary compact support and arbitrary codomain, respectively. We approximate f by a RELU network using the three-part coding scheme introduced in Section 4.1. We will refer to our implementations of the three parts as encode K(x), memorize K,M, and decode M. That is, we will approximate f by a RELU network f := decode M memorize K,M encode K. However, unlike our construction of RELU+STEP networks in Appendix A.1, STEP is not available, i.e., uniform approximation of q K is impossible. Nevertheless, one can approximate q K with some continuous piecewise linear function by approximating regions around discontinuities with some linear functions. Under this idea, we introduce the following lemma. The proof of Lemma 11 is presented in Appendix A.7. Lemma 11. For any dx, K N, for any γ > 0, there exist a RELU network f : Rdx R of width dx + 1 and O(2K) layers and Dγ [0, 1]dx such that for all x [0, 1]dx \ Dγ, f(x) = encode K(x), µ(Dγ) < γ, f(Dγ) [0, 1], and f(Rdx \ [0, 1]dx) = {1 2dx K} where µ denotes the Lebesgue measure. By Lemma 11, there exist a RELU network encode K of width dx + 1 and Dγ [0, 1]dx such that µ(Dγ) < γ and encode K(x) = encode K(x) for all x [0, 1]dx \ Dγ, encode K(Rdx \ [0, 1]dx) = {1 2dx K}. (4) We approximate the encoder by encode K. Here, we note that inputs from Dγ would be mapped to arbitrary values by encode K. Nevertheless, it is not critical to the error f f p as µ(Dγ) < γ can be made arbitrarily small by choosing a sufficiently small γ. The implementation of the memorizer utilizes Lemma 9 as in Appendix A.1. However, as f (x) = 0 for all x Rdx \ [0, 1]dx, we construct a RELU network memorize K,M of width 2 so that memorize K,M encode K,L(Rdx \ [0, 1]dx) = {0}. To achieve this, we design the memorizer for c Cdx K using Lemma 9 and based on (4) as memorize K,M(c) = 0 if c = 1 2dx K memorize K,M(c) otherwise . We note that such a design incurs an undesired error that a subset of EK := [1 2 K, 1]dx might be mapped to zero after applying memorize K,M. Nevertheless, mapping EK to zero is not critical to the error f f p as µ(EK) < 2 dx K can be made arbitrarily small by choosing a sufficiently large K. We implement the decoder by a RELU network decode M of width dy using Lemma 10 as in Appendix A.1. Then, by Lemma 10, it holds that decode M(R) [0, 1]dy, and hence, f(Rdx) [0, 1]dy. Published as a conference paper at ICLR 2021 Finally, we bound the error f f p utilizing the following inequality: Rdx f (x) f(x) p pdx 1 [0,1]dx\(EK Dγ) f (x) f(x) p pdx + Z EK Dγ f (x) f(x) p pdx dy(ωf (2 K) + 2 M)p + (µ(EK) + µ(Dγ)) sup x EK Dγ f (x) f(x) p p dy(ωf (2 K) + 2 M)p + (2 dx K + γ) sup x [0,1]dx f (x) p + sup x [0,1]dx f(x) p p ! 1 dy(ωf (2 K) + 2 M)p + (2 dx K + γ) sup x [0,1]dx f (x) p + (dy) 1 p p ! 1 By choosing sufficiently large K, M and sufficiently small γ, one can make the RHS smaller than ε/2 as supx [0,1]dx f (x) p < . This completes the proof of the tight upper bound in Theorem 1. A.3 PROOF OF THEOREM 4 In this section, we prove Theorem 4 by proving the following statement: For any p [1, ), for any f Lp(K, Rdy), for any ε > 0, there exists a ρ network f of width max{dx + 2, dy + 1} such that f f p ε. Here, there exists a continuous function f C(K, Rdy) such that 2 since f Lp(K, Rdy). Namely, if we construct a ρ network f such that f f p ε 2, it completes the proof. Throughout the proof, we assume that the support of f is a subset of [0, 1]dx and its codomain is [0, 1]dy which can be easily generalized to arbitrary compact support and arbitrary codomain, respectively. Before describing our construction, we first introduce the following lemma. Lemma 12 [Kidger and Lyons (2020, Proposition 4.9)]. Let ρ : R R be any continuous nonpolynomial function which is continuously differentiable at some z with ρ (z) = 0. Then, for any f C(K, Rdy), for any ε > 0, there exists a ρ network f : K Rdx Rdy of width dx + dy + 1 such that for all x K, f(x) := (y1(x), y2(x)), where y1(x) x ε and y2(x) f (x) ε. We note that Proposition 4.9 by Kidger and Lyons (2020) only ensures y2(x) f (x) ε; however, its proof provides y1(x) x ε as well. The proof of Theorem 4 also utilizes our coding scheme; here, we approximate RELU network constructions encode K, memorize K,M, and decode M in Appendix A.2 by ρ networks. Using Lemma 12, for any ε1 > 0, we approximate encode K by a ρ network encode K of width dx + 2 so that encode K(x) encode K(x) ε1 for all x [0, 1]dx \ Dγ and encode K([0, 1]dx) [ ε1, 1+ε1]. We note that encode K([0, 1]dx) [ ε1, 1+ε1] is possible as encode K(Rdx) [0, 1] by Lemma 11. Approximating the memorizer can be done in a similar manner. Using Lemma 12, for any compact interval I2 R containing Cdx K, for any ε2 > 0, we approximate memorize K,M by a ρ network memorize K,M of width 3 so that memorize K,M(c) memorize K,M(c) ε2 for all c Cdx K Published as a conference paper at ICLR 2021 and memorize K,M(I2) [ ε2, 1 + ε2]. We note that memorize K,M(I2) [ ε2, 1 + ε2] is possible as there exists memorize K,M (i.e., a RELU network) such that memorize K,M(I2) [0, 1] by Lemma 9. For approximating the decoder, we introduce the following lemma. The proof of Lemma 13 is presented in Appendix A.8. Lemma 13. For any dy, M N, for any ε > 0, for any compact interval I R containing [0, 1], there exists a ρ network f : R Rdy of width dy + 1 such that for all c I, f(c) decode M(c) ε. Namely, f(I) [ ε, 1 + ε]dy. By Lemma 13, for any compact interval I3 R containing [0, 1], for any ε3 > 0, there exists a ρ network decode M of width dy + 1 such that decode M(c) decode M(c) ε3 for all c Cdy M and decode M(I3) [ ε3, 1 + ε3]dy. We approximate f by a ρ network f of width max{dx + 2, dy + 1} defined as f := decode M memorize K,M encode K. Here, for any η > 0, by choosing sufficiently large K, M, sufficiently large I2, I3, and sufficiently small ε1, ε2, ε3 so that ωf (2 K) + 2 M η 2 and ωdecode M ωmemorize K,M (ε1) + ε2 + ε3 η sup x [0,1]dx\Dγ f (x) f(x) η and f([0, 1]dx) [ η where ωmemorize K,M and ωdecode M are defined on I2 and I3, respectively. Finally, we bound the error f f p utilizing the following inequality: [0,1]dx f (x) f(x) p pdx [0,1]dx\Dγ f (x) f(x) p pdx + Z Dγ f (x) f(x) p pdx sup x [0,1]dx\Dγ f (x) f(x) p p + µ(Dγ) sup x Dγ f (x) f(x) p p sup x [0,1]dx\Dγ f (x) f(x) p p + γ sup x [0,1]dx f (x) p + sup x [0,1]dx f(x) p p ! 1 By choosing sufficiently small ε1, ε2, ε3, γ, sufficiently large K, M, and sufficiently large I2, I3, one can make the RHS smaller than ε/2 due to (5) and the fact that supx [0,1]dx f (x) p < . This completes the proof of Theorem 4. A.4 PROOF OF LEMMA 8 We construct f : R R as f(x) := f2K f1(x) where each fℓ: R R is defined for x [0, 1] as fℓ(x) := (ℓ 1) 2 K if x [(ℓ 1) 2 K, ℓ 2 K) x if x / [(ℓ 1) 2 K, ℓ 2 K) = gℓ3 gℓ2 gℓ1(x) Published as a conference paper at ICLR 2021 where gℓ1 : R R2, gℓ2 : R2 R2, and gℓ3 : R2 R are defined as gℓ1(x) := σ(x), σ(x ℓ) gℓ2(x, z) := σ(x + z), σ(x ℓ+ 1) gℓ3(x, z) := σ(x + z) + 1[x ℓ]. This directly implies that f(x) = q K(x) for all x [0, 1] and completes the proof of Lemma 8. A.5 PROOF OF LEMMA 9 Let |X| = N and x(1), . . . , x(N) be distinct elements of X in an increasing order, i.e., x(i) < x(j) if i < j. Let x(0) := min I and x(N+1) := max I. Here, x(0) x(1) and x(N) x(N+1) as X I. Without loss of generality, we assume that x(0) = 0. Consider a continuous piecewise linear function f : [x(0), x(N+1)] R of N + 1 linear pieces defined as f (x(1)) if x [x(0), x(1)) f (x(i)) + f (x(i+1)) f (x(i)) x(i+1) x(i) (x x(i)) if x [x(i), x(i+1)) for 1 i N 1 f (x(N)) if x [x(N), x(N+1)] . Now, we introduce the following lemma. Lemma 14. For any compact interval I R, for any continuous piecewise linear function f : I R with P linear pieces, there exists a RELU network f of width 2 and O(P) layers such that f (x) = f(x) for all x I. From Lemma 14, there exists a RELU network f of width 2 and O(|X|) layers such that f (x) = f(x) for all x X. Since X [x(0), x(N+1)] = I and f (I) min f (X), max f (X) , this completes the proof of Lemma 9. Proof of Lemma 14. Suppose that f is linear on intervals [min I, x1), [x1, x2), . . . , [x P 1, max I] and parametrized as a1 x + b1 if x [min I, x1) a2 x + b2 if x [x1, x2) ... a P x + b P if x [x P 1, max I] for some ai, bi R satisfying ai xi + bi = ai+1 xi + bi+1. Without loss of generality, we assume that min I = 0. Now, we prove that for any P 1, there exists a RELU network f : I R2 of width 2 and O(P) layers such that (f(x))1 = σ(x x P 1) and (f(x))2 = f (x). Then, (f(x))2 is the desired RELU network and completes the proof. We use the mathematical induction on P for proving the existence of such f. If P = 1, choosing (f(x))1 = σ(x) and (f(x))2 = a1 σ(x) + b1 completes the construction of f. Now, consider P > 1. From the induction hypothesis, there exists a RELU network g of width 2 such that (g(x))1 = σ(x x P 2) a1 x + b1 if x [min I, x1) a2 x + b2 if x [x1, x2) ... a P 1 x + b P 1 if x [x P 2, max I] Then, the following construction of f completes the proof of the mathematical induction: f(x) = h2 h1 g(x) h1(x, z) = σ(x x P 1 + x P 2), σ(z K) + K h2(x, z) = x, z + (a P 1 a P 2) x where K := mini minx I{ai x + bi}. This completes the proof of Lemma 14. Published as a conference paper at ICLR 2021 A.6 PROOF OF LEMMA 10 Before describing our proof, we first introduce the following lemma. The proof of Lemma 15 is presented in Appendix A.9. Lemma 15. For any M N, for any δ > 0, there exists a RELU network f : R R2 of width 2 and O(2M) layers such that for all x [0, 1] \ DM,δ, f(x) := (y1(x), y2(x)), where y1(x) = q M(x), y2(x) = 2M (x q M(x)), (6) and DM,δ := S2M 1 i=1 (i 2 M δ, i 2 M). Furthermore, it holds that f(R) [0, 1 2 M] [0, 1]. (7) In Lemma 15, one can observe that Cdy M [0, 1] \ DM,δ for δ < 2 dy M , i.e., there exists a RELU network g of width 2 satisfying (6) on Cdy M and (7). g enables us to extract the first M bits of the binary representation of c Cdy M. Consider outputs of g(c): (g(c))1 for c Cdy M is the first coordinate of decode M(c) while (g(c))2 C(dy 1)M contains information on other coordinates of decode M(c). Now, consider further applying g to (g(c))2 and passing though the output (g(c))1 via the identity function (RELU is identity for positive inputs). Then, g (g(c))2 1 is the second coordinate of decode M(c) while g (g(c))2 2 contains information on coordinates other than the first and the second ones of decode M(c). Under this observation, if we iteratively apply g to the second output of the prior g and pass through all first outputs of previous g s, then we recover all coordinates of decode M(c) within dy 1 applications of g. Note that both the first and the second outputs of the (dy 1)-th g correspond to the second last and the last coordinate of decode M(c), respectively. Our construction of f is such an iterative dy 1 applications of g which can be implemented by a RELU network of width dy. Here, (7) in Lemma 15 enables us to achieve f(R) [0, 1]dy. This completes the proof of Lemma 10. A.7 PROOF OF LEMMA 11 To begin with, we introduce the following Lemma. The proof of Lemma 16 is presented in Appendix A.10. Lemma 16. For any dx, for any α (0, 0.5), there exists a RELU network f : Rdx Rdx of width dx + 1 and O(1) layers such that f(x) = (1, . . . , 1) for all x Rdx \ [0, 1]dx, f(x) = x for all x [α, 1 α]dx, and f(Rdx) [0, 1]dx. By Lemma 16, there exists a RELU network h1 of width dx + 1 and O(1) layers such that h1(x) = (1, . . . , 1) for all x Rdx \ [0, 1]dx, h1(x) = x for all x [α, 1 α]dx, and h1(Rdx) [0, 1]dx. Furthermore, by Lemma 15, for any δ > 0, there exists a RELU network g : R R of width 2 and O(2K) layers such that g(c) = q K(c) for all c [0, 1] \ DK,δ (see Lemma 15 for the definition of DK,δ). We construct a network h2 : Rdx Rdx of width dx + 1 by sequentially applying g for each coordinate of an input x Rdx, utilizing the extra width 1. Then, h2(x) = q K(x) for all x [0, 1]dx \ DK,δ,dx where DK,δ,dx := {x Rdx : xi DK,δ for some i}. Note that we use q K(x) for denoting the coordinate-wise q K for a vector x. Now, we define Dγ := ([0, 1]dx \ [α, 1 α]dx) DK,δ,dx [0, 1]dx. Then, from constructions of h1 and h2, we have h2 h1(x) = q K(x) for all x [0, 1]dx \ Dγ h2 h1(x) = (1 2 K, . . . , 1 2 K) for all x Rdx \ [0, 1]dx h2 h1(x) [0, 1 2 K]dx for all x Dγ where we use the fact that (1, . . . , 1) / DK,δ,dx and q K((1, . . . , 1)) = (1 2 K, . . . , 1 2 K). Published as a conference paper at ICLR 2021 Finally, we construct a RELU network f of width dx + 1 as i=1 (h2 h1(x))i 2 (i 1)K. Then, it holds that f(x) = encode K(x) for all x [0, 1]dx \ Dγ f(x) = 1 2dx K for all x Rdx \ [0, 1]dx f(x) [0, 1] for all x Dγ. In addition, if we choose sufficiently small α and δ so that µ(Dγ) < γ, then f satisfies all conditions in Lemma 11. This completes the proof of Lemma 11. A.8 PROOF OF LEMMA 13 The proof of Lemma 13 is almost identical to that of Lemma 10. In particular, we approximate the RELU network construction of iterative dy 1 applications of g (see Appendix A.6 for the definition of g) by a ρ network of width dy + 1. To this end, we consider a ρ network h of width 3 approximating g on some interval J within α error using Lemma 12. Then, one can observe that iterative dy 1 applications of h (as in Appendix A.6) results in a ρ network f of width dy + 1. Here, passing through the identity function can be approximated using a ρ network of width 1, i.e., same width to RELU networks (see Lemma 4.1 by Kidger and Lyons (2020) for details). Furthermore, since h is uniformly continuous on J , it holds that f(c) decode M(c) ε for all c Cdy M and f(I) [ ε, 1 + ε]dy by choosing sufficiently large J and sufficiently small α so that ωh( ωh(ωh(α) + α) ) + α ε.8 This completes the proof of Lemma 13. A.9 PROOF OF LEMMA 15 We first clip the input to be in [0, 1] using the following RELU network of width 1. min max{x, 0}, 1 = 1 σ(1 σ(x)) After that, we apply gℓ: [0, 1] [0, 1]2 defined as (gℓ(x))1 := x (gℓ(x))2 := 0 if x [0, 2 M δ] δ 12 M (x 2 M + δ) if x (2 M δ, 2 M) 2 M if x [2 M, 2 2 M δ] δ 12 M (x 2 2 M + δ) + 2 M if x (2 2 M δ, 2 2 M) ... (ℓ 1) 2 M if x [(ℓ 1) 2 M, 1] From the above definition of gℓ, one can observe that (g2M (x))2 = q M(x) for x [0, 1] \ DK,δ. Once we implement a RELU network g of width 2 such that g(x) = g2M (x), then, constructing f as f(x) := (g(z))2, 2M (g(z))1 (g(z))2 z := min max{x, 0}, 1 completes the proof. Note that as (g(x))2 x = (g(x))1 for all x [0, 1], f(x) [0, 1 2 M] [0, 1]. Now, we describe how to construct g2M by a RELU network. One can observe that (g1(x))2 = 0 and (gℓ+1(x))2 = min n ℓ 2 M, max δ 12 M (x ℓ 2 M + δ) + (ℓ 1) 2 M, gℓ(x) o for all x, i.e., alternating applications of min{ , } and max{ , }. Finally, we introduce the following definition and lemma. 8We consider ωh on J . Published as a conference paper at ICLR 2021 Definition 1 [Hanin and Sellke (2017)]. f : Rdx Rdy is a max-min string of length L if there exist affine transformations h1, . . . , h L such that h(x) = τL 1(h L(x), τL 2(h L 1(x), , τ2(h3(x), τ1(h2(x), h1(x))), ) where each τℓis either a coordinate-wise max{ , } or min{ , }. Lemma 17 [Hanin and Sellke (2017, Proposition 2)]. For any max-min string f : Rdx Rdy of length L, for any compact K Rdx, there exists a RELU network f : Rdx Rdx Rdy of L layers and width dx + dy such that for all x K, f(x) = (y1(x), y2(x)), where y1(x) = x and y2(x) = f (x). We note that Proposition 2 by Hanin and Sellke (2017) itself only ensures y2 = f (x); however, its proof provides y1 = x as well. From the definition of the max-min string, one can observe that (g2M (x))2 is a max-min string. Hence, by Lemma 17, there exists a RELU network g of width 2 such that g(x) = g2M (x) = q M(x) for all x DK,δ. This completes the proof of Lemma 15. A.10 PROOF OF LEMMA 16 Consider the following two functions from R to R: 0 if x 1 α 1 α(x 1 + α) if x (1 α, 1) 1 if x 1 =σ(1 σ(1 x)/α) 1 if x 0 1 α(α x) if x (0, α) 0 if x α =1 σ(1 σ(α x)/α). (9) Using h1 and h2, we first map all x Rdx \ [0, 1]dx to some vector whose coordinates are greater than one via g : Rdx Rdx, defined as g := rdx sdx r1 s1 rℓ(x) := σ(x + 1) 1 + 10 h1(xℓ) sℓ(x) := σ(x + 1) 1 + 10 h2(xℓ). Here we use the addition between a vector and a scalar for denoting addition of the scalar to each coordinate of the vector. Then, one can observe that if x [α, 1 α]dx, then g(x) = x and if x Rdx \ [0, 1]dx, then each coordinate of g(x) is greater than one. Furthermore, each rℓ(or sℓ) can be implemented by a RELU network of width dx + 1 (width dx for computing σ(x + 1) 1 and width one for computing h1(xℓ)) due to (9). Hence, g can be implemented by a RELU network of width dx + 1. Finally, we construct a RELU network f : Rdx Rdx of width dx + 1 as f(x) := min max{g(x), 0}, 1 min max{x, 0}, 1 =1 σ(1 σ(x)). Then, one can observe that if x [α, 1 α]dx, then f(x) = x and if x Rdx \ [0, 1]dx, then f(x) = (1, . . . , 1), and f(Rdx) [0, 1]dx. This completes the proof of Lemma 16. Published as a conference paper at ICLR 2021 B PROOFS OF LOWER BOUNDS B.1 PROOF OF GENERAL LOWER BOUND In this section, we prove that neural networks of width dy 1 is not dense in both Lp(K, Rdy) and C(K, Rdy), regardless of the activation functions. Lemma 18. For any set of activation functions, networks of width dy 1 are not dense in both Lp(K, Rdy) and C(K, Rdy). Proof. In this proof, we show that networks of width dy 1 are not dense in Lp([0, 1]dx, Rdy), which can be easily generalized to the cases of Lp(K, Rdy) and C(K, Rdy). In particular, we prove that there exist a continuous function f Lp([0, 1]dx, Rdy) and ε > 0 such that for any network f of width dy 1, it holds that Let be a dy-dimensional regular simplex with sidelength 2, that is isometrically embedded into Rdy. The volume of this simplex is given as Voldy( ) = dy+1 dy! .9 We denote the vertices of this simplex by {v0, . . . , vdy}. Then, we can construct the counterexample as follows. ( vi if x1 2i 2dy+1, 2i+1 2dy+1 for some i (2dy + 1)(vi+1 vi)x1 + (2i + 2)vi (2i + 1)vi+1 if x1 2i+1 2dy+1, 2i+2 2dy+1 for some i . In other words, f (x) travels the vertices of sequentially as we move x1 from 0 to 1, staying at each vertex over an interval of length 1 2dy+1 and traveling between vertices at a constant speed otherwise, i.e., f is continuous and in Lp([0, 1]dx, Rdy). Recalling (1), one can notice that any neural network f of width less than dy and L 2 layers can be decomposed as t L g, where t L : Rk Rdy is the last affine transformation and g denotes all the preceding layers, i.e., g = σL 1 t L 1 σ1 t1. Here, we consider k = dy 1 as it suffices to cover cases k < dy 1. Now, we proceed as [0,1]dx f (x) f(x) p p dx [0,1]dx f (x) t L g(x) p p dx [0,1]dx inf u (x) Rdy 1 f (x) t L(u (x)) p p dx p inf t: affine map max i [dy+1] inf u i Rdy 1 vi t(u i ) p p inf H H max i [dy+1] inf ai H vi ai p, where H denotes the set of all (dy 1)-dimensional hyperplanes in Rdy and [dy+1] := {0, 1, . . . , dy}. As the vertices of are dy+1 distinct points in a general position, inf H H maxi [dy+1] infai H vi ai p > 0. To make this argument more concrete, we take a volumetric approach; for any kdimensional hyperplane H Rdy, we have Voldy( ) 2 Voldy 1(πH( )) max i [dy+1] inf ai H vi ai 2, 9Vold(S) denotes the volume of S in the d-dimensional Euclidean space. Published as a conference paper at ICLR 2021 where πH denotes the projection onto H. As projection is contraction and the distance between any two points are at most 2, it holds that for any H, max i [dy+1] inf ai H vi ai 2 Voldy( ) 2 Voldy 1 ({x Rdy 1 : x 2 1}) dy + 1 2 dy! Γ dy + 1 where Γ denotes the gamma function, and we use the fact that Voldy 1 {x Rdy 1 : x 2 1} Voldy 1(πH( )) as can be contained in a dy-dimensional unit ball, and hence πH( ) can be contained in a (dy 1)-dimensional unit ball. Thus we have f f p > ε with 2 (2dy + 1) 1 p dy! Γ dy + 1 for p 2 and 2 (2dy + 1) 1 p dy! Γ dy + 1 for p < 2. This completes the proof of Lemma 18. B.2 PROOF OF TIGHT LOWER BOUND IN THEOREM 3 In this section, we prove the tight lower bound in Theorem 3. Since we already have the width-dy lower bound by Lemma 18 and it is already proven that RELU networks of width dx is not dense in C(K, Rdy) (Hanin and Sellke, 2017), we prove the tight lower bound in Theorem 3 by showing the following statement: There exist f C([0, 1]dx, R) and ε > 0 such that for any RELU+STEP network f of width dx containing at least one STEP, it holds that Without loss of generality, we assume that f has dx hidden neurons at each layer except for the output layer and all affine transformations in f are invertible (see Section 5.1). Our main idea is to utilize properties of level sets of width-dx RELU+STEP networks (Hanin and Sellke, 2017) defined as follows: Given a network f of width dx, we call a connected component of f 1(y) for some y as a level set. Level sets of RELU+STEP networks have a property described by the following lemma. We note that the statement and the proof of Lemma 19 is motivated by Lemma 6 of (Hanin and Sellke, 2017). Lemma 19. Let f be a RELU+STEP network of width dx containing at least one STEP. Then, for any level set S of f, S is unbounded unless it is empty. Proof of Lemma 19. Let ℓ be the smallest number such that STEP appears at the ℓ -th layer. In this proof, we show that all level sets of the first ℓ layers of f are either unbounded or empty. Then the claim of Lemma 19 directly follows. We prove this using the mathematical induction on ℓ . Recalling (1), we denote by fℓthe mapping of the first ℓlayers of f: fℓ:= σℓ tℓ σ1 t1. First, consider the base case: ℓ = 1. Assume without loss of generality that the activation function of the first hidden node in σ1 is STEP. Then for any x, the STEP activation maps the first component of t1(x) to 1 if (t1(x))1 0, and to 0 otherwise. This means that there exists a ray R starting from x such that f1(R) = {f1(x)}. Hence, any level set of f1 is either unbounded or empty. Now, consider the case that ℓ > 1. Then, until the (ℓ 1)-th layer, the network only utilizes RELU. Here, the level sets of fℓ 1 can be characterized using the following lemma. Lemma 20 [Hanin and Sellke (2017, Lemma 6)]. Given a RELU network g of width dx, let S Rdx be a set such that x S if and only if inputs to all RELU in g are strictly positive, when computing g(x). Then, S is open and convex, g is affine on S, and any bounded level set of g is contained in S. Published as a conference paper at ICLR 2021 Consider S of fℓ 1 as in Lemma 20 and consider a level set T of fℓ containing some x, i.e., T = . If x / S, then T is unbounded by Lemma 20. If x S, we argue as the base case. The preimage of fℓ (x) of the ℓ -th layer (i.e., σ tℓ ) contains a ray. If this ray is contained in fℓ 1(S), then T is unbounded as fℓ 1 is invertible and affine on S. Otherwise, T \ S = and it must be unbounded as any level set of fℓ 1 not contained in S is unbounded by Lemma 20. This completes the proof of Lemma 19. Now, we continue the proof of the tight lower bound in Theorem 3 based on Lemma 19. We note that our argument is also from the proof of the lower bound in Theorem 1 of (Hanin and Sellke, 2017). Consider f : [0, 1]dx R defined as f (x1, . . . , xdx) := Then, for a = 1 4 and b = 0, one can observe that (f ) 1(a) is a sphere of radius 1 2 centered at (f ) 1(b) = {( 1 2, . . . , 1 2)}. Namely, any path from (f ) 1(b) to infinity must intersect with (f ) 1(a). Now, suppose that a RELU+STEP network f of width dx satisfies that f f 1 16. Then, the level set of f containing ( 1 2, . . . , 1 2) must be unbounded by Lemma 19, and hence, must intersect with (f ) 1(a). However, as f (f ) 1(a) = 1 4 and f (f ) 1(b) = 0, this contradicts with f f 1 16. This completes the proof of the tight lower bound max{dx + 1, dy} in Theorem 3. B.3 PROOF OF TIGHT LOWER BOUND IN THEOREM 1 In this section, we prove the tight lower bound in Theorem 1. Since we already have the dy lower bound by Lemma 18, we prove the tight lower bound in Theorem 1 by showing the following statement: There exist f Lp(Rdx, R) and ε > 0 such that for any continuous function f represented by a RELU network of width dx, it holds that Note that this statement can be easily generalized to an arbitrary codomain. To derive the statement, we prove a stronger statement: For any RELU network f of width dx, either f / Lp(Rdx, R) or f = 0 (10) where f = 0 denotes that f is a constant function mapping any input to zero. Then it leads us to the desired result directly. Without loss of generality, we assume that f has dx hidden neurons at each layer except for the output layer and all affine transformations in f are invertible (see Section 5.1). As in the proof of the tight upper bound in Theorem 3, we utilize properties of level sets of f given by Lemma 20. Let S be a set defined in Lemma 20 of f. By the definition of S, one can observe that Rdx \ S = . Then, a level set T containing some x Rdx \ S must be unbounded by Lemma 20. Here, if y := f(x) > 0, then for δ := ω 1 f ( y 2), we have f(x ) y x T := {x Rdx : x T such that x x δ}. Since T contains T which is an unbounded set, one can easily observe that µ(T ) = and hence, R T |f(x)|pd(x) = , i.e., f / Lp(Rdx, R).10 One can derive the same result for f(x) < 0. Suppose that f(x) = 0 for all x Rdx \ S. Then, f(x) = 0 for all x S as S is open (see Lemma 20). Furthermore, we claim that f(S) = {0} or f / Lp(Rdx, R). For any s S, consider any two rays of opposite directions starting from s. If one ray is contained in S and f Lp(Rdx, R), then its image for f must be {0}. If the image of f is not {0}, using the similar argument for the case f(x) > 0 leads us to f / Lp(Rdx, R). Then, one can conclude that f(s) = 0. If both rays are not contained in S, they must both intersect with S. Then, since f is affine on S, f(s) must be zero as f( S) = {0}. Hence, we prove the claim. This completes the proof of the tight upper bound in Theorem 1. 10µ denotes the Lebesgue measure. Published as a conference paper at ICLR 2021 B.4 PROOF OF LEMMA 5 Let φ(x) = Ax + b for some invertible matrix A = [a1, a2] R2 2 and for some vectors a1, a2, b R2. Then, it is easy to see that if a1, x + b1 0 and a2, x + b2 0, i.e., x S, then φ 1 σ φ(x) = x. Hence, the first statement of Lemma 5 holds. Now, consider the second statement of Lemma 5. Suppose that a1, x + b1 0 but a2, x + b2 < 0. Then, one can easily observe that φ 1 σ φ maps a ray {x R2 : a1, x = a1, x , a2, x + b2 < 0} containing x to a single point φ 1 ( a1, x + b1, 0) , which is on S. In addition, similar arguments hold for cases that a1, x + b1 < 0, a2, x + b2 0 and a1, x + b1 < 0, a2, x + b2 < 0. This completes the proof of Lemma 5. B.5 PROOF OF LEMMA 6 We first prove the first statement of Lemma 6 using the proof by contradiction. Suppose that x = x and x / T but x is not in a bounded path-component of R2 \ T . Here, note that x = x S for S defined in Lemma 5. Then, there exists a path P from x to infinity such that P T = . If P int(S)11, then the preimages of P and T int(S) under φ 1 σ φ stay identical to their corresponding images, i.e., P and T int(S) (by Lemma 5). This contradicts the assumption that x is in a bounded path-component of R2 \ T . Hence, it must hold that P int(S). Let x / T be the first point in P S in the trajectory of P starting from x . Then, the preimage of x contains a ray R starting from x (see the proof of Lemma 5 for the details) which must not intersect with T ; had the ray R intersected with T , then R T must have mapped to x , which contradicts x / T and the definition of P. Furthermore, from the definition of x , the subpath P of P from x to x excluding x satisfies P int(S). Hence, the preimages of P and T int(S) under φ 1 σ φ stay identical by Lemma 5. This implies that there exist a path P from x to x , and then a path R from x to infinity, not intersecting with T . This contradicts the assumption of Lemma 6. This completes the proof of the first statement of Lemma 6. Now, consider the second statement of Lemma 6. By Lemma 5, x = x implies that x / S and x S. Here, as the preimage of x contains a ray from x containing x, this ray must intersect with T from the assumption of Lemma 6. Hence, x T and this completes the proof of the second statement of Lemma 6. By combining the proofs of the first and the second statements of Lemma 6, we complete the proof of Lemma 6. B.6 PROOF OF LEMMA 7 Before starting our proof, we first introduce the following definitions and lemma. The proof of Lemma 21 is presented in Appendix B.7. Definition 2. Definitions related to curves, loops, and polygons are listed as follows: For U R2 and F(U) := {f C([0, 1], R2) : f([0, 1]) = U}, U is a curve if there exists f F(U). U is a simple curve if there exists an injective f F(U). U is a loop if there exists f F(U) such that f(1) = f(0). U is a simple loop if there exists f F(U) such that f(1) = f(0) and f is injective on [0, 1). U is a polygon if there exists a piecewise linear f F(U) such that f(1) = f(0). 11int(S) denotes the interior of S. Published as a conference paper at ICLR 2021 Figure 3: (a) Illustration of U, gℓ (q). (b) Illustration of V, gℓ (q), v. (c) Illustration of U, gℓ (q), v. U is a simple polygon if there exists a piecewise linear f F(U) such that f(1) = f(0) and f is injective on [0, 1). Lemma 21. Suppose that gℓ ([0, p1]) gℓ ([p2, 1]) = and gℓ ([0, p1])\B is contained in a bounded path-component of R2 \ U for some U gℓ ([p2, 1]) B. Then, gℓ ([0, p1]) \ B is contained in a bounded path-component of R2 \ (gℓ ([p2, 1]) B). In this proof, we prove that if gℓ ([0, p1]) gℓ ([p2, 1]) = , then gℓ ([0, p1]) \ B is contained in a bounded path-component of R2 \ U for some simple polygon U gℓ ([p2, 1]) B. Then, the statement of Lemma 7 directly follows by Lemma 21. To begin with, consider a loop gℓ ([p2, 1]) L gℓ (p2), gℓ (1) where L(x, x ) denotes the line segment from x to x , i.e., L(x, x ) := {λ x + (1 λ) x : λ [0, 1]}. Then, the loop consists of a finite number of line segments, as an image of an interval of a RELU network is piecewise linear as well as L gℓ (p2), gℓ (1) , i.e., the loop is a polygon. Since gℓ ([p2, 1]) L gℓ (p2), gℓ (1) consists of line segments, under the assumption f f 1 100, one can easily construct a simple loop U in gℓ ([p2, 1]) L gℓ (p2), gℓ (1) so that U contains simple curves from the midpoint of L gℓ (p2), gℓ (1) to a point near the point ( 1, 1), then to a point near the point (1, 1), and finally to the midpoint of L gℓ (p2), gℓ (1) . We note that U also consists of line segments, i.e., U is a simple polygon. Figure 3(a) illustrates U where line segments from gℓ ([p2, 1]) is drawn in blue and line segments from L gℓ (p2), gℓ (1) indicated by dotted black line. Now, choose q (0, p1) such that f (q) = (0, 1 2). Since f f 1 100 and by the definition of ℓ , f(q) = gℓ (q) {x R2 : x (0, 1 which is illustrated by the red dot in Figure 3. Then, we claim the following statement: gℓ (q) is contained in a bounded path-component of R2 \ U. (11) From the definition of q and the path-connectedness of gℓ ([0, q]), one can observe that proving the claim (11) leads us to that gℓ ([0, q]) is contained in a bounded path-component of R2 \ U unless U gℓ ([0, q]) = . Since gℓ ([0, p1])\B gℓ ([0, q]) by the definitions of q, ℓ and the assumption that f f 1 100, this implies that if gℓ ([0, p1]) gℓ ([p2, 1]) = , then gℓ ([0, p1]) \ B is contained in a bounded path-component of R2 \ U. Hence, (11) implies the statement of Lemma 7. Proof of claim (11). To prove the claim (11), we first introduce the following lemma. Lemma 22 [Jordan curve theorem (Tverberg, 1980)]. For any simple loop O R2, R2 \ O consists of exactly two path-components where one is bounded and another is unbounded. Lemma 22 ensures the existence of a bounded path-component of R2 \ U. Furthermore, to prove the claim (11), we introduce the parity function πU : R2 \ U {0, 1}: For x R2 \ U and a ray starting from x, πU(x) counts the number of times that the ray properly Published as a conference paper at ICLR 2021 intersects with U (reduced modulo 2) where the proper intersection is an intersection where U enters and leaves on different sides of the ray. Here, it is well-known that πU(x) does not depend on the choice of the ray, i.e., πU is well-defined. We refer the proof of Lemma 2.3 by Thomassen (1992) and the proof of Lemma 1 by Tverberg (1980) for more details. Here, πU characterizes the position of x as πU(x) = 0 if and only if x is in the unbounded path-component of R2 \ U, which is known as the even-odd rule (Shimrat, 1962; Hacker, 1962). Hence, proving that πU(gℓ (q)) = 1 would complete the proof of the claim (11). Recall that there exists the line (e.g., the black arrow in Figure 3) that intersects with B and the image of gℓ can be at only one side of the line (see Section 5.2 for details). Since B is open, there exists a vertex v B (e.g., the green dot in Figures 3(b) and 3(c)) such that v is in the other side of the line.12 We prove πU(gℓ (q)) = 1 by counting the number of proper intersections between the ray R from gℓ (q) passing through v (the red arrow in Figures 3(b) and 3(c) illustrates R). To simplify showing πU(gℓ (q)) = 1, we consider two points z1, z2 U B near the points ( 1, 1), (1, 1), respectively, such that the simple curve P in U from z1 to z2 is contained in B except for z1, z2. Then, one can observe that P and L(z1, z2) forms a simple loop which we call V. Figure 3(b) illustrates V where the black dotted line indicates the line segment from L gℓ (p2), gℓ (1) , the blue line indicates the line segments from gℓ ([p2, 1]), and the green dotted line indicates L(z1, z2); from the definition of P, the blue and green lines together correspond to P. Then, πV(gℓ (q)) = 1 as a ray from gℓ (q) of the downward direction (the blue arrow in Figure 3(b)) only properly intersects once with V at some point in L gℓ (p2), gℓ (1) , under the assumption that f f 1 100. From the property of πV, this implies that the ray R starting from gℓ (q) and passing through v (e.g., the red arrow in Figures 3(b) and 3(c)) must properly intersect with V odd times. Furthermore, from the construction of U and V, definition of ℓ , and under the assumption that f f 1 100, one can observe that the simple curve in U from z1 to z2 not contained in B (i.e., U \ P) can only intersect with B within the ℓ balls of radius 2 100 centered at the points ( 1, 1) and (1, 1). This is because if U \ P intersects with B outside these ℓ balls, then by definition of ℓ , the network cannot make further modifications in B, hence contradicting the approximation assumption f f 1 100. In other words, all proper intersections between U and R are identical to those between V and R. This implies that πU(gℓ (q)) = 1 and hence, gℓ (q) is in the bounded path-component of R2 \ U. This completes the proof of the claim (11) and therefore, completes the proof of Lemma 7. B.7 PROOF OF LEMMA 21 Suppose that gℓ ([0, p1]) gℓ ([p2, 1]) = and gℓ ([0, p1]) \ B is contained in a bounded pathcomponent of R2 \ U for some U gℓ ([p2, 1]) B. If gℓ ([0, p1]) \ B is path-connected, then the statement of Lemma 21 directly follows. Hence, suppose that gℓ ([0, p1]) \ B has more than one path-components. To help the proof, we introduce the following Lemma. Lemma 23. If gℓ (p) B for some p [0, 1], then f(p) = gℓ (p). Proof of Lemma 23. Suppose that f(p) = gℓ (p). Then, φ 1 ℓ σ φℓ(gℓ (p)) = gℓ (p) for some ℓ> ℓ . By Lemma 5, there exist a1, a2 R2 and b1, b2 R such that φ 1 ℓ σ φℓ(x) = x if and only if a1, x +b1 0, a2, x +b2 0. Without loss of generality, we assume that a1, gℓ (p) +b1 < 0. Since gℓ (p) B, there exists z B such that a1, z + b1 < 0, i.e., φ 1 ℓ σ φℓ(z) = z, which contradicts to the definition of ℓ by Lemma 5. This completes the proof of Lemma 23 By Lemma 23 and the assumption that f f 1 100, gℓ ([0, p1]) \ B can only intersect with B within the ℓ ball O of radius 2 100 centered at the point (0, 1). Hence, all path-components of gℓ ([0, p1]) \ B intersect with the line segment B O. In other words, gℓ ([0, p1]) \ B is in a path-component of R2 \ (gℓ ([p2, 1]) B) unless gℓ ([p2, 1]) intersects with B O. However, by Lemma 23 and the assumption that f f 1 100, gℓ ([p2, 1]) must not intersect with B O. This completes the proof of Lemma 21. 12A vertex denotes one of the points (2, 1), (2, 1), ( 2, 1), ( 2, 1).