# characterizing_resnets_universal_approximation_capability__5e862e28.pdf Characterizing Res Net s Universal Approximation Capability Chenghao Liu 1 Enming Liang 1 Minghua Chen 1 Since its debut in 2016, Res Net has become arguably the most favorable architecture in deep neural network (DNN) design. It effectively addresses the gradient vanishing/exploding issue in DNN training, allowing engineers to fully unleash DNN s potential in tackling challenging problems in various domains. Despite its practical success, an essential theoretical question remains largely open: how well/best can Res Net approximate functions? In this paper, we answer this question for several important function classes, including polynomials and smooth functions. In particular, we show that Res Net with constant width can approximate Lipschitz continuous function with a Lipschitz constant µ using O(c(d)(ε/µ) d/2) tunable weights, where c(d) is a constant depending on the input dimension d and ϵ > 0 is the target approximation error. Further, we extend such a result to Lebesgue-integrable functions with the upper bound characterized by the modulus of continuity. These results indicate a factor of d reduction in the number of tunable weights compared with the classical results for Re LU networks. Our results are also order-optimal in ε, thus achieving optimal approximation rate, as they match a generalized lower bound derived in this paper. This work adds to the theoretical justifications for Res Net s stellar practical performance. 1. Introduction One trend in deep learning in the past decade is the use of larger and deeper neural networks to process higherdimensional data. However, for deep neural networks, training is notoriously difficult, due to gradient vanishing (Glorot & Bengio, 2010). In 2016, the emergence of Res Net (He et al., 2016) addresses the issue of gradient vanishing or 1School of Data Science, City University of Hong Kong. Correspondence to: Minghua Chen . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). exploding encountered during neural network training and has achieved outstanding performance in applications. The practical success of Res Net naturally leads to an essential theoretical question: how well can Res Net approximate functions? Along this line, a milestone result in (Lin & Jegelka, 2018) shows the universal approximation capability of Res Net (even with one neuron per layer): it can approximate any Lebesgue-integrable function arbitrarily well as the number of tunable weights goes to infinity. This result gives theoretical justification to using Res Net to approximate general functions and spurs a number of follow-up studies. For example, the authors in (Oono & Suzuki, 2019) explore the approximation capabilities of Res Net-type convolutional neural networks, establishing that they can approximate smooth functions with a comparable quantity of tunable weights as their feed-forward Re LU network counterparts. However, the question of whether this represents the optimal use of Res Net s resources remains unanswered. We delve deeper into this and other related research in Section 1.1. Despite of these exciting results, it remains largely open today to fully characterize the universal approximation capability of Res Net. That is, how many tunable weights are needed for Res Net with optimized structures to approximate a function up to an error ε? In this paper, we seek answers to the above question, by developing upper-/lowerbounds on tunable weights for Res Net with constant width to approximate popular classes of functions. We summarize our contributions as follows: In Sec. 3, we explicitly establish the relationship between Res Net and feedforward networks (FNNs) (see Proposition 1). We show that Res Net can be viewed as an FNN and thus derive lower bounds on the number of tunable parameters for Res Net to approximate various classes of functions. In Sec. 4, we show that Res Net with constant width, by leveraging specific tunable weights, can approximate functions in various function classes. These include monomials with degree p, requiring the number of weights O(p log p/ε), polynomials of degree p, needing O(p#terms log p/ε)1, and smooth functions differentiable 1The notation # is an abbreviation of number and #terms refers to the number of terms in the polynomial. Characterizing Res Net s Universal Approximation Capability Table 1. A summary of existing and our results on approximation rate of Res Net structure. In the table, the upper bound refers to the number of parameters needed using the network architecture to approximate any function in the target function space to ε. Paper Function Space Network Architecture Upper Bound Optimality (Lin & Jegelka, 2018) Lipschitz Continuous Res Net with one neuron Od(ε d) Suboptimal Functions over [0, 1]d per layer (Oono & Suzuki, 2019) Lipschitz Continuous Res Net CNN with O(1) channels Od(ε d) Suboptimal Functions over [0, 1]d Ours(Theorem 7) Lipschitz Continuous Res Net with constant(= 4) width Od(ε d 2 ) Optimal Functions over [0, 1]d Note that (Lin & Jegelka, 2018) and our Theorem 7 consider approximating Lebesgue-integrable functions and (Oono & Suzuki, 2019) consider H older class. All of them include Lipschitz functions that are used for comparison. The bound is optimal in terms of the entropy limitation. One could refer to Theorem 3 (Yarotsky, 2017) or (Yarotsky & Zhevnerchuk, 2020). However, the upper bound is not optimal in terms of Vapnik-Chervonenkis (VC) dimension (Shen et al., 2022b; Siegel, 2023). Our rate derived is explicit for all input-related parameters such as d. up to degree r, needing Od,r(ε d/r log 1/ε)2. In addition, we show that Res Net with one neuron per hidden layer can generate any continuous piece-wise linear function. Last but not least, we derive a tight upper bound of the number of tunable parameters of Res Net for approximating Lebesgue-integrable functions over [0, 1]d which is Od ω 1 f (ε) d/2 where ωf(t) := sup{|f(x) f(y)| : x y 2 t} is the modulus of continuity and ω 1 f (r) := sup{t : ωf(t) r}. Moreover, if f is Lipschitz continuous with Lipschitz constant µ, the upper bound becomes O(c(d)(ε/µ) d/2). Besides, Our bounds are explicit for all related parameters including the input dimension, the target function space, and the desired accuracy. We highlight our results in Theorem 7, which achieves the optimal upper bound of the number of parameters of Res Net for approximating Lebesgue-integrable functions even by Res Net with constant width. This is a non-trivial extension of the work (Lin & Jegelka, 2018), which shows that for Lebesgue-integrable function f over [0, 1]d, there exists a Res Net R with one neuron per layer and not more than O(ω 1 f (ε) d) parameters such that f R ε. Our results further establish that Res Net with constant width can approximate any Lebesgue-integrable function over [0, 1]d to an error ε with an ε-order optimal upper bound O(ω 1 f (ε) d/2) of the number of parameters needed. We summarize the comparison in Table 1. These findings add to the theoretical justifications for Res Net s outstanding practical performance, and shed light on analysis for further research on NN design optimization. 2The notation a(ε) = O(g(ε)) means a(ε) Cg(ε) for sufficiently small ε where C is a constant independent of ε. Importantly, throughout this paper, we employ the notation Od( ) to underscore the hidden constant C depending on d. 1.1. Related Work In recent years, the expressive capabilities of various neural network architectures have garnered increased attention, spurred by their remarkable and noteworthy successes. In this subsection, we will discuss previous research on the topic through the lens of approximation theory. Universality. The universality, i.e., universal approximation property, of a function family implies that this family is dense in the space of continuous functions, meaning it can approximate any continuous function to an arbitrary precision. In the earlier years, (Cybenko, 1989; Hornik, 1991; Leshno et al., 1993; Pinkus, 1999) made a groundbreaking argument by demonstrating that shallow neural networks equipped with suitable activation functions such as sigmoid, non-polynomial functions, possess universal approximation properties. In recent years, the universality of narrow deep networks has also attracted considerable attention. (Hanin & Sellke, 2017) determined that a deep Re LU neural network must have a minimum width of d + 1 to ensure universality, where d is the input dimension. (Kidger & Lyons, 2020; Park et al., 2020; Cai, 2022) later showed that deep narrow networks with reasonable activation functions can achieve universality, providing the minimum width of neural networks for achieving the universal approximation property. Over the past decades, a variety of network architectures have been developed to cater to diverse tasks and objectives, extending beyond feedforward Re LU networks. The universal approximation theorem has been shown for multiple network architectures, including standard Re LU convolutional neural networks (CNNs) (Zhou, 2018; 2020), 2D Re LU CNNs with classical structures (He et al., 2022), continuous-time recurrent neural network (RNN) (Li et al., 2020; 2022b), Res Net (Lin & Jegelka, 2018; Li et al., 2022a), and Transformers (Yun et al., 2019). Characterizing Res Net s Universal Approximation Capability Approximation Capabilities. There has been substantial progress in enhancing our theoretical understanding of neural networks. Some studies have focused on comparing the expressive power of both shallow and deep neural networks, examining their respective capabilities (e.g., (Arora et al., 2016; Eldan & Shamir, 2016; Liang & Srikant, 2016; Telgarsky, 2016; Yarotsky, 2017; Poggio et al., 2017)). Some others have quantified the number of linear regions within deep neural networks, casting light on their complexity (e.g., (Montufar et al., 2014; Serra et al., 2018; Arora et al., 2016)). Besides, constructive methods have been utilized to probe the approximation capabilities across different function classes. Notably, researchers have delved into the optimal approximation of continuous functions (e.g., (Shen et al., 2022b; Yarotsky, 2018)), the optimal approximation of smooth functions (e.g., (Yarotsky, 2017; Lu et al., 2021; Montanelli & Du, 2019)), and the approximation of analytic functions (e.g., (Wang et al., 2018; Schwab & Zech, 2021)). Recently, there is some interesting work on the special network architecture like parameters sharing (Zhang et al., 2023) and nested network (Zhang et al., 2022). These diverse investigations collectively deepen our understanding of the potential and constraints of neural networks in approximating various functions. Perspectives on the Curse of Dimensionality. The curse of dimensionality coined by (Bellman, 1957) refers to a phenomenon that a model class will suffer an exponential increase in its complexity as the input dimension increases. Importantly, the curse of dimensionality, not limited to FNNs, is also a challenge for almost all classes of function approximators due to the entropy limitation (Kolmogorov & Tikhomirov, 1959). More specifically, any continuous function approximator (refer to Ssec. 2.2 for detailed explanations) will suffer the curse of dimension in the H older space Cr([0, 1]d) (De Vore et al., 1989) because the metric entropy in Cr([0, 1]d) with respect to the uniform topology is Θ(ε d/r). The property is applied to Re LU neural networks in Thm. 3 (Yarotsky, 2017). In an attempt to mitigate the curse of dimensionality, initial strategies involved the consideration of smaller function classes whose metric entropy is expected to reduce such as analytical functions (Wang et al., 2018), bandlimited functions (Montanelli et al., 2019), Korobove space (Montanelli & Du, 2019), a space derived by Kolmogorov Superposition Theorem(KST) (Lai & Shen, 2021; He, 2023). More recently, researchers have shifted their focus toward the structure of neural networks, suggesting a potential solution to circumvent the curse of dimensionality. Simultaneously, a more recent trend aims to serve neural networks as discontinuous function approximators, thereby examining neural networks with novel activation functions (e.g. (Shen et al., 2020; Jiao et al., 2023; Shen et al., 2021; 2022a)). However, the failure of these model classes in practice is due to the discontinuity of the function approximators, wherein even minor perturbations in the training data can lead to chaotic changes in the input-output relationship. Consequently, to circumvent the curse of dimensionality, it is imperative to make appropriate choices within the unstable model class and the restricted objective function space. Res Net Architecture. Since Res Net s first appearance, it has made a big success in practice and has become a core component of popular structures like Transformer (Vaswani et al., 2017), inspiring much research for theoretical understanding. (Hardt & Ma, 2016) show that Res Net can represent any classifier on any finite sample perfectly, i.e., it can represent any discrete function. Later, (Lin & Jegelka, 2018) extended this discrete setting to continuous, which shows that Res Net with one neuron per layer can approximate any Lebesgue-integrable function. The authors in (Oono & Suzuki, 2019) derive approximation and estimation error rates for Res Net-type CNN using O(1) channels. They show a block-sparse FNN can be realized by a Res Net-type CNN and then study the approximation capabilities of blocksparse FNNs. Our paper differs in this regard. We show that Res Net can be implemented by Re LU FNNs (Prop. 1), and we utilize this relationship to set the lower bounds on the approximation capability of Res Nets. We establish the upper bounds by directly constructing Res Net to approximate different function classes. Since CNNs and FNNs are different structures, the results in (Oono & Suzuki, 2019) can not be translated to ours. Notably, while they show Res Net-type CNNs with not more than O(ε d) parameters can achieve approximation for Lipschitz functions to an error ε, our Theorem 7 shows O(ε d/2) parameters are adequate using Res Net with constant width. We summarize the comparison of the results in Table 1. Recently, (He et al., 2022) study the approximation properties of CNNs and show the universal approximation property of shallow Res Net-type CNN with a large number of channels. 2. Mathematical Modeling of Res Net and Proof Ideas 2.1. Mathematical modeling of Res Net Res Net s original proposal (He et al., 2016) includes complex structures such as convolutional layers. In this paper, we focus on Res Net-type FNNs as it is a fundamental structure and enough to achieve a strong approximation capability and unless otherwise specified, we will refer to Res Net as Res Net-type FNNs. Res Net consists of residual blocks and identity shortcut connections. Specifically, with the basic notations of addition (f + g)(x) = f(x) + g(x) and composition f g(x) = f(g(x)) of mappings f, g, a Res Net is a function R(x) from Rd to R given by R(x) = L (T [L] + Id) (T [1] + Id) AQ(x), (1) Characterizing Res Net s Universal Approximation Capability a residual block affine mapping +Id +Id +Id Figure 1. An example of a Res Net from R2 to R belonging to RN (Q = 3, N = 2, L = 3). Every residual block is composed of an activation layer followed by a linear layer. The activation layer neurons are colored yellow, while the linear layer neurons are grey. The maximum number of neurons in each activation layer is 2, each linear layer has 3 neurons, and it has 3 blocks. where AQ : Rd RQ and L : RQ R are affine transformations, Id : z 7 z is the identity mapping, and T [i](i = 0, 1, ..., L) are basic residual blocks. Each block T [i] : RQ RQ further consists of two layers: an activation layer RQ Rni : z 7 σ(Wiz + bi) with ni neurons, and a linear layer Rni RQ : σ(Wiz+bi) 7 Viσ(Wiz+ bi) with Q neurons, where parameters Wi Rni Q, Vi RQ ni, bi Rni (i = 1, 2, , L), and σ( ) is the generalized Re LU activation function for vector output, i.e., σ(x1, x2, , xn) = (max 0, x1, , max 0, xn) for x1, x2, , xn R. To this end, we write T [i](z) = Viσ(Wiz + bi). We use L to denote Res Net s depth defined as the number of residual blocks3, and its width N is defined as the maximum number of activation layer neurons, i.e., max{n1, n2, , n L}. For conciseness of notation, we denote by RN (Q, N, L) the set of Res Net functions from Rd to R, with width N, depth L, and Q neurons in each linear layer. To keep our study interesting, we always assume that Q d where d is the input dimension. Otherwise, the universality of the Res Net does not hold4. For our later discussion, we define a Res Net as Res Net with bottleneck blocks (b-Res Net) if its width is an absolute constant, i.e., it belongs to RN (Q, N = C, L) where C is an absolute constant (independent of input dimension d). We have an 3Note that we define the depth as the number of residual blocks because the activation and linear layers are always coupled together within each block. It is important to note that this definition differs from the one commonly used in applications, where the depth of a Res Net refers to the number of layers excluding the shortcuts. However, this difference does not affect the presentation of our results. 4If Q < d, the Res Net with one neuron per activation layer belongs to the set of narrow networks with widths smaller than or equal to d (Prop. 1). Therefore, the Res Net may lose the universality since narrow-width (smaller than d+1) Re LU networks cannot approximate all Rd continuous functions (Hanin & Sellke, 2017). example of a Res Net structure in Fig. 1. 2.2. Problem statement In this paper, we care about the number of parameters of a Res Net needed for approximating a given function, which characterizes Res Net s approximation capabilities. Specifically, we give the following problem statement. Given f : [0, 1]d R belong to some function space F and fix d N+ and the Res Net model H = RN (Q, N, L), we are interested in the following two questions specifically. Lower bound. What is the minimum number of weights of Res Net required to approximate any f F to an error ε? Upper bound. What is the number of weights of a Res Net architecture sufficient to approximate any f F to an error ε? Or does there exists Q, N, L such that inf f H f f ε holds with relatively smaller number of tunable parameters? Here tunable parameters refer to non-zero parameters in a Res Net. Note the answer to the two questions usually depends on the desired error ε, the input dimension d, and the function space F. Usually we will consider continuous function space C([0, 1]d) under uniform norm f = maxx [0,1]d |f(x)| and Lp-integrable function space Lp([0, 1]d) under norm f Lp([0,1]d) = R [0,1]d |f(x)| dx where we always assume p [1, ) without any specification. Our main results show that b-Res Net, i.e. Res Net in RN (Q, N, L) with Q = d + c1, N = c2 (c1, c2 are absolute constants) and proper L, can achieve powerful approximation capability in various function classes. It can approximate specific functions with fewer tunable functions than that of FNNs. Besides, b-Res Net can achieve optimal approximation for Lebesgue-integrable functions. Remarks on continuous/discontinuous approximators. While the difference between continuous and discontinuous Characterizing Res Net s Universal Approximation Capability will play an important role in the lower bounds, we provide explanations about it here to make readers understand it readily. In approximation theory, we aim to approximate all functions in a space F using a model class as an approximator (e.g., neural networks). We achieve this by choosing different parameters for different functions, meaning the parameters θ Θ can be seen as a mapping of the target functions, i.e., θ = h(f) where h : F Θ. If this mapping h is continuous, we refer to the approximator as a continuous approximator. Conversely, discontinuous deep networks are characterized by a discontinuous mapping h. In this paper, the term arbitrary/unconstrained deep networks refers to discontinuous approximators, and construction in a continuous (discontinuous) phase means the constructed neural networks are continuous (discontinuous) approximators. Significantly, the bound of approximation power for continuous approximators is limited by metric entropy, while for discontinuous approximators, it is limited by Vapnik-Chervonenkis (VC) dimension (Goldberg & Jerrum, 1993). 2.3. Proof Ideas and Novelty Lower Bound. We establish that Res Net can be conceptualized as a sparse FNN (Prop. 1), implying that the lower bound of Res Net must exceed that of FNN. We then derive the generalized lower bounds of Res Net on the approximation of various function classes from the lower bounds of FNNs. Upper Bounds for Approximation of Polynomials and Smooth Functions. Drawing inspiration from the work of (Yarotsky, 2017), where DNNs are constructed to approximate the function x2 and xy, we construct Res Net to approximate these fundamental functions. Unlike the work (Yarotsky, 2017) which computes the product function using xy = ((x + y)2 x2 y2)/2, we select xy = ((x + y)/2)2 ((x y)/2)2 as utilized in some previous work, e.g., (Suh et al., 2022). The selection is more efficient for construction and leads to fewer parameters. Next, we note that any polynomial can be expressed as a composition of the product function xy, and polynomials can approximate smooth functions due to the local Taylor expansion property. By constructing Res Net to approximate these functions, we can derive upper bounds on the approximation of polynomials and smooth functions. These constructions are in a continuous phase, which leads to the result Proposition 3 and Theorem 4 and 5. Our results show that b-Res Net is enough to approximate these smooth functions and has fewer tunable weights than that of FNN. Optimal Approximation for Lebesgue-Integrable Functions. Building upon the constructive techniques from (Shen et al., 2019; 2022b) on Re LU FNNs, we show how b-Res Net can achieve optimal approximation for Lebesgue-integrable Table 2. High-level steps of constructing b-Res Net to achieve optimal approximation where the framework of constructive methods is from (Shen et al., 2019; 2022b). Details are in Appendix F. Step 1: Space Partitions [0, 1]d\Ω= β {0,1, , L2/d 1}d Qβ. Qβ: small hypercubes with side length O(L 2/d) xβ: a representative for a cube Qβ Ω: a small enough region f: the target (Lipschitz) function from [0, 1]d to R Step 2: Constructing a b-Res Net Φ such that Φ(x) = β. b-Res Net Φ: depth O(L) Step 3: Constructing a b-Res Net ϕ such that ϕ(β) f(xβ). b-Res Net ϕ: depth O(L) Step 4: Error Estimation. Constructed b-Res Net R(x) = ϕ Φ(x) f(xβ) f(x) |f(xβ) f(x)| = O(L 2/d), |R(x) f(xβ)| = O(L 2/d) functions (refer to Table 2 for high-level steps). The basic idea is to construct a Res Net with depth O(L) to generate a step function with much more steps to achieve a higher approximation rate. In our construction, we utilize linear layers for value storage and activation layers for intermediate computation. In each block, we designate a constant number of neurons within in the activation layer for the computation in relation to one of the corresponding neurons in the linear layer, with the outcomes refreshed via identity mappings in the next block. In distinct blocks, activation layer neurons perform computations associated with different neurons from the linear layer. This procedure is replicated for each residual block, which is then sequentially combined to form a b-Res Net which will exhibit a large expressive power with fewer parameters than that of FNNs. More details can be found in Appendix F. It is important to note that the FNN results in (Shen et al., 2022b) cannot be directly extended to Res Net. Our novelty lies in the construction and analysis of b Res Net, as well as in the approximation power analysis of Res Net in general. Our proof utilizes a new construction of Res Net blocks for function approximation. Our work is the first to show Res Net even with constant width, can achieve optimal approximation for Lebesgue-integrable functions. Furthermore, the role of identity mappings is greatly leveraged in the construction. We uncover the extensive expressive power of b-Res Net, providing theoretical guarantees and insights into the successful performance of Res Net. 3. Lower Bounds on Res Net s Function Approximation Capability In this section, we build the explicit relation between Res Net and FNNs to establish the lower bound on the complexity of Res Net to approximate polynomials and other function classes such as smooth function class. The lower bounds Characterizing Res Net s Universal Approximation Capability help us to analytically show our upper bounds are optimal in terms of ε in Sec. 4. The proof in this section is postponed to Appendix B. First, we propose a key argument that a Res Net can be regarded as a special sparse Re LU network. Proposition 1. For any Res Net R(x) RN (Q, N, L) of the input x [0, 1]d with W number of tunable parameters, there exists an equivalent Re LU FNN Φ(x) : [0, 1]d R with width N + Q, depth 2L, W + 2QL number of tunable parameters, such that R(x) = Φ(x), x [0, 1]d. This proposition implies that if a Res Net can approximate a function up to an error ε, then an FNN with a larger size can also approximate the same function up to ε. Thus one can bound Res Net s universal approximation capability by studying that of a larger-size FNN (might have an increase in tunable parameters by a factor of d). That said, the lower bound of the complexity of Res Net must be larger than or equal to that of FNN in terms of ε when approximating the same function. Thus, building on the above discussion, lower bounds for FNNs in the existing literature can be applied to Res Nets in terms of ε. We discuss in the following. Regarding the smooth space Cr([0, 1]d), (Yarotsky, 2017) established lower bounds of Θr(ϵ d/r)5 for continuous Re LU network approximators and Θr,d(ϵ d/2r) for unconstrained deep Re LU networks. Later, (Yarotsky, 2018) demonstrated optimal error approximation rates of Od(ωf(W 1/d)) for continuous Re LU network approximators, and Od(ωf(W 2/d)) for unconstrained deep Re LU networks, where W represents the number of tunable weights. Suppose f is Lipschitz continuous, the optimal upper bound for the Re LU FNN is Od(ε d) for continuous weight selection, and Od(ε d/2) for unconstrained deep networks.6 We will return to these discussions in the next section. In the end of this section, we will focus on the lower bound for polynomial function space (a smaller space compared to continuous and smooth functions). However, the aforementioned lower bound is not applicable and it is not available in the existing literature. Below we give the lower bound of the complexity of Res Net on the approximation of polynomials. The result will be used to show the upper bound in Thm. 4 is ε-order optimal in the next section. We begin with some notations. Let x Rd and α = (α1, α2, , αd) where αi N. Define xα = xα1 1 xα2 2 xαd d and this is called a monomial. The degree 5Here f(ε) = Θd(g(ε)) means f(ε) Cg(ε) for sufficient small ε where C is a constant which does not depend on ε but can depend on d. The subscript d on Θ emphasizes that the constant C may depend on d. 6As pointed out in Sec. 2.2, the bound of approximation power for continuous approximators is limited by metric entropy, while for discontinuous approximators, it is limited by VC dimension. of the monomial is |α| := α1 + α2 + + αd. Then a multivariate polynomial is a sum of several monomials and its degree is the highest degree among these monomials. We then give the theorem as below. Theorem 2. Let x = [x1, x2, . . . , xd] [0, 1]d and P(d, p)(p d) be the set of d dimension polynomial functions with degree p. If a Re LU FNN Ψ(x) : [0, 1]d Rd with width N, depth L and T = NL neurons can approximate any f P(d, p) to an error ε, i.e. |Ψ(x) f(x)| < ε, x [0, 1]d, then we have T Θd(log 1/ε). Note this lower bound can also be applied to Res Net. When we compare Theorem 2 with prior findings, we observe that the lower bound for the polynomial function space, in terms of ε, is notably smaller than the lower bounds Θr(ϵ d/r) established for smooth functions Cr([0, 1]d). This is because polynomial functions constitute a substantially smaller subset of the smooth function space. Consequently, they exhibit a reduced complexity in approximation scenarios, reflecting the inherent simplicity of their structural characteristics. 4. Upper Bounds on Res Net s Function Approximation Capability In this section, we characterize Res Net s approximation capability by establishing its upper bounds for important function classes. Moreover, our results show that b-Res Net can approximate polynomials and smooth functions with fewer tunable parameters than those of FNNs. The organization of this section is in the following. Subsection 4.1 presents the upper bounds of the complexity of Res Net on approximating monomials. Subsequently in subsection 4.2, we extend the results to polynomials, and smooth functions in Sobolev space following the work of (Yarotsky, 2017). In Subsection 4.3, the properties of Res Net on approximating continuous piecewise linear functions are discussed. Last but not least, Subsection 4.4 shows that b-Res Net can achieve the optimal approximation for Lebesgue-integrable functions. 4.1. Approximating Monomials Our first key result shows that b-Res Net RN(Q = d + O(1), N = C, L) can approximate any monomials where L depends on d and the desired error ε, and C is an absolute constant independent of d. Importantly, this result explains why b-Res Net can approximate polynomials and smooth functions with fewer tunable parameters. The proof in this section can be found in Appendix C. Proposition 3. Let x = [x1, x2, , xd] [ M, M]d and α Nd, where M 1 and xα be any given monomial with Characterizing Res Net s Universal Approximation Capability degree p, i.e., |α| = p. Then there exists a b-Res Net R RN d + 3, 4, O(p log(p/ε) + p2 log M) such that R xα C([ M,M]d) < ε, while having O p log (p/ε) + p2 log M tunable weights. Note that the number of total weights of R is O(dp log(p/ε)) when M = 1. However, our constructive proof shows that each residual block only contains a constant number of non-zero weights. Therefore, it suffices to adjust O(p log(p/ε)) weights. This analysis also holds when M > 1. Furthermore, it is important to note that the upper bound in the above theorem is independent of d. In fact, the dimension d is incorporated into p because any monomial with degree p can always be interpreted as a product function of dimension p. For instance, x2 1x2x2 3 = π(x1, x1, x2, x3, x3) where π(x1, , xd) = x1x2 xd is the product function. Next, we highlight several key observations and discussions from the analysis and results in the following. Res Net vs FNNs. We show that Res Net is capable of approximating any monomial with degree p on [0, 1]d with O (p log(p/ε)) number of tunable weights, a reduction by a factor d as compared to that of Re LU FNNs. According to (De Vore et al., 2021), a Re LU network with width O(d) and depth O (p log(p/ε)) can approximate any monomial with degree p, resulting in a total weight count of O d2p log(p/ε) . According to their construction, there are O (dp log(p/ε)) tunable weights. Thus, our result has a reduction by a factor d. As a closed remark, while we compare the upper bounds, it is an open and interesting direction to obtain the lower bound of the size of neural networks with respect to input dimension d. Root of reduction. Note that each identity mapping can be realized by 2d Re LU units (x = [x]+ [ x]+) but it only may lead to an additive reduction of d tunable weights compared with an FNN. In our study, however, this additive reduction of d tunable weights, as seen in our b-Res Net model, does translate into a multiplicative reduction by a factor of d. We establish this result by constructively proving that a b-Res Net with a constant number of tunable weights per residual block can approximate functions with the same accuracy as a Re LU FNN requiring O(d) tunable weights in each layer. The high-level ideas are in the following. First, one can construct an FNN with width d + O(1) that can approximate a function, and in each layer, there are d neurons for storing the value of the input. Hence if we have an identity mapping, then we can move the d neurons in each layer and the role of identity mapping is to forward the input value (or input-related value). Then it will have a factor d reduction. Deep vs Shallow. The author in (Shapira, 2023) provides a lower bound on the complexity of shallow FNNs to approximate any non-normalized monomial over [ M, M]d which scales exponentially with d (refer to Thm. 3 (Shapira, 2023)). By proposition 1, this lower bound also applies to shallow Res Net. Conversely, Theorem 3 gives a mild upper bound for deep Res Net which scales polynomially with d. This underscores the benefits of deep networks. 4.2. Approximating Polynomials and Smooth Functions Polynomials are the summation of monomials and a smooth function can be approximated by a polynomial as per the local Taylor expansion. In this subsection, we display the upper bounds on the approximation of polynomials (Thm. 4) and smooth functions (Thm. 5). The proof of the two theorems can be found in Appendix C and D. Theorem 4. Let x = [x1, x2, , xd] [0, 1]d. For a multivariate polynomial P(x) with degree p, i.e., P(x) = P α E cαxα where E = {α Nd : |α| p}, there exists a Res Net R RN (d + 4, 4, O (p|E| log (p/ε))) |R(x) P(x)| < ε X |α| E |cα|, x [0, 1]d. Additionally, the Res Net has O(p|E| log(p/ε)) tunable weights. Note |E| p+d p which exponentially increase in d when p or d is very large. Nonetheless, we can see this upper bound is optimal in terms of ε according to Thm. 2. The Sobolev space W r, ([0, 1]d) is the set of functions belonging to Cr 1([0, 1]d) whose (r 1)-th order derivatives are Lipschitz continuous. Further definitions can be found in Appendix D. We then give the upper bounds of Res Net s complexity in the following theorem. Theorem 5. Fix r, d N+. There is a Res Net R(x) RN (d + 4, N = 4, L) that can approximate any function from the unit ball of W r, ([0, 1]d) to ε where L = Od,r(ε d/r log 1/ε). The number of tunable parameters is still less than that of FNN by a factor d based on the polynomial approximation methods. However, the hidden constant c(d, r) in Od,r(ε d/r log 1/ε) is very large in d where an estimation is given as ( 2 d+1 r d r )d < c(d, r) < ( 2 d+1 r d r )ddr+2(d + r)r. As we discussed before, (Yarotsky, 2017) established the lower bound Θr(ε d/r) of the tunable weights for continuous Re LU network approximators on the approximation of the Sobolev space W r, ([0, 1]d). Based on Prop. 1, the Characterizing Res Net s Universal Approximation Capability lower bound can also apply to Res Net. Thus, we can see the upper bound in Theorem 5 is nearly tight up to a log factor for Res Net. Besides, it is noteworthy that recent results (Yang & Zhou, 2024) show that shallow Re LU neural networks can also reach this upper bound when r < (d + 3)/2. 4.3. Representing Continuous Piecewise Linear Functions Piecewise linear interpolation holds a significant position in approximation theory, as it is a basic method of approximating functions. Therefore, studying the expressive power of neural networks for piecewise linear functions becomes particularly important. In this section, we show that Res Net even with one neuron per activation layer can generate any continuous piecewise linear function (CPw L), as shown in the theorem below where the proof is in Appendix E. Theorem 6. For any CPw L function f : Rd R, there exists a Res Net R(x) RN (d + 1, 1, L) with L = O(Md) that can exactly represent f where M is an f-dependent number. Note that M is implicit which is an f-dependent number. It depends on the property of the input CPw L function f including the number of pieces and linear components. More details about the representation of CPw L functions can be found in (Tarela & Martinez, 1999; Wang & Sun, 2005). Moreover, a recent work (Chen et al., 2022) derived a dimension-independent bound for Re LU networks if the number of pieces and linear components of the target CPw L function is known. Their constructions are also possible for Res Net. It should be importantly noted that Theorem 6 provides another approach to demonstrating the universal property as the CPw L functions are dense in continuous function space under uniform norms. While (Lin & Jegelka, 2018) shows that Res Net with one neuron per activation layer can approximate any step function, the construction of Theorem 6 is much easier. 4.4. Optimal Approximation for Lebesgue-Integrable Functions In this subsection, we show that even b-Res Net can achieve optimal approximation for Lp-integrable functions. It is well-known that C([0, 1]d) space is dense in Lp([0, 1]d) where p [1, ) under Lp norm. Thus, we just need to consider the approximation for C([0, 1]d) as shown in the following theorem. The proof can be found in Appendix F. Theorem 7. Let d N+, d 5 and p [1, ). For any given continuous function f C([0, 1]d), there exists a Res Net R RN(d + 1, 4, 24L + 9d + 4) such that f R Lp([0,1]d) 7 dωf(L 2/d). As a direct corollary, the number of parameters needed is Od ω 1 f (ε) d/2 for a Res Net to approximate the function f to an error ε. To discuss about the theorem, we have the following concluding remarks. Optimality. As discussed in Sec. 3, the lower bound of the size of deep Re LU networks for approximating continuous functions over [0, 1]d under uniform norm is O(ε d/2). Moreover, this lower bound also holds for the approximation under Lp norm which comes from a recent result (Siegel, 2023). Hence, it follows from Proposition 1 that this lower bound can also apply to Res Net which shows our upper bound in Theorem 7 is ε-order optimal. Extension to Uniform Approximation. The optimal bound in Theorem 7 is extendable to the case under the uniform norm, as facilitated by Lemma 3.4 (Lu et al., 2021). This extension, however, necessitates an increased width of the Res Net architecture. As such, it is non-trivial to demonstrate that bottleneck Res Nets (b-Res Nets) can maintain this optimal rate when approximating continuous functions under the uniform norm. Extension to Entired Domain Rd. Note that for any function f belonging to Lp(Rd), and given an arbitrary error margin ϵ > 0, there exists a compact set H upon which h is defined and outside of which it is zero (i.e., h(x) = 0 for x Rd \ H) (Walter, 1987). If we take H to be contained within the hypercube [ H, H]d, we can define a new function g(x) = h x+H 2H mapped onto the unit cube [0, 1]d. Function g is continuous by construction and adheres to the assumptions of Theorem 7. Non-trivial Bound Extension. In this part, we clarify that our results are non-trivial and can not be derived from the work (Yarotsky, 2018). It is important to note that his construction allows Re LU FNNs with a width of 2d + 10 to attain the approximation rate in our Theorem 7. However, within Yarotsky s framework, d neurons per layer can be adapted to form an identity mapping. If one were to attempt to extend these bounds to Res Nets in a straightforward manner, the conclusion would be that a Res Net with width d + O(1) could achieve the mentioned rate. Our Theorem 7, on the other hand, demonstrates that a Res Net with a constant width completely independent of d is capable of achieving the same rate of approximation. This finding is interesting as it implies a reduction in the number of parameters by a factor of d when compared to FNNs, underscoring the non-trivial nature of our extension. Trade-off between Depth and Width. As a closing remark, this paper concentrates on b-Res Net, that is, Res Net with a constant width. Besides, for the set of Res Nets denoted as RN(Q, N, L), it is an interesting future direction to explore the trade-off between Q, N and L. Characterizing Res Net s Universal Approximation Capability Table 3. A summary of upper bounds of the size of FNN and Res Net on the approximation of two representative types of functions: polynomials and Lipschitz functions. Note the relevant papers primarily focus on approximating functions such as H older functions, Sobolev functions, continuous functions, and Lebesgue-integrable functions. However, they all encompass Lipschitz functions, so we use Lipschitz functions for our comparison. Network Functions Polynomial of Degree p Lipschitz Functions with |E| Terms with Lip. Constant 1 Shallow Re LU FNNs O |E|p3/2ε 1 log(1/ε) ,[1] O c1(d) ε d ,[3] Deep Re LU FNNs O (|E|pd log(p/ε)),[2] O c2(d) ε d/2 ,[4,5] Res Net with Constant Width O (|E|p log(p/ε)) O c3(d) ε d/2 : c1(d) implicitly depends on d, c2(d) > (3dd2)d, and c3(d) = (7 d)d/2. The related references are: [1] (Blanchard & Bennouna, 2021),[2] (De Vore et al., 2021),[3] (Yang & Zhou, 2024),[4] (Yarotsky & Zhevnerchuk, 2020) and [5] (Yarotsky, 2018). Table 4. Comparison of MSE loss. NN structure d = 100 d = 200 d = 300 NN(d + 1, d/10) 0.0139 0.0216 0.0472 RN(d + 1, 10, d/10) 0.0102 0.0131 0.0225 RN(d + 1, 20, d/10) 0.0093 0.0127 0.0228 RN(d + 1, 40, d/10) 0.0091 0.0131 0.0230 Figure 2. Comparison of testing MSE loss for FNN and b-Rest Net to approximate high dimensional functions defined in Equation (2). More experiment results can be found in Appendix G. 5. Experiments In this section, we provide function approximation results to numerically validate the theoretical results presented in Sec. 4. To emphasize the approximation error, we involve a sufficiently complex target function for the experiment. Specifically, we utilize the following set of functions (where ai, bi are parameters) to test the universal approximation capability of b-Res Net. xj + bi sin( Y The parameter settings in Equation (2) are included in Appendix G. We then compare b-Res Net with fully connected NN for approximating the function in Equation (2) with different dimensions. The results are shown in Figure 2 and Table 4. From the experiments, it is evident that under the same dimensionality, the b-Res Net has a reduced number of parameters compared to a classical FNN. However, it achieves a lower MSE. This, in some sense, demonstrates the remarkable function approximation capability of the b-Res Net. Moreover, we take into account that the function approximation task is of high dimensionality, which in turn reflects the capacity of Res Nets to learn high-dimensional functions. In conclusion, the experiment results demonstrate (i) the exceptional approximation capability of b-Res Net for learning complex functions, (ii) efficient structure design which highly reduces training parameters, and (iii) strong scalability for approximating high-dimension functions. 6. Conclusion In this study, we provide a substantial theoretical contribution to the understanding of Res Net s capabilities in function approximation. We show that Res Net with constant width possesses the remarkable ability to approximate various important functions including polynomials, smooth functions, and piecewise linear functions. Importantly, we show that even Res Net with constant width can achieve optimal approximation for Lebesgue-integrable functions which are frequently encountered in practical applications. Moreover, we obtain some improvement compared with FNN. To make more explicit comparisons between rates of approximation by Res Net and FNN, we have Table 3 showing a summary. While Res Net has significant optimization advantages over FNNs during training, our results indicate that Res Nets can achieve strong approximation capabilities with even fewer parameters than FNNs. These findings add to the theoretical justifications for Res Net s stellar practical performance. Characterizing Res Net s Universal Approximation Capability Acknowledgements This work is supported in part by a General Research Fund from Research Grants Council, Hong Kong (Project No. 11200223), an Inno HK initiative, The Government of the HKSAR, Laboratory for AI-Powered Financial Technologies, and a Shenzhen-Hong Kong-Macau Science & Technology Project (Category C, Project No. SGDX20220530111203026). The authors would also like to thank the anonymous reviewers for their helpful comments. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectified linear units. ar Xiv preprint ar Xiv:1611.01491, 2016. Bellman, R. Dynamic programming (new jersey: Princeton university press). 1957. Blanchard, M. and Bennouna, M. A. Shallow and deep networks are near-optimal approximators of korobov functions. In International Conference on Learning Representations, 2021. Cai, Y. Achieve the minimum width of neural networks for universal approximation. In The Eleventh International Conference on Learning Representations, 2022. Chen, K.-L., Garudadri, H., and Rao, B. D. Improved bounds on neural complexity for representing piecewise linear functions. Advances in Neural Information Processing Systems, 35:7167 7180, 2022. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. De Vore, R., Hanin, B., and Petrova, G. Neural network approximation. Acta Numerica, 30:327 444, 2021. De Vore, R. A., Howard, R., and Micchelli, C. Optimal nonlinear approximation. Manuscripta mathematica, 63: 469 478, 1989. Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on learning theory, pp. 907 940. PMLR, 2016. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249 256. JMLR Workshop and Conference Proceedings, 2010. Goldberg, P. and Jerrum, M. Bounding the vapnikchervonenkis dimension of concept classes parameterized by real numbers. In Proceedings of the sixth annual conference on Computational learning theory, pp. 361 369, 1993. Hanin, B. and Sellke, M. Approximating continuous functions by relu nets of minimal width. ar Xiv preprint ar Xiv:1710.11278, 2017. Hardt, M. and Ma, T. Identity matters in deep learning. ar Xiv preprint ar Xiv:1611.04231, 2016. He, J. On the optimal expressive power of relu dnns and its application in approximation with kolmogorov superposition theorem. ar Xiv preprint ar Xiv:2308.05509, 2023. He, J., Li, L., and Xu, J. Approximation properties of deep relu cnns. Research in the Mathematical Sciences, 9(3): 38, 2022. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251 257, 1991. Jiao, Y., Lai, Y., Lu, X., Wang, F., Yang, J. Z., and Yang, Y. Deep neural networks with relu-sine-exponential activations break curse of dimensionality in approximation on h older class. SIAM Journal on Mathematical Analysis, 55(4):3635 3649, 2023. Kidger, P. and Lyons, T. Universal approximation with deep narrow networks. In Conference on learning theory, pp. 2306 2327. PMLR, 2020. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kolmogorov, A. N. and Tikhomirov, V. M. ε-entropy and ε-capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3 86, 1959. Lai, M.-J. and Shen, Z. The kolmogorov superposition theorem can break the curse of dimensionality when approximating high dimensional functions. ar Xiv preprint ar Xiv:2112.09963, 2021. Characterizing Res Net s Universal Approximation Capability Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861 867, 1993. Li, Q., Lin, T., and Shen, Z. Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 2022a. Li, Z., Han, J., Li, Q., et al. On the curse of memory in recurrent neural networks: Approximation and optimization analysis. ar Xiv preprint ar Xiv:2009.07799, 2020. Li, Z., Han, J., Weinan, E., and Li, Q. Approximation and optimization theory for linear continuous-time recurrent neural networks. J. Mach. Learn. Res., 23:42 1, 2022b. Liang, S. and Srikant, R. Why deep neural networks for function approximation? ar Xiv preprint ar Xiv:1610.04161, 2016. Lin, H. and Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. Advances in neural information processing systems, 31, 2018. Lu, J., Shen, Z., Yang, H., and Zhang, S. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465 5506, 2021. Montanelli, H. and Du, Q. New error bounds for deep relu networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78 92, 2019. Montanelli, H., Yang, H., and Du, Q. Deep relu networks overcome the curse of dimensionality for bandlimited functions. ar Xiv preprint ar Xiv:1903.00735, 2019. Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. Advances in neural information processing systems, 27, 2014. Oono, K. and Suzuki, T. Approximation and non-parametric estimation of resnet-type convolutional neural networks. In International conference on machine learning, pp. 4922 4931. PMLR, 2019. Park, S., Yun, C., Lee, J., and Shin, J. Minimum width for universal approximation. In International Conference on Learning Representations, 2020. Pinkus, A. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143 195, 1999. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503 519, 2017. Schwab, C. and Zech, J. Deep learning in high dimension: Neural network approximation of analytic functions in L2(Rd, γd). ar Xiv preprint ar Xiv:2111.07080, 2021. Serra, T., Tjandraatmadja, C., and Ramalingam, S. Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pp. 4558 4566. PMLR, 2018. Shapira, I. Expressivity of shallow and deep neural networks for polynomial approximation. ar Xiv preprint ar Xiv:2303.03544, 2023. Shen, Z., Yang, H., and Zhang, S. Deep network approximation characterized by number of neurons. ar Xiv preprint ar Xiv:1906.05497, 2019. Shen, Z., Yang, H., and Zhang, S. Deep network approximation with discrepancy being reciprocal of width to power of depth. ar Xiv preprint ar Xiv:2006.12231, 2020. Shen, Z., Yang, H., and Zhang, S. Neural network approximation: Three hidden layers are enough. Neural Networks, 141:160 173, 2021. Shen, Z., Yang, H., and Zhang, S. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons. The Journal of Machine Learning Research, 23(1):12653 12712, 2022a. Shen, Z., Yang, H., and Zhang, S. Optimal approximation rate of relu networks in terms of width and depth. Journal de Math ematiques Pures et Appliqu ees, 157:101 135, 2022b. Siegel, J. W. Optimal approximation rates for deep relu neural networks on sobolev and besov spaces. Journal of Machine Learning Research, 24(357):1 52, 2023. Suh, N., Zhou, T.-Y., and Huo, X. Approximation and nonparametric estimation of functions over high-dimensional spheres via deep relu networks. In The Eleventh International Conference on Learning Representations, 2022. Tarela, J. and Martinez, M. Region configurations for realizability of lattice piecewise-linear models. Mathematical and Computer Modelling, 30(11-12):17 27, 1999. Telgarsky, M. Representation benefits of deep feedforward networks. ar Xiv preprint ar Xiv:1509.08101, 2015. Telgarsky, M. Benefits of depth in neural networks. In Conference on learning theory, pp. 1517 1539. PMLR, 2016. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017. Characterizing Res Net s Universal Approximation Capability Walter, R. Real and complex analysis. 1987. Wang, Q. et al. Exponential convergence of the deep neural network approximation for analytic functions. ar Xiv preprint ar Xiv:1807.00297, 2018. Wang, S. and Sun, X. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425 4431, 2005. Yang, Y. and Zhou, D.-X. Optimal rates of approximation by shallow relu k neural networks and applications to nonparametric regression. Constructive Approximation, pp. 1 32, 2024. Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103 114, 2017. Yarotsky, D. Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pp. 639 649. PMLR, 2018. Yarotsky, D. and Zhevnerchuk, A. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33:13005 13015, 2020. Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? ar Xiv preprint ar Xiv:1912.10077, 2019. Zhang, S., Shen, Z., and Yang, H. Neural network architecture beyond width and depth. Advances in Neural Information Processing Systems, 35:5669 5681, 2022. Zhang, S., Lu, J., and Zhao, H. On enhancing expressive power via compositions of single fixed-size relu network. ar Xiv preprint ar Xiv:2301.12353, 2023. Zhou, D.-X. Deep distributed convolutional neural networks: Universality. Analysis and Applications, 16(06):895 919, 2018. Zhou, D.-X. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48 (2):787 794, 2020. Characterizing Res Net s Universal Approximation Capability A. Preliminaries In this section, we introduce some basic notations for use in subsequent proofs. A.1. Feedforward neural networks (FNNs) As is known to all, FNN is a function Φ : Rd : R which is formed as the alternating compositions of Re LU function σ, and affine transformations A[i](y) = Uiy + vi with Ui Rdi di 1, vi Rdi, d0 = d for i = 1, 2, , L. Specifically, Φ (x) = L σ A[L] σ A[L 1] σ A[1] (x) where L is a final affine transformation. Here L denotes the number of layers of the FNN, and the width of the FNN is conventionally defined by max{d1, d2, , d L} := K. The Re LU activation function is defined by: σ(x) := Re LU(x) = max (x, 0) = (x)+ , x R and for x Rd, σ(x) := (σ(x1), , σ(xd)). Typically, it is presumed that the number of neurons in each layer of an FNN is the same, which is equal to the width K, as any neuron deficits in a layer can be dealt with by adding K dj neurons whose biases are zero in layer j. The weights between these extra neurons are consequently assigned to zero. A.2. Res Net Let d, Q N+. Res Net R(x) : Rd R is a combination of an initial affine layer, multiple basic residual blocks with identity mapping, and a final affine output layer: R(x) = L (T [L] + Id) (T [L 1] + Id) (T [1] + Id) AQ(x), (3) where AQ : Rd RQ and L : RQ R are affine transformations. Besides, T [i](i = 0, 1, ..., L) are basic residual blocks, i.e., T [i](z) = Viσ (Wiz + bi) where Wi Rni Q, Vi RQ ni, bi Rni. Concretely, we denote the output of the i-th block by z[i]. Then the outputs of each block can be formulated as follows: z[0] = AQ (x) = W0x + b0, T [i](z) = Viσ (Wiz + bi) , z[i] = T [i] z[i 1] + z[i 1], i = 1, 2, , L, R(x) = L z[L] = Bz[L], where W0 RQ d, b0 RQ, Wi Rni Q, Vi RQ ni, bi Rni, B R1 Q and x Rd. The Res Net s depth, denoted by L, is defined as the number of residual blocks. The Res Net s width is the maximum number of neurons in the activation layer, that is max{n1, n2, ..., n L}. The subscript Q of AQ refers to the number of neurons in the linear layer. We denote by RN (Q, N, L) the set of Res Net functions width N, depth L and Q neurons in each linear layer. Additionally, we define ζ[i] = T [i] 1 (z[i 1]) = σ(Wiz + bi), γ[i] = T [i] z[i 1] = T [i] 2 ζ[i] = Viζ[i] = Vi T [i] 1 z[i 1] and z[i] = z[i 1] + γ[i] for i = 1, 2, , L. See Fig. 3 for an illustration. A.3. Notations We summarize the notations we will use in this paper in the following. Characterizing Res Net s Universal Approximation Capability +Id +Id +Id = γ[1] + z[0] = γ[2] + z[1] = γ[3] + z[2] Figure 3. An illustration figure for the Res Net notation. The yellow neurons are in the activation layer and the grey neurons are in the linear layer. Let column vectors ai Rmi, where i = 1, 2, , n and mi N+ := {1, 2, }. To represent these vectors concisely, we use (a1, a2, , an) to denote a T 1 , a T 2 , , a T n T Rm1+m2+ +mn. Here, a T denotes the transpose of a. Let a Rm. If we write a vector as (Rl, a) Rm+l, this implies that the value of the vector in the position represented by Rl does not matter. If l = 1, we always use - to substitute R , i.e., ( , a) Rm+1 implies the value of the position represented by - does not matter. For a vector v in Rm, vi is the i-th entry of v for i = 1, 2, , m. Denote by µ(T ) the Lebesgue measure of a measurable set T . Let 1S be the characteristic function on a set S, i.e., 1S = 1 on set S and 0 otherwise. For two sets A, B, A\B := {x : x A, x / B}. For any ξ R, let ξ := max{i : i ξ, i Z} and ξ := min{i : i ξ, i Z}. B. Proof of Proposition 1 and Theorem 2 In this appendix, we provide the proofs of conclusions in Sec. 3. B.1. Proof of Proposition 1 For a Res Net in RN (Q, N, L) from [0, 1]d to R defined by the formula 3, 4 and 5, we now construct a special network with depth L and width N + Q having the same output. We first suppose the input of the network is y[0] = AQ(x) = z[0] and denote the output of the i-th layer by y[i]. What s more, we assume in each layer the bottom Q neurons of each layer are Re LU-free, i.e., the activation function of them is identity mapping σ(x) = x. The activation of the rest of neurons are Re LU. Then by assigning some weights to the first layer, we can have y[1] = (ζ[1], z[0]). In the next layer, we can easily compute z[1] = V1ζ[1] + z[0]. Then we have y[2] = (RN, z[1]). Now assume y[2i] = (RN, z[i]). Then in the first N neurons of the next layer, we compute T [i+1] 1 (z[i]) = Re LU(Wi+1z[i] + bi). In the bottom Q Re LU-free neurons, we copy z[i]. Then y[i+1] = ζ[i+1], z[i] . Then in the next layer, we can compute z[i+1] = Vi+1ζ[i+1] + z[i] i.e., y[2(i+1)] = (RN, z[i+1]). By induction, we have found a special 2L deep network with top N Re LU neurons and bottom Q Re LU-free neurons having the same output as the Res Net. This process can be seen in Figure 4. Next, we construct a real Re LU network Φ that has the same size and output as the special Network. Because the domain [0, 1]d is compact, there exists Ci RQ such that z[i] + Ci > 0 for all i = 0, 1, 2, , L. Now, we suppose the first layer of Φ is u[0] = Re LU(AQ(x) + C0) = Re LU(z[0] + C0). Denote the j-th layer of Φ is u[j 1]. In each subsequent layer of Characterizing Res Net s Universal Approximation Capability ζ[i]γ[i] γ[L 1] a residul block = γ[1] + z[0] z[i 1] z[i] = γ[i] + z[i 1] z[L 1] = γ[L] + z[L 1] z[i 1] z[i] Special FNN Figure 4. Res Net (top) can be generated by a special FNN (bottom). All grey neurons are Re LU-free and all yellow neurons are with Re LU activation. Moreover, for Res Net, the yellow neurons are in the activation layer and the grey neurons are in the linear layer. Φ, the width is N + Q. We denote u[i] = (u[i] (N), u[i] (Q)) where u[i] (N) is the value of top N neurons and u[i] (Q) is the value of bottom Q neurons in the layer i + 1. Then note T [1] 1 (z[0]) = Re LU(W1z[0] + b1). We can compute u[1] (N) = Re LU W1(u[0] C0) + b1 = Re LU(W1z[0] + b1) = ζ[1] and u[1] (Q) = u[0] = Re LU(z[0] + C0). Note z[i] = Re LU(z[i] Ci) + Ci. We can compute u[2] (Q) = Re LU V1u[1] (N) + u[1] (Q) C0 + C1 = Re LU V1ζ[1] + z[0] + C1 = Re LU(z[1] + C1). Now, we suppose u[2j] = (RN, Re LU(z[j] + Cj)). Then we can compute u[2j+1] (N) = Re LU Wj+1(u[2j] Cj) + bj+1 = Re LU(Wj+1z[j] + bj+1) = T [j+1] 1 (z[j]) = ζ[j+1] and u[j+1] (Q) = Re LU(u[j] (Q)) = Re LU(z[0] + C0). Then in the next layer, u[2j+2] (Q) = Re LU Vj+1u[2j+1] (N) + u[2j+1] (Q) Cj + Cj+1 = Re LU Vj+1T [j+1] 1 (z[j]) + z[j] + Cj+1 = Re LU(z[j+1] + Cj+1). By induction, we can output z[L] in the last layer, i.e., u[2L] = (RN, Re LU(z[L] + CL)). Then output z[L] by some affine transformation A(u[2L]) = Re LU(z[L] + CL) CL = z[L]. From the construction of the Re LU network, we can see the FNN has W + 2k L non-zero training weights. B.2. Proof of Theorem 2 Because the product function f(x) = x1x2 xd belongs to P(d, p), it suffices to show the lower bound of the complexity of an FNN on the approximation of f is Θd(log 1/ε). Let x [0, 1]. Define eΨ(x) = Ψ(x, x, , x) and ef(x) = Characterizing Res Net s Universal Approximation Capability f(x, x, , x) = xd. Then by the assumption, we have eΨ(x) ef(x) < ε, x [1 In the interval [1/2, 1], ef is strictly convex because ef (x) = d(d 1)xd d(d 1)(1 2)d := c1 > 0. By lemma 2.1 in (Telgarsky, 2015), eΨ is a CPw L function over [0, 1] with at most (2N)L linear pieces, i.e. [1/2, 1] is partitioned into at most (2N)L intervals for which eΨ is linear. Now, we divide [1/2, 1] into (2N)L intervals. Thus, there exists an interval [a, b] [1/2, 1] with b a 1 2(2N)L over which eΨ is linear. Then define G(x) = ef(x) eΨ, x [a, b]. Then |G(x)| < ε and G (x) c1 > 0 for any x [a, b] due to the linearity of ef. Then we consider x [a, b] and local taylor expansion at (a + b)/2: G(x) = G(a + b 2 ) + G (a + b 2 )(x a + b 2 (ξ)(x a + b 2 )2 whereξ [a, b]. Then let x = a and x = b, we have G(a) = G(a + b 2 ) G (a + b 2 )2, ξ [a, a + b G(b) = G(a + b 2 ) + G (a + b 2 )2, η [a + b It follows that max{G(a), G(b)} G(a + b Thus, by noting b a 1 2(2N)L we have 2ε > max{G(a), G(b)} G(a + b Then (2N)2L c ε where c is a constant depending on d. It follows from the number of neurons T = NL that 4T u log u log c ε where u = 2T Therefore, the number of neurons must be at least the order log 1/ε. C. Proof of Proposition 3 and Theorem 4 Our ideas in this appendix are from (De Vore et al., 2021). It should be noted that while (De Vore et al., 2021) inspired our approach to polynomial approximation, there are big differences in the construction details. Significantly, our main contribution is the successful demonstration of Res Net s construction proof. The discussion in Subsection C.1 commences with a consideration of the fundamental functions, ranging from x2 to xy, which we use to construct our approximation using Res Net. Subsequently, in Subsection C.2, we begin by establishing Proposition 3 for the case [0, 1]d. This is then extended to the case [ M, M]d for M > 1 in Subsection C.3. Characterizing Res Net s Universal Approximation Capability C.1. prelininaries We recall that the so-called hat function h is defined by h(x) = 2(x)+ 4 x 1 + + 2(x 1)+. (6) Let hm(x) be the m-fold composition of the function h, i.e. hm = h h h | {z } m times which is the so-called sawtooth function. Then x2 = x X m 1 4 mhm(x), x [0, 1]. Next, we define S(x) := x2 and Sn(x) := x m=1 4 mhm(x), n 1, x [0, 1]. We then have |S(x) Sn(x)| i=n+1 4 i 1 3 4 n, x [0, 1]. (7) Sn(x) is a piecewise linear interpolation of S on [0, 1], using 2n + 1 uniformly distributed breakpoints, as indicated in (Yarotsky, 2017) (see Proposition 1). Using equation 7, we can generate Sn and approximate x2. Proposition 8. There exists a Res Net R(x) RN (2, 4, L) with L = O(log 1/ε) such that R(x) x2 < ε, x [0, 1] while having O(log 1/ε) neurons. Especially, R(x) RN (4, 2, n) generate Sn exactly. Proof. It suffices to construct a Res Net required to represent Sn. Then we let the right-hand side of (7) equal to ε, i.e., 1 34 n = ε. We then have n = O(log 1/ε). Next, we construct a Res Net R(x) RN (4, 2, n) generating Sn(x) exactly. 4 (+x) = h(x) 4 (+0) = S1(x) z[m] 1 = 4mhm(x) z[m] 2 = Sm(x) z[m] 1 hm+1(x) 4m+1 (+z[m] 1 ) = hm+1(x) 4m+1 (+Sm(x)) = Sm+1(x) Figure 5. Illustration for the constructive first block (left) and the m-th block (right). Grey represents the linear layer and yellow represents the activation layer. Let z[0] = A2(x) = (x, 0). Then by equation (6), we can use three Re LU units to store (x)+, (x 1/2)+ and (x 1)+ and hence compute h(x). At the same layer, use one Re LU unit to copy x. Then in the first residual block, we can compute γ[1] = T [1](z[0]) = ( x h(x)/4, x h(x)/4) so that z[1] = γ[1] + z[0] = ( x h(x)/4, x h(x)/4) + (x, 0) = ( h(x)/4, S1(x)). See Figure 5 (left) for illustration. Characterizing Res Net s Universal Approximation Capability Now we assume the output of the m-th block is z[m] = ( 4mhm(x), Sm(x)). Also by (6) and assigning appropriate weights, we can use three units to compute h(hm(x)) = hm+1(x) and use one unit to copy 4mhm(x) by a = ( a)+. Thus, we can output γ[m+1] = T [m+1](z[m]) = ( 4m+1hm+1(x) 4mhm(x), 4m+1hm+1(x)) in the next block so that z[m+1] = ( 4m+1hm+1(x) 4mhm(x), 4m+1hm+1(x)) + z[m] = ( 4m+1hm+1(x), Sm+1(x)). See Figure 5 (right) for illustration. By induction, we complete our proof. Concretely, z[n] = ( 4nhn(x), Sn(x)) and R(x) = L(z[n]) = Sn(x) by letting L(x1, x2) = x2. Let x, y [0, 1]. We can approximate the product function xy by using the equality xy = ( |x+y| 2 )2 ( |x y| 2 )2. Define πn(x, y) = Sn(|x + y| 2 ) Sn(|x y| It then follows from equation (7), |πn(x, y) xy| 4 n, x, y [0, 1]. (9) For the later rigorous derivation, we also need to prove the following lemma: πn(x, y) [0, 1], x, y [0, 1]. (10) Proof. According to the definition of Sn, we have x2 Sn(x) x, for x [0, 1]. πn(x, y) = Sn(x + y 2 ) Sn(|x y| 4 (x(2 x) + y(2 y) + 2xy) 4(1 + 1 + 2) = 1. Next, we show πn(x, y) 0. We start from πn(x, y) = Sn(x + y 2 ) Sn(|x y| i=1 4 ihi(x + y i=1 4 ihi(|x y| = min{x, y} + i=1 4 i hi(|x y| 2 ) hi(x + y Now we introduce the function ζ(x) := 2 min{|x s| : s Z}, x R. Then for x [0, 1] we have h(x) = ζ(x) and hm(x) = ζ(2m 1x), m 2. Characterizing Res Net s Universal Approximation Capability Since ζ is subadditive, i.e. ζ (t + t ) ζ(t) + ζ (t ), we have 2 ) = hi(|x y| 2 + min{x, y}) = ζ(2i 1(|x y| 2 + min{x, y})) ζ(2i 1 |x y| 2 ) + ζ(2i 1 min{x, y}) 2 ) + hi(min{x, y}). 2 ) hi(x + y 2 ) hi(min{x, y}). From (11), we then have πn(x, y) min{x, y} i=1 4 i (min{x, y}) = Sn(min{x, y}) min{x, y}2 0. For x, y [ M, M], we can approximate xy by the following remark. Remark 10. If x, y [ M, M], we can approximate xy by using the equality xy = M 2 h ( |x+y| 2M )2 ( |x y| 2M )2i . Because the domain of Sn is [0, 1], we define bπn(x, y) = M 2πn( x M ) = M 2 Sn(|x + y| 2M ) Sn(|x y| 2M ) . (13) We have |bπn(x, y) xy| M 24 n, x, y [ M, M]. (14) Now we show Res Net can approximate the product function xy. Proposition 11. Let x, y [ M, M]. There exists a Res Net R RN (3, (4, L)) from [ M, M] to R with L = O(log M/ε) such that |R(x, y) xy| < ε, x, y [ M, M] while having O(log M/ε) neurons and tunable weights. Especially, the Res Net R(x, y) with width 4 and depth 2n can generate πn(x, y) exactly. Proof. It suffices to construct a Res Net required to output πn(x, y). Let the right-hand side of (14) equal ε. We can get n = O(log M/ε). Now, we construct a Res Net with width 4 and depth 2n to represent πn(x, y). Let A3(x) = ( x+y 2M , 0). Next, we can use the first block to output z[1] = ( |x+y| 2M , 0) by the simple observation |a| = (a)+ + ( a)+. See Figure 6 for illustration. Then, it follows from the proof of 8 that we can use the first n layers and 4 units in each layer to output z[n+1] = ( , |x y| 2M , Sn( |x+y| 2M )) while keeping the value of the second neuron in each linear layer unchanged. Next, by the same operation, we use the next n blocks to output z[2n+1] = ( , , Sn(x + y 2 ) Sn(|x y| 2 )) = ( , , πn(x, y)/M 2). Thus, R(x, y) = L( , , πn(x, y)/M 2) = πn(x, y) by letting L(x1, x2, x3) = M 2x3. See Figure 7 for illustration. Characterizing Res Net s Universal Approximation Capability x [ x]+ 2[ x]+(+x) = |x| 1 2 Figure 6. Illustration for computing |x| by a residual block. 2 ) Sn( |x y| O(n) blocks O(n) blocks z[0] Figure 7. Illustration for generating πn(x, y) by the constructive Res Net. Grey represents the linear layer and yellow represents the activation layer. Moreover, we can approximate the multiple product function x1x2 xd where x1, x2, , xd [0, 1]. We can well-define by (10) that πm n (x1, x2, , xm) = πn πm 1 n (x1, x2, , xm 1) , xm , m = 3, 4, for x1, , xd [0, 1] (15) and π2 n(x1, x2) = πn(x1, x2). Then we have Proposition 12. |πm n (x1, x2, , xm) x1x2 xm| em4 n, x1, x2, , xm [0, 1] (16) as long as n 1 + log2 m. Proof. First, It follows from the definition of Sn and S(x) = x2 that S(x) Sn(x) = m=n+1 4 mhm(x). Note hm has the Lipschitz norm 2m. We have S S n L ([0,1]) | m=n+1 4 m| h m(x) L ([0,1]) m=n+1 4 m2m| S (x) = 2x so we have S n(x) = 2x + δ where δ 2 n. (17) Define π(x, y) := xy. We have 1πn (x, y) = S n 2 1{x y}S n 2 1{x < y}S n 2 (x + y + δ) 1 2 1{x y} (x y + δ1) + 1 2 1{x < y} (y x + δ2) = y + δ 1{x y}δ1 + 1{x < y}δ2 = 1π (x, y) + δ 1{x y}δ1 + 1{x < y}δ2, Characterizing Res Net s Universal Approximation Capability where δ, δ1, δ2 2 n. Moreover, it is the same when considering about 2πn(x, y). Thus, we have iπ iπn L ([0,1]2) 2 n+1, where i := xi, i = 1, 2. (19) Define πm(x1, x2, , xm) = x1x2 xm. Now we assume πj πj n C([0,1]j) cj4 n (20) holds for all j m 1. Note, the inequality literally holds for j = 2 by letting c2 = 1. Then we have πm πm n C([0,1]m) π xm, πm 1 πn xm, πm 1 C([0,1]m) + πn xm, πm 1 πn xm, πm 1 n C([0,1]m) 4 n + 2πn L ([0,1]2) πm 1 πm 1 n C([0,1]m 1) 4 n + 1 + 2 n+1cm 1 4 n = (1 + αncm 1) 4 n. By induction, we have proved πm πm n C([0,1]m) cm4 n where cm satisfies the recurrence formula cm = 1 + αncm 1, m 3, with initial value C2 = 1. The solution is j=0 αj n (m 1)αm 2 n . If m 2n 1, we have cm (m 1) 1 + 1 2n 1 m 2 < m 1 + 1 Then we complete the proof and get πm πm n C([0,1]m) em4 n for m = 3, 4, 5, , as long as n 1 + log2 m. C.2. Proof of Proposition 3 over [0, 1]d Now, we are ready to prove Proposition 3 over [0, 1]d. For the completeness, we show the following theorem. Proposition 13 (Proposition 3). Let x = (x1, x2, , xd) [0, 1]d, α Nd and xα be any given monomial with degree p. Then (1) there is a Res Net R1 RN (d + 1, 4, O(d log(d/ε))) such that R1 x1x2 xd C([0,1]d) < ε while having O(d log(d/ε)) tunable weights. Moreover, there is a Res Net belonging to R1 RN (d + 1, 4, O(nd)) can generate πd n(x1, x2, , x3) exactly. (2) there is a Res Net R2 RN (d + 3, 4, O(p log(p/ε))) such that R2 xα C([0,1]d) < ε while having O (p log (p/ε)) tunable weights. Proof. We prove (1) first. Let the right-hand side of inequality 16 equal to ε and note m = d under the condition of (1). Then, n is the order log d/ε. Now we construct a Res Net required with width 4 and depth O(nd) generating πd n (x1, x2, , xd). Let Ad+1 = (x1, x2, , xd, 0). It follows from the proof of proposition 11 that we can assign some weights for the first 2n blocks to output z[2n] = ( , , x3, x4, , xd, πn(x1, x2)) Characterizing Res Net s Universal Approximation Capability while only changing the value of the first, second, and last neurons in each activation layer. In the next block, we set the value of the first and second neurons to zero by using identity mapping. In the next block, we then can output z[2n+1] = (0, 0, x3, x4, , xd, πn(x1, x2)). The zero-value neuron in the activation layer is ready to store the results in the next phase. Then in the next 2n blocks we can compute π3 n(x1, x2, x3). Concretely, by the proof of proposition 11, we can have z[4n+1] = (0, π3 n(x1, x2, x3), , x4, , xd, ). By repeatedly doing the operation above, we can use about (2n + 1)(d 1) blocks totally with width 4 to approximate πd n(x1, x2, , xd). Moreover, there are about 4d + 4 weights in each building block. However, from the operation above, we can see only a constant number of weights are non-zero. Therefore, this network has at most cdn tunable weights where c is an absolute constant. For (2), if p d, the case can be the same with (1). Let s assume p > d. We just need to note xα1 1 xα2 2 xαd d can be approximated by πd n(πα1 n (x1, , x1), , παd n (xd, , xd)) = πp n(x1, , x1 | {z } α1 times , x2, , x2 | {z } α2 times , , αd, , αd | {z } αd times ). (21) Thus, we must store the value of x1, , xd in each building block. However, in the proof of (1), if we complete the output of πn(x1, x2), we will lose the value of x1 and x2 in the neurons. That s why we need d + 3 neurons in the linear layer in this case. The two more neurons in the linear layer can help us preserve x1, , xd in each linear layer. Here we briefly show a constructive Res Net generating 21 exactly. Let Ad+3 = (x, x1, x1, 0). By the proof of proposition 11, we can compute πn(x1, x1) using 2n blocks and get z[2n] = (x, , , πn(x1, x1)) in the 2n-th block. In the next block, we compute z[2n+1] = (x, 0, x1, πn(x1, x1)). Then in the next 2n block, we compute π3 n(x1, x1, x1) = π2 n(x1, πn(x1, x1)) and hence get z[4n+1] = (x, π3 n(x1, x1, x1), , ). Then by doing it repeatedly, we can use O(pn) blocks to generate (21). Similarly by the inequality (16) where m = p, we can get the depth is O(p log p ε) if the desired accuracy is ε. Moreover, note there is only a constant number of weights being non-zero in each block. Thus, the total number of the non-zero weights (tunable weights) is O(p log p ε). We can see an illustration in Figure 8. π3 n(x1, x2, x3) πd n(x) output O(n) blocks O(n) blocks O(nd) blocks Figure 8. Illustration for generating πd n(x1, , xd) by the constructive Res Net. The topmost d neurons are used for storing values. Grey represents the linear layer and yellow represents the activation layer. Following this theorem, we supplement some discussions about why are monomials important. Monomials are the essential constituents of polynomials, which serve an integral role in both theory and applications. Additionally, monomials possess a straightforward mathematical structure, which aids in analyzing and comparing the approximation capabilities of neural networks. Numerous intriguing studies have been conducted in this field. Both deep (Yarotsky, 2017) and shallow Characterizing Res Net s Universal Approximation Capability (Blanchard & Bennouna, 2021) Re LU networks can efficiently approximate monomials over [0, 1]d with poly(d) neurons. However, for monomials over [0, M]d (M > 1), shallow Res LU networks will at least cost exp(d) neurons approximating it to ε (Shapira, 2023). Further, the cost of shallow networks will be reduced to poly(d) if the monomial is normalized over [0, M]d (multiplied by some normalization constant M p). More comprehensive discussion can be found in (Shapira, 2023). C.3. Proof of Proposition 3 In this subsection, we extend Thm, 13 to [ M, M]d where M > 1. First, we extend the lemma 9 to the [ 1, 1]. Lemma 14. Let x, y [ 1, 1]. Then πn(x, y) [ 1, 1] Proof. Since x, y [ 1, 1], it suffices to show |πn(x, y)| = Sn(|x + y| 2 ) Sn(|x y| 2 ) 1. (22) First, we show that Sn(x) is monotone incresing over [0, 1]. For any 0 x y 1, Sn(x) Sn(y) = x y + i=1 4 i (hi(y) hi(x)) . hi(x) = hi(y + x y) = ζ(2i 1(y + x y)) ζ(2i 1y) + ζ(2i 1(x y))) = hi(y) + hi(x y) Sn(x) Sn(y) = x y + i=1 4 i (hi(y) hi(x)) x y i=1 4 ihi(x y) (x y)2 0. Note |x + y| |x y| is equivalent to xy 0. So we only care about the following case to show 22: πn(x, y) = Sn(|x + y| 2 ) Sn(|x y| 4 (|x|(2 |x|) + |y|(2 |y|) + 2xy) 4(1 + 1 + 2) = 1. πn(x, y) = Sn(|x y| 2 ) Sn(|x + y| 4 (|x|(2 |x|) + |y|(2 |y|) 2xy) 4(1 + 1 + 2) = 1. Characterizing Res Net s Universal Approximation Capability Thus, πm n can be well-defined over [ 1, 1] (equation 15). Next, we show that Proposition 12 holds for x1, x2, , xd [ 1, 1], i.e., Proposition 15. |πm n (x1, x2, , xm) x1x2 xm| em4 n, x1, x2, , xm [ 1, 1] (23) as long as n 1 + log2 m. Proof. For x, y [ 1, 1], the only change of the proof is equation 18. We note S n(|x + y| 2 ) = 1{x + y 0}1 2(x + y + ε1) 1{x + y < 0}1 2( x y + ε2) = 1 2(x + y + ε) where ε1, ε2, ε 2 n. Then we can still get iπ iπn L ([ 1,1]2) 2 n+1, where i := xi, i = 1, 2. (24) Then we can show the result following the proof of C.1. Then Thm. 3 can be easily showed by the following remark. Remark 16 (Proposition 3). Let x1, x2, , xm [ M, M]. To approximate x1x2 xm, we consider a function defined by bπm n (x1, . . . , xm) := M mπm n (|x1|/M, . . . , |xm|/M) with the approximation accuracy |x1x2 xm bπm n (x1, , xm)| em M m 4 n, x1, x2, , xm [ M, M] (25) as long as n 1 + log2 m. Moreover bπm n can be generated by a Res Net R(x) RN (m + 1, 4, O(mn)) while having at most O(mn) tunable weights. Proof. The remark is the direct corollary from the proof of Proposition 13. By letting the right hand side of Equation 25 equal to ε where m = p and p is the degree of the monomial, we complete the proof of Proposition 3. C.4. Proof of Theorem 4 Proof. By Proposition 3, for each xα : α E, we can use a Res Net with width 4, depth O(p log p/ε) and d + 3 neurons in each linear layer to output Rα(x) such that |Rα(x) xα| < ε, x [0, 1]d. α E cαRα(x) X α E cαxα| < ε X |α| E |cα|, x [0, 1]d. Then Let Ad+4 = (x, x1, x1, 0, 0). To generate each Rα(x), we need 4 more computational units in each linear layer and depth O(p log p/ε) while having O(p log p/ε) non-zero weights. Then store the value of Rα(x) in the last neuron in each linear layer. Then we can output (x, , , , P α E cαRα(x)) finally with depth O(p|E| log(p/ε)) while having O(p|E| log(p/ε)) non-zero weights totally. D. Proof of Theorem 5 In this section, we prove Theorem 5. Before that, we will give the definition supplement about Sobolev space in subsection 4.2. Characterizing Res Net s Universal Approximation Capability D.1. Definition supplement of Sobolev spaces For α = (α1, . . . , αd) Nd and x = (x1, . . . , xd) [0, 1]d, define Dαf = |α|f xα1 1 xαd d where |α| = α1 + +αd. Let r N+. The Sobolev space W r, ([0, 1]d) is the set of functions belonging to Cr 1([0, 1]d) whose (r 1)-th order derivatives are Lipschitz continuous with the norm f W r := max α:|α| r esssupx [0,1]d |Dαf(x)| < . We denote by U r([0, 1]d) the unit ball of W r, ([0, 1]d), i.e. U r([0, 1]d) = {f W r, ([0, 1]d) : f W r 1}. Note ess sup f = inf{a R : µ({x : f(x) > a}) = 0} where µ is Lebesgue measure. D.2. Proof of Theorem 5 We follow the proof of theorem 1 in (Yarotsky, 2017). In our proof, we skip some details and focus on the constructions of Res Net. The details can be found in theorem 1 of (Yarotsky, 2017). Now let f U r([0, 1]d) and α Nd. Let N be a positive integer to be determined and m = (m1, . . . , md) {0, 1, . . . , N}d. The function ϕm is defined as the product k=1 ψ 3N xk mk 1, |x| < 1, 0, 2 < |x|, 2 |x|, 1 |x| 2. Let f1(x) = X m {0,...,N}d α:|α|