# characterizing_resnets_universal_approximation_capability__5e862e28.pdf

Characterizing Res Net s Universal Approximation Capability

Chenghao Liu 1 Enming Liang 1 Minghua Chen 1

Since its debut in 2016, Res Net has become arguably the most favorable architecture in deep neural network (DNN) design. It effectively addresses the gradient vanishing/exploding issue in DNN training, allowing engineers to fully unleash DNN s potential in tackling challenging problems in various domains. Despite its practical success, an essential theoretical question remains largely open: how well/best can Res Net approximate functions? In this paper, we answer this question for several important function classes, including polynomials and smooth functions. In particular, we show that Res Net with constant width can approximate Lipschitz continuous function with a Lipschitz constant µ using O(c(d)(ε/µ) d/2) tunable weights, where c(d) is a constant depending on the input dimension d and ϵ > 0 is the target approximation error. Further, we extend such a result to Lebesgue-integrable functions with the upper bound characterized by the modulus of continuity. These results indicate a factor of d reduction in the number of tunable weights compared with the classical results for Re LU networks. Our results are also order-optimal in ε, thus achieving optimal approximation rate, as they match a generalized lower bound derived in this paper. This work adds to the theoretical justifications for Res Net s stellar practical performance.

1. Introduction

One trend in deep learning in the past decade is the use of larger and deeper neural networks to process higherdimensional data. However, for deep neural networks, training is notoriously difficult, due to gradient vanishing (Glorot & Bengio, 2010). In 2016, the emergence of Res Net (He et al., 2016) addresses the issue of gradient vanishing or

1School of Data Science, City University of Hong Kong. Correspondence to: Minghua Chen <minghua.chen@cityu.edu.hk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

exploding encountered during neural network training and has achieved outstanding performance in applications.

The practical success of Res Net naturally leads to an essential theoretical question: how well can Res Net approximate functions? Along this line, a milestone result in (Lin & Jegelka, 2018) shows the universal approximation capability of Res Net (even with one neuron per layer): it can approximate any Lebesgue-integrable function arbitrarily well as the number of tunable weights goes to infinity. This result gives theoretical justification to using Res Net to approximate general functions and spurs a number of follow-up studies. For example, the authors in (Oono & Suzuki, 2019) explore the approximation capabilities of Res Net-type convolutional neural networks, establishing that they can approximate smooth functions with a comparable quantity of tunable weights as their feed-forward Re LU network counterparts. However, the question of whether this represents the optimal use of Res Net s resources remains unanswered. We delve deeper into this and other related research in Section 1.1. Despite of these exciting results, it remains largely open today to fully characterize the universal approximation capability of Res Net. That is, how many tunable weights are needed for Res Net with optimized structures to approximate a function up to an error ε?

In this paper, we seek answers to the above question, by developing upper-/lowerbounds on tunable weights for Res Net with constant width to approximate popular classes of functions. We summarize our contributions as follows:

In Sec. 3, we explicitly establish the relationship between Res Net and feedforward networks (FNNs) (see Proposition 1). We show that Res Net can be viewed as an FNN and thus derive lower bounds on the number of tunable parameters for Res Net to approximate various classes of functions.

In Sec. 4, we show that Res Net with constant width, by leveraging specific tunable weights, can approximate functions in various function classes. These include monomials with degree p, requiring the number of weights O(p log p/ε), polynomials of degree p, needing O(p#terms log p/ε)1, and smooth functions differentiable

1The notation # is an abbreviation of number and #terms refers to the number of terms in the polynomial.

Characterizing Res Net s Universal Approximation Capability

Table 1. A summary of existing and our results on approximation rate of Res Net structure. In the table, the upper bound refers to the number of parameters needed using the network architecture to approximate any function in the target function space to ε.

Paper Function Space Network Architecture Upper Bound Optimality

(Lin & Jegelka, 2018) Lipschitz Continuous Res Net with one neuron Od(ε d) Suboptimal Functions over [0, 1]d per layer

(Oono & Suzuki, 2019) Lipschitz Continuous Res Net CNN with O(1) channels Od(ε d) Suboptimal Functions over [0, 1]d

Ours(Theorem 7) Lipschitz Continuous Res Net with constant(= 4) width Od(ε d

2 ) Optimal Functions over [0, 1]d

Note that (Lin & Jegelka, 2018) and our Theorem 7 consider approximating Lebesgue-integrable functions and (Oono & Suzuki, 2019) consider H older class. All of them include Lipschitz functions that are used for comparison. The bound is optimal in terms of the entropy limitation. One could refer to Theorem 3 (Yarotsky, 2017) or (Yarotsky & Zhevnerchuk, 2020). However, the upper bound is not optimal in terms of Vapnik-Chervonenkis (VC) dimension (Shen et al., 2022b; Siegel, 2023). Our rate derived is explicit for all input-related parameters such as d.

up to degree r, needing Od,r(ε d/r log 1/ε)2. In addition, we show that Res Net with one neuron per hidden layer can generate any continuous piece-wise linear function. Last but not least, we derive a tight upper bound of the number of tunable parameters of Res Net for approximating Lebesgue-integrable functions over [0, 1]d which is Od ω 1 f (ε) d/2 where ωf(t) := sup{|f(x) f(y)| :

x y 2 t} is the modulus of continuity and ω 1 f (r) := sup{t : ωf(t) r}. Moreover, if f is Lipschitz continuous with Lipschitz constant µ, the upper bound becomes O(c(d)(ε/µ) d/2). Besides, Our bounds are explicit for all related parameters including the input dimension, the target function space, and the desired accuracy.

We highlight our results in Theorem 7, which achieves the optimal upper bound of the number of parameters of Res Net for approximating Lebesgue-integrable functions even by Res Net with constant width. This is a non-trivial extension of the work (Lin & Jegelka, 2018), which shows that for Lebesgue-integrable function f over [0, 1]d, there exists a Res Net R with one neuron per layer and not more than O(ω 1 f (ε) d) parameters such that f R ε. Our results further establish that Res Net with constant width can approximate any Lebesgue-integrable function over [0, 1]d to an error ε with an ε-order optimal upper bound O(ω 1 f (ε) d/2) of the number of parameters needed. We summarize the comparison in Table 1.

These findings add to the theoretical justifications for Res Net s outstanding practical performance, and shed light on analysis for further research on NN design optimization.

2The notation a(ε) = O(g(ε)) means a(ε) Cg(ε) for sufficiently small ε where C is a constant independent of ε. Importantly, throughout this paper, we employ the notation Od( ) to underscore the hidden constant C depending on d.

1.1. Related Work

In recent years, the expressive capabilities of various neural network architectures have garnered increased attention, spurred by their remarkable and noteworthy successes. In this subsection, we will discuss previous research on the topic through the lens of approximation theory.

Universality. The universality, i.e., universal approximation property, of a function family implies that this family is dense in the space of continuous functions, meaning it can approximate any continuous function to an arbitrary precision. In the earlier years, (Cybenko, 1989; Hornik, 1991; Leshno et al., 1993; Pinkus, 1999) made a groundbreaking argument by demonstrating that shallow neural networks equipped with suitable activation functions such as sigmoid, non-polynomial functions, possess universal approximation properties. In recent years, the universality of narrow deep networks has also attracted considerable attention. (Hanin & Sellke, 2017) determined that a deep Re LU neural network must have a minimum width of d + 1 to ensure universality, where d is the input dimension. (Kidger & Lyons, 2020; Park et al., 2020; Cai, 2022) later showed that deep narrow networks with reasonable activation functions can achieve universality, providing the minimum width of neural networks for achieving the universal approximation property. Over the past decades, a variety of network architectures have been developed to cater to diverse tasks and objectives, extending beyond feedforward Re LU networks. The universal approximation theorem has been shown for multiple network architectures, including standard Re LU convolutional neural networks (CNNs) (Zhou, 2018; 2020), 2D Re LU CNNs with classical structures (He et al., 2022), continuous-time recurrent neural network (RNN) (Li et al., 2020; 2022b), Res Net (Lin & Jegelka, 2018; Li et al., 2022a), and Transformers (Yun et al., 2019).

Characterizing Res Net s Universal Approximation Capability

Approximation Capabilities. There has been substantial progress in enhancing our theoretical understanding of neural networks. Some studies have focused on comparing the expressive power of both shallow and deep neural networks, examining their respective capabilities (e.g., (Arora et al., 2016; Eldan & Shamir, 2016; Liang & Srikant, 2016; Telgarsky, 2016; Yarotsky, 2017; Poggio et al., 2017)). Some others have quantified the number of linear regions within deep neural networks, casting light on their complexity (e.g., (Montufar et al., 2014; Serra et al., 2018; Arora et al., 2016)). Besides, constructive methods have been utilized to probe the approximation capabilities across different function classes. Notably, researchers have delved into the optimal approximation of continuous functions (e.g., (Shen et al., 2022b; Yarotsky, 2018)), the optimal approximation of smooth functions (e.g., (Yarotsky, 2017; Lu et al., 2021; Montanelli & Du, 2019)), and the approximation of analytic functions (e.g., (Wang et al., 2018; Schwab & Zech, 2021)). Recently, there is some interesting work on the special network architecture like parameters sharing (Zhang et al., 2023) and nested network (Zhang et al., 2022). These diverse investigations collectively deepen our understanding of the potential and constraints of neural networks in approximating various functions.

Perspectives on the Curse of Dimensionality. The curse of dimensionality coined by (Bellman, 1957) refers to a phenomenon that a model class will suffer an exponential increase in its complexity as the input dimension increases. Importantly, the curse of dimensionality, not limited to FNNs, is also a challenge for almost all classes of function approximators due to the entropy limitation (Kolmogorov & Tikhomirov, 1959). More specifically, any continuous function approximator (refer to Ssec. 2.2 for detailed explanations) will suffer the curse of dimension in the H older space Cr([0, 1]d) (De Vore et al., 1989) because the metric entropy in Cr([0, 1]d) with respect to the uniform topology is Θ(ε d/r). The property is applied to Re LU neural networks in Thm. 3 (Yarotsky, 2017).

In an attempt to mitigate the curse of dimensionality, initial strategies involved the consideration of smaller function classes whose metric entropy is expected to reduce such as analytical functions (Wang et al., 2018), bandlimited functions (Montanelli et al., 2019), Korobove space (Montanelli & Du, 2019), a space derived by Kolmogorov Superposition Theorem(KST) (Lai & Shen, 2021; He, 2023). More recently, researchers have shifted their focus toward the structure of neural networks, suggesting a potential solution to circumvent the curse of dimensionality. Simultaneously, a more recent trend aims to serve neural networks as discontinuous function approximators, thereby examining neural networks with novel activation functions (e.g. (Shen et al., 2020; Jiao et al., 2023; Shen et al., 2021; 2022a)). However, the failure of these model classes in practice is due to the

discontinuity of the function approximators, wherein even minor perturbations in the training data can lead to chaotic changes in the input-output relationship. Consequently, to circumvent the curse of dimensionality, it is imperative to make appropriate choices within the unstable model class and the restricted objective function space.

Res Net Architecture. Since Res Net s first appearance, it has made a big success in practice and has become a core component of popular structures like Transformer (Vaswani et al., 2017), inspiring much research for theoretical understanding. (Hardt & Ma, 2016) show that Res Net can represent any classifier on any finite sample perfectly, i.e., it can represent any discrete function. Later, (Lin & Jegelka, 2018) extended this discrete setting to continuous, which shows that Res Net with one neuron per layer can approximate any Lebesgue-integrable function. The authors in (Oono & Suzuki, 2019) derive approximation and estimation error rates for Res Net-type CNN using O(1) channels. They show a block-sparse FNN can be realized by a Res Net-type CNN and then study the approximation capabilities of blocksparse FNNs. Our paper differs in this regard. We show that Res Net can be implemented by Re LU FNNs (Prop. 1), and we utilize this relationship to set the lower bounds on the approximation capability of Res Nets. We establish the upper bounds by directly constructing Res Net to approximate different function classes. Since CNNs and FNNs are different structures, the results in (Oono & Suzuki, 2019) can not be translated to ours. Notably, while they show Res Net-type CNNs with not more than O(ε d) parameters can achieve approximation for Lipschitz functions to an error ε, our Theorem 7 shows O(ε d/2) parameters are adequate using Res Net with constant width. We summarize the comparison of the results in Table 1. Recently, (He et al., 2022) study the approximation properties of CNNs and show the universal approximation property of shallow Res Net-type CNN with a large number of channels.

2. Mathematical Modeling of Res Net and Proof Ideas

2.1. Mathematical modeling of Res Net

Res Net s original proposal (He et al., 2016) includes complex structures such as convolutional layers. In this paper, we focus on Res Net-type FNNs as it is a fundamental structure and enough to achieve a strong approximation capability and unless otherwise specified, we will refer to Res Net as Res Net-type FNNs. Res Net consists of residual blocks and identity shortcut connections. Specifically, with the basic notations of addition (f + g)(x) = f(x) + g(x) and composition f g(x) = f(g(x)) of mappings f, g, a Res Net is a function R(x) from Rd to R given by

R(x) = L (T [L] + Id) (T [1] + Id) AQ(x), (1)

Characterizing Res Net s Universal Approximation Capability

a residual block affine mapping

+Id +Id +Id

Figure 1. An example of a Res Net from R2 to R belonging to RN (Q = 3, N = 2, L = 3). Every residual block is composed of an activation layer followed by a linear layer. The activation layer neurons are colored yellow, while the linear layer neurons are grey. The maximum number of neurons in each activation layer is 2, each linear layer has 3 neurons, and it has 3 blocks.

where AQ : Rd RQ and L : RQ R are affine transformations, Id : z 7 z is the identity mapping, and T [i](i = 0, 1, ..., L) are basic residual blocks. Each block T [i] : RQ RQ further consists of two layers:

an activation layer RQ Rni : z 7 σ(Wiz + bi) with ni neurons, and

a linear layer Rni RQ : σ(Wiz+bi) 7 Viσ(Wiz+ bi) with Q neurons,

where parameters Wi Rni Q, Vi RQ ni, bi Rni (i = 1, 2, , L), and σ( ) is the generalized Re LU activation function for vector output, i.e., σ(x1, x2, , xn) = (max 0, x1, , max 0, xn) for x1, x2, , xn R. To this end, we write T [i](z) = Viσ(Wiz + bi).

We use L to denote Res Net s depth defined as the number of residual blocks3, and its width N is defined as the maximum number of activation layer neurons, i.e., max{n1, n2, , n L}. For conciseness of notation, we denote by RN (Q, N, L) the set of Res Net functions from Rd

to R, with width N, depth L, and Q neurons in each linear layer. To keep our study interesting, we always assume that Q d where d is the input dimension. Otherwise, the universality of the Res Net does not hold4. For our later discussion, we define a Res Net as Res Net with bottleneck blocks (b-Res Net) if its width is an absolute constant, i.e., it belongs to RN (Q, N = C, L) where C is an absolute constant (independent of input dimension d). We have an

3Note that we define the depth as the number of residual blocks because the activation and linear layers are always coupled together within each block. It is important to note that this definition differs from the one commonly used in applications, where the depth of a Res Net refers to the number of layers excluding the shortcuts. However, this difference does not affect the presentation of our results. 4If Q < d, the Res Net with one neuron per activation layer belongs to the set of narrow networks with widths smaller than or equal to d (Prop. 1). Therefore, the Res Net may lose the universality since narrow-width (smaller than d+1) Re LU networks cannot approximate all Rd continuous functions (Hanin & Sellke, 2017).

example of a Res Net structure in Fig. 1.

2.2. Problem statement

In this paper, we care about the number of parameters of a Res Net needed for approximating a given function, which characterizes Res Net s approximation capabilities. Specifically, we give the following problem statement.

Given f : [0, 1]d R belong to some function space F and fix d N+ and the Res Net model H = RN (Q, N, L), we are interested in the following two questions specifically.

Lower bound. What is the minimum number of weights of Res Net required to approximate any f F to an error ε?

Upper bound. What is the number of weights of a Res Net architecture sufficient to approximate any f F to an error ε? Or does there exists Q, N, L such that inf f H f f ε holds with relatively smaller number of tunable parameters?

Here tunable parameters refer to non-zero parameters in a Res Net. Note the answer to the two questions usually depends on the desired error ε, the input dimension d, and the function space F. Usually we will consider continuous function space C([0, 1]d) under uniform norm f = maxx [0,1]d |f(x)| and Lp-integrable function space Lp([0, 1]d) under norm f Lp([0,1]d) = R

[0,1]d |f(x)| dx where we always assume p [1, ) without any specification. Our main results show that b-Res Net, i.e. Res Net in RN (Q, N, L) with Q = d + c1, N = c2 (c1, c2 are absolute constants) and proper L, can achieve powerful approximation capability in various function classes. It can approximate specific functions with fewer tunable functions than that of FNNs. Besides, b-Res Net can achieve optimal approximation for Lebesgue-integrable functions.

Remarks on continuous/discontinuous approximators. While the difference between continuous and discontinuous

Characterizing Res Net s Universal Approximation Capability

will play an important role in the lower bounds, we provide explanations about it here to make readers understand it readily. In approximation theory, we aim to approximate all functions in a space F using a model class as an approximator (e.g., neural networks). We achieve this by choosing different parameters for different functions, meaning the parameters θ Θ can be seen as a mapping of the target functions, i.e., θ = h(f) where h : F Θ. If this mapping h is continuous, we refer to the approximator as a continuous approximator. Conversely, discontinuous deep networks are characterized by a discontinuous mapping h. In this paper, the term arbitrary/unconstrained deep networks refers to discontinuous approximators, and construction in a continuous (discontinuous) phase means the constructed neural networks are continuous (discontinuous) approximators. Significantly, the bound of approximation power for continuous approximators is limited by metric entropy, while for discontinuous approximators, it is limited by Vapnik-Chervonenkis (VC) dimension (Goldberg & Jerrum, 1993).

2.3. Proof Ideas and Novelty

Lower Bound. We establish that Res Net can be conceptualized as a sparse FNN (Prop. 1), implying that the lower bound of Res Net must exceed that of FNN. We then derive the generalized lower bounds of Res Net on the approximation of various function classes from the lower bounds of FNNs.

Upper Bounds for Approximation of Polynomials and Smooth Functions. Drawing inspiration from the work of (Yarotsky, 2017), where DNNs are constructed to approximate the function x2 and xy, we construct Res Net to approximate these fundamental functions. Unlike the work (Yarotsky, 2017) which computes the product function using xy = ((x + y)2 x2 y2)/2, we select xy = ((x + y)/2)2 ((x y)/2)2 as utilized in some previous work, e.g., (Suh et al., 2022). The selection is more efficient for construction and leads to fewer parameters. Next, we note that any polynomial can be expressed as a composition of the product function xy, and polynomials can approximate smooth functions due to the local Taylor expansion property. By constructing Res Net to approximate these functions, we can derive upper bounds on the approximation of polynomials and smooth functions. These constructions are in a continuous phase, which leads to the result Proposition 3 and Theorem 4 and 5. Our results show that b-Res Net is enough to approximate these smooth functions and has fewer tunable weights than that of FNN.

Optimal Approximation for Lebesgue-Integrable Functions. Building upon the constructive techniques from (Shen et al., 2019; 2022b) on Re LU FNNs, we show how b-Res Net can achieve optimal approximation for Lebesgue-integrable

Table 2. High-level steps of constructing b-Res Net to achieve optimal approximation where the framework of constructive methods is from (Shen et al., 2019; 2022b). Details are in Appendix F.

Step 1: Space Partitions [0, 1]d\Ω= β {0,1, , L2/d 1}d Qβ. Qβ: small hypercubes with side length O(L 2/d) xβ: a representative for a cube Qβ Ω: a small enough region f: the target (Lipschitz) function from [0, 1]d to R

Step 2: Constructing a b-Res Net Φ such that Φ(x) = β. b-Res Net Φ: depth O(L)

Step 3: Constructing a b-Res Net ϕ such that ϕ(β) f(xβ). b-Res Net ϕ: depth O(L)

Step 4: Error Estimation. Constructed b-Res Net R(x) = ϕ Φ(x) f(xβ) f(x) |f(xβ) f(x)| = O(L 2/d), |R(x) f(xβ)| = O(L 2/d)

functions (refer to Table 2 for high-level steps). The basic idea is to construct a Res Net with depth O(L) to generate a step function with much more steps to achieve a higher approximation rate. In our construction, we utilize linear layers for value storage and activation layers for intermediate computation. In each block, we designate a constant number of neurons within in the activation layer for the computation in relation to one of the corresponding neurons in the linear layer, with the outcomes refreshed via identity mappings in the next block. In distinct blocks, activation layer neurons perform computations associated with different neurons from the linear layer. This procedure is replicated for each residual block, which is then sequentially combined to form a b-Res Net which will exhibit a large expressive power with fewer parameters than that of FNNs. More details can be found in Appendix F. It is important to note that the FNN results in (Shen et al., 2022b) cannot be directly extended to Res Net.

Our novelty lies in the construction and analysis of b Res Net, as well as in the approximation power analysis of Res Net in general. Our proof utilizes a new construction of Res Net blocks for function approximation. Our work is the first to show Res Net even with constant width, can achieve optimal approximation for Lebesgue-integrable functions. Furthermore, the role of identity mappings is greatly leveraged in the construction. We uncover the extensive expressive power of b-Res Net, providing theoretical guarantees and insights into the successful performance of Res Net.

3. Lower Bounds on Res Net s Function Approximation Capability

In this section, we build the explicit relation between Res Net and FNNs to establish the lower bound on the complexity of Res Net to approximate polynomials and other function classes such as smooth function class. The lower bounds

Characterizing Res Net s Universal Approximation Capability

help us to analytically show our upper bounds are optimal in terms of ε in Sec. 4. The proof in this section is postponed to Appendix B.

First, we propose a key argument that a Res Net can be regarded as a special sparse Re LU network.

Proposition 1. For any Res Net R(x) RN (Q, N, L) of the input x [0, 1]d with W number of tunable parameters, there exists an equivalent Re LU FNN Φ(x) : [0, 1]d R with width N + Q, depth 2L, W + 2QL number of tunable parameters, such that R(x) = Φ(x), x [0, 1]d.

This proposition implies that if a Res Net can approximate a function up to an error ε, then an FNN with a larger size can also approximate the same function up to ε. Thus one can bound Res Net s universal approximation capability by studying that of a larger-size FNN (might have an increase in tunable parameters by a factor of d). That said, the lower bound of the complexity of Res Net must be larger than or equal to that of FNN in terms of ε when approximating the same function.

Thus, building on the above discussion, lower bounds for FNNs in the existing literature can be applied to Res Nets in terms of ε. We discuss in the following. Regarding the smooth space Cr([0, 1]d), (Yarotsky, 2017) established lower bounds of Θr(ϵ d/r)5 for continuous Re LU network approximators and Θr,d(ϵ d/2r) for unconstrained deep Re LU networks. Later, (Yarotsky, 2018) demonstrated optimal error approximation rates of Od(ωf(W 1/d)) for continuous Re LU network approximators, and Od(ωf(W 2/d)) for unconstrained deep Re LU networks, where W represents the number of tunable weights. Suppose f is Lipschitz continuous, the optimal upper bound for the Re LU FNN is Od(ε d) for continuous weight selection, and Od(ε d/2) for unconstrained deep networks.6 We will return to these discussions in the next section. In the end of this section, we will focus on the lower bound for polynomial function space (a smaller space compared to continuous and smooth functions). However, the aforementioned lower bound is not applicable and it is not available in the existing literature. Below we give the lower bound of the complexity of Res Net on the approximation of polynomials. The result will be used to show the upper bound in Thm. 4 is ε-order optimal in the next section.

We begin with some notations. Let x Rd and α = (α1, α2, , αd) where αi N. Define xα = xα1 1 xα2 2 xαd d and this is called a monomial. The degree

5Here f(ε) = Θd(g(ε)) means f(ε) Cg(ε) for sufficient small ε where C is a constant which does not depend on ε but can depend on d. The subscript d on Θ emphasizes that the constant C may depend on d. 6As pointed out in Sec. 2.2, the bound of approximation power for continuous approximators is limited by metric entropy, while for discontinuous approximators, it is limited by VC dimension.

of the monomial is |α| := α1 + α2 + + αd. Then a multivariate polynomial is a sum of several monomials and its degree is the highest degree among these monomials. We then give the theorem as below. Theorem 2. Let x = [x1, x2, . . . , xd] [0, 1]d and P(d, p)(p d) be the set of d dimension polynomial functions with degree p. If a Re LU FNN Ψ(x) : [0, 1]d Rd

with width N, depth L and T = NL neurons can approximate any f P(d, p) to an error ε, i.e.

|Ψ(x) f(x)| < ε, x [0, 1]d,

then we have T Θd(log 1/ε). Note this lower bound can also be applied to Res Net.

When we compare Theorem 2 with prior findings, we observe that the lower bound for the polynomial function space, in terms of ε, is notably smaller than the lower bounds Θr(ϵ d/r) established for smooth functions Cr([0, 1]d). This is because polynomial functions constitute a substantially smaller subset of the smooth function space. Consequently, they exhibit a reduced complexity in approximation scenarios, reflecting the inherent simplicity of their structural characteristics.

4. Upper Bounds on Res Net s Function Approximation Capability

In this section, we characterize Res Net s approximation capability by establishing its upper bounds for important function classes. Moreover, our results show that b-Res Net can approximate polynomials and smooth functions with fewer tunable parameters than those of FNNs. The organization of this section is in the following. Subsection 4.1 presents the upper bounds of the complexity of Res Net on approximating monomials. Subsequently in subsection 4.2, we extend the results to polynomials, and smooth functions in Sobolev space following the work of (Yarotsky, 2017). In Subsection 4.3, the properties of Res Net on approximating continuous piecewise linear functions are discussed. Last but not least, Subsection 4.4 shows that b-Res Net can achieve the optimal approximation for Lebesgue-integrable functions.

4.1. Approximating Monomials

Our first key result shows that b-Res Net RN(Q = d + O(1), N = C, L) can approximate any monomials where L depends on d and the desired error ε, and C is an absolute constant independent of d. Importantly, this result explains why b-Res Net can approximate polynomials and smooth functions with fewer tunable parameters. The proof in this section can be found in Appendix C.

Proposition 3. Let x = [x1, x2, , xd] [ M, M]d and α Nd, where M 1 and xα be any given monomial with

Characterizing Res Net s Universal Approximation Capability

degree p, i.e., |α| = p. Then there exists a b-Res Net

R RN d + 3, 4, O(p log(p/ε) + p2 log M)

such that R xα C([ M,M]d) < ε,

while having O p log (p/ε) + p2 log M tunable weights.

Note that the number of total weights of R is O(dp log(p/ε)) when M = 1. However, our constructive proof shows that each residual block only contains a constant number of non-zero weights. Therefore, it suffices to adjust O(p log(p/ε)) weights. This analysis also holds when M > 1. Furthermore, it is important to note that the upper bound in the above theorem is independent of d. In fact, the dimension d is incorporated into p because any monomial with degree p can always be interpreted as a product function of dimension p. For instance, x2 1x2x2 3 = π(x1, x1, x2, x3, x3) where π(x1, , xd) = x1x2 xd is the product function.

Next, we highlight several key observations and discussions from the analysis and results in the following.

Res Net vs FNNs. We show that Res Net is capable of approximating any monomial with degree p on [0, 1]d with O (p log(p/ε)) number of tunable weights, a reduction by a factor d as compared to that of Re LU FNNs. According to (De Vore et al., 2021), a Re LU network with width O(d) and depth O (p log(p/ε)) can approximate any monomial with degree p, resulting in a total weight count of O d2p log(p/ε) . According to their construction, there are O (dp log(p/ε)) tunable weights. Thus, our result has a reduction by a factor d. As a closed remark, while we compare the upper bounds, it is an open and interesting direction to obtain the lower bound of the size of neural networks with respect to input dimension d.

Root of reduction. Note that each identity mapping can be realized by 2d Re LU units (x = [x]+ [ x]+) but it only may lead to an additive reduction of d tunable weights compared with an FNN. In our study, however, this additive reduction of d tunable weights, as seen in our b-Res Net model, does translate into a multiplicative reduction by a factor of d. We establish this result by constructively proving that a b-Res Net with a constant number of tunable weights per residual block can approximate functions with the same accuracy as a Re LU FNN requiring O(d) tunable weights in each layer. The high-level ideas are in the following. First, one can construct an FNN with width d + O(1) that can approximate a function, and in each layer, there are d neurons for storing the value of the input. Hence if we have an identity mapping, then we can move the d neurons in each layer and the role of identity mapping is to forward the input value (or input-related value). Then it will have a factor d reduction.

Deep vs Shallow. The author in (Shapira, 2023) provides a lower bound on the complexity of shallow FNNs to approximate any non-normalized monomial over [ M, M]d

which scales exponentially with d (refer to Thm. 3 (Shapira, 2023)). By proposition 1, this lower bound also applies to shallow Res Net. Conversely, Theorem 3 gives a mild upper bound for deep Res Net which scales polynomially with d. This underscores the benefits of deep networks.

4.2. Approximating Polynomials and Smooth Functions

Polynomials are the summation of monomials and a smooth function can be approximated by a polynomial as per the local Taylor expansion. In this subsection, we display the upper bounds on the approximation of polynomials (Thm. 4) and smooth functions (Thm. 5). The proof of the two theorems can be found in Appendix C and D.

Theorem 4. Let x = [x1, x2, , xd] [0, 1]d. For a multivariate polynomial P(x) with degree p, i.e., P(x) = P

α E cαxα where E = {α Nd : |α| p}, there exists a Res Net

R RN (d + 4, 4, O (p|E| log (p/ε)))

|R(x) P(x)| < ε X

|α| E |cα|, x [0, 1]d.

Additionally, the Res Net has O(p|E| log(p/ε)) tunable weights.

Note |E| p+d p which exponentially increase in d when p or d is very large. Nonetheless, we can see this upper bound is optimal in terms of ε according to Thm. 2.

The Sobolev space W r, ([0, 1]d) is the set of functions belonging to Cr 1([0, 1]d) whose (r 1)-th order derivatives are Lipschitz continuous. Further definitions can be found in Appendix D. We then give the upper bounds of Res Net s complexity in the following theorem.

Theorem 5. Fix r, d N+. There is a Res Net R(x) RN (d + 4, N = 4, L) that can approximate any function from the unit ball of W r, ([0, 1]d) to ε where L = Od,r(ε d/r log 1/ε).

The number of tunable parameters is still less than that of FNN by a factor d based on the polynomial approximation methods. However, the hidden constant c(d, r) in Od,r(ε d/r log 1/ε) is very large in d where an estimation

is given as ( 2 d+1

r d r )d < c(d, r) < ( 2 d+1

r d r )ddr+2(d + r)r. As we discussed before, (Yarotsky, 2017) established the lower bound Θr(ε d/r) of the tunable weights for continuous Re LU network approximators on the approximation of the Sobolev space W r, ([0, 1]d). Based on Prop. 1, the

Characterizing Res Net s Universal Approximation Capability

lower bound can also apply to Res Net. Thus, we can see the upper bound in Theorem 5 is nearly tight up to a log factor for Res Net. Besides, it is noteworthy that recent results (Yang & Zhou, 2024) show that shallow Re LU neural networks can also reach this upper bound when r < (d + 3)/2.

4.3. Representing Continuous Piecewise Linear Functions

Piecewise linear interpolation holds a significant position in approximation theory, as it is a basic method of approximating functions. Therefore, studying the expressive power of neural networks for piecewise linear functions becomes particularly important. In this section, we show that Res Net even with one neuron per activation layer can generate any continuous piecewise linear function (CPw L), as shown in the theorem below where the proof is in Appendix E.

Theorem 6. For any CPw L function f : Rd R, there exists a Res Net R(x) RN (d + 1, 1, L) with L = O(Md) that can exactly represent f where M is an f-dependent number.

Note that M is implicit which is an f-dependent number. It depends on the property of the input CPw L function f including the number of pieces and linear components. More details about the representation of CPw L functions can be found in (Tarela & Martinez, 1999; Wang & Sun, 2005). Moreover, a recent work (Chen et al., 2022) derived a dimension-independent bound for Re LU networks if the number of pieces and linear components of the target CPw L function is known. Their constructions are also possible for Res Net. It should be importantly noted that Theorem 6 provides another approach to demonstrating the universal property as the CPw L functions are dense in continuous function space under uniform norms. While (Lin & Jegelka, 2018) shows that Res Net with one neuron per activation layer can approximate any step function, the construction of Theorem 6 is much easier.

4.4. Optimal Approximation for Lebesgue-Integrable Functions

In this subsection, we show that even b-Res Net can achieve optimal approximation for Lp-integrable functions. It is well-known that C([0, 1]d) space is dense in Lp([0, 1]d) where p [1, ) under Lp norm. Thus, we just need to consider the approximation for C([0, 1]d) as shown in the following theorem. The proof can be found in Appendix F.

Theorem 7. Let d N+, d 5 and p [1, ). For any given continuous function f C([0, 1]d), there exists a Res Net R RN(d + 1, 4, 24L + 9d + 4) such that

f R Lp([0,1]d) 7

dωf(L 2/d).

As a direct corollary, the number of parameters needed is

Od ω 1 f (ε) d/2 for a Res Net to approximate the function f to an error ε.

To discuss about the theorem, we have the following concluding remarks.

Optimality. As discussed in Sec. 3, the lower bound of the size of deep Re LU networks for approximating continuous functions over [0, 1]d under uniform norm is O(ε d/2). Moreover, this lower bound also holds for the approximation under Lp norm which comes from a recent result (Siegel, 2023). Hence, it follows from Proposition 1 that this lower bound can also apply to Res Net which shows our upper bound in Theorem 7 is ε-order optimal.

Extension to Uniform Approximation. The optimal bound in Theorem 7 is extendable to the case under the uniform norm, as facilitated by Lemma 3.4 (Lu et al., 2021). This extension, however, necessitates an increased width of the Res Net architecture. As such, it is non-trivial to demonstrate that bottleneck Res Nets (b-Res Nets) can maintain this optimal rate when approximating continuous functions under the uniform norm.

Extension to Entired Domain Rd. Note that for any function f belonging to Lp(Rd), and given an arbitrary error margin ϵ > 0, there exists a compact set H upon which h is defined and outside of which it is zero (i.e., h(x) = 0 for x Rd \ H) (Walter, 1987). If we take H to be contained within the hypercube [ H, H]d, we can define a new function g(x) = h x+H

2H mapped onto the unit cube [0, 1]d. Function g is continuous by construction and adheres to the assumptions of Theorem 7.

Non-trivial Bound Extension. In this part, we clarify that our results are non-trivial and can not be derived from the work (Yarotsky, 2018). It is important to note that his construction allows Re LU FNNs with a width of 2d + 10 to attain the approximation rate in our Theorem 7. However, within Yarotsky s framework, d neurons per layer can be adapted to form an identity mapping. If one were to attempt to extend these bounds to Res Nets in a straightforward manner, the conclusion would be that a Res Net with width d + O(1) could achieve the mentioned rate. Our Theorem 7, on the other hand, demonstrates that a Res Net with a constant width completely independent of d is capable of achieving the same rate of approximation. This finding is interesting as it implies a reduction in the number of parameters by a factor of d when compared to FNNs, underscoring the non-trivial nature of our extension.

Trade-off between Depth and Width. As a closing remark, this paper concentrates on b-Res Net, that is, Res Net with a constant width. Besides, for the set of Res Nets denoted as RN(Q, N, L), it is an interesting future direction to explore the trade-off between Q, N and L.

Characterizing Res Net s Universal Approximation Capability

Table 3. A summary of upper bounds of the size of FNN and Res Net on the approximation of two representative types of functions: polynomials and Lipschitz functions. Note the relevant papers primarily focus on approximating functions such as H older functions, Sobolev functions, continuous functions, and Lebesgue-integrable functions. However, they all encompass Lipschitz functions, so we use Lipschitz functions for our comparison.

Network Functions Polynomial of Degree p Lipschitz Functions with |E| Terms with Lip. Constant 1

Shallow Re LU FNNs O |E|p3/2ε 1 log(1/ε) ,[1] O c1(d) ε d ,[3]

Deep Re LU FNNs O (|E|pd log(p/ε)),[2] O c2(d) ε d/2 ,[4,5]

Res Net with Constant Width O (|E|p log(p/ε)) O c3(d) ε d/2

: c1(d) implicitly depends on d, c2(d) > (3dd2)d, and c3(d) = (7

d)d/2. The related references are: [1] (Blanchard & Bennouna, 2021),[2] (De Vore et al., 2021),[3] (Yang & Zhou, 2024),[4] (Yarotsky & Zhevnerchuk, 2020) and [5] (Yarotsky, 2018).

Table 4. Comparison of MSE loss.

NN structure d = 100 d = 200 d = 300

NN(d + 1, d/10) 0.0139 0.0216 0.0472 RN(d + 1, 10, d/10) 0.0102 0.0131 0.0225 RN(d + 1, 20, d/10) 0.0093 0.0127 0.0228 RN(d + 1, 40, d/10) 0.0091 0.0131 0.0230

Figure 2. Comparison of testing MSE loss for FNN and b-Rest Net to approximate high dimensional functions defined in Equation (2). More experiment results can be found in Appendix G.

5. Experiments

In this section, we provide function approximation results to numerically validate the theoretical results presented in Sec. 4. To emphasize the approximation error, we involve a sufficiently complex target function for the experiment. Specifically, we utilize the following set of functions (where ai, bi are parameters) to test the universal approximation capability of b-Res Net.

xj + bi sin( Y

The parameter settings in Equation (2) are included in Appendix G. We then compare b-Res Net with fully connected NN for approximating the function in Equation (2) with different dimensions. The results are shown in Figure 2 and Table 4.

From the experiments, it is evident that under the same dimensionality, the b-Res Net has a reduced number of parameters compared to a classical FNN. However, it achieves a lower MSE. This, in some sense, demonstrates the remarkable function approximation capability of the b-Res Net. Moreover, we take into account that the function approximation task is of high dimensionality, which in turn reflects the capacity of Res Nets to learn high-dimensional functions.

In conclusion, the experiment results demonstrate (i) the exceptional approximation capability of b-Res Net for learning complex functions, (ii) efficient structure design which highly reduces training parameters, and (iii) strong scalability for approximating high-dimension functions.

6. Conclusion

In this study, we provide a substantial theoretical contribution to the understanding of Res Net s capabilities in function approximation. We show that Res Net with constant width possesses the remarkable ability to approximate various important functions including polynomials, smooth functions, and piecewise linear functions. Importantly, we show that even Res Net with constant width can achieve optimal approximation for Lebesgue-integrable functions which are frequently encountered in practical applications. Moreover, we obtain some improvement compared with FNN. To make more explicit comparisons between rates of approximation by Res Net and FNN, we have Table 3 showing a summary. While Res Net has significant optimization advantages over FNNs during training, our results indicate that Res Nets can achieve strong approximation capabilities with even fewer parameters than FNNs. These findings add to the theoretical justifications for Res Net s stellar practical performance.

Characterizing Res Net s Universal Approximation Capability

Acknowledgements

This work is supported in part by a General Research Fund from Research Grants Council, Hong Kong (Project No. 11200223), an Inno HK initiative, The Government of the HKSAR, Laboratory for AI-Powered Financial Technologies, and a Shenzhen-Hong Kong-Macau Science & Technology Project (Category C, Project No. SGDX20220530111203026). The authors would also like to thank the anonymous reviewers for their helpful comments.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectified linear units. ar Xiv preprint ar Xiv:1611.01491, 2016.

Bellman, R. Dynamic programming (new jersey: Princeton university press). 1957.

Blanchard, M. and Bennouna, M. A. Shallow and deep networks are near-optimal approximators of korobov functions. In International Conference on Learning Representations, 2021.

Cai, Y. Achieve the minimum width of neural networks for universal approximation. In The Eleventh International Conference on Learning Representations, 2022.

Chen, K.-L., Garudadri, H., and Rao, B. D. Improved bounds on neural complexity for representing piecewise linear functions. Advances in Neural Information Processing Systems, 35:7167 7180, 2022.

Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989.

De Vore, R., Hanin, B., and Petrova, G. Neural network approximation. Acta Numerica, 30:327 444, 2021.

De Vore, R. A., Howard, R., and Micchelli, C. Optimal nonlinear approximation. Manuscripta mathematica, 63: 469 478, 1989.

Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on learning theory, pp. 907 940. PMLR, 2016.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249 256. JMLR Workshop and Conference Proceedings, 2010.

Goldberg, P. and Jerrum, M. Bounding the vapnikchervonenkis dimension of concept classes parameterized by real numbers. In Proceedings of the sixth annual conference on Computational learning theory, pp. 361 369, 1993.

Hanin, B. and Sellke, M. Approximating continuous functions by relu nets of minimal width. ar Xiv preprint ar Xiv:1710.11278, 2017.

Hardt, M. and Ma, T. Identity matters in deep learning. ar Xiv preprint ar Xiv:1611.04231, 2016.

He, J. On the optimal expressive power of relu dnns and its application in approximation with kolmogorov superposition theorem. ar Xiv preprint ar Xiv:2308.05509, 2023.

He, J., Li, L., and Xu, J. Approximation properties of deep relu cnns. Research in the Mathematical Sciences, 9(3): 38, 2022.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251 257, 1991.

Jiao, Y., Lai, Y., Lu, X., Wang, F., Yang, J. Z., and Yang, Y. Deep neural networks with relu-sine-exponential activations break curse of dimensionality in approximation on h older class. SIAM Journal on Mathematical Analysis, 55(4):3635 3649, 2023.

Kidger, P. and Lyons, T. Universal approximation with deep narrow networks. In Conference on learning theory, pp. 2306 2327. PMLR, 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kolmogorov, A. N. and Tikhomirov, V. M. ε-entropy and ε-capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3 86, 1959.

Lai, M.-J. and Shen, Z. The kolmogorov superposition theorem can break the curse of dimensionality when approximating high dimensional functions. ar Xiv preprint ar Xiv:2112.09963, 2021.

Characterizing Res Net s Universal Approximation Capability

Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861 867, 1993.

Li, Q., Lin, T., and Shen, Z. Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 2022a.

Li, Z., Han, J., Li, Q., et al. On the curse of memory in recurrent neural networks: Approximation and optimization analysis. ar Xiv preprint ar Xiv:2009.07799, 2020.

Li, Z., Han, J., Weinan, E., and Li, Q. Approximation and optimization theory for linear continuous-time recurrent neural networks. J. Mach. Learn. Res., 23:42 1, 2022b.

Liang, S. and Srikant, R. Why deep neural networks for function approximation? ar Xiv preprint ar Xiv:1610.04161, 2016.

Lin, H. and Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. Advances in neural information processing systems, 31, 2018.

Lu, J., Shen, Z., Yang, H., and Zhang, S. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465 5506, 2021.

Montanelli, H. and Du, Q. New error bounds for deep relu networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78 92, 2019.

Montanelli, H., Yang, H., and Du, Q. Deep relu networks overcome the curse of dimensionality for bandlimited functions. ar Xiv preprint ar Xiv:1903.00735, 2019.

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. Advances in neural information processing systems, 27, 2014.

Oono, K. and Suzuki, T. Approximation and non-parametric estimation of resnet-type convolutional neural networks. In International conference on machine learning, pp. 4922 4931. PMLR, 2019.

Park, S., Yun, C., Lee, J., and Shin, J. Minimum width for universal approximation. In International Conference on Learning Representations, 2020.

Pinkus, A. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143 195, 1999.

Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503 519, 2017.

Schwab, C. and Zech, J. Deep learning in high dimension: Neural network approximation of analytic functions in L2(Rd, γd). ar Xiv preprint ar Xiv:2111.07080, 2021.

Serra, T., Tjandraatmadja, C., and Ramalingam, S. Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pp. 4558 4566. PMLR, 2018.

Shapira, I. Expressivity of shallow and deep neural networks for polynomial approximation. ar Xiv preprint ar Xiv:2303.03544, 2023.

Shen, Z., Yang, H., and Zhang, S. Deep network approximation characterized by number of neurons. ar Xiv preprint ar Xiv:1906.05497, 2019.

Shen, Z., Yang, H., and Zhang, S. Deep network approximation with discrepancy being reciprocal of width to power of depth. ar Xiv preprint ar Xiv:2006.12231, 2020.

Shen, Z., Yang, H., and Zhang, S. Neural network approximation: Three hidden layers are enough. Neural Networks, 141:160 173, 2021.

Shen, Z., Yang, H., and Zhang, S. Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons. The Journal of Machine Learning Research, 23(1):12653 12712, 2022a.

Shen, Z., Yang, H., and Zhang, S. Optimal approximation rate of relu networks in terms of width and depth. Journal de Math ematiques Pures et Appliqu ees, 157:101 135, 2022b.

Siegel, J. W. Optimal approximation rates for deep relu neural networks on sobolev and besov spaces. Journal of Machine Learning Research, 24(357):1 52, 2023.

Suh, N., Zhou, T.-Y., and Huo, X. Approximation and nonparametric estimation of functions over high-dimensional spheres via deep relu networks. In The Eleventh International Conference on Learning Representations, 2022.

Tarela, J. and Martinez, M. Region configurations for realizability of lattice piecewise-linear models. Mathematical and Computer Modelling, 30(11-12):17 27, 1999.

Telgarsky, M. Representation benefits of deep feedforward networks. ar Xiv preprint ar Xiv:1509.08101, 2015.

Telgarsky, M. Benefits of depth in neural networks. In Conference on learning theory, pp. 1517 1539. PMLR, 2016.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Characterizing Res Net s Universal Approximation Capability

Walter, R. Real and complex analysis. 1987.

Wang, Q. et al. Exponential convergence of the deep neural network approximation for analytic functions. ar Xiv preprint ar Xiv:1807.00297, 2018.

Wang, S. and Sun, X. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425 4431, 2005.

Yang, Y. and Zhou, D.-X. Optimal rates of approximation by shallow relu k neural networks and applications to nonparametric regression. Constructive Approximation, pp. 1 32, 2024.

Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103 114, 2017.

Yarotsky, D. Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pp. 639 649. PMLR, 2018.

Yarotsky, D. and Zhevnerchuk, A. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33:13005 13015, 2020.

Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? ar Xiv preprint ar Xiv:1912.10077, 2019.

Zhang, S., Shen, Z., and Yang, H. Neural network architecture beyond width and depth. Advances in Neural Information Processing Systems, 35:5669 5681, 2022.

Zhang, S., Lu, J., and Zhao, H. On enhancing expressive power via compositions of single fixed-size relu network. ar Xiv preprint ar Xiv:2301.12353, 2023.

Zhou, D.-X. Deep distributed convolutional neural networks: Universality. Analysis and Applications, 16(06):895 919, 2018.

Zhou, D.-X. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48 (2):787 794, 2020.

Characterizing Res Net s Universal Approximation Capability

A. Preliminaries

In this section, we introduce some basic notations for use in subsequent proofs.

A.1. Feedforward neural networks (FNNs)

As is known to all, FNN is a function Φ : Rd : R which is formed as the alternating compositions of Re LU function σ, and affine transformations A[i](y) = Uiy + vi with Ui Rdi di 1, vi Rdi, d0 = d for i = 1, 2, , L. Specifically,

Φ (x) = L σ A[L] σ A[L 1] σ A[1] (x)

where L is a final affine transformation. Here L denotes the number of layers of the FNN, and the width of the FNN is conventionally defined by max{d1, d2, , d L} := K. The Re LU activation function is defined by:

σ(x) := Re LU(x) = max (x, 0) = (x)+ , x R

and for x Rd, σ(x) := (σ(x1), , σ(xd)). Typically, it is presumed that the number of neurons in each layer of an FNN is the same, which is equal to the width K, as any neuron deficits in a layer can be dealt with by adding K dj neurons whose biases are zero in layer j. The weights between these extra neurons are consequently assigned to zero.

A.2. Res Net

Let d, Q N+. Res Net R(x) : Rd R is a combination of an initial affine layer, multiple basic residual blocks with identity mapping, and a final affine output layer:

R(x) = L (T [L] + Id) (T [L 1] + Id) (T [1] + Id) AQ(x), (3)

where AQ : Rd RQ and L : RQ R are affine transformations. Besides, T [i](i = 0, 1, ..., L) are basic residual blocks, i.e., T [i](z) = Viσ (Wiz + bi) where Wi Rni Q, Vi RQ ni, bi Rni.

Concretely, we denote the output of the i-th block by z[i]. Then the outputs of each block can be formulated as follows:

z[0] = AQ (x) = W0x + b0,

T [i](z) = Viσ (Wiz + bi) ,

z[i] = T [i] z[i 1] + z[i 1], i = 1, 2, , L,

R(x) = L z[L] = Bz[L],

where W0 RQ d, b0 RQ, Wi Rni Q, Vi RQ ni, bi Rni, B R1 Q and x Rd.

The Res Net s depth, denoted by L, is defined as the number of residual blocks. The Res Net s width is the maximum number of neurons in the activation layer, that is max{n1, n2, ..., n L}. The subscript Q of AQ refers to the number of neurons in the linear layer. We denote by RN (Q, N, L) the set of Res Net functions width N, depth L and Q neurons in each linear layer.

Additionally, we define

ζ[i] = T [i] 1 (z[i 1]) = σ(Wiz + bi),

γ[i] = T [i] z[i 1] = T [i] 2 ζ[i] = Viζ[i] = Vi T [i] 1 z[i 1] and

z[i] = z[i 1] + γ[i]

for i = 1, 2, , L. See Fig. 3 for an illustration.

A.3. Notations

We summarize the notations we will use in this paper in the following.

Characterizing Res Net s Universal Approximation Capability

+Id +Id +Id

= γ[1] + z[0]

= γ[2] + z[1]

= γ[3] + z[2]

Figure 3. An illustration figure for the Res Net notation. The yellow neurons are in the activation layer and the grey neurons are in the linear layer.

Let column vectors ai Rmi, where i = 1, 2, , n and mi N+ := {1, 2, }. To represent these vectors concisely, we use (a1, a2, , an) to denote a T 1 , a T 2 , , a T n T Rm1+m2+ +mn. Here, a T denotes the transpose of a. Let a Rm. If we write a vector as (Rl, a) Rm+l, this implies that the value of the vector in the position represented by Rl does not matter. If l = 1, we always use - to substitute R , i.e., ( , a) Rm+1 implies the value of the position represented by - does not matter. For a vector v in Rm, vi is the i-th entry of v for i = 1, 2, , m.

Denote by µ(T ) the Lebesgue measure of a measurable set T .

Let 1S be the characteristic function on a set S, i.e., 1S = 1 on set S and 0 otherwise.

For two sets A, B, A\B := {x : x A, x / B}.

For any ξ R, let ξ := max{i : i ξ, i Z} and ξ := min{i : i ξ, i Z}.

B. Proof of Proposition 1 and Theorem 2

In this appendix, we provide the proofs of conclusions in Sec. 3.

B.1. Proof of Proposition 1

For a Res Net in RN (Q, N, L) from [0, 1]d to R defined by the formula 3, 4 and 5, we now construct a special network with depth L and width N + Q having the same output. We first suppose the input of the network is y[0] = AQ(x) = z[0]

and denote the output of the i-th layer by y[i]. What s more, we assume in each layer the bottom Q neurons of each layer are Re LU-free, i.e., the activation function of them is identity mapping σ(x) = x. The activation of the rest of neurons are Re LU. Then by assigning some weights to the first layer, we can have y[1] = (ζ[1], z[0]). In the next layer, we can easily compute z[1] = V1ζ[1] + z[0]. Then we have y[2] = (RN, z[1]). Now assume y[2i] = (RN, z[i]). Then in the first N neurons of the next layer, we compute T [i+1] 1 (z[i]) = Re LU(Wi+1z[i] + bi). In the bottom Q Re LU-free neurons, we copy z[i]. Then y[i+1] = ζ[i+1], z[i] . Then in the next layer, we can compute

z[i+1] = Vi+1ζ[i+1] + z[i]

i.e., y[2(i+1)] = (RN, z[i+1]). By induction, we have found a special 2L deep network with top N Re LU neurons and bottom Q Re LU-free neurons having the same output as the Res Net. This process can be seen in Figure 4.

Next, we construct a real Re LU network Φ that has the same size and output as the special Network. Because the domain [0, 1]d is compact, there exists Ci RQ such that z[i] + Ci > 0 for all i = 0, 1, 2, , L. Now, we suppose the first layer of Φ is u[0] = Re LU(AQ(x) + C0) = Re LU(z[0] + C0). Denote the j-th layer of Φ is u[j 1]. In each subsequent layer of

Characterizing Res Net s Universal Approximation Capability

ζ[i]γ[i] γ[L 1]

a residul block

= γ[1] + z[0]

z[i 1] z[i]

= γ[i] + z[i 1] z[L 1]

= γ[L] + z[L 1]

z[i 1] z[i]

Special FNN

Figure 4. Res Net (top) can be generated by a special FNN (bottom). All grey neurons are Re LU-free and all yellow neurons are with Re LU activation. Moreover, for Res Net, the yellow neurons are in the activation layer and the grey neurons are in the linear layer.

Φ, the width is N + Q. We denote u[i] = (u[i] (N), u[i] (Q)) where u[i] (N) is the value of top N neurons and u[i] (Q) is the value of

bottom Q neurons in the layer i + 1. Then note T [1] 1 (z[0]) = Re LU(W1z[0] + b1). We can compute

u[1] (N) = Re LU W1(u[0] C0) + b1 = Re LU(W1z[0] + b1) = ζ[1]

and u[1] (Q) = u[0] = Re LU(z[0] + C0). Note z[i] = Re LU(z[i] Ci) + Ci. We can compute

u[2] (Q) = Re LU V1u[1] (N) + u[1] (Q) C0 + C1 = Re LU V1ζ[1] + z[0] + C1 = Re LU(z[1] + C1).

Now, we suppose u[2j] = (RN, Re LU(z[j] + Cj)). Then we can compute

u[2j+1] (N) = Re LU Wj+1(u[2j] Cj) + bj+1 = Re LU(Wj+1z[j] + bj+1) = T [j+1] 1 (z[j]) = ζ[j+1]

and u[j+1] (Q) = Re LU(u[j] (Q)) = Re LU(z[0] + C0). Then in the next layer,

u[2j+2] (Q) = Re LU Vj+1u[2j+1] (N) + u[2j+1] (Q) Cj + Cj+1

= Re LU Vj+1T [j+1] 1 (z[j]) + z[j] + Cj+1

= Re LU(z[j+1] + Cj+1).

By induction, we can output z[L] in the last layer, i.e., u[2L] = (RN, Re LU(z[L] + CL)). Then output z[L] by some affine transformation A(u[2L]) = Re LU(z[L] + CL) CL = z[L].

From the construction of the Re LU network, we can see the FNN has W + 2k L non-zero training weights.

B.2. Proof of Theorem 2

Because the product function f(x) = x1x2 xd belongs to P(d, p), it suffices to show the lower bound of the complexity of an FNN on the approximation of f is Θd(log 1/ε). Let x [0, 1]. Define eΨ(x) = Ψ(x, x, , x) and ef(x) =

Characterizing Res Net s Universal Approximation Capability

f(x, x, , x) = xd. Then by the assumption, we have eΨ(x) ef(x) < ε, x [1

In the interval [1/2, 1], ef is strictly convex because

ef (x) = d(d 1)xd d(d 1)(1

2)d := c1 > 0.

By lemma 2.1 in (Telgarsky, 2015), eΨ is a CPw L function over [0, 1] with at most (2N)L linear pieces, i.e. [1/2, 1] is partitioned into at most (2N)L intervals for which eΨ is linear. Now, we divide [1/2, 1] into (2N)L intervals. Thus, there exists an interval [a, b] [1/2, 1] with b a 1 2(2N)L over which eΨ is linear. Then define

G(x) = ef(x) eΨ, x [a, b].

Then |G(x)| < ε and G (x) c1 > 0 for any x [a, b] due to the linearity of ef. Then we consider x [a, b] and local taylor expansion at (a + b)/2:

G(x) = G(a + b

2 ) + G (a + b

2 )(x a + b

2 (ξ)(x a + b

2 )2 whereξ [a, b].

Then let x = a and x = b, we have

G(a) = G(a + b

2 ) G (a + b

2 )2, ξ [a, a + b

G(b) = G(a + b

2 ) + G (a + b

2 )2, η [a + b

It follows that max{G(a), G(b)} G(a + b

Thus, by noting b a 1 2(2N)L we have

2ε > max{G(a), G(b)} G(a + b

Then (2N)2L c

ε where c is a constant depending on d. It follows from the number of neurons T = NL that

4T u log u log c

ε where u = 2T

Therefore, the number of neurons must be at least the order log 1/ε.

C. Proof of Proposition 3 and Theorem 4

Our ideas in this appendix are from (De Vore et al., 2021). It should be noted that while (De Vore et al., 2021) inspired our approach to polynomial approximation, there are big differences in the construction details. Significantly, our main contribution is the successful demonstration of Res Net s construction proof.

The discussion in Subsection C.1 commences with a consideration of the fundamental functions, ranging from x2 to xy, which we use to construct our approximation using Res Net. Subsequently, in Subsection C.2, we begin by establishing Proposition 3 for the case [0, 1]d. This is then extended to the case [ M, M]d for M > 1 in Subsection C.3.

Characterizing Res Net s Universal Approximation Capability

C.1. prelininaries

We recall that the so-called hat function h is defined by

h(x) = 2(x)+ 4 x 1

+ + 2(x 1)+. (6)

Let hm(x) be the m-fold composition of the function h, i.e. hm = h h h | {z } m times which is the so-called sawtooth function.

Then x2 = x X

m 1 4 mhm(x), x [0, 1].

Next, we define

S(x) := x2 and Sn(x) := x

m=1 4 mhm(x), n 1, x [0, 1].

We then have

|S(x) Sn(x)|

i=n+1 4 i 1

3 4 n, x [0, 1]. (7)

Sn(x) is a piecewise linear interpolation of S on [0, 1], using 2n + 1 uniformly distributed breakpoints, as indicated in (Yarotsky, 2017) (see Proposition 1). Using equation 7, we can generate Sn and approximate x2.

Proposition 8. There exists a Res Net R(x) RN (2, 4, L) with L = O(log 1/ε) such that R(x) x2 < ε, x [0, 1]

while having O(log 1/ε) neurons. Especially, R(x) RN (4, 2, n) generate Sn exactly.

Proof. It suffices to construct a Res Net required to represent Sn. Then we let the right-hand side of (7) equal to ε, i.e., 1 34 n = ε. We then have n = O(log 1/ε). Next, we construct a Res Net R(x) RN (4, 2, n) generating Sn(x) exactly.

4 (+x) = h(x)

4 (+0) = S1(x)

z[m] 1 = 4mhm(x)

z[m] 2 = Sm(x)

z[m] 1 hm+1(x)

4m+1 (+z[m] 1 ) = hm+1(x)

4m+1 (+Sm(x)) = Sm+1(x)

Figure 5. Illustration for the constructive first block (left) and the m-th block (right). Grey represents the linear layer and yellow represents the activation layer.

Let z[0] = A2(x) = (x, 0). Then by equation (6), we can use three Re LU units to store (x)+, (x 1/2)+ and (x 1)+ and hence compute h(x). At the same layer, use one Re LU unit to copy x. Then in the first residual block, we can compute

γ[1] = T [1](z[0]) = ( x h(x)/4, x h(x)/4)

so that z[1] = γ[1] + z[0] = ( x h(x)/4, x h(x)/4) + (x, 0) = ( h(x)/4, S1(x)).

See Figure 5 (left) for illustration.

Characterizing Res Net s Universal Approximation Capability

Now we assume the output of the m-th block is z[m] = ( 4mhm(x), Sm(x)). Also by (6) and assigning appropriate weights, we can use three units to compute h(hm(x)) = hm+1(x) and use one unit to copy 4mhm(x) by a = ( a)+. Thus, we can output

γ[m+1] = T [m+1](z[m]) = ( 4m+1hm+1(x) 4mhm(x), 4m+1hm+1(x))

in the next block so that

z[m+1] = ( 4m+1hm+1(x) 4mhm(x), 4m+1hm+1(x)) + z[m] = ( 4m+1hm+1(x), Sm+1(x)).

See Figure 5 (right) for illustration. By induction, we complete our proof. Concretely, z[n] = ( 4nhn(x), Sn(x)) and R(x) = L(z[n]) = Sn(x) by letting L(x1, x2) = x2.

Let x, y [0, 1]. We can approximate the product function xy by using the equality xy = ( |x+y|

2 )2 ( |x y|

2 )2. Define

πn(x, y) = Sn(|x + y|

2 ) Sn(|x y|

It then follows from equation (7), |πn(x, y) xy| 4 n, x, y [0, 1]. (9)

For the later rigorous derivation, we also need to prove the following lemma:

πn(x, y) [0, 1], x, y [0, 1]. (10)

Proof. According to the definition of Sn, we have

x2 Sn(x) x, for x [0, 1].

πn(x, y) = Sn(x + y

2 ) Sn(|x y|

4 (x(2 x) + y(2 y) + 2xy)

4(1 + 1 + 2) = 1.

Next, we show πn(x, y) 0. We start from

πn(x, y) = Sn(x + y

2 ) Sn(|x y|

i=1 4 ihi(x + y

i=1 4 ihi(|x y|

= min{x, y} +

i=1 4 i hi(|x y|

2 ) hi(x + y

Now we introduce the function ζ(x) := 2 min{|x s| : s Z}, x R.

Then for x [0, 1] we have h(x) = ζ(x) and hm(x) = ζ(2m 1x), m 2.

Characterizing Res Net s Universal Approximation Capability

Since ζ is subadditive, i.e. ζ (t + t ) ζ(t) + ζ (t ), we have

2 ) = hi(|x y|

2 + min{x, y})

= ζ(2i 1(|x y|

2 + min{x, y}))

ζ(2i 1 |x y|

2 ) + ζ(2i 1 min{x, y})

2 ) + hi(min{x, y}).

2 ) hi(x + y

2 ) hi(min{x, y}).

From (11), we then have

πn(x, y) min{x, y}

i=1 4 i (min{x, y}) = Sn(min{x, y}) min{x, y}2 0.

For x, y [ M, M], we can approximate xy by the following remark.

Remark 10. If x, y [ M, M], we can approximate xy by using the equality xy = M 2 h ( |x+y|

2M )2 ( |x y|

2M )2i . Because

the domain of Sn is [0, 1], we define

bπn(x, y) = M 2πn( x

M ) = M 2 Sn(|x + y|

2M ) Sn(|x y|

2M ) . (13)

We have |bπn(x, y) xy| M 24 n, x, y [ M, M]. (14)

Now we show Res Net can approximate the product function xy.

Proposition 11. Let x, y [ M, M]. There exists a Res Net

R RN (3, (4, L))

from [ M, M] to R with L = O(log M/ε) such that

|R(x, y) xy| < ε, x, y [ M, M]

while having O(log M/ε) neurons and tunable weights. Especially, the Res Net R(x, y) with width 4 and depth 2n can generate πn(x, y) exactly.

Proof. It suffices to construct a Res Net required to output πn(x, y). Let the right-hand side of (14) equal ε. We can get n = O(log M/ε). Now, we construct a Res Net with width 4 and depth 2n to represent πn(x, y).

Let A3(x) = ( x+y

2M , 0). Next, we can use the first block to output z[1] = ( |x+y|

2M , 0) by the simple observation |a| = (a)+ + ( a)+. See Figure 6 for illustration.

Then, it follows from the proof of 8 that we can use the first n layers and 4 units in each layer to output z[n+1] = ( , |x y|

2M , Sn( |x+y|

2M )) while keeping the value of the second neuron in each linear layer unchanged. Next, by the same operation, we use the next n blocks to output

z[2n+1] = ( , , Sn(x + y

2 ) Sn(|x y|

2 )) = ( , , πn(x, y)/M 2).

Thus, R(x, y) = L( , , πn(x, y)/M 2) = πn(x, y) by letting L(x1, x2, x3) = M 2x3. See Figure 7 for illustration.

Characterizing Res Net s Universal Approximation Capability

x [ x]+ 2[ x]+(+x) = |x| 1 2

Figure 6. Illustration for computing |x| by a residual block.

2 ) Sn( |x y|

O(n) blocks O(n) blocks z[0]

Figure 7. Illustration for generating πn(x, y) by the constructive Res Net. Grey represents the linear layer and yellow represents the activation layer.

Moreover, we can approximate the multiple product function x1x2 xd where x1, x2, , xd [0, 1]. We can well-define by (10) that

πm n (x1, x2, , xm) = πn πm 1 n (x1, x2, , xm 1) , xm , m = 3, 4, for x1, , xd [0, 1] (15)

and π2 n(x1, x2) = πn(x1, x2). Then we have Proposition 12.

|πm n (x1, x2, , xm) x1x2 xm| em4 n, x1, x2, , xm [0, 1] (16)

as long as n 1 + log2 m.

Proof. First, It follows from the definition of Sn and S(x) = x2 that

S(x) Sn(x) =

m=n+1 4 mhm(x).

Note hm has the Lipschitz norm 2m. We have

S S n L ([0,1]) |

m=n+1 4 m| h m(x) L ([0,1])

m=n+1 4 m2m|

S (x) = 2x so we have S n(x) = 2x + δ where δ 2 n. (17)

Define π(x, y) := xy. We have

1πn (x, y) = S n

2 1{x y}S n

2 1{x < y}S n

2 (x + y + δ) 1

2 1{x y} (x y + δ1) + 1

2 1{x < y} (y x + δ2)

= y + δ 1{x y}δ1 + 1{x < y}δ2 = 1π (x, y) + δ 1{x y}δ1 + 1{x < y}δ2,

Characterizing Res Net s Universal Approximation Capability

where δ, δ1, δ2 2 n. Moreover, it is the same when considering about 2πn(x, y). Thus, we have

iπ iπn L ([0,1]2) 2 n+1, where i := xi, i = 1, 2. (19)

Define πm(x1, x2, , xm) = x1x2 xm. Now we assume

πj πj n C([0,1]j) cj4 n (20)

holds for all j m 1. Note, the inequality literally holds for j = 2 by letting c2 = 1. Then we have

πm πm n C([0,1]m) π xm, πm 1 πn xm, πm 1 C([0,1]m) + πn xm, πm 1 πn xm, πm 1 n C([0,1]m) 4 n + 2πn L ([0,1]2) πm 1 πm 1 n C([0,1]m 1) 4 n + 1 + 2 n+1cm 1 4 n

= (1 + αncm 1) 4 n.

By induction, we have proved πm πm n C([0,1]m) cm4 n

where cm satisfies the recurrence formula cm = 1 + αncm 1, m 3, with initial value C2 = 1. The solution is

j=0 αj n (m 1)αm 2 n .

If m 2n 1, we have

cm (m 1) 1 + 1 2n 1

m 2 < m 1 + 1

Then we complete the proof and get πm πm n C([0,1]m) em4 n

for m = 3, 4, 5, , as long as n 1 + log2 m.

C.2. Proof of Proposition 3 over [0, 1]d

Now, we are ready to prove Proposition 3 over [0, 1]d. For the completeness, we show the following theorem.

Proposition 13 (Proposition 3). Let x = (x1, x2, , xd) [0, 1]d, α Nd and xα be any given monomial with degree p. Then

(1) there is a Res Net R1 RN (d + 1, 4, O(d log(d/ε))) such that

R1 x1x2 xd C([0,1]d) < ε

while having O(d log(d/ε)) tunable weights. Moreover, there is a Res Net belonging to R1 RN (d + 1, 4, O(nd)) can generate πd n(x1, x2, , x3) exactly.

(2) there is a Res Net R2 RN (d + 3, 4, O(p log(p/ε))) such that

R2 xα C([0,1]d) < ε

while having O (p log (p/ε)) tunable weights.

Proof. We prove (1) first. Let the right-hand side of inequality 16 equal to ε and note m = d under the condition of (1). Then, n is the order log d/ε. Now we construct a Res Net required with width 4 and depth O(nd) generating πd n (x1, x2, , xd).

Let Ad+1 = (x1, x2, , xd, 0). It follows from the proof of proposition 11 that we can assign some weights for the first 2n blocks to output z[2n] = ( , , x3, x4, , xd, πn(x1, x2))

Characterizing Res Net s Universal Approximation Capability

while only changing the value of the first, second, and last neurons in each activation layer. In the next block, we set the value of the first and second neurons to zero by using identity mapping. In the next block, we then can output

z[2n+1] = (0, 0, x3, x4, , xd, πn(x1, x2)).

The zero-value neuron in the activation layer is ready to store the results in the next phase. Then in the next 2n blocks we can compute π3 n(x1, x2, x3). Concretely, by the proof of proposition 11, we can have

z[4n+1] = (0, π3 n(x1, x2, x3), , x4, , xd, ).

By repeatedly doing the operation above, we can use about (2n + 1)(d 1) blocks totally with width 4 to approximate πd n(x1, x2, , xd). Moreover, there are about 4d + 4 weights in each building block. However, from the operation above, we can see only a constant number of weights are non-zero. Therefore, this network has at most cdn tunable weights where c is an absolute constant.

For (2), if p d, the case can be the same with (1). Let s assume p > d. We just need to note xα1 1 xα2 2 xαd d can be approximated by

πd n(πα1 n (x1, , x1), , παd n (xd, , xd)) = πp n(x1, , x1 | {z } α1 times , x2, , x2 | {z } α2 times , , αd, , αd | {z } αd times ). (21)

Thus, we must store the value of x1, , xd in each building block. However, in the proof of (1), if we complete the output of πn(x1, x2), we will lose the value of x1 and x2 in the neurons. That s why we need d + 3 neurons in the linear layer in this case. The two more neurons in the linear layer can help us preserve x1, , xd in each linear layer. Here we briefly show a constructive Res Net generating 21 exactly.

Let Ad+3 = (x, x1, x1, 0). By the proof of proposition 11, we can compute πn(x1, x1) using 2n blocks and get z[2n] = (x, , , πn(x1, x1)) in the 2n-th block. In the next block, we compute z[2n+1] = (x, 0, x1, πn(x1, x1)). Then in the next 2n block, we compute π3 n(x1, x1, x1) = π2 n(x1, πn(x1, x1)) and hence get z[4n+1] = (x, π3 n(x1, x1, x1), , ). Then by doing it repeatedly, we can use O(pn) blocks to generate (21). Similarly by the inequality (16) where m = p, we can get the depth is O(p log p

ε) if the desired accuracy is ε. Moreover, note there is only a constant number of weights being non-zero in each block. Thus, the total number of the non-zero weights (tunable weights) is O(p log p

ε). We can see an illustration in Figure 8.

π3 n(x1, x2, x3)

πd n(x) output

O(n) blocks O(n) blocks O(nd) blocks

Figure 8. Illustration for generating πd n(x1, , xd) by the constructive Res Net. The topmost d neurons are used for storing values. Grey represents the linear layer and yellow represents the activation layer.

Following this theorem, we supplement some discussions about why are monomials important. Monomials are the essential constituents of polynomials, which serve an integral role in both theory and applications. Additionally, monomials possess a straightforward mathematical structure, which aids in analyzing and comparing the approximation capabilities of neural networks. Numerous intriguing studies have been conducted in this field. Both deep (Yarotsky, 2017) and shallow

Characterizing Res Net s Universal Approximation Capability

(Blanchard & Bennouna, 2021) Re LU networks can efficiently approximate monomials over [0, 1]d with poly(d) neurons. However, for monomials over [0, M]d (M > 1), shallow Res LU networks will at least cost exp(d) neurons approximating it to ε (Shapira, 2023). Further, the cost of shallow networks will be reduced to poly(d) if the monomial is normalized over [0, M]d (multiplied by some normalization constant M p). More comprehensive discussion can be found in (Shapira, 2023).

C.3. Proof of Proposition 3

In this subsection, we extend Thm, 13 to [ M, M]d where M > 1. First, we extend the lemma 9 to the [ 1, 1].

Lemma 14. Let x, y [ 1, 1]. Then πn(x, y) [ 1, 1]

Proof. Since x, y [ 1, 1], it suffices to show

|πn(x, y)| = Sn(|x + y|

2 ) Sn(|x y|

2 ) 1. (22)

First, we show that Sn(x) is monotone incresing over [0, 1]. For any 0 x y 1,

Sn(x) Sn(y) = x y +

i=1 4 i (hi(y) hi(x)) .

hi(x) = hi(y + x y) = ζ(2i 1(y + x y)) ζ(2i 1y) + ζ(2i 1(x y))) = hi(y) + hi(x y)

Sn(x) Sn(y) = x y +

i=1 4 i (hi(y) hi(x)) x y

i=1 4 ihi(x y) (x y)2 0.

Note |x + y| |x y| is equivalent to xy 0. So we only care about the following case to show 22:

πn(x, y) = Sn(|x + y|

2 ) Sn(|x y|

4 (|x|(2 |x|) + |y|(2 |y|) + 2xy)

4(1 + 1 + 2) = 1.

πn(x, y) = Sn(|x y|

2 ) Sn(|x + y|

4 (|x|(2 |x|) + |y|(2 |y|) 2xy)

4(1 + 1 + 2) = 1.

Characterizing Res Net s Universal Approximation Capability

Thus, πm n can be well-defined over [ 1, 1] (equation 15). Next, we show that Proposition 12 holds for x1, x2, , xd [ 1, 1], i.e.,

Proposition 15.

|πm n (x1, x2, , xm) x1x2 xm| em4 n, x1, x2, , xm [ 1, 1] (23)

as long as n 1 + log2 m.

Proof. For x, y [ 1, 1], the only change of the proof is equation 18. We note

S n(|x + y|

2 ) = 1{x + y 0}1

2(x + y + ε1) 1{x + y < 0}1

2( x y + ε2) = 1

2(x + y + ε)

where ε1, ε2, ε 2 n. Then we can still get

iπ iπn L ([ 1,1]2) 2 n+1, where i := xi, i = 1, 2. (24)

Then we can show the result following the proof of C.1.

Then Thm. 3 can be easily showed by the following remark.

Remark 16 (Proposition 3). Let x1, x2, , xm [ M, M]. To approximate x1x2 xm, we consider a function defined by bπm n (x1, . . . , xm) := M mπm n (|x1|/M, . . . , |xm|/M) with the approximation accuracy

|x1x2 xm bπm n (x1, , xm)| em M m 4 n, x1, x2, , xm [ M, M] (25)

as long as n 1 + log2 m. Moreover bπm n can be generated by a Res Net R(x) RN (m + 1, 4, O(mn)) while having at most O(mn) tunable weights.

Proof. The remark is the direct corollary from the proof of Proposition 13. By letting the right hand side of Equation 25 equal to ε where m = p and p is the degree of the monomial, we complete the proof of Proposition 3.

C.4. Proof of Theorem 4

Proof. By Proposition 3, for each xα : α E, we can use a Res Net with width 4, depth O(p log p/ε) and d + 3 neurons in each linear layer to output Rα(x) such that

|Rα(x) xα| < ε, x [0, 1]d.

α E cαRα(x) X

α E cαxα| < ε X

|α| E |cα|, x [0, 1]d.

Then Let Ad+4 = (x, x1, x1, 0, 0). To generate each Rα(x), we need 4 more computational units in each linear layer and depth O(p log p/ε) while having O(p log p/ε) non-zero weights. Then store the value of Rα(x) in the last neuron in each linear layer. Then we can output (x, , , , P

α E cαRα(x)) finally with depth O(p|E| log(p/ε)) while having O(p|E| log(p/ε)) non-zero weights totally.

D. Proof of Theorem 5

In this section, we prove Theorem 5. Before that, we will give the definition supplement about Sobolev space in subsection 4.2.

Characterizing Res Net s Universal Approximation Capability

D.1. Definition supplement of Sobolev spaces

For α = (α1, . . . , αd) Nd and x = (x1, . . . , xd) [0, 1]d, define

Dαf = |α|f xα1 1 xαd d

where |α| = α1 + +αd. Let r N+. The Sobolev space W r, ([0, 1]d) is the set of functions belonging to Cr 1([0, 1]d) whose (r 1)-th order derivatives are Lipschitz continuous with the norm

f W r := max α:|α| r esssupx [0,1]d |Dαf(x)| < .

We denote by U r([0, 1]d) the unit ball of W r, ([0, 1]d), i.e. U r([0, 1]d) = {f W r, ([0, 1]d) : f W r 1}. Note

ess sup f = inf{a R : µ({x : f(x) > a}) = 0}

where µ is Lebesgue measure.

D.2. Proof of Theorem 5

We follow the proof of theorem 1 in (Yarotsky, 2017). In our proof, we skip some details and focus on the constructions of Res Net. The details can be found in theorem 1 of (Yarotsky, 2017). Now let f U r([0, 1]d) and α Nd.

Let N be a positive integer to be determined and m = (m1, . . . , md) {0, 1, . . . , N}d. The function ϕm is defined as the product

k=1 ψ 3N xk mk

1, |x| < 1, 0, 2 < |x|, 2 |x|, 1 |x| 2.

Let f1(x) = X

m {0,...,N}d

α:|α|<r am,αϕm(x) x m

where am,α are some specific coefficients when considering the locally Taylor expansion of f. Then by choosing

where is the celling function, we have f f1 ε

Now, we consider to approximate ϕm(x) x m

N α by Res Net.

The following lemma follows directly from remark 16 that

Lemma 17. Let x [0, 1]d and g1(x1), , g2(xd) [ 1, 1]. Then πd n (g1(x1), g2(x2), , gd(xd)) g1(x1)g2(x2) gd(xd) ed4 n, n 1 + log2 d

for all x1, x2, , xd [ 1, 1].

Moreover, we need the following lemma for the construction.

Lemma 18. There is a Res Net R(x) RN (Q = 1, N = 3, L = 4) such that R(x) = ψ(x) for any x R.

Characterizing Res Net s Universal Approximation Capability

Proof of lemma 18. (Lin & Jegelka, 2018) has shown that the following operations are realizable by a single basic residual block of Res Net with one neuron: (a) Shifting by a constant: R+ = R + c for any c R. (b) Min or Max with a constant: R+ = min{R, c} or R+ = max{R, c} for any c R. (c) Min or Max with a linear transformation: R+ = min{R, αR+β} (or max) for any α, β R.

For the Res Net, z[0] = x. In the first layer, we use one computational unit to compute Re LU(x + 2) and two units to compute x = (x)+ ( x)+. Then we output z[1] = Re LU((x + 2) x + x = Re LU(x + 2) in the first layer. In the remaining layers, we only need one neuron per layer. we can output z[2] = z[1] 2(z[1] 2), z[3] = min{z[2], 1} and z[4] = max{z[3], 0} = ψ(x).

Then now we define

Rm,α(x) = π2d n ψ(3Nx1 3m1), , ψ(3Nxd 3md), πα1 n (x1 m1

N ), , παd n (xd md

ψ(3Nx1 3m1), , ψ(3Nxd 3md), x1 m1

N , , x1 m1

N | {z } α1times

N , , xd md

N | {z } αdtimes

By lemma 17 and note

|ψ(3Nxi 3mi)| 1 and |xi mi

N | 1 for i = 1, 2, , d

, we have |Rm,α(x) ϕm(x) x m

α | < e(r + d)4 n = ε0, x [0, 1]d, (27)

by letting n = O(log e(r+d)

ε0 ). Then with similar proof of Proposition 3 and lemma 18, we have a Res Net with Q = d + 3 to generate Rm,α(x) such (27) satisfies while having O(d(r + d) log e(r + d)/ε0) weights. It follows from the proof of Theorem 4, we have a Res Net with Q = d + 4 can generate

m {0,...,N}d

α:|α|<r am,αRm,α (x)

while having

O (N + 1)ddrd(d + r) log e(r + d)

weights. From the proof of theorem 1 in (Yarotsky, 2017) we then have

| ef(x) f1(x)| 2ddrε0.

Let ε0 = ε/(2d+1dr). We have ef f1 ε/2. Thus,

f ef f f1 + f1 ef ε.

Now, substitute ε0 = ε/(2d+1dr) and N with equation 26 into (28), the upper bound on the total weights of the Res Net are

d/r dr+1(d + r) log e(r + d)2ddr

By the well-known Stirling s approximation

r e 1 12r+1 < r! <

r e 1 12r , (29)

the hidden constant c(d, r) in the Od,r notation can be bounded by

r d r )d < c(d, r) < (2 d+1

r d r )ddr+2(d + r)r (30)

where C is an absolute constant.

Characterizing Res Net s Universal Approximation Capability

E. Proof of Theorem 6

In this appendix, we give the proof of Theorem 6. The proof of Theorem 6 is mainly based on the following lemma.

Lemma 19. Fix the integer d 1 and let f : Rd R be a CPw L function. Then there exist affine functions pα, qβ : Rd R such that f can be written as the difference of positive convex functions:

f = p q, where p := max 1 α A pα, q := max 1 β B qβ

where A, B are some positive numbers.

Proof. For any CPw L function f : Rd R, by theorem 1 in (Wang & Sun, 2005), there exists a finite set of affine linear functions ℓ1, . . . , ℓk and a finite integer M such that

max i Sj ℓi

where Sj {1, . . . , k}, |Sj| d + 1 and σi {+1, 1} for all i = 1, 2, , M. We write

max i Sj ℓi

j:σj=1 max i Sj ℓi X

j:σj= 1 max i Sj ℓi = p q.

The last equation holds by the fact that the sum of convex functions is convex and the sum of CPw L functions is also CPw L. We can easily see A, B is bounded by Md. However, M is an implicit number that may depend on the property of the CPw L function (e.g., the number of pieces, and the number of linear components). More details can be found in the proof details in (Tarela & Martinez, 1999; Wang & Sun, 2005).

Now we are ready for the proof of Theorem 6.

Proof of Theorem 6. The proof is based on the observation

max{x, y} = y + Re LU(x y)

Now we construct a single-neuron per hidden layer Res Net with k = d + 1 to output f exactly. We use the same notation as lemma 19. Let z[0] = Ad+1(x) = (x, p1(x)). In the next block, we compute ζ[1] = T [1] 1 (z[0]) = Re LU(p2(x) p1(x)) in the activation layer and γ[1]T [1] 2 (z[0]) = (0, Re LU(p2(x) p1(x))) in the linear layer by choosing the appropriate weights. Then we can output

z[1] = γ[1] + z[0] = (0, Re LU(p2(x) p1(x)) + z[0] = (x, max{p1, p2}).

By repeatedly doing this, we can output z[A] = (x, max1 α A{pα}) in the A-th block. In the next two block, we output z[A+2] = (x, p(x) q1(x)). Here we use two single-neuron blocks to compute q1(x) = (q1)+ ( q1)+. Then by the same operation, we can get the result z[A+B+2] = (x, p q) in the (A + B + 2)-th block. Then the Res Net outputs f = p q exactly.

F. Proof of Theorem 7

We first introduce some terms we will use. For a univariate continuous piecewise linear function f(x), x0 is called a breakpoint of f if limx x+ 0 f (x) = limx x 0 f (x). We will abbreviate continuous piecewise linear as CPw L . For a finite sample set A = {(xi, yi) : 1 i m} and let xi be increasing for i, we have a CPw L function f such that: i) f(xi) = yi and ii) f is linear in each interval ( , x1], [x1, x2], , [xm 1, xm], [xm, ]. In this case, we say f is a CPw L function defined by the sample set A. Note that the set of the breakpoints of f is a subset of {x1, , xm}.

Next, we introduce some basic lemma we will use for the construction of Res Net later.

Characterizing Res Net s Universal Approximation Capability

Definition 20. A function g : Rd Rv is a max-min string of length L 1 on d input variables and v output variables if there exist affine functions ℓ1, . . . , ℓL : Rd Rv such that

g = τL 1 (ℓL, τL 2 (ℓL 1, . . . , τ2 (ℓ3, τ1 (ℓ2, ℓ1)) ) ,

where each τi is either a coordinate-wise max or a min.

Lemma 21. Suppose g : Rd R is a max-min string of length L, i.e.

g = τL 1 (ℓL, τL 2 (ℓL 1, . . . , τ2 (ℓ3, τ1 (ℓ1, ℓ2)) )

Then there is a Res Net R RN(d + 1, 1, L) such that R(x) = (x, g(x)).

Proof. The proof is based on the observation

max{x, y} = y + Re LU(x y) and min{x, y} = y Re LU(y x)

We may consider the case v = 1. Now we construct a single-neuron per hidden layer Res Net with k = d + 1 to output f exactly. Let z[0] = Ad+1(x) = (x, ℓ1(x)). Without loss of generality, we assume τ1 = max. In the next block, we compute ζ[1] = T [1] 1 (z[0]) = Re LU(ℓ2(x) ℓ1(x)) in the activation layer and γ[1]T [1] 2 (z[0]) = (0, Re LU(p2(x) p1(x))) in the linear layer by choosing the appropriate weights. Then we can output

z[1] = γ[1] + z[0] = (0, Re LU(ℓ2(x) ℓ1(x)) + z[0] = (x, max{p1, p2}) = (x, τ1(ℓ1, ℓ2)).

If τ1 = min, we have a similar process. Now we can assume the output of the k 1-th block is

z[k 1] = (x, τk 1 (ℓk, τk 2 (ℓk 1, , τ1 (ℓ2, ℓ1)))) := (x, gk 1).

Without loss of generality, we assume τk = min. Then we can compute in the next residual block that

ζ[k] = Re LU(gk 1 ℓk+1)

γ[k] = (0, Re LU(gk 1 ℓk+1(x))) .

Thus, the output of the next block is

z[k] = z[k 1] + γ[k] = (x, gk 1) + (0, Re LU(gk 1 ℓk+1)) = (x, τk(ℓk+1, τk 1 (ℓk, , τ1 (ℓ2, ℓ1)))) .

By induction, we can output z[L] = (x, g) in the L-th block. Thus, we finish the proof.

There is a special CPw L function which can be represented by max-min string which is shown in the following lemma. Lemma 22. Let L be an positive integer and ε be a sufficient small number. Assume a CPw L function is defined by the sample set

{(k, k), (k + 1 ε, k) : k = 0, 1, , L 1} [ {(L, L)} [ {(L + 1, 0)} [ {(L + 2, 0)}

with 2L breakpoints. Then there exists a Res Net R RN(d + 1, 1, 2L + 1) such that R(x) = (x, f(x)).

Proof. It is easy to show the CPw L function on the assumption can be exactly represented by a max-min string of length L by induction. We have an illustration for the kind of step function f(x) as shown in the following when L = 4. Thus, the result follows from Lemma 21 immediately.

Characterizing Res Net s Universal Approximation Capability

Lemma 23. Let S Rd be a finite set. Then any function f : S R can be computed exactly by a Res Net R(x) RN(d + 1, 1, 2|S|).

Proof. As shown in Proposition 4 (Hanin & Sellke, 2017), f can be exactly generated by a min-max string of length 2|S|. Then with Lemma 21, we directly have the result.

F.1. Proof of Theorem 7

We first given the definition of the small region Ω. Given K N +and δ 0, 1

K , define a trifling region Ω [0, 1]d, K, δ

of [0, 1]d as

Ω [0, 1]d, K, δ :=

x = (x1, x2, , xd) [0, 1]d : xi K 1 k=1

In particular, Ω [0, 1]d, K, δ = if K = 1.

Thus, given the following theorem, we can prove Theorem 7. The proof of Theorem 24 can be found in Sec. F.2.

Theorem 24. Let d N+ and d 5. Given f C [0, 1]d , for any L N+, there exists a Res Net

R : Rd R RN(d + 1, 4, 24L + 9d + 4)

such that R L (Rd) |f(0)| + ωf(

|f(x) R(x)| 6

dωf L 2/d , for any x [0, 1]d\Ω [0, 1]d, K, δ ,

where K = L2/d and δ is an arbitrary number in 0, 1 3K .

Proof of Theorem 7. Without loss of generality, we assume f is not a constant function as it is a trivial case. Hence ωf(r) > 0 for any r > 0. Moreover, it is obvious that |f(x) f(0| < ωf(

d)) hence |f(x)| |f(0)| + ωf(

d) for x [0, 1]d.

By Theorem 24, there exists a Res Net R RN(d + 1, 4, 24L + 9d + 4) such that R L (Rd) |f(0)| + ωf(

|f(x) R(x)| 6

dωf L 2/d , for any x [0, 1]d\Ω [0, 1]d, K, δ ,

Now, we set K = L2/d and choose a small δ 0, 1 3K such that for x [0, 1]d

µ(Ω) |f(x) R(x)|p Kdδ 2|f(0)| + 2ωf(

d) p = j L2/dk dδ 2|f(0)| + 2ωf(

d) p ωf L 2/d p .

It follows that

f R p Lp([0,1]d) = Z

Ω([0,1]d,K,δ) |f(x) R(x)|p dx + Z

[0,1]d\Ω([0,1]d,K,δ) |f(x) R(x)|p dx

Kdδ 2|f(0)| + 2ωf(

dωf L 2/d p

ωf L 2/d p + 6

dωf L 2/d p

dωf L 2/d p .

Hence, f R Lp([0,1]d) 7

dωf L 2/d .

Characterizing Res Net s Universal Approximation Capability

F.2. Proof of Theorem 24

Given the following two propositions, we prove Theorem 24. The main steps are from (Shen et al., 2019; 2022b) and we mainly focus on the construction of Res Net. The proof of Proposition 25 can be found in Sec. F.3 and the Proof of Proposition 26 can be found in Sec. F.4.

Proposition 25. For any L, d N+ and δ 0, 1 3K with K = L2/d , there exists a Res Net b R : [0, 1] R2, b R(x) = (x, R(x)) in RN(2, 1, 4L 1 d + 4) such that

R(x) = k, if x k

K δ 1{k K 2}

for k = 0, 1, , K 1.

Proposition 26. Given any ε > 0 and arbitrary L, J N+ with J L2 and samples yj 0 for j = 0, 1, , J 1 with

|yj yj 1| ε, for j = 1, 2, , J 1,

then there exists a Res Net R RN(6, 4, 13L + 7) such that

(i) |R(j) yj| ε for j = 0, 1, , J 1, and

(ii) 0 R(x) max {yj : j = 0, 1, , J 1} for any x R.

Proof of Theorem 24. Without loss of generality, we assume f is not a constant. By the definition of ωf( ), we have |f(x) f(0)| ωf(

d) for any x [0, 1]d. We can define ef = f f(0) + ωf(

d), then 0 ef(x) 2ωf(

d) for any x [0, 1]d. Now set K = L2/d , and δ to an arbitrary number in 0, 1 3K .

The details of the proof can be divided into the following four steps.

Step 1: Divide [0, 1]d into {Qβ}β {0,1, ,K 1}d and Ω [0, 1]d, K, δ . Define xβ is the vertex of Qβ with minimum 1 norm, i.e., xβ := β/K and

Qβ := x = (x1, , xd) [0, 1]d : xi βi

K δ 1{βi K 2}

, i = 1, , d

for each d-dimensional index β = (β1, , βd) {0, 1, , K 1}d. Clearly,

[0, 1]d = β {0,1, ,K 1}d Qβ Ω [0, 1]d, K, δ .

Step 2: Construct Φ mapping x Qβ to β. By Proposition 25, there exists a Res Net b R1 = (x, R1(x)) RN(2, 1, 4L1/d + 3) such that

R1(x) = k, if x k

K δ 1{k K 2}

for k = 0, 1, , K 1.

It follows that R1 (xi) = βi if x = (x1, x2, , xd) Qβ for each β = (β1, β2, , βd). Let

L0(x1, , xd) = (x1, x2, , xd, 0),

φi(x1, x2, , xd, 0) = (x1, x2, , xd, R1(xi)), for i = 1, 2, , d,

Li(x1, , xd, y) = (x1, , xi 1, y, xi+1, , xd, 0), for i = 1, 2, , d,

Then define Φ(x1, , xd) = Ld φd L1 φ1 L0(x1, , xd). We have Φ is a Res Net in RN(d+1, 1, 4d L1/d+4d) RN(d + 1, 1, 4L + 8d 4) and

Φ(x) := (R1 (x1) , R1 (x2) , , R1 (xd) , 0) , for any x = (x1, x2, , xd) Rd

i.e., Φ(x) = (β, 0) if x Qβ for β {0, 1, , K 1}d. The computation process is illustrated by Fig. 9. Note that 4d L1/d + 4d 4L + 8d 4 comes form the inequality na1/n a + n 1 for any non-negative real number a and positive integer n.

Characterizing Res Net s Universal Approximation Capability

O(L1/d) blocks O(L1/d) blocks O(d L1/d) blocks

Figure 9. Illustration of the computation of Φ(x) by Res Net. The last neuron in each layer is used for the intermediate computation and will be cleared to zero by a linear mapping in each stage.

Step 3: Construct a Res Net ϕ mapping β approximately to ef (xβ). The construction of the sub-network generating ϕ essentially relies on Proposition 26. We follow the work of (Shen et al., 2019; 2022b) in this step. To meet the requirements of applying Proposition 26, one can define two auxiliary sets A1 and A2 as

A1 := i Kd 1 + k 2Kd : i = 0, 1, , Kd 1 1 and k = 0, 1, , K 1

A2 := i Kd 1 + K + k

2Kd : i = 0, 1, , Kd 1 1 and k = 0, 1, , K 1 .

Clearly, A1 A2 {1} = j 2Kd : j = 0, 1, , 2Kd and A1 A2 = .

Next, we further divide this step into three sub-steps.

Step 3.1: Construct ψ1 bijectively mapping {0, 1, , K 1}d to A1. Define

ψ1(x) := xd 2Kd +

xi Ki , for any x = (x1, x2, , xd) Rd.

Then ψ1 is a linear function bijectively mapping the index set {0, 1, , K 1}d to

βi Ki : β {0, 1, , K 1}d )

= i Kd 1 + k 2Kd : i = 0, 1, , Kd 1 1 and k = 0, 1, , K 1 = A1.

Step 3.2: Construct g to satisfy g ψ1(β) = ef (xβ) and to meet the requirements of Proposition 26. According to (Shen et al., 2019), we have a CPw L function g : [0, 1] R with a set of breakpoints j 2Kd : j = 0, 1, , 2Kd = A1 A2 {1} and the values of g at these breakpoints satisfy the following properties:

The values of g at the breakpoints in A1 are set as

g (ψ1(β)) = ef (xβ) , for any β {0, 1, , K 1}d;

At the breakpoint 1 , let g(1) = ef(1), where 1 = (1, 1, , 1) Rd;

The values of g at the breakpoints in A2 are assigned to reduce the variation of g, which is a requirement of applying Proposition 26. To achieve this, we can let g be linear on each interval i Kd 1 K+1

2Kd , i Kd 1 for i = 1, 2, , Kd 1.

Characterizing Res Net s Universal Approximation Capability

Then g satisfies the condition of Proposition 26:

, for j = 1, 2, , 2Kd,

d), for j = 0, 1, , 2Kd

Step 3.3: Construct ψ2 approximating g well on A1 A2 {1}. Since 2Kd = 2 L2/d d 2 L2 e L2, where e L = 3

by Proposition 26 (set yj = g j 2K2 and ε = ωf

d K > 0 therein), there exists a Res Net eψ2 RN(6, 4, 13e L + 7) such that eψ2(j) g j 2Kd

, for j = 0, 1, , 2Kd 1,

0 eψ2(x) max g j 2Kd

: j = 0, 1, , 2Kd 1 2ωf(

d), for any x R. (31)

By defining ψ2(x) := eψ2 2Kdx for any x R, we have ψ2 RN(6, 4, 20L + 7), 0 ψ2(x) = eψ2 2Kdx 2ωf(

d), for any x R, and

= eψ2(j) g j 2Kd

, for j = 0, 1, , 2Kd 1.

We then define the desired function ϕ as ϕ := ψ2 ψ1. Note that ψ1 : Rd R is a linear function and ψ2 RN(6, 4, 20L+ 7). Thus, ϕ RN(6, 4, 20L + 7). Then we have

ϕ(β) ef (xβ) = |ψ2 (ψ1(β)) g (ψ1(β))| ωf

for any β {0, 1, , K 1}d. Equation (31) and ϕ = ψ2 ψ1 implies

0 ϕ(x) 2ωf(

for any x Rd.

Step 4: Construct the final network to implement the desired function R. Define R := ϕ Φ + f(0) ωf(

d) and the specific Res Net structure is illustrated by Fig. 10. Since Φ RN(d + 1, 4L + 8d 4) and ϕ RN(6, 4, 20L + 7), R = ϕ Φ + f(0) ωf(

d) is in RN (d + 1, 4, 24L + 9d + 4) if d 5

as shown in Fig. 10.

Now let us estimate the approximation error. Note that f = ef + f(0) ωf(

d). By Equation (32), for any x Qβ and β {0, 1, , K 1}d, we have

|f(x) R(x)| = ef(x) ϕ (Φ(x)) = ef(x) ϕ(β)

ef(x) ef (xβ) + ef (xβ) ϕ(β)

Characterizing Res Net s Universal Approximation Capability

R = ϕ Φ+ f(0) ωf(

ψ2 ψ1 Φ = ϕ Φ

d blocks of width 2

Φ : 4L + 8d 4 blocks ψ2 : 20L + 7 blocks

Figure 10. Illustration of implementing Res Net R = ϕ Φ + f(0 ωf(

where the last inequality comes from the fact K = L2/d L2/d

2 for any L N+. Recall that ωf(nr) nωf(r) for any n N+and r [0, ). Therefore, for any x β {0,1, ,K 1}d Qβ = [0, 1]d\Ω [0, 1]d, K, δ , we have

|f(x) R(x)| 2ωf 2

d L 2/d 2 2

d ωf L 2/d 6

dωf L 2/d .

Last, by Equation (33) and R = ϕ Φ + f(0) ωf(

d), it follows that R L (Rd) |f(0)| + ωf(

d). Thus, we finish the proof.

F.3. Proof of Propostion 25

proof of Prop. 25. Without loss of generality, assume K = L2/d = e L2 where e L = L1/d. We first consider the sample set

{(1, e L 1), (2, 0)} m

e L , m : m = 0, 1, , e L 1 m + 1

e L δ, m : m = 0, 1, , e L 2 .

Let R1(x) be a CPw L function by the sample set. It follows from Lemma 22 that there exists a Res Net b R1 RN(2, 1, 2e L + 1) such that b R1(x) = (x, R1(x)) where R1(x) satisfy

= R1(1) = e L 1 and R1 m

e L δ = m for m = 0, 1, , e L 2, and

R1 is linear on h e L 1

e L , 1 i and each interval h m

e L δ i for m = 0, 1, , e L 2.

Thus, we have

R1 (x) = ℓ, for x m

e L , m + 1

e L δ 1{ℓ e L 2}

Next for the sample set 1

e L , e L 1 , (2, 0) ℓ

e L2 , ℓ : ℓ= 0, 1, , e L 1 ℓ+ 1

e L2 δ, ℓ : ℓ= 0, 1, , e L 2 .

with size 2e L + 1, similarly by Lemma 22, there exists a Res Net b R2 RN(2, 1, 2e L + 1) such that b R2 = (x, R2(x)) and

e L2 = R2 1 e L

= e L 1 and R2 ℓ e L2 = R2 ℓ+1

e L2 δ = ℓfor ℓ= 0, 1, , e L 2;

R2 is linear on h e L 1

i and each interval h ℓ e L2 , ℓ+1

e L2 δ i for ℓ= 0, 1, , e L 2.

Characterizing Res Net s Universal Approximation Capability

It follows that, for m, ℓ= 0, 1, , e L 1,

e L2 , me L + ℓ+ 1

e L2 δ 1{ℓ e L 2}

Since K = e L2, each k {0, 1, , K 1} can be unique represented by k = me L + ℓfor m, ℓ= 0, 1, , e L 1. For any x k

K δ 1{k K 2} for k {0, 1, , K 1}, b R1(x) = (x, R(x)) = (x, m). Next, we define an affine mapping

R0 such that R0(x, m) = m, x m

. Finally, let b R2(x) = (x, R2(x)). Then

With a final affine layer L(ℓ, m) = me L + ℓ= k, we can output the desired value.

Thus, the desired Res Net R := L b R2 R0 b R1 satisfy

R(x) = k, if x k

K δ 1{k K 2}

for k {0, 1, , K 1}.

Note that an affine function can be computed by a residul block. Hence R RN(2, 1, 4e L + 4).

F.4. Proof of Proposition 26

For θ [0, 1), suppose its base-q representation is θ = P ℓ=1 θℓq ℓwith θℓ {0, 1, , q 1}. Then we use the notation 0.θ1θ2 θL to denote the L-term base-q representation of θ, i.e., 0.θ1θ2 θL := PL ℓ=1 θℓq ℓ. We first show some fundamental lemmas which are the keys to the proof of Proposition 26.

Lemma 27. For any L N+, there exists a Res Net R in RN (5, 4, 7L) such that, for any θ1, θ2, , θL {0, 1, 2}, we have

R (0.θ1θ2 θL, ℓ) =

j=1 θj, for ℓ= 1, 2, , L

Proof. Given θ1, θ2, , θL {0, 1, 2}, define

ξj := 0.θjθj+1 θL =

θi 3i j+1 , for j = 1, 2, , L

and T3(x) : [0, 3) R as

T3(x) = k, if x [k, k + 1 ε] for k = 0, 1, 2,

where ε is a parameter to be determined later. Then we have

θj = 3ξj for j = 1, 2, , L,

and ξj+1 = 3ξj θj for j = 1, 2, , L 1.

Moreover, can be approximated by the Re LU network T3( ). Note that if ε < 3 L, then 3x = T3(3x) for any x = 0.θ1θ2 θl where l L. Thus, θj = T3(3ξj) for j = 1, 2, , L. Now let

T (x) := 1, x 0, 0, x < 0.

Characterizing Res Net s Universal Approximation Capability

If x Z, T (x) = T0(x) where T0(x) is defined by

1, x 0, x + 1, 0 x 1, 1, x < 1.

Now we have ℓ X

j=1 θj T (ℓ j) =

j=1 θj T0(ℓ j) :=

for ℓ= 1, 2, , L where zℓ,j = θj T0(ℓ j) 0.

Here the multiplication of x {0, 1, 2} and y {0, 1} can be done by xy = σ(x + y 1) σ(x y 1).

Now, we construct a Res Net R such that R (0.θ1θ2 θL, ℓ) = Pℓ j=1 θj for ℓ= 1, 2, , L. Note that T3 is a linear interpolation at the sample set

{(k, k) : k = 0, 1, 2} {(k + 1 ε, k) : k = 0, 1, 2} .

Then by Lemma 22, T3 can be generated by a Res Net b H RN(2, 1, 6), i.e., b H(x) = (x, T3(x)). Similarly, T0 can be generated by a Res Net b P RN(2, 1, 2) where b P(x) = (x, T0(x)). Note that if A(x) is CPw L and satisfies the condition of Lemma 22, so is A(ax + b) for a, b R.

Then we have the following Res Net to output the desired value. Let input be (ξ1, ℓ)

z[0] = (ξ1, ℓ, 0, 0, 0).

According to the above discussion, after 6 residual blocks, we can compute

z[6] = (ξ1, ℓ, θ1, T0(ℓ 1) := y1, 0).

In the next residual block, we can compute

ζ[7] = (θ1, y1, σ(θ1 + y1 1), σ(θ1 y1 1))

γ[7] = ( 3θ1, 0, θ1, y1, zℓ,1).

z[7] = γ[7] + z[6] = (ξ2, ℓ, 0, 0, zℓ,1)

This computation process is shown in Fig. 11. Now, for some m 1 we may assume the output of 7(m 1)-th block is

z[7(m 1)] = (ξm, ℓ, 0, 0,

As the same process, using 6 blocks, we can compute

z[7(m 1)+6] = (ξm, ℓ, θm, T0(ℓ m) := ym, 0).

Similarly, in the next block, by choosing appropriate weights, we can compute

ζ[7(m 1)+7] = (θm, ym, σ(θm + ym 1), σ(θm ym 1))

γ[7] = ( 3θm, 0, θm, ym,

Characterizing Res Net s Universal Approximation Capability

T0(ℓ 1) := y1

(θ1 + y1 1)+

3θ1(+ξ1) = ξ2

θ1(+θ1) = 0

y1(+y1) = 0

+Id z[0] z[6] z[7]

max-min string implemented by Res Net in RN(5, 2, 6)

Figure 11. Illustration for the computation of zℓ,1. The grey shape is in Re LU-free free and the yellow one is equipped with Re LU activation function.

Hence we have

z[7m] = γ[7m] + z[7m 1] = (ξm+1, ℓ, 0, 0,

By induction, there is a Res Net R : R2 R5 in RN(5, 4, 7L) such that

R(ξ1, ℓ) = ( , ℓ, , ,

Hence by choosing appropriate linear mapping L from R5 to R, we have L R(ξ1 = 0.θ1θ2 θL, ℓ) = Pℓ j=1 θj.

Lemma 28. For any L N+, any θk,ℓ {0, 1, 2} for k, ℓ= 0, 1, , L 1, there exists a function R : R2 R and R RN(5, 4, 9L + 2) such that

j=0 θk,j, for k, ℓ= 0, 1, , L 1.

Proof. Denote yk := 0.θk,0θk,1 θk,L 1, for k = 0, 1, , L 1. Then for the sample set {(k, yk) : k = 0, 1, , L}, whose size is L + 1, it follows from Lemma 23 that there exists a Res Net R1 RN(2, 1, 2L + 2) RN(5, 4, 2L + 1) such that R1(k) = yk, for k = 0, 1, , L 1.

By Lemma 27, there exists R2 RN(5, 4, 7L) such that, for any ξ1, ξ2, , ξL {0, 1, 2}, we have

R2 (0.ξ1ξ2 ξL, ℓ) =

j=1 ξj, for ℓ= 1, 2, , L.

It follows that, for any ξ0, ξ1, , ξL 1 {0, 1, 2}, we have

R2 (0.ξ0ξ1 ξL 1, ℓ+ 1) =

j=0 ξj, for ℓ= 0, 1, , L 1.

Characterizing Res Net s Universal Approximation Capability

Thus, for k, ℓ= 0, 1, , L 1, we have a Res Net R such that

R(k, ℓ) = R2 (R1(k), ℓ+ 1) = R2 (yk, ℓ+ 1) = R2 (0.θk,0θk,1 θk,L 1, ℓ+ 1) =

Moreover, it is clear that R is a Res Net in RN(5, 4, 9L + 2).

Lemma 29. For any ε > 0, L N+, and a sample set {yk,ℓ 0}k.ℓwith

|yk,ℓ yk,ℓ 1| ε, for k, ℓ= 0, 1, , L 1,

there exists a Re LU netework ϕ RN(6, 4, 11L + 5) such that

(i) |ϕ(k, ℓ) yk,ℓ| ε, for k, ℓ= 0, 1, , L 1, and

(ii) 0 ϕ (x1, x2) maxk,ℓ=0,1, ,L 1 {yk,ℓ}, for any x1, x2 R.

Proof. First we define ak,ℓ:= yk,ℓ/ε , for k, ℓ= 0, 1, , L 1.

We will construct a Res Net R : R2 R such that R(k, ℓ) = ak,ℓε for k, ℓ= 0, 1, , L 1.

We denote bk,0 := 0 and bk,ℓ:= ak,ℓ ak,ℓ 1 for k, ℓ= 0, 1, , L 1. It follows from |yk,ℓ yk,ℓ 1| ε for all k and ℓthat bk,ℓ { 1, 0, 1}. Thus, we have

ak,ℓ= ak,0 +

j=1 (ak,j ak,j 1) = ak,0 +

j=1 bk,j = ak,0 +

for k, ℓ= 0, 1, , L 1. For the sample set {(k, ak,0) : k = 0, 1, , L 1} {(L, 0)}, whose size is L + 1, it follows from Lemma 23 that there exists a Res Net b R1 RN(2, 1, 2L + 2) such that

b R1(k) = (k, R1(k) = ak,0) , for k = 0, 1, , L 1.

By Lemma 28, there exists a Res Net R2 RN (5, 4, 9L + 2) such that

j=0 (bk,j + 1) ℓ.

Here note that bk,j {0, 1, 2} which will satisfy the condition of Lemma 28.

Thus, we can compute ak,0 first by b R1. Then use one neuron in each linear layer to preserve the value of ak,0 and compute Pℓ j=0 bk,j by R2. For the new Res Net denoted by R3, which is the composition of R1 and R2, we show how to construct it below. We let input be (k, ℓ) and z[0] = (k, ℓ, 0, 0, 0, 0).

We use b R1 to compute

z[2L+2] = (k, ℓ, 0, 0, 0, ak,0).

Then view z[2L+2] as the input of R2. We can compute by Lemma 28

z[11L+4] = ( , , , , R2(k, ℓ), ak,0).

By choosing appropriate affine mapping L(x) = ε(x5 + x6), we have R(k, ℓ) = εak,ℓ.

Moreover, let R0(x) = min{σ(x), ymax}. Then for x, y R, R(x, y) := R0 R3(x, y) ymax and R(k, ℓ) = min{εak,ℓ, ymax} = ak,ℓfor k, ℓ= 0, 1, , L.

Note min{a, b} = a (a b)+. Then R is a Res Net in RN (6, 4, 11L + 5).

Characterizing Res Net s Universal Approximation Capability

Now, we are ready to prove Prop. 26.

Proof of Prop. 26. Without loss of generality, assume J = L2 since we can set y J 1 = y J = y J+1 = = y L2 1 if J < L2. For the sample set

{(k L, k) : k = 0, 1, , L} {(k L + L 1, k) : k = 0, 1, , L 1}

with size is 2L + 1, there is a Res Net b R1 RN(2, 1, 2L + 1) by Lemma 22 such that

c R1(x) = (x, R1(x)),

R1(L2) = L and R1(k L) = R1(k L + L 1) = k for k = 0, 1, , L 1, and

R1 is a CPw L function defined by the sample set

It follows that R1(j) = k and j L R1(j) = ℓ, where j = k L + ℓ,

for k, ℓ {0, 1, , L 1}. Since any number j in {0, 1, . . . , J 1} can be uniquely decomposed as the form j = k L+ℓfor k, ℓ= 0, 1, , L 1, we denote yj = yk L+ℓas yk,ℓ. Then by Lemma 29, there exists a Res Net R2 RN(6, 4, 11L + 5) such that |R2(k, ℓ) yk,ℓ| ε, for k = 0, 1, , L 1 and ℓ= 0, 1, , L 1,

and 0 R2 (x1, x2) ymax, for any x1, x2 R.

So for any j = k L + ℓ {0, 1, , J 1}, we have an affine mapping L such that

L b R1(j) = L(k, j) = (k, ℓ).

Let R = R2 L R1. Moreover, note that an affine mapping can be computed by a residual block. Thus, we have R is a Re LU network in RN(6, 4, 13L + 7) and it satisfy

|R(j) yj| ε.

Then we have finished the proof.

G. Experiments

In this appendix, we specify the experiment setting in Section 5. First, we assume that an appropriate algorithm can effectively manage the optimization error (e.g., Adam optimizer (Kingma & Ba, 2014)). We then choose a sufficiently complex target function to ensure that the approximation error is the dominant factor. To reduce the randomness and errors in our experiments, we conduct function approximation experiments across a set of functions. Specifically, we utilize the following set of functions to test the universal approximation capability of b-Res Net.

xj + bi sin( Y

where x [0, 1]d, S1 i and S1 j are the index sets randomly sampled from {1, ..., d} with replacement, ai and bi are coefficients, and m is the total number of terms.

Specifically, for each case of d = 100, 200, 300, we randomly selected 30 functions from the set for function approximation experiments. After removing some outliers, we took the average testing loss. For each d {100, 200, 300}, we set the parameters as m = d/10, card(S1 i )

d, card(S2 i )

d, ai [0, 1], and bi [0, 0.1], where the index sets (e.g., S1 i ) and coefficients (e.g., ai) are randomly sampled.

Characterizing Res Net s Universal Approximation Capability

We then compare b-Res Net with fully-connected (FC) NN for approximating each sampled function, with network structure as RN (d + 1, n, d/10) for n {10, 20, 40}, and NN (d + 1, d/10), respectively. Next for each approximated function, we conduct uniform sampling with 1000 d samples and use 90% for training and 10% for testing, and then take the average loss. We optimize the network parameters using Adam (Kingma & Ba, 2014) with a learning rate of 10 3 and present the test performance over iteration. The results include the mean square error (MSE) and the maximum absolute error (MAX) on testing samples.

Figure 12. Comparison of MSE loss on testing samples when training with MSE loss.

Figure 13. Comparison of MAX loss on testing samples when training with MSE loss.