# drm_revisited_a_complete_error_analysis__54b091b2.pdf

Journal of Machine Learning Research 26 (2025) 1-76 Submitted 8/24; Revised 1/25; Published 4/25

DRM Revisited: A Complete Error Analysis

Yuling Jiao yulingjiaomath@whu.edu.cn School of Artiﬁcial Intelligence, Wuhan University, Wuhan, China National Center for Applied Mathematics in Hubei, Wuhan University, Wuhan, China Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, China

Ruoxuan Li ruoxuanli.math@whu.edu.cn School of Mathematics and Statistics, Wuhan University, Wuhan, China

Peiying Wu peiyingwu@whu.edu.cn School of Mathematics and Statistics, Wuhan University, Wuhan, China

Jerry Zhijian Yang zjyang.math@whu.edu.cn School of Mathematics and Statistics, Wuhan University, Wuhan, China Wuhan Institute for Math & AI, Wuhan University, Wuhan, China National Center for Applied Mathematics in Hubei, Wuhan University, Wuhan, China Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, China

Pingwen Zhang pwzhang@whu.edu.cn Wuhan Institute for Math & AI, Wuhan University, Wuhan, China School of Mathematical Sciences, Peking University, Beijing, China

Editor: Jianfeng Lu

It is widely known that the error analysis for deep learning involves approximation, statistical, and optimization errors. However, it is challenging to combine them together due to overparameterization. In this paper, we address this gap by providing a comprehensive error analysis of the Deep Ritz Method (DRM). Speciﬁcally, we investigate a foundational question in the theoretical analysis of DRM under the overparameterized regime: given a target precision level, how can one determine the appropriate number of training samples, the key architectural parameters of the neural networks, the step size for the projected gradient descent optimization procedure, and the requisite number of iterations, such that the output of the gradient descent process closely approximates the true solution of the underlying partial diﬀerential equation to the speciﬁed precision? Keywords: deep Ritz method, projected gradient descent, over-parameterization, complete error analysis, new optimization error analysis

1. Introduction

Classical numerical methods, such as ﬁnite element methods (Brenner and Scott, 2007; Ciarlet, 2002), face diﬃculties when solving high-dimensional partial diﬀerential equations (PDEs). The success of deep learning methods in high-dimensional data analysis has led to the development of promising approaches for solving high-dimensional PDEs using deep neural networks, which have attracted much attention (Anitescu et al., 2019; Sirignano

. Corresponding author.

c 2025 Yuling Jiao, Ruoxuan Li, Peiying Wu, Jerry Zhijian Yang and Pingwen Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1258.html.

Jiao, Li, Wu, Yang and Zhang

and Spiliopoulos, 2018; Lu et al., 2021b; Raissi et al., 2019; Weinan and Yu, 2017; Zang et al., 2020; Berner et al., 2020; Han et al., 2018; Liao and Ming, 2021). Due to the excellent approximation power of deep neural networks, several numerical schemes have been proposed for solving PDEs, including physics-informed neural networks (PINNs) (Raissi et al., 2019), weak adversarial networks (WAN) (Zang et al., 2020) and the deep Ritz method (Weinan and Yu, 2017). PINNs is based on residual minimization, while WAN is inspired by Galerkin method. Based on classical Ritz method, the deep Ritz method is proposed to solve variational problems corresponding to a class of PDEs, which has become one of the most renowned approaches in the ﬁeld of elliptic equations.

The success of deep learning methods in solving high dimensional PDEs has propelled the advancement of its theoretical research. It is now widely recognized that, as a nonparametric estimation method, error analysis in deep learning for PDEs includes approximation error, statistical error (also called generalization error), and optimization error (Grohs and Kutyniok, 2022; Telgarsky, 2021; Weinan, 2020; Bach, 2023). To date, the existing convergence analysis for these deep solvers has predominantly focused on characterizing the trade-oﬀs between the approximation error and the statistical error (Weinan et al., 2019; Hong et al., 2021; Lu et al., 2021d; Hutzenthaler et al., 2020; Shin, 2020; Lanthaler et al., 2022; M uller and Zeinhofer, 2021; Mishra and Rusch, 2021; Kutyniok et al., 2022; Son et al., 2021; Wang et al., 2022; Weinan et al., 2020; Jiao et al., 2022; Duan et al., 2022; Lu et al., 2021c; Mishra and Molinaro, 2022; Ji et al., 2024; Yang and He, 2024; Hu et al., 2023, 2024; Dai et al., 2023; Yu et al., 2024). Meanwhile, these results are conducted in scenarios where the number of neural network parameters is smaller than the number of training samples. However, in practical applications, over-parameterized networks, where the number of parameters far exceeds the number of samples, are more commonly used since empirical evidence suggests that over-parameterization makes the training computationally more eﬃcient. Moreover, recent theoretical studies have indicated that the training loss will converges to zero linearly if one properly initialized the (stochastic) gradient descent specialized in over-parameterized regimes, even though the optimization problem is highly non-convex (Jacot et al., 2018; Allen-Zhu et al., 2019; Du et al., 2019; Zou and Gu, 2019; Liu et al., 2022; Chizat et al., 2019).

The fundamental drivers behind the empirical success of over-parameterized deep learning models continue to elude full understanding, especially when simultaneously accounting for the complex interplay between approximation, generalization, and optimization (Belkin, 2021; Bartlett et al., 2021; Berner et al., 2022). Extensive research eﬀorts have been dedicated to elucidating the role of over-parameterization in linear and kernel models, particularly from the perspective of the double descent phenomenon (Belkin et al., 2018, 2019a; Hastie et al., 2022; Belkin et al., 2019b; Liang and Rakhlin, 2020; Nakkiran et al., 2020; Bartlett et al., 2020; Tsigler and Bartlett, 2023; Belkin, 2021; Bartlett et al., 2021; Tsigler and Bartlett, 2023). However, a crucial gap remains in providing a comprehensive error analysis that jointly accounts for all three key error components. This challenge persists even for the empirical risk minimization estimator in over-parameterized deep learning settings, which has been shown to potentially yield inconsistent results (Kohler and Krzyzak, 2021).

DRM Revisited: A Complete Error Analysis

1.1 Contributions

In this work, we establish the ﬁrst comprehensive error analysis for the deep Ritz method in the over-parameterized setting. This analysis jointly accounts for all three key error components: approximation error, statistical error, and optimization error.

Technically, we derive a novel error decomposition, where the optimization error term we employ is distinct from and tighter than those used in the prior literature. This error decomposition is of independent theoretical interest and holds value for the analysis of other deep learning tasks.

Unlike previous analyses of optimization error, a key feature of our main results is that they do not require the entire training dynamics to remain conﬁned within an inﬁnitesimally small neighborhood of the initial parameter values. This reduces the gap between theory and practical training.

1.2 Organizations

The paper is organized as follows. In Section 2, we ﬁrst introduce the notation, the parallel neural network architecture PNN, and the projected gradient descent (PGD) algorithm used for optimization. Then, following these preliminaries, we present the main theorem of this paper. In Section 3, we present the proof of the main theorem, divided into ﬁve subsections covering a novel error decomposition method, approximation error bounds, statistical error estimates, optimization error control, and the combination of all the separate analysis. In Section 4, we discuss related work in detail and highlight our contributions. Finally, in Section 5, we provide a summary of the paper and outline our planned future work. All proof details are provided in the appendix.

2. Main Result

In this section, we present the main theoretical result of this paper, which establishes the ﬁrst comprehensive error analysis for the deep Ritz method in the over-parameterized setting. To achieve this, we ﬁrst need some groundwork: In Sections 2.1 and 2.2, we introduce necessary notation and the parallel neural network class PNN used in this paper. In Section 2.3, we review the deep Ritz method. In Section 2.4, we provide a detailed exposition of the employed optimization algorithm: the projected gradient descent algorithm. Finally, in Section 2.5, we formally propose our main result.

2.1 Notation

In this section, we provide all the notation needed in this paper. We use bold-faced letters to denote vectors and capital letters to denote matrices or ﬁxed parameters. Unless otherwise speciﬁed, C represents a constant, and C(a, b) or Ci(a, b) represents functions that depend only on a and b. If f : X Y and g : Y Z are two functions, then g f : X Z represents their composition. For two positive functions f(x) and g(x), the asymptotic notation f(x) = O(g(x)) denotes f(x) Cg(x) for some constant C > 0. The notation e O( ) is used to ignore logarithmic terms.

Jiao, Li, Wu, Yang and Zhang

Let N denote the set of natural numbers. We deﬁne N+ := {x N | x > 0}. If x R, x := max{k N : k x} denotes the largest integer strictly smaller than x and x := min{k N : k x} denotes the smallest integer strictly larger than x. If N N+, we deﬁne [N] := {1, 2, . . . , N} to be set of all positive integers less than or equal to N. For a vector x = (x1, . . . , xd)T Rd, the ℓ2-norm and ℓ -norm of x are, respectively, denoted

by x 2 := q Pd i=1 x2 i and x := max1 i d |xi|. We use the usual multi-index notation, that is, for α Nd we write α 1 := α1 + . . . + αd and α! := α1! . . . αd!. Let Ω Rd be an open set. For a function f : Ω R, the Lp(Ω) norm of f is deﬁned as

f Lp(Ω) := Z |f(x)|p dx 1/p , p (0, ) ; f L (Ω) := sup x Ω |f(x)| .

We denote the (weak or classical) derivative of order α of f by

Dαf := α 1f x1α1 x2α2 xdαd .

For s N { }, we denote by Cs(Ω) the set of s-times continuously diﬀerentiable functions on Ω. Additionally, if Ωis compact, we set, for f Cs(Ω),

f Cs(Ω) := max 0 α 1 s sup x Ω |Dαf(x)| .

For any s N and 1 p < , we deﬁne the Sobolev space W s,p(Ω) by

W s,p(Ω) := {f Lp(Ω) : Dαf Lp(Ω), α Nd with α 1 s} .

In particular, when p = 2, we deﬁne Hs(Ω) := W s,2(Ω) for any s N. Moreover, for any f W s,p(Ω) with 1 p < , we deﬁne the Sobolev norm by

f W s,p(Ω) := X

0 α 1 s Dαf p Lp(Ω)

When p = , we set f W s, (Ω) := max 0 α 1 s Dαf L (Ω) .

We also introduce the Sobolev semi-norm by focusing on derivatives of exact order s. For f W s,p(Ω) with 1 p < , we deﬁne

|f|W s,p(Ω) := X

α 1=s Dαf p Lp(Ω)

and when p = , we set

|f|W s, (Ω) := max α 1=s Dαf L (Ω) .

The notation speciﬁc to this paper are listed in Table 1.

DRM Revisited: A Complete Error Analysis

Table 1: Table of Notation

Notation Description

m Total number of sub-networks in a parallel neural network.

M Upper bound of the sum P |ck|, where ck are the linear coeﬃcients combining the sub-networks.

W, L The maximum width (number of neurons in the widest layer) and depth (number of layers) of each sub-network.

Bθ Upper bound on the absolute values of sub-network weights and biases.

nℓ Total number of nonzero weights in the ﬁrst ℓlayers of a sub-network.

D(W, L, d) Total number of trainable weights in a sub-network with width W, depth L, and input dimension d.

θk (a (0) k,1,1, . . . , a (L 1) k,NL,NL 1, b (0) k,1, . . . , b (L 1) k,NL ), weights of the k-th sub-network.

θm in (θ1, . . . , θm), weights of all the m sub-networks in the parallel neural network.

θm out (c1, . . . , cm), linear coeﬃcients used to combine the m sub-networks.

θm total (θm in, θm out), set of all weights in the parallel neural network.

B Range of the uniform distribution [ B, B] used to initialize the innerlayer weights and biases of each sub-network.

η Projection radius for the ℓ2-norm of the vector formed by the inner-layer parameters during projected gradient descent.

ζ Projection radius for the ℓ1-norm of the vector formed by the outer-layer parameters during projected gradient descent.

T Number of iterations of the PGD algorithm.

λ Learning rate of the PGD algorithm.

Ns Number of training samples required to achieve the speciﬁed accuracy.

n Regularity of the target solution u0.

β A scaling exponent determining how the projection radius η grows.

2.2 Topology of the Deep Networks

Let L, d, N0, . . . , NL N. We consider the function φθ : Rd R that can be parameterized by a ρ-activated neural network of the form

φ0(x) = x ,

φℓ(x) = ρ(Aℓ 1φℓ 1(x) + bℓ 1), ℓ= 1, . . . , L 1 ,

φL(x) = AL 1φL 1(x) + b L 1 .

where Aℓ= (a (ℓ) i,j) RNℓ+1 Nℓ, bℓ= (b (ℓ) i ) RNℓ+1 with N0 = d and NL = 1. The number W := max{N1, . . . , NL} is called the width of the network, and L the depth. For

Jiao, Li, Wu, Yang and Zhang

convenience, we denote nℓ, ℓ= 1, . . . , L, as the number of nonzero weights in the ﬁrst ℓ layers of the network, with n L D(W, L, d). Here, D(W, L, d) is deﬁned as

D(W, L, d) := (W + 1)[(L 2)W + d + 1] , (2)

Meanwhile, W is generally greater than d in the following context. Therefore, we will also use the following estimate n L D(W, L, d) W(W + 1)L without loss of generality. Let θ = (a (0) 1,1, . . . , a (L 1) NL,NL 1, b (0) 1 , . . . , b (L 1) NL ) denote the weight vector of a neural network, and let Θ be the space of all such vectors. For a given activation function ρ, we deﬁne the function class NN(W, L, Bθ) as the set of functions φθ realized by ρ-activated networks with width W, depth L, and parameters θ Θ constrained by θ Bθ. Throughout this paper, we take the activation function to be ρ = tanh. Note that we can elevate any weight vector θ to a D(W, L, d)-dimensional vector

θ = (a(0) 1,1, . . . , a (0) W,d, a(1) 1,1, . . . , a (L 2) W,W , a(L 1) 1,1 , . . . , a (L 1) 1,W , b(0) 1 , . . . , b (0) W , . . . , b (L 2) W , b(L 1) 1 )

by padding zeros, with φθ and φθ representing the same element in NN(W, L, Bθ). Therefore, for any φθ1, φθ2 NN(W, L, Bθ), we can align θ1 and θ2 to dimension D(W, L, d), and perform vector operations on them. It also implies that we can formalize the parameter space Θ of NN(W, L, Bθ) as

Θ = [ Bθ, Bθ]D(W,L,d) .

Further, we introduce a Parallel Neural Network class, PNN(m, M, W, L, Bθ), which represents a linear combination of m sub-networks. The structure is shown in Figure 1. Speciﬁcally, for any Φm,θ(x) PNN(m, M, {W, L, Bθ}), it can be expressed as

k=1 ckφk θ(x) , ck R ,

where φk θ(x) NN(W, L, Bθ). Deﬁne θm in := (θ1, . . . , θm) where θk = (a (0) k,1,1, . . . , a (L 1) k,NL,NL 1, b (0) k,1, . . . , b (L 1) k,NL ). Deﬁne θm out := (c1, . . . , cm). Deﬁne Θm as the set of all weight vectors θm total := (θm in, θm out) that parameterize Φm,θ. Where it does not cause ambiguity, the symbol PNN will be used both as an abbreviation for some speciﬁc PNN(m, M, {W, L, Bθ}), and to refer to a general parallel neural network class composed of multiple sub-networks.

2.3 Deep Ritz Method

Recall the deep Ritz method proposed in Weinan and Yu (2017). Let [0, 1]d be the unit hypercube on Rd, Ωbe a bounded open subset strictly contained in [0, 1]d and Ωbe the boundary of Ω. Consider the elliptic equation on Ωequipped with Neumann boundary condition: u + wu = h , x Ω; u n = g , x Ω. (3)

With the following assumptions on the known terms:

Ω C2+α , h L2(Ω) , g H1/2( Ω) , w Cα( Ω) , w(x) c0 > 0 ,

DRM Revisited: A Complete Error Analysis

Figure 1: This ﬁgure illustrates the structure of the Parallel Neural Network. The structure within the red box represents the sub-neural networks, which are fully connected networks, where the dark-colored nodes signify activation functions.

where 0 < α < 1, Equation 3 has a unique weak solution u0 H2(Ω) (Agmon et al., 1959). We further assume that u0 H1(Ω) 1. Let B0 = max{ h L (Ω), g L ( Ω), w L (Ω)}. Deﬁne the energy functional L as follows:

u 2 2 + w|u|2

Ω g Tu ds , (4)

where T is the trace operator. Proposition 1 demonstrates that minimizing L(u) in Equation 4 is equivalent to reducing the distance between u and u0 in the H1 norm.

Proposition 1 For any u H1(Ω), it holds that

2 u u0 2 H1(Ω) L(u) L(u0) B0 1

2 u u0 2 H1(Ω) .

Proof For any u H1(Ω), set v = u u0. Then, it holds that

L(u) = L(u0 + v)

(u0 + v) 2 2 + w|u0 + v|2

Ω h(u0 + v)dx Z

Ω g(Tu0 + Tv)ds

u0 2 2 + w|u0|2

Ω g Tu0 ds + Z

v 2 2 + w|v|2

Ω u0, v dx + Z

Ω wu0v dx Z

= L(u0) + Z

v 2 2 + w|v|2

where the last equality is due to the fact that u0 is the weak solution of Equation 3. Hence

2 v 2 H1(Ω) L(u) L(u0) = Z

v 2 2 + w|v|2

2 dx w L (Ω) 1

2 v 2 H1(Ω) ,

Jiao, Li, Wu, Yang and Zhang

that is, c0 1

2 u u0 2 H1(Ω) L(u) L(u0) B0 1

2 u u0 2 H1(Ω) .

This concludes the proof.

To facilitate implementation of deep learning algorithms, we use Monte Carlo method to discretize the integral type functional L. First, Equation 4 is rewritten as

L(u) = |Ω| E X U(Ω)

u(X) 2 2 2 + w(X)u2(X)

2 u(X)h(X) | Ω| E Y U( Ω)[Tu(Y )g(Y )] , (5)

where U(Ω), U( Ω) are the uniform distribution on Ωand Ω. Based on Equation 5, we introduce the discrete version b L(u):

b L(u) = |Ω|

u(Xp) 2 2 2 + w(Xp)u2(Xp)

2 u(Xp)h(Xp) | Ω|

q=1 u(Yq)g(Yq) , (6)

where {Xp}Nin p=1 i.i.d. U(Ω), {Yq}Nb q=1 i.i.d. U( Ω). Then, we select a deep neural network class Fθ, within which we will minimize b L(uθ) for uθ Fθ. In this paper, our choice is the PNN parallel neural network um,θ = Pm k=1 ckφk θ introduced in Section 2.2, which comprises m sub-networks. Now, Equation 6 turns into:

b L(um,θ) = |Ω|

um,θ(Xp) 2 2 2 + w(Xp)u2 m,θ(Xp)

2 um,θ(Xp)h(Xp)

q=1 um,θ(Yq)g(Yq) . (7)

2.4 Projected Gradient Descent

We use the PGD algorithm to minimize b L(um,θ) in Equation 7, which is an iterative optimization method suitable for constrained optimization problems. Since the Monte Carlo samples {Xp}Nin p=1, {Yq}Nb q=1 are ﬁxed during the optimization process, b L(um,θ) becomes a function solely dependent on the weights θm total, and we denote it as b F(θm total) = b F(θm in, θm out). The PGD algorithm consists of the following three steps:

Initialization. We start with an initial guess (θm total)[0] = (θm in, θm out)[0] as follows:

(i) For the linear coeﬃcients θm out, set

(θm out)[0] = 0 , i.e., (ck)[0] = 0 (k = 1, . . . , m) . (8)

(ii) For the sub-network parameters θm in, initialize each element in (θm in)[0] to follow the same uniform distribution U[ B, B] independently, that is, a (ℓ) k,i,j [0] i.i.d. U[ B, B] , b (ℓ) k,i [0] i.i.d. U[ B, B] . (9)

DRM Revisited: A Complete Error Analysis

Constraint Set. Then, we choose η, ζ > 0, and determine the constraint set as follows:

(i) Let Aη be the (random) set of all weight vectors θm in which satisfy

θm in (θm in)[0] 2 η . (10)

(ii) Let Bζ be the set of all weight vectors θm out which satisfy

k=1 |ck| ζ . (11)

Iterative Update. Finally, let T N+, λ > 0. For each iteration t = 0, . . . , T 1, do

(i) Compute the gradient of the objective function at the current point:

g[t] = θm total b F θm,[t] in , θm,[t] out .

(ii) Update the weight vector by ﬁrst performing a gradient descent step with step size λ and then projecting the result onto the feasible set:

θm in, θm out [t+1] = Proj Aη Bζ (θm in, θm out)[t] λ g[t] , (12)

where Proj C denotes the projection operator in the Euclidean space.

Remark 1 The projection onto the ℓ2 ball Aη can be computed in closed form, while the projection onto the ℓ1 ball Bζ can be implemented with computational complexity that scales linearly with the dimension (Duchi et al., 2008). When Nin = Nb = Ns, a single iteration of our PGD algorithm requires O(m W 2L2Ns) operations, as demonstrated in Appendix D.

Remark 2 The projection step plays a crucial role in the subsequent error analysis: it constrains the range of variation of the neural network parameters and potentially limits the complexity of the neural network class. This makes it possible to derive size-independent statistical error by bounding the Rademacher complexity of the potentially highly overparameterized parallel neural network class.

In the following, we will use A to represent the PGD algorithm, and use u A to denote the output of A which serves as the ﬁnal solution. It is evident that u A is exactly um,θ parameterized with (θm in, θm out)[T ].

2.5 Main Result

We now present the main theorem of this work, which provides a comprehensive end-toend error analysis for solving elliptic equations via the deep Ritz method in the overparameterized setting. The complete proof is given in Section 3.5.

Jiao, Li, Wu, Yang and Zhang

Theorem 1 Suppose that u0 Hn(Ω) for n 2 is the solution of Equation 3. We consider the application of the deep Ritz method to solve Equation 3 using a parallel neural network (PNN) architecture comprising m parallel sub-networks, each with width W and depth L. The network parameters are initialized according to Equations 8 and 9, with linear coeﬃcients connecting the sub-networks set to 0, and each sub-network weight drawn independently from the uniform distribution U[ B, B]. Let η and ζ be the projection radius described in Equations 10 and 11, respectively. Let Nin = Nb = Ns denote the Monte Carlo sample size in Equation 7, and let u A be the output of the PGD algorithm in Equation 12 with T iterations and step size λ. For any 0 < ϵ 1, set

m = Cϵ C1(µ,d,β0,n) , W = 2 log2(d+n 1) +1 , L = log2(d + n 1) + 2 ,

B = Cϵ 2d+2n

n µ 1 , η = ϵ β , ζ = Cϵ 3d 2(n µ 1) ,

T = Cϵ C2(µ,d,β0,n) , λ = Cϵ C2(µ,d,β0,n) , Ns = Cϵ C3(µ,d,β0,n) .

Then, with probability at least 1 2ϵ C3(µ,d,β0,n), the total error satisﬁes

u A u0 2 H1(Ω) Cϵ log1/2(Cϵ 1) = O(ϵ) ,

β > 0 , β0 = max{β, (2d + 2n)(n µ 1) 1} ,

C1(µ, d, β0, n) = C (d + n)3 log2 2(d + n 1) n µ 1 + 11β0 log2(d + n 1) + 36β0 ,

C2(µ, d, β0, n) = C (d + n)3 log2 2(d + n 1) n µ 1 + 15β0 log2(d + n 1) + 48β0 ,

C3(µ, d, β0, n) = 4β0 log2(d + n 1) + 15β0 .

Here, C denotes a place-by-place deﬁned constant depending only on Ω, W, L, d, B0 and n; C and C are positive constants; and µ is an arbitrarily small positive constant.

Remark 3 The assumption u0 Hn(Ω) for n 2 can be achieved by increasing the regularity of the coeﬃcient w and the right-hand side functions h, g in Equation 3. For instance, with Ω C n, such assumption would be realized if we have h Hn 2(Ω), g Hn 3/2( Ω) and w C n 2( Ω). See Agmon et al. (1959) for proof.

Remark 4 Note that in Theorem 1, when d and n are ﬁxed, the architecture of each subnetwork (width W and depth L) remains constant. Consequently, the number of parallel sub-networks m serves as the primary measure of over-parameterization in our analysis.

Remark 5 A notable feature of our analysis is that it does not require the neural network parameters to remain in close proximity to their initialization values during training a restrictive requirement commonly imposed in prior optimization error analyses (Jacot et al., 2018; Allen-Zhu et al., 2019; Du et al., 2019; Zou and Gu, 2019; Liu et al., 2022; Chizat et al., 2019; Nguyen, 2021; Lu et al., 2020; Mahankali et al., 2024). However, when the projection radius η grows faster than the initialization range B (β0 = β), achieving the same precision potentially demands greater computational resources: increased Monte Carlo samples, higher levels of over-parameterization, more iterations, and smaller step sizes.

DRM Revisited: A Complete Error Analysis

Remark 6 From Theorem 1, achieving the error bound O(ϵ) requires a sample size of

Ns = Cϵ C3(µ,d,β0,n) ,

where C3(µ, d, β0, n) = 4β0 log2(d+n 1)+15β0 with β0 = max{β, (2d+2n)(n µ 1) 1}. This yields the convergence rate

u A u0 2 H1(Ω) O N n µ 1 [8 log2(d+n 1)+30](d+n) s

if we let β0 = (2d + 2n)(n µ 1) 1. For comparison, Lu et al. (2021c) established a rate

of N 2n 2 d+2n 4 s , which achieves the minimax lower bound. While our current result does not attain this optimal rate, improving the convergence analysis remains an active direction for our future work.

In this section, we outline the ﬁve key steps required to prove Theorem 1. The full details of the proof are provided in the Appendix. Step 1: Decomposing the total error. The total error between u A and u0 can be decomposed into three main components: approximation error, statistical error, and a novel and tighter optimization error; details can be found in Section 3.1. Step 2: Constructing a parallel neural network with explicit weight bound to approximate in the Sobolev space. Building on methods from Yarotsky (2017), Jiao et al. (2023e), and G uhring and Raslan (2021), we derive an approximation error bound for neural networks in the Sobolev space. This bound is achieved by explicitly constructing tanh-activated neural networks that approximate local Taylor polynomials. Notably, the constructed network has a parallel architecture, meaning the ﬁnal neural network is a linear combination of many structurally identical fully connected sub-networks, see Section 2.2 for detail. By construction, we explicitly control the weight bound of the deep network, which is critical for the statistical error analysis in the over-parameterized setting, as demonstrated in Step 4. We also use this constructed over-parameterized network to deﬁne and analyze the novel optimization error term in Step 3. Step 3: Using the property of over-parameterization to analyze the new optimization error. As demonstrated in Equation 22, the optimization error is bounded by the sum of initialization error and iteration error . Through over-parameterization, the neural network s initialization captures suﬃcient information about the optimal approximation network constructed in Step 2, enabling eﬀective control of the initialization error. Concurrently, the iteration error of the PGD algorithm is controlled by selecting suﬃciently large iteration counts and appropriately small step sizes. A distinguishing feature of our optimization error analysis is that the projection radius for the inner parameters of parallel sub-networks can diverge at a controlled rate, thereby circumventing the training stagnation phenomenon encountered in previous optimization analyses. Step 4: Analyzing the statistical error for over-parameterized deep neural network class. The PGD optimization in Step 3 produces an output u A within an overparameterized neural network class. Standard tools from empirical process theory (Van

Jiao, Li, Wu, Yang and Zhang

Der Vaart and Wellner, 1996; van de Geer, 2000; Gin e and Nickl, 2021) cannot be directly applied to bound the statistical error in this context, as they would yield upper bounds that grow uncontrollably in the over-parameterized regime. Instead, we leverage the explicit weight constraints established in Step 2 and the projection operations in Step 3 to derive size-independent statistical error bounds, which fully exploit the parallel architecture of our neural network framework. By appropriately calibrating the Monte Carlo sample sizes for interior and boundary points, we ensure the statistical error remains bounded within acceptable limits. See Section 3.4 for details. Step 5: Synthesizing the error analysis from each component. By synthesizing the analysis of approximation, optimization, and statistical error from Step 2 to Step 4, we could control the total error between u A and u0 within the desired precision under appropriate parameter settings, thus proving our main result Theorem 1.

3.1 New Error Decomposition

To perform an end-to-end error analysis between u A and u0, we establish the following error decomposition theorem. A key innovation enabling this analysis is the introduction of a novel optimization error , deﬁned as

E opt := b L(u A) b L( u) ,

where u is the best approximation to u0 in a parallel neural network class PNN that may diﬀer from the algorithm s PNN class, formally deﬁned as

u argmin u PNN u u0 2 H1(Ω) . (13)

Theorem 2 Let u A be the PGD algorithm output when using DRM to solve Equation 3 and u PNN deﬁned in Equation 13, the H1 distance between u A and the true solution u0 can be decomposed into

u A u0 2 H1(Ω)

2 u u0 2 H1(Ω) | {z } Eapp

+ h b L(u A) b L( u) i

+ 2 sup u PNN

L(u) b L(u)

| {z } Esta

Proof By Proposition 1, it holds that

u A u0 2 H1(Ω)

n L(u A) b L(u A) + b L(u A) b L( u) + b L( u) L( u) + L( u) L(u0) o

n L( u) L(u0) + b L(u A) b L( u) + 2 sup u PNN

L(u) b L(u) o

2 u u0 2 H1(Ω) | {z } Eapp

+ h b L(u A) b L( u) i

+ 2 sup u PNN

L(u) b L(u)

| {z } Esta

DRM Revisited: A Complete Error Analysis

Remark 7 The optimization error is traditionally deﬁned as Eopt = b L(u A) b L(ˆu), where ˆu denotes the Empirical Risk Minimization (ERM) estimator of Equation 7:

ˆu = argmin um,θ PNN b L(um,θ) .

Since we have

E opt = b L(u A) b L(ˆu) + b L(ˆu) b L( u) b L(u A) b L(ˆu) + 0 = Eopt ,

the newly deﬁned E opt is clearly tighter than Eopt. Additionally, due to the highly non-convex training objective of deep neural networks, it is challenging to obtain detailed information about ˆu, making analysis of Eopt diﬃcult. In contrast, the best approximation element u is explicitly constructed according to the target solution, as shown in Equation 13, so its information can be fully grasped, greatly facilitating the analysis of E opt. In Section 3.3, we will prove that when the over-parameterization level is suﬃciently high, the initialization parameters of neural networks will contain suﬃcient information about u with high probability, and meanwhile, the iteration error of the PGD algorithm can be well controlled. Combining these two points, we can control E opt with arbitrary precision, thus achieving a complete end-to-end error analysis between u A and u0.

3.2 Approximation Error

In this section, we derive an upper bound on the approximation error, which reﬂects the ability of the constructed neural network u to approximate the true solution u0. Recall that

Eapp := u u0 2 H1(Ω) , (14)

where u is deﬁned in Equation 13. Leveraging insights from G uhring and Raslan (2021), we establish that for any prescribed accuracy ϵ > 0, each function f Fn,d,p admits an ϵ-approximation in Sobolev norm W 1,p

(where n, p N, n 2, and 1 p ) within a tanh-activated parallel neural network class PNN . The function class Fn,d,p is deﬁned as

Fn,d,p := n f W n,p(Ω) : f W n,p(Ω) 1 o .

A complete proof is provided in Appendix A. By setting p = 2 in Theorem 7 (Appendix A), we obtain the following approximation result, which precisely aligns with our requirements.

Theorem 3 Given any u0 Fn,d,2, for some suﬃciently small ϵ > 0 and any 0 < ϵ < ϵ , there exists a neural network u = u m, θ PNN( m, M, { W, L, B θ}) with

m = C1ϵ d n µ 1 , M = C2ϵ 3d 2(n µ 1) , W = 2 log2(d+n 1) +1 ,

L = log2(d + n 1) + 2 , B θ = C3ϵ 2d+2n

such that u0 u m, θ H1(Ω) ϵ ,

where n 2 and µ > 0 is an arbitrarily small positive number; C1, C2 and C3 are universal constants which only depend on d and n.

Jiao, Li, Wu, Yang and Zhang

3.3 Optimization Error

Now, we analyze the optimization error E opt, deﬁned for the PGD algorithm output u A PNN (Section 2.4) and the best approximation element u PNN (Equation 13) as

E opt := b L(u A) b L( u) .

We choose u = u m, θ PNN( m, M, { W, L, B θ}) in Theorem 3:

E opt = b L(u A) b L(u m, θ) . (15)

Intuitively, the weights of u m, θ serve as target parameters during the optimization process, speciﬁcally the sub-network parameters ( θ1, . . . , θ m) and the linear coeﬃcients ( c1, . . . , c m). Accordingly, we conﬁgure the implemented network um,θ PNN with subnetwork width W = W, depth L = L, and uniform distribution range B = B θ in Equation 9, which aligns with the parameter speciﬁcations of u m, θ. Given random sub-network initialization (θm in)[0] = (θ1, . . . , θm)[0], we seek to characterize the set of suﬃciently good initializations relative to ( θ1, . . . , θ m). We formalize this notion in the following deﬁnition.

Deﬁnition 1 Let Gm, m,R,δ denote the event where for each target parameter θk, k = 1, . . . , m, there exist at least R distinct sub-network weight vectors (θik,v)[0], v = 1, . . . , R, in the random initialization (θm in)[0] such that

(θik,v)[0] θk δ , k = 1, . . . , m , v = 1, . . . , R .

Furthermore, we require that ik,v = ik ,v whenever k = k or v = v .

Remark 8 Simply put, Gm, m,R,δ ensures that for each target θk, at least R sub-networks have already suﬃciently approximated it during parameter initialization phase.

In the subsequent analysis, we set the number of sub-networks m = m R Q, where R, Q N. To formalize the random indices ik,v from Deﬁnition 1, we introduce integervalued random variables:

sk,v(ω) , k = 1, . . . , m , v = 1, . . . , R . (16)

Assuming the m sub-networks are arranged in a predetermined order, we deﬁne these variables as follows: For ω Gm, m,R,δ, when k = 1, we let s1,v [m] denote the index of the v-th sub-network satisfying (θ )[0] θ1 δ; when k > 1, we let sk,v [m] \ {sl,v : l < k, v R} denote the index of the v-th sub-network among the remaining m (k 1)R sub-networks satisfying (θ )[0] θk δ. For ω / Gm, m,R,δ, we set sk,v = (k 1)R + v. Intuitively, sk,v picks out the good initialization weight vectors. Based on sk,v and ( c1, . . . , c m), we deﬁne a set of transition parameters to bridge u A and the target u m, θ:

θm, total := (θm, in , θm, out) , where θm, in := (θm in)[0] . (17)

For the linear coeﬃcients θm, out := (c 1, . . . , c m), we deﬁne:

c sk,v := ck

R , k = 1, . . . , m , v = 1, . . . , R , (18)

DRM Revisited: A Complete Error Analysis

c q := 0 , q / {sk,v : k = 1, . . . , m , v = 1, . . . , R} . (19)

When um,θ is parameterized with θm, total, we denote it as u m and expand it as:

s=1 c s φs θ [0](x) =

ck R φsk,v θ [0](x) , (20)

where (φs θ)[0] represents the s-th sub-network parameterized by (θs)[0]. To clarify the relationship between u m and u m, θ, we express the latter as:

u m, θ(x) =

k=1 ck φk θ(x) =

ck R φk θ(x) , (21)

where φk θ is the k-th sub-network in u m, θ with weights θk. When θk and (θsk,v)[0] are suﬃciently close a condition ensured by event Gm, m,R,δ we can expect u m and u m, θ to be similarly proximate. Using u m as an intermediate term, we decompose the optimization error E opt into:

E opt = b L(u A) b L(u m, θ) = b L(u A) b L(u m) | {z } iteration error

+ b L(u m) b L(u m, θ) | {z } initialization error

Iteration Error: Recall that u A is exactly um,θ parameterized with (θm total)[T ], the ﬁnal output of iterative Equation 12, and b L(u A) can be denoted as b F(θm,[T ] in , θm,[T ] out ). Similarly, b L(u m) can be expressed as b F(θm,[0] in , θm, out). Hence, the iteration error measures how close the PGD algorithm output (θm in, θm out)[T ] is to the transition parameters θm, total = (θm,[0] in , θm, out). We now characterize the speciﬁc PNN class containing u A. By Equation 11, the linear coeﬃcients satisfy (θm out)[t] 1 ζ during iteration. For sub-network parameters, we have

(θm in)[t] (θm in)[0] + (θm in)[t] (θm in)[0] (θm in)[0] + (θm in)[t] (θm in)[0] 2 ,

which, by Equation 10, implies (θm in)[t] B θ + η. Consequently,

u A PNN(m, ζ, { W, L, B θ + η}) .

Proposition 2 establishes that by setting ζ = M in Equation 11 and appropriately choosing the parameter R, iteration count T, and step size λ, we can bound the iteration error to any desired precision. This is achieved by using the strong convexity of the loss function b F(θm in, θm out) := b L(um,θ) with respect to the linear coeﬃcients θm out and its Lipschitz continuity with respect to the sub-network parameters θm in, as detailed in Appendix C.1.

Proposition 2 Let ζ = M in Equation 11. Then, the output u A of the PGD algorithm belongs to PNN(m, M, { W, L, B θ + η}). When we run the algorithm with step size

λ = min{T 1, 2C 1 4 m 1 M 2(B θ + η) 4 L} ,

Jiao, Li, Wu, Yang and Zhang

where T is the total number of iterations and η is the projection radius in Equation 10, the iteration error in Equation 22 satisﬁes

b L(u A) b L(u m) M2

2R + C5 M2(B θ + η)3 Lη

R + C6m M2(B θ + η)4 L

Here, u m PNN(m, M, { W, L, B θ}) is deﬁned in Equation 20, and C4, C5, C6 are universal constants depending only on Ω, W, L, d and B0.

Initialization Error: Next, we analyze the initialization error. By Equations 20 and 21, this term can be eﬀectively controlled when the target weights θk and their corresponding random initializations (θsk,v)[0] are suﬃciently close. To bound this error precisely, we require P(Gm, m,R,δ) from Deﬁnition 1 to approach 1 as δ 0, which inherently constrains the minimum value of m. Proposition 3 formalizes this relationship, demonstrating that by appropriately scaling the number of sub-networks in um,θ, we can bound the initialization error with arbitrarily high probability and precision. The proof is presented in Appendix C.2.

Proposition 3 Consider u m, θ PNN( m, M, { W, L, B θ}) from Theorem 3. For any δ > 0 and R, Q N with Q suﬃciently large, if we set m = m R Q, then with probability at least

1 m R h 1 δ W( W+1) L(2B θ) W( W+1) Li Q ,

the initialization error in Equation 22 satisﬁes

b L(u m) b L(u m, θ) C7 M2B 3 L θ δ ,

where u m is deﬁned in Equation 20 and C7 is a universal constant depending only on Ω, W, L, d and B0.

Combining Propositions 2 and 3, we obtain the following estimate of E opt.

Theorem 4 Consider u m, θ PNN( m, M, { W, L, B θ}) from Theorem 3. For any δ > 0 and R, Q N with Q suﬃciently large, set m = m R Q. Let u A PNN(m, M, { W, L, B θ+ η}) denote the output of the PGD algorithm in Equation 12 with ζ = M in Equation 11 and step size

λ = min{T 1, 2C 1 4 m 1R 1Q 1 M 2(B θ + η) 4 L} ,

where T is the total number of iterations. Then with probability at least

1 m R h 1 δ W( W+1) L(2B θ) W( W+1) Li Q ,

the optimization error E opt in Equation 15 satisﬁes

2R + C5 M2(B θ + η)3 Lη

R + C6m M2(B θ + η)4 L

T + C7 M2B 3 L θ δ ,

where η is the projection radius deﬁned in Equation 10.

DRM Revisited: A Complete Error Analysis

3.4 Statistical Error

In this section, we present the upper bound of the statistical error, with detailed proofs provided in Appendix B. Note that

Esta := sup u PNN

L(u) b L(u)

is a random variable, since it is a function of the Monte Carlo sample points {Xp}Nin p=1, {Yq}Nb q=1. Our task is to control Esta with high probability.

Theorem 5 Let PNN = PNN(m, M, {W, L, Bθ}). Let Nin = Nb = Ns in the Monte Carlo sampling. Let 0 < ξ < 1. Then, with probability at least 1 ξ, it holds that

Esta = sup um,θ PNN

L(um,θ) b L(um,θ)

C8M2B2L θ N 1/2 s p

log(BθWLNs) + p

where C8 is a universal constant which only depends on Ω, W, L, d and B0.

Theorem 5 provides a general analysis for any PNN class. From Theorem 4, we know that the PGD output u A belongs to the speciﬁc class PNN(m, M, { W, L, B θ + η}) due to the projection steps that explicitly limit parameter variation. This classiﬁcation, when combined with Theorem 5, yields precise upper bounds for the statistical error Esta. Critically, the bound established in Theorem 5 remains independent of the number of sub-networks m, enabling eﬀective control of statistical error even in highly over-parameterized settings where m becomes arbitrarily large.

3.5 Proof of Main Result

We now present the proof of Theorem 1, which provides the ﬁrst comprehensive error analysis integrating approximation, statistical, and optimization errors for PDE solving using over-parameterized deep networks to the best of our knowledge.

Step 1: By Theorem 3, we know that for any 0 < ϵ 1, there exists a neural network u m, θ PNN( m, M, { W, L, B θ}), with

m = C1ϵ d n µ 1 , M = C2ϵ 3d 2(n µ 1) , W = 2 log2(d+n 1) +1 ,

L = log2(d + n 1) + 2 , B θ = C3ϵ 2d+2n

such that the approximation error Eapp ϵ2 < ϵ.

Step 2: By Theorem 5, with probability at least 1 ξ, the statistical error satisﬁes

Esta C8 M2(B θ + η)2 LN 1/2 s log1/2 (B θ + η) W LNs + log1/2(ξ 1) .

Setting Ns = Cϵ C3(µ,d,β0,n) , ξ = Cϵ C3(µ,d,β0,n)

Jiao, Li, Wu, Yang and Zhang

β0 = max{β, (2d + 2n)(n µ 1) 1} , C3 = 4β0 log(d + n 1) + 6d n µ 1 + 12β0 + 2 ,

it follows that, with probability at least 1 ξ, Esta Cϵ log1/2(Cϵ 1) = O(ϵ).

Step 3: Recall that η = ϵ β, β > 0. In order to bound the optimization error E opt Cϵ with probability at least 1 ξ, we need to determine parameters Q, R, δ, T, η such that

1 m R h 1 δ W( W+1) L(2B θ) W( W+1) Li Q 1 ξ , (23)

2R + C5 M2(B θ + η)3 Lη

R + C6m M2(B θ + η)4 L

T + C7 M2B 3 L θ δ Cϵ . (24)

To ensure the ﬁrst, second and the last terms in Equation 24 bounded by Cϵ, we require

6(d+n) log(d+n 1)+21d+18n

n µ 1 +1 , R Cϵ 6 log(d+n 1)β0 18β0 6d n µ 1 2β 2 .

Since 1 x x 2 for x 0, we could solve for Q in Equation 23 as follows:

W( W+1) L Q ϵ C(d+n)3 log2 2(d+n 1) n µ 1 5β0 log2(d+n 1) 15β0 β .

Then, m = m R Q C ϵ C1(µ,d,β,β0,n) . Finally, we turn to the third term of Equation 24. To ensure that C6m M2(B θ + η)4 L

we require that

T Cϵ C2(µ,d,β,β0,n) .

Thus, we complete the proof of our main theorem.

4. Related Work

This section surveys recent advances in error analysis for deep neural networks. We begin by discussing some seminal contributions of approximation and statistical error analysis, then explore recent developments in optimization error analysis and the theoretical attempts to combine all three types errors, providing the context necessary to understand the motivations and innovations presented in this paper.

4.1 Approximation Error

Approximation error in the context of deep neural networks quantiﬁes the discrepancy between the target function and its neural network representation. Theoretical analysis of approximation capabilities began with shallow sigmoidal networks in the 1980s (Cybenko, 1989; Hornik et al., 1989; Hornik, 1991), as reviewed in Pinkus (1999). Recent years

DRM Revisited: A Complete Error Analysis

have witnessed a shift toward Re LU networks due to their superior empirical performance in contemporary learning tasks. Yarotsky (Yarotsky, 2017) pioneered the construction of Re LU networks that achieve arbitrary approximation accuracy using Taylor expansion. This breakthrough inspired modern approximation techniques that control errors through architectural parameters such as depth, width, and network size (Yarotsky, 2017, 2018; Petersen and Voigtlaender, 2018; Zhou, 2020; Shen et al., 2020; Siegel and Xu, 2020; Shen et al., 2022; Lu et al., 2021a). For comprehensive reviews, see Petersen (2020); De Vore et al. (2021).

Researchers have also developed neural networks with exceptional expressivity that mitigate the curse of dimensionality (Yarotsky and Zhevnerchuk, 2020; Yarotsky, 2021; Shen et al., 2021b,a; Jiao et al., 2023b). These approximation theories have recently extended to Sobolev spaces, particularly for deep learning solutions to PDEs. G uhring et al. adapted Yarotsky s approach to Sobolev spaces using approximate partition of unity and averaged Taylor expansion (G uhring et al., 2020; G uhring and Raslan, 2021). Additionally, researchers have derived Sobolev norm approximations for networks with Re LUk activations by examining connections between deep neural networks and B-splines (Duan et al., 2022; Jiao et al., 2023c). Recent work has further established links between deep neural network approximation capabilities and Barron spaces, yielding results that overcome the curse of dimensionality (Ma et al., 2022; Liao and Ming, 2023).

4.2 Statistical Error

Statistical (generalization) error in learning theory is characterized through the uniform law of large numbers over the network class. Classical approaches in empirical process theory use tools such as symmetrization and Lipschitz contraction to transform generalization error analysis into bounding the complexity of neural network classes, measured by Rademacher complexity, covering number, or VC-dimension. For comprehensive treatments, see Van Der Vaart and Wellner (1996); van de Geer (2000); Gin e and Nickl (2021); Cucker and Smale (2002). However, generalization analysis based solely on the uniform law of large numbers may yield suboptimal error bounds (Bartlett et al., 2017). Localized techniques that leverage the local structure of hypothesis function classes can achieve sharper error bounds when the Bernstein condition or oﬀset condition is satisﬁed, as demonstrated in Bartlett et al. (2005); Koltchinskii (2006); Mendelson (2018); Xu and Zeevi (2021); Kanade et al. (2024) and related literature.

It is important to note that the statistical error in the deep Ritz method (DRM) cannot be directly addressed using the contraction principle, as the diﬀerential operators in the loss function lack Lipschitz continuity. One eﬀective approach to overcome this challenge involves using the chain rule to represent the gradient of the neural network as another neural network class, then bounding the complexity of this derived class (Duan et al., 2022). By recasting the gradient representation in this manner, researchers can exploit properties of neural network classes such as Lipschitz continuity and covering numbers to establish bounds on statistical error. This methodology provides a rigorous framework for analyzing the generalization performance of deep PDE solvers (Jiao et al., 2022; Duan et al., 2022; Lu et al., 2021c; Jiao et al., 2023f; Ji et al., 2024; Yang and He, 2024; Jiao et al., 2023c).

Jiao, Li, Wu, Yang and Zhang

4.3 Theory on ERM with Deep Neural Networks

The convergence rate of Empirical Risk Minimization (ERM) with deep neural networks in regression, classiﬁcation, and PDE solutions can be established within nonparametric estimation frameworks (Bauer and Kohler, 2019; Kohler and Langer, 2021; Schmidt-Hieber et al., 2020; Nakada and Imaizumi, 2020; Farrell et al., 2021; Jiao et al., 2023d; Suzuki, 2018; Suzuki and Nitanda, 2021; Weinan et al., 2019; Hong et al., 2021; Lu et al., 2021d; Hutzenthaler et al., 2020; Shin, 2020; Lanthaler et al., 2022; M uller and Zeinhofer, 2021; Mishra and Rusch, 2021; Kutyniok et al., 2022; Son et al., 2021; Wang et al., 2022; Weinan et al., 2020; Jiao et al., 2022; Duan et al., 2022; Lu et al., 2021c; Mishra and Molinaro, 2022; Ji et al., 2024; Yang and He, 2024; Hu et al., 2023, 2024; Dai et al., 2023; Yu et al., 2024). This approach carefully balances the trade-oﬀbetween approximation and statistical errors, providing theoretical guarantees on deep learning model performance. However, these convergence rate results for ERM are only applicable when the neural network class size is smaller than the training sample count. For deep Re LU networks, their VC-dimension or covering number is bounded by their size (Bartlett et al., 2019), meaning theoretical convergence guarantees for ERM can only be established in under-parameterized regimes, where depth, width, and network size are selected as functions of sample size to balance approximation and statistical errors.

Recent work by Jiao et al. (2023e); Yang and Zhou (2024); Chen et al. (2025); Jiao et al. (2024); Ding et al. (2025) proposes using weight norm instead of conventional measures like network width, depth, or size to characterize both approximation and statistical errors. This approach has enabled researchers to derive convergence rates for ERM in over-parameterized regimes, representing a signiﬁcant advancement toward theoretically understanding modern, over-parameterized neural networks without addressing optimization error.

4.4 Optimization Error

Current analytical frameworks for studying optimization error in deep neural networks primarily rely on neural tangent kernel and mean ﬁeld theory in over-parameterized or inﬁnite width settings (Jacot et al., 2018; Allen-Zhu et al., 2019; Du et al., 2019; Zou and Gu, 2019; Liu et al., 2022; Chizat et al., 2019; Nguyen, 2021; Lu et al., 2020; Mahankali et al., 2024). A signiﬁcant limitation of these approaches is their requirement that training dynamics remain proximate to initial parameter values, a condition rarely satisﬁed in practical applications.

In over-parameterized regimes, the optimal training loss approaches zero as the network perfectly interpolates the data. This phenomenon presents a dual challenge: on one hand, when networks memorize data, they may develop substantially large weight upper bounds (Vershynin, 2020; Vardi et al., 2021; Shen et al., 2020; Lu et al., 2021a), potentially rendering size-independent generalization error uncontrollable (Golowich et al., 2018; Jiao et al., 2023e). On the other hand, the norm constraints employed in recent ERM analyses (Jiao et al., 2023e; Yang and Zhou, 2024; Chen et al., 2025; Jiao et al., 2024) may be insuﬃciently large, and even randomized initializations in NTK and mean ﬁeld frameworks may violate these constraints. These challenges represent the primary obstacles preventing researchers from simultaneously addressing the three fundamental errors approximation, generalization, and optimization in modern over-parameterized deep learning.

DRM Revisited: A Complete Error Analysis

4.5 Works on Complete Error Analysis

Beck et al. (2022); Jentzen and Welti (2023) conducted a comprehensive error analysis of deep regression in the under-parameterization setting. Building upon recent research on the estimation error of gradient descent in regression (Kohler and Langer, 2021; Kohler and Krzyzak, 2023b; Drews and Kohler, 2023), Jiao et al. (2025) derived consistency results for DRM, encompassing all three types of errors. However, it should be noted that the results in Jiao et al. (2025) are speciﬁcally applicable to three-layer networks only. One major drawback of Jiao et al. (2025); Kohler and Langer (2021); Kohler and Krzyzak (2023b); Drews and Kohler (2023) is their reliance on the iteration being very close to the initialization, which is an unrealistic requirement. This restrictive condition limits the practicality of their proposed theories, as real-world training of deep neural networks often involves substantial parameter updates that move far from the initial conﬁguration.

In contrast, based on the techniques proposed in Kohler and Krzyzak (2023a), we obtain consistency results for DRM that simultaneously addresses all three types of error in the over-parameterized deep neural network setting while allowing parameter iterates to diverge from their initialization values. This relaxation substantially enhances the practical relevance of our theoretical framework.

5. Conclusion

In this paper, we provide the ﬁrst complete error analysis for DRM that includes the approximation, statistical, and optimization error in the scenario of over-parameterization. Our analysis is based on the projected gradient descent algorithm and does not require constraining the neural network weights near their initial values during the optimization process, thereby completely moving away from the lazy training framework. This marks a milestone in the ﬁeld of theoretical understanding of solving PDEs via deep learning.

Several questions deserve further investigation. Firstly, our analytical techniques rely on the random initialization of over-parameterized neural networks. In the current analysis, we do not use any prior knowledge to design the parameter initialization method; instead, we use a general uniform distribution. This results in a theoretically excessive number of training samples and iteration steps to achieve the desired accuracy. Therefore, exploring the eﬀective use of prior information to improve these results is an intriguing subject. Secondly, the gradient descent algorithm used in our theoretical analysis is full gradient descent, which has certain gaps compared to the stochastic gradient descent (SGD) algorithm commonly used in practice. Lastly, recent second-order methods, such as M uller and Zeinhofer (2023); M uller and Mont ufar (2024), have demonstrated high accuracy in deep PDE solving, and analyzing these methods within our framework presents a promising and challenging direction for future research. Overall, the analytical framework presented in this paper is highly versatile and can be applied to other areas in deep PDE solving, such as PINNs and various inverse problems. We aim to explore these topics in greater depth.

Acknowledgments

Jiao, Li, Wu, Yang and Zhang

This work has been funded by the National Key Research and Development Program of China (No. 2023YFA1000103), by the National Natural Science Foundation of China (No. 123B2019, No. 12125103, No. U24A2002, No. 12371441), and by the Fundamental Research Funds for the Central Universities.

The appendix is divided into four parts. In Appendix A, we provide a detailed explanation of how to construct the optimal approximation function in the Sobolev space using the PNN structure as discussed in Section 3.2. In Appendix B, we present the complete proof of the statistical error upper bound estimation for the over-parameterized PNN network class as given in Section 3.4. Note that some of the lemmas used in Appendices A and B are derived from previous work or are classical results. For the sake of completeness, we have still provided detailed proofs of these lemmas. In Appendix C, we present the proofs of the theorems and lemmas involved in the optimization error analysis in Section 3.3, which are newly proposed in this paper. In Appendix D, we evaluate the complexity of a single step of the projected gradient descent algorithm.

Appendix A. Detailed Approximation Error Analysis

This section presents a detailed supplement to Section 3.2 of the main text. Building on the techniques developed in Yarotsky (2017); Jiao et al. (2023e); G uhring and Raslan (2021), we establish an upper bound for the approximation error of neural networks in the Sobolev norm W 1,p, assuming the target function belongs to W n,p with n 2. This bound is obtained by explicitly constructing tanh-activated neural networks that approximate local Taylor polynomials. A key feature of our construction is its parallel architecture: the ﬁnal network is formed as a linear combination of multiple structurally similar, fully connected sub-networks, as described in Section 2.2.

A.1 Exponential Partition of Unity with tanh Activation

We ﬁrst introduce the notion of exponential partition of unity , following the construction in G uhring and Raslan (2021). This concept plays a crucial role in localizing function approximations and is particularly compatible with the tanh activation function.

Deﬁnition 2 Let d, j, τ, N N and s R. A set of function families (Λ(j,τ,N,s))N N,s R 1, where each Λ(j,τ,N,s) := {Ψs

m : m {0, . . . , N}d}

consists of (N + 1)d functions Ψs m : Rd R, is called an exponential partition of unity of order τ and smoothness j, if there exist constants D > 0, S > 0, and C = C(k, d) > 0, such that for all N N, s S, and k {0, . . . , j}, the following properties are satisﬁed:

i. Uniform sobolev bounds:

Ψs m W k, (Rd) CNksmax{0,k τ} for every Ψs m Λ(j,τ,N,s) .

ii. Exponential decay outside support: Let

m = {x Rd : x N 1m N 1} .

DRM Revisited: A Complete Error Analysis

Then for every Ψs

m Λ(j,τ,N,s),

Ψs m W k, (Ωcm) CNksmax{0,k τ}e Ds .

iii. Approximate partition of unity:

m {0,...,N}d Ψs

W k, ((0,1)d) CNksmax{0,k τ}e Ds ,

for every Ψs

m Λ(j,τ,N,s).

iv. Neural network realizability: There exists an activation function ρ : R R such that for each Ψs m Λ, there exists a neural network ψθ with d-dimensional input and output, two hidden layers, and at most C nonzero weights, satisfying

l=1 [ψθ(x)]l = Ψs

m, and ψθ(x) W k, ((0,1)d) CNk smax{0,k τ} .

Moreover, the network weights satisfy θ Cs N.

Let ρ = tanh. We now deﬁne a class of smooth bump functions, denoted by Λ(j,0,N,s)(ρ).

Deﬁnition 3 Let j N, τ = 0, and let ρ = tanh, deﬁned by ρ(x) = ex e x

ex+e x . For a scaling factor s 1, we deﬁne the one-dimensional bump function ψs : R R as

ψs(x) := ρ(s(x + 3/2)) ρ(s(x 3/2))

Given N, d N and m {0, . . . , N}d, we construct the d-dimensional bump function Ψs

m : Rd R as a tensor product of scaled and shifted versions of ψs:

l=1 ψs 3N xl ml

Finally, for s 1, we denote the collection of such bump functions by Λ(j,0,N,s)(ρ) := {Ψs

m : m {0, . . . , N}d}.

The following result, as shown in G uhring and Raslan (2021) (Lemma 4.5), conﬁrms that Λ(j,0,N,s)(ρ) satisﬁes all the conditions to form an exponential partition of unity of order 0 and smoothness j.

Lemma 1 The collection of families of functions (Λ(j,0,N,s)(ρ))N N,s R 1 deﬁned in Deﬁnition 3 is an exponential partition of unity of order 0 and smoothness j.

Jiao, Li, Wu, Yang and Zhang

A.2 Approximate the Target with Polynomials

To construct a local approximation of Sobolev functions using a smooth partition of unity and polynomials, we present the following result, which serves as the foundation for the neural network approximation developed in subsequent sections. This proposition is adapted from G uhring and Raslan (2021) (Lemma D.1), where a detailed proof is provided.

Proposition 4 Let d N, j N, k {0, . . . , j}, k n 1, and 1 p . Let µ > 0 be arbitrarily small, and let ρ(x) = tanh(x). For α Nd, deﬁne xα := xα1 1 xαd d . For N N, set s = Nµ. Let (Ψs m)m {0,...,N}d be a family of partition functions in Λ(j,0,N,s)(ρ).

Then there exist constants C = C(d, n, p, k) > 0 and e N = e N(d, p, µ, k) N such that the following holds: For every f W n,p(Ω), there exists a function f N deﬁned as

m {0,...,N}d

α 1 n 1 cf,m,αΨs

satisﬁes the approximation estimate

f f N W k,p(Ω) C f W n,p(Ω) N (n k µk)

for all N e N. In addition, the polynomial coeﬃcients satisfy

|cf,m,α| C f W n,p(Ωm,N)Nd/p ,

where Ωm,N Rd is the open cube in terms of , centered at N 1m with radius N 1, and f W n,p(Rd) is an extension of f.

Remark 9 There exists some C = C(n, d, p) > 0 such that f W n,p(Rd) C f W n,p(Ω) (see Stein (1970), Theorem VI.3.1.5).

A.3 Approximate Polynomials with Neural Networks

This subsection aims to illustrate how neural networks can be used to approximate sums of localized polynomials. We will speciﬁcally analyze this approximation capability of neural networks under the W 1,p norm (i.e., the case k = 1), based on the deﬁnition of approximation error in Equation 14. It is worth noting that for k 2, corresponding approximation results can also be obtained using similar techniques. The key to achieving this approximation lies in the adopted activation function ρ(x) = tanh(x) and its properties. A core characteristic of this function is its smoothness and the existence of non-zero derivatives: there exists a point x0 R and a neighborhood U such that ρ Ci+1(U) and ρ(r)(x0) = 0 for all r = 1, . . . , i. As shown in G uhring and Raslan (2021) (Proposition 4.7), the monomials xr can be eﬀectively approximated by linear combinations of shifted and scaled copies of ρ. The following lemma formalizes this result in the speciﬁc cases of x and x2.

Lemma 2 Let B > 0, and deﬁne ρ(x) = tanh(x). Suppose x0 R is a point such that ρ(r)(x0) = 0 for r = 1, 2. Let C = C(B) > 0 be a constant, where C(B) is monotonically increasing in B, and let ϵ (0, 1) denote the desired approximation accuracy. Then,

DRM Revisited: A Complete Error Analysis

1. (Approximation of x) There exists a neural network φθ1 NN(1, 2, C ϵ 1) with parameters θ1 = ((A0, b0), (A1, b1)) deﬁned as

, b0 = (x0) , A1 = C ϵρ(1)(x0)

, b1 = Cρ(x0)

ϵρ(1)(x0) ,

where C = C (B) > 0, such that

x φθ1(x) W 1, ([ B,B]) ϵ . (27)

2. (Approximation of x2) There exists a neural network φθ2 NN(2, 2, C ϵ 2) with parameters θ2 = ((A0, b0), (A1, b1)) deﬁned as

2 1 , b1 = C2ρ(x0)

ϵ2ρ(2)(x0) ,

where C = C (B) > 0, such that

x2 φθ2(x) W 1, ([ B,B]) ϵ . (28)

To extend the approximation from individual monomials to product functions, we employ the polarization identity xy = [(x + y)2 (x y)2]/4, which transforms the problem into the approximation of squared terms. Further error analysis requires handling composite and product functions in W 1,

spaces. Relevant technical arguments are shown in Lemma 3 below:

Lemma 3 Let d1, d2 N, and let Ω1 Rd1, Ω2 Rd2 be open, bounded, and convex domains. Then there exists C1 = C(d1, d2) > 0 and C2 = C(d1) > 0 such that

1. (Chain rule) Let f W 1, (Ω1; Rd2) and g W 1, (Ω2) be Lipschitz functions with range(f) Ω2. Then g f W 1, (Ω1), and

g f W 1, (Ω1) C1 max n g L (Ω2), |g|W 1, (Ω2) |f|W 1, (Ω1;Rd2) o . (29)

2. (Product rule) Let u, v W 1, (Ω1). Then uv W 1, (Ω1), and

uv W 1, (Ω1) C2 u W 1, (Ω1) v W 1, (Ω1) . (30)

Moreover, if u W 1, (Ω1), v W 1, (Ω2), and h(x, y) := u(x)v(y) for (x, y) Ω1 Ω2, then h W 1, (Ω1 Ω2) and

h W 1, (Ω1 Ω2) max u W 1, (Ω1) v L (Ω2), u L (Ω1) v W 1, (Ω2) . (31)

Proof The ﬁrst part is a direct consequence of G uhring et al. (2020) (Corollary B.5). For the general case of the second part, the product rule estimate follows from G uhring and Raslan (2021) (Lemma B.5). For the case of separated variables, it is straightforward

Jiao, Li, Wu, Yang and Zhang

to verify that h L (Ω1 Ω2) = u L (Ω1) v L (Ω2). Moreover, for all 1 i d1 and 1 j d2, we have xih(x, y) = xiu(x) v(y) and yjh(x, y) = u(x) yjv(y). Therefore,

h W 1, (Ω1 Ω2) = max h L (Ω1 Ω2), max 1 i d1 xih L , max 1 j d2 yjh L

max u W 1, (Ω1) v L (Ω2), u L (Ω1) v W 1, (Ω2) .

This concludes the proof of the second part.

With the above preparations, we now construct neural networks to approximate products of variables. We begin with the basic case of approximating the binary product xy, as described in the following lemma.

Lemma 4 (Approximation of xy) Let B > 0, C = C(B) > 0, C = C (B) > 0 and ϵ (0, 1) be the desired approximation accuracy. Then, there exists a neural network φθ NN(4, 2, C ϵ 2) with parameters

R4 2 , b0 = (x0, x0, x0, x0)T R4 ,

4ϵ2ρ(2)(x0)( 2, 1, 2, 1) R1 4 , b1 = 0 R , (32)

such that xy φθ(x, y) W 1, ([ B,B]2) ϵ .

Proof To approximate xy, we use the polarization identity. Let f : [ 2B, 2B] R be the square function, f(z) = z2. Deﬁne auxiliary linear functions:

u : [ B, B]2 [ 2B, 2B] , (x, y) 7 x + y ,

v : [ B, B]2 [ 2B, 2B] , (x, y) 7 x y .

Then, we have xy = (f u(x, y) f v(x, y))/4 for all (x, y) [ B, B]2. By Lemma 2, let φˆθ(z) be the neural network satisfying

f φˆθ W 1, ([ 2B,2B]) ϵ .

The explicit form of φˆθ(z) is given by

φˆθ(z) = C2

h ρ(x0) 2ρ x0 ϵ

C z + ρ x0 2 ϵ

where C = CLem 2(2B) and x0 is a suitable point where ρ(2)(x0) = 0. We then construct the neural network φθ(x, y) to approximate xy using this φˆθ and the polarization identity:

φθ(x, y) = 1

4 φˆθ(x + y) φˆθ(x y) = 1

4 φˆθ u(x, y) φˆθ v(x, y) .

DRM Revisited: A Complete Error Analysis

The approximation error for xy on [ B, B]2 is bounded as follows:

φθ(x, y) xy W 1, ([ B,B]2) = 1

φˆθ u φˆθ v (f u f v) W 1, ([ B,B]2)

φˆθ u f u W 1, ([ B,B]2) + 1

φˆθ v f v W 1, ([ B,B]2) .

Note that |u|W 1, ([ B,B]2) = |v|W 1, ([ B,B]2) = 1. Applying the chain rule estimate Equation 29 from Lemma 3 to each term, we have

φθ(x, y) xy W 1, ([ B,B]2) Cchain ϵ .

Setting ϵ = ϵ/Cchain then yields φθ(x, y) xy W 1, ([ B,B]2) ϵ. Letting C := Cchain C, we obtain the explicit expression

φθ(x, y) = C2

4ϵ2ρ(2)(x0)

C (x + y) + ρ x0 2ϵ

C (x y) ρ x0 2ϵ

C (x y) i .

The parameters deﬁned in Equation 32 directly implement this construction.

Building on the approximation of the bivariate monomial xy, we next construct approximations for the multivariate product x1 xd.

Lemma 5 (Approximation of x1 xd) Let d 2, 0 < ϵ < 1, and set κ = log2 d . Let C = C(d) > 0. Then there exists a neural network φθ NN(2κ+1, κ + 1, Cϵ 2) such that

x1 xd φθ(x) W 1, ([0,1]d) 4dϵ , x = (x1, , xd) [0, 1]d .

Proof The proof proceeds by building approximations for products of increasing dimensions, beginning with the bivariate case and extending through recursive composition. We ﬁrst establish the case where d = 2κ for some κ N. Let Esϵ denote the upper bound of the approximation error at recursion level s for 1 s κ, with E0 = 0. In our recursive construction, the outputs of one approximation stage φs θ become the inputs for the next stage. As will be detailed in Step 2, the L norm of φs θ(x1, . . . , x2s) for xi [0, 1] is bounded by 1+Esϵ. To guarantee that these intermediate outputs remain within a range [ B, B], we need to set B max0 s κ 1(1 + Esϵ). Step 1: Base case (d = 2). By Lemma 4, there exists C = C(B) > 0, C = C (B) > 0, and φ1 θ NN(4, 2, C ϵ 2) such that

xy φ1 θ(x, y) W 1, ([ B,B]2) ϵ . (33)

Using Equation 33, since B 1, we directly have

x1x2 φ1 θ(x1, x2) W 1, ([0,1]2) E1ϵ, E1 = 1 (34)

Step 2: Recursive step for d = 4. For s = 2, denote by

φ2 θ(x1, x2, x3, x4) := φ1 θ φ1 θ(x1, x2), φ1 θ(x3, x4) .

Jiao, Li, Wu, Yang and Zhang

The approximation error decomposes as

x1x2 x3 x4 φ2 θ(x1, x2, x3, x4) W 1, ([0,1]4) x1x2 x3 x4 x1x2φ1 θ(x3, x4) W 1, ([0,1]4) | {z } H1 + x1x2φ1 θ(x3, x4) φ1 θ(x1, x2)φ1 θ(x3, x4) W 1, ([0,1]4) | {z } H2 + φ1 θ(x1, x2)φ1 θ(x3, x4) φ1 θ φ1 θ(x1, x2), φ1 θ(x3, x4) W 1, ([0,1]4) | {z } H3

First, by Equations 33 and 34, we have the following bounds

u vφ1 θ(u, v) L ([ B,B]2) E1ϵ , v uφ1 θ(u, v) L ([ B,B]2) E1ϵ ,

φ1 θ L ([0,1]2) 1 + E1ϵ , uφ1 θ L ([0,1]2) 1 + E1ϵ , vφ1 θ L ([0,1]2) 1 + E1ϵ .

Then, the three error terms are bounded as follows

H1: Let u1 = x1x2, v1 = x3x4 φ1 θ(x3, x4). Then, we have u1 W 1, ([0,1]2) 1 and v1 W 1, ([0,1]2) E1ϵ. By Equation 31, it holds that

H1 max u1 W 1, ([0,1]2) v1 L ([0,1]2), u1 L ([0,1]2) v1 W 1, ([0,1]2) E1ϵ .

H2: Let u2 = x1x2 φ1 θ(x1, x2), v2 = φ1 θ(x3, x4). Then, u2 W 1, ([0,1]2) E1ϵ, v2 W 1, ([0,1]2) 1 + E1ϵ. By Equation 31, we obtain

H2 E1ϵ(1 + E1ϵ) .

H3: Let u3 = φ1 θ(x1, x2) and v3 = φ1 θ(x3, x4). Since u3 L ([0,1]2) 1 + E1ϵ B, it holds that u3v3 φ1 θ(u3, v3) L ([0,1]4) ϵ . For the ﬁrst-order partial derivatives, take x1 as an example, we have 1φ1 θ φ1 θ(x1, x2), φ1 θ(x3, x4) x1φ1 θ(x1, x2) x1φ1 θ(x1, x2)φ1 θ(x3, x4) L ([0,1]4) 1φ1 θ(u3, v3) v3 L ([0,1]4) (1 + E1ϵ) ϵ(1 + E1ϵ) ,

where 1 denotes taking derivative of the ﬁrst position. Therefore, we get

H3 max{ϵ, ϵ(1 + E1ϵ)} ϵ(1 + E1ϵ) .

Finally, noting that E1 = 1 and 0 < ϵ2 < ϵ < 1, we have

x1x2 x3 x4 φ2 θ(x1, x2, x3, x4) W 1, ([0,1]4) E1ϵ + E1ϵ(1 + E1ϵ) + ϵ(1 + E1ϵ) [(E1 + 1)2 + E1]ϵ = E2ϵ , E2 = 5 .

Step 3: Network realization for d = 4. According to Lemma 4, the product x1x2 (or x3x4) can be approximated using an activation layer and a linear layer. To approximate (x1x2)(x3x4), one would typically repeat this process on the outputs of the two sub-products.

DRM Revisited: A Complete Error Analysis

However, since the intermediate results x1x2 and x3x4 are not needed explicitly, we can merge their ﬁnal linear layers with the initial linear transformation used to approximate their product. This reduces the number of layers: the full product x1x2x3x4 can be approximated using two activation layers, and a ﬁnal output layer. Combining this with Equation 32, we construct the weights ((A0, b0), (A1, b1), (A2, b2)) as follows

C ϵ C 0 0 2ϵ

R8 4 , b0 =

x0 x0 x0 x0 x0 x0 x0 x0

A1 = C 4ϵρ(2)(x0)

2 1 2 1 2 1 2 1

4 2 4 2 4 2 4 2

2 1 2 1 2 1 2 1

4 2 4 2 4 2 4 2

R4 8 , b1 =

x0 x0 x0 x0

ϵ2ρ(2)(x0)( 2, 1, 2, 1) R1 4 , b2 = 0 ,

which exactly expresses φ2 θ(x1, x2, x3, x4). It holds that W = 23, L = 3 and Bθ = C ϵ 2. Step 4: General recurrence (d = 2s). The recursive construction extends naturally to arbitrary dimensions d = 2s. Suppose the approximation of x1x2 x2s 1 admits an error bound of the form

x1x2 x2s 1 φs 1 θ (x1, . . . , x2s 1) W 1, ([0,1]2s 1) Es 1ϵ ,

φs θ(x1, . . . , x2s) := φ1 θ φs 1 θ (x1, . . . , x2s 1), φs 1 θ (x2s 1+1, . . . , x2s) .

Then, similar to the analysis in Step 2, we could obtain

x1x2 x2s φs θ(x1, . . . , x2s) W 1, ([0,1]2s) (Es 1 + 1)2 + Es 1 ϵ

This yields a recursive relation for the approximation constants:

Es = (Es 1 + 1)2 + Es 1, for 2 s κ .

To establish an explicit bound for Eκ, we set Js := Es + 1, yielding Js = Js 1(Js 1 + 1) 2J2 s 1. Letting Ds := log2 Js, the recurrence becomes Ds 1 + 2Ds 1 with D1 = 1. This

Jiao, Li, Wu, Yang and Zhang

solves to Ds 2s 1, giving Es 22s 1 1. Since 2κ = d, Eκ 2d 1 1, which means setting approximate range B = 2d is enough, and

x1 xd φκ θ(x) W 1, ([0,1]d) (2d 1 1)ϵ , x = (x1, , xd) [0, 1]d .

By recursively applying the layer-merging strategy introduced in the approximation of x1x2x3x4, we can construct a neural network φκ θ to approximate the product x1 xd which integrates intermediate linear layers. The resulting network satisﬁes

W(φκ θ) = 2κ+1, L(φκ θ) = κ + 1, and Bθ(φκ θ) = C ϵ 2 ,

and is parameterized by ((A0, b0), . . . , (Aℓ, bℓ), . . . , (Aκ, bκ)):

Layer ℓ= 0.

C 0 0 ... ... ... ... ...

C ϵ C 0 0 2ϵ

R2κ+1 2κ , b0 =

x0 x0 x0 x0 ...

x0 x0 x0 x0

Layers 1 ℓ κ 1.

Aℓ= C 4ϵρ(2)(x0)

2 1 2 1 2 1 2 1

4 2 4 2 4 2 4 2

2 1 2 1 2 1 2 1

4 2 4 2 4 2 4 2 ...

2 1 2 1 2 1 2 1

4 2 4 2 4 2 4 2

2 1 2 1 2 1 2 1

4 2 4 2 4 2 4 2

Aℓ R(2κ ℓ+1) (2κ ℓ+2) , bℓ= (x0, , x0)T R2κ ℓ+1 1 .

Final layer ℓ= κ.

ϵ2ρ(2)(x0)( 2, 1, 2, 1) R1 4 , bκ = 0 R . (36)

DRM Revisited: A Complete Error Analysis

Step 5: Extension to arbitrary d. For arbitrary d 2, we set κ = log2 d so that 2κ 1 < d 2κ < 2d. The target function is deﬁned through dimension padding

φθ(x) := φκ θ

Id 0(2κ d) d

0d 1 1(2κ d) 1

where Id is the d d identity matrix, 0p q is the p q zero matrix, and 1(2κ d) 1 is the all-ones vector. This construction preserves the approximation quality:

x1 xd φθ(x) W k, ([0,1]d) = x1 x2κ φθ(x) W 1, ([0,1]d) 4dϵ ,

while maintaining W = 2κ+1 and L = κ + 1. With C(d) = 4d C , it follows that setting Bθ = C(d)ϵ 2 completes the proof of the lemma.

Throughout the following proof, approximations are carried out on the domain Ω, a bounded open subset strictly contained in [0, 1]d where the target PDE solution is deﬁned. The function to be approximated is given by

m {0,...,N}d

α 1 n 1 cf,m,α Ψs m(x) xα.

To approximate f N, we ﬁrst construct neural networks φm,α θ for each term Ψs m(x) xα.

Lemma 6 Let d, N, s 1, n 2. Let C = C(d) > 0. For any 0 < ϵ < ϵ , where ϵ > 0 is suﬃciently small, there exists φm,α

θ NN(W, L, Bθ) where

W = 2 log2(d+ α 1) +1 , L = log2(d + α 1) + 2 , Bθ = max{3Ns, (3d + 3/2)s, Cϵ 2} ,

such that for some constant C(n, d) > 0,

θ (x) W 1, (Ω) C(n, d)s Nϵ , x Ω,

for all m {0, , N}d and α Nd with α 1 n 1.

Proof We recall from Equation 25 that Ψs

m(x) := Qd l=1 ψs(3N(xl N 1ml)) with ψs(x) = 2 1[ρ(s(x + 3/2)) ρ(s(x 3/2))]. Our goal is to approximate the function

l=1 ψs l xα , where ψs l (xl) = ψs(3N(xl N 1ml)) , x = (x1, , xd) Ω.

Let κ = log2(d+ α 1) . Our construction strategy treats the target function as a product of 2κ components. We achieve this by representing the function through an extended vector and then applying a product approximation network. Step 1: Approximation of product factors via extended vector. We start with deﬁning an extended vector as

x = (ψs 1, , ψs d , xi1, , xi α 1 | {z } α 1

, 1, , 1 | {z } 2κ (d+ α 1)

) [0, 1]2κ ,

Jiao, Li, Wu, Yang and Zhang

where the xij terms correspond to the variables appearing in the monomial xα. We construct a neural network φθ (x) to approximate x through three distinct components. First, for each basis function ψs l (xl) with l = 1, . . . , d, we achieve exact representation using two neurons with weight 3Ns and biases 3mls 3s/2. Second, for the α 1 components in the monomial term, we apply the approximation network from Lemma 2 (Equation 27). Each xij is approximated with error ϵ using weights of magnitude Cϵ 1 with C = C(d) > 0 from Lemma 5, and bias x0. We denote these approximated values as ˆxi and deﬁne the approximated monomial as ˆxα. Since Ωis strictly contained in [0, 1]d, choosing ϵ suﬃciently small ensures that ˆx [0, 1]d. Finally, the remaining 2κ (d+ α 1) components are implemented as constant functions with value 1, realized by neurons with zero input weights and unit bias. The resulting network output is

φθ (x) = (ψs 1, , ψs d , ˆxi1, , ˆxi α 1 | {z } α 1

, 1, , 1 | {z } 2κ (d+ α 1)

) [0, 1]2κ .

For notational convenience, denote by P(x) := Qd i=1 xi for a vector x = (x1, x2, . . . , xd). It is easy to check that ψs l W 1, (Ω) = O(s N). Thus, repeatedly applying the product rule (Equation 30) from Lemma 3 yields:

Ψs m(x)xα P φθ (x) W 1, (Ω)

l=1 ψs l (xl)xα

l=1 ψs l (xl)ˆxα W 1, (Ω) C1(n, d)s Nϵ .

Step 2: Product network composition and parameterization. Letting C = C(d) > 0 from Lemma 5, we construct a neural network φκ θ that approximates the product function P on [0, 1]2κ with parameters W(φκ θ) = 2κ+1, L(φκ θ) = κ + 1, Bθ(φκ θ) = Cϵ 2. Applying the chain rule from Lemma 3, the approximation error satisﬁes

P φθ (x) φκ θ φθ (x) W 1, (Ω) = (P φκ θ) φθ (x) W 1, (Ω) C max P φκ θ L ([0,1]κ), |P φκ θ|W 1, ([0,1]κ) |φθ (x)|W 1, (Ω; [0,1]κ)

C2(n, d)s Nϵ .

We deﬁne our ﬁnal network as the composition φm,α θ := φκ θ φθ with parameter

((A , b ), (A0, b0), (A1, b1), , (Aℓ, bℓ), , (Aκ, bκ)) .

Below, we provide an explicit construction of the parameter matrix. To simplify the presentation, we assume here that d and α 1 are both even with α 1 = 2 and xα = x1x2; the construction for the odd case is slightly diﬀerent but conceptually equivalent.

For the ﬁrst layer implementing φθ , denote by τN := (3Ns, 3Ns)T R2, bi := ( 3mis+3/2s, 3mis 3/2s)T R2, 1n Rn the all-ones vector, and 0m n Rm n

DRM Revisited: A Complete Error Analysis

the zero matrix. Then, we have

0(2κ α 1) d

R(d+2κ) d , b =

bd x01 α 1 12κ α 1

Next, denote by

ϵ 2C ϵ 2C ϵ 2C ϵ 2C ϵ

C ϵ C ϵ 2C ϵ 2C ϵ 2C ϵ 2C ϵ

C ϵ C ϵ C ϵ

1 ρ(1)(x0) 1 ρ(1)(x0) 2 ρ(1)(x0) 2 ρ(1)(x0) 1 ρ(1)(x0) 1 ρ(1)(x0) 2 ρ(1)(x0) 2 ρ(1)(x0)

ϵ Cρ(1) ϵ Cρ(1) 2ϵ Cρ(1) 2ϵ Cρ(1) ϵ Cρ(1) ϵ Cρ(1) 2ϵ Cρ(1) 2ϵ Cρ(1)

and π := (x0 2ρ(x0)/ρ(1)(x0), x0 4ρ(x0)/ρ(1)(x0), x0, x0) R1 4. Using the merging technique, the weight matrix for ℓ= 0 is as follows

A0 = diag Π1, . . . , Π1 | {z } d/2

, Π2, . . . , Π2 | {z } α 1/2

, Π3, . . . , Π3 | {z } (2κ d α 1)/2

R(2κ+1) (d+2κ) ,

b0 = x01T 2d , π, . . . , π | {z } α 1/2

, x01T 2κ+1 2d 2 α 1 T R2κ+1 .

For layers ℓ= 1, . . . , κ, the construction mirrors Lemma 5, using the parameters from Equations (35) and (36) to form the product approximation network φκ θ.

Step 3: Final architecture and error bound. The composed network has the following parameters: W = 2κ+1 = 2 log2(d+ α 1) +1, L = κ + 2 = log2(d + α 1) + 1 and Bθ = max{3Ns, (3d + 3/2)s, Cϵ 2}. The ﬁnal approximation error is bounded by the triangle inequality as

θ W 1, (Ω) = Ψs m(x)xα φκ θ φθ (x) W 1, (Ω) Ψs m(x)xα P φθ (x) W 1, (Ω) + P φθ (x) φκ θ φθ (x) W 1, (Ω) C3(n, d)s Nϵ ,

which completes the proof.

Now, we can construct the parallel neural network Φ m, θ PNN( m, M, { W, L, B θ}) to approximate f N.

Jiao, Li, Wu, Yang and Zhang

Theorem 6 For some suﬃciently small ϵ > 0 and any 0 < ϵ < ϵ , there exists a neural network Φ m, θ PNN( m, M, { W, L, B θ}) with

m = C1(n, d)(N + 1)d , M = C2(n, d)Nd/p(N + 1) , W = 2 log2(d+n 1) +1 , L = log2(d + n 1) + 2 , B θ = max{3Ns, (3d + 3/2)s, C(d)ϵ 2} ,

f N(x) Φ m, θ W 1, (Ω) C3(n, d)s N(N + 1)dϵ , x = (x1, , xd) Ω.

Proof By Lemma 6, for any m {0, , N}d and α 1 n 1, Ψs

m xα can be approximated by φm,α

θ NN( W, L, B θ) with W = 2 log2(d+n 1) +1, L = log2(d + n 1) + 2 and B θ = max{3Ns, (3d + 3/2)s, C(d)ϵ 2}. By Proposition 4, it holds that

|cf,m,α| CProp 4 f W n,p(Ωm,N)Nd/p CProp 4 f W n,p(Ω)Nd/p CProp 4Nd/p .

Also, observe that X

α 1 n 1 1 =

j=0 dj ndn 1.

Let Φ m, θ PNN( m, M, { W, L, B θ}), that is,

Φ m, θ(x) =

k=1 ckφk θ(x) := X

m {0,...,N}d

α 1 n 1 cf,m,αφm,α

where m = (N + 1)dndn 1 and M = CProp 4Nd/p(N + 1)dndn 1. Note that here, any surplus ck coeﬃcients and φk θ sub-networks are set to zero. Then, it holds that

f N(x) Φ m, θ(x) W 1,p(Ω) = f N(x) X

m {0,...,N}d

α 1 n 1 cf,m,α φm,α

θ (x) W 1,p(Ω)

m {0,...,N}d

α 1 n 1 |cf,m,α| Ψs

θ (x) W 1,p(Ω)

CProp 4CLem 6(n, d)Nd/p 2d ndn 1(N + 1)ds Nϵ .

When p = , we have f N(x) Φ m, θ(x) W 1, (Ω) C(n, d)s N(N + 1)dϵ.

A.4 Approximation Error Bound for Parallel Neural Networks

Finally, by combining Proposition 4 with Theorem 6, we can derive an upper bound for the approximation error.

Theorem 7 Let n, d, m N and 1 p . Let the target function f Fn,d,p. For some suﬃciently small ϵ > 0 and any 0 < ϵ < ϵ , there exists Φ m, θ PNN( m, M, { W, L, B θ})

DRM Revisited: A Complete Error Analysis

with m = C1(n, d)ϵ d n µ 1 , M = C2(n, d)ϵ d(p+1) (n µ 1)p , W = 2 log2(d+n 1) +1, L = log2(d +

n 1) + 2, and B θ = C3(n, d)ϵ 2d+2n

n µ 1 such that f Φ m, θ W 1,p(Ω) ϵ ,

where µ is an arbitrarily small positive number.

Proof First, by Proposition 4, we choose

N = (ϵ/2CProp 4) 1/(n µ 1) and s = Nµ .

Then, for each m {0, , N}d, there exists a polynomial pm(x) = P α 1 n 1 cf,m,αxα

such that f X

m {0, ,N}d Ψs

W 1,p(Ω) CProp 4

n µ 1 CProp 4 ϵ 2CProp 4 = ϵ

Secondly, since u W 1,p(Ω) C(Ω) u W 1, (Ω), we apply Theorem 6 with N and s chosen

above, and set ϵ Thm 6 = C (n, d) ϵ1+ d+µ+1

n µ 1 , then it holds that f N(x) Φ m, θ(x) W 1, (Ω) C(n, d)(N + 1)ds Nϵ Thm 6 (2C(Ω)) 1ϵ .

Hence, the total approximation error satisﬁes f Φ m, θ W 1,p(Ω) f f N W 1,p(Ω) + f N Φ m, θ W 1,p(Ω) ϵ ,

with m = C1(n, d)ϵ d n µ 1 , M = C2(n, d)ϵ d(p+1) (n µ 1)p , W = 2 log2(d+n 1) +1, L = log2(d +

n 1) + 2, and B θ = C3(n, d)ϵ 2d+2n

Appendix B. Detailed Statistical Error Analysis

In this section, we provide a detailed expansion of Section 3.4 in the main text, which is mainly based on the approach in Kohler and Krzyzak (2023a) and Jiao et al. (2023a). Recall the deﬁnition of statistical error in Section 3.4:

Esta := sup u PNN

L(u) b L(u) .

The task is to control Esta with high probability. To achieve this goal, we organize the analysis into following four parts.

B.1 Some Neural Network Function Classes

We begin with some auxiliary function classes that are essential for subsequent analysis. First, we deﬁne the squared classes with respect to PNN(m, M, W, L, Bθ):

F 1 := f : Ω R | um,θ PNN(m, M, W, L, Bθ) s.t. f(x; θ) = x1um,θ(x) 2 ,

Jiao, Li, Wu, Yang and Zhang

F 2 := f : Ω R | um,θ PNN(m, M, W, L, Bθ) s.t. f(x; θ) = u2 m,θ(x) .

Further, denote by

F1 := {f : Ω R | um,θ PNN(m, M, W, L, Bθ) s.t. f(x; θ) = x1um,θ(x)} ,

F2 := {f : Ω R | um,θ PNN(m, M, W, L, Bθ) s.t. f(x; θ) = um,θ(x)} ,

F3 := {f : Ω R | um,θ PNN(m, M, W, L, Bθ) s.t. f(x; θ) = um,θ(x)| Ω} .

Finally, we deﬁne the sub-network function classes:

F1,sub := {f : Ω R | φθ(x) NN(W, L, Bθ) s.t. f(x; θ) = x1φθ(x)} ,

F2,sub := {f : Ω R | φθ(x) NN(W, L, Bθ) s.t. f(x; θ) = φθ(x)} ,

F3,sub := {f : Ω R | φθ(x) NN(W, L, Bθ) s.t. f(x; θ) = φθ(x)| Ω} .

Lemmas 7 9 establish several regularity properties for NN(W, L, Bθ) and the newly deﬁned Fi,sub. Proofs of Lemmas 7 and 8 can be found in Appendices B.5 and B.6. Below, we assume that θ and θ are two arbitrary parameter vectors satisfying

θ, θ Θ = [ Bθ, Bθ]D(W,L,d) , Bθ 1 .

Lemma 7 For any φθ NN(W, L, Bθ) and x Ω, we have |φθ(x)| (W + 1)Bθ. Moreover, for any φθ, φ θ NN(W, L, Bθ) and x Ω, it holds that

|φθ(x) φ θ(x)| 2W L

LBL 1 θ θ θ 2 .

Lemma 8 For any φθ NN(W, L, Bθ) and x Ω, we have | xmφθ(x)| W L 1BL θ . Moreover, for any φθ, φ θ NN(W, L, Bθ), x Ωand m [d], it holds that

| xmφθ(x) xmφ θ(x)| 2W 2L 1

L(L + 1)B2L θ θ θ 2 , x Ω.

Lemma 9 is a direct result of Lemma 7 and Lemma 8.

Lemma 9 Let fi( ; θ), fi( ; θ) Fi,sub, i = 1, 2, 3. Then for any x Ω, they satisfy both boundedness and Lipschitz conditions with respect to parameters:

fi(x; θ) Bi , fi(x; θ) fi(x; θ) Li θ θ 2 ,

where the constants are given by:

B1 = W L 1BL θ , B2 = B3 = (W + 1)Bθ ,

L1 = 2W 2L 1

L(L + 1)B2L θ , L2 = L3 = 2W L

DRM Revisited: A Complete Error Analysis

B.2 Controlling Statistical Error Through Rademacher Complexity

Then, we focus on the expectation of Esta with respect to the Monte Carlo samples, E[Esta]. The following lemma is direct.

Lemma 10 Let D := {Xp}Nin p=1 {Yq}Nb q=1. The expected statistical error decomposes as:

E[Esta] = ED

sup um,θ PNN

L(um,θ) b L(um,θ)

sup um,θ PNN

Li(um,θ) b Li(um,θ) =:

i=1 E[Ei sta] .

L1(u) = |Ω|

2 EX U(Ω) u(X) 2 , L2(u) = |Ω|

2 EX U(Ω) w(X)u2(X) ,

L3(u) = |Ω|EX U(Ω) h(X)u(X) , L4(u) = | Ω|EY U( Ω) g(Y )Tu(Y ) .

and b Li(u) is the discrete version of Li(u), for example,

b L1(u) = |Ω|

p=1 u(Xp) 2 .

To control E[Ei sta] using symmetrization techniques, we introduce Rademacher complexity as our analytical foundation.

Deﬁnition 4 Two types of Rademacher complexity of function class F associate with random sample {Xk}N k=1 are deﬁned as

RN(F) = E{Xk,σk}N k=1

k=1 σku(Xk) ,

ˆRN(F) = E{Xk,σk}N k=1

k=1 σku(Xk) ,

where, {σk}N k=1 are N i.i.d Rademacher variables with P(σk = 1) = P(σk = 1) = 1

For Rademacher complexity RN(F), we have following two structural results. Proofs of Lemmas 11 and 12 can be found in Appendices B.7 and B.8.

Lemma 11 Let F be a class of functions mapping from Ωto R. If a : Ω R is a function such that |a(x)| B for all x Ω, then

RN(a F) BRN(F) ,

where a F := { f : Ω R | f(x) = a(x)f(x) for some f F}.

Jiao, Li, Wu, Yang and Zhang

Lemma 12 Let F be a class of functions mapping from Ωto R, and let Φ R be a λ-Lipschitz function. Then, it holds that

RN(Φ F) λRN(F) .

The following lemma bounds E[Ei sta] in terms of ˆRN(Fi,sub).

Lemma 13 Let um,θ PNN(m, M, {W, L, Bθ}). It holds that

E[E1 sta] 4d|Ω|W L 1BL θ M2 ˆRNin(F1,sub) , E[E3 sta] 2B0|Ω|M ˆRNin(F2,sub) ,

E[E2 sta] 4B0|Ω|(W + 1)BθM2 ˆRNin(F2,sub) , E[E4 sta] 2B0| Ω|M ˆRNb(F3,sub) .

Proof We will complete the proof in the following three steps.

Step 1. Take { Xp}Nin p=1 as an independent copy of {Xp}Nin p=1. Denote by DX := {Xp}Nin p=1, D X := { Xp}Nin p=1 and DX, X := {Xp, Xp}Nin p=1. Then, we have

L1(um,θ) b L1(um,θ) = |Ω|

EX U(Ω) um,θ(X) 2 1 Nin

p=1 um,θ(Xp) 2

h xmum,θ( Xp) 2 xmum,θ(Xp) 2i .

Denote by Dσ,X := DX {σp}Nin p=1 and Dσ,X, X := DX, X {σp}Nin p=1. It holds that

E[E1 sta] = EDX h sup u PNN

L1(u) b L1(u) i

h xmu(Xp) 2 xmu(Xp) 2i

2Nin EDX, X

h xmu(Xp) 2 xmu(Xp) 2i

2Nin EDσ,X, X

p=1 σp h xmu(Xp) 2 xmu(Xp) 2i

2Nin EDσ,X, X

p=1 σpf( Xp)

2Nin EDσ,X, X

p=1 σpf(Xp)

p=1 σpf(Xp)

Nin EDσ,X h sup f F 1

p=1 σpf(Xp) i = d|Ω|RNin(F 1) ,

where the ﬁfth step is due to the symmetric structure of the PNN class with respect to the d variable components (x1, . . . , xd) and the seventh step uses that F 1 contains both f and f for every function f in the class. Similarly, it holds that

E[E2 sta] |Ω|

Nin EDσ,X h sup f F 2

p=1 σp w(Xp)f(Xp) i = |Ω|RN(w F 2) B0|Ω|RNin(F 2) ,

DRM Revisited: A Complete Error Analysis

where we use Lemma 11 in the last step. Directly, we have

E[E3 sta] 2B0|Ω|RNin(F2) , E[E4 sta] 2B0| Ω|RNb(F3) .

Step 2. Since um,θ = Pm k=1 ckφk θ and P k |ck| M, by Lemmas 7 and 8, we have

|um,θ(x)| M(W + 1)Bθ , | xmum,θ(x)| MW L 1BL θ .

Deﬁne F 1 and F 2 as the non-symmetrized versions of F 1 and F 2 (thus, F 1 = {f2 | f F1} and F 2 = {f2 | f F2}). Letting Φ(x) = x2, then by Lemma 12, it holds that

RNin(F 1) 2RNin(F 1 ) 4MW L 1BL θ RNin(F1) ,

RNin(F 2) 2RNin(F 2 ) 4M(W + 1)Bθ RNin(F2) .

Step 3. Finally, we will relate RN(Fi) to ˆRN(Fi,sub). Taking F1 as an example, we have

RNin(F1) = EDσ,X

sup um,θ PNN

p=1 σp x1um,θ(Xp)

sup θm total Θm 1 Nin

k=1 ck x1φk θ(Xp)

sup θm total Θm

p=1 σp x1φk θ(Xp)

sup θm total Θm

p=1 σp x1φk θ(Xp)

sup θm total Θm

k=1 |ck| sup k {1,...,m}

p=1 σp x1φk θ(Xp)

sup θm total Θm,k

p=1 σp x1φk θ(Xp)

p=1 σp x1φ1 θ(Xp) = M ˆRNin(F1,sub) .

Similarly, we have

RNin(F2) M ˆRNin(F2,sub) , RNb(F3) M ˆRNb(F3,sub) .

Combining above steps then concludes the proof.

Remark 10 The above conclusion reveals an important fact: the Rademacher complexity of Fi is controlled by M and the complexity of Fi,sub. This suggests that the network s overall complexity may not be aﬀected by the number of sub-networks m, aiding us in managing the statistical error within the over-parameterized setting, where m can grow arbitrarily large.

Jiao, Li, Wu, Yang and Zhang

B.3 Bounding the Rademacher Complexity Through Covering Number

We ﬁrst provide the deﬁnition of the covering number , and the related Lemma 14, which is an eﬀective tool that can further help us control ˆRN(Fi,sub). The proof of Lemma 14 can be found in Appendix B.9.

Deﬁnition 5 An ϵ-cover of a set T in a metric space (S, τ) is a subset Tc S such that for each t T, there exists tc Tc such that τ(t, tc) ϵ. The ϵ-covering number of T, denoted as C(ϵ, T, τ) is deﬁned to be the minimum cardinality among all ϵ-cover of T with respect to the metric τ.

Lemma 14 Let F be a class of functions mapping from Ωto R with 0 F, while for any f F, we have f L (Ω) B. Then, it holds that

ˆRN(F) inf 0<δ<B

log 2C(ϵ, F, L ) dϵ .

A Lipschitz parameterization allows us to translate a cover of the function space into a cover of the parameter space. The following result is direct.

Lemma 15 Let F be a parameterized function class, F = {f( ; θ) : θ Θ}. Let Θ be a norm on Θ and let F be a norm on F. Suppose that the mapping θ 7 f( ; θ) is λ-Lipschitz, that is, f( ; θ) f ; θ F λ θ θ Θ ,

then for any ϵ > 0, it holds that

C(ϵ, F, F) C(ϵλ 1, Θ, Θ) .

In Euclidean space, we can establish an upper bound of covering number for a bounded set easily. The following result is also direct.

Lemma 16 Suppose that T Rd and t 2 B for t T, it holds that

C(ϵ, T, 2) 2B

Based on above results, we present the following lemma, providing an upper bound for ˆRN(Fi,sub) in terms of the covering number of Fi,sub.

Lemma 17 Let Nin = Nb = Ns. For i = 1, 2, 3, it holds

ˆRNs(Fi,sub) C(W, L)BL θ N 1/2 s p

log(BθWLNs) ,

Proof By Lemma 14, for i = 1, 2, 3, it holds that

ˆRN(Fi,sub) inf 0<δ<Bi

log 2C(ϵ, Fi,sub, L ) dϵ .

Combining Lemmas 9 and 15, we have

C(ϵ, Fi,sub, L ) C(ϵL 1 i , Θ, 2) ,

DRM Revisited: A Complete Error Analysis

and each of them can be bounded through Lemma 16:

C(ϵ, Fi,sub, L ) 2BθLi p

W(W + 1)Lϵ 1 W(W+1)L.

The above Bi and Li denote the boundedness and smoothness indices of Fi,sub with respect to model parameters, as shown in Lemma 9. For instance, to upper bound ˆRNin(F1,sub), we substitute B1 and L1 from Lemma 9 and obtain

ˆRNin(F1,sub) inf 0<δ<B1

4δ + 12 Nin

log 2C(ϵ, F1,sub, L ) dϵ

4δ + 12 Nin

W(W + 1)L log1/2 4L(L + 1)W 2LB2L+1 θ ϵ 1 dϵ

LB1N 1/2 in log1/2 4L(L + 1)W 2LB2L+1 θ δ 1 o .

Choosing δ = N 1/2 in < B1/2, we have

ˆRNin(F1,sub) C(W, L)BL θ N 1/2 in p

log(BθWLNin) .

ˆRNin(F2,sub) and ˆRNb(F3,sub) can also be bounded in a similar way.

B.4 Complete Statistical Error Bound

Combining previous analysis and with the help of the following inequality, we could ﬁnally obtain high probability control over Esta. Proof of Lemma 18 is referred to Mc Diarmid et al. (1989).

Lemma 18 Let g be a function from Ω1 Ω2 Ωn to R. Suppose that function g satisﬁes the bounded diﬀerences property, that is, there exists constants c1, . . . , cn > 0 such that for any x1 Ω1, . . . , xn Ωn

sup xi Ωi |g(x1, , xi, , xn) g(x1, , xi, , xn)| ci , i = 1, . . . , n .

Let {Xi}n i=1 be independent variables, where Xi Ωi, then for any τ > 0, we have

|g(X1, , Xn) E[g(X1, , Xn)]| τ

with probability at least 1 2 exp[ 2τ 2(Pn i=1 c2 i ) 1].

Theorem 8 (Theorem 5 in the main text) Let PNN = PNN(m, M, {W, L, Bθ}). Let Nin = Nb = Ns in the Monte Carlo sampling. Let 0 < ξ < 1. Then, with probability at least 1 ξ, it holds that

Esta = sup um,θ PNN

L(um,θ) b L(um,θ)

C(Ω, B0, d, W, L) M2B2L θ N 1/2 s p

log(BθWLNs) + p

where C(Ω, B0, d, W, L) is a universal constant which only depends on Ω, B0, d, W and L.

Jiao, Li, Wu, Yang and Zhang

Proof By Lemmas 10 and 13, we have

E[Esta] 4d|Ω|W L 1 M2 BL θ ˆRNin(F1,sub)

+ 4B0 |Ω| M(W + 1)Bθ + 1 M ˆRNin(F2,sub) + 2B0| Ω|M ˆRNb(F3,sub) .

By Lemma 17, for i = 1, 2, 3, it holds that

ˆRNs(Fi,sub) C(W, L)BL θ N 1/2 s p

log(BθWLNs) .

Then, we get

E[Esta] C1(Ω, B0, d, W, L)M2B2L θ N 1/2 s p

log(BθWLNs) . (37)

Now, we denote by

γ(X1, . . . , XNs, Y1, . . . , YNs) := sup um,θ PNN |L(um,θ) b L(um,θ)| = Esta .

Examining the diﬀerence of γ(X1, . . . , XNs, Y1, . . . , YNs), we have

|γ(X1, . . . , Xi, . . . , YNs) γ(X1, . . . , X i, . . . , YNs)|

Ns sup u PNN

u(Xi) 2 2 u(X i) 2 2 2 + w(Xi)u2(Xi) w(X i)u2(X i) 2

+ u(X i)h(X i) u(Xi)h(Xi)

4|Ω|N 1 s d M2 (B0 + 1)W 2L 2B2L θ ,

where we have used the boundedness properties outlined in Lemma 9. We also have

|γ(X1, . . . , Yj, . . . , YNs) γ(X1, . . . , Y j , . . . , YNs)| 2| Ω|N 1 s B0M(W + 1)Bθ .

Then by Lemma 18 and Equation 37, it holds that

Esta E[Esta] + τ C1(Ω, B0, d, W, L)M2B2L θ N 1/2 s p

log(BθWLNs) + τ

with probability as least

1 2 exp Nsτ 2

8d2(| Ω|2 + |Ω|2)(B0 + 1)2W 4LB4L θ M4

This implies that with probability at least 1 ξ, we have

sup um,θ PNN

L(um,θ) b L(um,θ)

C2(Ω, B0, d, W, L)M2B2L θ N 1/2 s p

log(BθWLNs) + p

which concludes the proof.

DRM Revisited: A Complete Error Analysis

B.5 Proof of Lemma 7

By deﬁnition in Section 2.2, for any φθ(x) NN(W, L, Bθ), we have φθ(x) NL 1 + 1 Bθ W + 1 Bθ .

Use φ (ℓ) i to denote the i-th output of the ℓ-th layer. Note that ρ = tanh is 1 Lipshcitz. For ℓ= 2, . . . , L, it holds that

φ (ℓ) i φ (ℓ) i = ρ Nℓ 1 X

j=1 a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i

j=1 a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i

j=1 a (ℓ 1) ij φ (ℓ 1) j

j=1 a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i b (ℓ 1) i

a (ℓ 1) ij φ (ℓ 1) j φ (ℓ 1) j +

a (ℓ 1) ij a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i b (ℓ 1) i

φ (ℓ 1) j φ (ℓ 1) j +

a (ℓ 1) ij a (ℓ 1) ij + b (ℓ 1) i b (ℓ 1) i .

For ℓ= 1, we have

φ (1) i φ (1) i

a (0) ij a (0) ij + b (0) i b (0) i =

For ℓ= 2, we have

φ (2) i φ (2) i Bθ

φ (1) j φ (1) j +

a (1) ij a (1) ij + b (1) i b (1) i

a (1) ij a (1) ij + b (1) i b (1) i N1Bθ

Assuming that for ℓ 2, it holds that

φ (ℓ) i φ (ℓ) i ℓ 1 Y

Then, we have

φ (ℓ+1) i φ (ℓ+1) i Bθ

φ (ℓ) j φ (ℓ) j +

a (ℓ) ij a (ℓ) ij + b (ℓ) i b (ℓ) i

a (ℓ) ij a (ℓ) ij + b (ℓ) i b (ℓ) i

Jiao, Li, Wu, Yang and Zhang

Hence, by induction, we could conclude that

|φθ(x) φ θ(x)| L 1 Y

LBL 1 θ θ θ 2 .

B.6 Proof of Lemma 8

Use φ (ℓ) i to denote the i-th output of the ℓ-th layer. For ρ = tanh, note that ρ and ρ are both 1 Lipshcitz. For ℓ= 1, 2, , L 1, it holds that

xmφ (ℓ) i =

j=1 a (ℓ 1) ij xmφ (ℓ 1) j ρ Nℓ 1 X

j=1 a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i

xmφ (ℓ 1) j

(Bθ)2 Nℓ 1 X

xmφ (ℓ 2) j = Nℓ 1(Bθ)2 Nℓ 2 X

xmφ (ℓ 2) j

(Bθ)ℓ 1 N1 X

(Bθ)ℓ 1 N1 X

j=1 Bθ = ℓ 1 Y

(Bθ)ℓ W ℓ 1Bℓ θ .

The bound for | xmφθ(x)| can be derived similarly. For ℓ= 1, we have

xmφ (1) i xm φ (1) i = a (0) imρ N0 X

j=1 a (0) ij xj + b (0) i

a (0) imρ N0 X

j=1 a (0) ij xj + b (0) i

a (0) im a (0) im ρ N0 X

j=1 a (0) ij xj + b (0) i

+ a (0) im ρ N0 X

j=1 a (0) ij xj + b (0) i

j=1 a (0) ij xj + b (0) i

a (0) im a (0) im + Bθ

a (0) ij a (0) ij + Bθ b (0) i b (0) i 2Bθ

For ℓ 2, we establish the recurrence relation: xmφ (ℓ) i xm φ (ℓ) i

DRM Revisited: A Complete Error Analysis

a (ℓ 1) ij xmφ (ℓ 1) j ρ Nℓ 1 X

j=1 a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i

j=1 a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i

a (ℓ 1) ij xmφ (ℓ 1) j a (ℓ 1) ij xm φ (ℓ 1) j ρ Nℓ 1 X

j=1 a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i

xmφ (ℓ 1) j Nℓ 1 X

a (ℓ 1) ij φ (ℓ 1) j a (ℓ 1) ij φ (ℓ 1) j + b (ℓ 1) i b (ℓ 1) i

a (ℓ 1) ij xmφ (ℓ 1) j a (ℓ 1) ij xm φ (ℓ 1) j

xmφ (ℓ 1) j Nℓ 1 X

a (ℓ 1) ij a (ℓ 1) ij + Bθ

φ (ℓ 1) j φ (ℓ 1) j + b (ℓ 1) i b (ℓ 1) i

xmφ (ℓ 1) j xm φ (ℓ 1) j +

a (ℓ 1) ij a (ℓ 1) ij xm φ (ℓ 1) j

xmφ (ℓ 1) j Nℓ 1 X

a (ℓ 1) ij a (ℓ 1) ij + Bθ

φ (ℓ 1) j φ (ℓ 1) j + b (ℓ 1) i b (ℓ 1) i

xmφ (ℓ 1) j xm φ (ℓ 1) j +

a (ℓ 1) ij a (ℓ 1) ij xm φ (ℓ 1) j

a (ℓ 1) ij a (ℓ 1) ij +Bθ

+ b (ℓ 1) i b (ℓ 1) i +Bθ

xm φ (ℓ 1) j φ (ℓ 1) j +

a (ℓ 1) ij a (ℓ 1) ij ℓ 2 Y

xmφ (ℓ 1) j xm φ (ℓ 1) j + B2ℓ θ

When ℓ= 2, we have

xmφ (2) i xm φ (2) i Bθ

xmφ (1) j xm φ(1) j + B4 θN2 1

θk θk + B4 θN2 1

θk θk 3B4 θN2 1

Assuming that for ℓ 2, it holds that

xmφ (ℓ) i xm φ (ℓ) i (ℓ+ 1)B2ℓ θ

Jiao, Li, Wu, Yang and Zhang

Then, we have xmφ (ℓ+1) i xm φ (ℓ+1) i

xmφ (ℓ) j xm φ (ℓ) j + B2ℓ+2 θ

j=1 (ℓ+ 1)B2ℓ θ

θk θk + B2ℓ+2 θ

(ℓ+ 2)B2ℓ+2 θ

Hence, by induction, we could conclude that

xmφθ xmφ θ = xmφ xm φ (L + 1)B2L θ

n L(L + 1)B2L θ

2 θ θ 2 2W 2L 1

L(L + 1)B2L θ θ θ 2 .

B.7 Proof of Lemma 11

Denote by EX,σ := E{Xk,σk}N k=1 and EX, σ := E{Xk}N k=1,{σk}N k=2. Recursively, we have

RN(a F) = 1

k=1 σka(Xk)f(Xk)

= 1 2N EX, σ

a(X1)f(X1) +

k=2 σka(Xk)f(Xk)

a(X1)f(X1) +

k=2 σka(Xk)f(Xk)

= 1 2N EX, σ

a(X1)(f f )(Xk) +

k=2 σka(Xk)(f + f )(Xk)

B|(f f )(Xk)| +

k=2 σka(Xk)(f + f )(Xk)

= 1 2N EX, σ

B(f f )(Xk) +

k=2 σka(Xk)(f + f )(Xk)

k=2 σka(Xk)f(Xk)

k=1 σkf(Xk) = BRN(F) .

DRM Revisited: A Complete Error Analysis

B.8 Proof of Lemma 12

For ﬁxed (x1, . . . , x N), denote by h N 1(f) := PN 1 i=1 σi(Φ f)(xi). We have

i=1 σi(Φ f)(xi) i = E{σk}N 1 k=1 EσN h sup f F

h N 1(f) + σN(Φ f)(x N) i ,

For any ϵ > 0, there exist f1, f2 F such that

h N 1(f1) + (Φ f1)(x N) (1 ϵ) sup f F

h N 1(f) + (Φ f)(x N) ,

h N 1(f2) (Φ f2)(x N) (1 ϵ) sup f F

h N 1(f) (Φ f)(x N) .

Thus, it holds that

(1 ϵ)EσN h sup f F

h N 1(f) + σN(Φ f)(x N) i

h N 1(f) + (Φ f)(x N) + 1 ϵ

h N 1(f) (Φ f)(x N)

2 h N 1(f1) + (Φ f1)(x N) + 1

2 h N 1(f2) (Φ f2)(x N) .

Let s = sgn(f1(x N) f2(x N)). Since Φ is λ-Lipschitz, we have

(1 ϵ)EσN h sup f F h N 1(f) + σN(Φ f)(x N) i

2 h N 1(f1) + h N 1(f2) + sλ f1(x N) f2(x N)

2 h N 1(f1) + sλf1(x N) + 1

2 h N 1(f2) sλf2(x N)

h N 1(f) + sλf(x N) + 1

h N 1(f) sλf(x N)

= EσN h sup f F

h N 1(f) + σNλf(x N) i ,

Note that ϵ > 0 is arbitrary. It holds that

i=1 σi(Φ f)(xi) i = E{σk}N k=1

h N 1(f) + σNλf(x N) i .

This technique is applied iteratively for all other σi (i = N). Then, the lemma is proved by generalizing from ﬁxed (x1, . . . , x N) to random (X1, . . . , XN), and subsequently taking expectation over {Xi}N i=1.

B.9 Proof of Lemma 14

Set ϵk = 2 k+1B. Denote by Fk such that Fk is an ϵk-cover of F and |Fk| = C(ϵk, F, L ). For any f F, there exists fk Fk, such that f fk L ϵk. Let K N+. We have

ˆRN(F) = E sup f F

i=1 σif(Xi)

Jiao, Li, Wu, Yang and Zhang

= E sup f F

i=1 σi h (f f K)(Xi) +

j=1 (fj+1 fj)(Xi) + f1(Xi) i

i=1 σi (f f K)(Xi) + E sup f1 F1

i=1 σif1(Xi)

j=1 E sup f F

i=1 σi (fj+1 fj)(Xi) .

We choose F1 = {0} to eliminate the second term above. For the ﬁrst term, it holds that

i=1 σi f(Xi) f K(Xi) E sup f F

i=1 |σi| f f K L ϵK .

To handle the second term, for any ﬁxed samples {Xi}N i=1 and 1 j < K, we deﬁne

Vj := (fj+1(X1) fj(X1), . . . , fj+1(XN) fj(XN)) RN : f F .

Then, for any vj = (vj 1, . . . , vj N) Vj, it holds that

i=1 |fj+1(Xi) fj(Xi)|2 1/2

N fj+1 fj L

N fj+1 f L +

Further, we present the following lemma, with proof provided in Appendix B.10.

Lemma 19 Let A Rm be ﬁnite, x = (x1, . . . , xm) and r = maxx A x 2. Then, we have

i=1 σixi r p

2 log(2|A|)

where {σi}m i=1 are independent Rademacher variables.

Applying Lemma 19 and take partial expectation with respect to {σi}N i=1, we have

j=1 E{σi}N i=1

i=1 σi fj+1(Xi) fj(Xi)

j=1 E{σi}N i=1

2 log(2|Vj|) .

By the deﬁnition of Vj, we know that |Vj| |Fj||Fj+1| |Fj+1|2. Hence, it holds that

j=1 E sup f F

i=1 σi fj+1(Xi) fj(Xi)

log(2|Fj+1|) .

DRM Revisited: A Complete Error Analysis

Finally, we obtain

ˆRN(F) ϵK +

log(2|Fj+1|)

j=1 (ϵj ϵj+1) q

log 2C(ϵj, F, L )

log 2C(ϵ, F, L ) dϵ

log 2C(ϵ, F, L ) dϵ .

where the last inequality holds because for 0 δ B, we can choose K as the largest integer such that ϵK+1 > δ. This choice implies that ϵK 4ϵK+2 4δ.

B.10 Proof of Lemma 19

For any t > 0, using Jensen s inequality, rearranging terms, we obtain

exp t E sup x A

i=1 σixi E exp t sup x A

= E sup x A exp t

x A E exp t

Since e|x| ex + e x and ex + e x 2ex2/2, it holds that

x A E exp t

x A E exp t

x A E exp t

exp(txi) + exp( txi)

i=1 exp t2x2 i 2

x A exp t2r2

= 2|A|e t2r2

Taking the log of both sides and dividing by t, we have

Choosing t =

2 log(2|A|)

r concludes the proof.

Jiao, Li, Wu, Yang and Zhang

Appendix C. Detailed Optimization Error Analysis

This section provides a detailed expansion of Section 3.3 in the main text. In this section, Building upon and further developing the technique proposed in Kohler and Krzyzak (2023a), we conduct an in-depth analysis of the optimization error bounds, addressing both iteration error and initialization error separately. Proofs for auxiliary lemmas are also included. Notation speciﬁc to this section is shown in Table 2.

Table 2: Notation speciﬁc to this section

u = u m, θ the best approximation element deﬁned in Equation 13

( θ1, . . . , θ m) the parameters of the m sub-networks in u m, θ ( c1, . . . , c m) the linear coeﬃcients of u m, θ (θm in)[0] the random sub-network initialization (θm in)[0] = (θ1, . . . , θm)[0]

θm, out (c 1, . . . , c m), where c i is deﬁned in Equations 17-19

θm, total the transition parameters θm, total = ((θm in)[0], θm, out)

u m the PNN parameterized by θm, total R a parameter controlling the iteration error

Q a parameter controlling the initialization error with high probability

C.1 Analysis of the Iteration Error

We ﬁrst introduce the following results for the PGD algorithm, with the complete proof provided in Appendix C.3.

Lemma 20 Let d1, d2 N, let U, V, K 0, let X Rd1 and Y Rd2 be closed and convex, and let F : Rd1 Rd2 R+ be a function such that F(x, y) is diﬀerentiable while y 7 F(x, y) is convex for all x Rd1. Assume that

y F(x, y) 2 V , (38)

F(x1, y1) F(x2, y2) 2 K (x1, y1) (x2, y2) 2 (39)

for all (x, y), (x1, y1) and (x2, y2) X Y. Choose (x0, y0) X Y and set

(xt+1, yt+1) = Proj X Y (xt, yt) λ F(xt, yt) (40)

for t = 0, 1, . . . , T, where λ = min{T 1, 2K 1} .

Let y Y and assume

|F(xt, y ) F(x0, y )| U y 2 xt x0 2 (41)

DRM Revisited: A Complete Error Analysis

for all t = 1, . . . , T. Then it holds:

F(x T, y T) F(x0, y ) U y 2 diam(X) + y y0 2 2 2 + V 2

To apply Lemma 20 to our setting, we identify F with b F, (xt, yt) with (θm in, θm out)[t], and y with θm, out, which directly corresponds to the iteration error in Equation 22. This approach requires satisfying conditions 38, 39, and 41. The following lemmas characterize the relevant properties of b F, with proofs provided in Appendices C.4-C.7. We begin by establishing an upper bound for θm out b F(θm in, θm out) to satisfy condition 38.

Lemma 21 For um,θ PNN(m, M, {W, L, Bθ}), denote the empirical risk b L(um,θ) from Equation 7 as b F(θm total) = b F(θm in, θm out), omitting the dependence on sample points. Then

θm out b F θm in, θm out 2 2 C1m M2B4L θ , (43)

where C1 is a universal constant depending only on Ω, W, L, d and B0.

To satisfy Equation 39, we estimate the Lipschitz constant of b F(θm total). We ﬁrst associate the Lipschitz property of the gradient f with the norm of Hessian matrix 2f.

Lemma 22 For f(x) convex and twice diﬀerentiable, it holds that

2f(x) 2,2 2f(x) F K = f(x) f(y) 2 K x y 2 ,

where 2,2 is the spectral norm of the matrix, F is the Frobenius norm of the matrix.

Thus, it suﬃces to estimate the Frobenius norm of the Hessian matrix of b F(θm total).

Lemma 23 With notation consistent with Lemma 21, we have 2 θm total b F θm total F C2m M2B4L θ ,

where C2 is a universal constant which only depends on Ω, W, L, d and B0. Using Lemma 22, b F(θm total) is then equipped with Lipschitz constant C2m M2B4L θ .

Finally, we assures that b F satisﬁes Equation 41.

Lemma 24 With notation consistent with Lemma 21, we have b F θm,1 in , θm out b F θm,2 in , θm out C3MB3L θ θm out 2 θm,1 in θm,2 in 2 ,

where θm,1 in and θm,2 in denote two diﬀerent sub-network weight vectors. C3 is a universal constant which only depends on Ω, W, L, d and B0.

Let ζ = M in Equation 11 to ensure θm, out Bζ = B M, while we naturally have (θm in)[0] Aη. Combining Lemmas 20-24, we achieve the following corollary to bound the iteration error.

Jiao, Li, Wu, Yang and Zhang

Corollary 1 (Proposition 2 in the main text) Let ζ = M in Equation 11. Then, the output u A of the PGD algorithm in Equation 12 belongs to PNN(m, M, { W, L, B θ + η}). When we run the algorithm with step size

λ = min{T 1, 2C 1 2 m 1 M 2(B θ + η) 4 L} ,

where T is the total number of iterations and η is the projection radius in Equation 10, the iteration error in Equation 22 satisﬁes

b L(u A) b L(u m) C3 Mη(B θ + η)3 L θm, out 2 + 1

2 θm, out 2 2 + C1m M2(B θ + η)4 L

C3 M2(B θ + η)3 Lη

2R + C1m M2(B θ + η)4 L

Here, u m PNN(m, M, { W, L, B θ}) is deﬁned in Equation 20, C1 is from Lemma 21, C2 is from Lemma 23, and C3 is from Lemma 24.

Remark 11 In Corollary 1, we use the following property of θm, out 2:

θm, out 2 =

s=1 |c s|2 = 1

k=1 | ck|2 1

k=1 | ck| = 1

R θm, out 1 M

That is, as R, which controls the over-parameterization degree of um,θ, increases, the upper bound of θm, out 2 decays polynomially. As we have seen, this property allows us to control the iteration error to any given precision by letting R with M ﬁxed, which underscores the importance of over-parameterization in our analysis.

C.2 Analysis of the Initialization Error

We ﬁrst establish the following lemma to estimate the probability of Gm, m,R,δ, with proof provided in Appendix C.8.

Lemma 25 Consider u m, θ PNN( m, M, { W, L, B θ}) from Theorem 3 with sub-network parameters ( θ1, . . . , θ m). If

m = m R Q , R, Q N and Q is suﬃciently large ,

then, it holds that

P Gm, m,R,δ 1 m R h 1 δ W( W+1) L(2B θ) W( W+1) Li Q .

Intuitively, the initialization error can be eﬀectively controlled when the target weights θk and random initialization (θsk,v)[0] exhibit only minimal diﬀerences. The following lemma formalizes this relationship, with proof provided in Appendix C.9.

DRM Revisited: A Complete Error Analysis

Lemma 26 For u m, θ PNN( m, M, { W, L, B θ}) in Theorem 3, we have

b L(u m) b L(u m, θ) C4 M2B3 L θ max k=1,..., m max v=1,...,R

(θsk,v)[0] θk

where u m PNN(m, M, { W, L, B θ +η}) is deﬁned in Equation 20, the random indices sk,v are from Equation 16, and C4 is a universal constant depending only on Ω, W, L, d and B0.

Combining Lemmas 25 and 26, we can bound the initialization error with arbitrarily high probability and precision through proper selection of the sub-network size m in um,θ.

Proposition 5 (Proposition 3 in the main text) For any δ > 0 and R, Q N with Q suﬃciently large, if we set m = m R Q, then with probability at least

1 m R h 1 δ W( W+1) L(2B θ) W( W+1) Li Q ,

the initialization error in Equation 22 satisﬁes

b L(u m) b L(u m, θ) C4 M2B3 L θ δ ,

where u m PNN(m, M, { W, L, B θ + η}) is deﬁned in Equation 20. C4 is from Lemma 26.

C.3 Proof of Lemma 20

Note that y Y. By convexity of y 7 F(xt, y) and Equation 38, we have

F(xt, yt) F(xt, y ) y F(xt, yt), yt y

2λ 2 λ y F(xt, yt), yt y

h yt y λ y F(xt, yt) 2 2 + yt y 2 2 + λ y F(xt, yt) 2 2

h yt y 2 yt+1 y 2 2 + λ2V 2i .

Since λ T 1, this implies

F(xt, yt) F(xt, y ) y0 y 2

Using above results and by Equation 41, we have

min t=0,...,T F(xt, yt) 1

t=0 F(xt, y ) + y y0 2

F(x0, y ) + 1

t=0 |F(xt, y ) F(x0, y )| + y y0 2 2 2 + V 2

F(x0, y ) + U y 2 diam(X) + y y0 2 2 2 + V 2

Jiao, Li, Wu, Yang and Zhang

Denote by z = (x, y). Since X Y is a convex set, we have

F(zt+1) = F(zt) + Z 1

F zt + w(zt+1 zt)

= F(zt) + Z 1

0 F zt + w(zt+1 zt) T(zt+1 zt) dw .

Since F(x, y) satisﬁes Equation 39, it holds that Z 1

F zt + w(zt+1 zt) F(zt) T(zt+1 zt) dw

0 K w(zt+1 zt) 2 zt+1 zt 2 dw = K

2 zt+1 zt 2 .

By Equation 40, zt+1 = Proj X Y zt λ F(zt) , which implies zt λ F(zt) zt+1 , u zt+1 0 , u X Y .

Let u = zt, then F(zt)T(zt+1 zt) 1

λ zt+1 zt 2. Thus, we get

F(zt+1) F(zt) + F(zt)T(zt+1 zt) + K

2 zt+1 zt 2

zt+1 zt 2 .

Therefore, when λ = T 1 2K 1, it holds that

F(x T, y T) = min t=0,...,T F(xt, yt) F(x0, y ) + U y diam(X) + y y0 2

C.4 Proof of Lemma 21

By Equation 7, um,θ(x) = Pm k=1 ckφk θ(x), B0 = max{ h L (Ω), g L ( Ω), w L (Ω)} and P k |ck| M, it holds that

θm out b F θm in, θm out 2 2 = (ck)m k=1 b L m X

k=1 ck xφk θ(Xp)T xφj θ(Xp) + w(Xp)

k=1 ck φk θ(Xp)φj θ(Xp)

φj θ(Xp)h(Xp) | Ω|

φj θ(Yq)g(Yq) 2

k=1 ck φk θ(Xp)T φj θ(Xp) 2

k=1 ckφk θ(Xp)φj θ(Xp) 2

DRM Revisited: A Complete Error Analysis

φj θ(Xp)h(Xp) 2 + | Ω|

q=1 [φj θ(Yq)g(Yq)] 2

j=1 4 |Ω|2 1

k=1 ck φk θ(Xp) 2

φj θ(Xp) 2 2

k=1 ckφk θ(Xp) 2 1 Nin

+ (|Ω|2 + | Ω|2)B2 0 max x |φj θ(x)|2

4m n |Ω|2M2 max k,x

φk θ(x) 2 2 max j,x φj θ(x) 2 2 + (|Ω|2 + | Ω|2)B2 0 max j,x |φj θ(x)|2

+ |Ω|2B2 0M2 max k,x

φk θ(x) 2 max j,x φj θ(x) 2o ,

where the Cauchy-Schwarz inequality is used in the third step. By Lemma 9, we have

|φθ(x)| (W + 1)Bθ , | xmφθ(x)| W L 1BL θ .

Then, we get maxk,x φk θ(x) 2 2 d W L 1BL θ 2, which concludes the proof.

C.5 Proof of Lemma 22

Write x = Pn j=1 ajej. Suppose x 2 = 1, that is, P j |aj|2 = 1. Then, it holds that

j=1 aj Aej 2

j |aj| Aej 2 2 n X

j=1 |aj|2 n X

j=1 Aej 2 2 = A 2 F .

Since x is arbitrary, we get A 2 A F. Thus, if we have 2f(x) 2,2 2f(x) F K, it further holds that f(x) f(y) 2 2f(ξ) 2,2 x y 2 K x y 2.

C.6 Proof of Lemma 23

The goal is to estimate the Frobenius norm of the Hessian matrix of b F(θm total), which is equivalent to estimate the second derivative of b L(um,θ) w.r.t. the weight parameter. For simplicity, we will demonstrate the proof process using the derivative w.r.t. a (0) 1,1,1, the innermost weight of the ﬁrst sub-network in um,θ. Denote by L := L ( Ω). We have a(0)

1, 1, 1 a(0)

1, 1, 1 a(0)

i=1 | xium,θ(Xp)|2 + 1

2w(Xp)u2 m,θ(Xp) um,θ(Xp)h(Xp)

1, 1, 1 a(0)

um,θ(Yq)g(Yq)

Jiao, Li, Wu, Yang and Zhang

1, 1, 1 xium,θ(x) a(0)

1, 1, 1 xium,θ(x) + a(0)

1, 1, 1 a(0)

1, 1, 1 xium,θ(x) xium,θ(x) L

1, 1, 1um,θ a(0)

1, 1, 1um,θ + um,θ a(0)

1, 1, 1 a(0)

1, 1, 1um,θ L + a(0)

1, 1, 1 a(0)

1, 1, 1um,θ L ,

Therefore, the task reduces to estimating the following partial derivatives:

First Order Derivatives: x1um,θ(x) , a(0)

1, 1, 1um,θ(x) .

Second Order Derivatives: a(0)

1, 1, 1 x1um,θ(x) , a(0)

1, 1, 1 a(0)

1, 1, 1um,θ(x) .

Third Order Derivative: a(0)

1, 1, 1 a(0)

1, 1, 1 x1um,θ(x) .

In the rest of the proof, we will use φ (ℓ) k,j(x) to denote the j-th output of the k-th sub-network at layer ℓ. Also, we assume all weights are non-negative when estimating the upper bounds of the partial derivatives (if not, the usual triangle inequality yields the same conclusion). We ﬁrst focus on the ﬁrst order derivatives. It holds that

x1um,θ(x) =

k=1 ck x1φk θ(x) .

Also, we have

x1φk θ(x) = x1φ(L)

s L 1=1 a(L 1)

k,1,s L 1 x1φ(L 1)

s L 1=1 a(L 1)

k,1,s L 1ρ NL 2 X

s L 2=1 a(L 2)

k,1,s L 2 φ(L 2)

k,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

k,s L 1,s L 2 x1φ(L 2)

s L 1=1 a(L 1)

k,1,s L 1 1 NL 2 X

s L 2=1 a(L 2)

k,s L 1,s L 2 1 NL 3 X

s L 3=1 a(L 3)

k,s L 2,s L 3

(1) k,s2,s1 a

s1=1 a(L 1)

k,1,s L 1a(L 2)

k,s L 1,s L 2 a

(1) k,s2,s1 a

(0) k,s2,1 L 1 Y

Therefore, it holds that

x1um,θ(x) M L 1 Y

As for the partial derivative w.r.t. a (0) 1,1,1, we have

1, 1, 1um,θ(x) = c1 a(0)

1, 1, 1φ1 θ(x) . (45)

DRM Revisited: A Complete Error Analysis

Also, it holds that

1, 1, 1φ1 θ(x) = a(0)

1, 1, 1φ(L)

s L 1=1 a(L 1)

1,1,s L 1 a(0)

1, 1, 1φ(L 1)

s L 1=1 a(L 1)

1,1,s L 1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 a(0)

1, 1, 1φ(L 2)

s L 1=1 a(L 1)

1,1,s L 1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2

(1) 1,s2,s1 x1ρ d X

(0) 1,s1,s0 xs0 + b

s L 1=1 a(L 1)

1,1,s L 1 NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 NL 3 X

s L 3=1 a(L 3)

1,s L 2,s L 3

(1) 1,s2,s1 x1

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

(Bθ)L 1 . (46)

Therefore, we have

1, 1, 1um,θ(x) M L 1 Y

Next, we estimate the second order partial derivatives. By Equation 45, we have

1, 1, 1 x1um,θ(x) = c1 x1 a(0)

1, 1, 1φ1 θ(x) . (47)

Meanwhile, by Equation 46, it holds that

1, 1, 1φ1 θ(x)

s L 1=1 a(L 1)

1,1,s L 1 h NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 NL 3 X

s L 3=1 a(L 3)

1,s L 2,s L 3

(1) 1,s2,s1

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

Jiao, Li, Wu, Yang and Zhang

s L 1=1 a(L 1)

1,1,s L 1 NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 NL 3 X

s L 3=1 a(L 3)

1,s L 2,s L 3

(1) 1,s2,s1 x1

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

(Bθ)L ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

+ x1 x1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

+ + x1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

Since we have

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 x1φ(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 L 3 Y

(Bθ)L 2 L 2 Y

Then, above estimations lead to

1, 1, 1 x1φ1 θ(x) L 1 Y

(Bθ)L L L 1 Y

(Bθ)L = L L 1 Y

Therefore, it holds that

1, 1, 1 x1um,θ(x) M L L 1 Y

DRM Revisited: A Complete Error Analysis

Besides, w.r.t a (0) 1,1,1, we have

1, 1, 1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 1)

s L 2=1 a(L 2)

1,s L 1,s L 2 a(0)

1, 1, 1(φ(L 2)

1,s L 2(x))

s L 2=1 a(L 2)

1,s L 1,s L 2 L 3 Y

(Bθ)L 3 L 2 Y

Then, it holds that

1, 1, 1 a(0)

1, 1, 1um,θ(x) M L L 1 Y

2 (Bθ)2L 1 .

Finally, we turn to the third order derivative. Initially, by Equation 47, we have

1, 1, 1 a(0)

1, 1, 1 x1um,θ(x) = c1 a(0)

1, 1, 1 x1 a(0)

1, 1, 1φ1 θ(x) .

Then, by fully expanding Equation 48, it holds that

1, 1, 1φ1 θ(x)

s L 1=1 a(L 1)

1,1,s L 1 NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 NL 3 X

s L 3=1 a(L 3)

1,s L 2,s L 3

(1) 1,s2,s1

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

+ x1 x1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

+ + x1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

s L 1=1 a(L 1)

1,1,s L 1 NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 NL 3 X

s L 3=1 a(L 3)

1,s L 2,s L 3

(1) 1,s2,s1

Jiao, Li, Wu, Yang and Zhang

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

+ x1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 x1φ(L 2)

s L 3=1 a(L 3)

1,s L 2,s L 3 φ(L 3)

1,s L 3(x) + b(L 3)

(0) 1,s1,s0 xs0 + b

+ + x1ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

Further taking derivative w.r.t. a (0) 1,1,1, we have

1, 1, 1 x1 a(0)

1, 1, 1φ1 θ(x)

s L 1=1 a(L 1)

1,1,s L 1 NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 NL 3 X

s L 3=1 a(L 3)

1,s L 2,s L 3

(1) 1,s2,s1

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 x1φ(L 2)

s L 3=1 a(L 3)

1,s L 2,s L 3 φ(L 3)

1,s L 3(x) + b(L 3)

(0) 1,s1,s0 xs0 + b

+ + x1 a(0)

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

(1) 1,s2,s1 φ

(1) 1,s1(x) + b

(0) 1,s1,s0 xs0 + b

DRM Revisited: A Complete Error Analysis

The calculation of the typical item tells us

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 x1φ(L 2)

s L 3=1 a (L 3) 1,s L 2,s L 3 φ(L 3)

1,s L 3(x) + b(L 3)

(0) 1,s1,s0 xs0 + b

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 a(0)

1, 1, 1φ(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 x1φ(L 2)

s L 3=1 a(L 3)

1,s L 2,s L 3 φ(L 3)

1,s L 3(x) + b(L 3)

(0) 1,s1,s0 xs0 + b

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 a(0)

1, 1, 1 x1φ(L 2)

s L 3=1 a(L 3)

1,s L 2,s L 3 φ(L 3)

1,s L 3(x) + b(L 3)

(0) 1,s1,s0 xs0 + b

+ + ρ NL 2 X

s L 2=1 a(L 2)

1,s L 1,s L 2 φ(L 2)

1,s L 2(x) + b(L 2)

s L 2=1 a(L 2)

1,s L 1,s L 2 x1φ(L 2)

s L 3=1 a(L 3)

1,s L 2,s L 3 φ(L 3)

1,s L 3(x) + b(L 3)

(0) 1,s1,s0 xs0 + b

(2L + 2) L 1 Y

Therefore, we have

1, 1, 1 a(0)

1, 1, 1 x1um,θ(x)

s1=1 a(L 1)

k,1,s L 1a(L 2)

k,s L 1,s L 2 a

(1) k,s2,s1 L (2L + 2) L 1 Y

M L (2L + 2) L 1 Y

3 (Bθ)3L 1 .

Combining above estimations together, we obtain the total upper bound

1, 1, 1 a(0)

b L(um,θ) C(d, B0, L) M2 L 1 Y

Jiao, Li, Wu, Yang and Zhang

It can be readily observed that the upper bound obtained by diﬀerentiating the neural network output w.r.t. a (0) 1,1,1 also controls the upper bounds obtained by diﬀerentiating the output w.r.t. general a (ℓ) k,i,j, which concludes the proof.

C.7 Proof of Lemma 24

For i = 1, 2, when um,θ is parameterized with (θm,i in , θm out), we denote it as um,i = Pm k=1 ck φk θ,i. Then, it holds that

b F θm,1 in , θm out b F θm,2 in , θm out

( um,1 2 2 um,2 2 2)(Xp) 2

w(Xp)(u2 m,1 u2 m,2)(Xp) 2

p=1 h(Xp)(um,1 um,2)(Xp) + C(Ω)

q=1 g(Yq)(um,1 um,2)(Yq) ,

where C(Ω) = max{|Ω|, | Ω|}. By Lemmas 7 and 8, we have

|φθ(x)| (W + 1)Bθ , | xmφθ(x)| W L 1BL θ .

Also, it holds that φθ(x) φ θ(x) 2W L

LBL 1 θ θ θ 2 , x Ω, xmφθ(x) xmφ θ(x) 2W 2L 1

L(L + 1)B2L θ θ θ 2 , x Ω.

Then, we have the following estimations. First, it holds that

um,1(Xp) 2 2 um,2(Xp) 2 2 2

k=1 ck φk θ,1(Xp) 2

k=1 ck φk θ,2(Xp) 2

k=1 |ck| xmφk θ,1(Xp) + xmφk θ,2(Xp) i

k=1 |ck| xmφk θ,1(Xp) xmφk θ,2(Xp) i

d MW L 1BL θ

k=1 |ck| max m,x xmφk θ,1(x) xmφk θ,2(x)

d MW L 1BL θ

k=1 max m,x xmφk θ,1(x) xmφk θ,2(x) 2

L(L + 1)MB3L θ θm out 2 θm,1 in θm,2 in 2 ,

DRM Revisited: A Complete Error Analysis

where the fourth step uses Cauchy-Schwarz inequality. Similarly, we have

p=1 w(Xp) u2 m,1(Xp) u2 m,2(Xp)

B0M(W + 1)Bθ

k=1 max x φk θ,1(x) φk θ,2(x) 2

2B0W L(W + 1)

LMBL θ θm out 2 θm,1 in θm,2 in 2 .

Finally, it holds that

h um,1(Xp) um,2(Xp) h(Xp) i + 1

h um,1(Yq) um,2(Yq) g(Yq) i

LBL 1 θ θm out 2 θm,1 in θm,2 in 2 .

Combining above estimations, we complete the proof of this lemma.

C.8 Proof of Lemma 25

In the following, we will focus on the probability of Gc m, m,R,δ. The proof will be divided into two steps: (i) the case where m = 1 and R = 1, that is, Gc m,1,1,δ; (ii) the general case Gc m, m,R,δ where m, R N. Step 1. For m = R = 1, when m = Q, Gc m,1,1,δ denotes the event where, each subnetwork weight vector among (θ1)[0] through (θQ)[0] satisﬁes (θ )[0] θ1 > δ. Since θ1 can be treated as a ﬁxed vector in [ B θ, B θ]D( W, L,d), where D(W, L, d) is deﬁned in Equation 2, and all sub-network parameters are i.i.d. from U[ B θ, B θ], it then holds that

P h (θi)[0] θ1 δ i δ

D( W, L,d) δ

for any i {1, . . . , Q}. Thus, we have

P Gc m,1,1,δ = P h i {1, . . . , Q} : (θi)[0] θ1 > δ i h 1 δ W( W+1) L(2B θ) W( W+1) Li Q .

Step 2. For general m, R N, when m = m R Q, Gm, m,R,δ denotes the event where, for each target θk in ( θ1, . . . , θ m), at least R weight vectors among (θk)[0] through (θ m R Q)[0]

satisfy (θ )[0] θk δ. It can be readily observed that

n i [(j 1) m + k 1]Q, . . . , [(j 1) m + k]Q : (θi)[0] θk δ o .

Thus, it holds that

Gc m, m,R,δ

n i [(j 1) m + k 1]Q, . . . , [(j 1) m + k]Q : (θi)[0] θk >δ o .

Jiao, Li, Wu, Yang and Zhang

This implies

P Gc m, m,R,δ m RP Gc m,1,1,δ m R h 1 δ W( W+1) L(2B θ) W( W+1) Li Q .

Therefore, we have

P Gm, m,R,δ = 1 P Gc m, m,R,δ 1 m R h 1 δ W( W+1) L(2B θ) W( W+1) Li Q .

C.9 Proof of Lemma 26

First, we have b L u m b L(u m, θ)

( u m 2 2 u m, θ 2 2)(Xp) 2

w(Xp) (u m)2 u2 m, θ (Xp)

p=1 h(Xp)(u m u m, θ)(Xp)

q=1 g(Yq)(u m u m, θ)(Yq) ,

where C(Ω) = max{|Ω|, | Ω|}. By Lemmas 7 and 8, we have

u m(Xp) 2 2 u m, θ(Xp) 2 2 2

ck R φsk,v θ [0](Xp) 2

ck R φk θ(Xp) 2

xm(φsk,v θ )[0](Xp) + xmφk θ(Xp) i

xm(φsk,v θ )[0](Xp) xmφk θ(Xp) i

d M2 W L 1B L θ max x,m xm(φsk,v θ )[0](x) xmφk θ(x)

L( L + 1) M2B 3 L θ max k,v

(θsk,v)[0] θk 2

2d W 3 L 2p

L( L + 1) M2B 3 L θ

W( W + 1) L max k,v

(θsk,v)[0] θk .

Similarly, it holds that

w(Xp) u m(Xp) 2 w(Xp)u2 m, θ(Xp)

B0 M2( W + 1)B θ max x (φsk,v θ )[0](x) φk θ(x)

DRM Revisited: A Complete Error Analysis

2B0 W L( W + 1) p

W( W + 1) L max k,v

(θsk,v)[0] θk .

Finally, we have

p=1 [u m(Xp)h(Xp) u m, θ(Xp)h(Xp)] + 1

q=1 [u m(Yq)g(Yq) u m, θ(Yq)g(Yq)]

LB L 1 θ M q

W( W + 1) L max k,v

(θsk,v)[0] θk .

Combining above estimations completes the proof.

Appendix D. Complexity of the Projected Gradient Descent Method

In this section, we will compute the algorithmic complexity of one step of the PGD algorithm in Section 2.4. By Duchi et al. (2008) and the explicit ℓ2-projection formula, the projection step has a complexity of O(dim(θm total)). Thus, only the gradient computation complexity requires attention. First, we assume that Nin = Nb = Ns in Equation 7, which is consistent with the result in Theorem 1. Thus, Equation 7 can be seen as sum of losses over Ns equally weighted sample pairs {Xi, Yi}Ns i=1. When evaluating the complexity of gradient computation, we only need to analyze the loss for a representative pair {x, y} and then multiply the result by Ns. Then, observe that the gradient update for um,θ(x) 2 2 involves computing the secondorder derivatives of the neural network. As this step dominates the overall complexity, we will focus on analyzing this term alone. Speciﬁcally, we will analyze the complexity of calculating

um,θ(x) 2 2 θm total = um,θ(x) 2 2 θm in , um,θ(x) 2 2 θm out

=: in um,θ(x) 2 2, out um,θ(x) 2 2 .

Recall that um,θ(x) = Pm k=1 ck φk θ(x), θm out = (c1, c2, . . . , cm) and

θm in = A1 0, b1 0, . . . , A1 L 1, b1 L 1 | {z } φ1 θ(x)

, . . . . . . , Am 0 , bm 0 , . . . , Am L 1, bm L 1 | {z } φm θ (x)

For simplicity, we will abbreviate um,θ(x) as u, φk θ(x) as φk. Additionally, we assume that the input dimension d = W and all Ak ℓ RW W , 0 ℓ< L 1. In this way, all subnetworks share a consistent parameter structure. Direct calculation yields

out u 2 2 = φ 1 (2 u), . . . , φ m(2 u) , u = c1 φ1 + + cm φm . (49)

Therefore, once the complexity of φk is determined, the complexity of out u 2 2 will be clear. Before diving into the computations, we need to outline the matrix diﬀerentiation rules relevant to this section.

Jiao, Li, Wu, Yang and Zhang

In summary, our focus is the derivative of a scalar-valued matrix function f(X) R with respect to X Rm n, deﬁned as f/ X := ( f/ Xij). Its computation relies on the following key relationship:

f Xij d Xij = tr f

d X , d X := (d Xij) .

That is to say, the computation of f/ X involves two steps: (1) deriving the relationship between the diﬀerentials df and d X based on the speciﬁc form of f(X); and (2) leveraging the properties of the trace operator to express df as tr(A d X), where A represents the sought-after matrix derivative. For the former step, we need the following rules:

d(X Y ) = d X d Y ; d(XY ) = Xd Y + (d X)Y ; d(X ) = (d X) ;

dρ(X) = ρ (X) d X ; d(X Y ) = X d Y + d X Y ,

where ρ denotes an element-wise function, and represents the element-wise Hadamard product. For the latter step, the following rules are necessary:

tr(A B) = tr(A) + tr(B) ; tr A B = tr BA

tr(A ) = tr(A) ; tr A (B C)) = tr (A B) C ,

where A, B and C have the same shape. With these, we present the computation of φk, which is known as the backpropagation algorithm in deep learning. The process starts with the forward pass that ultimately reaches φk, during which the following intermediate quantities will be stored:

z0 = A0x + b0 ; x1 = ρ(z0) ; z1 = A1x1 + b1 ; x2 = ρ(z1) ;

z L 2 = AL 2 x L 2 + b L 2 ; x L 1 = ρ(z L 2) ; φk = AL 1x L 1 + b L 1 ,

where the parameter matrices and biases omit the subscript k for simplicity. Note that for 0 i < L 2, the ﬂoating-point operations for zi are 2W 2, and the operations for xi+1 are W. Therefore, the ﬂoating-point operations for φk are W(2W + 1)(L 1) + 2W. Next, the backward computation of φk begins, storing the following intermediate quantities, which are required for evaluating the second-order parameter derivatives in in u 2 2.

s0 = ρ (z L 2) ; q0 = A L 1 s0 ; p1 = A L 2 q0 ;

s1 = ρ (z L 3) ; q1 = p 1 s1 ; p2 = A L 3 q0 ;

s L 2 = ρ (z0) ; q L 2 = p L 2 s L 2 ; φk = A 0 q L 2 .

The additional ﬂoating-point operations are W(2W +1)(L 1), leveraging the intermediate quantities stored during the forward pass. Thus, the total ﬂoating-point operations for computing a single φk are 2W(2W + 1)(L 1) + 2W, and ﬁnally, by Equation 49, the complexity of out u 2 2 is 2m W(2W + 1)(L 1) + 6m W W, during which 2 u is stored. Now, let s analyze in u 2 2. First, note that

d u 2 2 = (2 u) (d u) = (2 u) c1(d φ1) + + (d φm)

DRM Revisited: A Complete Error Analysis

= (2c1 u) (d φ1) + + (2cm u) (d φm) .

It is evident that the partial derivatives with respect to the parameters of any given φk are solely associated with (2ck u) (d φk). Given the identical subnetworks, it suﬃces to determine the complexity of the partial derivatives in a single (2ck u) (d φk), which, multiplied by m, gives the total complexity of in u 2 2. We ﬁrst deﬁne some auxiliary quantities. Let π0 = 2ck u. For 0 i L 2, let

χi = q L 2 i π i ; ιi = Ai πi ; πi+1 = ιi s L 2 i ;

r0 i = ιi p L 2 i ρ (zi) ; rj i = A i j+1rj 1 i ρ (zi j) , 0 < j i ,

where the subscript k is also omitted for simplicity. Using all previously stored intermediate quantities, the incremental ﬂoating-point operation counts for the above ﬁve quantities are as follows: W 2, W(2W 1), W, 3W and 2W 2. Denoting by Υi := Ai u 2 2 and υj := bj u 2 2, we then have

(2ck u) (d φk) =

i=0 tr(Υ i d Ai) +

j=0 υ j dbj ,

where ΥL 1 = π L 1, and for 0 k L 2,

l=k rl k l , Υk = χk +

l=k rl k l x k .

Thus, the complexity of the φk component in in u 2 2 is (L 1)(L 2)(2W 2 + W) + (L 1)(5W 2 + 3W). Based on all analyses, the complexity of the gradient update is O(m W 2L2Ns). Since dim(θm total) = O(m W 2L), the complexity of projection is O(m W 2L). Finally, we deduce that the complexity of one step of projected gradient descent in Section 2.4 is O(m W 2L2Ns).

Shmuel Agmon, Avron Douglis, and Louis Nirenberg. Estimates near the boundary for solutions of elliptic partial diﬀerential equations satisfying general boundary conditions. i. Communications on Pure and Applied Mathematics, 12(4):623 727, 1959.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242 252. PMLR, 2019.

Cosmin Anitescu, Elena Atroshchenko, Naif Alajlan, and Timon Rabczuk. Artiﬁcial neural network methods for the solution of second order boundary value problems. Computers, Materials and Continua, 59(1):345 359, 2019.

Francis Bach. Learning theory from ﬁrst principles, 2023.

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

Jiao, Li, Wu, Yang and Zhang

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, volume 30, 2017.

Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VCdimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(1):2285 2301, 2019.

Peter L Bartlett, Philip M Long, G abor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta Numerica, 30:87 201, 2021.

Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4):2261 2285, 2019.

Christian Beck, Arnulf Jentzen, and Benno Kuckuck. Full error analysis for the training of deep neural networks. Inﬁnite Dimensional Analysis, Quantum Probability and Related Topics, 25(02):2150020, 2022.

Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203 248, 2021.

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541 549. PMLR, 2018.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-oﬀ. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019a.

Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? In International Conference on Artiﬁcial Intelligence and Statistics, pages 1611 1619. PMLR, 2019b.

Julius Berner, Markus Dablander, and Philipp Grohs. Numerically solving parametric families of high-dimensional Kolmogorov partial diﬀerential equations via deep learning. In Advances in Neural Information Processing Systems, volume 33, pages 16615 16627. Curran Associates, Inc., 2020.

Julius Berner, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. The modern mathematics of deep learning. Mathematical Aspects of Deep Learning, page 1, 2022.

Susanne Brenner and Ridgway Scott. The mathematical theory of ﬁnite element methods, volume 15. Springer Science & Business Media, 2007.

Mo Chen, Zhao Ding, Yuling Jiao, Xiliang Lu, Peiying Wu, and Jerry Zhijian Yang. Convergence analysis of PINNs with over-parameterization. Communications in Computational Physics, 37(4):942 974, 2025.

DRM Revisited: A Complete Error Analysis

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in diﬀerentiable programming. In Advances in Neural Information Processing Systems, volume 32, 2019.

Philippe G Ciarlet. The ﬁnite element method for elliptic problems. SIAM, 2002.

Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1 49, 2002.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4):303 314, 1989.

Yongcheng Dai, Bangti Jin, Ramesh Sau, and Zhi Zhou. Solving elliptic optimal control problems via neural networks and optimality system. ar Xiv preprint ar Xiv:2308.11925, 2023.

Ronald De Vore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica, 30:327 444, 2021.

Zhao Ding, Yuling Jiao, Xiliang Lu, Peiying Wu, and Jerry Zhijian Yang. Convergence analysis of deep Ritz method with over-parameterization. Neural Networks, 184:107110, 2025.

Selina Drews and Michael Kohler. Analysis of the expected L2 error of an over-parametrized deep neural network estimate learned by gradient descent without regularization. ar Xiv preprint ar Xiv:2311.14609, 2023.

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent ﬁnds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675 1685. PMLR, 2019.

Chenguang Duan, Yuling Jiao, Yanming Lai, Dingwei Li, Jerry Zhijian Yang, et al. Convergence rate analysis for deep Ritz method. Communications in Computational Physics, 31(4):1020 1048, 2022.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Eﬃcient projections onto the L1-ball for learning in high dimensions. In International Conference on Machine Learning, pages 272 279, 2008.

Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. Econometrica, 89(1):181 213, 2021.

Evarist Gin e and Richard Nickl. Mathematical foundations of inﬁnite-dimensional statistical models. Cambridge University Press, 2021.

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference on Learning Theory, pages 297 299. PMLR, 2018.

Philipp Grohs and Gitta Kutyniok. Mathematical aspects of deep learning. Cambridge University Press, 2022.

Jiao, Li, Wu, Yang and Zhang

Ingo G uhring and Mones Raslan. Approximation rates for neural networks with encodable weights in smoothness spaces. Neural Networks, 134:107 130, 2021.

Ingo G uhring, Gitta Kutyniok, and Philipp Petersen. Error bounds for approximations with deep Re LU neural networks in Ws,p norms. Analysis and Applications, 18(05):803 859, 2020.

Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial diﬀerential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34): 8505 8510, 2018.

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2): 949 986, 2022.

Qingguo Hong, Jonathan W Siegel, and Jinchao Xu. Rademacher complexity and numerical quadrature analysis of stable neural networks with applications to numerical PDEs. ar Xiv preprint ar Xiv:2104.02903, 2021.

Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251 257, 1991.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989.

Tianhao Hu, Bangti Jin, and Zhi Zhou. Solving elliptic problems with singular sources using singularity splitting deep Ritz method. SIAM Journal on Scientiﬁc Computing, 45 (4):A2043 A2074, 2023.

Tianhao Hu, Bangti Jin, and Zhi Zhou. Solving Poisson problems in polygonal domains with singularity enriched physics informed neural networks. SIAM Journal on Scientiﬁc Computing, 46(4):C369 C398, 2024.

Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen, and Philippe von Wurstemberger. Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial diﬀerential equations. Proceedings of the Royal Society A, 476(2244):20190630, 2020.

Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31, 2018.

Arnulf Jentzen and Timo Welti. Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation. Applied Mathematics and Computation, 455:127907, 2023.

Xia Ji, Yuling Jiao, Xiliang Lu, Pengcheng Song, and Fengru Wang. Deep Ritz method for elliptical multiple eigenvalue problems. Journal of Scientiﬁc Computing, 98(2):48, 2024.

DRM Revisited: A Complete Error Analysis

Yuling Jiao, Yanming Lai, Dingwei Li, Xiliang Lu, Fengru Wang, Jerry Zhijian Yang, et al. A rate of convergence of physics informed neural networks for the linear second order elliptic PDEs. Communications in Computational Physics, 31(4):1272 1295, 2022.

Yuling Jiao, Yanming Lai, Yisu Lo, Yang Wang, and Yunfei Yang. Error analysis of deep Ritz methods for elliptic equations. Analysis and Applications, 2023a.

Yuling Jiao, Yanming Lai, Xiliang Lu, Fengru Wang, Jerry Zhijian Yang, and Yuanyuan Yang. Deep neural networks with Re LU-sine-exponential activations break curse of dimensionality in approximation on H older class. SIAM Journal on Mathematical Analysis, 55(4):3635 3649, 2023b.

Yuling Jiao, Xiliang Lu, Jerry Zhijian Yang, Cheng Yuan, and Pingwen Zhang. Improved analysis of PINNs: Alleviate the Co D for compositional solutions. Annals of Applied Mathematics, 39:239 263, 2023c.

Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2):691 716, 2023d.

Yuling Jiao, Yang Wang, and Yunfei Yang. Approximation bounds for norm constrained neural networks with applications to regression and GANs. Applied and Computational Harmonic Analysis, 65:249 278, 2023e.

Yuling Jiao, Jerry Zhijian Yang, Junyu Zhou, et al. A rate of convergence of weak adversarial neural networks for the second order parabolic PDEs. Communications in Computational Physics, 34(3):813 836, 2023f.

Yuling Jiao, Xiliang Lu, Peiying Wu, and Jerry Zhijian Yang. Convergence analysis for overparameterized deep learning. Communications in Computational Physics, 36(1):71 103, 2024.

Yuling Jiao, Yanming Lai, and Yang Wang. Error analysis of three-layer neural network trained with PGD for deep Ritz method. IEEE Transactions on Information Theory, 2025. In press.

Varun Kanade, Patrick Rebeschini, and Tomas Vaskevicius. Exponential tail local rademacher complexity risk bounds without the bernstein condition. Journal of Machine Learning Research, 25(388):1 43, 2024.

Michael Kohler and Adam Krzyzak. Over-parametrized deep neural networks minimizing the empirical risk do not generalize well. Bernoulli, 27(4):2564 2597, 2021.

Michael Kohler and Adam Krzyzak. On the rate of convergence of an over-parametrized deep neural network regression estimate with Re LU activation function learned by gradient descent. preprint, 2023a. URL https://www2.mathematik.tu-darmstadt.de/~kohler/ preprint23_01.pdf.

Jiao, Li, Wu, Yang and Zhang

Michael Kohler and Adam Krzyzak. On the rate of convergence of an over-parametrized transformer classiﬁer learned by gradient descent. ar Xiv preprint ar Xiv:2312.17007, 2023b.

Michael Kohler and Sophie Langer. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231 2249, 2021.

Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006.

Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical analysis of deep neural networks and parametric PDEs. Constructive Approximation, 55 (1):73 125, 2022.

Samuel Lanthaler, Siddhartha Mishra, and George E Karniadakis. Error estimates for Deep Onets: A deep learning framework in inﬁnite dimensions. Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022.

Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel ridgeless regression can generalize. The Annals of Statistics, 48(3):1329 1347, 2020.

Yulei Liao and Pingbing Ming. Deep Nitsche method: Deep Ritz method with essential boundary conditions. Communications in Computational Physics, 29(5):1365 1384, 2021.

Yulei Liao and Pingbing Ming. Spectral Barron space and deep neural network approximation. ar Xiv preprint ar Xiv:2309.00788, 2023.

Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in overparameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 2022.

Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465 5506, 2021a.

Lu Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis. Deep XDE: A deep learning library for solving diﬀerential equations. SIAM Review, 63(1):208 228, 2021b.

Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean ﬁeld analysis of deep Res Net and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pages 6426 6436. PMLR, 2020.

Yiping Lu, Haoxuan Chen, Jianfeng Lu, Lexing Ying, and Jose Blanchet. Machine learning for elliptic PDEs: Fast rate generalization bound, neural scaling law and minimax optimality. In International Conference on Learning Representations, 2021c.

Yulong Lu, Jianfeng Lu, and Min Wang. A priori generalization analysis of the deep Ritz method for solving high dimensional elliptic partial diﬀerential equations. In Conference on Learning Theory, pages 3196 3241. PMLR, 2021d.

Chao Ma, Lei Wu, et al. The Barron space and the ﬂow-induced function spaces for neural network models. Constructive Approximation, 55(1):369 406, 2022.

DRM Revisited: A Complete Error Analysis

Arvind Mahankali, Haochen Zhang, Kefan Dong, Margalit Glasgow, and Tengyu Ma. Beyond NTK with vanilla gradient descent: A mean-ﬁeld analysis of neural networks with polynomial width, samples, and time. In Advances in Neural Information Processing Systems, volume 36, 2024.

Colin Mc Diarmid et al. On the method of bounded diﬀerences. Surveys in combinatorics, 141(1):148 188, 1989.

Shahar Mendelson. Learning without concentration for general loss functions. Probability Theory and Related Fields, 171(1):459 502, 2018.

Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error of physicsinformed neural networks for approximating a class of inverse problems for PDEs. IMA Journal of Numerical Analysis, 42(2):981 1022, 2022.

Siddhartha Mishra and T Konstantin Rusch. Enhancing accuracy of deep learning algorithms by training with low-discrepancy sequences. SIAM Journal on Numerical Analysis, 59(3):1811 1834, 2021.

Johannes M uller and Guido Mont ufar. Geometry and convergence of natural policy gradient methods. Information Geometry, 7(Suppl 1):485 523, 2024.

Johannes M uller and Marius Zeinhofer. Error estimates for the variational training of neural networks with boundary penalty. ar Xiv preprint ar Xiv:2103.01007, 2021.

Johannes M uller and Marius Zeinhofer. Achieving high accuracy with PINNs via energy natural gradient descent. In International Conference on Machine Learning, volume 202, pages 25471 25485. PMLR, 23 29 Jul 2023.

Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21: 174 1, 2020.

Preetum Nakkiran, Prayaag Venkat, Sham M Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. In International Conference on Learning Representations, 2020.

Quynh Nguyen. On the proof of global convergence of gradient descent for deep Re LU networks with linear widths. In International Conference on Machine Learning, pages 8056 8062. PMLR, 2021.

Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep Re LU neural networks. Neural Networks, 108:296 330, 2018.

Philipp Christian Petersen. Neural network theory. University of Vienna, 2020.

Allan Pinkus. Approximation theory of the MLP model. Acta Numerica, 8:143 195, 1999.

Jiao, Li, Wu, Yang and Zhang

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial diﬀerential equations. Journal of Computational Physics, 378:686 707, 2019.

Johannes Schmidt-Hieber et al. Nonparametric regression using deep neural networks with Re LU activation function. The Annals of Statistics, 48(4):1875 1897, 2020.

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. Communications in Computational Physics, 28(5):1768 1811, 2020.

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network with approximation error being reciprocal of width to power of square root of depth. Neural Computation, 33(4): 1005 1036, 2021a.

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Neural network approximation: Three hidden layers are enough. Neural Networks, 141:160 173, 2021b.

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Optimal approximation rate of Re LU networks in terms of width and depth. Journal de Math ematiques Pures et Appliqu ees, 157:101 135, 2022.

Yeonjong Shin. On the convergence of physics informed neural networks for linear secondorder elliptic and parabolic type PDEs. Communications in Computational Physics, 28 (5):2042 2074, 2020.

Jonathan W Siegel and Jinchao Xu. Approximation rates for neural networks with general activation functions. Neural Networks, 128:313 321, 2020.

Justin A. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial diﬀerential equations. Journal of Computational Physics, 375:1339 1364, 2018.

Hwijae Son, Jin Woo Jang, Woo Jin Han, and Hyung Ju Hwang. Sobolev training for the neural network solutions of PDEs. ar Xiv preprint ar Xiv:2101.08932, 2021.

Elias M Stein. Singular integrals and diﬀerentiability properties of functions. Princeton university press, 1970.

Taiji Suzuki. Adaptivity of deep Re LU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2018.

Taiji Suzuki and Atsushi Nitanda. Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. In Advances in Neural Information Processing Systems, volume 34, 2021.

Matus Telgarsky. Deep learning theory lecture notes, 2021.

Alexander Tsigler and Peter L Bartlett. Benign overﬁtting in ridge regression. Journal of Machine Learning Research, 24(123):1 76, 2023.

DRM Revisited: A Complete Error Analysis

Sara A. van de Geer. Empirical processes in M-estimation, volume 6. Cambridge University Press, 2000.

Aad Van Der Vaart and Jon Wellner. Weak convergence. Springer, 1996.

Gal Vardi, Gilad Yehudai, and Ohad Shamir. On the optimal memorization power of Re LU neural networks. In International Conference on Learning Representations, 2021.

Roman Vershynin. Memory capacity of neural networks with threshold and rectiﬁed linear unit activations. SIAM Journal on Mathematics of Data Science, 2(4):1004 1033, 2020.

Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why PINNs fail to train: A neural tangent kernel perspective. Journal of Computational Physics, 449:110768, 2022.

E Weinan. Machine learning and computational mathematics. Communications in Computational Physics, 28(5):1639 1670, 2020.

E. Weinan and Ting Yu. The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1): 1 12, 2017.

E Weinan, Chao Ma, and Lei Wu. A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17(5):1407 1425, 2019.

E Weinan, Chao Ma, and Lei Wu. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Science China Mathematics, pages 1 24, 2020.

Yunbei Xu and Assaf Zeevi. Towards optimal problem dependent generalization error bounds in statistical learning theory. In Advances in Neural Information Processing Systems, 2021.

Yahong Yang and Juncai He. Deeper or wider: A perspective from optimal generalization error with sobolev loss. In International Conference on Machine Learning, pages 56109 56138. PMLR, 2024.

Yunfei Yang and Ding-Xuan Zhou. Optimal rates of approximation by shallow Re LUk neural networks and applications to nonparametric regression. Constructive Approximation, pages 1 32, 2024.

Dmitry Yarotsky. Error bounds for approximations with deep Re LU networks. Neural Networks, 94:103 114, 2017.

Dmitry Yarotsky. Optimal approximation of continuous functions by very deep Re LU networks. In Conference on Learning Theory, pages 639 649. PMLR, 2018.

Dmitry Yarotsky. Elementary superexpressive activations. In International Conference on Machine Learning, pages 11932 11940. PMLR, 2021.

Jiao, Li, Wu, Yang and Zhang

Dmitry Yarotsky and Anton Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 13005 13015, 2020.

Hao Yu, Yixiao Guo, and Pingbing Ming. Generalization error estimates of machine learning methods for solving high dimensional Schr odinger eigenvalue problems. ar Xiv preprint ar Xiv:2408.13511, 2024.

Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional partial diﬀerential equations. Journal of Computational Physics, 411: 109409, 2020.

Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis. Time-Frequency and Time-Scale Analysis, Wavelets, Numerical Algorithms, and Applications, 48(2):787 794, 2020.

Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, volume 32, 2019.