# derivativeenhanced_deep_operator_network__bf47a8f6.pdf Derivative-enhanced Deep Operator Network Yuan Qiu, Nolan Bridges, Peng Chen Georgia Institute of Technology, Atlanta, GA 30332 {yuan.qiu, bridges, pchen402}@gatech.edu The deep operator networks (Deep ONet), a class of neural operators that learn mappings between function spaces, have recently been developed as surrogate models for parametric partial differential equations (PDEs). In this work we propose a derivative-enhanced deep operator network (DE-Deep ONet), which leverages derivative information to enhance the solution prediction accuracy and provides a more accurate approximation of solution-to-parameter derivatives, especially when training data are limited. DE-Deep ONet explicitly incorporates linear dimension reduction of high dimensional parameter input into Deep ONet to reduce training cost and adds derivative loss in the loss function to reduce the number of required parameter-solution pairs. We further demonstrate that the use of derivative loss can be extended to enhance other neural operators, such as the Fourier neural operator (FNO). Numerical experiments validate the effectiveness of our approach. 1 Introduction Using neural networks to approximate the maps between functions spaces governed by parametric PDEs can be very beneficial in solving many-query problems, typically arising from Bayesian inference, optimization under uncertainty, and Bayesian optimal experimental design. Indeed, once pre-trained on a dataset, neural networks are extremely fast to evaluate given unseen inputs, compared to traditional numerical methods like the finite element method. Recently various neural operators are proposed to enhance the learning capacity, with two prominent examples deep operator network (Deep ONet) [1] and Fourier neural operator (FNO) [2], which are shown to be inclusive of each other in their more general settings [3, 4], see also their variants and other related neural operator architectures in [5, 6, 7, 8, 9, 10]. Though these work demonstrate to be successful in approximating the output function, they do not necessarily provide accurate approximation of the derivative of the output with respect to the input, which are often needed for many downstream tasks such as PDEconstrained optimization problems for control, inference, and experimental design [11, 12, 13, 14]. In this paper, we propose to enhance the performance of Deep ONet through derivative-based dimension reduction for the function input inspired by [15, 16, 17, 18] and the incorporation of derivative information in the training to learn both the output and its derivative with respect to the input inspired by [19, 20]. These two derivative-enhanced approaches can significantly improve Deep ONet s approximation accuracy for the output function and its directional derivative with respect to the input function, especially when the training samples are limited. We provide details on the computation of derivative labels of the solution of PDEs in a general form as well as the derivative-based dimension reduction to largely reduce the computational cost. We demonstrate the effectiveness of our proposed method (DE-Deep ONet) compared to three other neural operators, including Deep ONet, FNO, and derivative-informed neural operator (DINO) [20], in terms of both test errors and computational cost. In addition, we apply derivative learning to train the FNO and also compare its performance with other methods. The code for data generation, model training and inference, as well as configurations to reproduce the results in this paper can be found at https://github.com/qy849/DE-Deep ONet. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). 2 Preliminaries This section presents the problem setup, high-fidelity approximation using finite element for finite dimensional discretization, and the Deep ONet architecture in learning the solution operator. 2.1 Problem setup Let Ω Rd denote an open and bounded domain with boundary Ω Rd 1, where the dimension d = 1, 2, 3. We consider a PDE of the general form defined in Ωas R(m, u) = 0, (1) prescribed with proper boundary conditions. Here m V in is an input parameter function defined in a separable Banach space V in with probability measure ν and u V out is the output as the solution of the PDE defined in a separable Banach space V out. Our goal is to construct a parametric model ˆu(m; θ) to approximate the solution operator that maps the parameter m to the solution u. Once constructed, the parametric model ˆu(m; θ) should be much more computationally efficient to evaluate compared to solving the PDE with high fidelity approximation. 2.2 High fidelity approximation For the high fidelity approximation of the solution, we consider using a finite element method [21] in this work. We partition the domain Ωinto a finite set of subregions, called cells or elements. Collectively, these cells form a mesh of the domain Ω. Let h represent the diameter of the largest cell. We denote V in h indexed by h as the finite element space for the approximation of the input space V in with Lagrange basis {ϕin 1 , , ϕin Nin h } of dimension N in h such that ϕin i (x(j)) = δij at the finite element node x(j), with δij being the Kronecker delta function. Similarly, we denote V out h as the finite element space for the approximation of the solution space V out with basis {ϕout 1 , , ϕout Nout h } of dimension N out h . Note that for the approximation to be high fidelity, N in h and N out h are often very large. To this end, we can write the high fidelity approximation of the input and output functions as i=1 miϕin i (x) and uh(x) = i=1 uiϕout i (x), with coefficient vectors m = (m1, , m N in h )T RNin h and u = (u1, , u Nout h )T RNout h , whose entries are the nodal values of m and u at the corresponding nodes. 2.3 Deep ONet We briefly review the Deep ONet architecture [1] with a focus on learning the solution operator of the PDE in Equation (1). To predict the evaluation of solution function u at any point x Ω Ω for any given input function m, [1] design a network architecture that comprises two separate neural networks: a trunk net t( ; θt), which takes the coordinate values of the point x at which we want to evaluate the solution function as inputs, and a branch net b( ; θb), which takes the vector m encoding the parameter function m as inputs. In [1], the vector m is the function evaluations at a finite set of fixed points {x(j)}N in h j=1 Ω Ω, that is, m = (m(x(1)), , m(x(Nin h)))T , which corresponds to coefficient vector in the finite element approximation with Lagrange basis at the same nodes. If the solution function is scalar-valued, then both neural networks output vectors of the same dimensions. The prediction is obtained by taking the standard inner product between these two vectors and (optionally) adding a real-valued bias parameter, i.e, ˆu(m; θ)(x) = b(m; θb), t(x; θt) + θbias, with θ = (θb, θt, θbias). If the solution u is vector-valued of Nu components, i.e., u = (u(1), . . . , u(Nu)), as in our experiments, we can use one of the four approaches in [3] to construct the Deep ONet. Specifically, for each solution component, we use the same outputs of branch net with dimension Nb and different corresponding groups of outputs of trunk net to compute the inner product. More precisely, the solution u(i) of component i = 1, . . . , Nu, is approximated by ˆu(i)(m; θ)(x) = b(m; θb), t(i)(x; θt) + θ(i) bias, (2) where t(i)(x; θt) = t(x; θt)[(i 1)Nb + 1 : i Nb], the vector slice of t(x; θt) with indices ranging from (i 1)Nb + 1 to i Nb, i = 1, . . . , Nu. For this construction, the outputs of the branch net can be interpreted as the coefficients of the basis learned through the trunk net. By partitioning the outputs of the trunk net into different groups, we essentially partition the basis functions used for predicting different components of the solution. The Deep ONet is trained using dataset D = {(m(i), u(i))}ND i=1 with ND samples, where m(i) are random samples independently drawn from ν and u(i) are the solution of the PDE (with slight abuse of notation from the vector-valued solution) at m(i), i = 1, . . . , ND. 3 DE-Deep ONet The Deep ONet uses the input parameter and output solution pairs as the labels for the model training. The approximation for the derivative of the solution with respect to the parameter (and the coordinate) are not necessarily accurate. However, in many downstream tasks such as Bayesian inference and experimental design, the derivatives are required. We consider incorporating the the Fréchet derivative du(m; ) for the supervised training of the Deep ONet, which we call derivative-enhanced Deep ONet (DE-Deep ONet). By doing this we hope the optimization process can improve the neural network s ability to predict the derivative of the output function with respect to the input function. Let θ denote the trainable parameters in the DE-Deep ONet. We propose to minimize the loss function L(θ) = λ1Em ν u(m) ˆu(m) 2 L2(Ω) + λ2Em ν du(m; ) dˆu(m; ) 2 HS, (3) where the L2(Ω) norm of a square integrable function f is defined as f 2 L2(Ω) = ( R Ω f(x) 2 2 dx)1/2, the Hilbert Schmidt norm of an operator T : H H that acts on a Hilbert space H is defined as T 2 HS = P i I Tei 2 H, where {ei}i I is an orthonormal basis of H. Here, λ1, λ2 > 0 are hyperparameters that balance different loss terms. The main challenge of minimizing the loss function (3) in practice is that with high fidelity approximation of the functions m and u using high dimensional vectors m and u, the term du(m; ) dˆu(m; ) 2 HS (approximately) becomes mu mˆu 2 F , where the Frobenius norm of a matrix M Rm n is defined as M 2 F = Pm i=1 Pn j=1 M 2 ij. It is a critical challenge to both compute and store the Jacobian matrix at each sample as well as to load and use it for training since it is often very large with size N out h N in h , with N in h , N out h 1. To tackle this challenge, we employ dimension reduction for the high dimensional input vector m. The reduced representation of m is given by projecting m into a low dimensional linear space spanned by a basis ψin 1 , . . . , ψin r RNin h , with r N in h , that is, f m = ( ψin 1 , m , . . . , ψin r , m )T Rr. To better leverage the information of both the input probability measure ν and the map u : m u(m), we consider the basis generated by active subspace method (ASM) using derivatives [22]. ASM identifies directions in the input space that significantly affect the output variance, or in which the output is most sensitive. See Section 3.3 for the detailed construction. In this case, the term Em ν du(m; ) dˆu(m; ) 2 HS can be approximated by 1 N1N2 PN1 i=1 PN2 j=1 f mu(m(i))(x(j)) f mˆu(m(i))(x(j)) 2 2 with a small amount of functions m(i) sampled from input probability measure ν and points x(j) in the domain Ω. Note that e mˆu(m(i))(x(j)) is vector of size r, which is computationally feasible. For comparison, we also conduct experiments if the basis is the most commonly used Karhunen Loève Expansion (KLE) basis. We next formally introduces our DE-Deep ONet, which uses dimension reduction for the input of the branch net and incorporates its output-input directional derivative labels as additional soft constraints into the loss function. 3.1 Model architecture We incorporate dimension reduction into Deep ONet to construct a parametric model for approximating the solution operator. Specifically, if the solution is scalar-valued (the vector-valued case can be constructed similar to (2)), the prediction is given by ˆu(m; θ)(x) = b(f m; θb), t(x; θt) + θbias, (4) where θ = (θb, θt, θbias) are the parameters to be learned. The branch net b( ; θb) and trunk net t( ; θt) can be chosen as an MLP, Res Net, etc. Note that the branch net takes a small vector of the projected parameter as input. We also apply the Fourier feature embeddings [5], defined as γ(x) = [cos(Bx), sin(Bx)], to the trunk net, where each entry in B Rm d is sampled from a Gaussian distribution N(0, σ2) and m N+, σ R+ are hyper-parameters. 3.2 Loss function In practical training of the DE-Deep ONet, we formulate the loss function as follows i=1 err({(ˆu(m(i); θ)(x(j)), u(m(i))(x(j)))}Nx j=1) i=1 err({( mˆu(m(i); θ)(x(j))Ψin, Φout(x(j))( mu(m(i))Ψin)}Nx j=1), where {x(j)}Nx j=1 are the nodes of the mesh, Ψin = (ψin 1 | |ψin r ) is the matrix collecting the nodal values of the reduced basis of the input function space, and Φout(x) = (ϕout 1 (x), . . . , ϕout N out h (x)) is the vector-valued function consisting of the finite element basis functions of the output function space. The err({(a(i), b(i))}n i=1) denotes the relative error between any two groups of vectors a(i) and b(i), computed as err({(a(i), b(i))}n i=1) = (Pn i=1 a(i) b(i) 2 2)1/2 ε + (Pn i=1 b(i) 2 2)1/2 , where ε > 0 is some small positive number to prevent the fraction dividing by zero. In the following, we explain how to compute different loss terms in Equation (5) The first term is for matching the prediction of the parametric model ˆu(m(i); θ) evaluated at any set of points {x(j)}Nx j=1 Ω Ωwith the high fidelity solution u(m(i)) evaluated at the same points. The prediction ˆu(m(i); θ)(x(j)) is straightforward to compute using Equation (4). This involves passing the reduced branch inputs em(i) and the coordinates of point x(j) into the branch net and trunk net, respectively. The label u(m(i))(x(j)) is obtained using finite element method solvers. The second term is for learning the directional derivative of the evaluation u(x(j)) with respect to the input function m, in the direction of the reduced basis ψin 1 , . . . , ψin r . It can be shown that mˆu(m(i); θ)(x(j))Ψin = f mˆu(m(i); θ)(x(j)). Thus, the derivative of the outputs of the parametric model can be computed as the partial derivatives of the output with respect to the input of the branch net via automatic differentiation. On the other hand, the derivative labels Φout(x(j))( mu(m(i))Ψin) = (du(m(i); ψin 1 )(x(j)), . . . , du(m(i); ψin r )(x(j))) are obtained by first computing the Gateaux derivatives du(m(i); ψin 1 ), . . . , du(m(i); ψin r ) and then evaluating them at x(j). See Appendix B.3 for details about the computation of the Gâteaux derivative du(m; ψ). We initialize the loss weights λ1 = λ2 = 1 and choose a loss balancing algorithm called the self-adaptive learning rate annealing algorithm [23] to update them at a certain frequency. This ensures that the gradient of each loss term have similar magnitudes, thereby enabling the neural network to learn all these labels simultaneously. Remark. For the training, the computational cost of the second term of the loss function (and its gradients) largely depends on the number of points used in each iteration. To reduce the computational and especially memory cost, we can use a subset of points, N batch x = αNx, where α is small number between 0 and 1 (e.g., α = 0.1 in our experiments), in a batch of functions, though their locations could vary among different batches in one epoch. We find that this approach has little impact on the prediction accuracy of the model when Nx is large enough. 3.3 Dimension reduction Throughout the paper, we assume that the parameter functions m are independently drawn from some Gaussian random field [24]. In particular, we consider the case where the covariance function is the Whittle-Matérn covariance function, that is, m i.i.d. N( m, C), where m is the (deterministic) mean function and C = (δI γ ) 2 (I is the identity and is the Laplacian) is an operator such that the square root of its inverse, C 1 2 , maps random function (m m) to Gaussian white noise with unit variance [25]. The parameters δ, γ R+ jointly control the marginal variance and correlation length. We consider two linear projection bases for dimension reduction of the parameter function. Karhunen Loève Expansion (KLE) basis. The KLE basis is optimal in the sense that the mean-square error resulting from a finite representation of the random field m is minimized [26]. It consists of eigenfunctions determined by the covariance function of the random field. Specifically, an eigenfunction ψ of the covariance operator C = (δI γ ) 2 satisfies the differential equation Cψ = λψ. (6) When solved using the finite element method, Equation (6) is equivalent to the following linear system (See Appendix A.2 for the derivation) M in(Ain) 1M in(Ain) 1M inψ = λM inψ, (7) where the (i, j)-entries of matices Ain and M in are given by Ain ij = δ ϕin j , ϕin i + γ ϕin j , ϕin i , M in ij = ϕin j , ϕin i . Here, we recall that V in h is the finite element function space of input function, ϕin i , i = 1, . . . , N in h for the finite element basis of V in h , and , for the L2(Ω)-inner product. We select the first r (typically r N in h ) eigenfunctions ψin 1 , . . . , ψin r corresponding to the r largest eigenvalues for the dimension reduction. Let Ψin = (ψin 1 | |ψin r ) denote the corresponding nodal values of these eigenfunctions. Since the eigenvectors ψin i are M in-orthogonal (or equivalently, eigenfunctions ψin i are L2(Ω)-orthogonal), the reduced representation of m can be computed as f m = (Ψin)T M inm, that is, the coefficients of the low rank approximation of m in the linear subspace spanned by the eigenfunctions ψin 1 , . . . , ψin r . Active Subspace Method (ASM) basis. The active subspace method is a gradient-based dimension reduction method that looks for directions in the input space contributing most significantly to the output variability [27]. In contrast to the KLE basis, the ASM basis is more computationally expensive. However, since the ASM basis captures sensitivity information in the input-output map rather than solely the variability of the input space, it typically achieves higher accuracy in predicting the output than KLE basis. We consider the case where the output is a multidimensional vector [22], representing the nodal values of the output function u. The ASM basis ψi, i = 1, . . . , r, are the eigenfunctions corresponding to the r largest eigenvalues of the generalized eigenvalue problem Hψ = λC 1ψ, (8) where the action of operator H on function ψ is given by Hψ = Em ν(m)[d u(m; du(m; ψ))]. (9) Here, du(m; ψ) is the Gâteaux derivative of u at m V in h in the direction of ψ V in h , defined as limε 0(u(m + εψ) u(m))/ε, and d u(m; ) is the adjoint of the operator du(m; ). When solved using finite element method, Equation (8) is equivalent to the following linear system (See Appendix A.3 for the derivation) Hψ = λC 1ψ, (10) where the action of matrix H on vector ψ is given by Hψ = Em ν(m)[( mu)T M out( mu)ψ], (11) and the action of matrix C 1 on vector ψ is given by C 1ψ = Ain(M in) 1Ainψ. (12) Here, M out denotes the mass matrix of the output function space, i.e., M out ij = ϕout j , ϕout i . In practice, when solving Equation (10), we obtain its left hand side through computing ( Hψ, ϕin 1 , . . . , Hψ, ϕin N in h )T and its right hand side through the matrix-vector multiplication in Equation (12) (See Appendix A.3 for details). Similar to the KLE case, let Ψin denote the nodal values of r dominant eigenfunctions. Since the eigenvectors ψin i are C 1-orthogonal, the reduced representation of m can be computed as f m = (Ψin)T C 1m. We use a scalable double pass randomized algorithm [28] implemented in h IPPYlib to solve the generalized eigenproblems Equation (7) and Equation (10). To this end, we present the computation of the derivative label and the action of H as follows. Theorem 1. Suppose the PDE in the general form of (1) is well-posed with a unique solution map from the input function m V in to the output function u V out with dual (V out) . Suppose the PDE operator R : V in V out (V out) is differentiable with derivatives m R : V in (V out) and u R : V out (V out) , and in addition u R is invertible with invertible adjoint ( u R) . Then the directional derivative p = du(m; ψ) for any function ψ V in, and an auxillary variable q such that d u(m; p) = ( m R) q can be obtained as the solution of the linearized PDEs ( u R)p + ( m R)ψ = 0, (13) ( u R) q = p. (14) Proof sketch. We perturb m with εψ for any small ε > 0 and obtain R(m + εψ, u(m + εψ)) = 0. Using Taylor expansion to expand it to the first order and letting ε approach 0, we obtain Equation (13), where p = du(m; ψ). Next we compute d u(m; p). By Equation (13), we have du(m; ψ) = ( u R) 1( m R)ψ. Thus, d u(m; ψ) = ( m R) ( u R) p. Then we can first solve for q = ( u R) p and then compute ( m R) q. See Appendix A.1 for a full proof. 4 Experiments In this section, we present experiment results to demonstrate the derivative-enhanced accuracy and cost of our method on two test problems of nonlinear vector-valued PDEs in comparison with Deep ONet, FNO, and DINO. Details about the data generation and training can be found in Appendix B. 4.1 Input probability measure In all test cases, we assume that the input functions m(i) are i.i.d. samples drawn from a Gaussian random field with mean function m and covariance operator (δI γ ) α. We take α = 2 in two space dimensions so the covariance operator is of trace class [29]. It is worth noting that the parameters γ and δ jointly control the marginal variance and correlation length, for which we take values to have large variation of the input samples that lead to large variation of the output PDE solutions. In such cases, especially when the mapping from the input to output is highly nonlinear as in our test examples, a vanilla neural network approximations tend to result in relatively large errors, particularly with a limited number of training data. To generate samples from this Gaussian random field, we use a scalable (in mesh resolution) sampling algorithm implemented in h IPPYlib [28]. 4.2 Governing equations We consider two nonlinear PDE examples, including a nonlinear vector-valued hyperelasticity equation with one state (displacement), and a nonlinear vector-valued Navier Stokes equations with multiple states (velocity and pressure). For the hyperelasticity equation, we consider an experimental scenario where a square of hyperelasticity material is secured along its left edge while a fixed upwardright force is applied to its right edge [30]. Our goal is to learn the map between (the logarithm of) the material s Young s modulus and its displacement. We also consider the Navier Stokes equations that describes viscous, incompressible creeping flow. In particular, we consider the lid-driven cavity case where a square cavity consisting of three rigid walls with no-slip conditions and a lid moving with a tangential unit velocity. We consider that the uncertainty comes from the viscosity term and aim to predict the velocity field. See Appendix B.1 for details. 4.3 Evaluation metrics We use the following three metrics to evaluate the performance of different methods. For the approximations of the solution u(m), we compute the relative error in the L2(Ω) norm and H1(Ω) norm on the test dataset Dtest = {(m(i), u(i))}Ntest i=1, that is, ˆu(m(i); θ) u(m(i)) L2(Ω) u(m(i)) L2(Ω) , 1 Ntest ˆu(m(i); θ) u(m(i)) H1(Ω) u(m(i)) H1(Ω) , u L2(Ω) = Z Ω u(x) 2 2 dx = u T M outu, u H1(Ω) = ( u 2 L2(Ω) + u 2 L2(Ω))1/2. For the approximation of the Jacobian du(m; ), we compute the relative error (for the discrete Jacobian) in the Frobenius norm on Dtest along random directions ω = {ωi}Ndir i=1, that is, dˆu(m(i); ω) du(m(i); ω) F du(m(i); ω) F , where ωi are samples drawn from the same distribution as m. 4.4 Main results We compare the prediction errors measured in the above three evaluation metrics and computational cost in data generation and neural network training for four neural operator architectures, including Deep ONet [1], FNO [2], DINO [20], and our DE-Deep ONet. We also add experiments to demonstrate the performance of the FNO trained with the derivative loss (DE-FNO) and the Deep ONet trained with input dimension reduction but without the derivative loss (Deep ONet (ASM) or Deep ONet (KLE)). For FNO, we use additional position embedding [4] that improves its approximation accuracy in our test cases. For DINO, we use ASM basis (v.s. KLE basis) as the input reduced basis and POD basis (v.s. output ASM basis [20]) as the output reduced basis, which gives the best approximation accuracy. For the input reduced basis of DE-Deep ONet, we also test and present the results for both KLE basis and ASM basis. 16 64 256 1024 # Training samples Relative error in L2( ) norm (N_iterations = 32768) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 16 64 256 1024 # Training samples Relative error in H1( ) norm (N_iterations = 32768) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 16 64 256 1024 # Training samples Relative error in Fro norm (dm) (N_iterations = 32768) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 16 64 256 1024 # Training samples Relative error in L2( ) norm (N_iterations = 32768) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 16 64 256 1024 # Training samples Relative error in H1( ) norm (N_iterations = 32768) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 16 64 256 1024 # Training samples Relative error in Fro norm (dm) (N_iterations=32768) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO Figure 1: Mean relative errors ( standard deviation) over 5 random seeds of neural network training for a varying number of training samples for the [top: hyperelasticity; bottom: Navier Stokes] equation using different methods. Relative errors in the L2(Ω) norm (left) and H1(Ω) norm (middle) for the prediction of u = (u1, u2). Right: Relative error in the Frobenius (Fro) norm for the prediction of du(m; ω) = (du1(m; ω), du2(m; ω)). 100 101 102 103 104 # Training time (seconds) Relative error in L2( ) norm (N_train = 16) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in H1( ) norm (N_train = 16) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in Fro norm (dm) (N_train = 16) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in L2( ) norm (N_train = 256) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in H1( ) norm (N_train = 256) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in Fro norm (dm) (N_train = 256) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO Figure 2: Mean relative errors ( standard deviation) over 5 random seeds versus model training time for the Navier Stokes equations when the number of training samples is [top: 16; bottom: 256]. Test errors. In Figure 1, we show a comparison of different methods for the hyperelasticity and Navier Stokes problem in predicting the solution (in L2 norm, left, and H1 norm, middle) and its derivative (in Frobenius norm, right) with respect to parameter in the direction of Ndir = 128 test samples {ωi}Ndir i=1. First, we can observe significant improvement of the approximation accuracy of DE-Deep ONet compared to the vanilla Deep ONet for all three metrics in all cases of training samples. Second, we can see that FNO leads to larger approximation errors than DINO and DE-Deep ONet for all metrics except when number of training samples is 1024. We believe that the reason why FNO performs better than DE-Deep ONet and DINO when training samples are large enough is mainly due to the use of input dimensionality reduction in DE-Deep ONet and DINO (where the linear reduction error cannot be eliminated by increasing training samples) whereas in FNO we use full inputs. We also see that DE-FNO performs the best among all models when the training samples are sufficient (256 or 1024), although in the compensation of much longer training time shown in Figure 2. Third, we can see that the approximation accuracy of DINO is similar to DE-Deep ONet (ASM) but requires much longer inference time as shown in Table 5 (see Section 5 and Appendix B.3.2 for reasons). In DINO, the output reduced basis dimension is set to be smaller than or equal to the number of training samples as the output POD basis are computed from these samples, i.e., 16 for 16 samples and 64 for 64 samples. Increasing the output dimension beyond 64 does not lead to smaller errors in our test. Finally, we can observe that DE-Deep ONet using ASM basis leads to smaller errors than using KLE basis, especially for the Navier-Stokes problem. Table 1: Output reconstruction error with 16 input reduced bases Relative L2 error Hyperelasticity 3.8% 2.7% Navier Stokes 17.4% 5.8% Moreover, we present the output reconstruction error due to the input dimension reduction using KLE or ASM basis in Table 1. The errors provide the lower bound of the relative errors in L2(Ω) norm of DINO and DE-Deep ONet. We can see that using the ASM basis results in a lower output reconstruction error than the KLE basis (more significant difference observed in the more nonlinear Navier Stokes equations). See Appendix B.6 for the decay of the reconstruction error with increasing number of reduced basis. In addition, we provide full visualizations of the ground truth and prediction of both solution and derivatives in Appendix B.7. Visually, DE-Deep ONet (ASM) consistently provides the best estimate (in terms of pattern and smoothness similarity with the ground truth) for both the solution and its derivative when the training data is limited (16 or 64 samples). Data generation computational cost. We use MPI and the finite element library FEni CS [31] to distribute the computational load of offline data generation to 64 processes for the PDE models considered in this work. See Table 2 for the wall clock time in generating the samples of Gaussian random fields (GRFs), solving the PDEs, computing the r = 16 reduced basis functions (KLE or ASM) corresponding to the 16 largest eigenvalues, generating the derivative labels, respectively. In Table 3, We also provide the total wall clock time of data generation of DE-Deep ONet (ASM) (we only includes the major parts computing high fidelity solution, ASM basis and dm labels [16 directions]) when Ntrain = 16, 64, 256, 1024 using 16 CPU processors. Table 2: Wall clock time (in seconds) for data generation on 2 AMD EPYC 7543 32-Core Processors GRFs PDEs KLE ASM dm labels (Nall = 2000) (Nall = 2000) (r = 16) (r = 16) (Nall = 2000, r = 16) (64 procs) (64 procs) (1 procs) (16 procs) (64 procs) Hyperelasticity 1.1 9.7 0.4 1.4 19.5 Navier Stokes 1.9 99.1 1.3 9.7 125.5 Table 3: Wall clock time (in seconds) for data generation with different number of training samples using 16 CPU processors Dataset\Ntrain PDEs + ASM basis + dm labels 16 64 256 1024 Hyperelasticity 2 5 16 61 Navier Stokes 17 38 124 470 Table 4: Wall clock time (seconds/iteration with batch size 8) for training on a single NVIDIA RTX A6000 GPU Dataset Model Deep ONet FNO DINO DE-FNO DE-Deep ONet Hyperelasticity 0.007 0.015 0.007 0.215 0.022 Navier Stokes 0.007 0.015 0.007 0.216 0.033 Table 5: Total wall clock time (in seconds) for each model inferring on 500 test samples of both the solution and dm in 128 random directions, using a single GPU and a single CPU (except where specified) Deep ONet FNO/DE-FNO DINO1 DE-Deep ONet Numerical solver 1 GPU + 1 CPU/16 CPUs 0 GPU + 16 CPUs Hyperelasticity 3 33 69/7 10 166 Navier Stokes 3 33 2152/151 18 1103 1 The inference time of DINO is dominated by the time required to compute evaluations of all finite element basis functions at the grid points using FEni CS (which may not be the most efficient, see Appendix B.3.2). Even though these grid points overlap with parts of the finite element nodes allowing us to skip evaluations by extracting the relevant nodes for a fairer comparison with DE-Deep ONet (in terms of its ability to evaluate at any arbitrary point), we assume they are arbitrary points requiring explicit evaluation. Model training computational cost. We present comparisons of the wall clock time of each optimization iteration (with batch size 8) of different methods in Table 4 and convergence plot (error versus training time) in Figure 2 and the figures in Appendix B.5. We find that incorporating derivative loss leads to longer training time as expected. However, when the training data are limited, the increased computation cost is compensated for a significant reduction of errors. We note that there are potential ways to further reduce the training cost, e.g., by training the model with additional derivative loss only during the later stage of training, or by using fewer points for computing the derivative losses in each iteration. Additionally, thanks to the dimension reduction of the input, we can define a relatively small neural network and thus are able to efficiently compute the derivatives using automatic differentiation. 5 Related work Our work is related to Sobolev training for neural networks [32], which was found to be effective in their application to model distillation/compression and meta-optimization. In the domain of surrogate models for parametric partial differential equations, our work is more closely related to derivativeinformed neural operator (DINO) [20] which is based on a derivative-informed projected neural network (DIPNet) [17], and presents an extension to enhance the performance of the Deep ONet. Compared to DINO, although the Deep ONet architecture (and its formulation of dm loss) requires longer training time, it offers the following advantages: (1) Potentially shorter inference time. The additional trunk net (which receives spatial coordinates) allows us to quickly query the sensitivity of output function at any point when input function is perturbed in any direction. While DINO can only provide the derivative of the output coefficients respect to the input coefficients (we call reduced dm), in order to compute the sensitivity at a batch of points, we need to post process the reduced dm by querying the finite element basis on these points and computing large matrix multiplications; (2) Greater flexibility and potential for improvements. Although both Deep ONet and DINO approximate solution by a linear combination of a small set of functions, these functions together in Deep ONet is essentially the trunk net, which is "optimized" via model training, whereas in DINO, they are POD or derivative-informed basis precomputed on training samples. When using DINO, if we encounter a case where the training samples not enough to accurately compute the output basis, the large approximation error between the linear subspace and solution manifold will greatly restrict the model prediction accuracy (see Figure 2 when Ntrain = 16). And the reduced dm labels only supports linear reduction of output. However, it is possible that we can further improve Deep ONet by, e.g., adding physical losses (to enhance generalization performance) and Fourier feature embeddings (to learn high-frequency components more effectively) on the trunk net [5] and replacing the inner product of the outputs of two networks by more flexible operations [33, 9] (to enhance expressive power). The dm loss formulation of our work is broadly suitable any network architecture that has multiple subnetworks, where at least one of them receives high-dimensional inputs. 6 Discussion In this work, we proposed a new neural operator Derivative-enhanced Deep Operator Network (DE-Deep ONet) to address the limited accuracy of Deep ONet in both function and derivative approximations. Specifically, DE-Deep ONet employs a derivative-informed reduced representation of input function and incorporates additional loss into the loss function for the supervised learning of the derivative of the output with respect to the inputs of the branch net. Our experiments for nonlinear PDE problems with high variations in both input and output functions demonstrate that adding this loss term to the loss function greatly enhances the accuracy of both function and derivative approximations, especially when the training data are limited. We also demonstrate that the use of derivative loss can be extended to enhance other neural operators, such as the Fourier neural operator. We presented matrix-free computation of the derivative label and the derivative-informed dimension reduction for a general form of PDE problems by using randomized algorithms and linearized PDE solves. Thanks to this scalable approach, the computational cost in generating the derivative label data is shown to be only marginally higher than generating the input-output function pairs for the test problems, especially for the more complex Navier Stokes equations which require more iterations in the nonlinear solves than the hyperelasticity equation. Limitations: We require the derivative information in the training and dimension reduction using ASM, which may not be available if the explicit form of the PDE is unknown or if the simulation only provides input-output pairs from some legacy code. Another limitation is that dimension reduction of the input function plays a key role in scalable data generation and training, which may not be feasible or accurate for intrinsically very high-dimensional problems such as high frequency wave equations. Such problems are also very challenging and remain unsolved by other methods to our knowledge. Acknowledgements We would like to thank Jinwoo Go, Dingcheng Luo and Lianghao Cao for insightful discussions and helpful feedback. [1] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis, Learning nonlinear operators via Deeponet based on the universal approximation theorem of operators, Nature machine intelligence, vol. 3, no. 3, pp. 218 229, 2021. [2] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar, Fourier neural operator for parametric partial differential equations, ar Xiv preprint ar Xiv:2010.08895, 2020. [3] L. Lu, X. Meng, S. Cai, Z. Mao, S. Goswami, Z. Zhang, and G. E. Karniadakis, A comprehensive and fair comparison of two neural operators (with practical extensions) based on fair data, Computer Methods in Applied Mechanics and Engineering, vol. 393, p. 114778, 2022. [4] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar, Neural operator: Learning maps between function spaces with applications to PDEs, Journal of Machine Learning Research, vol. 24, no. 89, pp. 1 97, 2023. [5] S. Wang, H. Wang, and P. Perdikaris, Learning the solution operator of parametric partial differential equations with physics-informed deeponets, Science advances, vol. 7, no. 40, p. eabi8605, 2021. [6] S. Goswami, M. Yin, Y. Yu, and G. E. Karniadakis, A physics-informed variational deeponet for predicting crack path in quasi-brittle materials, Computer Methods in Applied Mechanics and Engineering, vol. 391, p. 114587, 2022. [7] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, A. Stuart, K. Bhattacharya, and A. Anandkumar, Multipole graph neural operator for parametric partial differential equations, Advances in Neural Information Processing Systems, vol. 33, pp. 6755 6766, 2020. [8] B. Raoni c, R. Molinaro, T. Rohner, S. Mishra, and E. de Bezenac, Convolutional Neural Operators, ar Xiv preprint ar Xiv:2302.01178, 2023. [9] Z. Hao, Z. Wang, H. Su, C. Ying, Y. Dong, S. Liu, Z. Cheng, J. Song, and J. Zhu, Gnot: A general neural operator transformer for operator learning, in International Conference on Machine Learning, pp. 12556 12569, PMLR, 2023. [10] A. Tran, A. Mathews, L. Xie, and C. S. Ong, Factorized fourier neural operators, ar Xiv preprint ar Xiv:2111.13802, 2021. [11] D. Luo, T. O Leary-Roseberry, P. Chen, and O. Ghattas, Efficient PDE-constrained optimization under high-dimensional uncertainty using derivative-informed neural operators, ar Xiv preprint ar Xiv:2305.20053, 2023. [12] L. Cao, T. O Leary-Roseberry, and O. Ghattas, Efficient geometric Markov chain Monte Carlo for nonlinear Bayesian inversion enabled by derivative-informed neural operators, ar Xiv preprint ar Xiv:2403.08220, 2024. [13] J. Go and P. Chen, Accelerating Bayesian optimal experimental design with derivative-informed neural operators, ar Xiv preprint ar Xiv:2312.14810, 2023. [14] J. Go and P. Chen, Sequential infinite-dimensional Bayesian optimal experimental design with derivative-informed latent attention neural operator, ar Xiv preprint ar Xiv:2409.09141, 2024. [15] P. G. Constantine, E. Dow, and Q. Wang, Active subspace methods in theory and practice: applications to kriging surfaces, SIAM Journal on Scientific Computing, vol. 36, no. 4, pp. A1500 A1524, 2014. [16] P. Chen and O. Ghattas, Projected Stein variational gradient descent, Advances in Neural Information Processing Systems, vol. 33, pp. 1947 1958, 2020. [17] T. O Leary-Roseberry, U. Villa, P. Chen, and O. Ghattas, Derivative-informed projected neural networks for high-dimensional parametric maps governed by PDEs, Computer Methods in Applied Mechanics and Engineering, vol. 388, p. 114199, 2022. [18] O. Zahm, T. Cui, K. Law, A. Spantini, and Y. Marzouk, Certified dimension reduction in nonlinear Bayesian inverse problems, Mathematics of Computation, vol. 91, no. 336, pp. 1789 1835, 2022. [19] H. Son, J. W. Jang, W. J. Han, and H. J. Hwang, Sobolev training for the neural network solutions of PDEs, 2020. [20] T. O Leary-Roseberry, P. Chen, U. Villa, and O. Ghattas, Derivative-informed neural operator: an efficient framework for high-dimensional parametric derivative learning, Journal of Computational Physics, vol. 496, p. 112555, 2024. [21] J. T. Oden and J. N. Reddy, An introduction to the mathematical theory of finite elements. Courier Corporation, 2012. [22] O. Zahm, P. G. Constantine, C. Prieur, and Y. M. Marzouk, Gradient-based dimension reduction of multivariate vector-valued functions, SIAM Journal on Scientific Computing, vol. 42, no. 1, pp. A534 A558, 2020. [23] S. Wang, S. Sankaran, H. Wang, and P. Perdikaris, An expert s guide to training physicsinformed neural networks, ar Xiv preprint ar Xiv:2308.08468, 2023. [24] R. J. Adler and J. E. Taylor, Random fields and geometry. Springer Science & Business Media, 2009. [25] F. Lindgren, H. Rue, and J. Lindström, An explicit link between gaussian fields and Gaussian markov random fields: the stochastic partial differential equation approach, Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 73, no. 4, pp. 423 498, 2011. [26] R. G. Ghanem and P. D. Spanos, Stochastic finite elements: a spectral approach. Courier Corporation, 2003. [27] P. G. Constantine, Active subspaces: Emerging ideas for dimension reduction in parameter studies. SIAM, 2015. [28] U. Villa, N. Petra, and O. Ghattas, HIPPYlib: an extensible software framework for large-scale inverse problems governed by PDEs: part I: deterministic inversion and linearized Bayesian inference, ACM Transactions on Mathematical Software (TOMS), vol. 47, no. 2, pp. 1 34, 2021. [29] A. M. Stuart, Inverse problems: a Bayesian perspective, Acta numerica, vol. 19, pp. 451 559, 2010. [30] L. Cao, T. O Leary-Roseberry, P. K. Jha, J. T. Oden, and O. Ghattas, Residual-based error correction for neural operator accelerated infinite-dimensional bayesian inverse problems, Journal of Computational Physics, vol. 486, p. 112104, 2023. [31] A. Logg, K.-A. Mardal, and G. Wells, Automated solution of differential equations by the finite element method: The FEni CS book, vol. 84. Springer Science & Business Media, 2012. [32] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu, Sobolev training for neural networks, Advances in neural information processing systems, vol. 30, 2017. [33] S. Pan, S. L. Brunton, and J. N. Kutz, Neural implicit flow: a mesh-agnostic dimensionality reduction paradigm of spatio-temporal data, Journal of Machine Learning Research, vol. 24, no. 41, pp. 1 60, 2023. [34] I. Loshchilov and F. Hutter, Decoupled weight decay regularization, ar Xiv preprint ar Xiv:1711.05101, 2017. A.1 Proof of Theorem 1 We first show how to compute p given the PDE residual R(m, u) = 0. Since u is uniquely determined by m, we can write u = u(m) so that R(m, u(m)) 0 holds for any m V in. Thus, for any ε > 0 and ψ V in, we have R(m + εψ, u(m + εψ)) = 0. Using the Taylor expansion we obtain ( m R(m, u(m)))εψ + ( u R(m, u(m)))δu 0, (15) where δu = u(m + εψ) u(m). Dividing both sides of Equation (15) by ε and letting ε approach 0 yields ( m R(m, u(m)))ψ + ( u R(m, u(m)))du(m; ψ) = 0. For ease of notation, we write m R = m R(m, u(m)), u R = u R(m, u(m)). Then p is the solution to the linear PDE ( m R)ψ + ( u R)p = 0. (16) We solve Equation (16) via its weak form ( m R)ψ, v + ( u R)p, v = 0, (17) where v is a test function in V out. Next we show how to compute d u(m; p). By Equation (16), we have du(m; ψ) = ( u R) 1( m R)ψ. Thus, w := d u(m; p) = ( ( u R) 1( m R)) p = ( m R) ( u R) p. (18) Let q := ( u R) p. We can solve for q via the weak form ( u R) q, v = p, v , or equivalently, q, ( u R)v = p, v , (19) where v is a test function in V out. By Equation (18), we have w = ( m R) q. For any test function v V in, it holds that w, v = ( m R) q, v = q, ( m R)v . (20) Note that we do not need to solve for w explicitly; we only compute w, v with v as the finite element basis functions ϕin 1 , . . . , ϕin Nin h. The cost of computing the right hand side of Equation (20) arises from evaluating the directional derivative and its inner product with the finite element basis functions. Remark. By Equation (9), we use Ngrad samples {(m(i), u(i))}Ngrad i=1 to compute the Monte Carlo estimate of the action of operator H on any function ψ V in, that is, i=1 d u(m(i); du(m(i); ψ)). (21) Remark. When using the double pass randomized algorithm to obtain the first r eigenpairs in Equation (10), we need to compute the action of H on 2(r + s) random functions in V in (e.g., their nodal values are sampled from the standard Gaussian distribution), where s N+ is typically a small oversampling parameter, often chosen between 5 and 20. To speed up the computation, we first compute the LU factorization of the matrices resulting from the discretization of the linear PDEs in Equation (17) and Equation (19). Then the action of d u(m(i); du(m(i); )) on these random functions can be efficiently computed via the forward and backward substitution. Furthermore, the computational time can be significantly reduced by parallelizing the computation of the average value in Equation (21) across multiple processors. A.2 Proof of the equivalence of Equation (6) and Equation (7) By the definition of C = (δI γ ) 2, Equation (6) is equivalent to ψ = λ(δI γ )2ψ. (22) We first compute (δI γ )ψ. To do this, let p = (δI γ )ψ and multiply both sides of this equation by a test function v and integrate Z Ω p(x)v(x) dx = Z Ω (δI γ )ψ(x)v(x) dx Ω ψ(x)v(x) dx + γ Z Ω ψ(x), v(x) dx, (23) where, in the second equality, we use integration by parts and the assumption that the test function vanishes on the boundary. Then we substitute v with all of the finite element basis functions ϕin 1 , . . . , ϕin Nin h and collect the corresponding linear equations for the nodal values of p Ainψ = M inp, (24) where the (i, j)-entries of Ain and M in are given by Ain ij = δ ϕin j , ϕin i + γ ϕin j , ϕin i , M in ij = ϕin j , ϕin i . By Equation (22), function p satisfies ψ = λ(δI γ )p. In the same manner we can see that λAinp = M inψ. (25) Note that both matrices M in and Ain are symmetric positive definite and thus nonsingular. Combining Equation (24) with Equation (25) yields λAin(M in) 1Ainψ = M inψ, (26) or equivalently, M in(Ain) 1M in(Ain) 1M inψ = λM inψ. (27) A.3 Proof of the equivalence of Equation (8) and Equation (10) By Equation (8), for any test function v V in h , it holds that Hψ, v = λC 1ψ, v . (28) In particular, for any m ν(m), as we let v go through all of the finite element basis functions ϕin i V in h , we can show that d u(m; du(m; ψ)), ϕin 1 ... d u(m; du(m; ψ)), ϕin Nin h = ( mu)T M out( mu)ψ. (29) Indeed, by the definition of Gateaux derivative, we have du(m; ψ) = lim ε 0 u(m + εψ) u(m) ui(m + εψ) ui(m) ε ϕout i (x) ui(m1 + εψ1, , m Nin h + εψNin h ) ui(m1, , m Nin h ) ε ϕout i (x) m1 ψ1 + + ui m N in h ψNin h )ϕout i (x) = ϕout(x)( mu)ψ, where ϕout(x) = (ϕout 1 (x), . . . , ϕNout h (x)) are the finite element basis functions of output function space V out h . Then for any test function v V in h , it holds that d u(m; p), v = p, du(m; v) (by the definition of adjoint operator) = ϕout(x)( mu)ψ, ϕout(x)( mu)v (by Equation (30)) = v T ( mu)T M out( mu)ψ, where M out is the mass matrix of output function space V out h , i.e., M out ij = ϕout j , ϕout i for 1 i, j N out h . Note that if we replace v by the i-th finite element basis functions ϕin i , then v becomes the standard unit vector ei RNin h (with the k-th entry one and all others zero). Thus, d u(m; p), ϕin i = e T i ( mu)T M out( mu)ψ, 1 i N in h . Concatenating all the above equations yields d u(m; p), ϕin 1 ... d u(m; p), ϕin Nin h = ( mu)T M out( mu)ψ. Next, we prove that λC 1ψ, ϕin 1 ... λC 1ψ, ϕin Nin h = λAin(M in) 1Ainψ. (31) Indeed, if we let w = λC 1ψ, then similar to the argument in Appendix A.2, we have λAin(M in) 1Ainψ = M inw, w, ϕin i = e T i M inw, 1 i N in h . w, ϕin 1 ... w, ϕin N in h = IM inw = λAin(M in) 1Ainψ. Combining (28), (29) and (31) yields Em ν(m)[( mu)T M out( mu)ψ] = λAin(M in) 1Ainψ. B Experimental details B.1 Governing equations Hyperelasticity equation. We follow the problem setup in [30]. Write X (instead of x) for a material point in the domain Ωand u = u(X) : R2 R2 for the displacement of the material point. Under the influence of internal and/or external forces, the material point is mapped to a spatial point x = x(X) = X + u(X) : R2 R2. Let F = Xx = I + Xu : R2 R2 2 denote the deformation gradient. For a hyperelasticity material, the internal forces can be derived from a strain energy density W(X, C) = µ(X) 2 (tr(C) 3) + λ(X) 2 (ln(J))2 µ(X) ln(J). Here, C = F T F is the right Cauchy-Green strain tensor, tr(C) is the trace of matrix C, J is the determinant of matrix F, and µ(X), λ(X) : R2 R are the Lamé parameters which we assume to be related to Young s modulus of elasticity E(X) : R2 R and Poisson ratio ν R µ(X) = E(X) 2(1 + ν), λ(X) = νE(X) (1 + ν)(1 2ν). We assume the randomness comes from the Young s modulus E(X) = em(X) + 1. Let S = 2 W C denote the second Piola-Kirchhoff stress tensor. We consider the case where the left boundary of the material is fixed, and the right boundary is subjected to stretching by an external force t = t(X) : R2 R2. The strong form of the steady state PDE can be written as X (FS) = 0, X Ω, u = 0, X Γleft, FS n = 0, X Γtop Γbottom, FS n = t, X Γright, where Γleft, Γright, Γtop and Γbottom denote the left, right, top, and bottom boundary of the material domain Ω, respectively, and n is the unit outward normal vector on the boundary. Our goal is to learn the operator that maps the parameter m to the displacement u. For demonstration, we choose m = 0, δ = 0.4, γ = 0.04, Poisson ratio ν = 0.4, and the external force t(X) = 0.06 exp( 0.25|X2 0.5|2), 0.03(1 + 0.1X2) T . In practice, we solve the PDE by first formulating the energy f W in the weak form Γright t, u ds and then solving for u that satisfies the stationary condition, that is, the equation df W(u; v) = 0, holds for any test function v in the state space. See Figure 3 for the visualization for one parametersolution pair of the hyperelasticity equation. 0.0 0.2 0.4 0.6 0.8 1.0 Parameter m 0.0 0.2 0.4 0.6 0.8 1.0 X1 Spatial point x = X + u(X) Figure 3: Visualization of one parameter-solution pair of hyperelasticity equation. The color of the output indicates the magnitude of the displacement u (which maps from domain Ωto R2) instead of its componentwise function u1 or u2. The skewed square shows locations of any domain point after deformation X x. See Figures 13 and 14 for u1 and u2. Navier Stokes equations. Let u = u(x) R2 and p = p(x) R denote the velocity and pressure at point x Ω= (0, 1)2. The strong form of the PDE can be written as em u + (u )u + p = 0, x Ω, u = 0, x Ω, u = (1, 0)T , x Γtop, u = (0, 0)T , x Γ \ Γtop, where Γtop and Γ denote the left and whole boundary of the cavity domain Ω, respectively. Here, we assume that the randomness arises from the viscosity term em. Our goal is learn the operator that maps the parameter m to the velocity u. For demonstration, we choose m = 6.908 (e m 103, thus the viscosity term dominates), δ = 0.6, and γ = 0.06. See Figure 4 for the visualization for one parameter-solution pair of the Navier Stokes equations. 0.0 0.2 0.4 0.6 0.8 1.0 Parameter m 0.0 0.2 0.4 0.6 0.8 1.0 x Velocity x-component 0.0 0.2 0.4 0.6 0.8 1.0 x Velocity y-component Figure 4: Visualization of one parameter-solution pair of Navier Stokes equations. B.2 Data generation For all PDEs in this work, we use the class dolfin.Unit Square Mesh to create a triangular mesh of the 2D unit square with 64 cells in horizontal direction and 64 cells in vertical direction. For the Darcy flow equation and hyperelasticity equation, we set the direction of the diagonals as right , while for the Navier Stokes equation, we set the direction of the diagonals as crossed . See Figure 5 for a visualization of the unit square mesh with a 10 by 10 resolution. Figure 5: Visualization of the 10 by 10 unit square mesh. Left: diagonal= right ; Right: diagonal= crossed We use the class dolfin.Function Space or dolfin.Vector Function Space to create the finite element function space of input function and output function. For the finite element basis functions, we consider the Continuous Galerkin (CG) family (or the standard Lagrange family) with degree 1 or 2. See Table 6 for details. Table 6: Configurations of data generation for different datasets Dataset Configurations Mesh ϕin i ϕout i (Ntrain, Ntest) r Ngrad s Hyperelasticity 64 64, right CG (1D, deg=1) CG (2D, deg=1) (1500, 500) 16 16 10 Navier Stokes 64 64, crossed CG (1D, deg=1) u: CG (2D, deg=2), p: CG (1D, deg=1) (1500, 500) 16 16 10 We generate Ntrain = 1500 and Ntest = 500 input-output pairs (m(i), u(i)) for training and testing, respectively. We compute first r = 16 KLE basis and ASM basis using double pass randomized algorithm with an oversampling parameter s of 10. In the computation of ASM basis, we use Ngrad = 16 samples for the Monte Carlo estimate of the action of operator H in Equation (21). In our method, we formulate the different labels into arrays with the shape as follows Evaluation labels: (Ntrain, Nx, Nu) Derivative m labels: (Ntrain, Nx, r, Nu) Recall that Ntrain is the number of functions used for training, Nx is the number of nodes of the mesh, Nu is the dimension of output function, r is the number of reduced basis, and d is the dimension of the domain. B.3 Computation of derivative labels and outputs B.3.1 Derivative labels p := du(m; ψ) as ground truth Since the PDE residual R(m, u(m)) 0 holds for any m V in h , we have R(m + εψ, u(m + εψ)) = 0, ε > 0, ψ V in h . By the Taylor expansion and R(m, u(m)) = 0, we obtain ( m R(m, u(m)))εψ + ( u R(m, u(m)))δu 0, (32) where δu = u(m + εψ) u(m). Dividing both sides of Eq. (32) by ε and letting ε approach 0 yields ( m R)ψ + ( u R)p = 0, (33) where p := du(m; ψ). We solve Eq. (33) for p via its weak form ( m R)ψ, v + ( u R)p, v = 0, where v is a test function in V out h . Example. Consider the (nonlinear) diffusion-reaction equation (em u) + u3 = 1, x Ω, u(x) = 0, x Ω. R = (em u) + u3 1 ( m R)ψ = (emψ u) ( u R)p = (em p) + 3u2p Thus, p satisfies the linear PDE emψ u, v + em p, v + 3u2p, v = 0. Using FEni CS, we can easily derive Gâteaux derivative via automatic symbolic differentiation instead of by hand. In this case, the Python code for representing the weak form of the residual R, v and Gâteaux derivatives ( m R)ψ, v and ( u R)p, v can be written as import dolfin as dl R=(dl.inner(dl.exp(m)*dl.grad(u), dl.grad(v))*dl.dx +(u**3-dl.Constant(1.0))*v*dl.dx) dl.derivative(R,m,psi) dl.derivative(R,u,p) B.3.2 Derivative outputs of neural networks DE-Deep ONet. For notation simplicity, we illustrate the case where the input reduced basis is ASM basis and the output function is real-valued. The output of the model is given by ˆu(m; θ)(x) = b((Ψin)T C 1m; θb), t(x; θt) + θbias, where Ψin = (ψin 1 | |ψin rin) RN in h rin are the nodal values of input (ASM) reduced basis functions, C 1 is the inverse of the covariance matrix of the Gaussian random field ν where parameter m is sampled from (Recall that the ASM basis are orthonormal in the inner product with weight matrix C 1.) Thus, by the chain rule of derivative, for any test direction ψtest, one has mˆu(m; θ)(x)ψtest = e m b( em), t(x; θt) (Ψin)T C 1ψtest, where em = (Ψin)T C 1m. The Jacobian matrix e m b( em), t(x; θt) can be efficiently computed using, e.g., torch.func.jacrev, and further parallelized with torch.vmap. If ψtest is in Ψin during model training, we can see that (Ψin)T C 1ψtest becomes a unit vector which frees us the need to compute it, otherwise in the model inference stage when ψtest is (the nodal values of) a random function sampled from ν, we compute T = (Ψin)T C 1 and then Tψtest. DINO. The output of the model is given by ˆu(m; θ)(x) = Φout(x)Ψoutfθ((Ψin)T C 1m), where Φout(x) = (ϕout 1 (x), . . . , ϕout N out h (x)) R1 Nout h denotes the output finite element basis functions evaluated on point x Ω, Ψout = (ψout 1 | |ψout rout) RNout h rout are the nodal values of output (POD) reduced basis functions, similarly for Ψin denoting the input (ASM) reduced basis, and C 1 is the inverse of the covariance matrix of the Gaussian random field where parameter m is sampled from. Thus, by the chain rule of derivative, for any test direction ψtest, one has mˆu(m; θ)(x)ψtest = Φout(x)Ψout e mfθ( em)(Ψin)T C 1ψtest, where em = (Ψin)T C 1m. For fast evaluations given different x, m, and ψtest, we first compute Tleft = Φout(x)Ψout for all points x that need to be evaluated and Tright = (Ψin)T C 1. Next, we compute J = e mfθ( em) using, e.g., torch.func.jacrev, and finally compute Tleft JTrightψtest. Deep ONet & FNO. Both models receive the full high dimensional parameter m, so we compute the directional derivative mˆu(m; θ)ψtest using the Jacobian-vector product torch.jvp instead of the full Jacobian in DE-Deep ONet and DINO. For the coordinates x as additional inputs, we pad zeros to the tensor ψtest to match the dimension of the input tensor (m, {x(j)}Nx j=1). B.4 Training details For the Deep ONet, we parameterize the branch net using a CNN and the trunk net using a Res Net. For the FNO, we use the package neuraloperator. For the DINO, we use 16 ASM basis functions for the input dimension reduction and 64 POD basis functions for the output dimension reduction, and parameterize the neural network using a Res Net. For the DE-Deep ONet, both the branch net and trunk net are parameterized using Res Nets. We train each model for 32768 iterations (with the same batch size 8) using an Adam W optimizer [34] and a Step LR learning rate scheduler (We disable learning rate scheduler for DE-Deep ONet). See Tables 7 to 10 for details. When the loss function comprises two terms, we apply the self-adaptive learning rate annealing algorithm from [23], with an update frequency of 100 and a moving average parameter of 0.9, to automatically adjust the loss weights {λi}2 i=1 in Equation (5). Additionally, we standardize the inputs and labels before training. Table 7: Training details for Deep ONet Hyperelasticity Navier Stokes branch net CNN 6 hidden layers 256 output dim Re LU CNN 6 hidden layers 256 output dim Re LU trunk net Res Net 3 hidden layers 512 output dim 512 hidden dim Re LU Res Net 3 hidden layers 512 output dim 512 hidden dim Re LU initialization Kaiming Uniform Kaiming Uniform Adam W (lr, weight decay) (10 3, 10 6) (10 3, 10 5) Step LR (gamma, step size) (0.6, 10) (0.7, 10) number of Fourier features 64 64 Fourier feature scale σ 0.5 1.0 Table 8: Training details for FNO & DE-FNO Hyperelasticity Navier Stokes number of modes (32,32) (32, 32) in channels 3 3 out channels 2 2 hidden channels 64 64 number of layers 4 4 lifting channel ratio 2 2 projection channel ratio 2 2 activation function GELU GELU Adam W (lr, weight decay) (5 10 3, 10 4) (5 10 3, 10 4) Step LR (gamma, step size) (0.9, 25) (0.9, 50) Table 9: Training details for DINO Hyperelasticity Navier Stokes neural network Res Net 3 hidden layers 64 output dim 128 hidden dim ELU Res Net 3 hidden layers 64 output dim 256 hidden dim ELU initialization Kaiming Normal Kaiming Normal Adam W (lr, weight decay) (5 10 3, 0.0) (5 10 3, 10 12) Step LR (gamma, step size) (0.5, 50) (0.5, 50) Table 10: Training details for DE-Deep ONet Hyperelasticity Navier Stokes branch net Res Net 3 hidden layers 128 output dim 128 hidden dim ELU Res Net 3 hidden layers 256 output dim 256 hidden dim ELU trunk net Res Net 3 hidden layers 256 output dim 256 hidden dim Re LU Res Net 3 hidden layers 512 output dim 512 hidden dim Re LU initialization Kaiming Uniform Kaiming Uniform Adam W (lr, weight decay) (10 3, 10 11) (10 3, 10 11) disable lr scheduler True True number of Fourier features 64 64 Fourier feature scale σ 0.5 1.0 N batch x (= αNx) 422 (0.1 652) 422 (0.1 652) Table 11: Number of trainable parameters in each model Dataset # parameters Deep ONet FNO & DE-FNO DINO DE-Deep ONet Hyperelasticity 4.21 M 17.88 M 0.04 M 0.27 M Navier Stokes 4.21 M 17.88 M 0.15 M 1.02 M B.5 Convergence plot Based on the training time (seconds/iteration) of each model in Table 4, we obtain the convergence plots when the number of training samples is either limited (Ntrain = 16, 64) or sufficient (Ntrain = 256, 1024) in Figures 6 to 9 for the hyperelasticity equation, and in Figures 2, 10 and 11 for the Navier Stokes equations. 100 101 102 103 104 # Training time (seconds) Relative error in L2( ) norm (N_train = 16) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in H1( ) norm (N_train = 16) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in Fro norm (dm) (N_train = 16) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO Figure 6: Mean relative errors ( standard deviation) over 5 trials versus model training time for the hyperelasticity equations when the number of training samples is 16. 100 101 102 103 104 # Training time (seconds) Relative error in L2( ) norm (N_train = 64) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in H1( ) norm (N_train = 64) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in Fro norm (dm) (N_train = 64) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO Figure 7: Mean relative errors ( standard deviation) over 5 trials versus model training time for the hyperelasticity equations when the number of training samples is 64. 100 101 102 103 104 # Training time (seconds) Relative error in L2( ) norm (N_train = 256) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in H1( ) norm (N_train = 256) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in Fro norm (dm) (N_train = 256) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO Figure 8: Mean relative errors ( standard deviation) over 5 trials versus model training time for the hyperelasticity equations when the number of training samples is 256. 100 101 102 103 104 # Training time (seconds) Relative error in L2( ) norm (N_train = 1024) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in H1( ) norm (N_train = 1024) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in Fro norm (dm) (N_train = 1024) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO Figure 9: Mean relative errors ( standard deviation) over 5 trials versus model training time for the hyperelasticity equations when the number of training samples is 1024. 100 101 102 103 104 # Training time (seconds) Relative error in L2( ) norm (N_train = 64) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in H1( ) norm (N_train = 64) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in Fro norm (dm) (N_train = 64) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO Figure 10: Mean relative errors ( standard deviation) over 5 trials versus model training time for the Navier Stokes equations when the number of training samples is 64. 100 101 102 103 104 # Training time (seconds) Relative error in L2( ) norm (N_train = 1024) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) 50% 60% 70% Relative error in H1( ) norm (N_train = 1024) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO 100 101 102 103 104 # Training time (seconds) Relative error in Fro norm (dm) (N_train = 1024) FNO DE-FNO (ASM) DE-FNO (KLE) DE-FNO (Random) Deep ONet Deep ONet (ASM) Deep ONet (KLE) DE-Deep ONet (ASM) DE-Deep ONet (KLE) DINO Figure 11: Mean relative errors ( standard deviation) over 5 trials versus model training time for the Navier Stokes equations when the number of training samples is 1024. B.6 Output reconstruction error To measure the error induced by the projection, we define the output reconstruction error as follows u(Prm(i)) u(m(i)) L2 u(m(i)) L2 , where Pr is the rank r linear projector. We provide the plots of the output reconstruction error vs number of reduced basis r using KLE basis and ASM basis in Figure 12. We can see that using the ASM basis results in a lower output reconstruction error than the KLE basis. 100 101 102 Ouput reconstruction error 100 101 102 10 1 rel-L2-err Ouput reconstruction error Figure 12: Output reconstruction error using KLE and ASM basis. Left: Hyperelasticity; Right: Navier Stokes B.7 Visualization of the ground truth and prediction We present the comparisons of the ground truth of solution u, model prediction ˆu using different methods, and absolute value of their difference |u(x) ˆu(x)| in Figures 13 and 14 for the hyperelasticity equation, and Figures 17 and 18 for the Navier Stokes equations. In addition, we present the comparisons of the ground truth of the derivative of u with respect to m in the direction of ω1, denoted as du(m; ω1), model prediction dˆu(m; ψin 1 ) using different methods, and absolute value of their difference |du(m; ω1) dˆu(m; ω1)| in Figures 15 and 16 for the hyperelasticity equation, and Figures 19 and 20 for the Navier Stokes equations. Figure 13: Hyperelasticity. Comparison of the predictions of u1 with (16, 64, 256, 1024) training samples using different methods. Top row > bottom row: Deep ONet, FNO, DE-FNO (Random), DINO, DE-Deep ONet (KLE), DE-Deep ONet (ASM). Figure 14: Hyperelasticity. Comparison of the predictions of u2 with (16, 64, 256, 1024) training samples using different methods. Top row > bottom row: Deep ONet, FNO, DE-FNO (Random), DINO, DE-Deep ONet (KLE), DE-Deep ONet (ASM). Figure 15: Hyperelasticity. Comparison of the predictions of directional derivative du1(m; ω1) with (16, 64, 256, 1024) training samples using different methods. Top row > bottom row: Deep ONet, FNO, DE-FNO (Random), DINO, DE-Deep ONet (KLE), DE-Deep ONet (ASM). Figure 16: Hyperelasticity. Comparison of the predictions of directional derivative du2(m; ω1) with (16, 64, 256, 1024) training samples using different methods.Top row > bottom row: Deep ONet, FNO, DE-FNO (Random), DINO, DE-Deep ONet (KLE), DE-Deep ONet (ASM). Figure 17: Navier Stokes. Comparison of the predictions of velocity-x with (16, 64, 256, 1024) training samples using different methods. Top row > bottom row: Deep ONet, FNO, DE-FNO (Random), DINO, DE-Deep ONet (KLE), DE-Deep ONet (ASM). Figure 18: Navier Stokes. Comparison of the predictions of velocity-y with (16, 64, 256, 1024) training samples using different methods. Top row > bottom row: Deep ONet, FNO, DE-FNO (Random), DINO, DE-Deep ONet (KLE), DE-Deep ONet (ASM). Figure 19: Navier Stokes. Comparison of the predictions of directional derivative du1(m; ω1) with (16, 64, 256, 1024) training samples using different methods. Top row > bottom row: Deep ONet, FNO, DE-FNO (Random), DINO, DE-Deep ONet (KLE), DE-Deep ONet (ASM). Figure 20: Navier Stokes. Comparison of the predictions of directional derivative du2(m; ω1) with (16, 64, 256, 1024) training samples using different methods. Top row > bottom row: Deep ONet, FNO, DE-FNO (Random), DINO, DE-Deep ONet (KLE), DE-Deep ONet (ASM). Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We include our paper s contributions in the abstract and the second paragraph of introduction and its scope in the first and third paragraphs of introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations in the last section. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We provide one theorem with clearly stated assumptions and proof. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We discuss all details (including data generation and model training) that help reproduce the experiments in the appendix. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code necessary to generate the data and train and test the different models considered in the paper. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We discuss the experimental details in the Appendix B. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We run five trails with different random seed for each method and report the mean value and standard deviation of the corresponding relative error. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide the machine information and time of execution in the Tables 4 and 6. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the code of ethics and checked that the research follows them. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our work lies in the field of scientific machine learning. We do not notice any societal impact for now. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We think our research poses no such risk. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We cite the corresponding papers and respect the licenses when using the packages FEni CS, h IPPYlib, and neuraloperator to conduct experiments. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release new assets at the moment. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our research is in the field of the scientific machine learning and does not involve crowdsourcing or human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our research is in the field of the scientific machine learning and does not involve crowdsourcing or human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.