# separable_operator_networks__271c4c69.pdf Published in Transactions on Machine Learning Research (12/2024) Separable Operator Networks Xinling Yu xyu644@ucsb.edu University of California, Santa Barbara Sean Hooten sean.hooten@hpe.com Hewlett Packard Labs, Hewlett Packard Enterprise Ziyue Liu ziyueliu@ucsb.edu University of California, Santa Barbara Yequan Zhao yequan_zhao@ucsb.edu University of California, Santa Barbara Marco Fiorentino marco.fiorentino@hpe.com Hewlett Packard Labs, Hewlett Packard Enterprise Thomas Van Vaerenbergh thomas.van-vaerenbergh@hpe.com Hewlett Packard Labs, HPE Belgium Zheng Zhang zhengzhang@ece.ucsb.edu University of California, Santa Barbara Contributed equally Reviewed on Open Review: https: // openreview. net/ forum? id= RYt Jm FDAxv Operator learning has become a powerful tool in machine learning for modeling complex physical systems governed by partial differential equations (PDEs). Although Deep Operator Networks (Deep ONet) show promise, they require extensive data acquisition. Physics-informed Deep ONets (PI-Deep ONet) mitigate data scarcity but suffer from inefficient training processes. We introduce Separable Operator Networks (Sep ONet), a novel framework that significantly enhances the efficiency of physics-informed operator learning. Sep ONet uses independent trunk networks to learn basis functions separately for different coordinate axes, enabling faster and more memory-efficient training via forwardmode automatic differentiation. We provide a universal approximation theorem for Sep ONet proving the existence of a separable approximation to any nonlinear continuous operator. Then, we comprehensively benchmark its representational capacity and computational performance against PI-Deep ONet. Our results demonstrate Sep ONet s superior performance across various nonlinear and inseparable PDEs, with Sep ONet s advantages increasing with problem complexity, dimension, and scale. For 1D time-dependent PDEs, Sep ONet achieves up to 112 faster training and 82 reduction in GPU memory usage compared to PI-Deep ONet, while maintaining comparable accuracy. For the 2D time-dependent nonlinear diffusion equation, Sep ONet efficiently handles the complexity, achieving a 6.44% mean relative ℓ2 test error, while PI-Deep ONet fails due to memory constraints. This work paves the way for extreme-scale learning of continuous mappings between infinite-dimensional function spaces. Open source code is available at https://github.com/Hewlett Packard/separable-operator-networks. Published in Transactions on Machine Learning Research (12/2024) Figure 1: Separable operator network (Sep ONet) architecture for 2D problem instance. A coordinate grid of collocation points (x(i), y(j)) can be evaluated efficiently by separating the coordinate axes, feeding them through independent trunk networks, and combining the outputs by outer product to obtain multiple basis function maps. Meanwhile, the branch network processes input functions and outputs coefficients, which are then used to scale and combine the trunk network basis functions by product and sum. Spatiotemporal derivatives of the output predictions are obtained efficiently by forward-mode automatic differentiation due to the independence of trunk networks along each coordinate axis. 1 Introduction Operator learning, which aims to learn mappings between infinite-dimensional function spaces, has gained significant attention in scientific machine learning thanks to its ability to model complex dynamics in physics systems. This approach has been successfully applied to a wide range of applications, including climate modeling (Kashinath et al., 2021; Pathak et al., 2022), multiphysics simulation (Liu et al., 2023; Cai et al., 2021; Mao et al., 2021; Lin et al., 2021; Kontolati et al., 2024), inverse design (Lu et al., 2022b; Gu et al., 2022) and more (Shukla et al., 2023; Gupta & Brandstetter, 2022). Various operator learning algorithms (Lu et al., 2021; Li et al., 2020b;a; Ovadia et al., 2023; Wen et al., 2022; Ashiqur Rahman et al., 2022) have been developed to address these applications, with Deep Operator Networks (Deep ONets) (Lu et al., 2021) being particularly notable due to their universal approximation guarantee for operators (Chen & Chen, 1995; Lanthaler et al., 2022; Gopalani et al.) and robustness (Lu et al., 2022a). To approximate the function operator G : U S, Deep ONets are trained in a supervised manner using a dataset of Nf pair functions u(i), s(i) Nf i=1, where in the context of parametric partial differential equations (PDEs), each u(i) represents a PDE configuration function and each s(i) represents the corresponding solution. Unlike traditional numerical methods which require repeated simulations for each different PDE configuration, once well trained, a Deep ONet allows for efficient parallel inference in abstract infinite-dimensional function spaces. Given new PDE configurations, it can immediately provide the corresponding solutions. However, this advantage comes with a significant challenge: to achieve satisfactory generalization error, the number of required input-output function pairs grows quadratically (Lu et al., 2021; Lanthaler et al., 2022; Liu et al., 2022a; Gopalani et al.). Generating enough function pairs can be computationally expensive or even impractical in some applications, creating a bottleneck in the effective deployment of Deep ONets. Physics-informed deep operator networks (PI-Deep ONet) (Wang et al., 2021b) have been introduced as a solution to the costly data acquisition problem. Inspired by physics-informed neural networks (PINNs) (Raissi et al., 2019), PI-Deep ONet learns operators by constraining the Deep ONet outputs to approximately satisfy the underlying governing PDE system parameterized by the input function u. This is achieved by penalizing the physics loss (including PDE residual loss, initial loss and boundary loss), thus eliminating Published in Transactions on Machine Learning Research (12/2024) the need for ground-truth output functions s. However, PI-Deep ONet shares the same disadvantage as PINNs in that the training process is both memory-intensive and time-consuming (He et al., 2023; Cho et al., 2024). This inefficiency, common to both PI-Deep ONet and PINNs, arises from the need to compute high-order derivatives of the PDE predictions with respect to numerous collocation points during physics loss optimization. This computation typically relies on reverse-mode automatic differentiation (Baydin et al., 2018), involving the backpropagation of physics loss through the computational graph to update model parameters. The inefficiency is even more pronounced for PI-Deep ONet, as it requires evaluating the physics loss across multiple PDE configurations. While numerous studies (He et al., 2023; Zhao et al., 2023; Hu et al., 2024; Liu et al., 2022b; Cho et al., 2024) have proposed methods to enhance PINN training efficiency, there has been limited research focused on improving the training efficiency of PI-Deep ONet specifically. To address the inefficiency in training PI-Deep ONet, we propose Separable Operator Networks (Sep ONet), inspired by the separation of variables technique in solving PDEs and recent work on separable PINN (Cho et al., 2024). Suppose we want to approximate a nonlinear continuous operator of d variables (ddimensional collocation points). When optimizing the physics loss at M collocation points, PI-Deep ONet always requires M inputs to evaluate PDE predictions and their derivatives, regardless of whether the points are on a regular grid or randomly sampled. This approach becomes inefficient when M is large. By contrast, Sep ONet uses d independent trunk networks to learn univariate basis functions separately for each variable, which can then be combined to obtain predictions in d dimensions. In particular, if M can be decomposed as N1 N2 Nd, where Nn is the number of points sampled along the n-th coordinate axis (e.g., 1024 = 16 8 8 for d = 3), Sep ONet requires only N1 + N2 + + Nd inputs to evaluate all M collocation points on a regular grid, resulting in a more efficient solution. The Sep ONet architecture for a d = 2 problem instance is shown in Figure 1. This simple yet effective modification enables fast training and memory-efficient implementation of Sep ONet by leveraging forward-mode automatic differentiation (Khan & Barton, 2015) to compute high-order derivatives of all N1 N2 Nd collocation points. It s important to note that factorizing functions in high-dimensional domains into multiple sub-functions defined over onedimensional domains has been explored in supervised operator learning methods (Tran et al., 2021; Li et al., 2024; Kossaifi et al., 2023). Working independently and concurrently, our work and that of Mandl et al. (2025) extended these ideas to physics-informed operator learning frameworks. While both approaches leverage separable structures to enhance computational efficiency, our method uniquely draws from classical separation of variables techniques in PDEs, leading to a distinct low-rank functional decomposition of the trunk network. Our key contributions are: 1. We introduce Sep ONet, a physics-informed operator learning framework that significantly enhances training time and GPU memory usage relative to PI-Deep ONet, enabling extreme-scale learning of continuous mappings between infinite-dimensional function spaces. 2. We provide a theoretical foundation for Sep ONet through the universal approximation theorem, proving its capability to approximate any nonlinear continuous operator with arbitrary accuracy. 3. We provide extensive benchmarks validating Sep ONet s representational capacity and computational performance relative to PI-Deep ONet on a range of 1D and 2D time-dependent nonlinear PDEs. Our findings reveal that scaling up Sep ONet in both number of functions and collocation points consistently improves its accuracy, while typically outperforming PI-Deep ONet. On 1D time-dependent PDEs, we achieve up to 112 training speed-up with minimal memory increase. Notably, at moderately large scales where training PI-Deep ONet exhausts 80GB of GPU memory, Sep ONet trains and operates efficiently with less than 1GB. Furthermore, we observe efficient scaling of Sep ONet with problem dimension, enabling accurate prediction of 2D time-dependent PDEs at scales where PI-Deep ONet fails. 2 Preliminaries 2.1 Operator Learning for Solving Parametric Partial Differential Equations Let X and Y be Banach spaces, with K X and K1 Y being compact sets. Consider a nonlinear continuous operator G : U S, mapping functions from one infinite-dimensional space to another, where Published in Transactions on Machine Learning Research (12/2024) U C(K) and S C(K1). The goal of operator learning is to approximate the operator G using a model parameterized by θ, denoted as Gθ. Here, U and S represent spaces of functions where the input and output functions have dimensions du and ds, respectively. We focus on the scalar case where du = ds = 1 throughout most of this paper; however, it should be noted that the results apply to arbitrary du and ds. In the context of solving parametric partial differential equations (PDEs), consider PDEs of the form: N(u, s) = 0, I(u, s) = 0, B(u, s) = 0, (1) where N is a nonlinear differential operator, I and B represent the initial and boundary conditions, u U denotes the PDE configurations (source terms, coefficients, initial conditions, and etc.), and s S denotes the corresponding PDE solution. Assuming that for any u U there exists a unique solution s S, we can define the solution operator G : U S as s = G(u). A widely used framework for approximating such an operator G involves constructing Gθ through three maps (Lanthaler et al., 2022): Gθ G := D A E. (2) First, the encoder E : U Rm maps an input function u U to a finite-dimensional feature representation. Next, the approximator A : Rm Rr transforms this encoded data within the finite-dimensional space Rm to another finite-dimensional space Rr. Finally, the decoder D : Rr S produces the output function s(y) = G(u)(y) for y K1. 2.2 Deep Operator Networks (Deep ONet) The original Deep ONet formulation (Lu et al., 2021) can be analyzed through the 3-step approximation framework (2). The encoder E : U Rm maps the input function u to its point-wise evaluations at m fixed sensors x1, x2, . . . , xm K, e.g., (u(x1), ..., u(xm)) = E(u). Two separate neural networks (usually multilayer perceptrons), the branch net and the trunk net, serve as the approximator and decoder, respectively. The branch net bψ : Rm Rr parameterized by ψ processes (u(x1), . . . , u(xm)) to produce a feature embedding (β1, β2, . . . , βr). The trunk net tϕ : Rd Rr with parameters ϕ, takes a continuous coordinate y = (y1, ..., yd) as input and outputs a feature embedding (τ1, τ2, . . . , τr). The final Deep ONet prediction of a function u for a query y is: k=1 βkτk = bψ(E(u)) tϕ(y), (3) where is the vector dot product and θ = (ψ, ϕ) represents all the trainable parameters in the branch and trunk nets. Despite Deep ONet s remarkable success across a range of applications in multiphysics simulation (Cai et al., 2021; Mao et al., 2021; Lin et al., 2021), inverse design (Lu et al., 2022b), and carbon storage (Jiang et al., 2023), its supervised training process is highly dependent on the availability of training data, which can be costly. Indeed, the generalization error of Deep ONets scales quadratically with the number of training inputoutput function pairs (Lu et al., 2021; Lanthaler et al., 2022; Liu et al., 2022a; Gopalani et al.). Generating a large number of high-quality training data is expensive or even impractical in some applications. For example, in simulating high Reynolds number (Re) turbulent flow (Pope, 2001), accurate numerical simulations require a fine mesh, leading to a computational cost scaling with Re3 (Kochkov et al., 2021), making the generation of sufficiently large and diverse training datasets prohibitively expensive. To address the need for costly data acquisition, physics-informed deep operator networks (PI-Deep ONet) (Wang et al., 2021b), inspired by physics-informed neural networks (PINNs) (Raissi et al., 2019), have been proposed to learn operators without relying on observed input-output function pairs. Given a dataset of Nf input training functions, Nr residual points, NI initial points, and Nb boundary points: D = n u(i) Nf i=1 , y(j) r Nr j=1, y(j) I NI j=1, y(j) b Nb j=1 o , PI-Deep ONets are trained by minimizing an unsupervised physics loss: Lphysics(θ|D) = Lresidual(θ|D) + λILinitial(θ|D) + λb Lboundary(θ|D), (4) Published in Transactions on Machine Learning Research (12/2024) Lresidual(θ|D) = 1 Nf Nr N(u(i), Gθ(u(i))(y(j) r )) 2 , Linitial(θ|D) = 1 Nf NI I(u(i), Gθ(u(i))(y(j) I )) 2 , Lboundary(θ|D) = 1 Nf Nb B(u(i), Gθ(u(i))(y(j) b )) 2 . Here, λI and λb denote the weight coefficients for different loss terms. However, as noted in the original PI-Deep ONet paper (Wang et al., 2021b), the training process can be both memory-intensive and timeconsuming. Similar to PINNs (Raissi et al., 2019), this inefficiency arises because optimizing the physics loss requires calculating high-order derivatives of the PDE solution with respect to numerous collocation points, typically achieved via reverse-mode automatic differentiation (Baydin et al., 2018). This process involves backpropagating the physics loss through the unrolled computational graph to update the model parameters. For PI-Deep ONet, the inefficiency is even more pronounced, as the physics loss terms (equation (5)) must be evaluated across multiple PDE configurations. Although various works (Chiu et al., 2022; He et al., 2023; Cho et al., 2024) have proposed different methods to improve the training efficiency of PINNs, little research has focused on enhancing the training efficiency of PI-Deep ONet. We propose to address this inefficiency through a separation of input variables. 2.3 Separation of Variables The method of separation of variables seeks solutions to PDEs of the form s(y) = T(t)Y1(y1) Yd(yd) for an input point y = (t, y1, . . . , yd) and univariate functions T, Y1, . . . , Yd. Suppose we have a linear PDE system M[t]s(y) = L1[y1]s(y) + + Ld[yd]s(y), (6) where M[t] = d dt + h(t) is a first order differential operator of t, and L1[y1], ..., Ld[yd] are linear second order ordinary differential operators of their respective variables y1, ..., yd only. Furthermore, assume we are provided Robin boundary conditions in each variable and separable initial condition s(t = 0, y1, . . . , yd) = Qd n=1 ϕn(yn) for functions ϕn(yn) that satisfy the boundary conditions. Then, leveraging Sturm-Liouville theory and some massaging, the solution to this problem can be written s(y) = s(t, y1, . . . , yd) = X k Ak T k(t) n=1 Y k n (yn), (7) where k is a lumped index that counts over infinite eigenfunctions of each Li operator (potentially with repeats). For example, given n {1, ..., d}, Ln Y k n (yn) = λk n Y k n for eigenvalue λk n R. T k(t) depends on all the eigenvalues λk n corresponding to index k. Ak R is a coefficient determined by the initial condition. More details can be found in Appendix C. The method of separation of variables applied to a linear heat equation example can be found in Appendix D.2. One may notice the resemblance between the form of the Deep ONet prediction in (3) with (7), provided βk = Ak and τk = T k(t) Qd i=1 Y k i (yi), with appropriately ordered k. We leverage this similarity explicitly in the construction of Sep ONet below. 3 Separable Operator Networks (Sep ONet) Inspired by the method of separation of variables (7) and recent work on separable PINN (Cho et al., 2024), we propose using separable operator networks (Sep ONet) to learn basis functions separately for different coordinate axes. Sep ONet approximates the solution operator of a PDE system by, for given point Published in Transactions on Machine Learning Research (12/2024) y = (y1, . . . , yd), Gθ(u)(y1, . . . , yd) = = bψ(E(u)) t1 ϕ1(y1) t2 ϕ2(y2) td ϕd(yd) , where is the Hadamard (element-wise) vector product and is the vector dot product. Here, βk = bψ(E(u))k is the k-th output of the branch net, as in Deep ONet. However, unlike Deep ONet, which employs a single trunk net that processes each collocation point y individually, Sep ONet uses d independent trunk nets, tn ϕn : R Rr for n = 1, . . . , d. In particular, τn,k = tn ϕn(yn)k denotes the k-th output of the n-th trunk net. Importantly, the parameters of the n-th trunk net ϕn are independent of all other trunk net parameter sets. Viewed through the 3-step approximation framework (2), Sep ONet and Deep ONet have identical encoder and approximator but different decoders. Equation (8) can be understood as a low-rank approximation of the solution operator by truncating the basis function series (represented by the output shape of the trunk nets) at a maximal number of ranks r. Sep ONet not only enjoys the advantage that basis functions for different variables with potentially different scales can be learned more easily and efficiently, but also allows for fast and efficient training by leveraging forwardmode automatic differentiation, which we will discuss in Section 3.1 and Section 3.2. Moreover, despite the resemblance between (8) and the separation of variables method for linear PDEs (7) (discussed below in Section 3.1.3), we find that Sep ONet can effectively approximate the solutions to nonlinear parametric PDEs. Indeed, we provide a universal approximation theorem for separable operator networks in Section 3.3 and extensive accuracy and performance scaling numerical experiments for nonlinear and inseparable PDEs in Section 4. Finally, it is worth noting that if one is only interested in solving deterministic PDEs under a certain configuration (i.e., u is fixed), then the coefficients βk are also fixed and can be absorbed by the basis functions. In this case, Sep ONet will reduce to separable PINN Cho et al. (2024), which has been proven to be efficient and accurate in solving single-scenario PDE systems (Es kin et al., 2024; Oh et al., 2024). 3.1 Sep ONet Architecture and Implementation Details Suppose we are provided a computation domain K1 = [0, 1]d of dimension d and an input function u. To evaluate the residual loss term in equation (5) on Nr random collocation points, PI-Deep ONet samples all Nr points directly from K1. However, if Nr can be approximately factorized as N1 N2 Nd (e.g., 1024 = 16 8 8 for d = 3), and we relax the Monte Carlo sampling requirement for d-dimensional collocation points, Sep ONet only needs to randomly sample Nn points along the n-th coordinate axis, resulting in a total of N1 + N2 + + Nd samples. It is important to note that Sep ONet s mapping from N1+N2+ +Nd inputs to N1 N2 Nd outputs is most efficient when using regular grid sampling. However, we have empirically demonstrated (in Section 4) that this per-axis grid-based sampling strategy does not degrade Sep ONet s accuracy compared to PIDeep ONet s Monte Carlo random sampling over the entire domain K1. Irregular domains may be sampled by (a) dividing the irregular domain into subdomains each approximated by a regular grid, or (b) applying a coordinate transformation to map the irregular domain onto a regular one (e.g., converting Cartesian to polar coordinates for a circular domain). For shorthand and generality, we will denote the dataset of input points for Sep ONet as D = {y(:) 1 , . . . , y(:) d }. Each y(:) n = {y(i) n }Nn i=1 represents an array of Nn samples along the n-th coordinate axis for a total of N1 + N2 + + Nd samples. The initial and boundary points may be separately sampled from K1; the number of samples (NI and Nb) and sampling strategy are equivalent for Sep ONet and PI-Deep ONet. 3.1.1 Forward Pass The forward pass of Sep ONet, illustrated for d = 2 in Figure 1, follows the formulation (8) except generalized to the computationally advantageous setting where predictions along a grid of collocation points are processed Published in Transactions on Machine Learning Research (12/2024) in parallel. The formula can be expressed: Gθ(u)(y(:) 1 , . . . , y(:) d ) = n=1 τ (:) n,k k=1 bψ(E(u))k t1 ϕ1(y(:) 1 )k t2 ϕ2(y(:) 2 )k td ϕd(y(:) d )k , where is the (outer) tensor product, which produces an output predictive array along a meshgrid of N1 N2 Nd collocation points. Notably, τ (:) n,k = tn ϕn(y(:) n )k represents a vector of Nn values produced by the n-th trunk net along the k-th output mode after feeding all y(:) n points. After taking the outer product along each of n = 1, . . . , d dimensions for all r modes, the modes are sum-reduced with the predictions of the branch net βk = bψ(E(u))k. While not shown here, our implementation also batches over input functions {u(i)}Nf i=1 for Nf functions. Thus, for only Nf + N1 + + Nd inputs, Sep ONet produces a predictive array with shape Nf N1 Nd. 3.1.2 Model Update In evaluation of the physics loss (4), Sep ONet enables more efficient computation of high-order derivatives in terms of both time and memory use compared to PI-Deep ONet by leveraging forward-mode automatic differentiation (AD) (Khan & Barton, 2015). This is fairly evident by the form of (9). For example, to compute derivatives of all Sep ONet outputs with respect to the m-th variable ym: Gθ(u)(y(:) 1 , ..., y(:) m , ..., y(:) d ) ym = n =m τ (:) n,k τ (:) m,k ym , (10) where τ (:) m,k ym is a vector of derivatives of the m-th trunk net s k-th basis function evaluated along all inputs to the m-th coordinate axis. One may notice that τ (:) m,k ym can be written as a Jacobian-vector product (JVP) of the Jacobian of the m-th trunk net s r Nm outputs with respect to all Nm inputs, and a length Nm 1 tangent vector of 1 s: τ (:) m,k ym := e(k)J[tm ϕm(y(:) m )]1, (11) where e(k) selects the k-th output mode from the resulting r Nm JVP output. This is equivalent to forward-mode AD. Consequently, the derivatives along the m-th coordinate axis across the entire grid of predictions can be obtained by pushing forward derivatives of the m-th trunk net, and then reusing the outputs of all other n = m trunk nets via outer product. By contrast, PI-Deep Onet must compute derivatives Gθ(y(i) 1 , ..., y(i) d )/ ym for each input-output pair individually, y(i) = (y(i) 1 , ..., y(i) d ) for i = 1, ..., M, where there is no such computational advantage and it is more prudent to use reverse-mode AD. Fundamentally, the advantage of Sep ONet for using forward-mode AD can be attributed to the significantly smaller input-output relationship when evaluating along coordinate grids RN1+ +Nd RN1 Nd compared to PI-Deep ONet RM d RM 1 when we choose M = N1N2 Nd. The time and space complexity analysis below in Section 3.2 provides a more descriptive breakdown of computational scaling behavior. For a more detailed explanation of forwardand reverse-mode AD, we refer readers to Cho et al. (2024); Margossian (2019). Once the physics loss is computed, often involving multiple evaluations of (10), reverse-mode AD is employed to update the model parameters θ = (ψ, ϕ1, . . . , ϕd). 3.1.3 Inference Once trained, Sep ONet can efficiently map input functions u to accurate output function predictions ˆs = Gθ(u) along large spatiotemporal grids. This is achieved by combining the learned coefficients from the Published in Transactions on Machine Learning Research (12/2024) Table 1: Time and space complexities of first-order derivatives of Sep ONet and PI-Deep ONet with respect to N d collocation points using forward-mode and reverse-mode AD, respectively. Method Time Complexity Space Complexity Sep ONet (Forward AD) O(N d Lr2 + N d rd) O(N rd + N d) PI-Deep ONet (Reverse AD) O(N d Lr2) O(N d Lr) branch net with the basis functions learned by the trunk nets. Indeed, intriguing comparisons can be made between the results of the method of separation of variables (7) and trained Sep ONet predictions (8). For an initial value problem with linear, separable PDE operator treated in Section 2.3, we might expect Sep ONet to learn initial value function dependent coefficients bψ(E(u))k = Ak and spatiotemporal basis functions tn ϕn(yn)k = Y k n (yn) (provided we supply one additional trunk net t0 ϕ0(t)k = T k(t) for the temporal dimension) for appropriately ordered modes k and sufficiently large r. Examples of the learned basis functions for a separable 1D time-dependent heat equation initial value problem example are provided in Appendix D.2 as a function of the number of the trunk net output shape r. For a small number of modes r = 1 or r = 2, Sep ONet learns nearly the exact spatiotemporal basis functions obtained by separation of variables. For larger r, the Sep ONet basis functions do not converge to the analytically expected basis functions. Nevertheless, approximation error is observed to improve with r, and near perfect accuracy is obtained at large r by comparison to numerical estimates of the analytic solution. While the form of Sep ONet predictions (8) resemble separation of variables, we note that the method of separation of variables typically only applies to linear PDEs with restricted properties. In spite of this, Sep ONet is capable of accurately approximating arbitrary operator learning problems (including nonlinear PDEs) as guaranteed by a universal approximation property, provided below in Section 3.3. 3.2 Complexity Analysis Suppose we are provided a computational domain K1 = [0, 1]d of dimension d. For PI-Deep ONet, collocation points are sampled randomly from the entire d-dimensional domain, with a total of M points. For Sep ONet, as described previously in Section 3.1, we randomly sample N1 + N2 + + Nd inputs for Nn points along the n-th axis, and construct a Cartesian product grid in K1 via (9). The resulting output of PIDeep ONet has shape M 1, and the output of Sep ONet has shape N1 N2 Nd. For simplicity, we assume all trunk nets are L-layer fully connected networks with hidden and output dimensions r, and that N1 = N2 = = Nd = N, and M = N d. The resulting time and space complexity to compute first-order derivatives of all Sep ONet and PI-Deep ONet outputs is provided in Table 1. The first term in Sep ONet s time and space complexity is due to the forward-mode AD computation of each of the d trunk networks derivatives with respect to N inputs per axis. The second term, containing N d, is from computing and storing the tensor product. On the other hand, PI-Deep ONet individually backpropagates all M = N d outputs, resulting in complexity scaling with N d. From this analysis, in the limiting case of N d 1 Lr, we observe that both Sep ONet and PI-Deep ONet have time and space complexities that include N d due to evaluations over all points in a d-dimensional space. However, since typically d Lr, Sep ONet is more efficient in practice due to smaller coefficients in the N d term. Moreover, the tensor product in Sep ONet can be greatly accelerated by GPU parallelization, which is not taken into account in this analysis. In the limiting case Lr N d 1, the first term in Sep ONet s time complexity dominates O(N d Lr2), or in other words, it scales linearly with dimension and sub-linearly with the total number of collocation points N d. This situation is not uncommon in many practical 2D and 3D operator learning problems. Note that Table 1 only considered first-order derivatives. Higher-order derivatives are typically needed to evaluate physics loss functions (5). Fortunately, higher-order derivatives for Sep ONet are computed with similar complexity to Table 1, since they amount to sequentially repeating the Jacobian-vector products (JVP) from (10) and (11). Lastly, we did not consider the parameter update for training with a physics loss Published in Transactions on Machine Learning Research (12/2024) in Table 1, since it requires further assumptions about the composition of the branch network, and it is not typically the limiting computation for either PI-Deep ONet or Sep ONet. 3.3 Universal Approximation Property of Sep ONet The universal approximation property of Deep ONet has been discussed in Chen & Chen (1995); Lu et al. (2021). Here we present the universal approximation theorem to show that proposed separable operator networks can also approximate any nonlinear continuous operators that map infinite-dimensional function spaces to others. Theorem 1 (Universal Approximation Theorem for Separable Operator Networks). Suppose that σ is a Tauber-Wiener function, g is a sinusoidal function, X is a Banach space, K X, K1 Rd1 and K2 Rd2 are three compact sets in X, Rd1 and Rd2, respectively, U is a compact set in C (K), G is a nonlinear continuous operator, which maps U into a compact set S C (K1 K2), then for any ϵ > 0, there are positive integers n, r, m, constants ck i , ζ1 k, ζ2 k, ξk ij, θk i R, points ω1 k Rd1, ω2 k Rd2, xj K, i = 1, . . . , n, k = 1, . . . , r, j = 1, . . . , m, such that j=1 ξk iju (xj) + θk i | {z } branch g w1 k y1 + ζ1 k | {z } trunk1 g w2 k y2 + ζ2 k | {z } trunk2 holds for all u U, y = (y1, y2) K1 K2. Proof. The proof can be found in Appendix A.2. Remark 1. The definition of the Tauber-Wiener function is given in Appendix A.1. It is worth noting that many common-used activations, such as Re LU, GELU and Tanh, are Tauber-Wiener functions. Remark 2. Here we show the approximation property of a separable operator network with two trunk nets, by repeatedly applying trigonometric angle addition formula, it is trivial to separate y as (y1, y2, . . . , yd) K1 K2 . . . Kd and extend (12) to d trunk nets. Remark 3. In our assumptions, we restrict the activation function for the trunk nets to be sinusoidal. This choice is motivated by the natural suitability of sinusoidal functions for constructing basis functions (Stein & Shakarchi, 2011) and their empirical effectiveness in solving PDEs (Sitzmann et al., 2020). However, it would be interesting to explore whether Theorem 1 still holds when g is a more general activation function, such as a Tauber-Wiener function. We will leave this investigation for future work. Remark 4. Here we assume a two-layer branch network and d one-layer trunk networks with sinusoidal activations. For practical implementation, our theoretical results can be extended to multi-layer trunk networks by leveraging the Universal Approximation Theorem (UAT) to approximate sinusoidal functions. Remark 5. Theorem 1 suggests the existence of a separable operator network approximation for any nonlinear continuous operator. This does not imply error bounds nor tractable scaling laws with respect to any specific error metric, nor does it provide a prescription for how to define and update model parameters. Below we will provide experimental evidence that error scaling is comparable to PI-Deep ONet when using physics-informed operator learning. Please note that error bounds for the supervised training of Deep ONet have been previously derived by Lanthaler et al. (2022); similar error bounds for Sep ONet are not provided in this work. 4 Numerical Results This section presents comprehensive numerical studies demonstrating the expressive power and effectiveness of Sep ONet compared to PI-Deep ONet on various time-dependent PDEs: diffusion-reaction, advection, Burgers , and (2+1)-dimensional nonlinear diffusion equations. Both models were trained by optimizing the Published in Transactions on Machine Learning Research (12/2024) Table 2: Summary of PDE test problems. Governing Law Domain Equation Form Initial Condition Boundary Condition Diffusion-reaction x [0, 1], t [0, 1] s t = D 2s x2 + ks2 + u(x) s(x, 0) = 0 s(0, t) = 0, s(1, t) = 0 Advection x [0, 1], t [0, 1] s t + u(x) s x = 0 s(x, 0) = sin(πx) s(0, t) = sin π Burgers x [0, 1], t [0, 1] s t + s s x2 = 0 s(x, 0) = u(x) s(0, t) = s(1, t), s x(0, t) = s 2D Nonlinear diffusion x Ω= [0, 1]2, t [0, 1] s t = α (s s) s(x, 0) = u(x) s(x, t) = 0 on Ω physics loss (equation (4)) on a dataset D consisting of input functions, residual points, initial points, and boundary points. PDE definitions are summarized in Table 2. We set the number of residual points to Nr = N d = Nc, where d is the problem dimension and N is an integer. Here, Nc refers to the total number of training points. For Sep ONet, the residual points are generated by randomly sampling N points along each axis and constructing a Cartesian product grid. In contrast, for PI-Deep ONet, the N d residual points are randomly sampled from the entire d-dimensional domain. The number of initial and boundary points per axis is set to NI = Nb = N = d Nc, and these points are also randomly sampled from the solution domain. For each PDE, the model size remains fixed as Nc or Nf varies. Specifically, both PI-Deep ONet and Sep ONet have branch and trunk networks of the same size; the main difference is that Sep ONet uses d independent trunk networks, one for each axis. We evaluate both models by varying the number of input functions (Nf) and training points (Nc) across four key perspectives: test accuracy, GPU memory usage, training time, and extreme-scale learning capabilities. The main results are illustrated in Figure 2 and Figure 3, with complete test results reported in Appendix B.3. Loss functions, training details, and problem-specific parameters are provided in Appendix B.1 and Appendix B.2. We provide additional ablation studies for our test results using trunk networks with hyperbolic tangent activation functions in Appendix B.4.1, and varied number of input sensors for the branch net in Appendix B.4.2. 4.1 Test accuracy Both PI-Deep ONet and Sep ONet demonstrate improved accuracy when increasing either the number of training points (Nc) or the number of input functions (Nf), while fixing the other parameter. This trend is consistent across all four equations tested. For instance, in the case of the diffusion-reaction equation, when fixing Nf = 100 and varying Nc from 82 to 1282, the relative ℓ2 error of PI-Deep ONet decreases from 1.39% to 0.73%, while Sep ONet s error reduces from 1.49% to 0.62%. Conversely, when fixing Nc = 1282 and varying Nf from 5 to 100, PI-Deep ONet s error drops from 34.54% to 0.73%, and Sep ONet s reduces from 22.40% to 0.62%. 4.2 GPU memory usage While both models show improved accuracy with increasing Nc or Nf, their memory usage patterns differ significantly. This divergence is particularly evident in the case of the advection equation. When fixing Nf = 100 and varying Nc, PI-Deep ONet exhibits a steep increase in GPU memory consumption during training, rising from 0.967 GB at Nc = 82 to 59.806 GB at Nc = 1282. In contrast, Sep ONet maintains a relatively constant and low memory footprint during training, ranging between 0.713 GB and 0.719 GB across the same range of Nc. Similarly, when fixing Nc = 1282 and varying Nf from 5 to 100, PI-Deep ONet s memory usage escalates from 3.021 GB to 59.806 GB. Sep ONet, however, maintains a stable memory usage throughout this range. Published in Transactions on Machine Learning Research (12/2024) Figure 2: Performance comparison of PI-Deep ONet and Sep ONet with varying number of training points (Nc) and fixed number of input functions (Nf = 100). Results show test accuracy, GPU memory usage, and training time for four PDEs. As Nc increases, both models demonstrate improved accuracy, but PIDeep ONet exhibits significant increases in training time and memory usage, while Sep ONet maintains better computational efficiency. 4.3 Training time The training time scaling exhibits a pattern similar to memory usage, as demonstrated by the advection equation example. As Nc increases from 82 to 1282 with fixed Nf = 100, PI-Deep ONet s training time increases significantly from 0.0787 to 8.231 hours (2.361 to 246.93 ms per iteration). In contrast, Sep ONet maintains relatively stable training times, ranging from 0.0730 to 0.0843 hours (2.19 to 2.529 ms per iteration) over the same Nc range. Similarly, when varying Nf from 5 to 100 with fixed Nc = 1282, PI-Deep ONet s training time increases from 0.3997 to 8.231 hours (11.991 to 246.93 ms per iteration). Sep ONet, however, keeps training times between 0.0730 and 0.0754 hours. These results demonstrate Sep ONet s superior scalability in terms of training time. The ability to maintain near-constant training times across a wide range of problem sizes is a significant advantage, particularly for large-scale applications where computational efficiency is crucial. 4.4 Extreme-scale learning The Burgers and nonlinear diffusion equations highlight Sep ONet s capabilities in extreme-scale learning scenarios. For the Burgers equation, PI-Deep ONet encounters memory limitations at larger scales. As seen in Figure 2(c), PI-Deep ONet can only compute up to Nc = 642, achieving a relative ℓ2 error of 13.72%. In contrast, Sep ONet continues to improve, reaching a 7.51% error at Nc = 1282. The nonlinear diffusion equation further emphasizes this difference. In Figure 3(d), PI-Deep ONet results are entirely unavailable due to outof-memory issues. Sep ONet, however, efficiently handles this complex problem, achieving a relative ℓ2 error of 6.44% with Nc = 1283 and Nf = 100. Table 3 demonstrates Sep ONet s ability to tackle even larger scales for the Burgers equation. It achieves a relative ℓ2 error as low as 4.12% with Nc = 5122 and Nf = 800, while maintaining reasonable memory usage (10.485 GB) and training time (0.478 hours). These results underscore Sep ONet s capability to handle extreme-scale learning problems beyond the reach of PI-Deep ONet due to computational constraints. 5 Discussion The field of operator learning faces a critical dilemma. Deep Operator Networks (Deep ONets) offer fast training but require extensive data generation, which can be prohibitively expensive or impractical for Published in Transactions on Machine Learning Research (12/2024) Figure 3: Performance comparison of PI-Deep ONet and Sep ONet with increasing number of input functions (Nf) and fixed number of training points (Nc = 128d, where d is the problem dimension). Both models show improved accuracy with increasing Nf, but PI-Deep ONet s computational resources scale poorly compared to Sep ONet s more efficient scaling. Note: PI-Deep ONet results for the (2+1)-dimensional diffusion equation are unavailable due to memory constraints. Table 3: Additional Sep ONet results for Burgers equation, demonstrating that larger Nc and Nf can be used to enhance accuracy with minimal cost increase. Metrics \ Nc & Nf 1282 & 400 1282 & 800 2562 & 400 2562 & 800 5122 & 400 5122 & 800 Relative ℓ2 error (%) 6.60 6.21 5.68 4.46 5.38 4.12 Memory (GB) 0.966 1.466 2.466 4.466 5.593 10.485 Training time (hours) 0.0771 0.0957 0.1238 0.1717 0.2751 0.478 complex systems. Physics-informed Deep ONets (PI-Deep ONets) relax the data requirement but at the cost of resource-intensive training processes. This creates a challenging trade-off: balancing resource allocation either before training (data generation) or during training (computational resources). Our proposed Separable Operator Networks (Sep ONet) effectively address both of these concerns. Inspired by the separation of variables technique typically used for linear PDEs, Sep ONet constructs its own basis functions to approximate a nonlinear operator. Sep ONet s expressive power is guaranteed by the universal approximation property (Theorem 1), ensuring it can approximate any nonlinear continuous operator with arbitrary accuracy. Our numerical results corroborate this theoretical guarantee, demonstrating Sep ONet s ability to handle complex, nonlinear systems efficiently. By leveraging independent trunk networks for different variables, Sep ONet enables an efficient implementation via forward-mode automatic differentiation (AD). This approach achieves remarkable efficiency in both data utilization and computational resources. Sep ONet is trained solely by optimizing the physics loss, eliminating the need for expensive simulations to generate ground truth PDE solutions. In terms of computational resources, Sep ONet maintains stable GPU memory usage and training time, even with increasing training data and network size, in contrast to PI-Deep ONet s dramatic resource consumption increases under similar scaling. We anticipate that Sep ONet s advantages will allow it to tackle more challenging physics-informed operator learning problems, such as the Navier-Stokes equations (Jin et al., 2021), where both input and output functions are vector-valued. These are problems that PI-Deep ONet may struggle to train on due to resource constraints. As an example, we have considered a (2+1)-dimensional Navier-Stokes equation, previously investigated in the context of PINN (Cho et al., 2024; Wang et al., 2024). Some early, preliminary results can be found in Appendix D.1. Published in Transactions on Machine Learning Research (12/2024) However, Sep ONet has certain limitations that warrant further research and development. The mesh grid structure of Sep ONet s solution, while enabling efficient training through forward-mode AD, may limit its flexibility in handling PDEs with irregular geometries. Addressing this limitation could involve developing adaptations or hybrid approaches that accommodate more complex spatial domains (Li et al., 2023; Serrano et al., 2024; Fang et al., 2024), potentially expanding Sep ONet s applicability to a broader range of physical problems. Additionally, while the linear decoder allows for an efficient Sep ONet implementation, a very large number of basis functions may be needed for accurate linear representation in some problems (Seidman et al., 2022). Developing a nonlinear decoder version of Sep ONet will be useful to balance accuracy and efficient training. Moreover, implementing an adaptive weighting strategy (Wang et al., 2021a; 2022a;b) for different loss terms in the physics loss function, instead of using predefined fixed weights, could lead to improved accuracy and faster convergence. Finally, empirical observations suggest that training accuracy and robustness improve with an increase in input training functions. However, the neural scaling laws in physics-informed operator learning remain unexplored, presenting an intriguing theoretical challenge for future investigation. Acknowledgments We would like to thank Wolfger Peelaers for the valuable discussions and insightful comments. Z. Zhang was supported by NSF Awards #2328281 and #1846476 Md Ashiqur Rahman, Zachary E Ross, and Kamyar Azizzadenesheli. U-no: U-shaped neural operators. ar Xiv e-prints, pp. ar Xiv 2204, 2022. Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18(153):1 43, 2018. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Shengze Cai, Zhicheng Wang, Lu Lu, Tamer A Zaki, and George Em Karniadakis. Deepm&mnet: Inferring the electroconvection multiphysics fields based on operator approximation by neural networks. Journal of Computational Physics, 436:110296, 2021. Tianping Chen and Hong Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE transactions on neural networks, 6(4):911 917, 1995. Pao-Hsiung Chiu, Jian Cheng Wong, Chinchun Ooi, My Ha Dao, and Yew-Soon Ong. Can-pinn: A fast physics-informed neural network based on coupled-automatic numerical differentiation method. Computer Methods in Applied Mechanics and Engineering, 395:114909, 2022. Junwoo Cho, Seungtae Nam, Hyunmo Yang, Seok-Bae Yun, Youngjoon Hong, and Eunbyung Park. Separable physics-informed neural networks. Advances in Neural Information Processing Systems, 36, 2024. Tobin A Driscoll, Nicholas Hale, and Lloyd N Trefethen. Chebfun guide, 2014. Vasiliy A Es kin, Danil V Davydov, Julia V Gur eva, Alexey O Malkhanov, and Mikhail E Smorkalov. Separable physics-informed neural networks for the solution of elasticity problems. ar Xiv preprint ar Xiv:2401.13486, 2024. Zhiwei Fang, Sifan Wang, and Paris Perdikaris. Learning only on boundaries: a physics-informed neural operator for solving parametric partial differential equations in complex geometries. Neural Computation, 36(3):475 498, 2024. Published in Transactions on Machine Learning Research (12/2024) Pulkit Gopalani, Sayar Karmakar, DIBYAKANTI KUMAR, and Anirbit Mukherjee. Towards sizeindependent generalization bounds for deep operator nets. Available at SSRN 4763746. Jiaqi Gu, Zhengqi Gao, Chenghao Feng, Hanqing Zhu, Ray Chen, Duane Boning, and David Pan. Neurolight: A physics-agnostic neural operator enabling parametric photonic device simulation. Advances in Neural Information Processing Systems, 35:14623 14636, 2022. Jayesh K Gupta and Johannes Brandstetter. Towards multi-spatiotemporal-scale generalized pde modeling. ar Xiv preprint ar Xiv:2209.15616, 2022. Di He, Shanda Li, Wenlei Shi, Xiaotian Gao, Jia Zhang, Jiang Bian, Liwei Wang, and Tie-Yan Liu. Learning physics-informed neural networks without stacked back-propagation. In International Conference on Artificial Intelligence and Statistics, pp. 3034 3047. PMLR, 2023. Zheyuan Hu, Khemraj Shukla, George Em Karniadakis, and Kenji Kawaguchi. Tackling the curse of dimensionality with physics-informed neural networks. Neural Networks, 176:106369, 2024. Arieh Iserles. A first course in the numerical analysis of differential equations. Number 44. Cambridge university press, 2009. Zhongyi Jiang, Min Zhu, Dongzhuo Li, Qiuzi Li, Yanhua O Yuan, and Lu Lu. Fourier-mionet: Fourierenhanced multiple-input neural operators for multiphase modeling of geological carbon sequestration. ar Xiv preprint ar Xiv:2303.04778, 2023. Xiaowei Jin, Shengze Cai, Hui Li, and George Em Karniadakis. Nsfnets (navier-stokes flow nets): Physicsinformed neural networks for the incompressible navier-stokes equations. Journal of Computational Physics, 426:109951, 2021. Karthik Kashinath, M Mustafa, Adrian Albert, JL Wu, C Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, R Wang, Ashesh Chattopadhyay, A Singh, et al. Physics-informed machine learning: case studies for weather and climate modelling. Philosophical Transactions of the Royal Society A, 379(2194):20200093, 2021. Kamil A Khan and Paul I Barton. A vector forward mode of automatic differentiation for generalized derivative evaluation. Optimization Methods and Software, 30(6):1185 1212, 2015. Patrick Kidger and Cristian Garcia. Equinox: neural networks in JAX via callable Py Trees and filtered transformations. Differentiable Programming workshop at Neural Information Processing Systems 2021, 2021. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Dmitrii Kochkov, Jamie A Smith, Ayya Alieva, Qing Wang, Michael P Brenner, and Stephan Hoyer. Machine learning accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences, 118 (21):e2101784118, 2021. Katiana Kontolati, Somdatta Goswami, George Em Karniadakis, and Michael D Shields. Learning nonlinear operators in latent spaces for real-time predictions of complex dynamics in physical systems. Nature Communications, 15(1):5101, 2024. Jean Kossaifi, Nikola Kovachki, Kamyar Azizzadenesheli, and Anima Anandkumar. Multi-grid tensorized fourier neural operator for high-resolution pdes. ar Xiv preprint ar Xiv:2310.00120, 2023. Samuel Lanthaler, Siddhartha Mishra, and George E Karniadakis. Error estimates for deeponets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022. Zijie Li, Dule Shu, and Amir Barati Farimani. Scalable transformer for pde surrogate modeling. Advances in Neural Information Processing Systems, 36, 2024. Published in Transactions on Machine Learning Research (12/2024) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. ar Xiv preprint ar Xiv:2010.08895, 2020a. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. ar Xiv preprint ar Xiv:2003.03485, 2020b. Zongyi Li, Daniel Zhengyu Huang, Burigede Liu, and Anima Anandkumar. Fourier neural operator with learned deformations for pdes on general geometries. Journal of Machine Learning Research, 24(388): 1 26, 2023. Chensen Lin, Zhen Li, Lu Lu, Shengze Cai, Martin Maxey, and George Em Karniadakis. Operator learning for predicting multiscale bubble growth dynamics. The Journal of Chemical Physics, 154(10), 2021. Hao Liu, Haizhao Yang, Minshuo Chen, Tuo Zhao, and Wenjing Liao. Deep nonparametric estimation of operators between infinite dimensional spaces. ar Xiv preprint ar Xiv:2201.00217, 2022a. Ziyue Liu, Xinling Yu, and Zheng Zhang. Tt-pinn: a tensor-compressed neural pde solver for edge computing. ar Xiv preprint ar Xiv:2207.01751, 2022b. Ziyue Liu, Yixing Li, Jing Hu, Xinling Yu, Shinyu Shiau, Xin Ai, Zhiyu Zeng, and Zheng Zhang. Deepoheat: operator learning-based ultra-fast thermal simulation in 3d-ic design. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pp. 1 6. IEEE, 2023. Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3):218 229, 2021. Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with practical extensions) based on fair data. Computer Methods in Applied Mechanics and Engineering, 393:114778, 2022a. Lu Lu, Raphaël Pestourie, Steven G Johnson, and Giuseppe Romano. Multifidelity deep neural operators for efficient learning of partial differential equations with application to fast inverse design of nanoscale heat transport. Physical Review Research, 4(2):023210, 2022b. Luis Mandl, Somdatta Goswami, Lena Lambers, and Tim Ricken. Separable physics-informed deeponet: Breaking the curse of dimensionality in physics-informed machine learning. Computer Methods in Applied Mechanics and Engineering, 434:117586, 2025. Zhiping Mao, Lu Lu, Olaf Marxen, Tamer A Zaki, and George Em Karniadakis. Deepm&mnet for hypersonics: Predicting the coupled flow and finite-rate chemistry behind a normal shock using neural-network approximation of operators. Journal of computational physics, 447:110698, 2021. Charles C Margossian. A review of automatic differentiation and its efficient implementation. Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1305, 2019. Jaemin Oh, Seung Yeon Cho, Seok-Bae Yun, Eunbyung Park, and Youngjoon Hong. Separable physicsinformed neural networks for solving the bgk model of the boltzmann equation. ar Xiv preprint ar Xiv:2403.06342, 2024. Oded Ovadia, Adar Kahana, Panos Stinis, Eli Turkel, and George Em Karniadakis. Vito: Vision transformeroperator. ar Xiv preprint ar Xiv:2303.08891, 2023. Ravi Shankar Palani. spinop.m (spin operator). https://www.mathworks.com/matlabcentral/ fileexchange/71536-spinop-m-spin-operator, 2024. MATLAB Central File Exchange. Retrieved June 28, 2024. Published in Transactions on Machine Learning Research (12/2024) Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. ar Xiv preprint ar Xiv:2202.11214, 2022. Stephen B Pope. Turbulent flows. Measurement Science and Technology, 12(11):2020 2021, 2001. Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686 707, 2019. Matthias Seeger. Gaussian processes for machine learning. International journal of neural systems, 14(02): 69 106, 2004. Jacob Seidman, Georgios Kissas, Paris Perdikaris, and George J Pappas. Nomad: Nonlinear manifold decoders for operator learning. Advances in Neural Information Processing Systems, 35:5601 5613, 2022. Louis Serrano, Lise Le Boudec, Armand Kassaï Koupaï, Thomas X Wang, Yuan Yin, Jean-Noël Vittaut, and Patrick Gallinari. Operator learning with neural fields: Tackling pdes on general geometries. Advances in Neural Information Processing Systems, 36, 2024. Khemraj Shukla, Vivek Oommen, Ahmad Peyvan, Michael Penwarden, Luis Bravo, Anindya Ghoshal, Robert M Kirby, and George Em Karniadakis. Deep neural operators can serve as accurate surrogates for shape optimization: a case study for airfoils. ar Xiv preprint ar Xiv:2302.00807, 2023. Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Advances in neural information processing systems, 33: 7462 7473, 2020. Elias M Stein and Rami Shakarchi. Fourier analysis: an introduction, volume 1. Princeton University Press, 2011. Alasdair Tran, Alexander Mathews, Lexing Xie, and Cheng Soon Ong. Factorized fourier neural operators. ar Xiv preprint ar Xiv:2111.13802, 2021. Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM Journal on Scientific Computing, 43(5):A3055 A3081, 2021a. Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric partial differential equations with physics-informed deeponets. Science advances, 7(40):eabi8605, 2021b. Sifan Wang, Hanwen Wang, and Paris Perdikaris. Improved architectures and training algorithms for deep operator networks. Journal of Scientific Computing, 92(2):35, 2022a. Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why pinns fail to train: A neural tangent kernel perspective. Journal of Computational Physics, 449:110768, 2022b. Sifan Wang, Shyam Sankaran, and Paris Perdikaris. Respecting causality for training physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering, 421:116813, 2024. Karl Weierstrass. Über die analytische darstellbarkeit sogenannter willkürlicher functionen einer reellen veränderlichen. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften zu Berlin, 2 (633-639):364, 1885. Gege Wen, Zongyi Li, Kamyar Azizzadenesheli, Anima Anandkumar, and Sally M Benson. U-fno an enhanced fourier neural operator-based deep-learning model for multiphase flow. Advances in Water Resources, 163:104180, 2022. Yequan Zhao, Xinling Yu, Zhixiong Chen, Ziyue Liu, Sijia Liu, and Zheng Zhang. Tensor-compressed backpropagation-free training for (physics-informed) neural networks. ar Xiv preprint ar Xiv:2308.09858, 2023. Published in Transactions on Machine Learning Research (12/2024) A Universal Approximation Theorem for Separable Operator Networks Here we present the universal approximation theorem for the proposed separable operator networks, originally written in Theorem 1 and repeated below in Theorem 5. We begin by reviewing established theoretical results on approximating continuous functions and functionals. Following this review, we introduce the preliminary lemmas and proofs necessary for understanding Theorem 5. We refer our readers to Chen & Chen (1995); Weierstrass (1885) for detailed proofs of Theorems 2, 3, 4. Main notations are listed in Table 4. A.1 Preliminaries and Auxiliary Results Definition 1 (Tauber-Wiener (TW)). If a function g : R R (continuous or discontinuous) satisfies that all the linear combinations PN i=1 cig (λix + θi), λi R, θi R, ci R, i = 1, 2, . . . , N, are dense in every C[a, b], then g is called a Tauber-Wiener (TW) function. Remark 1 (Density in C[a, b]). A set of functions is said to be dense in C[a, b] if every function in the space of continuous functions on the interval [a, b] can be approximated arbitrarily closely by functions from the set. Definition 2 (Compact Set). Suppose that X is a Banach space, V X is called a compact set in X, if for every sequence {xn} n=1 with all xn V , there is a subsequence {xnk}, which converges to some element x V . Theorem 2 (Chen & Chen (1995)). Suppose that K is a compact set in Rn, S is a compact set in C(K), g (TW), then for any ϵ > 0, there exist a positive integer N, real numbers θi, vectors ωi Rn, i = 1, . . . , N, which are independent of f C(K) and constants ci(f), i = 1, . . . , N depending on f, such that f(x) i=1 ci(f)g (ωi x + θi) holds for all x K and f S. Moreover, each ci(f) is a linear continuous functional defined on S. Theorem 3 (Chen & Chen (1995)). Suppose that σ (TW), X is a Banach Space, K X is a compact set, U is a compact set in C(K), f is a continuous functional defined on U, then for any ϵ > 0, there are positive integers N, m points x1, . . . , xm K, and real constants ci, θi, ξij, i = 1, . . . , N, j = 1, . . . , m, such that f(u) j=1 ξiju (xj) + θi holds for all u U. Theorem 4 (Weierstrass Approximation Theorem Weierstrass (1885)). Suppose f C[a, b], then for every ϵ > 0, there exists a polynomial p such that for all x in [a, b], we have |f(x) p(x)| < ϵ. Corollary 1. Trigonometric polynomials are dense in the space of continuous and periodic functions C[0, 2π] := {f C[0, 2π] | f(0) = f(2π)}. Proof. For any f C[0, 2π], extend it to a 2π-periodic and continuous function f defined on R. It suffices to show that there exists a trigonometric polynomial that approximates f within any ϵ > 0. We construct the continuous even functions of 2π period g and h as: g(θ) = f(θ) + f( θ) 2 and h(θ) = f(θ) f( θ) 2 sin(θ). (15) Let ϕ(x) = g(arccos x) and ψ(x) = h(arccos x). Since ϕ, ψ are continuous functions on [ 1, 1], by the Weierstrass Approximation Theorem 4, for any ϵ > 0, there exist polynomials p and q such that |ϕ(x) p(x)| < ϵ 4 and |ψ(x) q(x)| < ϵ Published in Transactions on Machine Learning Research (12/2024) Table 4: Notations and Symbols X some Banach space with norm X Rd Euclidean space of dimension d K some compact set in a Banach space C(K) Banach space of all continuous functions defined on K, with norm f C(K) = maxx K |f(x)| C[a, b] the space of functions in C[a, b] satisfying f(a) = f(b) V some compact set in C(K) u(x) some input function U the space of input functions G some continuous operator G(u)(y) or s(y) some output function that is mapped from the corresponding input function u by the operator G S the space of output functions (TW) all the Tauber-Wiener functions σ and g activation function for branch net and trunk nets in Theorem 5 {x1, x2, . . . , xm} m sensor points for identifying input function u r rank of some deep operator network or separable operator network n, m operator network size hyperparameters in Theorem 5 holds for all x [ 1, 1]. Let x = cos θ, it follows that |g(θ) p(cos θ)| < ϵ 4 and |h(θ) q(cos θ)| < ϵ for θ [0, π]. Because g, h and cosine are even and 2π-periodic, (17) holds for all θ R. From the definitions of g and h, and the fact |sin θ| 1, sin2 θ 1, we have f(θ) + f( θ) 2 sin2 θ p(cos θ) sin2 θ < ϵ 4 and f(θ) f( θ) 2 sin2 θ q(cos θ) sin θ < ϵ Using the triangle inequality, we obtain f(θ) sin2 θ p(cos θ) sin2 θ + q(cos θ) sin θ < ϵ Applying the same analysis to g(θ) = f(θ + π 2 ) + f( θ + π 2 ) 2 and h(θ) = f(θ + π 2 ) f( θ + π 2 ) 2 sin(θ), (20) we can find polynomials r and s such that f θ + π sin2 θ r(cos θ) sin2 θ + s(cos θ) sin θ < ϵ holds for all θ. Substituting θ with θ π f(θ) cos2 θ r(sin θ) cos2 θ s(sin θ) cos θ < ϵ By the triangle inequality, combining (22) and (19) gives f(θ) r(sin θ) cos2 θ s(sin θ) cos θ + p(cos θ) sin2 θ + q(cos θ) sin θ < ϵ (23) holds for all θ. Thus, the trigonometric polynomial r(sin θ) cos2 θ s(sin θ) cos θ + p(cos θ) sin2 θ + q(cos θ) sin θ (24) is an ϵ-approximation to f. Published in Transactions on Machine Learning Research (12/2024) Remark 2. If p(x) is a polynomial, it is easy to verify that p(cos θ) is a trigonometric polynomial due to the fact cosn θ = Pn k=0 ( n k) 2n cos ((n 2k)θ). Prior to proving Theorem 5, we need to establish the following lemmas. Lemma 1. Sine is a Tauber-Wiener function. Proof. Assuming the interval to be [0, π] first. For every continuous function f on [0, π] and any ϵ > 0, we can extend f to a continuous function F on [0, 2π] so that F(x) = f(x) on [0, π] and F(2π) = F(0). By Lemma 1, there exists a trigonometric polynomial p(x) = a0 + n=1 an cos(nx) + bn sin(nx) (25) such that sup x [0,π] |f(x) p(x)| sup x [0,2π] |F(x) p(x)| < ϵ. (26) Let c0 = a0, λ0 = 0, θ0 = π 2 , c2n 1 = bn, λ2n 1 = n, θ2n 1 = 0, c2n = an, λ2n = n, θ2n = π 2 , for n = 1, 2, . . . , N, p(x) is redefined as i=0 ci sin (λix + θi) . (27) Thus we have f(x) i=0 ci sin (λix + θi) for x [0, π]. Now consider a continuous function g on [a, b], define f(x) C[0, π] := g b a π x + a , then by (28), we have g(x) i=0 ci sin πλi holds for all x [a, b]. Therefore, it follows that for any continuous function g on [a, b] and any ϵ > 0, we can approximate g within ϵ by choosing N sufficiently large and adjusting ci, λi, θi accordingly. Hence, the set of all such linear combinations of sin(x) is dense in C[a, b], confirming that sin(x) is a Tauber-Wiener function. Remark 3. It is straightforward to conclude that all sinusoidal functions are Tauber-Wiener functions. Lemma 2. Suppose that V1 X1, V2 X2 are two compact sets in Banach spaces X1 and X2, respectively, then their Cartesian product V1 V2 is also compact. Proof. For every sequence x1 n, x2 n in V1 V2, since V1 is compact, x1 n has a subsequence x1 nk that converges to some element x1 V1. As well, since V2 is compact, there exists a subsequence x2 nk that converges to x2 V2. It follows that x1 n, x2 n converges to x1, x2 V1 V2, thus V1 V2 is compact. Lemma 3. Suppose that X is a Banach space, K1 X1, K2 X2 are two compact sets in X1 and X2, respectively. U is a compact set in C(K1), then the range G(U) of the continuous operator G from U to C(K2) is compact in C(K2). Proof. For every sequence {fn} in U, since U is compact, there exists a subsequence {fnk} that converges to some function f U. Since G is continuous, the convergence fnk f in C(K1) implies G(fnk) G(f) in C(K2). (30) Thus, for every sequence {G(fn)} in G(U), there exists a subsequence {G(fnk)} that converges to G(f) G(U). Therefore, the range G(U) of the continuous operator G is compact in C(K2). Published in Transactions on Machine Learning Research (12/2024) A.2 Universal Approximation Theorem for Sep ONet Theorem 5 (Universal Approximation Theorem for Separable Operator Networks). Suppose that σ (TW), g is a sinusoidal function, X is a Banach Space, K X, K1 Rd1 and K2 Rd2 are three compact sets in X, Rd1 and Rd2, respectively, U is a compact set in C (K), G is a nonlinear continuous operator, which maps U into a compact set S C (K1 K2), then for any ϵ > 0, there are positive integers n, r, m, constants ck i , ζ1 k, ζ2 k, ξk ij, θk i R, points ω1 k Rd1, ω2 k Rd2, xj K1, i = 1, . . . , n, k = 1, . . . , r, j = 1, . . . , m, such that j=1 ξk iju (xj) + θk i g w1 k y1 + ζ1 k g w2 k y2 + ζ2 k < ϵ (31) holds for all u U, y = (y1, y2) K1 K2. Proof. Without loss of generality, we can assume that g is sine function, by Lemma 1, we have g (TW); From the assumption that K1 and K2 are compact, by Lemma 2, K1 K2 is compact; Since G is a continuous operator that maps U into C(K1 K2), it follows that the range G(U) = {G(u) : u U} is compact in C(K1 K2) due to Lemma 3; Thus by Theorem 2, for any ϵ > 0, there exists a positive integer N, real numbers ck(G(u)) and ζk, vectors ωk Rd1+d2, k = 1, . . . , N, such that k=1 ck(G(u))g (ωk y + ζk) holds for all y K1 K2 and u C(K). Let (ω1 k, ω2 k) = ωk, where ω1 k Rd1 and ω2 k Rd2. Utilizing the trigonometric angle addition formula, we have g (ωk y + ζk) = g ω1 k y1 + ζk g ω2 k y2 + π + g ω1 k y1 + ζk + π g ω2 k y2 . (33) Let r = 2N, c N+k(G(u)) = ck(G(u)), ω1 N+k = ω1 k, ω2 N+k = ω2 k, ζ1 k = ζk, ζ2 k = π 2 for k = 1, . . . , N, and let ζ1 k = ζk + π 2 , ζ2 k = 0 for k = N + 1, . . . , r, equation 32 can be expressed as: k=1 ck(G(u))g ω1 k y1 + ζ1 k g ω2 k y2 + ζ2 k < ϵ Since G is a continuous operator, according to the last proposition of Theorem 2, we conclude that for each k = 1, . . . , 2N, ck(G(u)) is a continuous functional defined on U. Repeatedly applying Theorem 3, for each k = 1, . . . , 2N, ck(G(u)), we can find positive integers nk,mk, constants ck i , ξk ij, θk i R and xj K1, i = 1, . . . , nk, j = 1, . . . , mk, such that ck(G(u)) j=1 ξk iju(xj) + θk i holds for all k = 1, . . . , r and u U, where k=1 sup y1 K2,y2 K3 g ω1 k y1 + ζ1 k g ω2 k y2 + ζ2 k . (36) Substituting (35) into (34), we obtain that G(u)(y) j=1 ξk iju (xj) + θk i g w1 k y1 + ζ1 k g w2 k y2 + ζ2 k < ϵ (37) Published in Transactions on Machine Learning Research (12/2024) holds for all u U, y1 K1 and y2 K2. Let n = maxk nk, m = maxk mk. For all nk < i n, let ck i = 0. For all mk < j m, let ξk ij = 0. Then (37) can be rewritten as: j=1 ξk iju (xj) + θk i g w1 k y1 + ζ1 k g w2 k y2 + ζ2 k < ϵ, (38) which holds for all u U, y1 K1 and y2 K2. This completes the proof of Theorem 5. B PDE Problem Definitions, Training details, and Complete Test Results B.1 PDE Problem Definitions All PDE test problems exhibited in Section 4 are described in the subsections below. B.1.1 Diffusion-Reaction Systems We set the diffusion coefficient D = 0.01 and the reaction rate k = 0.01. The input training source terms are sampled from a mean-zero Gaussian random field (GRF) (Seeger, 2004) with a length scale 0.2. To generate the test dataset, we sample 100 different source terms from the same GRF and apply a second-order implicit finite difference method (Iserles, 2009) to obtain the reference solutions on a uniform 128 128 grid. The specific physics loss terms in equation (5) are defined as follows: Lresidual = 1 Nf Nr Gθ(u(i))(x(j) r , t(j) r ) t(j) r D 2Gθ(u(i))(x(j) r , t(j) r ) k Gθ(u(i))(x(j) r , t(j) r ) 2 u(i)(x(j) r ) Linitial = 1 Nf NI Gθ(u(i))(x(j) I , 0) 2 , Lboundary = 1 Nf Nb Gθ(u(i))(0, t(j) b ) 2 + Gθ(u(i))(1, t(j) b ) 2 . B.1.2 Advection Equation The input training variable coefficients are strictly positive by defining u(x) = v(x) minx v(x) + 1, where v is sampled from a GRF with length scale 0.2. To create the test dataset, we generate 100 new coefficients in the same manner that are not used in training and apply the Lax Wendroff scheme (Iserles, 2009) to solve the advection equation on a uniform 128 128 grid. The specific physics loss terms in equation (5) are defined as follows: Lresidual = 1 Nf Nr Gθ(u(i))(x(j) r , t(j) r ) t(j) r + u(i)(x(j) r ) Gθ(u(i))(x(j) r , t(j) r ) Linitial = 1 Nf NI Gθ(u(i))(x(j) I , 0) sin(πx(j) I ) 2 , Lboundary = 1 Nf Nb Gθ(u(i))(0, t(j) b ) sin π 2 t(j) b 2 . Published in Transactions on Machine Learning Research (12/2024) B.1.3 Burgers Equation The input training initial conditions are sampled from a GRF N 0, 252 + 52I 4 using the Chebfun package (Driscoll et al., 2014), satisfying the periodic boundary conditions. Synthetic test dataset consists of 100 unseen initial functions and their corresponding solutions, which are generated from the same GRF and are solved by spectral method on a 101 101 uniform grid using the spin Op library (Palani, 2024), respectively. The corresponding physics loss terms in equation (5) are defined as: Lresidual = 1 Nf Nr Gθ(u(i))(x(j) r , t(j) r ) t(j) r + Gθ(u(i))(x(j) r , t(j) r ) Gθ(u(i))(x(j) r , t(j) r ) ν 2Gθ(u(i))(x(j) r , t(j) r ) Linitial = 1 Nf NI Gθ(u(i))(x(j) I , 0) u(j)(x(j) I ) 2 , Lboundary = 1 Nf Nb Gθ(u(i))(0, t(j) b ) Gθ(u(i))(1, t(j) b ) 2 + Gθ(u(i))(x, t(j) b ) x|x=0 Gθ(u(i))(x, t(j) b ) x|x=1 B.1.4 2D Nonlinear Diffusion Equation The input training initial conditions are generated as a sum of Gaussian functions, parameterized as: i=1 Ai exp[ wi{(x xi)2 + (y yi)2}], (42) where Ai U(0.2, 0.5) are amplitudes, wi U(10, 20) are width parameters, and (xi, yi) U( 0.5, 0.5)2 are center coordinates. We also generate 100 unseen test initial conditions using this method. The nonlinear diffusion equation is then solved using explicit Adams method to obtain reference solutions on a uniform 101 101 spatial grid with 101 time points. Physics loss terms in equation (5) for this problem are: Lresidual = 1 Nf Nr Gθ(u(i))(x(j) r , t(j) r ) t(j) r α Gθ(u(i))(x(j) r , t(j) r ) Gθ(u(i))(x(j) r , t(j) r ) Linitial = 1 Nf NI Gθ(u(i))(x(j) I , 0) u(i)(x(j) I ) 2 , Lboundary = 1 Nf Nb Gθ(u(i))(x(j) b , t(j) b ) 2 . B.2 Training Details and Hyperparameters Both PI-Deep ONet and Sep ONet were trained by minimizing the physics loss (equation(4)) using gradient descent with the Adam optimizer (Kingma & Ba, 2014). The initial learning rate is 1 10 3 and decays by a factor of 0.9 every 1,000 iterations. Additionally, we resample input training functions and training points (including residual, initial, and boundary points) every 100 iterations. Across all benchmarks and on both models (Sep ONet and PI-Deep ONet), we apply Tanh activation for the branch net and Sine activation for the trunk net. We note that no extensive hyperparameter tuning was performed for either PI-Deep ONet or Sep ONet. The code in this study is implemented using JAX and Equinox libraries (Bradbury et al., 2018; Kidger & Garcia, 2021), and all training was performed on a single NVIDIA A100 GPU with 80 GB of memory. Training hyperparameters are provided in Table 5. Published in Transactions on Machine Learning Research (12/2024) Table 5: Training hyperparameters for different PDE benchmarks Hyperparameters \ PDEs Diffusion-reaction Advection Burgers 2D Nonlinear diffusion # of sensors 128 128 101 10201 (101 101) Network depth 5 6 7 7 Network width 50 100 100 128 # of training iterations 50k 120k 80k 80k Weight coefficients (λI / λb) 1 / 1 100 / 100 20 / 1 20 / 1 B.3 Complete Test Results We report the relative ℓ2 error, root mean squared error (RMSE), GPU memory usage and total training time as metrics to assess the performance of PI-Deep ONet and Sep ONet. Specifically, the mean and standard deviation of the relative ℓ2 error and RMSE are calculated over all functions in the test dataset. The complete test results are shown in Table 6 and Table 7. Table 6: Performance comparison of PI-Deep ONet and Sep ONet with varying number of training points (Nc) and fixed number of input training functions (Nf = 100). The - symbol indicates that results are not available due to out-of-memory issues. Equations Metrics Models 8d 16d 32d 64d 128d Diffusion-Reaction d = 2 Relative ℓ2 error (%) PI-Deep ONet 1.39 0.71 1.11 0.59 0.87 0.41 0.83 0.35 0.73 0.34 Sep ONet 1.49 0.82 0.79 0.35 0.70 0.33 0.62 0.28 0.62 0.26 RMSE ( 10 2) PI-Deep ONet 0.58 0.29 0.46 0.22 0.37 0.20 0.36 0.20 0.32 0.18 Sep ONet 0.62 0.28 0.35 0.22 0.32 0.23 0.28 0.20 0.29 0.21 Memory (GB) PI-Deep ONet 0.729 1.227 3.023 9.175 35.371 Sep ONet 0.715 0.717 0.715 0.717 0.719 Training time (hours) PI-Deep ONet 0.0433 0.0641 0.1497 0.7252 2.8025 Sep ONet 0.0403 0.0418 0.0430 0.0427 0.0326 Advection d = 2 Relative ℓ2 error (%) PI-Deep ONet 9.27 1.94 7.55 1.86 6.79 1.84 6.69 1.95 5.72 1.57 Sep ONet 14.29 2.65 11.96 2.17 6.14 1.58 5.80 1.57 4.99 1.40 RMSE ( 10 2) PI-Deep ONet 5.88 1.34 4.79 1.27 4.31 1.23 4.24 1.29 3.63 1.05 Sep ONet 9.06 1.88 7.58 1.55 3.90 1.09 3.69 1.07 3.17 0.95 Memory (GB) PI-Deep ONet 0.967 1.741 5.103 17.995 59.806 Sep ONet 0.713 0.715 0.715 0.715 0.719 Training time (hours) PI-Deep ONet 0.0787 0.1411 0.4836 2.3987 8.231 Sep ONet 0.0843 0.0815 0.0844 0.0726 0.0730 Burgers d = 2 Relative ℓ2 error (%) PI-Deep ONet 29.33 3.85 20.31 4.31 14.17 5.25 13.72 5.59 - Sep ONet 29.42 3.79 31.53 3.44 28.74 4.11 11.85 4.06 7.51 4.04 RMSE ( 10 2) PI-Deep ONet 4.19 2.79 2.82 1.86 2.23 2.10 2.20 2.13 - Sep ONet 4.18 2.74 4.44 2.81 4.11 2.76 1.80 1.60 1.23 1.44 Memory (GB) PI-Deep ONet 1.253 2.781 5.087 18.001 - Sep ONet 0.603 0.605 0.603 0.605 0.716 Training time (hours) PI-Deep ONet 0.1497 0.2375 0.6431 3.2162 - Sep ONet 0.0706 0.0719 0.0716 0.0718 0.0605 Nonlinear diffusion d = 3 Relative ℓ2 error (%) PI-Deep ONet 17.38 5.56 9.90 2.91 - - - Sep ONet 16.10 4.46 12.11 3.89 6.81 1.98 6.73 1.96 6.44 1.69 RMSE ( 10 2) PI-Deep ONet 1.86 0.62 1.04 0.23 - - - Sep ONet 1.72 0.49 1.29 0.37 0.72 0.19 0.71 0.17 0.68 0.15 Memory (GB) PI-Deep ONet 6.993 37.715 - - - Sep ONet 3.471 2.897 2.899 2.897 13.139 Training time (hours) PI-Deep ONet 0.5836 6.6399 - - - Sep ONet 0.1044 0.1069 0.1056 0.1456 0.5575 Published in Transactions on Machine Learning Research (12/2024) Table 7: Performance comparison of PI-Deep ONet and Sep ONet with varying number of input training functions (Nf) and fixed number of training points (Nc = 128d, d indicated by problem instance). The - symbol indicates that results are not available due to out-of-memory issues. Equations Metrics Models 5 10 20 50 100 Diffusion-Reaction d = 2 Relative ℓ2 error (%) PI-Deep ONet 34.54 27.83 4.23 2.52 1.72 1.00 0.91 0.46 0.73 0.34 Sep ONet 22.40 12.30 3.11 1.89 1.19 0.74 0.73 0.32 0.62 0.26 RMSE ( 10 2) PI-Deep ONet 14.50 9.04 1.75 0.90 0.71 0.37 0.40 0.28 0.32 0.18 Sep ONet 9.34 5.37 1.36 1.07 0.50 0.29 0.34 0.25 0.29 0.21 Memory (GB) PI-Deep ONet 2.767 5.105 9.239 17.951 35.371 Sep ONet 0.719 0.719 0.717 0.717 0.719 Training time (hours) PI-Deep ONet 0.1268 0.2218 0.5864 1.4018 2.8025 Sep ONet 0.0375 0.0390 0.0370 0.0317 0.0326 Advection d = 2 Relative ℓ2 error (%) PI-Deep ONet 9.64 2.91 8.77 2.23 7.57 1.98 6.69 1.93 5.72 1.57 Sep ONet 7.62 2.06 6.59 1.71 5.47 1.57 5.18 1.51 4.99 1.40 RMSE ( 10 2) PI-Deep ONet 6.11 1.90 5.55 1.51 4.80 1.33 4.24 1.28 3.63 1.05 Sep ONet 4.83 1.38 4.18 1.17 3.47 1.06 3.29 1.02 3.17 0.95 Memory (GB) PI-Deep ONet 3.021 5.611 9.707 34.511 59.806 Sep ONet 0.713 0.715 0.719 0.719 0.719 Training time (hours) PI-Deep ONet 0.3997 1.0766 1.9765 4.411 8.231 Sep ONet 0.0754 0.0715 0.0736 0.0720 0.0730 Burgers d = 2 Relative ℓ2 error (%) PI-Deep ONet 28.48 4.17 28.63 4.10 28.26 4.38 12.33 5.14 - Sep ONet 27.79 4.40 28.16 4.24 22.78 6.47 10.25 4.44 7.51 4.04 RMSE ( 10 2) PI-Deep ONet 4.09 2.77 4.11 2.78 4.07 2.78 1.96 1.92 - Sep ONet 4.01 2.75 4.05 2.76 3.30 2.55 1.65 1.64 1.23 1.44 Memory (GB) PI-Deep ONet 5.085 9.695 17.913 35.433 - Sep ONet 0.605 0.607 0.607 0.609 0.716 Training time (hours) PI-Deep ONet 0.5135 1.3896 2.6904 5.923 - Sep ONet 0.0725 0.0707 0.0703 0.0612 0.0605 Nonlinear diffusion d = 3 Relative ℓ2 error (%) PI-Deep ONet - - - - - Sep ONet 31.94 9.18 25.48 8.95 21.16 7.82 10.21 3.31 6.44 1.69 RMSE ( 10 2) PI-Deep ONet - - - - - Sep ONet 3.44 1.13 2.73 0.99 2.27 0.91 1.09 0.32 0.68 0.15 Memory (GB) PI-Deep ONet - - - - - Sep ONet 2.923 3.139 4.947 6.995 13.139 Training time (hours) PI-Deep ONet - - - - - Sep ONet 0.1175 0.1408 0.1849 0.3262 0.5575 Published in Transactions on Machine Learning Research (12/2024) Figure 4: Performance comparison of PI-Deep ONet and Sep ONet with Tan H trunk network activation functions, varying number of training points (Nc) and fixed number of input functions (Nf = 100). Results show test accuracy, GPU memory usage, and training time for four PDEs. Figure 5: Performance comparison of PI-Deep ONet and Sep ONet with Tan H trunk network activation functions, increasing number of input functions (Nf) and fixed number of training points (Nc = 128d, where d is the problem dimension). Note: PI-Deep ONet results for the (2+1)-dimensional diffusion equation are unavailable due to memory constraints. B.4 Ablation Studies B.4.1 Trunk Networks with Hyperbolic Tangent Activations In Figure 4 and Figure 5, we provide complete testing results repeating our experiments from Figure 2 and Figure 3 for all PDE examples, varying Nc and Nf, except we use hyperbolic tangenet (Tan H) activation functions for all hidden and output layers of the trunk networks in both PI-Deep ONet and Sep ONet. The results are very similar, indicating that alternative activation functions may be chosen for multi-layer trunk networks to maintain the universal approximation property in accordance with Theorem 1. Published in Transactions on Machine Learning Research (12/2024) Figure 6: Performance comparison of PI-Deep ONet and Sep ONet with varied number of input function sensors (branch input dimension). Note that we fix Nf = 20 for all experiments, Nc = 322 for (a)-(c), and Nc = 163 for (d). B.4.2 Varying the Input Function Discretization to the Branch Network Given an input function PDE configuration u, recall that the branch network predicts coefficients βk = bψ(E(u)) for k = 1, ..., r. Our studies in Section 4 use a simple encoder that measures the input function E(u) = (u(x1), ..., u(xm)) at points x1, x2, ..., xm K in the input function domain. High-dimensional or highly oscillatory input functions may lead to unwieldy discretizations with large branch input dimension that affect training performance. Here, in Figure 6, we study the sensitivity of the PDE examples from Section 4 with respect to the number of input function sensors (branch input dimension). Note that we fix the number of input functions Nf = 20 for all experiments, while Nc = 322 for diffusion-reaction, advection, and Burgers equations, and Nc = 163 for nonlinear diffusion. We find that the error curves converge to the minimum value using only a fraction of the number of input sensors that we used in the main text in Figure 2 and Figure 3. This indicates that the input functions we considered may be identified with a small number of points. Nevertheless, we find that training performance in terms of both memory consumption and training time is constant with the number of sensors. This is because training complexity is mainly data-dominated by the need to compute high-order derivatives with respect to a large number of collocation points for evaluation of the physics loss, as discussed in Section 3.2. C Complete Solution to Separation of Variables Example (7) Recall the linear PDE system treated in Section 2.3: M[t]s(y) = L1[y1]s(y) + + Ld[yd]s(y), (44) where M[t] = d dt + h(t) is a first order differential operator of t, and L1[y1], ..., Ld[yd] are linear second order ordinary differential operators of their respective variables y1, ..., yd only. Furthermore, assume we are provided Robin boundary conditions in each variable and separable initial condition s(t = 0, y1, . . . , yd) = Qd n=1 ϕn(yn) for functions ϕn(yn) that satisfy the boundary conditions. Assuming a separable solution exists, s(y) = T(t)Y1(y1) Yd(yd), the PDE can be decomposed in the following form: MT(t) T(t) = L1Y1(y1) Y1(y1) + + Ld Yd(yd) Yd(yd) , (45) where it is apparent that each term in the sequence is a constant, since they are each only functions of a single variable. Consequently, we may solve each of the Ln terms independently using Sturm-Liouville Published in Transactions on Machine Learning Research (12/2024) theory. After we have found the associated eigenfunctions (Y kn n ) and eigenvalues (λkn n ), we may manually integrate the left-hand side. Finally, we may decompose the separable initial condition into a product of sums of the orthonormal basis functions (eigenfunctions) of each variable. The resulting solution is given by s(y) = s(t, y1, . . . , yd) = X k=(k1,...,kd) Bk T k(t) n=1 Y kn n (yn), T k(t) := T (k1,...,kd)(t) = exp 0 h(τ)dτ + t Bk := B(k1,...,kd) = Y kn n (yn), ϕn(yn) n Y kn n (yn), Y kn n (yn) n , λkn n = Ln Y kn n (yn) Y kn n (yn) , n = 1, ..., d, kn = 1, 2, . . . , . Here, k = (k1, . . . , kd), where kn {1, 2, ..., }, n {1, ..., d}, is a lumped index that counts over all possible products of eigenfunctions Y kn n with associated eigenvalues λkn n . n is an appropriate inner product associated with the separated Hilbert space of the Ln-th operator. To obtain equation (7) in the main manuscript, one only need to break up the sum over all kn indices into a single ordered index. D Additional Experiments D.1 (2+1)-dimensional Navier-Stokes Equation Sep ONet s memory-efficient and fast-training advantages allow it to tackle more challenging physics-informed operator learning problems, which PI-Deep ONet may struggle to train on due to resource constraints. As an example, we consider a (2+1)-dimensional Navier-Stokes equation, previously investigated in the context of PINN (Cho et al., 2024; Wang et al., 2024): tω + s ω = 0.01 ω, x [0, 2π]2, t Γ, s = 0, x [0, 2π]2, t Γ, ω(x, 0) = ω0(x), s(x, 0) = s0(x), x [0, 2π]2, where s = (sx, sy) R2 is the velocity field, x = (x, y) denotes 2D spatial variables, Γ = [0, T] is the time window, and ω = s = xsy ysx is the vorticity. We aim to learn the solution operator that maps the initial velocity and vorticity field u(x) = (s0(x), ω0(x)) R3 to the solution s(x, t) R2 using Sep ONet, parameterized by θ = (ψ, ϕ1, ϕ2, ϕ3), which represents all the trainable parameters in the branch and trunk nets. The vector-valued velocity is approximated as: sx(u)(x, t) = sy(u)(x, t) = where r denotes the rank, βk = bψ(E(u))k is the k-th output of the branch net, and τ1,k = t1 ϕ1(t)k, τ2,k = t2 ϕ2(x)k, τ3,k = t3 ϕ3(y)k denote the k-th outputs of the three trunk nets. Loss function The physics loss for this problem are defined as: Lphysics = Lresidual + 5000Ldiv + 10000Linitial. (49) Published in Transactions on Machine Learning Research (12/2024) The specific loss terms are: Lresidual = 1 Nf Nr ( Gθ(u(i))(x(j) r , t(j) r )) t(j) r + s(i) ( Gθ(u(i))(x(j) r , t(j) r )) 0.01 ( Gθ(u(i))(x(j) r , t(j) r )) 2 , Ldiv = 1 Nf Nr Gθ(u(i))(x(j) r , t(j) r ) 2 , Linitial = 1 Nf NI Gθ(u(i))(x(j) I , 0) ω(i) 0 (x(j) I ) 2 + Gθ(u(i))(x(j) I , 0) s(i) 0 (x(j) I ) 2 . Note that periodic boundary conditions are enforced by applying the following positional encoding to the spatial variables: γ(x) = [1, sin(x), sin(2x), sin(3x), sin(4x), sin(5x), cos(x), cos(2x), cos(3x), cos(4x), cos(5x)] . (51) Sep ONet architecture The encoder E maps the initial condition u(x) to its point-wise evaluations at 128 128 3 sensors on a uniform 128 128 grid over [0, 2π]2. The branch network bψ is a CNN, starting with a 1 1 convolution that increases the channels from 3 to 16, followed by four residual blocks. Each residual block consists of two 3 3 convolutions with Ge LU activations; the first convolution in each block uses a stride of 2 to halve the spatial dimensions while doubling the number of channels. Skip connections employ 1 1 convolutions with a stride of 2 whenever there is a change in dimension. After flattening, a fully connected layer produces a vector of dimension 2r. The trunk networks tϕn are 7-layer modified MLPs (Wang et al., 2021a), each with 128 neurons per hidden layer, an input size of 11, and a rank/output size of 256/512, using Tan H activations. The initial velocities are sampled from a Gaussian random field with a maximum velocity of 5. Training settings We consider learning the solution operator within two time windows: T = 0.1 and T = 1. Two separate Sep ONet models were trained, each for 100,000 iterations using the Adam optimizer, to minimize the physics loss (49) within their respective time windows. The number of residual points was set to Nr = 256 256 32 (Nx Ny Nt), obtained by randomly sampling 256, 256, and 32 points along x, y, and t axes, respectively, and constructing a mesh grid via the tensor product. The number of initial points, NI = 128 128, corresponds to a uniform mesh grid. Initial velocities were sampled from a Gaussian random field with a maximum velocity of 5. Both the initial conditions and collocation points were resampled every 100 iterations. We varied the number of input functions Nf to see the scaling as data is increased. Evaluation The model was evaluated on 100 unseen initial conditions, sampled from the same Gaussian random field. Reference solutions for both time windows were obtained using the JAX-CFD solver (Kochkov et al., 2021) on a uniform 128 128 10 (Nx Ny Nt) grid. Results The results for two time windows are shown in Figure 7. We find that for T = 0.1 we can achieve very low relative ℓ2 error of 2.5%. For T = 1 the error only scales to 40%. We think that improving the error for the longer time scale represents an interesting application direction for future work. Our architecture and implementation choices were not optimized for this example. D.2 Linear Heat Equation Example Sep ONet is motivated by the classical method of separation of variables, which is often employed to solve linear partial differential equations (PDEs). To illustrate the connection between these approaches, consider Published in Transactions on Machine Learning Research (12/2024) Figure 7: Navier-Stokes equation results with Sep ONet. Here, we varies the number of input functions Nf and kept the number of collocation points fixed. We consider two cases, T = 0.1 and T = 1, corresponding to the length of the time window. the linear heat equation: s π2 2s x2 , (x, t) (0, 1) (0, 1], s(x, 0) = u(x), x (0, 1), s(0, t) = s(1, t) = 0, t (0, 1). The goal is to solve this equation for various initial conditions u(x) using both the separation of variables technique and the Sep ONet method, allowing for an intuitive comparison between the two. D.2.1 Separation of Variables Technique We seek a solution in the form: s(x, t) = X(x)T(t) (53) for functions X, T to be determined. Substituting (53) into (52) yields: X = λ and π2T for some constant λ. To satisfy the boundary condition, X must solve the following eigenvalue problem: X (x) + λX(x) = 0, x (0, 1), X(0) = X(1) = 0, (55) and T must solve the ODE problem: π2 T(t). (56) The eigenvalue problem (55) has a sequence of solutions: λk = (kπ)2, Xk(x) = sin(kπx), for k = 1, 2, . . . (57) For any λ, the ODE solution for T is T(t) = Ae λ π2 t for some constant A. Thus, for each eigenfunction Xk with corresponding eigenvalue λk, we have a solution Tk such that the function sk(x, t) = Xk(x)Tk(t) (58) will be a solution of (54). In fact, an infinite series of the form k=1 Xk(x)Tk(t) = k=1 Ake k2t sin(kπx) (59) Published in Transactions on Machine Learning Research (12/2024) will also be a solution satisfying the differential operator and boundary condition of the heat equation (52) subject to appropriate convergence assumptions of this series. Now let s(x, 0) = u(x), we can find coefficients: 0 sin (kπx) u(x)dx (60) such that (59) is the exact solution of the heat equation (52). D.2.2 Sep ONet Method In this section, we apply the Sep ONet framework to solve the linear heat equation (52) and compare the basis functions it learns with those derived from the classical separation of variables method. Recall that a Sep ONet, parameterized by θ, approximates the solution operator of (52) as follows: Gθ(u)(x, t) = k=1 βk(u(x1), u(x2), . . . , u(x128))τk(t)ζk(x), (61) where x1, x2, . . . , x128 are 128 equi-spaced sensors in [0, 1], βk is the k-th output of the branch net, and the basis functions τk(t) and ζk(x) are the k-th outputs of two independent trunk nets. Training settings The branch and trunk networks each have a width of 5 and a depth of 50. To determine the parameters θ, we trained Sep ONet for 80,000 iterations, minimizing the physics loss. Specifically, we set λI = 20, λb = 1, Nf = 100, and Nc = 1282 in the physics loss. The training functions (initial conditions) u(i) Nf i=1 were generated from a Gaussian random field (GRF) N 0, 252 + 52I 4 using the Chebfun package (Driscoll et al., 2014), ensuring zero Dirichlet boundary conditions. Additional training settings are detailed in Appendix B.2 of the main text. Evaluation We evaluated the model on 100 unseen initial conditions sampled from the same GRF, using the forward Euler method to obtain reference solutions on a 128 128 uniform spatio-temporal grid. Impact of the rank r Since τk(t) and ζk(x) are independent of the initial condition u, learning an expressive and rich set of basis functions is crucial for Sep ONet to generalize to unseen initial conditions. To investigate the impact of the rank r on the generalization error, we trained Sep ONet with ranks ranging from 1 to 50. The mean RMSE between Sep ONet s predictions and the reference solutions over 100 unseen test initial conditions was reported. For comparison, we also computed the mean RMSE of the truncated analytical solution at rank r for r from 1 to 15. The results are presented in Figure 8. As r increases, the truncated analytical solution quickly converges to the reference solution. The nonzero RMSE arises due to numerical errors in computing the coefficients Ak and the inherent inaccuracies of the forward Euler method used to generate the reference solution. For Sep ONet, we observed that when r = 1, 2, the mean RMSE aligns closely with that of the truncated solution. However, as r increases beyond that point, the error decreases more gradually, stabilizing around r = 50. This indicates that Sep ONet may not necessarily learn the exact same basis functions as those from the truncated analytical solution. Instead, a higher rank r allows Sep ONet to develop its own set of basis functions, achieving similar accuracy to the truncated solution. Sep ONet basis functions The learned basis functions for different ranks r are visualized in Figure 9 to Figure 13. At r = 1, Sep ONet learns basis functions that closely resemble the first term of the truncated solution. For r = 2, the learned functions are quite similar to the first two terms of the truncated series. However, when r = 5, the basis functions diverge from the truncated solution series, although some spatial components still resemble sinusoidal functions and the temporal components remain monotonic. As r increases further, Sep ONet continues to improve in accuracy, though the learned basis functions increasingly differ from the truncated series, confirming Sep ONet s ability to accurately approximate the solution using its own learned basis functions. Published in Transactions on Machine Learning Research (12/2024) Figure 8: Comparison of RMSE between the truncated analytical solution and Sep ONet predictions for varying rank r. The truncated analytical solution quickly converges, while Sep ONet shows a slower decay in error, converging around r = 50. Figure 9: Learned basis functions τk(t) and ζk(x) for r = 1. Sep ONet learns the same basis functions as the first term of the truncated solution. E Visualization of Sep ONet Predictions In this section, we showcase the performance of trained Sep ONets in predicting solutions for PDEs under previously unseen configurations. The Sep ONets were trained using Nf = 100 and Nc = 128d, where d denotes the dimensionality of the PDE problem. The prediction results are presented in Figure 14 to Figure 17. Published in Transactions on Machine Learning Research (12/2024) Figure 10: Learned basis functions τk(t) and ζk(x) for r = 2. Sep ONet learns very similar basis functions as the first two terms of the truncated solution. Figure 11: Learned basis functions τk(t) and ζk(x) for r = 5. Figure 12: Learned basis functions τk(t) and ζk(x) for r = 10. Published in Transactions on Machine Learning Research (12/2024) Figure 13: Learned basis functions τk(t) and ζk(x) for r = 50. Published in Transactions on Machine Learning Research (12/2024) Figure 14: (1+1)-d Diffusion-reaction equation. Figure 15: (1+1)-d Advection equation. Published in Transactions on Machine Learning Research (12/2024) Figure 16: (1+1)-d Burgers equation. Published in Transactions on Machine Learning Research (12/2024) Figure 17: (2+1)-d Nonlinear diffusion equation. Two snapshots at t = 0.5 and t = 1 are presented.