# uno_ushaped_neural_operators__6f36d9c3.pdf Published in Transactions on Machine Learning Research (04/2023) U-NO: U-shaped Neural Operators Md Ashiqur Rahman rahman79@purdue.edu Department of Computer Science Purdue University Zachary E. Ross zross@caltech.edu Seismological Laboratory California Institute of Technology Kamyar Azizzadenesheli kamyara@nvidia.com NVIDIA Corporation Reviewed on Open Review: https: // openreview. net/ forum? id= j3o QF9co Jd Neural operators generalize classical neural networks to maps between infinite-dimensional spaces, e.g., function spaces. Prior works on neural operators proposed a series of novel methods to learn such maps and demonstrated unprecedented success in learning solution operators of partial differential equations. Due to their close proximity to fully connected architectures, these models mainly suffer from high memory usage and are generally limited to shallow deep learning models. In this paper, we propose U-shaped Neural Operator (U-NO), a U-shaped memory enhanced architecture that allows for deeper neural operators. U-NOs exploit the problem structures in function predictions and demonstrate fast training, data efficiency, and robustness with respect to hyperparameters choices. We study the performance of U-NO on PDE benchmarks, namely, Darcy s flow law and the Navier-Stokes equations. We show that U-NO results in an average of 26% and 44% prediction improvement on Darcy s flow and turbulent Navier-Stokes equations, respectively, over the state of art. On Navier-Stokes 3D spatiotemporal operator learning task, we show U-NO provides 37% improvement over the state of the art methods. Keywords: Neural Operators, Partial Differential Equations 1 Introduction Conventional deep learning research is mainly dominated by neural networks that allow for learning maps between finite dimensional spaces. Such developments have resulted in significant advancements in many practical settings, from vision and nature language processing, to robotics and recommendation systems (Levine et al., 2016; Brown et al., 2020; Simonyan & Zisserman, 2014). Recently, in the field of scientific computing, deep neural networks have been used to great success, particularly in simulating physical systems following differential equations (Mishra & Molinaro, 2020; Brigham, 1988; Chandler & Kerswell, 2013; Long et al., 2015; Chen et al., 2019; Raissi & Karniadakis, 2018). Many real-world problems, however, require learning maps between infinite dimensional functions spaces (Evans, 2010). Neural operators are generalizations of neural networks and allow to learn maps between infinite dimensional spaces, including function spaces (Li et al., 2020b). Neural operators are universal approximators of operators and have shown numerous applications in tackling problems in partial differential equations (Kovachki et al., 2021b). Neural operators, in an abstract form, consist of a sequence of linear integral operators, each followed by a nonlinear point-wise operator. Various neural operator architectures have been proposed, most of which focus on the approximation of the Nyström type of the inner linear integral operator of neural operators (Li Published in Transactions on Machine Learning Research (04/2023) Neural Operator Layer ʃ k(x,y)v(y)dμ(y) + b(x) Figure 1: U-NO architecture. a is an input function, u is the output. Orange circles are point-wise operators, rectangles denote general operators, and smaller blue circles denote concatenations in function spaces. et al., 2020a;c; Gupta et al., 2021; Tripura & Chakraborty, 2023). For instance, Li et al. (2020a) suggests to use convolution kernel integration and proposes Fourier neural operator (FNO), a model that computes the linear integral operation in the Fourier domain. The linear integral operation can also be approximated using discrete wavelet transform (Gupta et al., 2021; Tripura & Chakraborty, 2023). But FNO approximates the integration using the fast Fourier transform method (Brigham, 1988) and enjoys desirable approximation theoretic guarantee for continuous operator (Kovachki et al., 2021a). The efficient use of the built-in fast Fourier transform makes the FNO approach among the fastest neural operator architectures. Tran et al. (2021) proposes slight modification to the integral operator of FNO by factorizing the multidimensional Fourier transform along each dimension. These developments have shown successes in tackling problems in seismology and geophysics, modeling the so-called Digital Twin Earth, and modeling fluid flow in carbon capture and storage experiments (Pathak et al., 2022; Yang et al., 2021; Wen et al., 2021; Shui et al., 2020; Li et al., 2022). These are crucial components in dealing with climate change and natural hazards. Many of the initial neural operator architectures are inspired by fully-connected neural networks and result in models with high memory demand that, despite the successes of deep neural networks, prohibit very deep neural operators. Though there are considerable efforts in finding more effective integral operators, works investigating suitable architectures of Neural operators are scarce. To alleviate this problem, we propose the U-shaped Neural Operator (U-NO) by devising integral operators that map between functions over different domains. We provide a rigorous mathematical formulation to adapt the U-net architecture for neural networks for finite-dimensional spaces to neural operators mapping between function spaces (Ronneberger et al., 2015). Following the u-shaped architecture, the U-NOfirst progressively maps the input function to functions defined with smaller domains (encoding) and later reverses this operation to generate an appropriate output function (decoding) with skip connections from the encoder part (see Fig. 1). This efficient contraction and subsequent expansion of domains allow for designing the over-parametrized and memory-efficient model, which regular architectures cannot achieve. Over the first half (encoding) of the U-NO layers, we map each input function space to vector-valued function spaces with steadily shrunken domains over layers, all the while increasing the dimension of the co-domains, i.e., the output space of each function. Throughout the second half (decoding) of U-NO layers, we gradually expand the domain of each intermediate function space while reducing the dimension of the output function co-domains and employ skip connections from the first half to construct vector-valued functions with larger co-dimension. The gradual contraction of the domain of the input function space in each layer allows for the encoding of function spaces with smaller domains. And the skip connections allow for the direct passage of information between domain sizes bypassing the bottleneck of the first half, i.e., the functions with the smallest domain (Long et al., 2015). The U-NO architecture can be adapted to the existing operator Published in Transactions on Machine Learning Research (04/2023) learning techniques (Gupta et al., 2021; Tripura & Chakraborty, 2023; Li et al., 2020a; Tran et al., 2021) to achieve a more effective model. In this work, for the implementation of the inner integral operator of U-NO, we adopt the Fourier transform-based integration method developed in FNO. As U-NO contracts the domain of each input function at the encoder layers, we progressively define functions over smaller domains. At a fixed sampling rate (resolution of discretization of the domain), this requires progressively fewer data points to represent the functions. This makes the U-NO a memory-efficient architecture, allowing for deeper and highly parameterized neural operators compared with prior works. It is known that over-parameterization is one of the main essential components of architecture design in deep learning (He et al., 2016), crucial for performance (Belkin et al., 2019), and optimization (Du et al., 2019). Also recent works show that the model overparameterization improves generalization performance (Neyshabur et al., 2017; Zhang et al., 2021; Dar et al., 2021). The U-NO architecture allows efficient training of overparameterized model with smaller memory footprint and opens neural operator learning to deeper and highly parameterized methods. We establish our empirical study on Darcy s flow and the Navier-Stokes equations, two PDEs which have served as benchmarks for the study of neural operator models. We compare the performance of U-NO against the state of art FNO. We empirically show that the advanced structure of U-NO allows for much deeper neural operator models with smaller memory usage. We demonstrate that U-NO achieves average performance improvements of 26% on high resolution simulations of Darcy s flow equation, and 44% on Navier-Stokes equation, with the best improvement of 51%. On Navier-Stokes 3D spatio-temporal operator learning task, for which the input functions are defined on the 3D spatio-temporal domain, we show U-NO provides 37% improvement over the state of art FNO model. Also, the U-NO outperforms the baseline FNO on zero-shot super resolution experiment. It is important to note that U-NO is the first neural operator trained for mapping from function spaces with 3D domains. Prior studies on 3D domains either transform the input function space with the 3D spatio-temporal domain to another auxiliary function space with 2D domain for the training purposes, resulting in an operator that is time discretization dependent or modifies the input function by extending the domain and co-domain (Li et al., 2020a). We further show that U-NO allows for much deeper models (3 ) with far more parameters (25 ), while still providing some performance improvement (Appendix A.2). Such a model can be trained with only a few thousand data points, emphasizing the data efficiency of problem-specific neural operator architecture design. Moreover, we empirically study the sensitivity of U-NO against hyperparameters (Appendix A.2) where we demonstrate the superiority of this model against FNO. We observe that U-NO is much faster to train (Appendix A.6)and also much easier to tune. Furthermore, for U-NO as a model that gradually contracts (expands) the domain (co-domain) of function spaces, we study its variant which applies these transformations more aggressively. A detailed empirical analysis of such variant of U-NO shows that the architecture is quite robust with respect to domain/co-domain transformation hyperparameters. 2 Neural Operator Learning Let A and U denote input and output function spaces, such that, for any vector valued function a A, a : DA Rd A, with DA Rd and for any vector valued function u U, u : DA Rd U, with DU Rd. Given a set of N data points, {(aj, uj)}N j=1, we aim to train a neural operator Gθ : A U, parameterized by θ, to learn the underlying map from a to u. Neural operators, in an abstract sense, are mainly constructed using a sequence of non-linear operators Gi that are linear integral operators followed by point-wise non-linearity, i.e., for an input function vi at the ith layer, Gi : vi vi+1 is computed as follows, Givi(x) := σ Z κi(x, y)vi(y)dµi(y) + Wivi(x) (1) Here, κi is a kernel function that acts, together with the integral, as a global linear operator with measure µi, Wi is a matrix that acts as a point-wise operator, and σ is the point-wise non-linearity. One can explicitly add a bias function bi separately. Together, these operations constitute Gi, i.e., a nonlinear operator between a function space of di dimensional vector-valued functions with the domain Di Rd and a function space of di+1 dimensional vector-valued functions with the domain Di+1 Rd. Neural operators often times are Published in Transactions on Machine Learning Research (04/2023) accompanied with a point-wise operator P that constitute the first block of the neural operator and mainly serves as a lifting operator directly applied to input function, and a point-wise operator Q, for the last block, that is mainly purposed as a projection operator to the output function space. Previous deployments of neural operators have studied settings in which the domains and co-domains of function spaces are preserved throughout the layers, i.e., D = Di = Di+1, and d = di = di+1, for all i. These models, e.g. FNO, impose that each layer is a map between functions spaces with identical domain and co-domain spaces. Therefore, at any layer i, the input function is νi : D Rd and the output function is νi+1 : D Rd , defined on the same domain D and co-domain Rd . This is one of the main reasons for the high memory demands of previous implementations of neural operators, which occurs primarily from the global integration step. In this work, we avoid this problem, since as we move to the innermost layers of the neural operator, we contract the domain of each layer while increasing the dimension of co-domains. As a result we sequentially encode the input functions into functions with smaller domain, learning compact representation as well as resulting in integral operations in much smaller domains. 3 A U-shaped Neural Operator (U-NO) In this section, we introduce the U-NO architecture, which is composed of several standard elements from prior works on Neural Operators, along with U-shaped additions tailored to the structure of function spaces (and maps between them). Given an input function a : DA Rd A, we first apply a point-wise operator P to a and compute v0 : DA Rd0. The point-wise operator P is parameterized with a function Pθ : Rd A Rd0 and acts as v0(x) = Pθ(a(x)), x D0 where D0 = DA. The function Pθ can be matrix or more generally a deep neural network. For the purpose of this paper, we take d0 d A, making P a lifting operator. Then a sequence of L1 non-linear integral operators is applied to v0, Gi : {vi : Di Rdvi} {vi+1 : Di+1 Rdvi+1} for i {0, 1, 2 . . . L1}, where for each i, Di Rd is a measurable set accompanied by a measure µi. Application of this sequence of operators results in a sequence of intermediate functions {vi : Di Rdvi}L1+1 i=1 in the encoding part of U-NO. In this study, without loss of generality, we choose Lebesgue measure µ for µi s. Over this first L1 layers, i.e., the encoding half of U-NO, we map the input function to a set of vector-valued functions with increasingly contracted domain and higher dimensional co-domain, i.e. for each i, we have µ(Di) µ(Di+1) and dvi+1 dvi. Then, given v L1+1, another sequence of L2 non-linear integral operator layers (i.e. the decoder), is applied. In this stage, the operators gradually expand the domain size and decrease the co-domain dimension to ultimately match the domain of the terget function, i.e., for each i in these layers, we have µ(Di+1) µ(Di) and dvi dvi+1. To include the skip connections from the encoder part from the encoder part to the decoder in (see Fig. 1), the operator GL1+i in the decoder part, takes a vector-wise concatenation of both v L1+i and v L1 i as its input. The vector-wise concatenation of v L1+i and v L1 i is a function v L1+i where, v L1+i : DL1+i Rd(L1 i)+d(L1+i) and v L1+i(x) = v L1 i(x) , v L1+i(x) , x DL1+i. v L1+i constitutes the input to the operator GL1+i, and we compute the output function of the next layer as, v(L1+i)+1 = Gi+L1v i+L1. Here, for simplicity, we assumed DL1+i = DL1 i1. 1More generally, concatenation can be defined with a map between the domains m : DL1+i DL1 i as v i+L1(x) = v L1 i m(x) , v L1+i x Published in Transactions on Machine Learning Research (04/2023) Computing v L+1 for the layer L = L1 + L2, we conclude U-NO architecture with another point-wise operator Q kernelized with the Qθ : Rd L+1 Rd U such that for u = Qv L+1 we have u(x) = Qθ(v L+1(x)), x DL+1 where DL+1 = DU. In this paper we choose d L+1 d U such that Q is a projection operator. In the next section, we instantiate the U-NO architecture for two benchmark operator learning tasks. 4 Empirical Studies In this section, we describe the problem settings, implementations, and empirical results. 4.1 Darcy Flow Equation Darcy s law is the simplest model for fluid flow in a porous medium. The 2-d Darcy Flow can be described as a second-order linear elliptic equation with a Dirichlet boundary condition in the following form, (a(x) u(x)) = f(x) x D u(x) = 0 x D where a A L (D; R+) represents the diffusion coefficient, u U H1 0(D; R) is the velocity function, and f F = H 1(D; R) is the forcing function. H represents Sobolev space with p = 2. We take D = (0, 1)2 and aim to learn the solution operator G , which maps a diffusion coefficient function a to the solution u, i.e., u = G (a). Please note that due to the Dirichlet boundary condition, we have the value of the solution u along the boundary D. For dataset preparation, we define µ as a pushforward of the Gaussian measure by operator ψ# as µ = ψ#N(0, ( + 9I) 2) as a probability measure with zero Neumann boundary conditions on the Laplacian, i.e., the Laplacian is 0 along the boundary. The N(0, ( + 9I) 2) describes a Gaussian measure with covariance operator ( + 9I) 2. The ψ(x) is defined as ( 3 if x < 0 12 if x 0 The diffusion coefficients a(x) are generated according to a µ and we fix f(x) = 1. The solutions are obtained by using a second-order finite difference method (Larsson & Thomée, 2003) on a uniform 421 421 grid over (0, 1)2. And solutions of any other resolution are down-sampled from the high-resolution data. This setup is the same as the benchmark setup in the prior works. 4.2 Navier-Stokes Equation The Navier-Stokes equations describe the motion of fluids taking into account the effects of viscosity and external forces. Among the different formulations, we consider the vorticity-streamfunction formulation of the 2-d Navier-Stokes equations for a viscous and incompressible fluid on the unit torus (T2), which can be described as, tw(x, t) + ψ w(x, t) = ν w(x, t) + g(x), x T2, t (0, ), ψ = ω, x T2, t (0, ), w(x, 0) = w0(x), x T2, Here u R+ T2 R2 is the velocity field. And w is the out-of-plane component of the vorticity field u (curl of u). Since we are considering 2-D flow, the velocity u can be written by splitting it into components as u1(x1, x2), u2(x1, x2), 0 (x1, x2) T2 Published in Transactions on Machine Learning Research (04/2023) And it follows that u = (0, 0, w). The stream function ψ is related to velocity by u = ψ, where is the skew gradient operator, which is defined as The function g is defined as the out-of-plane component of the curl of the forcing function, f, i.e., (0, 0, g) := ( f) and g T2 R, and ν R+ is the viscosity coefficient. We generated the initial vorticity w0 from Gaussian measure as (w0 N(0, 71.5( + 49I) 2.5) with periodic boundary condition with the constant mean function 0 and covariance 71.5( + 49I) 2.5. We fix g as g(x1, x2) = 0.1 sin(2π(x1 + x2)) + cos(2π(x1 + x2)) , (x1, x2) T2 Following (Kovachki et al., 2021b), the equation is solved using the pseudo-spectral split-step method. The forcing terms are advanced using Heun s method (an improved Euler method). And the viscous terms are advanced using a Crank Nicolson update with a time-step of 10 4. Crank Nicolson update is a second-order implicit method in time (Larsson & Thomée, 2003). We record the solution every t = 1 time unit. The data is generated on a uniform 256 256 grid and are downsampled to 64 64 for the low-resolution training. For this time-dependent problem, we aim to learn an operator that maps the vorticity field covering a time interval spanning [0, Tin], into the vorticity field for a later time interval, (Tin, T], G : C([0, Tin]; Hr per(T2; R)) C((Tin, T]; Hr per(T2; R)) where Hr per is the periodic Sobolev space Hr with constant r 0. In terms of vorticity, the operator is defined by w|T2 [0,Tin] w|T2 (Tin,T ]. 4.3 Model Implementation We present the specific implementation of the models used for our experiments discussed in the remainder of this section. For the internal integral operators, we adopt the approach of Li et al. (2020a), in which, evoking convolution theorem, the integral kernel (convolution) operators are computed by the multiplication of Fourier transform of kernel convolution operators and the input function in the Fourier domain. Such operation is approximately computed using fast Fourier transform, providing computational benefit compared to the other Nyström integral approximation methods developed in neural operator literature, and additional performance benefits results from computing the kernels directly in the Fourier domain, evoking the spectral convergence (Canuto et al., 2007). Therefore, for a given function vi, i.e., the input to the Gi, we have, Givi(x) = σ F 1 Ri F(vi) (x) + Wivi(x) Here F and F 1 are the Fourier and Inverse Fourier transforms, respectively. Ri : Zd Cdi+1 di, and for each Fourier mode k Zd, it is the Fourier transform of the periodic convolution function in the integral convolution operator. We directly parameterize the matrix valued functions Ri on Zd in the Fourier domain and learn it from data. For each layer i, we also truncate the Fourier series at a preset number of modes, ki max = |Zi kmax| = |k Zd : kj ki max,j for j {1, . . . d}|. Thus, Ri is implicitly a complex valued (ki max dvi+1 dvi) tensor. Assuming the discretization of the domain is regular, this non-linear operator is implemented efficiently with the fast Fourier transform. Finally, we define the point-wise residual operation Wivi where Wivi(x) = Wivi (s(x)) with x Di+1 and s : Di+1 Di is a fixed homeomorphism between Di+1 and Di. Published in Transactions on Machine Learning Research (04/2023) In the empirical study, we need two classes of operators. The first one is an operator setting where the input and output functions are defined on 2D spatial domain. And in the second setting, the input and output are defined on 3D spatio temporal setting. In the following, we describe the architecture of U-NO for both of these settings. Operator learning on functions defined on 2D spatial domains. For mapping between functions defined on 2D spatial domain, the solution operator is of form G : {a : (0, 1)2 Rd A} {u : (0, 1)2 Rd U }. The U-NO architecture consists of a series of seven non-linear integral operators, Gi 7 i=0, that are placed between lifting and projection operators. Each layer performs a 2D integral operation in the spatial domain and contract (or expand) the spatial domain (see Fig. 1). The first non-linear operator, G0, contracts the domain by a factor of 3 4 uniformly along each dimension, while increasing the dimension of the co-domain by a factor of 3 G0 : {v0 : (0, 1)2 Rdv0} {v1 : 0, 3/4 2 R 3 Similarly, G1 and G2 sequentially contract the domain by a factor of 2 2, respectively, while doubling the dimension of the co-domain. G3 is a regular Fourier operator layer and thus the domain or co-domain of its input and out function spaces stay the same. The last three Fourier operators, Gi 6 i=4, expand the domain and decrease the co-domain dimension, restoring the domain and co-domain to that of the input to G0. V0 : (0,0.5)2 R64 V1 : (0,0.25)2 R128 V6 : (0,0.5)2 R64 V5 : (0,0.25)2 R128 Diffusion Coefficients a : (0,1)2 R Solution Function u : (0,1)2 R Predictions Ground Truth Initial Vorticity Figure 2: (A) An illustration of aggressive contraction and re-expansion of the domain (and vice versa for the co-domains) by the variant of U-NO with a factor of 1 2 on an instance of Darcy flow equation. (B) Vorticity field generated by the U-NO variant as a solution to an instance of the two-dimensional Navier-Stokes equation with viscosity 10 6. We further propose U-NO which follows a more aggressive factor of 1 2 (shown in Fig. 2A) while contracting (or expanding) the domain in each of the integral operators. One of the reasons for this choice is that such a selection of scaling factors is more memory efficient than the initial U-NO architecture. Published in Transactions on Machine Learning Research (04/2023) The projection operator Q and lifting operator P are implemented as fully-connected neural networks with lifting dimension d0 = 32. And we set skip connections from the first three integral operator (G0, G1, G2) respectively to the last three (G6, G5, G4). For time dependent problems, where we need to map between input functions a of an initial time interval [0, Tin], to the solution functions u of later time interval (Tin, T], i.e., to learn an operator of form, G : C [0, Tin], {a : (0, 1)2 Rd A} C {(Tin, T], u : (0, 1)2 Rd U } we use an auto regressive model for U-NO and U-NO . Here recurrently compose the neural operator in time to produce solution functions for the time interval (Tin, T]. Operator learning on functions defined on 3D spatio-temporal domains. Separately, we also use a U-NO model with non-linear operators that performs the integral operation in space and time (3D), which directly maps the input functions a of interval [0, Tin] to the later time interval (Tin, T], without any recurrent composition in time. As this model does not require recurrent composition in time, it is fast during both training and inference. We redefine both a and u on 3D spatio-temporal domain and learn the the operator G : {a : (0, 1)2 [0, Tin] Rd A} {u : (0, 1)2 (Tin, T] Rd U } The non-linear operators constructing U-NO, {Gi}L i=0, are defined as, Gi : {vi : (0, αi)2 Ti Rdv0} {vi+1 : 0, cs iαi)2 (Tin, Tin + ct i Tin] Rcc i dv0}. Here, (0, αi)2 Ti is the domain of the function vi. And cs i, ct i and cc i are respectively the expansion (or contraction) factors for spatial domain, temporal domain, and co-domain for i th operator. Note that T0 = [0, Tin], (0,α0) = (0, αL+1) = (0, 1), and TL+1 = (Tin, T]. Here we also first contract and then expand (i.e., cs i 1 in the encoding part and cs i 1 in the decoding part) the spatial domain following the U-NO performing 2D integral operation. And we set ct i 1 for all operators Gi if the output time interval is larger than input time interval i.e. Tin < Tout Tin i.e., we gradually increase the time domain of the input function to match the output. Here the projection operator Q and lifting operator P are also implemented as fully connected neural network with d0 = 8 for the lifting operator P. For the experiments, we use Adam optimizer (Kingma & Ba, 2014) and the initial learning rate is scaled down by a factor of 0.5 every 100 epochs. As the non-linearity, we have used the GELU (Hendrycks & Gimpel, 2016) activation function. We use the architecture and implementation of FNO provided in the original work Li et al. (2020a). All the computations are carried on a single Nvidia GPU with 24GB memory. For each experiment, the data is divided into train, test, and validation sets. And the weight of the best-performing model on the validation set is used for evaluation. Each experiment is repeated 3 times and the mean is reported. In the following, we empirically study the performance of these models on Darcy Flow and Navier-Stokes Equations. 4.4 Results on Darcy Flow Equation From the dataset of N = 2000 simulations of the Darcy flow equation, we set aside 250 simulations for testing and use the rest for training and validation. We use U-NO with 9 layers including the integral, lifting, and projection operators . We train for 700 epochs and save the best performing model on the validation set for evaluation. The performances of U-NO and FNO on several different grid resolutions are shown in Table. 1. The number of modes used in each integral operator stays unchanged throughout different resolutions. We demonstrate that U-NO achieve lower relative error on every resolution with 27% improvement on high resolution (421 421) simulations. The U-shaped structure exploits the problem structure and gradually encodes the input function to functions with smaller domains, which enables U-NO to have efficient deep architectures, modeling complex non-linear mapping between the input and solution function spaces. We Published in Transactions on Machine Learning Research (04/2023) also notice that U-NO requires 23% less training memory while having 7.5 times more parameters. With vanilla FNO, designing such over paramterized deep operator to learn complex maps is impracticable due to high memory requirements. Table 1: Benchmarks on Darcy Flow. The average relative error in percentage is reported with error bars are indicated in the superscript. The average back-propagation memory requirements for a single training instance is reported over different resolutions are reported. U-NO performs better than FNO for every resolution, while requiring 23% lower training memory which allows efficient training of 7.5 times more parameters Mem. Resolution s s Model Req. # Parameter s = 421 s = 211 s = 141 s = 85 (MB) 106 U-NO 166 8.2 0.57 1.4e 2 0.58 0.4e 2 0.60 0.5e 2 0.73 0.1e 2 FNO 214 1.1 0.78 4.0e 2 0.84 5.5e 2 0.84 1.1e 2 0.87 3.0e 2 Table 2: Zero-shot super resolution result on Darcy Flow. Neural operators trained on lower spatial resolution data is directly tested on higher resolution with no further training. We observe U-NO archives lower percentage relative error rate. Train Test s = 141 s = 211 s = 421 s=85 4.7 6.2 8.3 U-NO s=141 - 2.6 6.3 s=211 - - 4.5 s=85 7.5 14.1 23.9 FNO s=141 - 4.3 13.1 s=211 - - 9.6 4.5 Results on Navier-Stokes Equation For the auto-regressive model, we follow the architecture for U-NO and U-NO described in Sec. 4.3. For U-NO performing spatio-temporal (3D) integral operation, we use 7 stacked non-linear integral operator following the same spatial contracting and expansion of 2D spatial U-NO. Additionally, as the positional embedding for the domain (unit torus T2) we used the Clifford torus (Appendix A.1) which embeds the domain into Euclidean space. It is important to note that, for spatio-temporal FNO setting, the domain of the input function is extended to match the solution function as it maps between the function spaces with same domains. Capable of mapping between function spaces with different domains, no such modification is required for U-NO. For each experiment, 10% of the total number of simulations N are set aside for testing, and the rest is used for training and validation. We experiment with the viscosity as ν {1e 3, 1e 4, 1e 5, 1e 6}, adjusting the final time T as the flow becomes more chaotic with smaller viscosities. Table 3 demonstrates the results of the Navier-Stokes experiment. We observe that U-NO achieves the best performance, with nearly a 50% reduction in the relative error over FNO in some experimental settings; it even achieves nearly a 1.76% relative error on a problem with a Reynolds number2 of 2 104 (Fig. 2B). In each case, the U-NO with aggressive contraction ( and expansion) factor, performs significantly better than FNO while requiring less memory. Also for neural operators performing 3D integral operation, U-NO achieved 37% lower relative error while requiring 50% less memory on average with 19 times more parameters. It is important to note that designing an FNO architecture matching the parameter count of 2Reynolds number is estimated as Re = 0.1 ν(2π)(3/2) (Chandler & Kerswell, 2013) Published in Transactions on Machine Learning Research (04/2023) Table 3: Benchmarks on Navier-Stokes (Relative Error (%)). For the models performing a 2D integral operator with recurrent structure in time, back-propagation memory requirement per time step is reported. U-NO yields superior performance compared with U-NO and FNO, while U-NO is the most memory efficient model. In the 3D integration setting, U-NO provides significant performance improvement with almost three times less memory requirement. Avg. ν = 1e 3 ν = 1e 4 ν = 1e 5 ν = 1e 6 model Mem. # Parameters T = 50s T = 30s T = 20s T = 15s Req. Tin = 10s Tin = 10s Tin = 10s Tin = 6s (MB) ( 106) N = 5000 N = 11000 N = 11000 N = 11000 2D U-NO 16 15.3 0.28 2.1e 2 3.44 6.3e 2 2.94 4.9e 2 1.76 1.4e 2 U-NO 11 6.7 0.35 2.0e 2 3.56 4.3e 2 3.60 1.2e 2 2.2 1.5e 2 FNO 13 1.3 0.58 1.4e 2 5.57 7.7e 2 5.12 0.7e 2 3.33 0.7e 2 3D U-NO 108 24.0 0.31 3.9e 2 5.59 2.9e 1 7.03 4.6e 2 5.10 1.0e 1 FNO 216 1.3 0.68 0.8e 2 9.60 5.4e 2 8.67 1.2e 1 7.35 6.3e 1 U-NO is impractical due to high computational requirements (see Fig. 3). FNO architecture with similar parameter counts demands enormous computational resources with multiple GPUs and would require a long training time due to the smaller batch size. Also unlike FNO, U-NO is able to perform zero-shot super-resolution in both time and spacial domain (Appendix A.5). Table 4: Relative Error (%) achieved by 2D U-NO and 3D U-NO trained and evaluated on high resolution (256 256) Simulations of Navier-Stokes. Even with highly constrained neural operator architecture and smaller training data, the U-NO architecture can achieve comparable performance. Models Memory Requirement ν = 1e 3 ν = 1e 4 ν = 1e 5 ν = 1e 6 U-NO (MB) N = 2400 N = 6000 N = 6000 N = 6000 2D U-NO 86 0.51 5.9 7.0 4.4 3D U-NO 850 0.83 8.3 11.2 8.2 Due to U-shaped encoding decoding structure which allow us to train U-NO on very high resolution (256 256) data on a single GPU with a 24GB memory. To accommodate training with high-resolution simulations of the Navier-Stocks equation, we have adjusted the contraction (or expansion) factors. The Operators in the encoder part of U-NO follow a contraction ratio of 1 2 for the spatial domain. But the co-domain dimension is increased only by a factor of 2. And the operators in the decoder part follow the reciprocals of these ratios. We have also used a smaller lifting dimension for the lifting operator P. Even with such a compact architecture and lower training data, U-NO achieves lower error rate on high resolution data (Table 4). These gains are attributed to the deeper architecture, skip connections, and encoding/decoding components of U-NO. 4.6 Remarks on Memory Usage Compared to FNO, the proposed U-NO architectures consume less memory during training and testing. Due to the gradual encoding of input function by contracting the domain, the U-NO allow for deeper architectures with a greater number of stacked non-linear operators and higher resolution training data (Appendix A.4). For U-NO, an increase in depth, does not significantly raise the memory requirement. The training memory requirement for 3D spatio-temporal U-NO and FNO is reported in Fig. 3A. We can observe a linear increase in memory requirement for FNO with the increase of depth. This makes FNO s unsuitable for designing and training very deep architectures as the memory requirement keep increasing Published in Transactions on Machine Learning Research (04/2023) rapidly with the increase of the depth. On the other hand, due to the repeated scaling of the domain, we do not observe any significant increase in the memory requirement for U-NO. This make the U-NO a suitable architecture for designing deep neural operators. The improvement of the performance with the increase of Figure 3: (A) Training memory requirements (in MB) for the 3D spatio-temporal problem of Navier-Stokes equation (ν = 1e 3). For different depth only the number of stacked non-linear operators is varied. For deeper models, the additive memory requirement of U-NO is negligible compared to FNO model. For addition of 7 more integral layer memory requirement only increased by only 80MB (vs 400MB for FNO). (B) Relative error in percentage on 3D spatio-temporal Navier-Stokes equation (ν = 1e 3) for different depth (average over three repeated experiment is reported). We can observe a gradual decrease in the relative error with the increase of depth. depth is shown in Fig. 3B. This is expected as highly parameterized deep architectures allow for effective approximation of highly complex non-linear operators. U-NO style architectures are more favorable for learning complex maps between function spaces. 5 Conclusion In this paper, we propose U-NO, a new neural operator architecture with a multitude of advantages as compared with prior works. Our approach is inspired by Unet, a neural network architecture that is highly successful at learning maps between finite-dimensional spaces with similar structures, e.g., image to image translation. U-NO possesses a similar set of benefits to Unet, including data efficiency, robustness to hyperparameter tuning (Appendix A.2), flexible training, memory efficiency, convergence (Appendix A.6), and non-vanishing gradients.This approach, while advancing neural operator research, significantly improves the performance and learning paradigm, still carries the fundamental and inherent memory usage limitation of integral operators. Nevertheless, U-NO allows for much deeper models incorporating the problem structures and provides highly parameterized neural operators for learning maps between function spaces. U-NOs are very easy to tune, possess faster convergence to desired accuracies, and achieve superior performance with minimal tuning. Published in Transactions on Machine Learning Research (04/2023) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32): 15849 15854, 2019. E Oran Brigham. The fast Fourier transform and its applications. Prentice-Hall, Inc., 1988. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Claudio Canuto, M Yousuff Hussaini, Alfio Quarteroni, and Thomas A Zang. Spectral methods: fundamentals in single domains. Springer Science & Business Media, 2007. Gary J Chandler and Rich R Kerswell. Invariant recurrent solutions embedded in a turbulent two-dimensional kolmogorov flow. Journal of Fluid Mechanics, 722:554 595, 2013. Zhengdao Chen, Jianyu Zhang, Martin Arjovsky, and Léon Bottou. Symplectic recurrent neural networks. ar Xiv preprint ar Xiv:1909.13334, 2019. Yehuda Dar, Vidya Muthukumar, and Richard G Baraniuk. A farewell to the bias-variance tradeoff? an overview of the theory of overparameterized machine learning. ar Xiv preprint ar Xiv:2109.02355, 2021. Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675 1685. PMLR, 2019. Lawrence C Evans. Partial differential equations, volume 19. American Mathematical Soc., 2010. Gaurav Gupta, Xiongye Xiao, and Paul Bogdan. Multiwavelet-based operator learning for differential equations. Advances in Neural Information Processing Systems, 34, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Nikola Kovachki, Samuel Lanthaler, and Siddhartha Mishra. On universal approximation and error bounds for fourier neural operators. Journal of Machine Learning Research, 22:Art No, 2021a. Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces. ar Xiv preprint ar Xiv:2108.08481, 2021b. Stig Larsson and Vidar Thomée. Partial differential equations with numerical methods, volume 45. Springer, 2003. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334 1373, 2016. Bian Li, Hanchen Wang, Xiu Yang, and Youzuo Lin. Solving seismic wave equations on variable velocity models with fourier neural operator. ar Xiv preprint ar Xiv:2209.12340, 2022. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. ar Xiv preprint ar Xiv:2010.08895, 2020a. Published in Transactions on Machine Learning Research (04/2023) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. ar Xiv preprint ar Xiv:2003.03485, 2020b. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Andrew Stuart, Kaushik Bhattacharya, and Anima Anandkumar. Multipole graph neural operator for parametric partial differential equations. Advances in Neural Information Processing Systems, 33, 2020c. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431 3440, 2015. Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error of physics informed neural networks (pinns) for approximating a class of inverse problems for pdes. ar Xiv preprint ar Xiv:2007.01138, 2020. Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017. Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. ar Xiv preprint ar Xiv:2202.11214, 2022. Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, 357:125 141, 2018. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234 241. Springer, 2015. Changjian Shui, Qi Chen, Jun Wen, Fan Zhou, Christian Gagné, and Boyu Wang. Beyond H-divergence: Domain adaptation theory with jensen-shannon divergence. ar Xiv preprint ar Xiv:2007.15567, 2020. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Alasdair Tran, Alexander Mathews, Lexing Xie, and Cheng Soon Ong. Factorized fourier neural operators. ar Xiv preprint ar Xiv:2111.13802, 2021. Tapas Tripura and Souvik Chakraborty. Wavelet neural operator for solving parametric partial differential equations in computational mechanics problems. Computer Methods in Applied Mechanics and Engineering, 404:115783, 2023. Gege Wen, Zongyi Li, Kamyar Azizzadenesheli, Anima Anandkumar, and Sally M Benson. U-fno an enhanced fourier neural operator based-deep learning model for multiphase flow. ar Xiv preprint ar Xiv:2109.03697, 2021. Yan Yang, Angela F Gao, Jorge C Castellanos, Zachary E Ross, Kamyar Azizzadenesheli, and Robert W Clayton. Seismic wave propagation and inversion with neural operators. The Seismic Record, 1(3):126 134, 2021. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021. Published in Transactions on Machine Learning Research (04/2023) A.1 Positional Embedding For Unit Torus A unit torus T2 is homeomorphic to the Cartesian product of two unit circles: S1 S1. We can also consider the cartesian product of the embedding of the circle. which produces Clifford torus. The Clifford torus can be defined as 2(sin(θ), cos(θ), sin(ϕ)cos(ϕ)) | 0 θ 2π, 0 ϕ 2π A.2 Sensitivity to Hyper-parameter Selection Learning Rate Figure 4: Result of sensitivity of U-NO to learning rate and number of stacked non-linear operator (depth) used. All models are trained on the dataset of the Darcy Flow equation with resolution 211 211 following the training protocol described in 4.4. Models for each of the configuration is trained three times and the average error rate (in %) is reported. We can notice that except for a very high ( 0.01) or very low ( 0.0001) learning rate, U-NO achieves low error rate at every other configurations (error rate achieved by FNO is 0.85 ). We also note that at all the high-performing hyper-parameter configuration setting U-NO has low generalization gap. Table 5: Number of Parameters in U-NO models at different depth. We observe U-NO architecture can efficiently learn large number of parameter (25 times of the FNO) from very limited training data (only 1500 simulations) and can perform reasonably (See Table. 4) . Depth 7 9 11 13 15 17 #Parameters ( 106) 5.3 8.3 9.6 18.5 22.8 27.8 A.3 FNO with Skip Connection Table 6: Performance of vanilla 2D FNO equipped with skip connection on Navier-Stokes equations described in Section 4.5 along with memory requirement during training. The result of 2D FNO is reported again for comparison. We notice that skip connection alone does not improve the performance of FNO. Models Memory Requirement ν = 1e 3 ν = 1e 4 ν = 1e 5 ν = 1e 6 U-NO (MB) N = 2400 N = 6000 N = 6000 N = 6000 FNO w. skip con. 13.14 MB 0.011 0.074 0.075 0.049 FNO 13.03 MB 0.009 0.072 0.074 0.052 A.4 Spatial Memory The memory requirements to learn the operator w|(0,1)2 [0,10] w|(0,1)2 (10,11] during back-propagation for the Navier-Stokes problem are shown in Table 7. On average, for various tested resolutions, U-NO with a Published in Transactions on Machine Learning Research (04/2023) contraction (and expansion) factor of 1 2 requires 40% less memory than FNO during training. This makes U-NO and its variants more suitable for problems where high-resolution data are crucial, e.g., weather forecasting model (Pathak et al., 2022). Table 7: Memory requirements (in MB) for a single training instance of the Navier-Stokes equations for different grid resolutions. All models have seven non-linear integral operator layers. Spatial Resolution FNO U-NO U-NO 64 64 13.0 16.8 11.3 128 128 76.1 67.3 45.4 256 256 304.5 269.0 181.5 512 512 1218.0 1076.0 726.0 1024 1024 4872.0 4304.0 2990.9 A.5 Zero-Shot super resolution on 3D Spatio-Temporal Data Table 8: Zero-shot super resolution result on 3D spatio-temporal Navier Stocks equation. The 3D FNO is not resolution invariant in the time domain and due the specific construction can not process data at different temporal resolution. But U-NO is resolution invariant both in time and spatial domain. Spatial Res. Temporal Res. fps = 1 fps = 1.5 fps = 2 fps = 3 s=64 5.10 17.43 19.39 20.32 U-NO s=128 6.34 17.61 19.56 20.62 s=256 8.16 17.86 19.94 20.83 A.6 Superior Convergence Rate Figure 5: The training and test set error rate (in % on log scale) of U-NO and FNO for Navier-Stokes equation with viscosity 10e 3, (A) models performing 2D spatio-temporal convolution (B) models performing 3D spatial covolution. We can notice what U-NO converges much faster than FNO. The final test set error rate for FNO after 500 epochs is achieved only after around 200 epochs by U-NO and continues to improve on the error rate. A.7 Domain Contraction and Expansion In this work, the domain contraction (or expansion) is performed by setting homeomorphism or mapping between the domains of the input and output function of an integral operator. Following the notations of Sec. Published in Transactions on Machine Learning Research (04/2023) 4.3, an integral operator can be defined as [Gi Vi](x) = Vi+1(x) = σ F 1 Ri F(vi) (s(x)) + Wivi(s(x)) Where with x Di+1 and s : Di+1 Di is a fixed homeomorphism between Di+1 and Di. For the discussed problems in this work (Darcy flow and Navier Stokes equation), the domains of the input and output functions are bounded and connected. As a result, the mapping can be trivially established by scaling operation. Lets the domain of the input and the output functions be (a, b)2 and (c, d)2 respectively, and the function s : [c, d] [a, b] is a homeomorphism between them. The function s has a linear form s(x) = a + m (x c) with m = (b a) (d c) where and represents element wise multiplication and division. If c = a = 0, s becomes a scaling operation on the domain. And the resultant output function can be efficiently computed through interpolation with appropriate factors along each dimension. A.8 Training Memory Requirement for 2D U-NO at Different Depths 6 7 8 9 10 11 12 13 Depth Memory Requirement (MB) Figure 6: Memory requirements (in MB) for a single training instance of the Navier-Stokes equation during back-propagation on input and outputs of resolution 64 64. Number of stacked non-linear operators is varied and the additive memory requirement of U-NO architectures for deeper models are negligible compared to FNO model. For U-NO the addition of 7 stacked non-linear integral operator increases the memory requirement by only 2.5 MB A.9 Comparison of Inference Time As across different problem settings, the number of parameters of U-NO is 8 25 times more than the FNO; it has more floating point operations (FLOPs). For this reason, the inference time of U-NO is more than FNO (see Table 9). But it is important to note that even if t U-NO has 8 25 times more parameters, the inference time is only 1 4 times more. As U-NO in the encoding part gradually contracts the domain of the functions, it reduces the time in the forward and inverse discrete Fourier transform. Published in Transactions on Machine Learning Research (04/2023) Table 9: The running time of U-NO and FNO during inference of a single sample. For the Darcy Flow problem, we test on the resolution of 421 421, and for Navier Stocks, we use the problem setting with viscosity 10 3 with the resolution of 64 64. For Navier-Stocks 2D time to infer a single time step and for Navier-Stocks 3D time to infer the whole trajectory are reported. Problem Model U-NO (sec) Darcy Flow 0.13 0.12 Navier-Stcoks (2D) 0.08 0.02 Navier-Stocks (3D) 0.33 0.09