# learning_operators_with_coupled_attention__74e59899.pdf Journal of Machine Learning Research 23 (2022) 1-63 Submitted 12/21; Revised 5/22; Published 7/22 Learning Operators with Coupled Attention Georgios Kissas gkissas@seas.upenn.edu Department of Mechanical Engineering and Applied Mechanics University of Pennsylvania Philadelphia, PA 19104 Jacob H. Seidman seidj@sas.upenn.edu Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104 Leonardo Ferreira Guilhoto guilhoto@sas.upenn.edu Graduate Group in Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 19104 Victor M. Preciado preciado@seas.upenn.edu Department of Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104 George J. Pappas pappasg@seas.upenn.edu Department of Electrical and Systems Engineering University of Pennsylvania Philadelphia, PA 19104 Paris Perdikaris pgp@seas.upenn.edu Department of Mechanical Engineering and Applied Mechanics University of Pennsylvania Philadelphia, PA 19104 Editor: Animashree Anandkumar Supervised operator learning is an emerging machine learning paradigm with applications to modeling the evolution of spatio-temporal dynamical systems and approximating general black-box relationships between functional data. We propose a novel operator learning method, LOCA (Learning Operators with Coupled Attention), motivated from the recent success of the attention mechanism. In our architecture, the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations. By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions, enabling us to approximate nonlinear operators even when the number of output function measurements in the training set is very small. Our formulation is accompanied by rigorous approximation theoretic guarantees on the universal expressiveness of the proposed model. Empirically, we evaluate the performance of LOCA on several operator learning scenarios involving systems governed by ordinary and partial differential equations, as well as a . These authors contributed equally. c 2022 Georgios Kissas and Jacob H. Seidman and Leonardo Ferreira Guilhoto and Victor M. Preciado and George J. Pappas and Paris Perdikaris. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1521.html. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris black-box climate prediction problem. Through these scenarios we demonstrate state of the art accuracy, robustness with respect to noisy input data, and a consistently small spread of errors over testing data sets, even for out-of-distribution prediction tasks. Keywords: deep learning, reproducing kernel Hilbert spaces, wavelet scattering networks, functional data analysis, universal approximation 1. Introduction The great success of modern deep learning lies in its ability to approximate maps between finite-dimensional vector spaces, as in computer vision (Santhanam et al., 2017), natural language processing (Vaswani et al., 2017), precision medicine (Rajkomar et al., 2019), bio-engineering (Kissas et al., 2020), and other data driven applications. A particularly successful class of such models are those built with the attention mechanism (Bahdanau et al., 2015). For example, the Transformer is an attention-based architecture that has recently produced state of the art performance in natural language processing (Vaswani et al., 2017), computer vision (Dosovitskiy et al., 2020; Parmar et al., 2018), and audio signal analysis (Gong et al., 2021; Huang et al., 2018). Another active area of research is applying machine learning techniques to approximate operators between spaces of functions. These methods are particularly attractive for many problems in computational physics and engineering where the goal is to learn the functional response of a system from a functional input, such as an initial/boundary condition or forcing term. In the context of learning the response of systems governed by differential equations, these learned models can function as fast surrogates of traditional numerical solvers. For example, in climate modelling one might wish to predict the pressure field over the earth from measurements of the surface air temperature field. The goal is then to learn an operator, F, between the space of temperature functions to the space of pressure functions (see Figure 1). An initial attempt at solving this problem might be to take a regular grid of measurements over the earth for the input and output fields and formulate the problem as a (finite-dimensional) image to image regression task. While architectures such as convolutional neural networks may perform well under this setting, this approach can be somewhat limited. For instance, if we desired the value of the output at a query location outside of the training grid, an entirely new model would need to be built and tuned from scratch. This is a consequence of choosing to discretize the regression problem before building a model to solve it. If instead we formulate the problem and model at the level of the (infinite-dimensional) input and output function spaces, and then make a choice of discretization, we can obtain methods that are more flexible with respect to the locations of the point-wise measurements. Formulating models with functional data is the topic of Functional Data Analysis (FDA) (Ramsay, 1982; Ramsay and Dalzell, 1991), where parametric, semi-parametric or nonparametric methods operate on functions in infinite-dimensional vector spaces. A useful class of non-parametric approaches are operator-valued kernel methods. These methods generalize the use of scalar-valued kernels for learning functions in a Reproducing Kernel Hilbert Space (RKHS) (Hastie et al., 2009) to RKHS s of operators. Kernel methods were thoroughly studied in the past (Hofmann et al., 2008; Shawe-Taylor et al., 2004) and have Learning Operators with Coupled Attention Input function Output function Figure 1: An example sketch of operator learning for climate modeling: By solving an operator learning problem, we can approximate an infinite-dimensional map between two functions of interest, and then predict one function using the other. For example, by providing the model with an input function, e.g. a surface air temperature field, we can predict an output function, e.g. the corresponding surface air pressure field. been successfully applied to nonlinear and high-dimensional problem settings (Takeda et al., 2007; Dou and Liang, 2020). Previous work has successfully extended this framework to learning operators between more general vector spaces as well (Micchelli and Pontil, 2005; Caponnetto et al., 2008; Kadri et al., 2010, 2016; Owhadi, 2020). This framework is particularly powerful as the inputs can be continuous or discrete, and the underlying vector spaces are typically only required to be normed and separable. A parametric-based approach to operator learning was introduced in Chen and Chen (1995) where the authors proposed a method for learning non-linear operators based on a one-layer feed-forward neural network architecture. Moreover, the authors presented a universal approximation theorem which ensures that their architecture can approximate any continuous operator with arbitrary accuracy. Lu et al. (2019) gave an extension of this architecture, called Deep ONet, built with multiple layer feed-forward neural networks, and demonstrated its effectiveness in approximating the solution operators of various differential equations. In follow up work, error estimates were derived for some specific problem scenarios (Lanthaler et al., 2022), and several applications have been pursued (Cai et al., 2020; Di Leoni et al., 2021; Lin et al., 2021). An extension of the Deep ONet was proposed by Wang et. al. (Wang et al., 2021c; Wang and Perdikaris, 2021; Wang et al., 2021b), where a regularization term is added to the loss function to enforce known physical constraints, enabling one to predict solutions of parametric differential equations, even in the absence of paired input-output training data. Another parametric approach to operator learning is the Graph Neural Operator proposed by Li et al. (2020c), motivated by the solution form of linear partial differential equations (PDEs) and their Greens functions. As an extension of this work, the authors also proposed a Graph Neural Operator architecture where a multi-pole method is used sample the spatial Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris grid (Li et al., 2020b) allowing the kernel to learn in a non-local manner. In later published work, this framework has been extended to the case where the integral kernel is stationary, enabling one to efficiently compute the integral operator in the Fourier domain (Li et al., 2020a) resulting in Fourier Neural Operators. Universality results and error estimates for Fourier Neural Operators were explored in Kovachki et al. (2021a). Other parametric-based models include a deep learning approach for directly approximating the Green s function of differential equations (Gin et al., 2021), using a neural network to learn a transformation between finite dimensional PCA representations of functional data (Bhattacharya et al., 2021), a multi-wavelet approach for learning projections of an integral kernel operator to approximate the true operator (Gupta et al., 2021) and a random feature approach for learning the solution map of PDEs (Nelsen and Stuart, 2020). While some of the previously described operator learning methods can be seen as generalizations of deep learning architectures such as feed-forward and convolutional neural networks, in this paper we are motivated by the success of the attention mechanism to propose a new operator learning framework. Specifically, we draw inspiration from the Badhanau attention mechanism (Bahdanau et al., 2015), which first constructs a feature representation of the input and then averages these features with a distribution that depends on the argument of the output function to obtain its value. We will additionally couple these distributions together in what we call a Kernel-Coupled Attention mechanism. This will allow our framework to explicitly model correlations within the output functions of the operator and train better with fewer output function measurements. Moreover, we prove that under certain assumptions the model satisfies a universal approximation property. The main contributions of this work can be summarized in the following points: Novel Architecture: We propose an operator learning framework inspired by the attention mechanism, operator approximation theory, and the Reproducing Kernel Hilbert Space (RKHS) literature. A novel Kernel-Coupled Attention mechanism is introduced to explicitly model correlations between the output functions query locations. Theoretical Guarantees: We prove that the proposed framework satisfies a universal approximation property, that is, it can approximate any continuous operator with arbitrary accuracy. Data Efficiency: By modelling correlations between output queries, our model can achieve high performance when trained with only a small fraction (6-12%) of the total available labeled data compared to competing methods. Robustness: Compared to existing methods, our model demonstrates superior robustness with respect to noise corruption in the training and testing inputs, as well as randomness in the model initialization. Our model s performance is stable in that the errors on the test data set are consistently concentrated around the median with significantly fewer outliers compared to other methods. Generalization: On a real data set of Earth surface air temperature and pressure measurements, our model is able to learn the functional relation between the two fields with high accuracy and extrapolate beyond the training data. On synthetic data we Learning Operators with Coupled Attention demonstrate that our model is able to generalize better than competing methods over increasingly out-of-distribution examples. These contributions are accompanied with systematic experiments examining the effect of the architecture components of LOCA, both in isolation and together. The rest of the paper is structured as follows. In Section 2 we introduce the supervised operator learning problem. In Section 3, we introduce the general form of the model and in following subsections present the construction of its different components. In Section 4 we prove theoretical results on the approximation power of this class of models. In Section 5 we present the specific architecture choices made for implementing our method in practice. Section 6 discusses the similarities and differences of our model with related operator learning approaches and known operator learning error estimates. In Section 7, we demonstrate the performance of the proposed methodology across different benchmarks in comparison to other state-of-the-art methods. In addition, we provide a new metric for assessing how close trained models align with an optimal representation of output functions and evaluate this metric on a synthetic example. In Section 8, we discuss our main findings, outline potential drawbacks of the proposed method, and highlight future directions emerging from this study. 2. Problem Formulation We now provide a formal definition of the operator learning problem. Given X Rdx, Y Rdy, we will refer to a point x X as an input location and a point y Y as a query location. Denote by C(X, Rdu) and C(Y, Rds) the spaces of continuous functions from X Rdu and Y Rds, respectively. We will refer to C(X, Rdu) as the space of input functions and C(Y, Rds) the space of output functions. For example, in Figure 1, if we aim to learn the correspondence between a temperature field over the earth and the corresponding pressure field, u C(X, R) would represent the temperature field and s C(Y, R) would be a pressure field, where X = Y represents the surface of the earth. With a data set of of input/output function pairs, we formulate the supervised operator learning problem as follows. Problem 1 Given N pairs of input and output functions {uℓ(x), sℓ(y)}N ℓ=1 generated by some possibly unknown ground truth operator G : C(X, Rdu) C(Y, Rds) with uℓ C(X, Rdu) and sℓ C(Y, Rds), learn an operator F : C(X, Rdu) C(Y, Rds), such that for ℓ= 1, . . . , N, F(uℓ) = sℓ. This problem also encompasses scenarios where more structure is known about the input/output functional relation. For example, u could represent the initial condition to a partial differential equation and s the corresponding solution. In this case, G would correspond to the true solution operator and F would be an approximate surrogate model. Similarly, u could represent a forcing term in a dynamical system described by an ordinary differential equation, and s the resulting integrated trajectory. In these two scenarios there do exist a suite of alternate methods to obtain the solution function s from the input u, but with an appropriate choice of architecture for F the approximate model can result in significant computational speedups. Sufficiently differentiable surrogates additionally permit Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris the fast computation of sensitivities with respect to the inputs using tools like automatic differentiation. Note that while the domains X and Y need not be discrete sets, in practice we may only have access to the functions uℓand sℓevaluated at finitely many locations. However, we take the perspective that it is beneficial to formulate the model with continuously sampled input data, and consider the consequences of discretization at implementation time. As we shall see, this approach will allow us to construct a model that is able to learn operators over multiple output resolutions simultaneously. 3. Proposed Model: Learning Operators with Coupled Attention (LOCA) We will construct our model through the following two steps. Inspired by the attention mechanism (Bahdanau et al., 2015), we will first define a class of models where the input functions u are lifted to a feature vector v(u) Rn ds. Each output location y Y will define ds probability distributions ϕ(y) Qds i=1 n, where n is the the n-simplex. The forward pass of the model is then computed by averaging the rows of v(u) over the probability distributions ϕ(y). Next, we augment this model by coupling the probability distributions ϕ(y) across different query points y Y. This is done by acting on a proposal score function g : Y Rn ds with a kernel integral operator. The form of the kernel determines the similarities between the resulting distributions. In Section 7.1 we will empirically demonstrate that the coupled version of our model is more accurate compared to the uncoupled version when the number of output function evaluations per example is small. 3.1 The Attention Mechanism The attention mechanism was first formulated in Bahdanau et al. (2015) for use in language translation. The goal of their work was to translate an input sentence in a given language {u1, . . . , u Tu} to a sentence in another language {s1, . . . , s Ts}. A context vector ci was associated to each index of the output sentence, i {1, . . . , Ts}, and used to construct a probability distribution of the i-th word in the translated sentence, si. The attention mechanism is a way to construct these context vectors by averaging over features associated with the input in a way that depends on the output index i. More concretely, the input sentence is first mapped to a collection of features {v1, . . . , v Tu}. Next, depending on the input sentence and the location/index i in the output (translated) sentence, a discrete probability distribution {ϕi1, . . . , ϕi Tu} is formed over the input indices such that j=1 ϕij = 1. The context vector at index i is then computed as If the words in the input sentence are represented by vectors in Rd, and the associated features and context vector are in Rl, the attention mechanism can be represented by the Learning Operators with Coupled Attention following diagram. [Ts] RTu d Rl We will apply this attention mechanism to learn operators between function spaces by mapping an input function u to a finite set of features v(u) Rn ds, and taking an average over these features with respect to ds distributions ϕ(y) Qds k=1 n that depend on the query location y Y for the output function. That is, F(u)(y) := Eϕ(y)[v(u)], where v(u) Rn ds, ϕ is a function from y Y to ds copies of the n-dimensional simplex n, and E : Qds k=1 n Rn ds Rds is an expectation operator that takes (ϕ, v) 7 P i ϕi vi, where denotes an element-wise product. This can be represented by the following diagram. Y C(X, Rdu) Rds Qds k=1 n Rn ds In summary, we propose an operator learning architecture where the value of the output function at a location y is determined by attending to an input vector v(u) with a probability distribution ϕ(y), dependent on y. It remains to choose the functional form for the mappings v : C(X, Rdu) Rn and ϕ : Y Qds i=1 n, which we describe in the following sections. 3.2 Kernel-Coupled Attention Weights In this section we provide a definition for the form of the attention weight function ϕ : Y Qds i=1 n. We first consider a ϕ defined by the softmax normalization of a score network g : Y Rn ds. By augmenting this definition with a kernel convolution of the score function g, we arrive at the Kernel-Coupled Attention weights. To define a trainable function mapping query locations y Y to attention weights ϕ(y) n, we must ensure that ϕ(y) forms a discrete probability distribution. A simple way of achieving this is to take a trainable un-normalized score function g : Y Rn and acting on it with the softmax function g 7 exp(g1) Pn i=1 exp(gi), , exp(gn) Pn i=1 exp(gi) . We can extend this definition to create a function ϕ : Y Qds i=1 n by using a score function g : Y Rn ds and acting on the rows of g(y) with the softmax function as ϕ(y) = σ(g(y)). The disadvantage of this formulation is that it solely relies on the form of the function g to capture relations between the distributions ϕ(y) across different y Y. Instead, we introduce the Kernel-Coupled Attention (KCA) mechanism to model these relations by Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris integrating the function g against a coupling kernel κ : Y Y R. This results in the score function Y κ(y, y )g(y ) dy , (1) which can be normalized across its rows to form the probability distributions Y κ(y, y )g(y ) dy . (2) The form of the kernel κ will determine how these distributions are coupled across y Y. For example, given a fixed y, the locations y where k(y, y ) is large will enforce similarity between the corresponding score functions g(y) and g(y ). If k is a local kernel with a small bandwidth then points y and y will only be forced to have similar score functions if they are very close together. Thus, the kernel transformation forces pairs of attention weights ϕ(y), ϕ(y ) for the features v(u) to be similar when y and y are deemed similar by the kernel k. We define a kernel for the transformation in (2) by first lifting the points y Y via a nonlinear parameterized mapping qθ : Y Rl. We then apply a universal kernel k : Rl Rl R (Micchelli and Pontil, 2004) over the lifted space, such as the Gaussian RBF kernel, k(z, z ) = γ exp( β z z 2), γ, β > 0. (3) Finally, we apply a normalization to the output of this kernel on the lifted points to create a similarity measure. The effect of the normalization is to maintain the relative scale of the proposal score function g. Overall, our kernel is defined as κ(y, y ) := k(qθ(y), qθ(y )) R Y k(qθ(y), qθ(z))dz 1/2 R Y k(qθ(y ), qθ(z))dz 1/2 . (4) By tuning the parameters θ, β and γ in the functions qθ and k, the kernel κ is able to learn the appropriate measures of similarity between the points in the output function domain Y. We show in Section 7.4 that the inclusion of the kernel transformation in (2) can have a strong effect on the performance of the model in the small data regime. 3.3 Input Function Feature Encoding The last architecture choice to be made concerns the functional form of the feature embedding v(u). Here, we construct the map v as a composition of two mappings. The first is a function D : C(X, Rdu) Rd, (5) that maps an input function u to a finite-dimensional vector D(u) Rd. After creating the d-dimensional representation of the input function D(u), we pass this vector through a function f from a class of universal function approximators, such as fully connected neural Learning Operators with Coupled Attention Figure 2: Schematic illustration of LOCA for n = 4 and ds = 1: The LOCA method builds a feature representation, v(u), of the input function and averages it with respect to ϕ(y). The transform D is first applied to the input function to produce a list of features, illustrated by disks in this case, and then a fully-connected network is applied to construct v(u). The score function g is applied to the output query locations yi together with the softmax function to produce the score vector ϕi. The vi and ϕi vectors are combined to evaluate the solution at each query location by computing Eϕ(y)[v(u)] at the last step. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris networks. The composition of these two operations forms our feature representation of the input function, v(u) = f D(u). (6) One example for the operator D is the image of the input function under d linear functionals on C(X; Rdu). For example, D could return the point-wise evaluation of the input function at d fixed points. This would correspond to the action of d translated δ-functionals. The drawback of such an approach is that the model would not be able to accept measurements of the input function at any other locations. As a consequence the input resolution could never vary across forward passes of the model. Alternatively, if we consider an orthonormal basis for L2(X; Rdu), we could also have D be the projection onto the first d basis vectors. For example, if we use the basis of trigonometric polynomials the Fast Fourier Transform (FFT) (Cooley and Tukey, 1965) allows for efficient computation of these values and can be performed across varying grid resolutions. We could also consider the projection onto an orthogonal wavelet basis (Daubechies, 1988). In the case of complex valued coefficients for these basis functions, the range space dimension of D would be doubled to account for the real and imaginary part of these measurements. 3.4 Model Summary Overall, the forward pass of the proposed model is written as follows, see Figure 2 for a visual representation. F(u)(y) = Eϕ(y)[v(u)] = Y κ(y, y )g(y ) dy i vi(u), (7) where κ : Y Y R is the kernel of equation (1), σ is the softmax function, v the input feature encoder and g is the proposed score function. Some theoretical results on the expressiveness of this model are presented in the next section. Practical aspects related to the parametrization of κ, v and g, as well as the model evaluation and training will be discussed in section 5. 4. Theoretical Guarantees of Universality In this section we give conditions under which the LOCA model is universal. There exist multiple definitions of universality present in the literature, for example see Sriperumbudur et al. (2011). To be clear, we formally state the definition we use below. Definition 1 (Universality) Given compact sets X Rdx, Y Rdy and a compact set U C(X, Rdu) we say a class of operators A F : C(X, Rdu) C(Y, Rds) is universal if it is dense in the space of operators equipped with the supremum norm. In other words, for any continuous operator G : C(X, Rdu) C(Y, Rds) and any ϵ > 0, there exists F A such that sup u U sup y Y G(u)(y) F(u)(y) 2 Rds < ϵ. Learning Operators with Coupled Attention To explore the universality properties of our model we note that if we remove the softmax normalization and the kernel coupling, the evaluation of the model can be written as i=1 gi(y) vi(u). The universality of this class of models has been proven in Chen and Chen (1995) (when ds = 1) and extended to deep architectures in Lu et al. (2019). We will show that our model with the softmax normalization and kernel coupling is universal by adding these components back one at a time. First, the following theorem shows that the normalization constraint ϕ(y) Qds k=1 n does not reduce the approximation power of this class of operators. Theorem 2 (Normalization Preserves Universality) If U C(X, Rdu) is a compact set of functions and G : U C(Y, Rds) is a continuous operator with X and Y compact, then for every ϵ > 0 there exists n N, functionals vj,k : U R, for j [n], k [ds], and functions ϕj : Y Rds with ϕj(y) [0, 1]ds and Pn j=1 ϕj(y) = 1ds for all y Y such that sup u U sup y Y G(u)(y) Eϕ(y)[v(u)] 2 Rds < ϵ. Proof The proof is given in Appendix B. It remains to show that the addition of the kernel coupling step for the functions ϕ also does not reduce the approximation power of this class of operators. By drawing a connection to the theory of Reproducing Kernel Hilbert Spaces (RKHS), we are able to state the sufficient conditions for this to be the case. The key insight is that, under appropriate conditions on the kernel κ, the image of the integral operator in (1) is dense in an RKHS Hκ which itself is dense in C(Y, Rn). This allows (2) to approximate any continuous function ϕ : Y Qds i=1 n and thus maintains the universality guarantee of Theorem 2. Proposition 3 (Kernel Coupling Preserves Universality) Let κ : Y Y R be a positive definite and symmetric universal kernel with associated RKHS Hκ and define the integral operator Tκ :C(Y, Rn) C(Y, Rn), Y κ(y, z)f(z)dz. If A C(Y, Rn) is dense, then Tκ(A) C(Y, Rn) is also dense. The statement of Proposition 3 requires that the kernel κ be symmetric, positive definite, and universal. We next show that by construction it will always be symmetric and positive definite, and under an assumption on the feature map q it will additionally be universal. Proposition 4 (Universality of the Kernel κ) The kernel defined in (4) is positive definite and symmetric. Further, if q is injective, it defines a universal RKHS. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Proof The proof is provided in Appendix D. Lastly, we present a result showing that a particular architecture choice for the input feature encoder v also preserves universality. We show that if there is uniform convergence of spectral representations of the input, projections onto these representations can be used to construct a universal class of functionals on C(X, Rdu). Proposition 5 (Spectral Encoding Preserves Universality) Let Ad C(Rd, Rn) be a set of functions dense in C(Rd, Rn), and {ei} i=1 a set of basis functions such that for some compact set U C(X, Rdu), P i=1 u, ei L2ei converges to u uniformly over U. Let Dd : U Rd denote the projection onto {e1, . . . , ed}. Then for any continuous functional h : U Rn, and any ϵ > 0, there exists d and f AN such that h(u) f Dd(u) < ϵ. Proof The proof is provided in Appendix E. For example, if our compact space of input functions U is contained in C1(X, Rdu), and D is a projection onto a finite number of Fourier modes, the architecture proposed in equation (6) is expressive enough to approximate any functional from U R, including those produced by the universality result stated in Theorem 2. 5. Implementation Aspects To implement our method, it remains to make a choice of discretization for computing the integrals required for updating the KCA weights ϕ(y), as well as a choice for the input function feature encoding v(u). Here we address these architecture choices, and provide an overview of the proposed model s forward evaluation. 5.1 Computation of the Kernel Integrals To compute the kernel-coupled attention weights ϕ(y), we are required to evaluate integrals over the domain Y in (1) and (4). Adopting an unbiased Monte-Carlo estimator using P points y1, . . . , y P Y, we can use the approximations Y κ(y, y )g(y ) vol(Y) i=1 κ(y, yi)g(yi), for equation (1), and Y k(q(y), q(z))dz vol(Y) i=1 k(q(y), q(yi)), for use in equation (4). Note that due to the normalization in κ, the vol(Y) term cancels out. In practice, we allow the query point y to be one of the points y1, . . . , y P used for the Monte-Carlo approximation. Learning Operators with Coupled Attention When the domain Y is low dimensional, as in many physical problems, a Gauss-Legendre quadrature rule with weights wi can provide an accurate and efficient alternative to Monte Carlo approximation. Using Q Gauss-Legendre nodes and weights, we can approximate the required integrals as Y κ(y, y )g(y )dy i=1 wiκ(y, y i)g(y i), for equation (1) and Y k(q(y), q(z))dz i=1 wik(q(y), q(zi)), for use in equation (4). If we restrict the kernel κ to be translation invariant, there is another option for computing these integrals. As in Li et al. (2020a), we could take the Fourier transform of both κ and g, perform a point-wise multiplication in the frequency domain, followed by an inverse Fourier transform. However, while in theory the discrete Fourier transformation could be performed on arbitrarily spaced grids, the most available and computationally efficient implementations rely on equally spaced grids. We prefer to retain the flexibility of arbitrary sets of query points y and will therefore not pursue this alternate approach. In Section 7, we will switch between the Monte-Carlo and quadrature strategies depending on the problem at hand. 5.2 Positional Encoding of Output Query Locations We additionally adopt the use of positional encodings, as they have been shown to improve the performance of attention mechanisms. For encoding the output query locations, we are motivated by the positional encoding in Vaswani et al. (2017), the harmonic feature expansion in Di Leoni et al. (2021), and the work of Wang and Liu (2019) for implementing the encoding to more than one dimensions. The positional encoding for a one dimensional query space is given by e(yi, 2j + (i 1)H) = cos(2jπyi), e(yi, 2j + 1 + (i 1)H) = sin(2jπyi), (8) where H the number of encoding coefficients, j = 1, ..., H/2, yi the query coordinates in different spatial dimensions and i = 1, ..., dy. In contrast to Vaswani et al. (2017) we consider the physical position of the elements of the set y as the position to encode instead of their index position in a given list, as the index position in general does not have a physically meaningful interpretation. 5.3 Wavelet Scattering Networks as a Spectral Encoder While projections onto an orthogonal basis allows us to derive a universality guarantee for the architecture, there can be some computational drawbacks. For example, it is known that the Fourier transform is not always robust to small deformations of the input (Mallat, Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris 2012). More worrisome is the lack of robustness to noise corrupting the input function. In real world applications it will often be the case that our inputs are noisy, hence, in practice we are motivated to find an operator D with stronger continuity with respect to these small perturbations. To address the aforementioned issues, we make use of the scattering transform (Bruna and Mallat, 2013), as an alternate form for the operator D. The scattering transform maps an input function to a sequence of values by alternating wavelet convolutions and complex modulus operations (Bruna and Mallat, 2013). To be precise, given a mother wavelet ψ and a finite discrete rotation group G, we denote the wavelet filter with parameter λ = (r, j) G Z as ψλ(u) = 2dxjψ(2jr 1x). Given a path of parameters p = (λ1, . . . , λm), the scattering transform is defined by the operator S[p]u = ||||u ψλ1| ψλ2| | ψλm| φ(x), (9) where φ(x) is a low pass filter. We allow the empty path as a valid argument of S with S[ ]u = u φ. As shown in Bruna and Mallat (2013), this transform is Lipschitz continuous with respect to small deformations, while the modulus of the Fourier transform is not. This transform can be interpreted as a deep convolutional network with fixed filters and has been successfully applied in multiple machine learning contexts (Oyallon et al., 2017; Chang and Chen, 2015). Computationally, the transform returns functions of the form (9) sampled at points in their domain, which we denote by ˆS[p](u). By choosing d paths p1, . . . , pd, we may define the operator D as D(u) = ˆS[p1](u), . . . , ˆS[pd](u) . In practice, the number of paths used is determined by three parameters: J, the maximum scale over which we take a wavelet transform; L, the number of elements of the finite rotation group G, and m0, the maximum length of the paths p. While Proposition 5 does not necessarily apply to this form of D, we find that empirically this input encoding gives the best performance in combination with the KCA and normalization of attention weight functions ϕ. We perform an ablation study in Appendix F.3 and show that, while the scattering transform can help the performance of other models as well, it is not the only component responsible for the competitive performance of LOCA. 5.4 Loss Function and Training The proposed model is trained by minimizing the empirical risk loss over the available training data pairs, ℓ=1 (si(yi ℓ) Fθ(ui)(yi ℓ))2, (10) where θ = (θq, θf, θg) denotes all trainable model parameters. This is the simplest choice that can be made for training the model. Other choices may include weighting the mean Learning Operators with Coupled Attention square error loss using the L1 norm of the ground truth output (Di Leoni et al., 2021; Wang et al., 2021b), or employing a relative L2 error loss (Li et al., 2020a). The minimization is performed via stochastic gradient descent updates, where the required gradients of the loss with respect to all the trainable model parameters can be conveniently computed via reverse-mode automatic differentiation. 5.5 Implementation Overview Algorithm 1 provides an overview of the steps required for implementing the LOCA method. The training data set is first processed by passing the input functions through a wavelet scattering network (Bruna and Mallat, 2013), followed by applying a positional encoding to the query locations and the quadrature/Monte-Carlo integration points. The forward pass of the model is evaluated and gradients are computed for use with a stochastic gradient descent optimizer. After training, we make one-shot predictions for super-resolution grids, and we compute the relative L2 error between the ground truth output and the prediction. Algorithm 1 Implementation summary of the LOCA method Require: Input/output function pairs {ui, si}N i=1. Query locations yi for evaluating si. Quadrature points zi. Pre-processing: Apply transformation (6) on the input function to get ˆu, the input features. Apply positional encoding (8) to query coordinates y, z, to get ˆy, ˆz. Choose the network architectures for functions qθq, fθf , and gθg. Initialize the trainable parameters θ = (θq, θf, θg), and choose a learning rate η. Training: for i = 0 to I do Randomly select a mini-batch of (ˆu, ˆy, ˆz, s). Evaluate gθg(qθq(ˆz)). Compute the Coupling Kernel κ(qθq(ˆy), qθq(ˆz)) (4). Numerically approximate the KCA (1) and compute ϕ(y). Evaluate fθf (ˆu), as in Equation (6). Evaluate the expectation (7) and get s , the model prediction. Evaluate the training loss (10) and compute its gradients θL(θi). Update the trainable parameters via stochastic gradient descent: θi+1 θi η θL(θi). end for 6. Connections to Existing Operator Learning Methods In this section, we provide some insight on the connections between our method and similar operator learning methods. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris 6.1 Deep ONets Note that if we identify our input feature map, v(u), with the Deep ONet s branch network, and the location dependent probability distribution, ϕ(y), with the Deep ONet s trunk network, then the last step of both models is computed the same way. We can recover the Deep ONet architecture from our model under three changes to the architecture in the forward evaluation. First, we would remove the normalization step in the construction of ϕ. Next, we remove the KCA mechanism that is applied to the candidate score function g (equivalently we may fix the kernel κ to be δ-distributions along the diagonal). Finally, in the construction of the input feature map v(u), instead of the scattering transform we would act on the input with a collection of δ distributions at the fixed sensor locations. These differences between Deep ONets and LOCA result in increased performance of our model, as we will see in Section 7. 6.2 Neural Operators The connection between Neural Operators and Deep ONets has been presented in Kovachki et al. (2021b) and Kovachki et al. (2021a), where it is shown that a particular choice of neural operator architecture produces a Deep ONet with an arbitrary trunk network and a branch network of a certain form. In particular, a Neural Operator layer has the form, v(ℓ+1)(z) = σ W (ℓ)v(ℓ)(z) + Z Z k(ℓ)(s, z)v(ℓ)(s)ds , (11) where here σ is a point-wise nonlinearity. It is shown in Kovachki et al. (2021b) that this architecture can be made to resemble a Deep ONet under the following choices. First, set W (ℓ) = 0. Next, lift the input data to n tiled copies of itself and choose a kernel k that is separable in s and z. If the output of the layer is then projected back to the original dimension by summing the coordinates, the architecture resembles a Deep ONet. The correspondence between our model and Deep ONets described above allows us to transitively connect our model to Neural Operators as well. We additionally note that the scattering transform component of our architecture can be viewed as a collection of multiple-layer Neural Operators with fixed weights. Returning to (11), when W (ℓ) = 0 for all ℓ, the forward pass of the architecture is a sequence of integral transforms interleaved with point-wise nonlinearities. Setting σ to be the complex modulus function and k(ℓ) to be a wavelet filter ψλℓwe may write v(ℓ+1) = |vℓ ψλ|. When we compose L of these layers together, we recover (9) up to the application of the final low pass filter (again a linear convolution) v(L) = ||||u ψλ1| ψλ2| | ψλL|. Thus, we may interpret the scattering transform as samples from a collection of Neural Operators with fixed weights. This connection between the scattering transform and convolutional neural architectures with fixed weights was noticed during the original formulation of the wavelet scattering transform by Bruna and Mallat (2013), and thus also extends to Learning Operators with Coupled Attention Neural Operators via the correspondence between Neural Operators and (finite-dimensional) convolutional neural networks (Kovachki et al., 2021b). We would like to emphasize that our application of a kernel transform is different than how it is used for a Neural Operator. While Neural Operators apply the kernel transformation to functions of the input, we are applying it to an auxiliary score function of the output function domain that does not depend on the input at all. 6.3 Other Attention-Based Architectures Here we compare our method with two other recently proposed attention-based operator learning architectures. The first is the Galerkin/Fourier Transformer (Cao, 2021). This method operates on a fixed input and output grid, and most similarly represents the original sequence-to-sequence Transformer architecture (Vaswani et al., 2017) with different choices of normalization. As in the original sequence-to-sequence architecture, the attention weights are applied across the indices (sensor locations) of the input sequence. By contrast, in our model the attention mechanism is applied to a finite-dimensional feature representation of the input that is not indexed by the input function domain. Additionally, our attention weights are themselves coupled over the domain Y via the KCA mechanism (2) as opposed to being defined over the input function domain in an uncoupled manner. A continuous attention mechanism for operator learning was also proposed as a special case of Neural Operators in Kovachki et al. (2021b). There, it was noted that if the kernel in the Neural Operator was (up to a linear transformation) of the form k(v(x), v(y)) = Z exp Av(s), Bv(y) m ds 1 exp Av(x), Bv(y) m with A, B Rm n, then the corresponding Neural Operator layer can be interpreted as the continuous generalization of a transformer block. Further, upon discretization of the integral this recovers exactly the sequence-to-sequence discrete Transformer model. The main difference of this kind of continuous transformer with our approach is again how the attention mechanism is applied to the inputs. The Neural Operator Transformer is similar to the Galerkin/Fourier Transformer in the sense that the attention mechanism is applied over the points of the input function itself, whereas our model first creates a different finite dimensional feature representation of the input function which the attention is applied to. We note that our model does make use of attention weights defined over a continuous domain, but it is the domain of the output functions Y as opposed to X. 6.4 Existing Error Estimate Frameworks Recent work in Lanthaler et al. (2022) and Kovachki et al. (2021a) has explored error estimates for Deep ONets and pseudo-spectral Fourier Neural Operators (Ψ-FNO), respectively. Note that the Ψ-FNO is a modification to the original FNO architecture introduced to facilitate the analysis in Kovachki et al. (2021a). In both cases a fundamental lower bound to the error of the architecture is derived in terms of a subspace of functions containing the outputs of the operator network. If F is an operator approximating G and F always produces functions that lie in some subspace V H, where H is a Hilbert space of functions from Y Rds, Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris then there is a lower bound for the worst case error (see Kovachki et al. (2021a)) sup u U F(u) G(u) H sup u U PV G(u) G(u) H, (12) where PV : H H is the projection operator onto the subspace V . If the input functions are sampled from a probability measure µ on C(X, Rdu) then an analogous result is shown in an L2(µ) sense (see Lanthaler et al. (2022)), F G L2(µ) PV G G L2(µ). (13) In the Deep ONet case, the subspace V is spanned by the trunk functions. For the (pseudospectral) Fourier Neural Operator, this subspace is spanned by the first N trigonometric polynomials. These results are intuitive in the sense that the approximation error for a particular output is lower bounded by the minimum distance from the image of the operator network to the target function. This lower bound applies to LOCA as well, where the subspace V is given by the span of the attention weight functions {ϕi}n i=1. In the L2 case (13), it was shown in Lanthaler et al. (2022) that if the distribution of output functions has zero mean and finite second moment, the n-dimensional subspace V which minimizes the lower bound is given by the leading eigenspace of the covariance operator for this distribution. Moreover, if this covariance operator has eigenvalues λ1 λ2 . . ., the overall error has the lower bound F G L2(µ) s X k>n λk, (14) which quantifies the error incurred from the eigenspaces of the covariance that cannot be accessed by the output of the model. This gives a notion of an optimal subspace of output functions for the image of the operator learning architecture. We find that the inclusion of the KCA mechanism and the normalization in the ϕi is able to create subspaces V that are closer in Grassmann distance to this optimal subspace. More details on this notion of optimality and supporting experiments on a synthetic example will be presented in Section 7.4. The results in Lanthaler et al. (2022) were derived within the following interpretation of operator learning models, which applies directly to both LOCA and the Deep ONet. These models are constructed from three components. First, an encoding map E : L2(X) Rm maps the input function to a finite set of features. Then, an approximation map A : Rm Rn transforms these features. Finally, an affine reconstruction map R takes the image of the approximation map and embeds it into L2(Y) as follows R :Rn L2(Y), where τi L2(Y). For the Deep ONet, the map E is a pointwise evaluation at m sensors, while the maps τi are given by the trunk network. For LOCA, the map E is given by a Learning Operators with Coupled Attention wavelet scattering transform, and the maps τi correspond to the kernel-coupled attention weights, τi = ϕi = σ Z Y k(z, y)g(z) dz Another fundamental limitation of operator learning architectures is how the number parameters scale with respect to the desired error. In general, this scaling is exponential and referred to as the curse of dimensionality . This has been shown for Deep ONets (Lanthaler et al., 2022) and Fourier Neural Operators (Kovachki et al., 2021a), though in both cases there are particular PDE-based operator learning problems where this unfavorable scaling can be improved (by showing the architectures can emulate known solvers for which can be emulated by relatively small networks). These results can likely be replicated for LOCA as well, though this is outside the scope of the current paper and is a topic for future research. 7. Experimental Results In this section we provide a comprehensive collection of experimental comparisons designed to assess the performance of the proposed LOCA model against two state of the art operator learning methods, the Fourier Neural Operator (FNO) (Li et al., 2020a) and the Deep ONet (DON) (Lu et al., 2019). We will show that our method requires less labeled data than competing methods, is robust against noisy data and randomness in the model initialization, has a smaller spread of errors over testing data sets, and is able to successfully generalize in out-of-distribution testing scenarios. Evidence is provided for the following numerical experiments, see Figure 3 for a visual description. Antiderivative: Learning the antiderivative operator given multi-scale source terms. Darcy Flow: Learning the solution operator of the Darcy partial differential equation, which models the pressure of a fluid flowing through a porous medium with random permeability. Mechanical MNIST: Learning the mapping between the initial and final displacement of heterogeneous block materials undergoing equibiaxial extension. Shallow Water Equations: Learning the solution operator for a partial differential equation describing the flow below a pressure surface in a fluid with reflecting boundary conditions. Climate modeling: Learning the mapping from the air temperature field over the Earth s surface to the surface air pressure field, given sparse measurements. Finally, we show that, in an experiment where the input function probability measure µ and associated true pushforward measure G#µ are known analytically, the KCA mechanism and scattering transform allow LOCA to better learn the principal eigenspaces of the covariance for the output function distribution G#µ, as measured by a normalized Grassmann distance. For all experiments the training data sets will take the following form. For each of the N input/output function pairs, (ui, si), we will consider m discrete measurements of each input Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris function at fixed locations, (ui(xi 1), . . . , ui(xi m)), and M available discrete measurements of each output function (si(yi 1), . . . , si(yi M)), with the query locations {yi ℓ}M ℓ=1 potentially varying over the data set. Out of the M available measurement points {yi ℓ}M ℓ=1 for each output function si, we consider the effect of taking only P of these points for each input/output pair. For example, if we use 10% of labeled data, we set P = M/10 and build a training data set where each example is of the form ({ui(xi j)}m j=1, {si(yℓ)}P ℓ=1). We round the percentages to the nearest integer or half-integer for clarity. We present details on the input and output data construction, as well as on the different problem formulations in Section F.7 of the Appendix. In each scenario the errors are computed between both the models output and ground truth at full resolution. Throughout all benchmarks, we employ Gaussian Error Linear unit activation functions (GELU) (Hendrycks and Gimpel, 2016), and initialize all networks using the Glorot normal scheme (Glorot and Bengio, 2010). All networks are trained via mini-batch stochastic gradient descent using the Adam optimizer with default settings (Kingma and Ba, 2014). The detailed hyper-parameter settings, the associated number of parameters for all examples, the computational cost, and other training details are provided in Appendix F.2. All code and data accompanying this manuscript will be made publicly available at https://github.com/Predictive Intelligence Lab/LOCA. 7.1 Data Efficiency In this section we investigate the performance of our model when the number of labeled output function points is small. In many applications labeled output function data can be scarce or costly to obtain. Therefore, it is desirable that an operator learning model is able to be successfully trained even without a large number of output function measurements. We investigate this property in the Darcy flow experiment by gradually increasing the percentage of labeled output function measurements used per input function example. Next, we compare the performance of all models for the Shallow Water benchmark in the small data regime. Lastly, we demonstrate that the proposed KCA weights provide additional training stability specifically in the small data regime. One important aspect of learning in the small data regime is the presence of outliers in the error statistics, which quantify the worst-case-scenario predictions. In each benchmark we present the following error statistics across the testing data set: the error spread around the median, and outliers outside the third quantile. Figure 4 shows the effect of varying the percentage of labeled output points used per training example in the Darcy flow prediction example. The box plot shows the distribution of errors over the test data set for each model. We see that the proposed LOCA model is able to achieve low prediction errors even with 1.5% of the available output function measurements per example. It also has a consistently smaller spread of errors with fewer outliers across the test data set in all scenarios. Moreover, when our model has access to 6% of the available output function measurements it achieves lower errors against both the DON and FNO trained with any percentage (up to 100%) of the total available labeled data. Figure 6 shows the spread of errors across the test data set for the Shallow Water benchmark when the LOCA model is trained on 2.5% of the available labeled data per input-output function pair. We observe that our model outperforms DON and FNO in Learning Operators with Coupled Attention Example Input Function u x ( ) Output Function s y ( ) Visualization of the operator F Antiderivative Smooth Univariate Function Antiderivative solution Random permeability Pressure field Initial displacement vector field Later displacement vector field Shallow Water Random initial condition Solution at specified time instances Climate Modeling Surface air temperature Surface air pressure Figure 3: A schematic visualization of the operator learning benchmarks considered in this work. Shown are the input/output function and a description of their physical meaning, as well as the operator that we learn for each example. In the Mechanical MNIST example, for visual clarity we do not present the map that the model is actually learning, which is the displacement in the vertical and the horizontal directions, but the position of each pixel under a specified displacement. See Appendix Section F.7 for more details on the data set generation. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris predicting the wave height, ρ, and provides similar errors to the FNO for the two velocity components, v1 and v2. Despite the fact that the two methods perform in a similar manner for the median error, LOCA consistently provides a much smaller standard deviation of errors across the test data set, as well as far fewer outliers. We hypothesize that the KCA mechanism is responsible for good performance of LOCA observed in the small labelled data regime. An interpretation for this phenomenon is that the kernel coupling allows evaluations of the score network g over the entire query domain to be used in estimating gradients with respect to its parameters, even when the number of query points is small. When a small number of output function measurements are available, the empirical loss function is approximating (up to a constant) the L2 error of an output function to a target function with a small sum over the observed query locations {y1, . . . , y P }, F(u) s 2 L2 k=1 F(u)(yk) s(yk) 2 = i=1 v(u)iϕ(yk)i s(yk) Ignoring the softmax normalization for now, if ϕi = gi without the kernel coupling, the gradient of the loss with respect to the network parameters of gθ i , denoted θ, will only use information from the network gi at the few query locations {y1, . . . , yp}, θϕ(yk)i = θgθ(yk)i. (16) However, if ϕ(yk)i is given as a kernel convolution of g(yk)i, then the gradient with respect to the network parameters θ will use information about the functional output of gθ across the query domain Y, Y k(yk, z)gθ(z)i dz = Z Y k(yk, z) θgθ(z)i dz. (17) This additionally has the effect of allowing the gradients θϕi to inherit the regularity properties of the kernel resulting in smoother gradients over the varying queries. When there are few available queries for training, this smoother variation in θϕi(y) over y Y can reduce the noise in the gradients of (15). To empirically test the effect of the KCA mechanism, we train LOCA on the Darcy flow problem with varying amounts of input/output function samples N = [100, 200, 500] and output measurements P = [32, 64, 128, 256, 512, 1024] both with and without the KCA mechanism, while considering the same architecture as above. We present the results in the form of a checkerboard for the case where the KCA is considered and the case where no KCA is considered inside the softmax function, in Figure 5. We observe in Figure 5 that the model presents higher maximum error for all cases, starting with larger differences for small N, around 4% and moving to smaller differences for larger N around 2%. There are two differences in this experiment compared to the Darcy case above, the first is that we train the model while considering full batch stochastic gradient descent to avoid possible noise in the gradients induced by the choice of the batch and we train for 5, 000 iterations in order to avoid possible over-fitting of the models in the small data regime. Learning Operators with Coupled Attention 1.5 3 6 12 25 50 100 Percentage of labeled data Relative L2 error LOCA DON FNO Figure 4: (Data Efficiency) Relative L2 error boxplots for the solution of Darcy flow: We present the error statistics for the case of the Darcy flow in the form of boxplots for the case where we train on [1.5, ..., 100]% of the available output function measurements per example. We observe that our model presents fast convergence to a smaller median error than the other methods and the error spread is more concentrated around the median with fewer outliers. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Figure 5: (Data Efficiency) Maximum relative testing L2 error for the solution of Darcy flow for different number of input/output function samples and output measurements when considering and not considering the KCA mechanism: We present the both the training (right) and testing (left) mean errors for the Darcy flow for N = [100, 200, 500] and P = [32, 64, 128, 256, 512, 1024]. 7.2 Robustness Operator learning can be a powerful tool for cases where we have access to clean simulation data for training, but wish to deploy the model on noisy experimental data. Alternatively, we may have access to noisy data for training and want to make predictions on noisy data as well. We will quantify the ability of our model to handle noise in the data by measuring the percentage increase in mean error clean to noisy data scenarios. For all experiments in this section, we consider 7% of the available labeled data. We use the Mechanical MNIST benchmark to investigate the robustness of our model with respect to noise in the training and testing data. We consider three scenarios: one where the training and the testing data sets are clean, one where the training data set is clean, but the output data set is corrupted Gaussian noise sampled from N(0, .15I), and one where both the input and the output data sets are corrupted by Gaussian noise sampled from N(0, .15I). In Figure 7 we present the distribution of errors across the test data set for each noise scenario. We observe that for the case where both the training and the testing data are clean, the FNO achieves the best performance. In the scenario where the training data set is clean but the testing data set is noisy, we observe a percentage increase to the approximation error of all methods. For the Clean to Noisy scenario the approximation error of the FNO method is increased by 1, 930% and 2, 238% for the displacement in the horizontal and vertical directions, respectively. For the DON method, the percentage increase is 112% and 96% for the displacement in the horizontal and vertical directions (labeled as v1 and v2), respectively. Learning Operators with Coupled Attention Relative L2 error Relative L2 error LOCA DON FNO Relative L2 error Shallow Water Equations Figure 6: (Data Efficiency) Relative L2 error boxplots for the solution of the Shallow Water equations: We present the errors for each different predicted quantity of the Shallow Water equations. On the left, we present the ρ quantity which is the height of the water, and v1 and v2 which are the two components of the fluid velocity vector. We observe that LOCA achieves higher accuracy, and presents fewer outliers and more concentrated error spread compared its competitors. For the LOCA method the percentage increase is 80% and 85% for the displacement in the horizontal and vertical directions, respectively. For the Noisy to Noisy scenario the approximation error of the FNO method is increased by 280% and 347% for the displacement in the horizontal and vertical directions, respectively. For the DON method, the percentage increase is 128% and 120%, and for LOCA is only 26% and 25% for each displacement component, respectively. We present the mean prediction error for each scenario and the corresponding percentage error increase in Table 1. We observe that even though the FNO is very accurate for the case where both training and test data sets are clean, a random perturbation of the test data set can cause a huge decrease in accuracy. On the other hand, even though the DON method presents similar accuracy as our model in the clean to clean case, the standard deviation of the error is greater and its robustness to noise is inferior. LOCA is clearly superior in the case where the testing data are corrupted with Gaussian noise. We again emphasise that the metric in which we assess the performance is not which method has the lowest relative prediction error, but which method presents the smallest percentage increase in the error when noise exists in testing (and training in the case of Noisy to Noisy) data compared to the case where there exist no noise. Next, we examine the variability of the models performance with respect to the random initialization of the network parameters. We consider the Mechanical MNIST benchmark where the input data is clean but the output data contain noise. We train each model 10 times with different random seeds for initialization and record the maximum error in each Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Figure 7: (Robustness) Relative L2 error boxplots for the Mechanical MNIST benchmark with noisy data: The left figure gives the distribution of errors for the displacement in the horizontal axis, v1, and the right figure gives the displacement in the vertical axis v2. For all cases we consider 7% of the whole training data set as labeled data used during training. case. In Figure 8 we present the distribution of maximum prediction errors under different random seeds for the displacement in horizontal and vertical directions, respectively. We observe that LOCA displays a smaller spread of error for the case of displacement in the horizontal direction, v1, and similar performance to the FNO for the case of displacement in the vertical direction, v2. 7.3 Generalization The ultimate goal of data-driven methods is to perform well outside of the data set they are trained on. This ability to generalize is essential for these models to be practically useful. In this section we investigate the ability of our model to generalize in three scenarios. We first consider an extrapolation problem where we predict the daily Earth surface air pressure from the daily surface air temperature. Our training data set consists of temperature and pressure measurements from 2000 to 2005 and our testing data set consists of measurements from 2005 to 2010. In Figure 9, we present the results for the extrapolation problem when considering 4% of the available pressure measurements each day for training. We observe that our method achieves the lowest error rates while also maintaining a small spread of these errors across the testing data set. While the DON method achieves a competitive Learning Operators with Coupled Attention Noise scenario and error Model FNO DON LOCA Clean to Clean (CC) [0.004, 0.003] [0.028, 0.029] [0.026, 0.026] Clean to Noisy (CN) [0.088, 0.087] [0.061, 0.057] [0.047, 0.049] Noisy to Noisy (NN) [0.016, 0.016] [0.065, 0.064] [0.032, 0.033] Percentage error increase from CC to CN [1,930 %, 2,238 %] [112%, 96%] [80%, 85%] Percentage error increase from CC to NN [280 %, 347%] [128%,120%] [26%, 25%] Table 1: (Robustness) Mechanical MNIST prediction error with noisy data: The first three rows present the mean relative L2 errors of the vertical and horizontal displacements [Error(v1), Error(v2)]. The last two rows show the percentage increase in mean error from the noiseless case to the scenarios where the testing input data is corrupted by noise and to the scenario that both the training and testing input data sets are corrupted by noise. For each case we consider 7% of the total data set as labeled data for training. We observe that our method shows the least percentage increase for each noise scenario. Relative L2 error Relative L2 error LOCA DON FNO MMNIST errors for different network initializations Figure 8: (Robustness) Maximum relative L2 error boxplots for Mechanical MNIST with over random model initializations: The left and right subplots show the distribution of maximum errors over the testing data set for the horizontal and vertical displacements, respectively. We consider 7% of the available output function measurments for training and run the model for 10 different random initializations. We observe that our method shows better performance than the other methods for both parameters v1 and v2. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris performance with respect to the median error, the error spread is larger than both LOCA and FNO with many outliers. Next, we examine the performance of our model under a distribution shift of the testing data. The goal of the experiment is to learn the antiderivatve operator where the training and testing data sets are sampled from a Gaussian process. We fix the length-scale of the testing distribution at 0.1 and examine the effect of training over 9 different data sets with length-scales ranging from 0.1 to 0.9. In Figure 10, we present the error on the testing data set after being trained on each different training data set. The error for each testing input is averaged over 10 random network initializations. We observe that while the LOCA and FNO methods present a similar error for the first two cases, the FNO error is rapidly increasing. On the other hand, the DON method while presenting a larger error at first, eventually performs better than the FNO as the training length-scale increases. We find that LOCA outperforms its competitors for all cases. Lastly, we examine the performance of the three models when the training and testing data set both contain a wide range of scale and frequency behaviors. We consider this set-up as a simple model of a multi-task learning scenario and we want to explore the generalization capabilities of our model for this case. We construct a training and testing data set by sampling inputs from a Gaussian process where the length-scale and amplitude are chosen over ranges of 2 and 4 orders of magnitude, respectively. In Figure 11, we present samples from the input distribution, the corresponding output functions, and the distribution of errors on the testing data set. We observe that our method is more accurate and the error spread is smaller than DON and Fourier Neural Operators. While the FNO method shows a median that is close to the LOCA model, there exist many outliers that reach very high error values. 7.4 Optimal Subspace Alignment In this section we will investigate the magnitude of the optimal error lower bounds presented in Section 6.4. In particular, we consider a scenario where the input functions are drawn from a probability distribution µ on L2(X, Rdu) and an operator G : L2(X, Rdu) L2(Y, Rds). As explained in Section 6.4, given an operator learning architecture of the form i=1 ϕi(y) vi(u), if V = span{ϕ1, . . . , ϕn} and PV : L2(Y, Rds) L2(Y, Rds) is the projection onto the subspace V , then the L2(µ) error satisfies the following lower bound PV G G L2(µ) F G L2(µ). (18) This lower bound motivates the question: for fixed n what is the choice of V = span{ϕ1, . . . , ϕn} which minimizes the lower bound in (18)? It was shown in Lanthaler et al. (2022) that, if the pushforward measure G#µ has a finite second moment and covariance operator Γ : L2(Y, Rds) L2(Y, Rds), the optimal choice of V is given by the leading ndimensional eigenspace of Γ. That is, if {λi, φi} i=1 with λ1 λ2 . . . gives an orthonormal Learning Operators with Coupled Attention Relative L2 error Climate modeling LOCA DON FNO Figure 9: (Generalization) Relative L2 error boxplots for the climate modeling experiment: We present the errors for the temperature prediction task on the testing data set. We consider 4% of the whole data set as labeled data used for training. We observe that our method performs better than the other methods both with respect to the median error and the error spread. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Lengthscale of training data Relative L2 error Antiderivative errors for out-of-distribution predictions LOCA DON FNO Figure 10: (Generalization) Antiderivative relative L2 error boxplots for out-of distribution testing sets: We show the performance of all models when trained on increasingly out of distribution data sets from the testing set. We use all available output function measurements for training. Learning Operators with Coupled Attention 0.0 0.2 0.4 0.6 0.8 1.0 x Input samples from train and test datasets 0.0 0.2 0.4 0.6 0.8 1.0 x Output samples from train and test datasets Relative L2 error LOCA DON FNO Lengthscales Lengthscales Antiderivative errors for Multiple Length and Output scales Figure 11: (Generalization) Antiderivative relative L2 error boxplots given input functions with multiple lenghtscales and amplitudes: We present samples of the input and output functions from the testing data set in the top left and right figures, respectively, as well as the test error boxplots for each method, bottom figure. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris eigendecomposition for the covariance operator i=1 λiφi φi, then the choice of V which minimizes the lower bound in (18) is given by ˆV := span{φ1, . . . , φn}. In the case where we can identify ˆV , we may examine how well the span of the functions {ϕi}n i=1 aligns with the optimal subspace ˆV as a quantitative measure of how well the operator learning architecture has identified an optimal finite dimensional subspace to represent the output functions from G#µ. To measure the alignment of two subspaces V and W of dimension n, we choose to use the geodesic Grassmann distance. To define this, we define the first principal angle cos θ1 between two subspaces as cos θ1 := max v V,w W v = w =1 v T w, (19) and the remaining principle angles cos θi+1 are defined inductively as the solution to the above problem over the subspaces Vi = V span{v1, . . . , vi} and Wi = W span{w1, . . . , wi} , where vi and wi are the minimizing arguments for cos θi. The geodesic Grassmann distance is then defined as d(V, W) = q θ2 1 + . . . + θ2n. (20) Note that the largest distance two n-dimensional subspaces can have from each other is if they are completely orthogonal, in which case d(V, W) = nπ/2, and two subspaces are identical if and only if all principle angles are zero and d(V, W) = 0. In our experiments we normalize this distance by dividing by its maximum value, nπ/2. For more details on the geodesic Grassmann distance see Edelman et al. (1998). Details on how we numerically compute this distance for two subspaces are provided in Section F.4. 7.4.1 A Synthetic Experiment: To be able to measure the geodesic Grassmann distance to the optimal subspace ˆV , we must know the leading eigenfunctions of the pushforward covariance G#µ. In general we will not have an analytic form for this, but in this section we devise an experiment where the ordered eigenfunctions of G#µ are known exactly. To do so we construct a linear operator G : L2(X, R) L2(X, R) as follows. Given any orthonormal basis {φi} i=1 of L2(X, R) and a summable sequence of positive scalars λi, we define the probability distrubiton µ on L2(X, R) as the mean zero Gaussian measure N(0, Γ0) with covariance operator i=1 λi φi φi. (21) By the Karhunen-Loeve theorem (Adler, 1990), given ξi N(0, 1) standard i.i.d normal scalar random variables, the following random function is distributed according to µ, Learning Operators with Coupled Attention For a summable sequence of positive scalars γi we define a linear operator i=1 γi φi φi. (23) The pushforward measure G#µ then has covariance i=1 λiγi φi φi. (24) After a potential reindexing of the sum in (24) so that γiλi are in decreasing order, by construction we have that the optimal n-dimensional subspace of output functions for an operator learning architecture to learn is given by ˆV = span{φ1, . . . , φn}. Since we had the freedom to choose any orthonormal collection of φi, we are able to compute the geodesic Grassmann distance of any other n-dimensional subspace V to ˆV . We will use this setup to train LOCA and its variants on a dataset {uℓ, G(uℓ)} where uℓis sampled as in (22) and G is as defined in (23). We choose λi = ((2πi)2 + τ 2) α and φi(y) = 2 sin(2πiy), which are the eigenvalues and eigenfunctions of the operator (I + ) α on X = [0, 1] with zero boundary conditions. To form the operator G we sample γi i.i.d. from a truncated normal distribution T N(0, 1). After training, we take the trained ϕi and compute the geodesic Grassmann distance between span{ϕi, . . . , ϕn} and ˆV (see Section F.4 for implementation details). This allows us to see how the different architecture components of LOCA affect its ability to learn a subspace aligned with the optimal ˆV . In Table 2 we report the normalized Grassmann distance of the subspace learned by an operator learning architecture to the optimal subspace ˆV . We see that LOCA with the KCA, kernel feature map q, and scattering transform gives the best alignment with ˆV , though LOCA with only the KCA and no map q or scattering transform also gives better alignment than the Deep ONet. This suggests that while LOCA does not avoid the fundamental lower bound for the error in Lanthaler et al. (2022), we see that its novel architecture components allow us to get closer to it. In Table 3 we report the normalized Grassmann distance to ˆV after training LOCA with different kernels in the KCA mechanism. In particular, we see that for a problem whose functions have periodic structure, such as this, periodic kernels give the best alignment to the dominant eigenspace of ΓG#µ. This shows that not only does the KCA itself lead to better representations of the output functions, but a proper choice of kernel can give a meaningful inductive bias for the attention weight functions ϕ(y). 8. Discussion This work proposes a novel operator learning framework with approximation theoretic guarantees. Drawing inspiration from the Bahdanau attention mechanism (Bahdanau et al., 2015), the model is constructed by averaging a feature embedding of an input function with a probability distribution dependent on the location at which we evaluate the output function. A key novelty of our approach is the coupling of these probability distributions through a variation of the classic attention mechanism called Kernel-Coupled Attention (KCA). In KCA, the probability distributions for the input features are coupled to each other with a kernel Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Method Grassmann distance LOCA 0.263 LOCA removing the q function and the scattering transform 0.393 DON 0.454 Table 2: Grassmann distance between the subspaces defined by the span of the real eigenfunctions and the eigenfunctions derived by orthonormalizing the learned ϕ functions. We consider the LOCA method with a periodic kernel and remove parts of it until we get the Deep ONet method. We observe that the Grassmann distance increases as we go from LOCA to DON. Method Grassmann distance No KCA 0.386 RBF kernel 0.461 Periodic Kernel 0.263 Local Periodic Kernel 0.367 Mattern 5/2 Kernel 0.433 Mattern 3/2 Kernel 0.430 Rational Quadratic Kernel 0.377 Table 3: Effect of the kernel choice in KCA to the Grassmann distance from the subspace spanned by the true eigenfunctions. integral transform over the query domain Y. This provides two main benefits. The first is that correlations between points of the output function can be learned and enforced by the action of a kernel integral transform. Additionally, when there are few available measurements of the output functions in the training dataset, this kernel integral transformation can provide more reliable gradients with respect to the score network parameters. We hypothesize, and support with experiments, that this property allows the model to learn very efficiently using a small fraction of labeled data. In order to have a feature encoder that is robust to small deformations and noise in the input, we employ a multi-resolution feature extraction method based on the wavelet scattering transform of Bruna and Mallat (2013). We empirically show that this is indeed a property of our model. Our experiments additionally show that the model is able to generalize across varying distributions of functional inputs, and is able to extrapolate on a functional regression task with global climate data.Lastly, we show in a synthetic experiment that each component of LOCA, together and in isolation, increases the alignment of the subspace of output functions the model learns with a subspace of output functions which optimizes a known lower bound to the error from Lanthaler et al. (2022). As discussed in Section 6.4, the error for the methods used in the experiments has a lower bound that is controlled by the decay of the spectrum of the output function covariance operator. When this decay is slow (for example transport dominated problems) these models have difficulty learning an approximation of the true operator, see Lanthaler et al. (2021). While we cannot avoid this fundamental lower bound, we show in section 7.4 that the KCA component of our architecture is able to bring us closer to the optimal subspace of output Learning Operators with Coupled Attention functions for this lower bound. However, we show that not all choices for the coupling kernel perform equally well for the synthetic problem presented in 7.4. Thus, a potential drawback of the method is that a poorly chosen kernel might reduce the performance of the base model. Prior knowledge on the structure or regularity of output functions can help to avoid choosing kernels that would have this effect. Another potential drawback of the proposed method is the computational cost needed for numerically approximating the integrals in the KCA mechanism. When using Monte-Carlo with P query points there is a complexity of O(P 2). Instead if a quadrature approach is taken with P queries and Q nodes there is a complexity of O(PQ + Q2). The relative efficiency of these two approaches in general will depend on the number of quadrature points necessary for a good approximation, and thus on the dimension of the query domain. In general, integrals over high dimensional domains will become increasingly costly to compute. Therefore, an immediate future research direction is to use further approximations to allow the kernel integral computations to scale to larger numbers of points and dimensions. A first approach is to parallelize this integral computation by partitioning the domain into pieces and summing the integral contributions from each piece. To lower the the computational complexity of the kernel computations between the query and integration points we can also use approximations of the kernel matrix. For example, in the seminal paper of Rahimi and Recht (Rahimi et al., 2007) the authors propose an approximation of the kernel using a random feature strategy. More recently, in the context of transformer architectures, a number of approximations have been proposed to reduce the complexity of such computations to be linear in the number of kernel points O(N). A non-exhaustive list of references include Linformers (Wang et al., 2020), Performers (Choromanski et al., 2020), Nystr oformers (Xiong et al., 2021) and Fast Transformers (Katharopoulos et al., 2020). Another potential extension of our framework is to take the output of our model as the input function of another LOCA module and thus make a layered version of the architecture. While in our experiments we did not see that this modification significantly increased the performance of the model, it is possible that other variants of this modular architecture could give performance improvements. Lastly, recall that the output of our model corresponds to the context vector generated in the Bahdanau attention. In the align and translate model of Bahdanau et al. (2015) this context vector is used to construct a distribution over possible values at the output location. By using the output of our model as a context vector in a similar architecture, we can create a probabilistic model for the potential values of the output function, therefore providing a way to quantify the uncertainty associated with the predictions of our model. A main application of operator learning methods is for PDEs, where they are used as surrogates for traditional numerical solvers. Since the forward pass of an operator learning model is significantly faster than classical numerical methods, the solution of a PDE under many different initial conditions can be expediently obtained. This can be a key enabler in design and optimal control problems, where many inputs must be tested in pursuit of identifying an optimal system configuration. A key advantage of operator learning techniques in this context is that they also allow the quick evaluation of sensitivities with respect to inputs (via automatic differentiation), thus enabling the use of gradient-based optimization. Conventional methods for computing sensitivities typically rely on solving an associated adjoint system with a numerical solver. In contrast, a well-trained operator Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris learning architecture can compute these sensitivities at a fraction of the time. Therefore, we expect that successful application of operator learning methods to predict the output of physical systems from control inputs can have a significant impact in the design of optimal inputs and controls. Some preliminary work in this direction has been explored in Wang et al. (2021a). Acknowledgments G.K. and P.P. would like to acknowledge support from the US Department of Energy under under the Advanced Scientific Computing Research program (grant DE-SC0019116), the US Air Force (grant AFOSR FA9550-20-1-0060), and US Department of Energy/Advanced Research Projects Agency (grant DE-AR0001201). J.S. and G.P. would like to acknowledge support from the AFOSR under grant FA9550-19-1-0265 (Assured Autonomy in Contested Environments) and the NSF Simmons Mathematical and Scientific Foundations of Deep Learning (grant 2031985). We also thank the developers of the software that enabled our research, including JAX (Bradbury et al., 2018), Kymatio (Andreux et al., 2020), Matplotlib (Hunter, 2007), Pytorch (Paszke et al., 2019) and Num Py (Harris et al., 2020). We would also like to thank Andreas Kalogeropoulos and Alp Aydinoglu for their useful feedback on the manuscript. Appendix A. Nomenclature Table 4 summarizes the main symbols and notation used in this work. Appendix B. Proof of Theorem 2 Proof The starting point of the proof is the following lemma, which gives justification for approximating operators on compact sets with finite dimensional subspaces as in Chen and Chen (1995); Lanthaler et al. (2022); Kovachki et al. (2021b). The lemma follows immediately from the fact that for any compact subset U of a Banach space E and any ϵ > 0, there exists a finite dimensional subspace En E such that d(U, En) < ϵ. Lemma 6 Let U E be a compact subset of a Banach space E. Then for any ϵ > 0, there exists n N, φ1, . . . , φn E, and functionals c1, . . . , cn with ci : E R such that i=1 ci(u)φi E < ϵ. Returning to the problem of learning a continuous operator G : U C(Y, Rds), since U is assumed to be compact and G is continuous, the image G(U) is compact in the co-domain. Thus, we may apply Lemma 6 to the set G(U). This shows that for any ϵ > 0, there exists c1, . . . , cn with each ci : U R linear and continuous and functions φ1, . . . , φn with φi : Y Rds such that sup u U sup y Y i=1 ci(u)φi(y) Learning Operators with Coupled Attention [n] The set {1, . . . , n} N. u v Hadamard (element-wise) product of vectors u and v. C(A, B) Space of continuous functions from a space A to a space B. C1 Space of continuous functions with continuous derivative. L2 Hilbert space of square integrable functions. Hk Reproducing Kernel Hilbert Space with kernel k. n n-dimensional simplex. X Domain for input functions. Y Domain for output functions. x Input function arguments. y Output function arguments (queries). u Input function in C(X, Rdu). s Output function in C(Y, Rds). F Operator mapping input functions u to output functions s. g(y) Proposal score function. g(y) Kernel-Coupled score function. ϕ(y) Attention weights at query y. v(u) Feature encoder. κ(y, y ) Coupling kernel. k(y, y ) Base similarity kernel. Table 4: (Nomenclature) A summary of the main symbols and notation used in this work. Next, we show that the approximation of G(u)(y) given in (25) can be expressed equivalently as vector of averages of a modified collection of functionals ci. These functionals ci will form the coordinates of our input feature vector v(u). First, for each φi : Y Rds we may form the positive and negative parts, whose coordinates are defined by (φ+ i )q = max{(φi)q, 0} (φ i )q = min{(φi)q, 0} Note that φ+ i and φ i are continuous, non-negative, and that φi = φ+ i φ i . For j = 1, . . . , 2n define a new collection of functions ϕj : Y Rds by 1 2n φ+ i φ+ i if j = 2i 1 2n φ i φ i if j = 2i 1 ϕ2n+1 := 1ds By construction, for all y we have that ϕj(y) [0, 1]ds, span{φi}n i=1 span{ϕj}2n+1 j=1 , Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris j=1 ϕj(y) = 1ds. In order to allow each output dimension of each ϕj to have its own coordinate function, (and thus have v(u) Rn ds), for each ϕj, we create ds new functions, ϕj,k(y) := ek ϕj(y), where ek Rds is the k-th standard basis vector in Rds. Thus, we have constructed a collection of vectors ϕj,k such that ϕj,k, em = 0 if and only if k = m, k=1 ϕj,k(y) = 1ds, y Y, span{φi}n i=1 span ϕj,k j [2n+1] k [ds] . Since from Lemma 6 we know d(span{φi}n i=1, G(U)) < ϵ, we conclude that d(span ϕj,k j [2n+1] k [ds] , G(U)) < ϵ, and can conclude the statement of the theorem. Appendix C. Proof of Proposition 3 Proof Note that im(T 1/2 k ) = Hκ, (Paulsen and Raghupathi, 2016). Since κ is universal, im(T 1/2 κ ) = Hκ C(Y, Rn) is dense. Thus, it suffices to show that Tκ(A) is dense in im(T 1/2 κ ). We will make use of the following fact, which we state as a lemma. Lemma 7 If f : X Y is a continuous map and A X is dense, then f(A) is dense in im(f). By the above lemma, we have that Tκ(A) is dense in im(Tκ). Now we must show that im(Tκ) im(T 1/2 κ ) is dense as well. This again follows from the above lemma by noting that im(Tk) = T 1/2 k (im(T 1/2 k )), and im(T 1/2 k ) is dense in the domain of T 1/2 k . Learning Operators with Coupled Attention Appendix D. Proof of Proposition 4 In this section, we show that the coupling kernel is symmetric and positive semi-definite. These two conditions are necessary to obtain theoretical guarantees of universality. The symmetry of the kernel κ follows immediately from the symmetry of the base kernel k in (3) and the form of κ in (4). To prove κ is positive semi-definite we must show for any v1, . . . , vn R and y1, . . . , yn Y, i,j=1 vivjκ(yi, yj) 0. For ease of notation define Y k(q(yi), q(z)) dz 1/2 . Using the definition of κ from (4), i,j=1 vivjκ(yi, yj) = i,j=1 vivj k(q(yi), q(yj)) vj Zj k(q(yi), q(yj)) where in the last line we have used the positive semi-definiteness of k. Finally, the injectivity of the map g would imply that the overall feature map of κ is injective, which gives that the kernel is universal (Christmann and Steinwart, 2010). Appendix E. Proof of Proposition 5 Proof Since U is compact, h is uniformly continuous. Hence, there exists δ > 0 such that for any u v < δ, h(u) h(v) < ϵ/2. Define ud := Pd i=1 u, ei ei. By the uniform convergence of ud u over u U, there exists d such that for all u U, u ud < δ. Thus, for all u U h(u) h(ud) < ϵ If we define r : Rd C(X, Rdu) as we may write h(ud) = (h r)(Dd(u)). Now, note that h r C(Rd, Rn), and recall that, by assumption, the function class Ad is dense in C(Rd, Rn). This means there exists f Ad such that f h r < ϵ/2. Putting everything together, we see that h(u) f Dd(u) h(u) h(ud) + (h r)(Dd(u)) f Dd(u) < ϵ. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Appendix F. Supplementary Information for Experiments In this section, we present supplementary information on the experiments presented in Section 7. F.1 Computational Complexity In LOCA the most expensive operations are the integral computations in the KCA mechanism. Let z1, . . . , z Q be the integration nodes with z = [z1, . . . , z Q] , and let the associated weights be w1, . . . , w Q, with w = [w1, . . . , w Q] . For evaluating the KCA mechanism at a single query location y0 with Q integration nodes we are required to compute the matrices k(y0, z) R1 Q, with [k(y0, z)]j = k(y0, zj) and k(z, z) RQ Q, with [k(z, z)]ij = k(zi, zj). These are combined to compute the kernel κ as (k(y0, z)w)1/2 k(y0, z) (k(z, z)w) 1/2 , where the exponent of 1/2 in the last factor is applied coordinate-wise. Carrying out this computation requires Q steps to compute k(y0, z) and Q2 steps to compute k(z, z), giving an overall complexity of Q + Q2. When considering P > 1 the complexity for computing k(y0, z) becomes PQ and the overall complexity becomes PQ + Q2 because we need to compute k(z, z) only once. For the Monte Carlo case, Q = P and y = z, so we only need to make one computation of k(y, y). Therefore in this case, we have a complexity of Q2. Both methods have their benefits and disadvantages: in the Monte Carlo case, we need to perform P 2 computations once, but the cost scales exponentially with P. On the other hand, Gauss-Legendre quadrature requires PQ+Q2 evaluations, but if Q is small the overall computational cost is less than Monte-Carlo integration. F.2 Architecture Choices and Hyper-parameter Settings In this section we present the neural network architecture choices, the training details, the training wall-clock time, as well as the number of training parameters for each model compared in the experiments. Specifically, for the DON and FNO models, we have performed an extensive number of simulations to identify settings for which these competing methods achieve their best performance. For LOCA and DON we set the batch size to be 100, initial learning rate equal to lr = 0.001, and an exponential learning rate decay with a decay-rate of 0.95 every 100 training iterations. For the FNO training, we set the batch size to be 100 and consider a learning rate lr = 0.001, which we then reduce by 0.5 every 100 epochs and a weight decay of 0.0001. Moreover, for the FNO method we use the Re LU activation function. For the LOCA model, we present the structure of the functions g, f, and q in Table 5. In Table 6 we present the number of samples considered for the train and test data sets, the number of points where the input and the output functions are evaluated, the dimensionality of positional encoding, the dimensionality of the latent space where we evaluate the expectation E(u)(y), the batch size used for training, and the number of training iterations. We present Learning Operators with Coupled Attention Example g depth g width f depth f depth q depth q width Antiderivative 2 100 1 500 2 100 Darcy Flow 2 100 2 100 2 100 Mechanical MNIST 2 256 2 256 2 256 Shallow Water Eq. 1 1024 1 1024 1 1024 Climate Modeling 2 100 2 100 2 100 Table 5: LOCA Architectural choices for each benchmark considered in this work: We present the chosen architecture for g and q, the functions that constructs φ(y), and the function v which together build up the architecture of the LOCA model. Example Ntrain Ntest m P n H l Batch # # of train iterations Antiderivative 1000 1000 100 100 100 10 100 100 50000 Darcy Flow 1000 1000 1024 - 100 6 100 100 20000 Mechanical MNIST 60000 10000 784 56 500 10 100 100 100000 Shallow Water Eq. 1000 1000 1024 128 480 2 100 100 80000 Climate Modeling 1825 1825 5184 144 100 10 100 100 100000 Table 6: LOCA model parameters for each benchmark considered in this work: We present the numbers of training and testing data Ntrain and Ntest, respectively, the number of input coordinate points m where the input function is evaluated, the number of coordinates P where the output function is evaluated, the dimension of the latent space n over which we evaluate the expectation, the number of positional encoding features H for the positional encoding, the dimensionality of the encoder l, the size of the batch B and the iterations for which we train the model. the parameters of the wavelet scattering network in Table 7. The method used for computing the kernel integral for each example is presented in Table 8. For the DON model, we present the structure of b and t, the branch and the trunk functions, in Table 9. In Table 10 we present the number of samples considered for the train and test data sets, the number of points where the input and the output functions are evaluated, the dimensionality of the positional encoding, the dimensionality of the latent space, the batch size used for training, and the number of training iterations. In order to achieve competitive performance, we also adopted some of the improvements proposed in Lu et al. (2021), including the application of harmonic feature expansions to both input and outputs, as well as normalization of the output functions. For the FNO model, we present the architecture choice in Table 11. In Table 12 we present the number of samples considered for the train and test data sets, the number of points where the input and the output functions are evaluated, the batch size used for training and the number of training epochs. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Example J L mo Antiderivative 4 8 2 Darcy Flow 1 2 2 Mechanical MNIST 1 16 2 Shallow Water Eq. 1 3 2 Climate Modeling 1 8 2 Table 7: Chosen parameters for the wavelet scattering network: J represents the log-2 scatteting scales, L the angles used for the wavelet transform and mo the maximum order of scattering coefficients to compute. The wavelet scattering network is implemented using the Kymatio library (Andreux et al., 2020). Example Integration method Antiderivative Quadrature Darcy Flow Quadrature Mechanical MNIST Quadrature Shallow Water Eq. Monte Carlo Climate Modeling Monte Carlo Table 8: Integral computation method for each benchmark considered in this work: We present the method that is used to compute the required kernel integrals for the LOCA method. Example b depth b width t depth t depth Antiderivative 2 512 2 512 Darcy Flow 6 100 6 100 Mechanical MNIST 4 100 4 100 Shallow Water Eq. 11 100 11 100 Climate Modeling 4 100 4 100 Table 9: DON architecture choices for each benchmark considered in this work. Example Ntrain Ntest m P n H Batch # Train iterations Antiderivative 1000 1000 1000 100 100 2 100 50000 Darcy Flow 1000 1000 1024 - 100 6 100 20000 Mechanical MNIST 60000 10000 784 56 500 10 100 100000 Shallow Water Eq. 1000 1000 1024 128 480 2 100 80000 Climate Modeling 1825 1825 5184 144 100 10 100 100000 Table 10: DON model parameter for each benchmark considered in this work: We present the numbers of training and testing data Ntrain and Ntest, respectively, the number of input coordinate points m where the input function is evaluated, the number of coordinates P where the output function is evaluated, the dimension of the latent space n over which we evaluate the inner product of the branch and the trunk networks, the number of positional encoding features H, the size of the batch B and the iterations for which we train the model. Learning Operators with Coupled Attention Example # of modes width # of FNO layers Antiderivative 32 100 4 Darcy Flow 8 32 4 Mechanical MNIST 12 32 4 Shallow Water Eq. 8 25 4 Climate Modeling 12 32 4 Table 11: FNO architecture choices for each benchmark considered in this work. Example Ntrain Ntest m P Batch # Train Epochs Antiderivative 1000 1000 1000 100 100 500 Darcy Flow 1000 1000 1024 - 100 500 Mechanical MNIST 60000 10000 784 56 100 200 Shallow Water Eq. 1000 1000 1024 128 100 400 Climate Modeling 1825 1825 5184 144 73 250 Table 12: FNO model parameter for each benchmark considered in this work: We present the numbers of training and testing data Ntrain and Ntest, respectively, the number of input coordinate points m where the input function is evaluated, the number of coordinates P where the output function is evaluated, the size of the batch B and the epochs for which we train the model. Example LOCA DON FNO Antiderivative 1,677,300 2,186,672 1,333,757 Darcy Flow 381,000 449,400 532,993 Mechanical MNIST 2,475,060 3,050,300 1,188,514 Shallow Water Eq. 5,528,484 5,565,660 5,126,690 Climate Modeling 1,239,500 5,805,800 1,188,353 Table 13: Total number of trainable parameters for each model, and for each benchmark considered in this work. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Method / Error Testing Error Training Error LOCA-S 5.3410 3 1.8210 3 3.2810 3 5.2010 4 LOCA-WS 1.2910 2 9.9010 3 3.5910 3 6.7310 4 DON-S 1.2410 2 2.7010 3 1.0610 2 1.2110 3 DON-WS 2.0110 2 6.6610 3 1.9210 2 3.1210 3 FNO-S 2.8710 2 7.3410 3 1.8210 2 3.6710 3 FNO-WS 9.7210 3 1.7310 3 7.8610 3 1.2910 3 Table 14: Training and testing relative L2 error for the case of the Darcy flow for all the methods considered in this study with scattering transform as an input feature extractor (denoted as -S above) and without scattering (denoted as -WS above). F.3 Effect of the wavelet scattering transform as a spectral encoder Here we perform an ablation study to determine the effect of the wavelet scattering transform on the performance of the methods employed in this manuscript. To this end, we examine its effect on the relative L2 error on the training and testing data sets as a measure of approximation and generalization capabilities, respectively. We train LOCA, the DON and the FNO methods with and without the scattering transform on the Darcy example and 12.5% of the data set used for training. The Darcy flow problem set-up is explained in section F.7.2 in detail. The results are presented in Table 14. We observe that for LOCA the scattering transform nearly does not affect the training error, but the testing error is affected. Given that the scattering transform is capable of extracting meaningful features from the training data set that are also present in the testing data set, it makes sense that the testing error is improved. For the DON case, the scattering transform does not affect the mean error but the standard deviation of the error vectors in both the training and testing case is slightly decreased. As a result, the scattering transform feature encoding can improve the worst case scenario for the DON case in this example. For the FNO case, the scattering transform makes both the mean and the distribution of the training and testing error vectors worse. This is probably a result of performing the Fourier transform on the scattering coefficients, which are already in the frequency domain. Overall, we argue that the while the scattering transform can have benefits as an effective feature extractor for pre-processing data, this is not the sole reason for the differences in experimental performance between LOCA and other methods. F.4 Computing the Grassmann Distance for Subspace Alignment To compute the Grassmann distance between V = span{ϕ1, . . . , ϕn} and ˆV = span{φ1, . . . , φn}, we first use Gram-Schmidt to compute an orthonormal basis for V , {ψ1, . . . , ψn}. Then we perform a matrix of pairwise inner products [M]ij = ψi, φj . We perform a singular value decomposition on M and extract the singular values σ1, . . . , σn. The principal angles are given by cos 1(σ1), ..., cos 1(σP ), which are then used to compute the Grassmann distance as G(V, ˆV ) = Pk i=1 cos 1(σ2 i ) 1/2 . Learning Operators with Coupled Attention Example LOCA DON FNO Antiderivative 2.23 2.08 2.06 Darcy Flow (P = 1024) 5.51 3.5 1.50 Mechanical MNIST 21.70 16.61 22.87 Shallow Water Eq. 12.10 15.39 13.95 Climate Modeling 4.52 7.51 10.49 Table 15: Computational cost for training each model across all benchmarks considered in this work: We present the wall clock time in minutes that is needed to train each model on a single NVIDIA RTX A6000 GPU. F.5 Computational Cost We present the wall clock time, in minutes, needed for training each model for each different example presented in the manuscript in Table 15. For the case of the Darcy flow, the computational time is calculated for the case of P = 1024, meaning we use all available labeled output function measurements per training example. We choose this number of query points to show that even when the number of labeled data is large, the computational cost is still reasonable, despite the KCA computation bottleneck. We observe that the wall clock time for all methods lie in the same order of magnitude. All the models are trained on a single NVIDIA RTX A6000 GPU. F.6 Comparison Metrics Throughout this work, we employ the relative L2 error as a metric to assess the test accuracy of each model, namely: Test error metric = ||si(y) ˆsi(y)||2 2 ||si(y)||2 2 , where ˆs(y) the model predicted solution, s(y) the ground truth solution and i the realization index. The relative L2 error is computed across all examples in the testing data set, and different statistics of the error vector are calculated: the median, quantiles, and outliers. For all examples the errors are computed between the full resolution reconstruction and the full resolution ground truth solution. F.7 Experiments In this section, we present additional details about the experimental scenarios discussed in Section 7. F.7.1 Antiderivative We approximate the antiderivative operator for demonstrating the generalization capabilities of the LOCA in two inference scenarios. The antiderivative operator is defined as dx = u(x), s(x) = s0 + Z x Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris where we consider x X = [0, 1] and the initial condition s(0) = 0. For a given forcing term u the solution operator G of system (F.7.1) returns the antiderivative s(x). In the notation of our model, the input and output function domains coincide, X = Y with dx = dy = 1. Since the solution operator is a map between scalar functions, we also have du = ds = 1. Under this setup, our goal is to learn the solution operator G : C(X, R) C(X, R). To construct the data sets we sample the forcing function u(x) from a Gaussian process prior and measure these functions at 500 points. We numerically integrate them to obtain 100 measurements of each output function to use for training different operator learning models. For investigating the performance of LOCA on out-of-distribution prediction tasks, we create training data sets by choosing ltrain [0.1, 0.9], and consider 9 cases of increasing ltrain spaced by 0.1 each. The output scale of the Gaussian Process prior o is equal to one for all length scales. The training and testing data sets each have N = 1, 000 solutions of equation (F.7.1), and we use 100% of all available output evaluation function points, both for training and testing. For the case where we train and test on multiple length and output scales, we construct each example in the data set as follows. To construct each input sample, we first we sample a uniform random variable δ U( 2, 1), and set the corresponding input sample length-scale to l = 10δ. Similarly, we construct a random amplitude scale by sampling ζ U( 2, 2), and setting o = 10ζ. Then we sample u(x) from a Gaussian Process prior u(x) GP(0, Cov(x, x )), where Cov(x, x ) = o exp x x l . The length and the outputs scales are different for each realization, therefore we have 1, 000 different length and outputs scales in the problem. F.7.2 Darcy Flow Fluid flow through porous media is governed by Darcy s Law (Bear, 2013), which can be mathematically expressed by the following partial differential equation system, (u(x) s(x)) = f(x), x X, (26) subject to appropriate boundary conditions s = 0 on ΓX , (u(x) s(x)) n = g on ΓN, 2 where u is permeability of the porous medium, and s is the corresponding fluid pressure. Here we consider a domain X = [0, 1] [0, 1] with a Dirichlet boundary ΓD = {(0, x) (1, x) | x2 [0, 1] X}, and a Neumann boundary ΓN = {(x, 0) (x, 1) | x [0, 1] X}. For a given forcing term f and set of boundary conditions, the solution operator G of system (26) maps the permeability function u(x) to the fluid pressure function s(x). In the notation of our model, the input and output function domains coincide, X = Y with dx = dy = 2. Since in this case the solution operator is a map between scalar functions, we also have du = ds = 1. Under this setup, our goal is to learn the solution operator G : C(X, R) C(X, R). We set the Neumann boundary condition to be g(x) = sin(5x), the forcing term f(x) = 5 exp( ((x1 0.5)2+(x2 0.5)2)), and sample the permeability function u(x) from a Gaussian Learning Operators with Coupled Attention 0.00 0.25 0.50 0.75 1.00 x1 Input sample 0.00 0.25 0.50 0.75 1.00 y1 Predicted s(y) 0.00 0.25 0.50 0.75 1.00 y1 Ground truth s(y) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Input sample 0.00 0.25 0.50 0.75 1.00 y1 Predicted s(y) 0.00 0.25 0.50 0.75 1.00 y1 Ground truth s(y) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Input sample 0.00 0.25 0.50 0.75 1.00 y1 Predicted s(y) 0.00 0.25 0.50 0.75 1.00 y1 Ground truth s(y) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error Figure 12: Comparison between the full resolution prediction and ground truth for the flow through porous medium testing data set: We present the input sample, the prediction, the ground truth and the absolute error for three realizations of the Darcy flow system. The first row corresponds to the example with minimum test error, the second row to the example with maximum test error, while the third row corresponds to a randomly chosen example from the testing data set. measure, as u(x) = exp(u0 cos(x)) with u0 N(0, 73/2( + 49I) 1.5. The training and testing data sets are constructed by sampling the initial condition along a 32 32 grid and solving the forward problem with the Finite Element library, Fenics (Alnæs et al., 2015). This gives us access to 32 32 solution values to use for training different operator learning models. Sub-sampling these solution values in the manner described in Section 7 allows us to create training data sets to examine the effect of using only a certain percentage of the available data. Figure 12 gives a visual comparison of the outputs of our trained model against the ground truth for three randomly chosen initial conditions, along with a plot of the point-wise error. We see that our model performs well across random initial conditions that were not present in the training data set. F.7.3 Mechanical MNIST For this example, our goal is to learn the operator that maps initial deformations to later-time deformations in the equi-biaxial extension benchmark from the Mechanical MNIST database Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris (Lejeune, 2020). The data set is constructed from the results of 70, 000 finite-element simulations of a heterogeneous material subject to large deformations. MNIST images are considered to define a heterogeneous block of material described by a compressible Neo-Hookean model (Lejeune, 2020). In our case, we are interested in learning displacement fields at later times, given some initial displacement. The material constitutive law is described by Lejeune (2020) 2µ h F : F 3 2ln(det F) i + 1 2λ h ((det F)2 1) ln(det F) i , (27) where ψ is the strain energy, F is the deformation energy, µ and λ are Lam e constants that can be found from the Young s modulus and the Poisson ratio E = µ(3λ + 2µ) λ + µ , ν = λ 2(λ + µ). The Young s modulus is chosen based on the bitmap values to convert the image to a material as E = b 255(100 1) + 1, where b is the bitmap value. Here, the Poisson ratio is fixed to ν = 0.3 for all block materials. This means that the pixels inside the digits are block materials that are much stiffer than the pixels that are outside of the digits. For the equi-biaxial extension experiments, Dirichlet boundary conditions are applied by considering different displacement values d = [0.0, 0.001, 0.01, 0.1, 0.5, 1, 2, 4, 6, 8, 10, 12, 14] for the right and top of the domain, and d for the left and bottom of the domain. In this benchmark the input and the output function domains coincide, X = Y with dx = dy = 2, while the solution operator, G, is a map between vector fields with du = ds = 2. Consequently, our goal here is to learn the solution operator G : C(X, R2) 7 C(X, R2). Even though we create a map between displacement vectors, we present the magnitude of the displacement v2 1 + v2 2, for visual clarity of our plots. The data set is constructed by sampling MNIST digits on a 28 28 grid and solving equation 27 using the Finite Element library, Fenics (Alnæs et al., 2015). Out of the 70, 000 realizations that the MNIST data set contains, 60, 000 are used for training and 10, 000 are used for testing, therefore Ntrain = 60, 000 and Ntest = 10, 000. We randomly sub-sample the number of measurement points per output function, as explained in Section 7, to create a training data set to demonstrate that our model only needs a small amount of labeled data to provide accurate predictions. We present a visual comparison of the outputs of the trained model against the ground truth solution for three randomly chosen initial conditions from the test data set in Figure 13. Figure 14 presents the same comparison, for the minimum error prediction, to show the change in the pixel position due to the applied displacement, which is not visible in the case where we present multiple solutions at the same time. The error reported in Figure 14 illustrates the discrepancy (shown in magenta) between the ground truth and the predicted pixel positions. Learning Operators with Coupled Attention Figure 13: Comparison between the predicted and the ground truth displacement magnitudes for three random testing data set initial conditions (100, 1,000, 10,000) of the Mechanical MNIST case: We present the results of our model for 3 different MNIST digits under final displacement d = 14. Despite the our model having displacement vector fields as inputs and outputs, we present our inputs and results in the form of positions. For this purpose, we add the displacements in the horizontal and vertical directions to the undeformed positions of the MNIST digit pixels which we assume that lie on a regular grid. The normalized absolute error is computed with respect to the position and not the displacement in each direction. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Initial position Baseline final position Predicted final position Error Figure 14: Schematic comparison between the predicted and the ground truth final positions for the best testing data set prediction for the Mechanical MNIST benchmark: We present the ground truth final and the predicted final positions of the block material together with the point-wise error between them (shown in magenta), as well as the initial position. We present this result in a schematic manner, meaning without some indication of the error magnitude, in order to provide a sense of the deformation of the MNIST pixels under the final displacement. F.7.4 Shallow Water Equations The modeling of the currents in Earth science is often modelled by the Shallow Water equations, which describes the flow below a pressure surface when the horizontal lengthscales are much larger than the vertical ones. The system of equations is defined as: t + x1 (ρv2 1 + 1 2gρ2) + (ρv1v2) x2 = 0, t (0, 1], x (0, 1)2 t + (ρv1v2) x1 + x2 (ρv2 2 + 1 where ρ is the total fluid column height, v1 the velocity in the x1-direction, v2 the velocity in the x2-direction, averaged across the vertical column, ρ the fluid density and g the acceleration due to gravity. The above equation can be also written in conservation form: ρv1 ρv2 1 + 1 ρv2 ρv1v2 ρv2 2 + 1 For a given set of initial conditions, the solution operator G of F.7.4 maps the initial fluid column height and velocity fields to the fluid column height and velocity fields at later times. Again in this problem the input and the output function domains coincide, therefore X = Y with dx = dy = 3 and du = ds = 3. The goal is to learn the operator G : C(X, R3) C(X, R3). Learning Operators with Coupled Attention We set the boundary conditions by considering a solid, impermeable wall with reflective boundaries: v1 nx1 + v2 nx2 = 0, where ˆn = nx1ˆi + nx2ˆj is the unit outward normal of the boundary. We sample the initial conditions by considering a falling droplet of random width, falling from a random height to a random spatial location and zero initial velocities: ρ = 1 + h exp ((x1 ξ)2 + (x2 ζ)2)/w v1 = v2 = 0, where ρ corresponds to the altitude that the droplet falls from, w the width of the droplet, and ξ and ζ the coordinates that the droplet falls in time t = 0s. Because the velocities v1, v2 are equal to zero in the initial time t0 = 0s for all realizations, we choose time t0 = dt = 0.002s as the initial time to make the problem more interesting. Therefore, the input functions become ρ = 1 + h exp ((x1 ξ)2 + (x2 ζ)2)/w , v1 = v1(dt, y1, y2), v2 = v2(dt, y1, y2). We set the random variables h, w, ξ, and ζ to be distributed according to the uniform distributions h = U(1.5, 2.5), w = U(0.002, 0.008), ξ = U(0.4, 0.6), ζ = U(0.4, 0.6). The data set is constructed by sampling the initial conditions along a 32 32 grid and solving the forward problem using a Lax-Friedrichs scheme. This provides us with a solution for a 32 32 grid which we can use for different operator learning models. Sub-sampling the solution to create the training data set allows us to predict the solution using only a percentage of the available spatial data. In Figures 15, 16, 17, 18, 19 we provide a visual comparison of the outputs of the trained model for 5 time steps, t = [0.11, 0.16, 0.21, 0.26, 0.31]s, for the worst case scenario prediction in the testing data set, along with the point-wise absolute error plot. We see that our model provides favorable solutions for all time steps for an initial condition not in the train data set. F.7.5 Climate Modeling For this example, our aim is to approximate the map between the surface air temperature and surface air pressure. In contrast to the previous examples, here we do not consider a relation between these two fields, for example a partial differential equation or a constitutive law. Therefore, we aim to learn a black-box operator which we then use for making predictions of the pressure using the temperature as an input. Therefore, we consider the map T(x) 7 P(y), Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition ρ 0.00 0.25 0.50 0.75 1.00 y1 Predicted ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v1 0.00 0.25 0.50 0.75 1.00 y1 Predicted v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v2 0.00 0.25 0.50 0.75 1.00 y1 Predicted v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error Figure 15: Comparison between the predicted and ground truth solution for the worst case prediction of the Shallow Water Equations benchmark: We present the inputs to the model (initial conditions), the ground truth and the predicted parameters, as well as the absolute error for time instance t = 0.11s. Learning Operators with Coupled Attention 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition ρ 0.00 0.25 0.50 0.75 1.00 y1 Predicted ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v1 0.00 0.25 0.50 0.75 1.00 y1 Predicted v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v2 0.00 0.25 0.50 0.75 1.00 y1 Predicted v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error Figure 16: Comparison between the predicted and ground truth solution for the worst case prediction of the Shallow Water Equations benchmark: We present the inputs to the model (initial conditions), the ground truth and the predicted parameters, as well as the absolute error for time instance t = 0.16s. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition ρ 0.00 0.25 0.50 0.75 1.00 y1 Predicted ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v1 0.00 0.25 0.50 0.75 1.00 y1 Predicted v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v2 0.00 0.25 0.50 0.75 1.00 y1 Predicted v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error Figure 17: Comparison between the predicted and ground truth solution for the worst case prediction of the Shallow Water Equations benchmark: We present the inputs to the model (initial conditions), the ground truth and the predicted parameters, as well as the absolute error for time instance t = 0.21s. Learning Operators with Coupled Attention 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition ρ 0.00 0.25 0.50 0.75 1.00 y1 Predicted ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v1 0.00 0.25 0.50 0.75 1.00 y1 Predicted v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v2 0.00 0.25 0.50 0.75 1.00 y1 Predicted v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error Figure 18: Comparison between the predicted and ground truth solution for the worst case prediction of the Shallow Water Equations benchmark: We present the inputs to the model (initial conditions), the ground truth and the predicted parameters, as well as the absolute error for time instance t = 0.26s. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition ρ 0.00 0.25 0.50 0.75 1.00 y1 Predicted ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline ρ(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v1 0.00 0.25 0.50 0.75 1.00 y1 Predicted v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v1(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error 0.00 0.25 0.50 0.75 1.00 x1 Initial Condition v2 0.00 0.25 0.50 0.75 1.00 y1 Predicted v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Baseline v2(t, y1, y2) 0.00 0.25 0.50 0.75 1.00 y1 Absolute Error Figure 19: Comparison between the predicted and ground truth solution for the worst case prediction of the Shallow Water Equations benchmark: We present the inputs to the model (initial conditions), the ground truth and the predicted parameters, as well as the absolute error for time instance t = 0.31s. Learning Operators with Coupled Attention where x, y [ 90, 90] [0, 360] for the latitude and longitude. For a given day of the year the solution operator maps the surface air temperature to the surface air pressure. For this set-up, the input and output function domains coincide which means X = Y with dx = dy = 2 and du = ds = 1 because we the input and output functions are scalar fields. We can write the map as G : C(X, R) C(X, R). For constructing the training data set, we consider the Physical Sciences Laboratory meteorological data (Kalnay et al., 1996)(https://psl.noaa.gov/data/gridded/data. ncep.reanalysis.surface.html) from the year 2000 to 2005. We consider the different model realizations to be the values of the daily Temperature and Pressure for these 5 years, meaning Ntrain = 1825 (excluding the days for leap years). We sub-sample the spatial coverage from 2.5 degree latitude 2.5 degree longitude global grid (144 73) to 72 72 for creating a regular grid for both the quantities. We consider a test data set consisting of the daily surface air temperature and pressure data from the years 2005 to 2010, meaning Ntest = 1825 (excluding leap years), on an 72 72 grid also. Figure 20: Comparison between the full resolution prediction and base line for the best prediction in the testing data set for the climate modeling benchmark: We present the input temperature field, the output prediction and ground truth, as well as the absolute error between our model s prediction and the ground truth solution. We present the prediction and the ground truth together with the respective input and error Figure 20.The solution corresponds to the index for which the error vector takes the minimum value. The prediction, the ground truth solution and the absolute error are all presented on a 72 72 grid. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Robert J Adler. An introduction to continuity, extrema, and related topics for general gaussian processes. Lecture Notes-Monograph Series, 12:i 155, 1990. Martin Alnæs, Jan Blechta, Johan Hake, August Johansson, Benjamin Kehlet, Anders Logg, Chris Richardson, Johannes Ring, Marie E Rognes, and Garth N Wells. The FEni CS project version 1.5. Archive of Numerical Software, 3(100), 2015. Mathieu Andreux, Tom as Angles, Georgios Exarchakis, Roberto Leonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka, St ephane Mallat, Joakim And en, Eugene Belilovsky, et al. Kymatio: Scattering transforms in python. J. Mach. Learn. Res., 21(60):1 6, 2020. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409. 0473. Jacob Bear. Dynamics of fluids in porous media. Courier Corporation, 2013. Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduction and neural networks for parametric pdes. The SMAI journal of computational mathematics, 7:121 157, 2021. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Joan Bruna and St ephane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872 1886, 2013. Shengze Cai, Zhicheng Wang, Lu Lu, Tamer A Zaki, and George Em Karniadakis. Deepm&mnet: Inferring the electroconvection multiphysics fields based on operator approximation by neural networks. ar Xiv preprint ar Xiv:2009.12935, 2020. Shuhao Cao. Choose a Transformer: Fourier or Galerkin. ar Xiv preprint ar Xiv:2105.14995, 2021. Andrea Caponnetto, Charles A Micchelli, Massimiliano Pontil, and Yiming Ying. Universal multi-task kernels. The Journal of Machine Learning Research, 9:1615 1646, 2008. Kuang-Yu Chang and Chu-Song Chen. A learning framework for age rank estimation based on face images with scattering transform. IEEE Transactions on Image Processing, 24(3): 785 798, 2015. Tianping Chen and Hong Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4):911 917, 1995. Learning Operators with Coupled Attention Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. ar Xiv preprint ar Xiv:2009.14794, 2020. Andreas Christmann and Ingo Steinwart. Universal kernels on non-standard input spaces. In Advances in neural information processing systems, pages 406 414. Citeseer, 2010. James W Cooley and John W Tukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation, 19(90):297 301, 1965. I Daubechies. Orthogonal bases of compactly supported wavelets, communications on pure and applied, 1988. P Clark Di Leoni, Lu Lu, Charles Meneveau, George Karniadakis, and Tamer A Zaki. Deeponet prediction of linear instability waves in high-speed boundary layers. ar Xiv preprint ar Xiv:2105.08697, 2021. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Xialiang Dou and Tengyuan Liang. Training neural networks as learning data-adaptive kernels: Provable representation and approximation benefits. Journal of the American Statistical Association, pages 1 14, 2020. Alan Edelman, Tom as A Arias, and Steven T Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2): 303 353, 1998. Craig R Gin, Daniel E Shea, Steven L Brunton, and J Nathan Kutz. Deepgreen: Deep learning of Green s functions for nonlinear boundary value problems. Scientific reports, 11 (1):1 14, 2021. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249 256. JMLR Workshop and Conference Proceedings, 2010. Yuan Gong, Yu-An Chung, and James Glass. Ast: Audio spectrogram Transformer. ar Xiv preprint ar Xiv:2104.01778, 2021. Gaurav Gupta, Xiongye Xiao, and Paul Bogdan. Multiwavelet-based operator learning for differential equations. Advances in Neural Information Processing Systems, 34, 2021. Charles R Harris, K Jarrod Millman, St efan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy. Nature, 585(7825):357 362, 2020. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016. Thomas Hofmann, Bernhard Sch olkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 1171 1220, 2008. Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer. ar Xiv preprint ar Xiv:1809.04281, 2018. John D Hunter. Matplotlib: A 2D graphics environment. IEEE Annals of the History of Computing, 9(03):90 95, 2007. Hachem Kadri, Emmanuel Duflos, Philippe Preux, St ephane Canu, and Manuel Davy. Nonlinear functional regression: a functional RKHS approach. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 374 380. JMLR Workshop and Conference Proceedings, 2010. Hachem Kadri, Emmanuel Duflos, Philippe Preux, St ephane Canu, Alain Rakotomamonjy, and Julien Audiffren. Operator-valued kernels for learning from functional response data. The Journal of Machine Learning Research, 17(1):613 666, 2016. Eugenia Kalnay, Masao Kanamitsu, Robert Kistler, William Collins, Dennis Deaven, Lev Gandin, Mark Iredell, Suranjana Saha, Glenn White, John Woollen, et al. The ncep/ncar 40-year reanalysis project. Bulletin of the American meteorological Society, 77(3):437 472, 1996. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran cois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156 5165. PMLR, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Georgios Kissas, Yibo Yang, Eileen Hwuang, Walter R Witschey, John A Detre, and Paris Perdikaris. Machine learning in cardiovascular flows modeling: Predicting arterial blood pressure from non-invasive 4d flow mri data using physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering, 358:112623, 2020. Nikola Kovachki, Samuel Lanthaler, and Siddhartha Mishra. On universal approximation and error bounds for fourier neural operators. Journal of Machine Learning Research, 22: Art No, 2021a. Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces. ar Xiv preprint ar Xiv:2108.08481, 2021b. Learning Operators with Coupled Attention Samuel Lanthaler, Siddhartha Mishra, and George Em Karniadakis. Error estimates for deeponets: A deep learning framework in infinite dimensions. ar Xiv preprint ar Xiv:2102.09618, 2021. Samuel Lanthaler, Siddhartha Mishra, and George E Karniadakis. Error estimates for deeponets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022. Emma Lejeune. Mechanical mnist: A benchmark dataset for mechanical metamodels. Extreme Mechanics Letters, 36:100659, 2020. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. ar Xiv preprint ar Xiv:2010.08895, 2020a. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Multipole graph neural operator for parametric partial differential equations. ar Xiv preprint ar Xiv:2006.09535, 2020b. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. ar Xiv preprint ar Xiv:2003.03485, 2020c. Chensen Lin, Zhen Li, Lu Lu, Shengze Cai, Martin Maxey, and George Em Karniadakis. Operator learning for predicting multiscale bubble growth dynamics. The Journal of Chemical Physics, 154(10):104118, 2021. Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. ar Xiv preprint ar Xiv:1910.03193, 2019. Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with practical extensions) based on fair data. ar Xiv preprint ar Xiv:2111.05512, 2021. St ephane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012. Charles A Micchelli and Massimiliano Pontil. Kernels for multi task learning. In NIPS, volume 86, page 89. Citeseer, 2004. Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation, 17(1):177 204, 2005. Nicholas H Nelsen and Andrew M Stuart. The random feature model for input-output maps between banach spaces. ar Xiv preprint ar Xiv:2005.10224, 2020. Houman Owhadi. Do ideas have shape? plato s theory of forms as the continuous limit of artificial neural networks. ar Xiv preprint ar Xiv:2008.03920, 2020. Kissas, Seidman, Guilhoto, Preciado, Pappas, Perdikaris Edouard Oyallon, Eugene Belilovsky, and Sergey Zagoruyko. Scaling the scattering transform: Deep hybrid networks. In Proceedings of the IEEE international conference on computer vision, pages 5618 5627, 2017. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In International Conference on Machine Learning, pages 4055 4064. PMLR, 2018. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026 8037, 2019. Vern I Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge university press, 2016. Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In NIPS, volume 3, page 5. Citeseer, 2007. Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. Machine learning in medicine. New England Journal of Medicine, 380(14):1347 1358, 2019. James O Ramsay. When the data are functions. Psychometrika, 47(4):379 396, 1982. James O Ramsay and CJ Dalzell. Some tools for functional data analysis. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):539 561, 1991. Venkataraman Santhanam, Vlad I Morariu, and Larry S Davis. Generalized deep image to image regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5609 5619, 2017. John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cambridge university press, 2004. Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R.G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(70):2389 2410, 2011. URL http://jmlr.org/papers/v12/sriperumbudur11a.html. Hiroyuki Takeda, Sina Farsiu, and Peyman Milanfar. Kernel regression for image processing and reconstruction. IEEE Transactions on image processing, 16(2):349 366, 2007. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017. Sifan Wang and Paris Perdikaris. Long-time integration of parametric evolution equations with physics-informed deeponets. ar Xiv preprint ar Xiv:2106.05384, 2021. Sifan Wang, Mohamed Aziz Bhouri, and Paris Perdikaris. Fast PDE-constrained optimization via self-supervised operator learning. ar Xiv preprint ar Xiv:2110.13297, 2021a. Learning Operators with Coupled Attention Sifan Wang, Hanwen Wang, and Paris Perdikaris. Improved architectures and training algorithms for deep operator networks. ar Xiv preprint ar Xiv:2110.01654, 2021b. Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric partial differential equations with physics-informed Deep ONets. Science Advances, 7(40):eabi8605, 2021c. doi: 10.1126/sciadv.abi8605. Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. ar Xiv preprint ar Xiv:2006.04768, 2020. Zelun Wang and Jyh-Charn Liu. Translating math formula images to La Te X sequences using deep neural networks with sequence-level training, 2019. Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr omformer: A Nystr om-based algorithm for approximating self-attention. ar Xiv preprint ar Xiv:2102.03902, 2021.