# operator_learning_for_hyperbolic_pdes__9c3bfcce.pdf

Journal of Machine Learning Research 26 (2025) 1-44 Submitted 12/23; Revised 7/25; Published 9/25

Operator Learning for Hyperbolic Partial Diﬀerential Equations

Christopher Wang cyw33@cornell.edu Department of Mathematics Cornell University Ithaca, NY 14853, USA

Alex Townsend townsend@cornell.edu Department of Mathematics Cornell University Ithaca, NY 14853, USA

Editor: Fei Sha

We construct the ﬁrst rigorously justiﬁed probabilistic algorithm for recovering the solution operator of a hyperbolic partial diﬀerential equation (PDE) in two variables from input-output training pairs. The primary challenge of recovering the solution operator of hyperbolic PDEs is the presence of characteristics, along which the associated Green s function is discontinuous. Therefore, a central component of our algorithm is a rank detection scheme that identiﬁes the approximate location of the characteristics. By combining the randomized singular value decomposition with an adaptive hierarchical partition of the domain, we construct an approximant to the solution operator using O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)) input-output pairs with relative error O(Ξ 1 ϵ ϵ) in the operator norm as ϵ 0, with high probability. Here, Ψϵ represents the existence of degenerate singular values of the solution operator, and Ξϵ measures the quality of the training data. Our assumptions on the regularity of the coeﬃcients of the hyperbolic PDE are relatively weak given that hyperbolic PDEs do not have the instantaneous smoothing eﬀect of elliptic and parabolic PDEs, and our recovery rate improves as the regularity of the coeﬃcients increases. We also include numerical experiments which corroborate our theoretical ﬁndings. Keywords: Data-driven PDE learning, hyperbolic PDE, operator learning, low-rank approximation, randomized SVD

1. Introduction

In this paper, we consider the recovery of the solution operator associated with hyperbolic partial diﬀerential equations (PDEs) from data. Generally, the task of PDE learning is to capture information about an unknown, inhomogeneous PDE given data corresponding to the observed eﬀect of the PDE on input functions (Fan et al., 2019; Fan and Ying, 2020; Gin et al., 2021; Karniadakis et al., 2021; Kovachki et al., 2023; Li et al., 2020a,b, 2021; Lu et al., 2021a,b; Wang et al., 2021). These PDEs typically represent real-world dynamical systems whose governing principles are poorly understood, even though one can accurately observe or predict their evolutions, either through experimentation or through simulation. Data-driven PDE learning has applications in climate science (Bi et al., 2023; Lam et al.,

c 2025 Christopher Wang and Alex Townsend.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1724.html.

Wang and Townsend

2023), biology (Raissi et al., 2020), and physics (Chen and Gu, 2021; Kochkov et al., 2021; Kutz, 2017; Qian et al., 2020), and is a signiﬁcant area of research in scientiﬁc machine learning and reduced order modeling (Berman and Peherstorfer, 2023, 2024; Brunton et al., 2016; Chen et al., 2024; de Hoop et al., 2023; Krishnapriyan et al., 2021; Rudy et al., 2017; Subramanian et al., 2023; Zhang and Lin, 2018). In particular, many eﬀective practical schemes based on neural networks have been developed to recover the solution operator associated with an unknown PDE (Boull e et al., 2022a; Feliu-Faba et al., 2020; Fan and Ying, 2020; Gin et al., 2021; Li et al., 2020a,b, 2021; Wan et al., 2023; Wang et al., 2021), although the inscrutability of these neural networks as black boxes often prevents one from understanding the underlying explanation for their success. Moreover, theoretical research in PDE learning typically centers on error estimates for neural network-based schemes (Kovachki et al., 2023; Lanthaler et al., 2022; Lu et al., 2021a). There is a growing body of research for operator learning in the context of elliptic and parabolic PDEs (Boull e et al., 2022b; Boull e and Townsend, 2023; Sch afer and Owhadi, 2024; Sch afer et al., 2021), but there is a lack of theoretical work on solution operators of hyperbolic PDEs. The existing work focuses primarily on deep learning models for solving hyperbolic PDEs (Arora, 2023; Berman and Peherstorfer, 2024; Bruna et al., 2024; Guo et al., 2020; Huang and Agarwal, 2023; Rodriguez-Torrado et al., 2022; Thodi et al., 2022) or related operators associated with inverse problems (Khoo and Ying, 2019; Li et al., 2022; Molinaro et al., 2023), rather than recovering their solution operators. Therefore, we aim to shed light on the theoretical aspects of solution operator learning for hyperbolic PDEs and to open the ﬁeld for further research. We consider an unknown second-order hyperbolic linear partial diﬀerential operator (PDO) in two variables of the form:

Lu := utt a(x, t)uxx + b(x, t)ux + c(x, t)u, (x, t) DT := [0, 1] [0, 1] (1a)

together with homogeneous initial-boundary conditions

u(x, 0) = ut(x, 0) = 0, 0 x 1

u(0, t) = ux(0, t) = 0, 0 t 1

u(1, t) = ux(1, t) = 0, 0 t 1

We assume that the coeﬃcients are somewhat regular, namely, a, b, c C1(DT ), and that L is strictly hyperbolic, i.e., a > 0, and self-adjoint, i.e., ax + b = 0.1 For equations of the form Lu = f, the function f is called the forcing term of the PDE, while u is the corresponding system s response or solution. The hyperbolic equation Lu = f describes wave-like phenomena, especially in heterogeneous media, such as water waves, acoustic and electromagnetic signals, or earthquakes (Evans, 2010; Lax, 2006). Our learning task is to approximate the solution operator that maps forcing terms to responses, given training data in the form of input-output pairs {(fj, uj)}N j=1 that satisfy either the Cauchy problem (1) or the corresponding adjoint Cauchy problem (see Section 4 for details). Associated with the Cauchy problem (1) is a unique Green s function G : DT DT R, which is a kernel for the solution operator (Courant and Hilbert, 1962;

1. We discuss relaxations of these constraints, as well as inhomogeneous initial conditions, in Section 6.

Operator Learning for Hyperbolic PDEs

Mackie, 1965). That is, solutions to the problem (1) are given by the integral operator

u(x, t) = Z

DT G(x, t; y, s)f(y, s) dy ds, (x, t) DT , (2)

for f L2(DT ). We call G the homogeneous Green s function. Our goal is to recover the action of the integral operator in (2) as accurately as possible, measured by the operator norm. Notably, we are not solving an inverse problem, in that we are not interested in learning the coeﬃcients of (1a).2 In fact, due to the compactness of the solution operator, its recovery is well-posed, and our proposed algorithm is stable (see Section A.1). Rather, our task is to construct a direct solver for the Cauchy problem (1) without explicit knowledge of the equation s coeﬃcients.

1.1 Challenges and contributions

This paper describes the ﬁrst theoretically justiﬁed scheme for recovering the solution operator associated with hyperbolic PDEs using only input-output data. We also provide a rigorous rate of recovery for our algorithm. While we adopt some of the strategies used by Boull e et al. (2022b) and Boull e and Townsend (2023), such as the randomized singular value decomposition (r SVD), we face three unique circumstances when recovering the Green s functions of hyperbolic PDEs that make our situation challenging:

Characteristic curves

The primary diﬃculty of recovering the Green s function of a hyperbolic PDE is the presence of characteristic curves (or simply characteristics), along which the Green s function is highly irregular. Characteristics describe the trajectory of waves in spacetime and are determined by the coeﬃcients of (1a), so their location is also unknown to us in advance. Thus, our recovery algorithm needs to detect if a region of the Green s function s domain intersects a characteristic curve, using only input-output data.

Adaptive partitioning

For elliptic and parabolic PDEs, the Green s functions are numerically low-rank oﬀthe diagonal, so they are well-suited to an approximation strategy via hierarchical partitioning of the domain (Bebendorf and Hackbusch, 2003; Boull e et al., 2022b). Since the Green s functions of hyperbolic PDEs have numerically high rank not only on the diagonal but also along the characteristics, we cannot apply a naive hierarchical partition of the domain. Instead, we use an adaptive partitioning strategy, which at each level uses information from the r SVD to decide which regions to partition further.

Regularity of coefficients

Underlying the diﬃculty posed by characteristic curves is a more fundamental issue with hyperbolic PDEs: irregularities are not instantaneously smoothed and propagate. This feature of Green s functions makes them challenging to recover, as the variable coeﬃcients can

2. Learning the coeﬃcients of an equation can be done with techniques similar to sparse identiﬁcation of nonlinear dynamics (SINDy) (Brunton et al., 2016; Kaiser et al., 2018).

Wang and Townsend

create additional discontinuities. Accordingly, the recovery rate we derive for the hyperbolic case depends on the regularity of the coeﬃcients; the more regular the coeﬃcients, the faster the recovery.

To overcome these challenges, we rely on the following three properties of the Green s function G of the hyperbolic PDO of (1a) (see Section 2):

1. The function G is square-integrable, with jump discontinuities along the characteristics and on the diagonal of DT DT .

2. Away from the diagonal and the characteristics, G is regular and, consequently, numerically low-rank.

3. The characteristics form a piecewise regular hypersurface in DT DT and never accumulate in ﬁnite time. In other words, the volume of a tube of radius δ around the characteristic surface shrinks to zero as δ 0.

Our main technical contribution is the derivation of a rigorous probabilistic algorithm that constructs an approximant to the solution operator associated with (1a) using randomly generated input-output data {(fj, uj)}N j=1, with a small error in the operator norm. In particular, we show in Theorem 6 that with high probability, the solution operator can be recovered within a tolerance of Ξ 1 ϵ ϵ using O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)) input-output pairs, where Ξϵ and Ψϵ are deﬁned in (31) and (32) and describe features of the operator L and the input-output data namely, the size of the singular value gaps of the solution operator, and the quality of the covariance kernel used to generate the training data that cannot be controlled without additional assumptions. Our construction relies on an adaptive hierarchical partition of the spatio-temporal domain, which roughly identiﬁes the location of the characteristic curves, combined with the r SVD for HS operators. We remark that the learning rate is comparable to the one derived by Boull e and Townsend (2023) for elliptic PDEs, where it was later shown that an ϵ 6 factor can be improved to polylog(1/ϵ) using the peeling algorithm; see Section 6.2 (Boull e et al., 2023; Levitt and Martinsson, 2024; Lin et al., 2011). We also emphasize that our scheme succeeds for equations with both spaceand time-dependent coeﬃcients, which informs the use of the time domain rather than the frequency domain in our analysis. While we are interested in as general of a setting as possible, other schemes that exploit the Helmholtz equation through the frequency domain may be preferable when the coeﬃcients are time-independent (Anderson et al., 2020; Liu et al., 2023; Zepeda-N u nez and Demanet, 2016). A key component of our probabilistic algorithm is a scheme that detects the numerical rank of an operator using input-output data, which allows us to tell whether or not a domain intersects a characteristic. We do so by showing, in Theorem 5, that the r SVD can eﬃciently recover an operator s dominant singular subspaces. We then use the singular subspaces to approximate the dominant singular values of the operator, whose decay rate corresponds to the operator s numerical rank. While our scheme assumes the existence of a gap between adjacent singular values, such an assumption is reasonable in practice since the singular values of the solution operator typically exhibit decay, which is fast enough for eﬃcient numerical rank detection (Meier and Nakatsukasa, 2024). Still, we believe that the assumption of a singular value gap can be discarded, although this will not be discussed in the present work.

Operator Learning for Hyperbolic PDEs

Our rank detection scheme facilitates the adaptive feature of the partitioning strategy, since at each hierarchical level we partition only the subdomains ﬂagged as numerically high-rank. While such adaptive strategies are not new and have been previously applied to wave-like settings (Liu et al., 2021; Massei et al., 2022; Zepeda-N u nez and Demanet, 2016), our work supplies, to our knowledge, the ﬁrst theoretical guarantee that the r SVD can be used in an adaptive scheme in a stable manner and with high probability of success. The use of the r SVD is particularly important in the realm of operator learning, where one often only has access to input-output data and thus cannot compute a SVD directly.

Finally, in Theorem 1, we improve the error estimates for the r SVD for HS operators, as derived by Boull e and Townsend (2023), by a factor of

k, where k is the target rank of the constructed approximant. Assuming one always oversamples using an additional k training pairs, then the error factor in Boull e and Townsend (2023, Thm. 1) grows to inﬁnity roughly like O(k) as k , whereas, practically speaking, our error factor remains bounded. This improvement shows that the r SVD for HS operators behaves similarly to the r SVD for matrices; our bounds are asymptotically comparable to the bounds proved in Halko et al. (2011).

We remark that because our algorithm is mainly of theoretical value, we choose to work in the continuous setting that is, without considering discretization both to simplify the analysis and to maintain discretization-invariance of our theoretical guarantees. It may be desirable to learn the Green s function as an operator between inﬁnite-dimensional function spaces rather than to learn a spaceand time-discretized version of the Green s function. Indeed, Huang et al. (2025) shows that learning the solution operator as a function-tofunction map and only discretizing its inputs and outputs when necessary may be more data-eﬃcient than learning its discretization as a vector-to-vector map. In practice, learning a solution operator without discretizing in space or time can be achieved by working in a ﬁnite-dimensional subspace of L2 using, for instance, the Legendre basis to represent functions, as done in the MATLAB package chebfun (Driscoll et al., 2014). Nevertheless, we implement a spaceand time-discretized version of our algorithm in Section 5, demonstrating its robustness to discretization.

1.2 Organization

The paper proceeds as follows. In Section 2, we review the characteristics of hyperbolic PDEs and their relationship with the Green s function, while Section 3 develops the necessary tools from randomized linear algebra, namely, the r SVD for Hilbert Schmidt (HS) operators. These tools are employed in Section 4 to construct our probabilistic algorithm for recovering the solution of a hyperbolic PDO using input-output training pairs; we also analyze the recovery rate and probability of success. A numerical implementation and example of our algorithm is presented in Section 5. Finally, we summarize our results and discuss further directions of research in Section 6. Background material on HS operators, quasimatrices, orthogonal projectors, Gaussian processes (GPs), and the Legendre basis can be found in Appendix A. Proofs related to the r SVD appear in Appendix B.

Wang and Townsend

1.3 Notation

Throughout the paper, we use the following notation. Norms are denoted by , and the type of norm is given by a subscript. If the argument is an operator, then without a subscript denotes the operator norm. For a matrix or quasimatrix A (see Section A.2), we denote its Moore Penrose pseudoinverse by A . We write Im to denote an m m identity matrix. For a random object X, we write X D to mean that X is drawn from the distribution D; we write X Y to mean that X has the same distribution as a diﬀerent random object Y. We use N(µ, σ2) to denote the univariate Gaussian distribution with mean µ and variance σ2. For a vector µ Rn and a symmetric positive deﬁnite n n matrix C, we write N(µ, C) for the multivariate Gaussian distribution with mean µ and covariance matrix C. For an integer r 0, Cr(D) denotes the space of r-times continuously diﬀerentiable functions on a domain D; if D is closed, then the functions in Cr(D) are also required to extend continuously to the boundary of D. The other function space we use is the Hilbert space L2(D) of square-integrable functions on D.

2. Green s functions of hyperbolic PDOs

Our recovery scheme relies crucially on understanding where the singularities of the Green s function lie. In this section, we investigate the geometry of characteristic curves and discuss the dependence of the regularity of the Green s function on the regularity of the coeﬃcients of L as in (1a). We assume that the coeﬃcients of L have regularity a, b, c Cr(DT ), for integer r 1.

2.1 Green s functions in the domain H

We ﬁrst consider the homogeneous Green s function GH of L, as deﬁned in (2), in the unbounded domain H = R [0, ).3 Since we assume the coeﬃcients a, b, c are at least continuously diﬀerentiable, G exists and is unique (Courant and Hilbert, 1962, Ch. V.5 6). Its discontinuities lie on the characteristics, which we now describe. The characteristic curves passing through some point (x0, t0) H can be interpreted as the paths in spacetime traversed by waves propagating from an instantaneous unit force at (x0, t0). We obtain a family of characteristic curves by iterating over all (x0, t0) H. For hyperbolic equations in two variables, the characteristic curves are the graphs of solutions to the following ordinary diﬀerential equations:

a(x(t), t) = 0, x (t) p

a(x(t), t) = 0 (3)

over all choices of initial conditions (Courant and Hilbert, 1962, Ch. III.1). Thus, for any (x0, t0) H, there exist exactly two characteristic curves passing through (x0, t0), given by the equations in (3), both of which are of class Cr+1 and satisfy the initial condition x(t0) = x0. Viewed as functions of t, one is strictly increasing, and the other is strictly

3. Much of the classical literature discusses what is known as the Riemann function, rather than the Green s function, of hyperbolic equations. The Riemann function (also referred to as the Riemann Green function or radiation solution) is a fundamental solution of the PDE and allows one to write an integral solution to the homogeneous equation given Cauchy initial data. Essentially, every result for the Riemann function also holds for the homogeneous Green s function, as they are closely related (Mackie, 1965).

Operator Learning for Hyperbolic PDEs

Λfuture(x0, t0)

Figure 1: Characteristics associated with the coeﬃcient a(x, t) = (x+1)2+1

t+1 in the domain H, with initial point (x0, t0) = (1, 1). They represent wave trajectories in spacetime produced by a unit force at (1, 1): solid curves are future trajectories, while dashed curves are past trajectories. Positive and negative characteristic rays emanating from (x0, t0) are solid and labeled by ζ (x0, t0). The future light cone is the region labeled Λfuture.

decreasing, so they intersect transversally at (x0, t0) and partition H into four connected components lying respectively to the north, south, east, and west of (x0, t0). We refer to the component to the north that is, forward in time as the future light cone Λfuture(x0, t0). Additionally, we call the characteristic ray emanating toward the northwest of (x0, t0) the negative characteristic ray, and likewise we call the ray emanating toward the northeast the positive characteristic ray (see Figure 1). We are interested in the bundle of positive and negative characteristic rays indexed over all initial points (x0, t0) H, since they determine where the Green s function is irregular. Let ζ (x0, t0) denote the positive and negative characteristic rays, respectively, emanating from (x0, t0). Then we deﬁne

ZH := {(x, t, x0, t0) H H : (x, t) ζ (x0, t0)} (4)

as well as ZH := ZH + ZH . (5)

Observe that ZH are 3-dimensional Cr-manifolds with boundary in H H.4

For ﬁxed (x0, t0) DT , the slices GH x0,t0 given by (x, t) 7 GH(x, t; x0, t0) are r-times continuously diﬀerentiable as long as (x, t) does not lie on either of the characteristic rays

4. Here, a Cr-manifold is simply a manifold whose transition maps are of class Cr. The claim here follows from the observation that ZH can be deﬁned as the graph of the ﬂow generated by the time-dependent vector ﬁeld (x, t) 7 p

a(x, t). The details are irrelevant here, and we omit them.

Wang and Townsend

ζ(0) + (x0, t0) t(0) +

ζ(1) + (x0, t0)

ζ(0) (x0, t0) t(0)

ζ(1) (x0, t0)

Figure 2: In a bounded domain, the characteristics reﬂect oﬀthe boundary, producing a series of reﬂecting characteristic segments, depicted by solid curves labeled ζ(j) (x0, t0), for j = 0, 1, 2, . . . . Collision points are labeled t(j) .

emanating from (x0, t0) (Lerner, 1991, Thm. 1). Moreover, due to symmetries of fundamental solutions of hyperbolic equations (Courant and Hilbert, 1962, Ch. V.5), we conclude that GH has the same regularity in every variable. For hyperbolic equations in two variables, the singularity of the Green s function on the characteristics is a jump discontinuity. This is because GH x0,t0 restricted to Λfuture(x0, t0) satisﬁes continuous boundary conditions on the characteristic rays (Courant and Hilbert, 1962, Ch. V.5). Outside the future light cone, the Green s function is identically zero (Mackie, 1965). In summary, we have

GH Cr((H H) \ ZH), (6)

where ZH is deﬁned by (5).

2.2 Green s functions in the domain DT

When we consider L on the bounded domain DT = [0, 1] [0, 1], we must also establish homogeneous boundary conditions on {0, 1} [0, 1], in addition to homogeneous initial conditions. Homogeneous boundary conditions produce wave reﬂections, so the characteristic

Operator Learning for Hyperbolic PDEs

curves reﬂect oﬀthe boundaries. When a positive characteristic ray collides with the right boundary, its reﬂection is given by the negative characteristic ray emanating from the collision point. Likewise, when a negative characteristic ray collides with the left boundary, its reﬂection is given by the positive characteristic ray emanating from the collision point. Wave trajectories can thus be described as a series ζ(0) (x0, t0), ζ(1) (x0, t0), . . . of reﬂecting characteristic segments emanating from (x0, t0), which terminates at some index N(x0, t0) once the segments cross the time horizon t = 1 (see Figure 2). Letting N = max(x0,t0) DT N(x0, t0), we deﬁne, analogous with (4) and (5), the objects

Z(j) := {(x, t, x0, t0) DT DT : (x, t) ζ(j) (x0, t0)}, 0 j N, (7)

Z(j) + Z(j) . (8)

Again, each component Z(j) , j = 1, . . . , N is a 3-dimensional Cr-manifold with boundary in DT DT , so that Z is a piecewise Cr-manifold. Notice that the number of reﬂections N is bounded by max(x,t) DT p

Besides the reﬂecting characteristics, the homogeneous Green s function G in the domain DT has the same regularity properties as the homogeneous Green s function GH in the domain H. The reﬂections produced by homogeneous boundary conditions manifest as additional jump discontinuities for G, lying on the reﬂecting characteristic segments described in Section 2.2. In other words, on a bounded domain, we have

G Cr((DT DT ) \ Z), (9)

where Z is deﬁned by (8).

Example 1 Consider the constant-coeﬃcient wave equation utt a2uxx = f. The Green s function is quite simple to understand and can be written explicitly, but it would be very notationally complicated due to the boundary conditions. It is piecewise constant, and its value in each component is either 0 or 1

2a (see Figure 3). Its components are partitioned by characteristic curves, which are lines of slope 1

a emanating from the initial points (x0, t0) and reﬂecting oﬀthe boundary. For general hyperbolic PDEs of the form (1), the Green s function is usually not piecewise constant; nevertheless, it has the same qualitative properties, in particular jump discontinuities on the characteristics.

3. Randomized SVD for Hilbert Schmidt operators

This section introduces our primary tool, which we have adapted from randomized numerical linear algebra. The landmark result of Halko et al. (2011) proved that one could recover the column space of an unknown matrix with high accuracy and a high probability of success by multiplying it with standard Gaussian vectors. Boull e and Townsend (2023) extend this result to HS operators and functions drawn from a non-standard Gaussian process. This algorithm is called the r SVD.

5. The claims made in this section can be seen by applying the method of reﬂections to solve the homogeneous initial-boundary problem which deﬁnes the homogeneous Green s function in a bounded domain (see, e.g., Laurent et al., 2021). Again, the details are not relevant to us.

Wang and Townsend

Figure 3: A slice of the Green s function associated with the wave operator Lu = utt 9uxx with initial point (x0, t0) = (1

6). In this case, the Green s function is piecewise constant with jump discontinuities on the characteristics. The values of the Green s function are labeled in their respective regions.

3.1 Randomized SVD

Given a HS operator F : L2(D1) L2(D2) with SVE as in (33), we deﬁne two quasimatrices U and V containing the left and right singular functions of F, so that the jth column of U and V is respectively ej and vj. We also denote by Σ the inﬁnite diagonal matrix with the singular values σ1 σ2 of F on the diagonal. For a ﬁxed integer k 0, we deﬁne V1 as the D1 k quasimatrix whose columns are the ﬁrst k right singular functions v1, . . . , vk, and V2 as the D1 quasimatrix whose columns are vk+1, vk+2, . . . . We analogously deﬁne U1 and U2 with the left singular functions. Similarly, we deﬁne Σ1 as the k k diagonal matrix with the ﬁrst k singular values σ1, . . . , σk on the diagonal, and Σ2 as the diagonal matrix with σk+1, σk+2, . . . on the diagonal. In summary, we write

[ ] F = U1 U2

k Σ1 0 V 1 k 0 Σ2 V 2 .

Additionally, for some ﬁxed continuous symmetric positive deﬁnite kernel K : D1 D1 R with eigenvalues λ1 λ2 > 0, we deﬁne the inﬁnite matrix C by

D1 D1 vi(x)K(x, y)vj(y) dx dy, i, j 1. (10)

Observe that C is symmetric and positive deﬁnite, and that Tr(C) = Tr(K) < , so it is a compact operator (Boull e and Townsend, 2023, Lem. 1 and Eq. (11)). We denote by C 1

the inverse operator of C on the domain for which it is well-deﬁned. Furthermore, for ﬁxed

Operator Learning for Hyperbolic PDEs

integer k 1, we partition C into

k C11 C12 k C21 C22 .

Since C11 is also positive deﬁnite and thus invertible, we deﬁne

γk := k λ1 Tr(C 1 11 ), ξk := 1 λ1 C 1 11 , (11)

which quantify the quality of the covariance kernel K with respect to F. Indeed, the Courant Fischer minimax principle implies that the jth largest eigenvalue of C is bounded by λj, since C is a principal submatrix of V KV. It follows that 0 < γk, ξk 1, and that the best case scenario occurs when the eigenfunctions of K are the right singular functions of F. In that case, C is an inﬁnite diagonal matrix with entries λ1 λ2 > 0, and γk = k/(Pk j=1 λ1/λj) and ξk = λk/λ1 attain their minimal values. One can view ξk as the operator norm analogue of γk, discussed in more detail in Boull e and Townsend (2023, 3.4). Finally, for some X DT , let RX : L2(DT ) L2(X) denote the restriction to X. Its adjoint R X : L2(X) L2(DT ) is the zero extension operator, i.e., R Xf is the function which is equal to f on X and equal to zero everywhere else. We denote by FX Y := RXFR Y the restriction of F to X Y . When considering the restricted operator, we deﬁne the analogous quantities

γk,X Y := k λ1 Tr(C 1 11,X Y ), ξk,X Y := 1 λ1 C 1 11,X Y , (12)

where C11,X Y is the k k matrix

[C11,X Y ]ij := Z

D1 D1 R Y vi,X Y (x)K(x, y)R Y vj,X Y (y) dx dy, 1 i, j k,

and v1,X Y , . . . , vk,X Y are the dominant right k-singular functions of FX Y . We also deﬁne σ1,X Y σ2,X Y as the singular values of FX Y . In this notation, we state an analogue of r SVD for HS operators (Boull e and Townsend, 2023, Thm. 1) with respect to the operator norm. We improve the error bound by a factor of k compared to Boull e and Townsend (2023, Thm. 1).

Theorem 1 Let D1, D2 Rd be domains with d 1, and let F : L2(D1) L2(D2) be a HS operator with singular values σ1 σ2 0. Select a target rank k 2, an oversampling parameter p 2, and a D1 (k + p) quasimatrix Ωsuch that each column is independently drawn from GP(0, K), where K : D1 D1 R is a continuous symmetric positive deﬁnite kernel with eigenvalues λ1 λ2 > 0. Set Y = FΩ. Then

λ1ξk e k + p

k γk(p + 1)

Wang and Townsend

where γk, ξk are deﬁned in (11). Moreover, if p 4, then for any s, t 1, we have

k γk(p + 1) t

with probability 1 2t p e s2/2.

Proof See Appendix B.

We also present a simpliﬁed version of the probability bound in (14).

Corollary 2 Under the assumptions of Theorem 1 with k = p 4, we have

with probability 1 3e k.

Proof We evaluate (14) by selecting s =

2k and t = e. We also use the inequalities

1 γk 1 ξk 1

which follow from the fact that 0 < γk, ξk 1 and Tr(C 1 11 ) k C 1 11 .

For convenience, we abbreviate the error factors in (14) and (15) by

Ak,p(s, t) := 1 + 1

k γk(p + 1) t, (16)

Ak := 1 + 1

as well as the analogue

Ak,X Y = 1 + 1 ξk,X Y

when considering a restricted domain X Y DT DT . A power scheme version of r SVD, as described in Halko et al. (2011) and Rokhlin et al. (2009), allows one to drive down the multiplicative factor in the probabilistic estimate (14) at the expense of a logarithmic factor increase in the number of operator-function products. The idea of the power scheme is that repeated projections of Y = FΩonto the column space of F improves the approximation.

Operator Learning for Hyperbolic PDEs

Theorem 3 Under the same assumptions as Theorem 1, select an integer q 0. Let H = (FF )q F and set Z = H Ω. If p 4, then

F PZF Ak,p(s, t)1/(2q+1)σk+1 (19)

with probability 1 2t p e s2/2, where Ak,p(s, t) is deﬁned in (16).

Proof See Appendix B.

We summarize how the r SVD generates an approximant for the operator F in Algorithm 1, following Halko et al. (2011).

Algorithm 1 Approximating F via r SVD Input: HS operator F, GP covariance kernel K, target rank k 4, oversampling parameter p 2, exponent q 0, additional parameters s, t 1 Output: Approximation F of F within relative error given by (19)

1: Draw a DT (k + p) random quasimatrix Ωwith independent columns from GP(0, K)

See Section A.4 on how to draw such a quasimatrix

2: Construct Z = (FF )q FΩby multiplying with F and F

3: Compute the projector PZ via a QR factorization of Z

See Trefethen (2010) for Householder triangularization on quasimatrices

4: Form F = PZF

Remark 4 (Noisy training data) Here, we quantify the quality of the training data via the terms γk and ξk, deﬁned in (11), which measure the deviation of the eigenspaces of the chosen covariance kernel K from the dominant right singular subspaces of F. One can also consider training data quality in the form of noise namely, the presence of random additive perturbation errors arising from the computation of operator-function products, or from the collection of input-output data. In fact, it was proven in the ﬁnite-dimensional setting that the r SVD is stable with respect to noise in the input-output data (Boull e et al., 2023, Supp. Info., 2.B), so we assume noiseless data for simplicity.

3.2 Approximating singular values via singular subspaces

Using the power scheme, we can not only improve the approximation of the operator F itself but also approximate its singular subspaces. Consequently, we can obtain excellent estimates for the singular values, which in conjunction with Theorem 1 tells us about the numerical rank of F. Notice that while Weyl s theorem (Stewart and Sun, 1990, Cor. 4.10) gives a straightforward bound on an operator s singular values, applying it meaningfully requires those singular values to decay quickly. Conversely, Theorem 5 says that we can approximate the singular values regardless of their decay rate at the cost of additional operator-function products. We emphasize that our argument assumes the existence of a gap between adjacent singular values, although we believe this assumption is not necessary in principle.

Wang and Townsend

Theorem 5 Assume the hypotheses of Theorem 1, but with σk > σk+1. Select an integer q 0 and set δq as in (44). Let H := (FF )q F, Z := H Ω, and H := PZH . Let Uk be the D1 k quasimatrix whose columns are dominant left k-singular vectors of H . If δq Ak,p(s, t) < 1, then

max 1 j k |σj σj( U k F)| 2δq Ak,p(s, t) 1 δq Ak,p(s, t) F (20)

with probability 1 2t p + e s2/2.

Proof See Appendix B.

4. Recovering the Green s function

In this section, we construct a global approximant of the homogeneous Green s function of a 2-variable hyperbolic PDO L of the form (1a) in the domain DT = [0, 1] [0, 1]. Recall that L is assumed to be linear strictly hyperbolic, and self-adjoint, with coeﬃcients satisfying a, b, c C1(DT ). We suppose that one can generate N forcing terms {fj}N j=1 drawn from a Gaussian process with continuous symmetric positive deﬁnite covariance kernel K : DT DT R and use them to query the associated solution operator F of L, as well as its adjoint F , to generate solutions uj = Ffj or uj = F fj.6 We derive a bound on the number of input-output pairs {(fj, uj)}N j=1 needed to approximate the Green s function within a given error tolerance measured in the operator norm, with a high probability of success. Our result is summarized in the following theorem.

Theorem 6 Let L be a hyperbolic PDO given in (1a). Let G : DT DT R be the homogeneous Green s function of L in the domain DT , and let F : L2(DT ) L2(DT ) be the solution operator with kernel G, as in (2). Additionally, deﬁne Ξϵ and Ψϵ as in (31) and (32). For any suﬃciently small ϵ > 0 such that Ψϵ > 0, there exists a randomized algorithm that can construct an approximation F of F using O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)) input-output training pairs of L, such that

F F = O(Ξ 1 ϵ ϵ) F

with probability 1 O(e 1/ϵ).

Remark 7 (Increased regularity of coeﬃcients) The number of input-output training pairs can be reduced by assuming greater regularity of the coeﬃcients a, b, c of L. In particular, if a, b, c Cr(DT ) for some r 1, then Theorem 6 easily generalizes, via (40) and (9), so that only O(Ψ 1 ϵ ϵ (6+1/r) log(Ξ 1 ϵ ϵ 1)) training pairs are needed to approximate F with relative error O(Ξ 1 ϵ ϵ). If we assume a, b, c are analytic, the ϵ 1/r factor can be reduced even further to log(ϵ 1).

6. We note that F is the solution operator of the adjoint Cauchy problem of (1), meaning that u = F f satisﬁes both the adjoint PDE L u = f as well as homogeneous conditions at the boundary and at the terminal time t = 1. In other words, the adjoint problem is the backward-in-time version of (1). In particular, F is self-adjoint if and only if L is self-adjoint and the coeﬃcients a, b, c are time-independent.

Operator Learning for Hyperbolic PDEs

A randomized algorithm that achieves Theorem 6 is summarized in Algorithm 2, and we dedicate the remainder of this section to constructing it. We ﬁx a suﬃciently small 0 < ϵ < 1/2, so that by (40) and (9), there exists kϵ := Cϵ 1 4, where C is a constant depending only on max(DT DT )\Z | G| and Z is deﬁned in (8). This ensures that rankϵ(FX Y ) < kϵ holds whenever X Y Z = . Here, Z is the bundle of reﬂecting characteristic segments, deﬁned in (8). We also set the parameters s = 2kϵ, t = e, and p = kϵ, so that the probabilities of failure for Theorems 1, 3, and 5 are bounded by 3e kϵ

(see Corollary 2).

Algorithm 2 Learning the solution operator via input-output data

Input: Black-box solver for L, GP covariance kernel K, tolerance 0 < ϵ < 1/2 Output: Approximation F of the solution operator F within relative error Ξ 1 ϵ ϵ

1: Set target rank kϵ 4

2: while vol(Dred(L)) > ϵ2 do

3: for D Dred(L) do

4: Partition D into 16 subdomains D1, . . . , D16 Section 4.2

5: for i = 1 : 16 do

6: Determine the numerical rank of F on Di (Algorithm 3) Section 4.1

7: if F is numerically low-rank on Di then

8: Color Di green

9: Approximate F on Di using the r SVD (Algorithm 1) Section 4.3

11: Color Di red and add Di to Dred(L + 1)

12: Pad the approximant F with zeros on the remaining red subdomains Section 4.4.1

4.1 Rank detection scheme

Since the locations of the characteristics are unknown, we adaptively partition DT DT in a hierarchical manner. The adaptive part relies on a rank detection scheme that detects the numerical rank of F in a given subdomain X Y DT DT (see Algorithm 3). Similar to Boull e and Townsend (2023, 4.1.2), we generate a Y 2kϵ quasimatrix Ω such that each column is independently drawn from a Gaussian process deﬁned on Y , given by GP(0, RY Y K), where RY Y is the operator that restricts functions to the domain Y Y . Then we extend by zero each column of Ωfrom L2(Y ) to L2(DT ) in the form R Y Ω. Next, we select

qϵ,X Y := max 0, 1

log(1 + Akϵ,X Y + 2Akϵ,X Y /ϵ)

log(σkϵ,X Y /σkϵ+1,X Y ) 1 , (21)

where Akϵ,X Y is deﬁned in (18), and set H = (FF )qϵ,X Y F. We approximate the range of H via Z = H R Y Ωand then compute the rank-2kϵ approximant HX Y = PRXZRXH R Y . Since HX Y has ﬁnite rank, we can compute its SVD and extract the quasimatrix Ukϵ,X Y whose columns are the dominant left kϵ-singular vectors of HX Y , and ﬁnally compute the dominant singular values ˆσ1,X Y ˆσkϵ,X Y of the operator U kϵ,X Y FX Y . In total, these computations require kϵ(8qϵ,X Y +5) input-output training pairs, and rely on the querying the adjoint of F.

Wang and Townsend

Algorithm 3 Detecting the numerical rank of F in a subdomain Input: Black-box solver for L, GP covariance kernel K, subdomain X Y , tolerance 0 < ϵ < 1 Output: Classiﬁcation of numerical rank of FX Y

1: Set target rank kϵ 4

2: Set exponent qϵ,X Y log(ϵ 1)

3: Construct HX Y using the r SVD (Algorithm 1) with exponent parameter 1 applied to H = (FF )qϵ,X Y F

4: Compute a SVD of HX Y to obtain the dominant left kϵ-singular functions Ukϵ,X Y 5: Construct U kϵ,X Y FX Y using the black-box solver for L

6: Compute the singular values ˆσ1,X Y ˆσkϵ,X Y of U kϵ,X Y FX Y 7: if ˆσkϵ,X Y < 4ϵˆσ1,X Y then rank5ϵ(FX Y ) < kϵ 8: else rankϵ(FX Y ) kϵ

Our choice of qϵ,X Y in (21) is motivated as follows. Given δqϵ,X Y as deﬁned in (44), we have 2δqϵ,X Y Akϵ,X Y 1 δqϵ,X Y Akϵ,X Y ϵ,

which in conjunction with Theorem 5 implies that the inequalities ˆσkϵ,X Y σkϵ,X Y + ϵσ1,X Y and (1 ϵ)σ1,X Y ˆσ1,X Y hold with probability 1 3e kϵ. If rankϵ(FX Y ) < kϵ, then σkϵ,X Y ϵσ1,X Y , hence

ˆσkϵ,X Y 2ϵσ1,X Y 2ϵ 1 ϵ ˆσ1,X Y .

Since ϵ < 1/2, then the right-hand side is bounded by 4ϵˆσ1,X Y . On the other hand, if ˆσkϵ,X Y 4ϵˆσ1,X Y , then Theorem 5 again yields

σkϵ,X Y ˆσkϵ,X Y + ϵσ1,X Y 5ϵσ1,X Y

hence rank5ϵ(FX Y ) < kϵ. In summary, we have shown that

rankϵ(FX Y ) < kϵ = ˆσkϵ,X Y 4ϵˆσ1,X Y = rank5ϵ(FX Y ) < kϵ (22)

holds with probability 1 3e kϵ. The preceding argument implies that we need only to check the validity of the inequality

ˆσkϵ,X Y < 4ϵˆσ1,X Y (23)

to determine with high probability whether or not FX Y has numerical rank bounded by kϵ, with low error tolerance. In particular, every subdomain that does not intersect Z has rankϵ(FX Y ) < kϵ, and the test passes on all such subdomains. Conversely, even if (23) happens to be satisﬁed for a subdomain that does intersect Z, then that subdomain still has a low numerical rank with comparatively small error tolerance, namely, 5ϵ.

Operator Learning for Hyperbolic PDEs

4.2 Hierarchical partition of domain

We now describe the hierarchical partitioning of the domain DT DT = [0, 1]4, so that F is numerically low-rank on many subdomains, while the subdomains on which F is not lowrank have a small total volume. Since the probability of failure is very low, we describe the partition deterministically and relegate the discussion of the probabilistic aspects to Section 4.4.4. In particular, we assume throughout this section that (22) holds deterministically. Our partitioning strategy proceeds as follows. At level L = 0, we consider only one subdomain, which is the entirety of DT DT . On each subdomain at the current partition level, we perform the rank detection procedure described in Section 4.1 to check if F is numerically low-rank on the subdomain. If it is, we color the subdomain green and approximate F restricted to the subdomain using the r SVD. Otherwise, we color the subdomain red. The next level of the partition then consists of partitioning each red subdomain into smaller subdomains and repeating the coloring process. Since the characteristics lie only in red subdomains, then at each level of the partition we learn the location of the characteristics with ﬁner detail (see Figure 4). The hierarchical partition of DT DT = [0, 1]4 for n levels is deﬁned recursively as follows.

At level L = 0, the domain I1,1,1,1 := I1 I1 I1 I1 = [0, 1]4 is the root of the partitioning tree.

At a given level 0 L n 1, a node Ij1,j2,j3,j4 := Ij1 Ij2 Ij3 Ij4 is colored red if (23) fails to hold for X Y = Ij1,j2,j3,j4. Otherwise, the node Ij1,j2,j3,j4 is colored green. Green nodes have no children, whereas red nodes have 24 children deﬁned as

{I2j1+k1 I2j2+k2 I2j3+k3 I2j4+k4 : ki {0, 1}, i = 1, 2, 3, 4}.

Here, if Ij = [a, b], 0 a < b 1, then we deﬁne I2j = [a, a+b

2 ] and I2j+1 = [a+b

Every subdomain X Y in levels 0 L n 1 is green and satisﬁes (23), hence rank5ϵ(FX Y ) < kϵ, by (22). At the end, we perform the test (23) for all remaining subdomains at level n. Those for which (23) holds are colored green, while the rest are colored red.

4.3 Local approximation of the Green s function

The hierarchical partition of DT DT described above tells us where F is numerically low-rank and where it is not. If a subdomain X Y is colored green, that is, a subdomain on which (23) holds, then we can approximate it using the r SVD. Since we have generated Z = H R Y Ωin the process of rank detection, where H = (FF )qϵ,X Y F and Ωis a quasimatrix with columns drawn from GP(0, RY Y K), then we only require an additional 2kϵ input-output training pairs to construct FX Y := PRXZRXFR Y .7 By Theorem 3 and (22), we have

FX Y FX Y A1/(2qϵ,X Y +1) kϵ,X Y σkϵ+1,X Y 5ϵA1/(2qϵ,X Y +1) kϵ,X Y FX Y . (24)

7. In other words, we can skip lines 2 3 in Algorithm 1.

Wang and Townsend

Moreover, this error estimate holds with probability 1, conditioned on the event that (22) is valid (see Section 4.4.4). On the other hand, if (23) does not hold on X Y , then we approximate F by zero. Since the Green s function Gis continuous in the compact domain DT DT , except for jumps along the characteristics and the diagonal, then there exists a constant C such that

G(x, t; x0, t0) C F , (x, y, x0, y0) DT DT .

Hence, for any X, Y DT , we have

G 2 L2(X Y ) C2 vol(X Y ) F 2 (25)

and thus FX Y C p

vol(X Y ) F , (26)

where C does not depend on X Y . Therefore, for suﬃciently small domains X Y , we may approximate FX Y by zero. At the end of partitioning, the total volume of the subdomains on which (23) fails to hold is negligible (see Section 4.4.1).

4.4 Recovering the Green s function on the entire domain

We now show that we can recover F on the entire domain DT DT .

4.4.1 Global approximation near characteristics

First, we ensure that the volume of all subdomains where F is not low-rank is small enough to safely ignore that part of F by approximating it by zero. As one increases the level of hierarchical partitioning, the volume of such subdomains shrinks to zero. Let Dred(L) denote the collection of subdomains X Y at level L that are colored red, such that rank5ϵ(FX Y ) kϵ. Let Dred(L) := S D Dred(L) D. By (9), (22) and the deﬁnition of kϵ, we have X Y Dred(L) only if (X Y ) Z = . Subdomains at level L are 4-dimensional cubes with side length 2 L, so every X Y Dred(L) is contained in the closed tubular neighborhood of Z in DT DT with radius 2 L+1, deﬁned as

T(Z, 2 L+1) := {p DT DT : dist(p, Z) 2 L+1}. (27)

The volume of T(Z, 2 L+1) depends only on the coeﬃcient a of (1a) and can be computed explicitly using a Weyl-type tube formula.8 In particular, we have

vol(Dred(L)) vol(T(Z, 2 L+1)) = O(2 L), (28)

X Y Dred(L) FX Y O(2 L/2) F

by (26). Therefore, by selecting nϵ := log2(1/ϵ2) , we guarantee that X

X Y Dred(nϵ) FX Y = O(ϵ) F (29)

and can safely approximate F by zero on Dred(nϵ).

8. For a C2, the classical Weyl tube formula suﬃces, whereas for a with less regularity one requires more complicated expressions; see, e.g., Federer (1959, Thm. 5.6), Gray (2003, Thm. 4.8), and Hug et al. (2004, Eq. (4.4)). We omit the details.

Operator Learning for Hyperbolic PDEs

Figure 4: Slices through DT DT at two levels of hierarchical partitioning. A subdomain intersecting a characteristic is colored red and is further partitioned at the next level. Otherwise, it is colored green. The boundary of the tube of (27) is depicted by the thick dashed lines in (b); notice that every red subdomain is contained within the tube s interior.

4.4.2 Recovery rate

Input-output training pairs are required both for rank detection on subdomains and for approximation for the subdomains on which F is numerically low-rank. We count the training pairs needed for each task separately. Since rank detection occurs on every subdomain in the partitioning tree, let Dtree(nϵ) denote the entire collection of subdomains in the tree after nϵ levels of partitioning. For each subdomain colored red at a given level L, we split it into 16 smaller subdomains and perform the rank detection scheme on each, hence

#Dtree(nϵ) = 16

L=0 #Dred(L).

Subdomains at the Lth level of partitioning have volume 16 L, so it follows from (28) that

#Dred(L) 16L vol(Dred(L)) = O(8L).

Therefore, #Dtree(nϵ) = O(8nϵ) = O(ϵ 6). (30)

Finally, recall from Section 4.1 that performing the rank detection scheme on a subdomain X Y requires kϵ(8qϵ,X Y + 5) training pairs. We remove the dependence on X Y by introducing Ξϵ := min{ξkϵ,X Y : X Y Dtree(nϵ)} (0, 1], (31)

Wang and Townsend

which represents the quality of the covariance kernel K, as well as

Ψϵ := min log σkϵ,X Y

: X Y Dtree(nϵ) , (32)

which represents the inﬂuence of the singular value gaps in the rank detection scheme. It follows from (18) that Akϵ,X Y = O(Ξ 1 ϵ ) and thus qϵ,X Y = O(Ψ 1 ϵ log(Ξ 1 ϵ ϵ 1)) by (21). Therefore, the total number of training pairs needed for rank detection on every domain in Dtree(nϵ) is given by

N(1) ϵ := X

X Y Dtree(nϵ) kϵ(8qϵ,X Y + 5) = O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)).

To count training pairs needed for approximation on domains where F is numerically low-rank, we ﬁrst observe that

#Dgreen(nϵ) =

16 #Dred(L 1) #Dred(L) = O(8nϵ),

where Dgreen(nϵ) is the collection of subdomains X Y in the hierarchical level nϵ colored green, i.e., those satisfying rank5ϵ(FX Y ) < kϵ. Recall from Section 4.3 that only 2kϵ additional training pairs are required to generate an approximant on a subdomain via r SVD after performing rank detection, so the number of training pairs needed for approximation is given by

N(2) ϵ := X

X Y Dgreen(nϵ) 2kϵ = O(ϵ 7).

We conclude that the total number of training pairs required is

Nϵ := N(1) ϵ + N(2) ϵ = O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)).

4.4.3 Global approximation error

Let F be the approximant to F given by stitching together, at the end of the hierarchical partition, the approximants FX Y generated on green subdomains X Y , and the zero operator on red subdomains. By (29), we have

X Y Dgreen(n) FX Y FX Y + O(ϵ) F ,

where every term in the summation satisﬁes (24). Recall that Akϵ,X Y = O(Ξ 1 ϵ ), hence

F F O(Ξ 1 ϵ ϵ) F

is our ﬁnal error.

Operator Learning for Hyperbolic PDEs

4.4.4 Recovery probability

Thus far, we have assumed that our rank detection is guaranteed to succeed on each domain, i.e., that (22) holds deterministically. Circumventing this assumption is simple: since the forcing terms in the input-output training data are independent, and since we perform rank detection on #Dtree(nϵ) = O(ϵ 6) domains, we have

P n (22) holds for every X Y Dtree(n) o (1 3e kϵ)#Dtree(nϵ) = 1 O(e 1/ϵ).

Notice that for each subdomain, the approximant is generated using the same test quasimatrices as used in the rank detection scheme, hence all error bounds for approximants, e.g., (24), are guaranteed to hold when conditioned on the validity of (22). Therefore, our global probability of failure is O(e 1/ϵ), and we have completed the proof of Theorem 6.

5. Numerical Example

We implement Algorithms 1, 2, and 3 in MATLAB for the constant coeﬃcient wave operator Lu = utt 4uxx with homogeneous initial and boundary conditions.9,10 Our implementation discretizes the domain blockwise at Gauss Legendre quadrature nodes. Input-output data was generated using the known analytical expression for the true Green s function. Figure 5 depicts a slice of the approximate Green s function constructed by Algorithm 2 after eight levels of partitioning, as well as the pointwise error in comparison with the actual Green s function. We observe that the rank detection scheme successfully distinguishes between those subdomains that intersect characteristics and those that do not, hence the partition is ﬁne in areas near characteristics and coarse in areas away from characteristics. Errors are concentrated near the characteristics, as expected, while low-rank subdomains are approximated very well. Figure 6 shows the experimental rate of convergence as the amount of input-output training data increases. As predicted by Theorem 6, the error decays like O(N 1/7), where N is the number of training data pairs. The slow convergence rate theoretically predicted and empirically observed shows that large amounts of data may be needed to accurately approximate Green s functions associated with hyperbolic PDEs. We emphasize that our algorithm is guarding against the nastiest possible solution operators associated with a hyperbolic PDE; if one assumes in advance that the PDE has constant coeﬃcients, then one can do dramatically better. Nevertheless, the numerical example demonstrates that our method is stable and robust to discretization errors.

6. Conclusions and Discussion

Our breakthrough result proves that one can rigorously recover the Green s function of a hyperbolic PDE in two variables using input-output training pairs with high probability.

9. MATLAB code is available at https://github.com/chriswang030/Operator Learningfor HPDEs. 10. While our code has no diﬃculty handling more complicated hyperbolic PDEs, the Green s functions for equations with variable coeﬃcients are generally unknown analytically, hence training data must be generated from, say, a numerical solver. Experiments with various popular techniques showed that such solvers were either (a) too slow to be practical, given the large amount of data needed, or (b) introduced numerical dissipation and/or dispersion that dominated the approximation error and resulted in a poor Green s function approximation. We discuss these issues in more detail in Section 6.4.

Wang and Townsend

Figure 5: Approximate Green s function at the slice (y, s) = (0.8, 0.1) for the wave operator Lu = utt 4uxx, overlaid with partition blocks (left), and the pointwise error at the slice compared with the exact Green s function (right).

Figure 6: Empirical rate of convergence of Algorithm 2.

We do so via three theoretical contributions: (a) we prove new error bounds for the r SVD for HS operators in the operator norm, such that the suboptimality factor Ak of (17) tends to a constant as the target rank k tends to ; (b) we develop a scheme to detect the numerical rank of an operator using input-output products, with high probability; and (c) we introduce an adaptive component to the hierarchical partition of the Green s function s domain to eﬀectively capture the location of its discontinuities.

Operator Learning for Hyperbolic PDEs

In the remainder of this section, we discuss possible relaxations of our assumptions, a potential improvement to our scheme via the peeling algorithm, the diﬃculties of extending to higher-dimensional hyperbolic PDOs, and some challenges arising from the availability of input-output data.

6.1 Relaxations of assumptions

The main assumptions required of our unknown PDO L from (1a) are (a) homogeneous initial conditions, (b) self-adjointness, and (c) suﬃcient regularity of coeﬃcients. Strict hyperbolicity is also required, but we do not discuss it because it enforces realistic physics by forbidding waves from traveling inﬁnitely fast or backward in time.

Inhomogeneous initial conditions

For use as a direct solver, the homogeneous Green s function can be applied to solve Cauchy problems with certain inhomogeneous initial conditions, as a consequence of Duhamel s principle (Courant and Hilbert, 1962, Ch. VI.15). For initial conditions u(x, 0) = 0 and ut(x, 0) = ψ(x), one simply needs to integrate ψ against the Green s function on the line {t = 0}. If one also wants to solve initial value problems where u(x, 0) is not identically zero, then the values and derivatives of the coeﬃcients of L must be given on {t = 0} (Courant and Hilbert, 1962, Ch. V.5). We demonstrate the practical implementation of Duhamel s principle in Figure 7 by replacing the given inhomogeneous initial condition with an appropriate approximation of the Dirac delta on a short time strip, e.g., fδ(x, t) := δ 1χ[0,δ](t)ut(x, 0), and setting our approximate solution to be

u(x, t) = Z

DT G(x, t; y, s)fδ(y, s) dy ds, (x, t) DT .

The inaccuracies seen in Figure 7 are due primarily to the use of an approximate Green s function. Because of the large amount of training data needed here, we are unable to construct a Green s function with suﬃciently high accuracy with our available software and hardware. One may be tempted to extrapolate the learned solution operator beyond the given time horizon by taking the value of the computed solution at the end of the ﬁrst time interval to be the initial condition for the next time interval. However, without additional information about the coeﬃcients of L, one has no information whatsoever about the behavior of the Green s function beyond the time horizon on which the input-output training data is given. If it is known that the wave speed, for instance, is independent of time, then such extrapolations beyond the initial domain may be feasible.

Non-self-adjoint PDOs

While self-adjointness is not strictly necessary, more training data may be needed to learn the solution operator of a non-self-adjoint PDO to the same accuracy. This is because the coeﬃcients of the adjoint PDO may have less regularity than the original operator. For instance, the wave operator with variable damping

Lu = utt uxx + k(x, t)ut

Wang and Townsend

Figure 7: Approximate solution (left) and exact solution (right) to the wave operator Lu = utt 4uxx with homogeneous boundary conditions but inhomogeneous, discontinuous initial conditions given by u(x, 0) = 0 and ut(x, 0) = χ[0.2,0.3](x). The approximation was computed by multiplying the approximate Green s function G derived from Algorithm 2 against an approximation of the Dirac delta on {t = 0}.

has adjoint L v = vtt vxx k(x, t)vt kt(x, t)v.

If k(x, t) is not smooth, then one of the coeﬃcients of L has one less degree of regularity. Consequently, the corresponding homogeneous Green s function also has one less degree of regularity in the region of its support and thus requires more training data to recover (see Section A.5). Whether or not our hyperbolic PDO is self-adjoint, in general one must have access to input-output data from both the Cauchy problem (1) as well as the corresponding adjoint problem (see footnote 6). Indeed, one cannot expect to recover an operator without the action of the adjoint (Boull e et al., 2024; Halikias and Townsend, 2024; Otto et al., 2023). In the special case where the hyperbolic PDO is self-adjoint and has time-independent coeﬃcients, the homogeneous Cauchy problem is also self-adjoint.

Less regular coefficients

We assume throughout the paper that the coeﬃcients of L are at least of class C1. It is certainly possible that our method extends to coeﬃcients of lower regularity, but the Green s function theory for hyperbolic PDEs with low regularity coeﬃcients is more subtle and may require one to work instead with parametrices. For example, Smith (1998) describes a parametrix for hyperbolic PDEs with Lipschitz coeﬃcients whose spatial derivatives are also Lipschitz. For equations with even lower regularity, it is not always the case that the Cauchy problem is well-posed, and the regularity in the time and the space variables must

Operator Learning for Hyperbolic PDEs

be considered separately (see, e.g., Cicognani, 1999; Cicognani and Lorenz, 2018; Hurd and Sattinger, 1968; Reissig, 2003). It is unclear in such cases whether or not a Green s function approach, such as the one taken in this paper, is possible.

6.2 Peeling algorithm

Our algorithm requires performing rank detection and the r SVD on O(ϵ 6) subdomains. These operations require O(Ψ 1 ϵ ϵ 1 log(Ξ 1 ϵ ϵ 1)) input-output training pairs for each subdomain; considering the subdomains one by one leads us to the number of training pairs described in Theorem 6. However, it was recently shown in Boull e et al. (2023) that one can dramatically reduce the required number of input-output pairs by considering multiple subdomains simultaneously via the peeling algorithm (Lin et al., 2011). The argument relies in part on bounding the chromatic number of a graph associated with the hierarchical partition (Levitt and Martinsson, 2024). In our setting, the chromatic number at a given level L of the partition, denoted by χ(L), is approximately the number of subdomains of minimal volume that is, cubes with side length 2 L seen in a 2-dimensional slice of the domain DT DT . For instance, the chromatic number associated with the partition at level L = 3 seen in Figure 4b is 52. Heuristically, the chromatic number is maximized when the characteristics are straight lines of some minimal slope 0 < m 1, in which case χ(L) 2Lm 1. If the arguments in Boull e et al. (2023) can be repeated for hyperbolic PDOs, then the required number of input-output training pairs in Theorem 6 improves from O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)) to O(Ψ 1 ϵ ϵ 3 polylog(Ξ 1 ϵ ϵ 1)). Moreover, by Remark 7, this term continues to improve as one assumes greater regularity of the coeﬃcients a, b, c of (1a). The challenge, which is addressed in Boull e et al. (2023) for elliptic PDEs, is that the peeling algorithm of Lin et al. (2011) is not proven to be stable. Hence, careful analysis is needed to show that errors from low-rank approximation and rank detection do not accumulate. Notice that, unlike the elliptic case, the use of peeling cannot entirely reduce the ϵ 6

factor to polylog(ϵ 1). This is because the singularities of the Green s functions of hyperbolic PDEs in d variables occupy a (2d 1)-dimensional hypersurface in the 2d-dimensional domain, due to the characteristics. In contrast, the singularities of the Green s functions of elliptic PDEs lie only on the diagonal, which is d-dimensional. In other words, the Green s functions of hyperbolic PDEs have an additional d 1 dimensions worth of singularities, and the chromatic number χ(L) thus grows exponentially with L. It is therefore doubtful that one can recover the Green s functions of hyperbolic PDEs at a sublinear rate without a signiﬁcant development in the approximation theory of hierarchical matrices.

6.3 Extending to higher dimensions

We learn the Green s function of a hyperbolic PDE in two variables by detecting the location of its characteristics, which feature as jump discontinuities for the Green s function. A similar scheme should also work for detecting the singularities of Green s functions for hyperbolic PDEs in higher dimensions. The main diﬃculty of higher-dimensional Green s functions is that they are not necessarily square-integrable. For instance, the Green s function of the wave equation in three

Wang and Townsend

spatial dimensions is given by

G(x, t; y, s) = δ(t s x y )

4π x y , (x, t), (y, s) R3 [0, ),

where δ is the Dirac delta. Here, G has singularities of a distributional nature on the characteristics, which form a 7-dimensional hypersurface in the 8-dimensional domain of the Green s function. In this case, the theory of the r SVD for HS operators no longer applies since HS integral operators must have a square-integrable kernel. This diﬃculty has been circumvented to a degree in the case of parabolic PDEs, whose Green s functions are integrable but not square-integrable on the diagonal (Boull e et al., 2022b). Nevertheless, it is not immediately clear that the analysis carries over when the singularities are distributional. Another factor that may play a role in higher dimensions is the eﬀect of Huygens principle (Courant and Hilbert, 1962; Evans, 2010). For the standard wave equation in d spatial dimensions, Huygens principle implies that the support of the Green s function lies exclusively on the boundary of the future light cone when d is odd. In contrast, the support is the entire future light cone when d is even. For general hyperbolic PDOs, this eﬀect does not correspond so nicely with the parity of the spatial dimension (see, e.g., G unther, 1991). Determining the impact of Huygens principle may be an essential aspect of hyperbolic PDE learning in higher dimensions.

6.4 Availability of data

In practice, one often does not have access to exact solutions of the PDE in question. Instead, input-output training data may be generated from simulations driven by a numerical PDE solver. In such cases, the training data inherits whichever discretization errors the solver itself exhibits. Figure 8 displays the output of our algorithm for training data coming from two diﬀerent numerical solvers. The ﬁrst is a ﬁrst-order accurate upwind method (Banks and Henshaw, 2012) while the second is the simple second-order centered-in-time and centered-in-space ﬁnite diﬀerence method; both advance by explicit time-steps on a ﬁxed-resolution uniform grid. Although much more advanced solvers exist, we chose these methods for their speed and simplicity given the large number of numerical solves required by our algorithm. Due to the numerical dissipation of the ﬁrst-order method, we observe that the approximate Green s function is unable to locate the characteristics precisely, since they get smoothed out. On the other hand, the numerical dispersion exhibited by the second-order method fabricates artiﬁcial oscillations both inside and outside the light cones; these oscillations mislead our rank detection scheme into believing the Green s function is high-rank in regions where it is not, resulting in excessive partitioning within the light cones. As such, the availability and accuracy of input-output data poses a major challenge for operator learning for hyperbolic PDEs that we encourage future research both theoretical and practical to address. While elliptic and parabolic PDEs signiﬁcantly smooth out errors from the training data, those errors are structurally propagated throughout the domain in the hyperbolic case and must be dealt with in a careful manner. One possible approach is to use our algorithm to quickly construct a low-accuracy approximation of the Green s function to get a rough sense of where the characteristics lie. Then one can use this information to update the numerical solver generating the input-output data to better track propagation

Operator Learning for Hyperbolic PDEs

of singularities and minimize dispersion and dissipation errors. Alternating between using the Green s function approximation to improve the solver and vice versa may be an iterative solution to the problems presented above.

Figure 8: Approximate Green s function at the slice (y, s) = (0.8, 0.1) using input-output data generated by time-stepping numerical solvers: the ﬁrst-order upwind method of Banks and Henshaw (2012) (left) and the second-order centered diﬀerence discretization (right).

Acknowledgments

The work of C. W. was supported by the NSF GRFP under grant DGE-2139899, and by the NSF under grant DMS-1929284 while the author was in residence at the Institute for Computational and Experimental Research in Mathematics in Providence, RI, during the Numerical PDEs: Analysis, Algorithms, and Data Challenges semester program. The work of A. T. was supported by the Oﬃce of Naval Research (ONR) under grant N0001423-1-2729, the NSF under grant DMS-2045646, and a Weiss Junior Fellowship Award. Many thanks to Diana Halikias, Sungwoo Jeong, and Seick Kim for helpful discussions and suggestions.

Appendix A. Background Material

We review the relevant background material, comprising of HS operators, quasimatrices, orthogonal projectors, Gaussian processes, and approximation using Legendre series (Halko et al., 2011; Hsing and Eubank, 2015; Townsend and Trefethen, 2015; Trefethen, 2019; Vershynin, 2018).

Wang and Townsend

A.1 Hilbert Schmidt operators

Hilbert Schmidt (HS) operators acting on L2 functions are inﬁnite-dimensional analogues of matrices acting on vectors. For d 1, let D1, D2 Rd be domains and let L2(Di) be the space of square-integrable functions deﬁned on Di, for i = 1, 2, equipped with the inner product f, g := R

Di f(x)g(x) dx. Since L2(Di) are separable Hilbert spaces, then for any compact linear operator F : L2(D1) L2(D2), there exists by the spectral theorem complete orthonormal bases {vj} j=1 and {ej} j=1 for L2(D1) and L2(D2) respectively, as well as a sequence of positive numbers σ1 σ2 > 0 such that

σj vj, f ej, f L2(D1), (33)

where equality and convergence hold in the L2 sense. This representation is called the singular value expansion (SVE) of F; we call {ej} and {vj} the left and right singular functions of F, respectively, and we call {σj} the singular values of F. The subspaces spanned by the ﬁrst k left or right singular functions, for some positive integer k, are called dominant left and right k-singular subspaces of F, respectively. We say that F is a HS operator if its HS norm is ﬁnite, i.e.,

j=1 Fvj 2 L2(D2)

Since F 2 HS = P j=1 σ2 j , the HS norm is an inﬁnite-dimensional analogue of the Frobenius norm for matrices. We also have F F HS. A special type of HS operator is a HS integral operator, which requires that F have an integral representation given by

(Ff)(x) = Z

D1 G(x, y)f(y) dy, f L2(D1), x D2. (34)

for some kernel G L2(D2 D1). In this case, F HS = G L2(D2 D1). The adjoint of F is the unique compact linear operator F : L2(D2) L2(D1) satisfying (F ) = F, F = F , and Ff, h L2(D2) = f, F h L2(D1) for every f L2(D1) and h L2(D2). The adjoint F has a SVE given by

σj ej, g vj, g L2(D2),

and for any f L2(D1) and h L2(D2) satisﬁes

j=1 σ2 j vj, f vj, (FF )h =

j=1 σ2 j ej, h ej. (35)

By the Eckart Young Mirsky theorem (Stewart and Sun, 1990, Thm. 4.18), truncating the SVE of a HS operator after k terms yields a best rank-k approximation in the HS

Operator Learning for Hyperbolic PDEs

and operator norms. If Fk : L2(D1) L2(D2) is deﬁned by the truncation Fkf := Pk j=1 σj vj, f ej, then

min rank( F)=k F F HS = F Fk HS =

as well as min rank( F)=k F F = F Fk = σk+1. (37)

Finally, we deﬁne the numerical rank of F with error tolerance ϵ > 0, denoted by rankϵ(F), to be the smallest integer k such that F Fk < ϵ F . In particular, (37) implies that rankϵ(F) k holds if σk+1 ϵσ1.

A.2 Quasimatrices

Quasimatrices are inﬁnite-dimensional analogues of tall-skinny matrices, and we use them principally to simplify notation and emphasize the analogy with matrices. A D1 k quasimatrix Ωis represented by k columns, where each column is a function in L2(D1), via

Ω= ω1 ωk , ωj L2(D1).

One can similarly deﬁne k inﬁnite matrices, in which each of the k columns is a sequence in ℓ2, and likewise D1 quasimatrices. Many operations for rectangular matrices generalize to quasimatrices. If F : L2(D1) L2(D2) is a HS operator, then FΩdenotes the quasimatrix given by applying F to each column of Ω, i.e., FΩ= Fω1 Fωk .

Quasimatrices have a transpose, denoted Ω , in the sense that

ω1, ω1 ω1, ωk ... ... ... ωk, ω1 ωk, ωk

, ΩΩ (x, y) =

j=1 ωj(x)ωj(y).

Here, Ω Ωis a k k matrix of real numbers, while ΩΩ is a function in L2(D1 D1) and can be thought of as an integral operator by taking ΩΩ (x, y) to be its kernel. Quasimatrices can be thought of as HS operators on the relevant Hilbert spaces if, for instance, their singular values are square summable.

A.3 Orthogonal projectors

In analogy with the ﬁnite-dimensional setting, an orthogonal projection is a bounded selfadjoint linear operator P : L2(D2) L2(D2) that satisﬁes P2 = P. Orthogonal projectors are completely determined by their range: for any closed subspace M L2(D2), there is a unique orthogonal projector PM whose range is M. This operator is compact if and only

Wang and Townsend

if M is ﬁnite-dimensional. For any D2 k quasimatrix Ω, the unique orthogonal projector associated with the column space of Ωis denoted by

PΩ:= Ω(Ω Ω) Ω : L2(D2) L2(D2),

as PΩF : L2(D1) L2(D2) captures the orthogonal projection of the range of F onto the ﬁnite-dimensional column space of Ω. If Ωhas full column rank, then Ω Ωis an invertible k k matrix, and (Ω Ω) = (Ω Ω) 1. We rely on the following inequality, analogous to that of Halko et al. (2011, Prop. 8.6), for the operator norm of orthogonal projectors.

Proposition 8 Let D1, D2 Rd be domains. Let P : L2(D2) L2(D2) be an orthogonal projector, and let F : L2(D1) L2(D2) be a HS operator. Then

PF P(FF )q F 1/(2q+1) (38)

holds for every q > 0.

Proof The proof closely follows that of Halko et al. (2011, Prop. 8.6), making use of the Courant Fischer minimax principle (Hsing and Eubank, 2015, Thm. 4.2.7).

A.4 Gaussian processes

A Gaussian process (GP) is an inﬁnite-dimensional analogue of a multivariate Gaussian distribution, and a function drawn from a GP is analogous to a random vector drawn from a Gaussian distribution. For a domain D Rd and a continuous symmetric positive semideﬁnite kernel K : D D R, we deﬁne a GP with mean µ : D R and covariance K to be a stochastic process {Xt : t D} such that the random vector (Xt1, . . . , Xtn), for every ﬁnite set of indices t1, . . . , tn D, is a multivariate Gaussian distribution with mean (µ(t1), . . . , µ(tn)) and covariance matrix Kij = K(ti, tj), 1 i, j n. We denote such a GP by GP(µ, K). Functions drawn from a GP with continuous kernel are almost surely in L2. Additionally, since K is positive semideﬁnite, it has non-negative eigenvalues λ1 λ2 0 with corresponding eigenfunctions {rj} j=1 that form an orthonormal basis for L2(D) (Hsing and Eubank, 2015, Thm. 4.6.5). The Karhunen L oeve expansion provides a method for sampling functions from a GP, as f GP(0, K) has the expansion f(t) = P j=1 p

λjcjrj(t) converging in L2 uniformly in t D, where {cj} j=1 are independent standard Gaussian random variables (Hsing and Eubank, 2015, Thm. 7.3.5). Additionally, we deﬁne the trace of K by Tr(K) := P j=1 λj < .

A.5 Low-rank approximation in the L2 norm

In recovering a low-rank approximation of a kernel of a HS integral operator, we need explicit estimates on the decay of the kernel s singular values. We can obtain them by analyzing the kernel expressed in a Legendre series. Let P n(x) denote the shifted Legendre polynomial of degree n in the domain [0, 1].11 The shifted Legendre polynomials form an orthogonal basis

11. Legendre polynomials are deﬁned in NIST DLMF, Table 18.3.1. The shifted Legendre polynomials are simply the composition of Legendre polynomials with the transformation x 7 2(x 1) (see NIST DLMF, Ch. 18).

Operator Learning for Hyperbolic PDEs

of L2([0, 1]), and if a function f L2([0, 1]) is uniformly H older continuous with parameter > 1/2, then it has a uniformly convergent Legendre expansion f(x) = P n=0 an P n(x). The truncated Legendre series fk(x) := Pk n=0 an P n(x), for some k 0, is the degree-k L2

projection of f onto the space of polynomials of degree k. If f is r-times diﬀerentiable and f(r) has bounded total variation V , then for any k > r + 1, we have

π(r + 1/2)(k r 1)r+1/2 = O(k (r+1/2)), (39)

where fk 1 is the degree-(k 1) L2 projection of f (Wang, 2023, Thm. 3.5). Similar results hold for a multivariate function f(x1, . . . , xn) if the conditions above hold uniformly over every 1-dimensional slice of f, which holds if f is r-times diﬀerentiable and Drf has total variation bounded by V < uniformly over all order-r partial derivatives Dr (Shi and Townsend, 2021; Townsend, 2014). In particular, for any positive integers m, n, we may consider f : [0, 1]m+n R and its degree-(k 1) L2 projection fk 1 as the kernels of HS integral operators F, F : L2([0, 1]m) L2([0, 1]n), respectively. Since fk 1 is a polynomial of degree k 1, then F is a rank-k operator. By the Eckart Young Mirsky theorem,

F F HS = f fk 1 L2

π(r + 1/2)(k r 1)r+1/2 ,

where σj denotes the jth singular value of F. Combining this with the observation that kσ2 2k P2k j=k+1 σ2 j P j=k+1 σ2 j , we obtain an explicit decay rate for the singular values of F, i.e.,

σk 2r+3/2V p

k(k 2r 2)r+1/2 = O(k (r+1)) (40)

for k > 2(r + 1).

Appendix B. Randomized Singular Value Decomposition

This appendix contains proofs of results relating to the r SVD in Section 3.

B.1 Proofs for Section 3.1

In this section, we prove Theorems 1 and 3, which bound the error of an approximant generated by the r SVD with respect to the operator norm. The proof is similar to that of Halko et al. (2011, Thms. 10.6 and 10.8), generalized to the inﬁnite-dimensional setting with non-standard Gaussian vectors. We ﬁrst introduce notation. For separable Hilbert spaces H, K, let HS(H, K) be the Hilbert space of HS operators H K. Given a bounded, self-adjoint linear operator C : K K, we deﬁne a quadratic form on K by

x, y C := x, Cy K, x, y K,

Wang and Townsend

as well as a quadratic form on HS(H, K) by

A, B C := A, CB HS.

If C is positive deﬁnite, then both quadratic forms are inner products on K and HS(H, K), respectively. Then we let x 2 C := x, x C and A 2 C := A, A C denote the induced norms. We remark that x C = C1/2x K and A I C = C1/2A , where A I C denotes the operator norm of A when viewed as an operator (H, H) (K, C). Finally, we let H K denote the tensor product of Hilbert spaces, which embeds into HS(H, K) by associating with each x y H K the operator u 7 x, u H y.

Lemma 9 Let H, K be Hilbert spaces. For any u, w H and v, z K, we have

u v w z 2 HS max( u 2, w 2) v z 2 + max( v 2, z 2) u w 2.

Proof Observe that u v, w z HS = u, w v, z . Then

u v w z 2 = u v 2 + w z 2 2 u v, w z

= u 2 v 2 + w 2 z 2 2 u, w v, z .

Let A = max( v , z ) and B = max( u , w ). Then

A2 u w 2 + B2 v z 2

= A2( u 2 + w 2) + B2( v 2 + z 2) 2(A2 u, w + B2 v, z ).

Observe that (B2 u, w )(A2 v, z ) 0, hence

u v w z 2 HS u 2 v 2 + w 2 z 2 + 2A2B2 2(A2 u, w + B2 v, z ).

It is straightforward to verify that

u 2 v 2 + w 2 z 2 + 2A2B2 A2( u 2 + w 2) + B2( v 2 + z 2),

which completes the proof.

We extend a result of Halko et al. (2011, Prop. 10.1) to the inﬁnite-dimensional setting, following Gordon (1985, 1988). To simplify notation, we use to denote either the operator, ℓ2, or L2(D) norm depending on context. We also use N(0, I C) to denote the distribution of a mean zero random matrix with independent columns drawn from N(0, C).

Proposition 10 Let G be an n inﬁnite random matrix with distribution N(0, In C), where C is a symmetric positive deﬁnite covariance matrix with Tr(C) < . Let S be a D quasimatrix and let T be a n inﬁnite matrix, both deterministic with S HS, T HS < . Then

E SGT C1/2 HS T + C1/2 T HS S .

Operator Learning for Hyperbolic PDEs

Proof Let g N(0, T 2C) be Gaussian in ℓ2 and let h N(0, C1/2S 2 In) be a Gaussian random vector in Rn, such that g, h, and G are all independent. Deﬁne

X(u, v) := SGTu, v , Y (u, v) := Sg, v + T h, u

indexed by u ℓ2 and v L2(D) of unit length. One can show that X(u, v) and Y (u, v) are well-deﬁned, and that X( , ) and Y ( , ) are mean zero Gaussian processes. We aim to apply the Sudakov Fernique inequality (Vershynin, 2018, Thm. 7.2.11). For any unit-length u, w ℓ2 and v, z L2(D), set A = (Tu) (S v) (Tw) (S z). We compute

E(X(u, v) X(w, z))2 = Tr(A CA)

T 2 S (v z) 2 C + S 2 I C T(u w) 2

= T 2 C1/2S (v z) 2 + C1/2S 2 T(u w) 2. (41)

The ﬁrst equality follows from fully expanding (X(u, v) X(w, z))2 and using the deﬁnition of G; the inequality is obtained by applying Lemma 9 taking H to be Rn with the usual inner product, and K to be ℓ2 with the inner product induced by C. On the other hand, observe that

E Sg, v z 2 = T 2(v z) SCS (v z) = T 2 C1/2S (v z) 2,

and similarly E T h, u w 2 = C1/2S 2 T(u w) 2. Therefore, by independence of g and h, we have

E(Y (u, v) Y (w, z))2 = T 2 C1/2S (v z) 2 + C1/2S 2 T(u w) 2. (42)

Combining (41) and (42) yields E(X(u, v) X(w, z))2 E(Y (u, v) Y (w, z))2 for every unit-length u, v, w, z, so by the Sudakov Fernique inequality, we have

E sup u ℓ2 u =1

sup v L2(D) v =1

SGTu, v E sup u ℓ2 u =1

sup v L2(D) v =1

( Sg, v + T h, u ). (43)

Since SGT is almost surely a bounded operator, then the left-hand side of (43) is E SGT . For the right-hand side, two applications of the Cauchy Schwarz inequality yield

E sup v L2(D) v =1

Sg, v (E Sg 2)1/2 = Tr( T 2SCS )1/2 C1/2 HS S T

An analogous argument for T h, u combined with (43) completes the proof.

Finally, we derive probability estimates for the operator norm of the pseudoinverse of non-standard Gaussian matrices, following Chen and Dongarra (2005) and Edelman (1988).

Proposition 11 Let G be an m n random matrix with distribution N(0, In C), where C is a symmetric positive deﬁnite m m covariance matrix with eigenvalues λ1 λm > 0. If n m 2, then for any t 0, we have

P{ G t} 1 p

2π(n m + 1)

e n λm(n m + 1)

n m+1 t (n m+1).

Wang and Townsend

Proof Let x1 xm > 0 be the eigenvalues of the Wishart matrix GG Wm(n, C). Since G = x 1/2 m , then we focus on the distribution of xm. It is well-known (James, 1964, 1968) that the joint density of the eigenvalues is given by

f(x1, . . . , xm) = |C| n

2 Km,n 0F0(C 1, 1

1 i<j m (xi xj)

where X = diag(x1, . . . , xm), | | is the determinant function, 0F0 is the hypergeometric function of two matrix arguments, Γm is the multivariate Gamma function, and

Km,n = π m2

(for details and deﬁnitions, see Muirhead, 1982). We bound the density function since the hypergeometric function is increasing in the eigenvalues of its arguments (Kates, 1981, Thm. IV.1), that is,

2X) = e 1 2λm Pm i=1 xi

where the equality follows from Muirhead (1982, Prob. 7.7). Therefore,

f(x1, . . . , xm) |C| n

2 Km,n e 1 2λm Pm i=1 xi Y

1 i<j m (xi xj)

We bound the density function fxm of xm by integration. Let Rx = {(x1, . . . , xm 1) : x1 xm 1 x}, so

fxm(x) |C| n

2 Km,n x n m 1

2 e x 2λm Z

Rx e 1 2λm Pm 1 i=1 xi Y

1 i<j m 1 (xi xj)

i=1 (xi x)x

We bound further by using xi x xi and the non-negativity of the integrand, then perform the change of variables xi = λmyi to obtain

fxm(x) |C| n

2 m x n m 1

2 e x 2λm Z

2 Pm 1 i=1 yi Y

1 i<j m 1 (yi yj)

where R0 = {(y1, . . . , ym 1) : y1 ym 1 0}. Notice that the integrand is simply the unnormalized joint density of the eigenvalues of a Wm 1(n + 1, Im 1) matrix, so the integral evaluates to Km 1,n+1. By Chen and Dongarra (2005, Eq. (3.14)), we have

Km,n = 2 n m 1

2 )Γ(n m + 1),

Operator Learning for Hyperbolic PDEs

fxm(x) 2 n m 1

2 )Γ(n m + 1) x n m 1

2 e x 2λm ,

and we obtain

P{xm x 2} 1 Γ(n m + 2)

by following Chen and Dongarra (2005, Lem. 4.1). We conclude by applying Stirling s approximation.

Corollary 12 Under the same assumptions as Proposition 11, we have

E G < e n λm(n m).

Proof The proof is the same as that in Halko et al. (2011, Prop. A.4).

We are ready to prove Theorem 1. Proof (Theorem 1) Given the notation of Section 3.1, let Ω1 = V 1Ωand Ω2 = V 2Ω. By (Boull e and Townsend, 2023, Lem. 1) which holds for any unitarily invariant norm we have E F PYF Σ2 + E Σ2Ω2Ω 1 .

To compute E Σ2Ω2Ω 1 , notice that V Ω N(0, Ik+p C), by (Boull e and Townsend, 2023, Lem. 1), with columns that are almost surely in ℓ2 since Tr(C) = Tr(K) < . Let Ω2|Ω1 denote the random matrix Ω2 conditioned on Ω1, which, by Mandelbaum (1984, Thm. 2 and Cor. 2), has distribution N(C21C 1 11 Ω1, Ik+p (C22 C21C 1 11 C12)). Let Ω2 be normally distributed with mean zero and the same covariance as Ω2|Ω1, sampled independently of Ω1 and Ω2, so that Ω2|Ω1 Ω2 + C21C 1 11 Ω1. We now condition on Ω1 to obtain, by Proposition 10,

= E Σ2( Ω2 + C21C 1 11 Ω1)Ω 1

(C22 C21C 1 11 C12)1/2 HS E Ω 1 + C22 C21C 1 11 C12 1/2 E Ω 1 F Σ2

+ E Σ2C21C 1 11 Ω1Ω 1 .

First, observe that Ω1 has full rank with probability 1, so Ω1Ω 1 = Ik. Thus,

E Σ2C21C 1 11 Ω1Ω 1 λ1 Σ2 C 1 11 ,

using the bounds C21 C λ1. Additionally, since C C22 C22 C21C 1 11 C12, where denotes the L owner order, then we have

(C22 C21C 1 11 C12)1/2 HS C1/2 HS = p

Wang and Townsend

and C22 C21C 1 11 C12 1/2 λ1. Finally, by Corollary 12 and Boull e and Townsend (2023, Eq. (10)), we have

C 1 11 (k + p)

p , E Ω 1 F =

Tr(C 1 11 ) p + 1 ,

which altogether yields

Tr(K) C 1 11 (k + p)

λ1 Tr(C 1 11 ) p + 1 + λ1 C 1 11

The estimate (13) for the average error in the operator norm now follows. For (14), we follow Halko et al. (2011, Thm. 10.8) and deﬁne, for each t 1, the event

C 1 11 (k + p)

p + 1 t and Ω 1 F

Tr(C 1 11 ) p + 1 t

such that P(Ec t ) 2t p by Proposition 11 and Boull e and Townsend (2023, Lem. 3). We also consider the random function h(X) := Σ2XΩ 1 deﬁned on the space of (k + p) quasimatrices. The corresponding Cameron Martin space M with respect to the distribution N(0, Ik+p C) is the space of (k+p) quasimatrices Y satisfying Tr(Y C 1Y) < , or equivalently, M = {C1/2X : X is an (k + p) quasimatrix},

equipped with the inner product Y, Z M := Tr(Y C 1Z) (Bogachev, 1998, Ch. 2). Notice that Y HS λ1 Y M for every Y M. Thus, the function h conditioned on Ω1 is M-Lipschitzian with constant λ1 Σ2 Ω 1 , since

|h(X + Y) h(X)| Σ2YΩ 1 p

λ1 Σ2 Ω 1 Y M

holds for every quasimatrix X and every Y M. We now apply the concentration inequality of Bogachev (1998, Thm. 4.5.7) to h conditioned on Ω1, to obtain, for every s 0,

P n Σ2Ω2Ω 1 > p

Tr(K) Ω 1 + p

λ1 Ω 1 F + λ1 C 1 11 + p

λ1 Ω 1 s Σ2 | Et o

where we make use of the fact

E[h(Ω2) | Ω1] p

Tr(K) Ω 1 + p

λ1 Ω 1 F + λ1 C 1 11 Σ2

by Proposition 10. By deﬁnition of Et, it follows that

Tr(K) + s p

C 1 11 (k + p)

λ1 Tr(C 1 11 ) p + 1 t + λ1 C 1 11

Operator Learning for Hyperbolic PDEs

Since P(Ec t ) 2t p, then we obtain

Tr(K) + s p

C 1 11 (k + p)

λ1 Tr(C 1 11 ) p + 1 t + λ1 C 1 11

2t p + e s2/2.

We combine this estimate with Boull e and Townsend (2023, Lem. 1) to conclude.

The proof of Theorem 3 follows easily. Proof (Theorem 3) We ﬁrst claim that F PZF H PZH 1/(2q+1). Indeed, let I be the identity operator on L2(D2) and observe that I PZ is an orthogonal projector. Then Proposition 8 yields

(I PZ)F (I PZ)(FF )q F 1/(2q+1) = (I PZ)H 1/(2q+1).

Now noting that the singular values of H are σ2q+1 1 σ2q+1 2 , the result immediately follows from Theorem 1.

B.2 Proofs for Section 3.2

Here, prove Theorem 5. In the following, let Θ(X, Y) denote the diagonal matrix of canonical angles between two subspaces X, Y of L2(D2) with equal ﬁnite dimension (Stewart and Sun, 1990, Def. I.5.3). Additionally, for any (quasi)matrix X, we denote its column space by the calligraphic version of the same symbol, namely, X.

Lemma 13 Let F : L2(D1) L2(D2) be a HS operator with SVE given by (33). Select an integer k 1 such that σk > σk+1. Select an integer q 0 and set

# 1 > 0. (44)

Let H := (FF )q F and let H : L2(D1) L2(D2) be a HS operator with ﬁnite rank k such that H H Cσ2q+1 k+1 for some C > 0. Let Uk be the dominant left k-singular subspace of H , and let Uk be a dominant left k-singular subspace of H . If Cδq < 1, then

sin Θ(Uk, Uk) Cδq 1 Cδq . (45)

Proof Let σ1 σ2 be the singular values of H ; the singular values of H are given by σ2q+1 1 σ2q+1 2 . By Weyl s theorem, we have | σk σ2q+1 k | Cσ2q+1 k+1 , whereas the assumption (44) implies σ2q+1 k σ2q+1 k+1 = δ 1 q σ2q+1 k+1 . It follows that

σk σ2q+1 k+1 σ2q+1 k | σk σ2q+1 k | σ2q+1 k+1 (δ 1 q C)σ2q+1 k+1 ,

Wang and Townsend

which in conjunction with Wedin s theorem (Stewart and Sun, 1990, Thm. V.4.4) yields the desired bound.

Combining Lemma 13 with r SVD means we can approximate an operator s singular values at the extra cost of computing Uk F, which takes an additional k operator-function products. This provides the desired proof.

Proof (Theorem 5) Let Uk be the D1 k quasimatrix whose columns are dominant left k-singular functions of H . A corollary of the CS decomposition (Stewart and Sun, 1990, Exc. I.5.6) along with Theorem 1 and Lemma 13 yields, with high probability,

Uk Uk 2 sin Θ(Uk, Uk) 2δq Ak,p(s, t) 1 δq Ak,p(s, t).

Hence, we ﬁnd that

U k F U k F 2δq Ak,p(s, t) 1 δq Ak,p(s, t) F .

Notice that U k F has the same dominant k singular values as F, so Weyl s theorem completes the proof.

T. G. Anderson, O. P. Bruno, and M. Lyon. High-order, dispersionless fast-hybrid wave equation solver. Part I: O(1) sampling cost via incident-ﬁeld windowing and recentering. SIAM J. Sci. Comput., 42(2):A1348 A1379, 2020.

R. Arora. A deep learning framework for solving hyperbolic partial diﬀerential equations: Part I. Preprint ar Xiv:2307.04121, 2023.

J. W. Banks and W. D. Henshaw. Upwind schemes for the wave equation in second-order form. J. Comput. Phys., 231(17):5854 5889, 2012.

M. Bebendorf and W. Hackbusch. Existence of H-matrix approximants to the inverse FEmatrix of elliptic operators with L -coeﬃcients. Numer. Math., 95:1 28, 2003.

J. Berman and B. Peherstorfer. Randomized sparse Neural Galerkin schemes for solving evolution equations with deep networks. In Advances in Neural Information Processing Systems, volume 36, pages 4097 4114, 2023.

J. Berman and B. Peherstorfer. Co Lo RA: Continuous low-rank adaptation for reduced implicit neural modeling of parameterized partial diﬀerential equations. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 3565 3583, 2024.

K. Bi, L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619(7970):533 538, 2023.

Operator Learning for Hyperbolic PDEs

V. I. Bogachev. Gaussian Measures, volume 62 of Mathematical Surveys and Monographs. American Mathematical Society, 1998.

N. Boull e and A. Townsend. Learning elliptic partial diﬀerential equations with randomized linear algebra. Found. Comput. Math., 23:709 739, 2023.

N. Boull e, C. J. Earls, and A. Townsend. Data-driven discovery of Green s functions with human-understandable deep learning. Sci. Rep., 12:1 9, 2022a.

N. Boull e, S. Kim, T. Shi, and A. Townsend. Learning Green s functions associated with time-dependent partial diﬀerential equations. J. Mach. Learn. Res., 23(1):1 34, 2022b.

N. Boull e, D. Halikias, and A. Townsend. Elliptic PDE learning is provably data-eﬃcient. Proc. Natl. Acad. Sci. USA, 120(39):e2303904120, 2023.

N. Boull e, D. Halikias, S. E. Otto, and A. Townsend. Operator learning without the adjoint. J. Mach. Learn. Res., 25(364):1 54, 2024.

J. Bruna, B. Peherstorfer, and E. Vanden-Eijnden. Neural Galerkin schemes with active learning for high-dimensional evolution equations. J. Comput. Phys., 496:112588, 2024.

S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identiﬁcation of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USA, 113 (15):3932 3937, 2016.

C.-T. Chen and G. X. Gu. Learning hidden elasticity with deep neural networks. Proc. Natl. Acad. Sci. USA, 118(31):e2102721118, 2021.

K. Chen, C. Wang, and H. Yang. Deep operator learning lessens the curse of dimensionality for PDEs. In Transactions on Machine Learning Research, 2024.

Z. Chen and J. Dongarra. Condition numbers of Gaussian random matrices. SIAM J. Matrix Anal. Appl., 26(3):1389 1404, 2005.

M. Cicognani. Strictly hyperbolic equations with non regular coeﬃcients with respect to time. Ann. Univ. Ferrara, 45(1):45 58, 1999.

M. Cicognani and D. Lorenz. Strictly hyperbolic equations with coeﬃcients low-regular in time and smooth in space. J. Pseudo-Diﬀer. Oper. Appl., 9(3):643 675, 2018.

R. Courant and D. Hilbert. Methods of Mathematical Physics, volume 2. Interscience Publishers, 1st English edition, 1962.

M. V. de Hoop, N. B. Kovachki, N. H. Nelsen, and A. M. Stuart. Convergence rates for learning linear operators from noisy data. SIAM/ASA J. Uncertain. Quantif., 11(2): 480 513, 2023.

T. A. Driscoll, N. Hale, and L. N. Trefethen, editors. Chebfun Guide. Pafnuty Publications, 2014.

Wang and Townsend

A. Edelman. Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl., 9(4):543 560, 1988.

L. C. Evans. Partial Diﬀerential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society, 2nd edition, 2010.

Y. Fan and L. Ying. Solving electrical impedance tomography with deep learning. J. Comput. Phys., 404:109119, 2020.

Y. Fan, L. Lin, L. Ying, and L. Zepeda-N u nez. A multiscale neural network based on hierarchical matrices. Multiscale Model. Simul., 17(4):1189 1213, 2019.

H. Federer. Curvature measures. Trans. Am. Math. Soc., 93(3):418 491, 1959.

J. Feliu-Faba, Y. Fan, and L. Ying. Meta-learning pseudo-diﬀerential operators with deep neural networks. J. Comput. Phys., 408:109309, 2020.

C. R. Gin, D. E. Shea, S. L. Brunton, and J. N. Kutz. Deep Green: Deep learning of Green s functions for nonlinear boundary value problems. Sci. Rep., 11:1 14, 2021.

Y. Gordon. Some inequalities for Gaussian processes and applications. Isr. J. Math., 50: 265 289, 1985.

Y. Gordon. Gaussian processes and almost spherical sections of convex bodies. Ann. Probab., 16(1):180 188, 1988.

A. Gray. Tubes, volume 221 of Progress in Mathematics. Birkh auser, 2nd edition, 2003.

P. G unther. Huygens principle and Hadamard s conjecture. Math. Intell., 13(2):56 63, 1991.

Y. Guo, X. Cao, B. Liu, and M. Gao. Solving partial diﬀerential equations using deep learning and physical constraints. Appl. Sci., 10(17):5917, 2020.

D. Halikias and A. Townsend. Structured matrix recovery from matrix-vector products. Numer. Linear Algebra Appl., 31(1):e2531, 2024.

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53 (2):217 288, 2011.

T. Hsing and R. Eubank. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. John Wiley & Sons, 2015.

A. J. Huang and S. Agarwal. On the limitations of physics-informed deep learning: Illustrations using ﬁrst-order hyperbolic conservation law-based traﬃc ﬂow models. IEEE Open J. Intell. Transp. Syst., 4:279 293, 2023.

D. Z. Huang, N. H. Nelsen, and M. Trautner. An operator learning perspective on parameter-to-observable maps. Found. Data Sci., 7(1):163 225, 2025.

Operator Learning for Hyperbolic PDEs

D. Hug, G. Last, and W. Weil. A local Steiner-type formula for general closed sets and applications. Math. Z., 246:237 272, 2004.

A. E. Hurd and D. H. Sattinger. Questions of existence and uniqueness for hyperbolic equations with discontinuous coeﬃcients. Trans. Am. Math. Soc., 132(1):159 174, 1968.

A. T. James. Distributions of matrix variates and latent roots derived from normal samples. Ann. Math. Stat., 35(2):475 501, 1964.

A. T. James. Calculation of zonal polynomial coeﬃcients by use of the Laplace Beltrami operator. Ann. Math. Stat., 39(5):1711 1718, 1968.

E. Kaiser, J. N. Kutz, and S. L. Brunton. Sparse identiﬁcation of nonlinear dynamics for model predictive control in the low-data limit. Proc. R. Soc. Lond. A, 474(2219): 20180335, 2018.

G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang. Physicsinformed machine learning. Nat. Rev. Phys., 3(6):422 440, 2021.

L. K. Kates. Zonal polynomials. Ph D thesis, Princeton University, 1981.

Y. Khoo and L. Ying. Switch Net: a neural network model for forward and inverse scattering problems. SIAM J. Sci. Comput., 41(5):A3182 A3201, 2019.

D. Kochkov, J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner, and S. Hoyer. Machine learning-accelerated computational ﬂuid dynamics. Proc. Natl. Acad. Sci. USA, 118(21): e2101784118, 2021.

N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. J. Mach. Learn. Res., 24:1 97, 2023.

A. S. Krishnapriyan, A. Gholami, S. Zhe, R. M. Kirby, and M. Mahoney. Characterizing possible failure modes in physics-informed neural networks. In Advances in Neural Information Processing Systems, volume 35, pages 26548 26560, 2021.

J. N. Kutz. Deep learning in ﬂuid dynamics. J. Fluid Mech., 814:1 4, 2017.

R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful mediumrange global weather forecasting. Science, 382(6677):1416 1421, 2023.

S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for Deep ONets: A deep learning framework in inﬁnite dimensions. Trans. Math. Appl., 6(1):tnac001, 2022.

P. Laurent, G. Legendre, and J. Salomon. On the method of reﬂections. Numer. Math., 148(2):449 493, 2021.

P. D. Lax. Hyperbolic Partial Diﬀerential Equations, volume 14 of Courant Lecture Notes in Mathematics. American Mathematical Society, 2006.

Wang and Townsend

M. E. Lerner. Qualitative properties of the Riemann function [in Russian]. Diﬀer. Uravn., 27(12):2106 2120, 1991.

J. Levitt and P.-G. Martinsson. Randomized compression of rank-structured matrices accelerated with graph coloring. J. Comput. Appl. Math., 451:116044, 2024.

M. Li, L. Demanet, and L. Zepeda-N u nez. Wide-band butterﬂy network: stable and eﬃcient inversion via multi-frequency neural networks. Multiscale Model. Simul., 20(4):1191 1227, 2022.

Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Graph kernel network for partial diﬀerential equations. In International Conference on Learning Representations Workshop on Integration of Deep Neural Models and Diﬀerential Equations, 2020a.

Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, A. Stuart, K. Bhattacharya, and A. Anandkumar. Multipole graph neural operator for parametric partial diﬀerential equations. In Advances in Neural Information Processing Systems, volume 33, pages 6755 6766, 2020b.

Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial diﬀerential equations. In International Conference on Learning Representations, 2021.

L. Lin, J. Lu, and L. Ying. Fast construction of hierarchical matrix representation from matrix-vector multiplication. J. Comput. Phys., 230(10):4071 4087, 2011.

Y. Liu, X. Xing, H. Guo, E. Michielssen, P. Ghysels, and X. S. Li. Butterﬂy factorization via randomized matrix-vector multiplications. SIAM J. Sci. Comput., 43(2):A883 A907, 2021.

Y. Liu, J. Song, R. Burridge, and J. Qian. A fast butterﬂy-compressed Hadamard Babich integrator for high-frequency Helmholtz equations in inhomogeneous media with arbitrary sources. Multiscale Model. Simul., 21(1):269 308, 2023.

L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via Deep ONet based on the universal approximation theorem of operators. Nat. Mach. Intell., 3:218 229, 2021a.

L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis. Deep XDE: A deep learning library for solving diﬀerential equations. SIAM Rev., 63(1):208 228, 2021b.

A. G. Mackie. Green s functions and Riemann s method. Proc. Edinburgh Math. Soc., 14 (4):293 302, 1965.

A. Mandelbaum. Linear estimators and measurable linear transformations on a Hilbert space. Z. Wahrscheinlichkeitstheorie verw. Gebiete, 65(3):385 397, 1984.

S. Massei, L. Robol, and D. Kressner. Hierarchical adaptive low-rank format with applications to discretized partial diﬀerential equations. Numer. Linear Algebra Appl., 29(6): e2448, 2022.

Operator Learning for Hyperbolic PDEs

M. Meier and Y. Nakatsukasa. Fast randomized numerical rank estimation for numerically low-rank matrices. Linear Algebra Appl., 686:1 32, 2024.

R. Molinaro, Y. Yang, B. Engquist, and S. Mishra. Neural inverse operators for solving PDE inverse problems. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 25105 25139, 2023.

R. J. Muirhead. Aspects of Multivariate Statistical Theory. John Wiley & Sons, 1982.

NIST DLMF. NIST Digital Library of Mathematical Functions. Release 1.1.12 of 2023-1215. URL https://dlmf.nist.gov/. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. Mc Clain, eds.

S. E. Otto, A. Padovan, and C. W. Rowley. Model reduction for nonlinear systems by balanced truncation of state and gradient covariance. SIAM J. Sci. Comput., 45(5): A2325 A2355, 2023.

E. Qian, B. Kramer, B. Peherstorfer, and K. Willcox. Lift & learn: Physics-informed machine learning for large-scale nonlinear dynamical systems. Physica D, 406:132401, 2020.

M. Raissi, A. Yazdani, and G. E. Karniadakis. Hidden ﬂuid mechanics: Learning velocity and pressure ﬁelds from ﬂow visualizations. Science, 367(6481):1026 1030, 2020.

M. Reissig. Hyperbolic equations with non-Lipschitz coeﬃcients. Rend. Semin. Mat. Univ. Politec. Torino, 61(2):135 181, 2003.

R. Rodriguez-Torrado, P. Ruiz, L. Cueto-Felgueroso, M. C. Green, T. Friesen, S. Matringe, and J. Togelius. Physics-informed attention-based neural network for hyperbolic partial diﬀerential equations: Application to the Buckley Leverett problem. Sci. Rep., 12:7557, 2022.

V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM J. Matrix Anal. Appl., 31(3):1100 1124, 2009.

S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz. Data-driven discovery of partial diﬀerential equations. Sci. Adv., 3(4):e1602614, 2017.

F. Sch afer and H. Owhadi. Sparse recovery of elliptic solvers from matrix-vector products. SIAM J. Sci. Comput., 46(2):A998 A1025, 2024.

F. Sch afer, T. J. Sullivan, and H. Owhadi. Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity. Multiscale Model. Simul., 19:688 730, 2021.

T. Shi and A. Townsend. On the compressibility of tensors. SIAM J. Matrix Anal. Appl., 42(1):275 298, 2021.

H. F. Smith. A parametrix construction for wave equations with C1,1 coeﬃcients. Ann. Inst. Fourier, 48(3):797 835, 1998.

Wang and Townsend

G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, 1990.

S. Subramanian, P. Harrington, K. Keutzer, W. Bhimji, D. Morozov, M. Mahoney, and A. Gholami. Towards foundation models for scientiﬁc machine learning: Characterizing scaling and transfer behavior. In Advances in Neural Information Processing Systems, volume 37, pages 71242 71262, 2023.

B. T. Thodi, S. V. R. Ambadipudi, and S. E. Jabari. Learning-based solutions to nonlinear hyperbolic PDEs: Empirical insights on generalization errors. In Advances in Neural Information Processing Systems Workshop on Machine Learning and the Physical Sciences, 2022.

A. Townsend. Computing with functions in two dimensions. Ph D thesis, University of Oxford, 2014.

A. Townsend and L. N. Trefethen. Continuous analogues of matrix factorizations. Proc. R. Soc. Lond. A, 471(2173):20140585, 2015.

L. N. Trefethen. Householder triangularization of a quasimatrix. IMA J. Numer. Anal., 30 (4):887 897, 2010.

L. N. Trefethen. Approximation Theory and Approximation Practice. SIAM, extended edition, 2019.

R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018.

Z. Y. Wan, L. Zepeda-N u nez, A. Boral, and F. Sha. Evolve smoothly, ﬁt consistently: Learning smooth latent dynamics for advection-dominated systems. In International Conference on Learning Representations, 2023.

H. Wang. New error bounds for Legendre approximations of diﬀerentiable functions. J. Fourier Anal. Appl., 29(42), 2023.

S. Wang, H. Wang, and P. Perdikaris. Learning the solution operator of parametric partial diﬀerential equations with physics-informed Deep ONets. Sci. Adv., 7:eabi8605, 2021.

L. Zepeda-N u nez and L. Demanet. The method of polarized traces for the 2D Helmholtz equation. J. Comput. Phys., 308:347 388, 2016.

S. Zhang and G. Lin. Robust data-driven recovery of governing physical laws with error bars. Proc. R. Soc. Lond. A, 474(2217):20180305, 2018.