# operator_learning_for_hyperbolic_pdes__9c3bfcce.pdf Journal of Machine Learning Research 26 (2025) 1-44 Submitted 12/23; Revised 7/25; Published 9/25 Operator Learning for Hyperbolic Partial Differential Equations Christopher Wang cyw33@cornell.edu Department of Mathematics Cornell University Ithaca, NY 14853, USA Alex Townsend townsend@cornell.edu Department of Mathematics Cornell University Ithaca, NY 14853, USA Editor: Fei Sha We construct the first rigorously justified probabilistic algorithm for recovering the solution operator of a hyperbolic partial differential equation (PDE) in two variables from input-output training pairs. The primary challenge of recovering the solution operator of hyperbolic PDEs is the presence of characteristics, along which the associated Green s function is discontinuous. Therefore, a central component of our algorithm is a rank detection scheme that identifies the approximate location of the characteristics. By combining the randomized singular value decomposition with an adaptive hierarchical partition of the domain, we construct an approximant to the solution operator using O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)) input-output pairs with relative error O(Ξ 1 ϵ ϵ) in the operator norm as ϵ 0, with high probability. Here, Ψϵ represents the existence of degenerate singular values of the solution operator, and Ξϵ measures the quality of the training data. Our assumptions on the regularity of the coefficients of the hyperbolic PDE are relatively weak given that hyperbolic PDEs do not have the instantaneous smoothing effect of elliptic and parabolic PDEs, and our recovery rate improves as the regularity of the coefficients increases. We also include numerical experiments which corroborate our theoretical findings. Keywords: Data-driven PDE learning, hyperbolic PDE, operator learning, low-rank approximation, randomized SVD 1. Introduction In this paper, we consider the recovery of the solution operator associated with hyperbolic partial differential equations (PDEs) from data. Generally, the task of PDE learning is to capture information about an unknown, inhomogeneous PDE given data corresponding to the observed effect of the PDE on input functions (Fan et al., 2019; Fan and Ying, 2020; Gin et al., 2021; Karniadakis et al., 2021; Kovachki et al., 2023; Li et al., 2020a,b, 2021; Lu et al., 2021a,b; Wang et al., 2021). These PDEs typically represent real-world dynamical systems whose governing principles are poorly understood, even though one can accurately observe or predict their evolutions, either through experimentation or through simulation. Data-driven PDE learning has applications in climate science (Bi et al., 2023; Lam et al., c 2025 Christopher Wang and Alex Townsend. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1724.html. Wang and Townsend 2023), biology (Raissi et al., 2020), and physics (Chen and Gu, 2021; Kochkov et al., 2021; Kutz, 2017; Qian et al., 2020), and is a significant area of research in scientific machine learning and reduced order modeling (Berman and Peherstorfer, 2023, 2024; Brunton et al., 2016; Chen et al., 2024; de Hoop et al., 2023; Krishnapriyan et al., 2021; Rudy et al., 2017; Subramanian et al., 2023; Zhang and Lin, 2018). In particular, many effective practical schemes based on neural networks have been developed to recover the solution operator associated with an unknown PDE (Boull e et al., 2022a; Feliu-Faba et al., 2020; Fan and Ying, 2020; Gin et al., 2021; Li et al., 2020a,b, 2021; Wan et al., 2023; Wang et al., 2021), although the inscrutability of these neural networks as black boxes often prevents one from understanding the underlying explanation for their success. Moreover, theoretical research in PDE learning typically centers on error estimates for neural network-based schemes (Kovachki et al., 2023; Lanthaler et al., 2022; Lu et al., 2021a). There is a growing body of research for operator learning in the context of elliptic and parabolic PDEs (Boull e et al., 2022b; Boull e and Townsend, 2023; Sch afer and Owhadi, 2024; Sch afer et al., 2021), but there is a lack of theoretical work on solution operators of hyperbolic PDEs. The existing work focuses primarily on deep learning models for solving hyperbolic PDEs (Arora, 2023; Berman and Peherstorfer, 2024; Bruna et al., 2024; Guo et al., 2020; Huang and Agarwal, 2023; Rodriguez-Torrado et al., 2022; Thodi et al., 2022) or related operators associated with inverse problems (Khoo and Ying, 2019; Li et al., 2022; Molinaro et al., 2023), rather than recovering their solution operators. Therefore, we aim to shed light on the theoretical aspects of solution operator learning for hyperbolic PDEs and to open the field for further research. We consider an unknown second-order hyperbolic linear partial differential operator (PDO) in two variables of the form: Lu := utt a(x, t)uxx + b(x, t)ux + c(x, t)u, (x, t) DT := [0, 1] [0, 1] (1a) together with homogeneous initial-boundary conditions u(x, 0) = ut(x, 0) = 0, 0 x 1 u(0, t) = ux(0, t) = 0, 0 t 1 u(1, t) = ux(1, t) = 0, 0 t 1 We assume that the coefficients are somewhat regular, namely, a, b, c C1(DT ), and that L is strictly hyperbolic, i.e., a > 0, and self-adjoint, i.e., ax + b = 0.1 For equations of the form Lu = f, the function f is called the forcing term of the PDE, while u is the corresponding system s response or solution. The hyperbolic equation Lu = f describes wave-like phenomena, especially in heterogeneous media, such as water waves, acoustic and electromagnetic signals, or earthquakes (Evans, 2010; Lax, 2006). Our learning task is to approximate the solution operator that maps forcing terms to responses, given training data in the form of input-output pairs {(fj, uj)}N j=1 that satisfy either the Cauchy problem (1) or the corresponding adjoint Cauchy problem (see Section 4 for details). Associated with the Cauchy problem (1) is a unique Green s function G : DT DT R, which is a kernel for the solution operator (Courant and Hilbert, 1962; 1. We discuss relaxations of these constraints, as well as inhomogeneous initial conditions, in Section 6. Operator Learning for Hyperbolic PDEs Mackie, 1965). That is, solutions to the problem (1) are given by the integral operator u(x, t) = Z DT G(x, t; y, s)f(y, s) dy ds, (x, t) DT , (2) for f L2(DT ). We call G the homogeneous Green s function. Our goal is to recover the action of the integral operator in (2) as accurately as possible, measured by the operator norm. Notably, we are not solving an inverse problem, in that we are not interested in learning the coefficients of (1a).2 In fact, due to the compactness of the solution operator, its recovery is well-posed, and our proposed algorithm is stable (see Section A.1). Rather, our task is to construct a direct solver for the Cauchy problem (1) without explicit knowledge of the equation s coefficients. 1.1 Challenges and contributions This paper describes the first theoretically justified scheme for recovering the solution operator associated with hyperbolic PDEs using only input-output data. We also provide a rigorous rate of recovery for our algorithm. While we adopt some of the strategies used by Boull e et al. (2022b) and Boull e and Townsend (2023), such as the randomized singular value decomposition (r SVD), we face three unique circumstances when recovering the Green s functions of hyperbolic PDEs that make our situation challenging: Characteristic curves The primary difficulty of recovering the Green s function of a hyperbolic PDE is the presence of characteristic curves (or simply characteristics), along which the Green s function is highly irregular. Characteristics describe the trajectory of waves in spacetime and are determined by the coefficients of (1a), so their location is also unknown to us in advance. Thus, our recovery algorithm needs to detect if a region of the Green s function s domain intersects a characteristic curve, using only input-output data. Adaptive partitioning For elliptic and parabolic PDEs, the Green s functions are numerically low-rank offthe diagonal, so they are well-suited to an approximation strategy via hierarchical partitioning of the domain (Bebendorf and Hackbusch, 2003; Boull e et al., 2022b). Since the Green s functions of hyperbolic PDEs have numerically high rank not only on the diagonal but also along the characteristics, we cannot apply a naive hierarchical partition of the domain. Instead, we use an adaptive partitioning strategy, which at each level uses information from the r SVD to decide which regions to partition further. Regularity of coefficients Underlying the difficulty posed by characteristic curves is a more fundamental issue with hyperbolic PDEs: irregularities are not instantaneously smoothed and propagate. This feature of Green s functions makes them challenging to recover, as the variable coefficients can 2. Learning the coefficients of an equation can be done with techniques similar to sparse identification of nonlinear dynamics (SINDy) (Brunton et al., 2016; Kaiser et al., 2018). Wang and Townsend create additional discontinuities. Accordingly, the recovery rate we derive for the hyperbolic case depends on the regularity of the coefficients; the more regular the coefficients, the faster the recovery. To overcome these challenges, we rely on the following three properties of the Green s function G of the hyperbolic PDO of (1a) (see Section 2): 1. The function G is square-integrable, with jump discontinuities along the characteristics and on the diagonal of DT DT . 2. Away from the diagonal and the characteristics, G is regular and, consequently, numerically low-rank. 3. The characteristics form a piecewise regular hypersurface in DT DT and never accumulate in finite time. In other words, the volume of a tube of radius δ around the characteristic surface shrinks to zero as δ 0. Our main technical contribution is the derivation of a rigorous probabilistic algorithm that constructs an approximant to the solution operator associated with (1a) using randomly generated input-output data {(fj, uj)}N j=1, with a small error in the operator norm. In particular, we show in Theorem 6 that with high probability, the solution operator can be recovered within a tolerance of Ξ 1 ϵ ϵ using O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)) input-output pairs, where Ξϵ and Ψϵ are defined in (31) and (32) and describe features of the operator L and the input-output data namely, the size of the singular value gaps of the solution operator, and the quality of the covariance kernel used to generate the training data that cannot be controlled without additional assumptions. Our construction relies on an adaptive hierarchical partition of the spatio-temporal domain, which roughly identifies the location of the characteristic curves, combined with the r SVD for HS operators. We remark that the learning rate is comparable to the one derived by Boull e and Townsend (2023) for elliptic PDEs, where it was later shown that an ϵ 6 factor can be improved to polylog(1/ϵ) using the peeling algorithm; see Section 6.2 (Boull e et al., 2023; Levitt and Martinsson, 2024; Lin et al., 2011). We also emphasize that our scheme succeeds for equations with both spaceand time-dependent coefficients, which informs the use of the time domain rather than the frequency domain in our analysis. While we are interested in as general of a setting as possible, other schemes that exploit the Helmholtz equation through the frequency domain may be preferable when the coefficients are time-independent (Anderson et al., 2020; Liu et al., 2023; Zepeda-N u nez and Demanet, 2016). A key component of our probabilistic algorithm is a scheme that detects the numerical rank of an operator using input-output data, which allows us to tell whether or not a domain intersects a characteristic. We do so by showing, in Theorem 5, that the r SVD can efficiently recover an operator s dominant singular subspaces. We then use the singular subspaces to approximate the dominant singular values of the operator, whose decay rate corresponds to the operator s numerical rank. While our scheme assumes the existence of a gap between adjacent singular values, such an assumption is reasonable in practice since the singular values of the solution operator typically exhibit decay, which is fast enough for efficient numerical rank detection (Meier and Nakatsukasa, 2024). Still, we believe that the assumption of a singular value gap can be discarded, although this will not be discussed in the present work. Operator Learning for Hyperbolic PDEs Our rank detection scheme facilitates the adaptive feature of the partitioning strategy, since at each hierarchical level we partition only the subdomains flagged as numerically high-rank. While such adaptive strategies are not new and have been previously applied to wave-like settings (Liu et al., 2021; Massei et al., 2022; Zepeda-N u nez and Demanet, 2016), our work supplies, to our knowledge, the first theoretical guarantee that the r SVD can be used in an adaptive scheme in a stable manner and with high probability of success. The use of the r SVD is particularly important in the realm of operator learning, where one often only has access to input-output data and thus cannot compute a SVD directly. Finally, in Theorem 1, we improve the error estimates for the r SVD for HS operators, as derived by Boull e and Townsend (2023), by a factor of k, where k is the target rank of the constructed approximant. Assuming one always oversamples using an additional k training pairs, then the error factor in Boull e and Townsend (2023, Thm. 1) grows to infinity roughly like O(k) as k , whereas, practically speaking, our error factor remains bounded. This improvement shows that the r SVD for HS operators behaves similarly to the r SVD for matrices; our bounds are asymptotically comparable to the bounds proved in Halko et al. (2011). We remark that because our algorithm is mainly of theoretical value, we choose to work in the continuous setting that is, without considering discretization both to simplify the analysis and to maintain discretization-invariance of our theoretical guarantees. It may be desirable to learn the Green s function as an operator between infinite-dimensional function spaces rather than to learn a spaceand time-discretized version of the Green s function. Indeed, Huang et al. (2025) shows that learning the solution operator as a function-tofunction map and only discretizing its inputs and outputs when necessary may be more data-efficient than learning its discretization as a vector-to-vector map. In practice, learning a solution operator without discretizing in space or time can be achieved by working in a finite-dimensional subspace of L2 using, for instance, the Legendre basis to represent functions, as done in the MATLAB package chebfun (Driscoll et al., 2014). Nevertheless, we implement a spaceand time-discretized version of our algorithm in Section 5, demonstrating its robustness to discretization. 1.2 Organization The paper proceeds as follows. In Section 2, we review the characteristics of hyperbolic PDEs and their relationship with the Green s function, while Section 3 develops the necessary tools from randomized linear algebra, namely, the r SVD for Hilbert Schmidt (HS) operators. These tools are employed in Section 4 to construct our probabilistic algorithm for recovering the solution of a hyperbolic PDO using input-output training pairs; we also analyze the recovery rate and probability of success. A numerical implementation and example of our algorithm is presented in Section 5. Finally, we summarize our results and discuss further directions of research in Section 6. Background material on HS operators, quasimatrices, orthogonal projectors, Gaussian processes (GPs), and the Legendre basis can be found in Appendix A. Proofs related to the r SVD appear in Appendix B. Wang and Townsend 1.3 Notation Throughout the paper, we use the following notation. Norms are denoted by , and the type of norm is given by a subscript. If the argument is an operator, then without a subscript denotes the operator norm. For a matrix or quasimatrix A (see Section A.2), we denote its Moore Penrose pseudoinverse by A . We write Im to denote an m m identity matrix. For a random object X, we write X D to mean that X is drawn from the distribution D; we write X Y to mean that X has the same distribution as a different random object Y. We use N(µ, σ2) to denote the univariate Gaussian distribution with mean µ and variance σ2. For a vector µ Rn and a symmetric positive definite n n matrix C, we write N(µ, C) for the multivariate Gaussian distribution with mean µ and covariance matrix C. For an integer r 0, Cr(D) denotes the space of r-times continuously differentiable functions on a domain D; if D is closed, then the functions in Cr(D) are also required to extend continuously to the boundary of D. The other function space we use is the Hilbert space L2(D) of square-integrable functions on D. 2. Green s functions of hyperbolic PDOs Our recovery scheme relies crucially on understanding where the singularities of the Green s function lie. In this section, we investigate the geometry of characteristic curves and discuss the dependence of the regularity of the Green s function on the regularity of the coefficients of L as in (1a). We assume that the coefficients of L have regularity a, b, c Cr(DT ), for integer r 1. 2.1 Green s functions in the domain H We first consider the homogeneous Green s function GH of L, as defined in (2), in the unbounded domain H = R [0, ).3 Since we assume the coefficients a, b, c are at least continuously differentiable, G exists and is unique (Courant and Hilbert, 1962, Ch. V.5 6). Its discontinuities lie on the characteristics, which we now describe. The characteristic curves passing through some point (x0, t0) H can be interpreted as the paths in spacetime traversed by waves propagating from an instantaneous unit force at (x0, t0). We obtain a family of characteristic curves by iterating over all (x0, t0) H. For hyperbolic equations in two variables, the characteristic curves are the graphs of solutions to the following ordinary differential equations: a(x(t), t) = 0, x (t) p a(x(t), t) = 0 (3) over all choices of initial conditions (Courant and Hilbert, 1962, Ch. III.1). Thus, for any (x0, t0) H, there exist exactly two characteristic curves passing through (x0, t0), given by the equations in (3), both of which are of class Cr+1 and satisfy the initial condition x(t0) = x0. Viewed as functions of t, one is strictly increasing, and the other is strictly 3. Much of the classical literature discusses what is known as the Riemann function, rather than the Green s function, of hyperbolic equations. The Riemann function (also referred to as the Riemann Green function or radiation solution) is a fundamental solution of the PDE and allows one to write an integral solution to the homogeneous equation given Cauchy initial data. Essentially, every result for the Riemann function also holds for the homogeneous Green s function, as they are closely related (Mackie, 1965). Operator Learning for Hyperbolic PDEs Λfuture(x0, t0) Figure 1: Characteristics associated with the coefficient a(x, t) = (x+1)2+1 t+1 in the domain H, with initial point (x0, t0) = (1, 1). They represent wave trajectories in spacetime produced by a unit force at (1, 1): solid curves are future trajectories, while dashed curves are past trajectories. Positive and negative characteristic rays emanating from (x0, t0) are solid and labeled by ζ (x0, t0). The future light cone is the region labeled Λfuture. decreasing, so they intersect transversally at (x0, t0) and partition H into four connected components lying respectively to the north, south, east, and west of (x0, t0). We refer to the component to the north that is, forward in time as the future light cone Λfuture(x0, t0). Additionally, we call the characteristic ray emanating toward the northwest of (x0, t0) the negative characteristic ray, and likewise we call the ray emanating toward the northeast the positive characteristic ray (see Figure 1). We are interested in the bundle of positive and negative characteristic rays indexed over all initial points (x0, t0) H, since they determine where the Green s function is irregular. Let ζ (x0, t0) denote the positive and negative characteristic rays, respectively, emanating from (x0, t0). Then we define ZH := {(x, t, x0, t0) H H : (x, t) ζ (x0, t0)} (4) as well as ZH := ZH + ZH . (5) Observe that ZH are 3-dimensional Cr-manifolds with boundary in H H.4 For fixed (x0, t0) DT , the slices GH x0,t0 given by (x, t) 7 GH(x, t; x0, t0) are r-times continuously differentiable as long as (x, t) does not lie on either of the characteristic rays 4. Here, a Cr-manifold is simply a manifold whose transition maps are of class Cr. The claim here follows from the observation that ZH can be defined as the graph of the flow generated by the time-dependent vector field (x, t) 7 p a(x, t). The details are irrelevant here, and we omit them. Wang and Townsend ζ(0) + (x0, t0) t(0) + ζ(1) + (x0, t0) ζ(0) (x0, t0) t(0) ζ(1) (x0, t0) Figure 2: In a bounded domain, the characteristics reflect offthe boundary, producing a series of reflecting characteristic segments, depicted by solid curves labeled ζ(j) (x0, t0), for j = 0, 1, 2, . . . . Collision points are labeled t(j) . emanating from (x0, t0) (Lerner, 1991, Thm. 1). Moreover, due to symmetries of fundamental solutions of hyperbolic equations (Courant and Hilbert, 1962, Ch. V.5), we conclude that GH has the same regularity in every variable. For hyperbolic equations in two variables, the singularity of the Green s function on the characteristics is a jump discontinuity. This is because GH x0,t0 restricted to Λfuture(x0, t0) satisfies continuous boundary conditions on the characteristic rays (Courant and Hilbert, 1962, Ch. V.5). Outside the future light cone, the Green s function is identically zero (Mackie, 1965). In summary, we have GH Cr((H H) \ ZH), (6) where ZH is defined by (5). 2.2 Green s functions in the domain DT When we consider L on the bounded domain DT = [0, 1] [0, 1], we must also establish homogeneous boundary conditions on {0, 1} [0, 1], in addition to homogeneous initial conditions. Homogeneous boundary conditions produce wave reflections, so the characteristic Operator Learning for Hyperbolic PDEs curves reflect offthe boundaries. When a positive characteristic ray collides with the right boundary, its reflection is given by the negative characteristic ray emanating from the collision point. Likewise, when a negative characteristic ray collides with the left boundary, its reflection is given by the positive characteristic ray emanating from the collision point. Wave trajectories can thus be described as a series ζ(0) (x0, t0), ζ(1) (x0, t0), . . . of reflecting characteristic segments emanating from (x0, t0), which terminates at some index N(x0, t0) once the segments cross the time horizon t = 1 (see Figure 2). Letting N = max(x0,t0) DT N(x0, t0), we define, analogous with (4) and (5), the objects Z(j) := {(x, t, x0, t0) DT DT : (x, t) ζ(j) (x0, t0)}, 0 j N, (7) Z(j) + Z(j) . (8) Again, each component Z(j) , j = 1, . . . , N is a 3-dimensional Cr-manifold with boundary in DT DT , so that Z is a piecewise Cr-manifold. Notice that the number of reflections N is bounded by max(x,t) DT p Besides the reflecting characteristics, the homogeneous Green s function G in the domain DT has the same regularity properties as the homogeneous Green s function GH in the domain H. The reflections produced by homogeneous boundary conditions manifest as additional jump discontinuities for G, lying on the reflecting characteristic segments described in Section 2.2. In other words, on a bounded domain, we have G Cr((DT DT ) \ Z), (9) where Z is defined by (8). Example 1 Consider the constant-coefficient wave equation utt a2uxx = f. The Green s function is quite simple to understand and can be written explicitly, but it would be very notationally complicated due to the boundary conditions. It is piecewise constant, and its value in each component is either 0 or 1 2a (see Figure 3). Its components are partitioned by characteristic curves, which are lines of slope 1 a emanating from the initial points (x0, t0) and reflecting offthe boundary. For general hyperbolic PDEs of the form (1), the Green s function is usually not piecewise constant; nevertheless, it has the same qualitative properties, in particular jump discontinuities on the characteristics. 3. Randomized SVD for Hilbert Schmidt operators This section introduces our primary tool, which we have adapted from randomized numerical linear algebra. The landmark result of Halko et al. (2011) proved that one could recover the column space of an unknown matrix with high accuracy and a high probability of success by multiplying it with standard Gaussian vectors. Boull e and Townsend (2023) extend this result to HS operators and functions drawn from a non-standard Gaussian process. This algorithm is called the r SVD. 5. The claims made in this section can be seen by applying the method of reflections to solve the homogeneous initial-boundary problem which defines the homogeneous Green s function in a bounded domain (see, e.g., Laurent et al., 2021). Again, the details are not relevant to us. Wang and Townsend Figure 3: A slice of the Green s function associated with the wave operator Lu = utt 9uxx with initial point (x0, t0) = (1 6). In this case, the Green s function is piecewise constant with jump discontinuities on the characteristics. The values of the Green s function are labeled in their respective regions. 3.1 Randomized SVD Given a HS operator F : L2(D1) L2(D2) with SVE as in (33), we define two quasimatrices U and V containing the left and right singular functions of F, so that the jth column of U and V is respectively ej and vj. We also denote by Σ the infinite diagonal matrix with the singular values σ1 σ2 of F on the diagonal. For a fixed integer k 0, we define V1 as the D1 k quasimatrix whose columns are the first k right singular functions v1, . . . , vk, and V2 as the D1 quasimatrix whose columns are vk+1, vk+2, . . . . We analogously define U1 and U2 with the left singular functions. Similarly, we define Σ1 as the k k diagonal matrix with the first k singular values σ1, . . . , σk on the diagonal, and Σ2 as the diagonal matrix with σk+1, σk+2, . . . on the diagonal. In summary, we write [ ] F = U1 U2 k Σ1 0 V 1 k 0 Σ2 V 2 . Additionally, for some fixed continuous symmetric positive definite kernel K : D1 D1 R with eigenvalues λ1 λ2 > 0, we define the infinite matrix C by D1 D1 vi(x)K(x, y)vj(y) dx dy, i, j 1. (10) Observe that C is symmetric and positive definite, and that Tr(C) = Tr(K) < , so it is a compact operator (Boull e and Townsend, 2023, Lem. 1 and Eq. (11)). We denote by C 1 the inverse operator of C on the domain for which it is well-defined. Furthermore, for fixed Operator Learning for Hyperbolic PDEs integer k 1, we partition C into k C11 C12 k C21 C22 . Since C11 is also positive definite and thus invertible, we define γk := k λ1 Tr(C 1 11 ), ξk := 1 λ1 C 1 11 , (11) which quantify the quality of the covariance kernel K with respect to F. Indeed, the Courant Fischer minimax principle implies that the jth largest eigenvalue of C is bounded by λj, since C is a principal submatrix of V KV. It follows that 0 < γk, ξk 1, and that the best case scenario occurs when the eigenfunctions of K are the right singular functions of F. In that case, C is an infinite diagonal matrix with entries λ1 λ2 > 0, and γk = k/(Pk j=1 λ1/λj) and ξk = λk/λ1 attain their minimal values. One can view ξk as the operator norm analogue of γk, discussed in more detail in Boull e and Townsend (2023, 3.4). Finally, for some X DT , let RX : L2(DT ) L2(X) denote the restriction to X. Its adjoint R X : L2(X) L2(DT ) is the zero extension operator, i.e., R Xf is the function which is equal to f on X and equal to zero everywhere else. We denote by FX Y := RXFR Y the restriction of F to X Y . When considering the restricted operator, we define the analogous quantities γk,X Y := k λ1 Tr(C 1 11,X Y ), ξk,X Y := 1 λ1 C 1 11,X Y , (12) where C11,X Y is the k k matrix [C11,X Y ]ij := Z D1 D1 R Y vi,X Y (x)K(x, y)R Y vj,X Y (y) dx dy, 1 i, j k, and v1,X Y , . . . , vk,X Y are the dominant right k-singular functions of FX Y . We also define σ1,X Y σ2,X Y as the singular values of FX Y . In this notation, we state an analogue of r SVD for HS operators (Boull e and Townsend, 2023, Thm. 1) with respect to the operator norm. We improve the error bound by a factor of k compared to Boull e and Townsend (2023, Thm. 1). Theorem 1 Let D1, D2 Rd be domains with d 1, and let F : L2(D1) L2(D2) be a HS operator with singular values σ1 σ2 0. Select a target rank k 2, an oversampling parameter p 2, and a D1 (k + p) quasimatrix Ωsuch that each column is independently drawn from GP(0, K), where K : D1 D1 R is a continuous symmetric positive definite kernel with eigenvalues λ1 λ2 > 0. Set Y = FΩ. Then λ1ξk e k + p k γk(p + 1) Wang and Townsend where γk, ξk are defined in (11). Moreover, if p 4, then for any s, t 1, we have k γk(p + 1) t with probability 1 2t p e s2/2. Proof See Appendix B. We also present a simplified version of the probability bound in (14). Corollary 2 Under the assumptions of Theorem 1 with k = p 4, we have with probability 1 3e k. Proof We evaluate (14) by selecting s = 2k and t = e. We also use the inequalities 1 γk 1 ξk 1 which follow from the fact that 0 < γk, ξk 1 and Tr(C 1 11 ) k C 1 11 . For convenience, we abbreviate the error factors in (14) and (15) by Ak,p(s, t) := 1 + 1 k γk(p + 1) t, (16) Ak := 1 + 1 as well as the analogue Ak,X Y = 1 + 1 ξk,X Y when considering a restricted domain X Y DT DT . A power scheme version of r SVD, as described in Halko et al. (2011) and Rokhlin et al. (2009), allows one to drive down the multiplicative factor in the probabilistic estimate (14) at the expense of a logarithmic factor increase in the number of operator-function products. The idea of the power scheme is that repeated projections of Y = FΩonto the column space of F improves the approximation. Operator Learning for Hyperbolic PDEs Theorem 3 Under the same assumptions as Theorem 1, select an integer q 0. Let H = (FF )q F and set Z = H Ω. If p 4, then F PZF Ak,p(s, t)1/(2q+1)σk+1 (19) with probability 1 2t p e s2/2, where Ak,p(s, t) is defined in (16). Proof See Appendix B. We summarize how the r SVD generates an approximant for the operator F in Algorithm 1, following Halko et al. (2011). Algorithm 1 Approximating F via r SVD Input: HS operator F, GP covariance kernel K, target rank k 4, oversampling parameter p 2, exponent q 0, additional parameters s, t 1 Output: Approximation F of F within relative error given by (19) 1: Draw a DT (k + p) random quasimatrix Ωwith independent columns from GP(0, K) See Section A.4 on how to draw such a quasimatrix 2: Construct Z = (FF )q FΩby multiplying with F and F 3: Compute the projector PZ via a QR factorization of Z See Trefethen (2010) for Householder triangularization on quasimatrices 4: Form F = PZF Remark 4 (Noisy training data) Here, we quantify the quality of the training data via the terms γk and ξk, defined in (11), which measure the deviation of the eigenspaces of the chosen covariance kernel K from the dominant right singular subspaces of F. One can also consider training data quality in the form of noise namely, the presence of random additive perturbation errors arising from the computation of operator-function products, or from the collection of input-output data. In fact, it was proven in the finite-dimensional setting that the r SVD is stable with respect to noise in the input-output data (Boull e et al., 2023, Supp. Info., 2.B), so we assume noiseless data for simplicity. 3.2 Approximating singular values via singular subspaces Using the power scheme, we can not only improve the approximation of the operator F itself but also approximate its singular subspaces. Consequently, we can obtain excellent estimates for the singular values, which in conjunction with Theorem 1 tells us about the numerical rank of F. Notice that while Weyl s theorem (Stewart and Sun, 1990, Cor. 4.10) gives a straightforward bound on an operator s singular values, applying it meaningfully requires those singular values to decay quickly. Conversely, Theorem 5 says that we can approximate the singular values regardless of their decay rate at the cost of additional operator-function products. We emphasize that our argument assumes the existence of a gap between adjacent singular values, although we believe this assumption is not necessary in principle. Wang and Townsend Theorem 5 Assume the hypotheses of Theorem 1, but with σk > σk+1. Select an integer q 0 and set δq as in (44). Let H := (FF )q F, Z := H Ω, and H := PZH . Let Uk be the D1 k quasimatrix whose columns are dominant left k-singular vectors of H . If δq Ak,p(s, t) < 1, then max 1 j k |σj σj( U k F)| 2δq Ak,p(s, t) 1 δq Ak,p(s, t) F (20) with probability 1 2t p + e s2/2. Proof See Appendix B. 4. Recovering the Green s function In this section, we construct a global approximant of the homogeneous Green s function of a 2-variable hyperbolic PDO L of the form (1a) in the domain DT = [0, 1] [0, 1]. Recall that L is assumed to be linear strictly hyperbolic, and self-adjoint, with coefficients satisfying a, b, c C1(DT ). We suppose that one can generate N forcing terms {fj}N j=1 drawn from a Gaussian process with continuous symmetric positive definite covariance kernel K : DT DT R and use them to query the associated solution operator F of L, as well as its adjoint F , to generate solutions uj = Ffj or uj = F fj.6 We derive a bound on the number of input-output pairs {(fj, uj)}N j=1 needed to approximate the Green s function within a given error tolerance measured in the operator norm, with a high probability of success. Our result is summarized in the following theorem. Theorem 6 Let L be a hyperbolic PDO given in (1a). Let G : DT DT R be the homogeneous Green s function of L in the domain DT , and let F : L2(DT ) L2(DT ) be the solution operator with kernel G, as in (2). Additionally, define Ξϵ and Ψϵ as in (31) and (32). For any sufficiently small ϵ > 0 such that Ψϵ > 0, there exists a randomized algorithm that can construct an approximation F of F using O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)) input-output training pairs of L, such that F F = O(Ξ 1 ϵ ϵ) F with probability 1 O(e 1/ϵ). Remark 7 (Increased regularity of coefficients) The number of input-output training pairs can be reduced by assuming greater regularity of the coefficients a, b, c of L. In particular, if a, b, c Cr(DT ) for some r 1, then Theorem 6 easily generalizes, via (40) and (9), so that only O(Ψ 1 ϵ ϵ (6+1/r) log(Ξ 1 ϵ ϵ 1)) training pairs are needed to approximate F with relative error O(Ξ 1 ϵ ϵ). If we assume a, b, c are analytic, the ϵ 1/r factor can be reduced even further to log(ϵ 1). 6. We note that F is the solution operator of the adjoint Cauchy problem of (1), meaning that u = F f satisfies both the adjoint PDE L u = f as well as homogeneous conditions at the boundary and at the terminal time t = 1. In other words, the adjoint problem is the backward-in-time version of (1). In particular, F is self-adjoint if and only if L is self-adjoint and the coefficients a, b, c are time-independent. Operator Learning for Hyperbolic PDEs A randomized algorithm that achieves Theorem 6 is summarized in Algorithm 2, and we dedicate the remainder of this section to constructing it. We fix a sufficiently small 0 < ϵ < 1/2, so that by (40) and (9), there exists kϵ := Cϵ 1 4, where C is a constant depending only on max(DT DT )\Z | G| and Z is defined in (8). This ensures that rankϵ(FX Y ) < kϵ holds whenever X Y Z = . Here, Z is the bundle of reflecting characteristic segments, defined in (8). We also set the parameters s = 2kϵ, t = e, and p = kϵ, so that the probabilities of failure for Theorems 1, 3, and 5 are bounded by 3e kϵ (see Corollary 2). Algorithm 2 Learning the solution operator via input-output data Input: Black-box solver for L, GP covariance kernel K, tolerance 0 < ϵ < 1/2 Output: Approximation F of the solution operator F within relative error Ξ 1 ϵ ϵ 1: Set target rank kϵ 4 2: while vol(Dred(L)) > ϵ2 do 3: for D Dred(L) do 4: Partition D into 16 subdomains D1, . . . , D16 Section 4.2 5: for i = 1 : 16 do 6: Determine the numerical rank of F on Di (Algorithm 3) Section 4.1 7: if F is numerically low-rank on Di then 8: Color Di green 9: Approximate F on Di using the r SVD (Algorithm 1) Section 4.3 11: Color Di red and add Di to Dred(L + 1) 12: Pad the approximant F with zeros on the remaining red subdomains Section 4.4.1 4.1 Rank detection scheme Since the locations of the characteristics are unknown, we adaptively partition DT DT in a hierarchical manner. The adaptive part relies on a rank detection scheme that detects the numerical rank of F in a given subdomain X Y DT DT (see Algorithm 3). Similar to Boull e and Townsend (2023, 4.1.2), we generate a Y 2kϵ quasimatrix Ω such that each column is independently drawn from a Gaussian process defined on Y , given by GP(0, RY Y K), where RY Y is the operator that restricts functions to the domain Y Y . Then we extend by zero each column of Ωfrom L2(Y ) to L2(DT ) in the form R Y Ω. Next, we select qϵ,X Y := max 0, 1 log(1 + Akϵ,X Y + 2Akϵ,X Y /ϵ) log(σkϵ,X Y /σkϵ+1,X Y ) 1 , (21) where Akϵ,X Y is defined in (18), and set H = (FF )qϵ,X Y F. We approximate the range of H via Z = H R Y Ωand then compute the rank-2kϵ approximant HX Y = PRXZRXH R Y . Since HX Y has finite rank, we can compute its SVD and extract the quasimatrix Ukϵ,X Y whose columns are the dominant left kϵ-singular vectors of HX Y , and finally compute the dominant singular values ˆσ1,X Y ˆσkϵ,X Y of the operator U kϵ,X Y FX Y . In total, these computations require kϵ(8qϵ,X Y +5) input-output training pairs, and rely on the querying the adjoint of F. Wang and Townsend Algorithm 3 Detecting the numerical rank of F in a subdomain Input: Black-box solver for L, GP covariance kernel K, subdomain X Y , tolerance 0 < ϵ < 1 Output: Classification of numerical rank of FX Y 1: Set target rank kϵ 4 2: Set exponent qϵ,X Y log(ϵ 1) 3: Construct HX Y using the r SVD (Algorithm 1) with exponent parameter 1 applied to H = (FF )qϵ,X Y F 4: Compute a SVD of HX Y to obtain the dominant left kϵ-singular functions Ukϵ,X Y 5: Construct U kϵ,X Y FX Y using the black-box solver for L 6: Compute the singular values ˆσ1,X Y ˆσkϵ,X Y of U kϵ,X Y FX Y 7: if ˆσkϵ,X Y < 4ϵˆσ1,X Y then rank5ϵ(FX Y ) < kϵ 8: else rankϵ(FX Y ) kϵ Our choice of qϵ,X Y in (21) is motivated as follows. Given δqϵ,X Y as defined in (44), we have 2δqϵ,X Y Akϵ,X Y 1 δqϵ,X Y Akϵ,X Y ϵ, which in conjunction with Theorem 5 implies that the inequalities ˆσkϵ,X Y σkϵ,X Y + ϵσ1,X Y and (1 ϵ)σ1,X Y ˆσ1,X Y hold with probability 1 3e kϵ. If rankϵ(FX Y ) < kϵ, then σkϵ,X Y ϵσ1,X Y , hence ˆσkϵ,X Y 2ϵσ1,X Y 2ϵ 1 ϵ ˆσ1,X Y . Since ϵ < 1/2, then the right-hand side is bounded by 4ϵˆσ1,X Y . On the other hand, if ˆσkϵ,X Y 4ϵˆσ1,X Y , then Theorem 5 again yields σkϵ,X Y ˆσkϵ,X Y + ϵσ1,X Y 5ϵσ1,X Y hence rank5ϵ(FX Y ) < kϵ. In summary, we have shown that rankϵ(FX Y ) < kϵ = ˆσkϵ,X Y 4ϵˆσ1,X Y = rank5ϵ(FX Y ) < kϵ (22) holds with probability 1 3e kϵ. The preceding argument implies that we need only to check the validity of the inequality ˆσkϵ,X Y < 4ϵˆσ1,X Y (23) to determine with high probability whether or not FX Y has numerical rank bounded by kϵ, with low error tolerance. In particular, every subdomain that does not intersect Z has rankϵ(FX Y ) < kϵ, and the test passes on all such subdomains. Conversely, even if (23) happens to be satisfied for a subdomain that does intersect Z, then that subdomain still has a low numerical rank with comparatively small error tolerance, namely, 5ϵ. Operator Learning for Hyperbolic PDEs 4.2 Hierarchical partition of domain We now describe the hierarchical partitioning of the domain DT DT = [0, 1]4, so that F is numerically low-rank on many subdomains, while the subdomains on which F is not lowrank have a small total volume. Since the probability of failure is very low, we describe the partition deterministically and relegate the discussion of the probabilistic aspects to Section 4.4.4. In particular, we assume throughout this section that (22) holds deterministically. Our partitioning strategy proceeds as follows. At level L = 0, we consider only one subdomain, which is the entirety of DT DT . On each subdomain at the current partition level, we perform the rank detection procedure described in Section 4.1 to check if F is numerically low-rank on the subdomain. If it is, we color the subdomain green and approximate F restricted to the subdomain using the r SVD. Otherwise, we color the subdomain red. The next level of the partition then consists of partitioning each red subdomain into smaller subdomains and repeating the coloring process. Since the characteristics lie only in red subdomains, then at each level of the partition we learn the location of the characteristics with finer detail (see Figure 4). The hierarchical partition of DT DT = [0, 1]4 for n levels is defined recursively as follows. At level L = 0, the domain I1,1,1,1 := I1 I1 I1 I1 = [0, 1]4 is the root of the partitioning tree. At a given level 0 L n 1, a node Ij1,j2,j3,j4 := Ij1 Ij2 Ij3 Ij4 is colored red if (23) fails to hold for X Y = Ij1,j2,j3,j4. Otherwise, the node Ij1,j2,j3,j4 is colored green. Green nodes have no children, whereas red nodes have 24 children defined as {I2j1+k1 I2j2+k2 I2j3+k3 I2j4+k4 : ki {0, 1}, i = 1, 2, 3, 4}. Here, if Ij = [a, b], 0 a < b 1, then we define I2j = [a, a+b 2 ] and I2j+1 = [a+b Every subdomain X Y in levels 0 L n 1 is green and satisfies (23), hence rank5ϵ(FX Y ) < kϵ, by (22). At the end, we perform the test (23) for all remaining subdomains at level n. Those for which (23) holds are colored green, while the rest are colored red. 4.3 Local approximation of the Green s function The hierarchical partition of DT DT described above tells us where F is numerically low-rank and where it is not. If a subdomain X Y is colored green, that is, a subdomain on which (23) holds, then we can approximate it using the r SVD. Since we have generated Z = H R Y Ωin the process of rank detection, where H = (FF )qϵ,X Y F and Ωis a quasimatrix with columns drawn from GP(0, RY Y K), then we only require an additional 2kϵ input-output training pairs to construct FX Y := PRXZRXFR Y .7 By Theorem 3 and (22), we have FX Y FX Y A1/(2qϵ,X Y +1) kϵ,X Y σkϵ+1,X Y 5ϵA1/(2qϵ,X Y +1) kϵ,X Y FX Y . (24) 7. In other words, we can skip lines 2 3 in Algorithm 1. Wang and Townsend Moreover, this error estimate holds with probability 1, conditioned on the event that (22) is valid (see Section 4.4.4). On the other hand, if (23) does not hold on X Y , then we approximate F by zero. Since the Green s function Gis continuous in the compact domain DT DT , except for jumps along the characteristics and the diagonal, then there exists a constant C such that G(x, t; x0, t0) C F , (x, y, x0, y0) DT DT . Hence, for any X, Y DT , we have G 2 L2(X Y ) C2 vol(X Y ) F 2 (25) and thus FX Y C p vol(X Y ) F , (26) where C does not depend on X Y . Therefore, for sufficiently small domains X Y , we may approximate FX Y by zero. At the end of partitioning, the total volume of the subdomains on which (23) fails to hold is negligible (see Section 4.4.1). 4.4 Recovering the Green s function on the entire domain We now show that we can recover F on the entire domain DT DT . 4.4.1 Global approximation near characteristics First, we ensure that the volume of all subdomains where F is not low-rank is small enough to safely ignore that part of F by approximating it by zero. As one increases the level of hierarchical partitioning, the volume of such subdomains shrinks to zero. Let Dred(L) denote the collection of subdomains X Y at level L that are colored red, such that rank5ϵ(FX Y ) kϵ. Let Dred(L) := S D Dred(L) D. By (9), (22) and the definition of kϵ, we have X Y Dred(L) only if (X Y ) Z = . Subdomains at level L are 4-dimensional cubes with side length 2 L, so every X Y Dred(L) is contained in the closed tubular neighborhood of Z in DT DT with radius 2 L+1, defined as T(Z, 2 L+1) := {p DT DT : dist(p, Z) 2 L+1}. (27) The volume of T(Z, 2 L+1) depends only on the coefficient a of (1a) and can be computed explicitly using a Weyl-type tube formula.8 In particular, we have vol(Dred(L)) vol(T(Z, 2 L+1)) = O(2 L), (28) X Y Dred(L) FX Y O(2 L/2) F by (26). Therefore, by selecting nϵ := log2(1/ϵ2) , we guarantee that X X Y Dred(nϵ) FX Y = O(ϵ) F (29) and can safely approximate F by zero on Dred(nϵ). 8. For a C2, the classical Weyl tube formula suffices, whereas for a with less regularity one requires more complicated expressions; see, e.g., Federer (1959, Thm. 5.6), Gray (2003, Thm. 4.8), and Hug et al. (2004, Eq. (4.4)). We omit the details. Operator Learning for Hyperbolic PDEs Figure 4: Slices through DT DT at two levels of hierarchical partitioning. A subdomain intersecting a characteristic is colored red and is further partitioned at the next level. Otherwise, it is colored green. The boundary of the tube of (27) is depicted by the thick dashed lines in (b); notice that every red subdomain is contained within the tube s interior. 4.4.2 Recovery rate Input-output training pairs are required both for rank detection on subdomains and for approximation for the subdomains on which F is numerically low-rank. We count the training pairs needed for each task separately. Since rank detection occurs on every subdomain in the partitioning tree, let Dtree(nϵ) denote the entire collection of subdomains in the tree after nϵ levels of partitioning. For each subdomain colored red at a given level L, we split it into 16 smaller subdomains and perform the rank detection scheme on each, hence #Dtree(nϵ) = 16 L=0 #Dred(L). Subdomains at the Lth level of partitioning have volume 16 L, so it follows from (28) that #Dred(L) 16L vol(Dred(L)) = O(8L). Therefore, #Dtree(nϵ) = O(8nϵ) = O(ϵ 6). (30) Finally, recall from Section 4.1 that performing the rank detection scheme on a subdomain X Y requires kϵ(8qϵ,X Y + 5) training pairs. We remove the dependence on X Y by introducing Ξϵ := min{ξkϵ,X Y : X Y Dtree(nϵ)} (0, 1], (31) Wang and Townsend which represents the quality of the covariance kernel K, as well as Ψϵ := min log σkϵ,X Y : X Y Dtree(nϵ) , (32) which represents the influence of the singular value gaps in the rank detection scheme. It follows from (18) that Akϵ,X Y = O(Ξ 1 ϵ ) and thus qϵ,X Y = O(Ψ 1 ϵ log(Ξ 1 ϵ ϵ 1)) by (21). Therefore, the total number of training pairs needed for rank detection on every domain in Dtree(nϵ) is given by N(1) ϵ := X X Y Dtree(nϵ) kϵ(8qϵ,X Y + 5) = O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)). To count training pairs needed for approximation on domains where F is numerically low-rank, we first observe that #Dgreen(nϵ) = 16 #Dred(L 1) #Dred(L) = O(8nϵ), where Dgreen(nϵ) is the collection of subdomains X Y in the hierarchical level nϵ colored green, i.e., those satisfying rank5ϵ(FX Y ) < kϵ. Recall from Section 4.3 that only 2kϵ additional training pairs are required to generate an approximant on a subdomain via r SVD after performing rank detection, so the number of training pairs needed for approximation is given by N(2) ϵ := X X Y Dgreen(nϵ) 2kϵ = O(ϵ 7). We conclude that the total number of training pairs required is Nϵ := N(1) ϵ + N(2) ϵ = O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)). 4.4.3 Global approximation error Let F be the approximant to F given by stitching together, at the end of the hierarchical partition, the approximants FX Y generated on green subdomains X Y , and the zero operator on red subdomains. By (29), we have X Y Dgreen(n) FX Y FX Y + O(ϵ) F , where every term in the summation satisfies (24). Recall that Akϵ,X Y = O(Ξ 1 ϵ ), hence F F O(Ξ 1 ϵ ϵ) F is our final error. Operator Learning for Hyperbolic PDEs 4.4.4 Recovery probability Thus far, we have assumed that our rank detection is guaranteed to succeed on each domain, i.e., that (22) holds deterministically. Circumventing this assumption is simple: since the forcing terms in the input-output training data are independent, and since we perform rank detection on #Dtree(nϵ) = O(ϵ 6) domains, we have P n (22) holds for every X Y Dtree(n) o (1 3e kϵ)#Dtree(nϵ) = 1 O(e 1/ϵ). Notice that for each subdomain, the approximant is generated using the same test quasimatrices as used in the rank detection scheme, hence all error bounds for approximants, e.g., (24), are guaranteed to hold when conditioned on the validity of (22). Therefore, our global probability of failure is O(e 1/ϵ), and we have completed the proof of Theorem 6. 5. Numerical Example We implement Algorithms 1, 2, and 3 in MATLAB for the constant coefficient wave operator Lu = utt 4uxx with homogeneous initial and boundary conditions.9,10 Our implementation discretizes the domain blockwise at Gauss Legendre quadrature nodes. Input-output data was generated using the known analytical expression for the true Green s function. Figure 5 depicts a slice of the approximate Green s function constructed by Algorithm 2 after eight levels of partitioning, as well as the pointwise error in comparison with the actual Green s function. We observe that the rank detection scheme successfully distinguishes between those subdomains that intersect characteristics and those that do not, hence the partition is fine in areas near characteristics and coarse in areas away from characteristics. Errors are concentrated near the characteristics, as expected, while low-rank subdomains are approximated very well. Figure 6 shows the experimental rate of convergence as the amount of input-output training data increases. As predicted by Theorem 6, the error decays like O(N 1/7), where N is the number of training data pairs. The slow convergence rate theoretically predicted and empirically observed shows that large amounts of data may be needed to accurately approximate Green s functions associated with hyperbolic PDEs. We emphasize that our algorithm is guarding against the nastiest possible solution operators associated with a hyperbolic PDE; if one assumes in advance that the PDE has constant coefficients, then one can do dramatically better. Nevertheless, the numerical example demonstrates that our method is stable and robust to discretization errors. 6. Conclusions and Discussion Our breakthrough result proves that one can rigorously recover the Green s function of a hyperbolic PDE in two variables using input-output training pairs with high probability. 9. MATLAB code is available at https://github.com/chriswang030/Operator Learningfor HPDEs. 10. While our code has no difficulty handling more complicated hyperbolic PDEs, the Green s functions for equations with variable coefficients are generally unknown analytically, hence training data must be generated from, say, a numerical solver. Experiments with various popular techniques showed that such solvers were either (a) too slow to be practical, given the large amount of data needed, or (b) introduced numerical dissipation and/or dispersion that dominated the approximation error and resulted in a poor Green s function approximation. We discuss these issues in more detail in Section 6.4. Wang and Townsend Figure 5: Approximate Green s function at the slice (y, s) = (0.8, 0.1) for the wave operator Lu = utt 4uxx, overlaid with partition blocks (left), and the pointwise error at the slice compared with the exact Green s function (right). Figure 6: Empirical rate of convergence of Algorithm 2. We do so via three theoretical contributions: (a) we prove new error bounds for the r SVD for HS operators in the operator norm, such that the suboptimality factor Ak of (17) tends to a constant as the target rank k tends to ; (b) we develop a scheme to detect the numerical rank of an operator using input-output products, with high probability; and (c) we introduce an adaptive component to the hierarchical partition of the Green s function s domain to effectively capture the location of its discontinuities. Operator Learning for Hyperbolic PDEs In the remainder of this section, we discuss possible relaxations of our assumptions, a potential improvement to our scheme via the peeling algorithm, the difficulties of extending to higher-dimensional hyperbolic PDOs, and some challenges arising from the availability of input-output data. 6.1 Relaxations of assumptions The main assumptions required of our unknown PDO L from (1a) are (a) homogeneous initial conditions, (b) self-adjointness, and (c) sufficient regularity of coefficients. Strict hyperbolicity is also required, but we do not discuss it because it enforces realistic physics by forbidding waves from traveling infinitely fast or backward in time. Inhomogeneous initial conditions For use as a direct solver, the homogeneous Green s function can be applied to solve Cauchy problems with certain inhomogeneous initial conditions, as a consequence of Duhamel s principle (Courant and Hilbert, 1962, Ch. VI.15). For initial conditions u(x, 0) = 0 and ut(x, 0) = ψ(x), one simply needs to integrate ψ against the Green s function on the line {t = 0}. If one also wants to solve initial value problems where u(x, 0) is not identically zero, then the values and derivatives of the coefficients of L must be given on {t = 0} (Courant and Hilbert, 1962, Ch. V.5). We demonstrate the practical implementation of Duhamel s principle in Figure 7 by replacing the given inhomogeneous initial condition with an appropriate approximation of the Dirac delta on a short time strip, e.g., fδ(x, t) := δ 1χ[0,δ](t)ut(x, 0), and setting our approximate solution to be u(x, t) = Z DT G(x, t; y, s)fδ(y, s) dy ds, (x, t) DT . The inaccuracies seen in Figure 7 are due primarily to the use of an approximate Green s function. Because of the large amount of training data needed here, we are unable to construct a Green s function with sufficiently high accuracy with our available software and hardware. One may be tempted to extrapolate the learned solution operator beyond the given time horizon by taking the value of the computed solution at the end of the first time interval to be the initial condition for the next time interval. However, without additional information about the coefficients of L, one has no information whatsoever about the behavior of the Green s function beyond the time horizon on which the input-output training data is given. If it is known that the wave speed, for instance, is independent of time, then such extrapolations beyond the initial domain may be feasible. Non-self-adjoint PDOs While self-adjointness is not strictly necessary, more training data may be needed to learn the solution operator of a non-self-adjoint PDO to the same accuracy. This is because the coefficients of the adjoint PDO may have less regularity than the original operator. For instance, the wave operator with variable damping Lu = utt uxx + k(x, t)ut Wang and Townsend Figure 7: Approximate solution (left) and exact solution (right) to the wave operator Lu = utt 4uxx with homogeneous boundary conditions but inhomogeneous, discontinuous initial conditions given by u(x, 0) = 0 and ut(x, 0) = χ[0.2,0.3](x). The approximation was computed by multiplying the approximate Green s function G derived from Algorithm 2 against an approximation of the Dirac delta on {t = 0}. has adjoint L v = vtt vxx k(x, t)vt kt(x, t)v. If k(x, t) is not smooth, then one of the coefficients of L has one less degree of regularity. Consequently, the corresponding homogeneous Green s function also has one less degree of regularity in the region of its support and thus requires more training data to recover (see Section A.5). Whether or not our hyperbolic PDO is self-adjoint, in general one must have access to input-output data from both the Cauchy problem (1) as well as the corresponding adjoint problem (see footnote 6). Indeed, one cannot expect to recover an operator without the action of the adjoint (Boull e et al., 2024; Halikias and Townsend, 2024; Otto et al., 2023). In the special case where the hyperbolic PDO is self-adjoint and has time-independent coefficients, the homogeneous Cauchy problem is also self-adjoint. Less regular coefficients We assume throughout the paper that the coefficients of L are at least of class C1. It is certainly possible that our method extends to coefficients of lower regularity, but the Green s function theory for hyperbolic PDEs with low regularity coefficients is more subtle and may require one to work instead with parametrices. For example, Smith (1998) describes a parametrix for hyperbolic PDEs with Lipschitz coefficients whose spatial derivatives are also Lipschitz. For equations with even lower regularity, it is not always the case that the Cauchy problem is well-posed, and the regularity in the time and the space variables must Operator Learning for Hyperbolic PDEs be considered separately (see, e.g., Cicognani, 1999; Cicognani and Lorenz, 2018; Hurd and Sattinger, 1968; Reissig, 2003). It is unclear in such cases whether or not a Green s function approach, such as the one taken in this paper, is possible. 6.2 Peeling algorithm Our algorithm requires performing rank detection and the r SVD on O(ϵ 6) subdomains. These operations require O(Ψ 1 ϵ ϵ 1 log(Ξ 1 ϵ ϵ 1)) input-output training pairs for each subdomain; considering the subdomains one by one leads us to the number of training pairs described in Theorem 6. However, it was recently shown in Boull e et al. (2023) that one can dramatically reduce the required number of input-output pairs by considering multiple subdomains simultaneously via the peeling algorithm (Lin et al., 2011). The argument relies in part on bounding the chromatic number of a graph associated with the hierarchical partition (Levitt and Martinsson, 2024). In our setting, the chromatic number at a given level L of the partition, denoted by χ(L), is approximately the number of subdomains of minimal volume that is, cubes with side length 2 L seen in a 2-dimensional slice of the domain DT DT . For instance, the chromatic number associated with the partition at level L = 3 seen in Figure 4b is 52. Heuristically, the chromatic number is maximized when the characteristics are straight lines of some minimal slope 0 < m 1, in which case χ(L) 2Lm 1. If the arguments in Boull e et al. (2023) can be repeated for hyperbolic PDOs, then the required number of input-output training pairs in Theorem 6 improves from O(Ψ 1 ϵ ϵ 7 log(Ξ 1 ϵ ϵ 1)) to O(Ψ 1 ϵ ϵ 3 polylog(Ξ 1 ϵ ϵ 1)). Moreover, by Remark 7, this term continues to improve as one assumes greater regularity of the coefficients a, b, c of (1a). The challenge, which is addressed in Boull e et al. (2023) for elliptic PDEs, is that the peeling algorithm of Lin et al. (2011) is not proven to be stable. Hence, careful analysis is needed to show that errors from low-rank approximation and rank detection do not accumulate. Notice that, unlike the elliptic case, the use of peeling cannot entirely reduce the ϵ 6 factor to polylog(ϵ 1). This is because the singularities of the Green s functions of hyperbolic PDEs in d variables occupy a (2d 1)-dimensional hypersurface in the 2d-dimensional domain, due to the characteristics. In contrast, the singularities of the Green s functions of elliptic PDEs lie only on the diagonal, which is d-dimensional. In other words, the Green s functions of hyperbolic PDEs have an additional d 1 dimensions worth of singularities, and the chromatic number χ(L) thus grows exponentially with L. It is therefore doubtful that one can recover the Green s functions of hyperbolic PDEs at a sublinear rate without a significant development in the approximation theory of hierarchical matrices. 6.3 Extending to higher dimensions We learn the Green s function of a hyperbolic PDE in two variables by detecting the location of its characteristics, which feature as jump discontinuities for the Green s function. A similar scheme should also work for detecting the singularities of Green s functions for hyperbolic PDEs in higher dimensions. The main difficulty of higher-dimensional Green s functions is that they are not necessarily square-integrable. For instance, the Green s function of the wave equation in three Wang and Townsend spatial dimensions is given by G(x, t; y, s) = δ(t s x y ) 4π x y , (x, t), (y, s) R3 [0, ), where δ is the Dirac delta. Here, G has singularities of a distributional nature on the characteristics, which form a 7-dimensional hypersurface in the 8-dimensional domain of the Green s function. In this case, the theory of the r SVD for HS operators no longer applies since HS integral operators must have a square-integrable kernel. This difficulty has been circumvented to a degree in the case of parabolic PDEs, whose Green s functions are integrable but not square-integrable on the diagonal (Boull e et al., 2022b). Nevertheless, it is not immediately clear that the analysis carries over when the singularities are distributional. Another factor that may play a role in higher dimensions is the effect of Huygens principle (Courant and Hilbert, 1962; Evans, 2010). For the standard wave equation in d spatial dimensions, Huygens principle implies that the support of the Green s function lies exclusively on the boundary of the future light cone when d is odd. In contrast, the support is the entire future light cone when d is even. For general hyperbolic PDOs, this effect does not correspond so nicely with the parity of the spatial dimension (see, e.g., G unther, 1991). Determining the impact of Huygens principle may be an essential aspect of hyperbolic PDE learning in higher dimensions. 6.4 Availability of data In practice, one often does not have access to exact solutions of the PDE in question. Instead, input-output training data may be generated from simulations driven by a numerical PDE solver. In such cases, the training data inherits whichever discretization errors the solver itself exhibits. Figure 8 displays the output of our algorithm for training data coming from two different numerical solvers. The first is a first-order accurate upwind method (Banks and Henshaw, 2012) while the second is the simple second-order centered-in-time and centered-in-space finite difference method; both advance by explicit time-steps on a fixed-resolution uniform grid. Although much more advanced solvers exist, we chose these methods for their speed and simplicity given the large number of numerical solves required by our algorithm. Due to the numerical dissipation of the first-order method, we observe that the approximate Green s function is unable to locate the characteristics precisely, since they get smoothed out. On the other hand, the numerical dispersion exhibited by the second-order method fabricates artificial oscillations both inside and outside the light cones; these oscillations mislead our rank detection scheme into believing the Green s function is high-rank in regions where it is not, resulting in excessive partitioning within the light cones. As such, the availability and accuracy of input-output data poses a major challenge for operator learning for hyperbolic PDEs that we encourage future research both theoretical and practical to address. While elliptic and parabolic PDEs significantly smooth out errors from the training data, those errors are structurally propagated throughout the domain in the hyperbolic case and must be dealt with in a careful manner. One possible approach is to use our algorithm to quickly construct a low-accuracy approximation of the Green s function to get a rough sense of where the characteristics lie. Then one can use this information to update the numerical solver generating the input-output data to better track propagation Operator Learning for Hyperbolic PDEs of singularities and minimize dispersion and dissipation errors. Alternating between using the Green s function approximation to improve the solver and vice versa may be an iterative solution to the problems presented above. Figure 8: Approximate Green s function at the slice (y, s) = (0.8, 0.1) using input-output data generated by time-stepping numerical solvers: the first-order upwind method of Banks and Henshaw (2012) (left) and the second-order centered difference discretization (right). Acknowledgments The work of C. W. was supported by the NSF GRFP under grant DGE-2139899, and by the NSF under grant DMS-1929284 while the author was in residence at the Institute for Computational and Experimental Research in Mathematics in Providence, RI, during the Numerical PDEs: Analysis, Algorithms, and Data Challenges semester program. The work of A. T. was supported by the Office of Naval Research (ONR) under grant N0001423-1-2729, the NSF under grant DMS-2045646, and a Weiss Junior Fellowship Award. Many thanks to Diana Halikias, Sungwoo Jeong, and Seick Kim for helpful discussions and suggestions. Appendix A. Background Material We review the relevant background material, comprising of HS operators, quasimatrices, orthogonal projectors, Gaussian processes, and approximation using Legendre series (Halko et al., 2011; Hsing and Eubank, 2015; Townsend and Trefethen, 2015; Trefethen, 2019; Vershynin, 2018). Wang and Townsend A.1 Hilbert Schmidt operators Hilbert Schmidt (HS) operators acting on L2 functions are infinite-dimensional analogues of matrices acting on vectors. For d 1, let D1, D2 Rd be domains and let L2(Di) be the space of square-integrable functions defined on Di, for i = 1, 2, equipped with the inner product f, g := R Di f(x)g(x) dx. Since L2(Di) are separable Hilbert spaces, then for any compact linear operator F : L2(D1) L2(D2), there exists by the spectral theorem complete orthonormal bases {vj} j=1 and {ej} j=1 for L2(D1) and L2(D2) respectively, as well as a sequence of positive numbers σ1 σ2 > 0 such that σj vj, f ej, f L2(D1), (33) where equality and convergence hold in the L2 sense. This representation is called the singular value expansion (SVE) of F; we call {ej} and {vj} the left and right singular functions of F, respectively, and we call {σj} the singular values of F. The subspaces spanned by the first k left or right singular functions, for some positive integer k, are called dominant left and right k-singular subspaces of F, respectively. We say that F is a HS operator if its HS norm is finite, i.e., j=1 Fvj 2 L2(D2) Since F 2 HS = P j=1 σ2 j , the HS norm is an infinite-dimensional analogue of the Frobenius norm for matrices. We also have F F HS. A special type of HS operator is a HS integral operator, which requires that F have an integral representation given by (Ff)(x) = Z D1 G(x, y)f(y) dy, f L2(D1), x D2. (34) for some kernel G L2(D2 D1). In this case, F HS = G L2(D2 D1). The adjoint of F is the unique compact linear operator F : L2(D2) L2(D1) satisfying (F ) = F, F = F , and Ff, h L2(D2) = f, F h L2(D1) for every f L2(D1) and h L2(D2). The adjoint F has a SVE given by σj ej, g vj, g L2(D2), and for any f L2(D1) and h L2(D2) satisfies j=1 σ2 j vj, f vj, (FF )h = j=1 σ2 j ej, h ej. (35) By the Eckart Young Mirsky theorem (Stewart and Sun, 1990, Thm. 4.18), truncating the SVE of a HS operator after k terms yields a best rank-k approximation in the HS Operator Learning for Hyperbolic PDEs and operator norms. If Fk : L2(D1) L2(D2) is defined by the truncation Fkf := Pk j=1 σj vj, f ej, then min rank( F)=k F F HS = F Fk HS = as well as min rank( F)=k F F = F Fk = σk+1. (37) Finally, we define the numerical rank of F with error tolerance ϵ > 0, denoted by rankϵ(F), to be the smallest integer k such that F Fk < ϵ F . In particular, (37) implies that rankϵ(F) k holds if σk+1 ϵσ1. A.2 Quasimatrices Quasimatrices are infinite-dimensional analogues of tall-skinny matrices, and we use them principally to simplify notation and emphasize the analogy with matrices. A D1 k quasimatrix Ωis represented by k columns, where each column is a function in L2(D1), via Ω= ω1 ωk , ωj L2(D1). One can similarly define k infinite matrices, in which each of the k columns is a sequence in ℓ2, and likewise D1 quasimatrices. Many operations for rectangular matrices generalize to quasimatrices. If F : L2(D1) L2(D2) is a HS operator, then FΩdenotes the quasimatrix given by applying F to each column of Ω, i.e., FΩ= Fω1 Fωk . Quasimatrices have a transpose, denoted Ω , in the sense that ω1, ω1 ω1, ωk ... ... ... ωk, ω1 ωk, ωk , ΩΩ (x, y) = j=1 ωj(x)ωj(y). Here, Ω Ωis a k k matrix of real numbers, while ΩΩ is a function in L2(D1 D1) and can be thought of as an integral operator by taking ΩΩ (x, y) to be its kernel. Quasimatrices can be thought of as HS operators on the relevant Hilbert spaces if, for instance, their singular values are square summable. A.3 Orthogonal projectors In analogy with the finite-dimensional setting, an orthogonal projection is a bounded selfadjoint linear operator P : L2(D2) L2(D2) that satisfies P2 = P. Orthogonal projectors are completely determined by their range: for any closed subspace M L2(D2), there is a unique orthogonal projector PM whose range is M. This operator is compact if and only Wang and Townsend if M is finite-dimensional. For any D2 k quasimatrix Ω, the unique orthogonal projector associated with the column space of Ωis denoted by PΩ:= Ω(Ω Ω) Ω : L2(D2) L2(D2), as PΩF : L2(D1) L2(D2) captures the orthogonal projection of the range of F onto the finite-dimensional column space of Ω. If Ωhas full column rank, then Ω Ωis an invertible k k matrix, and (Ω Ω) = (Ω Ω) 1. We rely on the following inequality, analogous to that of Halko et al. (2011, Prop. 8.6), for the operator norm of orthogonal projectors. Proposition 8 Let D1, D2 Rd be domains. Let P : L2(D2) L2(D2) be an orthogonal projector, and let F : L2(D1) L2(D2) be a HS operator. Then PF P(FF )q F 1/(2q+1) (38) holds for every q > 0. Proof The proof closely follows that of Halko et al. (2011, Prop. 8.6), making use of the Courant Fischer minimax principle (Hsing and Eubank, 2015, Thm. 4.2.7). A.4 Gaussian processes A Gaussian process (GP) is an infinite-dimensional analogue of a multivariate Gaussian distribution, and a function drawn from a GP is analogous to a random vector drawn from a Gaussian distribution. For a domain D Rd and a continuous symmetric positive semidefinite kernel K : D D R, we define a GP with mean µ : D R and covariance K to be a stochastic process {Xt : t D} such that the random vector (Xt1, . . . , Xtn), for every finite set of indices t1, . . . , tn D, is a multivariate Gaussian distribution with mean (µ(t1), . . . , µ(tn)) and covariance matrix Kij = K(ti, tj), 1 i, j n. We denote such a GP by GP(µ, K). Functions drawn from a GP with continuous kernel are almost surely in L2. Additionally, since K is positive semidefinite, it has non-negative eigenvalues λ1 λ2 0 with corresponding eigenfunctions {rj} j=1 that form an orthonormal basis for L2(D) (Hsing and Eubank, 2015, Thm. 4.6.5). The Karhunen L oeve expansion provides a method for sampling functions from a GP, as f GP(0, K) has the expansion f(t) = P j=1 p λjcjrj(t) converging in L2 uniformly in t D, where {cj} j=1 are independent standard Gaussian random variables (Hsing and Eubank, 2015, Thm. 7.3.5). Additionally, we define the trace of K by Tr(K) := P j=1 λj < . A.5 Low-rank approximation in the L2 norm In recovering a low-rank approximation of a kernel of a HS integral operator, we need explicit estimates on the decay of the kernel s singular values. We can obtain them by analyzing the kernel expressed in a Legendre series. Let P n(x) denote the shifted Legendre polynomial of degree n in the domain [0, 1].11 The shifted Legendre polynomials form an orthogonal basis 11. Legendre polynomials are defined in NIST DLMF, Table 18.3.1. The shifted Legendre polynomials are simply the composition of Legendre polynomials with the transformation x 7 2(x 1) (see NIST DLMF, Ch. 18). Operator Learning for Hyperbolic PDEs of L2([0, 1]), and if a function f L2([0, 1]) is uniformly H older continuous with parameter > 1/2, then it has a uniformly convergent Legendre expansion f(x) = P n=0 an P n(x). The truncated Legendre series fk(x) := Pk n=0 an P n(x), for some k 0, is the degree-k L2 projection of f onto the space of polynomials of degree k. If f is r-times differentiable and f(r) has bounded total variation V , then for any k > r + 1, we have π(r + 1/2)(k r 1)r+1/2 = O(k (r+1/2)), (39) where fk 1 is the degree-(k 1) L2 projection of f (Wang, 2023, Thm. 3.5). Similar results hold for a multivariate function f(x1, . . . , xn) if the conditions above hold uniformly over every 1-dimensional slice of f, which holds if f is r-times differentiable and Drf has total variation bounded by V < uniformly over all order-r partial derivatives Dr (Shi and Townsend, 2021; Townsend, 2014). In particular, for any positive integers m, n, we may consider f : [0, 1]m+n R and its degree-(k 1) L2 projection fk 1 as the kernels of HS integral operators F, F : L2([0, 1]m) L2([0, 1]n), respectively. Since fk 1 is a polynomial of degree k 1, then F is a rank-k operator. By the Eckart Young Mirsky theorem, F F HS = f fk 1 L2 π(r + 1/2)(k r 1)r+1/2 , where σj denotes the jth singular value of F. Combining this with the observation that kσ2 2k P2k j=k+1 σ2 j P j=k+1 σ2 j , we obtain an explicit decay rate for the singular values of F, i.e., σk 2r+3/2V p k(k 2r 2)r+1/2 = O(k (r+1)) (40) for k > 2(r + 1). Appendix B. Randomized Singular Value Decomposition This appendix contains proofs of results relating to the r SVD in Section 3. B.1 Proofs for Section 3.1 In this section, we prove Theorems 1 and 3, which bound the error of an approximant generated by the r SVD with respect to the operator norm. The proof is similar to that of Halko et al. (2011, Thms. 10.6 and 10.8), generalized to the infinite-dimensional setting with non-standard Gaussian vectors. We first introduce notation. For separable Hilbert spaces H, K, let HS(H, K) be the Hilbert space of HS operators H K. Given a bounded, self-adjoint linear operator C : K K, we define a quadratic form on K by x, y C := x, Cy K, x, y K, Wang and Townsend as well as a quadratic form on HS(H, K) by A, B C := A, CB HS. If C is positive definite, then both quadratic forms are inner products on K and HS(H, K), respectively. Then we let x 2 C := x, x C and A 2 C := A, A C denote the induced norms. We remark that x C = C1/2x K and A I C = C1/2A , where A I C denotes the operator norm of A when viewed as an operator (H, H) (K, C). Finally, we let H K denote the tensor product of Hilbert spaces, which embeds into HS(H, K) by associating with each x y H K the operator u 7 x, u H y. Lemma 9 Let H, K be Hilbert spaces. For any u, w H and v, z K, we have u v w z 2 HS max( u 2, w 2) v z 2 + max( v 2, z 2) u w 2. Proof Observe that u v, w z HS = u, w v, z . Then u v w z 2 = u v 2 + w z 2 2 u v, w z = u 2 v 2 + w 2 z 2 2 u, w v, z . Let A = max( v , z ) and B = max( u , w ). Then A2 u w 2 + B2 v z 2 = A2( u 2 + w 2) + B2( v 2 + z 2) 2(A2 u, w + B2 v, z ). Observe that (B2 u, w )(A2 v, z ) 0, hence u v w z 2 HS u 2 v 2 + w 2 z 2 + 2A2B2 2(A2 u, w + B2 v, z ). It is straightforward to verify that u 2 v 2 + w 2 z 2 + 2A2B2 A2( u 2 + w 2) + B2( v 2 + z 2), which completes the proof. We extend a result of Halko et al. (2011, Prop. 10.1) to the infinite-dimensional setting, following Gordon (1985, 1988). To simplify notation, we use to denote either the operator, ℓ2, or L2(D) norm depending on context. We also use N(0, I C) to denote the distribution of a mean zero random matrix with independent columns drawn from N(0, C). Proposition 10 Let G be an n infinite random matrix with distribution N(0, In C), where C is a symmetric positive definite covariance matrix with Tr(C) < . Let S be a D quasimatrix and let T be a n infinite matrix, both deterministic with S HS, T HS < . Then E SGT C1/2 HS T + C1/2 T HS S . Operator Learning for Hyperbolic PDEs Proof Let g N(0, T 2C) be Gaussian in ℓ2 and let h N(0, C1/2S 2 In) be a Gaussian random vector in Rn, such that g, h, and G are all independent. Define X(u, v) := SGTu, v , Y (u, v) := Sg, v + T h, u indexed by u ℓ2 and v L2(D) of unit length. One can show that X(u, v) and Y (u, v) are well-defined, and that X( , ) and Y ( , ) are mean zero Gaussian processes. We aim to apply the Sudakov Fernique inequality (Vershynin, 2018, Thm. 7.2.11). For any unit-length u, w ℓ2 and v, z L2(D), set A = (Tu) (S v) (Tw) (S z). We compute E(X(u, v) X(w, z))2 = Tr(A CA) T 2 S (v z) 2 C + S 2 I C T(u w) 2 = T 2 C1/2S (v z) 2 + C1/2S 2 T(u w) 2. (41) The first equality follows from fully expanding (X(u, v) X(w, z))2 and using the definition of G; the inequality is obtained by applying Lemma 9 taking H to be Rn with the usual inner product, and K to be ℓ2 with the inner product induced by C. On the other hand, observe that E Sg, v z 2 = T 2(v z) SCS (v z) = T 2 C1/2S (v z) 2, and similarly E T h, u w 2 = C1/2S 2 T(u w) 2. Therefore, by independence of g and h, we have E(Y (u, v) Y (w, z))2 = T 2 C1/2S (v z) 2 + C1/2S 2 T(u w) 2. (42) Combining (41) and (42) yields E(X(u, v) X(w, z))2 E(Y (u, v) Y (w, z))2 for every unit-length u, v, w, z, so by the Sudakov Fernique inequality, we have E sup u ℓ2 u =1 sup v L2(D) v =1 SGTu, v E sup u ℓ2 u =1 sup v L2(D) v =1 ( Sg, v + T h, u ). (43) Since SGT is almost surely a bounded operator, then the left-hand side of (43) is E SGT . For the right-hand side, two applications of the Cauchy Schwarz inequality yield E sup v L2(D) v =1 Sg, v (E Sg 2)1/2 = Tr( T 2SCS )1/2 C1/2 HS S T An analogous argument for T h, u combined with (43) completes the proof. Finally, we derive probability estimates for the operator norm of the pseudoinverse of non-standard Gaussian matrices, following Chen and Dongarra (2005) and Edelman (1988). Proposition 11 Let G be an m n random matrix with distribution N(0, In C), where C is a symmetric positive definite m m covariance matrix with eigenvalues λ1 λm > 0. If n m 2, then for any t 0, we have P{ G t} 1 p 2π(n m + 1) e n λm(n m + 1) n m+1 t (n m+1). Wang and Townsend Proof Let x1 xm > 0 be the eigenvalues of the Wishart matrix GG Wm(n, C). Since G = x 1/2 m , then we focus on the distribution of xm. It is well-known (James, 1964, 1968) that the joint density of the eigenvalues is given by f(x1, . . . , xm) = |C| n 2 Km,n 0F0(C 1, 1 1 i p Tr(K) Ω 1 + p λ1 Ω 1 F + λ1 C 1 11 + p λ1 Ω 1 s Σ2 | Et o where we make use of the fact E[h(Ω2) | Ω1] p Tr(K) Ω 1 + p λ1 Ω 1 F + λ1 C 1 11 Σ2 by Proposition 10. By definition of Et, it follows that Tr(K) + s p C 1 11 (k + p) λ1 Tr(C 1 11 ) p + 1 t + λ1 C 1 11 Operator Learning for Hyperbolic PDEs Since P(Ec t ) 2t p, then we obtain Tr(K) + s p C 1 11 (k + p) λ1 Tr(C 1 11 ) p + 1 t + λ1 C 1 11 2t p + e s2/2. We combine this estimate with Boull e and Townsend (2023, Lem. 1) to conclude. The proof of Theorem 3 follows easily. Proof (Theorem 3) We first claim that F PZF H PZH 1/(2q+1). Indeed, let I be the identity operator on L2(D2) and observe that I PZ is an orthogonal projector. Then Proposition 8 yields (I PZ)F (I PZ)(FF )q F 1/(2q+1) = (I PZ)H 1/(2q+1). Now noting that the singular values of H are σ2q+1 1 σ2q+1 2 , the result immediately follows from Theorem 1. B.2 Proofs for Section 3.2 Here, prove Theorem 5. In the following, let Θ(X, Y) denote the diagonal matrix of canonical angles between two subspaces X, Y of L2(D2) with equal finite dimension (Stewart and Sun, 1990, Def. I.5.3). Additionally, for any (quasi)matrix X, we denote its column space by the calligraphic version of the same symbol, namely, X. Lemma 13 Let F : L2(D1) L2(D2) be a HS operator with SVE given by (33). Select an integer k 1 such that σk > σk+1. Select an integer q 0 and set # 1 > 0. (44) Let H := (FF )q F and let H : L2(D1) L2(D2) be a HS operator with finite rank k such that H H Cσ2q+1 k+1 for some C > 0. Let Uk be the dominant left k-singular subspace of H , and let Uk be a dominant left k-singular subspace of H . If Cδq < 1, then sin Θ(Uk, Uk) Cδq 1 Cδq . (45) Proof Let σ1 σ2 be the singular values of H ; the singular values of H are given by σ2q+1 1 σ2q+1 2 . By Weyl s theorem, we have | σk σ2q+1 k | Cσ2q+1 k+1 , whereas the assumption (44) implies σ2q+1 k σ2q+1 k+1 = δ 1 q σ2q+1 k+1 . It follows that σk σ2q+1 k+1 σ2q+1 k | σk σ2q+1 k | σ2q+1 k+1 (δ 1 q C)σ2q+1 k+1 , Wang and Townsend which in conjunction with Wedin s theorem (Stewart and Sun, 1990, Thm. V.4.4) yields the desired bound. Combining Lemma 13 with r SVD means we can approximate an operator s singular values at the extra cost of computing Uk F, which takes an additional k operator-function products. This provides the desired proof. Proof (Theorem 5) Let Uk be the D1 k quasimatrix whose columns are dominant left k-singular functions of H . A corollary of the CS decomposition (Stewart and Sun, 1990, Exc. I.5.6) along with Theorem 1 and Lemma 13 yields, with high probability, Uk Uk 2 sin Θ(Uk, Uk) 2δq Ak,p(s, t) 1 δq Ak,p(s, t). Hence, we find that U k F U k F 2δq Ak,p(s, t) 1 δq Ak,p(s, t) F . Notice that U k F has the same dominant k singular values as F, so Weyl s theorem completes the proof. T. G. Anderson, O. P. Bruno, and M. Lyon. High-order, dispersionless fast-hybrid wave equation solver. Part I: O(1) sampling cost via incident-field windowing and recentering. SIAM J. Sci. Comput., 42(2):A1348 A1379, 2020. R. Arora. A deep learning framework for solving hyperbolic partial differential equations: Part I. Preprint ar Xiv:2307.04121, 2023. J. W. Banks and W. D. Henshaw. Upwind schemes for the wave equation in second-order form. J. Comput. Phys., 231(17):5854 5889, 2012. M. Bebendorf and W. Hackbusch. Existence of H-matrix approximants to the inverse FEmatrix of elliptic operators with L -coefficients. Numer. Math., 95:1 28, 2003. J. Berman and B. Peherstorfer. Randomized sparse Neural Galerkin schemes for solving evolution equations with deep networks. In Advances in Neural Information Processing Systems, volume 36, pages 4097 4114, 2023. J. Berman and B. Peherstorfer. Co Lo RA: Continuous low-rank adaptation for reduced implicit neural modeling of parameterized partial differential equations. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 3565 3583, 2024. K. Bi, L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619(7970):533 538, 2023. Operator Learning for Hyperbolic PDEs V. I. Bogachev. Gaussian Measures, volume 62 of Mathematical Surveys and Monographs. American Mathematical Society, 1998. N. Boull e and A. Townsend. Learning elliptic partial differential equations with randomized linear algebra. Found. Comput. Math., 23:709 739, 2023. N. Boull e, C. J. Earls, and A. Townsend. Data-driven discovery of Green s functions with human-understandable deep learning. Sci. Rep., 12:1 9, 2022a. N. Boull e, S. Kim, T. Shi, and A. Townsend. Learning Green s functions associated with time-dependent partial differential equations. J. Mach. Learn. Res., 23(1):1 34, 2022b. N. Boull e, D. Halikias, and A. Townsend. Elliptic PDE learning is provably data-efficient. Proc. Natl. Acad. Sci. USA, 120(39):e2303904120, 2023. N. Boull e, D. Halikias, S. E. Otto, and A. Townsend. Operator learning without the adjoint. J. Mach. Learn. Res., 25(364):1 54, 2024. J. Bruna, B. Peherstorfer, and E. Vanden-Eijnden. Neural Galerkin schemes with active learning for high-dimensional evolution equations. J. Comput. Phys., 496:112588, 2024. S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USA, 113 (15):3932 3937, 2016. C.-T. Chen and G. X. Gu. Learning hidden elasticity with deep neural networks. Proc. Natl. Acad. Sci. USA, 118(31):e2102721118, 2021. K. Chen, C. Wang, and H. Yang. Deep operator learning lessens the curse of dimensionality for PDEs. In Transactions on Machine Learning Research, 2024. Z. Chen and J. Dongarra. Condition numbers of Gaussian random matrices. SIAM J. Matrix Anal. Appl., 26(3):1389 1404, 2005. M. Cicognani. Strictly hyperbolic equations with non regular coefficients with respect to time. Ann. Univ. Ferrara, 45(1):45 58, 1999. M. Cicognani and D. Lorenz. Strictly hyperbolic equations with coefficients low-regular in time and smooth in space. J. Pseudo-Differ. Oper. Appl., 9(3):643 675, 2018. R. Courant and D. Hilbert. Methods of Mathematical Physics, volume 2. Interscience Publishers, 1st English edition, 1962. M. V. de Hoop, N. B. Kovachki, N. H. Nelsen, and A. M. Stuart. Convergence rates for learning linear operators from noisy data. SIAM/ASA J. Uncertain. Quantif., 11(2): 480 513, 2023. T. A. Driscoll, N. Hale, and L. N. Trefethen, editors. Chebfun Guide. Pafnuty Publications, 2014. Wang and Townsend A. Edelman. Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl., 9(4):543 560, 1988. L. C. Evans. Partial Differential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society, 2nd edition, 2010. Y. Fan and L. Ying. Solving electrical impedance tomography with deep learning. J. Comput. Phys., 404:109119, 2020. Y. Fan, L. Lin, L. Ying, and L. Zepeda-N u nez. A multiscale neural network based on hierarchical matrices. Multiscale Model. Simul., 17(4):1189 1213, 2019. H. Federer. Curvature measures. Trans. Am. Math. Soc., 93(3):418 491, 1959. J. Feliu-Faba, Y. Fan, and L. Ying. Meta-learning pseudo-differential operators with deep neural networks. J. Comput. Phys., 408:109309, 2020. C. R. Gin, D. E. Shea, S. L. Brunton, and J. N. Kutz. Deep Green: Deep learning of Green s functions for nonlinear boundary value problems. Sci. Rep., 11:1 14, 2021. Y. Gordon. Some inequalities for Gaussian processes and applications. Isr. J. Math., 50: 265 289, 1985. Y. Gordon. Gaussian processes and almost spherical sections of convex bodies. Ann. Probab., 16(1):180 188, 1988. A. Gray. Tubes, volume 221 of Progress in Mathematics. Birkh auser, 2nd edition, 2003. P. G unther. Huygens principle and Hadamard s conjecture. Math. Intell., 13(2):56 63, 1991. Y. Guo, X. Cao, B. Liu, and M. Gao. Solving partial differential equations using deep learning and physical constraints. Appl. Sci., 10(17):5917, 2020. D. Halikias and A. Townsend. Structured matrix recovery from matrix-vector products. Numer. Linear Algebra Appl., 31(1):e2531, 2024. N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53 (2):217 288, 2011. T. Hsing and R. Eubank. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. John Wiley & Sons, 2015. A. J. Huang and S. Agarwal. On the limitations of physics-informed deep learning: Illustrations using first-order hyperbolic conservation law-based traffic flow models. IEEE Open J. Intell. Transp. Syst., 4:279 293, 2023. D. Z. Huang, N. H. Nelsen, and M. Trautner. An operator learning perspective on parameter-to-observable maps. Found. Data Sci., 7(1):163 225, 2025. Operator Learning for Hyperbolic PDEs D. Hug, G. Last, and W. Weil. A local Steiner-type formula for general closed sets and applications. Math. Z., 246:237 272, 2004. A. E. Hurd and D. H. Sattinger. Questions of existence and uniqueness for hyperbolic equations with discontinuous coefficients. Trans. Am. Math. Soc., 132(1):159 174, 1968. A. T. James. Distributions of matrix variates and latent roots derived from normal samples. Ann. Math. Stat., 35(2):475 501, 1964. A. T. James. Calculation of zonal polynomial coefficients by use of the Laplace Beltrami operator. Ann. Math. Stat., 39(5):1711 1718, 1968. E. Kaiser, J. N. Kutz, and S. L. Brunton. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proc. R. Soc. Lond. A, 474(2219): 20180335, 2018. G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang. Physicsinformed machine learning. Nat. Rev. Phys., 3(6):422 440, 2021. L. K. Kates. Zonal polynomials. Ph D thesis, Princeton University, 1981. Y. Khoo and L. Ying. Switch Net: a neural network model for forward and inverse scattering problems. SIAM J. Sci. Comput., 41(5):A3182 A3201, 2019. D. Kochkov, J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner, and S. Hoyer. Machine learning-accelerated computational fluid dynamics. Proc. Natl. Acad. Sci. USA, 118(21): e2101784118, 2021. N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. J. Mach. Learn. Res., 24:1 97, 2023. A. S. Krishnapriyan, A. Gholami, S. Zhe, R. M. Kirby, and M. Mahoney. Characterizing possible failure modes in physics-informed neural networks. In Advances in Neural Information Processing Systems, volume 35, pages 26548 26560, 2021. J. N. Kutz. Deep learning in fluid dynamics. J. Fluid Mech., 814:1 4, 2017. R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful mediumrange global weather forecasting. Science, 382(6677):1416 1421, 2023. S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for Deep ONets: A deep learning framework in infinite dimensions. Trans. Math. Appl., 6(1):tnac001, 2022. P. Laurent, G. Legendre, and J. Salomon. On the method of reflections. Numer. Math., 148(2):449 493, 2021. P. D. Lax. Hyperbolic Partial Differential Equations, volume 14 of Courant Lecture Notes in Mathematics. American Mathematical Society, 2006. Wang and Townsend M. E. Lerner. Qualitative properties of the Riemann function [in Russian]. Differ. Uravn., 27(12):2106 2120, 1991. J. Levitt and P.-G. Martinsson. Randomized compression of rank-structured matrices accelerated with graph coloring. J. Comput. Appl. Math., 451:116044, 2024. M. Li, L. Demanet, and L. Zepeda-N u nez. Wide-band butterfly network: stable and efficient inversion via multi-frequency neural networks. Multiscale Model. Simul., 20(4):1191 1227, 2022. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Graph kernel network for partial differential equations. In International Conference on Learning Representations Workshop on Integration of Deep Neural Models and Differential Equations, 2020a. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, A. Stuart, K. Bhattacharya, and A. Anandkumar. Multipole graph neural operator for parametric partial differential equations. In Advances in Neural Information Processing Systems, volume 33, pages 6755 6766, 2020b. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021. L. Lin, J. Lu, and L. Ying. Fast construction of hierarchical matrix representation from matrix-vector multiplication. J. Comput. Phys., 230(10):4071 4087, 2011. Y. Liu, X. Xing, H. Guo, E. Michielssen, P. Ghysels, and X. S. Li. Butterfly factorization via randomized matrix-vector multiplications. SIAM J. Sci. Comput., 43(2):A883 A907, 2021. Y. Liu, J. Song, R. Burridge, and J. Qian. A fast butterfly-compressed Hadamard Babich integrator for high-frequency Helmholtz equations in inhomogeneous media with arbitrary sources. Multiscale Model. Simul., 21(1):269 308, 2023. L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via Deep ONet based on the universal approximation theorem of operators. Nat. Mach. Intell., 3:218 229, 2021a. L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis. Deep XDE: A deep learning library for solving differential equations. SIAM Rev., 63(1):208 228, 2021b. A. G. Mackie. Green s functions and Riemann s method. Proc. Edinburgh Math. Soc., 14 (4):293 302, 1965. A. Mandelbaum. Linear estimators and measurable linear transformations on a Hilbert space. Z. Wahrscheinlichkeitstheorie verw. Gebiete, 65(3):385 397, 1984. S. Massei, L. Robol, and D. Kressner. Hierarchical adaptive low-rank format with applications to discretized partial differential equations. Numer. Linear Algebra Appl., 29(6): e2448, 2022. Operator Learning for Hyperbolic PDEs M. Meier and Y. Nakatsukasa. Fast randomized numerical rank estimation for numerically low-rank matrices. Linear Algebra Appl., 686:1 32, 2024. R. Molinaro, Y. Yang, B. Engquist, and S. Mishra. Neural inverse operators for solving PDE inverse problems. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 25105 25139, 2023. R. J. Muirhead. Aspects of Multivariate Statistical Theory. John Wiley & Sons, 1982. NIST DLMF. NIST Digital Library of Mathematical Functions. Release 1.1.12 of 2023-1215. URL https://dlmf.nist.gov/. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. Mc Clain, eds. S. E. Otto, A. Padovan, and C. W. Rowley. Model reduction for nonlinear systems by balanced truncation of state and gradient covariance. SIAM J. Sci. Comput., 45(5): A2325 A2355, 2023. E. Qian, B. Kramer, B. Peherstorfer, and K. Willcox. Lift & learn: Physics-informed machine learning for large-scale nonlinear dynamical systems. Physica D, 406:132401, 2020. M. Raissi, A. Yazdani, and G. E. Karniadakis. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science, 367(6481):1026 1030, 2020. M. Reissig. Hyperbolic equations with non-Lipschitz coefficients. Rend. Semin. Mat. Univ. Politec. Torino, 61(2):135 181, 2003. R. Rodriguez-Torrado, P. Ruiz, L. Cueto-Felgueroso, M. C. Green, T. Friesen, S. Matringe, and J. Togelius. Physics-informed attention-based neural network for hyperbolic partial differential equations: Application to the Buckley Leverett problem. Sci. Rep., 12:7557, 2022. V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM J. Matrix Anal. Appl., 31(3):1100 1124, 2009. S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz. Data-driven discovery of partial differential equations. Sci. Adv., 3(4):e1602614, 2017. F. Sch afer and H. Owhadi. Sparse recovery of elliptic solvers from matrix-vector products. SIAM J. Sci. Comput., 46(2):A998 A1025, 2024. F. Sch afer, T. J. Sullivan, and H. Owhadi. Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity. Multiscale Model. Simul., 19:688 730, 2021. T. Shi and A. Townsend. On the compressibility of tensors. SIAM J. Matrix Anal. Appl., 42(1):275 298, 2021. H. F. Smith. A parametrix construction for wave equations with C1,1 coefficients. Ann. Inst. Fourier, 48(3):797 835, 1998. Wang and Townsend G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, 1990. S. Subramanian, P. Harrington, K. Keutzer, W. Bhimji, D. Morozov, M. Mahoney, and A. Gholami. Towards foundation models for scientific machine learning: Characterizing scaling and transfer behavior. In Advances in Neural Information Processing Systems, volume 37, pages 71242 71262, 2023. B. T. Thodi, S. V. R. Ambadipudi, and S. E. Jabari. Learning-based solutions to nonlinear hyperbolic PDEs: Empirical insights on generalization errors. In Advances in Neural Information Processing Systems Workshop on Machine Learning and the Physical Sciences, 2022. A. Townsend. Computing with functions in two dimensions. Ph D thesis, University of Oxford, 2014. A. Townsend and L. N. Trefethen. Continuous analogues of matrix factorizations. Proc. R. Soc. Lond. A, 471(2173):20140585, 2015. L. N. Trefethen. Householder triangularization of a quasimatrix. IMA J. Numer. Anal., 30 (4):887 897, 2010. L. N. Trefethen. Approximation Theory and Approximation Practice. SIAM, extended edition, 2019. R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. Z. Y. Wan, L. Zepeda-N u nez, A. Boral, and F. Sha. Evolve smoothly, fit consistently: Learning smooth latent dynamics for advection-dominated systems. In International Conference on Learning Representations, 2023. H. Wang. New error bounds for Legendre approximations of differentiable functions. J. Fourier Anal. Appl., 29(42), 2023. S. Wang, H. Wang, and P. Perdikaris. Learning the solution operator of parametric partial differential equations with physics-informed Deep ONets. Sci. Adv., 7:eabi8605, 2021. L. Zepeda-N u nez and L. Demanet. The method of polarized traces for the 2D Helmholtz equation. J. Comput. Phys., 308:347 388, 2016. S. Zhang and G. Lin. Robust data-driven recovery of governing physical laws with error bars. Proc. R. Soc. Lond. A, 474(2217):20180305, 2018.