# neural_qlearning_for_solving_pdes__7b404c89.pdf Journal of Machine Learning Research 24 (2023) 1-49 Submitted 9/22; Revised 6/23; Published 6/23 Neural Q-learning for solving PDEs Samuel N. Cohen cohens@maths.ox.ac.uk Mathematical Institute, University of Oxford Oxford, OX2 6GG, UK Deqing Jiang Deqing.Jiang@maths.ox.ac.uk Mathematical Institute, University of Oxford Oxford, OX2 6GG, UK Justin Sirignano sirignano@maths.ox.ac.uk Mathematical Institute, University of Oxford Oxford, OX2 6GG, UK Editor: Michael Mahoney Solving high-dimensional partial differential equations (PDEs) is a major challenge in scientific computing. We develop a new numerical method for solving elliptic-type PDEs by adapting the Q-learning algorithm in reinforcement learning. To solve PDEs with Dirichlet boundary condition, our Q-PDE algorithm is mesh-free and therefore has the potential to overcome the curse of dimensionality. Using a neural tangent kernel (NTK) approach, we prove that the neural network approximator for the PDE solution, trained with the QPDE algorithm, converges to the trajectory of an infinite-dimensional ordinary differential equation (ODE) as the number of hidden units . For monotone PDEs (i.e. those given by monotone operators, which may be nonlinear), despite the lack of a spectral gap in the NTK, we then prove that the limit neural network, which satisfies the infinite-dimensional ODE, strongly converges in L2 to the PDE solution as the training time . More generally, we can prove that any fixed point of the wide-network limit for the Q-PDE algorithm is a solution of the PDE (not necessarily under the monotone condition). The numerical performance of the Q-PDE algorithm is studied for several elliptic PDEs. Keywords: Deep learning, neural networks, high-dimensional PDEs, high-dimensional learning, Q-learning. 1. Introduction High dimensional partial differential equations (PDEs) are widely used in many applications in physics, engineering, and finance. It is challenging to numerically solve high-dimensional PDEs, as traditional finite difference methods become computationally intractable due to the curse of dimensionality. In the past decade, deep learning has become a revolutionary tool for a number of different areas including image recognition, natural language processing, and scientific computing. The idea of solving PDEs with deep neural networks has been rapidly developed in recent years and has achieved promising performance in solving realworld problems (e.g. Hu et al. (2022), Li et al. (2022), Cai et al. (2021), and Misyris et al. (2020)). c 2023 Samuel N. Cohen, Deqing Jiang, Justin Sirignano. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1075.html. Cohen, Jiang, Sirignano A number of deep-learning-based PDE solving algorithms, for instance the deep Galerkin method (DGM; Sirignano and Spiliopoulos (2018)) and physics-informed neural networks (PINNs) (Raissi et al. (2019)), have been proposed. Inspired by the Q-learning method in reinforcement learning, we propose a new Q-PDE algorithm which approximates the solution of PDEs with an artificial neural network. In this paper, we prove that for monotone PDEs the Q-PDE training process gives an approximation which converges to the solution of a certain limit equation as the number of hidden units in a single-layer neural network goes to infinity. Furthermore, we prove that the limit neural network converges strongly to the solution of the PDE that the algorithm is trying to solve. Q-learning (Watkins and Dayan (1992)) is a well-known algorithm for computing the value function of the optimal policy in reinforcement learning. Deep Q-learning (DQN) (Mnih et al. (2013)), a neural-network-based Q-learning algorithm, has successfully learned to play Atari games (Mnih et al. (2015)) and subsequently has become widely-used for many applications (Zhao et al. (2019)). The Q-learning algorithm updates its value function approximator following a biased gradient estimate computed from input data samples. We propose an algorithm which is similar to Q-learning in the sense that we update the parameters of a neural network via a biased gradient flow in continuous time. As far as the authors are aware, this is the first time that a Q-learning algorithm has been developed to solve PDEs directly. The universal approximation theorem (Cybenko (1989)) indicates that the family of single-layer neural networks are powerful function approximators. However, the universal approximation theorem only states that there exists a neural network which can approximate continuous functions arbitrarily well; it does not suggest how to identify the parameters for such a neural network. When the number of units in our neural network becomes large, it is, however, possible to obtain asymptotic limits, for example the neural tangent kernel limit (Jacot et al. (2018)), which gives a variant of the law of large numbers for infinitely wide neural networks. We will use this approach to study the performance of the biased gradient flow as a training algorithm, and show that the wide-network limit satisfies an infinite-dimensional ordinary differential equation (ODE). We apply our Q-PDE approach to second order nonlinear PDEs with Dirichlet boundary conditions. In particular, we are able to give strong convergence results when the differential operator satisfies a strong monotonicity condition. Monotone PDEs arise in various applications, particularly in PDEs arising from stochastic modeling the generators of ergodic stochastic processes are monotone (when evaluated with their stationary distributions), which suggests a variety of possible applications of our approach. Further, the subdifferentials of convex functionals (i.e. of maps from functions to the real line) are monotone; this suggests that monotone PDEs may be a particularly well-suited class of equations for gradient methods, as they correspond to (a generalization of) minimizations of convex functionals. We will see that, given this monotonicity assumption, we can prove that the limit Q-PDE algorithm will converge (strongly in L2) to the solution of the monotone PDE. More generally, we can prove that any fixed point of the wide-network limit for the Q-PDE algorithm is a solution of the PDE (not necessarily under the monotone condition). In the remainder of the introduction, we provide a brief survey of relevant literature and some related approaches. The necessary properties of the PDEs which we study, the Neural Q-learning for solving PDEs architecture of the neural networks we use, and the Q-PDE training algorithm we propose are presented in Section 2. Section 3 derives useful properties of the limiting system, while Section 4 contains a rigorous proof that the training process converges to this limit. Analysis of the limit neural network, in particular the proof of convergence to the solution of the PDE, is presented in Section 5. Numerical results are presented in Section 6. 1.1 Summary In summary, this paper will provide the following: A new neural network training algorithm for second order PDEs (of the form given by Assumption 13), based on Q-learning (Algorithm 20), and designed to ensure Dirichlet boundary conditions are enforced. Proofs of various functional-analytic properties of the neural tangent kernel limit of this algorithm (Section 3). A rigorous convergence proof of the neural-tangent kernel limit for a wide singlelayer network (Section 4). This limit gives a simple deterministic dynamical system characterizing the neural network approximation during training (Definition 43). Proof that any fixed point of the limit neural network training dynamics is a solution of the PDE (Theorem 46). A strong convergence proof for the limit neural network as the training time t , when considering a PDE given by a monotone operator (Section 5). Numerical examples to demonstrate effectiveness of the training algorithm (Section 6). A more precise summary of our mathematical results is given in Section 2.3.1. 1.2 Relevant literature The idea of solving PDEs with artificial neural networks has been rapidly developed in recent years. Various approaches have been proposed: Lagaris et al. (1998), Lagaris et al. (2000), Lee and Kang (1990), and Malek and Beidokhti (2006) applied neural networks to solve differential equations on an a priori fixed mesh. However, the curse of dimensionality shows that these approaches cannot be extended to high dimensional cases. In contrast, as a meshfree method, DGM (Sirignano and Spiliopoulos (2018)) randomly samples data points in each training step. Based on this random grid, it derives an unbiased estimator of the error of the approximation using the differential operator and the boundary condition of the PDE and then iteratively updates the neural network parameters with stochastic gradient descent. This approach has been successful at solving certain high-dimensional PDEs including free-boundary PDEs, Hamilton Jacobi Bellman equations, and Burger s equation. Unlike DGM, our algorithm is inspired by Q-learning, in the sense that we compute a biased gradient estimator as the search direction for parameter training. The biased gradient estimator does not require taking a derivative of the differential operators of the neural network; only the gradient of the neural network model itself is required. Cohen, Jiang, Sirignano The simplicity of this gradient estimator, together with the monotonicity of the PDEs we consider, allows a proof of the convergence of the Q-PDE training algorithm to the solution of the PDE. Another related area of research is PINNs (Raissi et al. (2019)), which train neural networks to merge observed data with PDEs. This enables its users to make use of a priori knowledge of the governing equations in physics. Analysis of solving second order elliptic/parabolic PDEs with PINNs can be found in Shin et al. (2020). Deep ONet (Lu et al. (2021)) proposes solving a family of PDEs in parallel. Its architecture is split into two parallel networks: a branch net, which approximates functions related to input PDEs, and a trunk net that maps spatial coordinates to possible functions. Combining these two networks, it is possible to learn the solution of a PDE given enough training data from traditional solvers. Mathematical analysis of Deep ONet in detail can be found in Lanthaler et al. (2021). Deep Sets (Germain et al. (2021a)) are designed to solve a class of PDEs which are invariant to permutations. It computes simultaneously an approximation of the solution and its gradient. For solving fully nonlinear PDE, Pham et al. (2021) estimates the solution and its gradient simultaneously by backward time induction. For linear PDEs, MOD-Net (Zhang et al. (2021)) uses a DNN to parameterize the Green s function and approximate the solution. Further approaches, which go beyond looking for an approximation of the strong solution of the PDE directly, have also been studied. Zang et al. (2020) proposes using generative adversarial networks (GANs) to solve a min-max optimization problem to approximate the weak solution of the PDE. The Fourier neural operator (FNO) (Li et al. (2020)) approach makes use of a Fourier transform and learns to solve the PDE via convolution. The solution of certain PDEs can be expressed in terms of stochastic processes using the Feynman Kac formula. Grohs and Herrmann (2020) apply neural networks to learn the solution of a PDE via its probabilistic representation and numerically demonstrates the approach for Poisson equations. Based on an analogy between the BSDE and reinforcement learning, E et al. (2017) proposes an algorithm to solve parabolic semilinear PDEs and the corresponding backward stochastic differential equations (BSDEs), where the loss function gives the error between the prescribed terminal condition and the solution of the BSDE. PDE-net and its variants (Long et al. (2018), Long et al. (2019)) attempt to solve inverse PDE problems: a neural network is trained to infer the governing PDE given physical data from sensors. In a recent paper Sirignano et al. (2021), neural network terms are introduced to optimize PDEconstrained models. The parameters in the PDE are trained using gradient descent, where the gradient is evaluated by an adjoint PDE. In Ito et al. (2021), a class of numerical schemes for solving semilinear Hamilton Jacobi Bellman Isaacs (HJBI) boundary value problems is proposed. By policy iteration, the semilinear problem is reduced into a sequence of linear Dirichlet problems. They are subsequently approximated by a feedforward neural network ansatz. For a detailed overview of deep learning for solving PDEs, we refer the reader to E et al. (2022), Beck et al. (2020) and Germain et al. (2021b), and references therein. Many of these works do not directly study the process of training the neural network. Gradient descent-based training of neural networks has been mathematically analyzed using the neural tangent kernel (NTK) approach (Jacot et al. (2018)). The NTK analysis characterizes the evolution of the neural network during training in the regime when the number of hidden units is large. For instance, Jacot et al. (2018), Lee et al. (2017), and Lee Neural Q-learning for solving PDEs et al. (2019) study the NTK limit for classical regression problems where a neural network is trained to predict target data given an input. Using an NTK approach, Wang et al. (2020) gives a possible explanation as to why sometimes PINNs fail to solve certain PDEs. Their analysis highlights the difficulty in approximating highly oscillatory functions using neural networks, due to the lack of a spectral gap in the NTK. The NTK approach is not without criticism, particularly when applied to deeper neural networks, where it has been seen to be unable to accurately describe observed performance (see, for example, Ghorbani et al. (2020) or Chizat et al. (2019)). Alternative approaches include mean-field analysis (e.g. Mei et al. (2019)). In this article, we train a neural network to solve a PDE, using a different training algorithm, for which we give a rigorous proof of the convergence to a NTK limiting regime (as the width of the neural network increases). This NTK approach yields a particularly simple dynamic under our training algorithm, allowing us to prove convergence to the solution of the PDE (as training time increases), despite the poorly behaved spectral properties of the NTK, for those PDEs given by monotone operators. Our Q-PDE algorithm is substantially different to the gradient descent algorithm studied in Jacot et al. (2018), Lee et al. (2017), Lee et al. (2019), since a PDE operator will be directly applied to the neural network to evaluate the parameter updates. 2. The Q-PDE Algorithm In this section, we first of all state our assumptions on the PDE and its domain. Then we describe the neural network approximator and our Q-PDE algorithm. 2.1 Assumptions on the PDE We consider a class of second-order nonlinear PDEs on a bounded open domain Ω Rn with Dirichlet boundary conditions: ( Lu = 0 in Ω u = f on Ω. (1) We assume a measure 1 µ is given on Ω, with continuous Radon Nikodym density dµ/d Leb bounded away from 0 and (where Leb denotes Lebesgue measure). It follows that µ(Ω) < and µ(A) > 0 for all open sets A Ω; all integrals on Ωwill be taken with respect to this measure (unless otherwise indicated). We will study strong Sobolev solutions to the PDE (1); that is, we are interested in solutions u H2, where Hp = f L2(Ω, µ) : f Hp := X |α| p Dαf L2 < , (2) 1. For example, µ can be taken to be the Lebesgue measure on Ω Rn, and the analytic theory we consider is essentially the same as that when using the Lebesgue measure (i.e. the Sobolev norms using µ are equivalent to those using Leb). We allow for more general µ, as this will roughly correspond to a choice of sampling scheme in our numerical algorithm, and gives us a degree of flexibility when considering the monotonicity of L. Cohen, Jiang, Sirignano where Du is the weak derivative of u and where the boundary condition u| Ω= f is understood in the usual trace-operator sense in H1. For notational simplicity, we write H2 (0) = H2 H1 0 for the subspace of functions u H2 such that u| Ω= 0, that is, with boundary trace zero. This remains a Hilbert space under the H2 inner product. We write (without a subscript) for the L2 norm and similarly for the inner product. We refer to Evans (2010) or Adams and Fournier (2003) for further details and general theory of these spaces and concepts. The class of PDEs we will consider are those for which L, f and Ωsatisfy some (strong) regularity conditions, which we now detail. We first of all present the assumptions and lemmas for the domain and the boundary condition. Assumption 1 The boundary Ωis C3,α, for some α (0, 1), that is, three times continuously differentiable, with α-H older continuous third derivative. Assumption 2 (Auxiliary function η) There exists a (known) function η C3 b (Rn), which satisfies 0 < η < 1 in Ω, and η = 0 on Ω. Furthermore, its first order derivative does not vanish at the boundary (that is, for x Ωand nx an outward unit normal vector at x, we have η(x), nx = 0). Using Assumptions 1 and 2, we have the following useful result, which allows us to approximate functions in H2 (0). Its proof is given in Appendix A.1. Lemma 3 1. The set of functions C3(Ω) C0(Ω) is dense in H2 (0) = H2 H1 0 (under the H2 topology). 2. For any function u C3(Ω) C0(Ω), the function u = u/η is in C2 b (Ω) H2. In addition to these assumptions on L, we make the following assumptions on the boundary value. Assumption 4 (Interpolation of the boundary condition function) There exists a (known) function f H2 such that f| Ω= f (or, more precisely, if T : H1(Ω) L2( Ω) is the boundary trace operator, we have T( f) = f). In the rest of this paper, for notational simplicity, we identify f with its extension f defined on Ω. Remark 5 The existence of some f satisfying Assumption 4 is guaranteed, given a solution u H2 exists (as we could take f = u). Practically, we assume not only that f exists, but that it is possible for us to use it as part of our numerical method to find u. In many cases, this is a mild and natural assumption, for instance when we have f = 0, where we can simply take f = 0. Now we present the assumptions of the PDE itself. Assumption 6 (Lipschitz continuity of L) There exists a constant C such that for any f1, f2 H2 C2 and any x Ω, the (nonlinear differential) operator L satisfies |Lf1(x) Lf2(x)| C |f1(x) f2(x)| + i=1 | xif1(x) xif2(x)| i,j=1 | 2 xixjf1(x) 2 xixjf2(x)| . Neural Q-learning for solving PDEs Assumption 7 (Integrability of L at zero) Taking f0 0, we have Lf0 L2. Remark 8 Under Assumptions 6 and 7, L is a Lipschitz operator L : H2 L2. Our algorithm will also make use of the following auxiliary function, which allows us to control behaviour of functions near the boundary of Ω. In order to prove convergence of our scheme for large training time, we will particularly focus on PDEs given by strongly monotone operators (cf. Browder (1967)), which for convenience we take with the following sign convention. Assumption 9 (Strong L2-monotonicity of L) There exists a constant γ > 0 such that for any f1, f2 H2 (0) the operator L satisfies f1 f2, Lf1 Lf2 γ f1 f2 2 L2, (4) where , is the L2(µ) inner product. While these assumptions are somewhat restrictive, they are general enough to allow for the case Lu = Au γu + r, for r any L2 function, where A is the generator of a sufficiently nice Feller process X and γ > 0, for an appropriate choice of sampling distribution µ; see Appendix A.7 for further discussion. In this case, the solution u can be expressed (using the Feynman Kac theorem) in terms of the stochastic process, u(x) = E h e γτf(Xτ) + Z τ 0 e γtr(Xt)dt X0 = x i (5) where τ = inf{t 0 : Xt Ω}. Further, these assumptions are also sufficiently general to allow some nonlinear PDEs of interest, for example Hamilton Jacobi Bellman equations under a Cordes condition, as discussed by Smears and S uli (2014). The assumption of monotonicity is also connected to Lyapunov stability analysis, as it is easy to check that strong monotonicity is equivalent to stating that v 7 1 2 v 2 L2 is a Lyapunov function for the infinite-dimensional dynamical system dv/dt = Lv + γv. For more general discussion of monotone operators, and their connection to analysis of the traditional Galerkin method, we refer to Zeidler (2013). Our final assumption on the PDE is that a solution exists. Assumption 10 (Existence of solutions) There exists a (unique) solution u H2 to (1). Remark 11 We note that, assuming strong monotonicity (Assumption 9), if a solution exists then it is guaranteed to be unique: if u, u are two solutions, then Lu = Lu , hence 0 = u u , Lu Lu γ u u 2, and so u = u almost everywhere. While existence of solutions in H2 is a strong assumption, it is often satisfied for weak solutions to elliptic equations, given the well-known elliptic regularity results which ensure solutions are sufficiently smooth. Cohen, Jiang, Sirignano Remark 12 The algorithm we present below can easily be extended to higher-order PDEs. To do this, further smoothness assumptions (on the activation function σ, continuation value f and auxiliary function η) and higher moments of the initial weights wi in the neural network are needed, and one argument (Lemma 3) needs to be extended using alternative approaches. In this paper, for the sake of concreteness (and to avoid unduly long derivations), we present our results for PDEs up to second-order; the details of the extension to the general case are left as a tedious exercise for the reader. Summary of assumptions 13 For ease of reference, we now summarize the assumptions we have made on our PDE: 1. L is a nonlinear second-order differential operator, Lipschitz (as a map H2(Ω) L2(Ω)), square integrable at zero (Assumptions 6, 7). 2. The boundary value f has a known continuation to a function in H2(Ω) (Assumption 4). 3. Ω Rn is bounded and open, and has a C3,α boundary Ω, which is the zero level set of a known auxiliary function η C3 b (Rn), and η has nonzero derivative on the boundary (Assumptions 1, 2) 4. L is strongly monotone, and there exists an H2(Ω) solution to the PDE (Assumptions 9, 10). We emphasise that only parts 1 3 of this assumption will be used, except in Section 5, where the strong monotonicity and existence of solutions will also be needed. 2.2 Neural Network Configuration We will approximate the solution to the PDE (1) with a neural network, which will be trained with a deep Q-learning inspired algorithm. In particular, we will study a singlelayer neural network with N hidden units: SN(x; θ) = 1 Nβ i=1 ci σ(wi x + bi), (6) where the scaling factor β (1 2, 1) and σ : R R is a non-linear scalar function. The parameters θ are defined as θ := {ci, wi, bi}i=1,...,N where ci, bi R, and wi Rn. The function η can then be used to design a neural network model which automatically satisfies the boundary condition, introduced by Mc Fall and Mahan (2009): QN(x; θ) := SN(x; θ)η(x) + f(x). (7) Assumption 14 (Activation function) The activation function σ C4 b (R) is non-constant, where Ck b is the space of functions with k-th order continuous derivatives. Remark 15 This assumption coincides with the condition of Theorem 4 of Hornik (1991), and guarantees that the functions σ generate neural nets which are dense in the Sobolev space H2. (Hornik shows, under this condition, the stronger result of density in H4; we focus on H2, but require additional bounded derivatives as part of our proof.) We will particularly make use of this to establish Lemma 30. Neural Q-learning for solving PDEs Before training begins (i.e. at t = 0), the neural network parameters are randomly initialized. The random initialization satisfies the assumptions described below. Assumption 16 (Neural network initialization) The initialization of the parameters θ0, for all i {1, 2, ..., N}, satisfies: The parameters ci 0, wi 0, bi 0 are independent random variables. The random variables ci 0 are bounded, |ci 0| < K0, and E[ci 0] = 0. The distribution of the random variables wi 0, bi 0 has full support. That is, for any open set D Rn+1, we have P((wi 0, bi 0) D) > 0. For any indices i, k {1, ..., N}, we have E[|(wi 0)k|3] < , E[|bi 0|] < . 2.3 Training Algorithm We now present our algorithm for training the neural network QN(x; θ) to solve the PDE (1). At training time t, the parameters of the neural network are denoted θt = {ci t, wi t, bi t}i=1,...,N. For simplicity, we denote SN( ; θt) as SN t , and QN( ; θt) as QN t . Then, QN t = SN t η + f. (8) Our goal is to design an algorithm to train the approximator QN to find the solution of the PDE. One approach, see for instance Sirignano and Spiliopoulos (2018), is to train the model QN to minimize the average PDE residual: Z Ω [LQN(x)]2dµ(x). (9) To improve integrability, we will smoothly truncate this objective function, and use the result to motivate our training algorithm. Definition 17 (Smooth truncation function) The functions {ψN}N N are a family of smooth truncation functions if, for some δ (0, 1 β ψN C2 b(R) is increasing on R. |ψN| is bounded by 2Nδ. ψN(x) = x for x [ Nδ, Nδ]. |(ψN) | 1 on R. F N := (ψN) (ψN) is uniformly Lipschitz continuous on R for all N N. For simplicity, we will usually describe the family ψN as a (smooth) truncation function. A simple example of a truncation function is as follows: take δ = (1 β)/4 > 0 and ψN(x) = Z x 0 g N(y)dy, where g N(x) = ( e (|x| Nδ)2 if |x| Nδ, 1 if |x| < Nδ. (10) Cohen, Jiang, Sirignano Remark 18 It is easy to see that, for any smooth truncation function, the related function F N(x) = (ψN) (ψN) satisfies |F N(x) x| 2|x|1{|x| Nδ} for all x R. (11) We will use a truncation function ψN to modify (9), and will consider minimizing the truncated objective: Z ψN(LQN(x)) 2dµ(x). (12) To minimise a loss function like (12), one can apply gradient-descent based methods. However, in practice, computing the gradient of L QN t ( ; θ) with respect to θ does not lead to an algorithm permitting simple analysis, as natural properties of the differential operator L are not directly preserved. Instead, we introduce a biased gradient estimator as follows: Definition 19 The sequence of functions ψN(LQN t (x)) (ψN )(LQN t (x)) θ( QN t (x))dµ(x) (13) are called the biased gradient estimators for the truncated loss. These can be approximated using Monte Carlo sampling, as ˆGN M(θt) = 1 ψN(LQN t (xi)) (ψN )(LQN t (xi)) θ( QN t (xi)) (14) where xi are independent samples from the distribution µ. Our analysis will focus on the biased gradient estimator GN, however in our numerical implementation (in order to avoid the curse of dimensionality) we will use the Monte Carlo estimates ˆGN M for large M. We shall see that this biased gradient is a continuous-space, PDE analog of the classic Q-learning algorithm. In summary, our Q-PDE algorithm for solving PDEs is: Algorithm 20 (Q-PDE Algorithm) We fix a family of smooth truncation functions {ψN}N N and, for each value of N N, we proceed as follows: 1. Randomly initialize the parameters θ0, as specified in Assumption 16. 2. Train the neural network via the biased gradient flow dt = αN t GN(θt). (15) with GN as in (13) and learning rate αN t = αN = αN2β 1, (16) where α is a positive constant. Neural Q-learning for solving PDEs The approximate solution to the PDE at training time t is QN t , as defined by (7). In the remainder of this paper, we will study this Q-PDE algorithm for solving PDEs both mathematically and numerically. First, we prove that the neural network model QN t converges to the solution of an infinite-dimensional ODE as the number of hidden units N . Then, we study the limit ODE, and prove that it converges to the true solution of the PDE as the training time t . Finally, in Section 6, we demonstrate numerically that the algorithm performs well in practice with a finite number of hidden units. For high dimensional PDEs, it is impossible to form mesh-grids to apply conventional finite difference methods. However, our method is mesh-free. The crucial feature for implementing our algorithm in practice is to evaluate the integral term GN. As discussed, we can do this using the Monte Carlo estimate ˆGN M; therefore our algorithm has the potential to solve high-dimensional PDEs. Remark 21 It is worth observing that, given this choice of biased gradient flow, there is generally no guarantee that our algorithm corresponds to stochastic gradient descent applied to any potential function. The key advantage of the use of this biased gradient is that it separates the derivatives in θ from the differential operator L, which allows the underlying properties of the PDE to be preserved more simply. We will see that this results in particularly simple dynamics in the wide-network limit. 2.3.1 Main results Our main mathematical results are the following: Theorem 22 (cf. Theorem 45) Define the kernel B(x, y) = η(x)η(y)Ec,w,b h σ(w x + b)σ(w y + b) + c2σ (w x + b)σ (w y + b)(x y + 1) i , where the expectation is taken with respect to the distribution of the random initialization of c, w, and b satisfying Assumption 16, and define the corresponding integral operator (Bv)(y) = R ΩB(x, y)v(x)dµ(x). For a single-hidden layer neural network with N units, as N the PDE approximator QN trained using the Q-PDE algorithm converges (in H2, for each time t) to the deterministic H2-valued dynamical system d Qt/dt = αBL(Qt), with Q0 = f, and Qt| Ω f for all t. Theorem 23 The limiting dynamical system Qt has the following convergence properties: If Qt converges in H1 to a fixed point Q, then Q is the solution to the PDE (1). (Theorem 46) If L is a monotone operator, and u is the solution to the PDE (1), then for almost every sequence tk , we have Qtk u in L2. (Theorem 47) If L is a Gateaux differentiable monotone operator then, for almost every sequence tk , we have BLQtk 0. (Theorem 48) Cohen, Jiang, Sirignano 2.3.2 Similarity to Q-learning Our training algorithm s biased gradient flow is inspired by the classic Q-learning algorithm in a reinforcement learning setting. The Q-learning algorithm approximates the value function of the optimal policy for Markov decision problems (MDPs). It seeks to minimize the error of a parametric approximator, such as a neural network, R( , ; θ): (s,a) S X [Y (s, a; θ) R(s, a; θ)]2π(s, a). (18) Here S and X are the finite state and action spaces of the MDP and π is a probability mass function which is strictly positive for every (s, a) S X. Y is the target Bellman function Y (s, a; θ) = r(s, a) + γ X s S max a X R(s , a ; θ)p(s |s, a). (19) Q-learning updates the parametric approximator R via θk+1 = θk + αN k Uk where U is a biased estimator of the gradient of (Y R)2. Since the transition density p(s |s, a) is unknown, Y R is estimated using random samples from the Markov chain (sk, ak): Y (sk, ak) R(sk, ak; θk) := r(sk, ak) + γ max a X R(sk+1,a ; θk) R(sk, ak; θk). (20) The Q-learning algorithm treats Y (and Y ) as a constant and takes a derivative only with respect to the last term in (20). The Q-learning update direction Uk is: Uk =: Y (sk, ak) R(sk, ak; θk) θk[ R(sk, ak; θk)] (21) We emphasize that the term θ Y is not included, which means Uk is a biased estimator for the direction of steepest descent. Returning to the objective function for the Q-PDE algorithm, the gradient of the loss function (12) is Z h ψN(LQN t (x))ψN (LQN t (x)) i θLQN t (x)dµ(x). (22) The update direction GN in (13) differs from (22), as θLQN t (x) has been replaced with θ( QN t (x)). This is, conceptually, the same update direction as used in the Q-learning algorithm. 3. Preliminary analysis of the training algorithm We begin our analysis of the training algorithm (15) by proving several useful bounds, in particular, that the neural network parameters ct and wt are bounded in expectation. The proof is provided in Appendix A.2. We recall that n is the dimension of our underlying space, that is, Ω Rn. Lemma 24 (Boundedness of parameters) For all T > 0, there exists a deterministic constant C > 0 such that, for all N N, all 0 t T, all i {1, 2, ...N}, and all k {1, 2, ...n}, |ci t| < C, E|(wi t)k| < C, and E|bi t| < C. (23) Neural Q-learning for solving PDEs 3.0.1 Dynamics of QN t We next consider the evolution of the output of the neural network QN t as a function of the training time t. Define BN t as the symmetric, positive semi-definite kernel function BN t (x, y) = η(x)η(y)AN t (x, y), (24) where AN t is the neural tangent kernel (Jacot et al. (2018)) AN t (x, y) = θSN t (x) θSN t (y) h σ(wi t x + bi t)σ(wi t y + bi t) + ci t 2σ (wi t x + bi t)σ (wi t y + bi t)(x y + 1) i . For any y Ω, we can derive the evolution of value QN t (y) using the chain rule: d QN t (y) dt = θQN t (y) dθt The RHS of (26) can be evaluated using (7), (15), and Fubini s theorem, which yields d QN t (y) dt = α Z Ω F N LQN t (x) BN t (x, y)dµ(x), (27) where F N is as in Definition 17. 3.0.2 Limit kernel Equation (27) shows that the kernel BN has a key role in the dynamics of QN. We now characterize the limit of BN as N . In Section 4.1, we will study this convergence in detail. At time t = 0, the parameters ci 0, wi 0, bi 0 are independently sampled. Therefore, for each (x, y) Ω, by the strong law of large numbers AN 0 (x, y) converges almost surely to A(x, y), where A(x, y) = Ec,w,b σ(w x + b)σ(w y + b) + c2σ (w x + b)σ (w y + b)(x y + 1) . (28) It follows that BN 0 (x, y) converges almost surely to B(x, y): lim N BN 0 (x, y) = B(x, y) := η(x)η(y)A(x, y). (29) We can write B(x, y) = η(x)η(y)Ec,w,b[σ(w x + b)σ(w y + b) + c2U(x) U(y)], (30) where the (w, b-dependent) vector function U is given by U(x) = σ (w x + b) x + 1/ n . As σ C4 b (R), we know that A, AN, B and BN are all uniformly continuous in (x, y), so the above a.s. convergences hold for all (x, y) Ωsimultaneously. Cohen, Jiang, Sirignano Lemma 25 There exists a constant M > 0 such that, for any (x, y) Ω Ω, |B(x, y)| + X k | yk B(x, y)| + X k,l | 2 yk,yl B(x, y)| M. (31) Proof Taking B as defined in (29), as η C3 b , σ C4 b and each term of {ci t}i N is uniformly bounded, we have |B(x, y)| < C. The partial derivative of B is = η(x) ykη(y) A(x, y) + η(x)η(y) yk A(x, y) = η(x) ykη(y)A(x, y) + η(x)η(y)Ec,w,b σ(w x + b)σ (w y + b)wk + c2σ (w x + b)[wkσ (w y + b)x y + σ (w y + b)xk] . As before, our assumptions on η, σ and {ci t}i N guarantee that the terms in yk B(x, y) are uniformly bounded for all x, y Ω, so | yk B(x, y)| < C; similarly for 2 ykyl B(x, y). The result follows. Given we have a kernel B, it is natural to define the corresponding integral operator, v 7 R ΩB(x, y)v(x)dµ(x). It will be convenient to define this on a slightly larger space than L2, to account for the effect of the function η. Definition 26 We define the function space L2 η := {f : Ω R| ηf L2}, and observe (as η is bounded) that L2 L2 η. Definition 27 The operator B : L2 η L2 is defined by (Bv)(y) = R ΩB(x, y)v(x)dµ(x) with B as in (29). Lemma 28 For any v L2 η, we have Bv H2 (0), or more specifically, Bv C2 b (Ω) and (Bv)(y) = 0 for all y Ω. Proof From the definition of Bv, we have (Bv)(y) = η(y) R ΩA(x, y)η(x)v(x)dµ(x). Lemma 25 and the smoothness of σ and η clearly imply that Bv C2 b (Ω) H2. We know that η(x) = 0 for x Ω, so it is clear that (Bv)(y) = 0 for all y Ω. As H1 0 is the kernel (in H1) of the boundary trace operator, and for smooth functions the trace operator is simply evaluation of the function on the boundary, it follows that Bv H1 0. We conclude that Bv H2 (0) = H2 H1 0. Lemma 29 The linear map B : L2 η H2 is Lipschitz, in particular, there exists C > 0 such that, for any v L2 η, we have Bv H2 C η v L2. Proof By definition, with Dα denoting the differential operator with respect to the y argument, Bv 2 H2 = P |α| 2 DαBv 2 L2. By the boundedness of η, A and their derivatives, there exists a constant K such that for any multi-index α, |Dα[η(y)A(x, y)]| < K for any Neural Q-learning for solving PDEs x, y Ω. Hence, exchanging the order of integration (with respect to x) and differentiation (with respect to y) by the dominated convergence theorem, DαBv (y) = Dα Z Ω η(y)A(x, y)η(x)v(x)dµ(x) = Ω Dα[η(y)A(x, y)]η(x)v(x)dµ(x) Dα[η(y)A(x, y)]η(x)v(x) dµ(x) K Z Ω |η(x)v(x)|dµ(x). By Jensen s inequality, there is a constant C > 0 such that DαBv (y) 2 K2 Z Ω |η(x)v(x)|dµ(x) 2 C Z Ω |η(x)v(x)|2dµ(x) = C η v 2 L2. (34) Integrating over Ω, we see that DαBv 2 L2 Cµ(Ω) η v 2 L2. Summing over indices α yields the result. Our next challenge is to prove that the image of B is dense in H2 (0). This will allow us to represent the solutions to our PDE in terms of B, and is closely related to the universal approximation theorems of Cybenko (1989), Hornik (1991), and others. Lemma 30 For a function g H2(Ω), if the equality Z |α| 2 Dασ(w x + b)Dαg(x)dµ(x) = 0 (35) holds for all w Rn, b R, then g 0. Proof Since µ is a finite measure and σ C2 b (cf. Assumption 14) and non-constant, from Theorem 4 in Hornik (1991) we have that the linear span of {σ(w x + b)}w,b R is dense in H2. Hence, there is no nontrivial element of H2 orthogonal to σ(w x + b) for all w, b. In other words, for g H2, if the inner product σw,b, g H2 = 0 for all w, b, then g = 0, which is the stated result. Lemma 31 Define S : H2 H2 by (Sh)(y) := R Ω P |α| 2 Dα xh(x)Dα x A(x, y)dµ(x). Then Sh = 0 if and only if h = 0. Proof We first observe that, as σ C4 b (R), Dα x A(x, y) is clearly a C2 b function of y for all x. Consider the inner product between Sh and h on H2: Sh, h H2 = X |α| 2 DαSh, Dαh L2 |α1| 2,|α2| 2 Dα1 x h(x)Dα2 y h(y) Dα1 x Dα2 y A(x, y) dµ(x)dµ(y) |α1| 2,|α2| 2 Dα1 x h(x)Dα2 y h(y) Dα1 x Dα2 y h σ(w x + b)σ(w y + b) + c2U(x) U(y) i dµ(x)dµ(y) . Cohen, Jiang, Sirignano Notice that |α| 2 Dα xh(x)Dα x U(x)dµ(x) 2 |α1| 2,|α2| 2 Dα1 x h(x)Dα2 y h(y) Dα1 x Dα2 y (c2U(x) U(y)) dµ(x)dµ(y) (37) Combining (36) with (37) we derive Sh, h H2 Ec,w,b |α| 2 Dα xh(x)Dα xσ(w x + b)dµ(x) 2 0. (38) By Lemma 30, for h = 0, there exists ϵ > 0, w Rn, b R such that 0 < ϵ = R Ω P |α| 2 Dα xh(x)Dα xσ(w x + b )dµ(x) . As this integral is continuous with respect to the parameters w, b, there exists δ > 0 such that for any (w, b) Rδ := {(w, b) : w w + |b b | δ}, we know R Ω P |α| 2 Dα xh(x)Dα xσ(w x + b)dµ(x) ϵ 2. Therefore, from (38) Sh, h H2 Ec,w,b |α| 2 Dα xh(x)Dα xσ(w x + b)dµ(x) 2 Ec,w,b hϵ2 4 1Rδ i > 0. (39) Lemma 32 Define the operator A : L2 H2 by (Af)(y) = R ΩA(x, y)f(x)dµ(x). The image of A is dense in H2, that is, for any u H2, there exists a sequence {vk}k N in L2 such that limk Avk u H2 = 0. Proof By the smoothness of the kernel A, we have im(A) := {Av : v L2} H2. By definition, for any h, g H2, the inner product between Ah and g in H2 space is Ag, h H2 = X |α| 2 DαAg, Dαh L2 = Z |α| 2 Dα xh(x) Z Ω Dα x A(x, y)g(y)dµ(y)dµ(x) |α| 2 Dα xh(x)Dα x A(x, y)dµ(x)dµ(y) = g, Sh We introduce the adjoint operator of A on H2, denoted A , and write Ag, h H2 = g, A h H2 = g, Sh L2. (41) By Lemma 31, Sh = 0 if and only if h = 0. Therefore, by setting g = Sh in (41), we have g, A h H2 = Sh, Sh L2 > 0 for any non-zero h. Thus, ker(A ) := {h H2 : A h = 0} = {0}. Write im(A) for the closure in H2 of im(A). We recall that the ortho-complement in H2 of im(A) is ker(A ), so we can decompose the space H2 as H2 = im(A) ker(A ). We know that ker(A ) = {0} in H2, and hence conclude that im(A) is dense in H2. Neural Q-learning for solving PDEs Theorem 33 (Density in H2 (0)) The image of B is dense in H2 (0), that is, for any u H2 (0), there exists a sequence {vk}k N in L2 η such that limk Bvk u H2 = 0. Proof For a function u C3(Ω) C0(Ω), we define the function u = u/η. By Lemma 3(ii), u C2 b (Ω) H2. Since the image of A is dense in H2, there exists a sequence wk H2 such that Awk u H2 0. Consequently, the boundedness of η and its derivatives shows η(Awk u) H2 = B wk/η u H2 0. Since C3(Ω) C0(Ω) is dense in H2 (0) (under the H2 topology, Lemma 3(i)), we conclude that the image of B is dense in H2 (0). Remark 34 The above result shows that it is, in principle, possible to approximate the PDE solution u using an infinite neural network, with the boundary condition enforced by η. It also justifies the boundary-matching method proposed by Mc Fall and Mahan (2009). In particular, as u (1 η)f H2 (0), there exists a sequence vk L2 η such that Bvk+(1 η)f u in H2. Of course, whether a particular training algorithm yields such a sequence is a separate question. We collect the remaining key properties of B into the following lemma. Lemma 35 The linear operator B : L2 η H2 has the following properties: 1. B is strictly positive definite, and induces a norm B on L2 η, given by v B := p Bv, v L2. In particular, v B < for all v L2 η. 2. There exists a constant λ > 0 such that Bv 2 L2 λ v 2 B. Proof For v L2 η, let g := ηv L2. By definition, v, Bv = v, ηAg = g, Ag = Z Ω2 g(x)g(y)A(x, y)dµ(x)dµ(y) Ω2 g(x)g(y)Ec,w,b σ(w x + b)σ(w y + b) dµ(x)dµ(y) Ω2 g(x)g(y)Ec,w,b c2U(x) U(y) dµ(x)dµ(y) Ω σ(w x + b)g(x)dµ(x) 2 + c Z Ω U(x)g(x)dµ(x) Ω σ(w x + b)g(x)dµ(x) 2 0. For v = 0, we know g = 0. With Assumption 14, by Theorem 5 in Hornik (1991), there exists w R2, b R such that R Ωσ(w x + b )g(x)dµ(x) = ϵ > 0. As the integral is continuous with respect to parameters w, b there exists δ > 0 such that for any (w, b) in the set Rδ := {(w, b) : w w + |b b | < δ}, we have | R Ωσ(w x + b)g(x)dµ(x)| > ϵ/2. Therefore v, Bv Ec,w,b R Ωσ(w x + b)g(x)dµ(x) 2 Ec,w,b 1Rδ ϵ2 4 > 0; so B defines a norm as stated. Cohen, Jiang, Sirignano From the above calculations, we also see that A is a positive-definite Hilbert Schmidt integral operator on L2. Therefore, its eigenfunctions span the entire L2 space and the spectral theorem applies, in particular A has nonnegative eigenvalues bounded above. Suppose λ is the supremum of the eigenvalues of A, we have Ag L2 λ g, Ag . Thus, as |η(x)| < 1, we know Bv 2 L2 = ηA(ηv) 2 L2 A(ηv) 2 L2 λ ηv, A(ηv) = λ v, Bv = λ v 2 B. 4. Convergence to the limit ODE In this section, we seek to understand the behavior of our approximator QN t when N . We prove that the pre-limit process, {QN t ( )}0 t T , converges to a limiting process {Qt( )}0 t T , in an appropriate space of functions. In this section, we will only use Assumptions 13(i iii), in particular we will not need the assumption that L is monotone or that a solution to the PDE exists in H2. The challenge here is that our operator L is not generally the gradient of any potential function, and is an unbounded operator (in the L2 or supremum norms), and so some care is needed. In the following subsections, we first bound the difference between the kernels BN t (x, y) and B(x, y) in a convenient sense. We then show that, as N becomes large, the neural network approximator converges to the solution of an infinite dimensional ODE. Our main convergence result is Theorem 45. 4.1 Characterizing the difference between kernels We characterize the (second order) difference between two kernels at (x, y) by H(A1, A2)(x, y) := |A1(x, y) A2(x, y)| + X k | yk A1(x, y) yk A2(x, y)| k,l | 2 ykyl A1(x, y) 2 ykyl A2(x, y)|. (43) Note that partial derivatives are only taken with respect to y components. From Lemma 25, there exists a constant M such that, for all x, y Ω, H(B, 0)(x, y) M. (44) Similarly, the (second order) difference between two smooth functions is characterized by G(f1, f2)(x) : = |f1(x) f2(x)| + X k | xkf1(x) xkf2(x)| k,l | 2 xkxlf1(x) 2 xkxlf2(x)|. (45) 4.1.1 Difference between kernel BN t (x, y) and BN 0 (x, y) In this subsection we characterize the difference between kernel BN t and BN 0 . We denote the expectation taken with respect to randomized initialization by Ec0,w0,b0[ ]. The proofs of the following lemmas are included in the appendix. Neural Q-learning for solving PDEs Lemma 36 There exists C > 0 such that, for all i, j {1, 2, ..., n}, all 0 t T, and all N > 0, |AN t (x, y) AN 0 (x, y)| CNδ+β 1, (46) | yi AN t (x, y) yi AN 0 (x, y)| CNδ+β 1 1 + PN k=1 |(wk 0)i| N | 2 yiyj AN t (x, y) 2 yiyj AN 0 (x, y)| CNδ+β 2 N X 1 + |(wk 0)i| 1 + |(wk 0)j| . (48) Lemma 37 There exists C > 0 such that, for all 0 t T and all N > 0, the expected difference between AN t and AN 0 satisfies Ec0,w0,b0 h H(AN t , AN 0 )(x, y) 2i CN2(δ+β 1). (49) After characterizing the difference between AN t and AN 0 , we estimate the difference between the kernels BN t and BN 0 . Lemma 38 There exists C > 0 such that, for all 0 t T and all N > 0 H(BN t , BN 0 )(x, y) 2dµ(x)dµ(y) CN2(2δ+β 1). (50) Proof Since BN t (x, y) = η(x)η(y)AN t (x, y), by the Cauchy Schwarz inequality and the assumption η C3 b , for some constant C > 0, H(BN t , BN 0 )(x, y) 2 C |BN t (x, y) BN 0 (x, y)|2 + X k | yk BN t (x, y) yk BN 0 (x, y)|2 k,l | ykyl BN t (x, y) ykyl BN 0 (x, y)|2 . (51) By the product rule, we have yi BN t (x, y) = η(x)AN t (x, y) yiη(y) + η(x)η(y) yi AN t (x, y), (52) 2 yiyj BN t (x, y) = η(x)AN t (x, y) 2 yiyjη(y) + η(x) yj AN t (x, y) yiη(y) + yi AN t (x, y) yjη(y) + η(x)η(y) 2 yiyj AN t (x, y). (53) As η C3 b , there exists a constant C > 0 independent of training time t, such that |BN t (x, y) BN 0 (x, y)| C|AN t (x, y) AN 0 (x, y)|, (54) | yi BN t (x, y) yi BN 0 (x, y)| C |AN t (x, y) AN 0 (x, y)| + | yi AN t (x, y) yi AN 0 (x, y)| , Cohen, Jiang, Sirignano | 2 yiyj BN t (x, y) 2 yiyj BN 0 (x, y)| C h |AN t (x, y) AN 0 (x, y)| + | yi AN t (x, y) yi AN 0 (x, y)| + | yj AN t (x, y) yj AN 0 (x, y)| + | 2 yiyj AN t (x, y) 2 yiyj AN 0 (x, y)| i . Substituting (54), (55) and (56) in (51), H(BN t , BN 0 )(x, y) 2 C |AN t (x, y) AN 0 (x, y)|2 + X k | yk AN t (x, y) yk AN 0 (x, y)|2 k,l | 2 ykyl AN t (x, y) 2 ykyl AN 0 (x, y)|2 . (57) By Lemma 37, there exists a constant C > 0 such that for all N N and any time 0 t T Ec0,w0,b0 H(BN t , BN 0 ) 2 CN2(δ+β 1). Thus, by Tonelli s theorem we have H(BN t , BN 0 )(x, y) 2dµ(x)dµ(y) CN2(δ+β 1). (58) Multiplying by N2δ on both sides concludes the result. 4.1.2 Difference between BN 0 (x, y) and B(x, y) We now characterize the difference between the kernels BN 0 and B. The law of large numbers implies BN 0 is converging to B almost surely; here we obtain a bound on its speed of convergence. Again, the proof of the following lemma is given in the appendix. Lemma 39 There exists C > 0 such that, for any j, k, l {1, 2, ..., n}, N > 0, Ec0,w0,b0 h AN 0 (x, y) A(x, y) 2i C Ec0,w0,b0 h yi AN 0 (x, y) yi A(x, y) 2i C Ec0,w0,b0 h 2 yiyj AN 0 (x, y) 2 yiyj A(x, y) 2i C Lemma 40 There exists C > 0 such that, for all N > 0, H(BN 0 , B)(x, y) 2dµ(x)dµ(y) C N1 2δ . (60) Proof Similar to (51), we derive H(BN 0 , B)(x, y) 2 C |AN 0 (x, y) A(x, y)|2 + X k | yk AN 0 (x, y) yk A(x, y)|2 k,l | 2 ykyl AN 0 (x, y) 2 ykyl A(x, y)|2 . (61) Neural Q-learning for solving PDEs Therefore, Lemma 39 guarantees Ec0,w0,b0 H(BN 0 , B)(x, y) 2 C/N. By Tonelli s theorem, we have H(BN 0 , B)(x, y) 2dµ(x)dµ(y) Ω2 N2δEc0,w0,b0 h H(BN 0 , B)(x, y) 2i dµ(x)dµ(y) C N1 2δ . (62) 4.1.3 Difference between kernel BN t (x, y) and B(x, y) Combining the results from the above two subsections, we provide one of our key lemmas. Lemma 41 The kernels BN t and B satisfy lim N Ec0,w0,b0 H(BN u , B)(x, y) 2dµ(x)dµ(y)du = 0. (63) Proof By the triangle inequality, H(BN u , B)(x, y) H(BN u , BN 0 )(x, y) + H(BN 0 , B)(x, y). (64) Therefore, H(BN u , B)(x, y) 2 2 H(BN u , BN 0 )(x, y) 2 + 2 H(BN 0 , B)(x, y) 2. (65) Combining Lemmas 38 and 40, we have, for any 0 t T, N2δEc0,w0,b0 H(BN u , B)(x, y) 2dµ(x)dµ(y) CN2(2δ+β 1) + CN2δ 1. (66) Integrating with respect to time, H(BN u , B)(x, y) 2dµ(x)dµ(y)du CT(N2(2δ+β 1) + N2δ 1), (67) which converges to zero as N . 4.2 Convergence of initial approximator QN 0 to Q0 In this subsection, we show that the randomly initialized approximator QN 0 converges to its limit, given by Q0 := f as the number of hidden units N goes to infinity. Lemma 42 The initial approximator QN 0 satisfies lim N Ec0,w0,b0 QN 0 Q0 2 H2 = 0. Cohen, Jiang, Sirignano Proof By definition, for any indices k, l, QN 0 (x) Q0(x) = η(x) i=1 ci 0 σ(wi 0 x + bi 0), xk(QN 0 Q0)(x) = η(x) i=1 ci 0 σ (wi 0 x + bi 0)(wi 0)k + xkη(x) i=1 ci 0 σ(wi 0 x + bi 0), 2 xkxl(QN 0 Q0)(x) i=1 ci 0 σ (wi 0 x + bi 0)(wi 0)k(wi 0)l + xlη(x) i=1 ci 0 σ (wi 0 x + bi 0)(wi 0)k i=1 ci 0 σ (wi 0 x + bi 0)(wi 0)l + 2 xkxlη(x) i=1 ci 0 σ(wi 0 x + bi 0). As {ci 0, wi 0, bi 0} are independent for different i {1, 2, ..., N}, we have, for a constant C > 0 which may vary from line to line, Ec0,w0,b0 h QN 0 (x) Q0(x) 2i = η(x)2 N2β 1 Ec,w,b c2σ(w x + b)2 C N2β 1 , (70) Ec0,w0,b0 h xk QN 0 (x) xk Q0(x) 2i 2 N2β 1 Ec,w,b η(x)2c2σ (w x + b)2w2 k + ( xkη(x))2c2σ(w x + b)2 C N2β 1 . (71) Ec0,w0,b0 h 2 xkxl QN 0 (x) 2 xkxl Q0(x) 2i C N2β 1 . (72) Summing over all indices, then integrating over Ω, gives Ec0,w0,b0 h QN 0 Q0 2 H2 i CN1 2β. As β > 1/2, we have the desired convergence. 4.3 Convergence of the approximator QN t to Qt Definition 43 (Wide network limit) For t 0, we define the wide network limit Qt by the infinite-dimensional ODE dt (y) = α Z Ω L(Qt)(x)B(x, y)dµ(x), for all y Ω, (73) with initial value Q0(y) = f(y), for all y Ω. Equivalently, we can write dt = αBL(Qt), Q0 = f. (74) Neural Q-learning for solving PDEs Theorem 44 The limit ODE (74) admits a unique solution in H2. Proof From Lemma 29, B is uniformly Lipschitz as a map L2 H2. By Assumption 6, L is uniformly Lipschitz as a map H2 L2. Hence, the operator BL : H2 7 H2 is uniformly Lipschitz. From the Picard Lindel of theorem (see, for example, Theorem 2.2.1 in Kolokoltsov (2019)), we know that (74) admits a unique solution with Qt H2 for all t 0. We next prove that QN t converges to Qt, justifying the name wide network limit . Theorem 45 (Convergence to the limit process Qt) For any fixed time 0 t T, lim N Ec0,w0,b0 h QN t Qt H2 i = 0. (75) In particular, the boundary value of Qt is given by Qt| Ω= f. Proof From the dynamics of QN and Q ((27) and (73)) we have QN t (y) QN 0 (y) = α Z t Ω F N(LQN u (x))BN u (x, y)dµ(x)du, (76) Qt(y) Q0(y) = α Z t Ω LQu(x)B(x, y)dµ(x)du. (77) Differentiate QN t (y) and Qt(y) with respect to yk to obtain yk QN t (y) yk QN 0 (y) = α Z t Ω F N(LQN u (x)) yk BN u (x, y)dµ(x)du, (78) yk Qt(y) yk Q0(y) = α Z t Ω LQu(x) yk B(x, y)dµ(x)du. (79) Twice-differentiate QN t and Qt with respect to yk and yl to obtain: 2 ykyl QN t (y) 2 ykyl QN 0 (y) = α Z t Ω F N(LQN u (x)) 2 ykyl BN u (x, y)dµ(x)du, (80) 2 ykyl Qt(y) 2 ykyl Q0(y) = α Z t Ω LQu(x) 2 ykyl B(x, y)dµ(x)du. (81) Subtracting (76) from (77), we have |QN t (y) Qt(y)| |QN 0 (y) Q0(y)| Ω F(QN u )(x)BN u (x, y) LQu(x)B(x, y)dµ(x) du Ω F N(LQN u (x)) BN u (x, y) B(x, y) dµ(x) du F N(LQN u (x)) F N(LQu(x) B(x, y)dµ(x) du F N(LQu(x)) LQu(x) B(x, y)dµ(x) du. Cohen, Jiang, Sirignano Subtracting (78) from (79), we have | yk QN t (y) yk Qt(y)| | yk QN 0 (y) yk Q0(y)| Ω F N(LQN u (x)) yk BN u (x, y) yk B(x, y)(x, y) dµ(x) du F N(LQN u (x)) F N(LQu(x) yk B(x, y)dµ(x) du F N(LQu(x)) LQu(x) yk B(x, y)dµ(x) du. Similarly, subtracting (80) from (81), we have | 2 ykyl QN t (y) 2 ykyl Qt(y)| | 2 ykyl QN 0 (y) 2 ykyl Q0(y)| Ω F N(LQN u (x)) 2 ykyl BN u (x, y) 2 ykyl B(x, y)(x, y) dµ(x) du F N(LQN u (x)) F N(LQu(x) 2 ykyl B(x, y)dµ(x) du F N(LQu(x)) LQu(x) 2 ykyl B(x, y)dµ(x) du. As stated in Definition 17, the function F N = (ψN) (ψN) satisfies a global Lipschitz condition. Therefore, for some C > 0, F N(LQN t )(x) F N(LQt)(x)) < C LQN t (x) LQt(x) C |QN t (x) Qt(x)| + X k | xk QN t (x) xk Qt(x)| + X k,l | 2 xkxl QN t (x) 2 xkxl Qt(x)| = CG(QN t , Qt)(x). (85) Summing (82), (83) and (84) over all indices k and l, we have G(QN t , Qt)(y) = |QN t (y) Qt(y)| + X k | xk QN t (y) xk Qt(y)| + X k,l | 2 xkxl QN t (y) 2 xkxl Qt(y)| G(QN t , Q0)(y) + CNδ Z t Ω H(BN u , B)(x, y)dµ(x)du Ω G(QN u , Qu)(x)H(B, 0)(x, y)dµ(x)du Ω |F N(LQu(x)) LQu(x)|H(B, 0)(x, y)dµ(x)du. Here, the coefficient Nδ comes from the term F N, as by Assumption 17, |F N| = |(ψN) (ψN) | 2Nδ. As mentioned in Remark 18, we also know |F N(x) x| 2|x|1{|x| Nδ}. As Neural Q-learning for solving PDEs H(B, 0) is uniformly bounded (Lemma 25), from inequality (86), for some C > 0 we have G(QN t , Qt)(y) G(QN t , Q0)(y) + CNδ Z t Ω H(BN u , B)(x, y)dµ(x)du Ω G(QN u , Qu)(x)dµ(x)du Ω |LQu(x)|1{|LQu(x)| Nδ}dµ(x)du. By multiplying G(QN t , Qt)(y) in inequality (87), integrating on both sides, and applying the Cauchy Schwarz inequality, from (87) we derive G(QN t , Qt)(y) 2dµ(y) C Z G(QN t , Qt)(y) 2dµ(y) 1 G(QN 0 , Q0)(y) 2dµ(y) 1 H(BN u , B)(x, y) 2dµ(x)dµ(y) 1 G(QN u , Qu)(x) 2dµ(x)du Ω |LQu(x)|1{|LQu(x)| Nδ}dµ(x)du (QN 0 , Q0)(y) 2dµ(y) 1 H(BN u , B)(x, y) 2dµ(x)dµ(y)du Ω |LQu(x)|1{|LQu(x)| Nδ}dµ(x)du. Then the inequality (87) can be formulated as G(QN t , Qt)(y) 2dµ(y) 1 2 JN + C Z t G(QN u , Qu)(x) 2dµ(x) 1 G(QN t , Qt)(y) = |QN t (y) Qt(y)| + X k | yk(QN t Qt)(y)| + X k,l | 2 yk,yl(QN t Qt)(y)|, by a simple quadratic-mean inequality we have QN t Qt 2 H2 Z G(QN t , Qt)(y) 2dµ(y) (n2 + n + 1) QN t Qt 2 H2. (92) Cohen, Jiang, Sirignano Consequently, from (90) we derive QN t Qt H2 JN + C Z t 0 QN u Qu H2du. (93) By applying Gr onwall s inequality, (93) yields the exponential bound R t 0 QN u Qu H2 du JN C (e Ct 1). It remains to bound JN in (89), as N goes to infinity. By Lemma 42 and (92), Ec0,w0,b0 R Ω G(QN 0 , Q0)(y) 2dµ(y) 0. By the dominated convergence theorem, we obtain R T 0 R Ω|LQu(x)|1{|LQu(x)| Nδ}dµ(x)du 0. Finally, by Lemma 41, we show that Ec0,w0,b0 Nδ R T 0 R Ω2 H(BN u , B)(x, y) 2dµ(x)dµ(y)du 0. Therefore, Ec0,w0,b0[JN] 0 as N and consequently, from the above exponential bound, lim N Ec0,w0,b0 h Z T 0 QN u Qu H2 du i = 0. (94) Substituting this result back into inequality (93) and taking the expectation with respect to the random initialization leads to lim N Ec0,w0,b0 QN t Qt H2 = 0 for any time t T. Finally, observe that QN t | Ω= f by construction, so almost sure convergence in H2 (for a subsequence of N) and continuity of the boundary trace operator imply Qt| Ω= f. 5. Convergence for large training time In the previous section, we showed that, as the neural network is made wider, its training process converges to a process that satisfies an infinite-dimensional ODE (74). In particular, the wide-limit of the approximate solution has the dynamics d Qt dt = αBL(Qt). This limit process can therefore be regarded as an approximation of the setting where we use a large (single-layer) neural network, and our problem (in the limiting setting) is transformed into the study of this infinite-dimensional dynamical system. The following theorem is an easy consequence. Theorem 46 If the wide-network limit converges in H1 to a fixed point Q, then Q is a solution to the PDE (1). Proof For a fixed point Q, we know BL(Q) = 0. From Lemma 35, this implies that L(Q) = 0, so the PDE dynamics are satisfied. As QN satisfies the boundary conditions QN t | Ω= f, and QN t Qt in H2 for each t, and Qt Q in H1 we see that Q must satisfy the boundary condition also (by H1-continuity of the boundary trace operator). In this section, we will consider the simple case where L is a monotone operator. This implies that the dynamical system dv/dt = Lv converges in L2 exponentially quickly, which suggests the dynamics of Q will be well behaved. However, the presence of the operator B leads to some difficulties in our analysis. Nevertheless, we will show that, in this setting, our approximation Qt converges to the true solution u of the PDE, at least for a generic subsequence of times. Neural Q-learning for solving PDEs We assume α = 1 in (16) for notational simplicity in this section, without loss of generality (as this is a simple rescaling of time). Theorem 47 Let u be the solution to the PDE (1) under our assumptions (Assumption 13, now including part iv), and assume the neural network configuration satisfies Assumptions 14 and 16. Then the wide network limit {Qt}t 0 of the Q-PDE algorithm (Definitions 20, 43) satisfies 1 0 Qs u 2 L2ds 0 as t . (95) In particular, there exists a set A [0, ) with limt t 1Leb(A [0, t]) = 0 such that, for all sequences tk which do not take values in A, we have the (strong) L2 convergence Qtk u L2 0. Proof We begin by supposing we have a decomposition Q0 = f = u + Bv0 + g, (96) for some v0 L2 η and g H2. As u H2, we can choose v0 arbitrarily and then find g using (96). Consider the process v solving the ODE dt = L(Bvt + u + g) (97) with initial value v0 as in (96). As detailed in Appendix A.6, as B and L are both Lipschitz continuous in appropriate spaces, a standard Picard Lindel of argument shows that v is uniquely defined in L2 η, for all t 0. By differentiating Bvt (and using dominated convergence to move the derivative through B), we obtain d(Bvt + u + g) dt = d(Bvt) dt = BL(Bvt + u + g); Bv0 + u + g = Q0. (98) In particular, as the ODE defining Q has a unique solution in H2 (Theorem 44), we have the identity Qt = Bvt + u + g. Using the Young inequality 2 x, y γ x 2 + (4/γ) y 2 for γ > 0, together with the fact L(u) = 0 and our assumption that L is strongly monotone and Lipschitz, for some constant k > 0 we have d dt vt 2 B = d dt Bvt, vt = 2 Bvt, L(Bvt + u + g) = 2 Bvt, L(Bvt + u + g) L(u + g) + 2 Bvt, L(u + g) L(u) 2γ Bvt 2 L2 + γ Bvt 2 L2 + 4 γ L(u + g) L(u) 2 L2 γ Bvt 2 L2 + γk g 2 H2. Cohen, Jiang, Sirignano From Lemma 35 there exists λ > 0 such that, for all v L2 η, Bvt 2 L2 λ vt B. Using this value of λ, as g 2 H2 is a constant, d dt λ vt 2 B k g 2 H2 γλ Bvt 2 L2 k g 2 H2 . Using the definition of λ and integrating, Bvt 2 L2 k g 2 H2 λ vt 2 B k g 2 H2 λ v0 2 B k g 2 H2 γλ Z t Bvs 2 L2 k g 2 H2 ds. (100) Writing ht = R t 0 Bvs 2 L2 k g 2 H2 ds, we see dht/dt (λ v0 2 B k g 2 H2) γλht, and hence Gr onwall s inequality yields Bvs 2 L2 k g 2 H2 ds (λ v0 2 B k g 2 H2) Z t 0 e γλ(t s)ds. (101) Recalling that Qt u = Bvt + g and Bv + g 2 L2 2 Bv 2 L2 + 2 g 2 H2 = 2( Bv 2 L2 k g 2 H2) + (2k + 2) g 2 H2, (102) we conclude that, for some constant C > 0, Z t 0 Qs u 2 L2ds = Z t 0 Bvs + g 2 L2ds Ct g H2 + C v0 2 B. (103) This inequality must hold for all choices of g and v0 satisfying (96), and we can choose them to optimize (103). We observe that Q0 u H2 (0), as Q0, u H2 and they have the same boundary value. For every ϵ > 0, by Theorem 33, there exists a choice of v(ϵ) 0 L2 η such that Q0 u Bv(ϵ) 0 H2 ϵ. We further recall that v(ϵ) 0 2 B < for v(ϵ) 0 L2 η. Defining g(ϵ) = Q0 u Bv(ϵ) 0 , (103) yields the limit as t 0 Qs u 2 L2ds C g(ϵ) 2 H2 + C t v(ϵ) 0 2 B Cϵ. (104) As ϵ > 0 was arbitrary, we conclude that limt 1 t R t 0 Qs u 2 L2ds = 0. To obtain the final statement, set β(t) = sup T t 1 T R T 0 Qs u 2 L2ds . Observe that β is a nonincreasing positive function with limt β(t) = 0. Defining the set A = t : Qt u 2 > p β(t) , Markov s inequality yields t Leb(A [0, t]) = 1 t Leb n s t : Qs u 2 L2 p β(s) > 1 o 1 Qs u 2 L2 p 0 Qs u 2 L2 ds p β(t) 0. (105) The above result only considers the convergence of Q. In the case where L is Gateaux differentiable (which is certainly the case when L is linear), we can give a similar result for the convergence of LQt, involving the positive-definite kernel B. Neural Q-learning for solving PDEs Theorem 48 Suppose L : H2 L2 is Gateaux differentiable and the conditions of Theorem 47 hold. Then there exists K > 0 such that R t 0 BLQs 2 L2 ds K. In particular, there exists a set A [0, ) with Leb(A) < such that for all sequences tk which do not take values in A, we have BLQtk L2 0. Proof For any w L2, h H2 we write L(w; h) = limϵ 0{(L(w + h) L(w))/ϵ} for the Gateaux derivative in direction h evaluated at w, the limit being taken in L2. We first observe that, as L is strongly monotone, for any h H2, h, L(w; h) = lim ϵ 0 1 ϵ h, L(w + ϵh) L(w) γ lim ϵ 0 1 ϵ2 ϵh 2 L2 = γ h 2 L2. (106) Now write vt = LQt. As Qt H2, we know that Bvt = BLQt H2 (as L : H2 L2 and B : L2 H2). By the chain rule, we have dvt/dt = L Qt; d Qt/dt = L Qt; BLQt = L Qt; Bvt . Therefore, d dt vt 2 B = 2 D Bvt, dvt E = 2 D Bvt, L Qt; Bvt E 2γ Bvt 2 L2. (107) Lemma 35 gives the bound Bvt 2 L2 λ vt B, from which we obtain Bvt 2 L2 λ vt 2 B λ v0 2 B 2λγ Z t 0 Bvs 2ds. (108) By Gr onwall s inequality, we conclude Z t 0 Bvs 2 L2ds λ v0 2 B 0 e 2λγ(t s)ds v0 2 B 2γ =: K < . (109) Taking A = {t : Bvt 2 L2 > 1/t}, the final stated property is obtained from Markov s inequality, similarly to in Theorem 47. Corollary 49 Under the conditions and notation of Theorem 48, the sequence LQtk converges weakly to zero in L2 if and only if it remains bounded in L2. Proof Recall that a sequence xn converges L2-weakly to zero if and only if it is L2-bounded and there exists an L2-dense set Y such that y, xn 0 for all y Y. We know from Theorem 48 that BLQtk 0 strongly in L2 so, for any v L2, Bv, LQtk = v, BLQtk 0. In particular, we see that y, Qtk 0 for any y im(B). We know from Theorem 33 that im(B) is dense in H2, and hence in L2. The result follows. 6. Numerical experiments In this section, we present numerical results where we apply our algorithm to solve a family of partial differential equations. The approximator matches the solution of the differential equation closely in a relatively short period of training time in these test cases. Cohen, Jiang, Sirignano 6.1 Discretization of the continuous-time algorithm As a reminder, in a continuous time setting, the algorithm follows a biased gradient flow: dθt/dt = αN t GN t (θt), based on the biased gradient estimator GN t (θt) = Z Ω 2ψ(LQN t (x))ψ (LQN t (x)) θ( γQN t (x))dµ(x). (110) A natural discretization of our algorithm with which we train our approximator is θt+1 θt = αN t ˆ GN M,t, (111) ˆ GN M,t = 1 i=1 2ψ(LQN t (xi t))ψ (LQN t (xi t)) θ( γQN t (xi t)) (112) is an unbiased estimator of GN t , given {xi t}i=1,2,..,M the random grid independently sampled from distribution µ at time step t, and αN t is the learning rate. More advanced discretization methods, for example the ADAM adaptive gradient descent rule (Kingma and Ba (2014)) can also be used, and (as is common in gradient based optimization problems) give a noticeable improvement in performance. The Q-learning algorithm for solving PDEs as follows: Algorithm 1: Q-PDE Algorithm Parameters: Hyper-parameters of the single-layer neural network; Domain Ω; PDE operator L; Boundary condition at Ω; Sampling measure µ; Number of Monte Carlo points M; Upper bound of training time T; Initialise: Neural net SN; Auxiliary function η; Approximator QN based on SN and η; Smoothing function ψN; Learning rate scheduler {αN t }t 0; Stopping criteria ϵ; Current time t = 0. while err ϵ and t T do Sample M points in Ωusing µ, {xi}; Compute biased gradient estimator GN M,t using (112); Update neural network parameters via (111); Compute err = 1 M PM i=1 ψN(LQN t (xi))2; Update time t; end Return approximator QN t . We implement this algorithm using Py Torch (with backpropagation to compute gradients in both parameters and the state variable x) and the standard ADAM optimization scheduler for α. 2 2. The implementation is available at https://github.com/Deqing J/QPDE. Neural Q-learning for solving PDEs 6.2 Test equation: Survival time of a Brownian motion Suppose Ωis the n-dimensional unit ball. For n-dimensional Brownian motion Xt starting from point x Rn, we wish to compute u(x) := E h Z τΩ 0 e γtdt X0 = x i (113) where τΩis the first exit time of X for domain Ω. By the Feynman Kac theorem (applied to the process X stopped at τΩ), this expectation is given by the solution of the following PDE: ( 1 γu + 1 2 u = 0 in Ω u = 0 on Ω, (114) where u is the Laplacian of u. We consider this problem as a good test case as its explicit solution is known to us, allowing us to check if our algorithm works for a high-dimensional version of this PDE. We initialize a single neural network SN equipped with sigmoid activation function, and introduce auxiliary function η(x) = 1 x 2. The approximator Q = S η is trained to fit the solution. As shown in the following subsections, through our algorithm, the approximator Q learns to have the solution of the PDE with different discounting factor in different dimensions. For detailed configuration of our approximator, see Appendix A.8. 6.2.1 1-dimensional case For dimension n = 1, domain Ω= ( 1, 1). The exact solution of this differential equation is u(x) = 1 γ + c1(e 2γx + e 2γx) where c1 = 1/(γ(e 2γ + e 2γ)). For this subsection, set γ = 0.1. To keep track of the training progress, we monitor the average loss level at time t, that is et R Ω[LQt]2dµ where µ is the Lebesgue measure on Ω, which can be estimated using our sample of evaluation points at each time. Figure 1: Loss level during and after training The left side of Figure 1 shows how the average operator loss smoothly decays during training. The right side of it plots the operator error at terminal time LQN T (x). For a Cohen, Jiang, Sirignano perfect fit, the operator loss LQT (x) should be zero across the entire domain. We observe that for our approximator, the norm of this loss remains small and fluctuates around 0. In Figure 2, we compare our approximator QN τ with the exact solution. Here τ indicates the training time at which we observed the minimal loss eτ, which occurs for τ close to the terminal time T. (Given the fluctuations observed in the loss, chosing this minimal point can give a noticeable qualitative improvement in performance.) The relative error of our approximation at x, given by |(QN τ (x) u(x))/u(x)|, remains lower than 0.02%. Figure 2: Approximate and exact solutions in 1d case 6.2.2 20-dimensional case Now we consider the 20-dimensional case. High-dimensional PDEs are typically hard to solve, as mesh grids are computationally unaffordable in these cases. However, as a meshfree method, our algorithm can be applied. We numerically solve the PDE (114) for discounting factor γ = 0.2 via our algorithm and compare the result with the exact solution, given by 1 J9 i 2γ x x 9J9(i 2γ) where J9 is the ninth-order Bessel function of the first kind. We also compare the training results of our method to the results from the Deep Galerkin Method (which does stochastic gradient descent with an unbiased gradient estimate) as a benchmark. Our approximator trained with the Q-PDE algorithm is able to accurately approximate the solution of the 20-dimensional PDE. Left of Figure 3 shows that the operator loss is less than 10 5 after training for 40,000 epochs, for both Q-PDE and DGM. For both methods, we observe cyclic behavior in the loss during training, which appears similar to the slingshot effect discussed in Thilak et al. (2022), which is potentially a consequence of momentum accumulation of the ADAM optimizer. A full understanding of this effect (which we have not observed when using a simple gradient descent optimizer) is left for future work. Neural Q-learning for solving PDEs Figure 3: Average loss level during training The solution u of this PDE is radially symmetric, that is, it can be represented by a function which depends on x rather than x. For the right side of Figure 3, we verify that our approximate solution is close to this function for different x [0, 1]. For 0 r 1, we randomly generate 100 sample points {xj,r}j=1,...,100 on the sphere x = r and compute the mean-squared error of the set of values {QN τ (xi,r)} as compared to the exact solution. It is shown that for different r, the mean-squared error of the neural network approximator is consistently small. Q-PDE solution obtains smaller error than our DGM solution, but they are at the same scale. For 0 r 1, we also consider the maximum relative error, defined as Rr = max xi,r QN τ (xi,r) u(xi,r) Figure 4: Relative and absolute errors of the approximate solution, in terms of distance from the origin. The left plot of Figure 4 implies that this error remains small for all 0 r 1. Again, DGM solution admits slightly larger relative error. Meanwhile, as DGM solution is trained Cohen, Jiang, Sirignano with exact gradients which are computationally more expensive, consequently, it takes 5 times longer to train. Both DGM and Q-PDE solutions reach high accuracy. We plot a scatter plot for {QN τ (wi)} in the right side in Figure 4 and compare with the exact solution u. The approximate solutions closely matches the exact solution. 7. Conclusion In this paper, we propose a novel algorithm which numerically solves elliptic PDEs. The neural network approximator is designed to match the Dirichlet boundary condition of the PDE and is trained to satisfy the PDE operator. We prove that for single-layer neural network approximators, as the number of hidden units , the evolution of the approximator during training is characterized by a limit ODE. We further prove that as the training time , the solution of the limit ODE converges to the classical solution of the PDE. In addition, we provide a numerical test case to show that our algorithm can numerically solve (high-dimensional) PDEs. Acknowledgments Thanks to Endre S uli, Lukasz Szpruch, Christoph Reisinger and Rama Cont, and to anonymous referees, for useful conversations and suggestions. This publication is based on work supported by the EPSRC Centre for Doctoral Training in Industrially Focused Mathematical Modelling (In Fo MM) (EP/L015803/1) in collaboration with the Alan Turing Institute. Samuel Cohen also acknowledges the support of the Oxford Man Institute for Quantitative Finance, and the UKRI Prosperity Partnership Scheme (FAIR) under the EPSRC Grant EP/V056883/1. The order of authorship in this paper was determined alphabetically. Neural Q-learning for solving PDEs Appendix A. Appendix In our proofs, the constants K, C may vary from line to line. A.1 Proof of Lemma 3 Proof We first prove3 statement i. Suppose that u H2(Ω) H1 0(Ω). Consider the Poisson problem solved by u: ( u = f in Ω, u = 0 on Ω, (117) where f := u L2(Ω). Let fn C 0 (Ω) be a sequence with fn f L2(Ω). Then consider the Poisson problem ( un = fn in Ω, un = 0 on Ω. (118) Since (by Assumption 1) Ω C3,α for a H older exponent α (0, 1) and fn C 0 (Ω) C1,α(Ω), by the Schauder theory for elliptic boundary-value problems (see, for example, Theorems 6.14 and 6.19 in Gilbarg and Trudinger (1998)), there exists a unique solution un C3,α(Ω) C0(Ω) to this boundary-value problem. As ( (u un) = (f fn) in Ω, u un = 0 on Ω, (119) and Ω C3,α (and therefore also C2), and u un H2(Ω), it follows from Theorem 3.1.2.1 and Remark 3.1.2.2 in Grisvard (2011), that for some C > 0 u un H2 C (u un) L2 = C f fn L2 0 (120) as n . Thus, we have constructed a sequence of functions un C3,α(Ω) H1 0 which converges to u H2(Ω) in the H2 norm. Therefore, C3,α(Ω) C0(Ω) is dense in H2(Ω) H1 0(Ω). As C3,α(Ω) C0(Ω) C3(Ω) C0(Ω) , (121) we conclude that C3(Ω) C0(Ω) is also dense in H2(Ω) H1 0(Ω). We now prove statement ii. Take an arbitrary u C3(Ω) C0(Ω), and define u = u/η. It is clear that u|Ωis in C3(Ω), from classical rules of calculus, as η|Ω> 0. In particular, u and its derivatives are bounded on any closed subset of Ω, so we only need to consider the behavior of u in a neighbourhood of Ω. We know that η has nonvanishing derivative on Ω, and equals zero on Ω, which implies that | η(x), nx | > 0 for each x Ω, where nx is the outward pointing normal at x. As Ωis compact, this implies that inf x Ω| η(x), nx | > ϵ, for some ϵ > 0. (122) 3. Thanks to Endre S uli for suggesting the proof of statement i given here. Cohen, Jiang, Sirignano As u C3(Ω) C0(Ω), for any x Ω, we also have lim supx x |u(x )| = 0. Consider a smooth path ρ : [0, 1) Ωwith limt 1 ρ(t) = x Ω. For a function g with domain Ω(e.g. g = η, u or u), we write gρ for the composition g ρ with domain [0, 1). Suppose ρ does not approach Ωtangentially, in particular, lim t 1 dρ/dt, nx ϵ and dρ/dt 1. (123) As η(x), y = 0 for all y nx, we have lim t 1 |η ρ(t)| = | η(x), nx | dρ/dt, nx > ϵ2. (124) Using L Hˆopital s rule, we can determine the behaviour of uρ (and its derivatives) as t 1. In particular, by repeated application of the rule, as t 1 ηρ u ρ η ρ (125) u ρ = ηρu ρ η ρuρ η2ρ u ρ 2η ρ η ρ 2η ρ uρ (126) u ρ = η2 ρu ρ ηρη ρuρ 2ηρη ρu ρ + 2(η ρ)2uρ η3ρ u ρ 3η ρ η ρ 3η ρ uρ + η ρ η ρ u ρ (127) Considering t 1, as |η ρ| > ϵ2, the right hand side of each of these terms is bounded, with a uniform bound for all paths under consideration. Writing Dh( u) for the directional derivative of u in a direction h, we know u ρ = Ddρ/dt( u) and u ρ = D2 dρ/dt( u) . We can therefore find a single value K > 0, and take a ball Bx around each x Ω, such that, within each Bx, for all unit vectors h satisfying h, nx > ϵ, | u| + |Dh( u)| + |D2 h( u)| K. (128) As Ω\ ( x ΩBx) is compact, we know that, for K sufficiently large, (128) also holds on Ω\ ( x ΩBx), for all unit vectors h. We conclude that (128) holds on all of Ω, that is, u, and its directional first and second derivatives (except possibly in tangent directions at the boundary), are uniformly bounded. However, as Ωis a C3,α boundary (in particular, for all ϵ sufficiently small and any x Ωthe set {y Rn : y, nx ϵ} is a convex cone containing a linear basis for Rn), this implies that the first and second derivatives of u are also uniformly bounded on Ω. We conclude that u C2 b (Ω). A.2 Proof of Lemma 24 Proof (1) From the training algorithm, calculating the entries in vector θQN t (x), we have QN t ci (x) = η(x)σ(wi t x + bi t) Nβ . (129) Neural Q-learning for solving PDEs Therefore, by (15), applying the chain rule we derive dci t dt = αNβ 1 Z Ω F N(LQN t (x))η(x)σ(wi t x + bi t)dµ(x). (130) By the boundedness of the functions F N, η and σ, the RHS of (130) is bounded: Ω F N(LQN t (x))η(x)σ(wi t x + bi t)dµ(x) < KNδ, (131) where K is a constant. Therefore, we conclude that |ci t| < |ci 0| + |ci t ci 0| < K0 + αNδ+β 1t K < C. (132) (2) Similar to above, considering the wi entries in the vector θQN t (x), we have QN t (wi t)k (x) = η(x)σ (wi t x + bi t)ci txk Nβ . (133) Therefore wi t satisfies: d(wi t)k dt = αNβ 1 Z Ω F N(LQN t (x))η(x)ci tσ (wi t x + bi t)xkdµ(x). (134) By the boundedness of F N, η, σ and ci t, the RHS of (134) is bounded by Ω F N(LQN t )(x)η(x)ci tσ (wi t x + bi t)xkdµ(x) < KNδ, (135) where K is a constant. Therefore, |(wi t)k| < |(wi 0)k| + |(wi t)k (wi 0)k| < |(wi 0)k| + αNδ+β 1t K < |(wi 0)k| + C. (136) Taking an expectation on both sides, we obtain E|(wi t)k| < C. (137) (3) Considering the bi entries in the vector θQN t (x), we have QN t bi t (x) = η(x)σ (wi t x + bi t)ci t Nβ . (138) Therefore, bi t satisfies dbi t dt = αNβ 1 Z Ω F N(LQN t (x))η(x)ci tσ (wi t x + bi t)dµ(x). (139) By the boundedness of F N, η, σ and ci t, the RHS of (134) is bounded by Ω F N(LQN t )(x)η(x)ci tσ (wi t x + bi t)dµ(x) < KNδ, (140) Cohen, Jiang, Sirignano where K is a constant. Therefore, |bi t| < |bi 0| + |bi t bi 0| < |bi 0| + αNδ+β 1t K < |bi 0| + C. (141) Taking an expectation on both sides, we obtain E|bi t| < C. (142) A.3 Proof of Lemma 36 Proof (1) Notice that |AN t (x, y) AN 0 (x, y)| = σ(wi t x + bi t)σ(wi t y + bi t) σ(wi 0 x + bi 0)σ(wi 0 y + bi 0) + x y(Ci t)2σ (wi t x + bi t)σ (wi t y + bi t) x y(Ci 0)2σ (wi 0 x + bi 0)σ (wi 0 y + bi 0) |σ(wi t x + bi t)σ(wi t y + bi t) σ(wi 0 x + bi 0)σ(wi 0 y + bi 0)| + |x y ci t 2 ci 0 2 σ (wi t x + bi t)σ (wi t y + bi t)| + ci 0 2|x y σ (wi t x + bi t)σ (wi t y + bi t) σ (wi 0 x + bi 0)σ (wi 0 y + bi 0)| . For a smooth function f : R R, applying the mean value theorem with respect to wi t, there exists w Rn such that f(wi t x + b)f(wi t y + b) f(wi 0 x + b)f(wi 0 y + b) = (wi t wi 0) h f (w x + b))f(w y + b)x + f(w x + b))f (w y + b)y i . (144) By the fact that σ C4 b (R) and |ci u| < C for any u [0, T], we have |AN t (x, y) AN 0 (x, y)| C h ( x 1 + y 1) wi t wi 0 1 + |x y||ci t ci 0| + wi t wi 0 1( x 1 + y 1)|x y| + (1 + |x y|)|bi t bi 0| i . Therefore, by (131), (140) and (135), the increments of ct, bt and wt are bounded by CNδ+β 1. Therefore, we have |AN t (x, y) AN 0 (x, y)| CNδ+β 1 ( x 1 + y 1)(1 + |x y|) + |x y| CNδ+β 1. (146) Neural Q-learning for solving PDEs (2) By definition, yi AN t (x, y) = 1 σ(wk t x + bk t )σ (wk t x + bk t )(wk t )i + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )xi + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )x y(wk t )i Similarly to in (143), we split the difference between yi AN t (x, y) and yi AN 0 (x, y) into terms | yi AN t (x, y) yi AN 0 (x, y)| 1 σ(wk t x + bk t )σ (wk t x + bk t )(wk t )i σ(wk 0 x + bk 0)σ (wk 0 x + bk 0)(wk 0)i + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )xi ck 0 2σ (wk 0 x + bk 0)σ (wk 0 y + bk 0)xi + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )x y(wk t )i ck 0 2σ (wk 0 x + bk 0)σ (wk 0 y + bk 0)x y(wk 0)i . Applying the mean value theorem with respect to wi t to each term on the RHS of (148), using the fact that σ C4 b (R) and that |ci t| < C for any t [0, T], we have | yi AN t (x, y) yi AN 0 (x, y)| C |(wk t )i (wk 0)i| + |(wk 0)i|( x 1 + y 1) wk t wk 0 1 + |ci t ci 0||xi| + |xi|( x 1 + y 1) wk t wk 0 1 + |(wk t )i (wk 0)i||x y| + |ci t ci 0||(wk 0)i||x y| + |(wk 0)i||x y|( x 1 + y 1)||wk t wk 0||1 + (1 + |x y|)|bk t bk 0| . Therefore, by the boundedness of |ck t ck 0|, |bk t bk 0|, |(wk t )i (wk 0)i| and ||wk t wk 0||1 we derive | yi AN t (x, y) yi AN 0 (x, y)| CNβ 1 1 + PN k=1 |(wk 0)i| N (|x y| + 1)(1 + x 1 + y 1) + |xi|(1 + x 1 + y 1) + |x y| CNδ+β 1. By definition, 2 yiyj AN t (x, y) = 1 σ(wk t x + bk t )σ (wk t x + bk t )(wk t )i(wk t )j + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )xi(wk t )j + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )xj(wk t )i + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )x y(wk t )i(wk t )j Cohen, Jiang, Sirignano The difference between 2 yiyj AN t and 2 yiyj AN 0 can be split into terms | 2 yiyj AN t (x, y) 2 yiyj AN 0 (x, y)| 1 σ(wk t x + bk t )σ (wk t x + bk t )(wk t )i(wk t )k σ(wk 0 x + bk 0)σ (wk 0 x + bk 0)(wk 0)i(wk 0)j + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )xi(wk t )j ck 0 2σ (wk 0 x + bk 0)σ (wk 0 y + bk 0)xi(wk 0)j + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )xj(wk t )i ck 0 2σ (wk 0 x + bk 0)σ (wk 0 y + bk 0)xj(wk 0)i + ck t 2σ (wk t x + bk t )σ (wk t y + bk t )x y(wk t )i(wk t )j ck 0 2σ (wk 0 x + bk 0)σ (wk 0 y + bk 0)x y(wk 0)i(wk 0)j . We apply the mean value theorem to each term above, leading to | 2 yiyj AN t (x, y) 2 yiyj AN 0 (x, y)| CNβ 1 PN k=1 |(wk 0)i||(wk 0)j| N (|x y| + 1)( x 1 + y 1 + 1) + PN k=1 |(wk 0)i| N [(1 + |xj|)(1 + x 1 + y 1) + |x y|] + PN k=1 |(wk 0)j| N [(1 + |xi|)(1 + x 1 + y 1) + |x y|] + 1 + |xi| + |xj| + y| CNδ+β 1. A.4 Proof of Lemma 37 Proof Taking squares and then expectations on both sides of inequalities (46), (47), and (48) gives Ec0,w0,b0 |AN t (x, y) AN 0 (x, y)|2 CN2(δ+β 1), Ec0,w0,b0 | yk AN t (x, y) yk AN 0 (x, y)|2 CN2(δ+β 1), Ec0,w0,b0 | 2 ykyl AN t (x, y) 2 ykyl AN 0 (x, y)|2 CN2(δ+β 1). Summing these inequalities for all k, l {1, 2, .., n} and applying the Cauchy Schwarz inequality gives the result. A.5 Proof of Lemma 39 Proof First note that Ec0,w0,b0[AN 0 (x, y)] = Ec0,w0,b0 k=1 σ(wk 0 x + bk 0)σ(wk 0 y + bk 0) + ck 0 2x yσ (wk 0 x + bk 0)σ (wk 0 y + bk 0) = A(x, y). Neural Q-learning for solving PDEs We can also calculate Ec0,w0,b0 |AN 0 (x, y) A(x, y)|2 = Var[AN 0 (x, y)] = 1 N Var[U(x; c, w, b) U(y; c, w, b)] N Ec,w,b|U(x; c, w, b) U(y; c, w, b)|2 C For the partial derivatives, for any fixed x, y Ω, | yi(σ(w x + b)σ(w y + b) + c2σ (w x + b)σ (w y + b)x y)| = |σ(w x + b)σ (w y + b)wi + c2σ (w x + b)σ (w y + b)xi + c2x yσ (w x + b)σ (w y + b)wi| | 2 yiyj(σ(w x + b)σ(w y + b) + c2σ (w x + b)σ (w y + b)x y)| = |σ(w x + b)σ (w y + b)wiwj + c2σ (w x + b)σ (w y + b)(xiwj + xjwi) + c2x yσ (w x + b)σ (w y + b)wiwj|. These are locally bounded around (x, y) Ω2, so by Leibniz s integral rule, we swap the expectation and the differential operator yi A(x, y) = yi Ec,w,b[σ(w x + b)σ(w y + b) + c2σ (w x + b)σ (w y + b)x y] = Ec,w,b[ yi(σ(w x + b)σ(w y + b) + c2σ (w x + b)σ (w y + b)x y)] = Ec0,w0,b0[ yi AN 0 (x, y)], 2 yiyj A(x, y) = 2 yiyj Ec,w,b[σ(w x + b)σ(w y + b) + c2σ (w x + b)σ (w y + b)x y] = Ec,w,b[ 2 yiyj(σ(w x + b)σ(w y + b) + c2σ (w x + b)σ (w y + b)x y)] = Ec0,w0,b0[ 2 yiyj AN 0 (x, y)]. Therefore, we can bound the first-derivative variance Var[ yi AN 0 (x, y)]: Ec0,w0,b0| yi AN 0 (x, y) yi A(x, y)|2 = 1 N Var[ yi(U(x; c, w, b) U(y; c, w, b))] N Ec,w,b| yi(U(x; c, w, b) U(y; c, w, b))|2 1 N Ec,w,b[(C|wi| + C|xi| + C|x y||wi|)2] C and similarly the second derivative variance Var[ 2 yiyj AN 0 (x, y)]: Ec0,w0,b0| 2 yiyj AN 0 (x, y) 2 yiyj A(x, y)|2 = 1 N Var[ 2 yiyj(U(x; c, w, b) U(y; c, w, b))] N Ec,w,b| 2 yiyj(U(x; c, w, b) U(y; c, w, b))|2 N Ec,w,b[(C|wiwj| + C|xiwj| + C|xjwi| + C|x y||wiwj|)2] C Cohen, Jiang, Sirignano A.6 Existence of ODE solutions Lemma 50 Under the assumptions of Theorem 47, for any q H2, and any initial value v0 L2 η, the ODE dt = L(Bvt + q) (157) admits a unique solution v taking values in L2 η. Proof Consider the process wt = ηvt. This has initial value w0 = ηv0 L2, and satisfies the dynamics dwt dt = ηL(B(wt/η) + q) =: G(wt). (158) From Lemma 29, the map w 7 B(w/η) is uniformly Lipschitz as a map L2 H2. By Assumption 6, L is uniformly Lipschitz as a map H2 L2. By Assumption 2, η is bounded on Ω. Therefore, G : L2 L2 is uniformly Lipschitz. From the Picard Lindel of theorem (see, for example, (Kolokoltsov, 2019, Theorem 2.2.1)), we know that (158) admits a unique solution with wt L2 for all t 0. Using this, together with the fact η > 0 on Ω, we see that there is a unique vt = wt/η satisfying the original equation, with values in L2 η. A.7 Monotonicity for linear elliptic PDE In this subsection, we give a brief proof of monotonicity of a class of elliptic linear differential operators, with particular choices of sampling measure µ. Our results in this appendix are not exhaustive, but demonstrate some of the flexibility of the equations we have considered. Lemma 51 Consider an autonomous Markov process X satisfying the SDE d Xt = ν(Xt)dt + σ(Xt)d Wt (159) where ν : Ω Rn and σ : Ω Rn n and W is an Rn-valued standard Brownian motion. Suppose further that ν and σ both vanish outside Ω. For an absolutely continuous measure µ on Ω, with density fµ, let ν, σ and fµ be sufficiently smooth that γ = sup x Ω ( P i xi [νi(x)fµ(x)] + P i,j 2 xi xj [(σσ )ij(x)fµ(x)] Then the generator of X satisfies, for all u H2, i νi u xi + 1 ij (σσ )ij 2u xi xj 2 u , (161) where the inner product and norm are both taken in L2(Ω, µ). Neural Q-learning for solving PDEs Proof Consider a copy of X initialized with X0 µ. The Fokker Plank equation states that the density of X0 satisfies f(x, 0) = fµ and tf(x, t) = X xi [νi(x)f(x, t)] + X xi xj [(σσ )ij(x)f(x, t)] (162) and hence, using the definition of γ , t(e γ tf(x, t)) 0. (163) As f(x, t) is positive, we can define an operator Ttu(x) := E h u(Xx t )e γ 2 t X0 = x i (164) which satisfies, by Jensen s inequality, Ttu L2(µ) Z Ω u2(t)e γ tf(t, x)dx Z Ω u2(t)fµ(x)dx = u 2 L2(µ) (165) so the operator Tt is a contraction. It is easy to verify that Tt is also a strongly continuous semigroup, and its generator is given by u 7 γ 2 u + Au, where A is the generator of X. By the Lumer Phillips theorem (Lumer and Phillips (1961)), we know that the generator of Tt is dissipative on L2(Ω, µ), that is 2 u + Au E = D u, γ i νi u xi + 1 ij (σσ )ij 2u xi xj Rearrangement yields the result. Remark 52 This result immediately shows that if µ is a stationary distribution for the Markov process, then γ = 0 and the generator of X is automatically L2-dissipative. (This result is known, see, for example, (Kallenberg, 2002, Chapter 20).) This suggests that this will often be a wise choice for µ in linear problems, if it is known explicitly. By manipulating the choice of measure µ, or equivalently setting µ equal to Lebesgue measure and considering the equivalent PDE Lu := f(x)Lu = 0, (167) this result can be leveraged usefully, provided the drift of X does not grow quickly. Similar results for other bounds on b can also be obtained. Lemma 53 Consider a process X as in Lemma 51. Suppose that σ is constant and there exist bounded functions gi : R R for i n such that, for all x Ω, xi νi(x) + g(xi)νi(x) + X ij gi(xi)gj(xj)(σσ )ij. (168) Cohen, Jiang, Sirignano Then the generator of X satisfies u, Au γ 2 u , where µ is the measure on Ωwith Lebesgue density fµ = C exp X 0 gi(ζ)dζ , for C a normalizing constant. In particular, this implies that the operator Lu = r γu + Au (169) is strongly monotone, in the sense of Assumption 9. Proof We calculate xi [νi(x)fµ(x)] + X xi xj [(σσ )ijfµ(x)] xi νi(x)fµ(x) + gi(xi)νi(x)fµ(x) + X i,j g(xi)g(xj)(σσ )ijfµ(x) Therefore, in (160) we have γ γ, and the result follows from (161). Remark 54 In one dimension, (168) is clearly satisfied whenever we know that 4σ2(γ + ν ) ν2, by taking g(x) = ν(x)/(2σ2). A.8 Configuration of the numerical test cases A.8.1 Table of hyper parameters Dimension Method Layer Units Activation Optimizer Numer of MC samples 1 Q-PDE 1 64 Sigmoid ADAM l MC=1k, u MC=2k 20 Q-PDE 1 256 Sigmoid ADAM l MC=2k, u MC=10k 20 DGM 1 256 Sigmoid ADAM l MC=2k, u MC=10k A.8.2 Initialization of neural networks Parameters of the single-layer net S0 are randomly sampled: ci 0 are i.i.d sampled from uniform distribution U[ 1, 1]; wi 0 and bi 0 are i.i.d sampled from Gaussian distribution N(0, Id) and N(0, 1) where Id is the identity matrix of dimension d. A.8.3 Learning process We use Qt := St η as the approximator, where η(x) := 1 x 2. We apply the built-in ADAM optimizer in our test with initial learning rate l0 = 0.5, and the learning rate decays as lt = l0/(1+t/200). In each step, we sample MC samples for gradient estimate. As a larger number of MC sample points reduces the random error, we linearly increase the number Mt of MC points to be sampled at each step as Mt = round(l MC + (u MC l MC)t/T), where T is the terminal number of training steps. Neural Q-learning for solving PDEs Robert A. Adams and John J.F. Fournier. Sobolev spaces. Elsevier, 2003. Christian Beck, Martin Hutzenthaler, Arnulf Jentzen, and Benno Kuckuck. An overview on deep learning-based approximation methods for partial differential equations. ar Xiv preprint ar Xiv:2012.12348, 2020. Felix E. Browder. Existence and perturbation theorems for nonlinear maximal monotone operators in banach spaces. Bulletin of the American Mathematical Society, 73(3):322 327, 1967. Shengze Cai, Zhiping Mao, Zhicheng Wang, Minglang Yin, and George Em Karniadakis. Physics-informed neural networks (pinns) for fluid mechanics: A review. Acta Mechanica Sinica, 37(12):1727 1738, 2021. L ena ıc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alch e Buc, Edward A. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 2933 2943, 2019. URL http://papers.nips.cc/paper/ 8559-on-lazy-training-in-differentiable-programming. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349 380, 2017. Weinan E, Jiequn Han, and Arnulf Jentzen. Algorithms for solving high dimensional PDEs: from nonlinear Monte Carlo to machine learning. Nonlinearity, 35(1):278 310, 2022. Lawrence C. Evans. Partial differential equations. American Mathematical Soc., 2010. Maximilien Germain, Mathieu Lauri ere, Huyˆen Pham, and Xavier Warin. Deepsets and their derivative networks for solving symmetric PDEs. ar Xiv preprint ar Xiv:2103.00838, 2021a. Maximilien Germain, Huyˆen Pham, and Xavier Warin. Neural networks-based algorithms for stochastic control and PDEs in finance. ar Xiv preprint ar Xiv:2101.08068, 2021b. Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 14820 14830. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/a9df2255ad642b923d95503b9a7958d8-Paper.pdf. Cohen, Jiang, Sirignano David Gilbarg and Neil S. Trudinger. Elliptic partial differential equations of second order. Springer, 1998. Pierre Grisvard. Elliptic problems in nonsmooth domains. SIAM, 2011. Philipp Grohs and Lukas Herrmann. Deep neural network approximation for highdimensional elliptic PDEs with boundary conditions. ar Xiv preprint ar Xiv:2007.05384, 2020. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251 257, 1991. Ruimeng Hu, Quyuan Lin, Alan Raydan, and Sui Tang. Higher-order error estimates for physics-informed neural networks approximating the primitive equations. ar Xiv preprint ar Xiv:2209.11929, 2022. Kazufumi Ito, Christoph Reisinger, and Yufei Zhang. A neural network-based policy iteration algorithm with global H2-superlinear convergence for stochastic games on domains. Foundations of Computational Mathematics, 21(2):331 374, 2021. Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. ar Xiv preprint ar Xiv:1806.07572, 2018. Olav Kallenberg. Foundations of Modern Probability. Springer, 2nd edition, 2002. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Vassili Kolokoltsov. Differential equations on measures and functional spaces. Springer, 2019. Isaac E. Lagaris, Aristidis Likas, and Dimitrios I. Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987 1000, 1998. Isaac E. Lagaris, Aristidis C. Likas, and Dimitris G. Papageorgiou. Neural-network methods for boundary value problems with irregular boundaries. IEEE Transactions on Neural Networks, 11(5):1041 1049, 2000. Samuel Lanthaler, Siddhartha Mishra, and George Em Karniadakis. Error estimates for deeponets: A deep learning framework in infinite dimensions. ar Xiv preprint ar Xiv:2102.09618, 2021. Hyuk Lee and In Seok Kang. Neural algorithm for solving differential equations. Journal of Computational Physics, 91(1):110 131, 1990. Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. ar Xiv preprint ar Xiv:1711.00165, 2017. Neural Q-learning for solving PDEs Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32: 8572 8583, 2019. Jian Li, Jing Yue, Wen Zhang, and Wansuo Duan. The deep learning galerkin method for the general stokes equations. Journal of Scientific Computing, 93(1):5, 2022. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. ar Xiv preprint ar Xiv:2010.08895, 2020. Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. PDE-net: Learning PDEs from data. In International Conference on Machine Learning, pages 3208 3216. PMLR, 2018. Zichao Long, Yiping Lu, and Bin Dong. PDE-net 2.0: Learning PDEs from data with a numeric-symbolic hybrid deep network. Journal of Computational Physics, 399:108925, 2019. Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via Deep ONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218 229, 2021. G unter Lumer and R.S. Phillips. Dissipative operators in a Banach space. Pacific Journal of Mathematics, 11:679 698, 1961. Alaeddin Malek and R. Shekari Beidokhti. Numerical solution for high order differential equations using a hybrid neural network optimization method. Applied Mathematics and Computation, 183(1):260 271, 2006. Kevin Stanley Mc Fall and James Robert Mahan. Artificial neural network method for solution of boundary value problems with exact satisfaction of arbitrary boundary conditions. IEEE Transactions on Neural Networks, 20(8):1221 1233, 2009. Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2388 2464. PMLR, 25 28 Jun 2019. George S Misyris, Andreas Venzke, and Spyros Chatzivasileiadis. Physics-informed neural networks for power systems. In 2020 IEEE Power & Energy Society General Meeting (PESGM), pages 1 5. IEEE, 2020. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Cohen, Jiang, Sirignano Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Huyˆen Pham, Xavier Warin, and Maximilien Germain. Neural networks-based backward scheme for fully nonlinear PDEs. SN Partial Differential Equations and Applications, 2 (1):1 24, 2021. Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 707, 2019. Yeonjong Shin, Jerome Darbon, and George Em Karniadakis. On the convergence and generalization of physics informed neural networks. ar Xiv preprint ar Xiv:2004.01806, 2020. Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of computational physics, 375:1339 1364, 2018. Justin Sirignano, Jonathan Mac Art, and Konstantinos Spiliopoulos. Pde-constrained models with neural network terms: Optimization and global convergence. ar Xiv preprint ar Xiv:2105.08633, 2021. Iain Smears and Endre S uli. Discontinuous Galerkin finite element approximation of Hamilton Jacobi Bellman equations with Cordes coefficients. SIAM Journal on Numerical Analysis, 52(2):993 1016, 2014. Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. ar Xiv preprint ar Xiv:2206.04817, 2022. Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why PINNs fail to train: A neural tangent kernel perspective. ar Xiv preprint ar Xiv:2007.14527, 2020. Christopher J.C.H. Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992. Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411: 109409, 2020. Eberhard Zeidler. Nonlinear Functional Analysis and Its Applications: II/B: Nonlinear Monotone Operators. Springer, 2013. Lulu Zhang, Tao Luo, Yaoyu Zhang, Zhi-Qin John Xu, and Zheng Ma. MOD-net: A machine learning approach via model-operator-data network for solving PDEs. ar Xiv preprint ar Xiv:2107.03673, 2021. Neural Q-learning for solving PDEs Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin. Deep reinforcement learning for search, recommendation, and online advertising: a survey. ACM SIGWEB Newsletter, (Spring):1 15, 2019.