# globally_injective_and_bijective_neural_operators__b9e7188f.pdf Globally injective and bijective neural operators Takashi Furuya1 Michael Puthawala2 Matti Lassas3 Maarten V. de Hoop4 1Shimane University, takashi.furuya0101@gmail.com 2South Dakota State University, Michael.Puthawala@sdstate.edu 3University of Helsinki, matti.lassas@helsinki.fi 4Rice University, mdehoop@rice.edu Recently there has been great interest in operator learning, where networks learn operators between function spaces from an essentially infinite-dimensional perspective. In this work we present results for when the operators learned by these networks are injective and surjective. As a warmup, we combine prior work in both the finite-dimensional Re LU and operator learning setting by giving sharp conditions under which Re LU layers with linear neural operators are injective. We then consider the case when the activation function is pointwise bijective and obtain sufficient conditions for the layer to be injective. We remark that this question, while trivial in the finite-rank setting, is subtler in the infinite-rank setting and is proven using tools from Fredholm theory. Next, we prove that our supplied injective neural operators are universal approximators and that their implementation, with finite-rank neural networks, are still injective. This ensures that injectivity is not lost in the transcription from analytical operators to their finite-rank implementation with networks. Finally, we conclude with an increase in abstraction and consider general conditions when subnetworks, which may have many layers, are injective and surjective and provide an exact inversion from a linearization. This section uses general arguments from Fredholm theory and Leray-Schauder degree theory for non-linear integral equations to analyze the mapping properties of neural operators in function spaces. These results apply to subnetworks formed from the layers considered in this work, under natural conditions. We believe that our work has applications in Bayesian uncertainty quantification where injectivity enables likelihood estimation and in inverse problems where surjectivity and injectivity corresponds to existence and uniqueness of the solutions, respectively. 1 Introduction In this work, we produce results at the intersection of two fields: neural operators (NO), and injective and bijective networks. Neural operators [Kovachki et al., 2021a,b] are neural networks that take a infinite dimensional perspective on approximation by directly learning an operator between Sobolev spaces. Injectivity and bijectivity on the other hand are fundamental properties of networks that enable likelihood estimation by the change of variables formula, are critical in applications to inverse problems, and are useful properties for downstream applications. The key contribution of our work is the translation of fundamental notions from the finite-rank setting to the infinite-rank setting. By the infinite-dimension setting we refer to the case when the object of approximation is a mapping between Sobolev spaces. This task, although straight-forward on first inspection, often requires dramatically different arguments and proofs as the topology, analysis and notion of noise are much simpler in the finite-rank case as compared to the infinite-rank case. We see our work as laying the groundwork for the application of neural operators to generative models 37th Conference on Neural Information Processing Systems (Neur IPS 2023). in function spaces. In the context of operator extensions of traditional VAEs [Kingma and Welling, 2013], injectivity of a decoder forces distinct latent codes to correspond to distinct outputs. Our work draws parallels between neural operators and pseudodifferential operators Taylor [1981], a class that contains many inverses of linear partial differential operators and integral operators. The connection to pseudodifferential operators provided an algebraic perspective to linear PDE Kohn and Nirenberg [1965]. An important fact in the analysis of pseudodifferential operators, is that the inverses of certain operators, e.g. elliptic pseudodifferential operators, are themselves pseudodifferential operators. By proving an analogous result in section 4.2, that the inverse of invertible NO are themselves given by NO, we draw an important and profound connection between (non)linear partial differential equations and NO. We also believe that our methods have applications to the solution of inverse problems with neural networks. The desire to use injective neural networks is one of the primary motivations for this work. These infinite dimensional models can then be approximated by a finite dimensional model without losing discretization invariance, see Stuart [2010]. Crucially, discretization must be done at the last possible moment, or else performance degrades as the discretization becomes finer, see Lassas and Siltanen [2004] and also Saksman et al. [2009]. By formulating machine learning problems in infinite dimensional function spaces and then approximating these methods using finite dimensional subspaces, we avoid bespoke ad-hoc methods and instead obtain methods that apply to any discretization. More details on our motivations for and applications of injectivity & bijectivity of neural operators are given in Appendix A. 1.1 Our Contribution In this paper, we give a rigorous framework for the analysis of the injectivity and bijectivity of neural operators. Our contributions are as follows: (A) We show an equivalent condition for the layerwise injectivity and bijectivity for linear neural operators in the case of Re LU and bijective activation functions (Section 2). In the particular Re LU case, the equivalent condition is characterized by a directed spanning set (Definition 2). (B) We prove that injective linear neural operators are universal approximators, and that their implementation by finite rank approximation is still injective (Section 3). We note that universal approximation theorem (Theorem 1) in the infinite dimensional case does not require an increase in dimension, which deviate from the finite dimensional case Puthawala et al. [2022a, Thm. 15]. (C) We zoom out and perform a more abstract global analysis in the case when the input and output dimensions are the same. In this section we coarsen the notion of layer, and provide a sufficient condition for the surjectivity and bijectivity of nonlinear integral neural operators with nonlinear kernels. This application arises naturally in the context of subnetworks and transformers. We construct their inverses in the bijective case (Section 4). 1.2 Related Works In the finite-rank setting, injective networks have been well-studied, and shown to be of theoretical and practical interest. See Gomez et al. [2017], Kratsios and Bilokopytov [2020], Teshima et al. [2020], Ishikawa et al. [2022], Puthawala et al. [2022a] for general references establishing the usefulness of injectivity or any of the works on flow networks for the utility of injectivity and bijectivity for downstream applications, [Dinh et al., 2016, Siahkoohi et al., 2020, Chen et al., 2019, Dinh et al., 2014, Kingma et al., 2016], but their study in the infinite-rank setting is comparatively underdeveloped. These works, and others, establish injectivity in the finite-rank setting as a property of theoretical and practical interest. Our work extends Puthawala et al. [2022a] to the infinite-dimensional setting as applied to neural operators, which themselves are a generalization of multilayer perceptrons (MLPs) to function spaces. Moreover, our work includes not only injectivity, but also surjectivity in the non-Re LU activation case, which Puthawala et al. [2022a] has not focused on. Examples of works in these setting include neural operators Kovachki et al. [2021b,a], Deep ONet Lu et al. [2019], Lanthaler et al. [2022], and PCA-Net Bhattacharya et al. [2021], De Hoop et al. [2022]. The authors of Alberti et al. [2022] recently proposed continuous generative neural networks (CGNNs), which are convolution-type architectures for generating L2(R)-functions, and provided the sufficient condition for the global injectivity of their network. Their approach is the wavelet basis expansion, whereas our work relies on an independent choice of basis expansion. 1.3 Networks considered and notation Let D Rd be an open and connected domain, and L2(D; Rh) be the L2 space of Rh-value function on D given by L2(D; Rh) := L2(D; R) L2(D; R) | {z } h Definition 1 (Integral and pointwise neural operators). We define an integral neural operator G : L2(D)din L2(D)dout and layers Lℓ: L2(D)dℓ L2(D)dℓ+1 by G := TL+1 LL L1 T0, (Lℓv)(x) := σ(Tℓ(v)(x) + bℓ(x)), Tℓ(v)(x) = Wℓ(x)u(x) + Z D kℓ(x, y)u(y), x D, where σ : R R is a non-linear activation operating element-wise, and kℓ(x, y) L2(D D; Rdℓ+1 dℓ) are integral kernels, and Wℓ C(D; Rdℓ+1 dℓ) are pointwise multiplications with matrices, and bℓ L2(D)dℓ+1 are bias functions (ℓ= 1, ..., L). Here, T0 : L2(D)din L2(D)d1 and TL+1 : L2(D)d L+1 L2(D)dout are mappings (lifting operator) from the input space to the feature space and mappings (projection operator) from the feature spaces to the output space, respectively. The layers T0 and TL+1 play a special role in the neural operators. They are local linear operators and serve to lift and project the input data from and to finite-dimensional space respectively. These layers may be absorbed into the layers L1 and LL without loss of generality (under some technical conditions), but are not in this text to maintain consistency with prior work. Prior work assumes that din < d1, we only assume that din d1 for lifting operator T0 : L2(D)din L2(D)d1. This would seemingly play an important role in the context of injectivity or universality, but we find that our analysis does not require that din < d1 at all. In fact, as elaborated in Section 3.2, we may take din = dℓ= dout for ℓ= 1, . . . , L and our analysis is the same. 2 Injective linear neural operator layers with Re LU and bijective activations In this section we present sharp conditions under which a layer of a neural operator with Re LU activation is injective. The Directed Spanning Set (DSS) condition, described by Def. 2 is a generalization of the finite-dimensional DSS [Puthawala et al., 2022a] which guarantees layerwise injectivity of Re LU layers. Extending this condition from finite to infinite dimensions is not automatic. The finite-dimensional DSS will hold with high probability for random weight matrices if they are expansive enough ([Puthawala et al., 2022a, Theorem7]). However, the infinite-dimensional DSS is much more restrictive than the finite dimensional setting. We then present a less restrictive condition that is met when the activation function is bijective, e.g. a leaky-Re LU activation is used. Although it may appear that the end-to-end result is strictly stronger than the layerwise result, this is not the case. The layerwise result is an exact characterization, whereas the end-to-end result is sufficient for injectivity, but not necessary. The layerwise analysis is also constructive, and so gives a rough guide for the construction of injective networks, whereas the global analysis is less so. Finally, the layerwise condition has different applications, such as network of stochastic depth, see e.g. Huang et al. [2016], Benitez et al. [2023]. End-to-end injectivity by enforcing layerwise injectivity is straightforward, whereas deriving a sufficient condition for networks of any depth is more daunting. We denote by σ(Tv + b)(x) := σ(T1v(x) + b1(x)) ... σ(Tmv(x) + bm(x)) , x D, v L2(D)n where σ : R R is a non-linear activation function, T L(L2(D)n, L2(D)m), and b L2(D)m, where L(L2(D)n, L2(D)m) is the space of linear bounded operators from L2(D)n to L2(D)m. The aim of this section is to characterize the injectivity condition for the operator v 7 σ(Tv +b) mapping from L2(D)n to L2(D)m, which corresponds to layer operators Lℓ. Here, T : L2(D)n L2(D)m is linear. 2.1 Re LU activation Let Re LU : R R be Re LU activation, defined by Re LU(s) = max{0, s}. With this activation function, we introduce a definition which we will find sharply characterizes layerwise injectivity. Definition 2 (Directed Spanning Set). We say that the operator T + b has a directed spanning set (DSS) with respect to v L2(D)n if Ker T S(v,T +b) X(v, T + b) = {0}, (2.1) where T|S(v,T +b)(v) = (Tiv)i S(v,T +b) and S(v, T + b) := n i [m] Tiv + bi > 0 in D o , (2.2) X(v, T + b) := for i / S(v, T + b) and x D, (i) Tiv(x) + bi(x) Tiu(x) if Tiv(x) + bi(x) 0, (ii) Tiu(x) = 0 if Tiv(x) + bi(x) > 0 The name directed spanning set arises from the ker(T|S(v,T +b)) term of (2.1). The indices of S(v, T + b) are those that are directed (positive) in the direction of v. If T restricted to these indices together span L2(D)n, then ker(T|S(v,T +b)) is {0}, and the condition is automatically satisfied. Hence, the DSS condition measures the extent to which the set of indices, which are directed w.r.t. v, form a span of the input space. Proposition 1. Let T L(L2(D)n, L2(D)m) and b L2(D)m. Then, the operator Re LU (T +b) : L2(D)n L2(D)m is injective if and only if T + b has a DSS with respect to every v L2(D)n in the sense of Definition 2. See Section B for the proof. Puthawala et al. [2022a] has provided the equivalent condition for the injectivity of Re LU operator in the case of the Euclidean space. However, proving analogous results for operators in function spaces require different techniques. Note that because Def. 2 is a sharp characterization of injectivity, it can not be simplified in any significant way. The condition restrictive Def. 2 is, therefore, difficult to relax while maintaining generality. This is because for each function v, multiple components of the function Tv + b are strictly positive in the entire domain D, and cardinality |S(v; T + b)| of S(v; T + b) is larger than n. This observation prompts us to consider the use of bijective activation functions instead of Re LU, such as leaky Re LU function, defined by σa(s) := Re LU(s) a Re LU( s) where a > 0. 2.2 Bijective activation If σ is injective, then injectivity of σ (T + b) : L2(D)n L2(D)m is equivalent to the injectivity of T. Therefore, we consider the bijectivity in the case of n = m. As mentioned in Section 1.3, an significant example is T = W + K, where W Rn n is injective and K : L2(D)n L2(D)n is a linear integral operator with a smooth kernel. This can be generalized to Fredholm operators (see e.g., Jeribi [2015, Section 2.1.4]), which encompasses the property for identity plus a compact operator. It is well known that a Fredholm operator is bijective if and only if it is injective and its Fredholm index is zero. We summarize the above observation as follows: Proposition 2. Let σ : R R be bijective, and let T : L2(D)n L2(D)m and b L2(D)m. Then, σ (T + b) : L2(D)n L2(D)m is injective if and only if T : L2(D)n L2(D)m is injective. Furthermore, if n = m and T L(L2(D)n, L2(D)n) is the linear Fredholm operator, then, σ (T + b) : L2(D)n L2(D)n is bijective if and only if T : L2(D)n L2(D)n is injective with index zero. We believe that this characterization of layerwise injectivity is considerably less restrictive than Def. 2, and the characterization of bijectivity in terms of Fredholm theory will be particularly useful in establishing operator generalization of flow networks. 3 Global analysis of injectivity and finite-rank implementation In this section we consider global properties of the injective and bijective networks that constructed in Section 2. First we construct end-to-end injective networks that are not layerwise injective. By doing this, we may avoid the dimension increasing requirement that would be necessary from a layerwise analysis. Next we show that injective neural operators are universal approximators of continuous functions. Although the punchline resembles that of [Puthawala et al., 2022a, Theorem 15], which relied on Whitney s embedding theorem, the arguments are quite different. The finite-rank case has dimensionality restrictions, as required by degree theory, whereas our infinite-rank result does not. Finally, because all implementations of neural operators are ultimately finite-dimensional, we present a theorem that gives conditions under which finite-rank approximations to injective neural operators are also injective. 3.1 Global analysis By using the characterization of layerwise injectivity discussed in Section 2, we can compose injective layers to form LL L1 T0, a injective network. Layerwise injectivity, however, prevents us from getting injectivity of TL+1 LL L1 T0 by a layerwise analysis if d L+1 > dout, as is common in application [Kovachki et al., 2021b, Pg. 9]. In this section, we consider global analysis and show that TL+1 LL L1 T0, nevertheless, remains injective. This is summarized in the following lemma. Lemma 1. Let ℓ N with ℓ< m, and let the operator T : L2(D)n L2(D)m be injective. Assume that there exists an orthogonal sequence {ξk}k N in L2(D) and a subspace S in L2(D) such that Ran(π1T) S and span{ξk}k N S = {0}. (3.1) where π1 : L2(D)m L2(D) is the restriction operator defined in (C.1). Then, there exists a linear bounded operator B L(L2(D)m, L2(D)ℓ) such that B T : L2(D)n L2(D)ℓis injective. See Section C.1 in Appendix C for the proof. T and B correspond to LL L1 T0 (from lifting to L-th layer) and TL+1 (projection), respectively. The assumption (3.1) on the span of ξk encodes a subspace distinct from the range of T. In Remark 3 of Appendix C, we provide an example that satisfies the assumption (3.1). Moreover, in Remark 4 of Appendix C, we show the exact construction of the operator B by employing projections onto the closed subspace, using the orthogonal sequence {ξk}k N. This construction is given by the combination of "Pairs of projections" discussed in Kato [2013, Section I.4.6] with the idea presented in [Puthawala et al., 2022b, Lemma 29]. 3.2 Universal approximation We now show that the injective networks that we consider in this work are universal approximators. We define the set of integral neural operators with L2-integral kernels by NOL(σ; D, din, dout) := n G : L2(D)din L2(D)dout : G = KL+1 (KL + b L) σ (K2 + b2) σ (K1 + b1) (K0 + b0), Kℓ L(L2(D)dℓ, L2(D)dℓ+1), Kℓ: f 7 Z D kℓ( , y)f(y)dy D , kℓ L2(D D; Rdℓ+1 dℓ), bℓ L2(D; Rdℓ+1), dℓ N, d0 = din, d L+2 = dout, ℓ= 0, ..., L + 2 o , (3.2) and NOinj L (σ; D, din, dout) := {G NOL(σ; D, din, dout) : G is injective}. The following theorem shows that L2-injective neural operators are universal approximators of continuous operators. Theorem 1. Let D Rd be a Lipschitz bounded domain, and G+ : L2(D)din L2(D)dout be continuous such that for all R > 0 there is M > 0 so that G+(a) L2(D)dout M, a L2(D)din, a L2(D)din R, (3.3) We assume that either (i) σ AL 0 BA is injective, or (ii) σ = Re LU. Then, for any compact set K L2(D)din, ϵ (0, 1), there exists L N and G NOinj L (σ; D, din, dout) such that G+(a) G(a) L2(D)dout ϵ. See Section C.3 in Appendix C for the proof. For the definitions of AL 0 and BA, see Definition 3 in Appendix C. For example, Re LU and Leaky Re LU functions belong to AL 0 BA (see Remark 5 (i)). We briefly remark on the proof of Theorem 1 emphasizing how its proof differs from a straightforward extension of the finite-rank case. In the proof we first employ the standard universal approximation theorem for neural operators ([Kovachki et al., 2021b, Theorem 11]). We denote the approximation of G+ by e G, and define the graph of e G as H : L2(D)din L2(D)din L2(D)dout. That is H(u) = (u, e G(u)). Next, utilizing Lemma 1, we construct the projection Q such that Q H becomes an injective approximator of G+ and belongs to NOL(σ; D, din, dout). The proof for universal approximation theorem is constructive. If, in the future, efficient approximation bounds for neural operators are given, such bounds can likely be used directly in our universality proof to generate corresponding efficient approximation bounds for injective neural operators. This approach resembles the approach in the finite-rank space Puthawala et al. [2022a, Theorem 15], but unlike that theorem we don t have any dimensionality restrictions. More specifically, in the case of Euclidean spaces Rd, Puthawala et al. [2022a, Theorem 15] requires that 2din + 1 dout before all continuous functions G+ : Rdin Rdout can be uniformly approximated in compact sets by injective neural networks. When din = dout = 1 this result is not true, as is shown in Remark 5 (iii) in Appendix C using topological degree theory [Cho and Chen, 2006]. In contrast, Theorem 1 does not assume any conditions on din and dout. Therefore, we can conclude that infinite dimensional case yields better approximation results than the finite dimensional case. This surprising improvement in restrictions in infinite-dimensions can be elucidated by an analogy to Hilbert s hotel paradox, see [Burger and Starbird, 2004, Sec 3.2]. In this analogy, the orthonormal bases {φk}k N and Ψj,k(x) = (δijφk(x))d i=1 play the part of guests in the hotel with N floor, each of which as d rooms. A key step in the proof of Theorem 1 is that there is a linear isomorphism S : L2(D)d L2(D) (i.e., a rearrangement of guests) which maps Ψj,k to φb(j,k), where b : [d] N N is a bijection. 3.3 Injectivity-preserving transfer to Euclidean spaces via finite-rank approximation In the previous section, we have discussed injective integral neural operators. The conditions are given in terms of integral kernels, but such kernels may not actually be implementable with finite width and depth networks, which have a finite representational power. A natural question to ask, therefore, is how should these formal integral kernels be implemented on actual finite rank networks, the so-called implementable neural operators? In this section we discuss this question. We consider linear integral operators Kℓwith L2-integral kernels kℓ(x, y). Let {φk}k N be an orthonormal basis in L2(D). Since {φk(y)φp(x)}k,p N is an orthonormal basis of L2(D D), integral kernels kℓ L2(D D; Rdℓ+1 dℓ) in integral operators Kℓ L(L2(D)dℓ, L2(D)dℓ+1) has the expansion kℓ(x, y) = X k,p N C(ℓ) k,pφk(y)φp(x), where C(ℓ) k,p Rdℓ+1 dℓwhose (i, j)-th component c(ℓ) k,p,ij is given by c(ℓ) k,p,ij = (kℓ,ij, φkφp)L2(D D), Here, we denote (u, φk) Rdℓby (u, φk) = (u1, φk)L2(D), ..., (udℓ, φk)L2(D) . By truncating by N finite sums, we approximate L2-integral operators Kℓ L(L2(D)dℓ, L2(D)dℓ+1) by finite rank operator Kℓ,N with rank N, having the form Kℓ,Nu(x) = X k,p [N] C(ℓ) k,p(u, φk)φp(x), u L2(D)dℓ. The choice of orthonormal basis {φk}k N is a hyperparameter. If we choose {φk}k as Fourier basis and wavelet basis, then network architectures correspond to Fourier Neural Operators (FNOs) [Li et al., 2020b] and Wavelet Neural Operators (WNOs) [Tripura and Chakraborty, 2023], respectively. We show that Propositions 1, 2 (characterization of layerwise injectivity), and Lemma 1 (global injectivity) all have natural analogues for finite rank operator Kℓ,N in Proposition 5 and Lemma 2 in Appendix D. These conditions applies out-of-the-box to both FNOs and WNOs. Lemma 2 and Remark 3 in the appendix D give a recipe to construct the projection B such that the composition B T (interpreted as augmenting finite rank neural operator T with one layer B) is injective. The projection B is constructed by using an orthogonal sequence {ξk}k N subject to the condition (3.1), which does not over-leap the range of T. This condition is automatically satisfied for any orthogonal base {φk}k N. This could yield practical implications in guiding the choice of the orthogonal basis {φk}k N for the neural operator s design. We also show the universal approximation in the case of finite rank approximation. We denote NOL,N(σ; D, din, dout) by the set of integral neural operators with N rank, that is the set (3.2) replacing L2-integral kernel operators Kℓwith finite rank operators Kℓ,N with rank N (see Definition 4). We define by NOinj L,N(σ; D, din, dout) := {GN NOL,N(σ; D, din, dout) : GN is injective}. Theorem 2. Let D Rd be a Lipschitz bounded domain, and N N, and G+ : L2(D)din L2(D)dout be continuous with boundedness as in (3.3). Assume that the non-linear activation function σ is either Re LU or Leaky Re LU. Then, for any compact set K L2(D)din (span{φk}k N)din, ϵ (0, 1), there exists L N, N N with N dout 2Ndin + 1, (3.4) and GN NOinj L,N (σ; D, din, dout) such that G+(a) GN (a) L2(D)dout ϵ. See Section D.4 in Appendix D for the proof. In the proof, we make use of Puthawala et al. [2022a, Lemma 29], which gives rise to the assumption (3.4). We do not require any condition on din and dout as well as Theorem 1. Remark 1. Observe that in our finite-rank approximation result, we only require that the target function G+ is continuous and bounded, but not smooth. This differs from prior work that requires smoothness of the function to be approximated. 3.4 Limitations We assume square matrices for the bijectivity and construction of inversions. Weight matrices in actual neural operators are not necessarily square. Lifting and projection are local operators which map the inputs into a higher-dimensional feature space and project back the feature output to the output space. We also haven t addressed possible aliasing effects of our injective operators. We will relax the square assumption and investigate the aliasing of injective operators in future work. 4 Subnetworks & nonlinear integral operators: bijectivity and inversion So far our analysis of injectivity has been restricted to the case where the only source of nonlinearity are the activation functions. In this section we consider a weaker and more abstract problem where nonlinearities can also arise from the integral kernel with surjective activation function, such as leaky Re LU. Specifically, we consider layers of the form F1(u) = Wu + K(u), (4.1) where W L(L2(D)n, L2(D)n) is a linear bounded bijective operator, and K : L2(D)n L2(D)n is a non-linear operator. This arises in the non-linear neural operator construction by Kovachki et al. [2021b] or in Ong et al. [2022] to improve performance of integral autoencoders. In this construction, each layer Lℓis written as x D, (Lℓv)(x) = σ(Wℓv(x) + Kℓ(v)(x)), Kℓ(u)(x) = Z D kℓ(x, y, u(x), u(y))u(y)dy, where Wℓ Rdℓ+1 dℓindependent of x, and Kℓ: L2(D)dℓ L2(D)dℓ+1 is the non-linear integral operator. This relaxing of assumptions is motivated by a desire to obtain theoretical results for both subnetworks and operator transformers. By subnetworks, we mean compositions of layers within a network. This includes, for example, the encoder or decoder block of a traditional VAE. By neural operator we mean operator generalizations of finite-rank transformers, which can be modeled by letting K be an integral transformation with nonlinear kernel k of the form k(x, y, v(x), v(y)) softmax Av(x), Bv(y) , where A and B are matrices of free parameters, and the (integral) softmax is taken over x. This specific choice of k can be understood as a natural generalization of the attention mechanism in transformers, see [Kovachki et al., 2021b, Sec. 5.2] for further details. 4.1 Surjectivity and bijectivity Critical to our analysis is the notion of coercivity. Apart from being a useful theoretical tool, layerwise coercivity of neural networks is a useful property in imaging applications, see e.g.Li et al. [2020a] Recall from Showalter [2010, Sec 2, Chap VII] that a non-linear operator K : L2(D)n L2(D)n is coercive if lim u L2(D)n K(u), u u L2(D)n L2(D)n = . (4.2) Proposition 3. Let σ : R R be surjective and W : L2(D)n L2(D)n be linear bounded bijective (then the inverse W 1 is bounded linear), and let K : L2(D)n L2(D)n be a continuous and compact mapping. Moreover, assume that the map u 7 αu + W 1K(u) is coercive with some 0 < α < 1. Then, the operator σ F1 is surjective. See Section E.1 in Appendix E for the proof. An example K satisfying the coercivity (4.2) condition is given here. Example 1. We simply consider the case of n = 1, and D Rd is a bounded interval. We consider the non-linear integral operator K(u)(x) := Z D k(x, y, u(x))u(y)dy, x D. The operator u 7 αu + W 1K(u) with some 0 < α < 1 is coercive when the non-linear integral kernel k(x, y, t) satisfies certain boundedness conditions. In Examples 2 and 3 in Appendix E, we show that these conditions are met by kernels k(x, y, t) of the form k(x, y, t) = j=1 cj(x, y)σ(aj(x, y)t + bj(x, y)), where σ : R R is the sigmoid function σs : R R, and a, b, c C(D D) or by a wavelet activation function σwire : R R, see Saragadam et al. [2023], where σwire(t) = Im (eiωte t2) and a, b, c C(D D), and aj(x, y) = 0. In the proof of Proposition 3, we utilize the Leray-Schauder fix point theorem. By employing the Banach fix point theorem under a contraction mapping condition (4.3), we can obtain bijectivity as follows: Proposition 4. Let σ : R R be bijective. Let W : L2(D)n L2(D)n be bounded linear bijective, and let K : L2(D)n L2(D)n. If W 1K : L2(D)n L2(D)n is a contraction mapping, that is, there exists ρ (0, 1) such that W 1K(u) W 1K(v) ρ u v , u, v L2(D)n, (4.3) then, the operator σ F1 is bijective. See Section E.3 in Appendix E for the proof. We note that if K is compact linear, then assumption (4.3) implies that W + K is an injective Fredholm operator with index zero, which is equivalent to σ (W + K) being bijective as observed in Proposition 2. That is, Proposition 4 requires stronger assumptions when applied to the linear case. Assumption (4.3) implies the the injectivity of K. An interesting example of injective operators arises when K are Volterra operators. When D Rd is bounded and K(u) = R D k(x, y, u(y))u(y)dy, where we denote x = (x1, . . . , xd) and y = (y1, . . . , yd), we recall that K is a Volterra operator if k(x, y, t) = 0 implies yj xj for all j = 1, 2, . . . , d. A well known fact, as discussed in Example 4 in Appendix E, is that if K(u) is a Volterra operator whose kernel k(x, y, t) C(D D Rn) is bounded and uniformly Lipschitz in t-variable then F : u 7 u + K(u) is injective. Remark 2. A similar analysis (illustrating coercivity) shows that operators of the following two forms are bijective. 1. Operators of the form F(U) := αu+K(u) where K(u) := R D a(x, y, u(x), u(y))dy where a(x, y, s1, s2) is continuous and is such that R > 0, c1 < α so that for all |(s1, s2)| > R, sign(s1)sign(s2)a(x, y, s1, s2) c1. 2. Layers of the form (Lℓv)(x) := σ1 (Wu + σ2(K(u))) where σ1 is bijective and σ2 bounded. Finally we remark that coercivity is bounded preserved by perturbations in a bounded domain. This makes it possible to study non-linear and non-positive perturbations of physical models. For example, in quantum mechanics when a non-negative energy potential |ϕ|4 is replaced by a Mexican hat potential C|ϕ|2 + |ϕ|4, as occurs in the study of magnetization, superconductors and the Higgs field. 4.2 Construction of the inverse of a non-linear integral neural operator The preceding sections clarified sufficient conditions for surjectivity and bijectivity of the non-linear operator F1. We now consider how to construct the inverse of F1 in a compact set Y. We find that constructing inverses is possible in a wide variety of settings and, moreover, that the inverses themselves can be given by neural operators. The proof that neural operators may be inverted with other neural operators provides a theoretical justification for Integral Auto Encoder networks [Ong et al., 2022] where an infinite encoder/decoder pair play a role parallel to those of encoder/decoders in finite-dimensional VAEs [Kingma and Welling, 2013]. This section proves that the decoder half of a IAE-net is provably able to inverse the encoder half. Our analysis also shows that injective differential operators (as arise in PDEs) and integral operator encoders form a formal algebra under operator composition. We prove this in the rest of this section, but first we summarize the main three steps of the proof. First, by using the Banach fixed point theorem and invertibility of derivatives of F1 we show that, locally, F1 may be inverted by an iteration of a contractive operator near gj = F1(vj). This makes local inversion simple in balls which cover the set Y. Second, we construct partition of unity functions Φj that masks the support of each covering ball and allows us to construct one global inverse that simply passes through to the local inverse on the appropriate ball. Third and finally, we show that each function used in both of the above steps are representable using neural operators with distributional kernels. As a simple case, let us first consider the case when n = 1, and D R is a bounded interval, and the operator F1 of the form F1(u)(x) = W(x)u(x) + Z D k(x, y, u(y))u(y)dy, where W C1(D) satisfies 0 < c1 W(x) c2 and the function (x, y, s) 7 k(x, y, s) is in C3(D D R) and in D D R its three derivatives and the derivatives of W are uniformly bounded by c0, that is, k C3(D D R) c0, W C1(D) c0. (4.4) The condition (4.4) implies that F1 : H1(D) H1(D), contains locally Lipschitz smooth functions. Furthermore, F1 : H1(D) H1(D) is Fréchet differentiable at u0 C(D), and we denote Fréchet derivative of F1 at u0 by Au0, which can be written as the integral operator (F.2). We will assume that for all u0 C(D), the integral operator Au0 : H1(D) H1(D) is an injective operator. (4.5) This happens for example when K(u) is a Volterra operator, see Examples 4 and 5. As the integral operators Au0 are Fredholm operators having index zero, this implies that the operators (4.5) are bijective. The inverse operator A 1 u0 : H1(D) H1(D) can be written by the integral operator (F.5). We will consider the inverse function of the map F1 in Y σa(BC1,α(D)(0, R)) = {σa g C(D) : g C1,α(D) R}, which is a set of the image of Hölder spaces Cn,α(D) through (leaky) Re LU-type functions σa(s) = Re LU(s) a Re LU( s) with a 0. We note that Y is a compact subset the Sobolev space H1(D), and we use the notations BX (g, R) and BX (g, R) as open and closed balls with radius R > 0 at center g X in Banach space X. To this end, we will cover the set Y with small balls BH1(D)(gj, ε0), j = 1, 2, . . . , J of H1(D), centered at gj = F1(vj), where vj H1(D). As considered in detail in Appendix F, when g is sufficiently near to the function gj in H1(D), the inverse map of F1 can be written as a limit (F 1 1 (g), g) = limm H m j (vj, g) in H1(D)2, where := u A 1 vj (F1(u) F1(vj)) + A 1 vj (g gj) g that is, near gj we can approximate F 1 1 as a composition H m j of 2m layers of neural operators. To glue the local inverse maps together, we use a partition of unity Φ i, i I in the function space Y, where I Zℓ0 is a finite index set. The function Φ i are given by neural operators Φ i(v, w) = π1 ϕ i,1 ϕ i,2 ϕ i,ℓ0(v, w), where ϕ i,ℓ(v, w) = (Fyℓ,s( i,ℓ),ϵ1(v, w), w), where some ϵ1 > 0, s( i, ℓ) R are some suitable values near gj( i)(yℓ), some yℓ D (ℓ= 1, ..., ℓ0), and π1 is the map π1(v, w) = v that maps a pair (v, w) to the first function v. Here, Fz,s,h(v, w) are integral neural operators with distributional kernels Fz,s,h(v, w)(x) = R D kz,s,h(x, y, v(x), w(y))dy, where kz,s,h(x, y, v(x), w(y)) = v(x)1[s 1 2 h)(w(y))δ(y z), and 1A is a indicator function of a set A and y 7 δ(y z) is the Dirac delta distribution at the point z D. Using these, we can write the inverse of F1 at g Y as F 1 1 (g) = lim m i I Φ i H m j( i) in H1(D) (4.6) where j( i) {1, 2, . . . , J}. This result is summarized in following theorem which is proven in Appendix F. Theorem 3. Assume that F1 satisfies the above assumptions (4.4) and (4.5) and that F1 : H1(D) H1(D) is a bijection. Let Y σa(BC1,α(D)(0, R)) be a compact subset of the Sobolev space H1(D), where α > 0 and a 0. Then the inverse of F1 : H1(D) H1(D) in Y can written as a limit (4.6) that is, as a limit of integral neural operators. 5 Discussion and Conclusion In this paper, we provided a theoretical analysis of injectivity and bijectivity for neural operators. In the future, we will further develop applications of our theory, particularly in the areas of generative models and inverse problems and integral autoencoders. We gave a rigorous framework for the analysis of the injectivity and bijectivity of neural operators including when either the Re LU activation or bijective activation functions are used. We further proved that injective neural operators are universal approximators and their finite-rank implementation are still injective. Finally, we ended by considering the coarser problem of non-linear integral operators, as arises in subnetworks, operator transformers and integral autoencoders. Acknowledgments ML was partially supported by Academy of Finland, grants 273979, 284715, 312110. M.V. de H. was supported by the Simons Foundation under the MATH + X program, the National Science Foundation under grant DMS-2108175, and the corporate members of the Geo-Mathematical Imaging Group at Rice University. Giovanni S. Alberti, Matteo Santacesaria, and Silvia Sciutto. Continuous generative neural networks. ar Xiv:2205.14627, 2022. Jose Antonio Lara Benitez, Takashi Furuya, Florian Faucher, Xavier Tricoche, and Maarten V de Hoop. Fine-tuning neural-operator architectures for training and generalization. ar Xiv preprint ar Xiv:2301.11509, 2023. Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduction and neural networks for parametric pdes. The SMAI journal of computational mathematics, 7: 121 157, 2021. Edward B Burger and Michael Starbird. The heart of mathematics: An invitation to effective thinking. Springer Science & Business Media, 2004. Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32, 2019. Yeol Je Cho and Yu-Qing Chen. Topological degree theory and applications. CRC Press, 2006. Maarten De Hoop, Daniel Zhengyu Huang, Elizabeth Qian, and Andrew M Stuart. The cost-accuracy trade-off in operator learning with neural networks. ar Xiv preprint ar Xiv:2203.13181, 2022. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. David Gilbarg and Neil S. Trudinger. Elliptic partial differential equations of second order. Classics in Mathematics. Springer-Verlag, Berlin, 2001. ISBN 3-540-41160-7. Reprint of the 1998 edition. Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. Advances in neural information processing systems, 30, 2017. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In Computer Vision ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11 14, 2016, Proceedings, Part IV 14, pages 646 661. Springer, 2016. Isao Ishikawa, Takeshi Teshima, Koichi Tojo, Kenta Oono, Masahiro Ikeda, and Masashi Sugiyama. Universal approximation property of invertible neural networks. ar Xiv preprint ar Xiv:2204.07415, 2022. Aref Jeribi. Spectral theory and applications of linear operators and block operator matrices, volume 9. Springer, 2015. Tosio Kato. Perturbation theory for linear operators, volume 132. Springer Science & Business Media, 2013. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improving variational inference with inverse autoregressive flow. ar Xiv preprint ar Xiv:1606.04934, 2016. Joseph J Kohn and Louis Nirenberg. An algebra of pseudo-differential operators. Communications on Pure and Applied Mathematics, 18(1-2):269 305, 1965. Nikola Kovachki, Samuel Lanthaler, and Siddhartha Mishra. On universal approximation and error bounds for fourier neural operators. Journal of Machine Learning Research, 22:Art No, 2021a. Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces. ar Xiv preprint ar Xiv:2108.08481, 2021b. Anastasis Kratsios and Ievgen Bilokopytov. Non-euclidean universal approximation. Advances in Neural Information Processing Systems, 33:10635 10646, 2020. Samuel Lanthaler, Siddhartha Mishra, and George E Karniadakis. Error estimates for deeponets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022. Matti Lassas and Samuli Siltanen. Can one use total variation prior for edge-preserving bayesian inversion? Inverse problems, 20(5):1537, 2004. Housen Li, Johannes Schwab, Stephan Antholzer, and Markus Haltmeier. NETT: solving inverse problems with deep neural networks. Inverse Problems, 36(6):065005, 23, 2020a. ISSN 0266-5611. doi: 10.1088/1361-6420/ab6d57. URL https://doi.org/10.1088/1361-6420/ab6d57. Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. ar Xiv preprint ar Xiv:2010.08895, 2020b. Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. ar Xiv preprint ar Xiv:1910.03193, 2019. Yong Zheng Ong, Zuowei Shen, and Haizhao Yang. Iae-net: Integral autoencoders for discretizationinvariant learning. ar Xiv preprint ar Xiv:2203.05142, 2022. Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143 195, 1999. Michael Puthawala, Konik Kothari, Matti Lassas, Ivan Dokmani c, and Maarten de Hoop. Globally injective relu networks. Journal of Machine Learning Research, 23(105):1 55, 2022a. Michael Puthawala, Matti Lassas, Ivan Dokmanic, and Maarten De Hoop. Universal joint approximation of manifolds and densities by simple injective flows. In International Conference on Machine Learning, pages 17959 17983. PMLR, 2022b. Matti Lassas Saksman, Samuli Siltanen, et al. Discretization-invariant bayesian inversion and besov space priors. ar Xiv preprint ar Xiv:0901.4220, 2009. Vishwanath Saragadam, Daniel Le Jeune, Jasper Tan, Gua Balakrishnan, Ashok Veeraraghavan, and Richard G Baraniuk. Wire: Wavelet implicit neural representations. ar Xiv preprint ar Xiv:2301.05187, 2023. Ralph E Showalter. Hilbert space methods in partial differential equations. Courier Corporation, 2010. Ali Siahkoohi, Gabrio Rizzuti, Philipp A Witte, and Felix J Herrmann. Faster uncertainty quantification for inverse problems with conditional normalizing flows. ar Xiv preprint ar Xiv:2007.07985, 2020. Andrew M Stuart. Inverse problems: a bayesian perspective. Acta numerica, 19:451 559, 2010. M.E. Taylor. Pseudodifferential Operators. Princeton mathematical series. Princeton University Press, 1981. Takeshi Teshima, Isao Ishikawa, Koichi Tojo, Kenta Oono, Masahiro Ikeda, and Masashi Sugiyama. Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems, 33:3362 3373, 2020. Tapas Tripura and Souvik Chakraborty. Wavelet neural operator for solving parametric partial differential equations in computational mechanics problems. Computer Methods in Applied Mechanics and Engineering, 404:115783, 2023. A Motivation behind our injectivity & bijectivity PDEs-based inverse problems: We consider the following partial differential equations (PDEs) of the form Lau(x) = f(x), x D, Bu(x) = g(x), x D, where D Rd is a bounded domain, La is partial differential operator with a coefficient a A(D; Rda) := {a : D Rda}, and B is some boundary operator, e.g. Dirichlet or Neumann, and function f & g are fixed. When a solution u U(D; Rdu) := {u : D Rda} is uniquely determined, we may define the solution operator G : A(D; Rda) U(D; Rdu) by putting G(a) := u. Note that the operator G is, in general, non-linear even if partial differential Operator La is linear (e.g., La = + a). The aim of the inverse problem is to evaluate G 1, the inverse of the solution operator. When G is non-injective, the problem is termed ill-posed. The key link between PDE and our injective neural operators is as follows. Can G be approximated by injective neural operators Ninj? If so, then Ninj can be a surrogate model to solve an ill-posed inverse problem. In general, this is possible, even if G is non-injective, per the results in Section 3. Moreover, by the results of Section 4, we may conduct the inverse N 1 inj of surrogate model Ninj with another neural operator. Approximation of pushes forward measure: Let M be the submanifold of X = L2([0, 1]2) or X = L2([0, 1]3) corresponding to natural images or 3D medical model. Let K RD be a manifold with the same topology as M, ι : RD X be an embedding, and let K1 = ι(K) M. Given µ, a measure supported on M, the task is to find a neural operator fθ : X X that maps (pushes forward) the uniform distribution on the model space K1 to µ and so thus maps K1 to M. If fθ : X X is bijective, computing likelihood functions in statistical analysis is made easier via the change of variables formula. Further, we may interpret f 1 θ as an encoder and fθ as the corresponding decoder, which parameterized elements of M. As everything is formulated in infinite dimension function space X, we obtain discretization invariant methods. B Proof of Proposition 1 in Section 2 Proof. We use the notation T|S(v,T +b)(v) = (Tiv)i S(v,T +b). Assume that T + b has a DSS with respect to every v L2(D)n in the sense of Definition 2, and that Re LU(Tv(1) + b) = Re LU(Tv(2) + b) in D, (B.1) where v(1), v(2) L2(D)n. Since T +b has a DSS with respect to v(1), we have for i S(v(1), T +b) 0 < Re LU(Tiv(1) + bi) = Re LU(Tiv(2) + bi) in D, which implies that Tiv(1) + bi = Tiv(2) + bi in D. Thus, v(1) v(2) Ker T S(v(1),T +b) By assuming (B.1), we have for i / S(v(1), T), {x D | Tiv(1)(x) + bi(x) 0} = {x D | Tiv(2)(x) + bi(x) 0}. Then, we have Ti(v(1) (v(1) v(2)))(x) + bi(x) = Tiv(2)(x) + bi(x) 0 if Tiv(1)(x) + bi(x) 0, that is, Tiv(1)(x) + bi(x) Ti v(1) v(2) (x) if Tiv(1)(x) + bi(x) 0. In addition, Ti(v(1) v(2))(x) = Tiv(1)(x) + bi(x) Tv(2)(x) + bi(x) = 0 if Tiv(1)(x) + bi(x) > 0. Thus, v(1) v(2) X(v, T + b). (B.3) Combining (B.2) and (B.3), and (2.1) as v = v(1), we conclude that v(1) v(2) = 0. Conversely, assume that there exists a v L2(D)n such that Ker T S(v,T +b) X(v, T + b) = {0}. Then there is u = 0 such that u Ker T S(v,T +b) X(v, T + b). For i S(v, T + b), we have by u Ker(Ti), Re LU (Ti(v u) + bi(x)) = Re LU (Tiv + bi(x)) . For i / S(v, T + b), we have by u X(v, T + b), Re LU (Ti(v u)(x) + bi(x)) = 0 if Tiv(x) + bi(x) 0 Tiv(x) + bi(x) if Tiv(x) + bi(x) > 0 = Re LU (Tiv(x) + bi(x)) . Therefore, we conclude that Re LU (T(v u) + b) = Re LU (Tv + b) , where u = 0, that is, Re LU (T + b) is not injective. C Details in Sections 3.1 and 3.2 C.1 Proof of Lemma 1 Proof. The restriction operator, πℓ: L2(D)m L2(D)ℓ(ℓ< m), acts as follows, πℓ(a, b) := b, (a, b) L2(D)m ℓ L2(D)ℓ. (C.1) Since L2(D) is a separable Hilbert space, there exists an orthonormal basis {φk}k N in L2(D). We denote by 0, ..., 0, φk |{z} j th , 0, ..., 0 for k N and j [m ℓ]. Then, {φ0 k,j}k N,j [m ℓ] is an orthonormal sequence in L2(D)m, and V0 := L2(D)m ℓ {0}ℓ = span φ0 k,j k N, j [m ℓ] . We define, for α (0, 1), 0, ..., 0, p (1 α)φk | {z } j th , 0, ..., 0, αξ(k 1)(m ℓ)+j L2(D)m, (C.2) with k N and j [m ℓ]. We note that {φα k,j}k N,j [m ℓ] is an orthonormal sequence in L2(D)m. We set Vα := span n φα k,j k N, j [m ℓ] o . (C.3) It holds for 0 < α < 1/2 that PV α PV 0 Indeed, for u L2(D)m and 0 < α < 1/2, PV α u PV 0 u 2 L2(D)m = PVαu PV0u 2 L2(D)m k N,j [m ℓ] (u, φα k,j)φα k,j (u, φ0 k,j)φ0 k,j k N,j [m ℓ] (1 α)(uj, φk)φk (uj, φk)φk k N,j [m ℓ] α(um, ξ(k 1)(m ℓ)+j)ξ(k 1)(m ℓ)+j k N |(uj, φk)|2 + α2 X k N |(um, ξk)|2 4α2 u 2 L2(D)m , which implies that PV α PV 0 We will show that the operator PV α T : L2(D)n L2(D)m, is injective. Assuming that for a, b L2(D)n, PV α T(a) = PV α T(b), is equivalent to T(a) T(b) = PVα(T(a) T(b)). Denoting by PVα(T(a) T(b)) = P k N,j [m ℓ] ck,jφα k,j, π1(T(a) T(b)) = X k N,j [m ℓ] ck,jξ(k 1)(m ℓ)+j. From (3.1), we obtain that ckj = 0 for all k, j. By injectivity of T, we finally get a = b. We define Qα : L2(D)m L2(D)m by Qα := PV 0 PV α + (I PV 0 )(I PV α ) I (PV 0 PV α )2 1/2 . By the same argument as in Section I.4.6 Kato [2013], we can show that Qα is injective and QαPV α = PV 0 Qα, that is, Qα maps from Ran(PV α ) to Ran(PV 0 ) {0}m ℓ L2(D)ℓ. It follows that πℓ Qα PV α T : L2(D)n L2(D)ℓ is injective. C.2 Remarks following Lemma 1 Remark 3. Lemma 1 and Eqn. (3.1) may be interpreted as saying that if some orthonormal sequence {ξk}k N exists that doesn t overlap the range of T, then T may be embedded in a small space without losing injectivity. An example that satisfies (3.1) is the neural operator whose L-th layer operator LL consists of the integral operator KL with continuous kernel function k L, and with continuous activation function σ. Indeed, in this case, we may choose the orthogonal sequence {ξk}k N in L2(D) as a discontinuous functions sequence 1 so that span{ξk}k N C(D) = {0}. Then, by Ran(LL) C(D)d L, the assumption (3.1) holds. Remark 4. In the proof of Lemma 1, an operator B L(L2(D)m, L2(D)ℓ), B = πℓ Qα PV α , appears, where PV α is the orthogonal projection onto orthogonal complement V α of Vα with Vα := span n φα k,j k N, j [m ℓ] o L2(D)m, in which φα k,j is defined for α (0, 1), k N and j [ℓ], 0, ..., 0, p (1 α)φk | {z } j th , 0, ..., 0, αξ(k 1)(m ℓ)+j Here, {φk}k N is an orthonormal basis in L2(D). Futhermore, Qα : L2(D)m L2(D)m is defined by Qα := PV 0 PV α + (I PV 0 )(I PV α ) I (PV 0 PV α )2 1/2 , where PV 0 is the orthogonal projection onto orthogonal complement V 0 of V0 with V0 := L2(D)m ℓ {0}ℓ. The operator Qα is well-defined for 0 < α < 1/2 because it holds that PV α PV 0 This construction is given by the combination of "Pairs of projections" discussed in Kato [2013, Section I.4.6] with the idea presented in [Puthawala et al., 2022b, Lemma 29]. C.3 Proof of Theorem 1 We begin with Definition 3. The set of L-layer neural networks mapping from Rd to Rd is NL(σ; Rd, Rd ) := n f : Rd Rd f(x) = WLσ( W1σ(W0x + b0) + b1 ) + b L, Wℓ Rdℓ+1 dℓ, bℓ Rdℓ+1, dℓ N0(d0 = d, d L+1 = d ), ℓ= 0, ..., L o , where σ : R R is an element-wise nonlinear activation function. For the class of nonlinear activation functions, A0 := n σ C(R) n N0 s.t. Nn(σ; Rd, R) is dense in C(K) for K Rd compact o AL 0 := n σ A0 σ is Borel measurable s.t. sup x R |σ(x)| 1 + |x| < o 1e.g., step functions whose supports are disjoint for each sequence. BA := n σ A0 K Rd compact , ϵ > 0, and C diam(K), n N0, f Nn(σ; Rd, Rd) s.t. |f(x) x| ϵ, x K, and, |f(x)| C, x Rdo . The set of integral neural operators with L2-integral kernels is NOL(σ; D, din, dout) := n G : L2(D)din L2(D)dout G = KL+1 (KL + b L) σ (K2 + b2) σ (K1 + b1) (K0 + b0), Kℓ L(L2(D)dℓ, L2(D)dℓ+1), Kℓ: f 7 Z D kℓ( , y)f(y)dy D , kℓ L2(D D; Rdℓ+1 dℓ), bℓ L2(D; Rdℓ+1), dℓ N, d0 = din, d L+2 = dout, ℓ= 0, ..., L + 2 o . Proof. Let R > 0 such that K BR(0), where BR(0) := {u L2(D)din | u L2(D)din R}. By Theorem 11 of Kovachki et al. [2021b], there exists L N and e G NOL(σ; D, din, dout) such that G+(a) e G(a) L2(D)dout ϵ and e G(a) L2(D)dout 4M, for a L2(D)din, a L2(D)din R. We write operator e G by e G = e KL+1 ( e KL + eb L) σ ( e K2 + eb2) σ ( e K1 + eb1) ( e K0 + eb0), e Kℓ L(L2(D)dℓ, L2(D)dℓ+1), e Kℓ: f 7 Z D ekℓ( , y)f(y)dy, ekℓ C(D D; Rdℓ+1 dℓ), ebℓ L2(D; Rdℓ+1), dℓ N, d0 = din, d L+2 = dout, ℓ= 0, ..., L + 2. We remark that kernel functions ekℓare continuous because neural operators defined in Kovachki et al. [2021b] parameterize the integral kernel function by neural networks, thus, Ran( e G) C(D)dout. (C.6) We define the neural operator H : L2(D)din L2(D)din+dout by H = KL+1 (KL + b L) σ (K2 + b2) σ (K1 + b1) (K0 + b0), where Kℓand bℓare defined as follows. First, we choose Kinj L(L2(D)din, L2(D)din) as a linear injective integral operator 2. (i) When σ1 AL 0 BA is injective, K0 = Kinj e K0 L(L2(D)din, L2(D)din+d1), b0 = O eb0 L2(D)din+d1, 2For example, if we choose the integral kernel kinj as kinj(x, y) = P k=1 φk(x) φk(y), then the integral operator Kinj with the kernel kinj is injective where { φ}k is the orthonormal basis in L2(D)din. Kℓ= Kinj O O e Kℓ L(L2(D)din+dℓ, L2(D)din+dℓ+1), bℓ= O ebℓ L2(D)din+dℓ+1, (1 ℓ L), ... KL+1 = Kinj O O e KL+1 L(L2(D)din+d L+1, L2(D)din+dout), bℓ= O O L2(D)din+dout. (ii) When σ1 = Re LU, K0 = Kinj e K0 L(L2(D)din, L2(D)din+d1), b0 = O eb0 L2(D)din+d1, Kinj O Kinj O O e K1 L(L2(D)din+d1, L2(D)2din+d2), b0 = L2(D)2din+d1, Kinj Kinj Kinj Kinj O L(L2(D)2din+dℓ, L2(D)2din+dℓ+1), L2(D)2din+dℓ+1, (2 ℓ L), KL = Kinj Kinj O O e KL L(L2(D)2din+d L, L2(D)din+d L+1), b L = O eb L L2(D)din+d L+1, KL+1 = Kinj O O e KL+1 L(L2(D)din+d L+1, L2(D)din+dout), b L+1 = O O L2(D)din+dout. Then, the operator H : L2(D)din L2(D)din+dout has the form of Kinj Kinj σ Kinj σ Kinj Kinj e G in the case of (i). Kinj Kinj e G in the case of (ii). For the case of (ii), we have used the fact ( I I ) Re LU I I Thus, in both cases, H is injective. In the case of (i), as σ AL 0 , we obtain the estimate σ(f) L2(D)din p 2|D|din C0 + f L2(D)din , f L2(D)din, C0 := sup x R |σ(x)| 1 + |x| < . Then we evaluate for a K( BR(0)), H(a) L2(D)din+dout e G(a) L2(D)dout + Kinj Kinj σ Kinj σ Kinj Kinj(a) L2(D)din ℓ=1 Kinj ℓ+1 op + Kinj L+2 op R =: CH. In the case of (ii), we find the estimate, for a K, H(a) L2(D)din+dout 4M + Kinj L+2 op R < CH. (C.8) From (C.6) (especially, Ran(π1H) C(D)) and Remark 3, we can choose an orthogonal sequence {ξk}k N in L2(D) such that (3.1) holds. By applying Lemma 1, as T = H, n = din, m = din+dout, ℓ= dout, we find that G := πdout Qα PV α | {z } =:B H : L2(D)din L2(D)dout, is injective. Here, PV α and Qα are defined as in Remark 4; we choose 0 < α << 1 such that op < min ϵ 10CH , 1 =: ϵ0, where PV 0 is the orthogonal projection onto V 0 := {0}din L2(D)dout. By the same argument as in the proof of Theorem 15 in Puthawala et al. [2022a], we can show that I Qα op 4ϵ0. Furthermore, since B is a linear operator, B KL+1 is also a linear operator with integral kernel (Bk L+1( , y)) (x), where k L+1(x, y) is the kernel of KL+1. This implies that G NOL(σ; D, din, dout). We get, for a K, G+(a) G(a) L2(D)dout G+(a) e G(a) L2(D)dout | {z } (C.5) ϵ + e G(a) G(a) L2(D)dout . (C.9) Using (C.7) and (C.8), we then obtain e G(a) G(a) L2(D)dout = πdout H(a) πdout Qα PV α H(a) L2(D)dout πdout (PV 0 PV α + PV α ) H(a) πdout Qα PV α H(a) L2(D)dout πdout (PV 0 PV α ) H(a) L2(D)dout + πdout (I Qα) PV α H(a) L2(D)dout 5ϵ0 H(a) L2(D)din+dout ϵ Combining (C.9) and (C.10), we conclude that G+(a) G(a) L2(D)dout ϵ C.4 Remark following Theorem 1 Remark 5. We make the following observations using Theorem 1: (i) Re LU and Leaky Re LU functions belong to AL 0 BA due to the fact that {σ C(R) | σ is not a polynomial} A0 (see Pinkus [1999]), and both the Re LU and Leaky Re LU functions belong to BA (see Lemma C.2 in Lanthaler et al. [2022]). We note that Lemma C.2 in Lanthaler et al. [2022] solely established the case for Re LU. However, it holds true for Leaky Re LU as well since the proof relies on the fact that the function x 7 min(max(x, R), R) can be exactly represented by a two-layer Re LU neural network, and a two-layer Leaky Re LU neural network can also represent this function. Consequently, Leaky Re LU is one of example that satisfies (ii) in Theorem 1. (ii) We emphasize that our infinite-dimensional result, Theorem 1, deviates from the finitedimensional result. Puthawala et al. [2022a, Theorem 15] assumes that 2din + 1 dout due to the use of Whitney s theorem. In contrast, Theorem 1 does not assume any conditions on din and dout, that is, we are able to avoid invoking Whitney s theorem by employing Lemma 1. (iii) We provide examples that injective universality does not hold when L2(D)din and L2(D)dout are replaced by Rdin and Rdout: Consider the case where din = dout = 1 and G+ : R R is defined as G+(x) = sin(x). We can not approximate G+ : R R by an injective function G : R R in the set K = [0, 2π] in the L -norm. According to the topological degree theory (see Cho and Chen [2006, Theorem 1.2.6(iii)]), any continuous function G : R R which satisfies G G+ C([0,2π]) < ε satisfies the equation on both intervals I1 = [0, π], I2 = [π, 2π] deg(G, Ij, s) =deg(G+, Ij, s) = 1 for all s [ 1 + ε, 1 ε], j = 1, 2. This implies that G : Ij R obtains the value s [ 1 + ε, 1 ε] at least once. Hence, G obtains the values s [ 1 + ε, 1 ε] at least two times on the interval [0, 2π] and is it thus not injective. It is worth noting that the degree theory exhibits significant differences between the infinite-dimensional and finite-dimensional cases [Cho and Chen, 2006]). D Details in Section 3.3 D.1 Finite rank approximation We consider linear integral operators Kℓwith L2 kernels kℓ(x, y). Let {φk}k N be an orthonormal basis in L2(D). Since {φk(y)φp(x)}k,p N is an orthonormal basis of L2(D D), integral kernels kℓ L2(D D; Rdℓ+1 dℓ) in integral operators Kℓ L(L2(D)dℓ, L2(D)dℓ+1) has the expansion kℓ(x, y) = X k,p N C(ℓ) k,pφk(y)φp(x), then integral operators Kℓ L(L2(D)dℓ, L2(D)dℓ+1) take the form k,p N C(ℓ) k,p(u, φk)φp(x), u L2(D)dℓ, where C(ℓ) k,p Rdℓ+1 dℓwhose (i, j)-th component c(ℓ) k,p,ij is given by c(ℓ) k,p,ij = (kℓ,ij, φkφp)L2(D D). Here, we write (u, φk) Rdℓ, (u, φk) = (u1, φk)L2(D), ..., (udℓ, φk)L2(D) . We define Kℓ,N L(L2(D)dℓ, L2(D)dℓ+1) as the truncated expansion of Kℓby N finite sum, that is, Kℓ,Nu(x) := X k,p N C(ℓ) k,p(u, φk)φp(x). Then Kℓ,N L(L2(D)dℓ, L2(D)dℓ+1) is a finite rank operator with rank N. Furthermore, we have Kℓ Kℓ,N op Kℓ Kℓ,N HS = i,j |c(ℓ) k,p,ij|2 which implies that as N , Kℓ Kℓ,N op 0. D.2 Layerwise injectivity We first revisit layerwise injectivity and bijectivity in the case of the finite rank approximation. Let KN : L2(D)n L2(D)m be a finite rank operator defined by KNu(x) := X k,p N Ck,p(u, φk)φp(x), u L2(D)n, where Ck,p Rm n and (u, φp) Rn is given by (u, φp) = (u1, φp)L2(D), ..., (un, φp)L2(D) . Let b N L2(D)n be defined by b N(x) := X p N bpφp(x), in which bp Rm. As analogues of Propositions 1 and 2, we obtain the following characterization. Proposition 5. (i) The operator Re LU (KN + b N) : (span{φk}k N)n L2(D)m, is injective if and only if for every v (span{φk}k N)n, u L2(D)n u N Ker(CS,N) X(v, KN + b N) (span{φk}k N)n = {0}. where S(v, KN + b N) [m] and X(v, KN + b N) are defined in Definition 2, and u N := ((u, φp))p N RNn, CS,N := Ck,q S(v,KN+b N) k,q [N] RN|S(v,KN+b N)| Nn. (D.2) (ii) Let σ be injective. Then the operator σ (KN + b N) : (span{φk}k N)n L2(D)m, is injective if and only if CN is injective, where CN := (Ck,q)k,q [N] RNm Nn. (D.3) Proof. The above statements follow from Propositions 1 and 2 by observing that u Ker (KN) is equivalent to (cf. (D.2) and (D.3)) X k,p N Ck,p(u, φk)φp = 0, CN u N = 0. D.3 Global injectivity We revisit global injectivity in the case of finite rank approximation. As an analogue of Lemma 1, we have the following Lemma 2. Let N, N N and n, m, ℓ N with N m > N ℓ 2Nn + 1, and let T : L2(D)n L2(D)m be a finite rank operator with N rank, that is, Ran(T) (span{φk}k N )m, (D.4) and Lipschitz continuous, and T : (span{φk}k N)n L2(D)m, is injective. Then, there exists a finite rank operator B L(L2(D)m, L2(D)ℓ) with rank N such that B T : (span{φk}k N)n (span{φk}k N )ℓ, is injective. Proof. From (D.4), T : L2(D)n L2(D)m has the form of k N (T(a), φk)φk, where (T(a), φk) Rm. We define T : RNn RN m by T(a) := ((T(a), φk))k [N ] RN m, a RNn, where T(a) L2(D)m is defined by in which ak Rn, a = (a1, ..., a N) RNn. Since T : L2(D)n L2(D)m is Lipschitz continuous, T : RNn RN m is also Lipschitz continuous. As N m > N ℓ 2Nn + 1, we can apply Lemma 29 from Puthawala et al. [2022a] with D = N m, m = N ℓ, n = Nn. According to this lemma, there exists a N ℓ-dimensional linear subspace V in RN m such that PV PV 0 and PV T : RNn RN m, is injective, where V 0 = {0}N (m ℓ) RN ℓ. Furthermore, in the proof of Theorem 15 of Puthawala et al. [2022a], denoting B := πN ℓ Q PV RN ℓ N m, we are able to show that B T : RNn RN ℓ is injective. Here, πN ℓ: RN m RN ℓ πN ℓ(a, b) := b, (a, b) RN (m ℓ) RN ℓ, where Q : RN m RN m is defined by Q := PV 0 PV + (I PV 0 )(I PV ) I (PV 0 PV )2 1/2 . We define B : L2(D)m L2(D)ℓby k,p N Bk,p(u, φk)φp, where Bk,p Rℓ m, B = (Bk,p)k,p [N ]. Then B : L2(D)m L2(D)ℓis a linear finite rank operator with N rank, and B T : L2(D)n L2(D)ℓ is injective because, by the construction, it is equivalent to B T : RNn RN ℓ, is injective. D.4 Proof of Theorem 2 Definition 4. We define the set of integral neural operators with N rank by NOL,N(σ; D, din, dout) := n GN : L2(D)din L2(D)dout : GN = KL+1,N (KL,N + b L,N) σ (K2,N + b2,N) σ (K1,N + b1,N) (K0,N + b0,N), Kℓ,N L(L2(D)dℓ, L2(D)dℓ+1), Kℓ,N : f 7 X k,p N C(ℓ) k,p(f, φk)φp, bℓ,N L2(D; Rdℓ+1), bℓ,N = X p N b(ℓ) p φm, C(ℓ) k,p Rdℓ+1 dℓ, b(ℓ) p Rdℓ+1, k, p N, dℓ N, d0 = din, d L+2 = dout, ℓ= 0, ..., L + 2 o . Proof. Let R > 0 such that K BR(0), where BR(0) := {u L2(D)din | u L2(D)din R}. As Re LU and Leaky Re LU function belongs to AL 0 BA, by Theorem 11 of Kovachki et al. [2021b], there exists L N and e G NOL(σ; D, din, dout) such that G+(a) e G(a) L2(D)dout ϵ and e G(a) L2(D)dout 4M, for a L2(D)din, a L2(D)din R. We write operator e G by e G = e KL+1 ( e KL + eb L) σ ( e K2 + eb2) σ ( e K1 + eb1) ( e K0 + eb0), e Kℓ L(L2(D)dℓ, L2(D)dℓ+1), e Kℓ: f 7 Z D ekℓ( , y)f(y)dy, ekℓ L2(D D; Rdℓ+1 dℓ), ebℓ L2(D; Rdℓ+1), dℓ N, d0 = din, d L+2 = dout, ℓ= 0, ..., L + 2. We set e GN NOL,N (σ; D, din, dout) such that e GN = e KL+1,N ( e KL,N +eb L,N ) σ ( e K2,N +eb2,N ) σ ( e K1,N +eb1,N ) ( e K0,N +eb0,N ), where e Kℓ,N : L2(D)dℓ L2(D)dℓ+1 is defined by e Kℓ,N u(x) = X k,p N C(ℓ) k,p(u, φk)φp(x), where C(ℓ) k,p Rdℓ+1 dℓwhose (i, j)-th component c(ℓ) k,p,ij is given by c(ℓ) k,p,ij = (ekℓ,ij, φkφp)L2(D D). Since e Kℓ e Kℓ,N 2 op e Kℓ e Kℓ,N 2 i,j |c(ℓ) k,p,ij|2 0 as N , there is a large N N such that e G(a) e GN (a) L2(D)dout ϵ Then, we have e GN (a) L2(D)dout sup a K e GN (a) e G(a) L2(D)dout + sup a K e G(a) L2(D)dout We define the operator HN : L2(D)din L2(D)din+dout by HN (a) = HN (a)1 HN (a)2 := Kinj,N Kinj,N(a) e GN (a) where Kinj,N : L2(D)din L2(D)din is defined by Kinj,Nu = X k N (u, φk)φk. As Kinj,N : (span{φk}k N)din L2(D)din is injective, HN : (span{φk}k N)din (span{φk}k N)din (span{φk}k N )dout , is injective. Furthermore, by the same argument (ii) (construction of H) in the proof of Theorem 1, HN NOL,N (σ; D, din, dout), because both of two-layer Re LU and Leaky Re LU neural networks can represent the identity map. Note that above Kinj,N is an orthogonal projection, so that Kinj,N Kinj,N = Kinj,N. However, we write above HN (a)1 as Kinj,N Kinj,N(a) so that it can be considered as combination of (L + 2) layers of neural networks. We estimate that for a L2(D)din, a L2(D)din R, HN (a) L2(D)din+dout 1 + 4M + Kinj L+2 op R =: CH. Here, we repeat an argument similar to the one in the proof of Lemma 2: HN : L2(D)din L2(D)din+dout has the form of k N (HN (a)1, φk)φk, X k N (HN (a)2, φk)φk where (HN (a)1, φk) Rdin, (HN (a)2, φk) Rdout. We define HN : RNdin RNdin+N dout by HN (a) := h ((HN (a)1, φk))k [N] , ((HN (a)2, φk))k [N ] i RNdin+N dout, a RNdin, where HN (a) = (HN (a)1, HN (a)2) L2(D)din+dout is defined by HN (a)1 := HN HN (a)2 := HN where ak Rdin, a = (a1, ..., a N) RNdin. Since HN : L2(D)din L2(D)din+dout is Lipschitz continuous, HN : RNdin RN dout is also Lipschitz continuous. As Ndin + N dout > N dout 2Ndin + 1, we can apply Lemma 29 of Puthawala et al. [2022a] with D = Ndin + N dout, m = N dout, n = Ndin. According to this lemma, there exists a N dout-dimensional linear subspace V in RNdin+N dout such that PV PV 0 op < min ϵ 15CHN , 1 =: ϵ0 and PV HN : RNdin RNdin+N dout, is injective, where V 0 = {0}Ndin RN dout. Furthermore, in the proof of Theorem 15 of Puthawala et al. [2022a], denoting by B := πN dout Q PV , we can show that B HN : RNdin RN dout, is injective, where πN dout : RNdin+N dout RN dout πN dout(a, b) := b, (a, b) RNdin RN dout, and Q : RNdin+N dout RNdin+N dout is defined by Q := PV 0 PV + (I PV 0 )(I PV ) I (PV 0 PV )2 1/2 . By the same argument in proof of Theorem 15 in Puthawala et al. [2022a], we can show that I Q op 4ϵ0. We define B : L2(D)din+dout L2(D)dout k,p N Bk,p(u, φk)φp, Bk,p Rdout (din+dout), B = (Bk,p)k,p [N ], then B : L2(D)din+dout L2(D)dout is a linear finite rank operator with N rank. Then, GN := B HN : L2(D)din L2(D)dout, is injective because by the construction, it is equivalent to B HN : RNdin RN dout, is injective. Furthermore, we have GN NOL,N (σ; D, din, dout). Indeed, HN NOL,N (σ; D, din, dout), B is the linear finite rank operator with N rank, and multiplication of two linear finite rank operators with N rank is also a linear finite rank operator with N rank. Finally, we estimate for a K, G+(a) GN (a) L2(D)dout = G+(a) e G(a) L2(D)dout | {z } (D.6) ϵ + e G(a) e GN (a) L2(D)dout | {z } (D.7) ϵ + e GN (a) GN (a) L2(D)dout . Using notation (a, φk) Rdin, and a = ((a, φk))k [N] RNdin, we further estimate for a K, e GN (a) GN (a) L2(Q)dout = πdout HN (a) B HN (a) L2(Q)dout = πN dout HN (a) B HN (a) 2 = πN dout HN (a) πN dout Q PV HN (a) 2 πN dout (PV 0 PV + PV ) HN (a) πN dout Q PV HN (a) 2 πN dout (PV 0 PV ) HN (a) 2 + πN dout (I Q) PV HN (a) 2 5ϵ0 HN (a) 2 | {z } = HN (a) L2(D)dout 0 such that for u L2(D)n > r, αu + W 1K(u), u L2(D)n u L2(D)n W 1z L2(D)n. Thus, we have that for u L2(D)n > r W 1K(u) W 1z, u L2(D)n u 2 L2(D)n αu + W 1K(u), u L2(D)n αu + W 1z, u L2(D)n u 2 L2(D)n W 1z L2(D)n L2(D)n u 2 L2(D)n α α > 1, and, hence, for all u L2(D) > r0 and λ (0, 1] we have u Vλ. Thus [ λ (0,1] Vλ B(0, r0). Again, by the Leray-Schauder theorem (see Gilbarg and Trudinger [2001, Theorem 11.3]), Hz has a fixed point. E.2 Examples for Proposition 3 Example 2. We consider the case where n = 1 and D Rd is a bounded interval. We consider the non-linear integral operator, K(u)(x) := Z D k(x, y, u(x))u(y)dy, x D, and k(x, y, t) is bounded, that is, there is CK > 0 such that |k(x, y, t)| CK, x, y D, t R. If W 1 op is small enough such that 1 > W 1 op CK|D|, then, for α W 1 op CK|D|, 1 , u 7 αu + W 1K(u) is coercive. Indeed, we have for u L2(D), αu + W 1K(u), u L2(D) u L2(D) α u L2(D) W 1 op K(u) L2(D) α W 1 op CK|D| For example, we can consider a kernel k(x, y, t) = j=1 cj(x, y)σs(aj(x, y)t + bj(x, y)), where σs : R R is the sigmoid function defined by σs(t) = 1 1 + e t . T are functions a, b, c C(D D) such that j=1 cj L (D D) < W 1 1 op |D| 1. Example 3. Again, we consider the case where n = 1 and D Rd a is a bounded set. We assume that W C1(D) satisfies 0 < c1 W(x) c2. For simplicity, we assume that |D| = 1. We consider the non-linear integral operator K(u)(x) := Z D k(x, y, u(x))u(y)dy, x D, (E.1) k(x, y, t) = j=1 cj(x, y)σwire(aj(x, y)t + bj(x, y)), (E.2) in which σwire : R R is the wavelet function defined by σwire(t) = Im (eiωte t2), and aj, bj, cj C(D D) are such that the aj(x, y) are nowhere vanishing functions, that is, aj(x, y) = 0 for all x, y D D. the and its generalizations (e.g. activation functions which do decay only exponentially). The next lemma holds for any activation function with exponential decay, including the activation function σwire and settles the key condition for Proposition 3 to hold. Lemma 3. Assume that |D| = 1 and the activation function σ : R R be continuous. Assume that there exists M1, m0 > 0 such that |σ(t)| M1e m0|t|, t R. Let aj, bj, cj C(D D) be such that aj(x, y) are nowhere vanishing functions, that is, aj(x, y) = 0 for all x, y D D. Moreover, let K : L2(D) L2(D) be a non-linear integral operator given in (E.1) with a kernel satisfying (E.2). Let α > 0 and 0 < c0 W(x) c1 . Then function F : L2(D) L2(D), F(u) = αu + W 1K(u) is coercive. Proof. As D is compact, there is a0 > 0 such that for all j = 1, 2, . . . , J we have |aj(x, y)| a0 a.e. and |bj(x, y)| b0 a.e. We point out that |σ(t)| M1. Next, j=1 W 1cj L (D D))M1ε < α we consider λ > 0 and u L2(D) and the sets D1(λ) = {x D : |u(x)| ελ}, D2(λ) = {x D : |u(x)| < ελ}. Let ε > 0 be such that j=1 W 1cj L (D D)) M1ε < α Then, for x D2(λ), j=1 W 1cj L (D D)|σ(aj(x, y)u(x) + bj(x, y))u(x)| j=1 W 1cj L (D D)M1ϵλ (E.3) After ε is chosen as in the above, we choose λ0 max(1, b0/(a0ε)) to be sufficiently large so that for all |t| ελ0 it holds j=1 W 1cj L (D D))M1exp( m0|a0t b0|)t < α Here, we observe that, as λ0 b0/(a0ε), we have that for all |t| ελ0, a0|t| b0 > 0. Then, when λ λ0, we have for x D1(λ), j=1 W 1cj L (D D) σ aj(x, y)u(x) + bj(x, y) u(x) α When u L2(D) has the norm u L2(D) = λ λ0 1, we have D W(x) 1k(x, y, u(x))u(x)u(y)dxdy j=1 W 1cj L (D D))M1exp m0|a0|u(x)| b0| |u(x)|dx |u(y)|dy j=1 W 1cj L (D D)|σ(aj(x, y)u(x) + bj(x, y))||u(x)|dx |u(y)|dy 4 u L2(D) + α 4 λ u L2(D) 2 u 2 L2(D). Hence, αu + W 1K(u), u L2(D) u L2(D) α and the function u αu + W 1K(u) is coercive. E.3 Proof of Proposition 4 Proof. (Injectivity) Assume that σ(Wu1 + K(u1) + b) = σ(Wu2 + K(u2) + b). where u1, u2 L2(D)n. Since σ is injective and W : L2(D)n L2(D)n is bounded linear bijective, we have u1 + W 1K(u1) = u2 + W 1K(u2) =: z. Since the mapping u 7 z W 1K(u) is contraction (because W 1K is contraction), by the Banach fixed-point theorem, the mapping u 7 z W 1K(u) admit a unique fixed-point in L2(D)n, which implies that u1 = u2. (Surjectivity) Since σ is surjective, it is enough to show that u 7 Wu + K(u) + b is surjective. Let z L2(D)n. Since the mapping u 7 W 1z W 1b W 1K(u) is contraction, by Banach fixed-point theorem, there is u L2(D)n such that u = W 1z W 1b W 1K(u ) Wu + K(u ) + b = z. E.4 Examples for Proposition 4 Example 4. We consider the case of n = 1, and D [0, ℓ]d. We consider Volterra operators K(u)(x) = Z D k(x, y, u(x), u(y))u(y)dy, where x = (x1, . . . , xd) and y = (y1, . . . , yd). We recall that K is a Volterra operator if k(x, y, t, s) = 0 = yj xj for all j = 1, 2, . . . , d. (E.4) In particular, when D = (a, b) R is an interval, the Volterra operators are of the form K(u)(x) = Z x a k(x, y, u(x), u(y))u(y)dy, and if x is considered as a time variable, the Volterra operators are causal in the sense that the value of K(u)(x) at the time x depends only on u(y) at the times y x. Assume that k(x, y, t, s) C(D D R R) is bounded and uniformly Lipschitz smooth in the t and s variables, that is, k C(D D; C0,1(R R)). Next, we consider the non-linear operator F : L2(D) L2(D), F(u) = u + K(u). (E.5) Assume that u, w L2(D) are such that u + K(u) = w + K(w), so that w u = K(u) K(w). Next, we will show that then u = w. We denote and D(z1) = D ([0, z1] [0, ℓ]d 1),. Then for x D(z1) the Volterra property of the kernel implies that |u(x) w(x)| Z D |k(x, y, u(x), u(y))u(y) k(x, y, w(x), w(y))w(y)|dy D(z1) |k(x, y, u(x), u(y))u(y) k(x, y, w(x), u(y))u(y)|dy D(z1) |k(x, y, w(x), u(y))u(y) k(x, y, w(x), w(y))u(y)|dy D(z1) |k(x, y, w(x), w(y))u(y) k(x, y, w(x), w(y))w(y)|dy 2 k C(D D;C0,1(R R)) u w L2(D(z1)) u L2(D(z1)) + k L (D D R R) u w L2(D(z1)) p so that for all 0 < z1 < ℓ, u w 2 L2(D(z1)) 0 1D(x)|u(x) w(x)|2dxddxd 1 . . . dx2 z1ℓd 1 2 k C(D D;C0,1(R R)) u w L2(D(z1)) u L2(D(z1)) + k L (D D R R) u w L2(D(z1)) p z1ℓd 1 k C(D D;C0,1(R R)) u L2(D) + k L (D D R R) p |D| 2 u w 2 L2(D(z1)). Thus, when z1 is so small that z1ℓd 1 k C(D D;C0,1(Rn)) u L2(D) + k L (D D R R) p we find that u w L2(D(z1)) = 0, that is, u(x) w(x) = 0 for x D(z1). Using the same arguments as above, we see for all k N that that if u = w in D(kz1) then u = w in D((k + 1)z1). Using induction, we see that u = w in D. Hence, the operator u 7 F(u) is injective in L2(D). Example 5. We consider derivatives of Volterra operators in the domain D [0, ℓ]d. Let K : L2(D) L2(D) be a non-linear operator D k(x, y, u(y))u(y)dy, (E.6) where k(x, y, t) satisfies (E.4), is bounded, and k C(D D; C0,1(R R)). Let F1 : L2(D) L2(D) be F1(u) = u + K(u). (E.7) Then the Fréchet derivative of K at u0 L2(D) to the direction w L2(D) is DF1|u0(w) = w(x) + Z D k1(x, y, u0(y))w(y)dy, (E.8) k1(x, y, u0(y)) = u0(y) tk(x, y, t) t=u0(x) + k(x, y, u0(y)) (E.9) is a Volterra operator satisfying k1(x, y, t) = 0 = yj xj for all j = 1, 2, . . . , d. (E.10) As seen in Example 4, the operator DF1|u0 : L2(D) L2(D) is injective. F Details in Section 4.2 In this appendix, we prove Theorem 3.We recall that in the theorem, weconsider the case when n = 1, and D R is a bounded interval, and the operator F1 is of the form F1(u)(x) = W(x)u(x) + Z D k(x, y, u(y))u(y)dy, where W C1(D) satisfies 0 < c1 W(x) c2 and the function (x, y, s) 7 k(x, y, s) is in C3(D D R) and in D D R its three derivatives and the derivatives of W are uniformly bounded by c0, that is, k C3(D D R) c0, W C1(D) c0. (F.1) We recall that the identical embedding H1(D) L (D) is bounded and compact by Sobolev s embedding theorem. As we will consider kernels k(x, y, u0(y)), we will consider the non-linear operator F1 mainly as an operator in a Sobolev space H1(D). The Frechet derivative of F1 at u0 to direction w, denoted Au0w = DF1|u0(w) is given by Au0w = W(x)w(x) + Z D k(x, y, u0(y))w(y)dy + Z u(x, y, u0(y))w(y)dy. (F.2) The condition (F.1) implies that F1 : H1(D) H1(D), (F.3) is a locally Lipsichitz smooth function and the operator Au0 : H1(D) H1(D), given in (F.2), is defined for all u0 C(D) as a bounded linear operator. When X is a Banach space, let BX (0, R) = {v X : v X < R} and BX (0, R) = {v X : v X R} be the open and closed balls in X, respectively. We consider the Hölder spaces Cn,α(D) and their image in (leaky) Re LU-type functions. Let a 0 and σa(s) = Re LU(s) a Re LU( s). We will consider the image of the closed ball of C1,α(D) in the map σa, that is σa(BC1,α(D)(0, R)) = {σa g C(D) : g C1,α(D) R}. We will below assume that for all u0 C(D) the integral operator Au0 : H1(D) H1(D) is an injective operator. (F.4) This condition is valid when K(u) is a Volterra operator, see Examples 4 and 5. As the integral operators Au0 are Fredholm operators having index zero, this implies that the operators (F.4) are bijective. The inverse operator A 1 u0 : H1(D) H1(D) can be written as A 1 u0 v(x) = f W(x)v(x) Z D eku0(x, y)v(y)dy, (F.5) where eku0, xeku0 C(D D) and f W C1(D). We will consider the inverse function of the map F1 in a set Y σa(BC1,α(D)(0, R)) that is a compact subset of the Sobolev space H1(D). To this end, we will cover the set Y with small balls BH1(D)(gj, ε0), j = 1, 2, . . . , J of H1(D), centred at gj = F1(vj), where vj H1(D). We will show below that when g BH1(D)(gj, 2ε0), that is, g is 2ε1-close to the function gj in H1(D), the inverse map of F1 can be written as a limit (F 1 1 (g), g) = limm H m j (vj, g) in H1(D)2, where = u A 1 vj (F1(u) F1(vj)) + A 1 vj (g gj) g that is, near gj we can approximate F 1 1 as a composition H m j of 2m layers of neural operators. To glue the local inverse maps together, we use a partition of unity in the function space Y that is given by integral neural operators Φ i(v, w) = π1 ϕ i,1 ϕ i,2 ϕ i,ℓ0(v, w), where ϕ i,ℓ(v, w) = (Fyℓ,s( i,ℓ),ϵ1(v, w), w), where i belongs in some finite index set I Zℓ0, some ϵ1 > 0, some yℓ D (ℓ= 1, ..., ℓ0), s( i, ℓ) = iℓϵ1.π1(v, w) = v that maps a pair (v, w) to the first function v Here, Fz,s,h(v, w) are integral neural operators with distributional kernels Fz,s,h(v, w)(x) = Z D kz,s,h(x, y, v(x), w(y))dy, where kz,s,h(x, y, v(x), w(y)) = v(x)1[s 1 2 h)(w(y))δ(y z), and 1A is the indicator function of a set A and y 7 δ(y z) is the Dirac delta distribution at the point z D. Using these, we can write the inverse of F1 at g Y as F 1 1 (g) = lim m i I Φ i H m j( i) where j( i) {1, 2, . . . , J} are suitably chosen and the limit is taken in the norm topology of H1(D). This result is summarized in the following theorem, that is a modified version of Theorem 3 for the inverse operator F 1 1 in (F.6) where we have refined the partition of unity Φ i so that we use indexes i I Zℓ0 instead of j {1, . . . , J}. This result is summarized in following theorem: Theorem 4. Assume that F1 satisfies the above assumptions (F.1) and (F.4) and that F1 : H1(D) H1(D) is a bijection. Let Y σa(BC1,α(D)(0, R)) be a compact subset of the Sobolev space H1(D), where α > 0 and a 0. Then the inverse of F1 : H1(D) H1(D) in Y can written as a limit (F.6) that is, as a limit of integral neural operators. Observe that Theorem 4 includes the case where a = 1, in which case σa = Id and Y σa(BC1,α(D)(0, R)) = BC1,α(D)(0, R)). We note that when σa is a leaky Re LU-function with parameter a > 0, Theorem 4 can be applied to compute the inverse of σa F1 that is given by F 1 1 σ 1 a , where σ 1 a = σ1/a. Note that the assumption that Y σa(BC1,α(D)(0, R)) makes it possible to apply Theorem 4 in the case when one trains deep neural networks having layers σa F1 and the parameter a of the leaky Re LU-function is a free parameter which is also trained. Proof. As the operator F1 can be multiplied by function W(x) 1, it is sufficient to consider the case when W(x) = 1. Below, we use the fact that as D R, Sobolev s embedding theorem yields that the embedding H1(D) C(D) is bounded and there is CS > 0 such that u C(D) CS u H1(D). (F.7) For clarity, we denote below the norm of C(D) by u L (D). Next we considre the Frechet derivatives of F1. We recall that the 1st Frechet derivative of F1 at u0 is the operator Au0. The 2nd Frechet derivative of F1 at u0 to directions w1 and w2 is D2F1|u0(w1, w2) = Z u(x, y, u0(y))w1(y)w2(y)dy + Z u2 (x, y, u0(y))w1(y)w2(y)dy D p(x, y)w1(y)w2(y)dy, p(x, y) = 2 k u(x, y, u0(y)) + u0(y) k2 u2 (x, y, u0(y)), (F.8) xp(x, y) = 2 2k u x(x, y, u0(y)) + u0(y) k3 u2 x(x, y, u0(y)). (F.9) D2F1|u0(w1, w2) H1(D) 3|D|1/2 k C3(D D R)(1 + u0 L (D)) w1 L (D) w2 L (D). When we freeze the function u in kernel k to be u0, we denote Ku0v(x) = Z D k(x, y, u0(y))v(y)dy, Lemma 4. For u0, u1 C(D) we have Ku1 Ku0 L2(D) H1(D) k C2(D D R)|D| u1 u0 L (D). Au1 Au0 L2(D) H1(D) 2 k C2(D D R)|D|(1 + u0 L (D)) u1 u0 L (D). (F.11) Proof. Denote Mu0v(x) = Z u(x, y, u0(y))v(y)dy, Nu1,u2v(x) = Z u(x, y, u2(y))v(y)dy. Mu2v Mu1v = (Nu2,u2v Nu2,u1v) + (Nu2,u1v Nu1,u1v). By Schur s test for continuity of integral operators, Ku0 L2(D) L2(D) sup x D D |k(x, y, u0(y))|dy 1/2 sup y D D |k(x, y, u0(y))|dx 1/2 k C0(D D R), Mu0 L2(D) L2(D) u(x, y, u0(y))|dy 1/2 sup y D u(x, y, u0(y))|dx 1/2 k C1(D D R) u C(D), Ku2 Ku1 L2(D) L2(D) D |k(x, y, u2(y)) k(x, y, u1(y))|dy 1/2 D |k(x, y, u2(y)) k(x, y, u1(y))|dx 1/2 k C1(D D R) D |u2(y) u1(y))|dy 1/2 k C1(D D R) sup y D D |u2(y) u1(y))|dx 1/2 k C1(D D R) D |u2(y) u1(y))|dy 1/2 sup y D D |u2(y) u1(y))|dx 1/2 k C1(D D R) |D|1/2 u2 u1 L2(D) 1/2 |D| sup y D |u2(y) u1(y))| 1/2 k C1(D D R)|D|3/4 u2 u1 1/2 L2(D) u2 u1 1/2 L (D) k C1(D D R)|D| u2 u1 L (D), Nu2,u2 Nu2,u1 L2(D) L2(D) D |u2(y)k(x, y, u2(y)) u2(y)k(x, y, u1(y))|dy 1/2 D |u2(y)k(x, y, u2(y)) u2(y)k(x, y, u1(y))|dx 1/2 k C1(D D R)|D|3/4 u2 C0(D) u2 u1 1/2 L2(D) u2 u1 1/2 L (D) k C1(D D R)|D| u2 C0(D) u2 u1 L (D), Nu2,u1 Nu1,u1 L2(D) L2(D) D |(u2(y) u1(y))k(x, y, u1(y))|dy 1/2 D |(u2(y) u1(y))k(x, y, u1(y))|dx 1/2 k C0(D D R)|D| u2 u1 L (D), Mu2 Mu1 L2(D) L2(D) k C1(D D R)|D|(1 + u2 C0(D)) u2 u1 L (D). Also, when Dxv = dv Dx Ku0 L2(D) L2(D) D |Dxk(x, y, u0(y))|dy 1/2 sup y D D |Dxk(x, y, u0(y))|dx 1/2 k C1(D D R), Dx Ku1 Dx Ku0 L2(D) L2(D) D |Dxk(x, y, u1(y)) Dxk(x, y, u0(y))|dy 1/2 D |Dxk(x, y, u1(y)) Dxk(x, y, u0(y))|dx 1/2 k C2(D D R) D |u1(y) u0(y))|dy 1/2 k C2(D D R) sup y D D |u1(y) u0(y))|dx 1/2 k C2(D D R) D |u1(y) u0(y))|dy 1/2 sup y D D |u1(y) u0(y))|dx 1/2 k C2(D D R) |D|1/2 u1 u0 L2(D) 1/2 |D| sup y D |u1(y) u0(y))| 1/2 k C2(D D R)|D|3/4 u1 u0 1/2 L2(D) u1 u0 1/2 L (D) k C2(D D R)|D| u1 u0 L (D). Ku0 L2(D) H1(D) k C1(D D R), Mu0 L2(D) H1(D) u0 C0(D) k C1(D D R), Ku1 Ku0 L2(D) H1(D) k C2(D D R)|D| u1 u0 L (D). Mu1 Mu0 L2(D) H1(D) k C2(D D R)|D|(1 + u2 C0(D)) u1 u0 L (D). As Au1 = Ku1 + Mu1, the claim follows. As the embedding H1(D) C(D) is bounded and has norm CS, Lemma 4 implies that for all R > 0 there is CL(R) = 2 k C2(D D R)|D|(1 + CSR), such that the map, u0 7 DF1|u0, u0 BH1(0, R), (F.12) is Lipschitz map BH1(0, R) L(H1(D), H1(D)) with a Lipschitz constant CL(R), that is, DF1|u1 DF1|u2 H1(D) H1(D) CL(R) u1 u2 H1(D). (F.13) As u0 7 Au0 = DF1|u0 is continuous, the inverse A 1 u0 : H1(D) H1(D) exists for all u0 C(D), and the embedding H1(D) C(D) is compact, we have that for all R > 0 there is CB(R) > 0 such that A 1 u0 H1(D) H1(D) CB(R), for all u0 BH1(0, R). (F.14) Let R1, R2 > 0 be such that Y BH1(0, R1) and X = F 1 1 (Y) BH1(0, R2). Below, we denote CL = CL(2R2) and CB = CB(R2). Next we consider inverse of F1 in Y. To this end, let us consider ε0 > 0, which we choose later to be small enough. As Y BH1(0, R) is compact there are finite number of elements gj = F1(vj) Y, where vj X, j = 1, 2, . . . , J such that j=1 BH1(D)(gj, ε0). We observe that for u0, u1 X, A 1 u1 A 1 u0 = A 1 u1 (Au1 Au0)A 1 u0 , and hence the Lipschitz constant of A 1 : u 7 A 1 u , X L(H1(D), H1(D)) satisfies Lip(A 1 ) CA = C2 BCL, (F.15) see (F.11). Let us consider a fixed j and gj Y. When g satisfies g gj H1(D) < 2ε0, (F.16) the equation F1(u) = g, u X, is equivalent to the fixed point equation u = u A 1 vj (F1(u) F1(vj)) + A 1 vj (g gj), that is equivalent to the fixed point equation for the function Hj : H1(D) H1(D), Hj(u) = u A 1 vj (F1(u) F1(vj)) + A 1 vj (g gj). Note that Hj depends on g, and thus we later denote Hj = Hg j . We observe that Hj(vj) = vj + A 1 vj (g gj). (F.17) Let u, v BH1(0, 2R2). We have F1(u) = F1(v) + Av(u v) + Bv(u v), Bv(u v) C0 u v 2, where, see (F.10), C0 = 3|D|1/2 k C3(D D R)(1 + 2CSR2)C2 S, so that for u1, u2 BH1(0, 2R2), u1 u2 A 1 vj (F1(u1) F1(u2)) = u1 u2 A 1 u2 (F1(u1) F1(u2)) (A 1 u2 A 1 vj )(F1(u1) F1(u2)), u1 u2 A 1 u2 (F1(u1) F1(u2)) H1(D) = A 1 u2 (Bu2(u1 u2)) H1(D) A 1 u2 H1(D) H1(D) Bu2(u1 u2) H1(D) A 1 u2 H1(D) H1(D)C0 u1 u2 2 H1(D), CBC0 u1 u2 2 H1(D), (A 1 u2 A 1 vj )(F1(u1) F1(u2)) H1(D) A 1 u2 A 1 vj H1(D) H1(D) F1(u1) F1(u2) H1(D) Lip BH1(0,2R2) H1(D)(A 1 ) u2 vj Lip BH1(0,2R2) H1(D)(F1) u2 u1 H1(D) CA u2 vj (CB + 4C0R2) u2 u1 H1(D), see (F.2), and hence, when u vj r R2, Hj(u1) Hj(u2) H1(D) u1 u2 A 1 vj (F1(u1) F1(u2)) H1(D) u1 u2 A 1 u2 (F1(u1) F1(u2)) H1(D) + (A 1 u2 A 1 vj )(F1(u1) F1(u2)) H1(D) CBC0( u1 vj H1(D) + u2 vj H1(D)) + CA(CB + 4C0R2) u2 vj u2 u1 H1(D) CHr u2 u1 H1(D), CH = 2CBC0 + CA(CB + 4C0R2). We now choose r = min( 1 2CH , R2). We consider Then, we have r 2CBε0/(1 CHr). Then, we have that Lip BH1(0,2R2) H1(D)(Hj) a = CHr < 1 r A 1 vj H1(D) H1(D) g gj H1(D)/(1 a), and for all u BH1(0, R2) such that u vj r, we have A 1 vj (g gj) H1(D) (1 a)r. Then, Hj(u) vj H1(D) Hj(u) Hj(vj) H1(D) + Hj(vj) vj H1(D) a u vj H1(D) + vj + A 1 vj (g gj) vj H1(D) ar + A 1 vj (g gj) H1(D) r, that is, Hj maps BH1(D)(vj, r) to itself. By Banach fixed point theorem, Hj : BH1(D)(vj, r) BH1(D)(vj, r) has a fixed point. Let us denote = Hg j (u) g = u A 1 vj (F1(u) F1(vj)) + A 1 vj (g gj) g By the above, when we choose ε0 to have a value the map F1 has a right inverse map Rj in BH1(gj, 2ε0), that is, F1(Rj(g)) = g, for g BH1(gj, 2ε0), (F.18) it holds that Rj : BH1(gj, 2ε0) BH1(D)(vj, r), and by Banach fixed point theorem it is given by the limit Rj(g) = lim m wj,m, g BH1(gj, 2ε0), (F.19) in H1(D), where wj,0 = vj, (F.20) wj,m+1 = Hg j (wj,m). (F.21) We can write for g BH1(gj, 2ε0), Rj(g) g = lim m H m j where the limit takes space in H1(D)2 and H m j = Hj Hj Hj, (F.22) is the composition of m operators Hj. This implies that Rj can be written as a limit of finite iterations of neural operators Hj (we will consider how the operator A 1 vj can be written as a neural operator below). As Y σa(BC1,α(D)(0, R)), there are finite number of points yℓ D, ℓ= 1, 2, . . . , ℓ0 and ε1 > 0 such that the sets Z(i1, i2, . . . , iℓ0) = {g Y : (iℓ 1 2)ε1 g(yℓ) < (iℓ+ 1 2)ε1, for all ℓ}, where i1, i2, . . . , iℓ0 Z, satisfy the condition (F.23) If (Z(i1, i2, . . . , iℓ0) Y) BH1(D)(gj, ε0) = then Z(i1, i2, . . . , iℓ0) Y BH1(D)(gj, 2ε0). To show (F.23), we will below use the mean value theorem for function g = σa v Y, where v C1,α(D). First, let us consider the case when the parameter a of the leaky Re LU function σa is strictly positive. Without loss of generality, we can assume that D = [0, 1] and yℓ= hℓ, where h = 1/ℓ0 and ℓ= 0, 1, . . . , ℓ0. We consider g Y Z(i1, i2, . . . , iℓ0) σa(BC1,α(D)(0, R)) of the form g = σa v. As a is non-zero, the inequality (iℓ 1 2)ε g(yℓ) < (iℓ+ 1 2)ε is equivalent to σ1/a((iℓ 1 2)ε) v(yℓ) < σ1/a((iℓ+ 1 2)ε), and thus σ1/a(iℓε) Aε v(yℓ) < σ1/a(iℓε) + Aε, (F.24) where A = 1 2 max(1, a, 1/a), that is, for g = σa(v) Z(i1, i2, . . . , iℓ0) the values v(yℓ) are known within small errors. By applying mean value theorem on the interval [(ℓ 1)h, ℓh] for function v we see that there is x [(ℓ 1)h, ℓh] such that dv dx(x ) = v(ℓh) v((ℓ 1)h) and thus by (F.24), dx(x ) dℓ, i| 2Aε1 h(σ1/a(iℓε1) σ1/a((iℓ 1)ε1)), (F.26) Observe that these estimates are useful when ε1 is much smaller that h. As g = σa v Y σa(BC1,α(D)(0, R)), we have v BC1,α(D)(0, R), so that dv dx BC0,α(D)(0, R) satisfies (F.25) implies that dx(x) dℓ, i| 2Aε1 h + Rhα, for all x [(ℓ 1)h, ℓh]. (F.27) Moreover, (F.24) and v BC1,α(D)(0, R) imply |v(x) σ1/a(iℓε1)| < Aε1 + Rh, (F.28) for all x [(ℓ 1)h, ℓh]. Let ε2 = ε0/A. When we first choose ℓ0 to be large enough (so that h = 1/ℓ0 is small) and then ε1 to be small enough, we may assume that h + Rhα, Aε1 + Rh) < 1 8ε2. (F.29) Then for any two functions g, g Y Z(i1, i2, . . . , iℓ0) σa(BC1,α(D)(0, R)) of the form g = σa v, g = σa v the inequalities (F.27) and (F.28) imply dx (x)| < 1 4ε2, (F.30) |v(x) v (x)| < 1 for all x D. As v, v BC1,α(D)(0, R), this implies v v C1(D) < 1 As the embedding C1(D) H1(D) is continuous and has norm less than 2 on the interval D = [0, 1], we see that v v H1(D) < ε2, and thus g g H1(D) < Aε2 = ε0. (F.31) To prove (F.23), we assume that (Z(i1, i2, . . . , iℓ0) Y) BH1(D)(gj, ε0) = , and g Z(i1, i2, . . . , iℓ0) Y. By the assumption, there exists g (Z(i1, i2, . . . , iℓ0) Y) BH1(D)(gj, ε0). Using (F.31), we have g gj H1(D) g g H1(D) + g gj H1(D) 2ϵ0. Thus, g BH1(D)(gj, 2ε0), which implies that the property (F.23) follows. We next consider the case when the parameter a of the leaky relu function σa is zero. Again, we assume that D = [0, 1] and yℓ= hℓ, where h = 1/ℓ0 and ℓ= 0, 1, . . . , ℓ0. We consider g Y Z(i1, i2, . . . , iℓ0) σa(BC1,α(D)(0, R)) of the form g = σa(v) and an interval [ℓ1h, (ℓ1+1)h] D, where 1 ℓ1 ℓ0 2. We will consider four cases. First, if g does not obtain the value zero on the interval [ℓ1h, (ℓ1 + 1)h] the mean value theorem implies that there is x [ℓ1h, (ℓ1 + 1)h] such that dg dx(x ) = dv dx(x ) is equal to d = (g(ℓ1h) g([(ℓ1 1)h))/h. Second, if g does not obtain the value zero on either of the intervals [(ℓ1 1)h, ℓ1h] or [(ℓ1 + 1)h, (ℓ1 + 2)h], we can use the mean value theorem to estimate the derivatives of g and v at some point of these intervals similarly to the first case. Third, if g does not vanish identically on the interval [ℓ1h, (ℓ1+1)h] but it obtains the value zero on the both intervals [(ℓ1 1)h, ℓ1h] and [(ℓ1 + 1)h, (ℓ1 + 2)h], the function v has two zeros on the interval [(ℓ1 1)h, (ℓ1 + 2)h] and the mean value theorem implies that there is x [(ℓ1 1)h, (ℓ1 + 2)h] such that dv dx(x ) = 0. Fourth, if none of the above cases are valid, g vanishes identically on the interval [ℓ1h, (ℓ1 + 1)h]. In all these cases the fact that v C1,α(D) R implies that the derivative of g can be estimated on the whole interval [ℓ1h, (ℓ1 + 1)h] within a small error. Using these observations we see for any ε2, ε3 > 0 that if yℓ D = [d1, d2] R, ℓ= 1, 2, . . . , ℓ0 are a sufficiently dense grid in D and ε1 to be small enough, then the derivatives of any two functions g, g Y Z(i1, i2, . . . , iℓ0) σa(BC1,α(D)(0, R)) of the form g = σa(v), g = σa(v ) satisfly g g H1([d1+ε3,d2 ε3]) < ε2. As the embedding C1([d1 +ε3, d2 ε3]) H1([d1 +ε3, d2 ε3]) is continuous, σa(v) H1([d1,d1+ε3]) ca v C1,α(D) ε3, σa(v) H1([d2 ε3,d2]) ca v C1,α(D) ε3, and ε2 and ε3 can be chosen to be arbitrarily small, we see that the property (F.23) follows. Thus the property (F.23) is shown in all cases. By our assumptions Y σa(BC1,α(D)(0, R)) and hence g Y implies that g C(D) AR. This implies that Y Z(i1, i2, . . . , iℓ0) is empty if there is ℓsuch that |iℓ| > 2AR/ε1 + 1. Thus, there is a finite set I Zℓ0 such that i I Z( i), (F.32) Z( i) Y = , for all i I, (F.33) where we use notation i = (i1, i2, . . . , iℓ0) Zℓ0. On the other hand, we have chosen gj Y such that BH1(D)(gj, ε0), j = 1, . . . , J cover Y. This implies that for all i I there is j = j( i) {1, 2, . . . , j} such that there exists g Z( i) BH1(D)(gj, ε0). By (F.23), this implies that Z( i) BH1(D)(gj( i), 2ε0). (F.34) Thus, we see that Z( i), i I is a disjoint covering of Y, and by (F.34), in each set Z( i) Y, i I the map g Rj(g) we have constructed a right inverse of the map F1. Below, we denote s( i, ℓ) = iℓε1. Next we construct a partition of unity in Y using maps Fz,s,h(v, w)(x) = Z D kz,s,h(x, y, v(x), w(y))dy, kz,s,h(x, y, v(x), w(y)) = v(x)1[s 1 2 h)(w(y))δ(y z). Fz,s,h(v, w)(x) = v(x), if 1 2h w(z) s < 1 2h, 0, otherwise. Next, for all i I we define the operator Φ i : H1(D) Y H1(D), Φ i(v, w) = π1 ϕ i,1 ϕ i,2 ϕ i,ℓ0(v, w), where ϕ i,ℓ: H1(D) Y H1(D) Y are the maps ϕ i,ℓ(v, w) = (Fyℓ,s( i,ℓ),ε1(v, w), w), and π1(v, w) = v maps a pair (v, w) to the first function v. It satisfies Φ i(v, w) = v, if 1 2ε1 w(yℓ) s( i, ℓ) < 1 2ε1 for all ℓ, 0, otherwise. Observe that here s( i, ℓ) = iℓε1 is close to the value gj( i)(yℓ). Now we can write for g Y F 1 1 (g) = X i I Φ i(Rj( i)(g), g), with suitably chosen j( i) {1, 2, . . . , J}. Let us finally consider A 1 u0 where u0 C(D). Let us denote u(x, y, u0(y))w(y)dy, and Ju0 = Ku0 + e Ku0 be the integral operator with kernel ju0(x, y) = k(x, y, u0(y)) + u0(y) k u(x, y, u0(y)). (I + Ju0) 1 = I Ju0 + Ju0(I + Ju0) 1Ju0, so that when we write the linear bounded operator A 1 u0 = (I + Ju0) 1 : H1(D) H1(D), as an integral operator (I + Ju0) 1v(x) = v + Z D mu0(x, y)v(y)dy, (I + Ju0) 1v(x) = v(x) Ju0v(x) ju0(x, y )ju0(y, y ) + Z D ju0(x, y )mu0(y , x )ju0(x , y)dx dy v(y)dy D eju0(x, y)v(y)dy, eju0(x, y) = ju0(x, y) + Z D (ju0(x, y )ju0(y, y )dy + Z D ju0(x, y )mu0(y , x )ju0(x , y)dx dy . This implies that the operator A 1 u0 = (I + Ju0) 1 is a neural operator, too. Observe that eju0(x, y), xeju0(x, y) C(D D). This proves Theorem 3.