# noneuclidean_universal_approximation__bf5c4098.pdf Non-Euclidean Universal Approximation Anastasis Kratsios Eugene Bilokopytov Modifications to a neural network s input and output layers are often required to accommodate the specificities of most practical learning tasks. However, the impact of such changes on architecture s approximation capabilities is largely not understood. We present general conditions describing feature and readout maps that preserve an architecture s ability to approximate any continuous functions uniformly on compacts. As an application, we show that if an architecture is capable of universal approximation, then modifying its final layer to produce binary values creates a new architecture capable of deterministically approximating any classifier. In particular, we obtain guarantees for deep CNNs and deep feed-forward networks. Our results also have consequences within the scope of geometric deep learning. Specifically, when the input and output spaces are Cartan-Hadamard manifolds, we obtain geometrically meaningful feature and readout maps satisfying our criteria. Consequently, commonly used non-Euclidean regression models between spaces of symmetric positive definite matrices are extended to universal DNNs. The same result allows us to show that the hyperbolic feed-forward networks, used for hierarchical learning, are universal. Our result is also used to show that the common practice of randomizing all but the last two layers of a DNN produces a universal family of functions with probability one. We also provide conditions on a DNN s first (resp. last) few layer s connections and activation function which guarantee that these layer s can have a width equal to the input (resp. output) space s dimension while not negatively effecting the architecture s approximation capabilities. 1 Introduction Modifications made to a neural network s input and output maps to extract features from a data-set or to better suit a learning task is prevalent throughout learning theory. Typically, such changes are made by pre-(resp. post-)composing an architecture with a fixed and untrainable feature (resp. readout) map. Examples prevail classification by neural networks, random feature maps obtained by randomizing all but the last few layers of a feed-forward network, and numerous illustrations throughout geometric deep-learning theory, which we detail below. This motivates the central question of this paper: "Which modifications to the input and output layers of a neural network architecture preserve its universal approximation capabilities?" Specifically, in this paper we obtain a simple sufficient condition on a pair of a feature map φ : X Rm and a readout map ρ : Rn Y , where X and Y are topological spaces, guaranteeing that if F is dense in C(Rm,Rn) for the uniform convergence on compacts (ucc) topology then {f C(X ,Y ) : ρ f φ, f F}, (1) Department of Mathematics, Eidgenössische Technische Hochschule Zürich, HG G 32.3, Rämistrasse 101, 8092 ürich, Switzerland. email: anastasis.kratsios@math.ethz.ch Department of Mathematics and Statistical Sciences, University of Alberta, 11324 89 Ave NW, Edmonton, AB T6G 2J5, Canada. email: bilokopy@ualberta.ca 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. is dense in C(X ,Y ) in the uniform convergence on compacts topology when Y is metric and, more generally, in the compact-open topology when Y is non-metrizable (such as in the hard classification problem). Simplified conditions are obtained when Y is a metrizable manifold, and characterization of ρ and φ is obtained when both X and Y are smooth manifolds. The set F represents any expressive neural network architecture. For example, by [35] F can be taken to be the set of feed-forward networks with one hidden layer and continuous, locally-bounded, and non-polynomial activation function. Or, by [56], F can be taken to be the set of deep convolution networks with specific sparsity structures and Re Lu activation function. Throughout F is often referred to as an architecture. The results are not limited to neural networks and remain valid when, for example, F is taken to be the set of posterior means generated by a Gaussian processes universal kernel, as in [41]. The central results are motivated by the following consequences. Implication: Method for Constructing Non-Euclidean Universal Approximators A natural hub for our results is in geometric deep learning, an emerging field of machine learning, which acknowledges and makes use of the latent non-Euclidean structures present in many types of data. Applications of geometric deep learning are prevalent throughout neuroimaging [16], computervision [48], covariance learning [40], and learning from hierarchical structures such as complex social networks [34], undirected graphs [43], and trees [50]. For instance, in [44], it is shown that low-dimensional representations for complex hierarchical structures into hyperbolic space outperform the current state-of-the-art high-dimensional Euclidean embedding methods due to the tree-like geometry of the former. Using the theory of gyro-vector space, introduced in [55], [17] proposed a hyperbolic-space variant of the feed-forward architecture and demonstrated its superior performance in learning hierarchical structure from these hyperbolic representations. A direct application of our main result confirms that this non-Euclidean architecture can indeed approximate any continuous function between hyperbolic spaces. More generally, we obtain an explicit construction of feed-forward networks between any Cartan Hadamard manifold and a guarantee that our construction is universal. Cartan-Hadamard manifolds appear throughout applied mathematics from the symmetric positive-definite matrix-valued regression problems of [16, 40], which we extend to universal approximators, to applications in mathematical finance in [23], Gaussian processes in [39], information geometric in [4], and to the geometry of the Wasserstein space [36] commonly used in Generative Adversarial Networks as in [3]. Implication: Universal Approximation Implies Universal Classification Perhaps the most commonly used readout maps are those used when modifying neural-networks to perform classification tasks. The currently available theoretical results, found in [15], guarantee that for a random vector in Rm with random labels, the set of feed-forward networks with one hidden layer, step activation function σ(x) = I[0, ) I( ,0], and readout map ρ(x)i = I[ 1 2 , ) can approximate the Bayes classifier in probability. As an application of this paper s main results, we obtain deterministic guarantees of generic hard (n-ary) and soft (fuzzy) classification on Rm for any given universal approximator in C(Rm,Rn) once it s outputs are modified by a continuous surjection ρ to take values in {0,1}n or (0,1)n, respectively. For example, our result applies to feed-forward networks with at-least one hidden layer holds when ρ the component-wise logistic ρ(x)i = I[ 1 2 ,1] exi 1+exi readout map. Implication: DNNs with Randomly Generated First Layers are Universal We show that the commonly employed practice of only training the final layers of a deep feed-forward network and randomizing the rest preserves its universal approximation capabilities with probability 1. Though widely used, this practice has only recently begun to be studied in [19, 37]. The link with our results arises from an observation made in [37], stating that the first portion of such a random architecture can be seen as a randomly generated feature map. Implication: DNNs Can be Narrowed In [29, 45] the authors provide lower bounds on a DNN layer s width, under which it is no longer a universal approximator. However, there is a wealth of literature which shows that arranging a network s neurons to create depth rather than width yields empirically superior performance. As a final application of our theory, we provide explicit conditions on a DNN s connections and activation functions so additional initial and final few layers may be added to a DNN which do not respect the minimum width requirements of [29, 45] but do not negatively impact the DNN s approximation capabilities. Numerical implementations show that additional depth build using our main results improve predictive performance additional deep layers failing our assumptions reduced the network s predictive performance. This paper is organized as follows. Section 2 discusses the necessary topological and geometric background needed to formulate the paper s central results. Section 3 contains the paper s main results discussed above. The conclusion follows in section 4. The proofs of the main results are contained within this paper s supplementary material. 2 Background 2.1 General Topology Before moving on to the main results of the paper, we will require some additional topological terminology. The interior of a subset A X of a topological space is the largest open subset contained in A. For example, in the Euclidean space R, the interior of [0,1) is (0,1). The closure of A is the smallest closed-set containing A. Therefore, the closure of [0,1) in R is [0,1]. The difference between the closure of A and its interior is called the boundary of A and is denoted by A. For example, the boundary of [0,1) in the Euclidean space R is {0,1}. A subset A Rm is called a retract of Rm if there is a continuous map r : Rm A such that r ιA = 1A, where ιA : A Rm is the inclusion map and 1A is the identity map on A. Dually, a continuous right-inverse of a continuous surjective map is called a section. A covering projection f : Rn Y is a surjective continuous function such that for every x Y there is an open set Ux, containing x, such that ρ 1[U] = S α AUα x and {Uα x }a A is a disjoint collection of open sets for which ρ|Uαx : Uα x Ux is a homeomorphism. For example, the map x e iπx is a covering projection of R onto the circle. Lastly, since continuous maps transfer topological information then a continuous bijection with continuous inverse preserves all topological information. Such a map is called a homeomorphism and two topological spaces related by a homeomorphism are said to be homeomorphic. For example, the sigmoid (or logistic) function x 7 ex 1+ex continuously puts R in bijection with (0,1) and its inverse is the logit function y 7 ln y 1 y , which is continuous. Therefore, R and (0,1) are homeomorphic. A continuous injective map φ : X Rm such that X is homeomorphic to φ(X ) is called an embedding. A topological space X is said to be locally-compact if every x X is contained in an open subset Ux of X which is in turn contained in an compact subset KU of X . For example, every point x R is contained in (x 1,x+1) which is in turn contained in the closed-bounded (and therefore compact-set by the Heine-Borel theorem) [x 1,x + 1]. A topological space is said to be simply connected, if every two paths between points can be continuously deformed into one another. Similarly to [8, 11], we say that a subset A of a topological space X is collared if there exists an open subset U X containing A and a homeomorphism φ : U A [0,1) mapping A to A [0,1). In this way the open set U is, in a sense, topologically similar to A itself since one can imagine shrinking any point (a,t) A [0,1) down to (a,0) and then identifying it back with a via ψ. We denote the set of positive integers by N+. 2.2 Topology of Function Spaces Let C(Rm,Rn) denote the set of all continuous functions from the Euclidean space Rm to the Euclidean space Rn. Closeness in C(Rm,Rn) can be described in a number of ways but in the context of universal approximation theorems of [13, 27, 35] two functions f and g are thought of as being close if they are uniformly close on compacts, if for a given ε > 0 the following holds: n i=1 f(x)i g(x)i 2 2k(1+sup x k q n i=1 f(x)i g(x)i 2) < ε. (2) The topology described by (2) is called the topology of uniform convergence on compacts, henceforth ucc topology. If Rn is replaced by any other topological space Y whose notion of closeness is defined by a distance function and Rm is replaced by nearly any other topological space then closeness in the collection of continuous functions from X to Y , denoted by C(X ,Y ), can be described analogously to (2) by replacing the Euclidean distance on Rn by another distance function d-on Y , the compact subsets K Rm with compact subsets of X , and taking f,g C(X ,Y ). The topology on C(X ,Y ) defined in this way is still called the ucc topology. If a distance function cannot describe the topology on Y , for example, we will see that this is the case for reasonable topologies on C(X ,{0,1}n), then one cannot define the ucc topology. Instead, consider the smallest topology on C(X ,Y ) containing the sets {VK,O : /0 = K X compact and /0 = O Y open},VK,O {f C(X ,Y ) : f(K) O}. (3) When the topology on Y is defined by a distance function and X is a locally-compact Hausdorff space, then the smallest topology containing (3) coincides with the ucc topology. However, unlike the ucc topology, the smallest topology containing (3) is well-defined on C(X ,Y ) for any topological spaces X and Y . This generalized ucc topology is called the compact-open topology (co-topology). 2.3 Manifolds A (topological) manifold is a topological space which "closeup" resembles Euclidean space, whereas a manifold with boundary locally looks like a part of Euclidean space but possibly with a hard edge. Definition 2.1 (Metrizable Manifold with Boundary; [8]). A topological space Y is said to be a metrizable manifold with boundary if (i) For every y Y , there is an open Uy Y containing y which is homeomorphic to ( (z1,...,zn) Rn : n i=1 z2 i < 1 and zn 0 (ii) There exists a distance function (metric) d : Y 2 Y such that the topology on Y coincides with the smallest topology on Y containing the open balls {Bε(y)}ε>0,y Y ; where Bε(y) {z Y : d(z,y) < ε}. We say that d is a metric for Y . The subset of Y consisting of all points y contained in some open set Uy which is homeomorphic to the interior of (4) is denoted by Int(Y ). A smooth manifold without boundary, is a manifold for which there is a well-defined differential calculus admitting arbitrarily many derivatives and which can locally be deformed into Euclidean space via infinitely differentiable maps with infinitely differentiable inverses. An m-dimensional Riemannian manifold M is a manifold without boundary which can be locally smoothly deformed into Euclidean space such that curvature and length can be meaningfully compared, locally, between Rm and M . Amongst other things, this allows the definition of minimal-length curves connecting points on M , called geodesics. If any two points on M can be connected by such a minimal length curve then M is said to be complete. Moreover, when M is complete and connected, the function mapping any two points p,q M to the length of a geodesic connecting them defines a metric d M . Thus, M has a geometrically meaningful metric structure where distance represents the length of maximally efficient trajectories and C(X,M ). The existence of d M also implies that C(X ,M ) is equipped with the ucc-topology. Further, when M is complete and connected the Hopf-Rinow Theorem, of [26], affirms that for any given p M , the map sending any v Rm lying tangent to p to the point on M arrived at time t = 1 by traveling along a the geodesic with initial velocity v defines a surjection from Rm onto M . This map is called the Riemannian Exponential map on M at p and is denoted by Exp M p . In [28], it is shown that, in this case, Exp M p has a smooth inverse outside a low-dimensional subset Cp. This inverse is denoted by Log M p and is known to locally preserve length between Rm and M along geodesics emanating from p. This means that Log M p and Exp M p are geometrically meaningful feature and readout maps, respectively. However, the set Cp can be pathological or difficult to deal with. This issue is overcome by turning to the sub-class of Cartan-Hadamard manifolds. A Riemannian manifold M is Cartan-Hadamard if it is simply connected and has non-positive curvature. Non-positive curvature mean that all triangles drawn on M by geodesics have internal angles adding-up at-most 180 . 3 Main Results Let φ : X Rm and ρ : Rn Y . Subsets of Rn (resp Rm) will be equipped with the (relative) Euclidean topology unless explicitly stated otherwise. Equip C(X ,Y ) with the co-topology, C(Rm,Rn) with the ucc topology, let F be a dense subset of C(Rm,Rn) such as the architectures studied in [35, 38, 56] or the posterior means of a Gaussian process with universal kernel as in [41], and define the subset Fρ,φ C(X ,Y ) by Fρ,φ {g C(X ,Y ) : g = ρ f φ where f F}. (5) The set Fρ,φ is dense in C(X ,Y ) under the following assumptions on φ and ρ. Assumption 3.1 (Feature Map Regularity). The map φ is a continuous and injective. Assumption 3.2 (Readout Map Regularity). Suppose that the readout map ρ satisfies the following: (i) Either of the following hold: (a) ρ is a continuous and it has a section on Im(ρ), (b) ρ is a covering projection of Rm onto Im(ρ) and X is connected and simply connected, (ii) Im(ρ) is dense in Y , (iii) Im(ρ) is collared. Theorem 3.3 (General Version). Suppose that F is dense in C(Rm,Rn). If Assumptions 3.1 and 3.2 hold then Fρ,φ is dense in C(X ,Y ). Just as in the filtering literature of [10], one would hope that the outputs of any learning model should depend continuously on its inputs. Therefore, we only consider feature maps φ which are continuous functions. In this case, Assumption 3.1 is sharp. We denote the identity map x 7 x on Rn by 1Rn. Theorem 3.4 (Assumption 3.1 is Sharp). Let X be a metrizable manifold with boundary, let φ be continuous, and F C(Rm,Rn). Then F1Rn,φ is dense in C(X ,Rn) if and only if φ is injective. Remark 3.5 (Sharpening Assumption 3.2). Assumption 3.2 is almost sharp and a characterization can be obtained using the Z -sets studied in [54, 20]. However, it is unlikely that a non-pathological example can be generated which falls outside the scope of Assumption 3.2. Theorem 3.4 shows that it is easy to verify if a feature map preserves the universal approximation property. However, it can be much more challenging to verify if and when the readout map ρ does so. The following presents a readily applicable case of Theorem 3.3. They highlight the convenient fact that if ρ is surjective then only Assumptions 3.1 and 3.2 (i) need to be verified. Corollary 3.6. If φ is a continuous injective map, ρ is a surjective covering projection, and F is dense in C(Rd,RD) then Fρ,φ is dense in C(X ,Y ). In particular, φ and ρ may be homeomorphisms. When both φ and ρ fully preserve topological structure then Corollary 3.6 sharpens. Proposition 3.7 (Homeomorphic case is Sharp). Let φ and ρ be homeomorphisms. Then F is dense in C(Rm,Rn) if and only if Fρ,φ is dense in C(X ,Y ). Corollary 3.8. If φ is a continuous injective map, ρ is a continuous surjection with a section, X is connected and simply connected, and F is dense in C(Rd,RD) then Fρ,φ is dense in C(X ,Y ). When additional structure is assumed of Y , as is common in most applications, Assumption 3.2 (ii) and (iii) can be omitted and the other assumptions can be simplified. Specifically, the case where Y is a manifold with boundary is considered. In the case where X and Int(Y ) are smooth, then Theorem 3.3 can be further streamlined as follows. Assumption 3.9 (Readout Map Regularity: Geometric Version). Suppose that ρ satisfies: (i) ρ satisfies Assumption 3.2 (i) and Im(ρ) Int(Y ), (ii) Int(Y ) Im(ρ) is a (possibly empty) smooth submanifold of Int(Y ) of dimension strictly less-than dim(Int(Y )) n. Theorem 3.10 (Geometric Version). Let Y be a metrizable manifold with boundary, for which Int(Y ) is a smooth manifold, X is locally-compact, and F is dense in C(Rm,Rn). If φ satisfies Assumption 3.1 and ρ satisfies Assumption 3.9 then Fρ,φ is dense in C(X ,Y ). Consequences of these results in various areas of machine-learning are now considered. 3.1 Dense Families in C(Rm,Rn) Induce Universal Classifiers Let X be a set, φ : X Rm be a bijection, and {Li}n i=1 be a collection of labels describing elements of X . Let Xi {x X : x has label Li}. For example, {Xi}n i=1 are disjoint and cover X then we obtain the n-ary classification problem, but in general, any x X may simultaneously have distinct multiple labels. Without loss of generality, we may assume that X is a topological space which is homeomorphic to Rm since we may equip it with the topology {φ 1[U] : U open in Rm}. Assume that the sets {Xi}n i=1 are open subsets of X . In the stochastic case, the Bayes classifier is the golden standard for classification. In the deterministic case, the standard is clearly the ideal classifier ˆh : X {0,1}n, introduced here, and defined by ˆh(x)i IXi(x), (6) where IXi is the indicator function of Xi, taking value 1 if x Xi and 0 otherwise. Since the usual Euclidean topology on {0,1}n coincides with the discrete topology on {0,1}n and since a continuous functions to a discrete topological space are constant, see [49], then ˆh only belongs to C(X ,{0,1}n) if it is trivial, i.e.: either Xi = X or Xi = /0 for each i. Moreover, a direct computation shows that there are exactly 2n functions in C(X ,{0,1}n). Thus, other topologies must be considered on {0,1}n in order to have a meaningful deterministic classification theory. When n = 1, there are two other choices of topologies on {0,1}, up to homeomorphism. These are the trivial topology {/0,{0,1}} and the Sierpi nski topology {/0,{1},{0,1}}. The trivial topology is uninteresting since a direct computation shows that with it every function in C(X ,{0,1}) becomes indistinguishable, i.e.: the co-topology on C(X ,{0,1}) becomes trivial and therefore density in C(X ,{0,1}) holds trivially for any non-empty subset. In the case of the Sierpi nski topology in [53, Chapter 7] it is shown that all indicator functions of any open set X from any sufficiently regular topological space, such as X , is a continuous function to {0,1} with the Sherpi nski topology. This latter property has lead to widespread use of this space in semantics. The next result shows that ˆh can be approximated on two fronts simultaneously. First, by showing that ˆh has a natural decomposition as I( 1 2 ,1] applied component-wise to continuous soft (fuzzy) classifier ˆs, i.e. ˆs C(X ,[0,1]n), satisfying ˆs 1 i [(1/2,1]] = Xi, ( i = 1,...,n). (7) Subsequently, the architecture Fρ,φ is shown to simultaneously approximate ˆs uniformly on compacts in C(X ,[0,1]n) and ˆh in the compact-open topology on C(X ,{0,1}n). Intuitively, (7) represents the philosophy of logistic regression where one approximates on the interval and the thresholds the logistic classifier to obtain a strict decision rule, and thus a hard classifier. Theorem 3.11 (Universal Classification: General Case). Let {0,1}n be equipped with the n-fold product of the Sierpi nski topology on {0,1}, φ be continuous and injective, ρ : Rn (0,1)n be a homeomorphism, α (0,1), and F C(Rm,Rn) be dense. Let {Xi}n i=1 be a set of open subsets of a metric space X and let ˆh be its associated ideal classifier defined by (6). Then the following hold: (i) (Hard-Soft Decomposition) There exist continuous functions ˆsi C(X ,[0,1]) such that ˆh = I(α,1] (ˆs1,..., ˆsn) ˆs 1 i [(α,1]] = Xi,( i = 1,...,n) (ii) (Universal Classification) There exists a sequence { fk}k N in F such that: (a) (Soft Classification) For each non-empty compact subset κ X and every ε > 0, there is some K N+ such that sup x κ max i=1,...,n |ρ fk φ(x)i ˆsi(xi)| < ε, ( k K) (b) (Hard Classification) I(α,1] ρ fk φ converges to ˆh in C(X ,{0,1}n) for the co-topology. Furthermore, Fρ,φ is dense in C(X ,[0,1]n). As an application, we now show that most feed-forward DNNs and deep CNNs used in practice for classifications, are indeed universal classifiers in the sense of Theorem 3.11. Let σ : R R be a continuous activation function, and let N N σ denote the set of feed-forward networks from Rm to Rn, i.e.: continuous functions with representation f(x) = W f (J), f (j)(x) = σ W (j) f (j 1)(x) , f (0)(x) = x, j = 1,...,J (8) where W and W j are affine maps and denotes component-wise composition. The following results directly follow from Theorem 3.11 and the central result of [35], and validates the principle way neural networks are used for classification. Corollary 3.12 (Universal Classification: Deep Feed-Forward Networks). Let {Xi}n i=1 be open subsets of X , and ˆh be their associated ideal classifier. Let φ : X Rn be a continuous injective feature map. Let σ be a continuous, locally-bounded, and non-constant activation function. Let ρ either be the component-wise logistic function. Then there exists a sequence {fk}k N+ of DNNs satisfying the conclusions of Theorem 3.11. Define the set of deep CNNs with Re Lu activation and sparsity 2 s m, denoted by Convs, to be the collection of all functions from Rn to R represented by f(x) = W f (J), f (j)(x) = σ w(j) (f (j 1)(x)) bj , f (0)(x) = x, j = 1,...,J, where W is an affine map from Rd+Js to R, b(j) Rd+js, w(j) = {w(j) k } k= is a convolutional filter mask where wk R and wk = 0 only if 0 k s, and the convolutional operation of w(j) with the vectors {v j}J j=1 is the sequence defined by (w v)i = J 1 j=0 wi jvj and σ(x) = max{0,x}. Corollary 3.13 (Universal Classification: Deep CNNs). Let 2 s n, {Xi}n i=1 be open subsets of X , and ˆh be their associated ideal classifier. Let φ : X Rn be a continuous injective feature map and let ρ : R (0,1) be the logistic function. Then there is a sequence of deep CNNS { fk}k N+ in Convs ρ,φ satisfying the conclusion of Theorem 3.11. 3.2 Applications in Geometric Deep Learning This subsection illustrates the applicability of the main results to geometric deep learning. Our examples focus on two illustrative points, first that many commonly used non-Euclidean regression models can be extended to non-Euclidean architectures capable of universal approximation and second, we illustrate how our results can be used to validate the approximation capabilities of certain geometric deep learning architectures. For Cartan-Hadamard manifolds, the Cartan-Hadamard Theorem, [30, Corollary 6.9.1], guarantees that Cp = /0 and in particular Log M p is a globally-defined homeomorphism between M and Rm. Thus, the following result follows from Corollary 3.8. Corollary 3.14 (Cartan-Hadamard Version). Let F be dense in C(Rm,Rn), let M and N be Cartan-Hadamard manifolds of dimension m and n. Then, FLog M p ,Exp N q is dense in C(M ,N ). We consider here two consequences of this result. 3.2.1 Symmetric Positive-Definite Matrix Learning Symmetric positive-definite matrices play a prominent role in many applied sciences, largely due to their relationship to covariance matrices, in areas ranging from computational anatomy in [47], computer vision in [46], and in finance [5]. The space P+ d of d d symmetric positive-definite matrices is a non-Euclidean subspace of Rd2. In [1], P+ d is shown to be a Cartan-Hadamard manifold whose Riemannian exponential and logarithm maps are, respectively, given by A, Log A(B) = where exp and log denote the matrix exponential and logarithms, respectively. Moreover, the distance function on this space is given by where F denotes the Fröbenius norm and A is well-defined for any matrix in P+ d . Using this distance, [40] developed non-Euclidean least-squares regression on P+ d . The parameters involved in these models are typically optimized either using the non-Euclidean line search algorithms of [40] or the non-Euclidean stochastic gradient approach on P+ d of [6]. The aforementioned regression models can be extended to form a ucc-dense architecture in C(P+ d ,P+ D ). Corollary 3.15 (Universal Approximation for Symmetric Positive-Definite Matrices). Let d,D N+ and F C(Rd(d+1)/2,RD(D+1)/2) be ucc-dense. Then, for any A P+ d and B P+ D , FLog A,Exp B is ucc-dense in C(P+ d ,P+ D ). In particular, if σ is a continuous, locally-bounded, and non-polynomial activation function then N N σ Log A,Exp B is ucc-dense in C(P+ d ,P+ D ). 3.2.2 Hyperbolic Feed-Forward Networks For c > 0, the generalized hyperbolic spaces Dn c of [17] have underlying set {x Rn : c x 2 < 1} and their topology is induced by the following non-Euclidean metric dc(x,y) 2 c tanh 1 c (1 c x 2)y (1 2cx y+c y 2) 1 2cx y+c2 x 2 y 2 Though a direct description of hyperbolic feed-forward neural networks would be lengthy, on [17, page 6], it is shown any hyperbolic feed-forward network from Dm c to Dn c can be represented as n Exp Dkc 0 f Log Dkc 0 : f N N σo , (10) where Exp Dkc 0 is the Riemmanian Exponential map on Dk c about 0, as in Corollary 3.14. Closedform expressions are obtained in [17, Lemma 2] for these feature and readout maps. Since, as discussed in [17], Dk c is a complete connected Riemannian manifold of non-positive curvature then the Cartan-Hadamard Theorem implies that C0 = /0. Whence, Corollary 3.14 yields the following. Corollary 3.16 (Hyperbolic Neural Networks are Universal). Let σ be a continuous, non-polynomial, locally-bounded activation function and c > 0. Then for every g C(Dm c ,Dn c), every ε > 0, and every compact subset K Dm c there exists a hyperbolic neural network gε,K,c in (10) satisfying sup x K dc(g(x),gε,K,c) < ε. Next, applications of Theorems 3.3 and 3.10 with Euclidean input and output spaces are considered. 3.3 Universality of Deep Networks with First Layers Randomized Fix R-valued random variables {Xi}k i=1 and {Zi}k i=1 defined on a common probability space (Ω,Σ,P). Fix an activation function σ : R [0,1], and positive integers {di}k i=1. Using this data, for each i = 1,...,k define random affine maps Wi : Rdi Ω Rdi+1, defined by (x,ω) 7 Ai(ω)x+bi(ω), (11) where the entries of Ai are i.i.d. copies of Xi and the entries of bi are i.i.d. copies of Zi. The random affine maps (11) define the (random) set of deep feed-forward neural networks with first k layers randomized and last 2 layers trainable to be the (random) subset of C(Rd,RD) via N N σ 2,k(ω) {f C(Rm,Rn) : ( g N N σ 2 )f(x) = g [σ Wk(x,ω) σ σ W1(x,ω)]}, where N N σ 2 is the collection of feed-forward neural networks of the form W2 σ W1, where W1 : Rm Rd and W2 : Rd Rn are affine maps and d is a positive integer. Under the following mild assumptions, the random set N N σ 2,k is dense in C(Rm,Rn) with probability 1. Assumption 3.17. For each i = 1,...,k (i) di di+1 for each i = 1,...,k, (ii) σ is a strictly increasing and continuous, (iii) E[Xi] = E[Zi] = 0, E[X2 i ] = E[Z2 i ] = 1, (iv) For every C > 1, E[|Xi|C],E[|Zi|C] < . Theorem 3.18. If Assumption 3.17 holds, then there exists a measurable subset Ω n ω Ω: N N σ 2,k(ω) = C(Rd,RD) o satisfying P(Ω ) = 1. Corollary 3.19 (Sub-Gaussian Case with Sigmoid Activation). Let Xi = Zi for each i = 1,...,k be independent standardized sub-Gaussian random-variables, σ(x) = 1 1+e x , and di = d for each i = 1,...,k. Then the conclusion of Theorem 3.18 holds. Corollary 3.20 (Bernoulli Case with PRe LU Activation). Suppose that for every i, j = 1,...,k, Xi and Zj i.i.d. copies of a random variable taking values { 1,1} with probabilities { 1 2}. Let di = d for each i = 1,...,k and σ be the PRe LU activation function of [22]. Then Assumptions 3.17 are met; thus the conclusion of Theorem 3.18 holds. 3.4 Feed-Forward Layers of Sub-Minimal Width In this section, we use Theorem 3.10 to describe how additional layers can be incorporated into a DNN, which violate the minimum width requirements of m+1 in its hidden layers (see [29, 45]) but do not negatively impact the architecture s approximation capabilities. We say that such layers have sub-minimal width. We derive specific conditions on the activation functions and structure of the connections between sub-minimal width layer ensuring that Assumptions 3.1 and 3.2 are met. Proposition 3.21 (Input Layers of Sub-Minimal Width: Continuous Monotone Activations and Invertible Connections). Let σ be a continuous and strictly increasing activation function, J N+, A1,...,AJ be m m matrices, and b1,...,b J Rd. Let φ(x) φJ(x) where φj(x) σ exp(A j)φ j 1(x)+bj φ0(x) x, j = 1,...,J, (12) where exp is the matrix exponential. Then φ satisfies Assumption 3.1. Proposition 3.22 (Output Layers of Sub-Minimal Width: Invertible Feed-Forward Layers). In the setting of Proposition 3.21, if σ is also surjective then φ is a homeomorphism, and in particular it satisfies Assumption 3.2. Example 3.23 (Sub-Minimal Width via Generalized PRe LU Activation). Fix α,β (0, ), α = β. Then the activation function σ = βx Ix 0 +αx Ix<0 is monotone increasing and surjective. Example 3.24 (Re LU Feed-Forward Layers Cannot Achieve Sub-Minimal Width). Let φ be as in (12) and let σ(x) max{0,x}. By Theorem 3.4, F1Rn,φ is not dense in C(Rm,Rn). 3.4.1 Numerical Illustration The following numerical illustration will also make use of the following method for constructing feature maps satisfying Assumption 3.1. The method can be interpreted as passing the inputs through an arbitrary function and feeding them into the model s input via a skip connection. Proposition 3.25 (Skip Connections Using Arbitrary Continuous Functions are Good Feature Maps). Let d N+ and g C(Rm,Rd). Then φg(x) (x,g(x)) satisfies Assumption 3.1. Example 3.26 (Pre-Trained DNN with a Skip Connection are Good Feature Maps). Let g N N σ. Then φg(x) (x,g(x)) satisfies Assumption 3.1. To illustrate the effect of (in)correctly choosing the networks input and output layers we implement different DNNs whose initial or final layers are build using the above examples or intentionally fail Assumptions 3.1 or 3.2. Our implementations are on the California housing dataset [31], with the objective of predicting the median housing value given a set of economic and geo-spacial factors as described in [18]. The test-set consists of 30% percent of the total 20k training instances. The implemented networks are of the form ρ f φ, where f = W2 σ W1 is a shallow feed-forward network with Re LU activation and ρ,φ are built using the above examples. Our reference model (Vanilla) is simply the shallow feed-forward network f. For the first DNN, which we denote (Bad), ρ and φ are given by as in Example 3.24 and therefore violate Assumption 3.1. For the second DNN, denoted by (Good), ρ and φ are as in Example (3.23) and Assumptions 3.1 and 3.2 are met. The final DNN, denoted by (Rand), ρ is as in Example 3.23 and φ is as in Example 3.26 where the pre-trained network is generated randomly following in Corollary 3.20. Model Good Rand Bad Vanilla Good Rand Bad Vanilla MAE 0.318 0.320 0.876 0.320 0.252 0.296 0.887 0.284 MSE 0.247 0.259 1.355 0.257 0.174 0.234 1.409 0.209 MAPE 16.714 17.626 48.051 17.427 12.921 15.668 48.698 14.878 Table 1: Training and test predictive performance. As anticipated, Table 1 shows that selecting the networks initial and final layers according to our method either improves performance (Good) when all involved parameters are trainable or does not significantly affect it even if nearly every parameter is random (Rand). However, disregarding Assumptions 3.1 and 3.2 when adding additional deep layers dramatically degrades predictive performance, as is the case for (Bad). Table 1 shows that if a DNN s first and final layers are structured according to Theorem 3.10 then expressibility can be improved, even if these layers violate the minimum width bounds of [29, 45]. Python code for these implementations is available at [33]. 4 Conclusion Modifications to the input and output layers of any neural networks, using carefully chosen feature φ : X Rm and readout ρ : Rn Y maps, are common in practice. Theorems 3.3, 3.10, and Corollary 3.14 provided general conditions on these maps guaranteeing that the new, modified, architecture can approximate any function in the uniform convergence on compacts (or more generally the compact-open) topologies between their new input and output spaces. As a consequence of our main results, we showed that universal approximation implies universal classification once a component-wise logistic map is applied. This is a deterministic strengthening of the probabilistic results of [15]. We derived a method for constructing universal approximators between a wide class of manifolds. In particular, we extended the symmetric positive-definite matrixvalued regressor of [40] to a universal approximator and we showed that the hyperbolic feed-forward networks of [17] are universal approximators between hyperbolic spaces. Our main results also described how to structure the first and final layers of a DNN between Euclidean spaces, so as to preserve the approximation capabilities of the network s middle layers. In particular, we provided conditions on a network s activation function and connections so that these layers can be made narrower than the specifications of [29, 45] while maintaining the architecture s expressibility. Lastly, we showed that randomly generated DNNs are good feature maps with probability 1. Broader Impact A large portion of available data is non-Euclidean, either in the form of social network data to imaging data relevant in health applications of deep learning. The tools in this paper open up a generic means of translating the currently available deep learning technology to those milieus. The automation of tools in the medical sciences is important to helping reducing waiting times in hospitals and help make healthcare more accessible to all, so in that way, any automatizing of health science tools helps move society in that direction. Therefore, we hope that the methods presented in paper form a small step towards a greater positive advancement of the social and natural sciences. Acknowledgments and Disclosure of Funding The authors would like to thank Gaël Meigniez for his insightful surrounding the differential topology tools used in the proof of Theorem 3.3. The authors would also like to thank Josef Teichmann, Behnoosh Zamanlooy, and Philippe Casgrain for their feedback and suggestions. This research was funded by the ETH Zürich foundation. The authors are grateful for its financial support of the project. [1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008. ISBN 978-0-691-13298-3. doi: 10.1515/ 9781400830244. URL https://doi.org/10.1515/9781400830244. With a foreword by Paul Van Dooren. [2] C. D. Aliprantis and K. C. Border. Infinite dimensional analysis. Springer, Berlin, third edition, 2006. ISBN 978-3-540-32696-0; 3-540-32696-0. A hitchhiker s guide. [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214 223, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/arjovsky17a.html. [4] N. Ay, J. Jost, H. V. Lê, and L. Schwachhöfer. Information geometry, volume 64 of Series of Modern Surveys in Mathematics. Springer, Cham, 2017. ISBN 978-3-319-56477-7; 978-3-319-56478-4. doi: 10.1007/978-3-319-56478-4. URL https://doi.org/10.1007/ 978-3-319-56478-4. [5] M. Baes, C. Herrera, A. Neufeld, and P. Ruyssen. Low-rank plus sparse decomposition of covariance matrices using neural network parametrization. ar Xiv preprint ar Xiv:1908.00461, 2019. [6] S. Bonnabel and R. Sepulchre. Riemannian metric and geometric mean for positive semidefinite matrices of fixed rank. SIAM J. Matrix Anal. Appl., 31(3):1055 1070, 2009. ISSN 0895-4798. doi: 10.1137/080731347. URL https://doi.org/10.1137/080731347. [7] S. Bonnabel, A. Collard, and R. Sepulchre. Rank-preserving geometric means of positive semi-definite matrices. Linear Algebra Appl., 438(8):3202 3216, 2013. ISSN 0024-3795. doi: 10.1016/j.laa.2012.12.009. URL https://doi.org/10.1016/j.laa.2012.12.009. [8] M. Brown. Locally flat imbeddings of topological manifolds. Ann. of Math. (2), 75:331 341, 1962. ISSN 0003-486X. doi: 10.2307/1970177. URL https://doi.org/10.2307/1970177. [9] V. V. Buldygin and J. V. Kozaˇcenko. Sub-Gaussian random variables. Ukrain. Mat. Zh., 32(6): 723 730, 1980. ISSN 0041-6053. [10] J. Clark and D. Crisan. On a robust version of the integral representation formula of nonlinear filtering. Probability theory and related fields, 133(1):43 56, 2005. [11] R. Connelly. A new proof of Brown s collaring theorem. Proc. Amer. Math. Soc., 27:180 182, 1971. ISSN 0002-9939. doi: 10.2307/2037284. URL https://doi.org/10.2307/2037284. [12] J. B. Conway. A course in functional analysis, volume 96 of Graduate Texts in Mathematics. Springer-Verlag, New York, second edition, 1990. ISBN 0-387-97245-5. [13] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst., 2(4):303 314, 1989. [14] R. Engelking. General topology. 6:viii+529, 1989. Translated from the Polish by the author. [15] A. Faragó and G. Lugosi. Strong universal consistency of neural network classifiers. IEEE Transactions on Information Theory, 39(4):1146 1151, 1993. [16] P. T. Fletcher. Geodesic regression and the theory of least squares on Riemannian manifolds. Int. J. Comput. Vis., 105(2):171 185, 2013. [17] O. Ganea, G. Bécigneul, and T. Hofmann. Hyperbolic neural networks. In Advances in neural information processing systems, pages 5345 5355, 2018. [18] A. Geron. https://github.com/ageron/handson-ml/tree/master/datasets/housing. Accessed: 2020-05-15. [19] L. Gonon, L. Grigoryeva, and J. Ortega. Approximation bounds for random neural networks and reservoir systems. Ar Xiv, abs/2002.05933, 2020. [20] C. R. Guilbault and C. J. Tirel. On the dimension of Z -sets. Topology Appl., 160(13):1849 1852, 2013. ISSN 0166-8641. doi: 10.1016/j.topol.2013.07.017. URL https://doi.org/10. 1016/j.topol.2013.07.017. [21] V. Guillemin and A. Pollack. Differential topology. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1974. [22] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026 1034, 2015. [23] P. Henry-Labordere. Analysis, geometry, and modeling in finance: Advanced methods in option pricing. CRC Press, 2008. [24] M. W. Hirsch. Differential topology, volume 33 of Graduate Texts in Mathematics. Springer Verlag, New York, 1994. ISBN 0-387-90148-5. Corrected reprint of the 1976 original. [25] H. Hoffmann. On the continuity of the inverses of strictly monotonic functions. Irish Math. Soc. Bull., (75):45 57, 2015. [26] H. Hopf and W. Rinow. Ueber den Begriff der vollständigen differentialgeometrischen Fläche. Comment. Math. Helv., 3(1):209 225, 1931. ISSN 0010-2571. doi: 10.1007/BF01601813. URL https://doi.org/10.1007/BF01601813. [27] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4 (2):251 257, 1991. [28] J.-i. Itoh and M. Tanaka. The dimension of a cut locus on a smooth riemannian manifold. Tohoku Mathematical Journal, Second Series, 50(4):571 575, 1998. [29] J. Johnson. Deep, skinny neural networks are not universal approximators. 2018. [30] J. Jost. Riemannian geometry and geometric analysis. Universitext. Springer, Cham, seventh edition, 2017. ISBN 978-3-319-61859-3; 978-3-319-61860-9. doi: 10.1007/978-3-319-61860-9. URL https://doi.org/10.1007/978-3-319-61860-9. [31] Kaggle. California housing prices. https://www.kaggle.com/camnugent/ california-housing-prices. Accessed: 2020-05-15. [32] A. W. Knapp. Lie groups beyond an introduction, volume 140. Springer, 2002. [33] A. Kratsios. Neurips2020 non euclidean universal approximation example dnn layer comparisons. https://github.com/Anastasis Kratsios/Neur IPS2020_Non_Euclidean_ Universal_Approximation_Example_DNN_Layer_Comparisons. Accessed: 2020-10-19. [34] D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguná. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010. [35] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6): 861 867, 1993. [36] J. Lott. Some geometric calculations on Wasserstein space. Comm. Math. Phys., 277(2): 423 437, 2008. ISSN 0010-3616. doi: 10.1007/s00220-007-0367-3. URL https://doi.org/ 10.1007/s00220-007-0367-3. [37] C. Louart, Z. Liao, R. Couillet, et al. A random matrix approach to neural networks. The Annals of Applied Probability, 28(2):1190 1248, 2018. [38] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231 6239, 2017. [39] A. Mallasto and A. Feragen. Learning from uncertain curves: The 2-wasserstein metric for gaussian processes. In Advances in Neural Information Processing Systems, pages 5660 5670, 2017. [40] G. Meyer, S. Bonnabel, and R. Sepulchre. Regression on fixed-rank positive semidefinite matrices: a riemannian approach. Journal of Machine Learning Research, 12(Feb):593 625, 2011. [41] C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651 2667, 2006. [42] J. Munkres. Topology. Pearson Education, 2014. [43] T. Munzner. H3: Laying out large directed graphs in 3d hyperbolic space. In Proceedings of VIZ 97: Visualization Conference, Information Visualization Symposium and Parallel Rendering Symposium, pages 2 10. IEEE, 1997. [44] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6338 6347. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 7213-poincare-embeddings-for-learning-hierarchical-representations.pdf. [45] S. Park, C. Yun, J. Lee, and J. Shin. Minimum width for universal approximation. ar Xiv preprint ar Xiv:2006.08859, 2020. [46] X. Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vis., 25(1):127 154, 2006. [47] X. Pennec. Statistical computing on manifolds: from riemannian geometry to computational anatomy. In LIX Fall Colloquium on Emerging Trends in Visual Computing, pages 347 386. Springer, 2008. [48] X. Pennec, P. Fillard, and N. Ayache. A riemannian framework for tensor computing. International Journal of computer vision, 66(1):41 66, 2006. [49] W. Rudin. Real and complex analysis. Tata Mc Graw-Hill Education, 1987. [50] F. Sala, C. De Sa, A. Gu, and C. Re. Representation tradeoffs for hyperbolic embeddings. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4460 4469, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. [51] E. H. Spanier. Algebraic topology. Springer-Verlag, New York, 1995. ISBN 0-387-94426-5. Corrected reprint of the 1966 original. [52] T. Tao and V. Vu. Random matrices: the distribution of the smallest singular values. Geom. Funct. Anal., 20(1):260 297, 2010. ISSN 1016-443X. [53] P. Taylor. Foundations for computable topology. In Foundational theories of classical and constructive mathematics, pages 265 310. Springer, 2011. [54] H. Toru nczyk. A correction of two papers concerning Hilbert manifolds: Concerning locally homotopy negligible sets and characterization of ℓ2-manifolds. Fund. Math., 125(1):89 93, 1985. ISSN 0016-2736. doi: 10.4064/fm-125-1-89-93. URL https://doi.org/10.4064/ fm-125-1-89-93. [55] A. A. Ungar. Analytic hyperbolic geometry. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, 2005. ISBN 981-256-457-8. doi: 10.1142/9789812703279. URL https: //doi.org/10.1142/9789812703279. Mathematical foundations and applications. [56] D.-X. Zhou. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48(2):787 794, 2020.