# small_transformers_compute_universal_metric_embeddings__03f2ef23.pdf Journal of Machine Learning Research 24 (2023) 1-48 Submitted 10/22; Revised 3/23; Published 5/23 Small Transformers Compute Universal Metric Embeddings Anastasis Kratsios KRATSIOA@MCMASTER.CA Mc Master University Department of Mathematics 1280 Main Street West, Hamilton, Ontario, L8S 4K1, Canada Valentin Debarnot VALENTIN.DEBARNOT@UNIBAS.CH Universit at Basel Department of Computer Science Basel, 4051, Switzerland Ivan Dokmani c IVAN.DOKMANIC@UNIBAS.CH Universit at Basel Department of Computer Science Basel, 4051, Switzerland Editor: Sayan Mukherjee We study representations of data from an arbitrary metric space X in the space of univariate Gaussian mixtures equipped with a transport metric (Delon and Desolneux 2020). We prove embedding guarantees for feature maps implemented by small neural networks called probabilistic transformers. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about n log(n) and width about n2 can bi-H older embed any n-point dataset from X with low metric distortion, thus avoiding the curse of dimensionality. We further derive probabilistic bi-Lipschitz guarantees, which trade off the amount of distortion and the probability that a randomly chosen pair of points embeds with that distortion. If the geometry of X is sufficiently regular, we obtain stronger bi-Lipschitz guarantees for all points. As applications, we derive neural embedding guarantees for datasets from Riemannian manifolds, metric trees, and certain types of combinatorial graphs. When instead embedding into multivariate Gaussian mixtures, we show that probabilistic transformers compute bi-H older embeddings with arbitrarily small distortion. Our results show that any finite metric dataset, from vertices on a graph to functions a function space, can be faithfully represented in a single representation space, and that the representation can be implemented by a simple transformer architecture. Thus one may only need a modular set of machine learning tools compatible with this one representation space, many of which already exist, for downstream supervised and unsupervised learning from a great variety of data types. Keywords: Metric Embeddings, Geometric Deep Learning, Optimal Transport, Representation Learning, Transformers, Gaussian Mixtures. 1. Introduction In machine learning practice we face a daunting variety of data sources, but all of them come equipped with some domain-specific notion of (dis)similarity. The successes of modern machine learning models may be in part attributed to the fact that they explicitly or implicitly learn representations of data that are compatible with these domain-specific notions. A key task in a successful machine learning pipeline is thus to identify a representation space R and a feature map φ : X R which faithfully encodes X. The data space X and the dissimilarity d X often form a metric space (X, d X ); faithfully then means with minimal distortion to d X . Once data is represented in R, we can learn and predict from φ(X) using downstream machine learning tools for classification, regression models, clustering, dimension reduction models, etc. c 2023 Anastasis Kratsios, Valentin Debarnot, Ivan Dokmani c. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1246.html. KRATSIOS, DEBARNOT, DOKMANI C For a long time, the default choice for R has been a Euclidean space or a (reproducing kernel) Hilbert space. Yet there is now a body of work both in machine learning (Samal et al., 2018; Eidi and Jost, 2020; Muscoloni et al., 2017; Ganea et al., 2018; Giovanni et al., 2022) and in metric embedding theory (Bourgain, 1985; Matouˇsek, 1996; Naor et al., 2006) showing that Euclidean and Hilbert spaces are suboptimal for many practically-relevant notions of dissimilarity. This is because Euclidean spaces are too small and Hilbert spaces too flat to represent the great variety of geometries in modern datasets. An archetypal situation where low-dimensional non-Euclidean representations outperform high-dimensional Euclidean is when working with hierarchical data, organized on trees or graphs with power-law degree distributions (Ravasz and Barab asi, 2003), and equipped with the usual combinatorial (geodesic) metric. Any finite tree X can be faithfully embedded into the two-dimensional hyperbolic space (Sarkar, 2011) while embeddings in d-dimensional Euclidean space require much greater distortion (Gupta, 2000). Since hierarchical data is ubiquitous, this favorable property of hyperbolic spaces has inspired a booming development of hyperbolic machine learning models. Examples include clustering (Chami et al., 2020; Tabaghi and Dokmani c, 2021, 2020), PCA-type methods (Chami et al., 2021), classification (L opez and Strube, 2020), hyperbolic analogues of feedforward networks (Ganea et al., 2018; Shimizu et al., 2021), and several Python packages supporting this Riemannian geometry (Miolane et al., 2020; Kochurov et al., 2020). These advances contributed to state-of-the-art performance when learning from natural language (Zipf, 1949; Dhingra et al., 2018; Le et al., 2019), knowledge graphs (Kolyvakis et al., 2020), social networks (Krioukov et al., 2010; Muscoloni et al., 2017), directed graphs (Munzner, 1997), scenario generation for stochastic phenomena (Pflug and Pichler, 2015), and combinatorial trees (Nickel and Kiela, 2017). Hyperbolic representations perform well when X possesses a tree-like geometric prior, but many datasets lack such structure or exhibit a more complex one. This has motivated the search for representations of combinatorial graphs in Riemannian manifolds beyond constant-curvature space forms, including products of space forms (mixed-curvature spaces) (Tabaghi et al., 2021; Samal et al., 2018; Eidi and Jost, 2020) and representations that match discrete curvature in heterogeneous rotationally-symmetric manifolds (Giovanni et al., 2022). While these proposals show great promise in addressing the emergent shortcomings of hyperbolic spaces, each of them introduces additional parameters (such as the choice of space forms or their dimensions and curvatures) without obvious nominal values on real-world datasets, and each requires a different set of downstream tools. We are thus motivated to look for a universal representation space. Beyond the geometric considerations in the previous paragraphs, there may exist no machine learning models on X with favourable approximation, generalization, and optimization properties even for data from a nice X. And when good models do exist, we often work with datasets from some unknown low-dimensional subset of X and this latent structure is not leveraged by the off-the-shelf models. We look for a representation that allows us to relay X s structure to downstream tasks optimally Furthermore, we seek a representation agnostic to the details of X s geometry in the sense that it should work for any dataset equipped with a metric. The implications of such a universal representation space R for machine learning is that machine learning tools could be standardized to process inputs from R. Upon learning a faithful representation of any dataset X in R, these standardized tools could be used for downstream tasks like classification, regression, and dimensionality reduction. 1.1 Good Representation Spaces and Learnable Feature Maps, with Guarantees We consider the problem of representing a general metric space (X, d X ) in a representation space R, using a feature map φ : X R. We are interested in good representation spaces and computationally tractable feature maps. Let us take a moment to reflect on what a good R and φ should entail. Empirically successful non Euclidean embeddings (Lopez et al., 2021; Giovanni et al., 2022) and Bourgain s impossibility result for metric embeddings in Hilbert spaces (Bourgain, 1985) suggest that R must not be flat. On the other hand, we show in Section 4.3 that for any complete Riemannian manifold R (including all manifolds of positive curvature) and for any positive integer n, there exists an n-point metric space which cannot be embedded in SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS R with metric distortion smaller than O(log n). Finite-dimensional manifolds are thus too small to embed datasets from arbitrary metric spaces, so we require that R be infinite-dimensional. Given a finite dataset from X, its representation in R is often computed by solving an optimization problem. One example is Laplacian eigenmaps when X is a Riemannian manifold and d X the geodesic distance (Belkin and Niyogi, 2003, 2006). The feature map φ is thus defined implicitly via an optimization problem. A drawback of this strategy is that if a new sample is added to the dataset, computing its representation may require recomputing representations of all points. To avoid this, we are interested in efficient direct parameterizations of the feature map by deep neural networks with controlled complexity. In machine learning, this amortization of static optimization-based procedures is referred to as (out-of-sample) generalization (Bartlett et al., 2017; Agarwal and Niyogi, 2009; Bartlett and Long, 2021; Mei et al., 2022; Mei and Montanari, 2022). To make it possible, R should be a familiar representation space with well-understood geometry, finitely parameterized points, and admitting efficient numerical algorithms. Since only finite-dimensional representations can be practically implemented, we ask that the feature map φ : X R implemented by our deep learning model take values in a finite-dimensional subspace of R whenever X has a suitable geometric prior (which we make precise shortly). We informally refer to the dimension of the image of φ as the effective dimension of the representation φ of X. We show that the requirements for R are met by a certain space of probability measures equipped with an optimal transport metric, and that the corresponding feature map φ can be exactly implemented (in the sense of memorization capacity) by a deep neural network with controlled complexity. Concretely, we prove that a deep neural network with about n log(n) layers of width n2 can embed any dataset of n points in X with low metric distortion. Thus, unlike neural networks studied by most universal approximation theorems (Burger and Neubauer, 2001; Yarotsky, 2017; Kratsios and Papon, 2022; Acciaio et al., 2023; Kidger and Lyons, 2020; Puthawala et al., 2022; Shen et al., 2021), the PT does not face the curse of dimensionality. Its number of parameters is a (small-degree) polynomial in the embedded number of points which is independent of X s dimension. In contrast, the networks constructed in universal approximation theorems usually require a number of parameters exponential in X s dimension. Representation Learning Metric Embedding Theory Neural Representations + n-Point Guarantees Deep Learning Theory The particular n-point dataset (a subset Xn of X) encodes information about general points in X. Its points act as reference points or landmarks for the remainder of X. If X is a polytope (for example, [0, 1]d) then Xn could be the set of extremal points in X (for example, Xn = {0, 1}d). If X is a connected compact Riemannian manifold then Xn could be any maximal δ-dense subset; for a sufficiently small1 δ > 0 this set of n points encodes most of the metric information in (X, d X ). In machine learning, Xn is the dataset to be analyzed or the training set which is randomly drawn from X. In metric embedding theory suitable finite subsets play an analogous role to Schauder bases in the theory of Banach spaces (see (Naor, 2018)). This places our work at the junction between metric embedding theory and representation learning. The former aims at proving the existence of low-dimensional and low-distortion embeddings of finite metric spaces (e.g. Xn), while the latter seeks computationally tractable representations of non-linear data (e.g. X) which are performant in downstream learning tasks. Our theoretical guarantees concern n-point datasets from X because one cannot expect to embed general infinite metric spaces into any single reasonable 2 space. Finally, we complement the theory by stylized computer experiments which illustrate low metric distortion of our proposed representations (compared to Euclidean and hyperbolic) as well as out-of-sample generalization. 1. Let dg be the geodesic distance on X. The main result of (Katz and Katz, 2011) shows that if δ is less than 10 times the length of the smallest non-contractible loop in X (i.e. X s systole) then, map x 7 (dg(x, xl))n l=1 is a high-dimensional Euclidean bi-Lipschitz embedding of the Riemannian manifold X; where Xn := {xl}n l=1. A key point here is the trade-off between the embedding dimension n, which grows exponentially in δ 1 (via an elementary covering argument) and the Lipschitz constant of the map x 7 (dg(x, xl))n l=1 which is large whenever δ is small. 2. By reasonable we mean a separable complete geodesic metric space. KRATSIOS, DEBARNOT, DOKMANI C 1.2 The Representation Space Our search for a good representation space begins by examining the Wasserstein space, denoted by (P2(Rd), W2), (d 1), which is both infinite-dimensional and exhibits positive curvatures on all scales (Kloeckner and Bertrand, 2016, Section 4.1.1). Furthermore, even when d = 1, (P2(Rd), W2) contains an isometric copy of any finite-dimensional bounded Euclidean ball. Thus, it has plenty of room and a complex-enough geometry to make it a viable candidate for a universal representation space. A fortiori, this intuition has been confirmed by Andoni, Naor, and Neiman (Andoni et al., 2018) who show that (P2(Rd), W2) embeds any finite dataset with little or no distortion to X s metric geometry when d > 2, with d = 2 still an open problem in geometric analysis. A computational drawback of (P2(Rd), W2) is that its elements cannot be exactly implemented by a computer (capable of processing real values) but rather only approximated by an expressive class of probability measures such as Gaussian mixtures or empirical distributions (atomic probability measures) (Chevallier, 2018). This makes them ill-suited for sharp embedding guarantees since the optimal approximation rate using, for example, atomic measures with M atoms in P2(Rd) with respect to the Wasserstein distance is O(M 1/d) (Graf and Luschgy, 2000; Kloeckner, 2012; Liu and Pag es, 2020). Moreover, restricting the Wasserstein-2 distance to either the set of Gaussian mixtures or empirical (atomic) probability measures no longer yields a complete geodesic space, resulting in an ill-behaved geometry. We sidestep these issues by instead working on the smaller optimal transport-theoretic space (GM2(R), MW2), introduced by (Delon and Desolneux, 2020), whose elements are (finitely parameterized) univariate Gaussian mixtures and whose distance is a strengthening of the Wasserstein distance obtained by considering only those couplings which are themselves Gaussian mixtures. This space, metrized by restricting the optimal couplings, is part of a broader line of recent work which encodes additional structure into the transport distance by constraining the couplings; we highlight adapted optimal transport (Backhoff-Veraguas et al., 2020b; Acciaio et al., 2020; Backhoff et al., 2022), martingale optimal transport (Dolinsky and Soner, 2014; Beiglb ock et al., 2017b; Guo and Obł oj, 2019; Backhoff-Veraguas and Pammer, 2022b), and semi-martingale optimal transport (Liu and Neufeld, 2019). The marked advantage of (GM2(R), MW2) over (P2(Rd), W2) is that its elements can be exactly implemented (up to numerical precision) which eliminates the quantization error of O(M 1/d). Geometrically, (GM2(R), MW2) shares many of the appealing properties of the classical Wasserstein space (P2(Rd), W2) (e.g., it is a complete separable geodesic space); thus, there are no geometric drawbacks of working with (GM2(R), MW2) instead of with (P2(Rd), W2). As byproduct, since the distance MW2 is strictly stronger than the classical Wasserstein distance W2, our deep neural embedding results into (GM2(R), MW2) automatically imply embeddings in (P2(R), W2). To summarize, we move to the space of univariate Gaussian mixtures with transport distances of (Delon and Desolneux, 2020) for which there is a well-developed machine learning machinery (including non-linear dimension reduction (Bigot et al., 2017), clustering (Mi et al., 2018), and regression (Chen et al., 2021)) and which is neither too small nor too flat . Furthermore, as we show, the feature maps are computationally tractable and can be parameterized by deep neural networks. Finally, several maintained Python libraries streamline the implementation of deep learning algorithms in these spaces (Flamary et al., 2021; Delon and Desolneux, 2021). 1.3 Universal Feature Maps We implement the feature maps using an extension of the probabilistic transformer (PT) model (Kratsios et al., 2022; Kratsios, 2023). PT extends the classical transformer network used in natural language processing3 (Vaswani et al., 2017); in particular, the PT implements the classical transformer on average . It can be shown to universally approximate any continuous function while simultaneously exactly implementing arbitrary compact (possibly non-convex) constraints (Kratsios et al., 2022). We adapt the probabilistic transformer by composing it with a non-parameterized input layer which translates metric information into vectorial data by mapping a point in X to a vector of its distances to a set of 3. And now in many other places in deep learning. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Figure 1: Schematic view of a probabilistic transformer (gray) representing a metric space X (orange) while bi-Lipschitz embedding the distinguished points (landmarks, data) Xn (black). Depth & Width Number of Mixtures Figure 2: Embedding landscape of probabilistic transformers. Data from structured spaces such as trees or manifolds is easiest to embed in terms of feature map complexity. Data from unstructured spaces (general metric graphs, expanders) requires more complex (but still uncursed) neural feature maps. landmarks {xl}n l=1 X (cf. Figure 6a). The resulting process of representing a metric space X with the PT with embedding guarantees for any distinguished n-point dataset Xn X is illustrated in Figure 1. 1.4 Summary of Contributions Our results can be summarized with reference to the PT s embedding landscape in Figure 2. Our first two universal representation theorems relate to the first row, where no geometric prior is assumed on the metric space X. The second and third rows relate to the sharper embedding guarantees we obtain when X does have a regular geometry. In particular, this regularity allows us to explicitly bound the embedding s effective dimension which becomes independent of n, and depends only on X s latent geometry. We begin by showing that, for any finite dataset Xn in X and any geometric perturbation parameter α (1/2, 1) (H older coefficient), there is a PT which bi-α-H older embeds Xn into R with metric distortion equal to that of the quantitative version of Assouad s embedding theorem (Assouad, 1979) as derived by Naor and Neiman (Naor and Neiman, 2012). We note that the result of Naor and Neiman is an existence theorem, while we prove that the embedding can be implemented by a concrete deep neural network (the PT) with explicitly controlled width and depth. Of course, it is desirable to transition from α < 1 to a bi-Lipschitz embedding with α = 1. One reason is that Lipschitz feature maps are easiest to learn from (Kratsios and Papon, 2022). Our second main result shows that for any D > 1, still without geometric assumptions on X, there exists a PT which represents Xn in R and for which a pair of uniformly randomly sampled points are bi-Lipschitz embedded with metric distortion at most D with probability at least O(n 4e/(1+D)). Thus the PT exhibits a natural trade-off between embedding quality and satisfiability of the embedding. KRATSIOS, DEBARNOT, DOKMANI C It is natural to expect that if X has a regular geometric prior, it should be possible to obtain a deterministic small-distortion bi-Lipschitz guarantee for all points in Xn. This is indeed the case: we show that if X is a compact d-dimensional Riemannian manifold with controlled Ricci curvature, then there is a PT which bi-Lipschitz embeds Xn into R with effective dimension 3d + 6. A similar fact holds for combinatorial trees, with PT s metric distortion coinciding with that of Gupta s embeddings of trees in Euclidean space (Gupta, 2000). In these cases, we also obtain explicit depth, width, and effective dimension estimates. It is worth emphasizing that the depth and the width only depend on the number of points n for which we seek a memorization guarantee (the number of landmarks) and the effective dimension only depends on the regularity and geometry of X (cf. Figure 2). Lastly, we show that if one embeds into Gaussian mixtures on Euclidean space of dimension at least three, then there are probabilistic transformers which implement bi-H older embeddings of arbitrarily low distortion. The significance of these results for machine learning practice is that (GM2(R), MW2) indeed acts as a universal embedding space with the feature map implemented by a probabilistic transformer. This motivates the development of downstream tools to work with data points that are probability measures, and in particular Gaussian mixtures. In fact, some of those tools are already available4, e.g. (Bigot et al., 2017; Mi et al., 2018; Ye et al., 2017; Ho et al., 2017; Chen et al., 2021). Furthermore, deep neural networks implicitly or explicitly compute representations of data, but, to the best of our knowledge, there are no prior theoretic results for such deep neural representations of embedding maps which guarantee low metric distortion. We thus provide the first theoretical results on embeddings of datasets from general metric spaces using deep neural networks. Independently of how the embeddings are implemented (by probabilistic transformers or otherwise), these are also the first guarantees that the discrete geometries covered by our results can be embedded into the space of Gaussian mixtures on R with an optimal transport-type distance. Finally, we develop new proof techniques that opportunely combine metric embedding theory (Heinonen, 2003; Krauthgamer et al., 2005; Naor and Neiman, 2012; Ostrovskii, 2013; Eriksson-Bique, 2018; Andoni et al., 2018), memorization theory of deep feedforward networks (Sontag, 1997b; Park et al., 2020; Feldman and Zhang, 2020; Daniely, 2020; Vershynin, 2020; Bubeck et al., 2020; Vardi et al., 2022), and computational optimal transport (Delon and Desolneux, 2020). Finally, we note that our results immediately imply embeddings into the usual Wasserstein-2 space over R. 2. Geometric Background We briefly overview some of the relevant terminologies from metric embedding theory. For further details we recommend the book (Ostrovskii, 2013) or the lecture notes (Matouˇsek, 2013; Naor, 2015). A metric space is a pair (X, d X ) of a set X and a distance function d X : X X [0, ) which is symmetric in its arguments, is 0 if and only if x = x (thus can uniquely identify points), and satisfies the triangle inequality: for every three points x1, x2, x3 in X d X (x1, x3) d X (x1, x2) + d X (x2, x3). The prototypical metric space is the Euclidean space; i.e. Rn with Euclidean distance d2(x, x) def.= Pn i=1(xi xi)2. We now review three other examples which are central to our analysis. 2.1 Combinatorial Graphs A (combinatorial) graph is a pair G def.= (V, E) of a set of vertices (or vertices) V and a set of unordered pairs5 E V 2 , called edges, such that {u, v} E if there is an edge between u, v V . We will always assume G to be connected, which means that for any u, v V there is a sequence of edges {u, v1}, . . . , {v N, v} in E linking u to v. Every graph G = (V, E) induces a metric space XG def.= (V, d G) whose graph geodesic distance 4. All our results regarding embeddings into (GM2(R), MW2) yield embeddings into (GM2(R), W2) with the same distortion using the same probabilistic transformer network. Therefore, one can deploy any such available machine learning tool on (W2(R), W2). 5. For a set V , V 2 denotes the quotient of the Cartesian product V V under the equivalence relation defined on any (v1, v2), (v3, v4) V V by (v1, v2) (v3, v4) if and only if (v1, v2) = (v3, v4) or (v1, v2) = (v4, v3). SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS d G is defined for any u, v V by d G(v, u) def.= d G(u, v) def.= inf N : {v, v1}, . . . {v N 1, u} E . We only consider simple graphs, meaning that any two vertices can be connected by at most one edge. 2.2 Riemannian Manifolds with Lower-Bounded Ricci Curvature We overview the smooth geometric prior considered in this paper; namely, we review the notion of a Riemannian manifold. For a more detailed exposition on Riemannian geometry, we refer the reader to (Jost, 2017, Chapter 1) on the topic. A d-dimensional Riemannian manifold (M, g) is a pair of a d-dimensional smooth manifold M and smoothly varying family of positive-definite inner-products g def.= (gx)x M, with each gx mapping pairs of vectors in the tangent space Tx(M) above the point x M to R. Equivalently, by (Nash, 1954, Theorem 2) we may describe any d-dimensional Riemannian manifold extrinsically by viewing it as a Riemannian submanifold of the Euclidean space R2d+1. We say that (M, g) is complete if every two points in M can be reached by a (continuous) curve and if the geodesic distance, defined for x, x M by dg(x, x) def.= inf γ 0 γ (t), γ (t) γ(t)dt, (1) makes (M, dg) into a complete metric space; where the infimum in (1) is taken over all piece-wise continuously differentiable curves γ : [0, 1] R starting at x and ending at x; i.e. γ(0) = x and γ(1) = x. By the main result of (Hopf and Rinow, 1931), the completeness of (M, dg) is equivalent to the existence of unique smooth curves minimizing the objective function (1); these curves are called geodesics. In this paper, we focus on Riemannian manifolds satisfying a lower-bounded Ricci curvature condition, which we express using the curvature-dimension inequality of (Baudoin et al., 2014): for every infinitely differentiable map f : M R the following holds 1 2 f 2 g f, ( f) g 1 d f 2 g + r f g, (2) where is the Laplacian on (M, g) (which describes heat flow on (M, g)) and g def.= , g. To make matters concrete, one can show that a d-dimensional sphere of radius ρ > 0 satisfies (2) with r = ρ 2d(d 1) > 0, the hyperbolic plane satisfies (2) with r = 1 < 0, and the Euclidean space satisfies (2) with r = 0. Since dictates the heatflow on (M, g), then r in (2) can be interpreted as describing the rate at which heat dissipates on (M, g) with small r being fast (e.g. the hyperbolic plane) and large r being slow (e.g. the sphere). 2.3 Mixed-Gaussian Optimal Transport We rely on one more example of a non-Euclidean metric space, which is a variant of the space of probability measures introduced by (Waserstein, 1969), in the context of optimal transport theory. This is the optimal transport-theoretic space of Gaussian mixtures introduced in (Delon and Desolneux, 2020), which specializes in the classical optimal transport constructions to the class of Gaussian mixtures. This metric space of probability measures is part of a broader area of contemporary research encoding structure into transport plans by restricting the classes of admissible transport plans; see (Beiglb ock et al., 2017a; Bion-Nadal and Talay, 2019; Acciaio et al., 2020; Backhoff-Veraguas et al., 2020a; Backhoff-Veraguas and Pammer, 2022a)). In what follows, we use Pp(Rd) to denote the set of probability measures on the Euclidean space Rd admitting a pth-moment; i.e. P Pp(Rd) if EX P[ X p] < . We refer to the metric space W2(Rd) def.= (P2(Rd), W2) as the Wasserstein space, where W2 is defined for any two P and Q in P2(Rd) by W2 2(P, Q) def.= inf π Cpl(P,Q) E(X,Y ) π X Y 2 , (3) KRATSIOS, DEBARNOT, DOKMANI C where the set Cpl(P, Q) consists of all probability distributions on R2d with marginals P and Q. A key point, highlighting our interest in the case in embedding of metric spaces into univariate distributions, is that when d = 1 then W2(P, Q) = Z 1 QP(t) QQ(t) 2 dt, where QP and QQ are the respective quantile functions of P and of Q. Thus, for empirical distributions, unlike in the multivariate case (d > 1) where computing W2(P, Q) has super-cubic complexity (Orlin, 1988), in the univariate case (d = 1) W2 can be easily computed via a simple sorting procedure; i.e. with near linear complexity. In the case where P and Q are Gaussian mixtures (Delon and Desolneux, 2020, Section 4.1) propose a stronger distance function than the classical Wasserstein-2 distance W2(P, Q) on the set of subset of P2(R) consisting of univariate Gaussian mixtures; we denote this set by GM2(R). This distance is defined by restricting the couplings in optimization problem (3) to the smaller class couplings π which are themselves are Gaussian mixtures (on R2d). Denoted by MW2(P, Q), the mixed-Gaussian Wasserstein distance between two Gaussian mixtures P = PI i=1 w1,i N(µ1,i, Σ1,i) and Q = PJ j=1 w2,j N(µ2,j, Σ2,j) is equivalently expressed in the following computationally tractable formulation. Denote the Wasserstein-2 distance between N(µ1,i, Σ1,i) and N(µ2,j, Σ2,j) by ωij and let Ω= (ω2 ij)I,J i,j=1. Then MW2 2(P, Q) is given by MW2 2(P, Q) def.= min V [0, )I J tr(V Ω) subject to V 1J = w1 where 1L = (1, . . . , 1) denotes the L-dimensional vector of ones. Central to the tractability of MW2 2 is that, by (Cuesta and Matran, 1989), the Wasserstein-2 distance between two Gaussians N(µ1, Σ1) and N(µ2, Σ2) is given in closed form by W2 2 N(µ1, Σ1), N(µ2, Σ2) = µ1 µ2 2 + tr Σ1 + Σ2 2 qp A denotes the square-root of a symmetric positive-definite matrix A. There are several relationships between the two optimal transport distances. First, clearly by definition one has W2 MW2; moreover, the inequality is often strict. Conversely, it holds that MW2(P, Q) W2(P, Q) + i=1 w1,i tr (Σ1,i) 1/2 + J X j=1 wj,i tr (Σ2,j) 1/2 ; from which we see that MW2 equals to W2 for finitely supported measures. We use (GM(Rd), MW2) to denote the space of Gaussian mixtures with mixed-Gaussian Wasserstein metric. 2.4 Bi-H older Embeddings A map φ : X , R from a metric space (X, d) and to a (metric) representation space (R, d R) is called a bi-H older embedding if: there is an 0 < α 1, a scale s > 0, and a distortion D 1 such that for every x, x X sdα(x, x) d R(φ(x1), φ(x2)) s D dα(x, x). (4) The smallest value of D for which (4) holds quantifies the worst-case dilation between any two points induced by the feature map φ, upon re-scaling6 by s. As an example, if R = Rn, X Rm, and if φ is smooth then 6. If R is a normed space and φ is invariant under linear maps then we can always absorb s into the model. However, for general representation spaces resealing is not meaningful without the scale s. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS D is roughly maxx φ(x) ; however, for general metric spaces these quantities are meaningless. Thus, a large distortion D only globally distorts X s geometry by stretching all of it. As illustrated in Figure 3, a small value of α relatively distorts the space s geometry by placing more importance (illustrated by bright colours) on nearby pairs of points and less importance on distant pairs of points (relative to the color redred vertex in the center of Figure 3). This has the effect that large scales become more minor and small scales become larger when α 1, with this effect disappearing as α, approaches 0. Effectively α plays a comparable role to Figure 3: Effect of varying α on the geometry of the embedding metric space. a) α = 1 (Lipschitz): all distances are equally distorted. b) α 1 (fractional): small distances are prolonged, large distances are shortened. tuning the neighbourhood size in classical iterations of graph attention (Veliˇckovi c et al., 2018) or the role of neighborhood size when defining the mean average precision performance metric of (Cruceru et al., 2020; Giovanni et al., 2022). The parameter α (0, 1] can also be interpreted through its role in the loss function, which we use to train our model. If Xn is a set of points which we want to embed into MW(R) then in our numerical experiments, we will train a probabilistic transformer to learn a low distortion bi-H older embedding by numerically optimizing the following loss function which is a proxy for condition (4): X dα X (x, x) MW2(φ(x), φ( x)) . (5) Here, the minimization is taken over the set of a set of models. When α 1 then, (5) over emphasizes embedding nearby pairs of points over embedding distant pairs of points. In contrast, when α = 1, all pairs of points are given equal importance in the embedding. Thus, through the loss function (5), α implicitly plays the role of the neighborhood size defining the graph attention mechanism of (Veliˇckovi c et al., 2018) or the average (distance) distortion loss function (used in e.g. (Samal et al., 2018; Giovanni et al., 2022)). (Krauthgamer et al., 2005) introduce the notion of an aspect ratio of a measure space as the ratio of total mass over the minimum mass at any point. Similarly, we define the aspect ratio of a finite metric space (Xn, dn) as the ratio of the maximum distance between any two points therein (its diameter) over the minimum separation between any two distinct points. We define aspect(Xn, dn) def.= maxx, x Xn dn(x, x) minx, x Xn; x = x dn(x, x). We note that closely related concepts to the aspect ratio can be found in the (approximate) memorization results for deep feedforward networks (Park et al., 2021). Furthermore, requiring that the aspect ratio is positive and finite circumvents otherwise memorization by a feedforward neural network may be impossible (Sontag, 1997a). A related quantity is the diameter of the n-point metric space (Xn, dn), defined by diam(Xn, dn) def.= maxx, x Xn dn(x, x). Thus, the aspect ratio can be interpreted as (Xn, dn) s diameter re-scaled by the reciprocal smallest distance in Xn. KRATSIOS, DEBARNOT, DOKMANI C Figure 4: X s capacity is the number of small balls of radius r/5 that can fit in any ball of radius r. Lastly, we will make use of the notion of metric capacity cap(X, d X ) of a metric space (X, d X ). This dimensional metric invariant is defined similarly to box or Hausdorff dimension (in fractal geometry (Falconer, 1986)), is a formalization of the notion of dimension of a general metric space X. When X is the m-dimensional Euclidean space or when m-dimensional compact Riemannian manifold then, the logarithm of its capacity can be shown to be proportional to m. Illustrated in Figure 4, the capacity cap(X, d X ) of a metric space (X, d X ) is equal to sup d N+ : (xi)d i=0 X d+1, r > 0 s.t. d i=1 BX (xi, r/5) BX (x0, r) , where BX (x, r) def.= {u X : d X (u, z) < r} and denotes the union of disjoint sets. We refer the reader to (Heinonen, 2001, Chapter 10) and (Bru e et al., 2021, Proposition 1.7) for further details on metric capacity and to (Le Donne and Rajala, 2015) for an exposition of metric-theoretic notions of dimension. 3. The Probabilistic Transformer Model Since any good representation space should admit nice feature maps which can be approximated by standard neural networks. Our main result shows that this holds for MW2(R), with the neural network being the probabilistic transformer. The PT model, introduced below and illustrated in Figure 5, processes a dataset (X, dn) from a metric space (X, d X ) in three phases before producing a probabilistic output in GM2(Rd). Linear Linear Linear 𝕕 Linear Linear + abs Figure 5: The architecture of a probabilistic transformer. First, data in (Xn, dn) is encoded into an L-dimensional vector of distances to a set of landmarks x1, . . . , x L Xn. The encoding, illustrated by the Φ{xl}L l=1 layer in Figure 5, is reminiscent of the graph attention mechanism of (Veliˇckovi c et al., 2018). The un-normalized graph attention mechanism is defined as Φ{xl}L l=1 : Xn x 7 (dn(x, xl))L l=1 RL. (6) The landmarks act as an intrinsic reference frame, they identify points in Xn by their relative positions to {xl}L l=1. This can be seen as a non-linear counterpart to a truncated orthonormal basis of a Hilbert space. When all points in X are chosen as landmarks, this un-normalized graph attention mechanism is a finite-dimensional version of the embedding of (Xn, dn) into (Rn, ℓ ) (Fr echet, 1906). Remark 1 The points in x1, . . . , x L in the definition of Φ{xl}L l=1 are in principle trainable and could be discovered algorithmically instead of being specified by the user. Next, this vectorial data is processed by a set of independent, parallel Re LU-feedforward networks NN w : RL RK, and K Re LU-feedforward networks NN 1 : RL Rd Rd d, . . . , NN K : RL Rd Rd d. The role of the networks {NN k}K k=1 is to process inputs to a d-dimensional Gaussian measure-valued outputs SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Nd(µ, Σ) by outputting a mean vector µ and a square-root of a covariance matrix Σ Σ. The role of the network NN w is to generate input-dependent weights in the K-simplex {w [0, 1]K : PK k=1 wk = 1} used to mix the Gaussians into a Gaussian mixture. Remark 2 It is theoretically sufficient for the Re LU network to be independent. In practice, this can be relaxed by merging them into a single network as we do in Section 5. The predictions of all networks are then combined using the probabilistic attention mechanism (Kratsios et al., 2022) which generalizes the construction of (Bishop, 1994). The (Gaussian) probabilistic attention mechanism is defined for any K N+, any weight w RK, and any set of keys Y def.= ((µ1, Σ1), . . . , (µK, ΣK)) RK (d+d2) by P-Attn(w, Y ) def.= k=1 σK(w)k N(µk, Σ k Σk) P2(Rd); where we have identified Rd+d2 with Rd Rd d (where RN M is the set of N M matrices) and σK(u) def.= (euk/ PK i=1 eui)K k=1 is the softmax function. The probabilistic attention mechanism s place in our model is illustrated by the last layer in Figure 6a, which synthesizes the mixtures from the (µk)K k=1, (Σk)K k=1, and the (wk)K k=1 parameters produced by the feedforward networks. The parameters defining the Gaussian mixtures are implemented by deep feedforward networks with Re LU activation function Re LU(x) def.= max{0, x}. Let D N+. Fix a depth7J N+ and a multi-index [d] def.= (d0, . . . , d J) of integers with d0 = d and d J = D. A function NN : Rd RD is said to be a deep feedforward network if for every j = 1, . . . , J there are dj dj 1-dimensional matrices A(j) called weights and b(j) Rdj called biases, such that NN admits the iterative representation NN(x) def.=A(J)x(J) + b(J) x(j) def.= Re LU (A(j)x(j 1) + b(j)) for j = 1, . . . , J 1, x(0) def.= x, where denotes component-wise composition. We now formalize the probabilistic transformer in Figure 6a. Definition 3 (Probabilistic Transformer (PT)) Fix a metric space (X, d) and a D N+. A probabilistic transformer is a map T : X P2(RD) with representation T(x) = P-Attn NN w(u), (NN k(u))K k=1 , u def.= Φ{xl}L l=1(x), (8) where L N+, x1, . . . , x L X, and NN w : RL RK and NN 1 : RL 7 RD RD D, . . . , NN K : RL 7 RD RD D are Re LU feedforward networks. The depth of a probabilistic transformer T (with representation (8)) is defined as the maximum of the depth of each of the feedforward networks NN w, NN 1, . . . , NN K. Similarly, the width of a transformer T is defined as the maximum of the widths of each feedforward network NN w, NN 1, . . . , NN K. These definitions are natural since each of the networks NN w, NN 1, . . . , NN K defining T are independent, in the sense that their layers do not pass inputs to or from one another and their parameters have no explicit relations. The effective dimension of a probabilistic transformer T, is the maximum number of mixtures output by T for any input scaled by the number of parameters defining each Gaussian probability measure in effdim(T) := K (D + D2), where we use the notation of Definition 3. Thus, d T def.= K. We define the number of trainable parameters in T, denoted by par(T), as the aggregate number of trainable parameters amongst the networks 7. The depth is the total number of affine layers defining the network and not the number of hidden layers ; that is, the depth is one more than the number of Re LU activation functions applied component-wise in (7). KRATSIOS, DEBARNOT, DOKMANI C NN w, NN 1, . . . , NN K; i.e. par(T) def.= par(NN w) + PK k=1 par(NN k). As in (Bartlett et al., 2019), the number of trainable parameters encodes network topology by only counting the number of non-zero entries, thus only counting the number of weighted and biases between any two neurons with a direct connection. In the notation of (7), the number of trainable parameters of a neural network NN is defined as par(NN) def.= c0 0 + A(j) 0 + b(j) 0 , where 0 counts the number of non-zero entries in a matrix or vector of a given size. For fully connected feedforward networks of constant width par(NN) is roughly equal to the network s depth multiplied by its width squared; but in general par(NN) will be much smaller since our networks will tend to have a sparsely connected architecture analogously to (Suzuki, 2019) or in the convolutional neural network theory (Zhou, 2020). If one replaces 0 with the Fr obenius and Euclidean norms (for the weight and biases respectively) then, par(NN) becomes a path norm (Neyshabur et al., 2015; Zheng et al., 2019; E. and Wojtowytsch, 2022) typically used in neural network regularization. 4. Main Results Our quantitative results concerning the embedding capabilities of probabilistic transformers in mixed-Gaussian Wasserstein spaces fall into two classes. The first set of results concerns the possibility of embedding and the complexity of the model performing an embedding when no latent geometric structures are present in the dataset, and the latter class of results concerns the embedding of datasets with a latent geometry. Table 1 summarizes the rates in our main results. We emphasize that explicit constants are also available and we refer the reader interested in detailed constants, and precise O rates instead of O to Tables 3 and 4 in Appendix A. Note that we use the following asymptotic notation. Given g, f : N [0, ) will say that g is O(f) if g is O( f) where f = f logarithmic terms. Table 1: Complexity of the probabilistic transformers performing n-point embeddings. Latent Geometry Effective Dimension d T Depth e O( ) Width Ω( ) Result General Ω log(cap(Xn,dn)) α n 1 + log(n5/2 aspect(Xn, dn)) max{d T , n2} Theorem 4 General O (D 2) 2θD log2(n) nΘD 1 + θD + log(aspect(Xn, dn)) max{d T , n2θD} Theorem 5 General - Multivariate Mixtures O n6 (D 1) aspect(Xn, d)2 n2 + log n5/2 aspect(Xn, dn) O n5(n 1) (D 1) aspect(Xn, d)2 Theorem 7 Discrete: Trees M n 1 + log(n5/2 diam(Xn, dn)) max{M, n2} Proposition 11 Discrete: 2-Hop Graphs Ω log(1+ρ(AG)) α n 1 + log(n5/2 diam(Xn, dn)) max{d T , n2} Corollary 12 Manifold: Riemannian, Bounded Curvature Ω m1+α α(1 α)1+α n 1 + log(n5/2 aspect(Xn, dn)) max{d T , n2} Corollary 9 Manifold: Riemannian and Compact Ω(d) n 1 + log(n5/2 aspect(Xn, dn)) max{d, n2} Proposition 8 4.1 Embedding Guarantees for General Datasets Our first main result shows that any metric space X can be represented in (GM2(R), MW2) with a small PT guaranteeing that any fixed finite set of points can be embedded with little distortion, if we allow for minor perturbations to X s geometry. By a small PT , we mean that the representation implemented by the PT of Theorem 5 does not face the the curse of dimensionality. Theorem 4 (Deterministic Fractional Embeddings of Large Data by Small Transformers) There is an absolute constants C > 0 such that, for every metric space (X, d X ), every finite subset Xn X with n 2 points, and every geometric perturbation parameter 1 2 < α < 1, there exists a probabilistic transformer T for which: for every x, x X dα X (x, x) W2(T(x), T( x)) MW2(T(x), T( x)) CCXn dα X (x, x), SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS def.= 12 log(cap(Xn)) (1 α) 1+α . Furthermore, T s effective dimension, depth, and width are recorded in Table 1 (with explicit constant given in Tables 3 and 4). Independently of any machine learning model, results such as (Bartal et al., 2005) confirm there are n-point metric spaces Xn with the property that any finite subset Xn that bi-Lipschitz embeds into ℓ2 with distortion D cannot contain more than n1 ϑD points, for some ϑD > 0 depending only on D. Equivalently, if we uniformly and independently pick a pair of points in Xn at random then they belong to Xn with probability at-most 1 n2ϑD . This places a fundamental limitation on the number of pairs of points which can be bi-Lipschitz embedded with distortion D using any model. In particular, this type of limitation must translate to the PT model. In contrast, this is not the case, when working with almost bi-Lipschitz (α-H older with α 1) embeddings. Nevertheless, the next result provides a probabilistic guarantee that a PT embeds pairs of points in the set of landmarks Xn with a given distortion. As we may expect, the likelihood that any two given data points are embedded with a prescribed (maximal) distortion is proportional to the size of the distortion. In other words if we do not allow any perturbation to the landmark set s geometry then, with high probability, most pairs of points can be bi-Lipschitz embedded into (GM2(R), MW2) with large distortion and with small probability most pairs of points in the landmark set can be embedded with small distortion. Theorem 5 (PAC-Type Embeddings of Large Data by Small Transformers) Let (X, d X ) be a metric space, fix a finite subset Xn of X with n 2 points, and let P be the uniform probability measure on Xn. There is a 1 e2 < δn < 1 depending only on n such that for every δn < δ < 1, there exists a scale s > 0 and a probabilistic transformer T : (Xn, dn) , (GM2(R), MW2) satisfying P sd X (x, x) W2(T(x), T( x)) and MW2(T(x), T( x)) s D(D 1)d X (x, x) δ, Abbreviate λ def.= logn( δ). The distortion parameter D > 2 is given explicitly by Furthermore, T s effective dimension, depth, and width are recorded in Table 1 (explicit constants are in Tables 3 and 4). Remark 6 (The Minimum Valid Probability δn in Theorem 5) The quantity δn in Theorem 5 is defined by δn def.= max 1 e2 , δn where δn is the unique 1 e2 δ < 1 solving logn( δ) = 1 + logn( Even though the multivariate transport distances are more expensive to compute, it is natural to ask whether there are any advantages in representing metric spaces as mixtures of multivariate instead of univariate Gaussians. The answer is yes, as shown by the following deep neural analogue of the main finding of (Andoni et al., 2018). The result shows that if one considers multivariate Gaussian mixtures, there are probabilistic transformers that implement metric embeddings of arbitrarily low distortion on any finite subset of a given metric space; however, the effective dimension of these embeddings does become large. We note that it still depends only polynomially on the number of points being embedded. Theorem 7 (Distortionless Fractional Embeddings into Multivariate Gaussian Mixtures) Let (X, d) be a metric space and Xn X a finite subset with at least two points. For every distortion D > 1 there exists a probabilistic transformer T : X GM2(R3) for which there is a scale s > 0 such that for every x, x Xn s d1/2(x, x) MW2 T(x), T( x) D s d1/2(x, x). (9) Moreover, the complexity of T s given in Table 1. KRATSIOS, DEBARNOT, DOKMANI C Theorem 7 shows that one can achieve arbitrarily low distortion with embeddings into the space of multivariate Gaussian mixtures. In the more computationally efficient setting, where we consider embeddings into univariate Gaussian mixtures, Theorems 4 and 5 provide efficient embedding guarantees for arbitrary finite subsets of general metric spaces with small but non-vanishing distortion. It is natural to ask whether we can obtain stronger guarantees if Xn is a subset of a well-behaved X with a geometric prior . Indeed, we show below that in this case the guarantees in Theorem 1 can be improved to deterministic bi-Lipschitz guarantees. 4.2 Consequences: Improved Guarantees When (Xn, dn) has a Geometric Priors Theorems 5 and 4 can be straightforwardly applied to any finite metric space, in the sense that there is no (Xn, dn) quantities to plug-in , and this is true regardless of if dn encoded some additional latent geometric priors . Nevertheless, a fully explicit use of Theorem 4 requires an explicit estimate on (Xn, dn) s metric capacity. Conveniently, such estimates can be derived when (Xn, dn) has some latent geometry. We consider two different types of geometries from which the n-point set of landmarks is drawn. For instance, one may alternatively interpret Xn as a training dataset. The first is the smooth geometries arising from suitable Riemannian manifolds and the second is discrete geometries arising from well-behaved types of combinatorial graphs. 4.2.1 SMOOTH GEOMETRIC PRIORS In the case of a smooth geometric prior , meaning that points are drawn from a Riemannian manifold with suitable curvature, the required number of mixtures and the embedding distortion can be made independent of the number of datapoints n. A fortiori, the necessary number of mixtures is proportional to the latent Riemannian manifold s dimension. Proposition 8 (Representation of Riemannian Manifolds with bi-Lipschitz Guarantees) Fix d N+, let (M, g) be a d-dimensional Riemannian manifold with geodesic distance dg, and let K M be compact. For any finite subset Xn K with n 2 points, there is a constant CK > 0 depending only on K and a PT as in Theorem 4 satisfying: for every x, x X the following holds dg(x, x) W2(ϕ(x), ϕ( x)) MW2(ϕ(x), ϕ( x)) CKdg(x, x). Furthermore, T s effective dimension, depth, and width are recorded8 in Table 1. Proposition 8 provides bi-Lipschitz embedding guarantees for finite subsets of a general Riemannian manifold via a small PT. If the latent Riemannian manifold has bounded Ricci curvature, then we find that the constant CK can be made explicit and independent of any fixed compact subset of the Riemannian manifold. For this, we apply Theorem 4, and we turn to representations by our PT model with n-point (non-Lipschitz) bi-H older embedding guarantees. Corollary 9 (Representation of Riemannian Manifolds with Bounded Ricci Curvature) Fix d N+, let (M, g) be a complete m-dimensional Riemannian manifold lower-bounded Ricci curvature and with geodesic distance dg. There is an absolute constant C > 0 such that, for every finite subset Xn M with n 2 points, there is a PT as in Theorem 4 satisfying dα g (x, x) W2(T(x), T( x)) MW2(T(x), T( x)) C m 1 α 1+α dα g (x, x). Furthermore, T s effective dimension, depth, and width are recorded in Table 1 (with explicit constant given in Tables 3 and 4). 8. Explicit constant are recorded in Tables 3 and 4. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS As a concrete example of Corollary 9 we obtain the following embedding guarantee for spherical data. It is worth noting that the sphere is both a simple yet interesting example since even in a Euclidean representation space, the result of (Robinson, 2006a) implies that the sphere cannot be embedded into any Euclidean space with no distortion9. Example 1 (Embedding Spherical Data) Consider the m-dimensional sphere Sm def.= {x Rm+1 : x = 1} with Riemannian geometry inherited from its inclusion into the Euclidean space Rm+1. In this geometry, the distance between any two x1, x2 Sm is dg(x1, x2) = arccos(x 1 x2) and Sm has lower-bounded Ricci curvature (see (Jost, 2017, page 276)). Therefore, for every n-point subset Xn Sm (n N+) there exists a PT T with width, depth, and number of mixture components as in Corollary 9 satisfying: for every x, x Sm it holds that d3/4 g (x, x) MW2(T(x), T( x)) C 4m 1+3/4 d3/4 g (x, x), and C > 0 is independent of n and of m. 4.2.2 COMBINATORIAL PRIORS A particularly useful class of combinatorial graphs are trees; i.e. graphs in which no path originating at one vertex can ever come back to itself without passing the same edge at least twice. The simplicity of the combinatorial geometry of trees begs the question: can the transition between the deterministic embeddings of Theorem 4 (with geometric perturbation parameter α < 1) and the probabilistic embeddings of Theorem 5 (with geometric perturbation parameter α = 1) be avoided, and can the simple geometry of combinatorial tree allows us to obtain purely deterministic embedding results? The following result gives an affirmative answer. Furthermore, an embedding with O(n1/(M 1)) distortion is possible by using a probabilistic transformer with at most O(M) mixture components. Remark 10 (Simplified Result Formulation for Graphs) For simplicity, we consider the case where the graph in question is finite and where each point in X is a landmark. However, one can easily see how the statement extends to the general case considered in Theorem 4. Proposition 11 (Bi-Lipschitz Representations of Combinatorial Trees) Let G = (V, E) be an n-vertex tree and let (V, d G) be its associated metric space (as in Example 2.1). There is an absolute constant C > 0 such that, for any positive integer M where M, n 2, there exists a PT T with width and depth as in Theorem 4 and such that d G(x, x) W2(T(x), T( x)) MW2(T(x), T( x)) Cn1/(M 1)d G(x, x). Furthermore, T s effective dimension, depth, and width are recorded in Table 1 (with explicit constant given in Tables 3 and 4). Therefore, for combinatorial graphs with simple geometric priors, the deterministic to probabilistic transition between Theorem 4 and 5 can be avoided. Next, we consider where we wish to represent data with manifold priors using the probabilistic transformer architecture. Proposition 11 was derived using the same method of proof as for Theorem 4, but specialized to the geometry of combinatorial trees. Instead, the next result is a consequence of Theorem 4. Many well-studied graphs are such that each vertex is reachable by hopping over at most one other. We refer to these types of graphs as 2-hop graphs . Examples of 2-hop combinatorial graphs include, friendship graphs (Erd os et al., 1966), complete bipartite graphs as in Mantel s Theorem (Erd os, 1964/65), cocktail-party graphs (Jungerman and Ringel, 1978), wheel graphs (Buckley and Harary, 1988), and various others. To state our guarantees for 2-hop graphs we require some additional terminology. The adjacency matrix AG of a graph G = (V, E), whose vertices we enumerate by V = {vi}n i=1, is the n n-matrix with binary 9. We note this, to clarify the misconception that the embeddings of (Nash, 1954) are not in the metric sense (as considered in this article) but are in a different (Riemannian sense). KRATSIOS, DEBARNOT, DOKMANI C with (AG)i,j = 1 if and only if there is an edge between the vertices vi and vj. The spectral radius of AG, denoted by ρ(AG), is the largest absolute eigenvalue of AG. Corollary 12 (Representation for 2-Hop Combinatorial Graphs) Let G = (V, E) be an n-vertex 2-hop graph and let (V, d G) be its associated metric space (as in Example 2.1). Fix 1 2 < α < 1, and let T be the PT of Theorem 4. Each T(x) is comprised of no more than 12 Cα 1 log(1 + ρ(AG)) mixture components and satisfies: for each x, x Xn the following holds dα G(x, x) W2(T(x), T( x)) MW2(T(x), T( x)) C 12 log(1 + ρ(AG) dα G(x, x). Furthermore, T s effective dimension, depth, and width are recorded in Table 1 (with explicit constant given in Tables 3 and 4). The embedding guarantee in Corollary 12 can be further approximated using additional information about G s connectivity. Specifically, define G s maximum degree to be the maximum number of edges connected to any of its vertices, denoted by deg+(G), and its minimum degree to be the smallest number of edges connected to any of its vertices, denoted by deg (G). We use the following example to showcase the explicit constants and PT complexity estimates which are derived through our analysis (also recorded Appendix A). Example 2 (Embedding of 2-Hop Graphs in Terms of Connectivity Information) Consider the setting of Corollary 12, with α = 3 4, n 12, and let T be the probabilistic transformer described by that result. The upper-bound on the spectral radius ρ(AG) given in (Das and Kumar, 2004, Theorem 2.7) implies the following embedding guarantee d3/4 G (x, x) MW2(T(x), T( x)) CCG d3/4 G (x, x), where CG = 48 log 1 + q 2(#E)(n 1) deg+(G) + (deg+(G) 1) deg (G) 7/4 . Furthermore, the complexity of T can be estimated in terms of G s degree, its number of edges, and its diameter; namely, effdim(T) = O log 1 + q 2(#E)(n 1) deg+(G) + (deg+(G) 1) deg (G) depth(T) = O c + log n5/2 diam(Xn, dn) We conclude our theoretical analysis by illustrating the fundamental limits of what can be achieved when embedding arbitrary finite metric spaces. Our results highlight the limitations of our model when globally embedding an (Xn, dn) with no geometric prior into (GM2(R), MW2), as well as the mathematical limitations of any model embedding such a space into any smooth and finite-dimensional representation space. 4.3 Limitations to Global Embedding Capacity for Datasets with no Geometric Priors It is natural to ask if the transition between the probabilistic bi-Lipschitz embedding of Theorem 5 and the deterministic bi-H older embedding of Theorem 4 is just an artifact of our analysis. The answer is no. The next result confirms as the phenomenon is generic and persists for any regular finite-dimensional embedding. We begin by considering the case of complete Riemannian manifold R of bounded negative curvature for which the frequently used hyperbolic spaces in non-Euclidean representation learning are prototypical. Our next result combines the results of (Naor et al., 2006) and (Biggs and Hoare, 1983) to construct an explicit sequence of bipartite combinatorial graphs (namely the sextet graphs introduced by (Biggs and Hoare, 1983)) which cannot be embedded into R with non-diverging distortion. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Proposition 13 (Impossibility of Manifold Embedding of Fixed Finite Dimension) Let R be a complete Riemannian manifold with negative sectional curvature bounded in [ C, c] where 0 < c C. Then there exists a constant c R > 0 depending only on R such that, for every prime integer n 2 satisfying n mod 16 { 3, 5, 7} there is a combinatorial graph (X, d X ) with n-vertices and degree exactly 3 such that every bi-Lipschitz embedding f : X R must have distortion D bounded below by 4 3 log2(n) 2. (10) Remark 14 (Improved Lower Bound Via Expander Graphs) One can explicitly obtain larger lower bounds than in 10 by using the result of (Margulis, 1982, 1988) (both discovered independently and improving a result of (Erd os and Sachs, 1963)) stating that: for every δ 3 there is a diverging sequence {nk}k N and graphs {Gnk}n N each with nk-vertices such that each Gnk has girth at-least 3 4 logδ 1 nk . Using this sequence of expander graphs, instead of the sequence of Sextet graphs used in the proof of Proposition 13, implies that there is no bi-Lipschitz embedding of Gnk into R with distortion less that 4 3 logδ 1(n) + 2 . The advantage of the construction in Proposition 13 is a simple explicit class of graphs. We complement Proposition 13 by considering the case of compact Riemannian manifolds or Euclidean spaces. We again find that there is a sequence of pathological finite metric spaces which cannot be bi-Lipschitz embedded into such representation spaces with non-diverging distortion. Proposition 15 (Impossibility of Manifold Embedding of Fixed Finite Dimension) Let R be a Riemannian manifold which is bi-Lipschitz embedded in the Hilbert space ℓ2. There is a constant c R > 0 (depending only on R) such that for every positive integer n, there is an n-point metric space (X, d X ) for which every bi-Lipschitz embedded f : X R has distortion D satisfying D c R log(n) log log(n) . Example 3 (Compact Manifolds Satisfy Proposition 15) If R is a compact Riemannian manifold then Whitney s embedding theorem implies that there is a smooth diffeomorphism (onto its image) ϕ : R R2d. Therefore, Rademacher s theorem and the compactness of R implies that ϕ is a bi-Lipschitz embedding. Thus, R satisfies the conditions of Proposition 15 by the Whitney embedding theorem. The PAC-type bound of Theorem 5 can be reformulated as a function of the distortion D (instead of the probability δ). Doing so allows us to compare our bi-Lipschitz embedding results with the above two impossibility results. We now record this inverted formulation. Remark 16 (Inverted Form of Theorem 5 where Distortion is Given Instead of δ) The relationship between the distortion D and the probabilistic level 0 < δ < 1 in Theorem 5 can be inverted. In that case, our proof implies that for any prespecified distortion level D > 2 there is a PT T as in Theorem 5 satisfying: P sd X (x, x) W2(T(x), T( x)) and MW2(T(x), T( x)) s D(D 1)d X (x, x) n 2+2θD, where θD is the unique solution to 2 D = (1 θ)θ/(1 θ) over [1 2e/D, 1). Juxtaposing either of the impossibility results in Proposition 15 or Proposition 13 with Theorem 4 (as formulated in Remark 16), we find that, even if there are n-point metric spaces which simply cannot be well-represented in most Riemannian representation spaces, we can always identify a subset of any such KRATSIOS, DEBARNOT, DOKMANI C space which can be arbitrarily well-represented in MW2(R). Furthermore, the embeddings are explicitly implemented by PT models whose required number of parameters we now know. The distortion D > 2 controls the size of this subspace we are willing to accept. For instance when comparing against (13) we can select any distortion D 2, c R q 4 3 log2(n) 2 and always deduce the existence of an embedding on a subset of those sextet graphs (this is not necessarily the case for a complete Riemannian representation space with pinched negative curvature). 5. Experiments In this section we complement the theoretical embedding guarantees from Section 4 by preliminary computer experiments on synthetic data. We show that the proposed feature maps can indeed be trained in a standard deep learning framework, that the theoretical advantages of PT mixture-Wasserstein embeddings over Euclidean and hyperbolic carry over to practice, and that the PT-based feature maps generalize beyond Xn. We compare our proposed representation maps to learned representation maps in Euclidean and hyperbolic space when representing trees, random graphs and graphs sampled from Riemannian manifolds.10,11 Our first experiment compares the capacity of deep neural embedding in the space (GM2(R), WG2(R)) to embed combinatorial trees against the state of the art; namely embeddings into the hyperbolic plane. Our second experiment examines the efficiency with which PTs utilize dimension. We compare how well the PT embeds points on a simple high-dimensional compact Riemannian manifold, namely the sphere, to how well a standard feed-forward network can embed those same points into a Euclidean space of equal dimension. The probabilistic transformer we implement is illustrated in Figure 6a. The weights of the network are computed following (Delon and Desolneux, 2020), by minimizing MW2 2(Tθ(x), Tθ(y)) d2α X (x, y) 2 . (Loss) In practice we set α = 1 as it does not have a strong influence on empirical performance. We conjecture that this is due to the influence of sampling and optimization which result in an error floor . Numerical parameters All networks are trained by the Adam optimizer in pytorch, with weight decay parameter 10 6, initial learning rate 10 4 and final learning rate 10 6. 5.1 Hyperbolic vs. Gaussian Mixtures Geometry: Embedding Trees Hyperbolic embeddings are the state-of-the-art when embedding metric trees. Indeed, results such as (Sarkar, 2011) show that two dimensions is enough to embed any metric tree with low distortion whereas finite trees can only be embedded into d-dimensional Euclidean space with low distortion if d is large (Gupta, 2000). We compare the ability of (GM2(R), MW2) s geometry to accommodate deep neural embeddings of metric trees against that of the hyperbolic plane. We implement the embedding maps using (universal (Kratsios and Papon, 2022; Acciaio et al., 2023)) overparamterized neural network models, in an effort to minimize the chance that differences in expressivity obscure the geometric effects. As we show below, the additional flexibility of (GM2(R), MW2) over the hyperbolic plane is manifested even when embedding regular binary trees. EMBEDDING INTO THE HYPERBOLIC PLANE We first consider the hyperbolic plane (Ganea et al., 2018; Cruceru et al., 2021) which theoretically embeds trees with arbitrarily small distortion (Sarkar, 2011, Theorem 6). In the next section we look at a higher-dimensional 10. More precisely, on neighborhood graphs constructed over points sampled from manifolds. 11. The Python codes used to produce the results of this section are available at https://github.com/swing-research/ Universal-Embeddings. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS hyperbolic space of the same dimension as the number of the degrees of freedom of our mixture-Wasserstein embedding. The hyperbolic space has several isometric (up to a scalar) representations. In this section, we first rely on its canonical (in the sense of (Dowty, 2018)) information geometric representation12. The set of non-degenerate Gaussian probability measures on R is defined by def.= N(µ, σ) P1(R) : N(µ, σ) dx exp (x µ)2 , µ R, σ (0, ) . This set can be made into a Riemannian manifold with the Riemannian metric defined by the Fisher information matrix (Amari, 2016). As shown in (Costa et al., 2015, Equation (7)), the geodesic distance (called the Fisher Rao distance) between two Gaussian probability measures N(µ1, σ1), N(µ2, σ2) G+ 1 in this Riemannian geometry is d F (N(µ1, σ1), N(µ2, σ2)) = 2, σ1) ( µ2 2, σ2) + ( µ1 2, σ1) ( µ2 2, σ1) ( µ2 2, σ2) ( µ1 2, σ1) ( µ2 Moreover, importantly for our comparison, d F is (up to a constant factor of 2) equal to the hyperbolic distance on H2 def.= R (0, ) which is precisely the parameter space of G+ 1 as well as the upper half-plane model of the hyperbolic space. There are two reasons to first look at the 2D hyperbolic embeddings. First, they are easy to visualize; second, it allows us to some extent to dissociate the effects of geometry and expressivity and ease of optimization of neural networks. We benchmark our representations against the metric transformer model of (Acciaio et al., 2023, Example 11) (which for clarity we call the G+ 1 -transformer) since that model is universal (Acciaio et al., 2023, Theorem 3.6). It is a variant of hyperbolic neural networks (Cruceru et al., 2021), which are known to be universal (Bilokopytov and Kratsios, 2020, Corollary 3.16). The weights of the metric transformer Tθ are obtained by minimizing LG+ 1 (θ) = X d2 F (Tθ(x), Tθ(y)) d2 X (x, y) 2 . (11) The main difference with our model is that the metric transformer makes convex combinations in the parameter space of Gaussian measures, not in the space of probability measures itself. EMBEDDING INTO THE d-DIMENSIONAL HYPERBOLIC SPACE Even though the hyperbolic plane is theoretically sufficient to embed trees with small distortion (Sarkar, 2011), experimental evidence indicates that neural networks representations into higher dimensional hyperbolic space perform better (Giovanni et al., 2022). We therefore also compare our representations with representations in d-dimensional hyperbolic space represented by the Poincar e upper-half space model, whose underlying set it Hd = {x Rd+1 such that xd+1 > 0}, and whose geodesic distance is given by d Hd(x, y) = arccosh 1 + x y 2 2 2xd+1yd+1 We adapt the output of the probabilistic transformer to return vectors in Hd, by swapping its last layer for a trainable linear readout map (see Figure 6c) and modifying the embedding objective to minimize the distortion with respect to the hyperbolic distance. The d-dimensional hyperbolic model is then trained using the following loss function: d2 Hd(Tθ(x), Tθ(y)) d2 X (x, y) 2 . (12) 12. Because it shares many similarities with our embedding space (GM2(R), MW2) KRATSIOS, DEBARNOT, DOKMANI C Linear Linear Linear 𝕕 Linear Linear + abs Linear + abs Figure 6: Two different networks implementing feature (embedding) maps. The first block (violet dashed line) is common to all networks; the output blocks change to accommodate different target spaces. a) Probabilistic transformer of (8). b) G+ 1 -transformer. c) Hd-transformer. EXPERIMENTAL RESULTS We consider a regular binary tree X = (V, E) (Figure 7a) of depth six with a total of |V | = 127 vertices. The distance d X is the shortest path distance. We partition the vertices V into training and testing sets, Vtrain Vtest = V , with |Vtrain| = 111 and |Vtest| = 16. The test vertices (colored white in Figure 7a) are used to evaluate the quality of out-of-sample representations (that is to say, the generalization) computed by the different representation maps. We designate L = 20 training vertices as landmarks (colored black in Figure 7a). The two probabilistic transformers take as input the distances between a vertex and the L landmarks. The neural networks contains more than 7, 000 parameters with only the last layer being adapted to produce an output in the desired space. We use K = 5 mixture components and d = 15 for the dimension of the hyperbolic space to ensure a fair comparison with the probabilistic transformer s effective dimension. GM2(R) H2 H15 Training points Mean error 0.03 0.15 0.04 Max. error 0.25 0.44 0.26 Testing points Mean error 0.02 0.19 0.05 Max. error 0.25 0.44 0.26 Table 2: Average and maximum relative embedding errors for binary tree representations. Table 2 shows that our neural representations in the space of Gaussian mixtures perform better than H2 even when embedding simple trees, and that they perform on par or slightly better than H15. The difference in performance between H2 and H15 epitomizes the difficulty in verifying theoretical facts with computer experiments: even though theoretically both spaces can achieve low distortion, it is easier to approximate a good feature map for H15. We report the average relative absolute difference between the pairwise distances in SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS the binary tree as compared to embeddings in the respective representation spaces. The training set evaluates the embedding quality of the given set of landmarks. The role of the test set is to check whether the learned representation still provides a good embedding using the set of landmarks in the binary graph13. We visually assess the quality of the computed representations in Figure 7: Figure 7b displays the probability density functions of the GM(R) representations and Figure 7c displays the information-theoretic hyperbolic representations using Gaussian measures. In both cases the networks learn representations that change continuously with the shortest path distance in X. 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 Figure 7: Two embeddings of colored vertices in the binary tree X shown in (a); white vertices are not seen during training; landmarks are colored black. We apply the transformation σ 7 log10(σ + 2.1) to aid visualization. (b) Mixture Gaussian Wasserstein embeddings represented by their density. (c) hyperbolic (univariate Gaussian) embeddings represented by their density. Figure 7 visually examines the stability of the embeddings produced by the PT against those learned by its hyperbolic counterpart. Since both output univariate probability distributions, a visual comparison can easily be drawn. When comparing Figure 7c to Figure 7b we see that the embedding of the path in Figure 7 is only a bit more contorted than the embedding produced by the PT s hyperbolic counterpart. This can explain why both embeddings generalize similarly well, but that the gap in embedding quality is explained by the richness of the geometry of (GM2(R), MW2) as compared to the rigidity of the hyperbolic plane s geometry. 13. Our theorems give memorization guarantees, or, phrased differently, approximation guarantees for embeddings of finite metric spaces. In computer experiments we thus deliberately explore the generalization aspect. The related theoretical questions remain open. KRATSIOS, DEBARNOT, DOKMANI C 5.2 PTs Leverage Dimension Efficiently: Manifold Embeddings We now compare the capacity of the Euclidean space with (GM2(R), MW2) to represent smooth geometries using few effective dimensions. We represent datasets from N-dimensional spheres equipped with the geodesic distance. We remark that, as shown in (Robinson, 2006b), there exist no bi-Lipschitz embeddings of the sphere into any Euclidean space with arbitrarily low distortion.14 In this case our findings corroborate known facts and also show that the PT leverages the rich geometry of (GM2(R), MW2) to produce better low-dimensional embeddings standard feedforward neural networks taking values in high-dimensional Euclidean spaces. EMBEDDING INTO EUCLIDEAN SPACE Despite negative theoretical results, Euclidean representations are attractive in practice thanks to the wealth of available machine learning and numerical tools. It is thus desirable to compare them with our proposed representations in the space of measures. To this end we modify the output of our probabilistic transformer to output vectors in Rd, by swapping its last layer for a trainable linear readout map (see Figure 8) and modifying the embedding objective to minimize the distortion with respect to the Euclidean distance ℓ2 d. Naively, one may expect a well-performing Euclidean representation for sufficiently high embedding dimension d. However, as shown below, even for very large d there is a hard limit to representation quality, reflecting the introductory theoretical notes. We let SN def.= {z RN+1 : z 1} denote the N-dimensional sphere and equip it with the geodesic distance d SN (x, x) = cos 1 x x , x, x SN (Dai and M uller, 2018; Fletcher et al., 2009). We randomly sample data points {xi}n i=1 from the uniform probability measure on SN. The resulting Euclidean model is a feedforward network with a suitable Euclidean loss function, LEuclid(θ) = X Tθ(x) Tθ(y) 2 ℓ2 d d2 X (x, y) 2 . (13) 5.2.1 2-DIMENSIONAL SPHERE S2 We begin by visualizing the GM2(R) embeddings of points on the 2-sphere to develop intuition. A good representation should achieve comparably low distortion across the entire manifold and vary continuously with respect to the geodesic distance on S2. The learned feature map should generalize well from a set of landmarks to arbitrary points on the sphere. We train a PT for 160 iterations with the Adam optimizer (Kingma and Ba, 2015). In each iteration, we use random batch of 32 points among the 10,000 fixed training points chosen from a uniform measure on the sphere. The 13 landmarks are not updated during training. The network contains about 200,000 parameters and returns a Gaussian mixture with K = 5 components. 14. We note that this is different from positive results on Riemannian embeddings of the sphere in Euclidean spaces as exemplified by Nash s theorems (Nash, 1954). These embeddings view the sphere as a submanifold of the Euclidean space but make no guarantees that the embedding preserves the geodesic distance. Figure 8: Replacing the PT s final layer by a linear readout layer yields our feedforward network benchmark. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS The results are shown in Figure 9. For each point xk, colors in Figures 9a and 9b represent the quantities d2 M(xk, xj) MW2 2( ˆTθ(xk), ˆTθ(xj)) and sup 1 i,j K d2 M(xi, xj) MW2 2( ˆTθ(xi), ˆTθ(xj)) . We can see from the figure that the distances in the representation space faithfully represent those on the sphere. This is further detailed in Figure 9c where color of a point encodes the embedding error between the point and the red cross. Notice that even if the maximum error between points may seem large (about 0.35), it is reached only for a very small number of point pairs; most pairs have a relative error bellow 10 2. Figure 9: Embedding quality of 2-dimensional sphere S2 in GM2(R). a) Average error of a point and all the other elements. b) Maximum error of a point and all the other elements. c) Error between the point marked by a red cross and all other points. We then assess the continuity of the embedding in Figure 10. We plot the probability density function of a sequence of points spiralling up from the bottom of the sphere to the top, which we colour-code as progressing from red to blue. As expected, the Gaussian mixtures representing points on the sphere progressively change. We also see that only a few significant mixture components are required to perform the embedding. To aid visualization, we normalize the support of the densities and transform the standard deviation of the Gaussian components using the mapping σ 7 log10(σ + 2.1). Figure 10: Probability density functions of the embeddings of points on S2 into GM2(R). KRATSIOS, DEBARNOT, DOKMANI C 5.2.2 EMBEDDING THE N-DIMENSIONAL SPHERE The motivation to use Gaussian mixtures instead of Euclidean embeddings is that they easily capture complex geometries; achieving comparable distortion would either require a very high-dimensional Euclidean embedding or is simply impossible due to curvature. To illustrate this, we set up an experiment on N-spheres where we vary N. We fix the number of mixture components at the output of the PT to K = 5. The Euclidean network outputs vectors in R15, thus having the same number of degrees of freedom. We also compare these two networks with embedding in the hyperbolic space H15. All networks contain about 200,000 parameters. We set the number of landmarks to L = N + 10 (note that we need at least N landmarks to uniquely localize a point on SN). We distribute the landmarks quasi-uniformly on the N-sphere adapting a procedure from (Debarnot and Weiss, 2022). 15 25 35 45 Manifold dimension Rel. abs. error Euclidean - Feedforward Hyperbolic - Feedforward Gaussian Mixture - PT Figure 11: Impact of dimension. Figure 11 shows the results. As the manifold dimension N increases, the quality of the Euclidean and Hyperbolic embeddings implemented by the deep feedforward networks rapidly deteriorate. In contrast, the distortion of the embedding implemented by the probabilistic transformer into the space (GM2(R), MW2) of (Delon and Desolneux, 2020) remains stable. 6. Outline of the Proofs of Main Results This section overviews the main steps toward proving Theorem 4. Similar techniques are used in deriving the probabilistic Theorem 5. The latter proof is more technical as it relies on the non-linear Dvoretzky theorem (Naor and Tao, 2012) which is a metric version of the classical result of Dvoretzky (Dvoretzky, 1961); we defer it to Appendix B. Our proofs use a novel two-step approach. In step one, we use a key lemma (Lemma 17 below) to show that certain probability-measure-valued functions can be implemented by metric transformers with few parameters. In step two, given any of the three aforementioned geometric priors , we demonstrate the existence of optimal embeddings of the required form. Each of the proofs then concludes by applying the key lemma to show that the embedding can be exactly implemented (memorized) by the probabilistic transformer. The proof is completed by remarking that the PT we have built is defined on the entire metric space X. We note that the d-dimensional Gaussian measure with mean µ Rd and d d covariance matrix Σ is be denoted by Nd(µ, Σ). Lemma 17 (Key Lemma: Probabilistic Transformers Memorize Empirical Markov Kernels) Let d, K be a positive integer, let 1 p < , #Xn = n > 2, and let σ def.= Re LU. Let (Xn, dn) be an n-point metric SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS space and for k = 1, . . . , K, let µk : Xn Rd. Define the map ϕ : Xn GM2(Rd) by ϕ(x) def.= 1 k=1 Nd(µk(x), 0). Then, there exists a probabilistic transformer with unnormalized graph attention T : Xn GM2(Rd), which exactly implements ϕ; i.e.: ϕ(x) = T(x) for every x Xn such that width(T) = max{n, K, d2, n(n 1) + max{d, 12}}, (14) depth(T) = O Cn + log n5/2 aspect(Xn, dn) effdim(T) K (d + d2) (16) 4 max{n, d} n2 1 ( d + max{d, 12}2p n log(n) (17) Cn + log n5/2 aspect(Xn, dn) Moreover, the dimensional constant Cn > 0 is defined by Cn def.= 2 log(5 2 log(n+1) 2 log(2) . Before stating the next lemma, let us recall the notion of a doubling metric space. Briefly, these are metric spaces for which a maximal ratio determines the number of open balls of finite radius required to cover an open ball of twice their radius. More precisely, a metric space (X, d) is said to be doubling if and only if, there is a constant Cd > 0 for every point x X and every radius r > 0 the open metric ball about x of radius r, i.e. the set B(x, r) def.= {u X : d(u, x) < r}, there are at most Cd points x1, . . . , xn X (i.e. n Cd) such that the collection of open metric balls {B(xi, r/2)}n i=1 cover B(x, r). The smallest constant Cd > 0 for which the above relation holds is called the doubling constant of (X, d). Examples of doubling metric spaces are any finite metric space, Carnot Groups such as the Heisenberg group (Pansu, 1989), and more generally, any metric space that can be bi-H older embedded into some Euclidean space (see (Heinonen, 2001, Theorem 12.2) and (Naor and Neiman, 2012)). The doubling constant Cd of a doubling metric space (X, d) plays the role of dimension; for example, for a compact n-dimensional Riemannian manifold log2(Cd) is of the order O(n) and the same is true of the n-dimensional Euclidean space. However, in general it can be difficult to compute the doubling constant of (X, d). Thus, we turn to the notion of metric capacity as defined in (Bru e et al., 2021). The reason is that, as shown in (Bru e et al., 2021, Proposition 1.7) the doubling constant and the metric capacity are proportional (we refer the reader to that paper or the proof of Lemma 21 for details). Our second main embedding lemma concerns the case where (Xn, dn) has its points drawn from a doubling metric space. Since the logarithm of the doubling constant of such spaces plays the role of topological dimension for manifolds (see (Heinonen, 2001, Chapter 12)), it is intuitive that the number of (parameterized) point masses required to embed (Xn, dn) into the Wasserstein space depends on the logarithm of the doubling constant. Lemma 18 (Datasets in Doubling Metric Spaces bi-H older Embed into (GM2(R), MW2)) Let (X, d) be a doubling metric space with doubling constant Cd > 0, Xn X, and dn def.= d Xn. There are constants c, C > 0, such that for any 1 2 < α < 1, there exists an injective map ϕ : (Xn, dn) , (GM2(R), MW2) of the form ϕ(x) = 1 cα 1 log(Cd) cα 1 log(Cd) X k=1 N1(φ(x)k, 0), KRATSIOS, DEBARNOT, DOKMANI C satisfying: for every x, x X the following holds dα X (x, x) W2(ϕ(x), ϕ( x)) MW2(ϕ(x), ϕ( x)) dα X (x, x), where φ : (Xn, dn) (R c log(Cd) , ℓ2) is a α-H older with α-H older constant C log(Cd)2 (1 α) 1+α . Since every finite metric space is a doubling space, then Lemma 18 applies to any finite subset Xn of any general metric space X. Consequentially, Theorem 4 is obtained by applying Lemma 17 to the embedding in Lemma 18. 7. Limitations and Possible Improvements As mentioned in the introduction, our results are the first theoretical guarantees for deep neural representations of finite datasets from arbitrary metric spaces. We prove several embedding guarantees which pleasingly reflect the impact of the ambient geometry of space on the complexity of the representations and the number of parameters of the feature map. These results can be seen as deep neural approximation theorems for embeddings of finite metric spaces which are usually studied in metric geometry non-constructively. To obtain these results we developed a new proof technique which combines elements of metric embedding theory and memorization theory for deep neural networks; this technique may by a tool of independent interest. At the same time, the preludial nature of our work makes it likely that a number of improvements are possible. One natural question is whether we can extend the n-point guarantees to the whole X when X has a regular geometry (for example, when it is a compact Riemannian manifold). While this may require rephrasing some of our questions and using different mathematical tools, it does seem plausible. Further, there are several aspects of the proofs, for embeddings into the space of Univariate Gaussian mixtures, which suggest improvements are possible: 1) a key step passes through a high-dimensional Euclidean space, but we know experimentally that Euclidean embeddings under-perform and theoretically (Naor, Andoni, Neiman (Andoni et al., 2018)) that this is not necessary; 2) the Gaussian mixtures are often degenerated into empirical measures with equal atomic weights, but numerical experiments suggest a clear advantage of using proper Gaussian mixtures. While at the moment we do not know how to leverage these observations, they remain a subject of ongoing work. This research was supported by the European Research Council (ERC) Starting Grant 852821-SWING. Anastasis was also supported by the NSERC Discovery grant (RGPIN-2023-04482) and by their Mc Master University Startup Grant. ACKNOWLEDGMENTS The authors would like to thank Giulia Livieri for her very helpful feedback. A. Explicit Network Complexities: Theorems and Examples The next table summarizes the precise complexities of the probabilistic transformer representing the metric spaces with generic n-point embedding guarantees, which are covered in any one of the theoretical results in this paper. This table is an explicit analogue of Table 1. Remark 19 (Approximate Complexity of T For Low-Distortion Embeddings) The proof of Theorem 5 and a remark on (Naor and Tao, 2012, page 492) shows if D 2 (but D > 2) then, the probabilistic SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Table 3: Width of the probabilistic transformers performing n-point embeddings, with explicit constants. Latent Geometry Effective Dimension d T Width Result General 2 12Cα 1 (log(cap(Xn, dn))) max{d T , n(n 1) + 12} Theorem 4 General O (D 2) 2θD log2(n) O max{d T , n2θD + 12} Theorem 5 General - Multivariate Mixtures 12 n(n 1)/2 (5 5n4 2(D 1) aspect(Xn, d)2 + 2) max n, 25n5(n 1) 4(D 1) (aspect(Xn, d)2 + 2), 9, n(n 1) + 12 Theorem 7 Discrete: Trees 2M max{d T , n(n 1) + 12} Proposition 11 Discrete: 2-Hop Graphs 2 12 Cα 1 log(1 + ρ(AG)) max{d T , n(n 1) + 12} Corollary 12 Manifold: Riemannian, Bounded Curvature 2 C m1+α α(1 α)1+α max d T , n(n 1) + 12 Corollary 9 Manifold: Riemannian & Compact 4d max{d T , n(n 1) + 12} Proposition 8 Table 4: Depth of the probabilistic transformers performing n-point embeddings, with explicit constants. Here c def.= 2 log(5 2 log(n+1) 2 log(2) . Latent Geometry Depth Result n log(n) 1 + log(2) c + log n5/2 aspect(Xn,dn) θD log(n) 1 + log(2) θD log(n) c + 3θD log(n)+log aspect(Xn,dn) General - Multivariate Mixtures O Cn + log n5/2 aspect(Xn,dn) Discrete: Trees O n log(n) 1 + log(2) c + log n5/2 diam(Xn,dn) Proposition 11 Discrete: 2-Hop Graphs O n log(n) 1 + log(2) c + log n5/2 diam(Xn,dn) Corollary 12 Manifold: Riemannian, Bounded Curvature O n log(n) 1 + log(2) c + log n5/2 aspect(Xn,dn) Corollary 9 Manifold: Riemannian & Compact O n log(n) 1 + log(2) c + log n5/2 aspect(Xn,dn) Proposition 8 transformer T in Theorem 5 has depth about the order15 O n 11(D 2) log(1/(D 2)4) +O 11(D 2) log log(1/(D 2)) 2 log(1/(D 2))2 . When n is also large then it has width of the order O n D 2 log(D 2) +O (2D 4) log log(1/(D 2)) log(1/(D 2))2 . B. Proof Details This section contains the paper s main proofs. We begin with the following memorization lemma, on which the proof of Lemma 17 is developed. This result extends the quantitative class memorization result of (Vardi et al., 2022) to a quantitative vector-valued interpolation result. The next Lemma can be contrasted with the VC-bounds of (Bartlett et al., 2019) which imply that a Re LU feedforward network of depth D memorizing how to match N inputs in Rn with C classes must have at-least Ω D ln(D) parameters. We recall that the following convenient notation is used in the next lemma. For each u R we denote [u]+ def.= max{u, 1} = Re LU(u 1). Lemma 20 (Memory Capacity of Deep Re LU Regressors) Let n, d, N N+, let f : Rn Rd be some function, and consider distinct x1, . . . , x N Rn. There exists a deep Re LU network NN : Rn Rd satisfying NN(xi) = f(xi), 15. We use the notation O to hide logarithmic factors. KRATSIOS, DEBARNOT, DOKMANI C for every i = 1, . . . , N. Furthermore, we have the following quantitative model complexity estimates: width(NN) = n(N 1) + max{d, 12}, depth(NN) = O Cn + log N 2 aspect(XN, 2) par(NN) = O N 11 4 max{n, d} N 2 1 d + p N log(N) 1 + log(2) Cn + log N 2 aspect(XN, 2) max{d, 12} h 1 + max{d, 12} i)! The dimensional constant is defined by Cn def.= 2 log(5 2 log(n+1) 2 log(2) > 0 . Proof [Proof of Lemma 20] Step 1: Centering Data Let XN def.= {xi}N i=1 and for any x Rn and any r > 0 denote B2( x, r) def.= u Rn : u x 2 r . Since XN is finite it is compact and therefore by Jung s Theorem (see (Jung, 1901)) there exists some x Rn such that XN def.= B2 x, r and r def.= diam(XN, 2) n1/2 (2(n + 1))1/2 , where diam(XN, 2) def.= max1 i,j N xi xj 2. Thus, the (affine) isometry W : Rn Rn sending any x Rn to x x, bijectively maps XN to the subset XN def.= {u Rn : ( xi XN) u = xi x} of B2(0, r). In particular, since W is an isometry then aspect(XN, 2) = aspect( XN, 2). Since W is an affine map then, the pre-composition of any deep feedforward network by W yields a deep feedforward network of the same depth and width. Step 2: Independently Memorizing Classes By (Vardi et al., 2022, Theorem 3.1), for every x XN there exists a feedforward network NN x : Rn Rd with Re LU activation function satisfying the class memorization property ( 1 : if u = x x 0 : if u = x x (18) for every u XN. Furthermore, NN x has width 12 and depth O N log(N) 1/2 + N log(N) 1/2 max log(R), log(C) , (19) where C = 2 and R = 10π1/2n1/2 N 2 rδ 1 and where δ def.= min1 i,j N; i =j (xi x) (xj x) 2 = min1 i,j N; i =j xi xj 2. We may therefore rewrite (19) as N log(N) 1 + log(2) Cn + log N 2 aspect(XN, 2) where the dimensional constant Cn > 0 is defined by Cn def.= 2 log(5 2 log(n+1) 2 log(2) and where, for any u R we define [u]+ def.= max{u, 1}. Step 3: Parallel Interpolating Re LU Network for each x in XN For every x XN, we view the vector f(x) Rd as a d 1 matrix in Rd 1. Now, since the composition SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS of affine maps is again affine and since the first and last layers of each NN x are affine maps then, for every x XN, the map NN x : Rn Rd defined by NN x(u) def.= f(x) [ NN x W 1(u)], (21) is a Re LU feedforward network satisfying 1. depth(NN x) = depth( NN x) where each depth( NN x) is as in (20), 2. width(NN x) = max{d, width( NN x)} = max{d, 12}, 3. par(NN x) d + par( NN x). Moreover, by construction, for each x XN it holds that ( f(x) : if u = x 0 : if u = x, (22) for every u XN. Furthermore, we can bound the number of parameters par(NN x) defining each of the Re LU networks NN x above by par(NN x) d + par( NN x) depth( NN x1)width( NN x1)(width(NN x1) + 1) + d N log(N) 1 + log(2) Cn + log N 2 aspect(XN, 2) max{d, 12}(max{d, 12} + 1) ! Step 4: Assembling the Parallel Interpolating Re LU Networks Next, we assemble each of the Re LU networks {NN x}x XN into a single Re LU network interpolating f on XN. Since Re LU(x) def.= max{0, x} has the 2-identity property (as defined in (Cheridito et al., 2021, Definition 4) and discussed on page 3 of that manuscript) then, (Cheridito et al., 2021, Proposition 5) implies that there exists a deep diagonalized parallelization ; that is, a map NN : Rn Rd satisfying the following for every u Rn NN x1(u) ... NN x N (u) Furthermore, (Cheridito et al., 2021, Proposition 5) shows that NN has width, depth, and number of non-zero parameters are estimated by 1. Width: NN(u) has width n(N 1) + max{d, 12} 2. Depth: NN(u) has depth N log(N) 1 + log(2) Cn + log N 2 aspect(XN, 2) 3. Number of non-zero parameters: The number of non-zero parameters determining NN is 11 4 max{n, d} N 2 1 PN i=1 par(NN x). When expanded, this can be rewritten as 4 max{n, d} N 2 1 Cn + log N 2 aspect(XN, 2) ! max{d, 12}(max{d, 12} + 1) !! KRATSIOS, DEBARNOT, DOKMANI C Consider the d (Nd) block-matrix A def.= (Id, . . . , Id). Multiplying the outputs of NN(u) on the left by the matrix A defines a map NN : Rn Rd; i.e. for every u Rn the map NN is defined by NN (u) def.= A NN(u). (26) Furthermore, since the post-composition of Re LU feedforward networks is again a Re LU feedforward network then NN is a Re LU feedforward network. Moreover, by construction, the width, depth, and number of parameters defining NN satisfy the same estimates as NN, respectively. Step 5: Verifying that NN Implements f on XN It remains only to verify that NN equals to f on XN. Upon combining the identities in (22), (26), and (24) we conclude that: for each x X the following holds i=1 NN xi = i=1 f(xi)Ixi=x = f(x). (27) Relabeling NN def.= NN yields the lemma s statement. We now move to the proof of the main memorization lemma, namely Lemma 17, which guarantees that probabilistic transformer networks can memorize conditionally Gaussian Markov kernels for any finite number of inputs. In the following, we make use of the finite-dimensional Fr echet embedding ι : (Xn, dn) x 7 (d(x, x)) x Xn. (28) Proof [Proof of Lemma 17] Set L def.= n (where L is as in (6)) and identify the space of d d-matrices with Fr obenius norm Rd d with Rd2 with Euclidean norm (NB, these are isometric metric spaces). Then, ι = Φ{xl}N l=1, where ι is the Fr echet isometric embedding defined in (28). For each k = 1, . . . , K define the maps µk def.= µk ι 1 : ι(Xn) Rd. By Lemma 20, for every k = 1, . . . , K there exists a Re LU network ˆµk : Rn Rd satisfying the following ˆµk(x) = µk(x) = µk ι 1(x), (29) for every x Xn. The following equivalence of the norms 2 n and the fact that ι is an isometric embedding of (Xn, dn) into (Rn, ) implies aspect(Xn, dn) def.= max1 i,j n dn(xi, xj) min1 k,s n; k =s dn(xk, xs) = max1 i,j n ι(xi) ι(xj) min1 k,s n; k =s ι(xk) ι(xs) n max1 i,j n ι(xi) ι(xj) 2 min1 k,s n; k =s ι(xk) ι(xs) 2 = n maxu, u ι(XN) u u 2 minv, v ι(XN); v = v v v 2 def.= n aspect(ι(XN), 2). Thus, Lemma 20 (i)-(iii) and (30) imply that for each k = 1, . . . , K the Re LU network ˆµk satisfies (i) Width: NN has width n(n 1) + max{d, 12}, (ii) Depth: NN has depth of the order of n log(n) 1 + log(2) Cn + log n5/2 aspect(Xn, dn) SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS (iii) Number of non-zero parameters: The number of non-zero parameters in NN is at most 4 max{n, d} n2 1 Cn + log n5/2 aspect(Xn,dn) ! max{d, 12}(max{d, 12} + 1) !! and where the dimensional constant Cd > 0 is defined by Cd def.= 2 log(5 2π)+3 log(d) log(d+1) Next, for every k = 1, . . . , K, define the Re LU network ˆΣk : Rn Rd2 by ˆΣk(x) def.= 0d2 Re LU 0d2 nx + 0d2 + 0d2, (31) where 0d2 n is the d2 n-matrix with all entries 0, 0d2 Rd2 is the vector with all entries 0. NB, by construction max k=1,...,K par(ˆΣk) = 0. (32) Similarly, define the Re LU network ˆw : Rn RK by ˆwk(x) def.= 0K Re LU 0K nx + 0K + 0K. (33) Combining (29), (31), and (33) with the fact that Φ{xl}N l=1 = ι = ι 1 ι(Xn) we have that: for every x Xn the following holds k=1 [σK ˆw Φ{xl}N l=1(x)]k Nd ˆµk Φ{xl}N l=1(x), ˆΣk Φ{xl}N l=1(x) k=1 σK(0)k Nd µk ι 1 ι(x), 0 e0 PK j=1 e0 Nd µk ι 1 ι(x), 0 1 K Nd µk ι 1 ι(x), 0 It remains only to compute the width, depth, and number of non-zero parameters determining T. Since T has width max width( ˆw), max k=1,...,K width(Σk), max k=1,...,K width(µk)} . Therefore, we deduce that width(T) = max{max{n, K}, max{n, d2}, n(n 1) + max{d, 12}} = max{n, K, d2, n(n 1) + max{d, 12}}, and T has depth max {depth( ˆw), maxk=1,...,K depth(Σk), maxk=1,...,K depth(µk)}. Since depth( ˆw) = maxk=1,...,K depth(Σk) = 1 then, T has depth of the following order n log(n) 1 + log(2) Cn + log n5/2 aspect(Xn, dn) KRATSIOS, DEBARNOT, DOKMANI C Furthermore, the number of parameters defining T equals par T def.=par ˆw + h par ˆµk + par ˆΣk i =0 + K h 0 + par ˆΣ1 i . Thus, the number of non-zero parameters defining T is 4 max{n, d} n2 1 Cn + log n5/2 aspect(Xn, dn) max{d, 12}2 !! We now turn our attention to the proof of Theorem 5. B.1 Proof of PAC-Type Representation Theorem for Unstructured Case The following argument uses the notation of an ultrametric space; briefly, an ultrametric space (Z, d Z) is a metric space which satisfies the strong triangle inequality: for all z1, z2, z3 Z one has d Z(z1, z2) max{d Z(z1, z2), d Z(z2, z3)}. Note that, max{d Z(z1, z2), d Z(z2, z3)} d Z(z1, z2) + d Z(z2, z3) so the strong triangle inequality implies the familiar triangle inequality. Proof [Proof of Theorem 5] Let n 2 and fix an n-point metric subset Xn of X. Denote (Xn, dn) def.= (Xn, d X ). Step 1 - Embedding of a Dvoretzky-Type Subspace into Low-Dimensional Euclidean Space Fix ϵ (0, ] and let (Xn, dn) be an n-point metric space. By (Naor and Tao, 2012, Theorem 1.2) there exists a θϵ (0, 1) (depending only on ϵ) and subset Xn Xn of cardinality # Xn nθϵ n1 2e with the following property: there exists a separable ultrametric space (Z, d Z) and a bi-Lipschitz map ϕ1 : ( Xn, dn) , (Z, d Z) such that for every x, x Xn it holds that dn(x, x) d Z(ϕ1(x), ϕ1( x)) (2 + ϵ) dn(x, x). (35) Since (Z, d Z) is a separable ultrametric space then (Vestfrid and Timan, 1979) (or (Shkarin, 2004)) implies that there exists an isometric embedding ϕ2 : (Z, d Z) , (ℓ 2 , ℓ2), where the norm ℓ2 is defined for any x RN by x 2 ℓ2 def.= P k=1 x2 k, ℓ 2 def.= {x RN : x ℓ2 < }, and thus ℓ 2 , ℓ2 the infinite dimensional separable Hilbert space. Define ϕ3 def.= ϕ2 ϕ1. It follows from (35) that: for all x, x Xn dn(x, x) ϕ3(x) ϕ3( x) ℓ2 (2 + ϵ)dn(x, x). (36) By the Johnson-Lindenstrauss Flattening Theorem (Matouˇsek, 2002, Theorem 15.2.1) there exist an s > 0 and a map ϕ4 : (ℓ 2 , ℓ2) (ℓnϵ 2 , ℓ2), where nϵ O(ϵ 2 log2(# Xn)) = O(ϵ 2θϵ log2(n))) (thus nϵ is at-most O(ϵ 2 log2(n) since θϵ < 1) satisfying s z z ℓ2 ϕ4(z) ϕ4( z) ℓ2 s(1 + ϵ) z z ℓ2 , (37) for all z, z ϕ3( Xn). Set ϕ5 def.= ϕ4 ϕ3 : ( Xn, dn) , (Rnϵ, ℓ2). Together (36) and (37) imply that: for every x, x Xn it holds that sdn(x, x) ϕ5(x) ϕ5( x) ℓ2 s(1 + ϵ)(2 + ϵ)dn(x, x). (38) Step 2 - Constructing The Embedding into (GM(R)2, MW2) SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Since Xn is finite, then ( Xn, dn) is compact. Since ϕ5 is a bi-Lipschitz map then it is continuous. Therefore, by (Munkres, 2000, Theorem 26.5) ϕ5( Xn) is a compact subset of (Rnϵ, ℓ2). By the Heine-Borel theorem (Munkres, 2000, Theorem 27.3), there exists an r > 0 such that ϕ5( Xn) Ball Rnϵ,ℓ2(0, r); where, Ball Rnϵ,ℓ2(0, r) def.= {u Rnϵ : x ℓ2 r}. Since ϕ5( Xn) Ball(Rnϵ,ℓ2)(0, r) then (Kloeckner, 2010, Proposition 7.4) applies. More specifically, as formulated in (Bertrand and Kloeckner, 2012, Example 5.5)) there exists some b Rnϵ (which can be computed explicitly using Algorithm 1) such that the map ι : (Ball(Rnϵ,ℓ2)(0, r), ℓ2) x 7 1 k=1 δ nϵ(x+b)k (P2(R), W2), is an isometry. Since the composition of isometries is itself an isometry then, the map ϕ def.= ι ϕ5 : ( Xn, dn) , (P2(R), W2) is an isometry and is of the required form. Lastly, since ϕ is has range in the set of degenerate Gaussian mixtures Z def.= {P GM2(R) : N N+ w N µ1, . . . , µN R s.t. P = PN n=1 wnδµn} then, we obtain the conclusion by applying the comparison result in (Delon and Desolneux, 2020, Proposition 6) which states that the identity map on (W2(R), W2) P 7 P (GM2(R), MW2) is an isometry on the collection of finitely supported probability measures (i.e. degenerate Gaussian mixtures ). Applying Lemma 17 to the map φ : (Xn, dn) (GM2(R), MW2), defined for each x Xn by φ(x) def.= ϕ [x 7 (d(x, x) x Xn)], yields the desired tranformer T satisfying sdn(x, x) MW2(T(x), T( x)) s(2 + ϵ)2dn(x, x), for all x, x Xn. Comment: From here, it now only remains to estimate the probability that a uniformly and independently chosen pair of random points in Xn lie in Xn. Step 3 - Estimating Probability of Independently and Uniformly Sampling two Points in the Dvoretzkytype Subspace Let P be the uniform probability distribution on Xn and let X, X P be independent Xn-valued random elements. By independence and the fact that X and X have the same law we can compute the following P X Xn and X Xn = P X Xn P X Xn = P X Xn 2 . (39) Together, the lower-bound in (34), the definition of P, and (39) imply that P X Xn and X Xn = P X Xn 2 n1 (2e/2+ϵ) = n 4e/(2+ϵ). Moreover, by (Naor and Tao, 2012, Theorem 1.2) the precise value of θϵ is given by the unique solution θ (0, 1) to 2 2+ϵ = (1 θ)θθ/(1 θ); nb, θϵ [1 2e/(2 + ϵ), 1). Labeling D def.= 2 + ϵ, θD def.= θϵ, observing that T is defined on all of X, and noting that by construction d T = k yields the theorem s statement and concludes our proof. KRATSIOS, DEBARNOT, DOKMANI C Step 4 - Inverting the relation between D and θD: Fix δ (0, 1) such that n 2+2θD = δ. Solving for θD yields θD = 1 + logn( Since θD must belong to (0, 1) then (41) only gives a valid θD if δ 1/e2, 1 ; thus, we now impose that constraint. Since θD solves 2 D = (1 θ) θθ/(1 θ) where θ [1 2e D , 1) then (41) implies that In order for D in (42) to be a valid distortion (compatible with our argument) we need that D > 2. Therefore, we require that δ > δn; where δn [ 1 e2 , 1) is the unique solution to δ) = 1 + logn( Set δn def.= max{ 1 e2 , δn}. Restricting δ (δn, 1) yields a δ for which θD and D both satisfy the required assumptions of our argument. This complete the proof. B.2 Proof of Distortionless Multivariate Embeddings Proof [Proof of Theorem 7] Fix a positive integer n and let Xn be a n-point subset of a metric space (X, d) which we view as an n-point metric space (Xn, d). Let ι : (Xn, d) (Rn, ) be the Fr echet embedding defined by ι(x) def.= (d(x, xi))n i=1 and denote Xn def.= ι(Xn). Clearly, (Xn, d) and ( Xn, ) are isometric with isometry given by ι. It is therefore sufficient to bi-H older embed (Xn, ) into (P2(R3), W2); we do this now. In the proof of (Andoni et al., 2018, Theorem 1) from (Andoni et al., 2018, Equation (12) to Equation(22)), the authors define a map φ : Xn (P2(R3), W2) with the property that: there is some positive integer N such that for every x Xn n=1 δun(x) (43) for some set of points {un(x) : n = 1, . . . , N, x Xn} in R3. Moreover, a careful reading of the construction show that the positive integer N is such bounded above by16 N n(n 1)/2 (5K + 2) (44) for some hyperparameter K N+ to be decided upon shortly. The quantitative version of the result in question, namely (Andoni et al., 2018, Lemma 15), shows that for every distortion D > 1, taking 5n4 2(D 1) aspect( Xn, d)2 K implies that: there is some scale s > 0 such that for every x, x Xn it holds that s x x 1/2 W2 φ(x), φ( x) D s x x 1/2 . (45) Since ι is an isometry and every pointmass on R3 is a Gaussian measure with mean at that point and zero variance then the map φ def.= φ ι 1 sends (Xn, d) to (GM2(R3), MW2) and it has representation 1 N N3(un(x), 0), 16. As a brief reference, we note that, in the notation of (Andoni et al., 2018), N #C, C i,j=1,...,n; i 0 such that for every x, x Xn it holds that s d(x, x)1/2 MW2 φ(x), φ( x) D s d(x, x)1/2. (46) That is, φ is a 1 2-bi-H older embedding of (Xn, d) into (GM2(R3), MW2). Applying the Key Lemma, Lemma 17, and to the map φ yields the conclusion; upon noting that the probabilistic transformer T is defined on all of X. B.3 Proof of Embedding Lemmata Proof [Proof of Lemma 18] Since (X, d) is a doubling metric space then, by (Naor and Neiman, 2012, Theorem 1.2) there exist constant c, C > 0 such that for any 0 < ϵ < 1 2 there exists an injective function φ : (X, d) , (RD, ℓ2) satisfying d1 ϵ X (x, x) φ(x) φ( x) ℓ2 C log(Cd)2 d1 ϵ X (x, x); (47) where D c log(Cd). Since Xn X is finite, then Xn is compact. By (47), φ is ( 1 2, 1) (1 ϵ)-H older continuous and therefore, by (Munkres, 2000, Theorem 26.5) φ(Xn) must be a compact subset of (RD, ℓ2). As before, by appealing to the Heine-Borel theorem ((Munkres, 2000, Theorem 27.3)) we conclude that there must exist some r > 0 such that φ(Xn) Ball RD,ℓ2(0, r); where Ball RD,ℓ2(0, r) def.= {u RD : x ℓ2 r}. Thus, (Kloeckner, 2010, Proposition 7.4) implies that there exists some b RD (which can be constructed using Algorithm 1) such that the map ι : (Ball(RD,ℓ2)(0, r), ℓ2) x 7 1 D(x+b)k (P2(R), W2), is an isometry. As in the proof of Lemma 24, since ι φ is has range in the set of degenerate Gaussian mixtures ; i.e. belonging to Z def.= {P GM2(Rd) : N N+ w N µ1, . . . , µN Rd s.t. P = PN n=1 wnδµn} then, we obtain the conclusion by applying the comparison result in (Delon and Desolneux, 2020, Proposition 6) which states that the identity map on (GM2(Rd), W2) P 7 P (GM2(Rd), MW2) is an isometry on the collection of degenerate mixtures of in Z. Setting φ def.= ι φ and α def.= 1 ϵ concludes the proof. B.4 Proof of Main Results The proof of Theorem 4 relies on the following lemma relating the rate at which the volume of a ball doubles as a function of its radius to Xn s metric capacity. To formulate the required lemma, we must first revisit the notion of a doubling measure (Heinonen, 2001, Page 3 and Chapter 13). Let us begin by recalling that, given any x Xn and any r > 0, the metric ball about x of radius r is defined by B(x, r) def.= {u Xn : d X (u, z) < r}. Briefly, we say that a (finite Borel) measure µ on (Xn, dn) is a doubling measure with doubling constant cµ if for every x Xn and every r > 0 it holds that µ(B(x, 2r)) cµµ(B(x, r)). (48) In (Vol berg and Konyagin, 1987) it was shown that the existence of a doubling measure is equivalent to dn being doubling as a metric space; this means that there is a (metric) doubling constant cdn > 0 such that: for each r > 0 every metric ball in Xn of radius 2r can be covered by at most cdn metric balls of radius r. In (Bru e et al., 2021, Proposition 1.7), it is shown that the existence of a bound on cap(Xn) is equivalent KRATSIOS, DEBARNOT, DOKMANI C to the existence and a bound for cdn. Thus, a bound on the constant of a doubling measure cµ should, in principle, imply a bound on the cap(Xn). The next explicitly spells out the relationship between these metric quantities . Lemma 21 (Bounds of Metric Capacity via (Metric and Measure) Doubling Constants) Let µ be a doubling measure on (Xn, dn) with doubling constant cµ, as in (48). Then, the metric capacity of (Xn, dn) is bounded above by log(cdn) log(cap(Xn)) 12 log(cµ). Proof [Proof of Lemma 21] Suppose that N is a positive integer for which there does not exist points x1, . . . , x N B(x, r) satisfying Ballg(x, r) i=1 Ballg(xi, r/2), then, N is an upper-bound for cdn. Moreover, there must exist N points x1, . . . , xn in B(x, r) for which 0 < min i,j=1,...,N; i =j min u Xn, dn(u,xi) 2 2r min u Xn, dn( u,xj) 2 2r dn(u, u). Pick N argmin i=1,...,N µ(B(xi, r)). Thus, we have that µ(B(x N , 4r)) i=1 µ(B(xi, 2 2r)) n min i=1,...,N µ(B(xi, r)) = nµ(B(x N , r)). Therefore, cd c4 µ. Now, by (Bru e et al., 2021, Proposition 1.7 (i)), we have that cap(Xn) C3 d. (49) Combining the estimate (49) with the estimate cd c4 µ yields the desired upper-bound on cap(Xn). The lower-bound on cap(Xn) is given in (Bru e et al., 2021, Proposition 1.7 (i)). Proof [Proof of Theorem 4] Fix an n 2-point metric subspace (Xn, dn) of (X, d X ). Restricting to (Xn, dn), the result for X = Xn follows from Lemma 17 applied to Lemma 18 and then using Lemma 21 to upper-bound log2(cdn) by 12 log2(cap(Xn, dn)). The general statement, follows by simply noting that T constructed this way is defined on all of X. Remark 22 The proof of Theorem 4 implies that the result holds with cap(Xn, dn) replaced by Xn s doubling constant. Nevertheless, one can still derive the result with C3 d in place of cap(Xn) as a consequence of the present formulation of Theorem 4 and of (Bru e et al., 2021, Proposition 1.7 (ii)). Thus, we only make the remark to state that the result can be sharpened if one prefers to consider doubling constants rather than metric capacity. B.5 Proofs of Secondary Results B.5.1 PROOFS OF DETERMINISTIC EMBEDDINGS Let us consider the case where (Xn, dn) is a tree. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Lemma 23 (Bi-Lipschitz Embedding of (Finite) Combinatorial Trees into (GM2(R), MW2)) Let G = (V, E) be an n-vertex tree and let (V, d G) be its associated metric space (as in Example 2.1). There is an absolute constant C > 0 such that, for every M N+ there exists a map of the form ϕ(x) def.= 1 k=1 Nd(φ(x)k, 0), such that the following holds: for every x, x Xn we have that d G(x, x) W2(ϕ(x), ϕ( x)) MW2(ϕ(x), ϕ( x)) Cn1/(d 1)d G(x, x), where φ : V RM and C > 0 is an absolute constant independent of n, d, and of (Xn, dn). Proof [Proof of Lemma 23] Fix ϵ > 0. Since dn is a tree metric on Xn then (Gupta, 2000, Theorem 1) implies there is an absolute constant C > 0 (independent of n and of (Xn, dn)) and bi-Lipschitz embedding φ : V Rd satisfying: for every x, x V dn(x, x) φ(x) φ( x) Cn1/(d 1)dn(x, x). (50) Since Xn is finite, then (Xn, dn) is compact. Since φ is a bi-Lipschitz map then, it is continuous and therefore by (Munkres, 2000, Theorem 26.5), φ(Xn) is a compact subset of (Rd, ℓ2). By the Heine Borel theorem ((Munkres, 2000, Theorem 27.3)), there exists an r > 0 such that φ(Xn) Ball Rd,ℓ2(0, r); where Ball Rd,ℓ2(0, r) def.= {u Rd : x ℓ2 r}. Since φ(Xn) Ball(Rd,ℓ2)(0, r) then (Kloeckner, 2010, Proposition 7.4) applies. More specifically, the proof of (Kloeckner, 2010, Proposition 7.4) (see the comment in (Bertrand and Kloeckner, 2012, Example 5.5)) guarantee that there exists some b Rd (which can be constructed using Algorithm 1 in Section C) such that the map ι : (Ball(R5,ℓ2)(0, r), ℓ2) x 7 1 d(x+b)k (P2(R), W2), is an isometry. Since the composition of isometries is itself an isometry, then the map ϕ def.= ι φ : (Xn, dn) , (P2(R), W2) is an isometry and is of the required form. Lastly, since ι φ is has range in the set of degenerate Gaussian mixtures Z def.= {P GM2(Rd) : N N+ w N µ1, . . . , µN Rd s.t. P = PN n=1 wnδµn} then, we obtain the conclusion by applying the comparison result in (Delon and Desolneux, 2020, Proposition 6) which states that the identity map on (W2(R), W2) P 7 P (GM2(R), MW2) is an isometry on the collection of finitely supported probability measures (i.e. degenerate Gaussian mixtures ). Proof [Proof of Proposition 11] Since (Xn, dn) is a combinatorial (metric) graph then aspect(Xn, dn) diam(Xn, dn). The result now follows directly from Lemma 17 applied to Lemma 23. Our next case concerns the structure of a low-distortion embedding of a graph whose points are drawn from a Riemannian manifold. In this case, we see that the number of pointmasses required scales quadratically in the dimension of the latent Riemannian manifold. Lemma 24 (Bi-Lipschitz Embeddings of Riemannian Datasets - For Compact Manifolds) Let n, d N+. Let (M, g) be a d-dimensional Riemannian manifold with geodesic distance dg and let K M be compact. There exists a constant CK (depending only on K) such that for any Xn M if dn def.= dg|Xn, then exists a map ϕ : (Xn, dn) (GM2(R), MW2) of the form k=1 N1(φ(x)k, 0); KRATSIOS, DEBARNOT, DOKMANI C satisfying: for every x, x X the following holds d X (x, x) W2(ϕ(x), ϕ( x)) MW2(ϕ(x), ϕ( x)) C CKd X (x, x); where φ : (Xn, dn) (R2d, ℓ2) is a class-C1-(Riemannian) isometric embedding. Proof [Proof of Lemma 24] Since (M, g) is a Riemannian manifold then, (Nash, 1954, Theorem 2) implies that there exists a class-C1 (Riemannian isometric embedding) φ : (M, dg) , (R2d, ℓ2). Since φ is a class C1 Riemannian isometric embedding, it is a diffeomorphism onto its image and therefore, φ has a class C1 inverse on φ(M) which we denote by φ 1. Since K is compact and φ is continuous then φ(K) is compact and therefore the following is finite CK def.= max{max x K φ(x) , max x φ(K) φ 1(x) }. Therefore, by the Rademacher-Stepanov Theorem (Federer, 1978, Theorem 3.1.6) we have that: for every x, x K dg(x, x) φ(x) φ( x) CKdg(x, x). (51) Since φ is class-C1, it is continuous and since Xn is a compact subset of M (since it is finite) then by (Munkres, 2000, Theorem 26.5) φ(Xn) is a non-empty compact subset of (R2d, ℓ2). By the Heine-Borel Theorem (Munkres, 2000, Theorem 27.3), there exists some r > 0 such that φ(Xn) Ball(R2d,ℓ2)(0, r) where Ball(R2d,ℓ2)(0, r) def.= {z R2d : z ℓ2 r}. Since φ(Xn) Ball(R2d,ℓ2)(0, r) then (Kloeckner, 2010, Proposition 7.4) applies. More specifically, the proof of (Kloeckner, 2010, Proposition 7.4) (see the comment in (Bertrand and Kloeckner, 2012, Example 5.5)) guarantee that there exists some b R2d (that can be built explicitly via Algorithm 1 in Section C) such that the map ι : (Ball(R2d,ℓ2)(0, r), ℓ2) x 7 1 2d(x+b)k (P2(R), W2), is an isometry. Combining the two maps, we find that the composite map ϕ : (Xn, dg) x 7 1 2d(φ(x)+b)k (P2(R), W2), is an isometry. As in the proof of Lemma 23, since ι φ is has range in the set of degenerate Gaussian mixtures Z def.= {P GM2(Rd) : N N+ w N µ1, . . . , µN Rd s.t. P = PN n=1 wnδµn} then, we obtain the conclusion by applying the comparison result in (Delon and Desolneux, 2020, Proposition 6) which states that the identity map on (W2(R), W2) P 7 P (GM2(R), MW2) is an isometry on the collection of finitely supported probability measures in Z. Relabeling φ as φ( )+b yields the conclusion. Proof [Proof of Proposition 8] The result is directly follows from Lemma 17 applied to the map ϕ of Lemma 24 and relabeling C CK as CK. B.5.2 PROOF OF COROLLARIES TO MAIN RESULTS Proof [Proof of Corollary 12] We only upper-bound cap(Xn) in order to apply Theorem 4. By Lemma 21 log(cap(Xn, dn)) 12 log(cµ) for any doubling measure µ on (Xn, dn). Since (Xn, dn) is the metric space associated to a (finite) combinatorial graph in which every two vertices are connected by at most 2 edges then, (Durand-Cartagena et al., 2021, Theorem 5, Proposition 18, and Proposition 19) we know that cµ = 1+ρ(AV ). SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Note also that aspect(Xn, dn) = diam(Xn, dn) since the shortest distance between any two points in a combinatorial (metric) graph is 1. Thus, log(cap(Xn, dn)) 12 log(1 + ρ(AV )) and the result now follows from Theorem 4. Proof [Proof of Corollary 9] As argued in (Baudoin and Garofalo, 2011, pages 1121-1122), the complete d-dimensional Riemannian manifold (M, g) satisfies the curvature-dimension inequality of (2) for some r R if and only if (M, g) s Ricci curvature tensor ((Jost, 2017, Definition 4.3.3)) is uniformly lower-bounded by r. As demonstrated in (Baudoin and Garofalo, 2011, Equation (1.1)) the lower Ricci-cuvature bound on (M, g) together with the Bishop-Gromov Volume Comparison Theorem (e.g. see (Chavel, 1993, Theorem 3.10)) implies that cdn d. Therefore, (Bru e et al., 2021, Proposition 1.7 (ii)) implies that log(cap(Xn, dn)) 3 log(cdn). The result therefore follows from Theorem 4. B.5.3 PROOF OF IMPOSSIBILITY RESULTS Proof [Proof of Proposition 13] Step 1 - A lower Bound when Embedding into Spaces of Markov type p 2 On (Naor et al., 2006, page 173) the authors discuss that the proof of a result in (Linial et al., 2002) implies the following for any metric space R of Markov type 2 (see (Naor et al., 2006, Definition 1.1)). If Γg,δ def.= (Xg,δ, dg,δ) is a graph (equipped with the usual graph geodesic distance) in which the shortest loop has length at least g > 0 (called Γg,δ s girth) and its average degree is δ > 2 then, any bi-Lipschitz embedding f : Γg,δ (R, d R) (here s = 1) has distortion Dist(f) at-least Dist(f) δ 2 2 Mp(R)g(p 1)/p, (52) where p is the Markov-type of (R, d R) as defined by in (Ball, 1992, Definition 1.1) and Mp(R) > 0 is a constant depending only on R s geometry. Step 2 - Markov Type of (M, g) Let (R, d R) def.= (M, dg) where (M, g) is a complete Riemannian manifold with sectional curvature pinched between [ C, c] for some 0 < c C < . Therefore, (Naor et al., 2006, Corollary 6.5) applies implying that (M, dg) has Markov type p = 2. Step 3 - Realizing the Diverging Lower-Bound via Sextet Graphs The main result of (Weiss, 1984) shows that, for every prime integer n > 2 satisfying n mod 16 { 3, 5, 7}, the sextet graph with n vertices is a cubic graph (i.e. each vertex has exactly 3 edges connected to it and therefore δ = 3 > 2) and its girth g is at-least 4 3 log2(n) 2. Therefore, setting Γg,δ to be the sextet graph with n-vertices we deduce from (52) and step 2 that any bi-Lipschitz embedding f : Γg,δ R satisfies Dist(f) 1 2 Mp(R) 4 3 log2(n) 2. Setting c R def.= 1 2 Mp(R) yields the conclusion. Proof [Proof of Proposition 15] We argue by contradiction. Assume that R is a Riemannian manifold which is bi-Lipschitz embedded in the Hilbert space ℓ2 with distortion D1; denote this bi-Lipschitz embedding by ψ : (R, g) , (R2d, ℓ2) and with the property that: for every n N+ there is no n-point metric space (Xn, dn) which can be bi-Lipschitz embedded in (R, dg) with distortion D2 log(n) log(log(n)) (here D2 is a constant depending only on R). Therefore, for every n-point metric space (Xn, dn) there exists a bi-Lipschitz embedding φ : (Xn, dn) , (R, dg) whose distortion D is less than D2 log(n) log(log(n)). Thus ψ φ : Xn ℓ2, is a bi-Lipschitz embedding with KRATSIOS, DEBARNOT, DOKMANI C at-least distortion D1D2. Thus, for n large enough there is an n-point metric space which bi-Lipschitz embeds into (ℓ2, ℓ2) with distortion less than O log(n) log(log(n)) . This is a contradiction of a main result of (Bourgain, 1985). Whence, (R, g) does not exist. C. Algorithmic Construction of the Bias Term in our Embeddings Proofs This short appendix contains an explicit algorithmic construction of the shift (or bias term ), denoted by b, used when embedding a bounded Euclidean ball into the Wasserstein space. The term b Rd maps the Euclidean ball into the space Rd < def.= {x Rd : x1 < < xd} used in (Bertrand and Kloeckner, 2012, Example 5.5) and initially proposed in (Kloeckner, 2010, Proposition 7.4) . Algorithm 1 Initialize Bias Require: Set of n-vectors X def.= {x(1), . . . , x(n)} in RK, b0 def.= 0 Initialize first shift for n = 1, . . . , N do xn 1 def.= x(n) 1 + b1 Dummy vectors end for for k = 1, . . . , K do Iteratively build bias components bk def.= maxn N Re LU(x(n) k 1 x(n) k 1) for n N do xn k def.= x(n) k + bk Dummy vectors end for end for return b def.= (b1, . . . , b K) Return Bias) B. Acciaio, J. Backhoff-Veraguas, and A. Zalashko. Causal optimal transport and its links to enlargement of filtrations and continuous-time stochastic optimization. Stochastic Process. Appl., 130(5):2918 2953, 2020. URL https://doi.org/10.1016/j.spa.2019.08.009. Beatrice Acciaio, Anastasis Kratsios, and Gudmund Pammer. Designing universal causal deep learning models: The geometric (hyper)transformer. Mathematical Finance (Forthcoming), page TBA, 2023. Special Issue: Machine Learning in Finance. Shivani Agarwal and Partha Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. J. Mach. Learn. Res., 10:441 474, 2009. ISSN 1532-4435. Shun-ichi Amari. Information geometry and its applications, volume 194 of Applied Mathematical Sciences. Springer, 2016. Alexandr Andoni, Assaf Naor, and Ofer Neiman. Snowflake universality of Wasserstein spaces. Ann. Sci. Ec. Norm. Sup er. (4), 51(3):657 700, 2018. Patrice Assouad. Etude d une dimension m etrique li ee a la possibilit e de plongements dans Rn. C. R. Acad. Sci. Paris S er. A-B, 288(15):A731 A734, 1979. Julio Backhoff, Daniel Bartl, Mathias Beiglb ock, and Johannes Wiesel. Estimating processes in adapted Wasserstein distance. Ann. Appl. Probab., 32(1):529 550, 2022. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS J. Backhoff-Veraguas and G. Pammer. Stability of martingale optimal transport and weak optimal transport. Ann. Appl. Probab., 32(1):721 752, 2022a. URL https://doi.org/10.1214/21-aap1694. J. Backhoff-Veraguas and G. Pammer. Stability of martingale optimal transport and weak optimal transport. Ann. Appl. Probab., 32(1):721 752, 2022b. Julio Backhoff-Veraguas, Daniel Bartl, Mathias Beiglb ock, and Manu Eder. All adapted topologies are equal. Probab. Theory Related Fields, 178(3-4):1125 1172, 2020a. ISSN 0178-8051. URL https: //doi.org/10.1007/s00440-020-00993-8. Julio Backhoff-Veraguas, Daniel Bartl, Mathias Beiglb ock, and Manu Eder. All adapted topologies are equal. Probab. Theory Related Fields, 178(3-4):1125 1172, 2020b. Keith Ball. Markov chains, riesz transforms and lipschitz maps. Geometric & Functional Analysis GAFA, 2 (2):137 172, 1992. Yair Bartal, Nathan Linial, Manor Mendel, and Assaf Naor. On metric Ramsey-type phenomena. Ann. of Math. (2), 162(2):643 709, 2005. URL https://doi.org/10.4007/annals.2005.162.643. Peter L. Bartlett and Philip M. Long. Failures of model-dependent generalization bounds for least-norm interpolation. J. Mach. Learn. Res., 22:Paper No. 204, 15, 2021. ISSN 1532-4435. Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017. Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20 (63):1 17, 2019. URL http://jmlr.org/papers/v20/17-612.html. Fabrice Baudoin and Nicola Garofalo. Perelman s entropy and doubling property on Riemannian manifolds. J. Geom. Anal., 21(4):1119 1131, 2011. URL https://doi.org/10.1007/s12220-010-9180-x. Fabrice Baudoin, Michel Bonnefont, and Nicola Garofalo. A sub-Riemannian curvature-dimension inequality, volume doubling property and the Poincar e inequality. Math. Ann., 358(3-4):833 860, 2014. URL https://doi.org/10.1007/s00208-013-0961-y. Mathias Beiglb ock, Pierre Henry-Labord ere, and Nizar Touzi. Monotone martingale transport plans and Skorokhod embedding. Stochastic Process. Appl., 127(9):3005 3013, 2017a. URL https://doi.org/ 10.1016/j.spa.2017.01.004. Mathias Beiglb ock, Marcel Nutz, and Nizar Touzi. Complete duality for martingale optimal transport on the line. Ann. Probab., 45(5):3038 3074, 2017b. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003. Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. Advances in neural information processing systems, 19, 2006. J erˆome Bertrand and Benoˆıt Kloeckner. A geometric study of Wasserstein spaces: Hadamard spaces. J. Topol. Anal., 4(4):515 542, 2012. URL https://doi.org/10.1142/S1793525312500227. Norman L Biggs and MJ Hoare. The sextet construction for cubic graphs. Combinatorica, 3(2):153 165, 1983. J er emie Bigot, Ra ul Gouet, Thierry Klein, and Alfredo L opez. Geodesic PCA in the Wasserstein space by convex PCA. Ann. Inst. Henri Poincar e Probab. Stat., 53(1):1 26, 2017. URL https://doi.org/10. 1214/15-AIHP706. KRATSIOS, DEBARNOT, DOKMANI C Eugene Bilokopytov and Anastasis Kratsios. Non-Euclidean Universal Approximation. Neur IPS, 33, 2020. Jocelyne Bion-Nadal and Denis Talay. On a Wasserstein-type distance between solutions to stochastic differential equations. Ann. Appl. Probab., 29(3):1609 1639, 2019. URL https://doi.org/10. 1214/18-AAP1423. Christopher M Bishop. Mixture density networks. Unpublished, 1994. J. Bourgain. On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. Math., 52(1-2):46 52, 1985. Elia Bru e, Simone Di Marino, and Federico Stra. Linear Lipschitz and C1 extension operators through random projection. J. Funct. Anal., 280(4):Paper No. 108868, 21, 2021. URL https://doi.org/10.1016/ j.jfa.2020.108868. Sebastien Bubeck, Ronen Eldan, Yin Tat Lee, and Dan Mikulincer. Network size and size of the weights in memorization with two-layers neural networks. Advances in Neural Information Processing Systems, 33: 4977 4986, 2020. Fred Buckley and Frank Harary. On the Euclidean dimension of a wheel. Graphs Combin., 4(1):23 30, 1988. URL https://doi.org/10.1007/BF01864150. Martin Burger and Andreas Neubauer. Error bounds for approximation with neural networks. J. Approx. Theory, 112(2):235 250, 2001. Ines Chami, Albert Gu, Vaggos Chatziafratis, and Christopher R e. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering. Advances in Neural Information Processing Systems, 33: 15065 15076, 2020. URL https://dl.acm.org/doi/abs/10.5555/3495724.3496987. Ines Chami, Albert Gu, Dat P Nguyen, and Christopher R e. Horopca: Hyperbolic dimensionality reduction via horospherical projections. In International Conference on Machine Learning, pages 1419 1429. PMLR, 2021. Isaac Chavel. Riemannian geometry-a modern introduction, volume 108 of Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, 1993. Yaqing Chen, Zhenhua Lin, and Hans-Georg M uller. Wasserstein regression. Journal of the American Statistical Association, pages 1 14, 2021. Patrick Cheridito, Arnulf Jentzen, and Florian Rossmannek. Efficient approximation of high-dimensional functions with neural networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1 15, 2021. Julien Chevallier. Uniform decomposition of probability measures: quantization, clustering and rate of convergence. Journal of Applied Probability, 55(4):1037 1045, 2018. Sueli I.R. Costa, Sandra A. Santos, and Jo ao E. Strapasson. Fisher information distance: A geometrical reading. Discrete Applied Mathematics, 197:59 69, 2015. Distance Geometry and Applications. Calin Cruceru, Gary Becigneul, and Octavian-Eugen Ganea. Computationally tractable riemannian manifolds for graph embeddings. ar Xiv preprint ar Xiv:2002.08665, 2020. Calin Cruceru, Gary Becigneul, and Octavian-Eugen Ganea. Computationally tractable riemannian manifolds for graph embeddings. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):7133 7141, May 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/16877. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Juan Antonio Cuesta and Carlos Matran. Notes on the wasserstein metric in hilbert spaces. The Annals of Probability, 17(3):1264 1276, 1989. ISSN 00911798. URL http://www.jstor.org/stable/ 2244406. Xiongtao Dai and Hans-Georg M uller. Principal component analysis for functional data on Riemannian manifolds and spheres. Ann. Statist., 46(6B):3334 3361, 2018. Amit Daniely. Neural networks learning and memorization with (almost) no over-parameterization. Advances in Neural Information Processing Systems, 33:9007 9016, 2020. Kinkar Ch. Das and Pawan Kumar. Some new bounds on the spectral radius of graphs. Discrete Math., 281 (1-3):149 161, 2004. ISSN 0012-365X. Valentin Debarnot and Pierre Weiss. Deep-blur : Blind identification and deblurring with convolutional neural networks. June 2022. Julie Delon and Agn es Desolneux. A Wasserstein-type distance in the space of Gaussian mixture models. SIAM Journal on Imaging Sciences, 13(2):936 970, 2020. Julie Delon and Agn es Desolneux. Optimal transport between gaussian mixture models. https://github. com/judelo/gmmot, 2021. Bhuwan Dhingra, Christopher J. Shallue, Mohammad Norouzi, Andrew M. Dai, and George E. Dahl. Embedding text in hyperbolic spaces. Text Graphs@NAACL-HLT, pages 59 69, 2018. Yan Dolinsky and H. Mete Soner. Martingale optimal transport and robust hedging in continuous time. Probab. Theory Related Fields, 160(1-2):391 427, 2014. ISSN 0178-8051. James G. Dowty. Chentsov s theorem for exponential families. Inf. Geom., 1(1):117 135, 2018. Estibalitz Durand-Cartagena, Javier Soria, and Pedro Tradacete. Doubling constants and spectral theory on graphs. ar Xiv preprint ar Xiv:2111.09199, 2021. Aryeh Dvoretzky. Some results on convex bodies and Banach spaces. Proc. Internat. Sympos. Linear Spaces (Jerusalem, 1960), pages 123 160, 1961. Weinan E. and Stephan Wojtowytsch. Representation formulas and pointwise properties for Barron functions. Calc. Var. Partial Differential Equations, 61(2):Paper No. 46, 37, 2022. Marzieh Eidi and J urgen Jost. Ollivier Ricci curvature of directed hypergraphs. Scientific Reports, 10(1): 12466, 2020. P. Erd os. On an extremal problem in graph theory. Colloq. Math., 13:251 254, 1964/65. ISSN 0010-1354. P. Erd os, A. R enyi, and V. T. S os. On a problem of graph theory. Studia Sci. Math. Hungar., 1:215 235, 1966. ISSN 0081-6906. Paul Erd os and Horst Sachs. Regukre graphen gegebener taillenweite mit minimaler knotenzahl. Wissenschaftliche Zeitschrift der Martin-Luther-Universit at Halle, pages 251 258, 1963. Sylvester Eriksson-Bique. Quantitative bi-Lipschitz embeddings of bounded-curvature manifolds and orbifolds. Geometry & Topology, 22(4):1961 2026, 2018. K. J. Falconer. The geometry of fractal sets, volume 85 of Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, 1986. Herbert Federer. Colloquium lectures on geometric measure theory. Bull. Amer. Math. Soc., 84(3):291 338, 1978. KRATSIOS, DEBARNOT, DOKMANI C Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881 2891, 2020. R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, L eo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http://jmlr.org/ papers/v22/20-451.html. P. Thomas Fletcher, Suresh Venkatasubramanian, and Sarang Joshi. The geometric median on riemannian manifolds with application to robust atlas estimation. Neuro Image, 45(1):S143 S152, 2009. Mathematics in Brain Imaging. M Maurice Fr echet. Sur quelques points du calcul fonctionnel. Rendiconti del Circolo Matematico di Palermo (1884-1940), 22(1):1 72, 1906. Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic neural networks. Advances in Neural Information Processing Systems, 31:5345 5355, 2018. URL https://proceedings.neurips.cc/ paper/2018/file/dbab2adc8f9d078009ee3fa810bea142-Paper.pdf. Francesco Di Giovanni, Giulia Luise, and Michael M. Bronstein. Heterogeneous manifolds for curvature-aware graph embedding. In ICLR 2022 Workshop on Geometrical and Topological Representation Learning, 2022. URL https://openreview.net/forum?id=rt Uxs N-kaxc. Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions, volume 1730 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, 2000. Gaoyue Guo and Jan Obł oj. Computational methods for martingale optimal transport problems. Ann. Appl. Probab., 29(6):3311 3347, 2019. A. Gupta. Embedding tree metrics into low-dimensional Euclidean spaces. Discrete Comput. Geom., 24(1): 105 116, 2000. Juha Heinonen. Lectures on analysis on metric spaces. Universitext. Springer-Verlag, New York, 2001. Juha Heinonen. Geometric embeddings of metric spaces, volume 90 of Report. University of Jyv askyl a Department of Mathematics and Statistics. University of Jyv askyl a, Jyv askyl a, 2003. Nhat Ho, Xuan Long Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh, and Dinh Phung. Multilevel clustering via wasserstein means. In International Conference on Machine Learning, pages 1501 1509. PMLR, 2017. H. Hopf and W. Rinow. Ueber den Begriff der vollst andigen differentialgeometrischen Fl ache. Comment. Math. Helv., 3(1):209 225, 1931. J urgen Jost. Riemannian geometry and geometric analysis. Universitext. Springer, Heidelberg, seventh edition, 2017. Heinrich Jung. Ueber die kleinste kugel, die eine r aumliche figur einschliesst. Journal f ur die reine und angewandte Mathematik, 123:241 257, 1901. URL http://eudml.org/doc/149122. Mark Jungerman and Gerhard Ringel. The genus of the n-octahedron: regular cases. J. Graph Theory, 2(1): 69 75, 1978. ISSN 0364-9024. Karin Usadi Katz and Mikhail G. Katz. Bi-Lipschitz approximation by finite-dimensional imbeddings. Geom. Dedicata, 150:131 136, 2011. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. Proceedings of Thirty Third Conference on Learning Theory, 125:2306 2327, 09 12 Jul 2020. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Benoˆıt Kloeckner. A geometric study of Wasserstein spaces: Euclidean spaces. Ann. Sc. Norm. Super. Pisa Cl. Sci. (5), 9(2):297 323, 2010. URL https://doi.org/10.2422/2036-2145.2010.2.03. Benoˆıt Kloeckner. Approximation by finitely supported measures. ESAIM Control Optim. Calc. Var., 18(2): 343 359, 2012. Benoˆıt R Kloeckner and J erome Bertrand. A geometric study of wasserstein spaces: isometric rigidity in negative curvature. International Mathematics Research Notices, 2016(5):1368 1386, 2016. Max Kochurov, Rasul Karimov, and Serge Kozlukov. Geoopt: Riemannian optimization in pytorch. In Hal Daum e III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, Virtual, 13 18 Jul 2020. PMLR. Prodromos Kolyvakis, Alexandros Kalousis, and Dimitris Kiritsis. Hyperbolic knowledge graph embeddings for knowledge base completion. In European Semantic Web Conference, pages 199 214. Springer, 2020. Anastasis Kratsios. Universal Regular Conditional Distributions. Constructive Approximation (Forthcoming), 2023. Anastasis Kratsios and L eonie Papon. Universal approximation theorems for differentiable geometric deep learning. Journal of Machine Learning Research, 23:1 73, 2022. Anastasis Kratsios, Behnoosh Zamanlooy, Tianlin Liu, and Ivan Dokmani c. Universal approximation under constraints is possible with transformers. In International Conference on Learning Representations, 2022. R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: a new embedding method for finite metrics. Geom. Funct. Anal., 15(4):839 858, 2005. Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Mari an Bogun a. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010. Matt Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. Inferring concept hierarchies from text corpora via hyperbolic embeddings. Ar Xi V, 2019. Enrico Le Donne and Tapio Rajala. Assouad dimension, nagata dimension, and uniformly close metric tangents. Indiana University Mathematics Journal, pages 21 54, 2015. N. Linial, A. Magen, and A. Naor. Girth and Euclidean distortion. Geom. Funct. Anal., 12(2):380 394, 2002. Chong Liu and Ariel Neufeld. Compactness criterion for semimartingale laws and semimartingale optimal transport. Trans. Amer. Math. Soc., 372(1):187 231, 2019. Yating Liu and Gilles Pag es. Convergence rate of optimal quantization and application to the clustering performance of the empirical measure. Journal of Machine Learning Research, 21(86):1 36, 2020. URL http://jmlr.org/papers/v21/18-804.html. Federico L opez and Michael Strube. A fully hyperbolic neural model for hierarchical multi-class classification. Ar Xi V, 2020. KRATSIOS, DEBARNOT, DOKMANI C Federico Lopez, Beatrice Pozzetti, Steve Trettel, Michael Strube, and Anna Wienhard. Symmetric spaces for graph embeddings: A finsler-riemannian approach. Proceedings of the 38th International Conference on Machine Learning, 139:7090 7101, 18 24 Jul 2021. Grigorii A Margulis. Explicit constructions of graphs without short cycles and low density codes. Combinatorica, 2(1):71 78, 1982. Grigorii Aleksandrovich Margulis. Explicit group-theoretical constructions of combinatorial schemes and their application to the design of expanders and concentrators. Problemy peredachi informatsii, 24(1):51 60, 1988. Jirı Matouˇsek. Lecture notes on metric embeddings. Technical report, Technical report, ETH Z urich, 2013. Jiˇr ı Matouˇsek. On the distortion required for embedding finite metric spaces into normed spaces. Israel J. Math., 93:333 344, 1996. ISSN 0021-2172. Jiˇr ı Matouˇsek. Lectures on discrete geometry, volume 212 of Graduate Texts in Mathematics. Springer-Verlag, New York, 2002. Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Comm. Pure Appl. Math., 75(4):667 766, 2022. Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration. Appl. Comput. Harmon. Anal., 59:3 84, 2022. Liang Mi, Wen Zhang, Xianfeng Gu, and Yalin Wang. Variational wasserstein clustering. In Proceedings of the European Conference on Computer Vision (ECCV), pages 322 337, 2018. Nina Miolane, Nicolas Guigui, Alice Le Brigant, Johan Mathe, Benjamin Hou, Yann Thanwerdas, Stefan Heyder, Olivier Peltre, Niklas Koep, Hadi Zaatiti, Hatem Hajri, Yann Cabanes, Thomas Gerald, Paul Chauchat, Christian Shewmake, Daniel Brooks, Bernhard Kainz, Claire Donnat, Susan Holmes, and Xavier Pennec. Geomstats: A python package for Riemannian geometry in machine learning. Journal of Machine Learning Research, 21(223):1 9, 2020. URL http://jmlr.org/papers/v21/19-027.html. James R. Munkres. Topology. Prentice Hall, Inc., Upper Saddle River, NJ, 2000. Second edition. Tamara Munzner. H3: Laying out large directed graphs in 3d hyperbolic space. In Proceedings of VIZ 97: Visualization Conference, Information Visualization Symposium and Parallel Rendering Symposium, pages 2 10. IEEE, 1997. Alessandro Muscoloni, Josephine Maria Thomas, Sara Ciucci, Ginestra Bianconi, and Carlo Vittorio Cannistraci. Machine learning meets complex networks via coalescent embedding in the hyperbolic space. Nature communications, 8(1):1 19, 2017. Assaf Naor. Metric embeddings and lipschitz extensions. Princeton University, Department of Mathematics, 2015. Assaf Naor. Metric dimension reduction: a snapshot of the Ribe program. Proceedings of the International Congress of Mathematicians Rio de Janeiro 2018. Vol. I. Plenary lectures, pages 759 837, 2018. Assaf Naor and Ofer Neiman. Assouad s theorem with dimension independent of the snowflaking. Revista Matematica Iberoamericana, 28(4):1123 1142, 2012. Assaf Naor and Terence Tao. Scale-oblivious metric fragmentation and the nonlinear Dvoretzky theorem. Israel J. Math., 192(1):489 504, 2012. ISSN 0021-2172. SMALL TRANSFORMERS COMPUTE UNIVERSAL METRIC EMBEDDINGS Assaf Naor, Yuval Peres, Oded Schramm, and Scott Sheffield. Markov chains in smooth banach spaces and gromov-hyperbolic metric spaces. Duke Mathematical Journal, 134(1):165 197, 2006. John Nash. C1 isometric imbeddings. Ann. of Math. (2), 60:383 396, 1954. Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. Advances in neural information processing systems, 28, 2015. Maximillian Nickel and Douwe Kiela. Poincar e embeddings for learning hierarchical representations. Advances in neural information processing systems, 30, 2017. James Orlin. A faster strongly polynomial minimum cost flow algorithm. In Proceedings of the Twentieth annual ACM symposium on Theory of Computing, pages 377 387, 1988. Mikhail I. Ostrovskii. Metric embeddings, volume 49 of De Gruyter Studies in Mathematics. De Gruyter, Berlin, 2013. Bilipschitz and coarse embeddings into Banach spaces. Pierre Pansu. M etriques de Carnot-Carath eodory et quasiisom etries des espaces sym etriques de rang un. Ann. of Math. (2), 129(1):1 60, 1989. ISSN 0003-486X. S. Park, C. Yun, J. Lee, and J. Shin. Minimum width for universal approximation, 2020. Sejun Park, Jaeho Lee, Chulhee Yun, and Jinwoo Shin. Provable memorization via deep neural networks using sub-linear parameters. Proceedings of Thirty Fourth Conference on Learning Theory, 134:3627 3661, 15 19 Aug 2021. Georg Ch. Pflug and Alois Pichler. Dynamic generation of scenario trees. Comput. Optim. Appl., 62(3): 641 668, 2015. Michael Puthawala, Konik Kothari, Matti Lassas, Ivan Dokmani c, and Maarten de Hoop. Globally injective Re LU networks. Journal of Machine Learning Research, 23:1 55, 2022. Erzs ebet Ravasz and Albert-L aszl o Barab asi. Hierarchical organization in complex networks. Physical review E, 67(2):026112, 2003. P. L. Robinson. The sphere is not flat. The American Mathematical Monthly, 113(2):171 173, 2006a. URL http://www.jstor.org/stable/27641870. PL Robinson. The sphere is not flat. The American Mathematical Monthly, 113(2):171 173, 2006b. Areejit Samal, R. P. Sreejith, Jiao Gu, Shiping Liu, Emil Saucan, and J urgen Jost. Comparative analysis of two discretizations of Ricci curvature for complex networks. Scientific Reports, 8(1):8650, 2018. Rik Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pages 355 366. Springer, 2011. Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network with approximation error being reciprocal of width to power of square root of depth. Neural Comput., 33(4):1005 1036, 2021. ISSN 0899-7667. Ryohei Shimizu, YUSUKE Mukuta, and Tatsuya Harada. Hyperbolic neural networks++. In International Conference on Learning Representations, 2021. SA Shkarin. Isometric embedding of finite ultrametric spaces in banach spaces. Topology and its Applications, 142(1-3):13 17, 2004. Eduardo D. Sontag. Shattering all sets of k points in general position requires (k|1)/2 parameters. Neural Computation, 9(2):337 348, 1997a. KRATSIOS, DEBARNOT, DOKMANI C Eduardo D Sontag. Shattering all sets of k points in general position requires (k 1)/2 parameters. Neural Computation, 9(2):337 348, 1997b. Taiji Suzuki. Adaptivity of deep Re LU network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. International Conference on Learning Representations, 2019. Puoya Tabaghi and Ivan Dokmani c. Hyperbolic distance matrices. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1728 1738, 2020. Puoya Tabaghi and Ivan Dokmani c. On procrustes analysis in hyperbolic space. IEEE Signal Processing Letters, 28:1120 1124, 2021. Puoya Tabaghi, Eli Chien, Chao Pan, Jianhao Peng, and Olgica Milenkovi c. Linear classifiers in product space forms. ar Xiv preprint ar Xiv:2102.10204, 2021. Gal Vardi, Gilad Yehudai, and Ohad Shamir. On the optimal memorization power of Re LU neural networks. International Conference on Learning Representations, 2022. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li o, and Yoshua Bengio. Graph attention networks. International Conference on Learning Representations, 2018. Roman Vershynin. Memory capacity of neural networks with threshold and rectified linear unit activations. SIAM Journal on Mathematics of Data Science, 2(4):1004 1033, 2020. I. A. Vestfrid and A. F. Timan. A universality property of Hilbert spaces. Dokl. Akad. Nauk SSSR, 246(3): 528 530, 1979. ISSN 0002-3264. A. L. Vol berg and S. V. Konyagin. On measures with the doubling condition. Izv. Akad. Nauk SSSR Ser. Mat., 51(3):666 675, 1987. Leonid Nisonovich Waserstein. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64 72, 1969. Alfred Weiss. Girths of bipartite sextet graphs. Combinatorica, 4(2):241 245, 1984. Dmitry Yarotsky. Error bounds for approximations with deep Re LU networks. Neural Networks, 94:103 114, 2017. ISSN 0893-6080. Jianbo Ye, Panruo Wu, James Z Wang, and Jia Li. Fast discrete distribution clustering using wasserstein barycenter with sparse support. IEEE Transactions on Signal Processing, 65(9):2317 2332, 2017. Shuxin Zheng, Qi Meng, Huishuai Zhang, Wei Chen, Nenghai Yu, and Tie-Yan Liu. Capacity control of Re LU neural networks by basis-path norm. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 5925 5932, Jul. 2019. Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48(2):787 794, 2020. George Kingsley Zipf. Human behavior and the principle of least effort. pages 1 334, 1949.