# intrinsic_dimension_for_largescale_geometric_learning__d95c7346.pdf Published in Transactions on Machine Learning Research (01/2023) Intrinsic Dimension for Large-Scale Geometric Learning Maximilian Stubbemann stubbemann@cs.uni-kassel.de Knowledge & Data Engineering Group, University of Kassel, Kassel, Germany Tom Hanika tom.hanika@cs.uni-kassel.de Knowledge & Data Engineering Group, University of Kassel, Kassel, Germany Friedrich Martin Schneider martin.schneider@math.tu-freiberg.de Institute of Discrete Mathematics and Algebra, TU Bergakademie Freiberg, Freiberg, Germany Reviewed on Open Review: https: // openreview. net/ forum? id= 85Bf Dd YMBY The concept of dimension is essential to grasp the complexity of data. A naive approach to determine the dimension of a dataset is based on the number of attributes. More sophisticated methods derive a notion of intrinsic dimension (ID) that employs more complex feature functions, e.g., distances between data points. Yet, many of these approaches are based on empirical observations, cannot cope with the geometric character of contemporary datasets, and do lack an axiomatic foundation. A different approach was proposed by V. Pestov, who links the intrinsic dimension axiomatically to the mathematical concentration of measure phenomenon. First methods to compute this and related notions for ID were computationally intractable for large-scale real-world datasets. In the present work, we derive a computationally feasible method for determining said axiomatic ID functions. Moreover, we demonstrate how the geometric properties of complex data are accounted for in our modeling. In particular, we propose a principle way to incorporate neighborhood information, as in graph data, into the ID. This allows for new insights into common graph learning procedures, which we illustrate by experiments on the Open Graph Benchmark. 1 Introduction Contemporary real-world datasets employed in artificial intelligence are often large in size and comprised of complex structures, which distinguishes them from Euclidean data. To consider these properties appropriately is a challenging task for procedures that analyze or learn from said data. Moreover, with increasing complexity of real-world data, the necessity arises to quantify to which extent this data suffer from the curse of dimensionality. The common approach for estimating the dimension curse of a particular dataset is through the notion of intrinsic dimension (ID) (Bac & Zinovyev, 2020; Granata & Carnevale, 2016; Pestov, 2007). There exists a variety of work on how to estimate the ID of datasets (Facco et al., 2017; Levina & Bickel, 2004; Costa et al., 2005; Gomtsyan et al., 2019; Bac & Zinovyev, 2020). Most approaches to quantify the ID are based on distances between data points, assuming the data to be Euclidean. A multitude of works base their modeling on the manifold hypothesis (Cloninger & Klock, 2021; Gomtsyan et al., 2019), which assumes that the observed data is embedded in a manifold of low dimension (compared to the number of data attributes). The ID then is an approximation of the dimension of this manifold. Pestov (2000) proposed a different concept of intrinsic dimension by linking it to the mathematical concentration of measure phenomenon. His modeling is based on a thorough axiomatic approach (Pestov, 2007; 2008; 2010) which resulted in a novel class of intrinsic dimension functions. In contrast to the manifold hypothesis, Pestov s ID functions measure to which extent a dataset is affected by the curse of dimensionality, i.e., to which extent the complexity of the dataset hinders the discrimination of data points. Yet, to compute said ID functions is an intractable computational endeavor. This limitation was overcome in principle by an adaptation to geometric datasets (Hanika et al., 2022). However, two limitations persisted: First, the computational effort was Published in Transactions on Machine Learning Research (01/2023) found to remain quadratic in the number of data points, which is insufficient for datasets of contemporary size; second, it is unclear how to account for complex structure, such as in graph data. With this in mind, we propose in the present work a default approach for computing the intrinsic dimension of geometric data, such as graph data, as used in graph neural networks. To do this, we revisit the computation of the ID based on distance functions (Hanika et al., 2022) and overcome, in particular, the inherent computational limitations in the works by Pestov (2007) and Hanika et al. (2022). In detail, we derive a novel approximation formula and present an algorithm for its computation. This allows us to compute ID bounds for datasets that are magnitudes larger than in earlier works. That equipped with, we establish a natural approach to compute the ID of graph data. We subsequently apply our method to seven real-world datasets and relate the obtained results to the observed performances of classification procedures. Thus, we demonstrate the practical computability of our approach. In addition, we study the extent to which the intrinsic dimension reveals insights into the performance of particularly classes of Graph Neural Networks. Our code is publicly available on Git Hub.1 2 Related Work In numerous works, the intrinsic dimension is estimated using the pairwise distances between data points (Chávez et al., 2001; Grassberger & Procaccia, 2004). More sophisticated approaches use distances to nearest neighbors (Facco et al., 2017; Levina & Bickel, 2004; Costa et al., 2005; Gomtsyan et al., 2019). All these works have in common, that they assume the data to be Euclidean and that they favor local properties. Recent work has drawn different connections between intrinsic dimension (ID) and modern learning theory. For example, Cloninger & Klock (2021) show that functions of the form f(x) = g(φ(x)), where φ maps into a manifold of lower dimension, can be approximated by neural networks. On the other hand, Wojtowytsch & E (2020) prove that modern artificial neural networks suffer from the curse of dimensionality in the sense that gradient training on high dimensional data may converge insufficiently. Additional to these theoretical results, there is an increasing interest of empirically estimating the ID of contemporary learning architectures. Li et al. (2018) study the ID of neural networks by replacing high dimensional parameter vectors with lower dimensional ones. Their approach results in a non-deterministic ID. More recent works studied ID in the realm of geometric data and their standard architectures. Ansuini et al. (2019) investigate the ID for convolutional neural networks (CNN). In detail, they are interested to which extent the ID changes at different hidden layers and how this is related to the overall classification performance. Another work (Pope et al., 2020) associates an ID to popular benchmark image datasets. These two works on ID estimators do solely rely on the metric information of the data and do not consider any geometric structure of image data. Our approach allows to incorporate such underlying geometric structures while incorporating the mathematical phenomenon of measure concentration (Gromov & Milman, 1983; Milman, 1988; 2010). Linking this phenomenon to the occurrence of the dimension curse was done by Pestov (Pestov, 2000; 2007; 2008; 2010). He based his considerations on a thorough axiomatic approach using techniques from metric-measure spaces. The resulting ID functions unfortunately turn out to be practically incomputable. In contrast, Bac & Zinovyev (2020) investigate computationally feasible ID estimators that are related to the concentration phenomenon. Yet, their results elude a comparable axiomatic foundation. Our modeling for the ID of large and geometric data is based on Hanika et al. (2022). We build on their axiomatization and derive a computationally feasible method for the intrinsic dimension of large-scale geometric datasets. 3 Intrinsic Dimension Since our work is based on the formalization from Hanika et al. (2022), we shortly revisit their modeling and recapitulate the most important structures. Based on this, we derive and prove an explicit formula to compute the ID for the special case of finite geometric datasets. This first result is essential for Section 4. 1https://github.com/mstubbemann/ID4Geo L Published in Transactions on Machine Learning Research (01/2023) Let D = (X, F, µ), where X is a set and F RX is a set of functions from X to R, in the following called feature functions. We require that supx,y X d F (x, y) < , where d F (x, y) := supf F |f(x) f(y)|. If (X, d F ) constitutes a complete and separable metric space such that µ is a Borel probability measure on (X, d F ), we call D a geometric dataset (GD). The aforementioned properties are satisfied when it holds that 0 < |X|, |F| < and that F can discriminate all data points, i.e., d F (x, y) > 0 for all x, y X with x = y. Two geometric datasets D1 := (X1, F1, µ1), D2 := (X2, F2, µ2) are isomorphic if there exists a bijection φ : X1 X2 such that F2 φ = F1 and φ (µ1) = µ2, where φ (µ1)(B) := µ1(φ 1(B)) is the push-forward measure and the closures are taken with respect to point-wise convergence. From this point on we identify a geometric dataset with its isomorphism class. The triple ({ }, R, ν{ }) represents the trivial geometric dataset. Pestov (2007) defines the curse of dimensionality as [...] a name given to the situation where all or some of the important features of a dataset sharply concentrate near their median (or mean) values and thus become non-discriminating. In such cases, X is perceived as intrinsically high-dimensional. Thus, the ID estimator aims to compute to which extent the features allow to discriminate different data points. For a specific feature f F, Hanika et al. (2022) therefore defines the partial diameter of f with regard to a specific α (0, 1) such that it displays to which extent f can discriminate subsets with minimal measure 1 α, i.e., via Part Diam(f (ν), 1 α) = inf{diam(B) | B R Borel, ν(f 1(B)) 1 α}, where diam(B) := supa,b B |a b|. The observable diameter with respect to α then defines to which extent F can discriminate points with minimum measure 1-α by being defined as the supremum of the partial diameter of all f F, i.e, Obs Diam(D, α) := supf F Part Diam(f (µ), 1 α). To observe the discriminability for different minimal measures α, the discriminability (D) of D and the intrinsic dimension (D) are defined via (D) := 1 (D)2 , where (D) := Z 1 0 Obs Diam(D, α) dα. (1) In other words, lower values of intrinsic dimensionality correspond to geometric datasets with points that can be better discriminated by the given set of feature functions. This intrinsic dimension function is, in principle, applicable to a broad variety of geometric data, such as metric data, graphs or images. This applicability arises from the possibility to choose suitable feature functions which reflect the underlying data structure. The appropriate choice of feature functions is part of Section 5. Furthermore, the ID (D) = 1 (D)2 respects the formal axiomatization (Hanika et al., 2022) for ID functions, informally: Axiom of concentration: A sequence of geometric datasets converges against the constant dataset (meaning having no chance to separate data points!), if and only if their IDs diverge against infinity. Axiom of feature antitonicity: If dataset D1 has more feature functions then D0 (i.e. having potentially more information to separate data points), it should have a lower intrinsic dimension. Axiom of continuity: If a sequence of geometric datasets converge against a specific geometric dataset, the sequence of the IDs should converge against the ID of the limit geometric dataset. Axiom of geometric order of divergence: If a sequence of geometric datasets converges against the constant dataset, its IDs should diverge against infinity with the same order as 1 ((D))2 does. 3.1 Intrinsic Dimension of Finite Data We want to apply Equation (1) to real-world data. In the following, let D = (X, F, ν) such that 0 < |X| < and 0 < |F| < and let ν be the normalized counting measure on X, i.e., ν(M) := |M| |X| for M X. In this case, it is possible to compute the partial diameter and Equation (1), as we show in the following. Let α (0, 1) and let cα := |X|(1 α) . The following arguments were already hinted in previous work (Hanika et al., 2022), yet not formally discussed or proven. Published in Transactions on Machine Learning Research (01/2023) Lemma 3.1. For f F it holds that Part Diam(f (ν), 1 α) = min |M|=cα max x,y M |f(x) f(y)|. Proof. It holds that Part Diam(f (ν), 1 α) = inf{diam(B) | B R Borel, ν(f 1(B)) 1 α} = inf{diam(B) | B R Borel,|{x X | f(x) B}| cα}. We have to show that inf{diam(B) | B R Borel, |{x X | f(x) B}| cα} = min{ max x,y M |f(x) f(y)| | M X, |M| cα}. (2) : We show that {diam(B) | B R Borel, |{x X | f(x) B}| cα} {maxx,y M |f(x) f(y)| | M X, |M| cα}. Let z := maxx,y M |f(x) f(y)| such that M X with |M| cα. Without loss of generality we assume that x X : ( m1, m2 M : f(m1) f(x) f(m2) = x M). Let b := maxx M f(x), a := minx M f(x), then M = {x X | f(x) [a, b]}. Hence, z = b a {diam(B)|B R Borel,|{x X|f(x) B}| cα}. : Let B R be Borel with |{x X | f(x) B}| cα. Furthermore, let M := {x X | f(x) B}. It holds that diam(B) = supx,y B |x y| maxx,y M |f(x) f(y)| because of the choice of M. As B was chosen arbitrarily, it follows . Finally, we need that min{ max x,y M |f(x) f(y)| | |M| cα} = min{ max x,y M |f(x) f(y)| | |M| = cα}. follows directly from the fact that {maxx,y M |f(x) f(y)| | |M| cα} {maxx,y M |f(x) f(y)| | |M| = cα}. follows from the fact that for every |M| cα and for every N M with |N| = cα the following equation holds: supx,y M |f(x) f(y)| supx,y N |f(x) f(y)|. This lemma allows for a more tractable formula for the computation of the partial diameter of a finite GD. That in turn enables the following theorem. Theorem 3.2. It holds that (D) = 1 |X| k=2 max f F min M X |M|=k max x,y M |f(x) f(y)|. (3) Proof. Let g : (0, 1) R, α 7 maxf F min M X,|M|=cα maxx,y M |f(x) f(y)|. Because of Lemma 3.1 we know that (D) = R 1 0 g(α) dα. The function g is a step function which can be expressed for each α (0, 1) via k=1 1 |X| k |X| , |X|+1 k |X| (α) max f F min M X |M|=k max x,y M |f(x) f(y)| almost everywhere. Hence, Equation (3) follows from the definition of the Lebesgue-Integral with the fact that min M X,|M|=1 maxx,y M |f(x) f(y)| = 0. In general, the addition of features should lower the ID since we have additional information that helps to discriminate the data. However, there are certain features that are not helping to further discriminate data points. These are for example: Published in Transactions on Machine Learning Research (01/2023) 1. Constant features. This is due to the fact that for a constant feature f it always holds for all M X that maxx,y M |f(x) f(y)| = 0. 2. Permutations of already existing features. Let f : X R have the form f = f π with π : X X being a permutation on X. Then there exist for all M X with |M| = k a set N X with |N| = k and maxx,y M |f(x) f(y)| = maxx,y N | f(x) f(y)| and vice versa. Thus, we have the following Lemma. Lemma 3.3. Let D = (X, F, µ) be a finite geometric dataset. Furthermore, let ˆF be a set of constant functions X R and let F be a set of functions X R such that there exist for each f F a f F and a permutation π : X X with f = f π. Let E := (X, F ˆF F, µ). Then it holds that (D) = (E). 3.2 Computing the Intrinsic Dimension of Finite Data In this section we will propose an algorithm for computing the ID based on Equation (3). For this, given a finite geometric dataset D, we use the shortened notations φk,f(D) := min M X,|M|=k maxx,y M |f(x) f(y)| and φk(D) := maxf F φk,f(D). Then, Equation (3) can be written as (D) = 1 |X| k=2 φk(D) = 1 |X| k=2 max f F φk,f(D). (4) The straightforward computation of Equation (4) is hindered by the task to iterate through all subsets M X of size k. This yields an exponential complexity with respect to |X| for computing (D). We can overcome this towards a quadratic computational complexity in |X| using the following concept. Definition 3.4. (Feature Sequence) For a feature f F let lf,D R|X| be the increasing sequence of all values f(x) for x X. We call lf,D = (lf,D 1 , . . . , lf,D |X| ) the feature sequence of f. Using these sequences, the following lemma allows us to efficiently compute φk,f(D). Lemma 3.5. For k {2, . . . , |X|}, f F and lf,D, it holds that φk,f(D) = min{lf,D k+j lf,D 1+j | j {0, . . . , |X| k}}. Proof. For all j {0, . . . , |X| k} there exist M X with |M| = k and lf,D k+j lf,D 1+j = maxx,y M |f(x) f(y)|. Thus, it is sufficient to show φk,f(D) {lf,D k+j lf,D 1+j | j {0, . . . , |X| k}}. Choose M X with |M| = k such that φk,f(D) = maxx,y M |f(x) f(y)| holds. Furthermore, let l M := (l M 1 , . . . , l M k ) be the increasing sequence of values f(m) for m M and let j {0, . . . , |X| k} such that l M 1 = lf,D 1+j. Since l M is an ordered sequence of which each element is also an element of the ordered sequence lf,D, it holds that l M k lf,D k+j and thus l M k l M 1 lf,D j+k lf,D j+1. There is an N X with size k such that maxx,y N |f(x) f(y)| = lf,D k+j lf,D k+1. Since M X is of size k such that maxx,y M |f(x) f(y)| is minimal, it follows l M k l M 1 = maxx,y M |f(x) f(y)| maxx,y N |f(x) f(y)| = lf,D k+j lf,D k+1, hence φk,f(D) = maxx,y M |f(x) f(y)| = l M k l M 1 = lf,D k+j lf,D k+1. To sum up, Lemma 3.5 enables the efficient computation of φk,f(D) via a sliding window, i.e., by using only pairs of elements (lf,D 1+j, lf,D k+j). The algorithm based on this is shown in Algorithm 1. We want to provide a brief description of the most relevant steps. In Line 4 we iterate through the sizes of X by setting k {2, . . . , |X|} in order to compute φk(D) in Lines 6 and 7. For this we also need to iterate over all f F (Line 5) to compute the necessary values of φk,f(D) in Line 6. For a given f F, k {1, . . . , |X|}, Line 6 consumes |X| k + 1 subtraction operations. Assuming that computing feature values can be done in constant time, the runtime for computing (D) from the feature sequences is O(|F| P|X| k=2 |X| k + 1) = O(|F| P|X| 1 k=1 k) = O(|F| |X|2). The creation of all feature sequences requires O(|F||X| log(|X|)) computations , which is negligible compared to the aforementioned complexity. Thus, Algorithm 1 has quadratic complexity with respect to |X|. Therefore, Algorithm 1 is a straightforward and easy to implement Published in Transactions on Machine Learning Research (01/2023) Algorithm 1: The pseudocode to compute (D) for a finite geometric dataset D = (X, µ, F). Input : Finite geometric dataset D = (X, µ, F). Output: (D) 1 forall f in F do 2 Compute feature sequence lf,D. 4 forall k in {2, . . . , |X|} do 5 forall f in F do 6 φk,f(D) = minj {0,...,|X| k} lf,D k+j lf,D 1+j. 7 (D)+ = maxf F φk,f(D) 8 (D) = 1 |X| (D) 9 return (D) solution for the computation of the ID. However, its quadratic runtime is obstructive for the application in large-scale data problems, which raises the necessity for a modification. We will present such a modification in the following section. 4 Intrinsic Dimension for Large-Scale Data In order to speed up the computation of the ID we modify Algorithm 1 with regard to the accuracy of the result. Hence, we settle for an efficiently computable approximation of the ID. To give an overview over the necessary steps, we will 1. approximate the ID by replacing {2, . . . , |X|} in Line 4 of Algorithm 1 with a smaller subset S {2, . . . , |X|}, which we represent by S := {s1, . . . , sl}. For all k S, we will use {φs1(D), . . . , φsl(D)} to estimate φk(D). This will eventually lead to two approximations of the ID, an underestimation and an overestimation. 2. compare the upper and lower approximation to provide an error bound of these approximations with respect to the exact ID. This error bound can be computed without knowing the exact ID. 3. argue how, the computation of the exact ID can be sped up with the help of knowledge about φsi(D) for all si S. For this, we will in particular show that we can replace for all k {2, . . . , |X|} \ M the set F with a subset ˆF, see Line 5-6 of Algorithm 1. 4. derive a formula which estimates the amount of computation cost which is saved by using only subsets of F for the computation of the ID. This information can be used to estimate and decide whether the exact computation of the ID is computational feasible for a specific dataset. The ensuing algorithm is shown in Algorithm 2. The underlying theory that justifies it is presented in the following. This theory will be based on the monotonicity of i 7 φi,f(D). Theorem 4.1. For m > n 2 and f F the following statements hold. 1. φm,f(D) φn,f(D), 2. φm(D) φn(D). Proof. The second inequality follows directly from the first one. Since per definition φm,f(D) = min M X,|M|=m maxx,y M |f(x) f(y)| and also φn,f(D) = min N X,|N|=n maxx,y N |f(x) f(y)|, we need to show that for each M X with |M| = m there exist N X with |N| = n and maxx,y M |f(x) f(y)| maxx,y N |f(x) f(y)|. It is sufficient to show that for n = m 1. Choose x1, x2 M such that Published in Transactions on Machine Learning Research (01/2023) Algorithm 2: The pseudocode to compute s, (D), s,+(D), (D) for a finite GD D = (X, µ, F). Input : Finite GD D = (X, µ, F), support sequence s = (2 = s1, . . . , sl = |X|), exact (Boolean) Output: s, (D), s,+(D), (D) 1 forall f in F do 2 Compute feature sequence lf,D. 5 φs0(D) = 0 6 forall i in {1, . . . , l} do // Iterate over support sequence indices 7 forall f in F do 8 φsi,f(D) = minj {0,...,|X| si} lf,D si+j lf,D 1+j // Compute φsi,f(D) with Lemma 3.5 9 φsi(D) = maxf F φsi,f(D) 10 Fsi 1 = {f F | φsi,f(D) > φsi 1(D)} // Compute Fsi 1 for Lemma 4.7 11 (D)+ = φsi(D) 12 s, (D) = 1 |X|( (D) + Pl 1 i=1 P si 2 we find x3 M \ {x1, x2}. Let N := M \ {x3}. It holds that maxx,y M |f(x) f(y)| = |f(x1) f(x2)| = maxx,y N |f(x) f(y)|. 4.1 Computing Intrinsic Dimension via Support Sequences Equipped with Theorem 4.1, we can bound (D) and thus the intrinsic dimension through computing φsi for a few 2 = s1 < s2 < sl = |X|. Definition 4.2. (Support Sequences and Upper / Lower ID) Let s = (2 = s1, . . . , sl 1, sl = |X|) be a strictly increasing and finite sequence of natural numbers. We call s a support sequence of D. We additionally define s, (D) := 1 |X| i=1 φsi(D) + si φsi(D)}. Then it holds that φj(D) = max({φj,f(D) | f Fsi} {φsi(D)}). Proof. follows from Theorem 4.1 and the definition of φj(D). holds because for f F \Fsi it holds that φj,f(D) φsi+1,f(D), due to Theorem 4.1, and φsi+1,f(D) φsi(D), due to the construction of Fsi. Hence, given a specific j, it is possible to compute φj(D) using a subset of F. Based on the particular GD D, this fact can considerably speed up the computation of the ID of D, as we will see in Section 5. An algorithm to approximate and compute the ID through support sequences is depicted in Algorithm 2. This algorithm takes as input a GD D and a chosen support sequence s. A reasonable choice for support sequences is discussed in Section 5.1. The output is s, (D), s,+(D), and (D), if desired (Line 14). In Line 2, all feature sequences are computed. In Line 6 to Line 11, φsi(D) and Fsi 1, as defined in Lemma 4.7, are computed. From Line 1 to Line 13, the feature sequences and the lower and upper ID are computed. If desired, the exact computation is done in Line 15 to Line 21. Here, we iterate for all support elements (Line 15) through all gaps between them (Line 16) and compute φj(D) using Lemma 4.7 (Line 17 to Line 19). Published in Transactions on Machine Learning Research (01/2023) Table 1: Statistics of all datasets used in this work. Nodes Edges Attributes Pub Med 19, 717 88, 648 500 Cora 2, 708 10, 556 1, 433 Cite Seer 3, 327 9, 104 3, 703 ogbn-arxiv 169, 343 1, 166, 243 128 ogbn-products 2, 449, 029 61, 859, 140 100 ogbn-mag 1, 939, 743 21, 111, 007 128 ogbn-papers100M 111, 059, 956 1, 615, 685, 872 128 4.2 Estimating Computational Costs Let s = (s1, . . . , sl) be a support sequence. After the computation of Fs1, . . . , Fsl 1, we can estimate how much computation steps we can avoid in order to compute (D) with Algorithm 2 compared to Algorithm 1. Together with the error function E(s), this estimation can help us to decide if it is desirable to compute the exact value (D) or leave it at s, (D) and s,+(D). This is done in the following manner. For a specific f F, Lemma 3.5 shows that the computation of φk,f(D) requires |X| k + 1 different subtractions and to keep the minimum value. Hence, the cost for computing (D) and therefore (D) via Algorithm 1 can be estimated via O(|F| P|X| k=2(|X| k + 1)) = O(|F| P|X| 1 k=1 k) = O(|F|( |X|2 |X| 2 )). However, if we use Algorithm 2, we solely have the cost to compute φsi. For all values j with si < j < si+1, our cost estimation is |Fsi|(|X| j + 1). Hence, for a given support sequence s = (s1, . . . , sl), we can estimate how many computations are saved using the following notions. We address the naive computation costs for computing the ID of a GD with C(D) := |F|(|X|2 |X| In contrast, for a support sequence s = (s1, . . . , sl) of D, the computation costs are Cs(D) := (|F| k=1 |X| sk + 1) + k=1 |Fsk| X sk s,+(Di+1) for i {0, . . . , 4}. It stands out, that even for such a short support sequence (compared to the size of the dataset), the observed error is remarkably low. In detail, we can approximate the ID with an accuracy of over 99.95%. It is further remarkable, that the error does not change significantly for different k. We observed this effect also for the other datasets. Our results on ogbn-papers100M indicate, that with short support sequences, we can sufficiently approximate the ID of large-scale graph data. 6 Errors of Random Data To further understand how our approximation procedure behaves we conducted experiments on random data. We considered different data sizes and different amount of attributes. For this, we experimented with real-valued datasets, i.e. datasets represented by an attribute matrix X Rn d. Here, the feature functions are given by the data columns. To be more detailed, the considered geometric dataset is D = ({Xi | i {1, . . . , n}}, {Xi 7 Xi,j | j {1, . . . , d}}, ν). Here, ν is again the normalized counting measure and Xi is he i th row vector of X. We iterate n through {106, 107, 108} and d through {10, 50, 250}. We repeat all experiments 3 times. For all datasets, we build a support sequence as described in Section 5.1 with l = 100, 000. The results can be found in Table 4. For all datasets, the errors are small and the accuracy is over 99.9% for all considered data sizes. The difference in the error for different values of d is negligible. Furthermore, we have small standard deviations. All this indicates that l = 100, 000 is a reasonable default choice that leads to sufficient approximations in a large range of data and attribute sizes. Published in Transactions on Machine Learning Research (01/2023) 7 Conclusion and Future Work We presented a principle way to efficiently compute the intrinsic dimension (ID) of geometric datasets. Our approach is based on an axiomatic foundation and accounts for underlying structures and is therefore especially tailored to the field of geometric learning. We proposed a novel speed up technique for an algorithm which has quadratic complexity with respect to the amount of data points. This enabled us to compute the ID of several real-world graphs with up to millions of nodes. Equipped with this ability, we shed light on connections of classification performances of graph neural networks and the observed intrinsic dimension for common benchmark datasets. Finally, using a novel approximation technique, we were able to show that our method scales to graphs with over 100 million nodes and billions of edges. We illustrated this by using the well-known ogbn-papers100M dataset. Future work includes the identification of suitable feature functions for other domains, such as learning on text or image data. Incorporating the structure of such datasets into the computation of intrinsic dimensionality is an open research problem. Another promising research direction is to investigate how the ID of datasets could be manipulated. Since our investigations suggest connections between a low ID and high classification performances, this has the potential to enhance learning procedures. Acknowledgement The authors thank the State of Hesse, Germany for funding this work as part of the LOEWE Exploration project Dimension Curse Detector" under grant LOEWE/5/A002/519/06.00.003(0007)/E19. Alessio Ansuini, Alessandro Laio, Jakob H. Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché Buc, Emily B. Fox, and Roman Garnett (eds.), Neur IPS, pp. 6109 6119, 2019. Jonathan Bac and Andrei Zinovyev. Local intrinsic dimensionality estimators based on concentration of measure. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2020. doi: 10.1109/IJCNN48605.2020.9207096. Edgar Chávez, Gonzalo Navarro, Ricardo A. Baeza-Yates, and José L. Marroquín. Searching in metric spaces. ACM Comput. Surv., 33(3):273 321, 2001. doi: 10.1145/502807.502808. Alexander Cloninger and Timo Klock. A deep network construction that adapts to intrinsic dimensionality beyond the domain. Neural Networks, 141:404 419, 2021. ISSN 0893-6080. doi: https://doi.org/10.1016/ j.neunet.2021.06.004. J.A. Costa, A. Girotra, and Alfred Hero. Estimating local intrinsic dimension with k-nearest neighbor graphs. volume 30, pp. 417 422, 08 2005. ISBN 0-7803-9403-8. doi: 10.1109/SSP.2005.1628631. Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 135 144. ACM, 2017. doi: 10.1145/3097983.3098036. Elena Facco, Maria d Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports, 7(1):1 8, 2017. Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. Marina Gomtsyan, Nikita Mokrov, Maxim Panov, and Yury Yanovich. Geometry-aware maximum likelihood estimation of intrinsic dimension. In Wee Sun Lee and Taiji Suzuki (eds.), Proceedings of The Eleventh Asian Conference on Machine Learning, volume 101 of Proceedings of Machine Learning Research, pp. 1126 1141. PMLR, 17 19 Nov 2019. Published in Transactions on Machine Learning Research (01/2023) Daniele Granata and Vincenzo Carnevale. Accurate estimation of the intrinsic dimension using graph distances: Unraveling the geometric complexity of datasets. Scientific reports, 6(1):1 12, 2016. Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors. In The theory of chaotic attractors, pp. 170 189. Springer, 2004. M. Gromov and V. D. Milman. A topological application of the isoperimetric inequality. American Journal of Mathematics, 105(4):843 854, 1983. ISSN 00029327, 10806377. William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30, pp. 1024 1034, 2017. Tom Hanika, Friedrich Martin Schneider, and Gerd Stumme. Intrinsic dimension of geometric data sets. Tohoku Mathematical Journal, 74(1):23 52, 2022. doi: 10.2748/tmj.20201015a. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687, 2020. Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A large-scale challenge for machine learning on graphs. ar Xiv preprint: 2103.09430, 2021. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th Int. Conf. on Learning Representations, 2017. Elizaveta Levina and Peter J. Bickel. Maximum likelihood estimation of intrinsic dimension. In NIPS, pp. 777 784, 2004. Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In ICLR (Poster). Open Review.net, 2018. Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi N. R. Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E. Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018a. URL https://openreview.net/forum?id=B1g J1L2a W. Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah M. Erfani, Shu-Tao Xia, Sudanthi N. R. Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 3361 3370. PMLR, 2018b. URL http://proceedings.mlr.press/v80/ma18d. html. David JC Mac Kay and Zoubin Ghahramani. Comments on maximum likelihood estimation of intrinsic dimension by e. levina and p. bickel (2005). The Inference Group Website, Cavendish Laboratory, Cambridge University, 2005. V. Milman. Topics in Asymptotic Geometric Analysis, pp. 792 815. Birkhäuser Basel, Basel, 2010. ISBN 978-3-0346-0425-3. doi: 10.1007/978-3-0346-0425-3_8. V. D. Milman. The heritage of P. Lévy in geometrical functional analysis. In Colloque Paul Lévy sur les processus stochastiques, number 157-158 in Astérisque. Société mathématique de France, 1988. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Vladimir Pestov. On the geometry of similarity search: Dimensionality curse and concentration of measure. Inf. Process. Lett., 73(1-2):47 51, 2000. doi: 10.1016/S0020-0190(99)00156-8. Published in Transactions on Machine Learning Research (01/2023) Vladimir Pestov. Intrinsic dimension of a dataset: what properties does one expect? In Proceedings of the International Joint Conference on Neural Networks, IJCNN 2007, Celebrating 20 years of neural networks, Orlando, Florida, USA, August 12-17, 2007, pp. 2959 2964. IEEE, 2007. doi: 10.1109/IJCNN. 2007.4371431. Vladimir Pestov. An axiomatic approach to intrinsic dimension of a dataset. Neural Networks, 21(2-3): 204 213, 2008. doi: 10.1016/j.neunet.2007.12.030. Vladimir Pestov. Intrinsic dimensionality. ACM SIGSPATIAL Special, 2(2):8 11, 2010. doi: 10.1145/ 1862413.1862416. Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2020. Emanuele Rossi, Fabrizio Frasca, Ben Chamberlain, Davide Eynard, Michael M. Bronstein, and Federico Monti. SIGN: scalable inception graph neural networks. Co RR, abs/2004.11198, 2020. Chuxiong Sun and Guoshi Wu. Scalable and adaptive graph neural networks with self-label-enhanced training. Co RR, abs/2104.09376, 2021. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. Stephan Wojtowytsch and Weinan E. Can shallow neural networks beat the curse of dimensionality? a mean field training perspective. IEEE Transactions on Artificial Intelligence, 1(2):121 129, 2020. doi: 10.1109/TAI.2021.3051357. Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 40 48. JMLR.org, 2016. Wentao Zhang, Ziqi Yin, Zeang Sheng, Wen Ouyang, Xiaosen Li, Yangyu Tao, Zhi Yang, and Bin Cui. Graph attention multi-layer perceptron. Co RR, abs/2108.10097, 2021. Published in Transactions on Machine Learning Research (01/2023) A.1 Setup of SIGN classifiers For Pub Med, Cora and Cite Seer, we train on the classification task provided by Pytorch Geometric (Fey & Lenssen, 2019) which was earlier studied by Yang et al. (2016). All Open Graph Benchmark datasets are trained and tested on the official node property prediction task.4 Our goal is not to find optimal classifiers but to discover connections between the choice of k, the ID and classifier performance. Thus, we omit excessive parameter tuning and stick to reasonable standard parameters. For all tasks, we use a simple SIGN model Rossi et al. (2020) with one hidden inception layer and one classification layer. For Pub Med, Cite Seer and Cora, we use batch sizes of 256, hidden layer size of 64 and dropout at the input and hidden layer with 0.5. The learning rate is set to 0.01. All these parameters were taken from Kipf & Welling (2017). For ogbn-arxiv, ogbn-mag and ogbn-products, we stick to the parameters from the SIGN implementations on the OGB leaderbord. For ogbn-arxiv, we use a hidden dimension of 512, dropout at the input with 0.1 and with 0.5 at the hidden layer. For ogbn-mag, we use a hidden dimension of 512, do not dropout at the input and use dropout with 0.5 at the hidden layer. For ogbn-products, we use a hidden dimension of 512, input dropout of 0.3 and hidden layer dropout of 0.4. For all ogbn tasks, the learning rate is 0.001 and the batch-size 50000. For all experiments, we train for a maximum of 1000 epochs with early stopping on the validation accuracy. Here, we use a patience of 15. These are the standard parameters of Pytorch Lightning.5 For all models, we use an Adam optimizer with weight decay of 0.0001. We report mean test accuracies over 10 runs. The intrinsic dimensions and the test accuracy are shown in Table 2. A.2 Details on Baseline ID Estimator To use the MLE ID, we have to convert the k-hop geometric dataset (V, FD,k, ν) of graph data D = (X, G), where X Rn d into a real-valued feature matrix ˆX. This done by concatenating the rows of X with the rows of ˆAX, . . . ˆAk X, i.e., ˆX Rn (k+1)d with ( Xi,j j {1, . . . , d}, (An X)i,ˆj j = nd + ˆj for n {1, . . . , k}, ˆj {1, . . . d}. The MLE is given via MLE( ˆX) := 1 n(k 1) j=1 log( d( ˆXi, Nl( ˆXi)) d( ˆXi, Nj( ˆXi)) ), (8) where d is the euclidean metric and Nj( ˆXi) is the j-th nearest neighbor of ˆXi with respect to the Euclidean metric. Thus, the MLE depends on a parameter l, which we set to 5. We implement the MLE by using the Nearest Neighbors class of scikit-learn (Pedregosa et al., 2011) and then building the mean of all log( d( ˆ Xi,N5( ˆ Xi)) d( ˆ Xi,Nj( ˆ Xi))) with i {1, . . . , n} and j {1, . . . , 5}. Here, we skip all elements where d( ˆXi, Nj( ˆXi)) = 0. This can happen, when ˆX has duplicated rows, representing data points with equal attribute vectors. For ogbn-mag and ogbn-products, computing Equation (8) is not possible due to performance reasons. Here, we sample 169, 343 indices6 I {1, . . . , n} and only compute MLE( ˆX) := 1 n(k 1) j=1 log( d( ˆXi, Nl( ˆXi)) d( ˆXi, Nj( ˆXi)) ). 4https://ogb.stanford.edu/docs/nodeprop/ 5https://www.pytorchlightning.ai/ 6This is the amount of nodes of ogbn-arxiv, the largest network for which the full computation was feasible.