# intrinsic_dimension_for_largescale_geometric_learning__d95c7346.pdf

Published in Transactions on Machine Learning Research (01/2023)

Intrinsic Dimension for Large-Scale Geometric Learning

Maximilian Stubbemann stubbemann@cs.uni-kassel.de Knowledge & Data Engineering Group, University of Kassel, Kassel, Germany

Tom Hanika tom.hanika@cs.uni-kassel.de Knowledge & Data Engineering Group, University of Kassel, Kassel, Germany

Friedrich Martin Schneider martin.schneider@math.tu-freiberg.de Institute of Discrete Mathematics and Algebra, TU Bergakademie Freiberg, Freiberg, Germany

Reviewed on Open Review: https: // openreview. net/ forum? id= 85Bf Dd YMBY

The concept of dimension is essential to grasp the complexity of data. A naive approach to determine the dimension of a dataset is based on the number of attributes. More sophisticated methods derive a notion of intrinsic dimension (ID) that employs more complex feature functions, e.g., distances between data points. Yet, many of these approaches are based on empirical observations, cannot cope with the geometric character of contemporary datasets, and do lack an axiomatic foundation. A diﬀerent approach was proposed by V. Pestov, who links the intrinsic dimension axiomatically to the mathematical concentration of measure phenomenon. First methods to compute this and related notions for ID were computationally intractable for large-scale real-world datasets. In the present work, we derive a computationally feasible method for determining said axiomatic ID functions. Moreover, we demonstrate how the geometric properties of complex data are accounted for in our modeling. In particular, we propose a principle way to incorporate neighborhood information, as in graph data, into the ID. This allows for new insights into common graph learning procedures, which we illustrate by experiments on the Open Graph Benchmark.

1 Introduction

Contemporary real-world datasets employed in artiﬁcial intelligence are often large in size and comprised of complex structures, which distinguishes them from Euclidean data. To consider these properties appropriately is a challenging task for procedures that analyze or learn from said data. Moreover, with increasing complexity of real-world data, the necessity arises to quantify to which extent this data suﬀer from the curse of dimensionality. The common approach for estimating the dimension curse of a particular dataset is through the notion of intrinsic dimension (ID) (Bac & Zinovyev, 2020; Granata & Carnevale, 2016; Pestov, 2007). There exists a variety of work on how to estimate the ID of datasets (Facco et al., 2017; Levina & Bickel, 2004; Costa et al., 2005; Gomtsyan et al., 2019; Bac & Zinovyev, 2020). Most approaches to quantify the ID are based on distances between data points, assuming the data to be Euclidean. A multitude of works base their modeling on the manifold hypothesis (Cloninger & Klock, 2021; Gomtsyan et al., 2019), which assumes that the observed data is embedded in a manifold of low dimension (compared to the number of data attributes). The ID then is an approximation of the dimension of this manifold. Pestov (2000) proposed a diﬀerent concept of intrinsic dimension by linking it to the mathematical concentration of measure phenomenon. His modeling is based on a thorough axiomatic approach (Pestov, 2007; 2008; 2010) which resulted in a novel class of intrinsic dimension functions. In contrast to the manifold hypothesis, Pestov s ID functions measure to which extent a dataset is aﬀected by the curse of dimensionality, i.e., to which extent the complexity of the dataset hinders the discrimination of data points. Yet, to compute said ID functions is an intractable computational endeavor. This limitation was overcome in principle by an adaptation to geometric datasets (Hanika et al., 2022). However, two limitations persisted: First, the computational eﬀort was

Published in Transactions on Machine Learning Research (01/2023)

found to remain quadratic in the number of data points, which is insuﬃcient for datasets of contemporary size; second, it is unclear how to account for complex structure, such as in graph data.

With this in mind, we propose in the present work a default approach for computing the intrinsic dimension of geometric data, such as graph data, as used in graph neural networks. To do this, we revisit the computation of the ID based on distance functions (Hanika et al., 2022) and overcome, in particular, the inherent computational limitations in the works by Pestov (2007) and Hanika et al. (2022). In detail, we derive a novel approximation formula and present an algorithm for its computation. This allows us to compute ID bounds for datasets that are magnitudes larger than in earlier works. That equipped with, we establish a natural approach to compute the ID of graph data.

We subsequently apply our method to seven real-world datasets and relate the obtained results to the observed performances of classiﬁcation procedures. Thus, we demonstrate the practical computability of our approach. In addition, we study the extent to which the intrinsic dimension reveals insights into the performance of particularly classes of Graph Neural Networks. Our code is publicly available on Git Hub.1

2 Related Work

In numerous works, the intrinsic dimension is estimated using the pairwise distances between data points (Chávez et al., 2001; Grassberger & Procaccia, 2004). More sophisticated approaches use distances to nearest neighbors (Facco et al., 2017; Levina & Bickel, 2004; Costa et al., 2005; Gomtsyan et al., 2019). All these works have in common, that they assume the data to be Euclidean and that they favor local properties.

Recent work has drawn diﬀerent connections between intrinsic dimension (ID) and modern learning theory. For example, Cloninger & Klock (2021) show that functions of the form f(x) = g(φ(x)), where φ maps into a manifold of lower dimension, can be approximated by neural networks. On the other hand, Wojtowytsch & E (2020) prove that modern artiﬁcial neural networks suﬀer from the curse of dimensionality in the sense that gradient training on high dimensional data may converge insuﬃciently. Additional to these theoretical results, there is an increasing interest of empirically estimating the ID of contemporary learning architectures. Li et al. (2018) study the ID of neural networks by replacing high dimensional parameter vectors with lower dimensional ones. Their approach results in a non-deterministic ID. More recent works studied ID in the realm of geometric data and their standard architectures. Ansuini et al. (2019) investigate the ID for convolutional neural networks (CNN). In detail, they are interested to which extent the ID changes at diﬀerent hidden layers and how this is related to the overall classiﬁcation performance. Another work (Pope et al., 2020) associates an ID to popular benchmark image datasets. These two works on ID estimators do solely rely on the metric information of the data and do not consider any geometric structure of image data.

Our approach allows to incorporate such underlying geometric structures while incorporating the mathematical phenomenon of measure concentration (Gromov & Milman, 1983; Milman, 1988; 2010). Linking this phenomenon to the occurrence of the dimension curse was done by Pestov (Pestov, 2000; 2007; 2008; 2010). He based his considerations on a thorough axiomatic approach using techniques from metric-measure spaces. The resulting ID functions unfortunately turn out to be practically incomputable. In contrast, Bac & Zinovyev (2020) investigate computationally feasible ID estimators that are related to the concentration phenomenon. Yet, their results elude a comparable axiomatic foundation. Our modeling for the ID of large and geometric data is based on Hanika et al. (2022). We build on their axiomatization and derive a computationally feasible method for the intrinsic dimension of large-scale geometric datasets.

3 Intrinsic Dimension

Since our work is based on the formalization from Hanika et al. (2022), we shortly revisit their modeling and recapitulate the most important structures. Based on this, we derive and prove an explicit formula to compute the ID for the special case of ﬁnite geometric datasets. This ﬁrst result is essential for Section 4.

1https://github.com/mstubbemann/ID4Geo L

Published in Transactions on Machine Learning Research (01/2023)

Let D = (X, F, µ), where X is a set and F RX is a set of functions from X to R, in the following called feature functions. We require that supx,y X d F (x, y) < , where d F (x, y) := supf F |f(x) f(y)|. If (X, d F ) constitutes a complete and separable metric space such that µ is a Borel probability measure on (X, d F ), we call D a geometric dataset (GD). The aforementioned properties are satisﬁed when it holds that 0 < |X|, |F| < and that F can discriminate all data points, i.e., d F (x, y) > 0 for all x, y X with x = y.

Two geometric datasets D1 := (X1, F1, µ1), D2 := (X2, F2, µ2) are isomorphic if there exists a bijection φ : X1 X2 such that F2 φ = F1 and φ (µ1) = µ2, where φ (µ1)(B) := µ1(φ 1(B)) is the push-forward measure and the closures are taken with respect to point-wise convergence. From this point on we identify a geometric dataset with its isomorphism class. The triple ({ }, R, ν{ }) represents the trivial geometric dataset.

Pestov (2007) deﬁnes the curse of dimensionality as [...] a name given to the situation where all or some of the important features of a dataset sharply concentrate near their median (or mean) values and thus become non-discriminating. In such cases, X is perceived as intrinsically high-dimensional. Thus, the ID estimator aims to compute to which extent the features allow to discriminate diﬀerent data points. For a speciﬁc feature f F, Hanika et al. (2022) therefore deﬁnes the partial diameter of f with regard to a speciﬁc α (0, 1) such that it displays to which extent f can discriminate subsets with minimal measure 1 α, i.e., via Part Diam(f (ν), 1 α) = inf{diam(B) | B R Borel, ν(f 1(B)) 1 α},

where diam(B) := supa,b B |a b|. The observable diameter with respect to α then deﬁnes to which extent F can discriminate points with minimum measure 1-α by being deﬁned as the supremum of the partial diameter of all f F, i.e, Obs Diam(D, α) := supf F Part Diam(f (µ), 1 α). To observe the discriminability for diﬀerent minimal measures α, the discriminability (D) of D and the intrinsic dimension (D) are deﬁned via

(D) := 1 (D)2 , where (D) := Z 1

0 Obs Diam(D, α) dα. (1)

In other words, lower values of intrinsic dimensionality correspond to geometric datasets with points that can be better discriminated by the given set of feature functions. This intrinsic dimension function is, in principle, applicable to a broad variety of geometric data, such as metric data, graphs or images. This applicability arises from the possibility to choose suitable feature functions which reﬂect the underlying data structure. The appropriate choice of feature functions is part of Section 5. Furthermore, the ID (D) = 1 (D)2 respects the formal axiomatization (Hanika et al., 2022) for ID functions, informally:

Axiom of concentration: A sequence of geometric datasets converges against the constant dataset (meaning having no chance to separate data points!), if and only if their IDs diverge against inﬁnity.

Axiom of feature antitonicity: If dataset D1 has more feature functions then D0 (i.e. having potentially more information to separate data points), it should have a lower intrinsic dimension.

Axiom of continuity: If a sequence of geometric datasets converge against a speciﬁc geometric dataset, the sequence of the IDs should converge against the ID of the limit geometric dataset.

Axiom of geometric order of divergence: If a sequence of geometric datasets converges against the constant dataset, its IDs should diverge against inﬁnity with the same order as 1 ((D))2 does.

3.1 Intrinsic Dimension of Finite Data

We want to apply Equation (1) to real-world data. In the following, let D = (X, F, ν) such that 0 < |X| < and 0 < |F| < and let ν be the normalized counting measure on X, i.e., ν(M) := |M|

|X| for M X. In this case, it is possible to compute the partial diameter and Equation (1), as we show in the following. Let α (0, 1) and let cα := |X|(1 α) . The following arguments were already hinted in previous work (Hanika et al., 2022), yet not formally discussed or proven.

Published in Transactions on Machine Learning Research (01/2023)

Lemma 3.1. For f F it holds that

Part Diam(f (ν), 1 α) = min |M|=cα max x,y M |f(x) f(y)|.

Proof. It holds that

Part Diam(f (ν), 1 α) = inf{diam(B) | B R Borel, ν(f 1(B)) 1 α}

= inf{diam(B) | B R Borel,|{x X | f(x) B}| cα}.

We have to show that

inf{diam(B) | B R Borel, |{x X | f(x) B}| cα} =

min{ max x,y M |f(x) f(y)| | M X, |M| cα}. (2)

: We show that {diam(B) | B R Borel, |{x X | f(x) B}| cα} {maxx,y M |f(x) f(y)| | M X, |M| cα}. Let z := maxx,y M |f(x) f(y)| such that M X with |M| cα. Without loss of generality we assume that x X : ( m1, m2 M : f(m1) f(x) f(m2) = x M).

Let b := maxx M f(x), a := minx M f(x), then M = {x X | f(x) [a, b]}. Hence, z = b a {diam(B)|B R Borel,|{x X|f(x) B}| cα}.

: Let B R be Borel with |{x X | f(x) B}| cα. Furthermore, let M := {x X | f(x) B}. It holds that diam(B) = supx,y B |x y| maxx,y M |f(x) f(y)| because of the choice of M. As B was chosen arbitrarily, it follows .

Finally, we need that

min{ max x,y M |f(x) f(y)| | |M| cα} = min{ max x,y M |f(x) f(y)| | |M| = cα}.

follows directly from the fact that {maxx,y M |f(x) f(y)| | |M| cα} {maxx,y M |f(x) f(y)| | |M| = cα}. follows from the fact that for every |M| cα and for every N M with |N| = cα the following equation holds: supx,y M |f(x) f(y)| supx,y N |f(x) f(y)|.

This lemma allows for a more tractable formula for the computation of the partial diameter of a ﬁnite GD. That in turn enables the following theorem.

Theorem 3.2. It holds that

(D) = 1 |X|

k=2 max f F min M X |M|=k

max x,y M |f(x) f(y)|. (3)

Proof. Let g : (0, 1) R, α 7 maxf F min M X,|M|=cα maxx,y M |f(x) f(y)|. Because of Lemma 3.1 we know that (D) = R 1 0 g(α) dα. The function g is a step function which can be expressed for each α (0, 1) via

k=1 1 |X| k

|X| , |X|+1 k

|X| (α) max f F min M X |M|=k

max x,y M |f(x) f(y)|

almost everywhere. Hence, Equation (3) follows from the deﬁnition of the Lebesgue-Integral with the fact that min M X,|M|=1 maxx,y M |f(x) f(y)| = 0.

In general, the addition of features should lower the ID since we have additional information that helps to discriminate the data. However, there are certain features that are not helping to further discriminate data points. These are for example:

Published in Transactions on Machine Learning Research (01/2023)

1. Constant features. This is due to the fact that for a constant feature f it always holds for all M X that maxx,y M |f(x) f(y)| = 0.

2. Permutations of already existing features. Let f : X R have the form f = f π with π : X X being a permutation on X. Then there exist for all M X with |M| = k a set N X with |N| = k and maxx,y M |f(x) f(y)| = maxx,y N | f(x) f(y)| and vice versa.

Thus, we have the following Lemma.

Lemma 3.3. Let D = (X, F, µ) be a ﬁnite geometric dataset. Furthermore, let ˆF be a set of constant functions X R and let F be a set of functions X R such that there exist for each f F a f F and a permutation π : X X with f = f π. Let E := (X, F ˆF F, µ). Then it holds that (D) = (E).

3.2 Computing the Intrinsic Dimension of Finite Data

In this section we will propose an algorithm for computing the ID based on Equation (3). For this, given a ﬁnite geometric dataset D, we use the shortened notations φk,f(D) := min M X,|M|=k maxx,y M |f(x) f(y)| and φk(D) := maxf F φk,f(D). Then, Equation (3) can be written as

(D) = 1 |X|

k=2 φk(D) = 1 |X|

k=2 max f F φk,f(D). (4)

The straightforward computation of Equation (4) is hindered by the task to iterate through all subsets M X of size k. This yields an exponential complexity with respect to |X| for computing (D). We can overcome this towards a quadratic computational complexity in |X| using the following concept.

Deﬁnition 3.4. (Feature Sequence) For a feature f F let lf,D R|X| be the increasing sequence of all values f(x) for x X. We call lf,D = (lf,D 1 , . . . , lf,D |X| ) the feature sequence of f.

Using these sequences, the following lemma allows us to eﬃciently compute φk,f(D).

Lemma 3.5. For k {2, . . . , |X|}, f F and lf,D, it holds that

φk,f(D) = min{lf,D k+j lf,D 1+j | j {0, . . . , |X| k}}.

Proof. For all j {0, . . . , |X| k} there exist M X with |M| = k and lf,D k+j lf,D 1+j = maxx,y M |f(x) f(y)|.

Thus, it is suﬃcient to show φk,f(D) {lf,D k+j lf,D 1+j | j {0, . . . , |X| k}}. Choose M X with |M| = k such that φk,f(D) = maxx,y M |f(x) f(y)| holds. Furthermore, let l M := (l M 1 , . . . , l M k ) be the increasing sequence of values f(m) for m M and let j {0, . . . , |X| k} such that l M 1 = lf,D 1+j. Since l M is an ordered sequence of which each element is also an element of the ordered sequence lf,D, it holds that l M k lf,D k+j and thus

l M k l M 1 lf,D j+k lf,D j+1. There is an N X with size k such that maxx,y N |f(x) f(y)| = lf,D k+j lf,D k+1. Since M X is of size k such that maxx,y M |f(x) f(y)| is minimal, it follows l M k l M 1 = maxx,y M |f(x) f(y)| maxx,y N |f(x) f(y)| = lf,D k+j lf,D k+1, hence φk,f(D) = maxx,y M |f(x) f(y)| = l M k l M 1 = lf,D k+j lf,D k+1.

To sum up, Lemma 3.5 enables the eﬃcient computation of φk,f(D) via a sliding window, i.e., by using only pairs of elements (lf,D 1+j, lf,D k+j). The algorithm based on this is shown in Algorithm 1. We want to provide a brief description of the most relevant steps. In Line 4 we iterate through the sizes of X by setting k {2, . . . , |X|} in order to compute φk(D) in Lines 6 and 7. For this we also need to iterate over all f F (Line 5) to compute the necessary values of φk,f(D) in Line 6. For a given f F, k {1, . . . , |X|}, Line 6 consumes |X| k + 1 subtraction operations. Assuming that computing feature values can be done in constant time, the runtime for computing (D) from the feature sequences is O(|F| P|X| k=2 |X| k + 1) = O(|F| P|X| 1 k=1 k) = O(|F| |X|2). The creation of all feature sequences requires O(|F||X| log(|X|)) computations , which is negligible compared to the aforementioned complexity. Thus, Algorithm 1 has quadratic complexity with respect to |X|. Therefore, Algorithm 1 is a straightforward and easy to implement

Published in Transactions on Machine Learning Research (01/2023)

Algorithm 1: The pseudocode to compute (D) for a ﬁnite geometric dataset D = (X, µ, F). Input : Finite geometric dataset D = (X, µ, F). Output: (D)

1 forall f in F do

2 Compute feature sequence lf,D.

4 forall k in {2, . . . , |X|} do

5 forall f in F do

6 φk,f(D) = minj {0,...,|X| k} lf,D k+j lf,D 1+j.

7 (D)+ = maxf F φk,f(D)

8 (D) = 1 |X| (D)

9 return (D)

solution for the computation of the ID. However, its quadratic runtime is obstructive for the application in large-scale data problems, which raises the necessity for a modiﬁcation. We will present such a modiﬁcation in the following section.

4 Intrinsic Dimension for Large-Scale Data

In order to speed up the computation of the ID we modify Algorithm 1 with regard to the accuracy of the result. Hence, we settle for an eﬃciently computable approximation of the ID. To give an overview over the necessary steps, we will

1. approximate the ID by replacing {2, . . . , |X|} in Line 4 of Algorithm 1 with a smaller subset S {2, . . . , |X|}, which we represent by S := {s1, . . . , sl}. For all k S, we will use {φs1(D), . . . , φsl(D)} to estimate φk(D). This will eventually lead to two approximations of the ID, an underestimation and an overestimation.

2. compare the upper and lower approximation to provide an error bound of these approximations with respect to the exact ID. This error bound can be computed without knowing the exact ID.

3. argue how, the computation of the exact ID can be sped up with the help of knowledge about φsi(D) for all si S. For this, we will in particular show that we can replace for all k {2, . . . , |X|} \ M the set F with a subset ˆF, see Line 5-6 of Algorithm 1.

4. derive a formula which estimates the amount of computation cost which is saved by using only subsets of F for the computation of the ID. This information can be used to estimate and decide whether the exact computation of the ID is computational feasible for a speciﬁc dataset.

The ensuing algorithm is shown in Algorithm 2. The underlying theory that justiﬁes it is presented in the following. This theory will be based on the monotonicity of i 7 φi,f(D).

Theorem 4.1. For m > n 2 and f F the following statements hold.

1. φm,f(D) φn,f(D),

2. φm(D) φn(D).

Proof. The second inequality follows directly from the ﬁrst one. Since per deﬁnition φm,f(D) = min M X,|M|=m maxx,y M |f(x) f(y)| and also φn,f(D) = min N X,|N|=n maxx,y N |f(x) f(y)|, we need to show that for each M X with |M| = m there exist N X with |N| = n and maxx,y M |f(x) f(y)| maxx,y N |f(x) f(y)|. It is suﬃcient to show that for n = m 1. Choose x1, x2 M such that

Published in Transactions on Machine Learning Research (01/2023)

Algorithm 2: The pseudocode to compute s, (D), s,+(D), (D) for a ﬁnite GD D = (X, µ, F).

Input : Finite GD D = (X, µ, F), support sequence s = (2 = s1, . . . , sl = |X|), exact (Boolean) Output: s, (D), s,+(D), (D)

1 forall f in F do

2 Compute feature sequence lf,D.

5 φs0(D) = 0

6 forall i in {1, . . . , l} do // Iterate over support sequence indices

7 forall f in F do

8 φsi,f(D) = minj {0,...,|X| si} lf,D si+j lf,D 1+j // Compute φsi,f(D) with Lemma 3.5

9 φsi(D) = maxf F φsi,f(D)

10 Fsi 1 = {f F | φsi,f(D) > φsi 1(D)} // Compute Fsi 1 for Lemma 4.7

11 (D)+ = φsi(D)

12 s, (D) = 1 |X|( (D) + Pl 1 i=1 P si<j<si+1 φsi(D)) // Compute lower ID from Definition 4.2

13 s,+(D) = 1 |X|( (D) + Pl 1 i=1 P

si<j<si+1 φsi+1(D)) // Compute upper ID from Definition 4.2

// Approximation finished. Continue with exact computation, if desired.

14 if exact then

15 forall i in {1, . . . l} do

16 forall si < j < si+1 do // Iterate through all indices between two support elements

17 forall f in Fsi do // Only iterate through Fsi because of Lemma 4.7

18 φsi,f(D) = minj {0,...,|X| si} lf,D si+j lf,D 1+j.

19 φj(D) = max({φj,f(D) | f Fi} {φsi}) // Use Lemma 4.7

20 (D)+ = φj(D)

21 (D) = 1 |X| (D)

22 return s, (D), s,+(D), (D)

23 return s, (D), s,+(D)

maxx,y M |f(x) f(y)| = |f(x1) f(x2)|. As |M| > 2 we ﬁnd x3 M \ {x1, x2}. Let N := M \ {x3}. It holds that maxx,y M |f(x) f(y)| = |f(x1) f(x2)| = maxx,y N |f(x) f(y)|.

4.1 Computing Intrinsic Dimension via Support Sequences

Equipped with Theorem 4.1, we can bound (D) and thus the intrinsic dimension through computing φsi for a few 2 = s1 < s2 < sl = |X|.

Deﬁnition 4.2. (Support Sequences and Upper / Lower ID) Let s = (2 = s1, . . . , sl 1, sl = |X|) be a strictly increasing and ﬁnite sequence of natural numbers. We call s a support sequence of D. We additionally deﬁne

s, (D) := 1 |X|

i=1 φsi(D) +

si<j<si+1 φsi(D)

s,+(D) := 1 |X|

i=1 φsi(D) +

si<j<si+1 φsi+1(D)

Published in Transactions on Machine Learning Research (01/2023)

and call accordingly s, (D) := 1 s,+(D)2 the lower intrinsic dimension of D and s,+(D) := 1 s, (D)2 the upper intrinsic dimension of D.

The governing idea is for i {1, . . . , l} and j with si < j < si+1 to substitute φj(D) with φsi(D) or φsi+1(D). With Theorem 4.1 this results in lower and upper bounds for (D) and thus for the intrinsic ID. By comparing upper and lower bounds, we can approximate the ID and estimate the approximation error.

Corollary 4.3. For support sequences s holds s, (D) (D) s,+(D) and s, (D) (D) s,+(D).

Deﬁnition 4.4 (Approximation Error). For a support sequence s the (relative) approximation error of (D) with respect to s is given by

E(s, D) := s,+(D) s, (D)

With the computation of the upper and lower ID it is possible to bound the error with respect to the ID (D). The following corollary can be deduced from Corollary 4.3 and Deﬁnition 4.4.

Corollary 4.5. For a support sequence s the following statements hold.

1. max{ s,+(D) (D)

(D) , (D) s, (D)

s, (D) }, E(s, D),

2. max{| s,+(D) (D)|, | (D) s, (D)|} | s,+(D) s, (D)|.

If the error of the approximation of a speciﬁc support sequence is not suﬃcient, further elements can be added to the support sequence. Directly from Equation (5) follows the following corollary.

Corollary 4.6. Let s = (2 = s1, . . . , sl = |X|) be a support sequence and let ˆs = (s1, . . . , si, p, si+1, . . . , sl) with a support sequence with an additional element p. Then it holds that

(D)ˆs, = (D)s, +

p j<si+1 ((φp(D)) φsi(D))

(D)ˆs,+ = (D)s,+ +

si<j p (φp(D)) φsi+1(D)

For a given support sequence s, Corollary 4.5 gives us an upper bound for the error when s,+(D) or s, (D) are used to approximate (D) without knowing (D). Hence, we can compute (a lower bound) for the accuracy when approximating the ID with Deﬁnition 4.4. As we can see in Section 5.1, Section 5.4 and Section 6, comparable small support sequences lead to suﬃcient approximations. Support sequences can also be used to shorten the computation of the exact intrinsic dimension as the following lemma shows.

Lemma 4.7. Let s = (2 = s1, . . . , sl = |X|) be a support sequence. Furthermore, let i {1, . . . , l 1} and let j N with si < j < si+1. Let Fsi := {f F | φsi+1,f(D) > φsi(D)}. Then it holds that

φj(D) = max({φj,f(D) | f Fsi} {φsi(D)}).

Proof. follows from Theorem 4.1 and the deﬁnition of φj(D). holds because for f F \Fsi it holds that φj,f(D) φsi+1,f(D), due to Theorem 4.1, and φsi+1,f(D) φsi(D), due to the construction of Fsi.

Hence, given a speciﬁc j, it is possible to compute φj(D) using a subset of F. Based on the particular GD D, this fact can considerably speed up the computation of the ID of D, as we will see in Section 5.

An algorithm to approximate and compute the ID through support sequences is depicted in Algorithm 2. This algorithm takes as input a GD D and a chosen support sequence s. A reasonable choice for support sequences is discussed in Section 5.1. The output is s, (D), s,+(D), and (D), if desired (Line 14). In Line 2, all feature sequences are computed. In Line 6 to Line 11, φsi(D) and Fsi 1, as deﬁned in Lemma 4.7, are computed. From Line 1 to Line 13, the feature sequences and the lower and upper ID are computed. If desired, the exact computation is done in Line 15 to Line 21. Here, we iterate for all support elements (Line 15) through all gaps between them (Line 16) and compute φj(D) using Lemma 4.7 (Line 17 to Line 19).

Published in Transactions on Machine Learning Research (01/2023)

Table 1: Statistics of all datasets used in this work.

Nodes Edges Attributes

Pub Med 19, 717 88, 648 500 Cora 2, 708 10, 556 1, 433 Cite Seer 3, 327 9, 104 3, 703 ogbn-arxiv 169, 343 1, 166, 243 128 ogbn-products 2, 449, 029 61, 859, 140 100 ogbn-mag 1, 939, 743 21, 111, 007 128 ogbn-papers100M 111, 059, 956 1, 615, 685, 872 128

4.2 Estimating Computational Costs

Let s = (s1, . . . , sl) be a support sequence. After the computation of Fs1, . . . , Fsl 1, we can estimate how much computation steps we can avoid in order to compute (D) with Algorithm 2 compared to Algorithm 1. Together with the error function E(s), this estimation can help us to decide if it is desirable to compute the exact value (D) or leave it at s, (D) and s,+(D). This is done in the following manner. For a speciﬁc f F, Lemma 3.5 shows that the computation of φk,f(D) requires |X| k + 1 diﬀerent subtractions and to keep the minimum value. Hence, the cost for computing (D) and therefore (D) via Algorithm 1 can be estimated via O(|F| P|X| k=2(|X| k + 1)) = O(|F| P|X| 1 k=1 k) = O(|F|( |X|2 |X|

2 )). However, if we use Algorithm 2, we solely have the cost to compute φsi. For all values j with si < j < si+1, our cost estimation is |Fsi|(|X| j + 1). Hence, for a given support sequence s = (s1, . . . , sl), we can estimate how many computations are saved using the following notions.

We address the naive computation costs for computing the ID of a GD with

C(D) := |F|(|X|2 |X|

In contrast, for a support sequence s = (s1, . . . , sl) of D, the computation costs are

Cs(D) := (|F|

k=1 |X| sk + 1) +

k=1 |Fsk| X

sk<j<sk+1 |X| j + 1. (6)

Hence, the saved costs of s are

SCs(D) := 1 Cs(D)

Once we have computed φsi(D) and Fi, depending on the saved costs, we can decide to discard the support sequence or to continue further computations with it. Furthermore, using the error estimation, we can decide to compute the exact ID or to settle with the approximation.

5 Intrinsic Dimension of Graph Data

Graph data is of major interest in the realm of geometric learning and beyond. In the following, a graph dataset D = (X, G) consists of an undirected, unweighted graph G = (V, E), where V = {v1, . . . , vn} is a ﬁnite set of vertices, E V 2 and X Rn d is a d-dimensional attribute matrix. The row-vector Xi = (Xi,1, . . . , Xi,d) is called the attribute vector of vi.

Learning from such data is often done via graph neural networks. The idea is to extend common multi-layer perceptrons by a so called neighborhood aggregation, where internal representations of graph neighbors are combined at speciﬁc layers. In earlier works, neighborhood aggregation is done at multiple layers (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018). Due to scalability, recent approaches perform multiple iterations of neighborhood aggregation as a preprocessing step and then use the aggregated features

Published in Transactions on Machine Learning Research (01/2023)

Figure 1: Example of an k-hop geometric dataset with k = 1. Given are a graph and an attribute matrix X. Then, the normalized adjacency matrix ˆA and then ˆAX are computed. The feature set F consists of the coordinate projections of X and ˆAX. In the ﬁgure, every column after F = represents one f F. Note, that in this example the normalization factor 1 dmax is 1.

as combined input (Rossi et al., 2020; Sun & Wu, 2021; Zhang et al., 2021). To be more speciﬁc, the employed networks use inputs of the form (X, ˆAX, ˆA2X . . . , ˆAk X), (7)

where ˆA is a transition matrix that is derived from the graph structure. The most common choice for such a matrix is the normalized adjacency matrix, i.e., ˆAi,j = ( p

deg(vj) deg(vi)) 1 if vj N(vi) and ˆAi,j = 0 else. Here, N(vi) := {vj V | {vi, vj} E} {vi} is the set of neighbors of vi and deg(vi) := |N(vi)| is the node degree of vi. The feature set of the following geometric dataset corresponds to the input in Equation (7).

Deﬁnition 5.1. Let k N and ˆA be the normalized adjacency matrix of a graph dataset D. Furthermore, let dmax := maxj {1,...,d} maxi,k {1,...,n} |Xi,j Xk,j|. We call the set

FD,k := {vi 7 1 dmax ( ˆAm X)i,j | m {0, . . . , k}, j {1, . . . , d}}

the k-hop feature functions of D. Let ν be the normalized counting measure on V . If there exist for each vi, vk V with vi = vj elements m {0, . . . , k}, j {1, . . . , d} such that 1 dmax ( ˆAm X)i,j = 1 dmax ( ˆAm X)k,j, then Dk = (V, FD,k, ν) is a GD. We call it the k-hop geometric dataset of D.

Basic statistics of all seven graph datasets considered in the following sections are depicted in Table 1. The statistics for Cora, Pub Med and Cite Seer were taken from Py Torch Geometric 2. The statistics of the OGB datasets were taken from the Open Graph Benchmark. 3 An example of a k-hop geometric dataset is depicted in Figure 1. It is well-known that the normalized adjacency matrix ˆA Rn n of a graph has a spectral radius of 1. As ˆA is symmetric, this yields ˆAx ˆAy x y for x, y Rn. The signiﬁcance of this property is for the respective computations, however, limited, since it primarily leads to insights of the behavior of the columns of X under multiplication with powers of ˆA. In contrast, the attribute vectors of the vertices are represented via the rows. Moreover, we may point out that we are not considering the Euclidean distances between attribute vectors, but diﬀerences between coordinate values. Thus, the spectral radius of ˆA does not provide direct insights into FD,k.

2https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets. Planetoid 3https://ogb.stanford.edu/docs/nodeprop

Published in Transactions on Machine Learning Research (01/2023)

Figure 2: Errors and saved costs for approximating and computing the intrinsic dimensionality for 2 hop geometric datasets with diﬀerent lengths of the support sequence.

5.1 Choosing Support Sequences

Algorithm 2 relies on a proper choice for a support sequence s. To choose s, two properties have to be considered. Namely, the length of the support sequence and the spacing of the elements. Regarding the second point, we decided to use log-scale spacing. To get such a support sequence for a geometric dataset D = (X, F, µ), we ﬁrst choose a geometric sequence ˆs = (s1, . . . sl) of length l from |X| to 2. We derive the ﬁnal support sequence s from s = ( |X| + 2 s1 , . . . , |X| + 2 sl ) by removing duplicated elements.

In the following, we study the error and the saved costs for diﬀerent lengths l of the support sequence. Here, for a geometric dataset, we investigate how E(s, D) and SC(s) vary for s chosen with l { 0.001 |X| , 0.002 |X| , . . . , 0.05 |X| }. Here, if l = r |X| , we call r R the relative support size of the resulting support sequence s. We experiment with common benchmark datasets, namely ogbn-arxiv, ogbn-mag and ogbnproducts from the Open Graph Benchmark (Hu et al., 2020; 2021). Since for ogbn-mag only a subset of vertices is equipped with attribute vectors, we generate the missing vectors via metapath2vec (Dong et al., 2017). For all datasets, we consider the 2-hop geometric dataset. The results are depicted in Figure 2.

5.1.1 Results

For all datasets, low errors and high saved costs can be reached with a remarkably short support sequence. With relative support sizes of under 0.015 all datasets are approximated with an accuracy of over 99%. Furthermore, the saved costs for sequences with comparable relative support sizes is over 0.98. It stands out, that for the larger datasets ogbn-mag and ogbn-products, shorter sequences (relative to the size of the dataset) lead to lower errors and higher saved costs then for ogbn-arxiv. Our results further indicate, that a relative support size between 0.01 and 0.02 is a reasonable range for maximizing the saved costs. For longer support sequences, the saved cost decrease while the error does not change dramatically, at least for the 2-hop geometric datasets of ogbn-mag and ogbn-products. Note, that longer support sequence do not always lead to a higher amount of saved costs. For longer support sequences s the costs of computing φk(D) for k s decreases. However, the costs of computing φsi(D) for all elements si s increase.

Published in Transactions on Machine Learning Research (01/2023)

Table 2: Intrinsic dimension and performances on classiﬁcation tasks. In the upper table, we display IDs for all k-hop geometric datasets for k {0, . . . , 5}. In the middle table, we display the ID estimated by the MLE baseline. In the lower table we display mean and standard derivations for test accuracy of a standard SIGN model on the classiﬁcation tasks which belongs to the dataset.

Dataset k-hop 0 1 2 3 4 5

Pub Med 2542.3425 2336.6611 2077.5821 2077.0953 2077.0886 2077.0848 Cora 6.2523 3.8324 3.6689 3.6627 3.6624 3.6623 Cite Seer 22.3337 11.3166 10.2347 9.8134 9.5491 9.3795 ogbn-arxiv 83.9160 31.4731 31.4731 31.4730 30.7370 30.3767 ogbn-products 1, 169, 323.2496 1, 169, 044.4736 1, 169, 044.2216 1, 169, 044.2216 1, 169, 044.2216 1, 169, 044.2216 ogbn-mag 2, 311.3509 2, 284.0290 2, 284.0290 2, 284.0290 2, 284.0290 2, 284.0290

Dataset k-hop 0 1 2 3 4 5

Pub Med 24.4623 24.7303 23.3924 22.2779 21.3495 20.5642 Cora 30.6049 28.1785 19.9316 10.8186 9.2970 8.6155 Cite Seer 58.9593 26.5031 16.5556 12.0495 9.3171 7.9572 ogbn-arxiv 16.2948 19.8571 18.9068 18.2265 17.4905 16.9325 ogbn-products 2.8694 4.7542 4.7950 4.7659 4.6943 4.6687 ogbn-mag 30.7024 33.2848 31.5140 30.4844 29.9080 29.5956

Dataset k-hop 0 1 2 3 4 5

Pub Med .6850 .0145 .7191 .0123 .7378 .0362 .7565 .0165 .7615 .0160 .7571 .0234 Cora .5329 .0120 .7223 .0117 .7766 .0045 .7870 .0076 .7917 .0084 .7951 .0047 Cite Seer .4975 .0075 .6165 .0160 .6530 .0101 .6677 .0074 .6695 .0085 .6734 .0080 ogbn-arxiv .5341 .0090 .6572 .0052 .6903 .0056 .6917 .0074 .6901 .0083 .6890 .0051 ogbn-products .5969 .0016 .7204 .0017 .7590 .0017 .7660 .0014 .7678 .0022 .7687 .0019 ogbn-mag .2712 .0020 .3635 .0029 .3879 .0030 .3959 .0029 .3983 .0050 .4012 .0040

5.2 Neighborhood Aggregation and Intrinsic Dimension

We study how the choice of k aﬀects the intrinsic dimension value of the k-hop geometric dataset. For this, we compute the intrinsic dimension for k {0, 1, . . . , 5} for six datasets: the three datasets mentioned above and Pub Med, Cora and Cite Seer (Yang et al., 2016), which we retrieved from Py Torch Geometric (Fey & Lenssen, 2019). Furthermore, we train GNNs which use the feature functions of k-hop geometric datasets as information for training and inference. This allows us to discover connections between the ID of speciﬁc datasets with respect to the considered feature functions and the performance of classiﬁers, which rely on these feature functions. For this, we train SIGN models (Rossi et al., 2020) for k {0, . . . , 5}. Implementation details and parameter choices can be found in Appendix A.1.

5.2.1 Baseline Estimator

To investigate to which extent our ID function surpasses established ID estimators with respect to estimating the discriminability of a dataset, we also compute all ID values with the Maximum Likelihood Estimator (MLE) (Levina & Bickel, 2004). This estimator is commonly used in the realm of deep learning (Pope et al., 2020; Ma et al., 2018a;b). For our experiments, we use the corrected version proposed by Mac Kay & Ghahramani (2005). Note, that the MLE is only applicable to datasets X Rn d and is thus not able to respect the neighborhood aggregated feature functions. Hence, we incorporate the neighborhood information of a k-hop dataset by concatenating feature vectors with the neighborhood aggregated feature vectors. Due to performance reasons, only subsets of the data points are considered for ogbn-mag and ogbn-products. More details to our usage of the MLE are discussed in Appendix A.2.

Published in Transactions on Machine Learning Research (01/2023)

Figure 3: Intrinsic dimension of one-hot encoded features. The right plot is a zoom of the left one which hides the 0 hop geometric dataset.

5.2.2 Results

We ﬁnd that one iteration of neighborhood aggregation always leads to a huge drop of the ID when using our ID function. However, consecutive iterations only lead to a small decrease. For the datasets from OGB, some iterations lead to no drop of the ID dimension at all. For ogbn-mag, only the ﬁrst iteration signiﬁcantly decreases the ID, for ogbn-products, only the ﬁrst two iterations are relevant for decreasing the ID. It stands out, that for ogbn-arxiv, the second and third iteration lead to no signiﬁcant decrease, but the fourth and ﬁfth do. The results for Pub Med stand out. Here, the second iteration of neighborhood aggregation leads to a comparable decrease as the ﬁrst one.

Considering the classiﬁcation performances, the ﬁrst iteration is again the key factor, leading to a signiﬁcant increase in accuracy. As for the ID, the Pub Med dataset behaves diﬀerently than the other datasets: the second iteration of neighborhood aggregation leads to a comparable increase in accuracy as the ﬁrst.

The MLE ID behaves diﬀerent. Here, no pattern of the ﬁrst iteration of aggregation being the key for decreasing the data complexity is observed. For some datasets, the ﬁrst rounds of feature aggregation may even increase the intrinsic dimension. To sum up, our results indicate that our ID is a better indicator for classiﬁcation performance then the MLE ID.

5.3 Synthetic Data

To get further insights into the behavior of our ID notion, we now consider k hop geometric dataset for one-hot encoded graph data, i.e., D = (X, (V, E)), with X = I|V | where I|V | is the |V |-dimensional identity matrix. We consider the case of V = {1, . . . , 100} and determine the ID for k {0, . . . , 5} for increasing edge sizes. To do so, we place the 4950 possible edges in a random order and add them step by step and compute the ID notions for the k hop geometric dataset. The results can be found in Figure 3.

5.3.1 Results

The 0 hop geometric dataset has an ID which does not depend on the amount of edges. This is not surprising since it does not incorporate any graph information. For all other k values, the ID ﬁrst sharply decreases and then increases. This indicates that the addition of neighborhood aggregation is particularly useful for

Published in Transactions on Machine Learning Research (01/2023)

Table 3: Approximation of intrinsic dimension for ogbn-papers100M.

k 0 1 2 3 4 5

s, (D) 282.2380 171.7385 148.3323 137.7662 128.2751 125.3418 s,+(D) 282.3387 171.7997 148.3852 137.8153 128.3208 125.3864 E(s, D) 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004

Table 4: Error on randomly generated data.

n d E(s, D)

106 10 2.55 10 4 1.13 10 8

106 50 2.55 10 4 5.36 10 9

106 250 2.55 10 4 2.31 10 9

107 10 3.08 10 4 4.78 10 10

107 50 3.08 10 4 6.57 10 10

107 250 3.08 10 4 6.16 10 10

108 10 3.55 10 4 3.67 10 11

108 50 3.55 10 4 1.34 10 11

108 250 3.55 10 4 2.69 10 11

graphs of moderate density. Here, the addition of additional rounds of aggregation beyond the ﬁrst one can further lower the ID. For higher edge sizes, the ID diﬀerence between diﬀerent k values vanishes.

5.4 Approximation of Intrinsic Dimension on Large-Scale Data

To demonstrate the feasibility of our approach, we use it to approximate the ID of the well known, largescale ogbn-mag-papers100M dataset. For this, we construct the support sequence as in Section 5.1 with l = 100.000. The results are depicted in Table 3. On our Xeon Gold System with 16 cores, approximating the ID of a k-hop geometric dataset build from ogbn-mag-papers100M is possible within a few hours. While the ID drops for every iteration of neighborhood aggregation, the decrease becomes smaller. The ID of the diﬀerent k-hops can be diﬀerentiated by the approximation, i.e., s, (Di) > s,+(Di+1) for i {0, . . . , 4}. It stands out, that even for such a short support sequence (compared to the size of the dataset), the observed error is remarkably low. In detail, we can approximate the ID with an accuracy of over 99.95%. It is further remarkable, that the error does not change signiﬁcantly for diﬀerent k. We observed this eﬀect also for the other datasets. Our results on ogbn-papers100M indicate, that with short support sequences, we can suﬃciently approximate the ID of large-scale graph data.

6 Errors of Random Data

To further understand how our approximation procedure behaves we conducted experiments on random data. We considered diﬀerent data sizes and diﬀerent amount of attributes. For this, we experimented with real-valued datasets, i.e. datasets represented by an attribute matrix X Rn d. Here, the feature functions are given by the data columns. To be more detailed, the considered geometric dataset is D = ({Xi | i {1, . . . , n}}, {Xi 7 Xi,j | j {1, . . . , d}}, ν). Here, ν is again the normalized counting measure and Xi is he i th row vector of X. We iterate n through {106, 107, 108} and d through {10, 50, 250}. We repeat all experiments 3 times. For all datasets, we build a support sequence as described in Section 5.1 with l = 100, 000. The results can be found in Table 4.

For all datasets, the errors are small and the accuracy is over 99.9% for all considered data sizes. The diﬀerence in the error for diﬀerent values of d is negligible. Furthermore, we have small standard deviations. All this indicates that l = 100, 000 is a reasonable default choice that leads to suﬃcient approximations in a large range of data and attribute sizes.

Published in Transactions on Machine Learning Research (01/2023)

7 Conclusion and Future Work

We presented a principle way to eﬃciently compute the intrinsic dimension (ID) of geometric datasets. Our approach is based on an axiomatic foundation and accounts for underlying structures and is therefore especially tailored to the ﬁeld of geometric learning. We proposed a novel speed up technique for an algorithm which has quadratic complexity with respect to the amount of data points. This enabled us to compute the ID of several real-world graphs with up to millions of nodes. Equipped with this ability, we shed light on connections of classiﬁcation performances of graph neural networks and the observed intrinsic dimension for common benchmark datasets. Finally, using a novel approximation technique, we were able to show that our method scales to graphs with over 100 million nodes and billions of edges. We illustrated this by using the well-known ogbn-papers100M dataset.

Future work includes the identiﬁcation of suitable feature functions for other domains, such as learning on text or image data. Incorporating the structure of such datasets into the computation of intrinsic dimensionality is an open research problem. Another promising research direction is to investigate how the ID of datasets could be manipulated. Since our investigations suggest connections between a low ID and high classiﬁcation performances, this has the potential to enhance learning procedures.

Acknowledgement

The authors thank the State of Hesse, Germany for funding this work as part of the LOEWE Exploration project Dimension Curse Detector" under grant LOEWE/5/A002/519/06.00.003(0007)/E19.

Alessio Ansuini, Alessandro Laio, Jakob H. Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché Buc, Emily B. Fox, and Roman Garnett (eds.), Neur IPS, pp. 6109 6119, 2019.

Jonathan Bac and Andrei Zinovyev. Local intrinsic dimensionality estimators based on concentration of measure. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2020. doi: 10.1109/IJCNN48605.2020.9207096.

Edgar Chávez, Gonzalo Navarro, Ricardo A. Baeza-Yates, and José L. Marroquín. Searching in metric spaces. ACM Comput. Surv., 33(3):273 321, 2001. doi: 10.1145/502807.502808.

Alexander Cloninger and Timo Klock. A deep network construction that adapts to intrinsic dimensionality beyond the domain. Neural Networks, 141:404 419, 2021. ISSN 0893-6080. doi: https://doi.org/10.1016/ j.neunet.2021.06.004.

J.A. Costa, A. Girotra, and Alfred Hero. Estimating local intrinsic dimension with k-nearest neighbor graphs. volume 30, pp. 417 422, 08 2005. ISBN 0-7803-9403-8. doi: 10.1109/SSP.2005.1628631.

Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 135 144. ACM, 2017. doi: 10.1145/3097983.3098036.

Elena Facco, Maria d Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientiﬁc reports, 7(1):1 8, 2017.

Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.

Marina Gomtsyan, Nikita Mokrov, Maxim Panov, and Yury Yanovich. Geometry-aware maximum likelihood estimation of intrinsic dimension. In Wee Sun Lee and Taiji Suzuki (eds.), Proceedings of The Eleventh Asian Conference on Machine Learning, volume 101 of Proceedings of Machine Learning Research, pp. 1126 1141. PMLR, 17 19 Nov 2019.

Published in Transactions on Machine Learning Research (01/2023)

Daniele Granata and Vincenzo Carnevale. Accurate estimation of the intrinsic dimension using graph distances: Unraveling the geometric complexity of datasets. Scientiﬁc reports, 6(1):1 12, 2016.

Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors. In The theory of chaotic attractors, pp. 170 189. Springer, 2004.

M. Gromov and V. D. Milman. A topological application of the isoperimetric inequality. American Journal of Mathematics, 105(4):843 854, 1983. ISSN 00029327, 10806377.

William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30, pp. 1024 1034, 2017.

Tom Hanika, Friedrich Martin Schneider, and Gerd Stumme. Intrinsic dimension of geometric data sets. Tohoku Mathematical Journal, 74(1):23 52, 2022. doi: 10.2748/tmj.20201015a.

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687, 2020.

Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A large-scale challenge for machine learning on graphs. ar Xiv preprint: 2103.09430, 2021.

Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. In 5th Int. Conf. on Learning Representations, 2017.

Elizaveta Levina and Peter J. Bickel. Maximum likelihood estimation of intrinsic dimension. In NIPS, pp. 777 784, 2004.

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In ICLR (Poster). Open Review.net, 2018.

Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi N. R. Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E. Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018a. URL https://openreview.net/forum?id=B1g J1L2a W.

Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah M. Erfani, Shu-Tao Xia, Sudanthi N. R. Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 3361 3370. PMLR, 2018b. URL http://proceedings.mlr.press/v80/ma18d. html.

David JC Mac Kay and Zoubin Ghahramani. Comments on maximum likelihood estimation of intrinsic dimension by e. levina and p. bickel (2005). The Inference Group Website, Cavendish Laboratory, Cambridge University, 2005.

V. Milman. Topics in Asymptotic Geometric Analysis, pp. 792 815. Birkhäuser Basel, Basel, 2010. ISBN 978-3-0346-0425-3. doi: 10.1007/978-3-0346-0425-3_8.

V. D. Milman. The heritage of P. Lévy in geometrical functional analysis. In Colloque Paul Lévy sur les processus stochastiques, number 157-158 in Astérisque. Société mathématique de France, 1988.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Vladimir Pestov. On the geometry of similarity search: Dimensionality curse and concentration of measure. Inf. Process. Lett., 73(1-2):47 51, 2000. doi: 10.1016/S0020-0190(99)00156-8.

Published in Transactions on Machine Learning Research (01/2023)

Vladimir Pestov. Intrinsic dimension of a dataset: what properties does one expect? In Proceedings of the International Joint Conference on Neural Networks, IJCNN 2007, Celebrating 20 years of neural networks, Orlando, Florida, USA, August 12-17, 2007, pp. 2959 2964. IEEE, 2007. doi: 10.1109/IJCNN. 2007.4371431.

Vladimir Pestov. An axiomatic approach to intrinsic dimension of a dataset. Neural Networks, 21(2-3): 204 213, 2008. doi: 10.1016/j.neunet.2007.12.030.

Vladimir Pestov. Intrinsic dimensionality. ACM SIGSPATIAL Special, 2(2):8 11, 2010. doi: 10.1145/ 1862413.1862416.

Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2020.

Emanuele Rossi, Fabrizio Frasca, Ben Chamberlain, Davide Eynard, Michael M. Bronstein, and Federico Monti. SIGN: scalable inception graph neural networks. Co RR, abs/2004.11198, 2020.

Chuxiong Sun and Guoshi Wu. Scalable and adaptive graph neural networks with self-label-enhanced training. Co RR, abs/2104.09376, 2021.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

Stephan Wojtowytsch and Weinan E. Can shallow neural networks beat the curse of dimensionality? a mean ﬁeld training perspective. IEEE Transactions on Artiﬁcial Intelligence, 1(2):121 129, 2020. doi: 10.1109/TAI.2021.3051357.

Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 40 48. JMLR.org, 2016.

Wentao Zhang, Ziqi Yin, Zeang Sheng, Wen Ouyang, Xiaosen Li, Yangyu Tao, Zhi Yang, and Bin Cui. Graph attention multi-layer perceptron. Co RR, abs/2108.10097, 2021.

Published in Transactions on Machine Learning Research (01/2023)

A.1 Setup of SIGN classiﬁers

For Pub Med, Cora and Cite Seer, we train on the classiﬁcation task provided by Pytorch Geometric (Fey & Lenssen, 2019) which was earlier studied by Yang et al. (2016). All Open Graph Benchmark datasets are trained and tested on the oﬃcial node property prediction task.4 Our goal is not to ﬁnd optimal classiﬁers but to discover connections between the choice of k, the ID and classiﬁer performance. Thus, we omit excessive parameter tuning and stick to reasonable standard parameters. For all tasks, we use a simple SIGN model Rossi et al. (2020) with one hidden inception layer and one classiﬁcation layer. For Pub Med, Cite Seer and Cora, we use batch sizes of 256, hidden layer size of 64 and dropout at the input and hidden layer with 0.5. The learning rate is set to 0.01. All these parameters were taken from Kipf & Welling (2017). For ogbn-arxiv, ogbn-mag and ogbn-products, we stick to the parameters from the SIGN implementations on the OGB leaderbord. For ogbn-arxiv, we use a hidden dimension of 512, dropout at the input with 0.1 and with 0.5 at the hidden layer. For ogbn-mag, we use a hidden dimension of 512, do not dropout at the input and use dropout with 0.5 at the hidden layer. For ogbn-products, we use a hidden dimension of 512, input dropout of 0.3 and hidden layer dropout of 0.4. For all ogbn tasks, the learning rate is 0.001 and the batch-size 50000. For all experiments, we train for a maximum of 1000 epochs with early stopping on the validation accuracy. Here, we use a patience of 15. These are the standard parameters of Pytorch Lightning.5 For all models, we use an Adam optimizer with weight decay of 0.0001. We report mean test accuracies over 10 runs. The intrinsic dimensions and the test accuracy are shown in Table 2.

A.2 Details on Baseline ID Estimator

To use the MLE ID, we have to convert the k-hop geometric dataset (V, FD,k, ν) of graph data D = (X, G), where X Rn d into a real-valued feature matrix ˆX. This done by concatenating the rows of X with the rows of ˆAX, . . . ˆAk X, i.e., ˆX Rn (k+1)d with

( Xi,j j {1, . . . , d}, (An X)i,ˆj j = nd + ˆj for n {1, . . . , k}, ˆj {1, . . . d}.

The MLE is given via

MLE( ˆX) := 1 n(k 1)

j=1 log( d( ˆXi, Nl( ˆXi))

d( ˆXi, Nj( ˆXi)) ), (8)

where d is the euclidean metric and Nj( ˆXi) is the j-th nearest neighbor of ˆXi with respect to the Euclidean metric. Thus, the MLE depends on a parameter l, which we set to 5.

We implement the MLE by using the Nearest Neighbors class of scikit-learn (Pedregosa et al., 2011) and then building the mean of all log( d( ˆ Xi,N5( ˆ Xi)) d( ˆ Xi,Nj( ˆ Xi))) with i {1, . . . , n} and j {1, . . . , 5}. Here, we skip all elements

where d( ˆXi, Nj( ˆXi)) = 0. This can happen, when ˆX has duplicated rows, representing data points with equal attribute vectors.

For ogbn-mag and ogbn-products, computing Equation (8) is not possible due to performance reasons. Here, we sample 169, 343 indices6 I {1, . . . , n} and only compute

MLE( ˆX) := 1 n(k 1)

j=1 log( d( ˆXi, Nl( ˆXi))

d( ˆXi, Nj( ˆXi)) ).

4https://ogb.stanford.edu/docs/nodeprop/ 5https://www.pytorchlightning.ai/ 6This is the amount of nodes of ogbn-arxiv, the largest network for which the full computation was feasible.