# generalization_bounds_using_datadependent_fractal_dimensions__53c6f8ef.pdf

Generalization Bounds using Data-Dependent Fractal Dimensions

Benjamin Dupuis 1 2 3 George Deligiannidis 4 5 Umut S ims ekli 1 2 3 6

Providing generalization guarantees for modern neural networks has been a crucial task in statistical learning. Recently, several studies have attempted to analyze the generalization error in such settings by using tools from fractal geometry. While these works have successfully introduced new mathematical tools to apprehend generalization, they heavily rely on a Lipschitz continuity assumption, which in general does not hold for neural networks and might make the bounds vacuous. In this work, we address this issue and prove fractal geometry-based generalization bounds without requiring any Lipschitz assumption. To achieve this goal, we build up on a classical covering argument in learning theory and introduce a data-dependent fractal dimension. Despite introducing a significant amount of technical complications, this new notion lets us control the generalization error (over either fixed or random hypothesis spaces) along with certain mutual information (MI) terms. To provide a clearer interpretation to the newly introduced MI terms, as a next step, we introduce a notion of geometric stability and link our bounds to the prior art. Finally, we make a rigorous connection between the proposed data-dependent dimension and topological data analysis tools, which then enables us to compute the dimension in a numerically efficient way. We support our theory with experiments conducted on various settings.

*Equal contribution 1Inria 2Ecole Normale Sup erieure, Paris, France 3PSL Research University, Paris, France 4Department of Statistics, University of Oxford, Oxford, UK 5The Alan Turing Institute, London, UK 6CNRS. Correspondence to: Benjamin Dupuis <benjamin.dupuis@inria.fr>, Umut S ims ekli <umut.simsekli@inria.fr>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

1. Introduction

Understanding the generalization properties of modern neural networks has been one of the major challenges in statistical learning theory over the last decade. In a classical supervised learning setting, this task boils down to understanding the so-called generalization error, which arises from the population risk minimization problem, given as follows:

n R(w) := E z µz[ℓ(w, z)] := E (x,y) µz[L(hw(x), y)] o ,

where x X denotes the features, y Y denotes the labels, Z = X Y denotes the data space endowed with an unknown probability measure µz, referred to as the data distribution, hw : X Y denotes a parametric predictor with w Rd being its parameter vector, L : Y Y R denotes the loss function, and ℓis the composition of the loss and the predictor, i.e. ℓ(w, z) = ℓ(w, (x, y)) = L(hw(x), y), which will also be referred to as loss , with a slight abuse of notation. As µz is unknown, in practice one resorts to the minimization of the empirical risk, given as follows:

ˆRS(w) := 1

i=1 ℓ(w, zi), (1)

where S := (zi)1 i n µ n z is a set of independent and identically distributed (i.i.d.) data points. Then, our goal is to bound the worst-case generalization error that is defined as the gap between the population and empirical risk over a (potentially random) hypothesis set W Rd:

G(S) := sup w W

R(w) ˆRS(w) . (2)

In the context of neural networks, one peculiar observation has been that, even when a network contains millions of parameters (i.e., d 1), it might still generalize well (Zhang et al., 2017), despite accepted wisdom suggesting that typically G p

d/n (Anthony & Barlett, 1999).

To provide a theoretical understanding for this behavior, several directions have been explored, such as compressionbased approaches (Arora et al., 2018; Suzuki et al., 2020; Barsbey et al., 2021) and the approaches focusing on the double-descent phenomenon (Belkin et al., 2019; Nakkiran et al., 2019). Recently, there has been an increasing

Generalization Bounds using Data-Dependent Fractal Dimensions

interest in examining the role of algorithm dynamics on this phenomenon. In particular, it has been illustrated that, in the case where a stochastic optimization algorithm is used for minimizing (1), the optimization trajectories can exhibit a fractal structure (S ims ekli et al., 2021; Camuto et al., 2021; Birdal et al., 2021; Hodgkinson et al., 2022). Under the assumption that ℓis uniformly bounded by some B and uniformly L-Lipschitz with respect to w, their results informally implies the following: with probability 1 ζ, we have that

r d(W) + I (W, S) + log(1/ζ)

where W is a data-dependent hypothesis set, which is provided by the learning algorithm, d(W) is a notion of fractal dimension of W, and I (W, S) denotes the total mutual information between the data S and the hypothesis set W. These notions will be formally defined in Section 21. In the case where the intrinsic dimension d(W) is significantly smaller than the ambient dimension d (which has been empirically illustrated in (S ims ekli et al., 2021; Birdal et al., 2021)), the bound in (3) provides an explanation on why overparameterized networks might not overfit in practice.

While these bounds have brought a new perspective on understanding generalization, they also possess an important drawback, that is they all rely on a uniform Lipschitz continuity assumption on ℓ(with respect to the parameters), which is too strict to hold for deep learning models. While it is clear that we cannot expect Lipschitz continuity of a neural network when the parameter space is unbounded, Herrera et al. (2020) showed that, even for the bounded domains, the Lipschitz constants of fully connected networks are typically polynomial in the width, exponential in depth which may be excessively large in practical settings; hence might make the bounds vacuous.

The Lipschitz assumption is required in (S ims ekli et al., 2021; Birdal et al., 2021; Camuto et al., 2021) as it enables the use of a fractal dimension defined through the Euclidean distance on the hypothesis set W (which is independent of the data). Hence, another downside of the Lipschitz assumption is that the Euclidean distance-based dimension unfortunately ignores certain important components of the learning problem, such as the how the loss ℓbehaves over W. As shown in (Jiang et al., 2019) in the case sharpness measures (Keskar et al., 2017), which measure the sensitivity of the empirical risk around local minima and correlate well with generalization, the data-dependence may improve the ability of a complexity measure to explain generalization.

1In (S ims ekli et al., 2021; Camuto et al., 2021) the bound is logarithmic in L. (S ims ekli et al., 2021) only requires sub-gaussian losses while (Camuto et al., 2021) requires sub-exponential losses. Their common points is to require a Lipschitz assumption.

1.1. Contributions

In this study, our main goal is to address the aforementioned issues by proving fractal geometric generalization bounds without requiring any Lipschitz assumptions. Inspired by a classical approach for bounding the Rademacher complexity (defined formally in Appendix A.2), we achieve this goal by making use of a data-dependent pseudo-metric on the hypothesis set W. Our contributions are as follows:

We prove bounds (Theorems 3.4 and 3.5) on the worstcase generalization error of the following form:

r d S(W) + I + log(1/ζ)

where d S denotes a notion of data-dependent fractal dimension and I is a (total) mutual information term (see Section 2.2). As opposed to prior work, this bound does not require any Lipschitz assumption and therefore applies to more general settings. However, this improvement comes with the expense of having a more complicated mutual information term compared to the one in (3). To provide more understanding about the newly introduced mutual information term I and highlight its links to prior work, we introduce a notion of geometric stability and without requiring Lipschitz continuity, we prove an almost identical bound to the one in Equation (3) (with a potentially slightly worse rate in n). In order to be able to compute the data-dependent fractal dimension, we build on (Birdal et al., 2021) and prove that our dimension can also be computed by using numerically efficient topological data analysis tools (Carlsson, 2014; P erez-Fern andez et al., 2021).

Finally, we illustrate our bounds on experiments using various neural networks. In addition to not requiring Lipschitz continuity, we show that our data-dependent dimension provides improved correlations with the actual generalization error. All the proofs are provided in the Appendix. Python code for numerical experiments is available at https://github.com/benji Dupuis/ data_dependent_dimensions.

2. Technical Background

2.1. Learning framework

We formalize the learning algorithm as follows. The data (probability) space is denoted by (Z, F, µz)2. A learning algorithm A is a map generating a random closed set WS,U (see (Molchanov, 2017, Definition 1.1.1)) from the data S and an external random variable U accounting for the randomness of the learning algorithm. The external random-

2For technical measure-theoretic reasons (see Section A.6), it is best to assume Z RN for some N.

Generalization Bounds using Data-Dependent Fractal Dimensions

ness U takes values in some probability space (ΩU, FU, µu), which means that U is FU-measurable and has distribution µu. Moreover, we assume that U is independent of S. Therefore if we write CL(Rd) for the class of closed sets of Rd endowed with the Effr os σ-algebra, as in (Molchanov, 2017), the algorithm will be thought as a measurable map:

n=0 Zn ΩU CL(Rd) WS,U. (5)

This formulation encompasses several settings, such as the following two examples. Example 2.1. Given a continuous time process of the form d Wt = f(Wt)dt + Σ(Wt)d Xt where Xt is typically a Brownian motion or a L evy process, as considered in various studies (Mandt et al., 2016; Chaudhari & Soatto, 2018; Hu et al., 2018; Jastrzebski et al., 2018; S ims ekli et al., 2021), we can view WS,U as the set of points of the trajectory {Wt, t [0, T]}, where U accounts for randomness coming from quantities defining the model like Xt. Example 2.2. Consider a neural network hw( ) and denote the output of the stochastic gradient descent (SGD) iterates by A(x0, S, U), where U accounts for random batch indices and x0 is the initialization. This induces a learning algorithm WS,U = S

x0 X0{A(x0, S, U)}, which is closed if X0 is compact under a continuity assumption on A.

2.2. Information theoretic quantities

Recently, one popular approach to prove generalization bounds has been based on information theory. In this context, Xu & Raginsky (2017); Russo & Zou (2019) proved particularly interesting generalization bounds in terms of the mutual information between input and output of the model. Other authors refined this argument in various settings (Pensia et al., 2018; Negrea et al., 2019; Steinke & Zakynthinou, 2020; Harutyunyan et al., 2021) while Asadi et al. (2019) combined mutual information and chaining to tighten the bounds. In our work we will use the total mutual information to specify the dependence between the data and the fractal properties of the hypothesis set.

The classic mutual information between two random elements X and Y is defined in terms of the Kullback-Leibler (KL) divergence I(X, Y ) := KL(PX,Y ||PX PY ). It is well known that mutual information can be used as a decoupling tool (Xu & Raginsky, 2017); yet, in our setup, we will need to consider the total mutual information, which is defined as follows, PX denoting the law of the variable X:

I (X, Y ) := log sup B

PX,Y (B) PX PY (B)

Hodgkinson et al. (2022) used total mutual information to decouple the data and the optimization trajectory, they defined it as a limit of α-mutual information, which is equivalent, see (van Erven & Harremo es, 2014, Theorem 6).

2.3. The upper box-counting dimension

Fractal geometry (Falconer, 2014) and dimension theory have been successful tools in the study of dynamical systems and stochastic processes (Pesin, 1997; Xiao, 2004). In our setting, we will be interested in the upper box-counting dimension, defined as follows. Given a (pseudo-)metric space (X, ρ) and δ > 0, we first define the closed δ-ball centered in x X by Bρ δ (x) = {y X, ρ(x, y) δ} and a minimal covering N ρ δ (X) as a minimal set of points of X such that X S

y Nρ δ (X) Bρ δ (y). We can then define the upper box-counting dimension as follows:

dim ρ B(X) := lim sup δ 0

log |N ρ δ (X)| log(1/δ) , (7)

where |A| denotes the cardinality of a set A.

Under the Lipschitz loss assumption, S ims ekli et al. (2021); Birdal et al. (2021); Camuto et al. (2021); Hodgkinson et al. (2022) related different kinds of fractal dimensions, computed with the Euclidean distance ρ(w, w ) = Eucl(w, w ) := w w 2, to the generalization error. Our approach in this study will be based on using a datadependent pseudo-metric ρ, which will enable us to remove the Lipschitz assumption.

3. Main Results

In this section we present our main theoretical results; our aim is to relate the worst-case generalization error of (5) with the upper box-counting dimension computed based on the following random pseudo-metric:

ρS(w, w ) := 1

i=1 |ℓ(w, zi) ℓ(w , zi)|. (8)

We insist on the fact that it is only a pseudo-metric because in practice we can have ρS(w, w ) = 0 while w = w , for example due to the internal symmetries of a neural network.

3.1. Main assumptions

A key component of our work is that we do not use any Lipschitz assumption on ℓas for example in (S ims ekli et al., 2021; Hodgkinson et al., 2022). The only regularity assumption we impose is the following: Assumption 3.1. The loss ℓ: Rd Z R is continuous in both variables and uniformly bounded by some B > 0.

We note that the box-counting dimension with respect to the pseudo-metric (8) involves minimal coverings, which we denote N ρS δ (A) for some set A. The boundedness assumption is essential to ensure that minimal coverings are finite and dim ρS B is also finite. Therefore our boundedness assumption cannot be replaced with a subgaussian assumption, as opposed to (S ims ekli et al., 2021).

Generalization Bounds using Data-Dependent Fractal Dimensions

We also assume that we can construct minimal coverings which are random closed (finite) sets in the sense of (Molchanov, 2017, Definition 1.1.1); this is made precise with the following assumption: Assumption 3.2. Let C Rd be any closed set, δ > 0, S Zn and S Zm. We can construct minimal δ-coverings N ρS δ (C WS,U) which are random finite sets with respect to the product σ-algebra F n F n FU (measurability with respect to S, S , U). We denote by Nδ(C WS,U) the family of all those random minimal coverings. Remark 3.3. Assumption 3.2 essentially enables us to avoid technical measurability complications. The main message is that we assume that we are able to construct measurable coverings . This assumption can be cast as a selection property; indeed for each realization of (S, S , U) there may be a wide range of possible minimal coverings: what we assume is that we can select one of them for each (S, S , U) so that the obtained random set is measurable.

Assumption 3.2 is actually much stronger than what is needed to make our results valid. Indeed, we are able to show that, under the assumption that the mapping A of (5) is measurable with respect to the Effr os σ-algebra, we can construct coverings (S, U) 7 N ρS δ (WS,U) which are measurable and, if not minimal, yield the same upper boxcounting dimension as minimal coverings, when computing the limit (7). To avoid too much technical considerations, this discussion is presented in Appendix A.6, as an additional technical contribution.

As the upper box-counting dimension (7) may be written as a countable limit, the measurability assumption 3.2 also implies that dim ρS B (WS,U) is a random variable. Continuity of the loss in Assumption 3.1 is there for technical purposes, e.g., to make quantities of the form supw WS,U R(w) ˆRS(w) well-defined random variables (see (Molchanov, 2017, Theorem 1.3.28) and Section A.6 in the Appendix).

3.2. Warm-up: fixed hypothesis spaces

In this subsection we fix a deterministic closed set W Rd

and consider its upper box-counting dimension with respect to the data-dependent pseudo-metric (8), which we denote by d(S) := dim ρS B (W). Our goal is to bound the worst-case generalization error as defined in (2). The next theorem is an extension of the classical covering bounds of Rademacher complexity (Barlett & Mendelson, 2002; Rebeschini, 2020). Theorem 3.4. For all ϵ, γ, η > 0 and n N+ there exists δn,γ,ϵ > 0 such that with probability at least 1 2η γ under µ n z , for all δ < δn,γ,ϵ we have:

4(d(S) + ϵ) log(1/δ) + 9 log(1/η)

Theorem 3.4 is therefore similar to (S ims ekli et al., 2021,

Theorem 1), which used a fractal dimension based on the Euclidean distance on Rd, w w 2 and a fixed hypothesis space. The improvement here is in the absence of Lipschitz assumption. Moreover, as detailed in Appendix B.6, in case of a Lipschitz ℓ, we recover, from our proofs, bounds in term of the upper box-counting dimension based on the Euclidean distance on the hypothesis set, which is the one used in prior works (S ims ekli et al., 2021). Therefore, our methods based on a data-dependent fractal dimension are more general than previous studies.

However, Theorem 3.4 might not be sufficiently satisfying. The proof involves techniques that do not hold in the case of random hypothesis spaces, an issue which we address in the next subsection.

3.3. Random hypothesis spaces

Theorem 3.4 is interesting because it gives a bound similar to (S ims ekli et al., 2021) in the case of a fixed hypothesis set but with a new notion of data dependent intrinsic dimension. Now we come to the case where the hypothesis set WS,U generated by the learning algorithm (5) is a random set.

For notational purposes let us denote the upper box-counting dimension of WS,U induced by pseudo-metric (8) by d(S, U) := dim ρS B (WS,U), and denote the worst-case generalization error by

G(S, U) := sup w WS,U (R(w) ˆRS(w)). (9)

Here again, note that d(S, U) can be written as a countable limit of random variables and therefore defines a random variable thanks to Assumption 3.2.

The main difficulty here is that classical arguments based on the Rademacher complexity cannot be applied in this case as WS,U depends on the data sample S. Hence, to be able to develop a covering argument, we first cover the set WS,U by using the pseudo-metric ρS (cf. Section 2.3) and rely on the following decomposition: for any δ > 0 and w N ρS δ (WS,U) we have that

R(w) ˆRS(w) R (w ) ˆRS (w )

+ | ˆRS(w) ˆRS (w ) |

+ |R(w) R (w )| .

In the above inequality, the first term can be controlled by standard techniques as w lives in a finite set N ρS δ (WS,U) and the second term is trivially less than δ by the definition of coverings. However, the last term cannot be bounded in an obvious way. To overcome this issue we introduce approximate level-sets of the population risk, defined as

Generalization Bounds using Data-Dependent Fractal Dimensions

follows3 for some K N+:

Rj S := WS,U R 1 j B

K , (j + 1)B

where j = 0, . . . , K 1 and R 1 denotes the inverse image of R. Let Nδ,j collect the centers of a minimal δ-cover of Rj S relatively to ρS4. The next theorem provides a generalization bound for random hypothesis sets.

Theorem 3.5. Let us set K = n and define In,δ := max0 j n I (S, Nδ,j). Then, for all ϵ, γ, η > 0, there exists δn,γ,ϵ > 0 such that with probability at least 1 η γ under µ n z µu, for all δ < δn,γ,ϵ we have:

G(S, U) B n 1 + 2B2

(d(S, U) + ϵ) log(2/δ)

+ log( n/η) + In,δ

This theorem gives us a bound in the general case similar to (S ims ekli et al., 2021, Theorem 2), yet without requiring Lipschitz continuity.

Moreover, also similar to (S ims ekli et al., 2021; Hodgkinson et al., 2022), Theorem 3.5 introduces a mutual information term In,δ, which intuitively measures the local mutual dependence between the data and the coverings. This can be seen as how the data influences the local fractal behavior of the the hypothesis set. On the other hand, despite the similarity to prior work, In,δ might be more complex because the dependence of Nδ,j on S comes both from the pseudo-metric ρS and the hypothesis set WS,U. In the next subsection, we show that we can modify our theory in a way that it involves the simpler mutual information term proposed in (Hodgkinson et al., 2022).

3.4. Geometric stability and mutual information

The intricate dependence between Nδ,j and S makes it hard to express the term In,δ in Theorem 3.5 or bound it with standard methods (e.g. data-processing inequality). In this subsection, we introduce a notion of geometric stability to obtain a more interpretable bound.

Algorithmic stability is a key notion in learning theory and has been shown to imply good generalization properties (Bousquet, 2002; Bousquet et al., 2020; Chandramoorthy et al., 2022). Recently, Foster et al. (2020) extended this notion to the stability of hypothesis sets, and proposed a notion of stability as a bound on the Hausdorff distance between the hypothesis sets generated by neighboring datasets. In our setting this would mean that there exists some β > 0

3As U is independent of S, we drop the dependence on it to ease the notation. 4Assumption 3.2 extends to the randomness of those sets Nδ,j.

such that for all S, S Zn differing only by one element, for all u U, we have:

w WS,U, w WS ,U, z Z,

|ℓ(w, z) ℓ(w , z)| β. (11)

Foster et al. (2020) argue that in many situations β = O(1/n).

Inspired by (Foster et al., 2020), we introduce a stability notion, coined geometric stability, on the minimal coverings that will allow us to reduce the statistical dependence between the dataset S µ n z and those coverings.

To state our stability notion, we need to refine our definition of coverings. Let A Rd be some closed set, potentially random. For any δ > 0 we define Nδ(A, S) to be the random minimal coverings of A by closed δ-balls under pseudo-metric ρS (8) with centers in A. Note that the dependence in S in Nδ(A, S) only refers to the pseudo-metric used. In addition to Assumption 3.2 which states that we can make such a selection of Nδ(A, S), making it a welldefined random set, we add the fact that this selection can be made regular enough in the following sense.

Definition 3.6. We say that a set A is geometrically stable if there exist some β > 0 and α > 0 such that for δ small enough we can find a random covering S 7 Nδ(A, S) such that for all S Zn and S Zn 1 such that S = S \ {zi} for some i, then Nδ(A, S) and Nδ(A, S ) are within β/nα

data-dependent Hausdorff distance, by which we mean:

w Nδ(S, A), w Nδ(S , A),

sup z Z |ℓ(w, z) ℓ(w , z)| β

Based on this definition, we assume the following condition.

Assumption 3.7. Let K N+. There exists α (0, 3/2) and β > 0 (potentially depending on K) such that all sets of the form WS,U R 1 j B

K are geometrically stable with parameters (α, β).

Assumption 3.7 essentially imposes a local regularity condition on the fractal behavior of WS,U with respect to the pseudo-metric ρS. Intuitively it means that we can select a regular enough covering among all coverings. Note that the geometric stability is a condition on how the coverings vary with respect to the pseudo-metric, which is fundamentally different than (Foster et al., 2020).

The next theorem provides a generalization bound under the geometric stability condition.

Theorem 3.8. Let d(S, U) and G(S, U) be as in Theorem 3.5 and further define I := I (S, WS,U). Suppose that Assumption 3.7 holds. Then there exists a constant nα, δγ,ϵ,n > 0 such that for all n nα, with probability

Generalization Bounds using Data-Dependent Fractal Dimensions

1 γ η, and for all δ δγ,ϵ,n, the following inequality holds:

G(S, U) 3B + 2β

ϵ + d(S, U) log(4/δ)

+ log(1/η) + log(n) + I 1

Moreover, we have that nα = max{2 3 2α , 21+ 3 3 2α }.

While Assumption 3.7 might be restrictive, our goal here is to highlight how such geometric regularity can help us deal with the statistical dependence between the data and the hypothesis set.

Note that the mutual information term appearing in Theorem 3.8 is much more interpretable compared to the corresponding terms in Theorem 3.5, and has the exact same form as the term presented in (Hodgkinson et al., 2022).

We also note that this way of controlling the dependence between the data and the hypothesis set comes at the expense of potentially losing in the convergence rate of our bound. More precisely, for a stability index of α, we get a convergence rate of n α/3. By examining the value of constant nα in Theorem 3.8, we observe that getting closer to an optimal rate (α 3

2) implies a larger nα, rendering our bound asymptotic.

3.5. One step toward lower bounds

As an additional theoretical result, we present an attempt to prove a lower bound involving the introduced datadependent fractal dimension. For this purpose, let us consider again the case of a fixed (non-random) closed hypothesis set W Rd. As in Section 3.2, we make Assumption 3.1. We also introduce the data-dependent lower box-counting dimension as:

d(S) = dimρS B (W) := lim inf δ 0 log |N ρS δ (W)| log(1/δ) . (13)

As proving lower bounds is not the main goal of this paper, we restrict ourselves to this specific setting. The next theorem is based on very classical arguments involving Gaussian complexity and Sudakov s theorem (Vershynin, 2020, Theorem 7.4.1). This lower bound requires a slightly different definition of the worst-case generalization error:

G(S) := sup w W

R(w) ˆRS(w) . (14)

Theorem 3.9. We further assume that d(S) > 0 almost surely. Then, for all ϵ, γ, η > 0, there is an absolute constant c > 0 and some δn,γ,ζ > 0 such that, with probability at least 1 ζ γ, for all δ δn,γ,ζ we have:

δ2 log(1/δ)d(S)

2n log(n) B

log(2) + 9 log(1/ζ)

Theorem 3.9 gives a lower bound that is probably less tight compared to the upper bounds we presented in this work. One could even note that the right hand side of the bound may be negative in some contexts. However, we believe that the techniques used to derive this bound are classical and may be useful for future research.

4. Computational Aspects

In this section, we will illustrate how the proposed datadependent dimension can be numerically computed, by making a rigorous connection to topological data analysis (TDA) tools (Boissonat et al., 2018).

4.1. Persistent homology

Persistent homology (PH) is a well known notion in TDA typically used for point cloud analysis (Edelsbrunner & Harer, 2010; Carlsson, 2014). Previous works have linked neural networks and algebraic topology (Rieck et al., 2019; P erez-Fern andez et al., 2021), especially in (Corneanu et al., 2020) who established experimental evidence of a link between homology and generalization. Important progress was made in (Birdal et al., 2021), who used PH tools to estimate the upper-box counting dimension induced by the Euclidean distance on WS,U. Here we extend their approach to the case of data-dependent pseudo-metrics, which lays the ground for our experimental analysis.

The formal definition of PH is rather technical and is not essential to our problematic; hence, we only provide a highlevel description here, and provide a more detailed description in Section A.4 (for a formal introduction, see (Boissonat et al., 2018; Memoli & Singhal, 2019). In essence, given a point cloud W Rd, PH of degree 0 , denoted by PH0

keeps track of the connected components in W, as we examine W at a gradually decreasing resolution.

Given a bounded (pseudo-)metric space (X, ρ), by using PH0, one can introduce another notion of fractal dimension, called the persistent homology dimension, which we denote by dimρ PH0(X) (see Section A.4 and (Schweinhart, 2019, Definition 4)).

Our particular interest in dimρ PH0(X) in the case where ρ is a proper metric comes from an important result (Kozma et al., 2005; Schweinhart, 2020) stating that for any bounded metric space (X, ρ) we have the following identity.

dim ρ B(X) = dimρ PH0(X). (15)

Several studies used this property to numerically evaluate the upper box-counting dimension (Adams et al., 2020; Birdal et al., 2021). In particular Birdal et al. (2021) combined it with the results from (S ims ekli et al., 2021) and showed that dim Eucl PH0(X) associated with the Euclidean metric on the parameter space, can be linked to the generaliza-

Generalization Bounds using Data-Dependent Fractal Dimensions

tion error under the Lipschitz loss condition.

4.2. PH dimension in pseudo-metric spaces

In order to extend the aforementioned analysis to our datadependent dimension, we must first prove that the equality (15) extends to pseudo-metric spaces, which is established in the following theorem:

Theorem 4.1. Let (X, ρ) be a bounded pseudo-metric space, we have: dim ρ B(X) = dimρ PH0(X).

This theorem shows that, similar to dim Eucl PH0(X), our proposed dimension dimρS PH0(X) can also be computed by using numerically efficient TDA tools. Moreover, Theorem 3.4 now (informally) implies that with probability 1 ζ:

dimρS PH0(W) log(1/δ) + log(1/ζ)

n + δ. (16)

Theorems 3.5 and 3.8 can be adapted similarly.

5. Experiments

Experimental setup. In our experiments, we closely follow the setting used in (Birdal et al., 2021). In particular, we consider learning a neural network by using SGD, and choose the hypothesis set WS,U as the optimization trajectory near the local minimum found by SGD5. Then, we numerically estimate dimρS PH0(WS,U) by using the PH software provided in (P erez et al., 2021). The main difference between our approach and (Birdal et al., 2021) is that we replace the Euclidean metric with the pseudo-metric ρS to compute the PH dimension.

Here is a brief description of the method: given a neural network, its loss ℓ(w, z), and a dataset S = (z1, . . . , zn), we compute the iterations of SGD for K iterations, (wk)K k=0, such that w K reaches near a local minimum. We then run SGD for 5000 more iterations and set WS,U to {w K +1, . . . , w K +5000}. We then approximate dimρS PH0(WS, U) by using the algorithm proposed in (Birdal et al., 2021) by replacing the Euclidean distance with ρS.

We experimentally evaluate dimρS PH0(WS,U) in different settings: (i) regression experiment with Fully Connected Networks of 5 (FCN-5) and 7 (FCN-7) layers trained on the California Housing Dataset (CHD) (Kelley Pace & Barry, 1997), (ii) training FCN-5 and FCN-7 networks on the MNIST dataset (Lecun et al., 1998) and (iii) training Alex Net (Krizhevsky et al., 2017) on the CIFAR-10 dataset

5Note that as the trajectories collected by SGD will only contain finitely many points, its dimension will be trivially 0. However, as in (Birdal et al., 2021), we treat this finite set an approximation to the full trajectory. This is justified since even for infinite X, dimρS PH0(X) is computed based on finite subsets of X.

Figure 1. dimρS PH0 (denoted dim S PH0 in the figure) versus accuracy gap for FCN-5 (top), FCN-7 (middle) on MNIST and Alex Net (bottom) on CIFAR-10 Different colors indicate different learning rates and different markers indicate different batch sizes.

(Krizhevsky et al., 2014). More experiments are shown in the appendix Section D. All the experiments use standard Re LU activation and vanilla SGD with constant step-size. We made both learning rate and batch size vary across a 6 6 grid. For experiments on CHD and MNIST we also used 10 different random seeds. All hyperparameter configurations are available in Section C.

Note that in the case of a classification experiment, one could not compute dimρS PH0 using a zero-one loss in (8). Indeed, it would be equivalent to computing PH on the finite set {0, 1}n Rn, which trivially gives an upper box-counting dimension of 0. To overcome this issue, we compute dimρS PH0 using the surrogate loss (cross entropy in our case) and illustrate that it is still a good predictor of the gap between the training and testing accuracies. For the

Generalization Bounds using Data-Dependent Fractal Dimensions

Figure 2. dimρS PH0 (denoted dim S PH0 in the figure) versus generalization gap for FCN-5 (top) and FCN-7 (bottom) trained on CHD. Different colors indicate different learning rates and different markers indicate different batch sizes.

sake of completeness, we provide how dimρS PH0 behaves with respect to the the actual loss gap in Section D.

Results. In order to compare our data-dependent intrinsic dimension with the one introduced in (Birdal et al., 2021), which is the PH dimension induced by the Euclidean distance on the trajectory and denoted dim Eucl PH0, we compute various correlation statistics, namely the Spearman s rank correlation coefficient ρ (Kendall & Stuart, 1973) and Kendall s coefficient τ (Kendall, 1938). We also use the mean Granulated Kendall s Coefficient Ψ introduced in (Jiang et al., 2019), which aims at isolating the influence of each hyperparameter and according to the authors could better capture the causal relationships between the generalization and the proposed complexity metric (the intrinsic dimension in our case). For more details on the exact computation of these coefficients, please refer to Section C.1. Therefore (ρ, Ψ, τ) are our main indicators of performance. The values of each granulated Kendall s coefficient are reported in Section D6.

Figures 1 and 2 depict the data-dependent dimension versus the generalization gap, as computed in different settings. We observe that, in all cases, we have a strong correlation between dimρS PH0(WS,U) and the generalization gap, for a

6All those coefficients are between 1 and 1, where the value of 1 indicating a perfect positive correlation.

Table 1. Correlation coefficients on CHD

MODEL DIM. ρ Ψ τ

FCN-5 dim EUCL PH0 0.77 0.08 0.54 0.11 0.59 0.07 FCN-5 dimρS PH0 0.87 0.05 0.68 0.10 0.71 0.09

FCN-7 dim EUCL PH0 0.40 0.09 0.16 0.08 0.28 0.07 FCN-7 dimρS PH0 0.77 0.08 0.62 0.06 0.77 0.08

Table 2. Correlation coefficients on MNIST

MODEL DIM. ρ Ψ τ

FCN-5 dim EUCL PH0 0.62 0.10 0.78 0.08 0.47 0.07 FCN-5 dimρS PH0 0.73 0.07 0.81 0.07 0.56 0.06

FCN-7 dim EUCL PH0 0.80 0.04 0.88 0.04 0.62 0.04 FCN-7 dimρS PH0 0.89 0.02 0.90 0.04 0.73 0.03

Table 3. Correlation coefficients with Alex Net on CIFAR-10

MODEL DIM. ρ Ψ τ

ALEXNET dim EUCL PH0 0.86 0.81 0.68 ALEXNET dimρS PH0 0.93 0.84 0.78

wide range of hyperparameters. We also observe that the highest learning rates and lowest batch sizes seem to give less correlation, which is similar to what was observed in (Birdal et al., 2021) as well. This might be caused by the increased noise as we suspect that the point clouds in those settings show more complex fractal structures and hence require more points for a precise computation of the PH dimension.

Next, we report the correlation coefficients for the same experiments in Tables 1, 2 and 3. The results show that on average our proposed dimension always yields improved metrics compared to the dimension introduced in (Birdal et al., 2021). The improvement is particularly better in the regression experiment we performed (as the classification task yields larger variations in the metrics, see Table 2). This may indicate that the proposed dimension may be particularly pertinent in specific settings. Moreover, increasing the size of the model, in all experiments, seems to have a positive impact on the correlation. We suspect that this might be due to the increasing local-Lipschitz constant of the network. We provide more experimental results in Section D.

Robustness analysis. The computation of ρS(w, w ) requires the exact evaluation of the loss function on every data point {z1, . . . , zn} for every w, w WS,U. This introduces a computational bottleneck in case where n is excessively large. To address this issue, in this section we will explore an approximate way of computing dimρS PH0. Similar to the computation of a stochastic gradient, instead of computing the distance on every data point, we will first draw a random subset of data points T S, with |T| n and use the following approximation

Generalization Bounds using Data-Dependent Fractal Dimensions

Figure 3. Robustness experiment using a FCNN trained on MNIST (Left) and CHD (Right). x-axis represents the proportion of the data T used to compute the metric, y-axis is the relative error with respect to the full dataset based dimension.

ρS(w, w ) ρT (w, w ) := 1 |T | P

z T |ℓ(w, z) ℓ(w , z)|.

We now conduct experiments to analyze the robustness of the computation of dimρS PH0 with respect to varying size of random subsets T. More precisely, we randomly select a subset T S whose size varies between 2% and 99% of the size dataset S and compute the PH dimension using the approximate pseudo-metric. Note that the whole dataset S is of course still used to produce the SGD iterates. Figure 3 presents results on the MNIST and CHD datasets in term of the relative error, i.e., | dimρT PH0 dimρS PH0 |/ dimρS PH0. The results show that the proposed dimension is significantly robust to the approximation of the pseudo-metric: even with 40% of the data, we achieve almost identical results as using the full dataset.

6. Conclusion

In this paper, we proved generalization bounds that do not require the Lipschitz continuity of the loss, which can be crucial in modern neural network settings. We linked the generalization error to a data-dependent fractal dimension of the random hypothesis set. We first extended some classical covering arguments to state a bound in the case of a fixed hypothesis set and then proved a result in a general learning setting. While some intricate mutual information terms between the geometry and the data appeared in this bound, we presented a possible workaround by the introduction of a stability property for the coverings of the hypothesis set. Finally, we made a connection to persistent homology, which allowed us to numerically approximate the intrinsic dimension and thus support our theory with experiments.

Certain points remain to be studied concerning our results. First the existence of differentiable persistent homology libraries (Hofer et al., 2018; 2019) open the door to the use of our intrinsic dimension as a regularization term as in (Birdal

et al., 2021). Refining our proof techniques, for example using the chaining method (Ledoux & Talagrand, 1991; Clerico et al., 2022), could help us improve our theoretical results or weaken the assumptions.

Acknowledgments

U.S . is partially supported by the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute). B.D. and U.S . are partially supported by the European Research Council Starting Grant DYNASTY 101039676.

Adams, H., Aminian, M., Farnell, E., Kirby, M., Peterson, C., Mirth, J., Neville, R., Shipman, P., and Shonkwiler, C. A fractal dimension for measures via persistent homology. Topological Data Analysis, Abel Symposia, vol. 15, pp. 1 31, 2020. doi: 10.1007/978-3-030-43408-3 1.

Anthony, M. and Barlett, P. L. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.

Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger generalization bounds for deep nets via a compression approach. Proceedings of the 35th International Conference on Machine Learning, November 2018.

Asadi, A. R., Abbe, E., and Verd u, S. Chaining Mutual Information and Tightening Generalization Bounds. Advances in Neural Information Processing Systems 31 (Neur IPS 2018), July 2019.

Barlett, P. L. and Mendelson, S. Rademacher and Gaussian Complexities: Risk Bounds and Structural Result. Journal of Machine Learning Research, 2002.

Barsbey, M., Sefidgaran, M., Erdogdu, M. A., Richard, G., and S ims ekli, U. Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks. Advances in Neural Information Processing Systems 34 (Neur IPS 2021), June 2021.

Bauer, U. Ripser: Efficient computation of Vietoris-Rips persistence barcodes. Journal of Applied and Computational Topology, 5(3):391 423, September 2021. ISSN 23671726, 2367-1734. doi: 10.1007/s41468-021-00071-5.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine learning practice and the bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, August 2019. ISSN 00278424, 1091-6490. doi: 10.1073/pnas.1903070116.

Generalization Bounds using Data-Dependent Fractal Dimensions

Birdal, T., Lou, A., Guibas, L., and S ims ekli, U. Intrinsic Dimension, Persistent Homology and Generalization in Neural Networks. Advances in Neural Information Processing Systems 34 (Neur IPS 2021), November 2021.

Bogachev, V. I. Measure Theory, volume Volume 1. Springer, 2007.

Boissonat, J.-D., Chazal, F., and Yvinec, M. Geometrical and Topological Inference. Cambridge Texts in Applied Mathematics. Cambridge University Press, 2018.

Bousquet, O. Stability and generalization. Journal of Machine Learning Research, 2002.

Bousquet, O., Klochkov, Y., and Zhivotovskiy, N. Sharper bounds for uniformly stable algorithms. Proceedings of Thirty Third Conference on Learning Theory, May 2020.

Camuto, A., Deligiannidis, G., Erdogdu, M. A., G urb uzbalaban, M., S ims ekli, U., and Zhu, L. Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms. Advances in Neural Information Processing Systems 34 (Neur IPS 2021), June 2021.

Carlsson, G. Topological pattern recognition for point cloud data*. Acta Numerica, 23:289 368, May 2014. ISSN 0962-4929, 1474-0508. doi: 10.1017/ S0962492914000051.

Chandramoorthy, N., Loukas, A., Gatmiry, K., and Jegelka, S. On the generalization of learning algorithms that do not converge. Thirty-Sixth Conference on Neural Information Processing Systems (Neurips 2022), August 2022.

Chaudhari, P. and Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. 2018 Information Theory and Applications Workshop (ITA), January 2018.

Clerico, E., Shidani, A., Deligiannidis, G., and Doucet, A. Chained Generalisation Bounds. Proceedings of Thirty Fifth Conference on Learning Theory, June 2022.

Corneanu, C., Madadi, M., Escalera, S., and Martinez, A. Computing the Testing Error without a Testing Set. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), May 2020.

Edelsbrunner, H. and Harer, J. Computational Topology - an Introduction Semantic Scholar. American Mathematical Society, 2010.

Falconer, K. Fractal Geometry - Mathematical Foundations and Applications - Third Edition. Wiley, 2014.

Foster, D. J., Greenberg, S., Kale, S., Luo, H., Mohri, M., and Sridharan, K. Hypothesis Set Stability and Generalization. Advances in Neural Information Processing Systems 32 (Neur IPS 2019), October 2020.

Harutyunyan, H., Raginsky, M., Steeg, G. V., and Galstyan, A. Information-theoretic generalization bounds for blackbox learning algorithms. Advances in Neural Information Processing Systems 34 (Neur IPS 2021), October 2021.

Herrera, C., Krach, F., and Teichmann, J. Estimating Full Lipschitz Constants of Deep Neural Networks. Estimating Full Lipschitz Constants of Deep Neural Networks, June 2020.

Hodgkinson, L., S ims ekli, U., Khanna, R., and Mahoney, M. W. Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers. Proceedings of the 39th International Conference on Machine Learning, July 2022.

Hofer, C., Kwitt, R., Niethammer, M., and Uhl, A. Deep Learning with Topological Signatures. Advances in Neural Information Processing Systems 30 (NIPS 2017), February 2018.

Hofer, C., Kwitt, R., Dixit, M., and Niethammer, M. Connectivity-Optimized Representation Learning via Persistent Homology. Proceedings of the 36th International Conference on Machine Learning, June 2019.

Hu, W., Li, C. J., Li, L., and Liu, J.-G. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications, March 2018.

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. Three Factors Influencing Minima in SGD, September 2018.

Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. Fantastic Generalization Measures and Where to Find Them. ICLR 2020, December 2019.

Kechris, A. S. Classical Descriptive Set Theory. Graduate Texts in Mathematics. Springer, 1995.

Kelley Pace, R. and Barry, R. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291 297, May 1997. ISSN 0167-7152. doi: 10.1016/S0167-7152(96) 00140-X.

Kendall, M. G. A new reasure of rank correlation. Biometrika, 1938.

Kendall, M. G. and Stuart, A. The Advanced Theory of Statistics. Griffin, 1973. ISBN 978-0-85264-069-2.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR 2017, February 2017.

Generalization Bounds using Data-Dependent Fractal Dimensions

Kozma, G., Lotker, Z., and Stupp, G. The minimal spanning tree and the upper box dimension. Proceedings of the American Mathematical Society, 134(4):1183 1187, September 2005. ISSN 0002-9939, 1088-6826. doi: 10.1090/S0002-9939-05-08061-5.

Krizhevsky, A., Nair, V., and Hinton, G. E. The cifar-10 dataset, 2014.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Image Net classification with deep convolutional neural networks. Communications of the ACM, 60(6):84 90, May 2017. ISSN 0001-0782, 1557-7317. doi: 10.1145/3065386.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, November 1998. ISSN 1558-2256. doi: 10.1109/5.726791.

Ledoux, M. and Talagrand, M. Probability in Banach Spaces - Isoperimetry and Processes. Classics in Mathematics. Springer, 1991.

Mandt, S., Hoffman, M. D., and Blei, D. M. A Variational Analysis of Stochastic Gradient Algorithms. Proceedings of The 33rd International Conference on Machine Learning, February 2016.

Memoli, F. and Singhal, K. A Primer on Persistent Homology of Finite Metric Spaces. Bulletin of Mathematical Biology, 81(7):2074 2116, July 2019. ISSN 0092-8240, 1522-9602. doi: 10.1007/s11538-019-00614-z.

Molchanov, I. Theory of Random Sets. Number 87 in Probability Theory and Stochastic Modeling. Springer, second edition edition, 2017.

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. Deep Double Descent: Where Bigger Models and More Data Hurt. ICLR 2020, December 2019.

Negrea, J., Haghifam, M., Dziugaite, G. K., Khisti, A., and Roy, D. M. Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates. Advances in Neural Information Processing Systems 32 (Neur IPS 2019), November 2019.

Pensia, A., Jog, V., and Loh, P.-L. Generalization Error Bounds for Noisy, Iterative Algorithms. 2018 IEEE International Symposium on Information Theory (ISIT), January 2018.

P erez, J. B., Hauke, S., Lupo, U., Caorsi, M., and Dassatti, A. Giotto-ph: A Python Library for High-Performance Computation of Persistent Homology of Vietoris-Rips Filtrations, August 2021.

P erez-Fern andez, D., Guti errez-Fandi no, A., Armengol Estap e, J., and Villegas, M. Characterizing and Measuring the Similarity of Neural Networks with Persistent Homology. Co RR, May 2021.

Pesin, Y. B. Dimension Theory in Dynamical Systems - Contemporary Views and Applications. Chicago Lectures in Mathematics. The University of Chicago Press, 1997.

Rebeschini, P. Algorithmic fundations of learning, 2020.

Rieck, B., Togninalli, M., Bock, C., Moor, M., Horn, M., Gumbsch, T., and Borgwardt, K. Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology. ICLR, pp. 25 p., February 2019. doi: 10.3929/ethz-b-000327207.

Russo, D. and Zou, J. How much does your data exploration overfit? Controlling bias via information usage. IEEE Transactions on Information Theory, October 2019.

Schweinhart, B. Persistent Homology and the Upper Box Dimension. Discrete & Computational Geometry volume 65, pages 331 364, July 2019.

Schweinhart, B. Fractal Dimension and the Persistent Homology of Random Geometric Complexes. Advances in Mathematics, June 2020.

S ims ekli, U., Sener, O., Deligiannidis, G., and Erdogdu, M. A. Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124014, December 2021. ISSN 1742-5468. doi: 10.1088/1742-5468/ac3ae7.

Steinke, T. and Zakynthinou, L. Reasoning About Generalization via Conditional Mutual Information. Proceedings of Thirty Third Conference on Learning Theory, June 2020.

Suzuki, T., Abe, H., and Nishimura, T. Compression based bound for non-compressed network: Unified generalization error analysis of large compressible deep neural network. ICLR 2020, June 2020.

van Erven, T. and Harremo es, P. Renyi Divergence and Kullback-Leibler Divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, July 2014. ISSN 0018-9448, 1557-9654. doi: 10.1109/TIT.2014.2320500.

Vershynin, R. High-Dimensional Probability - An Introduction with Application in Data Science. University of California - Irvine, 2020.

Xiao, Y. Random fractals and Markov processes. Fractal Geometry and Applications: A jubilee of Benoˆıt Mandelbrot - American Mathematical Society, 72.2:261 338, 2004. doi: 10.1090/pspum/072.2/2112126.

Generalization Bounds using Data-Dependent Fractal Dimensions

Xu, A. and Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems 30 (NIPS 2017), November 2017.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ICLR 2017, February 2017.

Generalization Bounds using Data-Dependent Fractal Dimensions

The outline of the appendix is as follows:

Section A: Additional technical background related to information theory, Rademacher complexity, Egoroff s Theorem and persistent homology.

Section B: Postponed proofs of the theoretical results.

Section C: Additional experimental details.

Section D: Additional experimental results, including full statistic of experiments presented in the main part of the paper, as well as additional experiments on different datasets.

A. Additional technical background

A.1. Information theoretic quantities

We recall there some basics concepts of information theory that we use throughout the paper. The absolute continuity of a probability measure with respect to another one will be denoted with symbol .

Definition A.1. Let us consider a probability space (Ω, F) and two probability distributions π and ρ, with π ρ. We define the Kullback-Leibler divergence of those distributions as:

KL(π||ρ) = Z log dπ

For α > 1, we define their α-Renyi divergence as:

Dα(π||ρ) = 1 α 1 log Z dπ

We set those two quantities to + if the absolute continuity condition is not verified.

Note that by convention we often consider that D1 = KL and that Renyi divergences may also be defined for orders α < 1 (van Erven & Harremo es, 2014), but we won t need it here.

It is easy to prove that Dα is increasing in α and it is therefore natural to define:

D (π||ρ) = lim α Dα(π||ρ).

The following property will be useful to perform decoupling of two random variables, a proof can be found in (van Erven & Harremo es, 2014).

Theorem A.2. With the same notations as above, we have:

D (π||ρ) = log sup B F

We can then define the following notions of mutual information:

Definition A.3. Let X, Y be two random variables on Ω, we define for α [1, ]:

Iα(X, Y ) := Dα(PX,Y ||PX PY ),

with in particular: I(X, Y ) := I1(X, Y ) = KL(PX,Y ||PX PY ).

I will be called the total mutual information. Note that thanks to Theorem A.2, we recover the definition of total mutual information that we wrote in Equation (6).

Those quantities satisfy the data processing inequality, given in the following proposition.

Generalization Bounds using Data-Dependent Fractal Dimensions

Proposition A.4 (Data-processing inequality). If X Y Z is a Markov chain and α [1, + ], then:

Iα(X, Z) Iα(X, Y ).

We are interested in those quantities because of their decoupling properties, summarized up in the following lemmas.:

Lemma A.5 (Lemma 1 in (Xu & Raginsky, 2017)). Let X, Y be two random variables and f(., .) a measurable function. We consider X and Y two copies of X and Y which are independent. Then if f( X, Y ) E[f( X, Y )] is σ2-subgaussian, we have: |E[f(X, Y )] E[f( X, Y )]| p

2σ2I(X, Y ).

We end this subsection by stating the decoupling in probability result that we will use several times in the proofs: Combining the definition of total mutual information with Theorem A.2, we immediately obtain:

Lemma A.6 (Lemma 1 in (Hodgkinson et al., 2022)). For every measurable set B we have:

PX,Y (B) e I (X,Y )PX PY (B).

A.2. Rademacher complexity

We call Rademacher random variables a tuple (σ1 . . . , σn) of mutually independent Bernoulli distributions with values in the set { 1, 1}.

Definition A.7. Let us consider a fixed set A Rn and σ := (σ1 . . . , σn) some Rademacher random variables, the Rademacher complexity of A is defined as:

Rad(A) := 1

Let us consider a fixed hypothesis space W and some dataset S = (z1, . . . , zn) µ n z , we will use the following notation:

ℓ(W, S) = {(ℓ(w, zi)1 i n Rn, w W}. (17)

Remark A.8. One could legitimately inquire about the measurability of Rad(ℓ(W, S)) with respect to F n (recall that the data space is denoted (Z, F, µz)). Thanks to the closedness of W Rd we can introduce a dense countable subset C of W and write that, thanks to the continuity of ℓ,

R(σ, S) := 1

i=1 σiℓ(w, zi) = 1

i=1 σiℓ(w, zi),

which is measurable as a countable supremum of random variables. As ℓis bounded, so is R(σ, S); it is therefore integrable with respect to (σ, S). Thus Rad(ℓ(W, S)) is integrable (and measurable) thanks to the first part of Fubini s theorem.

Rademacher complexity is linked to the worst case generalization error via the following proposition (see for example (Rebeschini, 2020)):

Proposition A.9. Assume that the loss is uniformly bounded by B. Then, for all η > 0, we have with probability 1 2η that:

R(w) ˆRS(w) 2Rad(ℓ(W, S)) + 3

n log(1/η).

We state the proof of this result for the sake of completeness. It is based on two classical arguments: symmetrization and Mc-Diarmid inequality.

Proof. Let us write: G(S) := sup w W

R(w) ˆRS(w) .

Generalization Bounds using Data-Dependent Fractal Dimensions

We introduce S = { z1, . . . , zn} µ n z an independent copy of S and some Rademacher random variables (σ1, . . . , σn), using properties of conditional expectation and Fubini s theorem we have:

E[G(S)] = E sup w W

i=1 R(w) ℓ(w, zi)

= E sup w W

i=1 E[ℓ(w, zi) ℓ(w, zi)| S]

E E sup w W

i=1 (ℓ(w, zi) ℓ(w, zi)) S

= E sup w W

i=1 (ℓ(w, zi) ℓ(w, zi))

= E sup w W

i=1 σi(ℓ(w, zi) ℓ(w, zi))

i=1 σiℓ(w, zi)

= 2E[Rad(ℓ(W, S))].

On the other hand if we denote Si = (z1, . . . , zi 1, zi, zi+1, . . . zn) we have that:

|G(S) G(Si)| 2B

And therefore by Mc-Diarmid inequality for any ϵ > 0:

P G(S) E[G(S)] ϵ exp nϵ2

By taking any η (0, 1) we can make a clever choice for ϵ and deduce that with probability at least 1 η we have:

G(S) E[G(S)] +

n log(1/η). (19)

Moreover we can also write:

|Rad(ℓ(W, S)) Rad(ℓ(W, Si))| Eσ

σi(ℓ(w, zi) ℓ(w, zi)) 2B

so that by Mc-Diarmid and the exact same reasoning than above we have that with probability at least 1 η:

E[Rad(ℓ(W, S))] Rad(ℓ(W, S)) +

n log(1/η). (20)

Therefore combining equations 18, 19 and 20 gives us that with probability at least 1 2η:

G(S) 2Rad(ℓ(W, S)) + 3

n log(1/η).

Another important result for us is the well-known Massart s lemma, presented here in a slightly simplified version which is enough for our work:

Generalization Bounds using Data-Dependent Fractal Dimensions

Lemma A.10 (Massart s lemma). Let T Rn be a finite set, then:

Rad(T) max t T ( t 2)

Where |T| denotes the cardinal of T as usual.

Example A.11. Consider the setting where we have a fixed finite hypothesis set W. In that case we have that maxw W( (ℓ(w, zi))i 2) B n, thanks to the boundedness assumption. Thus Massart s lemma A.10 gives us

Rad(ℓ(W, S)) B

A.3. Egoroff s Theorem

Egoroff s Theorem is an essential result in our theory which states that pointwise convergence in a probability space can be made uniform on measurable sets of arbitrary high probability. It was already used in (S ims ekli et al., 2021; Camuto et al., 2021) to make the convergence of the limit defining some fractal dimension uniform up certain probability. Theorem A.12 (Egoroff s Theorem (Bogachev, 2007)). Let (Ω, F, µ) be a measurable space with µ a positive finite measure. Let fn, f : Ω (X, d) be functions with values in a separable metric space X and such that µ-almost everywhere fn(x) f(x).

Then for all γ > 0 there exists Ωγ F such that µ(Ω\Ωγ) γ and on Ωγ the convergence of (fn) to f is uniform.

A.4. Persistent Homology

Persistent homology (PH) is a well known notion in TDA typically used for point cloud analysis (Edelsbrunner & Harer, 2010; Carlsson, 2014). Previous works have linked neural networks and algebraic topology (Rieck et al., 2019; P erez Fern andez et al., 2021), especially in (Corneanu et al., 2020) who established experimental evidence of a link between homology and generalization. Important progress was made in (Birdal et al., 2021), who used PH tools to estimate the upper-box counting dimension induced by the Euclidean distance on WS,U. In this subsection, we introduce a few necessary PH tools to understand this approach.

Throughout this subsection we consider a finite set of point W Rm. We will denote by K the unique two elements field Z/2Z. Definition A.13 (Abstract simplicial complex and filtrations). Given a finite set V , an abstract simplicial complex (which we will often refer simply as complex) K is a subset of P(V ), the subsets of V , such that:

s K, P(s) K

The elements of K are called the simplices. For any non-empty simplex s, we call the number |s| 1 its dimension, denoted dim(s). Given a simplicial complex K, a filtration of K is a sequence of sub-complexes increasing for the inclusion K0 KN = K such that every complex is obtained by adding one simplex to to the previous one: Ki+1 = Ki {σi+1}. Thus a filtration of a complex induces an ordering on the simplices, which will be denoted (si)i by convention.

The filtration will be denoted by K0 KN = K,

and the corresponding simplices, in the order in which they are added to the filtration, will typically be denoted by (s0, . . . , s N).

Example A.14. The most important filtration that we shall encounter is the Vietoris-Rips filtration (VR filtration) Rips(W). For any δ > 0 we first construct the Vietoris-Rips simplicial complex Rips(W, δ) by the following condition:

k{w1, . . . , wk} Rips(W, δ) i, j, d(pi, pj) δ. (22)

Then Rips(W) is formed by adding the complexes in the increasing order of δ from 0 to + . Complexes with the same value of δ are ordered based on their dimension and ordered arbitrarily if they have the same dimension.

Generalization Bounds using Data-Dependent Fractal Dimensions

Intuitively, Persistent homology of degree i keeps track of lifetimes of holes of dimension i , it is built over the concept of chains, which are a sort of linearized version of sets of simplices. More precisely, the space of k-chains Ck(K) over complex K is defined as the set of formal linear combinations of the k-dimensional simplices of k:

Ck(K) := span X

i ϵisi, i, dim(si) = k . (23)

We will denote a simplex by its points s = [w0, . . . , wk] and use the notation s\i := [w0, . . . , wi 1, wi+1, . . . , wk]. The boundary operator : Ck(K) Ck 1(K) is the linear map induced by the relations on the simplices:

i=0 s\i. (24)

It is easy to verify that 2 = 0 and therefore we have an exact sequence, where N = |W|,

{0} CN(K) CN 1(K) . . . C0(K) {0},

from which it is natural to define: Definition A.15 (Cycles and homology groups). With the same notations, we define:

The k cycles of K: Zk(K) := ker( : Ck(K) Ck 1(K)).

The k-th boundary of K: Bk(K) := ℑ( : Ck+1(K) Ck(K)).

k-th homology group (it is actually a quotient vector space): Hk(K) := Zk/Bk.

The k-th Betti number of K is defined as the dimension of the homology group: βk(K) = dim(Hk(K)).

Those Betti numbers, βk, correspond, in our analogy, to the number of holes of dimension k, i.e. the numbers of cycles whose interior is not in the complex, and therefore corresponds to a hole. Remark A.16. In particular, β0 corresponds to the number of connected components in the complex.

Now that we defined the notion of homology, we go on with the definition of persistent homology (PH). The intuition is the following: when we build the Vietoris-Rips filtration of the point cloud W, by increasing parameter δ in Example A.14, we collect the birth and death of each hole, the multiset7 of those pairs (birth, death) will be the definition of persistent homology. Remark A.17. In all the following, the parameter δ used in the definition of the Vietoris-Rips filtration will be seen as a time parameter.

While the concept of persistent homology can be extended to arbitrary orders (see (Boissonat et al., 2018)), here, for the sake of simplicity, we only define Persistent homology of degree 0, which is much simpler and is the only one we need in our work.

Persistent homology of degree 0:

The persistent homology of degree 0, denoted PH0 is the multiset of the distances δ used to build the Vietoris-Rips filtration of W for which a connected component is lost.

More formally, let us introduce a Vietoris-Rips filtration of P denoted by:

Kδ0,1 Kδ0,α0 Kδ1,1 Kδc,αC = K,

where 0 δ1 < < δC are the time/distance indices of the filtration and for the same value of δ the simplices are ordered by their dimension and arbitrarily if they also have the same dimension. Obviously δ0 = 0. With those notations, PH0 is the multiset of all the δi corresponding to a complex Kδi,j which has one less connected component than the preceding complex in the above filtration.

To stick with the usual notations, we actually define PH0 as the multiset of the (0, δi), where the 0 correspond to the birth of a connected component, while the δi, as described above, corresponds to the death of this connected component.

7By multiset, we mean that it can contain several time the same element, in our case the same persistence pair.

Generalization Bounds using Data-Dependent Fractal Dimensions

Definition A.18 (Persistent homology dimension). For any α 0 we define:

(b,d) PH0(Rips(W )) (d b)α. (25)

The persistent homology dimension of degree 0 (PH dimension) of any set bounded metric space W is then defined as:

dim PH0(W) := inf{α > 0, C > 0, W W finite, Eα(W) < C}.

Where the definition of VR filtration in finite subsets of metric spaces is naturally defined.

The importance of this dimension for our work relies on the following result (see (Schweinhart, 2019), (Kozma et al., 2005)):

Proposition A.19. For any bounded metric space X, we have dim B(X) = dim PH0(X).

Proposition A.19 opens the door to the numerical estimation of the upper-box dimension. Indeed, PH can be evaluated via several libraries (Bauer, 2021; P erez et al., 2021), moreover, Birdal et al. (2021) noted that, while Definition A.18 is impossible to evaluate in practice, it can be approximated from PH0(Rips(W)) computed on a finite number of finite subsets of the point cloud W.

A.5. Numerical estimation of the PH dimension

In this section we briefly discuss how we numerically estimate the persistent homology dimension, which is essentially the algorithm presented in (Birdal et al., 2021) where we changed the distance, which implies that we must evaluate on all data points for the last iterates. See also (Adams et al., 2020; Schweinhart, 2020) for similar ideas.

All persistent homology computation presented here have been made with the package presented in (P erez et al., 2021), which allows us to use more points in our persistent homology computation, e.g. Birdal et al. (2021) was only using between 1000 points prior to convergence for Alex Net and 200 for the other experiments. In our work we use up to 8000 points, which may allow us to better capture the fractal behavior.

The algorithm is based on the following result, proved by proposition 2 of (Birdal et al., 2021) and proposition 21 of (Schweinhart, 2020): If X is a bounded metric space with = dimd PH0(X), then for all ϵ > 0 and α (0, + ϵ) there exists Dα,ϵ > 0 such that for all finite subset Xn = {x1, . . . , xn} of X we have:

log Eα(Xn) log Dα,ϵ + 1 α + ϵ

log(n). (26)

Then we can perform an affine regression of log Eα(Xn) with respect to log n and get a slope a. Moreover it is argued in (Birdal et al., 2021) that the slope has good chance to be approximately the one appearing in Equation (26), which gives us α 1 a. Remark A.20. The aforementioned algorithm works in pseudo metric spaces. Indeed as we tried to explain formally in the proof of proposition B.9, PH0 in a pseudo-metric space only add some zeros to the quantities Eα computed in its metric identifications. Therefore the above algorithm is approximating dimρS PH0(X/ ) which is proven in lemma B.9 to be equal to dimρS PH0(X). See those notations in the next subsection.

A.6. Measurable coverings and additional technical lemmas

In this section we briefly discuss a few technical measure theoretic points that are worth mentioning. Essentially, we argue that our measurability assumptions ensure that the manipulations we make in our proofs on complicated random variables are valid and meaningful. We then show that it is possible to construct the measurable coverings that we need in our proofs.

A.6.1. SOME NICE CONSEQUENCES OF OUR MEASURABILITY ASSUMPTIONS

The worst-case generalization error takes the general form:

G(S, U) := sup WS,U

R(w) ˆRS(w) . (27)

Generalization Bounds using Data-Dependent Fractal Dimensions

Here we require WS,U Rd to be a random closed set. In this subsection, we will make this precise by describing basic notions of random set theory and prove a few technical results which will lay the ground of a rigorous theoretical basis for our main results. The interested reader can consult (Kechris, 1995; Molchanov, 2017). Other works mentioned similar formulation of the problem (Hodgkinson et al., 2022), though with not much technical details.

Let us fix a probability space (Ω, T , P) and denote E = Rd. Remark A.21. As highlighted by (Molchanov, 2017), we can develop the following theory in the more general case where E is a locally compact Hausdorff second countable space, but we avoid those technical considerations.

The definition of a random closed set is the following:

Definition A.22 (Random closed set). Consider a map W : Ω CL(E), W is said to be a random closed set if for every compact set K E we have: {ω, W(ω) K = } T .

A natural question is to know whether we can cast it as a random variable defined in the usual way, the answer is yes and is formalized by the following definition.

Definition A.23 (Effr os σ-algebra and Fell topology). The Effr os σ-algebra is the one generated by the sets {W CL(E), W K = } for K going over all compact sets in Rd.

The Fell topology on CL(E) is the one generated by open sets {W CL(E), W K = } for K going over all compact sets and {W CL(E), W O = } for O going over all open sets of Rd.

One can show that the Effr os σ-algebra on CL(E) corresponds to the Borel σ-algebra induced by the Fell topology (Molchanov, 2017, Chapter 1.1). The Effr os σ-algebra will be denoted by E(E).

It can be shown that Definition A.22 is equivalent to asking the measurability of W with respect to E(E).

The assumption that we made on our learning algorithm is the following:

Assumption A.24. We assume that WS,U is a random closed set in the sense of the above definition. It means that the mapping defining the learning algorithm:

n=0 Zn ΩU CL(Rd),

is measurable with respect to the Effr os σ-algebra.

Thanks to this definition, we can already state one particularly useful result:

Proposition A.25 (Theorem 1.3.28 in (Molchanov, 2017)). Consider (Gw)w E a R-valued, almost surely continuous, stochastic process on E = Rd and W a random closed set in E. Then the mapping

Ω ω 7 sup w W (ω) Gw(ω)

is a random variable.

Example A.26. If we define R(w) ˆRS(w) and W = WS,U, then thanks to the continuity of the loss (Assumption 3.1) we have that the worst case generalization error defined by Equation (27) is a well-defined random variable.

While Example A.26 gives us useful information, it is actually not enough for some arguments of our proofs to hold. In particular, to deal with the statistical dependence between the data and the random hypothesis set, we want to be able to perform the following operation: Given a random closed set W and S Zn we want to apply the decoupling results and write:

sup w W R(w) ˆRS(w) ϵ e I (W,S)PW PS

sup w W R(w) ˆRS(w) ϵ . (28)

In order for the decoupling lemmas to hold, we actually need the measurability of the mapping

CL(Rd) Zn (W, S) 7 sup w W |R(w) ˆRS(w)|,

Generalization Bounds using Data-Dependent Fractal Dimensions

with respect to E(Rd) F n.

We show two results in this direction, the first one assuming that the data space Z is countable8.

Lemma A.27. As before, let (CL(Rd), E(Rd)) denotes the closed sets of Rd endowed with the Effr os σ-algebra, (Ω, T ) be a countable measurable space (with T = P(Ω)) and ζ(x, ω) be an almost surely continuous stochastic process on Rd. Then the function f : CL(Rd) Ω (W, ω) 7 sup x W ζ(x, ω) R

is measurable with respect to E(Rd) T .

Proof. It is enough to show that f 1(]t, + [) E(Rd) T for any t Q as those sets ]t, + [ generate the Borel σ-algebra in R. Let us fix some t Q. Let us denote ζω := ζ( , ω), we have:

f 1(]t, + [) = [

{F CL(Rd), F ζ 1 ω (]t, + [) = } {ω} .

By (Molchanov, 2017, Proposition 1.1.2), we have that the sets of the form {F CL(Rd), F O = } generate E(Rd), with O running through open sets of Rd. Therefore the continuity of ζ and the countability of Z give us:

f 1(]t, + [) E(Rd) T .

If we want to get rid of the countability assumption on Ω, we have to introduce some metric structure on it. This approach justifies the assumptions made on Z (that it is a sub-metric space of some RN).

Lemma A.28. Assume that Ωis a Polish space with a dense countable subset D and that ζ is continuous in both variables. Then the function: f : CL(Rd) Ω (W, ω) 7 sup x W ζ(x, ω) R,

is measurable with respect to E(Rd) BΩ, where BΩis the Borel σ-algebra on Ω.

Proof. As before, let t Q, for X, ω CL(Rd) Ωwe have that

(X, ω) f 1(]t, + [) x X, ϵ Q>0, d D, d B( d, ϵ) D, ζ(d, x) > t,

and therefore

f 1(]t, + [) = [

d B( d,ϵ) {F CL(Rd), F ζ 1 ω (]t, + [) = } B( d, ϵ) .

The results follows from the same arguments as in the proof of the previous lemma.

A.6.2. CONSTRUCTION OF MEASURABLE COVERINGS

To end this technical discussion about random closed set we try to answer the following questions: are the covering numbers with respect to pseudo-metric ρS measurable? Moreover, can we construct coverings that are well-defined random close sets themselves?

Recall that we defined a δ-covering of some set X as a minimal set of points Nδ of X such that:

8This countability assumption on the dataset is found in some other works, especially in (S ims ekli et al., 2021) who used it to leverage the local stability of Hausdorff dimension

Generalization Bounds using Data-Dependent Fractal Dimensions

The fact that we ask the coverings to be in X is for technical reasons and does not change the values of the dimensions.

We first need a technical lemma to ensure that it is equivalent to cover dense countable subsets:

Lemma A.29 (Closure property of coverings). Let W be a closed set and C be a countable dense subset of W. Under Assumption 3.1 we have that any covering of C is a covering of W for pseudo-metric ρS. Moreover we have, for all δ > 0:

|N ρS 2δ (C)| |N ρS δ (W)| |N ρS δ (C)|. (29)

Proof. Let us consider some δ > 0, a minimal δ-cover {c1, . . . , c K} of C and w W. By density, there exists a sequence (ξn)n in C such that ξn w. As {c1, . . . , c K} is a finite cover of C, we can assume without loss of generality that, for all n, ξn BρS δ (ci) for some i. Therefore, by continuity we have ρS(w, ci) = limn ρS(w, ξn) δ. Thus:

|N ρS δ (W)| |N ρS δ (C)|.

Now, by the triangle inequality we have: |N ρS 2δ (C)| |N ρS δ (W)|

Let us first prove that we can construct measurable coverings in the case of fixed hypothesis sets. Indeed, this is essential to ensure the fact that the upper box-counting dimension dim ρS B induced by ρS is a well-defined random variable, which is required for the high probability bounds in our results to make sense. This kind of measurability condition is often assumed by authors dealing with potentially random covering numbers (S ims ekli et al., 2021; Camuto et al., 2021). In our case, we can prove this measurability under some condition.

Recall that we required that Z has a metric space structure, typically inherited by an inclusion in an Euclidean space RN and that its σ-algebra F is the corresponding Borel σ-algebra. With that in mind we prove the following theorem:

Theorem A.30 (Measurability of covering numbers in the case of fixed hypothesis set). Let W be a closed set, C be a dense countable subset9 of W and δ > 0. Under Assumption 3.1, we have that the mapping between probability spaces

(Zn, F n) S 7 |N ρS δ (C)| (N+, P(N+)),

is a random variable, where P(A) denotes the subsets of a set A.

Proof. For any set X let us denote by F k(X) the set of finite subsets of X with at most k elements.

We start by noting that thanks to the continuous loss assumption, we have that S 7 ρS(w, w ) is continuous for any w, w Rd. Moreover, let us denote C := {wk, k N}.

Thus, to show the measurability condition, it suffices to show that for any M N+ we have: {S Zn, |N ρS δ (C)| M} F n. we can write

|N ρS δ (C)| M F F M(C), k N, C [

c F BρS δ (c).

{S Zn, |N ρS δ (C)| M} = [

c F {S, ρS(c, wk) δ}. (30)

By continuity, it is clear that {S, ρS(c, wk) δ} F n, hence we have the result by countable unions and intersections.

9It always exists for any closed set in Rd.

Generalization Bounds using Data-Dependent Fractal Dimensions

Remark A.31. Given any positive sequence δk, decreasing and converging to 0, thanks to Lemma A.29 the upper boxcounting dimension can be written as

dim ρS B (W) = lim sup k +

log |N ρS δk (C)| log(1/δk) , (31)

which, by Theorem A.30, implies that dim ρS B (W) is a random variable as countable upper limit of random variables.

We now come to the case of random hypothesis sets, we begin by introducing Castaing s representations, which are a fundamental tool to deal with random closed sets (Molchanov, 2017, Theorem 1.3.3 and Definition 1.3.6).

Proposition A.32 (Castaing s representations). Let W be a random closed set in Rd, then there exists a countable family (ξn)n 1 of Rd-valued random variables whose closure is almost surely equal to W, namely:

{ξn, n 1} = W, almost surely.

Equipped with this result, we can easily extend Theorem A.30 to the measurability of the covering numbers associated to a Castaing s representation of the hypothesis set:

Theorem A.33 (Measurability of covering numbers in the case of random hypothesis set). Let W Rd be a random closed set over a probability space (Ω, T ) and δ > 0. Let us introduce a Castaing s representation (ξn)n 1 of W.

Then, under Assumption 3.1, we have that the mapping between probability spaces

(Zn, F n) (Ω, T ) (S, ω) 7 |N ρS δ ({ξn(ω), n 1})| (N+, P(N+)),

is a random variable, where P(A) denotes the subsets of a set A. In particular, the upper-box counting dimension dim ρS B is a random variable.

Proof. The proof follows exactly that of Theorem A.30 except that now have a Castaing s representation (ξn)n 1 of W.

By the same proof than Equation (30), we have:

{(S, ω), |N ρS δ ({ξn(ω), n 1})| M} = [

i I {(S, ω), ρS(ξi(ω), ξk(ω)) δ}.

By continuity and composition of random variables, it is clear that

{(S, ω), ρS(ξi(ω), ξk(ω)) δ} F n T ,

hence we have the result by countable unions and intersections.

Therefore dim ρS B (W(ω)) is a random variable is a random variable as a direct consequence of Lemma A.29.

Thanks to Theorem A.33, we are actually able to prove the much stronger result that we can build measurable coverings.

Theorem A.34 (Measurable coverings). Let W Rd be a random closed set over a probability space (Ω, T , P) and δ > 0. Let F(N+) denote the set of finite subsets of N+. Then, under Assumption 3.1, we can build a map:

Nδ : Zn Ω F(Rd) CL(Rd),

which is measurable (with respect to the Effr os σ-algebra on the right hand-side) and such that for almost all (S, ω) Zn Ω, Nδ(S, ω) is a finite set which is (almost surely) a covering of W(ω) with respect to pseudo-metric ρS and such that we have almost surely over µ n z P:

dim ρS B (W(ω)) = lim sup δ 0

Generalization Bounds using Data-Dependent Fractal Dimensions

Proof. Let us introduce a Castaing s representation (ξk)k 1 of W and denote by FN(N+) the set of finite subsets of N+ with exactly N elements. Again, as in Theorem A.30, the proof is based on the idea that, thanks to the continuity of the loss ℓdefining the pseudo-metric ρS, a cover of {ξk, k N+} covers W. Let us denote C(ω) = {ξk(ω), k N+}.

As FN(N+) is countable, for each N N+, we introduce (F N i )i 1 an ordering of FN(N+).

Now for each (S, ω) Zn Ω, we define:

i N+, Fi(S, ω) := F |N ρS δ (C(ω))| i .

Let us now introduce the minimal index of a set of indices that can cover W:

i0(S, ω) := argmin i N+, k 1, j Fi(S, ω), ρ(ξj(ω), ξi(ω)) δ .

Note that i0 is finite because {(ℓ(w, zi)1 i n), win W} is compactly contained, thanks to the boundedness assumption on ℓ, i.e. the covering numbers are finite.

We can therefore build the following covering indices function:

Iδ : Zn Ω F(N+),

defined by Iδ(S, ω) = Fi0(S,ω)(S, ω).

Now we want to introduce an evaluation functional , i.e. a mapping:

Ξ : Ω F(N+) F(Rd) CL(Rd),

defined by Ξ(ω, I) = {ξi(ω), i I}. It is easy to see that Ξ is measurable, indeed for any compact set K Rd we have:

{(ω, I), Ξ(ω, I) K = } = [

i F {ξi K} {I},

implying the measurability by countable unions and Definition A.23.

The key point of the proof is that we construct the coverings as Nδ(S, ω) = Ξ(ω, Iδ(S, ω)), so that the measurability of Nδ reduces to that of Iδ. This is achieved by noting that for any non-empty I F(N+), such that I = F N i1 for some N, i1 1, we have, by leveraging the countable ordering of FN(N+):

I 1 δ ({I}) ={|N ρS δ (C(ω))| = N} + \

m I {(S, ω), ρS(ξk(ω), ξm(ω)) δ}

{(S, ω), ρS(ξk(ω), ξm(ω)) > δ} .

By Theorem A.33, we have the measurability of (S, ω) 7 |N ρS δ (W(ω))|, hence the measurability result follows by continuity of ℓ(and therefore ρ ( , )) and countable unions and intersections.

Now, using Lemma A.29, Nδ(S, ω) also defines a covering of W and we have:

dim ρS B (W(ω)) = lim sup δ 0

log(1/δ) . (32)

Let us make the following important remark , which summarizes most of this subsection.

Generalization Bounds using Data-Dependent Fractal Dimensions

Remark A.35. Theorem A.34 shows that we can construct measurable coverings of the random closed hypothesis set under pseudo-metric ρS. While those coverings may not be strictly speaking minimal, they yields the same upper-box counting dimensions, which is enough for all proofs in this work to hold. Note that this technical complication of not being minimal comes from the fact that we asked the minimal coverings of a set F to be included in F, however this also removes further technical complications. If we do not impose this condition, our proof would imply that we can construct measurable minimal coverings.

From now on, we will always implicitly assume that the coverings we consider are measurable and induce correct upper box-counting dimension, the present subsection being a theoretical basis for this assumption.This is formalized by Assumption 3.2 in the main part of the paper.

B. Postponed proofs

B.1. Proof of Theorem 3.4

This proof essentially uses classical arguments related to Rademacher complexity.

Proof. Step 0: First of all, as W is closed, we can consider a dense countable subset C10. Thanks to the boundedness assumption, we can find finite coverings Nr for each value of r > 0. The notation Nr refers in this proof to the set of the centers of a covering of C by closed r-balls under the pseudo-metric ρS. Invoking results from Lemma A.29, those set Nr are also δ-coverings of W and induce the upper box-counting dimension of W under ρS, so that considering them does not change the dimension.

Step 1: Let us set: G(S) := sup w W

R(w) ˆRS(w) .

Invoking proposition A.9 we have:

G(S) 2Rad(ℓ(W, S)) + 3

n log(1/η). (33)

Therefore we have everywhere for S Zn:

dim ρS B (W) := lim sup r 0 log(|Nr|)

log(1/r) . (34)

Thanks to Theorem A.30 we have that log(|Nr|) is a random variable. Let us consider an arbitrary positive sequence rk decreasing and converging to 0. We have:

dim ρS B (W) := lim sup k

log(1/rk) . (35)

Let γ > 0, by Egoroff s Theorem A.12 there exist a set Ωγ such that µ n z (Ωγ) 1 γ, on which the above convergence is uniform. Therefore, if we fix ϵ > 0, we have that there exists K N such that

S Ωγ, k K, sup 0<δ<rk

log(1/δ) ϵ + dim ρS B (W).

Now, setting δn,γ,ϵ := r K, we have that on Ωγ:

δ δn,γ,ϵ, log(|Nδ|) (ϵ + dim ρS B (W)) log(1/δ). (36)

10the fact that we cover a dense countable subset and not directly W here is just made to invoke the measurability result of Theorem A.30, it does not change anything to the proof.

Generalization Bounds using Data-Dependent Fractal Dimensions

Now let us fix S Ωγ and the associated cover Nr, for (σi) Rademacher random variables independent of S and Nr, taking two points w, w such that ρS(w, w ) r we can use the triangle inequality and write:

i=1 σiℓ(w, zi) r + 1

i=1 σiℓ(w , zi).

Therefore we have:

Rad(ℓ(W, S)) r + Eσ

max w Nr 1 nσT ℓ(w, S) .

As the Rademacher random variables are independent of the other random variables we have by Massart s lemma (lemma A.10):

Rad(ℓ(W, S)) r + B

2 log(|Nr|)

Therefore if we take δ δγ we get that with probability at least 1 γ:

Rad(ℓ(W, S)) δ + Eσ

max w Nr 1 nσT ℓ(w, S) δ + B

n (ϵ + dim ρS B (W)). (37)

Putting together equations 33 and 37 we get that with probability at least 1 2η γ, for δ δn,γ,ϵ:

G(S) 2δ + 2B

4(ϵ + d(S)) log(1/δ) + 9 log(1/η)

Remark B.1. An important remark can be made at this point. One can see that δ is still appearing in Equation (38), this is due to the possible lack of uniformity in the limit defined in Equation (34). That way the quantity log(1/δn,γ,ϵ) may be seen as a sort of speed of convergence of the upper box-counting dimension. Theorem 3.4, as well as our other main results (Theorems 3.5 and 3.8) may be made uniform in n by further assumption of uniformity in n on the convergence of the limit defining the upper box-counting dimension, i.e. Equation (34), meaning that in that case δn,γ,ϵ will not depend on n. This would allow us to proceed as in the proof of Lemma S1 in (S ims ekli et al., 2021) and set δ = δn := 1/ n, at the cost of making the bound asymptotic in n.

B.2. Proof of Theorem 3.5

Before going to the proof, let us make a few remarks on the introduced approximated level sets of the empirical risk.

to be able to develop a covering argument, we first cover the set WS,U by using the pseudo-metric ρS and rely on the following decomposition: for any δ > 0 and w N ρS δ (WS,U) we have that

R(w) ˆRS(w) R (w ) ˆRS (w ) + | ˆRS(w) ˆRS (w ) | + |R(w) R (w )| .

In the above inequality, the first term can be controlled by standard techniques, namely concentration inequalities and decoupling theorems presented in Section A.1 as w lives in a finite set N ρS δ (WS,U) and the second term is trivially less than δ by the definition of coverings. However, the last term cannot be bounded in an obvious way. To overcome this issue we introduce approximate level-sets of the population risk, defined as follows11 for some K N+:

Rj S := WS,U R 1 j B

K , (j + 1)B

where j = 0, . . . , K 1 and R 1 denotes the inverse image of R. The interval j B

K will be denoted Ij. Note that thanks to the

Let Nδ,j collect the centers of a minimal δ-cover of Rj S relatively to ρS, the measurability condition on the coverings extend to the randomness of those sets Nδ,j.

11As U is independent of S, we drop the dependence on it to ease the notation.

Generalization Bounds using Data-Dependent Fractal Dimensions

Remark B.2. Without loss of generality we can always assume that those sets Rj S are non-empty. Indeed we can always add one deterministic point of R 1(Ij) in each of the coverings Nδ,j one deterministic (always the same) element of R 1(Ij). It won t make the mutual information term appearing in our result bigger (by the data-processing inequality) and it won t change the upper box-counting dimension because of its finite stability. Moreover if some of the sets R 1(Ij) are empty then we just need to restrict ourselves to a deterministic subset of [0, B]. If we don t want to do this, another way, maybe cleaner, of handling the potential empty sets would be to use the convention max( ) = 0 everywhere in the proof, then we should also adapt the definition of ϵ(N, I) below to replace log(KN) by max(0, log(KN)), where log(0) is set to . All those manipulations would essentially lead to the same results.

Measurability of the coverings: We proved in Section A.6 that we can construct measurable coverings (as random sets), which are actually coverings of a dense countable subset (or a Castaing s representation) of WS,U. Therefore, without loss of generality and thanks to the continuity of the loss ℓ, we can assume in all the remaining of this work that all the considered coverings are random sets, because either they can be constructed by Theorem A.34 or we can restrict ourselves to Castaing s representations of WS,U.

As can already be noted in Remark B.2, our approximate level set technique introduces quite a lot of technical difficulties and intricate terms. We believe that this proof technique is interesting but may not be a definitive answer to the problem at hand, improving it is a direction for future research.

Proof. We assume without loss of generality that that the loss takes values in [0, B].

Let us fix some integer K N+ and define Ij = [ j B

K ], such that:

Then, given WS we define the set Rj S := WS R 1(Ij).

We then introduce the random closed (finite) sets 12 Nδ,j corresponding to the centers of a minimal covering13 of Rj S, such that Nδ,j Rj S, for the pseudo-metric:

ρS(w, w ) := 1

i=1 |ℓ(w, zi) ℓ(w , zi)|.

The first step is to write that almost surely:

R(w) ˆRS(w) = max 0 j K 1 sup w Rj S

R(w) ˆRS(w) .

Then, given w, w Rj S such that ρS(w, w ) δ we have by the triangle inequality: R(w) ˆRS(w) R(w ) ˆRS(w ) + ρS(w, w ) + |R(w) R(w )|

R(w ) ˆRS(w ) + δ + B

So that we get:

R(w) ˆRS(w) δ + B

K + max 0 j K 1 max w Nδ,j

R(w) ˆRS(w) . (41)

12Note that, as mentioned earlier, in this paper we always assume that minimal coverings are random sets. 13Without loss of generality we can always assume that those sets are non-empty. Indeed we can always add one deterministic point of R 1(Ij) in each of the coverings Nδ,j one deterministic (always the same) element of R 1(Ij). It won t change the mutual information term in the final results (by the data-processing inequality) and it won t change the upper box-counting dimension because of its finite stability. Moreover if some of the sets R 1(Ij) are empty then we just need to restrict ourselves to a deterministic subset of [0, B]. If we don t want to do this, another way, maybe cleaner, of handling the potential empty sets would be to use the convention max( ) = 0 everywhere in the proof, then we should also adapt the definition of ϵ(N, I) below to replace log(KN) by max(0, log(KN)), where log(0) is set to .

Generalization Bounds using Data-Dependent Fractal Dimensions

Now we fix some η > 0 and just introduce the random variable ϵ as a function of two variables N and I:

log(1/η) + log KN + I .

We have by the decoupling lemma A.6 along with Fubini s Theorem, Hoeffding inequality and a union bound:

P max 0 j K 1 max w Nδ,j

R(w) ˆRS(w) ϵ(max j |Nδ,j|, max j I (S, Nδ,j))

j=0 P max w Nδ,j

R(w) ˆRS(w) ϵ(|Nδ,j|, I (S, Nδ,j))

j=0 e I (S,Nδ,j)PNδ,j PS

R(w) ˆRS(w) ϵ(|Nδ,j|, I (S, N j δ ))

j=0 e I (S,Nδ,j)ENδ,j

R(w) ˆRS(w) ϵ(|Nδ,j|, I (S, N j δ ))

j=0 e I (S,Nδ,j)ENδ,j

R(w) ˆRS(w) ϵ(|Nδ,j|, I (S, N j δ ))

j=0 e I (S,Nδ,j)ENδ,j

|Nδ,j| exp nϵ(|Nδ,j|, I (S, N j δ ))2

j=0 e I (S,Nδ,j)ENδ,j

K e I (S,Nδ,j)

Now let us consider a random minimal δ-cover of the whole (random) hypothesis set WS. Given j {0, . . . , K 1}, we have in particular almost surely that: WS Rj [

w Nδ BρS δ (w).

Where BρS δ (w) denotes the closed δ-ball for metric ρS centered in w. Therefore there exists a non-empty subset Nδ Nδ such that for all w Nδ we have BρS δ (w) Rj = .

Therefore we can collect in some set Nδ,j one element in each BρS δ (w) Rj for w Nδ and the triangular inequality gives us: Rj S [

w Nδ,j BρS 2δ (w).

This proves that almost surely j, |Nδ,j| |Nδ/2|, and thus:

max 0 j K 1 |Nδ,j| |Nδ/2|. (43)

We know that we have almost surely that:

lim sup δ 0

log(|Nδ/2|)

log(2/δ) = dim ρS B (WS).

Therefore let us fix γ, ϵ > 0. Using Egoroff s Theorem we can say that there exists δn,γ,ϵ > 0 such that, with probability at least 1 γ, for all δ δn,γ,ϵ we have:

log |Nδ/2| (ϵ + dim ρS B (WS)) log(2/δ).

Generalization Bounds using Data-Dependent Fractal Dimensions

Therefore combining equations 40, 42, 43, we get that with probability at least 1 γ η, for all δ δn,γ,ϵ:

R(w) ˆRS(w) δ + B

log(K/η) + log max j |Nδ,j| + max j I (S, Nδ,j)

log(K/η) + log |Nδ/2| + max j I (S, Nδ,j)

log(K/η) + log(2/δ)(ϵ + dim ρS B (WS)) + max j I (S, Nδ,j) .

The choice of K has not been done yet, considering the above equation the best choice is clearly: K = Kn := n . Let us introduce the notation: In,δ := max j I (S, Nδ,j).

This way we get that with probability at least 1 γ η, for all δ δn,γ,ϵ:

R(w) ˆRS(w) δ + B n 1 +

log( n/η) + log(2/δ)(ϵ + dim ρS B (WS)) + In,δ n . (44)

Note that it is possible to set this value of K, which depends on n, at the end of the proof, because the previous limits do not depend on K.

B.3. Proof of Theorem 3.8

Here we present the proof of Theorem 3.8. The proof proceeds in two steps and is based on what we will call a grouping technique. The main idea is to divide the dataset S Zn into H groups J1, . . . , JH of size J with J, H N+ and JH = n. In the end of the proof a particular choice is made.

A minor technical difficulty appears when it is not actually possible two write JH = n for a pertinent choice of (J, H). Therefore we first present a result when the latter is possible and then derive two corollaries to deal with this technical issue, mostly based on the boundedness assumption. Theorem 3.8 will be the second corollary. Remark B.3. For the sake of the proof we need to assume α 3

2, which is just asking for a potentially weaker assumption, which is not a problem. Note that the value α 3

2 will lead in Theorem 3.8 to a convergence rate in n 1/2 which is optimal anyway.

Let us start with the main result of this section:

Proposition B.4. We make assumptions 3.1, 3.2 and 3.7 with the same notations than in Theorem 3.8. We also take arbitrary J, H N+ such that JH = n

Then for all n 2 3 3 2α , with probability 1 γ η, for all δ smaller than some δγ,ϵ,n > 0 we have:

sup w WS,U |R(w) ˆRS(w)| δ + B n 1 + 2Jβ

ϵ + d(S, U) log(4/δ) + log(H n/η) + I .

Proof. Let us first refine our notations for the coverings to make the proof clearer. Throughout this section, for any S, S

we will denote Nδ(S, S , U) the centers of a covering of WS,U by closed δ-balls under pseudo-metric d S . As in the proof of Theorem 3.5, we introduce some approximate level sets Rj S for j {0, . . . , K 1}. We then denote by Nδ,j(S, S , U) the centers of a covering of Rj S by closed δ-balls under pseudo-metric d S . (note that the Rj S still depends on U but the dependence has been dropped to ease the notations).

The remark we made in the proof of theorem 3.5 about the assumptions that Rj S are non-empty without loss of generality still holds in the setting described hereafter.

Generalization Bounds using Data-Dependent Fractal Dimensions

The proof starts by introducing the level-sets of the population risk as in proof of Theorem 3.5. We define Rj S exactly in the same way.

The proof starts with the same statement:

sup w Ws,U |R(w) ˆRS(w)| max 0 j K 1 sup w Rj S |R(w) ˆRS(w)|.

For all j, we (minimally) cover Rj S with δ-covers for pseudo-metric d S, such that the centers are in Rj S. We collect those centers in Nδ,j(S, S, U).

This leads us to:

sup w Ws,U |R(w) ˆRS(w)| δ + B

K + max 0 j K 1 max w Nδ,j(S,S,U) |R(w) ˆRS(w)| | {z } :=Ej

Thanks to our stability assumption 3.7, we can say that for δ small enough there exists a random minimal covering such that for all j {0, . . . , K 1} and all k {1, . . . , H} the covering Nδ,j(S, S\Jk, U) satisfies:

w Nδ,j(S, S, U), w Nδ,j(S, S\Jk, U), sup z Z |ℓ(w, z) ℓ(w , z)| βJ

where the J factor on the right hand side comes from the fact that our stability assumption can be seen as a Lipschitz assumption in term of the Hausdorff distance of the coverings with respect to the Hamming distance on the datasets.

Recall that we assume that all Nδ,j have coverings-metrics stability with common parameters β, α.

As in the previous proposition, we split the index set {1, . . . , n} into H groups of size J, with HJ = n, which allows us to write (with a similar proof):

Ej = max w Nδ,j(S,S,U) |R(w) ˆRS(w)|

max w Nδ,j(S,S,U)

ℓ(w, zi) R(w)

k=1 max w Nδ,j(S,S,U) 1 n

ℓ(w, zi) R(w)

n max w Nδ,j(S,S\Jk ,U)

ℓ(w, zi) R(w)

k=1 max w Nδ,j(S,S\Jk ,U)

ℓ(w, zi) R(w) .

Putting this back into equation (45) we get:

sup w Ws,U |R(w) ˆRS(w)| δ + B

nα + max 0 j K 1

k=1 max w Nδ,j(S,S\Jk ,U) 1 n

ℓ(w, zi) R(w)

nα + H max 0 j K 1 max 1 k H max w Nδ,j(S,S\Jk ,U) 1 n

ℓ(w, zi) R(w) | {z } :=Mj,k(S,U)

Let ϵ be a random variable depending on Nδ,j(S, S\Jk, U) only. We use a decoupling lemma (lemma 1 in (Hodgkinson et al., 2022)) along with Hoeffding s inequality to write:

Generalization Bounds using Data-Dependent Fractal Dimensions

P Mj,k(S, U) ϵ e I (Nδ,j(S,S\Jk ,U),SJk )PNδ,j(S,S\Jk ,U) PSJk Mj,k(S, U) ϵ

e I (Nδ,j(S,S\Jk ,U),SJk )ENδ,j(S,S\Jk ,U)

PSJk Mj,k(S, U) ϵ

e I (Nδ,j(S,S\Jk ,U),SJk )

ENδ,j(S,S\Jk ,U)

w Nδ,j(S,S\Jk ,U)

ℓ(w, zi) R(w) ϵ

e I (Nδ,j(S,S\Jk ,U),SJk )E |Nδ,j(S, S\Jk, U)|e 2ϵ2n2

The key point of the proof, and the reason for which we have introduced this strong stability assumption on the coverings is that we can now use the following Markov chain:

SJk WS,U Nδ,j(S, S\Jk, U). (48)

Therefore, by the data processing inequality:

I (Nδ,j(S, S\Jk, U), JJk) I (WS,U, SJk).

Now using the easier Markov chain:

WS,U S SJk,

We have: I (Nδ,j(S, S\Jk, U), SJk) I (S, WS,U). (49)

Note that the mutual information term appearing in equation (49) is the same than the one appearing in (Hodgkinson et al., 2022).

P Mj,k(S, U) ϵ e I (S,WS,U)E |Nδ,j(S, S\Jk, U)|e 2ϵ2n2

Equipped with this result we can make an informed choice for the random variable ϵ, for a fixed η > 0:

ϵ = ϵj,k :=

log |Nδ,j(S, S\Jk, U)| + log(HK/η) + I (S, WS,U) ,

Now we can apply an union bound to get:

P max 0 j K 1 max 1 k H Mj,k(S, U) max 0 j K 1 max 1 k H ϵj,k

k=1 P Mj,k(S, U) max 0 j K 1 max 1 k H ϵj,k

k=1 P Mj,k(S, U) ϵj,k

Now let us have a closer look at those covering numbers |Nδ,j(S, S\Jk, U)|. Note that we have:

w, w Rd, d S\Jk (w, w ) n n J d S(w, w ),

Generalization Bounds using Data-Dependent Fractal Dimensions

And therefore |Nδ,j(S, S\Jk, U)| |N δ(n J)

n ,j(S, S, U)|.

Moreover, using the same reasoning than in the proof of Theorem 3.5, we know that we have |Nδ,j(S, S\Jk, U)| |Nδ/2(S, S\Jk, U)|.

Thus: |Nδ,j(S, S\Jk, U)| |N δ(n J)

2n (S, S, U)|.

As before, we will want to solve the trade-off in the values of H and J by setting J = nλ for some λ (0, 1) (this time we do not allow the value λ = 1, which will be justified later when we find the actual value of λ). A very simple calculation gives us: δ(n J)

Therefore we can say that if n 2 1 1 λ , then δ(n J)

2n δ/4 and therefore:

|Nδ,j(S, S\Jk, U)| |N δ

4 (S, S, U)|. (50)

We know that:

dim d S B (WS,U) = lim sup δ 0

4 (S, S, U)|

If we fix ϵ, γ > 0, we can apply Egoroff s Theorem to write that with probability 1 γ, we have for δ small enough:

4 (S, S, U)| ϵ + dim d S B (WS,U) log(4/δ).

Therefore, we can say that with probability 1 η γ, we have for δ small enough:

sup w Ws,U |R(w) ˆRS(w)| δ + B

ϵ + dim d S B (WS,U) log(4/δ) + log(HK/η) + I (S, WS,U) .

Setting K = n and noting that 1 α/3 1 in the above equation gives us the result.

Corollary B.5. With the exact same setting than in proposition B.4, if we assume in addition that nα/3 N+, then for all n 2 3 3 2α , with probability 1 γ η, for all δ smaller than some δγ,ϵ,n > 0 we have:

sup w Ws,U |R(w) ˆRS(w)| δ + B + 2β

log(1/η) + 1 α

3 log(n) + I + ϵ + d(S, U) log(4/δ)

Proof. We want to write J in the form J = nλ with some λ > 0. We see that there is a trade-off to be solved in the values of (J, H) if we want both all terms in equation (51) to have the same order of magnitude in n, which leads to H

J/n = J/nα. Therefore we want to have 1/

J = J/nα and λ/2 = α λ, which implies the following important formula:

Finally, we are left again with choosing the value of K, an obvious choice is K = nα/3 N+ to get the same order of magnitude. Thus we get the final result: for n 2 3 3 2α , with probability 1 γ η, for all δ smaller than some δγ,ϵ,n > 0 we have:

Generalization Bounds using Data-Dependent Fractal Dimensions

sup w Ws,U |R(w) ˆRS(w)| δ + B + 2β

nα/3 + B log(1/η) + 1 α

3 log(n) + I + ϵ + d(S, U) log(4/δ)

Remark B.6. The asymptoticity in δ defined by δγ,ϵ,n above accounts for the asymptoticity coming both from the stability assumption (definition 3.6) and the convergence of the limit defining the upper box-counting dimension.

Now we prove Theorem 3.8 which is based on the same idea than the previous corollary, but when nα/3 / N.

Theorem B.7. under the same assumptions and notations than proposition B.4. We have that for n C(α) := max{2 3 2α , 21+ 3 3 2α }, with probability 1 γ η, for all δ smaller than some δγ,ϵ,n > 0 we have:

sup w Ws,U |R(w) ˆRS(w)| δ + 3B + 2β

log(1/η) + 1 α

3 log(n) + I + ϵ + d(S, U) log(4/δ)

Proof. We define J := n2α/3 , J := n1 2α/3 and n := JH. We obviously have n n.

Using the boundedness assumption we have:

| ˆRS(w) R(w)| n n

i=1 ℓ(w, zi) R(w) . (55)

For the first term we write:

n B n n2α/3 1 n1 2α/3 1

n = n2α/3 + nα/3 1

The idea is to apply the proof of Theorem B.4 to the last term of equation (55), replacing d Sn with d S n. For clarity we still denote S = (z1, . . . , zn) and S n = (z1, . . . , z n)

There are several terms we need to consider: The mutual information term: The two data processing inequality we apply to prove equation (49) still apply so we can still write I (S, WS,U) in the bound. Dimension term: Let us denote by d(S, S , U) the upper-box dimension of WS,U for pseudo-metric d S . Using the same reasoning than equation (50), we have:

|Nδ(S, S n, U)| |Nδ n

n (S, S, U)|.

n2α/3 1 n1 2α/3 1

n δ 1 1 n2α/3

And therefore, once we have n 2 3 2α we have:

|Nδ(S, S n, U)| |N δ

2 (S, S, U)|,

which implies: d(S, S n,U d(S, S, U).

Terms in n: Now we look at equation (51), where we have 4 types of term in n which are of the form:

Generalization Bounds using Data-Dependent Fractal Dimensions

We do not forget that we also have to multiply those terms by the factor n/n coming from equation (55). Setting K := 1 +

J we get successively:

n n 1 K 1 nα/3 , n n H

J/n 1 nα/3 , n n J/nα 1 nα/3 .

For the logarithmic term we have:

log(HK) log(2

Jn1 2α/3) log(2n1 α/3).

Moreover, if n 2 3 2α we have: n n2α/3 1 n1 2α/3 n/2.

Therefore the condition n 2 3 3 2α is implied by n/2 2 3 3 2α . So now the condition on n becomes:

n C(α) := max{2 3 2α , 21+ 3 3 2α }. (56)

Putting all of this together, we get that for n C(α) (defined in equation (56)), with probability 1 γ η, for all δ smaller than some δγ,ϵ,n > 0 we have:

sup w Ws,U |R(w) ˆRS(w)| δ + 3B + 2β

log(1/η) + 1 α

3 log(n) + I + ϵ + d(S, U) log(4/δ)

B.4. Proof of Theorem 4.1

Let (X, ρ) be a pseudo-metric space, we introduce the equivalence relation:

x y ρ(x, y) = 0.

We call metric identification of X the quotient of X by this equivalence relation. The canonical projection on the quotient will be denoted as: π : X X/ .

ρ induces a metric on X/ that we will denote ρ = π ρ.

We prove that upper box-counting dimension and persistent homology dimension are invariant by this identification operation. Let us recall that we always consider the covers are made from closed δ-balls, even though equivalent definitions exist.

Lemma B.8 (Upper-box dimension with pseudo metric).

dim B(X) = dim B(X/ ). (58)

Let N d δ (F) denote the minimum number of closed δ-balls coverings of F for the (pseudo)-metric d.

Proof. Let F X, bounded. Let {x1, . . . , xn} be the centers of a closed δ-balls covering of F for metric ρ. We have:

x, x B(xi, δ), ρ (π(x), π(y )) = ρ(x, x ) δ.

Therefore π(B(xi, δ)) B(π(xi), δ), therefore N ρ δ (F) N ρ

On the other hand, if {y1, ..., yn} are the centers of a covering of F X/ , a similar reasoning shows that the π 1(B(yi, δ)) give a covering of π 1(F) with (set included in) δ-balls.

Generalization Bounds using Data-Dependent Fractal Dimensions

The result is also quite obvious for the persistent homology dimension, even though it is a bit more complicated to write it. For more details on persistent homology please refer to (Boissonat et al., 2018; Memoli & Singhal, 2019; Schweinhart, 2019). Lemma B.9 (Persistent homology dimension in pseudo metric spaces).

dim PH0(X) = dim PH0(X/ ). (59)

Intuitively, the proof of this result is as follows: When constructing the VR filtration in a pseudo-metric space, points within 0 pseudo-distance will only add pairs of the form (0, 0) in their persistence homology of degree 0, because they are created with the same value of the distance parameter δ in construction of the VR filtration.

Proof. Let K be a simplicial complex based on a finite point set T X. Let us denote by K := π(K) the image of K by the canonical projection π : X X/ , defined by its value on the simplices:

π([a0, . . . , as]) := [π(a0), . . . , π(as)]. (60)

We also introduce a section of π, i.e. an injective application s : X/ X, such that π s = Id X/ . Clearly, K is still a simplicial complex.The map π does not preserve the dimension of the simplices, as [π(a0), . . . , π(as)] is seen as a set, and two ai can have the same image, but π always reduces the dimension.

Note that K and s( K) clearly define simplicial complex, but that s( K) can only be seen as a sub-complex of K. Therefore, we define s : K K analogously to Equation (60). Actually, by injectivity of s, this allows us to identify K with a sub-complex of K.

Thus, both π and s linear maps on the space of k-chains:

π : Ck(K) Ck( K), s : Ck( K) Ck(K),

which both commute with the boundary operator, indeed, for any simplex [a0, . . . , as] and ϵi K:

π ([a0, . . . , as]) = π s X

i=0 ϵi[a0, . . . , ai 1, ai+1, . . . , as]

i=0 ϵi[π(a0), . . . , π(ai 1), π(ai+1), . . . , π(as)]

= π([a0, . . . , as]),

with the exact same computation for s, so that the following diagram commutes:

Therefore, π and s induces linear maps between the homology groups, making the following diagram commute:

Now let us consider P = {x1, . . . , xn} a finite set in (X, ρ) and denote accordingly P := π(P). Let us introduce a Vietoris-Rips filtration of P denoted by:

Kδ0,1 Kδ0,α0 Kδ1,1 Kδc,αC = K,

Generalization Bounds using Data-Dependent Fractal Dimensions

where 0 δ1 < < δC are the time-distance indices of the filtration and for the same value of δ the simplices are ordered by their dimension and arbitrarily if they also have the same dimension. Obviously δ0 = 0.

As π : P P preserves distances, it is clear that, up to allowing certain complexes to appear several times in a row, the nested sequence ( Ki,j)(0 i C,1 j αi) is a Vietoris-Rips filtration for P.

Let us fix some i 0, . . . , C and j 1, . . . , αi such that either i 1 or j = α0. This way we have:

a, b P, π(a) = π(b) = [a, b] Ki,j,

by definition of the VR filtration (all simplices within δ0 = 0 ρ-distance have been added in the filtration). Therefore, if π(a) = π(b), as [a, b] = [a] + [b], we have that [a] = [b] in H0(Ki,j). As be definition of s, for any a P we have π s π(a) = π(a), we have the following identity (the bars denote classes in homology groups):

s π([a]) = s π([a]) = [a].

Therefore, as also π s = Id, we have that s and π are inverse of one another, so that we have an isomorphism H0(Ki,j) = H0( Ki,j) and the following diagram:

/ H0(K0,α0 1)

H0(K0,α0) / . . . / H0(KδC,αC)

H0( K0,1) / . . . / H0( K0,α0 1) / H0( K0,α0) / . . . / H0( KδC,αC)

As already mentioned, persistent homology of degree 0 is characterized by the multi-set of death times δi. All death before K0,α0 1 are 0 so they do not add anything the weighted life-sum of Equation (25). After K0,α0 1, the isomorphisms in the diagram show that the basis will evolve exactly in the same way so the death times will be the same, therefore the weighted sum are the same in both spaces for any P. Therefore, by definition, we have the equality between the persistent homology dimension.

Combination of Equation (15), lemma B.8 and Lemma B.9 immediately gives the proof of Theorem 4.1.

B.5. Proof of Theorem 3.9

In this subsection, we show how we can leverage very classical tools from high dimensional probability to give one first step toward proving lower bounds, even though the obtained lower bound may look a bit disappointing. We combine two tools, namely Gaussian complexity and Sudakov s theorem.

Definition B.10 (Gaussian complexity). Given a set A Rn, and g1, . . . , gn N(0, 1) independent, the Gaussian complexity of A is defined by:

As before we will denote, for S Zn:

Γ(ℓ(W, S)) := 1

i=1 giℓ(w, zi)

We have the following lower bound of Rademacher complexity

Lemma B.11. We have:

log(n) Γ(A)

Generalization Bounds using Data-Dependent Fractal Dimensions

Proof. For Rademacher random variables σ1, . . . , σn, let us define the following function, for α = (α1, . . . , αn) [0, 1]n:

it is easy to see that f is convex and continuous on the compact set [0, 1]n. Therefore we know that f attains its maximum for some α0 [0, 1n.]. We denote the constant one vector by 1n Rn. For some 0 λ 1, let us write α0 = λα+(1 λ)1n, for some α. We have, by convexity:

f(α0) λf(α) + (1 λ)f(1n) λf(α0) + (1 λ)Rad(A),

which implies that:

Rad(A) (61)

Now let g1, . . . , gn N(0, 1) be independent normal random variables. Let g := maxi(|gi|). It is possible to write the following decomposition: i, gi = |gi|σi where the σi are Rademacher random variables independent of |gi|.

As g > 0 almost surely, we have:

i=1 σi |gi| g ai

, (because Rad is non negative)

, (by Equation (61))

n Eg [g ]Rad(A), (Fubini s theorem).

We conclude by using that E[g ] 2 p

Remark B.12. It is also possible to prove that Rad(A) p π

The key ingredient for the lower bound is Sudakov s theorem, see (Vershynin, 2020, Section 7):

Theorem B.13 (Sudakov s theorem). Let (Xt)t T be a mean zero gaussian process, then for any ϵ > 0 we have:

E sup t T Xt

where C is an absolute constant and Nϵ(T, d) the covering number of T for the following pseudo-metric:

d(t, s)2 := E[(Xt Xs)2]

To prove our lower bound, we first need a lower bound of the expected worst case generalization error in terms of Rademacher complexity:

Proposition B.14. Desymmetrization inequality Assume that the loss ℓis in the interval [0, B] (this does not make us lose generality). Then we have:

R(w) ˆRS(w) 1

2E Rad(ℓ(W, S)) B

While this result is classical, we present a proof for the sake of completeness, and to exhibit the absolute constants that we get in our case.

Generalization Bounds using Data-Dependent Fractal Dimensions

Proof. Similarly to the symmetrization inequality, we write, with (z i)i an independent copy of (zi)i and (σi)i independent Rademacher random variables:

E Rad(ℓ(W, S)) E 1

i=1 σi ℓ(w, zi) R(w) + E 1

i=1 σi R(w)

i=1 σi ℓ(w, zi) ℓ(w, z i) + BE 1

, (Jensen s inequality)

ℓ(w, zi) ℓ(w, z i) + BE 1

, (Symmetrization argument)

R(w) ˆRS(w) + BE 1

, (Triangle inequality)

R(w) ˆRS(w) + B

n , (Simple case of Massart s lemma),

hence the result.

Then we can prove the following result:

Proposition B.15. Lower bound in term of covering numbers Assume that the loss ℓis bounded by B > 0, then there is an absolute constant c > 0 (the one coming from Sudakov theorem) such that with probability at least 1 ζ, for all δ > 0 we have

R(w) ˆRS(w) c

δ2 log |N ρS δ (W)| n log(n) B

log(2) + 9 log(1/ζ)

where ρS is the data-dependent metric already used before in this project (based on an L1 empirical mean).

Proof. Using the same reasoning, based on Mc-Diarmid s inequality, than for proving the upper bound, we write successively that with probability at least 1 ζ:

R(w) ˆRS(w) E sup w W

R(w) ˆRS(w) B

n , (Mc-Diarmid s inequality)

2E Rad(ℓ(W, S)) B

2Rad(ℓ(W, S)) B

n , (Mc-Diarmid s inequality)

log(n) Γ(ℓ(W, S)) B

Now we note that:

Γ(ℓ(W, S)) := 1

i=1 giℓ(w, zi) ,

and introduce the following gaussian process:

w W, Xw := 1 n

i=1 giℓ(w, zi).

Generalization Bounds using Data-Dependent Fractal Dimensions

The L2 distance induced by this gaussian process on W can be computed by:

d(w, w )2 = 1

i=1 gi(ℓ(w, zi) ℓ(w , zi)) 2

i=1 (ℓ(w, zi) ℓ(w , zi))2

ρS(w, w )2, (Cauchy-Schwarz s inequality)

ρS(w, w ) := 1

i=1 |ℓ(w, zi) ℓ(w , zi)|

is the data-dependent pseudo-metric we used previously in this work. The result then follows by applying Sudakov s theorem.

Using this proposition, we can prove the following result:

Theorem B.16 (Lower bound with data-dependent fractal dimension). Assume that the loss ℓis bounded by B > 0 and that almost surely we have dimρS B (W) > 0. Then, for all γ, ζ > 0 there is an absolute constant c > 0 and some δn,γ,ζ > 0 such that, with probability at least 1 ζ γ, for all δ δn,γ,ζ we have:

R(w) ˆRS(w) c

δ2 log(1/δ)d(S)

2n log(n) B

log(2) + 9 log(1/ζ)

Remark B.17. As many of our results, the more interesting part of this result is the underlying covering numbers bound, the annoying asymptoticity in δ being introduced when we go from the covering numbers to the data-dependent fractal dimensions.

Proof. Let us fix γ, ζ (0, 1). Using the definition of the lower box-counting dimension and the fact that dimρS B (W) > 0 almost surely, we can write:

lim inf δ 0 log |N ρS δ (W)| dimρS B (W) log(1/δ) = 1,

we can invoke Egoroff s theorem, as in previous proofs, to argue that there exists Ωγ F n, such that µ n z (Ωγ) 1 γ, on which the above convergence is uniform. As dimρS B (W) > 0 almost surely, we can assume without loss of generality that this is also the case on Ωγ.

This implies that, on Ωγ, for δ smaller than some δn,γ,ζ, we have:

log |N ρS δ (W)| 1

2 log(1/δ)dimρS B (W).

Then, the result immediately follows from the previous proposition.

B.6. Lipschitz case

As mentioned in the introduction, several authors (S ims ekli et al., 2021; Camuto et al., 2021; Hodgkinson et al., 2022) have proven worst-case generalization bounds involving the Hausdorff dimension of the hypothesis set, computed based on the Euclidean distance. Their is in particular based on a Lipschitz loss ℓ assumption. It is therefore natural to ask whether we can find a similar result, i.e. involving the Euclidean based dimension, from our results.

Therefore, in this section, we assume that the function (w, z) 7 ℓ(w, z) is L-Lipschitz in w, uniformly with respect to z, with L > 0 a constant.

To simplify, we demonstrate the case of a fixed hypothesis set W Rd, more general cases being derived in a similar fashion.

Generalization Bounds using Data-Dependent Fractal Dimensions

Then we can bound the pseudo-metric ρS by ( is the Euclidean norm):

w, w Rd, ρS(w, w ) = 1

i=1 |ℓ(w, zi) ℓ(w , zi)|

From this observation, denoting N e the coverings associated to the Euclidean metric, we deduce that:

N ρS δ (W) N e δ/L(W). (62)

The proof of Theorem 3.4 therefore leads us to write, instead of Equation (38), that, with probability at least 1 2η:

R(w) ˆRS(w) 2δ + 2B

2 log |N e δ/L(W)|

Then we can use proof techniques similar to that in (S ims ekli et al., 2021) in the case of a fixed hypothesis set. More precisely, let us fix some ϵ > 0, using the definition of upper box-counting dimension along with Egoroff s theorem, we have that there exists Ωγ F n, such that µ n z (Ωγ) 1 γ, on which, for δ smaller than some δγ,ϵ (which is independent of n, because the metric does not depend on the data anymore) , we have:

log |N e δ/L(W)| ϵ + dim e B(W) log(L/δ).

Because δγ,ϵ does not depend on n, it is possible to set ϵ = dim e B(W) and set:

δ = δn := 2 n.

Therefore, we have that, with probability 1 2η γ for n big enough:

R(w) ˆRS(w) 4 n + 2B

2dim e B(W) log(L n)

Thus, we recover a result analogous to (S ims ekli et al., 2021), up to potentially absolute constants (coming from the fact that the proof technique is different). Their bound can therefore be seen as a particular case of our result.

C. Additional experimental details

C.1. Granulated Kendall s coefficients

Kendall s coefficient, initially introduced in (Kendall, 1938), is a well-known statistics to assess the co-monoticity of two observations, or rank correlation. It is usually denoted with letter τ.

If we consider ((gi, di)1 i n) a sequence of observation of two random random elements, in our case the generalization error g and the intrinsic dimension d. In our setting it is very likely that both (gi) and (di) will have pairwise distinct elements and that ties would therefore have little impact on the analysis. Therefore we will assume it in our presentation to make it easier. To compute Kendall s τ coefficient, denoted τ((gi)i, (di)i), we look at all the possible pairs of couples (gi, di) and count 1 if they are ordered the same way and 1 otherwise. The coefficients is then normalized by the total number of pairs which is n 2 . Therefore an analytical formula is:

τ((gi)i, (di)i) = 1 n 2 X

i<j sign(gi gj)sign(di dj) (63)

However, as highlighted in (Jiang et al., 2019), vanilla Kendall s τ may fail to capture any notion of causality in the correlation. Indeed, in our experiments we make vary several hyperparameters (e.g. learning rate L and batch size B),

Generalization Bounds using Data-Dependent Fractal Dimensions

we want to somehow measure whether the observed correlation is due to the influence of a hyperparameter on both the generalization error and the persistent homology dimension computation.

To overcome this issue, we follow the approach of (Jiang et al., 2019), whose authors introduced a notion of granulated Kendall s coefficient. Let ΘL an ΘB denote the (finite) set in which our two hyperparameters vary. We first compute τ coefficients when fixing (all but) one hyperparameter, and then average those coefficients to get the granulated Kendall s coefficients:

ψη := 1 |ΘB|

b ΘB τ (g(η, b), d(η, b))η ΘL , ψB := 1 |ΘL|

b ΘL τ (g(η, b), d(η, b))b ΘB , (64)

Where g(η, b) and d(η, b) denote the generalization and dimension obtained with learning-rate η and batch size b. We can then average those coefficients to get one numerical measure:

Ψ := ψ1 + ψ2

Remark C.1. Of course this analysis extends to more than 2 hyperparameters, but most of our experiments used only learning-rate and batch size.

We created Python scripts to compute those granulated Kendall s coefficients for all the results presented in this work.

Our analysis also report Spearman s rank correlation coefficient (Kendall & Stuart, 1973), denoted ρ, which is another widely used correlation statistics.

C.2. Hyperparameters and experimental setting

Here we present some additional experimental details concerning the experiments of the main part of the paper. Note that all experiments were realized using the same random seed while we were making vary the hyperparameters (e.g. learning rate and batch size). For each experiment both hyperparameters vary in a set of 6 values, making a total of 36 points if all experiment converge.

All Fully Connected Networks (FCN) have standard Re LU activation.

Classification experiments: We trained FCN-5 and FCN-7 networks of width 200 (for each inner layer) on the full training set of MNIST images until we reach 100% accuracy. Learning rate vary in the set [5.10 3, 10 1] and batch size vary in [32, 256].

The stopping criterion in those experiments is reaching 100% accuracy, given that the model is evaluated on all data points every 10000 iterations. To compute the PH dimension the last 5000 iterations were considered (this number essentially comes from computational and time constraints). The model was evaluated on each data point for all those 5000 iterations, producing a point cloud in R5000 n Persistent homology was computed on 20 subset of the generated point cloud with sizes varying in [1000, 5000] in order to apply the method from (Birdal et al., 2021).

Additional classification experiments, presented in Figures 1, 6 and 7 and Tables 6 and 7 involve Alex Net and Le Net networks trained on both MNIST and CIFAR-10 dataset within the same ranges of hyperparameters as described above.

Regression experiments on California Housing Dataset: We trained FCN-5 and FCN-7 of width 200 (for each inner layer) on a training set corresponding to a random subset of 80% of the 20640 points of the California Housing Dataset, using the remaining 20% for validation. Learning rate vary in the set [1.10 3, 10 2] and batch size vary in [32, 200].

The stopping criterion for regression experiments is the following: We periodically (every 2000 iterations in practice) evaluate the empirical risk on the whole training set and stop the training when the relative difference difference between two evaluations becomes smaller than some proportion, set to 0.5% in those experiments. Note that this choice may affect the results. Indeed if we wait to long before stopping the training in a regression experiment, it is possible that the geometry of the point cloud becomes trivial, so we need to ensure convergence while stopping training when the losses ℓare still moving enough to get interesting fractal geometry, both for our dimension and the one of (Birdal et al., 2021).

To compute the PH dimension the last 5000 iterations were considered. The model was evaluated on each data point for all those 5000 iterations, producing a point cloud in R5000 n Persistent homology was computed on 20 subset of the generated

Generalization Bounds using Data-Dependent Fractal Dimensions

point cloud with sizes varying in [1000, 5000] in order to apply the method from (Birdal et al., 2021).

Robustness experiment: For the robustness experiment presented in figure 3, we used the exact same hyperparameters and random seed than in experiment on MNIST and California Housing Dataset as above. For proportion η varying in [2%, 10%, 20%, . . . , 90%, 99%] we randomly select a subset T of the dataset S such that |T|/|S| = η and compute the PH dimensions corresponding to pseudo-metric ρT , presented in equation (8). Note that the PH dimension computation involves sampling different subsets of the last iterates (see above), of course this sampling has been done with the same random seed for all values of η so that the observe difference in the dimensional value can only come from the selection of subset T S.

D. Additional experimental results

D.1. More details on the experiments presented in Section 5

As mentioned above, for the experiments on MNIST and California Housing Dataset we performed 360 trainings with various seeds, learning rates and batch sizes. This allowed us to compute various statistics, namely granulated Kendall s coefficients ψlr and ψbs for learning rate and batch size respectively, Average Kendall s coefficient Ψ, Kendall s tau τ and Spearman s rho ρ, which are all indicators of correlation. Tables 4 and 5 contain all those statistics (same data than the tables in the main part of the paper but with additional coefficients displayed, for space issues). The variation of the seed allows for displaying standard deviation of all those coefficients.

Table 4. Correlation coefficients on CHD

MODEL DIM. ρ ψLR ψBS Ψ τ

FCN-5 dim EUCL PH0 0.77 0.08 0.62 0.11 0.46 0.14 0.54 0.11 0.59 0.07 FCN-5 dimρS PH0 0.87 0.05 0.75 0.10 0.61 0.13 0.68 0.10 0.71 0.69

FCN-7 dim EUCL PH0 0.40 0.09 0.07 0.13 0.25 0.11 0.16 0.08 0.28 0.07 FCN-7 dimρS PH0 0.77 0.08 0.63 0.05 0.58 0.10 0.62 0.06 0.77 0.08

Table 5. Correlation coefficients on MNIST

MODEL DIM. ρ ψLR ψBS Ψ τ

FCN-5 dim EUCL PH0 0.62 0.10 0.78 0.07 0.80 0.10 0.78 0.08 0.47 0.07 FCN-5 dimρS PH0 0.73 0.07 0.84 0.06 0.78 0.10 0.81 0.07 0.56 0.06

FCN-7 dim EUCL PH0 0.80 0.04 0.92 0.07 0.85 0.11 0.88 0.04 0.62 0.04 FCN-7 dimρS PH0 0.89 0.02 0.96 0.05 0.84 0.05 0.90 0.04 0.73 0.03

In Table 6 we also report the full metrics on one experiment on Alex Net trained on CIFAR-10.

Table 6. Correlation coefficients with Alex Net on CIFAR-10

MODEL DIM. ρ ψLR ψBS Ψ τ

ALEXNET dim EUCL PH0 0.86 0.78 0.84 0.81 0.68 ALEXNET dimρS PH0 0.93 0.87 0.81 0.84 0.78

Table 7. Correlation coefficients with convolutional models on MNIST

MODEL DIM. ρ ψLR ψBS Ψ τ

ALEXNET dim EUCL PH0 0.85 0.78 0.77 0.77 0.67 ALEXNET dimρS PH0 0.88 0.78 0.77 0.77 0.70

LENET dim EUCL PH0 0.74 0.78 0.77 0.78 0.57 LENET dimρS PH0 0.80 0.80 0.77 0.79 0.62

Generalization Bounds using Data-Dependent Fractal Dimensions

In Figures 5 and 4 we plot the values of dimρS PH0 against the actual loss gap (computed based on the cross entropy loss). While this has probably little practical interest compared to the plots shown in the main part of the paper, it highlights the fact that the correlation is indeed still there. As before, we note that low batch sizes and high learning rates yields better results, but that the correlation is very good for middle range values of those hyperparameters. As in the regression experiment, we observe on figure 5 and 4 that a bigger network gives better empirical correlation between the data-dependent dimension and the generalization error. Another interesting observation is that there seems to be more noise in the coefficients with respect to the loss gap than with respect to the accuracy gap. In most all experiments, again, the proposed dimension in close or better than the one proposed in (Birdal et al., 2021).

Figure 4. Plots of dimρS PH0 against the loss gap (as opposed to the accuracy gap) for a FCN-7 trained on MNIST dataset.

On Table 8 we report the correlations coefficients (ρ, ψlr, ψbs, Ψ, τ) between our data-dependent intrinsic dimension and the actual loss gap in the same classification experiments than in Figures 5 and 4.

Table 8. Correlation coefficients on MNIST, with respect to loss gap

MODEL DIM. ρ ψLR ψBS Ψ τ

FCN-5 dim EUCL PH0 0.76 0.06 0.33 0.18 0.75 0.09 0.54 0.11 0.58 0.05 FCN-5 dimρS PH0 0.73 0.09 0.30 0.20 0.75 0.09 0.52 0.12 0.57 0.07

FCN-7 dim EUCL PH0 0.86 0.05 0.77 0.12 0.80 0.08 0.79 0.06 0.69 0.06 FCN-7 dimρS PH0 0.90 0.03 0.80 0.10 0.79 0.06 0.80 0.06 0.75 0.05

Figures 6 and 7, as well as Table 7 show experimental results obtained by training Convolutional Neural Networks (CNN) on the MNIST dataset, namely Alex Net (Krizhevsky et al., 2017) and Le Net (Lecun et al., 1998) networks. It further highlights the pertinence of our intrinsic dimension, which is well correlated with the accuracy gap in those experiments.

D.2. Evaluation of the computable part of the bounds

Recent studies showed that, even when the overall scale of a generalization bound does not match the scale of the actual generalization error, the bound can still be useful (e.g., for hyperparameter selection) if it correlates well with the generalization error.

Our study of the correlation between our data-dependent intrinsic dimension the generalization error is inspired by Jiang et al.

Generalization Bounds using Data-Dependent Fractal Dimensions

Figure 5. Plots of dimρS PH0 against the loss gap (as opposed to the accuracy gap) for a FCN-5 trained on MNIST dataset.

Figure 6. Plots of dimρS PH0 against the accuracy gap for an Alex Net trained on MNIST dataset.

(2019), who pointed out that correlation can be an efficient measure of the performance of different complexity measures. Moreover, in our study, we use the Granulated Kendall s coefficients, that Jiang et al. (2019) introduced, to better capture the causal relationship in the correlation between the fractal dimension and the generalization.

That being said, one could argue that it would be better to plot the full bound and compare it to the generalization error.

Generalization Bounds using Data-Dependent Fractal Dimensions

Figure 7. Plots of dimρS PH0 against the accuracy gap for a Le Net trained on MNIST dataset.

However, as we are aiming to compute the bounds in Theorems 3.5 and 3.8, this is a notoriously difficult, and often impossible task due to the presence of the mutual information (MI) terms. Hence the full computation of the bound is unfortunately not possible. Yet, we would like to underline that this has been the case for most information theoretic bounds, and fractal geometric bounds (S ims ekli et al., 2021; Birdal et al., 2021; Hodgkinson et al., 2022).

As an intermediate solution towards this direction, we propose the following experimental setting, which will be added to the next version of the paper. Since our MI terms, especially I (S, WS,U) appearing in Theorem 3.8, can be seen as similar as terms appearing in previous fractal geometric works (S ims ekli et al., 2021; Birdal et al., 2021; Hodgkinson et al., 2022), we can aim at plotting the remaining terms of the bound and still provide a meaningful experiment.

Hence, in an attempt to provide further experimental results, we can compare the generalization error to the computable part of the bound, whose main term is of the form:

δ + B n 1 +

dim P H(WS,U) log(1/δ) + log( n/η)

This expression can be approximated thanks to the persistent homology tools described in the paper. We will include these experiments in the paper, where we will compute the full bounds of the prior art in the same way (which will require estimating the Lipschitz constant).

One particular point of attention is that the loss functions we consider for the experiments are in practice not bounded. Despite this fact, to allow for the experimentation to take place, we set the value of the constant B to the maximum loss reached on one data-point in the whole trajectory, over all experiments. For a fully connected networks trained on MNIST, we get Figures 8 and 9.

The figures report the value of the above formula compared to the actual generalization error. The correlation with the generalization is quite well. However, an offset can be observed, corresponding to the scaling to what appears to be a rather small absolute constant. We believe that this is due to the fact that the way we estimate B is a bad estimation of the potential sub-Gaussian character of the loss (indeed, the statement of Theorem 3.5 can be easily extended for sub-Gaussian losses). As the bound scales linearly with this quantity, we see that we can easily get pretty close to the true generalization gap.

Generalization Bounds using Data-Dependent Fractal Dimensions

Figure 8. Plot of the generalization error against the computable part of our bounds for a 5-layer fully-connected network trained on MNIST

Figure 9. Plot of the generalization error against the computable part of our bounds for a 7-layer fully-connected network trained on MNIST

D.3. Experiments with bigger models and datasets

Most experiment presented in Sections 5 and D.1 are made on relatively small datasets and/or neural network models. For the sake of completeness, we present here similar experiments computed with a Resnet-18 model on CIFAR10 and CIFAR100 datasets.

Generalization Bounds using Data-Dependent Fractal Dimensions

Note that the main difficulty to perform such experiments is that the computation of dimρS PH0(WS,U) requires the evaluation of the model on every training data point, and the corresponding distance matrix. However, to be able to make this experiment in a reasonable amount of time, and according to our computational resources, we leveraged the ideas from Section 5, regarding the robustness analysis, and used only a subset of the dataset for the computation of dimρS PH0(WS,U) (while the whole dataset is obviously used for training). Moreover, note that one advantage of the proposed data-dependent intrinsic dimension is that it requires much less memory to be computed than the one proposed in (Birdal et al., 2021). Indeed, to compute this last one, we would need to store all the weights of the network, for a few thousand iterations.

Hyperparameters details Both experiments (on CIFAR10 and CIFAR100) where realized on a 4 4 grid of hyperparameters with the batch size varying from 32 to 256 and the learning rate varying from 0.1 to 0.001. In both experiments, the persistent homology dimension has been computed using the last 2000 iterates and 5% of the dataset for the computation of the associated distance matrix. For the experiment on CIFAR10, the network was trained until 100% accuracy before computing the persistent homology dimensions. For the CIFAR100 experiment, because we were not able to train the model until 100% accuracy, we stopped the training after 100000 iterations.

Figure 10. Plots of dimρS PH0 against the accuracy gap for Resnet-18 network, trained on CIFAR10.

Figure 10 displays our results for the CIFAR10 experiments. As in previous experiments, we observe a very satisfying correlation for lower learning rates and high batch sizes.

Results on the CIFAR100 dataset are shown on Figures 11 and 12. Something interesting is observed here; as the network didn t reach, in our experiment, an accuracy close to 100% for all hyperparameter settings, we observe two regimes regarding the correlation between the accuracy gap and the data-dependent persistent homology dimension:

For experiments achieving very good training accuracy, the correlation is excellent, as shown in Figure 12.

Experiments with less training accuracy look like out of distribution experiments, this particular behavior is illustrated on Figure 11.

This shows that the fractal behavior seems to be particularly pertinent in the permanent regime of training, i.e. when the distribution of the parameters reaches a stable distribution.

Generalization Bounds using Data-Dependent Fractal Dimensions

Figure 11. Plots of dimρS PH0 against the accuracy gap for Resnet-18 network, trained on CIFAR100, the training accuracy is shown to highlight the importance of convergence for the correlation.

Figure 12. Plots of dimρS PH0 against the accuracy gap for Resnet-18 network, trained on CIFAR100, displaying only points for which 98% accuracy has been reached during training.