# nearoptimal_learning_with_average_hölder_smoothness__6b8d7007.pdf

Near-optimal learning with average Hölder smoothness

Steve Hanneke Department of Computer Science Purdue University steve.hanneke@gmail.com

Aryeh Kontorovich Department of Computer Science Ben-Gurion University of the Negev karyeh@cs.bgu.ac.il

Guy Kornowski Department of Computer Science and Applied Mathematics Weizmann Institute of Science guy.kornowski@weizmann.ac.il

We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. [2021] by extending it to Hölder smoothness. This measure of the effective smoothness of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic worst-case Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness. Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.

1 Introduction

A fundamental theme throughout learning theory and statistics is that smooth functions ought to be easier to learn than rough ones an intuition that has been formalized and rigorously established in various frameworks [Györfi et al., 2002, Tsybakov, 2008, Giné and Nickl, 2021]. Hölder continuity is a natural and well-studied notion of smoothness that measures the extent to which nearby points can differ in function value and includes Lipschitz continuity as an important special case.

These global moduli of smoothness, while convenient for theoretical analysis, suffer from the shortcoming of being overly pessimistic. Indeed, being distribution-independent, they fail to distinguish a function that is highly oscillatory everywhere from one that is smooth over most of the probability mass; see Figure 1 for a simple illustration. Moreover, classically studied classes of average smoothness (e.g. Besov space) typically fix some distribution in advance (predominantly uniform), and then turn to consider smooth functions with respect to that single distribution. Thus, from a distribution-free statistical learning perspective where the underlying distribution is assumed to be unknown such classes fall short.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: Illustration of a function and a measure µ exhibiting a large gap between worst-case smoothness (occurring in low density regions) and average-smoothness with respect to µ.

Seeking to address these drawbacks, Ashlagi et al. [2021] proposed a natural notion of average Lipschitz smoothness with respect to a distribution. Their average Lipschitz modulus can be considerably (even infinitely) smaller than the standard Lipschitz constant, while still being able to control the excess risk. However, the risk bounds obtained by Ashlagi et al. are far from optimal, while the optimal rates for distribution-free learning of average smoothness classes remained unknown. In particular, the cost of adapting to the smoothness with respect to the underlying distribution (in contrast to using classic worst-case smoothness) remained unclear so far.

Our contributions. In this work, we generalize the aforementioned notion of average Lipschitz smoothness by extending it to Hölder smoothness of any exponent β (0, 1]. After formally defining the average Hölder smoothness of a function with respect to a distribution in Section 2.1, our contributions can be summarized as follows:

Bracketing numbers upper bound (Theorem 3.1). We establish a nearly-optimal distribution-free bound on the bracketing entropy of our proposed average-smooth function class, serving as the main crux on which we base our analyses throughout the paper. In particular, although it is known that asymptotically empirical covering numbers yield sharper bounds than bracketing numbers,1 in the case of average smoothness we reveal that the latter are tight up to a logarithmic factor.

Realizable sample complexity (Theorem 3.4). We derive a nearly-optimal sample complexity required for uniform convergence of average-Hölder functions in the realizable case, which was not previously known even in the special case of average Lipschitz functions.

Optimal realizable learning algorithm (Theorem 4.1). Since our notion of average smoothness is defined with respect to the unknown sampling distribution, the learner does not have an explicit representation of the function class, and hence is unable to execute ERM.2 We are able to overcome this obstacle by constructing a realizable nonparametric regression algorithm with a nearly-optimal learning rate. Such a rate was not previously known even in the special case of average Lipschitz smoothness.

Agnostic learning algorithm (Theorem 5.1). We provide yet another learning algorithm for the fully agnostic (i.e. noisy) regression setting. Once again, our derived rate was not previously known even in the special case of average Lipschitz smoothness.

Matching lower bound (Theorem 6.1). We prove a lower bound, showing that all the results mentioned above are tight up to logarithmic factors in the realizable case, establishing the (nearly) minimax risk rate for average-smooth classes.

Illustrative comparisons (Section 7). Finally, we illustrate the extent to which the proposed smoothness notion is sharper than previously studied notions. We provide examples in which

1Uniform convergence of the L1(D) distance between the upper and lower bracket functions implies that, in the limit of sample size, the ε-bracket functions are almost surely an empirical (1 + o(1))ε-cover; see Section 2 for a reminder of relevant definitions. 2Indeed, the learner cannot know for sure whether any given non-classically Hölder function belongs to the average-Hölder class.

the optimistic average-Hölder constant is infinitely apart from both its pessimistic worst-case counterpart, or even the average-Lipschitz (β = 1) constant, exemplifying the substantial (possibly infinite) speed-ups in terms of learning rates.

1.1 Related work.

The sample complexities associated to distribution-free learning of (classic) Hölder classes is well covered in the literature, see for example the books by Györfi et al. [2002], Tsybakov [2008].

Previous notions of average smoothness include Bounded Variation (BV) [Appell et al., 2014] in dimensions one and higher [Kuipers and Niederreiter, 1974, Niederreiter and Talay, 2006]. Onedimensional BV has found learning-theoretic applications [Bartlett et al., 1997, Long, 2004, Anthony and Bartlett, 1999], but to our knowledge the higher-dimensional variants have not. Moreover, the positive results require µ to be uniformly distributed on a segment, and the aforementioned results break down for more general measures especially if µ is not known to the learner.

Sobolev spaces, and the Sobolev-Slobodetskii norm in particular [Agranovich, 2015], bear some resemblance to our average Hölder smoothness. However, Ashlagi et al. [2021, Appendix I] demonstrate that from a learning-theoretic perspective this notion is inadequate for general (i.e., non-uniform or Lebesgue) measures, as it cannot be used to control sample complexity. Results for controlling bracketing in terms of various measures of average smoothness include Nickl and Pötscher [2007], who bound the bracketing numbers of Besovand Sobolev-type classes and Malykhin [2010], who used the averaged modulus of continuity developed by Sendov and Popov [1988]; again, these are all defined under the Lebesgue measure. While it is easy to define these smoothness notions with respect to arbitrary distributions, we are not aware of any existing work to bound their corresponding sample complexity (or even their covering or bracketing numbers) in a distribution-independent manner. Moreover, the smoothness notion studied in this paper is defined over arbitrary metric spaces, whereas previous notions are typically restricted to Euclidean structures (or variants thereof). Despite of the considerable generality of our setting, we are able to provide tight bounds for all metric spaces alike, without requiring specialized analyses.

A seminal work on recovering functions with spatially inhomogeneous smoothness from noisy samples is Donoho and Johnstone [1998]. Arguably in the spirit of µ-dependent Hölder smoothness, some of the classic results on k-NN risk decay rates were refined by Chaudhuri and Dasgupta [2014] via an analysis that captures the interplay between the metric and the sampling distribution. Another related notion is that of Probabilistic Lipschitzness (PL) [Urner and Ben-David, 2013, Urner et al., 2013, Kpotufe et al., 2015], which seeks to relax a hard Lipschitz condition on the regression function. While PL is in the same spirit as our notion, one critical distinction from our work is that, while existing analyses of learning under PL have focused specifically on binary classification, our interest in the present work is learning real-valued functions.

As previously mentioned, the main feature setting this work apart from others studying regression under average smoothness is that our notion is defined with respect to a general, unknown measure µ. The notable exception is, of course, Ashlagi et al. [2021] who introduced the framework of efficiently learning smooth-on-average functions with respect to an unknown distribution. Although extending their definition from Lipschitz to Hölder average smoothness was straightforward, optimal minimax rates are likely inaccessible via their techniques, which relied on empirical covering numbers. Estimating the magnitude of these random objects was a formidable challenge, and Ashlagi et al. were only able to do so to within an additive error decaying with sample size; this sampling noise appears to present an inherent obstruction to optimal rates. Thus, our results required a novel technique to overcome this obstruction, which we did by tightly controlling the bracketing entropy. Our Höldertype extension is a direct adaptation of the Pointwise Minimum Slope Extension (PMSE) developed for the Lipschitz special case by Ashlagi et al., which in turn is closely related to the one introduced by Oberman [2008].

2 Preliminaries.

Setting. Throughout the paper we consider functions f : Ω [0, 1] where (Ω, ρ) is a metric space. We will consider a distribution D over Ω [0, 1] with marginal µ over Ω, such that (Ω, ρ, µ) forms a metric probability space (namely, µ is supported on the Borel σ-algebra induced by ρ). We associate

to any measurable function f : Ω [0, 1] its L1 risk LD(f) := E(X,Y ) D |f(X) Y |, and its empirical risk LS(f) := 1

n Pn i=1 |f(Xi) Yi| with respect to a sample S = (Xi, Yi)n i=1 Dn. More generally, we associate to any measurable function its L1 norm f L1(µ) := EX µ |f(X)|, and given a sample (X1, . . . , Xn) µn we denote its L1 norm with respect to the empirical measure f L1(µn) := 1

n Pn i=1 |f(Xi)|.

We say that a distribution D over Ω [0, 1] is realizable by a function class F [0, 1]Ωif there exists an f F such that LD(f ) = 0. Thus, f (X) = Y almost surely, where (X, Y ) D.

Metric notions. The diameter of A Ωis diam(A) := supx,x A ρ(x, x ), and we denote by B(x, r) := {x Ω: ρ(x, x ) r} the closed ball around x Ωof radius r > 0. For t > 0, A, B Ω, we say that A is a t-cover of B if B S a A B(a, t), and define the t-covering number NB(t) to be the minimal cardinality of any t-cover of B. We say that A B Ωis a t-packing of B if ρ(a, a ) t for all a = a A. We call V a t-net of B if it is a t-cover and a t-packing. The induced Voronoi partition of B with respect to a net V is its partitioning into subsets sharing the same nearest neighbor in V (with ties broken in some consistent arbitrary manner). A metric space (Ω, ρ) is said to be doubling if there exists d N such that every r-ball in Ωis contained in the union of some d r/2-balls. The doubling dimension is defined as mind 1 log2 d where the minimum is taken over d satisfying the doubling property.

Bracketing. Given any two functions l, u : Ω [0, 1], we say that f : Ω [0, 1] belongs to the bracket [l, u] if l f u. A set of brackets B is said to cover a function class F if any function in F belongs to some bracket in B. We say that [l, u] is a t-bracket with respect to a norm if u l t. The t-bracketing number N[ ](F, , t) is defined as the minimal cardinality of any set of t-brackets that covers F. The logarithm of this quantity is called the bracketing entropy.

Remark 2.1 (Covering vs. bracketing). Having recalled two notions that quantify the size of a normed function space (F, ) namely, its covering and bracketing numbers it is useful to note they are related through

NF(ε) N[ ](F, , 2ε) , (1)

though no converse inequality of this sort holds in general. On the other hand, the main advantage of using bracketing numbers for generalization bounds is that it suffices to bound the ambient bracketing numbers with respect to the distribution-specific metric, as opposed to the empirical covering numbers which are necessary to guarantee generalization [van der Vaart and Wellner, 1996, Section 2.1.1].

Strong and weak mean. For any non-negative random variable Z we define its weak mean by W[Z] := supt>0 t Pr[Z t], and note that W[Z] E[Z] by Markov s inequality. In the special case where Z has finite support of size N 3 where each atom has mass 1/N we have the reverse inequality E[Z] 2 log(N)W[Z] [Ashlagi et al., 2021, Lemma 22].

2.1 Average smoothness.

For β (0, 1] and f : Ω R, we define its β-slope at x Ωto be Λβ f(x) := supy Ω\{x} |f(x) f(y)|

Recall that f is called β-Hölder continuous if f H olβ := supx ΩΛβ f(x) < , with this quantity serving as its Hölder seminorm. In particular, when β = 1 these are exactly the Lipschitz functions equipped with the Lipschitz seminorm. For a metric probability space (Ω, ρ, µ), we consider the average β-slope to be the mean of Λβ f(X) where X µ. Namely, we define

Λ β f(µ) := EX µ[Λβ f(X)] ,

eΛβ f(µ) := WX µ[Λβ f(X)] = sup t>0 t µ(x : Λβ f(x) t) .

eΛβ f(µ) Λ β f(µ) f H olβ , (2)

where each subsequent pair can be infinitely apart as we demonstrate in Section 7. Having defined notions of averaged smoothness, we can further define their corresponding function spaces

H olβ L(Ω) := {f : Ω [0, 1] : f H olβ L} ,

H ol β L(Ω, µ) := n f : Ω [0, 1] : Λ β f(µ) L o ,

g H ol β L(Ω, µ) := n f : Ω [0, 1] : eΛβ f(µ) L o .

We occasionally omit µ when it is clear from context. Note that H olβ L(Ω) H ol β L(Ω, µ) g H ol β L(Ω, µ) due to Eq. (2), where both containments are strict in general. The special case of β = 1 recovers the average-Lipschitz spaces Lip L(Ω) Lip L(Ω, µ) g Lip L(Ω, µ) studied by Ashlagi et al. [2021].

3 Generalization bounds

Our first goal is to bound the bracketing entropy (namely, the logarithm of the bracketing number) of average-Hölder classes. We present this bound in full generality in terms of the underlying metric space, as captured by its covering number (see Corollary 3.5 for the typical scaling of covering numbers). As we will soon establish, this bound implies nearly-tight generalization guarantees in terms of the average smoothness constant. Theorem 3.1. For any metric probability space (Ω, ρ, µ), any β (0, 1] and any 0 < ϵ < L, it holds

log N[ ](H ol β L(Ω, µ), L1(µ), ε) log N[ ](g H ol β L(Ω, µ), L1(µ), ε)

ε 128L log(1/ε)

log 16 log2(1/ε)

Crucially, the bound above does not depend on µ, allowing us to obtain distribution-free generalization guarantees. We defer the proof of Theorem 3.1 to Appendix B.1. We start by showing that bounding the bracketing entropy implies a generalization bound in the realizable case: Proposition 3.2. Suppose (Ω, ρ) is a metric space, F [0, 1]Ωis a function class, and let D be a distribution over Ω [0, 1] which is realizable by F, with marginal µ over Ω. Then with probability at least 1 δ over drawing a sample S Dn it holds that for all f F :

LD(f) 1.01LS(f) + inf α 0

α + 205 log N[ ](F, L1(µ), α)

+ 205 log(1/δ)

Remark 3.3 (Constant is arbitrary). In Proposition 3.2 and in what follows, the constant multiplying LS(f) is arbitrary, and can be replaced by (1 + γ) for any γ > 0 at the expense of multiplying the remaining summands by γ 1. In the next section we will provide a realizable regression algorithm that returns an approximate empirical risk minimizer f for which LS(f) 0, thus this constant will not matter for our purposes.

We prove Proposition 3.2 in Appendix B.2. By combining Theorem 3.1 with Proposition 3.2 and setting α = ε/2, we obtain the following realizable sample complexity result. Theorem 3.4. For any metric space (Ω, ρ), any β (0, 1] and any 0 < ϵ < L, let D be a distribution

over Ω [0, 1] realizable by g H ol β L(Ω, µ). Then there exists N = N(β, ε, δ) N satisfying

ε 256L log(1/ε) 1/β + log(1/δ)

such that as long as n N, with probability at least 1 δ over drawing a sample S Dn it holds

that for all f g H ol β L(Ω, µ) : LD(f) 1.01LS(f) + ε .

The same claim holds for the smaller class H ol β L(Ω, µ).

Corollary 3.5 (Doubling metrics). In most cases of interest, (Ω, ρ) is a doubling metric space of some dimension d,3 e.g. when Ωis a subset of Rd (or more generally a d-dimensional Banach space). For d-dimensional doubling spaces of finite diameter we have NΩ(ε) 1

ε d [Gottlieb et al., 2016, Lemma 2.1], which, plugged into Theorem 3.4, yields the simplified sample complexity bound

N = e O Ld/β

or equivalently

LD(f) 1.01LS(f) + e O Ld/(d+β)

up to a constant which depends (exponentially) on d, but is independent of L, n.

Remark 3.6 (Tightness). The bounds in Theorem 3.1 and Theorem 3.4 are both tight up to logarithmic factors, as we will prove in Section 6.

4 Realizable learning algorithm

Recall that without knowing µ, the underlying distribution over Ω, we cannot know for sure whether a function f belongs to H ol β L(Ω, µ) (except for the trivial case f H olβ L(Ω)). This gives rise to the challenge of designing a fully empirical algorithm since standard empirical risk minimization is not possible. To that end, we provide the following algorithmic result with optimal guarantees (up to logarithmic factors).

Theorem 4.1. For any metric space (Ω, ρ), any β (0, 1] and any 0 < ϵ < L, let D be a distribution over Ω [0, 1] realizable by H ol β L(Ω, µ). Then there exists a polynomial time learning algorithm A, which, given a sample S Dn of size n N for some N = N(β, ε, δ) N satisfying

ε 256L log(1/ε) 1/β + log(1/δ)

constructs a hypothesis f = A(S) such that LD(f) ε with probability at least 1 δ.

Remark 4.2 (Doubling metrics). As mentioned in Corollary 3.5, in most cases of interest we have NΩ(ε) 1

ε d. In that case, the algorithm above has sample complexity

N = e O Ld/β

or equivalently

LD(f) = e O Ld/(d+β)

up to a constant which depends (exponentially) on d, but is independent of L, n.

Remark 4.3 (Computational complexity). The algorithm constructed in Theorem 4.1 involves a one-time preprocessing step after which f(x) can be evaluated at any given x Ωin O(n2) time. We note that the computation at inference time matches that of (classic) Lipschitz/Hölder regression (e.g. Gottlieb et al., 2017). Furthermore, the computational complexity of the preprocessing step is similar to that in Ashlagi et al. [2021, Theorem 7] for the average Lipschitz case, where it is shown to run in time e O(n2). The complexity analysis of our prepossessing step is entirely analogous to theirs, and we forgo repeating it here.

We will now outline the proof of Theorem 4.1, which appears in Appendix B.3. The key idea is to analyze a natural fully-empirical quantity that will serve as an estimator of the true unknown average

3Namely, any ball of radius r > 0 can be covered by 2d balls of radius r/2.

smoothness. To that end, given a sample S = (Xi, Yi)n i=1 Dn and a function f : Ω [0, 1], consider the following quantity which can be established directly from the data:

i=1 sup Xj =Xi

|f(Xi) f(Xj)|

ρ(Xi, Xj)β .

Namely, this is the empirical average smoothness with respect to the sampled points. It would suit us well if the empirical average smoothness of a function did not greatly exceed its true average smoothness, with high probability. The fact something like this turns out to be true is somewhat surprising and may be of independent interest:

Proposition B.1. (Informal) Let f : Ω [0, 1]. Then with high probability bΛβ f Λ β f .

The proposition above implies that restricting to the sample, and letting bf(Xi) := Yi yields a function over {X1, . . . , Xn} which is empirically average-smooth over the sample (with high probability). We then turn to show that any such function can be approximately extended to the whole space, in a way that guarantees its average smoothness with respect to the underlying distribution.

Proposition B.3. (Informal) Let bf : {X1, . . . , Xn} [0, 1] where (Xi)n i=1 µn. Then it is possible to construct f : Ω [0, 1] such that with high probability f(Xi) bf(Xi) for all i [N], and Λ β f(µ) bΛβ b f .

We will now sketch the procedure described in Proposition B.3, which serves as the main challenge in proving Theorem 4.1. Roughly speaking, the algorithm sorts the sampled points with respect to their relative slope to one another. Then, it discards a fraction of the sampled points with largest relative slope, which can be thought of as outliers . Then, the algorithm proceeds to extend the function in a smooth fashion among the remaining well-behaved samples. A careful probabilistic analysis shows that disregarding just the right amount of samples induces small error, while being average-smooth with high probability.

Overall this procedure yields a function f : Ω [0, 1] which is an approximate empirical-minimizer (since f(Xi) bf(Xi) = Yi), while guaranteed to be averagely-smooth with respect to the unknown distribution. Thus we can apply the uniform convergence of Theorem 3.4, proving Theorem 4.1.

5 Agnostic learning algorithm

Noticeably, up to this point, both the uniform convergence result we derived (Theorem 3.4) as well as the algorithmic result (Theorem 4.1) are tailored for the realizable regression setting. Inspired by a recent result of Hopkins et al. [2022] that showed a reduction from agnostic learning to realizable learning, we provide an algorithm for agnostic (i.e. noisy) regression of average-smooth functions. It is worth noting that the following algorithm does not require any prior assumption on the noise model, unlike many nonparametric regression methods, due to our distribution free analysis. Theorem 5.1. There exists a learning algorithm A such that for any metric space (Ω, ρ), any β (0, 1], 0 < ϵ < L, and any distribution D over Ω [0, 1], given a sample S Dn of size n N for some N = N(β, ε, δ) satisfying

ε 640L log(1/ε) 1/β + log(1/δ)

the algorithm constructs a hypothesis f = A(S) such that LD(f) inff H ol β L(Ω,µ) LD(f ) + ϵ with probability at least 1 δ. Remark 5.2 (Doubling metrics). As mentioned in Corollary 3.5, in most cases of interest we have NΩ(ε) 1

ε d. In that case, the algorithm above has sample complexity

N = e O Ld/β

or equivalently

LD(f) = inf f H ol β L(Ω,µ) LD(f ) + e O Ld/(d+2β)

up to a constant which depends (exponentially) on d, but is independent of L, n.

Though our agnostic algorithm is similar in spirit to that obtained by the reduction of Hopkins et al. [2022], our analysis is self-contained and crucially relies on the bracketing bound given by Theorem 3.1, as well as analyzing the empirical smoothness estimator as provided by Proposition B.1. We also note that unlike our algorithm for realizable learning, the agnostic algorithm is not computationally efficient. This seems to be inherent for such reductions, and we do not know whether this blow-up in running time can be avoided or not.

We will now describe the proof of Theorem 5.1 which appears in Appendix B.4. Given a sample S of size n, consider dividing it into two sub-samples S1 S2 = S of size n/2 each. We first use S1 in order to construct an empirical ϵ-net h1, . . . , h N : S1 [0, 1], namely a set of functions which are sufficiently empirically smooth over the sample, yet far away enough from one another when averaged over the sample. Recalling that bracketing numbers upper bound covering numbers (Eq. (1)), and since Theorem 3.1 holds true for every measure (in particular for the empirical measure), we can bound log N NΩ((ϵ/L)1/β). Moreover, using Proposition B.1 we know that f := arg minf H ol β L(Ω,µ) LD(f) is likely to be e O(L) average-smooth over S1, so there must exist some hj with ϵ excess empirical loss (since f is in the class we are ϵ-covering). Thus running the realizable algorithm of Theorem 4.1 over all {h1, . . . , h N}, producing f1, . . . , f N : Ω [0, 1], yields at least one function which has both small excess empirical error, while being smooth with respect to the underlying distribution. Finally, running ERM over {f1, . . . , f N} with respect to the fresh sample S2 reveals such a good candidate function within log(N)+log(1/δ)

ϵ2 samples by applying standard uniform convergence for finite classes (i.e. Hoeffding s inequality with the union bound).

6 Lower bound

We now turn to show that the bounds proved in Theorem 3.1, Theorem 3.4 and Theorem 4.1 are all tight up to logarithmic factors. In fact, since the bracketing entropy bound of Theorem 3.1 implies the generalization bound of Theorem 3.4 and the latter implies the sample complexity in Theorem 4.1, it is enough to show that the latter is nearly optimal.

Theorem 6.1. For any β (0, 1], ε (0, 1) any metric space (Ω, ρ) and L 8 diam(Ω), there exists

a distribution D over Ω [0, 1] which is realizable by H ol β L(Ω) such that any learning algorithm that produces f = A(S) with LD(f) ε with constant probability, must have sample complexity

n = Ω NΩ((ε/L)1/β)

Remark 6.2 (Typical case). In most cases of interest it holds that NΩ(ε) 1

ε d for some constant d, e.g. when Ωis a subset of non-empty interior in Rd (or more generally in any d-dimensional Banach space).4 That being the case, Theorem 6.1 yields the simplified sample complexity lower bound of

Equivalently, we obtain an excess risk lower bound of

LD(f) = Ω Ld/(d+β)

We will now provide a proof sketch of Theorem 6.1, while the full proof appears in Appendix B.5. Suppose K Ωis a (ε/L)1/β-net of most of Ω, yet x0 Ωis some isolated point at constant

4Note that assuming a subset has nonempty interior implies that it cannot be isometrically embedded to a lower dimensional space. Hence, this d encapsulates the true intrinsic metric dimension.

distance away from K (we show that such x0, K always exist). Let µ be the measure that assigns 1 ε probability mass to x0, while the rest of the probability mass is distributed uniformly over K. Now consider a (random) function that independently assigns either 0 or 1 to each point in K uniformly, and is constant over x0. Since points in K are (ε/L)1/β away from one another, the local β-slope at each point in K is roughly 1/((ε/L)1/β)β = L/ε, while the slope at x0 is small since it is far enough from other points. Averaging over the space with respect to µ, we see that the function is µ(K) L/ε = L average-Hölder. Now, we imitate the standard lower bound proof for VC classes over K: Since any point in K is sampled with probability ε/|K|, any learning algorithm with much fewer than |K|/ε NΩ((ε/L)1/β)/ε examples will guess wrong a large portion of K, suffering L1-loss of at least order of µ(K) = ε.

7 Illustrative examples

Having established the control that average-Hölder smoothness has on generalization, we illustrate the vast possible gap between the average smoothness and it s worst-case classic counterpart. Indeed, in the examples we provide, the gap is infinite. Moreover, we also show that classes of average-Hölder smoothness are significantly richer than the previously studied average-Lipschitz, motivating the more general Hölder framework considered in this work. Finally, it is illuminating to notice that both claims to follow actually consist of the same simple function f(x) = 1[x > 1

2] though with respect to different distributions, emphasizing the crucial role of the underlying distribution in terms of establishing the function classes.

Claim 7.1. For any L > 0, β (0, 1), there exist f : Ω [0, 1] and a probability measure µ such that

f is average-Hölder: f H ol β L(Ω, µ) .

f is not Hölder with any finite Hölder constant: For all M > 0 : f / H olβ M(Ω) .

f is not (even) weakly-average-Lipschitz with any finite modulus: For all M > 0 : f / g Lip M(Ω, µ) .

Thus, H ol β L(Ω, µ) S M=0 H olβ M(Ω) g Lip M(Ω, µ) .

Claim 7.2. For any L > 0, β (0, 1), there exist f : Ω [0, 1] and a probability measure µ such that

f is weakly-average-Hölder: f g H ol β L(Ω, µ) .

f is not strongly-average-Hölder with any finite modulus: For all M > 0 : f / H ol β M(Ω) .

f is not (even) weakly-average-Lipschitz with any finite modulus: For all M > 0 : f / g Lip M(Ω, µ) .

Thus, g H ol β L(Ω, µ) S M=0

H ol β M(Ω, µ) g Lip M(Ω, µ) .

We prove both of the claims above in Appendix B.6.

8 Discussion

In this work, we have defined a notion of an average-Hölder smoothness, extending the average Lipschitz one introduced by Ashlagi et al. [2021]. Using proof techniques based on bracketing numbers, we have established the minimax rate for average-smoothness classes in the realizable setting with respect to the L1 risk up to logarithmic factors, and have provided a nontrivial learning algorithm that attains this nearly-optimal learning rate. Moreover, we have also provided yet another learning algorithm for the agnostic setting. All of these results improve upon previously known rates even in the special case of average-Lipschitz classes.

A few notes are in order. First, the choice of focusing on L1 risk as opposed to general Lp losses is merely a matter of conciseness, as to avoid introducing additional parameters. Indeed, the only place throughout the proofs which we use the L1 loss is in the proof of Proposition 3.2, where we show that the loss-class LF := {x 7 |f(x) f (x)| : f F} satisfies

N[ ](LF, L1(µ), α) N[ ](F, L1(µ), α) .

It is easy to show via essentially the same proof that for any p [1, ), the Lp-composed lossclass satisfies N[ ](LF, L1(µ), α) N[ ](F, L1(µ), α1/p), and the remaining proofs can be invoked verbatim. This yields a realizable sample complexity (in the typical, d-dimensional case) of order N = e O Ld/pβ

ε(d+pβ)/pβ , or equivalently Lp-risk decay rate of LD(f) = e O Ld/(d+pβ)

npβ/(d+pβ) which are also easily translatable to their corresponding agnostic rates.

Focusing again on L1 minimax rates of average-Hölder classes, it is interesting to compare them to the minimax rates of classic (i.e., worst-case) Hölder classes. Schreuder [2020] has shown the minimax risk to be of order n β/d, whereas we showed the average-smooth case has the slightly worse rate of n β/(d+β) (which cannot be improved, due to our matching lower bound). However, comparing the rates alone is rather misleading, since both risks are multiplied by a factor depending on their corresponding Hölder constant, which can be considerably smaller in the average-case result. Still, it is interesting to note that in the asymptotic regime there is a marginal advantage in case the learned function is worst-case Hölder, as opposed to Hölder on average.

Our work leaves open several questions. A relatively straightforward one is to compute the minimax rates and construct an optimal algorithm for the classification setting, which is not addressed by this paper. Moreover, there is a slight mismatch between our established upper and lower bounds in the agnostic setting, ranging between e O(n β/(d+2β)) and Ω(n β/(d+β)). Closing this gap is an interesting problem which we leave for future work.

Mikhail S. Agranovich. Sobolev spaces, their generalizations and elliptic problems in smooth and Lipschitz domains. Springer Monographs in Mathematics. Springer, Cham, 2015. ISBN 978-3319-14647-8; 978-3-319-14648-5. doi: 10.1007/978-3-319-14648-5. URL https://doi.org/ 10.1007/978-3-319-14648-5. Revised translation of the 2013 Russian original.

Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. ISBN 0-521-57353-X. doi: 10.1017/CBO9780511624216. URL http://dx.doi.org/10.1017/CBO9780511624216.

Jürgen Appell, Józef Bana s, and Nelson Merentes. Bounded variation and around, volume 17 of De Gruyter Series in Nonlinear Analysis and Applications. De Gruyter, Berlin, 2014. ISBN 978-3-11-026507-1; 978-3-11-026511-8.

Yair Ashlagi, Lee-Ad Gottlieb, and Aryeh Kontorovich. Functions with average smoothness: structure, algorithms, and learning. In Conference on Learning Theory, pages 186 236. PMLR, 2021.

Peter L. Bartlett, Sanjeev R. Kulkarni, and S. E. Posner. Covering numbers for real-valued function classes. IEEE Trans. Information Theory, 43(5):1721 1724, 1997. doi: 10.1109/18.623181. URL https://doi.org/10.1109/18.623181.

Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In NIPS, 2014.

David L. Donoho and Iain M. Johnstone. Minimax estimation via wavelet shrinkage. Ann. Statist., 26 (3):879 921, 1998. ISSN 0090-5364. doi: 10.1214/aos/1024691081. URL https://doi.org/ 10.1214/aos/1024691081.

Evarist Giné and Richard Nickl. Mathematical foundations of infinite-dimensional statistical models. Cambridge university press, 2021.

Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620:105 118, 2016.

Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient regression in metric spaces via approximate lipschitz extension. IEEE Transactions on Information Theory, 63(8):4838 4849, 2017. doi: 10.1109/TIT.2017.2713820.

László Györfi, Michael Kohler, Adam Krzy zak, and Harro Walk. A distribution-free theory of nonparametric regression. Springer Series in Statistics. Springer-Verlag, New York, 2002. ISBN 0-387-95441-4. doi: 10.1007/b97848. URL http://dx.doi.org/10.1007/b97848.

Max Hopkins, Daniel M Kane, Shachar Lovett, and Gaurav Mahajan. Realizable learning is all you need. In Conference on Learning Theory, pages 3015 3069. PMLR, 2022.

Michael J Kearns and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994.

Samory Kpotufe, Ruth Urner, and Shai Ben-David. Hierarchical label queries with data-dependent partitions. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, volume 40 of JMLR Workshop and Conference Proceedings, pages 1176 1189. JMLR.org, 2015. URL http:// proceedings.mlr.press/v40/Kpotufe15.html.

L. Kuipers and H. Niederreiter. Uniform distribution of sequences. Wiley-Interscience [John Wiley & Sons], New York-London-Sydney, 1974. Pure and Applied Mathematics.

Philip M. Long. Efficient algorithms for learning functions with bounded variation. Inf. Comput., 188 (1):99 115, 2004. doi: 10.1016/S0890-5401(03)00164-0. URL https://doi.org/10.1016/ S0890-5401(03)00164-0.

Yu. V. Malykhin. Averaged modulus of continuity and bracket compactness. Mat. Zametki, 87(3): 468 471, 2010. ISSN 0025-567X. doi: 10.1134/S0001434610030181. URL https://doi.org/ 10.1134/S0001434610030181.

Richard Nickl and Benedikt M. Pötscher. Bracketing metric entropy rates and empirical central limit theorems for function classes of Besovand Sobolev-type. J. Theoret. Probab., 20(2):177 199, 2007. ISSN 0894-9840. doi: 10.1007/s10959-007-0058-1. URL https://doi.org/10.1007/ s10959-007-0058-1.

Harald Niederreiter and Denis Talay, editors. Monte Carlo and quasi-Monte Carlo methods 2004, 2006. Springer-Verlag, Berlin. ISBN 978-3-540-25541-3; 3-540-25541-9. doi: 10.1007/3-540-31186-6. URL https://doi.org/10.1007/3-540-31186-6.

Adam M. Oberman. An explicit solution of the Lipschitz extension problem. Proc. Amer. Math. Soc., 136(12):4329 4338, 2008. ISSN 0002-9939. doi: 10.1090/S0002-9939-08-09457-4. URL https://doi.org/10.1090/S0002-9939-08-09457-4.

Nicolas Schreuder. Bounding the expectation of the supremum of empirical processes indexed by hölder classes. Mathematical Methods of Statistics, 29(1):76 86, 2020.

Blagovest Sendov and Vasil A. Popov. The averaged moduli of smoothness. Pure and Applied Mathematics (New York). John Wiley & Sons, Ltd., Chichester, 1988. ISBN 0-471-91952-7. Applications in numerical methods and approximation, A Wiley-Interscience Publication.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519.

Ruth Urner and Shai Ben-David. Probabilistic Lipschitzness: A niceness assumption for deterministic labels. In Learning Faster from Easy Data Workshop NIPS, 2013.

Ruth Urner, Sharon Wullf, and Shai Ben-David. PLAL: Cluster-based active learning. In Conference on Learning Theory, 2013.

Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

A Minimal β-slope Hölder extension

In this section we describe a procedure that extends Hölder functions in an optimally smoothest manner at every point, as it will serve as a crucial ingredient in our proofs. That is, given a subset of a metric space A Ωand a function f : Ω [0, 1], it produces FA : Ω [0, 1] such that

1. It extends f|A : FA|A = f|A.

2. For any e F : Ω [0, 1] that extends f|A, it holds that Λβ FA(x) Λβ e F (x) for all x Ω.

Such a procedure was described for Lipschitz extensions (namely when β = 1) in Ashlagi et al. [2021]. The purpose of this section is to generalize this procedure to any Hölder exponent.

Throughout this section we fix β (0, 1], = A Ωand f : Ω [0, 1], and will always assume the following. Assumption A.1. f|A H olβ < and diam(A) < .

Keeping in mind that the case we are really interested in is when A is finite (i.e. a sample), the conditions above are trivially satisfied. Nonetheless, everything we will present continues to hold in this more general setting. For u, v A we introduce the following notation:

Rx(u, v) := f(v) f(u) ρ(x, v)β + ρ(x, u)β ,

Fx(u, v) := f(u) + Rx(u, v)ρ(x, u)β , R x := sup u,v A Rx(u, v) ,

Wx(ε) := {(u, v) A A : Rx(u, v) > R x ε} , 0 < ε < R x Φx(ε) := {Fx(u, v) : (u, v) Wx(ε)} .

Definition A.2. We define the β-pointwise minimal slope extension (β-PMSE) to be the function FA : Ω R satisfying FA(x) := lim ε 0+ Φx(ε) .

In the degenerate case in which f(u) = f(v) for all u, v A, define FA(x) := f(u) for some (and hence any) u A. Theorem A.3. Let = A Ω, f : Ω [0, 1], such that Assumption A.1 holds. Then FA : Ω [0, 1] is well defined, and satisfies for any x Ω: Λβ FA(x) Λβ f(x). Furthermore, if A is finite, then FA(x) can be computed for any x Ωwithin O(|A|2) arithmetic operations. Remark A.4. When Rx( , ) has a unique maximizer (u x, v x) A A, the definition of FA simplifies to

FA(x) = f(u x) + ρ(x, u x)β

ρ(x, u x)β + ρ(x, v x)β (f(v x) f(u x)) . (3)

We conclude that under Assumption A.1, we can assume without loss of generality that for each x Ωthere is such a unique maximizer (since the function is well defined, thus does not depend on the choice of the maximizer). Furthermore, this readily shows that when A is finite, we can compute FA(x) for any x Ωwithin O(|A|2) arithmetic operations simply by finding this maximizer.

Proof. (of Theorem A.3)

We will assume that there exist u, v A such that f(u) = f(v), since the degenerate (constant extension) case is trivial to verify. This assumption implies that R x > 0. It is also easy to verify that supx ΩR x < f H olβ < .

Lemma A.5. FA is well defined. Namely, under Assumption A.1 the limit limε 0+ Φx(ε) [0, 1] exists.

Proof. Fix x Ω(we will omit the x subscripts from now on). Let ε < R /2, (u, v), (u , v ) W(ε). Note that R(u, v) > 0 and that F(u, v) = f(v) R(u, v)ρ(x, v)β. Hence

f(u) F(u, v) f(v) , (4)

and the same clearly holds if we replace (u, v) by (u , v ). Assume without loss of generality that F(u, v) F(u , v ), hence f(u) F(u, v) F(u , v ) f(v ). We get

R(u , v ) + ε > R

f(v ) f(u) ρ(x, v )β + ρ(x, u)β

= f(v ) F(u , v ) + F(u, v) f(u)

ρ(x, v )β + ρ(x, u)β + F(u , v ) F(u, v)

ρ(x, v )β + ρ(x, u)β

= R(u , v )ρ(x, v )β + R(u, v)ρ(x, u)β

ρ(x, v )β + ρ(x, u)β + F(u , v ) F(u, v)

ρ(x, v )β + ρ(x, u)β

R(u , v )ρ(x, v )β + (R(u , v ) ε)ρ(x, u)β

ρ(x, v )β + ρ(x, u)β + F(u , v ) F(u, v)

R(u , v ) ε + F(u , v ) F(u, v)

= |Fx(u, v) Fx(u , v )| 4ε diam(A)β . We conclude that if diam(A) < then limε 0+ Φx(ε) indeed exists.

It remains to prove the optimality of the β-slope. Throughout the proof we will denote for any u = v Ω:

S(u, v) := |FA(u) FA(v)|

and for any point x Ω, subset B Ωand function g : Ω [0, 1] we let

Λβ g(x, B) := sup y B\{x}

|g(x) g(y)|

The proof is split into three claims.

Claim I. x Ω\ A : Λβ FA(x, A) Λβ f(x, A).

Let x Ω\ A, and let (u , v ) A A be its associated maximizer of Rx. Recall Eq. (4) from which we can deduce that FA(u ) FA(x) FA(v ). Also note that a simple rearrangement based on Eq. (3) (and the fact that f and FA agree on A) shows that S(u , x) = Rx(u , v ) = S(x, v ). Furthermore, we claim that Λβ FA(x, A) := supy A\{x} S(x, y) = S(x, u ). If this were not true then we would have S(x, y) > S(x, u ) = S(x, v ) for some y A \ {x, u , v }. Using the mediant inequality, if f(y) f(x) this implies

Rx(u , y) = f(y) f(u ) ρ(x, y)β + ρ(x, u )β = FA(y) FA(x) + FA(x) FA(u )

ρ(x, y)β + ρ(x, u )β > S(x, u ) = Rx(u , v ) ,

while if f(y) < f(x) then

Rx(y, v ) = f(v ) f(y) ρ(x, v )β + ρ(x, y)β = FA(v ) FA(x) + FA(x) FA(y)

ρ(x, v )β + ρ(x, y)β > S(x, v ) = Rx(u , v ) ,

both contradicting the maximizing property of (u , v ) - so indeed Λβ FA(x, A) = S(x, u ) = S(x, v ). In particular, we see that if FA(x) f(x) then

Λβ f(x, A) = sup y A\{x}

|f(y) f(x)|

ρ(y, x)β f(v ) f(x)

ρ(v , x)β FA(v ) FA(x)

ρ(v , x)β = S(x, v ) = Λβ FA(x, A) ,

while if FA(x) < f(x) then

Λβ f(x, A) = sup y A\{x}

|f(x) f(u)|

ρ(x, y)β f(x) f(u )

ρ(x, u )β > FA(x) FA(u )

ρ(x, u )β = S(x, u ) = Λβ FA(x, A) ,

proving Claim I in either case.

Claim II. x Ω\ A : Λβ FA(x, Ω\ A) Λβ FA(x, A), in particular Λβ FA(x, Ω) = Λβ FA(x, A).

It suffices to show that for any x, y Ω\ A :

S(x, y) min{Λβ FA(x, A), Λβ FA(y, A)} ,

since taking the supremum of the left hand side over y Ω\A shows the claim. Let (u x, v x), (u y, v y) the associated maximizers of Rx, Ry respectively, and note that by definition we have

Λβ FA(x, A) = sup z A\{x} S(x, z) max{S(x, u y), S(x, v y)} . (5)

We assume without loss of generality that Λβ FA(x, A) Λβ FA(y, A), and recall that by Eq. (4) we can deduce that FA(u x) FA(x) FA(v x) and FA(u y) FA(y) FA(v y). Now suppose by contradiction that S(x, y) > Λβ FA(x, A). If FA(x) FA(y) then

FA(v y) = FA(x) + ρ(x, y)βS(x, y) + ρ(y, v y)βΛβ FA(y, A)

> FA(x) + ρ(x, y)βΛβ FA(x, A) + ρ(y, v y)βΛβ FA(x, A)

FA(x) + ρ(x, v y)βΛβ FA(x, A) ,

thus S(x, v y) = |FA(x) FA(v y)| ρ(x,v y)β > Λβ FA(x, A) which contradicts Eq. (5). On the other hand, if

FA(x) > FA(y) then

FA(x) = FA(u y) + ρ(u y, y)βΛβ FA(y, A) + ρ(y, x)βS(x, y)

> FA(u y) + ρ(u y, y)βΛβ FA(x, A) + ρ(y, x)βΛβ FA(x, A)

FA(u y) + ρ(u y, x)βΛβ FA(x, A) ,

thus S(x, u y) = |FA(x) FA(u y)| ρ(x,u y)β > Λβ FA(x, A) which contradicts Eq. (5), and proves claim Claim II.

Claim III. x A : Λβ FA(x, Ω) = Λβ FA(x, A) Λβ f(x, Ω).

Let x A. Assume towards contradiction that there exists y / A such that

ΛFA(x, Ω) S(x, y) > Λβ FA(x, A) .

We denote by (u y, v y) A A the maximizer of Ry( , ). Recall that since x A, in the proof of Claim I we showed that S(x, y) S(y, u y) = S(y, v y). If FA(x) FA(y) FA(v y) then

S(x, v y) = FA(v y) FA(x) ρ(v y, x)β FA(v y) FA(y) + FA(y) FA(x)

ρ(v y, y)β + ρ(x, y)β

min{S(y, v y), S(x, y)} = S(x, y) > Λβ FA(x, A) ,

while on the other hand if FA(x) > FA(y) FA(u y) then

S(x, u y) = FA(x) FA(u y) ρ(x, u y)β FA(x) FA(y) + FA(y) FA(u y) ρ(x, y)β + ρ(u y, y)β

min{S(x, y), S(y, u y)} = S(x, y) > Λβ FA(x, A) ,

where in both calculations we used the mediant inequality. Both inequalities above contradict the definition of Λβ FA(x, A), thus proving Claim III.

Combining the ingredients. We are now ready to finish the proof. For x Ω, if x A then Claim III provides the desired inequality. Otherwise, if x Ω\ A then

Λβ FA(x, Ω) Claim II = Λβ FA(x, A) Claim I Λβ f(x, A) Λβ f(x, Ω) .

B.1 Proof of Theorem 3.1

We start by stating a strengthened version of the triangle inequality (also known as the snowflake triangle inequality) which we will use later on. For any β (0, 1], x = y, z Ω:

ρ(x, y)β ρ(x, z)β + ρ(z, y)β . (6)

Indeed, this follows from

ρ(x, z)β + ρ(z, y)β

ρ(x, y)β ρ(x, z)β + ρ(z, y)β

(ρ(x, z) + ρ(z, y))β = ρ(x, z) ρ(x, z) + ρ(z, y)

β + ρ(z, y) ρ(x, z) + ρ(z, y)

ρ(x, z) ρ(x, z) + ρ(z, y)

+ ρ(z, y) ρ(x, z) + ρ(z, y)

Let 0 < ε < 1

4, denote K := log2(1/ε) , ε := 1 (K+1)2K and note that

ε 1 (log2(1/ε) + 2) 2log2(1/ε)+1 = ε 2 (log2(1/ε) + 2) ε 4 log2(1/ε) . (7)

Let N = {x1, . . . , x|N|} be a ε 32L 1/β -net of Ωof size |N| = NΩ

ε 32L 1/β , and let Π =

{C1, . . . , C|N|} be its induced Voronoi partition. We define B = {[lj, uj]}j J [0, 1]Ω [0, 1]Ωto be the pairs of functions constructed as follows:

l, u are both constant over every cell Ci Π, and map each cell to a value in {0, ε

2 , . . . , 1}.

Choose some cells S1 Π such that µ(S

Ci S1 Ci) ε and set for any Ci S1 : l|Ci = 0, u|Ci = 1.

For m = 2, . . . , K choose some unchosen cells Sm Π \ S

j<m Sj such that µ(S

Ci Sm Ci) 2m 1ε and set for any Ci Sm : l|Ci {0, 1 2m , 2 2m , . . . , 2m 2

2m }, , u|Ci = l + 1 2m 1 .

In the remaining cells SK+1 := Π \ S

j K Sj set for any Ci SK+1 :

l|Ci 0, 1 2K+1 , 2 2K+1 , . . . , 2K+1 2

, u|Ci = l + 1

Notice that for any [l, u] B we have

l u L1(µ) = X

Ci |l(x) u(x)|dµ(x) =

Ci |l(x) u(x)|dµ(x)

1 2m 1 dµ(x) =

Ci Sm µ(Ci)

2m 1 = ε (K + 1) = 1

Furthermore, we can bound |B| by noticing that any such l is defined by its values over |N| cells who all belong to {0, ε

2 , ε , . . . , 1}, and that once l is fixed then any associated u has at most K + 1 possible values over each cell since it equals l + 1 2m 1 for some m [K + 1]. Thus

|B| (K + 1) 8

16 log2(1/ε)

N ( ε 128L log(1/ε)) 1/β

where the last inequality uses Eq. (7) and definition of K. In order to finish the proof, in remains to

show that B indeed cover g H ol β L(Ω, µ) as brackets. Namely, that for any f g H ol β L(Ω, µ) there exist

[l, u] B such that l f u. To that end, let f g H ol β L(Ω, µ). Denote

Sf 1 := Ci Π : x Ci : Λβ f(x) L

and notice that S{Ci Sf 1 } {x : Λβ f(x) L/ε } = µ(S{Ci Sf 1 }) ε . Hence we can pick [l, u] B such that (l|Ci, u|Ci) (0, 1) for any Ci Sf 1 (serving as S1 in their construction). Clearly any such l, u bound f over these cells. Furthermore, for m = 2, . . . , K we denote

j<m Sf j : x Ci : Λβ f(x) L 2m 1ε

and notice that S{Ci Sf m} {x : Λβ f(x) L/(2m 1ε )} = µ(S{Ci Sf m}) 2m 1ε . Consequently we can let Sf m serve as Sm in the construction of [l, u] B, assuming we will show such a choice can serve as a bracket of f over such cells. Indeed, for any x Ci we have

|f(x) f(zi)| Λβ f(zi) ρ(x, zi)β Eq. (6) L 2m 2ε 2ε

32L = 1 2m+2 ,

which by the triangle inequality shows in particular that for any x, y Ci :

|f(x) f(y)| |f(x) f(zi)| + |f(zi) f(y)| 1 2m+1 = 1 4 2m 1 .

So clearly there exists αi {0, 1 2m , 2 2m , . . . , 2m 2

2m } such that αi f|Ci αi+ 1 2m 1 , and by setting l|Ci, u|Ci = (αi, αi + 1 2m 1 ) for any Ci Sf m we ensure the bracketing property. Finally, for any of the remaining cells Sf K+1 := Π \ S j K Sf j we get by construction that zi Ci : Λβ f(zi) < L 2Kε (otherwise they would satisfy the condition for some previously constructed Sf m). Hence for any x Ci we have

|f(x) f(zi)| Λβ f(zi) ρ(x, zi)β Eq. (6) L 2Kε 2ε

32L = 1 2K+4 ,

which by the triangle inequality shows that for any x, y Ci :

|f(x) f(y)| 1 2K+3 = 1 8 2K .

So as before, there clearly exists αi {0, 1 2K+1 , 2 2K+1 , . . . , 2K+1 2

2K+1 } such that αi f|Ci αi+ 1

2K , and by setting l|Ci, u|Ci = (αi, αi + 1 2K ) for any Ci Sf m we ensure the bracketing property over all of Ω, which finishes the proof.

B.2 Proof of Proposition 3.2

Recalling that the realizability assumption ensures a perfect predictor f F, we start by introducing the loss class LF [0, 1]Ω:

LF = {ℓf(x) := |f(x) f (x)| : f F} .

Fix α > 0. We observe that LF is no larger than F in terms of bracketing entropy, namely

N[ ](LF, L1(µ), α) N[ ](F, L1(µ), α) . (8)

Indeed, suppose we are given an α-bracketing of F denoted by Bα, and denote for any f F by [lf, uf] Bα its associated bracket. Then any ℓf LF is inside the bracket [lℓf , uℓf ] where

lℓf := max{0 , min{lf f , f uf}} ,

uℓf := min{1 , max{uf f , f lf}} .

It is straightforward to verify that uℓf lℓf L1(µ) uf lf L1(µ) α, and clearly the set of all such brackets is of size at most |Bα|, yielding Eq. (8).

Now notice that for any f F :

LD(f) 1.01LS(f) = ℓf L1(µ) 1.01 ℓf L1(µn) α + lℓf L1(µ) 1.01 lℓf L1(µn) ,

sup f F (LD(f) 1.01LS(f)) α + max lℓf ( lℓf L1(µ) 1.01 lℓf L1(µn)) . (9)

In order to bound the right hand side, fix some lℓf , and note that Var(lℓf ) l2 ℓf L1(µ) lℓf L1(µ), since lℓf (x) [0, 1]. Thus by Bernstein s inequality and the AM-GM inequality we get that with probability at least 1 γ :

lℓf L1(µ) lℓf L1(µn) log(1/γ)

2 lℓf L1(µ) log(1/γ)

202 log(1/γ)

n + 1 101 lℓf L1(µ)

= lℓf L1(µ) 1.01 lℓf L1(µn) 205 log(1/γ)

Setting γ = δ/N[ ](F, L1(µ), α) and taking a union bound over lℓf whose number is bounded due to Eq. (8), we see that with probability 1 δ :

max lℓf ( lℓf L1(µ) 1.01 lℓf L1(µn)) 205 log N[ ](F, L1(µ), α) + 205 log(1/δ)

Plugging this back into Eq. (9), and minimizing over α > 0 finishes the proof.

B.3 Proof of Theorem 4.1

Proposition B.1. Let f : Ω [0, 1]. Then with probability at least 1 δ/2 over drawing a sample it holds that bΛβ f 4 log2(4n/δ)Λ β f(µ) + 4 log2(4n/δ)

Corollary B.2. If D is realizable by H ol β L(Ω, µ), then for f : Ω [0, 1] such that LD(f ) = 0 it holds with probability at least 1 δ/2 : bΛβ f 5 log2(4n/δ)L. Hence, bf(Xi) := f (Xi) = Yi satisfies LS( bf) = 0 and bΛβ b f 5 log2(4n/δ)L .

Proof. (of Proposition B.1) Fix f : Ω [0, 1]. Given a sample (Xi)n i=1 µn which induces an empirical measure µn, we get

i=1 sup z =Xi

|f(Xi) f(z)|

ρ(Xi, z)β = E X µn [Λβ f(X)] 2 log(n)WX µn[Λβ f(X)] , (10)

where the last inequality follows from the reversed strong-weak mean inequality for uniform measures. We will now show that with high probability WX µn[Λβ f(X)] WX µ[Λβ f(X)] = eΛβ f. To that end, we denote for any t > 0 : Mf(t) := {x : Λβ f(x) t} Ω, let K := eΛβ f(µ), N := 2 log(4n/δ) log log(4n/δ) and note that

WX µn[Λβ f(X)] = sup t>0 tµn(Mf(t)) (11)

sup 0<t K tµn(Mf(t)) + 2 max j {0,1,...,N 1} 2j Kµn(Mf(2j K)) + sup t 2NK tµn(Mf(t)) .

We will bound all three summands above. We easily bound the first term by

sup 0<t K tµn(Mf(t)) K 1 = eΛβ f(µ) . (12)

For the second term, denote for any t > 0 by M + f (t) Mf(t) a containing set for which 1

n µ(M + f (t)) µ(Mf(t))+ 1

n. We can always assume without loss of generality that such a set exists.5

By the multiplicative Chernoff bound we have for any t, α > 0 :

h µn(M + f (t)) (1 + α)µ(M + f (t)) i eα

(1 + α)1+α ,

hence by the union bound we get with probability at least 1 Neα (1+α)1+α :

max j {0,1,...,N 1} 2j Kµn(Mf(2j K)) max j {0,1,...,N 1} 2j Kµn(M + f (2j K))

(1 + α) max j {0,1,...,N 1} 2j Lµ(M + f (2j K))

(1 + α) max j {0,1,...,N 1} 2j K µ(Mf(2j K)) + 1

(1 + α)eΛβ f(µ) + 1 + α

Letting α = log(4n/δ) 1, by our choice of N = 2 log(4n/δ) log log(4n/δ) we get that with probability at least 1 δ/4 :

2 max j {0,1,...,N 1} 2j Kµn(Mf(2j K)) 2 log(4n/δ)eΛβ f(µ) + 2 log(4n/δ)

In order to bound the last term in Eq. (11), we observe that the empirical measure satisfies for any A Ω: µn(A) < 1

n µn(A) = 0, and that Mf(s) Mf(t) for s > t. Furthermore, by definition of K = eΛβ f(µ) we have µ(Mf(t)) K

t , hence by Markov s inequality

sup s t µn(Mf(s)) = 0 Pr S [µn(Mf(t)) = 0] = Pr S

µn(Mf(t)) 1

For t := 2NK yields Pr S sups 2n K µn(Mf(s)) = 0 n 2N δ

4. Combining this with Eq. (12), Eq. (13) and plugging back into Eq. (11), we get that with probability at least 1 δ/2 :

WX µn[Λβ f(X)] (1+2 log(4n/δ))eΛβ f(µ)+2 log(4n/δ)

n (1+2 log(4n/δ))Λ β f(µ)+2 log(4n/δ)

Recalling Eq. (10), we get overall that

bΛβ f 2 log(n) (1 + 2 log(4n/δ))Λ β f(µ) + 2 log(4n/δ)

Simplifying the expression above finishes the proof.

Proposition B.3. Under the same setting, for any γ > 0 there exists an algorithm that given a sample S Dn and any function bf : S [0, 1], provided that n N for N = e O NΩ(γ)+log(1/δ)

constructs a function f : Ω [0, 1] such that with probability at least 1 δ/2 :

f bf L1(µn) γ(1 + 2bΛβ b f). In particular LS(f) LS( bf) + γ(1 + 2bΛβ b f).

Λ β f(µ) 5bΛβ b f .

5Such a set does not exist only in the case of atoms x0 Ωwith large probability mass µ(x0). If that is the case, consider a copy metric space eΩwith x0 split into two points x1, x2 eΩat distance ε apart and each of mass µ(x0)/2. Any function f : Ω R is extended to ef : eΩ R via ef(x1) = ef(x2) = f(x0). Repeating the split if necessary and taking ε 0, we obtain a space eΩwith all of the relevant properties of Ωbut no atoms of large mass.

Proof. Throughout the proof, we denote for any point x Ω, subset B Ωand function g : B [0, 1] :

Λβ g(x, B) := sup y B\{x}

|g(x) g(y)|

Give the sample S = (Xi, Yi)n i=1, we denote Sx = (Xi)n i=1. Let γ > 0. The algorithm constructs f : Ω [0, 1] as follows:

1. Let Sx(γ) Sx consist of the γn points whose Λ b f( , Sx) values are the largest (with ties broken arbitrarily), and S x(γ) := Sx \ Sx(γ) be the rest.

2. Let A S x(γ) be a γ1/β-net of S x(γ).

3. Define f : Ω [0, 1] to be the β-PMSE extension of bf from A to Ωas defined in Definition A.2 (and analyzed throughout Appendix A).

We will prove that f satisfies both requirements. For the first requirement, since f|A = bf|A and Sx = S x(γ) Sx(γ) we have

f bf L1(µn) := 1

i=1 |f(xi) g(xi)| = 1

x Sx(γ)\A |f(x) bf(x)|+ 1

x S x(γ)\A |f(x) bf(x)| .

The first summand above is bounded by γ since 0 f, bf 1 = |f(x) bf(x)| 1 and |Sx(γ)| γn. In order to bound the second term, we denote by NA : S x(γ) A to be the mapping of each element to its nearest neighbor in the net, and note that ρ(x, NA(x)) γ1/β. Then

x S x(γ)\A |f(x) bf(x)| 1

γ ρ(x, NA(x))β |f(x) bf(x)|

|f(x) bf(NA(x))| + | bf(NA(x)) bf(x)|

ρ(x, NA(x))β

|f(x) f(NA(x))|

ρ(x, NA(x))β + | bf(NA(x)) bf(x)|

ρ(x, NA(x))β

x S x(γ)\A Λβ f(x, A) + Λβ b f(x, A)

[Theorem A.3] 2γ

x S x(γ)\A Λβ b f(x, A)

So overall we get f bf L1(µn) γ + 2γL = γ(1 + 2L) as claimed in the first bullet.

We move on to prove the second bullet. Let U Ωbe a γ1/β

4 -net of Ω, Π be its induced Voronoi partition and let m := |Π| NΩ(γ1/β/4). Let Consider the following partition of Π into light and heavy cells: Πl := {C Π : µn(C) < nγ/m} , Πh := Π \ Πl .

We will now state three lemmas required for the proof, two of which are due to [Ashlagi et al., 2021].

Lemma B.4. Suppose A Ωand that f : Ω [0, 1] is the β-PMSE extension of some function from

A to Ω. Let E Ωsuch that diam(E)β 1

2 minx =x A ρ(x, x )β. Then supx,x E Λβ f (x)

Λβ f (x ) 2.

Proof. Let u x, v x A be the pair of points which satisfy Λβ f(x) = f(v x) f(u x) ρ(v x,x)β+ρ(u x,x)β . By assumption on E, we know that 2diam(E)β ρ(v x, u x)β ρ(v x, x)β + ρ(u x, x)β, hence

ρ(v x, x)β + ρ(u x, x)β + 2diam(E)β 2(ρ(v x, x)β + ρ(u x, x)β). We get

Λβ f(x ) f(v x) f(u x) ρ(v x, x )β + ρ(u x, x )β

f(v x) f(u x) ρ(v x, x)β + diam(E)β + ρ(u x, x)β + diam(E)β

f(v x) f(u x) 2(ρ(v x, x)β + ρ(u x, x)β) = 1

Lemma B.5 (Ashlagi et al., 2021, Lemma 16). If nγ2 m, then

min C Πh µn(C)

1 m exp( nγ/4m) ,

max C Πh µn(C)

µ(C) < 2 1 m exp( nγ/3m) ,

C Πl µ(C) < 2γ

1 exp n(γ p

Lemma B.6 (Ashlagi et al., 2021, Lemma 17). f H olβ 2L

Equipped with the three lemmas, we calculate

Λ β f(µ) = Z

Ω Λβ f(x)dµ = X

C Λβ f(x)dµ + X

C Λβ f(x)dµ . (14)

The first summand above is bounded with high probability using Lemma B.5 and Lemma B.6, since under the event described in Lemma B.5 we have: X

C Λβ f(x)dµ X

In order to bound the second term in Eq. (14), let C Π, x C and note that by applying Lemma B.4 to E := Sx C we get that Λβ f(x ) 2 minx Sx C Λβ f(x). Thus, under the high probability event described in Lemma B.5 we have X

C Λβ f(x)dµ X

C 2 min x Sx C Λβ f(x)dµ = 2 X

C Πh min x Sx C Λβ f(x)µ(C)

C Πh min x Sx C Λβ f(x)µn(C) = 4

x Sx C min x Sx C Λβ f(x)

x Sx C Λβ f(x ) 4

x Sx Λβ f(x ) 4L ,

where the last inequality is due to the extension property of Theorem A.3. Overall, plugging these bounds into Eq. (14) and using the union bound to ensure all required events to hold simultaneously, we see that the desired second bullet holds holds with probability at least 1 m exp( nγ/4m)

m/n)2/2 . A straightforward computation shows that by our assumption on n

being large enough, this probability exceeds 1 δ/2 as required.

We are now ready to finish the proof of Theorem 4.1. Let γ > 0. By Corollary B.2, we can construct bf : S [0, 1] such that with probability at least 1 δ/2 : LS( bf) = 0 and bΛβ b f

5 log2(4n/δ)L. Assuming n is appropriately large, we further apply Proposition B.3 in order to obtain f : Ω [0, 1] such that with probability at least 1 δ/2 : f H ol β 25 log2(4n/δ)L(Ω) and also

LS(f) LS( bf) + γ(1 + 2L) = γ(1 + 2L). By the union bound, we get that with probability at least 1 δ :

LD(f) = 1.01LS(f) + (LD(f) 1.01LS(f)) γ(1 + 2L) + sup

f H ol β 25 log2(4n/δ)L(Ω) (LD(f) 1.01LS(f)) .

where ( ) is justified by setting γ = Θ(ε/L) and applying Theorem 3.4 for appropriately large n.

B.4 Proof of Theorem 5.1

Given a sample S = (Xi, Yi)n i=1 Dn, denote the empirically smooth class

d H ol := n f : {X1, . . . , X n/2 } [0, 1] : bΛβ f 5 log2(4n/δ)L o .

Consider the following procedure:

1. (Empirical cover) Construct h1, . . . , h N d H ol for maximal N such that i = j [N] : hi hj L1(µn) ϵ

2. (Run realizable algorithm on cover) For any j [N], execute the realizable algorithm Arealizable of Theorem 4.1 on the relabeled dataset (Xi, hj(Xi)) n/2 i=1 , and obtain fj : Ω [0, 1]. 3. (ERM) Return arg minf1,...,f N Pn i= n/2 +1 |fj(Xi) Yi|.

We will now prove that the algorithm above satisfies the theorem. Let f arg minf H ol β L(Ω,µ) LD(f),6 and note that by Proposition B.1 (as explained in Corollary B.1) we

have f d H ol with probability at least 1 δ/2. By construction, h1, . . . , h N is a maximal ϵ

4-packing of d H ol, which is known to imply that it is also a ϵ

4-net [Vershynin, 2018, Lemma 4.2.8] with respect to the metric L1(µn). In particular, this implies that there exists j [N] such that

f hj L1(µn) ϵ

4 = LS(hj ) LS(f ) + ϵ

Further note for any j [N] : hj d H ol, so our realizable algorithm (as manifested in Proposition B.3 for γ = Θ(ϵ/L)) when fed the smoothed labels (Xi, hj(Xi)) n/2 i=1 will produce fj such that

LS(fj) LS(hj) ϵ

4 and Λ β fj(µ) 5bΛβ hj 25 log2(4n/δ)L. In particular

LS(fj ) LS(hj ) + ϵ

4 LS(f ) + ϵ

Finally, by Eq. (1) and Theorem 3.1 (which holds for any measure, in particular for the empirical measure µn)

log N log Nd H ol(ϵ/2)

log N[ ](d H ol, L1(µn), ϵ)

ε 640 log2(4n/δ)L log(1/ε)

log 16 log2(1/ε)

Hence, by a standard Chernoff-Hoeffding bound over the finite class {f1, . . . , f N}, step (3) of the

algorithm yields ϵ

2 excess risk as long as n

2 = Ω log(N)+log(1/δ)

6We assume without loss of generality that the infimum is obtained. Otherwise we can take a function whose loss is arbitrarily close enough to the optimal value and continue with the proof verbatim.

B.5 Proof of Theorem 6.1

We start by providing a simple structural result which we will use for our lower bound construction, showing that in any metric space there exists a sufficiently isolated point from a large enough subset. Lemma B.7. There exists a point x0 Ωand a subset K Ωsuch that

x K : ρ(x0, x) diam(Ω)

x = y K : ρ(x, y) (ε/L)1/β .

|K| = j NΩ((ε/L)1/β)

Proof. Denote D := diam(Ω), let x0, x1 be two points such that ρ(x0, x1) > D/2, and let Π = {C0, C1} be a Voronoi partition of Ωinduced by {x0, x1}. For γ > 0, let Nγ be a maximal γ-packing of Ω. By the pigeonhole principle there must exist a cell Ci Π such that |Ci Nγ| |Nγ|/2, which we assume without loss of generality to be C1. Now note that any x C1 satisfies ρ(x, x0) 1 2ρ(x, x0) + 1

2ρ(x, x1) 1

2ρ(x0, x1) > D/4. Finally, set γ := ε1/β and let K C1 Nγ be any

subset of size j NΩ((ε/L)1/β)

Given x0, K from the lemma above, we denote K = {x0} K and define the distribution µ over Ωsupported on K such that µ(x0) = 1 ε

2 and µ(x) = ε 2|K| for all x K. From now on, the proof is similar to a classic lower bound strategy for VC classes in the realizable case (e.g. Kearns and Vazirani, 1994, Proof of Theorem 3.5). To that end, it is enough to provide a distribution over functions in H ol β L(Ω, µ) such that with constant probability any algorithm must suffer significant loss for some function supported by the distribution.

We define such a distribution over functions f : K {0, 1} as follows: Pr[f(x0) = 0] = 1, while for any x K : Pr[f(x) = 0] = Pr[f(x) = 1] = 1

2 independently of other points. We will now show that any such f : K {0, 1} is average Hölder smooth with respect to µ. Indeed, for every x K :

f(x) = sup x K\{x}

|f(x) f(x )|

ρ(x, x )β 1 ε/L = L

f(x0) = sup x K\{x0}

|f(x0) f(x )|

ρ(x0, x )β 1 diam(Ω)/4 = 4 diam(Ω) ,

Λ β f(x) = µ(x0)Λβ

Finally, we define the (random) function f : Ω [0, 1] to be the β-PMSE extension of f from K to Ωas defined in Definition A.2, and note that f satisfies the required smoothness assumption. Setting D over Ω [0, 1] to have marginal µ and Y = f (X), we ensure that D is indeed realizable by H ol β L(Ω).

Now assume A is a learning algorithm which is given a sample S of size |S| |K|

4ε and produces A(S) : Ω [0, 1]. We call a point x K "misclassified" by the algorithm if |A(S)(x) f (x)| 1

2, and denote the set of misclassified points by M K. Recalling that x K : Pr[f(x) = 0] = Pr[f(x) = 1] = 1

2 independently, and that µ(x) = ε 2|K|, we observe that with probability at least 1

2 the algorithm will misclassify more than |K|/2 points.7 Thus, we get that with probability at least 1

LD(A(S)) = E X µ[|A(S)(X) f (X)|] X

x M µ(x) |A(S)(x) f (x)| |K|

7Indeed, denoting C = K \ M we see that Pr[|C| |K|/8] 8 |K| E[|C|] = 8 |K| |S|

By rescaling ε, we see that in order to obtain LD(A(S)) ε the sample size must be of size

= Ω NΩ((ε/L)1/β)

B.6 Proofs from Section 7

Proof of Claim 7.1. Let β (0, 1). Consider the unit segment Ω= [0, 1] with the standard metric, equipped with the probability measure µ with density dµ

dx = 1 Z |x 1

2 (where Z = R 1 0 |x 1

2 < is a normalizing constant). We examine the function f(x) = 1[x > 1

2] which is clearly not Hölder continuous since it is discontinuous. Furthermore,

µ({x : Λ1 f(x) t}) = µ x 1

= eΛ1 f = sup t>0 t µ({x : Λ1 f(x) t}) sup t>0 t 1 β

hence f / g Lip M(Ω, µ) for all M > 0. On the other hand, Λβ f(x) = 1 |x 1

Λ β f = Z 1

0 Λβ f(x)dµ = 1

2 dx (β<1) < ,

thus f H ol β L(Ω) for some L < . Note that by normalizing the function, the claim holds even for L = 1.

Proof of Claim 7.2. Let β (0, 1). Consider the unit segment Ω= [0, 1] with the standard metric, equipped with the probability measure µ with density dµ

dx = 1 Z |x 1

2|β 1 (where Z = R 1 0 |x 1

2|β 1dx < is a normalizing constant). We examine the function f(x) = 1[x > 1

2]. Note that for any x = 1

2 : Λ1 f(x) = 1 |x 1

µ({x : Λ1 f(x) t}) = µ x : |x 1

0 xβ 1dx t β .

This shows that eΛ1 f = sup t>0 t µ({x : Λ1 f(x) t}) sup t>0 t1 β = ,

hence f / g Lip M(Ω, µ) for all M > 0. Furthermore, for x = 1

2 : Λβ f(x) = 1 |x 1

Λ β f = Z 1

hence f / g H ol β M(Ω, µ) for all M > 0. On the other hand

µ({x : Λβ f(x) t}) = µ({|x 1

2| t 1/β}) = 2

0 xβ 1dx t 1

= eΛβ f = sup t>0 t µ({x : Λβ f(x) t}) < ,

thus f g H ol β L(Ω) for some L < . Note that by normalizing the function, the claim holds even for L = 1.