# fatshattering_dimension_of_kfold_aggregations__d51f059b.pdf

Journal of Machine Learning Research 25 (2024) 1-29 Submitted 10/21; Revised 4/24; Published 5/24

Fat-Shattering Dimension of k-fold Aggregations

Idan Attias idanatti@post.bgu.ac.il Aryeh Kontorovich karyeh@cs.bgu.ac.il Department of Computer Science Ben-Gurion University of the Negev, Beer Sheva, Israel

Editor: John Shawe-Taylor

We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing k functions, one from each of the k classes, and computing pointwise an aggregate function of these, such as the median, mean, and maximum. The bounds are stated in terms of the fat-shattering dimensions of the component classes. For linear and aﬃne function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on k. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature. Keywords: combinatorial dimension, scale-sensitive dimension, fat-shattering dimension, aggregation rules, k-fold maximum, ensemble methods

1. Introduction

The fat-shattering dimension, also known as scale-sensitive and the parametrized variant of the P-dimension , was ﬁrst deﬁned by Kearns and Schapire (1994); its key role in learning theory lies in characterizing the PAC learnability of real-valued function classes (Alon et al., 1997; Bartlett and Long, 1998). In this paper, we study the behavior of the fat-shattering dimension under various kfold aggregations. Let F1, . . . , Fk RΩbe real-valued function classes, and G : Rk R be an aggregation rule. We consider the aggregate function class G(F1, . . . , Fk), which consists of all mappings x 7 G(f1(x), . . . , fk(x)), for any f1 F1, . . . , fk Fk. Some natural aggregation rules include the pointwise k-fold maximum, median, mean, and maxmin. We seek to bound the fat-shattering complexity of G(F1, . . . , Fk) in terms of the fatshattering dimensions of the constituent Fis. This question naturally arises in the context of ensemble methods, such as boosting and bagging, where the learner s prediction consists of an aggregation of base learners. The analogous question regarding aggregations of VC classes (VC dimension being the combinatorial complexity controlling the learnability of Boolean function classes) have been studied in detail and largely resolved (Baum and Haussler, 1989; Blumer et al., 1989; Eisenstat and Angluin, 2007; Eisenstat, 2009; Csik os et al., 2019). Furthermore, closure properties were also studied in the context of online classiﬁcation and private PAC learning (Alon

. A previous version of this paper was titled Fat-shattering dimension of k-fold maxima .

c 2024 Idan Attias and Aryeh Kontorovich.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-1193.html.

Attias and Kontorovich

et al., 2020; Ghazi et al., 2021) for the Littlestone and Threshold dimensions. However, for real-valued functions, this question remained largely uninvestigated.

1.1 Our Contributions

For a natural class of aggregation rules that commute with shifts (see deﬁnition (7)) and commute with truncation (see deﬁnition (21)), assuming fatγ(Fi) d, for 1 i k, we show that

fatγ(G(F1, . . . , Fk)) O dk log2 (dk) , γ > 0.

In particular, this result holds for the maximum, minimum, median, and max-min aggregations. The formal statement is given in Theorem 1.

By using an entirely diﬀerent approach, for aggregations that are L-Lipschitz (L 1) in supremum norm (see deﬁnition (8)) and for bounded function classes F1, . . . , Fk [ R, R]Ωwith fatεγ(Fi) d, for 1 i k, we show that

fatγ(G(F1, . . . , Fk)) O dk log1+ε LRk

, 0 < γ/L < R and 0 < ε < log 2.

In particular, this result holds for the maximum, minimum, median, mean, and maxmin aggregations. The formal statement is given in Theorem 2.

For R-bounded aﬃne functions and for aggregations that are L-Lipschitz in supremum norm, we show the following dimension-free bound,

fatγ(G(F1, . . . , Fk)) O L2R2k log(k)

, 0 < γ/L < R.

This result also extends to the hinge-loss class of aﬃne functions. In particular, this result holds for the maximum, minimum, median, mean, and max-min aggregations. We improve by a log factor the estimate of Feﬀerman et al. (2016, Lemma 6) on the fat-shattering dimension of max-min aggregation of linear functions. The formal statement is given in Theorem 3

Furthermore, in Corollary 5 we show an upper bound on the Rademacher complexity of the k-fold maximum aggregation of aﬃne functions and hinge-loss aﬃne functions. Our bound scales with

k, improving upon Raviv et al. (2018) where the dependence on k is linear.

For aﬃne functions and the k-fold maximum aggregation, we show tight dimensiondependent bounds (up to constants),

fatγ(Gmax(F1, . . . , Fk)) = Θ (dk log k) , γ > 0,

where d is the Euclidean dimension. For the formal statements, see Theorems 7 and 8.

Fat-Shattering Dimension of k-fold Aggregations

1.2 Applications

The need to analyze the combinatorial complexity of a k-fold maximum of function classes (see (4) for the formal deﬁnition) arises in a number of diverse settings. One natural example is adversarially robust PAC learning to test time attacks for real-valued functions (Attias et al., 2022; Attias and Hanneke, 2023). In this setting, the learner observes an i.i.d. labeled sample from an unknown distribution, and the goal is to output a hypothesis with a small error on unseen examples from the same distribution, with high probability. The diﬀerence from the standard PAC learning model is that at test time, the learner only observes a corrupted example, while the prediction is tested on the original label. Formally, (x, y) is drawn from the unknown distribution, and there is an adversary that can map x to k possible corruptions z that are known to the learner. The learner observes only z while its loss is with respect to the original label y. This scenario is naturally captured by the k-fold max: the learner aims to learn the maximum aggregation of the loss classes. Attias et al. (2022) showed that uniform convergence holds in this case, and so the sample complexity of an empirical risk minimization algorithm is determined by the complexity measure of the k-fold maximum aggregation. Analyzing the k-fold maximum arises also in a setting of learning polyhedra with a margin. Gottlieb et al. (2018) provided a learning algorithm that represents polyhedra as intersections of bounded aﬃne functions. The sample complexity of the algorithm is determined by the complexity measure of the maximum aggregation of aﬃne function classes. Another natural example of where the k-fold maximum and k-fold max-min play a role is in analyzing the convergence of k-means clustering. Feﬀerman et al. (2016) bounded the max-min aggregation and Klochkov et al. (2021); Biau et al. (2008); Appert and Catoni (2021); Zhivotovskiy (2022) bounded the max aggregation. The main challenge in this setting is bounding the covering numbers of the aggregation over k function classes which can be obtained by bounding the Rademacher complexity or the fat-shattering dimension. Finally, there are numerous ensemble methods for regression that output some aggregation of base learners, such as the median or mean. Examples of these methods include boosting (e.g., Freund and Schapire (1997); K egl (2003)), bagging (bootstrap aggregation) by Breiman (1996), and its extension to the random forest algorithm (Breiman, 2001).

1.3 Related Work

It was claimed in Attias et al. (2019, Theorem 12) that

fatγ(Gmax(F1, . . . , Fk)) 2 log(3k)

j=1 fatγ(Fi),

but the proof had a mistake (see Section 5); our Open Problem (28) asks if the general form of the bound does hold (we conjecture it does at least for the max aggregation). Using the recent disambiguation result of Alon et al. (2022) presented in Lemma 9 here, Attias et al. (2022, Lemma 15) obtained the bound

fatγ(Gmax(F1, . . . , Fk)) O

log(k) log2(|Ω|)

j=1 fatγ(Fi)

Attias and Kontorovich

where Ωis the domain of the function classes F1, . . . , Fk. The latter is, in general, incomparable to Theorem 1. However, for large or inﬁnite Ω, Theorem 1 is clearly a considerable improvement over (1). Using the covering number results of Mendelson and Vershynin (2003); Talagrand (2003) (see Section A.2), Duan (2012, Theorem 6.2) obtained a general result, which, when specialized to k-fold maxima, yields

fatγ(Gmax(F1, . . . , Fk)) O

for a universal constant c > 0; (2) is an immediate consequence of Theorem 10 (with p = 2), Lemma 18, and Lemma 19 in this paper. Our results improve over (2) by removing the dependence on k in the scale of the fat-shattering dimensions; however, Duan s general method is applicable to a wider class of uniformly continuous k-fold aggregations. Srebro et al. (2010, Lemma A.2) bounded the fat-shattering dimension in terms of the Rademacher complexity. Foster and Rakhlin (2019) bounded the Rademacher complexity of a smooth k-fold aggregate, see also references therein. Inspired by Appert and Catoni (2021), Zhivotovskiy (2022) has obtained the best known upper bound on the Rademacher complexity of k-fold maxima over linear function classes. Raviv et al. (2018) upper bounded the Rademacher complexity of the k-fold maximum aggregation of aﬃne functions and hinge-loss aﬃne functions.

2. Preliminaries

2.1 Aggregation Rules

A k-fold aggregation rule is any mapping G : Rk R. Just as G maps k-tuples of reals into reals, it naturally aggregates k-tuples of functions into a single one: for f1, . . . , fk : Ω R, we deﬁne G(f1, . . . , fk) : Ω R as the mapping x 7 G(f1(x), . . . , fk(x)). Finally, the aggregation extends to k-tuples of function classes: for F1, . . . , Fk RΩ, we deﬁne

G(F1, . . . , Fk) := {x 7 G(f1(x), . . . , fk(x)) : fi Fi, i [k]} . (3)

Examples. A canonical example of an aggregation rule is the k-fold max, induced by the mapping

Gmax(x1, . . . , xk) := max i [k] xi. (4)

The minimum is deﬁned analogously as

Gmin(x1, . . . , xk) := min i [k] xi.

The mean aggregation deﬁned as

Gmean(x1, . . . , xk) := 1

Fat-Shattering Dimension of k-fold Aggregations

Denoting by x(1), . . . , x(k) the ascending order of a sequence x1, . . . , xk, that is, x(1) x(k), the (lower1) median is deﬁned as

Gmed(x1, . . . , xk) := x( k/2 ). (5)

We also deﬁne Gmax-min : Rk ℓ R as

Gmax-min(x11, . . . , xkℓ) := max j [ℓ] min i [k] xij; (6)

Next, we consider some properties that an aggregation rule might possess.

Commuting with shifts. We say that an aggregation rule G commutes with shifts if

G(x) r = G(x r), x Rk, r R, (7)

where x r is deﬁned as (x1 r, . . . , xk r) for x = (x1, . . . , xk). It is readily veriﬁed that the maximum, minimum, max-min, mean, and median commute with shifts.

Lipschitz continuity. The mapping G : Rk R is L-Lipschitz with respect to if G(x) G(x ) L x x = L max i [k] |xi x i|, x, x Rk. (8)

In Appendix A.1, we show that maximum, median, and max-min aggregations are 1Lipschitz (Lemmas 13, 14, 15 respectively). Showing it for the mean is a simple exercise. The proof for the minimum is similar to the one for the maximum. We also consider aggregations that commute with truncation; see Section 4.1 for the formal deﬁnition.

2.2 Complexity Measures

Fat-shattering dimension. Let Ωbe a set and F RΩ. For γ > 0, a set S = {x1, . . . , xm} Ωis said to be γ-shattered by F if

sup r Rm min y { 1,1}m sup f F min i [m] yi(f(xi) ri) γ. (9)

The γ-fat-shattering dimension, denoted by fatγ(F), is the size of the largest γ-shattered set (possibly ).

Fat-shattering dimension at zero. As in Gottlieb et al. (2014), we also deﬁne the notion of γ-shattering at 0, where the shift r in (9) is constrained to be 0. Formally, the shattering condition is miny { 1,1}m supf F mini [m] yif(xi) γ, and we denote the corresponding dimension by f atγ(F). Attias et al. (2019, Lemma 13) showed that for all F RΩ,

fatγ(F) = max r RΩf atγ(F r), γ > 0, (10)

where F r = {f r; f F} is the r-shifted class (the maximum is always achieved). Lemma 24 presents another, apparently novel, connection between fat and f at.

1. Ordinarily, for even k, any m [x(k/2), x(k/2+1)] is a median of x. For the proof of Theorem 1, the median must be a value actually occurring in x.

Attias and Kontorovich

Rademacher complexity. Let F be a real-valued function class on the domain space W. Deﬁne the empirical Rademacher complexity of F on a given sequence (w1, . . . , wn) Wn

Rn(F|w1, . . . , wn) = Eσ sup f F

i=1 σif(wi),

where σ = (σ1, . . . σn) are independent random variables uniformly chosen from { 1, 1}. The Rademacher complexity of F with respect to a distribution D is deﬁned as

Rn(F) = Ew1,...,wn DRn(F|w1, . . . , wn).

It is a classic fact (Mohri et al., 2012, Theorem 3.1) that the Rademacher complexity controls generalization bounds in a wide range of supervised learning settings.

Covering numbers. We start with some background on covering numbers. Whenever Ω is endowed with a probability measure µ, this induces, for p [1, ) and f : Ω Rk, the norm

L(k) p (µ) = E X µ f(X) p p = Z

Ω f(x) p p dµ(x)

on L(k) p (µ) := n f (Rk)Ω: f L(k) p (µ) < o . When k = 1, we write Lp(µ) := L(1) p (µ). For

p = , f L(k) (µ) is the essential supremum of f with respect to µ. For t > 0 and H F Lp(µ), we say that H is a t-cover of F under Lp(µ) if supf F infh H f h Lp(µ) t. The t-covering number of F, denoted by N(F, Lp(µ), t), is the cardinality of the smallest t-cover of F (possibly, ). We note the obvious relation

p > q = N(F, Lp(µ), t) N(F, Lq(µ), t), (11)

which holds for all probability measures µ and all t > 0. We sometimes overload the notation about aggregations by deﬁning G on k-tuples of functions (instead of k-tuples of reals), G : (RΩ)k RΩ. We say that G is L-Lipschitz with respect to L(k) p (µ), if

G(f1:k) G(f 1:k) Lp(µ) L (f1:k) (f 1:k) L(k) p (µ) , f1:k, f 1:k (Rk)Ω.

2.3 Notation

We write N = {0, 1, . . .} to denote the natural numbers. For n N, we write [n] := {1, 2, . . . , n}. All of our logarithms are base e, unless explicitly denoted otherwise. We use max {u, v} and u v interchangeably, and write Log(x) := log(e x). For any function class F over a set Ωand E Ω, F(E) = F|E denotes the projection on (restriction to) E. In line with the common convention in functional analysis, absolute numerical constants will be denoted by letters such as C, c, whose value may change from line to line. Any transformation ϕ : R R may be applied to a function f RΩvia ϕ(f) := ϕ f, as well as to F RΩvia ϕ(F) := {ϕ(f); f F}. The sign function thresholds at 0: sign(t) = 1[t 0].

Fat-Shattering Dimension of k-fold Aggregations

3. Main Results

Our main results involve upper-bounding the fat-shattering dimension of aggregation rules in terms of the dimensions of the component classes. We begin with the simplest (to present):

Theorem 1 (General F and G that commutes with shifts and truncation) For F1, . . . , Fk RΩ, and an aggregation rule G that commutes with shifts, (see deﬁnition (7)) and commutes with truncation (see deﬁnition (21)), we have

fatγ(G(F1, . . . , Fk)) 35Dγ log2(126Dγ), γ > 0,

where Dγ := Pk i=1 fatγ(Fi) > 0. In the degenerate case where Dγ = 0, fatγ(G) = 0. In particular, this result holds for the maximum, minimum, max-min, and median aggregation rules.

Remark. We made no attempt to optimize the constants; these are only provided to give a rough order-of-magnitude sense. In the sequel, we forgo numerical estimates and state the results in terms of unspeciﬁed universal constants. The next result provides an alternative bound based on an entirely diﬀerent technique:

Theorem 2 (Bounded function classes and Lipschitz aggregations) For 0 < ε < log 2, F1, . . . , Fk [ R, R]Ω, and an aggregation rule G that is L-Lipschitz (L 1) in supremum norm (see deﬁnition (8)), we have

fatγ(G(F1, . . . , Fk)) CD Log1+ε LRk

γ , 0 < γ/L < R,

i=1 fatcεγ(Fi)

and C, c > 0 are universal constants. In particular, this result holds for natural aggregation rules, such as maximum, minimum, max-min, mean, and median.

Remark. The bounds in Theorems 1 and 2 are, in general, incomparable and not just because of the unspeciﬁed constants in the latter. One notable diﬀerence is that Theorem 1 only depends on the shattering scale γ, while Theorem 2 additionally features a (weak) explicit dependence on the aspect ratio R/γ. In particular, Theorem 1 is applicable to semi-bounded aﬃne classes (see Section A.4), while Theorem 2 is not. Still, for ﬁxed R, γ and large k, the latter presents a signiﬁcant asymptotic improvement over the former. For the special case of aﬃne functions and hinge-loss aﬃne functions, the technique of Theorem 2 yields a considerably sharper estimate:

Theorem 3 (Dimension-free bound for Lipschitz aggregations of aﬃne functions) Let B Rd be the d-dimensional Euclidean unit ball and

Fi = n x 7 w x + b; w |b| Ri, w Rd, b R o , Ri R, i [k], (12)

Attias and Kontorovich

be k collections of Ri-bounded aﬃne functions on Ω= B and G be an aggregation rule that is L-Lipschitz in supremum norm (see deﬁnition (8)). Then

fatγ(G(F1, . . . , Fk)) CL2 Log(k)

i=1 R2 i , 0 < γ/L < min i [k] Ri, (13)

where C > 0 is a universal constant. Further, if

F Hinge i = {(x, y) 7 max {0, 1 yf(x)} ; f Fi} (14)

is a family of Ri-bounded hinge-loss aﬃne functions for i [k] and GHinge G(F Hinge 1 , . . . , F Hinge k ) is an aggregation rule that is L-Lipschitz in supremum norm, then the same bound as in (13) holds for fatγ(GHinge). In particular, this result holds for the maximum, minimum, max-min, mean, and median aggregation rules.

Theorem 3 improves by a log factor the estimate of Feﬀerman et al. (2016), on the fat-shattering dimension of max-min aggregation (deﬁned in Section 2) of linear functions:2

Lemma 4 (Feﬀerman et al. (2016), Lemma 6) Let B Rd be the d-dimensional Euclidean unit ball and

Fij = n x 7 w x; w 1 , w Rdo , i [k], j [ℓ],

be kℓ(identical) linear function classes deﬁned on Ω= B. If Gmax-min is the max-min aggregation rule (6), then

fatγ(Gmax-min(F11, . . . , Fkℓ)) C kℓ

where C > 0 is a universal constant.

Our Theorem 3 improves the latter by a log factor:

fatγ(Gmax-min(F11, . . . , Fkℓ)) C kℓlog (kℓ)

Corollary 5 (Rademacher complexity for k-Fold Maximum of Aﬃne Functions) Let Fi be an Ri-bounded aﬃne function class as in (12) or a hinge loss aﬃne function class as in (14), let Gmax be the maximum aggregation rule, and let R = maxi Ri, then

Rn(Gmax(F1, . . . , Fk)) C

Log(k) Log3( Rn) R2 Pk i=1 R2 i n .

where Rn is the Rademacher complexity and C > 0 is a universal constant.

2. The max-min aggregation is shown to be 1-Lipschitz in supremum norm in Lemma 15 of Section A.1.

Fat-Shattering Dimension of k-fold Aggregations

Corollary 5 improves upon Raviv et al. (2018, Theorem 7). Their upper bound scales linearly with k, whereas ours scales as k log k. Note, however, that for linear classes a better bound is known:

Theorem 6 (Zhivotovskiy (2022)) Let B Rd be the d-dimensional Euclidean unit ball and

Fi = n x 7 w x; w 1, w Rdo , i [k]

be k (identical) linear function classes deﬁned on Ω= B. If Gmax is the maximum aggregation rule, then

Rn(Gmax(F1, . . . , Fk)) C log n

where Rn is the Rademacher complexity and C > 0 is a universal constant.

The estimate in Theorem 3 is dimension-free in the sense of being independent of d. In applications where a dependence on d is admissible, an optimal bound can be obtained:

Theorem 7 (Dimension-dependent bound for k-fold maximum of aﬃne functions) Let Ω= Rd and Fi RΩbe k (identical) function classes consisting of all real-valued aﬃne functions:

Fi = n x 7 w x + b; w Rd, b R o , i [k]

and let Gmax be the k-fold maximum (see deﬁnition (4)). Then

fatγ(Gmax(F1, . . . , Fk)) Cdk Log k, γ > 0,

where C > 0 is a universal constant.

The optimality of the upper bound in Theorem 7 is witnessed by the matching lower bound:

Theorem 8 (Dimension-dependent lower bound for k-fold maximum of aﬃne functions) For k 1 and d 4, let F1 = F2 = . . . = Fk be the collection of all aﬃne functions over Ω= Rd and let Gmax be the k-fold maximum (see deﬁnition (4)). Then

fatγ(Gmax(F1, . . . , Fk)) C log(k)

i=1 fatγ(Fi) = Cdk log k, γ > 0,

where C > 0 is a universal constant.

The scaling argument employed in the proof of Theorem 8 can be invoked to show that the claim continues to hold for Ω= B. Together, Theorems 7 and 8 show that the dependence on k is optimal.

Attias and Kontorovich

We start with upper-bounding the fat-shattering dimension of aggregation rules that commute with shifts (deﬁnition (7)) and commute with truncation (deﬁned below), in terms of the dimensions of the component classes.

4.1 Proof of Theorem 1

Partial concept classes and disambiguation. We say that F {0, 1, }Ωis a partial concept class over Ω; this usage is consistent with Alon et al. (2022), while Attias et al. (2019, 2022) used the descriptor ambiguous. Deﬁne the disambiguation operator D : {0, 1, } 2{0,1} as

D(0) = {0} ; D(1) = {1} ; D( ) = {0, 1} . (15)

For f F , deﬁne its disambiguation set D(f ) {0, 1}Ωas

D(f ) = n g {0, 1}Ω: x Ω, g(x) D(f (x)) o ; (16)

in words, D(f ) consists of the total concepts g : Ω {0, 1} that agree pointwise with f , whenever the latter takes a value in {0, 1}. We say that F {0, 1}Ωdisambiguates F if for all f F , we have F D(f ) = ; in words, every f F must have a disambiguated representative in F.3

As in Alon et al. (2022); Attias et al. (2022), we say4 that S Ωis VC-shattered by F if F (S) {0, 1}S. We write vc(F ) to denote the size of the largest VC-shattered set (possibly, ). The obvious relation vc(F ) vc( F) always holds between a partial concept class and any of its disambiguations. Alon et al. (2022, Theorem 13) proved the following variant of the Sauer-Shelah-Perles Lemma for partial concept classes:

Lemma 9 (Alon et al. (2022)) For every F {0, 1, }Ωwith d = vc(F ) < and |Ω| < , there is an F disambiguating F such that

| F(Ω)| (|Ω| + 1)(d+1) log2 |Ω|+2. (17)

For d > 0 and |Ω| > 1, this implies the somewhat more wieldy estimate5

| F(Ω)| |Ω|7d log2 |Ω|. (18)

We will make use of the elementary fact

x A log2 x = x 3A log(3A), x, A 1

and its corollary

y A(log2 y)2 = y 5A log2(18A), y, A 1. (19)

3. Attias et al. (2022) additionally required that F S

f F D(f ), but this is an unnecessary restriction, and does not aﬀect any of the results. 4. Attias et al. (2019) had incorrectly given F (S) = {0, 1}S as the shattering condition. 5. The estimate (18) does not appear in Alon et al. (2022), but is an elementary consequence of (17).

Fat-Shattering Dimension of k-fold Aggregations

Aggregation rules commuting with truncation. Fix γ > 0 and deﬁne the truncation operator [ ] γ : R {0, 1, } as

0, t γ 1, t γ , else.

Let xi R, i [k]. Let the γ-truncation [xi] γ {0, 1, }, and xi D([xi] γ) {0, 1} be a disambiguation. We say that an aggregation rule G : Rk R commutes with truncation if for any γ > 0,

G( x1, . . . , xk) D([G(x1, . . . , xk)] γ) (21)

for all disambiguations xi, i [k] (see deﬁnitions in (15) and (16)). In Appendix A.1, we show that median and max-min aggregations commute with truncations (Lemmas 16, 17 respectively). Showing it for the maximum and minimum is a simple exercise. We note that the mean aggregation does not satisfy this property. Proof [of Theorem 1] We follow the basic techniques of discretization and r-shifting, employed in Attias et al. (2019, 2022). Fix γ > 0, recall the truncation operator [ ] γ : R {0, 1, } deﬁned in (20). We also deﬁne the truncation operator over functions [ ] γ : RΩ {0, 1, }Ω, as [f] γ = f where f (x) = [f(x)] γ, for x Ω. Observe that for all F RΩand [F] γ := [f] γ; f F , we have f atγ(F) = vc([F] γ). Let G : Rk R be a k-fold aggregation rule and F1, . . . , Fk RΩ

be real-valued function classes. Suppose that some S = {x1, . . . , xℓ} Ωis γ-shattered by G G(F1, . . . , Fk). Proving the claim amounts to upper-bounding ℓappropriately. By (10), there is an r RΩsuch that fatγ(G) = f atγ(G r) = vc([G r] γ). Put F i := Fi r and since G commutes with r-shift, as deﬁned in (7), we have

G := G(F 1, . . . , F k) = G(F1 r, . . . , Fk r) = G(F1 . . . , Fk) r. (22)

Hence, S is VC-shattered by [G ] γ and

vi := vc([F i] γ) = f atγ(F i) fatγ(F i) = fatγ(Fi), i [k]. (23)

Let us assume for now that each vi > 0; in this case, there is no loss of generality in assuming ℓ> 1. Let Fi be a good disambiguation of [F i] γ on S, as furnished by Lemma 9:

| Fi(S)| ℓ7vi log2 ℓ.

Observe that G := G( F1, . . . , Fk) is a valid disambiguation of [G ] γ since we assume that G commutes with truncation. It follows that

2ℓ= | G(S)|

i=1 | Fi(S)| ℓ7 log2 ℓPk i=1 vi. (24)

Thus, (19) implies that ℓ 35(P vi) log2(126 P vi), and the latter is an upper bound on vc( G) and hence, also on vc([G ] γ) = fatγ(G). The claim now follows from (23).

Attias and Kontorovich

If any one given vi = 0, we claim that (24) is unaﬀected. This is because any C {0, 1, }Ωwith vc(C ) = 0 has a singleton disambiguation C = {c}. Indeed, any given x Ω can receive at most one of {0, 1} as a label from the members of C (otherwise, it would be shattered, forcing vc(C ) 1). If any c C labels x with 0, then all members of C are disambiguated to label x with 0 (and, mutatis mutandis, 1). Any x labeled with by every c C i can be disambiguated arbitrarily (say, to 0). Disambiguating the degenerate [F i] γ to the singleton Fi(S) has no eﬀect on the product in (24). The foregoing argument continues to hold if more than one vi = 0. In particular, in the degenerate case where fatγ(F1) = fatγ(F2) = . . . = fatγ(Fk) = 0, we have Q | Fi(S)| = 1, which forces ℓ= 0.

4.2 Proof of Theorem 2

First, we upper bound the covering numbers of Lipschitz aggregations as a function of the covering numbers of the component classes.

Theorem 10 (Covering number of L-Lipschitz aggregations) Let t > 0, p [1, ], and F1, . . . , Fk Lp(µ). Let G be an aggregation rule that is L-Lipschitz. Then, for all probability measures µ on Ω,

N(G(F1, . . . , Fk), Lp(µ), t)

(Qk i=1 N(Fi, Lp(µ), t/Lk1/p), p < Qk i=1 N(Fi, Lp(µ), t/L), p = .

We proceed to the main proof. Proof [of Theorem 2]. Let G : Rk R be an aggregation rule that is L-Lipschitz (L 1) in supremum norm, as deﬁned in (8), and let F1, . . . , Fk [ R, R]Ωbe real-valued function classes. Suppose that some Ωℓ= {x1, . . . , xℓ} Ω= B is a maximal set that is γ-shattered by G, let Fi(Ωℓ) = Fi|Ωℓ, and µℓbe the uniform distribution on Ωℓ. We upper bound the covering number with the fat-shattering dimension as in Lemma 21 (see Section A.2), with n = ℓand p = ,

log N(Fi(Ωℓ), L (µℓ), γ) Cvi log(Rℓ/viγ) logε(ℓ/vi), 0 < γ < R,

where vi = fatcεγ(Fi). Then Theorem 10 implies that

log N(G(Ωℓ), L (µℓ), γ/2)

i=1 log N(Fi(Ωℓ), L (µℓ), γ/2L)

i=1 vi log(LRℓ/viγ) logε(ℓ/vi)

i=1 vi log1+ε(LRℓ/viγ)

(b) CD log1+ε LRℓk

Fat-Shattering Dimension of k-fold Aggregations

where D := Pk i=1 vi, (a) follows since R/γ > 1 and assuming L 1, and (b) follows by the concavity of x log1+ε(u/x) (see Lemma 28 in Section A.5). We can assume ℓ 2 without loss of generality. Combining the monotonicity of the covering number (see (11)), a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 18 in Section A.2), and the fact the Ωℓis a maximal set that is γ-shattered by G yields

log N(G(Ωℓ), L (µℓ), γ/2) C fatγ(G) = Cℓ,

ℓ CD log1+ε LRℓk

Using the elementary fact

x A Log1+ε x = x c A Log1+ε A x, A 1

(with x = LRℓk/Dγ and A = c LRk/γ), we get

ℓ CD Log1+ε LRk

which implies the claim.

4.3 Proof of Theorem 3

We use the notation and results from the Appendix, and in particular, from Section A.3. Proof [of Theorem 3] A bound of this form for the k-fold maximum aggregation was claimed in Kontorovich (2018), however the argument there was ﬂawed, see Section 5. Let G : Rk R be an aggregation rule that is L-Lipschitz in supremum norm, as deﬁned in (8), and let F1, . . . , Fk be bounded aﬃne function classes, as deﬁned in (12). Suppose that some Ωℓ= {x1, . . . , xℓ} Ω= B is a maximal set that is γ-shattered by G, let Fi(Ωℓ) = Fi|Ωℓ, and µℓbe the uniform distribution on Ωℓ. We upper bound the covering number as in Lemma 23 (with m = ℓ),

log N(Fi(Ωℓ), L (µℓ), γ) C R2 i γ2 Log ℓγ2

R2 i , 0 < γ < Ri.

Denote vi := L2R2 i /γ2, and consider the L covering number of Fi(Ωℓ) at scale γ/L:

log N(Fi(Ωℓ), L (µℓ), γ/L) Cvi Log ℓ

Then Theorem 10 implies that

log N(G(Ωℓ), L (µℓ), γ/2)

i=1 log N(Fi(Ωℓ), L (µℓ), γ/2L)

i=1 vi Log ℓ

(a) CD Log kℓ

Attias and Kontorovich

where D := Pk i=1 vi and (a) follows by the concavity of x log(u/x) (see Corollary 27 in Section A.5). Combining the monotonicity of the covering number (see (11)), a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 18 in Section A.2), and the fact the Ωℓis a maximal set that is γ-shattered by G yields

log N(G(Ωℓ), L (µℓ), γ/2) C fatγ(G) = Cℓ,

ℓ CD Log kℓ

Using the elementary fact

x A Log x = x c A Log A, x, A 1

(with x = kℓ/D and A = ck) we get ℓ c D Log k, which implies the claim. The result can easily be generalized to hinge-loss aﬃne classes. Let Fi be an aﬃne function class as in (12), deﬁne F i as the function class on B { 1, 1} given by F i = {(x, y) 7 yf(x); f Fi}, and the hinge-loss aﬃne class F Hinge i as the function class on B { 1, 1} given by F Hinge i = {(x, y) 7 max {0, 1 f(x, y)} ; f F i}. One ﬁrst observes that the restriction of F i to any {(x1, y1), . . . , (xn, yn)}, as a body in Rn, is identical to the restriction of Fi to {x1, . . . , xn}. Interpreting F Hinge i as a 2-fold maximum over the singleton class H = {h 0} and the bounded aﬃne class F i lets us invoke Theorem 10 to argue that Fi and F Hinge i have the same L covering numbers. Hence, the argument we deployed here to establish (13) for aﬃne classes also applies to k-fold L-Lipschitz aggregations hinge-loss classes.

4.4 Proof of Corollary 5

Proof [of Corollary 5] Raviv et al. (2018, Theorem 7) upper-bounded the Rademacher complexity of the maximum aggregation of k hinge loss aﬃne functions by k/ n. For Ri-bounded aﬃne functions or hinge loss aﬃne functions, the analysis above, com-

bined with the calculation in Kontorovich (2018) yields a bound of O q

Log(k) Log3(n) Pk i=1 R2 i n

For completeness, we provide the full proof. Let Gmax : Rk R be the k-fold maximum aggregation rule, as deﬁned in (4), and let F1, . . . , Fk RΩbe Ri-bounded aﬃne function classes as in (12) or hinge loss aﬃne function classes as in (14). Since this aggregation is 1-Lipschitz in the supremum norm, Theorem 3 implies that

fatγ(Gmax) C Log(k)

i=1 R2 i , 0 < γ < min i [k] Ri,

where C > 0 is a universal constant.

Fat-Shattering Dimension of k-fold Aggregations

From fat-shattering to Rademacher. The fat-shattering estimate above can be used to upper-bound the Rademacher complexity by converting the former to a covering number bound and plugging it into Dudley s chaining integral (Dudley, 1967):

Rn(F) inf α 0

log N(F, 2 , t)

where N( ) are the L2 covering numbers. It remains to bound the covering numbers. A simple way of doing so is to invoke Lemmas 2.6, 3.2, and 3.3 in Alon et al. (1997) but this incurs superﬂuous logarithmic factors in n. Instead, we use the sharper estimate of Mendelson and Vershynin (2003), stated here in Lemma 19. Putting R = maxi Ri, the latter yields

Rn(Gmax) inf α 0

4α + 12 Z 1

log N(Gmax, 2 , t)

4α + 12c Z 1

fatct/ R(Gmax) log 2 R

Log(k) Pk i=1 R2 i n

log(2 R/α)3/2 (log 2 R)3/2

and choosing α = 1/ n yields

Rn(Gmax) 4 n + 12c

Log(k) Pk i=1 R2 i n 2 R

log(2 R n)3/2 (log 2 R)3/2

Log(k) log3( Rn) R2 Pk i=1 R2 i n

4.5 Proof of Theorem 7

Proof [of Theorem 7] Let Gmax : Rk R be the k-fold maximum aggregation rule, as deﬁned in (4), and let F1, . . . , Fk RΩbe identical function classes consisting of all realvalued aﬃne functions. Note that Gmax is an aggregation that commutes with shift, as deﬁned in (7).

Attias and Kontorovich

By (10), there is an r RΩsuch that fatγ(Gmax) = f atγ(Gmax r). As in (22), put F i := Fi r and G max := Gmax r = Gmax(F 1, . . . , F k). Deﬁne Gmax = sign(G max) and Fi = sign(F i). Since sign and max commute, we have Gmax = maxi [k]( Fi). We claim that

f atγ(G max) vc( Gmax). (26)

Indeed, any S Ωthat is γ-shattered with shift r = 0 by any G RΩis also VC-shattered by sign(G). (See Section 4.1, and notice that the converse implication and the reverse inequality do not hold.) It holds that

d + 1 (a) = vc( Fi) (b) = f atγ(F i) (c) = fatγ(F i) (d) = fatγ(Fi),

where (a) follows from a standard argument (e.g., Mohri et al. (2012, Example 3.2)), (b) holds because any S Rd that is VC-shattered by sign(F i) is also γ-shattered by F i with shift r = 0, (c) follows from Lemma 24, since the class satisﬁes the closure property (34), and (d) holds since the shattering remains the same for the shifted class. Now the argument of Blumer et al. (1989, Lemma 3.2.3) applies:

vc( Gmax) 2(d + 1)k log(3k) (27)

(this holds for any k-fold aggregation function, not just the maximum). Combining (26) with (27) proves the claim.

4.6 Proof of Theorem 8

Proof [of Theorem 8] It follows from Mohri et al. (2012, Example 3.2) that vc(sign(Fi)) = d + 1. Since Fi is closed under scalar multiplication, a scaling argument shows that any S Rd that is VC-shattered by sign(Fi) is also γ-shattered by Fi with shift r = 0, whence f atγ(Fi) = d + 1 for all γ > 0; invoking Lemma 24 extends this to fatγ(Fi) as well. Now Csik os et al. (2019, Theorem 1) shows that the k-fold unions of half-spaces necessarily shatter some set S Rd of size at least cdk log k. Since union is a special case of the max operator, and the latter commutes with sign, the scaling argument shows that this S is γ-shattered by Gmax with shift r = 0. Hence, fatγ(Gmax) f atγ(Gmax) |S|, which proves the claim.

5. Discussion

In this paper, we proved upper and lower bounds on the fat-shattering dimension of aggregation rules as a function of the fat-shattering dimension of the component classes. We leave some remaining gaps for future work. First, for aggregation rules that commute with shifts and commute with truncation, assuming fatγ(Fi) d, for 1 i k, we show in Theorem 1 that

fatγ(G(F1, . . . , Fk)) Cdk Log2 (dk) , γ > 0,

C > 0 is a universal constant. We pose the following

Fat-Shattering Dimension of k-fold Aggregations

Open problem. Let G be an aggregation rule with the properties as in Theorem 1. Is it the case that for all Fi RΩwith fatγ(Fi) d, i [k], we have

fatγ(G(F1, . . . , Fk)) Cdk Log (k) , γ > 0, (28)

for some universal C > 0? In light of Theorem 8, this is the best one could hope for in general. We pose also the following conjecture about bounded aﬃne functions.

Conjecture 11 Theorem 3 is tight up to constants. For Ri-bounded aﬃne functions and an aggregation rule G that is 1-Lipschitz in supremum norm,

fatγ(G(F1, . . . , Fk)) C Log(k)

i=1 R2 i , 0 < γ < min i [k] Ri, (29)

where C > 0 is a universal constant.

Throughout the paper, we mentioned several mistaken claims in the literature. In this section, we brieﬂy discuss the nature of these mistakes which are, in a sense, variations on the same kind of error. We begin with Attias et al. (2019, Lemma 14), which incorrectly claimed that any partial function class F has a disambiguation F such that vc( F) vc(F ) (see Section 4.1 for the deﬁnitions). The mistake was pointed out to us by Yann Guermeur, and later, Alon et al. (2022, Theorem 11) showed that there exist partial classes F with vc(F ) = 1 for which every disambiguation F has vc( F) = . Kontorovich (2018) attempted to prove the bound stated in our Theorem 3 (up to constants, and only for linear classes). The argument proceeded via a reduction to the Boolean case, as in our proof of Theorem 7. It was correctly observed that if, say, some ﬁnite S Ωis 1-shattered by Fi with shift r = 0, then it is also VC-shattered by sign(Fi). Neglected was the fact that sign(Fi) might shatter additional points in Ω\ S and, in suﬃciently high dimension, it necessarily will. The crux of the matter is that (26) holds in the dimension-dependent but not the dimension-free setting; again, this may be seen as a variant of the disambiguation mistake. Finally, the proof of Hanneke and Kontorovich (2019, Lemma 6) claims, in the ﬁrst display, that the shattered set can be classiﬁed with large margin, which is incorrect yet another variant of mistaken disambiguation.

Attias and Kontorovich

Acknowledgments

We thank Steve Hanneke and Ramon van Handel for very helpful discussions; the latter, in particular, patiently explained to us how to prove Lemma 23. Roman Vershynin kindly gave us permission to share his example in Remark 20. We deeply thank the anonymous reviewers for their insightful comments and suggestions. This research was partially supported by the Israel Science Foundation (grant No. 1602/19), an Amazon Research Award, the Ben-Gurion University Data Science Research Center, Cyber Security Research Center, Prime Minister s Oﬃce, and the Vatat Scholarship from the Israeli Council for Higher Education.

Noga Alon, Shai Ben-David, Nicol o Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615 631, 1997.

Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. Closure properties for private classiﬁcation and online prediction. In Conference on Learning Theory, pages 119 152. PMLR, 2020.

Noga Alon, Steve Hanneke, Ron Holzman, and Shay Moran. A theory of PAC learnability of partial concept classes. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 658 671, 2022.

Gautier Appert and Olivier Catoni. New bounds for k-means and information k-means. ar Xiv preprint ar Xiv:2101.05728, 2021.

S. Artstein, V. Milman, and S. J. Szarek. Duality of metric entropy. Annals of Mathematics, 159(3):1313 1328, 2004.

Idan Attias and Steve Hanneke. Adversarially robust PAC learnability of real-valued functions. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 1172 1199, 2023.

Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, ALT 2019, volume 98 of Proceedings of Machine Learning Research, pages 162 183. PMLR, 2019.

Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for adversarially robust learning. The Journal of Machine Learning Research, 23(1):7897 7927, 2022.

Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classiﬁers, pages 43 54. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3.

Peter L. Bartlett and Philip M. Long. Prediction, learning, uniform convergence, and scalesensitive dimensions. J. Comput. Syst. Sci., 56(2):174 190, 1998.

Fat-Shattering Dimension of k-fold Aggregations

Eric B. Baum and David Haussler. What size net gives valid generalization? Neural Comput., 1(1):151 160, 1989.

G erard Biau, Luc Devroye, and G abor Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781 790, 2008.

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929 965, 1989.

Leo Breiman. Bagging predictors. Machine learning, 24:123 140, 1996.

Leo Breiman. Random forests. Machine learning, 45:5 32, 2001.

Doron Cohen, Aryeh Kontorovich, Aaron Koolyk, and Geoﬀrey Wolfer. Dimension-free empirical entropy estimation. IEEE Transactions on Information Theory, 69(5):3190 3202, 2023. doi: 10.1109/TIT.2022.3232739.

M onika Csik os, Nabil H. Mustafa, and Andrey Kupavskii. Tight lower bounds on the vc-dimension of geometric set systems. J. Mach. Learn. Res., 20:81:1 81:8, 2019.

Hubert Haoyang Duan. Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective. Ph D thesis, University of Waterloo, 2012.

Richard M Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290 330, 1967.

David Eisenstat. k-fold unions of low-dimensional concept classes. Inf. Process. Lett., 109 (23-24):1232 1234, 2009.

David Eisenstat and Dana Angluin. The VC dimension of k-fold union. Inf. Process. Lett., 101(5):181 184, 2007.

Charles Feﬀerman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016.

Dylan J. Foster and Alexander Rakhlin. ℓ vector contraction for rademacher complexity. Co RR, abs/1911.06468, 2019.

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Badih Ghazi, Noah Golowich, Ravi Kumar, and Pasin Manurangsi. Near-tight closure bounds for the littlestone and threshold dimensions. In Algorithmic Learning Theory, pages 686 696. PMLR, 2021.

Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Eﬃcient classiﬁcation for metric data (extended abstract: COLT 2010). IEEE Transactions on Information Theory, 60(9):5750 5759, 2014.

Attias and Kontorovich

Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, and Gabriel Nivasch. Learning convex polytopes with margin. In Neural Information Processing Systems (NIPS), 2018.

Steve Hanneke and Aryeh Kontorovich. Optimality of SVM: novel proofs and tighter bounds. Theor. Comput. Sci., 796:99 113, 2019.

Michael J. Kearns and Robert E. Schapire. Eﬃcient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci., 48(3):464 497, 1994.

Bal azs K egl. Robust regression by boosting the median. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 258 272. Springer, 2003.

Yegor Klochkov, Alexey Kroshnin, and Nikita Zhivotovskiy. Robust k-means clustering for distributions with two moments. The Annals of Statistics, 49(4):2206 2230, 2021.

Aryeh Kontorovich. Rademacher complexity of k-fold maxima of hyperplanes. 2018.

S. Mendelson and R. Vershynin. Entropy and the combinatorial dimension. Invent. Math., 152(1):37 55, 2003.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations Of Machine Learning. The MIT Press, 2012.

Dolev Raviv, Tamir Hazan, and Margarita Osadchy. Hinge-minimax learner for the ensemble of hyperplanes. J. Mach. Learn. Res., 19:62:1 62:30, 2018.

M. Rudelson and R. Vershynin. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, 164(2):603 648, 2006.

Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. Advances in neural information processing systems, 23, 2010.

Michel Talagrand. Vapnik chervonenkis type conditions and uniform donsker classes of functions. The Annals of Probability, 31(3):1565 1582, 2003.

Roman Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018.

Roman Vershynin, 2021. Private communication.

Tong Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527 550, 2002.

Nikita Zhivotovskiy. A bound for k-fold maximum. 2022.

Fat-Shattering Dimension of k-fold Aggregations

Appendix A. Auxiliary results

A.1 Properties of Aggregation Rules

Lemma 12 If G : Rk R is L-Lipschitz under p, then G : (RΩ)k RΩis L-Lipschitz in L(k) p (µ).

G(f1, . . . , fk) G(f 1, . . . , f k) p Lp(µ) = Z

Ω |G(f1, . . . , fk)(x) G(f 1, . . . , f k)(x)|pdµ(x)

Ω |G(f1(x), . . . , fk(x)) G(f 1(x), . . . , f k(x))|pdµ(x)

Ω Lp (f1(x), . . . , fk(x)) (f 1(x), . . . , f k(x)) p p dµ(x),

where the inequality follows from the assumption that G : Rk R is L-Lipschitz in p. This proves G(f1, . . . , fk) G(f 1, . . . , f k) Lp(µ) L (f1, . . . , fk) (f 1, . . . , f k) L(k) p (µ) ,

and hence the claim.

Proof [of Theorem 10] Suppose p < , and let g = G(f1, . . . , fk) G(F1, . . . , Fk). For each i [k], let ˆFi Fi be a t/Lk1/p-cover of Fi. Let each fi be t/Lk1/p-covered by some ˆfi ˆFi, in the sense that fi ˆfi Lp(µ) t/Lk1/p. Assuming that G : Rk R is

L-Lipschitz in p, Lemma 12 implies that G : (RΩ)k RΩis L-Lipschitz in L(k) p (µ).

Then it follows that g is t-covered by G( ˆf1, . . . , ˆfk), since

G(f1, . . . , fk) G( ˆf1, . . . , ˆfk) p

Lp(µ) Lp (f1, . . . , fk) ( ˆf1, . . . , ˆfk) p

(f1(x), . . . , fk(x)) ( ˆf1(x), . . . , ˆfk(x)) p

fi(x) ˆfi(x) p dµ(x)

fi(x) ˆfi(x) p dµ(x)

Lpk t Lk1/p

Attias and Kontorovich

and so G(f1, . . . , fk) G( ˆf1, . . . , ˆfk) Lp(µ) t.

We conclude that G(F1, . . . , Fk) has a t-cover of size | ˆF1 ˆF2 . . . ˆFk|, which proves the claim. The case p = is proved analogously (or, alternatively, as a limiting case of p < ).

We show that natural aggregations are Lipschitz in p norms, p [1, ), and in supremum norm. The following facts are elementary:

|a b c d| |a c| |b d|, a, b, c, d R; (30)

|a b c d| |a c| |b d|, a, b, c, d R, (31)

where s t := max {s, t} and s t := min {s, t}.

Lemma 13 (Maximum aggregation is 1-Lipschitz) Let Gmax : Rk R be the maximum aggregation, then for any x, x Rk and p [1, ], Gmax(x) Gmax(x ) x x p .

Proof For k = 2 and p = , the claim follows from the stronger, pointwise inequality (30). The proof follows by simple induction on k. Since p, we conclude the proof for p [1, ].

Lemma 14 (Median aggregation is 1-Lipschitz) Let Gmed : Rk R be the median aggregation, then for any x, x Rk and p [1, ], Gmed(x) Gmed(x ) x x p .

Proof Denote by x(1), . . . , x(k) the ascending order of a sequence x1, . . . , xk, that is, x(1) x(k). For all x, x Rk we have

Gmed(x1, . . . , xk) Gmed(x 1, . . . , x k) = x( k/2 ) x ( k/2 )

x(i) x (i) max i [k]

where the last inequality follows from Cohen et al. (2023, Eq. (16))6 Since p, we conclude the proof for p [1, ].

Lemma 15 (Max-Min aggregation is 1-Lipschitz) Let Gmax-min : Rk ℓ R be the max-min aggregation, then for any x, x Rk ℓand p [1, ], Gmax-min(x) Gmax-min(x ) x x p .

6. stated there for distributions but true for all vectors, by the same argument

Fat-Shattering Dimension of k-fold Aggregations

Proof The inequalities (30), (31) imply that the k-fold max and min aggregations are both 1-Lipschitz with respect to . Hence, for all x, y Rk ℓ, we have min i [k] xij min i [k] yij

max i [k] |xij yij| , j [ℓ]

and further, max j [ℓ] min i [k] xij max j [ℓ] min i [k] yij

max j [ℓ] max i [k] |xij yij| .

This proves that |Gmax-min(x) Gmax-min(x )| x x . Since p, the claim holds for all p [1, ].

Lemma 16 (Median aggregation commutes with truncation) Let Gmed : Rk R be the median aggregation, then Gmed commutes with truncation. That is, for any γ > 0 and x Rd,

Gmed( x1, . . . , xk) D([Gmed(x1, . . . , xk)] γ)

for all disambiguations xi D([xi] γ), i [k].

Proof Fix γ > 0 and denote by x(1), . . . , x(k) the ascending order of a sequence x1, . . . , xk. Now for xi D([xi] γ) {0, 1}, our deﬁnition of the median (5) implies that Gmed( x1, . . . , xk) {0, 1}. It remains to perform an exhaustive veriﬁcation of the possible cases. If [Gmed(x1, . . . , xk)] γ = then any value in {0, 1} is valid. If [Gmed(x1, . . . , xk)] γ = 0 it means that Gmed(x1, . . . , xk) outputs a value smaller than γ, which means that at least half of inputs x1, . . . , xk have a value smaller than γ. Let these values be x(1), . . . , x(m) where m k/2. We have [x(j)] γ = 0 for j [m] and [x(ℓ)] γ { , 1} for ℓ {m + 1, . . . , k}. For any disambiguation x(ℓ), Gmed( x1, . . . , xk) would still output 0 and is a valid disambiguation of [Gmed(x1, . . . , xk)] γ. The case [Gmed(x1, . . . , xk)] γ = 1 follows from the same argument.

Lemma 17 (Max-Min aggregation commutes with truncation) Let Gmax-min : Rk ℓ R be the max-min aggregation, then Gmax-min commutes with truncation. That is, for any γ > 0 and x Rk ℓ,

Gmax-min( x11, . . . , xkℓ) D([Gmax-min(x11, . . . , xkℓ)] γ),

for all disambiguations xij D([xij] γ), i [k], j [ℓ].

Proof Fix γ > 0. For any i [k] denote by xi(1), . . . , xi(ℓ) the ascending order of a sequence xi1, . . . , xiℓ. We assume xij D([xij] γ) {0, 1} and Gmax-min( x11, . . . , xkℓ) outputs a value in {0, 1} by our deﬁnition of the max-min. We check all possible outputs of [Gmax-min(x11, . . . , xkℓ)] γ and verify that Gmax-min( x11, . . . , xkℓ) is a valid disambiguation. If [Gmax-min(x11, . . . , xkℓ)] γ = then any value in {0, 1} is valid. If [Gmax-min(x11, . . . , xkℓ)] γ = 0 it means that Gmax-min(x11, . . . , xkℓ) outputs a value smaller than γ. This means that

Attias and Kontorovich

all values that minimize each row x1(1), . . . , xk(1) are smaller than γ since the maximum of them is smaller than γ. We have [xi(1)] γ = 0 for i [k]. For any disambiguation xij Gmax-min( x11, . . . , xkℓ) would still output 0 and is a valid disambiguation of [Gmax-min(x11, . . . , xkℓ)] γ. The case[Gmax-min(x11, . . . , xkℓ)] γ = 1 follows from the same argument.

A.2 Covering numbers and the fat-shattering dimension

In this section, we summarize some known results connecting the covering numbers of a bounded function class to its fat-shattering dimension.

Lemma 18 (Talagrand (2003), Proposition 1.4) For any F [ R, R]Ω, there exists a probability measure µ on Ωsuch that

N(F, L2(µ), t) 2C fat2t(F), 0 < t < R, (32)

where C > 0 is a universal constant. Moreover, µ may be taken to be the uniform distribution on any 2t-shattered subset of Ω.

Remark. The tightness of (32) is trivially demonstrated by the example F = { γ, γ}n.

Lemma 19 (Mendelson and Vershynin (2003), Theorem 1) For all F [ 1, 1]Ω

and all probability measures µ,

N(F, L2(µ), t) 2

C fatct(F) , 0 < t < 1, (33)

where C, c > 0 are universal constants.

Remark 20 The following example due to Vershynin (2021) shows that (33) is tight. Take Ω= [n] and F = [ 1, 1]Ω. Then, for all suﬃciently small t > 0, we have fatt(F) = n. However, a simple volumetric calculation shows that N(F, L2(µ), t) behaves as (C/t)n for small t, where C > 0 is a constant.

Lemma 21 (Rudelson and Vershynin (2006)) Suppose that p [2, ), µ is a probability measure on Ω, and R > 0. If F Lp(Ω, µ) satisﬁes supf F f L2p(µ) R, then

log N(F, Lp(µ), t) Cp2 fatct(F) log R

ct, 0 < t < R;

furthermore, for all ε > 0, if supf F f L (µ) R, then

log N(F, L (µ), t) Cv log(Rn/vt) logε(n/v), 0 < t < R,

where n = |Ω|, v = fatcεt(F), and C, c > 0 are universal constants.

Fat-Shattering Dimension of k-fold Aggregations

A.3 Covering numbers of linear and aﬃne classes

Let B Rd be the d-dimensional Euclidean unit ball and

F = {x 7 w x + b; w |b| R}

be the collection of R-bounded aﬃne functions on Ω= B.

Remark 22 There is a trivial reduction from an R-bounded aﬃne class in d dimensions to a 2R-bounded linear class in d + 1 dimensions, via the standard trick of adding an extra dummy dimension. This only aﬀects the covering number bounds up to constants.

For Ωn B, |Ωn| = n, deﬁne F(Ωn) = F|Ωn, and endow Ωn with the uniform measure µn. Zhang (2002, Theorem 4) implies the covering number estimate

log N(F(Ωn), L (µn), t) C R2

where C > 0 is a universal constant (Zhang s result is more general and allows to compute explicit constants). We will use the following sharper bound:

log N(F(Ωn), L (µn), t) C R2

R2 , 0 < t < R,

where m = min {n, d} and C > 0 is a universal constant.

Proof The result is folklore knowledge, but we provide a proof for completeness. Let B = Bd 2 be the Euclidean unit ball and X = {x1, . . . , xn} B. This induces the set F = {(w x1, w x2, . . . , w xn); w B} Rn. We argue that there is no loss of generality in assuming d n. Indeed, if n > d, then X is spanned by some X = {x 1, . . . , x d} B and F span(X ) is also a d-dimensional set. Thus, we assume d n henceforth. Via a standard inﬁnitesimal perturbation, we can assume that X is a linearly independent set (i.e., spans Rn). If we treat X as an n d matrix, then F = XB, which means that F is an ellipsoid. We are interested in estimating the ℓ covering numbers of F. Let K Rd be such that XK = L, where L = Bn is the unit cube. (The existence of a K such that XK L is obvious, but because we assumed that X spans Rn, every point in [ 1, 1]n has a pre-image under X.) Let us compute the polar body K , deﬁned as

K = u Rd : sup v K v u 1 .

We claim that

K = absconv(X) =:

i=1 αixi; X |αi| 1

Attias and Kontorovich

Indeed, consider a z = Pn i=1 αixi absconv(X). Then, for any v K, we have

i=1 αi(v xi)

i=1 |αi| 1 = z K ,

where we have used |v xi| 1, since XK = L = Bn = [ 1, 1]n. This shows that absconv(X) K . On the other hand, consider any u K . There is no loss of generality in assuming that u is in the span of X, that is, u = Pm i=1 αixi, for αi R. By deﬁnition of u K , we have

sup v K v u = sup v K v

i=1 αixi = sup v K

i=1 αi(v xi) 1.

Now because XK = [ 1, 1]n, for each choice of α Rn, there is a v K such that |v xi| = sign(αi) for all i [n]. This shows that we must have Pn i=1 |αi| 1, and proves K absconv(X). It is well-known (and easy to verify) that covering numbers enjoy an aﬃne invariance:

N(F, L) := N(XB, XK) = N(B, K),

where N(A, B), for two sets A, B, is the smallest number of copies of B necessary to cover A. Now the seminal result of Artstein et al. (2004) applies: for all t > 0,

log N(B, t K) a log N(K , bt B),

where a, b > 0 are universal constants. This reduces the problem to estimating the ℓ2-covering numbers of absconv(X). The latter may be achieved via Maurey s method (Vershynin, 2018, Corollary 0.0.4 and Exercise 0.0.6): the t-covering number of absconv(r X) under ℓ2 is at most

(c + cmt2/r2) r2/t2 ,

where c > 0 is a universal constant.

A.4 Fat-shattering dimension of linear and aﬃne classes

In this section, Ω= Rd and B Rd denotes the Euclidean unit ball. A function f : Ω R is said to be aﬃne if it is of the form f(x) = w x + b, for some w Rd and b R, where denotes the Euclidean inner product.

Fat-Shattering Dimension of k-fold Aggregations

Throughout the paper, we have have referred to R-bounded aﬃne function classes as those for which w |b| R. In this section, we deﬁne the larger class of R-semi-bounded aﬃne functions, as those for which w R, but b may be unbounded. In particular, the covering-number results (and the reduction to linear classes spelled out in Remark 22) do not apply to semi-bounded aﬃne classes. The following simple result may be of independent interest.

Lemma 24 Let F RΩbe some collection of functions with the closure property

f, g F = (f g)/2 F. (34)

Then, for all γ > 0, we have fatγ(F) = f atγ(F) .

Proof Suppose that some set {x1, . . . , xk} is γ-shattered by F. That means that there is an r Rk such that for all y { 1, 1}k, there is an f = fy F for which

γ yi(f(xi) ri), i [k]. (35)

Now for any y { 1, 1}k, let ˆf = fy and ˇf = f y. Then, for each i [k], we have

γ yi( ˆf(xi) ri),

γ yi( ˇf(xi) ri).

It follows that f = ( ˆf ˇf)/2 achieves (35), for the given y, with r 0. Now (34) implies that the function deﬁned by f belongs to F, which completes the proof.

Now it is well-known (Bartlett and Shawe-Taylor, 1999, Theorem 4.6) that bounded linear functions i.e., function classes on B of the form F = {x 7 w x; w R}, also known as homogeneous hyperplanes satisfy fatγ(F) (R/γ)2. The discussion in Hanneke and Kontorovich (2019, p. 102) shows that the common approach of reducing of the general (aﬃne) case to the linear (homogeneous, b = 0) case, via the addition of a dummy coordinate, incurs a large suboptimal factor in the bound. Hanneke and Kontorovich (2019, Lemma 6) is essentially an analysis of the fat-shattering dimension of bounded aﬃne functions. Although this result contains a mistake (see Section 5), much of the proof technique can be salvaged:

Lemma 25 The semi-bounded aﬃne function class on B deﬁned by F = {x 7 w x + b; w R} in d dimensions satisﬁes

fatγ(F) min

Proof Since F satisﬁes (34), it suﬃces to consider f atγ(F), and so the shattering condition simpliﬁes to

γ yi(w xi + b), i [k]. (36)

Attias and Kontorovich

Now f atγ(F) is always upper-bounded by the VC-dimension of the corresponding class thresholded at zero, i.e., sign(F). For d-dimensional inhomogeneous hyperplanes, the latter is exactly d + 1 (Mohri et al., 2012, Example 3.2). Having dispensed with the dimensiondependent part in the bound, we how focus on the R-dependent one. Let us observe, as in Hanneke and Kontorovich (2019, Lemma 6), that for xi 1 and w , γ R, one can always realize (36) with |b| 2R; which is what we shall assume, without loss of generality, henceforth. Summing up the k inequalities in (36) yields

i=1 yixi + b

Letting y be drawn uniformly from { 1, 1}k and taking expectations, we have

i=1 xi 2 + 2R

i=1 Ey2 i 3R

Isolating k on the left-hand side of the inequality proves the claim k 3R

γ 2 . Following a referee s suggestion, we improve the constant as follows. Note that

|k 2i| = k 2k 1

where the inequality follows from a binomial coeﬃcient estimate via Stirling s approximation. Thus,

which proves that k

A.5 Concavity miscellanea

The results below are routine exercises in diﬀerentiation and Jensen s inequality.

Lemma 26 For u > 0, the function x 7 x log(u/x) is concave on (0, ).

Corollary 27 For all u > 0 and vi > 0, i [k],

i=1 vi log(u/vi) X vi log uk P vi .

Fat-Shattering Dimension of k-fold Aggregations

Lemma 28 For 0 ε log 2 and u 2, the function x 7 x log1+ε(u/x) is concave on [1, ). It follows that for ε, u as above and vi 1, i [k],

i=1 vi log1+ε(u/vi) X vi log1+ε uk P vi .