# fatshattering_dimension_of_kfold_aggregations__d51f059b.pdf Journal of Machine Learning Research 25 (2024) 1-29 Submitted 10/21; Revised 4/24; Published 5/24 Fat-Shattering Dimension of k-fold Aggregations Idan Attias idanatti@post.bgu.ac.il Aryeh Kontorovich karyeh@cs.bgu.ac.il Department of Computer Science Ben-Gurion University of the Negev, Beer Sheva, Israel Editor: John Shawe-Taylor We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing k functions, one from each of the k classes, and computing pointwise an aggregate function of these, such as the median, mean, and maximum. The bounds are stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on k. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature. Keywords: combinatorial dimension, scale-sensitive dimension, fat-shattering dimension, aggregation rules, k-fold maximum, ensemble methods 1. Introduction The fat-shattering dimension, also known as scale-sensitive and the parametrized variant of the P-dimension , was first defined by Kearns and Schapire (1994); its key role in learning theory lies in characterizing the PAC learnability of real-valued function classes (Alon et al., 1997; Bartlett and Long, 1998). In this paper, we study the behavior of the fat-shattering dimension under various kfold aggregations. Let F1, . . . , Fk RΩbe real-valued function classes, and G : Rk R be an aggregation rule. We consider the aggregate function class G(F1, . . . , Fk), which consists of all mappings x 7 G(f1(x), . . . , fk(x)), for any f1 F1, . . . , fk Fk. Some natural aggregation rules include the pointwise k-fold maximum, median, mean, and maxmin. We seek to bound the fat-shattering complexity of G(F1, . . . , Fk) in terms of the fatshattering dimensions of the constituent Fis. This question naturally arises in the context of ensemble methods, such as boosting and bagging, where the learner s prediction consists of an aggregation of base learners. The analogous question regarding aggregations of VC classes (VC dimension being the combinatorial complexity controlling the learnability of Boolean function classes) have been studied in detail and largely resolved (Baum and Haussler, 1989; Blumer et al., 1989; Eisenstat and Angluin, 2007; Eisenstat, 2009; Csik os et al., 2019). Furthermore, closure properties were also studied in the context of online classification and private PAC learning (Alon . A previous version of this paper was titled Fat-shattering dimension of k-fold maxima . c 2024 Idan Attias and Aryeh Kontorovich. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-1193.html. Attias and Kontorovich et al., 2020; Ghazi et al., 2021) for the Littlestone and Threshold dimensions. However, for real-valued functions, this question remained largely uninvestigated. 1.1 Our Contributions For a natural class of aggregation rules that commute with shifts (see definition (7)) and commute with truncation (see definition (21)), assuming fatγ(Fi) d, for 1 i k, we show that fatγ(G(F1, . . . , Fk)) O dk log2 (dk) , γ > 0. In particular, this result holds for the maximum, minimum, median, and max-min aggregations. The formal statement is given in Theorem 1. By using an entirely different approach, for aggregations that are L-Lipschitz (L 1) in supremum norm (see definition (8)) and for bounded function classes F1, . . . , Fk [ R, R]Ωwith fatεγ(Fi) d, for 1 i k, we show that fatγ(G(F1, . . . , Fk)) O dk log1+ε LRk , 0 < γ/L < R and 0 < ε < log 2. In particular, this result holds for the maximum, minimum, median, mean, and maxmin aggregations. The formal statement is given in Theorem 2. For R-bounded affine functions and for aggregations that are L-Lipschitz in supremum norm, we show the following dimension-free bound, fatγ(G(F1, . . . , Fk)) O L2R2k log(k) , 0 < γ/L < R. This result also extends to the hinge-loss class of affine functions. In particular, this result holds for the maximum, minimum, median, mean, and max-min aggregations. We improve by a log factor the estimate of Fefferman et al. (2016, Lemma 6) on the fat-shattering dimension of max-min aggregation of linear functions. The formal statement is given in Theorem 3 Furthermore, in Corollary 5 we show an upper bound on the Rademacher complexity of the k-fold maximum aggregation of affine functions and hinge-loss affine functions. Our bound scales with k, improving upon Raviv et al. (2018) where the dependence on k is linear. For affine functions and the k-fold maximum aggregation, we show tight dimensiondependent bounds (up to constants), fatγ(Gmax(F1, . . . , Fk)) = Θ (dk log k) , γ > 0, where d is the Euclidean dimension. For the formal statements, see Theorems 7 and 8. Fat-Shattering Dimension of k-fold Aggregations 1.2 Applications The need to analyze the combinatorial complexity of a k-fold maximum of function classes (see (4) for the formal definition) arises in a number of diverse settings. One natural example is adversarially robust PAC learning to test time attacks for real-valued functions (Attias et al., 2022; Attias and Hanneke, 2023). In this setting, the learner observes an i.i.d. labeled sample from an unknown distribution, and the goal is to output a hypothesis with a small error on unseen examples from the same distribution, with high probability. The difference from the standard PAC learning model is that at test time, the learner only observes a corrupted example, while the prediction is tested on the original label. Formally, (x, y) is drawn from the unknown distribution, and there is an adversary that can map x to k possible corruptions z that are known to the learner. The learner observes only z while its loss is with respect to the original label y. This scenario is naturally captured by the k-fold max: the learner aims to learn the maximum aggregation of the loss classes. Attias et al. (2022) showed that uniform convergence holds in this case, and so the sample complexity of an empirical risk minimization algorithm is determined by the complexity measure of the k-fold maximum aggregation. Analyzing the k-fold maximum arises also in a setting of learning polyhedra with a margin. Gottlieb et al. (2018) provided a learning algorithm that represents polyhedra as intersections of bounded affine functions. The sample complexity of the algorithm is determined by the complexity measure of the maximum aggregation of affine function classes. Another natural example of where the k-fold maximum and k-fold max-min play a role is in analyzing the convergence of k-means clustering. Fefferman et al. (2016) bounded the max-min aggregation and Klochkov et al. (2021); Biau et al. (2008); Appert and Catoni (2021); Zhivotovskiy (2022) bounded the max aggregation. The main challenge in this setting is bounding the covering numbers of the aggregation over k function classes which can be obtained by bounding the Rademacher complexity or the fat-shattering dimension. Finally, there are numerous ensemble methods for regression that output some aggregation of base learners, such as the median or mean. Examples of these methods include boosting (e.g., Freund and Schapire (1997); K egl (2003)), bagging (bootstrap aggregation) by Breiman (1996), and its extension to the random forest algorithm (Breiman, 2001). 1.3 Related Work It was claimed in Attias et al. (2019, Theorem 12) that fatγ(Gmax(F1, . . . , Fk)) 2 log(3k) j=1 fatγ(Fi), but the proof had a mistake (see Section 5); our Open Problem (28) asks if the general form of the bound does hold (we conjecture it does at least for the max aggregation). Using the recent disambiguation result of Alon et al. (2022) presented in Lemma 9 here, Attias et al. (2022, Lemma 15) obtained the bound fatγ(Gmax(F1, . . . , Fk)) O log(k) log2(|Ω|) j=1 fatγ(Fi) Attias and Kontorovich where Ωis the domain of the function classes F1, . . . , Fk. The latter is, in general, incomparable to Theorem 1. However, for large or infinite Ω, Theorem 1 is clearly a considerable improvement over (1). Using the covering number results of Mendelson and Vershynin (2003); Talagrand (2003) (see Section A.2), Duan (2012, Theorem 6.2) obtained a general result, which, when specialized to k-fold maxima, yields fatγ(Gmax(F1, . . . , Fk)) O for a universal constant c > 0; (2) is an immediate consequence of Theorem 10 (with p = 2), Lemma 18, and Lemma 19 in this paper. Our results improve over (2) by removing the dependence on k in the scale of the fat-shattering dimensions; however, Duan s general method is applicable to a wider class of uniformly continuous k-fold aggregations. Srebro et al. (2010, Lemma A.2) bounded the fat-shattering dimension in terms of the Rademacher complexity. Foster and Rakhlin (2019) bounded the Rademacher complexity of a smooth k-fold aggregate, see also references therein. Inspired by Appert and Catoni (2021), Zhivotovskiy (2022) has obtained the best known upper bound on the Rademacher complexity of k-fold maxima over linear function classes. Raviv et al. (2018) upper bounded the Rademacher complexity of the k-fold maximum aggregation of affine functions and hinge-loss affine functions. 2. Preliminaries 2.1 Aggregation Rules A k-fold aggregation rule is any mapping G : Rk R. Just as G maps k-tuples of reals into reals, it naturally aggregates k-tuples of functions into a single one: for f1, . . . , fk : Ω R, we define G(f1, . . . , fk) : Ω R as the mapping x 7 G(f1(x), . . . , fk(x)). Finally, the aggregation extends to k-tuples of function classes: for F1, . . . , Fk RΩ, we define G(F1, . . . , Fk) := {x 7 G(f1(x), . . . , fk(x)) : fi Fi, i [k]} . (3) Examples. A canonical example of an aggregation rule is the k-fold max, induced by the mapping Gmax(x1, . . . , xk) := max i [k] xi. (4) The minimum is defined analogously as Gmin(x1, . . . , xk) := min i [k] xi. The mean aggregation defined as Gmean(x1, . . . , xk) := 1 Fat-Shattering Dimension of k-fold Aggregations Denoting by x(1), . . . , x(k) the ascending order of a sequence x1, . . . , xk, that is, x(1) x(k), the (lower1) median is defined as Gmed(x1, . . . , xk) := x( k/2 ). (5) We also define Gmax-min : Rk ℓ R as Gmax-min(x11, . . . , xkℓ) := max j [ℓ] min i [k] xij; (6) Next, we consider some properties that an aggregation rule might possess. Commuting with shifts. We say that an aggregation rule G commutes with shifts if G(x) r = G(x r), x Rk, r R, (7) where x r is defined as (x1 r, . . . , xk r) for x = (x1, . . . , xk). It is readily verified that the maximum, minimum, max-min, mean, and median commute with shifts. Lipschitz continuity. The mapping G : Rk R is L-Lipschitz with respect to if G(x) G(x ) L x x = L max i [k] |xi x i|, x, x Rk. (8) In Appendix A.1, we show that maximum, median, and max-min aggregations are 1Lipschitz (Lemmas 13, 14, 15 respectively). Showing it for the mean is a simple exercise. The proof for the minimum is similar to the one for the maximum. We also consider aggregations that commute with truncation; see Section 4.1 for the formal definition. 2.2 Complexity Measures Fat-shattering dimension. Let Ωbe a set and F RΩ. For γ > 0, a set S = {x1, . . . , xm} Ωis said to be γ-shattered by F if sup r Rm min y { 1,1}m sup f F min i [m] yi(f(xi) ri) γ. (9) The γ-fat-shattering dimension, denoted by fatγ(F), is the size of the largest γ-shattered set (possibly ). Fat-shattering dimension at zero. As in Gottlieb et al. (2014), we also define the notion of γ-shattering at 0, where the shift r in (9) is constrained to be 0. Formally, the shattering condition is miny { 1,1}m supf F mini [m] yif(xi) γ, and we denote the corresponding dimension by f atγ(F). Attias et al. (2019, Lemma 13) showed that for all F RΩ, fatγ(F) = max r RΩf atγ(F r), γ > 0, (10) where F r = {f r; f F} is the r-shifted class (the maximum is always achieved). Lemma 24 presents another, apparently novel, connection between fat and f at. 1. Ordinarily, for even k, any m [x(k/2), x(k/2+1)] is a median of x. For the proof of Theorem 1, the median must be a value actually occurring in x. Attias and Kontorovich Rademacher complexity. Let F be a real-valued function class on the domain space W. Define the empirical Rademacher complexity of F on a given sequence (w1, . . . , wn) Wn Rn(F|w1, . . . , wn) = Eσ sup f F i=1 σif(wi), where σ = (σ1, . . . σn) are independent random variables uniformly chosen from { 1, 1}. The Rademacher complexity of F with respect to a distribution D is defined as Rn(F) = Ew1,...,wn DRn(F|w1, . . . , wn). It is a classic fact (Mohri et al., 2012, Theorem 3.1) that the Rademacher complexity controls generalization bounds in a wide range of supervised learning settings. Covering numbers. We start with some background on covering numbers. Whenever Ω is endowed with a probability measure µ, this induces, for p [1, ) and f : Ω Rk, the norm L(k) p (µ) = E X µ f(X) p p = Z Ω f(x) p p dµ(x) on L(k) p (µ) := n f (Rk)Ω: f L(k) p (µ) < o . When k = 1, we write Lp(µ) := L(1) p (µ). For p = , f L(k) (µ) is the essential supremum of f with respect to µ. For t > 0 and H F Lp(µ), we say that H is a t-cover of F under Lp(µ) if supf F infh H f h Lp(µ) t. The t-covering number of F, denoted by N(F, Lp(µ), t), is the cardinality of the smallest t-cover of F (possibly, ). We note the obvious relation p > q = N(F, Lp(µ), t) N(F, Lq(µ), t), (11) which holds for all probability measures µ and all t > 0. We sometimes overload the notation about aggregations by defining G on k-tuples of functions (instead of k-tuples of reals), G : (RΩ)k RΩ. We say that G is L-Lipschitz with respect to L(k) p (µ), if G(f1:k) G(f 1:k) Lp(µ) L (f1:k) (f 1:k) L(k) p (µ) , f1:k, f 1:k (Rk)Ω. 2.3 Notation We write N = {0, 1, . . .} to denote the natural numbers. For n N, we write [n] := {1, 2, . . . , n}. All of our logarithms are base e, unless explicitly denoted otherwise. We use max {u, v} and u v interchangeably, and write Log(x) := log(e x). For any function class F over a set Ωand E Ω, F(E) = F|E denotes the projection on (restriction to) E. In line with the common convention in functional analysis, absolute numerical constants will be denoted by letters such as C, c, whose value may change from line to line. Any transformation ϕ : R R may be applied to a function f RΩvia ϕ(f) := ϕ f, as well as to F RΩvia ϕ(F) := {ϕ(f); f F}. The sign function thresholds at 0: sign(t) = 1[t 0]. Fat-Shattering Dimension of k-fold Aggregations 3. Main Results Our main results involve upper-bounding the fat-shattering dimension of aggregation rules in terms of the dimensions of the component classes. We begin with the simplest (to present): Theorem 1 (General F and G that commutes with shifts and truncation) For F1, . . . , Fk RΩ, and an aggregation rule G that commutes with shifts, (see definition (7)) and commutes with truncation (see definition (21)), we have fatγ(G(F1, . . . , Fk)) 35Dγ log2(126Dγ), γ > 0, where Dγ := Pk i=1 fatγ(Fi) > 0. In the degenerate case where Dγ = 0, fatγ(G) = 0. In particular, this result holds for the maximum, minimum, max-min, and median aggregation rules. Remark. We made no attempt to optimize the constants; these are only provided to give a rough order-of-magnitude sense. In the sequel, we forgo numerical estimates and state the results in terms of unspecified universal constants. The next result provides an alternative bound based on an entirely different technique: Theorem 2 (Bounded function classes and Lipschitz aggregations) For 0 < ε < log 2, F1, . . . , Fk [ R, R]Ω, and an aggregation rule G that is L-Lipschitz (L 1) in supremum norm (see definition (8)), we have fatγ(G(F1, . . . , Fk)) CD Log1+ε LRk γ , 0 < γ/L < R, i=1 fatcεγ(Fi) and C, c > 0 are universal constants. In particular, this result holds for natural aggregation rules, such as maximum, minimum, max-min, mean, and median. Remark. The bounds in Theorems 1 and 2 are, in general, incomparable and not just because of the unspecified constants in the latter. One notable difference is that Theorem 1 only depends on the shattering scale γ, while Theorem 2 additionally features a (weak) explicit dependence on the aspect ratio R/γ. In particular, Theorem 1 is applicable to semi-bounded affine classes (see Section A.4), while Theorem 2 is not. Still, for fixed R, γ and large k, the latter presents a significant asymptotic improvement over the former. For the special case of affine functions and hinge-loss affine functions, the technique of Theorem 2 yields a considerably sharper estimate: Theorem 3 (Dimension-free bound for Lipschitz aggregations of affine functions) Let B Rd be the d-dimensional Euclidean unit ball and Fi = n x 7 w x + b; w |b| Ri, w Rd, b R o , Ri R, i [k], (12) Attias and Kontorovich be k collections of Ri-bounded affine functions on Ω= B and G be an aggregation rule that is L-Lipschitz in supremum norm (see definition (8)). Then fatγ(G(F1, . . . , Fk)) CL2 Log(k) i=1 R2 i , 0 < γ/L < min i [k] Ri, (13) where C > 0 is a universal constant. Further, if F Hinge i = {(x, y) 7 max {0, 1 yf(x)} ; f Fi} (14) is a family of Ri-bounded hinge-loss affine functions for i [k] and GHinge G(F Hinge 1 , . . . , F Hinge k ) is an aggregation rule that is L-Lipschitz in supremum norm, then the same bound as in (13) holds for fatγ(GHinge). In particular, this result holds for the maximum, minimum, max-min, mean, and median aggregation rules. Theorem 3 improves by a log factor the estimate of Fefferman et al. (2016), on the fat-shattering dimension of max-min aggregation (defined in Section 2) of linear functions:2 Lemma 4 (Fefferman et al. (2016), Lemma 6) Let B Rd be the d-dimensional Euclidean unit ball and Fij = n x 7 w x; w 1 , w Rdo , i [k], j [ℓ], be kℓ(identical) linear function classes defined on Ω= B. If Gmax-min is the max-min aggregation rule (6), then fatγ(Gmax-min(F11, . . . , Fkℓ)) C kℓ where C > 0 is a universal constant. Our Theorem 3 improves the latter by a log factor: fatγ(Gmax-min(F11, . . . , Fkℓ)) C kℓlog (kℓ) Corollary 5 (Rademacher complexity for k-Fold Maximum of Affine Functions) Let Fi be an Ri-bounded affine function class as in (12) or a hinge loss affine function class as in (14), let Gmax be the maximum aggregation rule, and let R = maxi Ri, then Rn(Gmax(F1, . . . , Fk)) C Log(k) Log3( Rn) R2 Pk i=1 R2 i n . where Rn is the Rademacher complexity and C > 0 is a universal constant. 2. The max-min aggregation is shown to be 1-Lipschitz in supremum norm in Lemma 15 of Section A.1. Fat-Shattering Dimension of k-fold Aggregations Corollary 5 improves upon Raviv et al. (2018, Theorem 7). Their upper bound scales linearly with k, whereas ours scales as k log k. Note, however, that for linear classes a better bound is known: Theorem 6 (Zhivotovskiy (2022)) Let B Rd be the d-dimensional Euclidean unit ball and Fi = n x 7 w x; w 1, w Rdo , i [k] be k (identical) linear function classes defined on Ω= B. If Gmax is the maximum aggregation rule, then Rn(Gmax(F1, . . . , Fk)) C log n where Rn is the Rademacher complexity and C > 0 is a universal constant. The estimate in Theorem 3 is dimension-free in the sense of being independent of d. In applications where a dependence on d is admissible, an optimal bound can be obtained: Theorem 7 (Dimension-dependent bound for k-fold maximum of affine functions) Let Ω= Rd and Fi RΩbe k (identical) function classes consisting of all real-valued affine functions: Fi = n x 7 w x + b; w Rd, b R o , i [k] and let Gmax be the k-fold maximum (see definition (4)). Then fatγ(Gmax(F1, . . . , Fk)) Cdk Log k, γ > 0, where C > 0 is a universal constant. The optimality of the upper bound in Theorem 7 is witnessed by the matching lower bound: Theorem 8 (Dimension-dependent lower bound for k-fold maximum of affine functions) For k 1 and d 4, let F1 = F2 = . . . = Fk be the collection of all affine functions over Ω= Rd and let Gmax be the k-fold maximum (see definition (4)). Then fatγ(Gmax(F1, . . . , Fk)) C log(k) i=1 fatγ(Fi) = Cdk log k, γ > 0, where C > 0 is a universal constant. The scaling argument employed in the proof of Theorem 8 can be invoked to show that the claim continues to hold for Ω= B. Together, Theorems 7 and 8 show that the dependence on k is optimal. Attias and Kontorovich We start with upper-bounding the fat-shattering dimension of aggregation rules that commute with shifts (definition (7)) and commute with truncation (defined below), in terms of the dimensions of the component classes. 4.1 Proof of Theorem 1 Partial concept classes and disambiguation. We say that F {0, 1, }Ωis a partial concept class over Ω; this usage is consistent with Alon et al. (2022), while Attias et al. (2019, 2022) used the descriptor ambiguous. Define the disambiguation operator D : {0, 1, } 2{0,1} as D(0) = {0} ; D(1) = {1} ; D( ) = {0, 1} . (15) For f F , define its disambiguation set D(f ) {0, 1}Ωas D(f ) = n g {0, 1}Ω: x Ω, g(x) D(f (x)) o ; (16) in words, D(f ) consists of the total concepts g : Ω {0, 1} that agree pointwise with f , whenever the latter takes a value in {0, 1}. We say that F {0, 1}Ωdisambiguates F if for all f F , we have F D(f ) = ; in words, every f F must have a disambiguated representative in F.3 As in Alon et al. (2022); Attias et al. (2022), we say4 that S Ωis VC-shattered by F if F (S) {0, 1}S. We write vc(F ) to denote the size of the largest VC-shattered set (possibly, ). The obvious relation vc(F ) vc( F) always holds between a partial concept class and any of its disambiguations. Alon et al. (2022, Theorem 13) proved the following variant of the Sauer-Shelah-Perles Lemma for partial concept classes: Lemma 9 (Alon et al. (2022)) For every F {0, 1, }Ωwith d = vc(F ) < and |Ω| < , there is an F disambiguating F such that | F(Ω)| (|Ω| + 1)(d+1) log2 |Ω|+2. (17) For d > 0 and |Ω| > 1, this implies the somewhat more wieldy estimate5 | F(Ω)| |Ω|7d log2 |Ω|. (18) We will make use of the elementary fact x A log2 x = x 3A log(3A), x, A 1 and its corollary y A(log2 y)2 = y 5A log2(18A), y, A 1. (19) 3. Attias et al. (2022) additionally required that F S f F D(f ), but this is an unnecessary restriction, and does not affect any of the results. 4. Attias et al. (2019) had incorrectly given F (S) = {0, 1}S as the shattering condition. 5. The estimate (18) does not appear in Alon et al. (2022), but is an elementary consequence of (17). Fat-Shattering Dimension of k-fold Aggregations Aggregation rules commuting with truncation. Fix γ > 0 and define the truncation operator [ ] γ : R {0, 1, } as 0, t γ 1, t γ , else. Let xi R, i [k]. Let the γ-truncation [xi] γ {0, 1, }, and xi D([xi] γ) {0, 1} be a disambiguation. We say that an aggregation rule G : Rk R commutes with truncation if for any γ > 0, G( x1, . . . , xk) D([G(x1, . . . , xk)] γ) (21) for all disambiguations xi, i [k] (see definitions in (15) and (16)). In Appendix A.1, we show that median and max-min aggregations commute with truncations (Lemmas 16, 17 respectively). Showing it for the maximum and minimum is a simple exercise. We note that the mean aggregation does not satisfy this property. Proof [of Theorem 1] We follow the basic techniques of discretization and r-shifting, employed in Attias et al. (2019, 2022). Fix γ > 0, recall the truncation operator [ ] γ : R {0, 1, } defined in (20). We also define the truncation operator over functions [ ] γ : RΩ {0, 1, }Ω, as [f] γ = f where f (x) = [f(x)] γ, for x Ω. Observe that for all F RΩand [F] γ := [f] γ; f F , we have f atγ(F) = vc([F] γ). Let G : Rk R be a k-fold aggregation rule and F1, . . . , Fk RΩ be real-valued function classes. Suppose that some S = {x1, . . . , xℓ} Ωis γ-shattered by G G(F1, . . . , Fk). Proving the claim amounts to upper-bounding ℓappropriately. By (10), there is an r RΩsuch that fatγ(G) = f atγ(G r) = vc([G r] γ). Put F i := Fi r and since G commutes with r-shift, as defined in (7), we have G := G(F 1, . . . , F k) = G(F1 r, . . . , Fk r) = G(F1 . . . , Fk) r. (22) Hence, S is VC-shattered by [G ] γ and vi := vc([F i] γ) = f atγ(F i) fatγ(F i) = fatγ(Fi), i [k]. (23) Let us assume for now that each vi > 0; in this case, there is no loss of generality in assuming ℓ> 1. Let Fi be a good disambiguation of [F i] γ on S, as furnished by Lemma 9: | Fi(S)| ℓ7vi log2 ℓ. Observe that G := G( F1, . . . , Fk) is a valid disambiguation of [G ] γ since we assume that G commutes with truncation. It follows that 2ℓ= | G(S)| i=1 | Fi(S)| ℓ7 log2 ℓPk i=1 vi. (24) Thus, (19) implies that ℓ 35(P vi) log2(126 P vi), and the latter is an upper bound on vc( G) and hence, also on vc([G ] γ) = fatγ(G). The claim now follows from (23). Attias and Kontorovich If any one given vi = 0, we claim that (24) is unaffected. This is because any C {0, 1, }Ωwith vc(C ) = 0 has a singleton disambiguation C = {c}. Indeed, any given x Ω can receive at most one of {0, 1} as a label from the members of C (otherwise, it would be shattered, forcing vc(C ) 1). If any c C labels x with 0, then all members of C are disambiguated to label x with 0 (and, mutatis mutandis, 1). Any x labeled with by every c C i can be disambiguated arbitrarily (say, to 0). Disambiguating the degenerate [F i] γ to the singleton Fi(S) has no effect on the product in (24). The foregoing argument continues to hold if more than one vi = 0. In particular, in the degenerate case where fatγ(F1) = fatγ(F2) = . . . = fatγ(Fk) = 0, we have Q | Fi(S)| = 1, which forces ℓ= 0. 4.2 Proof of Theorem 2 First, we upper bound the covering numbers of Lipschitz aggregations as a function of the covering numbers of the component classes. Theorem 10 (Covering number of L-Lipschitz aggregations) Let t > 0, p [1, ], and F1, . . . , Fk Lp(µ). Let G be an aggregation rule that is L-Lipschitz. Then, for all probability measures µ on Ω, N(G(F1, . . . , Fk), Lp(µ), t) (Qk i=1 N(Fi, Lp(µ), t/Lk1/p), p < Qk i=1 N(Fi, Lp(µ), t/L), p = . We proceed to the main proof. Proof [of Theorem 2]. Let G : Rk R be an aggregation rule that is L-Lipschitz (L 1) in supremum norm, as defined in (8), and let F1, . . . , Fk [ R, R]Ωbe real-valued function classes. Suppose that some Ωℓ= {x1, . . . , xℓ} Ω= B is a maximal set that is γ-shattered by G, let Fi(Ωℓ) = Fi|Ωℓ, and µℓbe the uniform distribution on Ωℓ. We upper bound the covering number with the fat-shattering dimension as in Lemma 21 (see Section A.2), with n = ℓand p = , log N(Fi(Ωℓ), L (µℓ), γ) Cvi log(Rℓ/viγ) logε(ℓ/vi), 0 < γ < R, where vi = fatcεγ(Fi). Then Theorem 10 implies that log N(G(Ωℓ), L (µℓ), γ/2) i=1 log N(Fi(Ωℓ), L (µℓ), γ/2L) i=1 vi log(LRℓ/viγ) logε(ℓ/vi) i=1 vi log1+ε(LRℓ/viγ) (b) CD log1+ε LRℓk Fat-Shattering Dimension of k-fold Aggregations where D := Pk i=1 vi, (a) follows since R/γ > 1 and assuming L 1, and (b) follows by the concavity of x log1+ε(u/x) (see Lemma 28 in Section A.5). We can assume ℓ 2 without loss of generality. Combining the monotonicity of the covering number (see (11)), a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 18 in Section A.2), and the fact the Ωℓis a maximal set that is γ-shattered by G yields log N(G(Ωℓ), L (µℓ), γ/2) C fatγ(G) = Cℓ, ℓ CD log1+ε LRℓk Using the elementary fact x A Log1+ε x = x c A Log1+ε A x, A 1 (with x = LRℓk/Dγ and A = c LRk/γ), we get ℓ CD Log1+ε LRk which implies the claim. 4.3 Proof of Theorem 3 We use the notation and results from the Appendix, and in particular, from Section A.3. Proof [of Theorem 3] A bound of this form for the k-fold maximum aggregation was claimed in Kontorovich (2018), however the argument there was flawed, see Section 5. Let G : Rk R be an aggregation rule that is L-Lipschitz in supremum norm, as defined in (8), and let F1, . . . , Fk be bounded affine function classes, as defined in (12). Suppose that some Ωℓ= {x1, . . . , xℓ} Ω= B is a maximal set that is γ-shattered by G, let Fi(Ωℓ) = Fi|Ωℓ, and µℓbe the uniform distribution on Ωℓ. We upper bound the covering number as in Lemma 23 (with m = ℓ), log N(Fi(Ωℓ), L (µℓ), γ) C R2 i γ2 Log ℓγ2 R2 i , 0 < γ < Ri. Denote vi := L2R2 i /γ2, and consider the L covering number of Fi(Ωℓ) at scale γ/L: log N(Fi(Ωℓ), L (µℓ), γ/L) Cvi Log ℓ Then Theorem 10 implies that log N(G(Ωℓ), L (µℓ), γ/2) i=1 log N(Fi(Ωℓ), L (µℓ), γ/2L) i=1 vi Log ℓ (a) CD Log kℓ Attias and Kontorovich where D := Pk i=1 vi and (a) follows by the concavity of x log(u/x) (see Corollary 27 in Section A.5). Combining the monotonicity of the covering number (see (11)), a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 18 in Section A.2), and the fact the Ωℓis a maximal set that is γ-shattered by G yields log N(G(Ωℓ), L (µℓ), γ/2) C fatγ(G) = Cℓ, ℓ CD Log kℓ Using the elementary fact x A Log x = x c A Log A, x, A 1 (with x = kℓ/D and A = ck) we get ℓ c D Log k, which implies the claim. The result can easily be generalized to hinge-loss affine classes. Let Fi be an affine function class as in (12), define F i as the function class on B { 1, 1} given by F i = {(x, y) 7 yf(x); f Fi}, and the hinge-loss affine class F Hinge i as the function class on B { 1, 1} given by F Hinge i = {(x, y) 7 max {0, 1 f(x, y)} ; f F i}. One first observes that the restriction of F i to any {(x1, y1), . . . , (xn, yn)}, as a body in Rn, is identical to the restriction of Fi to {x1, . . . , xn}. Interpreting F Hinge i as a 2-fold maximum over the singleton class H = {h 0} and the bounded affine class F i lets us invoke Theorem 10 to argue that Fi and F Hinge i have the same L covering numbers. Hence, the argument we deployed here to establish (13) for affine classes also applies to k-fold L-Lipschitz aggregations hinge-loss classes. 4.4 Proof of Corollary 5 Proof [of Corollary 5] Raviv et al. (2018, Theorem 7) upper-bounded the Rademacher complexity of the maximum aggregation of k hinge loss affine functions by k/ n. For Ri-bounded affine functions or hinge loss affine functions, the analysis above, com- bined with the calculation in Kontorovich (2018) yields a bound of O q Log(k) Log3(n) Pk i=1 R2 i n For completeness, we provide the full proof. Let Gmax : Rk R be the k-fold maximum aggregation rule, as defined in (4), and let F1, . . . , Fk RΩbe Ri-bounded affine function classes as in (12) or hinge loss affine function classes as in (14). Since this aggregation is 1-Lipschitz in the supremum norm, Theorem 3 implies that fatγ(Gmax) C Log(k) i=1 R2 i , 0 < γ < min i [k] Ri, where C > 0 is a universal constant. Fat-Shattering Dimension of k-fold Aggregations From fat-shattering to Rademacher. The fat-shattering estimate above can be used to upper-bound the Rademacher complexity by converting the former to a covering number bound and plugging it into Dudley s chaining integral (Dudley, 1967): Rn(F) inf α 0 log N(F, 2 , t) where N( ) are the L2 covering numbers. It remains to bound the covering numbers. A simple way of doing so is to invoke Lemmas 2.6, 3.2, and 3.3 in Alon et al. (1997) but this incurs superfluous logarithmic factors in n. Instead, we use the sharper estimate of Mendelson and Vershynin (2003), stated here in Lemma 19. Putting R = maxi Ri, the latter yields Rn(Gmax) inf α 0 4α + 12 Z 1 log N(Gmax, 2 , t) 4α + 12c Z 1 fatct/ R(Gmax) log 2 R Log(k) Pk i=1 R2 i n log(2 R/α)3/2 (log 2 R)3/2 and choosing α = 1/ n yields Rn(Gmax) 4 n + 12c Log(k) Pk i=1 R2 i n 2 R log(2 R n)3/2 (log 2 R)3/2 Log(k) log3( Rn) R2 Pk i=1 R2 i n 4.5 Proof of Theorem 7 Proof [of Theorem 7] Let Gmax : Rk R be the k-fold maximum aggregation rule, as defined in (4), and let F1, . . . , Fk RΩbe identical function classes consisting of all realvalued affine functions. Note that Gmax is an aggregation that commutes with shift, as defined in (7). Attias and Kontorovich By (10), there is an r RΩsuch that fatγ(Gmax) = f atγ(Gmax r). As in (22), put F i := Fi r and G max := Gmax r = Gmax(F 1, . . . , F k). Define Gmax = sign(G max) and Fi = sign(F i). Since sign and max commute, we have Gmax = maxi [k]( Fi). We claim that f atγ(G max) vc( Gmax). (26) Indeed, any S Ωthat is γ-shattered with shift r = 0 by any G RΩis also VC-shattered by sign(G). (See Section 4.1, and notice that the converse implication and the reverse inequality do not hold.) It holds that d + 1 (a) = vc( Fi) (b) = f atγ(F i) (c) = fatγ(F i) (d) = fatγ(Fi), where (a) follows from a standard argument (e.g., Mohri et al. (2012, Example 3.2)), (b) holds because any S Rd that is VC-shattered by sign(F i) is also γ-shattered by F i with shift r = 0, (c) follows from Lemma 24, since the class satisfies the closure property (34), and (d) holds since the shattering remains the same for the shifted class. Now the argument of Blumer et al. (1989, Lemma 3.2.3) applies: vc( Gmax) 2(d + 1)k log(3k) (27) (this holds for any k-fold aggregation function, not just the maximum). Combining (26) with (27) proves the claim. 4.6 Proof of Theorem 8 Proof [of Theorem 8] It follows from Mohri et al. (2012, Example 3.2) that vc(sign(Fi)) = d + 1. Since Fi is closed under scalar multiplication, a scaling argument shows that any S Rd that is VC-shattered by sign(Fi) is also γ-shattered by Fi with shift r = 0, whence f atγ(Fi) = d + 1 for all γ > 0; invoking Lemma 24 extends this to fatγ(Fi) as well. Now Csik os et al. (2019, Theorem 1) shows that the k-fold unions of half-spaces necessarily shatter some set S Rd of size at least cdk log k. Since union is a special case of the max operator, and the latter commutes with sign, the scaling argument shows that this S is γ-shattered by Gmax with shift r = 0. Hence, fatγ(Gmax) f atγ(Gmax) |S|, which proves the claim. 5. Discussion In this paper, we proved upper and lower bounds on the fat-shattering dimension of aggregation rules as a function of the fat-shattering dimension of the component classes. We leave some remaining gaps for future work. First, for aggregation rules that commute with shifts and commute with truncation, assuming fatγ(Fi) d, for 1 i k, we show in Theorem 1 that fatγ(G(F1, . . . , Fk)) Cdk Log2 (dk) , γ > 0, C > 0 is a universal constant. We pose the following Fat-Shattering Dimension of k-fold Aggregations Open problem. Let G be an aggregation rule with the properties as in Theorem 1. Is it the case that for all Fi RΩwith fatγ(Fi) d, i [k], we have fatγ(G(F1, . . . , Fk)) Cdk Log (k) , γ > 0, (28) for some universal C > 0? In light of Theorem 8, this is the best one could hope for in general. We pose also the following conjecture about bounded affine functions. Conjecture 11 Theorem 3 is tight up to constants. For Ri-bounded affine functions and an aggregation rule G that is 1-Lipschitz in supremum norm, fatγ(G(F1, . . . , Fk)) C Log(k) i=1 R2 i , 0 < γ < min i [k] Ri, (29) where C > 0 is a universal constant. Throughout the paper, we mentioned several mistaken claims in the literature. In this section, we briefly discuss the nature of these mistakes which are, in a sense, variations on the same kind of error. We begin with Attias et al. (2019, Lemma 14), which incorrectly claimed that any partial function class F has a disambiguation F such that vc( F) vc(F ) (see Section 4.1 for the definitions). The mistake was pointed out to us by Yann Guermeur, and later, Alon et al. (2022, Theorem 11) showed that there exist partial classes F with vc(F ) = 1 for which every disambiguation F has vc( F) = . Kontorovich (2018) attempted to prove the bound stated in our Theorem 3 (up to constants, and only for linear classes). The argument proceeded via a reduction to the Boolean case, as in our proof of Theorem 7. It was correctly observed that if, say, some finite S Ωis 1-shattered by Fi with shift r = 0, then it is also VC-shattered by sign(Fi). Neglected was the fact that sign(Fi) might shatter additional points in Ω\ S and, in sufficiently high dimension, it necessarily will. The crux of the matter is that (26) holds in the dimension-dependent but not the dimension-free setting; again, this may be seen as a variant of the disambiguation mistake. Finally, the proof of Hanneke and Kontorovich (2019, Lemma 6) claims, in the first display, that the shattered set can be classified with large margin, which is incorrect yet another variant of mistaken disambiguation. Attias and Kontorovich Acknowledgments We thank Steve Hanneke and Ramon van Handel for very helpful discussions; the latter, in particular, patiently explained to us how to prove Lemma 23. Roman Vershynin kindly gave us permission to share his example in Remark 20. We deeply thank the anonymous reviewers for their insightful comments and suggestions. This research was partially supported by the Israel Science Foundation (grant No. 1602/19), an Amazon Research Award, the Ben-Gurion University Data Science Research Center, Cyber Security Research Center, Prime Minister s Office, and the Vatat Scholarship from the Israeli Council for Higher Education. Noga Alon, Shai Ben-David, Nicol o Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615 631, 1997. Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. Closure properties for private classification and online prediction. In Conference on Learning Theory, pages 119 152. PMLR, 2020. Noga Alon, Steve Hanneke, Ron Holzman, and Shay Moran. A theory of PAC learnability of partial concept classes. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 658 671, 2022. Gautier Appert and Olivier Catoni. New bounds for k-means and information k-means. ar Xiv preprint ar Xiv:2101.05728, 2021. S. Artstein, V. Milman, and S. J. Szarek. Duality of metric entropy. Annals of Mathematics, 159(3):1313 1328, 2004. Idan Attias and Steve Hanneke. Adversarially robust PAC learnability of real-valued functions. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 1172 1199, 2023. Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, ALT 2019, volume 98 of Proceedings of Machine Learning Research, pages 162 183. PMLR, 2019. Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for adversarially robust learning. The Journal of Machine Learning Research, 23(1):7897 7927, 2022. Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers, pages 43 54. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3. Peter L. Bartlett and Philip M. Long. Prediction, learning, uniform convergence, and scalesensitive dimensions. J. Comput. Syst. Sci., 56(2):174 190, 1998. Fat-Shattering Dimension of k-fold Aggregations Eric B. Baum and David Haussler. What size net gives valid generalization? Neural Comput., 1(1):151 160, 1989. G erard Biau, Luc Devroye, and G abor Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781 790, 2008. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929 965, 1989. Leo Breiman. Bagging predictors. Machine learning, 24:123 140, 1996. Leo Breiman. Random forests. Machine learning, 45:5 32, 2001. Doron Cohen, Aryeh Kontorovich, Aaron Koolyk, and Geoffrey Wolfer. Dimension-free empirical entropy estimation. IEEE Transactions on Information Theory, 69(5):3190 3202, 2023. doi: 10.1109/TIT.2022.3232739. M onika Csik os, Nabil H. Mustafa, and Andrey Kupavskii. Tight lower bounds on the vc-dimension of geometric set systems. J. Mach. Learn. Res., 20:81:1 81:8, 2019. Hubert Haoyang Duan. Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective. Ph D thesis, University of Waterloo, 2012. Richard M Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290 330, 1967. David Eisenstat. k-fold unions of low-dimensional concept classes. Inf. Process. Lett., 109 (23-24):1232 1234, 2009. David Eisenstat and Dana Angluin. The VC dimension of k-fold union. Inf. Process. Lett., 101(5):181 184, 2007. Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016. Dylan J. Foster and Alexander Rakhlin. ℓ vector contraction for rademacher complexity. Co RR, abs/1911.06468, 2019. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. Badih Ghazi, Noah Golowich, Ravi Kumar, and Pasin Manurangsi. Near-tight closure bounds for the littlestone and threshold dimensions. In Algorithmic Learning Theory, pages 686 696. PMLR, 2021. Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient classification for metric data (extended abstract: COLT 2010). IEEE Transactions on Information Theory, 60(9):5750 5759, 2014. Attias and Kontorovich Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, and Gabriel Nivasch. Learning convex polytopes with margin. In Neural Information Processing Systems (NIPS), 2018. Steve Hanneke and Aryeh Kontorovich. Optimality of SVM: novel proofs and tighter bounds. Theor. Comput. Sci., 796:99 113, 2019. Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci., 48(3):464 497, 1994. Bal azs K egl. Robust regression by boosting the median. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 258 272. Springer, 2003. Yegor Klochkov, Alexey Kroshnin, and Nikita Zhivotovskiy. Robust k-means clustering for distributions with two moments. The Annals of Statistics, 49(4):2206 2230, 2021. Aryeh Kontorovich. Rademacher complexity of k-fold maxima of hyperplanes. 2018. S. Mendelson and R. Vershynin. Entropy and the combinatorial dimension. Invent. Math., 152(1):37 55, 2003. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations Of Machine Learning. The MIT Press, 2012. Dolev Raviv, Tamir Hazan, and Margarita Osadchy. Hinge-minimax learner for the ensemble of hyperplanes. J. Mach. Learn. Res., 19:62:1 62:30, 2018. M. Rudelson and R. Vershynin. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, 164(2):603 648, 2006. Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. Advances in neural information processing systems, 23, 2010. Michel Talagrand. Vapnik chervonenkis type conditions and uniform donsker classes of functions. The Annals of Probability, 31(3):1565 1582, 2003. Roman Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. Roman Vershynin, 2021. Private communication. Tong Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527 550, 2002. Nikita Zhivotovskiy. A bound for k-fold maximum. 2022. Fat-Shattering Dimension of k-fold Aggregations Appendix A. Auxiliary results A.1 Properties of Aggregation Rules Lemma 12 If G : Rk R is L-Lipschitz under p, then G : (RΩ)k RΩis L-Lipschitz in L(k) p (µ). G(f1, . . . , fk) G(f 1, . . . , f k) p Lp(µ) = Z Ω |G(f1, . . . , fk)(x) G(f 1, . . . , f k)(x)|pdµ(x) Ω |G(f1(x), . . . , fk(x)) G(f 1(x), . . . , f k(x))|pdµ(x) Ω Lp (f1(x), . . . , fk(x)) (f 1(x), . . . , f k(x)) p p dµ(x), where the inequality follows from the assumption that G : Rk R is L-Lipschitz in p. This proves G(f1, . . . , fk) G(f 1, . . . , f k) Lp(µ) L (f1, . . . , fk) (f 1, . . . , f k) L(k) p (µ) , and hence the claim. Proof [of Theorem 10] Suppose p < , and let g = G(f1, . . . , fk) G(F1, . . . , Fk). For each i [k], let ˆFi Fi be a t/Lk1/p-cover of Fi. Let each fi be t/Lk1/p-covered by some ˆfi ˆFi, in the sense that fi ˆfi Lp(µ) t/Lk1/p. Assuming that G : Rk R is L-Lipschitz in p, Lemma 12 implies that G : (RΩ)k RΩis L-Lipschitz in L(k) p (µ). Then it follows that g is t-covered by G( ˆf1, . . . , ˆfk), since G(f1, . . . , fk) G( ˆf1, . . . , ˆfk) p Lp(µ) Lp (f1, . . . , fk) ( ˆf1, . . . , ˆfk) p (f1(x), . . . , fk(x)) ( ˆf1(x), . . . , ˆfk(x)) p fi(x) ˆfi(x) p dµ(x) fi(x) ˆfi(x) p dµ(x) Lpk t Lk1/p Attias and Kontorovich and so G(f1, . . . , fk) G( ˆf1, . . . , ˆfk) Lp(µ) t. We conclude that G(F1, . . . , Fk) has a t-cover of size | ˆF1 ˆF2 . . . ˆFk|, which proves the claim. The case p = is proved analogously (or, alternatively, as a limiting case of p < ). We show that natural aggregations are Lipschitz in p norms, p [1, ), and in supremum norm. The following facts are elementary: |a b c d| |a c| |b d|, a, b, c, d R; (30) |a b c d| |a c| |b d|, a, b, c, d R, (31) where s t := max {s, t} and s t := min {s, t}. Lemma 13 (Maximum aggregation is 1-Lipschitz) Let Gmax : Rk R be the maximum aggregation, then for any x, x Rk and p [1, ], Gmax(x) Gmax(x ) x x p . Proof For k = 2 and p = , the claim follows from the stronger, pointwise inequality (30). The proof follows by simple induction on k. Since p, we conclude the proof for p [1, ]. Lemma 14 (Median aggregation is 1-Lipschitz) Let Gmed : Rk R be the median aggregation, then for any x, x Rk and p [1, ], Gmed(x) Gmed(x ) x x p . Proof Denote by x(1), . . . , x(k) the ascending order of a sequence x1, . . . , xk, that is, x(1) x(k). For all x, x Rk we have Gmed(x1, . . . , xk) Gmed(x 1, . . . , x k) = x( k/2 ) x ( k/2 ) x(i) x (i) max i [k] where the last inequality follows from Cohen et al. (2023, Eq. (16))6 Since p, we conclude the proof for p [1, ]. Lemma 15 (Max-Min aggregation is 1-Lipschitz) Let Gmax-min : Rk ℓ R be the max-min aggregation, then for any x, x Rk ℓand p [1, ], Gmax-min(x) Gmax-min(x ) x x p . 6. stated there for distributions but true for all vectors, by the same argument Fat-Shattering Dimension of k-fold Aggregations Proof The inequalities (30), (31) imply that the k-fold max and min aggregations are both 1-Lipschitz with respect to . Hence, for all x, y Rk ℓ, we have min i [k] xij min i [k] yij max i [k] |xij yij| , j [ℓ] and further, max j [ℓ] min i [k] xij max j [ℓ] min i [k] yij max j [ℓ] max i [k] |xij yij| . This proves that |Gmax-min(x) Gmax-min(x )| x x . Since p, the claim holds for all p [1, ]. Lemma 16 (Median aggregation commutes with truncation) Let Gmed : Rk R be the median aggregation, then Gmed commutes with truncation. That is, for any γ > 0 and x Rd, Gmed( x1, . . . , xk) D([Gmed(x1, . . . , xk)] γ) for all disambiguations xi D([xi] γ), i [k]. Proof Fix γ > 0 and denote by x(1), . . . , x(k) the ascending order of a sequence x1, . . . , xk. Now for xi D([xi] γ) {0, 1}, our definition of the median (5) implies that Gmed( x1, . . . , xk) {0, 1}. It remains to perform an exhaustive verification of the possible cases. If [Gmed(x1, . . . , xk)] γ = then any value in {0, 1} is valid. If [Gmed(x1, . . . , xk)] γ = 0 it means that Gmed(x1, . . . , xk) outputs a value smaller than γ, which means that at least half of inputs x1, . . . , xk have a value smaller than γ. Let these values be x(1), . . . , x(m) where m k/2. We have [x(j)] γ = 0 for j [m] and [x(ℓ)] γ { , 1} for ℓ {m + 1, . . . , k}. For any disambiguation x(ℓ), Gmed( x1, . . . , xk) would still output 0 and is a valid disambiguation of [Gmed(x1, . . . , xk)] γ. The case [Gmed(x1, . . . , xk)] γ = 1 follows from the same argument. Lemma 17 (Max-Min aggregation commutes with truncation) Let Gmax-min : Rk ℓ R be the max-min aggregation, then Gmax-min commutes with truncation. That is, for any γ > 0 and x Rk ℓ, Gmax-min( x11, . . . , xkℓ) D([Gmax-min(x11, . . . , xkℓ)] γ), for all disambiguations xij D([xij] γ), i [k], j [ℓ]. Proof Fix γ > 0. For any i [k] denote by xi(1), . . . , xi(ℓ) the ascending order of a sequence xi1, . . . , xiℓ. We assume xij D([xij] γ) {0, 1} and Gmax-min( x11, . . . , xkℓ) outputs a value in {0, 1} by our definition of the max-min. We check all possible outputs of [Gmax-min(x11, . . . , xkℓ)] γ and verify that Gmax-min( x11, . . . , xkℓ) is a valid disambiguation. If [Gmax-min(x11, . . . , xkℓ)] γ = then any value in {0, 1} is valid. If [Gmax-min(x11, . . . , xkℓ)] γ = 0 it means that Gmax-min(x11, . . . , xkℓ) outputs a value smaller than γ. This means that Attias and Kontorovich all values that minimize each row x1(1), . . . , xk(1) are smaller than γ since the maximum of them is smaller than γ. We have [xi(1)] γ = 0 for i [k]. For any disambiguation xij Gmax-min( x11, . . . , xkℓ) would still output 0 and is a valid disambiguation of [Gmax-min(x11, . . . , xkℓ)] γ. The case[Gmax-min(x11, . . . , xkℓ)] γ = 1 follows from the same argument. A.2 Covering numbers and the fat-shattering dimension In this section, we summarize some known results connecting the covering numbers of a bounded function class to its fat-shattering dimension. Lemma 18 (Talagrand (2003), Proposition 1.4) For any F [ R, R]Ω, there exists a probability measure µ on Ωsuch that N(F, L2(µ), t) 2C fat2t(F), 0 < t < R, (32) where C > 0 is a universal constant. Moreover, µ may be taken to be the uniform distribution on any 2t-shattered subset of Ω. Remark. The tightness of (32) is trivially demonstrated by the example F = { γ, γ}n. Lemma 19 (Mendelson and Vershynin (2003), Theorem 1) For all F [ 1, 1]Ω and all probability measures µ, N(F, L2(µ), t) 2 C fatct(F) , 0 < t < 1, (33) where C, c > 0 are universal constants. Remark 20 The following example due to Vershynin (2021) shows that (33) is tight. Take Ω= [n] and F = [ 1, 1]Ω. Then, for all sufficiently small t > 0, we have fatt(F) = n. However, a simple volumetric calculation shows that N(F, L2(µ), t) behaves as (C/t)n for small t, where C > 0 is a constant. Lemma 21 (Rudelson and Vershynin (2006)) Suppose that p [2, ), µ is a probability measure on Ω, and R > 0. If F Lp(Ω, µ) satisfies supf F f L2p(µ) R, then log N(F, Lp(µ), t) Cp2 fatct(F) log R ct, 0 < t < R; furthermore, for all ε > 0, if supf F f L (µ) R, then log N(F, L (µ), t) Cv log(Rn/vt) logε(n/v), 0 < t < R, where n = |Ω|, v = fatcεt(F), and C, c > 0 are universal constants. Fat-Shattering Dimension of k-fold Aggregations A.3 Covering numbers of linear and affine classes Let B Rd be the d-dimensional Euclidean unit ball and F = {x 7 w x + b; w |b| R} be the collection of R-bounded affine functions on Ω= B. Remark 22 There is a trivial reduction from an R-bounded affine class in d dimensions to a 2R-bounded linear class in d + 1 dimensions, via the standard trick of adding an extra dummy dimension. This only affects the covering number bounds up to constants. For Ωn B, |Ωn| = n, define F(Ωn) = F|Ωn, and endow Ωn with the uniform measure µn. Zhang (2002, Theorem 4) implies the covering number estimate log N(F(Ωn), L (µn), t) C R2 where C > 0 is a universal constant (Zhang s result is more general and allows to compute explicit constants). We will use the following sharper bound: log N(F(Ωn), L (µn), t) C R2 R2 , 0 < t < R, where m = min {n, d} and C > 0 is a universal constant. Proof The result is folklore knowledge, but we provide a proof for completeness. Let B = Bd 2 be the Euclidean unit ball and X = {x1, . . . , xn} B. This induces the set F = {(w x1, w x2, . . . , w xn); w B} Rn. We argue that there is no loss of generality in assuming d n. Indeed, if n > d, then X is spanned by some X = {x 1, . . . , x d} B and F span(X ) is also a d-dimensional set. Thus, we assume d n henceforth. Via a standard infinitesimal perturbation, we can assume that X is a linearly independent set (i.e., spans Rn). If we treat X as an n d matrix, then F = XB, which means that F is an ellipsoid. We are interested in estimating the ℓ covering numbers of F. Let K Rd be such that XK = L, where L = Bn is the unit cube. (The existence of a K such that XK L is obvious, but because we assumed that X spans Rn, every point in [ 1, 1]n has a pre-image under X.) Let us compute the polar body K , defined as K = u Rd : sup v K v u 1 . We claim that K = absconv(X) =: i=1 αixi; X |αi| 1 Attias and Kontorovich Indeed, consider a z = Pn i=1 αixi absconv(X). Then, for any v K, we have i=1 αi(v xi) i=1 |αi| 1 = z K , where we have used |v xi| 1, since XK = L = Bn = [ 1, 1]n. This shows that absconv(X) K . On the other hand, consider any u K . There is no loss of generality in assuming that u is in the span of X, that is, u = Pm i=1 αixi, for αi R. By definition of u K , we have sup v K v u = sup v K v i=1 αixi = sup v K i=1 αi(v xi) 1. Now because XK = [ 1, 1]n, for each choice of α Rn, there is a v K such that |v xi| = sign(αi) for all i [n]. This shows that we must have Pn i=1 |αi| 1, and proves K absconv(X). It is well-known (and easy to verify) that covering numbers enjoy an affine invariance: N(F, L) := N(XB, XK) = N(B, K), where N(A, B), for two sets A, B, is the smallest number of copies of B necessary to cover A. Now the seminal result of Artstein et al. (2004) applies: for all t > 0, log N(B, t K) a log N(K , bt B), where a, b > 0 are universal constants. This reduces the problem to estimating the ℓ2-covering numbers of absconv(X). The latter may be achieved via Maurey s method (Vershynin, 2018, Corollary 0.0.4 and Exercise 0.0.6): the t-covering number of absconv(r X) under ℓ2 is at most (c + cmt2/r2) r2/t2 , where c > 0 is a universal constant. A.4 Fat-shattering dimension of linear and affine classes In this section, Ω= Rd and B Rd denotes the Euclidean unit ball. A function f : Ω R is said to be affine if it is of the form f(x) = w x + b, for some w Rd and b R, where denotes the Euclidean inner product. Fat-Shattering Dimension of k-fold Aggregations Throughout the paper, we have have referred to R-bounded affine function classes as those for which w |b| R. In this section, we define the larger class of R-semi-bounded affine functions, as those for which w R, but b may be unbounded. In particular, the covering-number results (and the reduction to linear classes spelled out in Remark 22) do not apply to semi-bounded affine classes. The following simple result may be of independent interest. Lemma 24 Let F RΩbe some collection of functions with the closure property f, g F = (f g)/2 F. (34) Then, for all γ > 0, we have fatγ(F) = f atγ(F) . Proof Suppose that some set {x1, . . . , xk} is γ-shattered by F. That means that there is an r Rk such that for all y { 1, 1}k, there is an f = fy F for which γ yi(f(xi) ri), i [k]. (35) Now for any y { 1, 1}k, let ˆf = fy and ˇf = f y. Then, for each i [k], we have γ yi( ˆf(xi) ri), γ yi( ˇf(xi) ri). It follows that f = ( ˆf ˇf)/2 achieves (35), for the given y, with r 0. Now (34) implies that the function defined by f belongs to F, which completes the proof. Now it is well-known (Bartlett and Shawe-Taylor, 1999, Theorem 4.6) that bounded linear functions i.e., function classes on B of the form F = {x 7 w x; w R}, also known as homogeneous hyperplanes satisfy fatγ(F) (R/γ)2. The discussion in Hanneke and Kontorovich (2019, p. 102) shows that the common approach of reducing of the general (affine) case to the linear (homogeneous, b = 0) case, via the addition of a dummy coordinate, incurs a large suboptimal factor in the bound. Hanneke and Kontorovich (2019, Lemma 6) is essentially an analysis of the fat-shattering dimension of bounded affine functions. Although this result contains a mistake (see Section 5), much of the proof technique can be salvaged: Lemma 25 The semi-bounded affine function class on B defined by F = {x 7 w x + b; w R} in d dimensions satisfies fatγ(F) min Proof Since F satisfies (34), it suffices to consider f atγ(F), and so the shattering condition simplifies to γ yi(w xi + b), i [k]. (36) Attias and Kontorovich Now f atγ(F) is always upper-bounded by the VC-dimension of the corresponding class thresholded at zero, i.e., sign(F). For d-dimensional inhomogeneous hyperplanes, the latter is exactly d + 1 (Mohri et al., 2012, Example 3.2). Having dispensed with the dimensiondependent part in the bound, we how focus on the R-dependent one. Let us observe, as in Hanneke and Kontorovich (2019, Lemma 6), that for xi 1 and w , γ R, one can always realize (36) with |b| 2R; which is what we shall assume, without loss of generality, henceforth. Summing up the k inequalities in (36) yields i=1 yixi + b Letting y be drawn uniformly from { 1, 1}k and taking expectations, we have i=1 xi 2 + 2R i=1 Ey2 i 3R Isolating k on the left-hand side of the inequality proves the claim k 3R γ 2 . Following a referee s suggestion, we improve the constant as follows. Note that |k 2i| = k 2k 1 where the inequality follows from a binomial coefficient estimate via Stirling s approximation. Thus, which proves that k A.5 Concavity miscellanea The results below are routine exercises in differentiation and Jensen s inequality. Lemma 26 For u > 0, the function x 7 x log(u/x) is concave on (0, ). Corollary 27 For all u > 0 and vi > 0, i [k], i=1 vi log(u/vi) X vi log uk P vi . Fat-Shattering Dimension of k-fold Aggregations Lemma 28 For 0 ε log 2 and u 2, the function x 7 x log1+ε(u/x) is concave on [1, ). It follows that for ε, u as above and vi 1, i [k], i=1 vi log1+ε(u/vi) X vi log1+ε uk P vi .