# agnostic_sample_compression_schemes_for_regression__d7516e4a.pdf Agnostic Sample Compression Schemes for Regression Idan Attias 1 Steve Hanneke 2 Aryeh Kontorovich 1 Menachem Sadigurschi 1 We obtain the first positive results for bounded sample compression in the agnostic regression setting with the ℓp loss, where p [1, ]. We construct a generic approximate sample compression scheme for real-valued function classes exhibiting exponential size in the fat-shattering dimension but independent of the sample size. Notably, for linear regression, an approximate compression of size linear in the dimension is constructed. Moreover, for ℓ1 and ℓ losses, we can even exhibit an efficient exact sample compression scheme of size linear in the dimension. We further show that for every other ℓp loss, p (1, ), there does not exist an exact agnostic compression scheme of bounded size. This refines and generalizes a negative result of David, Moran, and Yehudayoff (2016) for the ℓ2 loss. We close by posing general open questions: for agnostic regression with ℓ1 loss, does every function class admit an exact compression scheme of polynomial size in the pseudodimension? For the ℓ2 loss, does every function class admit an approximate compression scheme of polynomial size in the fat-shattering dimension? These questions generalize Warmuth s classic sample compression conjecture for realizablecase classification (Warmuth, 2003). 1. Introduction Sample compression is a central problem in learning theory, whereby one seeks to retain a small subset of the labeled sample that uniquely defines a good hypothesis. Quantifying small and good specifies the different variants of the problem. For instance, in the classification setting, taking small to mean constant size (i.e., depending only on the VC-dimension d of the concept class but not on the sample 1Department of Computer Science, Ben-Gurion University, Israel 2Department of Computer Science, Purdue University, USA. Correspondence to: Idan Attias . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). size m) and good to mean consistent with the sample specifies the classic realizable sample compression problem for VC classes. The feasibility of the latter was an open problem between its being posed by Littlestone and Warmuth (1986b) and its positive resolution by Moran & Yehudayoff (2016), with various intermediate steps in between (Floyd, 1989; Helmbold, Sloan, and Warmuth, 1992; Floyd and Warmuth, 1995b; Ben-David and Litman, 1998; Kuzmin and Warmuth, 2007; Rubinstein, Bartlett, and Rubinstein, 2009; Rubinstein and Rubinstein, 2012; Chernikov and Simon, 2013; Livni and Simon, 2013; Moran, Shpilka, Wigderson, and Yehudayoff, 2017). A stronger form of this problem, where small means O(d) (or even exactly d), remains open (Warmuth, 2003). David, Moran, and Yehudayoff (2016) recently generalized the definition of compression scheme to the agnostic case, where it is required that the function reconstructed from the compression set obtains an average loss on the full data set nearly as small as the function in the class that minimizes this quantity. In Remark 2.2, we give a strong motivation for this criterion by arguing an equivalence to the generalization ability of the compression-based learning algorithm. Under this definition, David et al. (2016) extended the realizablecase result for VC classes to cover the agnostic case as well: a bounded-size compression scheme for the former implies such a scheme (in fact, of the same size) for the latter. They also generalized from binary to multiclass concept families, with the graph dimension in place of VC-dimension. Proceeding to real-valued function classes, David et al. (2016) came to a starkly negative conclusion: they established that there is no constant-size exact agnostic sample compression scheme for linear functions under the ℓ2 loss. (Realizable linear regression in Rd trivially admits sample compression of size d + 1, under any loss, by selecting a minimal subset that spans the data.) Main results. We are the first to construct bounded sample compression schemes for agnostic regression with ℓp loss, p [1, ]. Table 1 summarizes our contributions in the context of previous results. We refer to an α-approximate compression as one where the function reconstructed from the compression set achieves an average error at most α compared to the optimal function in the class. We consider the sample compression to be exact when we precisely recover this error. See Equations (3) and (4) for formal definitions. Agnostic Sample Compression Schemes for Regression Our approach begins with proposing a boosting method (Algorithm 1) to construct an α-approximate sample compression scheme for agnostic ℓp regression, within function classes characterized by a finite fat-shattering dimension. The scheme has a size of O(fat(F, cα/p) fat (F, cα/p))1, for some numerical constant c > 0, as established by Theorem 3.1. Here, fat(F, cα/p) represents the fat-shattering dimension of function class F at scale cα/p, and fat is the dimension of the dual-class, which is finite as long as the dimension of the primal class is finite and can be at most exponentially larger, see Equation (5). Notably, our compression size is independent of the sample size. A major open question is how to improve the exponential dependence in the dimension, even in the realizable binary classification setting (Warmuth, 2003). While such an approximate compression has been previously acknowledged in realizable regression (Hanneke et al., 2019), and exact compression in agnostic binary classification (David et al., 2016), in Section 3 we delve into the details of our techniques and elucidate why methods previously suggested fall short in addressing agnostic regression. We proceed with exploring linear regression. The negative result of David et al. (2016) regarding the impossibility of achieving an exact compression for linear regression with the ℓ2 (squared) loss raises a general doubt over whether exact sample compression is ever a viable approach to agnostic learning of real-valued functions. We address this concern by proving that, if we replace the ℓ2 loss with the ℓ1 or ℓ loss, then there is a simple exact agnostic compression scheme of size d + 1 for ℓ1 linear regression and d + 2 for ℓ in Rd, see Theorems 4.3 and 4.4. This is somewhat surprising, given the above negative result for the ℓ2 loss. Computationally, our compression schemes for ℓ1 and ℓ involve solving a linear program of linear size. We then propose Algorithm 2 for an α-approximate sample compression for ℓp linear regression of size O(d log(p/α)), where p (1, ), see Theorem 4.2. Roughly speaking, we reduce the problem to realizable binary classification with linear functions. Our approach involves introducing a discretized dataset on which the optimal solution of Support Vector Machine (SVM) pointwise approximates an optimal regressor on the original dataset. We complement this result by showing that p {1, } are the only two ℓp losses for which a constant-size exact compression scheme exists (Theorem 4.6), generalizing the argument of David et al. (2016). These appear to be the first positive results for bounded agnostic sample compression for real-valued function classes. We close by posing intriguing open questions generalizing our result to arbitrary function classes: under the ℓ1 loss, does every function class admit an exact agnostic compres- 1 O hides polylogarithmic factors in the specified expression. sion scheme of size equal to its pseudo-dimension? under the ℓ2 loss, does every function class admit an approximate agnostic compression of size equal to its fat-shattering dimension? We argue that this represents a generalization of Warmuth s classic sample compression problem, which asks whether every space of classifiers admits a compression scheme of size VC-dimension in the realizable case. Related work. Sample compression scheme is a classic technique for proving generalization bounds, introduced by Littlestone & Warmuth (1986a); Floyd & Warmuth (1995a). These bounds proved to be useful in numerous learning settings, particularly when the uniform convergence property does not hold or provides suboptimal rates, such as binary classification (Graepel et al., 2005b; Moran & Yehudayoff, 2016; Bousquet et al., 2020), multiclass classification (Daniely et al., 2015; Daniely & Shalev-Shwartz, 2014; David et al., 2016; Brukhim et al., 2022), regression (Hanneke et al., 2019; Attias et al., 2023), active learning (Wiener et al., 2015), density estimation (Ashtiani et al., 2020), adversarially robust learning (Montasser et al., 2019; 2020; 2021; 2022; Attias et al., 2022; Attias & Hanneke, 2023), learning with partial concepts (Alon et al., 2022), and showing Bayes-consistency for nearest-neighbor methods (Gottlieb et al., 2014; Kontorovich et al., 2017). As a matter of fact, compressibility and learnability are known to be equivalent for general learning problems (David et al., 2016). A remarkable result by Moran & Yehudayoff (2016) showed that VC classes enjoy a sample compression that is independent of the sample size. David et al. (2016) introduced sample compression in the context of regression. They showed that an exact compression scheme for ℓ2 agnostic linear regression requires a linear growth relative to the sample size. Additionally, they showed that it is feasible to have an α-approximate compression for zero-dimensional linear regression with a size of log(1/α)/α. In a broader sense, they established the equivalence between learnability and the presence of an approximate compression in regression. Hanneke et al. (2019) showed how to convert consistent real-valued learners into constant-size (i.e., independent of sample size) efficiently computable approximate compression schemes for the realizable (or nearly realizable) regression with the ℓ loss. This result was obtained via a weak-to-strong boosting procedure, coupled with a generic construction of weak learners out of abstract regressors. The agnostic variant of this problem remains open in its full generality. Ashtiani et al. (2020) adapted the notion of a compression scheme to the distribution learning problem. They showed that if a class of distributions admits robust compressibility then it is agnostically learnable. Agnostic Sample Compression Schemes for Regression Problem Setup Compression Type Compression Size Reference Realizable/Agnostic Binary Classification Exact O VC VC (Moran & Yehudayoff, 2016; David et al., 2016) Realizable/Agnostic Multiclass Classification Exact (David et al., 2016) O DS1.5 polylog(m) (Brukhim et al., 2022) Ω log(m)1 o(1) (Pabbaraju, 2023) Realizable ℓ Regression α-Approximate O fatcα fat cα polylog fatcα, fat cα, 1 (Hanneke et al., 2019) Agnostic ℓp Regression: p (1, ) α-Approximate O fatcα fat cα polylog fatcα, fat cα, p, 1 This work Agnostic ℓp Regression: p {1, } O fatcα fat cα polylog fatcα, fat cα, 1 Agnostic ℓp Linear Regression: p {1, } Exact O(d) This work Agnostic ℓp Linear Regression: p (1, ) α-Approximate O d log p Agnostic ℓ2 Linear Regression Exact Ω(m) (David et al., 2016) Agnostic ℓp Linear Regression: p [1, ] Exact Ω(log(m)) This work Table 1. Sample compression schemes for classification and regression. We denote the sample size by m, c > 0 is a numerical constant. The o(1) term vanishes as m . (i) Binary Classification: VC is the Vapnik-Chervonenkis dimension that characterizes realizable and agnostic learnability. Any dimension with ( ) denotes the dimension of the dual-class. (ii) Multiclass Classification: d G is the Graph-dimension and DS is the Daniely-Shwartz dimension. For a finite set of labels, both dimensions characterize realizable and agnostic learnability. For an infinite set, only the finiteness of the DS dimension is equivalent to learnability. There exist learnable function classes with infinite graph dimension and finite DS dimension. (iii) Regression: fatcα is the fat-shattering dimension at scale cα. A function class is agnostically learnable in this setting if and only if the fat-shattering dimension is finite for any scale. However, in the realizable case, there are learnable classes with infinite fat-shattering dimension. We comment that the results in (Hanneke et al., 2019) are stated for ℓ , but still hold for any ℓp (with extra polylog factors in p) due to Lipschitzness of this loss. (iv) Linear Regression: d is the vector space dimension. We refer to Section 5 for open problems. 2. Preliminaries We denote [m] := {1, . . . , m}. Let F YX be a hypothesis class. The ℓp loss incurred by a hypothesis f F on (x, y) is given by (x, y) 7 |f(x) y|p, where p [1, ]. For p [1, ), the loss incurred by a hypothesis f F on a labeled sample S = {(xi, yi) : i [m]} is given by Lp(f, S) := 1 i=1 |f(xi) yi|p, (1) while for p = , L (f, S) := max 1 i m |f(xi) yi|. (2) Remark 2.1. The ℓp regression objective is typically written without taking the pth root so as to facilitate optimization algorithms. As we avoid taking the p-th root, the resulting p-norm formulation does not directly converge to ℓ as p approaches infinity. Consequently, our ℓp results explicitly depend on p, similar to results in the literature. Now let us introduce a formal definition of sample compression, and a criterion we require of any valid agnostic compression scheme. Following the definition, we provide a strong motivation for this criterion in terms of an equivalence to the generalization ability of the learning algorithm under general conditions. Approximate and exact sample compression schemes. Following David et al. (2016), we define a selection scheme (κ, ρ) for a hypothesis class F YX is defined as follows. A k-selection function κ maps sequences {(x1, y1), . . . , (xm, ym)} S ℓ 1{X Y}ℓto elements ℓ k {X Y}ℓ S ℓ k {0, 1}ℓ, where k +k k. A reconstruction is a function ρ : K YX . We say that (κ, ρ) is a k-size agnostic exact sample compression scheme for F if κ is a k-selection and for all S = {(xi, yi) : i [m]}, f S := ρ(κ(S)) achieves Fcompetitive empirical loss: Lp(f S, S) inf f F Lp(f, S). (3) We also define a relaxed notion of agnostic α-approximate sample compression in which f S should satisfy Lp(f S, S) inf f F Lp(f, S) + α. (4) In principle, the size k of an agnostic compression scheme may depend on the data set size m, in which case we may denote this dependence by k(m). However, in this work we are primarily interested in the case when k(m) is bounded: that is, k(m) k for some m-independent value k. Note that the above definition is fully general, in that it defines a notion of agnostic compression scheme for any function Agnostic Sample Compression Schemes for Regression class F and loss function L, though in the present work we focus on Lp loss for 1 p . Remark 2.2. At first, it might seem unclear why this is an appropriate generalization of sample compression to the agnostic setting. To see that it is so, we note that one of the main interests in sample compression schemes is their ability to generalize: that is, to achieve low excess risk under a distribution P on X Y when the data S are sampled iid according to P (Littlestone and Warmuth, 1986b; Floyd and Warmuth, 1995b; Graepel, Herbrich, and Shawe-Taylor, 2005a). Also, as mentioned, in this work we are primarily interested in sample compression schemes that have bounded size: k(m) k for an m-independent value k. Furthermore, we are also focusing on the most-general case, where this size bound should be independent of everything else in the scenario, such as the data S or the underlying distribution P. Given these interests, we claim that the above definition is essentially the only reasonable choice. More specifically, for Lp loss with 1 p < , any compression scheme with k(m) bounded such that its expected excess risk under any P converges to 0 as m necessarily satisfies the above condition (or is easily converted into one that does). To see this, note that for any data set S for which such a compression scheme fails to satisfy the above F-competitive empirical loss criterion, we can define a distribution P that is simply uniform on S, and then the compression scheme s selection function would be choosing a bounded number of points from S and a bounded number of bits, while guaranteeing that excess risk under P approaches 0, or equivalently, excess empirical loss approaches 0. To make this argument fully formal, only a slight modification is needed, to handle having multiple copies of points from S in the compression set; given that the size is bounded, these repetitions can be encoded in a bounded number of extra bits, so that we can stick to strictly distinct points in the compression set. In the converse direction, we also note that any boundedsize agnostic compression scheme (in the sense of the above definition) will be guaranteed to have excess risk under P converging to 0 as m , in the case that S is sampled iid according to P, for losses Lp with 1 p < , as long as P guarantees that (X, Y ) P has Y bounded (almost surely). This follows from classic arguments about the generalization ability of compression schemes, which includes results for the agnostic case (Graepel, Herbrich, and Shawe-Taylor, 2005a). For unbounded Y one cannot, in general, obtain distribution-free generalization bounds. However, one can still obtain generalization under certain broader restrictions (see, e.g., Mendelson, 2015 and references therein). The generalization problem becomes more subtle for the L loss: this cannot be expressed as a sum of pointwise losses and there are no standard techniques for bounding the deviation of the sample risk from the true risk. One recently-studied guarantee achieved by minimizing em- pirical L loss is a kind of hybrid error generalization, developed in Hanneke et al. (2019, Theorem 9). We refer the interested reader to that work for the details of those results, which can easily be extended to apply to our notion of an agnostic compression scheme. Complexity measures. Let F [0, 1]X and γ > 0. We say that S = {x1, . . . , xm} X is γ-shattered by F if there exists a witness r = (r1, . . . , rm) [0, 1]m such that for each σ = (σ1, . . . , σm) { 1, 1}m there is a function fσ F such that ( fσ(xi) ri + γ, if σi = 1 fσ(xi) ri γ, if σi = 1. The fat-shattering dimension of F at scale γ, denoted by fat(F, γ), is the cardinality of the largest set of points in X that can be γ-shattered by F. This parametrized variant of the Pseudo-dimension (Alon et al., 1997) was first proposed by Kearns & Schapire (1994). Its key role in learning theory lies in characterizing the PAC learnability of real-valued function classes (Alon et al., 1997; Bartlett & Long, 1998). We also define the dual dimension. Define the dual class F [0, 1]F of F as the set of all functions gw : F [0, 1] defined by gw(f) = f(w). If we think of a function class as a matrix whose rows and columns are indexed by functions and points, respectively, then the dual class is given by the transpose of the matrix. The dual fat-shattering dimension at scale γ, is defined as the fat-shattering at scale γ of the dual-class and denoted by fat (F, γ). We have the following bound due to Kleer & Simon (2021, Corollary 3.8 and inequality 3.1), fat (F, γ) 1 γ 2fat(F,γ/2)+1. (5) 3. Approximate Agnostic Compression for Real-Valued Function Classes In this section, we construct an approximate compression scheme for all real-valued function classes that are agnostically PAC learnable, that is, classes with finite fat-shattering dimension at any scale (Alon et al., 1997; Bartlett & Long, 1998). We prove the following main result. Theorem 3.1 (Approximate compression for agnostic regression). Let F [0, 1]X , S = {(xi, yi) : i [m]} X [0, 1], an approximation parameter α [0, 1], a weak learner parameter β (0, 1/2], and ℓp loss where p [1, ]. By setting Algorithm 1 with T O 1 β2 log(m) d O(fat(F, cα/p)) , n O fat (F, cα/p) d O(fat(F, cα)) , n O fat (F, cα) Agnostic Sample Compression Schemes for Regression we get an α-approximate sample compression scheme of size β2 fat(F, cα/p) fat (F, cα/p) , p [1, ) β2 fat(F, cα) fat (F, cα) , p = , for some universal constant c > 0. Recall that the dual fatshattering is at most exponential in the primal dimension, see Equation (5). O( ) hides polylogarithmic factors of (fat, fat , p, 1/α, 1/β). Remark 3.2. Note that having an α-approximate compression of size k implies the following bound on the generaliza- tion error: α + q m (David et al., 2016, Theorem 4.2). Our algorithm incorporates a boosting approach for realvalued functions. Therefore, we need a definition of weak learners in this context. Definition 3.3 (Approximate weak real-valued learners). Let β (0, 1 2], α (0, 1). We say that g : X [0, 1] is an approximate (α, β)-weak learner, with respect to P and a target function f F if P(x,y) P {(x, y) : |g(x) y| > |f (x) y| + α} 1 This notion of a weak learner must be formulated carefully. For example, taking a learner guaranteeing absolute loss at most 1 2 β is known to not be strong enough for boosting to work, see the discussion in Hanneke et al. (2019, Section 4). On the other hand, by making the requirement too strong (for example, Ada Boost.R in Freund & Schapire (1997)), then the sample complexity of weak learning will be high that weak learners cannot be expected to exist for certain function classes. We can now present the main algorithm. The challenges beyond realizable regression and agnostic classification. There is a crucial difference from previous boosting algorithms for real-valued used by K egl (2003); Hanneke et al. (2019) in the realizable case. In our approach, the cut-offs ψ(x, y) are allowed to vary across different points, in contrast to a fixed cut-off applied uniformly across all points. This flexibility enables us to address the agnostic setting, wherein the loss of an optimal minimizer may differ across various points in the sample. To prove the existence of weak learners we are required to have a generalization theorem that is compatible with changing cutoffs, see Theorem A.1. A similar generalization result was used in the context of adversarially robust learning (Attias & Hanneke, 2023). The compression approach for agnostic binary classification, as discussed in (David et al., 2016), encounters a similar Algorithm 1 Approximate Agnostic Sample Compression for ℓp Regression, p [1, ] Input: F [0, 1]X , S = {(xi, yi) : i [m]} X [0, 1]. Parameters: Approximation parameter α (0, 1), weak learner parameter β (0, 1/2], weak learner sample size d 1, sparsification parameter n 1, number of boosting rounds T 1, loss parameter p [1, ]. Initialize: P1 Uniform(S). Find an optimal function in f F. Our goal is to construct a function that pointwise approximates f on S 1. Compute: (a) f argminf F Lp(f, S) (defined in Equations (1) and (2)). (b) ψ(x, y) |f (x) y|, (x, y) S. Median boosting for real-valued functions 2. For t = 1, . . . , T: (a) Get an (2α, β)-approximate weak learner ˆft with respect to distribution Pt: Find a multiset St S of d points such that for any f F with |f(x) y| ψ(x, y) + α (x, y) St, it holds that P(x,y) Pt{(x, y) : |f(x) y| > ψ(x, y) + 2α} 1/2 β. (St exists from Theorem A.1). (b) For i = 1, . . . , m: Set w(t) i 1 2I ˆft(xi) yi > ψ(xi, yi)+2α . (c) Set αt 1 2 log (1 β) Pm i=1 Pt(xi,yi)I h w(t) i =1 i (1+β) Pm i=1 Pt(xi,yi)I h w(t) i = 1 i . (d) If αt = : return T copies of ˆft, (α1 = 1, . . . , αT = 1), St. Else: Pt+1(xi, yi) Pt(xi, yi) exp( αtwt i) Pm j=1 Pt(xj,yj) exp( αtwt j). Sparsifying the weighted ensemble n ˆfi o T i=1 returned from boosting via sampling 3. Repeat: (a) Sampling: (J1, ... , Jn) Categorial α1 PT s=1 αs , ... , αT PT s=1 αs (b) Let F = {f J1, . . . , f Jn}. (c) Until (x, y) S : n f F : |f(x) y| > ψ(x, y) + 3α o < n/2. Compression: Multisets SJ1, . . . , SJn and cut-offs ψ|SJ1, . . . , ψ|SJn corresponding to the weak learners in F. Reconstruction: Reconstruct weak learners f Ji from SJi and ψ|SJi, i [n], and output their median Median(f J1, . . . , f Jn). Agnostic Sample Compression Schemes for Regression challenge. In this method, our initial emphasis is on identifying the points correctly classified by an optimal function in the class. Subsequently, we apply compression techniques for realizable classification. However, in regression, discarding points where the optimal function makes mistakes is not feasible, given that the loss is not strictly zero-one. Instead, we utilize the entire sample, targeting the error for each point and constructing a function with a similar approximated error on each point. Proof overview. First, we show that the returned output of Algorithm 1 is a valid compression. Then we bound the size of this compression. Approximate compression correctness. In step 1, we compute some f F the minimizes the empirical ℓp error on the sample S, f argmin f F Lp(f, S), as defined in Equations (1) and (2). Let ψ : X Y [0, 1] be the ℓ1 loss of f on each point in S, ψ(x, y) |f (x) y|, (x, y) S. In step 2, we implement a boosting algorithm, following Definition 3.3 of weak learners. By using Theorem A.1 with δ = 1/3 and ε = 1/2 β, for any distribution Pt on S, upon receiving an i.i.d. sample St S from Pt of size d = O fat(F, α/8) log2 fat(F, α/8) with probability 2/3 over sampling St from Pt, for any f F satisfying (x, y) St : |f(x) y| ψ(x, y)+α, it holds that P(x,y) Pt{(x, y) : |f(x) y| > ψ(x, y) + 2α} 1 That is, such a function is an approximate (2α, β)-weak learner for Pt and f . Since this holds with probability 2/3, there must be such St S. In order to construct an approximate (2α, β)-weak learner ˆft, we need to find f F such that (x, y) St : |f(x) y| ψ(x, y) + α, and so the weak learner can be encoded by St of size d and the set of cut-offs ψ(x, y) [0, 1] for all (x, y) St. We encode only approximations of the cut-offs to keep the compression size bounded (see the paragraph about the compression size below). For T = O 1 β2 log(m) rounds of boosting, Lemma A.3 guarantees that for all (x, y) S the output of the boosting algorithm satisfies Median ˆf1, . . . , ˆf T ; α1, . . . , αT (x) y ψ(x, y) + 2α. Finally, we use sampling to reduce the number of hypotheses in the ensemble from O 1 β2 log(m) to size that is independent of m. Lemma A.4 implies that the sparsification method in Step 3 ensures that we can sample n = O fat (F, cα) log2(fat (F, cα) /α) such that for all (x, y) S |Median(f J1(x), . . . , f Jn(x)) y| ψ(x, y) + 3α, where c > 0 is an absolute constant. By rescaling 3α to α, this proves the ℓ1 and ℓ losses. For p (1, ), we use the Lipschitzness of the ℓp loss and rescale the approximate parameter accordingly. We constructed a function h with |h(x) y| ψ(x, y)+α for any (x, y) S, which implies |h(x) y|p (i) ((ψ(x, y)) + α)p (ii) ψ(x, y)p + pα, and that will finish the proof. (i) Follows by just raising both sides to the power of p. (ii) Follows since the function x 7 |x y|p is p-Lipschitz for (x y) [0, 1], and so |(ψ(x, y) + α)p ψ(x, y)p| p|ψ(x, y) + α ψ(x, y)| By rescaling pα to α, we get |Median(f J1(x), . . . , f Jn(x)) y|p ψ(x, y)p + α, β2 fat (F, cα/p) log2 p fat (F, cα/p) d = O fat(F, cα/p) log2 p fat(F, cα/p) We proved the correctness of an α-approximate compression Lp(Median(f J1, . . . , f Jn) , S) inf f F Lp(f, S) + α. Approximate compression size. Each weak learner is encoded by a multiset S S of size d and is constructed by computing some f F that solves the constrained optimization |f (x) y| ψ(x, y) + α, (x, y) S . We encode each ψ(x, y) by some approximation ψ(x, y), such that ψ(x, y) ψ(x, y) α, by discretizing [0, 1] to 1/α buckets of size α, and each ψ(x, y) is rounded down to the closest value ψ(x, y). Each approximation requires to encode log(1/α) bits, and so each learner encodes Agnostic Sample Compression Schemes for Regression d log(1/α) bits and d samples. We have n weak learners, and the compression size is n(d + d log(1/α)) 2nd log(1/α) . By plugging in n and d, and rescaling α, we conclude β2 fat(F, cα/p) fat (F, cα/p) , p [1, ) β2 fat(F, cα) fat (F, cα) , p = . 4. Agnostic Compression for Linear Regression In this section, our focus is on ℓp linear regression in Rd. We begin by improving upon the construction of an approximate sample compression scheme for general classes, incorporating the structure of linear functions. Next, we demonstrate the feasibility of constructing an exact compression for p {1, } with a size linear in d. In sharp contrast, we exhibit that this holds only for p {1, }. We prove an impossibility result of achieving a bounded-size exact compression scheme for p (1, ). We use the following notation. Vectors v Rd are denoted by boldface, and their jth coordinate is indicated by v(j). (Thus, vi(j) indicates the jth coordinate of the ith vector in a sequence.) 4.1. Approximate Compression for p [1, ] In this subsection, our instance space is X = [0, 1]d, label space is Y = [0, 1], and hypothesis class is bounded homogeneous linear functions F YX , consisting of all fw : X Y given by fw(x) = w, x , indexed by w Rd, where w 2 1. In Section 3 we proved an approximate compression for general function classes with ℓp losses of size O fatcα/p fat cα/p polylog fatcα/p, fat cα/p, p, 1/α . We have an immediate corollary for linear functions. Let Pdim(F) be the pseudo-dimension of a function class F (Pollard, 1990b; Haussler, 1992), that can be defined as Pdim(F) = lim γ 0fatγ(F). The fat-shattering dimension (at any scale) is upper bounded by the pseudo-dimension. Moreover, the vector space dimension is of the same order as the pseudo-dimension (Anthony et al., 1999), and the dimension of the dual vector space is equal to the one of the primal space. This implies the following. Corollary 4.1. Algorithm 1 is a sample compression scheme of size O d2 polylog d, p, 1 α for bounded linear regression in dimension d with the ℓp loss, for p [1, ]. Another baseline solution involves encoding the coefficients of the linear regressor up to a certain approximation parameter. To achieve an α-approximate sample compression, each coefficient should be accurate up to an additive error of α/dp for p [1, ), and α/d for p = . Thus, in this solution, we will encode d log (dp/α) bits without retaining any samples for p [1, ), and d log (d/α) for p = . In this section, Theorems 4.2 to 4.4 improve upon these bounds by using a dedicated algorithm for linear functions. We start with the following result: Theorem 4.2 (Approximate compression for agnostic linear regression). Let F = x 7 w, x : w Rd, w 2 1 , S = {(xi, yi) : xi 2 1, i [m]} X [0, 1], and an approximation parameter α (0, 1). Algorithm 2 is an α-approximate sample compression scheme for the ℓp loss of size Algorithm 2 Approximate Agnostic Compression for ℓp Linear Regression, p [1, ] Input: F = x 7 w, x : w Rd, w 2 1 , S = {(xi, yi) : xi 2 1, i [m]} X [0, 1]. Parameters: Approximation parameter α [0, 1]. Find an optimal regressor for S 1. f argminf F Lp(f, S) Define a discretized dataset where the new labels are discretized to a resolution of α 2. Define Sα = A B, where A = (xi, jα) : i [m], j 1 α, . . . , 1, 0, 1, . . . , 1 B = (xi, j(1 + α)) : i [m], j { 1, +1} Label by 1 the discretized dataset with f Sα(f ) = {((xi, y) , z) : for any (xi, y) Sα : z = +1 if f (xi) y 0, otherwise z = 1} Compression: Run SVM for realizable binary classification on Sα(f ) and return a set of support vectors. Reconstruction: Run SVM on the compression set. Proof. Let F be the set of homogeneous linear predictors bounded by 1, F = x 7 w, x : w Rd, w 2 1 , and data a set S = {(xi, yi) : xi 2 1, i [m]} X [0, 1]. Agnostic Sample Compression Schemes for Regression Approximate compression correctness. The algorithmic idea is as follows. We first compute in Step 1 an optimal linear regressor f F for the ℓp loss. In step 2, we create a discretized dataset Sα of size m(2/α + 3), where for each example xi we create (2/α + 3) real-valued labels { 1 α, 1, . . . , 2α, α, 0, α, 2α, . . . , 1, 1 + α}. Then in step 3, we use the regressor f for classifying the dataset Sα. That is, for any (xi, y) Sα, we have ((xi, y) , +1) whenever f (xi) y 0, and ((xi, y) , 1) otherwise. We denote this dataset by Sα(f ). Note that for each xi we created a grid of binary labels of resolution α in the range [ 1 α, 1 + α], and since |f (xi)| 1, for each vector xi there exists y1, y2 such that (xi, y1), (xi, y2) Sα(f ) have different labels. To obtain compression, we execute Support Vector Machine(SVM) for realizable classification on Sα(f ). Note that the classification problem is in Rd+1 and the original regression problem is in Rd. Applying Caratheodory s theorem allows us to express its output as a linear combination of d + 2 support vectors (along with their labels). The set of returned support vectors constitutes the compression set. For reconstruction, we utilize SVM on these support vectors. The hyperplane returned by SVM can be re-interpreted as a function from Rd to R that pointwise approximates f on all xi in S. We proceed to prove the correctness. Denote the output of the compression scheme by fsvm = ρ(κ(S)) = (wsvm, bsvm), which a affine linear function in Rd+1. This function can be re-interpreted as an affine linear function ˆf : Rd R, for any x Rd we compute y R by solving wsvm, (x, y) + bsvm = 0, ˆf(x) = y = wd svm, x + bsvm wsvm(d + 1) , where wd svm = (wsvm(1), . . . , wsvm(d)). It holds that wsvm(d + 1) = 0, since for any xi there exists y1, y2 such that (xi, y1), (xi, y2) Sα(f ) have different labels. If wsvm(d + 1) = 0 it means that the SVM hyperplane cannot distinguish between these two points, and thus, it makes a mistake on a realizable dataset, which is a contradiction. Since the output of SVM is a valid compression scheme for realizable binary classification, ˆf should classify correctly all points in Sα(f ). It follows that for any xi in S, f (xi) ˆf(xi) α, due to the two adjacent grid points with resolution α lying above and below both the hyperplane of f and the ˆf hyperplane. Therefore, for any (xi, yi) S |f (xi) yi| | ˆf(xi) yi| (i) f (xi) yi ˆf(xi) + yi = f (xi) ˆf(xi) where (i) follows from the triangle inequality, and so ˆf is an α-approximate sample compression scheme for the ℓ1 and ℓ losses. For p (1, ), using Lipschitzness of the ℓp loss, we have |f (xi) yi|p | ˆf(xi) yi|p p |f (xi) yi| | ˆf(xi) yi| = p |f (xi) yi| | ˆf(xi) yi| By rescaling pα to α, we have an α-approximate compression scheme for the ℓp loss. Approximate compression size. The SVM running on Sα(f ) returns a set of support vectors of size at most d + 2, since the input is in dimension d + 1. The x vectors are part of the original sample S. We need to keep the grid point labels of the support vectors as well, each one of them requires log(1/α) bits, and each classification 1 costs an extra bit. We get a compression of size d + 2 + (d + 2) log(1/α) + d + 2 = O(d log(1/α)) . 4.2. Exact Compression for p {1, } In this section, we show that agnostic linear regression in Rd admits an exact compression scheme of size d + 1 under ℓ1 and d+2 under ℓ . Our instance space is X = Rd, label space is Y = R, and hypothesis class is F YX , consisting of all fw,b : X Y given by fw,b(x) = w, x + b, indexed by w Rd, b R. Note that we allow unbounded norms for the linear functions and the data can be unbounded as well, as opposed the the results in Section 4.1. Theorem 4.3. There exists an efficiently computable (see the linear program in Equation (8)) exact compression scheme for agnostic ℓ1 linear regression of size d + 1. The optimization technique based on minimizing the sum of absolute deviations is known as Least Absolute Deviations (LAD) and was introduced by Boscovich in 1757 (see, for example, Dodge (2008)). We derive a compression scheme from this method. Similarly, we can obtain a compression scheme for ℓ loss via linear programming. Theorem 4.4. There exists an efficiently computable (see the linear program in Equation (9)) exact compression scheme for agnostic ℓ linear regression of size d + 2. 4.3. Exact Constant Size Compression Is Impossible for p (1, ) We proceed to show that it is impossible to have an exact compression scheme of constant size (independent of the sample size) for p (1, ), generalizing the result for the ℓ2 loss by David et al. (2016, Theorem 4.1). Agnostic Sample Compression Schemes for Regression Theorem 4.5 (David et al. (2016)). There is no exact agnostic sample compression scheme for zero-dimensional linear regression with size k(m) m/2. Theorem 4.6. There is no exact agnostic sample compression scheme for zero-dimensional linear regression under ℓp loss, 1 < p < , with size k(m) < log(m). 5. Open Problems The positive result for ℓ1 loss may also lead us to wonder how general of a result might be possible. In particular, noting that the pseudo-dimension (Pollard, 1984; 1990a; Anthony et al., 1999) of linear functions in Rd is precisely d+1 (Anthony et al., 1999), there is an intriguing possibility for the following generalization. For any class F of real-valued functions, denote by Pdim(F) the pseudo-dimension of F. Open Problem: Compressing to pseudo-dimension Number of Points. Under the ℓ1 loss, does every class F of real-valued functions admit an exact agnostic compression scheme of size Pdim(F)? It is also interesting, and perhaps more approachable as an initial aim, to ask whether there is an agnostic compression scheme of size at most proportional to Pdim(F). Even falling short of this, one can ask the more-basic question of whether classes with Pdim(F) < always have bounded agnostic compression schemes (i.e., independent of sample size m), and more specifically whether the bound is expressible purely as a function of Pdim(F) (Moran & Yehudayoff (2016) have shown this is always possible in the realizable classification setting). These questions are directly related to (and inspired by) the well-known long-standing conjecture of Floyd & Warmuth (1995b); Warmuth (2003), which asks whether, for realizable-case binary classification, there is always a compression scheme of size at most linear in the VC dimension of the concept class. Indeed, it is clear that a positive solution of our open problem above would imply a positive solution to the original sample compression conjecture, since in the realizable case with a function class F of {0, 1}-valued functions, the minimal empirical ℓ1 loss on the data is zero, and any function obtaining zero empirical ℓ1 loss on a data set labeled with {0, 1} values must be {0, 1}-valued on that data set, and thus can be thought of as a sample-consistent classifier.2 Noting that, for F containing {0, 1}-valued functions, Pdim(F) is equal the VC dimension, the implication is clear. The converse of this direct relation is not necessarily true. Specifically, for a set F of real-valued functions, consider the set H of subgraph sets: hf(x, y) = I[y f(x)], f F. 2To make such a function actually binary-valued everywhere, it suffices to threshold at 1/2. In particular, note that the VC dimension of H is precisely Pdim(F). It is not true that any realizable classification compression scheme for H is also an agnostic compression scheme for F under ℓ1 loss. Nevertheless, this reduction-toclassification approach seems intuitively appealing, and it might possibly be the case that there is some way to modify certain types of compression schemes for H to convert them into agnostic compression schemes for F. Following up on this line of investigation seems the natural next step toward resolving the above general open question. Similarly, we ask the analogous question for the ℓ2 loss and approximate sample compression schemes. Open Problem: Compressing to fat-shattering Number of Points. Let c > 0 be an absolute constant. Under the ℓ2 loss, does every class F of real-valued functions admit an α-approximate agnostic compression scheme of size fat(F, cα)? Impact Statement This work builds upon the community s understanding of machine learning methods. This has a positive impact on the scientific advancement of the field, and may lead to further improvements in our understanding, methodologies and applications of machine learning and AI. While there are not obvious direct societal implications of the present work, the indirect and longer term impact on society may be positive, negative or both depending on how, where and for what machine learning method that will have benefited from our research are used in the future. Acknowledgements IA is supported by the Vatat Scholarship from the Israeli Council for Higher Education. AK was partially supported by the Israel Science Foundation (grant No. 1602/19), an Amazon Research Award, and the Ben-Gurion University Data Science Research Center. We thank Mark Kozlenko for helpful comments on an earlier manuscript. Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM (JACM), 44(4):615 631, 1997. Alon, N., Hanneke, S., Holzman, R., and Moran, S. A theory of pac learnability of partial concept classes. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 658 671. IEEE, 2022. Anthony, M. and Bartlett, P. L. Function learning from in- Agnostic Sample Compression Schemes for Regression terpolation. Combinatorics, Probability and Computing, 9(3):213 225, 2000. Anthony, M., Bartlett, P. L., Bartlett, P. L., et al. Neural network learning: Theoretical foundations, volume 9. cambridge university press Cambridge, 1999. Ashtiani, H., Ben-David, S., Harvey, N. J., Liaw, C., Mehrabian, A., and Plan, Y. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM (JACM), 67(6): 1 42, 2020. Attias, I. and Hanneke, S. Adversarially robust pac learnability of real-valued functions. In International Conference on Machine Learning, pp. 1172 1199. PMLR, 2023. Attias, I., Hanneke, S., and Mansour, Y. A characterization of semi-supervised adversarially robust pac learnability. Advances in Neural Information Processing Systems, 35: 23646 23659, 2022. Attias, I., Hanneke, S., Kalavasis, A., Karbasi, A., and Velegkas, G. Optimal learners for realizable regression: Pac learning and online learning. Advances in Neural Information Processing Systems, 2023. Bartlett, P. L. and Long, P. M. Prediction, learning, uniform convergence, and scale-sensitive dimensions. Journal of Computer and System Sciences, 56(2):174 190, 1998. Ben-David, S. and Litman, A. Combinatorial variability of vapnik-chervonenkis classes with applications to sample compression schemes. Discrete Applied Mathematics, 86 (1):3 25, 1998. Bousquet, O., Hanneke, S., Moran, S., and Zhivotovskiy, N. Proper learning, helly number, and an optimal svm bound. In Conference on Learning Theory, pp. 582 609. PMLR, 2020. Brukhim, N., Carmon, D., Dinur, I., Moran, S., and Yehudayoff, A. A characterization of multiclass learnability. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 943 955. IEEE, 2022. Chernikov, A. and Simon, P. Externally definable sets and dependent pairs. Israel J. Math., 194(1):409 425, 2013. Daniely, A. and Shalev-Shwartz, S. Optimal learners for multiclass problems. In Conference on Learning Theory, pp. 287 316. PMLR, 2014. Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz, S. Multiclass learnability and the erm principle. J. Mach. Learn. Res., 16(1):2377 2404, 2015. David, O., Moran, S., and Yehudayoff, A. Supervised learning through the lens of compression. In Advances in Neural Information Processing Systems, pp. 2784 2792, 2016. Dodge, Y. Least absolute deviation regression. The Concise Encyclopedia of Statistics, pp. 299 302, 2008. Floyd, S. Space-bounded learning and the vapnikchervonenkis dimension. In Proceedings of the second annual workshop on Computational learning theory, pp. 349 364. Morgan Kaufmann Publishers Inc., 1989. Floyd, S. and Warmuth, M. Sample compression, learnability, and the vapnik-chervonenkis dimension. Machine learning, 21(3):269 304, 1995a. Floyd, S. and Warmuth, M. K. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning, 21(3):269 304, 1995b. Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. Gottlieb, L.-A., Kontorovich, A., and Nisnevitch, P. Nearoptimal sample compression for nearest neighbors. Advances in Neural Information Processing Systems, 27, 2014. Graepel, T., Herbrich, R., and Shawe-Taylor, J. PACbayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59(1-2):55 76, 2005a. Graepel, T., Herbrich, R., and Shawe-Taylor, J. Pacbayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59:55 76, 2005b. Hanneke, S., Kontorovich, A., and Sadigurschi, M. Sample compression for real-valued learners. In Algorithmic Learning Theory, pp. 466 488. PMLR, 2019. Haussler, D. Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and computation, 100(1):78 150, 1992. Helmbold, D., Sloan, R., and Warmuth, M. K. Learning integer lattices. SIAM Journal on Computing, 21(2):240 266, 1992. Kearns, M. J. and Schapire, R. E. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464 497, 1994. Agnostic Sample Compression Schemes for Regression K egl, B. Robust regression by boosting the median. In Learning Theory and Kernel Machines, pp. 258 272. Springer, 2003. Kleer, P. and Simon, H. Primal and dual combinatorial dimensions. ar Xiv preprint ar Xiv:2108.10037, 2021. Kontorovich, A. and Attias, I. Fat-shattering dimension of k-fold maxima. ar Xiv preprint ar Xiv:2110.04763, 2021. Kontorovich, A., Sabato, S., and Weiss, R. Nearest-neighbor sample compression: Efficiency, consistency, infinite dimensions. Advances in Neural Information Processing Systems, 30, 2017. Kuzmin, D. and Warmuth, M. K. Unlabeled compression schemes for maximum classes. Journal of Machine Learning Research, 8:2047 2081, 2007. Littlestone, N. and Warmuth, M. Relating data compression and learnability. 1986a. Littlestone, N. and Warmuth, M. K. Relating data compression and learnability. Technical report, Department of Computer and Information Sciences, Santa Cruz, CA, Ju, 1986b. Livni, R. and Simon, P. Honest compressions and their application to compression schemes. In Conference on Learning Theory, pp. 77 92, 2013. Mendelson, S. Learning without concentration. J. ACM, 62 (3):21:1 21:25, 2015. Montasser, O., Hanneke, S., and Srebro, N. Vc classes are adversarially robustly learnable, but only improperly. In Conference on Learning Theory, pp. 2512 2530. PMLR, 2019. Montasser, O., Hanneke, S., and Srebro, N. Reducing adversarially robust learning to non-robust pac learning. Advances in Neural Information Processing Systems, 33: 14626 14637, 2020. Montasser, O., Hanneke, S., and Srebro, N. Adversarially robust learning with unknown perturbation sets. In Conference on Learning Theory, pp. 3452 3482. PMLR, 2021. Montasser, O., Hanneke, S., and Srebro, N. Adversarially robust learning: A generic minimax optimal learner and characterization. Advances in Neural Information Processing Systems, 35:37458 37470, 2022. Moran, S. and Yehudayoff, A. Sample compression schemes for vc classes. Journal of the ACM (JACM), 63(3):1 10, 2016. Moran, S., Shpilka, A., Wigderson, A., and Yehudayoff, A. Teaching and compressing for low vc-dimension. In A Journey Through Discrete Mathematics, pp. 633 656. Springer, 2017. Pabbaraju, C. Multiclass learnability does not imply sample compression. ar Xiv preprint ar Xiv:2308.06424, 2023. Pollard, D. Convergence of Stochastic Processes. Springer Verlag, 1984. Pollard, D. Empirical processes: theory and applications. NSF-CBMS Regional Conference Series in Probability and Statistics, 2. Institute of Mathematical Statistics, 1990a. Pollard, D. Empirical processes: theory and applications. In NSF-CBMS regional conference series in probability and statistics, pp. i 86. JSTOR, 1990b. Rubinstein, B. I. and Rubinstein, J. H. A geometric approach to sample compression. Journal of Machine Learning Research, 13(4), 2012. Rubinstein, B. I., Bartlett, P. L., and Rubinstein, J. H. Shifting: One-inclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 75(1): 37 59, 2009. Warmuth, M. K. Compressing to VC dimension many points. In Proceedings of the 16th Conference on Learning Theory, 2003. Wiener, Y., Hanneke, S., and El-Yaniv, R. A compression technique for analyzing disagreement-based active learning. J. Mach. Learn. Res., 16:713 745, 2015. Agnostic Sample Compression Schemes for Regression A. Auxiliary Proofs for Section 3 Our proof relies on several auxiliary results. Existence of approximate weak learners. We start with a result about generalization from interpolation. Anthony & Bartlett (2000) established such a result for interpolation models (Anthony et al. (1999, Section 21.4)), where the cut-off parameter η > 0 is fixed. The following results extend to cut-offs that may differ for different points. A similar result appeared in Attias & Hanneke (2023) in the context of adversarially robust learning. Theorem A.1 (Generalization from approximate interpolation with changing cutoffs). Let F [0, 1]X be a function class with a finite fat-shattering dimension (at any scale). For any α, ϵ, δ (0, 1), any function ψ : X Y [0, 1], any distribution P over X Y, for a random sample S P m, if fat(F, α/8) log2 fat(F, α/8) then with probability at least 1 δ over S, for any f F satisfying |f(x) y| ψ(x, y) + α, (x, y) S, it holds that P(x,y) P {(x, y) : |f(x) y| ψ(x, y) + 2α} 1 ϵ. Theorem A.2 (Generalization from approximate interpolation). (Anthony et al., 1999, Theorems 21.13 and 21.14) Let F [0, 1]X be a function class with a finite fat-shattering dimension (at any scale). For any η, α, ϵ, δ (0, 1), any distribution D over X, any function t : X [0, 1], for a random sample S Dm, if m(η, α, ϵ, δ) = O 1 fat(F, α/8) log2 fat(F, α/8) then with probability at least 1 δ over S, for any f F satisfying |f(x) t(x)| η (x, y) S, it holds that Px D{x : |f(x) t(x)| η + α} 1 ϵ. Proof of Theorem A.1. Let F [0, 1]X and let H = {(x, y) 7 |f(x) y| : f F} . Define the function classes F1 = {(x, y) 7 |f(x) y| ψ(x, y) : f F} , and F2 = {(x, y) 7 max{f(x, y), 0} : f F1} . We claim that fat(H, γ) = fat(F1, γ). Take a set S = {(x1, y1), . . . , (xm, ym)} that is γ-shattered by H. There exists a witness r = (r1, . . . , rm) [0, 1]m such that for each σ = (σ1, . . . , σm) { 1, 1}m there is a function hσ H such that ( hσ((xi, yi)) ri + γ, if σi = 1 hσ((xi, yi)) ri γ, if σi = 1. The set S is shattered by F1 by taking r = (r1 + η(x1, y1), . . . , rm + η(xm, ym)). Similarly, any set that is shattered by F1 is also shattered by H. The class F2 consists of choosing a function from F1 and computing its pointwise maximum with the constant function 0. In general, for two function classes G1, G2, we can define the maximum aggregation class max(G1, G2) = {x 7 max{g1(x), g2(x)} : gi Gi}, Kontorovich & Attias (2021) showed that for any G1, G2 fat(max(G1, G2), γ) (fat(G1, γ) + fat(G2, γ)) log2(fat(G1, γ) + fat(G2, γ)) . Taking G1 = F1 and G2 0, we get fat(F2, γ) fat(F1, γ) log2(fat(F1, γ)) . Agnostic Sample Compression Schemes for Regression For the particular case G2 0, we can show a better bound of fat(F2, γ) fat(F1, γ) . In words, it means that truncation cannot increase the shattering dimension. Indeed, take a set S = {(x1, y1), . . . , (xk, yk)} that is γ-shattered by F2 = max(F1, 0), we show that this set is γ-shattered by F1. There exists a witness r = (r1, . . . , rk) [0, 1]k such that for each σ = (σ1, . . . , σk) { 1, 1}k there is a function fσ F1 such that ( max{fσ((xi, yi)), 0} ri + γ, if σi = 1 max{fσ((xi, yi)), 0} ri γ, if σi = 1. For max{fσ((xi, yi)), 0} ri γ, we simply have that fσ((xi, yi)) ri γ. Moreover, this implies that ri γ. As a result, max{fσ((xi, yi)), 0} ri + γ which means that fσ((xi, yi)) ri + γ. This shows that F1 γ-shatters S as well. We can conclude the proof by applying Theorem A.2 to the class F2 with t(x) = 0 and η = α. The following boosting and sparsification claims were proven for the case of a fixed cut-off parameter. The proofs extend similarly to the case of a changing cut-off parameter ψ : X Y [0, 1]. Boosting. Following (Hanneke et al., 2019), we define the weighted median as Median(y1, . . . , y T ; α1, . . . , αT ) = min yj : PT t=1 αt I[yj < yt] PT t=1 αt < 1 and the weighted quantiles, for β [0, 1/2], as Q+ β (y1, . . . , y T ; α1, . . . , αT ) = min yj : PT t=1 αt I[yj < yt] PT t=1 αt < 1 Q β (y1, . . . , y T ; α1, . . . , αT ) = max yj : PT t=1 αt I[yj > yt] PT t=1 αt < 1 We define Q+ β (f1, . . . , f T ; α1, . . . , αT )(x) = Q+ β (f1(x), . . . , f T (x); α1, . . . , αT ), and Q β (f1, . . . , f T ; α1, . . . , αT )(x) = Q β (f1(x), . . . , f T (x); α1, . . . , αT ), and Median(f1, . . . , f T ; α1, . . . , αT )(x) = Median(f1(x), . . . , f T (x); α1, . . . , αT ). We omit the weights αi when they are equal to each other. The following guarantee holds for the boosting procedure. Lemma A.3. Let S = {(xi, yi)}m i=1, T = O 1 β2 log(m) . Let ˆf1, . . . , ˆf T and α1, . . . , αT be the functions and coefficients returned from the median boosting procedure with changing cut-offs (Step 2 in Algorithm 1). For any i {1, . . . , m} it holds that max n Q+ β/2( ˆf1, . . . , ˆf T ; α1, . . . , αT ))(xi) yi , Q β/2 ˆf1, . . . , ˆf T ; α1, . . . , αT )(xi) yi o ψ(x, y) + 2α. Sparsification. Lemma A.4. Choosing β2 fat (F, cα) log2(fat (F, cα) /α) in Step 3 of Algorithm 1, we have for all (x, y) S |Median(f J1(x), . . . , f Jn(x)) y| ψ(x, y) + 3α, where c > 0 is a universal constant. Agnostic Sample Compression Schemes for Regression 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Figure 1. A sample S of m = 20 points (xi, yi) was drawn iid uniformly from [0, 1]2. On this sample, ℓ1 regression was performed by solving the LP in (7), shown on the left, and ℓ regression was performed by solving the LP in (9), on the right. In each case, the regressor provided by the LP solver is indicated by the thick (red) line. Notice that for ℓ1, the line contains exactly 2 datapoints. For ℓ , the regressor contains no datapoints; rather, the d + 2 = 3 support vectors are indicated by . B. Missing Proofs for Section 4 Proof of Theorem 4.3. We start with d = 0. The sample then consists of (y1, . . . , ym) [formally: pairs (xi, yi), where xi 0], and F = R [formally, all functions h : 0 7 R]. We define f S to be the median of (y1, . . . , ym), which for odd m is defined uniquely and for even m can be taken arbitrarily as the smaller of the two midpoints. It is well-known that such a choice minimizes the empirical ℓ1 risk, and it clearly constitutes a compression scheme of size 1. The case d = 1 will require more work. The sample consists of (xi, yi)i [m], where xi, yi R, and F = {R x 7 wx + b : a, b R}. Let (w , b ) be a (possibly non-unique) minimizer of L(w, b) := X i [m] |(wxi + b) yi|, (6) achieving the value L . We claim that we can always find two indices ˆı, ˆȷ [m] such that the line determined by (xˆı, yˆı) and (xˆȷ, yˆȷ) also achieves the optimal empirical risk L . More precisely, the line ( ˆw,ˆb) induced by ((xˆı, yˆı), (xˆȷ, yˆȷ)) via3 ˆw = (yˆȷ yˆı)/(xˆȷ xˆı) and ˆb = yˆı ˆwxˆı, verifies L( ˆw,ˆb) = L . To prove this claim, we begin by recasting (6) as a linear program. min (ε1,...,εm,w,b) Rm+2 i=1 εi s.t. (7) i [m] wxi + b yi εi i [m] wxi b + yi εi. We observe that the linear program in (7) is feasible with a finite solution (and actually, the constraints εi 0 are redundant). Furthermore, any optimal value is achievable at one of the extreme points of the constraint-set polytope P Rm+2. Next, we claim that the extreme points of the polytope P are all of the form v P with two (or more) of the εis equal to 0. This suffices to prove our main claim, since εi = 0 in v P iff the (w, b) induced by v verifies wxi + b = yi; in other words, the line induced by (w, b) contains the point (xi, yi). If a line contains two data points, it is uniquely determined by them: these constitute a compression set of size 2. (See illustration in Figure 1.) Now we prove our claimed property of the extreme points. First, we claim that any extreme point of P must have at least one εi equal to 0. Indeed, let (w, b) define a line. Define b+ := min n b [b, ) : i [m], wxi + b = yi) o 3We ignore the degenerate possibility of vertical lines, which reduces to the 0-dimensional case. Agnostic Sample Compression Schemes for Regression and analogously, b := max n b ( , b] : i [m], wxi + b = yi) o . In words, (w, b+) is the line obtained by increasing b to a maximum value of b+, where the line (w, b+) touches a datapoint, and likewise, (w, b ) is the line obtained by decreasing b to a minimum value of b , where the line (w, b ) touches a datapoint. Define by S+ a,b := {i : |wxi + b < yi|} the points above the line defined by (w, b) and S a,b := {i : |wxi + b > yi|} the points below the line defined by (w, b). For a line (w, b) which does not contain a data point we can rewrite the sample loss as L(w, b) = X (yi (wxi + b)) + X ((wxi + b) yi) a + |S a,b| |S+ a,b| b + =: λa + µb + ν. Since for fixed a and b [b , b+], the quantities S a,b, S+ a,b are constant, it follows that the function L(w, ) is affine in b, and hence minimized at b {b , b+}. Thus, there is no loss of generality in taking b = b , which implies that the optimal solution s line (w , b ) contains a data point (xˆı, yˆı). If the line (w , b ) contains other data points then we are done, so assume to the contrary that εˆı is the only εi that vanishes in the corresponding solution v P. Let Pˆı P consist of all v for which εˆı = 0, corresponding to all feasible solutions whose line contains the data point (xˆı, yˆı). Let us say that two lines (w1, b1), (w2, b2) are equivalent if they induce the same partition on the data points, in the sense of linear separation in the plane. The formal condition is S w1,b1 = S w1,b1, which is equivalent to S+ w1,b1 = S+ w1,b1. Define P ˆı Pˆı to consist of those feasible solutions whose line is equivalent to (w , b ). Denote by w+ := max {a : (ε1, .., εm, w, b) P ˆı } and define v+ to be a feasible solution in P ˆı with slope w+, and analogously, w := min {w : (ε1, .., εm, w, b) P ˆı } and v P ˆı with slope w . Geometrically this corresponds to rotating the line (w , b ) about the point (xˆı, yˆı) until it encounters a data point above and below. Writing, as above, the sample loss in the form L(w, b), we see that L( , b ) is affine in a over the range w [w , w+] and hence is minimized at one of the endpoints. This furnishes another datapoint (xˆȷ, yˆȷ) verifying ˆwxˆȷ + ˆb = yˆȷ for L( ˆw,ˆb) = L , and hence proves compressibility into two points for d = 1. Generalizing to d > 1 is quite straightforward. We define L(w, b) = X i [m] |( w, xi + b) yi| and express it as a linear program analogous to (7), Linear programming for ℓ1 regression: min (ε1,...,εm,w,b) Rm+d+1 i=1 εi s.t. (8) i [m] w, xi + b yi εi i [m] w, xi b + yi εi. Given an optimal solution (w , b ), we argue exactly as above that b may be chosen so that the optimal regressor contains some datapoint say, (x1, y1). Holding b and w(j), j = 1 fixed, we argue, as above, that w(1) may be chosen so that the optimal regressor contains another datapoint (say, (x2, y2)). Proceeding in this fashion, we inductively argue that the optimal regressor may be chosen to contain some d + 1 datapoints, which provides the requisite compression scheme. Agnostic Sample Compression Schemes for Regression Proof of Theorem 4.4. Given m labeled points in Rd R, S = {(xi, yi) : i [m]} and any w Rd, b R define the empirical risk L(w, b) := max {| w, xi + b yi| : i [m]} . We cast the risk minimization problem as a linear program. Linear programming for ℓ regression: min (ε,w,b) Rd+2 : ε (9) s.t. i : ε w, xi b + yi 0 ε + w, xi + b yi 0. (As before, the constraint ε 0 is implicit in the other constraints.) Introducing the Lagrange multipliers λi, µi 0, i [m], we cast the optimization problem in the form of a Lagrangian: L(ε, w, b, µ1 . . . , µm, λ1 . . . , λm) = ε i=1 λi (ε w, xi b + yi) i=1 µi (ε + w, xi + b yi) . The KKT conditions imply, in particular, that i : λi(ε w, xi b + yi) = 0 µi(ε + w, xi + b yi) = 0. Geometrically, this means that either the constraints corresponding to the ith datapoint are inactive in which case, omitting the datapoint does not affect the solution or otherwise, the ith datapoint induces the active constraint w, xi + b yi = ε. (10) On analogy with SVM, let us refer to the datapoints satisfying (10) as the support vectors; clearly, the remaining sample points may be discarded without affecting the solution. Solutions to (9) lie in Rd+2 and hence d + 2 linearly independent datapoints suffice to uniquely pin down an optimal (ε, w, b) via the equations (10). Proof of Theorem 4.6. Consider a sample (y1, . . . , ym) {0, 1}m. Partition the indices i [m] into S0 := {i [m] : yi = 0} and S1 := {i [m] : yi = 1}. The empirical risk minimizer is given by ˆr := argmin s R i=1 |yi s|p. To obtain an explicit expression for ˆr, define i=1 |yi s|p = |S1|(1 s)p + |S0|sp =: N1(1 s)p + N0sp. We then compute F (s) = p N0sp 1 p N1(1 s)p 1 and find that F (s) = 0 occurs at ˆs = µ1/(p 1) 1 + µ1/(p 1), Agnostic Sample Compression Schemes for Regression where µ = N1/N0. A straightforward analysis of the second derivative shows that ˆs = ˆr is indeed the unique minimizer of F. Thus, given a sample of size m, the unique minimizer ˆr is uniquely determined by N0 which can take on any of integer m + 1 values between 0 and m. On the other hand, every output of a k-selection function κ outputs a multiset ˆS S of size k and a binary string of length k = k k . Thus, the total number of values representable by a k-selection scheme is at most k =0 k 2k k < 2k+1 k, which, for k < log m, is less than m. Remark B.1. A more refined analysis, along the lines of David et al. (2016, Theorem 4.1), should yield a lower bound of k = Ω(m). A technical complication is that unlike the p = 2 case, whose empirical risk minimizer has a simple explicit form, the general ℓp loss does not admit a closed-form solution and uniqueness must be argued from general convexity principles.