# adversarially_robust_pac_learnability_of_realvalued_functions__723a24ac.pdf

Adversarially Robust PAC Learnability of Real-Valued Functions

Idan Attias 1 Steve Hanneke 2

We study robustness to test-time adversarial attacks in the regression setting with ℓp losses and arbitrary perturbation sets. We address the question of which function classes are PAC learnable in this setting. We show that classes of finite fatshattering dimension are learnable in both realizable and agnostic settings. Moreover, for convex function classes, they are even properly learnable. In contrast, some non-convex function classes provably require improper learning algorithms. Our main technique is based on a construction of an adversarially robust sample compression scheme of a size determined by the fat-shattering dimension. Along the way, we introduce a novel agnostic sample compression scheme for realvalued functions, which may be of independent interest.

1. Introduction

Learning a predictor that is resilient to test-time adversarial attacks is a fundamental problem in contemporary machine learning. A long line of research has studied the vulnerability of deep learning-based models to small perturbations of their inputs (e.g., Szegedy et al. (2013); Biggio et al. (2013); Goodfellow et al. (2014); Madry et al. (2017)). From the theoretical standpoint, there has been a lot of effort to provide provable guarantees of such methods (e.g., Feige et al. (2015); Schmidt et al. (2018); Khim & Loh (2018); Yin et al. (2019); Cullina et al. (2018); Attias et al. (2019; 2022); Montasser et al. (2021b; 2020a;b; 2021a); Ashtiani et al. (2020); Dan et al. (2020); Awasthi et al. (2020; 2021b;a; 2022a;b; 2023); Bhattacharjee et al. (2021); Xing et al. (2021); Mao et al. (2023)), which is the focus of this work.

In the robust PAC learning framework, the problem of learn-

1Department of Computer Science, Ben-Gurion University, Israel 2Department of Computer Science, Purdue University, USA. Correspondence to: Idan Attias <idanatti@post.bgu.ac.il>, Steve Hanneke <steve.hanneke@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

ing binary function classes was studied by Montasser et al. (2019). They showed that uniform convergence does not hold in this setting, and as a result, robust empirical risk minimization is not sufficient to ensure learnability. Yet, they showed that VC classes are learnable, by considering an improper learning rule; the learning algorithm outputs a function that is not in the function class that we aim to learn.

In this work, we provide a theoretical understanding of the robustness of real-valued predictors in the PAC learning model, with arbitrary perturbation sets. The work of Attias et al. (2019) considered this question for finite perturbation sets, they obtained sample complexity guarantees based on uniform convergence, which is no longer true for arbitrary perturbation sets. We address the fundamental question, which real-valued function classes are robustly learnable?

Furthermore, we study the robust learnability of convex classes, a natural and commonly studied subcategory for regression. We address the question, are real-valued convex classes properly robustly learnable? On the one hand, some non-convex function classes provably require improper learning due to Montasser et al. (2019). On the other hand, Mendelson (2019) showed that non-robust regression with the mean squared error is properly learnable.

We study the following learning models for real-valued functions. An adversarial attack is formalized by a perturbation function U : X 2X , where U(x) is the set of possible perturbations (attacks) on x. In practice, we usually consider U(x) to be the ℓ1 ball centered at x. In this work, we have no restriction on U, besides x U(x). Let D be an unknown distribution over X [0, 1] and let H [0, 1]X

be a concept class. In our first model, the robust error of concept h is defined as

Errℓp(h; D) = E(x,y) D

sup z U(x) |h(z) y|p #

The learner gets an i.i.d. sample from D, and would like to output function ˆh, such that with high probability,

Errℓp(ˆh; D) inf h H Errℓp(h; D) + ϵ. (1)

The sample complexity for learning H is the minimal i.i.d. sample from D such that there exists a learning algorithm

Adversarially Robust PAC Learnability of Real-Valued Functions

with output as in Equation (1). We refer to this model as Robust Regression with ℓp robust losses. This is a robust formulation of the classic nonparametric regression setting.

In the second model, the robust error of concept h is defined as

Errη(h; D) = E(x,y) D

sup z U(x) |h(z) y| η

We refer to the loss function in this model as cutoff loss, where η 0 is a predefined cutoff parameter. The learner gets an i.i.d. sample from D, and would like to output function ˆh, such that with high probability,

Errη+β(ˆh; D) inf h H Errη(h; D) + ϵ,

where β > 0 is a predefined parameter. The sample complexity is defined similarly to the previous model. We refer to this model as Robust (η, β)-Regression. The non-robust formulation of this setting was studied by, e.g., Anthony & Bartlett (2000); Simon (1997). See also Anthony et al. (1999, section 21.4) and references therein.

Main results and roadmap. Denote the γ-fat-shattering dimension of H by fat(H, γ), and the dual γ-fat-shattering dimension by fat (H, γ), which is the dimension of the dual class. The dimension of the dual class is finite as long as the γ-fat-shattering of the primal class is finite (see Kleer & Simon (2021), and Equation (7)).

In Section 3 we provide a learning algorithm for robust regression with ℓp losses, with sample complexity

O fat3(H, ϵ/p) fat (H, ϵ/p)

Moreover, this algorithm is proper for convex function classes. We circumvent a negative result regarding non-convex function classes, for which proper learning is impossible, even for binary-valued functions (Montasser et al., 2019).

In Section 4 we provide a learning algorithm with a substantial sample complexity improvement for the ℓ1 loss,

O fat(H, ϵ) fat (H, ϵ)

In Section 5, we provide learning algorithms for the (η, β)-robust regression setting in the realizable and agnostic settings. Our sample complexity for the realizable case is

O fat(H, β) fat (H, β)

O fat(H, β) fat (H, β)

for the agnostic case.

Technical contributions and related work. The setting of agnostic adversarially robust regression with finite perturbation sets was studied by Attias et al. (2019). Subsequently, improved bounds appeared in Kontorovich & Attias (2021). Adversarially robust PAC learnability of binary-valued function classes with arbitrary perturbation sets was studied by Montasser et al. (2019). They showed that uniform convergence does not hold in this setting, which means that some classes provably require improper learning. Their main technique is constructing a sample compression scheme from a boosting-style algorithm, where the generalization follows from sample compression bounds.

First, we explain our new technical ideas behind the algorithms for robust (η, β)-regression, and compare it to the ones of Montasser et al. (2019) in the classification setting. We then explain why the approach for learning these models fails in the general robust regression setting and introduce the new ingredients behind the proofs for this setting.

Robust (η, β)-regression. We construct an adversarially robust sample compression scheme of a size determined by the fat-shattering dimension of the function class. The following steps are different from the binary-valued case. First, we use a modified boosting algorithm for real-valued functions. In the non-robust setting, Hanneke et al. (2019) showed how to convert a boosting algorithm (originally introduced by K egl (2003)), into a sample compression scheme. In order to find weak learners (and prove their existence), we rely on generalization from approximate interpolation (see Anthony & Bartlett (2000) and Anthony & Bartlett (2000, section 21.4)). The idea is that any function f F that approximately interpolates a sample S Dm, that is, |f(x) y| η for (x, y) S, also satisfies that P{(x, y) : |f(x) y| η + β} > 1 ϵ with high probability, as long as O(fat(F, β)/ϵ) |S|. Crucially, this result relies on uniform convergence and does not apply to the robust loss function. Another difference is in the discretization step. In the classification setting, we inflate the data set to include all possible perturbations (potentially infinite set). We then define a function class ˆH by running a robust empirical minimizer on every subset of size VC(H) from the training set, where H is the class we want to learn. ˆH induces a finite partition on the inflated set into regions, such that any h ˆH has a constant error in each region. This is no longer true in the real-valued case. Instead, we discretize the inflated set by taking a uniform cover using the supremum metric, and controlling the errors that arise from the cover.

Adversarially Robust PAC Learnability of Real-Valued Functions

Robust regression. We first explain which natural techniques fail. We cannot run boosting for the ℓp loss as explained by Hanneke et al. (2019): Duffy & Helmbold (2002, Remark 2.1) spell out a central technical challenge: no boosting algorithm can always force the base regressor to output a useful function by simply modifying the distribution over the sample. This is because unlike a binary classifier, which localizes errors on specific examples, a real-valued hypothesis can spread its error evenly over the entire sample and it will not be affected by reweighting .

As a first attempt, we could try to learn with respect to the cutoff loss (with a fixed cutoff parameter), and conclude learnability in the general regression setting. However, the ℓp loss can spread over different values for different points, which means that this approach fails. In another possible attempt, we could try to solve the realizable case first and try to reduce agnostic to realizable learning as in Montasser et al. (2019) for binary-valued functions, as we prove the agnostic setting for robust (η, β)-regression. However, this attempt fails for the same reasons we mentioned above.

Therefore, we introduce a novel technique for working with changing cutoffs. We establish generalization from approximate interpolation with different cutoff parameters, and thereby, we find a learner that approximates the loss of the target function on different points. Utilizing this idea, we provide a learning algorithm for ℓp robust losses that constructs an ensemble and predict with the average. Further, we show that this algorithm is proper for convex function classes. In contrast, some non-convex function classes provably require improper learning (Montasser et al., 2019). Moreover, we show how to reduce the sample complexity substantially for the ℓ1 robust loss with a different algorithm, by constructing an ensemble of weak learners and predicting with the median. Both algorithms can be represented as an agnostic sample compression scheme for the robust loss. This is a new result since constructing a sample compression scheme for real-valued functions is known only for the realizable setting (Hanneke et al., 2019). We believe that this technique may be of independent interest.

2. Problem Setup and Preliminaries

Let H [0, 1]X be a concept class. We implicitly assume that all concept classes are satisfying mild measuretheoretic conditions (see e.g., Dudley (1984, section 10.3.1) and Pollard (2012, appendix C)). Let D be a distribution over X Y, where Y = [0, 1]. Define a perturbation function U : X 2X that maps an input to an arbitrary set U(x) X, such that x U(x).

We consider the following loss functions. For 1 p < , define the ℓp robust loss function of h on (x, y) with respect

to a perturbation function U,

ℓp,U(h; (x, y)) = sup z U(x) |h(z) y|p . (2)

We define also the η-ball robust loss function of h on (x, y) with respect to a perturbation function U,

ℓη U(h; (x, y)) = I

sup z U(x) |h(z) y| η

The non-robust version of this loss function is also known as η-ball or η-tube loss (see for example Anthony et al. (1999, Section 21.4)).

Define the error of a function h on distribution D, with respect to the ℓp robust loss,

Errℓp(h; D) = E(x,y) D

sup z U(x) |h(z) y|p #

and the error with respect to the η-ball robust loss

Errη(h; D) = E(x,y) D

sup z U(x) |h(z) y| η

Note that in our model the learner is tested on the original label y while observing only the perturbed example z. There are formulations of robustness where the learner is compared to the value of the optimal function in the class on the perturbed example, i.e., if the optimal function in the class is h , then the ℓ1 robust loss would be supz U(x)|h(z) h (z)|. For more details and comparisons of the two models, see Gourdeau et al. (2021); Diochnos et al. (2018); Bubeck et al. (2019).

Learning models. We precisely define the models for robustly learning real-valued functions. Our first model is learning with the ℓp robust losses (see Equation (2)), we refer to this model as Robust Regression.

Definition 2.1 (Robust regression). For any ϵ, δ (0, 1), the sample complexity robust (ϵ, δ)-PAC learning a concept class H [0, 1]X with ℓp robust losses, denoted by M(ϵ, δ, H, U, ℓp), is the smallest integer m such that the following holds: there exists a learning algorithm A : (X Y)m [0, 1]X , such that for any distribution D over X [0, 1], for an i.i.d. random sample S Dm, with probability at least 1 δ over S, it holds that

Errℓp(A(S); D) inf h H Errℓp(h; D) + ϵ.

If no such m exists, define M(ϵ, δ, H, U, ℓp) = , and H is not robustly (ϵ, δ)-PAC learnable. We use the shorthand M = M(ϵ, δ, H, U, ℓp) for notational simplicity.

Adversarially Robust PAC Learnability of Real-Valued Functions

Our second model is learning with the η-ball robust loss (see Equation (3)) in the realizable and agnostic settings, we refer to this model by Robust (η, β)-regression. We say that a distribution D is η-uniformly realizable with respect to H and U, if there exists h H such that

Errη(h ; D) = 0. (4)

Definition 2.2 (Robust (η, β)-regression). For any β, ϵ, δ (0, 1) and η [0, 1], the sample complexity of realizable robust (η, β, ϵ, δ)-PAC learning a concept class H [0, 1]X , denoted by MRE(η, β, ϵ, δ, H, U), is the smallest integer m such that the following holds: there exists a learning algorithm A : (X Y)m [0, 1]X , such that for any distribution D over X [0, 1] that is η-uniformly realizable w.r.t. H and U (see Equation (4)), for an i.i.d. random sample S Dm, with probability at least 1 δ over S, it holds that

Errη+β(A(S); D) ϵ.

If no such m exists, define MRE(η, ϵ, δ, H, U) = , and H is not robustly (η, β, ϵ, δ)-PAC learnable.

The agnostic sample complexity, denoted by MAG(η, β, ϵ, δ, H, U), is defined similarly with the following difference. We require the learning algorithm to output a function, such that with probability at least 1 δ,

Errη+β(A(S); D) inf h H Errη(h; D) + ϵ.

We use the shorthand Mη,β RE = MRE(η, β, ϵ, δ, H, U) and Mη,β AG = MAG(η, β, ϵ, δ, H, U) for notational simplicity.

We do not define the setting of robust regression in the realizable setting since it coincides with the realizable setting of robust (η, β)-regression, by taking η = 0, β = ϵ/2, and rescaling ϵ to ϵ/2. Moreover, we could define the ℓp variant of the η-ball loss in robust (η, β)-regression, however, results for our definition translate immediately by taking η1/p.

Note that there is a fundamental difference between the models. In the robust (η, β)-regression, we demand from the learning algorithm to find a function that is almost everywhere within η + β from the target function in class. That is, on 1 ϵ mass of elements in the support of D, we find an approximation up to η + β. On the other hand, in the robust regression model, we aim to be close to the target function on average, and the error can possibly spread across all elements in the support.

Proper and improper learning algorithms. The learning algorithm is not limited to returning a function that is inside the concept class that we aim to learn. When learning a class H, whenever the learning algorithm returns a function inside the class, that is, A : (X Y)m H, we say that

the algorithm is proper and the class in properly learnable. Otherwise, we say that the algorithm is improper. Improper algorithms are extremely powerful and using them often circumvents computational issues and sample complexity barriers (Srebro et al., 2004; Candes & Recht, 2012; Anava et al., 2013; Hazan et al., 2015; Hanneke, 2016; Hazan & Ma, 2016; Hazan et al., 2012; Agarwal et al., 2019; Daniely et al., 2011; Daniely & Shalev-Shwartz, 2014; Angluin, 1988; Montasser et al., 2019).

Oracles. We make use of the following robust empirical risk minimizers. Let a set S = {(xi, yi)}m i=1. Define an ηrobust empirical risk minimizer η-RERMH : (X Y)m [0, 1]m N H,

η-RERMH(S, ηS, p) :=

sup z U(x) |h(z) y|p η(x, y)

where ηS = (η(x1, y1), . . . , η(xm, ym)). We refer to η(x, y) as a cutoff parameter. Note that η is a function of (x, y) and not necessarily a constant.

Define a robust empirical risk minimizer for the ℓp robust loss, ℓp-RERMH : (X Y)m H,

ℓp-RERMH(S) := argmin h H

(x,y) S sup z U(x) |h(z) y|p .

Complexity measures. Fat-shattering dimension. Let F [0, 1]X and γ > 0. We say that S = {x1, . . . , xm} X is γ-shattered by F if there exists a witness r = (r1, . . . , rm) [0, 1]m such that for each σ = (σ1, . . . , σm) { 1, 1}m there is a function fσ F such that

( fσ(xi) ri + γ, if σi = 1 fσ(xi) ri γ, if σi = 1.

The fat-shattering dimension of F at scale γ, denoted by fat(F, γ), is the cardinality of the largest set of points in X that can be γ-shattered by F. This parametrized variant of the Pseudo-dimension (Alon et al., 1997) was first proposed by Kearns & Schapire (1994). Its key role in learning theory lies in characterizing the PAC learnability of real-valued function classes (Alon et al., 1997; Bartlett & Long, 1998).

Dual fat-shattering dimension. Define the dual class F [0, 1]H of F as the set of all functions gw : F [0, 1] defined by gw(f) = f(w). If we think of a function class as a matrix whose rows and columns are indexed by functions and points, respectively, then the dual class is given by the transpose of the matrix. The dual fat-shattering at scale γ, is

Adversarially Robust PAC Learnability of Real-Valued Functions

defined as the fat-shattering at scale γ of the dual-class and denoted by fat (F, γ). We have the following bound due to Kleer & Simon (2021, Corollary 3.8 and inequality 3.1),

fat (F, γ) 1

γ 2fat(F,γ/2)+1. (7)

Covering numbers. We say that G [0, 1]Ωis ϵ-cover for F [0, 1]Ωin d norm, if for any f F the exists g G such that for any x Ω, |f(x) g(x)| ϵ. The ϵ-covering number of F is the minimal cardinality of any ϵ-cover, and denoted by N(ϵ, F, d ).

Sample compression scheme. We say that a pair of functions (κ, ρ) is uniform α-approximate sample compression scheme of size k for a class H [0, 1]X and the ℓp loss, if for any m N, h H, and sample S = {(xi, yi)}m i=1, it holds for the compression function that κ (S) S, |κ (S) | k, and the reconstruction function ρ (κ (S)) = ˆh satisfies i [m] ˆh(xi) yi p h (xi) yi p + α. (8)

Similarly, we define an adversarially robust uniform αapproximate sample compression scheme if i [m]

sup zi U(xi)

ˆh(zi) yi p sup zi U(xi)

h (zi) yi p + α. (9)

Notation. We use the notation O( ) for omitting poly-logarithmic factors of (fat(H, γ), fat (H, γ), 1/ϵ, 1/δ, 1/η, 1/β). We denote [n] = {1, . . . , n}, and exp( ) = e( ). and denote inequalities up to a constant factor, and denotes equality up to a constant factor. Vectors are written using bold symbols.

3. Robust Regression for ℓp Losses

In this section, we provide an algorithm and prove its sample complexity for robust regression with ℓp losses. Moreover, our learning algorithm is proper for convex function classes, arguably the most commonly studied subcategory of realvalued function classes for regression. This result circumvents a negative result from Montasser et al. (2019); there exist, non-convex function classes, where proper learning is impossible. Theorem 3.1. The sample complexity of Algorithm 1 for robust (ϵ, δ)-PAC learning a concept class H with the ℓp robust loss is

O fat3(H, cϵ/p) fat (H, cϵ/p)

for some numerical constant c (0, ).

Recall that fat (F, ϵ) 1

ϵ 2fat(F,ϵ/2)+1 by Equation (7).

Remark 3.2. The output of Algorithm 1 is a convex combination of the functions from the concept class, which is a proper predictor, assuming convexity of the function class. Remark 3.3. Similar to non-robust regression, our results generalize to loss functions with bounded codomain [0, M]. The generalization bound should be multiplied by p M p

and the scaling of the fat-shattering dimension should be ϵ/p M p.

In the following result, we establish generalization from approximate interpolation for changing cutoff parameters for different points. This generalizes a result by (Anthony & Bartlett, 2000), where the cutoff parameter is fixed for all points. The proof is in Appendix A.

Theorem 3.4 (Generalization from approximate interpolation with changing cutoffs). Let H [0, 1]X with a finite fat-shattering dimension (at any scale). For any β, ϵ, δ (0, 1), any function η : X Y [0, 1], any distribution D over X Y, for a random sample S Dm, if

fat(H, β/8) log2 fat(H, β/8)

then with probability at least 1 δ over S, for any h H satisfying |h(x) y| η(x, y), (x, y) S, it holds that P(x,y) D{(x, y) : |h(x) y| η(x, y) + β} 1 ϵ.

Algorithm 1 Improper Robust Regressor with High-Vote Input: H [0, 1]X , S = {(xi, yi)}m i=1. Parameters: ϵ. Algorithms used: ℓp-RERMH (Equation (6)), η-RERMH (Equation (5)), a variant of Multiplicative Weights (Algorithm 4).

1. Compute h ℓp-RERMH(S). Denote η(x, y) = supz U(x) |h (z) y|, (x, y) S.

2. Inflate S to SU to include all perturbed points.

3. Discretize SU SU: (i) Construct a function class ˆH, where each ˆh ˆH defined by η-RERM optimizer on O 1

ϵ fat(H, O(ϵ/p)) points from S. The input cutoff parameters to the optimizer are η(x, y), as computed in step 1. (ii) Each (z, y) SU defines a function in the dual space, f(z,y) : ˆH [0, 1] such that f(z,y)(h) = h(z) y p. Define SU to be the minimal cover of SU under d norm at scale O(ϵ).

4. Compute a modified Multiplicative Weights on SU. Let n ˆh1, . . . , ˆh T o be the returned set of classifiers.

Output: ˆh = 1

T PT i=1 ˆhi.

Adversarially Robust PAC Learnability of Real-Valued Functions

We construct an adversarially robust sample compression scheme of a size determined by the fat-shattering dimension of the function class. Recall that uniform convergence does not necessarily hold. Instead, we derive generalization from sample compression bounds. Our proof crucially relies on Theorem 3.4.

Proof overview and algorithm outline. The complete proof is in Appendix B. We follow the steps in Algorithm 1.

(1) We start with computing a robust empirical risk minimizer (ERM) h on S for the ℓp robust loss. This defines the target loss we are aiming for at any point in S. In other words, the robust loss of h on (x, y) defines a cutoff η(x, y) and our goal is to construct a predictor with a loss of η(x, y) + ϵ for any (x, y) S, which means that this predictor is an approximate robust ERM. In order to derive generalization, we cannot rely on uniform convergence. Instead, our predictor is based on a sample compression scheme from which we can generalize.

(2) Inflate the training set by including all possible perturbations. Whenever the same perturbation is mapped to more than one input, we assign the label of the input with the smallest index to prevent ambiguity. We denote this set by SU.

(3) Discretize the set SU as follows: (i) Construct a set of functions ˆH, such that each function is the output of η-RERM for H (defined in Equation (5)), performing on a subset S S of size d = O 1

ϵ fat(H, O(ϵ/p)) . This means that for any S S there exists ˆh ˆH that is an approximate robust ERM on S , that is, ˆh is within η(x, y)+ ϵ for any (x, y) S . The size of ˆH is bounded (m/d)d, where |S| = m.

(ii) Define a discretization SU SU by taking a uniform cover of the dual space. In the dual space, each (z, y) SU defines a function f(z,y) : ˆH [0, 1] such that f(z,y)(h) = h(z) y p. We take a minimal O(ϵ)-cover for SU with the supremum norm, which is of size N(O(ϵ) , SU, d ). We use covering numbers arguments (Rudelson & Vershynin, 2006) to upper bound the size of SU

(4) Compute a variant of Multiplicative Weights (MW) update (Algorithm 4) on SU for T log SU rounds as follows. From a corollary of Theorem 3.4 (see Corollary A.1), we know that for any distribution P on SU, upon receiving an i.i.d. sample S from P of size d, with probability 2/3 over sampling S from P, for any h H with (z, y) S : |h(z) y|p η(z, y), it holds that P(z,y) P{(z, y) : |h(z) y|p η(z, y) + ϵ} 1 ϵ, where η(z, y) is the η(x, y) for which z U(x) as defined in step 2. We can conclude that for any distribution P on SU, there exists such a set of points S SU. Then, we can find a set S of d points in S that S originated from.

Formally, S S

(x,y) S S{(z, y) : z U(x)}. We exe-

cute the optimizer ˆh η-RERM on S with the relevant cutoff parameters. ˆh has error of η(z, y) + ϵ on a fraction of (1 ϵ) points with respect to the distribution P. We start with P1 as the uniform distribution over SU and find ˆh1 respectively. We perform a multiplicative weights update on the distribution and find the next hypothesis w.r.t. the new distribution and so forth.

Following the analysis of MW (or α-Boost) from Schapire & Freund (2013, Section 6)), we know that for any point in SU, roughly (1 ϵ) base learners are within ϵ from the target cutoff. The rest ϵ fraction can contribute an error of at most ϵ since the loss is bounded by 1. We get that for any point in SU, the average loss of hypotheses in the ensemble is within 2ϵ from the target cutoff. Crucially, we use strong base learners in the ensemble. By the covering argument, we get that for any point in SU, the average loss of the ensemble is within 4ϵ,

(z, y) SU : 1

ˆhi(z) y p η(z, y) + 4ϵ.

We are interested that the average prediction 1 T PT i=1 ˆhi will be within the target cutoffs. For that reason, we use the convexity of the ℓp loss to show that 1 T

i=1 ˆhi(z) y

ˆhi(z) y p .

Therefore, we conclude that

(z, y) SU :

i=1 ˆhi(z) y

η(z, y) + 4ϵ,

which implies that we have an approximate robust ERM for S,

(x, y) S : sup z U(x)

i=1 ˆhi(z) y

η(x, y) + 4ϵ.

The proof follows by applying a sample compression generalization bound in the agnostic case, bounding the compression size, and rescaling ϵ.

For convex classes, we have a proper learner. The output of the algorithm is a convex combination of functions from H which is also in the class.

4. Better Sample Complexity for the ℓ1 Loss

In this section, we provide an algorithm with a substantial sample complexity improvement for the ℓ1 robust loss. The key technical idea in this result is to note that, if we replace

Adversarially Robust PAC Learnability of Real-Valued Functions

base learners with weak learners in the improper ensemble predictor, we can still get an accurate prediction by taking the median aggregation of the ensemble. Thus, we incorporate a variant of median boosting for real-valued functions (K egl, 2003; Hanneke et al., 2019) in our algorithm. Each base learner requires fewer samples and as a result, we improve the sample complexity. On the contrary, in Algorithm 1 we obtained accurate predictions for a 1 O(ϵ) quantile of the predictors, and we output their average.

Theorem 4.1. The sample complexity of Algorithm 2 for robust (ϵ, δ)-PAC learning a concept class H with the ℓ1 robust loss is

O fat(H, cϵ) fat (H, cϵ)

for some numerical constant c (0, ).

Recall that fat (F, ϵ) 1

ϵ 2fat(F,ϵ/2)+1 by Equation (7).

We shall define the notion of weak learners in the context of real-valued learners.

Definition 4.2 (Weak real-valued learner). Let ξ (0, 1

2], ζ [0, 1]. We say that f : X [0, 1] is a (ζ, ξ)-weak learner for the ℓp loss, with respect to D and a target function h H if

P(x,y) D{(x, y) : |f(x) y|p > |h (x) y|p + ζ} 1

This notion of a weak learner must be formulated carefully. For example, taking a learner guaranteeing absolute loss at most 1

2 ξ is known to not be strong enough for boosting to work. On the other hand, by making the requirement too strong (for example, Ada Boost.R in Freund & Schapire (1997)), then the sample complexity of weak learning will be high that weak learners cannot be expected to exist for certain function classes. We can now present an overview of the proof and the algorithm.

Algorithm 2 Improper Robust Regressor Input: H [0, 1]X , S = {(xi, yi)}m i=1. Parameters: ϵ. Algorithms used: ℓ1-RERMH (Equation (6)), η-RERMH (Equation (5)), a variant of median boosting: Med Boost (Algorithm 5), sparsification method (Algorithm 6).

1. Compute h ℓ1-RERMH(S). Denote η(x, y) = supz U(x) |h (z) y|, (x, y) S.

2. Inflate S to SU to include all perturbed points.

3. Discretize SU SU: (i) Construct a function class ˆH, where each ˆh ˆH defined by η-RERM optimizer on O(fat(H, O(ϵ))) points from S. The input cutoff parameters to the optimizer are η(x, y), as computed in step 1. (ii) Each (z, y) SU defines a function in the dual space, f(z,y) : ˆH [0, 1] such that f(z,y)(h) = h(z) y . Define SU to be the minimal cover of SU under d norm at scale O(ϵ).

4. Compute modified Med Boost on SU, where ˆH consists of weak learners for any distribution over SU. Let F = n ˆh1, . . . , ˆh T o be the returned set of classifiers.

5. Sparsify the set F to a smaller set n ˆh1, . . . , ˆhk o .

Output: ˆh = Median ˆh1, . . . , ˆhk .

Proof overview and algorithm outline. The complete proof is in Appendix C.

We explain the main differences from Algorithm 1 and where the sample complexity improvement comes from. In the discretization step, we replace the base learners in ˆH with weak learners. We construct an improper ensemble predictor via a median boosting algorithm, where the weak learners are chosen from ˆH. Specifically, each function in ˆH is the output of η-RERM for H (defined in Equation (5)), performing on a subset S S of size d = O(fat(H, O(ϵ))) This is in contrast to Algorithm 1, where we use Multiplicative Weights update that operates with strong base learners. We can make accurate predictions by taking the median aggregation due to the ℓ1 loss.

Another improvement arises from sparsifying the ensemble (Hanneke et al., 2019) to be independent of the sample size while keeping the median accurate almost with the same resolution. The sparsification step uses sampling and uniform convergence in the dual space (with respect to the non-robust loss).

We elaborate on the steps in Algorithm 2. Steps (1),(2), and (3) are similar to Algorithm 1, besides the construction of ˆH

Adversarially Robust PAC Learnability of Real-Valued Functions

as we explained above. In step (4), we compute a modified version of the real-valued boosting algorithm Med Boost (K egl, 2003) on the discretized set SU, see Algorithm 5. Hanneke et al. (2019) showed how to construct a sample compression scheme from Med Boost. From this step, we have that for any point in SU, the median of the losses of each hypothesis in the ensemble is within ϵ of the target cutoff that was computed in step 1. By the covering argument, the median of the losses is within 3ϵ for any point in (z, y) SU, Med ˆh1(z) y, . . . , ˆh T (z) y η(z, y) + 3ϵ.

The median is translation invariant, so we have Med ˆh1(z), . . . , ˆh T (z) y η(z, y) + 3ϵ.

Finally, for any (x, y) S,

Med ˆh1(z) y, . . . , ˆh T (z) y η(x, y) + 3ϵ.

To further reduce the sample compression size, in step (5) we sparsify the ensemble to k = O(fat (H, cϵ)) functions,

Med ˆh1(z) y, . . . , ˆhk(z) y η(x, y) + 4ϵ.

The proof follows by applying a sample compression generalization bound in the agnostic case, bounding the compression size, and rescaling ϵ.

5. Robust (η, β)-Regression

In this section, we study robust (η, β)-regression in realizable and agnostic settings. We provide an algorithm for the realizable setting and show how to reduce agnostic to realizable learning. We conclude by deriving sample complexity guarantees for both settings.

This model is different than regression which guarantees a small expected error (with high probability). In the robust (η, β)-regression, we aim for a small pointwise absolute error almost everywhere on the support of the distribution. Results for this model do not follow from the standard regression model. We first present our result for the realizable case. The proof is in Appendix D. Theorem 5.1. Assuming the distribution D is η-uniformly realizable (see Equation (4)) by a class H [0, 1]X , the sample complexity of Algorithm 3 for robust (η, β, ϵ, δ)-PAC learning a concept class H is

O fat(H, cβ) fat (H, cβ)

for some numerical constant c (0, ).

Recall that fat (F, ϵ) 1

ϵ 2fat(F,ϵ/2)+1 by Equation (7).

Algorithm 3 Improper Robust (η, β)-Regressor

Input: H [0, 1]X , S = {(xi, yi)}m i=1. Parameters: η, β. Algorithms used: η-RERMH (Equation (5)), a variant of median boosting: Med Boost (Algorithm 5), sparsification method (Algorithm 6).

1. Inflate S to SU to include all perturbed points.

2. Discretize SU SU: (i) Construct a function class ˆH, where each ˆh ˆH defined by η-RERM optimizer on O(fat(H, O(β))) points from S. The input cutoff parameters to the optimizer are fixed η for all points. (ii) Each (z, y) SU defines a function in the dual space, f(z,y) : ˆH [0, 1] such that f(z,y)(h) = h(z) y . Define SU to be the minimal cover of SU under d norm at scale O(β).

3. Compute modified Med Boost on SU, where ˆH consists of weak learners for any distribution over SU. Let F = n ˆh1, . . . , ˆh T o be the returned set of classifiers.

4. Sparsify the set F to a smaller set n ˆh1, . . . , ˆhk o .

Output: ˆh = Median ˆh1, . . . , ˆhk .

We explain the main differences from Algorithm 2. This model is different from robust regression with the ℓ1 loss. Our goal is to find a predictor with a prediction within η +β of the true label almost everywhere the domain, assuming that the distribution is η-uniformly realizable by the function class (Equation (4)).

In this model, the cutoff parameter is given to us as a parameter and is fixed for all points. This is different from Algorithms 1 and 2, where we computed the changing cutoffs with a robust ERM oracle. Moreover, the weak learners in ˆH are defined as the output of η-RERM performing on a subset S S of size d = O(fat(H, O(β))). Note that the scale of shattering depends on β and not ϵ. The resolution of discretization in the cover depends on β as well.

Agnostic setting We establish an upper bound on the sample complexity of the agnostic setting, by using a reduction to the realizable case. The main argument was originally suggested in (David et al., 2016) for the 0-1 loss and holds for the η-ball robust loss as well. The proof is in Appendix D.

Theorem 5.2. The sample complexity for agnostic robust (η, β, ϵ, δ)-PAC learning a concept class H is

O fat(H, cβ) fat (H, cβ)

Adversarially Robust PAC Learnability of Real-Valued Functions

for some numerical constant c (0, ).

Recall that fat (F, ϵ) 1

ϵ 2fat(F,ϵ/2)+1 by Equation (7).

Remark. An agnostic learner for robust (η, β)-regression does not apply to the robust regression setting. The reason is that the optimal function in H may have different scales of robustness on different points, which motivates our approach of using changing cutoffs for different points. In Appendix D.3 we show that by using a fixed cutoff for all points we can obtain an error of only OPTH + ϵ.

6. Discussion

In this paper, we studied the robustness of real-valued functions to test time attacks. We showed that finite fatshattering is sufficient for learnability. we proved sample complexity for learning with the general ℓp losses and improved it for the ℓ1 loss. We also studied a model of regression with a cutoff loss. We proved sample complexity in realizable and agnostic settings. We leave several interesting open questions for future research. (i) Improve the upper bound for learning with ℓp robust losses (if possible) and show a lower bound. There might be a gap between sample complexities of different values of p. More specifically, what is the sample complexity for learning with ℓ2 robust loss? (ii) We showed that the fat-shattering dimension is a sufficient condition. What is a necessary condition? In the binary-valued case, we know that having a finite VC is not necessary. (iii) To what extent can we benefit from unlabeled samples for learning real-valued functions? This question was considered by Attias et al. (2022) for binary function classes, where they showed that the labeled sample complexity can be arbitrarily smaller compared to the fully-supervised setting. (iv) In this work we focused on the statistical aspect of robustly learning real-valued functions. It would be interesting to explore the computational aspect as well.

Acknowledgments

We thank Meni Sadigurschi for helpful discussions regarding sample compression schemes of real-valued functions. We are also grateful to Uri Sherman for discussions on the generalization of ERM in Stochastic Convex Optimization. This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grants 993/17, 1602/19), Tel Aviv University Center for AI and Data Science (TAD), and the Yandex Initiative for Machine Learning at Tel Aviv University. I.A. is supported by the Vatat Scholarship from the Israeli Council for Higher Education and by Kreitman School of Advanced Graduate Studies.

Agarwal, N., Bullins, B., Hazan, E., Kakade, S., and Singh, K. Online control with adversarial disturbances. In International Conference on Machine Learning, pp. 111 119. PMLR, 2019.

Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM (JACM), 44(4):615 631, 1997.

Anava, O., Hazan, E., Mannor, S., and Shamir, O. Online learning for time series prediction. In Conference on learning theory, pp. 172 184. PMLR, 2013.

Angluin, D. Queries and concept learning. Machine learning, 2:319 342, 1988.

Anthony, M. and Bartlett, P. L. Function learning from interpolation. Combinatorics, Probability and Computing, 9(3):213 225, 2000.

Anthony, M., Bartlett, P. L., Bartlett, P. L., et al. Neural network learning: Theoretical foundations, volume 9. cambridge university press Cambridge, 1999.

Ashtiani, H., Pathak, V., and Urner, R. Black-box certification and learning under adversarial perturbations. In International Conference on Machine Learning, pp. 388 398. PMLR, 2020.

Attias, I., Kontorovich, A., and Mansour, Y. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, pp. 162 183. PMLR, 2019.

Attias, I., Hanneke, S., and Mansour, Y. A characterization of semi-supervised adversarially-robust pac learnability. ar Xiv preprint ar Xiv:2202.05420, 2022.

Awasthi, P., Frank, N., and Mohri, M. Adversarial learning guarantees for linear hypotheses and neural networks. In International Conference on Machine Learning, pp. 431 441. PMLR, 2020.

Awasthi, P., Frank, N., Mao, A., Mohri, M., and Zhong, Y. Calibration and consistency of adversarial surrogate losses. Advances in Neural Information Processing Systems, 34, 2021a.

Awasthi, P., Frank, N., and Mohri, M. On the existence of the adversarial bayes classifier. Advances in Neural Information Processing Systems, 34, 2021b.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Hconsistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pp. 1117 1174. PMLR, 2022a.

Adversarially Robust PAC Learnability of Real-Valued Functions

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Multi-class h-consistency bounds. Advances in Neural Information Processing Systems, 35:782 795, 2022b.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artificial Intelligence and Statistics, pp. 10077 10094. PMLR, 2023.

Bartlett, P. L. and Long, P. M. Prediction, learning, uniform convergence, and scale-sensitive dimensions. Journal of Computer and System Sciences, 56(2):174 190, 1998.

Bhattacharjee, R., Jha, S., and Chaudhuri, K. Sample complexity of robust linear classification on separated data. In International Conference on Machine Learning, pp. 884 893. PMLR, 2021.

Biggio, B., Corona, I., Maiorca, D., Nelson, B., ˇSrndi c, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387 402. Springer, 2013.

Bubeck, S., Lee, Y. T., Price, E., and Razenshteyn, I. Adversarial examples from computational constraints. In International Conference on Machine Learning, pp. 831 840. PMLR, 2019.

Candes, E. and Recht, B. Exact matrix completion via convex optimization. Communications of the ACM, 55 (6):111 119, 2012.

Cullina, D., Bhagoji, A. N., and Mittal, P. Pac-learning in the presence of adversaries. In Advances in Neural Information Processing Systems, pp. 230 241, 2018.

Dan, C., Wei, Y., and Ravikumar, P. Sharp statistical guaratees for adversarially robust gaussian classification. In International Conference on Machine Learning, pp. 2345 2355. PMLR, 2020.

Daniely, A. and Shalev-Shwartz, S. Optimal learners for multiclass problems. In Conference on Learning Theory, pp. 287 316. PMLR, 2014.

Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz, S. Multiclass learnability and the erm principle. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 207 232. JMLR Workshop and Conference Proceedings, 2011.

David, O., Moran, S., and Yehudayoff, A. Supervised learning through the lens of compression. Advances in Neural Information Processing Systems, 29:2784 2792, 2016.

Diochnos, D., Mahloujifar, S., and Mahmoody, M. Adversarial risk and robustness: General definitions and implications for the uniform distribution. Advances in Neural Information Processing Systems, 31, 2018.

Dudley, R. M. A course on empirical processes. In Ecole d et e de Probabilit es de Saint-Flour XII-1982, pp. 1 142. Springer, 1984.

Duffy, N. and Helmbold, D. Boosting methods for regression. Machine Learning, 47(2):153 200, 2002.

Feige, U., Mansour, Y., and Schapire, R. Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, pp. 637 657, 2015.

Floyd, S. and Warmuth, M. Sample compression, learnability, and the vapnik-chervonenkis dimension. Machine learning, 21(3):269 304, 1995.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Gourdeau, P., Kanade, V., Kwiatkowska, M., and Worrell, J. On the hardness of robust classification. The Journal of Machine Learning Research, 22(1):12521 12549, 2021.

Graepel, T., Herbrich, R., and Shawe-Taylor, J. Pacbayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59(1-2):55 76, 2005.

Hanneke, S. The optimal sample complexity of pac learning. The Journal of Machine Learning Research, 17(1):1319 1333, 2016.

Hanneke, S., Kontorovich, A., and Sadigurschi, M. Sample compression for real-valued learners. In Algorithmic Learning Theory, pp. 466 488. PMLR, 2019.

Hazan, E. and Ma, T. A non-generative framework and convex relaxations for unsupervised learning. Advances in Neural Information Processing Systems, 29, 2016.

Hazan, E., Kale, S., and Shalev-Shwartz, S. Near-optimal algorithms for online matrix prediction. In Conference on Learning Theory, pp. 38 1. JMLR Workshop and Conference Proceedings, 2012.

Hazan, E., Livni, R., and Mansour, Y. Classification with low rank and missing data. In International conference on machine learning, pp. 257 266. PMLR, 2015.

Adversarially Robust PAC Learnability of Real-Valued Functions

Kearns, M. J. and Schapire, R. E. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464 497, 1994.

K egl, B. Robust regression by boosting the median. In Learning Theory and Kernel Machines, pp. 258 272. Springer, 2003.

Khim, J. and Loh, P.-L. Adversarial risk bounds via function transformation. ar Xiv preprint ar Xiv:1810.09519, 2018.

Kleer, P. and Simon, H. Primal and dual combinatorial dimensions. ar Xiv preprint ar Xiv:2108.10037, 2021.

Kontorovich, A. and Attias, I. Fat-shattering dimension of k-fold maxima. ar Xiv preprint ar Xiv:2110.04763, 2021.

Littlestone, N. and Warmuth, M. Relating data compression and learnability. 1986.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Mao, A., Mohri, M., and Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. ar Xiv preprint ar Xiv:2304.07288, 2023.

Mendelson, S. An optimal unrestricted learning procedure. Journal of the ACM, 66(6):1 42, 2019.

Montasser, O., Hanneke, S., and Srebro, N. Vc classes are adversarially robustly learnable, but only improperly. ar Xiv preprint ar Xiv:1902.04217, 2019.

Montasser, O., Goel, S., Diakonikolas, I., and Srebro, N. Efficiently learning adversarially robust halfspaces with noise. In International Conference on Machine Learning, pp. 7010 7021. PMLR, 2020a.

Montasser, O., Hanneke, S., and Srebro, N. Reducing adversarially robust learning to non-robust pac learning. Advances in Neural Information Processing Systems, 33: 14626 14637, 2020b.

Montasser, O., Hanneke, S., and Srebro, N. Adversarially robust learning with unknown perturbation sets. In Conference on Learning Theory, pp. 3452 3482. PMLR, 2021a.

Montasser, O., Hanneke, S., and Srebro, N. Transductive robust learning guarantees. ar Xiv preprint ar Xiv:2110.10602, 2021b.

Pollard, D. Convergence of stochastic processes. Springer Science & Business Media, 2012.

Rudelson, M. and Vershynin, R. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, pp. 603 648, 2006.

Schapire, R. E. and Freund, Y. Boosting: Foundations and algorithms. Kybernetes, 2013.

Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., and Madry, A. Adversarially robust generalization requires more data. Advances in neural information processing systems, 31, 2018.

Simon, H. U. Bounds on the number of examples needed for learning functions. SIAM Journal on Computing, 26 (3):751 763, 1997.

Srebro, N., Rennie, J., and Jaakkola, T. Maximum-margin matrix factorization. Advances in neural information processing systems, 17, 2004.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Xing, Y., Zhang, R., and Cheng, G. Adversarially robust estimate and risk analysis in linear regression. In International Conference on Artificial Intelligence and Statistics, pp. 514 522. PMLR, 2021.

Yin, D., Kannan, R., and Bartlett, P. Rademacher complexity for adversarially robust generalization. In International Conference on Machine Learning, pp. 7085 7094. PMLR, 2019.

Adversarially Robust PAC Learnability of Real-Valued Functions

A. Auxiliary Results

Proof of Theorem 3.4. Let F [0, 1]X and let

H = {(x, y) 7 |f(x) y| : f F} .

Define the function classes F1 = {(x, y) 7 |h(x) y| η(x, y) : h H} ,

and F2 = {(x, y) 7 max{f(x, y), 0} : f F1} .

We claim that fat(H, γ) = fat(F1, γ). Take a set S = {(x1, y1), . . . , (xm, ym)} that is γ-shattered by H. There exists a witness r = (r1, . . . , rm) [0, 1]m such that for each σ = (σ1, . . . , σm) { 1, 1}m there is a function hσ H such that

( hσ((xi, yi)) ri + γ, if σi = 1 hσ((xi, yi)) ri γ, if σi = 1.

The set S is shattered by F1 by taking r = (r1 + η(x1, y1), . . . , rm + η(xm, ym)). Similarly, any set that is shattered by F1 is also shattered by H.

The class F2 consists of choosing a function from F1 and computing its pointwise maximum with the constant function 0. In general, for two function classes G1, G2, we can define the maximum aggregation class

max(G1, G2) = {x 7 max{g1(x), g2(x)} : gi Gi},

and Kontorovich & Attias (2021) showed that for any G1, G2

fat(max(G1, G2), γ) (fat(G1, γ) + fat(G2, γ)) log2(fat(G1, γ) + fat(G2, γ)) .

Taking G1 = F1 and G2 0, we get

fat(F2, γ) fat(F1, γ) log2(fat(F1, γ)) .

For the particular case G2 0, we can show a better bound of

fat(F2, γ) fat(F1, γ) .

In words, it means that truncation cannot increase the shattering dimension. Indeed, take a set S = {(x1, y1), . . . , (xk, yk)} that is γ-shattered by F2 = max(F1, 0), we show that this set is γ-shattered by F1. There exists a witness r = (r1, . . . , rk) [0, 1]m such that for each σ = (σ1, . . . , σk) { 1, 1}k there is a function fσ F1 such that

( max{fσ((xi, yi)), 0} ri + γ, if σi = 1 max{fσ((xi, yi)), 0} ri γ, if σi = 1.

For max{fσ((xi, yi)), 0} ri γ, we simply have that fσ((xi, yi)) ri γ. Moreover, this implies that ri γ. As a result,

max{fσ((xi, yi)), 0} ri + γ

which means that fσ((xi, yi)) ri + γ. This shows that F1 γ-shatters S as well.

We can conclude the proof by applying Theorem A.2 to the class F2 and taking η(x, y) = 0.

Adversarially Robust PAC Learnability of Real-Valued Functions

Corollary A.1. Let H [0, 1]X with a finite fat-shattering dimension (at any scale). For any β, ϵ, δ (0, 1), 1 p < , any function η : X Y [0, 1], any distribution D over X Y, for a random sample S Dm, if

m(η, β, ϵ, δ) = O 1

fat(H, β/8) log2 fat(H, β/8)

then with probability at least 1 δ over S, for any h H satisfying |h(x) y|p η(x, y), (x, y) S, it holds that P(x,y) D{(x, y) : |h(x) y|p η(x, y) + pβ} 1 ϵ.

Proof. Following Theorem 3.4, we know that if m = O 1 ϵ fat(H, β/8) log2 fat(H,β/8)

δ , with high prob-

ability over S Dm, for any h H satisfying |h(x) y| (η(x, y))1/p for all (x, y) S, it holds that

P(x,y) D n (x, y) : |h(x) y| (η(x, y))1/p + β o 1 ϵ.

We show that for any (x, y) with |h(x) y| (η(x, y))1/p + β, it holds that

|h(x) y|p (i) (η(x, y))1/p + β p

(ii) η(x, y) + pβ,

and that will finish the proof. (i) Follows by just raising both sides to the power of p. (ii) Follows since the function x 7 |x y|p is p-Lipschitz for (x y) [0, 1], and so (η(x, y))1/p + β p η(x, y) = (η(x, y))1/p + β p (η(x, y))1/p p

p (η(x, y))1/p + β (η(x, y))1/p

Theorem A.2 (Generalization from approximate interpolation). (Anthony et al., 1999, Theorems 21.13 and 21.14) Let H [0, 1]X with a finite fat-shattering dimension (at any scale). For any β, ϵ, δ (0, 1), η [0, 1], any distribution D over X Y, for a random sample S Dm, if

m(η, β, ϵ, δ) = O 1

fat(H, β/8) log2 fat(H, β/8)

then with probability at least 1 δ over S, for any h H satisfying |h(x) y| η, (x, y) S, it holds that P(x,y) D{(x, y) : |h(x) y| η + β} 1 ϵ.

The following is a bound on the covering numbers in d .

Lemma A.3 (Covering numbers for infinity metric). (Rudelson & Vershynin, 2006, Theorem 4.4) Let F [0, 1]Ωbe a class of functions and |Ω| = n. Then for any 0 < a 1 and 0 < t < 1/2,

log N(t, F, d ) Cv log(n/vt) loga(2n/v) ,

where v = fatcat(F), and C, c are universal constants.

Lemma A.4 (Fat-shattering of the loss class). Let H Rm be a real valued function class on m points. Denote the ℓp loss class of H by Lp H, for 1 p < . Assume Lp H is bounded by M. For any H,

fat(Lp H, γ) O(log2(m)fat(H, γ/p M p 1).

Proof. For any X and any function class H RX , define the difference class H RX R as

H = {X R (x, y) 7 h(x, y) := h(x) y; h H} .

Adversarially Robust PAC Learnability of Real-Valued Functions

In words, H consists of all functions h(x, y) = h(x) y indexed by h H.

It is easy to see that for all γ > 0, we have fatγ(H ) fatγ(H). Indeed, if H γ-shatters some set {(x1, y1), . . . , (xk, yk)} X R with shift r Rk, then H γ-shatters the set {x1, . . . , xk} X with shift r+(y1, . . . , yk).

Next, we observe that taking the absolute value does not significantly increase the fat-shattering dimension. Indeed, for any real-valued function class F, define abs(F) := {|f|; f F}. Observe that abs(F) max((Fj)j [2]), where F1 = F and F2 = F =: { f; f F}. It follows from Attias et al. (2019, Theorem 13) that

fatγ(abs(F)) < O(log2(m)(fatγ(F) + fatγ( F))) < O(log2(m)fatγ(F)). (10)

Next, define F as the L1 loss class of H:

F = {X R (x, y) 7 |h(x) y)|; h H} .

fatγ(F) = fatγ(abs(H ))

O(log2(m)fatγ(H ))

O(log2(m)fatγ(H));

this proves the claim for L1.

To analyze the L2 case, consider F [0, M]X and define F p := {f p; f F}. We would like to bound fatγ(F p) in terms of fatγ(F). Suppose that F p γ-shatters some set {x1, . . . , xk} with shift rp = (rp 1, . . . , rp k) [0, M]k (there is no loss of generality in assuming that the shift has the same range as the function class). Since for any y the function y 7 |y y |p is p M p 1 Lipschitz for (y y ) [0, M], we have

|ap bp| p M p 1|a b|, a, b [0, M],

we conclude that F is able to γ/(p M p 1)-shatter the same k points and thus fatγ(F p) fatγ/(p M p 1)(F).

To extend this result to the case where F [ M, M]X , we use Equation (10).

In particular, define F as the Lp loss class of H:

F = {X R (x, y) 7 (h(x) y)p; h H} .

fatγ(F) = fatγ((H ) p)

= fatγ((abs(H )) p)

fatγ/(p M p 1)(abs(H ))

O(log2(m)fatγ/(p M p 1)(H ))

O(log2(m)fatγ/(p M p 1)(H)).

The following generalization for sample compression in the realizable case was proven by Littlestone & Warmuth (1986); Floyd & Warmuth (1995). Their proof is for the 0-1 loss, but it applies similarly to bounded loss functions.

Lemma A.5 (Sample compression generalization bound). Let a sample compression scheme (κ, ρ), and a loss function ℓ: R R [0, 1]. In the realizable case, for any κ(S) m, any δ (0, 1), and any distribution D over X {0, 1}, for S Dm, with probability 1 δ,

Err(ρ(κ(S)); D) d Err(ρ(κ(S)); S) + O |κ(S)| log(m) + log 1

Adversarially Robust PAC Learnability of Real-Valued Functions

The following generalization for sample compression in the agnostic case was proven by Graepel et al. (2005). Their proof is for the 0-1 loss, but it applies similarly to bounded loss functions. We use it with the η-ball robust loss.

Lemma A.6 (Agnostic sample compression generalization bound). Let a sample compression scheme (κ, ρ), and a loss function ℓ: R R [0, 1]. In the agnostic case, for any κ(S) m, any δ (0, 1), and any distribution D over X {0, 1}, for S Dm, with probability 1 δ,

Err(ρ(κ(S)); D) d Err(ρ(κ(S)); S) + O

s |κ(S)| log(m) + log 1

B. Proofs for Section 3: Robust Regression for ℓp Losses

Proof of Theorem 3.1. Fix ϵ, δ (0, 1). Let H [0, 1]X . Fix a distribution D over X Y, and let S = {(xi, yi)}m i=1 be an i.i.d. sample from D. We elaborate on each one of the steps as described in Algorithm 1.

1. Compute h ℓp-RERMH(S) in order to get the set of cutoffs η(x, y) = supz U(x) |h (z) y|p for (x, y) S. Let ηS = ((η(x1, y1)) , . . . , (η(xm, ym))). Our goal is to construct a predictor with an empirical robust loss of η(x, y) + ϵ for any (x, y) S, for the ℓp loss, which means that our predictor is an approximate Robust ERM.

2. Define the inflated training data set

(z, y I(z)) : z U(xi) ,

where I(z) = min{i [m] : z U(xi)}. For (z, y) SU, let η(z, y) be the η(x, y) for which z U(x) and y I(z) = y.

3. Discretize SU to a finite set SU as following.

(a) Define a set of functions, such that each function is defined by η-RERMH optimizer on d = O 1 ϵ fat(H, ϵ/8p) log2 p fat(H,ϵ/8p)

ϵ2 points from S.

ˆH = {η-RERMH(S , ηS , p) : S S, |S | = d} .

Recall the definition of η-RERMH, see Equation (5). The cardinality of this class is bounded as follows

(b) A discretization SU SU will be defined by covering of the dual class in d norm. Let Lp ˆ H be the Lp loss

class of ˆH, namely, Lp ˆ H = n Z Y (z, y) 7 |h(z) y|p : h ˆH o . The dual class of Lp ˆ H , Lp ˆ H [0, 1] ˆ H,

is defined as the set of all functions f(z,y) : ˆH [0, 1] such that f(z,y)(h) = h(z) y p, for any (z, y) SU.

Formally, Lp ˆ H = f(z,y) : (z, y) SU , where f(z,y) = f(z,y)(h1), . . . , f(z,y)(h| ˆ H|) . We take SU SU to be a minimal ϵ-cover for SU in d ,

sup (z,y) SU inf ( z, y) SU

f(z,y) f( z, y) ϵ. (12)

Let fat (H, ϵ) be the dual ϵ-fat-shattering of H. Using Lemma A.4, we can bound the dual fat-shattering of the Lp H loss class by the fat-shattering of H,

fat (Lp H, ϵ) log2(m) fat (H, ϵ/p) . (13)

Adversarially Robust PAC Learnability of Real-Valued Functions

Applying a covering number argument from Lemma A.3 (taking a = 1) on the dual space, and upper bounding the dual fat-shattering of the Lp loss class as in Equation (13), we have the following bound

SU = N(ϵ, SU, d )

fat (Lp H, cϵ) log

| ˆH| ϵ fat (Lp H, cϵ)

| ˆH| fat (Lp H, cϵ)

fat (H, cϵ/p) log

| ˆH| ϵ fat (H, cϵ/p)

| ˆH| fat (H, cϵ/p)

fat (H, cϵ/p) log2 | ˆH| ϵ fat (H, cϵ/p)

where c (0, ) is a numerical constant, derived from the covering argument in Lemma A.3.

4. Compute the following variant of Multiplicative Wights (MW) algorithm on the discretized set SU for T log SU .

Let d = O 1 ϵ fat(H, ϵ/8p) log2 p fat(H,ϵ/8p)

ϵ2 , and let η(z, y) be the η(x, y) for which z U(x) as defined in step 2.

From Corollary A.1, taking δ = 1/3, β = ϵ/p, we know that for any distribution P on SU, upon receiving an i.i.d. sample S from P of size d, with probability 2/3 over sampling S from P, for any h H with (z, y) S : |h(z) y|p η(z, y), it holds that P(z,y) P{(z, y) : |h(z) y|p η(z, y) + ϵ} 1 ϵ. We can conclude that for any distribution P on SU, there exists such a set of points S SU.

Given that set, we can find the function with the aforementioned property in ˆH. Let S be the d points in S that the perturbed points S originated from. That is, S S

(x,y) S S{(z, y) : z U(x)}. Take

ˆH ˆh = η-RERMH(S , ηS , p), it holds that (z, y) S : ˆh(z) y p η(z, y), as a result we get

P(z,y) P n (z, y) : ˆh(z) y p η(z, y) + ϵ o 1 ϵ.

Algorithm 4 Modified Multiplicative Weights Input: H, S, SU. Parameters: ϵ, T, ηS = (η(x1, y1), . . . , η(xm, ym)) for (xi, yi) S. Algorithms used: Robust ERM for the η-ball robust loss: η-RERMH (Equation (5)). Initialize P1 = Uniform( SU), d = O 1 ϵ fat(H, ϵ/8p) log2 p fat(H,ϵ/8p)

ϵ2 , η(z, y) is the η(x, y) for which z U(x). For t = 1, . . . , T:

Compute a strong base learner w.r.t. distribution Pt by finding n points in S and executing η-RERMH on them. (a) Find d points S t SU such that any h H satisfying: (z, y) S t : |h(z) y|p η(z, y), it holds that

E(z,y) Pt h I n |h(z) y|p η(z, y) + ϵ oi 1 ϵ. (See the analysis for why this set exists).

(b) Let S t be the d points in S that S t originated from. Formally, S t S

(x,y) S t S{(z, y) : z U(x)}.

(c) Compute ˆht = η-RERMH(S t, ηS t, p). Make a multiplicative weights update on Pt. (d) For each (z, y) SU:

Pt+1(z, y) Pt(z, y)e ξI{|ˆht(z) y| p η(z,y)}

Output: classifiers ˆh1, . . . , ˆh T and sets S 1, . . . , S T .

A uniformly 4ϵ-approximate adversarially robust sample compression scheme for S. The output of the algorithm is a sequence of functions ˆh1, . . . , ˆh T , and the corresponding sets that encode them S 1, . . . , S T , where we predict with the

Adversarially Robust PAC Learnability of Real-Valued Functions

average of the returned hypotheses, 1

T PT i=1 ˆhi( ). For T log SU , we show that we have

( z, y) SU : 1

ˆhi( z) y p η( z, y) + 2ϵ. (15)

For any distribution Pt over SU, we have a base learner ˆh, satisfying E( z, y) Pt h I n ˆh( z) y p η( z, y) + ϵ oi 1 ϵ, due to Theorem 3.4. Following standard analysis of MW / α-Boost (see Schapire & Freund (2013, Section 6)), for any ( z, y) SU, 1 ϵ fraction of the base learners has an error within η( z, y) + ϵ. The loss is bounded by 1, so the other ϵ fraction can add an error of at most ϵ. The overall average loss of the base learners is upper bounded by η( z, y) + 2ϵ. Note that we can find these base learners in ˆH, as defined in step 2(a) of the main algorithm. Crucially, we use strong base learners in order to ensure a low empirical loss of the average base learners.

From the covering argument (Equation (12)), we have

(z, y) SU : 1

ˆhi(z) y p η(z, y) + 4ϵ. (16)

Indeed, for any (z, y) SU there exists ( z, y) SU, such that for any h H,

h(z) y p h( z) y p ϵ.

Specifically, it holds for n ˆh1, . . . , ˆh T o H and h H. Using the triangle inequality we have

ˆhi(z) y p 1

ˆhi( z) y p + ϵ, (17)

η( z, y) = h ( z) y p h (z) y p + ϵ = η(z, y). (18)

Combining Equations (17) and (18) we get Equation (16). Finally, using the convexity of the ℓp loss, we have

i=1 ˆhi(z) y

ˆhi(z) y p . (19)

Finally, from Equations (16) and (19) we conclude a uniformly 4ϵ-approximate adversarially robust sample compression scheme for S,

(z, y) SU :

i=1 ˆhi(z) y

η(z, y) + 4ϵ, (20)

which implies that

(x, y) S : sup z U(x)

i=1 ˆhi(x) y

η(x, y) + 4ϵ.

We summarize the compression size. We have have T = O log SU predictors, where each one is representable by

d = O 1 ϵ fat(H, ϵ/8p) log2 p fat(H,ϵ/8p)

ϵ2 points. By counting the number of predictors using Equation (14), we get

Adversarially Robust PAC Learnability of Real-Valued Functions

fat (H, cϵ/p/p) log2 | ˆH| ϵ fat (H, cϵ/p/p)

fat (H, cϵ/p) log2 | ˆH| ϵ fat (H, cϵ/p)

fat (H, cϵ/p) log2 1 ϵ fat (H, cϵ/p)

fat (H, cϵ/p) log 1 ϵ fat (H, cϵ/p)

fat (H, cϵ/p)

log2 1 ϵ fat (H, cϵ/p)

1 ϵ fat (H, cϵ/p)

+ d2 log2 m

fat (H, cϵ/p) d2 log2 m

log2 1 ϵ fat (H, cϵ/p)

We get a uniformly 4ϵ-approximate adversarially robust sample compression scheme for S of size

O fat (H, cϵ/p) d3 log2 m

log2 1 ϵ fat (H, cϵ/p)

By plugging in d = O 1 ϵ fat(H, ϵ/8p) log2 p fat(H,ϵ/8p)

ϵ2 , we have

O 1 ϵ3 fat3(H, ϵ/8p) fat (H, cϵ/p) log6 p fat(H,ϵ/8p)

ϵ2 log2 m 1 ϵ fat(H,ϵ/8p) log2( p fat(H,ϵ/8p)

log2 1 ϵ fat (H,cϵ/p) log2(m)

Let (κ, ρ) be the compression scheme and |κ(S)| the compression size. Let d Errℓp(h; S) be the empirical loss of h on S with the ℓp robust loss. We can derive the error as follows,

Errℓp(ρ(κ(S)); D) (i) d Errℓp(ρ(κ(S)); S) +

s |κ(S)| log(m) + log 1

(ii) d Errℓp(h ; S) + 4ϵ +

s |κ(S)| log(m) + log 1

(iii) Errℓp(h ; D) + 4ϵ +

s |κ(S)| log(m) + log 1

Errℓp(h ; D) + 4ϵ +

s |κ(S)| log(m) + log 1

(i) follows from a generalization of sample compression scheme in the agnostic case, see Lemma A.6, (ii) follows Equation (20), (iii) follows from Hoeffding s inequality.

Take m sufficiently large such that

|κ(S)| log(m) + log 1

Adversarially Robust PAC Learnability of Real-Valued Functions

Re-scale ϵ = ϵ/5 and plug in the compression size, we get sample complexity of size

|κ(S)| log 1

where |κ(S)| is upper bounded as follows

O 1 ϵ3 fat3(H, ϵ/8p) fat (H, cϵ/p) log6 p fat(H,ϵ/8p)

ϵ2 log2 m 1 ϵ fat(H,ϵ/8p) log2( p fat(H,ϵ/8p)

log2 1 ϵ fat (H,cϵ/p) log2(m)

We conclude the sample complexity

ϵ5 fat3(H, cϵ/p) fat (H, cϵ/p) + 1

for some numerical constant c (0, ).

C. Proofs for Section 4: Better Sample Complexity for the ℓ1 Loss

Proof of Theorem 4.1. Fix ϵ, δ (0, 1). Let H [0, 1]X . Fix a distribution D over X Y, and let S = {(xi, yi)}m i=1 be an i.i.d. sample from D. We elaborate on each one of the steps as described in Algorithm 2.

1. Compute h ℓ1-RERMH(S) in order to get the set of cutoffs η(x, y) = supz U(x) |h (z) y| for (x, y) S. Let ηS = (η(x1, y1), . . . , η(xm, ym)). Our goal is to construct a predictor with an empirical robust loss of η(x, y) + ϵ for any (x, y) S, which means that our predictor is an approximate Robust ERM.

2. Define the inflated training data set SU = [

(z, y I(z)) : z U(xi) ,

where I(z) = min{i [m] : z U(xi)}. For (z, y) SU, let η(z, y) be the η(x, y) for which z U(x) and y I(z) = y.

3. Discretize SU to a finite set SU as following.

(a) Define a set of functions, such that each function is defined by η-RERMH optimizer on d = O fat(H, ϵ/8) log2 fat(H,ϵ/8)

ϵ2 points from S.

ˆH = {η-RERMH(S , ηS , 1) : S S, |S | = d} .

Recall the definition of η-RERMH, see Equation (5). In order to understand what this definition of ˆH serves for, see step 4 below. The cardinality of this class is bounded as follows

(b) A discretization SU SU will be defined by covering of the dual class in d norm. Let L1 ˆ H be the L1 loss

class of ˆH, namely, L1 ˆ H = n Z Y (z, y) 7 |h(z) y| : h ˆH o . The dual class of L1 ˆ H , L1 ˆ H [0, 1] ˆ H,

is defined as the set of all functions f(z,y) : ˆH [0, 1] such that f(z,y)(h) = h(z) y , for any (z, y) SU.

Formally, L1 ˆ H = f(z,y) : (z, y) SU , where f(z,y) = f(z,y)(h1), . . . , f(z,y)(h| ˆ H|) . We take SU SU to be a minimal ϵ-cover for SU in d ,

sup (z,y) SU inf ( z, y) SU , f(z,y) f( z, y) ϵ. (23)

Adversarially Robust PAC Learnability of Real-Valued Functions

Let fat (H, ϵ) be the dual ϵ-fat-shattering of H. Using Lemma A.4, we can bound the dual fat-shattering of the L1 H loss class by the fat-shattering of H,

fat L1 H, ϵ log2(m) fat (H, ϵ) . (24)

By applying a covering number argument from Lemma A.3 (taking a = 1) on the dual space, and upper bounding the dual fat-shattering of the L1 loss class as in Equation (24), we have the following bound

SU = N(ϵ, SU, d )

fat L1 H, cϵ log

| ˆH| ϵ fat (L1 H, cϵ)

| ˆH| fat (L1 H, cϵ)

fat (H, cϵ) log

| ˆH| ϵ fat (H, cϵ)

| ˆH| fat (H, cϵ)

fat (H, cϵ) log2 | ˆH| ϵ fat (H, cϵ)

where c (0, ) is a numerical constant, derived from the covering argument in Lemma A.3.

4. Compute a modified version of the real-valued boosting algorithm Med Boost (K egl, 2003; Hanneke et al., 2019) on the discretized set SU. The output of the algorithm is a uniformly ϵ-approximate sample compression scheme for the set SU, for log SU boosting rounds. Moreover, the weak learners are chosen from the set ˆH. Once we have these weak learners, the guarantee of the algorithm follows from Hanneke et al. (2019, Corollary 6). We should explain why we have a weak learner for any distribution over SU.

The existence of weak learners in ˆH. Let d = O fat(H, ϵ/8) log2 fat(H,ϵ/8)

ϵ2 and let η(z, y) be the η(x, y) for

which z U(x) as defined in step 2. Taking δ = 1/3, we know that for any distribution P on SU, upon receiving an i.i.d. sample S from P of size d, with probability 2/3 over sampling S from P, for any h H satisfying (z, y) S : |h(z) y| η(z, y), it holds that P(z,y) P((z, y) : |h(z) y| > η(z, y) + ϵ) 1/3. That is, such a function is a (ϵ, 1/6)-weak learner for P and h (computed in step 1). We can conclude that for any distribution P on SU, there exists a set of points S SU of size d that defines a weak learner for P and h .

Furthermore, we can find these weak learners in ˆH as follows. Let S be the d points in S that the perturbed points S originated from. That is, S S

(x,y) S S{(z, y) : z U(x)}. Therefore, we can conclude that ˆh =

η-RERMH(S , ηS ) is a weak learner, and can be found in ˆH. So, we can think of ˆH as a pool of weak learners for any possible distribution over the discretized set SU.

Adversarially Robust PAC Learnability of Real-Valued Functions

Algorithm 5 Modified Med Boost Input: H, S, SU. Parameters: ϵ, T, ηS = (η(x1, y1), . . . , η(xm, ym)) for (xi, yi) S. Algorithms used: Robust ERM for the η-ball robust loss: η-RERMH (Equation (5)). Initialize P1 = Uniform( SU), d = O fat(H, ϵ/8) log2 fat(H,ϵ/8)

ϵ2 , η(z, y) is the η(x, y) for which z U(x) as defined in step 2 of the main algorithm. For t = 1, . . . , T:

Compute a weak base learner w.r.t. distribution Pt by finding d points in S and executing η-RERMH on them. (a) Find d points S t SU such that any h H satisfying: (z, y) S t : |h(z) y| η(z, y), it holds that

E(z,y) Pt h I n |h(z) y| η(z, y) + ϵ oi 1/3. (See the analysis for why this set exists).

(b) Let S t be the d points in S that S t originated from. Formally, S t S

(x,y) S t S{(z, y) : z U(x)}.

(c) Compute ˆht = η-RERMH(S t, ηS t, 1). From steps (a) and (b), it follows that ˆht is a (ϵ, 1/6)-weak learner with respect to the distribution Pt over SU. Update the weight of the weak learner in the ensemble and make a multiplicative weights update on Pt. (d) For i = 1, . . . , n = SU : i. Set w(t) i = 1 2I ˆht(zi) yi > η(zi, yi) + ϵ .

(1 1/6) Pn i=1 Pt(zi, yi) I h w(t) i = 1 i

(1 + 1/6) Pn i=1 Pt(zi, yi) I h w(t) i = 1 i

iii. If αt = : return T copies of ht, (α1 = 1, . . . , αT = 1), and S t. Else:

Pt+1(zi, yi) = Pt(zi, yi) exp( αtwt i) Pn j=1 Pt(zj, yj) exp αtwt j .

Output: Hypotheses ˆh1, . . . , ˆh T , coefficients α1, . . . , αT and sets S 1, . . . , S T .

A uniformly 3ϵ-approximate adversarially robust sample compression scheme for S. The output of Med Boost is a uniformly ϵ-approximate sample compression scheme for the set SU. We show that this is a uniformly 2ϵ-approximate adversarially robust sample compression scheme for S, that is, a sample compression for S scheme with respect to the robust loss.

For T log SU boosting rounds, it follows from Hanneke et al. (2019, Corollary 6) that the output of the algorithm satisfy

( z, y) SU : Med ˆh1( z), . . . , ˆh T ( z); α1, . . . , αT y η( z, y) + ϵ, (26)

Med ˆh1( z), . . . , ˆh T ( z); α1, . . . , αT is the weighted median of ˆh1, . . . , ˆh T with weights α1, . . . , αT . From the covering argument (Equation (23)), this implies that

(z, y) SU : Med ˆh1(z), . . . , ˆh T (z); α1, . . . , αT y η(z, y) + 3ϵ. (27)

Indeed, for any (z, y) SU there exists ( z, y) SU, such that for any h H,

h(z) y h( z) y ϵ.

Specifically, it holds for n ˆh1, . . . , ˆh T o H and h H.

Adversarially Robust PAC Learnability of Real-Valued Functions

Med ˆh1(z), . . . , ˆh T (z); α1, . . . , αT y (a) = Med ˆh1(z) y, . . . , ˆh T (z) y; α1, . . . , αT

(b) Med ˆh1( z) y, . . . , ˆh T ( z) y; α1, . . . , αT + ϵ

(c) = Med ˆh1( z), . . . , ˆh T ( z); α1, . . . , αT y + ϵ

(d) |h ( z) y| + 2ϵ

(e) |h (z) y| + 3ϵ

(f) = η(z, y) + 3ϵ,

(a)+(c) follow since the median is translation invariant, (b)+(e) follow from the covering argument, (d) holds since the returned function by Med Boost is a uniformly ϵ-approximate sample compression for SU, (f) follows from the definition in step 2 of the algorithm.

Finally, from Equation (28) we conclude a uniformly 3ϵ-approximate adversarially robust sample compression scheme for S,

(x, y) S : sup z U(x)

Med ˆh1(z), . . . , ˆh T (z); α1, . . . , αT y η(x, y) + 3ϵ. (29)

We summarize the compression size. We have have T = O log SU predictors, where each one is representable by

d = O fat(H, ϵ/8) log2 fat(H,ϵ/8)

ϵ points. By counting the number of predictors using Equation (25), we get

fat (H, cϵ) log2 | ˆH| ϵ fat (H, cϵ)

fat (H, cϵ) log2 | ˆH| ϵ fat (H, cϵ)

fat (H, cϵ) log2 1 ϵ fat (H, cϵ)

fat (H, cϵ) log 1 ϵ fat (H, cϵ)

fat (H, cϵ)

log2 1 ϵ fat (H, cϵ)

1 ϵ fat (H, cϵ)

+ d2 log2 m

All together we have a compression of size O d log SU , which is already sufficient for deriving generalization. We can reduce further the number of predictors to be independent of the sample size, thereby reducing the sample compression size and improving the sample complexity.

5. We follow the sparsification method suggested by Hanneke et al. (2019). The idea is that by sampling functions from the ensemble, we can guarantee via a uniform convergence for the dual space, that with high probability it is sufficient to have roughly fat (H, β) predictors.

For α1, . . . , αT [0, 1] with PT t=1 αt = 1, we denote the categorical distribution by Cat(α1, . . . , αT ), which is a discrete distribution on the set [T] with probability of αt on t [T]. The inputs to the algorithm are τ(x, y) = η(x, y) + 4ϵ and k = O fat (H, cϵ) log2 fat (H, cϵ) /ϵ2 , where c (0, ) is a numerical constant.

Adversarially Robust PAC Learnability of Real-Valued Functions

Algorithm 6 Sparsify

Input: Hypotheses ˆh1, . . . , ˆh T , coefficients α1, . . . , αT , S = {(xi, yi}m i=1. Parameter: ϵ, k, τ = (τ(x1, y1), . . . , τ(xm, ym)).

(a) Let α t = αt PT s=1 αs. (b) Repeat:

i. Sample (J1, . . . , Jk) Cat(α 1, . . . , α T)k. ii. Let F = {h J1, . . . , h Jk}.

iii. Until (x, y) S : n f F : supz U(x)|f(z) y| > τ(x, y) o < k/2.

Output: Hypotheses h J1, . . . , h Jk.

The sparsification method returns with high probability function {f1, . . . , fk}, such that

(x, y) S : sup z U(x)

Med f1(x), . . . , fk(x) y η(x, y) + 4ϵ. (30)

We get a uniformly 4ϵ-approximate adversarially robust sample compression scheme for S, where we have O fat (H, cϵ) log2 fat (H, cϵ) /ϵ2 functions, and each function is representable by O fat(H, ϵ/8) log2 fat(H, ϵ/8) /ϵ2 points, therefore, the compression set size is

fat(H, ϵ/8) fat (H, cϵ) log2 fat(H, ϵ/8)

log2 fat (H, cϵ)

Let (κ, ρ) be the compression scheme and |κ(S)| the compression size. Let d Errℓ1(h; S) be the empirical loss of h on S with the ℓ1 robust loss. We can derive the error as follows,

Errℓ1(ρ(κ(S)); D) (i) d Errℓ1(ρ(κ(S)); S) +

s |κ(S)| log(m) + log 1

(ii) d Errℓ1(h ; S) + 4ϵ +

s |κ(S)| log(m) + log 1

(iii) Errℓ1(h ; D) + 4ϵ +

s |κ(S)| log(m) + log 1

Errℓ1(h ; D) + 4ϵ +

s |κ(S)| log(m) + log 1

(i) follows from a generalization of sample compression scheme in the agnostic case, see Lemma A.6, (ii) follows Equation (30), (iii) follows from Hoeffding s inequality.

Take m sufficiently large such that

|κ(S)| log(m) + log 1

Re-scale ϵ = ϵ/5 and plug in the compression size, we get sample complexity of size

fat(H, cϵ) fat (H, cϵ) log2 fat(H, cϵ)

log2 fat (H, cϵ)

for some numerical constant c (0, ).

Adversarially Robust PAC Learnability of Real-Valued Functions

D. Proofs for Section 5: Robust (η, β)-Regression

D.1. Realizable

Proof of Theorem 5.1. Fix ϵ, δ, β (0, 1) and η [0, 1]. Let H [0, 1]X . Fix a distribution D over X Y, and let S = {(xi, yi)}m i=1 be an i.i.d. sample from D.

We elaborate on each one of the steps as described in Algorithm 3.

1. Define the inflated training data set SU = [

(z, y I(z)) : z U(xi) ,

where I(z) = min{i [m] : z U(xi)}.

2. Discretize SU to a finite set SU as following.

(a) Define a set of functions, such that each function is robustly accurate on d = O fat(H, β/8) log2 fat(H,β/8)

points in S, with respect to the η-ball robust loss,

ˆH = {η-RERMH(S ) : S S, |S | = d} .

Recall the definition of η-RERMH, see Equation (5). In order to understand what this definition of ˆH serves for, see step 3 below. The cardinality of this class is bounded as follows

(b) A discretization SU SU will be defined by covering of the dual class in d norm. Let L1 ˆ H be the L1 loss

class of ˆH, namely, L1 ˆ H = n Z Y (z, y) 7 |h(z) y| : h ˆH o . The dual class of L1 ˆ H , L1 ˆ H [0, 1] ˆ H,

is defined as the set of all functions f(z,y) : ˆH [0, 1] such that f(z,y)(h) = h(z) y , for any (z, y) SU.

Formally, L1 ˆ H = f(z,y) : (z, y) SU , where f(z,y) = f(z,y)(h1), . . . , f(z,y)(h| ˆ H|) . We take SU SU to be a minimal β-cover for SU in d ,

sup (z,y) SU inf ( z, y) SU , f(z,y) f( z, y) β. (32)

Let fat (H, β) be the dual β-fat-shattering of H. Using Lemma A.4, we can bound the dual fat-shattering of the L1 H loss class by the fat-shattering of H,

fat L1 H, β log2(m) fat (H, β) . (33)

By applying a covering number argument from Lemma A.3 (taking a = 1) on the dual space, and upper bounding the dual fat-shattering of the L1 loss class as in Equation (33), we have the following bound SU = N(β, SU, d )

fat L1 H, cβ log

| ˆH| β fat (L1 H, cβ)

| ˆH| fat (L1 H, cβ)

fat (H, cβ) log

| ˆH| β fat (H, cβ)

| ˆH| fat (H, cβ)

fat (H, cβ) log2 | ˆH| β fat (H, cβ)

where c (0, ) is a numerical constant, derived from the covering argument in Lemma A.3.

Adversarially Robust PAC Learnability of Real-Valued Functions

3. Compute a modified version of the real-valued boosting algorithm Med Boost (Algorithm 5) on the discretized set SU. The inputs to the algorithm are as follows. Set ϵ = β, ηS = (η, . . . , η), and T log SU rounds of boosting.

The output of the algorithm is a uniform ϵ-approximate sample compression scheme for the set SU. Moreover, the weak learners are chosen from the set ˆH. Once we have these weak learners, the guarantee of the algorithm follows from Hanneke et al. (2019, Corollary 6). We should explain why we have a weak learner for any distribution over SU.

The existence of weak learners in ˆH. From Theorem A.2, taking ϵ = δ = 1/3, we know that for any distribution P on SU, upon receiving an i.i.d. sample S from P of size O fat(H, β/8) log2 fat(H,β/8)

β , with probability 2/3 over sampling S from P, for any h H with (z, y) S : |h(z) y| η, it holds that P(z,y) P{(z, y) : |h(z) y| > η + β} 1/3. That is, such a function is a (η + β, 1/6)-weak learner for P (see Definition 4.2). We can conclude that for any distribution P on SU, there exists a set of points S SU of size O fat(H, β/8) log2 fat(H,β/8)

β that defines a weak learner for P.

Moreover, we can find these weak learners in ˆH as follows. Let S be the O fat(H, β/8) log2 fat(H,β/8)

β points in S that the perturbed points S originated from. That is, S S

(x,y) S S{(z, y) : z U(x)}. Therefore, we can

conclude that ˆh = η-RERMH(S ) is a weak learner, and can be found in ˆH. So, we can think of ˆH as a pool of weak learners for any possible distribution over the discretized set SU.

A uniformly 3β-approximate adversarially robust sample compression scheme for S. The output of Med Boost is a uniformly β-approximate sample compression scheme for the set SU. We show that this is a uniformly 2β-approximate adversarially robust sample compression scheme for S, that is, a sample compression for S scheme with respect to the robust loss.

For T log SU boosting rounds, it follows from Hanneke et al. (2019, Corollary 6) that the output of the algorithm satisfy

( z, y) SU : Med ˆh1( z), . . . , ˆh T ( z); α1, . . . , αT y η( z, y) + β, (35)

Med ˆh1( z), . . . , ˆh T ( z); α1, . . . , αT is the weighted median of ˆh1, . . . , ˆh T with weights α1, . . . , αT . From the covering argument (Equation (23)), this implies that

(z, y) SU : Med ˆh1(z), . . . , ˆh T (z); α1, . . . , αT y η(z, y) + 3β. (36)

Indeed, for any (z, y) SU there exists ( z, y) SU, such that for any h H, h(z) y h( z) y β.

Specifically, it holds for n ˆh1, . . . , ˆh T o H and h H.

So, Med ˆh1(z), . . . , ˆh T (z); α1, . . . , αT y (a) = Med ˆh1(z) y, . . . , ˆh T (z) y; α1, . . . , αT

(b) Med ˆh1( z) y, . . . , ˆh T ( z) y; α1, . . . , αT + β

(c) = Med ˆh1( z), . . . , ˆh T ( z); α1, . . . , αT y + β

(d) |h ( z) y| + 2β

(e) |h (z) y| + 3β

(f) = η(z, y) + 3β,

(a)+(c) follow since the median is translation invariant, (b)+(e) follow from the covering argument, (d) holds since the returned function by Med Boost is a uniformly β-approximate sample compression for SU, (f) follows from the definition in step 2 of the algorithm.

Adversarially Robust PAC Learnability of Real-Valued Functions

Finally, from Equation (36) we conclude a uniformly 3β-approximate adversarially robust sample compression scheme for S,

(x, y) S : sup z U(x)

Med ˆh1(z), . . . , ˆh T (z); α1, . . . , αT y η(x, y) + 3β. (37)

We summarize the compression size. We have have T = O log SU predictors, where each one is representable by

d = O fat(H, β/8) log2 fat(H,β/8)

β points. By counting the number of predictors using Equation (25), we get

fat (H, cβ) log2 | ˆH| β fat (H, cβ)

fat (H, cβ) log2 | ˆH| β fat (H, cβ)

fat (H, cβ) log2 1 β fat (H, cβ)

fat (H, cβ) log 1 β fat (H, cβ)

fat (H, cβ)

log2 1 β fat (H, cβ)

1 β fat (H, cβ)

+ d2 log2 m

All together we have a compression of size O d log SU , which is already sufficient for deriving generalization. We can reduce further the number of predictors to be independent of the sample size, thereby reducing the sample compression size and improving the sample complexity.

4. Compute the sparsification method (Algorithm 6). The idea is that by sampling functions from the ensemble, we can guarantee via a uniform convergence for the dual space, that with high probability it is sufficient to have roughly fat (H, β) predictors. Applying Hanneke et al. (2019, Theorem 10) with the parameters τ = η + 4β and k = O fat (H, cβ) log2(fat (H, cβ) /β) , where c (0, ) is a numerical constant, the sparsification method returns with high probability function {f1, . . . , fk}, such that

(x, y) S : sup z U(x)

Med f1(z), . . . , fk(z) y η + 4β.

We get a uniformly 4β-approximate adversarially robust sample compression scheme for S, where we have O fat (H, cβ) log2 fat (H, cβ) /β2 functions, and each function is representable by O fat(H, β/8) log2 fat(H, β/8) /β2 points, therefore, the compression set size is

fat(H, β/8) fat (H, cβ) log2 fat(H, β/8)

log2 fat (H, cβ)

Let (κ, ρ) be the compression scheme and |κ(S)| the compression size. Let d Errℓ1(h; S) be the empirical loss of h on S with the ℓ1 robust loss. We can derive the error as follows,

Errℓ1(ρ(κ(S)); D) (i) d Errℓ1(ρ(κ(S)); S) +

s |κ(S)| log(m) + log 1

(ii) d Errℓ1(h ; S) + 4β +

s |κ(S)| log(m) + log 1

(iii) Errℓ1(h ; D) + 4β +

s |κ(S)| log(m) + log 1

Errℓ1(h ; D) + 4β +

s |κ(S)| log(m) + log 1

Adversarially Robust PAC Learnability of Real-Valued Functions

(i) follows from a generalization of sample compression scheme in the agnostic case, see Lemma A.6, (ii) follows Equation (30), (iii) follows from Hoeffding s inequality.

Take m sufficiently large such that

|κ(S)| log(m) + log 1

Re-scale β = β/5 and plug in the compression size, we get sample complexity of size

fat(H, cβ) fat (H, cβ) log2 fat(H, cβ)

log2 fat (H, cβ)

for some numerical constant c (0, ).

D.2. Agnostic

Proof of Theorem 5.2. The construction follows a reduction to the realizable case similar to (David et al., 2016), which is for the non-robust zero-one loss. Moreover, we use a margin-based analysis of Med Boost algorithm (see K egl (2003, Theorem 1)), and overcome some technical challenges.

Denote ΛRE = ΛRE(η, 1/3, 1/3, H, U, ℓη U), the sample complexity of Robust (η, β)-regression for a class H with respect to a perturbation function U, taking ϵ = δ = 1/3

Using a robust ERM, find the maximal subset S S with zero empirical robust loss (for the η-ball loss), such that infh H d Errη(h, f; S ) = 0. Now, ΛRE samples suffice for weak robust learning for any distribution D on S .

Compute the Med Boost on S , with T log(|S |) boosting rounds, where each weak robust learner is trained on ΛRE samples. The returned weighted median ˆh = Med ˆh1(z), . . . , ˆh T (z); α1, . . . , αT satisfies d Errη ˆh, f; S = 0, and each

hypothesis ˆht n ˆh1, . . . , ˆh T o is representable as set of size O(ΛRE). This defines a compression scheme of size ΛRET,

and ˆhi can be reconstructed from a compression set of points from S of size ΛRET.

Recall that S S is a maximal subset such that infh H d Errη(h, f; S ) = 0 which implies that infˆh H d Errη(h, f; S ) infh H d Errη(h, f; S ). Plugging it into an agnostic sample compression bound Lemma A.6, we have a sample complexity

ϵ2 , which translates into O fat(H,cη)fat (H,cη)

ϵ2 , for some numerical constant c (0, ).

D.3. Naive approach with a fixed cutoff

An agnostic learner for robust (η, β)-regression does not apply to the robust regression setting. The reason is that the optimal function in H has different scales of robustness on different points. we show that by using a fixed cutoff for all points we can obtain an error of OPTH + ϵ.

Theorem D.1. For any H [0, 1]X with finite γ-fat-shattering for all γ > 0, any U : X 2X , and any ϵ, δ (0, 1), η [0, 1], for some numerical constant c (0, ), with probability 1 δ, Algorithm 7 outputs a function with error at most OPTH + ϵ, for the ℓ1,U( ) robust loss, and using a sample of size

O fat(H, cϵ) fat (H, cϵ)

Recall that fat (F, ϵ) 1

ϵ 2fat(F,ϵ/2)+1 by Equation (7).

Adversarially Robust PAC Learnability of Real-Valued Functions

Algorithm 7

Input: H [0, 1]X , S = {(xi, yi)}m i=1, S = {(xi, yi)}n i=1. Algorithms used: Agnostic learner for Robust (η, β)-regression (see Theorem 5.2): Agnostic-η-Regressor.

1. Define a grid Θ = 1

m, . . . , 1 .

2. Define HΘ = {hθ = Agnostic-θ-Regressor(S) : θ Θ}.

3. Find an optimal function on the holdout set

ˆhθ = argmin hθ HΘ

sup z U(x) |hθ(z) y| θ

Output: ˆhθ.

Proof of Theorem D.1. Let

OPTH = inf h H E(x,y) D h sup z U(x) |h(z) y| i ,

which is obtained by h H. By Markov Inequality we have

h (z) y > η

E(x,y) D h supz U(x) h (z) y i

Taking η = OPTH,

h (z) y > p

This means that we can apply the algorithm for agnostic robust uniform η regression with η = OPTH, and obtain an error of OPTH + ϵ. The problem is that OPTH is not known in advance. To overcome this issue, we can have a grid search on the scale of η, and then verify our choice using a holdout training set.

We define a grid,Θ = 1

m, . . . , 1 , such that one of its elements satisfies OPTH < ˆθ < 2 OPTH.

For each element in the grid, we compute the agnostic regressor for the η-robust loss. That is, we define HΘ = {hθ = Agnostic-θ-Regressor(S) : θ Θ}.

We choose the optimal function on a holdout labeled set S of size 1 ϵ2 log 1

ˆhθ = argmin hθ HΘ

sup z U(x) |hθ(z) y| θ

With high probability, the algorithm outputs a function with error at most OPTH + ϵ for the ℓ1 robust loss, using a sample of size

O fat(H, cϵ) fat (H, cϵ)