# distribution_learnability_and_robustness__29c935dc.pdf

Distribution Learnability and Robustness

Shai Ben-David University of Waterloo, Vector Institute shai@uwaterloo.ca

Alex Bie University of Waterloo yabie@uwaterloo.ca

Gautam Kamath University of Waterloo, Vector Institute g@csail.mit.edu

Tosca Lechner University of Waterloo tlechner@uwaterloo.ca

We examine the relationship between learnability and robust (or agnostic) learnability for the problem of distribution learning. We show that learnability of a distribution class implies robust learnability with only additive corruption, but not if there may be subtractive corruption. Thus, contrary to other learning settings (e.g., PAC learning of function classes), realizable learnability does not imply agnostic learnability. We also explore related implications in the context of compression schemes and differentially private learnability.

1 Introduction

Distribution learning (sometimes called density estimation) refers to the following statistical task: given i.i.d. samples from some (unknown) distribution p, produce an estimate of p. This is one of the most fundamental and well-studied questions in both statistics [DL01] and computer science [Dia16], often equivalent to classic problems of parameter estimation (e.g., mean estimation) in parametric settings. It is easy to see that no learner can meaningfully approximate any given p without having some prior knowledge. The problem then becomes: assuming the sample generating distribution p belongs to a given class of distributions C, and given parameters ε, δ (0, 1), output some distribution ˆp such that with probability at least 1 δ, the statistical distance between p and ˆp is at most ε. Specifically, we employ total variation distance, the most studied metric in density estimation [DL01, Dia16], using d TV(p, q) to denote the distance between distributions p and q. This case, when p C, is often called the realizable setting. If, for some particular class C, this is doable with a finite number of samples n(ε, δ), then we say the distribution class is (ε, δ)-learnable.1 A class is learnable if it is (ε, δ)-learnable for every (ε, δ) (0, 1)2. A significant amount of work has focused on proving bounds on n(ε, δ) for a number of classes C of interest for example, one can consider the class C of all Gaussian distributions N(µ, Σ) in some Euclidean space Rd.

However, this framework is restrictive in the sense that it requires the unknown distribution to be exactly a member of the class C of interest. This may not be the case for a variety of possible reasons, including some innocuous and some malicious. As one example, while it is a common modelling assumption to posit that data comes from a Gaussian distribution, Nature rarely samples exactly from Gaussians, we consider this only to be a convenient approximation. More generally, the class C that the learner assumes can be thought of as reflecting some prior knowledge about the task at hand. Such prior knowledge is almost always only an approximation of reality. Alternatively, we may be in an adversarial setting, where a malicious actor has the ability to modify an otherwise well-behaved

Authors are listed in alphabetical order. 1For the sake of exposition, we defer formal definitions of our learnability notions to Section 1.1.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

distribution, say by injecting datapoints of their own (known in the machine learning literature as data poisoning attacks [BNL12, SKL17, DKK+19, GTX+20, GFH+21, LKY23]).

More formally, the classic problem of agnostic learnability is generally described as follows: given a (known) class of distributions C, and a (finite) set of samples drawn i.i.d. from some (unknown) distribution p, find a distribution ˆp whose statistical distance from p is not much more than that of the closest member of C. It is not hard to see that this is equivalent to a notion of robust learnability, where the distribution p is not viewed as arbitrary, but instead an adversarial corruption of some distribution within C.2 Given their equivalence, throughout this work, we will use agnostic and robust learnability interchangeably.

The difference between a robust setting and the previous realizable one is that now, instead of assuming p C and asking for an arbitrarily good approximation of p, we make no prior assumption about the data-generating distribution and only ask to approximate as well as (or close to) what the best member of some benchmark class C can do.

We address the following question: Assuming a class of distributions is learnable, under which notions of robustness is it guaranteed to be robustly learnable? We focus entirely on information-theoretic learnability, and eschew concerns of computational efficiency. Indeed, our question of interest is so broad that computationally efficient algorithms may be too much to hope for.

We shall consider a few variants of robust learnability. Specifically, we will impose requirements on the nature of the difference between the data-generating distribution p and members of the class C. Obviously, such requirements can only make the task of robust learning easier.

One such model considers additive robustness. The underlying distribution is restricted to be a mixture of a distribution p from C, and some contaminating distribution q. In this Statistics community, this celebrated model is known as Huber s contamination model [Hub64]. Analogously, one can consider subtractive robustness. It includes the case where the starting point is a distribution in the class C, but a fraction of the probability mass is removed and samples are drawn from the resulting distribution (after rescaling). These two models are related to adversaries who can add or remove points from a sampled dataset, see discussion at the end of Section 1.2.

A significant line of work focuses on understanding the sample complexity of agnostic distribution learning (see examples and discussion in Section 1.3). Most study restricted classes of distributions, with analyses that are only applicable in certain classes. Some works have found quantitative separations between the different robustness models. For instance, in the specific case of Gaussian mean estimation, [DKK+16, LRV16, DKS17, DKK+18] give strong evidence that efficient algorithms can achieve better error if they must only be additively robust, rather than robust in general. However, such findings are again restricted to specific cases, and say little about the overall relationship between learnability in general and these various robust learning models.

Current results leave open a more comprehensive treatment of robustness in distribution learning. Specifically, what is the relative power of these different robustness models, and what is their impact on which types of distributions are learnable? Are there more generic ways to design robust learning algorithms?

Our two main contributions are the following:

We give a generic algorithm which converts a learner for a distribution class into a learner for that class which is robust to additive corruptions.

We show that there exist distribution classes which are learnable, but are no longer learnable after subtractive corruption.

Stated succinctly: we show that learnability implies robust learnability when an adversary can make additive corruptions, but not subtractive corruptions. Other results explore implications related to compression schemes and differentially private learnability.

2A related notion of robust learnability instead imagines the adversary modifies the samples from a distribution in C, rather than the distribution itself. This adaptive model is discussed further in Section 1.2.

1.1 Definitions of Learnability

In order to more precisely describe our results, we define various notions of learnability. We start with the standard notion of PAC learnability for a distribution class. We get samples from a distribution p belonging to a distribution class C, and the goal is to output a distribution similar to p. Definition 1.1 (Learnability). We say that a class C of probability distributions is learnable (or, realizably learnable) if there exists a learner A and a function n C : (0, 1)2 N, such that for every probability distribution p C, and every (ε, δ) (0, 1)2, for n n C(ε, δ) the probability over samples S of that size drawn i.i.d. from the distribution p that d TV(p, A(S)) ε is at least 1 δ.

We next introduce the more challenging setting of robust, or agnostic, learning. In this setting, the sampled distribution is within bounded distance to the distribution class C, rather than being in C itself. For technical reasons, we introduce two closely-related definitions. Roughly speaking, the latter definition assumes the distance from the sampling distribution to C is fixed, whereas the former (more commonly considered in the agnostic learning literature) doesn t. Note that in many cases, robust algorithms designed with knowledge of the distance η to C can be modified to do without [JOR22]. Definition 1.2 (Robust learnability).

1. For α > 0, we say that a class C of probability distributions is α-robustly learnable (also referred to as α-agnostically learnable) if there exists a learner A and a function n C : (0, 1)2 N, such that for every probability distribution p, and (ε, δ) (0, 1)2, for n n C(ε, δ) the probability over samples S of that size drawn i.i.d. from the distribution p that d TV(p, A(S)) α min{d TV(p, p ) : p C} + ε is at least 1 δ.

When α = 1 we omit it and say that the class is robustly (or agnostically) learnable.

2. For 0 η and α > 0, we say that a class C of probability distributions is η-α-robustly learnable if there exists a learner A and a function n C : (0, 1)2 N, such that for every probability distribution p such that min{d TV(p, p ) : p C} η and (ε, δ) (0, 1)2, for n n C(ε, δ) the probability over samples S of that size drawn i.i.d. from the distribution p that d TV(p, A(S)) αη + ε is at least 1 δ.

Finally, we introduce notions of robust learnability which correspond to only additive or subtractive deviations from the distribution class C. These more stringent requirements than standard (realizable) learnability, but more lenient than η-α-robust learnability: the adversary in that setting can deviate from the distribution class C with both additive and subtractive modifications simultaneously. Definition 1.3 (Additive robust learnability). Given parameters 0 η 1 and α > 0, we say that a class C of probability distributions is η-additive α-robustly learnable if there exists a learner A and a function n C : (0, 1)2 N, such that for every probability distribution q, every p C, and (ε, δ) (0, 1)2, for n n C(ε, δ) the probability over samples S of that size drawn i.i.d. from the distribution ηq + (1 η)p, that d TV(A(S), p) αη + ε is at least 1 δ. Definition 1.4 (Subtractive robust learnability). Given parameters 0 η 1 and α > 0, we say that a class C of probability distributions is η-subtractive α-robustly learnable if there exists a learner A and a function n C : (0, 1)2 N, such that for every probability distribution p for which there exists a probability distribution q such that ηq + (1 η)p C, and for every (ε, δ) (0, 1)2, for n n C(ε, δ) the probability over samples S of that size drawn i.i.d. from the distribution p, that d TV(A(S), p) αη + ε is at least 1 δ.

1.2 Results and Techniques

We explore how different robustness models affect learnability of distributions, showing strong separations between them. Our first main result shows that learnability implies additively robust learnability. Theorem 1.5. Any class of probability distributions Q which is realizably learnable, is also ηadditively 2-robustly learnable for every η (0, 1/4).

Note that, since additively robust learnability trivially implies learnability, this shows an equivalence between learnability and additively robust learnability.

Our algorithm enumerates over all subsets of the dataset of an appropriate size, such that at least one subset contains no samples from the contaminating distribution. A realizable learner is applied to

each subset, and techniques from hypothesis selection [Yat85, DL96, DL97, DL01] are used to pick the best of the learned distributions. Further details appear in Section 2.

We also note that since our robust learning algorithm enumerates all large subsets of the training dataset, it is not computationally efficient. Indeed, for such a broad characterization, this would be too much to ask. Efficient algorithms for robust learnability are an exciting and active field of study, but outside the scope of this work. For further discussion see Section 1.3.

Our other main result shows that a distribution class being learnable does not imply that it is subtractive robustly learnable. Theorem 1.6. For every α > 0, there exists a class that is learnable, but not η-subtractively α-robustly learnable for any 0 η 1 16α.

An immediate corollary is that learnability does not imply robust (or agnostic) learnability, since this is a more demanding notion than subtractive robust learnability.

Our proof of this theorem proceeds by constructing a class of distributions that is learnable, but classes obtained by subtracting light-weight parts of these distributions are not α-robustly learnable with respect to the original learnable class. More concretely, our construction works as follows. We start by considering a distribution class that, by itself, is not learnable with any finite number of samples. We map each distribution in that class to a new distribution, which additionally features a point with nontrivial mass that encodes the identity of the distribution, thus creating a new class of distributions which is learnable. Subtractive contamination is then able to erase this point mass, leaving a learner with sample access only to the original (unlearnable) class. Our construction is inspired by the recent construction of Lechner and Ben-David [LBD23], showing that the learnability of classes of probability distributions cannot be characterized by any notion of combinatorial dimension. For more details, see Section 3.

Thus far, we have only considered additive and subtractive robustness separately. General robustness, where probability mass can be both added and removed, is more powerful than either model individually. However, if a class is additive robustly learnable and subtractive robustly learnable, is it robustly learnable? Though this is intuitively true, we are not aware of an immediate proof. Using a similar argument as Theorem 1.5, we derive a stronger statement: that subtractively robust learnability implies robust learnability. Theorem 1.7. If a class C is η-subtractive α-robustly learnable, then it is also η-(2α + 4)-robustly learnable.

Adjacent to distribution learning is the notion of sample compression schemes. Recent work by Ashtiani, Ben-David, Harvey, Liaw, Mehrabian, and Plan [ABDH+18] expanded notions of sample compression schemes to apply to the task of learning probability distributions. They showed that the existence of such sample compression schemes for a class of distributions imply the learnability of that class. While the existence of sample compression schemes for classification tasks imply the existence of such schemes for robust leaning, the question if similar implication hold for distribution learning schemes was not answered. We strongly refute this statement. We use a construction similar to that of Theorem 1.6, see Section 3.1 for more details. Theorem 1.8. The existence of compression schemes for a class of probability distributions does not imply the existence of robust compression schemes for that class.

Finally, a natural question is whether other forms of learnability imply robust learnability. We investigate when differentially private3 (DP) learnability does or does not imply robust learnability. We find that the same results and separations as before hold when the distribution class is learnable under approximate differential privacy (i.e., (ε, δ)-DP), but, perhaps surprisingly, under pure differential privacy (i.e., (ε, 0)-DP), private learnability implies robust learnability for all considered adversaries.4

Theorem 1.9 (Informal). (ε, 0)-DP learnability implies robust (ε, 0)-DP learnability. For any δ > 0, (ε, δ)-DP learnability implies additively robust learnability, but not subtractively robust learnability.

3Differential privacy is a popular and rigorous notion of data privacy. For the definition of differential privacy, see Section 4. 4In the context of differential privacy, we diverge slightly from the established notation. Specifically, we align ourselves with common notation in the DP literature, using ε and δ for privacy parameters, and use α (in place of ε) and β (in place of δ) for accuracy parameters.

For pure DP learnability, we employ an equivalence between learnability under pure differential privacy and packing [BKSW19]. Existence of such a packing in turn implies learnability under both pure differential privacy and with both additive and subtractive contamination. For approximate DP learnability, we note that the corresponding version of Theorem 1.5 automatically holds, since private learnability implies learnability. We show that our construction for Theorem 1.6 is still learnable under approximate differential privacy, and thus the corresponding non-implication holds. See Section 4 for more details.

To summarize the qualitative versions of our findings:

Learnability and additively robust learnability are equivalent (Theorem 1.5);

Learnability does not imply subtractively robust learnability (Theorem 1.6);

Subtractively robust learnability implies robust learnability (Theorem 1.7);

Pure DP learnability is equivalent to robust pure DP learnability (Theorem 1.9);

Approximate DP learnability implies additively robust learnability,5 but not subtractively robust learnability (Theorem 1.9);

Existence of sample compression schemes does not imply the existence of robust sample compression schemes (Theorem 1.8).

Quantitative versions of these statements can be found in their respective theorems.

Adaptive Adversaries. Our definition of robustness allows an adversary to make changes to the underlying distribution. Equivalently, it corresponds to an adversary who can add or remove points from a dataset, but must commit to these modifications before the dataset is actually sampled. A stronger6 adversary would be able to choose which points to add or remove after seeing the sampled dataset. Such an adversary is referred to as adaptive. Since adaptive adversaries are stronger than the ones we consider, any impossibility result that we show also holds in this settings (e.g., learnability does not imply subtractive robust learnability when the adversary is adaptive). It is an interesting open question to understand whether our algorithms can be strengthened to work in this setting. Some positive evidence in this direction is due to Blanc, Lange, Malik, and Tan [BLMT22], who show that adaptive and non-adaptive adversaries have qualitatively similar power in many settings.

1.3 Related Work

Robust estimation is a classic subfield of Statistics, see, for example, the classic works [Tuk60, Hub64]. Our work fits more into the Computer Science literature on distribution estimation, initiated by the work of [KMR+94], which was in turn inspired by Valiant s PAC learning model [Val84]. Since then, several works have focused on algorithms for learning specific classes of distributions, see, e.g., [CDSS13, CDSS14a, CDSS14b, LS17, ABDH+20]. A recent line of work, initiated by [DKK+16, LRV16], focuses on computationally-efficient robust estimation of multivariate distributions, see, e.g., [DKK+17, SCV18, DKK+18, KSS18, HL18, DKK+19, LM21, LM22, BDJ+22, JKV23] and [DK22] for a reference. In contrast to all of these works, we focus on broad and generic connections between learnability and robust learnability, rather than studying robust learnability of a particular class of distributions.

Some of our algorithmic results employ tools from hypothesis selection, a problem which focuses on agnostic learning with respect to a specified finite set of distributions. The most popular approaches are based on ideas introduced by Yatracos [Yat85] and subsequently refined by Devroye and Lugosi [DL96, DL97, DL01]. Several others have studied hypothesis selection with an eye for several considerations, including running time, approximation factor, robustness, privacy, parallelization, and more [MS08, DDS12, DK14, SOAJ14, DKK+16, ABDH+18, AFJ+18, BKSW19, BKM19, GKK+20, BBK+22, AAK21].

Distribution learning under the constraint of differential privacy [DMNS06] has been an active area of research, see, e.g., [KV18, KLSU19, BS19, ASZ21, CWZ21, KMS+22b, KMS22a], and

5A natural open question is whether approximate DP learnability implies additively-robust approximate-DP learnability. 6And in the case of removals, much more natural

[KU20] for a survey. A number of these works have focused on connections between robustness and privacy [DL09, BKSW19, KSU20, AM20, BGS+21, LKKO21, LKO22, KMV22, RJC22, SM22, HKM22, GH22, HKMN23, AKT+23]. Again, these results either focus on specific classes of distributions, or give implications that require additional technical conditions, whereas we aim to give characterizations of robust learnability under minimal assumptions.

The question whether learnability under realizability assumptions extends to non-realizable setting has a long history. For binary classification tasks, both notions are characterized by the finiteness of the VC-dimension, and are therefore equivalent [VC71, VC74, BEHW89, Hau92]. [BPS09] show a similar result for online learning. Namely, that agnostic (non-realizable) learnability is characterized by the finiteness of the Littlestone dimension, and is therefore equivalent to realizable learnability.

Going beyond binary classification, recent work [HKLM22] shows that the equivalence of realizable and agnostic learnability extends across a wide variety of settings. These include models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. This stands in contrast to our results for the distribution learning setting. We show that realizable learnability of a class of probability distributions does not imply its agnostic learnability. It is interesting and natural to explore the relationship between various notions of distribution learnability, which we have scratched the surface of in this work.

2 Learnability Implies Additive Robust Learnability

We recall Theorem 1.5, which shows that any class that is realizably learnable is also additive robustly learnable. Theorem 1.5. Any class of probability distributions Q which is realizably learnable, is also ηadditively 2-robustly learnable for every η (0, 1/4).

We prove this theorem by providing an algorithm based on classical tools for hypothesis selection [Yat85, DL96, DL97, DL01]. These methods take as input a set of samples from an unknown distribution and a collection of hypotheses distributions. If the unknown distribution is close to one of the hypotheses, then, given enough samples, the algorithm will output a close hypothesis. Roughly speaking, our algorithm looks at all large subsets of the dataset, such that at least one will correspond to an uncontaminated set of samples. A learner for the realizable setting (whose existence we assumed) is applied to each to generate a set of hypotheses, and we then use hypothesis selection to pick one with sufficient accuracy. The proof of Theorem 1.7 (showing that subtractively robust learnability implies robust learnability) follows almost the exact same recipe, except the realizable learner is replaced with a learner robust to subtractive contaminations. We recall some preliminaries in Section A. We then prove Theorem 1.5 in Section B, and we formalize and prove a version of Theorem 1.7 in Section 2.1.

We note that α = 2 and α = 3 are often the optimal factors to expect in distribution learning settings, even for the case of finite distribution classes. For example, for proper agnostic learning the factor α = 3 is known to be optimal for finite collections of distributions, which holds for classes with only 2 distributions [BKM19]. Similarly the factor of α = 2 is optimal if the notion of learning is relaxed to improper learners [BKM19, CDSS14b]. While we are not aware of lower bounds for the additive setting, a small constant factor such as 2 is within expectations for these problems.

For the proof of Theorem 1.5, we refer the reader to Section B in the appendix.

2.1 Subtractive Robust Learnability Implies Robust Learnability

Similarly, we can show that robustness with respect to a subtractive adversary implies robustness with respect to a general adversary. We note that this theorem requires a change in constants from α to (2α + 4). Theorem 1.7. If a class C is η-subtractive α-robustly learnable, then it is also η-(2α + 4)-robustly learnable.

The proof follows a similar argument as the proof of Theorem 1.5 and can be found in Section C in the appendix.

3 Learnability Does Not Imply Robust (Agnostic) Learnablity

In this section we show that there are classes of distributions which are realizably learnable, but not robustly learnable. Theorem 3.1. There are classes of distributions Q, such that Q is realizably learnable, but for every α R, Q is not α-robustly learnable. Moreover, the sample complexity of learning Q can be arbitrarily close to (but larger than) linear. Namely, for any super-linear function g, there is a class Qg, with

Qg is realizable learnable with sample complexity nre Qg(ε, δ) log(1/δ)g(1/ε);

for every α R, Qg is not α-robustly learnable.

Note that this statement appears slightly weaker than Theorem 1.6, in that it holds for α-robust learnability rather than η-subtractive α-robust learnability. In fact, the two statements are incomparable, due to the order of quantifiers in the construction. Here we provide a single class which is not α-robustly learnable for every α, whereas in the proof of Theorem 1.6 we give a different class for each α (though the two constructions are similar). For simplicity we focus here on Theorem 3.1, whereas the proofs of Theorem 1.6 and other claims appear in Section D.

The key idea to the proof is to construct a class which is easy to learn in the realizable case, by having each distribution of the class have a unique support element that is not shared by any other distributions in the class. Distributions on which this indicator element has sufficient mass will be easily identified, independent of how rich the class is on other domain elements. That richness makes the class hard to learn from samples that miss those indicators. Furthermore, we construct the class in a way that its members are close in total variation distance to distributions that place no weight on those indicator elements.

This is done by making the mass on these indicator elements small, so that the members of a class of distributions that results from deleting these indicator bits are close to the initially constructed class, Qg. In order to make this work for every target accuracy and sample complexity, we need to have a union of such classes with decreasingly small mass on the indicator bits. In order for this to not interfere with the realizable learnability, we let the distributions with small mass on the indicator bits have most of their mass on one point (0, 0) that is the same for all distributions in the class. This ensures that distributions for which the indicator bit will likely not be observed because their mass is smaller than some η are still easily ε-approximated by a constant distribution (δ(0,0)). Lastly we ensure the impossibility of agnostic learnability, by controlling the rate at which η approaches zero to be faster than the rate at which ε approaches zero. With this intuition in mind, we will now describe the construction and proof of this theorem.

Proof. We first define the distributions in Qg. Let {Ai N : i N} be an enumeration of all finite subsets of N. Define distributions over N N as follows:

qi,j,k = 1 1

UAi {2j+1} + 1

k δ(i,2j+2), (1)

where, for every finite set W, UW denotes the uniform distribution over W. For a monotone, superlinear function g : N N, we now let Qg = {qi,j,g(j) : i, j N}. The first bullet point of the theorem (the class is learnable) follows from Claim 3.2 and the second bullet point (the class is not robustly learnable) follows from Claim 3.3.

Claim 3.2. For a monotone function g : N N, let Qg = {qi,j,g(j) : i, j N}. Then, the sample complexity of Qg in the realizable case is upper bounded by

nre Qg(ε, δ) log(1/δ)g(1/ε).

This claim can be proved by showing that the following learner defined by

A(S) = qi,j,g(j) if (i, 2j + 2) S δ(0,0) otherwise

is a successful learner in the realizable case. Intuitively, this learner is successful for distributions qi,j,g(j) for which j is large (i.e., j > 1

ε), since this will mean that d TV(qi,j,g(j), δ(0,0)) is small. Furthermore, it is successful for distributions qi,j,g(j) for which j is small (i.e., upper bounded by some constant dependent on ε), because this will lower bound the probability 1/g(j) of observing the indicator bit on (i, 2j + 2). Once the indicator bit is observed the distribution will be uniquely identified.

Claim 3.3. For every function g ω(n) the class Qg is not α-robustly learnable for any α > 0.

This claim can be proven by showing that for every α, there is η, such that the class of distributions Q such that for every q Q there is q Qg with d TV(q, q ) < η which is not αη-weakly learnable.7 In particular, those for every q Q there is q Qg and p such that q = (1 η)q + ηq. We construct this class and show that it is not learnable by using the construction and Lemma 3 from [LBD23].

3.1 Existence of sample compression schemes

Sample compression schemes are combinatorial properties of classes that imply their learnability. For a variety of learning tasks, such as binary classification or expectation maximization a class has a sample compression scheme if and only if it is learnable [MY16, BHM+17]. For classification tasks, sample compression for realizable samples implies agnostic sample compression. [ABDH+20] used compression schemes to show learnability of classes of distributions in the realizable case, but left open the question if for learning probability distributions, the existence of realizable sample compression schemes implies the existence of similar schemes for the non-realizable (agnostic, or robust) settings. We provide a negative answer to this question.

More concretely, let Q be a class of distributions over some domain X. A compression scheme for Q involves two agents: an encoder and a decoder.

The encoder knows a distribution q and receives a sample S generated by this distribution. The encoder picks a bounded size sub-sample and sends it, possibly with a few additional bits to the decoder.

The decoder receives the message and uses an agreed upon decoding rule (that may depend on Q but not on q or S) to constructs a distribution p that is close to q.

Of course, there is some probability that the samples are not representative of the distribution q, in which case the compression scheme will fail. Thus, we only require that the decoding succeed with constant probability.

We say that a class Q has a sample compression scheme (realizable or robust) if for every accuracy parameter ε > 0, the minimal required size of the sample S, and upper bounds on the size of the sub-sample and number of additional bits in the encoder s message depend only of Q and ε (and are independent of the sample generating q and on the sample S).

A realizable compression scheme is required to handle only q s in Q and output p such that d TV(p, q) ε, while a robust compression scheme should handle any q but the decoder s output p is only required to be minq Q{d TV(p, q)} + ε close to q.

Theorem 3.4 (Formal version of Theorem 1.8). For every α R, the existence of a realizable compression scheme, does not imply the existence of an α-robust compression scheme. That is, there is a class Q that has a realizable compression scheme, but for every α R, Q does not have an α-robust compression scheme.

Proof. Consider the class Q = Qg from Section 3. We note that this class has a compression scheme of size 1. However, from [ABDH+18], we know that having a α-robust compression scheme implies α-agnostic learnability. We showed in Theorem 1.6 that for every α and for every superlinear function g, the class Qg is not α-agnostically learnable. It follows that Qg does not have an α-robust compression scheme.

7We provide a definition for ε-weak learnability as Definition D.2. We note that the definition we provide is what would usually be referred to as (1/2 ε)-weak learnability in the supervised learning literature. For simplicity, because ε is our parameter of interest, we reparameterized the definition to be more intuitive.

In Section E, we present a precise quantitative definition of sample compression schemes, as well as the proof that the class Qg has a sample compression scheme of size 1.

4 Implications of Private Learnability

Qualitatively speaking, differentially private algorithms offer a form of robustness the output distribution of a differentially private algorithm is insensitive to the change of a single point in its input sample. The relationship between privacy and notions of robustness has been studied under various settings, where it has been shown that robust algorithms can be made private and vice versa [DL09, GH22, HKMN23].

For distribution learning, we find that: (1) the requirement of approximate differentially private learnability also does not imply (general) robust learnability; and (2) the stronger requirement of pure differentially pirvate learnability does imply robust learnability. Definition 4.1 (Differential Privacy [DMNS06]). Let X be an input domain and Y to be an output domain. A randomized algorithm A : Xm Y is (ε, δ)-differentially private (DP) if for every x, x Xn that differ in one entry,

P[A(x) B] eε P[A(x ) B] + δ for all B Y .

If A is (ε, δ)-DP for δ > 0, we say it satisfies approximate DP. If it satisfies (ε, 0)-DP, we say it satisfies pure DP. Definition 4.2 (DP learnable class). We say that a class C of probability distributions is (approximate) DP learnable if there exists a randomized learner A and a function n C : (0, 1)4 N, such that for every probability distribution p C, and every (α, β, ε, δ) (0, 1)4, for n n C(α, β, ε, δ)

1. A is (ε, δ)-DP; and

2. The probability over samples S of size n drawn i.i.d. from the distribution p, as well as over the randomness of A that

d TV(p, A(S)) α

is at least 1 β.

We say C is pure DP learnable if a learner A can be found that satisfies (ε, 0)-DP, in which case the sample complexity function n C : (0, 1)3 N does not take δ as a parameter. Theorem 4.3 (Approximate DP learnability vs. robust learnability).

1. If a class Q is approximate DP learnable, then Q is η-additive 2-robustly learnable for any η (0, 1/4).

2. There exists an approximate DP learnable class Q that is not α-robustly learnable for any α 1.

Note that the first claim is immediate from Theorem 1.5, since approximate DP learnability implies learnability. To prove the second claim, we show that the learner for the class Q described in Theorem 3.1 can be made differentially private by employing stability-based histograms [BNS16]. The proof appears in Section F. Theorem 4.4 (Pure DP learnable vs. robustly learnable). If a class Q is pure DP learnable, then Q is 3-robustly learnable.

The proof relies on the finite cover characterization of pure DP learnability. Proposition 4.5 (Packing lower bound, Lemma 5.1 from [BKSW19]). Let C be a class of distributions, and let α, ε > 0. Suppose Pα is a α-packing of C, that is, Pα C such that for any p = q Pα, d TV(p, q) > α.

Any ε-DP algorithm A that takes n i.i.d. samples S from any p C and has d TV(p, A(S)) α/2 with probability 9/10 requires

n log |Pα| log 10

Proof of Theorem 4.4. Let α, β > 0. Pure DP learnability of Q implies that there exists a 1-DP algorithm ADP and n = n C(α/12, 1/10, 1) such that for any p Q, with probability 9/10 over the sampling of n i.i.d. samples S from p, as well as over the randomness of the algorithm ADP , we have d TV(p, ADP (S)) α/12. By Proposition 4.5, any α/6-packing Pα/6 of Q has

|Pα/6| exp (m) (10/9).

Let b Q be such a maximal α/6-packing. By maximality, b Q is also an α/6-cover of Q. Hence, running Yatracos 3-robust finite class learner (Theorem A.1) A over b Q with

n b Q(α/2, β) = O

log | b Q| + log(1/β)

samples drawn i.i.d. from p yields, with probability 1 β

d TV(p, A(S)) 3 min{d TV(p, p ) : p b Q} + α/2

3(min{d TV(p, p ) : p Q} + α/6) + α/2

= 3 min{d TV(p, p ) : p Q} + α.

Note that Yatracos algorithm for hypothesis selection can be replaced with a pure DP algorithm for hypothesis selection (Theorem 27 of [AAK21]) in order to achieve the following stronger implication. Theorem 4.6 (Pure DP learnable vs. robustly learnable). If a class Q is pure DP learnable, then Q is pure DP 3-robustly learnable.

5 Conclusions

We examine the connection between learnability and robust learnability for general classes of probability distributions. Our main findings are somewhat surprising in that, in contrast to most known leaning scenarios, learnability does not imply robust learnability. We also show that learnability does imply additively robust learnability. We use our proof techniques to draw new insights related to compression schemes and differentially private distribution learning.

Acknowledgments

Thanks to Argyris Mouzakis for helpful conversations in the early stages of this work. AB was supported by an NSERC Discovery Grant and a David R. Cheriton Graduate Scholarship. GK was supported by a Canada CIFAR AI Chair, an NSERC Discovery Grant, and an unrestricted gift from Google. TL was supported by a Vector Research Grant and a Waterloo Apple Ph D Fellowship in Data Science and Machine Learning.

[AAK21] Ishaq Aden-Ali, Hassan Ashtiani, and Gautam Kamath. On the sample complexity of privately learning unbounded high-dimensional gaussians. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, ALT 21, pages 185 216. JMLR, Inc., 2021.

[AAL21] Ishaq Aden-Ali, Hassan Ashtiani, and Christopher Liaw. Privately learning mixtures of axis-aligned gaussians. In Advances in Neural Information Processing Systems 34, Neur IPS 21. Curran Associates, Inc., 2021.

[ABDH+18] Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes. In Advances in Neural Information Processing Systems 31, Neur IPS 18, pages 3412 3421. Curran Associates, Inc., 2018.

[ABDH+20] Hassan Ashtiani, Shai Ben-David, Nicholas JA Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of gaussian mixtures via compression schemes. Journal of the ACM, 67(6):32:1 32:42, 2020.

[AFJ+18] Jayadev Acharya, Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Maximum selection and sorting with adversarial comparators. Journal of Machine Learning Research, 19(1):2427 2457, 2018.

[AKT+23] Daniel Alabi, Pravesh K Kothari, Pranay Tankala, Prayaag Venkat, and Fred Zhang. Privately estimating a Gaussian: Efficient, robust and optimal. In Proceedings of the 55th Annual ACM Symposium on the Theory of Computing, STOC 23, New York, NY, USA, 2023. ACM.

[AM20] Marco Avella-Medina. The role of robust statistics in private data analysis. Chance, 33(4):37 42, 2020.

[ASZ21] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad, fano, and le cam. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, ALT 21, pages 48 78. JMLR, Inc., 2021.

[BBK+22] Olivier Bousquet, Mark Braverman, Gillat Kol, Klim Efremenko, and Shay Moran. Statistically near-optimal hypothesis selection. In Proceedings of the 62nd Annual IEEE Symposium on Foundations of Computer Science, FOCS 21, pages 909 919. IEEE Computer Society, 2022.

[BDJ+22] Ainesh Bakshi, Ilias Diakonikolas, He Jia, Daniel M Kane, Pravesh K Kothari, and Santosh S Vempala. Robustly learning mixtures of k arbitrary Gaussians. In Proceedings of the 54th Annual ACM Symposium on the Theory of Computing, STOC 22, pages 1234 1247, New York, NY, USA, 2022. ACM.

[BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929 965, 1989.

[BGS+21] Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, and Lydia Zakynthinou. Covariance-aware private mean estimation without private covariance estimation. In Advances in Neural Information Processing Systems 34, Neur IPS 21. Curran Associates, Inc., 2021.

[BHM+17] Shai Ben-David, Pavel Hrubes, Shay Moran, Amir Shpilka, and Amir Yehudayoff. A learning problem that is independent of the set theory ZFC axioms. Co RR, abs/1711.05195, 2017.

[BKM19] Olivier Bousquet, Daniel M. Kane, and Shay Moran. The optimal approximation factor in density estimation. In Proceedings of the 32nd Annual Conference on Learning Theory, COLT 19, pages 318 341, 2019.

[BKSW19] Mark Bun, Gautam Kamath, Thomas Steinke, and Zhiwei Steven Wu. Private hypothesis selection. In Advances in Neural Information Processing Systems 32, Neur IPS 19, pages 156 167. Curran Associates, Inc., 2019.

[BLMT22] Guy Blanc, Jane Lange, Ali Malik, and Li-Yang Tan. On the power of adaptivity in statistical adversaries. In Proceedings of the 35th Annual Conference on Learning Theory, COLT 22, pages 5030 5061, 2022.

[BNL12] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, ICML 12, pages 1467 1474. JMLR, Inc., 2012.

[BNS16] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multiple concepts. In Proceedings of the 7th Conference on Innovations in Theoretical Computer Science, ITCS 16, pages 369 380, New York, NY, USA, 2016. ACM.

[BPS09] Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.

[BS19] Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. In Advances in Neural Information Processing Systems 32, Neur IPS 19, pages 181 191. Curran Associates, Inc., 2019.

[CDSS13] Siu On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Learning mixtures of structured distributions over discrete domains. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 13, pages 1380 1394, Philadelphia, PA, USA, 2013. SIAM.

[CDSS14a] Siu On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Efficient density estimation via piecewise polynomial approximation. In Proceedings of the 46th Annual ACM Symposium on the Theory of Computing, STOC 14, pages 604 613, New York, NY, USA, 2014. ACM.

[CDSS14b] Siu On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Near-optimal density estimation in near-linear time using variable-width histograms. In Advances in Neural Information Processing Systems 27, NIPS 14, pages 1844 1852. Curran Associates, Inc., 2014.

[CWZ21] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. The Annals of Statistics, 49(5):2825 2850, 2021.

[DDS12] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. Learning Poisson binomial distributions. In Proceedings of the 44th Annual ACM Symposium on the Theory of Computing, STOC 12, pages 709 728, New York, NY, USA, 2012. ACM.

[Dia16] Ilias Diakonikolas. Learning structured distributions. In Peter Bühlmann, Petros Drineas, Michael J. Kane, and Mark J. van der Laan, editors, Handbook of Big Data, pages 267 283. Chapman and Hall/CRC, 2016.

[DK14] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms for proper learning mixtures of Gaussians. In Proceedings of the 27th Annual Conference on Learning Theory, COLT 14, pages 1183 1213, 2014.

[DK22] Ilias Diakonikolas and Daniel Kane. Algorithmic High-Dimensional Robust Statistics. Cambridge University Press, 2022.

[DKK+16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS 16, pages 655 664, Washington, DC, USA, 2016. IEEE Computer Society.

[DKK+17] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning, ICML 17, pages 999 1008. JMLR, Inc., 2017.

[DKK+18] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a Gaussian: Getting optimal error, efficiently. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 18, Philadelphia, PA, USA, 2018. SIAM.

[DKK+19] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In Proceedings of the 36th International Conference on Machine Learning, ICML 19, pages 1596 1606. JMLR, Inc., 2019.

[DKS17] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures. In Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science, FOCS 17, pages 73 84, Washington, DC, USA, 2017. IEEE Computer Society.

[DL96] Luc Devroye and Gábor Lugosi. A universally acceptable smoothing factor for kernel density estimation. The Annals of Statistics, 24(6):2499 2512, 1996.

[DL97] Luc Devroye and Gábor Lugosi. Nonasymptotic universal smoothing factors, kernel complexity and Yatracos classes. The Annals of Statistics, 25(6):2626 2637, 1997.

[DL01] Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer, 2001.

[DL09] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the 41st Annual ACM Symposium on the Theory of Computing, STOC 09, pages 371 380, New York, NY, USA, 2009. ACM.

[DMNS06] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC 06, pages 265 284, Berlin, Heidelberg, 2006. Springer.

[GFH+21] Jonas Geiping, Liam Fowl, W Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller, and Tom Goldstein. Witches brew: Industrial scale data poisoning via gradient matching. In Proceedings of the 9th International Conference on Learning Representations, ICLR 21, 2021.

[GH22] Kristian Georgiev and Samuel B Hopkins. Privacy induces robustness: Informationcomputation gaps and sparse mean estimation. In Advances in Neural Information Processing Systems 35, Neur IPS 22. Curran Associates, Inc., 2022.

[GKK+20] Sivakanth Gopi, Gautam Kamath, Janardhan Kulkarni, Aleksandar Nikolov, Zhiwei Steven Wu, and Huanyu Zhang. Locally private hypothesis selection. In Proceedings of the 33rd Annual Conference on Learning Theory, COLT 20, pages 1785 1816, 2020.

[GTX+20] Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, and Tom Goldstein. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. ar Xiv preprint ar Xiv:2012.10544, 2020.

[Hau92] David Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78 150, 1992.

[HKLM22] Max Hopkins, Daniel M Kane, Shachar Lovett, and Gaurav Mahajan. Realizable learning is all you need. In Proceedings of the 35th Annual Conference on Learning Theory, COLT 22, pages 3015 3069, 2022.

[HKM22] Samuel B Hopkins, Gautam Kamath, and Mahbod Majid. Efficient mean estimation with pure differential privacy via a sum-of-squares exponential mechanism. In Proceedings of the 54th Annual ACM Symposium on the Theory of Computing, STOC 22, pages 1406 1417, New York, NY, USA, 2022. ACM.

[HKMN23] Samuel B Hopkins, Gautam Kamath, Mahbod Majid, and Shyam Narayanan. Robustness implies privacy in statistical estimation. In Proceedings of the 55th Annual ACM Symposium on the Theory of Computing, STOC 23, New York, NY, USA, 2023. ACM.

[HL18] Samuel B. Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC 18, pages 1021 1034, New York, NY, USA, 2018. ACM.

[Hub64] Peter J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73 101, 1964.

[JKV23] He Jia, Pravesh K Kothari, and Santosh S Vempala. Beyond moments: Robustly learning affine transformations with asymptotically optimal error. ar Xiv preprint ar Xiv:2302.12289, 2023.

[JOR22] Ayush Jain, Alon Orlitsky, and Vaishakh Ravindrakumar. Robust estimation algorithms don t need to know the corruption level. ar Xiv preprint ar Xiv:2202.05453, 2022.

[KBG04] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blakeley, and K. Bernhard Schiefer, editors, (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, Toronto, Canada, August 31 - September 3 2004, pages 180 191. Morgan Kaufmann, 2004.

[KKMN09] Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas. Releasing search queries and clicks privately. In Proceedings of the 18th International World Wide Web Conference, WWW 09, pages 171 180, New York, NY, USA, 2009. ACM.

[KLSU19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. In Proceedings of the 32nd Annual Conference on Learning Theory, COLT 19, pages 1853 1902, 2019.

[KMR+94] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and Linda Sellie. On the learnability of discrete distributions. In Proceedings of the 26th Annual ACM Symposium on the Theory of Computing, STOC 94, pages 273 282, New York, NY, USA, 1994. ACM.

[KMS22a] Gautam Kamath, Argyris Mouzakis, and Vikrant Singhal. New lower bounds for private estimation and a generalized fingerprinting lemma. In Advances in Neural Information Processing Systems 35, Neur IPS 22. Curran Associates, Inc., 2022.

[KMS+22b] Gautam Kamath, Argyris Mouzakis, Vikrant Singhal, Thomas Steinke, and Jonathan Ullman. A private and computationally-efficient estimator for unbounded gaussians. In Proceedings of the 35th Annual Conference on Learning Theory, COLT 22, pages 544 572, 2022.

[KMV22] Pravesh K Kothari, Pasin Manurangsi, and Ameya Velingker. Private robust estimation by stabilizing convex relaxations. In Proceedings of the 35th Annual Conference on Learning Theory, COLT 22, pages 723 777, 2022.

[KSS18] Pravesh Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC 18, pages 1035 1046, New York, NY, USA, 2018. ACM.

[KSU20] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. In Proceedings of the 33rd Annual Conference on Learning Theory, COLT 20, pages 2204 2235, 2020.

[KU20] Gautam Kamath and Jonathan Ullman. A primer on private statistics. ar Xiv preprint ar Xiv:2005.00010, 2020.

[KV18] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS 18, pages 44:1 44:9, Dagstuhl, Germany, 2018. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik.

[LBD23] Tosca Lechner and Shai Ben-David. Impossibility of characterizing distribution learning a simple solution to a long-standing problem, 2023.

[LKKO21] Xiyang Liu, Weihao Kong, Sham Kakade, and Sewoong Oh. Robust and differentially private mean estimation. In Advances in Neural Information Processing Systems 34, Neur IPS 21. Curran Associates, Inc., 2021.

[LKO22] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In Proceedings of the 35th Annual Conference on Learning Theory, COLT 22, pages 1167 1246, 2022.

[LKY23] Yiwei Lu, Gautam Kamath, and Yaoliang Yu. Exploring the limits of model-targeted indiscriminate data poisoning attacks. In Proceedings of the 40th International Conference on Machine Learning, ICML 23, pages 22856 22879. JMLR, Inc., 2023.

[LM21] Allen Liu and Ankur Moitra. Settling the robust learnability of mixtures of gaussians. In Proceedings of the 53nd Annual ACM Symposium on the Theory of Computing, STOC 21, pages 518 531, New York, NY, USA, 2021. ACM.

[LM22] Allen Liu and Ankur Moitra. Learning GMMs with nearly optimal robustness guarantees. In Proceedings of the 35th Annual Conference on Learning Theory, COLT 22, pages 2815 2895, 2022.

[LRV16] Kevin A. Lai, Anup B. Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS 16, pages 665 674, Washington, DC, USA, 2016. IEEE Computer Society.

[LS17] Jerry Li and Ludwig Schmidt. Robust proper learning for mixtures of Gaussians via systems of polynomial inequalities. In Proceedings of the 30th Annual Conference on Learning Theory, COLT 17, pages 1302 1382, 2017.

[MS08] Satyaki Mahalanabis and Daniel Stefankovic. Density estimation in linear time. In Proceedings of the 21st Annual Conference on Learning Theory, COLT 08, pages 503 512, 2008.

[MY16] Shay Moran and Amir Yehudayoff. Sample compression schemes for VC classes. J. ACM, 63(3):21:1 21:10, 2016.

[RJC22] Kelly Ramsay, Aukosh Jagannath, and Shoja eddin Chenouri. Concentration of the exponential mechanism and differentially private multivariate medians. ar Xiv preprint ar Xiv:2210.06459, 2022.

[SB14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014.

[SCV18] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. In Proceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS 18, pages 45:1 45:21, Dagstuhl, Germany, 2018. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik.

[SKL17] Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certified defenses for data poisoning attacks. In Advances in Neural Information Processing Systems 30, Neur IPS 17, pages 3520 3532. Curran Associates, Inc., 2017.

[SM22] Aleksandra Slavkovic and Roberto Molinari. Perturbed M-estimation: A further investigation of robust statistics for differential privacy. In Alicia L. Carriquiry, Judith M. Tanur, and William F. Eddy, editors, Statistics in the Public Interest: In Memory of Stephen E. Fienberg, pages 337 361. Springer, 2022.

[SOAJ14] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Nearoptimal-sample estimators for spherical Gaussian mixtures. In Advances in Neural Information Processing Systems 27, NIPS 14, pages 1395 1403. Curran Associates, Inc., 2014.

[Tuk60] John W. Tukey. A survey of sampling from contaminated distributions. Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pages 448 485, 1960.

[Val84] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

[VC71] Vladimir Naumovich Vapnik and Alexey Yakovlevich Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264 280, 1971.

[VC74] Vladimir Vapnik and Alexey Chervonenkis. Theory of Pattern Recognition. Nauka, 1974.

[Yat85] Yannis G. Yatracos. Rates of convergence of minimum distance estimators and Kolmogorov s entropy. The Annals of Statistics, 13(2):768 774, 1985.

A Additional Preliminaries

We recall a classic theorem for the problem of hypothesis selection. Given a set of candidate hypothesis distributions, the algorithm selects one which is close to an unknown distribution (to which we have sample access). The requisite number of samples from the unknown distribution is logarithmic in the size of the set of candidates.

Theorem A.1 (Yatracos 3-robust learner for finite classes (Theorem 4.4 of [ABDH+20], based on Theorem 1 of [Yat85])). Let C be a finite class of distributions over a domain X. There exists an algorithm A such that for any (α, β) (0, 1)2 and for any distribution p over X, given a sample S of size

m = O log |C| + log(1/δ)

drawn i.i.d. from p, we have that with probability 1 δ,

d TV(p, A(S)) 3 min{d TV(p, p ) : p C} + ε.

We also use the following form of the Chernoff bound.

Proposition A.2 (Chernoff bounds). Let X = Pm i=1 Xi where X1...Xm are independent draws from Bernoulli(p). For any t (0, 1), P[X/m (1 t)p] exp( t2mp/2).

B Learnability Implies Additive Robust Learnability

In this section, we provide the proof for Theorem 1.5.

Theorem 1.5. Any class of probability distributions Q which is realizably learnable, is also ηadditively 2-robustly learnable for every η (0, 1/4).

Proof. Recall that Q is a class of probability distributions which is realizably learnable. We let Are Q be a realizable learner for Q, with sample complexity nre Q . Let accuracy parameters ε, δ > 0 be arbitrary.

We will define n1 max n 2nre Q ( ε

5), 162(1+log( 5

δ )) ε2 o and n2 162(2(η+ 2ε

9 )n1 log(n1)+log( 5

162(1+log( 5

δ )) ε2 , and n = n1 + n2 be their sum.

Our additive robust learner will receive a sample S (ηq + (1 η)p)n of size n. We can view a subset S S as the clean part, being i.i.d. generated by p. The size of this clean part |S | = n is distributed according to a binomial distribution Bin(n, 1 η). By a Chernoff bound (Proposition A.2), we get

9 η n i P h n 1 ε

n i = P h n 1 ε

2 n (1 η) /2 = exp ε2

162n (1 η) exp ε2

Thus, given that n = n1 + n2 2( 162(1+log( 5

δ )) ε2 ) with probability at least 1 δ

5, we have n n 1 η ε

9 . For the rest of the argument we will now assume that we have indeed n n 1 η ε

9 . The learner now randomly partitions the sample S into S1 and S2 of sizes n1 and n2, respectively. Now let S 1 = S1 S and S 2 = S2 S be the intersections of these sets with the clean set S , and n 1 and n 2 be their respective sizes. We note that n 1 Hypergeometric(n, n , n1) and n 2 Hypergeometric(n, n , n2).8 Thus, assuming m m(1 η ε

9) using Proposition A.2, we have that

Pr |S 1| 1 η 2ε

8Recall that Hypergeometric(N, K, n) is the random variable of the number of successes when n draws are made without replacement from a set of size N, where K elements of the set are considered to be successes.

Pr |S 2| 1 η 2ε

where the probability is over the random partition of S. We note that by our choices of n1 and n2, with probability 1 2δ

5 , the clean fractions of S1 and S2 (namely |S 1| |S1| and |S 2| |S2|) are each at least 1 η 2ε

Let ˆH = Are Q (S ) : S S1 with |S | = 1 η 2ε

be the set of distributions output by the realizable learning algorithm Are Q when given as input all possible subsets of S1 of size exactly 1 η 2ε

9 n1. For ε < 9η

2 , we know that this set

of distributions | ˆH| is of size n1 (1 η 2ε

n2ηn1 1 . By the guarantee of the realizable

learning algorithm Are Q , if there exists a clean subset S 1 S1 where |S 1| n1(1 η 2ε

9 ) (i.e., S 1 pn1(1 η 2ε

9 )), then with probability 1 δ

5 there exists a candidate distribution q ˆH with d TV(p, q ) = ε

We now define and consider the Yatracos sets.9 For every qi, qj ˆH, define the Yatracos set between qi and qj to be Ai,j = {x : qi(x) qj(x)}.10 We let

Y( ˆH) = {Ai,j X : qi, qj ˆH}

denote the set of all pairwise Yatracos sets between distributions in the set ˆH.

We now consider the A-distance [KBG04] between two distributions with respect to the Yatracos sets, i.e., we consider d Y( ˆ H)(p , q ) = sup

B Y( ˆ H) |p (B) q (B)|.

This distance looks at the supremum of the discrepancy between the distributions across the Yatracos sets. Consequently, for any two distributions p , q we have d TV(p , q ) d Y( ˆ H)(p , q ), since total variation distance is the supremum of the discrepancy across all possible sets. Furthermore, if q , p ˆH, then d TV(p , q ) = d Y( ˆ H)(p , q ), since either of the Yatracos sets between the two distributions serves as a set that realizes the total variation distance between them.

Suppose there is some q ˆH with d TV(p, q ) ε

9. Then for every q ˆH:

d TV(p, q) d TV(p, q ) + d TV(q , q)

9 + d Y( ˆ H)(q , q)

9 + d Y( ˆ H)(q , p) + d Y( ˆ H)(p, q)

9 + d TV(q , p) + d Y( ˆ H)(p, q)

9 + d Y( ˆ H)(p, q).

Lastly, we will argue that we can empirically approximate d Y( ˆ H)(p, q), which we can then use

to select a hypothesis. We note that, since Y( ˆH) is a finite set of size n1 (1 η 2ε

2 = n1 (η + 2ε

2 ((n (η+ 2ε

9 ) 1 ))2 = n 2(η+ 2η

9 )n1 1 , we have uniform convergence11 with respect to

9These sets are also sometimes called Scheffé sets in the literature. 10Note that this definition is asymmetric: Ai,j = Aj,i. 11A collection of sets W has the uniform convergence property if for every (ε, δ) (0, 1)2 there is a number m W(ε, δ) such that for every probability distribution p, with probability (1 δ) over samples S of size

Y( ˆH). Recall that, we assumed that there is clean subsample S 2 S2, which is i.i.d. distributed according to p and of size (1 2ε

9 η)n1. We also note that the clean samples S1 and S2 are drawn independently from each other. Thus with probability 1 δ

5, S 2 is ε

9-representative of p with respect to Y( ˆH). For a sample S0 and a set B X, let us denote S0(B) = |S0 B|

|S0| . Because of the

ε 9-representativeness of S 2 , we have for every B Y( ˆH):

|p(B) S 2 (B)| ε

|p(B) S2(B)|

= p(B) |S2 B|

max p(B) |S 2 B|

, p(B) |S 2 B| + (η + 2ε

max p(B) S 2 (B)|S 2 | n2

, p(B) S 2 (B)|S 2 | + (η + 2ε

( p(B) 1 η 2ε

p(B) S 2 (B) 1 η 2ε

9 n2 + η + 2ε

max |p(B) S 2 (B)| + S 2 (B) 1 η 2ε

S 2 (B) , p(B) S 2 (B) 1 η 2ε

9 + η, p(B) S 2 (B) 1 η 2ε

9 + η, p(B) S 2 (B) + S 2 (B) η + 2ε

9 + η, |p(B) S 2 (B)| + S 2 (B) η + 2ε

9 + |S 2 (B) 1| η + 2ε

Let the empirical A-distance with respect to the Yatracos sets be defined by

d Y( ˆ H)(q, S) = sup B d Y( ˆ H) |q(B) S(B)|.

n > n W(ε, δ) generated i.i.d. by p, a sample S is ε-representative for W with respect to p. Namely, for every

|S| p(A) ε. If W is finite then n W(ε, δ) log(|W|)+log(1/δ)

ε2 . For more details see Chapter 4 in [SB14].

Now if the learner outputs ˆq arg minq ˆ H d Y( ˆ H)(q, S), then putting all of our guarantees together, we get that with probability 1 δ

d TV(ˆq, p) 2ε

9 + d Y( ˆ H)(ˆq, p)

9 + η + d Y( ˆ H)(ˆq, S2)

9 + η + d Y( ˆ H)(q , S2)

9 + 2η + d Y( ˆ H)(q , p)

C Robust Learnability with Subtractive Contamination Implies Robust Learnability with General Contamination

In this section we will provide the proof for Theorem 1.7.

Theorem 1.7. If a class C is η-subtractive α-robustly learnable, then it is also η-(2α + 4)-robustly learnable.

Proof. Let C be a concept class that is η-subtractively α-robust learnable. Then there exists a successful η-subtractive α-robust learner Asub C with sample complexity nsub C for the class C. Let ε < 9η

2 and δ and be arbitrary.

Let n1 max n 2nsub C ( ε

9 ), 162(1+log( 5

δ )) ε2 o and n2 n (4(η+ 2ε

9 )n1 log(n1)+log( 5

δ ))) ε2 , 162(1+log( 5

δ )) ε2 o . Lastly let n = n1 + n2.

Let p C be arbitrary. The α-η-robust learner receives a sample S pm such that there is q C such that d TV(p, q) = η. Thus there exists a distributions q1, q2, q3, such that (1 η)q1 + ηq2 = p and (1 η)q1 + q3 = q. We now use the same learning strategy as in Theorem 1.5: We split the sample randomly into two subsamples S1 and S2, where we use S1 to learn candidate sets and then use S2 to select the hypothesis from the candidate set. The goal in both settings is to find as close an approximation to q1 as possible. The candidate based on S1 is created by feeding subsamples of S1 into the subtractively robust learner in such a way that with high probability one of the subsamples is guaranteed to be i.i.d. generated by q1 and thus (with high probability) yield a good hypothesis. More precisely, the learner randomly splits the sample S into S1 and S2 with |S1| = n1 and |S1| = n2. We now define the "clean" part of S S, i.e. the part of S that is i.i.d. distributed according to q1. We note that the size of this "clean" sample |S | = n is a random variable and distributed according to the binomial distributions Binom(n, 1 η). Now applying Chernoff bound, with the same argument as in the proof of Theorem 1.5, we get that with probability 1 δ

5, we have n n(1 η ε

9). Now let S 1 = S1 S and S 2 = S2 S be the "clean parts" of the subsamples S1 and S2 respectively. The sizes |S 1 | = n 1 and |S 2 | = n 2 We note that n 1 Hypergeometric(n, n(1 η), n1) and n 2 Hypergeometric(n, n(1 η), n2). Thus,

Prrandom split

|S 1| 1 η 2ε

Prrandom split

|S 2| 1 η 2ε

Taking together the guarantees on our random splits and the size of n , we note that by our choices of n1 and n2 with probability 1 3δ

5 , the fractions of the parts that are i.i.d. generated by q1 (namely |S 1 | |S 1| and |S 2 | |S 2| ) are at least (1 η 2ε

9 ). Going forward we will assume that this is indeed the case.

ˆH = Asub Q ( S) : S S 1 with | S| = 1 η 2ε

Using our assumption that |S 1| (1 η 2ε

9 )n1, we know that there is S 1 S 1 with Asub Q (S 1 ) H .

As S 1 q (1 η 2ε

9 )n1 1 , by the learning guarantee of Asub Q with probability 1 δ

5, there is a candidate distribution q ˆH with d TV(p, q ) d TV(p, q1) + d TV(q1, q ) = η + (αη + ε

9) = (α + 1)η + ε

We now consider the Yatracos sets. For every qi, qj ˆH, let Ai,j = {x : qi(x) qj(x)} and let

Y( ˆH) = n Aij X : qi, qj ˆH o .

We now consider the A-distance [KBG04] between two distributions with respect to the Yatracos sets, i.e., we consider

d Y( ˆ H)(p , q ) = sup

B Y( ˆ H) |p (B) q (B)|.

We note, that for any two distributions p , q we have d TV(p , q ) d Y( ˆ H)(p , q ). Furthermore,

if q , p ˆH, then d TV(p , q ) = d Y( ˆ H)(p , q ). Assume there is q ˆH with d TV(p, q )

(α + 1)η + ε

9, then for every q ˆH:

d TV(p, q) d TV(p, q ) + d TV(q , q)

(α + 1)η + ε

9 + d Y( ˆ H)(q , q)

(α + 1)η + ε

9 + d Y( ˆ H)(q , p) + d Y( ˆ H)(p, q)

(α + 1)η + ε

9 + d TV(q , p) + d Y( ˆ H)(p, q)

2(α + 1)η + 2ε

9 + d Y( ˆ H)(p, q).

Lastly, we will argue that we can empirically approximate d Y( ˆ H)(p, q), which we can then use to

select a hypothesis. We note that, since Y( ˆH) is a finite set of size |Y( ˆH)| = ( n1 (1 η 2ε

9 )n1 1 , we have uniform convergence with respect to Y( ˆH). Recall that S 2 qn 2 1 and by our previous assumption n 2 n2 1 η 2ε

9 . Thus by our choice of n2 , with probability 1 δ

5, there is S 2 S 2 S2 with |S 2 | = 1 η 2ε

9 n2 such that S 2 is ε

9-representative of q1 with respect to Y( ˆH). For a sample S0 and a set B X, let us denote S0(B) = |S0 B|

|S0| . Because of the

ε 9-representativeness of S 2, we have for every B Y( ˆH):

|q1(B) S 2 (B)| ε

|q1(B) S2(B)| |q1(B) S 2 (B)| + |S 2 (B) S2(B)|

|S2| |S 2 B|

n2 |S 2 B| n2(1 η 2ε

|S2 B|(1 η 2ε

η ) |S 2 B|

9 + max{|(|S 2 B| + (η + 2ε

9 )n2)(1 η 2ε

9 ) |S 2 B|| (1 η 2ε

|(|S 2 B|(1 η 2ε

9 ) |S 2 B|| (1 η 2ε

+ max{|(|S 2 B| + (η + 2ε

9 )n2)(1 η 2ε

9 ) ((1 η 2ε

9 )|S 2 B| + (η + 2ε

9 )|S 2 B|)| n2(1 η 2ε

9 ) n2(1 η 2ε

9 + max |(η + 2ε

9 )n2(1 η 2ε

9 ) |S 2 B|(η + 2ε

9 )| (1 η 2ε

9 )n2 , (η + 2ε

9 ) (1 η 2ε

9 + max |(η + 2ε

9 )(n2(1 η 2ε

9 )|) (1 η 2ε

9 )n2 , η + 2ε

Let us remember that the empirical A-distance with respect to the Yatracos is defined by

d Y( ˆ H)(q, S) = sup B Y |q(B) S(B)| .

Now if the learner outputs ˆq arg minq ˆ H d Y( ˆ H)(q, S2), then putting all of our guarantees together, with probability 1 δ we get

d TV(ˆq, p) 2(α + 1)η + 2ε

9 + d Y(ˆq, p)

2(α + 1)η + 2ε

9 + d Y(ˆq, q1) + d Y(q1, p)

2(α + 1)η + 2ε

9 + η + (η + 3η

9 ) + d Y(ˆq, S2)

2(α + 2)η + 5ε

9 + d Y(ˆq, S2)

2(α + 2)η + 5ε

9 + d Y(q , S2)

2(α + 2)η + 5ε

9 + (η + 3ε

9 ) + d Y(q , q1)

(2α + 3)η + 8ε

9 + η + d Y(q , p)

(2α + 4)η + ε + d TV(q , p).

D Learnability Does Not Imply Robust Learnability

We start with an upper bound, showing that our class Qg is realizably learnable. Claim 3.2. For a monotone function g : N N, let Qg = {qi,j,g(j) : i, j N}. Then, the sample complexity of Qg in the realizable case is upper bounded by

nre Qg(ε, δ) log(1/δ)g(1/ε).

Proof. Let the realizable learner A be

A(S) = qi,j,g(j) if (i, 2j + 2) S δ(0,0) otherwise

Note that for all Qg-realizable samples this learner is well-defined. Furthermore, we note that in the realizable case, whenever A outputs a distribution different from δ(0,0), then A(S) outputs the ground-truth distribution, i.e., the output has TV-distance 0 to the true distribution. Lastly, we note, that for an i.i.d. sample S qn i,j,g(j), we have the following upper bound for the learner identifying the correct distribution:

PS qn i,j,g(j)[A(S) = q(j)] = PS qn i,j,g(j)[(i, 2j + 2) S] = 1 (1 1/g(j))n.

We note, that since g is a monotone function, if ε 1

j , then g(j) g( 1

ε) and therefore,

(1 1/g(j))n (1 1/g(1/ε))n.

Furthermore for qi,j,g(j), we have that d TV(δ(0,0), qi,j,g(j)) = 1

Putting these two observations together, we get

PS qn i,j,g(j)[d TV(A(S), qi,j,g(j)) ε]

( (1 1/g(1/ε))n if 1

Thus, for every q Qg,

PS qn[d TV(A(S), q) ε] (1 1/g(1/ε))n exp n g(1/ε)

Letting the left-hand side equal the failure probability δ and solving for n, we get,

log δ n g(1/ε) log(δ)g(1/ε) n n log(δ)g(1/ε) = log(1/δ)g(1/ε).

Thus, we have a sample complexity bound of

n Qg(ε, δ) log(1/δ)g(1/ε).

Now, we show a lower bound, that our class Qg is not robustly learnable. Before we do that, we require a few more preliminaries. For a distribution class Q and a distribution p, let their total variation distance be defined by

d TV(p, Q) = inf q Q d TV(p, q).

We also use the following lemma from [LBD23].

Lemma D.1 (Lemma 3 from [LBD23]). Let Pη,4k = {(1 η)δ(0,0) + ηUA {2j+1} : A [4k]} For Q = Pη,4k, we have nre Q ( η

Finally, we recall the definition of weak learnability, which says that a distribution class is learnable only for some particular value of the accuracy parameter. Definition D.2. A class Q is ε-weakly learnable, if there is a learner A and a sample complexity function n : (0, 1) N, such that for ever δ (0, 1) and every p Q and every n n(δ),

PS pn[d TV(A(S), p) ε] < δ.

Learnability clearly implies ε-weak learnability for every ε (0, 1). While in some learning models (e.g., binary classification) learnability and weak learnability are equivalent, the same is not true for distribution learning [LBD23].

We are now ready to prove that Qg is not robustly learnable.

Claim 3.3. For every function g ω(n) the class Qg is not α-robustly learnable for any α > 0.

Proof. Consider

q i,j = 1 1

j UAi {2j+1}

Note that d TV(q i,j, Qg) = 1 g(j). Therefore, in order to show that Qg is not α-robustly learnable, it is sufficient to show that there are j and ε, such that the class Q j = {q i,j : i N} is not ( α g(j) + ε)-weakly learnable.

We will now show that for any γ < 1 8j , the class Q j = {q i,j : i N} is not γ-weakly learnable. Recalling notation from Lemma D.1, we note that that for every n N the class P 1

j ,4k Q j. By

monotonicity of the sample complexity and Lemma D.1, we have n Q j( 1

7) n P1/j,4n( 1

7) n, proving that this class is not weakly learnable.

Lastly, we need to show that for every α, there are ε and j, such that this claim holds for γ ( α g(j) +ε). That is, we need to show that there are ε and j, such that

α g(j) + ε < 1

Let ε = 1 16j . Now let g be any superlinear function, i.e., for every c R, there is tc N, such that for every t tc, g(t) ct. This implies that, for any α R, there is j N such that g(j) > 16jα. Thus for any super-linear function g and any α R, the class Qg is not α-robustly learnable.

D.1 Proof of Theorem 1.6

The result of Theorem 1.6 follows directly from the construction of class Qg for Theorem 3.1, the Claim 3.2 that shows this class is realizable learnable, and an adapated version for Claim 3.3, which states the following:

Claim D.3. For every α, there is g(t) O(t2), such that for every 0 η 1 16α the class Qg is not η-subtractive α-robustly learnable.

Proof. Let α > 1 be arbitrary. Let g : N N be defined by g(t) = 32αt2 for all t N. Now for every 0 η 1 16α, there exists some j, such that j g(j). η 1 16αj

For such j, we consider the distributions

q i,j = 1 1

j UAi {2j+1}

as in the proof of Claim 3.2. Recall that the element of Qg are of the form

qi,j,g(j) = 1 1

UAi {2j+1} + 1 g(j)δ(i,2j+2)

Then we have,

qi,j,g(j) = 1 1

UAi {2j+1} + 1 g(j)δ(i,2j+2) =

δ(0,0) + g(j) j

UAi {2j+1} + j jg(j)δ(i,2j+2) =

j UAi {(2,j+1)}

j δ(i,2j+2)

q i,j + j g(j)

j δ(i,2j+2)

Thus for every element q i,j of the class Q j = {q i,j : i N}, there is a distribution p, such that (1 η)q i,j + ηp Qg. That is, every element of Q j results from the η-subtractive contamination of some element in Qg. Thus, for showing that Qg is not η-subtractive α-robustly learnable, it is sufficient to show, that Q j is not (αη +ε)-weakly learnable for ε = 1 16j . As we have seen in the proof of Claim 3.3, we can use Lemma D.1 to show that for every n, we have n Q j( 1

7) n. Lastly, we need that 1

8j αη + ε, or after replacing ε, we need 1 16j αη. This follows directly from the choice of g.

E Existence of sample compression schemes

We adopt the [ABDH+20] definition of sample compression schemes. We will let C be a class of distribution over some domain X. A compression scheme for C involves two parties: an encoder and a decoder.

The encoder has some distribution q C and receives n samples from q. They send a succinct message (dependent on q) to the decoder, which will allow the decoder to output a distribution close to q. This message consists of a subset of size τ of the n samples, as well as t additional bits. The decoder receives the τ samples and t bits and outputs a distribution which is close to q.

Since this process inherently involves randomness (of the samples drawn from q), we require that this interaction succeeds at outputting a distribution close to q with only constant probability.

More formally, we have the following definitions for a decoder and a (robust) compression scheme. Definition E.1 (decoder, Definition 4.1 of [ABDH+20]). A decoder for C is a deterministic function J : S n=0 X n S n=0{0, 1}n C, which takes a finite sequence of elements of X and a finite sequence of bits, and outputs a member of C.

The formal definition of a compression scheme follows. Definition E.2 (robust compression schemes, Definition 4.2 of [ABDH+20]). Let τ, t, n : (0, 1) Z 0 be functions, and let r 0. We say C admits (τ, t, n) r-robust compression if there exists a decoder J for C such that for any distribution q C and any distribution p on X with d TV(p, q) r, the following holds:

For any ε (0, 1), if a sample S is drawn from pn(ε), then, with probability at least 2/3, there exists a sequence L of at most τ(ε) elements of S, and a sequence B of at most t(ε) bits, such that d TV(J (L, B), C) r + ε.

Note that S and L are sequences rather than sets, and can potentially contain repetitions. Theorem E.3 (Compression implies learning, Theorem 4.5 of [ABDH+20]). Suppose C admits (τ, t, n) r-robust compression. Let τ (ε)τ(ε) + t(ε). Then C can be max{3, 2/r}-learned in the agnostic setting using

+ τ (ε/6) log(n(ε/6) log3(1/δ)) + log(1/δ)

+ τ (ε/6) log n(ε/6)

samples. If Q admits (τ, t, n) non-robust compression, then Q can be learned in the realizable setting using the same number of samples. Theorem E.4. The class Q = Qg from Section 3 has a compression scheme of message size 1 (i.e., using just a single sample point).

Proof. Let n(ε) be 10/g(ε) and, give a sample S of at least that size, let the encoder pick a subset L(S) S be

L(S) = {(i, 2j + 2)} if (i, 2j + 2) S {(0, 0)} otherwise

Let the decoder output

J (L) = qi,j,g(j) if (i, 2j + 2) L δ(0,0) otherwise

With this construction established, the analysis follows very similarly to the analysis in the proof of Claim 3.2.

We note that the claim of Theorem 1.8 (and Theorem 3.4) follows directly.

F Approximate DP learnability vs robust learnability

We prove the second claim in Theorem 4.3 by showing that the learner for the class Q described in Theorem 3.1 can be made differentially private by employing stability-based histograms [KKMN09, BNS16]. Proposition F.1 (Stability-based histograms [KKMN09, BNS16], Lemma 4.1 from [AAL21]). Let X be a domain of examples. Let K be a countable index set, and let (hk)k K be a sequence of disjoint histogram bins over X. For every (α, β, ε, δ) (0, 1)4, there is an (ε, δ)-DP algorithm that takes a dataset S of size n and with probability 1 β, outputs bin frequency estimates (fk)k K such that fk |{x S : x hk}|

for all k K, so long as

n Ω log(1/βδ)

Proof of Theorem 4.3. Recall the realizably-but-not-robustly learnable class of distributions Qg = {qi,j,g(j) : i, j N} over N2 from Theorem 3.1, where g : N N is a monotone, super-linear function and qi,j,k is as defined in (1).

Fix (α, β, ε, δ) (0, 1)4. We define our DP learner ADP for Qg as follows: we take a sample S of size

n Ω log(1/(β/2)δ)

(1/4g(1/α))ε

+ 32 log(1/(β/2))g(1/α)

and run the stability-based histogram from Proposition F.1 targeting (ε, δ)-DP, with singleton histogram buckets (h(a,b))(a,b) N2, each h(a,b) = {(a, b)}. This yields frequency estimates (f(a,b))(a,b) N2 with |f(a,b) |{x S : x = (a, b)}|| 1/4g(1/α) for all (a, b) N2 with probability 1 β/2. Then let

ADP (S) = qi,j,g(j) if f(i,2j+2) 1/2g(1/α) δ0 otherwise.

Note that by post-processing ADP is indeed (ε, δ)-DP. Now suppose qi,j,g(j) is our unknown distribution. There are two cases:

If 1/j α, conditioned on the success of the histogram algorithm, the only possible outputs of ADP are δ0 and qi,j,g(j), so d TV(ADP (S), qi,j,g(j)) α with probability 1 β/2.

If 1/j > α, we have 1/g(1/α) < 1/g(j). By the second term in n and Chernoff bounds (Proposition A.2), we can conclude that

P |{x S : x = (i, 2j + 2)}|

n 3 4g(1/α)

If the above event does not occur and if the histogram algorithm does not fail, ADP outputs qi,j,g(j) exactly. So d TV(ADP (S), qi,j,g(j)) = 0 with probability 1 β.