# representation_preserving_multiclass_agnostic_to_realizable_reduction__e9abd69e.pdf

Representation Preserving Multiclass Agnostic to Realizable Reduction

Steve Hanneke 1 Qinglin Meng 1 Amirreza Shaeiri 1

We study the problem of multiclass classification when the number of labels can be unbounded within the PAC learning framework (Valiant, 1984). Our main contribution is a theory that demonstrates a simple and elegant agnostic to realizable reduction for this framework. This resolves an open problem raised by the recent work of (Hopkins et al., 2022). Notably, our result is the first representation preserving multiclass agnostic to realizable reduction, in contrast with the compression based approach of the work of (David et al., 2016). Furthermore, our main theorem is stated in an abstract framework, called Unified PAC Learning , which encompasses a range of frameworks, including multiclass PAC learning, list PAC learning, and multilabel PAC learning. In addition, we explore representation preserving reductions to the realizable setting for two noise models, namely Massart noise and Tsybakov noise, in the multiclass PAC learning framework. We believe our technique may find other applications in ensuing studies of theoretical machine learning.

1. Introduction

In many frameworks within the field of statistical learning theory, a surprising equivalence emerges between realizable and agnostic learnability, where both are characterized by the same quantity despite their inherent difference. This phenomenon encompasses fundamental frameworks, including binary PAC learning, multiclass PAC learning, list PAC learning, binary online learning, and multiclass online learning, among others. Nonetheless, the proof techniques employed to establish the realizable and agnostic results can differ significantly in many of these frameworks. This observation raises an intriguing question: Do we possess

1Department of Computer Science, Purdue University, West Lafayette, IN, USA. Correspondence to: Steve Hanneke <steve.hanneke@gmail.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

a unified theory that underlies these equivalences? Even more optimistically, could there exist a simple and elegant unified theory, both in terms of the analysis and reduction properties, that accounts for these equivalences?

The seminal work of (David et al., 2016) provides a basis for developing a unified theoretical framework. In fact, for almost a decade, their sample compression scheme based approach, which involves boosting, serves as the only tool for theorists to prove agnostic learnability in frameworks such as partial PAC learning, multiclass PAC learning, and list PAC learning, where the number of labels can be unbounded in the latter two cases. However, this approach can produce more complex functions than those outputted by a realizable learner due to its reliance on the boosting technique, which combines multiple realizable learners. In particular, consider the space of possible outputting functions in the realizable setting. Then, in the agnostic setting, the resulting functions based on this approach do not necessarily belong to that space.

In addition to the mentioned reduction, the recent work of (Hopkins et al., 2022) provides an affirmative answer to the aforementioned questions across various frameworks. More specifically, they introduced a three-line algorithm accompanied by a relatively simple proof technique, which effectively demonstrates the existence of reductions in many frameworks. Moreover, their algorithm uncovers some previously unknown equivalences between realizable and agnostic learnability. However, despite these advancements, their approach does not extend to important problems, such as multiclass PAC leaning with possibly unbounded label space framework. 1 We suggest readers refer to section 2 for a detailed discussion on the significance of this setting. Consequently, they posed the following open question: Is there a unifying yet still simple and elegant technique that can also address multiclass PAC learning with possibly infinite label space framework?

In this work, we provide POSITIVE ANSWER to their open question. Specifically, we demonstrate that a clever modification of their three-line algorithm indeed effectively reduces agnostic learnability to realizable learnability in frameworks such as multiclass learning with potentially unbounded label spaces.

1This is discussed formally in Appendix A.

Representation Preserving Multiclass Agnostic to Realizable Reduction

Additionally, our algorithm possesses a UNIQUE PROPERTY inherited from the algorithm of (Hopkins et al., 2022). To illustrate, consider the space of possible outputting functions in the realizable setting. Our reduction always outputs a function that belongs to that space, which is in sharp contrast to the approach of (David et al., 2016) as we discussed. Going further, our reduction even preserves data-dependent properties of the output function of the realizable setting, such as the geometric margin used in Support Vector Machines, rather than merely producing a function from the mentioned space. As a result, our work relates to the line of research focused on learning with simple predictors that has gained significant attention in recent years. For instance, the work of (Hanneke et al., 2021) explored this topic in the context of online learning, while (Aden-Ali et al., 2024; Bousquet et al., 2020) studied this topic in the context of binary PAC learning. Therefore, our technique provides a valuable new tool for theorists to utilize in their problems when required.

In fact, we state our main theorem within a new abstract framework, which we call UNIFIED PAC LEARNING. This framework encompasses a range of frameworks, multiclass PAC learning (for instance, see the work of (Brukhim et al., 2022)), list PAC learning (for instance, see (Charikar & Pabbaraju, 2023)), and multilabel PAC learning. As a result, our finding reveals some new previously unknown equivalences and offers new insights into the relationship between realizable and agnostic learnability. We believe that Unified PAC Learning offers a promising avenue for further exploration of realizable learnability in future research.

Subsequently, we delve into the exploration of NOISE MODELS, which occupy an intermediate position between the realizable and agnostic settings; see (Tsybakov, 2004; Massart & N ed elec, 2006; Agarwal, 2013; Hanneke & Yang, 2015) for more details. In particular, we focus on two very popular noise models, namely Massart noise and Tsybakov noise, for the multiclass PAC learning framework. Using almost the same technique, we demonstrate that focusing on noise models, rather than the agnostic setting, allows for an improvement in the sample complexity.

Finally, we extend our main theorem on the Unified PAC Learning framework to address PARTIAL CONCEPT

CLASSES as well. We recommend readers refer to the work of (Alon et al., 2022) for a comprehensive discussion on the importance of this setting. While the same algorithm can be applied to partial concept classes, the analysis becomes more involved. Notably, the work of (Hopkins et al., 2022) also addresses this setting; however, they adopt a different definition of agnostic learnability. In contrast, we employ the definition provided in the recent work of (Alon et al., 2022) on partial concept classes, which is more favorable in this context.

1.1. Overview of the Main Result and Technique

In the following subsection, we provide an overview of our main result along with a summary of the primary proof technique in our work. Before proceeding, let us informally define the Unified PAC Learning framework.

In the Unified PAC Learning framework, we have an instance space X, an output space Y, an action space A, and a loss function L : A Y {0, 1}. A predictor is a function from X to A. Moreover, we have a set of predictors C as a concept class. In addition, we have another set of predictors H as a hypothesis class. As a result, a 6-tuple consisting of X, Y, A, L, C, and H specifies an instance of this framework. Importantly, both the output space Y and the action space A can be unbounded.

Furthermore, in this framework, a learning algorithm maps a sequence of instance-output pairs of any size to a predictor. Also, a data distribution is defined on X Y. Based on these, we adopt the definitions of loss for a given data distribution and predictor, realizable data distribution, realizable PAC learnability, and agnostic PAC learnability within this framework. Importantly, in both the definitions of realizable PAC learnability and agnostic PAC learnability, we require that the image of the learning algorithm be a subset of H. See section 3 for formal notations and definitions.

In this paragraph, we provide a concise explanation of how our framework can be utilized to encompass various settings, illustrated through two examples. For instance, by defining Y as a set of labels and A as {a | a Y, |a| L} for some L N, L |Y|, we can effectively capture the list PAC learning framework of (Charikar & Pabbaraju, 2023). Indeed, A can also be interpreted as a set of labels, then by specifying Y as {y | y A, |y| L} for some L N, L |A|, it becomes possible to accommodate the multilabel PAC learning framework. It is worth noting that this definition is a special case of the general multilabel setting in practice, which we find particularly interesting as it serves as a dual formulation of list learning. For further details, refer to section 4.

Now, we informally present our main theorem in the current work.

Theorem 1.1 (Realizable Unified PAC Learnability Agnostic Unified PAC Learnability). Let Q = X, Y, A, L, C, H be an instance of the Unified PAC Learning framework. If Q is realizable PAC learnable, then Q is agnostic PAC learnable.

The key idea behind the proof of the above theorem is to run the realizable learning algorithm on each subset of samples, followed by applying empirical risk minimization (ERM) on the resulting finite collection of predictors (1). Moreover, by applying a concentration bound to this finite set

Representation Preserving Multiclass Agnostic to Realizable Reduction

of predictors, combined with the error rate guarantee from the realizable PAC learnability for the predictors generated from each subset, we can get the guarantee of agnostic learnability. See section 4 for more details.

As a direct corollary of the above theorem, one can derive a reduction from agnostic PAC learnability to realizable PAC learnability for the multiclass PAC learning framework, as described in subsection 4.2. Additionally, our result reveals new equivalences, including previously unknown equivalence between the realizable multilabel PAC learnability and the agnostic multilabel PAC learnability, as discussed in subsection 4.4.

Furthermore, building upon the work of (David et al., 2016), we establish another reduction for the Unified PAC Learning framework. While this reduction is more favorable from the sample complexity perspective, crucially, it does not retain properties such as representation preserving. In particular, given an accuracy parameter ϵ (0, 1), our agnostic result involves a factor of 1/ϵ3, compared to the more desirable factor of 1/ϵ2 that can be obtained using a sample compression scheme based proof. For further details, refer to section 4 and Appendix C. However, for every n, m N, we give an example of an instance of the Unified PAC Learning framework Q = X, Y, A, L, C, H such that |X| = n, |Y| = |A| = |C| = |H| = m where we have: for every ϵ, δ (0, 1), there exists a Ω(1/ϵ2) gap between the optimal sample complexity of the realizable PAC learning with parameters ϵ and δ and the optimal sample complexity of the agnostic PAC learning with the same parameters. In particular, this example is within multiclass PAC learning framework. Moreover, it builds upon the concept class of the countably infinite collection of constant functions over some domain. For further details, refer to section 6.

1.2. Organization

The remainder of this paper is structured as follows. In section 2, we discuss a broader range of related works. Then, in section 3, we present the formal notations and definitions. Subsequently, in section 4, we establish our main reduction along with some corollaries. Following this, in section 5, we explore noise models in the multiclass PAC learning framework. Finally, in section 7, we conclude the manuscript.

2. Related Work

PAC Learning. The Probably Approximately Correct (PAC) learning framework, introduced by (Valiant, 1984), has been a cornerstone in the field of statistical learning theory. (Blumer et al., 1989; Vapnik & Chervonenkis, 2015; Valiant, 1984; Vapnik, 2006) characterizes learnable classes within the binary PAC learning framework in the realizable

setting via a combinatorial parameter called the VC dimension. This result was later extended to the agnostic setting by (Haussler, 1992). Since then, PAC learning has been extensively studied in various learning theoretic settings.

Multiclass Classification. A large body of theoretical research has been devoted to studying multiclass classification in different frameworks. This includes contributions from (Natarajan & Tadepalli, 1988; Natarajan, 1989; Ben David et al., 1992; Haussler & Long, 1995; Rubinstein et al., 2006; Daniely et al., 2011; 2012; Daniely & Shalev-Shwartz, 2014; Brukhim et al., 2021). Nevertheless, the combinatorial characterization of multiclass classification, when the number of labels can be unbounded, within Valiant s PAC learning framework, has remained an open question until recently, even in the realizable setting. The seminal work of (Brukhim et al., 2022) addressed this gap. Furthermore, the same dimension also characterizes the agnostic version of this problem (David et al., 2016). Also, see the recent work of (Hanneke et al., 2023). There are several reasons motivating this interest in multiclass classification when the number of labels can be unbounded. First, in multiclass settings, guarantees should ideally not depend on the number of labels, even when finite. Second, mathematical concepts that involve infinity often offer clearer and more elegant insights. Finally, from a practical standpoint, many critical machine learning tasks require classification into very large label spaces. This includes the image object recognition task.

3. Preliminaries

In this section, we introduce the necessary notations and definitions that will be used throughout the remainder of this work.

3.1. Notations

We begin by introducing some necessary notation before describing and analyzing our algorithm. Fix a non-empty set X equipped with a σ-algebra specifying the measurable subsets as an instance space. Fix a non-empty set Y as an output space. Also, fix a non-empty set A as an action space. In addition, fix a loss function L : A Y {0, 1}. A predictor is a function from X to A. With this in mind, fix a non-empty set of predictors C as a concept class. Additionally, fix another set of predictors H as a hypothesis class. Notably, we implicitly assume standard measurability assumptions on C and H. In particular, a 6-tuple Q = X, Y, A, L, C, H presents an instance of the Unified PAC Learning framework. A learning algorithm A is a mapping from (X Y) to a predictor ˆh. Also, a data distribution D is a probability measure on (X Y). Finally, we use O( ), O( ), Ω( ), ω( ), and Θ( ) as standard notations of them in the theoretical computer science. We also

Representation Preserving Multiclass Agnostic to Realizable Reduction

use e O( ), eΩ( ), eΘ( ) to exclude logarithmic factors as well as constant coefficients.

3.2. Definitions

Given the above notations, we make the definitions for true risk and empirical risk. For a data distribution D and a predictor ˆh, define the true risk of ˆh with respect to the data distribution D as follows,

LD(ˆh) := P(x,y) D[L ˆh(x), y ].

Based on this, we say that a data distribution D is realizable by a concept class C if there exists c C such that LD(c ) = 0. Similarly, say that a data distribution D is realizable by a hypothesis class H, if infh H LD(h) = 0 holds. For simplicity, we will suppose the infimum infc C LD(c) is actually achieved by a concept c C. 2

Next, a data set of size m for some m N is a subset of (X Y) of size m. For such a data S = {(x1, y1), (x2, y2), . . . , (xm, ym)} and a predictor ˆh, define the empirical risk of ˆh with respect to the data distribution S as follows,

LS(ˆh) := 1

i=1 L ˆh(xi), yi . (1)

We say a data set S is realizable by a predictor ˆh if LS(ˆh) = 0. Subsequently, we present the definitions of realizable and agnostic PAC learnability for this framework.

Definition 3.1 (Unified PAC Learning). We say that Q = X, Y, A, L, C, H is realizable Unified PAC learnable, if for every ϵ, δ (0, 1), there exists a finite MRE(ϵ, δ) N and a learning algorithm A with Im(A) H such that, for every distribution D on (X Y) realizable w.r.t. C, for S DMRE(ϵ,δ), with probability at least 1 δ,

LD(A(S)) ϵ.

The value MRE(ϵ, δ) is called the sample complexity of A, and the optimal sample complexity3 is the minimum achievable value of MRE(ϵ, δ) for every given ϵ, δ.

Definition 3.2 (Unified Agnostic PAC Learning). We say that Q = X, Y, A, L, C, H is agnostic unified PAC learnable, if for every ϵ, δ (0, 1), there exists a finite MAG(ϵ, δ) N and a learning algorithm A with Im(A) H such that, for every distribution D on (X Y), for S DMAG(ϵ,δ), with probability at least 1 δ,

LD(A(S)) LD(c ) + ϵ,

2If not, for any fixed ϵ > 0, we can choose c such that LD(c ) infc C LD(c) + ϵ/2. 3For brevity, we refer to the optimal sample complexity as sample complexity.

Algorithm 1 Agnostic to Realizable Reduction Input: Concept Class C, Hypothesis class H, Realizable PAC-Learner A, Labeled Sample S. 1: Divide S into two parts V and T. 2: Run A over all subsets of V to get:

HV := {A(S ) | S V }

3: Return the hypothesis in HV with lowest empirical error over T.

where c is the optimal concept in the concept class C. The value MAG(ϵ, δ) is called the sample complexity of A, and the optimal sample complexity is the minimum achievable value of MAG(ϵ, δ) for every given ϵ, δ.

4. Reduction

We present the unified agnostic-to-realizable reduction in subsection 4.1. The applications of this reduction will be discussed in subsections 4.2, 4.3, and 4.4.

4.1. General Reduction

The following theorem demonstrates a representation preserving reduction from agnostic to realizable setting within the unified PAC learning framework. Theorem 4.1 (Agnostic Realizable ). Let A be an unified realizable learner for a learning instance Q = (X, Y, A, L, C, H) with sample complexity MRE(ϵ, δ). Then Algorithm 1 is also an unified agnostic learner for Q = (X, Y, A, L, C, H) with sample complexity: MAG(ϵ, δ) =

max µ [ϵ/2,1] MRE(ϵ/(2µ), δ/3)

We emphasis that Theorem 4.1 presents the first representation preserving reduction from agnostic to realizable within unified PAC learning regime. Its key application to multiclass learning further highlights the first such reduction in this domain.

With this settled, we now proceed to prove Theorem 4.1. The analysis naturally divides into two parts, corresponding to steps 2 and 3 of Algorithm 1, respectively. In the first part, we will show that HV , the set of output hypothesis corresponding to running the realizable learner A over all subsets S of V , contains a hypothesis h which is close to the optimal concept c in C. More formally, we have the following lemma. Lemma 4.2. For any distribution D over (X Y), with probability at least 1 2δ/3, there exists h HV which satisfies: LD(h ) LD(c ) + ϵ/2,

Representation Preserving Multiclass Agnostic to Realizable Reduction

where c is the optimal concept in C.

Once we prove this lemma, the second part is to show that Step 3, an empirical risk minimization process on HV over labeled sample T, gives the desired agnostic hypothesis. Since HV is finite, a standard Chernoff bound gives that with probability at least 1 δ/3, the empirical risk LT (h) of every hypothesis h in HV is ϵ/4-close to its true risk LD(h), as long as T is sufficiently large. Combining these two parts, Algorithm 1 returns a hypothesis with at most LD(c ) + ϵ error rate with high probability.

It remains to prove Lemma 4.2. The key observation is that the output A(S ) of running A over the subset S

realizable by c is close to c .

Proof of Lemma 4.2. Since the optimal concept c may not be realizable with respect to distribution D over (X Y). We divide the sample space (X Y) into two parts according to whether realizable with respect to c and D. Let (X Y)

denote the realizable part and let D denote the restriction of D over it. Similarly, let D denote the restriction of D over the complement, that is (X Y)\(X Y) . With this in mind, let µ denote the probability mass of D on (X Y) . Since we are restricting our attention to classification error, we can decompose the true risk of c over D as:

LD(c ) = µ LD (c ) + (1 µ )L D (c ) = 1 µ ,

where the second step is by the decomposition of the sample space (X Y), and the last step follows from the observation that LD (c ) = 0 and L D (c ) = 1 always hold by definition. To get a hypothesis within ϵ/2 true risk of LD(c ), we claim that it is sufficient to prove HV contains some h with true risk ϵ/2µ over D , that is some h satisfying:

LD (h) ϵ/(2µ ). (2)

This follows from a similar analysis of LD(h ) above. We can decompose LD(h) as:

LD(h) = µ LD (h) + (1 µ )L D (h) ϵ/2 + LD(c ).

It remains to prove the claim that HV contains a hypothesis satisfying Equation(2) with high probability. Notice that S

can be seen as drawn from the distribution D . Then, by the definition of the realizable learning, on the labeled sample S D of size MRE(ϵ/(2µ ), δ/3), the realizable learner A will output h S satisfying:

LD (h S ) ϵ/(2µ ).

with probability at least 1 δ/3.

Then we just need to draw a large enough sample V to make sure that the size of S is at least MRE(ϵ/(2µ ), δ/3). By a Chernoff bound, it is enough

to draw c MRE(ϵ/(2µ ), δ/3)/µ data points to achieve this for some constant c > 0. Since we do not know µ , we need to draw c maxµ [ϵ/2,1]{MRE(ϵ/(2µ), δ/3)/µ} data points to ensure this claim holds (if µ < ϵ/2, note that any hypothesis will give a valid solution). By a union bound, we have that this overall process holds with probability at least 1 2δ/3.

Proof of Theorem 4.1. By Lemma 4.2, with probability at least 1 2δ/3, HV contains a hypothesis h S such that :

LD(h S ) LD(c ) + ϵ/2.

We can now use standard empirical risk minimization bounds on HV to find a hypothesis with true risk at most LD(c ) + ϵ. A Chernoff and union bounds imply that with probability at least 1 δ/3, the empirical risk of every hypothesis in HV is at most ϵ/4 away from its true risk on a sample of size O(log(|HV |/δ)/ϵ2). Since h S has true risk LD(h S ) at most LD(c ) + ϵ/2, its empirical risk LT (h S ) is at most LD(c ) + 3ϵ/4. Then we can be sure there exists a hypothesis in HV with empirical risk at most LD(c ) + 3ϵ/4, and by the above guarantee any hypothesis in HV with empirical risk at most LD(c ) + 3ϵ/4 has true risk at most LD(c ) + ϵ.

Combining Lemma 4.2 and the above analysis of empirical risk minimization, we have that with probability at least 1 δ over the entire process, the output of Algorithm 1 is a desired agnostic learner. The sample complexity follows from noting that |HV | is at most 2|V | with |V | = O(maxµ [ϵ/2,1]{MRE(ϵ/(2µ), δ/3)/µ}) due to Lemma 4.2.

4.2. Multiclass Learning

Building on Theorem 4.1, we can readily adapt it to multiclass learning setting with the action space A equal to the infinite label space Y, and L represents the classification error. In this way, a unified PAC learning instance Q = (X, Y, A, L, C, H) can be specified as a multiclass learning instance (X, Y, C, H). Then Theorem 4.1 can be applied to this learning setting, together with Theorem 1 and Algorithm 1 in Brukhim et al. (2022) we have the following corollary. Corollary 4.3 (Agnostic Realizable (Multiclass Learning)). If H has finite DS dimension d, Algorithm 1 is an agnostic multiclass learner with sample complexity:

MAG(ϵ, δ) = O d3/2 + log(1/δ)

4.3. List Learning

We can also adapt Theorem 4.1 to L-list learning setting with the action space A = {Y Y, |Y | L}. The concept class C consists of functions c : X A , where

Representation Preserving Multiclass Agnostic to Realizable Reduction

A = {Y Y, |Y | = 1}. Similarly, the hypothesis class H consists of functions h : X A. By definition, there is a bijection between Y and A mapping each y to a corresponding Y such that y Y . According to this bijection, the loss function L can be defined on (A A ) such that L(h(x), c(x)) = 0 if and only if c(x) h(x). In this way, a unified PAC learning instance Q = (X, Y, A, L, C, H) can be specified as a L-list learning instance (X, Y, A, C, H). Thus, Theorem 4.1 can be applied to this learning setting, together with Theorem 2 and Algorithm 1 in Charikar & Pabbaraju (2023), we have the following corollary.

Corollary 4.4 (Agnostic Realizable (L-List Learning)). If H has finite L-DS dimension d, Algorithm 1 is an agnostic list learner with sample complexity:

MAG(ϵ, δ) = O L6d3/2 + log(1/δ)

4.4. Multilabel Learning

With Theorem 4.1 in mind, we can also adapt it to multilabel learning setting where the action space is defined as A = {Y Y, |Y | L}. The concept class C consists of functions c : X A. Similarly, the hypothesis class H consists of functions h : X A , where A = {Y Y, |Y | = 1}. By definition, there is a bijection between Y and A mapping each y to a corresponding Y such that y Y . According to this bijection, the loss function L can be defined on (A A ) such that L(c(x), h(x)) = 0 if and only if h(x) c(x). In this way, a unified PAC learning instance Q = (X, Y, A, L, C, H) can be specified as a multilabel learning instance (X, Y, A, C, H), and thus, we can apply Theorem 4.1 to this learning setting.

5. Noise Models

In this section, we provide the sample complexity of our multiclass classification reduction algorithm under two noise models: Massart Noise (Massart & N ed elec, 2006) and Tsybakov noise (Tsybakov, 2004). We use MD(ϵ, δ) to denote the agnostic sample complexity constrained by the distribution D D. Since multiclass learnability can be characterized by DS dimension d, which means a hypothesis class H is PAC learnable if and only if it has finite DS dimension, we restricted our reduction on the hypothesis classes with finite DS dimension. For the realizable learner A, we use the one-inclusion graph algorithm introduced by Brukhim et al. (2022).

Before present our result, we first give the definition of these two noise models.

Definition 5.1 (Massart Noise). For (0, 1), define MN( ) as the collection of joint distributions PXY over

X Y such that f C and

P(y = f (x)|x) max y =f (x) P(y = y |x) (3)

holds almost surely, where f is the Bayes optimal classifier.

Definition 5.2 (Tsybakov Noise). For a [1, ) and α (0, 1), define TN(a, α) as the collection of joint distributions PXY over X Y such that f C and γ > 0,

PX x : P(y = f (x)|x) max y =f (x) P(y = y |x) γ

a γα/(1 α), (4) where a = (1 α)(2α)α/(1 α)a1/(1 α) and f is the Bayes optimal classifier.

The following theorem shows the reduction from Massart noise to realizable.

Theorem 5.3 (Massart Realizable (Multiclass Learning)). If H has finite DS dimension d, Algorithm 1 is an agnostic multiclass learner with sample complexity :

MMN( )(ϵ, δ) = O d3/2 + log(1/δ)

Before providing the proof of Theorem 5.3, we state a useful lemma that will be used in the proof.

Lemma 5.4. For any distribution D over X Y, with probability at least 1 δ, we have:

LD(ˆh) LD(c ) ϵ/2 c log(|HV |/δ)/|T|

+ c q P(ˆh(x) = c (x)) + ϵ/2 log(|HV |/δ)/|T|,

where HV is the hypothesis class generated in step 2 of Algorithm 1, and ˆh is the hypothesis output by ERM in step 3 of Algorithm 1, and c C is the Bayes optimal classifier, and c is some positive constant.

Proof of Lemma 5.4. By uniform Bernstein inequality, for all h, h HV , and distribution D over (X Y), with probability at least 1 δ/3 we have:

LD(h) LD(h ) clog(|HV |/δ)

P(h(x) = h (x), y {h(x), h (x)})log(|HV |/δ)

(5) For ˆh, h S H and c C, we have the following inequality:

P(h S = ˆh(x), y {ˆh, h S (x)})

P(ˆh(x) = c (x), y {ˆh(x), c (x)})

+ P(h S (x) = c (x), y {h S (x), c (x)}).

Representation Preserving Multiclass Agnostic to Realizable Reduction

We can furthur upper bound P(h S (x) = c (x), y {h S (x), c (x)}) by ϵ/2 due to the construction of h S in Lemma 4.2. Then combining Equation (5), Equation (6) and Lemma 4.2, we can get the desired conclusion.

With the above lemma in mind, we start the proof of Theorem 5.3.

Proof of Theorem 5.3. For all distribution D MN( ) and h YX , with probability 1, we have

LD(h) LD(c ) P(h(x) = c (x)). (7)

Combining Lemma 5.4 and Equation (7), with probability at least 1 δ, we have

LD(ˆh) LD(c ) ϵ/2 clog(|HV |/δ)

(LD(ˆh) LD(c )) + ϵ/2 log(|HV |/δ)

where c is some positive constant. Transferring Equation (8) such that the term LD(ˆh) LD(c ) only appears in the left side of the inequality, with probability at least 1 δ, we have

LD(ˆh) LD(c ) c 1

log(|HV |/δ)

where c is some positive constant. Further upper bound Equation (9) with ϵ and solve for |T|, we have when |T| c log(|HV |/δ)/ ϵ, with probability at least 1 δ, the output of Algorithm 1 is a desired learner under Massart noise. The sample complexity follows from noting that |HV | is at most 2|V | with |V | = O(maxµ [ϵ/2,1]{MRE(ϵ/(2µ), δ/3)/µ}) due to Lemma 4.2.

The following theorem shows the reduction from Tsybakov noise to realizable.

Theorem 5.5 (Tsybakov Realizable (Multiclass Learning)). If H has finite DS dimension d, Algorithm 1 is an agnostic multiclass learner with sample complexity:

MTN(a,α)(ϵ, δ) = O a(d3/2 + log(1/δ))

proof of Theorem 5.5. For all distribution D TN(a, α) and h YX , we have

P(h(x) = c (x)) a(LD(h) LD(c ))α, (10)

which also refers to Bernstein class condition (Hanneke & Yang, 2015). Combining Lemma 5.4 and Equation (10),

with probability at least 1 δ, we have

LD(ˆh) LD(c ) ϵ/2 clog(|HV |/δ)

s a(LD(ˆh) LD(c ))α + ϵ/2 log(|HV |/δ)

(11) where c is some positive constant. For |T| c log(|HV |/δ)/ϵ2 α, with probability at least 1 δ, Equation (11) holds with LD(ˆh) LD(c ) ϵ, where c is some positive constant. In this way, we have that the output of Algorithm 1 is a desired learner under Tsybakov noise. The sample complexity follows from noting that |HV | is at most 2|V | with |V | = O(maxµ [ϵ/2,1]{MRE(ϵ/(2µ), δ/3)/µ}) due to Lemma 4.2.

6. Lower Bound

In this section, we demonstrate that although our sample complexity upper bound includes an additional 1/ϵ term compared to the upper bound obtained via a compression scheme, this extra factor is sometimes unavoidable due to the representation-preserving property maintained by our algorithm. Formally, we state the following theorem:

Theorem 6.1. For every n, m N, there exist an example of an instance of the Unified PAC Learning framework Q = X, Y, A, L, C, H such that |X| = n, |Y| = |A| = |C| = |H| = m for which we have:

ϵ, δ (0, 1), MRE(ϵ, δ)

MAG(ϵ, δ) Ω(1/ϵ2),

where MRE(ϵ, δ) and MAG(ϵ, δ) are optimal sample complexity of the realizable PAC learning and the agnostic PAC learning, accordingly.

Proof. Let X = {1, 2, . . . , n}. Also, let Y = A = {1, 2, . . . , m}. In addition, let C = H = {cy |cy : X {y}, y Y}. Additionally, define L as the multiclass classification loss 4.2. Now, observe that Q is realizable PAC learnable with only one sample. Therefore, we have: ϵ, δ (0, 1), MRE(ϵ, δ) O(1). Also, we have: ϵ, δ (0, 1), MAG(ϵ, δ) Ω(1/ϵ2). This follows from the standard technique based on the slightly unbiased coin; for instance, see (Daniely et al., 2011). As a result, we have: ϵ, δ (0, 1), MRE(ϵ, δ)/MAG(ϵ, δ) Ω(1/ϵ2). We note that it is possible to construct a similar example using partial concept classes B.1. This finishes the proof.

Remark 6.2. Indeed, it is preferable to provide an example of an instance of the Unified PAC Learning framework where the aforementioned result holds true but we also have an Ω(1/ϵ) lower bound on the optimal sample complexity of the realizable PAC learning for every ϵ, δ (0, 1) as the

Representation Preserving Multiclass Agnostic to Realizable Reduction

parameters. However, this claim holds if and only if there is a sublinear proper sample compression scheme for a given hypothesis class, which is an interesting and important topic to explore. Actually, if we have a proper sample compression scheme of size k(n) = O n1/3/ log n , then we can get an agnostic proper sample compression scheme with the same size k(n). An agnostic proper sample compression scheme of size k(n) implies a proper agnostic learner with error rate

ϵ(n, δ) = O

k(n) log n k(n) + k(n) + log 1

In other words, the sample complexity of this agnostic learner would be MAG(ϵ, δ) = O(1/ϵ3), which matches the sample complexity upper bound of our algorithm. However, even in the binary case, the existence of such a proper sample compression scheme remains unproven. We leave this as an open question for further exploration.

7. Conclusion

In this work, we established a simple and elegant agnostic to realizable reduction within the Unified PAC Learning framework. In particular, our result leads to the first representation preserving agnostic to realizable reduction within the multiclass PAC learning framework, especially when the number of labels can be unbound. Also, we explored representation preserving reductions in the realizable setting for two noise models in the multiclass PAC learning framework. In addition, we showed similar results for the Unified PAC Learning framework with partial concept classes. We believe that our techniques could have applications in future studies in the field of statistical learning theory.

Impact Statement

This paper presents work whose goal is to advance the field of Statistical Learning Theory. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Representation Preserving Multiclass Agnostic to Realizable Reduction

Aden-Ali, I., Høandgsgaard, M. M., Larsen, K. G., and Zhivotovskiy, N. Majority-of-three: The simplest optimal learner? In The Thirty Seventh Annual Conference on Learning Theory, pp. 22 45. PMLR, 2024.

Agarwal, A. Selective sampling algorithms for costsensitive multiclass prediction. In International Conference on Machine Learning, pp. 1220 1228. PMLR, 2013.

Alon, N., Hanneke, S., Holzman, R., and Moran, S. A theory of pac learnability of partial concept classes. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 658 671. IEEE, 2022.

Ben-David, S., Cesa-Bianchi, N., and Long, P. M. Characterizations of learnability for classes of {O,. . . , n}-valued functions. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 333 340, 1992.

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929 965, 1989.

Bousquet, O., Hanneke, S., Moran, S., and Zhivotovskiy, N. Proper learning, helly number, and an optimal svm bound. In Conference on Learning Theory, pp. 582 609. PMLR, 2020.

Brukhim, N., Hazan, E., Moran, S., Mukherjee, I., and Schapire, R. E. Multiclass boosting and the cost of weak learning. Advances in Neural Information Processing Systems, 34:3057 3067, 2021.

Brukhim, N., Carmon, D., Dinur, I., Moran, S., and Yehudayoff, A. A characterization of multiclass learnability. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 943 955. IEEE, 2022.

Charikar, M. and Pabbaraju, C. A characterization of list learnability. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pp. 1713 1726, 2023.

Daniely, A. and Shalev-Shwartz, S. Optimal learners for multiclass problems. In Conference on Learning Theory, pp. 287 316. PMLR, 2014.

Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz, S. Multiclass learnability and the erm principle. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 207 232. JMLR Workshop and Conference Proceedings, 2011.

Daniely, A., Sabato, S., and Shwartz, S. Multiclass learning approaches: A theoretical comparison with implications. Advances in Neural Information Processing Systems, 25, 2012.

David, O., Moran, S., and Yehudayoff, A. Supervised learning through the lens of compression. Advances in Neural Information Processing Systems, 29, 2016.

Hanneke, S. and Yang, L. Minimax analysis of active learning. J. Mach. Learn. Res., 16(1):3487 3602, 2015.

Hanneke, S., Livni, R., and Moran, S. Online learning with simple predictors and a combinatorial characterization of minimax in 0/1 games. In Conference on Learning Theory, pp. 2289 2314. PMLR, 2021.

Hanneke, S., Moran, S., and Zhang, Q. Universal rates for multiclass learning. In The Thirty Sixth Annual Conference on Learning Theory, pp. 5615 5681. PMLR, 2023.

Haussler, D. Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and computation, 100(1):78 150, 1992.

Haussler, D. and Long, P. M. A generalization of sauer s lemma. Journal of Combinatorial Theory, Series A, 71 (2):219 240, 1995.

Hopkins, M., Kane, D. M., Lovett, S., and Mahajan, G. Realizable learning is all you need. In Conference on Learning Theory, pp. 3015 3069. PMLR, 2022.

Massart, P. and N ed elec, E. Risk bounds for statistical learning. Annals of statistics, 34(5):2326 2366, 2006.

Natarajan, B. K. On learning sets and functions. Machine Learning, 4:67 97, 1989.

Natarajan, B. K. and Tadepalli, P. Two new frameworks for learning. In Machine Learning Proceedings 1988, pp. 402 415. Elsevier, 1988.

Rubinstein, B., Bartlett, P., and Rubinstein, J. Shifting, oneinclusion mistake bounds and tight multiclass expected risk bounds. Advances in Neural Information Processing Systems, 19, 2006.

Tsybakov, A. B. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135 166, 2004.

Valiant, L. G. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Vapnik, V. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006.

Vapnik, V. N. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity: festschrift for alexey chervonenkis, pp. 11 30. Springer, 2015.

Representation Preserving Multiclass Agnostic to Realizable Reduction

A. Failure of Hopkins et al. (2022) in Infinite-Label Multiclass Learning

We address it by noting the Hopkins et al. reduction fails for the well-studied stars and sets concept class, namely Example 4.1 of Daniely et al. (2011). Specifically, consider X = [0, 1], denote by F(X) the collection of all finite subsets A X. Let the label space Y = F(X) S { }, where is a special label (not to be confused with from the partial concept classes section). For every A X, define f A : X Y as follows:

( if x A A otherwise.

Let the concept and hypothesis class be C = H = {f A : A F(X) X}. Then, there is a realizable learner Agood that returns f X unless a label A F(X) appeared in the sample, in which case it returns f A. Let D have marginal on X the uniform distribution over X, and let the labels be always (i.e., realizable with target f X ). Now, consider the algorithm of Hopkins et al. with this scenario and realizable learner Agood. Let the unlabeled dataset be SU = {x1, x2, , xn}, and the labeled dataset be SL = {(xn+1, ), (xn+2, ), , (xn+m, )}, and with probability one these x s are all distinct. Denote A = {xn+1, xn+2, , xn+m} F(X). Then their algorithm would run Agood on all realizable labelings of SU; in particular, one of these is (SU, f A(SU)) = {(x1, A), . . . , (xn, A)}, and the output hypothesis of Agood(SU, f A(SU)) would be f A. By the definition of f A, we know that the empirical error of f A on SL is 0. Their algorithm then outputs any ERM on SL from these functions produced by Agood, which means their algorithm can output f A. However, the true error rate of f A is 1, while the best error in the concept class is 0. Thus, their algorithm fails for this concept class (for essentially the same reason ERM fails for this concept class). In contrast, our algorithm returns f X , hence achieves error 0.

B. Partial Concept Classes

We discuss the reduction for partial PAC learning in this section.

B.1. Notations

We first develop some necessary notation for partial PAC learning. A partial predictor is a function from X to A { }. With this in mind, fix a non-empty set of partial predictors C as a concept class. Additionaly, fix another set of predictors H as a hypothesis class. The support of a partial predictor h is the set supp(h) = h 1(A). A loss function L is a map L : A { } Y {0, 1}, with L( , ) 0. Similar to unified PAC learning, a 6-tuple Q presents an instance of Unified Partial PAC Learning framework. A learning algorithm A is a mapping from (X Y) to a partial predictor ˆh.

Given the above notations, we make the following definitions for realizability. A probability distribution, D, over the product space of instance space and label space (X Y), is realizable by a partial concept class C if almost surely (i.e., with probability 1), a sample S = ((xi, yi))n i=1 Dn (for any n N) is realizable by some partial concept c C: that is {xi}n i=1 supp(c) and L(c(xi), yi) = 0 for all i n. For a partial concept c and a distribution D on X Y, we define the true risk: LD(c) := E(x,y) D[L(c(x), y)].The true risk of a hypothesis h can also be defined as LD(h) := E(x,y) D[L(h(x), y)]. Given the above notation and definition, we define Unified Partial PAC learning, which can be easily adapted to a specific learning regime.

Definition B.1 (Unified Partial PAC Learning). We say that Q = X, Y, A, L, C, H is unified partial PAC learnable, if for every ϵ, δ (0, 1), there exists a finite MRE(ϵ, δ) N and a learning algorithm A with Im(A) H such that, for every distribution D on X Y realizable w.r.t. C, for S DMRE(ϵ,δ), with probability at least 1 δ,

LD(A(S)) ϵ.

The value MRE(ϵ, δ) is called the sample complexity of A, and the optimal sample complexity4 is the minimum achievable value of MRE(ϵ, δ) for every given ϵ, δ.

For any n N and data sequence S (X Y)n, define the empirical risk of any partial concept c as LS(c) = 1 n Pn i=1 L(c(xi), yi). For a distribution D on X Y, define the approximation error of C as LD(C) = limn ES Dn[minc C LS(c)].

4For brevity, we refer to the optimal sample complexity as sample complexity.

Representation Preserving Multiclass Agnostic to Realizable Reduction

Definition B.2 (Unified Agnostic Partial PAC Learning). We say that Q = X, Y, A, L, C, H is agnostic unified partial PAC learnable, if for every ϵ, δ (0, 1), there exists a finite MAG(ϵ, δ) N and a learning algorithm A with Im(A) H such that, for every distribution D on X Y, for S DMAG(ϵ,δ), with probability at least 1 δ,

LD(A(S)) LD(C) + ϵ.

The value MAG(ϵ, δ) is called the sample complexity of A, and the optimal sample complexity is the minimum achievable value of MAG(ϵ, δ) for every given ϵ, δ.

B.2. General Reduction

Given the definition above, we provide the unified partial agnostic to realizable reduction in the following theorem.

Theorem B.3 (Agnostic Realizable (Partial Learning) ). Let A be a realizable unified partial learner for a learning instance Q = (X, Y, A, L, C, H) with sample complexity MRE(ϵ, δ). Then Algorithm 1 is an agnostic unified partial learner for (X, Y, A, L, C, H) with sample complexity:

MAG(ϵ, δ) = O

maxµ [ϵ/2,1] n MRE(ϵ/2µ,δ/3)

µ o + log(1/δ)

Theorem B.3 is also the first representation preserving reduction within unified partial PAC learning regime. However, compared to a reduction-to-realizable technique of David et al. (2016), Theorem B.3 incurs an extra maxµ [ϵ/2,1]{MRE(ϵ/2µ, δ/3)/µ} term, which can scale to 1/ϵ. This raises an open question:

Although Theorem B.3 achieves the same sample complexity as Theorem 4.1, the proofs differ significantly, as there is no optimal concept c in C under the framework of agnostic partial PAC learning. Before presenting the proof, we introduce the following notations. Let ˆh be the output hypothesis in step 3 of Algorithm 1. For a given sample T, let c = arg minc C LT (c). According to c , let S S be the largest realizable sample set in S, i.e. S = {(x, y) S|L(c (x) = y)}. Also, let h S = A(S ). With these notations in mind, we we now present the proof of Theorem B.3.

Proof of Theorem B.3. Leveraging a standard Chernoff bound over HS, for any h HS, with probability at least 1 δ/4, we have LD(h) LT (h) + c1 p

log(|HS|/δ)/|T|,

for some positive constant c1. Specially, for the output hypothesis ˆh in step 3 of Algorithm 1, with probability at least 1 δ/4, we have LD(ˆh) LT (ˆh) + c1 p

log(|HS|/δ)/|T|.

By the construction of ˆh, we have LT (ˆh) LT (h) for all h HS. Since h S HS, we have LT (ˆh) LT (h S ), which futhur leads to LD(ˆh) LT (h S ) + c1 p

log(|HS|/δ)/|T| (12)

holds with probability at least 1 δ/4. On the other hand, since function LT (h) is bounded noise with constant bound 1/|T|, leveraging Mc Diarmid s inequality, with probability at least 1 δ/4, we have

LT (c ) E[LT (c )] + c2 p

log(1/δ)/|T|,

for some postive constant cs. Since E[LT (c )] is non decreasing in the size of sample T, (see Lemma 39 in Alon et al. (2022)) we have E[LT (c )] LD(C) for any T sampled from distribution D. Thus, with probability at least 1 δ/4, we have LT (c ) LD(C) + c2 p

log(1/δ)/|T|, (13)

for some positive constant c2. If we can further bound the difference LT (h S ) LT (c ) with high probability, combining Equation (12) and (13), we can get the desired conclusion. A natural idea is to divide T into two parts: T , the realizable part according to c , and T , the complementary of T . Then, we can decompose LT (c ) as

LT (c ) = µ LT (c ) + (1 µ )L T (c ) = 1 µ , (14)

Representation Preserving Multiclass Agnostic to Realizable Reduction

where µ = |T |/|T| denote the portion of realizable samples of c in T. Similarly, we can decompose LT (h S ) as

LT (h S ) = µ LT (h S ) + (1 µ )L T (h S )

1 µ + µ LT (h S )

= LT (c ) + µ LT (h S ), (15)

where the inequality comes from L T (h S ) 1, and the last equality is due to Equation (14). Since both S and T

are realizable by c , as long as |S | MRE(ϵ/2µ , δ/4), with probability at least 1 δ/4, we have LT (h S ) ϵ/2. Leveraging a standard Chernoff bound, when |S| = c3 maxµ [ϵ/2,1]{MRE(ϵ/2µ, δ/4)/µ}, with probability at least 1 δ/4, we have |S | MRE(ϵ/2µ , δ/4). Choosing |T| = O(log(|HS|/δ)/ϵ2), by a union bound across the whole process, we have, with probability at least 1 δ, LD(ˆh) LD(C) + ϵ.

In this way, we prove Algorithm 1 is an agnostic partial unified PAC leaner. The sample complexity comes from |HS| = 2|S|

with |S| = c3 maxµ [ϵ/2,1]{MRE(ϵ/2µ, δ/4)/µ} for some positive constant c3.

B.3. Multiclass Learning

With Theorem B.3 in mind, we can readily adapt it to multiclass learning setting with the action space A = Y { }, and L represents the classification error. In this way, a unified PAC learning instance Q = (X, Y, A, L, C, H) can be specified as a multiclass learning instance (X, Y, C, H).

Corollary B.4 (Agnostic Realizable (Partial Multiclass Learning)). Let A be a realizable partial multiclass learner for a learning instance (X, Y, C, H) with sample complexity MRE(ϵ, δ). Then Algorithm 1 is an agnostic partial multiclass learner for (X, Y, C, H) with sample complexity:

MAG(ϵ, δ) = O

maxµ [ϵ/2,1] n MRE(ϵ/2µ,δ/3)

µ o + log(1/δ)

Moreover, if H has finite DS dimension d, Algorithm 1 only needs:

MAG(ϵ, δ) = O d3/2 + log(1/δ)

labeled samples.

B.4. List Learning

Building on Theorem B.3, We can adapt it to L-list learning setting with the action space A = {Y Y, |Y | L}. The concept class C consists of functions c : X A { }, where A = {Y Y, |Y | = 1}. Similarly, the hypothesis class H consists of functions h : X A { }.By definition, there is a bijection between Y and A mapping each y to a corresponding Y such that y Y . According to this bijection, the loss function L can be defined on (A { } A { }) such that L(h(x), c(x)) = 0 if and only if c(x) h(x). In this way, a unified PAC learning instance Q = (X, Y, A, L, C, H) can be specified as a L-list learning instance (X, Y, A, C, H).

Corollary B.5 (Agnostic Realizable (Partial List Learning)). Let A be a realizable partial list learner for a learning instance (X, Y, A, C, H) with sample complexity MRE(ϵ, δ). Then Algorithm 1 is an agnostic partial list learner for (X, Y, A, C, H) with sample complexity:

MAG(ϵ, δ) = O

maxµ [ϵ/2,1] n MRE(ϵ/2µ,δ/3)

µ o + log(1/δ)

Moreover, if H has finite L-DS dimension d, Algorithm 1 needs only:

MAG(ϵ, δ) = O L6d3/2 + log(1/δ)

labeled samples.

Representation Preserving Multiclass Agnostic to Realizable Reduction

B.5. Multilabel Learning

We can also adapt Theorem B.3 to multilabel learning setting where the action space is defined as A = {Y Y, |Y | L}. The concept class C consists of functions c : X A. Similarly, the hypothesis class H consists of functions h : X A , where A = {Y Y, |Y | = 1}. By definition, there is a bijection between Y and A mapping each y to a corresponding Y such that y Y . According to this bijection, the loss function L can be defined on (A { } A { }) such that L(c(x), h(x)) = 0 if and only if h(x) c(x). In this way, a unified PAC learning instance Q = (X, Y, A, L, C, H) can be specified as a multilabel learning instance (X, Y, A, C, H).

Corollary B.6 (Agnostic Realizable (Partial Multilabel Learning)). Let A be a realizable partial multilabel learner for (X, Y, A, C, H) with sample complexity MRE(ϵ, δ). Then Algorithm 1 is an agnostic partial multilabel learner for (X, Y, A, C, H) with sample complexity:

MAG(ϵ, δ) = O

maxµ [ϵ/2,1] n MRE(ϵ/2µ,δ/3)

µ o + log(1/δ)

C. Reduction via Compression Scheme

For completeness, we state the sample complexity of reduction via compression method in this section.

The following theorem presents the sample complexity of the agnostic to realizable reduction achieved through a compression scheme.

Theorem C.1 (Agnostic Realizable (Compression)). If C is unified PAC learnable with sample complexity MRE(ϵ, δ), then it is unified agnostic PAC learnable with sample complexity:

MAG(ϵ, δ) = O MRE(1/3, 1/3) log2(1/ϵ) + log(1/δ)

Proof of Theorem C.1. The proof follows directly from Theorem 3.1 and Theorem 3.2 in David et al. (2016).

The following theorem presents the sample complexity of the Massart noise to realizable reduction achieved through a compression scheme.

Theorem C.2 (Massart Realizable (Compression)). If a multiclass concept class C is learnable with sample complexity MRE(ϵ, δ), then it is also learnable under Massart noise with sample complexity:

MMN( )(ϵ, δ) = O MRE(1/3, 1/3) log2( /ϵ) + log(1/δ)

Proof of Theorem C.2. According to Theorem 3.2 in David et al. (2016), we have: If C can be learned with sample complexity MRE(ϵ, δ), there is a sample compression scheme of size k(m) = O(d0 log(m) log(d0 log(m))) where d0 = MRE(1/3, 1/3). Furthermore, a sample compression scheme of size k(m) implies an agnostic sample compression scheme of size k(m). Denote this sample compression scheme by A(S), where S is the sample with size |S| = m. Due to the upper bound on the size of comparison scheme is k(m), the size of the class of compression functions |Hc| can be upper bounded by

k(m) . (16)

Then a universal Bernstein inequality gives that: for all ˆh Hc, with probability at least 1 δ, we have

|(LS(ˆh) LS(c )) (LD(ˆh) LD(c ))|

PD ˆh(x) = c (x), y {ˆh(x), c (x)} log(|Hc|/δ)

|S| + clog(|Hc|/δ)

Representation Preserving Multiclass Agnostic to Realizable Reduction

where c is the Bayesian optimal classifier in C. Specially, for the agnostic compression scheme A(S), with probability at least 1 δ, we have

LD(A(S)) LD(c ) c

PD A(S)(s) = c (x), y {A(S)(s), c (x)} log(|Hc|/δ)

|S| + clog(|Hc|/δ)

due to the definition of agnostic compression scheme. Combining Equation (18) and Equation (7),with probability at least 1 δ, we have

LD(A(S)) LD(c ) c

1 (LD(A(S)) LD(c ))log(|Hc|/δ)

|S| + clog(|Hc|/δ)

Combining Equation (19) and Equation (16) and solve, we have LD(A(S)) LD(c ) ϵ, if

|S| = O MRE(1/3, 1/3) log2( /ϵ) + log(1/δ)

In this way, we prove that the compression scheme is a learner under Massart noise and also prove the upper bound on sample complexity.

The following theorem presents the sample complexity of the Tsybakov noise to realizable reduction achieved through a compression scheme.

Theorem C.3 (Tsybakov Realizable (Compression)). If a multiclass concept class C is learnable with sample complexity MRE(ϵ, δ), then it is also learnable under Tsybakov noise with sample complexity:

MTN(a,α)(ϵ, δ) = O a(MRE(1/3, 1/3) log2α(1/ϵ) + log(1/δ))

Proof of Theorem C.3. The proof closely mirrors the argument in Theorem C.2, with the key difference being the substitution of the Massart noise condition (i.e. Equation (7)) by the Tsybakov noise condition (i.e. Equation (10)) in Equation (19).

D. Concentration Inequalities

In this section, we list the concentration inequalities used in our paper without providing proofs.

Theorem D.1 (Generic Chernoff Bound). For a random variable X, and a positive number t, and any a,we have

P(X a) M(t) exp( ta),

where M(t) = E[exp(t X)] is the moment generating function of X.

Theorem D.2 (Uniform Bernstein Inequality). For all h, h H and distribution D over X Y, with probability at least 1 δ, we have |(LS(h) LS(h )) (LD(h) LD(h ))|

PD(h(x) = h (x), y {h(x), h (x)})log(|H|/δ)

|S| + clog(|H|/δ)

where H is a multiclass hypothesis class, and c is some positive constant.

Theorem D.3 (Mc Diarmid s Inequality). Let f : X1 X2 Xn R satisfy the bounded differences property with bounds c1, c2, . . . , cn. Consider independent random variables X1, X2, . . . , Xn where Xi Xi for all i. Then, for any ϵ > 0,

P(f(X1, X2, . . . , Xn) E[f(X1, X2, . . . , Xn)] ϵ) exp 2ϵ2

Pn i=1 c2 i

P(f(X1, X2, . . . , Xn) E[f(X1, X2, . . . , Xn)] ϵ) exp 2ϵ2

Pn i=1 c2 i