# on_robust_multiclass_learnability__a6af4306.pdf

On Robust Multiclass Learnability

Jingyuan Xu School of Computer Science Wuhan University jingyuanxu777@gmail.com

Weiwei Liu School of Computer Science Wuhan University liuweiwei863@gmail.com

This work analyzes the robust learning problem in the multiclass setting. Under the framework of Probably Approximately Correct (PAC) learning, we first show that the graph dimension and the Natarajan dimension, which characterize the standard multiclass learnability, are no longer applicable in robust learning problem. We then generalize these notions to the robust learning setting, denoted as the adversarial graph dimension (AG-dimension) and the adversarial Natarajan dimension (ANdimension). Upper and lower bounds of the sample complexity of robust multiclass learning are rigorously derived based on the AG-dimension and AN-dimension, respectively. Moreover, we calculate the AG-dimension and AN-dimension of the class of linear multiclass predictors, and show that the graph (Natarajan) dimension is of the same order as the AG(AN)-dimension. Finally, we prove that the AGdimension and AN-dimension are not equivalent.

1 Introduction

Learning models that are robust to adversarial perturbations has attracted significant research attention in recent years[1, 2, 3]. Notably, prior work on this subject has largely studied the theory of robust learnability in the context of binary supervised learning. However, many important problems require classification into a great number of target classes. For example, in image object recognition and language models building[4, 5, 6, 7, 8, 9], the number of classes scales as the number of possible objects or the dictionary size respectively. It is thus of both practical and theoretical interest to generalize the robust binary learning theory to the robust multiclass setting.

In the standard multiclass learning setting, the finiteness of the Natarajan dimension or graph dimension [10] (both of which are generalized from the VC-dimension [11]) is necessary and sufficient to ensure learnability [12]. Recent work has shown that the finiteness of the VC-dimension is neither sufficient nor necessary for (proper) robust binary classification [13]. Inspired by [13], this paper demonstrates that the Natarajan/graph dimension can no longer characterize robust learnability in the multiclass problem. Meanwhile, some questions naturally arise: In multiclass setting, how can robust learning be ensured? And what is necessary for robust learning?

To answer these questions, we generalize the corrupted hypothesis classes [14], which arise from standard hypothesis classes in the binary setting in the presence of an adversary, and use them to reduce the robust learning problem to a non-robust one. By considering the graph dimension and Natarajan dimension of the corrupted hypothesis classes which are defined as the adversarial graph dimension (AG-dimension) and adversarial Natarajan dimension (AN-dimension) respectively we derive the upper and lower bounds of sample complexity of the robust multiclass learning problem. We then analyze the AG-dimension and the AN-dimension of linear multiclass predictor class, showing that the AG(AN)-dimension and the graph (Natarajan) dimension are of the same order. Since it is proven that the graph dimension and the Natarajan dimension are equivalent when the

Corresponding author

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

target class is finite [12], is it possible to use the AG-dimension and AN-dimension to bound each other? Our work suggests that the equivalent relationship does not hold, which is illustrated by constructing a hypothesis class with finite AN-dimension but not robustly learnable.

Related work. [15, 16, 17, 18] focus on the robust binary classification problem. [10] characterize the multiclass learnability in non-robust setting. [19, 12, 20] define a large family of notions of dimensions, all of which generalize the VC-dimension and may be used to estimate the sample complexity of multiclass classification. [21] calculate the graph dimension and the Natarajan dimension of linear multiclass predictor class. [22] upper bounds the adversarial risk of linear multiclass classifiers based on Rademacher complexity, which cannot directly derive bounds of sample complexity of general robust multiclass learning problems. The theory on robust multiclass learnability for general hypothesis class is less explored. This work takes some steps towards solving this problem.

2 Preliminaries

This section introduces some basic notations and concepts used throughout the present study.

In the remainder of this article, R, N, R+ and Rd represent the sets of real numbers, natural numbers, non-negative real numbers and d-dimensional vectors over R, respectively. We denote the set {1, . . . , n} (for n N) by [n]. If A and B are sets, we use BA to denote the collection of all mappings from A to B and 2A to denote the power set of A, that is the collection of all subsets of A. Let C = {c1, . . . , cm} X and C C be a subset of C, then IC [m] denotes the index set of C . Let H be a class of functions defined in X, then H|C represents the restriction of H to C, that is HC = {(h(c1), . . . , h(cm)) : h H}. We denote the indicator function by 1(event), that is 1 if an event happens and 0 otherwise. Finally, p represents the ℓp norm.

The robust learning problem is formalized as follows. Let X = Rd be the instance space and Y = {1, . . . , k} be the label space. For an unknown distribution D over X Y2 and a hypothesis class H YX , the goal of robust learning is to find a function f H, based on n labeled training samples S = {(xi, yi)}n i=1 drawn independent and identically distributed (i.i.d.) from D, such that under a small perturbation U : X 2X , the adversarial risk

RU(f; D) E (x,y) D

sup x U(x) l(f(x ), y)

is minimal for the 0-1 loss l(ˆy, y) 1(ˆy = y).

The perturbation U(x) is required to be nonempty, so some choice of x is always available. One choice for the perturbation is the p-norm ball (p 1) with a small radius r, i.e. U(x) = {z X : z x p r}. Selecting r = 0 gives the identity perturbation: I(x) {x}. Note that in this case, the problem is reduced to a standard learning problem, and RI(f; D) is called the standard risk of f over D.

The mechanism of finding a function is called a learning algorithm. In this paper, we focus on proper learning algorithms, that is, learning algorithms will always pick a function from H. Formally, a learning algorithm is a function A : n=0(X Y)n H. A common choice of robust learning algorithm is to learn through adversarial empirical risk minimization or AERM for short:

ˆf AERMU(H; S) arg min f H RU(f; ˆDn),

where ˆDn is the empirical distribution generated by S, i.e. for (x, y) ˆ Dn, (x, y) is equal to (xi, yi) with probability 1/n for each i {1, . . . , n}.

Learning algorithms like AERM do not always produce a function that achieves the optimal adversarial risk, because a training set consisting of finite samples is not always sufficiently informative to represent D. Thus, we introduce the definition of robust probably approximately correct (PAC) learnability, in the realizable and agnostic setting [14, 13]: Definition 1 (Agnostic Robust PAC Learnability). Under perturbation U, a hypothesis class H is robustly PAC learnable in the agnostic setting if there is a function mag H,U : (0, 1)2 N and a

2Formally, there is a sigma algebra F 2X Y of events and D is a probability measure on (X Y, F).

learning algorithm A : n=0(X Y)n H with the following property. For every ϵ, δ (0, 1) and every distribution D over X Y, let S = {(xi, yi)}n i=1 be n samples i.i.d generated by D s.t. n > mag H,U(ϵ, δ). Then with probability at least 1 δ,

RU(A(S); D) min f H RU(f; D) + ϵ.

The function mag H,U is defined as the sample complexity of learning H under perturbation U in the agnostic setting.

Definition 2 (Realizable Robust PAC Learnability). Under perturbation U, a hypothesis class H is robustly PAC learnable in the realizable setting if there is a function mre H,U : (0, 1)2 N and a learning algorithm A : n=0(X Y)n H with the following property. For every ϵ, δ (0, 1) and every distribution D over X Y, let S = {(xi, yi)}n i=1 be n samples i.i.d generated by D s.t. n > mre H,U(ϵ, δ). Then with probability at least 1 δ,

RU(A(S); D) ϵ.

The function mre H,U is defined as the sample complexity of learning H under perturbation U in the realizable setting.

These definitions agree with the standard PAC learnability when U = I, which has been widely studied in both binary (k = 2) and multiclass (k > 2) setting. We recall the known result regarding the sample complexity of binary learning. Recall the definition of the Vapnik-Chervonenkis dimension (VC-dimension) [11]:

Definition 3 (VC-dimension). Let H {0, 1}X be a hypothesis class and let S X. We say that S is shattered by H if H|S = {0, 1}S. The VC-dimension of H, denoted by VC (H), is the maximal cardinality of a set that is shattered by H.

The following theorem, based on the VC-dimension, characterizes the sample complexity of learning binary hypothesis classes.

Theorem 1 ([11] and [23]). There are absolute constants C1, C2 > 0 such that for every H {0, 1}X ,

VC (H) + ln 1

mre H,I(ϵ, δ) C2

VC (H) ln 1

VC (H) + ln 1

mag H,I(ϵ, δ) C2

VC (H) + ln 1

It is natural to seek a generalization of the VC-dimension to hypothesis classes of non-binary functions. We recall two generalizations, both introduced by [10]:

Definition 4 (Graph dimension). Let H YX be a hypothesis class and let S X. We say that H G-shatters S if there exists an f : S Y such that for every T S, there is a g H such that

x T, g(x) = f(x), and x S\T, g(x) = f(x).

The graph dimension of H, denoted by d G(H), is the maximal cardinality of a set that is G-shattered by H.

Definition 5 (Natarajan dimension). Let H YX be a hypothesis class and let S X. We say that H N-shatters S if there exists f1, f2 : S Y such that x S, f1(x) = f2(x), and for every T S, there is a g H such that

x T, g(x) = f1(x), and x S\T, g(x) = f2(x).

The Natarajan dimension of H, denoted by d N(H), is the maximal cardinality of a set that is N-shattered by H.

Both the graph dimension and the Natarajan dimension coincide with the VC-dimension for k = 2. Based on these notions, the following theorem provides the upper and lower bounds of sample complexity of standard multiclass learning:

Theorem 2 ([10] and [12]). There are absolute constants C1, C2 > 0 such that for every H YX ,

d N(H) + ln 1

mre H,I(ϵ, δ) C2

d G(H) ln 1

d N(H) + ln 1

mag H,I(ϵ, δ) C2

d G(H) + ln 1

3 AG-dimension and upper bounds for robust multiclass learning

We start by showing that finite graph dimension is not sufficient for robust multiclass learnability. We then introduce the notion of corrupted hypotheses, presented by [14]. By computing the graph dimension of these hypotheses, which is defined as adversarial graph dimension, we can upper bound the sample complexity of robust multiclass learning problem in both realizable and agnostic settings.

3.1 Finite graph dimension is not sufficient for robust multiclass learning

Theorem 2 shows that the finiteness of the graph dimension is a sufficient condition for multiclass learnability in non-robust setting. While in adversarial setting, for hypothesis classes with finite graph dimension, one cannot ensure the robust PAC learnability of H, even if d G(H) = 1. Theorem 3. There exists a hypothesis class H with d G(H) = 1 and an adversary U such that H is not robustly PAC learnable with respect to U.

To prove this theorem, We first present a lemma, which is generalized from Lemma 3 in [13]. Lemma 1. Let m N. Then, there exists a hypothesis class H YX such that for any learning rule A : n=0(X Y)n H, there exists a distribution D over X Y such that:

1. There exists a function f H with RU(f ; D) = 0.

2. With probability of at least 1/7 over the choice of S Dm we have that RU(A(S); D) 1/8.

The proof of this lemma is presented in Appendix. We now proceed with the proof of Theorem 3.

Proof of Theorem 3. The argument follows closely a proof of an analogous result by [13] for binary robust learning, but generalizes the constructions and analyzes to match the definition of graph dimension in the multiclass setting. For each m N, let Xm = {x(m) 1 , . . . , x(m) 3m } be a set that contains 3m distinct points in X subject to xi, xj m=1Xm, if xi = xj, then U(xi) U(xj) = . And we construct Hm on Xm as follows. For each b {0, 1}3m, we construct a set Zb : Initialize Zb = , for each i [3m], if bi = 1 then pick a point z U(x(m) i ) such that z / Zb for each b = b, and add it to Zb. Let hb : X Y be a hypothesis such that hb(x) = 1 if and only if x / Zb and x / Xm for m = m. Furthermore, for fixed m N and x Xm s.t. m = m, we need all hb map x into a same label yx( = 1), where yx is consistent for all m. That is,

1, x Zb, yx( = 1), m = m, x Xm , 1, otherwise,

where 1 is some label that is not equal to 1. Then define

Hm = {hb : b {0, 1}3m,

i=1 bi = m}.

Let H = m=1Hm. We claim that d G(H) 1. That is, if pick two points x1, x2 X such that x1 = x2, H cannot G-shatters {x1, x2}. since for all h H and x X, there are at most two cases of h(x), one of which is 1. Hence the only f : {x1, x2} Y in Definition 4 that needs to be considered is f(x1) = f(x2) = 1. There are six cases to consider: (1) There exists m N such that z1, z2 Xm. Then hb Hm, x1, x2 / Zb by the construction of Zb. Hence hb(x1) = hb(x2) = 1. And h Hm for m = m, we have h(x1) = yx1 = 1, h(x2) = yx2 = 1. So we only obtain labelings (1, 1) and ( 1, 1).

(2) There exists m N such that x1, x2 U(Xm)\Xm. There are two sub-cases to consider. (i) b s.t. hb Hm and x1, x2 Zb, then we have hb(x1) = 1 and hb(x2) = 1. For the remaining h H we have h(x1) = h(x2) = 1. So we only obtain labelings (1, 1) and ( 1, 1) in this sub-case. (ii) For the remaining sub-cases, labeling ( 1, 1) cannot be obtained since x1, x2 / Xm for all m and there is no b such that x1, x2 are in Zb in the same time. (3) x1 Xm and x2 Xm for m = m . In this case, we have h(x1) = 1, h(x2) = 1 for h Hm, and h(x1) = 1, h(x2) = 1 for h Hm . For those predictors that are in neither Hm nor Hm , we have h(x1) = 1, h(x2) = 1. Hence we cannot label both points x1 and x2 with (1, 1). (4) x1 Xm and x2 U(Xm )\Xm for m = m . Observe that h(x1) = 1 if and only if h Hm. But for all h Hm, h always labels x2 with 1. Hence labeling (1, 1) cannot be obtained. (5) x1 U(Xm)\Xm and x2 U(Xm )\Xm for m = m . Similar to the discussion above, for all h H, either h(x1) = 1 or h(x2) = 1. So we cannot obtain labeling ( 1, 1). (6) The remaining cases, i.e. x1, x2 / U(Xm) for all m N. In this case we can only obtain labeling (1, 1) since for all h H, we have h(x1) = h(x2) = 1.

The arguments above show that d G(H) 1. It remains to show H is not robustly PAC learnable. By Lemma 1, it follows that for any learning rule A : n=0(X Y)n H and for any m N, there exists a distribution D over Xm Y where there exists a predictor h H with RU(h ; D) = 0, but with probability at least 1/7 over S Dm, RU(A(S); D) > 1/8 if A(S) Hm. Classifiers in Hm where m = m are also non-robust since they make mistakes on points in Xm. This concludes that H is not robustly PAC learnable, which completes the proof.

3.2 Adversarial graph dimension

Consider a given hypothesis f : X Y. A labeled adversarial sample ( x, y) is classified correctly if x f 1(y). A labeled example (x, y) is classified correctly if U(x) f 1(y). Following [14], let Y = Y { }, where is the special always wrong output, i.e. l(y, ) = l( , y) = 1 for every y Y. We define the mapping κU : YX YX :

κU(f)(x) = y, U(x) f 1(y), , otherwise. (1)

The corrupted set of hypotheses induced by perturbation U is then defined by H = {κU(f) : f H}.

Lemma 2 (Lemma 2 in [14]). For any perturbation U and distribution D, RU(f; D) = RI(κU(f); D).

This lemma bridges the equivalence between robustly learning H and learning H without adversary, which enables us to use standard techniques to bound the sample complexity. Intuitively, we can derive a sufficient condition by considering the graph dimension of H, which can be equivalently formalized as follows:

Definition 6 (Adversarial Graph Dimension). Let H YX be a hypothesis class and let S X. We say that H adversarially G-shatters S if there exists an f : X Y such that for every T S, there is a g H such that x T, x U(x), g(x ) = f(x)

and x S\T, x U(x), g(x ) = f(x).

The adversarial graph dimension (AG-dimension) of H, denoted by d U G(H), is the maximal cardinality of a set that is adversarially G-shattered by H.

Proposition 1. Let H YX be a hypothesis class and let H be the corrupted set of hypotheses induced by perturbation U. Then we have

d G( H) = d U G(H).

The proof is presented in Appendix.

Intuitively, by combining the above proposition with Lemma 2 and Theorem 2, we can upper bound the sample complexity of robust multiclass learning. Formally:

Theorem 4. There are absolute constants C > 0 such that for every H YX ,

mre H,U(ϵ, δ) C d U G(H) ln( 1

ϵ ) + ln( 1

, and mag H,U(ϵ, δ) C d U G(H) + ln( 1

Proof. Let H YX be a hypothesis class with d U G(H) = d. For every f H, define f : X Y {0, 1} by setting f(x, y) = 1 if and only if f(x ) = y for all x U(x). Note that according to our construction, we have

P(x,y) D[ f(x, y) = 1] = E (x,y) D

sup x U(x) 1(f(x ) = y) . (2)

In the realizable case, let H = { f : f H}. From the definition, we can see that VC ( H) = d U G(H). By Theorem 1, there exists C > 0 such that when we draw m > C( d

δ )) samples i.i.d. from some distribution D on X Y, then with probability at least 1 δ we have P(x,y) D[ f(x, y) = 1] ϵ. Plugging 2 into this inequality, we complete the proof for the realizable case and it is similar for the agnostic case.

4 AN-dimension and lower bounds for robust multiclass learning

In this section, we discuss the necessary conditions for robust multiclass learnability.

4.1 Finite Natarajan dimension is not necessary for robust multiclass learning

We first show that the finiteness of ordinary Natarajan dimension is also not necessary for robust multiclass learnability. Theorem 5. There exist H, U such that d N(H) = but H is robustly PAC learnable under U, i.e. mag H,U(ϵ, δ) < .

Proof. The proof follows the construction from [13], which shows that finite VC-dimension is not necessary for robust binary learnability. Let H = YX be the family of all mappings from X to Y and U(x) = X be an all-powerful adversary. It is easy to see that n N, there exist n distinct points in Rd that are N-shattered by H, hence d N(H) = . Given a distribution D on X Y, let S = {(xi, yi)}m i=1 be m samples drawn i.i.d from D and A be a learning rule such that A(S) returns a constant-label predictor h :

h(x) = km := arg max k Y

{1 i m : yi = k} , x X.

According to the weak law of large numbers, ˆDm D as m , where ˆDm is the empirical distribution generated by S, we have P km = arg maxk Y P(x,y) D[y = k] 1 as m . Hence for all δ (0, 1) there exists a m0 N such that when m > m0, RU(h; D) reaches the optimal adversarial risk in H, i.e. RU(h; D) = arg minf H RU(f; D) with probability at least 1 δ.

4.2 Adversarial Natarajan dimension

Following the idea of generalizing the graph dimension to the AG-dimension, for a given hypothesis class H, one can consider the Natarajan dimension of its corrupted hypothesis class to lower bound the sample complexity of robust learnability. However, there are several natural but different ways in which the notion of N-shattering can be generalized: that is, we can either let f2 that witnesses H N-shattering S take the value or not. Both notions derive a necessary condition for robust learnability. In this work, we choose the stronger notion, i.e. we need the entire perturbation sets to be N-shattered by H. Definition 7 (Adversarial Natarajan Dimension). Let H YX be a hypothesis class and let S X. We say that H adversarially N-shatters S if there exist f1, f2 : X Y such that y S, f1(y) = f2(y), and for every T S, there is a g H such that

x T, x U(x), g(x ) = f1(x)

and x S\T, x U(x), g(x ) = f2(x).

The adversarial Natarajan dimension (AN-dimension) of H, denoted by d U N(H), is the maximal cardinality of a set that is adversarially N-shattered by H.

Based on this definition, we present that:

Theorem 6. There are absolute constants C > 0 such that for every H YX ,

mre H,U(ϵ, δ) C d U N(H) + ln( 1

, and mag H,U(ϵ, δ) C d U N(H) + ln( 1

Proof. Let H YX be a hypothesis class such that d U N(H) = d, and let Hd = {0, 1}[d] be the family of all mappings from [d] to {0, 1}. We want to show that for every learning algorithm A for H and every ϵ, δ (0, 1), there exists a learning algorithm A satisfying: if there exists a function mre A,U : (0, 1)2 N such that when A witnesses m > mre A,U(ϵ, δ) samples (denoted as S) drawn i.i.d from some distribution D, P[RU(A(S); D) ϵ] > 1 δ. Then when A witnesses m samples (denoted as S ) drawn i.i.d from some distribution D on [d] {0, 1}, P[RI( A(S ); D ) ϵ] > 1 δ holds. By this claim, we have mre A,I mre A,U, thus mre Hd,I mre H,U. Note that VC (Hd) = d, so by

Theorem 1 we have mre H,U(ϵ, δ) C( d+ln(1/δ)

ϵ ) for some constant C > 0 (independent of H), and similarly for the agnostic case.

We now prove the claim. Let {s1, . . . , sd} X be a set and let f0, f1 be functions that witness the adversarial N-shattering of H. Given a sample set S {(xi, yi)}m i=1 [d] {0, 1}, let g = A({sxi, fyi(sxi)}m i=1). Consider the learning algorithm A : n=0([d] {0, 1})n Hd :

A(S )(j) = 1, g(s j) = f1(sj), s j U(sj), 0, otherwise.

Denote A(S ) by f. By the definition of adversarial N-shattering, it can be derived that f(j) = 0 if and only if g(s j) = f2(sj), s j U(sj). By the construction we have

E (j,y) D [1(f(j) = y)] = E (j,y) D [ sup s j U(sj) 1(g(s j) = fy(sj))] = E (x,y) ˆ D m RU(g; ˆD m),

where ˆD m is defined by P ˆ D m(X = sj, Y = fy(sj)) = PD (J = j, B = y).

5 The AG/AN-dimension of linear multiclass predictors

In this section, we calculate the AG-dimension and AN-dimension for linear multiclass predictors under bounded lp perturbation, motivated by the binary results of [14] and the non-adversarial results of [21]. Recall the definition of linear multiclass predictors. Let Ψ : X Y Rdk be a class-sensitive feature mapping [24]:

Ψ(x, y) = [0, . . . , 0 | {z } R(y 1)d , x(1), . . . , x(d) | {z } Rd , 0, . . . , 0 | {z } R(k y)d ],

where x(i) represents the ith coordinate of x. The class of linear multiclass predictors is defined as

HΨ = {x 7 arg max i Y w, Ψ(x, i) : w Rdk}. (3)

We cite that the graph dimension and Natarajan dimension of HΨ is given by (d 1)(k 1) d N(HΨ) d G(HΨ) O(dk ln(dk)) [21]. We will demonstrate that the AG-dimension and ANdimension also satisfy this bound in the remainder of this section. To this end, we start by calculating the AN-dimension of HΨ. Specifically, one can show that the AN-dimension of HΨ equals to the standard Natarajan dimension of HΨ. Theorem 7. Let U(x) = {z X : z x p r} for some r R+ and p R+ { }, and let HΨ be as defined in 3. Then, the AN-dimension of HΨ satisfies d U N(HΨ) = d N(HΨ).

Proof. Since adversarial N-shattering implies N-shattering, we have d U N(HΨ) d N(HΨ). To prove the inverse inequality, it should be noted that the classifiers in HΨ are linear, hence if HΨ N-shatters S, it also N-shatters the set with some x S replaced by x lx, where lx is the half-line which terminates at 0 and goes through x. For those x S that cannot be adversarially N-shattered by HΨ, we can always find some x lx such that U(x ) does not cross the decision boundary of g T that witnesses the N-shattering for all T S.

Note that the Natarajan dimension of HΨ satisfies (d 1)(k 1) d N(HΨ) dk (Corollary 29.8 in [25]). Combine this result with Theorem 7, we have:

Corollary 1. Under the conditions of Theorem 7, (d 1)(k 1) d U N(HΨ) dk.

Next, we present the AG-dimension of HΨ: Theorem 8. Let U(x) = {z X : z x p r} for some r R+ and p R+ { }, and let HΨ be as defined in 3. Then, the AG-dimension of HΨ satisfies

(d 1)(k 1) d U G(HΨ) O(dk ln(dk)).

To prove this result, we first present a lemma, which can be viewed as the generalization of Sauer s Lemma [26] in the adversarial setting. We first define the restriction of a corrupted class. Let H be the corrupted hypothesis class induced by U. Let S = {x1, . . . , xm} X. Fix a f YS, for each h H, define hf : S {0, 1} such that hf(xi) = 1 if and only if κU(h)(xi) = f(xi), i [m]. Define Hf = {hf : h H}. The restriction of H to S is defined by:

H|S = Hf |S, where f = arg max f YS

Lemma 3. Let U be an adversary and H be a hypothesis class with d U G(H) d G < . Let H be the corrupted hypothesis class induced by U. Let S = {x1, . . . , xm} X. Then, for all m > d G + 1 we have

Proof. By Sauer s Lemma [26] and the fact that VC(Hf) d U G(H) for all f YS.

We now prove Theorem 8. The construction idea follows the proof of the non-adversarial results in [21].

Proof of Theorem 8: The lower bound follows from Theorem 7 and the fact that d U G(H) d U N(H) for all H. To upper bound d G := d U G(H), let S = (x1, . . . , xd G) Rd be a set which is adversarially G-shattered by HΨ, and let f : S Y be a function that witnesses the shattering. For every (i, j) [d G] [k], define zi,j = Ψ(xi, f(xi)) Ψ(xi, j). Denote Z = {zi,j|(i, j) [d G] [k]}. We now show that there exists a injective mapping from subsets of S to Wdk f |Z, where Wdk is the class of halfspace classifiers over Rdk.

For each T S, by the definition of adversarial G-shattering, h T HΨ such that

x T, x U(x), h T (x ) = f(x), and x S\T, x U(x), h T (x ) = f(x). (4)

Let w T Rdk be the vector defines h T , 4 implies

x T, x U(x), j [k], w T , Ψ(x , f(x)) w T , Ψ(x , j) ,

and x S\T, x U(x), j [k], w T , Ψ(x , f(x)) < w T , Ψ(x , j) . (5)

It can be easily derived from 5 that i IS\T , j [k], z V(zi,j) s.t. w T , z < 0, where V is the p-norm ball in Rdk with radius r. And from 5 we can assume w.l.o.g that i IT , j [k], z V(zi,j), w T , z 0. This is because if there exists some z V(z) s.t. w T , z < 0, notice that z = V(c )+zi,j for some bounded c , we can replace xi by some x i without destroying the adversarial G-shattering (see the trick used in proof of Theorem 7), to ensure w T , Ψ(x i, f(xi)) Ψ(x i, j) >

w T , c , c V(0). Consequently, for every T S we obtain a label pattern PT : [d G] [k] {0, 1} : PT (zi,j) = 1( min z V(zi,j) w T , z 0).

It satisfies that i IT , j [k], PS(i, j) = 1 and i IS\T , j [k], PT (i, j) = 0, which implies PT = PT for T = T . And by definition PT is connected with a unique element in Wdk f |Z, which completes the proof of our claim.

Note that the AG-dimension of halfspace classifier class with respect to U coincides with the definition of adversarial VC-dimension in [14]. Thus it satisfies that d U G(Wdk) = dk + 1 by Theorem 2 in [14]. Consequently, we have

|2S| = 2d G Wdk f |Z g Wdk|Z (i) e|Z| d U G(Wdk)

d U G(Wdk) (kd G)dk+1,

where we have used Lemma 3 in step (i). We conclude that d G O(dk ln dk).

6 AG-dimension and AN-dimension are not equivalent

In previous sections, we have shown that the finiteness of the AN-dimension is a necessary condition for robust learnability, and the finiteness of the AG-dimension is a sufficient condition for robust learnability. Furthermore, we analyze these definitions on the linear multiclass predictor class, finding that the AG-dimension (AN-dimension) matches the order of the graph dimension (Natarajan dimension) for this class. The following question therefore naturally arises: are these two quantities equivalent? In [12], it is proven that for every hypotheses class H YX , d N(H) d G(H) 4.67 log2(k)d N(H). (6) That is, in standard learning setting, if k then the definition of the Natarajan dimension and the graph dimension are indeed equivalent, hence both the Natarajan dimension and graph dimension can characterize the standard learnability of a hypothesis class. However, in the adversarial setting (U = I), the generalized definitions derived above cannot bound each other. To show this, we claim that finite AN-dimension is not sufficient for robust learnability. Theorem 9. There exist H, U such that d U N(H) = 0 but H is not robustly PAC learnable under U, i.e. mag H,U(ϵ, δ) = .

Proof. Let U be the 2-norm ball: U(x) = {z X : z x 2 r} for all x X and some r > 0. We construct H as follow. Pick a sequence {xi} i=1 X such that for any i = j, U(xi) U(xj) = . Let f : X Y be a function satisfying: there exists zi U(xi)\xi such that f(zi) = f(xi) for each i N. For each sequence b = {bi} i=1 {0, 1}, define hb : X Y :

f(xi), x U(xi)\zi, i N, f(xi), x = zi and bi = 1, f(zi), x = zi and bi = 0, 1, otherwise.

Let H = {hb : b is a sequence in {0, 1}}. Using the same approach employed in the proof of Lemma 1, for any learning rule A : n=0(X Y)n H, there exists a distribution D over X Y such that there exists a function f H with RU(f ; D) = 0, and with probability of at least 1/7 over the choice of S Dm we have that RU(A(S); D) 1/8, by considering the distribution family D containing distributions that are uniform over 2m points in {(x1, f(x1)), . . . , (x3m, f(x3m))}. We complete the proof by noting that d U N(H) = 0, since there is no single r-ball that is adversarially N-shattered by H.

7 Conclusion

In this work, we construct the concepts of AG-dimension and AN-dimension to upper bound and lower bound the sample complexity of robust multiclass learning, respectively. We further analyze the AG/AN-dimension of linear multiclass predictors and prove that AG-dimension and AN-dimension are not equivalent in general. However, establishing a complexity measure that characterizes robust learnability still remains an open question.

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant 61976161.

[1] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.

[2] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 7472 7482, 2019.

[3] Zekai Wang and Weiwei Liu. Robustness verification for contrastive learning. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 22865 22883, 2022.

[4] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE SP, pages 39 57, 2017.

[5] Nicholas Carlini and David A. Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. In IEEE SP, pages 1 7, 2018.

[6] Weiwei Liu, Donna Xu, Ivor W. Tsang, and Wenjie Zhang. Metric learning for multi-output tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):408 422, 2019.

[7] Weiwei Liu, Ivor W. Tsang, and Klaus-Robert Müller. An easy-to-hard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research, 18:94:1 94:38, 2017.

[8] Xiuwen Gong, Dong Yuan, and Wei Bao. Discriminative metric learning for partial label learning. IEEE Transactions on Neural Networks and Learning Systems, pages 1 12, 2021.

[9] Xiuwen Gong, Dong Yuan, and Wei Bao. Top-k partial label machine. IEEE Transactions on Neural Networks and Learning Systems, 2021.

[10] B. K. Natarajan. On learning sets and functions. Machine Learning, 4:67 97, 1989.

[11] Vladimir Naumovich Vapni. The Nature of Statistical Learning Theory. Springer, 1995.

[12] Shai Ben-David, Nicolò Cesa-Bianchi, David Haussler, and Philip M. Long. Characterizations of learnability for classes of {0, ..., n}-valued functions. Journal of Computer and System Sciences, 50(1):74 86, 1995.

[13] Omar Montasser, Steve Hanneke, and Nathan Srebro. VC classes are adversarially robustly learnable, but only improperly. In COLT, volume 99, pages 2512 2530, 2019.

[14] Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Pac-learning in the presence of adversaries. In Neur IPS, pages 228 239, 2018.

[15] Justin Khim and Po-Ling Loh. Adversarial risk bounds for binary classification via function transformation. Co RR, abs/1810.09519, 2018.

[16] Robi Bhattacharjee, Somesh Jha, and Kamalika Chaudhuri. Sample complexity of robust linear classification on separated data. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 884 893, 2021.

[17] Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. Lower bounds on adversarial robustness from optimal transport. In Neur IPS, pages 7496 7508, 2019.

[18] Dimitrios I. Diochnos, Saeed Mahloujifar, and Mohammad Mahmoody. Lower bounds for adversarially robust PAC learning under evasion and hybrid attacks. In ICMLA, pages 717 722, 2020.

[19] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

[20] David Haussler and Philip M. Long. A generalization of sauer s lemma. Journal of Combinatorial Theory, Series A, 71(2):219 240, 1995.

[21] Amit Daniely, Sivan Sabato, and Shai Shalev-Shwartz. Multiclass learning approaches: A theoretical comparison with implications. In Neur IPS, pages 494 502, 2012.

[22] Dong Yin, Kannan Ramchandran, and Peter L. Bartlett. Rademacher complexity for adversarially robust generalization. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 7085 7094, 2019.

[23] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

[24] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2:265 292, 2001.

[25] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014.

[26] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145 147, 1972.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [N/A]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

(b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]