# a_trichotomy_for_transductive_online_learning__d8effe53.pdf A Trichotomy for Transductive Online Learning Steve Hanneke Department of Computer Science Purdue University steve.hanneke@gmail.com Shay Moran Faculty of Mathematics, Faculty of Computer Science, and Faculty of Data and Decision Sciences Technion Israel Institute of Technology smoran@technion.ac.il Jonathan Shafer Computer Science Division UC Berkeley shaferjo@berkeley.edu We present new upper and lower bounds on the number of learner mistakes in the transductive online learning setting of Ben-David, Kushilevitz and Mansour (1997). This setting is similar to standard online learning, except that the adversary fixes a sequence of instances x1, . . . , xn to be labeled at the start of the game, and this sequence is known to the learner. Qualitatively, we prove a trichotomy, stating that the minimal number of mistakes made by the learner as n grows can take only one of precisely three possible values: n, Θ (log(n)), or Θ(1). Furthermore, this behavior is determined by a combination of the VC dimension and the Littlestone dimension. Quantitatively, we show a variety of bounds relating the number of mistakes to well-known combinatorial dimensions. In particular, we improve the known lower bound on the constant in the Θ(1) case from Ω p Ω(log(d)) where d is the Littlestone dimension. Finally, we extend our results to cover multiclass classification and the agnostic setting. 1 Introduction In classification tasks like PAC learning and online learning, the learner simultaneously confronts two distinct types of uncertainty: labeling-related uncertainty regarding the best labeling function, and instance-related uncertainty regarding the instances that the learner will be required to classify in the future. To gain insight into the role played by each type of uncertainty, researchers have studied modified classification tasks in which the learner faces only one type of uncertainty, while the other type has been removed. In the context of PAC learning, [BB11] studied a variant of proper PAC learning in which the true labeling function is known to the learner, and only the distribution over the instances is not known. They show bounds on the sample complexity in this setting, which conceptually quantify the instancerelated uncertainty. Conversely, labeling-related uncertainty is captured by PAC learning with respect to a fixed (e.g., uniform) domain distribution [BI91], a setting which has been studied extensively. In this paper we improve upon the work of [BKM97], who quantified the label-related uncertainty in online learning. They introduced a model of transductive online learning,1 in which the adversary 1[BKM97] call their model off-line learning with the worst sequence , but in this paper we opt for transductive online learning , a name that has appeared in a number of publications, including [KK05, Pec08, CS13, SKS16]. We remark there are at least two different variants referred to in the litera- 37th Conference on Neural Information Processing Systems (Neur IPS 2023). commits in advance to a specific sequence of instances, thereby eliminating the instance-related uncertainty. 1.1 The Transductive Online Learning Model The model of learning studied in this paper, due to [BKM97], is a zero-sum, finite, completeinformation, sequential game involving two players, the learner and the adversary. Let n N, let X and Y be sets, and let H YX be a collection of functions called the hypothesis class. The game proceeds as follows (see Section 2 for further formal definitions). First, the adversary selects an arbitrary sequence of instances, x1, . . . , xn X. Then, for each t = 1, . . . , n: 1. The learner selects a prediction, ˆyt Y. 2. The adversary selects a label, yt Y. In each step t [n], the adversary must select a label yt such that the sequence (x1, y1), . . . , (xt, yt) is realizable by H, meaning that there exists h H satisfying h(xi) = yi for all i [t]. The learner s objective is to minimize the quantity M(A, x, h) = |{t [n] : ˆyt = h(xt)}| , which is the number of mistakes when the learner plays strategy A and the adversary chooses sequence x X n and labels consistent with hypothesis h H. We are interested in understanding the value of this game, M(H, n) = inf A A sup x X n sup h H M(A, x, h), where A is the set of all learner strategies. Note that neither party can benefit from using randomness, so without loss of generality we consider only deterministic strategies. 1.2 A Motivating Example Transductive predictions of the type studied in this paper appear in many real-world situations, essentially in any case where a party has a schedule or a to-do list known in advance of specific tasks that need to be completed in order, and there is some uncertainly as to the precise conditions that will arise in each task. As a concrete example, consider the logistics that the management of an airport faces when scheduling the work of passport-control officers. Example 1.1. An airport knows in advance what flights are scheduled for each day. However, it does not know in advance exactly how many passengers will go through passport control each day (because tickets can be booked and cancelled in the last minute, and entire flights can be cancelled, delayed or rerouted, etc.). Each day can be regular (the number of passengers is normal), or it can be busy (more passengers than usual). Correspondingly, each day the airport must decide whether to schedule a regular shift of passport-control officers, or an extended shift that contains more officers. If the airport assigns a standard shift for a busy day, then passengers experience long lines at passport control, and the airport suffers a loss of 1; if the airport assigns an extended shift for a regular day, then it wastes money on excess manpower, and it again experiences a loss of 1; If the airport assigns a regular shift to a regular day, or an extended shift to a busy day, then it experiences a loss of 0. Hence, when the airport schedules its staff, it is essentially attempting to predict for each day whether it will be a regular day or a busy day, using information it knows well in advance about which flights are scheduled for each day. This is precisely a transductive online learning problem. 1.3 Our Contributions I. Trichotomy. We show the following trichotomy. It shows that the rate at which M(H, n) grows as a function of n is determined by the VC dimension and the Littlestone dimension (LD). ture as transductive online learning . For example, [SKS16] write of a transductive setting [BKM97] in which the learner knows the arriving contexts a priori, or, less stringently, knows only the set, but not necessarily the actual sequence or multiplicity with which each context arrives. That is, in one setting, the learner knows the sequence (x1, . . . , xn) in advance, but in another setting the learner only knows the set {x1, . . . , xn}. One could distinguish between these two settings by calling them sequence-transductive and set-transductive , respectively. Seeing as the current paper deals exclusively with the sequence-transductive setting, we refer to it herein simply as the transductive setting. Theorem (Informal Version of Theorem 4.1). Every hypothesis class H {0, 1}X satisfies precisely one of the following: 1. M(H, n) = n. This happens if VC(H) = . 2. M(H, n) = Θ(log(n)). This happens if VC(H) < and LD(H) = . 3. M(H, n) = Θ(1). This happens if LD(H) < . The Θ( ) notations in Items 2. and 3. hide a dependence on VC(H), and LD(H), respectively. The proof uses bounds on the number of mistakes in terms of the threshold dimension (Section 3.2), among other tools. II. Littlestone classes. The minimal constant upper bound in the Θ(1) case of Theorem 4.1 is some value C(H) that depends on the class H, but the precise mapping H 7 C(H) is not known in general. [BKM97] showed that C(H) = Ω p log(LD(H)) . In Section 3 and Appendix A we improve upon their result as follows. Theorem (Informal Version of Theorem 3.1). Let H {0, 1}X such that LD(H) = d < . Then M(H, n) = Ω(log(d)). III. Multiclass setting. In Section 5, we generalize Theorem 4.1 to the multiclass setting with a finite label set Y, showing a trichotomy based on the Natarajan dimension. The proof uses a simple result from Ramsey theory, among other tools. Additionally, we show that the DS dimension of [DS14] does not characterize multiclass transductive online learning. IV. Agnostic setting. In the standard (non-transductive) agnostic online setting, [BPS09] showed that Ronline(H, n), the agnostic online regret for a hypothesis class H for a sequence of length n satisfies LD(H) n Ronline(H, n) O p LD(H) n log n . (1) Later, [ABD+21] showed an improved bound of Ronline(H, n) = Θ p In Section 6 we show a result similar to Eq. (1), for the transductive agnostic online setting. Theorem (Informl Version of Theorem 6.1). Let H {0, 1}X , such that 0 < VC(H) < . Then the agnostic transductive regret for H is VC(H) n R(H, n) O p VC(H) n log n . 1.4 Related Works The general idea of bounding the number of mistakes by learning algorithms in sequential prediction problems was introduced in the seminal work of Littlestone [Lit87]. That work introduced the online learning model, where the sequence of examples is revealed to the learner one example at a time. After each example x is revealed, the learner makes a prediction, after which the true target label y is revealed. The constraint, which makes learning even plausible, is that this sequence of (x, y) pairs should maintain the property that there is an (unknown) target concept in a given concept class H which is correct on the entire sequence. Littlestone [Lit87] also identified the optimal predictor for this problem (called the SOA, for Standard Optimal Algorithm, and a general complexity measure which is precisely equal to the optimal bound on the number of mistakes: a quantity now referred to as the Littlestone dimension. Later works discussed variations on this framework. In particular, as mentioned, the transductive model discussed in the present work was introduced in the work of [BKM97]. The idea (and terminology) of transductive learning was introduced by [VC74, Vap82, Kuh99], to capture scenarios where learning may be easier due to knowing in advance which examples the learner will be tested on. [VC74, Vap82, Kuh99] study transductive learning in a model closer in spirit to the PAC framework, where some uniform random subset of examples have their labels revealed to the learner and it is tasked with predicting the labels of the remaining examples. In contrast, [BKM97] study transductive learning in a sequential prediction setting, analogous to the online learning framework of Littlestone. In this case, the sequence of examples x is revealed to the learner all at once, and only the target labels (the y s) are revealed in an online fashion, with the label of each example revealed just after its prediction for that example in the given sequential order. Since a mistake bound in this setting is still required to hold for any sequence, for the purpose of analysis we may think of the sequence of x s as being a worst case set of examples and ordering thereof, for a given learning algorithm. [BKM97] compare and contrast the optimal mistake bound for this setting to that of the original online model of [Lit87]. Denoting by d the Littlestone dimension of the concept class, it is clear that the optimal mistake bound would be no larger than d. However, they also argue that the optimal mistake bound in the transductive model is never smaller than Ω( p log(d)) (as mentioned, we improve this to log(d) in the present work). They further exhibit a family of concept classes of variable d for which the transductive mistake bound is strictly smaller by a factor of 3 2. They additionally provide a general equivalent description of the optimal transductive mistake bound in terms of the maximum possible rank among a certain family of trees, each representing the game tree for the sequential game on a given sequence of examples x. In addition to these two models of sequential prediction, the online learning framework has also been explored in other variations, including exploring the optimal mistake bound under a best-case order of the data sequence x, or even a self-directed adaptive order in which the learning algorithm selects the next point for prediction from the remaining x s from the given sequence on each round. [BKM97, BEK95, GS94, BE98, Kuh99]. Unlike the online learning model of Littlestone, the transductive model additionally allows for nontrivial mistake bounds in terms of the sequence length n (the online model generally has min{d, n} as the optimal mistake bound). In this case, it follows immediately from the Sauer Shelah Perles lemma and a Halving technique that the optimal transductive mistake bound is no larger than O(v log(n)) [KK05], where v is the VC dimension of the concept class [VC71, VC74]. 2 Preliminaries Notation 2.1. Let X be a set and n, k N. For a sequence x = (x1, . . . , xn) X n, we write x k to denote the subsequence (x1, . . . , xk). If k 0 then x k denotes the empty sequence, X 0. Definition 2.2. Let k N, let X and Y be sets, and let H YX . A sequence (x1, y1), . . . , (xk, yk) (X Y)k is realizable by H, or H-realizable, if h H i [k] : h(xi) = yi. Definition 2.3. Let X be a set, let H {0, 1}X , let d N, and let X = {x1, . . . , xd} X. We say that H shatters X if for every y {0, 1}d there exists h H such that for all i [d], h(xi) = yi. The Vapnik Chervonenkis (VC) dimension of H is VC(H) = sup{|X| : X X finite H shatters X}. Definition 2.4 ([Lit87]). Let X be a set and let d N. A Littlestone tree of depth d with domain X is a set s=0 {0, 1}s ) Let H {0, 1}X . We say that H shatters a tree T as in Eq. (2) if for every u {0, 1}d+1 there exists hu H such that i [d + 1] : h(xu i 1) = ui. The Littlestone dimension of H, denoted LD(H), is the supremum over all d N such that there exists a Littlestone tree of depth d with domain X that is shattered by H. Theorem 2.5 ([Lit87]). Let X be a set and let H {0, 1}X such that d = LD(H) < . Then there exists a strategy for the learner that guarantees that the learner will make at most d mistakes in the standard (non-transductive) online learning setting, regardless of the adversary s strategy and of number of instances to be labeled. h(xλ) = 1 h(x1) = 0 h(x10) = 1 Figure 1: A shattered Littlestone tree of depth 2. The empty sequence is denoted by λ. (Source: [BHM+21]) Theorem 2.6 (Sauer Shelah Perles; [She72, Sau72]). Let n, d N, let X be a set of cardinality n, and let H {0, 1}X such that VC(H) = d. Then |H| Pn i=0 n i en 3 Quantitative Bounds 3.1 Littlestone Dimension: A Tighter Lower Bound The Littlestone dimension is an upper bound on the number of mistakes, namely n N : M(H, n) LD(H) (3) for any class H. This holds because LD(H) is an upper bound on the number of mistakes for standard (non-transductive) online learning [Lit87], and the adversary in the transductive setting is strictly weaker. The Littlestone dimension also supplies a lower bound. We give a quadratic improvement on the previous lower bound of [BKM97], as follows. Theorem 3.1. Let X be a set, let H {0, 1}X such that d = LD(H) < , and let n N. Then M(H, n) min { log(d)/2 , log log(n)/2 } . Proof idea for Theorem 3.1. Let T be a Littlestone tree of depth d that is shattered by H, and let H1 H be a collection of 2d+1 functions that witness the shattering. The adversary selects the sequence consisting of the nodes of T in breadth-first order. For each time step t [n], let Ht denote the version space, i.e., the subset of H1 that is consistent with all previously-assigned labels. The adversary s adaptive labeling strategy at time t is as follows. If Ht is very unbalanced, meaning that a large majority of functions in Ht assign the same value to xt, then the adversary chooses yt to be that value. Otherwise, if Ht is fairly balanced, the adversary forces a mistake (it can do so without violating H-realizability). The pivotal observation is that: (1) under this strategy, the version space decreases in cardinality significantly more during steps where the adversary forces a mistake compared to steps where it did not force a mistake; (2) let xt be the t-th node in the breadth-first order. It has distance ℓ= log(t) from the root of T. Because T is a binary tree, the subtree T of T rooted at xt is a tree of depth d ℓ. In particular, seeing as Ht contains only functions necessary for shattering T , |Ht| 2d ℓ+1, so Ht must decrease not too slowly with t. Combining (1) and (2) yields that the adversary must be able to force a mistake not too rarely. A careful quantitative analysis shows that the number of mistakes the adversary can force is at least logarithmic in d. The full proof of Theorem 3.1 appears in Appendix A. 3.2 Threshold Dimension We also show some bounds on the number of mistakes in terms of the threshold dimension. Definition 3.2. Let X be a set, let X = {x1, . . . , xk} X, and let H {0, 1}X . We say that X is threshold-shattered by H if there exist h1, . . . , hk H such that hi(xj) = 1(j i) for all i, j [k]. The threshold dimension of H, denoted TD(H), is the supremum of the set of integers k for which there exists a threshold-shattered set of cardinality k. The following connection between the threshold and Littlestone dimensions is well-known. Theorem 3.3 ([She90, Hod97]). Let X be a set, let H {0, 1}X , and let d N. Then: 1. If LD(H) d then TD(H) log d . 2. If TD(H) d then LD(H) log d . Item 1 in Theorem 3.3 and Eq. (3) imply that n N : M(H, n) 2TD(H) for any class H. Similarly, Item 2 in Theorem 3.3 and Theorem 3.1 imply a mistake lower bound of Ω(log log(TD(H))). However, one can do exponentially better than that, as follows. Claim 3.4. Let X be a set, let H {0, 1}X such that d = TD(H) < , and let n N. Then M(H, n) min { log(d) , log(n) } . One of the ideas used in this proof appeared in an example called σworst in Section 4.1 of [BKM97]. Figure 2: Construction of the sequence q in the proof of Claim 3.4. q is a breadth-first enumeration of the depicted binary tree. Proof of Claim 3.4. Let k = min { log(d) , log(n) } and let N = 2k. Let X = {x1, . . . , x N 1} X be a set that is threshold-shattered by functions h1, . . . , h N 1 H and hi(xj) = 1(j i) for all i, j [N 1]. The strategy for the adversary is to present X in dyadic order, namely x N 8 , . . . , x (2k 1)N More explicitly, the adversary chooses the sequence q = q1 q2 qk, where denotes sequence concatenation and 2i N, . . . , x (2i 1) for all i [k]. See Figure 2. We prove by induction that for each i [k], all labels chosen by the adversary for the subsequences prior to qi are H-realizable, and additionally there exists an instance in subsequence qi on which the adversary can force a mistake regardless of the learners predictions. The base case is that the adversary can always force a mistake on the first instance, q1, by choosing the label opposite to the learner s prediction (both labels 0 and 1 are H-realizable for this instance). Subsequently, for any i > 1, note that by the induction hypothesis, the labels chosen by the adversary for all instances in the previous subsequences are H-realizable. In particular there exists an index a [N] such that instance xa has already been labeled, and all the labels chosen so far are consistent with ha. Let b be the minimal integer such that b > a and xb has also been labeled. Then xa and all labeled instances with smaller indices received label 1, while xb and all labeled instances with larger indices received label 0. Because the sequence is dyadic, subsequence qi contains an element xm such that a < m < b. The adversary can force a mistake on xm, because ha and hm agree on all previously labeled instances, but disagree on xm. Claim 3.4 is used in the proof of the trichotomy (Theorem 4.1, below). Finally, we note that for every d N there exists a hypothesis class H such that d = TD(H) and n N : M(H, n) = min{d, n}. Indeed, take X = [d] and H = {0, 1}X . The upper bound holds because |X| = d, and the lower bound holds by Item 2 in Theorem 4.1, because VC(H) = d. 4 Trichotomy Theorem 4.1. Let X be a set, let H {0, 1}X , and let n N such that n |X|. 1. If VC(H) = then M(H, n) = n. 2. Otherwise, if VC(H) = d < and LD(H) = then max{min{d, n}, log(n) } M(H, n) O(d log(n/d)). (4) Furthermore, each of the bounds in Eq. (4) is tight for some classes. The Ω( ) and O( ) notations hide universal constants that do not depend on X or H. 3. Otherwise, there exists an integer C(H) LD(H) (that depends on X and H but does not depend on n) such that M(H, n) C(H). Proof of Theorem 4.1. For Item 1, assume VC(H) = . Then there exists a set X = {x1, . . . , xn} X of cardinality n that is shattered by H. The adversary can force the learner to make n mistakes by selecting the sequence (x1, . . . , xn), and then selecting labels yt = 1 ˆyt for all t [n]. This choice of labels is H-realizable because X is a shattered set. To obtain the upper bound in Item 2 the learner can use the halving algorithm, as follows. Let x = (x1, . . . , xn) be the sequence chosen by the adversary, and let H|x denote the collection of functions from elements of x to {0, 1} that are restrictions of functions in H. For each t {0, . . . , n}, let Ht = f H|x : ( i [t] : f(xi) = yi) be a set called the version space at time t. At each step t [n], the learner makes prediction ˆyt = argmaxb {0,1} f Ht 1 : f(xt) = b . In words, the learner chooses ˆyt according to a majority vote among the functions in version space Ht 1, and then any function whose vote was incorrect is excluded from the next version space, Ht. This implies that for any t [n], if the learner made a mistake at time t then 2 |Ht 1|. (5) Let M = M(H, n). The adversary selects H-realizable labels, so Hn cannot be empty. Hence, applying Eq. (5) recursively yields 1 |Hn| 2 M |H0| 2 M O (n/d)d , where the last inequality follows from VC(H0) VC(H) = d and the Sauer Shelah Perles lemma (Theorem 2.6). Hence M = O(d log(n/d)), as desired. For the min{d, n} lower bound in Item 2, if n d then the adversary can force n mistakes by the same argument as in Item 1. For the logarithmic lower bound in Item 2, the assumption that LD(H) = and Theorem 3.3 imply that TD(H) = , and in particular TD(H) n, and this implies the bound by Claim 3.4. For Item 3, the assumption LD(H) = k < and Theorem 2.5 imply that for any n, the learner will make at most k mistakes. This is because the adversary in the transductive setting is strictly weaker than the adversary in the standard online setting. So there exists some C(H) {0, . . . , k} as desired. Remark 4.2. One can use Theorem 3.1 to obtain a lower bounds for the case of Item 2 in Theorem 4.1. However, that yields a lower bound of Ω(log log(n)), which is exponentially weaker than the bound we show. 5 Multiclass Setting The trichotomy of Theorem 4.1 can be generalized to the multiclass setting, in which the label set Y contains more than two labels. In this setting, the VC dimension is replaced by the Natarajan dimension [Nat89], denoted ND, and the Littlestone dimension is generalized in the natural way. The result holds for finite sets Y. Theorem 5.1 (Informal Version of Theorem B.3). Let X be a set, let Y be a finite set, and let H YX . Then H satisfies precisely one of the following: 1. M(H, n) = n. This happens if ND(H) = . 2. M(H, n) = Θ(log(n)). This happens if ND(H) < and LD(H) = . 3. M(H, n) = O(1). This happens if LD(H) < . The proof of Theorem 5.1 appears in Appendix B, along with the necessary definitions. The main innovation in the proof involves the use of the multiclass threshold bounds developed in Appendix D, which in turn rely on a basic result from Ramsey theory, stated Appendix C. 5.1 The Case of an Infinite Label Set It is interesting to observe that the analogy between the binary classification and multiclass classification settings breaks down when the label set Y is not finite. Example 5.2. There exists a class G YX such that Y is countable, LD(G) is infinite, but the class is learnable with a mistake bound of M(G, n) = 1. To see this, let X be countable, and let H {0, 1}X be a class with LD(H) = . For each i N, let Ti be a Littlestone tree of depth i that is shattered by H, and let {hi 1, . . . , hi 2i+1} H be a subset that witnesses the shattering. Let G = {gi j : i N j [2i+1]} be a set of functions such that gi j(x) = (hi j(x), i, j) for all i, j. Let Y = {0, 1} N N. Observe that G YX is a countable set of functions with a countable set of labels. Furthermore, LD(G) = because G shatters a sequence of suitable Littlestone trees corresponding to T1, T2, . . . . However, G can be learned with mistake bound 1, because a single example of the form x, (hi j(x), i, j) reveals the correct labeling function hi j. Recent work by [BCD+22] has shown that multiclass PAC learning with infinite Y is not characterized by the Natarajan dimension, and that instead it is characterized by the DS dimension (introduced by [DS14]). It is therefore natural to ask whether the DS dimension might also characterize multiclass transductive online learning with infinite Y. We show that the answer to that question is negative. Recall the definition of the DS dimension. Definition 5.3. Let d N, let X and Y be sets, and let H YX . For an index i [d] and vectors y = (y1, . . . , yd) Yd, y = (y 1, . . . , y d) Yd, we say that y and y are i-neighbors, denoted y i y , if {j [d] : yj = y j} = {i}. We say that C Yd is a d-pseudocube if C is non-empty and finite, and y C i [d] y C : y i y . For a vector x = (x1, . . . , xd) X d, we say that H DS-shatters x if the set H|x := h(x1), . . . , h(xd) : h H Yd contains a d-pseudocube. Finally, the Daniely Shalev-Shwartz (DS) dimension of H is DS(H) = sup d N : x X d : H DS-shatterd x . See [BCD+22] for figures and further discussion of the DS dimension. The following claim shows that the DS dimension does not characterize transductive online learning, even when Y is finite. Claim 5.4. For every n N, there exists a hypothesis class Hn such that DS(Hn) = 1 but the adversary in transductive online learning can force at least M(Hn, n) = n mistakes. Proof. Fix n N and let X = {0, 1, 2, . . . , n}. Consider a complete binary tree T of depth n such that for each x X, all the nodes at depth x (at distance x from the root) are labeled by x, and each edge in T is labeled by a distinct label. Let H be a minimal hypothesis class that shatters T, namely, H shatters T and there does not exist a strict subset of H that shatters T. Observe that M(Hn, n) = n, because the adversary can present the sequence 0, 1, 2, . . . , n 1 and force a mistake at each time step. To see that DS(Hn) = 1, assume for contradiction that there exists a vector x = (x1, x2) X 2 that is DS-shattered by Hn, namely, there exists a 2-pseudocube C H|x. Note that x1 = x2, and without loss of generality x1 < x2 (H DS-shatters (x1, x2) if and only if it DS-shatters (x2, x1)). Fix y C. So y = (h(x1), h(x2)) for some h H. Because C is a 2-pseudocube, there exists y C that is a 1-neighbor of y. Namely, there exists g H such that y = (g(x1), g(x2)) C, y 1 = y1 and y 2 = y2. However, because each edge in T has a distinct label, and H is minimal, it follows that for any x X, g(x) = h(x) = x {0, 1, . . . , x} : g(x ) = h(x ) . In particular, g(x2) = y 2 = y2 = h(x2) implies y 1 = g(x1) = h(x1) = y1 which is a contradiction to the choice of y . 6 Agnostic Setting The agnostic transductive online learning setting is defined analogously to the realizable (nonagnostic) transductive online learning setting described in Section 1.1. An early work by [Cov65] observed that it is not possible for a learner to achieve vanishing regret in an agnostic online setting with complete information. Therefore, we consider a game with incomplete information, as follows. First, the adversary selects an arbitrary sequence of instances, x1, . . . , xn X, and reveals the sequence to the learner. Then, for each t = 1, . . . , n: 1. The adversary selects a label yt Y. 2. The learner selects a prediction ˆyt Y and reveals it to the adversary. 3. The adversary reveals yt to the learner. At each time step t [n], the adversary may select any yt Y, without restrictions.2 The learner, which is typically randomized, has the objective of minimizing the regret, namely R(A, H, x, y) = E[|{t [n] : ˆyt = yt}|] min h H |{t [n] : h(xt) = yt}| , where the expectation is over the learner s randomness. In words, the regret is the expected excess number of mistakes the learner makes when it plays strategy A and the adversary chooses the sequence x X n and labels y Yn, as compared to the number of mistakes made by the best fixed hypothesis h H. We are interested in understanding the value of this game, namely R(H, n) = inf A A sup x X n sup y Yn R(A, H, x, y), where A is the set of all learner strategies. We show the following result. Theorem 6.1. Let X be a set, let H {0, 1}X , and let n N such that n |X|. Assume 0 < VC(H) < . Then the agnostic transductive regret for H on sequences of length n is VC(H) n R(H, n) O p VC(H) n log (n/VC(H)) . The upper bound in Theorem 6.1 follows directly from the the Sauer Shelah Perles lemma (Theorem 2.6), together with the following well-known bound on the regret of the Multiplicative Weights algorithm (see, e.g., Theorem 21.10 in [SB14]). Theorem 6.2. Let X be a set and let H {0, 1}X be finite. There exists an algorithm for the standard (non-transductive) agnostic online learning setting that satisfies Ronline(H, n) p 2 log (|H|). 2Hence the name agnostic , implying that we make no assumptions concerning the choice of labels. Theorem 6.2 implies the upper bound of Theorem 6.1, because the adversary in the transductive agnostic setting is weaker than the adversary in the standard agnostic setting. We prove the lower bound of Theorem 6.1 using an anti-concentration technique from Lemma 14 of [BPS09]. The proof appears in Appendix E. Remark 6.3. Additionally: 1. If VC(H) = 0 (i.e., classes with a single function) then the regret is 0. 2. If VC(H) < and LD(H) < then the regret is R(H, n) = O p LD(H) n , by [ABD+21] (as mentioned above). Namely, in some cases the log(n) factor in Theorem 6.1 can be removed. 3. If VC(H) = then the regret is Ω(n). 7 Future Work Some remaining open problems include: 1. Showing a sharper bound for the Θ(1) case in the trichotomy (Theorem 4.1). Currently, there is an exponential gap between the best known upper and lower bounds for Littlestone classes. 2. Showing sharper bounds for the Θ(log n) cases in the trichotomy (Theorem 4.1) and multiclass trichotomy (Theorem B.3). 3. Showing a sharper bound for the agnostic case (Theorem 6.1). 4. Characterizing the numeber of mistakes in the multiclass setting with an infinite label set Y (complementing Theorem B.3). Acknowledgments Zachary Chase contributed significantly to this paper. All errors are our own. SM is a Robert J. Shillman Fellow; he acknowledges support by ISF grant 1225/20, by BSF grant 2018385, by an Azrieli Faculty Fellowship, by Israel PBC-VATAT, by the Technion Center for Machine Learning and Intelligent Systems (MLIS), and by the European Union (ERC, GENERALIZATION, 101039692). JS was supported by DARPA (Defense Advanced Research Projects Agency) contract # HR001120C0015 and the Simons Collaboration on The Theory of Algorithmic Fairness. Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union, the European Research Council Executive Agency, DARPA, or the Simons Foundation. The European Union, the granting authority, DARPA, and the Simons Foundation cannot be held responsible for them. [ABD+21] Noga Alon, Omri Ben-Eliezer, Yuval Dagan, Shay Moran, Moni Naor, and Eylon Yogev. Adversarial laws of large numbers and optimal regret in online classification. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC 2021: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 447 455. ACM, 2021. doi:10.1145/3406325.3451041. [ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies finite littlestone dimension. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 852 860. ACM, 2019. doi:10.1145/3313276.3316312. [BB11] Shalev Ben-David and Shai Ben-David. Learning a classifier when the labeling is known. In Jyrki Kivinen, Csaba Szepesvári, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory 22nd International Conference, ALT 2011, Espoo, Finland, October 5-7, 2011. Proceedings, volume 6925 of Lecture Notes in Computer Science, pages 440 451. Springer, 2011. doi:10.1007/978-3-642-24412-4_34. [BCD+22] Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 943 955. IEEE, 2022. doi:10.1109/FOCS54457.2022.00093. [BE98] Shai Ben-David and Nadav Eiron. Self-directed learning and its relation to the vc-dimension and to teacher-directed learning. Mach. Learn., 33(1):87 104, 1998. doi:10.1023/A:1007510732151. [BEK95] Shai Ben-David, Nadav Eiron, and Eyal Kushilevitz. On self-directed learning. In Wolfgang Maass, editor, Proceedings of the Eigth Annual Conference on Computational Learning Theory, COLT 1995, Santa Cruz, California, USA, July 5-8, 1995, pages 136 143. ACM, 1995. doi:10.1145/225298.225314. [BHM+21] Olivier Bousquet, Steve Hanneke, Shay Moran, Ramon van Handel, and Amir Yehudayoff. A theory of universal learning. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC 2021: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 532 541. ACM, 2021. doi:10.1145/3406325.3451087. [BI91] Gyora M. Benedek and Alon Itai. Learnability with respect to fixed distributions. Theor. Comput. Sci., 86(2):377 390, 1991. doi:10.1016/0304-3975(91)90026-X. [BKM97] Shai Ben-David, Eyal Kushilevitz, and Yishay Mansour. Online learning versus offline learning. Mach. Learn., 29(1):45 63, 1997. doi:10.1023/A:1007465907571. [BPS09] Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT 2009 The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009. URL http://www.cs.mcgill.ca/%7Ecolt2009/papers/ 032.pdf#page=1. [CL06] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. doi:10.1017/CBO9780511546921. [Cov65] Thomas M. Cover. Behavior of sequential predictors of binary sequences. In Transactions of the Fourth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes. Academic Press, 1965. URL https://isl.stanford. edu/~cover/papers/paper3.pdf. [CS13] Nicolò Cesa-Bianchi and Ohad Shamir. Efficient transductive online learning via randomized rounding. In Bernhard Schölkopf, Zhiyuan Luo, and Vladimir Vovk, editors, Empirical Inference - Festschrift in Honor of Vladimir N. Vapnik, pages 177 194. Springer, 2013. doi:10.1007/978-3-642-41136-6_16. [DS14] Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Maria-Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages 287 316. JMLR.org, 2014. URL http://proceedings.mlr.press/v35/daniely14b. html. [DSBS15] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the ERM principle. J. Mach. Learn. Res., 16:2377 2404, 2015. doi:10.5555/2789272.2912074. [GS94] Sally A. Goldman and Robert H. Sloan. The power of self-directed learning. Mach. Learn., 14(1):271 294, 1994. doi:10.1023/A:1022605628675. [HL95] David Haussler and Philip M. Long. A generalization of sauer s lemma. J. Comb. Theory, Ser. A, 71(2):219 240, 1995. doi:10.1016/0097-3165(95)90001-2. [Hod97] Wilfrid Hodges. A Shorter Model Theory. Cambridge University Press, 1997. [KK05] Sham M. Kakade and Adam Kalai. From batch to transductive online learning. In Advances in Neural Information Processing Systems 18, Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British Columbia, Canada, pages 611 618, 2005. URL https://proceedings.neurips.cc/paper/2005/ hash/17693c91d9204b7a7646284bb3adb603-Abstract.html. [Kuh99] Christian Kuhlmann. On teaching and learning intersection-closed concept classes. In Paul Fischer and Hans Ulrich Simon, editors, Computational Learning Theory, 4th European Conference, Euro COLT 1999, Nordkirchen, Germany, March 29-31, 1999, Proceedings, volume 1572 of Lecture Notes in Computer Science, pages 168 182. Springer, 1999. doi:10.1007/3-540-49097-3_14. [Lit87] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Mach. Learn., 2(4):285 318, 1987. doi:10.1007/BF00116827. [Nat89] B. K. Natarajan. On learning sets and functions. Mach. Learn., 4:67 97, 1989. doi:10.1007/BF00114804. [Pec08] Dmitry Pechyony. Theory and practice of transductive learning. Ph D thesis, Technion Israel Institute of Technology, Israel, 2008. URL https://technion.primo.exlibrisgroup.com/permalink/972TEC_INST/ q1jq5o/alma990023032150203971. [Sau72] Norbert Sauer. On the density of families of sets. J. Comb. Theory, Ser. A, 13(1):145 147, 1972. doi:10.1016/0097-3165(72)90019-2. [SB14] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. doi:10.1017/CBO9781107298019. [She72] Saharon Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41(1):247 261, 1972. doi:10.2140/pjm.1972.41.247. [She90] Saharon Shelah. Classification Theory: And the Number of Non-Isomorphic Models, volume 92 of Studies in Logic and the Foundations of Mathematics. North-Holland, second edition, 1990. [SKS16] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert E. Schapire. Efficient algorithms for adversarial contextual learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 2159 2168. JMLR.org, 2016. URL http:// proceedings.mlr.press/v48/syrgkanis16.html. [Vap82] V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982. doi:10.1007/0-387-34239-7. [VC71] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264 280, 1971. doi:10.1137/1116025. [VC74] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. A Proof of Lower Bound for Littlestone Classes Proof of Theorem 3.1. Let T be a Littlestone tree of depth d that is shattered by H, and let H1 H be a collection of 2d+1 functions that witness the shattering. T contains n T = 2d+1 1 nodes. The adversary selects the sequence x1, x2, . . . , xn consisting of the first n nodes of T in breadth-first order (if n > n T , then the adversary chooses the suffix xn T +1, . . . , xn arbitrarily). For each time step t [n], let Ht denote the version space, i.e., the subset of H1 that is consistent with all previously-assigned labels. Namely, for any t > 1, Ht = {h H1 : ( s [t 1] : h(xs) = ys)} . Similarly, for each b {0, 1}, let Ht,b = {h Ht : h(xt) = b}. send x1, . . . , xn to learner for t 1, 2, . . . , n: εt 1/mk rt |Ht,1|/|Ht| receive ˆyt from learner ( 1 ˆyt rt [εt, 1 εt] 0 rt [0, εt) 1 rt (1 εt, 1] send yt to learner if rt [εt, 1 εt]: Algorithm 1: An adversary that forces Ω(log(LD(H))) mistakes. The adversary operates according to Algorithm 1. Conceptually, at each time step t [n], if Ht is very unbalanced, meaning that a large majority of the functions in Ht assign the same value to xt, then the adversary chooses yt to be that value. Otherwise, if Ht is fairly balanced, the adversary forces a mistake. Note that if Ht is fairly balanced then the adversary can force a mistake without violating H-realizability. We now argue that using this strategy, the adversary forces Ω(log(d)) mistakes. Let F = {t1, t2, . . . } = {t [n] : rt [εt, 1 εt]} be the set of time steps where the adversary forces a mistake. Note that in the for-loop in Algorithm 1, the value of k at the beginning of iteration tk is k (e.g., at the beginning of iteration t3, k = 3). We argue by induction that for any k N, if mk := 222k n then: 1. |F| k and tk mk; and 2. |Htk| (1/mk)2 |H1|. The base case is immediate for t1 = 1 F. For the induction step, assuming that Items 1 and 2 hold for some k N such that mk+1 n, we show that they also hold for k + 1. For Item 1, assume for contradiction that t / F for all t such that tk < t mk+1. For each t, tk < t mk+1, the definition of rt and the adversary s labeling strategy imply that the label yt agrees with at least a (1 εt)-majority of the functions in the version space Ht. Hence, Hmk+1 |Htk| t=tk+1 (1 εt) = |Htk| (1 1/mk+1)mk+1 tk |Htk| (1 1/mk+1)mk+1 |H1| (1/mk)2 (1 1/mk+1)mk+1 (Induction hypothesis for Item 2) |H1| (1/mk)2 (1/4) = |H1| 2 22k+1 2. (6) Observe that for every t [n], if xt is a node with depth ℓin T (i.e., the shortest path from the root to xt contains ℓedges), then there exists an active node x ℓwith the same depth ℓin T such that the version space Ht contains only functions from H1 that are consistent with the labels along the path from the root of T to x ℓ. Namely, Ht is a subset of the 2d ℓ+1 functions in H1 that witness the shattering of the subtree Tℓof T that is rooted at x ℓ. In particular, the depth (distance from the root) of node xmk+1 is log 222(k+1) = 22k+2, so Hmk+1 2d 22k+2+1 = 2d+1 2 22k+2 = |H1| 2 22k+2. (7) Combining Eqs. (6) and (7) yields 2 22k+1 2 2 22k+2, which is a contradiction. This establishes Item 1. Item 2 follows by a similar calculation, which accounts for the fact that at time tk+1 the adversary forces a mistake, and this reduces the version space by a factor of at most εtk+1: Htk+1 |Htk| t=tk+1 (1 εt) |Htk| (1 1/mk+1)mk+1 (1/mk+1) |Htk| (1/4) (1/mk+1) |H1| (1/mk)2 (1/4) (1/mk+1) (Induction hypothesis for Item 2) = |H1| 2 2 22k (1/4) 2 22k+2 = |H1| 2 6 22k 2 |H1| 2 8 22k = |H1| 2 2 22(k+1) = |H1| (1/mk+1)2. This completes the induction. To complete the proof, let k = min { log(d)/2 , log log(n)/2 }. Then mk 2d < 2d+1 1 = n T , so T contains at least mk nodes. Additionally, mk n, so Item 1 implies that |F| k , namely, the adversary can force at least k mistakes, as desired. B Multiclass Trichotomy The following generalization of the Littlestone dimesion to the multiclass setting is due to [DSBS15]. Definition B.1 (Multiclass Littlestone Dimension). Let X and Y be sets and let d N. A Littlestone tree of depth d with domain X and label set Y is a set (xu, yu 0, yu 1) X Y Y : u s=0 {0, 1}s yu 0 = yu 1 where denotes string concatenation. Let H YX . We say that H shatters a tree T as in Eq. (8) if for every u {0, 1}d+1 there exists hu H such that i [d + 1] : h(xu i 1) = yu i. The Littlestone dimension of H, denoted LD(H), is the supremum over all d N such that there exists a Littlestone tree of depth d with domain X and label set Y that is shattered by H. The Natarajan dimension is a popular generalization of the VC dimension to the multiclass setting. Definition B.2 ([Nat89]). Let X and Y be sets, let H YX , let d N, and let X = {x1, . . . , xd} X. We say that H Natarajan-shatters X if there exist f0, f1 : X Y such that: 1. x X : f0(x) = f1(x); and 2. A X h H x X : h(x) = f1(x A)(x). The Natarajan dimension of H is ND(H) = sup {|X| : X X finite H Natarajan-shatters X}. We show the following generalization of Theorem 4.1 for the multiclass setting. Theorem B.3 (Formal Version of Theorem 5.1). Let X and Y be sets with k = |Y| < , let H YX , and let n N such that n |X|. 1. If ND(H) = then M(H, n) = n. 2. Otherwise, if ND(H) = d < and LD(H) = then max{min{d, n}, log(n) } M(H, n) O(d log(nk/d)). (9) The Ω( ) and O( ) notations hide universal constants that do not depend on X, Y or H. 3. Otherwise, there exists a number C(H) N (that depends on X, Y and H but does not depend on n) such that M(H, n) C(H). The proof of Theorem B.3 uses the following generalization of the Sauer Shelah Perles lemma. Theorem B.4 ([Nat89]; Corollary 5 in [HL95]). Let d, n, k N, let X and Y be sets of cardinality n and k respectively, and let H YX such that ND(H) d. Then Proof of Theorem B.3. Items 1 and 3 and the min{d, n} lower bound in Item 2 follow similarly to the corresponding items in Theorem 4.1. The upper bound in Item 2 also follows similarly to the corresponding item in Theorem 4.1, except that it uses Theorem B.4 instead of the Sauer Shelah Perles lemma. The log(n) lower bound in Item 2 follows from Theorem D.5 and Claim D.4. C Combinatorics of Trees In this section we present a simple lemma from Ramsey theory about trees that is used for proving Theorem D.5. We start with a generalized definition of subtrees. Definition C.1. Let X be a finite set and let (X, ) be a partial order relation. For p, c X, we say that c is a child of p if p c and there does not exist m X such that p m c. We say that z X is a leaf if there exists no x X such that z x. (X, ) is a binary tree if every non-leaf x X has precisely 2 children. The depth of z X is the largest d N for which there exist distinct x1, . . . , xd X such that x1 x2 xd z. For d N, we say that (X, ) is a complete binary tree of depth d if (X, ) is a binary tree and all the leaves in X have depth d. We say that a partial order (X , ) is a subtree of (X, ) if X X, and a, b X : a b a b. Lemma C.2 (Lemma 16 in [ALMM19]). Let p, q R be non-negative such that p + q N. Let T = (X, ) be a complete binary tree of depth d = p + q 1, and let f : X {0, 1}. Then at least one of the following statements holds: T has a 0-monochromatic complete binary subtree of depth at least p. Namely, there exists T = (X , ) such that T is a subtree of T, T is a complete binary tree of depth at least p, and f(x) = 0 for all x X . T has a 1-monochromatic complete binary tree subtree of depth at least q. For completeness, we include a proof of this lemma. Proof of Lemma C.2. We prove the claim by induction on the depth d. The base case of d = 0 (a tree with a single node) is immediate. For the induction step, let a denote the root of T, and let Tℓ and Tr denote the subtrees of T of depth d 1 consisting of all descendants of the left and right child of a respectively. Assume that f(a) = 0. If Tℓor Tr contain a 1-monochromatic subtree of depth at least q, then we are done. Otherwise, by the induction hypothesis, both trees contain a 0-monochromatic subtree of depth at least p 1. Joining these two subtrees as children of the root a yields a 0-monochromatic subtree of depth at least p, as desired. The proof for the case f(a) = 1 is similar. We use the following corollary of Lemma C.2. Lemma C.3. Let k, d N. Let T = (X, ) be a complete binary tree of depth d N, and let f : X [k]. Then T has an f-monochromatic complete binary subtree T = (X , ) of depth at least d = d + 1 2 log(k) . Namely, there exists T such that T is a subtree of T, T is a complete binary tree of depth at least d , and |{f(a) : a X }| = 1. Proof of Lemma C.3. We will show that for any b N, if k 2b then there exists an fmonochromatic subtree of T of depth at least This implies the lemma, which corresponds to the special case b = log(k) . We proceed by induction on b. The base case b = 1 follows from Lemma C.2. For the induction step, we assume that the claim holds for b and prove that it holds for b + 1. Namely, we show that if f : X [k] and k 2b+1 then there exists an f-monochromatic subtree of depth at least (d + 1)/2b+1. Define g : X {1, 2} by g(x) = 1 + (f(x) mod 2). By Lemma C.2, there exists a gmonochromatic complete binary subtree T0 = (X0, ) of T of depth at least (d + 1)/2. In particular |{f(x) : x X0}| 2b. By invoking the induction hypotheses on T0, there exists a complete binary subtree of T0 that is f-monochromatic and has depth at least 2 + 1 2b > d + 1 as desired. D Multiclass Threshold Bounds Definition D.1. Let X and Y be sets, let X = {x1, . . . , xt} X, and let H YX . We say that X is threshold-shattered by H if there exist distinct y0, y1 Y and functions h1, . . . , ht H such that hi(xj) = y1(j i). The threshold dimension of H, denoted TD(H), is the supremum of the set of integers t for which there exists a threshold-shattered set of cardinality t. We introduce the following generalization of the threshold dimension. Definition D.2. Let X and Y be sets, let X = {x1, . . . , xt} X, and let H YX . We say that X is multi-class threshold-shattered by H if there exist y1, y 1 . . . , yt, y t Y such that yi = y j for all i, j [t], and there exist functions h1, . . . , ht H such that hi(xj) = yi (j i) y j (j > i). The multi-class threshold dimension of H, denoted MTD(H), is the supremum of the set of integers t for which there exists a threshold-shattered set of cardinality t. See Table 1 for an illustration of this definition. Claim D.3. Let X and Y be sets, k = |Y| < , and let H YX . Then TD(H) MTD(H)/k2 . Proof of Claim D.3. The proof follows from two applications of the pigeonhole principle. x1 x1 x3 x4 x5 h1 y1 y 2 y 3 y 4 y 5 h2 y2 y2 y 3 y 4 y 5 h3 y3 y3 y3 y 4 y 5 h4 y4 y4 y4 y4 y 5 h5 y5 y5 y5 y5 y5 Table 1: An illustration of Definition D.2. The table shows a collection of points {x1, . . . , x5} that are multi-class threshold shattered by functions {h1, . . . , h5}. Claim D.4. Let X and Y be sets, let H YX such that d = TD(H) < , and let n N. Then M(H, n) min { log(d) , log(n) } . The proof of Claim D.4 is similar to that of Claim 3.4. Theorem D.5. Let X and Y be sets with k = |Y| < , let H YX . If LD(H) = then MTD(H) = . Proof of Theorem D.5. Let fk(d) be the largest number such that every class with Littlestone dimension d has multi-class threshold dimension at least fk(d). We show by induction on d that fk satisfies the following recurrence relation: fk(d) 1 d = 1 1 + fk( d/2k 1) d > 1 . In particular, this implies that fk(d) d , which implies the theorem. The base case d = LD(H) = 1 is immediate. For the induction step, we assume the relation holds whenever LD(H) [d 1], and prove that it holds for LD(H) = d. Let T be a Littlestone tree of depth d that is shattered by H. Fix h H. Then h is a k-cloring of the nodes of T. By Lemma C.3, there exists an h-monochromatic subtree T T of depth at least d/2k . Let y be the color assigned by h to all nodes of T . T is shattered by H, so there exists a child c of the root x of T such that edge from x to c is labeled by some value y = y. Let H = {g H : g(x) = y }. H shatters the subtree rooted at c, so LD(H ) d/2k 1. By the induction hypothesis, there exist x1, . . . , xs for s = fk( d/2k 1) that are multi-class threshold shattered by functions h1, . . . , hs H . By construction, the set X = {x1, . . . , xs, xs+1 = x} is multi-class threshold shattered by {h1, . . . , hs, hs+1 = h}, because hs+1(xj) = y for all j [s + 1], and hi(xs+1) = y for all i [s]. Hence, fk(d) s + 1 = 1 + fk( d/2k 1), as desired. E Proof of Agnostic Lower Bound The lower bound in Theorem 6.1 is derived using an anti-concentration technique from Lemma 14 of [BPS09]. Specifically, this technique uses the following inequality. Theorem E.1 (Khinchine s inequality; Lemma 8.2 in [CL06]). Let k N, and let σ1, σ2, . . . , σk be random variables sampled independently and uniformly at random from { 1}. Then Proof of lower bound in Theorem 6.1. Let d = VC(H). Let {x 1, . . . , x d} X be a set of cardinality d that is VC-shattered by H. Let k N be the largest integer such that kd n. Let x X n be a sequence consisting of k copies of the shattered set, namely, (x1, . . . , xkd) = x1 1, x2 1, . . . , xk 1, x1 2, x2 2, . . . , xk 2, . . . , x1 d, x2 d, . . . , xk d , such that xj i = x i for all i [d] and j [k]. If kd < n then the remaining n kd elements of x may be arbitrary. Consider a randomized adversary that selects the sequence x, and chooses all labels to be i.i.d. uniform random bits. For each i [d] and j [k], let yj i = y(i 1)k+j and ˆyj i = ˆy(i 1)k+j denote, respectively, the labels and the predictions corresponding to xj i. Then for any learner strategy A, E y U({0,1}n)[R(A, H, x, y)] = E y U({0,1}n) i [n] 1 (ˆyt = yt) i [n] 1 (h(xt) = yt) 2 E y U({0,1}n) i [n] 1 (h(xt) = yt) 2 E y U({0,1}kd) j [k] 1 h(xj i) = yj i 2 E y U({0,1}kd) i [d] min h H j [k] 1 h(xj i) = yj i (H shatters {x 1, . . . , x d}) k 2 E yi U({0,1}k) j [k] 1 h(xj i) = yj i k 2 E yi U({0,1}k)[min{ri, k ri}] (Let ri = X j [k] yj i ) i=1 E yi U({0,1}k) i=1 E yi U({0,1}k) 1 2 + σj i 2 (Let σj i = 1 yj i = 1 1 yj i = 0 ) i=1 E yi U({0,1}k) where the final inequality is Khinchine s inequality (Theorem E.1). To see that Inequality (10) holds, let h argminh H Pkd i=1 1 (h(xt) = yt), and then E y U({0,1}n) i [n] 1 (h(xt) = yt) E y U({0,1}n) i=1 1 (h (xt) = yt) + E y U({0,1}n) i=kd+1 1 (h (xt) = yt) E y U({0,1}n) i=1 1 (h (xt) = yt) 2 . ({yt}t>kd h ) In particular, Eq. (11) implies that for every learner strategy A there exists y {0, 1}n such that R(A, H, x, y) = Ω