# multiclass_transductive_online_learning__cf717c4e.pdf Multiclass Transductive Online Learning Steve Hanneke Department of Computer Science Purdue University West Lafayette, IN 47907 steve.hanneke@gmail.com Vinod Raman Department of Statistics University of Michigan Ann Arbor, MI 48104 vkraman@umich.edu Amirreza Shaeri Department of Computer Science Purdue University West Lafayette, IN 47907 amirreza.shaeiri@gmail.com Unqiue Subedi Department of Statistics University of Michigan Ann Arbor, MI 48104 subedi@umich.edu We consider the problem of multiclass transductive online learning when the number of labels can be unbounded. Previous works by Ben-David et al. [1997] and Hanneke et al. [2023b] only consider the case of binary and finite label spaces, respectively. The latter work determined that their techniques fail to extend to the case of unbounded label spaces, and they pose the question of characterizing the optimal mistake bound for unbounded label spaces. We answer this question by showing that a new dimension, termed the Level-constrained Littlestone dimension, characterizes online learnability in this setting. Along the way, we show that the trichotomy of possible minimax rates of the expected number of mistakes established by Hanneke et al. [2023b] for finite label spaces in the realizable setting continues to hold even when the label space is unbounded. In particular, if the learner plays for T N rounds, its minimax expected number of mistakes can only grow like Θ(T), Θ(log T), or Θ(1). To prove this result, we give another combinatorial dimension, termed the Level-constrained Branching dimension, and show that its finiteness characterizes constant minimax expected mistake-bounds. The trichotomy is then determined by a combination of the Level-constrained Littlestone and Branching dimensions. Quantitatively, our upper bounds improve upon existing multiclass upper bounds in Hanneke et al. [2023b] by removing the dependence on the label set size. In doing so, we explicitly construct learning algorithms that can handle extremely large or unbounded label spaces. A key and novel component of our algorithm is a new notion of shattering that exploits the sequential nature of transductive online learning. Finally, we complete our results by proving expected regret bounds in the agnostic setting, extending the result of Hanneke et al. [2023b]. 1 Introduction Imagine you are a famous musician who has released K N songs. You are now on tour visiting T N cities worldwide based on the pre-specified plan, each with unique musical preferences that you have some understanding of. At each city, you can perform only one song in your concert, and following each performance, the audience provides feedback indicating their preferred song from your repertoire. Your goal is to select the song that aligns with the majority s taste in each 38th Conference on Neural Information Processing Systems (Neur IPS 2024). city to maximize satisfaction. How can you effectively select songs to ensure the highest audience satisfaction across most cities having minimal assumptions? The above example and similar real-world situations, where entities operate according to a possibly adversarially chosen pre-specified schedule, can be formulated in a framework called Multiclass Transductive Online Learning. Formally, in this setting, an adversary plays a repeated game against the learner over some T N rounds. Before the game begins, the adversary selects a sequence of T instances (x1, . . . , x T ) X T from some non-empty instance space X (e.g. images) and reveals it to the learner. Subsequently, during each round t {1, . . . , T}, the learner predicts a label ˆyt Y from some non-empty label space Y (e.g. categories of images), the adversary reveals the true label yt Y, and the learner suffers the 0-1 loss, namely 1{ˆyt = yt}. Importantly, the label space Y is not required even to be countable; we assume only standard measure theoretic properties for it. Following the well-established frameworks in learning theory, given a concept class C YX of functions c : X Y, the goal of the learner is to minimize the number of mistakes relative to the best-fixed concept in C. If there exists c C such that c(xt) = yt for all t {1, . . . , T}, we say we are in the realizable setting, and otherwise in the agnostic setting. We briefly note that if the learner s predictions are randomized, we focus on the expected value of the mentioned objective. In this paper, our main contribution is algorithmically answering the following question in the multiclass transductive online learning framework: Given a concept class C YX , what is the minimum expected number of mistakes achievable by a learner against any realizable adversary? For the special case of binary classification (|Y| = 2), this question was first considered by Ben-David et al. [1997] and then later fully resolved by Hanneke et al. [2023b]. Additionally, Hanneke et al. [2023b] considered the case where |Y| > 2, but did not resolve this question when Y is unbounded. In fact, the bounds by Hanneke et al. [2023b] break down even when |Y| 2T . As a result, Hanneke et al. [2023b] posed the characterization of the minimax expected number of mistakes in the multiclass setting with an infinite label set as an open question, which we resolve in this paper. 1.1 Online Learning and Multiclass Classification In this work, we study transductive online learning framework, where the adversary reveals the entire sequence of instances (x1, . . . , x T ) to the learner before the game begins. In the traditional online learning framework, the sequence of instances (x1, . . . , x T ) are revealed to the learner sequentially, one at a time. That is, on round t {1, . . . , T}, the learner would have only observed x1, . . . , xt. The celebrated work of Littlestone [1988] introduced this framework for binary classification and quantified the best achievable number of mistakes against any realizable adversary for a concept class C {0, 1}X in terms of a combinatorial parameter called the Littlestone dimension. Later, the work of Ben-David et al. [2009] showed that the Littlestone dimension of a concept class C {0, 1}X continues to quantify the expected relative mistakes (i.e expected regret) for the mentioned framework in the more general agnostic setting. More recently, Daniely et al. [2012] and Hanneke et al. [2023a] extended these results to multiclass online learning in the realizable and agnostic settings, respectively. See Section A for more details. In traditional online classification, there are two sources of uncertainty: one associated with the sequence of instances, and the other with respect to the true labels. Ben-David et al. [1997] initiated the study of transductive online classification with the aim of understanding how exclusively label uncertainty impacts the optimal number of mistakes. Furthermore, removing the uncertainty with respect to the instances can significantly reduce the optimal number of mistakes. For example, for the concept class of halfspaces in the realizable setting, the optimal number of mistakes grows linearly with the time horizon T in the traditional online binary classification framework, while only growing as Θ(log T) in the transductive online binary classification framework. So, it is natural to reduce the optimal number of mistakes or extend learnable classes whenever we have additional assumptions. Notably, Ben-David et al. [1997] initially called this setting offline learning , but it was later renamed Transductive Online Learning by Hanneke et al. [2023b] due to its close resemblance to transductive PAC learning [Vapnik and Chervonenkis, 1974, Vapnik, 1982, 1998]. See Section A for more details. While Ben-David et al. [1997] and Hanneke et al. [2023b] mainly focused on binary classification, in this work, we focus on the more general multiclass classification setting. Natarajan and Tadepalli [1988], Natarajan [1989] and Daniely et al. [2012] initiated the study of multiclass prediction within the foundational PAC framework and traditional online framework, respectively. More recently, following the work by [Brukhim et al., 2021], there has been a growing interest in understanding multiclass learning when the size of the label space is unbounded, including Hanneke et al. [2023c,a], Raman et al. [2023]. This interest is driven by several motivations. Firstly, guarantees for the multiclass setting should not inherently depend on the number of labels, even when it is finite. Secondly, in mathematics, concepts involving infinities often provide cleaner insights. Thirdly, insights from this problem might also advance understanding of real-valued regression problems [Attias et al., 2023]. Finally, on a practical front, many crucial machine learning tasks involve classification into extremely large label spaces. For instance, in image object recognition, the number of classes corresponds to the variety of recognizable objects, and in language models, the class count expands with the dictionary size. See Section A for more details. 1.2 Main Results and Techniques In the following subsection, we present an overview of our main findings along with a summary of our proof techniques. 1.2.1 Realizable Setting In the realizable setting, we assume that the sequence of labeled instances (x1, y1), . . . , (x T , y T ), played by the adversary, is consistent with at least one concept in C. Here, our objective is to minimize the well-known notion of the expected number of mistakes. We provide upper and lower bounds on the best achievable worst-case expected number of mistakes by the learner as a function of T and C, which we denote by M (T, C). Hanneke et al. [2023b] established a trichotomy of rates in the case of binary classification. That is, for every C {0, 1}X , we have that M (T, C) can only grow like Θ(T), Θ(log T), or Θ(1); where the Littlestone and Vapnik-Chervonenkis (VC) dimensions of C characterize the possible rate. In this work, we extend this trichotomy to the multiclass classification setting, even when Y is unbounded. To do so, we introduce two new combinatorial parameters, termed the Level-constrained Littlestone dimension and Level-constrained Branching dimension. To define the Level-constrained Littlestone dimension, we first need to define the Level-constrained Littlestone tree. A Level-constrained Littlestone tree is a Littlestone tree with the additional requirement that the same instance has to label all the internal nodes across a given level. Then, the Level-constrained Littlestone dimension is just the largest natural number d N, such that there exists a shattered Level-constrained Littlestone tree T of depth d. To define the Level-constrained Branching dimension, we first need to define the Level-constrained Branching tree. The Level-constrained Branching tree is a Level-constrained Littlestone tree without the restriction that the labels on the two outgoing edges are distinct. Then, the Level-constrained Branching dimension is then the smallest natural number d N such that for every shattered Level-constrained Branching tree T, there exists a path down T such that the number of nodes whose outgoing edges are labeled by different elements of Y is at most d. The Level-constrained Littlestone dimension reduces to the VC dimension when |Y| = 2. Additionally, the finiteness of the Level-constrained Branching and Littlestone dimension coincide when |Y| = 2. Finally, we note that the Level-constrained Branching dimension is exactly equal to the notion of rank in the work of Ben-David et al. [1997]. However, we believe it is simpler to understated. Using the Level-constrained Littlestone and Branching dimension, we establish the following trichotomy. Theorem 1. (Trichotomy) Let C YX be a concept class. Then, we have: Θ(1), if B(C) < . Θ(log T) if D(C) < and B(C) = . Θ(T), if D(C) = . Here, B(C) is Level-constrained Branching dimension, and D(C) is Level-constrained Littlestone dimension defined in Section 2. To prove the O(log T) upper bound for binary online classification, Hanneke et al. [2023b] run the Halving algorithm on the projection of C onto x1, ..., x T and use the Sauer Shelah Perles (SSP) lemma to bound the size of this projection by O(T VC(C)). However, this approach is not applicable when Y is unbounded. For example, when C = {x 7 n : n N} is the set of all constant functions over N, the size of the projection of C onto even a single x X is infinity. Moreover, the mentioned class can be learned with at most one number of mistakes. Thus, fundamentally new techniques are required. To this end, we define a new notion of shattering which makes it possible to apply an analog of the Halving algorithm. Additionally, while the proof of the O(1) upper bound in Hanneke et al. [2023b] follows immediately from the guarantee of Standard Optimal Algorithm (SOA) by Littlestone [1988], our O(1) upper bound in terms of the Level-constrained Branching dimension requires a modification of the SOA. We complement our results by presenting matching lower bounds. See Section 3 for more details. In Section F, we provide a comprehensive comparison between our dimensions and existing multiclass combinatorial complexity parameters. 1.2.2 Agnostic Setting In the agnostic setting, we make no assumptions about the sequence (x1, y1), (x2, y2), . . . , (x T , y T ) played by the adversary. Here, our focus shifts to the well-established notion of expected regret, which compares the expected number of mistakes made by the algorithm to that made by the best concept in the concept class over the sequence. As in the realizable setting, we aim to establish both upper and lower bounds on the optimal worst-case expected regret achievable by the learner, expressed as a function of T and the concept class C, denoted by R (T, C). The prior work by Hanneke et al. [2023b] showed that in the case of binary classification, R (T, C) is Θ( p VC(C) T) whenever VC(C) < and Θ(T) otherwise, where Θ hides logarithmic factors in T. Using the Level-constrained Littlestone dimension in hand, we extend these results to multiclass classification. Theorem 2. For every concept class C YX and T D(C), we have the following: r 8 R (T, C) p T D(C) log e T where D(C) is Level-constrained Littlestone dimension defined in Section 2. Our results in the agnostic setting can be proved using core ideas in the proof of the agnostic results from Ben-David et al. [2009], Hanneke et al. [2023a], and Hanneke et al. [2023b]. See Section E for more details. 2 Preliminaries 2.1 Notation Let X denote an example space and Y denote the label space. We make no assumptions about Y, so it can be unbounded and even uncountable (e.g. Y = R). Following the work of Hanneke et al. [2023a], if we consider randomized learning algorithms, the associated σ-algebra is of little consequence, except that singleton sets {y} should be measurable. Let C YX denote a concept class. We abbreviate a sequence z1, ..., z T by z1:T . Moreover, we also define z 0. Otherwise, if B(Vt, xt:T ) = 0, then we have {c(xt) : c Vt} = {y}. So, by our prediction rule, we cannot have yt = ˆyt under realizability. To establish (1), we want to show that B(Vt+1, xt+1:T ) < B(Vt, xt:T ). Suppose, for the sake of contradiction, we instead have B(Vt+1, xt+1:T ) B(Vt, xt:T ). Then, let us define d := B(Vt+1, xt+1:T ). If d = 0, then our proof is complete because 0 B(Vt, xt:T ) 1{yt = ˆyt}. Assume that d > 0 and recall that Vt+1 = V yt t . By definition of B(V yt t , xt+1:T ) and its equivalent shattered-trees representation, there exists a level-constrained tree Tyt of depth T t whose internal nodes are labeled by xt, . . . , x T and is shattered by V yt t . Moreover, every path down Tyt has branching factor d. Next, as ˆyt = arg maxy Y B(V y t , xt+1:T ), we further have B(V ˆyt t , xt+1:T ) B(V yt t , xt+1:T ) d. Thus, there exists another level-constrained tree Tˆyt of depth T t whose internal nodes are labeled by xt, . . . , x T , that is shattered by V ˆyt t , and every path down Tˆyt has branching factor d. Finally, consider a new tree T with root-node labeled by xt, the left-outgoing edge from the root node is labeled by yt, and the right outgoing edge is labeled by ˆyt. Moreover, the subtree following the outgoing edge labeled by yt is Tyt, and the subtree following the outgoing edge labeled by ˆyt is Tˆyt. Since both Tyt and Tˆyt are valid level-constrained trees each with internal nodes labeled by xt+1, . . . , x T , the newly constructed tree T is a also a level-constrained trees of depth T t + 1 with internal nodes labeled by xt, . . . , x T . In addition, as Tyt and Tˆyt are shattered by V yt t and V ˆyt t respectively, the tree T must be shattered by V yt t V ˆyt t . Finally, as every path down each sub-trees Tyt and Tˆyt has branching factor d and yt = ˆyt, every path of T must have branching factor d + 1. This shows that B(V yt t V ˆyt t , xt:T ) d + 1. And since V yt t V ˆyt t Vt, we have d + 1 B(V yt t V ˆyt t , xt:T ) B(Vt, xt:T ) by monotonicity. This contradicts our assumption that d := B(Vt+1, xt+1:T ) B(Vt, xt:T ). Therefore, we must have B(Vt+1, xt+1:T ) < B(Vt, xt:T ). This establishes (1), completing our proof. 3.2 Proof of Upperbound D(C) log e T D(C) Proof. Fix the time horizon T N and let x1:T := (x1, ..., x T ) X T be the sequence of T instances revealed to the learner at the beginning of the game. We say a subsequence x 1:n := (x 1, ..., x n), preserving the same order as in x1:T , is shattered by V C if there exists a sequence of functions {Yt}n t=1, where Yt : {0, 1}t Y, such that for every σ {0, 1}n, we have that (i) Yt(σ 2q 1. We now proceed by induction on q. For the base case q = 1, let T denote a level-constrained tree of depth(T ) > 1 shattered by C that satisfies property (a) and (b) specified in Lemma 1. First, consider the case where the root node of T has branching. Since every path in T can have exactly 1 branching, there can be no further branching in T . Next, consider the case when the root node of T is not the branching node and ℓis the first level in T with branching. There must be 2ℓ 1 nodes in this level, henceforth denoted by {vi}2ℓ 1 i=1 . Moreover, denote Tvi to be the corresponding subtree in T with vi as the root node. Since T satisfy property (b) and there are no branching nodes before level ℓ, the subtrees {Tvi}2ℓ 1 i=1 must be identical. Since all subtrees {Tvi}2ℓ 1 i=1 have branching on the root node, there can be no further branching in these subtrees beyond the root node. Therefore, there cannot be any other levels ℓ > ℓin T with branching node. This establishes that ℓis the only level in T with at least one branching node. In either case, we have BN(T ) 1 = 2q 1. Assume that Lemma 1 is true for some q = n N. We now establish Lemma 1 for q = n + 1. To that end, let T be a level-constrained tree with branching factor q = n + 1 shattered by C that satisfies (a) and (b). Let ℓ 1 be the first level in T with at least one branching node, and {Ti}2ℓ 1 i=1 be all the subtrees with its root node being a node on level ℓ. As argued in the base case, all these subtrees must be identical. Thus, branching occurs on the same set of levels on all these subtrees, which implies that BN(T ) = BN(Ti) for all i [2ℓ 1]. Let T 0 1 and T 1 1 denote the left and right subtree following the two outgoing edges from the root node of T1. Since, there is branching on the root-node of T1, we must have BN(T1) 1 + BN(T 0 1 ) + BN(T 1 1 ). For each i {0, 1}, the subtree T i 1 is a level-constrained tree shattered by C that satisfies properties (a) and (b) for q = n. Using the inductive concept, we have BN(T i 1 ) 2n 1 for i {0, 1}. Therefore, combining everything BN(T ) = BN(T1) 1 + BN(T 0 1 ) + BN(T 1 1 ) 1 + 2n 1 + 2n 1 = 2n+1 1. This completes our induction step. E Minimax Rates in the Agnostic Setting We go beyond the realizable setting, and establish the minimax regret in the agnostic setting in terms of the Level-constrained Littlestone dimension. Theorem 4 (Regret bound). For every concept class C YX and T D(C), we have r 8 R (T, C) p T D(C) log e T We note that the upperand lower-bounds in Theorem 4 are only off only by a factor logarithmic in T. We leave it as an open question to establish a matching upperand lower-bounds. Proof. (of upper bound in Theorem 4) To prove the upper bound, we will use the agnostic-torealizable reduction from Hanneke et al. [2023a] to convert our realizable learner in Section 3.2 to an agnostic learner with the claimed upper bound on expected regret. By Theorem 4 in [Hanneke et al., 2023a], any conservative deterministic learner A with mistake bound M can be converted into an agnostic learner with expected regret at most r T M log e T M . Although the proof by Hanneke et al. [2023a] only coverts the conservative Standard Optimal Algorithm to an agnostic learner, the arguments are general enough such that the conversion can be adapted for any conservative deterministic mistake-bound learner. By Theorem 3 and the proof in Section 3.2, there exists a conservative deterministic realizable learner with mistake bound at most D(C) log e T D(C) . Using the realizable-to-agnostic conversion from Theorem 4 in [Hanneke et al., 2023a] with the conservativeversion of the realizable learner in Section 3.2 gives an agnostic learner with expected regret at most v u u u t T D(C) log e T D(C) log e T D(C) T D(C) log e T completing the proof. Proof. (of lower bound in Theorem 4) Our proof of lower bound is identical to the lower bound for the binary setting proved in [Hanneke et al., 2023b, Theorem 6.1], which is just a simple adaptation of standard lower bound technique from [Ben-David et al., 2009]. Thus, we only outline the sketch of the proof here. Let d = D(C). Consider a sequence of instances {x 1, . . . , x d} X and a sequence of functions {Yi}d i=1 that is shattered by C according to Definition 2. Pick the largest odd number k N such that kd T. First, the adversary reveals the instances {x1, . . . , x T } such that xt = x 1 for t = 1, . . . , k, followed by xt = x 2 for t = k + 1, . . . , 2k, and so forth. If T > kd, take xt = x d for all t > kd. As for labels, the adversary will first sample (σ1, σ2, . . . , σT ) Uniform({0, 1}T ). Then, for t = 1, . . . , k, the labels are selected as yt = Y1(σt). For t = k + 1, . . . , 2k, the labels are selected as yt = Y2(( σ1, σt)), where σ1 = 1 n Pk t=1 1[σ1 = 0] < Pk t=1 1[σ1 = 1] o is the majority bit in the first block t = 1, . . . , k. One can define yt for all t > 2k analogously. For this stream, the label yt is essentially equivalent to the bit σt {0, 1}. Therefore, following the exact same arguments as in [Hanneke et al., 2023b, Theorem 6.1] establishes the lower bound of p Td/8. This completes the sketch of our proof. Remark 1. Let k N be the number of classes. Let C {1, 2, . . . , k}X be a concept class. It is notable that for small number of classes k (i.e. k << 2(log T )2), the Natarajan bound that can be proved using the technique of Hanneke et al. [2023b] can be smaller than the upper bound in terms of the Level-constrained Littlestone dimension. However, for large k (i.e. k >> 2(log T )2), our upper bound in terms of D(C) can be better. F Comparisons to Existing Combinatorial Dimensions In this section, we compare the Level-constrained Littlestone dimension 2 and the Level-constrained Branching dimension 3 to existing combinatorial dimensions in multiclass learning. F.1 Existing Combinatorial Dimensions Definition 4 (i-neighbour). Let f, g Yd for some d N. For every i [d], we say that f and g are i-neighbours if fi = gi and j [d]\{i} fj = gj. Definition 5 (DS dimension Daniely and Shalev-Shwartz [2014]). Let C YX be a concept class. Let S X d be a sequence for some d N. We say that S is DS-shattered by C, if there exists F C, |F| < such that for all f {g | g Yd, g F i [d] gi = f(Si)} and for all i [d], f has at least one i-neighbor. The DS dimension of C, denoted DS(C), is the maximal size of a sequence S X d for some d N that is DS-shattered by C. Definition 6 (Graph dimension). Let C YX be a concept class. Let S X. We say that S is G-shattered by C, if there exists an f : S Y such that for every T S there is a g C such that: x T g(x) = f(x) and x S T g(x) = f(x) The graph dimension of C, denoted G(C), is the maximal cardinality of a set S X that is G-shattered by C. Definition 7 (Natarajan Threshold dimension). Let C YX be a concept class. Let S X d be a sequence for some d N. We say that S is NT-shattered by C, if there exist f, g : [d] Y such that i [d] f(i) = g(i), and there exists c0, c1, c2, . . . , cd Cd+1 such that for every i [d+1], j [d]: ci 1(Sj) = f(j), j < i g(j), j i The Natarajan Threshold dimension of C, denoted NT(C), is the maximal size of a sequence S X d for some d N that is NT-shattered by C. F.2 Comparison It is easy to show that for every concept class C YX , its Natarajan dimension is always less than or equal to its DS dimension. Moreover, the work of Brukhim et al. [2022] demonstrated there exists a concept class C YX for which the Natarajan dimension is 1 but DS(C) = . Here, we show that for every concept class C YX , its DS dimension is always less than equal to its Level-constrained Littlestone dimension. Furthermore, we demonstrate there exists a concept class C YX such that DS(C ) = 1 but D(C ) = . These two results are shown in Proposition 2. Proposition 2. For every concept class C YX , we have: DS(C) D(C). Moreover, there exists a concept class C YX such that DS(C ) = 1 but D(C ) = . Proof. First, we prove that for every concept class C YX , we have: DS(C) D(C). Let C YX be a concept class such that DS(C) is finite. Subsequently, we show that we can construct a Levelconstrained Littlestone tree T of depth DS(C), which is shattered by C. Thus, DS(C) D(C). Let S X DS(C) be a sequence of instances, which is DS-shattered by C. We show that we can construct a Level-constrained Littlestone tree T of depth DS(C), having members of S as its nodes in order with the first member being its root and so on, which is shattered by C. To show the construction, we use induction. If DS(C) = 1, it is clear that we can construct a Level-constrained Littlestone tree T of depth 1, which is shattered by C. This is because there must be two concepts in C, which disagree on one member of S. We assume that if DS(C) = d, we can construct a Level-constrained Littlestone tree T of depth d, having members of S as its nodes in order with the first member being its root and so on, which is shattered by C, where S X d is a sequence of size d witnessing DS(C) = d. Now, we prove that if DS(C) = d + 1, we can construct a Level-constrained Littlestone tree T of depth d + 1, having members of S as its nodes in order with the first member being its root and so on, which is shattered by C, where S X d+1 is a sequence of size d + 1 witnessing DS(C) = d + 1. Let F C be a set witnessing DS(C) = d + 1. Take any two distinct concepts c1, c2 F. Define F as follows: F := {f | f F, f(S1) = c1(S1)}. In addition, define F as follows: F := {f | f F, f(S1) = c2(S1)}. Observe that S2, S3, . . . , Sd+1 and F can witness DS(C) d. Similarly, observe that S2, S3, . . . , Sd+1 and F can witness DS(C) d. Now, we set the root of T as S1 and branches with c1(S1) and c2(S1) labels. Based on the inductive assumption combined with the facts that we mentioned, we can complete the construction of Level-constrained Littlestone tree T of depth DS(C), having members of S as its nodes in order with the first member being its root and so on, which is shattered by C. Finally, we note that if DS(C) = , as we can do the construction for every depth d N, we should have D(C) = . Second, we prove that there exists a concept class C YX such that DS(C ) = 1 but D(C ) = . To show this, we use our next proposition, namely 3, combined with the well-known fact that for every C YX , we have: DS(C) G(C). Next, we show there that exists a concept class C YX such that G(C) = 1 and D(C) = . On the other hand, we also prove the existence of a concept class C YX such that G(C) = and D(C) = 1. These two results, shown in Proposition 3, imply that the Level-constrained Littlestone dimension and the Graph dimension are not comparable. Moreover, our first claim has an interesting consequence. In particular, it illustrates that having a finite Level-constrained Littlestone dimension is not necessary for having a bounded size sample compression scheme. This follows from the fact that having finite Graph dimension is sufficient for having a bounded size sample compression scheme [David et al., 2016]. We also remark that for every concept class C YX , its DS dimension is always less than or equal to its Graph dimension. Proposition 3. There exists a concept class C YX such that G(C) = 1 and D(C) = . Moreover, there exists a concept class C YX such that G(C ) = and D(C ) = 1. Proof. First, we prove the second claim. To show that, we rely on Example 1 in Hanneke et al. [2023a]. In particular, they showed there exists a concept class C YX such that G(C ) = and L(C ) = 1. As we know D(C ) L(C), we conclude there exists a concept class C YX such that G(C ) = and D(C ) = 1. Second, we prove that there exists a concept class C YX such that G(C) = 1 and D(C) = . Let T be an infinite depth rooted perfect binary tree so that all of its levels and edges are labeled by distinct elements. The definition of such a tree is similar to Definition 1.7 in the work of Bousquet et al. [2021]. Let X be the elements on the levels of T and Y be the elements on the edges of T . Also, define the concept class C YX as follows: C only contains all concepts consistent with a branch of T . Thus, clearly, we have: D(C) = . Now, we show that G(C) = 1. We prove this by contradiction. Assume G(C) 2. Thus, there exist S = (x1, x2) X of size 2 and f : S Y witnessing the fact that G(C) = 2. Without loss of generality, we assume that x1 is above x2 in T . Using the fact that the edges of T are labeled with distinct elements of Y, there cannot exist both c1 C and c2 C such that c1(x1) = f(x1), c1(x2) = f(x2), c2(x1) = f(x1), and c2(x2) = f(x2). This is a contradiction, thus G(C) = 1. It is well-known that for every concept class C YX , its Littlestone dimension is always less than equal to its sequential graph dimension. Moreover, the work of Hanneke et al. [2023a] demonstrated there exists a concept class C YX such that L(C) = 1 and SG(C) = . Here, we show that for every concept class C YX , its Level-constrained Branching dimension is always less than equal to its Littlestone dimension. Furthermore, we demonstrate that there exists a concept class C YX such that B(C ) 2 and L(C ) = . These two results are shown in Proposition 4. Proposition 4. For every concept class C YX , we have: B(C) L(C). Moreover, there exists a concept class C YX such that B(C ) 2 and L(C ) = . Proof. The proof of the following claim: for every concept class C YX , we have that B(C) L(C) is given by Proposition 1. Therefore, we focus on showing that there exists a concept class C YX such that B(C ) 2 and L(C ) = . Let T be an infinite depth rooted perfect binary tree so that all of its nodes are labeled by distinct elements, all of its left edges are labeled by 0, and all of its right edges are labeled by 1. The definition of such a tree is similar to Definition 1.7 in the work of Bousquet et al. [2021]. Let X be the elements on the nodes of T . Also, define the concept class C as follows: C contains only the concepts consistent with a branch of T . Further, each of these concepts predicts a unique label for all instances outside its associated branch. In addition, define Y as the union of {0, 1} and all unique labels used in the definition of C . Thus, we have: L(C ) = . Now, we show that B(C ) 2. To prove this, we demonstrate that for every T N, we have: inf Deterministic A MA(T, C ) 2. As a result, we can then conclude that B(C ) 2. To see why inf Deterministic A MA(T, C ) 2 for every T N implies that B(C ) 2, suppose for the sake of contradiction that B(C ) 3. So, there exists a Level-constrained Branching tree of depth d N such that its branching factor is at least 3. Let T = d. It is not hard to see that there exists a sequence of instances of size T such that for every deterministic learner, there exists a realizable labeling of instances that forces the learner to make at least 3 mistakes over T rounds. This leads to a contradiction. Thus, we conclude that B(C ) 2. We now construct a deterministic learner A such that MA(T, C ) 2 for every T N. Let T N. Let SX T be the sequence chosen by the adversary at the beginning of the game. Also, let c C be the target concept chosen by the adversary. Further, let u be the root-to-leaf path in T associated with the concept c . In addition, for every i [T], let vi be a root-to-leaf path in T containing first i members of S, if it exists. Finally, let i be the smallest positive integer such that vi does not exist. If i itself does not exist, let i = T + 1. Our algorithm A predicts according to the {0, 1} labels associated with the path vi 1 for the first i 1 points in S. Furthermore, if the adversary ever reveals a unique label, we use its corresponding c C to make predictions in all future rounds. For the i th member of S, if it exists, we predict arbitrarily. To see that this algorithm makes at most 2 mistakes, we consider two cases. (1) If i = T + 1, then our algorithm makes at most one mistake. In fact, our algorithm makes a mistake: (a) if the adversary switches the label from a bit in {0, 1} to a unique label corresponding to the target concept c . (b) perhaps on the last instance. (2) Otherwise, the algorithm makes at most two mistakes; the first mistake can be on round i 1, and the second mistake can be on round i , after which the true c is known to the learner from its unique label. Indeed, if the adversary switches the label from a bit in {0, 1} to a unique label corresponding to the target concept c before round i 1, we only make one mistake. This completes the proof. Finally, the works of Shelah [1990], Hodges [1997] showed that the finiteness of the Littlestone and Threshold dimensions coincide in the binary setting. Here, we show that this is not the case between the Level-constrained Branching dimension and the Natarajan Threshold dimension. More specifically, we show that for every concept class C YX , its Level-constrained Branching dimension is always greater than or equal to the log of its Natarajan Threshold dimension. However, we give a concept class C YX such that NT(C ) = 1 and B(C ) = . These two results are shown in Proposition 5. Notably, the lower bound of Hanneke et al. [2023b], based on the threshold dimension, can be easily generalized to our setting for the Natarajan Threshold dimension. Proposition 5. For every concept class C YX , we have: log(NT(C)) B(C). Moreover, there exists a concept class C YX such that NT(C ) = 1 and B(C ) = . Proof. First, we prove that for every concept class C YX , we have: log(NT(C)) B(C). Let C YX be a concept class such that NT(C) = d for some d N. Let T = d. On the one hand, by presenting the sequences of instances that are NT-shattered by C to the learner, we can use a similar technique as [Hanneke et al., 2023b, Claim 3.4], to prove a lower bound of log(NT(C)) on M (T, C). On the other hand, based on Section 3, we can prove an upper bound of B(C) on M (T, C). Thus, we have log(NT(C)) B(C). Second, we prove that there exists a concept class C YX such that NT(C ) = 1 and B(C ) = . Let T be a rooted binary tree so that it has the following three properties: (1) all of its levels and edges are labeled by distinct elements. (2) each level only contains one node with two children (3) its branching factor is infinite. It is not hard to see that such a tree exists. The definition of such a tree is similar to Definition 1.7 in the work of Bousquet et al. [2021]. Let X be the elements on the levels of T and Y be the elements on the edges of T . Also, define the concept class C YX as follows: C only contains all concepts consistent with a branch of T . Thus, clearly, we have: B(C ) = . Now, we show that NT(C ) = 1. We prove this by contradiction. Assume NT(C ) 2. Then, there exist S = (x1, x2) X 2 and (c0, c1, c2) C 3 witnessing NT(C ) = 2. Without loss of generality, we assume that x1 is above x2 in T . Based on our constriction of T , it is simple to see that c0(x2) = c1(x2) and c0(x2) = c2(x2) and c1(x2) = c2(x2). Thus, NT(C ) can not be even 2, which completes our contradiction-based proof. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: In the abstract and introduction, we claim that we establish trichotomy of rates in the multiclass transductive online learning in the realizable setting and near tight upper and lower bounds in the agnostic setting for arbitrary label spaces. The first claim is proven in Section 3 and the second claim is proven in Section E. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In Section 1.2.2, we point out that there is a gap of log T between our lower and upper bounds on regret in the agnostic setting. We also point out future directions in the Discussion section. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: This paper provides a full set of assumptions and a complete proof for every Theoretical result. Proposition 1 is proved in Appendix B. Theorem 1 is proved in Section 3, Theorem 3 is proven in Section 3 and in Appendix C, and Lemma 1 is proved in Appendix D. All theorems and lemmas are properly referenced. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: This paper does not include experiments. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: This paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: This paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: This paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the Neur IPS Code of Ethics and ensured that our paper conforms, in every respect, with the Neur IPS Code of Ethics. We have also made sure to preserve anonymity. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: As this paper is completely theoretical in nature, there does not seem to be any soceital impact of the work performed. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: This paper is theoretical and poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: This paper does not use existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.