# multiclass_transductive_online_learning__cf717c4e.pdf

Multiclass Transductive Online Learning

Steve Hanneke Department of Computer Science Purdue University West Lafayette, IN 47907 steve.hanneke@gmail.com

Vinod Raman Department of Statistics University of Michigan Ann Arbor, MI 48104 vkraman@umich.edu

Amirreza Shaeri Department of Computer Science Purdue University West Lafayette, IN 47907 amirreza.shaeiri@gmail.com

Unqiue Subedi Department of Statistics University of Michigan Ann Arbor, MI 48104 subedi@umich.edu

We consider the problem of multiclass transductive online learning when the number of labels can be unbounded. Previous works by Ben-David et al. [1997] and Hanneke et al. [2023b] only consider the case of binary and finite label spaces, respectively. The latter work determined that their techniques fail to extend to the case of unbounded label spaces, and they pose the question of characterizing the optimal mistake bound for unbounded label spaces. We answer this question by showing that a new dimension, termed the Level-constrained Littlestone dimension, characterizes online learnability in this setting. Along the way, we show that the trichotomy of possible minimax rates of the expected number of mistakes established by Hanneke et al. [2023b] for finite label spaces in the realizable setting continues to hold even when the label space is unbounded. In particular, if the learner plays for T N rounds, its minimax expected number of mistakes can only grow like Θ(T), Θ(log T), or Θ(1). To prove this result, we give another combinatorial dimension, termed the Level-constrained Branching dimension, and show that its finiteness characterizes constant minimax expected mistake-bounds. The trichotomy is then determined by a combination of the Level-constrained Littlestone and Branching dimensions. Quantitatively, our upper bounds improve upon existing multiclass upper bounds in Hanneke et al. [2023b] by removing the dependence on the label set size. In doing so, we explicitly construct learning algorithms that can handle extremely large or unbounded label spaces. A key and novel component of our algorithm is a new notion of shattering that exploits the sequential nature of transductive online learning. Finally, we complete our results by proving expected regret bounds in the agnostic setting, extending the result of Hanneke et al. [2023b].

1 Introduction

Imagine you are a famous musician who has released K N songs. You are now on tour visiting T N cities worldwide based on the pre-specified plan, each with unique musical preferences that you have some understanding of. At each city, you can perform only one song in your concert, and following each performance, the audience provides feedback indicating their preferred song from your repertoire. Your goal is to select the song that aligns with the majority s taste in each

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

city to maximize satisfaction. How can you effectively select songs to ensure the highest audience satisfaction across most cities having minimal assumptions?

The above example and similar real-world situations, where entities operate according to a possibly adversarially chosen pre-specified schedule, can be formulated in a framework called Multiclass Transductive Online Learning. Formally, in this setting, an adversary plays a repeated game against the learner over some T N rounds. Before the game begins, the adversary selects a sequence of T instances (x1, . . . , x T ) X T from some non-empty instance space X (e.g. images) and reveals it to the learner. Subsequently, during each round t {1, . . . , T}, the learner predicts a label ˆyt Y from some non-empty label space Y (e.g. categories of images), the adversary reveals the true label yt Y, and the learner suffers the 0-1 loss, namely 1{ˆyt = yt}. Importantly, the label space Y is not required even to be countable; we assume only standard measure theoretic properties for it. Following the well-established frameworks in learning theory, given a concept class C YX of functions c : X Y, the goal of the learner is to minimize the number of mistakes relative to the best-fixed concept in C. If there exists c C such that c(xt) = yt for all t {1, . . . , T}, we say we are in the realizable setting, and otherwise in the agnostic setting. We briefly note that if the learner s predictions are randomized, we focus on the expected value of the mentioned objective.

In this paper, our main contribution is algorithmically answering the following question in the multiclass transductive online learning framework:

Given a concept class C YX , what is the minimum expected number of mistakes achievable by a learner against any realizable adversary?

For the special case of binary classification (|Y| = 2), this question was first considered by Ben-David et al. [1997] and then later fully resolved by Hanneke et al. [2023b]. Additionally, Hanneke et al. [2023b] considered the case where |Y| > 2, but did not resolve this question when Y is unbounded. In fact, the bounds by Hanneke et al. [2023b] break down even when |Y| 2T . As a result, Hanneke et al. [2023b] posed the characterization of the minimax expected number of mistakes in the multiclass setting with an infinite label set as an open question, which we resolve in this paper.

1.1 Online Learning and Multiclass Classification

In this work, we study transductive online learning framework, where the adversary reveals the entire sequence of instances (x1, . . . , x T ) to the learner before the game begins. In the traditional online learning framework, the sequence of instances (x1, . . . , x T ) are revealed to the learner sequentially, one at a time. That is, on round t {1, . . . , T}, the learner would have only observed x1, . . . , xt. The celebrated work of Littlestone [1988] introduced this framework for binary classification and quantified the best achievable number of mistakes against any realizable adversary for a concept class C {0, 1}X in terms of a combinatorial parameter called the Littlestone dimension. Later, the work of Ben-David et al. [2009] showed that the Littlestone dimension of a concept class C {0, 1}X continues to quantify the expected relative mistakes (i.e expected regret) for the mentioned framework in the more general agnostic setting. More recently, Daniely et al. [2012] and Hanneke et al. [2023a] extended these results to multiclass online learning in the realizable and agnostic settings, respectively. See Section A for more details.

In traditional online classification, there are two sources of uncertainty: one associated with the sequence of instances, and the other with respect to the true labels. Ben-David et al. [1997] initiated the study of transductive online classification with the aim of understanding how exclusively label uncertainty impacts the optimal number of mistakes. Furthermore, removing the uncertainty with respect to the instances can significantly reduce the optimal number of mistakes. For example, for the concept class of halfspaces in the realizable setting, the optimal number of mistakes grows linearly with the time horizon T in the traditional online binary classification framework, while only growing as Θ(log T) in the transductive online binary classification framework. So, it is natural to reduce the optimal number of mistakes or extend learnable classes whenever we have additional assumptions. Notably, Ben-David et al. [1997] initially called this setting offline learning , but it was later renamed Transductive Online Learning by Hanneke et al. [2023b] due to its close resemblance to transductive PAC learning [Vapnik and Chervonenkis, 1974, Vapnik, 1982, 1998]. See Section A for more details.

While Ben-David et al. [1997] and Hanneke et al. [2023b] mainly focused on binary classification, in this work, we focus on the more general multiclass classification setting. Natarajan and Tadepalli

[1988], Natarajan [1989] and Daniely et al. [2012] initiated the study of multiclass prediction within the foundational PAC framework and traditional online framework, respectively. More recently, following the work by [Brukhim et al., 2021], there has been a growing interest in understanding multiclass learning when the size of the label space is unbounded, including Hanneke et al. [2023c,a], Raman et al. [2023]. This interest is driven by several motivations. Firstly, guarantees for the multiclass setting should not inherently depend on the number of labels, even when it is finite. Secondly, in mathematics, concepts involving infinities often provide cleaner insights. Thirdly, insights from this problem might also advance understanding of real-valued regression problems [Attias et al., 2023]. Finally, on a practical front, many crucial machine learning tasks involve classification into extremely large label spaces. For instance, in image object recognition, the number of classes corresponds to the variety of recognizable objects, and in language models, the class count expands with the dictionary size. See Section A for more details.

1.2 Main Results and Techniques

In the following subsection, we present an overview of our main findings along with a summary of our proof techniques.

1.2.1 Realizable Setting

In the realizable setting, we assume that the sequence of labeled instances (x1, y1), . . . , (x T , y T ), played by the adversary, is consistent with at least one concept in C. Here, our objective is to minimize the well-known notion of the expected number of mistakes. We provide upper and lower bounds on the best achievable worst-case expected number of mistakes by the learner as a function of T and C, which we denote by M (T, C).

Hanneke et al. [2023b] established a trichotomy of rates in the case of binary classification. That is, for every C {0, 1}X , we have that M (T, C) can only grow like Θ(T), Θ(log T), or Θ(1); where the Littlestone and Vapnik-Chervonenkis (VC) dimensions of C characterize the possible rate. In this work, we extend this trichotomy to the multiclass classification setting, even when Y is unbounded. To do so, we introduce two new combinatorial parameters, termed the Level-constrained Littlestone dimension and Level-constrained Branching dimension.

To define the Level-constrained Littlestone dimension, we first need to define the Level-constrained Littlestone tree. A Level-constrained Littlestone tree is a Littlestone tree with the additional requirement that the same instance has to label all the internal nodes across a given level. Then, the Level-constrained Littlestone dimension is just the largest natural number d N, such that there exists a shattered Level-constrained Littlestone tree T of depth d. To define the Level-constrained Branching dimension, we first need to define the Level-constrained Branching tree. The Level-constrained Branching tree is a Level-constrained Littlestone tree without the restriction that the labels on the two outgoing edges are distinct. Then, the Level-constrained Branching dimension is then the smallest natural number d N such that for every shattered Level-constrained Branching tree T, there exists a path down T such that the number of nodes whose outgoing edges are labeled by different elements of Y is at most d. The Level-constrained Littlestone dimension reduces to the VC dimension when |Y| = 2. Additionally, the finiteness of the Level-constrained Branching and Littlestone dimension coincide when |Y| = 2. Finally, we note that the Level-constrained Branching dimension is exactly equal to the notion of rank in the work of Ben-David et al. [1997]. However, we believe it is simpler to understated. Using the Level-constrained Littlestone and Branching dimension, we establish the following trichotomy.

Theorem 1. (Trichotomy) Let C YX be a concept class. Then, we have:

Θ(1), if B(C) < . Θ(log T) if D(C) < and B(C) = . Θ(T), if D(C) = .

Here, B(C) is Level-constrained Branching dimension, and D(C) is Level-constrained Littlestone dimension defined in Section 2.

To prove the O(log T) upper bound for binary online classification, Hanneke et al. [2023b] run the Halving algorithm on the projection of C onto x1, ..., x T and use the Sauer Shelah Perles (SSP)

lemma to bound the size of this projection by O(T VC(C)). However, this approach is not applicable when Y is unbounded. For example, when C = {x 7 n : n N} is the set of all constant functions over N, the size of the projection of C onto even a single x X is infinity. Moreover, the mentioned class can be learned with at most one number of mistakes. Thus, fundamentally new techniques are required. To this end, we define a new notion of shattering which makes it possible to apply an analog of the Halving algorithm. Additionally, while the proof of the O(1) upper bound in Hanneke et al. [2023b] follows immediately from the guarantee of Standard Optimal Algorithm (SOA) by Littlestone [1988], our O(1) upper bound in terms of the Level-constrained Branching dimension requires a modification of the SOA. We complement our results by presenting matching lower bounds. See Section 3 for more details.

In Section F, we provide a comprehensive comparison between our dimensions and existing multiclass combinatorial complexity parameters.

1.2.2 Agnostic Setting

In the agnostic setting, we make no assumptions about the sequence (x1, y1), (x2, y2), . . . , (x T , y T ) played by the adversary. Here, our focus shifts to the well-established notion of expected regret, which compares the expected number of mistakes made by the algorithm to that made by the best concept in the concept class over the sequence. As in the realizable setting, we aim to establish both upper and lower bounds on the optimal worst-case expected regret achievable by the learner, expressed as a function of T and the concept class C, denoted by R (T, C).

The prior work by Hanneke et al. [2023b] showed that in the case of binary classification, R (T, C) is Θ( p

VC(C) T) whenever VC(C) < and Θ(T) otherwise, where Θ hides logarithmic factors in T. Using the Level-constrained Littlestone dimension in hand, we extend these results to multiclass classification. Theorem 2. For every concept class C YX and T D(C), we have the following: r

8 R (T, C) p

T D(C) log e T

where D(C) is Level-constrained Littlestone dimension defined in Section 2.

Our results in the agnostic setting can be proved using core ideas in the proof of the agnostic results from Ben-David et al. [2009], Hanneke et al. [2023a], and Hanneke et al. [2023b]. See Section E for more details.

2 Preliminaries

2.1 Notation

Let X denote an example space and Y denote the label space. We make no assumptions about Y, so it can be unbounded and even uncountable (e.g. Y = R). Following the work of Hanneke et al. [2023a], if we consider randomized learning algorithms, the associated σ-algebra is of little consequence, except that singleton sets {y} should be measurable. Let C YX denote a concept class. We abbreviate a sequence z1, ..., z T by z1:T . Moreover, we also define z<t := (z1, . . . , zt 1) and z t := (z1, . . . , zt). Finally for n N, we let [n] := {1, ..., n}.

2.2 Transductive Online Classification

In the transductive online classification setting, a learner A plays a repeated game against an adversary over T rounds. Before the game begins, the adversary picks a sequence of labeled instances (x1, y1), ..., (x T , y T ) (X Y)T and reveals x1:T to the learner. Then, in each round t [T], using x1:T and y1:t 1, the learner makes a potentially randomized prediction A(xt) Y. Finally, the adversary reveals the true label yt, and the learner suffers the loss 1{A(xt) = yt}. Given a concept class C YX , the goal of the learner is to output predictions such that its expected regret,

RA(T, C) := sup (x1,y1),...,(x T ,y T )

t=1 1{A(xt) = yt}

t=1 1{c(xt) = yt}

is small. Moreover, we define R (T, C) := inf A RA(T, C), where the infimum is taken over all transductive online algorithms. We say that a concept class is transductive online learnable in the agnostic setting if R (T, C) = o(T).

If the learner is guaranteed to observe a sequence of examples labeled by some concept c C, then we say we are in the realizable setting, and the goal of the learner is to minimize its expected cumulative mistakes

MA(T, C) := sup c C sup x1:T E

t=1 1{A(xt) = c(xt)}

Similarly, we define M (T, C) := inf A MA(T, C), and an analogous definition of transductive online learnability in the realizable setting holds.

2.3 Combinatorial Dimensions

Combinatorial dimensions play an important role in providing a tight quantitative characterization of learnability in learning theory. In this section, we review existing combinatorial dimension in online classification and propose two new dimensions that help us establish the minimax rates for transductive online classification. We start by defining the Littlestone dimension which characterizes multiclass online learnability.

Definition 1 (Littlestone dimension). The Littlestone dimension of C, denoted L(C), is the largest d N such that there exists sequences of functions {Xt}d t=1 where Xt : {0, 1}t 1 X and {Yt}d t=1 where Yt : {0, 1}t Y such that for every σ {0, 1}d, the following holds:

(i) Yt((σ<t, 0)) = Yt((σ<t, 1)) for all t [d].

(ii) cσ C such that cσ(Xt(σ<t)) = Yt(σ t) for all t [d].

If for every d N, there exists sequences {Xt}d t=1 and {Yt}d t=1 satisfying (i) and (ii), we let L(C) = .

On the other hand, in this paper, we show that a different dimension, termed the Level-constrained Littlestone dimension, characterizes transductive online classification.

Definition 2 (Level-constrained Littlestone dimension). The Level-constrained Littlestone dimension of C, denoted D(C), is the largest d N such that there exists a sequence of instances x1, .., xd X d

and a sequence of functions {Yt}d t=1 where Yt : {0, 1}t Y, such that for every σ {0, 1}d, the following holds:

(i) Yt((σ<t, 0)) = Yt((σ<t, 1)) for all t [d].

(ii) cσ C such that cσ(xt) = Yt(σ t) for all t [d].

If for every d N, there exist sequences {xt}d t=1 and {Yt}d t=1 satisfying (i) and (ii), we let D(C) = .

The Littlestone and Level-constrained Littlestone dimensions can also be defined in terms of complete binary trees. A Littlestone tree T of depth d is a complete binary tree of depth d where the internal nodes are labeled by elements of X and for every internal node, its two outgoing edges are labeled by distinct elements in Y. Such a tree is shattered by C if for every root-to-leaf path σ {0, 1}d, there exists a concept cσ C consistent with the sequence of instance-label pairs obtained by traversing down T along σ. The Littlestone dimension is then the largest d N for which there exists a shattered Littlestone tree of depth d. From this perspective, the functions {Xt}d t=1 and {Yt}d t=1 in Definition 1 provide the labels on the internal nodes and the outgoing edges of T respectively. Analogously, a Level-constrained Littlestone tree is simply a Littlestone tree with the additional requirement that the instances labeling the internal nodes are the same across each level. In Definition 2, x1 labels all the internal nodes on level one, x2 labels all the internal nodes on level two, and so forth. The functions {Yt}d t=1 provide the labels on the outgoing edges of a Level-constrained Littlestone tree. Then, the Level-constrained Littlestone dimension is the largest d N for which

there exists a shattered Level-constrained Littlestone tree of depth d. We will use the function-based and tree-based definitions of these dimensions interchangeably.

Moreover, we show that the Level-constrained Branching dimension characterizes when constant minimax rates are possible in transductive online classification. Definition 3 (Level-constrained Branching dimension). The Level-constrained Branching dimension of C, denoted B(C), is the smallest p N such that for every d N, every sequence of instances x1, .., xd X d, and every sequence of functions {Yt}d t=1 where Yt : {0, 1}t Y: σ {0, 1}d, cσ C such that cσ(xt) = Yt(σ t) for all t [d]

= arg min σ {0,1}d

t=1 1{Yt((σ<t, 0)) = Yt((σ<t, 1))} p.

If no such p N exist, we let B(C) = .

For a given path σ {0, 1}d, we refer to Pd t=1 1{Yt((σ<t, 0)) = Yt((σ<t, 1))} as the branching factor of the path. In terms of trees, a Level-constrained Branching tree is a Level-constrained Littlestone tree without the restriction that the labels on the outgoing edges of any internal node need to be distinct. Given a path in such a tree, the branching factor of a path counts the number of nodes in the path whose two outgoing edges are labeled by distinct labels in Y. Finally, the Level-constrained Branching dimension can be equivalently defined as the smallest p N such that every shattered Level-constrained Branching tree T of depth d N must have at least one path with branching factor at most p.

The following proposition, whose proof is in Appendix B, establishes the relationship between the three dimensions. Proposition 1. For every C YX , we have that D(C) B(C) L(C).

We also compare our dimensions to other existing dimensions in multiclass learning in Section F.

3 A Trichotomy in the Realizable Setting

We start by establishing upper and lower bounds on the minimax expected number of mistakes in the realizable setting in terms of the Level-constrained Littlestone dimension and the Level-constrained Branching dimension. Theorem 3 (Mistake bound). For every concept class C YX , we have 1 2 min n max n D(C), log T 1[B(C) = ] o , T o M (T, C) min B(C), D(C) log e T

One can trivially upper bound M (T, C) by L(C). However, by Proposition 1, our upper bound in terms of B(C) is sharper. We can also infer from the proof in Section 3.3 that when T is large enough (namely T 2B(C)), the lower bound in the realizable setting is also B(C)

Given Theorem 3, we immediately infer a trichotomy in minimax rates. Corollary 1 (Trichotomy). For every concept class C YX , we have

Θ(1), if B(C) < . Θ(log T) if D(C) < and B(C) = . Θ(T), if D(C) = .

Proof. (of Corollary 1) When B(C) < , Theorem 3 gives that 1

2 D(C) M (T, C) B(C) for T D(C). When B(C) = but D(C) < , Theorem 3 gives that 1

2 log T M (T, C)

D(C) log e T D(C) for log T D(C). Finally, when D(C) = , Theorem 3 gives that T

M (T, C) T.

The remainder of this Section is dedicated to proving Theorem 3. The proof of the lowerbound D(C)/2 follows from standard techniques, so we defer it to Appendix C.

3.1 Proof of Upperbound B(C)

Proof. Fix n N, a sequence of instances x1:n := (x1, . . . , xn) X n, a sequence of functions Y1:n = (Y1, . . . , Yn) such that Yt : {0, 1}n Y, and set of concepts V C. If σ {0, 1}n, there exists cσ V such that cσ(xt) = Yt(σ t) for all t [n], then define B(V, x1:n, Y1:n) := arg minσ {0,1}n Pn t=1 1{Yt((σ<t, 0)) = Yt((σ<t, 1)). Otherwise, define B(V, x1:n, Y1:n) := 0. Recall that we can represent x1:n and Y1:n with level-constrained trees T of depth n. With the tree representation, B(V, x1:n, Y1:n) := 0 if V does not shatter T . If T is shattered by V , then B(V, x1:n, Y1:n) is the minimum branching factor across all the root-to-leaf paths in T . Recall that the branching factor of a path is the number of nodes in this path whose left and right outgoing edges are labeled by two distinct elements of Y. In this proof, we work with an instance-dependent complexity measure of V C defined as

B(V, x1:n) := sup Y1:n B(V, x1:n, Y1:n).

We now define a learning algorithm that obtains the claimed mistake bound of B(C). Fix a time horizon T N, and let x1:T = (x1, x2, . . . , x T ) denote the sequence of instances revealed by the adversary. Initialize V1 := C. For every t {1, . . . , T}, if we have {c(xt) : c Vt} = {y}, then predict ˆyt = y. Otherwise, define V y t = {c Vt : c(xt) = y} for all y Y, and predict ˆyt = arg maxy Y B(V y t , xt+1:T ). Finally, the learner receives a feedback yt Y, and updates Vt+1 V yt t . When t = T, the sequence x T +1:T is null and we define ˆy T = arg maxy Y B(V y T ). For this learning algorithm, we prove that

B(Vt+1, xt+1:T ) B(Vt, xt:T ) 1{yt = ˆyt}. (1)

Rearranging and summing over t [T] rounds, we obtain:

t=1 1{yt = ˆyt}

B(Vt, xt:T ) B(Vt+1, xt+1:T ) = B(V1, x1:T ) B(VT +1, x T +1:T )

B(V1, x1:T ) B(C)

The equality above follows because the sum telescopes. The final inequality follows because V1 = C and the level-constrained branching dimension of C is defined as B(C) = sup T N supx1:T X T B(C, x1:T ).

We now prove inequality (1). There are two cases to consider: (a) yt = ˆyt and (b) yt = ˆyt. Starting with (a), let yt = ˆyt. Recall that B(Vt+1, xt+1:T ) = B(V yt t , xt+1:T ). Since c(xt) = yt for all h V yt t , we must have B(V yt t , xt+1:T ) B(V yt t , xt:T ). Finally, using the fact that V yt t Vt, we have B(V yt t , xt:T ) B(Vt, xt:T ). This establishes (1) for this case.

Moving to (b), let yt = ˆyt. Note that we must have B(Vt, xt:T ) > 0. Otherwise, if B(Vt, xt:T ) = 0, then we have {c(xt) : c Vt} = {y}. So, by our prediction rule, we cannot have yt = ˆyt under realizability. To establish (1), we want to show that B(Vt+1, xt+1:T ) < B(Vt, xt:T ). Suppose, for the sake of contradiction, we instead have B(Vt+1, xt+1:T ) B(Vt, xt:T ). Then, let us define d := B(Vt+1, xt+1:T ). If d = 0, then our proof is complete because 0 B(Vt, xt:T ) 1{yt = ˆyt}. Assume that d > 0 and recall that Vt+1 = V yt t . By definition of B(V yt t , xt+1:T ) and its equivalent shattered-trees representation, there exists a level-constrained tree Tyt of depth T t whose internal nodes are labeled by xt, . . . , x T and is shattered by V yt t . Moreover, every path down Tyt has branching factor d.

Next, as ˆyt = arg maxy Y B(V y t , xt+1:T ), we further have B(V ˆyt t , xt+1:T ) B(V yt t , xt+1:T ) d. Thus, there exists another level-constrained tree Tˆyt of depth T t whose internal nodes are labeled by xt, . . . , x T , that is shattered by V ˆyt t , and every path down Tˆyt has branching factor d. Finally, consider a new tree T with root-node labeled by xt, the left-outgoing edge from the root node is labeled by yt, and the right outgoing edge is labeled by ˆyt. Moreover, the subtree following the outgoing edge labeled by yt is Tyt, and the subtree following the outgoing edge labeled by ˆyt is Tˆyt. Since both Tyt and Tˆyt are valid level-constrained trees each with internal nodes labeled by xt+1, . . . , x T , the newly constructed tree T is a also a level-constrained trees of depth T t + 1 with internal nodes labeled by xt, . . . , x T . In addition, as Tyt and Tˆyt are shattered by V yt t and V ˆyt t respectively, the tree T must be shattered by V yt t V ˆyt t . Finally, as every path down each

sub-trees Tyt and Tˆyt has branching factor d and yt = ˆyt, every path of T must have branching factor d + 1. This shows that B(V yt t V ˆyt t , xt:T ) d + 1. And since V yt t V ˆyt t Vt, we have d + 1 B(V yt t V ˆyt t , xt:T ) B(Vt, xt:T ) by monotonicity. This contradicts our assumption that d := B(Vt+1, xt+1:T ) B(Vt, xt:T ). Therefore, we must have B(Vt+1, xt+1:T ) < B(Vt, xt:T ). This establishes (1), completing our proof.

3.2 Proof of Upperbound D(C) log e T D(C)

Proof. Fix the time horizon T N and let x1:T := (x1, ..., x T ) X T be the sequence of T instances revealed to the learner at the beginning of the game. We say a subsequence x 1:n := (x 1, ..., x n), preserving the same order as in x1:T , is shattered by V C if there exists a sequence of functions {Yt}n t=1, where Yt : {0, 1}t Y, such that for every σ {0, 1}n, we have that

(i) Yt(σ<t, 0) = Yt(σ<t, 1) for all t [n],

(ii) cσ V such that cσ(xt) = Yt(σ t) for all t [n].

For every V C, let S(V ) be the number of subsequences of x1:T shattered by V . In addition, for every (x, y) X Y, let V(x,y) := {c V : c(x) = y}. Consider the following online learner. At the beginning of the game, the learner initializes V 1 = C. Then, in every round t [T], the learner predicts ˆyt arg maxy Y S(V t (xt,y)), receives yt Y, and updates V t+1 V t (xt,yt).

For this learning algorithm, we claim that

S(V t+1) max n 1{yt = ˆyt}, 1

for every round t [T]. This implies the stated mistake bound since S(C) PD(C) i=0 T i e T D(C) D(C) and the learner can make at most log(S(C)) mistakes before S(V t) = 1. We now prove this claim by considering the case where ˆyt = yt and ˆyt = yt separately.

Let t [T] be a round where ˆyt = yt. Then, S(V t+1) S(V t) since V t+1 = V t (xt,yt) V t. Now, let t [T] be a round where ˆyt = yt. We need to show that S(V t+1) 1

2 S(V t). For any V C, let Sh(V ) be the set of all subsequences of x1:T that are shattered by V . Then, for any subsequence q Sh(V t), only one of the following properties must be true:

(1) q / Sh(V t (xt,yt)) and q / Sh(V t (xt,ˆyt)),

(2) q Sh(V t (xt,yt)) Sh(V t (xt,ˆyt)),

(3) q Sh(V t (xt,yt)) Sh(V t (xt,ˆyt)),

where denotes the symmetric difference. For every i {1, 2, 3}, let Shi(V t) Sh(V t) be the subset of Sh(V t) that satisfies property (i). Note that Sh(V t) = S3 i=1 Shi(V t) and {Shi(V t)}3 i=1 are pairwise disjoint. Therefore, {Shi(V t)}3 i=1 forms a partition of Sh(V t). For each i {1, 2, 3}, we compute how many elements of Shi(V t) we drop when going from Sh(V t) to Sh(V t+1). We can then upperbound | Sh(V t (xt,yt)| = S(V t+1) by lower bounding | Sh(V t) \ Sh(V t+1)|, the number of

elements we drop across all of the subsets {Shi(V t)}3 i=1 when going from V t to V t+1.

Starting with i = 1, observe that for every q Sh1(V t), we have that q / Sh(V t (xt,yt)). Therefore, | Sh1(V t) Sh(V t (xt,yt))| = 0, implying that we drop all the elements from Sh1(V t) when going from Sh(V t) to Sh(V t+1).

For the case where i = 2, note that Sh(V t (xt,yt)), Sh(V t (xt,ˆyt)) Sh(V t) and S(V t (xt,ˆyt)) S(V t (xt,yt)), where the latter inequality is true by the definition of the prediction rule. Moreover, using

the fact that {Shi(V t)}3 i=1 forms a partition of Sh(V t), we can write

S(V t (xt,ˆyt)) = | Sh3(V t)| + | Sh2(V t) Sh(V t (xt,ˆyt))|

and S(V t (xt,yt)) = | Sh3(V t)| + | Sh2(V t) Sh(V t (xt,yt)|.

Since S(V t (xt,ˆyt)) S(V t (xt,yt)), we get that | Sh2(V t) Sh(V t (xt,ˆyt))| | Sh2(V t)

Sh(V t (xt,yt))|. This implies that | Sh2(V t) Sh(V t (xt,yt))| 1 2| Sh2(V t)| since Sh2(V t)

Sh(V t (xt,ˆyt)) Sh2(V t) Sh(V t (xt,yt)) = Sh2(V t) and Sh2(V t) Sh(V t (xt,ˆyt)) Sh2(V t)

Sh(V t (xt,yt)) = . Thus, we drop at least half the elements from Sh2(V t) when going from Sh(V t)

to Sh(V t+1).

Finally, consider when i = 3. Fix a q Sh3(V t). We claim that x1, ..., xt / q. This is because every c V t outputs yj on xj for all j t 1. In addition, xt / q because q Sh(V t (xt,yt)) Sh(V t (xt,ˆyt)) and every concept in V t (xt,yt) and V t (xt,ˆyt) outputs yt and ˆyt on xt respectively. Thus, the sequence xt q, obtained by concatenating xt to the front of q, is a valid subsequence of x1:T . Since ˆyt = yt, we also have that xt q is shattered by V t. Using the fact that xt q / Sh(V t (xt,yt)) and xt q / Sh(V t (xt,ˆyt)), gives that xt q Sh1(Vt). Since our choice of q was arbitrary, this implies

that for every q Sh3(Vt), there exists a subsequence q = xt q Sh1(V t), ultimately giving that | Sh1(V t)| | Sh3(V t)|.

To complete the proof, we lowerbound the total number of dropped elements when going from Sh(V t) to Sh(V t+1) by

| Sh(V t) \ Sh(V t (xt,yt))| | Sh1(V t)| + | Sh2(V t)|

| Sh1(V t)|

2 + | Sh2(V t)|

2 + | Sh3(V t)|

= | Sh(V t)|

The number of remaining elements is then S(V t+1) = | Sh(V t (xt,yt))| = | Sh(V t)| | Sh(V t) \ Sh(V t (xt,yt))| 1

2 S(V t), as needed.

We end this section by noting that the algorithm in the proof of Theorem 3 can be made conservative (i.e. does not update when it is correct) with the same mistake bound. This conservative-version of the realizable learner will be used when proving regret bounds in the agnostic setting (see Section E).

3.3 Proof of Lowerbound log T

2 1[B(C) = ]

If B(C) = , then for every q N, Definition 3 guarantees the existence of d N, a sequence of instances x1, . . . , xd, and a sequence of functions Y1, . . . , Yd where Yt : {0, 1}t Y such that the following holds: (i) σ {0, 1}d, there exists cσ C such that cσ(xt) = Yt(σ t) (ii) σ {0, 1}d, we have Pd t=1 1{Yt((σ<t, 0)) = Yt((σ<t, 1))} q. Equivalently, there exists a shattered levelconstrained branching tree T of depth d with internal nodes labeled by instances x1, . . . , xd such that every path down the tree T has q branching factor. Recall that the branching factor of a path is the number of nodes in the path whose two outgoing edges are labeled by two distinct elements of Y. We say that an internal node has branching if the left and right outgoing edges from the node are labeled by two distinct elements of Y.

Without loss of generality, we will assume that the T guaranteed by Definition 3 has the following properties: (a) every path in T has exactly q branching factor and (b) T is symmetric along its non-branching nodes that is, for every node in T that has no branching, the subtrees on its left and right outgoing edges are identical. There is no loss in generality because given T without property (a), we can traverse down every path in T , and once the path has branching factor q, label all the

subsequent outgoing edges down the path by a concept in C that shatters any completion of that path. For property (b), if any non-branching node has two different subtrees, then replace the right subtree with the left subtree. Given such tree T , let BN(T ) denote the number of levels in T with at least one branching node. The following Lemma, whose proof can be found in Appendix D, provides an upperbound on BN(T ).

Lemma 1. Let T be any level-constrained branching tree shattered by C such that: (a) every path in T has exactly q N branching and (b) for every node in T without branching, the subtrees on its left and right outgoing edges are identical. Then, BN(T ) 2q 1.

Given this Lemma, we now prove the claimed lowerbound of log T

2 1[B(C) = ]. Assume B(C) = and take q = log T . Definition 3 guarantees the existence of shattered Level-constrained branching tree T that satisfies property (a) and (b) specified in Lemma 1. Next, Lemma 1 implies that BN(T ) 2 log T 1 T 1. Let d be the depth T and S {1, . . . , d} be the levels in T with at least one branching node. By definition, we have |S| = BN(T ) T 1. Recall that T can be identified by a sequence of instances x1, . . . , xd and a sequence of functions Y1, . . . , Yd where Yt : {0, 1}t Y for every t [d]. For any path σ {0, 1}d down T , the set {Yt(σ t)}d t=1 gives the labels along this path. Moreover, as all the branching on T occurs on levels in S, we have Pd t=1 1{Yt((σ<t, 0)) = Yt((σ<t, 1))} = P t S 1{Yt((σ<t, 0)) = Yt((σ<t, 1))} = log T for every path σ {0, 1}d.

We now specify the stream to be observed by the learner A. Draw σ Uniform({0, 1}d) and consider the stream {(xt, Yt(σ t))}t S. Repeat (xm, Y (σ m)) for remaining T |S| timepoints, where m is the largest index in set S. Since this stream is a sequence of instance-label pairs along the path σ in the shattered tree T , there exists a cσ C consistent with the stream. However, using similar arguments as in the proof of the lowerbound D(C)/2, we can establish

t=1 1{A xt = Yt(σ t)}

t S 1{A xt = Yt(σ t)}

t S 1 Yt((σ<t, 0)) = Yt (σ<t, 1) #

This completes our proof of lower bound.

4 Discussion

In this paper, we study the problem of multiclass transductive online learning with possibly arbitrary label space. In the realizable setting, we establish a trichotomy in the possible minimax rates of the expected number of mistakes. Furthermore, we show near-tight upper and lower bounds on the optimal expected regret in the agnostic setting. Along the way, we introduce two new combinatorial complexity parameters, called the Level-constrained Littlestone dimension and the Level-constrained Branching dimension.

Finally, we highlight some future directions of this work. First, can we extend our results to settings such as transductive online learning under bandit feedback, list transductive online learning, and transductive online real-valued regression? Moreover, as our shattering technique is general, can we use similar ideas to establish the possible minimax rates of the number of mistakes in the self-directed and the best-order settings initially studied in [Ben-David et al., 1995, 1997]?

Acknowledgments and Disclosure of Funding

VR acknowledges support from the NSF Graduate Research Fellowship Program.

Scott Aaronson, Xinyi Chen, Elad Hazan, Satyen Kale, and Ashwin Nayak. Online learning of quantum states. Advances in neural information processing systems, 31, 2018.

Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private pac learning implies finite littlestone dimension. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 852 860, 2019.

Noga Alon, Mark Bun, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private and online learnability are equivalent. ACM Journal of the ACM (JACM), 69(4):1 34, 2022.

Idan Attias, Steve Hanneke, Alkis Kalavasis, Amin Karbasi, and Grigoris Velegkas. Optimal learners for realizable regression: Pac learning and online learning. ar Xiv preprint ar Xiv:2307.03848, 2023.

Shai Ben-David and Nadav Eiron. Self-directed learning and its relation to the vc-dimension and to teacher-directed learning. Machine Learning, 33:87 104, 1998.

Shai Ben-David, Nicolò Cesa-Bianchi, and Philip M. Long. Characterizations of learnability for classes of O, ..., n-valued functions, 1992.

Shai Ben-David, Nadav Eiron, and Eyal Kushilevitz. On self-directed learning. In Proceedings of the eighth annual conference on Computational learning theory, pages 136 143, 1995.

Shai Ben-David, Eyal Kushilevitz, and Yishay Mansour. Online learning versus offline learning. Machine Learning, 29:45 63, 1997.

Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In Conference on Learning Theory, volume 3, page 1, 2009.

Olivier Bousquet, Steve Hanneke, Shay Moran, Ramon Van Handel, and Amir Yehudayoff. A theory of universal learning. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 532 541, 2021.

Nataly Brukhim, Elad Hazan, Shay Moran, Indraneel Mukherjee, and Robert E Schapire. Multiclass boosting and the cost of weak learning. Advances in Neural Information Processing Systems, 34: 3057 3067, 2021.

Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability, 2022.

Mark Bun, Roi Livni, and Shay Moran. An equivalence between private classification and online prediction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 389 402. IEEE, 2020.

Amit Daniely and Tom Helbertal. The price of bandit information in multiclass online classification. In Conference on Learning Theory, pages 93 104. PMLR, 2013.

Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory, pages 287 316. PMLR, 2014.

Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Proceedings of the 24th Annual Conference on Learning Theory, pages 207 232. JMLR Workshop and Conference Proceedings, 2011.

Amit Daniely, Sivan Sabato, and Shai Shwartz. Multiclass learning approaches: A theoretical comparison with implications. Advances in Neural Information Processing Systems, 25, 2012.

Ofir David, Shay Moran, and Amir Yehudayoff. Supervised learning through the lens of compression. Advances in Neural Information Processing Systems, 29, 2016.

Pramith Devulapalli and Steve Hanneke. The dimension of self-directed learning, 2024.

Jesse Geneson. A note on the price of bandit feedback for mistake-bounded online learning. Theoretical Computer Science, 874:42 45, 2021.

Sally A Goldman and Robert H Sloan. The power of self-directed learning. Machine Learning, 14: 271 294, 1994.

Steve Hanneke and Liu Yang. Bandit learnability can be undecidable. In The Thirty Sixth Annual Conference on Learning Theory, pages 5813 5849. PMLR, 2023.

Steve Hanneke, Shay Moran, Vinod Raman, Unique Subedi, and Ambuj Tewari. Multiclass online learning and uniform convergence. In The Thirty Sixth Annual Conference on Learning Theory, pages 5682 5696. PMLR, 2023a.

Steve Hanneke, Shay Moran, and Jonathan Shafer. A trichotomy for transductive online learning. Advances in Neural Information Processing Systems, 36, 2023b.

Steve Hanneke, Shay Moran, and Qian Zhang. Universal rates for multiclass learning. In The Thirty Sixth Annual Conference on Learning Theory, pages 5615 5681. PMLR, 2023c.

David Haussler and Philip M Long. A generalization of sauer s lemma. Journal of Combinatorial Theory, Series A, 71(2):219 240, 1995.

Wilfrid Hodges. A shorter model theory. Cambridge university press, 1997.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2:285 318, 1988.

Philip M Long. New bounds on the price of bandit feedback for mistake-bounded online multiclass learning. In International Conference on Algorithmic Learning Theory, pages 3 10. PMLR, 2017.

Preetham Mohan and Ambuj Tewari. Quantum learning theory beyond batch binary classification. ar Xiv preprint ar Xiv:2302.07409, 2023.

Balas K Natarajan. On learning sets and functions. Machine Learning, 4:67 97, 1989.

Balas K Natarajan and Prasad Tadepalli. Two new frameworks for learning. In Machine Learning Proceedings 1988, pages 402 415. Elsevier, 1988.

Ananth Raman, Vinod Raman, Unique Subedi, and Ambuj Tewari. Multiclass online learnability under bandit feedback, 2023.

Benjamin Rubinstein, Peter Bartlett, and J Rubinstein. Shifting, one-inclusion mistake bounds and tight multiclass expected risk bounds. Advances in Neural Information Processing Systems, 19, 2006.

Saharon Shelah. Classification theory: and the number of non-isomorphic models. Elsevier, 1990.

Vladimir Vapnik and Alexey Chervonenkis. Theory of pattern recognition, 1974.

Vladimir N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, NY, 1982.

Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998.

A Related Work

Online Learning. Online learning has been a subject of study for more than half a century. Moreover, the seminal work by Littlestone [1988] initiated this line of research within the computer science community. Since that pivotal contribution, online learning has been explored in various settings, from learning under the bandit feedback Daniely et al. [2011], Daniely and Helbertal [2013], Long [2017], Geneson [2021], Raman et al. [2023], Hanneke and Yang [2023] to quantum settings Aaronson et al. [2018], Mohan and Tewari [2023]. Furthermore, it is also linked to a broad set of problems, such as differential privacy, highlighted in studies by Alon et al. [2019], Bun et al. [2020], Alon et al. [2022]. Further, given its fundamental nature, it is not surprising that online learning has found numerous practical applications.

Transductive and other Online Learning Frameworks. The concept of the transductive learning model traces its origins to seminal works by Vapnik Vapnik and Chervonenkis [1974], Vapnik [1982, 1998], where it was explored within the PAC learning framework. Subsequently, Ben-David et al. [1997] initiated the study of this model under the umbrella of online learning, referring to it as offline learning . They utilizes a notion of rank based dimension to prove their results. Notably, as we mentioned in the introduction, our Level-constrained Branching dimension is exactly equal to their rank based dimension. A significant advancement came recently with the work of Hanneke et al. [2023b]. Furthermore, other conceptually related models have also been rigorously studied, as seen in works by Goldman and Sloan [1994], Ben-David et al. [1995], Ben-David and Eiron [1998], Devulapalli and Hanneke [2024]. These studies notably include the self-directed online learning framework, which allows the learning algorithm to select the next instance for prediction from the remaining set of instances in each round, and additionally, the best order, which allows the learner (instead of an adversary) to select the order at the beginning of the game.

Multiclass Classification. A substantial volume of theoretical research has been conducted on various aspects of multiclass classification, as demonstrated by studies Natarajan and Tadepalli [1988], Natarajan [1989], Ben-David et al. [1992], Haussler and Long [1995], Rubinstein et al. [2006], Daniely et al. [2011, 2012], Daniely and Shalev-Shwartz [2014], Brukhim et al. [2021]. Despite this extensive body of work, a combinatorial characterization of multiclass classification with an infinite number of classes under Valiant s PAC learning framework in the realizable setting remained open until recently. In pursuit, the seminal paper by Brukhim et al. [2022] provided a combinatorial characterization in the mentioned setting. A key innovation in this breakthrough was the utilization of list learners. This dimension also serves to characterize the agnostic variant of this problem David et al. [2016]. For standard multiclass online learning with potentially unbounded label space, Daniely et al. [2011] presented a characterization for the realizable setting. Building on this, Hanneke et al. [2023a] extended Ben-David et al. [2009] technique to the agnostic setting with infinite label space. Notably, a similar trend can also be observed in the online learning under bandit feedback in the work of Daniely et al. [2011] followed by Raman et al. [2023].

B Proof of Proposition 1

Let C YX be any concept class. To see that D(C) B(C), note that if D(C) = d, there exists a Level-constrained Littlestone tree T of depth d with branching factor d. Thus, it must be the case that B(C) d.

To prove that B(C) L(C), it suffices to show that for every shattered Level-constrained Branching tree T with branching factor n N, there exists a shattered Littlestone tree of depth n. In particular, we will prove via induction the following claim: if T is a Level-constrained Branching tree with branching factor n shattered by some C C, then there exists a Littlestone tree T of depth n shattered by C .

For the base case let T be a Level-constrained Branching tree with branching factor 1 shattered by some C C. Without loss of generality (see Lemma 1), suppose that branching occurs on the root node of T . Then, it is clear that just the root node of T along with its two outgoing edges is a Littlestone tree of depth 1 shattered by C .

Now for the induction step, suppose the induction hypothesis is true for some n B(C) 1. Let T be a Level-constrained Branching tree with branching factor n + 1 shattered by C C. Again, without loss of generality, suppose branching occurs on the root node of T . Let T0 and T1 be the left and right subtrees of T respectively shattered by C 0 C and C 1 C respectively. Then, note that T0 and T1 must both have a branching factor exactly n. Then, by the induction hypothesis, there exist Littlestone trees T 0 and T 1 of depth n shattered by C 0 and C 1 respectively. Since branching occurs at the root node, the tree T obtained by keeping the root node and its two outgoing edges of T , but replacing T0 and T1 with T 0 and T 1 respectively, is a Littlestone tree of depth n + 1 shattered by C 0 C 1 C .

C Proof of Lowerbound D(C)

Proof. Fix a transductive online learner A. We will construct a randomized, hard realizable stream such that the expected number of mistakes made by A is at least D(C)

2 . Using the probabilistic method then gives the stated lowerbound.

Let d := D(C). Then, by Definition 2, there exists a sequence of instances x1, .., xd X d and a sequence of functions {Yt}d t=1 where Yt : {0, 1}t Y such that for every σ {0, 1}d, the following holds:

(i) Yt(σ<t, 0) = Yt(σ<t, 1) for all t [d].

(ii) cσ C such that cσ(xt) = Yt(σ t) for all t [d].

Let σ {0, 1}d a denote bitstring of length d sampled uniformly at random and consider the stream (x1, Y1(σ 1)), ..., (xd, Yd(σ d)). By the definition, there exists a concept cσ C such that cσ(xt) = Yt(σ t) for all t [d]. Moreover, observe that

t=1 1{A(xt) = Yt(σ t)}

t=1 E h 1{A xt = Yt(σ t)} i

t=1 E h E h 1 A xt = Yt (σ<t, σt) σ<t i i

t=1 E h E h 1 Yt (σ<t, 0) = Yt (σ<t, 1) σ<t ii

t=1 1 Yt((σ<t, 0)) = Yt (σ<t, 1) #

where the first inequality follows from the fact that σt Uniform({0, 1}). This completes the proof.

D Proof of Lemma 1

Proof. (of Lemma 1) If depth(T ) 2q 1, the claim holds trivially. So, we assume depth(T ) > 2q 1. We now proceed by induction on q. For the base case q = 1, let T denote a level-constrained tree of depth(T ) > 1 shattered by C that satisfies property (a) and (b) specified in Lemma 1. First, consider the case where the root node of T has branching. Since every path in T can have exactly 1 branching, there can be no further branching in T . Next, consider the case when the root node of T is not the branching node and ℓis the first level in T with branching. There must be 2ℓ 1 nodes in this level, henceforth denoted by {vi}2ℓ 1 i=1 . Moreover, denote Tvi to be the corresponding subtree in T with vi as the root node. Since T satisfy property (b) and there are no branching nodes before level ℓ, the subtrees {Tvi}2ℓ 1 i=1 must be identical. Since all subtrees {Tvi}2ℓ 1 i=1 have branching on the root node, there can be no further branching in these subtrees beyond the root node. Therefore, there cannot be any other levels ℓ > ℓin T with branching node. This establishes that ℓis the only level in T with at least one branching node. In either case, we have BN(T ) 1 = 2q 1.

Assume that Lemma 1 is true for some q = n N. We now establish Lemma 1 for q = n + 1. To that end, let T be a level-constrained tree with branching factor q = n + 1 shattered by C that satisfies (a) and (b). Let ℓ 1 be the first level in T with at least one branching node, and {Ti}2ℓ 1 i=1 be all the subtrees with its root node being a node on level ℓ. As argued in the base case, all these subtrees must be identical. Thus, branching occurs on the same set of levels on all these subtrees, which implies that BN(T ) = BN(Ti) for all i [2ℓ 1]. Let T 0 1 and T 1 1 denote the left and right subtree following the two outgoing edges from the root node of T1. Since, there is branching on the root-node of T1, we must have BN(T1) 1 + BN(T 0 1 ) + BN(T 1 1 ). For each i {0, 1}, the subtree T i 1 is a level-constrained tree shattered by C that satisfies properties (a) and (b) for q = n. Using the inductive concept, we have BN(T i 1 ) 2n 1 for i {0, 1}. Therefore, combining everything

BN(T ) = BN(T1) 1 + BN(T 0 1 ) + BN(T 1 1 ) 1 + 2n 1 + 2n 1 = 2n+1 1.

This completes our induction step.

E Minimax Rates in the Agnostic Setting

We go beyond the realizable setting, and establish the minimax regret in the agnostic setting in terms of the Level-constrained Littlestone dimension.

Theorem 4 (Regret bound). For every concept class C YX and T D(C), we have r

8 R (T, C) p

T D(C) log e T

We note that the upperand lower-bounds in Theorem 4 are only off only by a factor logarithmic in T. We leave it as an open question to establish a matching upperand lower-bounds.

Proof. (of upper bound in Theorem 4) To prove the upper bound, we will use the agnostic-torealizable reduction from Hanneke et al. [2023a] to convert our realizable learner in Section 3.2 to an agnostic learner with the claimed upper bound on expected regret. By Theorem 4 in [Hanneke et al., 2023a], any conservative deterministic learner A with mistake bound M can be converted into

an agnostic learner with expected regret at most r

T M log e T

M . Although the proof by Hanneke

et al. [2023a] only coverts the conservative Standard Optimal Algorithm to an agnostic learner, the arguments are general enough such that the conversion can be adapted for any conservative deterministic mistake-bound learner. By Theorem 3 and the proof in Section 3.2, there exists a conservative deterministic realizable learner with mistake bound at most D(C) log e T D(C) . Using the realizable-to-agnostic conversion from Theorem 4 in [Hanneke et al., 2023a] with the conservativeversion of the realizable learner in Section 3.2 gives an agnostic learner with expected regret at most v u u u t T D(C) log e T

D(C) log e T D(C)

T D(C) log e T

completing the proof.

Proof. (of lower bound in Theorem 4) Our proof of lower bound is identical to the lower bound for the binary setting proved in [Hanneke et al., 2023b, Theorem 6.1], which is just a simple adaptation of standard lower bound technique from [Ben-David et al., 2009]. Thus, we only outline the sketch of the proof here.

Let d = D(C). Consider a sequence of instances {x 1, . . . , x d} X and a sequence of functions {Yi}d i=1 that is shattered by C according to Definition 2. Pick the largest odd number k N such that kd T. First, the adversary reveals the instances {x1, . . . , x T } such that xt = x 1 for t = 1, . . . , k, followed by xt = x 2 for t = k + 1, . . . , 2k, and so forth. If T > kd, take xt = x d for all t > kd. As for labels, the adversary will first sample (σ1, σ2, . . . , σT ) Uniform({0, 1}T ). Then, for t = 1, . . . , k, the labels are selected as yt = Y1(σt). For t = k + 1, . . . , 2k, the labels are selected as

yt = Y2(( σ1, σt)), where σ1 = 1 n Pk t=1 1[σ1 = 0] < Pk t=1 1[σ1 = 1] o is the majority bit in the first block t = 1, . . . , k. One can define yt for all t > 2k analogously. For this stream, the label yt is essentially equivalent to the bit σt {0, 1}. Therefore, following the exact same arguments as in [Hanneke et al., 2023b, Theorem 6.1] establishes the lower bound of p

Td/8. This completes the sketch of our proof.

Remark 1. Let k N be the number of classes. Let C {1, 2, . . . , k}X be a concept class. It is notable that for small number of classes k (i.e. k << 2(log T )2), the Natarajan bound that can be proved using the technique of Hanneke et al. [2023b] can be smaller than the upper bound in terms of the Level-constrained Littlestone dimension. However, for large k (i.e. k >> 2(log T )2), our upper bound in terms of D(C) can be better.

F Comparisons to Existing Combinatorial Dimensions

In this section, we compare the Level-constrained Littlestone dimension 2 and the Level-constrained Branching dimension 3 to existing combinatorial dimensions in multiclass learning.

F.1 Existing Combinatorial Dimensions

Definition 4 (i-neighbour). Let f, g Yd for some d N. For every i [d], we say that f and g are i-neighbours if fi = gi and j [d]\{i} fj = gj.

Definition 5 (DS dimension Daniely and Shalev-Shwartz [2014]). Let C YX be a concept class. Let S X d be a sequence for some d N. We say that S is DS-shattered by C, if there exists F C, |F| < such that for all f {g | g Yd, g F i [d] gi = f(Si)} and for all i [d], f has at least one i-neighbor. The DS dimension of C, denoted DS(C), is the maximal size of a sequence S X d for some d N that is DS-shattered by C. Definition 6 (Graph dimension). Let C YX be a concept class. Let S X. We say that S is G-shattered by C, if there exists an f : S Y such that for every T S there is a g C such that:

x T g(x) = f(x) and x S T g(x) = f(x)

The graph dimension of C, denoted G(C), is the maximal cardinality of a set S X that is G-shattered by C. Definition 7 (Natarajan Threshold dimension). Let C YX be a concept class. Let S X d be a sequence for some d N. We say that S is NT-shattered by C, if there exist f, g : [d] Y such that i [d] f(i) = g(i), and there exists c0, c1, c2, . . . , cd Cd+1 such that for every i [d+1], j [d]:

ci 1(Sj) = f(j), j < i g(j), j i

The Natarajan Threshold dimension of C, denoted NT(C), is the maximal size of a sequence S X d for some d N that is NT-shattered by C.

F.2 Comparison

It is easy to show that for every concept class C YX , its Natarajan dimension is always less than or equal to its DS dimension. Moreover, the work of Brukhim et al. [2022] demonstrated there exists a concept class C YX for which the Natarajan dimension is 1 but DS(C) = . Here, we show that for every concept class C YX , its DS dimension is always less than equal to its Level-constrained Littlestone dimension. Furthermore, we demonstrate there exists a concept class C YX such that DS(C ) = 1 but D(C ) = . These two results are shown in Proposition 2. Proposition 2. For every concept class C YX , we have: DS(C) D(C). Moreover, there exists a concept class C YX such that DS(C ) = 1 but D(C ) = .

Proof. First, we prove that for every concept class C YX , we have: DS(C) D(C). Let C YX be a concept class such that DS(C) is finite. Subsequently, we show that we can construct a Levelconstrained Littlestone tree T of depth DS(C), which is shattered by C. Thus, DS(C) D(C).

Let S X DS(C) be a sequence of instances, which is DS-shattered by C. We show that we can construct a Level-constrained Littlestone tree T of depth DS(C), having members of S as its nodes in order with the first member being its root and so on, which is shattered by C. To show the construction, we use induction. If DS(C) = 1, it is clear that we can construct a Level-constrained Littlestone tree T of depth 1, which is shattered by C. This is because there must be two concepts in C, which disagree on one member of S. We assume that if DS(C) = d, we can construct a Level-constrained Littlestone tree T of depth d, having members of S as its nodes in order with the first member being its root and so on, which is shattered by C, where S X d is a sequence of size d witnessing DS(C) = d. Now, we prove that if DS(C) = d + 1, we can construct a Level-constrained Littlestone tree T of depth d + 1, having members of S as its nodes in order with the first member being its root and so on, which is shattered by C, where S X d+1 is a sequence of size d + 1 witnessing DS(C) = d + 1. Let F C be a set witnessing DS(C) = d + 1. Take any two distinct concepts c1, c2 F. Define F as follows: F := {f | f F, f(S1) = c1(S1)}. In addition, define F as follows: F := {f | f F, f(S1) = c2(S1)}. Observe that S2, S3, . . . , Sd+1 and F can witness DS(C) d. Similarly, observe that S2, S3, . . . , Sd+1 and F can witness DS(C) d. Now, we set the root of T as S1 and branches with c1(S1) and c2(S1) labels. Based on the inductive assumption combined with the facts that we mentioned, we can complete the construction of Level-constrained Littlestone tree T of depth DS(C), having members of S as its nodes in order with the first member being its root and so on, which is shattered by C.

Finally, we note that if DS(C) = , as we can do the construction for every depth d N, we should have D(C) = .

Second, we prove that there exists a concept class C YX such that DS(C ) = 1 but D(C ) = . To show this, we use our next proposition, namely 3, combined with the well-known fact that for every C YX , we have: DS(C) G(C).

Next, we show there that exists a concept class C YX such that G(C) = 1 and D(C) = . On the other hand, we also prove the existence of a concept class C YX such that G(C) = and D(C) = 1. These two results, shown in Proposition 3, imply that the Level-constrained Littlestone dimension and the Graph dimension are not comparable. Moreover, our first claim has an interesting consequence. In particular, it illustrates that having a finite Level-constrained Littlestone dimension is not necessary for having a bounded size sample compression scheme. This follows from the fact that having finite Graph dimension is sufficient for having a bounded size sample compression scheme [David et al., 2016]. We also remark that for every concept class C YX , its DS dimension is always less than or equal to its Graph dimension.

Proposition 3. There exists a concept class C YX such that G(C) = 1 and D(C) = . Moreover, there exists a concept class C YX such that G(C ) = and D(C ) = 1.

Proof. First, we prove the second claim. To show that, we rely on Example 1 in Hanneke et al. [2023a]. In particular, they showed there exists a concept class C YX such that G(C ) = and L(C ) = 1. As we know D(C ) L(C), we conclude there exists a concept class C YX such that G(C ) = and D(C ) = 1.

Second, we prove that there exists a concept class C YX such that G(C) = 1 and D(C) = . Let T be an infinite depth rooted perfect binary tree so that all of its levels and edges are labeled by distinct elements. The definition of such a tree is similar to Definition 1.7 in the work of Bousquet et al. [2021]. Let X be the elements on the levels of T and Y be the elements on the edges of T . Also, define the concept class C YX as follows: C only contains all concepts consistent with a branch of T . Thus, clearly, we have: D(C) = . Now, we show that G(C) = 1. We prove this by contradiction. Assume G(C) 2. Thus, there exist S = (x1, x2) X of size 2 and f : S Y witnessing the fact that G(C) = 2. Without loss of generality, we assume that x1 is above x2 in T . Using the fact that the edges of T are labeled with distinct elements of Y, there cannot exist both c1 C and c2 C such that c1(x1) = f(x1), c1(x2) = f(x2), c2(x1) = f(x1), and c2(x2) = f(x2). This is a contradiction, thus G(C) = 1.

It is well-known that for every concept class C YX , its Littlestone dimension is always less than equal to its sequential graph dimension. Moreover, the work of Hanneke et al. [2023a] demonstrated there exists a concept class C YX such that L(C) = 1 and SG(C) = . Here, we show that for

every concept class C YX , its Level-constrained Branching dimension is always less than equal to its Littlestone dimension. Furthermore, we demonstrate that there exists a concept class C YX such that B(C ) 2 and L(C ) = . These two results are shown in Proposition 4.

Proposition 4. For every concept class C YX , we have: B(C) L(C). Moreover, there exists a concept class C YX such that B(C ) 2 and L(C ) = .

Proof. The proof of the following claim: for every concept class C YX , we have that B(C) L(C) is given by Proposition 1. Therefore, we focus on showing that there exists a concept class C YX such that B(C ) 2 and L(C ) = . Let T be an infinite depth rooted perfect binary tree so that all of its nodes are labeled by distinct elements, all of its left edges are labeled by 0, and all of its right edges are labeled by 1. The definition of such a tree is similar to Definition 1.7 in the work of Bousquet et al. [2021]. Let X be the elements on the nodes of T . Also, define the concept class C as follows: C contains only the concepts consistent with a branch of T . Further, each of these concepts predicts a unique label for all instances outside its associated branch. In addition, define Y as the union of {0, 1} and all unique labels used in the definition of C . Thus, we have: L(C ) = . Now, we show that B(C ) 2. To prove this, we demonstrate that for every T N, we have: inf Deterministic A MA(T, C ) 2. As a result, we can then conclude that B(C ) 2.

To see why inf Deterministic A MA(T, C ) 2 for every T N implies that B(C ) 2, suppose for the sake of contradiction that B(C ) 3. So, there exists a Level-constrained Branching tree of depth d N such that its branching factor is at least 3. Let T = d. It is not hard to see that there exists a sequence of instances of size T such that for every deterministic learner, there exists a realizable labeling of instances that forces the learner to make at least 3 mistakes over T rounds. This leads to a contradiction. Thus, we conclude that B(C ) 2.

We now construct a deterministic learner A such that MA(T, C ) 2 for every T N. Let T N. Let SX T be the sequence chosen by the adversary at the beginning of the game. Also, let c C be the target concept chosen by the adversary. Further, let u be the root-to-leaf path in T associated with the concept c . In addition, for every i [T], let vi be a root-to-leaf path in T containing first i members of S, if it exists. Finally, let i be the smallest positive integer such that vi does not exist. If i itself does not exist, let i = T + 1.

Our algorithm A predicts according to the {0, 1} labels associated with the path vi 1 for the first i 1 points in S. Furthermore, if the adversary ever reveals a unique label, we use its corresponding c C to make predictions in all future rounds. For the i th member of S, if it exists, we predict arbitrarily. To see that this algorithm makes at most 2 mistakes, we consider two cases. (1) If i = T + 1, then our algorithm makes at most one mistake. In fact, our algorithm makes a mistake: (a) if the adversary switches the label from a bit in {0, 1} to a unique label corresponding to the target concept c . (b) perhaps on the last instance. (2) Otherwise, the algorithm makes at most two mistakes; the first mistake can be on round i 1, and the second mistake can be on round i , after which the true c is known to the learner from its unique label. Indeed, if the adversary switches the label from a bit in {0, 1} to a unique label corresponding to the target concept c before round i 1, we only make one mistake. This completes the proof.

Finally, the works of Shelah [1990], Hodges [1997] showed that the finiteness of the Littlestone and Threshold dimensions coincide in the binary setting. Here, we show that this is not the case between the Level-constrained Branching dimension and the Natarajan Threshold dimension. More specifically, we show that for every concept class C YX , its Level-constrained Branching dimension is always greater than or equal to the log of its Natarajan Threshold dimension. However, we give a concept class C YX such that NT(C ) = 1 and B(C ) = . These two results are shown in Proposition 5. Notably, the lower bound of Hanneke et al. [2023b], based on the threshold dimension, can be easily generalized to our setting for the Natarajan Threshold dimension.

Proposition 5. For every concept class C YX , we have: log(NT(C)) B(C). Moreover, there exists a concept class C YX such that NT(C ) = 1 and B(C ) = .

Proof. First, we prove that for every concept class C YX , we have: log(NT(C)) B(C). Let C YX be a concept class such that NT(C) = d for some d N. Let T = d. On the one hand, by presenting the sequences of instances that are NT-shattered by C to the learner, we can use a similar technique as [Hanneke et al., 2023b, Claim 3.4], to prove a lower bound of log(NT(C)) on M (T, C).

On the other hand, based on Section 3, we can prove an upper bound of B(C) on M (T, C). Thus, we have log(NT(C)) B(C).

Second, we prove that there exists a concept class C YX such that NT(C ) = 1 and B(C ) = . Let T be a rooted binary tree so that it has the following three properties: (1) all of its levels and edges are labeled by distinct elements. (2) each level only contains one node with two children (3) its branching factor is infinite. It is not hard to see that such a tree exists. The definition of such a tree is similar to Definition 1.7 in the work of Bousquet et al. [2021]. Let X be the elements on the levels of T and Y be the elements on the edges of T . Also, define the concept class C YX as follows: C only contains all concepts consistent with a branch of T . Thus, clearly, we have: B(C ) = . Now, we show that NT(C ) = 1. We prove this by contradiction. Assume NT(C ) 2. Then, there exist S = (x1, x2) X 2 and (c0, c1, c2) C 3 witnessing NT(C ) = 2. Without loss of generality, we assume that x1 is above x2 in T . Based on our constriction of T , it is simple to see that c0(x2) = c1(x2) and c0(x2) = c2(x2) and c1(x2) = c2(x2). Thus, NT(C ) can not be even 2, which completes our contradiction-based proof.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: In the abstract and introduction, we claim that we establish trichotomy of rates in the multiclass transductive online learning in the realizable setting and near tight upper and lower bounds in the agnostic setting for arbitrary label spaces. The first claim is proven in Section 3 and the second claim is proven in Section E. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In Section 1.2.2, we point out that there is a gap of log T between our lower and upper bounds on regret in the agnostic setting. We also point out future directions in the Discussion section. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: This paper provides a full set of assumptions and a complete proof for every Theoretical result. Proposition 1 is proved in Appendix B. Theorem 1 is proved in Section 3, Theorem 3 is proven in Section 3 and in Appendix C, and Lemma 1 is proved in Appendix D. All theorems and lemmas are properly referenced. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in

some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification: This paper does not include experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: This paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: This paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: This paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes] Justification: We have reviewed the Neur IPS Code of Ethics and ensured that our paper conforms, in every respect, with the Neur IPS Code of Ethics. We have also made sure to preserve anonymity.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: As this paper is completely theoretical in nature, there does not seem to be any soceital impact of the work performed.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: This paper is theoretical and poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: This paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: This paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: This paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: This paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.