# understanding_the_eluder_dimension__d05b7901.pdf

Understanding the Eluder Dimension

Gene Li Toyota Technological Institute at Chicago gene@ttic.edu

Pritish Kamath Google Research pritish@alum.mit.edu

Dylan J. Foster Microsoft Research dylanfoster@microsoft.com

Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu

We provide new insights on eluder dimension, a complexity measure that has been extensively used to bound the regret of algorithms for online bandits and reinforcement learning with function approximation. First, we study the relationship between the eluder dimension for a function class and a generalized notion of rank, deﬁned for any monotone activation σ : R R, which corresponds to the minimal dimension required to represent the class as a generalized linear model. It is known that when σ has derivatives bounded away from 0, σ-rank gives rise to an upper bound on eluder dimension for any function class; we show however that eluder dimension can be exponentially smaller than σ-rank. We also show that the condition on the derivative is necessary; namely, when σ is the relu activation, the eluder dimension can be exponentially larger than σ-rank. For binary-valued function classes, we obtain a characterization of the eluder dimension in terms of star number and threshold dimension, quantities which are relevant in active learning and online learning respectively.

1 Introduction

Russo and Van Roy [37] introduced the notion of eluder dimension for a function class and used it to analyze algorithms (based on the Upper Conﬁdence Bound (UCB) and Thompson Sampling paradigms) for the multi-armed bandit problem with function approximation. Since then, eluder dimension has been extensively used to construct and analyze the regret of algorithms for contextual bandits and reinforcement learning (RL) with function approximation [see, e.g., 46, 36, 44, 6, 14, 19, 12, 16, 26, 24]. Even though the eluder dimension has become a central technique for reinforcement learning theory, little is known about when exactly it is bounded. This paper makes progress toward ﬁlling this gap in our knowledge.

Russo and Van Roy [37] established upper bounds on eluder dimension for (i) function classes for which inputs have ﬁnite cardinality (the tabular setting), (ii) linear functions over Rd of bounded norm, and (iii) generalized linear functions over Rd of bounded norm, with any activation that has derivatives bounded away from 0. Apart from these function classes (and those that can be embedded into these), understanding of eluder dimension has been limited. Indeed, one might wonder whether a function class has small eluder dimension only if it can be realized as a class of (generalized) linear functions! This leads us to our ﬁrst motivating question:

Question 1. Are all function classes with small eluder dimension essentially generalized linear models?

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Answering this question has substantial ramiﬁcations on the scope of prior work. An answer of yes would imply that the results in the aforementioned work which gives regret guarantees in terms of eluder dimension do not go beyond already-established regret guarantees for generalized linear bandit or RL settings [see, e.g. 17, 32, 45, 31]. An answer of no can be construed as a positive result for RL theory, as it would indicate that existing (and future) results which use the eluder dimension apply to a richer set of function classes than generalized linear models.

To answer Question 1, we ﬁrst formally deﬁne what it means for a function class to be written as a generalized linear model (GLM). Informally, for an activation σ : R R and a function class F (X R), we deﬁne the σ-rank to be the smallest dimension d needed to express every function in F as a generalized linear function in Rd with activation σ (see Deﬁnition 3). Intuitively, the σ-rank captures the best possible upper bound on eluder dimension that the results from Russo and Van Roy [37] can give for a given F by treating it as a GLM with activation σ. We ask how the eluder dimension of any function class relates to its σ-rank for various activations σ. We show that the answer to Question 1 is indeed no , i.e., there exists a function class with eluder dimension d but the σ-rank for any monotone σ is at least exp(Ω(d)). Thus, while Russo and Van Roy [37] show that the set of function classes with small eluder dimension is a superset of the set of GLM function classes, we (roughly speaking) show that the set of function classes with small eluder dimension is strictly larger than the set of GLM function classes.

We also prove that the requirement from Russo and Van Roy [37] that the derivative of the activation is bounded away from 0 is necessary in order to bound the eluder dimension of GLMs. The upper bound in the paper [37] becomes vacuous when the activation function has zero derivative; we show a lower bound which indicates this requirement cannot be dropped. Namely, when σ is the relu activation, we show that eluder dimension can be exponentially larger than σ-rank.

In a second line of inquiry, we study a combinatorial version of eluder dimension. The original deﬁnition of Russo and Van Roy [37] is deﬁned for real-valued function classes, but one can specialize the deﬁnition to binary-valued function classes, leading to a so-called combinatorial eluder dimension. Thus, our second motivating question is:

Question 2. Can we bound the combinatorial eluder dimension, perhaps in terms of more familiar learning-theoretic quantities?

One might wonder: if the combinatorial eluder dimension is just a special case of the scale-sensitive version, why study it at all? Our reasons are threefold.

The ﬁrst and most immediate reason is that we are able to show new characterizations of eluder dimension once we move to the combinatorial deﬁnition. We elucidate a fundamental connection between eluder dimension and two other well-studied learning-theoretic quantities: (1) star number, a quantity that characterizes the label complexity of pool-based active learning [21], (2) threshold dimension, a quantity that characterizes the regret of online learning [3].1 We believe that this new result may help us better understand how different learning tasks relate to each other.

The second reason is that the combinatorial eluder dimension (or a multi-class variant of it) has already been studied for policy-based learning for contextual bandits and RL [see, e.g., 19, 34]. Thus, understanding the combinatorial eluder dimension may shed light on the challenges of policy-based RL.

Our last reason has more philosophical bent. Historically, the discovery of VC dimension placed statistical learning theory on solid footing. Speciﬁcally, the fundamental theorem of statistical learning allows us to precisely characterize the statistical complexity of PAC learnability in terms of the combinatorial VC dimension. The insights from understanding the role of VC dimension in classiﬁcation have led researchers to develop scale-sensitive complexity measures such as fat shattering dimension to provide sharper guarantees on learning. While we do not claim that the eluder dimension is fundamental to online RL and bandit settings, we believe that a better combinatorial understanding can lead to a better understanding of online decision making. In some sense, we are working backwards from the original scale-sensitive deﬁnition of eluder dimension to understand its combinatorial properties.

1Finiteness of the threshold dimension is equivalent to ﬁniteness of Littlestone dimension [40, 22, 3].

1.1 Main contributions

In this work, we provide several results which show when eluder dimension can be bounded (or is unbounded). Our results can be separated into two categories: (1) those pertaining to the (scalesensitive) eluder dimension (as originally deﬁned by Russo and Van Roy [37]); (2) those pertaining to the combinatorial eluder dimension, deﬁned for binary-valued function classes.

First, we investigate the relationship between eluder dimension and our notion of generalized rank that captures the realizability of any function class as a GLM. In Section 2, we formally introduce the eluder dimension (as well as a related quantity of the scale-sensitive star number). In Section 3, we formalize the notion of generalized rank . In Section 4, we provide several results.

1. We show that eluder dimension can be exponentially smaller than σ-rank for any monotone activation σ, not just those with derivatives bounded away from 0 (Theorem 6). 2. We show that the condition on the derivative being bounded away from 0 is necessary for σ-rank to be an upper bound on eluder dimension. Namely, when σ is the relu activation, we show that eluder dimension can be exponentially larger than σ-rank (Theorem 7).2

In Section 5, we specialize the eluder dimension to the binary-valued setting and present our results on the combinatorial eluder dimension.

1. We show that eluder dimension is ﬁnite if and only if both star number and threshold dimension are ﬁnite. Speciﬁcally, in Theorem 8 we show the following:

max{Sdim(F), Tdim(F)} Edim(F) exp(O(max{Sdim(F), Tdim(F)})).

Furthermore, we demonstrate that both inequalities can be tight (see Theorem 9 and discussion above it). 2. We investigate the comparison between eluder dimension and sign-rk and prove stronger separations than in the scale-sensitive case; namely we show examples where one quantity is ﬁnite while the other is inﬁnite (Theorem 10).

1.2 Related work

In this section, we highlight several related works.

Bounds on eluder dimension. Several papers provide bounds on eluder dimension for various function classes. The original bounds on tabular, linear, and generalized linear functions were proved by Russo and Van Roy [37] (and later generalized by Osband and Van Roy [36]). Mou et al. [34] provide several bounds for the combinatorial eluder dimension, mostly focusing on linear function classes. When the function class lies in an RKHS, Huang et al. [25] show that the eluder dimension is equivalent to the notion of information gain [41, 38], which can be seen as an inﬁnite dimensional generalization of the fact that the eluder dimension for linear functions over Rd is Θ(d). In concurrent work, Dong et al. [12] also prove an exponential lower bound on the eluder dimension for Re LU networks.

Applications of eluder dimension. The main application of the eluder dimension is to design algorithms and prove regret guarantees for contextual bandits and reinforcement learning. A few examples include the papers [46, 36, 44, 6, 14, 19, 28, 16, 26, 24, 34]. While the majority of papers prove upper bounds via eluder dimension, Foster et al. [19] provided lower bounds for contextual bandits in terms of eluder dimension, if one is hoping for instance-dependent regret bounds. In addition, several works observe that eluder dimension sometimes does not characterize the sample complexity, as the guarantee via eluder dimension can be too loose [23, 20]. Beyond the online RL setting, eluder dimension has been applied to risk sensitive RL [15], Markov games [24, 29], representation learning [47], and active online learning [10].

Other complexity measures for RL. We also touch upon other complexity measures which have been suggested for RL. One category is Bellman/Witness rank approach [see e.g. 27, 11, 42], which is generalized to bilinear classes [13]. These complexity measures capture an interplay between

2This result was independently established by Dong et al. [12].

the MDP dynamics and the function approximator class; in contrast, eluder dimension is purely a property of the function approximator class and can be stated (and studied) without referring to an online RL problem. Jin et al. [28] deﬁne a Bellman-Eluder dimension which captures function classes which have small Bellman rank or eluder dimension. Lastly, Foster et al. [20] propose a Decision-Estimation Coefﬁcient and prove that it is necessary and sufﬁcient for sample-efﬁcient interactive learning.

Notions of rank. The notion of rank we propose is a generalization of the classical notion of sign rank, also called dimension complexity. Sign rank has been studied extensively in combinatorics, learning theory, and communication complexity [see e.g. 1, 18, 5, 2, and references therein]. The norm requirements in our deﬁnition of σ-rank are related to the notion of margin complexity [5, 7, 30].

2 Eluder dimension and star number

Eluder dimension is a sequential notion of complexity for function classes, originally deﬁned by Russo and Van Roy [37]. Informally speaking, it characterizes the longest sequence of adversarially chosen points one must observe in order to accurately estimate the function value at any other point. We consider a variant of the original deﬁnition, proposed by Foster et al. [19], that is never larger and is sufﬁcient to analyze all the applications of eluder dimension in literature.

Deﬁnition 1. For any function class F (X R), f F, and scale ε 0, the exact eluder dimension Edimf (F, ε) is the largest m such that there exists (x1, f1), . . . , (xm, fm) X F satisfying for all i [m]:

|fi(xi) f (xi)| > ε, and X

j<i (fi(xj) f (xj))2 ε2. (1)

Then for all ε > 0:

the eluder dimension is Edimf (F, ε) = supε ε Edimf (F, ε ).

Edim(F, ε) := supf F Edimf (F, ε) and Edim(F, ε) := supf F Edimf (F, ε).

This deﬁnition is never larger than the original deﬁnition of Russo and Van Roy [37], which asks for a witnessing pair of functions fi, f i F (the above restricts f i = f ). Hence, all lower bounds on our variant of eluder dimension immediately apply to the original deﬁnition. Moreover, all upper bounds on eluder dimension in this paper can also be shown to hold for the other deﬁnition (unless stated otherwise).

We also consider the closely related notion of star number deﬁned by Foster et al. [19], which generalizes a combinatorial parameter introduced in the active learning literature by Hanneke and Yang [21] (we will denote it as Sdim for consistency). We study the combinatorial star number in more detail in Section 5. The only difference between the deﬁnitions of eluder dimension and star number is that P

j<i is replaced by P

j =i, which makes the star number a non-sequential notion of complexity.

Deﬁnition 2. For any function class F (X R), f F, and scale ε 0, the exact star number Sdimf (F, ε) is the largest m such that there exists (x1, f1), . . . , (xm, fm) X F satisfying for all i [m]:

|fi(xi) f (xi)| > ε, and X

j =i (fi(xj) f (xj))2 ε2.

Then for all ε > 0:

the star number is Sdimf (F, ε) = supε ε Sdimf (F, ε ).

Sdim(F, ε) := supf F Sdimf (F, ε) and Sdim(F, ε) := supf F Sdimf (F, ε).

It immediately follows from these deﬁnitions that the star number is never larger than eluder dimension. On the other hand, the star number can be arbitrarily smaller than eluder dimension.

Proposition 1. For all F, f F and scale ε 0, it holds that3

Sdimf (F, ε) Edimf (F, ε) min {|X|, |F| 1} .

Proposition 2 (simpliﬁed from Foster et al. [19, Prop 2.3]). For the class of threshold functions given as Fth n := {fi : [n] {0, 1} | i [n + 1]}, where fi(x) := 1 {x i}, and for f = fn+1, it holds for all ε < 1 that Sdimf (F, ε) = 2 and Edimf (F, ε) = n.

3 Generalized rank

Dimension complexity has been studied extensively in combinatorics, learning theory, and communication complexity [see e.g. 1, 18, 5, 2]. The classical notion of dimension complexity, also known as sign rank, corresponds to the smallest dimension required to embed the input space such that all hypotheses in the function class under consideration are realizable as halfspaces. We consider a generalized notion of rank that is speciﬁed for any particular activation σ : R R, and captures to the smallest dimension required to represent the function class as a GLM when σ is the activation. In what follows, we let Bd(R) := x Rd | x 2 R .

Deﬁnition 3. For any σ : R R, the σ-rank of a function class F (X R) at scale R > 0, denoted as σ-rk(F, R), is the smallest dimension d for which there exists mappings φ : X Bd(1) and w : F Bd(R) such that4

for all (x, f) X F : f(x) = σ( w(f), φ(x) ), (2)

or if no such d exists. For a collection of activation functions Σ (R R), the Σ-rank is

Σ-rk(F, R) := min σ Σ σ-rk(F, R).

Examples. We present some examples of Σ-rk that motivate our deﬁnition above.

1. Threshold activation. sign(z) yields the classic notion of sign-rank (equivalent to dimension complexity, as already mentioned). In this case, the scale R is irrelevant, so we denote sign-rk(F) := sign-rk(F, R) for any R. Note that this quantity is meaningful only for F (X { 1, 1}). 2. Identity activation. For id(z) := z, id-rk(F, R) is the smallest dimension needed to represent each f F as a (norm-bounded) linear function. We abbreviate rk := id-rk, as this corresponds to the standard notion of rank of the matrix (f(x))x,f (albeit with the additional norm constraint).

3. Monotone activations. For L µ 0, ML µ consists of all activations σ such that for all z < z ,

it holds that µ σ(z ) σ(z)

z z L (for differentiable σ, this is equivalent to µ σ (z) L for all z R).5 An important special case is when µ = 0, and a particular σ M1 0 of interest is the rectiﬁed linear unit (Re LU) deﬁned as relu(z) := max {z, 0}. For ease of notation, we denote Mµ := M1 µ. While we will always be explicit about the Lipschitz constant, note that the scale of the Lipschitz constant L (and µ) is interchangable with the scale of R. In particular,

ML µ-rk(F, R) = Mµ/L-rk(F, RL). (3)

4. All activations. Σall consists of all activations σ. We mention this notion of rank only in passing, and we will not focus on it for the rest of the paper.

We present a result which relates the aforementioned quantities (proof in Appendix A).

3For the deﬁnition of eluder dimension considered by [37], an upper bound of min{|X|, |F | 2 } holds, which can be tight. This upper bound holds because the witnessing pair of functions (fi, f i) has to be distinct for each i. 4Note that only the product of the scales of φ and w is relevant. The deﬁnition remains equivalent if we let φ : X Bd(Rφ) and w : F Bd(Rw) for any Rφ and Rw such that R = Rφ Rw. 5To prove upper bounds on eluder dimension, it sufﬁces for this condition to hold only when |z| R, [see e.g. 37]. Since we ﬁx σ in our deﬁnition ﬁrst and then consider σ-rank at different scales R, this weaker condition complicates our deﬁnitions. Note that at any speciﬁc scale R, we can always modify σ to satisfy the required constraint everywhere by extending it linearly whenever |z| > R.

rk Mµ-rk M0-rk Edim Sdim

rk Mµ-rk Fexp

M0-rk Frelu Frelu

Edim F Sdim Fth

Figure 1: (a) Each arrow M1 M2 indicates that M1(F) M2(F) for all F, where the dependence on R and ε is suppressed for clarity (see Propositions 1, 3, 4 for precise bounds). Whenever M2 M1 arrow is missing, there is an example of a class F where M1(F) M2(F). (b) An entry F in (M1, M2) means that M1(F) M2(F). Green cells indicate that M1(F) M2(F) for all F.

Proposition 3. For all F (X R), R > 0 and µ (0, 1], we have:

rk(F, R) Mµ-rk(F, R) M0-rk(F, R) sign-rk(F) 1,

where the last inequality is meaningful only for F (X { 1, 1}). Moreover, for each inequality above, there exists a function class F which exhibits an inﬁnite gap between the two quantities.

4 Eluder dimension versus generalized rank

In this section, we compare eluder dimension and star number with each notion of generalized rank: rk, Mµ-rk (for µ > 0) and M0-rk. Our results are summarized in Figure 1.

Eluder vs. rk and Mµ-rk. Russo and Van Roy [37] and Osband and Van Roy [36] provided upper bounds on eluder dimension for linear and generalized linear function classes. For completeness, we restate this result, with a slight improvement and include the proof with precise dependence on problem parameters in Appendix B. Intuitively, the generalized linear rank allows us to capture the tightest possible upper bound that the guarantees in the papers [37, 36] can provide on eluder dimension.

Proposition 4 (cf. [37], Prop. 6, 7; [36], Prop. 2, 4). For any function class F (X R) and ε > 0:

(i) For all R > 0, Edim(F, ε) rk(F, R) O log R

(ii) For all L, µ, R > 0, Edim(F, ε) ML µ-rk(F, R) O L2 µ2 log RL

This result has been used to prove upper bounds on eluder dimension of various function classes beyond GLMs; for example, the class of bounded degree polynomials, by taking the feature map φ(x) to be the vector of low degree monomials. The upper bound in Part (i) is in fact tight (up to constants) for the class of linear functions, as shown in Proposition 5 below. This trivially implies the optimality of the bound in Part (ii) up to the factor of (L/µ)2 which to the best of our knowledge is open.

Proposition 5 ([33]). For any R > 0, X := Bd(1) and F := {fθ : x 7 θ, x | θ Bd(R)}, it holds that: Edim(F, ε) Ω d log R

For completeness, we include the proof in Appendix C.

Eluder vs. M0-rk. It turns out that eluder dimension and M0-rk are incomparable. That is, there exists a function class for which eluder dimension is exponentially smaller than M0-rk (and hence Mµ-rk and rk by Proposition 3). Moreover, there exists a different function class for which eluder dimension (even the star number) is exponentially larger than relu-rk (and hence M0-rk).

First, we show that the eluder dimension can be exponentially smaller than M0-rk for the class of parities over d bits. Thus, parities over d bits exhibits an example where the eluder dimension is exponentially smaller than the best possible bound one can derive using the existing results of Proposition 4. Theorem 6. For X = { 1, 1}d and F := f S : x 7 Q

i S xi | S [d] , it holds that

(i) M0-rk(F , R) 2d/2 1 for all R > 0.

(ii) Sdim(F , ε) Edim(F , ε) d for all ε 0.

Proof. Part (i). From Proposition 3, we have that M0-rk(F , R) sign-rk(F ) 1 for any σ M0. The proof is now complete by noting a well known result that sign-rk(F ) 2d/2 [18].

Part (ii). For any x { 1, 1}d consider its 0-1 representation ex Fd 2 (representing +1 by 0 and 1 by 1). All functions in F can be simply viewed as linear functions over F2. Namely, any parity function is indexed by a vector a Fd 2, with fa(x) := ( 1) a,ex .

Note that Edim(F , ε) = 0 for all ε 2. For any ε < 2, suppose Edimf (F , ε) = m, witnessed by (x1, fa1), . . . , (xm, fam) { 1, 1}d and f = fa . We have

fai(xi) = fa (xi), and

fai(xj) = fa (xj) for all j < i : since P

j<i(fai(xj) fa (xj))2 < ε2 < 4 iff all terms are 0.

Thus, we have ai a , ex = 0 for all ex F2-span({ex1, . . . , exi 1}). But ai a , exi = 1 and hence exi is linearly independent of {ex1, . . . , exi 1} over Fd 2. Thus, {ex1, . . . , exm} are all linearly independent over Fd 2, and hence m d.

The bound in part (ii) of Theorem 6 was also calculated by Mou et al. [34, Prop. 3].

Next, we show a separation in the other direction for eluder dimension vs. M0-rk using the Re LU function class. Thus, we cannot hope to remove the requirement for the activation function σ to be strictly monotonically increasing in Proposition 4 (ii) for bounding the eluder dimension. Theorem 7. Let R > 0 and X = Bd(1). Deﬁne

Frelu := fθ,b : x 7 relu( θ, x + b) | θ 2 + b2 R2 .

It holds that

(i) M0-rk(Frelu, R) relu-rk(Frelu, R) d + 1,

(ii) Edim(Frelu, ε) Sdim(Frelu, ε) R

4ε d/2 for all ε (0, R

Proof. Part (i) is immediate from the deﬁnition. We show Part (ii) in the special case of R = 2; the general case follows by relatively scaling ε, since relu is homogeneous, namely, relu(ax) = a relu(x). Consider any U X such that u = 1 and u, v 1 ε for all u, v U. It holds that Sdimf (Frelu, ε) |U| when f is the identically zero function, since the function fu(x) = relu( u, x (1 ε)) is such that fu(v) = 0 for all v U {u}, whereas fu(u) = ε. A standard sphere packing argument shows that such a set U exists with |U| (1/2ε)d/2 for all ε < 1/2. In particular, the δ-packing number of the unit sphere is at least (1/δ)d [43, Cor. 4.2.13]. Thus, we can ﬁnd (1/δ)d points such that each pair u, v satisﬁes u v δ, or equivalently u, v 1 δ2/2. Setting δ =

2ε proves the claimed lower bound.

Theorem 7 was independently shown by Dong et al. [12, Thm. 5.1].

We remark that while we considered the variant of eluder dimension as deﬁned by Foster et al. [19], the lower bound on eluder dimension in Theorem 7 immediately holds for the notion deﬁned by Russo and Van Roy [37]. On the other hand, the upper bound on eluder dimension in Theorem 6 can be shown to hold even with the deﬁnition of Russo and Van Roy [37] (by replacing every instance of a by a i).

f6 1 * * * * *

0 1 * * * *

0 0 1 * * *

0 0 0 1 * *

0 0 0 0 1 *

0 0 0 0 0 1

Eluder sequence

f6 1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 0 0 0 0 1

Star sequence

f6 1 1 1 1 1 1

0 1 1 1 1 1

0 0 1 1 1 1

0 0 0 1 1 1

0 0 0 0 1 1

0 0 0 0 0 1

Threshold sequence

Figure 2: Illustration of witnessing sequences of length 6 for eluder dimension, star number and threshold dimension with respect to f = 0 (we use the 0/1 representation of functions for clarity). * in the eluder witness sequence refers to a free value, either 0 or 1.

5 Relationships between combinatorial measures

In this section, we specialize the eluder dimension to binary-valued function classes and prove several additional characterizations that relate this combinatorial eluder dimension to other learning-theoretic quantities. First, we (re)deﬁne these learning-theoretic quantities. The ﬁrst two deﬁnitions (of combinatorial eluder dimension and star number respectively) are the specialization of the scalesensitive versions (cf. Deﬁnition 1 and 2) to the binary-valued function setting. We abuse notation by dropping the argument ε from previous deﬁnitions in order to be consistent. Deﬁnition 4. Fix any function class F (X {1, 1}) and any f F.

The combinatorial eluder dimension w.r.t. f , denoted Edimf (F), is deﬁned as the largest m such that there exists (x1, f1), . . . , (xm, fm) X F satisfying for all i [m]:

fi(xi) = f (xi), and for all j < i : fi(xj) = f (xj).

The star number w.r.t. f , denoted Sdimf (F), is deﬁned as the largest m such that there exists (x1, f1), . . . , (xm, fm) X F satisfying for all i [m]:

fi(xi) = f (xi), and for all j = i : fi(xj) = f (xj).

The threshold dimension w.r.t. f , denoted Tdimf (F), is deﬁned as the largest m such that there exists (x1, f1), . . . , (xm, fm) X F satisfying for all i [m]:

for all k i : fi(xk) = f (xk), and for all j < i : fi(xj) = f (xj).

As before, we deﬁne the combinatorial eluder dimension (resp. star number and threshold dimension) to be Edim(F) := supf F Edimf(F) (Sdim(F) and Tdim(F) are deﬁned similarly).

Let us pause to unpack these deﬁnitions and give some background. In fact, the star number deﬁnition stated above is the original deﬁnition of Hanneke and Yang [21], who give tight upper and lower bounds on the label complexity of pool-based active learning via the star number Sdim(F) and show that almost every previously proposed complexity measure for active learning takes a worst case value equal to the star number. Roughly speaking, the star number corresponds to the number of singletons one can embed in a function class; that is, the maximum number of functions that differ from a base function f at exactly one point among a subset of the domain {x1, . . . , xm} X.

The threshold dimension has recently gained attention due to its role in proving an equivalence relationship between private PAC learning and online learning [see, e.g., 3, 9]. We slightly generalize the deﬁnition of Alon et al. [3] to allow for any base function f , in the spirit of the other two deﬁnitions. A classical result in model theory provides a link between the threshold dimension and Littlestone dimension (which we denote Ldim), a quantity which is both necessary and sufﬁcient for online learnability [8, 4]. In particular, results by Shelah [40] and Hodges et al. [22] show that for any binary-valued F and any f F:

log Tdimf (F) Ldim 2Tdimf (F).

A combinatorial proof of this fact can be found in Thm. 3 of Alon et al. [3].6 Thus, ﬁniteness of threshold dimension is necessary and sufﬁcient for online learnability (albeit in a much weaker, qualitative sense).

5.1 A qualitative equivalence

Now, we prove a rather surprising characterization that closely ties all three quantities in Deﬁnition 4 together. Our result implies that for any binary-valued function class, ﬁniteness of the combinatorial eluder dimension is equivalent to ﬁniteness of both star number and threshold dimension.

Theorem 8. For any function class F (X {1, 1}) and any f F, the following holds:

max{Sdimf (F), Tdimf (F)} Edimf (F) 4max{Sdimf (F),Tdimf (F)}.

The proof of Theorem 8 can be found in Appendix D. The lower bound is trivial by examining Deﬁnition 4. To prove the upper bound, we rely on a novel connection to Ramsey theory. In particular, we show that sequences (x1, f1), . . . (xm, fm) which witness Edimf = m form a bijection with redblue colorings of the complete graph Km, while subsequences of the witnessing eluder sequence that are valid star number witnesses or threshold dimension witnesses can be interpreted as monochromatic colorings of subgraphs of Km (see Figure 2). Thus, applying classical bounds from Ramsey theory implies the result.

Theorem 8 has an exponential gap between the upper and lower bounds. Can we improve either of the inequalities? The lower bound cannot be improved, by considering the simple examples over X = [n] of the singleton class Fsing n := {x 7 1{x = i} | i [n + 1]} and the threshold class Fth n := {x 7 1{x i} | i [n + 1]}. We also show that the upper bound cannot be improved, for example, to Edim(F) poly(Sdim(F), Tdim(F)) in general.

Theorem 9. For every N > 0, there exists a function class FN such that Edim(FN) = N and max{Sdim(F), Tdim(F)} < c log2 N, where c > 1/2 is some absolute numerical constant.

The proof of Theorem 9 can be found in Appendix E. It relies on the probabilistic method to show the existence of a randomly constructed F which satisﬁes the desired properties.

5.2 Comparisons with sign-rk

In this section, we investigate the comparison of these combinatorial quantities (eluder, star, threshold) with sign-rk, and examine whether we can prove stronger separations. One direction is clear; for the class of linear classiﬁers in Rd, the combinatorial eluder dimension, star number, and threshold dimension are all inﬁnite. However, we ask if the other separation is also possible: can we construct F where eluder/star/threshold are ﬁnite but sign-rk = ? We have already provided an explicit exponential separation: Theorem 6 shows that for the function class F , we had Edim(F ) = Sdim(F ) = d but sign-rk(F ) 2d/2. (One can also show that Tdim(F ) = d.) We are able to show stronger (but nonconstructive) separations by extending the probabilistic techniques of Alon et al. [2], who recently provided similar separations for VC dimension versus sign-rk. In view of Proposition 3, this result also provides a separation for the scale-sensitive Deﬁnition 1.

Theorem 10. For every N > 0, there exists a function class FN ([N] {1, 1}) such that Edim1(FN) = 4 and sign-rk(FN) Ω(N 1/9/ log N), where 1 is shorthand for the all 1s function.

The proof of Theorem 10 can be found in Appendix F. It is straightforward to replace the reference function f (x) = 1 with any ﬁxed reference function f : [N] {1, 1}. First, we use Lemma 22 of Alon et al. [2] which bounds the number of distinct matrices with sign-rk = r; then using a probabilistic argument we show that there must be many (more) matrices with Edim1 = 4, so at least one of them must have large sign-rk.

The careful reader might notice that we do not prove the existence of a function class where Edim(F) is constant and the sign-rk is inﬁnite; instead we prove the weaker statement that a function class exists with Edim w.r.t. any ﬁxed function f is bounded. We conjecture that the stronger statement holds; see Appendix F.1 for more details.

6Alon et al. [3] prove the result for f (x) = 1, but it is easy to extend their proof to hold for any f .

5.3 Back to scale-sensitive?

It is natural to ask if our results can be extended back to the scale-sensitive deﬁnitions. First, we require a scale-sensitive version of threshold dimension. One proposal is the following: Deﬁnition 5. For any function class F (X R), f F, and scale ε 0, the exact threshold dimension Tdimf (F, ε) is the largest m such that there exists (x1, f1), . . . , (xm, fm) X F satisfying for all i [m]:

j i : |fi(xj) f (xj)| > ε, and X

j<i (fi(xj) f (xj))2 ε2.

Then for all ε > 0:

the threshold dimension is Tdimf (F, ε) = supε ε Tdimf (F, ε ).

Tdim(F, ε) := supf F Tdimf (F, ε) and Tdim(F, ε) := supf F Tdimf (F, ε).

Deﬁnition 5 mirrors the scale-sensitive deﬁnitions for eluder and star; it also recovers the combinatorial deﬁnition when F is binary-valued. By deﬁnition, the relationship that Edimf (F, ε) max{Sdimf (F, ε), Tdimf (F, ε)} for every F, f F, ε > 0 is trivial. However, one cannot hope to prove the corresponding upper bound under this deﬁnition. For example, for any ε > 0, take the function class which is represented by the N N matrix:

0 i < j ε i = j 0.99ε i > j.

It is easy to see that Edim0(F, ε) = N, while Sdim0(F, ε) = 2 and Tdim0(F, ε) = 1. Notice that this class is still threshold-like , but Deﬁnition 5 does not capture this for said value of ε. Generally speaking, it is unclear how to carry over the intuition from Ramsey theory that applies in the combinatorial case to the scale-sensitive case; we leave this to future work.

Acknowledgments and Disclosure of Funding

We thank Gaurav Mahajan for allowing us to include the proof of Proposition 5 [33]. We thank Akshay Krishnamurthy, Tengyu Ma, and Ruosong Wang for helpful discussions. GL was partially supported by NSF award IIS-1764032. PK was partially supported by NSF BIGDATA award 1546500. Part of this work was done while GL, PK, and DF were participating in the Simons program on the Theoretical Foundations of Reinforcement Learning.

[1] N. Alon, P. Frankl, and V. Rödl. Geometrical realization of set systems and probabilistic communication complexity. In 26th Annual Symposium on Foundations of Computer Science, pages 277 280. IEEE Computer Society, 1985. URL https://doi.org/10.1109/SFCS.1985.30.

[2] N. Alon, S. Moran, and A. Yehudayoff. Sign rank versus VC dimension. In 29th Annual Conference on Learning Theory, volume 49 of PMLR, pages 47 80, 2016. URL http://proceedings.mlr.press/v49/alon16.html.

[3] N. Alon, R. Livni, M. Malliaris, and S. Moran. Private pac learning implies ﬁnite littlestone dimension. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, page 852 860, New York, NY, USA, 2019. Association for Computing Machinery. URL https://doi.org/10.1145/3313276.3316312.

[4] N. Alon, O. Ben-Eliezer, Y. Dagan, S. Moran, M. Naor, and E. Yogev. Adversarial laws of large numbers and optimal regret in online classiﬁcation. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 447 455, 2021.

[5] R. I. Arriaga and S. S. Vempala. An algorithmic theory of learning: Robust concepts and random projection. Mach. Learn., 63(2):161 182, 2006. URL https://doi.org/10.1007/ s10994-006-6265-7.

[6] A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. Yang. Model-based reinforcement learning with value-targeted regression. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of PMLR, pages 463 474, 2020. URL http://proceedings.mlr.press/v119/ ayoub20a.html.

[7] S. Ben-David, N. Eiron, and H. U. Simon. Limitations of learning via embeddings in euclidean half spaces. Journal of Machine Learning Research, 3(Nov):441 461, 2002.

[8] S. Ben-David, D. Pál, and S. Shalev-Shwartz. Agnostic online learning. In COLT, volume 3, page 1, 2009.

[9] M. Bun, R. Livni, and S. Moran. An equivalence between private classiﬁcation and online prediction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 389 402. IEEE, 2020.

[10] Y. Chen, H. Luo, T. Ma, and C. Zhang. Active online learning with hidden shifting domains. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2053 2061. PMLR, 2021.

[11] K. Dong, J. Peng, Y. Wang, and Y. Zhou. Root-n-regret for learning in markov decision processes with function approximation and low bellman rank. In Conference on Learning Theory, pages 1554 1557. PMLR, 2020.

[12] K. Dong, J. Yang, and T. Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. ar Xiv, 2102.04168, 2021. URL https: //arxiv.org/abs/2102.04168.

[13] S. Du, S. Kakade, J. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826 2836. PMLR, 2021.

[14] S. S. Du, J. D. Lee, G. Mahajan, and R. Wang. Agnostic Q-learning with function approximation in deterministic systems: Near-optimal bounds on approximation error and sample complexity. In Advances in Neural Information Processing Systems, volume 33, pages 22327 22337. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ﬁle/ fd5c905bcd8c3348ad1b35d7231ee2b1-Paper.pdf.

[15] Y. Fei, Z. Yang, and Z. Wang. Risk-sensitive reinforcement learning with function approximation: A debiasing approach. In International Conference on Machine Learning, pages 3198 3207. PMLR, 2021.

[16] F. Feng, W. Yin, A. Agarwal, and L. Yang. Provably correct optimization and exploration with non-linear policies. In International Conference on Machine Learning, pages 3263 3273. PMLR, 2021.

[17] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper/2010/ﬁle/ c2626d850c80ea07e7511bbae4c76f4b-Paper.pdf.

[18] J. Forster. A linear lower bound on the unbounded error probabilistic communication complexity. J. Comput. Syst. Sci., 65(4):612 625, 2002. URL https://doi.org/10.1016/S0022-0000(02) 00019-3.

[19] D. J. Foster, A. Rakhlin, D. Simchi-Levi, and Y. Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. ar Xiv, 2010.03104, 2020. URL https://arxiv.org/abs/2010.03104.

[20] D. J. Foster, S. M. Kakade, J. Qian, and A. Rakhlin. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021.

[21] S. Hanneke and L. Yang. Minimax analysis of active learning. Journal of Machine Learning Research, 16(109):3487 3602, 2015. URL http://jmlr.org/papers/v16/hanneke15a.html.

[22] W. Hodges et al. A shorter model theory. Cambridge university press, 1997.

[23] B. Huang, K. Huang, S. Kakade, J. D. Lee, Q. Lei, R. Wang, and J. Yang. Going beyond linear rl: Sample efﬁcient neural function approximation. Advances in Neural Information Processing Systems, 34, 2021.

[24] B. Huang, J. D. Lee, Z. Wang, and Z. Yang. Towards general function approximation in zero-sum markov games. ar Xiv preprint ar Xiv:2107.14702, 2021.

[25] K. Huang, S. M. Kakade, J. D. Lee, and Q. Lei. A short note on the relationship of information gain and eluder dimension. ar Xiv preprint ar Xiv:2107.02377, 2021.

[26] H. Ishfaq, Q. Cui, V. Nguyen, A. Ayoub, Z. Yang, Z. Wang, D. Precup, and L. Yang. Randomized exploration in reinforcement learning with general value function approximation. In International Conference on Machine Learning, pages 4607 4616. PMLR, 2021.

[27] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1704 1713, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/jiang17c.html.

[28] C. Jin, Q. Liu, and S. Miryooseﬁ. Bellman eluder dimension: New rich classes of RL problems, and sample-efﬁcient algorithms. ar Xiv, 2102.00815, 2021. URL https://arxiv.org/abs/2102. 00815.

[29] C. Jin, Q. Liu, and T. Yu. The power of exploiter: Provable multi-agent rl in large state spaces. ar Xiv preprint ar Xiv:2106.03352, 2021.

[30] P. Kamath, O. Montasser, and N. Srebro. Approximate is good enough: Probabilistic variants of dimensional and margin complexity. In Conference on Learning Theory, pages 2236 2262. PMLR, 2020.

[31] B. Kveton, M. Zaheer, C. Szepesvari, L. Li, M. Ghavamzadeh, and C. Boutilier. Randomized exploration in generalized linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2066 2076. PMLR, 2020.

[32] L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2071 2080. PMLR, 06 11 Aug 2017. URL http://proceedings.mlr.press/v70/li17c.html.

[33] G. Mahajan and S. Lovett. Personal communication. 2021.

[34] W. Mou, Z. Wen, and X. Chen. On the sample complexity of reinforcement learning with policy space generalization. ar Xiv preprint ar Xiv:2008.07353, 2020.

[35] D. Mubayi and A. Suk. A survey of quantitative bounds for hypergraph ramsey problems. arxiv e-prints (july 2017). ar Xiv preprint math.CO/1707.04229, 2017.

[36] I. Osband and B. Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/ﬁle/ 1141938ba2c2b13f5505d7c424ebae5f-Paper.pdf.

[37] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/ﬁle/ 41bfd20a38bb1b0bec75acf0845530a7-Paper.pdf.

[38] D. Russo and B. Van Roy. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442 2471, 2016.

[39] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. URL https://www.cs.huji.ac.il/~shais/ Understanding Machine Learning/.

[40] S. Shelah. Classiﬁcation theory: and the number of non-isomorphic models. Elsevier, 1990.

[41] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[42] W. Sun, N. Jiang, A. Krishnamurthy, A. Agarwal, and J. Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898 2933. PMLR, 2019.

[43] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. URL https://www.math.uci.edu/~rvershyn/papers/ HDP-book/HDP-book.html.

[44] R. Wang, R. R. Salakhutdinov, and L. Yang. Reinforcement learning with general value function approximation: Provably efﬁcient approach via bounded eluder dimension. In Advances in Neural Information Processing Systems, volume 33, pages 6123 6135. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ﬁle/ 440924c5948e05070663f88e69e8242b-Paper.pdf.

[45] Y. Wang, R. Wang, S. S. Du, and A. Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. ar Xiv preprint ar Xiv:1912.04136, 2019.

[46] Z. Wen and B. Van Roy. Efﬁcient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/ﬁle/ bad5f33780c42f2588878a9d07405083-Paper.pdf.

[47] Z. Xu and A. Tewari. Representation learning beyond linear prediction functions. Advances in Neural Information Processing Systems, 34, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] Since this is a theory paper regarding several notions of complexity for learning problems, we do not foresee any immediate potential negative impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]