# on_the_complexity_of_adversarial_decision_making__3895d1dc.pdf

On the Complexity of Adversarial Decision Making

Dylan J. Foster dylanfoster@microsoft.com

Alexander Rakhlin

rakhlin@mit.edu

Ayush Sekhari sekhari@mit.edu

Karthik Sridharan

ks999@cornell.edu

A central problem in online learning and decision making from bandits to reinforcement learning is to understand what modeling assumptions lead to sampleefﬁcient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show via new upper and lower bounds that the Decision-Estimation Coefﬁcient, a complexity measure introduced by Foster et al. [17] in the stochastic counterpart to our setting, is necessary and sufﬁcient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefﬁcient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexiﬁcation, and recovers a number of existing results both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefﬁcient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy [47] and the Exploration-by-Optimization objective of Lattimore and György [32].

1 Introduction

To reliably deploy data-driven decision making methods in real-world systems where safety is critical, such methods should satisfy two desiderata: (i) provable robustness in the face of dynamic or even adversarial environments, and (ii) ability to effectively take advantage of problem structure as modeled by the practitioner. In high-dimensional problems, this entails efﬁciently generalizing across states and actions while delicately exploring new decisions.

For decision making in static, stochastic environments, recent years have seen extensive investigation into optimal sample complexity and algorithm design principles, and the foundations are beginning to take shape. With an emphasis on reinforcement learning, a burgeoning body of research identiﬁes speciﬁc modeling assumptions under which sample-efﬁcient interactive decision making is possible [12, 54, 20, 39, 5, 27, 14, 36, 13, 56], as well as general structural conditions that aim to unify these assumptions [45, 19, 51, 53, 15, 21, 17]. For dynamic or adversarial settings, however, comparatively little is known outside of (i) positive results for special cases such as adversarial bandit problems [4, 3, 18, 10, 1, 7, 26, 16, 8, 29], and (ii) a handful of negative results suggesting that online reinforcement learning in agnostic or adversarial settings can actually be statistically intractable [48, 37]. These developments raise the following questions: (a) what are the underlying phenomena that govern the statistical complexity of decision making in adversarial settings? (b) what are the corresponding algorithmic design principles that attain optimal statistical complexity?

Contributions. We consider an adversarial variant of the Decision Making with Structured Observations (DMSO) framework introduced in Foster et al. [17], where a learner or decision-maker interacts with a sequence of models (reward distributions in the case of bandits, or MDPs in the case of reinforcement learning) chosen by an adaptive adversary, and aims to minimize regret against the

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

best decision in hindsight. Models are assumed to belong to a known model class, which reﬂects the learner s prior knowledge about the problem. The main question we investigate is: How does the structure of the model class determine the minimax regret for adversarial decision making? We show:

1. For any model class, one can obtain high-probability regret bounds that scale with a

convexiﬁed version of the Decision-Estimation Coefﬁcient (DEC), a complexity measure introduced by Foster et al. [17]. 2. For any algorithm with reasonable tail behavior, the optimal regret for adversarial decision

making is lower bounded by (a suitably localized version of) the convexiﬁed DEC.

In the process of obtaining these results, we draw new connections to several existing complexity measures.

1.1 Problem Setting

We adopt an adversarial variant of the DMSO framework of Foster et al. [17] consisting of T rounds, where at each round t = 1, . . . , T:

1. The learner selects a decision (t) 2 , where is the decision space. 2. Nature selects a model M (t) 2 M, where M is a model class. 3. The learner receives a reward r(t) 2 R R and observation o(t) 2 O sampled via

(r(t), o(t)) M (t)( (t)), where O is the observation space. We abbreviate z(t) := (r(t), o(t))

and Z := R O.

Here, each model M = M( , | ) 2 M is a conditional distribution M : ! (R O) that maps the learner s decision to a distribution over rewards and observations. This setting subsumes (adversarial) bandit problems, where models correspond to reward functions (or distributions), as well as adversarial reinforcement learning, where models correspond to Markov decision processes (MDPs). In both cases, the model class M encodes prior knowledge about the decision making problem, such as structure of rewards or dynamics (e.g., linearity or convexity). The model class might be parameterized by linear models, neural networks, or other rich function approximators depending on the problem domain.

For a model M 2 M, E

M, [ ] denotes expectation under the process (r, o) M( ). We deﬁne f M( ) := E

M, [r] as the mean reward function and M := arg max 2 f M( ) as the decision with greatest reward for M. We let FM = {f M | M 2 M} denote the induced class of reward functions. We measure performance via regret to the best ﬁxed decision in hindsight:1

Reg DM := sup

This formulation in which models are selected by a potentially adaptive adversary generalizes Foster et al. [17], who considered a stochastic setting where M (t) = M ? is ﬁxed across all rounds. Examples include:

Adversarial bandits. With no observations (O = {?}), the adversarial DMSO framework

is equivalent to the adversarial bandit problem with structured rewards. In this context, (t) is typically referred to as an action or arm and is referred to as the action space. The most basic example here is the adversarial ﬁnite-armed bandit problem with A actions [4, 3, 18], where = {1, . . . , A} and FM = RA. Other well-studied examples include adversarial linear bandits [10, 1, 7], bandit convex optimization [26, 16, 8, 29], and nonparametric bandits [26, 6, 38].2

Reinforcement learning. The adversarial DMSO framework encompasses ﬁnite-horizon,

episodic online reinforcement learning, with each round t corresponding to a single episode:

1The results in this paper immediately extend to the regret sup ?2

t=1 r(t)( ?) r(t)( (t)) through standard tail bounds.

2Typically, these examples are formulated with deterministic rewards, which we encompass by restricting models in M to be deterministic. Our formulation is more general and allows for, e.g., semi-stochastic adversaries.

(t) is a policy (a mapping from state to actions) to play in the episode, r(t) is the cumulative reward in the episode, and the observation o(t) is the episode s trajectory (sequence of observed states, actions, and rewards). Online reinforcement learning in the stochastic setting where M (t) = M ? is ﬁxed has received extensive attention [19, 51, 20, 53, 15, 21, 17], but the adversarial setting we study has received less investigation. Examples include the adversarial MDP problem where an adversary chooses a sequence of tabular MDPs, which is known to be intractable [37], and the easier problem in which there is a ﬁxed (known) MDP but rewards are adversarial [40, 57, 41, 22]. See Appendix D for more details.

We refer to Appendix B for additional measure-theoretic details and background, and to Foster et al. [17] for further examples and detailed discussion.3

Understanding statistical complexity (i.e., minimax regret) for the DMSO setting at this level of generality is a challenging problem. Even if one restricts only to bandit-type problems with no observations, any complexity measure must capture the role of structural assumptions such as convexity or smoothness in determining the optimal rates. To go beyond bandit problems and handle the general setting, one must accommodate problems with rich, structured feedback such as reinforcement learning, where observations (as well as subtle features of the noise distribution) can reveal information about the underlying model.

1.2 Overview of Results

For a model class M, reference model M 2 M, and scale parameter γ > 0, the Decision-Estimation Coefﬁcient [17] is deﬁned via

decγ(M, M) = inf p2 ( ) sup

where we recall that for probability measures P and Q with a common dominating measure , (squared) Hellinger distance is given by

We deﬁne decγ(M) = sup M2M decγ(M, M), and let co(M) denote the convex hull of M, which can be viewed as the set of all mixtures of models in M. Our main results show that the convexiﬁed Decision-Estimation Coefﬁcient,

decγ(co(M)), leads to upper and lower bounds on the optimal regret for adversarial decision making. Theorem (informal). For any model class M, Algorithm 1 ensures that with high probability,

Reg DM . decγ(co(M)) T, (4)

where γ satisﬁes the balance decγ(co(M)) / γ

T log| |. Moreover, for any algorithm with reasonable tail behavior (Section 2.2), regret must scale with a localized version of the same quantity.

As a consequence, there exists an algorithm for which E[Reg DM] o(T) if and only if decγ(co(M)) / γ for some > 0.

For the stochastic version of our setting, Foster et al. [17] give upper and lower bounds that scale with decγ(M), without convexifying (under appropriate technical assumptions; cf. Section 2.3). Hence, our results show that in general, the gap in optimal regret for stochastic and adversarial decision making (or, price of adversarial outcomes ) is governed by the behavior of the DEC under convexiﬁcation. For example, multi-armed bandits, linear bandits, and convex bandits correspond to convex model classes (where co(M) = M), which gives a post-hoc explanation for why these problems are tractable in the adversarial setting. Finite state/action Markov decision processes do not correspond to a convex model class, and have decγ(co(M)) exponentially large compared to decγ(M); in this case, our results recover lower bounds of Liu et al. [37].

Beyond these results, we prove that the convexiﬁed Decision-Estimation Coefﬁcient is equivalent to:

3We mention in passing that the upper bounds in this paper encompass the more general setting where rewards are not observed by the learner (i.e., z(t) does not contain the reward), thus subsuming the partial monitoring problem. Our lower bounds, however, require that rewards are observed. See Appendix A.

1. a parameterized variant of the generalized Information Ratio of Lattimore and György

[32]. 2. a novel high-probability variant of the Exploration-by-Optimization objective of Lattimore

and Szepesvári [35], Lattimore and György [32].

Our techniques. On the lower bound side, we strengthen the approach from Foster et al. [17] with an improved change-of-measure argument (leading to improved results even in the stochastic setting), and combine this with the simple idea of constructing adversaries based on static mixture models. On the upper bound side, we extend the powerful Exploration-by-Optimization machinery of Lattimore and György [32] to the DMSO setting, and give a novel high-probability variant of the technique which leads to regret bounds for adaptive adversaries. We show that the performance of this method is controlled by a complexity measure whose value is equivalent to the convexiﬁed DEC, as well as parameterized variant of the Information Ratio (we present results in terms of the former to draw comparison to the stochastic setting).

Overall, our results heavily draw on the work of Foster et al. [17] and Lattimore and György [32], but we believe they play a valuable role in bridging these lines of research and formalizing connections.

Organization. Section 2 presents our main results, including upper and lower bounds on regret and a characterization of learnability. In Section 3, we provide new structural results connecting the DEC to Exploration-by-Optimization and the Information Ratio. We close with discussion of future directions (Section 4). Additional comparison to related work is deferred to Appendix A. The appendix also contains proofs and additional results, including examples (Appendix D) and further structural results (Appendix E).

2 Main Results

We now present our main results. First, using a new high-probability variant of the Exploration-by Optimization technique [35, 32], we provide an upper bound on regret based on the (convexiﬁed) Decision-Estimation Coefﬁcient (Section 2.1). Next, we present a lower bound that scales with a localized version of the same quantity (Section 2.2). Finally, we use these results to give a characterization for learnability (Section 2.3), and discuss the gap between stochastic and adversarial decision making.

To keep presentation as simple as possible, we make the following assumption. Assumption 2.1. The decision space has | | < 1, and we have R = [0, 1]. This assumption only serves to facilitate the use of the minimax theorem, and we expect that our results can be generalized (e.g., with covering numbers as in Section 3.4 of Foster et al. [17]).

2.1 Upper Bound

In this section we give regret bounds for adversarial decision making based on the (convexiﬁed) Decision-Estimation Coefﬁcient. A-priori, it is not obvious why the DEC should bear any relevance to the adversarial setting we consider: The algorithms and regret bounds based on the DEC that Foster et al. [17] introduce for the stochastic setting heavily rely on the ability to estimate a static underlying model, yet in the adversarial setting, the learner may only interact with each model a single time. This renders any sort of global estimation (e.g., for dynamics of an MDP) impossible. In spite of this difﬁculty, we show that regret bounds can be achieved by building on the Exploration-by-Optimization technique of Lattimore and Szepesvári [35], Lattimore and György [32], which provides an elegant approach to estimating rewards that exploits the structure of the model class under consideration.

Exploration-by-Optimization introduced by Lattimore and Szepesvári [35] and substantially expanded in Lattimore and György [32] can be thought of as a generalization of the classical EXP3 algorithm [4] for ﬁnite-action bandits, which applies the exponential weights method for full-information online learning to a sequence of unbiased importance-weighted estimators for rewards. While EXP3 is near-optimal for bandits, it is unsuitable for general model classes because the reward estimators the algorithm uses do not exploit the structure of the decision space. Consequently, the regret scales linearly with | | rather than with, e.g., dimension, as one might hope for problems like linear bandits. The idea behind Exploration-by-Optimization is to solve an optimization problem at each round to search for a (potentially biased) reward estimator and modiﬁed sampling distribution that better exploit the structure of the model class M, leading to information sharing and improved regret. Lattimore and György [32] showed that for a general partial monitoring setting (cf. Appendix A),

Algorithm 1 High-Probability Exploration-by-Optimization (Ex O+)

1: parameters: Learning rate > 0.

2: for t = 1, 2, , T do

3: Deﬁne q(t) 2 ( ) via exponential weights update:

i=1 bf (i)( )

i=1 bf (i)( 0)

4: Solve high-probability exploration-by-optimization objective: // See Eq. (8)

(t)) arg min p2 ( ),g2G

sup M2M, ?2

Γq(t), (p, g ; ?, M). (6)

5: Sample decision (t) p(t) and observe z(t) = (r(t), o(t)).

6: Form reward estimator:

(t)( ) = g(t)( ; (t), z(t))

p(t)( (t)) . (7)

the expected regret for this method and for more general family of algorithms based on Bregman divergences is bounded by a generalization of the Information Ratio of Russo and Van Roy [46, 47].

Our development builds on that of Lattimore and György [32], but we pursue high-probability guarantees rather than in-expectation guarantees. This allows us to provide regret bounds that hold for adaptive adversaries, rather than oblivious adversaries as considered in prior work.4 Beyond this basic motivation, our interest in high-probability guarantees comes from the lower bound in the sequel (Section 2.2), which shows that the convexiﬁed Decision-Estimation Coefﬁcient lower bounds regret for algorithms with reasonable tail behavior. To develop high-probability regret bounds and complement this lower bound, we use a novel variant of the Exploration-by-Optimization objective and a specialized analysis that goes beyond the Bregman divergence framework.

Our algorithm, Ex O+, is displayed in Algorithm 1. At each round t, the algorithm computes a reference distribution q(t) 2 ( ) by applying the standard exponential weights update (with learning rate > 0) to a sequence of reward estimators bf (1), . . . , bf (t 1) from previous rounds (Line 3). For the main step (Line 4), the algorithm obtains a sampling distribution p(t) 2 ( ) and an estimation function g(t) 2 G := ( Z ! R) by solving a minimax optimization problem based on a new objective we term high-probability exploration-by-optmization: Deﬁning

Γq, (p, g ; ?, M) := E p[f

E p,z M( ) E 0 q

p( )(g( 0; , z) g( ?; , z))

(t)) arg min p2 ( ),g2G

sup M2M, ?2

Γq(t), (p, g ; ?, M). (9)

Finally (Lines 5 and 6), the algorithm samples (t) p(t), observes z(t) = (r(t), o(t)), and then forms an importance-weighted reward estimator via bf (t)( ) := g(t)( ; (t), z(t))/p(t)( (t)).

The interpretation of the high-probability Exploration-by-Optimization objective (8) is as follows: For a given round t, the model M 2 M and decision ? 2 should be thought of as a proxy for the true model M (t) and optimal decision, respectively. By solving the minimax problem in (9), the min-player aims to in the face of an unknown, worst-case model ﬁnd a sampling distribution that minimizes instantaneous regret, yet ensures good tail behavior for the importance-weighted estimator g( ; , z)/p( ). Tail behavior is captured by the moment generating function-like term in (8), which penalizes the learner for over-estimating rewards under the reference distribution q or under-estimating rewards under ?.

4In general, in-expectation regret bounds do not imply high-probability bounds. For example, in adversarial bandits, the EXP3 algorithm can experience linear regret with constant probability [34].

We show that this approach leads to a bound on regret that scales with the convexiﬁed DEC.

Theorem 2.1 (Main upper bound). For any choice of > 0, Algorithm 1 ensures that for all δ > 0, with probability at least 1 δ,

Reg DM dec1/8 (co(M)) T + 2

log(| |/δ). (10)

In particular, for any δ > 0, with appropriate , the algorithm ensures that with probability at least 1 δ,

Reg DM O(1) inf

γ>0{decγ(co(M)) T + γ log(| |/δ)}. (11)

This regret bound holds for arbitrary, potentially adaptive adversaries. The result should be compared to the upper bound for the stochastic setting in Foster et al. [17] (e.g., Theorem 3.3), which takes a similar form, but scales with the weaker quantity sup M2co(M) decγ(M, M).5 See Appendix A for comparison to Lattimore and Szepesvári [35], Lattimore and György [32].

Equivalence of Exploration-by-Optimization and Decision-Estimation Coefﬁcient. We now discuss a deeper connection between Exploration-by-Optimization and the DEC. Deﬁne the minimax value of the high-probability Exploration-by-Optimization objective via

exo (M, q) := inf p2 ( ),g2G sup M2M, ?2

Γq, (p, g ; ?, M), (12)

and let exo (M) := supq2 ( ) exo (M, q). This quantity can be interpreted as a complexity measure for M whose, value reﬂects the difﬁculty of exploration. The following structural result (Corollary 3.1 in Section 3), which is critical to the proof of Theorem 2.1, shows that this complexity measure is equivalent to the convexiﬁed Decision-Estimation Coefﬁcient:

dec(4 ) 1(co(M)) exo (M) dec(8 ) 1(co(M)), 8 > 0. (13)

As we show, the regret of Algorithm 1 is controlled by the value of exo (M), and thus Theorem 2.1 follows. In the process of proving (13), we also establish equivalence of the Exploration-by Optimization objective and a parameterized version of the Information Ratio, which is of independent interest (cf. Section 3). Both results build on, but go beyond the Bregman divergence-based framework in Lattimore and György [32], and exploit a somewhat obscure connection between Hellinger distance and the moment generating function (MGF) for the logarithmic loss. In particular, we use a technical lemma (proven in Appendix C), which shows that up to constants, the Hellinger distance between two probability distributions can be expressed as variational problem based on the associated MGFs.

Lemma 2.1. Let P and Q be probability distributions over a measurable space (X, F). Then

H(P, Q) sup g:X!R

H(P, Q). (14)

The lower inequality in Lemma 2.1 is proven using a trick similar to one used by Zhang [55] to prove high-probability bounds for maximum likelihood estimation based on Hellinger distance. To prove the upper bound in (13), we apply the lower inequality in (14) with the test function g taking the role of the estimation function in the Exploration-by-Optimization objective.

Further remarks. The main focus of this work is statistical complexity (in particular, minimax regret), and the runtime and memory requirements of Algorithm 1, which are linear in | |, are not practical for large decision spaces. Improving the computational efﬁciency is an interesting question for future work. We mention in passing that Theorem 2.1 answers a question raised by Foster et al. [17] of obtaining in the frequentist setting a regret bound matching the Bayesian regret bound in their Theorem 3.6.

5If a proper estimation algorithm (i.e., an algorithm producing estimators that lie in M) is available, Foster et al. [17] (Theorem 4.1) gives tighter bounds scaling with decγ(M).

2.2 Lower Bound

We now complement the regret bound in the prequel with a lower bound based on the convexiﬁed DEC. Our most general result shows that for any algorithm, either the expected regret or its (one-sided) second moment must scale with a localized version of the convexiﬁed DEC.

To state the result, we deﬁne the localized model class around a model M via

and deﬁne decγ,"(M) := sup M2M decγ(M"(M), M) as the localized Decision Estimation Coefﬁcient. We let (x)+ := max{x, 0} and deﬁne V (M) := sup M,M 02M sup 2 sup A2R O

_ e;6 ﬁniteness of V (M) is not necessary, but removes a log(T) factor from Theorem 2.2.

Theorem 2.2 (Main lower bound). Let C(T) := c log(T V (M)) for a sufﬁciently large numerical constant c > 0. Set "γ := γ 4C(T )T . For any algorithm, there exists an oblivious adversary for which

E[Reg DM] +

decγ,"γ(co(M)) T O(T 1/2). (15)

Theorem 2.2 implies that for any algorithm (such as Algorithm 1) with tail behavior beyond what is granted by control of the ﬁrst moment, the regret in Theorem 2.1 cannot be substantially improved. In more detail, consider the notion of a sub-Chebychev algorithm. Deﬁnition 2.1 (Sub-Chebychev Algorithm). A regret minimization algorithm is said to be sub Chebychev with parameter R if for all t > 0,

P((Reg DM)+ t) R2/t2. (16)

For sub-Chebychev algorithms, both the mean and (root) second moment of regret are bounded by the parameter R (cf. Appendix F.4), which has the following consequence.

Corollary 2.1. Any regret minimization algorithm with sub-Chebychev parameter R > 0 must have

R e (1) sup

decγ,"γ(co(M)) T O(T 1/2). (17)

To interpret this result, suppose for simplicity that decγ(co(M)) and decγ,"γ(co(M)) are continuous with respect to γ > 0, and that decγ,"γ(co(M)) & γ 1, which is satisﬁed for non-trivial classes.7

In this case, it follows from Theorem 2.1 (cf. Proposition F.2 for details) that by setting δ = 1/T 2, Algorithm 1 is sub-Chebychev with parameter

inf γ>0{decγ(co(M)) T + γ log(| |)}

= e O(decγu(co(M)) T), (18)

where γu satisﬁes the balance decγu(co(M)) / γu

T log| |. On the other hand, the lower bound in (17) can be shown to scale with

decγ ,"γ (co(M)) T

where γ satisﬁes the balance decγ ,"γ (co(M)) / γ

T . We conclude that the upper bound from Theorem 2.1 cannot be improved beyond (i) localization and (ii) dependence on log| |.

As an example, we show in Appendix D.3 that for the multi-armed bandit problem with = {1, . . . , A}, the upper bound in (18) yields R = e O(p AT log A), while the lower bound in (19) yields R = (

AT). See Appendix D for additional examples which further illustrate the scaling in the upper and lower bounds.

6Recall (Appendix B) that M( , | ) is the conditional distribution given . 7The dominant term decγ,"γ (co(M)) T in (15) scales with T 1/2 for any class that is non-trivial in the sense that it embeds the two-armed bandit problem, so that the O(T 1/2) term can be discarded.

The dependence on log| | cannot be removed from the upper bound or made to appear in the lower bound in general (cf. Section 3.5 of Foster et al. [17]). As shown in Foster et al. [17], localization is inconsequential for most model classes commonly studied in the literature. The same is true for the examples we consider here (Appendix D), where Theorem 2.2 leads to the correct rate up to small polynomial factors. However, improving the upper bound to achieve localization, which Foster et al. [17] show is possible in the stochastic setting, is an interesting future direction.

See Appendix A for further discussion and for comparison to a related lower bound in Lattimore [31].

Why convexity? At this point, a natural question is why the convex hull co(M) plays a fundamental role in the adversarial setting. For the lower bound, the intuition is simple: Given a model class M, the adversary can pick any mixture distribution µ 2 (M), then choose the sequence of models M (1), . . . , M (T ) by sampling M (t) µ independently at each round. This is equivalent to playing a static mixture model M ? = EM µ[M] 2 co(M), which is what allows us to prove a lower bound based on the DEC for the set co(M) of all such models. In view of the fact that the lower bound is obtained through this static (and stochastic) adversary, we believe the more surprising result here is that good behavior of the convexiﬁed DEC is also sufﬁcient for low regret for fully adversarial decision making.

2.3 Learnability and Comparison to Stochastic Setting

Building on the upper and lower bounds in the prequel, we give a characterization for learnability (i.e., when non-trivial regret is possible) for adversarial decision making. This extends the learnability characterization for the stochastic setting in Foster et al. [17], and follows a long tradition in learning theory [52, 2, 49, 44, 11]. To state the result, we deﬁne the minimax regret for model class M as

M(M, T) = inf p(1),...,p(T ) sup M (1),...,M (T ) E[Reg DM],

where p(t) : ( Z)t 1 ! ( ) and M

(t) : ( Z)t 1 ! M are policies for the learner and adversary, respectively. Our characterization is as follows. Theorem 2.3. Suppose there exists M0 2 M such that f M0 is a constant function, and that | | < 1.

1. If there exists > 0 such that limγ!1 decγ(co(M)) γ = 0, then lim T !1

T p = 0 for p < 1.

2. If limγ!1 decγ(co(M)) γ > 0 for all > 0, then lim T !1

T p = 1 for all p < 1.

The same conclusion holds when = T grows with T, but has log| T | = O(T q) for any q < 1.8

Theorem 2.3 shows that polynomial decay of the convexiﬁed DEC is necessary and sufﬁcient for low regret. We emphasize that this result is complementary to Theorem 2.2, and does not require localization or any assumption on the tail behavior of the algorithm. This is a consequence of the coarse, asymptotic nature of the result, which allows us the use of rescaling arguments to remove these conditions.

Comparison to stochastic setting. Having shown that the convexiﬁed Decision-Estimation Coefﬁcient leads to upper and lower bounds on the optimal regret for the adversarial DMSO setting, we now contrast with the stochastic setting. There, Foster et al. [17] obtain upper bounds on regret that have the same form as (11), but scale with the weaker quantity max M2co(M) decγ(M, M).9 For classes that are not convex, but where proper estimators are available (including ﬁnite-state/action MDPs), the upper bounds in Foster et al. [17] can further be improved to scale with decγ(M). Hence, our results show that in general, the price of adversarial outcomes can be as large as decγ(co(M))/decγ(M). Examples (see Appendix D for details and more) include:

For tabular (ﬁnite-state/action) MDPs with horizon H, S states, and A actions, Fos-

ter et al. [17] show that decγ(M) poly(H, S, A)/γ, and use this to obtain regret

8Allowing to grow with T is useful when considering inﬁnite decision spaces, because it facilitates covering arguments.

9Theorem 3.1 of Foster et al. [17] attains Reg DM . infγ>0

max M2co(M) decγ(M, M) + γ log|M|

with high probability.

poly(H, S, A) T. Tabular MDPs are not a convex class, and co(M) is equivalent to the class of so-called latent MDPs, which are known to be intractable [28, 37]. Indeed, we show (Appendix D) that decγ(co(M)) (Amin{S,H}), which implies an exponential lower bound on regret through Theorem 2.2. This highlights that in general, the gap between stochastic and adversarial outcomes can be quite large.

For many common bandit problems, one has co(M) = M, leading to polynomial bounds

on regret in the adversarial setting. For example, the multi-armed bandit problem with A actions has decγ(co(M)) O(A/γ), leading to p AT log A regret (via Theorem 2.1), and the linear bandit problem in d dimensions has decγ(co(M)) O(d/γ), leading to regret p

d T log| |.

3 Connections Between Complexity Measures

The Decision-Estimation Coefﬁcient bears a resemblance to the generalized Information Ratio introduced by Lattimore and György [32], Lattimore [30] which extends the original Information Ratio of Russo and Van Roy [46, 47]. In what follows, we establish deeper connections between these complexity measures. All of the results in this section are proven in Appendix E.

Let us call any function D( k ) : ( ) ( ) ! R+ a divergence-like function. We restate the generalized Information Ratio from Lattimore [31]. For a distribution µ 2 (M ) and decision distribution p 2 ( ), let P be the law of the process (M, ?) µ, p, z M( ), and deﬁne µpr( 0) := P( ? = 0) and µpo( 0; , z) := P( ? = 0 | ( , z)). The distribution µpr should be thought of as the prior over ?, and µpo should be thought of as the posterior over ? after observing (z, ); note that the law µpo does not depend on the distribution p. For parameter λ > 1, Lattimore

[31] deﬁnes the generalized Information Ratio for a class M via10

λ(M) = sup µ2 (M )

(E(M, ?) µ E p[f M( ?) f M( )])λ

E p Ez| [D(µpo( ; , z) k µpr)]

Here, we have slightly generalized the original deﬁnition in Lattimore [31] by incorporating models in M rather than placing an arbitrary prior over observations z directly. We also use a general divergence-like function, while Lattimore [31] uses KL divergence and Lattimore and György [32] use Bregman divergences.

To understand the connection to the Decision-Estimation Coefﬁcient, it will be helpful to introduce another variant of the Information Ratio, which we call the parameterized Information Ratio.

Deﬁnition 3.1. For a divergence-like function D( k ) : ( ) ( ) ! R+, the parameterized Information Ratio is given by

= sup µ2 (M )

inf p2 ( ) E p

E(M, ?) µ[f

M( )] γ E p Ez| [D(µpo( ; , z) k µpr)]

The parameterized Information Ratio is always bounded by the generalized Information Ratio in (20); in particular, we have inf D

γ (M) ( λ(M)/γ)

1 λ 1 8γ > 0. All regret bounds based on the generalized Information Ratio that we are aware of [32, 31] implicitly bound regret by the parameterized Information Ratio, and then invoke the inequality above to move to the generalized Information Ratio. In general though, it does not appear that these notions are equivalent. Informally, this is because the notion in (20) is equivalent to requiring that a single distribution p certify a certain bound on the value in (21) for all values of the parameter γ simultaneously, while the parameterized Information Ratio allows the distribution p to vary as a function of γ > 0 (hence the name); see also Appendix E.

Letting inf H

γ (M) denote the parameterized Information Ratio with D = D2

H( , ), we show that this notion is equivalent to the convexiﬁed Decision-Estimation Coefﬁcient.

Theorem 3.1. For all γ > 0, inf H

γ (M) decγ(co(M)) inf H

10Lattimore and György [32] give a slightly different but essentially equivalent deﬁnition; cf. Appendix E.

This result is a special case of Theorem E.1 in Appendix E, which shows that a similar equivalence holds for a class of well-behaved f-divergences that includes KL divergence, but not necessarily for general Bregman divergences. The idea is to use Bayes law to move from the Decision-Estimation Coefﬁcient, which considers distance between distributions over observations, to the Information Ratio, which considers distance between distributions over decisions.

In light of this characterization, the results in this paper could have equivalently been presented in terms of the parameterized Information Ratio. We chose to present them in terms of the Decision Estimation Coefﬁcient in order to draw parallels to the stochastic setting, where guarantees that scale with decγ(M) (without convexiﬁcation) are available. It is unclear whether the Information Ratio can accurately reﬂect the complexity for both stochastic and adversarial settings in the same fashion, because unlike the DEC it is invariant under convexiﬁcation, as the following proposition shows.11

Proposition 3.1. For any divergence-like function D( k ), we have

γ (M) = inf D

γ (co(M)), 8γ > 0.

For a ﬁnal structural result, we show that up to constants, the parameterized Information Ratio is equivalent to the high-probability Exploration-by-Optimization objective.

Theorem 3.2. For all > 0,

1(M) exo (M) inf H

This result is proven through a direct argument (cf. Section 2.1), and the equivalence of the DEC and Exploration-by-Optimization in (13) is proven by combining with Theorem 3.1. Summarizing the equivalence: Corollary 3.1. For all > 0,

dec(4 ) 1(co(M)) inf H

1(M) exo (M) inf H

(8 ) 1(M) dec(8 ) 1(co(M)).

Since this equivalence depends of the value of the parameter γ > 0 in the parameterized Information Ratio, it is not clear whether a similar equivalence can be established using the generalized Information Ratio in (20). We note in passing that one can use similar arguments to lower bound the Bregman divergence-based Exploration-by-Optimization objective in Lattimore and György [32] by the parameterized Information Ratio for the Bregman divergence of interest, complementing their upper bound.

4 Discussion

We have shown that the convexiﬁed Decision-Estimation Coefﬁcient is necessary and sufﬁcient to achieve low regret for adversarial interactive decision making, establishing that convexity governs the price of adversarial outcomes. Our results elucidate the relationship between the DEC, Exploration-by-Optimization, and the Information Ratio, and we hope they will ﬁnd broader use.

This work adds to a growing body of research which shows that online reinforcement learning with agnostic or adversarial outcomes can be statistically intractable [48, 37]. A promising future direction is to extend our techniques to natural semi-adversarial models in which reinforcement learning is tractable. Another interesting direction is to address the issue of computational efﬁciency for large decision spaces.

Acknowledgements

We thank Zak Mhammedi for useful comments and feedback. AR acknowledges support from the ONR through awards N00014-20-1-2336 and N00014-20-1-2394, from the ARO through award W911NF-21-1-0328, and from the DOE through award DE-SC0022199. Part of this work was completed while DF was visiting the Simons Institute for the Theory of Computing. KS acknowledges support from NSF CAREER Award 1750575.

11The variants in Lattimore and György [32], Lattimore [31] are also invariant under convexiﬁcation.

[1] J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An efﬁcient algorithm for

bandit linear optimization. In Proc. of the 21st Annual Conference on Learning Theory (COLT), 2008.

[2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform

convergence, and learnability. Journal of the ACM, 44:615 631, 1997.

[3] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT,

volume 7, pages 1 122, 2009.

[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit

problem. SIAM Journal on Computing, 32(1):48 77, 2002.

[5] A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. Yang. Model-based reinforcement learning

with value-targeted regression. In International Conference on Machine Learning, pages 463 474. PMLR, 2020.

[6] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári. X-armed bandits. Journal of Machine

Learning Research, 12(5), 2011.

[7] S. Bubeck, N. Cesa-Bianchi, and S. M. Kakade. Towards minimax policies for online linear

optimization with bandit feedback. In Conference on Learning Theory, pages 41 1. JMLR Workshop and Conference Proceedings, 2012.

[8] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization.

In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 72 85, 2017.

[9] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,

New York, NY, USA, 2006. ISBN 0521841089.

[10] V. Dani, T. P. Hayes, and S. Kakade. The price of bandit information for online optimization.

[11] A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz. Multiclass learnability and the

ERM principle. In Proceedings of the 24th Annual Conference on Learning Theory, pages 207 232, 2011.

[12] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear

quadratic regulator. Foundations of Computational Mathematics, 20(4):633 679, 2020.

[13] S. Dong, B. Van Roy, and Z. Zhou. Provably efﬁcient reinforcement learning with aggregated

states. ar Xiv preprint ar Xiv:1912.06366, 2019.

[14] S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Provably efﬁcient

RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665 1674. PMLR, 2019.

[15] S. S. Du, S. M. Kakade, J. D. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bilinear

classes: A structural framework for provable generalization in RL. International Conference on Machine Learning, 2021.

[16] A. D. Flaxman, A. T. Kalai, and H. B. Mc Mahan. Online convex optimization in the bandit

setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385 394, 2005.

[17] D. J. Foster, S. M. Kakade, J. Qian, and A. Rakhlin. The statistical complexity of interactive

decision making. ar Xiv preprint ar Xiv:2112.13487, 2021.

[18] E. Hazan and S. Kale. Better algorithms for benign bandits. Journal of Machine Learning

Research, 12(4), 2011.

[19] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision

processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704 1713, 2017.

[20] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efﬁcient reinforcement learning with

linear function approximation. In Conference on Learning Theory, pages 2137 2143, 2020.

[21] C. Jin, Q. Liu, and S. Miryooseﬁ. Bellman eluder dimension: New rich classes of RL problems,

and sample-efﬁcient algorithms. Neural Information Processing Systems, 2021.

[22] T. Jin and H. Luo. Simultaneously learning stochastic and adversarial episodic MDPs with

known transition. Advances in neural information processing systems, 33:16557 16566, 2020.

[23] J. Kirschner. Information-Directed Sampling-Frequentist Analysis and Applications. Ph D thesis,

ETH Zurich, 2021.

[24] J. Kirschner, T. Lattimore, and A. Krause. Information directed sampling for linear partial

monitoring. In Conference on Learning Theory, pages 2328 2369. PMLR, 2020.

[25] J. Kirschner, T. Lattimore, C. Vernade, and C. Szepesvári. Asymptotically optimal information-

directed sampling. In Conference on Learning Theory, pages 2777 2821. PMLR, 2021.

[26] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. Advances in

Neural Information Processing Systems, 17:697 704, 2004.

[27] A. Krishnamurthy, A. Agarwal, and J. Langford. PAC reinforcement learning with rich observa-

tions. In Advances in Neural Information Processing Systems, pages 1840 1848, 2016.

[28] J. Kwon, Y. Efroni, C. Caramanis, and S. Mannor. RL for latent MDPs: Regret guarantees and

a lower bound. Advances in Neural Information Processing Systems, 34, 2021.

[29] T. Lattimore. Improved regret for zeroth-order adversarial bandit convex optimisation. Mathe-

matical Statistics and Learning, 2(3):311 334, 2020.

[30] T. Lattimore. Minimax regret for bandit convex optimisation of ridge functions. ar Xiv preprint

ar Xiv:2106.00444, 2021.

[31] T. Lattimore. Minimax regret for partial monitoring: Inﬁnite outcomes and rustichini s regret.

ar Xiv preprint ar Xiv:2202.10997, 2022.

[32] T. Lattimore and A. György. Mirror descent and the information ratio. In Conference on

Learning Theory, pages 2965 2992. PMLR, 2021.

[33] T. Lattimore and C. Szepesvári. An information-theoretic approach to minimax regret in partial

monitoring. In Conference on Learning Theory, pages 2111 2139. PMLR, 2019.

[34] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[35] T. Lattimore and C. Szepesvári. Exploration by optimisation in partial monitoring. In Conference

on Learning Theory, pages 2488 2515. PMLR, 2020.

[36] L. Li. A unifying framework for computational reinforcement learning theory. Rutgers, The

State University of New Jersey New Brunswick, 2009.

[37] Q. Liu, Y. Wang, and C. Jin. Learning markov games with adversarial opponents: Efﬁcient

algorithms and fundamental limits. ar Xiv preprint ar Xiv:2203.06803, 2022.

[38] S. Magureanu, R. Combes, and A. Proutiere. Lipschitz bandits: Regret lower bound and optimal

algorithms. In Conference on Learning Theory, pages 975 999. PMLR, 2014.

[39] A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement learning using

linearly combined model ensembles. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2010 2020. PMLR, 2020.

[40] G. Neu, A. György, C. Szepesvári, et al. The online loop-free stochastic shortest-path problem.

In COLT, volume 2010, pages 231 243. Citeseer, 2010.

[41] G. Neu, A. György, C. Szepesvári, and A. Antos. Online Markov decision processes under

bandit feedback. IEEE Transactions on Automatic Control, 59:676 691, 2014.

[42] Y. Polyanskiy. Information theoretic methods in statistics and computer science. 2020. URL

https://people.lids.mit.edu/yp/homepage/sdpi_course.html.

[43] Y. Polyanskiy and Y. Wu. Lecture notes on information theory. 2014.

[44] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial

parameters, and learnability. Advances in Neural Information Processing Systems 23, pages 1984 1992, 2010.

[45] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration.

In Advances in Neural Information Processing Systems, pages 2256 2264, 2013.

[46] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of

Operations Research, 39(4):1221 1243, 2014.

[47] D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. Operations

Research, 66(1):230 252, 2018.

[48] A. Sekhari, C. Dann, M. Mohri, Y. Mansour, and K. Sridharan. Agnostic reinforcement learning

with low-rank MDPs and rich observations. Advances in Neural Information Processing Systems, 34, 2021.

[49] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability, and uniform

convergence. Journal of Machine Learning Research (JMLR), 2010.

[50] M. Sion. On general minimax theorems. Paciﬁc J. Math., 8:171 176, 1958.

[51] W. Sun, N. Jiang, A. Krishnamurthy, A. Agarwal, and J. Langford. Model-based RL in

contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898 2933. PMLR, 2019.

[52] V. N. Vapnik. The nature of statistical learning theory. 1995.

[53] R. Wang, R. R. Salakhutdinov, and L. Yang. Reinforcement learning with general value function

approximation: Provably efﬁcient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33, 2020.

[54] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features.

In International Conference on Machine Learning, pages 6995 7004. PMLR, 2019.

[55] T. Zhang. From -entropy to KL-entropy: Analysis of minimum information complexity density

estimation. The Annals of Statistics, 34(5):2180 2210, 2006.

[56] D. Zhou, Q. Gu, and C. Szepesvari. Nearly minimax optimal reinforcement learning for linear

mixture markov decision processes. In Conference on Learning Theory, pages 4532 4576. PMLR, 2021.

[57] A. Zimin and G. Neu. Online learning in episodic markovian decision processes by relative

entropy policy search. Advances in neural information processing systems, 26, 2013.

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justiﬁcation to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [Yes] Did you include the license to the code and datasets? [No] The code and the data are

proprietary. Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]