# online_learning_with_dynamics_a_minimax_perspective__99439c83.pdf

Online learning with dynamics: A minimax perspective

Kush Bhatia EECS, UC Berkeley kush@cs.berkeley.edu

Karthik Sridharan CS, Cornell University sridharan@cs.cornell.edu

We study the problem of online learning with dynamics, where a learner interacts with a stateful environment over multiple rounds. In each round of the interaction, the learner selects a policy to deploy and incurs a cost that depends on both the chosen policy and current state of the world. The state-evolution dynamics and the costs are allowed to be time-varying, in a possibly adversarial way. In this setting, we study the problem of minimizing policy regret and provide non-constructive upper bounds on the minimax rate for the problem. Our main results provide sufﬁcient conditions for online learnability for this setup with corresponding rates. The rates are characterized by: 1) a complexity term capturing the expressiveness of the underlying policy class under the dynamics of state change, and 2) a dynamic stability term measuring the deviation of the instantaneous loss from a certain counterfactual loss. Further, we provide matching lower bounds which show that both the complexity terms are indeed necessary. Our approach provides a unifying analysis that recovers regret bounds for several well studied problems including online learning with memory, online control of linear quadratic regulators, online Markov decision processes, and tracking adversarial targets. In addition, we show how our tools help obtain tight regret bounds for a new problems (with non-linear dynamics and non-convex losses) for which such bounds were not known prior to our work.

1 Introduction

Machine learning systems deployed in the real-world interact with people through their decision making. Such systems form a feedback loop with their environment: they learn to make decisions from real-world data and decisions made by these systems in turn affect the data that is collected. In addition, people often learn to adapt to such automated decision makers in an attempt to maximize their own utility rendering any assumption on the data generation process futile. Motivated by these aspects of decision making, we propose the problem of online learning with dynamics which involves repeated interaction between a learner and an environment with an underlying state. The decisions made by the learner affect this state of the environment which evolves as a dynamical system. Further, we place no distributional assumptions on the learning data and allow this to be adversarial.

Given such a setup, a natural question to ask is how does one measure the performance of the learner? Classical online learning studies one such notion of performance known as regret. This measure compares the performance of the learner to that of a ﬁxed best policy in hindsight, when evaluated on the same states which were observed by the learner. Such a measure of performance clearly does not work for the above setup: if we would have deployed a different policy, we would have observed different states of the environment. To overcome this, we study a counterfactual notion of regret, called Policy Regret, where the comparator term is the performance of a policy on the states one would have observed if this policy was deployed from the beginning of time.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Such a notion of regret has been studied in the online learning literature for understanding memory based adversaries [20, 4, 5] and more recently, for the study of speciﬁc reinforcement learning models [11, 1, 10]. However, a vast majority of these works have focused on known and ﬁxed models of state evolution, often restricting the scope to linear dynamical systems. Further, these works have focused on simplistic policy classes as the comparators in their notion of policy regret. Contrast this with the vast literature on statistical learning [28, 6] and classical online learning [22] which study the question of learnability in full generality; for arbitrary losses and general function classes.

Our work is a step towards addressing this gap. We study the problem of learnability for a class of online learning problems with underlying states evolving as a dynamical system in its full generality. Our main results provide sufﬁcient conditions (along with non-asymptotic upper bounds) on when such problems are learnable, that is, can have vanishing policy regret. Our approach is non-constructive and provides a complexity term that provides upper bounds on the minimax rates for these problems. Further, we provide lower bounds showing that for a large class of problems, our upper bounds are tight up to constant factors. By studying the problem in full generality, we show how several well-studied problems in the literature comprising online Markov decision processes [11], online adversarial tracking [1], online linear quadratic regulator [10], online control with adversarial noise [2], and online learning with memory [5, 4] can be seen as speciﬁc examples of our general framework. We recover the best known rates for a majority of these problems, often times even generalizing these setups. We also provide examples where, to the best of our knowledge, previous techniques are not able to obtain useful bounds on regret; however using our minimax tools, we are able to provide tight bounds on the policy regret for these examples.

Formally, we consider the setup where X denotes an arbitrary set of states, Π an arbitrary class of policies and Z an arbitrary instance space. Given this, the interaction between the learner and nature can be expressed as a T round protocol where on each round t [T], the learner picks a policy πt Π, the adversary simultaneously picks instance (zt,ζt) Z. The learner suffers loss ℓ(πt,xt,zt) and the state of the system evolves1 as xt+1 Φ(xt,πt,ζt), where Φ is known to the learner. The goal of the learner is to minimize policy regret

T t=1 ℓ(πt, xt, zt) inf π Π

T t=1 ℓ(π, xt[π(t 1), ζ1 t 1], zt) ,

where xt are the states of the system based on learners choices of policies and xt[π(t 1),ζ1 t 1] represents the state of the system at time t if the policy π was used the previous t 1 rounds. We refer to the loss ℓ(π, xt[π(t 1), ζ1 t 1], zt) as the counterfactual loss of policy π. Notice that dynamics Φ being ﬁxed or known in advance to the learner is not really restrictive since an adversary can encode arbitrary state dynamics mapping in ζt s and Φ can just be seen as an applicator of these mapping.

Our contributions. We are interested in the following question: for a given problem instance (Π,Z,Φ,ℓ), is the problem learnable, that is, does there exists a learning algorithm such that policy regret is such that Regpol T = o(T). Below we highlight some of the key contributions of this paper.

1. We show that the minimax policy regret for any problem speciﬁed by (Π,Z,Φ,ℓ) can be upper bounded by sum of two terms: i) a sequential Rademacher complexity like term for the class of counterfactual losses of the policy class, and ii) a term we refer to as dynamic stability term for the Empirical Risk Minimizer (ERM) (or regularized ERM) algorithm. 2. We analyze the problem in the dual game. While in most cases ERM does not even have low classical regret let alone policy regret, we show that ERM like strategy in the dual game can lead to the two term decomposition of minimax policy regret we mention above. 3. Ours is the ﬁrst work that studies arbitrary online dynamical systems, and provides an analysis for general policy classes and loss functions (possibly non-convex). 4. We provide lower bounds that show that our sufﬁcient conditions are tight for a large class of problem instances showing that both the terms in our upper bounds are indeed necessary. 5. We delineate a number of previously studied problems including online linear quadratic regulator, and online learning with memory for which we recover rates. More importantly, we provide examples of new non-convex and general online learning with dynamics problems and obtain tight regret bounds. For these examples, none of the previous methods are able to obtain any non-degenerate regret bounds.

1while we consider deterministic dynamics here, Section 2 considers general dynamics with stochastic noise

Related work. The classical online learning setup [8] considers a repeated interactive game between a learner and an environment without any notion of underlying dynamics. Sequential complexity measures were introduced in [22] to get tight characterization of minimax regret rates for the classical online learning setting. They showed that for the class of online supervised learning problems, one can upper and lower bound minimax rate in terms of a sequential Rademacher complexity of the predictor class. The works [18, 7] provided an analog of VC theory for online classiﬁcation and the sequential complexity measures in work [22] provided such a theory for general supervised online learning. This paper can be seen as deriving such characterization of learnability and tight rates for the problem of online learning with dynamics. In the more general setting we consider, while the main mathematical tools introduced in [22] are useful, they are not by themselves sufﬁcient because of the complexities of policy regret and the state dynamics. This is evident from our upper bound which consists of two terms (both of which we show are necessary) and only one of them is a sequential Rademacher complexity type term.

Another line of work closes related ours is that on the theory of optimal control (see [17] for a review). Linear dynamical systems with simple zero mean noise models like Gaussian noise for state dynamics have been extensively studied (see the surveys [19] and [14] for an extensive review). While majority of the work in control have focused on linear dynamics with ﬁxed noise models, H control (and more generally robust control) literature has aimed at extending the setting to worst case perturbations (see [26]). However these works focus on cumulative costs and are often not practical for machine learning scenarios where such algorithms tend to be overly conservative.

There has been recent work dealing with adversarial costs and linear dynamics with either stochastic or adversarial noise. Online Markov decision processes [11], Online Adversarial Tracking [1] and Online Linear Quadratic Regulator [10, 25] are all examples of such work that deal with speciﬁc form of possibly adversarially chosen cost functions, albeit the loss functions in these problems are very speciﬁc and the dynamics are basically linear with either ﬁxed stochastic noise or no noise. Perhaps the closest comparison to our work is the work by Agarwal et. al [2] (and also [3, 12]) where adversarial but convex costs, linear policies and linear dynamics with an adversarial component are considered. In contrast, we consider arbitrary class of policies, both adversarially chosen costs (possibly non-convex) and dynamics that are presented on the ﬂy and arbitrary state space.

2 Online learning with dynamics

We now formally deﬁne the online learning with dynamics problem. We let X represent the state space, Π denote the set of learner polices, and Z = Zℓ ZΦ denote the space of adversary s moves.

2.1 Problem setup

The problem of online learning with dynamics proceeds as a repeated game between a learner and an adversary played over T rounds. The state of the system at time t, denoted by xt X, evolves according to a stochastic dynamical system as xt+1 = Φ(xt,πt,ζt) + wt, where Φ X Π ZΦ X is the transition function and wt Dw is a zero-mean additive noise. The transition function Φ is allowed to depend on adversary s action ζt allowing the dynamics to change across time steps. We assume that the dynamics function Φ and distribution Dw are ﬁxed apriori and are known to the learner before the game begins.

Given these dynamics, the repeated online game between the learner and the adversary starts at an initial state x1 proceeds via the following interactive protocol:

On round t = 1,...,T,

the learner picks policy πt Π, adversary simultaneously selects instance (zt,ζt) Z the learner receives payoff (loss) signal ℓ(πt,xt,zt) the state of the system transitions to xt+1 = Φ(xt,πt,ζt) + wt

We consider the full information version of the above game: the learner gets to observe the instances (zt,ζt) at time t. The objective of the learner is to minimize, in expectation, the policy regret

T t=1 E w[ℓ(πt, xt[π1 t 1, w1 t 1, ζ1 t 1], zt)] inf π Π E w [

T t=1 ℓ(π, xt[π(t 1), w1 t 1, ζ1 t 1], zt)] (1)

with respect to a policy class Π and dynamics model Φ. In the above deﬁnition, the notation xt[π1 t 1,w1 t 1,ζ1 t 1] makes the dependence of the state xt explicit on the previous policies, noise and adversarial actions. For notational convenience, we will often drop dependencies on the noise variables w and adversarial actions ζ when it is clear from context.

Observe that in the deﬁnition of policy regret, the loss depends on the state of the system at that instance which can be potentially very different for the learner and a comparator policy π. This lends an additional source of complexity to the interactive game, and can make the problem much harder than its counterpart without a dynamics.

The problem of online learning with dynamics generalizes the online learning problem where the loss functions ℓ(π,x,z) = ℓ(π,z) are independent of the underlying state variables. Indeed, our notion of policy regret in equation (1) reduces to the notion of external regret studied in the online learning literature. Also, the problem of online learning with memory involves adversaries which have bounded memory of length m and thus the loss incurred by the learner at any time is a function of its past m moves. By setting the state variable xt = [πt m,...,πt 1], the dynamics function Φ(xt,πt) = [πt m+1,...,πt], and the noise disturbances wt = 0, we can see that the bounded memory adversaries can be seen as a special case of our problem with dynamics.

2.2 Minimax Policy Regret

Given the setup of the previous section, we study the online learning with dynamics game between the learner and the adversary through a minimax perspective. Studying this minimax value allows one to understand the limits of learnability for a tuple (Π,Z,Φ,ℓ): upper bounds on this value imply existence of algorithms with corresponding rates while lower bounds on this values represent the information-theoretic limits of learnability. In the following lemma, we formally deﬁne the value VT (Π,Z,Φ,ℓ) of the minimax policy regret for a given problem which informally is the policy regret of best learning algorithm against the worst case adversary. Proposition 1 (Value Dual Game). Let Q and P denote the sets of probability distributions over the policy class Π and the adversarial actions Z respectively, satisfying the necessary conditions for the minimax theorem to hold. Then, we have that2

VT (Π, Z, Φ, ℓ) = inf qt Q sup (zt,ζt) Z E πt qt

t=1 [Regpol T ] = sup pt P inf πt E (zt,ζt) pt

t=1 [Regpol T ] . (2)

The proof of the proposition is deferred to Appendix A. The proof proceeds via a repeated application of von Neumann s minimax theorem (for instance see [23, Appendix A]). Notice that the minimax theorem changes the order of the online sequential game deﬁned in the setup above: at every time step t, the adversary proceeds ﬁrst and outputs a distribution pt over instances and the learner responds back with πt after having observed the distribution. The actual loss instance (zt,ζt) is then sampled from the revealed distribution pt. On the other hand, the comparator remains the same as before: the best policy π Π in hindsight. This reversed game, termed the Dual Game, forms the basis of our analysis and allows us to study the complexity of the online learning with dynamics problem.

3 Upper bounds on value of the game

Our main result in this section concerns an upper bound on the value of the sequential game VT (Π,Z,Φ,ℓ) relating it to the study of certain stability properties of empirical minimizers and stochastic processes associated with them. Before we proceed to describe the main result, we revisit some preliminaries and setup notation which would be helpful in describing the main result.

Sequential Rademacher complexity. The notion of Sequential Rademacher Complexity, introduced in [22], is a natural generalization of the Rademacher complexity for online learning. However, observe that the loss of the comparator term in the deﬁnition of policy regret in equation (1) depends on the adversarial actions ζ1 t 1 through the dynamics and zt through the loss function ℓ. We deﬁne the following version of sequential Rademacher complexity for such dynamics based losses.

2 . . . T t=1 denotes interleaved application of the sequence of operator inside. For example, for T = 2, suppt infqt 2

t=1 [ ] = supp1 infq1 supp2 infq2[ ]

Deﬁnition 1. The Sequential Rademacher Complexity of a policy class Π with respect to loss function ℓ Π X Zℓ R and dynamics Φ X Π ZΦ X is deﬁned as

Rseq T (ℓ Π) = sup (z,ζ) Eϵ [sup π Π

T t=1 ϵtℓ(π,xt[ζ1(ϵ),...,ζt 1(ϵ)],zt(ϵ))] ,

where the outer supremum is taken over Z = Zℓ ZΦ-valued trees 3 of depth T and ϵ = (ϵ1,...,ϵT ) is a sequence of i.i.d. Rademacher random variables.

A similar deﬁnition was also used by Han et al. [13] in the context of online learning with strategies where the notion of regret was deﬁned w.r.t. a set of strategies rather than a ﬁxed action. As compared with the classical online learning problem, the above comprises problems where the loss at time t depends on the complete history (ζ1,...,ζt 1) of the adversarial choices along with zt. As noted by [13], such dependence on the the adversary s history can often make the online learning problem harder to learn compared with the online learning problem.

Empirical Risk Minimization (ERM). Given a sequence of loss functions ℓt F R for t [T], the ERM with respect to a function class F is deﬁned to be the minimizer of the cumulative loss with f ERM,T argminf F T t=1 ℓt(f). In the statistical learning setup, the problems of supervised classiﬁcation and regression are known to be learnable with respect to a function class F if and only if the empirical risks uniformly converge over this class F to the population risks. In contrast, our results provide sufﬁcient conditions for learnability in terms of certain stability properties of such empirical risk minimizers.

Dynamic stability. We introduce the notion of dynamic stability which captures the stability of an algorithm s interaction with the underlying dynamics Φ. In order to do so, we deﬁne a notion of counterfactual loss ℓΦ t of a policy π as the loss incurred by a learner which selects π for time 1 t. Deﬁnition 2 (Counterfactual Losses). Given a sequence of adversarial actions ζ1 t 1,zt, dynamics function Φ, and noise distribution Dw, the counterfactual loss of a policy π at time t is

ℓΦ t (π,ζ1 t 1,zt) = E ws Dw [ℓ(π,xt[π(t 1),w1 t 1,ζ1 t 1],zt)] .

With this deﬁnition, observe that the comparator term in the value VT in equation (2) is in fact a cumulative sum of counterfactual losses for a policy π. Any algorithm A that plays a sequence of policies {πt} in the online game incurs an instantaneous loss ℓ(πt,xt[π1 t 1,ζ1 t 1],zt) at time t. In comparison, the counterfactual loss ℓΦ(πt,ζ1 t 1,zt) represents a scenario where the algorithm commits to the policy πt from the beginning of the game. Our notion of dynamic stability of an algorithm is precisely the deviation between these two types of losses: instantaneous and counterfactual. Deﬁnition 3 (Dynamic Stability). An algorithm A is said to be {βt}-dynamically stable if for all sequences of adversarial actions [(z1,ζ1),...,(z T ,ζT )] and time instances t [T]

Ew1 t 1[ℓ(πt, xt[π1 t 1, w1 t 1, ζ1 t 1], zt)] ℓΦ(πt, ζ1 t 1, zt) βt where πt = A((z1 t 1, ζ1 t 1)).

It is interesting to note that if that loss functions are independent of the underlying states, that is ℓ(π,x,z) = ℓ(π,z), then any algorithm is dynamically stable in a trivial manner with the stability parameters βt = 0 for all time instances t.

With these deﬁnitions, we now proceed to describe our main result. Recall that Proposition 1 translates the problem of studying the value of the game VT (Π,Z,Φ,ℓ) to that of studying the policy regret in a dual game. In this dual game, the learner has access to the set of adversaries distribution {ps}t s=1 at time t and the policy πt can be a function of these. For a regularization function Ω Π R+, we denote the regularized ERMs with respect to function class Π and counterfactual losses ℓΦ by

πRERM,t argmin π Π

t s=1 E zs [ℓΦ(π,ζ1 s 1,zs)] + λ Ω(π) , (3)

where λ 0 is the regularization parameter. The following theorem provides an upper bound on the value VT in terms of the dynamic stability parameters of the regularized ERMs above as well a sequential Rademacher complexity of the effective loss class ℓΦ Π = {ℓΦ(π, ) π Π}.

3A Z-valued tree z of depth d is deﬁned as a sequence (z1, . . . ,zd) of mappings zt { 1}t 1 Z (see [21])

Theorem 1 (Upper bound on value). For any online learning with dynamics instance (Π,Z,Φ,ℓ), consider the set of regularized ERMs given by eq. (3) with regularization function Ωand parameter λ 0 having dynamic stability parameters {βRERM,t}T t=1. Then, we have that the value of the game

VT (Π,Z,Φ,ℓ) T t=1 βRERM,t + 2Rseq T (ℓΦ Π) + 2λ sup π Π Ω(π) . (4)

The complete proof of the above theorem can be found in Appendix B. A few comments on Theorem 1 are in order. The theorem provides sufﬁcient conditions to ensure learnability of the online learning with dynamics problem. In particular, the two terms Term (I) = T t=1 βRERM,t and Term (II) = Rseq T (ℓΦ Π) contain the main essence of the upper bound. Term (I) concerns the dynamic mixability property of the regularized ERM in the dual game. If there exist approximate minimizers (regularized) of the sequence of counterfactual losses within the policy class Π such that πRERM,t is uniformly close to πRERM,t+1 the dynamic stability parameters can be made to be small. Term (II) comprises of the sequential Rademacher complexity of the loss class ℓΦ Π which involves the underlying policy class Π as well as the counterfactual loss ℓΦ. This measure of complexity can be seen as one which corresponds to an effective online game where the the loss at time t depends on the adversarial actions up to time t. Compare this to the instantaneous loss ℓ(πt,xt[π1 t 1,ζ1 t 1],zt) which depends on both the policies as well as the adversarial actions up to time t. Observe that for the classical online learning setup without dynamics, the dynamic stability parameters βRERM,t 0. On setting the value of regularization parameter λ = 0, we recover back the learnability result of Rakhlin et al. [22].

We would like to highlight that the complexity-based learnability guarantees of Theorem 1 are non-constructive in nature. In particular, the theorem says that any non-trivial upper bounds on the stability and sequential complexity terms would guarantee the existence of an online learning algorithm with the corresponding policy regret. Our minimax perspective on the problem allows us to study the problem in full generality without making assumptions with respect to the policy class Π, adversarial actions Z and the underlying (possibly adversarial) dynamics Φ, and provide sufﬁcient conditions for learnability.

Given the upper bound on the value VT (Π,Z,Φ,ℓ), one can observe that there is a possible tension between the two complexity terms: while dynamic stability term promotes using policies which are similar" across time steps, the regularized complexity term seeks policies which are minimizers of cumulative losses and might vary across time steps. In order to balance similar trade-offs, a natural Mini-Batching Algorithm has been proposed in various works on online learning with memory [5] and online learning with switching costs [9]. The key idea is that the learner divides the time T into intervals of length τ > 0 and commits to playing the same strategy over this time period.

Let us denote any such mini-batching algorithm by Aτ and the corresponding minimax value restricted to this class of algorithms by VT,τ(Π,Z,Φ,ℓ) where the inﬁmum in equation 2 is taken over all mini-batching algorithms Aτ. Similar to the regularized ERM of equation (3), we deﬁne the following mini-batched ERMs:

πτ ERM,t = {πERM(t) for t 0 mod τ πERM(τ t

τ ) otherwise , (5)

where we have used the notation πERM(t) = πERM,t. In the following proposition, we prove an upper bound analogous to that of Theorem 1 for this class of mini-batching algorithms4. Proposition 2 (Mini-batching algorithms.). For any online learning with dynamics game (Π,Z,Φ,ℓ), consider the set of mini-batch ERMs given by equation (5) having dynamic stability parameters {βτ ERM,t}T t=1. Then, we have that the value of the game

VT (Π, Z, Φ, ℓ) inf τ>0 VT,τ(Π, Z, Φ, ℓ) inf τ>0 (

T t=1 βτ ERM,t + 2τ sup s [τ] Rseq T /τ(ℓΦ s Π)) , (6)

where ℓΦ s is the counterfactual loss for the sth batch.

We defer the proof of the above proposition to Appendix B. In comparison with the upper bound of Theorem 1, this bound concerns the dynamic stability of the mini-batched ERMS as compared to

4For this class of mini-batching algorithms, we consider an oblivious adversary which cannot adapt to the randomness of the learner.

their regularized counterparts. Often times, obtaining bounds on the stability parameters {βτ ERM,t}T t=1 can be much easier than the ones for regularized ERMS. For instance, it is easy to see that for the problem of online learning with memory with adversaries having memory m, one can bound T t=1 βτ ERM,t = O( m T

τ ) whenever the losses are bounded, providing a natural trade-off between the two complexity terms.

4 Lower bounds on value of the game

Having established sufﬁcient conditions for the learnability of the online learning with dynamics problem in the previous section, we now turn to address the optimality of these conditions. Recall that Theorem 1 and Proposition 2 established upper bounds on the value VT (Π,Z,Φ,ℓ) for instances of our problem. The following theorem shows that both the upper bounds of equations (4) and (6) are indeed tight upto constant factors. Theorem 2 (Lower Bound). For the online learning with dynamics problem, there exist problem instances {(Π,Z,Φ,ℓi)}3 i=1, a regularization function Ωand a universal constant c > 0 such that

VT (Π,Z,Φ,ℓ1) c Rseq T (ℓΦ 1 Π) (7a)

VT (Π,Z,Φ,ℓ2) c inf λ>0( T t=1 βRERM,t + λ sup π Π Ω(π)) (7b)

VT (Π,Z,Φ,ℓ3) c inf τ>0( T t=1 βτ ERM,t + 2τRseq T /τ(ℓΦ 3 Π)) , (7c)

where βRERM,t and βτ ERM,t are the dynamic mixability parameters of the regularized ERM w.r.t. ℓ2 (eq. (3)) and mini-batching ERM w.r.t. ℓ3 (eq. (5)) respectively.

A few comments on Theorem 2 are in order. The theorem exhibits that the sufﬁciency conditions from Theorem 1 and Proposition 2 are indeed necessary by exhibiting instances whose value is lower bounded by these terms. In particular, equation (7a) shows that the sequential Rademacher term is necessary, (7b) establishes necessity for the dynamic stability of the regularized ERM, while (7c) shows that the mini-batching upper bound is also tight. It is worth noting that these lower bounds are not instance dependent but rather construct speciﬁc examples to demonstrate the tightness of our upper bound from the previous section. We next present the key idea for the proof of the theorem and defer the complete details to Appendix C.

Proof sketch. We now describe the example instances which form the crux of the proof for Theorem 2. Consider the online learning with dynamics game between a learner and an adversary with the state space X = {x Rd x 2 1} and the set of adversarial actions ZLin ℓ = {z Rd z 2 1}. Further, we consider the constant policy class ΠLin = {πf π(x) = f for all states x with f Bd(1)}, consisting of policies πf which select the same action f at each state x. With a slight abuse of notation, we represent the policy πt played by the learner at time by the corresponding d-dimensional vector ft. Further, we let the dynamics function ΦLin(xt,ft,ζt) = ft. We now deﬁne the loss function which consists of two parts, a linear loss and a L-Lipschitz loss involving the dynamics:

ℓL(ft,xt,zt) = ft,zt + σ(ft,xt) where σ(ft,xt) = {L ft xt 2 for ft xt 2 1

L 1 otherwise . (8)

Observe that this example constructs a family of instances one for each value of the Lipschitz constant of L of the function σ. For this family of instances, we establish that the value

VT (ΠLin, Z, ΦLin, ℓL)

T for 0 < L < 1

LT for 1 L (4T) 1 3

2 1 3 T 2 3 for L > (4T) 1 3 .

The proof ﬁnally connects these lower bounds to the upper bounds of Theorem 1 and Proposition 2.

With the lower bounds given in Theorem 2, it is natural to ask whether the sufﬁcient conditions in Theorem 1 and Proposition 2 are indeed necessary for every instance of the online learning with dynamics problem. The answer to this question is unsurprisingly No given the generality in which we study this problem. Consider the following simple instance of the problem:

Π = X, ℓ(π,x,z) = ℓ(π,z) + I[π = x], and xt+1 = πt ,

for any non-negative bounded loss ℓ(π,z) [0,1] for all π Π,z Zℓ. Consider any policy class for which Rseq T (ℓΦ Π) > 0. Both bounds (4) and (6) suggest that the problem is learnable with rate at least Rseq T (ℓΦ Π). However, observe that the indicator term in the loss is quite severe on the comparator; it ensures that the comparator term is at least T. Thus, any algorithm which selects a policy from Π at every instance can ensure that the policy regret is at most 0! While the above example establishes that the sufﬁcient conditions are not necessary in an instance dependent manner, our next proposition establishes that they are indeed tight for large class of problems instances. Proposition 3 (Instance-dependent lower bound). a) Given any online learning problem (F,Zℓ,ℓ) with a bounded loss function ℓ F Zℓ [ 1,1], there exists an online learning with dynamics problem (ΠF,Zℓ { 1,1},Φ, ℓ) and a universal constant c > 0 such that

VT (ΠF, Zℓ { 1, 1}, Φ, ℓ) c inf τ>0 (

T t=1 βτ ERM,t + 2τRseq T /τ(ℓΦ Π)) ,

where βτ ERM,t are the dynamic mixability parameters of the mini-batching ERM w.r.t. ℓ(eq. (5)). b) Given a policy class Π and dynamics function Φ, there exists an online learning with dynamics problem (Π,Z,Φ,ℓ) and a universal constant c > 0 such that

VT (Π, Z, Φ, ℓ) c Rseq T (ℓΦ Π).

We defer the proof of the proposition to Appendix C. This proposition can be seen as a strengthening of the lower bounds (7a) and (7c) showing that for a very large class of problems, the upper bound given by the mini-batching algorithm and the sequential complexity terms are in fact necessary.

In this section, we look at speciﬁc examples of the online learning with dynamics problem and obtain learnability guarantees for these instances using our upper bounds from Theorem 1. For clarity of exposition, our focus in this section on the scaling of the value VT (Π,Z,Φ,ℓ) with the time horizon T. The proofs in Appendix D explicitly detail out all the problem dependent parameters.

Example 1: Online Isotron with dynamics. Single Index Models (SIM) are class of semiparametric models widely studied in the econometric and operations research community. Kalai and Sastry [16] introduced the Isotron algorithm for learning SIMs and Rakhlin et al [23] established that the online version of this problem is learnable. Here, we introduce a version of this problem with a state variable that requires a component of the model to vary slowly across time:

X = R, F = {f = (σ, w = (w1, w) σ [ 1, 1] [ 1, 1] 1-Lipschitz, w Rd+1 w1 1, w 2 1} ,

ΠF = {πf π F, πf(x) = f for all x X}, Zℓ= [ 1, 1]d+1 [ 1, 1], ZΦ = φ,

ℓ(πf, x, z = (z1, x, y)) = (y σ( x, w ))2 + (z1 w1)2 + (x w1)2, Φ(x, π(σ,w), ζ) = w1, Dw = 0. (9) The following corollary establishes the learnability of the online Isotron problem with dynamics. Corollary 1. For the online Isotron problem with dynamics given by (ΠF,Z,Φ,ℓ) in equation (5), we have that the minimax value VIso,T (ΠF,Z,Φ,ℓ) = O(

It is important to note that as with the online learning problem [23], it is not clear whether a computationally efﬁcient method attaining the above guarantee exists.

Example 2: Online Markov decision processes [11]. This example considers the problem of Online Markov Decision Processes (MDPs) studied in Even-Dar et al. [11]. The setup consists of a ﬁnite state space X = S, a ﬁnite action space U = A, and

Zℓ= {z z [0, 1]S A}, ZΦ = φ, ΠMDP = {π π X (U)}, ℓ(π, x, z) = z(x, π(x)),

Φ given by P X U (U) with x P(x, π(x)). (10)

With this setup, we now provide a bound on the minimax value VMDP T , assuming, as in [11], that the underlying MDP is unichain and satisﬁes a mixability assumption (see Appendix D for details). Corollary 2. For the online MDP problem given by (ΠMDP,Z,Φ,ℓ) in equation (5), we have that the minimax value VMDP,T (ΠMDP,Z,Φ,ℓ) = O(

The above corollary helps one recover the same O(

T) regret bound that was obtained by [11].

Example 3: Online Linear Quadratic Regulator [10]. The online Linear Quadratic Regulator (LQR) setup studied in this section was ﬁrst studied in Cohen et al. [10]. The setup consists of a LQ system - with linear dynamics and quadratic costs - where the cost functions are chosen adversarially. The comparator class consists of a set of strongly stable linear policies (see Appendix D).

X = Rd, Zℓ= {(Q, R) Q, R 0, tr(Q), tr(R) C}, ΠLQR = {K Rk d K is (κ, γ) strongly stable},

ZΦ = φ, ℓ(π, x, z) = x Qx + (Kx) R(Kx), Φ(x, K) = (A + BK)x, Dw = N(0, I). (11)

With this setup, we now establish the learnability of this problem in the following corollary. Corollary 3. For the online LQR problem given by (ΠLQR,Z,Φ,ℓ) in equation (5), we have that the minimax value VLQR,T (ΠLQR,Z,Φ,ℓ) = O(

Note that [10] obtained a similar policy regret bound of O(

T) but their analysis only worked for an oblivious adversary whereas the guarantee of Corollary 3 holds for an adaptive adversary.

Example 4: Online Control with Adversarial Disturbances [2]. In this example, we consider a simpliﬁed version of the setup from Agarwal et al. [2] where the adversary is allowed to perturb the dynamics at each time instance along with the loss functions. We consider the LQ version of the problem where the dynamics are linear and the cost functions quadratic. Similar to the example above, the comparator class consists of a set of strongly stable linear policies (see Appendix D).

X = Rd, Zℓ= {(Q, R) Q, R 0, tr(Q), tr(R) C}, ΠLQR = {K Rk d K is (κ, γ) strongly stable},

ZΦ = {ζ ζ 2 W}, ℓ(π, x, z) = x Qx + (Kx) R(Kx), Φ(x, K, ζ) = (A + BK)x + ζ, Dw = φ. (12)

With this setup, we now establish the learnability of this problem in the following corollary. Corollary 4. For the Online Control with Adversarial Disturbances problem instance (ΠLQR,Z,Φ,ℓ) detailed in equation (5), we have that the minimax value

Vadv,T (ΠLQR,Z,Φ,ℓ) = O(

In contrast to Example 3 above, the disturbances in the dynamics are actually controlled by the adversary instead of being random. Our result above shows that this harder version of the problem is also learnable at

In addition to these four examples, in Appendix D we consider additional examples of the framework including online adversarial tracking [1] and a non-linear generalization of the LQ problem [10, 2] which we call online non-linear control.

Broader Impact

This work is mainly theoretical in nature and hopes to provide theoretical foundations for learning under dynamical systems. The work is expected to have a broader impact in the future by opening up research on learning and non-linear dynamical systems with complex policy classes. In the future, we hope that our work will enable ML systems to be deployed reliably in more reactive environments.

Acknowledgments and Disclosure of Funding

We would like to thank Dylan Foster, Mehryar Mohri and Ayush Sekhari for helpful discussions. KB is supported by a JP Morgan AI Fellowship. KS would like to acknowledge NSF CAREER Award 1750575 and Sloan Research Fellowship

[1] Y. Abbasi-Yadkori, P. Bartlett, and V. Kanade. Tracking adversarial targets. In International Conference on Machine Learning, pages 369 377, 2014. [2] N. Agarwal, B. Bullins, E. Hazan, S. M. Kakade, and K. Singh. Online control with adversarial disturbances. ar Xiv preprint ar Xiv:1902.08721, 2019. [3] N. Agarwal, E. Hazan, and K. Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pages 10175 10184, 2019. [4] O. Anava, E. Hazan, and S. Mannor. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pages 784 792, 2015. [5] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: From regret to policy regret. In Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012. [6] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 2002. [7] S. Ben-David, D. Pál, and S. Shalev-Shwartz. Agnostic online learning. [8] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006. [9] L. Chen, Q. Yu, H. Lawrence, and A. Karbasi. Minimax regret of switching-constrained online convex optimization: No phase transition. ar Xiv preprint ar Xiv:1910.10873, 2019. [10] A. Cohen, A. Hassidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar. Online linear quadratic control. ar Xiv preprint ar Xiv:1806.07104, 2018. [11] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009. [12] D. J. Foster and M. Simchowitz. Logarithmic regret for adversarial online control. ar Xiv preprint ar Xiv:2003.00189, 2020. [13] W. Han, A. Rakhlin, and K. Sridharan. Competing with strategies. In Conference on Learning Theory, pages 966 992, 2013. [14] M. Hardt, T. Ma, and B. Recht. Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1), 2018. [15] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. [16] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT. Citeseer, 2009. [17] D. E. Kirk. Optimal control theory: an introduction. Courier Corporation, 2004. [18] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4), 1988. [19] L. Ljung. System identiﬁcation. Wiley encyclopedia of electrical and electronics engineering, 1999. [20] N. Merhav, E. Ordentlich, G. Seroussi, and M. J. Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48(7):1947 1958, 2002. [21] A. Rakhlin and K. Sridharan. Statistical Learning and Sequential Prediction. Lecture Notes, 2014. [22] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In Advances in Neural Information Processing Systems, 2010. [23] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning via sequential complexities. Journal of Machine Learning Research, 16(2):155 186, 2015. [24] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. [25] M. Simchowitz and D. J. Foster. Naive exploration is optimal for online lqr. ar Xiv preprint ar Xiv:2001.09576, 2020.

[26] R. F. Stengel. Optimal control and estimation. Courier Corporation, 1994. [27] A. S. Suggala and P. Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. ar Xiv preprint ar Xiv:1903.08110, 2019. [28] V. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16, 1971.