# logarithmic_regret_for_adversarial_online_control__6ab6a828.pdf

Logarithmic Regret for Adversarial Online Control

Dylan J. Foster 1 Max Simchowitz 2

We introduce a new algorithm for online linearquadratic control in a known system subject to adversarial disturbances. Existing regret bounds for this setting scale as

T unless strong stochastic assumptions are imposed on the disturbance process. We give the ﬁrst algorithm with logarithmic regret for arbitrary adversarial disturbance sequences, provided the state and control costs are given by known quadratic functions. Our algorithm and analysis use a characterization for the optimal ofﬂine control law to reduce the online control problem to (delayed) online learning with approximate advantage functions. Compared to previous techniques, our approach does not need to control movement costs for the iterates, leading to logarithmic regret.

1. Introduction

Reinforcement learning and control consider the behavior of an agent making decisions in a dynamic environment in order to suffer minimal loss. In light of recent practical breakthroughs in data-driven approaches to continuous RL and control (Lillicrap et al., 2016; Mnih et al., 2015; Silver et al., 2017), there is great interest in applying these techniques in real-world decision making applications. However, to reliably deploy data-driven RL and control in physical systems such as self-driving cars, it is critical to develop principled algorithms with provable safety and robustness guarantees. At the same time, algorithms should not be overly pessimistic, and should be able to take advantage of benign environments whenever possible.

In this paper we develop algorithms for online linearquadratic control which ensure robust worst-case performance while optimally adapting to the environment at hand. Linear control has traditionally been studied in settings where the dynamics of the environment are either governed

1Massachusetts Institute of Technology 2UC Berkeley. Correspondence to: Dylan Foster <dylanf@mit.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

by a well-behaved stochastic process or driven by a worstcase process to which the learner must remain robust in the H sense. We consider an intermediate approach introduced by Agarwal et al. (2019a) in which disturbances are non-stochastic but performance is evaluated in terms of regret. This benchmark forces the learner s control policy to achieve near optimal performance on any speciﬁc disturbance process encountered.

Concretely, we consider a setting in which the state evolves according to linear dynamics:

xt+1 = Axt + But + wt, (1)

where xt Rdx are states, ut Rdu are inputs, and A Rdx dx and B Rdx du are system matrices known to the learner. We refer to wt Rdx as the disturbance (or, noise ), which we assume is selected by an adaptive adversary and satisﬁes wt 1; we let w refer to the entire sequence w1 T . We consider ﬁxed quadratic costs of the form ℓ(x,u) = x Rxx + u Ruu, where Rx,Ru 0 are given. This model encompasses noise which is uncorrelated (H2), worst-case (H ), or governed by some nonstationary stochastic process. The model also approximates control techniques such as feedback linearization and trajectory tracking (Slotine & Li, 1991), where A and B are the result of linearizing a known nonlinear system and the disturbances arise due to systematic errors in linearization rather than from a benign noise process.

For any policy π that selects controls based on the current state and disturbances observed so far, we measure its performance over a time horizon T by

JT (π;w) = T t=1 ℓ(xπ t ,uπ t ),

the total cost incurred by following ut = πt(xt,w1 t 1). Letting πK denote a state-feedback control law of the form πK t (x) = Kx for all t, the learning algorithm s goal is to minimize

Reg T = JT (πalg;w) inf K KJT (πK;w),

where πalg denotes the learner s policy and K is an appropriately deﬁned set of stabilizing controllers. Thus, πalg has low regret when its performance nearly matches the optimal

Logarithmic Regret for Adversarial Online Control

controller K K on the speciﬁc, realized noise sequence. While the class K contains the optimal H and H2 control policies, we also develop algorithms to compete with a more general class of stabilizing linear controllers, which may fare better for certain noise sequences (Appendix B).

Logarithmic regret in online control. Agarwal et al. (2019a) introduced the adversarial LQR setting we study and provided an efﬁcient algorithm with

T-regret. Subsequent works (Agarwal et al., 2019b; Simchowitz et al., 2020) have shown that logarithmic regret is possible when the disturbances follow a semi-adversarial process with persistent excitation. Our main result is to achieve logarithmic regret for fully adversarial disturbances, provided that costs are known and quadratic.

1.1. Contributions

We introduce Riccatitron (Algorithm 1), a new algorithm for online linear control with adversarial disturbances which attains polylogarithmic regret.

Theorem 1 (informal). Riccatitron attains regret O(log3 T), where O hides factors polynomial in relevant problem parameters.

Riccatitron has comparable computational efﬁciency to previous methods. We show in Appendix B that the algorithm also extends to a more general benchmark class of linear controllers with internal state, and to tracking loss functions of the form ℓt(x,u) = ℓ(x at,u bt). Some conceptual contributions are as follows.

When is logarithmic regret possible in online control? Simchowitz & Foster (2020) and Cassel et al. (2020) independently show that logarithmic regret is impossible in a minimax sense if the system matrices (A,B) are unknown, even when disturbances are i.i.d. gaussian. Conversely, our result shows that if A and B are known, logarithmic regret is possible even when disturbances are adversarial. Together, these results paint a clear picture of when logarithmic regret is achievable in online linear control. We note, however, that our approach heavily leverages the structure of linear control with strongly convex, quadratic costs. We refer the to the related work section for discussion of further structural assumptions that facilitate logarithmic regret.

Addressing trajectory mismatch. Riccatitron represents a new approach to a problem we call trajectory mismatch that arises when considering policy regret in online learning problems with state. In dynamic environments, different policies inevitably visit different state trajectories. Lowregret algorithms must address the mismatch between the performance of the learner s policy πalg on its own realized trajectory and the performance of each benchmark policy π

on the alternative trajectories it induces. Most algorithms with policy regret guarantees (Even-Dar et al., 2009; Zimin & Neu, 2013; Abbasi-Yadkori et al., 2013; Arora et al., 2012; Anava et al., 2015; Abbasi-Yadkori et al., 2014; Cohen et al., 2018; Agarwal et al., 2019a;b; Simchowitz et al., 2020) adopt an approach to addressing this trajectory mismatch that we refer to as online learning with stationary costs , or OLw S. At each round t, the learner s adaptive policy πalg commits to a policy π(t), typically from a benchmark class Π. The goal is to ensure that the iterates π(t) attain low regret on a proxy sequence of stationary cost functions π λt(π) that describe the loss the learner would suffer at stage t under the ﬁctional trajectory that would arise if she had played the policy π at all stages up to time t (or in some cases, on the corresponding steady-state trajectory as t ). Since the stationary cost does not depend on the learner s state, low regret on the sequence {λt} can be obtained by feeding these losses directly into a standard online learning algorithm. To relate regret on the proxy sequence back to regret on the true sequence, most approaches use that the iterates produced by the online learner are sufﬁciently slow-moving.

The main technical challenge Riccatitron overcomes is that for the stationary costs that arise in our setting, no known algorithm produces iterates which move sufﬁciently slowly to yield logarithmic regret via OLw S (Appendix C.4). We adopt a new approach for online control we call online learning with advantages, or OLw A, which abandons stationary costs, and instead considers the control-theoretic advantages of actions relative to the unconstrained ofﬂine optimal policy π . Somewhat miraculously, we ﬁnd that these advantages remove the explicit dependence on the learner s state, thereby eliminating the issue of trajectory mismatch described above. In particular, unlike OLw S, we do not need to verify that the iterates produced by our algorithm change slowly.

1.2. Our approach: Online learning with advantages

In this section we sketch the online learning with advantages (OLw A) technique underlying Riccatitron. Let π denote the optimal unconstrained policy given knowledge of the entire disturbance sequence w, and let Q t (x,u;w) be the associated Q-function (this quantity is formally deﬁned in Deﬁnition 3). The advantage1 with respect to π , A t (u;x,w) = Q t (x,u;w) Q t (w,u,π (x);w), describes the difference between the total cost accumulated by selecting action u in state x at time t and subsequently playing according to the optimal policy π , versus choosing ut = π t (x;w) as well. By the well-known performance dif-

1Since we use losses rather than rewards, advantage refers to the advantage of π over u rather than the advantage of u over π ; the latter terminology is more common in reinforcement learning.

Logarithmic Regret for Adversarial Online Control

ference lemma (Kakade, 2003), the relative cost of a policy is equal the sum of the advantages under the states visited by said policy:2

JT (π;w) JT (π ;w) = T t=1 A t (uπ t ;xπ t ,w). (2)

With this observation, the regret Reg T (πalg;w,Π) of any algorithm πalg to a policy class Π can be expressed as:

T t=1 A t (ualg t ;xalg t ,w) inf π Π

T t=1 A t (uπ t ;xπ t ,w). (3)

The expression (3) suggests that a reasonable approach might be to run an online learner on the functions π A t (uπ t ;xπ t ,w). However, there are two issues. First, the advantages in the ﬁrst sum are evaluated on the states xalg t under πalg, and in the second sum under the comparator trajectories xπ (trajectory mismatch). Second, like π itself, the advantages require knowledge of all future disturbances, which are not yet known to the learner at time t. We show that if the control policies are parametrized using a particular optimal control law, the advantages do not depend on the state, and can be approximated using only ﬁnite lookahead.

Theorem 2 (informal). For control policies π with a suitable parametrization, the mapping π A t (uπ t ;xπ t ,w) can be arbitrarilily-well approximated by a function π At;h(π;w1 t+h) which (1) does not depend on the state, (2) can be determined by the learner at time t + h, and (3) has a simple quadratic structure.

The magic behind this theorem is that the functional dependence of the unconstrained optimal policy π (x;w) on the state x is linear, and does not depend w (Theorem 3). As a consequence, the state-dependent portion of π can be built into the controller parametrization, leaving only the w-dependent portion up to the online learner. In light of this result, we use online learning to ensure low regret on the sequence of loss functions ft(π) = At;h(π;w1 t+h); we address the fact that ft is only revealed to the learner after a delay of h steps via a standard reduction (Joulani et al., 2013). We then show that for an appropriate controller parameterization ft(π) is exp-concave with respective to the learner s policy and hence second-order online learning algorithms attain logarithmic regret (Hazan et al., 2007).

We refer the reader to Appendix C for an in-depth overview of the OLw S framework, its relationship to OLw A, and challenges associated with using these techniques to achieve logarithmic regret.

2See Lemma D.12 in Appendix E for a general statement of the performance difference lemma. The invocation of the performance difference lemma here is slightly different from other results on online learning in MDPs such as Even-Dar et al. (2009), in that the role of π and π is swapped.

1.3. Preliminaries

We consider the linear control setting in (1). For normalization, we assume wt 1 t. We also assume x1 = 0.

Policies and trajectories. We consider policies π parameterized as functions of xt and w via ut = πt(xt;w). We assume that, when selecting action ut at time t, the learner has access to all states x1 t,u1 t 1, as well as w1 t 1 (the latter assumption is without loss of generality by the identity ws = xs+1 Axs Bus). Thus, a policy is said to be executable if πt(x;w) depends only on x and w1 t 1, i.e. π(x;w) = π(w;w1 t 1). For analysis purposes, we also consider non-executable whose value at time t may depend on the entire sequence w. For a policy π and sequence w, we let xπ t (w),uπ t (w) denote the resulting states and input trajectories (which we note depend only on w1 t 1). For simplicity, we often write xπ t and uπ t , supressing the w-dependence. We shall let πalg refer to the policy selected by the learner s algorithm, and use the shorthand xalg t (w), ualg t (w) to denote the corresponding trajectories. Given a class of policies Π, the regret of the policy πalg is given by

Reg T (πalg;Π,w) = JT (πalg;w) inf π ΠJT (π;w).

We consider a benchmark class of policies induced by state feedback control laws πK t (x) = Kx, indexed by matrices K Rdu dx.

Linear control theory. We say that a linear controller K Rdudx is stabilizing if A BK is stable, that is ρ(A BK) < 1 where ρ( ) denotes the spectral radius.3 We assume the system (A,B) is stabilizable in the sense that there exists a stabilizing controller K. For any stabilizable system, there is a unique positive semideﬁnite solution P 0 to the discrete algebraic Riccati equation (henceforth, DARE),

P = A PA + Rx A PB(Ru + B PB) 1B PA. (4)

The solution P to (4) is an intrinsic property of the system (1) with (A,B) and characterizes the optimal inﬁnitehorizon cost for control in the absence of noise (Bertsekas, 2005). Our algorithms and analysis make use of this parameter, as well as the corresponding optimal state feedback controller K = (Ru + B P B) 1B P A. We also use the steady-state covariance matrix Σ = Ru +B P B and closed-loop dynamics matrix Acl, = A BK .

Competing with state feedback. While K represents the (asymptotically) optimal control law in the presense of uncorrelated, unbiased stochastic noise, πK may not be

3For a possibly asymmetric matrix A, ρ(A) = max{ λ λ is an eigenvalue for A}.

Logarithmic Regret for Adversarial Online Control

the optimal state feedback policy in hindsight for a given sequence of adversarial perturbations wt. Hence, we compete with linear controllers that satisfy a quantitative version of the stability property.

Deﬁnition 1 (Strong Stability (Cohen et al., 2018)). We say that A BK Rdx dx is (κ,γ)-strongly stable if there exists matrices H,L Rdx dx such that A BK = HLH 1, H op H 1 op κ and L op γ.

Given parameters (κ0,γ0), we consider the benchmark class

K0 = { K op κ0 A BK is (κ0,γ0)-strongly stable}.

Lemma D.1 (Appendix D.1) shows that the closed-loop dynamics for K are always (κ ,γ )-strongly stable for suitable γ ,κ . We assume that K0 is chosen such that κ κ0 and γ γ0.4 Our algorithms minimize policy regret to the class of induced policies for K0:

K0-Reg T (πalg;w) = JT (πalg;w) inf K K0 JT (πK;w).

Problem parameters. Our regret bounds depend on the following basic parameters for the LQR problem: Ψ = max{1, A op, B op, Rx op, Ru op}, β = max{1,λ 1 min(Ru),λ 1 min(Rx)}, Γ = max{1, P op}.

Additional notation. We adopt non-asymptotic big-oh notation: For functions f,g X R+, we write f = O(g) if there exists some constant C > 0 such that f(x) Cg(x) for all x X. We use O( ) so suppress logarithmic dependence on system parameters, and we use O ( ) to suppress all dependence on system parameters. For a vector x Rd, we let x denote the euclidean norm and x denote the element-wise ℓ norm. For a matrix A, we let A op denote the operator norm. If A is symmetric, we let λmin(A) denote the minimum eigenvalue. When P 0 is a positive deﬁnite matrix, we let x P =

x,Px denote the induced weighted euclidean norm. We et wt 1 = (wt 1,wt 2,...,w1,0,0,...) denote a sequence of past ws, terminating in an inﬁnite sequence of zeros. To simplify indexing, we let ws 0 for s 0, so that wt 1 = (wt 1,wt 2,...) We also let ws 0 for s > T.

1.4. Organization

Section 2 introduces the Riccatitron algorithm, states its formal regret guarantee, and gives an overview of the algorithm s building blocks and proof techniques. Section 3 gives a high-level proof of the key approximate advantage theorem used by the algorithm. Omitted proofs are deferred to Appendix E and Appendix F, and additional technical tools stated and proven in Appendix D.

4This assumption only serves to keep notation compact.

Appendix A gives a detailed survey of related work. Appendix B sketches extensions of Riccatitron to more general settings, and Appendix C gives a detailed survey of challenges associated with applying previous approaches to online reinforcement learning to obtain logarithmic regret in our setting.

2. Logarithmic regret for online linear control

Our main algorithm, Riccatitron, is described in Algorithm 1. The algorithm combines several ideas.

1. Following Agarwal et al. (2019a), we move from linear policies of the form πK(x;w) = Kx, to a relaxed set of disturbance-action (DAP) policies of the form π(M) t (x;w) = K x q M(wt 1), where

q M(wt 1) = m i=1 M [i]wt i,

and where K is linear controller from the DARE (4).

2. We show that the optimal unconstrained policy with full knowledge of the sequence w takes the form π t (x;w) = Ktx q t (wt T ), where (Kt) is a particular sequence of linear controllers that arises from the so-called Riccati recursion. We then show that for any policy of the form πt(x;w) = K qt(w) in particular, for the DAP parameterization above the advantage functions A t (uπ t ;xπ t ,w) can be well approximated by simple quadratic functions of the form

qt(w) q t (wt T ) 2 Σ .

This essentially removes the learner s state from the equation, and reduces the problem of control to that of predicting the optimal controller s bias vector q t (wt T ). The remaining challenge is that the optimal bias vectors depend on the future disturbances, which are not available to the learner at time t.

3. We show that the advantages can be truncated to require only ﬁnite lookahead, thereby reducing the problem to online learning with delayed feedback. We then apply a reduction from delayed online learning to classical online learning (Joulani et al., 2013), which proceeds by running multiple copies of a base online learning algorithm over separate subsequences of rounds.

4. Finally using the structure of the disturbance-action parameterization we show that the resulting online learning problem is exp-concave. As a result, we can use a second-order online learning algorithm either online Newton step (ONS, Hazan et al. (2007)) given in Algorithm 2, or Vovk-Azoury-Warmuth (VAW, Vovk (1998); Azoury & Warmuth (2001)) given in Algorithm 3 as our base learner to obtain logarithmic regret.

Logarithmic Regret for Adversarial Online Control

Algorithm 1 Riccatitron

1: parameters:

Horizon h, DAP length m, radius R, decay factor γ. Online Newton parameters ηons, εons,

or Vovk-Azoury-Warmuth parameter εvaw. 2: initialize:

Let M0 M(m,R,γ) (Eq. 5). Instantiate base learners BL(1),...,BL(h+1) as either

ONS(εons,ηons,M0) or VAW(εvaw,M0,Σ ). 3: Let τt = (t 1) mod (h + 1) + 1 [h + 1]. 4: for t = 1,...,T: do

// Predict using base learner τt. 5: Let Mt denote the kt-th iterate produced

by BL(τt) where kt t/(h + 1) . 6: Play ut = K xt q Mt(wt 1) (Deﬁnition 2). 7: Observe xt+1 and wt.

// Update base learner τt+1. 8: if t h + 1 then

// Approximate advantage from Eq. (10). 9: Update BL(τt+1) with At h;h(M;wt).

Together, these components give rise to the scheme in Algorithm 1. At time t, the algorithm plays the action ut = K xt q Mt(wt 1), where Mt is provided by the ONS (or VAW) instance responsible for the current round. The algorithm then observes wt and uses this to form the approximate advantage function for time t h, where h is the lookahead distance. The advantage is then used to update the ONS/VAW instance responsible for the next round. The main regret guarantee for this approach is as follows.

Theorem 1. For an appropriate choice of parameters, Riccatitron ensures

K0-Reg T O (dxdu log3 T),

where O suppresses polynomial dependence on system parameters. Suppressing only logarithmic dependence on system parameters, the regret is at most

O(dxdu log3 T β11 Ψ19 Γ11 κ8 0(1 γ0) 4).

In the remainder of this section we overview the algorithmic building blocks of Riccatitron and the key ideas of the proof.

2.1. Disturbance-action policies

Cost functionals parametrized by state feedback controllers (e.g., K JT (πK;w)) are generally non-convex (Fazel et al., 2018). To enable the use of tools from online convex optimization, we use a convex disturbance-action controller parameterization introduced by Agarwal et al. (2019a).

Algorithm 2 Online Newton Step (ONS(ε,η,C,Σ))

1: parameters: Learning rate η > 0, regularization parameter ε > 0, convex constraint set C.

// OCO with exp-concave costs fk(z), where z C Rd. 2: initialize: d dim(C), z1 C, E0 ε Id. 3: for k = 1,2,...: do 4: Play zk and receive gradient k = fk(zk). 5: Ek Ek 1 + k k. 6: zk+1 zk ηE 1 k k.

7: zk+1 arg minz C z zk+1 2 Ek.

Deﬁnition 2 (Disturbance-action policy (DAP)). Let M = (M [i])m i=1 denote a sequence of matrices M [i] Rdu dx. We deﬁne the corresponding disturbance-action policy π(M)

as π(M) t (x;w) = K x q M(wt 1), where q M(wt 1) = m i=1 M [i]wt i.

We work with DAPs for which the sequence M belongs to the set

M(m,R,γ) = {M = (M [i])m i=1 M [i] op Rγi 1}, (5) where m, R, and γ are algorithm parameters. We note that DAPs can be deﬁned with general stabilizing controllers K K , but the choice K = K is critical in the design and analysis of our main algorithm.

The ﬁrst lemma we require is a variant of a result of Agarwal et al. (2019a), which shows that disturbance-action policies are sufﬁciently rich enough to approximate all state feedback laws.

Lemma 2.1 (Expressivity of DAP). Suppose we choose our set of disturbance-action matrices as M0 = M(m,R ,γ0), where m = (1 γ0) 1 log((1 γ0) 1T) and R = 2β Ψ2 Γ κ2 0. Then for all w, we have

inf M M0 JT (π(M);w) inf K K0 JT (πK;w) + Capx,

where Capx O(β2 Ψ8 Γ2 κ7 0(1 γ0) 2).

We refer the reader to Appendix E.2 for a proof. Going forward, we deﬁne

Dq = O(β5/2 Ψ3 Γ5/2 κ2 0(1 γ0) 1), (6)

which serves as an upper bound on q M t for M M0, as well as other certain other bias vector sequences that arise in the subsequent analysis. In light of Lemma 2.1, the remainder of our discussion will directly bound regret with respect to DAPs:

M0-Reg T (π;w) = JT (π;w) inf M M0 JT (π(M);w).

Logarithmic Regret for Adversarial Online Control

We note in passing that DAPs are actually rich enough to compete with a broader class of linear control policies with internal state; this extension is addressed in Appendix B.2.

2.2. Advantages in linear control

To proceed, we adopt the OLw A paradigm, which minimizes approximations to the advantages (or, differences between the Q-functions) relative to the optimal unconstrained policy π given access to the entire sequence w. Recalling ℓ(x,u) = x 2 Rx + u 2 Ru, we deﬁne the optimal controller π and associated Q-functions and advantages by induction.

Deﬁnition 3. The optimal Q-function and policy at time T are given by Q T (x,u;w) = ℓ(x,u), π T (x;w) = minu Q T (x,u;w) = 0, and V T (x;w) = ℓ(x,0) = x 2 Rx. For each timestep t < T, the optimal Q-function and policy are given by

Q t (x,u;w) = x 2 Q + u 2 R + V t+1(Ax + Bu + wt;w),

π t (x;w) = arg min u Rdu Q t (x,u;w),

V t (x;w) = min u Rdu Q t (x,u;w) = Q t (x,π t (x;w);w).

The advantage function for the optimal policy is A t (u;x,w) = Q t (x,u;w) Q t (x,π t (x;w);w).

The advantage function A t (u;x,w) represents the total excess cost incurred by selecting a control u π t (x;w) at state x and time t, assuming we follow π for the remaining rounds. We have A t (u;x,w) 0 since, by Bellman s optimality condition, π t (x;w) is a minimizer of Q (x,u;w).

The advantages arise in our setting through application of the performance difference lemma (Lemma D.12), which we recall states that for any policy π, the regret to π is equal to the sum of advantages under the trajectory induced by π, i.e. JT (π;w) JT (π ;w) = T t=1 A t (uπ t ;xπ t ,w). To analyze Riccatitron, we apply this identity to obtain the regret decomposition

M0-Reg T (π;w) = T t=1 A t (uπ t ;xπ t ,w)

T t=1 A t (uπ(M) t ;xπ(M) t ,w) (8)

This decomposition is exact, and avoids the pitfalls of the usual stationary cost-based regret decomposition associated with the classical OLw S approach (cf. Appendix C). Our goal going forward will be to treat these advantages as losses that can be fed into an appropriate online learning algorithm to select controls. However, this approach presents three challenges: (a) the advantages for the policy π are evaluated on the trajectory xπ t , while the advantages

for comparator are evaluated under the trajectory induced by π(M); (b) the advantage is a difference in Q-functions that considers all future expected reward. In particular, A t ( ; ,w) depends on all future wts, including those not yet revealed to the learner; (c) the functional form of the advantages is opaque, and it is not clear that any online learning algorithm can achieve logarithmic regret even if they were able to evaluate A t at time t.

2.3. Approximate advantages

Our main structural result and the starting point for Riccatitron is the following observation. Let π be any policy of the form πt(x;wt 1) = K x q Mt(wt 1), where Mt = Mt(wt 1) are arbitrary functions of past w, and where K is the inﬁnite horizon Riccati optimal controller. Then A t (uπ t ;xπ t ,w) is well-approximated by an approximate advantage function At;h(M;wt+h) which (a) does not depend on the state, and (b) depends on only a small horizon h of future disturbances, and (c) is a pure quadratic function of M, and thereby amenable to fast (logarithmic) rates for online learning. Let h be a horizon/lookahead parameter. Deﬁning

q ;h(w1 h+1) = h+1 i=1 Σ 1 B (A cl, )i 1P wi, (9)

the approximate advantage function is At;h(M;wt+h) = q M(wt 1) q ;h(wt t+h) 2 Σ . (10)

The following theorem facilitates the use of the approximate advantages. Theorem 2. Let π be any policy of the form πt(x;w) = K x q Mt(wt 1), where Mt = Mt(w) M0. Then, by choosing h = 2(1 γ ) 1 log(κ2 β2 Ψ Γ2 T 2) as the horizon parameter, we have

T t=1 A t (uπ t ;xπ t ,w) At;h(Mt;wt+h) Cadv,

where Cadv = O(β11 Ψ19 Γ11 κ8 0(1 γ0) 4 log2 T).

The proof of this theorem constitutes a primary technical contribution of our paper, and is proven in Section 3. Brieﬂy, the idea behind the result is to use that the optimal policy π

itself satisﬁes π t (x;w) K x q ;h(wt t+h) whenever h is sufﬁciently large and t T O (log T), and that A t has a simple quadratic structure. This characterization for is why it is essential to consider advantages with respect to the optimal policy π , and why our DAPs use the controller K as opposed to an arbitrary stabilizing controller as in Agarwal et al. (2019a).

2.4. Online learning with delays

An immediate consequence of Theorem 2 is that for any algorithm (in particular, Riccatitron) which selects πt(x;w) =

Logarithmic Regret for Adversarial Online Control

K x q Mt(wt 1), the regret M0-Reg T (π;w) is at most

T t=1 At;h(Mt;wt+h) inf M M0

T t=1 At;h(M;wt+h) + 2Cadv.

This is simply an online convex optimization problem with M1,...,MT as iterates the only catch is that the loss at time t, At;h(Mt;wt+h), can only be evaluated after observing wt h, which will not be revealed to the learner until after round t + h. This is therefore an instance of online learning with delays, namely, the loss function suffered at time t is only available at time t + h + 1 (since wt is revealed at time t + 1). To reduce the problem of minimizing regret on the approximate advantages in (11) to classical online learning without delays, we use a simple black-box reduction.

Consider a generic online convex optimization setting where, at each time t, the learner proposes an iterate zt, then suffers cost ft(zt) and observes ft (or some function of it). Suppose we have an algorithm for this nondelayed setting that guarantees that for every sequence, T t=1 ft(zt) infz C T t=1 ft(zt) R(T), where R is increasing in T. Now consider the same setting with delay h, and let τ(t) = (t 1) mod (h + 1) + 1 [h + 1]. We use the following strategy: Make h + 1 copies of the base algorithm. At round t, observe zt, predict zt using the output of instance τ(t), then update instance τ(t + 1) using the loss ft h(zt h) (which is now available).

Lemma 2.2 (cf. Joulani et al. (2013)). The generic delayed online learning reduction has regret at most

T t=1 ft(zt) inf z C

T t=1 ft(z) (h + 1)R(T/(h + 1)),

where R(T) is the regret of the base instance.

Lemma 2.2 shows that minimizing the regret in (11) is as easy as minimizing regret in the non-delayed setting, up to a factor of h = O (log T). For completeness, we provide a proof Appendix E.4. All that remains is to specify the base algorithm for the reduction.

2.5. Exp-concave online learning

We have reduced the problem of obtaining logarithmic regret for online control to obtaining logarithmic regret for online learning with approximate advantages of the form in (11). A sufﬁcient condition to obtain fast rates in online learning is strong convexity of the loss Hazan (2016), but while the advantages At;h(M;wt+h) are strongly convex with respect to q M(w), they are not strongly convex with respect to the parameter M. Itself. Fortunately, logarithmic regret can also be achieved for loss functions that satisfy a weaker condition called exp-concavity (Hazan et al., 2007; Cesa Bianchi & Lugosi, 2006).

Deﬁnition 4. A function f C R is α-exp-concave if 2f(z) α( f(z))( f(z)) for all z C.

Intuitively, an exp-concave function f exhibits strong curvature along the directions of its gradient, which are precisely the directions along which f is sensitive to change. This property holds for linear regression-type losses, as the following standard lemma (Appendix E.4) shows. Lemma 2.3. Let A Rd1 d2, and consider the function f(z) = Az b 2 Σ, where Σ 0. If we restrict to z Rd2 for which f(z) R, then f is (2R) 1-exp-concave.

Observe that the approximate advantage functions At;h(M;wt+h) indeed have the form f(z) = Az b 2 Σ (viewing the map M q M(wt 1) as a linear operator), and thus satisfy exp-concavity for appropriate α > 0. To take advantage of this property we use online Newton step (ONS, Algorithm 2), a second-order online convex optimization algorithm which guarantees logarithmic regret for exp-concave losses. Lemma 2.4 (Hazan (2016)). Suppose that supz,z C z z D, supz C ft(z) G, and that each loss fk is α-exp-concave. Then by setting η = 2max{4GD,α 1} and ε = η2/D, the online Newton step algorithm guarantees

T k=1 fk(zk) inf z C

T k=1 fk(z) 5(α 1 + GD) dlog T.

Putting everything together. With the regret decomposition in terms of approximate advantages (Theorem 2) and the blackbox-reduction for online learning with delays (Lemma 2.2), the design and analysis of Riccatitron (Algorithm 1) is rather simple. In view of Lemma 2.1, we initialize the set M0 sufﬁciently large to compete with the appropriate state-feedback controllers (Line 2). Using Theorem 2, our goal is to obtain a regret bound for the approximate advantages in (11). In view of the delayed online learning reduction Lemma 2.2, we initialize h + 1 base online learners (Line 2). Since the approximate advantages At are pure quadratics, we use online Newton step for the base learner, which ensures logarithmic regret via Lemma 2.4.

2.6. Sharpening the regret bound

With online Newton step as the base algorithm, Riccatitron has regret O (dxdu dx du log3 T). The dxdu factor comes from the hard dependence on dim(C) in the ONS regret bound (Lemma 2.4), while the dx du factor is an upper bound on the Frobenius norm for each M M0. We can obtain improved dimension dependence by replacing ONS with a vector-valued variant of the classical Vovk-Azoury-Warmuth algorithm (VAW), described in Algorithm 3 (Appendix E.3). The VAW algorithm goes beyond the generic exp-concave online learning setting and

Logarithmic Regret for Adversarial Online Control

exploits the quadratic structure of the approximate advantages. Theorem 5 in Appendix E.3 shows that its regret depends only logarithmically on the Frobenius norm of the parameter vectors, so it avoids the dx du factor paid by ONS (up to a log term). This leads to a ﬁnal regret bound of O (dxdu log3 T) for Riccatitron. The runtime for both algorithms is identical.

The calculation for the ﬁnal regret bound is carried out in Appendix E.1.

3. Advantages without states

We now prove the key approximate advantage theorem (Theorem 2) used in the analysis of Riccatitron. The roadmap for the proof is as follows:

1. In Section 3.1, we show that the unconstrained optimal policy takes the form π t (x;w) = Ktxt q t (w), where q t (w) depends on all future disturbances, and where Kt is the ﬁnite-horizon solution to the Riccati recursion (Deﬁnition 5).

2. Next, Section 3.2 presents an intermediate version of the approximate advantage theorem for policies of the form πt(x;w) = Ktxt q Mt(wt 1). Because any such policy has the same state dependence as the optimal policy π , we are able to show that A t (u π t ;x π t ,w) has no state dependence. Moreover, the linear structure of the dynamics and quadratic structure of the losses ensures that A t (u π t ;x π t ,w) is a quadratic of the form q Mt(wt 1) q t (wt T ) 2 Σt, where Σt is a ﬁnitehorizon approximation to Σ , and q t (wt T ) is the bias vector of the optimal controller.

3. Finally (Section 3.3), we use stability of the Riccati recursion to show that q t (w) can be replaced with a term that depends only on wt+h, up to a small error. Similarly, we show that Σt can be replaced by Σ and Kt by K .

This argument implies that a slightly modiﬁed analogue of Riccatitron which replaces inﬁnite-horizon quantities (K , Σ ,...) with ﬁnite-horizon analogues from the Riccati recursion attains a similar regret. We state Riccatitron with the inﬁnite horizon analogues to simplify presentation, as well as implementation.

3.1. A closed form for the true optimal policy

Our ﬁrst result characterizes the optimal unconstrained optimal controller π given full knowledge of the disturbance sequence w, as well as the corresponding value function. To begin, we introduce a variant of the classical Riccati recursion.

Deﬁnition 5 (Riccati recursion). Deﬁne PT +1 = 0 and c T +1 = 0 and consider the recursion:

Pt = Rx + A Pt+1A A Pt+1BΣ 1 t B Pt+1A,

Σt = Ru + B Pt+1B,

Kt = Σ 1 t B Pt+1A,

ct(wt T ) = (A BKt) (Pt+1wt + ct+1(wt+1 T )).

We also deﬁne corresponding closed loop matrices via Acl,t = A BKt.

For i.i.d. disturbances with E[wt] = 0 for all times t, the optimal controller is the state feedback law πt(x) = Ktxt, and Kt K as t . The following theorem shows that for arbitrary disturbances the optimal controller applies the same state feedback law, but with an extra bias term that depends on the disturbance sequence. Theorem 3. The optimal controller is given by π t (x,w) = Ktx q t (wt T ), where

q t (wt T ) = T 1 i=t Σ 1 t B

i j=t+1 A cl,j Pi+1wi. (12)

Moreover, for each time t we have

V t (x;w) = x 2 Pt + 2 x,ct(wt T ) + ft(wt T ), (13)

where ft is a function that does not depend on the state x.

Theorem 3 is a special case of a more general result, Theorem 4, proven in Appendix D.

3.2. Removing the state

We now use the characterization of π to show that the advantages A t (u π t ;x π t ,w) have a particularly simple structure when we consider policies of the form πt(x;w) = Ktxt qt(wt 1), where qt(w) is an arbitrary function of w. For such policies, A t is a quadratic function which does not depend explicitly on the state. Lemma 3.1. Consider a policy πt(x) of the form πt(x;w) = Ktxt qt(w). Then, for all x,

A t ( πt(x;w);x,w) = qt(w) q t (wt T ) 2 Σt.

Proof. Since Q t (x, ;w) is a strongly convex quadratic, and since π t (x;w) = arg minu Rdu Q t (x,u;w), ﬁrstorder optimality conditions imply that for any u,

A t (u;x,w) = Q t (x,u;w) Q t (x,π t (x;w);w)

= u π t (x;w) 2 2u Q t (x,u;w).

A direct computation based on (13) reveals that 2 u Q t (x,u;w) = R + B Pt+1B = Σt, so that A t (u;x,w) = u π t (x;w) 2 Σt. Finally, since π t (x;w) = Ktx q t (wt T ), we have that if u = πt(x;w) = Ktxt qt(w), then the states in the expression u π t (x;w) cancel, leaving u π t (x;w) = (qt(w) q t (wt T )).

Logarithmic Regret for Adversarial Online Control

3.3. Truncating the future and passing to inﬁnite horizon

The next lemma proven in Appendix F shows that we can truncate q t (wt T ) to only depend on disturbances at most h steps in the future.

Lemma 3.2. For any h [T] deﬁne a truncated version of q t as follows:

q t;t+h(wt t+h) =

(t+h) T 1 i=t Σ 1 t B

i j=t+1 A cl,j Pi+1wi.

(14) Then for any t such that t + h < T O(β Ψ2 Γ ), setting γ = 1 2(1 + γ ) < 1, we have the bound q t t+h(wt t+h) q t (wt T ) κ2 β2 Ψ Γ2 (T h) γh , which is geometrically decreasing in h.

Going forward we use that both q t and q t t+h have norm at most β Ψ Γ κ (1 γ ) 1 = Dq (Lemma D.6). As an immediate corollary of Lemma 3.2, we approximate the advantages using ﬁnite lookahead.

Lemma 3.3. Consider a policy πt(x;w) = Ktxt qt(w), and suppose that qt Dq, where Dq Dq . If we choose h = 2(1 γ ) 1 log(κ2 β2 Ψ Γ2 T 2), we are guaranteed that

T t=1 A t (u π t ;x π t ,w) qt(w) q t;t+h(wt t+h) 2 Σt Ctrunc,

where Ctrunc O(D2 qβ Ψ4 Γ2 (1 γ ) 1 log T).

At this point, we have established an analogue of Theorem 2, except that we are still using state-action controllers Kt rather than K , and the approximate advantages in Lemma 3.3 are using the ﬁnite-horizon counterparts of Σ and q ;h. The following lemmas show that we can pass to these inﬁnite-horizon quantities by paying a small approximation cost.

Lemma 3.4. Let policies πt(x;w) = K x qt(w) and πt(x;w) = Ktx qt(w) be given, where qt is arbitrary but satisﬁes qt Dq for some Dq 1. Then

JT ( π,w) JT (π,w) CK ,

where CK O(κ4 β6 Ψ13 Γ6 (1 γ ) 2D2 q log (Dq T)).

Lemma 3.5. Let (qt)T t=1 be an arbitrary sequence with qt Dq for some Dq Dq . Then it holds that

T t=1 qt q t;t+h(wt t+h) 2 Σt qt q ;h(wt t+h) 2 Σ C ,

where C O(D2 q β4 Ψ7 Γ4 κ2 (1 γ ) 1hlog(Dq T)).

Combining these results immediately yields the proof of Theorem 2; details are given in Appendix F.

4. Conclusion

We have presented the ﬁrst efﬁcient algorithm with logarithmic regret for online linear control with arbitrary adversarial disturbance sequences. Our result highlights the power of online learning with advantages, and we are hopeful that this framework will ﬁnd broader use. Numerous questions naturally arise for future work: Does our framework extend to more general loss functions, or to more general classes of dynamical systems in control and reinforcement learning? Can our results be extended to handle partial observed dynamical systems? Can we obtain

T-regret for adversarial disturbances in unknown systems, as is possible in the stochastic regime?

Acknowledgements

DF acknowledges the support of NSF TRIPODS award #1740751. MS is generously supported by an Open Philanthropy AI Fellowship. We thank Ali Jadbabaie for helpful discussions.

Abbasi-Yadkori, Y. and Szepesv ari, C. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1 26, 2011.

Abbasi-Yadkori, Y., Bartlett, P. L., Kanade, V., Seldin, Y., and Szepesv ari, C. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in neural information processing systems, pp. 2508 2516, 2013.

Abbasi-Yadkori, Y., Bartlett, P., and Kanade, V. Tracking adversarial targets. In International Conference on Machine Learning, pp. 369 377, 2014.

Agarwal, N., Bullins, B., Hazan, E., Kakade, S., and Singh, K. Online control with adversarial disturbances. In International Conference on Machine Learning, pp. 111 119, 2019a.

Agarwal, N., Hazan, E., and Singh, K. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pp. 10175 10184, 2019b.

Anava, O., Hazan, E., and Mannor, S. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pp. 784 792, 2015.

Arora, R., Dekel, O., and Tewari, A. Online bandit learning against an adaptive adversary: from regret to policy regret. In International Conference on Machine Learning (ICML), pp. 1747 1754, 2012.

Logarithmic Regret for Adversarial Online Control

Azoury, K. S. and Warmuth, M. K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211 246, June 2001.

Bertsekas, D. P. Dynamic Programming and Optimal Control, Vol. I. Athena Scientiﬁc, 2005.

Cassel, A., Cohen, A., and Koren, T. Logarithmic regret for learning linear quadratic regulators efﬁciently. International Conference on Machine Learning (ICML), 2020.

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.

Cohen, A., Hasidim, A., Koren, T., Lazic, N., Mansour, Y., and Talwar, K. Online linear quadratic control. In International Conference on Machine Learning, pp. 1028 1037, 2018.

Cohen, A., Koren, T., and Mansour, Y. Learning linearquadratic regulators efﬁciently with only

T regret. In International Conference on Machine Learning, pp. 1300 1309, 2019.

Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pp. 4188 4197, 2018.

Even-Dar, E., Kakade, S. M., and Mansour, Y. Online markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009.

Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. On optimality of adaptive linear-quadratic regulators. ar Xiv preprint ar Xiv:1806.10749, 2018.

Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pp. 1467 1476, 2018.

Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Hazan, E. and Kale, S. Newtron: an efﬁcient bandit algorithm for online multiclass prediction. In Advances in Neural Information Processing Systems, pp. 891 899, 2011.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Hazan, E., Kakade, S., and Singh, K. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pp. 408 421. PMLR, 2020.

Joulani, P., Gyorgy, A., and Szepesv ari, C. Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453 1461, 2013.

Kakade, S. M. On the sample complexity of reinforcement learning. Ph D thesis, University College London (University of London), 2003.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR), 2016.

Lincoln, B. and Rantzer, A. Relaxing dynamic programming. IEEE Transactions on Automatic Control, 51(8): 1249 1260, 2006.

Mania, H., Tu, S., and Recht, B. Certainty equivalence is efﬁcient for linear quadratic control. In Advances in Neural Information Processing Systems, pp. 10154 10164, 2019.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529, 2015.

Orabona, F., Crammer, K., and Cesa-Bianchi, N. A generalized online mirror descent with applications to classiﬁcation and regression. Machine Learning, 99(3):411 435, 2015.

Rakhlin, A. and Sridharan, K. Online nonparametric regression. In Conference on Learning Theory, 2014.

Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pp. 5478 5486, 2019.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.

Simchowitz, M. and Foster, D. J. Naive exploration is optimal for online LQR. International Conference on Machine Learning (ICML), 2020.

Simchowitz, M., Singh, K., and Hazan, E. Improper learning for non-stochastic control. Conference on Learning Theory (COLT), 2020.

Logarithmic Regret for Adversarial Online Control

Slotine, J.-J. E. and Li, W. Applied nonlinear control. Prentice-Hall, 1991.

Vovk, V. Aggregating strategies. Proceedings of the conference on computational Learning Theory, 1990.

Vovk, V. A game of prediction with expert advice. In Proceedings of the eighth annual conference on computational learning theory, pp. 51 60. ACM, 1995.

Vovk, V. Competitive on-line linear regression. In NIPS 97: Proceedings of the 1997 conference on advances in neural information processing systems 10, pp. 364 370, Cambridge, MA, USA, 1998. MIT Press.

Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737 757, 2009.

Zimin, A. and Neu, G. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pp. 1583 1591, 2013.