# online_control_with_adversarial_disturbances__bc1be66f.pdf Online Control with Adversarial Disturbances Naman Agarwal 1 Brian Bullins 2 1 Elad Hazan 2 1 Sham M. Kakade 3 4 1 Karan Singh 2 1 We study the control of linear dynamical systems with adversarial disturbances, as opposed to statistical noise. We present an efficient algorithm that achieves nearly-tight regret bounds in this setting. Our result generalizes upon previous work in two main aspects: the algorithm can accommodate adversarial noise in the dynamics, and can handle general convex costs. 1. Introduction This paper studies the robust control of linear dynamical systems. A linear dynamical system (LDS) is governed by the dynamics equation xt+1 = Axt + But + wt, (1.1) where xt is the state, ut is the control and wt is a disturbance to the system. At every time step t, the controller suffers a cost ct(xt, ut) to enforce the control. In this paper, we consider the setting of online control with arbitrary disturbances, under known transition dynamics specified by the matrices A and B. Formally, the setting involves, at every time step t, an adversary selecting a convex cost function ct(x, u) and a disturbance wt, and the goal of the controller is to generate a sequence of controls ut such that a sequence of convex costs ct(xt, ut) is minimized. The above setting generalizes a fundamental problem in control theory, such as the Linear Quadratic Regulator, which has been studied over several decades. However, despite the significant amount of literature on the problem, several challenges remained. 1Google AI Princeton 2Department of Computer Science, Princeton University 3Allen School of Computer Science and Engineering, University of Washington 4Department of Statistics, University of Washington. Correspondence to: Naman Agarwal , Brian Bullins , Elad Hazan , Sham Kakade , Karan Singh . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Challenge 1. Perhaps the most important challenge we address is in dealing with arbitrary disturbances wt in the dynamics. This is a difficult problem, and so, the standard approaches often assume i.i.d. Gaussian noise. Worst-case approaches in the control literature, also known as H - control and its variants, are considered overly pessimistic. Instead, we take an online (adaptive) approach to dealing with adversarial disturbances. Challenge 2. Another limitation for efficient methods is the classical assumption that the costs ct(x, u) are quadratic, as is the case for the linear quadratic regulator. Part of the focus in the literature on the quadratic costs is due to special properties that allow for efficient computation of the best linear controller in hindsight via dynamic programming. One of our main goals is to introduce a more general technique that allows for efficient algorithms even when faced with arbitrary convex costs. Our contributions. In this paper, we tackle both of the challenges outlined above: coping with adversarial noise, and general loss functions in an online setting. For this we turn to the algorithmic methodology of regret minimization in online learning. To define the performance metric, we denote the cost for a control algorithm A as t=0 ct(xt, ut). The standard comparator in control is a linear controller, which generates a control signal as a linear function of the state, i.e., ut = Kxt. Let J(K) denote the cost of a linear controller from a certain class K K. For an algorithm A, we define the regret as the sub-optimality of its cost with respect to the best linear controller from a certain set, i.e., Regret = JT (A) min K K JT (K). Our main result is an efficient algorithm for control which achieves O( T) regret in the setting described above. While a similar setting has been considered in the literature before (Cohen et al., 2018), our work generalizes previous work in the following ways: 1. Our algorithm achieves regret O( T) even in the presence of bounded adversarial disturbances. Previous Online Control with Adversarial Disturbances regret bounds needed to assume that the disturbances wt are drawn from a distribution with zero mean and bounded variance. 2. Our regret bounds apply to any sequence of adversarially chosen convex loss functions. Previous efficient algorithms applied to convex quadratic costs only. Our results above are obtained using several techniques from online convex optimization, notably online learning for loss functions with memory, and improper learning using convex relaxations. 2. Related Work Online Learning: This work advocates for worst-case regret as a robust performance metric in the presence of adversarial noise. A special case of our setting is that of regret minimization in stateless systems (where A = 0), which is a well-studied problem in machine learning. We refer the reader to various books and surveys on the topic (Cesa-Bianchi & Lugosi, 2006; Shalev-Shwartz et al., 2012; Hazan, 2016). Of particular interest to our study is the setting of online learning with memory (Anava et al., 2015). Learning and Control in Linear Dynamical Systems: The modern setting for linear dynamical systems arose in the seminal work of Kalman (1960), who introduced the Kalman filter as a recursive least-squares solution for maximum likelihood estimation (MLE) of Gaussian perturbations to the system in latent-state systems. The framework and filtering algorithm have proven to be a mainstay in control theory and time-series analysis. We refer the reader to the classic survey (Ljung, 1998), and the extensive overview of recent literature in (Hardt et al., 2018). Most of this literature, as well as much of classical control theory, deals with zero-mean random noise, typically normally distributed. Recently, there has been a renewed interest in learning both fully-observable and latent-state linear dynamical systems. For fully-observable systems, sample complexity and regret bounds for control (under Gaussian noise) were obtained in (Abbasi-Yadkori & Szepesv ari, 2011; Dean et al., 2018; Abbasi-Yadkori et al., 2019). The technique of spectral filtering for learning and open-loop control of non-observable systems was introduced and studied in (Hazan et al., 2017; Arora et al., 2018; Hazan et al., 2018). Provable control in the Gaussian noise setting was also studied in (Fazel et al., 2018). Robust Control: A notable attempt to handle adversarial perturbations in the dynamics takes place in H control (Stengel, 1994; Zhou et al., 1996). In this setting, the controller solves for the best linear controller assuming worst case noise to come, i.e., min K1 max w1 min K2 ... min KT max w T t ct(xt, ut), assuming similar linear dynamics as in equation (1.1). This approach is overly pessimistic, and leads to sub-optimal performance in various cases of interest. In comparison, we do not solve for the entire noise trajectory in advance, but adjust for it iteratively, and, in this manner, offer an instancedependent guarantee. Another difference is computational: the above mathematical program may be hard to compute for general cost functions, as compared to our efficient gradientbased algorithm. Non-stochastic MDPs: The setting we consider, namely control in systems with linear transition dynamics (Bertsekas, 2005) in the presence of adversarial disturbances, can be cast as that of planning in an adversarially changing MDP (Arora et al., 2012; Dekel & Hazan, 2013). The results obtained via this reduction are unsatisfactory because these regret bounds scale with the size of the state space, which is exponential in the dimension of the system. In addition, the regret in these scales as Ω(T 2 3 ). In comparison, Yu et al. (2009); Even-Dar et al. (2009) solve the online planning problem for MDPs with fixed dynamics and changing costs. The satisfying aspect of their result is that the regret bound does not explicitly depend on the size of the state space, and scales as O( T). However, these assume that the dynamics are fixed and without (adversarial) noise. LQR with Changing Costs: For the Linear Quadratic Regulator problem, Cohen et al. (2018) consider changing quadratic costs with stochastic noise to achieve a O( T) regret bound. This work is well aligned with our results, and the present paper employs some notions developed therein (e.g., strong stability). However, the techniques used in (Cohen et al., 2018) (such as the SDP formulation for a linear controller) heavily rely on the quadratic nature of the cost functions and stochasticity of the disturbances. In particular, even for the offline problem, to the best of our knowledge, there does not exist an SDP formulation to determine the best linear controller for convex losses. In an earlier work, Abbasi-Yadkori & Szepesv ari (2011) consider a more restricted setting with fixed, deterministic dynamics (hence, noiseless) and changing quadratic costs. Ng & Kim (2005) consider the adaptation of online learning to control in a manner that ensures stability. While the work does not provide any regret guarantee, the proposed notion of stability continues to be useful in the literature (e.g. (Cohen et al., 2018)) and this paper. Iterative Learning Control: We also note the similarity of our proposed algorithm to methods involving iterative learning control (Ahn et al., 2007; Owens & H at onen, 2005) Online Control with Adversarial Disturbances and feedback error learning (Nakanishi & Schaal, 2004). These procedures, often operating over deterministic systems in the episodic setting, attempt to learn a control law via iteratively adjusting the controller as a function of the tracking error. 3. Problem Setting 3.1. Interaction Model The linear dynamical system is a Markov decision process on continuous state and action spaces, with linear transition dynamics. In each round t, the learner outputs an action ut upon observing the state xt and incurs a cost of ct(xt, ut), where ct is convex. The system then transitions to a new state xt+1 according to xt+1 = Axt + But + wt. In the above definition, wt is the disturbance the system suffers at each time step. In this paper, we make no distributional assumptions on wt, and the sequence wt is not made known to the learner in advance. For any algorithm A, we attribute a cost of t=0 ct(xt, ut), where xt+1 = Axt + But + wt and ut = A(x1, . . . xt). Overloading notation, we shall use J(K) to denote the cost of a linear controller K which chooses the action as ut = Kxt. 3.2. Assumptions We make the following assumptions throughout the paper. We remark that they are less restrictive than those considered by previous works, and hence allow for more general systems. In particular, we allow for adversarial (rather than i.i.d. stochastic) noise, and we also handle general convex cost functions. In addition, the non-stochastic nature of the disturbances permits, without loss of generality, the assumption that x0 = 0. Assumption 3.1. The matrices that govern the dynamics are bounded, i.e., A κA, B κB. The perturbation introduced per time step is bounded, i.e., wt W. Assumption 3.2. The costs ct(x, u) are convex. Further, as long as it is guaranteed that x , u D, it holds that |ct(x, u)| βD2, and xct(x, u) , uct(x, u) Gc D. Following the definitions in (Cohen et al., 2018), we work on the following class of linear controllers. Definition 3.3. A linear policy K is (κ, γ)-strongly stable if there exist matrices L, Q satisfying A BK = QLQ 1, such that following two conditions are met: 1. The spectral norm of L is strictly smaller than one, i.e., L 1 γ. 2. The controller and the transforming matrices are bounded, i.e., K κ and Q , Q 1 κ. 3.3. Regret Formulation Let K = {K : K is (κ, γ)-strongly stable}. For an algorithm A, the regret is the sub-optimality of its cost with respect to the best linear controller, i.e., Regret = JT (A) min K K JT (K). 3.4. Proof Techniques and Overview Choice of Policy Class: We begin by parameterizing the policy we execute at every step as a linear function of the disturbances in the past in Definition 4.1. Similar parameterization has been considered in the system level synthesis framework (see (Wang et al., 2019)). This leads to a convex relaxation of the problem. Optimization on alternative parameterizations, including an SDP based framework (Cohen et al., 2018) or a direct parameterization (Fazel et al., 2018), has been studied in the literature, but they seem unable to capture general convex functions as well as adversarial disturbance or lead to a non-convex loss. To avoid a linear dependence on time for the number of parameters in our policy, we additionally include a stable linear controller in our policy allowing us to effectively consider only O(γ 1 log(T)) previous perturbations. Lemma 5.2 makes this notion of approximation precise. Reduction to OCO with Memory: The choice of policy class with an appropriately chosen horizon H allows us to reduce the problem to compete with functions with truncated memory. This naturally falls under the class of online convex optimization with memory (see Section 4.5). Theorem 5.3 makes this reduction precise. Finally to bound the regret on truncated functions we use the Online Gradient Descent based approach specified in (Anava et al., 2015), which requires a bound on Lipschitz constants which we provide in Section 5.3.1. This reduction is inspired by the ideas introduced in (Even-Dar et al., 2009). 3.5. Roadmap The following section provides the notation required to define our algorithm and regret bounds. Section 5 describes our primary method (Algorithm 1) and establishes the central regret guarantees (Theorem 5.1), along with the requisite lemmas and their proofs. Online Control with Adversarial Disturbances 4. Preliminaries In this section, we establish some important definitions that will prove to be useful throughout the paper. 4.1. Notation We reserve the letters x, y for states and u, v for control actions. We denote by d = max(dim(x), dim(u)), i.e., a bound on the dimensionality of the problem. We reserve capital letters A, B, K, M for matrices associated with the system and the policy. Other capital letters are reserved for universal constants in the paper. We use the shorthand Mi:j to denote a subsequence {Mi, . . . , Mj}. 4.2. A Disturbance-Action Policy Class We establish the notion of a disturbance-action controller which chooses the action as a linear map of the past disturbances. Any disturbance-action controller ensures that the state of a system executing such a policy may be expressed as a linear function of the parameters of the policy. This property is convenient in that it permits efficient optimization over the parameters of such a policy. The situation may be contrasted with that of a linear controller. While the action recommended by a linear controller is also linear in past disturbances (a consequence of being linear in the current state), the state sequence produced on the execution of a linear policy is not a linear function of its parameters. Definition 4.1 (Disturbance-Action Policy). A disturbanceaction policy π(M, K) is specified by parameters M = (M [0], . . . , M [H 1]) and a fixed matrix K, for horizon H 1. At every time t, such a policy π(M, K) chooses the recommended action ut at a state xt1, defined as i=1 M [i 1]wt i. For notational convenience, here it may be considered that wi = 0 for all i < 0. We refer to the policy played at time t as Mt = {M [i] t } where the subscript t refers to the time index and the superscript [i] refers to the action of Mt on wt i. Note that such a policy can be executed because wt 1 is perfectly determined on the specification of xt as wt 1 = xt Axt 1 But 1. It shall be established in later sections that such a policy class can approximate any linear policy with a strongly stable matrix in terms of the total cost suffered. 1xt is completely determined given w0 . . . wt 1. Hence, the use of xt only serves to ease presentation. 4.3. Evolution of State This section describes the evolution of the state of a linear dynamical system under a non-stationary policy π = (π0, . . . , πT 1) composed of T policies, where each πt is specified by πt(Mt = (M [0] t , . . . , M [H 1] t ), K). With some abuse of notation, we shall use π((M0, . . . , MT 1), K) to denote such a nonstationary policy. The following definitions ease the burden of notation. 1. Define AK = A BK. AK shall be helpful in describing the evolution of state starting from a non-zero state in the absence of disturbances. 2. x K t (M0:t 1) is the state attained by the system upon execution of a non-stationary policy π(M0:t 1, K). We similarly define u K t (M0:t 1) to be the action executed at time t. If the same policy M is used across all time steps, we compress the notation to x K t (M), u K t (M). Note that x K t (0), u K t (0) refers to running the linear policy K. 3. ΨK,h t,i (Mt h:t) is a transfer matrix that describes the effect of wt i with respect to the past h+1 policies on the state xt+1, formally defined below. When M is the same across all arguments we compress the notation to ΨK,h t,i (M). Definition 4.2. For any t, h t, i H + h, define the disturbance-state transfer matrix ΨK,h t,i to be a function with h + 1 inputs defined as ΨK,h t,i (Mt h:t) = j=0 Aj KBM [i j 1] t j 1i j [1,H]. It will be worthwhile to note that ΨK,h t,i is linear in its arguments Mt h:t. For the rest of the paper we will set h to be H unless specified otherwise. Lemma 4.3. If ut is chosen as what the non-stationary policy π((M0, . . . , MT ), K) recommends, then the state sequence satisfies the following recurrence for any time t and h 0: xt+1 = Ah+1 K xt h + i=0 ΨK,h t,i (Mt h:t)wt i. (4.1) Proof. For any time t, we will prove the claim inductively Online Control with Adversarial Disturbances on h. For h = 0, we have that xt+1 = AKxt + i=1 BM [i 1] t wt i + wt i=0 ΨK,0 t,i (Mt)wt i. Further suppose the lemma holds for some h 0, we prove it for h + 1. We have that xt+1 = Ah+1 K xt h + i=0 ΨK,h t,i (Mt h:t)wt i = Ah+1 K ( AKxt h 1 + i=1 BM [i 1] t h 1wt h i 1+ i=0 ΨK,h t,i (Mt h:t)wt i = Ah+2 K xt h 1 + ΨK,h t,i (Mt h:t)+ Ai K1i=h+1 + Ah+1 K BM [i h 2] t h 1 1i h 1 [1,H] wt i = Ah+2 K xt h 1 + i=0 ΨK,h+1 t,i (Mt h 1:t)wt i. The proof now follows by induction. 4.4. Idealized Setting Note that the counter-factual nature of regret in the control setting implies in the loss at a time step t, depends on all the choices made in the past. To efficiently deal with this we propose that our optimization problem only consider the effect of the past H steps while planning, forgetting about the state, the system was at time t H. We will show later that the above scheme tracks the true cost suffered upto a small additional loss. To formally define this idea, we need the following notion of an ideal state. Definition 4.4 (Ideal State & Action). Define an ideal state y K t+1(Mt H:t) which is the state the system would have reached if it played the non-stationary policy Mt H:t at all time steps from t H to t, assuming the state at t H is 0. Similarly, define v K t+1(Mt H:t+1) to be an idealized action that would have been executed at time t + 1 if the state observed at time t + 1 is y K t+1(Mt H:t). Formally, y K t+1(Mt H:t) = i=0 ΨK,H t,i (Mt H:t)wt i, v K t+1(Mt H:t+1) = Ky K t+1(Mt H:t) i=1 M [i 1] t+1 wt+1 i. When M is the same across all arguments we compress the notation to y K t+1(M), v K t+1(M). We can now consider the loss of the ideal state and the ideal action. Definition 4.5 (Ideal Cost). Define the idealized cost function ft to be the cost associated with the idealized state and idealized action, i.e., ft(Mt H 1:t) = ct(y K t (Mt H 1:t 1), v K t (Mt H 1:t)). When M is the same across all arguments we compress the notation to ft(M). The linearity of y K t in past controllers and the linearity of v K t in its immediate state implies that ft is a convex function of a linear transformation of Mt H 1:t and hence convex in Mt H 1:t. This renders it amenable to algorithms for online convex optimization. In Theorem 5.3 we show that for a given sequence of policies {Mi}, the idealized cost ft and the real cost ct are close by and this reduction allows us to only consider the truncated ft while planning, hence allowing for efficiency. The precise notion of minimizing regret on such truncated ft was considered in online learning literature before as online convex optimization (OCO) with memory (Anava et al., 2015). We present an overview of this framework next. 4.5. OCO with Memory We now present an overview of the online convex optimization (OCO) with memory framework, as established by (Anava et al., 2015). In particular, we consider the setting where, for every t, an online player chooses some point xt K Rd, a loss function ft : KH+1 7 R is revealed, and the learner suffers a loss of ft(xt H:t). We assume a certain coordinate-wise Lipschitz regularity on ft of the form such that, for any j {0, . . . , H}, for any x0:H, xj K, |ft(x0:j 1, xj, xj+1:H) ft(x0:j 1, xj, xj+1:H)| L xj xj . (4.2) In addition, we define ft(x) = ft(x, . . . , x), and we let Gf = sup t {0,...,T },x K ft(x) , D = sup x,y K x y . The resulting goal is to minimize the policy regret (Arora et al., 2012), which is defined as Policy Regret = t=H ft(xt H:t) min x K As shown by (Anava et al., 2015), by running a memorybased OGD, we may bound the policy regret by the following theorem. Online Control with Adversarial Disturbances Algorithm 1 Online Control Algorithm 1: Input: Step size η, Control Matrix K, Parameters κB, κ, γ, T. 2: Define H = γ 1 log(Tκ2) 3: Define M = {M = {M [0] . . . M [H 1]} : M [i 1] κ3κB(1 γ)i}. 4: Initialize M0 M arbitrarily. 5: for t = 0, . . . , T 1 do 6: Choose the action: ut = ct Kxt + i=1 M [i 1] t wt i. 7: Observe the new state xt+1 and record wt = xt+1 Axt But. 8: Set Mt+1 = ΠM(Mt η ft(Mt)) 9: end for Theorem 4.6. Let {ft}T t=H be Lipschitz continuous loss functions with memory such that ft are convex, and let L, D, and Gf be as defined in (4.2) and (4.3). Then, there exists an algorithm which generates a sequence {xt}T t=0 such that t=H ft(xt H:t) min x K t=H ft(x) 3D q Gf(Gf+LH2)T. 5. Algorithm & Main Result Algorithm 1 describes our proposed algorithm for controlling linear dynamical systems with adversarial disturbances which at all times maintains a disturbance-action controller. The algorithm implements the memory based OGD on the loss ft( ) as described in the previous section. The algorithm requires the specification of a (κ, γ)-strongly stable matrix K once before the online game. Such a matrix can be obtained offline using an SDP relaxation as described in (Cohen et al., 2018). The following theorem states the regret bound Algorithm 1 guarantees. Theorem 5.1 (Main Theorem). Suppose Algorithm 1 is executed with η = Θ Gc W T 1 , on an LDS satisfying Assumption 3.1 with control costs satisfying Assumption 3.2. Then, it holds true that JT (A) min K K JT (K) O Gc W 2 Furthermore, the algorithm maintains at most O(1) parameters and can be implemented in time O(1) per time step. Here O( ), Θ( ) contain polynomial factors in γ 1, κB, κ, d. Proof of Theorem 5.1. Note that by the definition of the al- gorithm we have that all Mt M, where M = {M = {M [0] . . . M [H 1]} : M [i 1] κ3κB(1 γ)i}. Let D be defined as D Wκ3(1 + HκBτ) γ(1 κ2(1 γ)H+1) + κBκ3W Let K be the optimal linear policy in hindsight. By definition K is a (κ, γ)-strongly stable matrix. Using Lemma 5.2 and Theorem 5.3, we have that t=0 ct(x K t (0), u K t (0)) t=0 ct(x K t (M ), u K t (M )) t=0 ct(x K t (0), u K t (0)) + 2TGc D2κ3(1 γ)H+1 2TGc D(1 γ)H+1 WHκ2 Bκ5 γ + Dκ3 . (5.1) Let M1 . . . MT be the sequence of policies played by the algorithm. Note that by definition of the constraint set S, we have that t [T], i [H] M [i] t κBκ3(1 γ)i. Using Theorem 5.3 we have that t=0 ct(x K t (M0:t 1),u K t (M0:t 1)) t=0 ft(Mt H 1:t) 2TGc D2κ3(1 γ)H+1. (5.2) Finally using Theorem 4.6 and using Lemmas 5.6, 5.7 to bound the constants Gf and L associated with the function ft and by noting that max M1,M2 M M1 M2 κBκ3 we have that t=0 ft(Mt H 1:t) min M M 8Gc WDd3/2κ2 Bκ6H2.5γ 1 Summing up (5.1), (5.2) and (5.3), and using the condition that H = 1 γ log(Tκ2), we get the result. Online Control with Adversarial Disturbances 5.1. Sufficiency of Disturbance-Action Policies The class of policies described in Definition 4.1 is powerful enough to capture any fixed linear policy. Lemma 5.2 establishes this equivalence in terms of the state and action sequence that each policy produces. Lemma 5.2 (Sufficiency). For any two (κ, γ)-strongly stable matrices K , K, there exists a policy π(M , K), with M = (M [0] , . . . , M [H 1] ) defined as M [i] = (K K )(A BK )i ct(x K t (M ), u K t (M )) ct(x K t (0), u K t (0)) T 2Gc DWHκ2 Bκ5(1 γ)H Proof of Lemma 5.2. By definition we have that x K t+1(0) = i=0 Ai K wt i. Consider the following calculation for M with M [i] (K K )(A BK )i and for any i {0 . . . H}. We have for any h H that ΨK,h t,i (M ) = Ai K + j=1 Ai j K BM [j 1] j=1 Ai j K B(K K ) Aj 1 K j=1 Ai j K ( AK AK) Aj 1 K Ai j K Aj K Ai j+1 K Aj 1 K The final equality follows as the sum telescopes. Therefore, we have that x K t+1(M ) = i=0 Ai K wt i + i=H+1 ΨK,t t,i (M )wt i. From the above we get that x K t (0) x K t (M ) i=H+1 ΨK,t t,i (M ) + 2WHκ2 Bκ5(1 γ)H where the last inequality follows from using Lemma 5.4 and using the fact that M [i] κBκ3(1 γ)i. Further comparing the actions taken by the two policies, we may see that u K t u K t (M ) K x K t + Kx K t (M ) i=1 (K K) Ai 1 K wt i i=H+1 K Ai K + ΨK t,i(M ) wt i 2WHκ2 Bκ5(1 γ)H Using the above, Assumption 3.2 and Lemma 5.5, we get that ct(x K t (M ), u K t (M )) ct(x K t (0), u K t (0)) T 2Gc DWHκ2 Bκ5(1 γ)H 5.2. Approximation Theorems The following theorem relates the cost of ft(Mt H 1:t) with the actual cost ct(x K t (M0:t 1), u K t (M0:t)). Theorem 5.3. For any (κ, γ)-strongly stable K, any τ > 0, and any sequence of policies M1 . . . MT satisfying M [i] t τ(1 γ)i, if the perturbations are bounded by W, we have that t=1 ft(Mt H 1:t) t=1 ct(x K t (M0:t 1), u K t (M0:t)) 2TGc D2κ3(1 γ)H+1, D Wκ3(1 + HκBτ) γ(1 κ2(1 γ)H+1) + τW Before proving the above theorem, we will need a few lemmas which will be useful. Lemma 5.4. Let K be a (κ, γ)-strongly stable matrix, τ > 0, and Mt be a sequence such that for all i, t, we have M [i] t τ(1 γ)i. Then we have that for all i, t, h, ΨK,h t,i κ2(1 γ)i 1i H + HκBκ2τ(1 γ)i 1. We now derive a bound on the norm of each of the states. Lemma 5.5. Suppose the system satisfies Assumption 3.1 and let Mt be a sequence such that for all i, t, we have that Online Control with Adversarial Disturbances M [i] t τ(1 γ)i for a number τ. Define D Wκ3(1 + HκBτ) γ(1 κ2(1 γ)H+1) + a W Further suppose K is a (κ, γ)-strongly stable matrix. We have that for all t, x K t (M0:t 1) , y K t (Mt H 1:t 1) , x K t (0) D u K t (M0:t 1) , v K t (Mt H 1:t) D x K t (M0:t 1) y K t (Mt H 1:t 1) κ2(1 γ)H+1D u K t (M0:t) v K t (Mt H 1:t) κ3(1 γ)H+1D. Finally, we prove Theorem 5.3. Proof of Theorem 5.3. Using the above lemmas we can now bound the approximation error between ft and ct using Assumption 3.2 |ct(x K t (M0:t 1), u K t (M0:t)) ft(Mt H 1:t)| = |ct(x K t (M0:t), u K t (M0:t 1)) ct(y K t (Mt H 1:t 1), v K t (Mt H 1:t))| Gc D x K t (M0:t 1) y K t (Mt H 1:t 1) + Gc D u K t (M0:t) v K t (Mt H 1:t)) 2Gc D2κ3(1 γ)H+1. This finishes the proof of Theorem 5.3. 5.3. Bounding the Properties of OCO with Memory Thus far, Lemma 5.2 establishes that it is sufficient to compare against the class of disturbance-action policies. For such policies, Theorem 5.3 reduces the counter-factual regret minimization problem to online learning on loss functions with memory. It remains to quantify the constants with which Theorem 4.6 may be invoked to obtain a regret bound. Note that Line 3 of Algorithm 1 places on upper bound on the diameter D. 5.3.1. BOUNDING THE LIPSCHITZ CONSTANT We begin by establishing an upper bound on the lipschitz constant L as defined in Equation 4.2. Lemma 5.6. Consider two policy sequences {Mt H 1 . . . Mt k . . . Mt} and {Mt H 1 . . . Mt k . . . Mt} which differ in exactly one policy played at a time step t k for k {0, . . . , H}. Then we have that |ft(Mt H 1 . . . Mt k . . . Mt) ft(Mt H 1 . . . Mt k . . . Mt)| 2Gc DWκBκ3(1 γ)k H X M [i] t k M [i] t k . 5.3.2. BOUNDING THE GRADIENT A bound on the norm of the gradient follows similarly. Lemma 5.7. For all M such that M [j] τ(1 γ)j for all j [0, H 1], we have that Mft(M) F Gc DWHd 2κBκ3 Note that since M is a matrix, the ℓ2 norm of the gradient Mft corresponds to to the Frobenius norm of the Mft matrix. Due to space constraints, we provide the proof in the appendix. 6. Conclusion In this paper, we demonstrate an algorithmic methodology to control linear dynamical systems with adversarial disturbances through regret minimization, as well as how to handle general convex costs. Our notion of a robust controller is able to learn and adapt according to the noise encountered during the process. This deviates from the study of robust control in the framework of H control, that attempts to find a control with the anticipation of worst-case instances for all future perturbations. Acknowledgements Elad Hazan acknowledges funding from the NSF, award number 1704860. Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, the DARPA award FA8650-182-7836, and the ONR award N00014-18-1-2247. Abbasi-Yadkori, Y. and Szepesv ari, C. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1 26, 2011. Abbasi-Yadkori, Y., Lazic, N., and Szepesv ari, C. Modelfree linear quadratic control via reduction to expert prediction. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3108 3117, 2019. Ahn, H.-S., Chen, Y., and Moore, K. L. Iterative learning control: Brief survey and categorization. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 37(6):1099 1121, 2007. Anava, O., Hazan, E., and Mannor, S. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pp. 784 792, 2015. Online Control with Adversarial Disturbances Arora, R., Dekel, O., and Tewari, A. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning, pp. 1503 1510, 2012. Arora, S., Hazan, E., Lee, H., Singh, K., Zhang, C., and Zhang, Y. Towards provable control for unknown linear dynamical systems. 2018. Bertsekas, D. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 2005. Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006. Cohen, A., Hasidim, A., Koren, T., Lazic, N., Mansour, Y., and Talwar, K. Online linear quadratic control. In International Conference on Machine Learning, pp. 1028 1037, 2018. Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pp. 4188 4197, 2018. Dekel, O. and Hazan, E. Better rates for any adversarial deterministic MDP. In International Conference on Machine Learning, pp. 675 683, 2013. Even-Dar, E., Kakade, S. M., and Mansour, Y. Online Markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009. Fazel, M., Ge, R., Kakade, S. M., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pp. 1466 1475, 2018. Hardt, M., Ma, T., and Recht, B. Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1):1025 1068, 2018. Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. ISSN 2167-3888. doi: 10.1561/2400000013. URL http://dx.doi.org/10.1561/2400000013. Hazan, E., Singh, K., and Zhang, C. Learning linear dynamical systems via spectral filtering. In Advances in Neural Information Processing Systems, pp. 6702 6712, 2017. Hazan, E., Lee, H., Singh, K., Zhang, C., and Zhang, Y. Spectral filtering for general linear dynamical systems. In Advances in Neural Information Processing Systems, pp. 4634 4643, 2018. Kalman, R. E. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82.1: 35 45, 1960. Ljung, L. System identification: Theory for the User. Prentice Hall, Upper Saddle Riiver, NJ, 2 edition, 1998. Nakanishi, J. and Schaal, S. Feedback error learning and nonlinear adaptive control. Neural Networks, 17(10): 1453 1465, 2004. Ng, A. Y. and Kim, H. Stable adaptive control with online learning. In Advances in Neural Information Processing Systems, pp. 977 984, 2005. Owens, D. H. and H at onen, J. Iterative learning controlan optimization paradigm. Annual reviews in control, 29(1): 57 70, 2005. Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2012. Stengel, R. F. Optimal control and estimation. Courier Corporation, 1994. Wang, Y.-S., Matni, N., and Doyle, J. C. A system level approach to controller synthesis. IEEE Transactions on Automatic Control, 2019. Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737 757, 2009. Zhou, K., Doyle, J. C., Glover, K., et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.