# logarithmic_regret_for_online_control__f994ffa7.pdf

Logarithmic Regret for Online Control

Naman Agarwal1 Elad Hazan1 2 Karan Singh1 2

1 Google AI Princeton 2 Computer Science, Princeton University namanagarwal@google.com, {ehazan,karans}@princeton.edu

We study optimal regret bounds for control in linear dynamical systems under adversarially changing strongly convex cost functions, given the knowledge of transition dynamics. This includes several well studied and fundamental frameworks such as the Kalman ﬁlter and the linear quadratic regulator. State of the art methods achieve regret which scales as O(

T), where T is the time horizon.

We show that the optimal regret in this setting can be signiﬁcantly smaller, scaling as O(poly(log T)). This regret bound is achieved by two different efﬁcient iterative methods, online gradient descent and online natural gradient.

1 Introduction

Algorithms for regret minimization typically attain one of two performance guarantees. For general convex losses, regret scales as square root of the number of iterations, and this is tight. However, if the loss function exhibit more curvature, such as quadratic loss functions, there exist algorithms that attain poly-logarithmic regret. This distinction is also known as fast rates in statistical estimation.

Despite their ubiquitous use in online learning and statistical estimation, logarithmic regret algorithms are almost non-existent in control of dynamical systems. This can be attributed to fundamental challenges in computing the optimal controller in the presence of noise.

Time-varying cost functions in dynamical systems can be used to model unpredictable dynamic resource constraints, and the tracking of a desired sequence of exogenous states. At a pinch, if we have changing (even, strongly) convex loss functions, the optimal controller for a linear dynamical system is not immediately computable via a convex program. For the special case of quadratic loss, some previous works [9] remedy the situation by taking a semi-deﬁnite relaxation, and thereby obtain a controller which has provable guarantees on regret and computational requirements. However, this semi-deﬁnite relaxation reduces the problem to regret minimization over linear costs, and removes the curvature which is necessary to obtain logarithmic regret.

In this paper we give the ﬁrst efﬁcient poly-logarithmic regret algorithms for controlling a linear dynamical system with noise in the dynamics (i.e. the standard model). Our results apply to general convex loss functions that are strongly convex, and not only to quadratics.

Reference Noise Regret loss functions [1] none O(log2 T) quadratic (ﬁxed hessian) [4] adversarial O(

T) convex [9] stochastic O(

T) quadratic here stochastic O(log7 T) strongly convex

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

1.1 Our Results

The setting we consider is a linear dynamical system, a continuous state Markov decision process with linear transitions, described by the following equation:

xt+1 = Axt + But + wt. (1.1)

Here xt is the state of the system, ut is the action (or control) taken by the controller, and wt is the noise. In each round t, the learner outputs an action ut upon observing the state xt and incurs a cost of ct(xt, ut), where ct is convex. The objective here is to choose a sequence of adaptive controls ut so that a minimum total cost may be incurred.

The approach taken by [9] and other previous works is to use a semi-deﬁnite relaxation for the controller. However, this removes the properties associated with the curvature of the loss functions, by reducing the problem to an instance of online linear optimization. It is known that without curvature, O(

T) regret bounds are tight (see [13]).

Therefore we take a different approach, initiated by [4]. We consider controllers that depend on the previous noise terms, and take the form ut = PH i=1 Miwt i. While this resulting convex relaxation does not remove the curvature of the loss functions altogether, it results in an overparametrized representation of the controller, and it is not a priori clear that the loss functions are strongly convex with respect to the parameterization. We demonstrate the appropriate conditions on the linear dynamical system under which the strong convexity is retained.

Henceforth we present two methods that attain poly-logarithmic regret. They differ in terms of the regret bounds they afford and the computational cost of their execution. The online gradient descent update (OGD) requires only gradient computation and update, whereas the online natural gradient (ONG) update, in addition, requires the computation of the preconditioner, which is the expected Gram matrix of the Jacobian, denoted J, and its inverse. However, the natural gradient update admits an instance-dependent upper bound on the regret, which while being at least as good as the regret bound on OGD, offers better guarantees on benign instances (See Corollary 4.5, for example).

Algorithm Update rule (simpliﬁed) Applicability OGD Mt+1 Mt ηt ft(Mt) K, diag L s.t. A BK = QLQ 1

ONG Mt+1 Mt ηt(E[J J]) 1 ft(Mt) L 1 δ, Q , Q 1 κ

1.2 Related Work

For a survey of linear dynamical systems (LDS), as well as learning, prediction and control problems, see [17]. Recently, there has been a renewed interest in learning dynamical systems in the machine learning literature. For fully-observable systems, sample complexity and regret bounds for control (under Gaussian noise) were obtained in [3, 10, 2]. The technique of spectral ﬁltering for learning and open-loop control of partially observable systems was introduced and studied in [15, 7, 14]. Provable control in the Gaussian noise setting via the policy gradient method was also studied in [11].

The closest work to ours is that of [1] and [9], aimed at controlling LDS with adversarial loss functions. The authors in [3] obtain a O(log2 T) regret algorithm for changing quadratic costs (with a ﬁxed hessian), but for dynamical systems that are noise-free. In contrast, our results apply to the full (noisy) LDS setting, which presents the main challenges as discussed before. Cohen et al. [9] consider changing quadratic costs with stochastic noise to achieve a O(

T) regret bound.

We make extensive use of techniques from online learning [8, 16, 13]. Of particular interest to our study is the setting of online learning with memory [5]. We also build upon the recent control work of [4], who use online learning techniques and convex relaxation to obtain provable bounds for LDS with adversarial perturbations.

2 Problem Setting

We consider a linear dynamical system as deﬁned in (1.1) with costs ct(xt, ut), where ct is strongly convex. In this paper we assume that the noise wt is a random variable generated independently at

every time step. For any algorithm A, we attribute a cost deﬁned as

JT (A) = E{wt}

t=1 ct(xt, ut)

where xt+1 = Axt + But + wt, ut = A(x1, . . . xt) and E{wt} represents the expectation over the entire noise sequence. For the rest of the paper we will drop the subscript {wt} from the expectation as it will be the only source of randomness. Overloading notation, we shall use JT (K) to denote the cost of a linear controller K which chooses the action as ut = Kxt.

Assumptions. In the paper we assume that x1 = 0 1, as well as the following conditions.

Assumption 2.1. We assume that B κB. Furthermore, the perturbation introduced per time step is bounded, i.i.d, and zero-mean with a lower bounded covariance i.e.

t wt Dw, E[wt] = 0, E[wtw t ] σ2I and wt W

This may be adapted to the case of sub-gaussian noise by conditioning on the event that none of the noise vectors are ever large. Such adaptation introduces a multiplicative log(T) factor in the regret.

Assumption 2.2. The costs ct(x, u) are α-strongly convex. Wehnever x , u D, it holds that

xct(x, u) , uct(x, u) GD.

The class of linear controllers we work with are deﬁned as follows; see Section A for a detailed note.

Deﬁnition 2.3 (Diagonal Strong Stability). Given a dynamics (A, B), a linear controller K is (κ, γ)- diagonal strongly stable for real numbers κ 1, γ < 1, if there exists a complex diagonal matrix L and a non-singular complex matrix Q, such that A BK = QLQ 1 with the following being true:

1. The spectral norm of L is strictly smaller than one, i.e., L 1 γ.

2. The controller and transforming matrices are bounded, i.e., K κ and Q , Q 1 κ.

Regret Formulation. Let K = {K : K is (κ, γ)-diagonal strongly stable}. For an algorithm A, the notion of regret we consider is pseudo-regret, i.e. the sub-optimality of its cost with respect to the cost for the best linear controller i.e.,

Regret = JT (A) min K K JT (K).

3 Preliminaries

Notation. We reserve the letters x, y for states and u, v for actions. We denote by dx, du to be the dimensionality of the state and the control space respectively. Let d = max(dx, du). We reserve capital letters A, B, K, M for matrices associated with the system and the policy. Other capital letters are reserved for universal constants in the paper. We use the shorthand Mi:j to denote a subsequence {Mi, . . . , Mj}. For any matrix U, deﬁne Uvec to be a ﬂattening of the matrix where we stack the columns upon each other. Further for a collection of matrices M = {M [i]}, let Mvec be the ﬂattening deﬁned by stacking the ﬂattenings of M [i] upon each other. We use x 2 U = x Ux to denote the matrix induced norm. The rest of this section provides a recap of the relevant deﬁnitions and concepts introduced in [4].

3.1 Reference Policy Class

For the rest of the paper, we ﬁx a (κ, γ)-diagonally strongly stable matrix K (The bold notation is to stress that we treat this matrix as ﬁxed and not a parameter). Note that this can be any such matrix and it can be computed via a semi-deﬁnite feasibility program [9] given the knowledge of the dynamics, before the start of the game. We work with following the class of policies.

1This is only for convenience of presentation. The case with a bounded x1 can be handled similarly.

Deﬁnition 3.1 (Disturbance-Action Policy). A disturbance-action policy M = (M [0], . . . , M [H 1]), for horizon H 1 is deﬁned as the policy which at every time t, chooses the recommended action ut at a state xt, deﬁned 2 as

ut(M) Kxt +

i=1 M [i 1]wt i.

For notational convenience, here it may be considered that wi = 0 for all i < 0.

The policy applies a linear transformation to the disturbances observed in the past H steps. Since (x, u) is a linear function of the disturbances in the past under a linear controller K, formulating the policy this way can be seen as a relaxation of the class of linear policies. Note that K is a ﬁxed matrix and is not part of the parameterization of the policy. As was established in [4] (and we include the proof for completeness), with the appropriate choice of parameters, superimposing such a K, to the policy class allows it to approximate any linear policy in terms of the total cost suffered with a ﬁnite horizon parameter H.

We refer to the policy played at time t as Mt = {M [i] t } where the subscript t refers to the time index and the superscript [i 1] refers to the action of Mt on wt i. Note that such a policy can be executed because wt 1 is perfectly determined on the speciﬁcation of xt as wt 1 = xt Axt 1 But 1.

3.2 Evolution of State

This section describes the evolution of the state of the linear dynamical system under a non-stationary policy composed of a sequence of T policies, where at each time the policy is speciﬁed by Mt = (M [0] t , . . . , M [H 1] t ). We will use M0:T 1 to denote such a non-stationary policy. The following deﬁnitions ease the burden of notation.

1. Deﬁne A = A BK. A shall be helpful in describing the evolution of state starting from a non-zero state in the absence of disturbances. 2. For any sequence of matrices M0:H, deﬁne Ψi as a linear function that describes the effect of wt i on the state xt, formally deﬁned below. Deﬁnition 3.2. For any sequence of matrices M0:H, deﬁne the disturbance-state transfer matrix Ψi for i {0, 1, . . . H}, to be a function with h + 1 inputs deﬁned as

Ψi(M0:H) Ai1i H +

j=0 Aj BM [i j 1] H j 1i j [1,H].

It will be important to note that ψi is a linear function of its argument.

3.3 Surrogate State and Surrogate Cost

This section introduces a couple of deﬁnitions required to describe our main algorithm. In essence they describe a notion of state, its derivative and the expected cost if the system evolved solely under the past H steps of a non-stationary policy. Deﬁnition 3.3 (Surrogate State & Surrogate Action). Given a sequence of matrices M0:H+1 and 2H independent invocations of the random variable w given by {wj Dw}2H 1 j=0 , deﬁne the following random variables denoting the surrogate state and the surrogate action:

i=0 Ψi(M0:H)w2H i i,

v(M0:H+1) = Ky(M0:H) +

i=1 M [i 1] H+1 w2H i.

When M is the same across all arguments we compress the notation to y(M) and v(M) respectively.

2xt is completely determined given w0 . . . wt 1. Hence, the use of xt only serves to ease the burden of presentation.

Algorithm 1 Online Control Algorithm

1: Input: Step size schedule ηt, Parameters κB, κ, γ, T. 2: Deﬁne H = γ 1 log(Tκ2) 3: Deﬁne M = {M = {M [0] . . . M [H 1]} : M [i 1] κ3κB(1 γ)i}. 4: Initialize M0 M arbitrarily. 5: for t = 0, . . . , T 1 do 6: Choose the action: ut = Kxt +

i=1 M [i 1] t wt i.

7: Observe the new state xt+1 and record wt = xt+1 Axt But. 8: Online Gradient Update:

Mt+1 = ΠM(Mt ηt ft(Mt))

9: Online Natural Gradient Update:

Mvec,t+1 = ΠM(Mvec,t ηt(E[JT J]) 1 Mvec,tft(Mt))

10: end for

Deﬁnition 3.4 (Surrogate Cost). Deﬁne the surrogate cost function ft to be the cost associated with the surrogate state-action pair deﬁned above, i.e.,ft(M0:H+1) = E [ct(y(M0:H), v(M0:H+1))] . When M is the same across all arguments we compress the notation to ft(M).

Deﬁnition 3.5 (Jacobian). Let z(M) = y(M) v(M)

. Since y(M), v(M) are random linear functions

of M, z(M) can be reparameterized as z(M) = JMvec = Jy Jv

Mvec, where J is a random matrix,

which derives its randomness from the random perturbations wi.

3.4 OCO with Memory

We now describe the setting of online convex optimization with memory introduced in [5]. In this setting, at every step t, an online player chooses some point xt K Rd, a loss function ft : KH+1 7 R is then revealed, and the learner suffers a loss of ft(xt H:t). We assume a certain coordinate-wise Lipschitz regularity on ft of the form such that, for any j {0, . . . , H}, for any x0:H, xj K, |ft(x0:j 1, xj, xj+1:H) ft(x0:j 1, xj, xj+1:H)| L xj xj . (3.1) In addition, we deﬁne ft(x) = ft(x, . . . , x), and we let Gf = sup t {0,...,T },x K ft(x) , D = sup x,y K x y . (3.2)

The resulting goal is to minimize the policy regret [6], which is deﬁned as

Policy Regret =

t=H ft(xt H:t) min x K

4 Algorithms & Statement of Results

The two variants of our method are spelled out in Algorithm 1. Theorems 4.1 and 4.3 provide the main guarantees for the two algorithms.

Online Gradient Update Theorem 4.1 (Online Gradient Update). Suppose Algorithm 1 (Online Gradient Update) is executed with K being any (κ, γ)-diagonal strongly stable matrix and ηt = Θ ασ2t 1, on an LDS satisfying Assumption 2.1 with control costs satisfying Assumption 2.2. Then, it holds true that

JT (A) min K K JT (K) O G2W 4

ασ2 log7(T) .

The above result leverages the following lemma which shows that the function ft( ) is strongly convex with respect to its argument M. Note that strong convexity of the cost functions ct over the state-action space does not by itself imply the strong convexity of the surrogate cost ft over the space of controllers M. This is because, in the surrogate cost ft, ct is applied to y(M), v(M) which themselves are linear functions of M; the linear map M is necessarily column-rank-deﬁcient. To observe this, note that M maps from a space of dimensionality H dim(x) dim(u) to that of dim(x) + dim(u). The next theorem, which forms the core of our analysis, shows that this is not the case using the inherent stochastic nature of the dynamical system.

Lemma 4.2. If the cost functions ct( , ) are α-strongly convex, K is a (κ, γ) diagonal strongly stable matrix and Assumption 2.1 is met then the idealized functions ft(M) are λ-strongly convex with respect to M where

We present the proof for simple cases in Section 6, deferring the general proof to Section F.

Online Natural Gradient Update

Theorem 4.3 (Online Natural Gradient Update). Suppose Algorithm 1 (Online Natural Gradient Update) is executed with ηt = Θ (αt) 1, on an LDS satisfying Assumptions 2.1 and with control costs satisfying Assumption 2.2. Then, it holds true that

JT (A) min K K JT (K) O GW 2

αµ log7(T) where µ 1 max M M (E[JT J]) 1 Mvecft(M) .

In Theorem 4.3, the regret guarantee depends on an instance-dependent parameter µ, which is a measure of hardness of the problem. First, we note that the proof of Lemma 4.2 establishes that the Gram matrix of the Jacobian (Deﬁntion 3.5) is strictly positive deﬁnite and hence we recover the logarithmic regret guarantee achieved by the Online Gradient Descent Update, with the constants preserved.

Corollary 4.4. In addition to the assumptions in Theorem 4.3, if K is a (κ, γ)-diagonal strongly stable matrix, then for the natural gradient update

JT (A) min K K JT (K) O G2W 4

ασ2 log7(T) ,

Proof. The conclusion follows from Lemma 5.2 and Lemma 6.1 which is the core component in the proof of Lemma 4.2 showing that E[JT J] γ2σ2

Secondly, we note that, being instance-dependent, the guarantee the Natural Gradient update offers can potentially be stronger than that of the Online Gradient method. A case in point is the following corollary involving spherically symmetric quadratic costs, in which case the Natural Gradient update yields a regret guarantee under demonstrably more general conditions, in that the bound does not depend on the minimum eigenvalue of the covariance of the disturbances σ2, unlike OGD.

Corollary 4.5. Under the assumptions on Theorem 4.3, if the cost functions are of the form ct(x, u) = rt( x 2 + u 2), where rt [α, β] is an adversarially chosen sequence of numbers and K is chosen to be a (κ, γ)-diagonal strongly stable matrix, then the natural gradient update guarantees

JT (A) min K K JT (K) O β2W 2

α log7(T) ,

Proof. Note Mvecft(M) (E[JT J]) 2 = E[JT (rt I)JMvec] (E[JT J]) 2 β Mvec .

5 Regret Analysis

The next section is a condensation of the results from [4] which we present in this form to highlight the reduction to OCO with memory.

5.1 Reduction to Low Regret with Memory

The next lemma shows that achieving low policy regret on the memory based function ft is sufﬁcient to ensure low regret on the overall dynamical system. Since the proof is essentially provided by [4], we provide it in the Appendix for completeness. Deﬁne,

M {M = {M [0] . . . M [H 1]} : M [i 1] κ3κB(1 γ)i}. Lemma 5.1. Let the dynamical system satisfy Assumption 2.1 and let K be any (κ, γ)-diagonal strongly stable matrix. Consider a sequence of loss functions ct(x, u) satisfying Assumption 2.2 and a sequence of policies M0 . . . MT satisfying

Policy Regret =

t=0 ft(Mt H 1:t) min M M

t=0 ft(M) R(T)

for some function R(T) and ft as deﬁned in Deﬁnition 3.4. Let A be an online algorithm that plays the non-stationary controller sequence {M0, . . . MT }. Then as long as H is chosen to be larger than γ 1 log(Tκ2) we have that

J(A) min K K J(K ) R(T) + O(GW 2 log(T)),

Here O( ), Θ( ) contain polynomial factors in γ 1, κB, κ, d. Lemma 5.2. The function ft as deﬁned in Deﬁnition 3.4 is coordinate-wise L-lipschitz and the norm of the gradient is bounded by Gf, where

L = 2DGWκBκ3

γ , Gf GDWHd H + 2κBκ3

where D Wκ2(1 + Hκ2 Bκ3) γ(1 κ2(1 γ)H+1) + κBκ3W

The proof of this lemma is identical to the analogous lemma in [4] and hence is omitted.

5.2 Analysis for Online Gradient Descent

In the setting of Online Convex Optimization with Memory, as shown by [5], by running a memorybased OGD, we can bound the policy regret by the following theorem, proven in the appendix.

Theorem 5.3. Consider the OCO with memory setting deﬁned in Section 3.4. Let {ft}T t=H be Lipschitz loss functions with memory such that ft(x) are λ-strongly convex, and let L and Gf be as deﬁned in (3.1) and (3.2). Then, there exists an algorithm which generates a sequence {xt}T t=0 such

t=H ft(xt H:t) min x K

t=H ft(x) G2 f + LH2Gf

λ (1 + log(T)).

Proof of Theorem 4.1. Setting H = γ 1 log(Tκ2), Theorem 5.3, in conjunction with Lemma 5.2, implies that policy regret is bounded by O G2W 4H6

ασ2 log T . An invocation of Lemma 5.1 now sufﬁces to conclude the proof of the claim.

5.3 Analysis for Online Natural Gradient Descent

Consider structured loss functions of the form ft(M0:H+1) = E[ct(z)], where z = PH+1 i=0 Ji[Mi]vec. Ji is a random matrix, and ct s are adversarially chosen strongly convex loss functions. In a similar vein, deﬁne ft(M) to be the specialization of ft when input the same argument, i.e. M, H + 1 times. Deﬁne J = PH+1 i=0 Ji. The proof of the following theorem may be found in the appendix. Theorem 5.4. In the setting desribed in this subsection, let ct be α-strongly convex, and f T be such that it satisﬁes equation (3.1) with constant L, and Gf = max M M (E[JT J]) 1 Mvecft(M) . Then, the online natural gradient update generates a sequence {Mt}T t=0 such that

t=H ft(Mt H:t) min M M

t=H ft(M) max M M Mvecft(M) 2 (E[JT J]) 1 + LH2Gf α (1+log(T)).

Proof of Theorem 4.3. First observe that Mvecft(M) 2 (E[JT J]) 1 µ 1 Mvecft(M) . Setting H = γ 1 log(Tκ2), Theorem 5.4, in conjunction with Lemma 5.2, imply the stated bound on policy regret. An invocation of Lemma 5.1 sufﬁces to conclude the proof of the claim.

6 Proof of Strong Convexity in Simpler Cases

We will need some deﬁnitions and preliminaries that are outlined below. By deﬁnition we have that ft(M) = E[ct(yt(M), vt(M))]. Since we know that ct is strongly convex we have that

2ft(M) = E{wk}2H 1 k=0 [ 2ct(y(M), v(M))] αE{wk}2H 1 k=0 [J y Jy + J v Jv].

Jy, Jv are random matrices dependent on the noise {wk}2H 1 k=0 . The next lemma implies Lemma 4.2. Lemma 6.1. If Assumption 2.1 is satisﬁed and K is chosen to be a (κ, γ)-diagonal strongly stable matrix, then the following holds,

E{wk}2H 1 k=0 [J y Jy + J v Jv] γ2σ2

To analyze Jy, Jv, we will need to rearrange the deﬁnition of y(M) to make the dependence on each individual M [i] explicit. To this end consider the following deﬁnition for all k [H + 1].

i=1 M [i 1]w2H i k

Under this deﬁnition it follows that

k=1 (A BK)k 1B vk(M) +

k=1 (A BK)k 1w2H k

v(M) = Ky(M) + v0(M)

From the above deﬁnitions, (Jy, Jv) may be characterized in terms of the Jacobian of vk with respect to M, which we deﬁne for the rest of the section as J vk. Deﬁning Mvec as the stacking of rows of each M [i] vertically, i.e. stacking the columns of (M [i]) , it can be observed that for all k,

J vk = vk(M)

M = Idu w 2H k 1 Idu w 2H k 2 . . . Idu w H k

where du is the dimension of the controls. We are now ready to analyze the two simpler cases. Further on in the section we drop the subscripts {wk}2H 1 k=0 from the expectations for brevity.

6.1 Proof of Lemma 6.1: K = 0

In this section we assume that K = 0 is a (κ, γ)-diagonal strongly stable policy for (A, B). Be deﬁnition, we have v(M) = v0(M). One may conclude the proof with the following observation.

E[J y Jy + J v Jv] E[J v Jv] = E[J v0J v0] = Idu Σ σ2I.

6.2 Proof of Lemma 6.1: 1-dimensional case

Here, we specialize Lemma 4.2 to one-dimensional state and one-dimensional control. This case highlights the difﬁculty caused in the proof due to a choosing a non-zero K and presents the main ideas of the proof in a simpliﬁed notation.

Note that in the one dimensional case, the policy given by M = {M [i]}H 1 i=0 is an H dimensional vector with M [i] being a scalar. Furthermore y(M), v(M), vk(M) are scalars and hence their Jacobians Jy, Jv, J vk with respect to M are 1 H vectors. In particular we have that,

J vk = vk(M)

M = [w2H k 1 w2H k 2 . . . w H k]

Therefore using the fact that E[wiwj] = 0 for i = j and E[w2 i ] = σ2, it can be observed that for any k1, k2, we have that E[J vk1J vk2] = Tk1 k2 σ2 (6.1)

where Tm is deﬁned as an H H matrix with [Tm]ij = 1 if and only if i j = m and 0 otherwise. This in particular immediately gives us that,

E[J y Jy] =

k2=1 Tk1 k2 (A BK)k1 1+k2 1 !

B2 σ2 (6.2)

E[J v0Jy] =

k=1 T k(A BK)k 1 !

First, we prove a few spectral properties of the matrices G and Y deﬁned above. From Gershgorin s circle theorem, and the fact that K is (κ, γ)-diagonal strongly stable, we have

k=1 (T k + Tk)(A BK)k 1 2γ 1 (6.4)

The spectral properties of G summarized in the lemma below form the core of our analysis. Lemma 6.2. G is a symmetric positive deﬁnite matrix. In particular G 1

Now consider the statements which follow by the respective deﬁnitions.

E[J v Jv] = K2 E[J y Jy] K E[J y J v0] K E[J v0Jy] + E[J v0J v0]

= σ2 B2K2 G BK (Y + Y ) + I

Now F 0. We ﬁnish the proof by considering two cases. The ﬁrst case is when 3|B|γ 1κ 1. Noting κ 1, in this case Lemma 6.2 immediately implies that

m F + B2 G m m B2 G m

9γ 2κ2 γ2 m 2

In the second case (when 3|B|γ 1κ 1), (6.4) implies that

m F + B2 G m m I BK (Y + Y ) m (1/3) m 2 γ2 m 2

6.2.1 Proof of Lemma 6.2

Recall Tm is deﬁned as an H H matrix with [Tm]ij = 1 if and only if i j = m and 0 otherwise. Deﬁne the following matrix for any complex number |ψ| < 1.

k2=1 Tk1 k2 ψ k1 1 ψk2 1

Note that G in Lemma 6.2 is equal to G(A BK). The following lemma, proven in Section E, provides a lower bound on the spectrum of the matrix G(ψ). The lemma presents the proof of a more general case (φ is complex) that aids the multi-dimensional case. A special case when φ = 1 was proven in [12], and we follow a similar approach relying on the inverse. Lemma 6.3. Let ψ be a complex number such that |ψ| 1. We have that G(ψ) (1/4) IH.

7 Conclusion

We presented two algorithms for controlling linear dynamical systems with strongly convex costs with regret that scales poly-logarithmically with time. This improves state-of-the-art known regret bounds that scale as O(

T). It remains open to extend the poly-log regret guarantees to more general systems and loss functions, such as exp-concave losses, or alternatively, show that this is impossible.

Acknowledgements

The authors thank Sham Kakade and Cyril Zhang for various thoughtful discussions. Elad Hazan acknowledges funding from NSF grant # CCF-1704860.

[1] Yasin Abbasi-Yadkori, Peter Bartlett, and Varun Kanade. Tracking adversarial targets. In International Conference on Machine Learning, pages 369 377, 2014.

[2] Yasin Abbasi-Yadkori, Nevena Lazic, and Csaba Szepesv ari. Model-free linear quadratic control via reduction to expert prediction. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3108 3117, 2019.

[3] Yasin Abbasi-Yadkori and Csaba Szepesv ari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1 26, 2011.

[4] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In Proceedings of the 36th International Conference on Machine Learning, pages 111 119, 2019.

[5] Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pages 784 792, 2015.

[6] Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning, pages 1503 1510, 2012.

[7] Sanjeev Arora, Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Towards provable control for unknown linear dynamical systems. 2018.

[8] Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[9] Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1028 1037, 2018.

[10] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188 4197, 2018.

[11] Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1466 1475, 2018.

[12] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlapping patches. In Proceedings of the 35th International Conference on Machine Learning, pages 1783 1791, 2018.

[13] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

[14] Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral ﬁltering for general linear dynamical systems. In Advances in Neural Information Processing Systems, pages 4634 4643, 2018.

[15] Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral ﬁltering. In Advances in Neural Information Processing Systems, pages 6702 6712, 2017.

[16] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2012.

[17] Robert F Stengel. Optimal control and estimation. Courier Corporation, 1994.

[18] Gilbert Strang. Introduction to linear algebra, volume 3.