# geometric_exploration_for_online_control__afe57aaf.pdf

Geometric Exploration for Online Control

Orestis Plevrakis Princeton University orestisp@princeton.edu

Elad Hazan Princeton University Google AI Princeton ehazan@princeton.edu

We study the control of an unknown linear dynamical system under general convex costs. The objective is minimizing regret vs the class of strongly-stable linear policies. In this work, we ﬁrst consider the case of known cost functions, for which we design the ﬁrst polynomial-time algorithm with n3

T-regret, where n is the dimension of the state plus the dimension of control input. The

Thorizon dependence is optimal, and improves upon the previous best known bound of T 2/3. The main component of our algorithm is a novel geometric exploration strategy: we adaptively construct a sequence of barycentric spanners in an overparameterized policy space. Second, we consider the case of bandit feedback, for which we give the ﬁrst polynomial-time algorithm with poly(n)

T-regret, building on Stochastic Bandit Convex Optimization.

1 Introduction

In this paper we study the online control of an unknown linear dynamical system under general convex costs. This is a fundamental problem in control theory and it also embodies a central challenge of reinforcement learning: balancing exploration and exploitation in continuous spaces. For this reason, it has recently received considerable attention from the machine learning community.

Controlling unknown LDS: In a linear dynamical system (LDS), the system state xt Rdx evolves as

xt+1 = A xt + B ut + wt, where x1 = 0, (1)

ut Rdu is the learner s control input, wt Rdx is a noise process drawn as wt i.i.d N(0, I), and A , B are unknown system matrices. The learner applies control ut at timestep t, then observes the state xt+1 and suffers cost c(xt, ut), where c is a convex function. We consider two forms of cost information for the learner: the case where c is known in advance, and the bandit version where only the scalar cost is observed.

Even if the dynamics where known, there are problem instances where the optimal policy is a very complicated function [6]. A way to circumvent this is to consider a policy class that is both expressive and tractable, and aim for performing as well as the best policy from that class. The objective that captures this goal is regret, and is a standard performance metric in online control. In this paper, in accordance with the previous works, we choose the policy class to be all strongly-stable linear policies1. These are policies that 1) select the control as a linear function of the state, and 2) mix fast to a steady state. These policies are a reasonable baseline, since they are optimal for the special case where the cost c is strongly convex quadratic.

1In Section 5, we explain how we can extend all our results to hold for the more general class of stabilizing linear-dynamical-control policies.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Formally, regret with respect to the class K of all strongly stable linear policies is deﬁned as

t=1 c(xt, ut) T min K K J(K), (2)

where a policy K K applies ut = Kxt, and, letting EK denote expectation under this policy,

J(K) = lim T 1 T EK

t=1 c (xt, ut)

is the average inﬁnite-horizon cost of K.

Our main result is a polynomial-time algorithm for the case of known cost function, that achieves n3

T-regret, where n = dx +du. This is the optimal dependence in the time-horizon and our result improves upon the previous best known bound of T 2/3 [17, 29]. Perhaps more importantly than the regret bound is that, using ideas from convex geometry, we design a novel exploration strategy which signiﬁcantly expands the existing algorithmic toolbox for balancing exploration and exploitation in linear dynamical systems, as we explain below.

Beyond explore-then-commit: the challenge of exploration

The only algorithms we know for this problem apply the simplest exploration strategy: explorethen-commit (ETC), known in control literature as certainty equivalence. In ETC, the learner spends the ﬁrst T0 steps playing random controls (e.g. ut N(0, I)), then estimates the system dynamics, and thereafter executes a greedy policy, based on these estimates. On the other hand, the whole stochastic bandit and RL theory literature is about sophisticated and sample-efﬁcient exploration, mostly relying on the principle of optimism in the face of uncertainty (OFU). Unfortunately, implementing OFU in online control requires solving optimization problems that are intractable in general [1]. Even though this computational issue can be circumvented for the case of quadratic costs using semideﬁnite programming [12], these techniques do not apply for general convex costs. In this work, we do not follow the OFU principle. Our exploration strategy is based on adaptively constructing a sequence of barycentric spanners (Deﬁnition 7) in an over-parameterized policy space.

The importance of general convex costs

The special case of convex quadratic costs is the classical linear quadratic regulator and is frequently used, because it leads to a nice analytical solution when the system is known [7]. However, this modeling choice is fairly restrictive, and in 1987, Tyrrell Rockafellar [23] proposed the use of general convex functions for modelling the cost in a LDS, in order to handle constraints on state and control. In practice, imposing constraints is crucial for ensuring safe operating conditions.

1.1 Statement of results.

We consider both the setting where A is strongly stable (Assumption 1), and in the Appendix, we deal with unstable systems, by assuming that the the learner is initially given a strongly-stable linear policy (Assumption 4)2. Our main result is the geometric exploration strategy given in Algorithm 1, for the case of known cost function. Algorithm 4 is for the case of bandit feedback. We now state informal versions of our theorems. Let C = C(A , B ) denote a constant that depends polynomially on natural system parameters. Theorem 1 (informal). For online control of LDS with known cost function, with high probability, Algorithm 1 has regret 3

RT e O(C) n3

Theorem 2 (informal). For online control of LDS with bandit feedback, with high probability, Algorithm 4 has regret

RT e O(C) poly(n)

2For unstable systems, without Assumption 4, the regret is exponential in the n (see [10]). 3 e O(1) hides logarithmic factors.

In Theorem 2, the polynomial dependence in n is rather large (n36). The large dimension dependence is typical in

T-regret algorithms for bandit convex optimization (BCO). Our setting is even more challenging than BCO, since the environment has a state.

1.2 Our Approach and New Techniques

The cost J(K) is nonconvex in K. To remedy this, we use an alternative disturbance-based policy class, introduced in [4], where the control is linear in the past disturbances (as opposed to the state):

i=1 M [i 1]wt i, (6)

where H is a hyperparameter. This parameterization is useful, because the ofﬂine search for an optimal disturbance-based policy can be posed as a convex program. In [4], the authors show that disturbance-based policies can approximate all strongly stable linear policies. We move on to describe the novel components of our work.

Geometric exploration: Algorithm 1 runs in epochs and follows the phased-elimination paradigm [20]. During epoch r, it focuses on a convex disturbance-based policy set Mr, which by the end of the epoch will be substituted by Mr+1 Mr. In the beginning of the epoch, it decides on a set of exploratory policies to be executed during the epoch. This is done by constructing a barycentric spanner of the policy set Mr (Deﬁnition 7). At the end of the epoch, it reﬁnes its estimates for the system matrices A , B and creates Mr+1 Mr.

Estimating the disturbances: A typical difﬁculty in using disturbance-based policies is that we do not know the disturbances, and since the system matrices are unknown, we cannot recover wt by using wt = xt+1 A xt B ut. A key contribution of this work is showing that disturbances can be accurately estimated, without accurately estimating the system matrices A , B .

Coupling: At the end of epoch r, we create estimates b A, b B of A , B , which we use to create the next policy set Mr+1. These estimates are not necessarily accurate in every direction. To argue that Mr+1 contains good policies, we employ a probabilistic coupling argument between the true system and the estimated one. This could be a useful technique for future works.

Robustness of stochastic BCO: For bandit feedback, we require a generalization of the Stochastc Bandit Convex Optimization (SBCO) framework, where the noise contains a small adversarial component in addition to the stochastic part. The need for robustness to adversarial noise arises from the estimation error on the disturbances bwt wt . We prove that by appropriately changing the parameters in the SBCO algorithm from [3], we can get

T-regret for this more general model.

1.3 Prior Work

LQR: When the cost c is convex quadratic, we obtain the online linear quadratic regulator (LQR) [1, 14, 21, 12, 27]. The problem was introduced in [1], and [21, 12, 27] gave

T-regret algorithms with polynomial runtime and polynomial regret dependence on relevant problem parameters. In [27], the authors proved that

T-regret is optimal.

Convex costs: Closer to our work are recent papers on online control with general convex costs [17, 29, 19]. These papers consider even more general models, i.e, adversarially changing convex cost functions, in [17, 29] they also allow adversarial disturbances and [29, 19] deal with partial observation. Furthermore, all the considered policy classes have linear structure and in [17] they use the same policy class as we do, i.e. strongly-stable linear policies. Despite the differences in the models, a common feature of these works is that all algorithms apply explore-then-commit (ETC). For our setting, ETC gives T 2/3-regret. Under the assumption that c is strongly convex, the problem is signiﬁcantly simpliﬁed and ETC achieves

T-regret [29].

Linear system identiﬁcation: To address unknown systems, we make use of least-squares estimation [28, 24]. Recent papers deal with system indentiﬁcation under partial observation [22, 25, 30, 26].

Bandit feedback: Control with bandit feedback has been studied in [9] (known system) and [15] (both known and unknown system). Our result is comparable to [15], and improves upon the T 3/4 regret bound that they achieve, when the disturbances are stochastic and the cost is a ﬁxed function.

Barycentric spanners: Barycentric spanners have been used for exploration in stochastic linear bandits [5]. However, in that context, the barycentric spanner is computed ofﬂine and remains ﬁxed, while our algorithm adaptively changes it, based on the observed states.

2 Setting and Background

Notation. For a matrix A, we use A to denote its spectral norm, and for a positive semidefinite matrix Σ 0, we use A Σ to denote p

tr(AT ΣA). For notational simplicity, with high probability means with probability at least 1 1/T c, where c is a sufﬁciently large constant.

2.1 Assumptions and Regret Benchmark

Linear policies are parameterized with a matrix K and apply controls ut = Kxt. We deﬁne the class of strongly stable linear policies. This deﬁnition was introduced in [11] and quantiﬁes the classical notion of a stable policy in manner that permits non-asymptotic regret bounds. Deﬁnition 3. A linear policy K is (κ, γ)-strongly stable if there exists a decomposition of A + B K = QΛQ 1 with Λ 1 γ and K , Q , Q 1 κ.

For simplicity, in the main text we assume that K = 0 is strongly stable. Assumption 1. The linear policy K = 0 is (κ, γ)-strongly stable, for some known constants κ 1, γ 0.

When referring to this assumption, we will also say A is (κ, γ)-strongly stable . In Appendix A, we relax this assumption and consider possibly unstable A , by using an initial stabilizing policy (Assumption 4). Assumption 4 is standard in online control literature (e.g., [12, 17, 8]), and without it, the regret is exponential in n ([10]). The next assumptions are that B is bounded and the cost c is Lipschitz. Assumption 2. The norm B β, for some known constant β 1. Assumption 3. The cost function c is 1-Lipschitz. 4

Let K be the set of all (κ, γ)-strongly stable linear policies. Our objective is regret with respect to the policy class K, as given in equation 2.

2.1.1 Disturbance-based policies

We provide some background deﬁnitions, which were introduced in [4].

Deﬁnition 4. A disturbance-based policy M = M [0], . . . , M [H 1] applied at time t chooses control ut = PH i=1 M [i 1]wt i, where ws = 0, for s 0.

The advantage of disturbance-based policies is that the average inﬁnite horizon cost J(M) of the policy M = M [0], . . . , M [H 1] is a convex function of M. Also, this comes for free in terms of expressiveness, since disturbance-based policies can approximate all policies in K (Theorem 5 below). Formally, let H := C γ 1 log Tdxduκβγ 1 , where C is a sufﬁciently large constant

(e.g., 1000), and M = n M [0], . . . , M [H 1] M [i] κ3β(1 γ)io .

Theorem 5 ([4]). For all K K, there exists a disturbance-based policy M M, such that J(M) J(K)| 1/T.

This theorem is a slight modiﬁcation of a result in [4], and we give its proof in Appendix G.3. Under the execution of a disturbance-based policy M, the state can be expressed as xt+1 = AH+1 xt H +

4We can easily account for more general L-Lipschitz costs via rescaling. Also, we can account for quadratic costs, by assuming Lipschitzness inside a ball where state and control belong with high probability.

P2H i=0 Ψi(M | A , B )wt i 5, where Ψi(M | A , B ) are afﬁne functions of M. We provide exact expressions for Ψi in Appendix H. Because A is (κ, γ)-strongly stable, it can be shown that AH+1 0. This leads to the deﬁnition of two time-independent quantities: surrogate state and surrogate control.

Surrogate state and control: Let η = (η0, η1, . . . η2H), where ηi i.i.d N(0, I). We deﬁne u M | η = PH 1 i=0 M [i]ηi and x M | A , B , η = P2H i=0 Ψi(M | A , B )ηi. As we mentioned, AH+1 0, so when we execute policy M and t Ω(H), the state/control pair (xt, ut) is almost identically distributed with x M | A , B , η , u M | η .

Convex surrogate cost: We now deﬁne a function that approximates the cost J(M), without involving inﬁnite limits that create computational issues. Let

C M | A , B := Eη

c x(M | A , B , η), u(M | η) #

The cost C is convex, because x and u are afﬁne in M and c is convex. The following theorem establishes that J(M) is almost equal to C M | A , B .

Theorem 6 ([4]). For all M M, we have C (M | A , B ) J(M) 1/T.

Again, this theorem is almost proved in [4], and for completeness, we provide its proof in Appendix G.4. The algorithms we present in the paper aim to minimize C (M | A , B ). The difﬁculty is that we do not know this function, since we do not know A , B .

2.1.2 Barycentric spanners

Barycentric spanners and approximate barycentric spanners were introduced in [5]. Here, we will use the afﬁne6 version of approximate barycentric spanners, deﬁned below.

Deﬁnition 7. Let K be a compact set in Rd. A set X = {x0, x1, . . . , xd} K is a C-approximate afﬁne barycentric spanner for K if every x K can be expressed as x = x0 + Pd i=1 λi(xi x0), where the coefﬁcients λi [ C, C].

The above theorem follows from Proposition 2.5 in [5], as we show in Appendix G.5.

Theorem 8. Suppose K Rd is compact and not contained in any proper afﬁne subspace. Given an oracle for optimizing linear functions over K, for any C > 1 we can compute a C-approximate afﬁne barycentric spanner for K in polynomial time, using O(d2 log C(d)) calls to the oracle.

3 Known cost function

In this section, we present our polynomial-time algorithm for the case of known cost function (Algorithm 1), we state our main theorem and give an overview of its proof. Formally, known cost function means that the algorithm has ofﬂine access to the value c(x, u) and gradient x,uc(x, u) for all state/control pairs (x, u).

3.1 Algorithm and statement of main theorem

Algorithm 1 receives as input some initial estimates A0, B0 that approximate the true system matrices A , B within error ϵ. As we state in Theorem 9, ϵ needs to be 1/poly(dx, du), and we can make sure this is satisﬁed by executing a standard warm-up exploration procedure, given in Appendix F. Moreover, our algorithm computes a new 2-approximate afﬁne barycentric spanner in the beginning of each epoch. This step can be implemented in polynomial time, as we explain in Subsection 3.3.

5We set xs = 0, for s 0. 6The reason we need the afﬁne variant of barycentric spanners is technical, and due to the fact that Ψi(M | A , B ) are afﬁne functions of M (instead of just linear).

Algorithm 1: Geometric Exploration for Control Input: estimates A0, B0 satisfying (A0 B0) (A B ) F ϵ. Initialize policy set M1 = M, matrix estimates ( b A1 b B1) = (A0 B0), disturbance estimates bwt = 0 for t 0. Set d = du dx H. Set t = 1 and observe x1. for r = 1, 2, . . . do Set ϵr = 2 r. Compute a 2-approximate afﬁne barycentric spanner of Mr: {Mr,0, Mr,1, . . . , Mr,d}. Set Tr = eΘ(κ4γ 3) ϵ 2 r dxdu(dx + du)2. a Call Execute-Policy(Mr,0, d Tr). for j = 1, . . . , d do Call Execute-Policy(Mr,j, Tr). end Set tr = t (current timestep). Eliminate suboptimal policies: b

C M b Atr, b Btr min M Mr C M b Atr, b Btr 3ϵr

a The eΘ( ) in the deﬁnition of Tr denotes Tr = C κ4γ 3 ϵ 2 r dxdu(dx + du)2, for some sufﬁciently large C = polylog(T). The same applies for the deﬁnition of λ in Algorithm 3. b Equation 8 is a deﬁnition, not a step that requires computation.The computational step related to Mr+1 is the construction of the 2-approximate afﬁne barycentric spanner in the beginning of the next epoch.

Algorithm 2: Execute-Policy Input: Policy M and execution-length L. for s = 1, 2, . . . , L do Apply control ut = PH i=1 M [i 1] bwt i. Observe xt+1. Call System-Estimation, to get b At+1, b Bt+1. Record the estimate bwt = xt+1 b At+1xt b Bt+1ut. Set t = t + 1. end

Algorithm 3: System-Estimation

Set λ = eΘ κ10β4γ 5 dxdu(dx + du)3. Let ( b At+1 b Bt+1) be a minimizer of

s=1 (A B)zs xs+1 2 + λ (A B) (A0 B0) 2 F

over all (A B), where zs = xs us

We now formally state our main theorem. Theorem 9. Let C1 = κ10β4γ 5, C2 = κ14β4γ 8 and C3 = κ5β2γ 2.5. Suppose that the initial estimation error bound (A0 B0) (A B ) F ϵ satisﬁes ϵ2 (C1 dxdu(dx + du)) 1. Assume T eΩ(C2) dxdu(dx + du)5. Then, with high probability, Algorithm 1 satisﬁes

RT e O(C3) dxdu(dx + du)

In [12], the authors analyze the warm-up exploration (Algorithm 6 in Appendix F). In Appendix F, we show that their analysis can be combined with Theorem 9 to show the following.

Corollary 10. Let C2 = κ14β4γ 8 and C4 = κ13β5γ 5.5. Assume T eΩ(C2) dxdu(dx +du)5. If we run the warm-up exploration and then run Algorithm 1, then with high probability,

RT e O(C4) dxdu(dx + du)

3.2 Proof overview

This section gives an overview of the proof of Theorem 9, while formal statements and proofs in the Appendix. The main component of our proof is bounding the stationary regret, i.e.,

t=1 C (Mt | A , B ) T C (M | A , B ) , (11)

where Mt is the policy executed at time t and M arg min M M C(M | A , B ). Theorem 9 follows from the following two lemmas.

Lemma 11. With high probability, RT Rst T e O κ5β2γ 2.5 (dx + du)3/2

Lemma 12. With high probability, Rst T e O κ2γ 2 dxdu(dx + du)

The proof of Lemma 11 is mostly technical and we do not discuss it in the main text. We will focus instead on the proof of Lemma 12. First, observe that the policies Mt executed during epoch r belong to Mr, since they are elements of a 2-approximate afﬁne barycentric spanner of Mr. To bound Rst T , we show that with high probability, for all policies M Mr, the suboptimality gap Rst(M) := C (M | A , B ) C (M | A , B ) O(2 r), from which Lemma 12 follows after some calculations. To bound this suboptimality gap, we show that the elimination step of Algorithm 1 (Equation 8) is effective, i.e., it removes only the Ω(2 r)-suboptimal policies (Rst(M) Ω(2 r)). This effectiveness comes from the following bound on the estimation error C(M | b Atr, b Btr) C(M | A , B ) .

Lemma 13. With high probability, for all epochs r and for all M Mr, we have C(M | b Atr, b Btr) C(M | A , B ) 2 r. (12)

The proof of this lemma is the crux of our analysis, employs all our new techniques and is done in two steps. The ﬁrst step bounds the estimation error on the cost (LHS of 12) by the estimation error on the matrices, with respect to a matrix norm associated with policy M. More speciﬁcally, let Σ(M) be the covariance matrix of the random vector z (M | A , B , η) deﬁned as

z (M | A , B , η) := x (M | A , B , η) u (M | η)

where x (M | A , B , η), u(M | η) and the distribution of η are given in Subsection 2.1.1. In other

words, Σ(M) = Eη h z (M | A , B , η) z (M | A , B , η)T i . We prove the following lemma.

Lemma 14 (informal). Let tr = ( b Atr b Btr) (A B ). For all M Mr, C(M | b Atr, b Btr) C(M | A , B ) O κ2γ 1 T tr Σ(M) . (14)

The proof idea here is to analyze two coupled dynamical systems: the ﬁrst system has system matrices b Atr, b Btr and the second has A , B . The coupling is done by applying the same controls and disturbances to both of them. The formal statement and proof are in Appendix B.1 and B.2. The second step is the following lemma. Let C be a large constant. Lemma 15. (informal) With high probability, T tr Σ(M) (C 1κ 2γ) 2 r, for all r, M Mr.

This lemma completes the proof of Lemma 13, by choosing a sufﬁciently large C. To prove Lemma 15, we ﬁrst focus on the exploratory policies Mr,j and for notational convenience, we deﬁne Mr,d+1 = Mr,d+2 = = Mr,2d = Mr,0. We show that with high probability, P2d j=1 T tr 2 Σ(Mr,j) (C 1 1 γ2κ 4d 1) 2 2r, where d = dxdu H and C1 is a large constant. To ﬁnish the proof, we show that for all M Mr, we can express M as an afﬁne combination of {Mr,j}2d j=1 using coefﬁcients in [ 3, 3]. Then, we show that this implies Σ(M) 18d P2d j=1 Σ(Mr,j), which completes the proof. The formal statement and proof are in Appendix B.1 and B.3.

Estimating the disturbances: A challenge in proving both Lemmas 12 and 11 comes from using the estimates of the disturbances ( bwt)t, instead of the actual ones (wt)t. We deal with this via proving that bwt are on average very accurate.

Lemma 16. With high probability, we have PT t=1 bwt wt 2 e O(1) (dx + du)3. 7

We provide the proof in Appendix B.4. A notable aspect of that proof is that it does not use any information about the way we choose the controls ut. On the other hand, the choice of controls matters for obtaining accurate estimates of A , B (e.g. if we constantly play ut = 0, then we get no information about B ). Thus, the disturbances can be accurately estimated without accurately estimating A , B .

3.3 Running time

It sufﬁces to argue that for each epoch r, a 2-approximate afﬁne barycentric spanner of Mr can be computed in polynomial time. First, the conditions of Theorem 8 are satisﬁed: 1) Mr is compact, since c is Lipschitz and so C( | b Atr, b Btr) is continuous, and 2) from the proof of Lemma 18 (Appendix B.1) we can see that not only M Mr, but also there exists a small ball around M which is contained in Mr, so Mr is not contained in any proper afﬁne subspace. Thus, to prove polynomial time, it sufﬁces to have a linear optimization oracle for Mr. Observe that Mr is convex, being an intersection of sublevel sets of convex functions. So, given a separation oracle for Mr, a linear optimization oracle can be implemented in polynomial time via the ellipsoid method. Now, such a separation oracle can be implemented, given access to C(M | b Atr , b Btr ) and MC(M | b Atr , b Btr ) for all M M and epochs r < r. Even though we don t have exact access to these quantities because of the expectations they involve, we can approximate them in polynomial time up to 1/poly(T)-error by averaging samples, since we have ofﬂine access to the values c(x, u) and gradients x,uc(x, u), for any x, u. Folklore approximation arguments sufﬁce to show that even with this 1/poly(T)-error, ellipsoid method can optimize linear functions up to 1.01 multiplicative error, in polynomial time. Now, just by inspecting the algorithm from [5] which is used to prove Theorem 8, we can see that even with a 1.01-approximate optimization oracle (instead of an exact one) the same proof goes through. Thus, a 2-approximate afﬁne barycentric spanner of Mr can be constructed in polynomial time.

4 Bandit feedback

To tackle online control with bandit feedback, we use the stochastic bandit convex optimization (SBCO) algorithm of [3] as a black-box. Before we present our algorithm and the formal theorem statement, we brieﬂy present the SBCO setting. In SBCO, X is a convex subset of Rd with diameter bounded by D, and f : X R is an L-Lipschitz convex function on X. The algorithm has access to f via a noisy value oracle, i.e., it can query the value of any x X, and the response is y = f(x)+ζ where ζ is an independent σ2-subgaussian random variable with mean zero. The goal is to minimize regret: after making n queries x1, . . . , xn X, the regret is Pn t=1 f(xt) nf(x ), where x is a minimizer of f over X. In [3], the authors give a polynomial-time algorithm that takes as input d, D, L, σ2, n and a separation oracle for X, and achieves regret e O(1) poly(d, log D, L, σ2) n.

We now formally state our theorem, which says that after appropriately initializing the input parameters of the SBCO algorithm, Algorithm 4 achieves

Theorem 17. There exist C1, C2, C3, C4, C5 = poly dx, du, κ, β, γ 1, log T , such that after initializing the SBCO algorithm with d = dx du H, D = C1, L = C2, σ2 = C3 and n = T/(2H+2), if the horizon T C4, then with high probability, warm-up exploration (Algorithm 6 in Appendix F) followed by Algorithm 4 satisfy RT C5

Our point here is that

T-regret is achievable in polynomial time, so we did not try to optimize the terms Ci. The proof is in Appendix C. The key step in the proof is showing that the SBCO algorithm of [3] is robust to adversarial noise in the responses, when this noise is small on average. This noise comes from bwt wt , which is small on average since we can show that Lemma 16 holds here too.

7On average the error is very small, because of the lower bound on T in Theorem 9.

Algorithm 4: Control of unknown LDS with bandit feedback Input: SBCO algorithm with parameters d, D, L, σ2, n, domain X = M. Initial estimates A0, B0 satisfying (A0 B0) (A B ) F ϵ. Set estimates of matrices ( b A1 b B1) = (A0 B0), noise estimates bwt = 0 for t 0. SBCO algorithm queries the ﬁrst point M. Set initial policy M1 = M. for t = 1, . . . , T do Apply control ut = PH i=1 M [i 1] t bwt i. Observe xt+1 and c(xt, ut). Call System-Estimation (Algorithm 3), to get b At+1, b Bt+1. Record the estimate bwt = xt+1 b At+1xt b Bt+1ut. if t mod (2H + 1) = 0 then Send c(xt, ut) to SBCO algorithm. SBCO algorithm queries a new point M. Set Mt+1 = M. else Set Mt+1 = Mt. end end

5 Extensions

Linear Dynamical Controllers: Clearly, our regret bounds hold against any policy class that can be approximated by M. One such class are all stabilizing linear-dynamical-controllers (for background on LDCs see [29]).

General stochastic disturbances: Other than Gaussian, we can deal with any stochastic bounded disturbance distribution. The only place where the assumption of Gaussian disturbances really helps is that given some policy M and matrices b A, b B, we can compute ofﬂine (to a very good approximation) the surrogate cost C(M | b A, b B), because we know the disturbance distribution. However, even when we don t, we can still use the estimated disturbances ( ˆwt)t as samples to approximate this expectation (i.e., the average cost).

Partial observation: The extension to partial observation is tedious but straightforward and uses the idea of nature s y s , exactly as in [29].

6 Summary and open questions

We gave the ﬁrst polynomial-time algorithms with optimal regret, with respect to the time horizon, for online control of LDS with general convex costs and comparator class the set of strongly-stable linear policies. Our main result was a novel geometric exploration scheme for the case where the cost function is known. The following open questions arise. First, can we improve the e O(C) dxdu(dx + du)

T regret bound, in terms of dimension dependence? This looks plausible because the barycentric spanners are constructed by treating the policies as ﬂattened vectors of dimension dxdu H, thus the matrix structure is not exploited. Second, Algorithm 1 is not practical, since it employs the ellipsoid method. Is there a simpler, gradient-based algorithm that also achieves

Tregret? Third, a challenging question is whether

T-regret is achievable for nonstochastic control, where the disturbances are adversarial and the cost function adersarially changes over time. Even more broadly, can we prove regret bounds with respect to interesting nonlinear, yet tractable policy classes?

7 Broader Impact

This work does not present any foreseeable societal consequence.

8 Acknowledgements

EH acknowledges the support of NSF grant # 1704860 and Google.

[1] Yasin Abbasi-Yadkori and Csaba Szepesv ari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1 26, 2011.

[2] Alekh Agarwal, Dean P Foster, Daniel Hsu, Sham M Kakade, and Alexander Rakhlin. Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization, 23(1):213 240, 2013.

[3] Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Alexander Rakhlin. Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 1035 1043, 2011.

[4] Naman Agarwal, Brian Bullins, Elad Hazan, Sham M Kakade, and Karan Singh. Online control with adversarial disturbances. ar Xiv preprint ar Xiv:1902.08721, 2019.

[5] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97 114, 2008.

[6] Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3 20, 2002.

[7] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1.

[8] Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regulators efﬁciently. ar Xiv preprint ar Xiv:2002.08095, 2020.

[9] Asaf Cassel and Tomer Koren. Bandit linear control. ar Xiv preprint ar Xiv:2007.00759, 2020.

[10] Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. ar Xiv preprint ar Xiv:2007.06650, 2020.

[11] Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1029 1038, 2018.

[12] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efﬁciently with only

T-regret. ar Xiv preprint ar Xiv:1902.06223, 2019.

[13] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.

[14] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188 4197, 2018.

[15] Paula Gradu, John Hallman, and Elad Hazan. Non-stochastic control with bandit feedback. ar Xiv preprint ar Xiv:2008.05523, 2020.

[16] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

[17] Elad Hazan, Sham M Kakade, and Karan Singh. The nonstochastic control problem. ar Xiv preprint ar Xiv:1911.12178, 2019.

[18] Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on learning theory, pages 9 1, 2012.

[19] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems. ar Xiv preprint ar Xiv:2003.11227, 2020.

[20] Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2020.

[21] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of lqr is efﬁcient. ar Xiv preprint ar Xiv:1902.07826, 2019.

[22] Samet Oymak and Necmiye Ozay. Non-asymptotic identiﬁcation of lti systems from a single trajectory. In 2019 American Control Conference (ACC), pages 5655 5661. IEEE, 2019.

[23] R Tyrell Rockafellar. Linear-quadratic programming and optimal control. SIAM Journal on Control and Optimization, 25(3):781 814, 1987.

[24] Tuhin Sarkar and Alexander Rakhlin. Near optimal ﬁnite time identiﬁcation of arbitrary linear dynamical systems. In International Conference on Machine Learning, pages 5610 5618, 2019.

[25] Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Finite-time system identiﬁcation for partially observed lti systems of unknown order. ar Xiv preprint ar Xiv:1902.01848, 2019.

[26] Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, pages 2714 2802, 2019.

[27] Max Simchowitz and Dylan J. Foster. Naive exploration is optimal for online lqr. ar Xiv preprint ar Xiv:2001.09576, 2020.

[28] Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identiﬁcation. In Conference On Learning Theory, pages 439 473, 2018.

[29] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. ar Xiv preprint ar Xiv:2001.09254, 2020.

[30] Anastasios Tsiamis and George J Pappas. Finite sample analysis of stochastic system identiﬁcation. ar Xiv preprint ar Xiv:1903.09122, 2019.

[31] Ramon van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.

[32] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.