# optimal_rates_for_bandit_nonstochastic_control__df03d7e9.pdf

Optimal Rates for Bandit Nonstochastic Control

Y. Jennifer Sun Princeton University

Google Deep Mind ys7849@princeton.edu

Stephen Newman Princeton University sn9581@princeton.edu

Elad Hazan Princeton University & Google Deep Mind

ehazan@princeton.edu

Linear Quadratic Regulator (LQR) and Linear Quadratic Gaussian (LQG) control are foundational and extensively researched problems in optimal control. We investigate LQR and LQG problems with semi-adversarial perturbations and timevarying adversarial bandit loss functions. The best-known sublinear regret algorithm of Gradu et al. [2020] has a T

3 4 time horizon dependence, and the authors posed an open question about whether a tight rate of

T could be achieved. We answer in the afﬁrmative, giving an algorithm for bandit LQR and LQG which attains optimal regret (up to logarithmic factors) for both known and unknown systems. A central component of our method is a new scheme for bandit convex optimization with memory, which is of independent interest.

1 Introduction

Linear-Quadratic Regulator (LQR) and the more general Linear-Gaussian (LQG) control problems have been extensively studied in the ﬁeld of control theory due to their wide range of applications and admittance of analytical solutions by the seminal works of Bellman [1954] and Kalman [1960]. LQR and LQG control problems study the design of a feedback control policy for a linear dynamical system with the goal of minimizing cumulative, possibly time-varying quadratic costs. The discrete version of the problem studies the control of the following linear dynamical system governed by dynamics (A, B, C) 1:

xt+1 = Axt + But + wt , yt = Cxt + et ,

where at time t, xt represents the system s state, ut represents the control exerted on the system, and {wt}T

t=1 represents a sequence of i.i.d. centered Gaussian perturbations injected to the system. For a simple example of the model, consider the problem of controlling an elevator to reach the desired ﬂoors while minimizing the discomfort (due to acceleration or deceleration) and the energy consumption. In this case, the state xt would be a vector consisting of the elevator s position and velocity at time t. The control input ut is the force applied by the elevator motor at time t. wt represents process noise at time t, caused possibly by unexpected mechanical disturbances. In the generality of LQG, the system s states are not accessible. In the elevator example, one may not have the access to the measurement of the elevator s velocity or the precise position of the elevator. Instead, the algorithm has access to an observation yt, which is a linear transformation of state perturbed by a sequence {et}T

t=1 of i.i.d. centered Gaussian noises, i.e. the elevator s observed position. The cost is

1The LQR/LQG dynamics can be generalized to time-varying linear dynamical systems. Here we restrict ourselves to linear time-invariant systems for simplicity.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

a quadratic function of both the observation and the control exerted. The goal in LQR/LQG problems is to ﬁnd a control policy in some policy class that minimizes the cumulative cost over a ﬁnite time horizon T. With y

t denoting the observation and control at time t resulted from executing policy , the objective is formally given by

where Qt, Rt are some possibly time-varying state-weighting and control-weighting matrices, respectively. In the elevator example, if our goal is to control the elevator to reach the desired ﬂoors while minimizing the discomfort and the energy consumption, the cost function at every time step t may take the form ct(yt, ut) = c1kyt zgoalk2

2 + c2kutk2

2, where zgoal is the target position of the elevator, c1, c2 > 0.

Variations of the LQR/LQG problem have garnered considerable interest in the machine learning community. In the recent literature of online nonstochastic control, several setting-based generalizations to the aforementioned linear-control framework have been explored, including:

Adversarial cost functions (Cohen et al. [2018], Agarwal et al. [2019a]): In this setting,

the cost functions are adversarially chosen in an unpredictable way to the learner. This generalized formulation allows for applications such as building climate control in the presence of time-varying energy costs, as the energy costs may depend on unexpected demand ﬂuctuation (Cohen et al. [2018]). More recently, the model of adversarially chosen cost functions, along with adversarial perturbations in the subsequent paragraph, give rise to building connection between control theory and meta optimization (Chen and Hazan [2023]), where the nonconvex problem of hyperparameter tuning (e.g. learning rate tuning) can be reduced to online nonstochastic control, which often admits a convex formulation.

Adversarial perturbations (Agarwal et al. [2019a], etc.): In this setting, differing from

the usual Gaussian assumption on the perturbation distribution (wt

i.i.d. N(0, )), nonstochastic control often considers arbitrary disturbances wt in the dynamics. The relaxed assumption on the perturbations allows for the control of nonlinear systems under model inaccuracies by approximating them as linear systems (Ghai et al. [2022]).

Bandit feedback: Together with the aforementioned two generalizations, control under

bandit feedback further generalizes the control problem and suits many real-world applications. In the bandit setting, the cost functions is unknown to the learner. The learner has access only to the cost incurred by their chosen control. In addition, the learner has access to the system s state (xt in the fully observed LQR setting) or a noisy, possibly and usually low-dimensional observation (yt in the partially observed LQG setting). The bandit LQG problem can be illustrated with a motivating example. In the problem of controlling medical device such as a ventilator, the state usually represents a number of health status indicators of the patient, such as heart rate, blood pressure, and other related metrics. However, this state may not be fully observable. After the treatment is provided, the learner can observe the patient s response, e.g. changes in some vital indicators, with some observation noise and a cost function, e.g. deviation from the targeted breathing pattern. In this limited information setting, can a learner hope to learn meaning controls under minimal assumptions? Previously, the work of Gradu et al. [2020] and Cassel and Koren [2020] both studied the bandit nonstochastic control problem of fully observed systems and provided provable guarantees.

Taken together, these settings give rise to a general setting in differentiable reinforcement learning that strictly contains a variety of classical problems in optimal and robust control. Naturally, when adversarial costs and perturbations are considered, an optimal solution is not deﬁned a priori. Instead, the primary performance metric is regret: the difference between the total cost of a control algorithm and that of the best controller from a speciﬁc policy class in hindsight.

As mentioned above, this general setting of bandit nonstochastic control was considered in the recent work of Gradu et al. [2020], whose proposed Bandit Perturbation Controller (BPC) algorithm has a provable regret guarantee of O(T

3 4 ) when compared to a rich policy class called disturbance action controllers (DAC) for fully observed systems. Similar setting has also been studied by Cassel and Koren [2020], who established an optimal regret up to logarithmic factor of O(

T) for fully

observable systems under stochastic perturbations and adversarially chosen cost functions. Recently Ghai et al. [2023] improves the dimension dependence of regret bounds using exploration in actionspace rather than parameter-space. However, bandit control for partially observable systems (e.g. LQG) is less understood. Thus, these developments in the search for efﬁcient, low-regret bandit online control algorithms leave a central open question (also stated by by Gradu et al. [2020]):

Can we achieve optimal regret O(

T) with bandit LQG and nonstochastic noise?

Our work answers this question up to logarithmic factors. Our novel Ellipsoidal Bandit Perturbation Controller (EBPC) achieves a O(

T) regret guarantee in the presence of semi-adversarial perturbations in bandit LQG problems with strongly convex cost functions, with the additional generality of possibly unknown system dynamics. By Shamir [2013], this is asymptotically optimal up to logarithmic factors, as bandit optimization over quadratics reduces to bandit control with quadratic losses under A = 0, B = I. Our work therefore resolves the upper-bound/lower-bound gap for this generalization of LQR/LQG. The following table gives a comprehensive comparison between the regret guarantee of EBPC and existing results in literature.

Table 1: Comparison of previous results to our contributions.

Algorithm Noise Observation Feedback System Regret Agarwal et al. [2019a] Adversarial full full known O(

T) Agarwal et al. [2019b] Stochastic full full known O(1) Foster and Simchowitz [2020] Adversarial full full known O(1) Simchowitz et al. [2020] Semi-Adv. partial full known O(1) Simchowitz et al. [2020] Semi-Adv. partial full unknown O(

T) Gradu et al. [2020] Adversarial full bandit unknown O(T

3 4 ) Cassel and Koren [2020] Stochastic full bandit known O(

T) Cassel and Koren [2020] Adversarial full bandit known O(T

2 3 ) Theorem 4.1 Semi-Adv. partial bandit known O(

T) Theorem 4.2 Semi-Adv. partial bandit unknown O(

1.1 Related work

Online Nonstochastic Control and Online LQR. In the last decade, much research has been devoted to the intersection of learning and control. Abbasi-Yadkori and Szepesvári [2011] and Ibrahimi et al. [2012] considered the problem of learning a controller in LQR for known quadratic cost functions and stochastic/martingale difference perturbation sequence when the dynamics of the system is unknown, and achieved O(

T)-regret in this case. Dean et al. [2018] provided the ﬁrst provable low-regret, efﬁcient algorithm to solve LQR problems with known cost functions and stochastic perturbations. Cohen et al. [2018] extended this result to changing quadratic costs with stochastic perturbations and provided a regret guarantee of O(

T). Lale et al. [2021] consider the LQG problem with stochastic noise and unknown systems.

More recently, interest has turned to nonstochastic control, in which the cost functions and the perturbations can be adversarially chosen Agarwal et al. [2019a]. A broad spectrum of control problems were reconsidered from the nonstochastic perspective, and several different generalizations were derived. See Hazan and Singh [2022] for a comprehensive text detailing these results.

Online Bandit Convex Optimization with Memory. A classical approach to nonstochastic control of stable/stabilizable linear dynamical systems is to reduce control problems to online convex optimization with bounded memory. This reduction stems from the observation that the effects of past controls decay exponentially in stable/stabilizable systems. In the bandit online convex optimization with bounded memory setting, the learner iteratively plays a decision xt in a convex set K Rd and suffers an adversarially chosen loss Ft(xt H+1:t), where xt H+1:t is the sequence of points xt H+1, ..., xt. In particular, the loss depends on the last H points played by the algorithm, and the only information revealed to the learner is the scalar loss that they incurred. The goal is to minimize

regret, the difference between the loss actually suffered and the loss suffered under the best single play in hindsight:

Ft(xt H+1:t) min

Ft(x, . . . , x).

Since the loss function is unknown to the learner in the bandit setting, online bandit convex optimization algorithms make use of low-bias estimators of the true gradient or Hessian. Therefore, it is standard to measure the algorithm s performance by regret over the expectation of losses, where the expectation is taken over the randomness injected when creating such estimators.

Online convex optimization with memory in the full information setting, where the loss function is known to the learner, was proposed by Anava et al. [2015]. The work of Agarwal et al. [2019a], was the ﬁrst to connect this to control, and to give a regret bound for online control with adversarial perturbations.

3 4 ) regret bound of Gradu et al. [2020] leveraged the bandit convex optimization method of Flaxman et al. [2005]. We focus on the bandit LQR/LQG setting with strongly convex and smooth loss functions. It is thus natural to use the techniques of Hazan and Levy [2014], who obtained a

T) regret guarantee for bandit convex optimization without memory.

Online Learning with Delay. One technical difﬁculty in extending OCO with memory to the bandit setting arises from the requirement of independence between every play and the noises injected from the recent H steps. We resolve this issue by adapting online learning with delay to the subroutine algorithm used in our BCO algorithm. Online learning with delay was introduced by Quanrud and Khashabi [2015]. In particular, Flaspohler et al. [2021] relates online learning with delay to online learning with optimism and established a sublinear regret guarantee for mirror descent algorithms. A similar delay scheme was seen in [Gradu et al., 2020].

1.2 Notations and organization

Notation. For convenience and succinctness, we denote H

def = H 1. We use lowercase bold letters (e.g. x, y, u, w, e) to denote the states, observations, controls, and noises of the dynamical system 2, and dx, du, dy to denote their corresponding dimensions. We use Sn 1 to denote the unit sphere in Rn, as Sn 1 = Rn 1. R+, R++ denote all non-negative, positive real numbers, respectively. For a differentiable function F : (Rn)H ! R, we denote the gradient of F with respect to its ith argument vector by ri F( ). ( ) acting on a square matrix measures the spectral radius of the matrix. For a sequence M = (M [i])i2I, we use k Mk 1,op to denote the sum of the operator norm:

i2I k M [i]kop. σ(A) denotes the σ-algebra generated by a random variable A. We use O( ) to hide all universal constants, O( ) to hide poly(log T) terms, and O( ) to hide all natural parameters.

Organization. Our method has two main components: a novel algorithm for BCO with memory (EBCO-M), and its application to building a novel bandit perturbation controller (EBPC). Section 2 describes our problem setting. Section 3 gives EBCO-M and its near-optimal regret guarantee. Section 4 introduces EBPC and its regret guarantees to both known and unknown systems.

2 The Bandit LQG Problem

In this section we provide necessary background and describe and formalize the main problem of interest. We consider control of linear time-invariant dynamical systems of the form

xt+1 = Axt + But + wt , yt = Cxt + et . (2.1)

with dynamics matrices A 2 Rdx dx, B 2 Rdx du, C 2 Rdy dx. Here, consistent with previous notations, xt 2 Rdx is the state of the system at time t, ut 2 Rdu is the control applied at time t, and

2It shall be noted that unbolded x, y, u are used in the EBCO-M algorithm (Algorithm 1) and is unrelated to the linear dynamical system.

wt 2 Rdx, et 2 Rdy are the system and measurement perturbations. At each timestep, the learner may observe yt 2 Rdy, which usually represents a possibly noisy projection of the state xt onto some (possibly and usually) low-dimensional space.

In the online bandit setting, the learner is asked to perform a control ut at time t. After the control is performed, the adversary chooses a quadratic cost function ct(yt, ut). The learner observes the scalar ct(yt, ut) 2 R+ and the signal yt, but no additional information about ct( , ), xt, wt, or et is given. The goal of the learner is to minimize expected regret, where regret against the controller class is deﬁned as

ct(yt, ut) min

t ) , (2.2)

t is the control exerted at time t by policy , and y

t is the time-t observation that would have occurred against the same costs/noises if the control policy were carried out from the beginning.

In controlling linear dynamical systems with partial observations, we often make use of the system s counterfactual signal had no controls been performed since the beginning of the instance:

Deﬁnition 2.1 (Nature s y). Nature s y at time t, denoted by ynat

t , is the signal that the system would have generated at time t under u1:t = 0. We may compute this as

t+1 = Axnat

t + wt , ynat

or, equivalently, ynat

t = et + Pt 1

i=1 CAt i 1wi.

Critically, this may be calculated via the Markov operator:

Deﬁnition 2.2 (Markov operator). The Markov operator G = [G[i]]i2N corresponding to a linear system parametrized by (A, B, C) as in Eq.(2.1) is a sequence of matrices in Rdy du such that

G[i] def = CAi 1B, G[0] def = 0dy du.

It follows immediately that ynat

t may be computed from observations as ynat

i=1 G[i]ut i.

2.1 Assumptions

We impose four core assumptions on the problem:

Assumption 2.3 (Stable system). We assume the system is stable: the spectral radius (A) < 1 3.

This assumption has the following important consequence:

Remark 2.4 (Decay of stable systems). That the system is stable implies that 9P 0dx dx, P 2 Sym(dx) such that r P A>PA for some 0 r < 1, and therefore 9 depending on k Bkop, k Ckop, σmin(P) such that k G[i]kop ri 1. Then with H = O(log T), we can assume that

k Gk 1,op = P1

i=0 k G[i]kop RG and G(H)

i=H k G[i]kop RG

Together, Assumption 2.3 and its consequence Remark 2.4 allows modeling the nonstochastic control problem as an online convex optimization problem against adversary with bounded memory, as the effect of past controls decays exponentially. Assumption 2.5-2.7 introduce the assumptions on the adversarial cost functions and perturbations.

Assumption 2.5 (Noise model). The perturbations {wt, et}T

t=1 are assumed to be semi-adversarial: wt, et decompose as sums of adversarial and stochastic components wt = wadv

t and et = eadv

t . The stochastic components of the perturbations are assumed to come from distributions satisfying E[wstoch

t ] = E[estoch

σe > 0. {wt, et}T

t=1 are bounded such that kynat

t k2 Rnat, 8t, for some parameter Rnat.

3Although out of scope of this work, we refer the readers to Appendix G of Simchowitz et al. [2020] for a possible and expected extension of out results to the relaxed assumption of stability: the learner is given access to a stabilizing controller of the system.

The bound on ynat is implied by bounded noise, which is a standard assumption in literature, and the stability of the system. The semi-adversarial assumption is also seen in prior work [Simchowitz et al., 2020], and is a necessary condition for our analysis: we depend on the regret guarantee of a bandit online convex optimization with memory algorithm which requires the strong convexity of the expected loss functions conditioned on all but the (poly(log T)) most recent steps of history. This assumption is essentially equivalent to the adversarial assumption in applications: in almost all systems, noise is either endemic or may be injected. We also emphasize that this assumption is much weaker than that of previous optimal-rate work: even in the known-state, known-dynamic case, the previous optimal guarantee in the bandit setting depended on no adversarial perturbation (see Table 1). Assumption 2.6 (Cost model). The cost functions ct( , ) are assumed to be quadratic, σc-strongly convex, βc-smooth, i.e. ct(y, u) = y>Qty + u>Rtu with βc I Qt σc I, βc I Rt σc I 8t. They are also assumed to obey the following Lipschitz condition: 8(y, u), (y0, u0) 2 Rdy+du,

|ct(y, u) ct(y0, u0)| Lc(k(y, u)k2 _ k(y0, u0)k2)k(y y0, u u0)k2. (2.3)

These conditions are relatively standard for bandit convex optimization algorithms, and are needed for the novel BCO-with-memory algorithm which underpins our control algorithm. Assumption 2.7 (Adversary). {ct( , ), wadv

t=1 is chosen by the adversary ahead of time.

2.2 Disturbance Response Controllers

Regret compares the excess cost from executing our proposed control algorithm with respect to the cost of the best algorithm in hindsight from a given policy class. In particular, low regret against a rich policy class is a very strong near-optimality guarantee. We take the comparator policy class to be the set of disturbance response controllers (DRC), formally given by the following deﬁnition. Deﬁnition 2.8 (Disturbance Response Controllers). The disturbance response controller (DRC) and the DRC policy class are deﬁned as:

A disturbance response controller (DRC) is a policy M parametrized by M = (M [j]) H j=0, where H 2 Z++, a sequence of H matrices in Rdu dy s.t. the control at time t given by

t = P H j=0 M [j]ynat

t j. We shorthand u M

The DRC policy class parametrized by H 2 Z++, R 2 R+ is the set of all disturbance

response controller with bounded length H and norm R: M(H, R) = {M = (M [j]) H j=0 |

k Mk 1,op = P H j=0 k M [j]kop R}.

Previous works have demonstrated the richness of the DRC policy class. In particular, Theorem 1 from Simchowitz et al. [2020] has established that the DRC policy class generalizes the state-of-art benchmark class of stabilizing linear dynamic controllers (LDC) with error e (H).

2.3 Approach and Technical Challenges

The classical approach in online nonstochastic control of stable/stabilizable systems is to reduce to a problem of online convex optimization with memory. This insight relies on the exponentially decaying effect of past states and controls on the present, which allows approximating the cost functions as functions of the most recent controls.

A core technical challenge lies in the bandit convex optimization problem obtained from the bandit control problem. In the bandit setting, no gradient information is given to the learner, and thus the learner needs to construct a low-bias gradient estimator. Previous work uses the classical spherical gradient estimator proposed by Flaxman et al. [2004], but the regret guarantee is suboptimal. We would like to leverage the ellipsoidal gradient estimator proposed by Hazan and Levy [2014]. However, when extending to loss functions with memory, there is no clear mechanism for obtaining a low-bias bound for general convex functions. We exploit the quadratic structure of the LQR/LQG cost functions to build EBCO-M (Algorithm 1), which uses ellipsoidal gradient estimators. We note that even outside of the control applications, EBCO-M may be of independent interests in bandit online learning theory.

3 BCO with Memory: Quadratic and Strongly Convex Functions

As with previous works, our control algorithm will depend crucially on a generic algorithm for bandit convex optimization with memory (BCO-M). We present a new online bandit convex optimization with memory algorithm that explores the structure of quadratic costs to achieve near-optimal regret.

3.1 Setting and working assumptions

In the BCO-M setting with memory length H, we consider an algorithm playing against an adversary. At time t, the algorithm is asked to play its choice of yt in the convex constraint set K. The adversary chooses a loss function Ft : KH ! R+ which takes as input the algorithm s current play as well as its previous H plays. The algorithm then observes a cost Ft(yt H, . . . , yt) (and no other information about Ft( )) before it chooses and plays the next action yt+1. The goal is to minimize regret with respect to the expected loss, which is the excessive loss incurred by the algorithm compared to the best ﬁxed decision in K:

E[Ft(yt H, . . . , yt)] min

E[Ft(x, . . . , x)].

For notation convenience, we will at times shorthand yt H:t

def = (yt H, . . . , yt) 2 KH.

3.1.1 BCO-M assumptions

We make the following assumptions on the loss functions {Ft}T

t=H and the constraint set K. Assumption 3.1 (Constraint set). K is convex, closed, and bounded with non-empty interior. diam(K) = sup

Assumption 3.2 (Loss functions). The loss functions chosen by the adversary obeys the following regularity and curvature assumptions:

Ft : KH ! R+ is quadratic and β-smooth:

Quadratic: 9Wt 2 Rn H n H, bt 2 Rn H, ct 2 R such that Ft(w) = w>Wtw + b>

t w + ct, 8w 2 KH. Smooth: Wt βIn H n H.

Ft : KH ! R+ is σ-strongly convex in its induced unary form: ft : K ! R+ with ft(z) =

Ft(z, . . . , z) is σ-strongly convex, i.e. ft(z) ft(z0) + rft(z0)>(z z0) + σ

2., 8z, z0 2 K.

Ft satisﬁes the following diameter and gradient bound on K: 9B, L > 0 such that

B = sup w,w02KH |Ft(w) Ft(w0)|, L = sup

w2KH kr Ft(w)k2.

In the online control problems, when formulating the cost function ct as a function Ft of the most recent H controls played, the function Ft itself may depend on the entire history of the algorithm through step t H. Therefore, it is essential to analyze the regret guarantee of our BCO-M algorithm when playing against an adversary that can be (t H)-adaptive, giving rise to the following assumption.

Assumption 3.3 (Adversarial adaptivity). The adversary chooses Ft independently of the noise ut H:t which is drawn by the algorithm in the H most recent steps, but possibly not independently of earlier noises.

Note that Assumption 3.3 is minimal for BCO: if this fails, then in the subcase of a delayed loss, the adversary may fully control the agent s observations, resulting in no possibility of learning.

Self-concordant barrier. The algorithm makes use of a self-concordant barrier R( ) of K as the regularization function in the updates.

Deﬁnition 3.4 (Self-concordant barrier). A three-time continuously differentiable function R( ) over a closed convex set K Rn with non-empty interior is a -self-concordant barrier of K if it satisﬁes the following two properties:

1. (Boundary property) For any sequence {xn}n2N int(K) such that limn!1 xn = x 2 @K,

limn!1 R(xn) = 1.

2. (Self-concordant) 8x 2 int(K), h 2 Rn,

(a) |r3R(x)[h, h, h]| 2|r2R(x)[h, h]|3/2. (b) |hr R(x), hi| p |r2R(x)[h, h]|1/2.

3.2 Algorithm speciﬁcation and regret guarantee

We present EBCO-M (Algorithm 1) for online bandit convex optimization with memory. The key novelty is the use of an ellipsoidal gradient estimator. It is difﬁcult to establish a low-bias guarantee for ellipsoidal gradient estimator for general convex loss functions. However, thanks to the quadratic structure of the loss functions in LQR/LQG problems, we can show provable low bias for the ellipsoidal gradient estimator, and therefore achieve optimal regret.

Algorithm 1 Ellipsoidal BCO with memory (EBCO-M)

1: Input: Convex, closed set K Rn with non-empty interior, time horizon T, memory length H,

step size , -self-concordant barrier R( ) over K, convexity strength parameter σ. 2: Initialize xt = arg minx2K R(x), 8t = 1, . . . , H. 3: Compute At = (r2R(xt) + σt I) 1/2, 8t = 1, . . . , H. 4: Sample u1, . . . , u H Sn 1 i.i.d. uniformly at random. 5: Set yt = xt + Atut, 8t = 1, . . . , H. 6: Set gt = 0, 8t = 1, . . . , H. 7: Play y1, . . . , y H. 8: for t = H, . . . , T do 9: Play yt, suffer loss Ft(yt H:t).

10: Store gt = n Ft(yt H:t) P H i=0 A 1

11: Set xt+1 = arg minx2K

2 kx xs Hk2

12: Compute At+1 = (r2R(xt+1) + σ(t + 1)I) 1/2. 13: Sample ut+1 Sn 1 uniformly at random. 14: Set yt+1 = xt+1 + At+1ut+1. 15: end for

Before analyzing the regret, we ﬁrst make note of two properties of Algorithm 1. Remark 3.5 (Delayed dependence). In Algorithm 1, xt is independent of ut H:t, 8t, and therefore At is independent of ut H:t as At is determined by xt.

Remark 3.6 (Correctness). yt played by Algorithm 1 lies in K: kyt xtk2

r2R(xt) = k Atutk2

r2R(xt) kutk2

2 = 1, and by Proposition C.1, the Dikin ellipsoid centered at xt is contained in K.

We give the following two theorems on regret guarantees of the EBCO-M algorithm. In particular, Theorem 3.6.B is a slight generalization of Theorem 3.6.A in its relaxation on the curvature assumption on the sequence of loss functions FH:T . Theorem 3.6.A (EBCO-M regret with strong convexity). For any sequence of cost functions {Ft}T

t=H satisfying Assumption 3.2, constraint set K satisfying Assumption 3.1, adversary satisfying Assumption 3.3, and H = poly (log T), Algorithm 1 satisﬁes the regret bound

Regret T (EBCO-M) =

E[Ft(yt H:t)] min

where the expectation is taken over the randomness of the exploration noises u1:T , with O hiding all natural parameters (B, D, L) and logarithmic dependence on T.

Theorem 3.6.B (EBCO-M regret with conditional strong convexity). Suppose Algorithm 1 is run on K satisfying Assumption 3.1 against an adversary satisfying Assumption 3.3 with a sequence of random cost functions {Ft}T

t=H satisfying that

1. Ft is quadratic, convex, β-smooth, has diameter bound B and gradient bound L on KH.

2. 9 ﬁltration G1:T such that u1:T is independent of GT , Ft 2 σ(u1:t H) [ Gt, and Ft is

conditionally σ-strongly convex in its induced unary form: ft(z)

def = E[ft(z) | u1:t H, Gt H] is σ-strongly convex.

Then, Algorithm 1 satisﬁes the same regret bound attained in Theorem 3.6.A, i.e.

Regret T (EBCO-M) =

E[Ft(yt H:t)] min

where the expectation is taken over the randomness of the exploration noises u1:T and the random functions FH:T , with O hiding all natural parameters (B, D, L) and logarithmic dependence on T.

4 Bandit Controller: Known and Unknown Systems

We will now use our BCO-with-memory algorithm to ﬁnd an optimal controller (as in Gradu et al. [2020]), arguing that regret in choice of controller transfers into the setting discussed in the previous section. We ﬁrst consider the case where the system is known, and then reduce the unknown system case to the known system case.

4.1 Known systems

Applying Algorithm 1 to predict controllers4 with losses given by control losses, we obtain Algorithm 2.

Algorithm 2 Ellipsoidal Bandit Perturbation Controller (EBPC)

1: Input: Time horizon T, memory length H, Markov operator G. BCO-M parameters σ, .

Self-concordant barrier R( ) over M(H, R) RH du dy. 2: Initialize M1 = = MH = arg min M2M(H,R)

3: Compute Ai = (r2R(Mi) + σt I) 1/2, 8i = 1, . . . , H. 4: Sample "1, . . . , "H SH du dy 1 i.i.d. uniformly at random. 5: Set f Mi = Mi + "i, 8i = 1, . . . , H. Set gi = 0, 8i = 1, . . . , H. 6: Play control ui = 0, incur cost ci(yi, ui), 8i = 1, . . . , H. 7: for t = H, . . . , T do

8: Play control ut = ut(f Mt) = P H i=0 f M [i]

t i , incur cost ct(yt, ut).

9: Observe yt+1 and compute signal ynat

t+1 = yt+1 Pt

i=1 G[i]ut i.

10: Store gt = dudy Hct(yt, ut) P H i=0 A 1

11: Update Mt+1 = arg min M2M(H,R)

hgs H, Mi + σ

2 k M Ms Hk2+

12: Compute At+1 = (r2R(Mt+1) + σ(t + 1)I) 1/2.

13: Sample "t+1 SH du dy 1 uniformly at random. Set f Mt+1 = Mt+1 + At+1"t+1. 14: end for

Theorem 4.1 (Known system control regret). Consider a linear dynamical system governed by known dynamics (A, B, C) and the interaction model with adversarially chosen cost functions

4Notation: while our controller M is typically a tensor, it should be thought of as the output vector of Algorithm 1. As such, the relevant vector and matrix operations in that algorithm will correspond to tensor operations here, and the notation reﬂects that correspondence. In particular, the inner product on line 11 is an all-dimension tensor dot product and A is a square matrix which acts on tensors of shape (H, du, dy).

and perturbations satisfying Assumption 2.3, 2.5, 2.6, 2.7. Then running Algorithm 2 with H = (poly(log T)), σ = σc

σmin(C) 1+k Ak2op

1 dudy Lc H3p

E[Regret T (EBPC)] O

where regret is deﬁned as in Eq.(2.2), the expectation is taken over the exploration noises "1:T of the algorithm as well as the stochastic components {wstoch

t=1 of the perturbations {wt, et}T

t=1, and O( ) hides all universal constants, natural parameters, and logarithmic dependence on T.

4.2 Unknown systems: control after estimation

Note that EBPC (Algorithm 2) relies on the access to the system s Markov operator G, which is available if and only if the system dynamics (A, B, C) are known. When the system dynamics is unknown, we can identify the system using a system estimation algorithm, obtain an estimated Markov operator ˆG, and run EBPC with G ˆG. Algorithm 3 outlines the estimation method of system dynamics via least squares.

Algorithm 3 System estimation via least squares (Sys Est-LS)

1: Input: estimation sample size N, system length H. 2: Initialize: ˆG[t] = 0, 8t H. 3: Sample and play ut N(0, Idu du), 8t = 1, . . . , N.

4: Set ˆG[0: H] = arg min PN

t=H kyt P H i=0 ˆG[i]ut ik2

2. 5: Return ˆG.

Theorem 4.2 (Unknown system control regret). Consider a linear dynamical system governed by unknown dynamics (A, B, C) and the interaction model with adversarially chosen cost functions and perturbations satisfying Assumption 2.3, 2.5, 2.6, 2.7. Suppose we obtain an estimated Markov operator ˆG from Algorithm 3 with N = d

Te and H = (poly log T). Then Algorithm 2 with

G ˆG, H 3H, σ = 1

1 dudy Lc H3p

E[Regret T (EBPC)] O

where regret is deﬁned as in Eq.(2.2), the expectation is taken over the exploration noises "1:T in Algorithm 2, the sampled Gaussian controls u1:N in Algorithm 3, and the stochastic components

t=1 of the perturbations {wt, et}T

t=1, and O( ) hides all universal constants, natural parameters, and logarithmic dependence on T.

5 Discussion and conclusion

We solve the open problem put forth by Gradu et al. [2020] on the optimal rate for online bandit control for the case of LQR/LQG control, improving to regret O(

T) from O(T

3 4 ) in the semiadversarial noise model and for strongly convex LQR/LQG cost functions. Our method builds upon recent advancements in bandit convex optimization for quadratic functions, providing the ﬁrst near-optimal regret algorithm for bandit convex optimization with memory in a nonstochastic setting.

It would be interesting to investigate (1) whether the results can be extended to fully adversarial noise, (2) whether a similar stable controller recovery as seen in [Chen and Hazan, 2021] for fully observable systems can be established for partially observable systems, and whether that can be incorporated to extend our result to stabilizable systems even without access to a stabilizing controller.

6 Acknowledgement

Elad Hazan acknowledges funding from the Ofﬁce of Naval Research grant N000142312156, the NSF award 2134040, and Open Philanthropy.

Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear

quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1 26. JMLR Workshop and Conference Proceedings, 2011.

Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efﬁcient algorithm

for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, 2008.

Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with

adversarial disturbances. In International Conference on Machine Learning, pages 111 119. PMLR, 2019a.

Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. Advances in

Neural Information Processing Systems, 32, 2019b.

Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of

past mistakes. Advances in Neural Information Processing Systems, 28, 2015.

Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical

Society, 60(6):503 515, 1954.

Asaf Cassel and Tomer Koren. Bandit linear control. Advances in Neural Information Processing

Systems, 33:8872 8882, 2020.

Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. In Conference on

Learning Theory, pages 1114 1143. PMLR, 2021.

Xinyi Chen and Elad Hazan. A nonstochastic control approach to optimization. ar Xiv preprint

ar Xiv:2301.07902, 2023.

Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar.

Online linear quadratic control. In International Conference on Machine Learning, pages 1029 1038. PMLR, 2018.

Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust

adaptive control of the linear quadratic regulator. Advances in Neural Information Processing Systems, 31, 2018.

Genevieve E Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu,

Paulo Orenstein, and Lester Mackey. Online learning with optimism and delay. In International Conference on Machine Learning, pages 3363 3373. PMLR, 2021.

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization

in the bandit setting: gradient descent without a gradient. ar Xiv preprint cs/0408007, 2004.

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization

in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385 394, 2005.

Dylan Foster and Max Simchowitz. Logarithmic regret for adversarial online control. In International

Conference on Machine Learning, pages 3211 3221. PMLR, 2020.

Udaya Ghai, Xinyi Chen, Elad Hazan, and Alexandre Megretski. Robust online control with model

misspeciﬁcation. In Learning for Dynamics and Control Conference, pages 1163 1175. PMLR, 2022.

Udaya Ghai, Arushi Gupta, Wenhan Xia, Karan Singh, and Elad Hazan. Online nonstochastic

model-free reinforcement learning. ar Xiv preprint ar Xiv:2305.17552, 2023.

Paula Gradu, John Hallman, and Elad Hazan. Non-stochastic control with bandit feedback. Advances

in Neural Information Processing Systems, 33:10764 10774, 2020.

Paula Gradu, John Hallman, Daniel Suo, Alex Yu, Naman Agarwal, Udaya Ghai, Karan Singh,

Cyril Zhang, Anirudha Majumdar, and Elad Hazan. Deluca a differentiable control library: Environments, methods, and benchmarking. ar Xiv preprint ar Xiv:2102.09968, 2021.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization,

2(3-4):157 325, 2016.

Elad Hazan and Kﬁr Levy. Bandit convex optimization: Towards tight bounds. Advances in Neural

Information Processing Systems, 27, 2014.

Elad Hazan and Karan Singh. Introduction to online nonstochastic control. ar Xiv preprint ar Xiv:2211.09619, 2022.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex

optimization. Machine Learning, 69(2):169 192, 2007.

Morteza Ibrahimi, Adel Javanmard, and Benjamin Roy. Efﬁcient reinforcement learning for high

dimensional linear quadratic systems. Advances in Neural Information Processing Systems, 25, 2012.

Rudolph Emil Kalman. A new approach to linear ﬁltering and prediction problems. 1960.

Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Adaptive control

and regret minimization in linear quadratic gaussian (lqg) setting. In 2021 American Control Conference (ACC), pages 2517 2522. IEEE, 2021.

Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. Advances in neural

information processing systems, 28, 2015.

Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In

Conference on Learning Theory, pages 3 24. PMLR, 2013.

Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In

Conference on Learning Theory, pages 3320 3436. PMLR, 2020.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In

Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928 936, 2003.