# bandit_linear_control__91b66d5b.pdf

Bandit Linear Control

Asaf Cassel School of Computer Science Tel Aviv University acassel@mail.tau.ac.il

Tomer Koren School of Computer Science Tel Aviv University tkoren@tauex.tau.ac.il

We consider the problem of controlling a known linear dynamical system under stochastic noise, adversarially chosen costs, and bandit feedback. Unlike the full feedback setting where the entire cost function is revealed after each decision, here only the cost incurred by the learner is observed. We present a new and eﬃcient algorithm that, for strongly convex and smooth costs, obtains regret that grows with the square root of the time horizon 푇. We also give extensions of this result to general convex, possibly non-smooth costs, and to non-stochastic system noise. A key component of our algorithm is a new technique for addressing bandit optimization of loss functions with memory.

1 Introduction

Reinforcement learning studies sequential decision making problems where a learning agent repeatedly interacts with an environment and aims to improve her strategy over time based on the received feedback. One of the most fundamental tradeoﬀs in reinforcement learning theory is the exploration vs. exploitation tradeoﬀ, that arises whenever the learner observes only partial feedback after each of her decisions, thus having to balance between exploring new strategies and exploiting those that are already known to perform well. The most basic and well-studied form of partial feedback is the so-called bandit feedback, where the learner only observes the cost of her chosen action on each decision round, while obtaining no information about the performance of other actions.

Traditionally, the environment dynamics in reinforcement learning are modeled as a Markov Decision Process (MDP) with a ﬁnite number of possible states and actions. The MDP model has been studied and analyzed in numerous diﬀerent settings and under various assumptions on the transition parameters, the nature of the reward functions, and the feedback model. Recently, a particular focus has been given to continuous state-action MDPs, and in particular, to a speciﬁc family of models in classic control where the state transition function is linear. Concretely, in linear control the state evolution follows the linear dynamics:

푥푡+1 = 퐴 푥푡+ 퐵 푢푡+ 푤푡, (1)

where 푥푡 ℝ푑푥, 푢푡 ℝ푑푢, 푤푡 ℝ푑푥are respectively the system state, action (control), and noise at round 푡, and 퐴 ℝ푑푥 푑푥, 퐵 ℝ푑푥 푑푢are the system parameters. The goal is to minimize the total control costs with respect to cost function 푐푡(푥, 푢) : ℝ푑푥 ℝ푑푢 ℝput forward on round 푡.

However, in contrast to the reinforcement learning literature, existing work on learning in linear control largely assumes the full feedback model, where after each decision round the learning agent observes the entire cost function 푐푡used to assign costs on the same round. In fact, to the best of our knowledge, thus far linear control has not been studied in the bandit setting, even in the special case where the costs are generated stochastically over time.

Contributions. In this paper, we introduce and study the bandit linear control problem, where a learning agent has to control a known linear dynamical system (as in Eq. (1)) under stochastic noise, adversarially chosen convex cost functions, and bandit feedback. Namely, after each decision round

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

the learner only observes the incurred cost 푐푡(푥푡, 푢푡) as feedback. (We still assume, however, that the state evolution is fully observable.) For strongly convex and smooth cost functions, we present an eﬃcient bandit algorithm that achieves 푂(

푇) regret over 푇decision rounds, with a polynomial dependence on the natural problem parameters. This result is optimal up to polylogarithmic factors as it matches the optimal regret rate in the easier stationary (i.e., stateless) strongly convex and smooth bandit optimization setting [26, 17].

The starting point of our algorithmic approach is an approximate reparameterization of the online control problem due to [1, 2], called the Disturbance-Action Policy. In this new parameterization, the control problem is cast in terms of bounded memory convex loss functions, under which the cost of the learner on each round depends explicitly only on her last few decisions rather than on the entire history (this is thanks to strong stability conditions of the learned policies [9]).

As a key technical tool, we develop a new reduction technique for addressing bandit convex optimization with bounded memory. While an analogous reduction has been well established in the full feedback model [3, 1, 2], its adaptation to bandit feedback is far from straightforward. Indeed, loss functions with memory in the bandit setting were previously studied by [4], that showed a black-box reduction via a mini-batching approach that, for an algorithm achieving 푂(푇1/2) regret in the no-memory bandit setting, achieves 푂(푇2/3) regret with memory. While this technique imposes very few restrictions on the adversary, it degrades performance signiﬁcantly even for adversaries with bounded memory. In contrast, [3] show that the full-feedback setting enjoys nearly no degradation when the adversary s memory is ﬁxed and bounded. Our new technique establishes a similar lossless reduction for bandit feedback with adversaries restricted to choosing smooth cost functions. Combining these ideas with standard techniques in (no-memory) bandit convex optimization [15, 25, 17] gives our main result.

Our techniques readily extend to weakly convex costs with regret scaling as 푂(푇2/3) in the smooth case and 푂(푇3/4) without smoothness. Moreover, these hold even without the stochastic assumptions on the system noise 푤푡, which were only required in our analysis for preserving the strong convexity of the costs through the reduction to loss functions with memory. We defer further details on these extensions to later sections and choose to focus ﬁrst on the more challenging case demonstrating how both the strong convexity and smoothness of the costs are exploited and where 푂(

푇) rates are possible.

Related work. The study of linear control has seen renewed interest in recent years. Most closely related to our work are [9, 1, 2], that study online linear control in the full-information setting. The latter paper establishes logarithmic regret bounds for the case where the costs are strongly convex and smooth and the noise is i.i.d. stochastic. Subsequently, [16] established a similar result for ﬁxed and known quadratic losses and adversarial disturbances. Thus, we exhibit a gap between the achievable regret rates in the fulland bandit-feedback cases of our problem. (A similar gap exists in standard online optimization with strongly convex and smooth losses [18, 26].)

A related yet crucially diﬀerent setting of partial observability was recently studied in [28], that considered the case where the state evolution is not revealed in full to the learner and only a low-dimensional projection of the state vector is observed instead. However, this model assumes full observability of the (convex) loss function following each round, and is therefore not directly comparable to ours. When the underlying linear system is initially unknown (this is the so-called adaptive control setting), regret of order

푇was recently shown to be optimal for online linear control even with full feedback and quadratic (strongly convex) costs [7, 27]. Optimal and eﬃcient algorithms matching these lower bounds were developed earlier in [10, 22, 1].

In the reinforcement learning literature on ﬁnite Markov Decision Processes (MDPs), regret minimization with bandit feedback was studied extensively (e.g., [23, 19, 11, 13, 5, 24, 20]). Our study can thus be viewed as a ﬁrst step in an analogous treatment of bandit learning in continuous linear control.

2 Preliminaries

2.1 Problem Setup: Bandit Linear Control

We consider the setting of controlling a known linear dynamical system with unknown (strongly) convex losses and bandit feedback. The linear system is an instance of the process described in Eq. (1) where 퐴 and 퐵 are known, initialized for simplicity and without loss of generality at 푥0 = 0. (Our assumptions on the nature of the various parameters are speciﬁed below.) Our goal is to minimize the

total control cost in the following online setting where an oblivious adversary chooses cost functions 푐푡: ℝ푑푥 ℝ푑푢 ℝfor 푡 1. At round 푡:

(1) The player chooses control 푢푡; (2) The system transitions to 푥푡+1 according to Eq. (1); (3) The player observes the new state 푥푡+1 and the incurred cost 푐푡(푥푡, 푢푡).

The overall cost incurred is 퐽(푇) = P푇 푡=1 푐푡(푥푡, 푢푡). We denote by 퐽퐾(푇) the overall cost of a linear controller 퐾 ℝ푑푥 푑푢, which chooses its actions as 푢푡= 퐾푥푡. For such controllers, it is useful to deﬁne the notion of strong stability [9] (and its reﬁnement due to [2]), which is essentially a quantitative version of classic stability notions in linear control. Deﬁnition 1 (strong stability). A controller 퐾for the system (퐴 , 퐵 ) is (κ, γ) strongly stable (κ 1, 0 < γ 1) if there exist matrices 푄, 퐿such that 퐴 + 퐵 퐾= 푄퐿푄 1, 퐿 1 γ, and 퐾 , 푄 , 푄 1 κ. If additionally 퐿is complex and diagonal and 푄is complex, then 퐾is diagonal (κ, γ)-strongly stable.

For ﬁxed κ, γ, we deﬁne regret with respect to the class K of (κ, γ) diagonal strongly stable policies

K = 퐾 ℝ푑푥 푑푢: 퐾is (κ, γ)-diagonal strongly stable w.r.t. (퐴 , 퐵 ) . (2)

Beyond its relative simplicity, this class is interesting as it contains an asymptotic global optimum (with respect to all policies) when the costs are constrained to a ﬁxed quadratic function, as in classical control. The regret compared to 퐾 K is given by 푅(푇, 퐾) = 퐽(푇) 퐽퐾(푇). The pseudo regret is then deﬁned as

RA(푇) = max 퐾 K E[푅A(푇, 퐾)],

where the expectation is taken with respect to the randomness of the algorithm, and the system noise.

Assumptions. Throughout we assume the following. There are known constants κ퐵 1, and 푊, 퐺, 퐶, α, β, σ > 0 such that:

1. (System bound) 퐵 κ퐵; 2. (Noise bound) 푤푡 푊 푡 1; 3. (Cost bounds) If 푥 , 푢 퐷for large enough 퐷,1 then

|푐푡(푥, 푢)| 퐶퐷2, 푥푐푡(푥, 푢) , 푢푐푡(푥, 푢) 퐺퐷;

4. (Curvature bounds) The costs 푐푡(푥, 푢) are α-strongly convex and β-smooth; 5. (Noise) The disturbances 푤푡are independent random variables satisfying E[푤푡푤T 푡] σ2퐼.

The above assumptions are fairly standard in recent literature (e.g., [1, 2]).

2.2 Online Optimization with Memory

We describe the setting of online optimization with memory [4, 3], which will serve as an intermediate framework for our algorithmic development. In this setting, an oblivious adversary chooses loss functions 푓푡: X퐻 + ℝover a domain X+ ℝ푑, where 퐻 1 is the length of the adversary s memory. The game proceeds in rounds, where in round 푡, the player chooses 푥푡 X+ and observes some form of feedback ˆ푓푡. Performance is evaluated using the expected policy regret,

푡=퐻 푓푡(푥푡+1 퐻, . . . , 푥푡)

푡=퐻 푓푡(푥, . . . , 푥), (3)

where X X+ is the comparator set, which may diﬀer from the domain X+ where the loss functions are deﬁned (and are well behaved). Notice that for 퐻= 1, the quantity R1(푇) refers to the regret of standard online optimization, with no memory.

We will rely on the following conditions for the loss functions. The ﬁrst is a coordinate-wise Lipschitz property, while the second is standard smoothness, stated explicitly for an 퐻-coordinate setup. Deﬁnition 2. 푓: X퐻 + ℝis coordinate-wise 퐿 Lipschitz if 푖 [퐻], 푥1, . . . , 푥퐻, 푦푖 X+:

| 푓(푥1, . . . , 푥푖, . . . , 푥퐻) 푓(푥1, . . . , 푦푖, . . . , 푥퐻)| 퐿 푥푖 푦푖 .

1The precise 퐷for which this holds will be speciﬁed later as an explicit polynomial in the problem parameters.

Deﬁnition 3. 푓: X퐻 + ℝis β smooth if for any 푥= (푥1, . . . , 푥퐻), 푦= (푦1, . . . , 푦퐻) X퐻 + :

푖=1 푖푓(푥)T(푦푖 푥푖) + β

where 푖is the gradient with respect to 푥푖. When 푥1 = . . . = 푥퐻= 푧, we compress notation to 푖푓(푧).

2.3 Disturbance-Action Policies

Online linear control may be approximated by certain loss functions with memory, via a reparameterization suggested in [1, 2] named the Disturbance Action Policy (DAP). For completeness, we state the parameterization here even though our technical development will be mostly orthogonal. Deﬁnition 4 (Disturbance-Action Policy). For a ﬁxed linear controller 퐾0 K and parameters 푀= (푀[1], . . . , 푀[퐻]) with 푀[푖] ℝ푑푢 푑푥, a disturbance-action policy chooses an action at time 푡 as

푢푡(푀) = 퐾0푥푡+

푖=1 푀[푖]푤푡 푖,

where for notational convenience we say 푤푖= 0 for 푖 0.

This parameterization reduces the decision of the player at time 푡to choosing 푀푡= (푀[1] 푡 , . . . , 푀[퐻] 푡 ). The comparator set is given by M = M[1] M[퐻], where

M[푖] = 푀 ℝ푑푢 푑푥: 푀 2κ퐵κ3(1 γ)푖 ,

however, the player may choose 푀푡from the slightly larger M+ = {2푀: 푀 M}. The following deﬁnes the adversary s cost functions, referred to as surrogate or ideal cost functions. It is a summary of Deﬁnitions 4.2, 4.4, and 4.5 of [1], as well as 3.4 of [2], and while we do not use it explicitly, we give it here for the sake of concreteness. Deﬁnition 5. Denote by 푀0:퐻a sequence of policies 푀0, . . . , 푀퐻 M+. For a controller 퐾, let 퐴퐾= 퐴 + 퐾퐵 , and deﬁne:

(1) (disturbance-state transfer matrix) Ψ퐾 푖(푀0:퐻 1) = 퐴푖 퐾ퟙ푖 퐻+P퐻 푗=1 퐴푗 퐾퐵 푀[푖 푗] 퐻 푗ퟙ푖 푗 [1,퐻];

(2) (ideal state) 푦퐾 푡+1(푀0:퐻 1) = P2퐻 푖=0 Ψ퐾 푖(푀0:퐻 1)푤푡 푖; (3) (ideal action) υ퐾 푡(푀0:퐻) = 퐾푦퐾 푡(푀0:퐻 1) + P퐻 푖=1 푀[푖] 퐻푤푡 푖.

The surrogate or ideal costs and their expected value are respectively deﬁned as:

ˆ푐푡(푀0:퐻) = 푐푡 푦퐾 푡(푀0:퐻 1), υ퐾 푡(푀0:퐻) , ˆ퐶푡(푀0:퐻) = E푤[푐푡 푦퐾 푡(푀0:퐻 1), υ퐾 푡(푀0:퐻) ].

The following are statements of the reduction s key results due to [2]. Denote:

퐻= γ 1 log 2κ3푇, 퐷푥,푢= 8γ 1κ퐵κ3푊(퐻κ퐵+ 1). (4)

The ﬁrst result relates the costs ˆ푐푡, ˆ퐶푡and the associated regret to the original losses and regret. Lemma 6. For any algorithm A that plays policies 푀1, . . . , 푀푇 M+ we have:

(i) 푥푡 , 푢푡 퐷푥,푢, and thus |푐푡(푥푡, 푢푡)| 퐶퐷2 푥,푢; (ii) |푐푡(푥푡, 푢푡) ˆ푐푡(푀푡 퐻:푡)| 퐺퐷2 푥,푢/푇; (iii) RA(푇) E P푇 푡=퐻+1 ˆ퐶푡(푀푡 퐻:푡) min푀 M P푇 푡=퐻+1 ˆ퐶푡(푀 , . . . , 푀 ) +2퐷2 푥,푢(퐺+퐻퐶).

The second result establishes certain desirable properties of the cost functions ˆ푐푡and ˆ퐶푡.

Lemma 7. Let 퐶푡: 푀 ˆ퐶푡(푀, . . . , 푀). Then:

(i) ˆ푐푡, ˆ퐶푡are coordinate-wise 퐿푓-Lipschitz over M+, with 퐿푓= 2κ퐵γ 1κ3퐺퐷푥,푢푊; (ii) if 푐푡( , ) are β-smooth then ˆ푐푡, ˆ퐶푡, 퐶푡are β 푓-smooth over M+ with β 푓= 25βκ2 퐵κ6푊2퐻/γ2; (iii) If 푐푡( , ) are α-strongly convex and E[푤푡푤T 푡] σ2퐼then 퐶푡are α 푓 strongly convex over M with α 푓= 1

36ασ2γ2/κ10.

We note that the second claim of Lemma 7 was not previously established. We prove it in the full version of the paper [6] using similar techniques to those used for the ﬁrst claim.

Algorithm 1 Bandit Linear Control

1: input: controller 퐾0, memory length 퐻, step size η, and coordinate-wise sampling radii 푟[푖] 푡 2: Draw 푈1 S(푑푢 푑푥) 퐻. 3: Initialize τ = 1, 푀1 = 0, 푀[푖] 1 = 푟[푖] 1 푈[푖] 1 ( 푖 [퐻]). 4: for 푡= 1, . . . ,푇do 5: Play 푢푡= 퐾0푥푡+ P퐻 푖=1 푀[푖] 푡 푤푡 푖, 6: Observe 푥푡+1 and 푐푡(푥푡, 푢푡); update 푤푡= 푥푡+1 퐴 푥푡 퐵 푢푡. 7: Draw 푏푡 Bernoulli(1/(2퐻+ 2)) 8: if 푡 2퐻+ 2 and 푏푡 Q2퐻+1 푖=1 (1 푏푡 푖) = 1 then

9: Update 푀[푖] τ+1 = ΠM[푖]

푀[푖] τ η푑푥푑푢퐻푐푡(푥푡, 푢푡)푟[푖] τ 푈[푖] τ ( 푖 [퐻]). 10: τ τ + 1 11: Draw 푈τ S(푑푢 푑푥) 퐻. 12: 푀[푖] 푡+1 = 푀[푖] τ + 푟[푖] τ 푈[푖] τ ( 푖 [퐻]). 13: else 14: 푀푡+1 = 푀푡

3 Algorithm and Main Results

We present a new algorithm for the bandit linear control problem, detailed in Algorithm 1, for which we prove: Theorem 8. Let 퐻, 퐷푥,푢, α 푓, β 푓be as in Eq. (4) and Lemma 7, 퐾0 be a (κ, γ) diagonal strongly stable controller, and 푑min = min{푑푥, 푑푢}. Suppose Algorithm 1 is run with the above parameters, and

3푑2 min + (15β 푓/α 푓) log푇

푇푑2푥푑2푢퐶2퐷4푥,푢 , 푟[푖] 푡 = (2κ퐵κ3(1 γ)푖) 2 + 1

2α 푓η푡 1/2.

Then the expected pseudo-regret is bounded as

RA(푇) 4푑푥푑푢퐶퐷2 푥,푢(퐻+ 1)2q

푇(3푑2 min + (15β 푓/α 푓) log푇) + 푂(푇1/4).

The big푂notation in the theorem hides polynomial dependence in the problem parameters and poly-log dependence on 푇. We prove Theorem 8 later in Section 5 after discussing our reduction technique.

There are three main components to the algorithm:

(1) A randomized schedule to determine the times of parameter updates, which ensures that these are at least 2(퐻+ 1) apart and 푂(퐻) apart in expectation. This is part of our new reduction scheme, which is presented and discussed in Section 4. (2) A standard one-point gradient estimate that gives a (nearly-)unbiased estimate for the gradient of a function based on bandit feedback, by perturbing the policy using uniform samples from the unit Euclidean sphere of ℝ(푑푢 푑푥) 퐻; this is denoted as 푈 S(푑푢 푑푥) 퐻. (3) A preconditioned (online) gradient update rule that uses mixed regularization and standard Euclidean projections ΠM[푖] [푀] = arg min푀 M[푖] 푀 푀 퐹.

The mixed regularization, inspired by [17], is comprised of two terms (see 푟[푖] 푡 in Theorem 8): the ﬁrst exploits the strong convexity of the (expected) loss functions, while the second accounts for the small diameter of M[푖], which might be signiﬁcantly smaller than the magnitude of the perturbations required for the gradient estimates (this is particularly problematic for large 푖). To avoid sampling too far away from the admissable set, where the cost functions are well-behaved, we cap the perturbations of the one-point estimate according to the radii {푟[푖] 푡}푖 [퐻] and increase the regularization term to account for the higher variance of the resulting gradient estimate.

Computational and space complexity: From a memory perspective, the algorithm needs to maintain the matrices 푀[푖] τ and 푀[푖] 푡 , which take 푂(푑푥푑푢퐻) memory. Computationally, the bottleneck is the projection in line 9, which is onto a spectral normed sphere. This typically requires a singular

value decomposition, which takes 푂(푑2 푢푑푥퐻+ 푑3 푥퐻) time. However, when this is computationally prohibitive, it could be relaxed to a projection on a Frobenius normed sphere, which takes only 푂(푑푥푑푢퐻) time, at the cost of a factor 푑min to the regret bound in Theorem 8. Notice that the random draws in lines 2 and 11 are from the Euclidean (Frobenius normed) sphere and thus all remaining computations take 푂(푑푥푑푢퐻) time.

4 Bandit Convex Optimization with Memory

In this section we give the details of our reduction from BCO with memory to standard BCO, that constitutes a key element of our main algorithm. The application to bandit linear control, however, will require a slightly more general feedback model than the usual notion of bandit feedback.

We consider the online optimization with memory setting described in Section 2.2, with feedback model such that on round 푡:

(1) The player chooses 푥푡 X+, and independently, the adversary draws a random ξ푡; (2) The player observes feedback ˆ푓푡= ˆ푓푡(푥푡+1 퐻:푡; ξ푡+1 퐻:푡) such that, if 푥푡+1 퐻:푡are jointly independent of ξ푡+1 퐻:푡, then |Eξ푡+1 퐻:푡[ ˆ푓푡] 푓푡(푥푡+1 퐻:푡)| ε.

The above expectation is only with respect to the variables ξ푡+1 퐻:푡, and ε 0 is a ﬁxed parameter (possibly unknown to the player). Our feedback model, which may potentially seem non-standard, encompasses the following ideas, both of which are necessary for the application to linear control:

In the standard no-memory setting (퐻= 1), standard arguments apply even if the feedback received by the learner is randomized, as long as it is independent of the learner s decision on the same round. In the memory setting, the analogous condition is that the last 퐻decisions do not depend on the adversary s randomness during this time. We allow feedback of the form ˆ푓푡= 푓푡(푥푡) + ε푡, where ε푡is a small adaptive adversarial disturbance that can depend on all past history (yet is at most ε in absolute value).

In the context of linear control, the identity of the above terms will be 푥푡:= 푀푡, ξ푡:= 푤푡, ˆ푓푡:= 푐푡(푥푡, 푢푡), 푓푡( ) := ˆ퐶푡( ), and thus 푑= 푑푥푑푢퐻.

4.2 Base BCO Algorithm

The reduction relies on the following properties of the base algorithm A(푇) for BCO with no memory, that can be used against an adversary that chooses loss function from F { 푓: X+ ℝ}, and observing feedback satisfying |Eξ푡[ ˆ푓푡] 푓푡(푥푡)| ε :

(i) Its regret at times 푡 푇is bounded as R1(푡) 푅A(푇) where 푅A(푇) 0; (ii) Its predictions 푥1, . . . , 푥푇satisfy 푥푡+1 푥푡 δ푡, and 푥푡 푥푡 ϑ푡almost surely, where 푥푡 is the expected value of 푥푡conditioned on all past history up to (not including) the player s decision at time 푡, and δ푡, ϑ푡are decreasing sequences.

The above properties are satisﬁed by standard BCO algorithms, often without any modiﬁcation. In particular, these algorithms are amenable to our randomized and perturbed feedback model and often require only a slight modiﬁcation in their regret analyses to account for the additive disturbances. (In the full version of the paper [6] we give an analysis of a concrete BCO algorithm in this setting.)

Notice that, crucially, δ푡bounds the change in the algorithm s expected predictions as opposed to their actual movement. This is crucial as typical BCO algorithms add large perturbations to their predictions (as part of their gradient estimation procedure), with magnitude often signiﬁcantly larger than the change in the underlying expected prediction; i.e., it holds that ϑ푡 δ푡. Our reduction procedure is able to exploit this observation to improve performance for smooth functions.

4.3 The Reduction

We can now present our reduction from BCO with memory to BCO with no memory (퐻= 1); see Algorithm 2. The idea is simple: we use a base algorithm for standard BCO, denoted here by A, using the observed feedback, but make sure that A is updated at most once in every 퐻rounds. Since the setup is adversarial, we cannot impose a deterministic update schedule; instead, we employ a randomized schedule in which A is invoked with probability 1/퐻on each round, but constrained so that this does not happen too frequently. (A similar technique was used in a diﬀerent context in [12, 8].)

Algorithm 2 BCO Reduction

1: input: memory length 퐻, BCO algorithm A(푇/퐻). 2: set: 푥1 A.initialize() 3: for 푡= 1, . . . ,푇do 4: Play 푥푡(independently, adversary draws ξ푡) 5: Observe feedback ˆ푓푡(푥푡+1 퐻:푡; ξ푡+1 퐻:푡) 6: Draw 푏푡 Bernoulli(1/퐻) 7: if 푡 퐻and 푏푡 Q퐻 1 푖=1 (1 푏푡 푖) = 1 then 8: 푥푡+1 A.update( ˆ푓푡) 9: else 10: 푥푡+1 푥푡

The induced spacing between consecutive updates of A serves two purposes at once: ﬁrst, it reduces the 퐻-memory loss functions to functions of a single argument, amenable to optimization using A; second, it facilitates (conditional) probabilistic independence between consecutive updates which is crucial for dealing with the extended feedback model as required by the application to linear control. (We note that these conditions are not satisﬁed by existing techniques [4, 3, 1].)

The following is the main result of the reduction.

Theorem 9. Suppose Algorithm 2 is run using A(푇/퐻) satisfying the above properties:

(i) If 푓푡: 푥 푓푡(푥, . . . , 푥) satisfy 푓푡 F, and 푓푡are coordinate-wise 퐿 Lipschitz then

2 퐿퐻2 푇/퐻 X

푡=1 (δ푡+ 2ϑ푡);

(ii) If additionally 푓푡are convex and 푓푡are β smooth, then

2퐻2 푇/퐻 +1 X

푡=1 (퐿δ푡+ βδ2 푡+ 6βϑ2 푡).

4.4 Proof Ideas

We provide some of the main ideas for proving Theorem 9. We start with the following technical lemma that quantiﬁes the duration between updates of the base BCO algorithm.

Lemma 10. Suppose 푏푡in Algorithm 2 are drawn in advance for all 푡 1. Let 푡0 = 0 and for 푖 1 let

푡푖= min n 푡 푡푖 1 + 퐻| 푏푡 Q퐻 1 푖=1 (1 푏푡 푖) = 1 o .

Then denoting 푆= {푡푖| 퐻 푡푖< 푇}, the times Algorithm 2 updates A, we have that (i) |푆| 푇/퐻 , and (ii) E[푡푖 푡푖 1] = E푡1 3퐻for all 푖.

See proof in the full version of the paper [6]. The next lemma shows how the randomized update schedule allows us to bound the regret using the base (no memory) algorithm and an additional term that depends on the algorithm s prediction shifts.

Lemma 11. Suppose Algorithm 2 is run with A(푇/퐻) as in Theorem 9. If 푓푡 F then we have that

푡=퐻 푓푡(푥푡+1 퐻, . . . , 푥푡) 푓푡(푥푡+1 퐻)

Proof. Let 푆be the times Algorithm 2 updates A as deﬁned in Lemma 10. Denote ξ푡= ξ푡+1 퐻:푡, and notice that { ξ푡}푡 푆are mutually independent since Algorithm 2 ensures there are at least 퐻rounds between updates of A. Moreover, this implies that for any 푡 푆, 푥푡+1 퐻= . . . = 푥푡, and these are also independent of ξ푡. Our 퐻-memory feedback model thus implies that

| 푓푡(푥푡) E ξ푡[ ˆ푓푡]| ε, 푡 푆,

and since 푓푡 F, we can use the regret bound of A to get that for any 푥 X

푡 푆 푓푡(푥푡) X

푡 푆 푓푡(푥푡) X

푡 푆 푓푡(푥) 푆

= E[R1(|푆|)] 푅A

where the last transition also used the fact that |푆| 푇/퐻(see Lemma 10). Next, denote χ푡= 푏푡 Q퐻 1 푖=1 (1 푏푡 푖) and notice that Eχ푡= Eχ퐻. Then for any ﬁxed 푥 X we have that

푡=퐻 푓푡(푥)χ푡

푡=퐻 푓푡(푥)E[χ푡] = E[χ퐻]

Next, notice that 푥푡+1 퐻is independent of χ푡and since χ푡= 1 implies that 푥푡= 푥푡+1 퐻, we get that

푡=퐻 푓푡(푥푡)χ푡

푡=퐻 E 푓푡(푥푡+1 퐻) E[χ푡] = E[χ퐻]E

푡=퐻 푓푡(푥푡+1 퐻)

Finally, combining the last three equations we get that

푡=퐻 푓푡(푥푡+1 퐻)

푡=퐻 푓푡(푥) (E[χ퐻]) 1푅A

where the last transition used the non-negativity of 푅A(푇/퐻) and that E[χ퐻] 1/3퐻. Plugging the above into R퐻(푇) concludes the proof.

Completing the proof of Theorem 9 entails bounding the second term in Lemma 11. While the smooth case requires some delicate care for achieving the squared dependence on ϑ푡, the proof is otherwise quite technical and thus deferred to the full version of the paper [6].

We ﬁrst require the following, mostly standard, analysis of the base procedure of Algorithm 1 for the no-memory setting (퐻= 1). See proof in the full version of the paper [6]. Lemma 12. Consider the setting of Section 4 with 퐻= 1 and ε 푂(1/푇), against an adversary that chooses 푓푡: M+ ℝthat are α 푓-strongly convex and β 푓-smooth. Let 푑M = 푑푥푑푢퐻be the dimension of M+, and R1(푡) be the regret of a procedure that at time 푡:

(i) Draws 푈푡 S(푑푢 푑푥) 퐻; and plays 푀푡where 푀[푖] 푡 = 푀[푖] 푡 + 푟[푖] 푡푈[푖] 푡 ( 푖 [퐻]) (ii) Observes ˆ푓푡; and sets ˆ푔[푖] 푡 = (푑M/푟[푖] 푡) ˆ푓푡푈[푖] 푡 ( 푖 [퐻]) (1-point gradient estimate) (iii) Updates 푀[푖] 푡+1 = ΠM[푖]

푀[푖] 푡 η(푟[푖] 푡)2 ˆ푔[푖] 푡 ( 푖 [퐻]). (preconditioned update)

If | ˆ푓푡| ˆ퐹, 푟[푖] 푡, 푑min are as in Theorem 8, and η 푂(푇 1/2) then δ푡= 푑M ˆ퐹 p

2η/α 푓푡and ϑ2 푡= 2/α 푓η푡satisfy the assumptions of A(푇) in Theorem 9 , and for all 푡 푇

퐻푑2 min + 2β 푓

α 푓 (1 + log푇) + 푑2 M ˆ퐹2

2 η푇+ 푂(푇1/4).

Proof of Theorem 8. Consider ˆ푐푡, ˆ퐶푡from Deﬁnition 5 and notice that ˆ푐푡depends on the last 퐻+ 1 policies but also the last 2(퐻+ 1) system noises. This means that the eﬀective memory of the adversary is 2(퐻+ 1), prompting us to modify the deﬁnitions of ˆ푐푡, ˆ퐶푡to receive 2(퐻+ 1) policies but ignore the ﬁrst 퐻+ 1, i.e., ˆ퐶new 푡 (푀0:2퐻+1) = ˆ퐶푡(푀퐻+1:2퐻+1), ˆ푐new 푡 ((푀0:2퐻+1) = ˆ푐푡(푀퐻+1:2퐻+1).

Henceforth, ˆ푐푡, ˆ퐶푡refer to ˆ푐new 푡 , ˆ퐶new 푡 . Notice that Lemmas 6 and 7 are not impacted by this change and hold with the same 퐻as the original functions. We thus have that ˆ퐶푡are 퐿푓-coordinate-wise Lipschitz and β 푓-smooth, and 퐶푡are α 푓-strongly convex and β 푓-smooth. Moreover, |푐푡(푥푡, 푢푡)| 퐶퐷2 푥,푢, and if 푀푡 1 2퐻:푡are independent of 푤푡 1 2퐻:푡then

|E푤푡 1 2퐻:푡[푐푡(푥푡, 푢푡) ˆ푐푡(푀푡 1 2퐻:푡)]| = |E푤푡 1 2퐻:푡[푐푡(푥푡, 푢푡)] ˆ퐶푡(푀푡 1 2퐻:푡)| 퐺퐷2 푥,푢 푇 .

Now, Consider Algorithm 1 in the context of the BCO with memory setting presented in Section 4 with ε = 퐺퐷2 푥,푢/푇, feedback bounded by 퐶퐷2 푥,푢, and let R2(퐻+1) (푇) be its regret against an adversary that chooses functions 푓푡: M2(퐻+1) + ℝsatisfying:

푓푡are 퐿푓-coordinate-wise Lipschitz and β 푓-smooth; 푓푡: 푀 푓푡(푀, . . . , 푀) are α 푓-strongly convex and β 푓-smooth.

Since ˆ퐶푡satisfy these assumptions, and our choice of 푟[푖] 푡 ensures that 푀푡 M+, Lemma 6 yields that

RA(푇) R2(퐻+1) (푇) + 2퐷2 푥,푢(퐺+ 퐻퐶), and since the second term is at most poly-log in 푇, it remains to bound R2(퐻+1) (푇). To that end, notice that Algorithm 1 ﬁts the mold of our reduction procedure given in Algorithm 2 with base procedure as in Lemma 12. Now, invoking Lemma 12 with ˆ퐹= 퐶퐷2 푥,푢and horizon 푇/2(퐻+ 1), the second term of Theorem 9 satisﬁes that

푇/2(퐻+1) +1 X

푡=1 (퐿푓δ푡+ β 푓δ2 푡+ 6β 푓ϑ2 푡) 12β 푓log푇

α 푓η + 푂(푇1/4),

and further using Lemma 12 to bound the ﬁrst term of Theorem 9, and simplifying, we get that

R2(퐻+1) (푇) 2(퐻+ 1)2 1

3푑2 min + 15β 푓

α 푓 log푇 + 푑2 푥푑2 푢퐶2퐷4 푥,푢η푇 + 푂(푇1/4).

Our choice of η yields the ﬁnal bound.

6 Extensions to General Costs and Adversarial Noise

In this section we consider the case where the cost functions chosen by the adversary are general, possibly non-smooth (weakly) convex functions. Importantly, we also allow the system noise to be chosen by an oblivious adversary. Formally, the setup is identical to Section 2.1 but with the following modiﬁcations:

1. Only Assumptions 1-3 are assumed throughout; 2. The costs 푐푡(푥, 푢) are (weakly) convex functions of (푥, 푢); 3. The disturbances 푤푡are chosen by an oblivious adversary, i.e., one that has knowledge of the algorithm but must choose all disturbances before the ﬁrst round. (Notice that 푤푡are still bounded as per Assumption 1.)

To ease notation, recall that 푑min = min{푑푥, 푑푢}, and 퐷2 = max푀1,푀2 M 푀1 푀2 2 퐹 4푑2 minκ2 퐵κ6/γ. The following extends our main results, given in Theorem 8, to the setting above. Theorem 13. Let 퐻, 퐷푥,푢, 퐿푓, β 푓be as in Eq. (4) and Lemma 7, 퐾0 be a (κ, γ)-strongly stable controller, 푑M = 푑푥푑푢퐻, ˆ퐹= 퐶퐷2 푥,푢, and 푟[푖] 0 = 2κ퐵κ3(1 γ)푖. Then:

1. The regret of running Algorithm 1 with 푟[푖] 푡 = (푟[푖] 0 ) 2 + 4퐿푓

(퐻+1)푇 푑M ˆ퐹퐷

η = 2 (퐻+1)3퐿2 푓퐷2

1/4 satisﬁes

2푑푥푑푢푑min퐶퐷2푥,푢κ퐵κ3γ 1/2퐿푓(퐻+ 1)7/2푇3/4 + 푂(푇1/2);

2. If 푐푡are β-smooth, then the regret of running Algorithm 1 with η = 2(퐻+1)β 푓퐷2

푟[푖] 푡 = (푟[푖] 0 ) 2 + ( 4β2 푓푇

(퐻+1)푑2 M ˆ퐹2퐷2 )1/3 1/2 satisﬁes

RA(푇) 12 2푑푥푑푢푑min퐶퐷2 푥,푢κ퐵κ3q

β 푓/γ(퐻+ 1)3푇 2/3 + 푂(푇1/2).

The proof of Theorem 13 is given in the full version of the paper [6], and follows the same ideas behind Theorem 8 with a few technical adjustments. Notice that Theorem 13 only requires a strongly stable initial controller 퐾0, as opposed to the diagonal strongly stable controller needed for Theorem 8. Moreover, since the noise is no longer stochastic, our deﬁnition of pseudo regret now coincides with the standard deﬁnition of regret given by RA(푇) = E max퐾 K 푅A(푇, 퐾) .

Broader Impact

There are no foreseen ethical or societal consequences for the research presented herein.

Acknowledgments and Disclosure of Funding

This work was partially supported by the Israeli Science Foundation (ISF) grant 2549/19 and by the Yandex Initiative in Machine Learning.

[1] N. Agarwal, B. Bullins, E. Hazan, S. Kakade, and K. Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111 119, 2019. [2] N. Agarwal, E. Hazan, and K. Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pages 10175 10184, 2019. [3] O. Anava, E. Hazan, and S. Mannor. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pages 784 792, 2015. [4] R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning, pages 1747 1754, 2012. [5] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263 272. JMLR. org, 2017. [6] A. Cassel and T. Koren. Bandit linear control. ar Xiv preprint ar Xiv:2007.00759, 2020. [7] A. Cassel, A. Cohen, and T. Koren. Logarithmic regret for learning linear quadratic regulators eﬃciently. ar Xiv preprint ar Xiv:2002.08095, 2020. [8] N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Nonstochastic bandits with composite anonymous feedback. In Conference On Learning Theory, pages 750 773, 2018. [9] A. Cohen, A. Hasidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1029 1038, 2018. [10] A. Cohen, T. Koren, and Y. Mansour. Learning linear-quadratic regulators eﬃciently with only

푇regret. In International Conference on Machine Learning, pages 1300 1309, 2019. [11] O. Dekel and E. Hazan. Better rates for any adversarial deterministic MDP. In International Conference on Machine Learning, pages 675 683, 2013. [12] O. Dekel, E. Hazan, and T. Koren. The blinded bandit: Learning with adaptive feedback. In Advances in Neural Information Processing Systems, pages 1610 1618, 2014. [13] T. Dick, A. Gyorgy, and C. Szepesvari. Online learning in markov decision processes with changing cost sequences. In International Conference on Machine Learning, pages 512 520, 2014. [14] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121 2159, 2011. [15] A. D. Flaxman, A. T. Kalai, and H. B. Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385 394, 2005. [16] D. J. Foster and M. Simchowitz. Logarithmic regret for adversarial online control. ar Xiv preprint ar Xiv:2003.00189, 2020. [17] E. Hazan and K. Levy. Bandit convex optimization: Towards tight bounds. In Advances in Neural Information Processing Systems, pages 784 792, 2014. [18] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007. [19] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. [20] T. Jin and H. Luo. Learning adversarial MDPs with bandit feedback and unknown transition. ar Xiv preprint ar Xiv:1912.01192, 2019.

[21] D. A. Levin and Y. Peres. Markov chains and mixing times. American Mathematical Society, 2017. [22] H. Mania, S. Tu, and B. Recht. Certainty equivalence is eﬃcient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154 10164, 2019. [23] G. Neu, A. Antos, A. György, and C. Szepesvári. Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, pages 1804 1812, 2010. [24] A. Rosenberg and Y. Mansour. Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems, pages 2209 2218, 2019. [25] A. Saha and A. Tewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 636 642, 2011. [26] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, pages 3 24, 2013. [27] M. Simchowitz and D. J. Foster. Naive exploration is optimal for online LQR. ar Xiv preprint ar Xiv:2001.09576, 2020. [28] M. Simchowitz, K. Singh, and E. Hazan. Improper learning for non-stochastic control. ar Xiv preprint ar Xiv:2001.09254, 2020.