# online_control_of_unknown_timevarying_dynamical_systems__92206030.pdf

Online Control of Unknown Time-Varying Dynamical Systems

Edgar Minasyan Princeton University, Google AI Princeton minasyan@princeton.edu

Paula Gradu UC Berkeley pgradu@berkeley.edu

Max Simchowitz MIT msimchow@mit.edu

Elad Hazan Princeton University, Google AI Princeton ehazan@princeton.edu

We study online control of time-varying linear systems with unknown dynamics in the nonstochastic control model. At a high level, we demonstrate that this setting is qualitatively harder than that of either unknown time-invariant or known timevarying dynamics, and complement our negative results with algorithmic upper bounds in regimes where sublinear regret is possible. More speciﬁcally, we study regret bounds with respect to common classes of policies: Disturbance Action (SLS), Disturbance Response (Youla), and linear feedback policies. While these three classes are essentially equivalent for LTI systems, we demonstrate that these equivalences break down for time-varying systems. We prove a lower bound that no algorithm can obtain sublinear regret with respect to the ﬁrst two classes unless a certain measure of system variability also scales sublinearly in the horizon. Furthermore, we show that ofﬂine planning over the state linear feedback policies is NP-hard, suggesting hardness of the online learning problem. On the positive side, we give an efﬁcient algorithm that attains a sublinear regret bound against the class of Disturbance Response policies up to the aforementioned system variability term. In fact, our algorithm enjoys sublinear adaptive regret bounds, which is a strictly stronger metric than standard regret and is more appropriate for time-varying systems. We sketch extensions to Disturbance Action policies and partial observation, and propose an inefﬁcient algorithm for regret against linear state feedback policies.

1 Introduction

The control of linear time-invariant (LTI) dynamical systems is well-studied and understood. This includes classical methods from optimal control such as LQR and LQG, as well as robust H control. Recent advances study regret minimization and statistical complexity for online linear control, in both stochastic and adversarial perturbation models. Despite this progress, rigorous mathematical guarantees for nonlinear control remain elusive: nonlinear control is both statistically and computationally intractable in general.

In the face of these limitations, recent research has begun to study the rich continuum of settings which lie between LTI systems and generic nonlinear ones. The hope is to provide efﬁcient and robust algorithms to solve the most general control problems that are tractable, and at the same time, to characterize precisely at which degree of nonlinearity no further progress can be made.

35th Conference on Neural Information Processing Systems (Neur IPS 2021), virtual.

This paper studies the control of linear, time-varying (LTV) dynamical systems as one such point along this continuum. This is because the ﬁrst-order Taylor approximation to the dynamics of any smooth nonlinear system about a given trajectory is an LTV system. These approximations are widely popular because they allow for efﬁcient planning, as demonstrated by the success of i LQR and i LQG methods for nonlinear receding horizon control. We study online control of discrete-time LTV systems, with dynamics and time-varying costs

xt+1 = Atxt + Btut + wt, ct(xt, ut) : (x, u) R. (1.1)

Above, xt is the state of the system, ut the control input, wt the disturbances, and At, Bt the system matrices. Our results extend naturally to partial-state observation, where the controller observes linear projections of the state yt = Ctxt. We focus on the challenges introduced when the system matrices At, Bt and perturbations wt are not known to the learner in advance, and can only be determined by live interaction with the changing systems.

In this setting, we ﬁnd that the overall change in system dynamics across time characterizes the difﬁculty of controlling the unknown LTV system. We deﬁne a measure, called system variability, which quantiﬁes this. We show both statistical and computational lower bounds as well as algorithmic upper bounds in terms of the system variabilility. Surprisingly, system variability does not impede the complexity of control when the dynamics are known [16].

1.1 Contributions

We consider the recently popularized nonstochastic model of online control, and study regret bounds with respect to common classes of policies: Disturbance Action (DAC/SLS [44]), Disturbance Response (DRC/Youla [46]), and linear feedback policies. Planning over the third class of feedback policies in LTI systems admits efﬁcient convex relaxations via the the ﬁrst two parametrizations, DAC and DRC. This insight has been the cornerstone of both robust [49, 44] and online [3, 39] control.

Separation of parametrizations. For linear time-varying systems, however, we ﬁnd that equivalences between linear feedback, DAC and DRC fail to hold: we show that there are cases where any one of the three parametrizations exhibits strictly better control performance than the other two.

Regret against convex parametrizations. Our ﬁrst set of results pertain to DAC and DRC parametrizations, which are convex and admit efﬁcient optimization. We demonstrate that no algorithm can obtain sublinear regret with respect to these classes when faced with unknown, LTV dynamics unless a certain measure of system variability also scales sublinearly in the horizon. This is true even under full observation, controllable dynamics, and ﬁxed control cost. This ﬁnding is in direct contrast to recent work which shows sublinear regret is attainable over LTV system dynamics if they are known [16].

We give an efﬁcient algorithm that attains sublinear regret against these policy classes up to an additive penalty for the aforementioned system variability term found in our lower bound. When the system variability is sufﬁciently small, our algorithm recovers state-of-the-art results for unknown LTI system dynamics up to logarithmic factors.

In fact, our algorithm enjoys sublinear adaptive regret [21], a strictly stronger metric than standard regret which is more appropriate for time-varying systems. We also show that the stronger notion of adaptivity called strongly adaptive regret [11] is out of reach in the partial information setting.

Regret against state feedback. Finally, we consider the class of state feedback policies, which are linear feedback with memory length one. We show that full-information optimization over state feedback policies is computationally hard. This suggests that obtaining sublinear regret relative to these policies may be computationally prohibitive, though does not entirely rule out the possibility of improper learning. However, improper learning cannot be done via the DRC or DAC relaxations in light of our policy class separation results. Finally, we include an inefﬁcient algorithm which attains sublinear (albeit nonparametric-rate) regret against state feedback control policies.

Paper Structure

Discussion of relevant literature and relation to our work can be found in Section 1.2. In Section 2, we formally introduce the setting of LTV nonstochastic control, the policy classes we study and our key result regarding their non-equivalence in the LTV setting (Theorem 2.1). Motivated by this

non-equivalence, the remainder of the paper is split into the study of convex policies (Section 3) and of state feedback policies (Section 4). In Section 3, we show that regret against the DAC and DRC classes cannot be sublinear unless the metric system variability (Deﬁnition 3.1) itself is sublinear (Theorem 3.1), and also propose Algorithm 2 whose adaptive regret scales at the rate of our lower bound plus a T 2/3 term (Theorem 3.4). On the other hand, in Section 4 we show sublinear regret against state feedback policies is technically possible (Theorem 4.1) with a computationally inefﬁcient algorithm, but also provide a computational lower bound (Theorem 4.2) for planning which reveals signiﬁcant difﬁculties imposed by the LTV dynamics in this scenario as well. Finally, in Section 5 we pose several future directions, concerning both questions in LTV control, as well as the extension to nonlinear control.

1.2 Related Work

Our study of LTV systems is motivated by the widespread practical popularity of iterative linearization for nonlinear receding horizon control; e.g., the i LQR [40], i LC [29], and i LQG [41] algorithms. Recent research has further demonstrated that near-optimal solutions to LTV approximations of dynamics confer stability guarantees onto the original nonlinear system of interest [45].

Low-Regret Control: We study algorithms which enjoy sublinear regret for online control of LTV systems; that is, whose performance tracks a given benchmark of policies up to a term which is vanishing relative to the problem horizon. [1] initiated the study of online control under the regret benchmark by introducing the online LQR problem: where a learner is faced with an unknown LTI system, ﬁxed costs and i.i.d. Gaussian disturbances, and must attain performance relative to the LQR-optimal policy. Bounds for this setting were later improved and reﬁned in [12, 26, 10, 38], and extended to partial-state observation in [25, 24]. Our work instead adopts the nonstochastic control setting [3], where the adversarially chosen (i.e. non-Gaussian) noise is considered to model the drift terms that arise in linearizations of nonlinear terms, and where costs may vary with time. [3] consider known system dynamics, later extended to unknown systems under both full-state [20] and partial-state observation [39, 37]. The study of nonstochastic control of known LTV dynamics was taken up in [16], with parallel work by [32] considering known LTV dynamics under stochastic noise.

Unknown LTV dynamics: Our work is the ﬁrst to consider online (low-regret) control of unknown LTV systems in any model. There is, however, a rich body of classical work on adaptive control of LTV systems [28, 42]. These guarantees focus more heavily on error sensitivity and stability; they only permit dynamical recovery up to error that scales linearly in system noise, and thus guarantee only (vacuous) linear-in-horizon regret. More recent work has studied identiﬁcation (but not online control) of an important LTV class called switching systems [31, 35].

Online Convex Optimization: We make extensive use of techniques from the ﬁeld of online convex optimization [9, 18]. Most relevant to our work is the literature on adapting to changing environments in online learning, which starts from the works of [22, 6]. The notion of adaptive regret was introduced in [21] and signiﬁcantly studied since as a metric for adaptive learning in OCO [2, 47]. [11] proposed to strengthen adaptive regret and the stronger metric has been shown to imply results over dynamic regret [48].

Recent nonlinear control literature: Recent research has also studied provably guarantees in various complementary (but incomparable) models: planning regret in nonlinear control [4], adaptive nonlinear control under linearly-parameterized uncertainty [5], online model-based control with access to non-convex planning oracles [23], and control with nonlinear observation models [27, 13].

2 Problem Setting

We study control of a linear time-varying (LTV) system Eq. (1.1) with state xt Rdx, control input ut Rdu chosen by the learner, and the external disturbance wt Rdx chosen by Nature. The system is characterized by time-varying matrices At Rdx dx, Bt Rdx du. For simplicity, the initial state is x1 = 0. At each time t, oblivious1 adversary picks the system matrices (At, Bt), disturbances wt and cost functions ct : Rdx Rdu R. The dynamics (At, Bt) are unknown to the learner: one observes only the next state xt+1 and current cost ct( , ) after playing control ut.

1An oblivious adversary chooses the matrices, costs and perturbations prior to the control trajectory.

Adaptive Regret. The goal of the learner is to minimize regret w.r.t. a policy class Π, i.e. the difference between the cumulative cost of the learner and the best policy π Π in hindsight. Formally, the regret of an algorithm A with control inputs u1:T and corresponding states x1:T , over an interval I = [r, s] [T], is deﬁned as

Regret I(A; Π) = X

t I ct(xt, ut) inf π Π

t I ct(xπ t , uπ t ) . (2.1)

Here uπ t , xπ t indicate the control input and the corresponding state when following policy π. For a randomized algorithm A, we consider the expected regret. In this work, we focus on designing control algorithms that minimize adaptive regret, i.e. guarantee a low regret relative to the best-in-hindsight policy π I Π on any interval I [T]. This performance metric of adaptive regret is more suitable for control over LTV dynamics given its agility to compete against different local optimal policies π I Π at different times [16].

Key objects. A central object in our study is the sequence of Nature s x s xnat 1:T that arises from playing zero control input ut = 0 at each t [T], i.e. xnat t+1 = Atxnat t +wt [39]. deﬁne the following operators for all t,

Φ[0] t = I, h [1, t), Φ[h] t =

k=t Ak, i [0, t), G[i] t = Φ[i] t Bt i,

where the matrix product Qr s with s r is taken in the indicated order k = s, . . . , r. the following identities give an alternative representation for the Nature s x s xnat t and state xt with control input ut in terms of the Markov operator at time t, Gt = [G[i] t ]i 0:

i=0 Φ[i] t wt i, xt+1 = xnat t+1 +

i=0 G[i] t ut i .

These operators and the alternative representation capture the dynamics by decoupling the disturbance and the control action effects.

Assumptions. We make the three basic assumptions: we require from (i) the disturbances to not blow up the system with no control input, (ii) the system to have decaying effect over time, and (iii) the costs to be well-behaved and admit efﬁcient optimization. Formally, these assumptions are: Assumption 1. For all t [T], assume xnat t Rnat. Assumption 2. Assume there exist RG 1 and ρ (0, 1) s.t. for any h 0 and for all t [T] X

i h G[i] t op RG ρh := ψ(h) .

Assumption 3. Assume the costs ct : Rdx Rdu R are general convex functions that satisfy the conditions 0 ct(x, u) L max{1, x 2 + u 2}, and ct(x, u) L max{1, x + u } for some constant L > 0, where denotes any subgradient [7].

The conditions in Assumption 3 allow for functions whose values and gradient grow as quickly as quadratics (e.g. the costs in LQR) , and the max{1, } term ensures the inclusion of standard bounded and Lipschitz functions as well. Assumptions 1 and 2 arise from the assumption our LTV system is open-loop stable; Appendix A.2 extends to the case where a nominal stabilizing controller is known, as in prior work [3, 39]. While these two assumptions may seem unnatural at ﬁrst, they can be derived from the basic conditions of disturbance norm bound and sequential stability.

Lemma 2.1. Suppose that there exist C1 1, ρ1 (0, 1) such that Φ[h] t op C1ρh 1 for any h 0 and all t [T], and suppose that maxt wt Rw. Then, Assumption 1 holds with Rnat = C1 1 ρ1 Rw, and Assumption 2 holds with ρ = ρ1 and RG = max{1, maxt Bt op C1 1 ρ1 }.

Note that Assumption 2 implies that Gt ℓ1,op = P

i 0 G[i] t op RG. It also suggests that for a sufﬁciently large h the effect of iterations before t h are negligible at round t. This prompts introducing a truncated Markov operator: denote Gh t = [G[i] t ]i<h to be the h-truncation of the true Markov operator Gt. It follows that their difference is Gh t Gt ℓ1,op = P

i h G[i] t op ψ(h) negligible in operator norm for a sufﬁciently large h. Deﬁne the bounded set of h-truncated Markov operators to be G(h, RG) = {G = [G[i]]0 i<h : G ℓ1,op RG} with Gh t G(h, RG) for all t.

2.1 Benchmarks and Policy Classes

The performance of an algorithm, measured by Eq. (2.1), directly depends on the policy class Π that is chosen as a benchmark to compete against. In this work, we consider the following three policy classes: DRC, DAC, and linear feedback. DRC parameterizes control inputs in terms of Nature s x s xnat t , DAC does so in terms of the disturbances wt and linear feedback in terms of the states xt. We express all three in terms of a length-m parameter M = [M [i]]i<m in a bounded ball M(m, RM):

M(m, RM) = {(M [0], . . . , M [m 1]) : Pm 1 i=0 M [i] op RM} .

Deﬁnition 2.1 (DRC policy class). A DRC control policy πM drc of length m is given by u M t = Pm 1 i=0 M [i]xnat t i where M = [M [i]]i<m is the parameter of the policy. Deﬁne the bounded DRC policy class as Πdrc(m, RM) = {πM drc : M M(m, RM)}.

Deﬁnition 2.2 (DAC policy class). A DAC control policy πM dac of length m is given by u M t = Pm 1 i=0 M [i]wt i where M = [M [i]]i<m is the parameter of the policy. Deﬁne the bounded DAC policy class as Πdac(m, RM) = {πM dac : M M(m, RM)}.

Deﬁnition 2.3 (Feedback policy class). A feedback control policy πM feed of length m is given by u M t = Pm 1 i=0 M [i]xt i where M = [M [i]]i<m is the parameter of the policy. Deﬁne the bounded feedback policy class as Πfeed(m, RM) = {πM feed : M M(m, RM)}. In the special case of memory m = 1, denote the state feedback policy class as Πstate = Πfeed(m = 1).

Convexity. Both the DRC and DAC policy classes are convex parametrizations: a policy π Πdrc Πdac outputs controls ut that are linear in the policy-independent sequences xnat 1:T and w1:T , and thus the mapping from parameter M to resulting states and inputs (resp. costs) is afﬁne (resp. convex). Hence, we refer to these as the convex classes. In contrast, feedback policies select inputs based on policy-dependent states, and are therefore non-convex [15].

We drop the arguments m, RM when they are clear from the context. The state feedback policies Πstate encompass the H2 and H optimal control laws under full observation. For LTI systems, DRC and DAC are equivalent [46, 44] and approximate all linear feedback policies to arbitrarily high precision [3, 39]. However, we show that these relationships between the classes break down for LTV systems: there exist scenarios where any one of the three classes strictly outperforms the other two. Theorem 2.1 (Informal). For each class Π in {Πdrc, Πdac, Πfeed} there exists a sequence of wellbehaved (At, Bt, wt, ct) such that a policy π Π suffers 0 cumulative cost, but each of the other two classes Π {Πdrc, Πdac, Πfeed} \ Π suffers Ω(T) cost on all their constituent policies π Π .

The formal theorem that includes the deﬁnition of a well-behaved instance sequence and the ﬁnal statement dependence on m, RM along with its proof can be found in Appendix F.1.

Notation. The norm refers to Euclidean norm unless otherwise stated, [n] is used as a shorthand for [1, n], T is used as a subscript shorthand for [T]. The asymptotic notation O( ), Ω( ) suppress all terms independent of T, e O( ) additionally suppresses terms logarithmic in T. We deﬁne e O ( ) to suppress absolute constants, polynomials in Rnat, RG, RM and logarithms in T.

3 Online Control over Convex Policies

This section considers online control of unknown LTV systems so as to compete with the convex DRC and DAC policy classes. The fundamental quantity which appears throughout our results is the system variability, which measures the variation of the time-varying Markov operators Gt over intervals I. Deﬁnition 3.1. Deﬁne the system variability of an LTV dynamical system with Markov operators G = G1:T over a contiguous interval I [T] to be

Var I(G) = min G 1 |I|

t I G Gt 2 ℓ2,F = 1

t I GI Gt 2 ℓ2,F ,

where ℓ2,F indicates the ℓ2 norm of the fully vectorized operator and GI = |I| 1 P

t I Gt is the empirical average of the operators that correspond to I. Recall that Var T (G) corresponds to I = [T].

Our results in this section for both upper and lower bounds focus on expected regret: high probability results are possible as well with more technical effort using standard techniques.

3.1 A Linear Regret Lower Bound

Our ﬁrst contribution is a negative one: that the regret against the class of either DAC or DRC policies cannot scale sublinearly in the time horizon. Informally, our result shows that the regret against these classes scales as Tσ, where σ2 is the system variability.

More precisely, for any σ2 (0, 1/8], we construct a distribution Dσ over sequences (At, Bt, ct, wt), formally speciﬁed in Appendix F.2. Here, we list the essential properties of Dσ: (i) At 0, (ii) ct c is a ﬁxed cost satisfying Assumption 3 with L 4, (iii) the matrices (Bt) are i.i.d., with Bt op 2 almost surely, and E[ Bt E[Bt] 2 F] = σ2, and (iv) wt 4 for all t. These conditions imply that Assumptions 1 and 2 hold for RG = 2, ρ = 0, Rnat = 4. The condition At 0 implies that xnat t = wt for all t, so the classes DRC and DAC are equivalent and the lower bound holds over both. Moreover, by Jensen s inequality, this construction ensures that

E[Var I(G)] = |I| 1E[min G P

t I G Gt 2 ℓ2,F ]

= |I| 1E[min B P

t I B Bt 2 F] E[ Bt E[Bt] 2 F] = σ2.

In particular, E[Var T (G)] σ2. For the described construction, we show the following lower bound:

Theorem 3.1. Let C be a universal, positive constant. For any σ (0, 1/8] and any online control algorithm A, there exists a DRC policy π Πdrc(1, 1) s.t. expected regret incurred by A under the distribution Dσ and cost c(x, u) is at least

EDσ,A[Regret T (A; {π })] C Tσ C T p

E[Var T (G)],

A full construction and proof of Theorem 3.1 is given in Appendix F.2. In particular, for σ = 1/8, we ﬁnd that no algorithm can attain less than Ω(T) expected regret; a stark distinction from either unknown LTI [39, 20] or known LTV [16] systems.

3.2 Estimation of Time-Varying Vector Sequences

To devise an algorithmic upper bound that complements the result in Theorem 3.1, we ﬁrst consider the setting of online prediction under a partial information model. This setting captures the system identiﬁcation phase of LTV system control and is used to derive the ﬁnal control guarantees. Formally, consider the following repeated game between a learner and an oblivious adversary: at each round t [T], the adversary picks a target vector z t K from a convex decision set K contained in a 0-centered ball of radius Rz; simultaneously, the learner selects an estimate ˆzt K and suffers quadratic loss ℓt(ˆzt) = ˆzt z t 2. The only feedback the learner has access to is via the following noisy and costly oracle. Oracle 1 (Noisy Costly Oracle). At each time t [T], the learner selects a decision bt {0, 1} indicating whether a query is sent to the oracle. If bt = 1, the learner receives an unbiased estimate zt as response such that zt Rz and E[ zt | Ft, bt = 1] = z t . The ﬁltration Ft is the sigma algebra generated by z1:t 1, b1:t 1 and the choices of the oblivious adversary z 1:T . A completed query results in a unit cost λ > 0 for the learner.

The performance metric of an online prediction algorithm Apred is expected quadratic loss regret along with the extra cumulative oracle query cost. It is deﬁned over each interval I = [r, s] [T] as

Regret I(Apred; λ) = EF1:T

t I ℓt(ˆzt) min z K

t I ℓt(z) + λ X

To attain adaptive regret, i.e. bound Eq. (3.1) for each interval I, we propose Algorithm 1 constructed as follows. First, suppose we wanted non-adaptive (i.e. just I = [T]) guarantees. In this special case, we propose to sample bt Bernoulli(p) for an appropriate parameter p (0, 1), and perform a gradient descent update on the importance-weighted square loss ℓt(z) = 1 2p I{bt = 1} z zt 2. To

Algorithm 1 Adaptive Estimation Algorithm (ADA-PRED)

1: Input: parameter p, decision set K

2: Initialize: ˆz(1) 1 K, working dictionary S1 = {(1 : ˆz(1) 1 )}, q(1) 1 = 1, parameter α = p (Rz+ Rz)2 3: for t = 1, . . . , T do 4: Play iterate ˆzt = P

(i,ˆz(i) t ) St q(i) t ˆz(i) t 5: Draw/Receive bt Bernoulli(p) 6: if bt = 1 then 7: Request estimate zt from Oracle 1 8: Let ℓt(z) = 1 2p||z zt||2 and t = 1

p(zt z) 9: else 10: Let zt and ℓt(z) = 0 and t = 0 11: Update predictions ˆz(i) t+1 Proj K(ˆz(i) t η(i) t t) for all (i, ˆz(i) t ) St 12: Form new dictionary St+1 = (i, ˆz(i) t+1)i keys(St)

13: Construct proxy new weights q(i) t+1 = t t+1 q(i) t e α ℓt(ˆz(i) t )

j keys(St) q(j) t e α ℓt(ˆz(j) t ) for all i keys(St)

14: Add new instance St+1 St+1 (t+1, ˆz(t+1) t+1 ) for arbitrary ˆz(t+1) t+1 K with q(t+1) t+1 = 1 t+1 15: Prune St+1 to form St+1 (see Appendix C.1)

16: Normalize q(i) t+1 = q(i) t+1 P

j keys(St+1) q(j) t+1

extend this method to enjoy adaptive regret guarantees, we adopt the approach of [21]: the core idea in this approach is to initiate an instance of the base method at each round t and use a weighted average of the instance predictions as the ﬁnal prediction (Line 4). The instance weights are multiplicatively updated according to their performance (Line 13). To ensure computational efﬁciency, the algorithm only updates instances from a working dictionary St (Line 11). These dictionaries are pruned each round (Line 15) such that |St| = O(log T) (see Appendix C.1 for details).

Theorem 3.2. Given access to queries from Oracle 1 and with stepsizes η(i) t = 1 t i+1, Algorithm 1 enjoys the following adaptive regret guarantee: for all I = [r, s] [T],

Regret I(ADA-PRED; λ) 2(Rz + Rz)2(1 + log s log |I|)

p + λp|I| . (3.2)

When I = [T], the optimal choice of parameter p = log T/

λT yields regret scaling roughly as p

λT log2 T. Unfortunatelly, this gives regret scaling as

T for all interval sizes: to attain p

|I| regret on interval I, the optimal choice of p would yield T/ p

|I| regret on [T], which is considerably worse for small |I|. One may ask if there exists a strongly adaptive algorithm which adapts p as well, so as to enjoy regret polynomial in |I| for all intervals I simultaneously [11]. The following result shows this is not possible:

Theorem 3.3 (Informal). For all γ > 0 and λ > 0, there exists no online algorithm A with feedback access to Oracle 1 that enjoys strongly adaptive regret of Regret I(A; λ) = O(|I|1 γ).

Hence, in a sense, Algorithm 1 is as adaptive as one could hope for: it ensures a regret bound for all intervals I, but not a strongly adaptive one. The lower bound construction, formal statement, and proof of Theorem 3.3 are given in Appendix C.2.

3.3 Adaptive Regret for Control of Unknown Time-Varying Dynamics

We now apply our adaptive estimation algorithm (Algorithm 1) to the online control problem. Our proposed algorithm, Algorithm 2, takes in two sub-routines: a prediction algorithm Apred which enjoys low prediction regret in the sense of the previous section, and a control algorithm Actrl which has low regret for control of known systems. Our master algorithm trades off between the two in epochs τ = 1, 2, . . . of length h: each epoch corresponds to one step of Apred indexed by [τ].

At each epoch, the algorithm receives Markov operator estimates from Apred (Line 3) and makes a binary decision b[τ] Bernoulli(p). If b[τ] = 1, then it explores using i.i.d. Rademacher inputs (Line 7), and sends the resulting estimator to Apred (Line 14). This corresponds to one query from Oracle 1. Otherwise, it selects inputs in line with Actrl (Line 9), and does not give a query to Apred (Line 16). Regardless of exploration decision, the algorithm feeds costs, current estimates of the Markov operator and Nature s x s based on the Markov operator estimates to Actrl (Lines 10-12), which it uses to select inputs and update its parameter.

The prediction algorithm Apred is taken to be ADA-PRED with the decision set K = G(h, RG): the projection operation onto it and the ball BRnat is done by clipping when the norm of the argument exceeds the indicated bound. The control algorithm Actrl is taken to be DRC-OGD [39] for known systems.

Theorem 3.4. For h = log T log ρ 1 , p = T 1/3 and m

T, on any contiguous interval I [T],

Algorithm 2 enjoys the following adaptive regret guarantee:

E [Regret I(ADA-CTRL); Πdrc(m, RM)] e O Lm |I| p

E[Var I(G)] + du T 2/3 (3.3)

Proof Sketch. The analysis proceeds by reducing the regret incurred to that over a known system, accounting for: 1) the additional exploration penalty (O(p|I|)), 2) the system misspeciﬁcation induced error ( P

t I ˆGt Gh t ℓ1,op), and 3) truncation errors ( ψ(h)|I|). Via straightforward computations, the system misspeciﬁcation error can be expressed in terms of the result in Theorem 3.2, ultimately leading to an error contribution |I| p

E[Var I(G)]+p 1/2|I|1/2. The analysis is ﬁnalized by noting that the chosen p ideally balances p|I| and p 1/2|I|1/2, and that the chosen h ensures that the truncation error is negligible. The full proof can be found in Appendix D.

The adaptive regret bound in Eq. (3.3) has two notable terms. Note that the ﬁrst term |I| p

E[Var I(G)] for I = [T] matches the regret lower bound in Theorem 3.1. Furthermore, our algorithm is adaptive in this term for all intervals I. On the other hand, for unknown LTI systems with Var I(G) = 0, the algorithm recovers the state-of-the-art bound of T 2/3 [20]. However, the T 2/3 term is not adaptive to the intervals I consistent with the lower bound against strongly adaptive algorithms in Theorem 3.3.

Algorithm 2 DRC-OGD with Adaptive Exploration (ADA-CTRL)

1: Input: p, h, Apred ADA-PRED(p, ˆG0, G(h, RG)), Actrl DRC-OGD(m, RM) 2: for τ = 1, . . . , T/h do let tτ = (τ 1)h + 1 3: Set ˆGtτ , ˆGtτ +1, . . . , ˆGtτ +h 1 equal to τ-th iterate ˆG[τ] from Apred 4: Draw b[τ] Bernoulli(p) 5: for t = tτ, . . . , tτ + h 1 do 6: if b[τ] = 1 then 7: Play control ut { 1}du

8: else 9: Play control ut according to the t-th input chosen by Actrl 10: Suffer cost ct(xt, ut) , observe new state xt+1

11: Extract ˆxnat t+1 = Proj BRnat

xt+1 Ph 1 i=0 ˆG[i] t ut i

12: Feed cost, Markov operator and Nature s x estimates (ct, ˆGt, ˆxnat t+1) to Actrl.

13: if b[τ] = 1 then

14: Feed (b[τ], G[τ]) to Apred, where G[i] [τ] = xtτ +hu tτ +h i, i = 0, 1, . . . , h 1. 15: else 16: Feed (b[τ], G[τ]) to Apred, where G[τ] .

4 Online Control over State Feedback

Given the impossibility of sublinear regret against DRC/DAC without further restrictions on system variability, this section studies whether sublinear regret is possible against the class of linear feedback

policies. For simplicity, we focus on the state feedback policies ut = Kxt, that is, linear feedback policies with memory m = 1 (Deﬁnition 2.3). We note that state feedback policies were the class which motivated the relaxation to DAC policies in the ﬁrst study of nonstochastic control [3].

We present two results, rather qualitative in nature. First, we show that obtaining sublinear regret is, in the most literal sense, possible. The following result considers regret relative to a class K of static feedback controllers which satisfy the restrictive assumption that each K K stabilizes the time varying dynamics (At, Bt); see Appendix E for the formal algorithm, assumptions, and guarantees. We measure the regret against this class K:

Regret T (K) :=

t=1 ct(xt, ut) inf K K

t=1 ct(x K t , u K t ),

where (x K t , u K t ) are the iterates arising under the control law ut = Kxt.

Theorem 4.1 (Sublinear regret against state-feedback). Under a suitable stabilization assumption, there exists a computationally inefﬁcient control algorithm which attains sublinear expected regret:

E[Regret T (K)] eΩ(dxdu/2) T 1 1 2(dxdu+3) .

Above, Ω( ) suppresses a universal constant and exponent base, both of which are made explicit in a formal theorem statement in Appendix E. The bound follows by running the EXP3 bandit algorithm on a discretization of the set K (high probability regret can be obtained by instead using EXP3.P [8]). The guarantee in Theorem 4.1 is neither practical nor sharp; its sole purpose is to conﬁrm the possibility of sublinear regret. Due to the bandit reduction and exponential size of the cover of K Rdu dx, the algorithm is computationally inefﬁcient and suffers a nonparametric rate of regret [33]: ϵ-regret requires T = ϵ Ω(dimension).

One may wonder if one can do much better than this naive bandit reduction. For example, is there structure that can be leveraged? For LTV systems, we show that there is strong evidence to suggest that, at least from a computational standpoint, attaining polynomial regret (e.g. T 1 α for α > 0 independent of dimension) is computationally prohibitive.

Theorem 4.2. There exists a reduction from MAX-3SAT on m-clauses and n-literals to the problem of ﬁnding a state-feedback controler K which is within a small constant factor of optimal for the cost PT t=1 ct(x K t , x K t ) on a sequence of sequentially stable LTV systems and convex costs (At, Bt, ct) with no disturbance (wt 0), with state dimension n+1, input dimension 2, and horizon T = Θ(mn). Therefore, unless P = NP, the latter cannot be solved in time polynomial in n [17].

A more precise statement, construction, and proof are given in Appendix F.4. Theorem 4.2 demonstrates that solving the ofﬂine optimization problem over state feedback controllers K to within constant precision is NP-Hard. In particular, this means that any sublinear regret algorithm which is proper and convergent, in the sense that ut = Ktxt for some sequence Kt converges to a limit as T , must be computationally inefﬁcient. This is true even if the costs and dynamics are known in advance. Our result suggests it is computationally hard to obtain sublinear regret, but it does not rigorously imply it. For example, there may be more clever convex relaxations (other than DRC and DAC, which provably cannot work) that yield efﬁcient and sublinear regret. Secondly, this lower bound does not rule out the possibility of an computationally inefﬁcient algorithm which nevertheless attains polynomial regret.

5 Discussion and Future Work

This paper provided guarantees for and studied the limitations of sublinear additive regret in online control of an unknown, linear time-varying (LTV) dynamical system.

Our setting was motivated by the fact that the ﬁrst-order Taylor approximation (Jacobian linearization) of smooth, nonlinear systems about any smooth trajectory is LTV. One would therefore hope that low-regret guarantees against LTV systems may imply convergence to ﬁrst-order stationary points of general nonlinear control objectives [34], which in turn may enjoy stability properties [45]. Making this connection rigorous poses several challenges. Among them, one would need to extend our low-regret guarantees against oblivious adversaries to hold against adaptive adversaries, the latter

modeling how nonlinear system dynamics evolve in response to the learner s control inputs. This may require parting from our current analysis, which leverages the independence between exploratory inputs and changes in system dynamics.

Because we show that linear-in-T regret is unavoidable for changing systems with large system variability, at least for the main convex policy parametrizations, it would be interesting to study our online setting under other measures of performance. In particular, the competive ratio, or the ratio of total algorithm cost to optimal cost in hindsight (as opposed to the difference between the two measured by regret) may yield a complementary set of tradeoffs, or lead to new and exciting principles for adaptive controller design. Does system variability play the same deciding roles in competive analysis as it does in regret? And, in either competitive or regret analyses, what is the correct measure of system variability (e.g. variability in which norm/geometry, or of which system parameters) which best captures sensitivity of online cost to system changes?

Acknowledgments

Elad Hazan and Edgar Minasyan have been supported in part by NSF grant #1704860. This work was done in part when Paula Gradu was at Google AI Princeton and Princeton University. Max Simchowitz is generously supported by an Open Philanthropy AI fellowship.

[1] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1 26, 2011.

[2] Dmitry Adamskiy, Wouter M Koolen, Alexey Chernov, and Vladimir Vovk. A closer look at adaptive regret. The Journal of Machine Learning Research, 17(1):706 726, 2016.

[3] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111 119, 2019.

[4] Naman Agarwal, Elad Hazan, Anirudha Majumdar, and Karan Singh. A regret minimization approach to iterative learning control. ar Xiv preprint ar Xiv:2102.13478, 2021.

[5] Nicholas M. Bofﬁ, Stephen Tu, and Jean-Jacques E. Slotine. Regret bounds for adaptive nonlinear control, 2020.

[6] Olivier Bousquet and Manfred K Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3(Nov):363 396, 2002.

[7] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[8] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. ar Xiv preprint ar Xiv:1204.5721, 2012.

[9] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[10] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efﬁciently with only

T regret. In International Conference on Machine Learning, pages 1300 1309, 2019.

[11] Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1405 1411, Lille, France, 07 09 Jul 2015. PMLR.

[12] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188 4197, 2018.

[13] Sarah Dean and Benjamin Recht. Certainty equivalent perception-based control. ar Xiv preprint ar Xiv:2008.12332, 2020.

[14] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Experts in a markov decision process. In Advances in neural information processing systems, pages 401 408, 2005.

[15] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467 1476, 2018.

[16] Paula Gradu, Elad Hazan, and Edgar Minasyan. Adaptive regret for control of time-varying dynamics. ar Xiv preprint ar Xiv:2007.04393, 2020.

[17] Johan Håstad. Some optimal inapproximability results. Journal of the ACM (JACM), 48(4):798 859, 2001.

[18] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016.

[19] Elad Hazan. Introduction to online convex optimization, 2019.

[20] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 408 421. PMLR, 2020.

[21] Elad Hazan and Comandur Seshadhri. Efﬁcient learning algorithms for changing environments. In Proceedings of the 26th annual international conference on machine learning, pages 393 400. ACM, 2009.

[22] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 32(2):151 178, 1998.

[23] Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control. ar Xiv preprint ar Xiv:2006.12466, 2020.

[24] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems, 2020.

[25] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Regret minimization in partially observable linear quadratic control, 2020.

[26] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efﬁcient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154 10164, 2019.

[27] Zakaria Mhammedi, Dylan J Foster, Max Simchowitz, Dipendra Misra, Wen Sun, Akshay Krishnamurthy, Alexander Rakhlin, and John Langford. Learning the linear quadratic regulator from nonlinear observations. ar Xiv preprint ar Xiv:2010.03799, 2020.

[28] Richard H Middleton and Graham C Goodwin. Adaptive control of time-varying linear systems. IEEE Transactions on Automatic Control, 33(2):150 155, 1988.

[29] Kevin L Moore. Iterative learning control for deterministic systems. Springer Science & Business Media, 2012.

[30] Samet Oymak and Necmiye Ozay. Non-asymptotic identiﬁcation of lti systems from a single trajectory. In 2019 American control conference (ACC), pages 5655 5661. IEEE, 2019.

[31] Necmiye Ozay, Constantino Lagoa, and Mario Sznaier. Set membership identiﬁcation of switched linear systems with known number of subsystems. Automatica, 51:180 191, 2015.

[32] Guannan Qu, Yuanyuan Shi, Sahin Lale, Anima Anandkumar, and Adam Wierman. Stable online control of ltv systems stable online control of linear time-varying systems. ar Xiv preprint ar Xiv:2104.14134, 2021.

[33] Alexander Rakhlin and Karthik Sridharan. Online non-parametric regression. In Conference on Learning Theory, pages 1232 1264. PMLR, 2014.

[34] Vincent Roulet, Siddhartha Srinivasa, Dmitriy Drusvyatskiy, and Zaid Harchaoui. Iterative linearized control: stable algorithms and complexity guarantees. In International Conference on Machine Learning, pages 5518 5527. PMLR, 2019.

[35] Tuhin Sarkar, Alexander Rakhlin, and Munther Dahleh. Nonparametric system identiﬁcation of stochastic switched linear systems. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3623 3628. IEEE, 2019.

[36] Tuhin Sarkar, Alexander Rakhlin, and Munther A Dahleh. Finite-time system identiﬁcation for partially observed lti systems of unknown order. ar Xiv preprint ar Xiv:1902.01848, 2019.

[37] Max Simchowitz. Making non-stochastic control (almost) as easy as stochastic, 2020.

[38] Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937 8948. PMLR, 2020.

[39] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control, 2020.

[40] Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906 4913, 2012.

[41] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceedings of the 2005, American Control Conference, 2005., pages 300 306. IEEE, 2005.

[42] Kostas S Tsakalis and Petros A Ioannou. Linear time-varying systems: control and adaptation. Prentice-Hall, Inc., 1993.

[43] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[44] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. A system-level approach to controller synthesis. IEEE Transactions on Automatic Control, 64(10):4079 4093, 2019.

[45] Tyler Westenbroek, Max Simchowitz, Michael I Jordan, and S Shankar Sastry. On the stability of nonlinear receding horizon control: a geometric perspective. ar Xiv preprint ar Xiv:2103.15010, 2021.

[46] Dante Youla, Hamid Jabr, and Jr Bongiorno. Modern wiener-hopf design of optimal controllers part ii: The multivariable case. IEEE Transactions on Automatic Control, 21(3):319 338, 1976.

[47] Lijun Zhang, Tie-Yan Liu, and Zhi-Hua Zhou. Adaptive regret of convex and smooth functions. ar Xiv preprint ar Xiv:1904.11681, 2019.

[48] Lijun Zhang, Tianbao Yang, rong jin, and Zhi-Hua Zhou. Dynamic regret of strongly adaptive methods. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5882 5891. PMLR, 10 15 Jul 2018.

[49] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.