# taskoptimal_exploration_in_linear_dynamical_systems__9e9b9b72.pdf

Task-Optimal Exploration in Linear Dynamical Systems

Andrew Wagenmaker 1 Max Simchowitz 2 Kevin Jamieson 1

Exploration in unknown environments is a fundamental problem in reinforcement learning and control. In this work, we study task-guided exploration and determine what precisely an agent must learn about their environment in order to complete a particular task. Formally, we study a broad class of decision-making problems in the setting of linear dynamical systems, a class that includes the linear quadratic regulator problem. We provide instanceand task-dependent lower bounds which explicitly quantify the difﬁculty of completing a task of interest. Motivated by our lower bound, we propose a computationally efﬁcient experiment-design based exploration algorithm. We show that it optimally explores the environment, collecting precisely the information needed to complete the task, and provide ﬁnite-time bounds guaranteeing that it achieves the instanceand task-optimal sample complexity, up to constant factors. Through several examples of the linear quadratic regulator problem, we show that performing task-guided exploration provably improves on exploration schemes which do not take into account the task of interest. Along the way, we establish that certainty equivalence decision making is instanceand task-optimal, and obtain the ﬁrst algorithm for the linear quadratic regulator problem which is instance-optimal. We conclude with several experiments illustrating the effectiveness of our approach in practice.

1. Introduction

Modern reinforcement learning aims to understand how agents should best explore their environments in order to successfully complete assigned tasks. In the face of uncer-

1Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, Washington, USA 2Department of Electrical Engineering and Computer Science, University of California, Berkeley, California, USA. Correspondence to: Andrew Wagenmaker <ajwagen@cs.washington.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

tainty about the environment, a naive strategy might be to explore the environment until it is uniformly understood (system identiﬁcation), and then devise a plan to complete the task under this precise understanding of the environment (control). However, it is widely understood that such a two-phased approach of system identiﬁcation followed by control can be wasteful since, depending on the task, some aspects of the environment ought to be estimated more accurately than others. For instance, if a task requires a precise sequence of steps to be taken in order to be completed, one need not understand all possible outcomes leading to failure after a missed early critical step. Since it may be very costly for an agent to estimate all facets of a complex or high dimensional environment to high precision, it is far preferable to direct agents exploration only to those aspects most relevant to their tasks. Motivated by this challenge, this paper aims to answer:

Q1. What exactly must an agent learn about its envi-

ronment to carry out a particular task?

Q2. Given knowledge of the task, can the agent di-

rect their exploration to speed up the process of learning this task-speciﬁc critical information?

Q3. Having explored its environment, how can the

agent best use the information gained to complete the task of interest?

Our work provides answers to the above questions for a family of decision-making problems in environments parameterized by a linear dynamical system, including synthesis of the linear quadratic regulator. Speciﬁcally, for Q1 we show that accomplishing a variety of tasks amounts to maximizing a task-speciﬁc linear functional of the Fisher-information matrix, a quantity of fundamental importance to optimal experimental design. Indeed, our results naturally reduce to classical linear optimal experimental design criteria (for example, A-optimal) in the absence of dynamics. Answering Q2 in the afﬁrmative amounts to being able to learn just enough about the environment to drive the system to a sequence of states that maximize this task-speciﬁc function as fast as possible. We accomplish this via a sequence of experimental design problems over control inputs given a successively improving estimate of the environment. Finally, to answer Q3 we show that the certainty equivalence

Task-Optimal Exploration in Linear Dynamical Systems

decision rule choosing the policy that would optimally complete the task if the estimate of the environment was correct is the optimal decision rule in an instance-speciﬁc sense.

1.1. Main Contributions

Our primary contributions are as follows:

1. We develop taskand instance-speciﬁc lower bounds

which precisely quantify how parameter estimation error translates to suboptimal task performance.

2. We cast the problem of optimal exploration as a surro-

gate experiment design problem we call task-optimal experiment design. For linear dynamical systems, the task-optimal design problem can be solved efﬁciently by projected gradient descent. We demonstrate that the solution to the design problem yields the informationtheoretically optimal exploration strategy, in a strong, instance-dependent sense.

3. The task-optimal design depends on unknown problem

parameters. We therefore propose a meta-algorithm, TOPLE, which sequentially solves empirical approximations to the design objective, and demonstrate that this approach matches the performance of the optimal design given knowledge of the true system parameters. As a consequence, we obtain the ﬁrst instance-optimal algorithm for the LQR problem.

4. We show through numerous mathematical examples

that task-speciﬁc experiment design can perform arbitrarily better on a task of interest than uniform or taskagnostic exploration. We also rigorously prove a strong sub-optimality result for strategies with low regret for online LQR, such as optimism-under-uncertainty.

5. We show that, for any exploration strategy which is

sufﬁciently non-degenerate, in a very general class of decision-making problems which includes certain classes of nonlinear dynamical systems, the certainty equivalence decision rule is instance optimal.

6. Finally, we show that our approach yields practical

gains over naive exploration schemes through several numerical examples.

All our results are non-asymptotic and polynomial in terms of the natural problem parameters.

1.2. Task-Speciﬁc Pure Exploration

We consider linear dynamical systems of the form:

xt+1 = A?xt + B?ut + wt, x0 0. (1.1)

where xt, wt 2 Rdx, ut 2 Rdu, A? and B? have appropriate dimensions, and where for simplicity we assume that wt N(0, σ2

w I)1. We let ? = (A?, B?) capture the true dynamical system; importantly, ? is unknown to the learner. We also deﬁne a policy as a mapping from past actions and states to future actions : (x1:t, u1:t 1) ! ut. We let E , [ ] denote the expectation over trajectories induced on instance playing policy . While we show in Appendix B that several of our results hold in a more general observation model which encompasses certain nonlinear systems, throughout Sections 2 to 5 we assume we are in the linear dynamical system setting.

We are interested in a general decision making problem: given some smooth loss J ?(a) parameterized by ?, choose a 2 Rda such that J ?(a) is minimized. For every ?, we assume there exists some optimal decision aopt( ?) for which J ?(a) is minimized. We require that J (a) and aopt( ) satisfy the following assumption.

Assumption 1 (Smooth Decision-Making, Informal). The loss J (a) and optimal decision aopt( ) are each three times differentiable within a ball around aopt( ?) and ?, respectively, and their gradients can be absolutely bounded over this range. Furthermore, r2

a J (a) varies smoothly in .

Our interaction protocol is as follows.

Task-Speciﬁc Pure Exploration Problem. The learner s behavior is speciﬁed by an exploration policy exp : (x1:t, u1:t 1) ! ut and decision rule dec executed in the

dynamics Eq. (1.1).

1. For steps t = 1, . . . , T, the learner executes exp and

collects a trajectory = (x1:T +1, u1:T ).

2. For a budget γ2 0, the inputs u1:T must satisfy the

constraint E exp[PT

3. Finally, the learner proposes a decision ba = dec( ) as

a function of .

The learner s performance is evaluated on the excess risk

R(ba; ?) := J ?(ba) inf

and their goal is to choose an exploration policy exp which induces sufﬁcient exploration to propose a decision ba with as little excess risk as possible. For simplicity, we assume that the learner only collects a single trajectory. In contrast to online control, the performance of the exploration policy

1See Appendix C.2 for a discussion on accommodating nonidentity, possibly unknown noise covariance.

2The upper bounds in this paper can be easily modiﬁed to ensure that the budget constraint PT

t=1 kutk2 Tγ2 holds with high probability.

Task-Optimal Exploration in Linear Dynamical Systems

is only evaluated on its ﬁnal decision ba, not the trajectory generated during the learning phase. To make this setting concrete, we consider several speciﬁc examples.

1.3. Examples and Applications

The task-speciﬁc pure exploration problem captures many natural settings. Under the assumed linear dynamics model of (1.1) with ? = (A?, B?), the unknown quantity of interest aopt( ?) can represent any function of the environment deﬁned by ?. In the simplest case of system identiﬁcation, we may have aopt( ?) = ?, ba the least squares estimate of ? given the trajectory = (x1:T +1, u1:T ), and J ?(ba) a measure of loss, for example the Frobenius norm: J ?(ba) = kba aopt( ?)k2

F. Even in this simple case, the learner can reduce J ?(ba) far faster with a deliberate exploration policy relative to a naive policy such as playing isotropic noise. The next several examples illustrate that the task speciﬁc pure exploration framework generalizes far beyond this simple system identiﬁcation task.

Example 1.1 (Pure Exploration LQR). In the LQR problem, the agent s objective is to design a policy that minimizes the inﬁnite-horizon cumulative cost, with losses (x, u) := x>Rxx + u>Ruu. The resultant cost function is

JLQR, ?[ ] := lim

t Rxxt + u>

It is well known that the optimal policies are of the form ut = Kxt where K 2 Rdu dx; we denote these policies K, and let JLQR, (K) = JLQR, [ K]. Here our decision a is the controller K and our loss J is JLQR, . Under standard conditions, J ( ) admits a unique minimizer, which we denote Kopt( ). Furthermore, both J and Kopt are smooth functions of K and , respectively, and can be shown to satisfy Assumption 1.

Example 1.2 (Inverse Reinforcement Learning). In this setting, we assume there is some agent playing according to the control law uagent

t = Kagentxt in the system ? = (A?, B?), inducing the closed-loop dynamics Acl,? = A? + B?Kagent. Furthermore, we assume that Kagent = Kopt( ?; Ru,?) for some parameter Ru,? 2 Rda and a known map Kopt( ; ). A?, B? and Ru,? are unknown, but we are told the value of Kagent (e.g., estimated through observation of the agent s actions). We assume we have access to the closed-loop system

xt+1 = Acl,?xt + B?ut + wt

and our goal is to infer the parameter, Ru,?, the player is utilizing. This can be thought of as an inverse reinforcement learning problem, where we assume the agent is playing in order to minimize some cost parameterized by Ru,?, and we want to determine what the cost is. In this setting our

decision a is the cost vector Ru and we deﬁne our loss as:

JIRL, (Ru) = k Ru Ru,opt( )k2

and the certainty equivalence estimate as:

Ru,opt( ) = arg min

k Kagent Kopt( ; Ru)k2

Under amenable parameterizations of Kopt( ?; Ru,?), this will satisfy Assumption 1. Example 1.3 (System Identiﬁcation with Parametric Uncertainty). Consider the system identiﬁcation problem where we only care about estimating particular entries of (A?, B?) for example, the gain of a particular actuator

or the friction coefﬁcient of a surface. In this setting, we choose our loss to be:

JSID, (b ) = kb ?k2

M := vec(b ?)>Mvec(b ?)

where M 0 has a value of 0 at coordinates which correspond to the known entries of (A?, B?) and a value of 1 at coordinates which correspond to the unknown entries of (A?, B?). Our decision, ba, is the least squares estimate of

?. Example 1.4 (Linear Experimental Design). If A? = 0, B>

? = φ? 2 Rdu, (1.1) reduces to

? ut + wt (1.2)

for yt, wt 2 R, ut 2 Rdu. This is the standard linear regression setting, and our framework therefore encompasses optimal linear experiment design in arbitrary smooth losses (Pukelsheim, 2006). For example, we may consider the A-optimal objective JLED,φ?(φ) = kφ φ?k2

2. Alternatively, we could minimize the negative log-likelihood relative to some reference distribution so that JLED,φ?(φ) = EU ,Y p( |U,φ?)[ log(p(Y |U, φ))] where P(Y |U, φ?) is the likelihood of observations such that yt p( |ut, φ?) (Chaudhuri & Mykland, 1993; Chaudhuri et al., 2015; Pronzato & P azman, 2013). Non-smooth G-optimal-like objectives such as Jφ?(φ) = maxx2X hx, bφ φ?i2 for some ﬁnite set X Rdu can be captured in our framework by using an approximate smoothed objective such as JLED,φ?(φ) = 1

x2X eλhx,φ φ?i2

for large λ.

Many other examples exist from more general control problems, to incentive design, and beyond. As we will show, there is a provable gain to performing task-guided exploration on examples such as these. We present our results for general loss functions J , but consider several of the examples stated here in more detail in Section 3.

1.4. Related Works

Experiment Design and Control. Experiment design has over a century-old history in statistics, and numerous surveys have been written addressing its classical results (see

Task-Optimal Exploration in Linear Dynamical Systems

e.g. (Pukelsheim, 2006; Pronzato & P azman, 2013)). More recently, (Chaudhuri et al., 2015) gives a non-asymptotic active learning procedure for adaptive maximum likelihood estimation, again adapting the design to the unknown parameter of interest; unlike our work, their setting does not address dynamical systems.

In the controls literature, there has been signiﬁcant attention devoted to optimally exciting dynamical systems (Mehra, 1976; Goodwin & Payne, 1977; Jansson & Hjalmarsson,

2005; Gevers et al., 2009; Manchester, 2010; H agg et al., 2013) to optimize classical design criteria for system identiﬁcation. More recent works (Hjalmarsson et al., 1996; Hildebrand & Gevers, 2002; Katselis et al., 2012) have focused on designing inputs to meet certain task-speciﬁc objectives, as is the focus of this work. In control, the optimal design depends on the unknown parameters of the system, and prior work rely on either robust experiment design (Rojas et al., 2007; 2011; Larsson et al., 2012; H agg et al., 2013) or adaptive experimental design (Lindqvist & Hjalmarsson, 2001; Gerencs er & Hjalmarsson, 2005; Barenthin et al., 2005; Gerencs er et al., 2007; 2009), the method of choice in this work, to address this challenge. Past results were often heuristic, and rigorous bounds are asymptotic in nature (Gerencs er et al., 2007; 2009). In contrast, we provide ﬁnite sample upper bounds, and unconditional information-theoretic lower bounds which validate the optimality of our approach. Our adaptive algorithm also admits an efﬁcient implementation via projected gradient descent, whereas past designs require the solution of semi-deﬁnite programs, which may be prohibitive in high dimensions.

More recently, (Wagenmaker & Jamieson, 2020) provided a ﬁnite sample analysis of system identiﬁcation in the operator norm. Our work shows that designs which optimize operator norm recovery can fare arbitrarily worse for control tasks compared to task-optimal designs. Moreover, the techniques in this work translate to providing an efﬁcient implementation of the computationally inefﬁcient procedure proposed by (Wagenmaker & Jamieson, 2020). In addition, our lower bounds consider a more realistic moderate δ regime (see Remark D.1 for comparison).

Non-Asymptotic Learning for Control. While the adaptive control problem has been extensively studied within the controls community ( Astr om & Wittenmark, 2013), machine learning has produced considerable recent interest in ﬁnite-time performance guarantees for system identiﬁcation and control, guarantees which the classical adaptive control literature lacked. In the control setting, results have focused on ﬁnite time regret bounds for the LQR problem with unknown dynamics (Abbasi-Yadkori & Szepesv ari, 2011; Dean et al., 2017; 2018; Mania et al., 2019; Dean et al., 2019; Cohen et al., 2019), with (Simchowitz & Foster, 2020) ultimately settling the minimax optimal regret in terms of

dimension and time horizon. Others have considered regret in online adversarial settings (Agarwal et al., 2019; Simchowitz et al., 2020). Recent results in system identiﬁcation have focused on obtaining ﬁnite time high probability bounds on the estimation error of the system s parameters when observing the evolution over time (Tu et al., 2017; Faradonbeh et al., 2018; Hazan et al., 2018; Hardt et al., 2018; Simchowitz et al., 2018; Sarkar & Rakhlin, 2018; Oymak & Ozay, 2019; Simchowitz et al., 2019; Sarkar et al., 2019; Tsiamis & Pappas, 2019). Existing results rely on excitation from random noise to guarantee learning and do not consider the problem of learning with arbitrary sequences of inputs or optimally choosing inputs for excitation. Recent work has begun to consider instance-optimal bounds with more targeted excitation (Wagenmaker & Jamieson, 2020; Ziemann & Sandberg, 2020); the former is discussed above. The latter presents an asymptotic, instance-dependent lower bound for the online LQR problem. Our results, in contrast, consider ofﬂine pure-exploration for a class of tasks much more general than LQR, and are ﬁnite-time. Furthermore, (Ziemann & Sandberg, 2020) do not provide a matching upper bound for their lower bound.

Reinforcement Learning. Viewing linear dynamical systems as a particular class of Markov Decision Processes (MDPs), our work can also be seen as studying PAC reinforcement learning (RL), where the goal is to ﬁnd an -good policy with probability 1 δ on a ﬁxed MDP and reward function. Existing literature on PAC RL has tended to focus on obtaining coarse, worst-case bounds (Dann & Brunskill, 2015; Dann et al., 2017; 2019; M enard et al., 2020). Only recently has progress been made in obtaining instance-dependent bounds, and here the results are either restricted to the much simpler generative model setting (Zanette et al., 2019; Marjani & Proutiere, 2020), or are asymptotic in nature and only apply to ﬁnding the optimal policy (Marjani et al., 2021). In contrast, our work provides tight, non-asymptotic, and instance-dependent upper and lower bounds for ﬁnding -good policies, albeit in a restricted class of continuous RL problems.

1.5. Organization

The remainder of this paper is organized as follows. In Section 2 we provide an overview of our results, state an informal version of TOPLE, and introduce the essential quantities used in our analysis. Section 3 states several corollaries of our main result in speciﬁc settings and gives our bound for the LQR problem. Section 4 provides explicit examples where task-guided exploration yields provable gains over task-agnostic exploration, and Section 5 presents numerical experiments demonstrating that this improvement occurs in practice as well. We close in Section 6 with several interesting questions motivated by this work.

Task-Optimal Exploration in Linear Dynamical Systems

2. Summary of Results

We now turn to the presentation of our results. We assume we are in the setting described in Section 1.2. Throughout, we let O?( ) suppress terms polynomial in problem parameters, log 1

δ, and log log T; we let a . b if a C b for a universal constant C > 0.

2.1. Optimality of Certainty Equivalence

Before describing the optimal exploration policy for collecting data, we resolve the optimal procedure for synthesizing a decision, and its sample complexity. Given a trajectory = (x1:T +1, u1:T ), the least squares estimator of ? is

b ls( ) := arg min

kxt+1 Axt Butk2

Note that b ls is the maximum-likelihood estimator of ?. For our upper bounds, we propose the certainty-equivalent decision rule:

Deﬁnition 2.1. The certainty equivalence decision rule selects the optimal control policy for the least-squares estimate of the dynamics; ce( ) := aopt(b ls( )).

Certainty Equivalence has a long history in controller design (Theil, 1957; Simon, 1956). To analyze this strategy, we quantify both the error in our least squares estimator, and how it translates into uncertainty about the control synthesis. The former is quantiﬁed in terms of the expected covariance matrices under exploration policies:

ΓT ( ; ) := 1

ΓT ( ; ) := Idx ΓT ( ; ).

where denotes the Kronecker product. The latter requires that we measure how uncertainty in translates into uncertainty about the optimal decision rule for the task of interest:

Deﬁnition 2.2 (Model-Task Hessian and Idealized Risk). We deﬁne the model-task Hessian as

H( ?) := r 2

R ?(aopt( ))

and the idealized risk

ΦT ( ; ?) := tr(H( ?)ΓT ( ; ?) 1).

Intuiviely, the model-task Hessian measures the local curvature of J ?(a), as the decision a varies along the directions of optimal policies aopt( ) for parameters in a neighborhood of ?. The idealized risk capture how the least-squares error propagates through this uncertainty.

Our results will show that Φ characterizes the instanceoptimal sample complexity for decision making, and consequently, by optimizing over , of pure exploration. To formalize this, we require a notion of local minimax risk:

Deﬁnition 2.3 (Decision-Making Local Minimax Risk). Let B denote a subset of instances. The T-sample local minimax decision risk on B under exploration policy exp is

M exp(R; B) := min

2B E , exp [R (dec( ))]

where the minimization is over all maps from trajectories to decisions. Typically, we shall let B take the form BF(r; ?) := { : k ?k F r}.

By choosing B to contain only instances close to ?, the local minimax risk captures the difﬁculty of completing our task on the speciﬁc instance ?, yielding an effectively instance-speciﬁc lower bound. Finally, we make the following assumption on the system dynamics.

Assumption 2. Let denote the set of all stable : := { = (A, B) : (A) < 1}, where (A) denotes the spectral radius of A. We assume that ? 2 .

While this assumption restricts our results to stable systems, similar assumptions are standard in much of the recent literature. Appendix C discusses generalization to unstable systems. Under this assumption we have the following result.

Theorem 2.1 (Optimality of Certainty Equivalence). Let exp be any sufﬁciently regular policy, and consider J (a) and aopt( ) satisfying Assumption 1 and ? satisfying Assumption 2. Then for all δ 2 (0, 1/3) and all T sufﬁciently large, a trajectory generated by exp and ? satisﬁes the following with probability 1 δ,

R ?(ce( )) . σ2

δ ) ΦT ( exp; ?)

T + O? + 1 T 3/2

Moreover, for some r = (1/T 5/12), and B = BF(r; ?), the synthesis minimax risk is lower bounded as

M exp(R; B) σ2

w 3 ΦT ( exp; ?)

T O? + 1 T 5/4

In Appendix B we state the full version of this result, which holds in a more general martingale decision making setting encompassing certain instances of the nonlinear system formulation considered in (Mania et al., 2020). This result establishes that, for a given exploration policy exp and for T sufﬁciently large, the certainty equivalence decision ce( ) is the locally minimax optimal synthesis rule there does not exist a more efﬁcient way to utilize the acquired information to produce a decision. Note that, under some reasonable assumptions, an expectation bound can be obtained from

Task-Optimal Exploration in Linear Dynamical Systems

the high probability bound. We precisely quantify what it means for a policy to be sufﬁciently regular in Appendix B. In short, it entails that the policy sufﬁciently excites the system, and that the covariates concentrate to their mean. Lastly, note that our lower bound differs substantively from the δ ! 0 lower bounds common in the adaptive estimation literature (see Remark D.1). Appendix C provides a more thorough discussion of these points.

Proof Sketch of Theorem 2.1. For the proof of the lower bound, we ﬁrst show that for any decision ba = dec( ) for which R (ba) is small, we can infer an instance b (ba) such that kb (ba) k2

H( ?) is also small (see Appendix E). This equivalence reduces our problem to that of estimating in the H( ?) norm. We then show a lower bound on a Gaussian martingale regression problem with general quadratic losses via a careful though elementary Bayesian MMSE computation. Unlike vanilla Cramer-Rao, this approach allows us to obtain a lower bound which holds for any estimator, not simply unbiased estimators (see Appendix D). Combining these results gives the stated lower bound. The proof of our upper bound mirrors this: we approximate R ?(ce( )) as a quadratic, kb ( ) ?k2

H( ?), and prove an upper bound on the error of the least squares estimator for martingale regression in general norms (see Appendix F).

2.2. Task-Optimal Experiment Design

Given that the optimal risk for a ﬁxed exploration policy exp is governed by

ΦT ( exp; ?) = tr(H( ?)ΓT ( exp; ?) 1),

it stands to reason that the optimal design procedure seeks to minimize this quantity. To this end, we introduce several quantities describing the optimality properties.

Deﬁnition 2.4 (Power-Constrained Policies). Let γ2 denote the set of causal polices that have expected average power bounded as γ2. That is, for any 2 γ2, we will have E , [PT

2] Tγ2 for all .

Deﬁnition 2.5 (Optimal Risk). We deﬁne:

Φopt(γ2; ?) := lim inf

T !1 inf exp2 γ2 ΦT ( exp; ?),

the risk obtained by the policy minimizing the complexity ΦT ( exp; ?).

Deﬁnition 2.6 (Exploration Local Minimax Risk). Let B denote a subset of instances. The T-sample local minimax exploration risk on B with budget γ2 is

Mγ2(R; B) := min exp2 γ2 min

2B E , exp[R (dec( ))].

Algorithm Sketch. We are now ready to state our algorithm. TOPLE proceeds in epochs. At each epoch it chooses a policy that minimize the certainty-equivalence design objective, ΦT ( ; b ), based on the estimate of the system s parameters produced in the previous epoch. As the estimate of ? is reﬁned, the exploration policy is improved, and ultimately achieves near-optimal excitation of the system for the task of interest.

In the policy optimization step on Line 6, we optimize over a restricted class of policies, p

γ2, which contains only periodic signals. As we show, this restriction is expressive enough to contain a near-optimal policy, while allowing us to represent ΓT ( ; b i) in a convenient frequency-domain form. We then adopt a (sharp) convex relaxation of these policies that transforms the experiment design into a convex program, admitting a simple, efﬁcient projected gradient descent implementation. A formal deﬁnition of TOPLE and detailed explanation of these points is given in Appendix B.

Algorithm 1 Task-OPtima L Experiment Design (TOPLE), Informal

1: Input: Initial epoch length T0, budget γ2

2: 0 N(0, γ2/du I). 3: for phase i = 0, 1, 2, . . . do 4: Run system for T02i steps, playing input

ut = i(x1:t, u1:t 1) 5: Compute least squares estimate

b i 2 arg min A,B

t=1kxt+1 Axt Butk2

2 6: Select policy for epoch i + 1,

i+1 arg min p

γ2 tr(H(b i)ΓT ( ; b i) 1)

Theorem 2.2 (Task-Optimal Experiment Design). Consider J (a) and aopt( ) satisfying Assumption 1 and ? satisfying Assumption 2. For sufﬁciently large T, the trajectory

generated by Algorithm 1 enjoys the following guarantee with probability at least 1 δ:

R ?(ce( )) . σ2

δ ) Φopt(γ2; ?)

T + O? + 1 T 3/2

Moreover, it produces inputs satisfying E ?,TOPLE[PT

2] Tγ2, and can be implemented in polynomial time. Finally, for r = O?(1/T 5/12) and B = BF(r; ?), the local minimax risk is lower bounded by

Mγ2(R; B) σ2

w 64 Φopt(γ2; ?)

T O? + 1 T 5/4

We emphasize that the only assumptions needed for Theorem 2.2 to hold are that our system, ?, is stable, and that the loss we are considering, R ?(a), is sufﬁciently smooth. For any system and any loss satisfying these minimal assumptions, including those stated in Section 1.3, Theorem

Task-Optimal Exploration in Linear Dynamical Systems

2.2 shows that certainty equivalence decision making is instance-wise optimal, and that TOPLE hits this optimal rate. Furthermore, while TOPLE relies on experiment design, its sample complexity is also optimal over algorithms which incorporate feedback. We precisely quantify the lower order terms and burn-in times necessary for this result to hold in Appendix B, and consider relaxations to our assumptions in Appendix C.

Proof Sketch of Theorem 2.2. The key technical difﬁculty lies in proving that our restricted class of policies, p

γ2, contains a near-optimal policy. We show this in Appendix H by a careful truncation argument and application of Caratheodory s Theorem. Given this, the lower bound follows by a similar argument as in Theorem 2.1. For the upper bound, we show that once ? has been estimated well enough, the certainty equivalence experiment design on Line 6 achieves the near-optimal rate (see Appendix J).

3. Interpreting the Results

To make our results more concrete, we return to the examples introduced in Section 1.3, and show how TOPLE applies in these settings.

3.1. Instance-Optimal LQR Synthesis

Consider the pure exploration LQR problem stated in Example 1.1. We deﬁne

RLQR, ?(K) := JLQR, ?(K) min

K JLQR, ?(K)

where JLQR, ?(K) is given in Example 1.1. Recall the discrete algebraic Ricatti equation, deﬁned for some (A, B):

P = A>PA A>PB(Ru + B>PB)B>PA + Rx

If is stabilizable and Rx, Ru 0, it is a well-known fact that this has a unique solution, P 0. We denote the solution for the instance ? = (A?, B?) by P?. We also recall the deﬁnition of the H-inﬁnity norm of a system:

k A?k H1 = max !2[0,2 ] k(e !I A?) 1kop

Finally, we let ΦLQR(γ2; ?) := Φopt(γ2; ?) in the case when our loss is the LQR loss, J ? = JLQR, ?. Given these deﬁnitions, the following corollary shows the performance of TOPLE on the pure exploration LQR problem, and that relevant quantities can be expressed in terms of the problemdependent constants k P?kop, k B?kop, and k A?k H1.

Corollary 1. As long as T CLQR(dx log2 T + d2

x), with probability at least 1 δ, TOPLE achieves the following rate for the LQR problem:

RLQR, ?(ce( )) . σ2

δ ) ΦLQR(γ2; ?)

Furthermore, any algorithm must incur the following loss:

Mγ2(RLQR; B) σ2

w 64 ΦLQR(γ2; ?)

where B is as in Theorem 2.2 and CLQR = C0

LQR/ min{σ6

u, 1} for C0

LQR polynomial in k P?kop, k B?kop, k B?k 1

op , k A?k H1, k Rukop, γ2, σ2

w, du, log log T, and log 1

As this result shows, TOPLE is instance-optimal for the LQR problem, with sample complexity governed by the constant ΦLQR(γ2; ?). To the best of our knowledge, this is the ﬁrst algorithm provably instance-optimal for LQR albeit in the ofﬂine LQR setting.

3.2. System Identiﬁcation in Arbitrary Norms

Next, we consider the case of system identiﬁcation in arbitrary norms outlined in Example 1.3. In this setting our loss is RSID, ?(b ) := JSID, ?(b ) = kb ?k2

M, and it can be shown our idealized risk is ΦT ( exp; ?) = tr(MΓT ( exp; ?) 1). Deﬁning

ΦSID(γ2; ?) := lim inf

T !1 inf exp2 γ2 tr(MΓT ( exp; ?) 1)

Theorem 2.2 implies that

RSID, ?(ce( )) . σ2

δ ) ΦSID(γ2; ?)

T + CSIDtr(M)d3

and that this rate is instance-optimal, for some constant CSID polynomial in k B?kop, k A?k H1, γ2, σ2

w, du, log 1

δ, and log log T. In particular, if M = I our loss RSID, ?(b ) reduces to the Frobenius norm, implying that TOPLE is the optimal Frobenius norm identiﬁcation algorithm.

4. Task-Guided Exploration yields Provable

We turn now to several examples which illustrate that taking into account the task of interest when performing exploration yields provable gains over task-agnostic exploration schemes. We focus on the LQR setting and compare against the following natural exploration baselines:

System Identiﬁcation in Operator Norm (Wagen-

maker & Jamieson, 2020): Let op denote the exploration policy that is optimal for estimating ? = (A?, B?) under the operator norm kb ?kop = supu:kuk2 1 k(b ?)uk2. Explicitly, op = arg max 2 γ2 λmin(ΓT ( ; ?)).

System Identiﬁcation in Frobenius Norm: Let fro

denote the exploration policy that is optimal for estimating ? = (A?, B?) under the Frobenius norm: fro = arg min 2 γ2 tr(ΓT ( ; ?) 1).

Task-Optimal Exploration in Linear Dynamical Systems

Task-Optimal Gaussian Noise: Let noise denote the exploration policy such that noise =: noise( ?) where noise( ?) plays the inputs ut N(0, ?) and ? = arg min :tr( ) γ2 tr(H( ?)ΓT ( noise( ); ?) 1).

In stating our results, we overload notation and let RLQR, ?( exp) = RLQR, ?(ce( )) for exp, ?. We are concerned primarily in how the complexity scales with the dimension, dx, and 1 1 where is the spectral radius of the system, and use ( ) and O( ) to suppress lower order dependence on these terms. Our ﬁrst example shows that, if (A?, B?) is properly structured, TOPLE achieves a tighter scaling in 1 1 than all naive exploration approaches.

Proposition 4.1. Consider the system A? = e1e>

1 , B? = b I, Rx = I, and Ru = µI. There exist values of b, , µ, and σw such that the loss of TOPLE, optimal operator norm identiﬁcation (Wagenmaker & Jamieson, 2020), optimal Frobenius norm identiﬁcation, and optimally exciting Gaus-

sian noise have the following scalings:

RLQR, ?(TOPLE) = O

RLQR, ?( fro) =

x (1 )2 + dx (1 )5/2

RLQR, ?( op) =

x (1 )2 + dx (1 )3

RLQR, ?( noise) =

x(1 ) (1 )2

TOPLE achieves the optimal scaling in 1 1 and, as ! 1, will outperform other approaches by an arbitrarily large factor. In addition, we note that Frobenius norm identiﬁcation outperforms operator norm identiﬁcation for this task. A key ingredient in the proof of this result is our convex relaxation of the optimal policy computation. Intuitively, on this instance, the ﬁrst coordinate is easily excited and op and fro will therefore devote the majority of their energy to reducing the uncertainty in the remaining coordinates. However, the LQR cost will primarily be incurred in the ﬁrst coordinate due to the same effect this coordinate is easily excited and therefore the ﬁrst coordinate of the state grows at a much faster rate. As such, the task-optimal allocation does the opposite of op and fro and seeks to learn the ﬁrst coordinate more precisely than the remaining coordinates so as to mitigate this growth.

In our next example, our system behaves isotropically but our costs are non-isotropic. As a result, certain directions incur greater cost than others, and the task-optimal allocation seeks to primarily reduce uncertainty in these directions. Proposition 4.2. Consider the system A? = I, B? = I, Rx = I + e1e>

1 and Ru = µI. Then there exists a choice of µ, , and σw such that

RLQR, ?(TOPLE) = O

RLQR, ?( op) = RLQR, ?( fro) =

RLQR, ?( noise) =

We note that TOPLE improves on task-agnostic exploration by a factor of at least the dimensionality. These examples make clear that, in the setting of a linear dynamical system, when our goal is to perform a speciﬁc task, exploration agnostic to this task can be arbitrarily suboptimal.

4.1. Suboptimality of Low-Regret Algorithms

In contrast to our pure-exploration setting, where we do not incur cost during exploration, a signiﬁcant body of work exists on regret-minimization for the online LQR problem with unknown A?, B?. Here the goal is to choose a low regret policy lr so as to minimize

Reg T := E ?, lr

t=1 (xt, ut)

K JLQR, ?(K)

for (xt, ut) as deﬁned in Example 1.1. While our objectives differ, it would seem a natural strategy to run a lowregret algorithm for T steps to obtain a controller Klr, and then evaluate the cost JLQR, ? on this Klr. The following result shows that there is a fundamental tradeoff between regret and estimation; in particular, the optimal ?(

T) (see (Simchowitz & Foster, 2020)) regret translates to a (very suboptimal) ?(1/

T) excess risk R ?,LQR(Klr).

Proposition 4.3 (Suboptimality of Low Regret, Informal). For any sufﬁciently large T and any regret bound R 2

T, T], any policy lr with regret E lr, ?[Reg T ] R which returns a controller Klr as a function of its trajectory must have E ?, lr[RLQR, ?(Klr)] = ?( d2

In particular, Proposition 4.3 implies that popular low-regret strategies, such as optimism-in-the-face-of-uncertainty (Abbasi-Yadkori & Szepesv ari, 2011; Abeille & Lazaric, 2020), are highly suboptimal in our setting. The key intuition behind the proof is that low regret algorithms converge to inputs ut K?xt approaching the optimal control policy; in doing so, they under-explore directions perpendicular to the hyperplane {(x, u) : u = K?x}, which are necessary for identifying the optimal control policy. We formally state and prove this result in Appendix C.7.

5. Numerical Experiments

Finally, we show that task-guided exploration yields practical gains. Figures 1, 2, and 3 illustrate the performance of TOPLE on several instances of the pure-exploration LQR problem. We compare against the baselines presented in Section 4 and the oracle task-optimal algorithm (which we refer to as TOPLE Oracle ). For all baselines, we compute the inputs in an oracle, ofﬂine manner, using knowledge of A?

Task-Optimal Exploration in Linear Dynamical Systems

0 20000 40000 60000 Time (T)

Normalized LQR Loss

Noise Oracle Fro-norm Oracle op-norm Oracle Tople Oracle Tople

Figure 1. LQR loss vs time on A? a Jordan

block and B?, Rx, Ru randomly generated.

0 200 400 600 800 1000 1/(1 ρ)

Figure 2. LQR loss when varying on ex-

ample stated in Proposition 4.1.

Figure 3. LQR loss when varying dx on exam-

ple stated in Proposition 4.2.

and B?, and play them for the entire trajectory. Our implementation of TOPLE follows precisely the formal statement of the algorithm given in Appendix B, and we rely on the aforementioned convex relaxation and a projected gradient descent solution to efﬁciently solve the experiment design problem. This convex relaxation can, in fact, also be applied to the optimal operator norm identiﬁcation algorithm, rendering the algorithm from (Wagenmaker & Jamieson, 2020) computationally efﬁcient. We therefore rely on this relaxation and a projected subgradient descent method in our implementation of the operator norm identiﬁcation algorithm. All data points correspond to averaging over at least 50 runs of the algorithm. Additional details and plots with error bars are provided in Appendix M.

Figures 2 and 3 illustrate performance on the instances stated in Proposition 4.1 and Proposition 4.2, respectively. Every point in the plot corresponds to the LQR loss obtained after T = 60000 steps. As these plots clearly illustrate, the theoretical gains stated in Proposition 4.1 and Proposition 4.2 appear in practice as well there is a clear improvement in terms of the scaling in and dx when performing taskguided exploration, even over moderate time regimes. Figure 1 illustrates the performance of TOPLE on a more typical problem instance: A? a single Jordan block and B?, Rx, and Ru randomly generated. Figure 1 gives the average loss versus time obtained by averaging the performance over 15 different realizations of B?, Rx, Ru. As in the previous

examples, TOPLE outperforms all other approaches.

6. Conclusion

In this work, we have shown that task-guided exploration of an unknown environment yields signiﬁcant improvements over task-agnostic exploration. Furthermore, we have derived an instanceand task-optimal exploration algorithm which applies to a wide range of decision making problems, and derived corresponding instanceand task-dependent lower bounds. Our results also establish that certainty equivalence decision making is optimal, and we obtain the ﬁrst

instance-optimal algorithm for the LQR problem. This work raises several interesting questions:

While our martingale decision making setting encom-

passes certain classes of nonlinear systems, all our results fundamentally rely on linear observations of the parameter of interest, ?. Task-optimal exploration remains an open question for general nonlinear systems, and is an interesting future direction.

We show that the smoothness conditions on our loss

are met by a wide range of decision making problems. However, it remains an interesting future direction to obtain an optimal algorithm that holds without these smoothness assumptions. As (Wagenmaker & Jamieson, 2020) shows, when the loss is the operator norm which we note does not satisfy our smoothness assumption the optimal algorithm takes a form very similar to TOPLE. Does a general algorithm and analysis exist for both smooth and non-smooth losses?

Our work focuses on the ofﬂine, pure-exploration set-

ting. Extending our analysis to obtain instanceand task-optimal rates in the online setting is an interesting direction of future work. For the online LQR problem, (Simchowitz & Foster, 2020) obtain the optimal scaling in terms of dimension but their rates are suboptimal in terms of other problem-dependent constants. On the lower bound side, (Ziemann & Sandberg, 2020) provide an instance-dependent lower bound but give no upper bound. Solving this problem may require new algorithmic ideas, and we leave this for future work.

ACKNOWLEDGEMENTS

The work of AW is supported by an NSF GFRP Fellowship DGE-1762114. MS is generously supported by an Open Philanthropy AI Fellowship. The work of KJ is supported in part by grants NSF RI 1907907 and NSF CCF 2007036.

Task-Optimal Exploration in Linear Dynamical Systems

Abbasi-Yadkori, Y. and Szepesv ari, C. Regret bounds for

the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1 26, 2011.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved

algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312 2320, 2011.

Abeille, M. and Lazaric, A. Efﬁcient optimistic exploration

in linear-quadratic regulators via lagrangian relaxation. In International Conference on Machine Learning, pp. 23 31. PMLR, 2020.

Agarwal, N., Bullins, B., Hazan, E., Kakade, S., and Singh,

K. Online control with adversarial disturbances. In International Conference on Machine Learning, pp. 111 119. PMLR, 2019.

Anderson, J., Doyle, J. C., Low, S. H., and Matni, N. System

level synthesis. Annual Reviews in Control, 47:364 393, 2019.

Arias-Castro, E., Candes, E. J., and Davenport, M. A. On

the fundamental limits of adaptive sensing. IEEE Transactions on Information Theory, 59(1):472 481, 2012.

Astr om, K. J. and Wittenmark, B. Adaptive control. Courier

Corporation, 2013.

Barenthin, M., Jansson, H., and Hjalmarsson, H. Applica-

tions of mixed h2 and hinﬁn; input design in identiﬁcation. IFAC Proceedings Volumes, 38(1):458 463, 2005.

Chaudhuri, K., Kakade, S., Netrapalli, P., and Sanghavi,

S. Convergence rates of active learning for maximum likelihood estimation. ar Xiv preprint ar Xiv:1506.02348, 2015.

Chaudhuri, P. and Mykland, P. A. Nonlinear experiments:

Optimal design and inference based on likelihood. Journal of the American Statistical Association, 88(422):538 546, 1993.

Cohen, A., Koren, T., and Mansour, Y. Learning linear-

quadratic regulators efﬁciently with only

T regret. ar Xiv preprint ar Xiv:1902.06223, 2019.

Dann, C. and Brunskill, E. Sample complexity of episodic

ﬁxed-horizon reinforcement learning. ar Xiv preprint ar Xiv:1510.08906, 2015.

Dann, C., Lattimore, T., and Brunskill, E. Unifying pac and

regret: Uniform pac bounds for episodic reinforcement learning. ar Xiv preprint ar Xiv:1703.07710, 2017.

Dann, C., Li, L., Wei, W., and Brunskill, E. Policy cer-

tiﬁcates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pp. 1507 1516. PMLR, 2019.

Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On

the sample complexity of the linear quadratic regulator. ar Xiv preprint ar Xiv:1710.01688, 2017.

Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. Regret

bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pp. 4188 4197, 2018.

Dean, S., Tu, S., Matni, N., and Recht, B. Safely learning to

control the constrained linear quadratic regulator. In 2019 American Control Conference (ACC), pp. 5582 5588.

IEEE, 2019.

Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. Finite

time identiﬁcation in unstable linear systems. Automatica, 96:342 353, 2018.

Gerencs er, L. and Hjalmarsson, H. Adaptive input design

in system identiﬁcation. In Proceedings of the 44th IEEE Conference on Decision and Control, pp. 4988 4993. IEEE, 2005.

Gerencs er, L., M artensson, J., and Hjalmarsson, H. Adap-

tive input design for arx systems. In 2007 European Control Conference (ECC), pp. 5707 5714. IEEE, 2007.

Gerencs er, L., Hjalmarsson, H., and M artensson, J.

Identiﬁcation of arx systems with non-stationary inputs asymptotic analysis with application to adaptive input design. Automatica, 45(3):623 633, 2009.

Gevers, M., Bazanella, A. S., Bombois, X., and Miskovic,

L. Identiﬁcation and the information matrix: how to get just sufﬁciently rich? IEEE Transactions on Automatic Control, 54(ARTICLE):2828 2840, 2009.

Gill, R. D., Levit, B. Y., et al. Applications of the van trees

inequality: a bayesian cram er-rao bound. Bernoulli, 1 (1-2):59 79, 1995.

Goodwin, G. C. and Payne, R. L. Dynamic system identiﬁ-

cation: experiment design and data analysis. Academic press, 1977.

H agg, P., Larsson, C. A., and Hjalmarsson, H. Robust

and adaptive excitation signal generation for input and output constrained systems. In 2013 European Control Conference (ECC), pp. 1416 1421. IEEE, 2013.

Hardt, M., Ma, T., and Recht, B. Gradient descent learns lin-

ear dynamical systems. The Journal of Machine Learning Research, 19(1):1025 1068, 2018.

Task-Optimal Exploration in Linear Dynamical Systems

Hazan, E., Lee, H., Singh, K., Zhang, C., and Zhang, Y.

Spectral ﬁltering for general linear dynamical systems. In Advances in Neural Information Processing Systems, pp. 4634 4643, 2018.

Hildebrand, R. and Gevers, M. Identiﬁcation for control:

optimal input design with respect to a worst-case -gap cost function. SIAM Journal on Control and optimization, 41(5):1586 1608, 2002.

Hjalmarsson, H., Gevers, M., and De Bruyne, F. For model-

based control design, closed-loop identiﬁcation gives better performance. Automatica, 32(12):1659 1673, 1996.

Hsu, D., Kakade, S., Zhang, T., et al. A tail inequality for

quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.

Jansson, H. and Hjalmarsson, H. Input design via lmis

admitting frequency-wise model speciﬁcations in conﬁdence regions. IEEE transactions on Automatic Control, 50(10):1534 1549, 2005.

Jedra, Y. and Proutiere, A. Sample complexity lower bounds

for linear system identiﬁcation. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 2676 2681. IEEE, 2019.

Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M.,

and Sun, W. Information theoretic regret bounds for online nonlinear control. ar Xiv preprint ar Xiv:2006.12466, 2020.

Katselis, D., Rojas, C. R., Hjalmarsson, H., and Bengtsson,

M. Application-oriented ﬁnite sample experiment design: A semideﬁnite relaxation approach. IFAC Proceedings Volumes, 45(16):1635 1640, 2012.

Kaufmann, E., Capp e, O., and Garivier, A. On the com-

plexity of best-arm identiﬁcation in multi-armed bandit models. The Journal of Machine Learning Research, 17 (1):1 42, 2016.

Larsson, C., Geerardyn, E., and Schoukens, J. Robust input

design for resonant systems under limited a priori information. IFAC Proceedings Volumes, 45(16):1611 1616, 2012.

Lindqvist, K. and Hjalmarsson, H. Identiﬁcation for control:

Adaptive input design using convex optimization. In Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No. 01CH37228), volume 5, pp. 4326 4331. IEEE, 2001.

Manchester, I. R. Input design for system identiﬁcation via

convex relaxation. In 49th IEEE Conference on Decision and Control (CDC), pp. 2041 2046. IEEE, 2010.

Mania, H., Tu, S., and Recht, B. Certainty equivalent control

of lqr is efﬁcient. ar Xiv preprint ar Xiv:1902.07826, 2019.

Mania, H., Jordan, M. I., and Recht, B. Active learning for

nonlinear system identiﬁcation with guarantees. ar Xiv preprint ar Xiv:2006.10277, 2020.

Marjani, A. A. and Proutiere, A. Best policy identiﬁcation

in discounted mdps: Problem-speciﬁc sample complexity. ar Xiv preprint ar Xiv:2009.13405, 2020.

Marjani, A. A., Garivier, A., and Proutiere, A. Navigating

to the best policy in markov decision processes. ar Xiv preprint ar Xiv:2106.02847, 2021.

Mehra, R. K. Synthesis of optimal inputs for multiinput-

multioutput (mimo) systems with process noise part i: Frequenc y-domain synthesis part ii: Time-domain synthesis. In Mathematics in Science and Engineering, volume 126, pp. 211 249. Elsevier, 1976.

M enard, P., Domingues, O. D., Jonsson, A., Kaufmann, E.,

Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. ar Xiv preprint ar Xiv:2007.13442, 2020.

Ok, J., Proutiere, A., and Tranos, D. Exploration in struc-

tured reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8874 8882, 2018.

Oymak, S. and Ozay, N. Non-asymptotic identiﬁcation of

lti systems from a single trajectory. In 2019 American Control Conference (ACC), pp. 5655 5661. IEEE, 2019.

Pronzato, L. and P azman, A. Design of experiments in

nonlinear models. Lecture notes in statistics, 212, 2013.

Pukelsheim, F. Optimal design of experiments. SIAM, 2006.

Rojas, C. R., Welsh, J. S., Goodwin, G. C., and Feuer, A. Ro-

bust optimal experiment design for system identiﬁcation. Automatica, 43(6):993 1008, 2007.

Rojas, C. R., Aguero, J.-C., Welsh, J. S., Goodwin, G. C.,

and Feuer, A. Robustness in experiment design. IEEE Transactions on Automatic Control, 57(4):860 874, 2011.

Sarkar, T. and Rakhlin, A. How fast can linear dynamical

systems be learned? ar Xiv preprint ar Xiv:1812.01251, 2018.

Sarkar, T. and Rakhlin, A. Near optimal ﬁnite time identiﬁ-

cation of arbitrary linear dynamical systems. In International Conference on Machine Learning, pp. 5610 5618. PMLR, 2019.

Sarkar, T., Rakhlin, A., and Dahleh, M. A. Finite-time

system identiﬁcation for partially observed lti systems of unknown order. ar Xiv preprint ar Xiv:1902.01848, 2019.

Task-Optimal Exploration in Linear Dynamical Systems

Simchowitz, M. and Foster, D. J. Naive exploration is optimal for online lqr. ar Xiv preprint ar Xiv:2001.09576, 2020.

Simchowitz, M., Jamieson, K., and Recht, B. The simula-

tor: Understanding adaptive sampling in the moderateconﬁdence regime. ar Xiv preprint ar Xiv:1702.05186, 2017.

Simchowitz, M., Mania, H., Tu, S., Jordan, M. I., and

Recht, B. Learning without mixing: Towards a sharp analysis of linear system identiﬁcation. ar Xiv preprint ar Xiv:1802.08334, 2018.

Simchowitz, M., Boczar, R., and Recht, B. Learning linear

dynamical systems with semi-parametric least squares. ar Xiv preprint ar Xiv:1902.00768, 2019.

Simchowitz, M., Singh, K., and Hazan, E. Improper learning

for non-stochastic control. In Conference on Learning Theory, pp. 3320 3436. PMLR, 2020.

Simon, H. A. Dynamic programming under uncertainty with

a quadratic criterion function. Econometrica, Journal of the Econometric Society, pp. 74 81, 1956.

Theil, H. A note on certainty equivalence in dynamic plan-

ning. Econometrica: Journal of the Econometric Society, pp. 346 349, 1957.

Tsiamis, A. and Pappas, G. J. Finite sample analysis of stochastic system identiﬁcation. ar Xiv preprint ar Xiv:1903.09122, 2019.

Tu, S., Boczar, R., Packard, A., and Recht, B. Nonasymptotic analysis of robust control from coarse-grained identiﬁcation. ar Xiv preprint ar Xiv:1707.04791, 2017.

Vershynin, R. High-dimensional probability: An introduc-

tion with applications in data science, volume 47. Cambridge university press, 2018.

Wagenmaker, A. and Jamieson, K. Active learning for identiﬁcation of linear dynamical systems. ar Xiv preprint ar Xiv:2002.00495, 2020.

Zanette, A., Kochenderfer, M., and Brunskill, E. Almost

horizon-free structure-aware best policy identiﬁcation with a generative model. 2019.

Ziemann, I. and Sandberg, H. On uninformative optimal

policies in adaptive lqr with unknown b-matrix. ar Xiv preprint ar Xiv:2011.09288, 2020.