# tracking_adversarial_targets__c491eeea.pdf

Tracking Adversarial Targets

Yasin Abbasi-Yadkori YASIN.ABBASIYADKORI@QUT.EDU.AU Queensland University of Technology

Peter Bartlett BARTLETT@EECS.BERKELEY.EDU University of California, Berkeley and QUT

Varun Kanade VKANADE@EECS.BERKELEY.EDU University of California, Berkeley

We study linear control problems with quadratic losses and adversarially chosen tracking targets. We present an efﬁcient algorithm for this problem and show that, under standard conditions on the linear system, its regret with respect to an optimal linear policy grows as O(log2 T), where T is the number of rounds of the game. We also study a problem with adversarially chosen transition dynamics; we present an exponentiallyweighted average algorithm for this problem, and we give regret bounds that grow as O(

1. Introduction

Consider a robot that controls an electron microscope to track a microorganism. Given the entire trajectory of the microorganism and the dynamics of the system, the optimal control can be computed. The trajectory, however, is not known in advance and the target might behave in an arbitrary fashion. In such situations, designing a controller based on some prior knowledge about the target location might be sub-optimal. It is important to take the behavior of the target into account.

We consider problems with linear transition functions and quadratic tracking losses. When the target trajectory is known in advance, the problem is called the linear quadratic (LQ) problem in the control community. The LQ problem is one of the most studied problems in the control literature and is widely applied in practice (Lai and Wei, 1982; 1987; Chen and Guo, 1987; Chen and Zhang, 1990; Fiechter, 1997; Lai and Ying, 2006; Campi and Ku-

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

mar, 1998; Bittanti and Campi, 2006; Abbasi-Yadkori and Szepesv ari, 2011). By principles of dynamic programming the optimal controller can be computed analytically. The solution is obtained by computing value functions, starting from the last round and recursing backward. This method needs to know the entire target sequence from the beginning and its computational complexity scales linearly with the length of the trajectory. It turns out that the optimal controller is linear in the state vector, and the value functions are quadratic in state and action.

As we discussed earlier, the assumption that the entire trajectory is known in advance is not always realistic. But what would tracking mean without a reference trajectory? To make the problem well-deﬁned, we ﬁx a class of mappings from states to actions (also known as policies) as our competitors. Our objective is to track the trajectory nearly as well as the best policy in the comparison class in hindsight. The standard dynamic programming procedures are not applicable when the entire sequence is not known in advance. We show that we can still have an effective tracking algorithm even if the sequence is not known in advance. The proposed algorithm is perhaps the ﬁrst tracking method that can deal with inﬁnite and unknown sequences.

We study the adversarial version of the LQ problem where an adversary designs the trajectory of the target and reveals the target location only at the end of each round. Formally, we study problems with transition dynamics and loss functions

xt+1 = Axt + Bat ,

ℓt(xt, at) = (xt gt) Q(xt gt) + a t at ,

where xt Rn is the state at round t; at Rd is the action; gt is the target location; ℓt : Rn Rd R is the loss function; and A, B, Q are known matrices.1 The matrix Q is symmetric and positive deﬁnite. The learner observes the

1We can also use this formulation for loss functions of the

Tracking Adversarial Targets

sequence of states (xt)t. We make no assumptions on the trajectory sequence (gt)t, apart from its boundedness. The sequence might even be generated by an adversary. An LQ problem is deﬁned by a 4-tuple (A, B, Q, G), where G is an upper bound on the norm of vectors gt.

Let T be a time horizon, xπ t be the state of the system if we run policy π for t rounds, and Π be a class of policies. Let

ρT (g1, . . . , g T ) =

t=1 ℓt(xt, at) min π Π

t=1 ℓt(xπ t , π(xπ t )) .

The objective of the learner is to suffer low loss. The performance is measured by the regret deﬁned by

RT (A, B, Q, G, x1, Π) = sup g1,...,g T : gt G ρT (g1, . . . , g T ) .

where denotes the 2-norm. In what follows, we use RT def= RT (A, B, Q, G, x1, Π). For simplicity, we assume that x1 = 0. As the optimal policy for the classical problem with a constant target is linear (strictly speaking, afﬁne), a reasonable choice for the class of competitors is the set of linear (afﬁne) policies.

The problem that we describe is an instance of Markov Decision Process (MDP) problems with ﬁxed and known dynamics, and changing loss functions. Note that the loss function depends on the target location, which can change in an arbitrary fashion. Such MDP problems were previously studied by Even-Dar et al. (2009) and Neu et al. (2010b;a). However, these papers provide results for ﬁnite MDP problems and are not applicable to problems with large state/action spaces. Our key ﬁnding is that the algorithm of Even-Dar et al. (2009) can be modiﬁed to be applicable in adversarial LQ problems. Interestingly, the resultant algorithm is identical with a Policy Iteration method (see, for example, (Howard, 1960)) with changing loss functions. Another interesting observation is that the gain matrix is independent of target vectors (see Lemma 4). This simpliﬁes the design and analysis of our algorithm. We prove that the regret of the algorithm is logarithmic in the number of rounds of the game.

Finally, we also study a more adversarial problem:

xt+1 = Atxt + Btat,

ℓt(xt) = (xt gt) Q(xt gt) + a t at .

Here, the time-varying transition matrices At and Bt and the target vector gt are chosen by an adversary. In this problem, we show that under a uniform stability assumption,

form ℓt(xt, at) = (xt gt) Q(xt gt) + a t Rat for a positive deﬁnite matrix R, by writing xt+1 = Axt+(BR 1/2)R1/2at = Axt + e Beat and ℓt(xt, at) = (xt gt) Q(xt gt) + a t R1/2R1/2at = (xt gt) Q(xt gt) + ea t eat.

an exponentially-weighted average algorithm recently proposed by Abbasi-Yadkori et al. (2013) enjoys an O(

T) regret bound with respect to the class of linear policies.

2. Notation

We use σmin(M) and σmax(M) to denote the minimum and maximum eigenvalues of the positive semideﬁnite matrix M, respectively. We use to denote the 2-norm of matrices and vectors, where the 2-norm of a matrix M is deﬁned by M = p

σmax(M M). We use M 0 to denote that M is positive deﬁnite, while we use M 0 to denote that it is positive semideﬁnite. We use Mij to denote a block of matrix M. The indices and the dimensionality of the block will be understood from the context. Similarly, vi denotes a block of vector v.

3. Tracking Adversarial Targets

Even-Dar et al. (2009) study ﬁnite state MDP problems with ﬁxed and known transition functions and adversarial loss functions. Their algorithm (MDP-E) in its present form is not applicable to our problem with a continuous state space. Somewhat surprisingly, we can design a variant of the MDP-E algorithm that is applicable to our tracking problem with continuous state and action spaces.

The MDP-E algorithm, shown in Figure 1, maintains an expert algorithm in each state; that is, it treats each state as a separate problem of prediction with expert advice, where each action corresponds to an expert. At round t, the product of expert recommendations over the state space deﬁnes the policy, denoted by πt. The algorithm takes action at πt(xt) and observes the loss function ℓt. It computes the value function Vπt,ℓt deﬁned by the Bellman Optimality Equation

λ, x, a, λ + Vπt,ℓt(x, a) = ℓt(x, a)

+ Ex m(.|x,a) [Vπt,ℓt(x , πt(x ))] ,

where m deﬁnes the state transition probabilities.2 Then, the algorithm feeds the expert algorithm in state x with Vt(x, .) = Vπt,ℓt(x, .) as the loss function at time t. Thus, the computational cost of the MDP-E algorithm per round is O(W + |X|), where W is the cost of obtaining the value function and X is the ﬁnite state space.

Applied to the LQ problem, the value functions are deﬁned

2In the Bellman Optimality Equation, scalar λ is the average loss of policy πt, and Vπt,ℓt(x, a) is the relative goodness of state-action pair (x, a) under policy πt. Note that under certain assumptions, the average loss is independent of the initial state, however, some states are more favorable as the policy incurs lower losses during the transient phase, starting from those states.

Tracking Adversarial Targets

Initialize an expert algorithm in each state for t := 1, 2, . . . do

Let πt(xt) be the prediction of the expert algorithm in state xt Take action at πt(xt) Observe loss function ℓt Compute Vt = Vπt,ℓt For all x, feed the expert algorithm in state x with loss Vt(x, .) end for

Figure 1. The MDP-E Algorithm

λ, x, a, λ + Vt(x, a) = ℓt(x, a) (1)

+ Vt(Ax + Ba, πt(Ax + Ba)) ,

where we use the notation Vt = Vπt,ℓt. The linear structure allows us to compute Vt implicitly for all states, thus overcoming the difﬁculty of the inﬁnite state space. As we will show, a suitable expert algorithm for our problem is the Follow The Leader (FTL) algorithm that we deﬁne next.

Consider an online quadratic optimization problem where at round t the adversary chooses a quadratic loss function ft that is deﬁned over a convex set D Rk. Simultaneously, the learner makes a prediction pt D, suffers loss ft(pt) and observes the loss function ft. The regret of the learner is deﬁned by BT = PT t=1 ft(pt) minp D PT t=1 ft(p). The FTL algorithm makes the prediction

pt = argmin p D

s=1 fs(p) .

The FTL algorithm enjoys the following regret bound for quadratic losses (Cesa-Bianchi and Lugosi, 2006, Theorem 3.1):

Theorem 1 (FTL Regret bound). Assume ft is convex, maps to [0, C1], is Lipschitz with constant C2, and is twice differentiable everywhere with Hessian H C3I. Then the regret of the Follow The Leader algorithm is bounded by BT 4C1C2 2 C3 (1 + log T).

Figure 2 shows the FTL-MDP algorithm for the linear quadratic tracking problem. It corresponds to the MDP-E algorithm, with FTL as the expert algorithm for each state. The algorithm starts at state x1 = 0 and the ﬁrst policy is chosen to be π1(x) = K x, where K is a gain matrix that will be deﬁned later.3 The algorithm computes the total loss in each state, shown by V (x, .) = Pt 1 s=1 Vs(x, .).

3Adopting a convention from feedback control, we represent linear policies with a negative sign.

x1 = 0 x, π1(x) = K x (6) for t := 1, 2, . . . do

Take action at = πt(xt) and suffer the loss ℓt(xt, at) Move to the state xt+1 = Axt + Bat Compute the value function Vt = Vπt,ℓt (2) Let V = Pt s=1 Vs Obtain the policy by solving a V (x, a) = 0: πt+1(x) = Kt+1x + ct+1 end for

Figure 2. FTL-MDP: The Follow the Leader Algorithm for Markov Decision Processes

The FTL strategy chooses the greedy action in each state, which is obtained by minimizing V (x, .). As the next lemma shows, value functions computed in the FTL-MDP algorithm are always quadratic and thus, the function V

is always quadratic in state and action. This implies that policies are linear in state.

Lemma 2. Consider the MDP-E algorithm applied to the adversarial LQ problem (A, B, Q, G). Let the expert algorithm be the FTL strategy. Assume the ﬁrst policy is chosen to be an arbitrary linear policy, π1(x) = K1x+c1. Then, for appropriate matrices Pt and Lt, the value function at time t has the form of

Vt(x, a) = x a Pt

and the policy chosen by the algorithm at time t is πt(x) = Ktx + ct, where Kt = (Pt 1 s=1 Ps,22) 1 Pt 1 s=1 Ps,21 and ct = (Pt 1 s=1 Ps,22) 1 Pt 1 s=1 Ls,2/2 and Ps,ij and Ls,i are the ijth and ith blocks of matrix Ps and vector Ls, respectively. (Here the block structure naturally appears as components corresponding to the state and action.)

The proof uses the following lemma that shows that the value of a linear policy is quadratic.

Lemma 3. Consider the LQ problem (A, B, Q, G) with ﬁxed target g . Let K be a matrix such that A BK < 1. The value function of policy π(x) = Kx + c has a quadratic form

Vπ,ℓ(x, a) = x a P x a

where P = P(K) and L are solutions to equations

A B + Q 0 0 I

Tracking Adversarial Targets

L = L + 2 0 c P I K

A B 2g Q 0 .

(4) Further, the loss in the limiting state xπ is

λ = g Qg + c P22c + L 2 c . (5)

The proof can be found in Appendix A.

Proof of Lemma 2. We prove the lemma by induction. By Lemma 3, the value of the ﬁrst policy, π1(x) = K1x+c1, has the form of

V1 = Vπ1,ℓ1(x, a) = x a P1

for some matrices P1 and L1. This establishes the base case.

Next assume the value functions up to time t 1 are all quadratic. Because we use a FTL strategy as our expert algorithm in states, the policy in state x in round t can be computed by

πt(x) = argmin a

s=1 Vs(x, .) .

We obtain the policy by setting a Pt 1 s=1 Vs(x, .) = 0. Letting

Vs(x, a) = x a Ps

for s = 1 . . . (t 1), we get that

s=1 Vs(x, a) = x a e Pt

where e Pt = Pt 1 s=1 Ps and e Lt = Pt 1 s=1 Ls. Taking the gradient with respect to the second argument and setting to zero, we get that,

a V (x, a) = 2 e Pt,21x + 2 e Pt,22a + e Lt,2 = 0

a = e P 1 t,22 e Pt,21x 1

2 e P 1 t,22e Lt,2 .

Thus, the policy can be compactly written as

x, πt(x) = Ktx + ct ,

where Kt = e P 1 t,22 e Pt,21 and ct = e P 1 t,22e Lt,2/2. Given this linear policy, we get the quadratic value function Vt from Equation (2).

Lemma 2 implies that the MDP-E algorithm can be efﬁciently implemented in the adversarial LQ problem.

Before stating the main result of this paper, we describe certain assumptions and deﬁnitions from the control literature. (See, for example, (Bertsekas, 2001)).

Deﬁnition 1. A pair (A, B), where A is an n n matrix and B is an n d matrix, is said to be controllable if the n nd matrix [B AB . . . An 1B] has full rank. A pair (A, C), where A is an n n matrix and C is an d n matrix, is said to be observable if the pair (A , C ) is controllable.

Roughly speaking, controllability implies that the state can be moved arbitrarily by changing the actions, while observability implies that the state can be externally measured. We assume that the system is controllable and observable. These assumptions are standard in the literature, and will allow a closed form expression for the optimal control law.

Assumption A1. (Controllability and observability) The pair (A, B) is controllable and the pair (A, Q1/2) is observable.

Under this assumption, the gain matrix is stable, i.e. there exists ρ (0, 1) such that A BK ρ < 1, where

K = (I + B SB) 1B SA (6)

is the gain matrix (Bertsekas, 2001), and S is the solution of the Riccati equation

S = Q + A SA A SB(I + B SB) 1B SA .

Interestingly, as the next lemma shows, all gain matrices are equal. The proof can be found in Appendix A.

Lemma 4. Consider the FTL-MDP algorithm. Let P = P(K ) as deﬁned by (3). If we choose K1 = K , then all gain matrices are equal, K = K1 = K2 = K3 = . . . , and hence P = P1 = P2 = P3 = . . . .

Lemma 4 shows that gain matrices are independent of target vectors and can be computed by assuming that all target vectors are zero. Given the ﬁxed gain matrix, the system is driven to a desired target position by changing the bias term of the linear policy.

We represent the linear policy π(x) = Kx+c by the pair π = (K, c). The class of (K , C )-bounded, ρ-stable linear policies is deﬁned by Π = {π = (K, c) : A BK ρ, K K , c C }.

Theorem 5. For a controllable, observable LQ problem (A, B, Q, G), the regret of the FTL-MDP algorithm with respect to the class of (K , C )- bounded, ρ-stable linear policies is O(log2 T), where the hidden constants are polynomials in

Tracking Adversarial Targets

A , B , Q , G, S, P , 1/λmin(P ), K , K , C , and 1/(1 ρ).

The hidden constants are all small order polynomials. As we will show in the next section, a careful asymptotic analysis gives us an asymptotic O(log T) bound. It is an open problem to show a ﬁnite-time O(log T) regret bound with polynomial constants.

4. Analysis

Let xπ = limt xπ t be the limiting state under (K , C )- bounded and ρ-stable linear policy π = (K, c). In Lemma 6 we show that this limit exists. Let λt(π) = ℓt(xπ , π(xπ )) be the loss of policy π in state xπ . We decompose the regret

t=1 ℓt(xt, at)

t=1 ℓt(xπ t , π)

t=1 ℓt(xt, at)

t=1 λt(πt) +

t=1 λt(π) +

t=1 ℓt(xπ t , π)

t=1 ℓt(xt, at)

t=1 λt(πt) ,

t=1 λt(π) ,

t=1 ℓt(xπ t , π) .

The terms αT and γT correspond to the difference between the losses of the policies between their stationary and transient states. The term βT measures the regret with respect to the optimal policy. The rest of this section is devoted to providing bounds on these terms.

4.1. Bounding αT

To bound αT , we need to show that sum of terms ℓt(xt, at) λt(πt) is small. Let xπt = lims xπt s . We will show that this limit exists. Because λt(πt) = ℓt(xπt , πt(xπt )) and ℓt(xt, at) = ℓt(xt, πt(xt)), we need to show that xπt is close to xt. This is done in a number of steps. First, we obtain the limiting state xπt (Lemma 6 and the discussion after that). Then, we show that the chosen policy is slowly changing. Given these two results, we bound xt xπt , which is then used to bound αT .

First, we study the behavior of the state vector under any bounded and stable policy. We show that the policy converges to its stationary state exponentially fast. Lemma 6. The limiting state xπ = limt xπ t under a (K , C )-bounded and ρ-stable linear policy π = (K, c) exists and is equal to xπ = (I A + BK) 1Bc. Further, we have that xπ t B C /(1 ρ) and xπ t+1 xπ ρt (I A + BK) 1Bc .

Proof. We have that

xπ t+1 = (A BK)xπ t + Bc

= (A BK)2xπ t 1 + (A BK)Bc + Bc

= (A BK)tx1 +

s=1 (A BK)t s Bc

s=0 (A BK)s Bc ,

where we used x1 = 0 in the last equality. Thus, as t goes to inﬁnity, xπ t (I A + BK) 1Bc. This also implies that xπ t B C /(1 ρ). It is also easy to see that

xπ t+1 xπ =

s=0 (A BK)s Bc

s=0 (A BK)s Bc

= (A BK)t(I A + BK) 1Bc

A BK t (I A + BK) 1Bc

ρt (I A + BK) 1Bc .

Note that even if π = (K, c) / Π, but (A BK) is stable, the above argument is still valid and we get a similar result. In particular,

xπt = (I A + BK ) 1Bct . (7)

Letting C be an upper bound on ct for t T, with a similar argument we can also show that

In what follows, we use X to denote the upper bound on the norm of state vector, X def= B C

1 ρ . Next, we prove that the chosen policy is slowly changing and the bias term in policies is bounded by C, where

D = P 1 ,22B (I A + BK ) Q ,

1 + K 2 B 2 /(1 ρ)2 + K B

1 + K 2 B 2 /(1 ρ)2 ,

Tracking Adversarial Targets

where P denotes the solution to Equation (3), corresponding to gain matrix K . The proof can be found in Appendix A.

Lemma 7. We have that

(i). ct C , (ii). ct ct 1 D G + 2C

Next, we show that the limiting state of policy πt (i.e. lims xπt s ) is close to the state at time t. The proof can be found in Appendix A.

Lemma 8. If t > log(T 1)/ log(1/ρ) , then

xt xπt B ( D G + 2C)

1 + log(t 1)

1 t log(t 1)/ log(1/ρ)

Also, we have that

t=1 xt xπt 1 1 ρ

4 B C log T log(1/ρ)

+ B ( D G + 2C)(1 + log T)

1 + log T + log T log(1/ρ)

Remark 9. This lemma shows an O(log2 T) bound on PT t=1 xt xπt . As we will see, this leads to an O(log2 T) regret bound. Let ϵt = xt xπt . To get an O(log T) regret bound, we need to show that ϵt = H1/t for a constant H1. A careful examination of proof of Lemma 8 reveals that ϵt = H2 Pt 1 s=1 ρs/(t s) + H3 for constants H2 and H3. Let f(t) = Pt 1 s=1 ρs/(t s). We want that f(t) = H4/t for a constant H4 so that we get an O(log T) regret bound.

To show this, we argue as follows: ﬁrst establish the simple recurrence f(t + 1) = ρf(t) + ρ/t. This implies that f(t) 0 as t . Now, deﬁne g(t + 1) = tf(t + 1). Thus g(t + 1) = ρg(t) + ρf(t) + ρ. Since ρ < 1 and limt f(t) = 0, limt g(t) exists and a simple calculation shows that this is ρ/(1 ρ). This in fact implies that f(t) ρ/(t(1 ρ)) asymptotically, which in turn implies that regret is O(log T) asymptotically.

Now we are ready to bound αT .

Lemma 10. Let

Z1 = 2(C K + G Q ) + 2( Q + K2 ) B C

Z2 = 4 B C log T log(1/ρ)

Z3 = B ( D G + 2C)(1 + log T)

1 + log T + log T log(1/ρ)

Then we have that

αT Z1(Z2 + Z3)/(1 ρ) .

Proof. For policy π = (K, c), we have

ℓt(x, π) = x (Q + K K)x

2(c K + g t Q)x + c c + g t Qgt .

For policy πt = (Kt, ct) = (K , ct), deﬁne St = Q + K K and dt = 2(c t K + g t Q). We write

x t Stxt dtxt

xπt Stxπt dtxπt

t=1 dt(xπt xt) +

S1/2 t xt S1/2 t xπt

S1/2 t xt + S1/2 t xπt

t=1 dt(xπt xt) +

S1/2 t (xt xπt )

S1/2 t xt + S1/2 t xπt

dt + S1/2 t S1/2 t xt + S1/2 t xπt

Z1(Z2 + Z3)/(1 ρ) .

where we used Lemma 8 in the last step.

4.2. Bounding βT

The term βT is bounded by showing a reduction to regret minimization algorithms (in this case, the FTL algorithm). To use the regret bound of the FTL algorithm (Theorem 1), we need to show boundedness of the value functions.

Lemma 11. Let X = B C /(1 ρ), U = max{ K , K } max{X, X } + max{C, C }, V = P (X +U)2 + 2 1 ρ (G Q + ρC P ) (X +U), and

Tracking Adversarial Targets

F = 2 P ,22 U+ P ,21 X + 2 1 ρ (G Q + ρC P ). For any t, and any (K , C )-bounded, ρ-stable linear policy π = (K, c),

(i). at U ,

(ii). Kxπ + c U ,

(iii). K xπ + ct U .

For any action such that a U,

(iv). Vt(xπ , a) V .

Further, Vt(xπ , .) is Lipschitz in its second argument with constant F. Finally, the Hessians of the value functions are positive deﬁnite and H(Vt(xπ , .)) 2I.

The proof can be found in Appendix A.

Lemma 12. We have

βT 2V F 2(1 + log T) .

Proof. To bound βT , ﬁrst note that

Vπ,ℓ(xπ , π ) = ℓ(xπ , π ) λπ,ℓ+ Vπ,ℓ(xπ , π)

= λπ ,ℓ λπ,ℓ+ Vπ,ℓ(xπ , π) .

λπ,ℓ λπ ,ℓ= Vπ,ℓ(xπ , π) Vπ,ℓ(xπ , π ) .

t=1 (Vπt,ℓt(xπ , πt) Vπt,ℓt(xπ , π)) .

Now notice that Vπt,ℓt(xπ , .) is the loss function that is fed to the FTL strategy in state xπ . Lemma 11 shows that conditions of Theorem 1 are satisﬁed. Thus, we get the result from the regret bound for the FTL algorithm (Theorem 1):

βT 2V F 2(1 + log T) .

4.3. Bounding γT

Finally, we bound γT . The proof is similar to the proof of Lemma 10 and can be found in Appendix A.

Lemma 13. Let Z 1 = (CK + G Q ) + ( Q + K 2) B C /(1 ρ). Then we have that

γT 2Z 1 B C

4.4. Putting Everything Together

Proof of Theorem 5. The regret bound follows from Lemmas 10, 12, and 13.

5. Adversarially Chosen Transition Matrices

Our results can be extended to LQ problems with adversarial transition matrices,

xt+1 = Atxt + Btat, (9)

ℓt(xt, at) = (xt gt) Q(xt gt) + a t at .

Here, transition matrices At and Bt and the target vector gt are chosen by an adversary. Once again, we measure the regret with respect to the class of linear policies. Thus, any policy π is identiﬁed by some pair (K, c) such that π(x) = Kx + c.

The only no-regret algorithm for this setting is the result of Abbasi-Yadkori et al. (2013) who propose an exponentially-weighted average algorithm and analyze it under a mixing assumption. Similarly, we make the following assumption:

Assumption A2. (Uniform Stability) The choices of the learner and the adversary are restricted to sets K C Rd n Rd and A B Rn n Rn d, respectively. There exists 0 < ρ < 1 such that for any A A and B B, and any K K,

Further, there exits K , C > 0 such that for any K K and c C, K K and c C .

The proposed algorithm for the LQ problem (9) is shown in Figure 3. The algorithm maintains a distribution over policies. The distribution has the form of

qt(π) e η Pt 1 s=1 ℓs(xπ s ,π) , η > 0 . (10)

The following theorem bounds the regret of this algorithm.

Theorem 14. Consider a uniformly stable system. The regret of the algorithm in Figure 3 with respect to a class of policies |Π| is bounded by O( p

T log |Π| + log |Π|).

Proof. We prove the theorem by showing that conditions of Theorem 1 in (Abbasi-Yadkori et al., 2013) are satisﬁed.

Uniform mixing assumption: Let P(π, A, B) be the transition probability matrix of policy π = (K, c) K C under transition dynamics (A, B) A B. Let p1 and p 1 be two distributions over the state space and p2 = p1P(π, A, B)

Tracking Adversarial Targets

q0: the uniform distribution over Π, η = 1/

T for t := 1, 2, . . . do

With probability βt = wπt 1,t 1/wπt 1,t 2 choose the previous policy, πt = πt 1, while with probability 1 βt, choose πt based on the distribution qt. Learner takes the action at = Ktxt + ct. Simultaneously, adversary chooses transition matrices At and Bt and target vector gt. Learner suffers loss ℓt(xt, at) and observes At, Bt, gt. Update state: xt+1 = Atxt + Btat. Update the distribution qt(π) wπ,t, where wπ,t = e η Pt s=1 E[ℓs(xπ s ,π)]

Figure 3. The Exponentially Weighted Algorithm for Linear Quadratic Problems

and p 2 = p 1P(π, A, B). Let 1 k n be rank of M = (A BK). Let M be a k k matrix whose eigenvalues are the nonzero eigenvalues of M. For x Rn, let xr(x) Rk

be a parameterization of the component of x that is on the row space of M. Similarly, deﬁne xn(x) Rn k that corresponds to the orthogonal component on the null space of M. For u Rk and v Rn k, let x(u, v) be a vector in Rn such that xr(x(u, v)) = u and xn(x(u, v)) = v. Finally, let pr(u) = R

Rn k p1(x(u, v))dv. Using integration by substitution, we get that

p2 p 2 1 = Z

Rk |p2(y) p 2(y)| dy

Rk |pr(u) p r(u)| |det(M )| du

Rk |pr(u) p r(u)| du

Rn k p1(x(u, v))dv

Rn k p 1(x(u, v))dv

Rn k |p1(x(u, v)) p 1(x(u, v))| dvdu

Rn |p1(x) p 1(x)| dx = ρ p1 p 1 1 .

This shows that the uniform mixing assumption of Abbasi Yadkori et al. (2013) is satisﬁed with the choice of mixing time τ = 1/ log(1/ρ).

Bounded losses: With an argument similar to the proof of Lemma 6, we can show that the state is bounded. This, together with the boundedness of sets K and C, give that the action is bounded. Thus, all losses are bounded.

Results of Abbasi-Yadkori et al. (2013) also apply to the simpler setting of Section 3. However, sampling from the distribution (10) can be computationally expensive, whereas the FTL-MDP algorithm is computationally efﬁcient.

6. Conclusions and Future Work

We studied the problem of controlling linear systems with adversarial quadratic tracking losses, competing with the family of policies that compute actions as linear functions of the state. We presented an algorithm whose regret, with respect to such linear policies, is logarithmic in the number of rounds of the game. An interesting direction for future work is to consider more complex families of policies, such as the class of linear policies with a limited number of switches.

Existing tracking algorithms require the target sequence to be known in advance. Also their computational complexity scales linearly with the length of the trajectory. The main difﬁculty in the setting studied here, is the adversarial nature of target vectors, which is very different from the classical setting. The key advance is to show how the idea of Even-Dar et al. (2009) (instantiating expert algorithms in all states) can be applied to the LQ problem, which has a continuous and unbounded state space. This is done by showing that the sequence of value functions and policies will be quadratic and linear respectively, if we choose the right expert algorithm (FTL). The compact representation of value functions and policies allows an efﬁcient implementation of the FTL algorithm.

We showed how a related approach can be applied to adversarially chosen changing linear dynamics. Unfortunately, this algorithms is computationally expensive. A more challenging problem is to design efﬁcient algorithms for the case of adversarially chosen changing transition matrices. An interesting open problem is whether there is an efﬁcient no-regret algorithm, or whether a computational lower bound can be established.

It might be possible to extend our results to LQ problems with ﬁxed, but unknown transition matrices of the form:

xt+1 = Axt + Bat + wt+1,

ℓt(xt, at) = (xt gt) Q(xt gt) + a t at ,

where wt+1 is a sub-Gaussian noise and matrices A and B are unknown. We expect this extension to be fairly straighfoward using techniques from (Neu et al., 2012) and (Abbasi-Yadkori and Szepesv ari, 2011). Our approach is similar to Neu et al. (2012), with the difference that we use FTL instead of FPL.

Tracking Adversarial Targets

Yasin Abbasi-Yadkori and Csaba Szepesv ari. Regret bounds for the adaptive control of linear quadratic systems. In COLT, 2011.

Yasin Abbasi-Yadkori, Peter Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesv ari. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In NIPS, 2013.

D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientiﬁc, 2nd edition, 2001.

S. Bittanti and M. C. Campi. Adaptive control of linear time invariant systems: the bet on the best principle. Communications in Information and Systems, 6(4):299 320, 2006.

M. C. Campi and P. R. Kumar. Adaptive linear quadratic Gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization, 36(6): 1890 1907, 1998.

Nicol o Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.

H. Chen and L. Guo. Optimal adaptive control and consistent parameter estimates for armax model with quadratic cost. SIAM Journal on Control and Optimization, 25(4): 845 867, 1987.

H. Chen and J. Zhang. Identiﬁcation and adaptive control for systems with unknown orders, delay, and coefﬁcients. Automatic Control, IEEE Transactions on, 35(8): 866 877, August 1990.

Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009.

C. Fiechter. Pac adaptive control of linear systems. In in Proceedings of the 10th Annual Conference on Computational Learning Theory, ACM, pages 72 80. Press, 1997.

R. A. Howard. Dynamic Programming and Markov Processes. MIT, 1960.

T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applications to identiﬁcation and control of dynamic systems. The Annals of Statistics, 10(1):pp. 154 166, 1982.

T. L. Lai and C. Z. Wei. Asymptotically efﬁcient selftuning regulators. SIAM Journal on Control and Optimization, 25:466 481, March 1987.

T. L. Lai and Z. Ying. Efﬁcient recursive estimation and adaptive control in stochastic regression and armax models. Statistica Sinica, 16:741 772, 2006.

Gergely Neu, Andr as Gy orgy, and Andr as Antos Csaba Szepesv ari. Online Markov decision processes under bandit feedback. In NIPS, 2010a.

Gergely Neu, Andr as Gy orgy, and Csaba Szepesv ari. The online loop-free stochastic shortest path problem. In COLT, 2010b.

Gergely Neu, Andr as Gy orgy, and Csaba Szepesv ari. The adversarial stochastic shortest path problem with unknown transition probabilities. In AISTATS, 2012.