# tracking_adversarial_targets__c491eeea.pdf Tracking Adversarial Targets Yasin Abbasi-Yadkori YASIN.ABBASIYADKORI@QUT.EDU.AU Queensland University of Technology Peter Bartlett BARTLETT@EECS.BERKELEY.EDU University of California, Berkeley and QUT Varun Kanade VKANADE@EECS.BERKELEY.EDU University of California, Berkeley We study linear control problems with quadratic losses and adversarially chosen tracking targets. We present an efficient algorithm for this problem and show that, under standard conditions on the linear system, its regret with respect to an optimal linear policy grows as O(log2 T), where T is the number of rounds of the game. We also study a problem with adversarially chosen transition dynamics; we present an exponentiallyweighted average algorithm for this problem, and we give regret bounds that grow as O( 1. Introduction Consider a robot that controls an electron microscope to track a microorganism. Given the entire trajectory of the microorganism and the dynamics of the system, the optimal control can be computed. The trajectory, however, is not known in advance and the target might behave in an arbitrary fashion. In such situations, designing a controller based on some prior knowledge about the target location might be sub-optimal. It is important to take the behavior of the target into account. We consider problems with linear transition functions and quadratic tracking losses. When the target trajectory is known in advance, the problem is called the linear quadratic (LQ) problem in the control community. The LQ problem is one of the most studied problems in the control literature and is widely applied in practice (Lai and Wei, 1982; 1987; Chen and Guo, 1987; Chen and Zhang, 1990; Fiechter, 1997; Lai and Ying, 2006; Campi and Ku- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). mar, 1998; Bittanti and Campi, 2006; Abbasi-Yadkori and Szepesv ari, 2011). By principles of dynamic programming the optimal controller can be computed analytically. The solution is obtained by computing value functions, starting from the last round and recursing backward. This method needs to know the entire target sequence from the beginning and its computational complexity scales linearly with the length of the trajectory. It turns out that the optimal controller is linear in the state vector, and the value functions are quadratic in state and action. As we discussed earlier, the assumption that the entire trajectory is known in advance is not always realistic. But what would tracking mean without a reference trajectory? To make the problem well-defined, we fix a class of mappings from states to actions (also known as policies) as our competitors. Our objective is to track the trajectory nearly as well as the best policy in the comparison class in hindsight. The standard dynamic programming procedures are not applicable when the entire sequence is not known in advance. We show that we can still have an effective tracking algorithm even if the sequence is not known in advance. The proposed algorithm is perhaps the first tracking method that can deal with infinite and unknown sequences. We study the adversarial version of the LQ problem where an adversary designs the trajectory of the target and reveals the target location only at the end of each round. Formally, we study problems with transition dynamics and loss functions xt+1 = Axt + Bat , ℓt(xt, at) = (xt gt) Q(xt gt) + a t at , where xt Rn is the state at round t; at Rd is the action; gt is the target location; ℓt : Rn Rd R is the loss function; and A, B, Q are known matrices.1 The matrix Q is symmetric and positive definite. The learner observes the 1We can also use this formulation for loss functions of the Tracking Adversarial Targets sequence of states (xt)t. We make no assumptions on the trajectory sequence (gt)t, apart from its boundedness. The sequence might even be generated by an adversary. An LQ problem is defined by a 4-tuple (A, B, Q, G), where G is an upper bound on the norm of vectors gt. Let T be a time horizon, xπ t be the state of the system if we run policy π for t rounds, and Π be a class of policies. Let ρT (g1, . . . , g T ) = t=1 ℓt(xt, at) min π Π t=1 ℓt(xπ t , π(xπ t )) . The objective of the learner is to suffer low loss. The performance is measured by the regret defined by RT (A, B, Q, G, x1, Π) = sup g1,...,g T : gt G ρT (g1, . . . , g T ) . where denotes the 2-norm. In what follows, we use RT def= RT (A, B, Q, G, x1, Π). For simplicity, we assume that x1 = 0. As the optimal policy for the classical problem with a constant target is linear (strictly speaking, affine), a reasonable choice for the class of competitors is the set of linear (affine) policies. The problem that we describe is an instance of Markov Decision Process (MDP) problems with fixed and known dynamics, and changing loss functions. Note that the loss function depends on the target location, which can change in an arbitrary fashion. Such MDP problems were previously studied by Even-Dar et al. (2009) and Neu et al. (2010b;a). However, these papers provide results for finite MDP problems and are not applicable to problems with large state/action spaces. Our key finding is that the algorithm of Even-Dar et al. (2009) can be modified to be applicable in adversarial LQ problems. Interestingly, the resultant algorithm is identical with a Policy Iteration method (see, for example, (Howard, 1960)) with changing loss functions. Another interesting observation is that the gain matrix is independent of target vectors (see Lemma 4). This simplifies the design and analysis of our algorithm. We prove that the regret of the algorithm is logarithmic in the number of rounds of the game. Finally, we also study a more adversarial problem: xt+1 = Atxt + Btat, ℓt(xt) = (xt gt) Q(xt gt) + a t at . Here, the time-varying transition matrices At and Bt and the target vector gt are chosen by an adversary. In this problem, we show that under a uniform stability assumption, form ℓt(xt, at) = (xt gt) Q(xt gt) + a t Rat for a positive definite matrix R, by writing xt+1 = Axt+(BR 1/2)R1/2at = Axt + e Beat and ℓt(xt, at) = (xt gt) Q(xt gt) + a t R1/2R1/2at = (xt gt) Q(xt gt) + ea t eat. an exponentially-weighted average algorithm recently proposed by Abbasi-Yadkori et al. (2013) enjoys an O( T) regret bound with respect to the class of linear policies. 2. Notation We use σmin(M) and σmax(M) to denote the minimum and maximum eigenvalues of the positive semidefinite matrix M, respectively. We use to denote the 2-norm of matrices and vectors, where the 2-norm of a matrix M is defined by M = p σmax(M M). We use M 0 to denote that M is positive definite, while we use M 0 to denote that it is positive semidefinite. We use Mij to denote a block of matrix M. The indices and the dimensionality of the block will be understood from the context. Similarly, vi denotes a block of vector v. 3. Tracking Adversarial Targets Even-Dar et al. (2009) study finite state MDP problems with fixed and known transition functions and adversarial loss functions. Their algorithm (MDP-E) in its present form is not applicable to our problem with a continuous state space. Somewhat surprisingly, we can design a variant of the MDP-E algorithm that is applicable to our tracking problem with continuous state and action spaces. The MDP-E algorithm, shown in Figure 1, maintains an expert algorithm in each state; that is, it treats each state as a separate problem of prediction with expert advice, where each action corresponds to an expert. At round t, the product of expert recommendations over the state space defines the policy, denoted by πt. The algorithm takes action at πt(xt) and observes the loss function ℓt. It computes the value function Vπt,ℓt defined by the Bellman Optimality Equation λ, x, a, λ + Vπt,ℓt(x, a) = ℓt(x, a) + Ex m(.|x,a) [Vπt,ℓt(x , πt(x ))] , where m defines the state transition probabilities.2 Then, the algorithm feeds the expert algorithm in state x with Vt(x, .) = Vπt,ℓt(x, .) as the loss function at time t. Thus, the computational cost of the MDP-E algorithm per round is O(W + |X|), where W is the cost of obtaining the value function and X is the finite state space. Applied to the LQ problem, the value functions are defined 2In the Bellman Optimality Equation, scalar λ is the average loss of policy πt, and Vπt,ℓt(x, a) is the relative goodness of state-action pair (x, a) under policy πt. Note that under certain assumptions, the average loss is independent of the initial state, however, some states are more favorable as the policy incurs lower losses during the transient phase, starting from those states. Tracking Adversarial Targets Initialize an expert algorithm in each state for t := 1, 2, . . . do Let πt(xt) be the prediction of the expert algorithm in state xt Take action at πt(xt) Observe loss function ℓt Compute Vt = Vπt,ℓt For all x, feed the expert algorithm in state x with loss Vt(x, .) end for Figure 1. The MDP-E Algorithm λ, x, a, λ + Vt(x, a) = ℓt(x, a) (1) + Vt(Ax + Ba, πt(Ax + Ba)) , where we use the notation Vt = Vπt,ℓt. The linear structure allows us to compute Vt implicitly for all states, thus overcoming the difficulty of the infinite state space. As we will show, a suitable expert algorithm for our problem is the Follow The Leader (FTL) algorithm that we define next. Consider an online quadratic optimization problem where at round t the adversary chooses a quadratic loss function ft that is defined over a convex set D Rk. Simultaneously, the learner makes a prediction pt D, suffers loss ft(pt) and observes the loss function ft. The regret of the learner is defined by BT = PT t=1 ft(pt) minp D PT t=1 ft(p). The FTL algorithm makes the prediction pt = argmin p D s=1 fs(p) . The FTL algorithm enjoys the following regret bound for quadratic losses (Cesa-Bianchi and Lugosi, 2006, Theorem 3.1): Theorem 1 (FTL Regret bound). Assume ft is convex, maps to [0, C1], is Lipschitz with constant C2, and is twice differentiable everywhere with Hessian H C3I. Then the regret of the Follow The Leader algorithm is bounded by BT 4C1C2 2 C3 (1 + log T). Figure 2 shows the FTL-MDP algorithm for the linear quadratic tracking problem. It corresponds to the MDP-E algorithm, with FTL as the expert algorithm for each state. The algorithm starts at state x1 = 0 and the first policy is chosen to be π1(x) = K x, where K is a gain matrix that will be defined later.3 The algorithm computes the total loss in each state, shown by V (x, .) = Pt 1 s=1 Vs(x, .). 3Adopting a convention from feedback control, we represent linear policies with a negative sign. x1 = 0 x, π1(x) = K x (6) for t := 1, 2, . . . do Take action at = πt(xt) and suffer the loss ℓt(xt, at) Move to the state xt+1 = Axt + Bat Compute the value function Vt = Vπt,ℓt (2) Let V = Pt s=1 Vs Obtain the policy by solving a V (x, a) = 0: πt+1(x) = Kt+1x + ct+1 end for Figure 2. FTL-MDP: The Follow the Leader Algorithm for Markov Decision Processes The FTL strategy chooses the greedy action in each state, which is obtained by minimizing V (x, .). As the next lemma shows, value functions computed in the FTL-MDP algorithm are always quadratic and thus, the function V is always quadratic in state and action. This implies that policies are linear in state. Lemma 2. Consider the MDP-E algorithm applied to the adversarial LQ problem (A, B, Q, G). Let the expert algorithm be the FTL strategy. Assume the first policy is chosen to be an arbitrary linear policy, π1(x) = K1x+c1. Then, for appropriate matrices Pt and Lt, the value function at time t has the form of Vt(x, a) = x a Pt and the policy chosen by the algorithm at time t is πt(x) = Ktx + ct, where Kt = (Pt 1 s=1 Ps,22) 1 Pt 1 s=1 Ps,21 and ct = (Pt 1 s=1 Ps,22) 1 Pt 1 s=1 Ls,2/2 and Ps,ij and Ls,i are the ijth and ith blocks of matrix Ps and vector Ls, respectively. (Here the block structure naturally appears as components corresponding to the state and action.) The proof uses the following lemma that shows that the value of a linear policy is quadratic. Lemma 3. Consider the LQ problem (A, B, Q, G) with fixed target g . Let K be a matrix such that A BK < 1. The value function of policy π(x) = Kx + c has a quadratic form Vπ,ℓ(x, a) = x a P x a where P = P(K) and L are solutions to equations A B + Q 0 0 I Tracking Adversarial Targets L = L + 2 0 c P I K A B 2g Q 0 . (4) Further, the loss in the limiting state xπ is λ = g Qg + c P22c + L 2 c . (5) The proof can be found in Appendix A. Proof of Lemma 2. We prove the lemma by induction. By Lemma 3, the value of the first policy, π1(x) = K1x+c1, has the form of V1 = Vπ1,ℓ1(x, a) = x a P1 for some matrices P1 and L1. This establishes the base case. Next assume the value functions up to time t 1 are all quadratic. Because we use a FTL strategy as our expert algorithm in states, the policy in state x in round t can be computed by πt(x) = argmin a s=1 Vs(x, .) . We obtain the policy by setting a Pt 1 s=1 Vs(x, .) = 0. Letting Vs(x, a) = x a Ps for s = 1 . . . (t 1), we get that s=1 Vs(x, a) = x a e Pt where e Pt = Pt 1 s=1 Ps and e Lt = Pt 1 s=1 Ls. Taking the gradient with respect to the second argument and setting to zero, we get that, a V (x, a) = 2 e Pt,21x + 2 e Pt,22a + e Lt,2 = 0 a = e P 1 t,22 e Pt,21x 1 2 e P 1 t,22e Lt,2 . Thus, the policy can be compactly written as x, πt(x) = Ktx + ct , where Kt = e P 1 t,22 e Pt,21 and ct = e P 1 t,22e Lt,2/2. Given this linear policy, we get the quadratic value function Vt from Equation (2). Lemma 2 implies that the MDP-E algorithm can be efficiently implemented in the adversarial LQ problem. Before stating the main result of this paper, we describe certain assumptions and definitions from the control literature. (See, for example, (Bertsekas, 2001)). Definition 1. A pair (A, B), where A is an n n matrix and B is an n d matrix, is said to be controllable if the n nd matrix [B AB . . . An 1B] has full rank. A pair (A, C), where A is an n n matrix and C is an d n matrix, is said to be observable if the pair (A , C ) is controllable. Roughly speaking, controllability implies that the state can be moved arbitrarily by changing the actions, while observability implies that the state can be externally measured. We assume that the system is controllable and observable. These assumptions are standard in the literature, and will allow a closed form expression for the optimal control law. Assumption A1. (Controllability and observability) The pair (A, B) is controllable and the pair (A, Q1/2) is observable. Under this assumption, the gain matrix is stable, i.e. there exists ρ (0, 1) such that A BK ρ < 1, where K = (I + B SB) 1B SA (6) is the gain matrix (Bertsekas, 2001), and S is the solution of the Riccati equation S = Q + A SA A SB(I + B SB) 1B SA . Interestingly, as the next lemma shows, all gain matrices are equal. The proof can be found in Appendix A. Lemma 4. Consider the FTL-MDP algorithm. Let P = P(K ) as defined by (3). If we choose K1 = K , then all gain matrices are equal, K = K1 = K2 = K3 = . . . , and hence P = P1 = P2 = P3 = . . . . Lemma 4 shows that gain matrices are independent of target vectors and can be computed by assuming that all target vectors are zero. Given the fixed gain matrix, the system is driven to a desired target position by changing the bias term of the linear policy. We represent the linear policy π(x) = Kx+c by the pair π = (K, c). The class of (K , C )-bounded, ρ-stable linear policies is defined by Π = {π = (K, c) : A BK ρ, K K , c C }. Theorem 5. For a controllable, observable LQ problem (A, B, Q, G), the regret of the FTL-MDP algorithm with respect to the class of (K , C )- bounded, ρ-stable linear policies is O(log2 T), where the hidden constants are polynomials in Tracking Adversarial Targets A , B , Q , G, S, P , 1/λmin(P ), K , K , C , and 1/(1 ρ). The hidden constants are all small order polynomials. As we will show in the next section, a careful asymptotic analysis gives us an asymptotic O(log T) bound. It is an open problem to show a finite-time O(log T) regret bound with polynomial constants. 4. Analysis Let xπ = limt xπ t be the limiting state under (K , C )- bounded and ρ-stable linear policy π = (K, c). In Lemma 6 we show that this limit exists. Let λt(π) = ℓt(xπ , π(xπ )) be the loss of policy π in state xπ . We decompose the regret t=1 ℓt(xt, at) t=1 ℓt(xπ t , π) t=1 ℓt(xt, at) t=1 λt(πt) + t=1 λt(π) + t=1 ℓt(xπ t , π) t=1 ℓt(xt, at) t=1 λt(πt) , t=1 λt(π) , t=1 ℓt(xπ t , π) . The terms αT and γT correspond to the difference between the losses of the policies between their stationary and transient states. The term βT measures the regret with respect to the optimal policy. The rest of this section is devoted to providing bounds on these terms. 4.1. Bounding αT To bound αT , we need to show that sum of terms ℓt(xt, at) λt(πt) is small. Let xπt = lims xπt s . We will show that this limit exists. Because λt(πt) = ℓt(xπt , πt(xπt )) and ℓt(xt, at) = ℓt(xt, πt(xt)), we need to show that xπt is close to xt. This is done in a number of steps. First, we obtain the limiting state xπt (Lemma 6 and the discussion after that). Then, we show that the chosen policy is slowly changing. Given these two results, we bound xt xπt , which is then used to bound αT . First, we study the behavior of the state vector under any bounded and stable policy. We show that the policy converges to its stationary state exponentially fast. Lemma 6. The limiting state xπ = limt xπ t under a (K , C )-bounded and ρ-stable linear policy π = (K, c) exists and is equal to xπ = (I A + BK) 1Bc. Further, we have that xπ t B C /(1 ρ) and xπ t+1 xπ ρt (I A + BK) 1Bc . Proof. We have that xπ t+1 = (A BK)xπ t + Bc = (A BK)2xπ t 1 + (A BK)Bc + Bc = (A BK)tx1 + s=1 (A BK)t s Bc s=0 (A BK)s Bc , where we used x1 = 0 in the last equality. Thus, as t goes to infinity, xπ t (I A + BK) 1Bc. This also implies that xπ t B C /(1 ρ). It is also easy to see that xπ t+1 xπ = s=0 (A BK)s Bc s=0 (A BK)s Bc = (A BK)t(I A + BK) 1Bc A BK t (I A + BK) 1Bc ρt (I A + BK) 1Bc . Note that even if π = (K, c) / Π, but (A BK) is stable, the above argument is still valid and we get a similar result. In particular, xπt = (I A + BK ) 1Bct . (7) Letting C be an upper bound on ct for t T, with a similar argument we can also show that In what follows, we use X to denote the upper bound on the norm of state vector, X def= B C 1 ρ . Next, we prove that the chosen policy is slowly changing and the bias term in policies is bounded by C, where D = P 1 ,22B (I A + BK ) Q , 1 + K 2 B 2 /(1 ρ)2 + K B 1 + K 2 B 2 /(1 ρ)2 , Tracking Adversarial Targets where P denotes the solution to Equation (3), corresponding to gain matrix K . The proof can be found in Appendix A. Lemma 7. We have that (i). ct C , (ii). ct ct 1 D G + 2C Next, we show that the limiting state of policy πt (i.e. lims xπt s ) is close to the state at time t. The proof can be found in Appendix A. Lemma 8. If t > log(T 1)/ log(1/ρ) , then xt xπt B ( D G + 2C) 1 + log(t 1) 1 t log(t 1)/ log(1/ρ) Also, we have that t=1 xt xπt 1 1 ρ 4 B C log T log(1/ρ) + B ( D G + 2C)(1 + log T) 1 + log T + log T log(1/ρ) Remark 9. This lemma shows an O(log2 T) bound on PT t=1 xt xπt . As we will see, this leads to an O(log2 T) regret bound. Let ϵt = xt xπt . To get an O(log T) regret bound, we need to show that ϵt = H1/t for a constant H1. A careful examination of proof of Lemma 8 reveals that ϵt = H2 Pt 1 s=1 ρs/(t s) + H3 for constants H2 and H3. Let f(t) = Pt 1 s=1 ρs/(t s). We want that f(t) = H4/t for a constant H4 so that we get an O(log T) regret bound. To show this, we argue as follows: first establish the simple recurrence f(t + 1) = ρf(t) + ρ/t. This implies that f(t) 0 as t . Now, define g(t + 1) = tf(t + 1). Thus g(t + 1) = ρg(t) + ρf(t) + ρ. Since ρ < 1 and limt f(t) = 0, limt g(t) exists and a simple calculation shows that this is ρ/(1 ρ). This in fact implies that f(t) ρ/(t(1 ρ)) asymptotically, which in turn implies that regret is O(log T) asymptotically. Now we are ready to bound αT . Lemma 10. Let Z1 = 2(C K + G Q ) + 2( Q + K2 ) B C Z2 = 4 B C log T log(1/ρ) Z3 = B ( D G + 2C)(1 + log T) 1 + log T + log T log(1/ρ) Then we have that αT Z1(Z2 + Z3)/(1 ρ) . Proof. For policy π = (K, c), we have ℓt(x, π) = x (Q + K K)x 2(c K + g t Q)x + c c + g t Qgt . For policy πt = (Kt, ct) = (K , ct), define St = Q + K K and dt = 2(c t K + g t Q). We write x t Stxt dtxt xπt Stxπt dtxπt t=1 dt(xπt xt) + S1/2 t xt S1/2 t xπt S1/2 t xt + S1/2 t xπt t=1 dt(xπt xt) + S1/2 t (xt xπt ) S1/2 t xt + S1/2 t xπt dt + S1/2 t S1/2 t xt + S1/2 t xπt Z1(Z2 + Z3)/(1 ρ) . where we used Lemma 8 in the last step. 4.2. Bounding βT The term βT is bounded by showing a reduction to regret minimization algorithms (in this case, the FTL algorithm). To use the regret bound of the FTL algorithm (Theorem 1), we need to show boundedness of the value functions. Lemma 11. Let X = B C /(1 ρ), U = max{ K , K } max{X, X } + max{C, C }, V = P (X +U)2 + 2 1 ρ (G Q + ρC P ) (X +U), and Tracking Adversarial Targets F = 2 P ,22 U+ P ,21 X + 2 1 ρ (G Q + ρC P ). For any t, and any (K , C )-bounded, ρ-stable linear policy π = (K, c), (i). at U , (ii). Kxπ + c U , (iii). K xπ + ct U . For any action such that a U, (iv). Vt(xπ , a) V . Further, Vt(xπ , .) is Lipschitz in its second argument with constant F. Finally, the Hessians of the value functions are positive definite and H(Vt(xπ , .)) 2I. The proof can be found in Appendix A. Lemma 12. We have βT 2V F 2(1 + log T) . Proof. To bound βT , first note that Vπ,ℓ(xπ , π ) = ℓ(xπ , π ) λπ,ℓ+ Vπ,ℓ(xπ , π) = λπ ,ℓ λπ,ℓ+ Vπ,ℓ(xπ , π) . λπ,ℓ λπ ,ℓ= Vπ,ℓ(xπ , π) Vπ,ℓ(xπ , π ) . t=1 (Vπt,ℓt(xπ , πt) Vπt,ℓt(xπ , π)) . Now notice that Vπt,ℓt(xπ , .) is the loss function that is fed to the FTL strategy in state xπ . Lemma 11 shows that conditions of Theorem 1 are satisfied. Thus, we get the result from the regret bound for the FTL algorithm (Theorem 1): βT 2V F 2(1 + log T) . 4.3. Bounding γT Finally, we bound γT . The proof is similar to the proof of Lemma 10 and can be found in Appendix A. Lemma 13. Let Z 1 = (CK + G Q ) + ( Q + K 2) B C /(1 ρ). Then we have that γT 2Z 1 B C 4.4. Putting Everything Together Proof of Theorem 5. The regret bound follows from Lemmas 10, 12, and 13. 5. Adversarially Chosen Transition Matrices Our results can be extended to LQ problems with adversarial transition matrices, xt+1 = Atxt + Btat, (9) ℓt(xt, at) = (xt gt) Q(xt gt) + a t at . Here, transition matrices At and Bt and the target vector gt are chosen by an adversary. Once again, we measure the regret with respect to the class of linear policies. Thus, any policy π is identified by some pair (K, c) such that π(x) = Kx + c. The only no-regret algorithm for this setting is the result of Abbasi-Yadkori et al. (2013) who propose an exponentially-weighted average algorithm and analyze it under a mixing assumption. Similarly, we make the following assumption: Assumption A2. (Uniform Stability) The choices of the learner and the adversary are restricted to sets K C Rd n Rd and A B Rn n Rn d, respectively. There exists 0 < ρ < 1 such that for any A A and B B, and any K K, Further, there exits K , C > 0 such that for any K K and c C, K K and c C . The proposed algorithm for the LQ problem (9) is shown in Figure 3. The algorithm maintains a distribution over policies. The distribution has the form of qt(π) e η Pt 1 s=1 ℓs(xπ s ,π) , η > 0 . (10) The following theorem bounds the regret of this algorithm. Theorem 14. Consider a uniformly stable system. The regret of the algorithm in Figure 3 with respect to a class of policies |Π| is bounded by O( p T log |Π| + log |Π|). Proof. We prove the theorem by showing that conditions of Theorem 1 in (Abbasi-Yadkori et al., 2013) are satisfied. Uniform mixing assumption: Let P(π, A, B) be the transition probability matrix of policy π = (K, c) K C under transition dynamics (A, B) A B. Let p1 and p 1 be two distributions over the state space and p2 = p1P(π, A, B) Tracking Adversarial Targets q0: the uniform distribution over Π, η = 1/ T for t := 1, 2, . . . do With probability βt = wπt 1,t 1/wπt 1,t 2 choose the previous policy, πt = πt 1, while with probability 1 βt, choose πt based on the distribution qt. Learner takes the action at = Ktxt + ct. Simultaneously, adversary chooses transition matrices At and Bt and target vector gt. Learner suffers loss ℓt(xt, at) and observes At, Bt, gt. Update state: xt+1 = Atxt + Btat. Update the distribution qt(π) wπ,t, where wπ,t = e η Pt s=1 E[ℓs(xπ s ,π)] Figure 3. The Exponentially Weighted Algorithm for Linear Quadratic Problems and p 2 = p 1P(π, A, B). Let 1 k n be rank of M = (A BK). Let M be a k k matrix whose eigenvalues are the nonzero eigenvalues of M. For x Rn, let xr(x) Rk be a parameterization of the component of x that is on the row space of M. Similarly, define xn(x) Rn k that corresponds to the orthogonal component on the null space of M. For u Rk and v Rn k, let x(u, v) be a vector in Rn such that xr(x(u, v)) = u and xn(x(u, v)) = v. Finally, let pr(u) = R Rn k p1(x(u, v))dv. Using integration by substitution, we get that p2 p 2 1 = Z Rk |p2(y) p 2(y)| dy Rk |pr(u) p r(u)| |det(M )| du Rk |pr(u) p r(u)| du Rn k p1(x(u, v))dv Rn k p 1(x(u, v))dv Rn k |p1(x(u, v)) p 1(x(u, v))| dvdu Rn |p1(x) p 1(x)| dx = ρ p1 p 1 1 . This shows that the uniform mixing assumption of Abbasi Yadkori et al. (2013) is satisfied with the choice of mixing time τ = 1/ log(1/ρ). Bounded losses: With an argument similar to the proof of Lemma 6, we can show that the state is bounded. This, together with the boundedness of sets K and C, give that the action is bounded. Thus, all losses are bounded. Results of Abbasi-Yadkori et al. (2013) also apply to the simpler setting of Section 3. However, sampling from the distribution (10) can be computationally expensive, whereas the FTL-MDP algorithm is computationally efficient. 6. Conclusions and Future Work We studied the problem of controlling linear systems with adversarial quadratic tracking losses, competing with the family of policies that compute actions as linear functions of the state. We presented an algorithm whose regret, with respect to such linear policies, is logarithmic in the number of rounds of the game. An interesting direction for future work is to consider more complex families of policies, such as the class of linear policies with a limited number of switches. Existing tracking algorithms require the target sequence to be known in advance. Also their computational complexity scales linearly with the length of the trajectory. The main difficulty in the setting studied here, is the adversarial nature of target vectors, which is very different from the classical setting. The key advance is to show how the idea of Even-Dar et al. (2009) (instantiating expert algorithms in all states) can be applied to the LQ problem, which has a continuous and unbounded state space. This is done by showing that the sequence of value functions and policies will be quadratic and linear respectively, if we choose the right expert algorithm (FTL). The compact representation of value functions and policies allows an efficient implementation of the FTL algorithm. We showed how a related approach can be applied to adversarially chosen changing linear dynamics. Unfortunately, this algorithms is computationally expensive. A more challenging problem is to design efficient algorithms for the case of adversarially chosen changing transition matrices. An interesting open problem is whether there is an efficient no-regret algorithm, or whether a computational lower bound can be established. It might be possible to extend our results to LQ problems with fixed, but unknown transition matrices of the form: xt+1 = Axt + Bat + wt+1, ℓt(xt, at) = (xt gt) Q(xt gt) + a t at , where wt+1 is a sub-Gaussian noise and matrices A and B are unknown. We expect this extension to be fairly straighfoward using techniques from (Neu et al., 2012) and (Abbasi-Yadkori and Szepesv ari, 2011). Our approach is similar to Neu et al. (2012), with the difference that we use FTL instead of FPL. Tracking Adversarial Targets Yasin Abbasi-Yadkori and Csaba Szepesv ari. Regret bounds for the adaptive control of linear quadratic systems. In COLT, 2011. Yasin Abbasi-Yadkori, Peter Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesv ari. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In NIPS, 2013. D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2nd edition, 2001. S. Bittanti and M. C. Campi. Adaptive control of linear time invariant systems: the bet on the best principle. Communications in Information and Systems, 6(4):299 320, 2006. M. C. Campi and P. R. Kumar. Adaptive linear quadratic Gaussian control: the cost-biased approach revisited. SIAM Journal on Control and Optimization, 36(6): 1890 1907, 1998. Nicol o Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006. H. Chen and L. Guo. Optimal adaptive control and consistent parameter estimates for armax model with quadratic cost. SIAM Journal on Control and Optimization, 25(4): 845 867, 1987. H. Chen and J. Zhang. Identification and adaptive control for systems with unknown orders, delay, and coefficients. Automatic Control, IEEE Transactions on, 35(8): 866 877, August 1990. Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009. C. Fiechter. Pac adaptive control of linear systems. In in Proceedings of the 10th Annual Conference on Computational Learning Theory, ACM, pages 72 80. Press, 1997. R. A. Howard. Dynamic Programming and Markov Processes. MIT, 1960. T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):pp. 154 166, 1982. T. L. Lai and C. Z. Wei. Asymptotically efficient selftuning regulators. SIAM Journal on Control and Optimization, 25:466 481, March 1987. T. L. Lai and Z. Ying. Efficient recursive estimation and adaptive control in stochastic regression and armax models. Statistica Sinica, 16:741 772, 2006. Gergely Neu, Andr as Gy orgy, and Andr as Antos Csaba Szepesv ari. Online Markov decision processes under bandit feedback. In NIPS, 2010a. Gergely Neu, Andr as Gy orgy, and Csaba Szepesv ari. The online loop-free stochastic shortest path problem. In COLT, 2010b. Gergely Neu, Andr as Gy orgy, and Csaba Szepesv ari. The adversarial stochastic shortest path problem with unknown transition probabilities. In AISTATS, 2012.