# distributionally_robust_linear_quadratic_control__8d8467a9.pdf Distributionally Robust Linear Quadratic Control Bahar Ta skesen EPFL bahar.taskesen@epfl.ch Dan A. Iancu Stanford University daniancu@stanford.edu Ça gıl Koçyi git University of Luxembourg cagil.kocyigit@uni.lu Daniel Kuhn EPFL daniel.kuhn@epfl.ch Linear-Quadratic-Gaussian (LQG) control is a fundamental control paradigm that has been studied and applied in various fields such as engineering, computer science, economics, and neuroscience. It involves controlling a system with linear dynamics and imperfect observations, subject to additive noise, with the goal of minimizing a quadratic cost function depending on the state and control variables. In this work, we consider a generalization of the discrete-time, finite-horizon LQG problem, where the noise distributions are unknown and belong to Wasserstein ambiguity sets centered at nominal (Gaussian) distributions. The objective is to minimize a worst-case cost across all distributions in the ambiguity set, including non-Gaussian distributions. Despite the added complexity, we prove that a control policy that is linear in the observations is optimal, as in the classic LQG problem. We propose a numerical solution method that efficiently characterizes this optimal control policy. Our method uses the Frank-Wolfe algorithm to identify the leastfavorable distributions within the Wasserstein ambiguity sets and computes the controller s optimal policy using Kalman filter estimation under these distributions. 1. Introduction The Linear Quadratic Regulator (LQR) is a classic control problem that has served as a building block for numerous applications in engineering and computer science [3, 12], economics [29], or neuroscience [47]. It involves controlling a system with linear dynamics and imperfect observations affected by additive noise, with the goal of minimizing a quadratic state and control cost. Under the assumption that noise terms are independent and normally distributed (a case referred to as Linear-Quadratic-Gaussian, or LQG), it is well known that the optimal control policy depends linearly on the observations and can be obtained efficiently by using the Kalman filtering procedure and dynamic programming [8]. Motivated by practical settings where noise distributions may not be readily available or may not be Gaussian, this paper considers a discrete-time, finite-horizon generalization of the LQG setting where noise distributions are unknown and are chosen adversarially from ambiguity sets characterized by a Wasserstein distance and centered around nominal (Gaussian) distributions. We show that, even under distributional ambiguity, the optimal control policy remains linear in the system s observations. Our proof is novel and does not rely on traditional recursive dynamic programming arguments. Instead, we re-parametrize the control policy in terms of the purified state observations and we derive an upper bound for the resulting minimax formulation by relaxing the ambiguity set (from a Wasserstein ball into a Gelbrich ball) while simultaneously restricting the controller to linear dependencies. We then use convex duality to prove that this upper bound matches 37th Conference on Neural Information Processing Systems (Neur IPS 2023). a lower bound obtained by restricting the ambiguity set in the dual of the minimax formulation. This implies the optimality of linear output feedback controllers, thus generalizing the classic results to a distributionally robust setting. We also find that the worst-case distribution is actually Gaussian, which leads to a very efficient algorithm for finding optimal controllers. Specifically, we propose an algorithm based on the Frank Wolfe first-order method that at every step solves sub-problems corresponding to classic LQG control problems, using Kalman filtering and dynamic programming. We show that this algorithm enjoys a sublinear convergence rate and is susceptible to parallelization. Lastly, we implement the algorithm leveraging Py Torch s automatic differentiation module and we find that it yields uniformly lower runtimes than a direct method (based on solving semidefinite programs) across all problem horizons. 1.1. Literature Review This paper is related to the ample literature in control theory and engineering aimed at designing controllers that are robust to noise. The classic LQR/LQG theory, developed in the 1960s, examined linear dynamical systems in either time or frequency domain, seeking to minimize a combination of quadratic state and control costs (in time-domain) or the H2 norm of the system s transfer function (in frequency domain). Motivated by findings that LQG controllers do not provide the guaranteed robust stability properties of LQR controllers [15], much effort has been devoted subsequently to designing controllers that are robust to worst-case perturbations, typically evaluated in terms of the H norm of the system s transfer function (see, e.g., [16, 53] for a comprehensive review of H and H2 controllers). Because H controllers tend to be overly conservative [32], various approaches have been proposed to balance the performance of nominal and robust controllers, e.g., by combining H2 and H approaches [7, 17]. A parallel stream of literature has considered risk-sensitive control [51], which minimizes an entropic risk measure instead of the expected quadratic cost. Although risksensitive control has a distributionally robust flavor (as the entropic risk measure is equivalent to a distributionally robust quadratic objective penalized via Kullback-Leibler divergence), risk-sensitive control models do not admit a distributionally robust formulation because the entropic risk measure is convex, but not coherent [22]. In contrast, our distributionally robust model provides a direct interpretation of the exact set of noise distributions against which the controller provides safeguards, and leads to a computationally tractable framework for finding the optimal controller. In this sense, our work is more directly related to the literature on distributionally robust control, which seeks controllers that minimize expected costs under worst-case noise distributions [11, 33, 34, 41, 50, 52]. Closest to our work are [28, 33]. [33] proves the optimality of linear state-feedback control policies for a related minimax LQR model with a Wasserstein distance but with perfect state observations. With perfect observations, the optimal policies in the classic LQR formulation are independent of the noise distribution and are thus inherently already robust, so considering imperfect observations is what makes the problem significantly more challenging in our case. [28] studies a minimax formulation based on the Wasserstein distance with both state and observation noise but without any control policy, and focuses solely on the problem of estimating the states. Several papers have considered robust formulations with imperfect observations but for constrained systems [5, 6, 34], which are more challenging; the common approach is to restrict attention to linear feedback policies for computational tractability, and without proving their optimality. Also related is the recent literature stream on distributionally robust optimization using the Wasserstein distance [36]. Within this stream, the closest work is [38, 44], which consider the problem of minimax mean-squared-error estimation when ambiguity is modeled with a Wasserstein distance from a nominal Gaussian distribution. Our proof builds on some ideas from these papers (e.g., relying on the Gelbrich distance in the construction of the upper bound), which it combines with ideas from control theory on purified output-feedback to obtain the overall construction. Also related is [2], which studies multistage distributionally robust problems with ambiguity sets given by a nested Wasserstein distance for stochastic processes and identifies computationally tractable cases. For a broader overview of developments related to optimal transport and Wasserstein distance with an emphasis on computational tractability and applications in machine learning, we refer to [42]. Finally, our paper is also related to literature that documents the optimality of linear/affine policies in (distributionally) robust dynamic optimization models. [10, 30] prove optimality for one-dimensional linear systems affected by additive noise and with perfect state observations, but with general (convex) state and/or control costs, [27, 49] provide computationally tractable approaches to quantifying the suboptimality of affine controllers in finite or infinite-horizon settings, and [9, 21, 25] characterize the performance of affine policies in two-stage (distributionally) robust dynamic models. Notation. All random objects are defined on a probability space (Ω, F, P). Thus, the distribution of any random vector ξ : Ω Rd is given by the pushforward distribution Pξ = P ξ 1 of P with respect to ξ. The expectation under P is denoted by EP[ ]. For any t Z+, we set [t] = {0, . . . , t}. 2. Problem Definition We consider a discrete-time linear dynamical system xt+1 = Atxt + Btut + wt t [T 1] (1) with states xt Rn, control inputs ut Rm, process noise wt Rn and system matrices At Rn n and Bt Rn m. The controller only has access to imperfect state measurements yt = Ctxt + vt t [T 1] (2) corrupted by observation noise vt Rp, where Ct Rp n and usually p n (so that observing yt would not allow reconstructing xt even if there were no observation noise). The control inputs ut are causal, i.e., depend on the past observations y0, . . . , yt but not on the future observations yt+1, . . . , y T 1. More precisely, the set of feasible control inputs Uy is the set of random vectors (u0, u1, . . . , u T 1) where for every t there exists a measurable control policy φt : Rp(t+1) Rm such that ut = φt(y0, . . . , yt). Controlling the system generates costs that depend quadratically on the states and the controls: t=0 (x t Qtxt + u t Rtut) + x T QT x T , (3) where Qt Sn + and Rt Sm ++ represent the state and input cost matrices, respectively. The exogenous random vectors x0, {wt}T 1 t=0 and {vt}T 1 t=0 are mutually independent and follow probability distributions given by Px0, {Pwt}T 1 t=0 , and {Pvt}T 1 t=0 , respectively. As the control inputs are causal, the system equations (2) imply that xt, ut and yt can be expressed as measurable functions of the exogenous uncertainties x0 as well as ws and vs, s [t], for every t. From now on we may thus assume without loss of generality that Ω= Rn Rn T Rp T is the space of realizations of the exogenous uncertainties, F is the Borel σ-algebra on Ωand P = Px0 ( T 1 t=0 Pwt) ( T t=0Pvt), where P1 P2 denotes the independent coupling of the distributions P1 and P2. In this context, the classic LQG model assumes that P is known and Gaussian, and seeks u Uy that minimizes EP[J]. Appendix A reviews the standard approach for computing optimal control inputs by estimating states through Kalman filtering techniques and using dynamic programming. In contrast, we assume that P is only known to belong to an ambiguity set W, and we formulate a distributionally robust LQG problem that seeks u Uy to minimize the worst-case expected cost: t=0 (x t Qtxt + u t Rtut) + x T QT x T We construct the ambiguity set W as a ball based on the Wasserstein distance. Specifically, we assume that a nominal Gaussian distribution ˆP = ˆPx0 ( T 1 t=0 ˆPwt) ( T t=0ˆPvt) is available so that ˆPx0 = N(0, ˆX0), ˆPwt = N(0, ˆWt), and ˆPvt = N(0, ˆVt) for all t [T 1], and W is given by: W = Wx0 ( T 1 t=0 Wwt) ( T 1 t=0 Wvt), Wx0 = {Px0 P(Rn) : W(ˆPx0, Px0) ρx0, EPx0[x0] = 0} Wwt = {Pwt P(Rn) : W(ˆPwt, Pwt) ρwt, EPwt[wt] = 0} Wvt = {Pvt P(Rm) : W(ˆPvt, Pvt) ρvt, EPvt[vt] = 0}, and W is the 2-Wasserstein distance. Thus, by construction, all exogenous random variables x0, w0, . . . , w T 1, v0, . . . , v T 1 are independent under every distribution in W. Definition 1 (2-Wasserstein distance). The 2-Wasserstein distance between two distributions P1 and P2 on Rd with finite second moments is given by W(P1, P2) = inf π Π(P1,P2) Rd Rd ξ1 ξ2 2 2 π(dξ1, dξ2) 1 where Π(P1, P2) denotes the set of all couplings, that is, all joint distributions of the random variables ξ1 and ξ2 with marginal distributions P1 and P2, respectively. Our model strictly generalizes the classic LQG setting,1 which can be recovered by choosing ρx0 = ρwt = ρvt = 0. The parameters ρ thus allow quantifying the uncertainty about the nominal model and building robustness to mis-specification. We emphasize that the Wasserstein ambiguity set W contains many non-Gaussian distributions and it is not readily obvious that the worst-case distribution in (4) is in fact Gaussian. However, the set W is also non-convex, as it contains only distributions under which the exogenous uncertainties are independent, which makes the distributionally robust LQG problem potentially difficult to solve. 3. Nash Equilibrium and Optimality of Linear Output Feedback Controllers We henceforth view the distributionally robust LQG problem as a zero-sum game between the controller, who chooses causal control inputs, and nature, who chooses a distribution P W. In this section we show that this game admits a Nash equilibrium, where nature s Nash strategy is a Gaussian distribution P W and the controller s Nash strategy is a linear output feedback policy based on the Kalman filter evaluated under P . Purified Observations. Before outlining our proof strategy, we first simplify the problem formulation by re-parametrizing the control inputs in a more convenient form (following [5, 6, 27]). Note that the control inputs in the LQG formulation are subject to cyclic dependencies, as ut depends on yt, while yt depends on xt through (2), and xt depends again on ut through (1), etc. Because these dependencies make the problem hard to analyze, it is preferable to instead consider the controls as functions of a new set of so-called purified observations instead of the actual observations yt. Specifically, we first introduce a fictitious noise-free system ˆxt+1 = Atˆxt + Btut t [T 1] and ˆyt = Ctˆxt t [T 1] with states ˆxt Rn and outputs ˆyt Rp, which is initialized by ˆx0 = 0 and controlled by the same inputs ut as the original system (2). We then define the purified observation at time t as ηt = yt ˆyt and we use η = (η0, . . . , ηT 1) to denote the trajectory of all purified observations. As the inputs ut are causal, the controller can compute the fictitious state ˆxt and output ˆyt from the observations y0, . . . , yt. Thus, ηt is representable as a function of y0, . . . , yt. Conversely, one can show by induction that yt can also be represented as a function of η0, . . . , ηt. Moreover, any measurable function of y0, . . . , yt can be expressed as a measurable function of η0, . . . , ηt and viceversa [27, Proposition II.1]. So if we define Uη as the set of all control inputs (u0, u1, . . . , u T 1) so that ut = ψt(η0, . . . , ηt) for some measurable function ψt : Rp(t+1) Rm for every t [T 1], the above reasoning implies that Uη = Uy. In view of this, we can rewrite the distributionally robust LQG problem equivalently as ( min x,u,y max P W EP u Ru + x Qx s.t. u Uy, x = Hu + Gw, y = Cx + v ( min x,u max P W EP u Ru + x Qx s.t. u Uη, x = Hu + Gw, (5) where x = (x0, . . . , x T ), u = (u0, . . . , u T 1), y = (y0, . . . , y T 1), w = (x0, w0, . . . , w T 1), v = (v0, . . . , v T 1), η = (η0, . . . , ηT 1), and R, Q, H, G and C are suitable block matrices 1Our assumption that noise terms are zero-mean is consistent with the standard LQG model [8]. Requiring EPx0 [x0] = 0 is assumed for clarity and without loss of generality. (see Appendix B for their precise definitions). The latter reformulation involving the purified observations η is useful because these are independent of the inputs. Indeed, by recursively combining the equations of the original and the noise-free systems, one can show that η = Dw + v for some block triangular matrix D (see Appendix B for its construction). This shows that the purified observations depend (linearly) on the exogenous uncertainties but not on the control inputs. Hence, the cyclic dependencies complicating the original system are eliminated in (5). Subsequently, we also study the dual of (5), defined as ( max P W min x,u EP u Ru + x Qx s.t. u Uη, x = Hu + Gw. (6) The classic minimax inequality implies that p d . If we can prove that p = d , that (5) has a solution u and that (6) has a solution P , then (u , P ) must be a Nash equilibrium of the zero-sum game at hand [43, Theorem 2]. However, because Uη is an infinite-dimensional function space and W is an infinite-dimensional, non-convex set of non-parametric distributions, the existence of a Nash equilibrium (in pure strategies) is not at all evident. Instead, our proof strategy will rely on constructing an upper bound for p and a lower bound for d , and showing that these match. Upper Bound for p . We obtain an upper bound for p by suitably enlarging the ambiguity set W and restricting the controllers ut to linear dependencies. We enlarge W by ignoring all information about the distributions in W except for their covariance matrices, and by replacing the Wasserstein distance with the Gelbrich distance. To that end, we first define the Gelbrich distance on the space of covariance matrices. Definition 2 (Gelbrich distance). The Gelbrich distance between the two covariance matrices Σ1, Σ2 Sd + is given by G(Σ1, Σ2) = Tr Σ1 + Σ2 2 Σ We are interested in the Gelbrich distance because of its close connection to the 2-Wasserstein distance. Indeed, it is known that the 2-Wasserstein distance between two distributions with zero means is bounded below by the Gelbrich distance between the respective covariance matrices. Proposition 3.1 (Gelbrich bound [24, Theorem 2.1]). For any two distributions P1 and P2 on Rd with zero means and covariance matrices Σ1, Σ2 Sd +, respectively, we have W(P1, P2) G(Σ1, Σ2). Recalling that ˆ X0, ˆW t and ˆV t respectively denote the covariance matrices for x0, wt and vt under the nominal distribution ˆP, we can then define the following Gelbrich ambiguity set for the exogenous uncertainties: G = Gx0 ( T 1 t=0 Gwt) ( T 1 t=0 Gvt), Gx0 = {Px0 P(Rn) : EPx0[x0] = 0, EP[x0x 0 ] = X0, G(X0, ˆX0) ρx0} Gwt = {Pwt P(Rn) : EPwt[wt] = 0, EP[wtw t ] = Wt, G(Wt, ˆWt) ρwt} Gvt = {Pvt P(Rm) : EPvt[vt] = 0, EP[vtv t ] = Vt, G(Vt, ˆVt) ρvt}. By construction, the random vectors x0, {wt}T 1 t=0 and {vt}T 1 t=0 are thus mutually independent under any P G. In addition and as a direct consequence of Proposition 3.1, G constitutes an outer approximation for the Wasserstein ambiguity set W, as summarized in the next result. Corollary 1 (Gelbrich hull). We have W G. Because G covers W, we henceforth refer to it as the Gelbrich hull of the Wasserstein ambiguity set W. To finalize our construction of the upper bound on p , we focus on linear policies2 of the form 2Technically, the policies are affine because they include a constant term, but we retain the more common terminology that focuses on the dependencies. u = q + Uη = q + U(Dw + v), where q = (q0, . . . , q T 1), and U is a block lower triangular matrix U0,0 U1,0 U1,1 ... ... UT 1,0 . . . . . . UT 1,T 1 The block lower triangularity of U ensures that the corresponding controller is causal, which in turn ensures that u Uη. In the following, we denote by U the set of all block lower triangular matrices of the form (7). An upper bound on problem (5) can now be obtained by restricting the controller s feasible set to causal controllers that are linear in the purified observations η and by relaxing nature s feasible set to the Gelbrich hull G of W. The resulting bounding problem is given by ( min U,q,x,u max P G EP u Ru + x Qx s.t. U U, u = q + U(Dw + v), x = Hu + Gw. (8) As we obtained (8) by restricting the feasible set of the outer minimization problem and relaxing the feasible set of the inner maximization problem in (5), it is clear that p p . Recall also that problem (5) constitutes an infinite-dimensional zero-sum game, where the agents optimize over measurable policies and non-parametric distributions, respectively. In contrast, the next proposition shows that problem (8) is equivalent to a finite-dimensional zero-sum game. Proposition 3.2. Problem (8) is equivalent to the optimization problem ( min q Rp T U U max W GW V GV Tr D U (R+H QH)UD+ 2G QHUD +G QG W +Tr U (R + H QH)U V +q (R+H QH)q, (9) GW = W Sn(T +1) + : W = diag(X0, W0, . . . , WT 1), X0 Sn +, Wt Sn + t [T 1] G(X0, ˆX0)2 ρ2 x0, G(Wt, ˆWt)2 ρ2 wt t [T 1] GV = V Sp T + : V = diag(V0, . . . , VT 1), Vt Sp +, G(Vt, ˆVt)2 ρ2 vt t [T 1] . We emphasize that Proposition 3.2 remains valid even if the nominal distribution ˆP fails to be normal. Note also that, while nature s feasible set in (8) is non-convex due to the independence conditions, the sets GW and GV are convex and even semidefinite representable thanks to the properties of the squared Gelbrich distance.3 By dualizing the inner maximization problem, one can therefore reformulate the minimax problem (9) as a convex semidefinite program (SDP). Even though this SDP is computationally tractable in theory, it involves O(T(mp + n2 + p2)) decision variables. For practically interesting problem dimensions, it thus quickly exceeds the capabilities of existing solvers. Lower Bound for d . To derive a tractable lower bound on d , we restrict nature s feasible set to the family WN of all normal distributions in the Wasserstein ambiguity set W. The resulting bounding problem is thus given by ( max P WN min x,u EP u Ru + x Qx s.t. u Uη, x = Hu + Gw. (10) As we obtained (10) by restricting the feasible set of the outer maximization problem in (6), it is clear that d d . Next, we show that (10) can be recast as a finite-dimensional zero-sum game. This result critically relies on the following known fact regarding the 2-Wasserstein distance between two normal distributions, which coincides with the Gelbrich distance between their covariance matrices. Proposition 3.3 (Tightness for normal distributions [26, Proposition 7]). For any two normal distributions P1 = N(0, Σ1) and P2 = N(0, Σ2) with zero means we have W(P1, P2) = G(Σ1, Σ2). With this, we can provide a finite-dimensional reformulation, as summarized in the next result. 3Note that the ambiguity sets GW and GV appearing in (9) involve the squared Gelbrich distance, G(Σ1, Σ2)2. The reason is that G(Σ1, Σ2)2 is known to be jointly convex in Σ1, Σ2 and semidefinite representable [38, Proposition 2.3], unlike the Gelbrich distance G(Σ1, Σ2) itself, which is generally non-convex. Proposition 3.4. Problem (10) is equivalent to the optimization problem ( max W GW V GV min q Rp T U U Tr D U (R+H QH)UD+ 2G QHUD +G QG W +Tr U (R + H QH)U V +q (R + H QH)q, (11) where GW and GV are defined exactly as in Proposition 3.2. Proposition 3.4 relies on Proposition 3.3 and thus fails to hold unless ˆP is normal. Also, one can again reformulate (11) as a tractable SDP by dualizing the inner minimization problem. Conclusions. Propositions 3.2 and 3.4 reveal that problems (9) and (11) are dual to each other, that is, they can be transformed into one another by interchanging minimization and maximization. The following main theorem shows that strong duality holds irrespective of the problem data. Theorem 3.5 (Strong duality of (9) and (11)). We have p = d . Theorem 3.5 follows immediately from Sion s classic minimax theorem [45], which applies because GW and GV are convex as well as compact thanks to [38, Lemma A.6]. By weak duality and the construction of the bounding problems (9) and (11), we trivially have d d p p . Theorem 3.5 reveals that all of these inequalities are in fact equalities, each of which gives rise to a non-trivial insight. The first key insight is that (5) and (6) are strong duals. Corollary 2 (Strong duality of (5) and (6)). We have p = d . We stress that, unlike Theorem 3.5, Corollary 2 establishes strong duality between two infinitedimensional zero-sum games. The second key implication of Theorem 3.5 is that the distributionally robust LQG problem (5) is solved by a linear output-feedback controller. Corollary 3 (The controller s Nash strategy is linear in the observations). There exist U U and q Rm such that the distributionally robust LQG problem (5) is solved by u = q + U y. The identity p = p readily implies that (5) is solved by a causal controller that is linear in the purified observations. However, any causal controller that is linear in the purified observations η can be reformulated exactly as a causal controller that is linear in the original observations y and vice-versa [6, Proposition 3]. Thus, Corollary 3 follows. The third key implication of Theorem 3.5 is that the dual distributionally robust LQG problem is solved by a normal distribution. Corollary 4 (Nature s Nash strategy is a normal distribution). The dual distributionally robust LQG problem (6) is solved by a distribution P WN . Corollary 4 is a direct consequence of the identity d = d . Note that the optimal normal distribution P is uniquely determined by the covariance matrices W and V of the exogenous uncertain parameters, which can be computed by solving problem (11). That the worst-case distribution is actually Gaussian is not a-priori expected and is surprising given that the Wasserstein ball contains many non-Gaussian distributions. 4. Efficient Numerical Solution of Distributionally Robust LQG Problems Having proven these structural results, we next turn attention to the problem of finding the optimal strategies. Our next result shows that, under a mild regularity condition, the optimal controller u of the distributionally robust LQG problem (5) can be computed efficiently from P . Proposition 4.1 (Optimality of Kalman filter-based feedback controllers). If ˆVt 0 for all t [T 1], then problem (6) is solved by a Gaussian distribution P under which vt has a covariance matrix V t 0 for every t [T 1], and (5) is solved by the optimal LQG controller corresponding to P . Additionally, the optimal value of problem (9) and its strong dual (11) does not change if we restrict GW and GV to G+ W and G+ V , respectively, where G+ W = n W GW : X0 λmin( ˆX0)I, Wt λmin( ˆWt)I t [T 1] o , G+ V = n V GV : Vt λmin( ˆVt)I t [T 1] o . This implies that the optimal controller can be computed by solving a classic LQG problem corresponding to nature s optimal strategy P , which can be done very efficiently through Kalman filtering and dynamic programming (see Appendix A for details). It thus suffices to design an efficient algorithm for computing P , which is uniquely determined by the covariance matrices (W , V ) that solve problem (11). To this end, we first reformulate (11) as max W G+ W ,V G+ V f(W, V ), (12) where we restrict GW and GV to G+ W and G+ V , respectively, due to Proposition 4.1, and where f(W, V ) denotes the optimal value function of the inner minimization problem in (11). As (11) is a reformulation of (10) and as the family of all causal purified output-feedback controllers matches the family of causal output-feedback controllers, f(W, V ) can also be viewed as the optimal value of the classic LQG problem corresponding to the normal distribution P determined by the covariance matrices W and V . These insights lead to the following structural result. Proposition 4.2. f(W, V ) is concave and β-smooth in (W, V ) G+ W G+ V for some β > 0. By Proposition 4.2, it is possible to address problem (12) with a Frank-Wolfe algorithm [13, 18, 19, 20, 23, 35]. Each iteration of this algorithm solves a direction-finding subproblem, that is, a variant of problem (12) that maximizes the first-order Taylor expansion of f(W, V ) around the current iterates. max LW G+ W ,LV G+ V W f(W, V ), LW W + V f(W, V ), LV V (13) The next iterates are then obtained by moving towards a maximizer (L W , L V ) of (13), i.e., we update (W, V ) (W, V ) + α (L W W, L v V ), where α is an appropriate step size. The proposed Frank-Wolfe algorithm enjoys a very low periteration complexity because problem (13) is separable. To see this, we reformulate (13) as max LW ,LV X0f(W, V ), LX0 X0 + t=0 Wtf(W, V ), LWt Wt + Vtf(W, V ), LVt Vt s.t. G(LX0, ˆX0)2 ρ2 x0, G(LWt, ˆWt)2 ρ2 wt, G(LVt, ˆVt)2 ρ2 vt t [T 1] LX0 λmin( ˆX0)I, LWt λmin( ˆWt)I, LVt λmin( ˆVt)I t [T 1]. Hence, (13) decomposes into 2T + 1 separate subproblems that can be solved in parallel. That is, for any matrix Z {X0, W0, . . . , WT 1, V0, . . . , VT 1} we solve a separate subproblem of the form max LZ λmin( ˆ Z) n Zf(W, V ), LZ Z : G(LZ, ˆZ)2 ρ2 z o . (14) These subproblems can be reformulated as tractable SDPs and are thus amenable to efficient offthe-shelf solvers. By [38, Theorem 6.2], however, one can exploit the structure of the Gelbrich distance in order to reduce (14) to a univariate algebraic equation that can be solved to any desired accuracy δ > 0 by a highly efficient bisection algorithm. We say that Lδ Z is a δ-approximate solution of problem (14) for some δ (0, 1) if Lδ Z is feasible in (14) and if Zf(W, V ), Lδ Z Z δ Zf(W, V ), L Z Z , where L Z is an exact maximizer of (14). Note that, by the concavity of f(W, V ), the inner product on the right-hand side is nonnegative and vanishes if and only if Z maximizes f(W, V ) over the feasible set of (14). For further details we refer to Appendix E in the supplementary material. Remark 1 (Automatic differentiation). Recall that f(W, V ) coincides with the optimal value of the LQG problem corresponding to the normal distribution P determined by the covariance matrices W and V . By using the underlying dynamic programming equations, f(W, V ) can thus be expressed in closed form as a serial composition of O(T) rational functions (see Appendix A for details). Hence, Zf(W, V ) can be calculated symbolically for any Z {X0, W0, . . . , WT 1, V0, . . . , VT 1} by repeatedly applying the chain and product rules. However, the resulting formulas are lengthy and cumbersome. We thus compute the gradients numerically using backpropagation. The cost of evaluating Zf(W, V ) is then of the same order of magnitude as the cost of evaluating f(W, V ). A detailed description of the proposed Frank-Wolfe method is given in Algorithm 1 below. By [31, Theorem 1 and Lemma 7], which applies thanks to the structural properties of f(W, V ) established in Proposition 4.2, Algorithm 1 attains a suboptimality gap of ϵ within O(1/ϵ) iterations. Algorithm 1 Frank-Wolfe algorithm for solving (12) Input: initial iterates W, V , nominal covariance matrices ˆW, ˆV , oracle precision δ (0, 1) 1: set initial iteration counter k = 0 2: while stopping criterion is not met do 3: for Z {X0, W0, . . . , WT 1, V0, . . . , VT 1} do in parallel 4: compute Zf(W, V ) 5: find a δ-approximate solution Lδ Z of (14) 6: end 7: g W f(W, V ), Lδ W W + V f(W, V ), Lδ V V 8: (W, V ) (W, V ) + 2/(2 + k) (Lδ W W, Lδ V V ) 9: end while 10: Output: W and V 100 101 102 Time horizon Execution time (s) MOSEK Frank-Wolfe 100 101 102 # of iterations Optimality gap Figure 1: (a) Execution time for MOSEK and Frank-Wolfe algorithm over 10 simulation runs as a function of the horizon T (solid lines show the mean and the shaded areas correspond to 1 standard deviation). (b) Convergence of optimality gap for Frank-Wolfe algorithm with horizon T = 10. 5. Numerical Experiments All experiments are run on an Intel i7-8700 CPU (3.2 GHz) machine with 16GB RAM. All linear SDP problems are modeled in Python 3.8.6 using CVXPY [1, 14] and solved with MOSEK [37]. The gradients of f(W, V ) are computed via Pymanopt [48] with Py Torch s automated differentiation module [39, 40]. Consider a class of distributionally robust LQG problems with n = m = p = 10. We set At = 0.1 A to have ones on the main diagonal and the superdiagonal and zeroes everywhere else (Ai,j = 1 if i = j or i = j 1 and Ai,j = 0 otherwise), and the other matrices to Bt = Ct = Qt = Rt = Id. The Wasserstein radii are set to ρx0 = ρwt = ρvt = 10 1. The nominal covariance matrices of the exogenous uncertainties are constructed randomly and with eigenvalues in the interval [1, 2] (so as to ensure they are positive definite). The code is publicly available in the Github repository https: //github.com/RAO-EPFL/DR-Control. The optimal value of the distributionally robust LQG problem (5) can be computed by directly solving the SDP reformulation of (11) with MOSEK or by solving the nonlinear SDP (12) with our Frank Wolfe method detailed in Algorithm 1. We next compare these two approaches in 10 independent simulation runs, where we set a stopping criterion corresponding to an optimality gap below 10 3 and we run the Frank-Wolfe method with δ = 0.95. Figure 1a illustrates the execution time for both approaches as a function of the planning horizon T; runs where MOSEK exceeds 100s are not reported. Figure 1b visualizes the empirical convergence behavior of the Frank-Wolfe algorithm. The results highlight that the Frank-Wolfe algorithm achieves running times that are uniformly lower than MOSEK across all problem horizons and is able to find highly accurate solutions already after a small number of iterations (50 iterations for problem instances of time horizon T = 10). 6. Concluding Remarks and Limitations In view of the popularity of LQG models, the results in this work carry important theoretical and practical implications. Despite considering a generalization of the classic LQG setting where the noise affecting the system dynamics and the observations follows unknown (and potentially non Gaussian) distributions, our findings suggest that certain classic structural results continue to hold and that highly efficient methods can be adapted to tackle this more realistic (and more challenging) problem. Specifically, that control policies depending linearly on observations continue to be optimal and that the worst-case distribution turns out to be Gaussian is surprising from a theoretical angle and also has direct practical implications, because it allows leveraging the highly efficient Kalman filter in conjunction with dynamic programming and a Frank-Wolfe method to design an efficient computational procedure for solving the problem. The results also raise several important questions that warrant future exploration. First, it would be highly relevant to consider extensions where the system matrices are also affected by uncertainty, as this captures many applications of practical interest in, e.g., reinforcement learning or revenue management. Second, it would be worth exploring an infinite horizon setting or relaxing the assumption that the nominal distribution is Gaussian, as both assumptions may be limiting the practical appeal of the framework. Third, one could also attempt to prove structural optimality results or design novel algorithms for generating high-quality suboptimal solutions for the more general setting involving constraints on states and/or control inputs. Lastly, one could improve the present algorithmic proposal by exploiting topological properties of the objective so as to guarantee linear convergence rates in the Frank-Wolfe procedure. Acknowledgements. This research was supported by the Swiss National Science Foundation under the NCCR Automation, grant agreement 51NF40_180545. Dan A. Iancu would like to acknowledge INSEAD for financial support during the duration of the project. [1] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42 60, 2018. [2] Rohit Arora and Rui Gao. Data-driven multistage distributionally robust optimization with nested distance: Time consistency and tractable dynamic reformulations. Available at Optimization Online, 2022. [3] François Auger, Mickael Hilairet, Josep M. Guerrero, Eric Monmasson, Teresa Orlowska Kowalska, and Seiichiro Katsura. Industrial applications of the Kalman filter: A review. IEEE Transactions on Industrial Electronics, 60(12):5458 5471, 2013. [4] Richard Bellman. Some inequalities for the square root of a positive definite matrix. Linear Algebra and its Applications, 1(3):321 324, 1968. [5] Aharon Ben-Tal, Stephen Boyd, and Arkadi Nemirovski. Control of uncertainty-affected discrete time linear systems via convex programming. Available at Optimization Online, 2005. [6] Aharon Ben-Tal, Stephen Boyd, and Arkadi Nemirovski. Extending scope of robust optimization: Comprehensive robust counterparts of uncertain problems. Mathematical Programming, 107(1):63 89, 2006. [7] Denis S. Bernstein and Wassim M. Haddad. LQG control with an H performance bound: A Riccati equation approach. In American Control Conference, pages 796 802, 1988. [8] Dimitri Bertsekas. Dynamic Programming and Optimal Control, volume I. Athena Scientific, 2017. [9] Dimitris Bertsimas and Vineet Goyal. On the power and limitations of affine policies in two-stage adaptive optimization. Mathematical Programming, 134(2):491 531, 2012. [10] Dimitris Bertsimas, Dan A. Iancu, and Pablo A. Parrilo. Optimality of affine policies in multistage robust optimization. Mathematics of Operations Research, 35(2):363 394, 2010. [11] Dimitris Bertsimas, Dan A. Iancu, and Pablo A. Parrilo. A hierarchy of near-optimal policies for multistage adaptive optimization. IEEE Transactions on Automatic Control, 56(12):2809 2824, 2011. [12] Shen-Yong Chen. Kalman filter for robot vision: A survey. IEEE Transactions on Industrial Electronics, 59(11):4409 4420, 2012. [13] Vladimir F. Demyanov and Aleksandr M. Rubinov. Approximate Methods in Optimization Problems. Elsevier, 1970. [14] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. [15] John Doyle. Guaranteed margins for LQG regulators. IEEE Transactions on Automatic Control, 23(4):756 757, 1978. [16] John Doyle, Keith Glover, Pramod Khargonekar, and Bruce Francis. State-space solutions to standard H2 and H control problems. In American Control Conference, pages 1691 1696, 1988. [17] John Doyle, Kemin Zhou, and Bobby Bodenheimer. Optimal control with mixed H2 and H performance objectives. In American Control Conference, pages 2065 2070, 1989. [18] Joseph C. Dunn. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization, 17(2):187 211, 1979. [19] Joseph C. Dunn. Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM Journal on Control and Optimization, 18(5):473 487, 1980. [20] Joseph C. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432 444, 1978. [21] Omar El Housni and Vineet Goyal. On the optimality of affine policies for budgeted uncertainty sets. Mathematics of Operations Research, 46(2):674 711, 2021. [22] Hans Föllmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time. de Gruyter, 2011. [23] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics, 3(1-2):95 110, 1956. [24] Matthias Gelbrich. On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten, 147(1):185 203, 1990. [25] Angelos Georghiou, Angelos Tsoukalas, and Wolfram Wiesemann. On the optimality of affine decision rules in robust and distributionally robust optimization. Available at Optimization Online, 2021. [26] Clark R. Givens and Rae M. Shortt. A class of Wasserstein metrics for probability distributions. Michigan Mathematical Journal, 31(2):231 240, 1984. [27] Michael J. Hadjiyiannis, Paul J. Goulart, and Daniel Kuhn. An efficient method to estimate the suboptimality of affine controllers. IEEE Transactions on Automatic Control, 56(12):2841 2853, 2011. [28] Bingyan Han. Distributionally robust Kalman filtering with volatility uncertainty. ar Xiv preprint ar Xiv:2302.05993, 2023. [29] Lars Peter Hansen and Thomas J. Sargent. Robust estimation and control under commitment. Journal of Economic Theory, 124(2):258 301, 2005. [30] Dan A. Iancu, Mayank Sharma, and Maxim Sviridenko. Supermodularity and affine policies in dynamic robust optimization. Operations Research, 61(4):941 956, 2013. [31] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427 435, 2013. [32] Aren Karapetyan, Andrea Iannelli, and John Lygeros. On the regret of H control. In IEEE Conference on Decision and Control, pages 6181 6186, 2022. [33] Kihyun Kim and Insoon Yang. Distributional robustness in minimax linear quadratic control with Wasserstein distance. SIAM Journal on Control and Optimization, 61(2):458 483, 2023. [34] Georgios Kotsalis, Guanghui Lan, and Arkadi S. Nemirovski. Convex optimization for finitehorizon robust covariance control of linear stochastic systems. SIAM Journal on Control and Optimization, 59(1):296 319, 2021. [35] Evgenii S. Levitin and Boris T. Polyak. Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6(5):1 50, 1966. [36] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1 2):115 166, 2018. [37] MOSEK Ap S. The MOSEK Optimization Toolbox. Version 9.2., 2019. [38] Viet Anh Nguyen, Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Bridging Bayesian and minimax mean square error estimation via Wasserstein distributionally robust optimization. Mathematics of Operations Research, 48(1):1 37, 2023. [39] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Py Torch. In NIPS 2017 Autodiff Workshop, 2017. [40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8026 8037, 2019. [41] Ian R. Petersen, Matthiew R. James, and Paul Dupuis. Minimax optimal control of stochastic uncertain systems with relative entropy constraints. IEEE Transactions on Automatic Control, 45(3):398 412, 2000. [42] Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. [43] R. Tyrrell Rockafellar. Conjugate Duality and Optimization. SIAM, 1974. [44] Soroosh Shafieezadeh-Abadeh, Viet Anh Nguyen, Daniel Kuhn, and Peyman Mohajerin Esfahani. Wasserstein distributionally robust Kalman filtering. In Advances in Neural Information Processing Systems, pages 8483 8492, 2018. [45] Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171 176, 1958. [46] Joelle Skaf and Stephen P. Boyd. Design of affine controllers via convex optimization. IEEE Transactions on Automatic Control, 55(11):2476 2487, 2010. [47] Emanuel Todorov and Michael I. Jordan. Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5(11):1226 1235, 2002. [48] James Townsend, Niklas Koep, and Sebastian Weichwald. Pymanopt: A Python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research, 17(137):1 5, 2016. [49] Bart P. G. Van Parys, Paul J. Goulart, and Manfred Morari. Infinite horizon performance bounds for uncertain constrained systems. IEEE Transactions on Automatic Control, 58(11):2803 2817, 2013. [50] Bart P. G. Van Parys, Daniel Kuhn, Paul J. Goulart, and Manfred Morari. Distributionally robust control of constrained stochastic systems. IEEE Transactions on Automatic Control, 61(2):430 442, 2016. [51] Peter Whittle. Risk-sensitive linear/quadratic/Gaussian control. Advances in Applied Probability, 13(4):764 777, 1981. [52] Insoon Yang. Wasserstein distributionally robust stochastic control: A data-driven approach. IEEE Transactions on Automatic Control, 66(8):3863 3870, 2021. [53] Kemin Zhou and John C. Doyle. Essentials of Robust Control. Prentice Hall, 1998. The supplementary material is structured as follows. Appendix A presents the well-known solution to the classic LQG problem using dynamic programming and Kalman Filter estimation. Appendix B provides the definitions of the stacked system matrices utilized in the compact formulation (5) of the distributionally robust LQG problem. Appendix C contains the proofs of the formal statements in the main text and provides additional technical results. Appendix D derives the SDP reformulation of the dual problem (11). Appendix E, finally, elaborates on the bisection algorithm used for solving the linearization oracle of the Frank-Wolfe algorithm. A. Solution of the LQG Problem The classic LQG problem can be solved efficiently via dynamic programming; see, e.g., [8]. That is, the unique optimal control inputs satisfy u t = Ktˆxt for every t [T 1], where Kt Rn n is the optimal feedback gain matrix, and ˆxt = EP[xt|y0, . . . , yt] is the minimum mean-squared-error estimator of xt given the observation history up to time t. Thanks to the celebrated separation principle, Kt can be computed by pretending that the system is deterministic and allows for perfect state observations, and ˆxt can be computed while ignoring the control problem. To compute Kt, one first solves the deterministic LQR problem corresponding to the LQG problem at hand. Its value function x t Ptxt at time t is quadratic in xt, and Pt obeys the backward recursion Pt = A t Pt+1At + Qt A t Pt+1Bt(Rt + B t Pt+1Bt) 1B t Pt+1At t [T 1] (A.15a) initialized by PT = QT . The optimal feedback gain matrix Kt can then be computed from Pt+1 as Kt = (Rt + B t Pt+1Bt) 1B t Pt+1At t [T 1]. (A.15b) Since xt and (y0, . . . , yt) are jointly normally distributed, the minimum mean-squared-error estimator ˆxt can be calculated directly using the formula for the mean of a conditional normal distribution. Alternatively, however, one can use the Kalman filter to compute ˆxt recursively, which is significantly more insightful and efficient. The Kalman filter also recursively computes the covariance matrix Σt of xt conditional on y0, . . . , yt and the covariance matrix Σt+1|t of xt+1 conditional on y0, . . . , yt evaluated under P. Specifically, these covariance matrices obey the forward recursion Σt = Σt|t 1 Σt|t 1C t (CtΣt|t 1C t + Vt) 1CtΣt|t 1 Σt+1|t = AtΣt A t + Wt t [T 1] (A.16) initialized by Σ0| 1 = X0. Using Σt|t 1, we then define the Kalman filter gain as Lt = Σt C t V 1 t t [T 1] which allows us to compute the minimum mean-squared-error estimator via the forward recursion ˆxt+1 = Atˆxt + Btut + Lt+1 (yt+1 Ct+1(Atˆxt + Btut)) t [T 1] initialized by ˆx0 = L0y0. One can also show that the optimal value of the LQG problem amounts to t=0 Tr((Qt Pt)Σt) + t=1 Tr(Pt(At 1Σt 1A t 1 + Wt 1)) + Tr(P0X0). (A.17) B. Definitions of Stacked System Matrices The stacked system matrices appearing in the distributionally robust LQG problem (5) are defined as follows. First, the stacked state and input cost matrices Q Sn(T +1) and R Sm T are set to Q0 Q1 ... QT R0 R1 ... RT 1 respectively. Similarly, the stacked matrices appearing in the linear dynamics and the measurement equations C Rp T n(T +1), G Rn(T +1) n(T +1) and H Rn(T +1) m T are defined as C0 0 C1 0 ... ... CT 1 0 A0 0 A1 0 A1 1 ... ... AT 0 AT 1 . . . AT T 0 A1 1B0 0 A2 1B0 A2 2B1 0 ... ... ... 0 AT 1 B0 AT 2 B1 . . . . . . AT T BT 1 respectively, where At s = Qt 1 k=s Ak for every s < t and At s = In for s = t. Using the stacked system matrices, we can now express the purified observation process η as a linear function of the exogenous uncertainties w and v that is not impacted by u; see also [5, 46] Lemma B.1. We have η = Dw + v, where D = CG. Proof of Lemma B.1. The purified observation process is defined as η = y ˆy. Recall now that the observations of the original system satisfy y = Cx + v. Similarly, one readily verifies that the observations of the fictitious noise-free system satisfy ˆy = Cˆx. Thus, we have η = C(x ˆx) + v. Next, recall that the state of the original system satisfies x = Hu + Gw, and note that the state of the fictitious noise-free system satisfies ˆx = Hu. Combining all of these linear equations finally shows that u cancels out and that η = CGw + v = Dw + v. C.1. Additional Technical Results It is well known that every causal controller that is linear in the original observations y can be reformulated as a causal controller that is linear in the purified observations η and vice versa [5, 46]. Perhaps surprisingly, however, the one-to-one transformation between the respective coefficients of y and η is not linear. To keep this paper self-contained, we review these insights in the next lemma. Lemma C.1. If u = Uη + q for some U U and q Rp T , then u = U y + q for U = (I + UCH) 1U and q = (I + UCH) 1q. Conversely, if u = U y + q for some U U and q Rp T , then u = Uη + q for U = (I U CH) 1U and q = (I U CH) 1q . Proof of Lemma C.1. If u = Uη + q for some U U and q Rp T , then we have u = Uη + q = U(y ˆy) + q = Uy UCˆx + q = Uy UCHu + q, where the second equality follows from the definition of η, the third equality holds because y = Cx+v, and the last equality exploits our earlier insight that ˆy = Cˆx. The last expression depends only on y and u. Solving for u yields u = U y + q , where U = (I + UCH) 1U and q = (I + UCH) 1q. Note that (I + UCH) is indeed invertible because I + UCH is a lower triangular matrix with all diagonal entries equal to one, ensuring a determinant of one. Similarly, if u = U y + q for some U U and q Rp T , then we have u = U y + q = U (η + ˆy) + q = U η + U Cˆx + q = U η + U CHu + q . Solving for u yields u = Uη + q, where U = (I U CH) 1U and q = (I U CH) 1q . Note again that (I U CH) is indeed invertible because (I U CH) is a lower triangular matrix with all diagonal entries equal to one. C.2. Proofs of Section 3 Proof of Proposition 3.2. In problem (8), both u and x are linear in w and v, i.e., u = q+UDw+Uv and x = Hu + Gw = Hq + HUDw + HUv + Gw. By substituting the linear representations of u and x into the objective function of problem (8), we obtain the following equivalent reformulation. min q Rp T U U max P G EP w D U (R + H QH)UD + 2D U H QG + G QG w + EP v U (R + H QH)U v + q (R + H QH)q For any fixed P G, we can express the expectation in the objective function of the above problem in terms of the covariance matrices W = EP[ww ] and V = EP[vv ]. Thus, the problem becomes min q Rp T U U max W,V,P Tr D U (R+H QH)UD+ 2G QHUD +G QG W + Tr U (R + H QH)U V +q (R+H QH)q s.t. P G, W = EP[ww ], V = EP[vv ]. Recall now the definition of G, and note that the requirements G(X0, ˆX0) ρx0, G(Wt, ˆWt) ρwt and G(Vt, ˆVt) ρvt are equivalent to the convex constraints G(X0, ˆX0)2 ρ2 x0, G(Wt, ˆWt)2 ρ2 wt and G(Vt, ˆVt)2 ρ2 vt, respectively, for all t [T 1]. The definition of G also implies that W = EP[ww ] = diag(X0, W0, . . . , WT 1) and V = EP[vv ] = diag(V0, . . . , VT 1). Problem (A.18) thus constitutes a relaxation of problem (9). Indeed, the feasible set of the inner maximization problem in (A.18) is a subset of the feasible set of the inner maximization problem in (9). Moreover, for any W and V feasible in the inner maximization problem in (9), the distribution P = Px0 ( T 1 t=0 Pwt) ( T t=0Pvt) defined through Px0 = N(0, X0), Pwt = N(0, Wt) and Pvt = N(0, Vt), t [T 1], is feasible in the inner maximization problem in (A.18) with the same objective value. The relaxation is thus exact, and the optimal values of (8), (9) and (A.18) coincide. Proof of Proposition 3.4. Recall that the space Uy of all causal output-feedback controllers coincides with the space Uη of all causal purified output-feedback controllers. We can thus replace the feasible set Uη of the inner minimization problem in (10) with Uy. Hence, for any fixed P WN , the inner minimization problem in (10) constitutes a classic LQG problem. By standard LQG theory [8], it is solved by a linear output-feedback controller of the form u = U y+q for some U U and q Rp T ; see also Appendix A. Lemma C.1 shows, however, that any linear output-feedback controller can be equivalently expressed as a linear purified-output feedback controller of the form u = Uη + q for some U U and q Rp T . In summary, the above reasoning shows that the feasible set of the inner minimization problem in (10) can be reduced to the family of all linear purified-output feedback controllers without sacrificing optimality. Thus, problem (10) is equivalent to max P WN min q,U,x,u EP u Ru + x Qx s.t. U U, u = q + Uη, x = Hu + Gw. Using a similar reasoning as in the proof of Proposition 3.2, we can now substitute the linear representations of u and x into the objective function and reformulate the above problem as max W,V,P min q Rp T U U Tr D U (R+H QH)UD+ 2G QHUD +G QG W + Tr U (R + H QH)U V +q (R+H QH)q s.t. P WN , W = EP[ww ], V = EP[vv ]. As WN contains only normal distributions, Proposition 3.3 implies that W(Px0, ˆPx0) = G(X0, ˆX0), W(Pwt, ˆPwt) = G(Wt, ˆWt) and W(Pvt, ˆPvt) = G(Vt, ˆVt) for all t [T 1]. We may thus replace the requirement W(Px0, ˆPx0) ρx0 in the definition of WN by G(X0, ˆX0) ρx0, which is equivalent to the convex constraint G(X0, ˆX0)2 ρ2 x0. The conditions on the marginal distributions of wt and vt, t [T 1], admit similar reformulations. The definition of WN also implies that W = EP[ww ] = diag(X0, W0, . . . , WT 1) and V = EP[vv ] = diag(V0, . . . , VT 1). Thus, the feasible set of the outer maximization problem in (11) constitutes a relaxation of that in (10). One readily verifies that the relaxation is exact by using similar arguments as in the proof of Proposition 3.2. Thus, the claim follows. Proof of Theorem 3.5. By Proposition 3.2, p coincides with the minimum of (9). Similarly, by Proposition 3.4 d coincides with the maximum of (11). Note that problems (9) and (11) only differ by the order of minimization and maximization. Note also that U is convex and closed, GW and GV are convex and compact by virtue of [38, Lemma A.6], and the (identical) trace terms in (9) and (11) are bilinear in (W, V ) and (U, q). The claim thus follows from Sion s minimax theorem [45]. C.3. Proofs of Section 4 Note that Proposition 4.1 is consistent with Corollary 3 because the optimal LQG controller corresponding to P is linear in the past observations. Proof of Proposition 4.1. By [38, Lemma A.3], the inner problem in (9) admits a maximizer (W , V ) with X 0 λmin( ˆX0) as well as W t λmin( ˆWt) and V t λmin( ˆVt) for all t [T 1]. Thus, the optimal value of problem (9) and its strong dual (11) does not change if we restrict GW and GV to G+ W and G+ V , respectively. We may thus conclude that problem (11) has a maximizer (W , V ) with V t λmin( ˆVt) 0 for all t [T 1]. This in turn implies that problem (6) is solved by a normal distribution P under which the covariance matrix of the observation noise vt satisfies V t 0 for every t [T 1]. As (5) and (6) are strong duals, the optimal solution u of problem (5) forms a Nash equilibrium with P , i.e., u is a best response to P and thus solves the classic LQG problem corresponding to P . As Rt 0 for every t [T 1], this best response u is unique, and as V T 0 for every t [T 1], u is in fact the Kalman filter-based optimal outputfeedback strategy corresponding to P (which can be obtained using the techniques highlighted in Appendix A). Before proving Proposition 4.2, recall that f(W, V ) is called β-smooth for some β > 0 if | f(W, V ) f(W , V )| β W W 2 F + V V 2 F 1 2 W, W G+ W , V, V G+ V , where F denotes the Frobenius norm. Proof of Proposition 4.2. The function f(W, V ) is concave because the objective function of the inner minimization problem in (11) is linear (and hence concave) in W and V and because concavity is preserved under minimization. To prove that f(W, V ) is β-smooth, we first recall from Proposition 3.3 that it coincides with the optimal value of the inner minimization problem in (10). As Uη = Uy, f(W, V ) can thus be viewed as the optimal value of the classic LQG problem corresponding to the normal distribution P determined by the covariance matrices W and V . Hence, f(W, V ) coincides with (A.17), where Σt, for t [T 1], is a function of (W, V ) defined recursively through the Kalman filter equations (A.16). Note that all inverse matrices in (A.16) are well-defined because any V G+ V is strictly positive definite. Therefore, Σt constitutes a proper rational function (that is, a ratio of two polyonmials with the polynomial in the denominator being strictly positive) for every t [T 1]. Thus, f(W, V ) is infinitely often continuously differentiable on a neighborhood of G+ W G+ V . As f(W, V ) is concave and (at least) twice continuously differentiable, it is β-smooth on G+ W G+ V if and only if the largest eigenvalue of the Hessian matrix of f(W, V ) is bounded above by β throughout G+ W G+ V . Also, the largest eigenvalue of the positive semidefinite Hessian matrix 2( f(W, V )) coincides with the spectral norm of 2f(W, V ). We may thus set β = sup W G+ W ,V G+ V 2f(W, V ) 2, (A.19) where 2 denotes the spectral norm. The supremum in the above maximization problem is finite and attained thanks to Weierstrass theorem, which applies because f(W, V ) is twice continuously differentiable and the spectral norm is continuous, while the sets G+ W and G+ V are compact by virtue of [38, Lemma A.6]. This observation completes the proof. D. SDP Reformulation of the Lower Problem (11) Instead of solving the dual problem (11) with the customized Frank-Wolfe algorithm of Section 4, it can be reformulated as an SDP amenable to off-the-shelf solvers. This reformulation is obtained by dualizing the inner minimization problem and by exploiting the following preliminary lemma. Lemma D.1. For any ˆZ Sd + and ρz 0, the set GZ = {Z Sd + : G(Z, ˆZ) ρz} coincides with Z Sd + : Ez Sd + with Tr(Z + ˆZ 2Ez) ρ2 z, ˆZ 1 2 Z ˆZ 1 2 Ez Ez I Proof of Lemma D.1. By Definition 2, we have GZ = {Z Sd + : Tr(Z + ˆZ 2( ˆZ 1 2 Z ˆZ 1 2 ) 1 2 ) ρ2 z}. Next, introduce an auxiliary variable Ez Sd + subject to the matrix inequality E2 z ( ˆZ 1 2 Z ˆZ 1 2 ). By [4, Theorem 1], this inequality can be recast as Ez ( ˆZ 1 2 Z ˆZ 1 2 ) 1 2 . Hence, we can reformulate the nonlinear matrix inequality in the above representation of GZ as Tr(Z + ˆZ 2Ez) ρ2 z. A standard Schur complement argument reveals that the inequality E2 z ( ˆZ 1 2 Z ˆZ 1 2 ) is also equivalent to ˆZ 1 2 Z ˆZ 1 2 Ez Ez I The claim then follows by combining all of these insights. We are now ready to derive the desired SDP reformulation of problem (11). Proposition D.2. If ˆV 0, then problem (11) is equivalent to the SDP max Tr(G QGW) Tr(F(R + H QH) 1) s.t. W Sn(T +1) + , V Sp T + , M M, F ST m + Ex0 Sn +, Ewt Sn +, Evt Sp + t [T 1] Tr(W0 + ˆX0 2Ex0) ρ2 x0, Tr(Wt+1 + ˆWt 2Ewt) ρ2 wt, Tr(Vt + ˆVt 2Evt) ρ2 vt t [T 1] ˆX 1 2 0 X0 ˆX 1 2 0 Ex0 Ex0 In 1 2 t Wt+1 ˆW 1 2 t Ewt Ewt In 1 2 t Vt ˆV 1 2 t Evt Evt Ip 0 t [T 1] F H QGWD + M/2 (H QGWD + M/2) DWD + V W0 λmin( ˆX0)I, Wt+1 λmin( ˆWt)I, Vt λmin( ˆVt)I t [T 1]. Here, M denotes the set of all strictly upper block triangular matrices of the form 0 M1,2 M1,3 . . . M1,T 0 M2,3 M2,T ... ... 0 MT 1,T 0 where Mt,s Rm p for every t, s Z with 1 t < s T. Proof of Proposition D.2. The proof relies on dualizing the inner minimization problem in (11). Note that strong duality holds because the primal problem is trivially feasible and involves only equality constraints, which implies that any feasible point is in fact a Slater point. In the following we use M M to denote the Lagrange multiplier of the constraint U U, which requires all blocks of the matrix U above the main diagonal to vanish. The Lagrangian function of the inner minimization problem in (11) can therefore be represented as L(q, U, M) = Tr D U (R + H QH)UD + G QG W + 2 Tr(G QHUDW) + Tr U (R + H QH)U V + q (R + H QH)q + Tr(UM ). Recall now that R 0 and Q 0, and thus R + H QH 0. Consequently, L is minimized by q = 0 for any fixed U and M. In addition, the partial gradient of L with respect U is given by L U = 2(R + H QH)UDWD + 2(R + H QH)UV + 2H QGWD + M. Recall also that V G+ V is strictly positive, which implies that DWD + V 0 is invertible. As we already know that R + H QH 0 is invertible, as well, L is minimized by U = (R + H QH) 1 H QGWD + M/2 (DWD + V ) 1 for any fixed M. Substituting both q and U into L yields the dual objective function g(M) = L(q , U , M) = Tr(G QGW) Tr (R + H QH) 1(H QGWD + M/2)(DWD + V ) 1(H QGWD +M/2) . The dual of the inner minimization problem in (11) is thus given by max M M g(M). To linearize the dual objective function, we next introduce an auxiliary variable F Sm T + subject to the matrix inequality F (H QGWD + M/2)(DWD + V ) 1(H QGWD + M/2) . By using a standard Schur complement reformulation, we can then rewrite the dual problem as max Tr(G QGW) Tr((R + H QH) 1F) s.t. M M, F Sm T + F H QGWD + M/2 (H QGWD + M/2) DWD + V Next, by replacing the inner problem in (11) with its strong dual (A.21), we can reformulate (11) as max Tr(G QGW) Tr((R + H QH) 1F) s.t. M M, F Sm T + , W Sn(T +1) + , V Sp T + F H QGWD + M/2 (H QGWD + M/2) DWD + V G(X0, ˆX0)2 ρ2 x0, G(Wt, ˆWt) ρ2 wt, G(Vt, ˆVt) ρ2 vt t [T 1]. By Proposition 4.1, the inclusion of the constraints X0 λmin( ˆX0)I, Wt λmin( ˆWt)I and Vt λmin( ˆVt)I for all t [T 1] has no effect on the solution to problem (A.22). In addition, by Lemma D.1, each (non-linear) Gelbrich constraint in (A.22) can be reformulated as an equivalent (linear) SDP constraint. Thus, problem (A.22) reduces to (A.20), and the claim follows. E. Bisection Algorithm for the Linearization Oracle We now show that the direction-finding subproblem (14) can be solved efficiently via bisection. To this end, we first establish that (14) can be reduced to the solution of a univariate algebraic equation. Proposition E.1 ([38, Proposition A.4 (iii)]). If ˆZ Sd ++, ΓZ Sd +, ΓZ = 0 and ρz R++, then max ΓZ, L Z s.t. G(L, ˆZ) ρz L λmin( ˆZ)I (A.23) is uniquely solved by L = (γ )2(γ I ΓZ) 1 ˆZ(γ I ΓZ) 1, where γ is the unique solution of ρ2 z ˆZ, (I γ (γ I ΓZ) 1)2 = 0 (A.24) in the interval (λmax(ΓZ), ). In practice, we need to solve the algebraic equation (A.24) numerically. The numerical error in approximating γ should be contained to ensure that L approximates the exact maximizer of problem (A.23). The next proposition shows that, for any tolerance δ (0, 1), a δ-approximate solution of (A.23) can be computed with an efficient bisection algorithm. Proposition E.2 ([38, Theorem 6.4]). For any fixed ρz R++, ˆZ Sd ++ and ΓZ Sd +, ΓZ = 0, define G+ Z = {Z Sd + : G(Z, ˆZ) ρz, Z λmin( ˆZ)} as the feasible set of problem (A.23), and let Z G+ Z be any reference covariance matrix. Additionally, let δ (0, 1) be the desired oracle precision, and define φ(γ) = γ(ρ2+ γ(γI ΓZ) 1 I, ˆZ ) Z, ΓZ for any γ > λmax(ΓZ). Then, Algorithm A.2 returns in finite time a matrix Lδ Z Sd + with the following properties. (i) Feasibility: Lδ Z G+ Z (ii) δ-Suboptimality: Lδ Z Z, ΓZ δ max L G+ Z ΓZ, L Z . Algorithm A.2 Bisection algorithm to compute Lδ Z Input: nominal covariance matrix ˆZ Sd ++, radius ρ R++, reference covariance matrix Z G+ Z , gradient matrix ΓZ Sd +, ΓZ = 0, precision δ (0, 1), dual objective function ϕ(γ) defined in Proposition E.2 1: set λ1 λmax(ΓZ), and let p1 be an eigenvector for λ1 2: set γ λ1(1 + (p 1 ˆZp1) 1 2 /ρ) and γ λ1(1 + Tr( ˆZ) 1 2 /ρ) 3: repeat 4: set γ (γ + γ)/2 and L ( γ)2( γI ΓZ) 1 ˆZ( γI ΓZ) 1 dγ ( γ) < 0 then set γ γ else γ γ endif 6: until dϕ dγ ( γ) > 0 and L Z, ΓZ δϕ( γ) Output: L In summary, for any Z {X0, W0, . . . , WT 1, V0, . . . , VT 1}, Algorithm A.2 computes a δapproximate solutions to the direction-finding subproblem (14) with ΓZ = Zf(W, V ). F. Additional Information on Experiments Generation of Nominal Covariance Matrices. The nominal covariance matrices of the exogenous uncertainties are constructed randomly using the following procedure. For each exogenous uncertainty z {x0, w0, . . . , w T 1, v0, . . . , v T 1}, we denote the dimension of z by d and sample a matrix MZ Rd d from the uniform distribution on the hypercube [0, 1]d d. Next, we define ΞZ Rd d as the orthogonal matrix whose columns represent the orthonormal eigenvectors of the symmetric matrix MZ + M Z . Finally, we set ˆZ = ΞZΛZΞ Z, where ΛZ is a diagonal matrix whose main diagonal is sampled uniformly from the interval [1, 2]d. The rationale for adopting this cumbersome procedure is to ensure that the covariance matrix ˆZ is positive definite. Optimality Gap. The optimality gap of the Frank-Wolfe algorithm visualized in Figure 1b is calculated as the sum of the surrogate optimality gaps Lδ Z Z, Zf(W, V ) across all Z {X0, W0 . . . , WT 1, V0, . . . , VT 1}. For more information on the surrogate optimality gaps see [31].