# online_optimal_control_with_affine_constraints__1121b545.pdf

Online Optimal Control with Afﬁne Constraints

Yingying Li,1 Subhro Das,2 Na Li1

1 John A. Paulson School of Engineering and Applied Sciences, Harvard University 2 MIT-IBM Watson AI Lab, IBM Research yingyingli@g.harvard.edu, subhro.das@ibm.com, nali@seas.harvard.edu

This paper considers online optimal control with afﬁne constraints on the states and actions under linear dynamics with bounded random disturbances. The system dynamics and constraints are assumed to be known and time invariant but the convex stage cost functions change adversarially. To solve this problem, we propose Online Gradient Descent with Buffer Zones (OGD-BZ). Theoretically, we show that OGDBZ with proper parameters can guarantee the system to satisfy all the constraints despite any admissible disturbances. Further, we investigate the policy regret of OGD-BZ, which compares OGD-BZ s performance with the performance of the optimal linear policy in hindsight. We show that OGD-BZ can achieve a policy regret upper bound that is square root of the horizon length multiplied by some logarithmic terms of the horizon length under proper algorithm parameters.

Introduction Recently, there is a lot of interest in solving control problems by learning-based techniques, e.g. online learning and reinforcement learning (Agarwal et al. 2019; Li, Chen, and Li 2019; Ibrahimi, Javanmard, and Roy 2012; Dean et al. 2018; Fazel et al. 2018; Yang et al. 2019; Li et al. 2019a). This is motivated by applications such as data centers (Lazic et al. 2018; Li et al. 2019b), robotics (Fisac et al. 2018), autonomous vehicles (Sallab et al. 2017), power systems (Chen et al. 2021), etc. For real-world implementation, it is crucial to design safe algorithms that ensure the system to satisfy certain (physical) constraints despite unknown disturbances. For example, temperatures in data centers should be within certain ranges to reduce task failures despite disturbances from unmodeled heat sources, quadrotors should avoid collisions even when perturbed by wind, etc. In addition to safety, many applications involve time-varying environments, e.g. varying electricity prices, moving targets, etc. Hence, safe algorithms should not be over-conservative and should adapt to varying environments for desirable performance. In this paper, we design safe algorithms for time-varying environments by considering the following constrained online optimal control problem. Speciﬁcally, we consider a linear system with random disturbances, xt+1 = Axt + But + wt, t 0, (1)

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

where disturbance wt is random and satisﬁes wt w. Consider afﬁne constraints on the state xt and the action ut:

Dxxt dx, Duut du, t 0. (2)

For simplicity, we assume the system parameters A, B, w and the constraints are known. At stage 0 t T, a convex cost function ct(xt, ut) is adversarially generated and the decision maker selects a feasible action ut before ct(xt, ut) is revealed. We aim to achieve two goals simultaneously: (i) to minimize the sum of the adversarially varying costs, (ii) to satisfy the constraints (2) for all t despite the disturbances. There are many studies to address each goal separately but lack results on both goals together as discussed below. Firstly, there is recent progress on online optimal control to address Goal (i). A commonly adopted performance metric is policy regret, which compares the online cost with the cost of the optimal linear policy in hindsight (Agarwal et al. 2019). Sublinear policy regrets have been achieved for linear systems with either stochastic disturbances (Cohen et al. 2018; Agarwal, Hazan, and Singh 2019) or adversarial disturbances (Agarwal et al. 2019; Foster and Simchowitz 2020; Goel and Hassibi 2020a,b). However, most literature only considers the unconstrained control problem. Recently, Nonhoff and M uller (2020) studies constrained online optimal control but assumes no disturbances. Secondly, there are many papers from the control community to address Goal (ii): constraints satisfaction. Perhaps the most famous algorithms are Model Predictive Control (MPC) (Rawlings and Mayne 2009) and its variants, such as robust MPC which guarantees (hard) constraint satisfaction in the presence of disturbances (Bemporad and Morari 1999; Kouvaritakis, Rossiter, and Schuurmans 2000; Mayne, Seron, and Rakovi c 2005; Limon et al. 2010; Zaﬁriou 1990) as well as stochastic MPC which considers soft constraints and allows constraints violation (Oldewurtel, Jones, and Morari 2008; Mesbah 2016). However, there lack algorithms with both regret/optimality guarantees and constraint satisfaction guarantees. Therefore, an important question remains to be addressed: Q: how to design online algorithms to both satisfy the constraints despite disturbances and yield o(T) policy regrets?

Our Contributions In this paper, we answer the question above by proposing an online control algorithm: Online Gradient Descent with Buffer Zones (OGD-BZ). To develop

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

OGD-BZ, we ﬁrst convert the constrained online optimal control problem into an online convex optimization (OCO) problem with temporal-coupled stage costs and temporalcoupled stage constraints, and then convert the temporalcoupled OCO problem into a classical OCO problem. The problem conversion leverages the techniques from recent unconstrained online control literature and robust optimization literature. Since the conversion is not exact/equivalent, we tighten the constraint set by adding buffer zones to account for approximation errors caused by the problem conversion. We then apply classical OCO method OGD to solve the problem and call the resulting algorithm as OGD-BZ. Theoretically, we show that, with proper parameters, OGD-BZ can ensure all the states and actions to satisfy the constraints (2) for any disturbances bounded by w. In addition, we show that OGD-BZ s policy regret can be bounded by O(

T) for general convex cost functions ct(xt, ut) under proper assumptions and parameters. As far as we know, OGD-BZ is the ﬁrst algorithm with theoretical guarantees on both sublinear policy regret and robust constraint satisfaction. Further, our theoretical results explicitly characterize a trade-off between the constraint satisfaction and the low regret when deciding the size of the buffer zone of OGD-BZ. That is, a larger buffer zone, which indicates a more conservative search space, is preferred for constraints satisfaction; while a smaller buffer zone is preferred for low regret.

Related Work We provide more literature review below. Safe reinforcement learning. There is a rich body of literature on safe RL and safe learning-based control that studies how to learn optimal policies without violating constraints and without knowing the system (Fisac et al. 2018; Aswani et al. 2013; Wabersich and Zeilinger 2018; Garcıa and Fern andez 2015; Cheng et al. 2019; Zanon and Gros 2019; Fulton and Platzer 2018). Perhaps the most relevant paper is Dean et al. (2019b), which proposes algorithms to learn optimal linear policies for a constrained linear quadratic regulator problem. However, most theoretical guarantees in the safe RL literature require time-invariant environment and there lacks policy regret analysis when facing time-varying objectives. This paper addresses the timevarying objectives but considers known system dynamics. It is our ongoing work to combine both safe RL and our approach to design safe learning algorithms with policy regret guarantees in time-varying problems. Another important notion of safety is the system stability, which is also studied in the safe RL/learning-based control literature (Dean et al. 2018, 2019a; Chow et al. 2018). Online convex optimization (OCO). Hazan (2019) provides a review on classical (decoupled) OCO. OCO with memory considers coupled costs and decoupled constraints (Anava, Hazan, and Mannor 2015; Li, Qu, and Li 2020; Li and Li 2020). The papers on OCO with coupled constraints usually allow constraint violation (Yuan and Lamperski 2018; Cao, Zhang, and Poor 2018; Kveton et al. 2008). Besides, OCO does not consider system dynamics or disturbances. Constrained optimal control. Constrained optimal control enjoys a long history of research. Without disturbances, it is known that the optimal controller for linearly constrained

linear quadratic regulator is piecewise linear (Bemporad et al. 2002). With disturbances (as considered in this paper), the problem is much more challenging. Current methods such as robust MPC (Limon et al. 2008, 2010; Rawlings and Mayne 2009) and stochastic MPC (Mesbah 2016; Oldewurtel, Jones, and Morari 2008) usually deploy linear policies for fast computation even though linear policies are suboptimal. Besides, most theoretical analysis of robust/stochastic MPC focus on stability, recursive feasibility, and constraints satisfaction, instead of policy regrets.

Notations and Conventions We let 1, 2, denote the L1, L2, L norms respectively for vectors and matrices. Let 1n denote an all-one vector in Rn. For two vectors a, b Rn, we write a b if ai bi for any entry i. Let vec(A) denote the vectorization of matrix A. For better exposition, some bounds use Θ( ) to omit constants that do not depend on T or the problem dimensions explicitly.

Problem Formulation In this paper, we consider an online optimal control problem with linear dynamics and afﬁne constraints. Speciﬁcally, at each stage t {0, 1, . . . , T}, an agent observes the current state xt and implements an action ut, which incurs a cost ct(xt, ut). The stage cost function ct( , ) is generated adversarially and revealed to the agent after the action ut is taken. The system evolves to the next state according to (1), where x0 is ﬁxed, wt is a random disturbance bounded by wt W = {w Rn : w w}, and states and actions should satisfy the afﬁne constraints (2). We denote the corresponding constraint sets as X ={x Rn:Dxx dx}, U ={u Rm:Duu du}, where dx Rkx and du Rku. Deﬁne kc = kx + ku as the total number of the constraints. For simplicity, we consider that the parameters A, B, w, Dx, dx, Du, du are known a priori and that the initial value satisﬁes x0 = 0. We leave the study of unknown parameters and general x0 for the future. Deﬁnition 1 (Safe controller). Consider a controller (or an algorithm) A that chooses action u A t U based on history states {x A k }t k=0 and cost functions {ck( , )}t 1 k=0. The controller A is called safe if x A t X and u A t U for all 0 t T and all disturbances {wk W}T k=0. Deﬁne the total cost of a safe algorithm/controller A as:

JT (A) = E {wk}

t=0 ct(x A t , u A t )

Benchmark Policy and Policy Regret In this paper, we consider linear policies of the form ut = Kxt as our benchmark policy for simplicity, though the optimal policy for the constrained control of noisy systems may be nonlinear (Rawlings and Mayne 2009). We leave the discussion on nonlinear policies as future work. Based on (Cohen et al. 2018), we deﬁne strong stability, which is a quantitative version of stability and is commonly introduced to ease non-asymptotic regret analysis in the online control literature (Agarwal, Hazan, and Singh 2019; Agarwal et al. 2019).

Deﬁnition 2 (Strong Stability). A linear controller ut = Kxt is (κ, γ)-strongly stable for κ 1 and γ (0, 1] if there exists a matrix L and an invertible matrix Q such that A BK = Q 1LQ, with L 2 1 γ and max( Q 2, Q 1 2, K 2) κ. As shown in Cohen et al. (2018), strongly stable controllers can be computed efﬁciently by SDP formulation. Our benchmark policy class includes any linear controller ut = Kxt satisfying the conditions below:

K = {K : K is safe and (κ, γ)-strongly stable},

where K is called safe if the controller ut = Kxt is safe according to Deﬁnition 1. The policy regret of online algorithm A is deﬁned as:

Reg(A) = JT (A) min K K JT (K). (4)

Assumptions and Deﬁnitions For the rest of the paper, we deﬁne κB = max( B 2, 1). In addition, we introduce the following assumptions on the disturbances and the cost functions, which are standard in literature (Agarwal, Hazan, and Singh 2019). Assumption 1. {wt} are i.i.d. and bounded by wt w, where w > 0.1

Assumption 2. For any t 0, cost function ct(xt, ut) is convex and differentiable with respect to xt and ut. Further, there exists G > 0, such that for any x 2 b, u 2 b, we have xct(x, u) 2 Gb and uct(x, u) 2 Gb. Next, we deﬁne strictly and loosely safe controllers. Deﬁnition 3 (Strict and loose safety). A safe controller A is called ϵ-strictly safe for some ϵ > 0 if Dxx A t dx ϵ1kx and Duu A t du ϵ1ku for all 0 t T under any disturbance sequence {wk W}T k=0. A controller A is called ϵ-loosely safe for some ϵ > 0 if Dxx A t dx + ϵ1kx and Duu A t du + ϵ1ku for all 0 t T under any disturbance sequence {wk W}T k=0. In the following, we assume the existence of a strictly safe linear policy. The existence of a safe linear policy is necessary since otherwise our policy regret is not well-deﬁned. The existence of a strictly safe policy provides some ﬂexibility for the approximation steps in our algorithm design and is a common assumption in constrained optimization and control (Boyd and Vandenberghe 2004; Limon et al. 2010). Assumption 3. There exists K K such that the policy ut = K xt is ϵ -strictly safe for some ϵ > 0. Intuitively, Assumption 3 requires the sets X and U to have non-empty interiors and that the disturbance set W is small enough so that a disturbed linear system xt+1 = (A BK )xt + wt stays in the interiors of X and U for any {wk W}T k=0. In addition, Assumption 3 implicitly assumes that 0 belongs to the interiors of X and U since we let x0 = 0. Finally, though it is challenging to verify Assumption 3 directly, there are numerical veriﬁcation methods, e.g. by solving linear matrix inequalities (LMI) programs (Limon et al. 2010).2

1The results of this paper can be extended to adversarial noises. 2(Limon et al. 2010) provides an LMI program to compute a

Preliminaries This section brieﬂy reviews the unconstrained online optimal control and robust constrained optimization literature, techniques from which motivate our algorithm design.

Unconstrained Online Optimal Control In our setting, if one considers X = Rn and U = Rm, then the problem reduces to an unconstrained online optimal control. For such unconstrained online control problems, Agarwal, Hazan, and Singh (2019); Agarwal et al. (2019) propose a disturbance-action policy class to design an online policy. Deﬁnition 4 (Disturbance-Action Policy (Agarwal, Hazan, and Singh 2019)). Fix an arbitrary (κ, γ)-strongly stable matrix K a priori. Given an H {1, 2, . . . , T}, a disturbance-action policy deﬁnes the control policy as:

i=1 M [i]wt i, t 0, (5)

where, M [i] Rm n and wt = 0 for t 0. Let M = {M [i]}H i=1 denote the list of parameter matrices for the disturbance-action policy.3

For the rest of the paper, we will ﬁx K and discuss how to choose parameter M. In Agarwal, Hazan, and Singh (2019), a bounded convex constraint set on policy M is introduced for technical simplicity and without loss of generality:4

M2 ={M ={M [i]}H i=1 : M [i] 2 κ3κB(1 γ)i, i} (6) The next proposition introduces state and action approximations when implementing disturbance-action policies. Proposition 1 ((Agarwal et al. 2019)). When implementing a disturbance-action policy (5) with time-varying Mt = {M [i] t }H i=1 at each stage t 0, the states and actions satisfy: xt = AH K xt H + xt and ut = KAH K xt H + ut, (7) where AK = A BK. The approximate/surrogate state and action, xt and ut, are deﬁned as:

k=1 Φx k(Mt H:t 1)wt k,

ut = K xt +

i=1 M [i] t wt i =

k=1 Φu k(Mt H:t)wt k,

Φx k(Mt H:t 1)=Ak 1 K 1(k H)+

i=1 Ai 1 K BM [k i] t i 1(1 k i H)

Φu k(Mt H:t) = M [k] t 1(k H) KΦx k(Mt H:t 1),

near-optimal linear controller for a time-invariant constrained control problem, which can be used to verify the existence of a safe solution. To verify Assumption 3, one could run the LMI program with the constraints tightened by ϵ and continue to reduce ϵ if no solution is found until ϵ is smaller than a certain threshold. 3The disturbance-action policy is mainly useful for non-zero disturbances. Nevertheless, our theoretical results do not require wt = 0 because for no-disturbance systems, any strongly stable controller ut = Kxt will only result in a constant O(1) regret. 4This is without loss of generality because (Agarwal et al. 2019) shows that any (κ, γ)-strongly stable linear policy can be approximated by a disturbance-action policy in M2.

where Mt H:t := {Mt H, . . . , Mt}, the superscript k in Ak K denotes the kth power of AK, and M [k] t with superscript [k] denotes the kth matrix in list Mt. Further, deﬁne Φx k(M) = Φx k(M, . . . , M), Φu k(M) = Φu k(M, . . . , M). Notice that xt and ut are afﬁne functions of Mt H:t. Based on xt and ut, Agarwal, Hazan, and Singh (2019) introduces an approximate cost function: ft(Mt H:t) = E[ct( xt, ut)], which is convex with respect to Mt H:t since xt and ut are afﬁne functions of Mt H:t and ct( , ) is convex. Remark 1. The disturbance-action policy is related to afﬁne disturbance feedback in stochastic MPC (Oldewurtel, Jones, and Morari 2008; Mesbah 2016), which also considers policies that are linear with disturbances to convexify the control problem in MPC s lookahead horizon. OCO with memory. In Agarwal, Hazan, and Singh (2019), the unconstrained online optimal control problem is converted into OCO with memory, i.e. at each stage t, the agent selects a policy Mt M2 and then incurs a cost ft(Mt H:t). Notice that the cost function at stage t couples the current policy Mt with the H-stage historical policies Mt H:t 1, but the constraint set M2 is decoupled and only depends on the current Mt. To solve this OCO with memory problem, Agarwal, Hazan, and Singh (2019) deﬁnes decoupled cost functions ft(Mt) := ft(Mt, . . . , Mt), (8) by letting the H-stage historical policies be identical to the current policy. Notice that ft(Mt) is still convex. Accordingly, the OCO with memory is reformulated as a classical OCO problem with stage cost ft(Mt), which is solved by classical OCO algorithms such as online gradient descent (OGD) in Agarwal, Hazan, and Singh (2019). The stepsizes of OGD are chosen to be sufﬁciently small so that the variation between the current policy Mt and the H-stage historical policies Mt H, . . . , Mt 1 is sufﬁciently small, which guarantees a small approximation error between ft(Mt) and ft(Mt H:t), and thus low regrets. For more details, we refer the reader to Agarwal, Hazan, and Singh (2019).

Robust Optimization with Constraints Consider a robust optimization problem with linear constraints (Ben-Tal, El Ghaoui, and Nemirovski 2009): min x f(x) s.t. a i x bi, ai Ci, 1 i k, (9)

where the (box) uncertainty sets are deﬁned as Ci = {ai = ai+Piz : z z} for any i. Notice that the robust constraint {a i x bi, ai Ci} is equivalent to the standard constraint {supai Ci[a i x] bi}. Further, one can derive

sup ai Ci a i x = sup z z ( ai + Piz) x

= a i x + sup z z z (P i x) = a i x + P i x 1 z (10)

Therefore, the robust optimization (9) can be equivalently reformulated as the linearly constrained optimization below:

min x f(x) s.t. a i x + P i x 1 z bi, 1 i k.

Online Algorithm Design This section introduces our online algorithm design for online disturbance-action policies (Deﬁnition 4). Roughly speaking, to develop our online algorithm, we ﬁrst convert the constrained online optimal control into OCO with memory and coupled constraints, which is later converted into classical OCO and solved by OCO algorithms. The conversions leverage the approximation and the reformulation techniques in the Preliminaries. During the conversions, we ensure that the outputs of the OCO algorithms are safe for the original control problem. This is achieved by tightening the original constraints (adding buffer zones) to allow for approximation errors. Besides, our method ensures small approximation errors and thus small buffer zones so that the optimality/regret is not sacriﬁced signiﬁcantly for safety. The details of algorithm design are discussed below.

Step 1: Constraints on Approximate States and Actions When applying the disturbance-action policies (5), we can use (7) to rewrite the state constraint xt+1 X as

Dx AH K xt H+1 + Dx xt+1 dx, {wk W}T k=0, (11)

where xt+1 is the approximate state. Note that the term Dx AH K xt H+1 decays exponentially with H. If there exists H such that Dx AH K xt H+1 ϵ11kx, {wk W}T k=0, then a tightened constraint on the approximate state, i.e.

Dx xt+1 dx ϵ11kx, {wk W}T k=0, (12)

can guarantee the original constraint on the true state (11). The action constraint ut U can similarly be converted into a tightened constraint on the approximate action ut, i.e.

Du ut du ϵ11ku, {wk W}T k=0, (13)

if Du( KAH K xt H) ϵ11ku for any {wk W}T k=0.

Step 2: Constraints on the Policy Parameters Next, we reformulate the robust constraints (12) and (13) on xt+1 and ut as polytopic constraints on policy parameters Mt H:t based on the robust optimization techniques reviewed in Robust Optimization with Constraints. Firstly, we consider the ith row of the constraint (12), i.e. D x,i xt+1 dx,i ϵ1 {wk W}T k=0, where D x,i denotes the ith row of the matrix Dx. This constraint is equivalent to sup{wk W}T k=0(D x,i xt+1) dx,i ϵ1. Further, by (10) and the deﬁnitions of xt+1 and W, we obtain

sup {wk W} D x,i xt+1 = sup {wk W} D x,i

s=1 Φx s(Mt H+1:t)wt+1 s

s=1 sup wt+1 s W D x,iΦx s(Mt H+1:t)wt+1 s

s=1 D x,iΦx s(Mt H+1:t) 1 w

Deﬁne gx i (Mt H+1:t) = P2H s=1 D x,iΦx s(Mt H+1:t) 1 w. Hence, the robust constraint (12) on xt+1 is equivalent to the following polytopic constraints on Mt H+1:t:

gx i (Mt H+1:t) dx,i ϵ1, 1 i kx. (14)

Similarly, the constraint (13) on ut is equivalent to: gu j (Mt H:t) du,j ϵ1, 1 j ku, (15)

where gu j (Mt H:t) = P2H s=1 D u,jΦu s(Mt H:t) 1 w.

Step 3: OCO with Memory and Temporal-coupled Constraints By Step 2 and our review of robust optimization, we can convert the constrained online optimal control problem into OCO with memory and coupled constraints. That is, at each t, the decision maker selects a policy Mt satisfying constraints (14) and (15), and then incurs a cost ft(Mt H:t). In our framework, both the constraints (14), (15) and the cost function ft(Mt H:t) couple the current policy with the historical policies. This makes the problem far more challenging than OCO with memory which only considers coupled costs (Anava, Hazan, and Mannor 2015).

Step 4: Beneﬁts of the Slow Variation of Online Policies We approximate the coupled constraint functions gx i (Mt H+1:t) and gu j (Mt H:t) as decoupled ones below,

gx i (Mt)=gx i (Mt, . . . , Mt), gu i (Mt)=gu i (Mt, . . . , Mt), by letting the historical policies Mt H:t 1 be identical to the current Mt.5 If the online policy Mt varies slowly with t, which is satisﬁed by most OCO algorithms (e.g. OGD with a diminishing stepsize (Hazan 2019)), one may be able to bound the approximation errors by gx i (Mt H+1:t) gx i (Mt) ϵ2 and gu j (Mt H:t) gu j (Mt) ϵ2 for a small ϵ2 > 0. Consequently, the constraints (14) and (15) are ensured by the polytopic constraints that only depend on Mt: gx i (Mt) dx,i ϵ1 ϵ2, gu j (Mt) du,j ϵ1 ϵ2, (16) where the buffer zone ϵ2 allows for the approximation error caused by neglecting the variation of the online policies.

Step 5: Conversion to OCO By Step 4, we deﬁne a decoupled search space/constraint set on each policy below, Ωϵ ={M M : gx i (M) dx,i ϵ, 1 i kx, gu j (M) du,j ϵ, 1 j ku}, (17)

where M is a bounded convex constraint set deﬁned as M = {M : M [i] 2 nκ3(1 γ)i 1, 1 i H}. Our set M is slightly different from M2 in (6) to ensure that Ωϵ is a polytope.6 Notice that Ωϵ provides buffer zones with size ϵ to account for the approximation errors ϵ1 and ϵ2. Based on Ωϵ and technique (8), we can further convert the OCO with memory and coupled constraints in Step 3 into a classical OCO problem below. That is, at each t, the agent selects a policy Mt Ωϵ, and then suffers a convex stage cost ft(Mt) deﬁned in (8). We apply online gradient descent to solve this OCO problem, as described in Algorithm 1. We select the stepsizes of OGD to be small enough to ensure small approximation errors from Step 4 and thus small buffer zones, but also to be large enough to allow online policies to adapt to time-varying environments. Conditions for suitable stepsizes are discussed in Theoretical Results.

5Though we consider Mt = = Mt H here, the component M [i] t of Mt = {M [i] t }H i=1 can be different for different i. 6Compared with M2, our M uses the L norm; the n factor accounts for the change of norms; and κB disappears because we can prove that κB is not necessary here (see (Li, Das, and Li 2020)).

Algorithm 1: OGD-BZ

Input: A (κ, γ)-strongly stable matrix K, parameter H > 0, buffer size ϵ, stepsize ηt.

1 Determine the polytopic constraint set Ωϵ by (17) with buffer size ϵ and initialize M0 Ωϵ.

2 for t = 0, 1, 2, . . . , T do

3 Implement action ut = Kxt+PH i=1 M [i] t wt i.

4 Observe the next state xt+1 and record wt = xt+1 Axt But.

5 Run projected OGD

Mt+1 = ΠΩϵ h Mt ηt ft(Mt) i

where ft(M) is deﬁned in (8).

In Algorithm 1, the most computationally demanding step at each stage is the projection onto the polytope Ωϵ, which requires solving a quadratic program. Nevertheless, one can reduce the online computational burden via ofﬂine computation by leveraging the solution structure of quadratic programs (see (Alessio and Bemporad 2009) for more details). Lastly, we note that other OCO algorithms can be applied to solve this problem too, e.g. online natural gradient, online mirror descent, etc. One can also apply projection-free methods, e.g. (Yuan and Lamperski 2018), to reduce the computational burden at the expense of o(T) constraint violation. Remark 2. To ensure safety, safe RL literature usually constructs a safe set for the state (Fisac et al. 2018), while this paper constructs a safe search space Ωϵ for the policies directly. Besides, safe RL literature may employ unsafe policies occasionally, for example, Fisac et al. (2018) allows unsafe exploration policies within the safe set and changes to a safe policy on the boundary of the safe set. However, our search space Ωϵ only contains safe policies. Despite a smaller policy search space, our OGD-BZ still achieves desirable (theoretical) performance. Nevertheless, when the system is unknown, larger sets of exploration policies may beneﬁt the performance, which is left as future work. Remark 3. It is worth comparing our method with a wellknown robust MPC method: tube-based robust MPC (see e.g. Rawlings and Mayne (2009)). Tube-based robust MPC also tightens the constraints to allow for model inaccuracy and/or disturbances. However, tube-based robust MPC considers constraints on the states, while our method converts the state (and action) constraints into the constraints on the policy parameters by leveraging the properties of disturbance-action policies.

Theoretical Results In this section, we show that OGD-BZ guarantees both safety and O(

T) policy regret under proper parameters.

Preparation To establish the conditions on the parameters for our theoretical results, we introduce three quantities ϵ1(H), ϵ2(η, H), ϵ3(H) below. We note that ϵ1(H) and ϵ2(η, H) bound the approximation errors in Step 1 and Step

4 of the previous section respectively (see Lemma 1 and Lemma 3 in the proof of Theorem 1 for more details). ϵ3(H) bounds the constraint violation of the disturbance-action policy M(K), where M(K) approximates the linear controller ut = Kxt for any K K (see Lemma 4 in the proof of Theorem 1 for more details). Deﬁnition 5. We deﬁne

ϵ1(H) = c1n m H(1 γ)H, ϵ2(η, H) = c2η n2 m H2

ϵ3(H) = c3 n(1 γ)H

where c1, c2, and c3 are polynomials of Dx , Du , κ, κB, γ 1, w, G.

Safety of OGD-BZ Theorem 1 (Feasibility & Safety). Consider constant stepsize ηt = η, ϵ 0, H log(2κ2) log((1 γ) 1). If the buffer size ϵ and H satisfy ϵ ϵ ϵ1(H) ϵ3(H), the set Ωϵ is non-empty. Further, if η, ϵ and H also satisfy

ϵ ϵ1(H) + ϵ2(η, H),

our OGD-BZ is safe, i.e. x OGD-BZ t X and u OGD-BZ t U for all t and for any disturbances {wk W}T k=0. Discussions: Firstly, Theorem 1 shows that ϵ should be small enough to ensure a nonempty Ωϵ and thus valid/feasible outputs of OGD-BZ. This is intuitive since the constraints are more conservative as ϵ increases. Since ϵ1(H) + ϵ3(H) = Θ(H(1 γ)H) decays with H by Deﬁnition 5, the ﬁrst condition also implies a large enough H. Secondly, Theorem 1 shows that, to ensure safety, the buffer size ϵ should also be large enough to allow for the total approximation errors ϵ1(H) + ϵ2(η, H), which is consistent with our discussion in the previous section. To ensure the compatibility of the two conditions on ϵ, the approximation errors ϵ1(H)+ϵ2(η, H) should be small enough, which requires a large enough H and a small enough η by Deﬁnition 5. In conclusion, the safety requires a large enough H, a small enough η, and an ϵ which is neither too large nor too small. For example, we can select η ϵ 8c2n2 m H2 , ϵ /4

ϵ 3ϵ /4, and H max( log( 8(c1+c3)n m

ϵ T ) log((1 γ) 1 , log(2κ2) log((1 γ) 1)).

Remark 4. It can be shown that it is safe to implement any M Ωϵ for an inﬁnite horizon (0 t + ) under the conditions of Theorem 1 based on the proof of Theorem 1. For more details, please refer to the end of the proof of Theorem 1.

Policy Regret Bound for OGD-BZ Theorem 2 (Regret Bound). Under the conditions in Theorem 1, OGD-BZ enjoys the regret bound below:

Reg(OGD-BZ) O n3m H3ηT + mn

+(1 γ)HH2.5T(n3m1.5+ p

+ ϵTH1.5(n2m + p

where the hidden constant depends polynomially on κ, κB, γ 1, Dx , Du , dx 2, du 2, w, G. Theorem 2 provides a regret bound for OGD-BZ as long as OGD-BZ is safe. Notice that as the buffer size ϵ increases, the regret bound becomes worse. This is intuitive since our OGD-BZ will have to search for policies in a smaller set Ωϵ if ϵ increases. Consequently, the buffer size ϵ can serve as a tuning parameter for the trade-off between safety and regrets, i.e. a small ϵ is preferred for low regrets while a large ϵ is preferred for safety (as long as Ωϵ = ). In addition, although a small stepsize η is preferred for safety in Theorem 1, Theorem 2 suggests that the stepsize should not be too small for low regrets since the regret bound contains a Θ(η 1) term. This is intuitive since the stepsize η should be large enough to allow OGD-BZ to adapt to the varying objectives for better online performance. Next, we provide a regret bound with speciﬁc parameters. Corollary 1. For sufﬁciently large T, when H log(8(c1+c2)n m T/ϵ )

log((1 γ) 1) , η = Θ( 1 n2 m H

T ), ϵ = ϵ1(H) +

ϵ2(η, H) = Θ( log(n m T )

T ), OGD-BZ is safe and

Reg(OGD-BZ) O (n3m1.5kc 0.5)

Corollary 1 shows that OGD-BZ achieves O(

T) regrets when H Θ(log T), η 1 = Θ(

T), and ϵ = Θ(1/

T). This demonstrates that OGD-BZ can ensure both constraint satisfaction and sublinear regrets under the proper parameters of the algorithm. We remark that a larger H is preferred for better performance due to smaller approximation errors and a potentially larger policy search space Ωϵ, but the computational complexity of OGD-BZ increases with H. Besides, though the choices of H, η, and ϵ above require the prior knowledge of T, one can apply doubling tricks (Hazan 2019) to avoid this requirement. Lastly, we note that our O(

T) regret bound is consistent with the unconstrained online optimal control literature for convex cost functions (Agarwal et al. 2019). For strongly convex costs, the regret for the unconstrained case is logarithmic in T (Agarwal, Hazan, and Singh 2019). We leave the study on the constrained control with strongly convex costs for the future.

Proof of Theorem 1 To prove Theorem 1, we ﬁrst provide lemmas to bound errors by ϵ1(H), ϵ2(η, H), and ϵ3(H), respectively. The proofs of Lemmas 1-4 and Corollary 2 in this subsection are provided in the ar Xiv version (Li, Das, and Li 2020). Firstly, we show that the approximation error in Step 1 of the previous section can be bounded by ϵ1(H). Lemma 1 (Error bound ϵ1(H)). When Mk M for all k and H log(2κ2) log((1 γ) 1), we have

max wk w Dx AH K xt H ϵ1(H),

max wk w Du KAH K xt H ϵ1(H).

The proof of Lemma 1 relies on the boundedness of xt when implementing Mt M as stated below.

Lemma 2 (Bound on xt). With Mk M for all k and κ2(1 γ)H < 1, we have

where b = κ n w(κ2+2κ5κB mn H) (1 κ2(1 γ)H)γ + 2 mnκ3 w/γ. Hence,

when H log(2κ2) log((1 γ) 1), we have b 8

mn2H wκ6κB/γ.

Secondly, we show that the error incurred by the Step 3 of the previous section can be bounded by ϵ2(η, H).

Lemma 3 (Error bound ϵ2(η, H)). When H log(2κ2) log((1 γ) 1), the policies {Mt}T t=0 generated by OGD-BZ with a constant stepsize η satisfy

max 1 i kx | gx i (Mt) gx i (Mt H+1:t)| ϵ2(η, H),

max 1 j ku | gu j (Mt) gu j (Mt H:t)| ϵ2(η, H).

Thirdly, we show that for any K K, there exists a disturbance-action policy M(K) M to approximate the policy ut = Kxt. However, M(K) may not be safe and is only ϵ3(H)-loosely safe.

Lemma 4 (Error bound ϵ3(H)). For any K K, there exists a disturbance-action policy M(K) = {M [i](K)}H i=1 M deﬁned as M [i](K) = (K K)(A BK)i 1 such that

max( Dx[x K t x M(K) t ] , Du[u K t u M(K) t ] ) ϵ3(H)

where (x K t , u K t ) and (x M(K) t , u M(K) t ) are produced by controller ut = Kxt and disturbance-action policy M(K) respectively. Hence, M(K) is ϵ3(H)-loosely safe.

Based on Lemma 4, we can further show that M(K) belongs to a polytopic constraint set in the following corollary. For the rest of the paper, we will omit the arguments in ϵ1(H), ϵ2(η, H), ϵ3(H) for notational simplicity.

Corollary 2. Consider K K, if K is ϵ0-strictly safe for ϵ0 0, then M(K) Ωϵ0 ϵ1 ϵ3.

Proof of Theorem 1. For notational simplicity, we denote the states and actions generated by OGD-BZ as xt and ut in this proof. First, we show M(K ) Ωϵ below. Since K deﬁned in Assumption 3 is ϵ -strictly safe, by Corollary 2, there exists M(K ) Ωϵ ϵ1 ϵ3. Since the set Ωϵ is smaller as ϵ increases, when ϵ ϵ1 ϵ3 ϵ, we have M(K ) Ωϵ ϵ1 ϵ3 Ωϵ, so Ωϵ is non-empty. Next, we prove the safety by Lemma 1 and Lemma 3 based on the discussions in the previous section. Speciﬁcally, OGD-BZ guarantees that Mt Ωϵ for all t. Thus, by Lemma 3, we have gx i (Mt H:t 1) = gx i (Mt H:t 1) gx i (Mt 1) + gx i (Mt 1) ϵ2 + dx,i ϵ for any i. Further, by Step 2 of the previous section and Lemma 1, we have D x,ixt = D x,i AH K xt H + D x,i xt Dx AH K xt H + gx i (Mt H:t 1) ϵ1 + ϵ2 + dx,i ϵ dx,i for any {wk W}T k=0 if ϵ ϵ1 + ϵ2. Therefore, xt X for all wk W. Similarly, we can show ut U for any wk W. Thus, OGD-BZ is safe.

Proof of Remark 4. When implementing M Ωϵ for an inﬁnite horizon, we have gx i (M) = gx i (M, . . . , M) dx,i ϵ. Since Proposition 1 holds for any t 0, we still have D x,ixt = D x,i AH K xt H + D x,i xt Dx AH K xt H + gx i (M, . . . , M) ϵ1 + dx,i ϵ dx,i for any {wk W}T k=0 if ϵ ϵ1 + ϵ2. Thus, xt X for all wk W for t 0. Constraint satisfaction of ut can be proved similarly.

Proof of Theorem 2 We divide the regret into three parts and bound each part.

Reg(OGD-BZ) = JT (A) min K K JT (K)

| {z } Part i

t=0 ft(Mt) min M Ωϵ

| {z } Part ii

t=0 ft(M) min K K JT (K)

| {z } Part iii

Bound on Part ii. Firstly, we bound Part ii based on the regret bound of OGD in the literature (Hazan 2019).

Lemma 5. With a constant stepsize η, we have Part ii δ2/2η + ηG2 f T/2, where δ = sup M, M Ωϵ M M F 4 mnκ3/γ and Gf = maxt sup M Ωϵ ft(M) F Gb(1 + κ) n wκ2κB

γ . Consequently, when H

log(2κ2) log((1 γ) 1), we have Gf Θ(

n3H3m) and the hidden factor is quadratic on w.

The proof details are provided in (Li, Das, and Li 2020). Bound on Part iii. For notational simplicity, we denote M = arg minΩϵ PT t=0 ft(M), K = arg min K JT (K). By Lemma 4, we can construct a loosely safe Map = M(K ) to approximate K . By Corollary 2, we have

Map Ω ϵ1 ϵ3. (18)

We will bound Part iii by leveraging Map as middle-ground and bounding the Part iii-A and Part iii-B deﬁned below.

t=0 ( ft(M ) ft(Map))

| {z } Part iii-A

t=0 ft(Map) JT (K )

| {z } Part iii-B

Lemma 6. Consider K K and Map = M(K ), then Part iii-B Θ(Tn2m H2(1 γ)H).

Lemma 7. Under the conditions in Theorem 2, we have

Part iii-A Θ (ϵ1 + ϵ3 + ϵ)TH 3 2 n2m+ kcmn3

We highlight that Map may not belong to Ωϵ by (18). Therefore, even though M is optimal in Ωϵ, Part iii-A can still be positive and has to be bounded to yield a regret bound. This is different from the unconstrained online control literature (Agarwal, Hazan, and Singh 2019), where

Part iii-A is non-positive because Map M and M is optimal in the same set M when there are no constraints (see (Agarwal, Hazan, and Singh 2019) for more details). Bound on Part i. Finally, we provide a bound on Part i.

Lemma 8. Apply Algorithm 1 with constant stepsize η, then Part i O(Tn2m H2(1 γ)H+n3m H3ηT).

The proofs of Lemma 6 and Lemma 8 are similar to those in Agarwal, Hazan, and Singh (2019). Finally, Theorem 2 can be proved by summing up the bounds on Part i, Part ii, Part iii-A, and Part iii-B in Lemmas 5-8 and only explicitly showing the highest order terms.

Proof of Lemma 7 We deﬁne M = arg minΩ ϵ1 ϵ3 PT t=0 ft(M). By (18),

we have PT t=0 ft(Map) PT t=0 ft(M ). Therefore, it sufﬁces to bound PT t=0 ft(M ) PT t=0 ft(M ), which can be viewed as the difference in the optimal values when perturbing the feasible/safe set from Ωϵ to Ω ϵ1 ϵ3. To bound Part iii-A, we establish a perturbation result by leveraging the polytopic structure of Ωϵ and Ω ϵ1 ϵ3.

Proposition 2. Consider two polytopes Ω1 = {x : Cx h}, Ω2 = {x : Cx h }, where i 0 for all i. Consider a convex function f(x) that is L-Lipschitz continuous on Ω1. If Ω1 is bounded, i.e. supx1,x 1 Ω1 x1 x 1 2 δ1 and if Ω2 is non-empty, i.e. there exists x Ω2, then

| min Ω1 f(x) min Ω2 f(x)| Lδ1 min{i: i>0}(h C x)i . (19)

To prove Lemma 7, it sufﬁces to bound the quantities in (19) for our problem and then plug them in (19).

Lemma 9. There exists an enlarged polytope Γϵ = { W : C W hϵ} that is equivalent to Ωϵ for any ϵ R, where W contains elements of M and auxiliary variables. Further, under the conditions of Theorem 1, (i) Γ ϵ1 ϵ3 is bounded by δ1 = Θ( mn + kc); (ii) PT t=0 ft(M) is Lipschitz continuous with L = Θ(T(n H)1.5 m); (iii) the difference between Γϵ and Γ ϵ1 ϵ3 satisﬁes = ϵ + ϵ1 + ϵ3; (iv) there exists W Γϵ s.t. min{i: i>0}(h( ϵ1 ϵ3) C W )i ϵ .

Numerical Experiments In this section, we numerically test our OGD-BZ on a thermal control problem with a Heating Ventilation and Air Conditioning (HVAC) system. Speciﬁcally, we consider the linear thermal dynamics studied in Zhang et al. (2016) with additional random disturbances, that is, x(t) = 1 υζ (θo(t) x(t)) 1

υw(t), where x(t) denotes the room temperature at time t, u(t) denotes the control input that is related with the air ﬂow rate of the HVAC system, θo(t) denotes the outdoor temperature, w(t) represents random disturbances, π represents external heat sources impact, υ and ζ are physical constants. We discretize the thermal dynamics with t = 60s. For human comfort and/or safe operation of device, we impose constraints on the room

temperature, x(t) [xmin, xmax], and the control inputs, u(t) [umin, umax]. Consider a desirable temperature θset set by the user and a control setpoint uset. Consider the cost function c(t) = qt(x(t) θset)2 + rt(u(t) uset)2. In our experiments, we consider v = 100, ζ = 6, θo = 30 C, π = 1.5, and let wt be i.i.d. generated from Unif( 2, 2). Besides, we consider θset = 24 C, xmin = 22 C, xmax = 26 C, umin = 0, umax = 5. We consider qt = 2 for all t and time-varying rt generated i.i.d. from Unif(0.1, 4). When applying OGD-BZ, we select H = 7 and a diminishing stepsize ηt = Θ(t 0.5), i.e. we let ηt = 0.5(40) 0.5 for t < 40 and ηt = 0.5(t + 1) 0.5 for t 40. Figure 1 plots the comparison of OGD-BZ with different buffer sizes. Speciﬁcally, ϵ = 0.04 is a properly chosen buffer size and ϵ = 0.4 offers larger buffer zones. From Figure 1-(a), we can observe that the averaged regret with a properly chosen buffer size ϵ = 0.04 quickly diminishes to 0, which is consistent with Theorem 2. In addition, Figure 1-(b) and Figure 1-(c) plot the range of x(t) and u(t) under random disturbances in 1000 trials to demonstrate the safety of OGD-BZ. With a larger buffer zone, i.e. ϵ = 0.4, the range of xt is smaller and further from the boundaries, thus being safer. Interestingly, the range of u(t) becomes slightly larger, which still satisﬁes the control constraints because the control constraints are not binding/active in this experiment and which indicates more control power is used here to ensure a smaller range of x(t) under disturbances. Finally, the regret with ϵ = 0.4 is worse than that with ϵ = 0.04, which demonstrates the trade-off between safety and performance and how the choices of the buffer size affect this trade-off.

Supplementary Proofs for Lemma 7 Proof of Proposition 2 Since Ω2 Ω1, we have minΩ2 f(x) minΩ1 f(x) 0. Let x 1 = arg minΩ1 f(x). We will show that there exists x 2 Ω2 such that x 1 x 2 2 δ1 mini S(h C x)i , where S = {i : i > 0}. Then, by the Lipschitz continuity, we can prove the bound: minΩ2 f(x) minΩ1 f(x) f(x 2) f(x 1) Lδ1 mini S(h C x)i . In the following, we will show, more generally, that there exists x2 Ω2 that is close to x1 for any x1 Ω1. For ease of notation, we deﬁne y = x x, Ωy 1 = {y : Cy h C x}, and Ωy 2 = {y : Cy h C x }. Notice that 0 Ωy 2 and (h C x )i 0. Besides, we have y1 = x1 x Ωy 1. Further, by the convexity of Ωy 1, we have λy1 Ωy 1 for 0 λ 1. If (Cy1)i (h C x )i for all i, then y1 Ωy 2 and x1 Ω2. So we can let x2 = x1 and x2 x1 2 = 0. If, instead, there exists a set S such that for any i S , (Cy1)i > (h C x )i. Then, deﬁne

λ = min i S (h C x )i

Notice that λ [0, 1). We can show that λy1 Ωy 2 below. When i S , (λCy1)i (Cy1)i (h C x )i

(Cy1)i = (h C x )i. When i S , we have (λCy1)i λ(h C x )i

(a) Averaged regret

(b) x(t) s range

(c) u(t) s range

Figure 1: Comparison of OGD-BZ with buffer sizes ϵ = 0.04 and ϵ = 0.4. In Figure (b) and (c), the yellow shade represents the range of x(t) generated by OGD-BZ with ϵ = 0.04, while the grey shade is generated by OGD-BZ with ϵ = 0.4.

(h C x )i. Therefore, λy1 Ωy 2. Deﬁne x2 = λy1 + x, then x2 Ω2. Notice that x1 x2 2 = y1 y2 2 = (1 λ) y1 2 (1 λ)δ1. Since y1 Ωy 1, when i S , we have 0 (h C x )i < (Cy1)i (h C x)i. Therefore, (h C x )i

(Cy1)i (h C x )i

(h C x)i = 1 i (h C x)i . Con-

sequently, by S S, we have 1 λ maxi S i (h C x)i

mini S (h C x)i mini S(h C x)i .

Proof of Lemma 9 We ﬁrst provide an explicit expression for Γϵ and then prove the bounds on Γϵ based on the explicit expression. Lemma 10. For any ϵ R, M Ωϵ if and only if there exist {Y x i,k,l}(1 i kx,1 k 2H,1 l n),

{Y u j,k,l}(1 j ku,1 k 2H,1 l n), {Z[i] k,j}(1 i H,1 k m,1 j n) such that

P2H k=1 Pn l=1 Y x i,k,l w dx,i ϵ, 1 i kx P2H k=1 Pn l=1 Y u j,k,l w du,j ϵ, 1 j ku Pn j=1 Z[i] k,j 2 nκ3(1 γ)i 1, 1 i H,1 k m Y x i,k,l (D x,i Φx k(M))l Y x i,k,l, i, k, l Y u j,k,l (D u,j Φu k(M))l Y u j,k,l i, k, l Z[i] k,j M [i] k,j Z[i] k,j, i, k, j.

Let W denote the vector containing the elements of M, Y x = {Y x i,k,l}, Y u = {Y u j,k,l}, Z = {Z[i] k,j}. Thus, the

constraints above can be written as Γϵ = { W : C W hϵ}. Since Lemma 10 holds for any ϵ R, we can similarly deﬁne Γ ϵ1 ϵ3 = { W : C W h ϵ1 ϵ3} which is equivalent to Ω ϵ1 ϵ3. Lemma 10 is based on a standard reformulation method in constrained optimization to handle inequalities involving absolute values so the proof is omitted. Proof of (i). Firstly, notice that Pn j=1(M [i] k,j)2 Pn j=1(Z[i] k,j)2 (Pn j=1 Z[i] k,j)2 4nκ6(1 γ)2i 2. Then,

j=1 (M [i] k,j)2

j=1 (Z[i] k,j)2 4nmκ6 1 2γ γ2

Similarly, by the ﬁrst two constraints in Lemma 10 and by the deﬁnition of Γ ϵ1 ϵ3, we have P2H k=1 Pn l=1(Y x i,k,l)2

(dx,i + ϵ1 + ϵ3)2/ w2 (dx,i + ϵ )2/ w2 and P2H k=1 Pn l=1(Y u j,k,l)2 (du,j + ϵ1 + ϵ3)2/ w2

(du,j + ϵ )2/ w2. Therefore, Pkx i=1 P2H k=1 Pn l=1(Y x i,k,l)2 Pkx i=1(dx,i + ϵ )2/ w2, and Pku j=1 P2H k=1 Pn l=1(Y u j,k,l)2 Pku j=1(du,j + ϵ )2/ w2. Consequently,

M 2 F + Y x 2 F + Y u 2 F + Z 2 F

Pkx i=1(dx,i + ϵ )2 + Pku j=1(du,j + ϵ )2

where δ1 = Θ( mn + kc) by the boundedness of ϵ , dx, du. (Although δ1 depends linearly on 1/ w, we will show L = TGf and Gf is quadratic on w by Lemma 5, hence, Lδ1 is still linear with w.) Proof of (ii). Since the gradient of ft(M) is bounded by Gf = Θ(

n3m H3), the gradient of PT t=0 ft(M) is bounded by LGf = Θ(T

n3m H3). Proof of (iii). Notice that the differences between Γϵ and Γ ϵ1 ϵ3 come from the ﬁrst two lines of the right-hand-side of inequalities in Lemma 10, which is ϵ + ϵ1 + ϵ3 in total. Proof of (iv). From the proof of Theorem 1, we know that M(K ) Ωϵ ϵ1 ϵ3 Ωϵ. Therefore, there exist corresponding Y x(K ), Y u(K ), Z(K ) such that W = vec(M(K ), Y x(K ), Y u(K ), Z(K )) Γϵ ϵ1 ϵ3 Γϵ. Therefore, min{i: i>0}(h ϵ1 ϵ3 C W )i ϵ1 + ϵ3 ( ϵ + ϵ1 + ϵ3) = ϵ .

Conclusion and Future Work

This paper studies online optimal control with linear constraints and linear dynamics with random disturbances. We propose OGD-BZ and show that OGD-BZ can satisfy all the constraints despite disturbances and ensure O(

T) policy regret. There are many interesting future directions, e.g. (i) consider adversarial disturbances and robust stability, (ii) consider soft constraints and unbounded noises, (iii) consider bandit feedback, (iv) reduce the regret bound s dependence on dimensions, (v) consider unknown systems, (vi) consider more general policies than linear policies, (vii) prove logarithmic regrets for strongly convex costs, etc.

Acknowledgements This work was conducted while the ﬁrst author was doing internship at the MIT-IBM Watson AI Lab. We thank the helpful suggestions from Jeff Shamma, Andrea Simonetto, Yang Zheng, Runyu Zhang, and the reviewers.

Ethics Statement The primary motivation for this paper is to develop an online control algorithm under linear constraints on the states and actions, and under noisy linear dynamics. Some practical physical systems can be approximated by noisy linear dynamics and most practical systems have to satisfy certain constraints on the states and actions, such as data center cooling and robotics, etc. Our proposed approach ensures to generate control policies that satisfy the constraints even under the uncertainty of unknown noises. Thus our algorithm can potentially be very beneﬁcial for safety critical applications. However, note that our approach relies on a set of technical assumptions, as mentioned in the paper, which may not directly hold for all practical applications. Hence, when applying our algorithm, particular cares are needed when modeling the system and the constraints.

References Agarwal, N.; Bullins, B.; Hazan, E.; Kakade, S. M.; and Singh, K. 2019. Online control with adversarial disturbances. In 36th International Conference on Machine Learning, ICML 2019, 154 165. International Machine Learning Society (IMLS).

Agarwal, N.; Hazan, E.; and Singh, K. 2019. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, 10175 10184.

Alessio, A.; and Bemporad, A. 2009. A survey on explicit model predictive control. In Nonlinear model predictive control, 345 369. Springer.

Anava, O.; Hazan, E.; and Mannor, S. 2015. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, 784 792.

Aswani, A.; Gonzalez, H.; Sastry, S. S.; and Tomlin, C. 2013. Provably safe and robust learning-based model predictive control. Automatica 49(5): 1216 1226.

Bemporad, A.; and Morari, M. 1999. Robust model predictive control: A survey. In Robustness in identiﬁcation and control, 207 226. Springer.

Bemporad, A.; Morari, M.; Dua, V.; and Pistikopoulos, E. N. 2002. The explicit linear quadratic regulator for constrained systems. Automatica 38(1): 3 20.

Ben-Tal, A.; El Ghaoui, L.; and Nemirovski, A. 2009. Robust optimization, volume 28. Princeton University Press.

Boyd, S.; and Vandenberghe, L. 2004. Convex optimization. Cambridge university press.

Cao, X.; Zhang, J.; and Poor, H. V. 2018. A virtual-queuebased algorithm for constrained online convex optimization with applications to data center resource allocation. IEEE

Journal of Selected Topics in Signal Processing 12(4): 703 716.

Chen, X.; Qu, G.; Tang, Y.; Low, S.; and Li, N. 2021. Reinforcement Learning for Decision-Making and Control in Power Systems: Tutorial, Review, and Vision. ar Xiv preprint ar Xiv:2102.01168 .

Cheng, R.; Orosz, G.; Murray, R. M.; and Burdick, J. W. 2019. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, 3387 3395.

Chow, Y.; Nachum, O.; Duenez-Guzman, E.; and Ghavamzadeh, M. 2018. A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, 8092 8101.

Cohen, A.; Hasidim, A.; Koren, T.; Lazic, N.; Mansour, Y.; and Talwar, K. 2018. Online Linear Quadratic Control. In International Conference on Machine Learning, 1029 1038.

Dean, S.; Mania, H.; Matni, N.; Recht, B.; and Tu, S. 2018. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, 4188 4197.

Dean, S.; Mania, H.; Matni, N.; Recht, B.; and Tu, S. 2019a. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics 1 47.

Dean, S.; Tu, S.; Matni, N.; and Recht, B. 2019b. Safely learning to control the constrained linear quadratic regulator. In 2019 American Control Conference (ACC), 5582 5588. IEEE.

Fazel, M.; Ge, R.; Kakade, S.; and Mesbahi, M. 2018. Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator. In International Conference on Machine Learning, 1467 1476.

Fisac, J. F.; Akametalu, A. K.; Zeilinger, M. N.; Kaynama, S.; Gillula, J.; and Tomlin, C. J. 2018. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control 64(7): 2737 2752.

Foster, D. J.; and Simchowitz, M. 2020. Logarithmic regret for adversarial online control. ar Xiv preprint ar Xiv:2003.00189 .

Fulton, N.; and Platzer, A. 2018. Safe reinforcement learning via formal methods. In AAAI Conference on Artiﬁcial Intelligence.

Garcıa, J.; and Fern andez, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1): 1437 1480.

Goel, G.; and Hassibi, B. 2020a. The Power of Linear Controllers in LQR Control. ar Xiv preprint ar Xiv:2002.02574 .

Goel, G.; and Hassibi, B. 2020b. Regret-optimal control in dynamic environments. ar Xiv preprint ar Xiv:2010.10473 .

Hazan, E. 2019. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207 . Ibrahimi, M.; Javanmard, A.; and Roy, B. V. 2012. Efﬁcient reinforcement learning for high dimensional linear quadratic systems. In Advances in Neural Information Processing Systems, 2636 2644. Kouvaritakis, B.; Rossiter, J. A.; and Schuurmans, J. 2000. Efﬁcient robust predictive control. IEEE Transactions on automatic control 45(8): 1545 1549. Kveton, B.; Yu, J. Y.; Theocharous, G.; and Mannor, S. 2008. Online Learning with Expert Advice and Finite-Horizon Constraints. In AAAI, 331 336. Lazic, N.; Boutilier, C.; Lu, T.; Wong, E.; Roy, B.; Ryu, M.; and Imwalle, G. 2018. Data center cooling using modelpredictive control. In Advances in Neural Information Processing Systems, 3814 3823. Li, Y.; Chen, X.; and Li, N. 2019. Online optimal control with linear dynamics and predictions: Algorithms and regret analysis. In Advances in Neural Information Processing Systems, 14887 14899. Li, Y.; Das, S.; and Li, N. 2020. Online Optimal Control with Afﬁne Constraints. ar Xiv preprint ar Xiv:2010.04891 . Li, Y.; and Li, N. 2020. Leveraging Predictions in Smoothed Online Convex Optimization via Gradient-based Algorithms. Advances in Neural Information Processing Systems 33. Li, Y.; Qu, G.; and Li, N. 2020. Online optimization with predictions and switching costs: Fast algorithms and the fundamental limit. IEEE Transactions on Automatic Control . Li, Y.; Tang, Y.; Zhang, R.; and Li, N. 2019a. Distributed Reinforcement Learning for Decentralized Linear Quadratic Control: A Derivative-Free Policy Optimization Approach. ar Xiv preprint ar Xiv:1912.09135 . Li, Y.; Zhong, A.; Qu, G.; and Li, N. 2019b. Online markov decision processes with time-varying transition probabilities and rewards. In ICML workshop on Real-world Sequential Decision Making. Limon, D.; Alvarado, I.; Alamo, T.; and Camacho, E. 2008. On the design of Robust tube-based MPC for tracking. IFAC Proceedings Volumes 41(2): 15333 15338. Limon, D.; Alvarado, I.; Alamo, T.; and Camacho, E. 2010. Robust tube-based MPC for tracking of constrained linear systems with additive disturbances. Journal of Process Control 20(3): 248 260. Mayne, D. Q.; Seron, M. M.; and Rakovi c, S. 2005. Robust model predictive control of constrained linear systems with bounded disturbances. Automatica 41(2): 219 224. Mesbah, A. 2016. Stochastic model predictive control: An overview and perspectives for future research. IEEE Control Systems Magazine 36(6): 30 44. Nonhoff, M.; and M uller, M. A. 2020. An online convex optimization algorithm for controlling linear systems with state and input constraints. ar Xiv preprint ar Xiv:2005.11308 .

Oldewurtel, F.; Jones, C. N.; and Morari, M. 2008. A tractable approximation of chance constrained stochastic MPC based on afﬁne disturbance feedback. In 2008 47th IEEE conference on decision and control, 4731 4736. IEEE. Rawlings, J. B.; and Mayne, D. Q. 2009. Model predictive control: Theory and design. Nob Hill Pub. Sallab, A. E.; Abdou, M.; Perot, E.; and Yogamani, S. 2017. Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017(19): 70 76. Wabersich, K. P.; and Zeilinger, M. N. 2018. Linear model predictive safety certiﬁcation for learning-based control. In 2018 IEEE Conference on Decision and Control (CDC), 7130 7135. IEEE. Yang, Z.; Chen, Y.; Hong, M.; and Wang, Z. 2019. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems, 8353 8365. Yuan, J.; and Lamperski, A. 2018. Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems, 6137 6146. Zaﬁriou, E. 1990. Robust model predictive control of processes with hard constraints. Computers & Chemical Engineering 14(4-5): 359 371. Zanon, M.; and Gros, S. 2019. Safe reinforcement learning using robust MPC. ar Xiv preprint ar Xiv:1906.04005 . Zhang, X.; Shi, W.; Li, X.; Yan, B.; Malkawi, A.; and Li, N. 2016. Decentralized temperature control via HVAC systems in energy efﬁcient buildings: An approximate solution procedure. In Proceedings of 2016 IEEE Global Conference on Signal and Information Processing, 936 940.