# online_control_in_population_dynamics__78e5c0f7.pdf

Online Control in Population Dynamics

Noah Golowich Elad Hazan: Zhou Lu; Dhruv Rohatgi Y. Jennifer Sun

The study of population dynamics originated with early sociological works but has since extended into many ﬁelds, including biology, epidemiology, evolutionary game theory, and economics. Most studies on population dynamics focus on the problem of prediction rather than control. Existing mathematical models for population control are often restricted to speciﬁc, noise-free dynamics, while real-world population changes can be complex and adversarial. To address this gap, we propose a new framework based on the paradigm of online control. We ﬁrst characterize a set of linear dynamical systems that can naturally model evolving populations. We then give an efﬁcient gradient-based controller for these systems, with near-optimal regret bounds with respect to a broad class of linear policies. Our empirical evaluations demonstrate the effectiveness of the proposed algorithm for population control even in non-linear models such as SIR and replicator dynamics.

1 Introduction

Dynamical systems involving populations are ubiquitous in describing processes that arise in natural environments. As one example, the SIR model [26] is a fundamental concept in epidemiology, used to describe the spread of infectious diseases within a population. It divides the population into three groups Susceptible (S), Infected (I) and Removed (R). A susceptible individual has not contracted the disease but has the chance to be infected if interacting with an infected individual. A removed individual either has recovered from the disease and gained immunity or is deceased. The population evolves over time according to three ordinary differential equations:

dt βIS , d I

dt βIS θI , d R

for constants β, θ ą 0 representing the infection and recovery rate, respectively. Numerous extensions of this basic model have been proposed to better capture how epidemics evolve and spread [8].

Beyond epidemiology, population dynamics naturally arise in many other ﬁelds, notably evolutionary game theory [23], biology [14, 7], and the analysis of genetic algorithms [39, 25]. For any dynamical system, it is natural to ask how it might best be controlled. For instance, controlling the spread of an infectious disease while minimizing externalities is a problem of signiﬁcant societal importance.

In many natural models, these dynamics tend to be nonlinear, so the control problem is often computationally intractable. Existing algorithms are designed on a case-by-case basis by positing a speciﬁc system of differential equations, and then numerically or analytically solving for the optimal controller. Unfortunately, this approach is not robust to adversarial shocks to the system and cannot adapt to time-varying cost functions.

MIT. nzg@mit.edu. :Google Deep Mind & Princeton University. ehazan@princeton.edu. ;Princeton University. zhoul@princeton.edu. MIT. drohatgi@mit.edu. Princeton University. ys7849@princeton.edu.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

A new approach for population control. In this paper we propose a generic and robust methodology for population control, drawing on the framework and tools from online non-stochastic control theory to obtain a computationally efﬁcient gradient-based method of control. In online non-stochastic control, at every time t 1, . . . , T, the learner is faced with a state xt and must choose a control ut. The learner then incurs cost according to some time-varying cost function ctpxt, utq evaluated at the current state/control pair, and the state evolves as:

xt 1 : fpxt, utq wt, (2)

where f describes the (known) discrete-time dynamics, xt 1 is the next state, and wt is an adversarially-chosen perturbation. A policy is a mapping from states to controls. The goal of the learner is to minimize regret with respect to some rich policy class Π, formally deﬁned by

t 1 ctpxt, utq min πPΠ

t 1 ctpxπ t , uπ t q, (3)

where pxπ t , uπ t q is the state/control pair at time t had policy π been carried out since time 1.

As with prior work in online control [1], our method is theoretically grounded by regret guarantees for a broad class of Linear Dynamical Systems (LDSs). The key algorithmic and technical challenge we overcome is that prior methods only give regret bounds against comparator policies that strongly stabilize the LDS (Deﬁnition 4). Such policies force the magnitude of the state to decrease exponentially fast in the absence of noise. Unfortunately, for applications to population dynamics, even the assumption that such policies exist let alone perform well is fundamentally unreasonable, since it essentially implies that the population can be made to exponentially shrink.

A priori, one might hope to generically overcome this issue, by broadening the comparator class to all policies that marginally stabilize the LDS (informally, these are policies under which the magnitude of the state does not blow up). But we show that, in general, it is impossible to achieve sub-linear regret against that class a result that may be of independent interest in online control:6

Theorem 1 (Informal statement of Theorem 25). There is a distribution D over LDSs with state space and control space given by R, such that any online control algorithm on a system L D incurs expected regret Ωp Tq against the class of time-invariant linear policies that marginally stabilize L.

For general LDSs, it s not obvious if there is a natural intermediate comparator class that does not require strong stabilizability and does enable control with low regret. However, systems that model populations possess rich additional structure, since they can be interpreted as controlled Markov chains.7 In this paper, leveraging that structure, we design an algorithm GPC-Simplex for online control that applies to LDSs constrained to the simplex (Deﬁnition 3), and achieves strong regret bounds against a natural comparator class of policies with bounded mixing time (Deﬁnition 6).

1.1 Our Results

Throughout this work, we model a population as a distribution over d different categories, evolving over T discrete timesteps. For simplicity, we assume that ut is a d-dimensional real vector.

Theoretical guarantees for online population control. We introduce the simplex LDS model (Deﬁnition 3), which is a modiﬁcation of the standard LDS model (Deﬁnition 9) that ensures the states pxtqt always represent valid distributions, i.e. never leave the simplex d. Informally, given state xt P d and control ut P Rd ě0 with }ut}1 ď 1, the next state is

xt 1 p1 γtq rp1 }ut}1q Axt Buts γt wt,

where A, B are known stochastic matrices, γt P r0, 1s is the observed perturbation strength,8 and wt P d is an unknown perturbation. The perturbation wt can be interpreted as representing an

6[22, 20] showed that the prediction task in a marginally stable LDS can be solved with sublinear regret via spectral ﬁltering if the state transition matrix is symmetric. The construction for Theorem 1 has symmetric transition matrices, so the result in a sense separates analogous prediction and control tasks. 7Markov decision processes can also be thought of as controlled Markov chains. However, in that setting the controls/actions are at the individual level, whereas we are concerned with controls at the population level, as motivated by applications to epidemiology, evolutionary game theory, and other ﬁelds. 8See Appendix B for discussion about this modelling assumption.

adversary that can add individuals from a population with distribution wt to the population under study. Intuitively, ut represents a distribution over d possible interventions as well as a null intervention .

For any simplex LDS L and mixing time parameter τ ą 0, we deﬁne a class K τ p Lq (Deﬁnition 6), which roughly consists of the linear time-invariant policies under which the state of the system would mix to stationarity in time τ, in the absence of noise. Our main theoretical contribution is an algorithm GPC-Simplex that achieves low regret against this policy class:

Theorem 2 (Informal version of Theorem 7). Let L be a simplex LDS on d, and let τ ą 0. For any adversarially-chosen perturbations pwtqt, perturbation strengths pγtqt, and convex and Lipschitz cost functions pctqt, the algorithm GPC-Simplex performs T steps of online control on L with regret Opτ 7{2?

d Tq against K τ p Lq.

Finally, analogously Theorem 1, we show that the mixing time assumption cannot be removed: it is impossible to achieve sub-linear regret (for online control of a simplex LDS) against the class of all linear time-invariant policies (Theorem 8).

Experimental evaluations. To illustrate the practicality of our results, we apply (a generalization of) GPC-Simplex to controlled versions of (a) the SIR model for disease transmission (Section 4), and (b) the replicator dynamics from evolutionary game theory (Appendix H). In the former, closed-form optimal controllers are known in the absence of perturbations [27]. We ﬁnd that GPC-Simplex learns characteristics of the optimal control (e.g. the turning point phase transition where interventions stop once herd immunity is reached). Moreover, our algorithm is robust even in the presence of adversarial perturbations, where previous theoretical results no longer apply. In the latter, we demonstrate that even when the control affects the population only indirectly, through the replicator dynamics payoff matrix, GPC-Simplex can learn to control the population effectively, and is more robust to noisy cost functions than a one-step best response controller.

1.2 Related work

Online non-stochastic control. In recent years, the machine learning community has witnessed an increasing interest in non-stochastic control problems (e.g. [1, 38, 19]). Unlike the classical setting of stochastic control, in non-stochastic control the dynamics are subject to time-varying, adversarially chosen perturbations and cost functions. See [21] for a survey of prior results. Most relevant to our work is the Gradient Perturbation Controller (GPC) for controlling general LDSs [1]. All existing controllers only provide provable regret guarantees against policies that strongly stabilize the system.

Population growth models. There is extensive research on modeling the evolution of populations in sociology, biology and economics. Besides the pioneering work of [33], notable models include the SIR model from epidemiology [26], the Lotka Volterra model for predator-prey dynamics [32, 40] and the replicator dynamics from evolutionary game theory [23]. Recent years have seen intensive study of controlled versions of the SIR model see e.g. empirical work [10], vaccination control models [13], and many others [15, 12, 30, 17]. Most relevant to our work is the quarantine control model, where the control reduces the effective transmission rate. Some works consider optimal control in the noiseless setting [27, 5]; follow-up work [34] considers a budget constraint on the control. None of these prior works can handle the general case of adversarial noise and cost functions.

2 Deﬁnitions and setup

Notation. Denote Sd : ! M P r0, 1sdˆd : řd i 1 Mi,j 1 @j P rds ) as the set of d ˆ d column-

stochastic matrices. For a ą 0, deﬁne Sd a : ta M : M P Sdu and Sd ďa : Ť

0ďa1ďa Sd a1. Let d

denote the simplex in Rd. Similarly, we deﬁne d α : α d and d ďα : Ť

0ďα1ďα d α1. Given a square matrix M P Rdˆd, let M ,j denote the jth column of M. We consider the following matrix norms: }M} denotes the spectral norm of M, }M}2 2,1 : řd j 1 }M ,j}2 1 is the sum of the squares of the ℓ1 norms of the columns of M, and }M}1Ñ1 : supx PRd:}x}1 1 }Mx}1.

2.1 Dynamical systems

The standard model in online control is the linear dynamical system (LDS). We deﬁne a simplex LDS to be an LDS where the state of the system always lies in the simplex. This requires enforcing certain constraints on the transition matrices, the control, and the noise: Deﬁnition 3 (Simplex LDS). Let d P N. A simplex LDS on d is a tuple

L p A, B, I, x1, pγtqt PN, pwtqt PN, pctqt PNq,

where A, B P Sd are the transition matrices; I Ď d ď1 is the valid control set; x1 P d is the initial state; γt P r0, 1s, wt P d are the noise strength and noise value at time t; and ct : d ˆ I Ñ R is the cost function at time t. These parameters deﬁne a dynamical system where the state at time t 1 is x1. For each t ě 1, given state xt and control ut P I at time t, the state at time t 1 is

xt 1 p1 γtq rp1 }ut}1q Axt Buts γt wt, (4)

and the cost incurred at time t is ctpxt, utq.

Note that since the set of possible controls I is contained in d ď1, the states pxtqt are guaranteed to remain within the simplex for all t. In this paper, we will assume that I Ť

αPrα,αs d α, for some parameters α, α P r0, 1s, which represent lower and upper bounds on the strength of the control.

Online non-stochastic control. Let L p A, B, x1, pγtqt PN, pwtqt PN, pctqt PN, Iq be a simplex LDS and let T P N . We assume that the transition matrices A, B are known to the controller at the beginning of time, but the perturbations pwtq T t 1 are unknown. At each step 1 ď t ď T, the controller observes xt and γt, plays a control ut P I, and then observes the cost function ct and incurs cost ctpxt, utq. The system then evolves according to Eq. (4). Note that our assumption that the controller observes γt contrasts with some of the existing work on nonstochastic control [1, 21], in which no information about the adversarial disturbances is known. In Appendix B, we justify the learner s ability to observe γt by observing that in many situations, the learner observes the counts of individuals in a populations (in addition to their proportions, represented by the state xt), and that this additional information allows computation of γt.

The goal of the controller is to minimize regret with respect to some class K Kp Lq of comparator policies. Formally, for any ﬁxed dynamical system and any time-invariant and Markovian policy K : d Ñ I, let pxtp Kqqt and putp Kqqt denote the counterfactual sequences of states and controls that would have been obtained by following policy K. Then the regret of the controller on observed sequences pxtqt and putqt with respect to K is

t 1 ctpxt, utq inf KPK

t 1 ctpxtp Kq, utp Kqq.

The following assumption on the cost functions of L is standard in online control [1, 3, 38, 37]: Assumption 1. The cost functions ct : d ˆ I Ñ R are convex and L-Lipschitz, in the following sense: for all x, x1 P d and u, u1 P I, we have |ctpx, uq ctpx1, u1q| ď L p}x x1}1 }u u1}1q.

2.2 Comparator class and spectral conditions

In prior works on non-stochastic control for linear dynamical systems [1], the comparator class K Kκ,ρp Lq is deﬁned to be the set of linear, time-invariant policies x ÞÑ Kx where K P Rdˆd pκ, ρq-strongly stabilizes L:

Deﬁnition 4. A matrix M P Rdˆd is pκ, ρq-strongly stable if there is a matrix H P Rdˆd so that }H 1MH} ď 1 ρ and }M}, }H}, }H 1} ď κ. A matrix K P Rdˆd is said to pκ, ρq-strongly stabilize an LDS with transition matrices A, B P Rdˆd if A BK is pκ, ρq-strongly stable.

The regret bounds against Kκ,ρp Lq scale with ρ 1, and so are vacuous for ρ 0 [1]. Unfortunately, in the simplex LDS setting, no policies satisfy the analogous notion of strong stability (see discussion in Section 3) unless ρ 0. Intuitively, the reason is that a pκ, ρq-strongly stable policy with ρ ą 0 makes the state converge to 0 in the absence of noise.

What is a richer but still-tractable comparator class for a simplex LDS? We propose the class of linear, time-invariant policies under which (a) the state of the LDS mixes, when viewed as a distribution, and (b) the level of control }ut}1 is independent of the state xt. Formally, we make the following deﬁnitions:

Deﬁnition 5. Given t P N and a matrix X P Sd with unique stationary distribution π P d, we deﬁne DXptq : supp P d }Xtp π}1 and DXptq : supp,q P d }Xt pp qq}1. Moreover we deﬁne tmixp X, εq : mint PNtt : DXptq ď εu for each ϵ ą 0, and we write tmixp Xq : tmixp X, 1{4q.9

Deﬁnition 6 (Mixing a simplex LDS). Let L be a simplex LDS with transition matrices A, B P Sd and control set I Ť

αPrα,αs d α. A matrix K P Sd rα,αs is said to τ-mix L if tmixp AKq ď τ, where

AK : p1 }K}1Ñ1q A BK P Sd. (5)

We deﬁne the comparator class K τ K τ p Lq as the set of linear, time-invariant policies x ÞÑ Kx where K P Sd rα,αs τ-mixes L.

Notice that for any K P Sd rα,αs, the linear policy ut : Kxt always plays controls in the control set I, and the dynamics Eq. (4) under this policy can be written as xt 1 p1 γtq AKxt γt wt.

Notice that by considering the comparator class K τ , we require the control norm to be independent of the state. This assumption is needed for technical reasons: without it, since Ax is multiplied by 1 }u}1 in the transition dynamics (see Equation (4)), even a linear policy u : Kx does not induce a linear transition. Hence, it would no longer be clear how one might deﬁne mixing time of a linear policy. It is a very interesting question whether there is a more natural (yet still tractable) deﬁnition of a simplex LDS that avoids this issue.

3 Online Algorithm and Theoretical Guarantee

In this section, we describe our main upper bound and accompanying algorithm for the setting of online control in a simplex LDS L. As discussed above, we assume that the set of valid controls is given by I Ť

αPrα,αs d α, for some constants 0 ď α ď α ď 1, representing lower and upper bounds on the strength of the control.10

For convenience, we write αt : }ut}1 and u1 t ut{αt P d (if αt 0, we set u1 t : 0). The dynamical system Eq. (4) can then be expressed as follows:

xt 1 p1 γtq pp1 αtq Axt αt Bu1 tq γt wt. (6)

We aim to obtain a regret guarantee as in Eq. (3) with respect to some rich class of comparator policies K . As is typical in existing work on linear nonstochastic control, we take K to be a class of time-invariant linear policies, i.e. policies that choose control ut : Kxt at time t for some matrix K P Rdˆd. In the standard setting of nonstochastic control, it is typically further assumed that all policies in the comparator class strongly stabilize the LDS (Deﬁnition 4).11 The naive generalization of such a requirement in our setting would be that AK is strongly stable; however, this is impossible, since no stochastic matrix can be strongly stable. Instead, we aim to compete against the class K K τ p Lq of time-invariant linear policies that (a) have ﬁxed level of control in rα, αs, and (b) τ-mix L (Deﬁnition 6). We view the second condition as a natural distributional analogue of strong stabilizability; the ﬁrst condition is needed for τ-mixing to even be well-deﬁned.

Algorithm description. Our main algorithm, GPC-Simplex (Algorithm 1), is a modiﬁcation of the GPC algorithm [1, 21]. As a refresher, GPC chooses the controls ut by learning a disturbance-action policy: a policy ut : Kxt řH i 1 M riswt i, where K is a known, ﬁxed matrix that strongly stabilizes the LDS; wt 1, . . . , wt H are the recent noise terms; and M r1s, . . . , M r Hs are learnable, matrix-valued parameters which we abbreviate as M r1:Hs. The key advantage of this parametrization

9If X does not have a unique stationary distribution, we say that all of these quantities are inﬁnite. 10While our techniques allow some more general choices for I, we leave a full investigation of general I for future work. 11Such an assumption cannot be dropped in light of Theorem 1.

of policies (as opposed to a simpler parametrization such as ut Kxt for a parameter K) is that the entire trajectory is linear in the parameters, and not a high-degree polynomial. Thus, optimizing the cost of a trajectory over the class of disturbance-action policies is a convex problem in M r1:Hs.

But why is the class of disturbance-action policies expressive enough to compete against the comparator class? This is where GPC crucially uses strong stabilizability. Notice that in the absence of noise, every disturbance-action policy is identical to the ﬁxed policy ut : Kxt. This is ﬁne when K and the comparator class are strongly stabilizing, since in the absence of noise, all strongly stabilizing policies rapidly force the state to 0, and thus incur very similar costs in the long run. But in the simplex LDS setting, strong stabilizability is impossible. While all policies in K mix the LDS, they may mix to different states, which may incur different costs. There is no reason to expect that an arbitrary K P K , chosen before observing the cost functions, will have low regret against all policies in K .

We ﬁx this issue by enriching the class of disturbance-action policies with an additional parameter p P d which, roughly speaking, represents the desired stationary distribution to which xt would converge, in the absence of noise, as t Ñ 8. It is unreasonable to expect prior knowledge of the optimal choice of p, which depends on the not-yet-observed cost functions. Thus, GPC-Simplex instead learns p together with M r1:Hs. We retain the property that the requisite online learning problem is convex in the parameters, and therefore can be efﬁciently solved via an online convex optimization algorithm (as discussed in Appendix C.1, we use lazy mirror descent, Lazy MD). One advantage of GPC-Simplex over GPC is that the former requires no knowledge of the ﬁxed reference policy K (which, in the context of GPC, had to be strongly stabilizing). While such K is needed in the context of GPC to bound a certain approximation error involving the cost functions, in the context of GPC-Simplex this approximation error may be bounded by some simple casework involving properties of stochastic matrices (see Appendix C.3).

Formally, for parameters a0 P rα, αs and H P N, GPC-Simplex considers a class of policies parametrized by the set Xd,H,a0,α : Ť

a Pra0,αs d a ˆ p Sd aq H. We abbreviate elements pp, p M r1s, . . . , M r Hsqq P Xd,H,a0,α by pp, M r1:Hsq. The high level idea of GPC-Simplex, like that of GPC, is to perform online convex optimization on the domain Xd,H,a0,α (Line 10). At each time t, the current iterate ppt, M r1:Hs t q, which deﬁnes a policy πpt,M r1:Hs t , is used to choose the control ut. The optimization subroutine then receives a new loss function ℓt : Xd,H,a0,α Ñ R based on the newly observed cost function ct. As with GPC, showing that this algorithm works requires showing that the policy class is sufﬁciently expressive. Unlike for GPC, our comparator policies are not strongly stabilizing, so new ideas are required for the proof.

We next formally deﬁne the policy πp,M r1:Hs associated with parameters pp, M r1:Hsq, and the loss function ℓt used to update the optimization algorithm at time t.

Parametrization of policies. First, for t P r Ts and i P N , we deﬁne the weights

λt,i : γt i

j 1 p1 γt jq, λt,i :

j 1 p1 γt jq, λt,0 : 1

i 1 λt,i. (7)

We write w0 : x1, γ0 1, and wt 0 for t ă 0 as a matter of convention.12 λt,i can be interpreted as the inﬂuence of perturbation wt i on the state xt , and λt,i can be interpreted as the inﬂuence of perturbations prior to time step t i on the state xt . An element pp, M r1:Hsq P Xd,H,a0,α induces a

policy13 at time t, denoted πp,M r1:Hs t , via the following variant of the disturbance-action control [1]:

πp,M r1:Hs t pδt 1:t Hq : λt,0 p

j 1 λt,j M rjsδt j. (8)

In Line 6 of GPC-Simplex, the control ut is chosen to be πpt,M r1:Hs t t pwt 1:t Hq, which belongs to d }pt}1 (using řH i 0 λt,i 1) and hence to the constraint set I (since }pt}1 P ra0, αs Ă rα, αs).

12As a result of this convention, we have řt i 1 λt,i 1 for all t P r Ts, and λt,i 0 for all i ą t.

13Technically, we are slightly abusing terminology here, since πp,Mr1:Hs t takes as input a set of the previous H disturbances, δt 1:t H, as opposed to the current state xt.

Algorithm 1 GPC-Simplex: GPC for Simplex LDS

Require: Linear system A, B, mixing time τ ą 0 for comparator class, horizon parameter H P N, set of valid controls I Ť

αPrα,αs α d, total number of time steps T.

1: Write τA : tmixp Aq, and deﬁne a0 : maxtα, mintα, 1tτA ą 4τu{p96τquu. 2: Initialize an instance Lazy MD of mirror descent (Algorithm 2) for the domain Xd,H,a0,α with the regularizer Rd,H (deﬁned in Appendix C.1) and step size η c a

d H lnpdq{p Lτ 2 log2p Tq ?

Tq, for a sufﬁciently small constant c.

3: Initialize pp1, M r1:Hs 1 q Ð arg minpp,M r1:Hsq PXd,H,a0,α Rd,Hpp, M r1:Hsq.

4: Observe initial state x1 P d. 5: for 1 ď t ď T do 6: Choose control ut : λt,0 pt řH i 1 M ris t λt,i wt i. 7: Receive cost ctpxt, utq. 8: Observe xt 1, γt and compute wt γ 1 t pxt 1 p1 γtqrp1 }ut}1q Axt Butsq. (If γt 0, then set wt 0.) 9: Deﬁne loss function ℓtpp, M r1:Hsq : ctpxtpp, M r1:Hsq, utpp, M r1:Hsqq.

10: Update ppt 1, M r1:Hs t 1 q Ð Lazy MDtpℓt; ppt, M r1:Hs t qq.

Loss functions. For pp, M r1:Hsq P Xd,H,a0,α, we let xtpp, M r1:Hsq and utpp, M r1:Hsq denote the state and control at step t obtained by following the policy πp,M r1:Hs s at all time steps s prior to t (see Eqs. (20) and (21) in the appendix for precise deﬁnitions). We then deﬁne ℓtpp, M r1:Hsq to be the evaluation of the adversary s cost function ct on the state-action pair pxtpp, M r1:Hsq, utpp, M r1:Hsqq (Line 9).

Main guarantee and proof overview. Theorem 7 gives our regret upper bound for GPC-Simplex:

Theorem 7. Let d, T P N and τ ą 0. Let L p A, B, I, x1, pγtqt PN, pwtqt PN, pctqt PNq be a simplex LDS with cost functions pctqt satisfying Assumption 1 for some L ą 0. Set H : τrlogp2LT 3qs. Then the iterates pxt, utq T t 1 of GPC-Simplex (Algorithm 1) with input p A, B, τ, H, I, Tq satisfy:

regret K τ p Lq :

t 1 ctpxt, utq inf KPK τ p Lq

t 1 ctpxtp Kq, utp Kqq ď Op Lτ 7{2d1{2?

where Op q hides only universal constants and poly-logarithmic dependence in T. Moreover, the time complexity of GPC-Simplex is polypd, Tq.

While for simplicity we have stated our results for obliviously chosen pγtqt, pwtqt, pctqt, since GPC-Simplex is deterministic the result also holds when these parameters are chosen adaptively by an adversary. See Appendix C for the formal proof of Theorem 7.

Lower bound. We also show that the mixing assumption on the comparator class K τ p Lq (Definition 6) cannot be removed. In particular, without that assumption, if the valid control set I is restricted to controls ut of norm at mostly roughly Op1{Tq, then linear regret is unavoidable.14

Theorem 8 (Informal statement of Theorem 30). Let β ą 0 be a sufﬁciently large constant. For any T P N, there is a distribution D over simplex LDSs with state space 2 and control space Ť

αPr0,β{T s 2 α, such that any online control algorithm on a system L D incurs expected regret Ωp Tq against the class of all time-invariant linear policies x ÞÑ Kx where K P Ť

αPr0,β{T s Sd α.

4 Experimental Evaluation

The previous sections focused on linear systems, but in fact GPC-Simplex can be easily modiﬁed to control non-linear systems, for similar reasons as in prior work [2]. It sufﬁces for the dynamics to

14We remark that, with the mixing assumption, Theorem 7 does achieve Op ?

Tq regret when I : Ť

αPr0,Op1{T qs 2 α. In particular, there is no hidden dependence on I in the regret bound.

have the form xt 1 : p1 γtqfpxt, utq γtwt (9)

for known f, observed γt, and unknown wt. See Appendix E for discussion of the needed modiﬁcations and other implementation details. Relevant code is open-sourced in [16].

As a case study, in this section we apply GPC-Simplex (Algorithm 1) to a disease transmission model speciﬁcally, a controlled generalization of the SIR model introduced earlier. In Appendix H we apply GPC-Simplex to a controlled version of the replicator dynamics from evolutionary game theory.

A controlled disease transmission model. The Susceptible-Infectious-Recovered (SIR) model is a basic model for the spread of an epidemic [26]. The SIR model has been extensively studied since last century [36, 41, 4, 24, 6] and attracted renewed interest during the COVID-19 pandemic [10, 29, 9]. As discussed previously, this model posits that a population consists of susceptible (S), infected (I), and recovered (R) individuals. When a susceptible individual comes into contact with an infected individual, the susceptible individual becomes infected at some transmission rate β. Infected patients become uninfected and gain immunity at some recovery rate θ. We consider a natural generalization of the standard dynamics Eq. (1) where recovered individuals may also lose immunity at a rate of ξ. Formally, in the absence of control, the population evolves over time according to the following system of differential equations:

dt βIS ξR, d I

dt βIS θI, d R

dt θI ξR, (10)

Typically, β ą θ ą ξ. We normalize the total population to be 1, and thus x r S, I, Rs P 3. Next, we introduce a variable called the preventative control ut P 2, which has the effect of decreasing the transmission rate β, and adversarial perturbations wt, which allow for model misspeciﬁcation. Incorporating these changes to the forward discretization of Eq. (10) gives the following dynamics: St 1 It 1 Rt 1

1 βIt 0 ξ 0 1 θ 0 0 θ 1 ξ

βIt St 0 0 βIt St 0 0

The control ut P 2 represents a distribution over transmission prevention protocols: ut r1, 0s represents full-scale prevention, whereas ut r0, 1s represents that no prevention measure is imposed. Concretely, the effective transmission rate under control ut is β utp2q.

Parameters and cost function. To model a highly infectious pandemic, we consider Eq. (11) with parameters β 0.5, θ 0.03, and ξ 0.005. Suppose we want to control the number of infected individuals by modulating a (potentially expensive) prevention protocol ut. To model this setting, the cost function includes (1) a quadratic cost for infected individuals It, and (2) a cost that is bilinear in the magnitude of prevention and the susceptible individuals:

ctpxt, utq c3 xtp2q2 c2 xtp1q utp1q, (12)

where xt r St, It, Rts. Typically c3 ě c2 ą 0 to model the high cost of infection.

In Fig. 1, we compare GPC-Simplex against two baselines (a) always executing ut r1, 0s (i.e. full prevention), and (b) always executing ut r0, 1s (i.e. no prevention) for T 200 steps in the above model with no perturbations. We observe that GPC-Simplex suppresses the transmission rate via high prevention at the initial stage of the disease outbreak, then relaxes as the outbreak is effectively controlled. Moreover, GPC-Simplex outperforms both baselines in terms of cumulative cost. See Appendix F for additional experiments exhibiting the robustness of GPC-Simplex to perturbations (i.e. non-zero γt s) and different model parameters.

4.1 Controlling hospital ﬂows: reproducing a study by [27]

We now turn to the recent work [27], which also studies a controlled SIR model. Similar to above, they considered a control that temporarily reduces the rate of contact within a population. In one scenario (inspired by the COVID-19 pandemic), they considered a cost function that penalizes

Figure 1: Control with cost function (12) for T 200 steps: initial distribution x1 r0.9, 0.1, 0.0s; parameters c3 10, c2 1; no noise. Left/Middle: Cost and cumulative cost over time of GPC-Simplex versus baselines. Right: control utp2q (proportional to effective transmission rate) played by GPC-Simplex over time.

medical surges, i.e. when the number of infected exceeds a threshold ymax determined by hospital capacities. Formally, they deﬁne the cost of a trajectory pxt, utq T t 1 as

W0p 3x T p1qe 3px T p1q x T p2qqq

c2 utp1q2 c3pxtp2q ymaxq 1 e 100pxtp2q ymaxq

where W0 is the principal branch of Lambert s W-function, and c2, c3 are hyperparameters. The system parameters used by [27] are β 0.3, θ 0.1, ξ 0. In the absence of noise and with a known cost function, [27] is able to compute the approximate solutions of the associated Hamilton Jacobi-Bellman equations for various choices of c2, c3.

In Fig. 2, we show that GPC-Simplex (with a slightly modiﬁed instantaneous version of Eq. (13)) in fact matches the optimal solution analytically computed by [28]. See Appendix G for further experimental details, including the exact model parameters and cost function.

Figure 2: Controlling hospital ﬂows for T 100 steps: initial distribution r0.9, 0.01, 0.09s; parameters ymax 0.1, c2 0.01, c3 100. Left: The dashed red line shows the number of infected over time under no control; note that ymax (shown in dashed purple line) is signiﬁcantly exceeded. The solid yellow and blue lines show the number of infected and susceptible under GPC-Simplex, which closely match the optimal solutions computed by [28] (dashed yellow and blue). Right: GPC-Simplex control (solid) vs. optimal control (dashed).

Acknowledgements

NG is supported by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship. EH, ZL and JS gratefully acknowledge funding from the National Science Foundation, the Ofﬁce of Naval Research, and Open Philanthropy. DR is supported by a U.S. Do D NDSEG Fellowship.

[1] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111 119. PMLR, 2019.

[2] Naman Agarwal, Elad Hazan, Anirudha Majumdar, and Karan Singh. A regret minimization approach to iterative learning control. In International Conference on Machine Learning, pages 100 109. PMLR, 2021.

[3] Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. Advances in Neural Information Processing Systems, 32, 2019.

[4] Linda JS Allen. Some discrete-time si, sir, and sis epidemic models. Mathematical biosciences, 124(1):83 105, 1994.

[5] Rocío Balderrama, Javier Peressutti, Juan Pablo Pinasco, Federico Vazquez, and Constanza Sánchez de la Vega. Optimal control for a sir epidemic model with limited quarantine. Scientiﬁc Reports, 12(1):12583, 2022.

[6] Ottar N Bjørnstad, Bärbel F Finkenstädt, and Bryan T Grenfell. Dynamics of measles epidemics: estimating scaling of transmission rates using a time series sir model. Ecological monographs, 72(2):169 184, 2002.

[7] Fred Brauer, Carlos Castillo-Chavez, and Carlos Castillo-Chavez. Mathematical models in population biology and epidemiology, volume 2. Springer, 2012.

[8] Tom Britton. Stochastic epidemic models: a survey. Mathematical biosciences, 225(1):24 35, 2010.

[9] Yi-Cheng Chen, Ping-En Lu, Cheng-Shang Chang, and Tzu-Hsuan Liu. A time-dependent sir model for covid-19 with undetectable infected persons. Ieee transactions on network science and engineering, 7(4):3279 3294, 2020.

[10] Ian Cooper, Argha Mondal, and Chris G Antonopoulos. A sir model assumption for the spread of covid-19 in different communities. Chaos, Solitons & Fractals, 139:110057, 2020.

[11] Ross Cressman and Yi Tao. The replicator equation and other game dynamics. Proceedings of the National Academy of Sciences, 111(supplement_3):10810 10817, 2014.

[12] Adil El-Alami Laaroussi, Mostafa Rachik, and Mohamed Elhia. An optimal control problem for a spatiotemporal sir model. International Journal of Dynamics and Control, 6:384 397, 2018.

[13] Mohamed Elhia, Mostafa Rachik, and Elhabib Benlahmar. Optimal control of an sir model with delay in state and control variables. International scholarly research notices, 2013, 2013.

[14] Herbert I Freedman. Deterministic mathematical models in population ecology. (No Title), 1980.

[15] Nicole M Gatto and Henry Schellhorn. Optimal control of the sir model in the presence of transmission and treatment uncertainty. Mathematical biosciences, 333:108539, 2021.

[16] Noah Golowich, Elad Hazan, Zhou Lu, Dhruv Rohatgi, and Y. Jennifer Sun. Population control, 2024. https://github.com/jysun105/population_control.

[17] EV Grigorieva, EN Khailov, and A Korobeinikov. Optimal control for a sir epidemic model with nonlinear incidence rate. Mathematical Modelling of Natural Phenomena, 11(4):89 104, 2016.

[18] Elad Hazan. Introduction to online convex optimization, 2019.

[19] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Algorithmic Learning Theory, pages 408 421. PMLR, 2020.

[20] Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral ﬁltering for general linear dynamical systems. Advances in Neural Information Processing Systems, 31, 2018.

[21] Elad Hazan and Karan Singh. Introduction to online nonstochastic control. ar Xiv preprint ar Xiv:2211.09619, 2022.

[22] Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral ﬁltering. Advances in Neural Information Processing Systems, 30, 2017.

[23] Josef Hofbauer and Karl Sigmund. Evolutionary games and population dynamics. Cambridge university press, 1998.

[24] Chunyan Ji, Daqing Jiang, and Ningzhong Shi. The behavior of an sir epidemic model with stochastic perturbation. Stochastic analysis and applications, 30(5):755 773, 2012.

[25] Sourabh Katoch, Sumit Singh Chauhan, and Vijay Kumar. A review on genetic algorithm: past, present, and future. Multimedia tools and applications, 80:8091 8126, 2021.

[26] William Ogilvy Kermack and Anderson G Mc Kendrick. A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character, 115(772):700 721, 1927.

[27] David I Ketcheson. Optimal control of an sir epidemic through ﬁnite-time non-pharmaceutical intervention. Journal of mathematical biology, 83(1):7, 2021.

[28] David I Ketcheson. Sir-control-code. optimal control of an sir epidemic through ﬁnite-time non-pharmaceutical intervention, 2021.

[29] Nikolay A Kudryashov, Mikhail A Chmykhov, and Michael Vigdorowitsch. Analytical features of the sir model and their applications to covid-19. Applied Mathematical Modelling, 90:466 473, 2021.

[30] Urszula Ledzewicz and Heinz Schättler. On optimal singular controls for a general sir-model with vaccination and treatment. In Conference Publications, volume 2011, pages 981 990. Conference Publications, 2011.

[31] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times. American Mathematical Society, 2006.

[32] Alfred J Lotka. Contribution to the theory of periodic reactions. The Journal of Physical Chemistry, 14(3):271 274, 2002.

[33] Thomas Robert Malthus. An Essay on the Principle of Population Or a View of Its Past and Present Effects on Human Happiness, an Inquiry Into Our Prospects Respecting the Future Removal Or Mitigation of the Evils which it Occasions by Rev. TR Malthus. Reeves and Turner, 1872.

[34] Emilio Molina and Alain Rapaport. An optimal feedback control that minimizes the epidemic peak in the sir model under a budget constraint. Automatica, 146:110596, 2022.

[35] Arkadij Semenoviˇc Nemirovskij and David Borisovich Yudin. Problem complexity and method efﬁciency in optimization. 1983.

[36] Boris Shulgin, Lewi Stone, and Zvia Agur. Pulse vaccination strategy in the sir epidemic model. Bulletin of mathematical biology, 60(6):1123 1148, 1998.

[37] Max Simchowitz. Making non-stochastic control (almost) as easy as stochastic. Advances in Neural Information Processing Systems, 33:18318 18329, 2020.

[38] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320 3436. PMLR, 2020.

[39] Mandavilli Srinivas and Lalit M Patnaik. Genetic algorithms: A survey. computer, 27(6):17 26, 1994.

[40] Vito Volterra. Fluctuations in the abundance of a species considered mathematically. Nature, 118(2972):558 560, 1926.

[41] Gul Zaman, Yong Han Kang, and Il Hyo Jung. Stability analysis and optimal vaccination of an sir epidemic model. Bio Systems, 93(3):240 249, 2008.

1 Introduction 1

1.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Deﬁnitions and setup 3

2.1 Dynamical systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Comparator class and spectral conditions . . . . . . . . . . . . . . . . . . . . . . . 4

3 Online Algorithm and Theoretical Guarantee 5

4 Experimental Evaluation 7

4.1 Controlling hospital ﬂows: reproducing a study by [27] . . . . . . . . . . . . . . . 8

A Additional preliminaries 14

B Discussion on the observation model 14

C Proof of Theorem 7 14

C.1 Preliminaries on mirror descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

C.2 Approximation of linear policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

C.3 Bounding the memory mismatch error . . . . . . . . . . . . . . . . . . . . . . . . 19

C.4 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

D Proof of Lower Bounds 25

D.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D.2 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

E Implementation details 30

F Experiments: Controlled SIR model 31

F.1 Control in presence of perturbations . . . . . . . . . . . . . . . . . . . . . . . . . 31

F.2 Alternative parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

G Experiments: Controlling hospital ﬂows 32

H Experiments: Controlled replicator dynamics 35

I Discussions 36

I.1 Broader impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

I.2 Computational Resources for Experiments . . . . . . . . . . . . . . . . . . . . . . 37

A Additional preliminaries

For completeness, we recall the deﬁnition of a standard LDS [21].

Deﬁnition 9 (LDS). Let dx, du P N. A linear dynamical system (LDS) is described by a tuple L p A, B, x1, pwtqt PN, pctqt PNq where A P Rdxˆdx, B P Rdxˆdu are the transition matrices; x1 P Rdx is the initial state; wt P Rdx is the noise value at time t; and ct : Rdx ˆ Rdu Ñ R is the cost function at time t. For each t ě 1, given state xt P Rdx and control ut P Rdu at time t, the state at time t 1 is given by

xt 1 : Axt But wt,

and the instantaneous cost incurred at time t is given by ctpxt, utq.

B Discussion on the observation model

Our main algorithm GPC-Simplex for online control of simplex LDSs assumes that for each t, the perturbation strength γt is observed by the controller at the same time as it observes xt 1 (the algorithm does not require the entire sequence pγtqt to be known in advance). In this appendix we discuss (a) why this is a crucial technical assumption for the algorithm, and (b) why it is a reasonable assumption in many natural population models.

First we explain why is it technically important for GPC-Simplex that the controller observes γt. Recall that like the algorithm GPC from [1], GPC-Simplex is a disturbance-action controller, meaning that the control at time t is computed based on previous disturbances wt i. In the standard LDS model (Deﬁnition 9) studied by [1], it s clear that wt 1 can be computed from xt 1, ut 1, xt, using the fact that A, B are known. However, in the simplex LDS model, if γt 1 is not directly observed, then in fact wt 1 may not be uniquely identiﬁable given xt 1, ut 1, xt. This is why GPC-Simplex requires observing the parameters γt. It is an interesting open problem whether this assumption can be removed.

Second, we argue that in many practical applications, it is reasonable for γt 1 to be observed along with the population state xt. The reason is that often the controller can observe not just the proportions of individuals of different categories in a population but also the total population size.

Formally, consider a population which has Nt individuals at time t. Thus, if the distribution of the population across d categories is described by xt P d, then for each i P rds there are Ntpxtqi individuals in category i. Suppose that under control ut P I, this population evolves to a new distribution p1 }ut}1q Axt But, but then the adversary adds nt new individuals to the population, whose distribution over categories is given by wt P d. Then if we write xt P Rd ě0 to denote the vector of counts of individuals in each category at time t, it holds that

xt 1 Ntpp1 }ut}1q Axt Butq ntwt Nt 1 pp1 γtqpp1 }ut}1q Axt Butq γtwtq

where Nt 1 Nt nt is the total number of individuals at time t 1, and we write γt : nt{p Nt ntq. Thus, the distribution of the population across the d categories at time t 1 is

Nt 1 p1 γtqpp1 }ut}1q Axt Butq γtwt

which is exactly the update rule from Deﬁnition 3. Moreover, if the controller observes the total population counts Nt, Nt 1 in addition to xt, ut, xt 1, then it may compute γt p Nt 1 Ntq{Nt 1 as well as wt (using knowledge of A, B), which is what we wanted to show.

C Proof of Theorem 7

In this section, we prove Theorem 7. We begin with an overview of this section that outlines the structure and the main idea behind the proof of Theorem 7.

Overview. GPC-Simplex (Algorithm 1) essentially runs mirror descent on the loss functions ℓtpp, M r1:Hsq constructed in Line 9. In particular, the loss at time t measures the counterfactual cost of following the policy πp,M r1:Hs for the ﬁrst t timesteps. Thus, the regret of GPC-Simplex against the comparator class K τ p Lq (Deﬁnition 6) can be bounded by the following decomposition:

Approximation error of comparator class Mismatch error of costs Lazy MD regret

In more detail:

Approximation error of comparator class. Since GPC-Simplex is only optimizing over policies of the form πp,M r1:Hs for pp, M r1:Hsq P Xd,H,a0,α, we must show that every policy in the comparator class can be approximated by some policy πp,M r1:Hs. This is accomplished by Lemma 17. Mismatch error of costs. The cost incurred by the mirror descent algorithm Lazy MD at time t is ℓtppt, M r1:Hs t q, which is the counterfactual cost at time t had the current policy πpt,M r1:Hs t been carried out from the beginning of the time. However, the cost actually incurred by the controller at time t is ctpxt, utq, which is the cost incurred by following policy πps,M r1:Hs s at time t, for each s ď t. Thus, there is a mismatch between the loss that GPC-Simplex is optimizing and the loss that GPC-Simplex needs to optimize. This mismatch can be bounded using the stability of mirror descent along with a mixing argument; see Lemma 21. Lazy MD regret. GPC-Simplex uses Lazy MD as its subroutine for mirror descent. The regret of Lazy MD can be bounded by standard guarantees; see Corollary 15.

C.1 Preliminaries on mirror descent

We begin with some preliminaries regarding mirror descent. Let X Ă Rd be a convex compact set, and let R : X Ñ R be a convex function. We consider the Lazy Mirror Descent algorithm Lazy MD (also known as Following the Regularized Leader) for online convex optimization on X. Given an ofﬂine optimization oracle over X, the function R, and a parameter η ą 0, Lazy MD chooses each iterate zt based on the historical loss functions ℓs : X Ñ R (for s P rt 1s) as described in Algorithm 2.

Algorithm 2 Lazy MD: Lazy Mirror Descent [35]

Require: Ofﬂine convex optimization oracle over set X Ă Rd; convex regularization function R : X Ñ R; step size η ą 0; loss functions ℓ1, . . . , ℓT where ℓt is revealed after iteration t. 1: for t ě 1 do 2: Compute and output the solution to the following convex optimization problem:

zt : arg min z PX

s 1 xz, ℓspzsqy 1

η Rpzq, (14)

3: Receive loss function ℓt : X Ñ R.

The following lemma bounds the regret of Lazy MD against the single best z P X (in hindsight), for an appropriately chosen step size η. Lemma 10 (Mirror descent). Suppose that X Ă Rd is convex and compact, and let } } be a norm on Rd. Let R : X Ñ R be a 1-strongly convex function with respect to } }. Let L ą 0, and let ρ : maxz PX Rpzq minz PX Rpzq.

Fix an arbitrary sequence of loss functions ℓt : X Ñ R which are each convex and L-Lipschitz with respect to } }. Then the iterates zt of Lazy MD (Eq. (14)) with an optimization oracle over X, regularizer R, step size η ?ρ{p L ?

2Tq, and loss functions ℓ1, . . . , ℓT satisfy:

t 1 ℓtpztq min z PX

t 1 ℓtpzq ď L a

Moreover, for each t P r T 1s, it holds that

}zt zt 1} ď c ρ

Lemma 10 is essentially standard but we provide a proof for completeness.

Proof of Lemma 10. By [18, Theorem 5.2], it holds that

t 1 ℓtpztq min z PX ℓtpzq ď 2η

t 1 } ℓtpztq}2 ρ

where } } : Rd Ñ R is the dual norm of } }, deﬁned by }y} : max}z}ď1xy, zy. Recall that a convex L-Lipschitz loss function ℓt satisﬁes } ℓtpzq} ď L for all z P X. Thus, the above regret bound simpliﬁes to

t 1 ℓtpztq min z PX ℓtpzq ď 2ηTL2 ρ

Substituting in η ?ρ L ?

2T yields Eq. (15). To establish the movement bound Eq. (16), we argue as follows. Consider any y1, y2 P Rd and deﬁne, for i P t1, 2u,

wi : arg min z PX xyi, zy Rpzq.

The deﬁnition of w2 implies that

Rpw1q xy2, w2y xy2, w1y ě Rpw2q ě Rpw1q x Rpw1q, w2 w1y 1

where the second inequality is by 1-strong convexity of R. Simplifying, we get

1 2}w1 w2}2 ď xy2, w1 w2y x Rpw1q, w2 w1y.

Symmetrically, the deﬁnition of w1 together with strong convexity implies that

1 2}w1 w2}2 ď xy1, w2 w1y x Rpw2q, w1 w2y.

Adding the two above displays gives

}w1 w2}2 ď xy2 y1, w1 w2y x Rpw1q Rpw2q, w2 w1y ď }y2 y1} }w1 w2},

where the second inequality uses convexity of R (which gives x Rpw1q Rpw2q, w1 w2y ě 0). It follows that }w1 w2} ď }y1 y2} . Setting y1 : η řt 1 s 1 ℓspzsq and y2 : η řt s 1 ℓspzsq, and recalling the deﬁnitions of zt, zt 1 from Eq. (14), we get

}zt zt 1} ď }η ℓtpztq} ď ηL ď a

as desired.

We next apply Lemma 10 to the domain used in GPC-Simplex. Recall that, given d, H P N and real numbers 0 ď a ď b ď 1, we have deﬁned Xd,H,a,b : Ť a1Pra,bs d a1 ˆ p Sd a1q H.

Deﬁnition 11 (Entropy of a sub-distribution). Let d P N. We deﬁne the function Ent : d ď1 Ñ Rě0 by

Entpvq : vc ln 1

j 1 vj ln 1

where for any vector v P Rd we write vc : 1 řd j 1 vj P R.

Lemma 12. Let d P N and u, v P d ď1. Then

x u Entpuq v Entpvq, u vy ď }u v}2 1 .

That is, v ÞÑ Entpvq is 1-strongly convex on d ď1 with respect to } }1.

Proof. Let p be the probability mass function on rd 1s with pi ui for all i P rds, and let q be the probability mass function on rd 1s with pi vi for all i P rds. Then it can be checked that

x u Entpuq v Entpvq, v uy KLpp||qq KLpq||pq

ě TVpp, qq2

where the ﬁrst inequality is by Pinsker s inequality.

Deﬁnition 13 (Regularizer for mirror descent in GPC-Simplex). Let d, H P N and 0 ď a ď b ď 1. We deﬁne Rd,H : Xd,H,a,b Ñ Rď0 as follows (omitting the domain s dependence on a, b for notational simplicity):

Rd,Hpp, M r1:Hsq : Entppq

j 1 Entp M rhs ,j q., (17)

Deﬁnition 14 (Norm for analysis of mirror descent in GPC-Simplex). Let d, H P N, and identify Rd Hd2 with Rd ˆ p Rdˆdq H. We deﬁne a norm } }d,H on Rd Hd2 as follows: for p P Rd, M r1:Hs P p Rdˆdq H,

pp, M r1:Hsq 2

d,H : }p}2 1

Corollary 15. Let d, H P N and 0 ď a ď b ď 1. Consider an arbitrary sequence of cost functions ℓt : Xd,H,a,b Ñ R which are convex and L-Lipschitz with respect to } }d,H. Then the iterates zt of Lazy MD with η a

2d H lnpdq{p L ?

Tq, regularizer R, and loss functions ℓ1, . . . , ℓT satisfy the following regret guarantee:

t 1 ℓtpppt, M r1:Hs t qq min pp,M r1:Hsq PXd,H,a,b

t 1 ℓtppp, M r1:Hsqq ď L a

32d H lnpdq T (18)

Moreover, for β : ?

2d H lnpdq ?

T , for all t P r T 1s, we have

}pt pt 1}1 ď β, max h Pr Hs

M rhs t M rhs t 1 1Ñ1 ď β. (19)

Proof. Note that the set of pp, M r1:Hsq where p P Rd and M r1:Hs P p Rdˆdq H can be identiﬁed with Rd Hd2. We apply Lemma 10 with X : Xd,H,a,b, R Rd,H, and the norm } }d,H. It is

straightforward to check that X is convex and compact in Rd Hd2. By Lemma 12, we have that Rd,H is 1-strongly convex with respect to the norm } }d,H. Moreover, note that

max pp,M r1:Hsq PXd,H,a,b Rd,Hppp, M r1:Hsqq min pp,M r1:Hsq PXd,H,a,b Rd,Hppp, M r1:Hsqq

ď p1 d Hq lnpd 1q ď 4d H lnpdq

since 0 ď Entpvq ď lnpd 1q for all v P d ď1. Thus, Lemma 10 implies the claimed bounds Eqs. (18) and (19), where to prove Eq. (19) we are using the fact that }C}1Ñ1 maxj Prds }C ,j}1 ď bř

j Prds }C ,j}2 1 for all C P Rdˆd.

C.2 Approximation of linear policies

Henceforth ﬁx a simplex LDS L p A, B, I, x1, pγtqt, pwtqt, pctqtq on d, where I Ť

αPrα,αs d α for some constants 0 ď α ď α ď 1.

Recall that any choice of parameters pp, M r1:Hsq P Xd,H,a,b (for some hyperparameters H P N and α ď a ď b ď α) induces, via Eq. (8), a set of policies pπp,M r1:Hs s qs Pr T s. The policy πp,M r1:Hs s takes as input the disturbances ws 1, . . . , ws H observed at the H time steps before step s, and outputs a control for step s. Recall that, in Algorithm 1, we used xtpp, M r1:Hsq, utpp, M r1:Hsq to denote the state and control at time step t one would observe by playing the control πp,M r1:Hs s pws 1:s Hq at step s, for each 1 ď s ď t.

Formally, we have the following expressions for xtpp, M r1:Hsq, utpp, M r1:Hsq:

Fact 16. For any pp, M r1:Hsq P Xd,H,a,b and t P r Ts, it holds that

xtpp, M r1:Hsq

i 1 αp,M r1:Hs

λt i,0 λt,i B p B

j 1 λt,i j M rjs wt i j λt,i wt i

utpp, M r1:Hsq πp,M r1:Hs t pwt 1:t Hq λt,0 p

j 1 λt,j M rjs wt j, (21)

where we have written, for t P r Ts, i P N, αp,M r1:Hs

t,i : śi 1 j 1 1 πp,M r1:Hs

p1 }p}1qi 1.

Proof. This follows unrolling Eq. (6) with the controls us : πp,M r1:Hs s pws 1:s Hq, and recalling the deﬁnitions in Eq. (7) and the conventions w0 : x1, γ0 1, and wt 0 for t ă 0.

In a sense, Algorithm 1 performs online convex optimization over the set of such policies. Even if we can manage to show that doing so yields a good regret guarantee with respect to the class of policies tπp,M r1:Hs t : pp, M r1:Hsq P Xd,H,a,bu for some choices of H, a, b, why should this imply a good regret guarantee with respect to the class K τ p Lq of linear policies (see Deﬁnition 6)? Lemma 17 bridges this gap, showing that any policy in K τ p Lq can be approximated by a policy of the form pπp,M r1:Hs s qs Pr T s.

Lemma 17 (Approximation). Suppose that the cost functions c1, . . . , c T of L satisfy Assumption 1 with Lipschitz parameter L. Fix τ ą 0, ε P p0, 1q, and any K P Sd ď1 such that tmixp AK q ď τ. Write α : }K }1Ñ1. If H ě τrlog2p2LT 2{εqs, then there is some pp, M r1:Hsq P d α ˆ p Sd α q H such that

t 1 ctpxtpp, M r1:Hsq, utpp, M r1:Hsqq

t 1 ctpxtp K q, utp K qq ď ε, (22)

where xtp K q, utp K q denote the state and control that one would observe at time step t if one were to play according to the policy x ÞÑ K x at all time steps 1 ď s ď t.

Proof. For each t, if the controls ut are chosen to satisfy ut : K xt, then we have αt : }K }1Ñ1. Moreover, for 1 ď t ď T, we can write

i 1 p AK qi 1

j 1 p1 γt jq

i 1 Ai 1 K λt,i wt i, (23)

utp K q K xtp K q

i 1 K Ai 1 K λt,i wt i (24)

where AK was deﬁned in Eq. (5). By the assumption that tmixp AK q ď τ, there is some unique p1 P d such that that AK p1 p1 (see Deﬁnition 5). Moreover, by our bound on H and

Lemma 18, for any i ą H and q P d we have Ai 1 K q p1 1 ď p1{2q H{τ ď ε{p2LT 2q. Using that λt,0 řt i H 1 λt,i by the deﬁnition in Eq. (7),

i H 1 K Ai 1 K λt,iwt i λt,0 K p1 1

i H 1 λt,ip Ai 1 K wt i p1q

i H 1 λt,i Ai 1 K wt i p1 1 ď ε{p2LT 2q. (25)

For 1 ď i ď H, let us deﬁne M ris : K Ai 1 K P Sd α and p : K p1 P d α . Using Eqs. (21) and (24), we have that utp K q utpp, M r1:Hsq 1

i 1 K Ai 1 K λt,iwt i λt,0p

j 1 λt,j M rjswt j

i H 1 K Ai 1 K λt,iwt i λt,0 K p1 1 ď ε{p2LT 2q, (26)

where the ﬁnal inequality uses Eq. (25).

Next, we may bound the difference in state vectors using Eq. (26), as follows: for any sequence of puiqt i 1 with }ui}1 α for all i, we can expand Eq. (6) to get

i 1 p1 α qi 1Ai 1p λt,i But i λt,iwt iq.

Thus, for any t P r Ts, we have xtp K q xtpp, M r1:Hsq 1 ď

i 1 p1 α qi 1 λt,i Ai 1B ut ip K q ut ipp, M r1:Hsq 1

ď ε 2LT . (27)

By Eqs. (26) and (27) and Assumption 1, it follows that, for each t P r Ts, ˇˇˇctpxtpp, M r1:Hsq, utpp, M r1:Hsqq ctpxtp K q, utp K qq ˇˇˇ ď ε{T,

which yields the claimed bound Eq. (22).

The following facts about distance to stationarity are well-known (see e.g. [31, Section 4.4]): Lemma 18. Let X P Sd have a unique stationary distribution π. Then the following inequalities hold for any c, t P N:

1. DXptq ď DXptq ď 2DXptq. 2. DXpctq ď DXptqc.

C.3 Bounding the memory mismatch error

In this section, we prove Lemma 21, which allows us to show that an algorithm with bounded aggregate loss with respect to the loss functions ℓt deﬁned on Line 9 of Algorithm 1 in fact has bounded aggregate cost with respect to the cost functions ct chosen by the adversary.

First, we introduce two useful lemmas on the mixing time of matrices (Deﬁnition 5). Lemma 19. Let X P Sd have a unique stationary distribution. Let Y P Sd satisfy }X Y }1Ñ1 ď δ. Then for any t P N,

DY ptq ď 2tδ 2DXptq.

Proof. For any v P d, we have }Xv Y v}1 ď δ. A hybrid argument then yields that for any t ě 1, }Xtv Y tv}1 ď tδ. Then

DY ptq ď sup p,q P d

Y tpp qq 1 ď 2tδ sup p,q P d

Xtpp qq 1 ď 2tδ DXptq ď 2tδ 2DXptq,

where the ﬁrst and last inequalities apply the ﬁrst item of Lemma 18.

Lemma 20. Suppose that A, B P Sd, K P Sd ď1 satisfy tmixp Aq ą 4 tmixp AK q. Then }K }1Ñ1 ą 1{p96 tmixp AK qq.

Proof. Let us write τ : tmixp AK q and α : }K }1Ñ1, so that AK p1 α q A BK . Suppose for the purpose of contradiction that α ď 1{p96τq. We have that }A AK }1Ñ1 ď 2α . By Lemma 18 and Deﬁnition 5, we have DAK pτq ď 2DAK pτq ď 1{2, so DAK p4τq ď DAK p4τq ď 1{16. Using Lemma 19 and the assumption on α ,

DAp4τq ď 12τα 2DAK p4τq ď 12τα 1{8 ď 1{4,

meaning that tmixp Aq ď 4τ.

The last step is to bound the memory mismatch error. Lemma 21 (Memory mismatch error). Suppose that pctqt satisfy Assumption 1 with Lipschitz parameter L. Let τ, β ą 0, and suppose that K τ p Lq is nonempty. Consider the execution of GPC-Simplex (Algorithm 1) on L with input τ. If the iterates ppt, M r1:Hs t qt Pr T s satisfy

}pt pt 1}1 ď β, max i Pr Hs

M ris t M ris t 1 1Ñ1 ď β, (28)

then for each t P r Ts, the loss function ℓt computed at time step t satisﬁes

|ℓtppt, M r1:Hs t q ctpxt, utq| ď O Lτ 3β log3p1{βq .

Proof. Recall that ut P d denotes the control chosen in step t of Algorithm 1. We write αt : }ut}1 and, for i P rts, αt,i : śi 1 j 1p1 αt jq. Note that αt }pt}1 M rhs t 1Ñ1 for each h P r Hs, by deﬁnition of Xd,H,a0,α.

Let us ﬁx t P r Ts, and write p : pt, M r1:Hs : M r1:Hs t . By Eq. (6), the state xt at step t of Algorithm 1 can be written as follows:

i 1 αt,i Ai 1

λt i,0 λt,i Bpt i B

j 1 M rjs t iλt,i jwt i j λt,iwt i

By assumption that K τ p Lq is nonempty, there is some K P Sd rα,αs satisfying tmixp AK q ď τ. Let us write α : }K }1Ñ1, so that AK p1 α q A BK . Moreover, recall we have written in Algorithm 1 that τA : tmixp Aq.

For 1 ď i ď t, deﬁne

vi : λt i,0 λt,i Bpt i B

j 1 M rjs t iλt,i jwt i j λt,iwt i,

v1 i : λt i,0 λt,i Bp B

j 1 M rjsλt,i jwt i j λt,iwt i.

maxt}vi}1 , v1 i 1 , vi v1 i 1u ď λt i,0 λt,i

j 1 λt,i j λt,i ď 1. (30)

Next, using Eq. (29) and Eq. (20), we have

xt xtpp, M r1:Hsq

αt,i Ai 1 vi αp,M r1:Hs

t,i Ai 1 v1 i . (31)

The condition Eq. (28) together with the triangle inequality gives that }Bpt i Bp}1 ď iβ and BM rjs t iwt i j BM rjswt i j 1 ď iβ for all i, j ě 1, as well as |αt i αt| ď iβ for all i ě 1.

It follows that }vi v1 i}1 ď iβ and |αt,i αp,M r1:Hs

t,i | ď i2β for all i ě 1 and that for any ℓě 1, ˇˇˇˇˇ

i 1 αt,i }vi}1

i 1 αp,M r1:Hs

i 1 |αt,i αp,M r1:Hs

t,i | }vi}1

i 1 αp,M r1:Hs

t,i vi v1 i 1

ď ℓ3β. (32)

Using Eq. (32) and the fact that řt i 1 αt,i }vi}1 řt i 1 αp,M r1:Hs

t,i }v1 i}1 1, we see ˇˇˇˇˇ

i ℓ 1 αt,i }vi}1

i ℓ 1 αp,M r1:Hs

ˇˇˇˇˇ ď ℓ3β. (33)

We consider the following two cases:

Case 1: τA ď 4τ. Write t0 tτA log2p1{βqu. Let the stationary distribution of A be denoted p P d. By Lemma 18, we have that for all i ě 1, Ai p p 1 ď DApiq ď 1{2ti{τAu. Now, using Eq. (31), we may compute xt xtpp, M r1:Hsq 1

αt,i Ai 1 vi αp,M r1:Hs

t,i Ai 1 v1 i 1

i t0 1 αt,i Ai 1 vi }vi}1 p αp,M r1:Hs

t,i Ai 1 v1 i v1 i 1 p 1

i t0 1 αt,i }vi}1 p αp,M r1:Hs

t,i v1 i 1 p 1

αt,i Ai 1 pvi v1 iq 1 |αt,i αp,M r1:Hs

t,i | v1 i 1

αt,i Ai 1 vi }vi}1 p 1 αp,M r1:Hs

t,i Ai 1 v1 i v1 i 1 p 1

i 1 αt,i iβ

i 1 t0 2 1{2ti{τAu

ď Ct3 0β (34)

ď C1τ 3 log3p1{βq β,

for some universal constants C, C1. Above, the ﬁrst inequality uses the triangle inequality, the second inequality uses Eq. (33), and the third inequality uses that }vi v1 i}1 ď iβ, |αt,i αp,M r1:Hs

t,i | ď i2β, }v1 i}1 ď 1. The fourth inequality uses the bound řt i 1 t0 2 ti{τAu ď OpτAβq ď Opt0βq.

Case 2: τA ą 4τ. In this case, we claim that a0 ě 1{p96τq. By choice of a0 in Line 1 of Algorithm 1 and the fact that τA ą 4τ, it sufﬁces to show that α ě 1{p96τq: to see this, note that τA tmixp Aq ą 4τ ě 4 tmixp AK q, so Lemma 20 gives that }K }1Ñ1 ą 1{p96 tmixp AK qq ě 1{p96τq. But }K }1Ñ1 ď α, and thus α ą 1{p96τq. This proves that a0 ě 1{p96τq. Hence αi ě a0 ě 1{p96τq, by deﬁnition of Xd,H,a0,α, for all i P r Ts.

Write t0 : t200τ logp1{βqu. Then for any i ą t0,

maxtαt,i, αp,M r1:Hs

t,i u ď p1 a0qi 1 ď p1 1{p96τqqt200τ logp1{βqu ď Opβq.

Again using Eq. (31),

xt xtpp, M r1:Hsq 1 ď

|αt,i αp,M r1:Hs

t,i | αt,i vi v1 i 1

i t0 1 pαt,i αp,M r1:Hs

i t0 1 Opβq p1 a0qi t0 1 (35)

ď Ct3 0β Cβ{a0 ď C1τ 3 log3p1{βq β,

for some constants C, C1. Above, the ﬁrst inequality uses Eq. (30); the second inequality uses the previously derived bounds |αt,i αp,M r1:Hs

t,i | ď i2β and }vi v1 i}1 ď iβ; and the ﬁnal inequality uses that a0 ě 1{p96τq.

In both cases, we have xt xtpp, M r1:Hsq 1 ď C1τ 3β log3p1{βq for some universal constant C1. By deﬁnition, the control ut chosen by Algorithm 1 at time step t is exactly ut utpp, M r1:Hsq. Thus, using L-Lipschitzness of ct, we have ˇˇˇℓtpp, M r1:Hsq ctpxt, utq ˇˇˇ ˇˇˇctpxtpp, M r1:Hsq, utpp, M r1:Hsqq ctpxt, utq ˇˇˇ

ď L xt xtpp, M r1:Hsq 1 ď C1Lτ 3β log3p1{βq.

as desired.

C.4 Proof of Theorem 7

Before proving Theorem 7, we establish that the loss functions ℓt used in GPC-Simplex are Lipschitz.

Lemma 22. Let X P Sd with τ : tmixp Xq ă 8. Then for any i P N and v P Rd with x1, vy 0, it holds that Xiv 1 ď 2 ti{τu }v}1.

Proof. Fix v P Rd with x1, vy 0. We can write v v v , where v , v P Rd ě0 are the non-negative and negative components of v respectively. We have }v }1 }v }1 1

2}v}1 since x1, vy 0. Let u1 : 2v {}v}1 and u2 : 2v {}v}1, so that u1, u2 P d. By Lemma 18 and the deﬁnition of tmixp Xq, we have

}Xipu1 u2q}1 ď DXpiq ď DXpτqti{τu ď p2DXpτqqti{τu ď 2 ti{τu.

Thus, }Xτv}1 ď 2 ti{τu}v}1.

Lemma 23 (Lipschitzness of ℓt). Let τ ą 0, and suppose that K τ p Lq is nonempty. For each t P r Ts, the loss function ℓtpp, M r1:Hsq ctpxtpp, M r1:Hsq, utpp, M r1:Hsqq (as deﬁned on Line 9 of Algorithm 1) is Op Lτ 2q-Lipschitz with respect to the norm } }d,H in Xd,H,a0,α.

Proof. By L-Lipschitzness of ct with respect to } }1, it sufﬁces to show that for any pp1, M r1:Hs 1 q, pp2, M r1:Hs 2 q P Xd,H,a0,α, we have xtpp1, M r1:Hs 1 q xtpp2, M r1:Hs 2 q 1 ď Opτ 2q pp1, M r1:Hs 1 q pp2, M r1:Hs 2 q d,H (36) utpp1, M r1:Hs 1 q utpp2, M r1:Hs 2 q 1 ď Opτ 2q pp1, M r1:Hs 1 q pp2, M r1:Hs 2 q d,H . (37)

Fix pp1, M r1:Hs 1 q, pp2, M r1:Hs 2 q P Xd,H,a0,α, and write

ε : max " }p1 p2}1 , max j Pr Hs

M rjs 1 M rjs 2 1Ñ1

Since ε ď pp1, M r1:Hs 1 q pp2, M r1:Hs 2 q d,H, it sufﬁces to show that Eqs. (36) and (37) hold with ε

on the right-hand sides.

To verify Eq. (36) in this manner, we deﬁne, for b P t1, 2u,

vi,b : λt i,0 λt,i B pb B

j 1 λt,i j M rjs b wt i j λt,i wt i.

Since λt i,0 λt,i λt,i řH j 1 λt,i j ď 1, we have }vi,b}1 ď 1 for each i P rts, b P t1, 2u.

Moreover, }vi,1 vi,2}1 ď pλt i,0 λt,i řH j 1 λt,i jq ε ď ε. Write σ1 : }p1}1 , σ2 : }p2}1, so that |σ1 σ2| ď ε and |p1 σ1qi p1 σ2qi| ď iε for all i ě 1. Also note that for each b P t1, 2u,

i 1 p1 σbqi 1 }vi,b}1

i 1 p1 σbqi 1 λt,i 1 pp1 γt iq σb γt iq

i 1 p1 σbqi 1 λt,i 1 p1 p1 γt iqp1 σbqq 1, (38)

where the ﬁnal equality follows since γ0 1.

By Eq. (20), we have

xtpp1, M r1:Hs 1 q xtpp2, M r1:Hs 2 q

p1 σ1qi 1Ai 1 vi,1 p1 σ2qi 1Ai 1 vi,2 .

We consider two cases, depending on the mixing time τA : tmixp Aq of A:

Case 1: τA ď 4τ. Let the stationary distribution of A be denoted p P d. Then xtpp1, M r1:Hs 1 q xtpp2, M r1:Hs 2 q 1

p1 σ1qi 1Ai 1vi,1 p1 σ2qi 1Ai 1vi,2 1

p1 σ1qi 1p Ai 1vi,1 }vi,1}1 p q p1 σ2qi 1p Ai 1vi,2 }vi,2}1 p q 1

p1 σ1qi 1 }vi,1}1 p1 σ2qi 1 }vi,2}1 p 1

i 1 Ai 1 p1 σ1qi 1vi,1 p1 σ2qi 1vi,2 p1 σ1qi 1 }vi,1}1 p1 σ2qi 1 }vi,2}1 p 1

i 1 21 tpi 1q{τAu p1 σ1qi 1vi,1 p1 σ2qi 1vi,2 p1 σ1qi 1 }vi,1}1 p1 σ2qi 1 }vi,2}1 p 1

i 1 22 tpi 1q{τAupiε εq

for some constants C, C1. Above, the second equality uses Eq. (38), and the second inequality uses Lemma 22 together with the fact that Ai 1p p and @ 1, p1 σ1qi 1vi,1 p1 σ2qi 1vi,2 p1 σ1qi 1 }vi,1}1 p1 σ2qi 1 }vi,2}1 D 0.

The ﬁnal inequality uses the assumption that τA ď 4τ.

Case 2: τA ą 4τ. In this case, the assumption that K τ p Lq is nonempty together with the choice of a0 in Line 1 of Algorithm 1 and Lemma 20 gives that a0 ą 1{p96τq. See Case 2 of the proof of Lemma 21 for more details of this argument, which uses the fact that τA ą 96τ.

Since ppb, M r1:Hs b q P Xd,H,a0,α for b P t1, 2u, we have σ1, σ2 ě a0 ą 1{p96τq. We may compute xtpp1, M r1:Hs 1 q xtpp2, M r1:Hs 2 q 1

p1 σ1qi 1Ai 1vi,1 p1 σ2qi 1Ai 1vi,2 1

i 1 |p1 σ1qi 1 p1 σ2qi 1|

i 1 p1 σ1qi 1 }vi,1 vi,2}1

j 1 |σ1 σ2|p1 σ1qj 1p1 σ2qi 1 j ε

i 1 p1 σ1qi 1

i 2 pi 1qεp1 1{p96τqqi 2 ε

i 1 p1 1{p96τqqi 1

for some constant C.

Thus, in both cases above, we have xtpp1, M r1:Hs 1 q xtpp2, M r1:Hs 2 q 1 ď Opτεq, which veriﬁes Eq. (36).

The proof of Eq. (37) is much simpler: we have utpp1, M r1:Hs 1 q utpp2, M r1:Hs 2 q 1 ď λt,0 }p1 p2}1

j 1 λt,j M rjs 1 M rjs 2 1Ñ1 ď ε,

since λt,0 λt,H 1.

Proof of Theorem 7. Set β ?

2d H ln d ?

T , ε 1{T, and K τ : K τ p Lq. We will apply Corollary 15

to the sequence of iterates ppt, M r1:Hs t q produced in Algorithm 1, for the domain Xd,H,a0,α (i.e., a a0, b α). Note that Lemma 23 gives that ℓt is Op Lτ 2q-Lipschitz, for each t P r Ts. Thus Corollary 15 guarantees a regret bound (with respect to Xd,H,a0,α) of Op Lτ 2a

d H lnpdq Tq. Moreover, Eq. (19) of Corollary 15 ensures that the precondition Eq. (28) of Lemma 21 is satisﬁed. Thus, we may bound

t 1 ctpxt, utq inf KPK τ

t 1 ctpxtp Kq, utp Kqq

t 1 ℓtppt, M r1:Hs t q inf KPK τ

t 1 ctpxtp Kq, utp Kqq Op T Lτ 3 log3p1{βqβq

t 1 ℓtppt, M r1:Hs t q inf pp,M r1:Hsq PXd,H,a0,α

t 1 ctpxtpp, M r1:Hsq, utpp, M r1:Hsqq (39)

Op T Lτ 3 log3p1{βqβq ε

t 1 ℓtppt, M r1:Hs t q inf pp,M r1:Hsq PXd,H,a0,α

t 1 ℓtpp, M r1:Hsq (40)

Op T Lτ 3 log3p1{βqβq ε

d H lnpdq T Op T Lτ 3 log3p1{βqβq ε,

where the ﬁrst inequality uses Lemma 21 together with Eq. (19) of Corollary 15, and the second inequality uses Lemma 17 with ϵ 1{T (by the theorem assumption, the inequality H ě τrlog2p2LT 2{ϵqs is indeed satisﬁed). Note that for the second inequality to hold, we also need that }K}1Ñ1 ě a0 for all K P K τ , which in particular requires (by Line 1) that }K}1Ñ1 ě 1{p96τq if tmixp Aq ą 4τ. But if tmixp Aq ą 4τ, then for any K P K τ we have tmixp Aq ą 4 tmixp AKq and hence }K}1Ñ1 ě 1{p96τq by Lemma 20. Finally, the equality above uses the deﬁnition of ℓt in Algorithm 1, and the ﬁnal inequality uses the regret bound of Corollary 15. By our choice of β, ε, we see that the overall policy regret is Op Lτ 7{2d1{2?

Tq, as desired.

D Proof of Lower Bounds

In this section, we formally state and prove the regret lower bounds Theorem 1 and Theorem 8. The former states that the comparator class for online control of standard LDSs cannot be broadened to all marginally stable (time-invariant, linear) policies; the latter states that the mixing time assumption cannot be removed from the comparator class for online control of simplex LDSs. Both results hold even in constant dimension.

The basic idea is the same for both proofs: we construct two systems L0, L1 which are identical until time T{2, but then at time T{2 experience differing perturbations of constant magnitude. The costs are zero until time T{2, after which they penalize distance to a prescribed state (and can in fact be taken to be the same for both systems). The optimal strategy in the ﬁrst T{2 time steps therefore depends on which system the controller is in, but the controller does not observe this until time T{2, and hence will necessarily incur regret with respect to the optimal policy.

Formalizing this intuition requires two additional pieces: ﬁrst, for both systems there must be a near-optimal time-invariant linear policy. This can be achieved by careful design of the dynamics, perturbations, and costs. Second, if the controller ﬁnds itself in a high-cost state at time T{2 1, it must be unable to reach a low-cost state without incurring Ωp Tq total cost along the way. In the standard LDS setting, we achieve this by setting the transition matrices A, B so that }B} Op1{Tq (i.e. so constant-size controls have small effect on the state) and adding a penalty of |ut| to the cost for t ą T{2. In the simplex LDS setting, we achieve this by our choice of the valid constraint set I (which enforces that }ut}1 Op1{Tq for all t).

See Fig. 3 for a pictorial explanation of the proof in the simplex LDS setting.

D.1 Proof of Theorem 1

In this section we give a formal statement and proof of Theorem 1. Recall the deﬁnition of an LDS (Deﬁnition 9). We deﬁne the class Kκp Lq of policies that κ-marginally stabilize L below; it is equivalent to the class Kκ,ρp Lq of policies that pκ, ρq-strongly stabilize L (Deﬁnition 4) with ρ 0.

Deﬁnition 24 (Marginal stabilization). A matrix M P Rdˆd is κ-marginally stable if there is a matrix H P Rdˆd so that H 1MH ď 1 and }M} , }H} , H 1 ď κ. A matrix K P Rdˆd is said to κ-marginally stabilize an LDS with transition matrices A, B P Rdˆd if A BK is κ-marginally stable. For κ ą 0 and an LDS L on Rd, we deﬁne Kκp Lq to be the set of linear, time-invariant policies x ÞÑ Kx where K P Rdˆd κ-marginally stabilizes L.

We also introduce a standard regularity assumption on cost functions:15

Assumption 2. Let L ą 0. We say that cost functions pctqt, where ct : Rdx ˆ Rdu Ñ R, are L-regular if ct is convex and L-Lipschitz with respect to the Euclidean norm for all t. Theorem 25 (Formal statement of Theorem 1). Let Alg be any randomized algorithm for online control with the following guarantee:

15Technically, in this setting of general LDSs where the state domain is unbounded, Assumption 1 is stronger than the assumption on cost functions made in prior work on non-stochastic control [1], because it enforces a uniform Lipschitzness bound on the entire domain. But we are proving a lower bound in this section, so this strengthening only makes our result stronger.

Figure 3: An intuitive illustration of xtp2q in the lower bound for simplex LDS (Theorem 30). The blue curve is the trajectory of π0, the decreasing" comparator policy, in the system L0, which has the smaller perturbation. The green curve is π1, the lazy" comparator policy, in the system L1, which has the larger perturbation. The orange curves correspond to the trajectories of an arbitrary policy π under the two different perturbation sequences. The sum of regret under the two perturbation sequences is equal to the area S1 S2 S3, which is shown to be Ωp Tq for any h.

Let d, T P N and κ ą 0, and let L p A, B, x1, pwtqt, pctqtq be an LDS with state space and control space Rd; L-regular cost functions pctqt (Assumption 2); and perturbations pwtqt satisfying }wt}2 ď L for all t. Then the iterates pxt, utq T t 1 produced by Alg with input p A, B, κ, Tq on interaction with L satisfy

regret Kκp Lq : E

t 1 ctpxt, utq

inf KPKκp Lq

t 1 ctpx L,K t , u L,K t q ď fpd, κ, L, Tq (41)

where px L,K t , u L,K t q T t 1 are the iterates produced by following policy x ÞÑ Kx in system L for all t P r Ts.

Then fp1, 1, 1, Tq Ωp Tq.

Remark 26. In the above theorem statement, if Kκp Lq were replaced with Kκ,ρp Lq, the class of linear time-invariant policies that pκ, ρq-strongly stabilize L, then the main result of [1] would imply that in fact there is a (deterministic) algorithm GPC with regret at most polypd, κ, L, ρ 1q ?

T logp Tq on any LDS L satisfying the above conditions. Thus, Theorem 25 indeed provides a converse to [1].

We prove Theorem 25 by constructing a simple distribution over LDSs on which any algorithm must incur Ωp Tq regret in expectation. Let β ě 2 be a constant that we will determine later, and ﬁx T ě β. Recall that we denote an LDS on Rd using the notation L p A, B, x1, pwtqt, pctqtq, where A, B P Rdˆd. We deﬁne two LDSs on R as follows:

L0 : p1, β{T, x1, pw0 t qt, pctqtq,

L1 : p1, β{T, x1, pw1 t qt, pctqtq,

where the (common) initial state is x1 1, the (common) cost functions pctqt are deﬁned as

ctpx, uq : "|x| |u| if t ą T{2 0 otherwise ,

the perturbations of L0 are w0 t : 0 for all t, and the perturbations of L1 are

w1 t : " 1 if t T{2 0 otherwise .

For simplicity, we assume that T{2 is an integer. Thus, at all times t T{2, the two systems have identical dynamics

xt 1 : xt β

but at time t T{2, system L1 experiences a negative perturbation of magnitude 1, whereas L0 does not. The following lemma characterizes the performance of two time-invariant linear policies π0, π1 for L0, L1 respectively: Lemma 27. Deﬁne π0, π1 : R Ñ R by π0pxq x and π1pxq 0. Then:

Policy π0 is an element of K1p L0q, and the iterates px L0,π0 t , u L0,π0 t q T t 1 produced by following π0 in system L0 satisfy

t 1 ctpx L0,π0 t , u L0,π0 t q ď 2T

Policy π1 is an element of K1p L1q, and the iterates px L1,π1 t , u L1,π1 t q T t 1 produced by following π1 in system L1 satisfy

t 1 ctpx L1,π1 t , u L1,π1 t q 0.

Proof. Note that π0, π1 are both time-invariant linear policies. The inclusion π0 P K1p L0q is immediate from the fact that L0 has transitions A 1, B β{T, and |A B| ď 1. Similarly, π1 P K1p L1q because |A| ď 1. To bound the total cost of π0 on L0, note that u L0,π0 t x L0,π0 t p1 β{Tqt 1 for all t P r Ts. Hence,

t 1 ctpx L0,π0 t , u L0,π0 t q 2

Moreover, we have x L1,π1 t 1rt ď T{2s and u L1,π1 t 0 for all t P r Ts, from which it is clear that řT t 1 ctpx L1,π1 t , u L1,π1 t q 0.

Next, we show that the total cost of any trajectory pxt, utq T t 1 can be lower bounded in terms of |x T {2 1| in both L0 and L1:

Lemma 28. Let Alg be any randomized algorithm for online control, and let b P t0, 1u. The (random) trajectory pxt, utq T t 1 produced by Alg in system Lb satisﬁes the inequality

t 1 ctpxt, utq

t T {2 1 |xt| |ut| ě T

2β |x T {2 1|

with probability 1.

Proof. By deﬁnition of L0, L1, any valid trajectory in Lb satisﬁes |ut| T β |xt 1 xt| for all T{2 ă t ă T. We consider two cases:

1. If min T {2ătďT |xt| ě 1

2|x T {2 1|, then

t T {2 1 |xt| |ut| ě T

4 |x T {2 1| ě T

2β |x T {2 1|

since β ě 2.

2. If min T {2ătďT |xt| ă 1

2|x T {2 1|, then

t T {2 1 |xt| |ut| ě T

t T {2 1 |xt 1 xt| ě T

ˇˇˇˇ|x T {2 1| min T {2ătďT |xt| ˇˇˇˇ ě T

2β |x T {2 1|

by the triangle inequality.

In both cases the claimed inequality holds.

We can now prove Theorem 25.

Proof of Theorem 25. Let b Unifpt0, 1uq be an unbiased random bit, and let pxt, utq T t 1 be the (random) trajectory produced by executing Alg on Lb. On the one hand, by Eq. (41) applied to L0 and L1, we have

t 1 ctpxt, utq

ď fp1, 1, 1, Tq 1

inf KPK1p L0q

t 1 ctpx L0,K t , u L0,K t q inf KPK1p L1q

t 1 ctpx L1,K t , u L1,K t q

ď fp1, 1, 1, Tq T

β e β{2 (42)

where the ﬁrst inequality uses the fact that the cost functions pctqt are convex and 1-Lipschitz and that |w0 t |, |w1 t | ď 1 for all t P r Ts; and the second inequality is by Lemma 27. On the other hand, by Lemma 28, we have

t 1 ctpxt, utq

2β Er|x T {2 1|s

2β E ˇˇˇˇx T {2 β

T u T {2 b ˇˇˇˇ

2E ˇˇˇˇx T {2 β

2E ˇˇˇˇx T {2 β

T u T {2 1 ˇˇˇˇ

ˆ E x T {2 β

ˇˇˇˇE x T {2 β

where the key equality p q uses the fact that L0, L1 are identical up until and including time T{2, and hence px T {2, u T {2q is independent of b. Comparing Eq. (43) with Eq. (42) yields that

fp1, 1, 1, Tq ě T

β e β{2 Ωp Tq

for any sufﬁciently large constant β.

D.2 Proof of Theorem 8

Deﬁnition 29. Let 0 ď α ď α ď 1 and let I : Ť

αPrα,αs d α. We deﬁne Kp Iq to be the set of linear, time-invariant policies x ÞÑ Kx where K P Ť

αPrα,αs Sd α.

Theorem 30 (Formal statement of Theorem 8). Let Alg be any randomized algorithm for online control with the following guarantee:

Let d, T P N and I : Ť

αPr0,αs d α for some α P p0, 1q. Let L p A, B, I, x1, pγtqt, pwtqt, pctqtq be a simplex LDS with state space d and cost functions pctqt satisfying Assumption 1 with Lipschitz parameter L ą 0. Then the iterates pxt, utq T t 1 produced by Alg with input p A, B, I, Tq on interaction with L satisfy

regret Kp Iq : E

t 1 ctpxt, utq

inf KPKp Iq

t 1 ctpx L,K t , u L,K t q ď fpd, L, α, Tq (44)

where px L,K t , u L,K t q T t 1 are the iterates produced by following policy x ÞÑ Kx in system L for all t P r Ts.

For any sufﬁciently large constant β, if we deﬁne αp Tq : β{T, then fp1, 1, αp Tq, Tq Ωp Tq.

We deﬁne two simplex LDSs on 2 as follows:

L0 : p I2, I2, I, x1, pγtqt, pw0 t qt, pctqtq

L1 : p I2, I2, I, x1, pγtqt, pw1 t qt, pctqtq

where I2 P R2ˆ2 is the identity matrix, the (common) valid control set is I : Ť

αPr0,β{T s d α, the (common) initial state is x1 p0, 1q, the (common) cost functions pctqt are deﬁned as

ctpx, uq : "|xp2q 1{2| if t ą T{2 0 otherwise ,

the (common) perturbation strengths are γt : 1 21rt T{2s, and the perturbations of L0 are w0 t : p1{2, 1{2q for all t whereas the perturbations of L1 are w1 t : p1, 0q for all t. Thus, for both systems, the dynamics are described by

xt 1 : p1 }ut}1qxt ut for all t T{2.

Lemma 31. Deﬁne π0, π1 : 2 Ñ Ť

αPr0,1s d by π0pxq : β T p1{2, 1{2q and π1pxq : p0, 0q. Then π0, π1 P Kp Iq, and:

The iterates px L0,π0 t , u L0,π0 t q T t 1 produced by following π0 in system L0 satisfy

t 1 ctpx L0,π0 t , u L0,π0 t q ď T

The iterates px L1,π1 t , u L1,π1 t q T t 1 produced by following π1 in system L1 satisfy

t 1 ctpx L1,π1 t , u L1,π1 t q 0.

Proof. The fact that π0, π1 P Kp Iq is immediate from Deﬁnition 29 and the choice of I. To bound the total cost of π0 on L0, note that x L0,π0 t 1 p2q 1{2 p1 β{Tqpxtp2q 1{2q for all t T{2, and

x L0,π0 t 1 p2q 1{2 p1{2qp1 β{Tqpxtp2q 1{2q for t T{2. Thus,

t 1 ctpx L0,π0 t , u L0,π0 t q

t T {2 1 |x L0,π0 t p2q 1{2| ď

t T {2 1 p1 β{Tqt 1 ď T

Moreover, we have x L1,π1 t p0, 1q for all t ď T{2 and x L1,π1 t p1{2, 1{2q for all t ą T{2, so indeed řT t 1 ctpx L1,π1 t , u L1,π1 t q 0 as claimed.

Lemma 32. Let Alg be any randomized algorithm for online control, and let b P t0, 1u. The (random) trajectory pxt, utq T t 1 produced by Alg in system Lb satisﬁes the inequality

t 1 ctpxt, utq

t T {2 1 |xtp2q 1{2| ě T

8β |x T {2 1p2q 1{2|2 1.

Proof. By deﬁnition of L0, L1 and the valid control set I, any valid trajectory in Lb satisﬁes }xt xt 1}1 ď 2 }ut}1 ď 2β{T for all T{2 ă t ă T. It follows that |xt 1p2q 1{2| ě |xtp2q 1{2| 2β{T for all such t, and hence

t T {2 1 |xtp2q 1{2| ě

n 1 max ˆ 0, |x T {2 1p2q 1{2| 2βn

ě |x T {2 1p2q 1{2|

4β |x T {2 1p2q 1{2|

8β |x T {2 1p2q 1{2|2 1

as claimed.

Proof of Theorem 30. Let b Unifpt0, 1uq be an unbiased random bit, and let pxt, utq T t 1 be the (random) trajectory produced by executing Alg on Lb. On the one hand, by Eq. (44) applied to L0 and L1, we have

t 1 ctpxt, utq

ď fp1, 1, β{T, Tq 1

inf KPKp Iq

t 1 ctpx L0,K t , u L0,K t q inf KPKp Iq

t 1 ctpx L1,K t , u L1,K t q

ď fp1, 1, β{T, Tq T

2β e β{2 (45)

where the ﬁrst inequality uses the deﬁnition of I and the fact that the cost functions pctqt are convex and 1-Lipschitz per Assumption 1; and the second inequality is by Lemma 31. On the other hand, by Lemma 32, we have

t 1 ctpxt, utq

8β Erpx T {2 1p2q 1{2q2s

p1 u T {2 1qx T {2p2q u T {2p2q

p1 u T {2 1qx T {2p2q u T {2p2q

p1 u T {2 1qx T {2p2q u T {2p2q

p1 u T {2 1qx T {2p2q u T {2p2q

p1 u T {2 1qx T {2p2q u T {2p2q

ě T 1024β . (46)

where the key equality p q uses the fact that L0, L1 are identical up until and including time T{2, and hence px T {2, u T {2q is independent of b. Comparing Eq. (46) with Eq. (45) yields that

fp1, 1, β{T, Tq ě T 1024β 1 T

2β e β{2 Ωp Tq

for any sufﬁciently large constant β.

E Implementation details

In this section we describe the version of GPC-Simplex (Algorithm 1) implemented for our experiments. First, the dynamical systems in our experiments are non-linear. The GPC-Simplex algorithm is still practical and applicable in such settings concretely, any setting with update rule Eq. (9) but of course several modiﬁcations/generalizations must be made:

1. The algorithm takes as input the function f describing the dynamics in Eq. (9), rather than transition matrices A, B. Accordingly, in Line 8, the expression p1 }ut}1q Axt But

(which exactly corresponds to the noiseless update rule in a simplex LDS) is replaced by fpxt, utq. Moreover, in Line 9, the hypothetical iterates xtpp, M r1:Hsq, utpp, M r1:Hsq under policy πp,M r1:Hs are computed using the update rule f. 2. The algorithm directly takes as input a learning rate η for the mirror descent subroutine, rather than the mixing time bound τ. In our experiments, we always set η : a

d H lnp Hq{p2 ?

Tq. 3. We always parametrize our systems so that the valid control set is the space of distributions d. Hence, the domain used for mirror descent is Xd,H,1,1. Mirror descent is implemented by exponential weights updates with learning rate η and uniform initialization.

We remark that the above (natural) modiﬁcations to GPC-Simplex are analogous to the modiﬁcations to GPC made by [2] to perform online control for nonlinear systems.

F Experiments: Controlled SIR model

In this section, we provide additional experiments in the controlled SIR model. Speciﬁcally, in Appendix F.1 we provide experimental evaluations when there are perturbations to the system (i.e. γt is not always 0 in Eq. (9)). In Appendix F.2 we vary the parameters of the SIR model.

F.1 Control in presence of perturbations

We experiment with the SIR system Eq. (11) with the following parameters: β 0.5, θ 0.03, ξ 0.005, and cost function given by: ctpxt, utq c3 xtp2q2 c2 xtp1qutp1q.

We test the performance of our algorithm on pc2, c3q p1, 5q. In addition, we add a perturbation sequence wt r0, 1, 0s, @1 ď t ď 200. γt 0.01 Berp0.2q, @1 ď t ď 200.

Fig. 4 shows comparison of the costs over T 200 time steps incurred by GPC-simplex to that of always executing ut r1, 0s (full prevention) and that of always executing ut r0, 1s (no prevention). In addition to cost, we plot the value of utp2q over time, representing how relaxed prevention measure evolves over time according to GPC-simplex.

F.2 Alternative parameter settings

We experiment with two SIR systems with different set of parameters. The ﬁrst uses the following parameters: β 0.5, θ 0.03, ξ 0.005, whereas the second uses the following parameters: β 0.3, θ 0.05, ξ 0.001. In both cases, the cost function is: ctpxt, utq c3 xtp2q2 c2 xtp1qutp1q.

For both experiments, we test the performance of our algorithm on different choices of parameters for the cost function. In particular, we test the parameter tuples: pc2, c3q P tp1, 20q, p1, 10q, p1, 5q, p1, 1qu. Figs. 5 and 6 show comparison of the costs over T 200 time steps incurred by GPC-Simplex to that of always executing ut r1, 0s (full prevention) and that of always executing ut r0, 1s (no prevention). Speciﬁcally, Fig. 5 uses the ﬁrst set of parameters above, and Fig. 6 uses the second set. In addition to cost, we plot the value of utp2q over time, representing how the effective transmission rate evolves over time according to GPC-Simplex.

We notice that our algorithm consistently outperforms the two baselines. No matter how we set the parameters, our algorithm will outperform the full-intervention baseline since its cumulative cost grows linearly with time. As c3 gets larger, the gap between our algorithm and the no-intervention baseline becomes smaller, since the optimal policy with a high cost on control is basically playing no control.

Figure 4: SIR with perturbations. T 200. Initial state x1 r0.9, 0.1, 0s. GPC-Simplex parameter H 5. Top: Perturbation sequence: wt r0, 1, 0s, @1 ď t ď 200. γt 0.01 Berp0.2q, @1 ď t ď 200. Bottom: Perturbation sequence: @t, wt is a normalized uniform random vector. γt 0.01, @1 ď t ď 200.

G Experiments: Controlling hospital ﬂows

In this section, we provide more details regarding the setup and experiments in Section 4.1.

The continuous time dynamical system considered by [27] is the following: let Sptq, Iptq denote the susceptible and infected fraction of the population at time t, and let σptq denote the control at time t. The system has some initial state p Sp0q, Ip0qq in the set

D : tpx0, y0q : x0 ą 0, y0 ą 0, x0 y0 ď 1u,

reﬂecting the constraint that Sp0q, Ip0q represent disjoint proportions of a population, and the system evolves according to the differential equation

S1ptq γσptq Iptq Sptq, (47)

I1ptq γσptq Iptq Sptq γIptq. (48)

where γ ą 0 is some ﬁxed model parameter, and the control σptq models a non-pharmaceutical intervention (NPI) inducing a time-dependent reproduction number σptq P r0, σ0s, where σ0 is the base reproduction number in the absence of interventions. In most examples in [27], including the example of controlling hospital ﬂows, the parameter settings σ0 3 and γ 0.1 are used. This means that the natural discretization of Eq. (48) is in fact equivalent to Eq. (11) with transmission rate β : γσ0 0.3, recovery rate θ : γ 0.1, loss-of-immunity rate ξ : 0, no perturbations (i.e. γt 0 for all t), and control

ut : ˆ 1 σptq

at each time t.

Figure 5: Control with costs: control over T 200 steps. γt 0, @t. SIR parameters: β 0.5, θ 0.03, ξ 0.005. Initial state x1 r0.9, 0.1, 0s. GPC-Simplex parameters: H 5. Left: instantaneous cost over time, compared with that of no control (green) and full control (orange). Middle: cumulative cost over time. Right: utp2q output by GPC-Simplex over time. pc2, c3q values (from top to bottom rows): p1, 20q, p1, 10q, p1, 5q, p1, 1q.

The goal in [27] is the following: given an initial state p Sp0q, Ip0qq along with a horizon length T ą 0 and the parameters listed above, choose an admissible control function σ : r0, Ts Ñ r0, σ0s to minimize the loss

J : S8p Sp Tq, Ip Tq, σ0q ż T

0 Lp Sptq, Iptq, σptqqdt,

where Lp Sptq, Iptq, σptqq is the instantaneous cost at time t, and the extra term S8p Sp Tq, Ip Tq, σ0q incentivizes the state of the system at time T to lead to a favorable long-term trajectory (in the absence of any interventions after time T). In [27], the following formula for S8 is given; see that paper for further discussion:

S8p S, I, σ0q W0p σ0Ie σ0p S Iqq

Figure 6: Control with costs: control over T 200 steps. γt 0, @t. SIR parameters: β 0.3, θ 0.05, ξ 0.001. Initial state x1 r0.9, 0.1, 0s. GPC-Simplex parameters: H 5. Left: instantaneous cost over time, compared with that of no control (green) and full control (orange). Middle: cumulative cost over time. Right: utp2q output by GPC-Simplex over time. pc2, c3q values (from top to bottom rows): p1, 20q, p1, 10q, p1, 5q, p1, 1q.

The instantaneous cost is modeled by [27] as follows:

Lp Sptq, Iptq, σptqq c2 ˆ 1 σptq

2 c3 p Iptq ymaxq 1 e 100p Iptq ymaxq ,

where c2, c3 are some parameters determining the cost of preventing disease transmission and the cost of a medical surge (i.e. when the proportion of infected individuals exceeds ymax). Notice that the second term above will indeed be very small in magnitude unless Iptq exceeds ymax.

Note that GPC-Simplex cannot directly handle end-of-trajectory losses such as the term S8p Sp Tq, Ip Tq, σ0q. Thus, in our evaluation of GPC-Simplex on this system, we instead incorporate S8 into the instantaneous cost functions. Concretely, we use the following cost function at

ctpxt, utq S8pxtp1q, xtp2q, σ0q c2 utp1q2 c3pxtp2q ymaxq 1 e 100pxtp2q ymaxq .

Recall that we write xt p St, It, Rtq and utp1q 1 σptq{σ0, so modulo the addition of S8 to all times t ă T and the conversion from continuous time to discrete time, our loss is analogous to that of [27].

H Experiments: Controlled replicator dynamics

The replicator equation is a basic model in evolutionary game theory that describes how individuals in a population will update their strategies over time based on their payoffs from repeatedly playing a game with random opponents from the population [11]. The basic principle is that strategies (or traits) that perform better than average in a given environment will, over time, increase in frequency within the population, whereas strategies that perform worse than average will become less common.

Formally, consider a normal-form two-player game with d possible strategies and payoff matrix M P Rdˆd. A population at time t is modelled by the proportion of individuals that currently favor each strategy, and thus can be summarized by a distribution xptq P Rd. The ﬁtness of an individual playing strategy i P rds in a population with strategy distribution x P d is deﬁned to be

ﬁtness M,xpiq : e J i Mx,

where ei P Rd is the indicator vector for strategy i. That is, ﬁtness M,xpiq is simply the expected payoff of playing strategy i against a random individual from the population. The replicator dynamics posit that the population s distribution over strategies xptq will evolve according to the following differential equation:

dt : xiptq ﬁtness M,xptqpiq Ej xﬁtness M,xptqpjq xiptq pe J i Mxptq xptq JMxptqq. (49) It is straightforward to check that this differential equation preserves the invariant that xptq is a distribution. This equation can induce various types of dynamics depending on the initialization and payoff matrix M: the distribution may converge to an equilibrium, or it may cycle, or it may even exhibit chaotic behavior [11]. In this study we focus on a simple (time-discretized) replicator equation namely, the equation induced by a generalized Rock-Paper-Scissors game when the payoffs may be controlled.

Controlled Rock-Paper-Scissors. The standard Rock-Paper-Scissors game has d 3 and payoff matrix

0 1 1 1 0 1 1 1 0

Consider a setting where the game is run by an external agent that is allowed to set the payoffs. For simplicity, we assume that the game remains zero-sum and the rewards sum to 1, so the payoff matrix is now

0 u1 u3 u1 0 u2 u3 u2 0

for a control vector u P 3. The discrete-time analogue of the replicator equation with this controlled payoff matrix Mpuq is

xt 1 fpxt, utq : xt η

xt1 e J 1 Mputqxt xt2 e J 2 Mputqxt xt3 e J 3 Mputqxt

where xt, ut P 3 are the population distribution and control at time t respectively, and η P p0, 1q is the rate of evolution. Note that the term xt Mputqxt does not need to appear in Eq. (50) because Mputq is always zero-sum. Also, since η ď 1 and all entries of Mputq are at most 1 in magnitude, if xt is a distribution then xt 1 will remain a distribution. We omit noise in this study, so Eq. (50) is a special case of Eq. (9) with γt 0 for all t.

(a) Instantaneous cost achieved by GPC-Simplex over time, compared to default Rock-Paper-Scissors control and Best Response control (dashed orange).

(b) Proportions of the population playing strategies rock , paper , and scissors over time under control by GPC-Simplex.

Figure 7: Experimental results for dynamical system with horizon T 100, uniform initial state, update rule Eq. (50) with η 1{4, no perturbations, and time-invariant cost function ctpxt, utq x2 t1 for all times t. GPC-Simplex was implemented as described in Appendix E. The Best Response controller at each time t picks the control u that minimizes ct 1pfpxt, uq, uq. The default controller picks the uniform control u p1{3, 1{3, 1{3q.

Parameters and cost function. We deﬁne a (nonlinear) dynamical system with uniform initial state x1 p1{3, 1{3, 1{3q, update rule Eq. (50) with η 1{4, and T 100 timesteps. We consider the ﬁxed cost function cpxt, utq : x2 t1,

which can be thought of as penalizing the strategy rock .

Results. We compare GPC-Simplex (implemented as described in Appendix E) with a baseline control that simply uses the standard Rock-Paper-Scissors payoff matrix (up to scaling) induced by u p1{3, 1{3, 1{3q P 3. As shown in Fig. 7a, GPC-Simplex (shown in blue) signiﬁcantly outperforms this baseline (shown in green), learning to alter the payoff in such a way that the population tends to avoid the rock strategy. The evolution of the dsitribution over time under GPC-Simplex is shown in Fig. 7b.

For completeness, we also compare GPC-Simplex against the Best Response strategy (shown in dashed orange) that essentially performs 1-step optimal control, using the fact that the cost function for this example is time-invariant. While both controllers eventually learn a good policy, Fig. 7a clearly shows that Best Response learns faster. However, it is strongly exploiting the time-invariance of the cost function, since in general, this algorithm computes the best response with respect to the previous cost function rather than the current cost function, which it does not observe until after playing a control. In Fig. 8, we consider a slightly modiﬁed system where the cost function includes a cost on the control with probability 1{2. In this setting, we see that GPC-Simplex still eventually learns a good policy, whereas Best Response and the default control incur large costs through the trajectory. Best Response in particular suffers greatly due to the time-varying nature of the costs.

I Discussions

I.1 Broader impacts

Our work provides a robust algorithm with theoretical justiﬁcations for practical control problems that might be applicable to problems such as disease control. The experiments performed are preliminary. More careful empirical veriﬁcation is necessary before our algorithm can be responsibly implemented in high-impact scenarios. Excluding the scenario of ill intention, we do not anticipate any negative social impact.

Figure 8: Experimental results for dynamical system with horizon T 200, uniform initial state, update rule Eq. (50) with η 1{4, no perturbations, and random cost function which is either ctpxt, utq x2 t1 or ctpxt, utq x2 t1 u2 t3 with equal probability. GPC-Simplex was implemented as described in Appendix E. The Best Response controller at each time t picks the control u that minimizes ct 1pfpxt, uq, uq. The default controller picks the uniform control u p1{3, 1{3, 1{3q. The plot shows the cost achieved by GPC-Simplex over time, compared to default Rock-Paper-Scissors control and Best Response control (dashed orange). Due to the non-continuity induced by the random cost functions, the loss plotted at time t is the average loss of the controller across the last minpt, 15q time steps.

I.2 Computational Resources for Experiments

The experiments in this work are simulations and relatively small-scaled. They were run on Google Colab with default compute resources. For each experiment, the time required to roll-out one trajectory using GPC-Simplex was less than 10 minutes.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? Answer: [Yes] Justiﬁcation: The main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justiﬁcation: The paper discussed the limitations of the work. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justiﬁcation: For each theoretical result, the paper provides the full set of assumptions and a complete (and correct) proof. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.

All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justiﬁcation: The paper fully discloses all the information needed to reproduce the main experimental results of the paper. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justiﬁcation: We provide a link to the anonymous repository containing our code. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not

including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justiﬁcation: The paper speciﬁes all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments? Answer: [No] Justiﬁcation: The experiments are primarily deterministic; the experiments with randomness are provided largely as proof-of-concept. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes] Justiﬁcation: See Appendix J.2. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justiﬁcation: The research conforms to the Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justiﬁcation: See Appendix J.1. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justiﬁcation: the paper poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks.

Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justiﬁcation: we cite the original paper that produced the code package or dataset. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justiﬁcation: the paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justiﬁcation: the paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: the paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.