# a_regret_minimization_approach_to_multiagent_control__dc915eaf.pdf

A Regret Minimization Approach to Multi-Agent Control

Udaya Ghai 1 2 Udari Madhushani 3 Naomi Leonard 3 Elad Hazan 1 2

We study the problem of multi-agent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distributed algorithm. The reduction guarantees that the resulting distributed algorithm has low regret relative to the optimal precomputed joint policy. Our methodology involves generalizing online convex optimization to a multi-agent setting and applying recent tools from nonstochastic control derived for a single agent. We empirically evaluate our method on a model of an overactuated aircraft. We show that the distributed method is robust to failure and to adversarial perturbations in the dynamics.

1 Introduction

In many real-world scenarios it is desirable to apply multiagent control algorithms, as opposed to a centralized designed policy, for a number of practical reasons. First is increased robustness: agents, serving as control actuators, may be added or removed from an existing set, without change to the distributed policies. This also allows for fault-tolerance; agents can adapt on the ﬂy to faulty actuators. Second, limited computational resources and/or system information may limit applicability of a sophisticated centralized policy, whereas simple distributed policy agents are feasible and can adapt to the underlying dynamics. Third, multi-agent control allows robustness to noisy state and control observations by the individual agents. These advantages motivate the study of distributed control, and indeed a vast

1Department of Computer Science, Princeton University, Princeton, NJ 2Google AI Princeton, Princeton, NJ 3Department of Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ. Correspondence to: Udaya Ghai <ughai@cs.princeton.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

literature exists on the design and analysis of such methods.

In this paper we propose a novel game-theoretic performance metric for multi-agent control (MAC), which we call multi-agent regret. This metric measures the difference between the total loss of the joint policy of individual agents vs. the loss of the best joint policy in hindsight from a certain reference class of policies. Vanishing multi-agent regret thus implies competitive performance with respect to that of the optimal policy in a certain policy class.

We study a mechanism for MAC that takes individual controllers and merge them into a MAC method with vanishing regret. For this purpose, we make use of recent advances in online learning for control and speciﬁcally the framework of nonstochastic control. This methodology was recently proposed for the study of robust adaptive control algorithms through the lens of online convex optimization. As opposed to optimal and robust control, nonstochastic control studies adaptive control algorithms that minimize regret, or the difference in loss/reward of the controller vs. the best policy in hindsight from a reference class. This is a game-theoretic performance metric, which is naturally applicable to agents interacting in a multi-agent environment.

1.1 Our result and techniques

Our main result is a reduction from single agent to multiagent control with provable regret guarantees. More precisely, we assume that we have access to single-agent controllers with the following guarantees:

1. Each agent has sublinear regret vs. a class of policies under adversarial perturbations and cost functions. 2. Each agent should be able to evaluate their policy given the controls of the other agents. 3. The policy that each agent uses is independent of the controls that the other agents play. 4. The joint cost function over all agents needs to be jointly convex.

This set of assumptions is well motivated: we review several well-studied control settings in the literature which provide agents with such guarantees. Under these assumptions we give a reduction that takes regret minimizing control agents and guarantees low multi-agent regret of the joint control policy, w.r.t. the optimal joint policy in hindsight.

A Regret Minimization Approach to Multi-Agent Control

Let A be a single agent control algorithm1 that, given a state xt, produces a control ut, and suffers cost ct(xt, ut). The average (policy) regret of A w.r.t. policy class Π is deﬁned to be the difference between its total cost and that of the best policy in hindsight, namely

t ct(xt, ut) min π Π 1 T

t ct(xπ t , uπ t ) .

Notice that the comparator are the states xπ t and controls uπ t that would have been played by policy π. Then algorithm C, described in Algorithm box 3, takes as an input k such controllers, and under the aforementioned assumptions, controls a multi-agent system with the guarantee that

i=1 RT (Ai) + O 1

where A is the control algorithm for agent i, which is not necessarily the same for every agent, and ε is an upper bound on the error of each agent s policy evaluation. Here RT (C) is the multi-agent regret of the joint policy, namely the difference between the cost of C, and that of the best joint-policy in hindsight. The reference policy class we use for comparison is the Cartesian product of the individual agent policy classes. This is deﬁned precisely in Section 4.

In order to obtain this result we study a generalization of online convex optimization to the multi-agent setting. Speciﬁcally, we propose a reduction from online convex optimization (OCO) to multi-agent OCO, such that the overall multi-agent regret is guaranteed to be the sum of the individual agents regret on an induced loss function.

1.2 Paper structure

Section 2 describes the formal setting in which we operate, as well as give basic deﬁnitions and assumptions. Section 3 provides the foundation of our result: a multi-agent generalization of online convex optimization, deﬁnition of multi-agent regret, and an efﬁcient reduction from OCO to multi-agent OCO. In section 4 we give our main result: a meta-algorithm that converts regret minimizing controllers into a joint multi-agent regret minimizing distributed policy. Experiments are shown in section 5. Some analysis is deferred to the Appendix.

1.3 Related work

Multi-agent RL. Multi-agent RL (Sutton & Barto, 2018; Zhang et al., 2021) considers the problem of ﬁnding globally optimal joint policies through optimizing local policies. Common approaches for ﬁnding optimal policies in RL includes dynamic programming, Monte-Carlo methods (Williams, 1992), temporal difference learning (Watkins &

1More formally, it is an online convex optimization with memory algorithm, which we precisely deﬁne in coming sections.

Dayan, 1992), combining temporal difference and Monte Carlo learning, evaluation strategies (Salimans et al., 2017), and policy gradient methods (Williams, 1992; Mnih et al., 2016). Policy gradient methods, which are closest to our algorithm, compute stochastic gradients from trajectories (Sutton et al., 2000; Sutton & Barto, 2018). More recently, Jin et al. (2021); Daskalakis et al. (2022) provide methods with game-theoretic equilibrium guarantees. Performance in multi-agent coordination signiﬁcantly improves with agents ability to predict the behaviour of other agents. In opponent modeling methods, wherein agents learn to model the behaviours of their partners, (Foerster et al., 2018) agents use counterfactual policies of other agents to update their policy parameters. In contrast, our approach decouples the agents policies, forgoing the need to deal with counterfactuals of other agents.

Decentralized and Distributed control. Communicationfree decentralized control is the earliest multi-agent control model considered in the literature. In this model, each agent observes a different partial observation of the state and each agent provides its own control input to the system. This classic setting is well understood with exact characterizations of stabilizability (Wang & Davison, 1973) and controllability (Lefebvre et al., 1980). More recently, decentralized control, also called distributed control, has been generalized to include a network layer of communication among the controllers, modeled by a network graph. Coordinated and cooperative control over a communication network has been studied using a wide range of approaches, such as distributed optimization (Nedi c et al., 2018), model predictive control (Christoﬁdes et al., 2013; St urz et al., 2020), and for a wide variety of problems, including stabilization of formations, coverage, and search (Olfati-Saber et al., 2007; Bullo et al., 2009; Cao et al., 2013; Knorn et al., 2016).

One particular direction in this setting is on designing distributed observers, using information from neighbors to provide local estimates of the global state (Olfati-Saber, 2005; Carli et al., 2008; Matei & Baras, 2010; Das & Moura, 2017; Park & Martins, 2017). Such distributed observers can be applied to produce end-to-end control laws. One such approach stabilizes systems when the neighbor graph is strongly connected and the joint system is stabilizable (Wang et al., 2020). In contrast to the present work, approaches to this problem are decentralized from an information perspective, but not a game theoretic perspective as controllers are designed centrally.

Exploiting input redundancy in control. Input redundancy in overactuated systems enables efﬁcient responses (Sobel & Shapiro, 1985) and robustness to failures (Oppenheimer et al., 2006; Tohidi et al., 2017). A number of optimization methods have been developed for redistributing desired control input under different faulty conditions. Our

A Regret Minimization Approach to Multi-Agent Control

approach does not tackle this directly, but the action-centric rather than policy-centric philosophy of the algorithm leads to some robustness in such settings.

Online learning and online convex optimization. The framework of learning in games has been extensively studied as a model for learning in adversarial and nonstochastic environments (Cesa-Bianchi & Lugosi, 2006). Online learning was infused with algorithmic techniques from mathematical optimization into the setting of online convex optimization, see (Hazan, 2019) for a comprehensive introduction. We particularly make use of the property that no-regret algorithms converge to equilibrium, a phenomenon originally proposed by (Hart & Mas-Colell, 2000).

Online and nonstochastic control. The starting point of our study are algorithms which enjoy sublinear regret for online control of dynamical systems; that is, whose performance tracks a given benchmark of policies up to a term which is vanishing relative to the problem horizon. (Abbasi-Yadkori & Szepesv ari, 2011) initiated the study of online control under the regret benchmark for Linear Timeinvariant (LTI) dynamical systems. Our work instead adopts the nonstochastic control setting (Agarwal et al., 2019), that allows for adversarially chosen (i.e. non-Gaussian and potentially adaptively chosen (Ghai et al., 2021)) noise and costs that may vary with time. The nonstochastic control model was extended to consider nonlinear and time-varying dynamics in (Gradu et al., 2020; Minasyan et al., 2021; Chen et al., 2021). See (Hazan & Singh, 2021) for a comprehensive survey of results and advances in online and nonstochastic control. In order to extend these results to a multi-agent setting, we introduce a simple multi-agent settings of OCO. A more sophisticated decentralized OCO setting involving a communication graph has been explored in (Lee et al., 2016), though the information model for this line of work would require sharing of control policies between agents rather than just black-box actions.

2 Problem Setting

2.1 Notation

The norm refers to the ℓ2 norm for vectors and spectral norm for matrices. For any natural number n, the set [n] refers to the set {1, 2 . . . n}. We use the window notation za:b def = [za, . . . , zb] to represent a sequence of vectors. We use the convention where indices i, j refer to agents, with superscripts (typically) denoting an agent-speciﬁc quantity, while lack of a superscript denotes something global. We use the index i to represent information for all agents except i, for example u i t = (u1 t . . . ui 1 t , ui+1 t . . . uk t ).

2.2 Multi-Agent Nonstochastic Control Problem

We consider the following multi-agent control problem. Let ft : Rdx Rdu Rdx and gi : Rdx Rdyi deﬁne known

time varying dynamics. Now, starting from state x0, the system follows the dynamical equation

yi t = gi(xt) + ei t, xt+1 = ft(xt, ut) + wt, (1)

where xt is the state, ut is the joint control consisting of ui t Rdui for each i [k], yi t is a partial observation, and ei t, wt are additive adversarial chosen disturbance terms. After committing to a control ui t, agent i observes the controls of all other players u i t along with a convex cost ct : Rdx Rdu R. Together, the k agents suffer cost ct(xt, ut). The information model described above is diagrammed for a 2-agent system in Figure. 1.

Figure 1: Diagram of a 2-agent System S executing dynamics from (1) with agents C1, C2. It should be noted that while each agent shares its action with other agents, it does not directly share its policy or its observation. Policy parameters might not be shareable due to memory/hardware restrictions (e.g. the policy is a neural net on a GPU), while lower dimensional controls may be less of a problem. Furthermore, some settings may involve local observations with private information, but with controls that are inherently public. It also is useful for seamlessly handling system failures as discussed in Section 5.1.

A control policy for agent i, πi : S

s N Rsdyi R(s 1)du Rdu is a mapping from all available information to a control ui t = πi(y1:t, u1:t 1). A joint policy for the multi-agent setting is a collection of policies for each agent π = π1:k. Policy classes are deﬁned analogously with Πi being a set of agent i policies and Π being a set of joint policies. We often consider policies parameterized by vectors θi Θi, where Θi is some bounded convex domain and use the parameterizations and policies interchangeably.

For time horizon T, performance of joint policy π is measured by the average cost along the trajectory (xπ 1, uπ 1 . . . ),

J(π|w1:T , e1:T ) = 1

t=1 ct(xπ t , uπ t ) .

For a set of online control algorithms for each agent C = C1:k, we are interested in the average (policy) regret relative to some joint policy class Π, deﬁned as

RT (C) = J(C|w1:T , e1:T ) inf π Π J(π|w1:T , e1:T ) .

A Regret Minimization Approach to Multi-Agent Control

2.3 Assumptions

Converging towards optimal play for an online control algorithm may not happen immediately, so our system cannot explode too quickly. In order to bound the cost during this stage, xt must not explode in the absence of control. The corresponding outputs in the absence of any controller is a concept called Nature s y s in (Simchowitz et al., 2020), which are an integral part of DAC policies, deﬁned below. We analogously deﬁne Nature s x s here. Deﬁnition 2.1 (Nature s x s and y s). Given a sequence of disturbances (wt, et)t>1 we deﬁne Nature s x s and Nature s y s respectively as the sequences xnat,i t , ynat,i t where

ynat,i t = gi(xt) + ei t, xnat t+1 = ft(xt, 0) + wt .

Assumption 2.2 (Bounded Nature s x s). We assume that wt and et are chosen by an oblivious adversary, and that xnat t Rnat for all t.

We also need well behaved cost functions for learning: Assumption 2.3. The costs ct(x, u) are convex and if x , u D then there exists constant C such that 0 ct(x, u)) CD2 .

In control, the most ubiquitous cost functions are quadratics, which satisfy this assumption. Finally, we must restrict our joint policy class Π in order to obtain global regret guarantees. We make the following assumption about each agent s policy class: Assumption 2.4 (Decoupling). For any policy πi Πi, the policy can always be represented as a function of the disturbances ei t, wt. In particular, there exists a function π : S

s N R(s 1)dx Rsdyi Rdu for any set of controls u1:t 1 and the resultant set of observations y1:t, such that π(y1:t, u1:t 1) = π(w1:t 1, ei 1:t) .

Note, that if Assumption 2.4 holds for each agent s policy class, then the joint policy class Π will depend implicitly on only w1:t 1, e1:t. Policies that satisfy this assumption are special in that the controls played by one agent s policy are completely independent of another agent s policy, given the disturbances are chosen obliviously.

2.4 Policy Classes

We now introduce a few important policy classes that satisfy Assumption 2.4.

Disturbance-action Control. For fully observed linear systems, disturbances wt can be derived from states and joint controls. This allows us to use Disturbance Action Control, a policy parameterization that is linear in the disturbance history. Deﬁnition 2.5. A Disturbance-action Controller (DAC) (Agarwal et al., 2019), π(M), is speciﬁed by parameters M Rdui mdx where the control ui t = Mwt m:t 1.

A comparator policy class Πi can be deﬁned as a bounded convex set M Rdu mdx of policy matrices. It has been shown that DACs can approximate stable Linear Dynamic Controllers (LDCs), a powerful generalization of linear controllers that encompasses H2 and H optimal controllers under partial observation, and is a mainstay in control applications (see Deﬁnition A.1 in Appendix A.2). Furthermore, when costs are quadratic, it can be shown that a DAC also approximates the ofﬂine optimal policy (Goel & Hassibi, 2021).

Disturbance-response Control. Another policy class that suits our needs, especially if the system is partial observed, is that of Disturbance-Response Control (DRC). Unlike in the fully observed setting, with partial observations it is generally impossible for an agent to calculate wt for use in a policy. Instead, DRC controllers are a linear function of a window of previous Nature s y s deﬁned as follows:

Deﬁnition 2.6. A Disturbance-response Controller (DRC) (Simchowitz et al., 2020), π(M), is speciﬁed by parameters M Rdu mdy where the control ui t = Mynat t m:t 1.

Like DAC, DRC is a powerful control law which also can approximate stable LDCs. Furthermore, Nature s y s are just a function of w s and e s so DRC policy classes satisfy Assumption 2.4.

Neural Network Policies We also consider neural network policies. In particular, fully-connected neuralnetworks with Re LU activations that act on normalized windows of w s or Nature s y were explored in (Chen et al., 2021). While there are more restrictions for this class, neural-networks are a richer policy class than the linear counterparts DAC and DRC.

3 Multiagent Online Convex Optimization With and Without Memory

Our derivation henceforth for the multi-agent control problem hinges upon a generalization of OCO to the multi-agent setting. Now we describe this generalization, explain the challenge, and give a reduction from vanilla OCO, as well as OCO with Memory (OCO-M), to the multi-agent settings.

In multiplayer OCO we have k players and the ith learner iteratively chooses a vector from a convex set Ki Rd. We denote the total number of rounds as T. In each round, the ith learner commits to a decision xi t Ki. We deﬁne the joint decision xt = x1:k t K = K1:k. Afterwards, the learner incurs a loss ℓt(xt) where ℓt is a convex loss function ℓt : K R. The learner observers ℓi t : Ki R deﬁned by ℓi t(x) = ℓt(x, x i t ), where we use the notation (x, x i t ) to represent (x1 t, . . . xi 1 t , x, xi+1 t , . . . xk t ).

Regret is deﬁned to be the total loss incurred by the algorithm with respect to the loss of the best ﬁxed single prediction found in hindsight for each player. Formally, the

A Regret Minimization Approach to Multi-Agent Control

average multi-agent regret of a set of learning algorithms A1:k = A1, . . . Ak is deﬁned as

RT (A1:k) def = sup ℓ1...ℓt

t=1 ℓt(xt) min x K 1 T

t=1 ℓt(x ) .

Our goal is to create a black-box mechanism that takes as input regret minimizing strategies, and allows for joint-regret minimization with minimal or no central coordination.

3.1 Challenges of multi-agent OCO

A natural ﬁrst attempt for joint regret minimization in multiagent OCO is to allow each regret minimizing agent to operate on its own on the given loss function.

In this subsection we show that this naive strategy fails. As an example, consider the simple scalar two player game. Let ℓt(x1, x2) = (x1 x2)2 + 0.1 (x1, x2) 2 2 for all t. Now we consider A1 and A2 to be algorithms that each alternate between playing 1 on odd t and 1 on even t. In this setup, the online learners see losses:

ℓ1 t(z) = ℓ2 t(z) =

( (z 1)2 + 0.1z2 + 0.1 t is odd (z + 1)2 + 0.1z2 + 0.1 t is even

By design of A1 and A2, predictions alternate between (1, 1) and ( 1, 1) and ℓ1 t(x1 t) = ℓ2 t(x2 t) = 0.2. In contrast, the best ﬁxed decision in hindsight is 0 for both players, suffering loss ℓ1 t(0) = ℓ2 t(0) = 1.1 for all t, so each player has negative regret. However, for the multiplayer game, the best loss in hindsight is 0 by playing x1 t = x2 t = 0 for all t, so the multi-agent regret is 0.2T . We note that this phenomenon is well known in the game theory literature, where failure of no-regret algorithms to converge to Nash equilibrium has been studied (Blum et al., 2008). Our study extends this observation to the dynamic games setting.

3.2 Reduction from OCO to multi-agent OCO

To overcome the lower bound of the previous section, we choose to apply the reduction from multi-agent to single agent OCO via the instantaneous linearization. This simple change makes all the difference, and allows for provable sublinear multi-agent regret, as we now prove. Algorithm 1 Multiplayer OCO

Input: OCO algorithm Ai, convex domain Ki Rd

for t = 1 to T do

Learner Ai predicts xi t Calculate gi t = ℓi t(xi t) Feed Ai linear loss function gi t, end for Theorem 3.1. Suppose for each i [k], OCO algorithm Ai has regret RT (Ai) over Ki Rd with linear loss functions provided by Alg. 1, then the multi-agent regret of Alg. 1 is upper bounded by RT (A1:k) Pk i=1 RT (Ai).

Proof. Let x arg minx K PT t=1 ℓt(x) with xi the component corresponding to player i. By the convexity of ℓt,

RT (A1:k) =

t=1 (ℓt(xt) ℓt( x))

t=1 ℓt(xt) (xt x) .

Now we note that we can decompose ℓt by player, so

ℓt(xt) = [ ℓ1 t(x1 t) . . . ℓk t (xk t ) ]

ℓt(xt) (xt x) =

i=1 ℓi t(xi t) (xi t xi) .

Now we note that, RT (Ai) = PT t=1 ℓi t(xi t) (xi t xi) and the result follows.

3.3 Multiplayer OCO with Memory

The control setting which motivates our whole study is inherently state-based. Therefore we need to extend the multi-agent OCO to allow for memory as in (Anava et al., 2015).

This is similar to the previous section but the loss functions now have memory of the h previous controls. After each of the k online learners commits to the joint decision xt as deﬁned in the previous section. The learners incur a loss ℓt(xt h:t) where ℓt is a convex loss function ℓt : Kh+1 R. The learner observers ℓi t : Kh+1 i R by ℓi t(z) = ℓt(z, x i t h:t).

Regret is deﬁned to be the total loss incurred by the algorithm with respect to the loss of the best ﬁxed single prediction found in hindsight for each player. Formally, the average regret of a set of learning algorithms A1:k is

RT (A1:k) def = sup ℓ1...ℓt

t=1 ℓt(xt h:t) min x K

where ℓt(x) = ℓt(x, . . . x). Algorithm 2 Multiplayer OCO with Memory

Input: OCO-M algorithm Ai, convex domain Ki Rd

and memory length h for t = 1 to T do

Learner Ai predicts xi t Calculate gi t = ℓi t(xi t h:t) Feed Ai linear loss function gi t, end for

The main result of this section is given by the following theorem, which is formally proved in the appendix. The proof follows the memoryless setting in the previous subsection.

Theorem 3.2. Suppose for each i [k] OCO-M algorithm with memory Ai has regret RT (Ai) over Ki Rd with

A Regret Minimization Approach to Multi-Agent Control

linear loss functions provided by Alg. 2, then the regret of Alg. 2 has regret upper bounded by

i=1 RT (Ai) .

See Appendix A for proof.

4 The Meta-Algorithm and its Analysis

This section contains our main result: a meta-algorithm for multi-agent nonstochastic control. We start by deﬁning the precise requirements from the single agent controllers that are given as input, and assumptions on them. We then proceed to describe the meta-algorithm and its main performance guarantee on the multi-agent regret. We proceed to describe a few special cases of interest.

4.1 Assumptions and deﬁnitions

The single agent control algorithms that are the basis for the reduction need to satisfy following requirements:

Each agent should be able to evaluate their policy given the controls of the other agents.

The joint cost function over all agents is jointly convex.

Agent policies need to be decoupled in the sense of Assumption 2.4.

We now specify these requirements more formally.

We introduce two related notions of a (ε, h)-policy evaluation oracle (PEO) for a multi-agent control problem. A joint PEO allows us to counterfactually evaluate a complete multi-agent policy, while a local PEO allows us to counterfactually evaluate the policy of a single agent, ﬁxing the remainder of the controls. ε is the accuracy required of this oracle, while h is something like a mixing time such that the cost of the tth state can be well approximated regardless of the history before t h.

Given the available control information u1:t 1, observations yi 1:t, and cost function ct, a loss function ℓt is a joint (ε, H)- PEO, if ℓt(θ1:k 1:H) provides an ε approximate estimate of ct( xt, ut) where xt is the counterfactual state had ut r+1 been played according to joint policy πθ1:k r .

Likewise, ℓi t is the local PEO of the ith agent, if ℓi t(θ1:h) provides an ε approximate estimate of ct( xt, ut) where ut and xt are the counterfactual state and observation had ui t r+1 been played according to policy πθr while all other controls u i 1:t 1 remain the same. These oracles are deﬁned formally in Deﬁnition A.2 in Appendix A.3. Remark 4.1. In order for a local agent to be able to provide an accurate estimate of the cost using only partial observations and controls, intuitively the cost may need to be restricted in in its dependence on the fully observed state.

For example, if all agents partial observations contain certain features from the state (e.g. the x-y-z position of the center of gravity of drone but no other velocity or angular features), the cost can depend on these and an agent PEO may exist.

A key observation is that for policies satisfying Assumption 2.4, if u i t h+1:t are generated by policies θ i t h+1:t respectively, then a local PEO and a joint PEO can be related via ℓi t = ℓt( , θ i t:t h+1). Because all controls are determined directly from the disturbances, changes in agent i s controls have no impact on the controls of other agents.

We now assume that beyond an initial burn-in time Tb, we can construct such an (ε, h)-local PEO from information available to a controller. We also must guarantee that the PEOs we use are convex in order to use OCO analysis.

Assumption 4.2 (Information). For t Tb > h, for each agent i, there exists a functional Li such that ℓi t = Li(ct, u1:t, yi 1:t) is a (ε, h)-local PEO for policy class Θi.

Assumption 4.3 (Convexity). There exists a (ε, h) joint PEO ℓt that is convex such that ℓi t = Li(ct, u1:t, yi 1:t) satisﬁes ℓi t = ℓt( , θ i t h+1:t).

4.2 Multi-Agent Control Meta-Algorithm and Analysis

Our reduction is described in Algorithm 3. It accepts as input OCO-M algorithms Ai with requirements speciﬁed before, and applies them sequentially using the linearization technique of the previous section. Algorithm 3 Multi-Agent Nonstochastic Control

1: Input: Horizon length T, burn in time Tb, policy evaluation horizon h, and for each i [k], OCO-M algorithms Ai over policy parameterization class Θi, policy evaluation functional Li . 2: Initialize θi 0 = = θi Tb as a policy that always plays control 0. 3: for t = 1 to T do 4: for i = 1 to k do 5: Observe yi t 6: Play control ui t = πθi t(u1:t 1, yi 1:t) 7: Observe controls from other agents to form joint control ut = u1:k t 8: Observe cost ct 9: if t > Tb then 10: Construct policy evaluation oracle ℓi t = Li(ct, u1:t, yi 1:t) 11: Compute gi t = ℓi t(θi t h,t) 12: Feed Ai linear loss function gi t, and receive policy parameterization θi t+1. 13: end if 14: end for 15: end for

A Regret Minimization Approach to Multi-Agent Control

The inner loop of the algorithm starting at Line 5 is what is performed by a local agent. The information available in this scope is the local observation and the controls of the other agents. We note that Alg. 3 only uses ℓi t instead of ℓt, which is only used for analysis. This is fundamental as each agent is unaware of the policies of other agents.

The performance guarantee for this meta-algorithm is given by: Theorem 4.4. Suppose Assumptions 2.2, 2.3, 2.4 hold and Assumptions 4.2, 4.3 holds with burn-in Tb, horizon h and error ε. If Ai are OCO-M algorithms with regret RT (Ai) on linear loss functions provided by Algorithm 3 over policy parameterization class Θi, then the multi-agent regret of Algorithm 3 is bounded by

i=1 RT (Ai) + Tb CR2 nat T + 2ε .

Proof. In order to analyze the regret we begin with the following regret decomposition. We ﬁrst let θ = arg minθ Θ PT t=1 ct(yπθ t , uπθ t ) and let π = πθ denote ofﬂine optimal joint policy from our comparator policy class.

t=1 ct(x C t , u C t ) 1

t=1 ct(xπ t , uπ t ) (2)

t=1 ct(xnat t , 0) 1

t=1 ct(xπθ t , uπθ t )

| {z } burn-in cost

t=Tb+1 ct(x C t , u C t ) 1

t=Tb+1 ℓt(θt:t h)

| {z } algorithm truncation error

t=Tb+1 ℓt(θt:t h) 1

t=Tb+1 ℓt(θ , . . . , θ )

| {z } multi-agent ℓpolicy regret

t=Tb+1 ℓt(θ , . . . , θ ) 1

t=Tb+1 ct(xπθ t , uπθ t )

| {z } comparator truncation error

The ﬁrst term of the decomposition is exactly the cost of the ﬁrst Tb iterations because the Alg. 3 starts with Tb zero controls so by deﬁnition x C t are Nature s x s. By Assumption 2.3, losses are nonnegative so we have PTb t=1 ct(xπθ t , uπθ t ) 0 so we drop these terms. By combining Assumptions 2.3 and 2.2 we bound the burn-in cost as

t=1 ct(xnat t , 0) C

t=1 xt 2 CTb R2 nat T .

The comparator truncation error is bounded by ε because for t > Tb, |ℓt(θ , . . . , θ ) ct(xπθ t , uπθ t )| ε by deﬁnition of an (ε, h) joint PEO. Similarly, since C produces x C t with controls us from joint policy θs for the preceding h controls, the algorithm truncation error is also bounded by ε.

To bound the ℓt multi-agent policy regret term, we note that Algorithm 3 chooses θt exactly by the mechanism in Algorithm 2 with loss functions ℓi t. By Assumption 4.2, ℓi t = ℓt( , θ i t h:t 1) and by Assumption 4.3, ℓt is convex as desired. As such, we bound the ℓt multi-agent policy regret term using Theorem 3.2 as

t=Tb ℓt(θt:t h)

t=Tb+1 ℓt(θ , . . . , θ ) T

i=1 RT (Ai) .

The result follows after combining the above bounds.

4.3 Applications to Linear Dynamical Systems

Algorithm 3 leaves most of the technical challenge to a policy evaluation oracle. It s not immediately obvious that such a function exists and is efﬁcient to compute. Fortunately, recent work from nonstochastic control provides exactly what we need for Linear Dynamical Systems with a variety of policy classes.

For simplicity of presentation, we consider only stable systems. However, we note that in settings with a single shared observation yt, these algorithms can be extended to work for unstable systems. In order to do this, all agents must be aware of baseline stabilizing controllers employed by each agent, and Algorithm 3 can then be applied on the stable closed-loop system including the stabilizing component. For a thorough treatment of this approach, we refer the reader to Appendix G of (Simchowitz et al., 2020).

We assume strong stability, which guarantees spectral radius of A strictly less than 1. Strong stability assures us that impact of disturbances and controls from the far history have a negligible impact on the current state. This geometric decay enables the construction of PEO s with horizon h just logarithmic in T while providing ε = 1 T approximation error.

Fully Observed LDS with Disturbance Action Control We consider the case of a fully observed LDS deﬁned as follows: yt+1 = xt+1 = ft(xt, ut) + wt = Axt + But + wt ,

with bounded adversarial disturbances wt, known strongly stable dynamics, and well behaved convex costs ct satisfying Assumption 2.32.

2In general, we may actually need more assumptions on the costs, and the size of our comparator policy classes as gi t in Line 11 of Algorithm 3 depends on these pieces, and hence so does RT (Ai).

A Regret Minimization Approach to Multi-Agent Control

We now show that a linear system with a DAC policy class M can satisfy assumptions required for Theorem 4.4. We already saw that Assumption 2.4 holds for DAC policies. Next we show that Assumption 2.2 holds. By fully unrolling dynamics, Nature s x s can be written as a linear combination of disturbances xnat t = Pt 1 s=0 Aswt s.

Now because wt are bounded, and A is strongly stable, xnat t can be bounded by a convergent geometric series. The last piece is constructing a PEO for a DAC policy. Unrolling dynamics, we express states in terms of controls:

xt xnat t + [B, AB, . . . Ah 1B]ut 1:t h = ˆGut 1:t h .

The truncation follows from strong stability of the dynamics. Decomposing ˆG, our Markov operator by agent, we can write xt xnat t + Pk i=1 ˆGiui t 1:t h. Now to compute ℓi t(M1:h) = ct( x(M1:h), Mwt m:t 1) we hold u i s ﬁxed while changing the policies for agent i. Because the counterfactual state x(M1:h) is afﬁne in ui t 1:t h and by the DAC policy parameterization agent i s controls are linear, the counterfactual state can be approximated as linear in M1:h. Composing with a convex cost ct gives a convex PEO ℓi t. A complete analysis in the single agent setting can be found in (Agarwal et al., 2019). In order to compute ℓi t(M1:h), the disturbances wt are needed. Because agent i has access to the joint control ut, disturbances can be computed as wt = xt+1 Axt But. Once Tb = m + h disturbances are observed, ℓi t can be computed.

Similar ideas can be applied using Nature s y s and DRC to handle partial observations. Furthermore, Theorem 4.4 can be applied with richer neural network policy classes as well. For a more complete discussion, see Appendix B. 5 Experiments

5.1 Robustness to System Failures

Our algorithm guarantee can also be extended to handle situations where certain agents no longer are locally regret minimizing (e.g. an actuator breaks). Because Algorithm 3 is agnostic to the policy played by other players, if we want to bound the regret for a subset I [k] of controllers, we can just view the policy class of the remaining agents as singleton open-loop policies that always play the same controls and so Theorem 4.4 will be a regret guarantee with respect to just the agents I with the remaining controls obliviously ﬁxed.

Such robust guarantees are not achieved with existing approaches in nonstochastic control. Typically, as in the Gradient Perturbation Controller (GPC) (Agarwal et al., 2019), a PEO ℓt(θ1:h) is passed to an online learning algorithm. If the system breaks down and certain controls are not played according to the chosen policy, there will be a mismatch between the true cost and the PEO. In contrast, because ℓi t uses the actual controls played by the algorithm, for an

Average Cost

Random Walk

GPC Multi Agent GPC LQR Hinf

0 250 500 750 1000 Time

Average Cost

0 250 500 750 1000 Time

0 250 500 750 1000 Time

Figure 2: Cost of GPC, Multi Agent GPC, H2 and H control before and after the 4th controller becomes inactive in the presence of gaussian disturbances, disturbances generated by a random walk, and sinusoidal disturbances.

agent i that is not experiencing failure, the oracle accuracy remains unchanged and provides the correct gradient signal.

As such, Algorithm 3 may be useful to provide robustness even for a single agent system with some failures. 5.2 ADMIRE experiments

In Fig. 2 we run GPC, LQR control, and a linear Hinf control policy, along with Multi-Agent GPC (MAGPC), which is Algorithm 3 using gradient descent as the online learners Ai. These are used to stabilize the ADMIRE overactuated aircraft model (H arkeg ard & Glad, 2005) in the presence of 3 different disturbance proﬁles: Gaussian noise, a random walk, and a sinusoidal disturbance pattern. We run the algorithms on the complete system, and with the 4th controller forced inactive by zeroing out the control to demonstrate the robustness of our approach. Figure 2 demonstrates the GPC failure described in Section 5.1. In the random walk experiment, GPC explodes, but MAGPC stabilizes the system. See Appendix C for full details. We note that, when a control failure does not occur and learning rates are sufﬁciently small, GPC and MAGPC are essentially the same algorithm. This is because the policy parameterizations moves slowly so ℓi t(θt h:t) ℓi t(θt . . . θt), which is GPC s proxy loss.

6 Discussion and Conclusions

Conclusions We have described a new approach for regret minimization in multi-agent control that is based on a new proposed performance metric: multi-agent regret. This measures the difference in cost between that of a distributed control system and that of the best joint policy from a reference class. We give a reduction that takes any regret minimizing controller and converts it into a regret minimizing multi-agent distributed controller.

Future directions The most interesting question is whether similar metrics of performance and ideas can be applied to general multi-agent reinforcement learning

A Regret Minimization Approach to Multi-Agent Control

(MARL), since here we exploit full knowledge of the dynamics. It will be interesting to see if our information model can be relaxed and still allow sublinear multi-agent regret. It also remains to explore the extent of robustness obtained from a multi-agent regret minimizing algorithm.

Abbasi-Yadkori, Y. and Szepesv ari, C. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1 26, 2011.

Agarwal, N., Bullins, B., Hazan, E., Kakade, S., and Singh, K. Online control with adversarial disturbances. In International Conference on Machine Learning, pp. 111 119. PMLR, 2019.

Anava, O., Hazan, E., and Mannor, S. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pp. 784 792, 2015.

Blum, A., Hajiaghayi, M., Ligett, K., and Roth, A. Regret minimization and the price of total anarchy. pp. 373 382, 05 2008. doi: 10.1145/1374376.1374430.

Bullo, F., Cortes, J., and Martinez, S. Distributed control of robotic networks. Princeton University Press, 2009.

Cao, Y., Yu, W., Ren, W., and Chen, G. An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial Informatics, 9(1):427 438, 2013. doi: 10.1109/TII.2012.2219061.

Carli, R., Chiuso, A., Schenato, L., and Zampieri, S. Distributed kalman ﬁltering based on consensus strategies. IEEE Journal on Selected Areas in Communications, 26 (4):622 633, 2008. doi: 10.1109/JSAC.2008.080505.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Chen, X., Minasyan, E., Lee, J. D., and Hazan, E. Provable regret bounds for deep online learning and control, 2021.

Christoﬁdes, P. D., Scattolini, R., de la Pena, D. M., and Liu, J. Distributed model predictive control: A tutorial review and future research directions. Computers & Chemical Engineering, 51:21 41, 2013.

Das, S. and Moura, J. M. F. Consensus+innovations distributed kalman ﬁlter with optimized gains. IEEE Transactions on Signal Processing, 65(2):467 481, 2017. doi: 10.1109/TSP.2016.2617827.

Daskalakis, C., Golowich, N., and Zhang, K. The complexity of markov equilibrium in stochastic games. ar Xiv preprint ar Xiv:2204.03991, 2022.

Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with opponentlearning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 122 130, 2018.

Ghai, U., Snyder, D., Majumdar, A., and Hazan, E. Generating adversarial disturbances for controller veriﬁcation. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 of Proceedings of Machine Learning Research, pp. 1192 1204. PMLR, 07 08 June 2021.

Goel, G. and Hassibi, B. Competitive control. ar Xiv preprint ar Xiv:2107.13657, 2021.

Gradu, P., Hazan, E., and Minasyan, E. Adaptive regret for control of time-varying dynamics. ar Xiv preprint ar Xiv:2007.04393, 2020.

Gradu, P., Hallman, J., Suo, D., Yu, A., Agarwal, N., Ghai, U., Singh, K., Zhang, C., Majumdar, A., and Hazan, E. Deluca a differentiable control library: Environments, methods, and benchmarking. ar Xiv preprint ar Xiv:2102.09968, 2021.

H arkeg ard, O. and Glad, S. T. Resolving actuator redundancy optimal control vs. control allocation. Automatica, 41(1):137 144, 2005.

Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5): 1127 1150, 2000.

Hazan, E. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

Hazan, E. and Singh, K. Tutorial: online and non-stochastic control, July 2021.

Jin, C., Liu, Q., Wang, Y., and Yu, T. V-learning a simple, efﬁcient, decentralized algorithm for multiagent rl. ar Xiv preprint ar Xiv:2110.14555, 2021.

Knorn, S., Chen, Z., and Middleton, R. H. Overview: Collective control of multiagent systems. IEEE Transactions on Control of Network Systems, 3(4):334 347, 2016. doi: 10.1109/TCNS.2015.2468991.

Lee, S., Nedi c, A., and Raginsky, M. Coordinate dual averaging for decentralized online optimization with nonseparable global objectives. IEEE Transactions on Control of Network Systems, 5(1):34 44, 2016.

Lefebvre, S., Richter, S., and De Carlo, R. Decentralized control of linear interconnected multivariable systems. In 1980 19th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes, pp. 525 529, 1980. doi: 10.1109/CDC.1980.271852.

A Regret Minimization Approach to Multi-Agent Control

Matei, I. and Baras, J. S. Consensus-based distributed linear ﬁltering. In 49th IEEE Conference on Decision and Control (CDC), pp. 7009 7014, 2010. doi: 10.1109/CDC.2010.5718072.

Minasyan, E., Gradu, P., Simchowitz, M., and Hazan, E. Online control of unknown time-varying dynamical systems. Advances in Neural Information Processing Systems, 34, 2021.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016.

Nedi c, A., Olshevsky, A., and Rabbat, M. G. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106 (5):953 976, 2018. doi: 10.1109/JPROC.2018.2817461.

Olfati-Saber, R. Distributed kalman ﬁlter with embedded consensus ﬁlters. In Proceedings of the 44th IEEE Conference on Decision and Control, pp. 8179 8184, 2005. doi: 10.1109/CDC.2005.1583486.

Olfati-Saber, R., Fax, J. A., and Murray, R. M. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1):215 233, 2007. doi: 10.1109/JPROC.2006.887293.

Oppenheimer, M. W., Doman, D. B., and Bolender, M. A. Control allocation for over-actuated systems. In 2006 14th Mediterranean Conference on Control and Automation, pp. 1 6, 2006. doi: 10.1109/MED.2006.328750.

Park, S. and Martins, N. C. Design of distributed lti observers for state omniscience. IEEE Transactions on Automatic Control, 62(2):561 576, 2017. doi: 10.1109/ TAC.2016.2560766.

Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017.

Simchowitz, M., Singh, K., and Hazan, E. Improper learning for non-stochastic control. In Conference on Learning Theory, pp. 3320 3436. PMLR, 2020.

Sobel, K. and Shapiro, E. Eigenstructure assignment for design of multimode ﬂight control systems. IEEE Control Systems Magazine, 5(2):9 15, 1985. doi: 10.1109/MCS. 1985.1104941.

St urz, Y. R., Zhu, E. L., Rosolia, U., Johansson, K. H., and Borrelli, F. Distributed learning model predictive control for linear systems. In 2020 59th IEEE Conference

on Decision and Control (CDC), pp. 4366 4373. IEEE, 2020.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000.

Tohidi, S. S., Yildiz, Y., and Kolmanovsky, I. V. Adaptive control allocation for over-actuated systems with actuator saturation. IFAC-Papers On Line, 50:5492 5497, 2017.

Wang, L., Fullmer, D., Liu, F., and Morse, A. S. Distributed control of linear multi-channel systems: Summary of results. In 2020 American Control Conference (ACC), pp. 4576 4581, 2020. doi: 10.23919/ACC45564.2020. 9148043.

Wang, S.-H. and Davison, E. On the stabilization of decentralized control systems. IEEE Transactions on Automatic Control, 18(5):473 478, 1973. doi: 10.1109/TAC.1973. 1100362.

Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279 292, 1992.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992.

Zhang, K., Yang, Z., and Bas ar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pp. 321 384, 2021.

A Regret Minimization Approach to Multi-Agent Control

A Deferred Technical Material

A.1 Proof of Theorem 3.2

We provide a proof of Theorem 3.2, for the multi-agent regret of an OCO-M algorithm.

Proof. Let x arg minx K PT t=1 ℓt(x) with xi the component corresponding to player i. Now by the convexity of ℓt,

RT (A1, . . . Ak) =

t=1 ℓt(xt h:t)

t=1 ℓt( x, . . . x)

t=1 ℓt(xt h:t) (xt h:t ( x, . . . , x)) .

Now we note that we can decompose ℓt by player (after reorganizing coordinates), so

ℓt(xt h:t) = [ ℓ1 t(x1 t h:t) , ℓ2 t(x2 t h:t) , . . . ℓk t (xk t h:t) ]

ℓt(xt h:t) (xt h:t ( x, . . . , x)) =

i=1 ℓt(xi t h:t) (xi t h:t ( xi, . . . , xi)) .

Now we note that, RT (Ai) = PT t=1 ℓt(xi t h:t) (xi t h:t ( xi, . . . , xi)) and the result follows.

A.2 Linear Dynamical Controller

Deﬁnition A.1 (Linear Dynamic Controller). A linear dynamic controller π is a linear dynamical system (Aπ, Bπ, Cπ, Dπ) with internal state st Rdπ, input xt Rdx, and output ut Rdu that satisﬁes

st+1 = Aπst + Bπxt, ut = Cπst + Dπxt .

Linear dynamical controllers involve an internal linear dynamical system to produce a control. The combination of a Kalman Filter with Optimal control on the state estimates is a classic example.

A.3 Technical Details from Section 4

Below are the formal deﬁnition of joint and local policy evaluation oracles:

Deﬁnition A.2. ℓt : Θh R is an (ε, h)-joint PEO if |ℓt(θ1:k 1:h) ct( xt, ut)| ε where ( x1:t h, y1:t h, u1:t h 1) = (x1:t h, y1:t h, u1:t h 1) and for s > t h

us = πθ1:k t s+1( y1:s, u1:s 1), (3)

xs+1 = fs( xt, us) + ws, i [k], yi s+1 = gi( xs+1) + ei s+1 .

ℓi t : Θh i R is an (ε, h)-local PEO for agent i if |ℓi t(θ1:h) ct( xt, ut)| ε where ( x1:t h, yi 1:t h, u1:t h 1) = (x1:t h, yi 1:t h, u1:t h 1), u i t h+1:t = u i t h+1:t and for s > t h

ui s = πθi t s+1( yi 1:s, u1:s 1), (4)

xs+1 = fs( xt, us) + ws, yi s+1 = gi( xs+1) + ei s+1 .

B More Applications for Linear Dynamical Systems

Partially Observed LDS with Disturbance Response Control Another setting that we can capture with out metaalgorithm is partially observed linear dynamical systems, where all agents share the same partial observation. In this case, the dynamics are as follows,

yt = g(xt) + et = Cxt + et,

xt+1 = ft(xt, ut) + wt = Axt + But + wt ,

A Regret Minimization Approach to Multi-Agent Control

with bounded adversarial disturbances et, wt and known strongly stable dynamics. For a partially observed system with adversarial disturbances, it is known that many states can be consistent with the observations and so costs must be restricted to be well deﬁned for the agents. This is solved by using convex costs of the observation ct : Rdy Rdu, so at time t the cost incurred is ct(yt, ut). Because yt is an afﬁne function of xt, this cost is still convex in x to ﬁt the general setting.

A policy class, that suits our needs is that of Disturbance-Response Control (DRC). The satisfaction of the assumptions of Algorithm 3 follows from similar arguments to DAC in the fully observable case. In particular, to compute PEO ℓi t, the following approximate Markov operator is used

yt ynat t +

i=1 ˆGui t 1:t h (5)

ˆG = [CB, CAB, . . . CAh 1B] .

In the above equation ˆGi is the block of the Markov operator ˆG corresponding to the ith agents controls. Beyond this, the core ideas are the same, just replacing disturbances with Nature s y s. The last piece necessary is computing Nature s y s. This is not challenging given we have access to all agent s controls, so we can let ynat t yt ˆGiui t 1:t h. For a complete analysis of this approach in the single agent setting, refer to (Simchowitz et al., 2020).

Locally Partially Observed LDS with Disturbance Response Control An extension of the prior setting of interest is for each agent to have it s own local partial observation without needing to share it in the protocol. In particular, consider the multi-agent system

xt+1 = ft(xt, ut, wt) = Axt + But + wt,

yi t+1 = gi(xt+1, et) = Cixt+1 + ei t+1 ,

with bounded adversarial disturbances ei t, wt and known strongly stable dynamics. As noted in Remark 4.1, in order for each agent to have a local PEO, the cost ct(xt, ut) needs to be computable with only local information at each agent. Therefore, we assume there are local cost functions ci t : Rdyi Rdu for each agent such that for all i [k], ct(x, u) = ci t(Cix + ei t, ut).

With access to the global loss in each agent, agent i can use DRC with ynat,i t to satisfy all assumptions required for our meta-algorithm. Each observation matrix Ci produces a different markov operator Gi = [Ci B, . . . , Ci Ah 1B], such that yi t = ynat,i t + Giut 1:t h. Using Gi agent i can compute ynat,i t and build a PEO in the same way as the single partial observation setting.

Linear Dynamics with Neural Network Policies All the previous policy parameterizations were linear in some feature representation. Recent work shows that nonstochastic control can extend beyond linear policies using analysis from deep learning theory (Chen et al., 2021). The authors consider the fully observed setting with fully-connected Re LU neural network policies that act on normalized windows of disturbances. A regret bound for online episodic control is shown by proving that under some technical conditions, a PEO is ε-nearly-convex, which means

p(x) p(y) p(x), x y ε .

It can be readily shown that Theorem 3.2 can be extended to nearly convex functions using near-convexity in place of convexity, so we can get a multi-agent control regret guarantee for two-layer neural nets that act on normalized windows of disturbances. In addition, the proof that a PEO is nearly convex can be extended to neural net policies that act on normalized windows of Nature s y s, so neural net policies can also be used in the partial observation settings described above.

C Robustness to System Failures: Overactuated Aircraft

In this section we provide the mathematical model of an overactuated aircraft based on ADMIRE (H arkeg ard & Glad, 2005), and demonstrate the robustness of our controller to system failures.

Let α, β, p, q, r be angle of attack, slideslip angle, roll rate, pitch rate and yaw rate respectively. State space can be given as x = α β p q r T and input space consists of 4 one-dimensional controllers, u1 u2 u3 u4 . Discrete linear

A Regret Minimization Approach to Multi-Agent Control

dynamics of the system can be given as x(t + 1) = Ax(t) + P4 i=1 Biui(t) + W(t) where,

1.5109 0.0084 0.0009 0.8598 0.0043 0 0.0295 0.0903 0 0.4500 0 3.1070 0.1427 0 2.7006 2.3057 0.0097 0.0006 1.5439 0.0029 0 0.5000 0.0125 0 0.4878

B = B1 B2 B3 B4 =

0.6981 0.5388 0.5367 0.0029 0 0.2031 0.2031 0.3912 0 2.0768 2.0768 0.4667 1.8415 1.4190 1.4190 0.0035 0 0.1854 0.1854 0.7047

and W(t) is the disturbance.

C.1 Experimental Setup

Controller details:

1. GPC has a decaying learning rate of 0.001

t , a rollout length h of 5 and uses DACs with windows of size 5.

2. MAGPC is split into 4 1-d controllers, each using decaying learning rate of 0.001

t , a rollout length 5 and a DAC with window length 5.

3. We use a linear H controller and standard inﬁnite horizon LQR linear controller.

Disturbance details:

1. Random walk chooses wt = wt 1 + Xt where Xt is a standard Gaussian random variable.

2. Gaussian noise is iid. and has variance 1.

3. Sinusoidal noise is chosen via wi = sin(2t + φi) where φ = [12.0, 21.0, 3.0, 42.0, 1.0], picked arbitrarily once.

Fourth control is set to 0 in one experiment and otherwise controls are left unchanged. Code is adapted from reference GPC implementations in (Gradu et al., 2021).