# countering_feedback_delays_in_multiagent_learning__2d85ff6c.pdf Countering Feedback Delays in Multi-Agent Learning Zhengyuan Zhou Stanford University zyzhou@stanford.edu Panayotis Mertikopoulos Univ. Grenoble Alpes, CNRS, Inria, LIG panayotis.mertikopoulos@imag.fr Nicholas Bambos Stanford University bambos@stanford.edu Peter Glynn Stanford University glynn@stanford.edu Claire Tomlin UC Berkeley tomlin@eecs.berkeley.edu We consider a model of game-theoretic learning based on online mirror descent (OMD) with asynchronous and delayed feedback information. Instead of focusing on specific games, we consider a broad class of continuous games defined by the general equilibrium stability notion, which we call λ-variational stability. Our first contribution is that, in this class of games, the actual sequence of play induced by OMD-based learning converges to Nash equilibria provided that the feedback delays faced by the players are synchronous and bounded. Subsequently, to tackle fully decentralized, asynchronous environments with (possibly) unbounded delays between actions and feedback, we propose a variant of OMD which we call delayed mirror descent (DMD), and which relies on the repeated leveraging of past information. With this modification, the algorithm converges to Nash equilibria with no feedback synchronicity assumptions and even when the delays grow superlinearly relative to the horizon of play. 1 Introduction Online learning is a broad and powerful theoretical framework enjoying widespread applications and great success in machine learning, data science, operations research, and many other fields [3, 7, 22]. The prototypical online learning problem may be described as follows: At each round t = 0, 1, . . . , a player selects an action xt from some convex, compact set, and obtains a reward ut(xt) based on some a priori unknown payoff function ut. Subsequently, the player receives some feedback (e.g. the past history of the reward functions) and selects a new action xt+1 with the goal of maximizing the obtained reward. Aggregating over the rounds of the process, this is usually quantified by asking that the player s (external) regret Reg(T) maxx X PT t=1 [ut(x) ut(xt)] grow sublinearly with the horizon of play T, a property known as no regret . One of the most widely used algorithmic schemes for learning in this context is the online mirror descent (OMD) class of algorithms [23]. Tracing its origins to [17] for offline optimization problems, OMD proceeds by taking a gradient step in the dual (gradient) space and projecting it back to the primal (decision) space via a mirror map generated by a strongly convex regularizer function (with different regularizers giving rise to different algorithms). In particular, OMD includes as special cases several seminal learning algorithms, such as Zinkevich s online gradient descent (OGD) scheme [29], and the multiplicative/exponential weights (EW) algorithm [1, 13]. Several variants of this class also exist and, perhaps unsurprisingly, they occur with a variety of different names such as Follow-the-Regularized-Leader" [9], dual averaging [18, 25], and so on. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. When ut is concave, OMD enjoys a sublinear O( T) regret bound which is known to be universally tight.1 A common instantiation of this is found in repeated multi-player games, where each player s payoff function is determined by the actions of all other players via a fixed mechanism the stage game. Even though this mechanism may be unknown to the players, the universality of the OMD regret bounds raises high expectations in terms of performance guarantees, so it is natural to assume that players adopt some variant thereof when faced with such online decision processes. This leads to the following central question: if all players of a repeated game employ an OMD updating rule, do their actions converge to a Nash equilibrium of the underlying one-shot game? Related Work. Given the prominence of Nash equilibrium as a solution concept in game theory (compared to coarser notions such as correlated equilibria or the Hannan set), this problem lies at the heart of multi-agent learning [4]. However, convergence to a Nash equilibrium is, in the words of [4], considerably more difficult than attaining a no-regret state for all players (which leads to weaker notion of coarse correlated equilibrium in finite games). To study this question, a growing body of literature has focused on special classes of games (e.g. zero-sum games, routing games) and established the convergence of the so-called ergodic average T 1 PT t=1 xt of OMD [2, 10, 12]. In general, the actual sequence of play may fail to converge altogether, even in simple, finite games [16, 24]. On the other hand, there is a number of recent works establishing the convergence of play in potential games with finite action sets under different assumptions for the number of players involved (continuous or finite) and the quality of the available feedback (perfect, semi-bandit/imperfect, or bandit/payoff-based) [5, 11, 14, 19]. However, these works focus on games with finite action sets and feedback is assumed to be instantly available to the players (i.e. with no delays or asynchronicities), two crucial assumptions that we do not make in this paper. A further major challenge arises in decentralized environments (such as transportation networks), where a considerable delay often occurs between a player s action and the corresponding received feedback. To study learning in such settings, [20] recently introduced an elegant and flexible delay framework where the gradient at round t is only available at round t + dt 1, with dt being the delay associated with the player s action at round t.2 [20] then considered a very natural extension of OMD under delays: updating the set of gradients as they are received (see Algorithm 1 for details). If the total delay after time T is D(T) = PT t=1 dt, [20] showed that OMD enjoys an O(D(T)1/2) regret bound. This natural extension has several strengths: first, no assumption is made on how the gradients are received (the delayed gradients can be received out-of-order); further, as pointed out in [6, 8], a gradient does not need to be timestamped by the round s from which it originates, as required for example by the pooling strategies of [6, 8]. Our Contributions. Our investigations here differ from existing work in the following aspects: First, we consider learning in games with asynchronous and delayed feedback by extending the general single-agent feedback delay framework introduced in [20]. Previous work on the topic has focused on the regret analysis of single-agent learning with delays, but the convergence properties of such processes in continuous games are completely unknown. Second, we focus throughout on the convergence of the actual sequence of play generated by OMD (its last iterate in the parlance of optimization), as opposed to the algorithm s ergodic average 1 T PT t=1 xt. This last point is worth emphasizing for several reasons: a) this mode of convergence is stronger and theoretically more appealing because it implies ergodic convergence; b) in a game-theoretic setting, payoffs are determined by the actual sequence of play, so ergodic convergence diminishes in value if it is not accompanied by similar conclusions for the players realized actions; and c) because there is no inherent averaging, the techniques used to prove convergence of xt provide a much finer understanding of the evolution of OMD. The starting point of our paper is the introduction of an equilibrium stability notion which we call λ-variational stability, a notion that is motivated by the concept of evolutionary stability in population games and builds on the characterization of stable Nash equilibria as solutions to a Mintytype variational inequality [15]. This stability notion is intimately related to monotone operators in variational analysis [21] and can be seen as a strict generalization of operator monotonicity in the 1In many formulations, a cost function (as opposed to a reward function) is used, in which case such cost functions need to be convex. 2Of course, taking dt = 1 yields the classical no-delay setting. current game-theoretic context.3 By means of this notion, we are able to treat convergence questions in general games with continuous action spaces, without having to focus on a specific class of games such as concave potential or strictly monotone games (though our analysis also covers such games). Our first result is that, assuming variational stability, the sequence of play induced by OMD converges to the game s set of Nash equilibria, provided that the delays of all players are synchronous and bounded (see Theorems 4.3 and 4.4). As an inherited benefit, players adopting this learning algorithm can receive gradients out-of-order and do not need to keep track of the timestamps from which the gradients originate. In fact, even in the special case of learning without delays, we are not aware of a similar convergence result for the actual sequence of play. An important limitation of this result is that delays are assumed synchronous and bounded, an assumption which might not hold in large, decentralized environments. To lift this barrier, we introduce a modification of vanilla OMD which we call delayed mirror descent (DMD), and which leverages past information repeatedly, even in rounds where players receive no feedback. Thanks to this modification, play under DMD converges to variationally stable sets of Nash equilibria (Theorem 5.2), even if the players experience asynchronous and unbounded delays: in particular, delays could grow superlinearly in the game s horizon, and DMD would still converge. We mention that the convergence proofs for both OMD and DMD rely on designing a particular Lyapunov function, the so-called λ-Fenchel coupling which serves as a primal-dual divergence measure between actions and gradient variables. Thanks to its Lyapunov properties, the λ-Fenchel coupling provides a potent tool for proving convergence and we exploit it throughout. Further, we present a unified theoretical framework that puts the analysis of both algorithms under different delay assumptions on the same footing. 2 Problem Setup 2.1 Games with Continuous Action Sets We start with the definition of a game with continuous action sets, which serves as a stage game and provides a reward function for each player in an online learning process. Definition 2.1. A continuous game G is a tuple G = (N, X = QN i=1 Xi, {ui}N i=1), where N is the set of N players {1, 2, . . . , N}, Xi is a compact convex subset of some finite-dimensional vector space Rdi representing the action space of player i, and ui : X R is the i-th player s payoff function. Regarding the players payoff functions, we make the following assumptions throughout: 1. For each i N, ui(x) is continuous in x. 2. For each i N, ui is continuously differentiable in xi and the partial gradient xi ui(x) is Lipschitz continuous in x. Throughout the paper, x i denotes the joint action of all players but player i. Consequently, the joint action4 x will frequently be written as (xi, x i). Two important quantities in the current context are: Definition 2.2. We let v(x) be the profile of the players individual payoff gradients,5 i.e. v(x) = (v1(x), . . . , v N(x)), where vi(x) xi ui(x). Definition 2.3. Given a continuous game G, x X is called a (pure-strategy) Nash equilibrium if for each i N, ui(x i , x i) ui(xi, x i), xi Xi. 2.2 Online Mirror Descent in Games under Delays In what follows, we consider a general multi-agent delay model extending the single-agent delay model of [20] to the multi-agent learning case. At a high level, for each agent there can be an arbitrary 3In the supplement, we give two well-known classes of games that satisfy this equilibrium notion. 4Note that boldfaced letters are only used to denote joint actions. In particular, xi is a vector even though it is not boldfaced. 5Note that per the last assumption in the definition of a concave game (Definition 2.1), the gradient v(x) always exists and is a continuous function on the joint action space X. delay between the stage at which an action is played and the stage at which feedback is received about said action (typically in the form of gradient information). There is no extra assumption imposed on the feedback delays in particular, feedback can arrive out-of-order and in a completely asynchronous manner across agents. Further, the received feedback is not time-stamped so the player might not know to which iteration a specific piece of feedback corresponds. When OMD is applied in this setting, we obtain the following scheme: Algorithm 1 Online Mirror Descent on Games under Delays 1: Each player i chooses an initial y0 i . 2: for t = 0, 1, 2, . . . do 3: for i = 1, . . . , N do 4: xt i = arg maxxi Xi{ yt i, xi hi(xi)} 5: yt+1 i = yt i + αt P s Gt i vi(xs) 6: end for 7: end for Three comments are in order here. First, each hi is a regularizer on Xi, as defined below: Definition 2.4. Let D be a compact and convex subset of Rm. We say that g: D R is a regularizer if g is continuous and strongly convex on D, i.e. there exists some K > 0 such that g(td + (1 t)d ) tg(d) + (1 t)g(d ) 1 2Kt(1 t) d d 2 (2.1) for all t [0, 1], bd, bd D. Second, the gradient step size αt in Algorithm 1 can be any positive and non-increasing sequence that satisfies the standard summability assumption: P t=0 αt = , P t=0(αt)2 < . Third, regarding the delay model: in Algorithm 1, Gt i denotes the set of rounds whose gradients become available for player i at the current round t. Denote player i s delay of the gradient at round s to be ds i (a positive integer), then this gradient vi(xs) will be available at round s + ds i 1, i.e. s Gs+ds i 1 i . In particular, if ds i = 1 for all s, player i doesn t experience any feedback delays. Note here again that each player can receive feedback out of order: this can happen if the gradient at an earlier round has a much larger delay than that of the gradient at a later round. 3 λ-Variational Stability: A Key Criterion In this section, we define a key stability notion, called λ-variational stability. This notion allows us to obtain strong convergence results for the induced sequence of play, as opposed to results that only hold in specific classes of games. The supplement provides two detailed special classes of games (convex potential games and asymmetric Cournot oligopolies) that admit variationally stable equilibria. Other examples include monotone games (discussed later in this section), pseudo-monotone games [28], non-atomic routing games [26, 27], symmetric influence network games [11] and many others. 3.1 λ-Variational Stability Definition 3.1. Given a game with continuous actions (N, X = QN i=1 Xi, {ui}N i=1), a set C X is called λ-variationally stable for some λ RN ++ if i=1 λi vi(x), xi x i 0, for all x X, x C. (3.1) with equality if and only if x C. Remark 3.1. If C is λ-stable with λi = 1 for all i, it is called simply stable [15]. We emphasize that in a game setting, λ-variational stability is more general than an important concept called operator monotonicity in variational analysis. Specifically, v( ) is called a monotone operator [21] if the following holds (with equality if and only if x = x): v(x) v( x), x x i=1 vi(x) vi( x), xi xi 0, x, x X. (3.2) If v( ) is monotone, the game admits a unique Nash equilibrium x which (per the property of a Nash equilibrium) satisfies v(x ), x x 0. Consequently, if v( ) is a monotone operator, it follows that v(x), x x v(x ), x x 0, where equality is achieved if and only if x = x . This implies that when v(x) is a monotone operator, the singleton set of the unique Nash equilibrium is 1-variationally stable, where 1 is the all-ones vector. The converse is not true: when v(x) is not a monotone operator, we can still have a unique Nash equilibrium that is λ-variationally stable, or more generally, have a λ-variationally stable set C. 3.2 Properties of λ-Variational Stability Lemma 3.2. If C is nonempty and λ-stable, then it is closed, convex and contains all Nash equilibria of the game. The following lemma gives us a convenient sufficient condition ensuring that a singleton λvariationally stable set {x } exists; in this case, we simply say that x is λ-variationally stable. Lemma 3.3. Given a game with continuous actions (N, X = QN i=1 Xi, {ui}N i=1), where each ui is twice continuously differentiable. For each x X, define the λ-weighted Hessian matrix Hλ(x) as follows: Hλ ij(x) = 1 2λi xj vi(x) + 1 2λj( xi vj(x))T . (3.3) If Hλ(x) is negative-definite for every x X, then the game admits a unique Nash equilibrium x that is λ-globally variational stable. Remark 3.2. It is important to note that the Hessian matrix so defined is a block matrix: each Hλ ij(x) is a di dj matrix. Writing it in terms of the utility function, we have Hλ ij(x) = 1 2λi xj xi ui(x)+ 1 2λj( xi xj uj(x))T . 4 Convergence under Synchronous and Bounded Delays In this section, we tackle the convergence of the last iterate of OMD under delays. We start by defining an important divergence measure, λ-Fenchel coupling, that generalizes Bregman divergence. We then establish its useful properties that play an indispensable role in both this and next sections. 4.1 λ-Fenchel Coupling Definition 4.1. Fix a game with continuous action spaces (N, X = QN i=1 Xi, {ui}N i=1) and for each player i, let hi : Xi R be a regularizer with respect to the norm i that is Ki-strongly convex. 1. The convex conjugate function h i : Rdi R of hi is defined as: h i (yi) = max xi Xi{ xi, yi hi(xi)}. 2. The choice function Ci : Rdi Xi associated with regularizer hi for player i is defined as: Ci(yi) = arg max xi Xi{ xi, yi hi(xi)}. 3. For a λ RN ++, the λ-Fenchel coupling F λ : X R PN i=1 di R is defined as: F λ(x, y) = i=1 λi(hi(xi) xi, yi + h i (yi)). Note that although the domain of hi is Xi Rdi, the domain of its conjugate (gradient space) h i is Rdi. The two key properties of λ-Fenchel coupling that will be important in establishing the convergence of OMD are given next. Lemma 4.2. For each i {1, . . . , N}, let hi : Xi R be a regularizer with respect to the norm i that is Ki-strongly convex and let λ RN ++. Then x X, y, y R PN i=1 di: 1. F λ(x, y) 1 2 PN i=1 Kiλi Ci(yi) xi 2 i 1 2(mini Kiλi) PN i=1 Ci(yi) xi 2 i . 2. F λ(x, y) F λ(x, y)+PN i=1 λi yi yi, Ci(yi) xi + 1 Ki ) PN i=1( yi yi i )2, where i is the dual norm of i (i.e. yi i = max xi i 1 xi, yi . Remark 4.1. Collecting each individual choice map into a vector, we obtain the aggregate choice map C : R PN i=1 di X, with C(y) = (C1(y1), . . . , CN(y N)). Since each space Xi is endowed with norm i, we can define the induced aggregate norm on the joint space X as follows: x = PN i=1 xi i. We can also similarly define the aggregate dual norm: y = PN i=1 yi i . Henceforth, it shall be clear that the convergence in the joint space (e.g. C(yt) x, yt y) will be defined under the respective aggregate norm. Finally, we assume throughout the paper that the choice maps are regular in the following (very weak) sense: a choice map C( ) is said to be λ-Fenchel coupling conforming if C(yt) x implies F λ(x, yt) 0 as t . (4.1) Unless one aims for relatively pathological cases, choice maps induced by typical regularizers are always λ-Fenchel coupling conforming: examples include the Euclidean and entropic regularizers. 4.2 Convergence of OMD to Nash Equilibrium We start by characterizing the assumption on the delay model: Assumption 1. The delays are assumed to be: 1. Synchronous: Gt i = Gt j, i, j, t. 2. Bounded: dt i D, i, t (for some positive integer D). Theorem 4.3. Fix a game with continuous action spaces (N, X = QN i=1 Xi, {ui}N i=1) that admits x as the unique Nash equilibrium that is λ-variationally stable. Under Assumption 1, the OMD iterate xt given in Algorithm 1 converges to x , irrespective of the initial point x0. Remark 4.2. The proof is rather long and involved. To aid the understanding and enhance the intuition, we break it down into four main steps, each of which will be proved in the appendix in detail. 1. Since the delays are synchronous, we denote by Gt the common set and dt the common delay at round t. The gradient update in OMD under delays can then be written as: yt+1 i = yt i + αt X s Gt vi(xs) = yt i + αt ( |Gt|vi(xt) + X s Gt {vi(xs) vi(xt)} Define bt i = P s Gt{vi(xs) vi(xt)}. We show limt bt i i = 0 for each player i. 2. Define bt = (bt 1, . . . , bt N) and we have limt bt = 0 per Claim 1. Since each player s gradient update can be written as yt+1 i = yt i + αt(|Gt|vi(xt) + bt i) per Claim 1, we can then write the joint OMD update (of all players) as: xt = C(yt), (4.3) yt+1 = yt + αt {|Gt|v(xt) + bt} . (4.4) Let B(x , ϵ) {x X | x x < ϵ} be the open ball centered around x with radius ϵ. Then, using λ-Fenchel coupling as a energy" function and leveraging the handle on bt given by Claim 1, we can establish that, for any ϵ > 0 the iterate xt will eventually enter B(x , ϵ) and visit B(x , ϵ) infinitely often, no matter what the initial point x0 is. Mathematically, the claim is that ϵ > 0, x0, |{t | xt B(x , ϵ)}| = . 3. Fix any δ > 0 and consider the set B(x , δ) {C(y) | F λ(x , y) < δ}. In other words, B(x , δ) is some neighborhood" of x , which contains every x that is an image of some y (under the choice map C( )) that is within δ distance of x under the λ-Fenchel coupling metric". Although F λ(x , y) is not a metric, B(x , δ) contains an open ball within it. Mathematically, the claim is that for any δ > 0, ϵ(δ) > 0 such that: B(x , ϵ) B(x , δ). 4. For any neighborhood" B(x , δ), after long enough rounds, if xt ever enters B(x , δ), it will be trapped inside B(x , δ) thereafter. Mathematically, the claim is that for any δ > 0, T(δ), such that for any t T(δ), if xt B(x , δ), then x t B(x , δ), t t. Putting all four elements above together, we note that the significance of Claim 3 is that, since the iterate xt will enter B(x , ϵ) infinitely often (per Claim 2), xt must enter B(x , δ) infinitely often. It therefore follows that, per Claim 4, starting from iteration t, xt will remain in B(x , δ). Since this is true for any δ > 0, we have F λ(x , yt) 0 as t . Per Statement 1 in Lemma 4.2, this leads to that C(yt) x 0 as t , thereby establishing that xt = C(yt) x as t 0. In fact, the result generalizes straightforwardly to multiple Nash equilibria. The proof of the convergence to the set case is line-by-line identical, provided we redefine, in a standard way, every quantity that measures the distance between two points to the corresponding quantity that measures the distance between a point and a set (by taking the infimum over the distances between the point and a point in that set). We directly state the result below. Theorem 4.4. Fix a game with continuous action spaces (N, X = QN i=1 Xi, {ui}N i=1) that admits X as a λ-variationally stable set (of necessarily all Nash equilibria), for some λ RN ++. Under Assumption 1, the OMD iterate xt given in Algorithm 1 satisfies limt dist(xt, X ) = 0, irrespective of x0, where dist( , ) is the standard point-to-set distance function induced by the norm . 5 Delayed Mirror Descent: Asynchronous and Unbounded Delays The synchronous and bounded delay assumption in Assumption 1 is fairly strong. In this section, by a simple modification of OMD, we propose a new learning algorithm called Delayed Mirror Descent (DMD), that allows the last-iterate convergence-to-Nash result to be generalized to cases with arbitrary asynchronous delays among players as well as unbounded delay growth. 5.1 Delayed Mirror Descent in Games The main idea for the modification is that when player i doesn t receive any gradient on round t, instead of not doing any gradient updates as in OMD, he uses the most recent set of gradients to perform updates. More formally, define the most recent information set6 as: Gt i = Gt i, if Gt i = Gt 1 i , if Gt i = . Under this definition, Delayed Mirror Descent is (note that Gt i is always non-empty here): We only make the following assumption on the delays: Assumption 2. For each player i, limt Pt s=min Gt i αs = 0. This assumption essentially requires that no player s delays grow too fast. Note that in particular, players delays can be arbitrarily asynchronous. To make this assumption more concrete, we next give two more explicit delay conditions that satisfy the main delay assumption. As made formal by the following lemma, if the delays are bounded (but not necessarily synchronous), then Assumption 2 is satisfied. Furthermore, by appropriately choosing the sequence αt, Assumption 2 can accommodate delays that are unbounded and grow super-linearly. 6There may not be any gradient information in the first few rounds due to delays. Without loss of generality, we can always start at the first round when there is non-empty gradient information, or equivalently, assume that some gradient is available at t = 0. Algorithm 2 Delayed Mirror Descent on Games 1: Each player i chooses an initial y0 i . 2: for t = 0, 1, 2, . . . do 3: for i = 1, . . . , N do 4: xt i = arg maxxi Xi{ yt i, xi hi(xi)} 5: yt+1 i = yt i + αt s Gt i vi(xs) 6: end for 7: end for Lemma 5.1. Let {ds i} s=1 be the delay sequences for player i. 1. If each player i s delay is bounded (i.e. d Z, ds i d, s), then Assumption 2 is satisfied for any positive, non-increasing, not-summable-but-square-summable sequence {αt}. 2. There exists a positive, non-increasing, not-summable-but-square-summable sequence (e.g. αt = 1 t log t log log t) such that if ds i = O(s log s), i, then Assumption 2 is satisfied. Proof: We will only prove Statement 2, the more interesting case. Take αt = 1 t log t log log t, which is obviously positive, non-increasing and square-summable. Since R t s=4 1 s log s log log sds = log log log t as t , αt is not summable. Next, let Gt i be given and let t be the most recent round (up to and including t) such that G t i is not empty. This means: Gt i = G t i, Gk i = , k ( t, t]. (5.1) Note that since the gradient at time t will be available at time t + d t i 1, it follows that t t d t i. (5.2) Note that this implies t as t , because otherwise, t is bounded, leading to the right-side d t i being bounded, which contradicts to the left-side diverging to infinity. Since ds i = O(s log s), it follows that ds i Ks log s for some K > 0. Consequently, Equation 5.2 implies: t t + K t log t. Denote st min = min Gt i, Equation 5.1 implies that st min = min G t i, thereby yielding st min + dst min i 1 = t. Therefore: dst min i = t st min + 1. (5.3) Equation (5.3) implies that st min as t , because otherwise, the left-hand side of Equation (5.3) is bounded while the right-hand side goes to infinity (since t as t as established earlier). With the above notation, it follows that: n dst min i αst min + ( t log t)α to (5.5) ( dst min i (st min) log(st min) log log(st min) + K t log t ( t + 1) log( t + 1) log log( t + 1) K(st min) log(st min) (st min) log(st min) log log(st min) + K t log t ( t + 1) log( t + 1) log log( t + 1) K log log(st min) + K log log( t + 1) Remark 5.1. The proof to the second claim of Lemma 5.1 indicates that one can also easily obtain slightly larger delay growth rates: O(t log t log log t), O(t log t log log t log log log t) and so on, by choosing the corresponding step size sequences. Further, it is conceivable that one can identify meaningfully larger delay growth rates that still satisfy Assumption 2, particularly under more restrictions on the degree of delay asynchrony among the players. We leave that for future work. 5.2 Convergence of DMD to Nash Equilibrium Theorem 5.2. Fix a game with continuous action spaces (N, X = QN i=1 Xi, {ui}N i=1) that admits x as the unique Nash equilibrium that is λ-variationally stable. Under Assumption 2, the DMD iterate xt given in Algorithm 2 converges to x , irrespective of the initial point x0. The proof here uses a similar framework as the one in Remark 4.2, although the details are somewhat different. Building on the notation and arguments given in Remark 4.2, we again outline three main ingredients that together establish the result. Detailed proofs are omitted due to space limitation. 1. The gradient update in DMD can be rewritten as: yt+1 i = yt i + αt vi(xs) = yt i + αtvi(xt) + αt X vi(xs) vi(xt) By defining: bt i = P s Gt i vi(xs) vi(xt) | Gt i| , we can write player i s gradient update as: yt+1 i = yt i + αt(vi(xt) + bt i). By bounding bt i s magnitude using the delay sequence, Assumption 2 allows us to establish that bt i has negligible impact over time. Mathematically, the claim is that limt bt i i = 0. 2. The joint DMD update can be written as: xt = C(yt), (5.9) yt+1 = yt + αt(v(xt) + bt). (5.10) Here again using λ-Fenchel coupling as a energy" function and leveraging the handle on bt given by Claim 1, we show that for any ϵ > 0 the iterate xt will eventually enter B(x , ϵ) and visit B(x , ϵ) infinitely often, no matter what the initial point x0 is. Furthermore, per Claim 3 in Remark 4.2, B(x , ϵ) B(x , δ). This implies that xt must enter B(x , δ) infinitely often. 3. Again using λ-Fenchel coupling, we show that under DMD, for any neighborhood" B(x , δ), after long enough iterations, if xt ever enters B(x , δ), it will be trapped inside B(x , δ) thereafter. Combining the above three elements, it follows that under DMD, starting from iteration t, xt will remain in B(x , δ). Since this is true for any δ > 0, we have F λ(x , yt) 0 as t , thereby establishing that xt = C(yt) x as t 0. Here again, the result generalizes straightforwardly to the multiple Nash equilibria case (with identical proofs modulo using point-to-set distance metric). We omit the statement. 6 Conclusion We examined a model of game-theoretic learning based on OMD with asynchronous and delayed information. By focusing on games with λstable equilibria, we showed that the sequence of play induced by OMD converges whenever the feedback delays faced by the players are synchronous and bounded. Subsequently, to tackle fully decentralized, asynchronous environments with unbounded feedback delays (possibly growing sublinearly in the game s horizon), we showed that our convergence result still holds under delayed mirror descent, a variant of vanilla OMD that leverages past information even in rounds where no feedback is received. To further enhance the distributed aspect of the algorithm, in future work we intend to focus on the case where the players gradient input is not only delayed, but also subject to stochastic imperfections or, taking this to its logical extreme, when players only have observations of their in-game payoffs, and have no gradient information. 7 Acknowledgments Zhengyuan Zhou is supported by Stanford Graduate Fellowship and he would like to thank Walid Krichene and Alex Bayen for stimulating discussions (and their charismatic research style) that have firmly planted the initial seeds for this work. Panayotis Mertikopoulos gratefully acknowledges financial support from the Huawei Innovation Research Program ULTRON and the ANR JCJC project ORACLESS (grant no. ANR 16 CE33 0004 01). Claire Tomlin is supported in part by the NSF CPS:FORCES grant (CNS-1239166). [1] S. ARORA, E. HAZAN, AND S. KALE, The multiplicative weights update method: A meta-algorithm and applications, Theory of Computing, 8 (2012), pp. 121 164. [2] M. BALANDAT, W. KRICHENE, C. TOMLIN, AND A. BAYEN, Minimizing regret on reflexive Banach spaces and Nash equilibria in continuous zero-sum games, in NIPS 16: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016. [3] A. BLUM, On-line algorithms in machine learning, in Online algorithms, Springer, 1998, pp. 306 325. [4] N. CESA-BIANCHI AND G. LUGOSI, Prediction, learning, and games, Cambridge university press, 2006. [5] J. COHEN, A. HÉLIOU, AND P. MERTIKOPOULOS, Learning with bandit feedback in potential games, in NIPS 17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. [6] T. DESAUTELS, A. KRAUSE, AND J. W. BURDICK, Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization., Journal of Machine Learning Research, 15 (2014), pp. 3873 3923. [7] E. HAZAN, Introduction to Online Convex Optimization, Foundations and Trends(r) in Optimization Series, Now Publishers, 2016. [8] P. JOULANI, A. GYORGY, AND C. SZEPESVÁRI, Online learning under delayed feedback, in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1453 1461. [9] A. KALAI AND S. VEMPALA, Efficient algorithms for online decision problems, Journal of Computer and System Sciences, 71 (2005), pp. 291 307. [10] S. KRICHENE, W. KRICHENE, R. DONG, AND A. BAYEN, Convergence of heterogeneous distributed learning in stochastic routing games, in Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, IEEE, 2015, pp. 480 487. [11] W. KRICHENE, B. DRIGHÈS, AND A. M. BAYEN, Online learning of nash equilibria in congestion games, SIAM Journal on Control and Optimization, 53 (2015), pp. 1056 1081. [12] K. LAM, W. KRICHENE, AND A. BAYEN, On learning how players learn: estimation of learning dynamics in the routing game, in Cyber-Physical Systems (ICCPS), 2016 ACM/IEEE 7th International Conference on, IEEE, 2016, pp. 1 10. [13] N. LITTLESTONE AND M. K. WARMUTH, The weighted majority algorithm, INFORMATION AND COMPUTATION, 108 (1994), pp. 212 261. [14] R. MEHTA, I. PANAGEAS, AND G. PILIOURAS, Natural selection as an inhibitor of genetic diversity: Multiplicative weights updates algorithm and a conjecture of haploid genetics, in ITCS 15: Proceedings of the 6th Conference on Innovations in Theoretical Computer Science, 2015. [15] P. MERTIKOPOULOS, Learning in games with continuous action sets and unknown payoff functions. https://arxiv.org/abs/1608.07310, 2016. [16] P. MERTIKOPOULOS, C. H. PAPADIMITRIOU, AND G. PILIOURAS, Cycles in adversarial regularized learning, in SODA 18: Proceedings of the 29th annual ACM-SIAM symposium on discrete algorithms, to appear. [17] A. S. NEMIROVSKI AND D. B. YUDIN, Problem Complexity and Method Efficiency in Optimization, Wiley, New York, NY, 1983. [18] Y. NESTEROV, Primal-dual subgradient methods for convex problems, Mathematical Programming, 120 (2009), pp. 221 259. [19] G. PALAIOPANOS, I. PANAGEAS, AND G. PILIOURAS, Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos, in NIPS 17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. [20] K. QUANRUD AND D. KHASHABI, Online learning with adversarial delays, in Advances in Neural Information Processing Systems, 2015, pp. 1270 1278. [21] R. T. ROCKAFELLAR AND R. J.-B. WETS, Variational analysis, vol. 317, Springer Science & Business Media, 2009. [22] S. SHALEV-SHWARTZ ET AL., Online learning and online convex optimization, Foundations and Trends R in Machine Learning, 4 (2012), pp. 107 194. [23] S. SHALEV-SHWARTZ AND Y. SINGER, Convex repeated games and Fenchel duality, in Advances in Neural Information Processing Systems 19, MIT Press, 2007, pp. 1265 1272. [24] Y. VIOSSAT AND A. ZAPECHELNYUK, No-regret dynamics and fictitious play, Journal of Economic Theory, 148 (2013), pp. 825 842. [25] L. XIAO, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, 11 (2010), pp. 2543 2596. [26] Z. ZHOU, N. BAMBOS, AND P. GLYNN, Dynamics on linear influence network games under stochastic environments, in International Conference on Decision and Game Theory for Security, Springer, 2016, pp. 114 126. [27] Z. ZHOU, B. YOLKEN, R. A. MIURA-KO, AND N. BAMBOS, A game-theoretical formulation of influence networks, in American Control Conference (ACC), 2016, IEEE, 2016, pp. 3802 3807. [28] M. ZHU AND E. FRAZZOLI, Distributed robust adaptive equilibrium computation for generalized convex games, Automatica, 63 (2016), pp. 82 91. [29] M. ZINKEVICH, Online convex programming and generalized infinitesimal gradient ascent, in ICML 03: Proceedings of the 20th International Conference on Machine Learning, 2003, pp. 928 936.