# learning_in_games_with_lossy_feedback__c9287880.pdf

Learning in Games with Lossy Feedback

Zhengyuan Zhou Stanford University zyzhou@stanford.edu

Panayotis Mertikopoulos Univ. Grenoble Alpes, CNRS, Inria, LIG panayotis.mertikopoulos@imag.fr

Susan Athey Stanford University athey@stanford.edu

Nicholas Bambos Stanford University bambos@stanford.edu

Peter Glynn Stanford University glynn@stanford.edu

Yinyu Ye Stanford University yinyu-ye@stanford.edu

We consider a game-theoretical multi-agent learning problem where the feedback information can be lost during the learning process and rewards are given by a broad class of games known as variationally stable games. We propose a simple variant of the classical online gradient descent algorithm, called reweighted online gradient descent (ROGD) and show that in variationally stable games, if each agent adopts ROGD, then almost sure convergence to the set of Nash equilibria is guaranteed, even when the feedback loss is asynchronous and arbitrarily corrrelated among agents. We then extend the framework to deal with unknown feedback loss probabilities by using an estimator (constructed from past data) in its replacement. Finally, we further extend the framework to accomodate both asynchronous loss and stochastic rewards and establish that multi-agent ROGD learning still converges to the set of Nash equilibria in such settings. Together, these results contribute to the broad lanscape of multi-agent online learning by signiﬁcantly relaxing the feedback information that is required to achieve desirable outcomes.

1 Introduction

In multi-agent online learning [13, 14, 21, 45], a set of agents repeatedly interact with the environment and each other while accumulating rewards in an online manner. The key feature in this problem is that to each agent, the environment consists of all other agents who are simultaneously making such sequential decisions, and hence, each agent s reward depends not only on their own action, but also on the joint action of all other agents. A common way to model reward structures in this multi-agent online learning setting is through a repeated but otherwise unknown stage game: the reward of the i-th agent at iteration n is ri(an i , an i), where an i is agent i s action at n and an i is the vector of all other agents s actions at stage n.

Even though the underlying stage game is ﬁxed (i.e. each ri( ) is ﬁxed), from each agent s own perspective, its reward is nevertheless a time-varying function when viewed solely as a function of its own action, i.e., rn i ( ) = ri( , an i), and it needs to select an action an i before receiving the reward rn i (an i ). As such, each agent is exactly engaged in a classical online learning process [9, 23, 42, 43]. In this context, there is a fruitful line of existing literature providing a rich source of online learning algorithms to minimize the standard performance metric known as regret [10], deﬁned as Reg Alg T = maxai A PT t=1{rt i(ai) rt i(at i)} with the sequence of actions at i generated by some online learning algorithm Alg. These algorithms, already classical, include follow-theregularized-leader [26], online gradient descent [57], multiplicative/exponential weights [3], online mirror descent [44], and many others. Perhaps the simplest algorithm in the above list is Zinkevich s online gradient descent (OGD) method, where the agent simply takes a gradient step (using their

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

current reward function) to form the action for the next stage, performing a projection if the iterate steps out of the feasible action set. This algorithm, straightforward as it is, provides (asymptotically) optimal regret guarantees (see Section B in the appendix for a brief review) and is arguably one of the most well-studied and widely-used algorithms in online learning theory and applications [24, 39, 57].

Consequently, several natural questions arise in multi-agent online learning : if each agent adopts OGD to minimize regret, what is the resulting evolution of the joint action? Speciﬁcally, under what games/assumptions would it lead to a Nash equilibrium (the leading solution concept in multi-agent games)? These questions fall in the broader inquiry of understanding the joint convergence of no-regret online learning algorithms, an inquiry that lies at the heart of game-theoretical learning for well-grounded reasons: Speciﬁcally, studying whether the process converges at all provides an answer as to the stability of the joint learning dynamics, while studying whether it converges to a Nash equilibrium provides an answer as to whether the joint learning dynamics lead to the emergence of rationality. Speciﬁcally, for the latter point, if, when agents adopt an online learning algorithm the joint action converges to a point that is not a Nash equilibrium, each agent would be able to do better by unilaterally deviating from that algorithm. Hence, convergence to a non-Nash point would inherently produce regret" in equilibrium.

1.1 Related Work

Despite the fact that game-theoretic learning has received signiﬁcant scrutiny in the literature, the questions raised above are still open for several reasons.

First, in general, joint convergence of no-regret learning does not hold. In fact, even in simple ﬁnite games (where each agent has a ﬁnite number of actions), no-regret learning may fail to converge [31]. Even if it does converge, in mixed extensions of ﬁnite games (where each agent s action set is a probability simplex over a ﬁnite number of actions), the limit can assign positive weight only to strictly dominated actions. Consequently, establishing joint convergence to Nash equilibria under no-regret learning algorithms (OGD included) for a broad and meaningful subclass of games has been a challenging ongoing effort.

Second, the extensive existing literature in the ﬁeld has mainly focused on studying convergence in (mixed extensions of) ﬁnite games [8, 34, 45, 46]. Earlier work of game-theoretic learning (see [18] for a comprehensive review) has mainly focused on learning in ﬁnite games with dynamics that are not necessarily regret-less (e.g. best-response dynamics, ﬁctitious play and the like). Subsequently, the seminal work [14] (see also the many references therein) has provided a uniﬁed treatment of joint convergence of various no-regret learning algorithms in mixed extensions of ﬁnite games. The primary focus of [14] is convergence to other, coarser equilibrium notions (e.g. correlated equilibria), where a fairly complete characterization is given. On the other hand, as pointed out in [14], convergence to Nash is a much more difﬁcult problem: recent results of [47] have clearly highlighted the gap between (coarse) correlated equilibria obtained by no-regret learning processes and Nash equilibria. More positive results can be obtained in the class of potential games where, in a recent paper, the authors of [15] established the convergence of multiplicative weights and other regularized strategies in potential games with only payoff-based, bandit feedback.

However, moving beyond mixed extensions of ﬁnite games, much less is known, and only a moderate amount of literature exists. In the context of mixing in games with continuous action spaces, the authors of [37] provide a convergence analysis for a perturbed version of the multiplicative weights algorithm in potential games. In a pure-strategy setting, the network games considered in [29] belong to the much broader class of games known as concave games: each agent s reward function is individually concave.1 Therein, the dynamics investigated may lead to positive regret in the limit. The recent paper [5] studied a two-player continuous zero-sum game, and showed that if both players adopt a no-regret learning algorithm, then the empirical time-average of the joint action converges to Nash equilibria. However, barring a few recent exceptions, the territory of no-regret learning on concave games is not well understood (let alone in general games with continuous action sets). An exception to this are the recent papers [30, 32] where the authors establish the convergence of mirror

1In this vein, mixed extensions of ﬁnite games can be called linear games, because each agent s reward is individually linear in its own action. Note also that convex potential games is a subclass of concave games. The following gives the set membership: ﬁnite games mixed extension of ﬁnite games concave games general continuous games.

descent in monotone games a result later extended to learning with bandit, zeroth-order feedback in [7, 11].

Third, the convergence mode that is commonly adopted in the existing literature is ergodic convergence (i.e. convergence of 1

n Pt t=n at), rather than the convergence of the actual sequence of play (i.e. an). The former is convergence of the empirical frequency of play, while the latter is convergence of actual play: note that the latter implies the former, but not the other way round. It is important to point out, however, that convergence of the actual sequence of play is more desirable2 not only because it is theoretically more appealing, but also because it characterizes the joint evolution of the system in no-regret learning (since an, rather than 1

n Pt t=n at, is the actual action taken).

This issue was highlighted in [31], where it is shown that even though continuous-time followthe-regularized-leader (another no-regret learning algorithm) converges to Nash equilibrium in linear zero-sum games in the sense of time-averages, actual play orbits Nash equilibria in perpetuity. However, most of the existing work focus on establishing ergodic convergence (which granted is some form of stability), where the tools developed are far from sufﬁcient to ensure convergence of actual play. Some recent exceptions in speciﬁc games do exist: [27, 28] studied nonatomic routing games (a special class of concave games) and established that both online mirror descent and multiplicative weights (yet another two no-regret algorithms) converge to Nash equilibria; while [36] established multiplicative weights converge to Nash equilibria under certain conditions in atomic routing games (a subclass of ﬁnite games). Theoretically, the main difﬁculty in establishing convergence of the sequence of play lies in cleverly designing (as done in [27, 28, 36, 53]) speciﬁc and much sharper Lyapunov functions for the speciﬁc learning algorithms at hand. This difﬁculty is partially overcome in [11] for strongly monotone games where, assuming perfect gradient observations, an O(1/n) convergence rate is established.

1.2 Our Contributions

In this paper, we study the convergence of OGD to Nash equilibrium in general continuous games. To the best of our knowledge, the existing state-of-the-art convergence guarantee is that multi-agent OGD converges to Nash equilibria in monotone games in the sense of ergodic averages, a result due to [35]. Monotone games3 are an important and well-known subclass of concave games (in particular, it includes convex potential games as a special case). On a related note, a very recent paper [41] considered pseudo-monotone games (which strictly contain monotone games but also belong to concave games) and devised two distributed algorithms for computing Nash equilibria. However, it is far from clear that the algorithms devised are no-regret when used in an online fashion.

In this paper, we work with a broad class of general continuous games (not necessarily concave even), known as variationally stable games [53, 54], that strictly include pseudo-monotone games (and hence, in a chain of inclusion, monotone games and convex potential games), and also nonatomic routing games, linear Cournot games, resource allocation auctions, etc. More importantly, we go a step further and consider an even more general multi-agent online learning problem, where we allow agents to have asynchronous gradient feedback losses (note that so far the game-theoretical learning discussion is under perfect feedback, where the resulting landscape is already challenging). More speciﬁcally, instead of assuming that every agent receives information at every stage, we allow for cases where a (random) subset of agents remains informed (speciﬁcally, each agent i has a probability pi of receving the feedback and probability 1 pi of losing it).

Two important features of this model are: 1) the feedback loss can be asynchronous; 2) the feedback loss can be arbitrarily correlated among agents (whether some agent loses its feedback on the current iteration may affect whether others lose theirs). In this asynchronous context, we design a simple variant of OGD, called reweighted online gradient descent (ROGD), where each agent corrects its own marginal bias via dividing the received gradient (if any) by pi. This is inspired by the classical EXP3 algorithm [4, 12]) in the multi-armed literature, where it has a similar feature of weighting the observed bandit feedback by the probability of selecting the corresponding arm. We then establish in

2Of course, convergence in ergodic average is still valuable in many contexts, particularly when convergence of actual play fails to hold. 3In short, it means ( a1 r1( ), a2 r2( ), . . . , an rn( )) is a monotone operator on the joint action space. Note that if there is only 1 player, this means the underlying problem is a convex optimization problem since the gradient of a convex function is monotone. In this case, a Nash equilibrium is a globally optimal solution.

Theorem 4.3 that, in this asynchronous feedback loss setting, the sequence of play induced by joint ROGD converges almost surely to Nash equilibria in variationally stable games. We achieve this strong theoretical guarantee by designing an energy function that sharply tracks the progress made in the joint evolution of the ROGD update of all agents under feedback loss. We mention that a very recent work that also studies multi-agent online learning under imperfect information at the generality of variationally stable games is [53]. In particular, it is shown there that online mirror descent (also no-regret) converges to Nash even in the presence of super-linear but sub-quradtic growing delays in gradients. However, in that context, in addition to studying a different algorithm, [53] focuses on delays and in particular requires all gradients to be received (i.e. no gradient loss is allowed). Consequently, the results in [53] are strictly complementary to ours here: in particular, it is unclear whether online mirror descent would converge to Nash under feedback loss.

Finally, we make two practically useful and theoretically meaningful extensions. First, we extend to the case where the loss probabilites (which are assumed to be known so far) are not known. In this case, we can replace the actual probabilities pi by an estimate. Our main message (Theorem 5.1) is that provided the estimator is n-consistent, then convergence to Nash in last iterate under ROGD is still guaranteed. Note that a simple estimator that satisﬁes this guarantee is the sample mean estimator (with Laplace smoothing). Second, we extend the multi-agent online learning setup to also accomodate for stochastic reward functions, where in each iteration only a random reward function is realized. In such cases, we ﬁrst note an important equivalence result, both from a modeling standpoint and from a proof standpoint: the setup where agents reward functions are iid can be identiﬁed with the setup where agents reward functions are ﬁxed and deterministic but the received feedback gradients are noisy (but unbiased). We then establish that (in either setup) multi-agent ROGD learning still converges almost surely to the set of Nash equilibria when noise has bounded support (Theorem E.4 in the appendix); and converges in probability to the set of Nash equilibria if noise has unbounded support but has ﬁnite second moment (Theorem E.6 in the appendix): both are in last iterate. Due to space limitation, this part is placed in the appendix. Together, these results not only make meaningful progress towards the challenging open problem of convergence of no-regret algorithms to Nash in general continuous games under perfect information, but also, more importantly, contribute to the broad lanscape of multi-agent online learning under imperfect information.

2 Problem setup

2.1 Rewards for Multi-Agent Online Learning

We consider a general continuous game model for the rewards in multi-agent online learning:

Deﬁnition 2.1. A continuous game G is a tuple G = (N, A = QN i=1 Ai, {ri}N i=1), where N is the set of N agents, Xi is a convex and compact subset of Rdi representing the action space of agent i, and ri : A R is agent i s reward function, such that ri(a) = ri(a1, a2, . . . , a N) is continuous in a and continuously differentiable in ai for each i N and airi(a) is Lipschitz continuous in a.

Deﬁnition 2.2. Given a continuous game G, a X is called a Nash equilibrium if for each i N, ri(a i , a i) ri(ai, a i), ai Xi.

For the rest of the paper, we set d = PN i=1 di, which denotes the dimension of the joint action space: A Rd. Variables associated with agent i are denoted by subscripts. If there is no subscript, then it is understood that it is a joint variable of all agents. When each agent is applying4 OGD independently, we obtain multi-agent OGD in Algorithm 1 (we assume P n=1 γn+1 = , P n=1 γ2 n+1 < ):

4Note that in particular, the gradient here is a partial gradient with respect to one s own action, and its dimension is equal to di, the dimension of its own action space.

Algorithm 1: Multi-Agent OGD Learning

Require: Each agent i picks an arbitrary a0 i Rdi

1: n 0, y0 i a0 i 2: repeat 3: for each agent i do 4: yn+1 i = yn i + γn+1 airi(an) 5: an+1 i = Proj Ai(yn+1 i ) 6: end for 7: n n + 1 8: until end

2.2 Asynchronous Feedback Loss

The setup described in Algorithm 1 is multi-agent learning under perfect feedback information. Here we extend the model to allow for asynchronous feedback loss, where at each time step, only a subset N n+1 N of the agents receive their gradients from time5 n, while other agents gradients are lost. We assume that this subset is drawn iid from a ﬁxed distribution across time steps.

To faciliate the discussion, let the indicator variable In+1 i be 1 if agent i s gradient from n is received (at the beginning of n + 1) and 0 otherwise. Mathematically:

In+1 i = 1, if i N n+1, 0, if i / N n+1. (2.1)

Note that under this model, even though N n+1 is drawn iid across time steps, within the same time step, In+1 i and In+1 j can be arbitrarily correlated (whether agent i s feedback is lost can inﬂuence whether agent j s feedback is lost). We assume each agent i only knows its own marginal loss probability 1 pi, where pi = E[In+1 i ]. We refer to this setup as asynchronous feedback loss, because different agents can have different feedback loss pattern at any iteration. Applying multi-agent OGD in the feedback loss model yields:

Algorithm 2: Multi-Agent OGD Learning under Asynchronous Feedback Loss

Require: Each agent i picks an arbitrary A0 i Rdi

1: n 0, Y 0 i A0 i 2: repeat 3: for each agent i do

4: Y n+1 i = Y n i + γn+1 airi(An), if In+1 i = 1 Y n i , if In+1 i = 0

5: An+1 i = Proj Ai(Y n+1 i ) 6: end for 7: n n + 1 8: until end

Here we have capitalized the iterates Y n i , An i because they are now random variables (due to random feedback loss). Speciﬁcally, denoting by I the vector of indicator variables for all the agents, we have that both Y n i and An i are adapted to A0, I1, I2, . . . , In. Note the important but subtle point here: Y n i and An i are not adapted to A0 i , I1 i , I2 i , . . . , In i , because each individual s gradient update is coupled with all the other agents actions, which is a fundamental phenomenon in multi-agent learning.

2.3 Variationally Stable Games

Consequently, special structures must be imposed on the games/reward functions. Here we consider a broad meta class of games, called variationally stable games [32, 54], that contain many existing

5Here the superscript n + 1 means that set N n+1 is revealed at the beginning of time n + 1

well-known classes of games, such as convex potential games, monotone games, pseudo-monotone games [56], coercive games [41], inﬂuence network games [52, 55], non-atomic routing games [40]. See [53] for more details. Deﬁnition 2.3. G is a variationally stable game if its set of Nash equilibria A is non-empty and satisﬁes: ar(a), a a PN i=1 airi(a), ai a i 0, a A, a A , with equality if and only if a A . Remark 1. Monotone games require ar(a) ar(a ), a a 0, a, a A. Pseudo-monotone games require: if ar(a ), a a 0, then ar(a), a a 0, a, a A. That monotone is a special case of pseudo-monotone is obvious. That pseduo-monotone is a special case of variational stability follows by recalling the standard characterization of a Nash equilibrium: a satisﬁes ar(a ), a a 0, a A.

In the presence of feedback loss, the convergence of multi-agent OGD given in Algorithm 2 to Nash equilibria cannot be guaranteed in variationally stable games (and there exist cases where it doesn t, in fact, convergence of OGD is not guaranteed even in convex potential games). In the next section, we present a simple modiﬁcation of the vanilla multi-agent OGD that will later be shown to converge to Nash equilibria, even in the presence of asynchronous delays. We shall then see that as a corollary (see Corollay 4.4), multi-agent OGD will converge to the set of Nash equilibria (almost surely) if the feedback loss are synchronous (and as a further special case, if there is no feedback loss).

3 Algorithm and energy function

In this section, to deal with asynchronous feedback loss, we give a new algorithm called Reweighted Online Gradient Descent (ROGD) for the multi-agent learning problem.

3.1 Reweighted Online Gradient Descent

The main idea in the modiﬁcation of OGD lies in each agent individually correcting its marginal bias (which comes from the loss of the gradient feedback) by dividing the probability. More speciﬁcally, when a gradient is lost on the current time step, the agent uses the previous action as in OGD. Otherwise, when a gradient is indeed available, the gradient will be weighted by pi ﬁrst before getting updated. This results in reweighted online gradient descent:

Algorithm 3: Multi-Agent ROGD Learning under Asynchronous Feedback Loss

Require: Each agent i picks an arbitrary A0 i Rdi

1: n 0, Y 0 i A0 i 2: repeat 3: for each agent i do

4: Y n+1 i =

( Y n i + γn+1 airi(An)

pi , if In+1 i = 1 Y n i , if In+1 i = 0

5: An+1 i = Proj Ai(Y n+1 i ) 6: end for 7: n n + 1 8: until end

We emphasize once again that the gradient loss among agents can be correlated. However, in ROGD, each agent only corrects its own marginal bias and does not concern itself with other agents feedback loss. To analyze the ROGD algorithm, we next introduce an important theoretical tool.

3.2 Energy Function

We start by deﬁning the energy function, an important tool we use to establish convergence later. a is always any ﬁxed Nash equilibrium. Deﬁnition 3.1. Deﬁne the energy function L : Rd R as follows:

L(y) = a 2 2 Proj A(y) 2 2 + 2 y, Proj A(y) a . (3.1)

Lemma 3.2. Let yn be a sequence in Rd

1. L(yn) 0 implies Proj A(yn) a .

2. Proj A(y) ˆy 2 2 Proj A(ˆy) ˆy 2 2 y ˆy 2 2, for any y, ˆy Rd

3. L(y + y) L(y) 2 y, Proj A(y) a + y 2 2, for any y, y Rd. Remark 2. The second statement of the lemma serves as an important intermediate step in proving the third statement, and is established by leveraging the envelop theorem and several important properties of Euclidean projection. To see that this is not trivial, consider the quantity Proj A(y) ˆy 2 Proj A(ˆy) ˆy 2, which we know by triangle s inequality satisﬁes:

Proj A(y) ˆy 2 Proj A(ˆy) ˆy 2 Proj A(y) Proj A(ˆy) 2 y ˆy 2, (3.2)

where the last inequality follows from the fact that projection is an non-expansive map. However, this inequality is not sufﬁcient for our purposes because in quantifying the perturbation L(y+ y) L(y), we also need the squared term y 2 2, which is not easily obtainable from Equation (3.2). In fact, a ﬁner-grained analysis is needed to establish that y ˆy 2 2 is an upper bound on Proj A(y) ˆy 2 2 Proj A(ˆy) ˆy 2 2.

4 Almost sure convergence to Nash equilibria

In this section, we establish that when agents rewards come from a variationally stable game, multiagent ROGD converges to the set of Nash equilibria almost surely under asynchronous feedback loss. For ease of exposition, we break the framework into three steps, each of which centers on one idea and is described in detail in a subsection. All the proof details are given in the appendix. For the ﬁrst two subsections, let a be an arbitrary Nash equilibrium.

4.1 Controlling the Tail Behavior of Expectation

Our ﬁrst step lies in controlling the tail behavior of the expected value of the sequence { ar(An), a An } n=0. Note that by variational stability, we have: n, ar(An), a An 0, a.s. Consequently, E[ ar(An), a An ] 0, n. By leveraging the energy function, its telescoping sum and an appropritate conditioning, we show that (next lemma) this sequece of expectations should be rather small in the limit and its tail should mostly" go to 0. Lemma 4.1. X

t=0 γn+1 E[ ar(At), a At ] < . (4.1)

Remark 3. Since P t=0 γn+1 = , Lemma 4.1 that lim infn E[ ar(At), a At ] = 0. Note that the converse is not true: when a subsequence of {E[ ar(An), a An ]} n=0 converges to 0, the sum need not be ﬁnite. As a simple example, consider γn+1 = 1

E[ ar(An), a An ] = 1

n, if t = 2k

1, otherwise. (4.2)

Then the subsequence on indicies 2k converges to 0, but the sum still diverges. This means that Equation (4.1) is stronger than subsequence convergence.

4.2 Bounding the Successive Differences

However, Equation (4.1) is still not strong enough to guarantee that limn E[ ar(At), a At ] = 0, let alone limn ar(At), a At = 0, a.s.. This is because the convergent sum given in Equation (4.1) only limits the tail growth somewhat, but not completely. As an example to demonstrate this point, let Ct be the following boolean variable6:

Ct = 1, if t contains the digit 9 in its decimal expansion 0, otherwise. (4.3)

6By it deﬁnition, c9 = 1, c11 = 0.

Now deﬁne γn+1 = 1 n, and E[ ar(An), a An ] = 1

n, if Ct = 1 1, if Ct = 0. Then it follows from

a straightforward calculation that P t=0 γn+1 E[ ar(At), a At ] < (see Problem 1.3.24 in [16]). However, the limit E[ ar(At), a At ] does not exist.

This indicates that to obtain almost sure convergence of ar(An), a An to 0, we need to impose more stringent conditions to ensure its sufﬁcient tail decay. One way is to bound the difference between every two successive terms in terms of a decreasing sequence. This ensures that ar(At), a At cannot change two much from iteration to iteration. Further, the change between two successive terms will become smaller and smaller. This result is formalized in the following lemma (the proof is given in the appendix):

Lemma 4.2. There exists a constant C > 0 such that for every n,

ar(An+1), a An+1 ar(An), a An Cαn+1, a.s. (4.4)

4.3 Main Convergence Result

We are now ﬁnally ready to put all the pieces together and state the main convergence result.

Theorem 4.3. Let the reward functions be given from a variationally stable game. Deﬁne the point-to-set distance in the standard way: dist(a, A ) = infa A a a 2. Then for any strictly positive probabilities {pi}N i=1, ROGD converges almost surely to the set of Nash equilibria: limn dist(An, A ) = 0 a.s., as n , where An is a sequence generated from Algorithm 3.

Remark 4. As a quick outline here (the details are in appendix), pick an arbitrary Nash equilibrium a A . Lemma 4.1 and Lemma 4.2 will together ensure that limn ar(At), a At = 0, a.s. Since ar(a), a a > 0 if and only if a / A and ar(a), a a = 0 if and only if a A , it then follows by continuity of ar( ) that limn ar(At), a At = 0, a.s. implies limn dist(An, A ) = 0 a.s..

Although not mentioned in this theorem, another useful and interesting structural insight to point out here is that as ar(At), a At converges to 0, if it ever becomes 0 at n, then At A , and furthermore, the joint action will stay exactly at that Nash equilibrium forever. Why? There are two cases to consider.

1. First, this Nash equilibrium a is an interior point in A. In this case, ar(a ) = 0 and hence ar(An) = 0. Consequently, per the ROGD update rule, whether any agent updates or not does not matter: either the gradient is not received, in which case no gradient update happens; or a gradient is received, but at this Nash equilibrium, it is 0 and therefore nobody will want to make any update.

2. Second, this Nash equilibrium a is a boundary point in A. In this case, ar(a ) may not be 0, but it always points outside the feasible action set A. Consequently, even if an agent receives a gradient and hence makes a gradient update, its action will immediately get projected back to the same point. As a result, the joint action An will stay exactly at A (even though the Y n variables will still change).

Corollary 4.4. Under the same setup as in Theorem 4.3, if feedback loss is synchronous on average (pi = pj, i, j), then multi-agent OGD in Algorithm 2 converges almost surely to A in last iterate.

This can be easily seen that by noting that the probability can be absorbed into the step-size.

Corollary 4.5. Under the same setup as in Theorem 4.3, if there is no feedback loss (pi = 1, i), multi-agent OGD in Algorithm 2 converges to A in last iterate.

Remark 5. Note that as stated, our above results require a joint learning step-size policy, even if there is no missing gradient feedback otherwise, the players individual gradients weighted by individual step-sizes might not be stable and can cause divergence (and, of course, the situation only becomes worse in the lossy regime). That said, our analysis still holds if each player i N uses an individual step-size policy γi n such that lim supn γi n/γj n < for all pairs i, j N, i.e., if the players updates do not follow different time-scales (so to speak). In that case however, the proofs (and the overall write-up) would become much more cumbersome, so we avoided this extra degree of generality in this paper.

5 Extension: Unknown Loss Probabilities

So far we have assumed pi s are known. We close the paper with a brief comment on how to remove this assumption. When each agent i does not know the underlying loss probability pi, ROGD is no longer feasible. To overcome this, we use an estimator ˆpi (obtainable from the past history) in replacement of the true probability pi. Since the only information an agent has is the past history of received gradients, we require the estimator ˆpi to be adapted to the sequence of the indicator functions It i s: ˆpn i = ˆpt i(I1 i , . . . , In i ). The resulting algorithm will be called EROGD. An estimator ˆpn

is called n-consistent if E[(ˆpn p)2] = O( 1

n), where p is the true parameter (note that it is called n-consistent because root-mean-squared error is typically used to deﬁne the rate of consistency.). We have the following result (proof omitted due to space limitation):

Theorem 5.1. Under the same setup as in Theorem 4.3, if ˆpn i is n-consistent for every i N, and if P n=1 γn 1 n < , then the last iterate of EROGD converges to A in probability.

Remark 6. One simple estimator that is n-consistent is sample mean with smoothing (where the smoothing is used to prevent the estimator from ever reaching 0): pn+1 i =

Pn s=1 Is i +1 n+1 . Further, many step-size sequences satisfy the requirement: examples include γn = 1 nα where 0.5 < γ < 1.

6 Concluding remarks

In this paper, we have provided an algorithmic framework to deal with multi-agent online learning under feedback loss and obtained broad convergence-to-Nash results under fairly general settings. We also believe more exciting work remains. For instance, our formulation is game-theoretical where participating agents are self-interested. A parallel facet to multi-agent online learning is coordination [1, 2, 17, 20], where participating agents coordinate to achieve a common goal. Understanding how to effectively cooperate under imperfect information will be an interesting future direction. Another direction is to incorporate state into the reward and allow actions to also depend on the underlying state that may transition. Such settings belong broadly to multi-agent online policy learning, where the imperfect information regime is under-explored. Empirically, we believe the recent advances in deep learning and representation learning could possibly [19, 33, 48 50] provide a ﬂexible architecture for learning good policies in the imperfect information regime, although characterizations of theoretical guarantees may require novel machinery. Finally, it would also be interesting to further extend the results into partial feedback settings. In the presence of a single agent, such problems have been studied in the context of ofﬂine policy learning [6, 51] and online bandits (with imperfect information) [22, 25, 38]. However, the multi-agent learning setting is again under-explored; we leave that for future work.

[1] A. ANAND, A. GROVER, P. SINGLA, ET AL., Asap-uct: Abstraction of state-action pairs in uct, in Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015. [2] , A novel abstraction framework for online planning, in Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems, 2015, pp. 1901 1902. [3] S. ARORA, E. HAZAN, AND S. KALE, The multiplicative weights update method: a meta-algorithm and applications., Theory of Computing, 8 (2012), pp. 121 164. [4] P. AUER, N. CESA-BIANCHI, Y. FREUND, AND R. E. SCHAPIRE, The nonstochastic multiarmed bandit problem, SIAM journal on computing, 32 (2002), pp. 48 77. [5] M. BALANDAT, W. KRICHENE, C. TOMLIN, AND A. BAYEN, Minimizing regret on reﬂexive banach spaces and learning nash equilibria in continuous zero-sum games, ar Xiv preprint ar Xiv:1606.01261, (2016). [6] D. BERTSIMAS AND N. KALLUS, From predictive to prescriptive analytics, ar Xiv preprint ar Xiv:1402.5481, (2014). [7] S. BERVOETS, M. BRAVO, AND M. FAURE, Learning with minimal information in continuous games.

https://arxiv.org/abs/1806.11506, 2018. [8] D. BLOEMBERGEN, K. TUYLS, D. HENNES, AND M. KAISERS, Evolutionary dynamics of multi-agent learning: a survey, Journal of Artiﬁcial Intelligence Research, 53 (2015), pp. 659 697.

[9] A. BLUM, On-line algorithms in machine learning, in Online algorithms, Springer, 1998, pp. 306 325. [10] A. BLUM AND Y. MANSOUR, From external to internal regret, Journal of Machine Learning Research, 8 (2007), pp. 1307 1324. [11] M. BRAVO, D. S. LESLIE, AND P. MERTIKOPOULOS, Bandit learning in concave N-person games, in NIPS 18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018. [12] S. BUBECK, N. CESA-BIANCHI, ET AL., Regret analysis of stochastic and nonstochastic multi-armed bandit problems, Foundations and Trends R in Machine Learning, 5 (2012), pp. 1 122. [13] L. BUSONIU, R. BABUSKA, AND B. DE SCHUTTER, Multi-agent reinforcement learning: An overview, in Innovations in multi-agent systems and applications-1, Springer, 2010, pp. 183 221. [14] N. CESA-BIANCHI AND G. LUGOSI, Prediction, learning, and games, Cambridge university press, 2006. [15] J. COHEN, A. HÉLIOU, AND P. MERTIKOPOULOS, Learning with bandit feedback in potential games, in NIPS 17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. [16] P. DE SOUZA AND J. SILVA, Berkeley Problems in Mathematics, Problem Books in Mathematics, Springer New York, 2012. [17] M. DIMAKOPOULOU AND B. VAN ROY, Coordinated exploration in concurrent reinforcement learning, ar Xiv preprint ar Xiv:1802.01282, (2018). [18] D. FUDENBERG AND D. K. LEVINE, The Theory of Learning in Games, vol. 2 of Economic learning and social evolution, MIT Press, Cambridge, MA, 1998. [19] I. GOODFELLOW, Y. BENGIO, AND A. COURVILLE, Deep Learning, Adaptive computation and machine learning, MIT Press, 2016. [20] A. GROVER, M. AL-SHEDIVAT, J. K. GUPTA, Y. BURDA, AND H. EDWARDS, Evaluating generalization in multiagent systems using agent-interaction graphs, in International Conference on Autonomous Agents and Multiagent Systems, 2018. [21] A. GROVER, M. AL-SHEDIVAT, J. K. GUPTA, Y. BURDA, AND H. EDWARDS, Learning policy representations in multiagent systems, in International Conference on Machine Learning, 2018. [22] A. GROVER, T. MARKOV, P. ATTIA, N. JIN, N. PERKINS, B. CHEONG, M. CHEN, Z. YANG, S. HARRIS, W. CHUEH, ET AL., Best arm identiﬁcation in multi-armed bandits with delayed feedback, ar Xiv preprint ar Xiv:1803.10937, (2018). [23] E. HAZAN, Introduction to Online Convex Optimization, Foundations and Trends(r) in Optimization Series, Now Publishers, 2016. [24] E. HAZAN, A. AGARWAL, AND S. KALE, Logarithmic regret algorithms for online convex optimization, Machine Learning, 69 (2007), pp. 169 192. [25] P. JOULANI, A. GYORGY, AND C. SZEPESVÁRI, Online learning under delayed feedback, in International Conference on Machine Learning, 2013, pp. 1453 1461. [26] A. KALAI AND S. VEMPALA, Efﬁcient algorithms for online decision problems, Journal of Computer and System Sciences, 71 (2005), pp. 291 307. [27] S. KRICHENE, W. KRICHENE, R. DONG, AND A. BAYEN, Convergence of heterogeneous distributed learning in stochastic routing games, in Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, IEEE, 2015, pp. 480 487. [28] K. LAM, W. KRICHENE, AND A. BAYEN, On learning how players learn: estimation of learning dynamics in the routing game, in Cyber-Physical Systems (ICCPS), 2016 ACM/IEEE 7th International Conference on, IEEE, 2016, pp. 1 10. [29] I. MENACHE AND A. OZDAGLAR, Network games: Theory, models, and dynamics, Synthesis Lectures on Communication Networks, 4 (2011), pp. 1 159. [30] P. MERTIKOPOULOS, E. V. BELMEGA, R. NEGREL, AND L. SANGUINETTI, Distributed stochastic optimization via matrix exponential learning, IEEE Trans. Signal Process., 65 (2017), pp. 2277 2290. [31] P. MERTIKOPOULOS, C. PAPADIMITRIOU, AND G. PILIOURAS, Cycles in adversarial regularized learning, in Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, 2018, pp. 2703 2717. [32] P. MERTIKOPOULOS AND Z. ZHOU, Learning in games with continuous action sets and unknown payoff functions, Mathematical Programming, (2018), pp. 1 43. [33] V. MNIH, K. KAVUKCUOGLU, D. SILVER, A. GRAVES, I. ANTONOGLOU, D. WIERSTRA, AND M. RIEDMILLER, Playing atari with deep reinforcement learning, ar Xiv preprint ar Xiv:1312.5602, (2013). [34] B. MONNOT AND G. PILIOURAS, Limits and limitations of no-regret learning in games, The Knowledge Engineering Review, 32 (2017).

[35] Y. NESTEROV, Primal-dual subgradient methods for convex problems, Mathematical programming, 120 (2009), pp. 221 259. [36] G. PALAIOPANOS, I. PANAGEAS, AND G. PILIOURAS, Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos, in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds., Curran Associates, Inc., 2017, pp. 5872 5882. [37] S. PERKINS, P. MERTIKOPOULOS, AND D. S. LESLIE, Mixed-strategy learning with continuous action sets, IEEE Trans. Autom. Control, 62 (2017), pp. 379 384. [38] C. PIKE-BURKE, S. AGRAWAL, C. SZEPESVARI, AND S. GRUNEWALDER, Bandits with delayed, aggregated anonymous feedback, in International Conference on Machine Learning, 2018, pp. 4102 4110. [39] K. QUANRUD AND D. KHASHABI, Online learning with adversarial delays, in Advances in Neural Information Processing Systems, 2015, pp. 1270 1278. [40] T. ROUGHGARDEN, Selﬁsh routing and the price of anarchy, vol. 174. [41] F. SALEHISADAGHIANI AND L. PAVEL, Distributed nash equilibrium seeking via the alternating direction method of multipliers, IFAC-Papers On Line, 50 (2017), pp. 6166 6171. [42] S. SHALEV-SHWARTZ, Online learning: Theory, algorithms, and applications, Ph D thesis, Hebrew University of Jerusalem, 2007. [43] S. SHALEV-SHWARTZ ET AL., Online learning and online convex optimization, Foundations and Trends R in Machine Learning, 4 (2012), pp. 107 194. [44] S. SHALEV-SHWARTZ AND Y. SINGER, Convex repeated games and Fenchel duality, in Advances in Neural Information Processing Systems 19, MIT Press, 2007, pp. 1265 1272. [45] Y. SHOHAM AND K. LEYTON-BROWN, Multiagent systems: Algorithmic, game-theoretic, and logical foundations, Cambridge University Press, 2008. [46] Y. VIOSSAT AND A. ZAPECHELNYUK, No-regret dynamics and ﬁctitious play, Journal of Economic Theory, 148 (2013), pp. 825 842. [47] Y. VIOSSAT AND A. ZAPECHELNYUK, No-regret dynamics and ﬁctitious play, Journal of Economic Theory, 148 (2013), pp. 825 842. [48] Z. YIN, K.-H. CHANG, AND R. ZHANG, Deepprobe: Information directed sequence understanding and chatbot design via recurrent neural networks, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 2131 2139. [49] Z. YIN, V. SACHIDANANDA, AND B. PRABHAKAR, The global anchor method for quantifying linguistic shifts and domain adaptation, in Advances in Neural Information Processing Systems, 2018. [50] Z. YIN AND Y. SHEN, On the dimensionality of word embedding, in Advances in Neural Information Processing Systems, 2018. [51] Z. ZHOU, S. ATHEY, AND S. WAGER, Ofﬂine multi-action policy learning: Generalization and optimization, ar Xiv preprint ar Xiv:1810.04778, (2018). [52] Z. ZHOU, N. BAMBOS, AND P. GLYNN, Dynamics on linear inﬂuence network games under stochastic environments, in International Conference on Decision and Game Theory for Security, Springer, 2016, pp. 114 126. [53] Z. ZHOU, P. MERTIKOPOULOS, N. BAMBOS, P. W. GLYNN, AND C. TOMLIN, Countering feedback delays in multi-agent learning, in NIPS 17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. [54] Z. ZHOU, P. MERTIKOPOULOS, A. L. MOUSTAKAS, N. BAMBOS, AND P. GLYNN, Mirror descent learning in continuous games, in Decision and Control (CDC), 2017 IEEE 56th Annual Conference on, IEEE, 2017, pp. 5776 5783. [55] Z. ZHOU, B. YOLKEN, R. A. MIURA-KO, AND N. BAMBOS, A game-theoretical formulation of inﬂuence networks, in American Control Conference (ACC), 2016, IEEE, 2016, pp. 3802 3807. [56] M. ZHU AND E. FRAZZOLI, Distributed robust adaptive equilibrium computation for generalized convex games, Automatica, 63 (2016), pp. 82 91. [57] M. ZINKEVICH, Online convex programming and generalized inﬁnitesimal gradient ascent, in ICML 03: Proceedings of the 20th International Conference on Machine Learning, 2003, pp. 928 936.