# bandit_learning_in_concave_nperson_games__67d1baaf.pdf

Bandit Learning in Concave N-Person Games

Mario Bravo Universidad de Santiago de Chile Departamento de Matemática y Ciencia de la Computación mario.bravo.g@usach.cl

David Leslie Lancaster University & PROWLER.io d.leslie@lancaster.ac.uk

Panayotis Mertikopoulos Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP LIG 38000 Grenoble, France. panayotis.mertikopoulos@imag.fr

This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely lowinformation environments where the agents may not even know they are playing a game; as such, the agents most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players behavior stabilizes in the long run: no-regret learning may lead to cycles, even with perfect gradient information. However, if a standard monotonicity condition is satisﬁed, our analysis shows that no-regret learning based on mirror descent with bandit feedback converges to Nash equilibrium with probability 1. We also derive an upper bound for the convergence rate of the process that nearly matches the best attainable rate for single-agent bandit stochastic optimization.

1 Introduction

The bane of decision-making in an unknown environment is regret: noone wants to realize in hindsight that the decision policy they employed was strictly inferior to a plain policy prescribing the same action throughout. For obvious reasons, this issue becomes considerably more intricate when the decision-maker is subject to situational uncertainty and the fog of war : when the only information at the optimizer s disposal is the reward obtained from a given action (the so-called bandit framework), is it even possible to design a no-regret policy? Especially in the context of online convex optimization (repeated decision problems with continuous action sets and convex costs), this problem becomes even more challenging because the decision-maker typically needs to infer gradient information from the observation of a single scalar. Nonetheless, despite this extra degree of difﬁculty, this question has been shown to admit a positive answer: regret minimization is possible, even with bandit feedback (Flaxman et al., 2005; Kleinberg, 2004).

In this paper, we consider a multi-agent extension of this framework where, at each stage n = 1, 2, . . . , of a repeated decision process, the reward of an agent is determined by the actions of all agents via a ﬁxed mechanism: a non-cooperative N-person game. In general, the agents or players might be completely oblivious to this mechanism, perhaps even ignoring its existence: for instance, when choosing how much to bid for a good in an online auction, an agent is typically unaware of who the other bidders are, what are their speciﬁc valuations, etc. Hence, lacking any knowledge about the game, it is only natural to assume that agents will at least seek to achieve a minimal worst-case guarantee and minimize their regret. As a result, a fundamental question that arises is a) whether the agents sequence of actions stabilizes to a rationally admissible state under no-regret learning; and b) if it does, whether convergence is affected by the information available to the agents.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Related work. In ﬁnite games, no-regret learning guarantees that the players time-averaged, empirical frequency of play converges to the game s set of coarse correlated equilibria (CCE), and the rate of this convergence is O(1/n) for (λ, µ)-smooth games (Foster et al., 2016; Syrgkanis et al., 2015). In general however, this set might contain highly subpar, rationally inadmissible strategies: for instance, Viossat and Zapechelnyuk (2013) provide examples of CCE that assign positive selection probability only to strictly dominated strategies. In the class of potential games, Cohen et al. (2017) recently showed that the actual sequence of play (i.e., the sequence of actions that determine the agents rewards at each stage) converges under no-regret learning, even with bandit feedback. Outside this class however, the players chosen actions may cycle in perpetuity, even in simple, two-player zero-sum games with full information (Mertikopoulos et al., 2018a,b); in fact, depending on the parameters of the players learning process, agents could even exhibit a fully unpredictable, aperiodic and chaotic behavior (Palaiopanos et al., 2017). As such, without further assumptions in place, no-regret learning in a multi-agent setting does not necessarily imply convergence to a unilaterally stable, equilibrium state.

In the broader context of games with continuous action sets (the focal point of this paper), the long-run behavior of no-regret learning is signiﬁcantly more challenging to analyze. In the case of mixedstrategy learning, Perkins and Leslie (2014) and Perkins et al. (2017) showed that mixed-stratgy learning based on stochastic ﬁctitious play converges to an ε-perturbed Nash equilibrium in potential games (but may lead to as much as O(εn) regret in the process). More relevant for our purposes is the analysis of Nesterov (2009) who showed that the time-averaged sequence of play induced by a no-regret dual averaging (DA) process with noisy gradient feedback converges to Nash equilibrium in monotone games (a class which, in turn, contains all concave potential games).

The closest antecedent to our approach is the recent work of Mertikopoulos and Zhou (2018) who showed that the actual sequence of play generated by dual averaging converges to Nash equilibrium in the class of variationally stable games (which includes all monotone games). To do so, the authors ﬁrst showed that a naturally associated continuous-time dynamical system converges, and then used the so-called asymptotic pseudotrajectory (APT) framework of Benaïm (1999) to translate this result to discrete time. Similar APT techniques were also used in a very recent preprint by Bervoets et al. (2018) to establish the convergence of a payoff-based learning algorithm in two classes of one-dimensional concave games: games with strategic complements, and ordinal potential games with isolated equilibria. The algorithm of Bervoets et al. (2018) can be seen as a special case of mirror descent coupled with a two-point gradient estimation process, suggesting several interesting links with our paper.

Our contributions. In this paper, we drop all feedback assumptions and we focus on the bandit framework where the only information at the players disposal is the payoffs they receive at each stage. As we discussed above, this lack of information complicates matters considerably because players must now estimate their payoff gradients from their observed rewards. What makes matters even worse is that an agent may introduce a signiﬁcant bias in the (concurrent) estimation process of another, so traditional, multiple-point estimation techniques for derivative-free optimization cannot be applied (at least, not without signiﬁcant communication overhead between players).

To do away with player coordination requirements, we focus on learning processes which could be sensibly deployed in a single-agent setting and we show that, in monotone games, the sequence of play induced by a wide class of no-regret learning policies converges to Nash equilibrium with probability 1. Furthermore, by specializing to the class of strongly monotone games, we show that the rate of convergence is O(n 1/3), i.e., it is nearly optimal with respect to the attainable O(n 1/2) rate for bandit, single-agent stochastic optimization with strongly convex and smooth objectives (Agarwal et al., 2010; Shamir, 2013).

We are not aware of a similar Nash equilibrium convergence result for concave games with general convex action spaces and bandit feedback: the analysis of Mertikopoulos and Zhou (2018) requires ﬁrst-order feedback, while the analysis of Bervoets et al. (2018) only applies to one-dimensional games. We ﬁnd this outcome particularly appealing for practical applications of game theory (e.g., in network routing) because it shows that in a wide class of (possibly very complicated) nonlinear games, the Nash equilibrium prediction does not require full rationality, common knowledge of rationality, ﬂawless execution, or even the knowledge that a game is being played: a commonly-used, individual no-regret algorithm sufﬁces.

2 Problem setup and preliminaries

Concave games. Throughout this paper, we will focus on games with a ﬁnite number of players i N = {1, . . . , N} and continuous action sets. During play, every player i N selects an action xi from a compact convex subset Xi of a di-dimensional normed space Vi; subsequently, based on each player s individual objective and the action proﬁle x = (xi; x i) (x1, . . . , x N) of all players actions, every player receives a reward, and the process repeats. In more detail, writing X Q

i Xi for the game s action space, we assume that each player s reward is determined by an associated payoff (or utility) function ui : X R. Since players are not assumed to know the game (or even that they are involved in one) these payoff functions might be a priori unknown, especially with respect to the dependence on the actions of other players. Our only structural assumption for ui will be that ui(xi; x i) is concave in xi for all x i X i Q

j =i Xj, i N.

With all this in hand, a concave game will be a tuple G G(N, X, u) with players, action spaces and payoffs deﬁned as above. Below, we brieﬂy discuss some examples thereof: Example 2.1 (Cournot competition). In the standard Cournot oligopoly model, there is a ﬁnite set of ﬁrms indexed by i = 1, . . . , N, each supplying the market with a quantity xi [0, Ci] of some good (or service), up to the ﬁrm s production capacity Ci. By the law of supply and demand, the good is priced as a decreasing function P(xtot) of the total amount xtot = PN i=1 xi supplied to the market, typically following a linear model of the form P(xtot) = a bxtot for positive constants a, b > 0. The utility of ﬁrm i is then given by

ui(xi; x i) = xi P(xtot) cixi, (2.1)

i.e., it comprises the total revenue from producing xi units of the good in question minus the associated production cost (in the above, ci > 0 represents the marginal production cost of ﬁrm i). Example 2.2 (Resource allocation auctions). Consider a service provider with a number of splittable resources s S = {1, . . . , S} (bandwidth, server time, GPU cores, etc.). These resources can be leased to a set of N bidders (players) who can place monetary bids xis 0 for the utilization of each resource s S up to each player s total budget bi, i.e., P

s S xis bi. Once all bids are in, resources are allocated proportionally to each player s bid, i.e., the i-th player gets ρis = (qsxis) (cs + P

j N xjs) units of the s-th resource (where qs denotes the available units of said resource and cs 0 is the entry barrier for bidding on it). A simple model for the utility of player i is then given by ui(xi; x i) = X

s S [giρis xis], (2.2)

with gi denoting the marginal gain of player i from acquiring a unit slice of resources.

For more examples of monotone games, see Scutari et al. (2010), D Oro et al. (2015), Mertikopoulos and Belmega (2016), and references therein.

Nash equilibrium and monotone games. The most widely used solution concept for noncooperative games is that of a Nash equilibrium (NE), deﬁned here as any action proﬁle x X that is resilient to unilateral deviations, viz.

ui(x i ; x i) ui(xi; x i) for all xi Xi, i N. (NE)

By the classical existence theorem of Debreu (1952), every concave game admits a Nash equilibrium. Moreover, thanks to the individual concavity of the game s payoff functions, Nash equilibria can also be characterized via the ﬁrst-order optimality condition

vi(x ), xi x i 0 for all xi Xi, (2.3)

where vi(x) denotes the individual payoff gradient of the i-th player, i.e.,

vi(x) = i ui(xi; x i), (2.4)

with i denoting differentiation with respect to xi.1 In terms of regularity, it will be convenient to assume that each vi is Lipschitz continuous; to streamline our presentation, this will be our standing assumption in what follows.

1We adopt here the standard convention of treating vi(x) as an element of the dual space Yi V i of Vi, with yi, xi denoting the duality pairing between yi Yi and xi Xi Vi.

Starting with the seminal work of Rosen (1965), much of the literature on continuous games and their applications has focused on games that satisfy a condition known as diagonal strict concavity (DSC). In its simplest form, this condition posits that there exist positive constants λi > 0 such that X

i N λi vi(x ) vi(x), x i xi < 0 for all x, x X, x = x . (DSC)

Owing to the formal similarity between (DSC) and the various operator monotonicity conditions in optimization (see e.g., Bauschke and Combettes, 2017), games that satisfy (DSC) are commonly referred to as (strictly) monotone. As was shown by Rosen (1965, Theorem 2), monotone games admit a unique Nash equilibrium x X, which, in view of (DSC) and (NE), is also the unique solution of the (weighted) variational inequality X

i N λi vi(x), xi x i < 0 for all x = x . (VI)

This property of Nash equilibria of monotone games will play a crucial role in our analysis and we will use it freely in the rest of our paper.

In terms of applications, monotonicity gives rise to a very rich class of games. As we show in the paper s supplement, Examples 2.1 and 2.2 both satisfy diagonal strict concavity (with a nontrivial choice of weights for the latter), as do atomic splittable congestion games in networks with parallel links (Orda et al., 1993; Sorin and Wan, 2016), multi-user covariance matrix optimization problems in multiple-input and multiple-output (MIMO) systems (Mertikopoulos et al., 2017), and many other problems where online decision-making is the norm. Namely, the class of monotone games contains all strictly convex-concave zero-sum games and all games that admit a (strictly) concave potential, i.e., a function f : X R such that vi(x) = i f(x) for all x X, i N. In view of all this (and unless explicitly stated otherwise), we will focus throughout on monotone games; for completeness, we also include in the supplement a straightforward second-order test for monotonicity.

3 Regularized no-regret learning

We now turn to the learning methods that players could employ to increase their individual rewards in an online manner. Building on Zinkevich s (2003) online gradient descent policy, the most widely used algorithmic schemes for no-regret learning in the context of online convex optimization invariably revolve around the idea of regularization. To name but the most well-known paradigms, following the regularized leader (FTRL) explicitly relies on best-responding to a regularized aggregate of the reward functions revealed up to a given stage, while online mirror descent (OMD) and its variants use a linear surrogate thereof. All these no-regret policies fall under the general umbrella of regularized learning and their origins can be traced back to the seminal mirror descent (MD) algorithm of Nemirovski and Yudin (1983).2

The basic idea of mirror descent is to generate a new feasible point x+ by taking a so-called mirror step from a starting point x along the direction of an approximate gradient vector y (which we treat here as an element of the dual space Y Q

i Yi of V Q

i Vi).3 To do so, let hi : Xi R be a continuous and Ki-strongly convex distance-generating (or regularizer) function, i.e.,

hi(txi + (1 t)x i) thi(xi) + (1 t)hi(x i) 1

2Kit(1 t) x i xi 2, (3.1)

for all xi, x i Xi and all t [0, 1]. In terms of smoothness (and in a slight abuse of notation) we also assume that the subdifferential of hi admits a continuous selection, i.e., a continuous function hi : dom hi Yi such that hi(xi) hi(xi) for all xi dom hi.4 Then, letting

2In a utility maximization setting, mirror descent should be called mirror ascent because players seek to maximize their rewards (as opposed to minimizing their losses). Nonetheless, we keep the term descent throughout because, despite the role reversal, it is the standard name associated with the method. 3For concreteness (and in a slight abuse of notation), we assume in what follows that V is equipped with the product norm x 2 = P

i xi 2 and Y with the dual norm y = max{ y, x : x 1}. 4Recall here that the subdifferential of hi at xi Xi is deﬁned as hi(xi) {yi Yi : hi(x i) hi(xi) + yi, x i xi for all x i Vi}, with the standard convention that hi(xi) = + if xi Vi \ Xi. By standard results, the domain of subdifferentiability hi {xi Xi : hi = } of hi satisﬁes X i dom hi Xi.

h(x) = P i hi(xi) for x X (so h is strongly convex with modulus K = mini Ki), we get a pseudo-distance on X via the relation

D(p, x) = h(p) h(x) h(x), p x , (3.2)

for all p X, x dom h.

This pseudo-distance is known as the Bregman divergence and we have D(p, x) 0 with equality if and only if x = p; on the other hand, D may fail to be symmetric and/or satisfy the triangle inequality so, in general, it is not a bona ﬁde distance function on X. Nevertheless, we also have D(p, x) 1

2K x p 2 (see the paper s supplement), so the convergence of a sequence Xn to p can be checked by showing that D(p, Xn) 0. For technical reasons, it will be convenient to also assume the converse, i.e., that D(p, Xn) 0 when Xn p. This condition is known in the literature as Bregman reciprocity (Chen and Teboulle, 1993), and it will be our blanket assumption in what follows (note that it is trivially satisﬁed by Examples 3.1 and 3.2 below).

Now, as with true Euclidean distances, D(p, x) induces a prox-mapping given by

Px(y) = arg min x X { y, x x + D(x , x)} (3.3)

for all x dom h and all y Y. Just like its Euclidean counterpart below, the prox-mapping (3.3) starts with a point x dom h and steps along the dual (gradient-like) vector y Y to produce a new feasible point x+ = Px(y). Standard examples of this process are:

Example 3.1 (Euclidean projections). Let h(x) = 1

2 x 2 2 denote the Euclidean squared norm. Then, the induced prox-mapping is Px(y) = Π(x + y), (3.4)

with Π(x) = arg minx X x x 2 denoting the standard Euclidean projection onto X. Hence, the update rule x+ = Px(y) boils down to a vanilla , Euclidean projection step along y.

Example 3.2 (Entropic regularization and multiplicative weights). Suppressing the player index for simplicity, let X be a d-dimensional simplex and consider the entropic regularizer h(x) = Pd j=1 xj log xj. The induced pseudo-distance is the so-called Kullback Leibler (KL) divergence

DKL(p, x) = Pd j=1 pj log(pj/xj), which gives rise to the prox-mapping

Px(y) = (xj exp(yj))d j=1 Pd j=1 xj exp(yj) (3.5)

for all x X , y Y. The update rule x+ = Px(y) is widely known as the multiplicative weights (MW) algorithm and plays a central role for learning in multi-armed bandit problems and ﬁnite games (Arora et al., 2012; Auer et al., 1995; Freund and Schapire, 1999).

With all this in hand, the multi-agent mirror descent (MD) algorithm is given by the recursion

Xn+1 = PXn(γnˆvn), (MD)

where γn is a variable step-size sequence and ˆvn = (ˆvi,n)i N is a generic feedback sequence of estimated gradients. In the next section, we detail how this sequence is generated with ﬁrstor zeroth-order (bandit) feedback.

4 First-order vs. bandit feedback

4.1 First-order feedback.

A common assumption in the literature is that players are able to obtain gradient information by querying a ﬁrst-order oracle (Nesterov, 2004). i.e., a black-box feedback mechanism that outputs an estimate ˆvi of the individual payoff gradient vi(x) of the i-th player at the current action proﬁle x = (xi; x i) X. This estimate could be either perfect, giving ˆvi = vi(x) for all i N, or imperfect, returning noisy information of the form ˆvi = vi(x) + Ui where Ui denotes the oracle s error (random, systematic, or otherwise).

Having access to a perfect oracle is usually a tall order, either because payoff gradients are difﬁcult to compute directly (especially without global knowledge), because they involve an expectation over a possibly unknown probability law, or for any other number of reasons. It is therefore more common to assume that each player has access to a stochastic oracle which, when called against a sequence of actions Xn X, produces a sequence of gradient estimates ˆvn = (vi,n)i N that satisﬁes the following statistical assumptions:

a) Unbiasedness: E[ˆvn | Fn] = v(Xn).

b) Finite mean square: E[ ˆvn 2 | Fn] V 2 for some ﬁnite V 0. (4.1)

In terms of measurability, the expectation in (4.1) is conditioned on the history Fn of Xn up to stage n; in particular, since ˆvn is generated randomly from Xn, it is not Fn-measurable (and hence not adapted). To make this more transparent, we will write ˆvn = v(Xn) + Un+1 where Un is an adapted martingale difference sequence with E[ Un+1 2 | Fn] σ2 for some ﬁnite σ 0.

4.2 Bandit feedback.

Now, if players don t have access to a ﬁrst-order oracle the so-called bandit or payoff-based framework they will need to derive an individual gradient estimate from the only information at their disposal: the actual payoffs they receive at each stage. When a function can be queried at multiple points (as few as two in practice), there are efﬁcient ways to estimate its gradient via directional sampling techniques as in Agarwal et al. (2010). In a game-theoretic setting however, multiple-point estimation techniques do not apply because, in general, a player s payoff function depends on the actions of all players. Thus, when a player attempts to get a second query of their payoff function, this function may have already changed due to the query of another player i.e., instead of sampling ui( ; x i), the i-th player would be sampling ui( ; x i) for some x i = x i.

Following Spall (1997) and Flaxman et al. (2005), we posit instead that players rely on a simultaneous perturbation stochastic approximation (SPSA) approach that allows them to estimate their individual payoff gradients vi based off a single function evaluation. In detail, the key steps of this one-shot estimation process for each player i N are:

0. Fix a query radius δ > 0.5

1. Pick a pivot point xi Xi where player i seeks to estimate their payoff gradient.

2. Draw a vector zi from the unit sphere Si Sdi of Vi Rdi and play ˆxi = xi + δzi.6

3. Receive ˆui = ui(ˆxi; ˆx i) and set

δ ˆui zi. (4.2)

By adapting a standard argument based on Stokes theorem (detailed in the supplement), it can be shown that ˆvi is an unbiased estimator of the individual gradient of the δ-smoothed payoff function

uδ i (x) = 1 vol(δBi) Q

j =i vol(δSj)

j =i δSj ui(xi +wi; x i +z i) dz1 dwi dz N (4.3)

with Bi Bdi denoting the unit ball of Vi. The Lipschitz continuity of vi guarantees that i ui i uδ i = O(δ), so this estimate becomes more and more accurate as δ 0+. On the other hand, the second moment of ˆvi grows as O(1/δ2), implying in turn that the variability of ˆvi grows unbounded as δ 0+. This manifestation of the bias-variance dilemma plays a crucial role in designing no-regret policies with bandit feedback (Flaxman et al., 2005; Kleinberg, 2004), so δ must be chosen with care.

Before dealing with this choice though, it is important to highlight two feasibility issues that arise with the single-shot SPSA estimate (4.2). The ﬁrst has to do with the fact that the perturbation direction zi is chosen from the unit sphere Si so it may fail to be tangent to Xi, even when xi is interior. To iron out this wrinkle, it sufﬁces to sample zi from the intersection of Si with the afﬁne

5For simplicity, we take δ equal for all players; the extension to player-speciﬁc δ is straightforward, so we omit it. 6We tacitly assume here that the query directions zi Sdi are drawn independently across players.

hull of Xi in Vi; on that account (and without loss of generality), we will simply assume in what follows that each Xi is a convex body of Vi, i.e., it has nonempty topological interior.

The second feasibility issue concerns the size of the perturbation step: even if zi is a feasible direction of motion, the query point ˆxi = xi + δzi may be unfeasible if xi is too close to the boundary of Xi. For this reason, we will introduce a safety net in the spirit of Agarwal et al. (2010), and we will constrain the set of possible pivot points xi to lie within a suitably shrunk zone of X.

In detail, let Bri(pi) be an ri-ball centered at pi Xi so that Bri(pi) Xi. Then, instead of perturbing xi by zi, we consider the feasibility adjustment

wi = zi r 1 i (xi pi), (4.4)

and each player plays ˆxi = xi + δwi instead of xi + δzi. In other words, this adjustment moves each pivot to xδ i = xi r 1 i δ(xi pi), i.e., O(δ)-closer to the interior base point pi, and then perturbs xδ i by δzi. Feasibility of the query point is then ensured by noting that

ˆxi = xδ i + δzi = (1 r 1 i δ)xi + r 1 i δ(pi + rizi), (4.5)

so ˆxi Xi if δ/ri < 1 (since pi + rizi Bri(pi) Xi).

The difference between this estimator and the oracle framework we discussed above is twofold. First, each player s realized action is ˆxi = xi + δwi, not xi, so there is a disparity between the point at which payoffs are queried and the action proﬁle where the oracle is called. Second, the resulting estimator ˆv is not unbiased, so the statistical assumptions (4.1) for a stochastic oracle do not hold. In particular, given the feasibility adjustment (4.4), the estimate (4.2) with ˆx given by (4.5) satisﬁes

E[ˆvi] = i uδ i (xδ i ; xδ i), (4.6)

so there are two sources of systematic error: an O(δ) perturbation in the function, and an O(δ) perturbation of each player s pivot point from xi to xδ i . Hence, to capture both sources of bias and separate them from the random noise, we will write

ˆvi = vi(x) + Ui + bi (4.7)

where Ui = ˆvi E[ˆvi] and bi = i uδ i (xδ) i ui(x). We are thus led to the following manifestation of the bias-variance dilemma: the bias term b in (4.7) is O(δ), but the second moment of the noise term U is O(1/δ2); as such, an increase in accuracy (small bias) would result in a commensurate loss of precision (large noise variance). Balancing these two factors will be a key component of our analysis in the next section.

5 Convergence analysis and results

Combining the learning framework of Section 3 with the single-shot gradient estimation machinery of Section 4, we obtain the following variant of (MD) with payoff-based, bandit feedback:

ˆXn = Xn + δn Wn, Xn+1 = PXn(γnˆvn). (MD-b)

In the above, the perturbations Wn and the estimates ˆvn are given respectively by (4.4) and (4.2), i.e.,

Wi,n = Zi,n r 1 i (Xi,n pi) ˆvi,n = (di/δn)ui( ˆXn) Zi,n (5.1)

and Zi,n is drawn independently and uniformly across players at each stage n (see also Algorithm 1 for a pseudocode implementation and Fig. 1 for a schematic representation).

In the rest of this paper, our goal will be to determine the equilibrium convergence properties of this scheme in concave N-person games. Our ﬁrst asymptotic result below shows that, under (MD-b), the players learning process converges to Nash equilibrium in monotone games: Theorem 5.1. Suppose that the players of a monotone game G G(N, X, u) follow (MD-b) with step-size γn and query radius δn such that

lim n γn = lim n δn = 0,

n=1 γnδn < , and

γ2 n δ2n < . (5.2)

Then, the sequence of realized actions ˆXn converges to Nash equilibrium with probability 1.

Algorithm 1: Multi-agent mirror descent with bandit feedback (player indices suppressed)

Require: step-size γn > 0, query radius δn > 0, safety ball Br(p) X

1: choose X dom h # initialization 2: repeat at each stage n = 1, 2, . . . 3: draw Z uniformly from Sd # perturbation direction 4: set W Z r 1(X p) # query direction 5: play ˆ X X + δn W # choose action 6: receive ˆu u( ˆ X) # get payoff 7: set ˆv (d/δn)ˆu Z # estimate gradient 8: update X PX(γnˆv) # update pivot 9: until end

γ(d/δ) ˆu1z1 δz1

γ(d/δ) ˆu2z2

Figure 1: Schematic representation of Algorithm 1 with ordinary, Euclidean projections. To reduce visual clutter, we did not include the feasibility adjustment r 1(x p) in the action selection step Xn 7 ˆ Xn.

Even though the setting is different, the conditions (5.2) for the tuning of the algorithm s parameters are akin to those encountered in Kiefer Wolfowitz stochastic approximation schemes and serve a similar purpose. First, the conditions limn γn = 0 and P n=1 γn = respectively mitigate the method s inherent randomness and ensure a horizon of sufﬁcient length. The requirement limn δn = 0 is also straightforward to explain: as players accrue more information, they need to decrease the sampling bias in order to have any hope of converging. However, as we discussed in Section 4, decreasing δ also increases the variance of the players gradient estimates, which might grow to inﬁnity as δ 0. The crucial observation here is that new gradients enter the algorithm with a weight of γn so the aggregate bias after n stages is of the order of O(Pn k=1 γkδk) and its variance is O(Pn k=1 γ2 k/δ2 k). If these error terms can be controlled, there is an underlying drift that emerges over time and which steers the process to equilibrium. We make this precise in the supplement by using a suitably adjusted variant of the Bregman divergence as a quasi-Féjér energy function for (MD-b) and relying on a series of (sub)martingale convergence arguments to establish the convergence of ˆXn (ﬁrst as a subsequence, then with probability 1).

Of course, since Theorem 5.1 is asymptotic in nature, it is not clear how to choose γn and δn so as to optimize the method s convergence rate. Heuristically, if we take schedules of the form γn = γ/np and δn = δ/nq with γ, δ > 0 and 0 < p, q 1, the only conditions imposed by (5.2) are p + q > 1 and p q > 1/2. However, as we discussed above, the aggregate bias in the algorithm after n stages is O(Pn k=1 γnδn) = O(1/np+q 1) and its variance is O(Pn k=1 γ2 k/δ2 k) = O(1/n2p 2q 1): if the conditions (5.2) are satisﬁed, both error terms vanish, but they might do so at very different rates. By equating these exponents in order to bridge this gap, we obtain q = p/3; moreover, since the single-shot SPSA estimator (4.2) introduces a Θ(δn) random perturbation, q should be taken as large as possible to ensure that this perturbation vanishes at the fastest possible rate. As a result, the most suitable choice for p and q seems to be p = 1, q = 1/3, leading to an error bound of O(1/n1/3).

We show below that this bound is indeed attainable for games that are strongly monotone, i.e., they satisfy the following stronger variant of diagonal strict concavity:

i N λi vi(x ) vi(x), x i xi β

2 x x 2 (β-DSC)

for some λi, β > 0 and for all x, x X. Focusing for expository reasons on the most widely used, Euclidean incarnation of the method (Example 3.1), we have:

Theorem 5.2. Let x be the (necessarily unique) Nash equilibrium of a β-strongly monotone game. If the players follow (MD-b) with Euclidean projections and parameters γn = γ/n and δn = δ/n1/3 with γ > 1/(3β) and δ > 0, we have

E[ ˆXn x 2] = O(n 1/3). (5.3)

Theorem 5.2 is our main ﬁnite-time analysis result, so some remarks are in order. First, the step-size schedule γn 1/n is not required to obtain an O(n 1/3) convergence rate: as we show in the paper s supplement, more general schedules of the form γn 1/np and δn 1/nq with p > 3/4 and q = p/3 > 1/4, still guarantee an O(n 1/3) rate of convergence for (MD-b). To put things in perspective, we also show in the supplement that if (MD) is run with ﬁrst-order oracle feedback satisfying the statistical assumptions (4.1), the rate of convergence becomes O(1/n). Viewed in this light, the price for not having access to gradient information is no higher than O(n 2/3) in terms of the players equilibration rate.

Finally, it is also worth comparing the bound (5.3) to the attainable rates for stochastic convex optimization (the single-player case). For problems with objectives that are both strongly convex and smooth, Agarwal et al. (2010) attained an O(n 1/2) convergence rate with bandit feedback, which Shamir (2013) showed is unimprovable. Thus, in the single-player case, the bound (5.3) is off by n1/6 and coincides with the bound of Agarwal et al. (2010) for strongly convex functions that are not necessarily smooth. One reason for this gap is that the Θ(n 1/2) bound of Shamir (2013) concerns the smoothed-out time average Xn = n 1 Pn k=1 Xk, while our analysis concerns the sequence of realized actions ˆXn. This difference is semantically signiﬁcant: In optimization, the query sequence is just a means to an end, and only the algorithm s output matters (i.e., Xn). In a game-theoretic setting however, it is the players realized actions that determine their rewards at each stage, so the ﬁgure of merit is the actual sequence of play ˆXn. This sequence is more difﬁcult to control, so this disparity is, perhaps, not too surprising; nevertheless, we believe that this gap can be closed by using a more sophisticated single-shot estimate, e.g., as in Ghadimi and Lan (2013). We defer this analysis to the future.

6 Concluding remarks

The most sensible choice for agents who are oblivious to the presence of each other (or who are simply conservative), is to deploy a no-regret learning algorithm. With this in mind, we studied the long-run behavior of individual regularized no-regret learning policies and we showed that, in monotone games, play converges to equilibrium with probability 1, and the rate of convergence almost matches the optimal rates of single-agent, stochastic convex optimization. Nevertheless, several questions remain open: whether there is an intrinsic information-theoretic obstacle to bridging this gap; whether our convergence rate estimates hold with high probability (and not just in expectation); and whether our analysis extends to a fully decentralized setting where the players updates need not be synchronous. We intend to address these questions in future work.

Acknowledgments

M. Bravo gratefully acknowledges the support provided by FONDECYT grant 11151003. P. Mertikopoulos was partially supported by the Huawei HIRP ﬂagship grant ULTRON, and the French National Research Agency (ANR) grant ORACLESS (ANR 16 CE33 0004 01). Part of this work was carried out with ﬁnancial support by the ECOS project C15E03.

Agarwal, Alekh, O. Dekel, L. Xiao. 2010. Optimal algorithms for online convex optimization with multi-point bandit feedback. COLT 10: Proceedings of the 23rd Annual Conference on Learning Theory. Arora, Sanjeev, Elad Hazan, Satyen Kale. 2012. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing 8(1) 121 164. Auer, Peter, Nicolò Cesa-Bianchi, Yoav Freund, Robert E. Schapire. 1995. Gambling in a rigged casino: The adversarial multi-armed bandit problem. Proceedings of the 36th Annual Symposium on Foundations of Computer Science. Bauschke, Heinz H., Patrick L. Combettes. 2017. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. 2nd ed. Springer, New York, NY, USA.

Benaïm, Michel. 1999. Dynamics of stochastic approximation algorithms. Jacques Azéma, Michel Émery, Michel Ledoux, Marc Yor, eds., Séminaire de Probabilités XXXIII, Lecture Notes in Mathematics, vol. 1709. Springer Berlin Heidelberg, 1 68. Bervoets, Sebastian, Mario Bravo, Mathieu Faure. 2018. Learning with minimal information in continuous games. https://arxiv.org/abs/1806.11506. Chen, Gong, Marc Teboulle. 1993. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization 3(3) 538 543. Cohen, Johanne, Amélie Héliou, Panayotis Mertikopoulos. 2017. Learning with bandit feedback in potential games. NIPS 17: Proceedings of the 31st International Conference on Neural Information Processing Systems. Debreu, Gerard. 1952. A social equilibrium existence theorem. Proceedings of the National Academy of Sciences of the USA 38(10) 886 893. D Oro, Salvatore, Panayotis Mertikopoulos, Aris L. Moustakas, Sergio Palazzo. 2015. Interference-based pricing for opportunistic multi-carrier cognitive radio systems. IEEE Trans. Wireless Commun. 14(12) 6536 6549. Flaxman, Abraham D., Adam Tauman Kalai, H. Brendan Mc Mahan. 2005. Online convex optimization in the bandit setting: gradient descent without a gradient. SODA 05: Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms. 385 394. Foster, Dylan J., Thodoris Lykouris, Kathrik Sridharan, Éva Tardos. 2016. Learning in games: Robustness of fast convergence. NIPS 16: Proceedings of the 30th International Conference on Neural Information Processing Systems. 4727 4735. Freund, Yoav, Robert E. Schapire. 1999. Adaptive game playing using multiplicative weights. Games and Economic Behavior 29 79 103. Ghadimi, Saeed, Guanghui Lan. 2013. Stochastic ﬁrstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4) 2341 2368. Kleinberg, Robert D. 2004. Nearly tight bounds for the continuum-armed bandit problem. NIPS 04: Proceedings of the 18th Annual Conference on Neural Information Processing Systems. Mertikopoulos, Panayotis, E. Veronica Belmega. 2016. Learning to be green: Robust energy efﬁciency maximization in dynamic MIMO-OFDM systems. IEEE J. Sel. Areas Commun. 34(4) 743 757. Mertikopoulos, Panayotis, E. Veronica Belmega, Romain Negrel, Luca Sanguinetti. 2017. Distributed stochastic optimization via matrix exponential learning. IEEE Trans. Signal Process. 65(9) 2277 2290. Mertikopoulos, Panayotis, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, Georgios Piliouras. 2018a. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. https://arxiv.org/abs/1807.02629. Mertikopoulos, Panayotis, Christos H. Papadimitriou, Georgios Piliouras. 2018b. Cycles in adversarial regularized learning. SODA 18: Proceedings of the 29th annual ACM-SIAM Symposium on Discrete Algorithms. Mertikopoulos, Panayotis, Zhengyuan Zhou. 2018. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming . Nemirovski, Arkadi Semen, David Berkovich Yudin. 1983. Problem Complexity and Method Efﬁciency in Optimization. Wiley, New York, NY. Nesterov, Yurii. 2004. Introductory Lectures on Convex Optimization: A Basic Course. No. 87 in Applied Optimization, Kluwer Academic Publishers. Nesterov, Yurii. 2009. Primal-dual subgradient methods for convex problems. Mathematical Programming 120(1) 221 259. Orda, Ariel, Raphael Rom, Nahum Shimkin. 1993. Competitive routing in multi-user communication networks. IEEE/ACM Trans. Netw. 1(5) 614 627.

Palaiopanos, Gerasimos, Ioannis Panageas, Georgios Piliouras. 2017. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. NIPS 17: Proceedings of the 31st International Conference on Neural Information Processing Systems. Perkins, Steven, David S. Leslie. 2014. Stochastic ﬁctitious play with continuous action sets. Journal of Economic Theory 152 179 213. Perkins, Steven, Panayotis Mertikopoulos, David S. Leslie. 2017. Mixed-strategy learning with continuous action sets. IEEE Trans. Autom. Control 62(1) 379 384. Rosen, J. B. 1965. Existence and uniqueness of equilibrium points for concave N-person games. Econometrica 33(3) 520 534. Scutari, Gesualdo, Francisco Facchinei, Daniel Pérez Palomar, Jong-Shi Pang. 2010. Convex optimization, game theory, and variational inequality theory in multiuser communication systems. IEEE Signal Process. Mag. 27(3) 35 49. Shamir, Ohad. 2013. On the complexity of bandit and derivative-free stochastic convex optimization. COLT 13: Proceedings of the 26th Annual Conference on Learning Theory. Sorin, Sylvain, Cheng Wan. 2016. Finite composite games: Equilibria and dynamics. Journal of Dynamics and Games 3(1) 101 120. Spall, James C. 1997. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33(1) 109 112. Syrgkanis, Vasilis, Alekh Agarwal, Haipeng Luo, Robert E. Schapire. 2015. Fast convergence of regularized learning in games. NIPS 15: Proceedings of the 29th International Conference on Neural Information Processing Systems. 2989 2997. Viossat, Yannick, Andriy Zapechelnyuk. 2013. No-regret dynamics and ﬁctitious play. Journal of Economic Theory 148(2) 825 842. Zinkevich, Martin. 2003. Online convex programming and generalized inﬁnitesimal gradient ascent. ICML 03: Proceedings of the 20th International Conference on Machine Learning. 928 936.