# strategizing_against_noregret_learners__841b4058.pdf

Strategizing against No-regret Learners

Yuan Deng Duke University ericdy@cs.duke.edu

Jon Schneider Google Research jschnei@google.com

Balasubramanian Sivan Google Research balusivan@google.com

How should a player who repeatedly plays a game against a no-regret learner strategize to maximize his utility? We study this question and show that under some mild assumptions, the player can always guarantee himself a utility of at least what he would get in a Stackelberg equilibrium of the game. When the no-regret learner has only two actions, we show that the player cannot get any higher utility than the Stackelberg equilibrium utility. But when the no-regret learner has more than two actions and plays a mean-based no-regret strategy, we show that the player can get strictly higher than the Stackelberg equilibrium utility. We provide a characterization of the optimal game-play for the player against a mean-based no-regret learner as a solution to a control problem. When the no-regret learner s strategy also guarantees him a no-swap regret, we show that the player cannot get anything higher than a Stackelberg equilibrium utility.

1 Introduction

Consider a two player bimatrix game with a ﬁnite number of actions for each player repeated over T rounds. When playing a repeated game, a widely adopted strategy is to employ a no-regret learning algorithm: a strategy that guarantees the player that in hindsight no single action when played throughout the game would have performed signiﬁcantly better. Knowing that one of the players (the learner) is playing a no-regret learning strategy, what is the optimal gameplay for the other player (the optimizer)? This question is the focus of our work.

If this were a single-shot strategic game where learning is not relevant, a (pure or mixed strategy) Nash equilibrium is a reasonable prediction of the game s outcome. In the T rounds game with learning, can the optimizer guarantee himself a per-round utility of at least what he could get in a single-shot game? Is it possible to get signiﬁcantly more utility than this? Does this utility depend on the speciﬁc choice of learning algorithm of the learner? What gameplay the optimizer should adopt to achieve maximal utility? None of these questions are straightforward, and indeed none of these have unconditional answers.

Our results. Central to our results is the idea of the Stackelberg equilibrium of the underlying game. The Stackelberg variant of our game is a single-shot two-stage game where the optimizer is the ﬁrst player and can publicly commit to a mixed strategy; the learner then best responds to this strategy. The Stackelberg equilibrium is the resulting equilibrium of this game when both players play optimally. Note that the optimizer s utility in the Stackelberg equilibrium is always weakly larger than his utility in any (pure or mixed strategy) Nash equilibrium, and is often strictly larger.

Let V be the utility of the optimizer in the Stackelberg equilibrium. With some mild assumptions on the game, we show that the optimizer can always guarantee himself a utility of at least (V ε)T o(T) in T rounds for any ε > 0, irrespective of the learning algorithm used by the learner as long as it has the no-regret guarantee (Theorem 4). This means that if one of the players is a learner the other player can already proﬁt over the Nash equilibrium regardless of the speciﬁcs of the learning algorithm employed or the structure of the game. Further, if any one of the following conditions is true:

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

1. the game is a constant-sum game, 2. the learner s no-regret algorithm has the stronger guarantee of no-swap regret (see Section 2), 3. the learner has only two possible actions in the game,

the optimizer cannot get a utility higher than V T + o(T) (see Theorem 5, Theorem 6, Theorem 7).

If the learner employs a learning algorithm from a natural class of algorithms called mean-based learning algorithms [Braverman et al., 2018] (see Section 2) that includes popular no-regret algorithms like the Multiplicative Weights algorithm, the Follow-the-Perturbed-Leader algorithm, and the EXP3 algorithm, we show that there exist games where the optimizer can guarantee himself a utility V T o(T) for some V > V (see Theorem 8). We note the contrast between the cases of 2 and 3 actions for the learner: in the 2-actions case even if the learner plays a mean-based strategy, the optimizer cannot get anything more than V T + o(T) (Theorem 7), whereas with 3 actions, there are games where he is able to guarantee a linearly higher utility.

Given this possibility of exceeding Stackelberg utility, our ﬁnal result is on the nature and structure of the utility optimal gameplay for the optimizer against a learner that employs a mean-based strategy. First, we give a crisp characterization of the optimizer s asymptotic optimal algorithm as the solution to a control problem (see Section 4.2) in N dimensions where N is the number of actions for the learner. This characterization is predicated on the fact that just knowing the cumulative historical utilities of each of the learner s actions is essentially enough information to accurately predict the learner s next action in the case of a mean-based learner. These N cumulative utilites thus form an N-dimensional state for the learner which the optimizer can manipulate via their choice of action. We then proceed to make multiple observations that simplify the solution space for this control problem. We leave as a very interesting open question of computing or characterizing the optimal solution to this control problem and we further provide one conjecture of a potential characterization.

Comparison to prior work. The very recent work of Braverman et al. [2018] is the closest to ours. They study the speciﬁc 2-player game of an auction between a single seller and single buyer. The main difference from Braverman et al. [2018] is that they consider a Bayesian setting where the buyer s type is drawn from a distribution, whereas there is no Bayesian element in our setting. But beyond that the seller s choice of the auction represents his action, and the buyer s bid represents her action. They show that regardless of the speciﬁc algorithm used by the buyer, as long as the buyer plays a no-regret learning algorithm the seller can always earn at least the optimal revenue in a single shot auction. Our Theorem 4 is a direct generalization of this result to arbitrary games without any structure. Further Braverman et al. [2018] show that there exist no-regret strategies for the buyer that guarantee that the seller cannot get anything better than the single-shot optimal revenue. Our Theorems 5, 6 and 7 are both generalizations and reﬁnements of this result, as they pinpoint both the exact learner s strategies and the kind of games that prevent the optimizer from going beyond the Stackelberg utility. Braverman et al. [2018] show that when the buyer plays a mean-based strategy, the seller can design an auction to guarantee him a revenue beyond the per round auction revenue. Our control problem can be seen as a rough parallel and generalization of this result.

Other related work. The ﬁrst notion of regret (without the swap qualiﬁcation) we use in the paper is also referred to as external-regret (see Hannan [1957], Foster and Vohra [1993], Littlestone and Warmuth [1994], Freund and Schapire [1997], Freund and Schapire [1999], Cesa-Bianchi et al. [1997]). The other notion of regret we use is swap regret. There is a slightly weaker notion of regret called internal regret that was deﬁned earlier in Foster and Vohra [1998], which allows all occurrences of a given action x to be replaced by another action y. Many no-internal-regret algorithms have been designed (see for example Hart and Mas-Colell [2000], Foster and Vohra [1997, 1998, 1999], Cesa Bianchi and Lugosi [2003]). The stronger notion of swap regret was introduced in Blum and Mansour [2005], and it allows one to simultaneously swap several pairs of actions. Blum and Mansour show how to efﬁciently convert a no-regret algorithm to a no-swap-regret algorithm. One of the reasons behind the importance of internal and swap regret is their close connection to the central notion of correlated equilibrium introduced by Aumann [1974]. In a general n players game, a distribution over action proﬁles of all the players is a correlated equilibrium if every player has zero internal regret. When all players use algorithms with no-internal-regret guarantees, the time averaged strategies of the players converges to a correlated equilibrium (see Hart and Mas-Colell [2000]). When all players simply use algorithms with no-external-regret guarantees, the time averaged strategies of the players converges to the weaker notion of coarse correlated equilibrium. When the game is a zero-sum game,

the time-averaged strategies of players employing no-external-regret dynamics converges to the Nash equilbrium of the game.

On the topic of optimizing against a no-regret-learner, Agrawal et al. [2018] study a setting similar to Braverman et al. [2018] but also consider other types of buyer behavior apart from learning, and show to how to robustly optimize against various buyer strategies in an auction.

2 Model and Preliminaries

2.1 Games and equilibria

Throughout this paper, we restrict our attention to simultaneous two-player bimatrix games G. We refer to the ﬁrst player as the optimizer and the second player as the learner. We denote the set of actions available to the optimizer as A = {a1, a2, . . . , a M} and the set of actions available to the learner as B = {b1, b2, . . . , b N}. If the optimizer chooses action ai and the learner chooses action bj, then the optimizer receives utility u O(ai, bj) and the learner receives utility u L(ai, bj). We normalize the utility such that |u O(ai, bj)| 1 and |u L(ai, bj)| 1. We write (A) and (B) to denote the set of mixed strategies for the optimizer and learner respectively. When the optimizer plays α (A) and the learner plays β (B), the optimizer s utility is denoted by u O(α, β) = PM i=1 PN j=1 αiβju O(ai, bj), similarly for the learner s utility.

We say that a strategy b B is a best-response to a strategy α (A) if b argmaxb B u L(α, b ). We are now ready to deﬁne Stackelberg equilibrium [Von Stackelberg, 2010].

Deﬁnition 1. The Stackelberg equilibrium of a game is a pair of strategies α (A) and b B that maximizes u O(α, b) under the constraint that b is a best-response to α. We call the value u O(α, b) the Stackelberg value of the game.

A game is zero-sum if u O(ai, bj) + u L(ai, bj) = 0 for all i [M] and j [N]; likewise, a game is constant-sum if u O(ai, bj) + u L(ai, bj) = C for some ﬁxed constant C for all i [M] and j [N]. Note that for zero-sum or constant-sum games, the Stackelberg equilibrium coincides with the standard notion of Nash equilibrium due to the celebrated minimax theorem [von Neumann, 1928]. Moreover, throughout this paper, we assume that the learner does not have weakly dominated strategies: a strategy b B is weakly dominated if there exists β (B \ {b}) such that for all a A, u L(a, β) u L(a, b).

We are interested in the setting where the optimizer and the learner repeatedly play the game G for T rounds. We will denote the optimizer s action at time t as at; likewise we will denote the learner s action at time t as bt. Both the optimizer and learner s utilities are additive over rounds with no discounting.

The optimizer s strategy can be adaptive (i.e. at can depend on the previous values of bt) or nonadaptive (in which case it can be expressed as a sequence of mixed strategies (α1, α2, . . . , αT )). Unless otherwise speciﬁed, all positive results (results guaranteeing the optimizer can guarantee some utility) apply for non-adaptive optimizers and all negative results apply even to adaptive optimizers. As the name suggests, the learner s (adaptive) strategy will be speciﬁed by some variant of a low-regret learning algorithm, as described in the next section.

2.2 No-regret learning and mean-based learning

In the classic multi-armed bandit problem with T rounds, the learner selects one of K options (a.k.a. arms) on round t and receives a reward ri,t [0, 1] if he selects option i. The rewards can be chosen adversarially and the learner s objective is to maximize her total reward.

Let it be the arm pulled by the learner at round t. The regret for a (possibly randomized) learning algorithm Alg is deﬁned as the difference between performance of the algorithm Alg and the best arm: Reg(Alg) = maxi PT t=1 ri,t rit,t. An algorithm Alg for the multi-armed bandit problem is no-regret if the expected regret is sub-linear in T, i.e., E[Reg(Alg)] = o(T). In addition to the bandits setting in which the learner only learns the reward of the arm he pulls, our results also apply to the experts setting in which the learner can learn the rewards of all arms for every round. Simple no-regret strategies exist in both the bandits and the experts settings.

Among no-regret algorithms, we are interested in two special classes of algorithms. The ﬁrst is the class of mean-based strategies:

Deﬁnition 2 (Mean-based Algorithm). Let σi,t = Pt s=1 ri,s be the cumulative reward for pulling arm i for the ﬁrst t rounds. An algorithm is γ-mean-based if whenever σi,t < σj,t γT, the probability for the algorithm to pull arm i on round t is at most γ. An algorithm is mean-based if it is γ-mean-based for some γ = o(1).

Intuitively, mean-based strategies are strategies that play the arm that historically performs the best. Braverman et al. [2018] show that many no-regret algorithms are mean-based, including commonly used variants of EXP3 (for the bandits setting), the Multiplicative Weights algorithm (for the experts setting) and the Follow-the-Perturbed-Leader algorithm (for the experts setting).

The second class is the class of no-swap-regret algorithms:

Deﬁnition 3 (No-Swap-Regret Algorithm). The swap regret Regswap(Alg) of an algorithm Alg is deﬁned as

Regswap(Alg) = max π:[K] [K] Reg(Alg, π) =

t=1 rπ(it),t rit,t

where the maximum is over all functions π mapping options to options. An algorithm is no-swap-regret if the expected swap regret is sublinear in T, i.e. E[Regswap(Alg)] = o(T).

Intuitively, no-swap-regret strategies strengthen the no-regret criterion in the following way: no-regret guarantees the learning algorithm performs as well as the best possible arm overall, but no-swapregret guarantees the learning algorithm performs as well as the best possible arm over each subset of rounds where the same action is played. Given a no-regret algorithm, a no-swap-regret algorithm can be constructed via a clever reduction (see Blum and Mansour [2005]).

3 Playing against no-regret learners

3.1 Achieving Stackelberg equilibrium utility

To begin with, we show that the optimizer can achieve an average utility per round arbitrarily close to the Stackelberg value against a no-regret learner.

Theorem 4. Let V be the Stackelberg value of the game G. If the learner is playing a no-regret learning algorithm, then for any ε > 0, the optimizer can guarantee at least (V ε)T o(T) utility.

Proof. Let (α, b) be the Stackelberg equilibrium of the game G. Since (α, b) forms a Stackelberg equilibrium, b argmaxb u L(α, b ). Moreover, by the assumption that the learner does not have a weakly dominated strategy, there does not exist β (B \ {b}) such that for all a A, u L(a, β) u L(a, b). By Farkas s lemma [Farkas, 1902], there must exist an α (A) such that for all b B \ {b}, u L(α , b) u L(α , b ) + κ for κ > 0.

Therefore, for any δ (0, 1), the optimizer can play the strategy α = (1 δ)α + δα such that b is the unique best response to α and playing strategy b = b will induce a utility loss at least δκ for the learner. As a result, since the leaner is playing a no-regret learning algorithm, in expectation, there is at most o(T) rounds in which the learner plays b = b. It follows that the optimizer s utility is at least V T δ(V u O(α , b))T o(T). Thus, we can conclude our proof by setting ε = δ(V u O(α , b)).

Next, we show that in the special class of constant-sum games, the Stackelberg value is the best that the optimizer can hope for when playing against a no-regret learner.

Theorem 5. Let G be a constant-sum game, and let V be the Stackelberg value of this game. If the learner is playing a no-regret algorithm, then the optimizer receives no more than V T + o(T) utility.

Proof. Let a = (a1, , a T ) be the sequence of the optimizer s actions. Moreover, let α (A) be a mixed strategy such that α plays ai A with probability α i = |{t | at = ai}|/T.

Since the learner is playing a no-regret learning algorithm, the learner s cumulative utility is at least maxb B u L(a , b )T o(T) = CT (minb B u O(a , b )T + o(T)), where C is the constant sum, which implies that the optimizer s utility is at most max a (A) min b B u O(a , b )T + o(T) = V T + o(T)

where the equality follows that the Stackelberg value is equal to the minimax value by the minimax theorem for a constant-sum game.

3.2 No-swap-regret learning

In this section, we show that if the learner is playing a no-swap-regret algorithm, the optimizer can only achieve their Stackelberg utility per round. Theorem 6. Let V be the Stackelberg value of the game G. If the learner is playing a no-swap-regret algorithm, then the optimizer will receive no more than V T + o(T) utility.

Proof. Let a = (a1, , a T ) be the sequence of the optimizer s actions and let b = (b1, , b T ) be the realization of the sequence of the learner s actions. Moreover, let Pr[ b] be the probability that the learner (who is playing some no-swap-regret learning algorithm) plays b given that the adversary plays a. Then, the marginal probability for the learner to play bj B at round t is

Pr[bt = bj] = X

Let αbj (A) be a mixed strategy such that αbj plays ai A with probability

t:at=ai Pr[bt = bj] P

t Pr[bt = bj] .

Let B = {bj B : bj argmaxb u L(αbj, b )} and consider a mapping π such that π(bj) argmaxb u L(αbj, b ). Then, the swap-regret under π is X

u L(αbj, π(bj)) u L(αbj, bj) X

t Pr[bt = bj]

u L(αbj, π(bj)) u L(αbj, bj) X

t Pr[bt = bj]

t Pr[bt = bj]

where δ = minbj B u L(αbj, π(bj)) u L(αbj, bj). Therefore, since the learner is playing a noswap-regret algorithm, we have P

t Pr[bt = bj]) = o(T).

Moreover, for bj B \ B, the optimizer s utility when the learner plays bj is at most

u O(αbj, bj) X

t Pr[bt = bj] V X

t Pr[bt = bj].

Thus, the optimizer s utility is at most X

u O(αbj, bj) X

t Pr[bt = bj]

u O(αbj, bj) X

t Pr[bt = bj]

u O(αbj, bj) X

t Pr[bt = bj]

t Pr[bt = bj]

t Pr[bt = bj]

V T + o(T).

Theorem 7. Let G be a game where the learner has N = 2 actions, and let V be the Stackelberg value of this game. If the learner is playing a no-regret algorithm, then the optimizer receives no more than V T + o(T) utility.

Proof. By Theorem 6, it sufﬁces to show that when there are two actions for the learner, a no-regret learning algorithm is in fact a no-swap-regret learning algorithm.

When there are only two actions, there are three possible mappings from B B other than the identity mapping. Let π1 be a mapping such that π1(b1) = b1 and π1(b2) = b1, π2 be a mapping such that π2(b1) = b2 and π2(b2) = b2, and π3 be a mapping such that π3(b1) = b2 and π3(b2) = b1.

Since the learner is playing a no-regret learning algorithm, we have E[Reg(A, π1)] = o(T) and E[Reg(A, π2)] = o(T). Moreover, notice that

E[Reg(A, π3)] = E[Reg(A, π1)] + E[Reg(A, π2)] = o(T),

which concludes the proof.

4 Playing against mean-based learners

From the results of the previous section, it is natural to conjecture that no optimizer can achieve more than the Stackelberg value per round if playing against a no-regret algorithm. After all, this is true for the subclass of no-swap-regret algorithms (Theorem 6) and is true for simple games: constant-sum games (Theorems 5) and games in which the learner only has two actions (Theorem 7).

In this section we show that this is not the case. Speciﬁcally, we show that there exist games G where an optimizer can win strictly more than the Stackelberg value every round when playing against a mean-based learner. We emphasize that the same strategy for the optimizer will work against any mean-based learning algorithm the learner uses.

We then proceed to characterize the optimal strategy for a non-adaptive optimizer playing against a mean-based learner as the solution to an optimal control problem in N dimensions (where N is the number of actions of the learner), and make several preliminary observations about structure an optimal solution to this control problem must possess. Understanding how to efﬁciently solve this control problem (or whether the optimal solution is even computable) is an intriguing open question.

4.1 Beating the Stackelberg value

We begin by showing it is possible for the optimizer to get signiﬁcantly (linear in T) more utility when playing against a mean-based learner.

Theorem 8. There exists a game G with Stackelberg value V where the optimizer can receive utility at least V T o(T) against a mean-based learner for some V > V .

Proof. Assume that the learner is using a γ-mean-based algorithm. Consider the bimatrix game shown in Table 1 in which the optimizer is the row player (These utilities are bounded in [ 2, 2] instead of [ 1, 1] for convenience; we can divide through by 2 to get a similar example where utility is bounded in [ 1, 1]). We ﬁrst argue that the Stackelberg value of this game is 0. Notice that if the optimizer plays Bottom with probability more than 0.5, then the learner s best response is to play Mid, resulting in a 2 utility for the optimizer . However, if the optimizer plays Bottom with probability at most 0.5, the expected utility for the optimizer from each column is at most 0. Therefore, in the Stackelberg equilibrium, the optimizer will play Top and Bottom with probability 0.5 each, and the learner will best respond with purely playing Right.

Left Mid Right Top (0, γ) (-2, -1) (-2, 0) Bottom (0, -1) (-2, 1) (2, 0) Table 1: Example game for beating the Stackelberg value.

However, the optimizer can obtain utility T o(T) by playing Top for the ﬁrst 1

2T rounds and then playing Bottom for the remaining 1

2T rounds. Given the optimizer s strategy, for the ﬁrst 1

2T rounds, the learner will play Left with probability at least (1 2γ) after ﬁrst γT rounds. For the remaining 1

2T rounds, the learner will switch to play Right with probability at least (1 2γ) between ( 1+ γ

2 + γ)T-th round and (1 γ)T-th round, since the cumulative utility for playing Left is at most

2 T γT = γT and the cumulative utility for playing Mid is at most γT.

Therefore, the cumulative utility for the optimizer for the ﬁrst 1

2T rounds is at least

2 γ)T 0 + 1

2T (1 2γ)(1

2 γ)T ( 2) = o(T),

and the cumulative utility for the optimizer for the remaining 1

2T rounds is at least

2 2γ)T 2 + 1

2T (1 2γ)(1

2 2γ)T ( 2) = T o(T).

Thus, the optimizer can obtain a total utility T o(T), which is greater than V T = 0 for the Stackelberg value V = 0 in this game.

4.2 The geometry of mean-based learning

We have just seen that it is possible for the optimizer to get more than the Stackelberg value when playing against a mean-based learner. This raises an obvious next question: how much utility can an optimizer obtain when playing against a mean-based learner? What is the largest α such that an optimizer can always obtain utility αT o(T) against a mean-based learner?

In this section, we will see how to reduce the problem of constructing the optimal gameplay of a non-adaptive optimizer to solving a control problem in N dimensions. The primary insight is that a mean-based learner s behavior depends only on their historical cumulative utilities for each of their N actions, and therefore we can characterize the essential state of the learner by a tuple of N real numbers that represent the cumulative utilities for different actions. The optimizer can control the state of the learner by playing different actions, and in different regions of the state space the learner plays speciﬁc responses.

More formally, our control problem will involve constructing a path in RN starting at the origin. For each i [N], let Si equals the subset of (u1, u2, . . . , u N) RN where ui = max(u1, u2, . . . , u N) (this will represent the subset of state space where the learner will play action bi). Note that these sets Si (up to some intersection of measure 0) partition the entire space RN.

We represent the optimizer s strategy π as a sequence of tuples (α1, t1), (α2, t2), . . . , (αk, tk) with αi (A) and ti [0, 1] satisfying P

i ti = 1. Here the tuple (αi, ti) represents the optimizer playing mixed strategy αi for a ti fraction of the total rounds. This strategy evolves the learner s state as follows. The learner originally starts at the state P0 = 0. After the ith tuple (αi, ti), the learner s state evolves according to Pi = Pi 1 + ti(u L(αi, b1), u L(αi, b2), . . . , u L(αi, b N)) (in fact, the state linearly interpolates between Pi 1 and Pi as the optimizer plays this action). For simplicity, we will assume that positive combinations of vectors of the form (u L(αi, b1), u L(αi, b2), . . . , u L(αi, b N)) can generate the entire state space RN.

To characterize the optimizer s reward, we must know which set Si the learner s state belongs to. For this reason, we will insist that for each 1 i k, there exists a ji such that both Pi 1 and Pi belong to the same region Sji. It is possible to convert any strategy π into a strategy of this form by subdividing a step (α, t) that crosses a region boundary into two steps (α, t ) and (α, t ) with t = t +t so that the ﬁrst step stops exactly at the region boundary. If there is more than one possible choice for ji (i.e. Pi 1 and Pi lie on the same region boundary), then without loss of generality we let the optimizer choose ji, since the optimizer can always modify the initial path slightly so that Pi and Pi 1 both lie in a unique region.

Once we have done this, the optimizer s average utility per round is given by the expression:

i=1 tiu O(αi, bji).

Theorem 9. Let U = supπ U(π) where the supremum is over all valid strategies π in this control game. Then

1. For any ε > 0, there exists a non-adaptive strategy for the optimizer which guarantees expected utility at least (U ε)T o(T) when playing against any mean-based learner.

2. For any ε > 0, there exists no non-adaptive strategy for the optimizer which can guarantee expected utility at least (U + ε)T + o(T) when playing against any mean-based learner.

Understanding how to solve this control problem (even inefﬁciently, in ﬁnite time) is an interesting open problem. In the remainder of this section, we make some general observations which will let us cut down the strategy space of the optimizer even further and propose a conjecture to the form of the optimal strategy.

The ﬁrst observation is that when the learner has N actions, our state space is truly N 1 dimensional, not N dimensional. This is because in addition to the learner s actions only depending on the cumulative reward for each action, they in fact only depend on the differences between cumulative rewards for different actions (see Deﬁnition 2). This means we can represent the state of the learner as a vector (x1, x2, . . . , x N 1) RN 1, where xi = ui u N. The sets Si for 1 i N 1 can be written in terms of the xi as

Si = {x|xi = max(x1, . . . , x N 1, 0)}

and SN = {x|0 = max(x1, . . . , x N 1, 0)}.

The next observation is that if the optimizer makes several consecutive steps in the same region Si, we can combine them into a single step. Speciﬁcally, assume Pi, Pi+1, and Pi+2 all belong to some region Sj, where (αi, ti) sends Pi to Pi+1 and (αi+1, ti+1) sends Pi+1 to Pi+2. Then replacing these

two steps with αiti+αi+1ti+1

ti+ti+1 , ti + ti+1 results in a strategy with the exact same reward U(π). Applying this fact whenever possible, this means we can restrict our attention to strategies where all Pi (with the possible exception of the ﬁnal state Pk) lie on the boundary of two or more regions Si.

Finally, we observe that this control problem is scale-invariant; if

π = ((α1, t1), (α2, t2), . . . , (αn, tn))

is a valid policy that obtains utility U, then

λπ = ((α1, λt1), (α2, λt2), . . . , (αn, λtn))

is another valid policy (with the exception that P ti = λ, not 1) which obtains utility λU (this is true since all the regions Si are cones with apex at the origin). This means we do not have to restrict to policies with P ti = 1; we can choose a policy of any total time, as long as we normalize the utility by P ti. This generalizes the strategy space, but is useful for the following reason. Consider a sequence of steps π which starts at some point P (not necessarily 0) and ends at P. Then if U is the average utility of this cycle, then U U (in particular, we can consider any policy which goes from 0 to P and then repeats this cycle many times). Likewise, if we have a sequence of steps π which starts at some point P and ends at λP for some λ > 1 which achieves average utility U, then again U U (by considering the policy which proceeds 0 P λP λ2P . . . (note that it is essential that λ 1 to prevent this from converging back to 0 in ﬁnite time).

These observations motivate the following conjecture. Conjecture 10. The value U is achieved by either:

1. The average utility of a policy starting at the origin and consisting of at most N steps (in distinct regions).

2. The average utility of a path of at most N steps (in distinct regions) which starts at some point P and returns to λP for some λ 1.

We leave it as an interesting open problem to compute the optimal solution to this control problem.

Shipra Agrawal, Constantinos Daskalakis, Vahab S. Mirrokni, and Balasubramanian Sivan. Robust repeated auctions under heterogeneous buyer behavior. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, page 171, 2018.

Robert J. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1(1):67 96, 1974. ISSN 0304-4068.

Avrim Blum and Yishay Mansour. From external to internal regret. In Peter Auer and Ron Meir, editors, Learning Theory, 2005.

Mark Braverman, Jieming Mao, Jon Schneider, and Matt Weinberg. Selling to a no-regret buyer. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 523 538. ACM, 2018.

Nicolò Cesa-Bianchi and Gábor Lugosi. Potential-based algorithms in on-line prediction and game theory. Machine Learning, 51(3):239 261, Jun 2003.

Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427 485, May 1997. ISSN 0004-5411.

Julius Farkas. Theorie der einfachen ungleichungen. Journal für die reine und angewandte Mathematik, 124:1 27, 1902.

Dean P. Foster and Rakesh V. Vohra. A randomization rule for selecting forecasts. Operations Research, 41(4):704 709, 1993.

Dean P. Foster and Rakesh V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(1):40 55, 1997.

Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration. Biometrika, 85(2):379 390, 06 1998.

Dean P. Foster and Rakesh V. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1):7 35, 1999.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1):79 103, 1999.

James Hannan. Approximation to bayes risk in repeated plays. Contributions to the Theory of Games, 3:97 139, 1957.

Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 1150, 2000.

N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994.

John von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295 320, 1928.

Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media, 2010.