# learning_to_steer_learners_in_games__7ec6a58c.pdf Learning to Steer Learners in Games Yizhou Zhang 1 Yian Ma 2 Eric Mazumdar 1 We consider the problem of learning to exploit learning algorithms through repeated interactions in games. Specifically, we focus on the case of repeated two player, finite-action games, in which an optimizer aims to steer a no-regret learner to a Stackelberg equilibrium without knowledge of its payoffs. We first show that this is impossible if the optimizer only knows that the learner is using an algorithm from the general class of no-regret algorithms. This suggests that the optimizer requires more information about the learner s objectives or algorithm to successfully exploit them. Building on this intuition, we reduce the problem for the optimizer to that of recovering the learner s payoff structure. We demonstrate the effectiveness of this approach if the learner s algorithm is drawn from a smaller class by analyzing two examples: one where the learner uses an ascent algorithm, and another where the learner uses stochastic mirror ascent with known regularizer and step sizes. 1. Introduction Learning algorithms and AI agents are increasingly being deployed into environments where they interact with other learning agents be they people or other algorithms. This is already a reality in wide-ranging application areas such as ad auctions, self-driving, automated trading, and cybersecurity. In each of these problem areas, the presence of other agents with potentially misaligned objectives renders the problem game-theoretic in nature. Algorithms deployed in such environments are therefore faced with a generalization of the classic exploration-exploitation trade-off in online learning: On one hand, they must take actions to learn the underlying structure of the game (i.e., its payoff and potentially its 1Department of Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA 2Halıcıo glu Data Science Institute, University of California San Diego, La Jolla, CA 92093, USA. Correspondence to: Yizhou Zhang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). opponents objectives), and on the other, they must reason strategically about the game-theoretic implications of its actions to maximize utility. Thus, the problem becomes trading off between exploration and competition. To address this problem, the dominant approach to learning in games has been formulating it as an adversarial online learning problem. In this framing, to handle opponents with unknown objectives and learning rules, each player assumes they are faced with an arbitrary (and potentially adversarial) sequence of payoffs and seeks to find an algorithm that maximizes their own utility. Given this setup, a natural class of algorithms to choose from is the class of no-regret algorithms, which guarantees asymptotically optimal performance compared to the best fixed action in hindsight (Cesa-Bianchi & Lugosi, 2006). If all players use no-regret algorithms, it is well known that the average action of the players over the entire time horizon converges to a (coarse) correlated equilibrium (Foster & Vohra, 1998; Hart & Mas-Colell, 2000). However, since the opponents are also learners rather than adversaries, adopting a no-regret algorithm may not be optimal. In this paper, we seek to understand how one player should deviate from choosing a no-regret learning algorithm in games. We focus on a simplified abstraction of this problem the repeated two player games with finite actions, in which each player only knows their own utility function (i.e., their payoff matrix), but not their opponent s. This problem has been well studied in recent years though primarily in the case where the payoff matrix, and thus the objective of the opponent player, is known. Under such assumptions, it was shown in (Deng et al., 2019) that if one player (the optimizer) deviates from using a no-regret algorithm, it can (under mild assumptions about the game instance) guarantee an asymptotic average payoff that is arbitrarily close to the Stackelberg value of the game by steering the no-regret player (the learner) to the Stackelberg equilibrium of the underlying matrix game. This is the highest attainable value if the optimizer plays one fixed mixed strategy. Further studies have focused on analyzing the steerability of smaller classes of algorithms, such as no-swap-regret (Brown et al., 2023) or mean-based algorithms (Arunachaleswaran et al., 2024), providing more insights into steering no-regret learners. Crucially, all Learning to Steer Learners in Games these works assume knowledge of the learner s objectives knowledge that the optimizer itself may not even have. In real-world applications, instead of knowing the entire game (i.e., knowledge of both payoff matrices), it is often the case that each player only knows their own payoff matrix. Despite this, few works have studied the setting where the payoff structure of the learner is initially unknown (or only partially known) to the optimizer. Thus, in this work we focus on answering this question: Without full knowledge of the game, can an optimizer learn to steer a no-regret learner to an (approximate) Stackelberg equlilibrium? As we will show, a crucial question that emerges from studying this problem is the learnability of the learner s objective through repeated interaction. Giving rise to the second main question that we answer: What information does the optimizer need in order to achieve this goal? 1.1. Our Contributions We answer both questions in the setting of a repeated two player bimatrix game over a fixed (but assumed large) time horizon T. Our results can be summarized as follows: We provide a negative answer to the first question. More specifically, we show that when the learner s payoff matrix is unknown, no matter what algorithm the optimizer uses, there exists a no-regret algorithm for the learner that prevents the optimizer from achieving its approximate Stackelberg value. This happens because the optimizer cannot accurately learn the learner s payoff. This result suggests that we cannot hope to design an algorithm that asymptotically achieves the same performance as the Stackelberg equilibrium against all no-regret algorithms and all payoff matrices of the learner. This highlights a fundamental difference from the case where the learner s payoff is known where the asymptotic Stackelberg value is attainable due to the lack of information. Given the impossibility result above, we shift our focus to what information is needed to steer the learner. We show that in order to achieve asymptotic Stackelberg value, instead of exactly recovering the learner s payoff structure, it suffices for the optimizer to first obtain a reasonably accurate estimation of the payoff matrix (or equivalently, the best-response structure) through some learning method and then incorporate the idea of pessimism to steer the learner to the Stackelberg equilibrium by leveraging its no-regret nature. We show that as long as the estimation process takes no more than o(T) steps, the optimizer is able to steer the learner to the Stackelberg equilibrium and consequently (asymptotically) achieve its Stackelberg value by using an explore-then-commit style algorithm. Building on the previous result, we show by two concrete examples that, when some information about the learner s update rule is known, it is possible to learn the learner s payoff structure and thus to steer them to the Stackelberg equilibrium. One example assumes the learner only has two pure strategies and is using any ascent algorithm where its payoff increases at each step and the other assumes that the learner is using stochastic mirror ascent with known step sizes and regularizer. We note that most existing no-regret algorithms share similar dynamics with these two cases. 2. Related Works Before presenting our results, we discuss relevant related works. Steering no-regret learners. The problem of steering a no-regret learner in a repeated game has been the focus of several recent works, though often under strong assumptions on what information is available to the optimizer. Assuming known learner payoffs, Braverman et al. (2018) first introduced the problem of steering a learner in an auction setting. Subsequently, Deng et al. (2019) showed that the optimizer can guarantee at least the Stackelberg value against the learner in bimatrix games and Assos et al. (2024) studied the problem of utility maximization under the additional assumptions on the learner s algorithm. Brown et al. (2023) also focused on smaller classes of no-regret algorithms such as no-swap-regret, anytime no-regret and no-adaptive-regret algorithms. While sharing a similar goal with our work, all of these works use the known learner payoff to design steering algorithms, which is not available in our setting. Without the knowledge of learner payoff, Brˆanzei et al. (2024) showed the steerability of the learner in a cake-cutting model, which can be viewed as a 2D special case of our problem. Lin & Chen (2024) proposed a principal-agent framework and studied the optimal average utility that can be obtained by the optimizer assuming either a no-regret or a no-swap-regret algorithm under known learner payoff. There is also a broader line of works regarding more general properties of interacting with no-regret learners, including (Zhang et al., 2024) that considered the steering problem through direct payments to the learner and (Arunachaleswaran et al., 2024) that studied the problem of pareto-optimality in the space of learning algorithms. Learning in Stackelberg games. The problem of steering is closely related to the problem of learning to play a Stack- Learning to Steer Learners in Games elberg equilibrium through repeated interactions. Thus, a particularly related line of work involves learning in unknown repeated Stackelberg games, where the decisions are made sequentially in each round. This problem has been well-studied in its own right but often under simplifying assumptions on the games or the responses of the follower (the learner in our framing of the problem). Letchford et al. (2009) and Peng et al. (2019) proposed algorithms for learning through interaction with a myopic best-responding agent, and Haghtalab et al. (2022) extended this framework to non-myopic agents with discounted utilities over time a different setup from the one we consider. Other works have analyzed similar problems under different assumptions on the underlying game. In the control literature, Lauffer et al. (2023) studied the problem of learning in dynamic Stackelberg games and showed that one could learn the Stackelberg equilibrium. Maheshwari et al. (2024) studied a similar problem of learning Stackelberg equilibria in the context of continuous games; Goktas et al. (2022) studied the behavior of no-regret learning in a smaller class of zero-sum Stackelberg games, and the problem of steering learning agents has also emerged in the literature on strategic classification (Zrnic et al., 2021). In each of these problems, the additional structure introduced into the games simplifies the task of learning the Stackelberg equilibrium through e.g., convexity or smoothness of the underlying optimization problem. Unfortunately, in bimatrix games, the Stackelberg optimization problem is both highly nonconvex and discontinuous, vastly complicating the task of learning Stackelberg equilibria. Learning in Stackelberg games has also been studied in various application areas including security (Blum et al., 2014; Balcan et al., 2015), calibration (Haghtalab et al., 2023) and learning with side information (Harris et al., 2024). Most results proposed in these works assume the follower uses (nearly) myopic best response dynamics, which does not fit into our setting. Other related works. Several other relevant works do not directly fit into either steering or learning . Conitzer & Sandholm (2006) first proposed an efficient algorithm of computing Stackelberg equilibria when the game is known through solving and combining several small linear programs, motivating our steering approach based on payoff matrix recovery. Gan et al. (2023) studied the notion of robust Stackelberg equilibrium allowing the follower to respond with any δ-optimal response. Despite being formulated differently, this shares the same high-level idea with our approach of inducing pessimism to steer learners. Collina et al. (2024) considered the setting of finitely repeated Stackelberg games where the leader commitments and follower actions are adaptive to the gameplay history and proposed an efficient algorithm for approximating Stackelberg equilibria in the space of adaptive game playing algorithms. There are also works from an empirical perspective that con- siders (and leverages) the learning behavior of other agents to achieve higher payoff and more stable learning process (Foerster et al., 2018; Lu et al., 2022). 3. Notations and Preliminaries Throughout this paper, we use m to denote the probability simplex in Rm. We use [m] to denote the set {1, 2, . . . , m}, and {xt}T t=1 to denote the set {x1, x2, . . . , x T }. We use ei to denote the i-th one-hot vector, whose j-th element is 1{i = j}. We use 1, to denote the L1 and L norm of a matrix/vector, and max to denote the max norm of a matrix, which is given by the maximum absolute value among all entries. We use 0n and 1n to denote the all-zero and all-one vectors in Rn respectively. For two vectors a, b Rn, a b indicates ai bi, i [n]. For a matrix M, Mi,: (and abbreviation Mi) denotes the i-th row of M and M:,j denotes the j-th column of a matrix. As an extension, we use MI,: (and abbreviation MI) and M:,J to denote the matrix obtained by combining the rows in an index set I (columns in an index set J ) respectively, and MI,J similarly denote the matrix obtained by choosing rows in I and columns in J . For an index set I, let Ii denote the i-th element in I. 3.1. Problem Setup We consider a repeated general-sum bimatrix game G with two players, referred to as P1 (the optimizer) and P2 (the learner). There are T rounds of repeated interaction, in each round t, P1 and P2 play mixed actions xt m and yt n simultaneously. We use A, B Rm n to denote the payoff matrices of P1 and P2 respectively, thus their utility in round t could be written as x T t Ayt and x T t Byt. Within the interaction process, players use algorithms to decide the action they play. An algorithm takes the interaction sequence {xτ, yτ}t 1 τ=1 as input, and outputs the actions xt (for P1) or yt (for P2). We allow the algorithm of each player to be randomized, and the goal of each player is to obtain a higher expected total utility across all time steps, which can be written as: U(P1) = E h PT t=1 x T t Ayt i ; U(P2) = E h PT t=1 x T t Byt i . 3.2. Regret and No-regret Algorithms Given a fixed sequence of actions {xt}T t=1 played by P1, a natural metric of the performance of P2 is the regret, which compares to the best action in hindsight. We define three forms of regret, one regret associated with a trajectory, one incurred by an algorithm, and one incurred in a game. Definition 3.1. Given an interaction history {xt, yt}T t=1, Learning to Steer Learners in Games the learner regret of P2 on the trajectory is defined as: Reg2({xt, yt}T t=1) := max y n t=1 x T t By t=1 x T t Byt. (1) Given a sequence of optimizer actions {xt}T t=1, the learner regret of P2 under the learner algorithm A2 is defined as: Reg2(A2, {xt}T t=1) := max y n t=1 x T t By EA2 t=1 x T t Byt Given the optimizer algorithm A1 and the learner algorithm A2, the learner regret of P2 is the expected learner regret under A1, A2: Reg2(A1, A2) := E{xt,yt}T t=1 A1,A2Reg2({xt, yt}T t=1). The learner would like to choose an algorithm that achieves a low regret on different possible optimizer trajectories {xt}T t=1, while being flexible to prevent the optimizer from efficiently learning its payoff matrix. Faced with such tradeoff, the learner may set a regret budget f, and aim to act against possible optimizer exploration strategies while keeping its regret under this budget. To better characterize these circumstances, we make the following definition of finegrained no-regret algorithms: Definition 3.2. Given some function f : N R such that f(T) = o(T) and an optimizer action sequence {xt}T t=1, an interaction sequence {xt, yt}T t=1 is f-no-regret for the learner (with constant C) if Reg2({xt, yt}T t=1) C f(T) (2) for some constant C, a learner algorithm A2 is f-no-regret (with constant C) on {xt}T t=1 if Reg2(A2, {xt}T t=1) C f(T) (3) for some constant C as T . An algorithm is f-noregret if it is f-no-regret on all possible optimizer action sequences. An algorithm is no-regret if it is f-no-regret for some f(T) = o(T). 3.3. Stackelberg Equilibrium and Stackelberg Regret Consider a simple optimizer strategy that plays a fixed action x at each time step, if the learner wants to be no-regret on the resulting trajectory, its action sequence {yt}T t=1 will converge to a best response to x, defined as: Definition 3.3. For an action x of P1, the best response of P2 given its payoff matrix B is the set of actions maximizing its payoff: BR(B, x) := {y n : x T By x T By , y n}. If the optimizer simply commits to a fixed action and the learner best-responds, the choice that maximizes its utility should yield a Stackelberg equilibrium: Definition 3.4. An action pair (x , y ) is a Stackelberg equilibrium if it is a solution to the following optimization problem: maximize x T Ay subject to y BR(B, x), x m. (4) The Stackelberg value for P1 is defined as the optimal value of (4), denoted by V (A, B). Notice that in general, the set BR(B, x) can contain more than one element. This would yield potentially different Stackelberg values among the best-response set. We use the rule of optimistic tie-breaking to adopt the one with the highest optimizer payoff among all best responses to define the Stackelberg equilibrium. However, we do not impose the assumption that the learner will also use optimistic tiebreaking when deciding between indifferent actions. Since the Stackelberg value is the highest possible averagereward it can get through fixing action among all time steps, we use Stackelberg regret to measure its performance: Definition 3.5. Given an interaction history {xt, yt}T t=1, the Stackelberg regret of P1 is: Stack Reg1({xt, yt}T t=1) := T V (A, B) t=1 x T t Ayt Given the optimizer algorithm A1 and the learner algorithm A2, the Stackelberg regret of P1 is the expected Stackelberg regret under A1, A2: Stack Reg1(A1, A2) := E{xt,yt}T t=1 A1,A2Stack Reg1({xt, yt}T t=1) (5) In the following sections we build upon these preliminaries to study the problem of steering the learner to the Stackelberg equilibrium. For brevity and ease of exposition, we defer all proofs to the Appendices. 4. Impossibility of Learning to Steer Agents Who Use General No-regret Algorithms When the learner payoff matrix B is known to the optimizer, Deng et al. (2019) proved that under mild assumptions, there exists an optimizer algorithm A1 that guarantees a Stackelberg regret of at most Stack Reg(A1, A2) ϵT + o(T) for an arbitrarily small constant ϵ. However, if the optimizer doesn t have full knowledge of B, extracting Stackelberg value becomes harder. It is natural to wonder if we do not impose any extra assumptions (other than being no-regret) Learning to Steer Learners in Games on P2, is it still possible for P1 to learn the payoff structure of P2 through the interaction process and thereby extract the Stackelberg value for all possible game instances? The following result shows that this is impossible in general: Theorem 4.1. There exists a pair of game instances G1 = (A, B1) and G2 = (A, B2) with the same optimizer payoff matrix A, such that for all optimizer algorithms A1, there exists a no-regret algorithm A2 for the learner satisfying: Stack Reg1(A1, A2) = o(T) on G1 and Stack Reg1(A1, A2) = c T for some constant c on G2. The proof of Theorem 4.1 is deferred to Appendix A. Theorem 4.1 suggests that even if the optimizer knows that the learner payoff is one of the two different candidates B1 or B2, whatever algorithm A1 they try to come up with, there exists a no-regret algorithm A2 for the learner that can induce an Θ(T) Stackelberg regret to the optimizer in one of the two game instances. The proof of Theorem 4.1 relies on first designing a game instance such that G1 and G2 has different Stackelberg equilibrium, and use a simple algorithm A 2 against which the optimizer is not able to distinguish whether the realized learner payoff matrix is B1 or B2, before finally modifying it to be no-regret. The idea of constructing a pair of game instances that have different equilibrium but the same payoff function for one player is also used in the proof of Theorem 3 in (Bajaj et al., 2024), in which they showed that in a repeated two-player game where the opponent strategy is not known, no algorithm can achieve a bounded competitive ratio against itself (used by the opponent) for all such game instances. This suggests that we cannot hope to design a no Stackelberg-regret algorithm for the optimizer that works simultaneously well on all game instances without any additional assumption on the learner besides it being no-regret. Interestingly, Theorem 6 in (Brown et al., 2023) suggests that the optimizer is able to learn and steer a no-adaptiveregret learner, indicating the fundamental difference of learning to steer learners with different algorithm classes. 5. Steering through Facets and Payoff Matrix Recovery Given Theorem 4.1, a natural question that emerges is what information the optimizer needs to acquire to be able steer no-regret learners to Stackelberg equilibrium. In this section we present two alternative sufficient conditions under which the optimizer can steer the learner to the Stackelberg equilibrium and achieve o(T) Stackelberg regret by simply fixing one action at different time steps. The first sufficient condition is the approximate pessimistic recovery of facets the best response regions for each pure learner strategy, and the second sufficient condition is an approximation of the learner s payoff matrix, up to an equivalence class. 5.1. Facets and Equivalence Classes of Payoff Matrices From the optimizer s perspective, the matrix B does not directly show up on its payoff. The only way that B influences the Stackelberg equilibrium and the value is through the induced best response set BR(B, x). Therefore, intuitively all the information that the optimizer needs to characterize the response dynamics is encoded in the best response for each x m. We characterize each point x in the optimizer s simplex m by which best response it could induce, as indicated in the following definition: Definition 5.1. For any possible payoff matrix B of the learner, the facet Ei m corresponding to the i-th learner action ei is defined as the set of optimizer actions x m such that ei is a best response to x: Ei := {x m : ei BR(B, x)}. (6) We sometimes use Ei(B) to explicitly indicate that Ei is induced by B. Intuitively, the boundary of a facet specifies the critical hyperplane in the space of mixed strategies m of the optimizer, where the learner is indifferent between two (or more) pure strategies. In general, a Stackelberg equilibrium strategy x stays at an extreme point of one facet, and therefore, in order to extract Stackelberg from the learner, the optimizer must be able to (or potentially implicitly) reconstruct the facet boundaries around the equilibrium point. We illustrate the definition of facets by the following example: Example 5.2. Let m = n = 3 and consider the learner payoff matrix B = I, with the optimizer action x = (x1, x2, x3)T , the facets E1, E2 and E3 are visualized in Figure 1: Figure 1. The facets E1, E2 and E3 for payoff matrix B = I. While similar definitions are made in (Letchford et al., 2009; Peng et al., 2019; Lattimore & Szepesv ari, 2019), we use the word facet because each Ei induced by any matrix B is indeed a polytope that is a subset of m and their union i [n]Ei = m. Learning to Steer Learners in Games Following (Conitzer & Sandholm, 2006), once we have identified the facets for all i [n], we can compute the Stackelberg equilibrium in the following way: First solve the linear program: max x Ei(B) Vi(A, B) = x T A:,i (7) for each i [n]. Since P2 plays a pure strategy at equilibrium, the Stackelberg value is then given by: V (A, B) = max i [n] Vi(A, B) (8) with the corresponding solution (x , i ) being the Stackelberg equilibrium. At a Stackelberg equilibrium the learner is usually indifferent between multiple pure strategies, which means that switching from the equilibrium pure strategy y to another indifferent pure strategy y does not incur additional regret to the learner, while potentially vastly degrading the optimizer s payoff. Therefore, instead of simply selecting the equilibrium point, the optimizer must choose a pessimistic equilibrium point that sacrifices some utility from Stackelberg value to guarantee the learner responds with y . We define pessimistic facets as: Definition 5.3. Given a facet Ei m, a pessimistic facet E i is a subset of Ei. We say E i is d-pessimistic if it is non-empty and d H(Ei, E i ) d, where d H( , ) is the Hausdorff distance with respect to the L1 norm. If the optimizer plays a safe version of Stackelberg equilibrium by modifying (7) to: max x E i V i (A, B) = x T A:,i (9) and obtains a pessimistic value with respect to (8) as: V (A, B) = max i [n] V i (A, B), (10) the obtained value V (A, B) will be close to V (A, B) if d is small, stated formally as follows: Proposition 5.4. Consider the optimization problem (9). If the facet E i is d-pessimistic, the pessimistic Stackelberg value V i (A, B) satisfies Vi(A, B) d A:,i V i (A, B) Vi(A, B), (11) and therefore if E i is d-pessimistic for all i [n], V (A, B) d A max V (A, B) V (A, B). (12) The proof is deferred to Appendix B.1. As shown in Proposition 5.4, as long as the optimizer knows a complete set of d-pessimistic facets for each i [n], it is able to compute an approximate Stackelberg equilibrium up to an error at the scale of d. We now extend this result to its ability to extract Stackelberg value against no-regret learners: Theorem 5.5. If P1 has a set of d1-pessimistic facets E i , i [n] such that i = j, infx E i ,x Ej x x 1 d2, as long as P2 is using an f-no-regret algorithm A2, P1 can guarantee a Stackelberg regret of Stack Reg1(A1, A2) = O( f(T ) d2 + d1T) (13) by sticking to the corresponding x obtained from (9) and (10). Here we keep d1 and d2 inside the O( ) notation to allow their choice to be dependent on T. A refined version of Theorem 5.5 that expands the big O( ) notation as well as its proof can be found in Appendix B.2. Here we can see if we take d1 = d2 = p obtain an optimizer Stackelberg regret of O p which is o(T). In Theorem 5.5 we require a condition of infx E i ,x Ej x x 1 d2, ensuring the pessimistic facets are disjoint (and at least d2 distance away) from other facets, and once the optimizer selects a point within E i , the learner has a unique best response i , and deviating from i incurs a regret proportional to d2 at each step. While a proper estimation of pessimistic facets suffices to steer the learner, under some specific cases, it may be easier for the optimizer to reconstruct the learner s payoff matrices, and it suffices to restrict our attention to those matrices which could induce different best response sets, leading to the following definition of equivalent payoff matrix classes: Definition 5.6. For any two m n matrices B and B , if there exists some c R+, µ Rm such that B = c B + µ1T n, (14) we say that B and B are equivalent. It s not hard to see that if two matrices B and B are equivalent, the induced best response set BR(B, x) = BR(B , x) for all x m. Indeed, we show in Appendix B.3 that for all equivalent matrix pairs (B1, B2), if a learner algorithm is f-no-regret on one, there exists a corresponding algorithm that is f-no-regret on the other, which indicates that the optimizer is in general not able to distinguish between these two matrices without knowing A2. Therefore, restricting our attention from payoff matrices to equivalence classes won t affect the optimizer s ability to steer learners. To describe an equivalence class B, observe that for all matrices B in B, the matrix B i Rm (n 1) defined by (B i ):,k = (B:,k B:,i)/ maxj1,j2 B:,j1 B:,j2 for some k = i in each column will be the same for any fixed index i [m] with the convention that 0/0 = 0, so we can use B i to represent the entire equivalence class. Based on this, we can also define the difference between two equivalence classes: Learning to Steer Learners in Games Definition 5.7. For two equivalence classes B1 and B2, their difference on index i is defined as di(B1, B2) := B 1,i B 2,i. 5.2. Steering with Payoff Class Estimation If P1 has an estimation B that perfectly recovers the underlying payoff matrix class of B, we could rewrite (7) as: max x m Vi(A, B) = x T A:,i s.t.: (B i )T x 0n 1 (15) for each i [n]. Again, each linear program solves for the best action when ei is the best response to action x played by P1. Consequently (8) becomes V (A, B) = maxi Vi(A, B) and the Stackelberg equilibrium point would be the corresponding solution. With an estimation ˆB that has some error within margin d, we can define an optimistic version of (15): max x m V + i (A, ˆB) = x T A:,i s.t: ( ˆB i )T x d1, (16) and a set of pessimistic version: max x m V i (A, ˆB) = x T A:,i s.t: ( ˆB i )T x d1. (17) The optimistic (cf. pessimistic) problem relaxes (cf. tightens) the condition by a margin d. Notice that the feasible set of each optimization problem characterized by (15), (16) and (17) are also variants of the aforementioned concept of facets, we overload the notation: Definition 5.8. Given a payoff matrix class B, the facet Ei corresponding to the i-th action ei of P2 is defined as: Ei(B) := x m : (B i )T x 0n 1 . (18) Similarly, given an estimation of payoff matrix class ˆB and an error margin d, the optimistic facet E+ i and the pessimistic facet E i corresponding to the i-th action ei of P2 is defined respectively as: E+ i ( ˆB, d) := n x m : ( ˆB i )T x d1n 1 o ; (19) E i ( ˆB, d) := n x m : ( ˆB i )T x d1n 1 o . (20) The definition of pessimistic facets in Definition 5.8 enforces strict dominance of the corresponding action by at least some margin d. We show in Proposition 5.9 that when the error margin d is larger than the scale of the difference in equivalence classes, the optimistic (cf. pessimistic) problems are indeed relaxations (cf. tightenings) of (15). Proposition 5.9. If the error margin d satisfies d di(B, ˆB) max, (21) then: E i ( ˆB, d) Ei(B) E+ i ( ˆB, d). Further, since (15), (16) and (17) maximize the same objective, if E i ( ˆB, d) is non-empty then: V i (A, ˆB) Vi(A, B) V + i (A, ˆB). We provide the following example: Example 5.10. Following Example 5.2, consider ˆB = h 1.05 0.05 0 0.05 1.05 0 0.05 0 0.95 i . For facet E1 and the corresponding B, ˆB, we have that d1(B, ˆB) max = 0.1. We show E 1 ( ˆB, d) and the boundaries of E1(B), E1( ˆB) as follows in Figure 2: Figure 2. We can see that although E1( ˆB) E1(B), by construction we have E 1 ( ˆB, d) E1(B). Proposition 5.9 suggests that given an estimation ˆB and a proper margin d, if P1 plays according to the solution to (17) (assuming E i ( ˆB, d) is not empty), the corresponding learner best response in the underlying game would be ei. However, a pessimistic facet may not be feasible when the original facet is feasible. If E i ( ˆB, d) is empty, we cannot deduce that Ei(B) is also empty. Instead, to certify the emptiness of Ei(B), we need E+ i ( ˆB, d) to be empty as well. We use the following version of the definition given by (Gan et al., 2023) to capture the emptiness of E+ i and E i . Definition 5.11 ((Gan et al., 2023), Definition 3). Given a payoff matrix class B of P2, define the inducibility gap Ci with respect to the i-th action ei of P2 to be: Ci := minx m maxj x T (B i ):,j. (22) We can see from Definition 5.11 that Ci 0 if and only if (15) is feasible. We show in Appendix C.2 that if Ci > 0, (15) is infeasible and there exists a gap where it can be relaxed while still being infeasible. If Ci = 0, however, Definition 5.11 indicates that x, maxj x T B i (ej ei) 0, and x m, j [n] such that x T Bi(ej ei) = 0. Under this case, the facet Ei(B) has zero volume and as long as the estimation ˆB is not precise, the facet E i ( ˆB, d) could always be empty. Also, even if the optimizer knows the real underlying B, since ei is weakly dominated, the learner is not steerable if ei happens to be the Stackelberg equilibrium since the learner is indifferent between ei and ej. To avoid this special case (which occurs with probability 0 Learning to Steer Learners in Games for uniformly randomly generated B matrices (Von Stengel & Zamir, 2010)), we make the following assumption, as is standard (see e.g., (Gan et al., 2023; Deng et al., 2019; Brown et al., 2023)): Assumption 5.12. The learner payoff matrix class B satisfies Ci = 0 for all i [n]. Remark 5.13. Our construction of game instances when proving Theorem 4.1 both satisfy this assumption. That is saying, if the optimizer knows B1 and B2, it is able to steer the learner to Stackelberg equilibrium, indicating that the impossibility lies in learning instead of steering . With Assumption 5.12 we are ready to show that if the estimation ˆB is accurate enough, all the facets are identifiable: Proposition 5.14. Given an estimation ˆB satisfying (21): 1. Ei(B) = and d Ci 4 , then E+ i ( ˆB, d) = ; 2. Ei(B) = and d Ci 2 , then E i ( ˆB, d) = . Proposition 5.14 shows that under Assumption 5.12, when the estimation error is small enough, either both E+ i and E i are empty, or none of them is empty. Therefore, given ˆB that is accurate enough, the optimizer will finally be able to decide whether Ei(B) is empty or not. Since Proposition 5.9 suggests V i (A, ˆB) Vi(A, B) V + i (A, ˆB), if the optimizer chooses the solution to (17), its suboptimality can be bounded by V + i (A, ˆB) V i (A, ˆB). To bound this difference term, we make the following definition to capture the sensitivity of this problem: Definition 5.15. Given a matrix M R(n 1) m, define the sensitivity constant Sen(M) as: Sen(M) := minϵ =0 max P,Q where the maximization is over all P and Q that satisfies P [n], Q [m], |P| = |Q|, M ϵ1T m P,Q invertible. In Definition 5.15, we take the maximum over all invertible square submatrices, if we take M = ( ˆB i )T , we can interpret P as choosing active constraints within the columns of ˆB i and Q can be interpreted as choosing nonzero entries of x. Based on Definition 5.15, we obtain Lemma 5.16: Lemma 5.16. Suppose both the optimistic and pessimistic problems are feasible, then the difference between the optimal solution V + i (A, ˆB) to (16) and the optimal solution V i (A, ˆB) to (17) can be upper bounded by: V + i (A, ˆB) V i (A, ˆB) 4d A:,i Sen((B i )T ), as long as di(B, ˆB) d 1 2Sen((B i )T ). Lemma 5.16 suggests that once P1 has a small estimation error of the payoff matrix class B of P2, the value and the corresponding action obtained by solving (17) will guarantee a bounded suboptimality proportional to the error scale. Based on this, if P1 has an estimation ˆBt that is accurate enough, P1 can commit to a fixed strategy given by the solution to the pessimistic optimization problem and could obtain a sublinear Stackelberg regret in the long run as T . We state this result as follows: Theorem 5.17. Under Assumption 5.12, if P1 has an estimator ˆB of B such that di(B, ˆB) ϵ = O(g(T)/T), i for some g(T) = o(T), then if P2 is using a f-no-regret algorithm A2, there exists an algorithm A1 satisfying: Stack Reg1(A1, A2) = O( p Tf(T) + g(T)). (24) Similarly, the refined version of Theorem 5.17 with explicit Stackelberg regret bound and its proof can be found in Appendix C.5. 5.3. Lower Bound on Stackelberg Regret To illustrate the tightness of our result, we provide the lower bound on the Stackelberg regret of the optimizer in Appendix D, which shows that our rate p Tf(T) is essentially optimal against f-no-regret learners. 6. Learning to Steer Classes of Learners We have shown in Section 5 that if P1 can recover the set of pessimistic facets or approximate payoff matrix class, it would be able to steer the learner to a Stackelberg equilibrium. Therefore it is natural for the optimizer to adopt an explore-then-commit style algorithm that first learns either the facets or the approximate payoff matrix, and then commits to a pessimistic Stackelberg equilibrium. In this section we show in two concrete examples that when some information about the update rule of the learner s algorithm is known, P2 leaks information about its payoff which allows P1 to learn the desired payoff structure and thus steer the learner to Stackelberg equilibrium. We provide numerical experiments to illustrate the effectiveness of the algorithms in Appendix F. 6.1. Learning to Steer Ascending Learners with n = 2 In this section we assume that the learner is using an ascent algorithm, where the learner s action greedily improves its payoff based on the last round s optimizer action: Definition 6.1. A learner algorithm A2 is an ascent algorithm if x T t Byt x T t Byt+1 0 for all t, and x T t Byt x T t Byt+1 = 0 if and only if yt BR(B, xt). For simplicity we restrict our attention to the case where m = n = 2, where we can see that the direction that Learning to Steer Learners in Games yt+1 moves from yt directly reflects the best response to xt. Based on this observation, we propose Algorithm 1 as shown in Appendix E.1. The idea behind the algorithm is that the optimizer first performs a binary serach across its simplex [0, 1] and then apply pessimism to get an estimated E 1 and E 2 before finally committing to the solution obtained through (9) and (10). We show in the following theorem that the algorithm obtains a sublinear Stackelberg regret: Theorem 6.2. Suppose m = n = 2 and the payoff matrix B does not contain identical columns. For some chosen parameter d, if either one facet is empty, or each facet has diameter at least d and P2 uses an ascent algorithm A2 that is f-no-regret, Algorithm 1 with accuracy margin d achieves a Stackelberg regret of at most O( f(T ) d + d T log d) as long as d = Ω(exp( f(T))). A more detailed version of Theorem 6.2 with its proof can be found in Appendix E.1. As an example, here if f(T) = T α and we take d = p f(T)/T, we achieve a bound on the optimizer Stackelberg regret of O p Tf(T) log p 2 log T = O T 1+α For the more general case where n = 2 and m is an arbitrary constant, we can use a similar approach that does m(m 1)/2 binary searches on all pairs of (ei, ej), i = j to find a set of approximate indifferent points on each segment {x m : xi + xj = 1} and then use them to reconstruct the facets E 1 and E 2 , the reconstruction is possible under mild assumptions since the real facets E1 and E2 are separated by the hyperplane x T (Be1 Be2) = 0. We leave it as a open problem whether similar approach will work for n > 2 case. There is an alternative view of Algorithm 1 based on payoff matrix reconstruction, see discussion also in Appendix E.1. 6.2. Learning to Steer Mirror Ascent Learners We now present an estimation algorithm that estimates the payoff matrix class B of P2 given that P2 is using stochastic mirror ascent with known regularizer. More specifically, we assume that the follower is using the following update rule: yt+1 = arg min y n ηt D(y yt) (x T t B + ξT t )y (25) where ξt Rn is some noise that is either innate in the problem or injected by P2 to prevent from information leakage. We assume that ηt and the Bregman divergence regularizer D( ) are both known to P1 and the regularizer satisfies y D(yt+1 yt) if there exists i [n] such that yt+1,i 0. At each time step t, through the update rule (which the optimzer knows by knowing the regularizer and step size), the relationship between yt+1 and yt only depends on the term x T t B + ξT t , therefore if the optimizer selects xt = ei, it can get some information of the i-th row Bi of B. By uniformly exploring all such rows, it is able to fully recover the entire matrix class B. Interestingly, since the update rule includes projection onto the simplex n, the information of one dimension is lost, so the optimizer cannot fully recover the exact matrix B, but luckily the projection preserve all information needed to recover B up to the equivalence class, which suffices to steer the learner to Stackelberg. Based on the intuition above, we propose Algorithm 4 as shown in Appendix E.2 with the following regret bound. Theorem 6.3. If the learner payoff matrix B statisfies the assumptions needed in Theorem 5.17, P2 follows update rule (25), and each entry ξt,i is i.i.d. R-sub-Gaussian, then with probability at least 1 δ, P1 using Algorithm 4 with k = (T/g(T))2 2R2 log(2mn/δ), incurs Stackelberg regret of at most Stack Reg1(A1, A2) = O( p g(T) + T g(T ) 2 ). The detailed version of Theorem 6.3 and its proof can be found in Appendix E.2. Here since for all no-regret algorithms A2 we have f(T) = Ω( T), if we take Tf(T), T g(T ) 2 = o(g(T)) and we have Stack Reg1(A1, A2) = O( p 7. Conclusion We studied the problem of learning to steer no-regret learners to Stackelberg equilibrium through repeated interactions. While we showed this to be impossible against a general no-regret learner, we provided sufficient conditions under which the learner can be exploited and designed algorithms that learns to steer the learner under further assumptions on their algorithm. Our work provides several future directions for learning in strategic environments, including but not limited to finding a more precise characterization on learnable and steerable learner algorithm classes, and learning in environments where neither payoff matrices are known. Acknowledgements EM acknowledges support from NSF Award 2240110. YM is supported by: NSF Award CCF-2112665 (TILOS), DARPA AIE program, the U.S. Department of Energy, Office of Science, and CDC-RFA-FT-23-0069 from the CDC s Center for Forecasting and Outbreak Analytics. Impact Statement Our work focus on learning within strategic unknown environments, providing theoretical insights that can guide the design of more robust and intelligent multi-agent systems. Since our society inherently fits into such environments, our work coud serve as the guideline for designing algorithms Learning to Steer Learners in Games that interacts with people. For instance, in self-driving fleets, vehicles must not only learn how to navigate but also anticipate the diverse, evolving behaviors of other drivers, either human or algorithms. A better understanding of how to steer or align these vehicles learning processes could reduce collisions, congestion, and erratic maneuvers. Similarly, in automated trading platforms where billions of dollars change hands every day, being able to steer and exploit rival strategies especially when the rival s objectives are unknown, can prevent destabilizing market manipulations or cascading losses. Beyond commercial applications, cybersecurity systems stand to benefit as well: more nuanced models of adversarial behavior under uncertain objectives can better protect critical infrastructure from sophisticated attacks. Ultimately, our findings encourage organizations and policymakers to invest in adaptive mechanisms that account for varying degrees of information accessibility. By doing so, we can foster more stable, cooperative, and beneficial outcomes in multi-agent settings, ensuring that AI systems operating under unknown strategic environments are both robust and societally accountable. Arunachaleswaran, E. R., Collina, N., and Schneider, J. Pareto-optimal algorithms for learning in games. In Proceedings of the 25th ACM Conference on Economics and Computation, pp. 490 510, 2024. Assos, A., Dagan, Y., and Daskalakis, C. Maximizing utility in multi-agent environments by anticipating the behavior of other learners, 2024. URL https://arxiv.org/ abs/2407.04889. Bajaj, S., Das, P., Vorobeychik, Y., and Gupta, V. Rationality of learning algorithms in repeated normal-form games. IEEE Control Systems Letters, 8:2409 2414, 2024. doi: 10.1109/LCSYS.2024.3486631. Balcan, M.-F., Blum, A., Haghtalab, N., and Procaccia, A. D. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the sixteenth ACM conference on economics and computation, pp. 61 78, 2015. Bertsimas, D. and Tsitsiklis, J. N. Introduction to linear optimization, volume 6. 1997. Blum, A., Haghtalab, N., and Procaccia, A. D. Learning optimal commitment to overcome insecurity. Advances in Neural Information Processing Systems, 27, 2014. Brˆanzei, S., Hajiaghayi, M., Phillips, R., Shin, S., and Wang, K. Dueling over dessert, mastering the art of repeated cake cutting. Advances in Neural Information Processing Systems, 37:97699 97765, 2024. Braverman, M., Mao, J., Schneider, J., and Weinberg, M. Selling to a no-regret buyer. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 523 538, 2018. Brown, W., Schneider, J., and Vodrahalli, K. Is learning in games good for the learners? Advances in Neural Information Processing Systems, 36:54228 54249, 2023. Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006. Collina, N., Arunachaleswaran, E. R., and Kearns, M. Efficient stackelberg strategies for finitely repeated games, 2024. URL https://arxiv.org/abs/ 2207.04192. Conitzer, V. and Sandholm, T. Computing the optimal strategy to commit to. In Proceedings of the 7th ACM Conference on Electronic Commerce, EC 06, pp. 82 90, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595932364. doi: 10. 1145/1134707.1134717. URL https://doi.org/ 10.1145/1134707.1134717. Deng, Y., Schneider, J., and Sivan, B. Strategizing against no-regret learners. Advances in neural information processing systems, 32, 2019. Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with opponentlearning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, AAMAS 18, pp. 122 130, Richland, SC, 2018. International Foundation for Autonomous Agents and Multiagent Systems. Foster, D. P. and Vohra, R. V. Asymptotic calibration. Biometrika, 85(2):379 390, 1998. Gan, J., Han, M., Wu, J., and Xu, H. Robust stackelberg equilibria. In Proceedings of the 24th ACM Conference on Economics and Computation, EC 23, pp. 735, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701047. doi: 10. 1145/3580507.3597680. URL https://doi.org/ 10.1145/3580507.3597680. Goktas, D., Zhao, J., and Greenwald, A. Robust no-regret learning in min-max stackelberg games, 2022. URL https://arxiv.org/abs/2203.14126. Haghtalab, N., Lykouris, T., Nietert, S., and Wei, A. Learning in stackelberg games with non-myopic agents. In Proceedings of the 23rd ACM Conference on Economics and Computation, pp. 917 918, 2022. Learning to Steer Learners in Games Haghtalab, N., Podimata, C., and Yang, K. Calibrated stackelberg games: Learning optimal commitments against calibrated agents. Advances in Neural Information Processing Systems, 36:61645 61677, 2023. Harris, K., Wu, S. Z., and Balcan, M.-F. F. Regret minimization in stackelberg games with side information. Advances in Neural Information Processing Systems, 37: 12944 12976, 2024. Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5): 1127 1150, 2000. Lattimore, T. and Szepesv ari, C. An information-theoretic approach to minimax regret in partial monitoring. In Conference on Learning Theory, pp. 2111 2139. PMLR, 2019. Lauffer, N., Ghasemi, M., Hashemi, A., Savas, Y., and Topcu, U. No-regret learning in dynamic stackelberg games. IEEE Transactions on Automatic Control, 69(3): 1418 1431, 2023. Letchford, J., Conitzer, V., and Munagala, K. Learning and approximating the optimal strategy to commit to. In Algorithmic Game Theory: Second International Symposium, SAGT 2009, Paphos, Cyprus, October 18-20, 2009. Proceedings 2, pp. 250 262. Springer, 2009. Lin, T. and Chen, Y. Generalized principal-agent problem with a learning agent. ar Xiv preprint ar Xiv:2402.09721, 2024. Lu, C., Willi, T., De Witt, C. A. S., and Foerster, J. Modelfree opponent shaping. In International Conference on Machine Learning, pp. 14398 14411. PMLR, 2022. Maheshwari, C., Cheng, J., Sastry, S. S., Ratliff, L., and Mazumdar, E. Convergent first-order methods for bi-level optimization and stackelberg games. In 2024 IEEE 63th Conference on Decision and Control (CDC), 2024. Peng, B., Shen, W., Tang, P., and Zuo, S. Learning optimal strategies to commit to. Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):2149 2156, Jul. 2019. doi: 10.1609/aaai.v33i01. 33012149. URL https://ojs.aaai.org/index. php/AAAI/article/view/4047. Von Stengel, B. and Zamir, S. Leadership games with convex strategy sets. Games and Economic Behavior, 69 (2):446 457, 2010. Zhang, B. H., Farina, G., Anagnostides, I., Cacciamani, F., Mc Aleer, S. M., Haupt, A. A., Celli, A., Gatti, N., Conitzer, V., and Sandholm, T. Steering no-regret learners to a desired equilibrium, 2024. URL https://arxiv. org/abs/2306.05221. Zrnic, T., Mazumdar, E., Sastry, S., and Jordan, M. Who leads and who follows in strategic classification? In Advances in Neural Information Processing Systems, 2021. Learning to Steer Learners in Games A. Proof of Theorem 4.1 Consider the game instances G1 = (A, B1) and G2 = (A, B2) where: A = 0 0 1 ϵ , B1 = 0 ϵ 0 1 , B2 = 1 0 0 1 with ϵ (0, 1/2) being a small positive constant. Fix an optimizer algorithm A1 and suppose it is no-Stackelberg-regret on G1. We first show that there exists a learner algorithm A2 (that may not be a no-regret algorithm itself) that achieves Reg2(A1, A2) = o(T) on both G1 and G2, and then modify this A2 to make it a no-regret algorithm itself. In this proof section we use Stackreg1( , ; Gi) and Reg2( , ; Gi) to denote the corresponding regret notions evaluated on Gi. For the first step, we show that a simple algorithm A2 that takes y = (0, 1)T satisfies the conditions above for both G1 and G2. Notice that the unique Stackelberg equilibrium for G1 is: x 1 = (0, 1)T , y 1 = (0, 1)T (27) with the corresponding Stackelberg value V (A, B1) = ϵ. In order to achieve a sublinear Stackelberg regret, the selections xt by A1 must satisfy Stack Reg1(A1, A2; G1) = E{xt}T t=1 A1 t=1 x T t Ayt yt = y 1, t = o(T), (28) which simplifies to: t=1 ( 0 1 x T t ) 0 ϵ = o(T). (29) That is, the average of {xt}T t=1 must be asymptotically close to (0, 1)T in expectation. Also notice that y 1 is a strictly dominant strategy for the learner, we have Reg2(A1, A2; G1) = 0. Now consider G2, the unique Stackelberg equilibrium for G2 is: 2)T , y 2 = (1, 0)T , (30) yielding a Stackelberg value of V (A, B2) = 1/2. Since A1 only takes the {yt}T t=1 sequence as input, it must behave identically as in G1, with expected average {xt}T t=1 asymptotically close to (0, 1)T as well. That is, Stack Reg1(A1, A2; G2) =E{xt}T t=1 A1 t=1 x T t Ayt yt = y 1, t 2T E{xt}T t=1 A1 t=1 x T t Ayt yt = y 1, t 2T (ϵT o(T)) 2 ϵ)T + o(T). Therefore, as long as A1 is no-Stackelberg-regret on G1 against A2, it incurs linear Stackelberg regret on G2 against the Learning to Steer Learners in Games same A2. Also, since (29) still holds under G2, we have the following upper bound on the learner regret: Reg2(A1, A2; G2) =T E{xt}T t=1 A1 t=1 x T t B2yt yt = y 1, t =T E{xt}T t=1 A1 t=1 (xt 0 1 )T B2yt yt = y 1, t E{xt}T t=1 A1 yt = y 1, t =T + E{xt}T t=1 A1 t=1 ( 0 1 xt)T B2yt yt = y 1, t =E{xt}T t=1 A1 t=1 ( 0 1 xt)T B2yt yt = y 1, t t=1 ( 0 1 xt)T 0 1 t=1 ( 0 1 x T t ) 0 ϵ which completes the first part of the proof. We now modify A2 into a no-regret algorithm. Notice that although A2 constructed above obtains sublinear regret against A1 on both G1 and G2, it is not no-regret on G2 since it may incur linear regret on some {xt}T t=1 sequence, e.g. xt = (1, 0)T , t. Since A1 is fixed, we can use a function g(T) = o(T) to characterize its Stackelberg regret, namely we select g(T) such that Stack Reg1(A1, A2; G1) = E{xt}T t=1 A1 t=1 x T t Ayt yt = y 1, t = O(g(T)). (33) Our idea is to let the learner keep track of the cumulated regret upon the current time step t to identify whether the trajectory {xτ}t τ=1 is generated by A1 or not. Consider the modified algorithm A2 as follows: 1. When the interaction process starts, stick to (0, 1)T ; 2. At each time step t, calculate the running Stackelberg regret SR({xτ, yτ}t 1 τ=1) := (t 1)ϵ Pt 1 τ=1 x T τ Ayτ; 3. If SR({xτ, yτ}t 1 τ=1) p Tg(T), switch and stick to online mirror ascent, otherwise keep playing (0, 1)T . Notice that here we can use the knowledge of g(T), which serves as the Stackelberg regret bound of A1 because we only aim to prove the existence of such algorithm as A2. To complete the proof of Theorem 4.1, notice that on game instance G1 no matter what trajectory {xτ}t 1 τ=1 it faces, since (0, 1)T is a dominant learner action, A2 will play (0, 1)T for all time steps, and therefore have 0 Stackelberg regret. It suffices to prove that on the game instance G2, A2 is no-regret and A2 still incurs Θ(T) Stackelberg regret to the optimizer against A1. To simplify calculation we use notations x = (x1, x2)T and y = (y1, y2)T here. To show that A2 is no-regret, notice that before switching to mirror ascent, the learner regret has the following form: Reg2({xt, yt}t τ=1; G2) = max{ τ=1 xτ,2 = max{ τ=1 (1 2xτ,2), 0}. (34) Also, A2 sticks to (0, 1)T , and therefore, SR({xτ, yτ}t 1 τ=1) = (t 1)ϵ ϵ τ=1 xτ,2. (35) Learning to Steer Learners in Games Let Ts denote the time step at which A2 switches, or Ts = T if the algorithm doesn t switch until the end, we have: Reg2({xt, yt}Ts t=1; G2) t=1 xt,2, 0} = max{(Ts 1) 2(Ts 1 SR({xt, yt}Ts 1 t=1 ) ϵ ), 0} + O(1) = max{2SR({xt, yt}Ts 1 t=1 ) ϵ Ts + 1, 0} + O(1) 2SR({xt, yt}Ts 1 t=1 ) ϵ + O(1) and therefore, Reg2({xt, yt}T t=1; G2) t=1 x T t By t=1 x T t Byt t=1 x T t By t=1 x T t Byt + max y n t=Ts+1 x T t By t=Ts+1 x T t Byt =Reg2({xt, yt}Ts t=1; G2) + Reg2({xt, yt}T t=Ts+1; G2) Since this holds for arbitrary {xt}T t=1 sequence, we deduce that A2 is a no-regret algorithm. To show that A2 incurs Θ(T) Stackelberg regret to the optimizer against A1, consider the event E = {SR({xτ, y 1}t τ=1; G1) p Tg(T), for some t [T]} = {SR({xτ, y 1}T τ=1; G1) p Tg(T)} (38) that captures the case where the Stackelberg regret of A1 exceeds p Tg(T) under G1 given the learner fixes y 1, since A1 is no-Stackelberg-regret on G1, the probability of E should be small: Pr(E) E SR({xτ, yτ}T τ=1; G1) Conditioned on E doesn t happen, A2 will not switch to online mirror descent, the Stackelberg regret under G2 satisfies: E{xt,yt}T t=1 A1, A2[Stack Reg1({xt, yt}T t=1; G2)| E] 2T E{xt}T t=1 A1 t=1 xt A 0 1 2T (ϵT E{xt}T t=1 A1 SR({xt, (0, 1)T }T t=1) ) 2 ϵ)T + O( p Learning to Steer Learners in Games Since A1 should respond identically on G2 we can write the Stackelberg regret of A1 as: Stack Reg1(A1, A2; G2) =E{xt,yt}T t=1 A1, A2Stack Reg1({xt, yt}T t=1; G2) =E{xt,yt}T t=1 A1, A2[Stack Reg1({xt, yt}T t=1; G2)|E] Pr(E) + E{xt,yt}T t=1 A1, A2[Stack Reg1({xt, yt}T t=1; G2)| E](1 Pr(E)) T ) + E{xt,yt}T t=1 A1, A2[Stack Reg1({xt, yt}T t=1; G2)| E](1 Pr(E)) 2 ϵ)T + O( p c T (41) for some constant c, where we have used (40) in (i). As a result, A2 incurs Θ(T) Stackelberg regret against A1, which completes the proof of Theorem 4.1. Learning to Steer Learners in Games B. Proofs for Section 5.1 B.1. Proof of Proposition 5.4 Fix some i. Since d-pessimism implies E i and Ei are non-empty, let x i denote the optimal solution to (9) and x be the optimal solution to (7), since E i Ei we have: V i (A, B) = (x i )T A:,i (x )T A:,i = Vi(A, B). (42) Also, d H(Ei, E i ) d implies there exists ˆx E i satisfying ˆx x 1 d, and thus: V i (A, B) =(x i )T A:,i =(x + ˆx x )T A:,i (x )T ai d A:,i =Vi(A, B) d A:,i , where the first inequality holds due to the optimality of x i as a solution to (9) and in the second inequality we use H older s inequality that gives |a T b| a 1 b , which completes the proof. B.2. Refined Statement and Proof of Theorem 5.5 We first provide a refined statement of Theorem 5.5 that expands the big O( ) notation in the original statement. Theorem B.1. If P1 has a set of d1-pessimistic facets E i , i [n] such that i = j, infx E i ,x Ej x x 1 d2, as long as P2 is using an f-no-regret algorithm A2 with constant C, P1 can guarantee a Stackelberg regret of Stack Reg1(A1, A2) = Td1 + 2Cf(T) by sticking to the corresponding x obtained from (9) and (10). Here ϵ is a constant that depends only on the learner s payoff matrix B. Proof. Let x i denote the optimal solution to (9) and i = arg maxi V i (A, B) be the index of the best response under x so that x = x i . Since the pessimistic facet E i satisfies infx E i ,y Ej x x 1 d2 for all j = i , ei is a unique best response to x , we have (x )T Bei (x )T Bej ϵd2, j [n], j = i (45) Learning to Steer Learners in Games for some constant ϵ. Since the learner regret has the following expression: Reg2({x , yt}T t=1) t=1 (ei yt) j [n] yt,jej) (1 yt,i )ei X j [n],j =i yt,jej j [n],j =i yt,jei X j [n],j =i yt,jej t=1 yt,j (x )T Bei (x )T Bej t=1 yt,jϵd2, where we have used the fact that 1 yt,i = P j [n],j =i yt,j for all yt n in the fourth equation. Since A2 used by the learner is f-no-regret, assume the regret constant is C, we have: t=1 (ei yt) 1 j [n] yt,jej) 1 (1 yt,i ) + X j [n],j =i yt,j j [n],j =i yt,j 2Reg2({x , yt}T t=1) ϵd2 where the second equality follows from yt,i [0, 1], yt n, i [n] and the third equality follows from the same Learning to Steer Learners in Games argument as above. Let i denote arg maxi Vi(A, B), the Stackelberg regret of P1 satisfies: Stack Reg1(A1, A2) =T V (A, B) t=1 (x )T Ayt =T V (A, B) t=1 (x )T Aei + t=1 (x )T A(ei yt) =T V (A, B) V i (A, B) + t=1 (x )T A(ei yt) (i) T V (A, B) V i (A, B) + AT x t=1 (ei yt) 1 (ii) T V (A, B) V i (A, B) + AT x t=1 (ei yt) 1 =T Vi (A, B) V i (A, B) + AT x t=1 (ei yt) 1 (iii) Td1 A max + AT x t=1 (ei yt) 1 (iv) Td1 A max + A max t=1 (ei yt) 1 (v) Td1 A max + 2 A max Cf(T) where we have used H older s inequality in (i), the maximizing argument of (10) in (ii), Proposition 5.4 in (iii), H older s inequality in (iv) and (47) in (v). This completes the proof of Theorem 5.5. B.3. Regret Invariance Properties of Equivalent Payoff Matrices We now state and prove Proposition B.2, which shows that the same trajectory yields the same asymptotic learner regret for all learner payoff matrices within the same equivalence class. Proposition B.2. Consider an interaction history {xt, yt}T t=1 that is f-no-regret for the learner on matrix B1, then for all matrices B2 being equivalent to B1, the same interaction history is also f-no-regret. As a result, for all f-no-regret learner algorithm A2 on B1, there exists another learner algorithm A 2 on B2 that simulates A2 on B1 which is also f-no-regret on B2. Proof. We use the notation Reg2( ; B) to denote the learner regret on its payoff matrix B. Since B1 and B2 are in the same equivalence class, by definition we have B2 = c B1 + µ1T n for some c R+, µ Rm. The interaction history {xt, yt}T t=1 being f-no-regret on B1 implies for some constant C: Reg2({xt, yt}T t=1; B1) = max y n t=1 x T t B1y t=1 x T t B1yt C f(T). (49) Learning to Steer Learners in Games Therefore, we have the following bound on the learner regret on B2: Reg2({xt, yt}T t=1; B2) = max y n t=1 x T t B2y t=1 x T t B2yt t=1 x T t (c B1 + µ1T n)y t=1 x T t (c B1 + µ1T n)yt t=1 x T t B1y t=1 x T t B1yt t=1 x T t µ1T ny t=1 x T t µ1T nyt t=1 x T t B1y t=1 x T t B1yt t=1 x T t µ t=1 x T t µ t=1 x T t B1y t=1 x T t B1yt where the fourth equation holds because 1T ny = Pn i=1 yi = 1 for all y n. This shows that {xt, yt}T t=1 is also f-no-regret on B2. Learning to Steer Learners in Games C. Proofs for Section 5.2 Within the proofs in this section we will make extensive use of the following property that for all x m: di(B, ˆB) max1n 1 (B i ˆB i )T x di(B, ˆB) max1n 1. (51) This property can be obtained by bounding each row of the column vector (B i ˆB i )T x with the fact that each entry of x is in [0, 1]. C.1. Proof of Proposition 5.9 Suppose we have d di(B, ˆB) max. To prove that E i ( ˆB, d) Ei(B), notice that if E i ( ˆB, d) = the inclusion naturally holds. Otherwise for all x E i ( ˆB, d), it holds that (B i )T x =(B i ˆB i )T x + ( ˆB i )T x (i) (B i ˆB i )T x d1n 1 (ii) ( di(B, ˆB) max d)1n 1 0, where (i) holds by definition of E i ( ˆB, d) and (ii) holds due to (51), so that x Ei(B). Similarly to prove Ei(B) E+ i ( ˆB, d), we only need to consider the case where Ei(B) = for all x Ei(B), we have: ( ˆB i )T x =( ˆB i B i )T x + (B i )T x (i) ( ˆB i B i )T x (ii) di(B, ˆB) max1n 1 d1n 1, where again (i) holds by definition of Ei(B) and (ii) uses (51). Therefore x E+ i ( ˆB, d). C.2. Relaxed Empty Facet Condition under Positive Inducibility Gap Proposition C.1. Ci > 0 is equivalent to Ei = , both imply the following: {x m : (B i )T x Ci 2 1n 1} = . (54) Proof. We first prove that Ci > 0 Ei = . Notice that Ci > 0 x m, max j x T (B i ):,j > 0 x m, j, s.t. x T (B i ):,j > 0 x m, j, s.t. (B i )T :,jx > 0 We now prove the second part by proving that if there exists x0 m satisfying (B i )T x0 Ci 2 1n 1, (56) we have Ci 0. This is because (B i )T x0 Ci 2 min x m max j x T (B i ):,j1n 1 (57) Learning to Steer Learners in Games (B i )T x0 1 2 max j x T 0 (B i ):,j1n 1, (58) which means that for j attaining the maximum, x T 0 (B i ):,j 1 2x T 0 (B i ):,j , (59) which is equivalent to x T 0 (B i ):,j 0. (60) This means that Ci = min x m max j x T (B i ):,j max j x T 0 (B i ):,j =x T 0 (B i ):,j which completes the proof. C.3. Proof of Proposition 5.14 Proof of part 1: By Proposition C.1, Ei = implies {x m : (B i )T x Ci 2 1n 1} = . (62) {x m : ( ˆB i )T x ( ˆB i B i )T x + Ci 2 1n 1} = . (63) Combining (51) and (21) we obtain ( ˆB i B i )T x di(B, ˆB) max1n 1 d1n 1 Ci 4 1n 1. (64) Based on the two equations above, we further have {x m : ( ˆB i )T x Ci 4 1n 1} = , (65) and since d < Ci 4 , it holds that E+ i ( ˆB, d) = . Proof of part 2: Let x0 = arg minx m maxj x T (B i ):,j, if Ei(B) = , by definition we have (B i )T x0 Ci1n 1. (66) That is, ( ˆB i )T x0 ( ˆB i B i )T x0 + Ci1n 1 (i) di(B, ˆB) max1n 1 + Ci1n 1 (ii) (Ci + d)1n 1 (iii) d1n 1, where we have used (51) in (i), the condition (21) in (ii) and the assumption d Ci/2 in (iii). This means that x0 E i ( ˆB, d) and hence E i ( ˆB, d) = . Learning to Steer Learners in Games C.4. Proof of Lemma 5.16 To bound the difference term: V + i (A, ˆB) V i (A, ˆB), (68) notice that (16) can be written as: maximize V + i (A, ˆB) = x T A:,i 1T m 1T m Im d1n 1 1 1 0 and similarly for (17): maximize V i (A, ˆB) = x T A:,i 1T m 1T m Im d1n 1 1 1 0 For notational simplicity, we use M to denote the matrix 1T m 1T m Im and we only need to bound the term M 1 I all linearly independent index sets I such that |I| = m and k = |[n 1] I| is the number of rows in MI corresponding to those in ( ˆB i )T . We begin with presenting some auxiliary lemmas: Lemma C.2. Consider a linear optimization problem in the following form: maximize V = c T x subject to Ax b, (71) and its perturbed problem: maximize V (δ) = c T x subject to Ax b + δ, (72) where x Rn and A Rm n for some m n (notice that in the context of this lemma the matrix A and its dimensions m, n are in general not those considered in the broader setting of the game). Assume both problems are feasible and the constraint sets are bounded, recall that AI denote the matrix constructed by selecting rows of A from some index set I [m], we have that V (δ) V max I S c T A 1 I δI, (73) where S denotes the set of all index sets corresponding to rows in any basic solution to (71), or equivalently, the maximization is over all linearly independent row combinations of size n. Proof. See Appendix G.1. Lemma C.3. For arbitrary ˆB i and the corresponding M described above, we have: 2d Sen(( ˆB i )T ), (74) where S denotes the set of all index sets containing linearly independent rows of M with size m and k = |[n 1] I|. Proof. See Appendix G.2. Learning to Steer Learners in Games Lemma C.4. For an invertible matrix B and a small perturbation matrix δB, let be any sub-multiplicative matrix norm, if B 1 δB < 1, B + δB is also invertible and its inverse is bounded by: (B + δB) 1 B 1 1 B 1 δB . (75) Proof. See Appendix G.3. Proof of Lemma 5.16. Compare equations (69) and (70) we obtain the following through Lemma C.3: V + i (A, ˆB) V i (A, ˆB) A:,i max I 2d A:,i Sen(( ˆB i )T ). (76) Since for all invertible submatrices ( ˆB i )T P,Q and (B i )T P,Q such that di(B, ˆB) Sen((B i )T ) 1 2, the following P,Q (B i )T is satisfied, and for every possible combinations of P and Q, Lemma C.4 implies: P,Q (B i )T Sen((B i )T ) 1 Sen((B i )T ) di(B, ˆB) . We have that Sen(( ˆB i )T ) Sen((B i )T ) 1 d Sen((B i )T ), (79) which leads to the final result: V + i (A, ˆB) V i (A, ˆB) 2d A:,i Sen(( ˆB i )T ) 2d A:,i Sen((B i )T ) 1 d Sen((B i )T ) 4d A:,i Sen((B i )T ), where the last inequality holds because 1 d Sen((B i )T ) 1 C.5. Refined Statement and Proof of Theorem 5.17 We first expand the big O( ) notation in the statement of Theorem 5.17 and obtain the following theorem: Theorem C.5. Under Assumption 5.12, if P1 has an estimator ˆB of B such that di(B, ˆB) ϵ = O(g(T)/T), i for some g(T) = o(T), then if P2 is using a f-no-regret algorithm A2 with constant C, there exists an algorithm A1 satisfying: Stack Reg1(A1, A2) = Tf(T) Sen (B i )T + C p Tf(T) maxj,k B(ej ek) Tf(T) + g(T) . Learning to Steer Learners in Games Proof. Consider the algorithm A1 that commits to the solution (x , i ) to: maximizei,x V i (A, ˆB) = x T A:,i subject to ( ˆB i )T x (ϵ + f(T)/T)1n 1, x m, i [n]. where f(T) is some function satisfying f(T) = o(T) and f(T) = o( f(T)). Since ϵ + f(T)/T di(B, ˆB) , Proposition 5.9 guarantees that ei is a best response to x under B. Let (x , i ) denote the Stackelberg equilibrium of the game. We have that Ei (B) = and since ϵ = O(g(T)/T), ϵ + f(T)/T goes to zero as T , Assumption 5.12 and Proposition 5.14 guarantees that V i (A, ˆB) is well-defined (and therefore so is V + i (A, ˆB)) for large enough T. Since A2 is f-no-regret, it holds that t=1 (x )T B(ei yt) C f(T) (83) for some constant C. We have that t=1 x T t Ayt t=1 (x )T Ayt t=1 (x )T Aei t=1 (x )T A(yt ei ) For the first term, notice that for large enough T, ϵ + f(T)/T 1 2Sen((B i )T ) and get: t=1 (x )T Aei =TV i (A, ˆB) (i) TV i (A, ˆB) =TV + i (A, ˆB) (TV + i (A, ˆB) TV i (A, ˆB)) (ii) TV (A, B) 4(ϵT + f(T)) A:,i Sen((B i )T ), where (i) holds since we are taking maximum in (82), and (ii) holds by Proposition 5.9 and Lemma 5.16. for the second term, since x satisfies ( ˆB i )T x (ϵ + f(T)/T)1n 1, (86) we have (B i )T x =( ˆB i )T x + (B i ˆB i )T x (i) ( (ϵ + f(T)/T) + di (B, ˆB) )1n 1 where (i) holds by definition of di( , ) and (51), and (ii) holds by the assumption that di (B, ˆB) ϵ. Learning to Steer Learners in Games Thus ei is the unique best response to x and H older s inequality yields t=1 (ei yt) T max j,k B(ej ek) t=1 (yt ei ) Therefore, EA2 t=1 (yt ei ) # 1 Cf(T)T f(T) maxj,k B(ej ek) . (89) Thus we have: t=1 (x )T A(yt ei ) (i) (x )T A t=1 (yt ei ) # 1 (ii) A max Cf(T)T f(T) maxj,k B(ej ek) , where we apply H older again in (i) and use the fact that x m in (ii). Combining (84), (85) and (90) we obtain as T : t=1 x T t Ayt TV (A, B) 4(ϵT + f(T)) A:,i Sen((B i )T ) A max Cf(T)T f(T) maxj,k B(ej ek) TV (A, B) 4(g(T) + f(T)) A:,i Sen((B i )T ) A max Cf(T)T f(T) maxj,k B(ej ek) . taking f(T) = p Tf(T) we have: Stack Reg1(A1, A2) Tf(T) A:,i Sen (B i )T + A max C p Tf(T) maxj,k B(ej ek) Tf(T) Sen (B i )T + C p Tf(T) maxj,k B(ej ek) =O(g(T) + p This completes the proof of Theorem 5.17. Learning to Steer Learners in Games D. Lower Bound on Stackelberg Regret against General f-no-regret Learners In this section we provide Theorem D.1, which shows that even if the optimizer knows B, the learner still has a no-regret algorithm with regret budget f that can lead to a p Tf(T) Stackelberg regret of P1, indicating that our Stackelberg regret bound in Theorem 5.17 is essentially optimal. Theorem D.1. Consider a given function f(T) = o(T) which serves as the regret budget of P2, there exists a game instance G = (A, B) that satisfies Assumption 5.12 and an f-no-regret learner algorithm A2 that can be used by P2 such that for all mixed strategy x m, the non-adaptive algorithm A1 that plays xt = x at all time steps has Stackelberg regret at least Ω( p Proof. Consider the game instance G = (A, B) where: A = 0 0 3 1 , B = 1 0 0 1 whose unique Stackelberg equilibrium is: 2)T , y = (1, 0)T (94) with a Stackelberg value V (A, B) = 3 2. For an algorithm A1 that outputs a fixed optimizer action x = (x1, x2)T and given a sequence {yt}T t=1 of learner actions, the learner regret can be expressed as: Reg2({xt = x, yt}T t=1) = T max{x1, x2} t=1 x T yt. (95) Consider the learner algorithm A2 to be: 1. If x2 x1, play (0, 1)T ; 2. Otherwise, play (0, 1)T for f(T ) 1 2x2 rounds, and (1, 0)T for the remaining T f(T ) 1 2x2 rounds. The learner regret of A2 if x2 x1 is: Reg2(A2, {xt = x}T t=1) = Tx2 Tx2 = 0, (96) and if x2 < x1, we have: Reg2(A2, {xt = x}T t=1) =Tx1 x2f(T) 1 2x2 (T f(T) 1 2x2 )x1 =T(1 x2) x2f(T) 1 2x2 (T f(T) 1 2x2 )(1 x2) =(1 x2)f(T) 1 2x2 x2f(T) 1 2x2 =f(T), and therefore A2 is f-no-regret. If the optimizer wants to achieve sublinear Stackelberg regret, x must satisfy x2 < x1, or otherwise A2 will stick to (0, 1)T and incur a Θ(T) Stackelberg regret, now we calculate the Stackelberg regret of the optimizer when x2 < x1: Stack Reg1({xt = x, yt}T t=1) =3 2T (3(T f(T) 1 2x2 ) + f(T) 1 2x2 ) x2 2 3x2)T + 2x2 1 2x2 f(T). (98) The minimum Stackelberg regret over x2 [0, 1 2) can be achieved by taking where Stack Reg1({xt = x, yt}T t=1) = Θ( p Tf(T)), so in general the Stackelberg regret is Ω( p Learning to Steer Learners in Games E. Algorithms and Proofs for Section 6 Before we start presenting the algorithms and proofs, we first prove an auxiliary lemma which suggests that in an explorethen-commit style algorithm, any interaction history that has length o(T) before committing will have no impact on the asymptotic learner regret, stated formally as follows: Lemma E.1. Consider an optimizer action x m, if the optimizer action sequence {xt}T t=1 satisfies xt = x, t > τ for some τ = O(f(T)), then the interaction sequence {xt, yt}T t=1 is f-no-regret if and only if Reg2({xt, yt}T t=τ+1) C f(T) (100) for some constant C, and consequently, A2 is f-no-regret on {xt}T t=1 if and only if t=τ+1 ( y yt) C f(T) (101) for some constant C. Proof. Let y BR(B, x) be a best response to x. For the if direction, on observing that: Reg2({xt, yt}T t=1) = max y n t=1 x T t By t=1 x T t Byt t=1 x T t By t=1 x T t Byt + max y n t=τ+1 x T t By t=τ+1 x T t Byt = Reg2({xt, yt}τ t=1) + Reg2({xt, yt}T t=τ+1) τ B max + C f(T) for some constant C where we have used the fact that τ = O(f(T)), and therefore {xt, yt}T t=1 is f-no-regret. For the only if direction, notice that Reg2({xt, yt}T t=1) = max y n t=1 x T t By t=1 x T t Byt t=1 x T t B( y yt) t=1 x T t B( y yt) + x T B t=τ+1 ( y yt) τ B max + Reg2({xt, yt}T t=τ+1). If we want Reg2({xt, yt}T t=1) C f(T) for some constant C, we must have: Reg2({xt, yt}T t=τ+1) C f(T) + τ B max C f(T) (104) for some constant C , which completes the proof. The proof of (101) follows directly by taking expectation over all {yt}T t=1 trajectories generated by A2. E.1. Algorithm 1, Detailed Version and Proof of Theorem 6.2 and Discussion We present the pseudocode for Algorithm 1 as follows: Here test( ) is a procedure that compares the learner s response at two consecutive time steps to determine the underlying best response to some optimizer action x = (p, 1 p), as shown in Algorithm 2, and Binary Search( , , d) is a procedure that approximates the optimizer action x at which the learner is indifferent up to an error margin d, as shown in Algorithm 3. We also provide the detailed version of Theorem 6.2 as follows: Learning to Steer Learners in Games Algorithm 1 Playing Against Ascending Learner Input: Accuracy margin d. Run BR(0) test(0) and BR(1) test(1). if test(0) = test(1) then Let y = test(0) = test(1). Compute x arg maxx m x T Ay . else p L 1[BR(1) = (1, 0)], p R 1[BR(1) = (0, 1)]. p L, p R Binary Search(p L, p R, d). if p L < p R then E 1 [0, max{p L d, 0}]; E 2 [min{1, p R + d}, 1]. else E 2 [0, max{p R d, 0}]; E 1 [min{1, p L + d}, 1]. end if Compute x through (9) and (10). end if Stick to x for all remaining time steps. Algorithm 2 test Input: Action parameter p [0, 1] Use t to denote the current timestep. Play xt = (p, 1 p)T and observe yt = (qt, 1 qt)T . Play an arbitrary xt+1 and observe yt+1 = (qt+1, 1 qt+1)T . if qt+1 > qt then return (1, 0)T return (0, 1)T Theorem E.2. Suppose m = n = 2 and the payoff matrix B does not contain identical columns. For some chosen parameter d, if either one facet is empty, or each facet has diameter at least d and P2 uses an ascent algorithm A2 that is f-no-regret, Algorithm 1 with accuracy margin d achieves a Stackelberg regret of at most 4 + C ϵ1 f(T) A max (105) if the learner has a strictly dominated action, where ϵ1 = minx m x T B(e1 e2), and 2 log d + 6 + 2Td + 2Cf(T) A max (106) otherwise, where ϵ2 is a constant that depends only on B. Under either case, C is a constant that depends only on the learner regret constant and payoff matrix. This indicates that the Stackelberg regret is at most O( f(T ) d + d T log d) as long as d = Ω(exp( f(T))). Proof. If one of the facet is empty, we would have test(0) = test(1). In this case, despite the first 4 interaction steps used by 2 test calls, Algorithm 1 will output the Stackelberg equilibrium strategy for all subsequent actions. Now we analyze the Stackelberg regret against an f-no-regret algorithm A2. W.l.o.g suppose e2 is the strictly dominated action, since e2 is strictly dominated, we would have x T Be1 x T Be2 > 0 for all x m, there exists a constant ϵ that depends only on B (but not d), such that: x T Be1 x T Be2 ϵ. (107) Learning to Steer Learners in Games Algorithm 3 Binary Search Input: Interval endpoints p L, p R, accuracy margin d. if |p L p R| d then return p L, p R end if BR test( p L+p R 2 ). if BR = (1, 0)T then return Binary Search( p L+p R 2 , p R, d) else return Binary Search(p L, p L+p R 2 , d) end if Therefore, by Lemma E.1 the sequence should satisfy: Reg2({xt, yt}T t=5) = x T B t=5 (e1 yt) C f(T) (108) for some constant C, therefore, we have: t=5 (e1 yt) C ϵ f(T). (109) This would lead to an upper bound on the Stackelberg regret: Stack Reg1(A1, A2) 4V (A, B) t=1 x T t Ayt + x T A t=5 (e1 yt) 4 A max + x T A 1 t=5 (e1 yt) 4 A max + A max C If neither facets are empty, the binary search phase shrinks the interval length from 1 to d, which requires at most log d + 1 calls of the Binary Search function. Each Binary search calls the procedure test for at most one time, which corresponds to at most two interaction time steps. Therefore, the total number of time steps in the binary search phase is at most 2 log d + 6, leading to a Stackelberg regret of at most ( 2 log d + 6) A max, which is O(f(T)). Since each facet has length at least d, the pessimistic facets computed by E 1 and E 2 satisfies d H(Ei, E i ) 2d and infx E i ,x Ej x x 1 d for i = 1, 2. Therefore, combining Lemma E.1 and Theorem 5.5 we obtain: Stack Reg1(A1, A2) 2 log d + 6 V (A, B) 2 log d+6 X t=1 x T t Ayt + x T A t= 2 log d+6 +1 (e1 yt) ( 2 log d + 6) A max + x T A 1 t= 2 log d+6 +1 (e1 yt) ( 2 log d + 6) A max + 2Td + 2Cf(T) Discussion on matrix reconstruction view. Observe that by Definition 5.6 multiplication by a positive constant and shifting an all-one vector in a row preserves the equivalence class, there are only three possible forms of payoff matrices B: 0 λ 0 1 Learning to Steer Learners in Games In the third case (and the instances where λ = 0) Assumption 5.12 is not satisfied, indicating that this B instance is not learnable. So we focus on the first two cases where λ = 0. For the first case where B = 0 λ 0 1 for xt = (p, 1 p) we know that if λ > 0 then y = (0, 1) is a dominant action, and when λ 0 we have: p < 1 1 λ (0, 1) is the best response; p = 1 1 λ all actions are equivalent; p > 1 1 λ (1, 0) is the best response. (114) Similarly for the second case where B = 0 λ 0 1 if λ < 0 then (1, 0) is a dominant action and when λ 0 we have: p < 1 1+λ (1, 0) is the best response; p = 1 1+λ all actions are equivalent; p > 1 1+λ (0, 1) is the best response. (116) Now we have reduced the problem to identifying the matrix type and finding λ (or equivalently, p := 1 1 λ where all actions are equivalent to the learner). E.2. Algorithm 4, Detailed Version and Proof of Theorem 6.3 We present the pseudocode for Algorithm 4 as follows, where the Explore Row procedure shown in Algorithm 5. Algorithm 4 Playing Against Mirror Descent Input: Per-row exploration step k. for i = 1, 2, . . . , m do ˆBi Explore Row(ei, k) end for Construct estimation ˆB = ˆB1 ˆB2 . . . ˆBm T Compute the equivalence class ˆB and commit to x through (82) with f(T) = p Algorithm 5 Explore Row Input: Action x m, exploration step k. for τ = 1, 2, . . . , k + 1 do Play x, observe the follower action yτ in this time step. end for return ˆBi = 1 k Pk τ=1 ητ y D(yτ+1 yτ) The detailed version of Theorem 6.3 is presented as follows: Theorem E.3. If the learner payoff matrix B statisfies the assumptions needed in Theorem 5.17, P2 follows update rule (25), and each entry ξt,i is i.i.d. R-sub-Gaussian, then with probability at least 1 δ, P1 using Algorithm 4 with k = (T/g(T))2 2R2 log(2mn/δ), incurs Stackelberg regret of at most m (1 + k) + 4 8ng(T) maxj,k B(ej ek) + p Tf(T) Sen((B i )T ) + C p Tf(T) maxj,k B(ej ek) A max. (117) where C is a constant that depends only on the learner regret constant and payoff matrix. Learning to Steer Learners in Games Proof. The Lagrangian of this problem can be written as: L(y, λ, µ) = ηt D(y yt) (x T t B + ξT t )y λT y + µ( X i yi 1). (118) Since yt+1 is the optimal solution to (25), the KKT condition yields: ηt t+1y D(yt+1 yt) BT xt λ + µ1 ξt = 0; yt+1,i λi = 0, i [n]. (119) Since the Bregman divergence regularizer satisfies y D(yt+1 yt) if there exists i [n] such that yt+1,i 0, we can deduce that λ = 0, and the condition becomes: ηt y D(yt+1 yt) BT xt + µ1 ξt = 0. (120) Let ht denote ht := ηt y D(yt+1 yt), we know that BT xt = ht + µt1 ξt, (121) here we used µt instead of µ to indicate the different Lagrange multipliers at different time steps. If we set xt = ei, we obtain: Bi = ht + µt1 ξt, (122) where BT i is the i-th row of B, so if we fix xt = ei for t = 1 to k, we have the following estimation of Bi: t=1 ht (123) with error term: t=1 ξt. (124) The first term doesn t affect the equivalence class of B, and the second term diminishes over time. Hence we can obtain an estimation with arbitrarily small error using uniform exploration. Assuming that each entry ξt,i is i.i.d. R-sub-Gaussian, we obtain through Chernoff bound: If we take k = 2R2 δ it holds with probability at least 1 δ/m that t=1 µt1 ϵ. (126) Combining this with Theorem 5.17, we take ϵ = Θ(g(T)/T), the number of steps to explore one row of B would be: 2 2R2 log 2mn When g(T) = Ω( T) we would have k = o(T), indicating that the exploration cost would also be sublinear in T. The length of the exploration phase consists of at most m(k + 1) steps and achieves an estimation ˆB in the equivalence class ˆB. We first give an upper bound on di(B, ˆB) for all i. Notice that: max j1,j2 B:,j1 B:,j2 max j1,j2 ˆB:,j1 ˆB:,j2 max j1,j2{ B:,j1 B:,j2 ˆB:,j1 ˆB:,j2 } max j1,j2 (B ˆB)(ej1 ej2) = max j1,j2 max i |(BT i ˆBT i )(ej1 ej2)| = max j1,j2 max i | ξT (ej1 ej2)| Learning to Steer Learners in Games where ξ = 1 t ξt for all t in this Explore Row function call. To make notation simpler, in this proof we mildly overload the notation to let: Bi = B:,1 B:,i B:,2 B:,i . . . B:,n B:,i , ˆBi = ˆB:,1 ˆB:,i ˆB:,2 ˆB:,i . . . ˆB:,n ˆB:,i . di(B, ˆB) = Bi maxj,k B:,j B:,k ˆBi maxj,k ˆB:,j ˆB:,k =maxj1,j2 ˆB:,j1 ˆB:,j2 Bi maxj1,j2 B:,j1 B:,j2 ˆBi maxj1,j2 B:,j1 B:,j2 maxj1,j2 ˆB:,j1 ˆB:,j2 =Bi(maxj1,j2 ˆB:,j1 ˆB:,j2 maxj1,j2 B:,j1 B:,j2 ) maxj1,j2 B:,j1 B:,j2 maxj1,j2 ˆB:,j1 ˆB:,j2 + maxj1,j2 B:,j1 B:,j2 (Bi ˆBi) maxj1,j2 B:,j1 B:,j2 maxj1,j2 ˆB:,j1 ˆB:,j2 2 ξ Bi maxj1,j2 B:,j1 B:,j2 maxj1,j2 ˆB:,j1 ˆB:,j2 + Bi ˆBi maxj1,j2 B:,j1 B:,j2 2 ξ . Since our exploration round k = T g(T ) 2 2R2 log 2mn δ we obtain through union bound that with probability at least 1 δ, for all Explore Row function calls, ξ g(T ) T which goes to zero as T , for large enough T we have: di(B, ˆB) 4 ξ Bi (maxj1,j2 B:,j1 B:,j2 )2 + 2 Bi ˆBi maxj1,j2 B:,j1 B:,j2 4n ξ maxj1,j2 B:,j1 B:,j2 + 4n ξ maxj1,j2 B:,j1 B:,j2 8n ξ maxj1,j2 B:,j1 B:,j2 = 8n maxj1,j2 B:,j1 B:,j2 Therefore, combining Lemma E.1 and Theorem 5.17 we obtain: Stack Reg1(A1, A2) m(k + 1)V (A, B) + 4 8ng(T) maxj,k B(ej ek) + p Tf(T) Sen (B i )T + C p Tf(T) maxj,k B(ej ek) m(k + 1) A max + 4 8ng(T) maxj,k B(ej ek) + p Tf(T) Sen (B i )T + C p Tf(T) maxj,k B(ej ek) Tf(T) + g(T)) Tf(T) + g(T) + T (131) which completes the proof. Learning to Steer Learners in Games F. Numerical Experiments F.1. Empirical Simulations for Section 6.1 For all experiments in this section, we assume the learner is using Online Gradient descent (OGD) with step size For the purpose of properly displaying the interaction and learning process, we choose different η0 for different game instances. For each game instance, we compare the performance and learning dynamics for optimizer algorithm being either OGD or Binary Search explore-then-commit (BS, Algorithm 1). For Binary Search, we set the accuracy margin d = 0.01. For each game instance, we plot both the payoff and the strategy (indicated by its 0-th entry) of each player at different time steps. We assume optimizer is the row player and learner is the column player. Matching pennies. We first test repeated matching pennies, where the payoff matrices are given by: A = 1 1 1 1 ; B = 1 1 1 1 Both the unique Nash equilibrium and the Stackelberg equilibria all have 2)T . (134) We obtain the curve shown in Figure 3. 0 20 40 60 80 100 1.00 Payoff History 0 20 40 60 80 100 Round Strategy[0] Strategy History Optimizer, Opt Alg=BS Learner, Opt Alg=BS Optimizer, Opt Alg=OGD Learner, Opt Alg=OGD Exploration Phase Optimizer, Stackelberg Figure 3. Learning dynamics for optimizer algorithms OGD and BS for matching pennies. Here the blue curves represent the learning dynamics when the optimizer uses binary search (Algorithm 1) with the exploration phase shaded in red. The orange curves represent the dynamics when the optimizer uses OGD. In both optimizer Learning to Steer Learners in Games algorithms, the solid lines are curves for the optimizer and the dashed lines are curves for the learner. We plot the optimizer Stackelberg strategy and payoff in black dotted lines. We can see that when both players are using OGD, the trajectory keeps oscillating and does not converge to the Nash equilibrium. In comparison, when the optimizer uses BS, it quickly learns its real underlying Stackelberg equilibrium (which is also the Nash) and commits to it, yielding a stable learning dynamics. Constructed game instance 1. Below we show that BS indeed yields a smaller Stackelberg regret than OGD. We construct the following game instance: A = 5 0 0 3 ; B = 2 2 3 3 The unique Stackelberg equilibrium action for the optimizer is: with Stackelberg value 3. We obtain the curve shown in Figure 4. In Figure 4 all curves are drawn with the same line 0 25 50 75 100 125 150 175 200 Payoff History 0 25 50 75 100 125 150 175 200 Payoff History 0 25 50 75 100 125 150 175 200 Round Strategy[0] Strategy History 0 25 50 75 100 125 150 175 200 Round Strategy[0] Strategy History Optimizer: Steering with BS Optimizer: OGD Optimizer, avgpayoff Optimizer, avgpayoff Optimizer, Opt Alg=BS Learner, Opt Alg=BS Exploration Phase Optimizer, Stackelberg Optimizer, Opt Alg=OGD Learner, Opt Alg=OGD Optimizer, Stackelberg Figure 4. Learning dynamics for optimizer algorithms OGD and BS for game instance 1. style as in Figure 3, in addition we use blue and orange dotted lines to plot the average optimizer payoff for BS and OGD respectively. We notice again that when the optimizer is using OGD, the algorithm fails to converge. In addition, after the optimizer commits to the pessimistic Stackelberg solution, the learner slowly converges to the best response induced by the Stackelberg equilibrium and steers the optimizer payoff close to the Stackelberg value, which is higher on average than the payoff using SGD. Learning to Steer Learners in Games Constructed game instance 2. One may argue that OGD fails because it doesn t converge, however it is not the case. Below we construct a game instance that has a unique Nash equilibrium to which OGD converges, and a unique Stackelberg equilibrium with higher optimizer utility than that of Nash. The game instance is as follows: A = 2 0 3 1 ; B = 1 0 0 2 The unique Nash equilibrium is x = y = (0, 1)T , (138) while the unique Stackelberg equilibrium is x = (2/3, 1/3)T , y = (1, 0)T . (139) The optimizer payoff at Nash is 1, while its Stackelberg value is 2. The simulation result is shown in Figure 5. 0 50 100 150 200 250 300 Payoff History 0 50 100 150 200 250 300 Round Strategy[0] Strategy History Optimizer, Opt Alg=BS Learner, Opt Alg=BS Optimizer, Opt Alg=OGD Learner, Opt Alg=OGD Exploration Phase Optimizer, Stackelberg Figure 5. Learning dynamics for optimizer algorithms OGD and BS for game instance 2. The plot above shows that even if both converges, BS and OGD converge to completely different equilibria, while BS always yields a higher payoff and therefore a lower Stackelberg regret. F.2. Empirical Simulations for Section 6.2 In this section we show the effectiveness of Algorithm 4. and illustrate the necessity of pessimism. Here we assume the optimizer is using Algorithm 4, but with different pessimism levels d {0.01, 0.02, 0.05}. We assume that the learner is using Stochastic Mirror descent with KL regularizer. For each pure strategy of the optimizer, we set the number of steps for exploration to be k = 50. We consider the following game instance: A = 0 1 5 0 ; B = 2 2 3 3 Learning to Steer Learners in Games The unique Stackelberg equilibrium of this game is with optimizer payoff 2. We plot the payoffs and strategies of both player at each time step with different d in Figure 6. 0 250 500 750 1000 1250 1500 1750 2000 Payoff History 0 250 500 750 1000 1250 1500 1750 2000 Round Strategy[0] Strategy History Optimizer, d=0.01 Learner, d=0.01 Optimizer, d=0.02 Learner, d=0.02 Optimizer, d=0.05 Learner, d=0.05 Exploration Phase Optimizer, Stackelberg Figure 6. Learning dynamics of payoff estimation for different pessimism levels d. In Figure 6, different line colors indicate the learning dynamics of different pessimism levels. The solid lines are optimizer curves while the dashed lines being learner curves. The shaded region indicates the exploration phase and the black dotted line represents the optimizer Stackelberg payoff and strategy. We can see that for larger d, the optimizer is being more pessimistic and chooses an action that is farther away from the Stackelberg equilibrium. This leads to a lower optimizer payoff after the learner converges to the unique best response induced by the action. However, for less pessimistic choices of d, since the committed optimizer strategy x is too close to the Stackelberg equilibrium where the learner is indifferent from all mixed strategies, the gradients of the learner payoff with respect x will be extremely small and thus takes a lot longer to converge. Once it hasn t converged the optimizer payoff of committing to x will be smaller, illustrating the effectiveness of being pessimistic. Learning to Steer Learners in Games G. Proof of Auxiliary Lemmas G.1. Proof of Lemma C.2 The dual problem of (71) can be written as: minimize b T y subject to AT y = c, and that of (72) is: minimize (b + δ)T y subject to AT y = c, We can now follow the approach in Section 5.4 and Section 5.5 in (Bertsimas & Tsitsiklis, 1997) to obtain the result: Since both (71) and (72) are feasible and have bounded constraint sets, they have a finite optimal value and a corresponding optimal solution, so do their dual problems. Therefore for each dual problem there exists an optimal solution that is a basic feasible solution (which are the same for both dual problems), denoted by y1, y2, . . . , y I. That is, V = min i=1,2,...,I b T yi (144) and V (δ) = min i=1,2,...,I(b + δ)T yi. (145) Therefore we have: V (δ) V = min i=1,2,...,I(b + δ)T yi min i=1,2,...,I b T yi max i=1,2,...,I δT yi. (146) Since for each i = 1, 2, . . . , I, a linearly independent set of columns I of AT of size n captures the basic feasible solution (which is an independent set of rows of A) and yi is specified by yi = (AI)T c in its entries within I and 0 elsewhere, we can rewrite δT yi as δT yi = y T i δ = c T AIδI, (147) which completes the proof. G.2. Proof of Lemma C.3 Consider an index set I containing linearly independent rows of M that satisfies |I| = m. Since all m m submatrices contain k rows in and m k rows in Im for some 0 < k m. Thus MI has the MI e T Ik+1 n 1 e T Ik+2 n 1 . . . e T Im n 1 where Ij denotes the j-th index in I and MI contains the first k rows selected from . Now we show some properties of M 1 I . First, since i {k + 1, k + 2, . . . , m} and j [m], it holds that [i = j] = Iij (i)= (MI)T i (M 1 I ):,j (ii) = e T Ii n 1(M 1 I ):,j = (M 1 I )Ii n 1,j, (149) Learning to Steer Learners in Games where (i) comes from the definition of M 1 I and (ii) holds due to (148), we have that: (M 1 I )Ii n 1 = e T i . (150) Second, consider the product for arbitrary Rk, j=1 (M 1 I ):,j j, (151) whose i-th row can be written as: M 1 I j=1 (M 1 I )ij j. (152) Let E = {Ik+1 n 1, Ik+2 n 1, . . . , Im n 1}, we can see from (150) that if i E, M 1 I i = 0. (153) Also, consider the product of (M 1 I )i,:(MI):,j for i, j [m]\E, we have: Iij =(M 1 I )T i,:(MI):,j l=1 (M 1 I )i,l(MI)l,j + l=k+1 (M 1 I )i,l(MI)l,j l=1 (M 1 I )i,l(MI)l,j, where Pm l=k+1(M 1 I )i,l(MI)l,j = 0 because (MI)l,j = 0, l {k + 1, . . . , m}, j [m]\E (155) as shown in (148). Similarly, for the product of (MI)i,:(M 1 I ):,j for i, j [k], we have: Iij =(MI)T i,:(M 1 I ):,j l [m]\E (MI)i,l(M 1 I )l,j + X l E (MI)i,l(M 1 I )l,j l [m]\E (MI)i,l(M 1 I )l,j, where P l E(MI)i,l(M 1 I )l,j = 0 due to (M 1 I )l,j = 0, l E, j [k] (157) as indicated by (150). The two equations (154) and (156) above show that if we consider the submatrix of (MI)[k],[m]\E consisting of its first k rows and columns in [m]\E, its inverse is equal to the corresponding entries of the inverse of MI. More specifically, (M 1 I )[m]\E,[k] = (MI) 1 [k],[m]\E. (158) Now that we have: M 1 I = (M 1 I ):,[k] (i)= (M 1 I )[m]\E,[k] = (MI) 1 [k],[m]\E , (159) Learning to Steer Learners in Games where again (i) holds due to (157). Since the selection of I is arbitrary, E can also be constructed arbitrarily, so the upper bound for M 1 I max I,E (MI) 1 [k],[m]\E = max P,Q where the maximization is over all P, Q that satisfies: P [n], Q [m], |P| = |Q|, ( ˆB i )T P,Q invertible. (161) Also, because the entries corresponding to 1T m and are not perturbed, we can multiply those constraints by arbitrary nonzero ϵ and still get the same result, which completes the proof. G.3. Proof of Lemma C.4 Given that B 1 δB < 1, we can expand the series (I + B 1δB) 1 as a convergent Neumann series: (I + B 1δB) 1 = k=0 ( B 1δB)k. (162) So that (I + B 1δB) 1 is well-defined. As a result, (B + δB) 1 = (B(I + B 1δB)) 1 = (I + B 1δB) 1B 1 (163) is also well-defined, and further (B + δB) 1 = (I + B 1δB) 1B 1 (I + B 1δB) 1 B 1 k=0 ( B 1δB)k k=0 ( B 1δB)k k=0 B 1 k δB k = B 1 1 B 1 δB , which completes the proof.