# learning_in_noncooperative_configurable_markov_decision_processes__baf9e1ab.pdf Learning in Non-Cooperative Configurable Markov Decision Processes Giorgia Ramponi ETH AI Center Zurich, Switzerland gramponi@ethz.ch Alberto Maria Metelli Politecnico di Milano Milan, Italy albertomaria.metelli@polimi.it Alessandro Concetti Politecnico di Milano Milan, Italy alessandro.concetti@mail.polimi.it Marcello Restelli Politecnico di Milano Milan, Italy marcello.restelli@polimi.it The Configurable Markov Decision Process framework includes two entities: a Reinforcement Learning agent and a configurator that can modify some environmental parameters to improve the agent s performance. This presupposes that the two actors have identical reward functions. What if the configurator does not have the same intentions as the agent? This paper introduces the Non-Cooperative Configurable Markov Decision Process, a framework that allows modeling two (possibly different) reward functions for the configurator and the agent. Then, we consider an online learning problem, where the configurator has to find the best among a finite set of possible configurations. We propose two learning algorithms to minimize the configurator s expected regret, which exploit the problem s structure, depending on the agent s feedback. While a naïve application of the UCB algorithm yields a regret that grows indefinitely over time, we show that our approach suffers only bounded regret. Furthermore, we empirically validate the performance of our algorithm in simulated domains. 1 Introduction The standard Reinforcement Learning [RL, 40] framework involves an agent whose objective is to maximize the reward collected during its interaction with the environment. However, there exist real-world scenarios in which the agent itself or an external supervisor (configurator) can partially modify the environment. In a car racing problem, for example, it is possible to modify the car setup to better suit the driver s needs. Recently, the Configurable Markov Decision Processes [Conf-MDPs, 29] were introduced to model these scenarios and exploit the configuration opportunities. Solving a Conf-MDP consists of simultaneously optimizing a set of environmental parameters and the agent s policy to reach the maximum expected return. In many scenarios, however, the configurator does not know the agent s reward, and their intentions are different, leading to new forms of interaction between the two actors. For instance, imagine we are the owner of a supermarket, and we have to arrange the products on the shelves. Our objective is to increase the company s final profit; on the other hand, a customer aims to spend the shortest time possible inside the supermarket and buy the indispensable products only. Since we do not know the customer reward function, the only possibility is to try different dispositions and observe the customers reactions. What if we knew what buyers Work done when Giorgia Ramponi was at Politecnico di Milano. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). are most interested in? In this case, we can strategically decide how to position other products close to the popular ones to induce the customer in a more profitable behavior for the supermarket owner. In this paper, we model these scenarios introducing the Non-Cooperative Markov Decision Processes (NConf-MDP). This novel framework handles the possibility of having different reward functions for the agent and the configurator. While Conf-MDP assumes that the configurator acts to help the agent to optimize its expected reward, an NConf-MDP, instead, allows modeling a wider set of situations, including the cases in which agent and configurator display a non-cooperative behavior. Obviously, this setting cannot be addressed with straightforward application of the algorithms designed for cooperative Conf-MDP. In fact, if the configurator and the agent optimize separately different objectives, they might not converge to an equilibrium strategy [52, 12, 51, 13]. In this novel setting, we consider an online learning problem, where the configurator has to select a configuration, within a finite set of possible configurations, in order to maximize its own return. This framework can be seen as a leader-follower game, in which the follower (the agent) is selfish and optimizes its own reward function, and the leader (the configurator) has to decide the best configuration, based on its reward. Clearly, to adapt its decisions, the configurator has to receive some form of feedback related to the agent s behavior. We analyze two settings based on whether the configurator observes just the agent s actions or, in addition, a noisy version of the agent s reward. Contributions In this paper, we extend the Configurable Markov Decision Process setting to deal with situations where the configurator and the agent have different reward functions. We call this new framework the Non-Cooperative Markov Decision Process (NConf-MDP, Section 3). Then, we formalize the problem of finding the best environment configuration, according to the configurator s reward, as a leader-follower game, in which the agent (follower) reacts to each presented configuration with its best response policy (Section 4). We provide a first algorithm, Action-feedback Optimistic Configuration Learning (Af OCL), to tackle this problem under the assumption that the configurator observes the agent s actions only (Section 5.1). We show Af OCL achieves finite expected regret, scaling linearly with the number of admissible configurations. As far as we know, this represents the first problem-dependent regret analysis in a Multi-Agent RL setting. Then, we introduce a second algorithm, Reward-feedback Optimistic Configuration Learning (Rf OCL), that assumes the availability of a noisy version of the agent s reward, in addition to the agent s actions (Section 5.2). We prove that, under suitable conditions, Rf OCL further exploits the structure underlying the decision process, removing the dependence on the number of configurations. The analysis use novel ideas, combining the suboptimality gaps of the configurator with those of the agent. Finally, we provide an experimental evaluation on benchmark domains, inspired by scenarios that motivate the NConf-MDPs framework (Section 7). The proofs of the results presented in the paper are reported in Appendix B. A preliminary version of this work was presented at AAAI-21 Workshop on Reinforcement Learning in Games [36]. 2 Preliminaries A finite-horizon Markov Decision Process [MDP, 35] is a tuple M = (S, A, p, µ, r, H) where S is a finite state space (S = |S|), A is a finite action space (A = |A|), p : S A S ! [0, 1] is the transition model, which defines the density p(s0|s, a) of state s0 2 S when taking action a 2 A in state s 2 S, µ : S ! [0, 1] is the initial state distribution, r : S ! [0, 1] is the reward function, and H 2 N 1 is the horizon. A stochastic decision rule h : S A ! [0, 1] with h 2 [H] prescribes the probability h(a|s) of playing action a 2 A in state s 2 S. A stochastic policy = ( 1, , H) 2 H is a sequence of decision rules, where H is the set of stochastic policies over the horizon H. A finite-horizon Configurable Markov Decision Process [Conf-MDP, 29] is defined as CM = (S, A, P, µ, r, H) and extends the MDP considering a configuration space P instead a single transition model p. The Q-value of a policy 2 H and configuration p 2 P is the expected sum of the rewards starting from (s, a) 2 S A at step h 2 [H]: h (s, a) = r(s) + E sh0 p, r(sh0)|sh = s, ah = a denoting with Esh0 p, the expectation w.r.t. the state distribution induced by and p at step h0. The value function is given by V ,p h (s) = Ea h( |s)[Q ,p h (s, a)] and the expected return is defined as V ,p = Es µ[V ,p 1 (s)]. In a Conf-MDP the goal consists in finding a policy together with an environment configuration p so as to maximize the expected return, i.e., ( , p ) 2 arg max 2 H,p2P V ,p. 3 Non-Cooperative Conf-MDPs The definition of Conf-MDP allows modeling scenarios in which agent and configurator share the same objective, encoded in a single reward function r. In this section, we introduce an extension of this framework to account for the presence of a configurator having interests that might differ from those of the agent. Definition 3.1. A Non-Cooperative Configurable Markov Decision Process (NConf-MDP) is defined by a tuple NCM = (S, A, P, µ, rc, ro, H), where (S, A, P, µ, H) is a Conf-MDP without reward and rc, ro : S ! [0, 1] are the configurator and agent (opponent) reward functions, respectively. Given a policy 2 H and a configuration p 2 P, for every (s, a) 2 S A and h 2 [H] we define the configurator and agent Q-values as: c,h(s, a) = rc(s) + E sh0 p, rc(sh0)|sh = s, ah = a o,h(s, a) = ro(s) + E sh0 p, ro(sh0)|sh = s, ah = a We denote with V ,p c,h (s) = Ea h(s)[Q ,p c,h(s, a)] and V ,p o,h = Ea h(s)[Q ,p o,h(s, a)] the value functions and with V ,p c = Es µ[V ,p c,1 (s)] and V ,p o = Es µ[V ,p o,1 (s)] the expected returns for the configurator and the agent respectively. 4 Problem Formulation While for classical Conf-MDPs [29, 27] a notion of optimality is straightforward as agent and configurator share the same objective, in an NConf-MDP, they can display possibly conflicting interests. We assume a sequential interaction between the configurator and the agent that resembles the leaderfollower protocol [10, 6, 34, 38]. First, the configurator (leader) selects an environment configuration p 2 P, where P is a finite set made of M stochastic transition models P = {p1, . . . , p M}. Then the agent (follower) plays a policy chosen by a best response function f : P ! H, such that: f(p) 2 arg max 2 H V ,p o . The solution concept that we use is the well-known Stackelberg equilibrium [43, 15, 30, 32, 19]. It captures the outcome in which the configurator s transition model is optimal, under the assumption that the agent will always respond optimally [26]. However, this definition includes the possibility of ties, i.e., situations in which multiple agent optimal policies exist, with possibly different performance for the configurator. Therefore, it is necessary to employ a tie-breaking rule, i.e., a criterion to select one agent best response. Different tie-breaking rules lead to different Stackelberg equilibria, and the two prevailing solution concepts in the literature are the Strong Stackelberg Equilibrium (SSE) and the Weak Stackelberg Equilibrium (WSE). A policy-transition model pair ( , p ) forms an SSE if ties are broken in favor of the configurator: p 2 arg max c and := f S(p) 2 arg max The WSE can be constructed by breaking the ties against the configurator. In the rest of the paper, we employ the concept of SSE; however, every result can be applied to any deterministic tie-breaking rule. We call p the application of the best response function f S to a transition model p. Notice that the goal of the configurator is well-defined, whenever deciding the function f S. From an online learning perspective, this goal is to minimize the expected regret: E[Regret(K)] = E To lighten the notation, in the following, we will denote with i the agent s best response policy to the configuration pi, i.e., pi and with V i the configurator expected returned attained with configuration pi and policy i, i.e., V i,pi c . Finally, we denote with V = maxi2[M] V i. Agent s Feedback The configurator knows its reward rc, but it does not know the agent reward ro. At each episode k 2 [K], the configurator selects a configuration p Ik 2 P and observes a trajectory of H steps generated by the agent s best response policy Ik. We study two types of feedback: Action-feedback (Af). The configurator observes the states and the actions played by the agent (s1, a1, . . . , s H 1, a H 1, s H), where ah Ik,h(sh). Reward-feedback (Rf). The configurator observes the states, the actions played by the agent, and a noisy feedback of the agent reward function (s1, er1, a1, . . . , s H 1, er H 1, a H 1, s H, er H), where ah Ik,h(sh) and erh is sampled from a distribution with mean ro(s) and support [0, 1].2 The Rf models situations in which the agent s reward is known under uncertainty, or it is obtained in an approximate way through Inverse Reinforcement Learning [33]. Connections with Bandit Algorithms The online problem that we are facing can be seen as a stochastic multi-armed bandit [25], in which the arms are configurations, and the configurator receives a random realization of its expected return at every episode. Thus, in principle, it can be solved by standard algorithms for bandit problems, such as UCB1 [1]. These algorithms are computationally less demanding than those we will present in the next sections. On the other hand, they suffer regret that grows logarithmically, i.e., indefinitely, with the number of episodes. Indeed, they do not exploit either the information regarding the agent s policy or the structure induced by the agent s reward function. We will prove that, instead, the proposed algorithms, which use the problem structure, suffer bounded regret. Furthermore, our algorithms are combined with UCB1 confidence intervals, so their regret, at finite time, is never worse than the one of UCB1. 5 Optimistic Configuration Learning In this section, we present two algorithms for the online learning problem introduced in Section 4. The first algorithm uses only the collected agent decisions to optimistically learn the best configuration (Section 5.1). In the second algorithm, we also use the noisy reward feedback to construct an algorithm that leverages the structure that links together all the transition probability models: the agent s reward function ro (Section 5.2). In Appendix C, we provide some hints about the adversarial case to illustrate the additional complexities that arise. In the adversarial setting, the agent can play a different policy at each step, inside the set of possible policies that satisfy the SSE. 5.1 Action-feedback Optimistic Configuration Learning We start with the action-feedback (Af) setting, in which the configurator observes the agent s actions only. The idea at the basis of the algorithm we propose, Action-feedback Optimistic Configuration Learning (Af OCL), is to maintain, for each configuration, a set of plausible policies that contains an agent s best response policy. The configurator plays the transition model that maximizes an optimistic approximation of its value function. Specifically, for every i 2 [M], k 2 [K], and h 2 [H] we denote with Ai k,h(s) A the set of plausible actions in state s at step h for configuration pi at the beginning of episode k. For every model pi, the first time we visit an (s, h)-pair and observe the agent s action a i,h( |s), we set Ai k,h(s) = {a}. For the non-visited (s, h)-pairs, we leave Ai k,h(s) = A. Based on this, we can compute an optimistic approximation e V i k,h of the configurator value function V i k,h(s) = rc(s) + max a2Ai pi(s0|s, a)e V i k,h+1(s0), (2) k,H(s) = rc(s). e V i k,h can be computed applying a value-iteration-like algorithm [35] that employs the iterate as in Equation (2).3 Clearly, if the agent is playing deterministically, it holds that Ai k,h(s) = { i,h(s)} for all visited (s, h)-pairs and, consequently, e V i h(s). Instead, if the agent is playing stochastically, we possibly observe different actions whenever visiting (s, h) and we record just the first one. The following lemma shows that even for stochastic agents, if the SSE tie-breaking rule is employed, e V i k,h is optimistic. 2Clearly, the results we present can be directly extended to subgaussian distributions on the reward. 3Notice that the computational complexity decreases as the number of visited states increases and, in any case, is bounded by that of value iteration O . Therefore, the time complexity of Af OCL is O Algorithm 1 Action-feedback Optimistic Configuration Learning (Af OCL). 1: Input: S, A, H, P = {p1, . . . , p M} 2: Initialize Ai 1,h(s) = A for all s 2 S, h 2 [H], and i 2 [M] 3: for episodes 1, 2, . . . , K do 4: Compute e V i,UCB k for all i 2 [M] 5: Compute e V i k for all i 2 [M] 6: Play p Ik with Ik 2 arg maxi2[M] min{e V i k } 7: Observe (sk,1, ak,1, . . . , sk,H 1, ak,H 1, sk,H) 8: Compute the plausible actions for all s 2 S and h 2 [H]: {ak,h} if i = Ik and s = sk,h and Nk,h(s) = 0 Ai k,h(s) otherwise Lemma 5.1. The value function e V i k,h(s) computed as in Equation (2) is such that e V i h(s) for all s 2 S, h 2 [H], and i 2 [M]. In addition, we compute the confidence interval for UCB1 looking at the transition models as arms: 2 log k/Ni,k, where V i k is the sample mean of the observed return for model pi and Ni,k is the number of times the algorithm plays model i up to episode k. Thus, at each episode k 2 [K] the configurator plays the transition model p Ik maximizing the optimistic approximation: Ik 2 arg max k, e V i,UCB The pseudocode of Af OCL is reported in Algorithm 1. Regret Guarantees We now provide an expected regret bound for the Af OCL algorithm. If the agent s policy i is deterministic, it is not hard to get convinced that Af OCL suffers bounded regret since whenever an (s, h)-pair is visited under a pi, the agent reveals its (deterministic) policy i. Thus, either an (s, h)-pair is visited with high probability, or it will impact only marginally on the performance. The main challenge arises when the agent is playing a stochastic policy i for some pi. Af OCL just memorizes the first observed action for each (s, h), pretending the agent s policy to be deterministic. Let b i be the policy that plays the action memorized by Af OCL at the end of the K episodes, filled with the true agent s policy for the non-visited (s, h)-pairs. By construction, the support of b i is contained into the support of the true agent s policy i. Clearly, if i is optimal for the agent reward, b i is too. Furthermore, since the agent and the configurator are playing an SSE, b i will lead to the same configurator s performance as i. Indeed, if this were not the case, there would exist another deterministic policy optimal for the agent, leading to higher performance for the configurator, contradicting the definition of SSE. The following result shows that by switching i with b i changes the regret just by a multiplicative factor depending on the mismatch between the visitation distributions induced by the two policies, di,h and bdi,h respectively. Theorem 5.1 (Regret of Af OCL). Let NCM = (S, A, P, µ, rc, ro, H) with P = {p1, . . . , p M} be the M configurations. The expected regret of Af OCL at every episode K > 0 is bounded by: E[Regret(K)] O i | {z } UCB1 regret | {z } Af OCL regret where is the maxi2[M]: i>0 E maxs2S maxh2[H] b di,h(s) di,h(s) The result might be surprising as the regret is constant and independent of the suboptimality gaps between the configurations, i.e., i = V V i for every i 2 [M]. As supported by intuition, we need to spend more time discarding MDPs that are more similar in performance to the optimal one. Formally, the maximum number of times a suboptimal configuration pi is played is proportional to 1/ i (and not proportional to 1/ 2 i as in standard bandits). This is because we just need one visit to every reachable state. We underline that the term , which indicates the expected ratio between the estimated policy s induced states distribution and real policy s induced states distribution, is equal to 1 when the agent plays a deterministic policy and bounded by SH in the worst case (see Lemma B.3). As far as we know, Theorem 5.1 is the first problem-dependent result for regret minimization for a multi-entity MDP. More details on the proof are given in the Appendix B. 5.2 Reward-feedback Optimistic Configuration Learning The main drawback of Af OCL is that every transition model is treated separately, preventing from employing the underlying structure of the environment, which is represented by the agent reward function ro. Indeed, if the configurator knew ro, it could find the optimal configuration with no need for interaction by simply computing an agent s best response policies for the SSE. The algorithm we propose in this section, Reward-feedback Optimistic Configuration Learning (Rf OCL), employs the reward feedback (Rf), i.e., at every interaction, the configurator can see also a noisy version of the agent s reward function. The crucial point is that ro is the same regardless of the chosen configuration, and, for this reason, it provides a link between them. Specifically, for every k 2 [K] and s 2 S, Rf OCL maintains a confidence interval for the agent reward function Rk(s) = [ro,k(s), ro,k(s)] obtained using the samples collected up to episode k 1 regardless of the played configuration. We apply Höeffding s inequality to build the confidence interval: log(2SHk2) max{Nk(s),1}, where Nk(s) is the number of visits of state s in the first k 1 episodes, and bro,k(s) is the sample mean of the observed rewards for state s up to episode k. Given the estimated reward, for every configuration i 2 [M], we can compute a confidence interval for the agent s Q-values Qk,h(s, a) = [Qi o,k,h(s, a), Q i o,k,h(s, a)], by simply applying the Bellman equation: o,k,h(s, a) = ro,k(s) + pi(s0|s, a) max o,k,h+1(s0, a0), i o,k,h(s, a) = ro,k(s) + pi(s0|s, a) max i o,k,h+1(s0, a0), o,k,H(s, a) = ro,k(s) and Q i o,k,H(s, a) = ro,k(s). If the true reward function belongs to the confidence interval, i.e., ro 2 Rk, then the true Q-value belongs to the corresponding confidence interval, i.e., Qi h 2 Qk,h. Consequently, we can use Qk,h to restrict the set of plausible actions in a state without actually observing the agent playing the action in that state. Indeed, the plausible actions are those that have a Q-value upper bound larger than the maximum Q-value lower bound: i o,k,h(s, a) max o,k,h(s, a0) In other words, if the upper Q-value of an action is smaller than the largest lower Q-value, it cannot be the greedy action, and it is discarded. Clearly, if we observe, for the first time, the agent playing an action in (s, h) at episode k we can reduce the plausible actions to the singleton ak,h, as in the action-feedback setting (Section 5.1). Based on this refined definition of plausible actions, we can compute the optimistic estimate e V i k,h of the configurator value function V i h as in Equation (2) and proceed playing the optimistic configuration. The pseudocode of Rf OCL is reported in Algorithm 2. It is worth noting that we need to keep track of the states that have been already visited because for those, we know the agent s action, and there is no need to apply Equation (4). This is why we introduce the counts Nk,h(s)4. Regret Guarantees We now give a regret bound for the Rf OCL algorithm. Obviously, the same arguments for Af OCL can also be applied for this extended version, and then the regret bound of Theorem 5.1 is valid for Rf OCL. Moreover, for this algorithm, we prove that the regret, under the following assumption, does not depend on the number of configurations. Assumption 1. There exists > 0 such that: mini2[M] mins2S maxh2[H] di h(s) , where di h(s) is the probability of visiting the state s 2 S at time h 2 [H] in configuration pi under the agent s best response policy i. 4The value iteration dominates the computational complexity of an individual iteration of Rf OCL (steps 5 and 9), leading, as for Af OCL, to O Algorithm 2 Reward-feedback Optimistic Configuration Learning (Rf OCL) 1: Input: S, A, H, P = {p1, . . . , p M} 2: Initialize Ai 1,h(s) = A for all s 2 S, h 2 [H], and i 2 [M] 3: Initialize ro,1(s) = 1, ro,1(s) = 0, and N1,h(s) = 0 for all s 2 S and h 2 [H] 4: for episodes 1, 2, . . . , K do 5: Compute e V i,UCB k for all i 2 [M] 6: Compute e V i k for all i 2 [M] 7: Play p Ik with Ik 2 arg maxi2[M] min{e V i k } 8: Observe (sk,1, erk,1, ak,1, . . . , sk,H 1, erk,H 1, ak,H 1, sk,H, erk,H) 9: Compute ro,k+1(s), ro,k+1(s), and Nk+1,h(s) for all s 2 S and h 2 [H] using erk,1 erk,H 10: Compute Qi o,k+1,h(s, a) and Q i o,k+1,h(s, a) for all s 2 S, a 2 A, h 2 [H], and i 2 [M] 11: Compute the plausible actions for all s 2 S and h 2 [H]: {ak,h} if i = Ik and s = sk,h and Nk,h(s) = 0 Ai k,h(s) if Nk,h(s) > 0 e Ai k+1,h(s) otherwise k+1,h(s) as in Equation (4). 12: end for This assumption requires that in every model pi 2 P the agent has non-zero probability, in some step h, to visit every state s. This allows shrinking the confidence intervals for the reward of every state to estimate the agent s policy correctly, regardless of the played configuration. Notice that this assumption is less strict than requiring the well-known ergodicity of the Markov process induced by any policy, used in many algorithms [9, 21, 44].5 Under Assumption 1 we prove the following regret guarantee. Theorem 5.2 (Regret of Rf OCL). Let NCM = (S, A, P, µ, rc, ro, H) with P = {p1, . . . , p M} be the M configurations. Under Assumption 1, the expected regret of Rf OCL at every episode K > 0 is bounded by: E[Regret(K)] O i | {z } UCB1 regret | {z } Af OCL regret 3 | {z } Rf OCL regret where is defined as in Theorem 5.1, K is the smallest integer solution of the inequality K 1 + 2H2S2 log(2SHK 1 , = maxi2[M] i, i.e., the maximum suboptimality gap, and Q is the minimum positive gap of the agent s Q-values (see Appendix B). The regret bound removes the dependence on the number of models M, as K is clearly independent of M, but it introduces, as expected, a dependence on the minimum visitation probability . The proof of the result is reported in Appendix B. Since Rf OCL exploits additional information compared to Af OCL and the set of plausible actions Ai k,h of Rf OCL are subsets of those of Af OCL, the regret bound Af OCL (Theorem 5.1) also holds for Rf OCL. Thus, we can take as regret bound for Rf OCL the minimum between K + 2 3 and MH3S2. We underline that, as far as we know, this is the first proof that takes into consideration the sub-optimality gap of the uncontrollable entity, the agent, and the sub-optimality gap of the controllable entity, the configurator. This permits to derive a problem dependent regret bound. We think that similar techniques can also be of interest for Markov games. 6 Related Works The idea of altering the environment dynamics to improve the agent s learning experience has been exploited before the introduction of Conf-MDPs. Curriculum learning [8] provides the agent with 5Moreover, the configurator can force this assumption since it has the control over the environmental transition model. 0 1000 2000 0 Cumulative regret 0 2000 4000 0 UCB1 Af OCL Rf OCL Figure 1: Cumulative regret for the Gridworld experiment. 50 runs, 98% c.i. 0 500 1000 0 Cumulative regret UCB1-tuned Af OCL Rf OCL Figure 2: Cumulative regret for the Gridworld experiment without ergodicity. 50 runs, 98% c.i. a sequence of environments, of increasing difficulty, to shape the learning process with possible benefits on the learning speed [e.g., 14, 16]. Although the learning process is carried out in a different environment, the configuration is typically performed in simulation only. The setting more similar to Non-Conf-MDP is the one presented in [47], where the configurator and the agent have opposite reward functions (similar to a zero-sum game). In the Conf-MDP framework, instead, the configuration opportunities are an intrinsic property of the environment [29]. The initial approaches entitled the agent of the configuration activity and, consequently, this task was totally auxiliary to its learning experience [29, 39, 27]. More recently, it has been observed that environment configuration can be actuated even by an external entity, opening new opportunities for the application of environment configurability, including settings in which the configurator s interest conflicts with those of the agent. For instance, in [28] the configurator acts on the environment to induce the agent to reveal its capabilities in terms of perception and actuation. Instead, in [17] a threatener entity can change the transition probabilities either in a stochastic or adversarial manner. More generally, environment configuration carried out by an external entity has been studied in the field of planning as a form of environment design [48]. Thus, our NConf MDP unifies these settings, allowing for arbitrary agent s and configurator s reward functions. An interesting connection is established with the robust control literature [31, 20]. Whenever the two reward functions are opposite, i.e., the interaction between the agent and the configuration is fully competitive, the resulting equilibrium corresponds to a robust policy. Indeed, while the agent tries to maximize its expected return, the configurator places the agent in the worst possible environment. Configurable environments (cooperative and non-cooperative) share similarities with environment design [49]. At a high level, the two frameworks share analogous objectives: they both aim at determining an environment with a certain goal that can differ from that of the agent. However, there are some notable differences. In particular, the classical environment design formulation [49] assumes that the configurator (called interested party ) knows the agent s best response function, while in our approach, we learn it by interaction. Nevertheless, the general environment design makes no assumption about the underlying environment, that might not me an MDP. Instead, [22] limit to MDPs and considers a form of cooperative environment design in which the goal is to maximize the agent s performance. Interestingly, some works [22, 37] also account for a cost function to penalize expensive environment configurations. The design of our approaches is based on the OFU principle used for stochastic multi-armed bandits [e.g., 23, 1, 18, 25] and MDPs [e.g., 2, 7, 3]. Moreover, our learning setting with reward feedback is related to structured bandits or bandits with correlated arms.6 Interestingly, for certain structures, it is known that bounded regret is achievable [11, 24], a property that is enjoyed by both our algorithms. Our setting is also close to the Stochastic Games model, in which two or more agents act in an MDP to maximize their own reward functions. Recently, the stochastic game s framework gains growing interest [5, 4, 50], especially in the offline setting i.e., we can control all the agents. For this reason, these approaches do not apply to our setting, where we have the control of the configurator only. 6In our case, playing a single configuration provides information about the opponent s reward, which, in turn, provides information about the value of all configurations. 0 2000 4000 Cumulative regret 0 2000 4000 0 2000 4000 UCB1 Af OCL Rf OCL Figure 3: Cumulative regret as a function of the episodes for the Student-Teacher experiment. 50 runs, 98% c.i. Although some works tackle the online setting [44, 45, 41], where we can control only one agent, all of these algorithms work in the zero-sum setting only. 7 Experiments In this section, we provide the experimental evaluation of our algorithms in two different settings: when the policies are stochastic and when the policies are deterministic. For these experiments, we provide two novel environments: Configurable Gridworld and the Student-Teacher. We compare the algorithms with the standard (theoretical) implementation of UCB1 [1]. The environment description and additional results can be found in Appendix D. Stochastic policies The Configurable Gridworld is a configurable version of a classic 3 3 Gridworld. The agent s starting state is in the cell (0, 1), and its goal is to minimize the number of steps required to reach the exit located in the cell (2, 1). The configurator takes reward 1 when the agent occupies the central cell (1, 1) and 0 otherwise. In a classic Gridworld, the optimal policy would be trivial, as the agent would proceed straight to the exit. In this Configurable Gridworld, instead, the configurator can set the power p of a stochastic obstacle located in the cell (1, 1). When the agent is in that cell and performs action go right to reach the exit, it will hit the obstacle, and will remain in the same position with probability p. The configurator s goal is to tune this probability to keep the agent in the central cell for the maximum number of steps. The M configurations differ in the probability p and are obtained by a regular discretization of [0, 1]. In the first experiment (Figure 1), we considered 10 and 30 configurations with a number of episodes K = 2000 and K = 4000 and horizon H = 10. For this experiment, the agent plays optimal stochastic policies. We can see that Af OCL and Rf OCL suffer constant regret, whereas UCB1 displays a logarithmic regret, as expected. Specifically, Rf OCL outperforms Af OCL and stops playing suboptimal configuration in less than 500 episodes in both cases. This can be explained because, being Assumption 1 fulfilled (in fact, the agent has the probability 0.1 of failing its action), Rf OCL is able to exploit the underlying structure of the problem more effectively. Non-Ergodicity In Figure 2, we have only three configurations designed to induce an optimal agent s policy that generates a non-ergodic Markov chain. In this case, the optimal policies are deterministic, and we violate Assumption 1. For this reason, we observe that Af OCL and Rf OCL display very similar behavior but still significantly better than UCB1. Deterministic policies: Student-Teacher The Student-Teacher environment models a simple interaction between a student and a teacher. There is a set of exercises, with a different level of teacher hardness and student hardness each. The teacher has to decide the optimal sequence of exercises in order to make the student acquire as much knowledge as possible. The student s goal is to maximize the number of exercises and to reduce the hardness of the proposed exercises. At each timestep, the student decides whether to answer the exercise or not. If it answers, it receives a reward equal to the level of correctness of the exercise, the teacher receives a reward corresponding to the level of exercise s teacher hardness , and they end up to the next exercise. If the student does not answer, the student and the teacher will receive 1, and with a probability of 0.7, the next exercise will be easier to solve. In Figure 3, the results with M 2 {40, 60, 100} and horizon H = 10 are shown. The configurations represent the distribution over the next exercise, given a positive answer. In every run, we change the student hardness of the exercises. We observe that both Af OCL and Rf OCL suffer significantly less regret compared to UCB1 and tend to converge to constant regret as expected. It is interesting to observe that, in line with our analysis, the gap between Af OCL and Rf OCL appears more evident as the number of configurations grows. 8 Conclusions In this paper, we have introduced an extension of the Conf-MDP framework to account for possible non-cooperative interaction between the agent and the configurator. We focused on an online learning problem in this new setting, proposing two regret minimization algorithms for identifying the best environment configuration within a finite set, based on the principle of optimism in the face of uncertainty. We proved that even when the agent s policy is stochastic, and the configurator observes the agent s actions, it is possible to achieve finite regret that depends linearly on the admissible number configurations. Furthermore, we illustrated that we can remove this dependence if the configurator observes a possibly noisy version of the agent s reward and under sufficient regularity conditions on the environment. This paper also gives interesting insights on the importance of properly exploiting the available feedback to construct efficient algorithms. Moreover, as far as we know, the ones we have presented are the first problem-dependent regret results for multi-entity MDPs. The experimental evaluation showed that our algorithms display a convergence speed significantly faster than UCB1, and Rf OCL tends to outperform Af OCL thanks to the exploitation of the additional structure. Future research directions include a deeper analysis of the adversarial setting, as well as the application to inverse reinforcement learning. Limitations and Societal Impact Methods that incentive the manipulation of users behavior can have, generally speaking, a negative societal impact, when used, for instance, in a marketing campaign. Nevertheless, our work is mainly theoretical and, at the present level, can hardly be used in a malevolent way. Another relevant aspect is the cost of environment configuration. We are aware that reconfiguring the environment is an activity that typically leads to higher costs compared with policy learning. However, we did not consider this aspect in the formalization of the Non-Cooperative Conf-MDP since it would possibly make the problem more complex (like, for instance, when considering bandits with switching costs). [1] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235 256, 2002. [2] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems 21 (NIPS), pages 89 96. Curran Associates, Inc., 2008. [3] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325 349, 2013. [4] Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 551 560. PMLR, 2020. [5] Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. In Advances in Neural Information Processing Systems 33 (Neur IPS), 2020. [6] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the Sixteenth ACM Conference on Economics and Computation (EC), pages 61 78. ACM, 2015. [7] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), pages 35 42. AUAI Press, 2009. [8] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), volume 382 of ACM International Conference Proceeding Series, pages 41 48. ACM, 2009. [9] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213 231, 2002. [10] Michele Breton, Abderrahmane Alj, and Alain Haurie. Sequential stackelberg equilibria in two-person games. Journal of Optimization Theory and Applications, 59(1):71 97, 1988. [11] Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi- armed bandits. In The 26th Annual Conference on Learning Theory (COLT), volume 30 of JMLR Workshop and Conference Proceedings, pages 122 134. JMLR.org, 2013. [12] Lucian Busoniu, Robert Babuska, and Bart De Schutter. Multi-agent reinforcement learning: A survey. In Ninth International Conference on Control, Automation, Robotics and Vision, ICARCV 2006, Singapore, 5-8 December 2006, Proceedings, pages 1 6. IEEE, 2006. [13] Benjamin Chasnov, Lillian J. Ratliff, Eric Mazumdar, and Samuel Burden. Convergence analysis of gradient-based learning in continuous games. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), volume 115 of Proceedings of Machine Learning Research, pages 935 944. AUAI Press, 2019. [14] Kamil Andrzej Ciosek and Shimon Whiteson. OFFER: off-environment reinforcement learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 1819 1825. AAAI Press, 2017. [15] Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings 7th ACM Conference on Electronic Commerce (EC), pages 82 90. ACM, 2006. [16] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. In 1st Annual Conference on Robot Learning (Co RL), volume 78 of Proceedings of Machine Learning Research, pages 482 495. PMLR, [17] Víctor Gallego, Roi Naveiro, and David Ríos Insua. Reinforcement learning under threats. In The Thirty-Third AAAI Conference on Artificial Intelligence, pages 9939 9940. AAAI Press, [18] Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In The 24th Annual Conference on Learning Theory (COLT), volume 19 of JMLR Proceedings, pages 359 376. JMLR.org, 2011. [19] Qingyu Guo, Jiarui Gan, Fei Fang, Long Tran-Thanh, Milind Tambe, and Bo An. On the inducibility of stackelberg equilibrium for security games. In The Thirty-Third AAAI Conference on Artificial Intelligence, pages 2020 2028. AAAI Press, 2019. [20] Garud N. Iyengar. Robust dynamic programming. Math. Oper. Res., 30(2):257 280, 2005. [21] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49(2-3):209 232, 2002. [22] Sarah Keren, Luis Enrique Pineda, Avigdor Gal, Erez Karpas, and Shlomo Zilberstein. Equi- reward utility maximizing design in stochastic environments. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI), pages 4353 4360. ijcai.org, 2017. [23] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Ad- vances in applied mathematics, 6(1):4 22, 1985. [24] Tor Lattimore and Rémi Munos. Bounded regret for finite-armed structured bandits. In Advances in Neural Information Processing Systems 27 (NIPS), pages 550 558, 2014. [25] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [26] George Leitmann. On generalized stackelberg strategies. Journal of optimization theory and applications, 26(4):637 643, 1978. [27] Alberto Maria Metelli, Emanuele Ghelfi, and Marcello Restelli. Reinforcement learning in configurable continuous environments. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 4546 4555. PMLR, 2019. [28] Alberto Maria Metelli, Guglielmo Manneschi, and Marcello Restelli. Policy space identification in configurable environments. Mach. Learn., 2021. [29] Alberto Maria Metelli, Mirco Mutti, and Marcello Restelli. Configurable Markov decision processes. In Proceedings of the 35th International Conference on Machine Learning, (ICML), volume 80 of Proceedings of Machine Learning Research, pages 3488 3497. PMLR, 2018. [30] Thanh Hong Nguyen, Rong Yang, Amos Azaria, Sarit Kraus, and Milind Tambe. Analyzing the effectiveness of adversary modeling in security games. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. AAAI Press, 2013. [31] Arnab Nilim and Laurent El Ghaoui. Robustness in Markov decision problems with uncertain transition matrices. In Advances in Neural Information Processing Systems 16 (NIPS), pages 839 846. MIT Press, 2003. [32] Steven Okamoto, Noam Hazon, and Katia P. Sycara. Solving non-zero sum multiagent network flow security games with attack costs. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 879 888. IFAAMAS, 2012. [33] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning. Found. Trends Robotics, 7(1-2):1 179, 2018. [34] Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning optimal strategies to commit to. In The Thirty-Third AAAI Conference on Artificial Intelligence, pages 2149 2156. AAAI Press, 2019. [35] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. [36] Giorgia Ramponi, Alberto Maria Metelli, Alessandro Concetti, and Marcello Restelli. Online learning in non-cooperative configurable Markov decision process. AAAI-21 Workshop on Reinforcement Learning in Games, 2021. [37] Sandhya Saisubramanian and Shlomo Zilberstein. Mitigating negative side effects via environ- ment shaping. In 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1640 1642. ACM, 2021. [38] Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. Learning to play sequential games versus unknown opponents. In Advances in Neural Information Processing Systems 33 (Neur IPS), 2020. [39] Rui Silva, Francisco S. Melo, and Manuela Veloso. What if the world were different? gradient- based exploration for new optimal policies. In 4th Global Conference on Artificial Intelligence (GCAI), volume 55 of EPi C Series in Computing, pages 229 242. Easy Chair, 2018. [40] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, [41] Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Provably efficient online agnostic learning in Markov games. Co RR, abs/2010.15020, 2020. [42] Andrea Tirinzoni, Riccardo Poiani, and Marcello Restelli. Sequential transfer in reinforcement learning with a generative model. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 9481 9492. PMLR, 2020. [43] Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media, 2010. [44] Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems 30 (NIPS), pages 4987 4997, 2017. [45] Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In Conference on Learning Theory (COLT), volume 125 of Proceedings of Machine Learning Research, pages 3674 3682. PMLR, 2020. [46] Andrea Zanette, Mykel J. Kochenderfer, and Emma Brunskill. Almost horizon-free structure- aware best policy identification with a generative model. In Advances in Neural Information Processing Systems 32 (Neur IPS), pages 5626 5635. 2019. [47] Haifeng Zhang, Jun Wang, Zhiming Zhou, Weinan Zhang, Yin Wen, Yong Yu, and Wenxin Li. Learning to design games: Strategic environments in reinforcement learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), pages 3068 3074. ijcai.org, 2018. [48] Haoqi Zhang, Yiling Chen, and David C. Parkes. A general approach to environment design with one agent. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), pages 2002 2014, 2009. [49] Haoqi Zhang, Yiling Chen, and David C. Parkes. A general approach to environment design with one agent. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), pages 2002 2014, 2009. [50] Kaiqing Zhang, Sham M. Kakade, Tamer Basar, and Lin F. Yang. Model-based multi-agent RL in zero-sum Markov games with near-optimal sample complexity. In Advances in Neural Information Processing Systems 33 (Neur IPS), 2020. [51] Kaiqing Zhang, Zhuoran Yang, and Tamer Basar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Co RR, abs/1911.10635, 2019. [52] Martin Zinkevich, Amy Greenwald, and Michael L. Littman. Cyclic equilibria in Markov games. In Advances in Neural Information Processing Systems 18 (NIPS), pages 1641 1648, 2005. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See conclusion. (c) Did you discuss any potential negative societal impacts of your work? [N/A] The paper is mostly a theoretical contribution and we think it will not have any potential negative societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] The assump- tions are all given in the main paper. (b) Did you include complete proofs of all theoretical results? [Yes] See appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] See supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] See figures. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]