# learning_in_noncooperative_configurable_markov_decision_processes__baf9e1ab.pdf

Learning in Non-Cooperative Conﬁgurable Markov Decision Processes

Giorgia Ramponi

ETH AI Center Zurich, Switzerland

gramponi@ethz.ch

Alberto Maria Metelli

Politecnico di Milano

Milan, Italy albertomaria.metelli@polimi.it

Alessandro Concetti Politecnico di Milano

Milan, Italy alessandro.concetti@mail.polimi.it

Marcello Restelli Politecnico di Milano

Milan, Italy marcello.restelli@polimi.it

The Conﬁgurable Markov Decision Process framework includes two entities: a Reinforcement Learning agent and a conﬁgurator that can modify some environmental parameters to improve the agent s performance. This presupposes that the two actors have identical reward functions. What if the conﬁgurator does not have the same intentions as the agent? This paper introduces the Non-Cooperative Conﬁgurable Markov Decision Process, a framework that allows modeling two (possibly different) reward functions for the conﬁgurator and the agent. Then, we consider an online learning problem, where the conﬁgurator has to ﬁnd the best among a ﬁnite set of possible conﬁgurations. We propose two learning algorithms to minimize the conﬁgurator s expected regret, which exploit the problem s structure, depending on the agent s feedback. While a naïve application of the UCB algorithm yields a regret that grows indeﬁnitely over time, we show that our approach suffers only bounded regret. Furthermore, we empirically validate the performance of our algorithm in simulated domains.

1 Introduction

The standard Reinforcement Learning [RL, 40] framework involves an agent whose objective is to maximize the reward collected during its interaction with the environment. However, there exist real-world scenarios in which the agent itself or an external supervisor (conﬁgurator) can partially modify the environment. In a car racing problem, for example, it is possible to modify the car setup to better suit the driver s needs. Recently, the Conﬁgurable Markov Decision Processes [Conf-MDPs, 29] were introduced to model these scenarios and exploit the conﬁguration opportunities. Solving a Conf-MDP consists of simultaneously optimizing a set of environmental parameters and the agent s policy to reach the maximum expected return. In many scenarios, however, the conﬁgurator does not know the agent s reward, and their intentions are different, leading to new forms of interaction between the two actors. For instance, imagine we are the owner of a supermarket, and we have to arrange the products on the shelves. Our objective is to increase the company s ﬁnal proﬁt; on the other hand, a customer aims to spend the shortest time possible inside the supermarket and buy the indispensable products only. Since we do not know the customer reward function, the only possibility is to try different dispositions and observe the customers reactions. What if we knew what buyers

Work done when Giorgia Ramponi was at Politecnico di Milano.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

are most interested in? In this case, we can strategically decide how to position other products close to the popular ones to induce the customer in a more proﬁtable behavior for the supermarket owner.

In this paper, we model these scenarios introducing the Non-Cooperative Markov Decision Processes (NConf-MDP). This novel framework handles the possibility of having different reward functions for the agent and the conﬁgurator. While Conf-MDP assumes that the conﬁgurator acts to help the agent to optimize its expected reward, an NConf-MDP, instead, allows modeling a wider set of situations, including the cases in which agent and conﬁgurator display a non-cooperative behavior. Obviously, this setting cannot be addressed with straightforward application of the algorithms designed for cooperative Conf-MDP. In fact, if the conﬁgurator and the agent optimize separately different objectives, they might not converge to an equilibrium strategy [52, 12, 51, 13]. In this novel setting, we consider an online learning problem, where the conﬁgurator has to select a conﬁguration, within a ﬁnite set of possible conﬁgurations, in order to maximize its own return. This framework can be seen as a leader-follower game, in which the follower (the agent) is selﬁsh and optimizes its own reward function, and the leader (the conﬁgurator) has to decide the best conﬁguration, based on its reward. Clearly, to adapt its decisions, the conﬁgurator has to receive some form of feedback related to the agent s behavior. We analyze two settings based on whether the conﬁgurator observes just the agent s actions or, in addition, a noisy version of the agent s reward.

Contributions In this paper, we extend the Conﬁgurable Markov Decision Process setting to deal with situations where the conﬁgurator and the agent have different reward functions. We call this new framework the Non-Cooperative Markov Decision Process (NConf-MDP, Section 3). Then, we formalize the problem of ﬁnding the best environment conﬁguration, according to the conﬁgurator s reward, as a leader-follower game, in which the agent (follower) reacts to each presented conﬁguration with its best response policy (Section 4). We provide a ﬁrst algorithm, Action-feedback Optimistic Conﬁguration Learning (Af OCL), to tackle this problem under the assumption that the conﬁgurator observes the agent s actions only (Section 5.1). We show Af OCL achieves ﬁnite expected regret, scaling linearly with the number of admissible conﬁgurations. As far as we know, this represents the ﬁrst problem-dependent regret analysis in a Multi-Agent RL setting. Then, we introduce a second algorithm, Reward-feedback Optimistic Conﬁguration Learning (Rf OCL), that assumes the availability of a noisy version of the agent s reward, in addition to the agent s actions (Section 5.2). We prove that, under suitable conditions, Rf OCL further exploits the structure underlying the decision process, removing the dependence on the number of conﬁgurations. The analysis use novel ideas, combining the suboptimality gaps of the conﬁgurator with those of the agent. Finally, we provide an experimental evaluation on benchmark domains, inspired by scenarios that motivate the NConf-MDPs framework (Section 7). The proofs of the results presented in the paper are reported in Appendix B. A preliminary version of this work was presented at AAAI-21 Workshop on Reinforcement Learning in Games [36].

2 Preliminaries

A ﬁnite-horizon Markov Decision Process [MDP, 35] is a tuple M = (S, A, p, µ, r, H) where S is a ﬁnite state space (S = |S|), A is a ﬁnite action space (A = |A|), p : S A S ! [0, 1] is the transition model, which deﬁnes the density p(s0|s, a) of state s0 2 S when taking action a 2 A in state s 2 S, µ : S ! [0, 1] is the initial state distribution, r : S ! [0, 1] is the reward function, and H 2 N 1 is the horizon. A stochastic decision rule h : S A ! [0, 1] with h 2 [H] prescribes the probability h(a|s) of playing action a 2 A in state s 2 S. A stochastic policy = ( 1, , H) 2 H is a sequence of decision rules, where H is the set of stochastic policies over the horizon H.

A ﬁnite-horizon Conﬁgurable Markov Decision Process [Conf-MDP, 29] is deﬁned as CM = (S, A, P, µ, r, H) and extends the MDP considering a conﬁguration space P instead a single transition

model p. The Q-value of a policy 2 H and conﬁguration p 2 P is the expected sum of the rewards starting from (s, a) 2 S A at step h 2 [H]:

h (s, a) = r(s) + E sh0 p,

r(sh0)|sh = s, ah = a

denoting with Esh0 p, the expectation w.r.t. the state distribution induced by and p at step h0. The value function is given by V ,p

h (s) = Ea h( |s)[Q ,p

h (s, a)] and the expected return

is deﬁned as V ,p = Es µ[V ,p

1 (s)]. In a Conf-MDP the goal consists in ﬁnding a policy together with an environment conﬁguration p so as to maximize the expected return, i.e., ( , p ) 2 arg max 2 H,p2P V ,p.

3 Non-Cooperative Conf-MDPs

The deﬁnition of Conf-MDP allows modeling scenarios in which agent and conﬁgurator share the same objective, encoded in a single reward function r. In this section, we introduce an extension of this framework to account for the presence of a conﬁgurator having interests that might differ from those of the agent. Deﬁnition 3.1. A Non-Cooperative Conﬁgurable Markov Decision Process (NConf-MDP) is deﬁned by a tuple NCM = (S, A, P, µ, rc, ro, H), where (S, A, P, µ, H) is a Conf-MDP without reward and rc, ro : S ! [0, 1] are the conﬁgurator and agent (opponent) reward functions, respectively.

Given a policy 2 H and a conﬁguration p 2 P, for every (s, a) 2 S A and h 2 [H] we deﬁne the conﬁgurator and agent Q-values as:

c,h(s, a) = rc(s) + E sh0 p,

rc(sh0)|sh = s, ah = a

o,h(s, a) = ro(s) + E sh0 p,

ro(sh0)|sh = s, ah = a

We denote with V ,p

c,h (s) = Ea h(s)[Q ,p

c,h(s, a)] and V ,p

o,h = Ea h(s)[Q ,p

o,h(s, a)] the value functions and with V ,p

c = Es µ[V ,p

c,1 (s)] and V ,p

o = Es µ[V ,p

o,1 (s)] the expected returns for the conﬁgurator and the agent respectively.

4 Problem Formulation

While for classical Conf-MDPs [29, 27] a notion of optimality is straightforward as agent and conﬁgurator share the same objective, in an NConf-MDP, they can display possibly conﬂicting interests. We assume a sequential interaction between the conﬁgurator and the agent that resembles the leaderfollower protocol [10, 6, 34, 38]. First, the conﬁgurator (leader) selects an environment conﬁguration p 2 P, where P is a ﬁnite set made of M stochastic transition models P = {p1, . . . , p M}. Then the agent (follower) plays a policy chosen by a best response function f : P ! H, such that: f(p) 2 arg max 2 H V ,p

o . The solution concept that we use is the well-known Stackelberg equilibrium [43, 15, 30, 32, 19]. It captures the outcome in which the conﬁgurator s transition model is optimal, under the assumption that the agent will always respond optimally [26]. However, this deﬁnition includes the possibility of ties, i.e., situations in which multiple agent optimal policies exist, with possibly different performance for the conﬁgurator. Therefore, it is necessary to employ a tie-breaking rule, i.e., a criterion to select one agent best response. Different tie-breaking rules lead to different Stackelberg equilibria, and the two prevailing solution concepts in the literature are the Strong Stackelberg Equilibrium (SSE) and the Weak Stackelberg Equilibrium (WSE). A policy-transition model pair ( , p ) forms an SSE if ties are broken in favor of the conﬁgurator:

p 2 arg max

c and := f S(p) 2 arg max

The WSE can be constructed by breaking the ties against the conﬁgurator. In the rest of the paper, we employ the concept of SSE; however, every result can be applied to any deterministic tie-breaking rule. We call

p the application of the best response function f S to a transition model p. Notice that the goal of the conﬁgurator is well-deﬁned, whenever deciding the function f S. From an online learning perspective, this goal is to minimize the expected regret:

E[Regret(K)] = E

To lighten the notation, in the following, we will denote with i the agent s best response policy to the conﬁguration pi, i.e.,

pi and with V i the conﬁgurator expected returned attained with conﬁguration pi and policy i, i.e., V i,pi c . Finally, we denote with V = maxi2[M] V i.

Agent s Feedback The conﬁgurator knows its reward rc, but it does not know the agent reward ro. At each episode k 2 [K], the conﬁgurator selects a conﬁguration p Ik 2 P and observes a trajectory of H steps generated by the agent s best response policy Ik. We study two types of feedback:

Action-feedback (Af). The conﬁgurator observes the states and the actions played by the agent

(s1, a1, . . . , s H 1, a H 1, s H), where ah Ik,h(sh). Reward-feedback (Rf). The conﬁgurator observes the states, the actions played by the agent, and a

noisy feedback of the agent reward function (s1, er1, a1, . . . , s H 1, er H 1, a H 1, s H, er H), where ah Ik,h(sh) and erh is sampled from a distribution with mean ro(s) and support [0, 1].2

The Rf models situations in which the agent s reward is known under uncertainty, or it is obtained in an approximate way through Inverse Reinforcement Learning [33].

Connections with Bandit Algorithms The online problem that we are facing can be seen as a stochastic multi-armed bandit [25], in which the arms are conﬁgurations, and the conﬁgurator receives a random realization of its expected return at every episode. Thus, in principle, it can be solved by standard algorithms for bandit problems, such as UCB1 [1]. These algorithms are computationally less demanding than those we will present in the next sections. On the other hand, they suffer regret that grows logarithmically, i.e., indeﬁnitely, with the number of episodes. Indeed, they do not exploit either the information regarding the agent s policy or the structure induced by the agent s reward function. We will prove that, instead, the proposed algorithms, which use the problem structure, suffer bounded regret. Furthermore, our algorithms are combined with UCB1 conﬁdence intervals, so their regret, at ﬁnite time, is never worse than the one of UCB1.

5 Optimistic Conﬁguration Learning

In this section, we present two algorithms for the online learning problem introduced in Section 4. The ﬁrst algorithm uses only the collected agent decisions to optimistically learn the best conﬁguration (Section 5.1). In the second algorithm, we also use the noisy reward feedback to construct an algorithm that leverages the structure that links together all the transition probability models: the agent s reward function ro (Section 5.2). In Appendix C, we provide some hints about the adversarial case to illustrate the additional complexities that arise. In the adversarial setting, the agent can play a different policy at each step, inside the set of possible policies that satisfy the SSE.

5.1 Action-feedback Optimistic Conﬁguration Learning

We start with the action-feedback (Af) setting, in which the conﬁgurator observes the agent s actions only. The idea at the basis of the algorithm we propose, Action-feedback Optimistic Conﬁguration Learning (Af OCL), is to maintain, for each conﬁguration, a set of plausible policies that contains an agent s best response policy. The conﬁgurator plays the transition model that maximizes an optimistic approximation of its value function. Speciﬁcally, for every i 2 [M], k 2 [K], and h 2 [H] we denote with Ai

k,h(s) A the set of plausible actions in state s at step h for conﬁguration pi at the beginning of episode k. For every model pi, the ﬁrst time we visit an (s, h)-pair and observe the agent s action a i,h( |s), we set Ai

k,h(s) = {a}. For the non-visited (s, h)-pairs, we leave Ai

k,h(s) = A. Based on this, we can compute an optimistic approximation e V i

k,h of the conﬁgurator value function V i

k,h(s) = rc(s) + max a2Ai

pi(s0|s, a)e V i

k,h+1(s0), (2)

k,H(s) = rc(s). e V i

k,h can be computed applying a value-iteration-like algorithm [35] that employs the iterate as in Equation (2).3 Clearly, if the agent is playing deterministically, it holds that Ai

k,h(s) = { i,h(s)} for all visited (s, h)-pairs and, consequently, e V i

h(s). Instead, if the agent is playing stochastically, we possibly observe different actions whenever visiting (s, h) and we record just the ﬁrst one. The following lemma shows that even for stochastic agents, if the SSE tie-breaking rule is employed, e V i

k,h is optimistic.

2Clearly, the results we present can be directly extended to subgaussian distributions on the reward. 3Notice that the computational complexity decreases as the number of visited states increases and, in any case, is bounded by that of value iteration O

. Therefore, the time complexity of Af OCL is O

Algorithm 1 Action-feedback Optimistic Conﬁguration Learning (Af OCL).

1: Input: S, A, H, P = {p1, . . . , p M} 2: Initialize Ai

1,h(s) = A for all s 2 S, h 2 [H], and i 2 [M] 3: for episodes 1, 2, . . . , K do 4: Compute e V i,UCB

k for all i 2 [M] 5: Compute e V i

k for all i 2 [M] 6: Play p Ik with Ik 2 arg maxi2[M] min{e V i

k } 7: Observe (sk,1, ak,1, . . . , sk,H 1, ak,H 1, sk,H) 8: Compute the plausible actions for all s 2 S and h 2 [H]:

{ak,h} if i = Ik and s = sk,h and Nk,h(s) = 0 Ai

k,h(s) otherwise

Lemma 5.1. The value function e V i

k,h(s) computed as in Equation (2) is such that e V i

h(s) for all s 2 S, h 2 [H], and i 2 [M].

In addition, we compute the conﬁdence interval for UCB1 looking at the transition models as arms:

2 log k/Ni,k, where V i

k is the sample mean of the observed return for model pi and Ni,k is the number of times the algorithm plays model i up to episode k. Thus, at each episode k 2 [K] the conﬁgurator plays the transition model p Ik maximizing the optimistic approximation:

Ik 2 arg max

k, e V i,UCB

The pseudocode of Af OCL is reported in Algorithm 1.

Regret Guarantees We now provide an expected regret bound for the Af OCL algorithm. If the agent s policy i is deterministic, it is not hard to get convinced that Af OCL suffers bounded regret since whenever an (s, h)-pair is visited under a pi, the agent reveals its (deterministic) policy i. Thus, either an (s, h)-pair is visited with high probability, or it will impact only marginally on the performance. The main challenge arises when the agent is playing a stochastic policy i for some pi. Af OCL just memorizes the ﬁrst observed action for each (s, h), pretending the agent s policy to be deterministic. Let b i be the policy that plays the action memorized by Af OCL at the end of the K episodes, ﬁlled with the true agent s policy for the non-visited (s, h)-pairs. By construction, the support of b i is contained into the support of the true agent s policy i. Clearly, if i is optimal for the agent reward, b i is too. Furthermore, since the agent and the conﬁgurator are playing an SSE, b i will lead to the same conﬁgurator s performance as i. Indeed, if this were not the case, there would exist another deterministic policy optimal for the agent, leading to higher performance for the conﬁgurator, contradicting the deﬁnition of SSE. The following result shows that by switching i with b i changes the regret just by a multiplicative factor depending on the mismatch between the visitation distributions induced by the two policies, di,h and bdi,h respectively. Theorem 5.1 (Regret of Af OCL). Let NCM = (S, A, P, µ, rc, ro, H) with P = {p1, . . . , p M} be the M conﬁgurations. The expected regret of Af OCL at every episode K > 0 is bounded by:

E[Regret(K)] O

i | {z } UCB1 regret

| {z } Af OCL regret

where is the maxi2[M]: i>0 E

maxs2S maxh2[H]

b di,h(s) di,h(s)

The result might be surprising as the regret is constant and independent of the suboptimality gaps between the conﬁgurations, i.e., i = V V i for every i 2 [M]. As supported by intuition, we need to spend more time discarding MDPs that are more similar in performance to the optimal one. Formally, the maximum number of times a suboptimal conﬁguration pi is played is proportional to 1/ i (and not proportional to 1/ 2

i as in standard bandits). This is because we just need one visit to

every reachable state. We underline that the term , which indicates the expected ratio between the estimated policy s induced states distribution and real policy s induced states distribution, is equal to 1 when the agent plays a deterministic policy and bounded by SH in the worst case (see Lemma B.3). As far as we know, Theorem 5.1 is the ﬁrst problem-dependent result for regret minimization for a multi-entity MDP. More details on the proof are given in the Appendix B.

5.2 Reward-feedback Optimistic Conﬁguration Learning

The main drawback of Af OCL is that every transition model is treated separately, preventing from employing the underlying structure of the environment, which is represented by the agent reward function ro. Indeed, if the conﬁgurator knew ro, it could ﬁnd the optimal conﬁguration with no need for interaction by simply computing an agent s best response policies for the SSE.

The algorithm we propose in this section, Reward-feedback Optimistic Conﬁguration Learning (Rf OCL), employs the reward feedback (Rf), i.e., at every interaction, the conﬁgurator can see also a noisy version of the agent s reward function. The crucial point is that ro is the same regardless of the chosen conﬁguration, and, for this reason, it provides a link between them. Speciﬁcally, for every k 2 [K] and s 2 S, Rf OCL maintains a conﬁdence interval for the agent reward function Rk(s) = [ro,k(s), ro,k(s)] obtained using the samples collected up to episode k 1 regardless of the played conﬁguration. We apply Höeffding s inequality to build the conﬁdence interval:

log(2SHk2) max{Nk(s),1}, where Nk(s) is the number of visits of state s in the ﬁrst k 1 episodes, and bro,k(s) is the sample mean of the observed rewards for state s up to episode k. Given the estimated reward, for every conﬁguration i 2 [M], we can compute a conﬁdence interval for the agent s Q-values Qk,h(s, a) = [Qi

o,k,h(s, a), Q

i o,k,h(s, a)], by simply applying the Bellman equation:

o,k,h(s, a) = ro,k(s) +

pi(s0|s, a) max

o,k,h+1(s0, a0),

i o,k,h(s, a) = ro,k(s) +

pi(s0|s, a) max

i o,k,h+1(s0, a0),

o,k,H(s, a) = ro,k(s) and Q

i o,k,H(s, a) = ro,k(s). If the true reward function belongs to the conﬁdence interval, i.e., ro 2 Rk, then the true Q-value belongs to the corresponding conﬁdence interval, i.e., Qi

h 2 Qk,h. Consequently, we can use Qk,h to restrict the set of plausible actions in a state without actually observing the agent playing the action in that state. Indeed, the plausible actions are those that have a Q-value upper bound larger than the maximum Q-value lower bound:

i o,k,h(s, a) max

o,k,h(s, a0)

In other words, if the upper Q-value of an action is smaller than the largest lower Q-value, it cannot be the greedy action, and it is discarded. Clearly, if we observe, for the ﬁrst time, the agent playing an action in (s, h) at episode k we can reduce the plausible actions to the singleton ak,h, as in the action-feedback setting (Section 5.1). Based on this reﬁned deﬁnition of plausible actions, we can compute the optimistic estimate e V i

k,h of the conﬁgurator value function V i

h as in Equation (2) and proceed playing the optimistic conﬁguration.

The pseudocode of Rf OCL is reported in Algorithm 2. It is worth noting that we need to keep track of the states that have been already visited because for those, we know the agent s action, and there is no need to apply Equation (4). This is why we introduce the counts Nk,h(s)4.

Regret Guarantees We now give a regret bound for the Rf OCL algorithm. Obviously, the same arguments for Af OCL can also be applied for this extended version, and then the regret bound of Theorem 5.1 is valid for Rf OCL. Moreover, for this algorithm, we prove that the regret, under the following assumption, does not depend on the number of conﬁgurations. Assumption 1. There exists > 0 such that: mini2[M] mins2S maxh2[H] di

h(s) , where di

h(s) is the probability of visiting the state s 2 S at time h 2 [H] in conﬁguration pi under the agent s best response policy i.

4The value iteration dominates the computational complexity of an individual iteration of Rf OCL (steps 5 and 9), leading, as for Af OCL, to O

Algorithm 2 Reward-feedback Optimistic Conﬁguration Learning (Rf OCL)

1: Input: S, A, H, P = {p1, . . . , p M} 2: Initialize Ai

1,h(s) = A for all s 2 S, h 2 [H], and i 2 [M] 3: Initialize ro,1(s) = 1, ro,1(s) = 0, and N1,h(s) = 0 for all s 2 S and h 2 [H] 4: for episodes 1, 2, . . . , K do 5: Compute e V i,UCB

k for all i 2 [M] 6: Compute e V i

k for all i 2 [M] 7: Play p Ik with Ik 2 arg maxi2[M] min{e V i

k } 8: Observe (sk,1, erk,1, ak,1, . . . , sk,H 1, erk,H 1, ak,H 1, sk,H, erk,H) 9: Compute ro,k+1(s), ro,k+1(s), and Nk+1,h(s) for all s 2 S and h 2 [H] using erk,1 erk,H 10: Compute Qi

o,k+1,h(s, a) and Q

i o,k+1,h(s, a) for all s 2 S, a 2 A, h 2 [H], and i 2 [M] 11: Compute the plausible actions for all s 2 S and h 2 [H]:

{ak,h} if i = Ik and s = sk,h and Nk,h(s) = 0 Ai

k,h(s) if Nk,h(s) > 0 e Ai

k+1,h(s) otherwise

k+1,h(s) as in Equation (4). 12: end for

This assumption requires that in every model pi 2 P the agent has non-zero probability, in some step h, to visit every state s. This allows shrinking the conﬁdence intervals for the reward of every state to estimate the agent s policy correctly, regardless of the played conﬁguration. Notice that this assumption is less strict than requiring the well-known ergodicity of the Markov process induced by any policy, used in many algorithms [9, 21, 44].5 Under Assumption 1 we prove the following regret guarantee. Theorem 5.2 (Regret of Rf OCL). Let NCM = (S, A, P, µ, rc, ro, H) with P = {p1, . . . , p M} be the M conﬁgurations. Under Assumption 1, the expected regret of Rf OCL at every episode K > 0 is bounded by:

E[Regret(K)] O

i | {z } UCB1 regret

| {z } Af OCL regret

3 | {z } Rf OCL regret

where is deﬁned as in Theorem 5.1, K is the smallest integer solution of the inequality K 1 +

2H2S2 log(2SHK

1 , = maxi2[M] i, i.e., the maximum suboptimality

gap, and Q is the minimum positive gap of the agent s Q-values (see Appendix B).

The regret bound removes the dependence on the number of models M, as K is clearly independent of M, but it introduces, as expected, a dependence on the minimum visitation probability . The proof of the result is reported in Appendix B. Since Rf OCL exploits additional information compared to Af OCL and the set of plausible actions Ai

k,h of Rf OCL are subsets of those of Af OCL, the regret bound Af OCL (Theorem 5.1) also holds for Rf OCL. Thus, we can take as regret bound for Rf OCL the minimum between K + 2

3 and MH3S2. We underline that, as far as we know, this is the ﬁrst proof that takes into consideration the sub-optimality gap of the uncontrollable entity, the agent, and the sub-optimality gap of the controllable entity, the conﬁgurator. This permits to derive a problem dependent regret bound. We think that similar techniques can also be of interest for Markov games.

6 Related Works

The idea of altering the environment dynamics to improve the agent s learning experience has been exploited before the introduction of Conf-MDPs. Curriculum learning [8] provides the agent with

5Moreover, the conﬁgurator can force this assumption since it has the control over the environmental transition model.

0 1000 2000 0

Cumulative regret

0 2000 4000 0

UCB1 Af OCL Rf OCL

Figure 1: Cumulative regret for the Gridworld experiment. 50 runs, 98% c.i.

0 500 1000 0

Cumulative regret

UCB1-tuned Af OCL Rf OCL

Figure 2: Cumulative regret for the Gridworld experiment without ergodicity. 50 runs, 98% c.i.

a sequence of environments, of increasing difﬁculty, to shape the learning process with possible beneﬁts on the learning speed [e.g., 14, 16]. Although the learning process is carried out in a different environment, the conﬁguration is typically performed in simulation only. The setting more similar to Non-Conf-MDP is the one presented in [47], where the conﬁgurator and the agent have opposite reward functions (similar to a zero-sum game).

In the Conf-MDP framework, instead, the conﬁguration opportunities are an intrinsic property of the environment [29]. The initial approaches entitled the agent of the conﬁguration activity and, consequently, this task was totally auxiliary to its learning experience [29, 39, 27]. More recently, it has been observed that environment conﬁguration can be actuated even by an external entity, opening new opportunities for the application of environment conﬁgurability, including settings in which the conﬁgurator s interest conﬂicts with those of the agent. For instance, in [28] the conﬁgurator acts on the environment to induce the agent to reveal its capabilities in terms of perception and actuation. Instead, in [17] a threatener entity can change the transition probabilities either in a stochastic or adversarial manner. More generally, environment conﬁguration carried out by an external entity has been studied in the ﬁeld of planning as a form of environment design [48]. Thus, our NConf MDP uniﬁes these settings, allowing for arbitrary agent s and conﬁgurator s reward functions. An interesting connection is established with the robust control literature [31, 20]. Whenever the two reward functions are opposite, i.e., the interaction between the agent and the conﬁguration is fully competitive, the resulting equilibrium corresponds to a robust policy. Indeed, while the agent tries to maximize its expected return, the conﬁgurator places the agent in the worst possible environment.

Conﬁgurable environments (cooperative and non-cooperative) share similarities with environment design [49]. At a high level, the two frameworks share analogous objectives: they both aim at determining an environment with a certain goal that can differ from that of the agent. However, there are some notable differences. In particular, the classical environment design formulation [49] assumes that the conﬁgurator (called interested party ) knows the agent s best response function, while in our approach, we learn it by interaction. Nevertheless, the general environment design makes no assumption about the underlying environment, that might not me an MDP. Instead, [22] limit to MDPs and considers a form of cooperative environment design in which the goal is to maximize the agent s performance. Interestingly, some works [22, 37] also account for a cost function to penalize expensive environment conﬁgurations.

The design of our approaches is based on the OFU principle used for stochastic multi-armed bandits [e.g., 23, 1, 18, 25] and MDPs [e.g., 2, 7, 3]. Moreover, our learning setting with reward feedback is related to structured bandits or bandits with correlated arms.6 Interestingly, for certain structures, it is known that bounded regret is achievable [11, 24], a property that is enjoyed by both our algorithms. Our setting is also close to the Stochastic Games model, in which two or more agents act in an MDP to maximize their own reward functions. Recently, the stochastic game s framework gains growing interest [5, 4, 50], especially in the ofﬂine setting i.e., we can control all the agents. For this reason, these approaches do not apply to our setting, where we have the control of the conﬁgurator only.

6In our case, playing a single conﬁguration provides information about the opponent s reward, which, in turn, provides information about the value of all conﬁgurations.

0 2000 4000

Cumulative regret

0 2000 4000

0 2000 4000

UCB1 Af OCL Rf OCL

Figure 3: Cumulative regret as a function of the episodes for the Student-Teacher experiment. 50 runs, 98% c.i.

Although some works tackle the online setting [44, 45, 41], where we can control only one agent, all of these algorithms work in the zero-sum setting only.

7 Experiments

In this section, we provide the experimental evaluation of our algorithms in two different settings: when the policies are stochastic and when the policies are deterministic. For these experiments, we provide two novel environments: Conﬁgurable Gridworld and the Student-Teacher. We compare the algorithms with the standard (theoretical) implementation of UCB1 [1]. The environment description and additional results can be found in Appendix D.

Stochastic policies The Conﬁgurable Gridworld is a conﬁgurable version of a classic 3 3 Gridworld. The agent s starting state is in the cell (0, 1), and its goal is to minimize the number of steps required to reach the exit located in the cell (2, 1). The conﬁgurator takes reward 1 when the agent occupies the central cell (1, 1) and 0 otherwise. In a classic Gridworld, the optimal policy would be trivial, as the agent would proceed straight to the exit. In this Conﬁgurable Gridworld, instead, the conﬁgurator can set the power p of a stochastic obstacle located in the cell (1, 1). When the agent is in that cell and performs action go right to reach the exit, it will hit the obstacle, and will remain in the same position with probability p. The conﬁgurator s goal is to tune this probability to keep the agent in the central cell for the maximum number of steps.

The M conﬁgurations differ in the probability p and are obtained by a regular discretization of [0, 1]. In the ﬁrst experiment (Figure 1), we considered 10 and 30 conﬁgurations with a number of episodes K = 2000 and K = 4000 and horizon H = 10. For this experiment, the agent plays optimal stochastic policies. We can see that Af OCL and Rf OCL suffer constant regret, whereas UCB1 displays a logarithmic regret, as expected. Speciﬁcally, Rf OCL outperforms Af OCL and stops playing suboptimal conﬁguration in less than 500 episodes in both cases. This can be explained because, being Assumption 1 fulﬁlled (in fact, the agent has the probability 0.1 of failing its action), Rf OCL is able to exploit the underlying structure of the problem more effectively.

Non-Ergodicity In Figure 2, we have only three conﬁgurations designed to induce an optimal agent s policy that generates a non-ergodic Markov chain. In this case, the optimal policies are deterministic, and we violate Assumption 1. For this reason, we observe that Af OCL and Rf OCL display very similar behavior but still signiﬁcantly better than UCB1.

Deterministic policies: Student-Teacher The Student-Teacher environment models a simple interaction between a student and a teacher. There is a set of exercises, with a different level of teacher hardness and student hardness each. The teacher has to decide the optimal sequence of exercises in order to make the student acquire as much knowledge as possible. The student s goal is to maximize the number of exercises and to reduce the hardness of the proposed exercises. At each timestep, the student decides whether to answer the exercise or not. If it answers, it receives a reward equal to the level of correctness of the exercise, the teacher receives a reward corresponding to the level of exercise s teacher hardness , and they end up to the next exercise. If the student does not

answer, the student and the teacher will receive 1, and with a probability of 0.7, the next exercise will be easier to solve. In Figure 3, the results with M 2 {40, 60, 100} and horizon H = 10 are shown. The conﬁgurations represent the distribution over the next exercise, given a positive answer. In every run, we change the student hardness of the exercises. We observe that both Af OCL and Rf OCL suffer signiﬁcantly less regret compared to UCB1 and tend to converge to constant regret as expected. It is interesting to observe that, in line with our analysis, the gap between Af OCL and Rf OCL appears more evident as the number of conﬁgurations grows.

8 Conclusions

In this paper, we have introduced an extension of the Conf-MDP framework to account for possible non-cooperative interaction between the agent and the conﬁgurator. We focused on an online learning problem in this new setting, proposing two regret minimization algorithms for identifying the best environment conﬁguration within a ﬁnite set, based on the principle of optimism in the face of uncertainty. We proved that even when the agent s policy is stochastic, and the conﬁgurator observes the agent s actions, it is possible to achieve ﬁnite regret that depends linearly on the admissible number conﬁgurations. Furthermore, we illustrated that we can remove this dependence if the conﬁgurator observes a possibly noisy version of the agent s reward and under sufﬁcient regularity conditions on the environment. This paper also gives interesting insights on the importance of properly exploiting the available feedback to construct efﬁcient algorithms. Moreover, as far as we know, the ones we have presented are the ﬁrst problem-dependent regret results for multi-entity MDPs. The experimental evaluation showed that our algorithms display a convergence speed signiﬁcantly faster than UCB1, and Rf OCL tends to outperform Af OCL thanks to the exploitation of the additional structure. Future research directions include a deeper analysis of the adversarial setting, as well as the application to inverse reinforcement learning.

Limitations and Societal Impact

Methods that incentive the manipulation of users behavior can have, generally speaking, a negative societal impact, when used, for instance, in a marketing campaign. Nevertheless, our work is mainly theoretical and, at the present level, can hardly be used in a malevolent way. Another relevant aspect is the cost of environment conﬁguration. We are aware that reconﬁguring the environment is an activity that typically leads to higher costs compared with policy learning. However, we did not consider this aspect in the formalization of the Non-Cooperative Conf-MDP since it would possibly make the problem more complex (like, for instance, when considering bandits with switching costs).

[1] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed

bandit problem. Mach. Learn., 47(2-3):235 256, 2002.

[2] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement

learning. In Advances in Neural Information Processing Systems 21 (NIPS), pages 89 96. Curran Associates, Inc., 2008.

[3] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax PAC bounds

on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325 349, 2013.

[4] Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In

Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 551 560. PMLR, 2020.

[5] Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. In

Advances in Neural Information Processing Systems 33 (Neur IPS), 2020.

[6] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Commitment

without regrets: Online learning in stackelberg security games. In Proceedings of the Sixteenth ACM Conference on Economics and Computation (EC), pages 61 78. ACM, 2015.

[7] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement

learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 35 42. AUAI Press, 2009.

[8] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.

In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), volume 382 of ACM International Conference Proceeding Series, pages 41 48. ACM, 2009.

[9] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A general polynomial time algorithm for

near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213 231, 2002.

[10] Michele Breton, Abderrahmane Alj, and Alain Haurie. Sequential stackelberg equilibria in

two-person games. Journal of Optimization Theory and Applications, 59(1):71 97, 1988.

[11] Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-

armed bandits. In The 26th Annual Conference on Learning Theory (COLT), volume 30 of JMLR Workshop and Conference Proceedings, pages 122 134. JMLR.org, 2013.

[12] Lucian Busoniu, Robert Babuska, and Bart De Schutter. Multi-agent reinforcement learning:

A survey. In Ninth International Conference on Control, Automation, Robotics and Vision, ICARCV 2006, Singapore, 5-8 December 2006, Proceedings, pages 1 6. IEEE, 2006.

[13] Benjamin Chasnov, Lillian J. Ratliff, Eric Mazumdar, and Samuel Burden. Convergence analysis

of gradient-based learning in continuous games. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), volume 115 of Proceedings of Machine Learning Research, pages 935 944. AUAI Press, 2019.

[14] Kamil Andrzej Ciosek and Shimon Whiteson. OFFER: off-environment reinforcement learning.

In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, pages 1819 1825. AAAI Press, 2017.

[15] Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In

Proceedings 7th ACM Conference on Electronic Commerce (EC), pages 82 90. ACM, 2006.

[16] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse

curriculum generation for reinforcement learning. In 1st Annual Conference on Robot Learning (Co RL), volume 78 of Proceedings of Machine Learning Research, pages 482 495. PMLR,

[17] Víctor Gallego, Roi Naveiro, and David Ríos Insua. Reinforcement learning under threats. In

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, pages 9939 9940. AAAI Press,

[18] Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits

and beyond. In The 24th Annual Conference on Learning Theory (COLT), volume 19 of JMLR Proceedings, pages 359 376. JMLR.org, 2011.

[19] Qingyu Guo, Jiarui Gan, Fei Fang, Long Tran-Thanh, Milind Tambe, and Bo An. On the

inducibility of stackelberg equilibrium for security games. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, pages 2020 2028. AAAI Press, 2019.

[20] Garud N. Iyengar. Robust dynamic programming. Math. Oper. Res., 30(2):257 280, 2005.

[21] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial

time. Mach. Learn., 49(2-3):209 232, 2002.

[22] Sarah Keren, Luis Enrique Pineda, Avigdor Gal, Erez Karpas, and Shlomo Zilberstein. Equi-

reward utility maximizing design in stochastic environments. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 4353 4360. ijcai.org, 2017.

[23] Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Ad-

vances in applied mathematics, 6(1):4 22, 1985.

[24] Tor Lattimore and Rémi Munos. Bounded regret for ﬁnite-armed structured bandits. In Advances

in Neural Information Processing Systems 27 (NIPS), pages 550 558, 2014.

[25] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[26] George Leitmann. On generalized stackelberg strategies. Journal of optimization theory and

applications, 26(4):637 643, 1978.

[27] Alberto Maria Metelli, Emanuele Ghelﬁ, and Marcello Restelli. Reinforcement learning in

conﬁgurable continuous environments. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 4546 4555. PMLR, 2019.

[28] Alberto Maria Metelli, Guglielmo Manneschi, and Marcello Restelli. Policy space identiﬁcation

in conﬁgurable environments. Mach. Learn., 2021.

[29] Alberto Maria Metelli, Mirco Mutti, and Marcello Restelli. Conﬁgurable Markov decision

processes. In Proceedings of the 35th International Conference on Machine Learning, (ICML), volume 80 of Proceedings of Machine Learning Research, pages 3488 3497. PMLR, 2018.

[30] Thanh Hong Nguyen, Rong Yang, Amos Azaria, Sarit Kraus, and Milind Tambe. Analyzing the

effectiveness of adversary modeling in security games. In Proceedings of the Twenty-Seventh AAAI Conference on Artiﬁcial Intelligence. AAAI Press, 2013.

[31] Arnab Nilim and Laurent El Ghaoui. Robustness in Markov decision problems with uncertain

transition matrices. In Advances in Neural Information Processing Systems 16 (NIPS), pages 839 846. MIT Press, 2003.

[32] Steven Okamoto, Noam Hazon, and Katia P. Sycara. Solving non-zero sum multiagent network

ﬂow security games with attack costs. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 879 888. IFAAMAS, 2012.

[33] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan

Peters. An algorithmic perspective on imitation learning. Found. Trends Robotics, 7(1-2):1 179, 2018.

[34] Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning optimal strategies to

commit to. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, pages 2149 2156. AAAI Press, 2019.

[35] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.

John Wiley & Sons, Inc., New York, NY, USA, 1994.

[36] Giorgia Ramponi, Alberto Maria Metelli, Alessandro Concetti, and Marcello Restelli. Online

learning in non-cooperative conﬁgurable Markov decision process. AAAI-21 Workshop on Reinforcement Learning in Games, 2021.

[37] Sandhya Saisubramanian and Shlomo Zilberstein. Mitigating negative side effects via environ-

ment shaping. In 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1640 1642. ACM, 2021.

[38] Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. Learning

to play sequential games versus unknown opponents. In Advances in Neural Information Processing Systems 33 (Neur IPS), 2020.

[39] Rui Silva, Francisco S. Melo, and Manuela Veloso. What if the world were different? gradient-

based exploration for new optimal policies. In 4th Global Conference on Artiﬁcial Intelligence (GCAI), volume 55 of EPi C Series in Computing, pages 229 242. Easy Chair, 2018.

[40] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,

[41] Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Provably efﬁcient online agnostic

learning in Markov games. Co RR, abs/2010.15020, 2020.

[42] Andrea Tirinzoni, Riccardo Poiani, and Marcello Restelli. Sequential transfer in reinforcement

learning with a generative model. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 9481 9492. PMLR, 2020.

[43] Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business

Media, 2010.

[44] Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games.

In Advances in Neural Information Processing Systems 30 (NIPS), pages 4987 4997, 2017.

[45] Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium. In Conference on Learning Theory (COLT), volume 125 of Proceedings of Machine Learning Research, pages 3674 3682. PMLR, 2020.

[46] Andrea Zanette, Mykel J. Kochenderfer, and Emma Brunskill. Almost horizon-free structure-

aware best policy identiﬁcation with a generative model. In Advances in Neural Information Processing Systems 32 (Neur IPS), pages 5626 5635. 2019.

[47] Haifeng Zhang, Jun Wang, Zhiming Zhou, Weinan Zhang, Yin Wen, Yong Yu, and Wenxin Li.

Learning to design games: Strategic environments in reinforcement learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 3068 3074. ijcai.org, 2018.

[48] Haoqi Zhang, Yiling Chen, and David C. Parkes. A general approach to environment design with

one agent. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 2002 2014, 2009.

[49] Haoqi Zhang, Yiling Chen, and David C. Parkes. A general approach to environment design with

one agent. In Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 2002 2014, 2009.

[50] Kaiqing Zhang, Sham M. Kakade, Tamer Basar, and Lin F. Yang. Model-based multi-agent

RL in zero-sum Markov games with near-optimal sample complexity. In Advances in Neural Information Processing Systems 33 (Neur IPS), 2020.

[51] Kaiqing Zhang, Zhuoran Yang, and Tamer Basar. Multi-agent reinforcement learning: A

selective overview of theories and algorithms. Co RR, abs/1911.10635, 2019.

[52] Martin Zinkevich, Amy Greenwald, and Michael L. Littman. Cyclic equilibria in Markov games.

In Advances in Neural Information Processing Systems 18 (NIPS), pages 1641 1648, 2005.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See conclusion.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] The paper

is mostly a theoretical contribution and we think it will not have any potential negative societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] The assump-

tions are all given in the main paper. (b) Did you include complete proofs of all theoretical results? [Yes] See appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] See supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] See ﬁgures. (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]