# solving_robust_mdps_through_noregret_dynamics__4347801f.pdf

Published in Transactions on Machine Learning Research (06/2024)

Solving Robust MDPs through No-Regret Dynamics

Etash Guha etash.guha@sambanovasystems Samba Nova Systems, University of Washington

Reviewed on Open Review: https: // openreview. net/ forum? id= Sd Cuffxg5A

Reinforcement learning is a powerful framework for training agents to navigate different situations, but it is susceptible to changes in environmental dynamics. Generating an algorithm that can find environmentally robust policies efficiently and handle different model parameterizations without imposing stringent assumptions on the uncertainty set of transitions is difficult due to the intricate interactions between policy and environment. In this paper, we address both of these issues with a No-Regret Dynamics framework that utilizes policy gradient methods and iteratively approximates the worst case environment during training, avoiding assumptions on the uncertainty set. Alongside a toolbox of nonconvex online learning algorithms, we demonstrate that our framework can achieve fast convergence rates for many different problem settings and relax assumptions on the uncertainty set of transitions.

1 Introduction

Reinforcement learning (RL) is a powerful subset of machine learning that enables agents to learn through trial-and-error interactions with their environment. RL has succeeded in various applications such as game playing, robotics, and finance (Sutton & Barto, 2018). However, when a trained policy operates in different environmental dynamics than the training environment, it often underperforms and achieves suboptimal rewards (Farebrother et al., 2018; Packer et al., 2018; Cobbe et al., 2018; Song et al., 2019; Raileanu & Fergus, 2021). Mitigating disastrous failures of RL-trained agents in practice can prevent many undesirable outcomes (Srinivasan et al., 2020; Choi et al., 2021). Robust Markov Decision Processes (MDPs) have emerged as a promising solution to mitigate this problem and address the issue of the sensitivity of reinforcement learning to changing environments. The Robust MDP problem is finding a policy that maximizes some chosen objective function in the worst-case environment. By optimizing policies against hostile environmental dynamics, Robust MDPs provide a more stable approach to training agents (Nilim & Ghaoui, 2003; Bagnell et al.; Iyengar, 2005).

While the Robust MDP is a powerful framework for changing environments, designing robust algorithms that solve such MDPs still presents significant challenges. One primary difficulty arises from the nonconvexity of many objective functions in Robust MDPs. Additionally, the intricate interactions between policy and environment in many MDPs make the problem difficult to handle. Many Value Iteration based approaches have been studied to solve Robust MDPs under a Direct Parameterization setting (Nilim & Ghaoui, 2003; Iyengar, 2005; Badrinath & Kalathil, 2021; Wiesemann et al., 2013; Roy et al., 2017; Tamar et al., 2014). However, many recent works have focussed on extending these methodologies past the direct parameterization setting (Dong et al., 2022; Wang & Zou, 2022) by utilizing modern policy gradient methods. Such gradient techniques are popular for their scalability and versatility to many different problem settings (Williams, 1992; Sutton et al., 1999; Konda & Tsitsiklis, 1999; Kakade, 2001). Most works make convergence explicit in Robust MDPs with such gradient techniques often by assuming that the set of possible adversarial environments, known as the uncertainty set, take on specific shapes, such as Rectangular sets (Dong et al., 2022), R-Contamination sets (Wang & Zou, 2022), or Wasserstein balls Clement & Kroer (2021). Finding algorithms that can solve Robust MDPs for general shapes of uncertainty sets is a complex and open problem.

Published in Transactions on Machine Learning Research (06/2024)

To more fundamentally attack this problem, we want to form an algorithmic framework to design algorithms to solve Robust MDPs under many different settings. Several works have solved finding robust minima in different problem settings using a No-Regret Dynamics framework (Wang et al., 2022a; Balduzzi et al., 2018; Syrgkanis et al., 2015). Such a framework frames an algorithm to find robust minima as a two-player game between a learner and the environment. These frameworks enjoy both versatility in design and simplicity of analysis. Most importantly, the convergence of an algorithm from this framework can be bounded in terms of the expected regrets of both players online strategies without strong assumptions. We use a similar framework style to attack the Robust MDP to more fundamentally attack the challenges of limiting assumptions and versatility to different settings. Our proposed game precisely consists of two players: policy and environment players. In each round of the game, the policy player first picks a policy and tries to maximize its objective function. After that, the environment observes the policy and then chooses a hostile environment to minimize the policy player s objective function. The two players repeatedly play against each other, and the goal is to converge to a robust minimum, which naturally yields a policy that is robust to worst-case environments. This framework has three concrete advantages. First, in each round, the framework calls for explicitly approximating the worst-case environment. This step is critical to relax assumptions on the shape of uncertainty set of environments. Second, both players can choose any online strategy, including gradient-based updates. This flexibility makes this framework applicable to a host of problem settings. Third, the time taken till the algorithm converges to a robust equilibrium is bounded by the regret of the two players chosen strategies. This makes both algorithm design and convergence analysis simple.

To utilize our framework, we provide various possible online strategies for each player. Given the nonconvexity of the objective functions in Robust MDPs, we explore the existing nonconvex online learning literature to develop our toolbox and augment it with several new nonconvex online algorithms that work well in our framework. For the first augmentation, we make use of the fact that the environment player in our framework can see the incoming loss function and developed two new algorithms known as Best-Response and Follow the Perturbed Leader Plus (FTPL+), which enjoy improved regret rates with this ability to peak at the policy player s chosen policy. The second augmentation is that the nonconvex online learning literature often depends on a minimization oracle that can find a global minimizer of the nonconvex loss function (Suggala & Netrapalli, 2019). While this is impossible to build in every setting, we identify the MDP settings under which such an oracle is accessible. To be more specific, we demonstrate that for both the policy and environmental players, the Value Function, a common objective for Robust MDPs, exhibits gradient-dominance. Thus, using Projected Gradient Descent (PGD) will surprisingly suffice as a minimization oracle for both players. This choice has the added benefit of enjoying the scalability of gradient methods. Thus, under different conditions, including simple gradient-dominance, smooth MDPs, or strongly gradient-dominated MDPs, we build different algorithms from our framework using different algorithms that take advantage of each setting. With our framework, proving convergence for each algorithm is simple yet powerful. For example, in the most common setting where the objective function is the Value Function, our algorithm achieves the strong convergence rate of O T 1

2 where T is the number of iterations of our algorithm. This rate is competitive with the most recent Robust MDP algorithms from the literature. Moreover, due to the explicit approximation of the worst-case environment throughout the training process, we do not need an assumption on the uncertainty set shape; instead, we only need the uncertainty set to be convex.

We utilize our algorithmic framework to provide different algorithms for Robust MDPs with gradientdominance, smooth MDPs, and strongly gradient-dominated MDPs. We also prove robust convergence rates with only a convexity assumption on the uncertainty set of environments. Alongside these guarantees, we run several small experiments on the convergence behavior of our algorithm where the Value Function is the optimization objective in the Grid World MDP. Our small experiments corroborate this convergence rate in Section 7.

Contributions Our contributions are as follows.

1. We design an algorithmic framework for generating algorithms to solve Robust MDPs based on No Regret Dynamics. Alongside our framework, we develop novel, no-regret online learning algorithms to use with our framework. This framework is versatile and can work for various problem settings, including direct parameterization. Moreover, due to the explicit approximation of the worst-case

Published in Transactions on Machine Learning Research (06/2024)

environment during training, we need only the uncertainty set of transitions to be convex to guarantee convergence.

2. Many of the no-regret online algorithms in our framework require access to a Minimization Oracle. We show that when the objective function of the game is gradient-dominated for both players, we can use Projected Gradient Descent as the oracle. We also provide tools for identifying the conditions under which the gradient-dominated property of an objective function is ensured. For example, we show that when the MDPs objective function is the Value Function in the tabular setting, both players objective functions enjoy gradient-dominance. We then prove that when the objective function is the Value Function in the direct parameterization setting, our framework can generate an algorithm that yields an O T 1

2 convergence rate. We empirically verify that our algorithm finds a robust policy

at a rate of roughly O T 1

2 in the Grid World MDP across several uncertainty set shapes and sizes.

3. Finally, we demonstrate that even faster convergence rates can be obtained when the objective function is Lipschitz smooth or enjoys strong gradient-dominance with advanced online dynamics. To our knowledge, this is the first work to study Robust MDPs with strong gradient-dominance.

2 Related Works

Robust MDPs Some of the first algorithms to solve Robust MDPs with convergence guarantees and empirical performance are via transition dynamics set assumptions(Nilim & Ghaoui, 2003; Iyengar, 2005), Robust Policy Iteration (Mankowitz et al., 2019; Tamar et al., 2014; Sinha & Ghate, 2016) or Robust Q-Learning (Wang & Zou, 2021). After the introduction of Robust MDPs, several works studied the setting where only samples from the MDP are accessible (Wang & Zou, 2022; Badrinath & Kalathil, 2021; Roy et al., 2017; Tessler et al., 2019). Goyal & Grand-Clement (2023) expanded the rectangular uncertainty set using Factor Matrix Uncertainty Sets to be more general. Zhang et al. (2021) introduced a different adversary where the stochastic samples from the central MDP are adversarial. This work focuses solely on optimizing the policy where the uncertainty set is known. Regarding robust policy optimization, Mankowitz et al. (2018) proposed robust learning through policy gradient for learning temporally extended action. Mankowitz et al. (2019) used Bellman contractors to optimize the robust objective but did not provide any sub-optimality guarantees, only convergence. Dong et al. (2022) used an online approach similar to ours but used rectangular uncertainty set assumptions and did not utilize any no-regret dynamics. Wang et al. (2023) and Tamar et al. (2014) discussed a policy iteration approach to improving the Average Reward Robust MDP using Bellman Equations. Wang et al. (2022b) uses a similar game framework but does not use no-regret dynamics or sophisticated online learning algorithms to achieve their guarantees, achieving a worse convergence rate with less robust guarantees.

Wang & Zou (2022) is one of the best-known algorithms for solving Robust MDPs in this setting with policy gradient methods. Unlike our work, they rely on the uncertainty set taking the form of R-Contamination set, while our algorithm needs no such assumption. Moreover, Clement & Kroer (2021) do approximate the worst-case environment iteratively. However, unlike our work, they utilize Wasserstein uncertainty set assumptions and focus on direct parameterizations. There are also other forms of Robust MDPs, including distributionally robust MDPs (Xu & Mannor, 2010; Ruszczyński, 2010; Shapiro, 2016; Xu & Mannor, 2009; Petrik, 2012; Ghavamzadeh et al., 2016; Chen et al., 2019), soft-robust MDPs (Buchholz & Scheftelowitsch, 2019; Lobo et al., 2020; Steimle et al., 2021), and noise robustness (Pattanaik et al., 2017; Eysenbach & Levine, 2021).

No-Regret Learning In the context of game no-regret learning, Mc Mahan & Abernethy (2013) used a similar game framework to solve linear optimization. In contrast, Wang et al. (2022a) used this framework for solving binary linear classification. Wang et al. (2021) used these frameworks to solve Fenchel games, phrasing many optimization algorithms. Daskalakis et al. (2017) showed that using a similar framework using online players for training GANs yields strong theoretical and empirical improvements. The classical Syrgkanis et al. (2015) uses No-Regret Dynamics to solve social welfare games. Balduzzi et al. (2018) helped show the convergence of these games even in adversarial settings and explained gradient descent as a two-player game.

Published in Transactions on Machine Learning Research (06/2024)

3 Preliminaries

In this section, we present the problem setup along with some standard assumptions and definitions.

3.1 Notation

We denote our infinite horizon, γ-discounted MDP as a tuple of (S, A, PW , R), where S is a set of states, A is a set of actions, PW is a transition dynamics function parameterized by W that maps a state-action-state triple s, a, s S A S to a probability between 0 and 1, and R is a reward function that maps a state s to a reward r. Here, W parameterizes the transition dynamics belonging to some bounded convex set W. Moreover, we will denote Rmax as a constant denoting the largest possible value of R. We will assume that both S and A are finite sets. We will design a policy πθ parameterized by a term θ T where T is a bounded convex set of vectors of dimension d for some constant d. This policy πθ maps a state action pair s, a S A to a probability [0, 1]. For most of this paper, we will refer to πθ as π where clear. We will not specify the form W and θ take since that depends on the parameterization of the policy and transition dynamics. For example, under direct parameterization for both transition dynamics and policy, W belongs to probability simplex W |S| |A| |S| and θ belongs to the probability simplex θ |S| |A|.

We denote the value function V of a state s under the policy π and transition dynamics PW as V π W (s). This function is the expected value function of arriving in a state s. We will define µ |S| as some probability distribution over the initial states. Moreover, we will slightly abuse notation and call V π W (µ) = µ V π W where V π W is the vector of value functions for all states. Moreover, we will define d W s0 (s) and dπ s0(s) as the probability distribution over the occupancy of states when a policy π interacts with an environment of dynamics W given the initial state s0. Formally, this is written as

d W s0 (s) = dπ s0(s) = (1 γ)

t=0 γt P(st = s|s0, π, W).

We choose to use either d W s0 (s) and dπ s0(s) based on the context. Moroever, the state visitation distribution under initial state distribution µ is formally dπ µ(s) = E s0 µ[dπ s0(s)] and d W µ (s) = E s0 µ[d W s0 (s)].

3.2 Robust Policies

Our main goal is to create a policy π where some objective function over the initial state distribution is as large as possible against the worst environments. Within different formulations of Robust MDPs, there may be different objective functions for the robust optimization, which we denote as g(π, W). One common setup of Robust MDPs is where g(π, W) is simply the Value Function given π and W, specifically g(π, W) = V π W (µ).

Definition 3.1. We are in the infinite horizon setting where γ serves as a discount factor and is a positive constant less than 1. Given a policy π and a transition dynamics PW , we define the value function recursively as V π W (s) = R(s) + γ X

a A π(a, s) X

s S PW (s , a, s)V π W (s ).

In this setting, the Robust MDP problem is simply the task of finding a policy π that maximizes the Value Function under worst-case transition dynamics as in

π = arg max π T min W W g(π, W) = arg max π T min W WV π W (µ).

However, in different setups such as Control or Regularized MDPs (Bhandari & Russo, 2022), it may be desirable to find a policy solving a similar robust optimization problem with a different objective function. For example, one may wish to regularize the policy parameter as in g(πθ, W) = V πθ W (µ) θ 2. To generalize the Robust MDP problem to all such possible MDPs, we will denote the optimization problem as the following.

Definition 3.2. The Robust MDP problem is that of finding π such that π = arg max π T min W W g(π, W).

Published in Transactions on Machine Learning Research (06/2024)

To connect this more clearly to work on repeated games, we will view this as finding a policy π that minimizes the suboptimality, i.e., arg max π T min W W g(π , W) min W W g(π, W). Improving the robustness of the policy can

similarly be seen as reducing the difference between our policy s robustness and the best policy s robustness. The second definition intuitively connects to the online learning literature definition of regret. We will similarly use this algorithmic perspective to design a Robust Policy Optimization algorithm that minimizes the suboptimality of the learned policy with the least amount of computational complexity possible. Similar min-max games have been solved in the past using No-Regret Dynamics, a powerful and versatile framework for algorithm design. Given its existing connection to robustness (Wang et al., 2022a), we explore how to use a No-Regret Dynamics based framework to solve the Robust MDP problem.

4 No Regret Dynamics

In this section, we present a general No-Regret Dynamics framework. We then demonstrate how such a framework can be used to generate algorithms for Robust MDPs.

4.1 The General Framework

Algorithm 1: No-Regret RL Data: T for t [T] do

OLπ chooses a policy: πt OLπ; OLW sees the policy: OLW πt; OLW chooses transition dynamics: Wt OLW ; OLπ sees the transition dynamics: OLπ Wt; OLW incur losses: OLW lt(Wt); OLπ incur losses: OLπ ht(πt); end

Our No-Regret framework is presented in Algorithm 1. Intuitively, we iteratively choose a policy that performs well in previously seen adversarial environments, and we iteratively look for adversarial environments. In the end, the policies produced at the later rounds will become more and more robust. Specifically, in each round t of this algorithm, our policy player called OLπ, chooses a policy πt at time step t. Our second player is a W-player, called OLW , which will see the policy πt outputted by the policy player and, then, outputs a transition dynamic Wt. The policy player then sees the Wt that was chosen. The policy player incurs a loss ht(πt), and the environment player then incurs a loss lt(Wt) corresponding to their decision. They repeatedly play this game until an equilibrium is reached. Here, the online players OLπ and OLW can be any online strategy. In Wang et al. (2022a), example strategies for online players include Mirror Descent and Follow the Leader. The performance of a strategy for either online players can be measured in terms of regret. By definition, the regret of the W-player is equivalent to

t=0 lt(Wt) min W

Similarly, for the π player, we have that

t=0 ht(πt) min π

This framework is a simple and versatile method of solving many optimization problems. We need only choose the online algorithms that the π-player and W-player employ and the loss function they see. The benefit of a No-Regret Dynamics framework is that when the loss functions for both players are negative of the others, the algorithm s convergence is simply the convergence for two players. By setting lt(Wt) = g(Wt, πt) and ht(πt) = g(Wt, πt), we have Theorem 4.1. Theorem 4.1. We have the difference between the robustnesses of the chosen policies and any policy π is upper bounded by the regret of the two players

1 T min W W

t=0 g(W , π) 1

t=0 g(W , πt) Reg W + Regπ.

Published in Transactions on Machine Learning Research (06/2024)

Here, Reg W and Regπ are the two average regrets of the two players OLW and OLπ.

As in Theorem 4.1, the convergence of Algorithm 1 is explained by the regret of the two players. This framework is both powerful and intuitive. Say we are in the traditional Robust MDP setting where g(W, π) = V π W (µ). We can get convergence guarantees for the robustness of our setting by configuring the loss functions lt and ht in the following way. In this setting, the tth loss function for the W-player is lt(Wt) = V πt Wt(µ) and for the π-player is ht(πt) = V πt Wt(µ). Setting the loss functions according to the above, we similarly get the following convergence guarantee based on the regret of the two players, just like in Theorem 4.1. Therefore, despite the nonconvexity, we have a simple framework for solving the Robust MDP problem. Since we approximate the worst-case environment at every round, we only need an assumption that the uncertainty set of transitions is convex. Moreover, we have the flexibility to generate many algorithms for different settings by choosing different online learning algorithms for both players. However, many recent online learning results require the underlying loss functions to be convex or strongly convex. Regrettably, in the simplest setting where g(W, π) = V π W (µ), neither of these properties are satisfied. Therefore, we must look for efficient online learning algorithms for nonconvex loss functions. We now present a toolbox of such online learning algorithms.

4.2 Online Learning Toolbox

To generate an algorithm from our No-Regret framework in Algorithm 1, we need only choose strategies for both players. Here, we develop a toolbox of nonconvex online learning algorithms usable in our framework. We summarize our toolbox in Algorithm 2.

Online algorithms for the π-player First, we present two existing algorithms from Suggala & Netrapalli (2019), namely Follow the Perturbed Leader and Optimistic Follow the Perturbed Leader, which achieves strong convergence in expectation in nonconvex settings and is suitable for the π-player. At each time step, the Follow the Perturbed Leader algorithm generates a random noise vector σt according to an Exponential distribution with parameter η. Intuitively, this noise vector helps the online learner avoid local minima. It then chooses xt = Oα Pt 1 i=1 hi σi . Here, Oα is an Approximate Optimization Oracle that approximately minimizes the received loss function hi to accuracy α. The Optimistic Follow the Perturbed Leader follows a similar procedure, except it chooses xt = Oα mt + Pt 1 i=1 hi σi where mt is some optimistic function. The main dependency of their regret bounds is that there exists an Approximate Optimization Oracle Oα.

Definition 4.1. An optimization oracle is a function that takes some nonconvex function f and returns an approximate minimizer x such that

f(x ) σ, x h inf x f(x) σ, x i + α.

Throughout this section, we will assume the presence of such an oracle and discuss how to set this oracle later. Suggala & Netrapalli (2019) prove a regret bound for Follow the Perturbed Leader and Optimistic Follow the Perturbed Leader algorithms when they can access such an oracle.

Online algorithms for the W-player Note that, for our framework, the environmental player can see the incoming loss function before choosing an environment, but neither of the aforementioned methods can account for this. To address this problem, we prove that a "Best Response" algorithm achieves an even smaller upper bound regarding regret. The Best Response algorithm assumes knowledge of the coming loss function and returns the value that minimizes the incoming loss function. Formally, the Best Response algorithm outputs xt = arg min x lt(x). This is a useful algorithm given that the W player can see the output of the π

player when making its decision and greatly reduce its regret. We prove this in the following regret bound.

Lemma 4.1. Suppose we have an incoming sequence of loss functions ft for t [T] with an optimization oracle that can minimize a function to less than α error. The Best Response algorithm satisfies the following regret bound 1 T

t=1 ft(xt) 1

t=1 ft(x) α.

Published in Transactions on Machine Learning Research (06/2024)

Algorithm 2: A set of useful online learning algorithms Input: η, Oα for t = 1, . . . , T do

Sample σ Rd and σi Exp(η) where i [d]

FTPL : xt = Oα

i=1 fi(x) σ, x

OFTPL[mt] : xt = Oα

i=1 fi(x) + mt(x) σ, x

Best-Response : xt = Oα (ft(x))

FTPL+ : xt = Oα

i=1 fi(x) σ, x

We will also prove the regret of one more algorithm, Follow the Perturbed Leader Plus. Similar to the Follow the Leader Plus algorithm from Wang et al. (2021), Follow the Perturbed Leader Plus assumes knowledge of the incoming loss function and then outputs the minimizer of the sum of all the seen loss functions minus the noise term. Formally, Follow the Perturbed Leader Plus outputs xt that satisfies xt = arg min xt

Pt i=1 li(xt) σ, xt .

While this algorithm achieves worse regret than Best Response, it produces more stable outcomes. This will be useful for later extensions, such as Smooth Robust MDPs. In fact, Follow the Perturbed Leader Plus is equivalent to OFTPL, when the optimistic function mt = lt is the true loss function lt. We present the regret of this algorithm here.

Lemma 4.2. Let η be the constant parameterizing the exponential distribution from which the noise σt is drawn and d be the dimensionality of the set of choices X . Moreover, denote D as the maximum distance between any two points in X, i.e. D = max x,x X x x 2. Assume access to an optimization oracle that yields

solutions with at most error α. Given a series of choices by FTPL+ x1, . . . , x T with loss functions l1, . . . , l T , the regret in expectation is upper bounded by

t=1 lt(xt) 1

We summarize these online strategies in Algorithm 2. These two algorithms are suitable choices for the environmental player, and the two algorithms from Suggala & Netrapalli (2019) are suitable choices for the policy player. However, we still have the issue that an Approximation Optimization Oracle is generally challenging to parameterize. However, we exploit a unique characteristic of many Robust MDP variants. While the objective functions of Robust MDPs do not exhibit convexity, they exhibit Gradient-Dominance in some settings. This property helps us design a sufficient Approximation Oracle.

5 Constructing the Optimization Oracle

The online learning algorithms demonstrated in the previous section require having access to an approximate optimization oracle. However, designing the approximate optimization oracle can be a complex problem. In this section, we first show that such an oracle can be easily constructed when the objective function is gradient-dominated. We then demonstrate that under direct parameterization, the gradient-dominance property is satisfied.

Published in Transactions on Machine Learning Research (06/2024)

5.1 Gradient-Dominance Algorithm 3: Projected Gradient Descent Data: Initial guess x0, step size β, projection operator Proj X for t = 0 to TO 1 do

xt+1 Proj X (xt β f(xt)) end

A function is gradient-dominated if the difference between the function value at a point x and the minimal function value is upper bounded on the order of the function s gradient at the point x. We formalize this in the below definition.

Definition 5.1. We say a function f is gradient-dominated for set X with constant K if

f(x) min x Xf(x ) Kmin x X x x, f(x) .

Here, K is some constant greater than 0.

As noted by Bhandari & Russo (2022), this gradient-dominated property is relatively common for many different RL settings, including Quadratic Control or direct parameterization. It is useful for proving convergence guarantees for traditional policy gradient methods (Agarwal et al., 2019). For gradient-dominated functions, one can use projected gradient descent to minimize the function f. We recap projected gradient descent in Algorithm 3. Namely, Bhandari & Russo (2022) shows that

Lemma 5.1. The below property only holds if f is gradient-dominated and β min

1 sup x f(x) 2 , 1

TO is the number of iterations of Projected Gradient Descent runs. Moreover, the function cx is used for brevity for cπ and c W later. Also, D = max x,x X x x 2. Given that f is L-Lipschitz continuous, the sequence

xt+1 = Proj X (xt β f(xt)) enjoys the property

f(x TO) inf x f(x )

2D2K2(f(x0) inf x f(x ))

βTO = cx(TO, K).

Ideally, we would like to use projected gradient descent as our Approximate Optimization Oracle for OLπ

and OLW since it is simple and utilizes scalable gradient-based techniques. However, this would require the loss functions lt and ht to be gradient-dominated. While not generally true, this is known to hold in many cases. Gradient-dominance is a well-known phenomenon for the π-player when the Value Function is the objective function, as seen in Agarwal et al. (2019). We formally list some helpful conditions here.

Condition 5.1. Here, we list the conditions we have.

1. The function Pt i fi(x) σ, x is gradient-dominated, enabling the use of FTPL+

2. The function Pt 1 i fi(x) σ, x is gradient-dominated, enabling the use of FTPL.

3. The function Pt 1 i fi(x) + ft 1(x) σ, x is gradient-dominated, enabling the use of OFTPL.

4. The function ft(x) is gradient-dominated, enabling the use of Best Response.

We will first provide tools useful for showing when these conditions hold.

5.2 Tools for Demonstrating Gradient-Dominance

Here, we will provide the tools to show that any of these conditions hold for the objective functions for either the π or W players. We have a gradient term within the terms of gradient-dominance from Definition 5.1. In many cases, the loss function will often contain the Value Function. Therefore, we must know what the gradients of the Value Functions will be for both players. While this is known for the policy player from the Policy Gradient Theorem (Sutton & Barto, 2018), we demonstrate a similar result for the gradient of the Value Function for the W-player.

Published in Transactions on Machine Learning Research (06/2024)

Lemma 5.2. The gradient of the value function V W with respect to the parameter W, denoted as W V W (s), is given by

W V W (s) = 1 1 γ

s ,a,s d W µ (s)π(a, s) PW (s , a, s)V W (s ).

Now, we express the suboptimality of a transition dynamics parameter W in terms of the gradient of the Value Function.

While this is shown via the Performance Difference Lemma from Kakade & Langford (2002) for the policy player, we need a similar lemma for the transition dynamics. The Performance Difference Lemma relies on the Advantage function. We will define an analogous advantage function for the transition dynamics. Intuitively, this Advantage Function is the value of taking state s over the expected value over all states. Given this, we can provide an analog of the Performance Difference Lemma for the W-Player. We provide such a lemma here. Lemma 5.3. Given two different transition dynamics parameters W and W , we have that

VW (µ) VW (µ) = X

s ,a,s d W µ (s)PW (s , a, s)π(a, s)AW (s , a, s).

Here, we define the W-Advantage Function as AW (s , a, s) = γVW (s ) + r(s, a) VW (s).

We have presented the tools necessary to demonstrate gradient-dominance in many cases. While many MDP settings exhibit gradient-dominance (Bhandari & Russo, 2022), we demonstrate how to prove gradientdominance for both players loss functions under direct parameterization.

5.3 Direct Parameterization

We now begin with the direct parameterization case with standard Robust MDPs. This setting is one of the most standard settings for MDPs. Here, PW (s , a, s) = Ws ,a,s directly parameterizes the transition dynamics, π(a, s) = θs,a, and g(W, π) = V π W (µ). Moreover, the set of transition dynamics is some convex bounded set, such as a rectangular uncertainty set (Dong et al., 2022). We demonstrate that in this setting, Subcondition 2 holds for the loss function of the policy player, and Subcondition 4 holds for the loss function of the W-player. Lemma 5.4. Let the objective function be the Value Function. For any positive noise term σ, we have that under direct parameterization, Pt 1 i li( ) σ, and Pt 1 i hi( ) σ, are both gradient-dominated for the

W-player and the π-player on sets W and T with constants KW = 1 1 γ

and Kπ = 1 1 γ

respectively. Therefore, Subcondition 2 hold for both the loss functions for both players.

Algorithm 4: Algorithm under Direct Parameterization Data: T for t [T] do

Sample σt Exp(η);

πt Oα Pt 1 i=1 V πt Wi(µ) σt, πt

/* Follow the Perturbed Leader */

Wt Oα V πt Wt(µ)

/* Best-Response */ end

Now, we have shown that when the objective function is the loss function, the loss functions for the policy player satisfy Subcondition 2. Therefore, we can use Follow the Perturbed Leader for OLπ. Now, we wish to show Subcondition 4 holds for the loss function of the W player. Lemma 5.5. Under direct parameterization, we have that the VW (µ) is gradient-dominated with constant

as in for an arbitrary W W and the optimal parameter W W, we have

VW (µ) VW (µ) KW min W W

h W W W VW (µ) i .

Published in Transactions on Machine Learning Research (06/2024)

Here, we have that Subcondition 4 holds for the W-player.

Now that we have that Subcondition 4 holds for the W player, we know we can use Best Response for the OLW player. Thus, in the direct parameterization setting with standard Robust MDPs, we can use our framework from Algorithm 1 with Follow the Perturbed Leader for OLπ and Best Response for OLW where both use Projected Gradient Descent as their Approximate Optimization Oracle. We present our algorithm in Algorithm 4. We can now prove the convergence and robustness of our framework using our proof framework.

5.4 Overall Convergence Rates

Now, if the policy player employs FTPL and the environment player employs Best-Response, we get the following convergence bound when the objective function is the Value Function.

Theorem 5.1. Assume that the set of possible transition matrices W is convex. Let Lπ be the smoothness constant of ht with respect to the ℓ1 norm, Dπ = max π,π T π π 2, and dπ be the dimension of the input

for the π-player. Let OLπ use FTPL and OLW use Best-Response. Under direct parameterization, when η = 1 Lπ T dπ the robustness of the set of trained algorithms is for any π

t=0 V π W (µ) E

t=0 V πt W (µ)

T + cπ (TO, Kπ) + c W (TO, KW ) .

We formalize this in Algorithm 4. Here, we have shown that in this simple setting, we have that the robustness in expectation is O T 1

2 O with simple gradient-dominance assumptions.

5.5 Comparison to Existing Rates and Algorithms

Our resulting algorithm is similar to that of Wang et al. (2022b). Both algorithms iteratively optimize the policy and transition matrix in turns. However, their policy update is a single step of Projected Gradient Descent, and our framework utilized different updates for the policy player. With our framework, our method gets a more powerful convergence rate with less stringent assumptions. Moreover, Wang & Zou (2022) only update the policy on each step by using the gradient of the worst-case Value Function. However, their method relies on stringent assumptions on the uncertainty set to make the gradient computable. Clement & Kroer (2021) in structure is similar to that of Wang et al. (2022b) given that it iteratively updates both the policy and environment by directly optimizing the objective function. Instead, our method uses different updates, such as OFTPL. Overall, our algorithm differs from existing algorithms by optimizing both the policy and environment and exploring different updates for each, resulting in better convergence rates with weaker assumptions.

Both our algorithm and Wang & Zou (2022) achieve rates of roughly O T 1

2 . However, they assume the

uncertainty set is a R-Contamination set. Moreover, Clement & Kroer (2021) achieves a slightly faster rate of

3 . However, their analysis is limited to the tabular setting and relies on the uncertainty set being a Wasserstein ball. To our knowledge, these are the fastest rates in the literature that solve Robust MDPs with gradient-based updates. Thus, our work roughly meets or comes near the fastest rates for solving Robust MDPs while requiring the least stringent assumptions. However, given some additional properties, it may be possible to improve this bound. We will investigate this in the following sections.

6 Faster Rates

Our framework can be applied with only the gradient-dominance property of each player s loss function. However, with different assumptions, the algorithm s convergence rate can be improved by utilizing different properties. The two properties we will investigate are smoothness and strong gradient-dominance.

Published in Transactions on Machine Learning Research (06/2024)

6.1 Smoothness

As in Agarwal et al. (2019), for the direct parameterization, we have that the Value Function is smooth with

respect to the π-player as in V π W (µ) V π W (µ) = O 2γRmax|A|

(1 γ)2 π π 2 .

Algorithm 5: Algorithm under Smoothness Data: T for t [T] do

Sample σt Exp(η);

πt Oα Pt 1 i=1 V πt Wi(µ) σt, πt V πt Wt 1(µ)

/* Optim. Follow the Perturbed Leader */

Wt Oα Pt i=1 V πi Wt(µ) Wt, σt

/* Follow the Perturbed Leader Plus */

If our objective function is the value function and we can sufficiently demonstrate that the difference of value functions under different transition dynamics is smooth, we can improve our bounds by exploiting this smoothness. Formally, such smoothness would take the form of Condition 6.1.

Condition 6.1. The difference in value functions between subsequent rounds is smooth with respect to policies such that for all s S and policies π and π , we have that

[V π W (µ) V π W (µ)] [V π W (µ) V π W (µ)] L W W 1 π π 1.

In general, it is difficult to show that such an L is smaller than the value that of the Lipschitz constant of the Value Function, 2γRmax|A|

(1 γ)2 . However, if we are in a setting such that L is small, we can take advantage of this by using Optimistic Follow the Perturbed Leader Plus for the π-player where the optimistic function is simply the last loss function ht 1. If Condition 6.1 holds, we have that ht ht 1 is very smooth, and we can directly improve our rate of convergence. However, from the formulation of Condition 6.1, we also need to choose an online strategy for the environment such that W W 1 is bounded, which we can demonstrate is the case if the W-player uses FTPL+. We summarize these algorithmic choices in Algorithm 5. In the direct parameterization setting with the traditional objective function, we also show that the loss functions seen by OFTPL for the policy player and FTPL+ satisfy our gradient-dominance properties. In the direct parameterization case with traditional objective, these conditions hold as shown in the appendix respectively (see Lemma B.1 and Lemma B.2). Furthermore, in this setting, we have the robustness as such. Indeed, if L is small here, we can improve the convergence and robustness of our trained algorithm.

Theorem 6.1. Assume that the set of possible transition matrices W is convex and we are in the direct parameterization setting. Let OLπ use OFTPL and OLW use FTPL+. Given that Condition 6.1 holds, we

have that the robustness of the algorithm is for any π when setting η = q

20Lπ(dw DW +dπDπ)

t=0 V π W (µ) E

t=0 V πt W (µ)

Dπ(dw Dw + dπDπ) p

5LπTcπ(TO, Kπ) + 1

cπ(TO, Kπ)+c W (TO, KW )

6.2 Strong Gradient-Dominance

We now explore another extension: strong gradient-dominance. Moreover, in specific parameterizations, the objective functions for Robust MDPs will obey strong gradient-dominance, also known famously as the Polyak-Lojasiewicz condition (Polyak, 1963). This condition helps improve convergence in nonconvex settings, analogous to the strong convexity condition. We formally state such a property in Condition 6.2

Condition 6.2. A function a(x) satisfies K, δ-strong gradient-dominance if for any point x X,

min x Xa(x ) a(x) + min x X

K x x, a(x) + δ

Published in Transactions on Machine Learning Research (06/2024)

In general, this is useful for achieving even tighter optimization bounds. Indeed, from Karimi et al. (2016), if we have this, Projected Gradient Descent enjoys better convergence.

Lemma 6.1. Here, x = arg min x X at(x ) is the minimizer of at. cs x is used for brevity. Given at is L-Lipschitz

continuous and is K, δ-strongly dominated, we have that using projected gradient descent gets global linear convergence as in

at(xk) at(x ) 1 δ K2L

k (at(x0) at(x )) = cs x(k, δ, K).

6.3 Direct Parameterization with Regularization

Say we are in a setting where we want to maximize V π W and ensure that the policy π is regularized according to the ℓ2 norm. In this way, we can redefine the loss function to be g(W, π) = V π W (µ) π 2 2. Again, the sets W and T are bounded convex sets in this setting.

Algorithm 6: Algorithm under Regularization Data: T for t [T] do

Sample σt Exp(η);

πt Oα Pt 1 i=1 V πt Wi(µ) + πt 2 2 σt, πt

/* Follow the Perturbed Leader */

Wt Oα V πt Wt(µ) + πt 2 2

/* Best-Response */ end

Lemma 6.2. We have g(W, π) = V π W (µ) π 2 2 is Kπ, T

2 strongly gradient-dominated on set T

In this setting, if the π player employs FTPL where its Approximate Optimization Oracle enjoys even better convergence and the W-Player employs Best Response, we have the robustness bound as follows. We summarize this algorithm in Algorithm 6. Indeed, given strong gradient-dominance, we have that the dependence on the complexity for the Optimization Oracle is better for the π-player, slightly improving the robustness bounds.

Theorem 6.2. We are in the direct parameterization setting where the objective function is the regularized Value Function. Assume that the set of possible transition matrices W is convex. Let OLπ use FTPL and OLW use Best-Response. Under direct parameterization, given that Condition 6.2 holds and η = 1 Lπ T dπ , we have that the robustness of Algorithm 1 is for any π

t=0 V π W (µ) π 2 min W W

t=0 V πt W (µ)0 πt 2 2d

+ c W (TO, KW ).

7 Experiments

We now turn to explore how our algorithm finds robust minima empirically. We will use Algorithm 4 to optimize a policy in the Grid World MDP (Sutton & Barto, 2018). This setting is a traditional MDP where the world is a grid where the initial state is one corner of the grid. The goal state is the opposite corner of the grid. At each step, the policy can take any of four actions. The next state is sampled respectively from the transition matrix. If the policy lands in the goal state, it receives a reward of 10, and the MDP is finished. Otherwise, it receives a reward of 1. We experiment in the setting where the grid is 5 states tall and 5 states wide. We wish to measure how quickly the robustness of policy is improved through each iteration of our algorithm. As a metric to measure robustness, given a policy, we choose the transition matrix that minimizes the expected reward of the initial state and reports the initial state s expected reward. We do this for every iteration of our algorithm. We will do this over different adversarial transition matrix sets. The sets in question will be T = {T s.t. T T0 q τ}.

Published in Transactions on Machine Learning Research (06/2024)

0 5 10 15 20 25 30 35 40 T

(a) τ = .1, q = 1

0 5 10 15 20 25 30 35 40 T

(b) τ = .2, q = 1

0 5 10 15 20 25 30 35 40 T

(c) τ = .3, q = 1

0 5 10 15 20 25 30 35 40 T

(d) τ = .5, q = 1

0 5 10 15 20 25 30 35 40 T

(e) τ = .1, q = 2

0 5 10 15 20 25 30 35 40 T

(f) τ = .2, q = 2

0 5 10 15 20 25 30 35 40 T

(g) τ = .3, q = 2

0 5 10 15 20 25 30 35 40 T

(h) τ = .5, q = 2

Figure 1: We plot the convergence of our algorithm over many different transition matrix uncertainty set shapes. We see that over all shapes, our algorithm converges in roughly the predicted 1

T rate predicted by our results. The convergence curves of DRPG and our algorithms are very similar.

Here, T0 is some randomly generated initial transition matrix, q is a hyperparameter affecting the shape of the transition set, and τ is the radius of the transition set. We demonstrate robustness improvement over several values of τ and q. We plot the convergence of our algorithm over q {1, 2} and τ {.1, .2, .3, .5} in Figure 1. As a baseline, we provide the convergence results for Double-Loop Robust Policy Gradient (DRPG) from Wang et al. (2022b). We see in these convergence plots that our algorithm roughly achieves a convergence rate of O T 1

2 . This is achieved across all uncertainty set shapes and radii. This corroborates the theoretical results of our framework s versatility and efficiency in different environmental settings. When comparing our results to the baseline, we see that our method and DRPG achieve similar convergence curves. Overall, these results suggest that, empirically, this algorithm performs competitively with existing algorithms and slightly.

7.1 Implementation of the Practical Algorithm

We have provided a Git Hub repository to reproduce our experiments. We use the Open AI Gym environment to set up our experiments. For all optimization tasks, we utilitize the constrained optimization libraries in Sci Py. Moreover, implementing projected gradient descent unfortunately requires manually designing the gradient calculation depending on the problem setting.

8 Discussion and Limitations

In this paper, we developed a nonconvex No-Regret framework that has decoupled the convergence for Robust MDP algorithms. Based on the proposed framework, we designed different Robust MDP algorithms under standard gradient-dominance, strong gradient-dominance, and smooth MDPs. The proven convergence results are some of the strongest in the literature, with only a convexity assumption on the set of possible transition matrices. Possible extensions include using this nonconvex No-Regret Framework for other nonconvex problems, such as other nonconvex games, or exploring how different minimization oracles could empirically improve the performance of our algorithms. Another possible avenue could be studying Robust MDPs where the optimization objective obeys the strict saddle property.

Limitations However, our work has several limitations. Firstly, we require some gradient-dominance conditions for the policy and environmental dynamics player. In many settings, the objective function for the Robust MDP does not satisfy this. This assumption does reduce the scope of our work. Moreover, many of our convergence guarantees are only made in expectation, while many other bounds in the literature are

Published in Transactions on Machine Learning Research (06/2024)

made absolutely. Moreover, we do not generate a single policy that achieves strong robustness but, instead, a series of policies that, if used together, obey our convergence guarantees.

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. Co RR, abs/1908.00261, 2019. URL http: //arxiv.org/abs/1908.00261.

Kishan Panaganti Badrinath and Dileep Kalathil. Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 511 520. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/badrinath21a.html.

J Andrew Bagnell, Andrew Y Ng, and Jeff G Schneider. Solving uncertain markov decision processes.

David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In International Conference on Machine Learning, pp. 354 363. PMLR, 2018.

Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods, 2022.

Peter Buchholz and Dimitri Scheftelowitsch. Light robustness in the optimization of markov decision processes with uncertain parameters. Computers & Operations Research, 108:69 81, 2019.

Zhi Chen, Pengqian Yu, and William B Haskell. Distributionally robust optimization for sequential decisionmaking. Optimization, 68(12):2397 2426, 2019.

Jason J. Choi, Donggun Lee, Koushil Sreenath, Claire J. Tomlin, and Sylvia L. Herbert. Robust control barrier-value functions for safety-critical control, 2021.

Julien Grand Clement and Christian Kroer. First-order methods for wasserstein distributionally robust mdp. In International Conference on Machine Learning, pp. 2010 2019. PMLR, 2021.

Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. Co RR, abs/1812.02341, 2018. URL http://arxiv.org/abs/1812.02341.

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. ar Xiv preprint ar Xiv:1711.00141, 2017.

Jing Dong, Jingwei Li, Baoxiang Wang, and Jingzhao Zhang. Online policy optimization for robust mdp. ar Xiv preprint ar Xiv:2209.13841, 2022.

Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl problems. ar Xiv preprint ar Xiv:2103.06257, 2021.

Jesse Farebrother, Marlos C. Machado, and Michael Bowling. Generalization and regularization in DQN. Co RR, abs/1810.00123, 2018. URL http://arxiv.org/abs/1810.00123.

Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. Advances in Neural Information Processing Systems, 29, 2016.

Vineet Goyal and Julien Grand-Clement. Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 48(1):203 226, 2023.

Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005.

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML 02, pp. 267 274, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. ISBN 1558608737.

Published in Transactions on Machine Learning Research (06/2024)

Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. Co RR, abs/1608.04636, 2016. URL http://arxiv.org/ abs/1608.04636.

Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.

Elita A. Lobo, Mohammad Ghavamzadeh, and Marek Petrik. Soft-robust algorithms for handling model misspecification. Co RR, abs/2011.14495, 2020. URL https://arxiv.org/abs/2011.14495.

Daniel J. Mankowitz, Timothy A. Mann, Pierre-Luc Bacon, Doina Precup, and Shie Mannor. Learning robust options. Co RR, abs/1802.03236, 2018. URL http://arxiv.org/abs/1802.03236.

Daniel J. Mankowitz, Nir Levine, Rae Jeong, Abbas Abdolmaleki, Jost Tobias Springenberg, Timothy A. Mann, Todd Hester, and Martin A. Riedmiller. Robust reinforcement learning for continuous control with model misspecification. Co RR, abs/1906.07516, 2019. URL http://arxiv.org/abs/1906.07516.

Brendan Mc Mahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear optimization. Advances in Neural Information Processing Systems, 26, 2013.

Arnab Nilim and Laurent Ghaoui. Robustness in markov decision problems with uncertain transition matrices. In S. Thrun, L. Saul, and B. Schölkopf (eds.), Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003. URL https://proceedings.neurips.cc/paper_files/paper/ 2003/file/300891a62162b960cf02ce3827bb363c-Paper.pdf.

Charles Packer, Katelyn Gao, Jernej Kos, Philipp Krähenbühl, Vladlen Koltun, and Dawn Song. Assessing generalization in deep reinforcement learning. Co RR, abs/1810.12282, 2018. URL http://arxiv.org/ abs/1810.12282.

Anay Pattanaik, Zhenyi Tang, Shuijing Liu, Gautham Bommannan, and Girish Chowdhary. Robust deep reinforcement learning with adversarial attacks. ar Xiv preprint ar Xiv:1712.03632, 2017.

Marek Petrik. Approximate dynamic programming by minimizing distributionally robust bounds. ar Xiv preprint ar Xiv:1205.1782, 2012.

B.T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963. ISSN 0041-5553. doi: https://doi.org/10.1016/0041-5553(63) 90382-3. URL https://www.sciencedirect.com/science/article/pii/0041555363903823.

Roberta Raileanu and Rob Fergus. Decoupling value and policy for generalization in reinforcement learning. Co RR, abs/2102.10330, 2021. URL https://arxiv.org/abs/2102.10330.

Aurko Roy, Huan Xu, and Sebastian Pokutta. Reinforcement learning under model mismatch. Advances in neural information processing systems, 30, 2017.

Andrzej Ruszczyński. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125:235 261, 2010.

Alexander Shapiro. Rectangular sets of probability measures. Operations Research, 64(2):528 541, 2016.

Saumya Sinha and Archis Ghate. Policy iteration for robust nonstationary markov decision processes. Optimization Letters, 10:1613 1628, 2016.

Xingyou Song, Yiding Jiang, Stephen Tu, Yilun Du, and Behnam Neyshabur. Observational overfitting in reinforcement learning. Co RR, abs/1912.02975, 2019. URL http://arxiv.org/abs/1912.02975.

Krishnan Srinivasan, Benjamin Eysenbach, Sehoon Ha, Jie Tan, and Chelsea Finn. Learning to be safe: Deep RL with a safety critic. Co RR, abs/2010.14603, 2020. URL https://arxiv.org/abs/2010.14603.

Published in Transactions on Machine Learning Research (06/2024)

Lauren N Steimle, David L Kaufman, and Brian T Denton. Multi-model markov decision processes. IISE Transactions, 53(10):1124 1139, 2021.

Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. Co RR, abs/1903.08110, 2019. URL http://arxiv.org/abs/1903.08110.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.

Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.

Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games, 2015.

Aviv Tamar, Shie Mannor, and Huan Xu. Scaling up robust mdps using function approximation. In International conference on machine learning, pp. 181 189. PMLR, 2014.

Chen Tessler, Yonathan Efroni, and Shie Mannor. Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pp. 6215 6224. PMLR, 2019.

Guanghui Wang, Rafael Hanashiro, Etash Guha, and Jacob Abernethy. On accelerated perceptrons and beyond. ar Xiv preprint ar Xiv:2210.09371, 2022a.

Jun-Kun Wang, Jacob D. Abernethy, and Kfir Y. Levy. No-regret dynamics in the fenchel game: A unified framework for algorithmic convex optimization. Co RR, abs/2111.11309, 2021. URL https: //arxiv.org/abs/2111.11309.

Qiuhao Wang, Chin Pang Ho, and Marek Petrik. On the convergence of policy gradient in robust mdps. ar Xiv preprint ar Xiv:2212.10439, 2022b.

Yue Wang and Shaofeng Zou. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193 7206, 2021.

Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. In International Conference on Machine Learning, pp. 23484 23526. PMLR, 2022.

Yue Wang, Alvaro Velasquez, George Atia, Ashley Prater-Bennette, and Shaofeng Zou. Robust average-reward markov decision processes. ar Xiv preprint ar Xiv:2301.00858, 2023.

Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229 256, 1992.

Huan Xu and Shie Mannor. Parametric regret in uncertain markov decision processes. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pp. 3606 3613. IEEE, 2009.

Huan Xu and Shie Mannor. Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23, 2010.

Huan Zhang, Hongge Chen, Duane S. Boning, and Cho-Jui Hsieh. Robust reinforcement learning on state observations with learned optimal adversary. Co RR, abs/2101.08452, 2021. URL https://arxiv.org/ abs/2101.08452.

Published in Transactions on Machine Learning Research (06/2024)

Symbol Meaning

γ Discount Factor S Set of Actions PW Transition Matrix parameterized by W R Reward Function W Parameter for Environment Rmax Maximum Reward θ Parameter for the policy T Set of Policy Parameters πθ, π Policy µ Distribution over starting states V π W Value Function d W s0 , dπ s0 Occupancy Measure of state g Objective function of Robust MDP lt Loss Function for Environmental Player ht Loss Function for Policy Player Reg W Regret Of Environment Player Regπ Regret Of Policy Player Oα Optimization Oracle σ Noise Vector for Online Algorithms η Parameter for the Noise Distribution KW , Kπ Gradient Dominance Constant for W and π TO Number of Iterations for Optimization Oracle cx(TO, K) Suboptimality of Projected Gradient Descent Lπ Lipschitz Smoothness Constant of ht Dπ Radius of Set of Policies dπ Dimension of θ c W (TO, KW ) Suboptimality of Projected Gradient Descent for Environment Player cπ(TO, Kπ) Suboptimality of Projected Gradient Descent for Policy Player L Lipschitz Constant of Difference of Objective Functions δ Constant for Strong Gradient Dominance cs x(TO, K) Suboptimality of Projected Gradient Descent under Strong Gradient Dominance

Table 1: Table of Notation

A Proof of Preliminary Theorems

A.1 Proof of Theorem 4.1

Theorem 4.1. We have the difference between the robustnesses of the chosen policies and any policy π is upper bounded by the regret of the two players

1 T min W W

t=0 g(W , π) 1

t=0 g(W , πt) Reg W + Regπ.

Here, Reg W and Regπ are the two average regrets of the two players OLW and OLπ.

Published in Transactions on Machine Learning Research (06/2024)

Proof. By definition, the regret of the W-player is equivalent to

t=0 lt(Wt) min W

t=0 g(Wt, πt) min W

t=0 g(W , πt)

Similarly, for the π player, we have that

t=0 lt(πt) min π

t=0 g(Wt, π )

t=0 g(Wt, πt) (1)

Therefore, we can upper bound the sum of objective functions throughout our training process as

t=0 g(Wt, πt) = Reg W + min W

t=0 g(W , πt).

We similarly lower bound the sum of value functions throughout our training process as

t=0 g(Wt, πt) = max π

t=0 g(Wt, π ) Regπ

t=0 g(Wt, π) Regπ

t=0 g(W , π) Regπ (2)

Combining Equation (1) and Equation (2), we have our desired statement

t=0 g(W , π) min W

t=0 g(W , πt) Reg W + Regπ.

B Proofs of Gradient Dominance

B.1 Proof of Lemma 5.2

Lemma 5.2. The gradient of the value function V W with respect to the parameter W, denoted as W V W (s), is given by

W V W (s) = 1 1 γ

s ,a,s d W µ (s)π(a, s) PW (s , a, s)V W (s ).

Published in Transactions on Machine Learning Research (06/2024)

Proof. We wish to calculate the gradient of the value function V W with respect to PW (s , a, s). We have that

W V W (s) = W

a π(a, s)Qπ(s, a)

a π(a, s) W Qπ(s, a)

a π(a, s) W

R(s, a) + γ X

s PW (s , a, s)V π(s )

a γπ(a, s) X

s [ W PW (s , a, s)V π(s ) + PW (s , a, s) W V π(s )]

a,s γπ(a, s)PW (s , a, s) W V π(s ) + X

s ,a π(a, s)γ W PW (s , a, s)V π(s )

Here, the Equation (3) comes from the definition of the Q function. Unrolling this makes it such that we have

W V W (µ) =

s ,a,s Pr(st = s|µ)γtπ(a, s) PW (s , a, s)V W (s )

s ,a,s d W µ (s)π(a, s) PW (s , a, s)V W (s )

We have now arrived at our desired quantity.

B.2 Proof of Lemma 5.3

Lemma 5.3. Given two different transition dynamics parameters W and W , we have that

VW (µ) VW (µ) = X

s ,a,s d W µ (s)PW (s , a, s)π(a, s)AW (s , a, s).

Here, we define the W-Advantage Function as AW (s , a, s) = γVW (s ) + r(s, a) VW (s).

Proof. This proof follows mainly from the proof of the Performance Difference Lemma from Kakade & Langford (2002).

VW (µ) VW (µ) =EPW ,π

t=0 γtr(st, at) V W (µ)

t=0 γtr(st, at) + γt VW (st) γt VW (st)

t=0 γtr(st, at) + γt+1VW (st+1) γt VW (st)

t=0 γt AW (st+1, at, st)

h γtd W µ (s)PW (s , a, s)π(a, s)AW (s , at, s) i (4)

Here, Equation (4) comes from our definition of the Advantage function for the W-player. This concludes the proof.

Published in Transactions on Machine Learning Research (06/2024)

B.3 Proof of Lemma 5.5

Lemma 5.5. Under direct parameterization, we have that the VW (µ) is gradient-dominated with constant

as in for an arbitrary W W and the optimal parameter W W, we have

VW (µ) VW (µ) KW min W W

h W W W VW (µ) i .

Here, we have that Subcondition 4 holds for the W-player.

Proof. We can use Lemma 5.3 to prove this. We will lower bound the difference between the optimal VW (µ) and the VW (µ). In order to prove Gradient Dominance, we need that VW (µ) VW (µ) to be upper-bounded. We will equivalently lower bound the negative of this. We have

VW (µ) VW (µ) = 1 1 γ

h γtd W µ (s)π(a, s)PW (s |a, s)AW (s , a, s) i

h γtd W µ (s)π(a, s)min s AW (s , a, s) i (5)

max s d W µ (s) d W µ (s)

h γtd W µ (s)π(a, s) min s AW (s , a, s) i

Here, the Equation (5) comes from seeing that the value PW (s |a, s)AW (s , a, s) is minimized when PW puts the most weight on the state minimizing the advantage function. Looking only at that last term, we can bound it in the following manner

h γtd W µ (s)π(a, s) min s AW (s , a, s) i

= 1 1 γ min W

γtd W µ (s)π(a, s) P W (s , a, s) AW (s , a, s) (6)

= 1 1 γ min W

γtd W µ (s)π(a, s) (P W (s , a, s) PW (s , a, s)) AW (s , a, s) (7)

= 1 1 γ min W

γtd W µ (s)π(a, s) (P W (s , a, s) PW (s , a, s)) (VW (s )) (8)

h W W W VW (µ) i (9)

Here, Equation (6) comes from seeing that the value PW (s |a, s)AW (s , a, s) is minimized when PW puts the most weight on the state minimizing the advantage function. Equation (7) comes from the fact that the P s PW (s , a, s)AW (s , a, s) = 0. Equation (8) comes from the definition of the W-player advantage function. Finally, Equation (9) comes from Lemma 5.2. Combining these yield

VW (µ) VW (µ)

d W µ d W µ

h W W W VW (µ) i

h W W W VW (µ) i (10)

Equation (10) comes from the fact that d W µ (s) (1 γ)µ(s) by definition. Here, flipping this, we have

VW (µ) VW (µ) 1 1 γ

h W W W VW (µ) i

This is a satisfying definition of gradient dominance.

Published in Transactions on Machine Learning Research (06/2024)

B.4 Proof of Lemma 5.4

Lemma 5.4. Let the objective function be the Value Function. For any positive noise term σ, we have that under direct parameterization, Pt 1 i li( ) σ, and Pt 1 i hi( ) σ, are both gradient-dominated for the

W-player and the π-player on sets W and T with constants KW = 1 1 γ

and Kπ = 1 1 γ

respectively. Therefore, Subcondition 2 hold for both the loss functions for both players.

Proof. We can use Lemma 5.3 to prove both. We start with the W player.

i V πi W (µ) V πi W (µ) σ, W W =

h d W µ (s)πi(a|s)PW (s |a, s)AW (s , a, s) i σ, W W (11)

max s d W µ (s) d W µ (s)

d W µ (s)πi(a|s) PW (s |a, s) AW (s , a, s)

max s d W µ (s) d W µ (s)

h d W µ (s)πi(a|s)P W (s |a, s) AW (s , a, s) i (12)

Here, Equation (11) comes from the fact that the maximum of the interior of the RHS is always nonnegative, and Equation (12) comes from using the minimizing transition dynamics W. Looking at the inside term, we have that

d W µ (s)πi(a|s) P W (s , a, s) AW (s , a, s) σ, W W

d W µ (s)πi(a|s) (P W (s , a, s) PW (s , a, s)) AW (s , a, s)

d W µ (s)πi(a|s) (P W (s , a, s) PW (s , a, s)) (VW (s )) (14)

i V πi W (µ) σ, W

Equation (13) comes from the fact that the P s PW (s , a, s)AW (s , a, s) = 0. Equation (14) comes from the definition of the W-player advantage function. Finally, comes from Lemma 5.2. Combining these, we have

Published in Transactions on Machine Learning Research (06/2024)

i V πi W (µ) V πi W (µ) σ, W W

1 1 γ min W

i V πi W (µ) σ, W

Moreover, we use the fact that d W µ (s) (1 γ)µ(s) by definition. We now do this for the π-player. By the Performance Difference Lemma,

i V π Wi V π Wi σ, π π = 1 1 γ

s,a dπ µ (s)π (a, s)Aπ(s, a) σ, π π (16)

1 1 γ min π

s,a dπ µ (s) π(s, a)Aπ(s, a) σ, π π

s,a dπ µ(s) π(s, a)Aπ(s, a) σ, π π

s,a dπ µ(s)( π(s, a) π(a, s))Aπ(s, a) σ, π π

s,a dπ µ(s)( π(s, a) π(a, s))Qπ(s, a) σ, π π

min π ( π π) π

i V π Wi σ, π

Here, Equation (16) comes from the Performance Difference Lemma, Equation (17) comes from the fact that P a π(a, s)Aπ(s, a) = 0, Equation (18) comes from the definition of the Advantage Function for the π-player, and Equation (19) comes from the both the Policy Gradient Theorem and the fact that dπ µ(s) (1 γ)µ(s). We now have proven both claims of our lemma.

Lemma B.1. The π-player enjoys Subcondition 3, i.e. Pt 1 i hi( ) + ht 1( ) σ is gradient-dominated with

constant 1 1 γ

Proof. For simplicity, we will call P j hj( ) := Pt 1 i hi( ) + ht 1( ) where j indexes over the set {h1, . . . , ht 2, ht 1, ht 1}. With this, we can follow through with our proof. By the performance difference

Published in Transactions on Machine Learning Research (06/2024)

j V π Wi V π Wi σ, π π

s,a dπ µ (s)π (a, s)Aπ(s, a) σ, π π (20)

1 1 γ min π

s,a dπ µ (s) π(s, a)Aπ(s, a) σ, π π

s,a dπ µ(s) π(s, a)Aπ(s, a) σ, π π

s,a dπ µ(s)( π(s, a) π(a, s))Aπ(s, a) σ, π π

s,a dπ µ(s)( π(s, a) π(a, s))Qπ(s, a) σ, π π

min π ( π π) π

j V π Wi σ, π

Here, Equation (20) comes from the Performance Difference Lemma, Equation (21) comes from the fact that P

a π(a, s)Aπ(s, a) = 0, Equation (22) comes from the definition of the Advantage Function for the π-player, and Equation (23) comes from the both the Policy Gradient Theorem and the fact that dπ µ(s) (1 γ)µ(s).

Lemma B.2. Subcondition 1 is satisfied for the W-player, i.e. Pt i li σ is gradient dominated.

Proof. We can use Lemma 5.3 to prove both. We start with the W player.

i V πi W (µ) V πi W (µ) σ, W W =

h d W µ (s)πi(a|s)PW (s |a, s)AW (s , a, s) i σ, W W

d W µ (s) d W µ (s)

d W µ (s)πi(a|s) PW (s |a, s) AW (s , a, s)

d W µ (s) d W µ (s)

h d W µ (s)πi(a|s)P W (s |a, s) AW (s , a, s) i (24)

Published in Transactions on Machine Learning Research (06/2024)

Here, Equation (24) comes from using the minimizing transition dynamics W. Looking at the inside term, we have that

d W µ (s)πi(a|s) P W (s , a, s) AW (s , a, s) σ, W W

d W µ (s)πi(a|s) (P W (s , a, s) PW (s , a, s)) AW (s , a, s) (25)

d W µ (s)πi(a|s) (P W (s , a, s) PW (s , a, s)) (VW (s )) (26)

i V πi W (µ) σ, W

Equation (25) comes from the fact that the P

s PW (s , a, s)AW (s , a, s) = 0. Equation (26) comes from the definition of the W-player advantage function. Finally, Equation (27) comes from Lemma 5.2. Combining these, we have that

i V πi W (µ) V πi W (µ) σ, W W

i V πi W (µ) σ, W

Moreover, we use the fact that d W µ (s) (1 γ)µ(s) by definition.

C Proofs for Convergence

Here, we detail a lemma on the convergence of FTPL and OFTPL as proven in Suggala & Netrapalli (2019). Lemma C.1. Let D be the ℓ diameter of the space X. Suppose the losses encountered by the learner are L-Lipschitz w.r.t ℓ1 norm. Moreover, suppose the optimization oracle used has error α. For any fixed η, the predictions of Follow the Perturbed Leader satisfy the following regret bound. Here, d is the dimension of the noise vector.

t=1 ft(xt) 1

O ηd2DL2 + d D

For OFTPL, suppose our guess gt is such that gt ft is Lt-Lipschitz w.r.t ℓ1 norm, for all t [T]. The predictions of Optimistic Follow the Perturbed Leader satisfy the following regret bound.

t=1 ft(xt) 1

L2 t T + d D

C.1 Proof of Lemma 4.1

Lemma 4.1. Suppose we have an incoming sequence of loss functions ft for t [T] with an optimization oracle that can minimize a function to less than α error. The Best Response algorithm satisfies the following regret bound 1 T

t=1 ft(xt) 1

t=1 ft(x) α.

Published in Transactions on Machine Learning Research (06/2024)

Proof. The regret term is defined as 1

T PT t=1 ft(xt) 1

T inf x X PT t=1 ft(x) where xt are the choices taken by the

Best Response algorithm. For an arbitrary time step t, we have that

ft(xt) ft(x) min x ft(x ) + α ft(x) (28)

where Equation (28) comes from the fact that an optimization oracle is used to calculate xt and has error upper bounded by α. Therefore, we have that

t=1 ft(xt) 1

t=1 ft(x) α.

D Proofs for Extension Section

D.1 Proof of Lemma D.1

Lemma D.1. Given a series of choices by FTPL+ x1, . . . , x T with smooth and gradient dominated loss functions f1, . . . , ft and noise sampled σ Exp(η), the stability in choices is bounded in expectation by

E( xt+1 xt 1) 125ηLd2D + α 20L

To prove this, we will first prove two properties of monotonicity of the loss function on an input of noise σ. Much of this proof structure is inspired by Suggala & Netrapalli (2019). Lemma D.2. Let xt(σ) be the solution chosen by FTPL+ under noise sigma. Let σ = σ + cei for some positive constant c, then we have that

xt,i(σ ) xt,i(σ) 2α

Proof. Given that the approximate optimality of xt(σ), we have that

i=1 fi(xt(σ)) + mt(xt(σ)) σ, xt(σ) (29)

i=1 fi(xt(σ )) + mt(xt(σ )) σ, xt(σ ) + α (30)

i=1 fi(xt(σ )) + mt(xt(σ )) σ , xt(σ ) + σ σ, xt(σ ) + α

i=1 fi(xt(σ)) + mt(xt(σ)) σ , xt(σ) + σ σ, xt(σ ) + 2α (31)

i=1 fi(xt(σ)) + mt(xt(σ)) σ, xt(σ) + σ σ, xt(σ ) xt(σ) + 2α (32)

Here, Equation (30) comes from the fact that xt(σ) is an approximate minimizer for the loss function, and Equation (31) comes from the fact that xt(σ ) minimizes the loss function with noise set to σ by definition. Combining Equation (29) and Equation (32), we have that

0 σ σ, xt(σ ) xt(σ) + 2α

c(x(t,i)(σ ) x(t,i)(σ)) + 2α

We, therefore, get that x(t,i)(σ ) x(t,i)(σ) 2α

Published in Transactions on Machine Learning Research (06/2024)

Moreover, we have that the difference between predictions made by our algorithm at sequential timesteps is close.

Lemma D.3. If xt(σ) xt+1(σ) 1 10d|xt,i(σ) xt+1,i(σ)| and σ = 100Ldei + σ, we have that

min (xt,i(σ ), xt+1,i(σ )) max(xt,i(σ), xt+1,i(σ)) 1

10|xt,i(σ) xt+1,i(σ )| 3α 100Ld

Proof. We have that

i=1 fi(xt(σ)) σ, xt(σ) + ft(xt(σ)) + mt+1(xt(σ))

i=1 fi(xt+1(σ)) σ, xt+1(σ) + ft(xt+1(σ)) + +mt+1(xt(σ)) + α (33)

i=1 fi(xt+1(σ)) σ, xt+1(σ) + ft(xt+1(σ)) + mt+1(xt+1(σ)) (34)

+ L xt+1(σ) xt(σ) 1 + α

Here, we have Equation (33) from the approximate optimality of xt(σ) and Equation (34) from the smoothness of the optimistic function. Moreover, from the opposite direction, we have that

i=1 fi(xt(σ)) σ, xt(σ) + ft(xt(σ)) + mt+1(xt(σ))

i=1 fi(xt(σ)) σ , xt(σ) + σ σ, xt(σ) + ft(xt(σ)) + mt+1(xt(σ))

i=1 fi(xt(σ )) σ , xt+1(σ ) + σ σ, xt(σ) + ft(xt+1(σ )) (35)

+ mt+1(xt+1(σ )) α

i=1 fi(xt(σ )) σ, xt+1(σ ) + σ σ, xt(σ) xt+1(σ ) + ft(xt+1(σ ))

+ mt+1(xt+1(σ )) α

i=1 fi(xt(σ)) σ, xt+1(σ) + σ σ, xt(σ) xt+1(σ ) + ft(xt+1(σ)) (36)

+ mt+1(xt+1(σ)) 2α

Here, Equation (35) from the approximate optimality of xt(σ ) and Equation (36) from the approximate optimality of xt(σ). From these and our original assumption, we have that,

10Ld xt+1,i(σ) xt,i(σ) 1 + α 100Ld(xt,i(σ) xt+1,i(σ )) 2α.

Using a similar argument, we have that

10Ld xt+1,i(σ) xt,i(σ) 1 + α 100Ld(xt+1,i(σ) xt,i(σ )) 2α.

Moreover, from Lemma D.2,we know that xt+1,i(σ ) xt+1,i(σ) 3α 100Ld and xt,i(σ ) xt,i(σ) 3α 100Ld. Combining these, we have our claim.

We can now finally prove our claim. This proof is very similar to the proof of Theorem 1 in Suggala & Netrapalli (2019).

Published in Transactions on Machine Learning Research (06/2024)

Proof. We note that we can decompose the ℓ1 norm in E [ xt(σ) xt+1(σ) 1] as

E [ xt(σ) xt+1(σ) 1] =

i=1 E [|xt,i(σ) xt+1,i(σ)|] .

To bound E [ xt(σ) xt+1(σ) 1] we derive an upper bound for each dimension E [|xt,i(σ) xt+1,i(σ)|] , i [d]. For any i [d], define E i [|xt,i(σ) xt+1,i(σ)|] as

E i [|xt,i(σ) xt+1,i(σ)|] = E h |xt,i(σ) xt+1,i(σ)| | {σj}j =i i

where σj is the jth coordinate of σ. Intuitively, we are computing in expectation of the noise of a single dimension while holding the other dimensions noise constant. Let xmax,i(σ) = max (xt,i(σ), xt+1,i(σ)) and xmin,i(σ) = min (xt,i(σ), xt+1,i(σ)). Then, by definition, we have that

E i [|xt,i(σ) xt+1,i(σ)|] = E i [xmax,i(σ)] E i [xmin,i(σ)] .

Define event E as E = {σ : xt(σ) xt+1(σ) 1 10d |xt,i(σ) xt+1,i(σ)|}

For notational ease, let P = exp( 100ηLd) be a constant. Consider the following

E i [xmin,i(σ)] = P (σi < 100Ld) E i [xmin,i(σ) | σi < 100Ld]

+ P (σi 100Ld) E i [xmin,i(σ) | σi 100Ld]

(1 P) (E i [xmax,i(σ)] D) (37)

+ PE i [xmin,i (σ + 100Ldei)] (38)

where Equation (37) follows from the fact that the domain of ith coordinate lies within some interval of length D and since E i [xmin,i(σ) | σi < 100Ld] and E i [xmax,i(σ)] are points in this interval, their difference is bounded by D. We can further lower bound E i [xmin,i(σ)] as follows

E i [xmin,i(σ)] (1 P) (E i [xmax,i(σ)] D)

+ PP i(E)E i [xmin,i (σ + 100Ldei) | E]

+ PP i (Ec) E i [xmin,i (σ + 100Ldei) | Ec]

where P i(E) is defined as P i(E) := P E | {σj}j =i . We now use the monotonicity properties proved in

Lemma D.2 and Lemma D.3 to further lower bound E i [xmin,i(σ)]. Then

E i [xmin,i(σ)] (1 P) (E i [xmax,i(σ)] D) (39)

+ PP i(E)E i

xmax,i(σ) 1

10 |xt,i(σ) xt+1,i(σ)| 3α 100Ld | E

+ PP i (Ec) E i

xmin,i(σ) 2α 100Ld | Ec

(1 P) (E i [xmax,i(σ)] D) (40)

+ PP i(E)E i

xmax,i(σ) 1

10 |xt,i(σ) xt+1,i(σ)| 3α 100Ld | E

+ PP i (Ec) E i

xmax,i(σ) 1 10d xt(σ) xt+1(σ) 1 2α 100Ld | Ec

where Equation (39) follows from Lemma D.2 and Lemma D.3, Equation (40) follows from the definition of Ec. Rearranging the terms in the RHS and using P i(E) 1 gives us

Published in Transactions on Machine Learning Research (06/2024)

E i [xmin,i(σ)] (1 P) (E i [xmax,i(σ)] D)

xmax,i(σ) 3α 100Ld

10 |xt,i(σ) xt+1,i(σ)| + 1 10d xt(σ) xt+1(σ) 1

E i [xmax,i(σ)] 100ηLd D 3α 100Ld (41)

10 |xt,i(σ) xt+1,i(σ)| + 1 10d xt(σ) xt+1(σ) 1

where Equation (41) uses the the fact that exp(x) 1 + x. Rearranging the terms in the last inequality gives us

E i [|xt,i(σ) xt+1,i(σ)|] 1

9d E i [ xt(σ) xt+1(σ) 1] + 1000

9 ηLd D + E i[α]

30Ld . (42)

Since Equation (42) holds for any {σj}j =i, we get the following bound on the unconditioned expectation

E [|xt,i(σ) xt+1,i(σ)|] 1

9d E [ xt(σ) xt+1(σ) 1] + 1000

9 ηLd D + E[α]

30Ld.. (43)

Substituting Equation (43) with the above yields the following bound on the stability of predictions of FTPL+

E [ xt(σ) xt+1(σ) 1] 125ηLd2D + α 20L

Utilizing Appendix D.1 gives us the required bound on regret.

D.2 Proofs for Strong Gradient Dominance

We will first prove what the gradient of the value function is with respect to θ in this setting.

Lemma 6.2. We have g(W, π) = V π W (µ) π 2 2 is Kπ, T

2 strongly gradient-dominated on set T

Published in Transactions on Machine Learning Research (06/2024)

Proof. By the Performance Difference Lemma,

t V π Wt σ, π π + π 2 2 π 2 2

s,a dπ µ (s)π (a, s)Aπ(s, a) σ, π π + π 2 2 π 2 2 (44)

s,a dπ µ (s) π(s, a)Aπ(s, a) σ, π π π 2 2 + π 2 2

s,a dπ µ(s) π(s, a)Aπ(s, a) σ, π π π 2 2 + π 2 2

s,a dπ µ(s)( π(s, a) π(a, s))Aπ(s, a) σ, π π (45)

π 2 2 + π 2 2

s,a dπ µ(s)( π(s, a) π(a, s))Qπ(s, a) σ, π π (46)

π 2 2 + π 2 2

s,a dπ µ(s)( π(s, a) π(a, s))Qπ(s, a) σ, π π (47)

2π, π π π π 2 2

t V π Wt σ, π T π 2 2

Here, Equation (44) comes from the Performance Difference Lemma, Equation (45) comes from the fact that P

a π(a, s)Aπ(s, a) = 0, Equation (46) comes from the definition of the Advantage Function for the π-player, and Equation (48) comes from the both the Policy Gradient Theorem and the fact that dπ µ(s) (1 γ)µ(s). Moreover, Equation (47) comes from the following logic

π π 2 2 = π π 2 π π + π π

= π π 2 π π + 2π π π π

= π 2 2 π 2 2 + 2 π π, π

= π 2 2 + π 2 2 + 2 π π, π

From this, we have that the value function done in this manner is strongly gradient dominated with constant T

D.3 Proof of Theorem 5.1

Theorem 5.1. Assume that the set of possible transition matrices W is convex. Let Lπ be the smoothness constant of ht with respect to the ℓ1 norm, Dπ = max π,π T π π 2, and dπ be the dimension of the input

for the π-player. Let OLπ use FTPL and OLW use Best-Response. Under direct parameterization, when

Published in Transactions on Machine Learning Research (06/2024)

η = 1 Lπ T dπ the robustness of the set of trained algorithms is for any π

t=0 V π W (µ) E

t=0 V πt W (µ)

T + cπ (TO, Kπ) + c W (TO, KW ) .

Proof. From Theorem 4.1, we have that

t=0 V π W (µ) min W

t=0 V πt W (µ) Reg W + Regπ.

When the π-player is using FTPL, its regret is bounded according to

E(Regπ) O ηd2 πDπL2 π + dπDπ

Setting η = 1 Lπ T dπ to minimize this, we have

Moreover, α is the Oracle Error term. Therefore, given that we have gradient dominance, using Projected Gradient Descent yields from Lemma 5.1

T + cπ (TO, Kπ)

Given the W-player employs Best-Response, we have that the regret of the W-player is bounded by the following by Lemma 4.1 E(Reg W ) α

where α is the optimization error. Given that we have gradient dominance properties for the W-player as well, we have that using the Projected Gradient Descent for

E(Reg W ) c W (TO, KW ) .

Adding these two inequalities together gets our final result.

D.4 Proof of Theorem 6.1

Theorem 6.1. Assume that the set of possible transition matrices W is convex and we are in the direct parameterization setting. Let OLπ use OFTPL and OLW use FTPL+. Given that Condition 6.1 holds, we

have that the robustness of the algorithm is for any π when setting η = q

20Lπ(dw DW +dπDπ)

t=0 V π W (µ) E

t=0 V πt W (µ)

Dπ(dw Dw + dπDπ) p

5LπTcπ(TO, Kπ) + 1

cπ(TO, Kπ)+c W (TO, KW )

Proof. From Theorem 4.1, we have that

t=0 V π W (µ) min W

t=0 V πt W (µ) Reg W + Regπ.

Given the π-player employs OFTPL, we have from Lemma C.1 that the regret is upper bounded by

L2 t T + dπDπ

Published in Transactions on Machine Learning Research (06/2024)

However, we have that

E(L2 t) = L2 Wt Wt 1 1

L2 125ηLπd2 πDπ + α 20Lπ

Here, the last inequality comes from the fact that the W-player uses FTPL+, and the decisions made by FTPL+ are stable from Lemma D.1. Using this above, we have that

Regπ O ηd2 πDπ L2 125ηLπd2 πDπ + α 20Lπ

Since our π-player enjoys gradient dominance for its loss function, we have from Lemma 5.1 that

α cπ (TO, Kπ) .

Moreover, given the W-player is employing FTPL+, from Lemma 4.2, we have that regret of the W-player is bounded by

Reg W O d W DW

In this setting, the W-player still enjoys gradient dominance properties, so using Projected Gradient Descent has α c W (TO, KW ) .

Adding these together yields

Regπ + Reg W O(ηd2 πDπ L2 125ηLπd2 πDπ + α 20Lπ

+ d W DW + dπDπ

cπ (TO, Kπ) + c W (TO, KW )).

Setting η = q

20Lπ(dw DW +dπDπ)

2Dπ L2αT, we have that

Regπ + Reg W O

Dπ(dw Dw + dπDπ) p

5LπTcπ(TO, Kπ) + 1

cπ(TO, Kπ) + c W (TO, KW )

D.5 Proof for Theorem 6.2

Theorem 6.2. We are in the direct parameterization setting where the objective function is the regularized Value Function. Assume that the set of possible transition matrices W is convex. Let OLπ use FTPL and OLW use Best-Response. Under direct parameterization, given that Condition 6.2 holds and η = 1 Lπ T dπ , we have that the robustness of Algorithm 1 is for any π

t=0 V π W (µ) π 2 min W W

t=0 V πt W (µ)0 πt 2 2d

+ c W (TO, KW ).

Proof. From Theorem 4.1, we have that

t=0 V π W (µ) min W

t=0 V πt W (µ) Reg W + Regπ.

When the π-player is using FTPL, its regret is bounded according to

E(Regπ) O ηd2 πDπL2 π + dπDπ

Published in Transactions on Machine Learning Research (06/2024)

Setting η = 1 L

T d to minimize this, we have

Moreover, α is the Oracle Error term. Therefore, given that we have strong gradient dominance, using Projected Gradient Descent yields from Lemma 6.1

Given the W-player employs Best-Response, we have that the regret of the W-player is bounded by the following by Lemma 4.1 E(Reg W ) α

where α is the optimization error. Given that we have gradient dominance properties for the W-player as well, we have that using the Projected Gradient Descent for

E(Reg W ) c W (TO, KW ) .

Adding these two inequalities together gets our final result.