# robust_averagereward_reinforcement_learning__eaa7c467.pdf

Journal of Artificial Intelligence Research 80 (2024) 719-803 Submitted 09/2023; published 06/2024

Robust Average-Reward Reinforcement Learning

Yue Wang yue.wang@ucf.edu University of Central Florida

Alvaro Velasquez alvaro.velasquez@colorado.edu University of Colorado Boulder

George Atia george.atia@ucf.edu University of Central Florida

Ashley Prater-Bennette ashley.prater-bennette@us.af.mil Air Force Research Laboratory

Shaofeng Zou szou3@buffalo.edu University at Buffalo, The State University of New York

Robust Markov decision processes (MDPs) aim to find a policy that optimizes the worst-case performance over an uncertainty set of MDPs. Existing studies mostly have focused on the robust MDPs under the discounted reward criterion, leaving the ones under the average-reward criterion largely unexplored. In this paper, we develop the first comprehensive and systematic study of robust average-reward MDPs, where the goal is to optimize the long-term average performance under the worst case. Our contributions are four-folds: (1) we prove the uniform convergence of the robust discounted value function to the robust average-reward function as the discount factor γ goes to 1; (2) we derive the robust average-reward Bellman equation, characterize the structure of its solution set, and prove the equivalence between solving the robust Bellman equation and finding the optimal robust policy; (3) we design robust dynamic programming algorithms, and theoretically characterize their convergence to the optimal policy; and (4) we design two model-free algorithms unitizing the multi-level Monte-Carlo approach, and prove their asymptotic convergence.

1. Introduction

The Markov decision process (MDP) is an effective mathematical tool for modeling sequential decision-making problems in stochastic environments (Derman, 1970; Puterman, 1994). Solving an MDP problem entails finding an optimal policy that maximizes a cumulative reward according to a given criterion. However, due to reasons including non-stationarity of the environment, modeling error, exogenous perturbation, partial observability, and adversarial attacks, a model mismatch exists between the assumed MDP model and the underlying environment and can result in solution policies with poor performance. To solve this issue, a framework of robust MDPs, e.g., (Bagnell et al., 2001; Nilim & El Ghaoui, 2004; Iyengar, 2005), has been proposed. Rather than adopting a fixed MDP model, in robust MDP, one seeks to optimize the worst-case performance over an uncertainty set of possible MDP models. The solution provides performance guarantees for all MDP models in the uncertainty set, and is thus robust to the model mismatch. Robust MDPs falling under different reward optimality criteria are fundamentally different. In robust discounted MDPs, the goal is to find a policy that maximizes the cumulative

2024 The Authors. Published by AI Access Foundation under Creative Commons Attribution License CC BY 4.0.

Wang, Velasquez, Atia, Prater-Bennette, Zou

discounted reward in the worst case, where the reward received diminishes exponentially over time. Much of the prior work in the robust setting has focused on the discounted reward criterion (see Sec. 1.2). Although discounted MDPs induce an elegant contractive Bellman operator which is mathematically convenient, the policy obtained may have a poor long-term performance when a system operates for an extended period of time. When the discount factor is close to 1, the agent may prefer to compare policies on the basis of their average expected reward instead of their expected total discounted reward, e.g., queueing control, inventory management in supply chains, scheduling automatic guided vehicles and applications in communication networks (Kober et al., 2013). Therefore, it is of great importance to optimize the long-term average performance of a system, especially in the present of environment uncertainty or model mismatch.

Nevertheless, robust MDPs under the average-reward criterion are largely understudied. Compared to the discounted reward, the average-reward depends on the limiting behavior of the underlying stochastic process and is markedly more intricate. A recognized instance of such intricacy concerns the one-to-one correspondence between the stationary policies and the limit points of state-action frequencies, which while true for discounted MDPs, breaks down under the average-reward criterion even in the non-robust setting except in some very special cases (Puterman, 1994; Atia et al., 2021). This is largely due to the dependence on the necessary conditions for establishing a contraction in average-reward settings on the graph structure of the MDP, versus the discounted-reward setting where it simply suffices to have a discount factor that is strictly less than one (Kazemi, Perez, Somenzi, Soudjani, Trivedi, & Velasquez, 2022). Heretofore, only a handful of studies have considered average-reward MDPs in the robust setting. The first work by (Tewari & Bartlett, 2007) considers robust average-reward MDPs under a specific finite interval uncertainty set, but their method is not easily applicable to other uncertainty sets. More recently, (Lim et al., 2013) proposed an algorithm for robust average-reward MDPs under the ℓ1 uncertainty set, and in (Grand-Cl ement & Petrik, 2023), a characterization of the similarity between the optimal robust policy for average-reward and discounted reward MDPs is studied for a class of uncertainty models. Beyond these works, however, obtaining fundamental characterizations of the problem and convergence guarantees remains elusive.

On the other hand, model-free approaches for robust MDPs, even for the discounted reward setting, are still far from well-established. Recent work (Roy et al., 2017) developed the first model-free algorithm for robust discounted RL, where they develop a relaxation to the uncertainty set which can be over-pessimistic due to the deviation from the nominal kernel. Moreover, a strong assumption is made on the discount factor in order to guarantee the convergence. To solve these issues, recent works (Wang & Zou, 2021; Liu et al., 2022; Liang et al., 2023; Wang et al., 2023) developed robust Q-learning algorithm for various uncertainty sets with convergence guarantee and sample complexity analyses. However, these approaches are for the discounted reward setting, and there is still a gap in developing model-free approaches for the average-reward setting.

1.1 Challenges and Contributions

In this paper, we derive characterizations of robust average-reward MDPs with general uncertainty sets, and develop model-based and model-free approaches with provable the-

Robust Average-Reward Reinforcement Learning

oretical guarantees. Our approach is fundamentally different from prior work on robust discounted MDPs, and robust and non-robust average-reward MDPs. In particular, the key challenges and the main contributions are summarized below. We characterize the limiting behavior of robust discounted value function as the discount factor γ 1. For the standard non-robust MDP and for a fixed transition kernel, the discounted non-robust value function converges to the average-reward non-robust value function as γ 1 (Puterman, 1994). However, in the robust setting, we need to consider the worst-case limiting behavior under all possible transition kernels in the uncertainty set. Hence, the previous point-wise convergence result (Puterman, 1994) cannot be directly applied. In (Tewari & Bartlett, 2007), a finite interval uncertainty set is studied, where due to its special structure, the number of possible worst-case transition kernels of robust discounted MDPs is finite, and hence the order of min (over transition kernel) and limγ 1 can be exchanged, and therefore, the robust discounted value function converges to the robust average-reward value function. This result, however, does not hold for general uncertainty sets investigated in this paper. We first prove the uniform convergence of discounted non-robust value function to average-reward w.r.t. the transition kernels and policies. Based on this uniform convergence, we show the convergence of the robust discounted value function to the robust average-reward. This uniform convergence result is the first in the literature and is of key importance to motivate our algorithm design and to guarantee convergence to the optimal robust policy in the average-reward setting. We design algorithms for robust policy evaluation and optimal control based on the limit method. Based on the uniform convergence, we then use robust discounted MDPs to approximate robust average-reward MDPs. We show that when γ is large, any optimal policy of the robust discounted MDP is also an optimal policy of the robust averagereward, and hence solves the robust optimal control problem in the average-reward setting. This result is similar to the Blackwell optimality (Blackwell, 1962; Hordijk & Yushkevich, 2002) for the non-robust setting. However, our proof is fundamentally different. Technically, the proof in (Blackwell, 1962; Hordijk & Yushkevich, 2002) is based on the fact that the difference between the discounted value functions of two policies is a rational function of the discount factor, which has a finite number of zeros. However, in the robust setting with a general uncertainty set, the difference is no longer a rational function due to the min over the transition kernel. We construct a novel proof based on the limiting behavior of robust discounted MDPs, and show that the (optimal) robust discounted value function converges to the (optimal) robust average-reward as γ 1. Motivated by these insights, we then design our algorithms by applying a sequence of robust discounted Bellman operators with an increasing discount factor. We prove that our method can (i) evaluate the robust average-reward for a given policy and (ii) find the optimal robust value function and, in turn, the optimal robust policy for general uncertainty sets. We derive the robust average-reward Bellman equation and design a direct model-based algorithm with convergence guarantee. The fundamental structure of MDPs is usually characterized by a bootstrap-type equation, namely, the Bellman equation, which reveals the relation among the value function, reward function, and the transition kernels. It is of great importance in value-based approaches, since finding the optimal policy and solving the Bellman equation are equivalent. We derive a robust Bellman equation for robust average-reward MDPs, and show that the pair of robust relative value functions

Wang, Velasquez, Atia, Prater-Bennette, Zou

and robust average-reward is a solution to the robust Bellman equation under the averagereward setting. We further prove the equivalence between finding its solution and optimizing the robust average-reward. We then design a robust value iteration method that provably converges to the solution of the robust Bellman equation, i.e., solves the optimal policy for the robust average-reward MDP problem.

We design model-free algorithms for robust policy evaluation and optimal control. We then design model-free algorithms for robust average-reward RL. The major challenges are two-fold: 1) Constructing an unbiased estimator of the robust Bellman operator that captures the worst-case performance using the samples from the nominal transition kernel; and 2) Deriving the convergence of the stochastic algorithms. Regarding the first problem, one plausible approach is to define the estimator using the empirical transition kernel. However, due to the non-linear dependence of the robust Bellman operator on the nominal transition kernel, this plug-in estimator is biased. We hence employ the multilevel Monte-Carlo method (Blanchet & Glynn, 2015) and construct unbiased estimators. We then utilize them to design robust RVI TD and Q-learning algorithms. We leverage the stochastic approximation approach and the characterization of the Bellman equation to show the global asymptotic stability of our algorithms and further derive the convergence of our stochastic algorithms. Specifically, robust RVI TD converges to the worst-case average-reward; and for the robust RVI Q-learning, the greedy policy w.r.t. the Q-function converges to an optimal robust policy.

1.2 Related Work

Robust discounted MDPs. Model-based methods for robust discounted MDPs were studied in (Iyengar, 2005; Nilim & El Ghaoui, 2004; Bagnell et al., 2001; Satia & Lave Jr, 1973; Wiesemann et al., 2013; Lim & Autef, 2019; Xu & Mannor, 2010; Yu & Xu, 2015; Lim et al., 2013; Tamar et al., 2014), where the uncertainty set is assumed to be known, and the problem can be solved using robust dynamic programming. Later, the studies were generalized to the model-free setting where stochastic samples from the nominal MDP of the uncertainty set are available in an online fashion (Roy et al., 2017; Badrinath & Kalathil, 2021; Wang & Zou, 2021, 2022; Tessler et al., 2019) and an offline fashion (Zhou et al., 2021; Yang et al., 2022; Panaganti & Kalathil, 2022; Goyal & Grand-Clement, 2018; Kaufman & Schaefer, 2013; Ho et al., 2018, 2021; Si et al., 2020). There are also empirical studies on robust RL, e.g., (Vinitsky et al., 2020; Pinto et al., 2017; Abdullah et al., 2019; Hou et al., 2020; Rajeswaran et al., 2017; Huang et al., 2017; Kos & Song, 2017; Lin et al., 2017; Pattanaik et al., 2018; Mandlekar et al., 2017). For discounted MDPs, the robust Bellman operator is a contraction, based on which robust dynamic programming and value-based methods can be designed. In this paper, we focus on robust average-reward MDPs, where the robust Bellman operator for average-reward MDPs is not a contraction, and its fixed point may not be unique. Moreover, the average-reward setting depends on the limiting behavior of the underlying stochastic process, which is thus more intricate.

Robust average-reward MDPs. Studies on robust average-reward MDPs are quite limited in the literature. Robust average-reward MDPs under a specific finite interval uncertainty set was studied in (Tewari & Bartlett, 2007), where the authors showed the existence of a robust Blackwell optimality constant, i.e., there exists some δ [0, 1), such that the

Robust Average-Reward Reinforcement Learning

optimal robust policy for the robust average-reward MDP exists and remains unchanged for the robust discounted reward ones with any discount factor γ [δ, 1). However, this result depends on the structure of the uncertainty set, where the number of possible worst-case transition kernels is assumed finite. Under the similar assumptions, a recent work (Grand Cl ement & Petrik, 2023) derived a lower bound on the robust Blackwell optimality constant δ; Under a similar polytopic assumption, (Chatterjee et al., 2023) design a policy iteration algorithm with convergence and computational complexity analysis. For more general uncertainty sets, the studies of robust average-reward MDPs, are not well-explored. Another work (Lim et al., 2013) designed a model-free algorithm for a specific ℓ1-norm uncertainty set and characterized its regret bound. However, their method also relies on the structure of the ℓ1-norm uncertainty set, and may not be generalizable to other types of uncertainty sets. In this paper, our results can be applied to various types of uncertainty sets, and thus is more general. Non-robust average-reward MDPs. Early contributions to non-robust average-reward MDPs include a fundamental characterization of the problem and model-based methods (Puterman, 1994; Bertsekas, 2011). Model-free methods in the tabular setting, e.g., RVI Q-learning (Abounadi et al., 2001) and differential Q-learning (Wan et al., 2021; Wan & Sutton, 2022), were developed recently and are both shown to converge to the optimal average-reward. There is also work on average-reward RL with function approximation, e.g., (Zhang et al., 2021b; Tsitsiklis & Van Roy, 1999; Zhang et al., 2021a; Yu & Bertsekas, 2009). In this paper, we focus on the robust setting, where the key challenge lies in the non-linearity of the robust average-reward Bellman equation, whereas it is linear in the non-robust setting.

2. Preliminaries and Problem Model

In this section, we introduce some preliminaries on discounted MDPs, average-reward MDPs, and robust MDPs. Discounted MDPs. A discounted MDP (S, A, P, r, γ) is specified by: a finite state space S, a finite action space A, a transition kernel P = {pa s (S), a A, s S}1, where pa s is the distribution of the next state over S upon taking action a in state s (with pa s,s denoting the probability of transitioning to s ), a reward function r : S A [0, 1], and a discount factor γ [0, 1). At each time step t, the agent at state st takes an action at, the environment then transitions to the next state st+1 according to pat st, and produces a reward signal r(st, at) [0, 1] to the agent. In this paper, we also write rt = r(st, at) for convenience. A stationary policy2 π : S (A) is a distribution over A for any given state s, and the agent takes action a at state s with probability π(a|s). The discounted value function of a stationary policy π starting from s S is defined as the expected discounted cumulative reward by following policy π: V π P,γ(s) Eπ,P P t=0 γtrt|S0 = s . Average-Reward MDPs. Different from discounted MDPs, average-reward MDPs do not discount the reward over time, and consider the behavior of the underlying Markov

1. (S): the (|S| 1)-dimensional probability simplex on S. 2. In this paper, we focus on the stationary policies. The studies under the history-dependent policies are left for future exploration.

Wang, Velasquez, Atia, Prater-Bennette, Zou

process under the steady-state distribution. More specifically, under a specific transition kernel P, the average-reward of a policy π starting from s S is defined as

gπ P(s) lim n Eπ,P

t=0 rt|S0 = s , (1)

which we also refer to in this paper as the average-reward value function for convenience. The average-reward value function can also be equivalently written as follows: gπ P = limn 1

n Pn 1 t=0 (Pπ)trπ Pπ rπ, where (Pπ)s,s P a π(a|s)pa s,s and rπ(s) P a π(a|s)r(s, a) are the transition matrix and reward function induced by π, and Pπ limn 1

n Pn 1 t=0 (Pπ)t

is the limit matrix of Pπ. In the average-reward setting, we also define the following relative value function

V π P (s) Eπ,P

t=0 (rt gπ P)|S0 = s , (2)

which is the cumulative difference over time between the reward and the average value gπ P. It has been shown that (Puterman, 1994): V π P = Hπ Prπ, where Hπ P (I Pπ +Pπ ) 1(I Pπ ) is defined as the deviation matrix of Pπ. The relationship between the average-reward and the relative value functions can be characterized by the following Bellman equation (Puterman, 1994):

V π P (s) = Eπ

r(s, A) gπ P(s) + X

s S p A s,s V π P (s ) . (3)

Robust discounted and average-reward MDPs. For robust MDPs, the transition kernel is not fixed but belongs to some uncertainty set P. After the agent takes an action, the environment transits to the next state according to an arbitrary transition kernel P P. In this paper, we focus on the (s, a)-rectangular uncertainty set (Nilim & El Ghaoui, 2004; Iyengar, 2005), i.e., P = N s,a Pa s , where Pa s (S). We note that there are also studies on relaxing the (s, a)-rectangular uncertainty set to s-rectangular uncertainty set, which is not the focus of this paper. Under the robust setting, we consider the worst-case performance over the uncertainty set of MDPs. More specifically, the robust discounted value function of a policy π for a discounted MDP is defined as

V π P,γ(s) min P P Eπ,P

t=0 γtrt|S0 = s

In this paper, we focus on the following worst-case average-reward for a policy π:

gπ P(s) min P P lim n Eπ,P

t=0 rt|S0 = s

to which, for convenience, we refer as the robust average-reward value function3.

3. Here we consider the worst case performance among the stationary model, i.e., the transition kernels are identical at each time step. However, as shown later in this paper, the worst case performance under the dynamic uncertain model is the same as the one under the stationary model. Hence it is sufficient to consider the stationary model.

Robust Average-Reward Reinforcement Learning

For robust discounted MDPs, it has been shown that the robust discounted value function is the unique fixed-point of the robust discounted Bellman operator (Nilim & El Ghaoui, 2004; Iyengar, 2005; Puterman, 1994):

a A π(a|s) r(s, a) + γσPas (V ) , (6)

where σPas (V ) minp Pas p V is the support function of V on Pa s . Based on the contraction of Tπ, robust dynamic programming approaches, e.g., robust value iteration, can be designed (Nilim & El Ghaoui, 2004; Iyengar, 2005) (see Appendix B for a review of these methods). However, there is no such contraction result for robust average-reward MDPs. In this paper, our goal is to find a policy that optimizes the robust average-reward value function:

max π Π gπ P(s), for any s S, (7)

where Π is the set of all stationary policies, and we denote by g P(s) maxπ gπ P(s) the optimal robust average-reward.

3. Limit Approach

We first take a limit approach to solve the problem of robust average-reward MDPs in (7). It is known that under the non-robust setting, for any fixed π and P, the discounted value function converges to the average-reward value function as the discount factor γ approaches 1 (Puterman, 1994), i.e.,

lim γ 1(1 γ)V π P,γ = gπ P. (8)

Note that the term (1 γ) is necessary to ensure the finiteness of the limit, otherwise V π P,γ if γ 1. We take a similar idea, and show that the same result holds in the robust case: limγ 1(1 γ)V π P,γ = gπ P under a mild assumption. This result further enables us to draw numerous characterizations of the fundamental structure of the robust MDPs under the average-reward setting. Moreover, we design algorithms (Algorithms 1 and 2) to solve robust MDPs under the average-reward criterion based on this result, and further prove its convergence and optimality.

3.1 Uniform Convergence of Robust Discounted Value Functions

In this section, we first show that the convergence limγ 1(1 γ)V π P,γ = gπ P is uniform on the set Π P. In studies of average-reward MDPs, it is usually the case that a certain class of MDPs is considered, e.g., unichain and communicating (Wei et al., 2020; Zhang & Ross, 2021; Chen et al., 2022; Wan et al., 2021). In this paper, we focus on the unichain setting to highlight the major technical novelty to achieve robustness.

Assumption 1. For any s S, a A, the uncertainty set Pa s is a compact subset of (S). And for any π Π, P P, the induced MDP is a unichain.

Wang, Velasquez, Atia, Prater-Bennette, Zou

The first part of Assumption 1 amounts to assuming that the uncertainty set is closed. We remark that many standard uncertainty sets satisfy this assumption, e.g., those defined by ϵ-contamination (Huber, 1965), finite interval (Tewari & Bartlett, 2007), total-variation (Rahimian et al., 2022) and KL-divergence (Hu & Hong, 2013). The unichain assumption is also widely used in studies of average-reward MDPs, e.g., (Puterman, 1994; Wan et al., 2021; Zhang & Ross, 2021; Lan, 2020; Zhang et al., 2021b). Also, it is worth noting that under the unichain assumption, the average-reward is identical for every starting state, i.e., gπ P(s1) = gπ P(s2), s1, s2 S (Bertsekas, 2011).

Remark 1. The results in this section actually only require the uniform boundedness of Hπ P , π Π, P P (Lemma 4 in the appendix). Assumption 1 is one sufficient condition.

In (Puterman, 1994), the convergence limγ 1(1 γ)V π P,γ = gπ P for a fixed policy π and a fixed transition kernel P (non-robust setting) is point-wise. However, such pointwise convergence does not provide any convergence guarantee on the robust discounted value function, as the robust value function measures the worst-case performance over the uncertainty set and the order of lim and min may not be exchangeable in general. In the following theorem, we prove the uniform convergence of the discounted value function under the foregoing assumption.

Theorem 1 (Uniform convergence). Under Assumption 1, the discounted value function converges uniformly to the average-reward value function on Π P as γ 1, i.e.,

lim γ 1(1 γ)V π P,γ = gπ P uniformly on Π P. (9)

With uniform convergence in Theorem 1, the order of the limit γ 1 and min P can be interchanged. Then, the following convergence of the robust discounted value function can be established.4

Theorem 2. The robust discounted value function in (4) converges to the robust averagereward uniformly on Π:

lim γ 1(1 γ)V π P,γ = gπ P uniformly on Π. (10)

We note that the convergence can also be derived under some other assumptions or for specific uncertainty sets. For example, a similar convergence result is shown in (Tewari & Bartlett, 2007) for a special uncertainty set of finite interval type. This result is further generalized in (Grand-Cl ement & Petrik, 2023; Goyal & Grand-Clement, 2018), where the convergence is obtained under the assumption that the number of the possible worstcase transition kernels is finite, i.e., {P P : gπ P = gπ P} is a finite set for any policy π. Besides the finite interval uncertainty set, it is shown that the uncertainty sets defined by lp-norm (Grand-Cl ement & Petrik, 2023; Goyal & Grand-Clement, 2018) also satisfy this assumption. Under this assumption, the lim and max are interchangeable and hence the convergence can be obtained. Our Theorem 2 holds for general compact uncertainty sets. Moreover, it is worth highlighting that our proof technique is fundamentally different from

4. During the preparation of our manuscript, a recent work (Grand-Clement, Petrik, & Vieille, 2023) develops a similar result without the unichian assumption.

Robust Average-Reward Reinforcement Learning

the one in (Tewari & Bartlett, 2007; Grand-Cl ement & Petrik, 2023), where the worst-case transition kernels are from a finite set, i.e., V π P,γ = min P M V π P,γ for a finite set M P. This hence implies the interchangeability of lim and min. However, for general uncertainty sets, the number of worst-case transition kernels may not be finite. We demonstrate the interchangeability via our uniform convergence result in Theorem 1. The preceding uniform convergence result enables us to exchange the order of the operators limγ 1, min P P and maxπ Π. This connects robust MDPs under the discounted reward with average-reward settings. In the following theorem, we show that we can also restrict ourselves to deterministic polices when optimizing the robust MDPs under the average-reward criterion.

Theorem 3. (Deterministic Optimality) There exists a deterministic optimal robust policy, i.e., π ΠD, such that gπ P = g P.

The uniform convergence result in Theorem 2 motivates the use of robust discounted MDPs with γ 1 to approximate robust average-reward MDPs, which we refer to as the limit method. As discussed in (Blackwell, 1962; Hordijk & Yushkevich, 2002), the average-reward criterion is insensitive and under selective since it is only interested in the performance under the steady-state distribution. For example, two policies providing rewards: 100 + 0 + 0 + and 0 + 0 + 0 + are equally good/bad. For the non-robust setting, a more sensitive term of optimality was introduced by Blackwell (Blackwell, 1962). More specifically, a policy is said to be Blackwell optimal if it optimizes the discounted value function for any discount factor γ (δ, 1) for some δ (0, 1). Together with (8), the optimal policy obtained by taking γ 1 is optimal not only for the average-reward criterion, but also for the discounted criterion with large γ. Intuitively, it is optimal under the average-reward setting, and is sensitive to early rewards. Following a similar idea, we justify that the optimal robust policy for the robust averagereward MDPs is also sensitive to early rewards. Denote by Π D the set of all the deterministic optimal policies for robust average-reward (proved to exist in Lemma 9), i.e. Π D = {π ΠD : gπ P = g P} .

Theorem 4 (Blackwell optimality). There exists 0 < δ < 1, such that for any γ > δ, the deterministic optimal robust policy for robust discounted value function V P,γ is also optimal under the average-reward criterion. Moreover, when Π D is a singleton, there exists a unique Blackwell optimal policy.

This result implies that the optimal robust policy for average-reward MDPs has an additional advantage that the policy it finds not only optimizes the average-reward in steady state, but also is sensitive to early rewards. It is worth highlighting the distinction of our results from the technique used in the proof of Blackwell optimality (Blackwell, 1962). In the non-robust setting, the existence of a stationary Blackwell optimal policy is proved via contradiction, where a difference function of two policies π and ν: fπ,ν(γ) V π P,γ V ν P,γ is used in the proof. It was shown by contradiction that f has infinitely many zeros, which however contradicts with the fact that f is a rational function of γ with a finite number of zeros. A similar technique was also used in (Tewari & Bartlett, 2007) for the finite interval uncertainty set. Specifically, in (Tewari & Bartlett, 2007), it was shown that the worst-case transition kernels for any π, γ

Wang, Velasquez, Atia, Prater-Bennette, Zou

are from a finite set M, hence fπ,ν(γ) min P M V π P,γ min P M V ν P,γ can also be shown to be a rational function with a finite number of zeroes. For a general uncertainty set P, the difference function fπ,ν(γ), however, may not be rational. This makes the method in (Blackwell, 1962; Tewari & Bartlett, 2007) inapplicable to our problem.

3.2 Limit Method: Robust Value Iteration

Results in Section 3.1 play a fundamental role in developing the limit method for robust average-reward MDPs, and are of key importance to motivate the design of the following two algorithms. The basic idea is to apply a sequence of robust discounted Bellman operators on an arbitrary initialization while increasing the discount factor at a certain rate.

We first consider the robust policy evaluation problem, which aims to estimate the robust average-reward gπ P for a fixed policy π. This problem for robust discounted MDPs is well studied in the literature. However, results for robust average-reward MDPs are quite limited except for the one in (Tewari & Bartlett, 2007; Goyal & Grand-Clement, 2018) for specific uncertainty sets. We present a robust value iteration (robust VI) algorithm for evaluating the robust average-reward with general uncertainty sets in Algorithm 1. At each time step

Algorithm 1 Robust VI: Policy Evaluation Input: π, V0(s) = 0, s, T

1: for t = 0, 1, ..., T 1 do

t+2 3: for all s S do

4: Vt+1(s) Eπ[(1 γt)r(s, A) + γtσPA s (Vt)] = Eπ[(1 γt)r(s, A) + γt min P PA s (PVt)]

7: return VT

t, the discount factor γt is set to be t+1

t+2, which converges to 1 as t . Subsequently, a robust Bellman operator w.r.t discount factor γt is applied on the current estimate Vt of the robust discounted value function (1 γt)V π P,γt. As the discount factor approaches 1, the estimated robust discounted value function converges to the robust average-reward gπ P by Theorem 2. The following result shows that the output of Algorithm 1 converges to the robust average-reward.

Theorem 5. Algorithm 1 converges to the robust average-reward, i.e., lim T VT = gπ P.

Besides the robust policy evaluation problem, it is also of great practical importance to find an optimal policy that maximizes the worst-case average-reward, i.e., to solve (7). Based on a similar idea as the one of Algorithm 1, we extend our limit approach to solve the robust optimal control problem in Algorithm 2.

Robust Average-Reward Reinforcement Learning

Algorithm 2 Robust VI: Optimal Control Input: V0(s) = 0, s, T

1: for t = 0, 1, ..., T 1 do

t+2 3: for all s S do

4: Vt+1(s) max a A (1 γt)r(s, a) + γtσPas (Vt)

7: for s S do

8: πT (s) arg maxa A (1 γt)r(s, a) + γtσPas (VT )

10: return VT , πT

The discount factor γt is set similarly as in Algorithm 1, and a one-step robust discounted Bellman operator (for optimal control) w.r.t. γt is applied to the current estimate Vt. The following theorem establishes that VT in Algorithm 2 converges to the optimal robust value function, and hence can find the optimal robust policy.

Theorem 6. The output VT in Algorithm 2 converges to the optimal robust average-reward g P, i.e., VT g P as T .

4. Direct Approach

The limit approach in Section 3 is based on the uniform convergence of the robust discounted value function, and uses discounted MDPs to approximate average-reward MDPs. In this section, we develop a direct approach to solving the robust average-reward MDPs that does not adopt discounted MDPs as intermediate steps.

4.1 Robust Bellman Equation

One of the most important results which enable the dynamic programming approach for solving MDPs is the Bellman equation. Such results have been generalized to robust discounted MDPs (Nilim & El Ghaoui, 2004; Iyengar, 2005), and we develop an analog result for robust average-reward MDPs as follows. We first generalize the relative value function in (2) to the robust relative value function. The robust relative value function measures the difference between the worst-case cumulative reward and the worst-case average-reward for a policy π.

Definition 1. The robust relative value function is defined as

V π P (s) min κ N t 0 P Eκ,π

t=0 (rt gπ P)|S0 = s , (11)

where gπ P is the worst-case average-reward defined in (5).

We further introduce several notations. For V R|S|, denote by PV (s, a) arg minp Pas p V and let PV = {PV (s, a), s S, a A}. Moreover, denote the set of the worst-case transition kernels by Ωπ g, i.e., Ωπ g = {P P : gπ P = gπ P}.

Wang, Velasquez, Atia, Prater-Bennette, Zou

Theorem 7. For any π, (V π P , gπ P) is a solution to the following robust Bellman equation:

V (s) + g = X

a π(a|s) r(s, a) + σPas (V ) , s. (12)

Moreover, if (g, V ) is a solution to it, then

1) g = gπ P;

2) PV Ωπ g;

3) V = V π PV + ce for some c R, where e denotes the vector (1, 1, ..., 1) R|S|.

It can be seen that the robust Bellman equation for average-reward MDPs has a similar structure to the one for discounted MDPs in (6) except for a discount factor. This actually reveals a fundamental difference between the robust Bellman operator of the discounted MDPs and the average-reward ones. For a discounted MDP, its robust Bellman operator is a contraction with constant γ (Nilim & El Ghaoui, 2004; Iyengar, 2005), and hence the fixed point is unique. Based on this, the robust value function can be found by recursively applying the robust Bellman operator (see Appendix B for a review). In sharp contrast, in the average-reward setting, the robust Bellman operator is not necessarily a contraction, and the fixed point may not be unique. Therefore, repeatedly applying the robust Bellman operator in the average-reward setting may not even converge, which underscores that the two problem settings are fundamentally different. The second part of Theorem 7 provides a characterization of the solutions to the Bellman equation, where we show that for any solution (g, V ) to (12), the transition kernel PV Ωπ g, i.e., it is a worst-case transition kernel for gπ P. This result also distinguishes the structure of the robust Bellman equation from the non-robust one. Under the non-robust setting, the solution set to the Bellman equation can be written as {(gπ P, V π P +ce) : c R}. The solution is uniquely determined by the transition kernel (up to some constant vector ce). In contrast, in the robust setting, the robust Bellman equation is no longer linear. Any solution V to (12) is a relative value function w.r.t. some worst-case transition kernel P Ωπ g (up to some additive constant vector), i.e., V {V π P + ce : P Ωπ g, c R}. A natural question that arises is whether, for any P Ωπ g, (gπ P, V π P ) is a solution to (12)? Lemma 1 refutes this.

Lemma 1. There exists a robust MDP such that for some P Ωπ g, (gπ P, V π P ) is not a solution to (12).

Lemma 1 implies that the solution set to (12) is a subset of {V π P + ce, P Ωπ g, c R}. Note that an explicit characterization of the solution set to (12) is challenging due to its non-linear structure; however, result 3 in Theorem 7 suffices to establish the convergence of our model-free algorithms (shown later in Section 5). Theorem 7 characterizes the robust average-reward for a fixed policy π, which plays an essential role in policy evaluation problems, i.e., to estimate the robust average-reward for π. To find the optimal robust policy, we similarly derive the following optimality condition for robust average-reward MDPs. It is generally useful to consider the action-value functions in optimal control problems, hence we consider a Q-function Q : S A R, and define VQ(s) = maxa Q(s, a), s S.

Robust Average-Reward Reinforcement Learning

Theorem 8 (Optimal robust Bellman equation). If (g, Q) is a solution to the optimal robust Bellman equation

Q(s, a) = r(s, a) g + σPas (VQ), s, a, (13)

1) g = g P;

2) the greedy policy w.r.t. Q: πQ(s) = arg maxa Q(s, a) is an optimal robust policy;

3) VQ = V πQ P + ce for some P ΩπQ g , c R.

Note that the theorem is presented using the action-value function Q; Similar results can be easily adapted using the V function as follows.

Corollary 1. For any (g, V ) that is a solution to

max a r(s, a) g + σPas (V ) V (s) = 0, s, (14)

g = g P. If we further set

π (s) = arg max a r(s, a) + σPas (V ) , s S, (15)

then π is an optimal robust policy.

According to Theorem 8, finding a solution to (13) is sufficient to get the optimal robust average-reward and to derive the optimal robust policy. Similarly to Theorem 7, we omit the explicated study and characterization of the solution set to (13), but the above results are sufficient for the convergence proof of our direct methods and algorithms. Results in this section provide a comprehensive characterization of the fundamental structure of robust MDPs under the average-reward criterion, and indicates the equivalence between solving them and find the solutions to the robust Bellman equations. However, as discussed, the solution set to the robust Bellman equations can be complicated, and are not straightforward to solve. In the next section, we develop a model-based algorithm to solve the equations and find the optimal robust policy.

4.2 Direct Method: Robust Relative Value Iteration

In the following, we generalize the RVI approach to the robust setting, and design a robust RVI algorithm in Algorithm 3. We will further show that the output of this algorithm converges to a solution to (14), and further the optimal policy could be obtained by (15). Here, sp denotes the span semi-norm: sp(w) = maxs w(s) mins w(s), and f : R|S| R is an offset function introduced to stabilize the algorithm updating. We adopt the following assumption from (Puterman, 1994).

Assumption 2. f : R|S| R is Lf-Lipschitz and satisfies f(e) = 1, f(x + ce) = f(x) + c, f(cx) = cf(x), c R.

Wang, Velasquez, Atia, Prater-Bennette, Zou

Algorithm 3 Robust RVI Input: V0, ϵ

1: w0 V0 f(V0)

2: while sp(wt wt+1) ϵ do

3: for all s S do

4: Vt+1(s) maxa(r(s, a) + σPas (wt))

5: wt+1(s) Vt+1(s) f(Vt+1)

7: end while

8: return wt, Vt

Assumption 2 can be easily satisfied, e.g., f(V ) = V (s0) for some reference state s0 S, or f(V ) = P s V (s)

|S| (Abounadi et al., 2001). Compared with the discounted setting, f is critical here. As we discussed above, in the average-reward setting, the solution to the Bellman equation V + ce can be arbitrarily large because c can be any real number. This may lead to a non-convergent sequence Vn (see, e.g., Example 8.5.2 of (Puterman, 1994)). Hence, a function f is introduced to offset Vn and keep the iterates stable. Different from Algorithm 2, in Algorithm 3, we do not apply the robust discounted Bellman operator. The method directly solves the robust optimal control problem for average-reward robust MDPs. We note that in the previous studies of non-robust averagereward, a stronger assumption is made to guarantee the convergence of the non-robust relative value iteration (see, e.g., (Puterman, 1994)). However, it can be weakened using the aperiodic transform technique. In this paper, we further generalize such technique to the robust setting, and show the convergence of our robust RVI under Assumption 1. In the following theorem, we show that our Algorithm 3 converges to a solution of (14). Then according to Theorem 8, the optimal robust policy can be obtained by setting π according to (15) from the limit of the algorithm.

Theorem 9. (wt, Vt) converges to a solution (w, V ) to (14) as ϵ 0.

Remark 2. In this section, we present the robust RVI algorithm for the robust optimal control problem, and its asymptotic convergence and optimality guarantee. A robust RVI algorithm for robust policy evaluation can be similarly designed by replacing the max in line 4, Algorithm 3 with an expectation w.r.t. π. The convergence results in Theorem 9 can also be similarly derived. Our algorithms are expected to converge linearly, as we show in Theorem 9 that it is a multi-step contraction. However, the exact number of steps and the contraction coefficients are involved dependent on the underlying MDP, and hence we leave the exact characterization of its convergence rate as a future research interest.

5. Model-Free Approaches for Robust Average-Reward MDPs

In the previous sections, we developed the fundamental characterizations of the robust average-reward MDPs and designed two algorithms under the model-based setting, where we assume full knowledge of the uncertainty set P. However, in practice the learner may

Robust Average-Reward Reinforcement Learning

not know exactly the nominal MDP and thus the uncertainty set, instead, it can obtain samples from the nominal MDP. A natural idea for the model-free setting is to replace the robust Bellman operators in Algorithms 2 and 3 using some estimators obtained from the samples. However, such an extension is not straightforward due to two reasons: 1) The convergence results above are not guaranteed with stochastic estimators; 2) the construction of an unbiased estimator can be difficult, due to the distribution shift between the nominal kernel and the worst-case kernel. In this section, we first construct unbiased estimators of the robust Bellman operator for five uncertainty set models, including the contamination model, the total variation model, the Chi-square model, the KL-divergence model, and the Wasserstein distance model. We then design stochastic algorithms using these unbiased estimators and show their convergence.

5.1 Robust RVI TD for Policy Evaluation

In this section, we first study the problem of robust policy evaluation, which aims to estimate the robust average-reward gπ P for a fixed policy π. For technical convenience, we make the following assumption to guarantee that the average-reward is independent of the initial state (Abounadi et al., 2001; Wan et al., 2021; Zhang et al., 2021a; Zhang & Ross, 2021; Chen et al., 2022).

Assumption 3. The Markov chain induced by π is a unichain for all P P.

Note that this assumption is weaker than Assumption 1, which is because the policy evaluation problem only considers a fixed policy. In general, the average-reward depends on the initial state. For example, imagine a policy that induces a multichain in an MDP with two closed communicating classes. A learning algorithm would be able to learn the average-reward for each communicating class; however, the average-rewards for the two classes may be different. To remove this complexity, it is common and convenient to rule out this possibility. Under Assumption 3, the average-reward w.r.t. any P P is identical for any start state, i.e., gπ P(s) = gπ P(s ), s, s S. Motivated by the robust Bellman equation in (12), we propose a model-free robust RVI TD algorithm in Algorithm 4, where ˆT is some estimator of the robust Bellman operator and will be discussed later.

Algorithm 4 Robust RVI TD Input: V0, αn, n = 0, 1, ..., N 1

1: for n = 0, 1, ..., N 1 do

2: for all s S do

3: Vn+1(s) Vn(s) + αn(ˆTVn(s) f(Vn) Vn(s))

Note that (12) can be written as V = TV g, where T is the robust average-reward Bellman operator. Since in the model-free setting P is unknown, in Algorithm 4, we construct ˆTV as an estimate of TV satisfying

E[ˆTV ] = TV, Var[ˆTV (s)] C(1 + V 2), (16)

Wang, Velasquez, Atia, Prater-Bennette, Zou

for some constant C > 0. In this paper, if not specified, denotes the infinity norm . It is challenging to construct such ˆT as T is non-linear in the nominal transition kernel from which samples are generated. In Section 5.3, we will present in detail how to construct such ˆT for various uncertainty set models. We then assume the Robbins-Monro condition on the stepsize, and further show the convergence of robust RVI TD.

Assumption 4. The stepsize {αn} n=0 satisfies the Robbins-Monro condition, i.e., P n=0 αn = , P n=0 α2 n < .

Theorem 10 (Convergence of robust RVI TD). Under Assumptions 2, 3, 4, and if ˆT satisfies (16), then almost surely, (f(Vn), Vn) converges to a solution to (12) which may depend on the initialization.

The result implies that f(Vn) gπ P a.s., which means our robust RVI TD converges to the worst-case average-reward for the given policy π. Our robust RVI TD algorithm is hence shown to converge to a solution to (12), which solves the problem under the model-free setting. To show the convergence of the stochastic model-free algorithm, we utilize the stochastic approximation approach. We study the associated ODE of the algorithm, tackle the stochastic noise and show the algorithm converge to the equilibrium of the ODE. As we discussed after Theorem 7, result 3 of Theorem 7 is crucial to the convergence proof of Theorem 10. Specifically, it is necessary in order to characterize the equilibrium of the associated ODE, and thus the limit of the iterates f(Vn) gπ P.

Remark 3. Although our algorithm is presented in a synchronous fashion, the similar convergence result is also expected to hold for asynchronous version of algorithm, under the assumption that all state-action pairs are visited for infinite number of times. Such an extension is standard, see, e.g., (Wan et al., 2021).

5.2 Robust RVI Q-Learning for Control

In this section, we aim to find the optimal robust policy under the model-free setting, i.e., find π = arg maxπ gπ P. Inspired by the model-based methods and the previous section, We hence present the following model-free robust RVI Q-learning algorithm.

Algorithm 5 Robust RVI Q-learning Input: Q0, αn

1: for n = 0, ..., N 1 do

2: for all s S, a A do

3: Qn+1(s, a) Qn(s, a) + αn ˆHQn(s, a) f(Qn) Qn(s, a)

Similar to the robust RVI TD algorithm, denote the optimal robust Bellman operator by HQ(s, a) r(s, a) + σPas (VQ), and we construct an estimate ˆH such that for some finite

Robust Average-Reward Reinforcement Learning

constant C,

E[ ˆHQ] = HQ, Var[ ˆHQ(s, a)] C(1 + Q 2). (17)

In Section 5.3, we will present in detail how to construct such ˆH for various uncertainty set models. The following theorem shows the convergence of the robust RVI Q-learning to the optimal robust average-reward g P and the optimal robust policy π .

Theorem 11 (Convergence of robust RVI Q-learning). Under Assumptions 1, 2, 4, and if ˆH satisfies (17), then almost surely, (f(Qn), Qn) converges to a solution to (13), i.e., f(Qn) converges to g P, and the greedy policy πQn(s) arg maxa Qn(s, a) converges to an optimal robust average-reward π .

The above results imply that our robust RVI Q-learning algorithm finds the optimal robust average-reward function and the optimal robust policy, under the model-free setting. The proof technique is similar to the one of Theorem 10, where we first characterize the equilibrium of the associated ODE, and prove the global stability and convergence of our algorithm.

5.3 Construction of Estimated Operator: Case Studies

In the previous two sections, we showed that if an unbiased estimator with bounded variance is available for the robust Bellman operator, then both robust algorithms proposed converge to the optimum. In this section, we present the design of these estimators for various uncertainty set models. The major challenge in designing the estimated operators satisfying (16) and (17) lies in estimating the support function σPas (V ) using samples from the nominal transition kernel Pa s, which in general is different from the worst-case transition kernel. The function σPas (V ) is non-linear in the nominal kernel, which makes it challenging to construct such an estimator. For instance, the most widely-used MLE plugging-in estimator is shown to be biased. If we use the empirical transition kernel ˆP as the centroid to construct an uncertainty set ˆP, then the estimator is biased: E[σ ˆPas (V )] = σPas (V ). To solve this issue, we hence utilize the multi-level Monte-Carlo approach (Blanchet & Glynn, 2015), and construct an unbiased estimator for several widely-used non-linear uncertainty models including the total variation model, the Chi-square model, the KLdivergence model, and the Wasserstein distance model. We show that our estimators are unbiased and have bounded variance in the following theorem. We will present the design in later sections.

Theorem 12. For each uncertainty set, denote its corresponding estimators by ˆT and ˆH as in Sections 5.3.1 and 5.3.2. Then, there exists some constant C, such that (16) and (17) hold.

In the following sections, we construct an operator ˆσPas to estimate the support function σPas , s S, a A for each uncertainty set. We further define the estimated robust Bellman operators as ˆTV (s) P a π(a|s)(r(s, a) + ˆσPas (V )) and ˆHQ(s, a) r(s, a) + ˆσPas (VQ).

Wang, Velasquez, Atia, Prater-Bennette, Zou

5.3.1 Linear Model: Contamination Uncertainty Set

The ζ-contamination uncertainty set is Pa s = {(1 ζ)Pa s +ζq : q ζ(S)}, where 0 < ζ < 1 is the radius. Under this uncertainty set, the support function can be computed as σPas (V ) = (1 ζ)Pa s V + ζ mins V (s), and this is linear in the nominal transition kernel Pa s. We hence use the transition to the subsequent state to construct our estimator:

ˆσPas (V ) (1 ζ)γV (s ) + ζ min x V (x), (18)

where s is a subsequent state sample after (s, a).

5.3.2 Non-Linear Models

Unlike the contamination model, most uncertainty sets result in a non-linear support function of the nominal transition kernel. We will employ the approach of multi-level Monte Carlo which is widely used in quantile estimation under stochastic environments (Blanchet & Glynn, 2015; Blanchet et al., 2019; Wang & Wang, 2022) to construct an unbiased estimator with bounded variance. For any s, a, we first generate N according to a geometric distribution with parameter Ψ (0, 1). Then, we take action a at state s for 2N+1 times, and observe r(s, a) and the subsequent state {s i}, i = 1, ..., 2N+1. We divide these 2N+1 samples into two groups: samples with odd indices, and samples with even indices. We then individually calculate the empirical distribution of s using the even-index samples, odd-index ones, all the samples, and the first sample: ˆPa,E s,N+1 = 1 2N P2N i=1 1s 2i, ˆPa,O s,N+1 = 1 2N P2N i=1 1s 2i 1, ˆPa s,N+1 =

1 2N+1 P2N+1 i=1 1s i, ˆPa,1 s,N+1 = 1s 1. Then, we use these estimated transition kernels as nomi-

nal kernels to construct four estimated uncertainty sets ˆPa,E s,N+1, ˆPa,O s,N+1, ˆPa s,N+1, ˆPa,1 s,N+1. The multi-level estimator is then defined as

ˆσPas (V ) σ ˆPa,1 s,N+1(V ) + N(V )

where p N = Ψ(1 Ψ)N and

N(V ) σ ˆPa s,N+1(V ) σ ˆPa,E s,N+1(V ) + σ ˆPa,O s,N+1(V )

We note that in previous results of the multi-level Monte-Carlo estimator (Blanchet & Glynn, 2015; Blanchet et al., 2019; Wang & Wang, 2022), several assumptions are needed to show that the estimator is unbiased. These assumptions, however, do not hold in our cases. For example, the function σP(V ) is not continuously differentiable. Hence, their analysis cannot be directly applied here. We then present four examples of non-linear uncertainty sets. Under each example, a solution to the support function σP(V ) is given, and by plugging it into (19) the unbiased estimator can then be constructed. More details can be found in Section H and Section I. Total Variation Uncertainty Set. The total variation uncertainty set is Pa s = {q (|S|) : 1

2 q Pa s 1 ζ}, and the support function can be computed using its dual function (Iyengar, 2005):

σPas (V ) = max µ 0 Pa s(V µ) ζSpan(V µ) . (20)

Robust Average-Reward Reinforcement Learning

Chi-square Uncertainty Set. The Chi-square uncertainty set is Pa s = {q (|S|) : dc(Pa s, q) ζ}, where dc(q, p) = P s (p(s) q(s))2

p(s) . Its support function can be computed using its dual function (Iyengar, 2005):

σPas (V ) = max µ 0 Pa s(V µ) q

ζVar Pas(V µ) . (21)

Kullback Leibler (KL) Divergence Uncertainty Set. The KL-divergence between two distributions p, q is defined as DKL(q||p) = P s q(s) log q(s)

p(s), and the uncertainty set defined via KL divergence is

Pa s = {q : DKL(q||Pa s) ζ} , s S, a A. (22)

Its support function can be efficiently solved using the duality result in (Hu & Hong, 2013):

σPas (V ) = min α 0

ζα + α log EPas e V

The above estimator for the KL-divergence uncertainty set has also been developed in (Liu et al., 2022) for robust discounted MDPs. Its extension to our average-reward setting is similar. Wasserstein Distance Uncertainty Sets. Consider the metric space (S, d) by defining some distance metric d. For some parameter l [1, ) and two distributions p, q (S), define the l-Wasserstein distance between them as Wl(q, p) = infµ Γ(p,q) d µ,l, where Γ(p, q) denotes the distributions over S S with marginal distributions p, q, and d µ,l = E(X,Y ) µ d(X, Y )l 1/l. The Wasserstein distance uncertainty set is then defined as

Pa s = {q (|S|) : Wl(Pa s, q) ζ} . (24)

To solve the support function w.r.t. the Wasserstein distance set, we first prove the following duality lemma.

Lemma 2. It holds that

σPas (V ) = sup λ 0

λζl + EPas inf y V (y) + λd(S, y)l . (25)

Thus, the support function can be solved using its dual form, and the estimator can then be constructed following (19).

Remark 4. We note that in the construction of MLMC estimators for the non-linear uncertainty sets, we do not need to estimate and store any transition models. When updating asynchronously, we only need a memory space of |S| |A| to store the nominal kernel of the specific state-action pair, instead of the whole model. From this aspect, we refer to our algorithms with MLMC estimators as model-free approaches. It is left for further exploration of model-free algorithms with less memory space.

Wang, Velasquez, Atia, Prater-Bennette, Zou

6. Numerical Results

In this section, we numerically verify our theoretical results. We aim to verify two aspects of our methods: the convergence of the algorithms, and the robustness of them. Additional experiments can be found in Appendix A.

6.1 Convergence of Robust RVI TD and Q-Learning

We first verify the convergence of our robust RVI TD and Q-learning algorithms under a Garnet problem G(30, 20) (Archibald et al., 1995). The problem can be characterized as a MDP (30, 20, P, r), where there are 30 states and 20 actions. The nominal transition kernel P = {Pa s, s S, a A} is randomly generated by a normal distribution: Pa s N(1, σa s) and then normalized. The reward function r(s, a) N(1, µa s), where µa s, σa s Uniform[0, 100]. We set the radius of the uncertainty set ζ = 0.4, αn = 0.01, f(V ) = P s V (s)

|S| and f(Q) = P s,a Q(s,a)

|S||A| . We show the results under the Chi-square and Wasserstein Distance models. The results under the other three uncertainty sets are presented in Appendix A. For policy evaluation, we evaluate the robust average-reward of the uniform policy π(a|s) = 1 |A|. We implement our robust RVI TD algorithm under different uncertainty models. We run the algorithm independently for 30 times and plot the average value of f(V ) over all 30 trajectories. We also plot the 95th and 5th percentiles of the 30 curves as the upper and lower envelopes of the curves. To compare, we plot the true robust averagereward computed using the model-based robust value iteration method. It can be seen from the results in Figure 1 that our robust RVI TD algorithm converges to the true robust average-reward value.

Figure 1: Robust RVI TD Algorithm.

We then consider policy optimization. We run our robust RVI Q-learning independently for 30 times. The curves in Figure 2 show the average value of f(Q) over 30 trajectories, and the upper/lower envelopes are the 95/5 percentiles. We also plot the optimal robust average-reward g P computed by the model-based RVI method. Our robust RVI Q-learning converges to the optimal robust average-reward g P under each uncertainty set, which verifies our theoretical results.

Robust Average-Reward Reinforcement Learning

Figure 2: Robust RVI Q-Learning Algorithm.

6.2 Robustness of Our Robust Approaches

In this section, we aim to demonstrate the robustness of our approaches, including both model-based and model-free ones. We first compare our model-based approaches with the non-robust model-based ones under the Garnet problem (Archibald et al., 1995). Then we verify the robustness of our model-free robust RVI Q-learning under two problems, namely, the Recycling Robot and the inventory control problem.

6.2.1 Robustness of Model-Based Approaches

We study several commonly used uncertainty set models, including contamination model, Kullback-Lerbler (KL) divergence, and total-variation defined model. As can be observed from Algorithm 1, 2, and 3 for different uncertainty sets, the only difference lies in how the support function σPas (V ) is calculated. In the sequel, we discuss how to efficiently calculate the support function for various uncertainty sets. We numerically compare our robust (relative) value iteration v.s. non-robust (relative) value iteration methods on different uncertainty sets. Our experiments are based on the same Garnet problem G(20, 40) considered in 6.1, with the same nominal transition kernel, reward functions, and uncertainty set structures. We run different algorithms, i.e., (robust) value iteration and (robust) relative value iteration, and obtain the greedy policies at each time step. Then, we use robust average-reward policy evaluation (Algorithm 1) to evaluate the robust average-reward of these policies. We plot the robust average-reward against the number of iterations. Contamination model. Our experimental results under the contamination model are shown in Figure 3. Total variation. Our experimental results under the total variation model are shown in Figure 4. Kullback-Lerbler (KL) divergence. Our experimental results under the KL-divergence model are shown in Figure 5. It can be seen that our robust methods can obtain policies that achieve higher worstcase rewards. Also, both our limit-based robust value iteration and our direct method of

Wang, Velasquez, Atia, Prater-Bennette, Zou

Figure 3: Comparison on contamination model with R = 0.4.

Figure 4: Comparison on total variation model with R = 0.6.

robust relative value iteration converge to the optimal robust policies, which validates our theoretical results.

6.2.2 Robustness of Model-Free Approach: Recycling Robot

We first consider the recycling robot problem (Example 3.3 (Sutton & Barto, 2018)). A mobile robot running on a rechargeable battery aims to collect empty soda cans. It has 2 battery levels: S = {low and high}. The robot can either 1) search for empty cans; 2) remain stationary and wait for someone to bring it a can; or 3) go back to its home base to recharge, i.e., A = {wait, search, recharge}. Under low (high) battery level, the robot finds an empty can with probabilities α (β), and remains at the same battery level. If the robot goes out to search but finds nothing, it will run out of its battery and can only be carried back by humans. The reward of finding a can is set to be +5, the reward of finding nothing and running out of battery is 5, and r = 0 for waiting. More details can be found in (Sutton & Barto, 2018). In this experiment, the uncertainty lies in the probabilities α, β of finding an empty can if the robot chooses the action search . We set ζ = 0.4 and implement our algorithms and

Robust Average-Reward Reinforcement Learning

Figure 5: Comparison on KL-divergence model with R = 0.8.

vanilla Q-learning under the nominal environment (α = β = 0.5) with stepsize 0.01. To show the difference among the policies that the algorithms learned, we plot the difference of Q values at low battery level in Figure 6a. The results showed the average value of the difference between the Q-function among 10 independent experiments. In the low battery level, the robust algorithms find conservative policies that choose to wait instead of search, whereas the vanilla Q-learning finds a policy that chooses to search. To test the robustness of the obtained policies, we evaluate the average reward of the learned policies in perturbed environments. Specifically, let x denote the amplitude of the perturbation. Then, we calculate the exact robust average reward functions of the two policies over the testing uncertainty set (0.5 x, 0.5 + x) using the model-based approach Alg 1, and plot them in Figure 6b. It can be seen that when the perturbation is small, the true worst-case kernels (w.r.t. ζ during training) are far from the testing environment, and hence the vanilla Q-learning has a higher reward; however, as the perturbation level becomes larger, the testing environment gets closer to the worst-case kernels, and then our robust algorithms perform better. It can be seen that the performance of Q-learning decreases rapidly while our robust algorithm is stable and outperforms the non-robust Q-learning. This implies that our algorithm is robust to the model uncertainty.

6.2.3 Robustness of Model-Free Approach: Inventory Control Problem

We now consider the supply chain problem (Giannoccaro & Pontrandolfo, 2002; Kemmer et al., 2018; Liu et al., 2022). At the beginning of each month, the manager of a warehouse inspects the current inventory of a product. Based on the current stock, the manager decides whether or not to order additional stock from a supplier. During this month, if the customer demand is satisfied, the warehouse can make a sale and obtain profits; but if not, the warehouse will obtain a penalty associated with being unable to satisfy customer demand for the product. The warehouse also needs to pay the holding cost for the remaining stock and new items ordered. The goal is to maximize the average profit. We let st denote the inventory at the beginning of the t-th month, Dt be a random demand during this month, and at be the number of units ordered by the manager. We assume that Dt follows some distribution and is independent over time. When the agent

Wang, Velasquez, Atia, Prater-Bennette, Zou

(a) Q(low,wait)-Q(low,search)

(b) Perturbed environment

Figure 6: Recycling Robot.

takes action at, the order cost is at, and the holding cost is 3 (st + at). If the demand Dt st +at, then selling the item brings 5 Dt in total; but if the demand Dt > st +at, then there will not be any sale and a penalty of 15 will be received. We set S = {0, 1, ..., 16} and A = {0, ..., 8}. We first set ζ = 0.4 and αt = 0.01, and implement our algorithms and vanilla Q-learning under the nominal environment where Dt Uniform(0, 16) is generated following the uniform distribution. To verify the robustness, we test the obtained policies under different perturbed environments. More specifically, we perturb the distribution of the demand to Dt U(m,b), where

U(m,b)(x) =

( 1 |S| + b |S| 2

2|S| , if x {m, m + 1}, 1 b

|S| , else.

The results are plotted in Figure 7. We first fix m = 0 and plot the performance under different values of b in Figure 7a, then we fix b = 0.25 and plot the performance under different values of m in Figure 7b. As the results show, when b is small, i.e., the perturbation of the environment is small, the non-robust Q-learning obtains higher reward than our robust methods; as b becomes larger, the performance of the non-robust method decreases rapidly, while our robust methods are more robust and outperform the non-robust one. When b is fixed, our robust methods outperform the non-robust Q-learning, which also demonstrates the robustness of our methods.

7. Conclusion

In this paper, we investigated the problem of robust MDPs under the average-reward criterion. We first developed the fundamental characterization of their structures using two approaches: the limit approach and the direct one. We first established uniform convergence of the discounted value function to average-reward and showed the common properties

Robust Average-Reward Reinforcement Learning

(b) b = 0.25

Figure 7: Inventory Control.

shared by robust MDPs under both reward criteria. We then designed a direct approach for robust average-reward MDPs, where we derived the robust Bellman equation for robust average-reward MDPs. Based on these results, we further designed two model-based algorithms, robust VI and robust RVI, with convergence and robustness guarantees. We then generalized such approaches to the model-free setting, where we constructed an unbiased estimator of the robust Bellman operator and proposed robust RVI TD and Q-learning algorithms, and further theoretically proved their convergence and optimality.

8. Acknowledgement

This work is supported by the National Science Foundation under Grants CCF-2007783, CCF-2106560, ECCS-2337375 (CAREER), and CCF-2106339. This material is based upon work supported under the AI Research Institutes program by National Science Foundation and the Institute of Education Sciences, U.S. Department of Education through Award # 2229873 - National AI Institute for Exceptional Education. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the Institute of Education Sciences, or the U.S. Department of Education.

Wang, Velasquez, Atia, Prater-Bennette, Zou

Appendix A. Additional Experiments

In this section, we first show the additional experiments on the Garnet problem in Section 6.1. Then, we further verify our theoretical results using some additional experiments.

A.1 Garnet Problem

We first verify the convergence of our robust RVI TD and robust RVI Q-learning under the Garnet problem with the same setting as in Section 6.1. Our results show that both our algorithms converge to the (optimal) robust average-reward under the other three uncertainty sets.

Figure 8: Robust RVI TD Algorithm under Garnet Problem.

Figure 9: Robust RVI Q-Learning Algorithm under Garnet Problem.

Robust Average-Reward Reinforcement Learning

A.2 Frozen-Lake Problem

We first verify our robust RVI TD algorithm and robust RVI Q-learning under the Frozen Lake environment of Open AI (Brockman et al., 2016). We set the uncertainty radius ζ = 0.4, αn = 0.01 and plot the (optimal) robust average-reward computed using modelbased methods as the baseline. We evaluate the uniform policy for the policy evaluation problem, plot the average value of f(Vt) of 30 trajectories and plot the 95/5 percentile as the upper/lower envelope. For the optimal control problem, we plot the average value of f(Qt) of 30 trajectories and plot the 95/5 percentile as the upper/lower envelope. The results show that both algorithms converge to the (optimal) robust average-reward.

Figure 10: Robust RVI TD Algorithm under Frozen-Lake environment.

Figure 11: Robust RVI Q-learning Algorithm under Frozen-Lake environment.

Wang, Velasquez, Atia, Prater-Bennette, Zou

A.3 Robustness of Robust RVI Q-Learning

We further use the simple, yet widely-used problem, referred to as the one-loop task problem (Panaganti & Kalathil, 2022), to verify the robustness of our robust RVI Q-learning. This environment is widely used to demonstrate that robust methods can learn different optimal polices from the non-robust methods, which are more robust to model uncertainty. The one-loop MDP contains 2 states s1, s2, and 2 actions al, ar indicating going left or right. The nominal environment is shown in the left of Figure 12, where at state s1, going left and right will result in a transition to s1 or s2; and at s2, going left and right will result in a transition to s1 or s2.

ar, 2 al, 0

s1 s2 al, 0

ar, 2 al, 0

Figure 12: One-Loop Task.

We implement our robust RVI Q-learning and vanilla non-robust Q-learning as the baseline in this environment. At each time step t, we plot the difference between Qt(s1, al) and Qt(s1, ar) in Figure 13a. If Qt(s1, al) Qt(s1, ar) < 0, the greedy policy will be going right; and if Qt(s1, al) Qt(s1, ar) > 0, the policy will be going left. As the results show, the vanilla Q-learning will finally learn a policy π(s1) = ar, while our algorithms learn a policy π(s1) = al. To verify the robustness of our method, we test the learned policies under a perturbed testing environment, shown on the right of Figure 12. We plot the average-reward of policies πt under this perturbed environment. The results are shown in Figure 13b.

(a) Q(s1, al) Q(s2, ar)

(b) Average-Reward under Testing MDP

Figure 13: One-Loop Task.

Robust Average-Reward Reinforcement Learning

As the results show, our robust RVI Q-learning learns a more robust policy under the nominal environment, which obtains a higher reward in the perturbed environment; whereas the non-robust Q-learning learns a policy that is optimal w.r.t. the nominal environment, but less robust when the environment is perturbed. This verifies that our algorithm is more robust than the vanilla method.

Appendix B. Review of Robust Discounted MDPs

In this section, we provide a brief review on the existing methods and results for robust discounted MDPs.

B.1 Robust Policy Evaluation

We first consider the robust policy evaluation problem, where we aim to estimate the robust value function V π P,γ for any policy π. It has been shown that the robust Bellman operator Tπ is a γ-contraction, and the robust value function V π P,γ is its unique fixed-point. Hence by recursively applying the robust Bellman operator, we can find the robust discounted value function (Nilim & El Ghaoui, 2004; Iyengar, 2005).

Algorithm 6 Policy evaluation for robust discounted MDPs Input: π, V0, T

1: for t = 0, 1, ..., T 1 do

2: for all s S do

3: Vt+1(s) Eπ[r(s, A) + γσPA s (Vt)]

6: return VT

B.2 Robust Optimal Control

Another important problem in robust MDP is finding the optimal policy that maximizes the robust discounted value function:

π = arg max π V π P,γ. (26)

A robust value iteration approach is developed in (Nilim & El Ghaoui, 2004; Iyengar, 2005) as follows.

Algorithm 7 Optimal Control for robust discounted MDPs Input: V0, T

1: for t = 0, 1, ..., T 1 do

2: for all s S do

3: Vt+1(s) maxa r(s, a) + γσPas (Vt)

6: π (s) arg maxa r(s, a) + γσPas (VT ) , s

7: return π

Wang, Velasquez, Atia, Prater-Bennette, Zou

Appendix C. Equivalence between Time-Varying and Stationary Models

We first provide an equivalence result between time-varying and stationary transition kernel models under stationary policies, which is an analog result to the one for robust discounted MDPs (Iyengar, 2005; Nilim & El Ghaoui, 2004). This result will be used in our following proofs. Recall the definitions of the robust discounted value function and worst-case averagereward in (4),(5), the worst-case is taken w.r.t. κ = (P0, P1...) N t 0 P, therefore, the transition kernel at each time step could be different. This model is referred to as the time-varying transition kernel model (as in (Iyengar, 2005; Nilim & El Ghaoui, 2004)). Another commonly used setting is that the transition kernels at different time steps are the same, which is referred to as the stationary model (Iyengar, 2005; Nilim & El Ghaoui, 2004). In this paper, we use the following notations to distinguish the two models. By EP[ ], we denote the expectation when the transition kernels at all time steps are the same, P, i.e., the stationary model. We also denote by gπ P(s) limn EP,π h 1 n Pn 1 t=0 rt S0 = s i

and V π P,γ(s) EP,π P t=0 γtrt S0 = s being the expected average-reward and expected discounted value function under the stationary model P. By Eκ[ ], we denote the expectation when the transition kernel at time t is Pt, i.e., the time-varying model. For the discounted setting, it has been shown in (Nilim & El Ghaoui, 2004) that for a stationary policy π, any γ [0, 1), and any s S,

V π P,γ(s) = min κ N t 0 P Eπ,κ

t=0 γtrt|S0 = s

= min P P Eπ,P

t=0 γtrt|S0 = s

In the following theorem, we prove an analog of (27) for robust average-reward MDPs that if we consider stationary policies, then the robust average-reward problem with the time-varying model can be equivalently solved by a stationary model. Specifically, we define the worst-case average-reward for the stationary transition kernel model as follows:

min P P lim n Eπ,P

t=0 rt S0 = s

Recall the worst-case average-reward for the time-varying model in (5). We will show that for any stationary policy, (5) can be equivalently solved by solving (28).

Theorem 13. Consider an arbitrary stationary policy π. Then, the worst-case averagereward under the time-varying model is the same as the one under the stationary model:

gπ P(s) min κ N t 0 P lim n Eκ,π

t=0 rt|S0 = s

= min P P lim n EP,π

t=0 rt S0 = s

Robust Average-Reward Reinforcement Learning

Similar result also holds for the robust relative value function:

V π P (s) min κ N t 0 P Eκ,π

t=0 (rt gπ P)|S0 = s

= min P P EP,π

t=0 (rt gπ P)|S0 = s . (30)

Proof. From the robust Bellman equation in Theorem 7 5, we have that

V π P (s) + gπ P = X

a π(a|s) r(s, a) + σPas (V π P ) . (31)

Denote by arg minp Pas (p) V π P pa s 6, and denote by Pπ {pa s : s S, a A}. It then follows that

V π P (s) = X

a π(a|s) r(s, a) gπ P + σPas (V π P )

a π(a|s)(r(s, a) gπ P) + X

a π(a|s)EPπ[V π P (S1)|S0 = s, A0 = a]

a π(a|s)(r(s, a) gπ P) + EPπ,π[V π P (S1)|S0 = s]

a π(a|s)(r(s, a) gπ P) + EPπ,π

a π(a|S1)(r(S1, a) gπ P)|S0 = s

a π(a|S1)σPa S1(V π P )|S0 = s

a π(a|s)(r(s, a) gπ P) + EPπ,π [r1 gπ P|S0 = s] + EPπ,π

σPA1 S1 (V π P )|S0 = s

a π(a|s)(r(s, a) gπ P) + EPπ,π

r1 gπ P S0 = s + EPπ,π

(p A1 S1 ) V π P |S0 = s

r0 gπ P + r1 gπ P|S0 = s + EPπ,π[V π P (S2)|S0 = s]

t=0 (rt gπ P)|s . (32)

By the definition, the following always hold:

min κ N t 0 P Eκ,π

t=0 (rt gπ P)|S0 = s min P P EP,π

t=0 (rt gπ P)|S0 = s . (33)

5. The proof of Theorem 7 is independent of Theorem 13 and does not rely on the results to be shown here. 6. We pick one arbitrarily if there are multiple minimizers.

Wang, Velasquez, Atia, Prater-Bennette, Zou

This hence implies that a stationary transition kernel sequence κ = (Pπ, Pπ, ...) is one of the worst-case transition kernels for V π P . Therefore, (30) can be proved. Consider the transition kernel Pπ. We denote its non-robust average-reward and the non-robust relative value function by gπ Pπ and V π Pπ. By the non-robust Bellman equation (Sutton & Barto, 2018), we have that

V π Pπ(s) = X

a π(a|s)(r(s, a) gπ Pπ) + EPπ,π[V π Pπ(S1)|s]. (34)

On the other hand, the robust Bellman equation shows that

V π P (s) = V π Pπ(s) = X

a π(a|s)(r(s, a) gπ P) + EPπ,π[V π Pπ(S1)|s]. (35)

These two equations imply that gπ P = gπ Pπ, and hence the stationary kernel (Pπ, Pπ, ...) is also a worst-case kernel of robust average-reward in the time-varying setting. This proves (29).

Appendix D. Limit Approach

D.1 Proof of Theorem 1

In the proof, unless otherwise specified, we denote by v the l norm of a vector v, and for a matrix A, we denote by A its matrix norm induced by l norm, i.e., A = supx Rd Ax

Lemma 3. [Theorem 8.2.3 in (Puterman, 1994)] For any P, γ, π,

V π P,γ = 1 1 γ gπ P + hπ P + fπ P(γ), (36)

where hπ P = Hπ Prπ, and fπ P(γ) = 1

γ P n=1( 1)n 1 γ

γ n (Hπ P)n+1rπ.

Following Proposition 8.4.6 in (Puterman, 1994), we can show the following lemma.

Lemma 4. Hπ P is continuous on Π P. If Π and P are compact, Hπ P is uniformly bounded on Π P, i.e., there exists a constant h, such that Hπ P h for any π, P.

For simplicity, denote by

Sπ (P, γ) 1

n=1 ( 1)n 1 γ

n (Hπ P)n+1rπ,

Sπ N(P, γ) 1

n=1 ( 1)n 1 γ

n (Hπ P)n+1rπ. (37)

Clearly Sπ (P, γ) = fπ P(γ) and lim N Sπ N(P, γ) = Sπ (P, γ) for any specific π, P.

Lemma 5. There exists δ (0, 1), such that

lim N Sπ N(P, γ) = Sπ (P, γ) (38)

uniformly on Π P [δ, 1].

Robust Average-Reward Reinforcement Learning

Proof. Note that Hπ P h, hence there exists δ, s.t.

δ h k < 1 (39)

for some constant k. Then for any γ δ,

δ h k. (40)

Moreover, note that 1 γ ( 1)n 1 γ

n (Hπ P)n+1r 1

which is because A+B A + B for induced l norm, Ax A x and rπ 1. Note that

δ k 1 k, (42)

hence by Weierstrass M-test (Rudin, 2022), Sπ N(P, γ) uniformly converges to Sπ (P, γ) on Π P [δ, 1].

Lemma 6. There exists a uniform constant L, such that

Sπ N(P, γ1) Sπ N(P, γ2) L|γ1 γ2|, (43)

for any N, π, P, γ1, γ2 [δ, 1].

Proof. We first show that γSπ N(P, γ) = PN n=1( 1)n 1 γ

γ n (Hπ P)n+1rπ T π N(P, γ) is uniformly Lipschitz w.r.t. the l norm, i.e.,

T π N(P, γ1) T π N(P, γ2) l|γ1 γ2|, (44)

for any N, π, P, γ1, γ2 [δ, 1] and some constant l. Clearly, it can be shown by verifying T π N(P, γ) is uniformly bounded for any π, N, P or γ. First, it can be shown that

T π N(P, γ) =

n=1 ( 1)nn 1 γ

γ2 (Hπ P)n+1rπ, (45)

and moreover

T π N(P, γ)

γ2 hn+1 l N(γ). (46)

Wang, Velasquez, Atia, Prater-Bennette, Zou

γ2 hn+2, (47)

then, we can show that 1 h1 γ

γ2 h2 N 1 γ

γ h 1 1 1 γ

γ h 1 1 1 γ

Hence, we have that

T π N(P, γ) l N(γ) 1 1 h 1 γ

γ h 1 1 1 γ

which implies a uniform bound on T π N(P, γ) . Now, we have that

|Sπ N(P, γ1) Sπ N(P, γ2)|

γ1γ2 T π N(P, γ1) + T π N(P, γ1) T π N(P, γ2) γ2 . (50)

To show T π N(P, γ) is uniformly bounded, we have that

T π N(P, γ)

n (Hπ P)n+1r

h k 1 k. (51)

Robust Average-Reward Reinforcement Learning

Then, it follows that

Sπ N(P, γ1) Sπ N(P, γ2)

γ1γ2 T π N(P, γ1) + T π N(P, γ1) T π N(P, γ2) γ2

δ2 h k 1 k + 1

L|γ1 γ2|, (52)

where L = 1 δ2 h k 1 k + 1

δ 1 1 k h2 δ2 + h2

δ2 k 1 k is a universal constant that does not depend on N, P, π or γ.

Lemma 7. Sπ (P, γ) uniformly converges as γ 1 on Π P. Also, Sπ (P, γ) is L-Lipschitz for any γ > δ: for any π, P and any γ1, γ2 (δ, 1].

Sπ (P, γ1) Sπ (P, γ2) L|γ1 γ2|. (53)

Proof. From Lemma 5, for any ϵ, there exists Nϵ, such that for any n Nϵ, π, P, γ > δ,

Sπ (P, γ) Sπ n(P, γ) < ϵ. (54)

Thus for any γ1, γ2 (δ, 1],

Sπ (P, γ1) Sπ (P, γ2)

Sπ (P, γ1) Sπ n(P, γ1) + Sπ n(P, γ1) Sπ n(P, γ2) + Sπ n(P, γ2) Sπ (P, γ2)

2ϵ + Sπ n(P, γ1) Sπ n(P, γ2)

2ϵ + L|γ1 γ2|, (55)

where the last step is from Lemma 6. Thus, for any ϵ, there exists ω = max {δ, 1 ϵ}, such that for any γ1, γ2 > ω,

Sπ (P, γ1) Sπ (P, γ2) < (2 + L)ϵ, (56)

and hence by Cauchy s criterion we conclude that Sπ (P, γ) converges uniformly on Π P. On the other hand, since (55) holds for any ϵ, it implies that

Sπ (P, γ1) Sπ (P, γ2) L|γ1 γ2|, (57)

which completes the proof.

We now prove Theorem 1. For any P, π, we have that

V π P,γ = 1 1 γ gπ P + hπ P + fπ P(γ). (58)

It then follows that

(1 γ)V π P,γ = gπ P + (1 γ)hπ P + (1 γ)fπ P(γ). (59)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Clearly (1 γ)hπ P 0 uniformly on Π P because hπ P = Hπ Prπ h is uniformly bounded. Then,

(1 γ1)fπ P(γ1) (1 γ2)fπ P(γ2)

(1 γ1)fπ P(γ1) (1 γ1)fπ P(γ2) + (1 γ1)fπ P(γ2) (1 γ2)fπ P(γ2)

(1 γ1)L|γ1 γ2| + fπ P(γ2) |γ1 γ2|. (60)

For any π, P, γ > δ,

fπ P(γ) = 1 γ

n=1 ( 1)n 1 γ

n (Hπ P)n+1rπ

γ h 1 1 1 γ

δ k 1 k cf. (61)

Hence, (1 γ)fπ P(γ) 0 uniformly on Π P due to the fact that fπ P(γ) is uniformly bounded for any π, γ > δ, P. Then we have that limγ 1(1 γ)V π P,γ = gπ P uniformly on P Π. This completes the proof of Theorem 1.

D.2 Proof of Theorem 2

We first show a lemma which allows us to interchange the order of lim and max.

Lemma 8. If a function f(x, y) converges uniformly to F(x) on X as y y0, then

max x lim y y0 f(x, y) = lim y y0 max x f(x, y). (62)

Proof. For each f(x, y), denote by arg maxx f(x, y) = xy, and hence f(xy, y) f(x, y) for any x, y. Also denote by arg maxx F(x) = x . Now because f(x, y) uniformly converges to F(x), then for any ϵ, there exists δ , such that |y y0| < δ ,

|f(x, y) F(x)| ϵ (63)

for any x. Now consider |f(xy, y) F(x )| for |y y0| < δ . If f(xy, y) F(x ) > 0, then

|f(xy, y) F(x )| = f(xy, y) F(x ) = f(xy, y) F(xy) + F(xy) F(x ) ϵ; (64)

On the other hand if f(xy, y) F(x ) < 0, then

|f(xy, y) F(x )| = F(x ) f(xy, y) = F(x ) f(x , y) + f(x , y) f(xy, y) ϵ. (65)

Robust Average-Reward Reinforcement Learning

Hence, we showed that for any ϵ, there exists δ , such that |y y0| < δ ,

|f(xy, y) F(x )| = | max x f(x, y) max x F(x)| ϵ, (66)

lim y y0 max x f(x, y) = max x F(x) = max x lim y y0 f(x, y), (67)

and this completes the proof.

Then, we show that the robust discounted value function converges uniformly to the robust average-reward as the discounted factor approaches 1.

Theorem 14 (Restatement of Theorem 2). The robust discounted value function converges uniformly to the robust average-reward on Π:

lim γ 1(1 γ)V π P,γ = gπ P. (68)

Proof. Due to Theorem 13, for any stationary policy π, gπ P(s) = min P P gπ P(s) under the stationary model. Hence from the uniform convergence in Theorem 1, we first show the following:

gπ P = min P P gπ P

= min P P lim γ 1(1 γ)V π P,γ

(a) = lim γ 1 min P P(1 γ)V π P,γ

= lim γ 1(1 γ)V π P,γ, (69)

where (a) is because Lemma 8. Moreover, note that limγ 1(1 γ)V π P,γ = gπ P uniformly on Π P, hence the convergence in (69) is also uniform on Π. Thus, we complete the proof.

D.3 Proof of Theorem 5

Theorem 15 (Restatement of Theorem 5). VT generated by Algorithm 1 converges to the robust average-reward gπ P as T .

Proof. From discounted robust Bellman equation (Nilim & El Ghaoui, 2004), it can be shown that

(1 γt)V π P,γt = (1 γt) X

a π(a|s)(r(s, a) + γtσPas (V π P,γt)). (70)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Then we can show that for any s S,

|Vt+1(s) (1 γt+1)V π P,γt+1(s)|

= |Vt+1(s) (1 γt)V π P,γt(s) + (1 γt)V π P,γt(s) (1 γt+1)V π P,γt+1(s)| (71)

|(1 γt)V π P,γt(s) (1 γt+1)V π P,γt+1(s)| + |Vt+1(s) (1 γt)V π P,γt(s)|

= |(1 γt)V π P,γt(s) (1 γt+1)V π P,γt+1(s)|

a π(a|s) (1 γt)r(s, a) + γtσPas (Vt) ((1 γt)r(s, a) + γtσPas ((1 γt)V π P,γt))

= |(1 γt)V π P,γt(s) (1 γt+1)V π P,γt+1(s)| + X

a π(a|s) γtσPas (Vt) γtσPas ((1 γt)V π P,γt)

= |(1 γt)V π P,γt(s) (1 γt+1)V π P,γt+1(s)| + γt

a π(a|s) σPas (Vt) σPas ((1 γt)V π P,γt) .

If we denote by t Vt (1 γt)V π P,γt , then

(1 γt)V π P,γt (1 γt+1)V π P,γt+1 + γt max s

a π(a|s) σPas (Vt) σPas ((1 γt)V π P,γt)

It can be easily verified that σPas (V ) is a 1-Lipschitz function, thus the second term in (73) can be further bounded as

a π(a|s) σPas (Vt) σPas ((1 γt)V π P,γt)

a π(a|s) Vt (1 γt)V π P,γt

= Vt (1 γt)V π P,γt , (74)

t+1 (1 γt)V π P,γt (1 γt+1)V π P,γt+1 + γt t. (75)

Recall that

(1 γt)V π P,γt = (1 γt) min P V π P,γt. (76)

Let s t arg maxs |(1 γt)V π P,γt(s) (1 γt+1)V π P,γt+1(s)|. Then it follows that

(1 γt)V π P,γt (1 γt+1)V π P,γt+1 = |(1 γt)V π P,γt(s t ) (1 γt+1)V π P,γt+1(s t )|. (77)

Note that from (Nilim & El Ghaoui, 2004; Iyengar, 2005), for any stationary policy

π, there exists a stationary model P such that V π P,γ(s) = EP,π

P t=0 γtrt|S0 = s V π P,γ.

Robust Average-Reward Reinforcement Learning

Hence in the following, for each γt, we denote the worst-case transition kernel of V π P,γt by Pt. If (1 γt)V π P,γt(s t ) (1 γt+1)V π P,γt+1(s t ), then

|(1 γt)V π P,γt(s t ) (1 γt+1)V π P,γt+1(s t )|

= min P (1 γt)V π P,γt(s t ) min P (1 γt+1)V π P,γt+1(s t )

= (1 γt)V π Pt,γt(s t ) (1 γt+1)V π Pt+1,γt+1(s t )

= (1 γt)V π Pt,γt(s t ) (1 γt)V π Pt+1,γt(s t ) + (1 γt)V π Pt+1,γt(s t ) (1 γt+1)V π Pt+1,γt+1(s t )

(a) (1 γt)V π Pt+1,γt(s t ) (1 γt+1)V π Pt+1,γt+1(s t )

(1 γt)V π Pt+1,γt (1 γt+1)V π Pt+1,γt+1 , (78)

where (a) is due to (1 γt)V π Pt,γt(s t ) = min P(1 γt)V π P,γt(s t ) (1 γt)V π Pt+1,γt(s t ). Now, according to Lemma 3,

(1 γt)V π Pt+1,γt = gπ Pt+1 + (1 γt)hπ Pt+1 + (1 γt)fπ Pt+1(γt), (79)

(1 γt+1)V π Pt+1,γt+1 = gπ Pt+1 + (1 γt+1)hπ Pt+1 + (1 γt+1)fπ Pt+1(γt+1). (80)

Hence, for any γt > δ, (78) can be further bounded as

(1 γt)V π Pt+1,γt (1 γt+1)V π Pt+1,γt+1

= (γt+1 γt)hπ Pt+1 + (1 γt)fπ Pt+1(γt) (1 γt+1)fπ Pt+1(γt+1)

(γt+1 γt) hπ Pt+1 + fπ Pt+1(γt) fπ Pt+1(γt+1) + γt+1fπ Pt+1(γt+1) γtfπ Pt+1(γt)

(a) h(γt+1 γt) + L(γt+1 γt) + γt+1fπ Pt+1(γt+1) γtfπ Pt+1(γt)

h(γt+1 γt) + L(γt+1 γt) + γt+1fπ Pt+1(γt+1) γt+1fπ Pt+1(γt)

+ γt+1fπ Pt+1(γt) γtfπ Pt+1(γt)

h(γt+1 γt) + L(γt+1 γt) + γt+1 fπ Pt+1(γt+1) fπ Pt+1(γt) + fπ Pt+1(γt) (γt+1 γt)

(b) (h + L + γt+1L + sup π,P,γ fπ P(γ) )(γt+1 γt)

K(γt+1 γt), (81)

where (a) is from Lemma 7 for any γt > δ, cf is defined in (61) and K h + 2L + cf is a uniform constant; And (b) is from Lemma 7. Similarly, the inequality also holds for the case when (1 γt)V π P,γt(s t ) (1 γt+1)V π P,γt+1(s t ). Thus we have that for any t such that γt > δ,

t+1 K(γt+1 γt) + γt t, (82)

where K is a uniform constant. Following Lemma 8 from (Tewari & Bartlett, 2007), we have that t 0. Note that

Vt gπ P Vt (1 γt)V π P,γt + (1 γt)V π P,γt gπ P = t + (1 γt)V π P,γt gπ P . (83)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Together with Theorem 2, we further have that

lim t Vt gπ P = 0, (84)

which completes the proof.

D.4 Proof of Theorem 6

Note that the optimal robust average-reward is defined as

g P(s) max π gπ P(s). (85)

We further define

V P,γ(s) max π V π P,γ(s). (86)

Theorem 16 (Restatement of Theorem 6). VT generated by Algorithm 2 converges to the optimal robust average-reward g P as T .

Proof. Firstly, from the uniform convergence in Theorem 2, it can be shown that

lim t (1 γt)V P,γt = g P. (87)

We then show that for any s S,

|Vt+1(s) (1 γt+1)V P,γt+1(s)|

|Vt+1(s) (1 γt)V P,γt(s)| + |(1 γt)V P,γt(s) (1 γt+1)V P,γt+1(s)|

(a) = |(1 γt)V P,γt(s) (1 γt+1)V P,γt+1(s)|

(1 γt)r(s, a) + γtσPas (Vt) max a

((1 γt)r(s, a) + γtσPas ((1 γt)V P,γt))

|(1 γt)V P,γt(s) (1 γt+1)V P,γt+1(s)|

(1 γt)r(s, a) + γtσPas (Vt) ((1 γt)r(s, a) + γtσPas ((1 γt)V P,γt)) , (88)

where (a) is because the optimal robust Bellman equation, and the last inequality is from the fact that | maxx f(x) maxx g(x)| maxx |f(x) g(x)|. Hence (88) can be further bounded as

|Vt+1(s) (1 γt+1)V P,γt+1(s)|

|(1 γt)V P,γt(s) (1 γt+1)V P,γt+1(s)| + γt max a

σPas (Vt) σPas ((1 γt)V P,γt) . (89)

If we denote by t Vt (1 γt)V P,γt , then

t+1 (1 γt)V P,γt (1 γt+1)V P,γt+1 + γt max s.a

σPas (Vt) σPas ((1 γt)V P,γt) .

Robust Average-Reward Reinforcement Learning

Since the support function σPas (V ) is 1-Lipschitz, then it can be shown that for any s, a, σPas (Vt) σPas ((1 γt)V P,γt) Vt (1 γt)V P,γt . (91)

t+1 (1 γt)V P,γt (1 γt+1)V P,γt+1 + γt t. (92)

Similar to (81) in Theorem 5, we can show that

(1 γt)V P,γt (1 γt+1)V P,γt+1 K|γt γt+1|, (93)

and similar to Lemma 8 from (Tewari & Bartlett, 2007),

lim t t = 0. (94)

Moreover, note that

Vt g P Vt (1 γt)V P,γt + (1 γt)V P,γt g P = t + (1 γt)V P,γt g P , (95)

which together with (87) implies that

Vt g P 0, (96)

and hence it completes the proof.

Lemma 9. There exists a deterministic optimal policy, i.e., π ΠD, s.t. gπ P = g P = maxπ Π gπ P.

D.5 Proof of Lemma 9

Lemma 10. (Restatement of Lemma 9). There exists a deterministic optimal policy, i.e., π ΠD, s.t. gπ P = g P = maxπ Π gπ P.

Proof. Assume that there is no deterministic optimal robust policy, i.e., there exists a strictly random policy πr Π, such that for any deterministic policy π ΠD,

gπr P > gπ P. (97)

According to Theorem 2, we have that

lim γ 1(1 γ)V πr P,γ = gπr P , (98)

lim γ 1(1 γ)V π P,γ = gπ P, π ΠD. (99)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Since there are only finite number of deterministic policies, there exists δ < 1, such that for any γ > δ,

V πr P,γ > V π P,γ, π ΠD. (100)

This implies that for γ > δ, the random policy πr is better than all the deterministic policies, i.e.,

V πr P,γ > max π ΠD V π P,γ. (101)

However, Theorem 3.1 of (Iyengar, 2005) implies that there exists deterministic optimal robust policy, i.e.,

max π ΠD V π P,γ = max π Π V π P,γ V πr P,γ, (102)

which contradicts to (101). Hence it implies that there exists a deterministic optimal robust policy, and completes the proof.

D.6 Proof of Theorem 4

Theorem 17 (Restatement of Theorem 4). There exists 0 < δ < 1, such that for any γ > δ, a deterministic optimal robust policy for robust discounted value function V P,γ is also an optimal policy for robust average-reward, i.e.,

V π P,γ = V P,γ. (103)

Moreover, when arg maxπ ΠD gπ P is a singleton, there exists a unique Blackwell optimal policy.

Proof. According to Lemma 9, there exists π ΠD such that

g P = gπ P . (104)

Assume the robust average-reward of all deterministic policies are sorted in a descending order:

g P = gπ 1 P = gπ 2 P = ... = gπ m P > gπ1 P ... gπn P (105)

for all π i , πi ΠD, and we define Π = {π i : i = 1, ..., m}. Denote by d = gπ i P gπ1 P . From Theorem 2, we know that for any π ΠD,

lim γ 1(1 γ)V π P,γ = gπ P. (106)

Because the set ΠD is finite, for any ϵ < d

2, there exists δ < 1, such that for any γ > δ , π i and πj,

|(1 γ)V π i P,γ g P| < ϵ, (107)

|(1 γ)V πj P,γ gπj P | < ϵ. (108)

Robust Average-Reward Reinforcement Learning

It hence implies that

(1 γ)V π i P,γ (d 2ϵ) + (1 γ)V πj P,γ > (1 γ)V πj P,γ, (109)

V π i P,γ > V πj P,γ. (110)

Note that from Theorem 3.1 in (Iyengar, 2005), i.e., maxπ ΠD V π P,γ = V P,γ, we have that for any γ, there exists a deterministic policy π ΠD, such that V P,γ = V π P,γ. Together with (110), it implies that all the possible optimal robust polices of V π P,γ belong to {π 1, ...π m}, i.e., the set Π . Hence, there exists π j Π , such that

V π j P,γ = max π ΠD V π P,γ = V P,γ. (111)

For the second part, when the optimal robust policy of robust average-reward is unique, i.e., Π = {π }. Then from the results above, there exists δ , such that for any γ > δ , V π P,γ > V π P,γ for any π = π ΠD, and hence π is the optimal policy for discounted robust MDPs, which is the unique Blackwell optimal policy.

Appendix E. Direct Approach

Recall that

V π P (s) min κ N t 0 P Eκ,π

t=0 (rt gπ P) S0 = s , (112)

gπ P = min κ N t 0 P lim n Eκ,π

t=0 rt|S0 = s

We first show that the robust relative function is always finite.

Lemma 11. For any π, V π P is finite.

Proof. According to Theorem 13, V π P = min P P V π P = min P P EP,π

P t=0(rt gπ P) . Note

that V π P can be rewritten as

V π P = min P P EP,π

t=0 (rt gπ P)

= min P P EP,π

t=0 (rt gπ P)

= min P P EP,π

t=0 (rt gπ P + gπ P gπ P)

= min P P EP,π

lim n (Rn ngπ P + ngπ P ngπ P) , (114)

Wang, Velasquez, Atia, Prater-Bennette, Zou

where Rn = Pn t=0 rt. Note that for any P P and n, ngπ P ngπ P, hence

lim n (Rn ngπ P + ngπ P ngπ P) lim n (Rn ngπ P), (115)

and thus the lower bound of V π P can be derived as follows,

V π P min P P EP,π

t=0 (rt gπ P)

= min P P V π P

= min P P Hπ Prπ. (116)

which is finite due to the fact that Hπ P is continuous on the compact set P. From Theorem 13, we denote the stationary worst-case transition kernel of gπ P by Pg. Then the upper bound of V π P can be bounded by noting that

V π P = min P P EP,π

t=0 (rt gπ Pg)

t=0 (rt gπ Pg)

= V π Pg, (117)

which is also finite and Pg denotes the worst-case transition kernel of gπ P. Hence we show that V π P is finite for any π and hence complete the proof.

After showing that the robust relative value function is well-defined, we show the following robust Bellman equation for average-reward robust MDPs.

Theorem 18 (Restatement of Theorem 7). For any s and π, (V π P , gπ P) is a solution to the following robust Bellman equation:

V (s) + g = X

a π(a|s) r(s, a) + σPas (V ) . (118)

And if (g, V ) is a solution to the robust Bellman equation

a π(a|s)(r(s, a) g + σPas (V )), s, (119)

then 1) g = gπ P; 2) PV Ωπ g; 3) V = V π PV + ce for some c R.

Proof. We first show the first part. From the definition,

V π P (s) = min κ N t 0 P Eκ,π

t=0 (rt gπ P) S0 = s , (120)

Robust Average-Reward Reinforcement Learning

V π P (s) = min κ N t 0 P Eκ,π

t=0 (rt gπ P) S0 = s

= min κ N t 0 P Eκ,π

(r0 gπ P) +

t=1 (rt gπ P) S0 = s

= min κ N t 0 P

a π(a|s)r(s, a) gπ P + Eκ,π

t=1 (rt gπ P) S0 = s )

a π(a|s) (r(s, a) gπ P) + min κ N t 0 P

a,s π(a|s)Pa s,s Eκ,π

t=1 (rt gπ P)|S1 = s

a π(a|s) (r(s, a) gπ P)

+ min P0 P min κ=(P1,...) N t 1 P

a,s π(a|s)(P0)a s,s Eκ,π

t=1 (rt gπ P)|S1 = s

a π(a|s) (r(s, a) gπ P)

a,s π(a|s)(P0)a s,s min κ=(P1,...) N t 1 P

t=1 (rt gπ P)|S1 = s )

a π(a|s) (r(s, a) gπ P) + X

s min pa s,s Pas pa s,s V π P (s )

a π(a|s) (r(s, a) gπ P) + X

a π(a|s)σPas (V π P )

a π(a|s) r(s, a) gπ P + σPas (V π P ) . (121)

This hence completes the proof. We then show the second part. 1). The robust Bellman equation in (119) can be rewritten as

g + V (s) rπ(s) = σPs(V ), s S. (122)

From the definition, it follows that

σPs(V ) = X

a π(a|s) min Pas Pas Pa s V. (123)

Hence, for any transition kernel P = (Pa s) N s,a Pa s ,

g + V (s) rπ(s) X

a π(a|s)Pa s V 0, s. (124)

Wang, Velasquez, Atia, Prater-Bennette, Zou

It can be further rewritten in matrix form as:

ge rπ + (Pπ I)V, (125)

where Pπ is the state transition matrix induced by π and P, i.e., the s-th row of Pπ is X

a π(a|s)Pa s. (126)

Note that Pπ has non-negative components since it is a transition matrix. Multiplying by Pπ on both sides, we have that

Pπge = ge Pπrπ + Pπ(Pπ I)V,

ge (Pπ)2rπ + (Pπ)2(Pπ I)V,

ge (Pπ)n 1rπ + (Pπ)n 1(Pπ I)V. (127)

Now, by summing up all these inequalities in (125) and (127), we have that

i=0 (Pπ)irπ + ((Pπ)n I)V, (128)

ge Pn 1 i=0 (Pπ)irπ

n + ((Pπ)n I)V

Let n , and we have that

Pn 1 i=0 (Pπ)irπ

n + lim n ((Pπ)n I)V

n = gπ Pe, (130)

where the last inequality is from the definition of gπ P and the fact that limn ((Pπ)n I)V

n = 0. Hence, g gπ P for any P N s,a Pa s . Consider the worst-case transition kernel PV of V . The robust Bellman equation can be equivalently rewritten as

ge = rπ V + Pπ V V. (131)

This means that (g, V ) is a solution to the non-robust Bellman equation for transition kernel PV and policy π:

xe = rπ Y + Pπ V Y. (132)

Thus, by Thm 8.2.6 from (Puterman, 1994),

g = gπ PV , (133)

V = V π PV + ce, for some c R. (134)

Robust Average-Reward Reinforcement Learning

However, note that

gπ PV = g gπ P = min P P gπ P gπ PV , (135)

gπ PV = g = gπ P. (136)

2). From (136),

gπ PV = gπ P . (137)

It then follows from the definition of Ωπ g that PV Ωπ g. 3). Since (g, V ) is a solution to the non-robust Bellman equation

xe = rπ Y + Pπ V Y, (138)

the claim then follows from Theorem 8.2.6 in (Puterman, 1994).

Theorem 19. [Restatement of Theorem 8] If (g, Q) is a solution to the optimal robust Bellman equation

Q(s, a) = r(s, a) g + σPas (VQ), s, a, (139)

then 1) g = g P; 2) the greedy policy w.r.t. Q: πQ(s) = arg maxa Q(s, a) is an optimal robust policy; 3) VQ = V πQ P + ce for some P ΩπQ g , c R.

Proof. In this proof, for two vectors v, w Rn, v w denotes that v(s) w(s) entry-wise. Taking the maximum on both sides of (139) w.r.t. a, we have that

max a Q(s, a) = max a {r(s, a) g + σPas (VQ)}, s S. (140)

This is equivalent to

VQ(s) = max a {r(s, a) g + σPas (VQ)}, s S, (141)

and hence (g, VQ) is a solution to (141). We hence only need to show the conclusion for any solution (g, V ) to (141). Let B(g, V )(s) maxa r(s, a) g + σPas (V ) V (s) . Since (g, V ) is a solution to (14), hence for any a A and any s S,

r(s, a) g + σPas (V ) V (s) 0, (142)

from which it follows that for any policy π,

g(s) rπ(s) + X

a π(a|s)σPas (V ) V (s) rπ(s) + X

a π(a|s)(pa s) V V (s), (143)

where rπ(s) P a π(a|s)r(s, a), pa s arg minp Pas p V , and PV = {pa s : s S, a A}. We also denotes the state transition matrix induced by π and PV by Pπ V .

Wang, Velasquez, Atia, Prater-Bennette, Zou

Using these notations, and rewrite (143), we have that

g1 rπ + (Pπ V I)V. (144)

Since the inequality in (144) holds entry-wise, all entries of Pπ V are positive, then by multiplying both sides of (144) by Pπ V , we have that

g1 = g Pπ V 1 Pπ V rπ + Pπ V (Pπ V I)V. (145)

Multiplying the both sides of (145) by Pπ V , and repeatedly doing that, we have that

g1 (Pπ V )2rπ + (Pπ V )2(Pπ V I)V, (146) ... ... (147)

g1 (Pπ V )n 1rπ + (Pπ V )n 1(Pπ V I)V. (148)

Summing up these inequalities from (144) to (148), we have that

ng1 (I + Pπ V + ... + (Pπ V )n 1)rπ + (I + Pπ V + ... + (Pπ V )n 1)(Pπ V I)V, (149)

and from which, it follows that

n(I + Pπ V + ... + (Pπ V )n 1)rπ + 1

n(I + Pπ V + ... + (Pπ V )n 1)(Pπ V I)V

n(I + Pπ V + ... + (Pπ V )n 1)rπ + 1

n((Pπ V )n I)V. (150)

It can be easily verified that limn 1

n((Pπ V )n I)V = 0, and hence it implies that

g1 lim n 1 n(I + Pπ V + ... + (Pπ V )n 1)rπ

= lim n 1 n EPπ V ,π

= gπ Pπ V 1

gπ P1. (151)

Since (151) holds for any policy π, it follows that g g P. On the other hand, since B(g, V ) = 0, there exists a policy τ such that

g1 = rτ + (Pτ V I)V, (152)

where rτ, Pτ V are similarly defined as for π. From Theorem 13, there exists a stationary transition kernel Pτ ave such that gτ P = gτ Pτave. We denote the state transition matrix induced by τ and Pτ ave by Pτ. Then because Pτ V is the worst-case transition of V , it follows that

Pτ V V PτV. (153)

g1 rτ + (Pτ I)V. (154)

Robust Average-Reward Reinforcement Learning

Similarly, we have that

g1 (Pτ)j 1rτ + (Pτ)j 1(Pτ I)V, (155)

for j = 2, ..., n. Summing these inequalities together we have that

ng1 (I + Pτ + ... + (Pτ)n 1)rτ + (I + Pτ + ... + (Pτ)n 1)(Pτ)n 1(Pτ I)V

= (I + Pτ + ... + (Pτ)n 1)rτ + ((Pτ)n I)V. (156)

g1 lim n 1 n EPτave,τ

= gτ Pτave1 = gτ P1 g P1. (157)

Thus g = g P. This hence proves (1). To prove (2), note that for any stationary policy π, we denote by

σPπ(V ) ( X

a π(a|s1)σPas1(V ), ..., X

a π(a|s|S|)σPas|S|(V ))

being a vector in R|S|. Then (15) is equivalent to

rπ + σPπ (V ) = max π {rπ + σPπ(V )} . (158)

rπ g + σPπ (V ) V = max π {rπ g + σPπ(V ) V } . (159)

Since (g, V ) is a solution to (14), it follows that

rπ g + σPπ (V ) V = 0. (160)

According to the robust Bellman equation (12), (gπ P , V π P ) is a solution to (160). Thus from Theorem 19, gπ P = g P, and hence π is an optimal robust policy. To prove (3), recall that VQ(s) = maxa Q(s, a). It can be also written as

a πQ(a|s)Q(s, a). (161)

Here, we slightly abuse the notation of πQ, and use πQ(s) and πQ(a|s) interchangeably. Then, the optimal robust Bellman equation in (140) can be rewritten as

Q(s, πQ(s)) = r(s, πQ(s)) g + σ P πQ(s) s

a πQ(a| )Q( , a) . (162)

Moreover, if we denote by W(s) = Q(s, a) = Q(s, πQ(s)) = maxa Q(s, a), then the equation above is equivalent to

a πQ(a|s)(r(s, a) g + σPas (W)). (163)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Therefore, (W, g) is a solution to the robust Bellman equation for the policy πQ in Theorem 7. By Theorem 7, we have that

g = gπQ P , (164)

W = V πQ P + ce, (165)

for some P ΩπQ g and c R. Combining this with the claim (1) implies that πQ is an optimal robust policy. Claims (2) and (3) are thus proved.

Theorem 20 (Restatement of Theorem 9). (w T , Vt) in Algorithm 3 converges to a solution of (14).

Before showing this theorem, we first present the robust aperiodic transform in the next lemma.

Lemma 12. (Robust Aperiodic Transform) Assume a robust MDP (S, A, P, r) satisfies Assumption 1. Construct another uncertainty set as follows:

Pa s = { Pa s = (1 τ)Pa s + τ1s : Pa s Pa s },

where τ (0, 1). Then the optimal robust policies for both robust MDPs are the same.

Proof. The result can be straightforwardly derived by the following claim: for any π and P P, gπ P = gπ P. Note that the discounted value function V π P,γ = (I γPπ) 1rπ. Hence we have that

V π P,γ = (I γ Pπ) 1rπ

= (I γ((1 τ)Pπ + τI)) 1rπ = ((1 τγ)I γ(1 τ)Pπ) 1rπ

= (1 τγ) I γ(1 τ)

1 τγ Pπ 1 rπ

1 τγ Pπ 1 rπ

= 1 1 τγ V π P, γ(1 τ)

1 τγ . (166)

Moreover, by noting 1 γ(1 τ)

1 τγ = 1 γ 1 τγ , we have

(1 γ)V π P,γ = 1 γ

1 τγ V π P, γ(1 τ)

1 τγ V π P, γ(1 τ)

V π P, γ(1 τ)

1 τγ . (167)

Robust Average-Reward Reinforcement Learning

Now set γ 1 on both sides, we have that

gπ P = lim γ 1(1 γ)V π P,γ = lim γ 1

V π P, γ(1 τ)

1 τγ = gπ P, (168)

where the last equation is from the fact that γ 1 implies γ(1 τ)

1 τγ 1, and hence proves the claim and the lemma.

The reason we apply such a robust aperiodic transform is that the modified transition kernel is always aperiodic (since Pπ(s|s), ( Pπ)2(s|s) > 0, s). Hence Assumption 1 is equivalent to the following strong assumption:

Assumption 5. There exists a positive integer J such that for any P = {pa s (S)} P and any stationary deterministic policy π, there exists κ > 0 and a state s S, such that ((Pπ)J)x,s κ, x S.

This assumption is shown to be equivalent to assuming each transition kernel in the uncertainty set can induce a unichain and aperiodic Markov chain under any determinisit policy (Bertsekas, 2011). Under the robust aperiodic transform, the modified uncertainty set hence satisfy this assumption. Without loss of generality, we prove the convergence under this assumption.

Proof. We first denote the update operator as

Lv(s) max a (r(s, a) + σPas (v)). (169)

Now, consider sp(Lv Lu). Denote by s arg maxs(Lv(s) Lu(s)) and s arg mins(Lv(s) Lu(s)). Also denote by av arg maxa(r( s, a)+σPa s (v)) and au arg maxa(r( s, a)+σPa s (u)) Then

Lv( s) Lu( s) = max a (r( s, a) + σPa s (v)) max a (r( s, a) + σPa s (u))

r( s, av) + σPav s (v) (r( s, au) + σPau s (u))

r( s, av) + σPav s (v) (r( s, av) + σPav s (u))

= σPav s (v) σPav s (u)

(pav,v s ) v (pav,u s ) u, (170)

where pav,v s = arg minp Pav s p v and pav,u s = arg minp Pav s p u. Thus (170) can be further bounded as

Lv( s) Lu( s)

(pav,v s ) v (pav,u s ) u

(pav,u s ) (v u). (171)

Lv( s) Lu( s) (pau,v s ) (v u). (172)

Wang, Velasquez, Atia, Prater-Bennette, Zou

sp(Lv Lu) (pav,u s ) (v u) (pau,v s ) (v u). (173)

Now denote by v u (x1, x2, ..., xn), pav,u s = (p1, ..., pn) and pau,v s = (q1, ..., qn). Further denote by bi min{pi, qi} Then

i=1 (pi bi)xi

i=1 (qi bi)xi

i=1 (pi bi) max{xi}

i=1 (qi bi) min{xi}

i=1 (pi bi)sp(x) + n X

i=1 (pi bi)

i=1 (qi bi) min{xi}

sp(x). (174)

Thus we showed that

sp(Lv Lu) 1

sp(v u). (175)

Now from Assumption 1, and following Theorem 8.5.3 from (Puterman, 1994), it can be shown that there exists 1 > λ > 0, such that for any a, u, v,

i=1 bi λ. (176)

Further, following Theorem 8.5.2 in (Puterman, 1994), it can be shown that L is a J-step contraction operator for some integer J, i.e.,

sp(LJv LJu) (1 λ)sp(v u). (177)

Then, it can be shown that the relative value iteration converges to a solution of the optimal equation similar to the relative value iteration for non-robust MDPs under the average-reward criterion (Theorem 8.5.7 in (Puterman, 1994), Section 1.6.4 in(Sigaud & Buffet, 2013)), and hence (wt, Vt) converges to a solution to (14) as ϵ 0.

Appendix F. Robust RVI TD Method for Policy Evaluation

We define the following notation:

a π(a|s)r(s, a),

a π(a|s)σPas (V ),

σP(V ) (σPs1(V ), σPs2(V ), ..., σPs|S|(V )) R|S|.

Robust Average-Reward Reinforcement Learning

F.1 Proof of Lemma 1

We construct the following example.

Example 1. Consider an MDP with 3 states (1,2,3) and only one action a, and set a (s, a)- rectangular uncertainty set P = Pa 1 N Pa 2 N Pa 3 where Pa 1 = {Pa 11, Pa 12}, Pa 2 = {(0, 0, 1) } and Pa 3 = {(0, 1, 0) }, where Pa 11 = (0, 1, 0) , Pa 12 = (0, 0, 1) . Hence, the uncertainty set contains two transition kernels P = {P1, P2}. The reward of each state is set to be r = (r1, r2, r3). The only stationary policy π in this example is π(i) = a, i.

Note that this robust MDP is a unichain and hence satisfies Assumption 3 with gπ P1(1) = gπ P1(2) = gπ P1(3), gπ P2(1) = gπ P2(2) = gπ P2(3). Under both transition kernels P1, P2, the average-reward are identical: gπ P1 = gπ P2 = 0.5r2 + 0.5r3. Hence, both P1, P2 are the worst-case transition kernels. According to Section A.5 of (Puterman, 1994), the relative value functions w.r.t. P1, P2 can be computed as

V π P1 = r1 1

V π P2 = r1 3

When r3 > r2, only V π P1 is the solution to (12); and when r2 > r3, only V π P2 is the solution to (12). Hence, this proves Lemma 1 and implies that not any relative value function w.r.t. a worst-case transition kernel is a solution to (12).

F.2 Proof of Theorem 10

Theorem 21. (Restatement of Theorem 10) Under Assumptions 3,2,4, (f(Vn), Vn) converges to a (possible sample path dependent) solution to (12) a.s..

We first show the stability of the robust RVI TD algorithm in the following lemma.

Lemma 13. Algorithm 4 remains bounded during the update, i.e.,

sup n Vn < , a.s.. (178)

Proof. Denote by

h(V ) rπ + σP(V ) f(V )e V. (179)

Then the update of robust RVI TD can be rewritten as

Vn+1 = Vn + αn(h(Vn) + Mn+1), (180)

where Mn+1 ˆTVn rπ σP(V ) is the noise term. Further, define the limit function h :

h (V ) lim c h(c V )

Wang, Velasquez, Atia, Prater-Bennette, Zou

Then, from σPas (c V ) = cσPas (V ) and f(c V ) = cf(V ), it follows that

h (V ) = lim c rπ

c + σP(V ) f(V )e V = σP(V ) f(V )e V. (182)

According to Section 2.1 and Section 3.2 of (Borkar, 2009), it suffices to verify the following assumptions: (1). h is Lipschitz; (2). Stepsize αn satisfies Assumption 4; (3). Denoting by Fn the σ-algebra generated by V0, M1, ..., Mn, then E[Mn+1|Fn] = 0, E[ Mn+1 2|Fn] K(1 + Vn 2) for some constant K > 0. (4). h has the origin as its unique globally asymptotically stable equilibrium. First, note that

h(V1) h(V2)

a π(a|s)(σPas (V1) σPas (V2)) (f(V1) f(V2)) (V1(s) V2(s))

a π(a|s)(σPas (V1) σPas (V2))

+ |(f(V1) f(V2))| + |(V1(s) V2(s))|

(2 + Lf) V1 V2 , (183)

where the last inequality follows from the fact that the support function σP( ) is 1-Lipschitz and the assumptions on f in Assumption 2. Thus, h is Lipschitz, which verifies (1). It is straightforward that (3) is satisfied if E[ˆTVn|Fn] = rπ +σP(Vn) and Var[ˆTVn|Fn] K(1+ Vn 2). As discussed in Section 5.1, we assume the existence of an unbiased estimator ˆT with bounded variance here, and we will construct the estimator in Section 5.3. Then, it suffices to verify condition (4), i.e., to show that the ODE

x(t) = h (x(t)) (184)

has 0 as its unique globally asymptotically stable equilibrium. Define an operator T0(V )(s) P a π(a|s)σPas (V ). Then, any equilibrium W of (184) satisfies

T0(W) f(W)e W = 0. (185)

This equation can be further rewritten as a set of equations: ( W =T0(W) ge,

g =f(W). (186)

The equation in (186) is the robust Bellman equation for a zero-reward robust MDP. Hence, from Theorem 7, any solution (g, W) to (186) satisfies

g = gπ P, W = V π P + ce, (187)

Robust Average-Reward Reinforcement Learning

where V π P is the relative value function w.r.t. some worst-case transition kernel P (i.e., gπ P = min P P gπ P), and some c R. Hence, any equilibrium of (184) satisfies

W = V π P + ce, f(W) = gπ P. (188)

However, note that this robust Bellman equation is for a zero-reward robust MDP, hence for any P,

gπ P = lim T EP

t=0 (rt gπ P)

thus gπ P = 0 and W = ce for some c R. From (188), it follows that f(W) = f(ce) = 0, for any equilibrium W. From Assumption 2, we have that f(ce) = cf(e) = c = 0. This further implies that

W = V π P + ce = 0. (191)

Thus, the only equilibrium of (184) is 0. We then show that 0 is globally asymptotically stable. Recall that the zero-reward robust Bellman operator

T0V (s) = X

a π(a|s)(σPas (V )). (192)

We further introduce two operators:

T 0V T0V f(V )e, (193) T0V T0V gπ Pe. (194)

Note that in the zero-reward robust MDP, gπ P = 0 and T0 = T0, but we introduce this notation for future use. Consider the ODEs w.r.t. these two operators:

x = T 0x x, (195)

y = T0y y. (196)

First, it can be easily shown that both T 0 and T0 are Lipschitz with constants 1 + Lf and 1, respectively. Hence, both two ODEs are well-posed. Also, it can be seen that (195) is the same as the ODE in (184). Since the second equation (196) is a non-expansion (Lipschitz with parameter no larger than 1), Theorem 3.1 of (Borkar & Soumyanatha, 1997) implies that any solution y(t) to (196) converges to the set of equilibrium points, i.e.,

y(t) n W : W = T0W o , a.s.. (197)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Similar to the discussion for T0, our Theorem 7 implies that the set of equilibrium points of (196) is {W = ce : c R}. Hence, for any solution y(t) to (196), y(t) ce for some constant k that may depend on the initial value of y(t). Now, consider the solution x(t) to (195). According to Lemma 20 (note that T0 here is a special case of T in Lemma 20 with r = 0), if the solutions x(t), y(t) have the same initial value x(0) = y(0), then

x(t) = y(t) + r(t)e, (198)

where r(t) is a solution to r(t) = r(t) + gπ P f(y(t)), r(0) = 0. Note that the solution r(t) with r(0) = 0 can be written as

0 e (t s)(gπ P f(y(s)))ds (199)

by variation of constants formula (Abounadi et al., 2001). If we denote the limit of y(t) by y = ce, then limt r(t) = gπ P f(y ) (Lemma B.4 in (Wan et al., 2021), Theorem 3.4 in (Abounadi et al., 2001)). Hence, x(t) = y(t) + r(t)e converges to y + (gπ P f(y ))e, i.e.,

x(t) ce f(ce)e = 0. (200)

Hence, any solution x(t) to (195) converges to 0, which is its unique equilibrium. This thus implies that 0 is the unique globally asymptotically stable equilibrium. Together with Theorem 3.7 in (Borkar, 2009), it further implies the boundedness of Vn, which completes the proof.

We can readily prove Theorem 21.

Proof. In Lemma 13, we have shown that

sup n Vn < , a.s.. (201)

Thus, we have verified that conditions (A1-A3) and (A5) in (Borkar, 2009) are satisfied. Lemma 2.1 in (Borkar, 2009) thus implies that it suffices to study the solution to the ODE x(t) = h(x(t)). For the robust Bellman operator TV = rπ + σP(V ), define

T V TV f(V )e, (202) TV TV gπ Pe. (203)

From Lemma 20, we know that if x(t), y(t) are the solutions to equations

x = T x x, (204)

y = Ty y, (205)

with the same initial value x(0) = y(0), then

x(t) = y(t) + r(t)e, (206)

Robust Average-Reward Reinforcement Learning

where r(t) satisfies

r(t) = r(t) + gπ P f(y(t)), r(0) = 0. (207)

Thus, by the variation of constants formula,

0 e (t s)(gπ P f(y(s)))ds. (208)

Note that T is also non-expansive, hence y(t) converges to some equilibrium of (205) (Theorem 3.1 of (Borkar & Soumyanatha, 1997)). The set of equilibrium points of (205) can be characterized as n W : TW = W o

= {W : W = TW gπ Pe}

W : W(s) = X

a π(a|s)(r(s, a) gπ P + σPas (W)), s S

From Theorem 7, any equilibrium of (205) can be rewritten as

W = V π P + ce, for some P Ωπ g, c R. (210)

Thus, y(t) converges to an equilibrium denoted by y :

y(t) y V π P + c e, for some P Ωπ g, c R. (211)

Similar to Lemma 13, it can be showed that r(t) gπ P f(y ) (Lemma B.4 in (Wan et al., 2021), Theorem 3.4 in (Abounadi et al., 2001)). This further implies that

x(t) y + (gπ P f(y ))e = V π P + (c + gπ P f(y ))e, (212)

and we denote m = c +gπ P f(y ). Moreover, since f is continuous (because it is Lipschitz), we have that

f(x(t)) f(V π P + (c + gπ P f(y ))e)

= f(V π P ) + c + gπ P f(y )

= f(V π P ) + c + gπ P f(V π P + c e)

= f(V π P ) + c + gπ P f(V π P ) c

= gπ P. (213)

Hence, we show that

x(t) V π P + m e, (214)

f(x(t)) gπ P. (215)

Following Lemma 2.1 from (Borkar, 2009), we conclude that a.s.,

Vn V π P + m e, (216)

f(Vn) gπ P, (217)

which completes the proof.

Wang, Velasquez, Atia, Prater-Bennette, Zou

Appendix G. Robust RVI Q-Learning

G.1 Proof of Theorem 11

Lemma 14. If ˆH satisfies that for any Q, s S, a A, E[ ˆHQ(s, a)] = HQ(s, a) and Var( ˆHQ(s, a)) C(1 + Q 2) for some constant C, then under Assumptions 2, 1 and 4, Algorithm 5 remains bounded during the update almost surely, i.e.,

sup n Qn < , a.s.. (218)

Proof. Denote by

h(Q) rπ + σP(VQ) f(Q)e Q. (219)

Then, the update of robust RVI Q-learning can be rewritten as

Qn+1 = Qn + αn(h(Qn) + Mn+1), (220)

where Mn+1 ˆHQn rπ σP(VQ) is the noise term. Further, define the limit function h :

h (Q) lim c h(c Q)

Then, note that σPas (Vc Q) = σPas (c VQ) = cσPas (VQ) for c > 0 and f(c Q) = cf(Q). It then follows that

h (Q) = lim c rπ

c + σP(VQ) f(Q)e Q = σP(VQ) f(Q)e Q. (222)

Similar to the proof of Theorem 13, it suffices to verify the following conditions: (1). h is Lipschitz; (2). Stepsize αn satisfies Assumption 4; (3). E[Mn+1|Fn] = 0, and E[ Mn+1 2|Fn] K(1 + Qn 2) for some constant K. (4). h has the origin as its unique globally asymptotically stable equilibrium. Clearly, (2) and (3) can be verified similarly to Theorem 13. We then verify (1) and (4). Firstly, it can be shown that

|h(Q1)(s, a) h(Q2)(s, a)|

= σPas (VQ1) f(Q1) Q1(s, a) σPas (VQ2) f(Q2) Q2(s, a)

σPas (VQ1) σPas (VQ2) + |f(Q1) f(Q2)| + |Q1(s, a) Q2(s, a)|

VQ1 VQ2 + Lf Q1 Q2 + Q1 Q2

(2 + Lf) Q1 Q2 , (223)

where the last inequality is from the fact that VQ1 VQ2 Q1 Q2 . This implies that h is Lipschitz. To verify (4), note that the stability equation is

X(t) = h (X(t)) = σP(VX(t)) f(X(t))e X(t), (224)

Robust Average-Reward Reinforcement Learning

where VX(t) is a |S|-dimensional vector with VX(t)(s) = maxa X(t)(s, a). Any equilibrium Q of the stability equation (224) satisfies that

Q(s, a) = σPas (VQ) f(Q)e, (225)

which can be viewed as an optimal robust Bellman equation (13) with zero reward. Hence, by Theorem 19, it implies that

f(Q) = g P = 0, (226)

VQ = V πQ P + ce for some P ΩπQ g , c R. (227)

In the zero-reward MDP, we have that V π P = 0 for any π, P, thus VQ(s) = maxa Q(s, a) = c for any s S. Note that from (225), Q satisfies that

Q(s, a) = σPas (VQ) = σPas (ce) = c. (228)

Since f(Q) = 0, it implies that

f(Q) = f(ce) = c = 0. (229)

c = 0, (230)

Q = 0. (231)

Thus, 0 is the unique equilibrium of the stability equation. We then show that 0 is globally asymptotically stable. Define the zero-reward optimal robust Bellman operator

H0Q(s, a) = σPas (VQ), (232)

and further introduce two operators

H 0Q(s, a) = σPas (VQ) f(Q), (233) H0Q(s, a) = σPas (VQ) g P. (234)

It is straightforward to verify that H0 is non-expansive. Hence by (Borkar & Soumyanatha, 1997), the solution y(t) to equation

y = H0y y (235)

converges to the set of equilibrium points

{W : W(s, a) = σPas (VW ) g P}, a.s.. (236)

This again can be viewed as an optimal robust Bellman equation with zero-reward. Hence, any equilibrium W of (235) satisfies

max a W(s, a) = c, s. (237)

Wang, Velasquez, Atia, Prater-Bennette, Zou

This together with (236) further implies that the equilibrium W of (235) satisfies

W(s, a) = σPas (VW ) = σPas (ce) = c, (238)

and hence y(t) converges to {ce : c R}. We denote its limit by y = c e. Lemma 25 implies the solution x(t) to the ODE x = H 0(x) x can be decomposed as x(t) = y(t) + r(t)e, where r(t) satisfies r(t) = r(t) + g P f(y(t)), r(0) = 0. Then, similar to Lemma 13, Lemma B.4 in (Wan et al., 2021) and Theorem 3.4 in (Abounadi et al., 2001), it can be shown that r(t) g P f(y(t)) = c . Hence,

x(t) 0, (239)

which proves the asymptotic stability. Thus, we conclude that 0 is the unique globally asymptotically stable equilibrium of the stability equation, which implies the boundedness of {Qn} together with results from Section 2.1 and 3.2 from (Borkar, 2009).

Theorem 22 (Restatement of Theorem 11). The sequence {Qn} generated by Algorithm 5 converges to a solution Q to the optimal robust Bellman equation a.s., and f(Qn) converges to the optimal robust average-reward g P a.s..

Proof. According to Lemma 1 from (Borkar, 2009) and Theorem 3.5 from (Abounadi et al., 2001), the sequence {Qn} converge to the same limit as the solution x(t) to the ODE x = H x x. Hence the proof can be completed by showing convergence of x(t) and f(x(t)). For the optimal robust Bellman operator,

HQ(s, a) = r(s, a) + σPas (VQ), (240)

define two operators

H Q HQ f(Q)e, (241) HQ HQ g Pe. (242)

From Lemma 25, we know that if x(t), y(t) are the solutions to equations

x = H x x, (243)

y = Hy y, (244)

with the same initial value x(0) = y(0), then

x(t) = y(t) + r(t)e, (245)

where r(t) satisfies

r(t) = r(t) + g P f(y(t)), r(0) = 0. (246)

Robust Average-Reward Reinforcement Learning

It can be easily verified that H is non-expansive. Hence y(t) converges to the set of equilibrium points of of (244) (Theorem 3.1 of (Borkar & Soumyanatha, 1997)), which can be characterized as n W : HW = W o

= {W : W = HW g Pe}

= W : W(s, a) = r(s, a) g P + σPas (VW ), s, a . (247)

From Theorem 19, any equilibrium W satisfies

VW = V πW P + ce, for some P ΩπW g , c R, (248)

and πW is robust optimal. We denote the limit of y(t) by W . Similar to (212) to (216), it can be shown that r(t) g P f(W ). This further implies that

x(t) W + (g P f(W ))e W + m e, (249)

where m = g P f(W ). Note that W + m e is a solution to the optimal robust Bellman equation, hence x(t) converges to a solution to (13). Moreover, since f is continuous (because it is Lipschitz), we have that

f(x(t)) f(W + m e)

= f(W ) + g P f(W )

= g P. (250)

This completes the proof.

Appendix H. Case Studies for Robust RVI TD

In this section, we provide the proof of the first part of Theorem 12, i.e., that ˆT is unbiased and has bounded variance under each uncertainty model. We first show a lemma, by which the problem can be reduced to investigating whether ˆσPas is unbiased and has bounded variance.

Lemma 15. If

E[ˆσPas V ] = σPas (V ), s, a, (251)

and moreover, there exists a constant C, such that

Var(ˆσPas V ) C(1 + V 2), s, a, (252)

E[ˆTV (s)] = TV (s), s, (253)

Var(ˆTV (s)) |A|C(1 + V 2), s. (254)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Proof. From the definition, ˆTV (s) = P a π(a|s)(r(s, a) + ˆσPas V ). Thus,

E[ˆTV (s)] = E X

a π(a|s)(r(s, a) + ˆσPas V )

a π(a|s)(r(s, a) + E[ˆσPas V ])

a π(a|s)(r(s, a) + σPas (V )) = TV (s), (255)

which shows that ˆT is unbiased. On the other hand, we have that

Var(ˆTV (s)) = E X

a π(a|s)(r(s, a) + ˆσPas V ) E X

a π(a|s)(r(s, a) + ˆσPas V ) 2

a π(a|s)(r(s, a) + ˆσPas V ) X

a π(a|s)(r(s, a) + E ˆσPas V 2

a π(a|s)(ˆσPas V ) E ˆσPas V 2

a π(a|s)(ˆσPas V E ˆσPas V )2

a π(a|s)E (ˆσPas V E ˆσPas V 2 )2

a π(a|s)Var(ˆσPas V )

|A|C(1 + V 2), (256)

where (a) is because (E[X])2 E[X2], which completes the proof.

This lemma implies that to prove Theorem 12, it suffices to show that ˆσPas is unbiased and has bounded variance.

H.1 Contamination Uncertainty Set

Theorem 23. ˆT defined in (18) is unbiased and has bounded variance.

Proof. First, note that

Vn+1(s) = Vn(s) + αn(r(s, a) + ((1 ζ)Vn(s ) + ζ min x Vn(x) f(Vn) Vn(s))

= Vn(s) + αn(TVn(s) f(Vn) Vn(s) + Mn(s)), (257)

Mn(s) = r(s, a) + (1 ζ)Vn(s ) + ζ min x Vn(x) TVn(s), (258)

Robust Average-Reward Reinforcement Learning

a π(a|s) r(s, a) + (1 ζ) X

s Pa s,s Vn(s ) + ζ min x Vn(x) . (259)

E[Mn(s)] = E r(s, a) + (1 ζ)Vn(s ) + ζ min x Vn(x)

a π(a|s) r(s, a) + (1 ζ) X

s Pa s,s Vn(s ) + ζ min x Vn(x)

a π(a|s) r(s, a) + (1 ζ) X

s Pa s,s Vn(s ) + ζ min x Vn(x)

a π(a|s) r(s, a) + (1 ζ) X

s Pa s,s Vn(s ) + ζ min x Vn(x))

Hence, the operator is unbiased. We also have that

E[|Mn(s)|2] = E r(s, a) + (1 ζ)Vn(s ) + ζ min x Vn(x) TVn(s) 2

2E r(s, a) + (1 ζ)Vn(s ) + ζ min x Vn(x) 2 + 2E[(TVn(s))2]

(a) 8 + 8 Vn 2

8(1 + Vn 2), (261)

where (a) is from the fact that E (1 ζ)Vn(s ) + ζ minx Vn(x) 2 = E (1 ζ)Vn(s ) + ζ minx Vn(x) 2 E (1 ζ)Vn(s ) + ζ minx Vn(x) 2 E (1 ζ) Vn + ζ Vn 2 Vn 2. The proof is completed.

H.2 Total Variation Uncertainty Set

The estimator under the total variation uncertainty set can be written as

ˆσPas (V ) = max µ 0 ˆPa,1 s,N+1(V µ) ζSpan(V µ) + N(V )

p N , (262)

N(V ) = max µ 0 ˆPa s,N+1(V µ) ζSpan(V µ)

2 max µ 0 ˆPa,O s,N+1(V µ) ζSpan(V µ)

2 max µ 0 ˆPa,E s,N+1(V µ) ζSpan(V µ) . (263)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Theorem 24. The estimated operator ˆσPas defined in (262) is unbiased, i.e.,

E[ˆσPas V ] = σPas (V ). (264)

Proof. First, denote the dual function (20) by g:

g V s,a(µ) = Pa s(V µ) ζSpan(V µ), (265)

and denote its optimal solution by µV s,a:

µV s,a = arg max µ 0 Pa s(V µ) ζSpan(V µ) . (266)

Then, the support function σPas (V ) = g V s,a(µV s,a). Similarly, define the empirical function

g V s,a,N+1(µ) = ˆPa s,N+1(V µ) ζSpan(V µ), (267)

g V s,a,N+1,O(µ) = ˆPa,O s,N+1(V µ) ζSpan(V µ), (268)

g V s,a,N+1,E(µ) = ˆPa,E s,N+1(V µ) ζSpan(V µ), (269)

and their optimal solutions are denoted by µV s,a,N+1, µV s,a,N+1,O, µV s,a,N+1,E. We have that

E[ˆσPas V ] = E max µ 0 ˆPa,1 s,N+1(V µ) ζSpan(V µ) + N(V )

= E[g V s,a,0(µV s,a,0)] + E N(V )

= E[g V s,a,0(µV s,a,0)] +

n=0 p(N = n)E N(V )

= E[g V s,a,0(µV s,a,0)] +

n=0 E[ n(V )]

= E[g V s,a,0(µV s,a,0)]

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n+1,O(µV s,a,n+1,O) + g V s,a,n+1,E(µV s,a,n+1,E)

= E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n(µV s,a,n) , (270)

where the last inequality is from Lemma 21. The last equation can be further rewritten as

E[ˆσPas V ] = E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n(µV s,a,n)

= lim n E g V s,a,n(µV s,a,n) . (271)

To show that ˆσPas is unbiased, it suffices to prove that

lim n E g V s,a,n(µV s,a,n) = g V s,a(µV s,a). (272)

Robust Average-Reward Reinforcement Learning

For any arbitrary i.i.d. samples {Xi} and its corresponding function g V s,a,n, together with Lemma 22, we have that

|g V s,a,n(µV s,a,n) g V s,a(µV s,a)|

= | max 0 µ V + V e g V s,a(µ) max 0 µ V + V e g V s,a,n(µ)|

max 0 µ V + V e |g V s,a(µ) g V s,a,n(µ)|

= max 0 µ V + V e |Pa s(V µ) ζSpan(V µ) ˆPa s,n(V µ) + ζSpan(V µ)|

= max 0 µ V + V e |Pa s(V µ) ˆPa s,n(V µ)|

max 0 µ V + V e V µ Pa s ˆPa s,n 1

3 V Pa s ˆPa s,n 1. (273)

Thus, by Hoeffding s inequality and Theorem 3.7 from (Liu et al., 2022),

E[|g V s,a,n(µV s,a,n) g V s,a(µV s,a)|] 3 V |S|2 π

which implies that

lim n E g V s,a,n(µV s,a,n) = g V s,a(µV s,a), (275)

completing the proof.

Theorem 25. The estimated operator ˆσPas defined in (262) has bounded variance, i.e., there exists a constant C0, such that

Var(ˆσPas V ) (1 + 18(1 + 2ζ)2 + 2C0) V 2. (276)

Proof. Similar to Theorem 24, we have that

Var(ˆσPas V )

= E[(ˆσPas V )2] σPas (V )2

E g V s,a,0(µV s,a,0) + N(V )

2 + (σPas (V ))2

2E g V s,a,0(µV s,a,0) 2 + 2E N(V )

2 + (σPas (V ))2

(1 + 18(1 + 2ζ)2) V 2 + 2

E[( i(V ))2]

Wang, Velasquez, Atia, Prater-Bennette, Zou

where the last inequality is from Lemma 22. For any n 1, we have that

E[( n(V ))2]

= E g V s,a,n(µV s,a,n) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O)

= E g V s,a,n(µV s,a,n) g V s,a(µV s,a) + g V s,a(µV s,a) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O)

2E[(g V s,a,n(µV s,a,n) g V s,a(µV s,a))2]

+ 2E g V s,a(µV s,a) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O)

(a) = 2E[(g V s,a,n(µV s,a,n) g V s,a(µV s,a))2] + 2E[(g V s,a,n 1(µV s,a,n 1) g V s,a(µV s,a))2]

18 V 2E[ Pa s ˆPa s,n 2 1] + 18 V 2E[ Pa s ˆPa s,n 1 2 1], (278)

where (a) is due to Lemma 21 and the last inequality follows a similar argument to (273). Note that pn = Ψ(1 Ψ)n for Ψ (0, 0.5), thus similar to Theorem 3.7 of (Liu et al., 2022), we can show that there exists a constant C0, such that

E[( i(V ))2]

pi C0 V 2. (279)

Var(ˆσPas V ) (1 + 18(1 + 2ζ)2) V 2 + 2C0 V 2 = (1 + 18(1 + 2ζ)2 + 2C0) V 2 . (280)

H.3 Chi-Square Uncertainty Set

The estimator under the Chi-square uncertainty set can be written as

ˆσPas V = max µ 0 ˆPa,1 s,N+1(V µ) q

ζVarˆPa,1 s,N+1(V µ)

p N , (281)

N(V ) = max µ 0 EˆPa s,N+1[V µ] q

ζVarˆPa s,N+1(V µ)

2 max µ 0 EˆPa,O s,N+1[V µ] q

ζVarˆPa,O s,N+1(V µ)

2 max µ 0 EˆPa,E s,N+1[V µ] q

ζVarˆPa,E s,N+1(V µ) .

Theorem 26. The estimated operator defined in (281) is unbiased, i.e.,

E[ˆσPas V ] = σPas (V ). (282)

Robust Average-Reward Reinforcement Learning

Proof. Denote the dual function (21) by g:

g V s,a(µ) = Pa s(V µ) q

ζVar Pas(V µ), (283)

and denote its optimal solution by µV s,a:

µV s,a = arg max µ 0 Pa s(V µ) q

ζVar Pas(V µ) . (284)

Then, the support function σPas (V ) = g V s,a(µV s,a). Similarly, define the empirical function

g V s,a,N+1(µ) = ˆPa s,N+1(V µ) q

ζVarˆPa s,N+1(V µ), (285)

g V s,a,N+1,O(µ) = ˆPa,O s,N+1(V µ) q

ζVarˆPa,O s,N+1(V µ), (286)

g V s,a,N+1,E(µ) = ˆPa,E s,N+1(V µ) q

ζVarˆPa,E s,N+1(V µ), (287)

and their optimal solutions are denoted by µV s,a,N+1, µV s,a,N+1,O, µV s,a,N+1,E. We have that

E[ˆσPas V ]

= E[g V s,a,0(µV s,a,0)] + E N(V )

= E[g V s,a,0(µV s,a,0)] +

n=0 p(N = n)E N(V )

= E[g V s,a,0(µV s,a,0)] +

= E[g V s,a,0(µV s,a,0)]

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n+1,O(µV s,a,n+1,O) + g V s,a,n+1,E(µV s,a,n+1,E)

= E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n(µV s,a,n) , (288)

where the last inequality is from Lemma 21. The last equation can be further rewritten as

E[ˆσPas V ] = E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n(µV s,a,n)

= lim n E g V s,a,n(µV s,a,n) . (289)

To show that ˆσPas is unbiased, it suffices to prove that

lim n E g V s,a,n(µV s,a,n) = g V s,a(µV s,a). (290)

Wang, Velasquez, Atia, Prater-Bennette, Zou

For any arbitrary i.i.d. samples {Xi} and its corresponding function g V s,a,n, together with Lemma 23, we have that

|g V s,a,n(µV s,a,n) g V s,a(µV s,a)|

= | max 0 µ V + V e g V s,a(µ) max 0 µ V + V e g V s,a,n(µ)|

max 0 µ V + V e |g V s,a(µ) g V s,a,n(µ)|

= max 0 µ V + V e

Pa s(V µ) ˆPa s,n(V µ) q

ζVar Pas(V µ) q

ζVarˆPas,n(V µ)

max 0 µ V + V e |Pa s(V µ) ˆPa s,n(V µ)|

+ max 0 µ V + V e

ζVar Pas(V µ) q

ζVarˆPas,n(V µ)

(a) max 0 µ V + V e V µ Pa s ˆPa s,n 1

+ max 0 µ V + V e

|ζVar Pas(V µ) ζVarˆPas,n(V µ)|, (291)

where (a) is due to | x y| p

|x y|. Note that for any distribution p, q (|S|) and any random variable X,

|Varp[X] Varq[X]| = |Ep[X2] Ep[X]2 Eq[X2] + Eq[X]2|

|Ep[X2] Eq[X2]| + |(Ep[X] + Eq[X])(Ep[X] Eq[X])|

sup |X2| p q 1 + 2(sup |X|)2 p q 1. (292)

|ζVar Pas(V µ) ζVarˆPas,n(V µ)| q

3ζ V µ 2 Pas ˆPas,n 1. (293)

Thus, by Hoeffding s inequality and Theorem 3.7 from (Liu et al., 2022),

E[|g V s,a,n(µV s,a,n) g V s,a(µV s,a)|] 3 V

which implies that

lim n E g V s,a,n(µV s,a,n) = g V s,a(µV s,a), (295)

which completes the proof.

Theorem 27. The estimated operator ˆσPas defined in (281) has bounded variance, i.e., there exists a constant C0, such that

Var(ˆσPas V ) (1 + 18(1 + p

2ζ)2 + 2C0) V 2. (296)

Robust Average-Reward Reinforcement Learning

Proof. We have that

Var(ˆσPas V )

= E[(ˆσPas V )2] σPas (V )2

E g V s,a,0(µV s,a,0) + N(V )

2 + (σPas (V ))2

2E g V s,a,0(µV s,a,0) 2 + 2E N(V )

2 + (σPas (V ))2

(1 + 18(1 + p

2ζ)2) V 2 + 2

E[( i(V ))2]

where the last inequality is from Lemma 23. For any n 1, we have that

E[( n(V ))2]

= E g V s,a,n(µV s,a,n) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O)

= E g V s,a,n(µV s,a,n) g V s,a(µV s,a) + g V s,a(µV s,a) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O)

2E[(g V s,a,n(µV s,a,n) g V s,a(µV s,a))2]

+ 2E g V s,a(µV s,a) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O)

(a) = 2E[(g V s,a,n(µV s,a,n) g V s,a(µV s,a))2] + 2E[(g V s,a,n 1(µV s,a,n 1) g V s,a(µV s,a))2]

3ζ)2 V 2E[ Pa s ˆPa s,n 2 1 + Pa s ˆPa s,n 1]

3ζ)2 V 2E[ Pa s ˆPa s,n 1 2 1 + Pa s ˆPa s,n 1 1], (298)

where (a) is due to Lemma 21 and the last inequality follows a similar argument to (291). Note that pn = Ψ(1 Ψ)n for Ψ 0, 1

2 2 . Thus, similar to Theorem 3.7 of (Liu et al., 2022), we can show that there exists a constant C0, such that

E[( i(V ))2]

pi C0 V 2. (299)

Var(ˆσPas V ) (1 + 18(1 + p

2ζ)2) V 2 + 2C0 V 2 = (1 + 18(1 + p

2ζ)2 + 2C0) V 2. (300)

H.4 KL-Divergence Uncertainty Sets

The estimator under the KL-Divergence uncertainty set can be written as

ˆσPas V min α 0

ζα + α log e

V (s 1) α + N(V )

Wang, Velasquez, Atia, Prater-Bennette, Zou

N(V ) = min α 0

ζα + α log EˆPa s,N+1 e V

ζα + α log EˆPa,O s,N+1 e V

ζα + α log EˆPa,E s,N+1 e V

Theorem 28. (Liu et al., 2022) The estimated operator ˆσPas is unbiased and has bounded variance, i.e., there exists a constant C0, such that Var(ˆσPas V ) C0(1 + V 2).

H.5 Wasserstein Distance Uncertainty Sets

To study the support function w.r.t. this uncertainty model, we first introduce some notation.

Definition 2. For any function f : Z R, λ 0 and x Z, define the regularization operator

Φ(λ, x) inf y Z(λd(x, y)l + f(y)). (302)

The growth rate κ of function f and any distribution q over Z is defined as

κq inf λ 0 : X

x Z q(x)Φ(λ, x) > . (303)

Lemma 16. (Gao & Kleywegt, 2023) Consider the distributional robust optimization of a function f:

inf Wl(q,p) ζ Ex q[f(x)], (304)

and define its dual problem as

sup λ 0 ( λζl + X

x Z p(x) inf y Z(f(y) + λd(x, y)l)). (305)

If κp < , then the strong duality holds, i.e.,

inf Wl(q,p) ζ Ex q[f(x)] = sup λ 0 ( λζl + X

x Z p(x) inf y Z(f(y) + λd(x, y)l)). (306)

We first verify that this strong duality holds for our support function.

Lemma 17. (Restatement of (25)) It holds that

σPas (V ) = sup λ 0

x Pa s,x inf y (V (y) + λd(x, y)l) . (307)

Robust Average-Reward Reinforcement Learning

Proof. In our case, the regularization operator is

Φ(λ, x) = inf s S(λd(s, x)l + V (s)). (308)

Note that for any λ 0,

x S Pa s(x)Φ(λ, x) = X

x S Pa s(x) inf s S(λd(s, x)l + V (s)) V > . (309)

Hence, the growth rate κPas = 0 < . Thus, the strong duality holds.

Then, the estimator under the Wasserstein distance uncertainty set can be constructed as

ˆσPas V sup λ 0

λζl + inf y (V (y) + λd(s 1, y)l) + N(V )

p N + r(s, a), (310)

λζl + EˆPa s,N+1

inf y (V (y) + λd(S, y)l)

λζl + EˆPa,O s,N+1

inf y (V (y) + λd(S, y)l)

λζl + EˆPa,E s,N+1

inf y (V (y) + λd(S, y)l) .

Theorem 29. The estimated operator defined in (310) is unbiased, i.e.,

E[ˆσPas V ] = σPas (V ). (311)

Proof. Denote the dual function (25) by g:

g V s,a(λ) = λζl + ES Pas[inf x S(V (x) + λd(S, x)l)], (312)

and denote its optimal solution by λV s,a:

λV s,a = arg max λ 0

λζl + ES Pas[inf x S(V (x) + λd(S, x)l)] . (313)

Then, the support function σPas (V ) = g V s,a(λV s,a). Similarly, define the empirical function g V s,a,N+1, g V s,a,N+1,O, g V s,a,N+1,E, and denote their optimal solutions by λV s,a,N+1, λV s,a,N+1,O, λV s,a,N+1,E.

Wang, Velasquez, Atia, Prater-Bennette, Zou

We have that

E[ˆσPas V ]

= E[g V s,a,0(λV s,a,0)] + E N(V )

= E[g V s,a,0(λV s,a,0)] +

n=0 p(N = n)E N(V )

= E[g V s,a,0(λV s,a,0)] +

= E[g V s,a,0(λV s,a,0)]

n=0 E g V s,a,n+1(λV s,a,n+1) g V s,a,n+1,O(λV s,a,n+1,O) + g V s,a,n+1,E(λV s,a,n+1,E)

= E[g V s,a,0(λV s,a,0)] +

n=0 E g V s,a,n+1(λV s,a,n+1) g V s,a,n(λV s,a,n) , (314)

where the last inequality is from Lemma 21. The last equation can be further rewritten as

E[ˆσPas V ] = E[g V s,a,0(λV s,a,0)] +

n=0 E g V s,a,n+1(λV s,a,n+1) g V s,a,n(λV s,a,n)

= lim n E g V s,a,n(λV s,a,n) . (315)

To show that ˆσPas is unbiased, it suffices to prove that

lim n E g V s,a,n(λV s,a,n) = g V s,a(λV s,a). (316)

For any arbitrary i.i.d. samples {Xi} and its corresponding function g V s,a,n, together with Lemma 24, we have that

|g V s,a,n(λV s,a,n) g V s,a(λV s,a)|

= | max 0 λ 2 V

ζl g V s,a(λ) max 0 λ 2 V

ζl g V s,a,n(λ)|

max 0 λ 2 V

ζl |g V s,a(λ) g V s,a,n(λ)|

= max 0 λ 2 V

ES Pas[inf x S(V (x) + λd(S, x)l)] ES ˆPas,n[inf x S(V (x) + λd(S, x)l)]

max 0 λ 2 V

ζl Pa s ˆPa s,n 1 sup x,S S (|V (x) + λd(S, x)l|)

V Pa s ˆPa s,n 1, (317)

Robust Average-Reward Reinforcement Learning

where the last inequality is from the bound on λ and D is the diameter of the metric space (S, d). By Hoeffding s inequality and similar to the previous proofs, we have that

E[|g V s,a,n(λV s,a,n) g V s,a(λV s,a)|] 1 + 2Dl

which implies that

lim n E g V s,a,n(λV s,a,n) = g V s,a(λV s,a). (319)

This completes the proof.

Theorem 30. The estimated operator ˆσPas defined in (310) has bounded variance, i.e., there exists a constant C0, such that

Var(ˆσPas V ) (3 + 2C0) V 2. (320)

Proof. We first have that

Var(ˆσPas V )

= E[(ˆσPas V )2] σPas (V )2

E g V s,a,0(λV s,a,0) + N(V )

2 + (σPas (V ))2

2E g V s,a,0(λV s,a,0) 2 + 2E N(V )

2 + (σPas (V ))2

E[( i(V ))2]

where the last inequality is from Lemma 23. For any n 1, we have that

E[( n(V ))2]

= E g V s,a,n(λV s,a,n) g V s,a,n,E(λV s,a,n,E) + g V s,a,n,O(λV s,a,n,O)

= E g V s,a,n(λV s,a,n) g V s,a(λV s,a) + g V s,a(λV s,a) g V s,a,n,E(λV s,a,n,E) + g V s,a,n,O(λV s,a,n,O)

2E[(g V s,a,n(λV s,a,n) g V s,a(λV s,a))2]

+ 2E g V s,a(λV s,a) g V s,a,n,E(λV s,a,n,E) + g V s,a,n,O(λV s,a,n,O)

(a) = 2E[(g V s,a,n(λV s,a,n) g V s,a(λV s,a))2] + 2E[(g V s,a,n 1(λV s,a,n 1) g V s,a(λV s,a))2]

2 V 2E[ Pa s ˆPa s,n 2 1] + 2 1 + 2Dl

2 V 2E[ Pa s ˆPa s,n 1 2 1], (322)

Wang, Velasquez, Atia, Prater-Bennette, Zou

where (a) is due to Lemma 21 and the last inequality follows a similar argument to (318). Note that pn = Ψ(1 Ψ)n for Ψ 0, 0.5 , thus similar to Theorem 3.7 of (Liu et al., 2022), we can show that there exists a constant C0, such that

E[( i(V ))2]

pi C0 V 2. (323)

Thus, we have that

Var(ˆσPas V ) 3 V 2 + 2C0 V 2 = (3 + 2C0) V 2. (324)

Appendix I. Case Studies for Robust RVI Q-Learning

In this section, we provide the proof of the second part of Theorem 12, i.e., ˆH is bounded and unbiased under each uncertainty model. We note that the proofs in this part can be easily derived by following the ones in Section H. We first prove a lemma necessary to the proofs in this section.

Lemma 18. It holds that

VQ Q . (325)

Proof. From the definition of VQ, we have that

VQ = max s |VQ(s)| = max s | max a Q(s, a)| |Q(s , a )|. (326)

Clearly, |Q(s , a )| maxs,a |Q(s, a)|, hence

VQ Q . (327)

Similar to Section H, the propositions of ˆH can be reduced to the ones of ˆσPas .

Lemma 19. If E[ˆσPas V ] = σPas (V ), and moreover there exists a constant C, such that for any s, a, Var(ˆσPas V ) C(1 + V 2), then E[ ˆHQ(s, a)] = HQ(s, a), and Var( ˆHQ(s, a)) C(1 + Q 2).

Proof. First, we have that

E[ ˆHQ(s, a)] = E[r(s, a) + ˆσPas VQ(s)] = r(s, a) + σPas (VQ) = HQ(s, a). (328)

For boundedness, note that

Var( ˆHQ(s, a)) = E ( ˆHQ(s, a) HQ(s, a))2

= E ˆσPas VQ(s) σPas (VQ) 2

C(1 + VQ 2)

C(1 + Q 2), (329)

where the last inequality is from Lemma 18.

Robust Average-Reward Reinforcement Learning

This implies that the problem is reduced to verifying whether ˆσPas is unbiased and has bounded variance, which is identical to the results in Section H. We thus omit the proofs for this part.

Appendix J. Technical Lemmas

Lemma 20. For a robust Bellman operator T, define

T V TV f(V )e, (330) TV TV gπ Pe. (331)

Assume that x(t), y(t) are the solutions to equations

x = T x x, (332)

y = Ty y, (333)

with the same initial value x(0) = y(0) = x0. Then,

x(t) = y(t) + r(t)e, (334)

where r(t) satisfies

r(t) = r(t) + gπ P f(y(t)). (335)

Proof. Note that T V = TV + (gπ P f(V ))e, then from the variation of constants formula, we have that

x(t) = x0e t + Z t

0 e (t s) T(x(s))ds + Z t

0 e (t s)(gπ P f(x(s)))ds e, (336)

y(t) = x0e t + Z t

0 e (t s) T(y(s))ds. (337)

Hence, the maximal and minimal components of x(t) y(t) can be bounded as:

max i (xi(t) yi(t)) Z t

0 e (t s) max i ( Ti(x(s)) Ti(y(s)))ds + Z t

0 e (t s)(gπ P f(x(s)))ds,

min i (xi(t) yi(t)) Z t

0 e (t s) min i ( Ti(x(s)) Ti(y(s)))ds + Z t

0 e (t s)(gπ P f(x(s)))ds.

This hence implies that

Span(x(t) y(t)) Z t

0 e (t s)Span( T(x(s)) T(y(s)))ds

0 e (t s)Span(x(s) y(s))ds, (338)

where the last inequality is because T is non-expansive w.r.t. the span semi-norm.

Wang, Velasquez, Atia, Prater-Bennette, Zou

Gronwall s inequality implies that Span(x(t) y(t)) 0 R t 0 e (t s)ds = 0 for any t 0. However, since Span is non-negative, then Span(x(t) y(t)) = 0. Hence, we have that x(t) = y(t) + r(t)e for some r(t) satisfying r(0) = 0. Also note that the differential of r(t) can be written as

r(t)e = x(t) y(t)

= Tx(t) + (gπ P f(x(t)))e x(t) Ty(t) + y(t)

= ( r(t) + gπ P f(y(t)))e, (339)

where the last equation is because

Tx(t) = T(y(t) + r(t)e) = T(y(t)) + r(t)e, (340)

f(x(t)) = f(y(t) + r(t)e) = f(y(t)) + r(t). (341)

This completes the proof.

Lemma 21. For any function g : (|S|) R, assume there are 2n+1 i.i.d. samples Xi q. Denote the empirical distributions from samples {Xi : i = 1, ..., 2n+1}, {X2i 1 : i = 1, ..., 2n}, {X2i : i = 1, ..., 2n} by ˆqn+1, ˆqn+1,O, ˆqn+1,E. Then,

E[g(ˆqn+1,O)] = E[g(ˆqn+1,E)] = E[g(ˆqn)]. (342)

Proof. Note that

ˆqn+1,O(s) = P2n i=1 1X2i 1=s

E[g(ˆqn+1,O)] = X

p=(p1,...,p|S|) (|S|) g(p)P(ˆqn+1,O = p)

p=(p1,...,p|S|) (|S|) P P2n i=1 1X2i 1=s1

2n = p1, ...,

P2n i=1 1X2i 1=s|S|

where 2npi N and P|S| i=1 pi = 1. On the other hand,

E[g(ˆqn)] = X

p=(p1,...,p|S|) (|S|) g(p)P(ˆqn = p)

p=(p1,...,p|S|) (|S|) P P2n i=1 1Xi=s1

2n = p1, ...,

P2n i=1 1Xi=s|S|

q g(p). (345)

Note that Xi are i.i.d., hence,

P P2n i=1 1Xi=s1

2n = p1, ...,

P2n i=1 1Xi=s|S|

= P P2n i=1 1X2i 1=s1

2n = p1, ...,

P2n i=1 1X2i 1=s|S|

Robust Average-Reward Reinforcement Learning

E[g(ˆqn+1,O)] = E[g(ˆqn)]. (346)

Similarly, E[g(ˆqn+1,E)] = E[g(ˆqn)] and hence it completes the proof.

Lemma 22. Under the total variation uncertainty model, the optimal solution and optimal value for g V s,a, g V s,a,N+1, g V s,a,N+1,E, g V s,a,N+1,O are bounded. Specifically,

µV s,a, µV s,a,N+1, µV s,a,N+1,E, µV s,a,N+1,O V + V e, (347)

µV s,a , µV s,a,N+1 , µV s,a,N+1,E , µV s,a,N+1,O 2 V , (348)

|g V s,a(µV s,a)|, |g V s,a,N+1(µV s,a,N+1)| 3(1 + 2ζ) V ,

|g V s,a,N+1,E(µV s,a,N+1,E)|, |g V s,a,N+1,O(µV s,a,N+1,O)| 3(1 + 2ζ) V . (349)

Proof. First we show the bounds on the optimal solutions. If we denote the minimal entry of V by w: w = mins V (s), then W V we 0. Note that,

µW s,a = arg max µ 0 Pa s(W µ) ζSpan(W µ)

= arg max µ 0 w + Pa s(V µ) ζSpan(V µ) , (350)

which is because Span(V + ke) = Span(V ) and Pa s(V + ke) = k + Pa s V . Hence, µW s,a = µV s,a. Moreover note that W 0, hence µW s,a is bounded: µW s,a W, this further implies that

µV s,a = µW s,a W 2 V . (351)

The bounds on µV s,a,N+1, µV s,a,N+1,O, µV s,a,N+1,E can be similarly derived. We then consider the optimal value. Note that,

g V s,a(µV s,a) = Pa s(V µV s,a) ζSpan(V µV s,a)

V + µV s,a + ζ| max i (V (i) µV s,a(i))| + ζ| min i (V (i) µV s,a(i))|

3 V + 2ζ( V + µV s,a )

3(1 + 2ζ) V . (352)

On the other hand,

g V s,a(µV s,a) g V s,a(0) = Pa s V ζSpan(V ) = Pa s V ζ max i V (i) + ζ min i V (i). (353)

Denote the maximal and minimal entries of V by V (M) and V (m), then we have that

Pa s V ζ max i V (i) + ζ min i V (i)

x Pa s,x V (x) ζV (M) + ζV (m)

V 2ζ V , (354)

Wang, Velasquez, Atia, Prater-Bennette, Zou

where the last inequality is from V V (i) V for any entry i. Thus, combining (353) and (354) implies that

(1 + 2ζ) V g V s,a(µV s,a) 3(1 + 2ζ) V . (355)

Similarly, the bounds on g V s,a,N+1(µV s,a,N+1), g V s,a,N+1,O(µV s,a,N+1,O), g V s,a,N+1,E(µV s,a,N+1,E) can be derived.

Lemma 23. Under the chi-square uncertainty model, the optimal solution and optimal value for g V s,a, g V s,a,N+1, g V s,a,N+1,E, g V s,a,N+1,O are bounded. Specifically,

µV s,a, µV s,a,N+1, µV s,a,N+1,E, µV s,a,N+1,O V + V e, (356)

µV s,a , µV s,a,N+1 , µV s,a,N+1,E , µV s,a,N+1,O 2 V , (357)

|g V s,a(µV s,a)|, |g V s,a,N+1(µV s,a,N+1)| 3(1 + p

2ζ) V , (358)

|g V s,a,N+1,O(µV s,a,N+1,O)|, |g V s,a,N+1,E(µV s,a,N+1,E)| 3(1 + p

2ζ) V . (359)

Proof. First, we show the bounds on the optimal solutions. If we denote the minimal entry of V by w: w = mins V (s), then W V we 0. Note that,

µW s,a = arg max W µ 0 Pa s(W µ) q

ζVar Pas(W µ)

= arg max W µ 0 w + Pa s(V µ) q

ζVar Pas(V µ) , (360)

which is because Var Pas(V µ we) = Var Pas(V µ) + Var Pas(we) 2Cov Pas(V µ, we) = Var Pas(V µ). Hence µW s,a = µV s,a. Moreover note that W 0, hence µW s,a is bounded: µW s,a W, this further implies that

µV s,a = µW s,a W 2 V . (361)

The bounds on µV s,a,N+1, µV s,a,N+1,O, µV s,a,N+1,E can be similarly derived. We then consider the optimal value. Note that,

|g V s,a(µV s,a)| = |Pa s(V µV s,a) q

ζVar Pas(V µVs,a)|

V + µV s,a + q

2ζ V µVs,a 2

2ζ( V + µV s,a )

2ζ) V . (362)

Similarly, the bounds on g V s,a,N+1(µV s,a,N+1), g V s,a,N+1,O(µV s,a,N+1,O), g V s,a,N+1,E(µV s,a,N+1,E) can be derived.

Lemma 24. Under the Wasserstein distance uncertainty model, the optimal solution and optimal value for g V s,a, g V s,a,N+1, g V s,a,N+1,E, g V s,a,N+1,O are bounded. Specifically,

λV s,a, λV s,a,n, λV s,a,n,O, λV s,a,n,E 2 V

|g V s,a(λV s,a)|, |g V s,a,n(λV s,a,n)|, |g V s,a,n,O(λV s,a,n,O)|, |g V s,a,n,E(λV s,a,n,E)| V . (364)

Robust Average-Reward Reinforcement Learning

Proof. First, we show the bounds on the optimal solutions. Denote the optimal solution to maxλ 0 g V s,a(λ) by λV s,a. Moreover, for each state y S and any λ 0, denote sy λ arg minx S{λd(x, y)l + V (x)}. Hence,

g V s,a(λ) = λζl + ES Pas[λd(S, s S λ)l + V (s S λ)]. (365)

Moreover, note that g V s,a(λV s,a) = maxλ 0 g V s,a(λ), hence,

λV s,aζl + ES Pas[λV s,ad(S, s S λVs,a)l + V (s S λVs,a)] g V s,a(0) = ES Pas[V (s S 0 )] = min x V (x), (366)

where the last equation is due to the fact that s S 0 = arg minx S{V (x)} = minx V (x). Now consider the inner problem ES Pas[λV s,ad(S, s S λVs,a)l + V (s S λVs,a)]. Note that,

ES Pas[λV s,ad(S, s S λVs,a)l + V (s S λVs,a)]

x Pa s,x λV s,ad(x, sx λVs,a)l + V (sx λVs,a)

x Pa s,x λV s,ad(x, x)l + V (x)

= EPas[V (S)], (367)

where (a) is because sx λVs,a = arg miny S{λV s,ad(x, y)l + V (y)} and hence λV s,ad(x, sx λVs,a)l +

V (sx λVs,a) λV s,ad(x, x)l + V (x).

Combine (366) and (367), then we further have that

min x V (x) λV s,aζl + ES Pas[λV s,ad(S, s S λVs,a)l + V (s S λVs,a)] λV s,aζl + EPas[V (S)]. (368)

This implies that

λV s,a EPas[V (S)] minx V (x)

and hence λV s,a is bounded. On the other hand, note that g V s,a(λV s,a) = σPas [V (S)], hence, g V s,a(λV s,a) V . (370)

Same bound can be similarly derived for

λV s,a,n, λV s,a,n,O, λV s,a,n,E, g V s,a,n(λV s,a,n), g V s,a,n,O(λV s,a,n,O), g V s,a,n,E(λV s,a,n,E).

Lemma 25. For an optimal robust Bellman operator: HQ(s, a) = r(s, a)+σPas (VQ), define

H Q HQ f(Q)e, (371) HQ HQ g Pe. (372)

Wang, Velasquez, Atia, Prater-Bennette, Zou

Assume that x(t), y(t) are the solutions to equations

x = H x x, (373)

y = Hy y, (374)

with the same initial value x(0) = y(0). Then x(t) = y(t) + r(t)e, where r(t) satisfies r(t) = r(t) + g P f(y(t)).

Proof. The proof follows exactly that of Lemma 20 if we show that H is non-expansion w.r.t. the span semi-norm. It can be shown that

Span( H(Q1) H(Q2)) Span(VQ1 VQ2). (375)

s = arg max i {max a Q1(i, a) max a Q2(i, a)}, (376)

t = arg min i {max a Q1(i, a) max a Q2(i, a)}. (377)

Span(VQ1 VQ2) = (max a Q1(s, a) max a Q2(s, a)) (max a Q1(t, a) max a Q2(t, a))

Q1(s, as) Q2(s, as) (Q1(t, at) Q2(t, at))

max x,b (Q1(x, b) Q2(x, b)) min x,b (Q1(x, b) Q2(x, b))

= Span(Q1 Q2). (378)

where as = arg maxa Q1(S, a) and at = arg maxa Q2(t, a). This completes the proof.

Robust Average-Reward Reinforcement Learning

Abdullah, M. A., Ren, H., Ammar, H. B., Milenkovic, V., Luo, R., Zhang, M., & Wang, J. (2019). Wasserstein robust reinforcement learning. ar Xiv preprint ar Xiv:1907.13196, abs/1907.13196.

Abounadi, J., Bertsekas, D., & Borkar, V. S. (2001). Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3), 681 698.

Archibald, T., Mc Kinnon, K., & Thomas, L. (1995). On the generation of Markov decision processes. Journal of the Operational Research Society, 46(3), 354 361.

Atia, G. K., Beckus, A., Alkhouri, I., & Velasquez, A. (2021). Steady-state planning in expected reward multichain mdps. Journal of Artificial Intelligence Research, 72, 1029 1082.

Badrinath, K. P., & Kalathil, D. (2021). Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In Proc. International Conference on Machine Learning (ICML), pp. 511 520. PMLR.

Bagnell, J. A., Ng, A. Y., & Schneider, J. G. (2001). Solving uncertain Markov decision processes.. 1.

Bertsekas, D. P. (2011). Dynamic Programming and Optimal Control 3rd edition, Vol. II.

Blackwell, D. (1962). Discrete Dynamic Programming. The Annals of Mathematical Statistics, 33(2), 719 726.

Blanchet, J. H., & Glynn, P. W. (2015). Unbiased Monte Carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pp. 3656 3667. IEEE.

Blanchet, J. H., Glynn, P. W., & Pei, Y. (2019). Unbiased multilevel Monte Carlo: Stochastic optimization, steady-state simulation, quantiles, and other applications. ar Xiv preprint ar Xiv:1904.09929, 1904.09929.

Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint, Vol. 48. Springer.

Borkar, V. S., & Soumyanatha, K. (1997). An analog scheme for fixed point computation. i. theory. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 44(4), 351 355.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 1606.01540.

Chatterjee, K., Goharshady, E. K., Karrabi, M., Novotny, P., & Zikelic, D. (2023). Solving long-run average reward robust mdps via stochastic games. ar Xiv preprint ar Xiv:2312.13912, abs/2312.13912.

Chen, L., Jain, R., & Luo, H. (2022). Learning infinite-horizon average-reward Markov decision processes with constraints. ar Xiv preprint ar Xiv:2202.00150, abs/2202.00150.

Derman, C. (1970). Finite state Markovian decision processes. Academic Press, Inc.

Wang, Velasquez, Atia, Prater-Bennette, Zou

Gao, R., & Kleywegt, A. (2023). Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research, 48(2), 603 655.

Giannoccaro, I., & Pontrandolfo, P. (2002). Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics, 78(2), 153 161.

Goyal, V., & Grand-Clement, J. (2018). Robust Markov decision process: Beyond rectangularity. ar Xiv preprint ar Xiv:1811.00215, abs/1811.00215.

Grand-Cl ement, J., & Petrik, M. (2023). Reducing blackwell and average optimality to discounted mdps via the blackwell discount factor. ar Xiv preprint ar Xiv:2302.00036, abs/2302.00036.

Grand-Clement, J., Petrik, M., & Vieille, N. (2023). Beyond discounted returns: Robust markov decision processes with average and blackwell optimality. ar Xiv preprint ar Xiv:2312.03618, abs/2312.03618.

Ho, C. P., Petrik, M., & Wiesemann, W. (2018). Fast Bellman updates for robust MDPs. In Proc. International Conference on Machine Learning (ICML), pp. 1979 1988. PMLR.

Ho, C. P., Petrik, M., & Wiesemann, W. (2021). Partial policy iteration for L1-robust Markov decision processes. Journal of Machine Learning Research, 22(275), 1 46.

Hordijk, A., & Yushkevich, A. A. (2002). Blackwell optimality. In Handbook of Markov decision processes, pp. 231 267. Springer.

Hou, L., Pang, L., Hong, X., Lan, Y., Ma, Z., & Yin, D. (2020). Robust reinforcement learning with Wasserstein constraint. ar Xiv preprint ar Xiv:2006.00945, abs/2006.00945.

Hu, Z., & Hong, L. J. (2013). Kullback-Leibler divergence constrained distributionally robust optimization. Optimization Online, 1, 1695 1724.

Huang, S., Papernot, N., Goodfellow, I., Duan, Y., & Abbeel, P. (2017). Adversarial attacks on neural network policies. In Proc. International Conference on Learning Representations (ICLR).

Huber, P. J. (1965). A robust version of the probability ratio test. Ann. Math. Statist., 36, 1753 1758.

Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2), 257 280.

Kaufman, D. L., & Schaefer, A. J. (2013). Robust modified policy iteration. INFORMS Journal on Computing, 25(3), 396 410.

Kazemi, M., Perez, M., Somenzi, F., Soudjani, S., Trivedi, A., & Velasquez, A. (2022). Translating omega-regular specifications to average objectives for model-free reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pp. 732 741.

Kemmer, L., von Kleist, H., de Rochebou et, D., Tziortziotis, N., & Read, J. (2018). Reinforcement learning for supply chain optimization. In European Workshop on Reinforcement Learning, Vol. 14.

Robust Average-Reward Reinforcement Learning

Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238 1274.

Kos, J., & Song, D. (2017). Delving into adversarial attacks on deep policies. In Proc. International Conference on Learning Representations (ICLR).

Lan, G. (2020). First-order and Stochastic Optimization Methods for Machine Learning. Springer Nature.

Liang, Z., Ma, X., Blanchet, J., Zhang, J., & Zhou, Z. (2023). Single-trajectory distributionally robust reinforcement learning. ar Xiv preprint ar Xiv:2301.11721, abs/2301.11721.

Lim, S. H., & Autef, A. (2019). Kernel-based reinforcement learning in robust Markov decision processes. In Proc. International Conference on Machine Learning (ICML), pp. 3973 3981. PMLR.

Lim, S. H., Xu, H., & Mannor, S. (2013). Reinforcement learning in robust Markov decision processes. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 701 709.

Lin, Y.-C., Hong, Z.-W., Liao, Y.-H., Shih, M.-L., Liu, M.-Y., & Sun, M. (2017). Tactics of adversarial attack on deep reinforcement learning agents. In Proc. International Joint Conferences on Artificial Intelligence (IJCAI), pp. 3756 3762.

Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., & Zhou, Z. (2022). Distributionally robust Q-learning. In Proc. International Conference on Machine Learning (ICML), pp. 13623 13643. PMLR.

Mandlekar, A., Zhu, Y., Garg, A., Fei-Fei, L., & Savarese, S. (2017). Adversarially robust policy learning: Active construction of physically-plausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3932 3939. IEEE.

Nilim, A., & El Ghaoui, L. (2004). Robustness in Markov decision problems with uncertain transition matrices. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 839 846.

Panaganti, K., & Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pp. 9582 9602. PMLR.

Pattanaik, A., Tang, Z., Liu, S., Bommannan, G., & Chowdhary, G. (2018). Robust deep reinforcement learning with adversarial attacks. In Proc. International Conference on Autonomous Agents and Multi Agent Systems, pp. 2040 2042.

Pinto, L., Davidson, J., Sukthankar, R., & Gupta, A. (2017). Robust adversarial reinforcement learning. In Proc. International Conference on Machine Learning (ICML), pp. 2817 2826. PMLR.

Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming..

Rahimian, H., Bayraksan, G., & De-Mello, T. H. (2022). Effective scenarios in multistage distributionally robust optimization with a focus on total variation distance. SIAM Journal on Optimization, 32(3), 1698 1727.

Wang, Velasquez, Atia, Prater-Bennette, Zou

Rajeswaran, A., Ghotra, S., Ravindran, B., & Levine, S. (2017). Epopt: Learning robust neural network policies using model ensembles. In Proc. International Conference on Learning Representations (ICLR).

Roy, A., Xu, H., & Pokutta, S. (2017). Reinforcement learning under model mismatch. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 3046 3055.

Rudin, W. (2022). Functional Analysis (2nd edition). Mc Graw-Hill Science &Engineering &Math.

Satia, J. K., & Lave Jr, R. E. (1973). Markovian decision processes with uncertain transition probabilities. Operations Research, 21(3), 728 740.

Si, N., Zhang, F., Zhou, Z., & Blanchet, J. (2020). Distributionally robust policy evaluation and learning in offline contextual bandits. In Proc. International Conference on Machine Learning (ICML), pp. 8884 8894. PMLR.

Sigaud, O., & Buffet, O. (2013). Markov decision processes in artificial intelligence. John Wiley & Sons.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Massachusetts.

Tamar, A., Mannor, S., & Xu, H. (2014). Scaling up robust MDPs using function approximation. In Proc. International Conference on Machine Learning (ICML), pp. 181 189. PMLR.

Tessler, C., Efroni, Y., & Mannor, S. (2019). Action robust reinforcement learning and applications in continuous control. In Proc. International Conference on Machine Learning (ICML), pp. 6215 6224. PMLR.

Tewari, A., & Bartlett, P. L. (2007). Bounded parameter Markov decision processes with average reward criterion. In International Conference on Computational Learning Theory, pp. 263 277. Springer.

Tsitsiklis, J. N., & Van Roy, B. (1999). Average cost temporal-difference learning. Automatica, 35(11), 1799 1808.

Vinitsky, E., Du, Y., Parvate, K., Jang, K., Abbeel, P., & Bayen, A. (2020). Robust reinforcement learning using adversarial populations. ar Xiv preprint ar Xiv:2008.01825, 2008.01825.

Wan, Y., Naik, A., & Sutton, R. S. (2021). Learning and planning in average-reward Markov decision processes. In Proc. International Conference on Machine Learning (ICML), pp. 10653 10662. PMLR.

Wan, Y., & Sutton, R. S. (2022). On convergence of average-reward off-policy control algorithms in weakly-communicating MDPs. ar Xiv preprint ar Xiv:2209.15141, 2209.15141.

Wang, G., & Wang, T. (2022). Unbiased multilevel Monte Carlo methods for intractable distributions: Mlmc meets mcmc. ar Xiv preprint ar Xiv:2204.04808, 2204.04808.

Wang, S., Si, N., Blanchet, J., & Zhou, Z. (2023). A finite sample complexity bound for distributionally robust q-learning. In International Conference on Artificial Intelligence and Statistics, pp. 3370 3398. PMLR.

Robust Average-Reward Reinforcement Learning

Wang, Y., & Zou, S. (2021). Online robust reinforcement learning with model uncertainty. In Proc. Advances in Neural Information Processing Systems (Neur IPS).

Wang, Y., & Zou, S. (2022). Policy gradient method for robust reinforcement learning. In Proc. International Conference on Machine Learning (ICML), Vol. 162, pp. 23484 23526. PMLR.

Wei, C.-Y., Jahromi, M. J., Luo, H., Sharma, H., & Jain, R. (2020). Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes. In International conference on machine learning, pp. 10170 10180. PMLR.

Wiesemann, W., Kuhn, D., & Rustem, B. (2013). Robust Markov decision processes. Mathematics of Operations Research, 38(1), 153 183.

Xu, H., & Mannor, S. (2010). Distributionally robust Markov decision processes. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 2505 2513.

Yang, W., Zhang, L., & Zhang, Z. (2022). Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6), 3223 3248.

Yu, H., & Bertsekas, D. P. (2009). Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automatic Control, 54(7), 1515 1531.

Yu, P., & Xu, H. (2015). Distributionally robust counterpart in Markov decision processes. IEEE Transactions on Automatic Control, 61(9), 2538 2543.

Zhang, S., Wan, Y., Sutton, R. S., & Whiteson, S. (2021a). Average-reward off-policy policy evaluation with function approximation. In Proc. International Conference on Machine Learning (ICML), pp. 12578 12588. PMLR.

Zhang, S., Zhang, Z., & Maguluri, S. T. (2021b). Finite sample analysis of average-reward TD learning and Q-learning. In Proc. Advances in Neural Information Processing Systems (Neur IPS), Vol. 34, pp. 1230 1242.

Zhang, Y., & Ross, K. W. (2021). On-policy deep reinforcement learning for the averagereward criterion. In Proc. International Conference on Machine Learning (ICML), pp. 12535 12545. PMLR.

Zhou, Z., Bai, Q., Zhou, Z., Qiu, L., Blanchet, J., & Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In Proc. International Conference on Artifical Intelligence and Statistics (AISTATS), pp. 3331 3339. PMLR.