# modelfree_robust_averagereward_reinforcement_learning__ac7e5957.pdf

Model-Free Robust Average-Reward Reinforcement Learning

Yue Wang 1 Alvaro Velasquez 2 George Atia 3 Ashley Prater-Bennette 4 Shaofeng Zou 1

Robust Markov decision processes (MDPs) address the challenge of model uncertainty by optimizing the worst-case performance over an uncertainty set of MDPs. In this paper, we focus on the robust average-reward MDPs under the modelfree setting. We first theoretically characterize the structure of solutions to the robust averagereward Bellman equation, which is essential for our later convergence analysis. We then design two model-free algorithms, robust relative value iteration (RVI) TD and robust RVI Q-learning, and theoretically prove their convergence to the optimal solution. We provide several widely used uncertainty sets as examples, including those defined by the contamination model, total variation, Chi-squared divergence, Kullback-Leibler (KL) divergence and Wasserstein distance.

1. Introduction

Reinforcement learning (RL) has demonstrated remarkable success in applications like robotics, finance, and games. However, RL often suffers from a severe performance degradation when deployed in real-world environments, which is often due to the model mismatch between the training and testing environments. Such a model mismatch could be a result of non-stationary conditions, modeling errors between simulated and real systems, external perturbations and adversarial attacks. To address this challenge, a framework of robust MDPs and robust RL was developed in (Nilim & El Ghaoui, 2004; Iyengar, 2005; Bagnell et al., 2001), where an uncertainty set of MDPs is constructed, and a pessimistic approach is adopted to optimize the worst-case performance over the uncertainty set. Such a minimax approach guarantees performance for all MDPs within the uncertainty set, making it robust to model mismatch.

1University at Buffalo 2University of Colorado Boulder 3University of Central Florida 4Air Force Research Laboratory. Correspondence to: Yue Wang <ywang294@buffalo.edu>, Shaofeng Zou <szou3@buffalo.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Two performance criteria are commonly used for infinitehorizon MDPs: 1) the discounted-reward setting, where the reward is discounted exponentially with time; and 2) the average-reward setting, where the long-term averagereward over time is of interest. For systems that operate for an extended period of time, e.g., queue control, inventory management in supply chains, or communication networks, it is more important to optimize the average-reward since policies obtained from the discounted-reward setting may be myopic and have poor long-term performance (Kazemi et al., 2022). However, much of the work on robust MDPs has focused on the discounted-reward setting, and robust average-reward MDPs remain largely unexplored.

Robust MDPs under the two criteria are fundamentally different. Compared to the discounted-reward setting, the average-reward setting places equal weight on immediate and future rewards, thus depends on the limiting stateaction frequencies of the underlying MDPs. This makes it more challenging to investigate than the discounted-reward setting. For example, to establish a contraction in the discounted-reward setting, it generally suffices to have a discount factor strictly less than one, whereas in the averagereward setting, convergence depends on the structure of the MDP. Studies on robust average-reward MDPs are quite limited in the literature, e.g., (Tewari & Bartlett, 2007; Lim et al., 2013; Wang et al., 2023), and are model-based, where the uncertainty set is fully known to the learner. However, the more practical model-free setting, where only samples from the nominal MDP (the centroid of the uncertainty set) can be observed, has yet to be explored. Algorithms and fundamental convergence guarantees in this setting have not been established yet, which is the focus of this paper.

1.1. Challenges and Contributions

In this paper, we develop a fundamental characterization of solutions to the robust average-reward Bellman equation, design model-free algorithms with provable optimality guarantees, and construct an unbiased estimator of the robust average-reward Bellman operator for various types of uncertainty sets. Our results fill the gap in model-free algorithms for robust average-reward RL, and are fundamentally different from existing studies on robust discounted RL and robust and non-robust average-reward MDPs. In the following, we summarize the major challenges and our key contributions.

Model-Free Robust Average-Reward Reinforcement Learning

We characterize the fundamental structure of solutions to the robust average-reward Bellman equation. Valuebased approaches typically converge to a solution to the (robust) Bellman equation. For example, TD and Q-learning converge to the unique fixed-point that yields an optimal solution to the Bellman equation in the discounted setting (Sutton & Barto, 2018). However, in the (robust) averagereward setting, solutions to the (robust) average-reward Bellman equation may not be unique (Puterman, 1994; Wang et al., 2023), and the fundamental structure of the solution space usually plays an important role in proving convergence (e.g., (Wan et al., 2021; Abounadi et al., 2001; Tsitsiklis & Van Roy, 1999)). In the non-robust average-reward setting, the solution to the average-reward Bellman equation is unique up to an additive constant (Puterman, 1994). This result is largely due to the nice linear structure of the non-robust average-reward Bellman equation. However, in the robust setting, where the worst-case performance is of interest, the robust average-reward Bellman equation is nonlinear, which poses a significant challenge in characterizing its solution. We demonstrate that the worst-case transition kernel may not be unique and then show that any solution to the robust average-reward Bellman equation is the relative value function for some worst-case transition kernel (up to an additive constant). We note that this structure is of key importance later in our convergence analysis.

We develop model-free approaches for robust policy evaluation and optimal control. We then design model-free algorithms for robust average-reward RL. To ensure stability of the iterates, we adopt a similar idea as in the nonrobust average-reward setting (Puterman, 1994; Bertsekas, 2011), which is to subtract an offset function and design robust RVI TD and Q-learning algorithms. Unlike the convergence proof for model-free algorithms for non-robust average-reward MDPs, e.g., (Wan et al., 2021; Abounadi et al., 2001) and robust discounted MDPs, e.g., (Wang & Zou, 2021; Liu et al., 2022), where the Bellman operator is either linear or a contraction, our robust average-reward Bellman operator is neither linear nor a contraction, which makes the convergence analysis challenging. Nevertheless, based on the solution structure of the robust average-reward Bellman equation discussed above, we prove that our algorithms converge to solutions to the robust average-reward Bellman equations. Specifically, for robust RVI TD, its output converges to the worst-case average-reward; and for the robust RVI Q-learning, the greedy policy w.r.t. the Qfunction converges to an optimal robust policy.

We construct unbiased estimators with bounded variance for robust Bellman operators under various uncertainty set models. One popular approach in RL is bootstrapping, i.e., use HQ as an estimate of the true Q-function, where H is the (robust) Bellman operator. In the non-robust setting, an unbiased estimate of H can be easily derived

because H is linear in the transition kernel. This linearity, however, does not hold in the robust setting since we minimize over the transition kernel to account for the worst-case and, therefore, directly plugging in the empirical transition kernel results in a biased estimate. In this paper, we employ the multi-level Monte-Carlo method (Blanchet & Glynn, 2015) and construct unbiased estimators for five popular uncertainty models, including those defined by contamination models, total variation, Chi-squared divergence, Kullback Leibler (KL) divergence, and Wasserstein distance. We then prove that our estimator is unbiased and has a bounded variance. These properties play an important role in establishing the convergence of our algorithms.

1.2. Related Work

Robust average-reward MDPs. Studies on robust averagereward MDPs are quite limited in the literature. Modelbased robust average-reward MDPs were first studied in (Tewari & Bartlett, 2007), where the authors showed the existence of the Blackwell optimal policy for a specific uncertainty set of bounded parameters. Results for general uncertainty sets were developed in recent work (Wang et al., 2023), following a fundamentally different approach from (Tewari & Bartlett, 2007). However, these two methods are model-based. The work of (Lim et al., 2013) designed a model-free algorithm for the uncertainty set defined by total variation and characterized its regret bound. However, their method cannot be generalized to other uncertainty sets. In this paper, we design the first model-free method for general uncertainty sets and provide fundamental insights into the model-free robust average-reward RL problem.

Robust discounted-reward MDPs. Model-based methods for robust discounted MDPs were studied in (Iyengar, 2005; Nilim & El Ghaoui, 2004; Bagnell et al., 2001; Satia & Lave Jr, 1973; Wiesemann et al., 2013; Lim & Autef, 2019; Xu & Mannor, 2010; Yu & Xu, 2015; Lim et al., 2013; Tamar et al., 2014; Neufeld & Sester, 2022), where the uncertainty set is assumed to be known, and the problem can be solved using robust dynamic programming. Later, the studies were generalized to the model-free setting (Roy et al., 2017; Badrinath & Kalathil, 2021; Wang & Zou, 2021; 2022; Tessler et al., 2019; Liu et al., 2022; Zhou et al., 2021; Yang et al., 2021; Panaganti & Kalathil, 2021; Goyal & Grand Clement, 2018; Kaufman & Schaefer, 2013; Ho et al., 2018; 2021). Our work focuses on the average-reward setting. First, the robust average-reward Bellman operator does not come with a simple contraction property like the one for the discounted setting, and the average-reward depends on the limiting behavior of the underlying MDP. Moreover, the average-reward Bellman function is a function of two variables: the average-reward and the relative value function, whereas its discounted-reward counterpart is a function of only the value function. These challenges make the robust

Model-Free Robust Average-Reward Reinforcement Learning

average-reward problem more intricate. Furthermore, existing studies mostly focus on a certain type of uncertainty set, e.g., contamination (Wang & Zou, 2021) and total variation (Panaganti et al., 2022); in this paper, our method can be used to solve a wide range of uncertainty sets.

Non-robust average-reward MDPs. Early contributions to non-robust average-reward MDPs include a fundamental characterization of the problem and model-based methods (Puterman, 1994; Bertsekas, 2011). Model-free methods in the tabular setting, e.g., RVI Q-learning (Abounadi et al., 2001) and differential Q-learning (Wan et al., 2021; Wan & Sutton, 2022), were developed recently and are both shown to converge to the optimal average-reward. There is also work on average-reward RL with function approximation, e.g., (Zhang et al., 2021b; Tsitsiklis & Van Roy, 1999; Zhang et al., 2021a; Yu & Bertsekas, 2009). In this paper, we focus on the robust setting, where the key challenge lies in the non-linearity of the robust average-reward Bellman equation, whereas it is linear in the non-robust setting.

2. Preliminaries and Problem Formulation

Average-reward MDPs. An MDP (S, A, P, r) is specified by: a state space S, an action space A, a transition kernel P = {Pa s (S), a A, s S}1, where Pa s is the distribution of the next state over S upon taking action a in state s, and a reward function r : S A [0, 1]. At each time step t, the agent at state st takes an action at, the environment then transitions to the next state st+1 Pat st , and provides a reward signal rt [0, 1].

A stationary policy π : S (A) maps a state to a distribution over A, following which the agent takes action a at state s with probability π(a|s). Under a transition kernel P, the average-reward of π starting from s S is defined as

gπ P(s) lim T Eπ,P

n=0 rt|S0 = s . (1)

The relative value function is defined to measure the cumulative difference between the reward and gπ P:

V π P (s) Eπ,P

t=0 (rt gπ P)|S0 = s . (2)

Then (gπ P, V π P ) satisfies the following Bellman equation (Puterman, 1994):

V π P (s) = Eπ,P

r(s, A) gπ P(s) + X

s S p A s,s V π P (s ) . (3)

1 (S) denotes the (|S| 1)-dimensional probability simplex on S.

Robust average-reward MDPs. For robust MDPs, the transition kernel is assumed to be in some uncertainty set P. At each time step, the environment transits to the next state according to an arbitrary transition kernel P P. In this paper, we focus on the (s, a)-rectangular uncertainty set (Nilim & El Ghaoui, 2004; Iyengar, 2005), i.e., P = N

s,a Pa s, where Pa s (S). Popular uncertainty sets include those defined by the contamination model (Huber, 1965; Wang & Zou, 2022), total variation (Lim et al., 2013), Chi-squared divergence (Iyengar, 2005), Kullback-Leibler (KL) divergence (Hu & Hong, 2013) and Wasserstein distance (Gao & Kleywegt, 2022). We will investigate these uncertainty sets in detail in Section 5.

We investigate the worst-case average-reward over the uncertainty set of MDPs. Specifically, define the robust averagereward of a policy π as

gπ P(s) min κ N

n 0 P lim T Eπ,κ

t=0 rt|S0 = s

where κ = (P0, P1...) N n 0 P. It was shown in (Wang et al., 2023) that the worst case under the time-varying model is equivalent to the one under the stationary model:

gπ P(s) = min P P lim T Eπ,P

t=0 rt|S0 = s

Therefore, we limit our focus to the stationary model. We refer to the minimizers of (5) as the worst-case transition kernels for the policy π, and denote the set of all possible worst-case transition kernels by Ωπ g , i.e., Ωπ g {P P : gπ P = gπ P}. As shown in Example A.1 in the appendix, the worst-case transition kernel may not be unique.

We focus on the model-free setting, where only samples from the nominal MDP (centroid of the uncertainty set) are available. We investigate two problems: 1) given a policy π, estimate its robust average-reward gπ P, and 2) find an optimal robust policy that optimizes the robust average-reward:

max π gπ P(s), for any s S. (6)

We denote by g P(s) maxπ gπ P(s) the optimal robust average-reward.

3. Robust RVI TD for Policy Evaluation

In this section, we study the problem of policy evaluation, which aims to estimate the robust average-reward gπ P for a fixed policy π.

For technical convenience, we make the following assumption to guarantee that the average-reward is independent of the initial state (Abounadi et al., 2001; Wan et al., 2021; Zhang et al., 2021a; Zhang & Ross, 2021; Chen et al., 2022; Wang et al., 2023).

Model-Free Robust Average-Reward Reinforcement Learning

Assumption 1. The Markov chain induced by π is a unichain for all P P.

In general, the average-reward depends on the initial state. For example, imagine a policy that induces a multichain in an MDP with two closed communicating classes. A learning algorithm would be able to learn the average-reward for each communicating class; however, the average-rewards for the two classes may be different. To remove this complexity, it is common and convenient to rule out this possibility. Under Assumption 1, the average-reward w.r.t. any P P is identical for any start state, i.e., gπ P(s) = gπ P(s ), s, s S.

We first revisit the robust average-reward Bellman equation in (Wang et al., 2023), and further characterize the structure of its solutions. For V R|S|, denote by σPas (V ) minp Pas p V , PV (s, a) arg minp Pas p V and let PV = {PV (s, a), s S, a A}.

Theorem 3.1 (Robust Bellman equation). If (g, V ) is a solution to the robust Bellman equation

a π(a|s)(r(s, a) g + σPas (V )), s, (7)

then 1) g = gπ P (Wang et al., 2023); 2) PV Ωπ g ; 3) V = V π PV + ce for some c R, where e denotes the vector (1, 1, ..., 1) R|S|.

The robust Bellman equation in (7) was initially developed in (Wang et al., 2023). The authors also defined a robust relative value function:

V π P (s) min P P Eπ,P

n=0 (rn gπ P)|S0 = s

which is the worst-case relative value function, and showed that (gπ P, V π P ) is a solution to (7). However, conversely, it may not be the case that any solution (g, V ) to (7) can be written as V = V π P + ce for some c R. This is in contrast to the results for the non-robust average-reward setting, where the set of solutions to the non-robust averagereward Bellman equation can be written as {(gπ P, V π P + ce) : c R} (Puterman, 1994).

In Theorem 3.1, we show that for any solution (g, V ) to (7), the transition kernel PV Ωπ g , i.e., it is a worst-case transition kernel for gπ P. Moreover, any solution to (7) can be written as V = V π PV + ce for some constant c. This is different from the non-robust setting, as V π PV actually depends on V . As will be seen later, this result is crucial to establish the convergence of our robust RVI TD.

Theorem 3.1 also reveals the fundamental difference between the robust and the non-robust average-reward settings. Under the non-robust setting, the solution set to the Bellman equation can be written as {(gπ P, V π P + ce) : c R}. The solution is uniquely determined by the transition kernel (up

to some constant vector ke). In contrast, in the robust setting, the robust Bellman equation is no longer linear. Any solution V to (7) is a relative value function w.r.t. some worst-case transition kernel P Ωπ g (up to some additive constant vector), i.e., V {V π P + ce : P Ωπ g , c R}.

A natural question that arises is whether, for any P Ωπ g , (gπ P, V π P ) is a solution to (7)? Lemma 3.1 refutes this.

Lemma 3.1. There exists a robust MDP such that for some P Ωπ g , (gπ P, V π P ) is not a solution to (7).

Lemma 3.1 implies that the solution set to (7) is a subset of {V π P + ce, P Ωπ g , c R}. Note that explicit characterization of the solution set to (7) is challenging due to its non-linear structure; however, result 3 in Theorem 3.1 suffices for the convergence proof (see appendix for proofs).

Motivated by the robust Bellman equation in (7), we propose a model-free robust RVI TD algorithm in Algorithm 1, where ˆT and function f will be discussed later.

Algorithm 1 Robust RVI TD Input: V0, αn, n = 0, 1, ..., N 1

1: for n = 0, 1, ..., N 1 do 2: for all s S do 3: Vn+1(s) Vn(s)+αn(ˆTVn(s) f(Vn) Vn(s)) 4: end for 5: end for

Note that (7) can be written as V = TV g, where T is the robust average-reward Bellman operator. Since in the model-free setting P is unknown, in Algorithm 1, we construct ˆTV as an estimate of TV satisfying

E[ˆTV ] = TV, Var[ˆTV (s)] C(1 + V 2), (9)

for some constant C > 0. In this paper, if not specified, denotes the infinity norm .

It is challenging to construct such ˆT as T is non-linear in the nominal transition kernel from which samples are generated. In Section 5, we will present in detail how to construct such ˆT for various uncertainty set models.

To make the iterates stable, we follow the idea of RVI in the non-robust setting and introduce an offset function f satisfying the following assumption (Puterman, 1994).

Assumption 2. f : R|S| R is Lf-Lipschitz and satisfies f(e) = 1, f(x + ce) = f(x) + c, f(cx) = cf(x), c R.

Assumption 2 can be easily satisfied, e.g., f(V ) = V (s0) for some reference state s0 S, and f(V ) =

|S| (Abounadi et al., 2001). Compared with the discounted setting, f is critical here. As we discussed above, in the average-reward setting, the solution to the Bellman equation

Model-Free Robust Average-Reward Reinforcement Learning

V + ce can be arbitrarily large because c can be any real number. This may lead to a non-convergent sequence Vn (see, e.g., example 8.5.2 of (Puterman, 1994)). Hence, a function f is introduced to "offset" Vn and keep the iterates stable. Also, f(Vn) serves as an estimator of the averagereward gπ P, as we shall see later.

We then assume the Robbins-Monro condition on the stepsize, and further show the convergence of robust RVI TD. Assumption 3. The stepsize {αn} n=0 satisfies the Robbins Monro condition, i.e., P n=0 αn = , P n=0 α2 n < . Theorem 3.2 (Convergence of robust RVI TD). Under Assumptions 1,2,3, and if ˆT satisfies (9), then almost surely, (f(Vn), Vn) converges to a solution to (7) which may depend on the initialization.

The result implies that f(Vn) gπ P a.s., which means our robust RVI TD converges to the worst-case average-reward for the given policy π. Remark 3.1. Our robust RVI TD algorithm is shown to converge to a solution to (7). This model-free result is the same as the model-based results in (Wang et al., 2023). Though it was shown in (Wang et al., 2023) that (gπ P, V π P ) (defined in (8)) is a solution to (7), it is not guaranteed that the convergence is to (gπ P, V π P + ce) for some c. Remark 3.2. As we discussed following the statement of Theorem 3.1, result 3) of Theorem 3.1 is crucial to the convergence proof of Theorem 3.2. Specifically, it is necessary in order to characterize the equilibrium of the associated ODE, and thus the limit of the iterates f(Vn) gπ P.

4. Robust RVI Q-Learning for Control

In this section, we study the problem of optimal control, which aims to find a policy that optimizes the robust averagereward: π = arg maxπ gπ P.

Similar to Assumption 1, we make the following assumption to guarantee that the average-reward is independent of the initial state (Abounadi et al., 2001; Li et al., 2022). Assumption 4. The Markov chain induced by any P P and any π is a unichain.

We first revisit the optimal robust Bellman equation in (Wang et al., 2023) and further present a characterization of its solutions. Consider a Q-function Q : S A R, and define VQ(s) = maxa Q(s, a), s S. Lemma 4.1 (Optimal robust Bellman equation). If (g, Q) is a solution to the optimal robust Bellman equation

Q(s, a) = r(s, a) g + σPa s (VQ), s, a, (10)

then 1) g = g P (Wang et al., 2023); 2) the greedy policy w.r.t. Q: πQ(s) = arg maxa Q(s, a) is an optimal robust policy (Wang et al., 2023); 3) VQ = V πQ P + ce for some P ΩπQ g , c R.

According to result 2 in Lemma 4.1, finding a solution to (10) is sufficient to get the optimal robust average-reward and to derive the optimal robust policy. We note that a complete characterization of the solution set to (10) can be obtained similarly to the one in result 3 of Theorem 3.1. Here, we only provide its structure to simplify the presentation and avoid cumbersome notation. This structure as outlined in Lemma 4.1 is sufficient for our convergence analysis.

We hence present the following model-free robust RVI Qlearning algorithm.

Algorithm 2 Robust RVI Q-learning Input: Q0, αn

1: for n = 0, ..., N 1 do 2: for all s S, a A do 3: Qn+1(s, a) Qn(s, a) + αn ˆHQn(s, a) f(Qn) Qn(s, a)

4: end for 5: end for

Similar to the robust RVI TD algorithm, denote the optimal robust Bellman operator by HQ(s, a) r(s, a)+σPas (VQ), and we construct an estimate ˆH such that for some finite constant C,

E[ ˆHQ] = HQ, Var[ ˆHQ(s, a)] C(1 + Q 2). (11)

In Section 5, we will present in detail how to construct such ˆH for various uncertainty set models.

The following theorem shows the convergence of the robust RVI Q-learning to the optimal robust average-reward g P and the optimal robust policy π .

Theorem 4.1 (Convergence of robust RVI Q-learning). Under Assumptions 2, 3 and 4, and if ˆH satisfies (11), then almost surely, (f(Qn), Qn) converges to a solution to (10), i.e., f(Qn) converges to g P, and the greedy policy πQn(s) arg maxa Qn(s, a) converges to an optimal robust average-reward π .

5. Case Studies

In the previous two sections, we showed that if an unbiased estimator with bounded variance is available for the robust Bellman operator, then both robust algorithms proposed converge to the optimum. In this section, we present the design of these estimators for various uncertainty set models.

The major challenge in designing the estimated operators satisfying (9) and (11) lies in estimating the support function σPa s (V ) unbiasedly with bounded variance using samples from the nominal transition kernel. However, the nominal transition kernel P in general is different from the worst-case

Model-Free Robust Average-Reward Reinforcement Learning

transition kernel, and the straightforward estimator is shown to be biased. Specifically, if we centered at the empirical transition kernel ˆP and construct the uncertainty set ˆP, then the estimator is biased: E[σˆPas (V )] = σPas (V ).

We consider several widely-used uncertainty models including the contamination model, the total variation model, the Chi-square model, the KL-divergence model, and the Wasserstein distance model. We show that our estimators are unbiased and have bounded variance in the following theorem. We will present the design in later sections.

Theorem 5.1. For each uncertainty set, denote its corresponding estimators by ˆT and ˆH as in Sections 5.1 and 5.2. Then, there exists some constant C, such that (9) and (11) hold.

In the following sections, we construct an operator ˆσPas to estimate the support function σPas , s S, a A for each uncertainty set. We further define the estimated robust Bellman operators as ˆTV (s) P

a π(a|s)(r(s, a) + ˆσPas (V )) and ˆHQ(s, a) r(s, a) + ˆσPas (VQ).

5.1. Linear Model: Contamination Uncertainty Set

The δ-contamination uncertainty set is Pa s = {(1 δ)Pa s + δq : q (S)}, where 0 < δ < 1 is the radius. Under this uncertainty set, the support function can be computed as σPas (V ) = (1 δ)Pa s V + δ mins V (s), and this is linear in the nominal transition kernel Pa s. We hence use the transition to the subsequent state to construct our estimator:

ˆσPas (V ) (1 δ)γV (s ) + δ min x V (x), (12)

where s is a subsequent state sample after (s, a).

5.2. Non-Linear Models

Unlike the δ-contamination model, most uncertainty sets result in a non-linear support function of the nominal transition kernel. We will employ the approach of multi-level Monte-Carlo which is widely used in quantile estimation under stochastic environments (Blanchet & Glynn, 2015; Blanchet et al., 2019; Wang & Wang, 2022) to construct an unbiased estimator with bounded variance.

For any s, a, we first generate N according to a geometric distribution with parameter Ψ (0, 1). Then, we take action a at state s for 2N+1 times, and observe r(s, a) and the subsequent state {s i}, i = 1, ..., 2N+1. We divide these 2N+1

samples into two groups: samples with odd indices, and samples with even indices. We then individually calculate the empirical distribution of s using the even-index samples, odd-index ones, all the samples, and the first sample: ˆPa,E s,N+1 = 1 2N P2N

i=1 1s 2i, ˆPa,O s,N+1 = 1 2N P2N

i=1 1s 2i 1, ˆPa s,N+1 = 1 2N+1 P2N+1

i=1 1s i, ˆPa,1 s,N+1 = 1s 1. Then,

we use these estimated transition kernels as nominal kernels to construct four estimated uncertainty sets ˆPa,E s,N+1, ˆPa,O s,N+1, ˆPa s,N+1, ˆPa,1 s,N+1. The multi-level estimator is then defined as

ˆσPa s (V ) σˆPa,1 s,N+1(V ) + N(V )

where p N = Ψ(1 Ψ)N and

N(V ) σˆPa s,N+1(V ) σˆPa,E s,N+1(V ) + σˆPa,O s,N+1(V )

We note that in previous results of the multi-level Monte Carlo estimator (Blanchet & Glynn, 2015; Blanchet et al., 2019; Wang & Wang, 2022), several assumptions are needed to show that the estimator is unbiased. These assumptions, however, do not hold in our cases. For example, the function σP(V ) is not continuously differentiable. Hence, their analysis cannot be directly applied here.

We then present four examples of non-linear uncertainty sets. Under each example, a solution to the support function σP(V ) is given, and by plugging it into (13) the unbiased estimator can then be constructed. More details can be found in Appendix D and Appendix E.

Total Variation Uncertainty Set. The total variation uncertainty set is Pa s = {q (|S|) : 1

2 q Pa s 1 δ}, and the support function can be computed using its dual function (Iyengar, 2005):

σPas (V ) = max µ 0 Pa s(V µ) δSpan(V µ) . (14)

Chi-square Uncertainty Set. The Chi-square uncertainty set is Pa s = {q (|S|) : dc(Pa s, q) δ}, where dc(q, p) = P

s (p(s) q(s))2

p(s) . Its support function can be computed using its dual function (Iyengar, 2005):

σPas (V ) = max µ 0 Pa s(V µ) q

δVar Pas (V µ) . (15)

Kullback Leibler (KL) Divergence Uncertainty Set. The KL-divergence between two distributions p, q is defined as DKL(q||p) = P

s q(s) log q(s)

p(s), and the uncertainty set defined via KL divergence is

Pa s = {q : DKL(q||Pa s) δ} , s S, a A. (16)

Its support function can be efficiently solved using the duality result in (Hu & Hong, 2013):

σPas (V ) = min α 0

δα + α log EPas e V

The above estimator for the KL-divergence uncertainty set has also been developed in (Liu et al., 2022) for robust discounted MDPs. Its extension to our average-reward setting is similar.

Model-Free Robust Average-Reward Reinforcement Learning

Wasserstein Distance Uncertainty Sets. Consider the metric space (S, d) by defining some distance metric d. For some parameter l [1, ) and two distributions p, q (S), define the l-Wasserstein distance between them as Wl(q, p) = infµ Γ (p,q) d µ,l, where Γ(p, q) denotes the distributions over S S with marginal distributions p, q, and d µ,l = E(X,Y ) µ d(X, Y )l 1/l. The Wasserstein distance uncertainty set is then defined as

Pa s = {q (|S|) : Wl(Pa s, q) δ} . (18)

To solve the support function w.r.t. the Wasserstein distance set, we first prove the following duality lemma.

Lemma 5.1. It holds that

σPas (V ) = sup λ 0

λδl + EPas inf y V (y) + λd(S, y)l .

Thus, the support function can be solved using its dual form, and the estimator can then be constructed following (13).

6. Experiments

We numerically verify our previous convergence results and demonstrate the robustness of our algorithms. Additional experiments can be found in Appendix G.

6.1. Convergence of Robust RVI TD and Q-Learning

We first verify the convergence of our robust RVI TD and Q-learning algorithms under a Garnet problem G(30, 20) (Archibald et al., 1995). There are 30 states and 20 actions. The nominal transition kernel P = {Pa s, s S, a A} is randomly generated by a normal distribution: Pa s N(1, σa s) and then normalized, and the reward function r(s, a) N(1, µa s), where µa s, σa s Uniform[0, 100]. We set δ = 0.4, αn = 0.01, f(V ) =

|S||A| . Due to the space limit, we only show the results under the Chi-square and Wasserstein Distance models. The results under the other three uncertainty sets are presented in Appendix G.

For policy evaluation, we evaluate the robust average-reward of the uniform policy π(a|s) = 1 |A|. We implement our robust RVI TD algorithm under different uncertainty models. We run the algorithm independently for 30 times and plot the average value of f(V ) over all 30 trajectories. We also plot the 95th and 5th percentiles of the 30 curves as the upper and lower envelopes of the curves. To compare, we plot the true robust average-reward computed using the model-based robust value iteration method in (Wang et al., 2023). It can be seen from the results in Figure 1 that our robust RVI TD algorithm converges to the true robust average-reward value.

Figure 1: Robust RVI TD Algorithm.

We then consider policy optimization. We run our robust RVI Q-learning independently for 30 times. The curves in Figure 2 show the average value of f(Q) over 30 trajectories, and the upper/lower envelopes are the 95/5 percentiles. We also plot the optimal robust average-reward g P computed by the model-based RVI method in (Wang et al., 2023). Our robust RVI Q-learning converges to the optimal robust average-reward g P under each uncertainty set, which verifies our theoretical results.

Figure 2: Robust RVI Q-Learning Algorithm.

6.2. Robustness of Robust RVI Q-Learning

We then demonstrate the robustness of our robust RVI Qlearning by showing that our method achieves a higher average-reward when there is model deviation between training and evaluation.

6.2.1. RECYCLING ROBOT

We first consider the recycling robot problem (Example 3.3 (Sutton & Barto, 2018)). A mobile robot running on a rechargeable battery aims to collect empty soda cans. It has 2 battery levels: low and high. The robot can either 1) search for empty cans; 2) remain stationary and wait for someone to bring it a can; 3) go back to its home base to recharge. Under low (high) battery level, the robot finds an empty can with probabilities α (β), and remains at the same battery level. If the robot goes out to search but finds nothing, it will run out of its battery and can only be carried back by human. More details can be found in (Sutton & Barto, 2018).

In this experiment, the uncertainty lies in the probabilities α, β of finding an empty can if the robot chooses the action search . We set δ = 0.4 and implement our algorithms and vanilla Q-learning under the nominal environment (α = β = 0.5) with stepsize 0.01. To show the

Model-Free Robust Average-Reward Reinforcement Learning

difference among the policies that the algorithms learned, we plot the difference of Q values at low battery level in Figure 3(a). In the low battery level, the robust algorithms find conservative policies which choose to wait instead of search, whereas the vanilla Q-learning finds a policy that chooses to search. To test the robustness of the obtained policies, we evaluate the average-reward of the learned policies in perturbed environments. Specifically, let x denote the amplitude of the perturbation. Then, we estimate the worst performance of the two policies over the testing uncertainty set (0.5 x, 0.5 + x), and plot them in Figure 3(b). It can be seen that when the perturbation is small, the true worst-case kernels (w.r.t. δ during training) are far from the testing environment, and hence the vanilla Q-learning has a higher reward; however, as the perturbation level becomes larger, the testing environment gets closer to the worst-case kernels, and then our robust algorithms perform better. It can be seen that the performance of Q-learning decreases rapidly while our robust algorithm is stable and outperforms the non-robust Q-learning. This implies that our algorithm is robust to the model uncertainty.

(a) Q(low,wait)-Q(low,search)

(b) Perturbed environment

Figure 3: Recycling Robot.

6.2.2. INVENTORY CONTROL PROBLEM

We now consider the supply chain problem (Giannoccaro & Pontrandolfo, 2002; Kemmer et al., 2018; Liu et al., 2022). At the beginning of each month, the manager of a warehouse inspects the current inventory of a product. Based on the current stock, the manager decides whether or not to order additional stock from a supplier. During this month, if the customer demand is satisfied, the warehouse can make a sale and obtain profits; but if not, the warehouse will obtain a penalty associated with being unable to satisfy customer demand for the product. The warehouse also needs to pay the holding cost for the remaining stock and new items ordered. The goal is to maximize the average profit.

We let st denote the inventory at the beginning of the t-th month, Dt be a random demand during this month, and at be the number of units ordered by the manager. We assume that Dt follows some distribution and is independent over time. When the agent takes action at, the order cost is at, and the holding cost is 3 (st + at). If the demand Dt st + at, then selling the item brings 5 Dt in total; but if the demand Dt > st + at, then there will not be

any sale and a penalty of 15 will be received. We set S = {0, 1, ..., 16} and A = {0, ..., 8}.

We first set δ = 0.4 and αt = 0.01, and implement our algorithms and vanilla Q-learning under the nominal environment where Dt Uniform(0, 16) is generated following the uniform distribution. To verify the robustness, we test the obtained policies under different perturbed environments. More specifically, we perturb the distribution of the demand to Dt U(m,b), where

U(m,b)(x) =

( 1 |S| + b |S| 2

2|S| , if x {m, m + 1}, 1 b

|S| , else.

The results are plotted in Figure 4. We first fix m = 0 and plot the performance under different values of b in Figure 4(a), then we fix b = 0.25 and plot the performance under different values of m in Figure 4(b).

As the results show, when b is small, i.e., the perturbation of the environment is small, the non-robust Q-learning obtains higher reward than our robust methods; as b becomes larger, the performance of the non-robust method decreases rapidly, while our robust methods are more robust and outperform the non-robust one. When b is fixed, our robust methods outperform the non-robust Q-learning, which also demonstrates the robustness of our methods.

(b) b = 0.25

Figure 4: Inventory Control.

7. Conclusion

In this paper, we developed the first model-free algorithms with provable convergence and optimality guarantee for robust average-reward RL under a broad range of uncertainty set models. We characterized the fundamental structure of solutions to the robust average-reward Bellman equation, which is crucial for the convergence analysis. We designed model-free robust algorithms based on the ideas of relative value iteration for non-robust average-reward MDPs and the robust average-reward Bellman equation. We developed concrete solutions to five popular uncertainty sets, where we generalized the idea of multi-level Monte-Carlo and constructed an unbiased estimate of the non-linear robust average-reward Bellman operator.

Model-Free Robust Average-Reward Reinforcement Learning

8. Acknowledgments

This work is supported by the National Science Foundation under Grants CCF-2106560, CCF-2007783, CCF-2106339, and CCF-1552497. This material is based upon work supported under the AI Research Institutes program by National Science Foundation and the Institute of Education Sciences, U.S. Department of Education through Award # 2229873 - National AI Institute for Exceptional Education. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the Institute of Education Sciences, or the U.S. Department of Education.

Abounadi, J., Bertsekas, D., and Borkar, V. S. Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3): 681 698, 2001.

Archibald, T., Mc Kinnon, K., and Thomas, L. On the generation of Markov decision processes. Journal of the Operational Research Society, 46(3):354 361, 1995.

Badrinath, K. P. and Kalathil, D. Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In Proc. International Conference on Machine Learning (ICML), pp. 511 520. PMLR, 2021.

Bagnell, J. A., Ng, A. Y., and Schneider, J. G. Solving uncertain Markov decision processes. 2001.

Bertsekas, D. P. Dynamic Programming and Optimal Control 3rd edition, volume II. Belmont, MA: Athena Scientific, 2011.

Blanchet, J. H. and Glynn, P. W. Unbiased Monte Carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pp. 3656 3667. IEEE, 2015.

Blanchet, J. H., Glynn, P. W., and Pei, Y. Unbiased multilevel Monte Carlo: Stochastic optimization, steadystate simulation, quantiles, and other applications. ar Xiv preprint ar Xiv:1904.09929, 2019.

Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.

Borkar, V. S. and Soumyanatha, K. An analog scheme for fixed point computation. i. theory. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 44(4):351 355, 1997.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Chen, L., Jain, R., and Luo, H. Learning infinite-horizon average-reward Markov decision processes with constraints. ar Xiv preprint ar Xiv:2202.00150, 2022.

Gao, R. and Kleywegt, A. Distributionally robust stochastic optimization with Wasserstein distance. Mathematics of Operations Research, 2022.

Giannoccaro, I. and Pontrandolfo, P. Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics, 78(2): 153 161, 2002.

Goyal, V. and Grand-Clement, J. Robust Markov decision process: Beyond rectangularity. ar Xiv preprint ar Xiv:1811.00215, 2018.

Ho, C. P., Petrik, M., and Wiesemann, W. Fast Bellman updates for robust MDPs. In Proc. International Conference on Machine Learning (ICML), pp. 1979 1988. PMLR, 2018.

Ho, C. P., Petrik, M., and Wiesemann, W. Partial policy iteration for L1-robust Markov decision processes. Journal of Machine Learning Research, 22(275):1 46, 2021.

Hu, Z. and Hong, L. J. Kullback-Leibler divergence constrained distributionally robust optimization. Available at Optimization Online, pp. 1695 1724, 2013.

Huber, P. J. A robust version of the probability ratio test. Ann. Math. Statist., 36:1753 1758, 1965.

Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005.

Kaufman, D. L. and Schaefer, A. J. Robust modified policy iteration. INFORMS Journal on Computing, 25(3):396 410, 2013.

Kazemi, M., Perez, M., Somenzi, F., Soudjani, S., Trivedi, A., and Velasquez, A. Translating omega-regular specifications to average objectives for model-free reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pp. 732 741, 2022.

Kemmer, L., von Kleist, H., de Rochebouët, D., Tziortziotis, N., and Read, J. Reinforcement learning for supply chain optimization. In European Workshop on Reinforcement Learning, volume 14, 2018.

Li, T., Wu, F., and Lan, G. Stochastic first-order methods for average-reward Markov decision processes. ar Xiv preprint ar Xiv:2205.05800, 2022.

Model-Free Robust Average-Reward Reinforcement Learning

Lim, S. H. and Autef, A. Kernel-based reinforcement learning in robust Markov decision processes. In Proc. International Conference on Machine Learning (ICML), pp. 3973 3981. PMLR, 2019.

Lim, S. H., Xu, H., and Mannor, S. Reinforcement learning in robust Markov decision processes. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 701 709, 2013.

Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. Distributionally robust Q-learning. In Proc. International Conference on Machine Learning (ICML), pp. 13623 13643. PMLR, 2022.

Neufeld, A. and Sester, J. Robust Q-learning algorithm for Markov decision processes under Wasserstein uncertainty. ar Xiv preprint ar Xiv:2210.00898, 2022.

Nilim, A. and El Ghaoui, L. Robustness in Markov decision problems with uncertain transition matrices. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 839 846, 2004.

Panaganti, K. and Kalathil, D. Sample complexity of robust reinforcement learning with a generative model. ar Xiv preprint ar Xiv:2112.01506, 2021.

Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. Robust reinforcement learning using offline data. ar Xiv preprint ar Xiv:2208.05129, 2022.

Puterman, M. L. Markov decision processes: Discrete stochastic dynamic programming, 1994.

Roy, A., Xu, H., and Pokutta, S. Reinforcement learning under model mismatch. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 3046 3055, 2017.

Satia, J. K. and Lave Jr, R. E. Markovian decision processes with uncertain transition probabilities. Operations Research, 21(3):728 740, 1973.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Massachusetts, 2018.

Tamar, A., Mannor, S., and Xu, H. Scaling up robust MDPs using function approximation. In Proc. International Conference on Machine Learning (ICML), pp. 181 189. PMLR, 2014.

Tessler, C., Efroni, Y., and Mannor, S. Action robust reinforcement learning and applications in continuous control. In Proc. International Conference on Machine Learning (ICML), pp. 6215 6224. PMLR, 2019.

Tewari, A. and Bartlett, P. L. Bounded parameter Markov decision processes with average reward criterion. In International Conference on Computational Learning Theory, pp. 263 277. Springer, 2007.

Tsitsiklis, J. N. and Van Roy, B. Average cost temporaldifference learning. Automatica, 35(11):1799 1808, 1999.

Wan, Y. and Sutton, R. S. On convergence of average-reward off-policy control algorithms in weakly-communicating MDPs. ar Xiv preprint ar Xiv:2209.15141, 2022.

Wan, Y., Naik, A., and Sutton, R. S. Learning and planning in average-reward Markov decision processes. In Proc. International Conference on Machine Learning (ICML), pp. 10653 10662. PMLR, 2021.

Wang, G. and Wang, T. Unbiased multilevel Monte Carlo methods for intractable distributions: Mlmc meets mcmc. ar Xiv preprint ar Xiv:2204.04808, 2022.

Wang, Y. and Zou, S. Online robust reinforcement learning with model uncertainty. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2021.

Wang, Y. and Zou, S. Policy gradient method for robust reinforcement learning. In Proc. International Conference on Machine Learning (ICML), volume 162, pp. 23484 23526. PMLR, 2022.

Wang, Y., Velasquez, A., Atia, G., Prater-Bennette, A., and Zou, S. Robust average-reward Markov decision processes. In Proc. Conference on Artificial Intelligence (AAAI), 2023.

Wiesemann, W., Kuhn, D., and Rustem, B. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013.

Xu, H. and Mannor, S. Distributionally robust Markov decision processes. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 2505 2513, 2010.

Yang, W., Zhang, L., and Zhang, Z. Towards theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics. ar Xiv preprint ar Xiv:2105.03863, 2021.

Yu, H. and Bertsekas, D. P. Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automatic Control, 54(7):1515 1531, 2009.

Yu, P. and Xu, H. Distributionally robust counterpart in Markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538 2543, 2015.

Model-Free Robust Average-Reward Reinforcement Learning

Zhang, S., Wan, Y., Sutton, R. S., and Whiteson, S. Averagereward off-policy policy evaluation with function approximation. In Proc. International Conference on Machine Learning (ICML), pp. 12578 12588. PMLR, 2021a.

Zhang, S., Zhang, Z., and Maguluri, S. T. Finite sample analysis of average-reward TD learning and Q-learning. In Proc. Advances in Neural Information Processing Systems (Neur IPS), volume 34, pp. 1230 1242, 2021b.

Zhang, Y. and Ross, K. W. On-policy deep reinforcement learning for the average-reward criterion. In Proc. International Conference on Machine Learning (ICML), pp. 12535 12545. PMLR, 2021.

Zhou, Z., Bai, Q., Zhou, Z., Qiu, L., Blanchet, J., and Glynn, P. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In Proc. International Conference on Artifical Intelligence and Statistics (AISTATS), pp. 3331 3339. PMLR, 2021.

Model-Free Robust Average-Reward Reinforcement Learning

A. Proof of Lemma 3.1

We construct the following example.

Example A.1. Consider an MDP with 3 states (1,2,3) and only one action a, and set a (s, a)-rectangular uncertainty set P = Pa 1 N Pa 2 N Pa 3 where Pa 1 = {Pa 11, Pa 12}, Pa 2 = {(0, 0, 1) } and Pa 3 = {(0, 1, 0) }, where Pa 11 = (0, 1, 0) , Pa 12 = (0, 0, 1) . Hence, the uncertainty set contains two transition kernels P = {P1, P2}. The reward of each state is set to be r = (r1, r2, r3). The only stationary policy π in this example is π(i) = a, i.

Note that this robust MDP is a unichain and hence satisfies Assumption 1 with gπ P1(1) = gπ P1(2) = gπ P1(3), gπ P2(1) = gπ P2(2) = gπ P2(3).

Under both transition kernels P1, P2, the average-reward are identical: gπ P1 = gπ P2 = 0.5r2 + 0.5r3. Hence, both P1, P2 are the worst-case transition kernels.

According to Section A.5 of (Puterman, 1994), the relative value functions w.r.t. P1, P2 can be computed as

V π P1 = r1 1

V π P2 = r1 3

When r3 > r2, only V π P1 is the solution to (7); and when r2 > r3, only V π P2 is the solution to (7). Hence, this proves Lemma 3.1 and implies that not any relative value function w.r.t. a worst-case transition kernel is a solution to (7).

B. Robust RVI TD Method for Policy Evaluation

We define the following notation:

a π(a|s)r(s, a),

a π(a|s)σPas (V ),

σP(V ) (σPs1(V ), σPs2(V ), ..., σPs|S| (V )) R|S|.

B.1. Proof of Theorem 3.1

Theorem B.1 (Restatement of Theorem 3.1). If (g, V ) is a solution to the robust Bellman equation

a π(a|s)(r(s, a) g + σPas (V )), s, (20)

then 1) g = gπ P (Wang et al., 2023); 2) PV Ωπ g ; 3) V = V π PV + ce for some c R.

Proof. 1). The robust Bellman equation in (20) can be rewritten as

g + V (s) rπ(s) = σPs(V ), s S. (21)

From the definition, it follows that

σPs(V ) = X

a π(a|s) min Pas Pas Pa s V. (22)

Hence, for any transition kernel P = (Pa s) N s,a Pa s,

g + V (s) rπ(s) X

a π(a|s)Pa s V 0, s. (23)

Model-Free Robust Average-Reward Reinforcement Learning

It can be further rewritten in matrix form as:

ge rπ + (Pπ I)V, (24)

where Pπ is the state transition matrix induced by π and P, i.e., the s-th row of Pπ is X

a π(a|s)Pa s. (25)

Note that Pπ has non-negative components since it is a transition matrix. Multiplying by Pπ on both sides, we have that

Pπge = ge Pπrπ + Pπ(Pπ I)V,

ge (Pπ)2rπ + (Pπ)2(Pπ I)V,

ge (Pπ)n 1rπ + (Pπ)n 1(Pπ I)V. (26)

Now, by summing up all these inequalities in (24) and (26), we have that

i=0 (Pπ)irπ + ((Pπ)n I)V, (27)

ge Pn 1 i=0 (Pπ)irπ

n + ((Pπ)n I)V

Let n , and we have that

Pn 1 i=0 (Pπ)irπ

n + lim n ((Pπ)n I)V

n = gπ Pe, (29)

where the last inequality is from the definition of gπ P and the fact that limn ((Pπ)n I)V

n = 0. Hence, g gπ P for any P N

Consider the worst-case transition kernel PV of V . The robust Bellman equation can be equivalently rewritten as

ge = rπ V + Pπ V V. (30)

This means that (g, V ) is a solution to the non-robust Bellman equation for transition kernel PV and policy π:

xe = rπ Y + Pπ V Y. (31)

Thus, by Thm 8.2.6 from (Puterman, 1994),

g = gπ PV , (32)

V = V π PV + ce, for some c R. (33)

However, note that

gπ PV = g gπ P = min P P gπ P gπ PV , (34)

gπ PV = g = gπ P. (35)

2). From (35),

gπ PV = gπ P . (36)

It then follows from the definition of Ωπ g that PV Ωπ g .

3). Since (g, V ) is a solution to the non-robust Bellman equation

xe = rπ Y + Pπ V Y, (37)

the claim then follows from Theorem 8.2.6 in (Puterman, 1994).

Model-Free Robust Average-Reward Reinforcement Learning

B.2. Proof of Theorem 3.2

Theorem B.2. (Restatement of Theorem 3.2) Under Assumptions 1,2,3, (f(Vn), Vn) converges to a (possible sample path dependent) solution to (7) a.s..

We first show the stability of the robust RVI TD algorithm in the following lemma.

Lemma B.1. Algorithm 1 remains bounded during the update, i.e.,

sup n Vn < , a.s.. (38)

Proof. Denote by

h(V ) rπ + σP(V ) f(V )e V. (39)

Then the update of robust RVI TD can be rewritten as

Vn+1 = Vn + αn(h(Vn) + Mn+1), (40)

where Mn+1 ˆTVn rπ σP(V ) is the noise term.

Further, define the limit function h :

h (V ) lim c h(c V )

Then, from σPas (c V ) = cσPas (V ) and f(c V ) = cf(V ), it follows that

h (V ) = lim c rπ

c + σP(V ) f(V )e V = σP(V ) f(V )e V. (42)

According to Section 2.1 and Section 3.2 of (Borkar, 2009), it suffices to verify the following assumptions:

(1). h is Lipschitz;

(2). Stepsize αn satisfies Assumption 3;

(3). Denoting by Fn the σ-algebra generated by V0, M1, ..., Mn, then E[Mn+1|Fn] = 0, E[ Mn+1 2|Fn] K(1 + Vn 2) for some constant K > 0.

(4). h has the origin as its unique globally asymptotically stable equilibrium.

First, note that

h(V1) h(V2) = max s

a π(a|s)(σPas (V1) σPas (V2)) (f(V1) f(V2)) (V1(s) V2(s))

a π(a|s)(σPas (V1) σPas (V2))

+ |(f(V1) f(V2))| + |(V1(s) V2(s))|

(2 + Lf) V1 V2 , (43)

where the last inequality follows from the fact that the support function σP( ) is 1-Lipschitz and the assumptions on f in Assumption 2. Thus, h is Lipschitz, which verifies (1).

It is straightforward that (3) is satisfied if E[ˆTVn|Fn] = rπ + σP(Vn) and Var[ˆTVn|Fn] K(1 + Vn 2). As discussed in Section 3, we assume the existence of an unbiased estimator ˆT with bounded variance here, and we will construct the estimator in Section 5.

Then, it suffices to verify condition (4), i.e., to show that the ODE

x(t) = h (x(t)) (44)

Model-Free Robust Average-Reward Reinforcement Learning

has 0 as its unique globally asymptotically stable equilibrium.

Define an operator T0(V )(s) P

a π(a|s)σPas (V ). Then, any equilibrium W of (44) satisfies

T0(W) f(W)e W = 0. (45)

This equation can be further rewritten as a set of equations: ( W =T0(W) ge,

g =f(W). (46)

The equation in (46) is the robust Bellman equation for a zero-reward robust MDP. Hence, from Theorem 3.1, any solution (g, W) to (46) satisfies

g = gπ P, W = V π P + ce, (47)

where V π P is the relative value function w.r.t. some worst-case transition kernel P (i.e., gπ P = min P P gπ P), and some c R.

Hence, any equilibrium of (44) satisfies

W = V π P + ce, f(W) = gπ P. (48)

However, note that this robust Bellman equation is for a zero-reward robust MDP, hence for any P,

gπ P = lim T EP

t=0 (rt gπ P)

thus gπ P = 0 and W = ce for some c R. From (48), it follows that f(W) = f(ce) = 0, for any equilibrium W. From Assumption 2, we have that f(ce) = cf(e) = c = 0. This further implies that

W = V π P + ce = 0. (51)

Thus, the only equilibrium of (44) is 0.

We then show that 0 is globally asymptotically stable. Recall that the zero-reward robust Bellman operator

T0V (s) = X

a π(a|s)(σPas (V )). (52)

We further introduce two operators:

T 0V T0V f(V )e, (53) T0V T0V gπ Pe. (54)

Note that in the zero-reward robust MDP, gπ P = 0 and T0 = T0, but we introduce this notation for future use.

Consider the ODEs w.r.t. these two operators:

x = T 0x x, (55)

y = T0y y. (56)

First, it can be easily shown that both T 0 and T0 are Lipschitz with constants 1 + Lf and 1, respectively. Hence, both two ODEs are well-posed. Also, it can be seen that (55) is the same as the ODE in (44).

Since the second equation (56) is a non-expansion (Lipschitz with parameter no larger than 1), Theorem 3.1 of (Borkar & Soumyanatha, 1997) implies that any solution y(t) to (56) converges to the set of equilibrium points, i.e.,

y(t) n W : W = T0W o , a.s.. (57)

Model-Free Robust Average-Reward Reinforcement Learning

Similar to the discussion for T0, our Theorem 3.1 implies that the set of equilibrium points of (56) is {W = ce : c R}. Hence, for any solution y(t) to (56), y(t) ce for some constant k that may depend on the initial value of y(t).

Now, consider the solution x(t) to (55). According to Lemma F.1 (note that T0 here is a special case of T in Lemma F.1 with r = 0), if the solutions x(t), y(t) have the same initial value x(0) = y(0), then

x(t) = y(t) + r(t)e, (58)

where r(t) is a solution to r(t) = r(t) + gπ P f(y(t)), r(0) = 0.

Note that the solution r(t) with r(0) = 0 can be written as

0 e (t s)(gπ P f(y(s)))ds (59)

by variation of constants formula (Abounadi et al., 2001). If we denote the limit of y(t) by y = ce, then limt r(t) = gπ P f(y ) (Lemma B.4 in (Wan et al., 2021), Theorem 3.4 in (Abounadi et al., 2001)). Hence, x(t) = y(t) + r(t)e converges to y + (gπ P f(y ))e, i.e.,

x(t) ce f(ce)e = 0. (60)

Hence, any solution x(t) to (55) converges to 0, which is its unique equilibrium. This thus implies that 0 is the unique globally asymptotically stable equilibrium. Together with Theorem 3.7 in (Borkar, 2009), it further implies the boundedness of Vn, which completes the proof.

We can readily prove Theorem B.2.

Proof. In Lemma B.1, we have shown that

sup n Vn < , a.s.. (61)

Thus, we have verified that conditions (A1-A3) and (A5) in (Borkar, 2009) are satisfied. Lemma 2.1 in (Borkar, 2009) thus implies that it suffices to study the solution to the ODE x(t) = h(x(t)).

For the robust Bellman operator TV = rπ + σP(V ), define

T V TV f(V )e, (62) TV TV gπ Pe. (63)

From Lemma F.1, we know that if x(t), y(t) are the solutions to equations

x = T x x, (64)

y = Ty y, (65)

with the same initial value x(0) = y(0), then

x(t) = y(t) + r(t)e, (66)

where r(t) satisfies

r(t) = r(t) + gπ P f(y(t)), r(0) = 0. (67)

Thus, by the variation of constants formula,

0 e (t s)(gπ P f(y(s)))ds. (68)

Model-Free Robust Average-Reward Reinforcement Learning

Note that T is also non-expansive, hence y(t) converges to some equilibrium of (65) (Theorem 3.1 of (Borkar & Soumyanatha, 1997)). The set of equilibrium points of (65) can be characterized as

n W : TW = W o = {W : W = TW gπ Pe} =

W : W(s) = X

a π(a|s)(r(s, a) gπ P + σPas (W)), s S

From Theorem 3.1, any equilibrium of (65) can be rewritten as

W = V π P + ce, for some P Ωπ g , c R. (70)

Thus, y(t) converges to an equilibrium denoted by y :

y(t) y V π P + c e, for some P Ωπ g , c R. (71)

Similar to Lemma B.1, it can be showed that r(t) gπ P f(y ) (Lemma B.4 in (Wan et al., 2021), Theorem 3.4 in (Abounadi et al., 2001)). This further implies that

x(t) y + (gπ P f(y ))e = V π P + (c + gπ P f(y ))e, (72)

and we denote m = c + gπ P f(y ). Moreover, since f is continuous (because it is Lipschitz), we have that

f(x(t)) f(V π P + (c + gπ P f(y ))e)

= f(V π P ) + c + gπ P f(y )

= f(V π P ) + c + gπ P f(V π P + c e)

= f(V π P ) + c + gπ P f(V π P ) c

= gπ P. (73)

Hence, we show that

x(t) V π P + m e, (74)

f(x(t)) gπ P. (75)

Following Lemma 2.1 from (Borkar, 2009), we conclude that a.s.,

Vn V π P + m e, (76)

f(Vn) gπ P, (77)

which completes the proof.

C. Robust RVI Q-Learning

C.1. Proof of Lemma 4.1

Part of the following theorem is proved in (Wang et al., 2023), but we include the proof for completeness.

Theorem C.1 (Restatement of Lemma 4.1). If (g, Q) is a solution to the optimal robust Bellman equation

Q(s, a) = r(s, a) g + σPa s (VQ), s, a, (78)

then 1) g = g P (Wang et al., 2023); 2) the greedy policy w.r.t. Q: πQ(s) = arg maxa Q(s, a) is an optimal robust policy (Wang et al., 2023); 3) VQ = V πQ P + ce for some P ΩπQ g , c R.

Proof. Taking the maximum on both sides of (78) w.r.t. a, we have that

max a Q(s, a) = max a {r(s, a) g + σPas (VQ)}, s S. (79)

Model-Free Robust Average-Reward Reinforcement Learning

This is equivalent to

VQ(s) = max a {r(s, a) g + σPas (VQ)}, s S. (80)

By Theorem 7 in (Wang et al., 2023), we can show that g = g P, which proves claim (1).

Recall that VQ(s) = maxa Q(s, a). It can be also written as

a πQ(a|s)Q(s, a). (81)

Here, we slightly abuse the notation of πQ, and use πQ(s) and πQ(a|s) interchangeably.

Then, the optimal robust Bellman equation in (79) can be rewritten as

Q(s, πQ(s)) = r(s, πQ(s)) g + σP πQ(s) s

a πQ(a| )Q( , a) . (82)

Moreover, if we denote by W(s) = Q(s, a) = Q(s, πQ(s)) = maxa Q(s, a), then the equation above is equivalent to

a πQ(a|s)(r(s, a) g + σPas (W)). (83)

Therefore, (W, g) is a solution to the robust Bellman equation for the policy πQ in Theorem 3.1. By Theorem 3.1, we have that

g = gπQ P , (84)

W = V πQ P + ce, (85)

for some P ΩπQ g and c R.

Combining this with the claim (1) implies that πQ is an optimal robust policy. Claims (2) and (3) are thus proved.

C.2. Proof of Theorem 4.1

Lemma C.1. If ˆH satisfies that for any Q, s S, a A, E[ ˆHQ(s, a)] = HQ(s, a) and Var( ˆHQ(s, a)) C(1 + Q 2) for some constant C, then under Assumptions 2, 4 and 3, Algorithm 2 remains bounded during the update almost surely, i.e.,

sup n Qn < , a.s.. (86)

Proof. Denote by

h(Q) rπ + σP(VQ) f(Q)e Q. (87)

Then, the update of robust RVI Q-learning can be rewritten as

Qn+1 = Qn + αn(h(Qn) + Mn+1), (88)

where Mn+1 ˆHQn rπ σP(VQ) is the noise term.

Further, define the limit function h :

h (Q) lim c h(c Q)

Then, note that σPa s (Vc Q) = σPa s (c VQ) = cσPas (VQ) for c > 0 and f(c Q) = cf(Q). It then follows that

h (Q) = lim c rπ

c + σP(VQ) f(Q)e Q = σP(VQ) f(Q)e Q. (90)

Model-Free Robust Average-Reward Reinforcement Learning

Similar to the proof of Lemma B.1, it suffices to verify the following conditions:

(1). h is Lipschitz;

(2). Stepsize αn satisfies Assumption 3;

(3). E[Mn+1|Fn] = 0, and E[ Mn+1 2|Fn] K(1 + Qn 2) for some constant K.

(4). h has the origin as its unique globally asymptotically stable equilibrium.

Clearly, (2) and (3) can be verified similarly to Lemma B.1. We then verify (1) and (4).

Firstly, it can be shown that

|h(Q1)(s, a) h(Q2)(s, a)| = σPas (VQ1) f(Q1) Q1(s, a) σPas (VQ2) f(Q2) Q2(s, a)

σPas (VQ1) σPas (VQ2) + |f(Q1) f(Q2)| + |Q1(s, a) Q2(s, a)|

VQ1 VQ2 + Lf Q1 Q2 + Q1 Q2

(2 + Lf) Q1 Q2 , (91)

where the last inequality is from the fact that VQ1 VQ2 Q1 Q2 . This implies that h is Lipschitz.

To verify (4), note that the stability equation is

X(t) = h (X(t)) = σP(VX(t)) f(X(t))e X(t), (92)

where VX(t) is a |S|-dimensional vector with VX(t)(s) = maxa X(t)(s, a).

Any equilibrium Q of the stability equation (92) satisfies that

Q(s, a) = σPas (VQ) f(Q)e, (93)

which can be viewed as an optimal robust Bellman equation (10) with zero reward. Hence, by Lemma 4.1, it implies that

f(Q) = g P = 0, (94)

VQ = V πQ P + ce for some P ΩπQ g , c R. (95)

In the zero-reward MDP, we have that V π P = 0 for any π, P, thus VQ(s) = maxa Q(s, a) = c for any s S.

Note that from (93), Q satisfies that

Q(s, a) = σPas (VQ) = σPas (ce) = c. (96)

Since f(Q) = 0, it implies that

f(Q) = f(ce) = c = 0. (97)

c = 0, (98)

Q = 0. (99)

Thus, 0 is the unique equilibrium of the stability equation.

We then show that 0 is globally asymptotically stable. Define the zero-reward optimal robust Bellman operator

H0Q(s, a) = σPas (VQ), (100)

and further introduce two operators

H 0Q(s, a) = σPas (VQ) f(Q), (101) H0Q(s, a) = σPa s (VQ) g P. (102)

Model-Free Robust Average-Reward Reinforcement Learning

It is straightforward to verify that H0 is non-expansive. Hence by (Borkar & Soumyanatha, 1997), the solution y(t) to equation

y = H0y y (103)

converges to the set of equilibrium points

{W : W(s, a) = σPas (VW ) g P}, a.s.. (104)

This again can be viewed as an optimal robust Bellman equation with zero-reward. Hence, any equilibrium W of (103) satisfies

max a W(s, a) = c, s. (105)

This together with (104) further implies that the equilibrium W of (103) satisfies

W(s, a) = σPas (VW ) = σPas (ce) = c, (106)

and hence y(t) converges to {ce : c R}. We denote its limit by y = c e.

Lemma F.6 implies the solution x(t) to the ODE x = H 0(x) x can be decomposed as x(t) = y(t) + r(t)e, where r(t) satisfies r(t) = r(t) + g P f(y(t)), r(0) = 0.

Then, similar to Lemma B.1, Lemma B.4 in (Wan et al., 2021) and Theorem 3.4 in (Abounadi et al., 2001), it can be shown that r(t) g P f(y(t)) = c . Hence,

x(t) 0, (107)

which proves the asymptotic stability.

Thus, we conclude that 0 is the unique globally asymptotically stable equilibrium of the stability equation, which implies the boundedness of {Qn} together with results from Section 2.1 and 3.2 from (Borkar, 2009).

Theorem C.2 (Restatement of Theorem 4.1). The sequence {Qn} generated by Algorithm 2 converges to a solution Q to the optimal robust Bellman equation a.s., and f(Qn) converges to the optimal robust average-reward g P a.s..

Proof. According to Lemma 1 from (Borkar, 2009) and Theorem 3.5 from (Abounadi et al., 2001), the sequence {Qn} converge to the same limit as the solution x(t) to the ODE x = H x x. Hence the proof can be completed by showing convergence of x(t) and f(x(t)).

For the optimal robust Bellman operator,

HQ(s, a) = r(s, a) + σPas (VQ), (108)

define two operators

H Q HQ f(Q)e, (109) HQ HQ g Pe. (110)

From Lemma F.6, we know that if x(t), y(t) are the solutions to equations

x = H x x, (111)

y = Hy y, (112)

with the same initial value x(0) = y(0), then

x(t) = y(t) + r(t)e, (113)

where r(t) satisfies

r(t) = r(t) + g P f(y(t)), r(0) = 0. (114)

Model-Free Robust Average-Reward Reinforcement Learning

It can be easily verified that H is non-expansive. Hence y(t) converges to the set of equilibrium points of of (112) (Theorem 3.1 of (Borkar & Soumyanatha, 1997)), which can be characterized as n W : HW = W o = {W : W = HW g Pe} = W : W(s, a) = r(s, a) g P + σPas (VW ), s, a . (115)

From Lemma 4.1, any equilibrium W satisfies

VW = V πW P + ce, for some P ΩπW g , c R, (116)

and πW is robust optimal. We denote the limit of y(t) by W .

Similar to (72) to (76), it can be shown that r(t) g P f(W ). This further implies that

x(t) W + (g P f(W ))e W + m e, (117)

where m = g P f(W ). Note that W + m e is a solution to the optimal robust Bellman equation, hence x(t) converges to a solution to (10). Moreover, since f is continuous (because it is Lipschitz), we have that

f(x(t)) f(W + m e)

= f(W ) + g P f(W )

= g P. (118)

This completes the proof.

D. Case Studies for Robust RVI TD

In this section, we provide the proof of the first part of Theorem 5.1, i.e., that ˆT is unbiased and has bounded variance under each uncertainty model.

We first show a lemma, by which the problem can be reduced to investigating whether ˆσPas is unbiased and has bounded variance.

Lemma D.1. If

E[ˆσPas V ] = σPas (V ), s, a, (119)

and moreover, there exists a constant C, such that

Var(ˆσPas V ) C(1 + V 2), s, a, (120)

E[ˆTV (s)] = TV (s), s, (121)

Var(ˆTV (s)) |A|C(1 + V 2), s. (122)

Proof. From the definition, ˆTV (s) = P

a π(a|s)(r(s, a) + ˆσPas V ). Thus,

E[ˆTV (s)] = E X

a π(a|s)(r(s, a) + ˆσPa s V )

a π(a|s)(r(s, a) + E[ˆσPa s V ])

a π(a|s)(r(s, a) + σPas (V )) = TV (s), (123)

Model-Free Robust Average-Reward Reinforcement Learning

which shows that ˆT is unbiased. On the other hand, we have that

Var(ˆTV (s)) = E X

a π(a|s)(r(s, a) + ˆσPas V ) E X

a π(a|s)(r(s, a) + ˆσPas V ) 2

a π(a|s)(r(s, a) + ˆσPas V ) X

a π(a|s)(r(s, a) + E ˆσPas V 2

a π(a|s)(ˆσPas V ) E ˆσPas V 2

a π(a|s)(ˆσPas V E ˆσPas V )2

a π(a|s)E (ˆσPas V E ˆσPas V 2 )2

a π(a|s)Var(ˆσPas V )

|A|C(1 + V 2), (124)

where (a) is because (E[X])2 E[X2], which completes the proof.

This lemma implies that to prove Theorem 5.1, it suffices to show that ˆσPas is unbiased and has bounded variance.

D.1. Contamination Uncertainty Set

Theorem D.1. ˆT defined in (12) is unbiased and has bounded variance.

Proof. First, note that

Vn+1(s) = Vn(s) + αn(r(s, a) + ((1 δ)Vn(s ) + δ min x Vn(x) f(Vn) Vn(s))

= Vn(s) + αn(TVn(s) f(Vn) Vn(s) + Mn(s)), (125)

Mn(s) = r(s, a) + (1 δ)Vn(s ) + δ min x Vn(x) TVn(s), (126)

a π(a|s) r(s, a) + (1 δ) X

s Pa s,s Vn(s ) + δ min x Vn(x) . (127)

E[Mn(s)] = E r(s, a) + (1 δ)Vn(s ) + δ min x Vn(x) X

a π(a|s) r(s, a) + (1 δ) X

s Pa s,s Vn(s ) + δ min x Vn(x)

a π(a|s) r(s, a) + (1 δ) X

s Pa s,s Vn(s ) + δ min x Vn(x)

a π(a|s) r(s, a) + (1 δ) X

s Pa s,s Vn(s ) + δ min x Vn(x))

Hence, the operator is unbiased.

Model-Free Robust Average-Reward Reinforcement Learning

We also have that

E[|Mn(s)|2] = E r(s, a) + (1 δ)Vn(s ) + δ min x Vn(x) TVn(s) 2

2E r(s, a) + (1 δ)Vn(s ) + δ min x Vn(x) 2 + 2E[(TVn(s))2]

(a) 8 + 8 Vn 2

8(1 + Vn 2), (129)

where (a) is from the fact that E (1 δ)Vn(s ) + δ minx Vn(x) 2 = E (1 δ)Vn(s ) + δ minx Vn(x) 2 E (1

δ)Vn(s ) + δ minx Vn(x) 2 E (1 δ) Vn + δ Vn 2 Vn 2.

The proof is completed.

D.2. Total Variation Uncertainty Set

The estimator under the total variation uncertainty set can be written as

ˆσPas (V ) = max µ 0 ˆPa,1 s,N+1(V µ) δSpan(V µ) + N(V )

p N , (130)

N(V ) = max µ 0 ˆPa s,N+1(V µ) δSpan(V µ)

2 max µ 0 ˆPa,O s,N+1(V µ) δSpan(V µ)

2 max µ 0 ˆPa,E s,N+1(V µ) δSpan(V µ) . (131)

Theorem D.2. The estimated operator ˆσPas defined in (130) is unbiased, i.e.,

E[ˆσPas V ] = σPas (V ). (132)

Proof. First, denote the dual function (14) by g:

g V s,a(µ) = Pa s(V µ) δSpan(V µ), (133)

and denote its optimal solution by µV s,a:

µV s,a = arg max µ 0 Pa s(V µ) δSpan(V µ) . (134)

Then, the support function σPas (V ) = g V s,a(µV s,a). Similarly, define the empirical function

g V s,a,N+1(µ) = ˆPa s,N+1(V µ) δSpan(V µ), (135)

g V s,a,N+1,O(µ) = ˆPa,O s,N+1(V µ) δSpan(V µ), (136)

g V s,a,N+1,E(µ) = ˆPa,E s,N+1(V µ) δSpan(V µ), (137)

and their optimal solutions are denoted by µV s,a,N+1, µV s,a,N+1,O, µV s,a,N+1,E. We have that

E[ˆσPa s V ] = E max µ 0 ˆPa,1 s,N+1(V µ) δSpan(V µ) + N(V )

= E[g V s,a,0(µV s,a,0)] + E N(V )

Model-Free Robust Average-Reward Reinforcement Learning

= E[g V s,a,0(µV s,a,0)] +

n=0 p(N = n)E N(V )

= E[g V s,a,0(µV s,a,0)] +

n=0 E[ n(V )]

= E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n+1,O(µV s,a,n+1,O) + g V s,a,n+1,E(µV s,a,n+1,E) 2

= E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n(µV s,a,n) , (138)

where the last inequality is from Lemma F.2. The last equation can be further rewritten as

E[ˆσPas V ] = E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n(µV s,a,n)

= lim n E g V s,a,n(µV s,a,n) . (139)

To show that ˆσPas is unbiased, it suffices to prove that

lim n E g V s,a,n(µV s,a,n) = g V s,a(µV s,a). (140)

For any arbitrary i.i.d. samples {Xi} and its corresponding function g V s,a,n, together with Lemma F.3, we have that

|g V s,a,n(µV s,a,n) g V s,a(µV s,a)|

= | max 0 µ V + V e g V s,a(µ) max 0 µ V + V e g V s,a,n(µ)|

max 0 µ V + V e |g V s,a(µ) g V s,a,n(µ)|

= max 0 µ V + V e |Pa s(V µ) δSpan(V µ) ˆPa s,n(V µ) + δSpan(V µ)|

= max 0 µ V + V e |Pa s(V µ) ˆPa s,n(V µ)|

max 0 µ V + V e V µ Pa s ˆPa s,n 1

3 V Pa s ˆPa s,n 1. (141)

Thus, by Hoeffding s inequality and Theorem 3.7 from (Liu et al., 2022),

E[|g V s,a,n(µV s,a,n) g V s,a(µV s,a)|] 3 V |S|2 π

which implies that

lim n E g V s,a,n(µV s,a,n) = g V s,a(µV s,a), (143)

completing the proof.

Theorem D.3. The estimated operator ˆσPa s defined in (130) has bounded variance, i.e., there exists a constant C0, such that

Var(ˆσPas V ) (1 + 18(1 + 2δ)2 + 2C0) V 2. (144)

Proof. Similar to Theorem D.2, we have that

Var(ˆσPa s V )

Model-Free Robust Average-Reward Reinforcement Learning

= E[(ˆσPas V )2] σPas (V )2

E g V s,a,0(µV s,a,0) + N(V )

2 + (σPas (V ))2

2E g V s,a,0(µV s,a,0) 2 + 2E N(V )

2 + (σPas (V ))2

(1 + 18(1 + 2δ)2) V 2 + 2

E[( i(V ))2]

where the last inequality is from Lemma F.3. For any n 1, we have that

E[( n(V ))2] = E g V s,a,n(µV s,a,n) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O) 2

= E g V s,a,n(µV s,a,n) g V s,a(µV s,a) + g V s,a(µV s,a) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O) 2

2E[(g V s,a,n(µV s,a,n) g V s,a(µV s,a))2] + 2E g V s,a(µV s,a) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O) 2

(a) = 2E[(g V s,a,n(µV s,a,n) g V s,a(µV s,a))2] + 2E[(g V s,a,n 1(µV s,a,n 1) g V s,a(µV s,a))2]

18 V 2E[ Pa s ˆPa s,n 2 1] + 18 V 2E[ Pa s ˆPa s,n 1 2 1], (146)

where (a) is due to Lemma F.2 and the last inequality follows a similar argument to (141). Note that pn = Ψ(1 Ψ)n for Ψ (0, 0.5), thus similar to Theorem 3.7 of (Liu et al., 2022), we can show that there exists a constant C0, such that

E[( i(V ))2]

pi C0 V 2. (147)

Var(ˆσPas V ) (1 + 18(1 + 2δ)2) V 2 + 2C0 V 2 = (1 + 18(1 + 2δ)2 + 2C0) V 2 . (148)

D.3. Chi-Square Uncertainty Set

The estimator under the Chi-square uncertainty set can be written as

ˆσPas V = max µ 0 ˆPa,1 s,N+1(V µ) q

δVarˆPa,1 s,N+1(V µ)

p N , (149)

N(V ) = max µ 0 EˆPa s,N+1[V µ] q

δVarˆPa s,N+1(V µ)

2 max µ 0 EˆPa,O s,N+1[V µ] q

δVarˆPa,O s,N+1(V µ)

2 max µ 0 EˆPa,E s,N+1[V µ] q

δVarˆPa,E s,N+1(V µ) .

Theorem D.4. The estimated operator defined in (149) is unbiased, i.e.,

E[ˆσPas V ] = σPas (V ). (150)

Model-Free Robust Average-Reward Reinforcement Learning

Proof. Denote the dual function (15) by g:

g V s,a(µ) = Pa s(V µ) q

δVar Pas (V µ), (151)

and denote its optimal solution by µV s,a:

µV s,a = arg max µ 0 Pa s(V µ) q

δVar Pas (V µ) . (152)

Then, the support function σPas (V ) = g V s,a(µV s,a). Similarly, define the empirical function

g V s,a,N+1(µ) = ˆPa s,N+1(V µ) q

δVarˆPa s,N+1(V µ), (153)

g V s,a,N+1,O(µ) = ˆPa,O s,N+1(V µ) q

δVarˆPa,O s,N+1(V µ), (154)

g V s,a,N+1,E(µ) = ˆPa,E s,N+1(V µ) q

δVarˆPa,E s,N+1(V µ), (155)

and their optimal solutions are denoted by µV s,a,N+1, µV s,a,N+1,O, µV s,a,N+1,E. We have that

E[ˆσPas V ] = E[g V s,a,0(µV s,a,0)] + E N(V )

= E[g V s,a,0(µV s,a,0)] +

n=0 p(N = n)E N(V )

= E[g V s,a,0(µV s,a,0)] +

= E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n+1,O(µV s,a,n+1,O) + g V s,a,n+1,E(µV s,a,n+1,E) 2

= E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n(µV s,a,n) , (156)

where the last inequality is from Lemma F.2. The last equation can be further rewritten as

E[ˆσPas V ] = E[g V s,a,0(µV s,a,0)] +

n=0 E g V s,a,n+1(µV s,a,n+1) g V s,a,n(µV s,a,n)

= lim n E g V s,a,n(µV s,a,n) . (157)

To show that ˆσPas is unbiased, it suffices to prove that

lim n E g V s,a,n(µV s,a,n) = g V s,a(µV s,a). (158)

For any arbitrary i.i.d. samples {Xi} and its corresponding function g V s,a,n, together with Lemma F.4, we have that

|g V s,a,n(µV s,a,n) g V s,a(µV s,a)|

= | max 0 µ V + V e g V s,a(µ) max 0 µ V + V e g V s,a,n(µ)|

max 0 µ V + V e |g V s,a(µ) g V s,a,n(µ)|

= max 0 µ V + V e

Pa s(V µ) ˆPa s,n(V µ) q

δVar Pas (V µ) q

δVarˆPas,n(V µ)

Model-Free Robust Average-Reward Reinforcement Learning

max 0 µ V + V e |Pa s(V µ) ˆPa s,n(V µ)| + max 0 µ V + V e

δVar Pas (V µ) q

δVarˆPas,n(V µ)

(a) max 0 µ V + V e V µ Pa s ˆPa s,n 1 + max 0 µ V + V e

|δVar Pas (V µ) δVarˆPas,n(V µ)|, (159)

where (a) is due to | x y| p

|x y|. Note that for any distribution p, q (|S|) and any random variable X,

|Varp[X] Varq[X]| = |Ep[X2] Ep[X]2 Eq[X2] + Eq[X]2|

|Ep[X2] Eq[X2]| + |(Ep[X] + Eq[X])(Ep[X] Eq[X])|

sup |X2| p q 1 + 2(sup |X|)2 p q 1. (160)

|δVar Pas (V µ) δVarˆPas,n(V µ)| q

3δ V µ 2 Pas ˆPas,n 1. (161)

Thus, by Hoeffding s inequality and Theorem 3.7 from (Liu et al., 2022),

E[|g V s,a,n(µV s,a,n) g V s,a(µV s,a)|] 3 V

which implies that

lim n E g V s,a,n(µV s,a,n) = g V s,a(µV s,a), (163)

which completes the proof.

Theorem D.5. The estimated operator ˆσPas defined in (149) has bounded variance, i.e., there exists a constant C0, such that

Var(ˆσPas V ) (1 + 18(1 +

2δ)2 + 2C0) V 2. (164)

Proof. We have that

Var(ˆσPas V )

= E[(ˆσPas V )2] σPas (V )2

E g V s,a,0(µV s,a,0) + N(V )

2 + (σPas (V ))2

2E g V s,a,0(µV s,a,0) 2 + 2E N(V )

2 + (σPas (V ))2

(1 + 18(1 +

2δ)2) V 2 + 2

E[( i(V ))2]

where the last inequality is from Lemma F.4. For any n 1, we have that

E[( n(V ))2] = E g V s,a,n(µV s,a,n) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O) 2

= E g V s,a,n(µV s,a,n) g V s,a(µV s,a) + g V s,a(µV s,a) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O) 2

2E[(g V s,a,n(µV s,a,n) g V s,a(µV s,a))2] + 2E g V s,a(µV s,a) g V s,a,n,E(µV s,a,n,E) + g V s,a,n,O(µV s,a,n,O) 2

Model-Free Robust Average-Reward Reinforcement Learning

(a) = 2E[(g V s,a,n(µV s,a,n) g V s,a(µV s,a))2] + 2E[(g V s,a,n 1(µV s,a,n 1) g V s,a(µV s,a))2]

3δ)2 V 2E[ Pa s ˆPa s,n 2 1 + Pa s ˆPa s,n 1]

3δ)2 V 2E[ Pa s ˆPa s,n 1 2 1 + Pa s ˆPa s,n 1 1], (166)

where (a) is due to Lemma F.2 and the last inequality follows a similar argument to (159). Note that pn = Ψ(1 Ψ)n for Ψ 0, 1

2 2 . Thus, similar to Theorem 3.7 of (Liu et al., 2022), we can show that there exists a constant C0, such that

E[( i(V ))2]

pi C0 V 2. (167)

Var(ˆσPas V ) (1 + 18(1 +

2δ)2) V 2 + 2C0 V 2 = (1 + 18(1 +

2δ)2 + 2C0) V 2. (168)

D.4. KL-Divergence Uncertainty Sets

The estimator under the KL-Divergence uncertainty set can be written as

ˆσPas V min α 0

δα + α log e

V (s 1) α + N(V )

N(V ) = min α 0

δα + α log EˆPa s,N+1 e V

δα + α log EˆPa,O s,N+1 e V

δα + α log EˆPa,E s,N+1 e V

Theorem D.6. (Liu et al., 2022) The estimated operator ˆσPas is unbiased and has bounded variance, i.e., there exists a constant C0, such that Var(ˆσPas V ) C0(1 + V 2).

D.5. Wasserstein Distance Uncertainty Sets

To study the support function w.r.t. this uncertainty model, we first introduce some notation. Definition D.1. For any function f : Z R, λ 0 and x Z, define the regularization operator

Φ(λ, x) inf y Z(λd(x, y)l + f(y)). (170)

The growth rate κ of function f and any distribution q over Z is defined as

κq inf λ 0 : X

x Z q(x)Φ(λ, x) > . (171)

Lemma D.2. (Gao & Kleywegt, 2022) Consider the distributional robust optimization of a function f:

inf Wl(q,p) δ Ex q[f(x)], (172)

and define its dual problem as

sup λ 0 ( λδl + X

x Z p(x) inf y Z(f(y) + λd(x, y)l)). (173)

If κp < , then the strong duality holds, i.e.,

inf Wl(q,p) δ Ex q[f(x)] = sup λ 0 ( λδl + X

x Z p(x) inf y Z(f(y) + λd(x, y)l)). (174)

Model-Free Robust Average-Reward Reinforcement Learning

We first verify that this strong duality holds for our support function.

Lemma D.3. (Restatement of Equation (19)) It holds that

σPas (V ) = sup λ 0

x Pa s,x inf y (V (y) + λd(x, y)l) . (175)

Proof. In our case, the regularization operator is

Φ(λ, x) = inf s S(λd(s, x)l + V (s)). (176)

Note that for any λ 0, X

x S Pa s(x)Φ(λ, x) = X

x S Pa s(x) inf s S(λd(s, x)l + V (s)) V > . (177)

Hence, the growth rate κPas = 0 < . Thus, the strong duality holds.

Then, the estimator under the Wasserstein distance uncertainty set can be constructed as

ˆσPas V sup λ 0

λδl + inf y (V (y) + λd(s 1, y)l) + N(V )

p N + r(s, a), (178)

λδl + EˆPa s,N+1

inf y (V (y) + λd(S, y)l)

λδl + EˆPa,O s,N+1

inf y (V (y) + λd(S, y)l)

λδl + EˆPa,E s,N+1

inf y (V (y) + λd(S, y)l) .

Theorem D.7. The estimated operator defined in (178) is unbiased, i.e.,

E[ˆσPas V ] = σPas (V ). (179)

Proof. Denote the dual function (19) by g:

g V s,a(λ) = λδl + ES Pas [inf x S(V (x) + λd(S, x)l)], (180)

and denote its optimal solution by λV s,a:

λV s,a = arg max λ 0

λδl + ES Pa s [inf x S(V (x) + λd(S, x)l)] . (181)

Then, the support function σPas (V ) = g V s,a(λV s,a). Similarly, define the empirical function g V s,a,N+1, g V s,a,N+1,O, g V s,a,N+1,E, and denote their optimal solutions by λV s,a,N+1, λV s,a,N+1,O, λV s,a,N+1,E. We have that

E[ˆσPas V ] = E[g V s,a,0(λV s,a,0)] + E N(V )

= E[g V s,a,0(λV s,a,0)] +

n=0 p(N = n)E N(V )

Model-Free Robust Average-Reward Reinforcement Learning

= E[g V s,a,0(λV s,a,0)] +

= E[g V s,a,0(λV s,a,0)] +

n=0 E g V s,a,n+1(λV s,a,n+1) g V s,a,n+1,O(λV s,a,n+1,O) + g V s,a,n+1,E(λV s,a,n+1,E) 2

= E[g V s,a,0(λV s,a,0)] +

n=0 E g V s,a,n+1(λV s,a,n+1) g V s,a,n(λV s,a,n) , (182)

where the last inequality is from Lemma F.2. The last equation can be further rewritten as

E[ˆσPas V ] = E[g V s,a,0(λV s,a,0)] +

n=0 E g V s,a,n+1(λV s,a,n+1) g V s,a,n(λV s,a,n)

= lim n E g V s,a,n(λV s,a,n) . (183)

To show that ˆσPas is unbiased, it suffices to prove that

lim n E g V s,a,n(λV s,a,n) = g V s,a(λV s,a). (184)

For any arbitrary i.i.d. samples {Xi} and its corresponding function g V s,a,n, together with Lemma F.5, we have that

|g V s,a,n(λV s,a,n) g V s,a(λV s,a)|

= | max 0 λ 2 V

δl g V s,a(λ) max 0 λ 2 V

δl g V s,a,n(λ)|

max 0 λ 2 V

δl |g V s,a(λ) g V s,a,n(λ)|

= max 0 λ 2 V

ES Pas [inf x S(V (x) + λd(S, x)l)] ES ˆPas,n[inf x S(V (x) + λd(S, x)l)]

max 0 λ 2 V

δl Pa s ˆPa s,n 1 sup x,S S (|V (x) + λd(S, x)l|)

V Pa s ˆPa s,n 1, (185)

where the last inequality is from the bound on λ and D is the diameter of the metric space (S, d).

By Hoeffding s inequality and similar to the previous proofs, we have that

E[|g V s,a,n(λV s,a,n) g V s,a(λV s,a)|] 1 + 2Dl

which implies that

lim n E g V s,a,n(λV s,a,n) = g V s,a(λV s,a). (187)

This completes the proof.

Theorem D.8. The estimated operator ˆσPas defined in (178) has bounded variance, i.e., there exists a constant C0, such that

Var(ˆσPas V ) (3 + 2C0) V 2. (188)

Proof. We first have that

Var(ˆσPa s V )

Model-Free Robust Average-Reward Reinforcement Learning

= E[(ˆσPas V )2] σPas (V )2

E g V s,a,0(λV s,a,0) + N(V )

2 + (σPas (V ))2

2E g V s,a,0(λV s,a,0) 2 + 2E N(V )

2 + (σPas (V ))2

E[( i(V ))2]

where the last inequality is from Lemma F.4. For any n 1, we have that

E[( n(V ))2] = E g V s,a,n(λV s,a,n) g V s,a,n,E(λV s,a,n,E) + g V s,a,n,O(λV s,a,n,O) 2

= E g V s,a,n(λV s,a,n) g V s,a(λV s,a) + g V s,a(λV s,a) g V s,a,n,E(λV s,a,n,E) + g V s,a,n,O(λV s,a,n,O) 2

2E[(g V s,a,n(λV s,a,n) g V s,a(λV s,a))2] + 2E g V s,a(λV s,a) g V s,a,n,E(λV s,a,n,E) + g V s,a,n,O(λV s,a,n,O) 2

(a) = 2E[(g V s,a,n(λV s,a,n) g V s,a(λV s,a))2] + 2E[(g V s,a,n 1(λV s,a,n 1) g V s,a(λV s,a))2]

2 V 2E[ Pa s ˆPa s,n 2 1] + 2 1 + 2Dl

2 V 2E[ Pa s ˆPa s,n 1 2 1], (190)

where (a) is due to Lemma F.2 and the last inequality follows a similar argument to (186). Note that pn = Ψ(1 Ψ)n for Ψ 0, 0.5 , thus similar to Theorem 3.7 of (Liu et al., 2022), we can show that there exists a constant C0, such that

E[( i(V ))2]

pi C0 V 2. (191)

Thus, we have that

Var(ˆσPas V ) 3 V 2 + 2C0 V 2 = (3 + 2C0) V 2. (192)

E. Case Studies for Robust RVI Q-Learning

In this section, we provide the proof of the second part of Theorem 5.1, i.e., ˆH is bounded and unbiased under each uncertainty model. We note that the proofs in this part can be easily derived by following the ones in Appendix D.

We first prove a lemma necessary to the proofs in this section.

Lemma E.1. It holds that

VQ Q . (193)

Proof. From the definition of VQ, we have that

VQ = max s |VQ(s)| = max s | max a Q(s, a)| |Q(s , a )|. (194)

Clearly, |Q(s , a )| maxs,a |Q(s, a)|, hence

VQ Q . (195)

Model-Free Robust Average-Reward Reinforcement Learning

Similar to Appendix D, the propositions of ˆH can be reduced to the ones of ˆσPas .

Lemma E.2. If E[ˆσPas V ] = σPas (V ), and moreover there exists a constant C, such that for any s, a, Var(ˆσPas V ) C(1 + V 2), then E[ ˆHQ(s, a)] = HQ(s, a), and Var( ˆHQ(s, a)) C(1 + Q 2).

Proof. First, we have that

E[ ˆHQ(s, a)] = E[r(s, a) + ˆσPas VQ(s)] = r(s, a) + σPas (VQ) = HQ(s, a). (196)

For boundedness, note that

Var( ˆHQ(s, a)) = E ( ˆHQ(s, a) HQ(s, a))2

= E ˆσPa s VQ(s) σPa s (VQ) 2

C(1 + VQ 2)

C(1 + Q 2), (197)

where the last inequality is from Lemma E.1.

This implies that the problem is reduced to verifying whether ˆσPas is unbiased and has bounded variance, which is identical to the results in Appendix D. We thus omit the proofs for this part.

F. Technical Lemmas

Lemma F.1. For a robust Bellman operator T, define

T V TV f(V )e, (198) TV TV gπ Pe. (199)

Assume that x(t), y(t) are the solutions to equations

x = T x x, (200)

y = Ty y, (201)

with the same initial value x(0) = y(0) = x0. Then,

x(t) = y(t) + r(t)e, (202)

where r(t) satisfies

r(t) = r(t) + gπ P f(y(t)). (203)

Proof. Note that T V = TV + (gπ P f(V ))e, then from the variation of constants formula, we have that

x(t) = x0e t + Z t

0 e (t s) T(x(s))ds + Z t

0 e (t s)(gπ P f(x(s)))ds e, (204)

y(t) = x0e t + Z t

0 e (t s) T(y(s))ds. (205)

Hence, the maximal and minimal components of x(t) y(t) can be bounded as:

max i (xi(t) yi(t)) Z t

0 e (t s) max i ( Ti(x(s)) Ti(y(s)))ds + Z t

0 e (t s)(gπ P f(x(s)))ds,

min i (xi(t) yi(t)) Z t

0 e (t s) min i ( Ti(x(s)) Ti(y(s)))ds + Z t

0 e (t s)(gπ P f(x(s)))ds. (206)

Model-Free Robust Average-Reward Reinforcement Learning

This hence implies that

Span(x(t) y(t)) Z t

0 e (t s)Span( T(x(s)) T(y(s)))ds

0 e (t s)Span(x(s) y(s))ds, (207)

where the last inequality is because T is non-expansive w.r.t. the span semi-norm (Wang et al., 2023).

Gronwall s inequality implies that Span(x(t) y(t)) 0 R t 0 e (t s)ds = 0 for any t 0. However, since Span is non-negative, then Span(x(t) y(t)) = 0. Hence, we have that x(t) = y(t) + r(t)e for some r(t) satisfying r(0) = 0.

Also note that the differential of r(t) can be written as

r(t)e = x(t) y(t)

= Tx(t) + (gπ P f(x(t)))e x(t) Ty(t) + y(t)

= ( r(t) + gπ P f(y(t)))e, (208)

where the last equation is because

Tx(t) = T(y(t) + r(t)e) = T(y(t)) + r(t)e, (209)

f(x(t)) = f(y(t) + r(t)e) = f(y(t)) + r(t). (210)

This completes the proof.

Lemma F.2. For any function g : (|S|) R, assume there are 2n+1 i.i.d. samples Xi q. Denote the empirical distributions from samples {Xi : i = 1, ..., 2n+1}, {X2i 1 : i = 1, ..., 2n}, {X2i : i = 1, ..., 2n} by ˆqn+1, ˆqn+1,O, ˆqn+1,E. Then,

E[g(ˆqn+1,O)] = E[g(ˆqn+1,E)] = E[g(ˆqn)]. (211)

Proof. Note that

ˆqn+1,O(s) = P2n

i=1 1X2i 1=s

E[g(ˆqn+1,O)] = X

p=(p1,...,p|S|) (|S|) g(p)P(ˆqn+1,O = p)

p=(p1,...,p|S|) (|S|) P P2n

i=1 1X2i 1=s1

2n = p1, ...,

i=1 1X2i 1=s|S|

q g(p), (213)

where 2npi N and P|S| i=1 pi = 1. On the other hand,

E[g(ˆqn)] = X

p=(p1,...,p|S|) (|S|) g(p)P(ˆqn = p)

p=(p1,...,p|S|) (|S|) P P2n

2n = p1, ...,

i=1 1Xi=s|S|

q g(p). (214)

Note that Xi are i.i.d., hence,

2n = p1, ...,

i=1 1Xi=s|S|

i=1 1X2i 1=s1

2n = p1, ...,

i=1 1X2i 1=s|S|

Model-Free Robust Average-Reward Reinforcement Learning

E[g(ˆqn+1,O)] = E[g(ˆqn)]. (215)

Similarly, E[g(ˆqn+1,E)] = E[g(ˆqn)] and hence it completes the proof.

Lemma F.3. Under the total variation uncertainty model, the optimal solution and optimal value for g V s,a, g V s,a,N+1, g V s,a,N+1,E, g V s,a,N+1,O are bounded. Specifically,

µV s,a, µV s,a,N+1, µV s,a,N+1,E, µV s,a,N+1,O V + V e, (216)

µV s,a , µV s,a,N+1 , µV s,a,N+1,E , µV s,a,N+1,O 2 V , (217)

|g V s,a(µV s,a)|, |g V s,a,N+1(µV s,a,N+1)|, |g V s,a,N+1,O(µV s,a,N+1,O)|, |g V s,a,N+1,E(µV s,a,N+1,E)| 3(1 + 2δ) V . (218)

Proof. First we show the bounds on the optimal solutions. If we denote the minimal entry of V by w: w = mins V (s), then W V we 0. Note that,

µW s,a = arg max µ 0 Pa s(W µ) δSpan(W µ)

= arg max µ 0 w + Pa s(V µ) δSpan(V µ) , (219)

which is because Span(V + ke) = Span(V ) and Pa s(V + ke) = k + Pa s V . Hence, µW s,a = µV s,a. Moreover note that W 0, hence µW s,a is bounded: µW s,a W, this further implies that

µV s,a = µW s,a W 2 V . (220)

The bounds on µV s,a,N+1, µV s,a,N+1,O, µV s,a,N+1,E can be similarly derived.

We then consider the optimal value. Note that,

g V s,a(µV s,a) = Pa s(V µV s,a) δSpan(V µV s,a)

V + µV s,a + δ| max i (V (i) µV s,a(i))| + δ| min i (V (i) µV s,a(i))|

3 V + 2δ( V + µV s,a )

3(1 + 2δ) V . (221)

On the other hand,

g V s,a(µV s,a) g V s,a(0) = Pa s V δSpan(V ) = Pa s V δ max i V (i) + δ min i V (i). (222)

Denote the maximal and minimal entries of V by V (M) and V (m), then we have that

Pa s V δ max i V (i) + δ min i V (i)

x Pa s,x V (x) δV (M) + δV (m)

V 2δ V , (223)

where the last inequality is from V V (i) V for any entry i. Thus, combining (222) and (223) implies that

(1 + 2δ) V g V s,a(µV s,a) 3(1 + 2δ) V . (224)

Similarly, the bounds on g V s,a,N+1(µV s,a,N+1), g V s,a,N+1,O(µV s,a,N+1,O), g V s,a,N+1,E(µV s,a,N+1,E) can be derived.

Lemma F.4. Under the chi-square uncertainty model, the optimal solution and optimal value for g V s,a, g V s,a,N+1, g V s,a,N+1,E, g V s,a,N+1,O are bounded. Specifically,

µV s,a, µV s,a,N+1, µV s,a,N+1,E, µV s,a,N+1,O V + V e, (225)

µV s,a , µV s,a,N+1 , µV s,a,N+1,E , µV s,a,N+1,O 2 V , (226)

|g V s,a(µV s,a)|, |g V s,a,N+1(µV s,a,N+1)|, |g V s,a,N+1,O(µV s,a,N+1,O)|, |g V s,a,N+1,E(µV s,a,N+1,E)| 3(1 +

2δ) V . (227)

Model-Free Robust Average-Reward Reinforcement Learning

Proof. First, we show the bounds on the optimal solutions. If we denote the minimal entry of V by w: w = mins V (s), then W V we 0. Note that,

µW s,a = arg max W µ 0 Pa s(W µ) q

δVar Pas (W µ)

= arg max W µ 0 w + Pa s(V µ) q

δVar Pas (V µ) , (228)

which is because Var Pa s (V µ we) = Var Pa s (V µ) + Var Pa s (we) 2Cov Pa s (V µ, we) = Var Pa s (V µ). Hence µW s,a = µV s,a. Moreover note that W 0, hence µW s,a is bounded: µW s,a W, this further implies that

µV s,a = µW s,a W 2 V . (229)

The bounds on µV s,a,N+1, µV s,a,N+1,O, µV s,a,N+1,E can be similarly derived.

We then consider the optimal value. Note that,

|g V s,a(µV s,a)| = |Pa s(V µV s,a) q

δVar Pas (V µVs,a)|

V + µV s,a + q

2δ V µVs,a 2

2δ( V + µV s,a )

2δ) V . (230)

Similarly, the bounds on g V s,a,N+1(µV s,a,N+1), g V s,a,N+1,O(µV s,a,N+1,O), g V s,a,N+1,E(µV s,a,N+1,E) can be derived.

Lemma F.5. Under the Wasserstein distance uncertainty model, the optimal solution and optimal value for g V s,a, g V s,a,N+1, g V s,a,N+1,E, g V s,a,N+1,O are bounded. Specifically,

λV s,a, λV s,a,n, λV s,a,n,O, λV s,a,n,E 2 V

|g V s,a(λV s,a)|, |g V s,a,n(λV s,a,n)|, |g V s,a,n,O(λV s,a,n,O)|, |g V s,a,n,E(λV s,a,n,E)| V . (232)

Proof. First, we show the bounds on the optimal solutions. Denote the optimal solution to maxλ 0 g V s,a(λ) by λV s,a. Moreover, for each state y S and any λ 0, denote sy λ arg minx S{λd(x, y)l + V (x)}. Hence,

g V s,a(λ) = λδl + ES Pas [λd(S, s S λ)l + V (s S λ)]. (233)

Moreover, note that g V s,a(λV s,a) = maxλ 0 g V s,a(λ), hence,

λV s,aδl + ES Pas [λV s,ad(S, s S λV s,a)l + V (s S λV s,a)] g V s,a(0) = ES Pas [V (s S 0 )] = min x V (x), (234)

where the last equation is due to the fact that s S 0 = arg minx S{V (x)} = minx V (x). Now consider the inner problem ES Pas [λV s,ad(S, s S λV s,a)l + V (s S λV s,a)]. Note that,

ES Pas [λV s,ad(S, s S λV s,a)l + V (s S λV s,a)]

x Pa s,x λV s,ad(x, sx λV s,a)l + V (sx λV s,a)

x Pa s,x λV s,ad(x, x)l + V (x)

= EPas [V (S)], (235)

where (a) is because sx λV s,a = arg miny S{λV s,ad(x, y)l + V (y)} and hence λV s,ad(x, sx λV s,a)l + V (sx λV s,a) λV s,ad(x, x)l + V (x).

Model-Free Robust Average-Reward Reinforcement Learning

Combine (234) and (235), then we further have that

min x V (x) λV s,aδl + ES Pas [λV s,ad(S, s S λV s,a)l + V (s S λV s,a)] λV s,aδl + EPas [V (S)]. (236)

This implies that

λV s,a EPas [V (S)] minx V (x)

and hence λV s,a is bounded.

On the other hand, note that g V s,a(λV s,a) = σPas [V (S)], hence,

g V s,a(λV s,a) V . (238)

Same bound can be similarly derived for λV s,a,n, λV s,a,n,O, λV s,a,n,E, g V s,a,n(λV s,a,n), g V s,a,n,O(λV s,a,n,O), g V s,a,n,E(λV s,a,n,E).

Lemma F.6. For an optimal robust Bellman operator: HQ(s, a) = r(s, a) + σPas (VQ), define

H Q HQ f(Q)e, (239) HQ HQ g Pe. (240)

Assume that x(t), y(t) are the solutions to equations

x = H x x, (241)

y = Hy y, (242)

with the same initial value x(0) = y(0). Then x(t) = y(t) + r(t)e, where r(t) satisfies r(t) = r(t) + g P f(y(t)).

Proof. The proof follows exactly that of Lemma F.1 if we show that H is non-expansion w.r.t. the span semi-norm.

Following Theorem 17 of (Wang et al., 2023), it can be shown that

Span( H(Q1) H(Q2)) Span(VQ1 VQ2). (243)

s = arg max i {max a Q1(i, a) max a Q2(i, a)}, (244)

t = arg min i {max a Q1(i, a) max a Q2(i, a)}. (245)

Span(VQ1 VQ2) = (max a Q1(s, a) max a Q2(s, a)) (max a Q1(t, a) max a Q2(t, a))

Q1(s, as) Q2(s, as) (Q1(t, at) Q2(t, at))

max x,b (Q1(x, b) Q2(x, b)) min x,b (Q1(x, b) Q2(x, b))

= Span(Q1 Q2). (246)

where as = arg maxa Q1(S, a) and at = arg maxa Q2(t, a). This completes the proof.

Model-Free Robust Average-Reward Reinforcement Learning

G. Additional Experiments

In this section, we first show the additional experiments on the Garnet problem in Section 6.1. Then, we further verify our theoretical results using some additional experiments.

G.1. Garnet Problem

We first verify the convergence of our robust RVI TD and robust RVI Q-learning under the Garnet problem with the same setting as in Section 6.1. Our results show that both our algorithms converge to the (optimal) robust average-reward under the other three uncertainty sets.

Figure 5: Robust RVI TD Algorithm under Garnet Problem.

Figure 6: Robust RVI Q-Learning Algorithm under Garnet Problem.

Model-Free Robust Average-Reward Reinforcement Learning

G.2. Frozen-Lake Problem

We first verify our robust RVI TD algorithm and robust RVI Q-learning under the Frozen-Lake environment of Open AI (Brockman et al., 2016). We set the uncertainty radius δ = 0.4, αn = 0.01 and plot the (optimal) robust average-reward computed using model-based methods in (Wang et al., 2023) as the baseline. We evaluate the uniform policy for the policy evaluation problem, plot the average value of f(Vt) of 30 trajectories and plot the 95/5 percentile as the upper/lower envelope. For the optimal control problem, we plot the average value of f(Qt) of 30 trajectories and plot the 95/5 percentile as the upper/lower envelope. The results show that both algorithms converge to the (optimal) robust average-reward.

Figure 7: Robust RVI TD Algorithm under Frozen-Lake environment.

Figure 8: Robust RVI Q-learning Algorithm under Frozen-Lake environment.

Model-Free Robust Average-Reward Reinforcement Learning

G.3. Robustness of Robust RVI Q-Learning

We further use the simple, yet widely-used problem, referred to as the one-loop task problem (Panaganti & Kalathil, 2021), to verify the robustness of our robust RVI Q-learning. This environment is widely used to demonstrate that robust methods can learn different optimal polices from the non-robust methods, which are more robust to model uncertainty. The one-loop MDP contains 2 states s1, s2, and 2 actions al, ar indicating going left or right.

The nominal environment is shown in the left of Figure 9, where at state s1, going left and right will result in a transition to s1 or s2; and at s2, going left and right will result in a transition to s1 or s2.

ar, 2 al, 0

s1 s2 al, 0

ar, 2 al, 0

Figure 9: One-Loop Task.

We implement our robust RVI Q-learning and vanilla non-robust Q-learning as the baseline in this environment. At each time step t, we plot the difference between Qt(s1, al) and Qt(s1, ar) in Figure 10(a). If Qt(s1, al) Qt(s1, ar) < 0, the greedy policy will be going right; and if Qt(s1, al) Qt(s1, ar) > 0, the policy will be going left. As the results show, the vanilla Q-learning will finally learn a policy π(s1) = ar, while our algorithms learn a policy π(s1) = al.

To verify the robustness of our method, we test the learned policies under a perturbed testing environment, shown on the right of Figure 9. We plot the average-reward of policies πt under this perturbed environment. The results are shown in Figure 10(b).

(a) Q(s1, al) Q(s2, ar)

(b) Average-Reward under Testing MDP

Figure 10: One-Loop Task.

As the results show, our robust RVI Q-learning learns a more robust policy under the nominal environment, which obtains a higher reward in the perturbed environment; whereas the non-robust Q-learning learns a policy that is optimal w.r.t. the nominal environment, but less robust when the environment is perturbed. This verifies that our algorithm is more robust than the vanilla method.