# sample_complexity_of_variancereduced_distributionally_robust_qlearning__3fd2bf36.pdf

Journal of Machine Learning Research 25 (2024) 1-77 Submitted 4/23; Revised 8/24; Published 10/24

Sample Complexity of Variance-Reduced Distributionally Robust Q-Learning

Shengbo Wang shengbo.wang@stanford.edu Department of Management Science and Engineering Stanford University

Nian Si niansi@ust.hk Department of Industrial Engineering and Decision Analytics The Hong Kong University of Science and Technology

Jose Blanchet jose.blanchet@stanford.edu Department of Management Science and Engineering Stanford University

Zhengyuan Zhou zz26@stern.nyu.edu Stern School of Business New York University

Editor: Zhihua Zhang

Dynamic decision-making under distributional shifts is of fundamental interest in theory and applications of reinforcement learning: The distribution of the environment in which the data is collected can diﬀer from that of the environment in which the model is deployed. This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can eﬀectively learn a robust policy despite distributional shifts. These algorithms are designed to eﬃciently approximate the q-function of an inﬁnite-horizon γ-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise ϵ-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of O(|S||A|(1 γ) 4ϵ 2), where S and A denote the state and action spaces. This is the ﬁrst complexity result that is independent of the ambiguity size δ, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments conﬁrm the theoretical ﬁndings and the eﬃciency of the algorithms in handling distributional shifts. Keywords: sample complexity, reinforcement learning, distributional robustness, Qlearning, stochastic approximation.

1. Introduction

Reinforcement learning (RL) (Sutton and Barto, 2018) focuses on how agents can learn to make optimal decisions in uncertain and dynamic environments. It is based on the principle of trial-and-error learning, where the agent interacts with the environment, receives rewards or penalties for its actions, and adjusts its behavior to maximize the expected long-term reward.

c 2024 Shengbo Wang, Nian Si, Jose Blanchet, Zhengyuan Zhou.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0526.html.

Wang, Si, Blanchet, Zhou

A signiﬁcant obstacle in RL is the limited interaction between the agent and the environment, often due to factors such as data-collection costs or safety constraints. To overcome this, practitioners often rely on historical datasets or simulation environments to train the agent. However, this approach can suﬀer from distributional shifts (Quinonero-Candela et al., 2008) between the real-world environment and the data-collection/simulation environment, potentially leading to suboptimal learned policies when deployed in the actual environment. It is also observed in RL environments that an agent trained this way could be vulnerable to adversarial attacks (Lin et al., 2017; Pan et al., 2019).

To tackle these challenges, distributionally robust reinforcement learning (DR-RL) (Zhou et al., 2021; Yang et al., 2021; Liu et al., 2022; Shi and Chi, 2022; Wang et al., 2023b) has emerged as a promising approach. DR-RL seeks to learn policies that are robust to distributional shifts in the environment by explicitly considering a family of possible distributions that the agent may encounter during deployment. This approach allows the agent to learn a policy that performs well across a range of environments, rather than just the one it was trained on.

These beneﬁts of distributionally robust policies motivate the exploration of a critical question: Can we construct eﬃcient reinforcement learning algorithms that achieve the desired robustness properties while also providing provable guarantees on their sample complexity?

A growing body of literature aims to understand the sample complexities of distributionally robust reinforcement learning. Speciﬁcally, we are interested in a robust tabular Markov Decision Process (MDP) with state space S and action space A, in the discounted inﬁnite-horizon setting with discount factor γ. To account for uncertainty, we use an ambiguity set based on Kullback-Leibler (KL) divergence with ambiguity size δ, which is arguably the most natural and challenging divergence in distributionally robust literature. Previous research has mainly focused on the model-based approach, where a speciﬁc model of the environment is estimated, and value iteration (VI) is run on the estimated model. Table 1 shows the worst-case sample complexity of model-based distributionally RL, with Shi and Chi (2022) proposing a method with state-of-the-art sample complexity in terms of |S|, |A|, 1 γ, ϵ.

Algorithm Sample Complexity Origin

DRVI O(|S|2|A|e O(1 γ) 1(1 γ) 4ϵ 2δ 2) Zhou et al. (2021) REVI/DRVI O(|S|2|A|e O(1 γ) 1(1 γ) 4ϵ 2δ 2) Panaganti and Kalathil (2021) DRVI O(|S|2|A|(1 γ) 4ϵ 2p 2 δ 2) Yang et al. (2021) DRVI-LCB O(|S||A|(1 γ) 4ϵ 2p 1 δ 2) Shi and Chi (2022)

Table 1: Summary of sample complexity upper bounds for ﬁnding an ϵ-optimal robust policy in model-based distributionally robust RL (p is the minimal support probability of the nominal MDP; see, Def. 6).

Variance-Reduced Distributionally Robust Q-Learning

1.1 Our Motivation

The emerging line of work mentioned above reﬂects the growing interest and fruitful results in the pursuit of sample-eﬃcient distributionally robust reinforcement learning. At the same time, a closer scrutiny of the results suggests that two fundamental aspects of the problem are inadequately addressed. For one thing, the complexity bounds of existing results exhibit O(δ 2) dependence as δ 0. This increase in the complexity bounds appears to reﬂect an increased need for learning the training environment as the training and adversarial environments become more alike. At the surface level, this makes sense: in the extreme case where δ is approaching , then (assuming known support of the distributions) no sample is needed to ﬁnd an optimal distributionally robust policy. Nevertheless, such bounds have failed to align with the continuity property of the robust MDP: the robust value function should converge to the non-robust optimal cumulative reward as δ 0. Therefore, for all suﬃciently small δ that may depend on the training environment and ϵ, the robust value function can be approximated by the output of a classical RL algorithm. Speciﬁcally, we expect an algorithm and analysis with a O(1) dependence as δ 0. This is presently absent in the literature. Additionally, with the exception of Wang et al. (2023b) (discussed in more detail in the next subsection), all the existing distributionally robust policy learning algorithms that have ﬁnite-sample guarantees (such as the ones mentioned above (Zhou et al., 2021; Panaganti and Kalathil, 2021; Yang et al., 2021; Shi and Chi, 2022)) are model-based, which estimates the underlying MDP ﬁrst before provisioning some policy from it. Although model-based methods are often more sample-eﬃcient and easier to analyze, their drawbacks are also well-understood (Sutton and Barto, 2018; Fran cois-Lavet et al., 2018): they are computationally intensive; they require more memory to store MDP models and often do not generalize well to non-tabular RL settings. These issues limit the practical applicability of model-based algorithms, which stand in contrast to model-free algorithms that learn to select actions without ﬁrst learning an MDP model. Such methods are often more computationally eﬃcient, have less storage overhead, and better generalize to RL with function approximation. In particular, Q-learning (Watkins and Dayan, 1992), as the prototypical model-free learning algorithm, has widely been both studied theoretically and deployed in practical applications. However, Q-learning is not robust (as demonstrated in our simulations), and the policy learned by Q-learning in one environment can perform poorly in another under a worst-case shift (with bounded magnitude). As such, the above discussion naturally motivates the following research question: Can we design a variant of Q-Learning that is distributionally robust, where the sample complexity has the right scaling with δ?

1.2 Our Contributions

We answer the above question aﬃrmatively and contribute to the existing literature on the worst-case sample complexity theory of model-free distributionally robust RL. We propose two distributionally robust variants of the Q-learning algorithm (Watkins and Dayan, 1992), namely DR Q-learning (Algorithm 1) and variance-reduced DR Q-learning (Algorithm 2), which eﬀectively solve the DR-RL problem under the KL ambiguity set.

Wang, Si, Blanchet, Zhou

The proposed algorithms operate eﬃciently under the assumption of limited power of the adversary (as per Assumption 1), which is realistic in many real-world applications. We prove that both algorithms have near-optimal worst-case sample complexity guarantees in this regime. Additionally, the variance-reduced version exhibits superior complexity dependence on the eﬀective horizon (1 γ) 1, as shown in Table 2. To the best of our knowledge, both algorithms and their worst-case sample complexity upper bounds represent state-ofthe-art results in model-free distributionally robust RL. Moreover, our sample complexity upper bound for variance-reduced DR Q-learning matches the best-known upper bound for this DR-RL problem in Shi and Chi (2022) in terms of ϵ 2 and (1 γ) 4 dependence.

Algorithm Sample Complexity Origin

MLMC DR Q-learning O(|S||A|(1 γ) 5ϵ 2p 6 δ 4) Wang et al. (2023b)

DR Q-learning O(|S||A|(1 γ) 5ϵ 2p 3 ) Theorem 9 Variance-reduced DR Q-learning O(|S||A|(1 γ) 4ϵ 2p 3 ) Theorem 13

Table 2: Summary of sample complexity upper bounds for ﬁnding an ϵ-optimal robust policy in model-free distributionally robust RL (p is the minimal support probability of the nominal MDP; see, Def. 6).

The DR Q-learning Algorithm 1 is a direct extension of mini-batch Q-learning. Compared to the MLMC DR Q-learning method proposed by Wang et al. (2023b), Algorithm 1 is easier to implement in real-world applications. Additionally, this approach allows for the design of a more sophisticated variant, the variance-reduced DR Q-learning, which provides a provable enhancement of the worst-case sample complexity guarantee of DR Q-learning. To achieve this improvement, we leverage Wainwright s variance reduction technique and algorithm structure (Wainwright, 2019a), adapting it to the DR-RL context and redesigning the variance reduction scheme accordingly. Both the DR Q-learning and its variance-reduced version use a stochastic approximation (SA) step to iteratively update the estimator of the optimal DR q-function towards the ﬁxed point of the population DR Bellman operator. However, both algorithms involve a bias that must be controlled at the algorithmic and iterative update levels. Our contribution to the literature lies in the near-optimal analysis of the biased SA resulting from DR Q-learning and its variance-reduced version. This analysis also generalizes to settings where the biased stochastic version of the contraction mapping is a monotonic contraction. We highlight that these are the ﬁrst algorithmic complexity results showing that the worst-case complexity dependence on the uncertainty set size δ is O(1) as δ 0 for the DR-RL problem with a KL ambiguity set. This resolves the issue of worst-case complexity bounds blowing up as δ approaches 0, a problem present in all previous works, including both model-based and model-free approaches (Yang et al., 2021; Panaganti and Kalathil, 2021; Shi and Chi, 2022; Wang et al., 2023b). The signiﬁcance of this characteristic lies in its theoretical illustration that as the adversary s power δ approaches 0, not only does the solution to the DR-RL problem converge to that of the non-robust version, but so does the sample complexity required to solve it. This sheds light on the connection between robust and non-robust RL problems, indicating that

Variance-Reduced Distributionally Robust Q-Learning

in more general settings and real-world applications, DR-RL problems with function approximation may be eﬃciently addressed by utilizing variants of the corresponding approach for non-robust RL problems.

1.3 Literature Review

This section is dedicated to reviewing the literature that is relevant to our work. The literature on RL and MDP is extensive. One major line of research focuses on developing algorithms that can eﬃciently learn policies to maximize cumulative discounted rewards. When discussing RL and MDP problems, we will concentrate on this inﬁnite horizon discounted reward formulation. Minimax Sample Complexity of Tabular RL: Recent years have seen signiﬁcant developments in the worst-case sample complexity theory of tabular RL. Two principles, namely model-based and model-free, have motivated distinct algorithmic designs. In the model-based approach, the controller aims to gather a dataset so as to construct an empirical model of the underlying MDP and solving it using variations of the dynamic programming principle. Research (Azar et al., 2013; Sidford et al., 2018; Agarwal et al., 2020; Li et al., 2022) have proposed model-based algorithms and proven optimal upper bounds for achieving ϵ, with a matching lower bound Ω(|S||A|(1 γ) 3ϵ 2) proven in Azar et al. (2013). In contrast, the model-free approach involves maintaining only lower-dimensional statistics of the transition data, which are iteratively updated. As one of the most well known modelfree algorithm, the sample complexity of Q-learning been extensively studied (Even-Dar et al., 2003; Wainwright, 2019b; Chen et al., 2020; Li et al., 2021). However, Li et al. (2021) have shown that the Q-learning has a minimax sample complexity of Θ(|S||A|(1 γ) 4ϵ 2), which doesn t match the lower bound Ω(|S||A|(1 γ) 3ϵ 2). Nevertheless, variance-reduced variants of the Q-learning, such as the one proposed in Wainwright (2019a), achieve the aforementioned sample complexity lower bound. Other algorithmic techniques such as Polyak-Ruppert averaging (Li et al., 2023) has been shown to result in optimal sample complexity. Finite Analysis of SA: The classical theory of asymptotic convergence for SA has been extensively studied, as seen in Kushner and Yin (2013). Recent progress in the minimax and instant dependent sample complexity theory of Q-learning and its variants has been aided by advances in the ﬁnite-time analysis of SA. Traditional RL research focuses on settings where the random operator is unbiased. Wainwright (2019b) demonstrated a sample path bound for the SA recursion, which enables the use of variance reduction techniques to achieve optimal learning rates. In contrast, Chen et al. (2020, 2022) provided ﬁnite sample guarantees for SA only under a second moment bound on the martingale diﬀerence noise sequence. Additionally, research has been conducted on non-asymptotic analysis of SA procedures in the presence of bias, as documented in (Karimi et al., 2019; Wang, 2022). Robust MDP and RL: Our work draws upon the theoretical framework of classical max-min control and robust MDPs, as established in previous works (Gonz alez-Trejo et al., 2002; Iyengar, 2005; Nilim and El Ghaoui, 2005; Wiesemann et al., 2013; Xu and Mannor, 2010; Shapiro, 2022; Wang et al., 2024). These works have established the concept of distributional robustness in dynamic decision making. In particular, Gonz alez-Trejo et al. (2002); Iyengar (2005); Nilim and El Ghaoui (2005) established the distributionally

Wang, Si, Blanchet, Zhou

robust dynamic programming principles for SA-rectangular adversaries under symmetric information structures, while Wiesemann et al. (2013); Wang et al. (2024) studies asymmetric settings, leading to the same the DR Bellman equation. Recent research has shown great interests in learning DR policies from data (Si et al., 2020; Zhou et al., 2021; Yang et al., 2021; Liu et al., 2022; Shi and Chi, 2022; Wang et al., 2023b; Yang et al., 2023). For instance, (Si et al., 2020) studied the contextual bandit setting, while (Zhou et al., 2021; Panaganti and Kalathil, 2021; Yang et al., 2021; Shi and Chi, 2022) focused on the model-based tabular RL setting. On the other hand, (Liu et al., 2022; Wang et al., 2023b; Yang et al., 2023) tackled the DR-RL problem using a model-free approach1. Before our work, the best worst-case sample complexity upper bound for DR-RL under the KL ambiguity set was established for the model-based DRVI-LCB algorithm, as proposed and analyzed by Shi and Chi (2022). Their analysis showed that the worst-case sample complexity has an upper bound of O(|S||A|(1 γ) 4ϵ 2δ 2p 1 ).

2. Distributionally Robust Reinforcement Learning

2.1 Classical Tabular Reinforcement Learning

Let M0 = (S, A, R, P0, N0, γ) be a Markov decision process (MDP), where S, A, and R R+ are ﬁnite state, action, and reward spaces2, respectively. Let P(U), where U = S, A, R, denote the set of probability measures on the power set 2U. Then P0 = {ps,a P(S), s S, a A} and N0 = {νs,a P(R), s S, a A} are the sets of transition and reward distributions, respectively. γ (0, 1) is the discount factor. Deﬁne rmax = max{r R} as the maximum reward. At each time t, given the state process is at St and the decision maker takes action At, the subsequent state is determined by the conditional distribution St+1 p St,At. Then, a randomized reward Rt νSt,At will be collected, independent of the history. Let Π be the history-dependent policy class (see (Wang et al., 2024) for a rigorous construction). For π Π, the value function vπ(s) is deﬁned as:

The optimal value function is

v (s) := max π Π vπ(s),

s S. It is well known that the optimal value function is the unique solution of the following Bellman equation:

v (s) = max a A Eνs,a[R] + γEps,a[v (S)] .

where the expectations are taken over R νs,a and S ps,a, respectively.

1. Liu et al. (2022) s algorithm is infeasible: it requires an inﬁnite number of samples in expectation for each iteration, and only asymptotic convergence is established with an inﬁnite number of iterations. 2. We assume a ﬁnite reward space for simplicity. However, our results can be extended to continuous reward spaces by imposing a minimum density assumption, as described in Si et al. (2020).

Variance-Reduced Distributionally Robust Q-Learning

An important implication of the Bellman equation is that it suﬃces to optimize within the stationary Markovian deterministic policy class. We deﬁne the optimal q-function as

q (s, a) := Eνs,a[R] + γEps,a[v (S)].

It is well-know that q satisﬁes its Bellman equation

q (s, a) = Eνs,a[R] + γEps,a

max b A q (S, b) .

An optimal policy can be constructed as π (s) = arg maxa A q (s, a). Therefore, policy learning in RL environments can be achieved if we can learn a good estimate of q .

2.2 Kullback-Leibler Divergence Constrained DR-RL

We consider a DR-RL setting where the adversary is constrained to perturb both transition probabilities and rewards within a KL divergence ball of radius δ. Speciﬁcally, for probability measures Q is absolutely continuous w.r.t. P on some measurable space (Ω, F), denoted by Q P, deﬁne

DKL(Q P) := Z

d P (ω) P(dω), (2.1)

d Q is the Radon-Nikodym derivative. For each (s, a) S A and δ > 0, we deﬁne KL ambiguity set that are centered at ps,a P0 and νs,a N0 of radius δ by

Ps,a(δ) := {p : DKL (p ps,a) δ} ,

Ns,a(δ) := {ν : DKL(ν νs,a) δ} . (2.2)

These ambiguity sets represent the possible distributional shifts from the reference model P0, N0. In particular, the parameter δ > 0 controls the size of the ambiguity sets, quantifying the power of the adversary. With these deﬁnitions in mind, we deﬁne the DR optimal value function as the solution to a ﬁxed point equation a.k.a. the DR Bellman equation which serves as the learning objective of this paper.

Deﬁnition 1 The DR Bellman operator Bδ for the value function is deﬁned as the mapping

Bδ(v)(s) := max a A inf p Ps,a(δ), ν Ns,a(δ)

(Eν[R] + γEp [v(S)]) . (2.3)

Deﬁne the DR optimal value function v δ as the solution of the DR Bellman equation:

v δ = Bδ(v δ) (2.4)

Moving forward, we will suppress the explicit dependence on δ. The DR Bellman equation has a unique solution as the ﬁxed point of B, which is a consequence of B being a contraction operator. Furthermore, the solution is equal to the

Wang, Si, Blanchet, Zhou

max-min control optimal value of a SA-rectangular distributionally robust MDP (DRMDP) (Iyengar, 2005; Nilim and El Ghaoui, 2005; Wiesemann et al., 2013). Speciﬁcally, this maxmin optimal value is given by

u (s) := sup π Π inf κ K Eπ,κ " X

where Π is the history-dependent policy class, and the adversary chooses a policy κ from an adversarial ambiguity set K that is induced by the KL ambiguity sets in (2.2). Intuitively, this value represents the optimal reward in the following adversarial environment: When the controller selects a policy π, an adversary observes this policy and then chooses a counter-policy that determines the sequence of reward and transition distributions. The adversary s choice is constrained such that the reward and transition distributions induced by the counter-policy lie within the ambiguity set (2.2) of radius δ. The decisions made by both the controller and the adversary uniquely specify the law of the state-action-reward process, thereby determining the value of the policy pair (π, κ). The equivalence of the max-min control optimal value (2.5) and the solution to the DR Bellman equation (2.4) shows the optimality of stationary deterministic Markov control policies and stationary Markovian adversarial distribution choices. This equivalence, known as the dynamic programming principle (DPP), is explored in detail in Wang et al. (2024), where the adversary and controller can have asymmetric information structures. For those interested, we refer you to this paper. We note that Wang et al. (2024) considers a setting where the reward is not randomized, i.e., Ns,a = {δr(s,a)} for some reward function r : S A [0, 1]. However, it is straightforward to generalize the DPP to include randomized rewards in the SA-rectangular setting.

2.3 Dual and q-Function Formulations

The right-hand side of (2.3) can be challenging to work with because the measure underlying the expectations is not directly accessible. To address this, we use strong duality to reveal the dependence of the value on the reference transition and reward distributions, P0 and N0. Speciﬁcally, we consider the dual representation:

Lemma 2 (Hu and Hong (2013), Theorem 1) Let X be a random variable and µ0 be a probability measure on (Ω, F) s.t. X has a ﬁnite moment generating function in a neighborhood of zero. Then for any δ > 0,

inf µ:DKL(µ µ0) δ EµX = sup α 0

n α log Eµ0 h e X/αi αδ o .

Since the reward and values are bounded, directly apply Lemma 2 to the r.h.s. of (2.4), the DR value function v in fact satisﬁes the following dual form of the DR Bellman s equation.

v (s) = max a A

n α log Eνs,a h e R/αi αδ o + γ sup β 0

n β log Eps,a h e v (S)/βi βδ o)

Similar to the traditional RL policy learning approach, we utilize the optimal DR stateaction value function, also known as the q-function, to address the DR-RL problem. The

Variance-Reduced Distributionally Robust Q-Learning

q-function assigns real numbers to pairs of states and actions, and can be represented as a matrix q RS A. From now on, we will assume this representation. To simplify notation, let us deﬁne

v(q)(s) := max b A q(s, b), (2.6)

which is the value function induced by the q-function q( , ). We proceed to rigorously deﬁne the optimal q-function and its Bellman equation.

Deﬁnition 3 The optimal DR q-function is deﬁned as

q (s, a) := inf p Ps,a(δ), ν Ns,a(δ)

(Eν[R] + γEp [v (S)]) (2.7)

where v is the DR optimal value function in Deﬁnition 1.

Similar to the Bellman operator, we can deﬁne the DR Bellman operator for the q-function as follows:

Deﬁnition 4 Given δ > 0 and q RS A, the primal form of the DR Bellman operator T : RS A RS A is deﬁned as

T (q)(s, a) := inf p Ps,a(δ), ν Ns,a(δ)

(Eν[R] + γEp [v(q)(S)]) (2.8)

The dual form of the DR Bellman operator is

T (q)(s, a) = sup α 0

n α log Eνs,a h e R/αi αδ o

+ γ sup β 0

n β log Eps,a h e v(q)(S)/βi βδ o . (2.9)

The equivalence of the primal and dual form follows from Lemma 2. We remark that the dual form is usually easier to work with, as the outer supremum is a 1-d optimization problem and the dependence on the reference measures νs,a and ps,a are explicit. Note that by deﬁnition (2.7) and the Bellman equation (2.4), we have v(q ) = v . So, our deﬁnition implies that q is a ﬁxed point of T and the following Bellman equation for the q -function holds:

q = T (q ). (2.10)

The uniqueness of the ﬁxed point q of T follows from the contraction property of the operator T ; c.f. Lemma 19. The optimal DR policy can be extracted from the optimal q-function by π (s) = arg maxa A q (s, a). Hence the goal the DR-RL paradigm is to learn the DR q-function and extract the corresponding robust policy.

Wang, Si, Blanchet, Zhou

2.4 Synchronous Q-Learning and Stochastic Approximation

The Q-learning estimates the optimal q-function by iteratively update the estimator {qk : k 0} using samples generated by the reference measures. The classical synchronous Q-learning proceeds as follows. At iteration k Z 0 and each (s, a) S A, we draw samples Rk+1 νs,a and Sk+1 ps,a. Then perform the Q-learning update

qk+1(s, a) = (1 λk)qk(s, a) + λk(Rk+1 + γv(qk)(Sk+1)) (2.11)

for some chosen step-size sequence {λk}. The synchronous Q-learning can be analyzed as a stochastic approximation (SA) algorithm. SA for the ﬁxed point of a contraction operator L refers to the class of algorithms using the update

Xk+1 = (1 λk)Xk + λk L(Xk) + Wk+1. (2.12)

{Wk} is a sequence satisfying E[Wk|Wk 1, . . . , W1] = 0 and some higher order moment conditions, thence is known as the martingale diﬀerence noise. The asymptotics of the above recursion are well-understood in the literature, as discussed in Kushner and Yin (2013). The recent developments of ﬁnite-time/sample behavior of SA is discussed in the literature review. The Q-learning recursion in (2.11) can be represented as an SA update if we notice that given any q-function, R+γv(q)(S) is an unbiased estimator of the population Bellman operator applied to q. However, the DR Q-learning and the variance-reduced version cannot be formulated in the same way as (2.12) with martingale diﬀerence noise, as there is bias present in the former algorithms. Consequently, to achieve the nearly optimal sample complexity bounds, we must conduct a tight analysis of these algorithms as biased SA, as we will explain in the subsequent sections.

3. The DR Q-Learning and Variance Reduction

This section introduces two model-free algorithms, the DR Q-learning (Section 3.1) and its variance-reduced version (Section 3.2), for learning the optimal q-function of a robust MDP. We also present the upper bounds on their worst-case sample complexity. In addition, we outline the fundamental ideas behind the proof of the sample complexity results in Section 3.3. Prior to presenting the algorithms, we introduce several notations. Let νs,a,n and ps,a,n denote the empirical measure of µs,a and ps,a formed by n i.i.d. samples respectively; i.e. for f : U R, where U could be the S or R,

Eµs,a,nf(U) := 1

j=1 f(Ui) (3.1)

for µ = ν, p and Ui = Ri, Si are i.i.d. across i. Assuming access to a simulator, we are able to draw samples and construct an empirical version of the DR Bellman operator.

Variance-Reduced Distributionally Robust Q-Learning

Deﬁnition 5 Deﬁne the empirical DR Bellman operator on n i.i.d. samples by

T(q)(s, a) := sup α 0

n α log Eνs,a,n h e R/αi αδ o

+ γ sup β 0

n β log Eps,a,n h e v(q)(S)/βi βδ o . (3.2)

Note that T is a random operator whose randomness is coming from on the samples that we used to construct {νs,a,n, ps,a,n : (s, a) S A}. Deﬁnition 5 presents the empirical DR Bellman operator in its dual form. Lemma 2 establishes that this deﬁnition is equivalent to the DR Bellman operator T in (2.8) where the sets Ps,a(δ) and Ns,a(δ) are replaced with their empirical counterparts: {p : DKL(p ps,a,n) δ} and {ν : DKL(ν νs,a,n) δ}. The dual formulation of the empirical DR Bellman operator implies that it is generally a biased estimator of the population DR Bellman operator T in the sense that E [T(q)] = T (q) for a generic q RS A. This bias poses a signiﬁcant challenge in the design of model-free algorithms and the analysis of sample complexities. Previous works Liu et al. (2022) and Wang et al. (2023b) eliminates this bias by using a randomized multilevel Monte Carlo estimator. However, the randomization procedure requires a random (and heavy-tailed) sample size. Therefore, the complexity bound is stated in terms of the expected number of samples. Also, this complex algorithmic design limits its generalizability. In contrast, this paper takes a diﬀerent approach by directly analyzing the DR Q-learning and its variancereduced version as biased SA. To achieve near-optimal sample complexity guarantees, the bias of the empirical DR Bellman operator and the propagation of the systematic error it causes are tightly controlled, and samples are optimally allocated so that the stochasticity is in balance with the cumulative bias. A detailed discussion of this approach is provided in Section 3.3. To state the key assumption which constraint the operating regime of our algorithm, we introduce the following complexity metric parameter:

Deﬁnition 6 Deﬁne the minimum support probability as

p := min s,a S A min min r R:νs,a(r)>0 νs,a(r), min s S:ps,a(s )>0 ps,a(s ) . (3.3)

The intuition behind the dependence of the MDP complexity on the minimal support probability is that in order to estimate the DR Bellman operator with high accuracy in the worst case, it is necessary to know the entire support of the transition and reward distributions. As a result, at least 1/p samples are required, as discussed in Wang et al. (2023b). We are now prepared to present the main assumption that deﬁnes the operating regime for which our algorithms are optimized.

Assumption 1 (Limited Adversarial Power) Suppose the adversary s power δ satisﬁes δ < 1

It should be noted that the constant 1/24 is only for mathematical convenience and can potentially be improved.

Wang, Si, Blanchet, Zhou

Under this assumption, the adversary cannot collapse the support of the transition or reward distributions to a singleton, preventing them from completely restricting possible transition events under P0. This assumption regime is of practical signiﬁcance because overly conservative policies can be produced if δ is large. Furthermore, the support of the reward and transition measures often encode physical constraints intrinsic to the real environment, which the adversary should not be allowed to violate. We also make the following simplifying assumption.

Assumption 2 (Reward Bound) The reward R [0, 1].

This assumption is straightforward to remove given that the results of the empirical Bellman operator hold for R R 0. We assume it so as to clarify our presentation.

3.1 The Distributionally Robust Q-learning

First, we proposed the DR Q-learning Algorithm 1, a robust version of the classical Qlearning that is based on iteratively update the q-function by applying the n-sample empirical Bellman operator.

Algorithm 1 Distributionally Robust Q-Learning

Input: the total times of iteration k0 and a batch size n0. Initialization: q1 0; k = 1. for 1 k k0 do

Sample Tk+1 the n0-sample empirical DR Bellman operator as in Deﬁnition 5. Compute the Q-learning update

qk+1 = (1 λk)qk + λk Tk+1(qk) (3.4)

with stepsize λk = 1/(1 + (1 γ)k). end for return qk0+1.

Algorithm 1 can be viewed as a biased SA: We can rewrite the update (3.4) as

qk+1 = (1 λk)qk + λk T (qk) + λk(Tk+1(qk) T (qk)).

This is in the form of (2.12). However, notice that E[Tk+1(qk) T (qk)|qk] = 0. Moreover, we note that the update (3.4) involves computing Tk+1(qk)(s, a) for all (s, a) S A. Unlike a model-based algorithm, which requires storing the entire empirical kernel and reward distributions {ps,a,n, νs,a,n : (s, a) S A}, the update rule (3.4) can be implemented separately for each state-action pair. This allows ps,a,n and νs,a,n to be discarded immediately after the update, signiﬁcantly reducing the memory requirements for running Algorithm 1 when the state space is large. It turns out that, by leveraging the fact that the empirical Bellman operators are monotone contractions w.p.1 (as proven in Lemma 19), we can perform a stronger pathwise analysis of Algorithm 1 instead of treating it as a variant of the SA update in (2.12). As a result, we will prove in Section B.2 that the DR Q-learning algorithm satisﬁes the following error bound in Proposition 7.

Variance-Reduced Distributionally Robust Q-Learning

To simplify notation, we deﬁne the dimensionality parameter d := |S||A|(|S| |R|). It will only show up inside the log( ) term in our complexity bounds because of the use of union bound techniques.

Proposition 7 Suppose that Assumptions 1 and 2 are satisﬁed. The output qk0+1 of the distributionally robust Q-learning satisﬁes

qk0+1 q c 1 (1 γ)3k0 + 1 p3 (1 γ)2n0 + 1 p (1 γ)5/2 n0k0

(log (3dk0/η))2.

with probability at least 1 η, where c is an absolute constant.

By absolute constant , we mean a constant that does not depend on the complexity metric parameters ϵ, p , (1 γ) 1, η, d. Although the logarithmic term in the above proposition can be further improved, we will not focus on optimizing the logarithmic dependence in this paper. For clarity, we adjust the constant in the logarithmic factor using the inequality for C1 1, C2 e, log(C1C2) = log(C1) + log(C2) C1 log(C2), and incorporate C1 into c. These adjustments are applied to all subsequent convergence results. The proof of this Proposition, which is outlined in Section 3.3, will be postponed to Section B.2. Proposition 7 provides an upper bound on the terminal error in the estimator after k0 iterations of Algorithm 1. This bound is given by three terms that decay with rate O(k 1 0 ), O(n 1 0 ), and O((k0n0) 1/2), respectively, where the ﬁrst and third terms resemble the upper bounds for standard Q-learning and the second term arises because of the bias. We optimize the algorithm parameters to balance these three terms and ensure that the right-hand side of the probability bound in Proposition 7 is less than ϵ. One way to achieve this is by selecting the parameters n0 and k0 as follows:

Corollary 8 Assume Assumptions 1 and 2. Running Algorithm 1 with parameters

k0 = c0 1 (1 γ)3ϵ log 3d (1 γ)ηϵ

2 and n0 = c0 1 p3 (1 γ)2ϵ log (3dk0/η)2

will produce an output qk0+1 s.t. qk0+1 q ϵ w.p. at least 1 η, where c is an absolute constant.

An immediate consequence of Corollary 8 is the following the worst-case sample complexity upper bound of the robust Q-learning.

Theorem 9 Assume Assumptions 1 and 2. Then the distributionally robust Q-learning Algorithm 1 with parameters speciﬁed in Corollary 8 computes a solution qk0+1 s.t. qk0+1 q w.p. at least 1 η using

O |S||A| p3 (1 γ)5ϵ2

number of samples.

Wang, Si, Blanchet, Zhou

Proof The total number of samples used is |S||A|n0k0, implying the sample complexity upper bound.

Theorem 9 provides a near-optimal worst-case sample complexity guarantee that matches and beats the expected sample complexity upper bound in Wang et al. (2023b) in all parameter dependence. In particular, we have shown that the dependence on δ is O(1) as δ 0. This resolves the issue of the worst-case complexity bound blowing up as δ 0 for KL divergence based DR-RL that present in all prior works (Yang et al., 2021; Panaganti and Kalathil, 2021; Shi and Chi, 2022; Wang et al., 2023b).

3.2 The Variance-Reduced Distributionally Robust Q-learning

We adapt Wainwright s variance-reduced Q-learning (Wainwright, 2019a) to the robust RL setting. This is outlined in Algorithm 2.

Algorithm 2 Variance-Reduced Distributionally Robust Q-Learning

Input: the number of epochs lvr, a sequence of recentering sample size {ml}lvr l=1, an epoch length kvr and a batch size nvr. Initialization: ˆq0 0; l = 1; k = 1. for 1 l lvr do

Compute e Tl, ml-sample empirical DR Bellman operator as in Deﬁnition 5. Set ql,1 = ˆql 1. for 1 k kvr do

Sample Tl,k+1 an nvr-sample empirical Bellman operator. Compute the recentered Q-learning update

ql,k+1 = (1 λk)ql,k + λk Tl,k+1(ql,k) Tl,k+1(ˆql 1) + e Tl(ˆql 1) (3.5)

with stepsize λk = 1/(1 + (1 γ)k). end for Set ˆql = ql,kvr+1. end for return ˆqlvr

As in the Q-learning case, the update rule (3.5) can be implemented separately for each state-action pair. Thus, Algorithm 2 does not require storing or performing computations using the entire empirical kernel and reward distribution. Before delving into the convergence rate theory of the DR variance-reduced Q-learning, we provide an intuitive description of this variance reduction scheme. The basic idea is to partition the algorithm into epochs. During each epoch, we perform a recentered version of stochastic approximation recursions with the aim of eliminating the variance component in the SA iteration ((2.12)). Speciﬁcally, instead of approximating q by one stochastic approximation, in each epoch, starting with an estimator ˆql 1, we recenter the SA procedure so that it approximates T (ql 1). However, since T is not known, we use e Tl(ql 1) as an natural estimator. By choosing a sequence of empirical DR Bellman operators

Variance-Reduced Distributionally Robust Q-Learning

with exponentially increasing sample sizes, we expect that the errors ˆql q decrease exponentially with high probability. This indeed holds true for Algorithm 2. The outer loop produces a sequence of estimators ˆql, l 1. We will show that if ˆql 1 is within some error from the optimal q , then ˆql will satisfy a better concentration bound by a geometric factor. This result is summarized in Proposition 10. Denote the σ-ﬁeld generated by the random samples used until the end of epoch l by Fl. We deﬁne the conditional expectation El 1[ ] := E[ |Fl 1] and probability measure Pl 1( ) := El 1[1 { }].

Proposition 10 Assuming that Assumptions 1 and 2 are satisﬁed. On {ω : ˆql 1 q b} for some b 1/(1 γ), under measure Pl 1( )(ω), we have that there exists numerical constant c s.t.

b (1 γ)2kvr + b

p3/2 (1 γ)3/2 nvrkvr + b

p3/2 (1 γ) nvr

log (3dkvr/η)2

p3/2 (1 γ)2 ml

w.p. at least 1 η, provided that ml 8p 2 log(24d/η) and nvr p 1 .

Proposition 10 implies that if the variance reduced algorithm ﬁnds an approximation of q with inﬁnity norm b, then the error after one epoch is improved accordingly with high probability. This and the Markovian nature of the sequence {ˆql} would imply a high probability bound for trajectories satisfying the pathwise property {ω : l lvr : ˆql q bl}. This is formalized by the next theorem where we use bl = 2 l(1 γ) 1. Let us deﬁne the parameter choice: for suﬃciently large cvr absolute constant that doesn t depend on the complexity metric parameters ϵ, p , (1 γ) 1, η, d, deﬁne

kvr = cvr 1 (1 γ)2 log 3dlvr (1 γ)η

nvr = cvr 1 p3 (1 γ)2 log(3dkvrlvr/η)4,

ml = cvr 4l

p3 (1 γ)2 log(3dlvr/η)2.

Notice that evidently ml 8p 2 log(24d/η) and nvr p 1 , satisfying the requirement of Proposition 10.

Proposition 11 Assume Assumptions 1 and 2. For ϵ < (1 γ) 1, deﬁne parameters according to(3.6). Then, the sequence {ˆql, 0 l lvr} produced by Algorithm 2 satisﬁes the pathwise property that ˆql q 2 l(1 γ) 1 for all 0 l lvr w.p. at least 1 η. In particular, the ﬁnal estimator ˆqlvr satisﬁes ˆqlvr q 2 lvr(1 γ) 1 w.p. at least 1 η.

Wang, Si, Blanchet, Zhou

Remark 12 The base of geometric growth in our choice of ml in (3.6) can be modiﬁed. The same proof as in Proposition 11 suggests that with ml = α2l Θ(p 3 (1 γ) 2) and lvr = logα ϵ 1(1 γ) 1 for some α > 1, we have ˆql q α l(1 γ) 1 for all 0 l lvr with probability at least 1 η. Running Algorithm 2 with this new parameter choice will yield the same sample complexity as in Theorem 13. The choice of base 4 in (3.6) was made only for clarity in our presentation.

Proposition 11 immediately implies the following worst-case sample complexity upper bound.

Theorem 13 Assume Assumptions 1 and 2. For ϵ < (1 γ) 1, the variance-reduced DR Q-learning Algorithm 2 with parameters speciﬁed in (3.6) computes a solution ˆqlvr s.t. ˆqlvr q ϵ w.p. at least 1 η using

O |S||A| p3 (1 γ)4 min(1, ϵ2)

number of samples.

Proof Given the speciﬁed parameters, the total number of samples used is

lvrnvrkvr +

= O |S||A| 1 p3 (1 γ)4 + 4lvr

This simpliﬁes to the claimed result.

Theorem 13 establishes an upper bound of O |S||A|(1 γ) 4ϵ 2p 3 when ϵ 1, which is superior to the upper bound O |S||A|(1 γ) 5ϵ 2p 3 for Algorithm 1 (see Theorem 9) in terms of 1 γ. This represents the best-known upper bound for DR-RL problems in the KL case, including both model-free and model-based algorithms (Shi and Chi, 2022). Although Shi and Chi (2022) achieve a similar rate of O (1 γ) 4 , their result suﬀers from a O δ 2 dependence, which becomes problematic as δ 0. In contrast, our upper bound is free from δ-dependence. We recall that the information-theoretical lower bound for the sample complexity of the classical tabular RL problem is Ω |S||A|(1 γ) 3ϵ 2 (Azar et al., 2013). In this setting, the variance-reduced Q-learning algorithm in Wainwright (2019a) is minimax optimal. For distributionally robust RL, Shi and Chi (2022) recently showed that the minimax lower bound dependence on |S||A|, (1 γ) 1, and ϵ remains Ω |S||A|(1 γ) 3ϵ 2 when δ is small. Furthermore, Shi et al. (2024) showed the information-theoretical lower bound may be Ω |S||A|(1 γ) 4ϵ 2 when δ = O(1) for χ2-divergence uncertainty sets. However, their construction of hard instances violates our Assumption 1. It is currently unknown whether variance-reduced DR Q-learning can achieve those rates. Further reﬁnement of this bound is left for future research. Notice that the variance-reduced Algorithm 2 has the property that kvr, nvr, and ml only depend on 1

ϵ through log(lvr) = Θ(log log 1

ϵ). Therefore, within a reasonable range of ϵ, the algorithm can operate with the sample complexity guarantee in Theorem 13 without needing to tune kvr, nvr, and ml based on ϵ. This introduces signiﬁcant versatility in application: for example, we can continue to run the algorithm beyond termination epoch lvr without losing sample eﬃciency.

Variance-Reduced Distributionally Robust Q-Learning

3.3 Overview of the Analysis of Algorithms

In this section, we provide a road map to proving the key results, Proposition 7 and 10.

Deﬁnition 14 We say that L is a monotonic γ-quasi-contraction with center q if

L(q) L(q ) γ q q , (3.7)

and entrywise q1 q2 = L(q1) L(q2) (3.8)

for all q, q1, q2 R|S| |A|. Moreover, a monotonic γ-contraction is such that the above identities hold for all q R|S| |A|.

The term quasi refers to the fact that the relation 3.7 is only required for a single q

(Wainwright, 2019b). Therefore, a monotonic γ-contraction is a quasi-contraction with center q for any q RS A. The successive application of monotonic γ-contractions under the rescaled linear stepsize λk = 1 1+(1 γ)k will satisfy the following deterministic bound:

Proposition 15 (Corollary 1, Wainwright (2019b)) Let {Lk, k 2} be a family of monotonic γ-quasi-contractions with center q . Let Hk(q) = Lk(q) Lk(q ) the recentered operator. Then, for the sequence of step sizes {λk, k 1} the iterates of

qk+1 q = (1 λk)(qk q ) + λk [Hk+1(qk) + wk+1] (3.9)

for all k 1, where the sequence {pk, k 1} is deﬁned by p1 = 0 and

pk+1 := (1 λk)pk + λkwk+1.

A key observation is that the empirical robust Bellman operators Tk, e Tl,k used in the iterative updates of Algorithms 1 and 2 are monotonic γ-contractions (see Lemma 19). In the proof of the main results, we apply the deterministic bound for contraction mappings from Proposition 15 to each sample path of the distributionally robust Q-learning and the inner loop of the variance-reduced version. We illustrate this by considering the distributionally robust Q-learning. Since {Tk+1, k 0} are monotonic γ-contractions, they are quasi-contractions with center q . We can deﬁne Hk+1(q) := Tk+1(q) Tk+1(q ) for all q RS A. Then, the update rule of Algorithm 1 can be written as

qk+1 q = (1 λk)(qk q ) + λk [(Tk+1(qk) Tk+1(q )) + (Tk+1(q ) T (q ))]

= (1 λk)(qk q ) + λk [Hk+1(qk) + Wk+1] .

where Wk+1 := Tk+1(q ) T (q ) and we used the Bellman equation (2.10) that q = T (q ).

Wang, Si, Blanchet, Zhou

This representation allow as to apply Proposition 15 to bound the error of the q-function estimation using the sequence P1 = 0 and

Pk+1 := (1 λk)Pk + λk Wk+1.

Note that the only source of randomness in Wk is from Tk+1(q ), which are i.i.d.. Therefore, the process P is a non-stationary auto-regressive (AR) process. It follows that the concentration properties of Pk can be derived from that of Tk+1(q ). While standard Q-learning updates utilize an unbiased empirical Bellman operator, the DR empirical Bellman operator is biased due to its non-linearity in the empirical measure (c.f. (3.2)), resulting in E[Wk] = 0. To achieve a canonical error rate of O(n 1/2), it is necessary that both the bias and standard deviation of the n-sample DR empirical Bellman operator are O(n 1/2). However, our DR Q-learning algorithms require an additional condition: the one-step bias must be of the order O(n 1). This is because the ﬁnal bias, which is the systematic error resulting from the repeated use of the DR Bellman estimator, is compounded by the one-step bias through the model-free Q-learning updates. We illustrate this algorithmic behavior using a simple biased SA instance in Appendix B.1. This imposes signiﬁcant challenges on the design and analysis of our model-free algorithms. Fortunately, we are able to establish tight bounds (in n0 and δ) on the bias, c.f. Proposition 22, in the important regime when δ is small, as stated in Assumption 1. These bounds are central to our sample complexity analysis. We summarize the relevant bounds on the variance and bias of the empirical DR Bellman operator in Section A. By utilizing these variance and bias bounds, we can eﬃciently allocate samples such that the systematic error due to bias is balanced with the stochasticity in the estimator at the termination of the algorithm. With this optimal sample allocation, we can establish the worst-case sample complexity bounds as claimed. The theory for the convergence rate of the variance-reduced DR Q-learning is more complex. In order to achieve the geometric convergence in Proposition 11, an O(n 1) bias bound of the empirical DR Bellman operator is not enough. However, by introducing a recentered dynamics, a similar recursion can be derived in this context if we consider the conditionally recentered noise Hl,k+1(ˆql 1) E[Hl,k+1(ˆql 1)|ˆql 1] and a random bias (denoted by Dl in Appendix B.3). For details, please refer to Appendix B.3.

4. Numerical Experiments

This section presents a numerical validation of our theoretical ﬁndings regarding the convergence properties of the proposed algorithms. We conduct a comparative analysis between our algorithms and MLMC DR Q-learning, as studied in Wang et al. (2023b). Additionally, we investigate the complexity of Algorithm 2 as the adversary s power δ 0. Section 4.1 demonstrates convergence and compares the proposed algorithms with multilevel Monte Carlo distributionally robust (MLMC DR) Q-learning. We use the hard MDP instances constructed in Li et al. (2021), where standard Q-learning performs at its worstcase complexity dependence of Ω((1 γ) 4ϵ 2). Both algorithms in this paper show the canonical convergence rate of O(ϵ 2), with the variance-reduced version displaying superior performance.

Variance-Reduced Distributionally Robust Q-Learning

In Section 4.2, we test the stability of sample complexity of the variance-reduced DR Q-learning Algorithm 2 as δ 0 using a simple DRMDP instance. In the subsequent developments, we use ml = 2l(1 γ) 2 for the variance-reduced Algorithm 2. As explained in Remark 12, this choice (up to a log factor) yields the same complexity guarantee as stated in Theorem 13. An advantage of this parameter choice is that it allows us to run more epochs for the plots, thereby clarifying the convergence behavior.

4.1 Hard MDPs for the Q-learning

Figure 1: Hard MDP for the Q-learning transition diagram.

First, we demonstrate the convergence of the proposed algorithms using the MDP instance shown in Figure 1. This MDP has 4 states and 2 actions, with transition probabilities given for actions 1 and 2 labeled on the arrows between states. Constructed in Li et al. (2021), it is shown in that when p = 4γ 1

3γ , standard non-robust Q-learning will have a sample complexity of Θ((1 γ) 4ϵ 2).

(a) DR Q-learning

(b) variance-reduced DR Q-learning

Figure 2: Convergence of Algorithm 1 and 2 on the MDP instance 1

Figures 2a and 2b depict the convergence properties of the two algorithms for γ = {0.93, 0.95} and δ = 0.1. These ﬁgures show the (4000 samples) averaged error of the output q-function in the inﬁnity norm plotted against the (4000 samples) averaged number of samples used, both on a log-log scale. The parameters for DR Q-learning in Figure 2a are set according to 8. On the other hand, Figure 2b plots the averaged error achieved by the variance-reduced algorithm after each epoch against the total number of samples used.

Wang, Si, Blanchet, Zhou

The ﬁgures indicate that both algorithms converge to the optimal robust q , with the variance-reduced algorithm outperforming DR Q-learning. Additionally, when comparing the log-log error plot with a reference line having a slope of 1/2, we observe that the log error for both algorithms decays at a rate of 1/2 as the log of the samples increases. This behavior aligns with the ϵ 2 dependence of the sample complexity bounds in Theorems 9 and 13, corresponding to the canonical convergence rate of Monte Carlo estimations, which is O(n 1/2).

Remark 16 With δ = 0.1 and γ = 0.93 or 0.95, the DRMDP instances do not satisfy Assumption 1. However, the ﬁgures still show the canonical n 1/2 convergence rate, suggesting that our proposed algorithms might perform well even outside the regime prescribed by Assumption 1.

Figure 3: Comparing the performance of Algorithm 1, 2 and the MLMC DR Q-learning on the MDP 1.

Figure 3 compares the performance of the algorithms proposed in this paper with the MLMC DR Q-learning in Wang et al. (2023b). We observe the performance comparison of three Q-learning methods: MLMC DR, DR, and DR-VR, for γ {0.6, 0.7}. The results indicate that the distributionally robust variance-reduced Q-learning approach achieves the smallest errors. Although our DR Q-learning method shows slightly lower expected performance than the MLMC DR Q-learning, it is worth noting that the line corresponding to MLMC DR Q-learning is considerably rougher. This suggests that the MLMC DR Qlearning approach has a higher degree of variability in terms of performance.

4.2 Testing the Small δ Regime

We proceed to empirically demonstrate the stability of the sample complexity of Algorithm 2 as δ 0. First, we introduce a family of MDPs instance. Deﬁne reference MPDs with S = {1, 2}, A = {a1, a2}, transition kernel

P0,a1 = P0,a2 = 1/2 1/2 1/2 1/2

Variance-Reduced Distributionally Robust Q-Learning

Figure 4: Testing the sample complexity behavior as δ 0.

and deterministic reward function r(1, ) = 1 and r(2, ) = 0. For any positive adversarial power level δ, the worst-case transition kernel chosen by the adversary is

Pδ,a1 = Pδ,a2 = q(δ) 1 q(δ) q(δ) 1 q(δ)

where q(δ) < 1/2 and q(δ) 1/2 as δ 0. In a classical tabular RL setting, this worst-case MDP (δ > 0) should be easier to learn compared to (4.1), c.f. (Khamaru et al., 2021; Wang et al., 2023a). Using this DRMDP instance, we plot the average number of samples required to achieve a ﬁxed error ϵ while varying δ, as shown in Figure 4. We observe that the average number of samples increases as δ 0, because the worst-case MDP converges to the instance in (4.1), which is more challenging to learn. Additionally, the number of samples needed to reach the target error level becomes insensitive to increasingly small δ when δ 10 2, conﬁrming the theoretical results presented in this paper.

5. Extension: χ2 Divergence Ambiguity Sets

We extend the variance-reduced version of the Q-learning Algorithm 2 to the setting where the adversary is constrained to perturbations within χ2 divergence balls of radius δ. The χ2 divergence is deﬁned for Q P as

Dχ2(Q P) := 1

d P (ω) 1 2 P(dω). (5.1)

Note that we follow the convention in Duchi and Namkoong (2021) to include an 1/2 in (5.1). We reuse the notation for the KL case in the discussion of this section. In particular, for each (s, a) S A and δ > 0, we deﬁne χ2 ambiguity sets analogous to (2.2) as

Ps,a(δ) := {p : Dχ2 (p ps,a) δ} ,

Ns,a(δ) := {ν : Dχ2(ν νs,a) δ} . (5.2)

Wang, Si, Blanchet, Zhou

For χ2 divergence deﬁned in (5.1), we have the following strong duality.

Lemma 17 (Duchi and Namkoong (2021), Lemma 1) Let X be a random variable and µ0 a probability measure on (Ω, F). Then, for any δ > 0,

inf µ:Dχ2(µ µ0) δ EµX = sup α R

n α c(δ)Eµ0 (α X)2 + 1

where c(δ) =

1 + 2δ and ( )+ := max { , 0}.

We note that the dual variable α can be optimized within α ess infµ0 X. We wish to learn the optimal q-function as deﬁned in (2.7). To achieve this, we use the DR Bellman equation for the q-function (2.10) where the dual form of the Bellman operator T : RS A RS A in the χ2 case is given by

T (q)(s, a) := sup α R

n α c(δ)Eνs,a (α R)2 + 1

2 o + γ sup β R

n β c(δ)Eps,a (β v(q)(S))2 + 1

(5.4) Then, the empirical Bellman operator T is similarly deﬁned as in (3.2) using this dual representation as

T(q)(s, a) := sup α R

n α c(δ)Eνs,a,n (α R)2 + 1

+ γ sup β R

n β c(δ)Eps,a,n (β v(q)(S))2 + 1

2 o , (5.5)

where the empirical measures and expectations are deﬁned in (3.1). Recall the deﬁnition of the minimum support probability p in (3.3). As in the KL case, we also consider the regime δ = O(δ):

Assumption 3 (Limited Adversarial Power) Suppose the adversary s power δ < 1

In this context, we will apply the variance-reduced Q-learning Algorithm 2 with the following parameters.

kvr = cvr 1 (1 γ)2 log 3dlvr (1 γ)η

nvr = cvr 1 p2 (1 γ)2 log(3dkvrlvr/η)4,

ml = cvr 4l

p2 (1 γ)2 log(3dlvr/η)2.

Notice that, compare to the speciﬁcations in (3.6), (5.6) has a p 2 dependence instead of p 3 . Running Algorithm 2 with these parameters will yield an estimate ˆqlvr of q with an error of at most ϵ with high probability, leading to the following theorem.

Variance-Reduced Distributionally Robust Q-Learning

Theorem 18 Assume Assumptions 2 and 3. For ϵ < (1 γ) 1, the variance-reduced DR Q-learning Algorithm 2 with parameters speciﬁed in (3.6) computes a solution ˆqlvr s.t. ˆqlvr q ϵ w.p. at least 1 η using

O |S||A| p2 (1 γ)4 min(1, ϵ2)

number of samples.

The proof of Theorem 18 closely follows the proof of Theorem 13. We ﬁrst establish the analog of Proposition 10 and then apply it to achieve the statement in Proposition 11 using the parameters in (5.6) for the χ2 divergence ambiguity set case. The sample complexity is then derived by summing the number of samples used in each epoch. This procedure is carried out in Appendix G.

Acknowledgments

The material in this paper is based upon work supported by the Air Force Oﬃce of Scientiﬁc Research under award number FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF 2229012, 2312204, 2312205, 2419564; ONR 13983111, 13983263; and 2024 New York University Center for Global Economy and Business grant.

Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 67 83. PMLR, 09 12 Jul 2020. URL https: //proceedings.mlr.press/v125/agarwal20b.html.

Mohammad Gheshlaghi Azar, R emi Munos, and Hilbert J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325 349, jun 2013. ISSN 0885-6125. doi: 10.1007/s10994-013-5368-1. URL https://doi.org/10.1007/s10994-013-5368-1.

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8223 8234. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/ paper/2020/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.

Zaiwei Chen, Sheng Zhang, Thinh T. Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022. ISSN 0005-1098. doi: https: //doi.org/10.1016/j.automatica.2022.110623. URL https://www.sciencedirect. com/science/article/pii/S0005109822004873.

Wang, Si, Blanchet, Zhou

John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021.

Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for q-learning. Journal of machine learning Research, 5(1), 2003.

Vincent Fran cois-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al. An introduction to deep reinforcement learning. Foundations and Trends R in Machine Learning, 11(3-4):219 354, 2018.

J. I. Gonz alez-Trejo, O. Hern andez-Lerma, and L. F. Hoyos-Reyes. Minimax control of discrete-time stochastic systems. SIAM Journal on Control and Optimization, 41(5): 1626 1659, 2002. doi: 10.1137/S0363012901383837. URL https://doi.org/10. 1137/S0363012901383837.

Zhaolin Hu and L JeﬀHong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013.

Garud Iyengar. Robust dynamic programming. Math. Oper. Res., 30:257 280, 05 2005.

Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme, 2019.

Koulik Khamaru, Eric Xia, Martin J. Wainwright, and Michael I. Jordan. Instanceoptimality in optimal value estimation: Adaptivity via variance-reduced q-learning, 2021. URL https://arxiv.org/abs/2106.14352.

H. Kushner and G.G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer New York, 2013. ISBN 9781489926968. URL https://books.google.com/books?id=s B0GCAAAQBAJ.

Gen Li, Changxiao Cai, Yuxin Chen, Yuantao Gu, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis. ar Xiv preprint ar Xiv:2102.06548, 2021.

Gen Li, Laixi Shi, Yuxin Chen, Yuejie Chi, and Yuting Wei. Settling the sample complexity of model-based oﬄine reinforcement learning, 2022. URL https://arxiv.org/abs/ 2204.05275.

Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged Q-learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2207 2261. PMLR, 2023.

Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. ar Xiv preprint ar Xiv:1703.06748, 2017.

Variance-Reduced Distributionally Robust Q-Learning

Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robust q-learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 13623 13643. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/liu22a.html.

Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005.

Xinlei Pan, Chaowei Xiao, Warren He, Shuang Yang, Jian Peng, Mingjie Sun, Jinfeng Yi, Zijiang Yang, Mingyan Liu, Bo Li, et al. Characterizing attacks on deep reinforcement learning. ar Xiv preprint ar Xiv:1907.09470, 2019.

Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model, 2021. URL https://arxiv.org/abs/2112.01506.

Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. Mit Press, 2008.

Philippe Rigollet. 18. s997: High dimensional statistics lecture notes, 2015.

Alexander Shapiro. Distributionally robust modeling of optimal control. Operations Research Letters, 50(5):561 567, 2022. ISSN 0167-6377. doi: https://doi.org/10.1016/j.orl. 2022.08.002. URL https://www.sciencedirect.com/science/article/pii/ S0167637722000992.

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy nski. Lectures on Stochastic Programming: Modeling and Theory, Second Edition. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2014. doi: 10.1137/1.9781611973433. URL https: //epubs.siam.org/doi/abs/10.1137/1.9781611973433.

Laixi Shi and Yuejie Chi. Distributionally robust model-based oﬄine reinforcement learning with near-optimal sample complexity, 2022. URL https://arxiv.org/abs/2208. 05767.

Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, and Yuejie Chi. The curious price of distributional robustness in reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 36, 2024.

Nian Si, Fan Zhang, Zhengyuan Zhou, and Jose Blanchet. Distributionally robust policy evaluation and learning in oﬄine contextual bandits. In International Conference on Machine Learning, pages 8884 8894. PMLR, 2020.

Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ bb03e43ffe34eeb242a2ee4a4f125e56-Paper.pdf.

Wang, Si, Blanchet, Zhou

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Martin J. Wainwright. Variance-reduced q-learning is minimax optimal, 2019a. URL https://arxiv.org/abs/1906.04697.

Martin J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp ℓ -bounds for q-learning, 2019b. URL https://arxiv.org/abs/1905.06265.

Gang Wang. Finite-time error bounds of biased stochastic approximation with application to td-learning. IEEE Transactions on Signal Processing, 70:950 962, 2022. doi: 10.1109/ TSP.2021.3128723.

Shengbo Wang, Jose Blanchet, and Peter Glynn. Optimal sample complexity of reinforcement learning for uniformly ergodic discounted markov decision processes, 2023a.

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A ﬁnite sample complexity bound for distributionally robust Q-learning, 2023b. URL https://arxiv.org/abs/ 2302.13203.

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. On the foundation of distributionally robust reinforcement learning, 2024.

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279 292, 1992.

Wolfram Wiesemann, Daniel Kuhn, and Bre c Rustem. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013. ISSN 0364765X, 15265471. URL http://www.jstor.org/stable/23358653.

Huan Xu and Shie Mannor. Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems, pages 2505 2513, 2010.

Wenhao Yang, Liangyu Zhang, and Zhihua Zhang. Towards theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics, 2021. URL https://arxiv.org/abs/2105.03863.

Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M. Jordan, and Zhihua Zhang. Avoiding model estimation in robust Markov decision processes with a generative model, 2023.

Zhengqing Zhou, Zhengyuan Zhou, Qinxun Bai, Linhai Qiu, Jose Blanchet, and Peter Glynn. Finite-sample regret bound for distributionally robust oﬄine tabular reinforcement learning. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 3331 3339. PMLR, 13 15 Apr 2021. URL https://proceedings.mlr.press/v130/zhou21d.html.

Variance-Reduced Distributionally Robust Q-Learning

Appendix A. The Empirical Robust Bellman Operator: KL Case

In this section, we establish the bias and concentration properties of the empirical DR Bellman operator. As pointed out in the previous sections, they are the key ingredients for proving our near-optimal sample complexity bounds. Let b Tn be the empirical DR Bellman operator formed by n samples deﬁned in Deﬁnition 5. To simplify the notation, we will omit the subscript n and only keep b T when there is no confusion. Even though the main results of this paper restrict R [0, 1] to simplify notation and align with convention in the literature, in this section, we consider R [0, rmax]. This allows our results to be directly applied to contexts beyond RL, such as supervised learning where rmax may vary. In order to employ the analysis outlined in the previous section, the empirical Bellman operators need to be contraction mappings. Indeed, we have

Lemma 19 b T is a monotonic γ-contraction.

Direct consequences of b T being a γ-contraction with γ < 1 are the following bounds:

Lemma 20 The following two bounds hold with probability 1:

b T(q)(s, a) T (q)(s, a) 2(rmax + q );

As motivated in the paper, to obtain a desired complexity dependence on problem primitives, we need to develop good bounds on the bias and the variance of the empirical Bellman operator. We deﬁne the span seminorm of the q-function as |q|span := maxs,a q(s, a) mins,a q(s, a) and |q|span (1 γ) 1. The proofs of the following propositions are in Appendix C.

Proposition 21 The empirical DR Bellman operator satisﬁes the following variance bound:

Var(b T(q)(s, a)) 104 r2 max + γ2 |q|2 span p2 n (log(e(|R| |S|))).

We note that here p can be replaced by mins S min {ps,a(s ), νs,a(s )}. In particular, the variance upper bound can depend on the state and action. However, since we are interested in a minimax complexity bound, such distinction will not make a diﬀerence if we consider an example with only O(1) number of states and actions. We also have the following bound on the bias:

Proposition 22 Under Assumption 1, the empirical DR Bellman Operator satisﬁes the following bias bound:

|Bias(b T(q)(s, a))| := |E[b T(q)(s, a)] T (q)(s, a)| 4480 rmax + γ |q|span

p3 n log(e|S| |R|).

Wang, Si, Blanchet, Zhou

Again, the dependence on p can be replaced by mins S min {ps,a(s ), νs,a(s )}. By Lemma 20, the DR empirical Bellman operator is bounded. This along with the uniform (across s, a S A) variance bound in Proposition 21 yields:

Proposition 23 The empirical DR Bellman operator

b T(q) T (q) 17(rmax + γ |q|span)

log (6|S||A|(|S| |R|)/η)

w.p. at least 1 η, provided that n 8p 2 log (12|S||A|(|S| |R|)/η).

Recall that for ﬁxed ˆq, we haved deﬁned the recentered DR Bellman operators

b H(ˆq) := b T(ˆq) b T(q ) and H(ˆq) := T (ˆq) T (q ). (A.1)

For the variance-reduced algorithm, we instead consider the bias and concentration properties of the recentered operator b H. As we will observe, the recentering allows the concentration bounds to depend on the residual error in the q-function ˆq q instead of ˆq . As a consequence, one can imagine that as the algorithm progresses, ˆql q will progressively become smaller, making b H having much better concentration properties than b T. We start with bias and variance bounds.

Proposition 24 Suppose Assumption 1 is enforced. Then

|E[ b H(ˆq)(s, a) H(ˆq)(s, a)]| 26 ˆq q

provided n p 1 , and

Var( b H(q ))(s, a) 212 ˆq q 2 p3 n log(e|S|)

for all n 1.

Similar to the extension from Proposition 21 to Proposition 23, we can obtain the following concentration bound for the recentered operator by extending the variance bound in Propositon 24.

Proposition 25 Assume Assumption 1. Then w.p. at least 1 η

H(ˆq) b H(ˆq) 8 ˆq q

log(4|S|2|A|/η)

provided that n 8p 2 log(4|S|2|A|/η)

We emphasize that all of the propositions are O(1) when δ 0. This is due to a more thorough analysis, which allows us to remove the dual variable α (see Lemma 2) in the bounds, as explained in Lemma 37 in detail.

Variance-Reduced Distributionally Robust Q-Learning

Appendix B. Proofs for the Analysis of Algorithms: KL Case

Before we prove the main results in this paper, we illustrate the necessity of requiring the nsample empirical Bellman operator to exhibit a bias of e O(n 1) to achieve an overall sample complexity of e O(ϵ 2).

B.1 An Illustrative Example for the Necessity of the Bias Requirement

To understand our requirement that the bias of the empirical Bellman operator (3.2) decay like O(n 1), we consider the following simple example. We use stochastic approximation to ﬁnd the ﬁxed point 0 of the equation x = γx, γ (0, 1). In the notation of (2.12), L(x) = γx. We generate biased noisy observation of x γx; i.e. Ln,k(x) = γx + Un,k + βn, where {Un, k : k = 1, . . .} are i.i.d., each requires n-sample to compute, with EUn,k = 0 and Var Un,k = σ2 n. The Un, k and βn are analogous to the noise and bias associated with an n-sample empirical Bellman operator. For simplicity, we consider a constant stepsize setting where the stepsize λ = eΘ(ϵ), ignoring the dependence on 1/(1 γ). We note that similar phenomena happen for discounted stepsize as well. We consider the stochastic approximation:

Xt+1 = (1 λ)Xt + λ(γXt + Un,t+1 + βn)

with X0 = 1. Then, it is not hard to see that

Xt = (1 λ(1 γ))t + λ

k=0 (1 λ(1 γ))k(Un,t k + βn).

To achieve a statistical error of ϵ in mean squared, we need

t = Ω 1 λ(1 γ) log 1

k=0 (1 λ(1 γ))k Un,t k

= O(ϵ2), βnλ

k=0 (1 λ(1 γ))k = O(ϵ).

The variance requirement suggests

λ2(1 (1 λ(1 γ))2t)

λ(1 γ)(2 λ(1 γ)) σ2 n = O(ϵ2).

So, with the choice of t earlier, we need σ2 n = O(ϵ2/λ) = e O(ϵ). On the other hand, the bias requirement implies

βn λ(1 (1 λ(1 γ))t)

λ(1 γ) = O(ϵ);

i.e. βn = O(ϵ). Assume that the variance σ2 n = Θ(n 1), which is the canonical variance decay rate of Monte Carlo estimation. Then, to guarantee σ2 n = O(ϵ), we need n = Ω(1/ϵ). Yet, if

Wang, Si, Blanchet, Zhou

βn = Θ(n 1/2), we also need n = Ω(1/ϵ2) so as to have βn = O(ϵ). In this case, the total number of samples used until iteration t is nt = eΩ(ϵ 3). On the other hand, if we managed to show that βn = eΘ(n 1), we see that we only need n = eΩ(1/ϵ), and hence nt = eΩ(ϵ 2), the correct rate. Having explained the necessity of e O(n 1) bias and equipped with the key bias and concentration bounds, we are ready to carry out the proofs of the worst-case sample complexity bounds for Algorithm 1 and 2. We will follow the proof outlined in Section 3.3.

B.2 The Distributionally Robust Q-learning Algorithm 1

B.2.1 Proof of Proposition 7

Proof Recall that the update rule for Algorithm 1 can be written as

qk+1 q = (1 λk)(qk q ) + λk [Tk+1(qk) Tk+1(q ) + Tk+1(q ) T (q )]

= (1 λk)(qk q ) + λk [Hk+1(qk) + Wk+1] (B.1)

where we deﬁne Wk+1 := Tk+1(q ) T (q ). Since Tk+1(q ) is a i.i.d. sequence of estimators to T (q ),

β := Bias(Tk(q )) = E[Tk(q )] T (q )

is independent of k. We can write Wk+1 = β + Uk+1 where Uk+1 := Tk+1(q ) T (q ) β has zero mean. Next, we would like to apply Proposition 15. Deﬁne the auxiliary sequences

Pk+1 = (1 λk)Pk + λk Wk+1 (B.2)

Qk+1 = (1 λk)Qk + λk Uk+1 (B.3)

with Q0 = P0 = 0. Notice that since {Uk, k 1} has mean zero, E[Qk] = 0 for all k 0. It is easier to analyze the process {Qk, k 0} than {Pk, k 0} which correspond to {pk} in Proposition 15. To use {Qk, k 0}, we ﬁrst show that

Pk = Qk + β.

We prove this by induction. The base case P1 = λ0W1 = Q1 + β as λ0 = 1. Next we check the induction step. By the iterative updates (B.2) and (B.3) and the induction hypothesis, we have that

Pk+1 = (1 λk)Pk + λk Wk+1 = (1 λk)(Qk + β) + λk(Uk+1 + β)

= Qk+1 + β. (B.4)

Variance-Reduced Distributionally Robust Q-Learning

By Algorithm 1, q1 = 0. We have that by Lemma 20, q1 q (1 γ) 1. Therefore, by Proposition 15

1 λ1(1 γ) +

+ Ql,k+1 + β .

+ Qk+1 + 2 β

(B.5) w.p.1, where we used kλk = 1/(1/k + (1 γ)) 1/(1 γ).

Next, we bound the sequence {Qk, k 1}.

Lemma 26 The {Qk, k 1} sequence satisﬁes

P( Qk+1 > t) 2|S||A| exp t2

λk(8γ(1 γ) 1t + 4 σ2(q ) )

where σ2(q )(s, a) = Var(Tk(q )(s, a)).

The proof of Lemma 26 is in Appendix B.4. By applying Lemma 26, we have that

1 γ log (2|S||A|/η) + 2 p

log (2|S||A|/η)

log (2|S||A|/η)

w.p. at least 1 η.

To establish high probability bound using (B.5), we also need the following properties of the stepsize:

Lemma 27 (Proof of Corollary 3,Wainwright (2019b)) The following inequalities hold:

λj 2 (1 γ) λk ;

j=1 λj log(1 + (1 γ)k)

Wang, Si, Blanchet, Zhou

We have that by Lemma 27 and the union bound,

j=1 Qj + Qk0+1

λk0 log(1 + (1 γ)k0)

(1 γ)2 + σ(q ) p

log (4|S||A|k0/η)

1 γ + 2 σ(q ) p

log (4|S||A|k0/η) .

λk0 log(1 + (1 γ)k0)

(1 γ)2 + σ(q ) p

log (4|S||A|k0/η)

16 1 (1 γ)3k0 + 20 p (1 γ)5/2 n0k0

log (4dk0/η)2

w.p. at least 1 η, where we utilize Proposition 21 to bound σ(q ) . We use Proposition 22 to bound β. Then, from (B.5) we conclude that there exists constant c s.t.

qk0+1 q c 1 (1 γ)3k0 + 1 p (1 γ)5/2 n0k0

log (4dk0/η)2 + c rmax + γ |q|span

(1 γ)p3 n0 log(e|S| |R|)

c 1 (1 γ)3k0 + 1 p (1 γ)5/2 n0k0 + 1 p3 (1 γ)2n0

log (4dk0/η)2

where c can change from line to line. Finally, note that for C1 1, C2 e, log(C1C2) = log(C1) + log(C2) C1 log(C2). So, log (4dk0/η)2 16

9 log (3dk0/η)2. This completes the proof.

B.3 The Variance-Reduced Distributionally Robust Q-learning Algorithm 2

B.3.1 Proof of Proposition 11

Proof Recall that Fl be the σ-ﬁeld generated by the random samples used until the end of epoch l and

El[ ] = E[ |Fl], Pl[ ] = P[ |Fl], and Varl( ) = Var( |Fl).

In the following proof, the probabilities are w.r.t. Pl 1( ). Recall that

Hl,k = Tl,k(q) Tl,k(q ) and e Hl = e Tl,k(q) e Tl,k(q ).

From the variance-reduced DR-RL (Algorithm 2) update rule, we have at epoch l,

ql,k+1 q = (1 λk)(ql,k q ) + λk h Tl,k+1(ql,k) Tl,k+1(ˆql 1) + e Tl(ˆql 1) T (q ) i

= (1 λk)(ql,k q ) + λk h Hl,k+1(ql,k) + Tl,k+1(q ) Tl,k+1(ˆql 1) + e Tl(ˆql 1) T (q ) i

= (1 λk)(ql,k q ) + λk [Hl,k+1(ql,k) + Wl,k+1] (B.6)

Variance-Reduced Distributionally Robust Q-Learning

where we deﬁne Wl,k+1 = Tl,k+1(q ) Tl,k+1(ˆql 1) + e Tl(ˆql 1) T (q ). Notice that only the ﬁrst two terms is dependent on k. We can write

Wl,k+1 = Tl,k+1(q ) Tl,k+1(ˆql 1) + e Tl(ˆql 1) T (q )

= Hl,k+1(ˆql 1) + e Hl(ˆql 1) + e Tl(q ) T (q )

= [Hl,k+1(ˆql 1) El 1[Hl,k+1(ˆql 1)]] + e Hl(ˆql 1) + e Tl(q ) T (q ) El 1[Hl,k+1(ˆql 1)]

= Ul,k+1 + Dl (B.7) where

Ul,k+1 := Hl,k+1(ˆql 1) El 1[Hl,k+1(ˆql 1)] (B.8)

Dl := e Hl(ˆql 1) + e Tl(q ) T (q ) El 1[Hl,k+1(ˆql 1)]. (B.9)

Note that El 1[Hl,k+1(ˆql 1)] is constant in k. We will apply Proposition 15. Deﬁne the auxiliary sequences

Pl,k+1 = (1 λk)Pl,k + λk Wl,k+1 (B.10)

Ql,k+1 = (1 λk)Ql,k + λk( Ul,k+1) (B.11)

with Ql,0 = Pl,0 = 0. Note that Ul,k+1 under El 1 are i.i.d. and has mean 0. So El 1[Ql,k] = 0 for any k 0. It is easier to analyze the process {Ql,k, k 0} than {Pl,k, k 0} which correspond to {pk} in Proposition 15. As in the DR Q-learning case (Equation (B.4)), the same induction argument implies that Pl,k = Ql,k + Dl. By the algorithm, ql,1 = ˆql 1, we have that ql,1 q b. Therefore, by Proposition 15

ql,k+1 q λk

j=1 Ql,j + γk Dl

+ Ql,k+1 + Dl .

+ Ql,k+1 + 2 Dl

w.p.1, where we used kλk = 1/(1/k + (1 γ)) 1/(1 γ). Next, we prove bounds for { Ql,k , k 0} and Dl .

Lemma 28 Under measure Pl 1( ),

Pl 1( Ql,j > t) 2|S||A| exp

4λj(γ ζl 1 t + σ2 l 1 )

where ζl 1 = ˆql 1 q and σ2 l 1(s, a) = Varl 1(Hl,k(ˆql 1)(s, a)).

Wang, Si, Blanchet, Zhou

The proof of this Lemma is deferred to Appendix B.4. Apply Lemma 28, we have that

Ql,j 4λj ζl 1 log (2|S||A|/η) + 2 p

log (2|S||A|/η).

w.p. at least 1 η.

Recall the deﬁnition of σ2 l 1 and Proposition 24. We have that by Lemma 27 and the union bound,

j=1 Ql,j + Ql,kvr+1

λkvr log(1 + (1 γ)kvr) ζl 1

1 γ + σl 1 p

log (4|S||A|kvr/η)

+ 4λkvr ζl 1 + 2 σl 1 p

λkvr log (4|S||A|kvr/η) .

λkvr log(e + (1 γ)kvr) ζl 1

1 γ + σl 1 p

log (4|S||A|kvr/η)

b (1 γ)2kvr + 26b

p3/2 (1 γ)3/2 nvrkvr

log (4|S||A|kvr/η)2

w.p. at least 1 η.

For Dl, recall the deﬁnition in (B.9). We add and subtract H(ˆql 1) and write:

Dl = e Hl(ˆql 1) H(ˆql 1) + e Tl(q ) T (q ) + (H(ˆql 1) El 1[Hl,k+1(ˆql 1)])

= e Hl(ˆql 1) H(ˆql 1) + e Tl(q ) T (q ) + El 1 [H(ˆql 1) Hl,k+1(ˆql 1)] . (B.14)

Recall Propositions 23, 24, and 25, we have that by union bound,

Dl c rmax + |q |span + ˆql 1 q

log (12d/η) + El 1 (H(ˆql 1) Hl,k+1(ˆql 1))

c rmax + |q |span + ˆql 1 q

log (12d/η) + c ˆql 1 q

w.p. at least 1 η, provided that c is a large enough constant ml 8p 2 log(24d/η), and nvr p 1 .

Variance-Reduced Distributionally Robust Q-Learning

Finally, recall that ˆql = ˆql,kvr+1 and rmax = 1, combine (B.12), (B.13), and (B.15) we conclude that there exists absolute constant c s.t.

b (1 γ)2kvr + b

p3/2 (1 γ)3/2 nvrkvr

log (8dkvr/η)2

+ c rmax + |q |span + b

p3/2 (1 γ) ml log(24d/η) + c b

p3/2 (1 γ) nvr

b (1 γ)2kvr + b

p3/2 (1 γ)3/2 nvrkvr + b

p3/2 (1 γ) nvr

log (8dkvr/η)2

p3/2 (1 γ)2 ml

w.p. at least 1 η, where we used |q |span 2 q 2/(1 γ) and b 1/(1 γ), c can change from line to line. Finally, note that for C1 1, C2 e, log(C1C2) = log(C1)+log(C2) C1 log(C2). This completes the proof of Proposition 10.

B.3.2 Proof of Proposition 11

Proof By the deﬁnition of conditional probability

n ˆql q 2 l(1 γ) 1o!

ˆql q 2 l(1 γ) 1

ˆqn q 2 n(1 γ) 1 !

ˆql q 2 l(1 γ) 1

ˆqn q 2 n(1 γ) 1 !

where we note that ˆq0 = 0 and Lemma 20 implies that ˆq0 q (1 γ) 1 w.p.1, so the conditioned intersection and product can start from s = 1. Let

ˆqs q 2 s(1 γ) 1 .

We analyze the probability for l 1

P ˆql q 2 l(1 γ) 1 Al 1

= 1 P(Al 1)E h 1 n ˆql q 2 l(1 γ) 1o 1Al 1 i

= 1 P(Al 1)E h 1 n ˆql 1 q 2 (l 1)(1 γ) 1o E h 1 n ˆql q 2 l(1 γ) 1o Fl 1 i 1Al 1 i

= 1 P(Al 1)E h 1 n ˆql 1 q 2 (l 1)(1 γ) 1o Pl 1 1 n ˆql q 2 l(1 γ) 1o 1Al 1 i ,

Wang, Si, Blanchet, Zhou

By Proposition 10, we recall conditioned on ˆql 1 q 2 (l 1)(1 γ) 1 =: b

b (1 γ)2kvr + b

p3/2 (1 γ)3/2 nvrkvr + b

p3/2 (1 γ) nvr

log (3dkvr/η)2

p3/2 (1 γ)2 ml

w.p. at least 1 η. Therefore, by the parameter choice (3.6), we have that for suﬃciently large cvr and for events ω ˆql 1 q 2 (l 1)(1 γ) 1 ,

Pl 1 1 n ˆql q 2 l(1 γ) 1o (ω) 1 η

lvr ; (B.16)

1 n ˆql 1 q 2 (l 1)(1 γ) 1o Pl 1 1 n ˆql q 2 l(1 γ) 1o 1 η

Therefore, we have

P ˆql q 2 l(1 γ) 1 Al 1 1 η

which further gives us

n ˆql q 2 l(1 γ) 1o!

lvr . (B.17)

To ﬁnish the proof, we consider the function

e(η) := 1 η

Clearly, e(η) is C2 with derivatives

e (η) = 1 η

l 1 , e (η) = l 1

Note that e 0 if l 1. So e (η) is non-decreasing. Hence for all η 0, e (η) e (0). This implies that

e(η) = e(0) + Z η

Assumption ϵ < (1 γ) 1 implies that lvr 1. Therefore, we plug in this to (B.17) and conclude that

P ˆqlvr q 2 lvr(1 γ) 1 P

n ˆql q 2 l(1 γ) 1o!

Variance-Reduced Distributionally Robust Q-Learning

B.4 Proof of Lemma 26 and 28

To prove these two lemma, we introduce the following result:

Lemma 29 (Wainwright (2019b), Lemma 2) Let {Yk R, k 1} be a sequence of i.i.d. zero mean ζ-bounded r.v.s with variance σ2. Deﬁne {Xk, k 0} by the recursion X0 = 0

Xk+1 = (1 λk)Xk + λk Yk+1,

where λk = 1/(1 + (1 γ)k). Then

E [exp(t Xk+1)] exp t2σ2λk

for all |t| < 1/(ζλk).

We ﬁrst prove Lemma 28. Proof We use the same steps. Recall (B.11), where Ul,k is an i.i.d. sequence under El 1 given by (B.8). By Lemma 20

Ul,k Tl,k(ˆql 1) Tl,k(q ) + El 1[Tl,k(ˆql 1) Tl,k(q )] 2γ ˆql 1 q = 2γ ζl 1 .

Notice that by construction, Tl,k(q )(s, a) are independent across s S, a A. Therefore, by Lemma 29,

El 1 exp(λ Ql,k+1 ) = El 1 sup (s,a) S A max {exp(λQl,k+1(s, a)), exp( λQl,k+1(s, a))}

(s,a) S A El 1 exp(λQl,k+1(s, a)) + E exp( λQl,k+1(s, a))

2|S||A| exp

λ2 σ2 l 1 λk 1 2γ ζl 1 λk|λ|

for any λ < 1/(2γ ζl 1 λk). Therefore, by the Chernoﬀbound

Pl 1( Ql,k+1 > t) 2|S||A| exp

λ2 σ2 l 1 λk 1 2γ ζl 1 λk|λ|

for any λ (0, 1/(2γ ζl 1 λk)). Choose

λ = t 2γ ζl 1 λkt + 2 σ2 l 1 λk ,

we conclude that

Pl 1( Ql,k+1 > t) 2|S||A| exp

4λk(γ ζl 1 t + σ2 l 1 )

Wang, Si, Blanchet, Zhou

Next, we prove Lemma 26. Notice that we only need to modify the bounds on ζ and σ2. Proof Recall that {Ql,k, k 0} is given by recursive relation (B.3), where Uk has mean 0. By Lemma 20

Uk 2 Tk+1(q ) 2rmax + 2γ q

and Var(Tk+1(q )(s, a)) = σ2(q )(s, a). Therefore, using the same arguments, we conclude that

P( Qk+1 > t) 2|S||A| exp t2

λk(8γ(1 γ) 1t + 4 σ2(q ) )

Appendix C. Proofs of Properties of the Empirical Bellman Operator: KL Case

C.1 Glossary of Notations and Basic Properties

Before we present our proofs, we ﬁrst deﬁne some technical notations. For ﬁnite discrete measurable space (Y, 2Y ), ﬁxed u m2Y , and signed measure ν M (Y, 2Y ), let

y Y ν(y)u(y)

denotes the integral. For generic probability measure µ on (Y, 2Y ) and random variable u : Y R, let w = w(α) = e u/α; deﬁne the KL dual functional under the reference measure µ

f(µ, u, α) := α log µ[e u/α] αδ. (C.1)

We clarify that f(µ, u, 0) = limα 0 f(µ, u, α) = ess infµ u. We present two basic properties of the dual functional f for which the proofs are deferred to Appendix E.

Lemma 30 For any ν µ, the dual functional is bounded

u L (µ) sup α 0 f(ν, u, α) u L (µ)

Lemma 31 The following bound holds w.p.1.: sup α 0 f(µ, u, α) sup α 0 f(µn, u, α) 2 |u|span ,

where |u|span = maxs S u(s) mins S u(s).

Variance-Reduced Distributionally Robust Q-Learning

Let µn be the empirical measure form by n i.i.d. samples drawn from µ. In the following development, we need to consider the perturbation analysis on the line of center measures {tµ + (1 t)µn : t [0, 1]}. So, it is convenient to deﬁne for µs,a = ps,a, νs,a

µs,a,n(t) = tµs,a + (1 t)µs,a,n ms,a,n = µs,a µs,a,n gs,a,n(t, α) = f(µs,a,n(t), u, α).

Note that we will not explicitly indicate the dependence of u for the function g, because it will always be the identity function when µ = ν and the value function when µ = p. We will also drop the dependence on (s, a) when clear. Our analysis involves many derivative computations. We use three type of derivative notations, two of which is explained here and the Radon-Nikodym derivative is introduced in the following paragraph. For a smooth function of multiple arguments g(t, αs,t) where αs,t could be dependent on parameters s, t, denote the partial derivatives by t, α; i.e.

tg(t, αs,t) := lim ϵ 0 g(t + ϵ, αs,t)

ϵ , αg(t, αs,t) := lim ϵ 0 g(t, αs,t + ϵ)

On the other hand, when αs,t is also smooth in t, denote the total derivative w.r.t. t by dt; i.e.

dtg(t, αs,t) := lim ϵ 0 g(t + ϵ, αs,t+ϵ)

ϵ = tg(t, αs,t) + αg(t, αs,t) tαs,t

The intuition behind our ability to remove the 1/δ dependence stems from the mutual absolute continuity (also known as equivalence) between the empirical worst-case transition kernel and reward distribution and the true ones. This holds if δ is suﬃciently small and the empirical centers of the uncertainty sets are equivalent to the true centers. As a result, our techniques rely on the absolute continuity between the empirical measure µn and µ. We say that µ is absolute continuous w.r.t. another measure ν, denoted by ν µ, if for A 2Y , ν(A) = 0 implies that µ(A) = 0. We say that µ is equivalent to ν, denoted by µ ν, if ν µ and ν µ. Note that the empirical measure µn always satisﬁes µn µ w.p.1. For absolutely continuous measures ν µ, the Radon-Nikodym derivative is well deﬁned: dν dµ(y) := ν(s)

µ(s)1 {µ(s) = 0} .

The proof strategy we will implement is to consider separately the good events on which µn and µ are close (so that we have µn µ) and the bad events where the empirical measure is not close to the reference model. This motivates us to deﬁne for p > 0

Ωn,p(µ) = ω : sup y |µn(ω)(y) µ(y)| p . (C.3)

Then, in the DR-RL setting, deﬁne

s,a Ωn,p(ps,a) \

s,a Ωn,p(νs,a)

= ω : sup s,a sup s |ps,a,n(ω)(s ) ps,a(s )| p, sup s,a sup r |νs,a,n(ω)(r) νs,a(r)| p .

Wang, Si, Blanchet, Zhou

We frequently make use of the minimum support probability of certain measures such as µ, µs,a. This is denoted by µ := min {µ(s) : µ(s) > 0}, µs,a, := min {µ(s) : µ(s) > 0}. It is easy to see that the following lemma holds:

Lemma 32 Suppose p < µ , then on Ωn,p(µ), µ µn and infy:µ(y)>0 µn(y) > µ p.

Moreover, the empirical measures are satisﬁes the following concentrations:

Lemma 33 Let µ be any probability measure on ﬁnite measure space (Y, 2Y ). Then, for any k = 1, 2, 3, , . . .

P(Ωn,p(µ)c) 1 p2knk log(e2k 1|Y |)k.

In particular, if we choose k = 1,

P(Ωn,p(µ)c) 1 p2n log(e|Y |).

This lemma follows from the subgaussian property of empirical measures on ﬁnite measure space; i.e. Lemma 39 holds. For absolutely continuous empirical measures, we also have the following lemma, again as a consequence of subgaussianity and hence Lemma 39.

Lemma 34 Let ξn be another random measure on (Y, 2Y ). Let (Ω, F, P) be the probability space that supports ξn, µn. Suppose that µn ξn, µ ξn, and ξn(y) > p for all y s.t. ξn(y) = 0. Then, for all A F, the following bounds hold:

L (ξn) 1A 1 p n

L (ξn) 1A 1 p2n log(e|Y |).

The proofs of these results are deferred to Appendix D.

C.2 Proof of Proposition 21

Proof By deﬁnition, we have b T(q)(s, a) T (q)(s, a) sup β 0 |f(νs,a,n, id, β) f(νs,a, id, β)|

+ γ sup α 0 |f(ps,a,n, v(q), α) f(ps,a, v(q), α)| . (C.4)

We will drop the s, a dependence for simplicity. This motivates us to look at the dual functional applied to generic measureable u : Y R. Let s Deﬁne w = e u/α.

|f(µn, u, α) f(µ, u, α)| = |gn(0, α) gn(1, α)|

= tgn(t, α) t=τ

= α mn[w] µn(τ)[w]

Variance-Reduced Distributionally Robust Q-Learning

for some random variable τ (0, 1). To bound this, we introduce the following lemma for which the proof is deferred to E.

Lemma 35 Let m = µ1 µ2 with µ1, µ2 µ and w = e u/α, we have that

µ[w]2 3j inf κ R u κ j L (µ)

To apply Lemma 35, we consider p 1

2µ . Then, By Lemma 32, on Ωn,p(µ), µn(t) µ for all t [0, 1]. So, on Ωn,p(µ), we have by Lemma 35

sup α 0 |f(µn, u, α) f(µ, u, α)| sup α 0 α mn[w] µn(τ)[w]

inf κ R 3 u κ L (µ)

= 3 |u|span

Therefore, by partitioning Ωinto Ωn,p(µ)c and Ωn,p(µ), we bound

E sup α 0 |f(µn, u, α) f(µ, u, α)|2

9 |u|2 span E dmn dµn(τ)

L (µ) 1Ωn,p(µ) + 4 |u|2 span P(Ωn,p(µ)c) (C.5)

where on Ωn,p(µ)c, we use the bound in Lemma 31. By Lemma 32, on Ωn,p(µ) for y s.t. µ(y) > 0, µn(y) µ p 1

2µ p. Since µ(y) > 0 implies that µ(y) µ , we have that µn(t)(y) p for any t [0, 1]. Therefore, Lemma 34 applies. On the other hand, Lemma 33 also applies and is used to bound P(Ωn,p(µ)c). Therefore, continue from (C.5), we have

E sup α 0 |f(µn, u, α) f(µ, u, α)|2 13 |u|2 span p2n log(e|Y |).

We conclude that choosing p = 1

2 min {νs,a, , ps,a, },

Var(b T(q)(s, a)) 2E sup β 0 |f(νs,a,n, id, β) f(νs,a, id, β)|2

+ 2γ2E sup α 0 |f(ps,a,n, v(q), α) f(ps,a, v(q), α)|2

26 id 2 νs,a,span p2n log(e|R|) + 26γ2 v(q) 2 ps,a,span p2n log(e|S|)

26 r2 max + γ2 |q|2 span p2n log(e(|R| |S|)).

Plugging in p = 1

2p , we obtain the claimed inequality in Proposition 21.

Wang, Si, Blanchet, Zhou

C.3 Proof of Proposition 22

Proof We consider for generic u and measure µ on (Y, 2Y ). We assume δ < 1 24µ , which will be guaranteed by Assumption 1. Since α f(µ, u, α) is continuous, and from Si et al. (2020) it is suﬃcient to optimize the Lagrange multiplier on compact set [0, δ 1 u L (µ)], there is an optimal Lagrange multiplier α n(t) that achieves supα 0 f(µn(t), u, α). The bias of the dual functional Bias(f(µn, u, α n))

= E(gn(0, α n(0)) gn(1, α ))1Ωn,p(µ) + E (gn(0, α n(0)) gn(1, α )) 1Ωn,p(µ)c

=: E1 + E2.

4µ . Notice that by assumption,

Then, the following Lemma 36 holds.

Lemma 36 (Diﬀerentiability of the Dual Functional) Suppose δ < log(1 1

2µ ) and p 1

On Ωn,p(µ), t supα 0 gn(t, α) is C2((0, 1)) C[0, 1].

α = 0 iﬀu is µ essentially constant. So, α n(t) 0 and supα 0 gn(t, α) µ[u]

If α > 0, then α n(t) > 0 for all t [0, 1] with

dt sup α 0 gn(t, α) = α n(t) mn[w]

and dtdt sup α 0 gn(t, α)

= α n(t) mn[w]2

Varµ n(t)(u)

µn(t)[w] + mn[uw] α n(t)µn(t)[w] mn[w]µn(t)[uw]

α n(t)µn(t)[w]2

The proof of this result is deferred to Appendix E. So, on Ωn,p(µ), t gn(t, α n(t)) is C2(0, 1) C[0, 1]. By the (second order) mean value theorem, there exists random variable τ [0, 1] s.t.

E1 = E dtgn(t, α n(t)) t=1 + 1

2dtdtgn(t, α n(t)) t=τ

= E α mn[w]

2dtdtgn(t, α n(t)) t=τ

µ[w] Eα mn[w]

µ[w] 1Ωn,p(µ)c + E 1

2dtdtgn(t, α n(t)) t=τ 1Ωn,p(µ)

= 0 E1,1 + E1,2

Variance-Reduced Distributionally Robust Q-Learning

where Emn[u] = 0 for any function u. Recall Lemma 35. Since naturally µ µ, µn,

|E1,1| 3 |u|span E dmn

L (µ) 1Ωn,p(µ)c

µ P(Ωn,p(µ)c)

3 |u|span µ p2n log(e|Y |),

where we use Lemma 33 for the last inequality. On Ωn,p(µ), by Lemma 36, for all t [0, 1] either α n(t) = 0 or α n(t) > 0. In the ﬁrst case, we have trivially E1,2 = 0. In the second case,

dtdtgn(t, α n(t)) = dt tgn(t, α n(t))

= α n(t) mn[w]2

Varµ n(t)(u)

µn(t)[w] + mn[uw] α n(t)µn(t)[w] µn(t)[uw]mn[w]

α n(t)µn(t)[w]2

Next, we prove a ﬁner characteristic when δ goes to 0. We need the following Lemma:

Lemma 37 On Ωn,p(µ) with p < µ

sup α 0 α3 mn[w]

µn(t)[w] + mn[uw]

αµn(t)[w] mn[w]µn(t)[uw]

2 136 inf κ R u κ 3 L (µ)

Applying Lemma 35 and 37, we have that on Ωn,p(µ)

|dtdtgn(t, α n(t))|1Ωn,p(µ)

3 inf κ R u κ L (µ)

L (µ) + 136 infκ R u κ 3 L (µ) Varµ n(t)(u)

L (µ) + 136 |u|span u µ n[u] 2 L (µ n) u µ n(t)[u] 2 L2(µ n)

To bound the second ratio in the last inequality, we introduce the following lemma, whose proof is deferred to Appendix E as well.

Lemma 38 Suppose δ 1

24µ and p 1

4µ . When the optimal Lagrange multiplier α > 0, worst-case measures µ n(t) = µn(t)[w ]/µn(t)[w] satisﬁes µ n(t)(y) 1

2µ on Ωn,p(µ).

24p , by Lemma 38, for some y s.t. µ n(t)(y ) > 0,

u µ n[u] 2 L (µ n) u µ n(t)[u] 2 L2(µ n) = |u(y ) µ n[u]|2

µ n(t)(y )|u(y ) µ n[u]|2 + P y =y µ n(t)(y)|u(y) µ n[u]|2

|u(y ) µ n[u]|2

µ n(t)(y )|u(y ) µ n[u]|2

Wang, Si, Blanchet, Zhou

As in the proof of Propositon 21, under the choice p 1

4µ , Lemma 34 applies. Therefore,

|E1,2| 275 |u|span µ p2n log(e|Y |)

For E2 in (C.6), we use Lemma 31 and previous bound on P (Ωn,p(µ)c)

|E2| E |f(µn, u, α n(0)) f(µ, u, α )| 1Ωn,p(µ)c

2 |u|span P (Ωn,p(µ)c)

p2n log(e|Y |)

|u|span µ p2n log(e|Y |)

Therefore, going back to (C.6), we have

Bias sup α 0 f(µn, u, α) 280 |u|span µ p2n log(e|Y |).

Apply this to the empirical Bellman operator with p = 1

4 min {ps,a, , µs,a, } and Assumption 1 holds. So, δ < 1 24p implies that δ < 1 24 min {ps,a, , µs,a, }. Therefore, we have

|Bias(b T(q)(s, a))| =

sup β 0 f(νs,a,n, id, β)

+ γBias sup α 0 f(ps,a,n, v(q), α)

4480 id νs,a,span + γ |v(q)|span

p3 n log(e|S| |R|)

4480 rmax + γ |q|span

p3 n log(e|S| |R|).

C.4 Proof of Proposition 23

Proof We recall the bound (C.4) and the subsequent result

sup α 0 |f(µn, u, α) f(µ, u, α)| 3 |u|span

Again, we consider p 1

2µ . Also recall the deﬁnition (C.3) of Ωn,p(µ). By Lemma 32, on Ωn,p(µ) for y s.t. µ(y) > 0, µn(y) µ p 1

2µ p. Since µ(y) > 0 implies that

Variance-Reduced Distributionally Robust Q-Learning

µ(y) µ , we have that µn(t)(y) p for any t [0, 1]. Therefore, we have

P sup α 0 |f(µn, u, α) f(µ, u, α)| > t

P(Ωn,p(µ)c) + P

L (µ) > t, Ωn,p(µ)

P sup y |µn(y) µ(y)| > p + P 3 |u|span

p sup y |mn(y)| > t

exp( 2p2n) + exp

9 |u|2 span

exp( 2p2n) + exp

9 |u|2 span

where we used Hoeﬀding s inequality and union bound. Therefore, going back to the DR Bellman operator setting, we choose p = 1

4p and by union bound

P( b T(q) T (q) > t)

sup s,a sup β 0 |f(νs,a,n, id, β) f(νs,a, id, β)| > t

+ P sup s,a sup α 0 |f(ps,a,n, v(q), β) f(ps,a, v(q), β)| > t

2(|S|2|A| + |S||A||R|) exp p2 n

+ 2|S||A||R| exp p2 t2n 288r2max

+ 2|S|2|A| exp

p2 t2n 288γ2 |q|2 span

We set each of the three terms to be less than η/3 and ﬁnd that it suﬃces to have

p2 log (12|S||A|(|S| |R|)/η)

t 17(rmax + γ |q|span)

log (6|S||A|(|S| |R|)/η).

This implies the statement of the proposition.

C.5 Proof of Proposition 24

Proof We deﬁne

V := H(ˆq) b H(ˆq) = (T (ˆq) T (q )) (b T(ˆq) b T(q )). (C.9)

Wang, Si, Blanchet, Zhou

Recall the dual formulation

T (q)(s, a) = sup β 0 f(νs,a, id, β) + γ sup α 0 f(ps,a, v(q), α).

The ﬁrst term is not dependent on q, hence canceled in V . We have that

|V (s, a)| = γ sup α 0 f(ps,a, v(ˆq), α) sup α 0 f(ps,a, v(q ), α) sup α 0 f(ps,a,n, v(ˆq), α) + sup α 0 f(ps,a,n, v(q ), α)

Note that if v(ˆq) and v(q ) are both µ essentially constant, then V = 0, and the claim of Proposition 24 holds trivially. Therefore, moving forward, we consider the case at least one of v(ˆq) and v(q ) is not µ essentially constant. To analyze V while keeping the consistency of our notations, we deﬁne vt = tv(ˆq)+(1 t)v(q ), µ = ps,a, µn = ps,a,n, m = µ µn, and µ(t) = tµ (1 t)µn. Because Assumption 1 is imposed, we have that δ < 1

24µ . We consider the parametric function for s, t [0, 1]

h(s, t) := sup α 0 f(µ(t), vs, α) = f(µ(t), vs, α s,t). (C.10)

To motivates our analysis, we assume that h(s, ) is C1(0, 1) C[0, 1] and th( , t) is C1(0, 1) C[0, 1] as well. Then the fundamental theorem of calculus implies that

|V (s, a)| = γ |h(1, 0) h(0, 0) h(1, 1) + h(0, 1)|

0 th(1, t)dt + Z 1

0 th(0, t)dt

0 s th(s, t)dsdt

0 | s th(s, t)| dsdt

where s th(s, t) is easier to analyze. We proceed to show that (C.11) is valid (with some minor modiﬁcation) on Ωn,p(µ). As in the proof of Proposition 22, Lemma 36 applies when we consider p 1

4µ . So, for p 1

4µ , on Ωn,p(µ), h(s, ) is C2(0, 1) C[0, 1] with derivative

th(s, t) = dt sup α 0 f(µ(t), vs, α) = α s,t m[ws] µ(t)[ws].

Here, by Lemma 36, α s,t is the unique optimal Lagrange multiplier, and ws = e vs/α s,t. Next, we show that for every ﬁxed t, there is a function Ds th s.t. Z 1

0 Ds th(s, t)ds = th(1, t) th(0, t). (C.12)

We note that by Lemma 36, α s,t = 0 if and only if vs is essentially constant. This can only happen at one particular s = s . Otherwise, if there are some 0 s1 < s2 1,

Variance-Reduced Distributionally Robust Q-Learning

s1v(ˆq) + (1 s1)v(q ) = c1e and s2v(ˆq) + (1 s2)v(q ) = c2e w.p.1 under µ, where e is the vector of all ones, then for all a, b 0,

a + b v(ˆq) + 1 as1 + bs2

v(q ) = (ac1 + bc2)e.

This would imply that v(ˆq) and v(q ) are both essentially constant. We consider two cases: Case 1: vs is never essentially constant for all s [0, 1]. In this case, α s,t > 0 for all s [0, 1]. Note that s e vs/α is clearly C for α > 0. So, on Ωn,p(µ) if α s,t is C1(0, 1) in s, then s th(s, t) is C1(0, 1) C[0, 1]. We show diﬀerentiability of s α s,t by invoking the implicit function theorem as in the proof of Lemma 36. When α s,t > 0, as shown in Lemma 36, it is the unique solution to the optimality condition

α s,t( log µ(t)[ws] δ) µ(t)[vsws]

µ(t)[ws] =: F(s, α s,t) = 0. (C.13)

Deﬁne the optimal measure

µ (s, t)[ ] = µ(t)[ws ]

Since for all ﬁxed t, α s,t > 0 on (0, 1) and F is inﬁnite smooth. The implicit function theorem then implies that α s,t is C1(0, 1) C[0, 1] and s th(s, t) is C1(0, 1) C[0, 1]. We compute the derivative s th in this case. Let v = v(ˆq) v(q ). Rewrite the optimality equation as

α s,t( log µ(t)[ws] δ) = µ(t)[vsws]

Diﬀerentiate w.r.t. s on both side

LHS = sα s,t( log µ(t)[ws] δ) + µ(t)[ vws]

µ(t)[ws] sα s,t µ(t)[vsws] α s,tµ(t)[ws]

= sα s,t( log µ(t)[ws] δ) + µ (s, t)[ v] sα s,tµ (s, t)[vs/α s,t]

RHS = µ(t)[ vws]µ(t)[vsws]

α s,tµ(t)[ws]2 + µ(t)[ vws]

µ(t)[ws] µ(t)[ vvsws]

α s,tµ(t)[ws]

µ(t)[vsws]2

(α s,t)2µ(t)[ws]2 + µ(t)[v2 sws] (α s,t)2µ(t)[ws]

= Covµ (s,t) v, vs/α s,t + µ (s, t)[ v] + sα s,t Varµ (s,t)(vs/α s,t)

From the optimality equation and the LHS and RHS derivatives, we have

sα s,t log µ(t)[ws] + δ + µ (s, t)[vs/α s,t] + Varµ (s,t)(vs/α s,t) = Covµ (s,t)( v, vs/α s,t)

sα s,t Varµ (s,t)(vs/α s,t) = Covµ (s,t)( v, vs/α s,t)

sα s,t = Covµ (s,t)( v, vs/α s,t) Varµ (s,t)(vs/α s,t) .

Wang, Si, Blanchet, Zhou

Therefore, when α s,t > 0,

s th(s, t) = s α s,tm[ws] µ(t)[ws]

= m[ws]µ(t)[ vws]

µ(t)[ws]2 + m[ vws]

µ(t)[ws] sα s,t m[ws] µ(t)[ws]

m[vsws] α s,tµ(t)[ws] + m[ws]µ(t)[vsws]

α s,tµ(t)[ws]2

=: D1 + D2 + D3 + D4.

Case 2: There is a unique s [0, 1] s.t. vs is essentially constant. In this case, the previous argument implies that s th(s, t) is C1(0, s ), C1(s , 1), and continuous at 0, 1. The derivative is also given by (C.15). We need to show the existence of Ds th that satisfy (C.12). Observe that if s th(s, t) is continuous at s , then applying the fundamental theorem of calculus on the interval [0, s ] and [s , 1] separately, we will have that

th(1, t) th(0, t) = Z s

0 s th(s, t)ds + Z 1

s s th(s, t)ds.

Hence, taking Ds th(s, t) = s th(s, t) for every s = s and Ds th(s , t) = 0 will suﬃce to produce (C.12). It is left to check the continuity at s . As analyzed in (E.1),

lim α 0 α s,t m[ws] µ(t)[ws] = 0.

So we can conclude the continuity of s th(s, t) at s , if we can show that when vs ce for some constant c, then α s,t 0. To prove this, we assume to the contrary that there is a subsequential limit α sn,t β > 0 for some sequence sn s . But since F deﬁned (C.13) in s and α when α > 0, we must have that 0 = lim n F(sn, α sn,t) = β( log µ(t)[e ce/β] δ) c = δβ

raising a contradiction. This implies that s th(s, t) is continuous at s , and hence (C.12) holds with Ds th(s, t) = s th(s, t) for every s = s and Ds th(s , t) = 0. Therefore, we have shown that the bound (C.11) is valid on Ωn,p(µ) with p 1

4µ . Now we bound s th(s, t) using the decomposition (C.15). |D1| and |D2| can be bounded using the change of measure techniques: on Ωn,p(µ)

|D1| µ(t)[ dm

dµ(t)ws]µ(t)[| v|ws]

Variance-Reduced Distributionally Robust Q-Learning

|D2| µ(t)[ dm

dµ(t)| v|ws]

To bound |D3| and |D4|, recall sα s,t from (C.14).

|D3| = sα s,t m[ws] µ(t)[ws]

Covµ (s,t)( v, vs/α s,t)

Varµ (s,t)(vs/α s,t) m[ws] µ(t)[ws]

Covµ (s,t)( v, vs)

Varµ (s,t)(vs) α s,t m[ws] µ(t)[ws]

Covµ (s,t)( v, vs)

Varµ (s,t)(vs) inf κ R vs κ L (µ)

Varµ (s,t)( v) vs µ (s, t)[vs] L (µ) q

Varµ (s,t)(vs)

3 v vs µ (s, t)[vs] L (µ (s,t))

vs µ (s, t)[vs] L2(µ (s,t))

where (i) used Lemma 35 with j = 1. Since α s,t > 0 for s (0, 1) and vs is not essentially constant, by Lemma 38, for some s S s.t. vs(s ) µ (s, t)[vs] = 0

vs µ (s, t)[vs] 2 L (µ (s,t)) vs µ (s, t)[vs] 2 L2(µ (s,t))

= |vs(s ) µ (s, t)[vs]|2

µ (s, t)(s )|vs(s ) µ (s, t)[vs]|2 + P s =s µ (s, t)(s )|vs(s ) µ (s, t)[vs]|2

|vs(s ) µ (s, t)[vs]|2

µ (s, t)(s )|vs(s ) µ (s, t)[vs]|2

From (C.14), by the property of variance,

Varµ (s,t)( v) Varµ (s,t)(vs/α s,t) v q

Varµ (s,t)(vs/α s,t) .

Wang, Si, Blanchet, Zhou

Hence applying similar analysis,

m[vsws] α s,tµ(t)[ws] + m[ws]µ(t)[vsws]

α s,tµ(t)[ws]2

= | sα s,t| µ (s, t) dm

dµ(t)vs/α s,t

+ µ (s, t) dm

µ (s, t)[vs/α s,t]

Varµ (s,t)(vs/α s,t) Covµ (s,t)

dµ(t), vs/α s,t

By (C.11), we have

E|V | E|V |1Ωn,p(µ)c + γ Z 1

0 E | s th(s, t)| 1Ωn,p(µ)dsdt

E|V |1Ωn,p(µ)c + γ sup s,t (0,1) E(|D1| + |D2| + |D3| + |D4|)1Ωn,p(µ).

Recall the deﬁnition (C.9) of V ,

V = (T (ˆq) T (q )) (b T(ˆq) b T(q ))

T (ˆq) T (q ) + b T(ˆq) b T(q ) 2γ ˆq q .

So, by Lemma 33, E|V |1Ωn,p(µ)c 2γ ˆq q P(Ωn,p(µ)c)

p2n log(e|S|) (C.16)

By the previous bounds on |Di|, i = 1, 2, 3, 4,

E|V | 2γ ˆq q

p2n log(e|S|) + γ sup s,t (0,1)

L (µ) 1Ωn,p(µ)

µ2 n log(e|S|) + 8 v

µ2 n log(e|S|) + 25 ˆq q

p3/2 n log(e|S|)

where we choose p = 1

4p and the last inequality follows from the assumption in Proposition 24 that n p 1 .

Variance-Reduced Distributionally Robust Q-Learning

To bound the variance, note that Var(b T( x) b T(x )) EV 2 and

EV 21Ωn,p(µ) γ2 Z 1

0 ( s th(s, t))2dsdt

which follows from applying Jensen s inequality to the [0, 1]2 integral. Therefore,

Var(b T(ˆq) b T(q ))

8γ2 ˆq q 2 P(Ωn,p(µ)c) + γ2E Z 1

0 4(D2 1 + D2 2 + D2 3 + D2 4)dsdt1Ωn,p(µ)

27γ2 ˆq q 2 µ2 n log(e|S|) + 112

µ v 2 sup s,t (0,1) E dm dµ(t)

L (µ) 1Ωn,p(µ)

27γ2 ˆq q 2 µ2 n log(e|S|) + 211 ˆq q 2 µ3 n log(e|S|).

212 ˆq q 2 p3 n log(e|S|).

C.6 Proof of Proposition 25

Proof Recall the notations and deﬁnitions in as the proof of Proposition 24 in Appendix C.5 and, in particular, the deﬁnition (C.9) and bound (C.11) for V . We again choose p 1

4ps,a, . As Appendix C.5, we have that

|V (s, a)| |V |1Ωn,p(µ)c + γ sup s,t (0,1) (|D1| + |D2| + |D3| + |D4|)1Ωn,p(µ)

where µ = ps,a. Since Assumption 1 is assumed, the bounds on D1, D2, D3, D4 are still applicable. Therefore, by Hoeﬀding s inequality and union bound

P(|V (s, a)| > t)

P(Ωn,p(ps,a)c) + P

γ sup s,t (0,1) (|D1| + |D2| + |D3| + |D4|) > t, Ωn,p(ps,a)

P sup s S |ps,a,n(s ) ps,a(s )| > p + P 8 ˆq q (ps,a, p) ps,a, sup s S |mn(s )| > t

P |mn(s )| > p + P

p3/2 s,a, |mn(s )| > t

exp 2p2n + exp

p3/2 s,a, t2n 56 ˆq q 2

Wang, Si, Blanchet, Zhou

where mn = ps,a,n ps,a. Then, as p ps,a, for all (s, a) S A, by union bound

P( V > t) 2|S|2|A| exp p2 n

+ exp p3 t2n 56 ˆq q 2

We ﬁrst control the ﬁrst term to be less than η/2, which is implied by

p2 log(4|S|2|A|/η).

Finally, the second term less than η/2 is implied by choosing

t2 = 56 ˆq q

p3 n log(4|S|2|A|/η).

This proves the claimed result.

Appendix D. Proof of Technical Lemmas: Empirical Measures and Concentrations

The proofs in the rest of this section is based on the following concentration property of maximum subgaussian random variables.

D.1 Subgaussian Maximum Inequality

Lemma 39 Let {Yi, i = 1 . . . n} be σ2-sub-Gaussian with zero means, not necessarily independent, then EZ := E max i=1...n |Yi|k 2kσk (k 1 + log n)k/2 .

Proof For any λ > 0, consider an increasing function φλ(z) = exp(λz1/k) for z 0. Since Z 0, φλ(EZ) = φλ(EZ1 {Z > u} + EZ1 {Z u})

φλ(EZ1 {Z > u} + u P(Z u))

Take second derivatives,

φ λ(z) = k 2λz1/k 2eλz1/k(λz1/k k + 1);

one can see that φλ(z) is convex for z (k 1)kλ k. Let u = (k 1)kλ k. By Jensen s inequality φλ(EZ) Eφλ(Z + (k 1)kλ k)

= ek 1E exp(λ max i=1...n |Yi|)

i=1 Eeλ|Yi|

Variance-Reduced Distributionally Robust Q-Learning

Since {Yi} are Sub-Gaussian,

P(|Yi| > t) 2 exp t2

By Rigollet (2015, Lemmas 1.4 and 1.5), one can show that

log Eeλ|Yi| log(E[eλYi + e λYi]) log(2 exp(σ2λ2/2)) 4σ2λ2.

λ E max i=1...n |Yi|k 1/k = log φλ(EZ)

k 1 + log n + 4σ2λ2.

Rearrange and take inﬁmum over λ > 0, we conclude

E max i=1...n |Yi|k inf λ>0 k 1 + log n

2kσk (k 1 + log n)k/2

D.2 Proof of Lemma 33

Proof By deﬁnition and Markov s inequality

P(Ωn,p(µ)c) = P sup y |µn(y) µ(y)| > p

1 p2k E sup y |µn(y) µ(y)|2k

= 1 p2kn2k E

i=1 1 {Yi = y} µ(y)

Since Pn i=1 1 {Yi = y} µ(y) is n/4 sub-Gaussian, by Lemma 39

P(Ωn,p(µ)c) 1 p2knk (2k 1 + log(|Y |))k = 1 p2knk log(e2k 1|Y |)k

as claimed.

D.3 Proof of Lemma 34

Proof Note that by Jensen s inequality,

Wang, Si, Blanchet, Zhou

So it suﬃces to show the second claim. By assumption,

L (ξn) 1A 1

p2 E sup y |mn(y)|21A.

Same as the proof of Lemma 33, we use Lemma 39 to conclude that

L (ξn) 1A 1 p2n log(e|Y |).

Appendix E. Proof of Technical Lemmas: KL Case

E.1 Proof of Lemma 30

Proof sup α 0 f(ν, u, α) lim α 0 f(ν, u, α) = ess inf ν u ess inf µ u u L (µ)

On the other hand, since the sup is achieved on compact K. For optimal α ν > 0,

sup α 0 f(ν, α) u L (ν) α ν log ν[e (u u L (ν))/α ν]

where the last line follows from that ν[e (u u L (ν))/α ν] > 0 and ν µ. Also, if α ν = 0, the above holds trivially.

E.2 Proof of Lemma 31

Proof Let α and α n Use Lemma 30, sup α 0 f(µ, u, α) sup α 0 f(µn, u, α)

= α n log µn[e u/α n] + α nδ α log µ[e u/α ] α δ

α n(0) log µn[e (u κ)/α n(0)] + α nδ α log µ[e (u κ)/α ] α δ

inf κ R |f(µn, u κ, α n(0))| + |f(µ, u κ, α )|

2 inf κ R u κ L (µ)

= 2 |u|span

Variance-Reduced Distributionally Robust Q-Learning

E.3 Proof of Lemma 32

By deﬁnition, on Ωn,p(µ), |µn(y) µ(y)| p. So for all y s.t. µ(y) > 0

0 < µ p µ(y) p µn(y).

Moreover, if µn(y) = 0, then 0 µ(y) p, we must have that µ(y) = 0. So, µn µ and hence µn µ.

E.4 Proof of Lemma 35

Proof First we note that for any κ R

µ[e u/α] = m[e (u κ)/α]

µ[e (u κ)/α] .

Therefore, it suﬃces to show that for m = µ1 µ2 s.t. µ µ1, µ2

µ[w]2 9 u j L (µs,a)

Fix any c > 0, write

µ[w]2 = max

sup α [0,c u ]

µ[w]2 , sup α c u

=: max {J1(c), J2(c)} .

We ﬁrst bound J2(c)

J2(c) = sup α c u L (µ)

αjm[e (u+ u L (µ))/α]2

µn[e (u+ u L (µ))/α]2

For simplicity, let w := e (u+ u L (µ))/α. Recall that m = µn µ, so m[1] = 0 and

αjm[e (u+ u L (µ))/α]2 = (m[αj/2(e (u+ u L (µ))/α 1)])2.

Deﬁne and note that v := αj/2(e (u+ u L (µ))/α 1) < 0. Then

µ[w ]2 = m[v]2

= 1 µ[w ]2 µ dm

We defer the proof of the following claim:

Wang, Si, Blanchet, Zhou

Lemma 40 For any j [0, 2]

sup α c u L (µ)

L (µ) (c u L (µ))j/2(e2/c 1).

J2(c) cj u j L (µ)(e2/c 1)2 dm

Choose c = 2/ log 2

µ[w]2 = max {J1(c), J2(c)}

max n cj u j L (µ), cj u j L (µ)(e2/c 1)2o dm

9 u j L (µ)

which completes the proof.

E.4.1 Proof of Lemma 40

Proof We bound v

L (µ) = ess sup µ αj/2(e(u(s)+ u L (µ))/α 1)

αj/2(e2 u L (µ)/α 1)

Compute derivative: let β = 2 u L (µ)/α

d dααj/2(e2 u L (µ)/α 1) = αj/2 1((eβ 1)j/2 βeβ)

Notice that when β = 0, the above expression is 0. Moreover, for j [0, 2]

d dβ ((eβ 1)j/2 βeβ) = (j/2 1 β)eβ < 0;

i.e. (eβ 1)j/2 βeβ is decreasing. Therefore, for α > 0

d dααj/2(e2 u L (µ)/α 1) < 0;

i.e. αj/2(e2 u L (µ)/α 1) is decreasing in α. Hence

sup α c u L (µ)

L (µ) sup α c u L (µ) αj/2(e2 u L (µ)/α 1)

= (c u L (µ))j/2(e2/c 1)

establishing the claim.

Variance-Reduced Distributionally Robust Q-Learning

E.5 Proof of Lemma 37

Proof Let u = u + u L (µ) and w = e u /α.

sup α 0 α3 mn[w]

µn(t)[w] + mn[uw]

αµn(t)[w] mn[w]µn(t)[uw]

= sup α 0 α3 mn[w ]

µn(t)[w ] + mn[u w ]

αµn(t)[w ] mn[w ]µn(t)[u w ]

αµn(t)[w ]2

2 sup α 0 α3 mn[w ]

µn(t)[w ] + mn[u w ]

2 + 2 sup α 0 αmn[w ]2µn(t)[u w ]2

=: 2OPT1 + 2OPT2

We ﬁrst analyze OPT1. Fix c 0, we separately consider α c u L (µ) and α [0, c u L (µ)]. The ﬁrst two terms

sup α c u L (µ) α3 mn[w ]

µn(t)[w ] + mn[u w ]

2 = sup α c u L (µ) α3 mn[(1 + u /α)w ]2

= sup α c u L (µ)

mn[α3/2((1 + u /α)w 1)]2

µn(t)[w ]2 .

Recall that 1 + x ex; i.e. (1 + u /α)w 1 0. Also, by Lemma 32, on Ωn,p(µ), p < µ , µn(t) µn µ. So,

mn[α3/2((1 + u /α)w 1)]2

α3/2(1 (1 + u /α)w )

α3/2(eu /α (1 + u /α)) 2

Recall the Taylor series of ex. For all s S, we have that

α3/2(eu (s)/α (1 + u (s)/α)) =

Notice that k 3/2 > 0 for k 2 and the terms in the sum are non-negative. So, the above expression suggests that α α3/2(eu (s)/α (1 + u (s)/α)) is decreasing. Therefore, on Ωn,p(µ)

sup α c u L (µ) α3 mn[w ]

µn(t)[w ] + mn[u w ]

2 c3 u 3 L (µ)(e2/c 1)2 dmn dµn(t)

Wang, Si, Blanchet, Zhou

sup α [0,c u L (µ)] α3 mn[w ]

µn(t)[w ] + mn[u w ]

2 sup α [0,c u L (µ)]

µn(t)[w ]2 + αmn[u w ]2

2 sup α [0,c u L (µ)]

α3 dmn dµn(t)

L (µ) + α u 2 L (µ)

2(c3 + 4c) u 3 L (µ)

Choose c = 2, we conclude that

OPT1 32 u 3 L (µ)

For OPT2, we use Lemma 35.

OPT2 9 u L (µ)

µn(t)[u w ]2

L (µ) u 2 L (µ)

36 u 3 L (µ)

Therefore, we conclude that on Ωn,p(µ)

sup α 0 α3 mn[w]

µn(t)[w] + mn[uw]

αµn(t)[w] mn[w]µn(t)[uw]

2 136 u 3 L (µ)

The lemma follows from considering u κ, which won t change the left hand side.

E.6 Proof of Lemma 36

Proof From Si et al. (2020), it is suﬃcient to consider α [0, δ 1 u L (µ)] =: K. For α > 0 ﬁxed,

tgn(t, α) = α mn[w]

Also, for α = 0, by Lemma 32 and p 1

4µ , gn(t, 0) ess infµ u; hence tgn(t, 0) 0. Again by Lemma 32 µn(t) µ on Ωn,p(µ). So, the Radon-Nikodym theorem applies: For

Variance-Reduced Distributionally Robust Q-Learning

ﬁxed t [0, 1],

lim α 0 sup s (t ϵ) [0,1] | tgn(t, α)| lim α 0 sup t [0,1] α mn[w] µn(t)[w]

= lim α 0 sup t [0,1] α 1 µn(t)[w]µn(t) dmn

lim α 0 sup t [0,1] α dmn dµn(t)

lim α 0 α µ p = 0.

where we used H older s inequality to get the second last line. Therefore, tg( , ) is continuous on [0, 1] K. Next deﬁne Θ(t) := arg max α K g(t, α).

To simplify notation, we use the w to denote w = w n(t) = e u/α n(t). We discuss two cases:

1. If u is µ-essentially constant with u L (µ) = u, then

sup α K α log e u/α αδ = sup α K u αδ;

i.e. Θ(t) = {0}.

2. u is not µ-essentially constant. Note that when α > 0, w > 0; we can deﬁne a new measure

µ n(t)[ ] = µn(t)[w ]

We have that

α αgn(t, α) = µn(t)[u2w]

α3µn(t)[w] + µn(t)[uw]2

α3µn(t)[w]2

= µ n(t)[u2]

α3 + µ n(t)[u]2

= Varµ n(t)(u)

i.e. gn(t, ) is strictly concave for α > 0. Also, recall that gn(t, ) is continuous at 0. So, in this case either Θ(t) = {0} or Θ(t) = {α n(t)} where δ 1 u L (µ) α n(t) > 0.

In particular, Θ(t) is a singleton which we will denote by α n(t) in both cases. We conclude that by Shapiro et al. (2014) Theorem 7.21, the following derivative exists

dt sup α K gn(t, α) = sup α Θ(t) tgn(t, α) = tgn(t, α n(t)).

Wang, Si, Blanchet, Zhou

Next, we analyze the second derivative. We prove that under Assumption 1, we have that on Ωn,p(µ) α = 0 or α > 0 will imply that α n(t) = 0 or α n(t) > 0 respectively. Let ρ = µ({y : u(y) = ess infµ u}) and ρn(t) the mixed version. Since µn µ, if ρ = 1 (thence α = 0), then we automatically have that ρn(t) 1 and α n(t) 0. Now we consider the case ρ = 1. Notice that by deﬁnition of Ωn,p(µ), ρ p ρn ρ+p. There are two cases:

1. α = 0. From Hu and Hong (2013), α = 0 iﬀρ e δ. If we want α n(t) = 0 for all t [0, 1], a suﬃcient condition is that ρn(t) ρ p e δ.

2. α > 0 iﬀρ < e δ. If we want α n(t) > 0 for all t [0, 1], a suﬃcient condition is that ρn(t) ρ + p < e δ.

Therefore, for any e δ = ρ {µ({y : u(y) t}) : t R}, we can always choose p small enough s.t. for ω Ωn,p(µ), ρn(t) is close to ρ for all t and the above suﬃcient conditions hold.

Remark 41 While this generalizes to all but ﬁnitely many δ, for simplicity of presentation, we assume Assumption 1 that µ /2 1 e δ.

So, if ρ = 1, then 1 ρ µ > 1 e δ; i.e. ρ < e δ and case 1 cannot happen. Therefore, α = 0 iﬀu is µ essentially constant. Moreover, by our choice p 1

satisfying the suﬃcient condition in case 2. Hence our assumption on p implies that if α = 0 or α > 0, then on ω Ωn,p(µ), α n(t) = 0 or α n(t) > 0 for all t [0, 1] respectively.

1. α = 0, then gn(t, α n(t)) = gn(t, 0) is constant. Hence dtdtgn(t, α n(t)) = 0.

2. α > 0, then α n(t1), α n(t2) > 0. Since gn(t, ) is strictly convex, α n(t) is the unique solution to the ﬁrst order optimality condition

0 = αgn(t, α n(t)) = log µn(t)[w] δ µn(t)[uw] α n(t)µn(t)[w]. (E.2)

Note that αgn C ([0, 1] R++) and that α αgn(t, α n(t)) < 0. The implicit function theorem implies that α n(t) C1((0, 1)) with derivative

dtα n(t) = t αgn(t, α n(t)) α αgn(t, α n(t))

Varµ n(t)(u)

µn(t)[w] + µn(t)[uw]mn[w]

α n(t)µn(t)[w]2 mn[uw] α n(t)µn(t)[w]

We conclude that

tgn(t, α n(t)) = α n(t) mn[w]

Variance-Reduced Distributionally Robust Q-Learning

is C1((0, 1)) as a function of t. Therefore, gn(t, α n(t)) is C2((0, 1)) with derivative

dtdtgn(t, α n(t))

= dt tgn(t, α n(t))

= α n(t) mn[w]2

µn(t)[w]2 + dtα n(t) mn[w]

µn(t)[w] + mn[uw] α n(t)µn(t)[w] mn[w]µn(t)[uw]

α n(t)µn(t)[w]2

= α n(t) mn[w]2

µn(t)[w]2 α n(t)3

Varµ n(t)(u)

µn(t)[w] + mn[uw] α n(t)µn(t)[w] mn[w]µn(t)[uw]

α n(t)µn(t)[w]2

Therefore, Lemma 36 summarizes these two cases.

E.7 Proof of Lemma 38

Proof First, we note that if µ n(t)(y) µn(t)(y) > 0, then by Lemma 32, µ n(t)(y) µn(t)(y) 3 4µ . So, we will only consider cases where µ n(t)(y) < µn(t)(y). We now ﬁx any such y. By Lemma 36, under the given assumptions α > 0 implies that α n(t) > 0. So, the KL constraint is binding; i.e. δ = DKL(µ n(t)||µn(t)). By the log-sum inequality,

δ = DKL(µ n(t)||µn(t)) µ n(t)(y) log µ n(t)(y) µn(t)(y)

+ (1 µ n(t))(y) log 1 µ n(t)(y) 1 µn(t)(y)

kl(q, b) = q log q

+ (1 q) log 1 q

where we think of b = µn(t)(y). Observe that for q (0, b)

q qkl(q, b) = 1

q + 1 1 q > 0;

i.e. kl( , b) is strictly convex and the maximum is achieved at q = 0, kl(0, b) = log(1/(1 b)). Since b [3

4µ ], we have that log(1/(1 b)) log(1/(1 3

4µ > δ. So, by the convexity, continuity of kl( , b) and kl(b, b) = 0, there is unique q (0, b) s.t. kl(q , b) = δ. Now we bound such q . Since dqkl(q, b) < 0 for q < b, by the fundamental theorem of calculus and convexity

kl(q, b) = Z b

q xkl(x, b)dx

2b(1 b) =: ζ(q, b)

Wang, Si, Blanchet, Zhou

Note that for q < b

dbζ(q, b) = (b q)(q + b 2qb)

2(1 b)2b2 > 0

i.e. ζ(q, ) is increasing. Suppose to the contrary q < 1

kl(q , b) ζ(q , b)

4 µ ] ζ(q , b)

However, by assumption, 1 24µ δ kl(q , µn(t)(y)) > µ . Hence q 1

2µ . We conclude that µ n(t) 1

Appendix F. The Empirical Robust Bellman Operator: χ2 Case

To analyze the variance-reduced Q-learning for the χ2 case, we establish important statistical properties of the empirical DR ellman operator b T and its recentered version b H. We defer the proofs to Appendix H. The proof techniques are similar to that in Appendix C. We let b T be the empirical DR Bellman operator formed by n samples deﬁned in (5.5). Deﬁne the recentered operators b H, H as in (A.1). We ﬁx ˆq RS A.

Proposition 42 Suppose Assumption 3 is enforced. Then

|E[ b H(ˆq)(s, a) H(ˆq)(s, a)]| 26 ˆq q

p n log(e|S|),

provided n p 2 , and

Var( b H(q ))(s, a) 211 ˆq q 2 p2 n log(e|S|)

for all n 1.

Proposition 43 Assume Assumption 3. Then w.p. at least 1 η

H(ˆq) b H(ˆq) 6 ˆq q

log(4|S|2|A|/η)

provided that n 8p 2 log(4|S|2|A|/η)

Proposition 44 The empirical DR Bellman operator

b T(q) T (q) 8(rmax + γ q )

log (6|S||A|(|S| |R|)/η)

w.p. at least 1 η, provided that n 8p 2 log (12|S||A|(|S| |R|)/η).

Variance-Reduced Distributionally Robust Q-Learning

Appendix G. Analysis of the Variance-Reduced Q-Learning: χ2 Case

We proceed with the analysis of the variance-reduced DR Q-learning Algorithm 2 in the χ2 divergence case, similar to the KL case. Speciﬁcally, we aim to show that if the q-function from the last variance-reduced algorithm epoch, ˆql 1, is within a certain error b of the optimal q , then ˆql will have a better concentration bound by a geometric factor. This is summarized in Proposition 45, which is analogous to Proposition 10 in the KL case. Recall that Fl denotes the σ-ﬁeld generated by the random samples used until the end of epoch l. We deﬁne the conditional expectation El 1[ ] = E[ |Fl 1].

Proposition 45 Assuming that Assumptions 2 and 3 are satisﬁed. On {ω : ˆql 1 q b} for some b 1/(1 γ), under measure Pl 1( ) := El 1[1 { }], we have that there exists numerical constant c s.t.

ˆql q c b (1 γ)2kvr + b p (1 γ)3/2 nvrkvr + b p (1 γ) nvr

log (3dkvr/η)2

+ c 1 p (1 γ)2 ml

w.p. at least 1 η, provided that ml 8p 2 log(24d/η) and nvr p 1 .

Proof [Proof of Proposition 45] We recall the proof of Proposition 10 in Appendix B.3.1. We have that by (B.12), under Pl 1, on {ω : ˆql 1 q b}

ql,k+1 q λk

+ Ql,k+1 + 2 Dl

w.p.1. The sequence {Ql,j : j = 1, . . . , k + 1}, by (B.13), satisﬁes

j=1 Ql,j + Ql,kvr+1

λkvr log(e + (1 γ)kvr) ζl 1

1 γ + σl 1 p

log (4|S||A|kvr/η)

w.p. at least 1 η, where we recall that

ζl 1 = ˆql 1 q , σ2 l 1 = max (s,a) S A Varl 1(Hl,k(ˆql 1)(s, a)).

Therefore, by Proposition 42, we have that

j=1 Ql,j + Ql,kvr+1

c b (1 γ)2kvr + b p (1 γ)3/2 nvrkvr

log (4|S||A|kvr/η)2

Wang, Si, Blanchet, Zhou

for some constant c. Moreover, recall the deﬁnition of Dl in (B.14). By Propositions 42, 43, and 44, we have that

Dl c rmax + |q |span + ˆql 1 q

log (12d/η) + c ˆql 1 q

c 1 p (1 γ) ml

log (12d/η) + c b p nvr

for some constant c that can change from line to line. Combining these bound with (G.1) and apply union bound, we conclude that

ql,kvr+1 q c b (1 γ)2kvr + b p (1 γ)3/2 nvrkvr + b p (1 γ) nvr

log (8dkvr/η)2

+ c 1 p (1 γ)2 ml

log (24d/η)

w.p. at least 1 η. Recall the deﬁnition in Algorithm 2 that ql,kvr+1 = ˆql. Finally, we adjust the constant in the log factor using the inequality for C1 1, C2 e, log(C1C2) = log(C1) + log(C2) C1 log(C2). This completes the proof.

Given Proposition 46, we apply the analysis techniques for the variance-reduction iterates in the proof of 46. This yields the following Proposition.

Proposition 46 Assume Assumptions 2 and 3. For ϵ < (1 γ) 1, deﬁne parameters according to(5.6). Then, the statement of Proposition 11 hold; i.e. the sequence {ˆql, 0 l lvr} produced by Algorithm 2 satisﬁes the pathwise property that ˆql q 2 l(1 γ) 1

for all 0 l lvr w.p. at least 1 η. In particular, the ﬁnal estimator ˆqlvr satisﬁes ˆqlvr q 2 lvr(1 γ) 1 w.p. at least 1 η.

Proof [Proof of Proposition 46] Follow the proof of Proposition 11, we only to validate (B.16) given the parameter choice in (5.6). By Proposition 45, conditioned on ˆql 1 q 2 (l 1)(1 γ) 1 =: b

ˆql q c b (1 γ)2kvr + b p (1 γ)3/2 nvrkvr + b p (1 γ) nvr

log (3dkvr/η)2

+ c 1 p (1 γ)2 ml

w.p. at least 1 η. Therefore, it is easy to see that by the parameter choice (5.6), we have that for suﬃciently large cvr and for events ω ˆql 1 q 2 (l 1)(1 γ) 1 ,

Pl 1 1 n ˆql q 2 l(1 γ) 1o (ω) 1 η

validating (B.16). Following the same arguments as in proof of Proposition 11 will yield Proposition 46.

Variance-Reduced Distributionally Robust Q-Learning

Now, we prove Theorem 18. Proof [Proof of Theorem 18] By Proposition 46, under the parameter choice (5.6), ˆqlvr q ϵ w.p. at least 1 η. The total number of samples used is

lvrnvrkvr +

= O |S||A| 1 p2 (1 γ)4 + 4lvr

This yields the sample complexity bound in Theorem 18.

Appendix H. Proofs of Properties of the Empirical Bellman Operator: χ2 Case

We ﬁrst deﬁne some notations that mimic the deﬁnitions in Appendix C. Again, we override the notations for the KL case. For generic probability measure µ on (Y, 2Y ) and random variable u : Y R, let w = (α u)+; deﬁne the χ2 dual functional under the reference measure µ as f(µ, u, α) := α c(δ)µ[w2] 1 2 . (H.1)

Recall the dual formulation of the DR Bellman operator (5.4), we have that

T (q)(s, a) = sup β R f(νs,a, id, β) + γ sup α R f(ps,a, v(q), α). (H.2)

Next, we present two important lemmas that underlie our analysis of the DR Bellman operator in the χ2 case. First, we characterize the optimal Lagrange multiplier in the dual formulation (5.4).

Lemma 47 For δ > 0, f(µ, u, α) is second continuously diﬀerentiable and concave for α > ess infµ u. The supremum is achieved at ess infµ u α < , i.e. supα R f(µ, u, α) = f(µ, u, α ), satisfying µ[w2] = c(δ)2µ[w]2. (H.3)

Moreover, if α > ess infµ u, then

µ ( ) := µ[w1 { }]

µ[w] . (H.4)

is a worst-case measure satisfying

µ [u] = f(µ, u, α ) = inf µ :Dχ2(µ µ) δ µ [u] = inf µ :Dχ2(µ µ)=δ µ [u];

i.e. the χ2 constraint is active. Finally, if α = ess infµ u, then the measure

µ ( ) := µ[1 {U }]

where U := {s : µ(s) > 0, u(s) = ess infµ u} is a worst-case measure.

Wang, Si, Blanchet, Zhou

With this lemma, we can show that under Assumption 3, the optimal Lagrange multiplier α is suﬃciently large so that w = (α v)+ = α v a.s.µ.

Lemma 48 If δ < 1

2µ := mins:µ(s)>0 µ(s), then α ess supµ u. Moreover, if u is not µ essentially constant, then α > ess supµ u.

The proofs of these Lemmas are deferred to Appendix I.

H.1 Proof of Proposition 42

As in Appendix C.5, call V := H(ˆq) b H(ˆq) = (T (ˆq) T (q )) (b T(ˆq) b T(q )). Recall the following notations in Appendix C.5: vt = tv(ˆq) + (1 t)v(q ), µ = p, m = p pn, and µ(t) = tp (1 t)pn. Let

h(s, t) := sup α R f(µ(t), vs, α).

We consider Ωn,p(µ) with p 1

4µ . Then, by Lemma 32, we have that µ µn µ(t) on Ωn,p(µ). Also, recall that α s,t is the optimal Lagrange multiplier that satisﬁes the conclusions of Lemma 47. First we note that if v(ˆq) and v(q ) are both µ essentially constant, then V = 0, and the claim of Proposition 42 holds trivially. Moving forward, we consider the case at least one of v(ˆq) and v(q ) is not µ essentially constant. We proceed to show the diﬀerentiability of h in this setting. This is summarized by Lemma 49. The proof of this result is deferred to Appendix I. Note that Assumption 3 implies that δ < 1

Lemma 49 Suppose δ < 1

4µ . If at least one of v(ˆq) and v(q ) is not µ essentially constant, then on Ωn,p(µ) there exists function s, t D2h(s, t) s.t.

|V (s, a)| γ Z 1

D2h(s, t) dsdt (H.6)

w.p.1, where

D2h(s, t) = µ(t)[ vws]m[w2 s] 2µ(t)[ws]µ(t)[w2s] m[ vws]

m[w2 s] 2µ(t)[w2s] m[ws]

sα s,t = c(δ)2µ(t)[ws]µ(t)[ v] µ(t)[ws v]

(c(δ)2 1)µ(t)[ws] (H.8)

We analyze the two terms separately. Recall that ws 0. Similar to the techniques in Appendix C.5, we have that on Ωn,p(µ) with µ µn µ(t),

µ(t)[ vws]m[w2 s] 2µ(t)[ws]µ(t)[w2s]

v m[w2 s] 2µ(t)[w2s]

Variance-Reduced Distributionally Robust Q-Learning

and m[ vws]

Hence on Ωn,p(µ),

For D2, we note that

sα s,t = c(δ)2µ(t)[ws]µ(t)[ v] µ(t)[ws v]

(c(δ)2 1)µ(t)[ws]

= 1 c(δ)2 1

c(δ)2µ(t)[ v] µ(t)[ws v]

Next, we consider

m[w2 s] 2µ(t)[w2s] m[ws]

µ(t)[ws] = m[(ws µ(t)[ws])2] + 2m[ws]µ(t)[ws]

2µ(t)[w2s] m[ws]

= m[(ws µ(t)[ws])2]

2c(δ)2µ(t)[ws]2 (c(δ)2 1)m[ws]

c(δ)2µ(t)[ws]

where we use the optimality condition (E.2) to replace µ(t)[w2 s] with c(δ)2µ(t)[ws]2. Then,

m[(ws µ(t)[ws])2] = µ(t) dm

dµ(t)(ws µ(t)[ws])2

µ(t) (ws µ(t)[ws])2 dm dµ(t)

= µ(t)[w2 s] µ(t)[ws]2 dm dµ(t)

= (c(δ)2 1)µ(t)[ws]2 dm dµ(t)

where we also apply (E.2) and µ(t) µ. So,

m[w2 s] 2µ(t)[w2s] m[ws]

m[(ws µ(t)[ws])2]

2c(δ)2µ(t)[ws]2

+ (c(δ)2 1)m[ws]

c(δ)2µ(t)[ws]

Wang, Si, Blanchet, Zhou

Therefore, we have that

|D2| = sα s,t m[w2 s] 2µ(t)[w2s] m[ws]

as c(δ)2 = 1 + 2δ 1. So, on Ωn,p(µ),

| s th(s, t)| |D1| + |D2| 9

Recall (C.16) and (C.17), we have that

E|V | 2γ ˆq q

p2n log(e|S|) + 5γ v sup s,t (0,1) E dm dµ(t)

L (µ) 1Ωn,p(µ)

µ2 n log(e|S|) + 5 ˆq q

µ2 n log(e|S|) + 20 ˆq q

p n log(e|S|)

where we choose p = 1

4p and the last inequality follows from the assumption that n p 2 . To bound the variance, we use the same techniques as in (C.18) and conclude that for n 1

Var(b T(ˆq) b T(q )) 8γ2 ˆq q 2 P(Ωn,p(µ)c) + γ2E Z 1

0 2(D2 1 + D2 2)dsdt1Ωn,p(µ)

27 ˆq q 2 µ2 n log(e|S|) + 24 v 2 sup s,t (0,1) E dm dµ(t)

L (µ) 1Ωn,p(µ)

27 ˆq q 2 µ2 n log(e|S|) + 210 ˆq q 2 µ n log(e|S|).

211 ˆq q 2 p n log(e|S|).

This is the variance of H(ˆq) as H(ˆq) is deterministic.

H.2 Proof of Proposition 43

Proof Given Lemma 49, we directly apply the arguments in Appendix C.6.

Variance-Reduced Distributionally Robust Q-Learning

We have that w.p.1,

|V (s, a)| |V |1Ωn,p(µ)c + γ sup s,t (0,1) (|D1| + |D2)1Ωn,p(µ)

where µ = ps,a. Recall the choice p 1

4ps,a . By Hoeﬀding s inequality and the union bound

P(|V (s, a)| > t) P(Ωn,p(ps,a)c) + P

γ sup s,t (0,1) (|D1| + |D2|) > t, Ωn,p(ps,a)

P sup s S |ps,a,n(s ) ps,a(s )| > p + P 5γ ˆq q

ps,a, p sup s S |m(s)| > t

P (|m(s)| > p) + P 8 ˆq q

ps,a, |m(s)| > t

exp 2p2n + exp

p2 s,a, t2n 32 ˆq q 2

Then, as p ps,a, for all (s, a) S A, by union bound

P( V > t) 2|S|2|A| exp p2 n

+ exp p2 t2n 32γ2 ˆq q 2

We ﬁrst control the ﬁrst term to be less than η/2, which is implied by

p2 log(4|S|2|A|/η).

Finally, the second term less than η/2 is implied by choosing

t2 = 32γ2 ˆq q

p2 n log(4|S|2|A|/η).

This proves the claimed result.

H.3 Proof of Proposition 44

Proof We recall the bound (C.4). If v(q) is essentially constant w.r.t. ps,a, then b T(q)(s, a) = T (q)(s, a). Therefore, we then focus on the case that v(q) is not essentially constant. Again, we ﬁx p 1

4µ and thus on Ωn,p(µ), µ µn where µ = νs,a or ps,a. So, if u is not essentially constant, by Assumption 3 and Lemma 48, we have that

sup α R f(µn, u, α) = f(µn, u, α n), sup α R f(µ, u, α) = f(µ, u, α )

for some α n, α > ess supµ u =: u . Then, as in (C.4) we analyze

sup α>u |f(µn, u, α) f(µ, u, α)| .

Wang, Si, Blanchet, Zhou

Since α > u , µ[w2] > 0 and f is diﬀerentiable in µ on Ωn,p(µ). By the mean value theorem,

|f(µn, u, α) f(µ, u, α)| = c(δ)1

for some τ [0, 1] where we used (H.3) and µ(t) = tµ + (1 t)µn and m = µ µn.

We ﬁrst consider when α > 2 u ,

sup α>2 u |f(µn, u, α) f(µ, u, α)|

m[α2 2αu + u2]

αm[u] α µ(τ)[u]

m[u2] α µ(τ)[u]

(α µ(τ)[u])m[u]

+ µ(τ)[u]m[u]

m[u2] α µ(τ)[u]

(i) |m[u]| + µ(τ)[u]

2 m[u2] u u sup y Y |m(y)|

where (i) uses that α 2 u and hence α µ(τ)[u] u .

On the other hand, if u < α 2 u

sup u <α 2 u |f(µn, u, α) f(µ, u, α)| sup u <α 2 u

sup u <α 2 u

m[α u] µ(τ)[α u]

2 u 1 µ p sup y Y |m(y)|

µ sup y Y |m(y)|

where the last two inequalities follow from Lemma 32 and p 1

Variance-Reduced Distributionally Robust Q-Learning

Therefore, we have

P sup α R |f(µn, u, α) f(µ, u, α)| > t

P(Ωn,p(µ)c) + P sup α>v |f(µn, u, α) f(µ, u, α)| > t, Ωn,p(µ)

P sup y |µn(y) µ(y)| > p + P

µ sup y Y |mn(y)| > t

exp( 2p2n) + exp

µ2 t2n 2 u 2

exp( 2p2n) + exp

µ2 t2n 2 u 2

where we used Hoeﬀding s inequality and union bound. Therefore, going back to the DR Bellman operator setting, we choose p = 1

4p . By union bound

P( b T(q) T (q) > t)

sup s,a sup β R |f(νs,a,n, id, β) f(νs,a, id, β)| > t

+ P sup s,a sup α R |f(ps,a,n, v(q), β) f(ps,a, v(q), β)| > t

2(|S|2|A| + |S||A||R|) exp p2 n

+ 2|S||A||R| exp p2 t2n 64r2max

+ 2|S|2|A| exp

p2 t2n 64γ2 q 2

We set each of the three terms to be less than η/3 and ﬁnd that it suﬃces to have

p2 log (12|S||A|(|S| |R|)/η)

t 8(rmax + γ |q|span)

log (6|S||A|(|S| |R|)/η).

This implies the statement of the proposition.

Wang, Si, Blanchet, Zhou

Appendix I. Proof of Technical Lemmas: χ2 Case

I.1 Proof of Lemma 47

Proof First, we note that for every u and µ, f is continuous in α. Diﬀerentiate, we see that f(µ, u, ) is C1 with derivative

αf(µ, u, α) = 1 c(δ)µ[w2] 1

2 µ[w] (I.1)

which is again continuous. Diﬀerentiate again, we get that

α αf(µ, u, α) = c(δ) µ[w2] 3

2 µ[w]2 µ[w2] 1

2 µ[1 {α > v}]

= c(δ)µ[w2] 3

2 µ[w1 {α > v}]2 µ[w2]µ[1 {α > v}2]

when α > ess infµ u, where (i) follows from Jensen s inequality. Moreover, this expression is continuous for α > Therefore, f is second diﬀerentiable and convex in α when α > ess infµ u. As we commented after Lemma 17, it suﬃces to optimize over α ess infµ u. By the continuity of f and αf in α and convexity, if the optimizer ess infµ u < α < , it must satisﬁes 0 = αf(µ, u, α ) = 1 c(δ)µ[w2] 1

which is (H.3). Next, we handle the boundary cases α = and α = ess infµ u. Notice that rewriting (H.3) as

we see that for δ > 0, α = , because otherwise w µ[w] = 1 a.s.µ. and the above equality cannot hold. On the other hand, if α = ess infµ u, then (H.3) holds trivially with w = 0. Then, we show that (H.4) is a worst-case measure. It suﬃces to check that µ [u] = f(µ, u, α ) and Dχ2(µ µ) = δ. We have that

µ [u] = µ[wu]

= α µ[(α u)1 {α > u} (α u)]

(i) = α c(δ)2µ[w]

(ii) = α c(δ)µ[w2] 1 2

= f(µ, u, α )

Variance-Reduced Distributionally Robust Q-Learning

where (i) and (ii) follows from (H.3). Moreover, by deﬁnition (5.1),

Dχ2(µ µ) = 1

µ[w]2 + 1 2

again (i) follows from (H.3). Finally, clearly µ deﬁned in (H.5) satisﬁes µ [u] = ess infµ u = f(µ, u, α ). So, to show that µ is a worst-case measure, it suﬃces to check that Dχ2(µ µ) δ. To show this, we observe that if α = ess infµ u, then by convexity we must have that for all suﬃciently small ϵ > 0, αf(µ, u, α + ϵ) 0. Otherwise, α = ess infµ u cannot be optimal. In particular, let w(ϵ) = (α + ϵ u)+, then by (I.1), we have that

µ[w(ϵ)2] c(δ)2µ[w(ϵ)]2.

Note that if ϵ is suﬃciently small, i.e. when α + ϵ < u(s) for all s / U and µ(s) > 0, then w(ϵ) = ϵ1U. Therefore, we must have that

µ[1U] c(δ)2µ[1U]2;

i.e. µ(U) 1 c(δ)2. With this bound, we now compute

Dχ2(µ µ) = 1

Therefore, this proves Lemma 47.

I.2 Proof of Lemma 48

Proof If u is µ essentially constant, then ess infµ u = ess supµ u = α ; i.e. the statement of Lemma 48 holds.

Wang, Si, Blanchet, Zhou

Next, we prove that if u is not µ essentially constant, then δ < 1 2µ implies α ess supµ u. To achieve this, we ﬁrst show that α > ess infµ u under these assumptions. We prove this by assuming α = ess infµ u and raising a contradiction. By Lemma 47, µ deﬁned in (H.5) is a worst-case measure. Hence,

where (i) follows from the assumption that u is not µ essentially constant, so

U = s : µ(s) > 0, u(s) = ess inf µ u

cannot be of probability 1. In particular, by the deﬁnition of µ , µ(U) 1 µ . Therefore, rearrange terms, we have that δ µ 1

2µ , contradicting our assumption. Therefore, α > ess infµ u. Using this, we then show that if u is not µ essentially constant, δ < 1

2µ , and α > ess infµ u, then α ess supµ u. We prove by contradiction, assuming that ess infµ u < α ess supµ u. Since α ess supµ u, we must have that for some s S s.t. µ(s ) > 0, w(s ) = (α u(s ))+ = 0. By Lemma 47, µ deﬁned in (H.4) is a worst-case measure when α > ess infµ u. Moreover,

δ = Dχ2(µ µ)

contradicting the assumption. Therefore, α > ess supµ u. This completes the proof of Lemma 48.

I.3 Proof of Lemma 49

Proof By assumption, we are interested in empirical measures that satisfy Ωn,p(µ) (c.f. (C.3)) with p 1

4µ . Then, by Lemma 32, we have that µ µn µ(t) on Ωn,p(µ).

Variance-Reduced Distributionally Robust Q-Learning

We ﬁrst ﬁx s [0, 1]. Let us denote vs, := ess supµ vs. Recall that by Lemma 48, when δ < 1

2µ , it suﬃces to optimize the Lagrange multiplayer in [vs, , ). We have

tf(µ(t), vs, α) = 1

2c(δ)m[w2 s]µ(t)[w2 s] 1

where ws = (α vs)+ = α v. It is not hard to see that tf(µ(t), vs, α) is continuous on [0, 1] [vs, , ) even if vs is essentially constant (in this case we note that tf(µ(t), vs, vs, ) = 0). Next deﬁne Θ(t) := arg max α>vs, f(µ(t), vs, α).

We discuss two cases:

1. If vs is µ essentially constant, then for α > vs,

f(µ(t), vs, α) = α c(δ)(α vs, ) = (1 c(δ))α + c(δ)vs, .

Since c(δ) = 1 + 2δ > 0, this is maximized at Θ(t) = {vs, }.

2. vs is not µ essentially constant. Note that then by Lemma 48, α > ess supµ vs, α > vs a.s.µ (hence µ(t)). Recall that the second derivative in (I.2),

α αf(µ(t), vs, α)

= c(δ)µ(t)[w2 s] 3

2 µ(t)[ws1 {α > vs}]2 µ(t)[w2 s]µ(t)[1 {α > vs}2]

= c(δ)µ(t)[w2 s] 3

2 µ(t)[ws]2 µ(t)[w2 s]

where the last inequality follows from that ws is not µ(t) constant, hence the variance is positive. So, in this case f(µ(t), vs, ) is strictly concave. Thus, Θ(t) is a singleton.

Therefore, in both case, Θ(t) is a singleton. We conclude that by Shapiro et al. (2014, Theorem 7.21), the following derivative exists

dt sup α>vs, f(µ(t), vs, α) = sup α Θ(t) tf(µ(t), vs, α)

= tf(µ(t), vs, α s,t)

2c(δ)m[w2 s]µ(t)[w2 s] 1

where it is understood that ws = (α s,t vs)+ = α s,t vs. Therefore, we have shown that t h(s, t) is C1(0, 1) C[0, 1]. Hence,

|V (s, a)| = γ |h(1, 0) h(0, 0) h(1, 1) + h(0, 1)|

0 th(1, t) th(0, t)dt

Wang, Si, Blanchet, Zhou

Next, we show that for any ﬁxed t, there exists a mapping s Ds th(s, t) s.t. (C.12) holds. We note that by Lemma 48, α s,t = vs, only when vs is essentially constant. Again, assuming that at least one of v(ˆq) and v(q ) is not µ essentially constant, as in the proof of Proposition 24, this can only happen at one particular s = s . We separately consider these two cases: Case 1: vs is never essentially constant for all s [0, 1]. In this case, α s,t > vs, for all s [0, 1]. Note that ws = α s,t vs > 0. So, if on Ωn,p(µ), α s,t is C1(0, 1) C[0, 1] in s, then by chain rule, s th(s, t) in (I.4) is C1(0, 1) C[0, 1]. As in the proof of Proposition 24, we show diﬀerentiability of s α s,t by invoking the implicit function theorem. By the strict convexity (I.3), α s,t is the unique solution to the optimality condition (H.3)

0 = c(δ)2µ(t)[ws]2 µ(t)[w2 s] =: F(s, α s,t).

Since F is inﬁnite smooth, the implicit function theorem implies that α s,t is C1(0, 1) C[0, 1] and s th(s, t) is C1(0, 1) C[0, 1]. We compute the derivative s th in this case. Recall v = v(ˆq) v(q ). Diﬀerentiate w.r.t. s on both side, we have

0 = c(δ2)2µ(t)[ws]µ(t)[ sα s,t v] 2µ(t)[ws( sα s,t v)].

Rearranging terms, we have

sα s,t = c(δ)2µ(t)[w]µ(t)[ v] µ(t)[ws v]

(c(δ)2 1)µ(t)[ws]

This gives (H.8). Moreover, when α s,t > 0,

s th(s, t) = 1

2c(δ)µ(t)[w2 s] 3

2 µ(t)[ vw2 s]m[w2 s] c(δ)µ(t)[w2 s] 1

c(δ)µ(t)[w2 s] 1

2 m[ws] + 1

2c(δ)µ(t)[w2 s] 3

2 µ(t)[ws]m[w2 s]

= c(δ)µ(t)[w2 s] 1

2 µ(t)[ws] µ(t)[ vws]m[w2 s] 2µ(t)[ws]µ(t)[w2s] m[ vws]

+ c(δ)µ(t)[w2 s] 1

2 µ(t)[ws] sα s,t

m[w2 s] 2µ(t)[w2s] m[ws]

(i) = µ(t)[ vws]m[w2 s] 2µ(t)[ws]µ(t)[w2s] m[ vws]

m[w2 s] 2µ(t)[w2s] m[ws]

where (i) uses the optimality equation (H.3). This is consistent with (H.7). Case 2: There is a unique s [0, 1] s.t. vs is essentially constant. As in the proof of Proposition 24, in this case, the previous argument implies that s th(s, t) is C1(0, s ), C1(s , 1), and continuous at 0, 1. The derivative is also given by (H.7).

Variance-Reduced Distributionally Robust Q-Learning

Again, we show the existence of Ds th that satisfy (C.12). Observe that if s th(s, t) is continuous at s , then applying the fundamental theorem of calculus on the interval [0, s ] and [s , 1] separately, we will have that

th(1, t) th(0, t) = Z s

0 s th(s, t)ds + Z 1

s s th(s, t)ds.

Hence, taking Ds th(s, t) = s th(s, t) for every s = s and Ds th(s , t) = 0 will suﬃce to produce (C.12). It is left to check the continuity at s of

th(s, t) = tf(µ(t), vs, α s,t) = 1

2c(δ)m[w2 s]µ(t)[w2 s] 1

from (I.4). Note that on Ωn,p(µ), for all s [0, 1], α vs, ,

2c(δ)m[α vs]µ(t)[(α vs)2] 1

1 2µ(t)[(α vs)2] 1 2 dm dµ(t)

(i) 1 2 α vs L (µ) 1 µ p

1 2 α vs L (µ) 1 3 4µ

where (i) follows from Lemma 32. Also, th(s , t) = 0. Therefore, if sα s,t vs , as s s , then α vs L (µ) 0 as s s , implying continuity at s . It is left to check that sα s,t vs , as s s . To prove this, we assume to the contrary that there is a subsequential limit α sn,t β + vs , for some sequence sn s and β > 0. But by Lemma 47, we must have that

0 = lim n c(δ)2µ(t)[α sn,t vsn]2 µ(t)[(α sn,t vsn)2] = δβ

raising a contradiction. This implies that s th(s, t) is continuous at s , and hence (C.12) holds with Ds th(s, t) = s th(s, t) for every s = s and Ds th(s , t) = 0. Therefore, in both cases (H.6) holds with D2h(s, t) is given by (H.7). This gives the claim of the lemma.