# singletrajectory_distributionally_robust_reinforcement_learning__9f12d7d2.pdf Single-Trajectory Distributionally Robust Reinforcement Learning Zhipeng Liang * 1 Xiaoteng Ma * 2 Jose Blanchet 3 Jun Yang 2 Jiheng Zhang 1 4 Zhengyuan Zhou 5 6 To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either modelbased or fail to learn from a single sample trajectory. In this paper, we design a first fully modelfree DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm s complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms. *Equal contribution 1Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology 2Department of Automation, Tsinghua University 3Department of Management Science and Engineering, Stanford University 4Department of Mathematics, Hong Kong University of Science and Technology 5Stern School of Business, New York University 6Arena Technologies. Correspondence to: Zhengyuan Zhou . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 1. Introduction Reinforcement Learning (RL) is a machine learning paradigm for studying sequential decision problems. Despite considerable progress in recent years (Silver et al., 2016; Mnih et al., 2015; Vinyals et al., 2019), RL algorithms often encounter a discrepancy between training and test environments. This discrepancy is widespread since test environments may be too complex to be perfectly represented in training, or the test environments may inherently shift from the training ones, especially in certain application scenarios, such as financial markets and robotic control. Overlooking the mismatch could impede the application of RL algorithms in real-world settings, given the known sensitivity of the optimal policy of the Markov Decision Process (MDP) to the model (Mannor et al., 2004; Iyengar, 2005). To address this concern, Distributionally Robust RL (DRRL) (Zhou et al., 2021; Yang et al., 2022; Shi & Chi; Panaganti & Kalathil, 2022; Panaganti et al., 2022; Ma et al., 2022; Yang, 2018; Abdullah et al., 2019; Neufeld & Sester, 2022) formulates the decision problem under the assumption that the test environment varies but remains close to the training environment. The objective is to design algorithms optimizing the worst-case expected return over an ambiguity set encompassing all possible test distributions. Evaluating a DRRL policy necessitates deeper insight into the transition dynamics than evaluating a non-robust one, as it entails searching for the worst-case performance across all distributions within the ambiguity set. Therefore, most prior solutions are model-based, require the maintenance of an estimator for the entire transition model and the ambiguity set. Such requirements may render these algorithms less practical in scenarios with large state-action spaces or where adequate modeling of the real environment is unfeasible. Prompted by this issue, we study a fully model-free DRRL algorithm in this paper, which learns the optimal DR policy without explicit environmental modeling. The algorithm s distinctive feature is its capacity to learn from a single sample trajectory, representing the least demanding requirement for data collection. This feature results from our innovative algorithmic framework, comprising incrementally updated estimators and a delicate approximation scheme. While most model-free non-robust RL algorithms support training Single-Trajectory DRRL in this setting contributing to their widespread use no existing work can effectively address the DRRL problem in this way. The challenge arises from the fact that approximating a DR policy by learning from a single trajectory suffers from restricted control over state-action pairs and limited samples, i.e., only one sample at a time. As we will demonstrate, a simple plug-in estimator using one sample, which is unbiased in the non-robust Q-learning algorithm, fails to approximate any robust value accurately. The complexity of this task is further affirmed by the sole attempt to develop a model-free DRRL algorithm in (Liu et al., 2022). It relies on a restricted simulator assumption, enabling the algorithm to access an arbitrary number of samples from any state-action pair, thereby amassing sufficient system dynamics information before addressing the DRRL problem. Relaxing the dependence on a simulator and developing a fully model-free algorithm capable of learning from a single trajectory necessitates a delicate one-sample estimator for the DR value, carefully integrated into an algorithmic framework to eradicate bias from insufficient samples and ensure convergence to the optimal policy. Moreover, current solutions heavily depend on the specific divergence chosen to construct the ambiguity set and fail to bridge different divergences, underscoring the practical importance of divergence selection. Thus a nature question arises: Is it possible to develop a model-free DRRL framework that can learn the optimal DR policy across different divergences using only a single sample trajectory for learning? 1.1. Our Contributions In this paper, we provide a positive solution to the aforementioned question by making the following contributions: 1. We introduce a pioneering approach to construct the ambiguity set using the Cressie-Read family of fdivergence. By leveraging the strong duality form of the corresponding distributionally robust reinforcement learning (DRRL) problem, we reformulate it, allowing for the learning of the optimal DR policies using misspecified MDP samples. This formulation effortlessly covers widely used divergences such as the Kullback-Leibler (KL) and χ2 divergence. 2. To address the additional nonlinearity that arises from the DR Bellman equation, which is absent in its nonrobust counterpart, we develop a novel multi-timescale stochastic approximation scheme. This scheme carefully exploits the structure of the DR Bellman operator. The update of the Q table occurs in the slowest loop, while the other two loops are delicately designed to mitigate the bias introduced by the plug-in estimator due to the nonlinearity. 3. We instantiate our framework into a DR variant of the Q-learning algorithm, called distributionally robust Qlearning with single trajectory (DRQ). This algorithm solves discount Markov Decision Processes (MDPs) in a fully online and incremental manner. We prove the asymptotic convergence of our proposed algorithm by extending the classical two-timescale stochastic approximation framework, which may be of independent interest. 4. We conduct extensive experiments to showcase the robustness and sample efficiency of the policy learned by our proposed DR Q-learning algorithm. We also create a deep learning version of our algorithm and compare its performance to representative online and offline (robust) reinforcement learning benchmarks on classical control tasks. 1.2. Related Work Robust MDPs and RL: The framework of robust MDPs has been studied in several works such as Nilim & El Ghaoui (2005); Iyengar (2005); Wiesemann et al. (2013); Lim et al. (2013); Ho et al. (2021); Goyal & Grand-Clement (2022). These works discuss the computational issues using dynamic programming with different choices of MDP formulation, as well as the choice of ambiguity set, when the transition model is known. Robust Reinforcement Learning (RL) (Roy et al., 2017; Badrinath & Kalathil, 2021; Wang & Zou, 2021) relaxes the requirement of accessing to the transition model by simultaneously approximating to the ambiguity set as well as the optimal robust policy, using only the samples from the misspecified MDP. Online Robust RL: Existing online robust RL algorithms including Wang & Zou (2021); Badrinath & Kalathil (2021); Roy et al. (2017), highly relies on the choice of the Rcontamination model and could suffer over-conservatism. This ambiguity set maintains linearity in their corresponding Bellman operator and thus inherits most of the desirable benefits from its non-robust counterpart. Instead, common distributionally robust ambiguity sets, such as KL or χ2 divergence ball, suffer from extra nonlinearity when trying to learn along a single-trajectory data, which serves as the foundamental challenge in this paper. Distributionally Robust RL: To tackle the overconservatism aroused by probability-agnostic Rcontamination ambiguity set in the aforementioned robust RL, DRRL is proposed by constructing the ambiguity set with probability-aware distance (Zhou et al., 2021; Yang et al., 2022; Shi & Chi; Panaganti & Kalathil, 2022; Panaganti et al., 2022; Ma et al., 2022), including KL and χ2 divergence. As far as we know, most of the existing DRRL algorithms fall into the model-based fashion, which first estimate the whole transition model and then construct Single-Trajectory DRRL the ambiguity set around the model. The DR value and the corresponding policy are then computed based upon them. Their main focus is to understand the sample complexity of the DRRL problem in the offline RL regime, leaving the more prevalent single-trajectory setting largely unexplored. 2. Preliminary 2.1. Discounted MDPs Consider an infinite-horizon MDP (S, A, γ, µ, P, r) where S and A are finite state and action spaces with cardinality S and A. P : S A S is the state transition probability measure. Here S is the set of probability measures over S. r is the reward function and γ is the discount factor. We assume that r : S A [0, 1] is deterministic and bounded in [0, 1]. A stationary policy π : S A maps, for each state s to a probability distribution over the action set A and induce a random trajectory s1, a1, r1, s2, , with s1 µ, an = π(sn) and sn+1 P( |sn, an) := Psn,an for n N+. To derive the policy corresponding to the value function, we define the optimal state-action function Q : S A R as the expected cumulative discounted rewards under the optimal policy, Q (s, a) := supπ Π Eπ,P [P n=1 γn 1r(sn, an)|s1 = s, a1 = a]. The optimal state-action function Q is also the fixed point of the Bellman optimality equation, Q (s, a) = r(s, a) + γEs P [max a A Q (s , a )]. (1) 2.2. Q-learning Our model-free algorithmic design relies on the Q-learning template, originally designed to solve the non-robust Bellman optimality equation (Equation 1). Q-learning is a model-free reinforcement learning algorithm that uses a single sample trajectory to update the estimator for the Q function incrementally. Suppose at time n, we draw a sample (sn, an, rn, s n) from the environment. Then, the algorithm updates the estimated Q-function following: Qn+1(sn, an) = (1 αn)Qn(sn, an)+ αn(rn + γ max a A Qn(s n, a )), Here, αn > 0 is a learning rate. The algorithm updates the estimated Q function by constructing a unbiased estimator for the true Q value, i.e., rn + γ maxa A Qn(s n, a ) using one sample. 2.3. Distributionally Robust MDPs DRRL learns an optimal policy that is robust to unknown environmental changes, where the transition model P and reward function r may differ in the test environment. To focus on the perturbation of the transition model, we assume no pertubation to the reward function. Our approach adopts the notion of distributional robustness, where the true transition model P is unknown but lies within an ambiguity set P that contains all transition models that are close to the training environment under some probability distance D. To ensure computational feasibility, we construct the ambiguity set P in the (s, a)-rectangular manner, where for each (s, a) S A, we define the ambiguity set Ps,a as, Ps,a := {P s,a : S|D(P s,a Ps,a) ρ}. (2) We then build the ambiguity set for the whole transition model as the Cartesian product of every (s, a)-ambiguity set, i.e., P = Q (s,a) S A Ps,a. Given P, we define the optimal DR state-action function Q as the value function of the best policy to maximize the worst-case return over the ambiguity set, Qrob, (s, a) := sup π Π inf P P Eπ,P [ n=1 γn 1r(sn, an)|s1 = s, a1 = a]. Under the (s, a)-rectangular assumption, the Bellman optimality equation has been established by Iyengar (2005); Xu & Mannor (2010), Qrob, (s, a) =Tk(Qrob, )(s, a) :=r(s, a) + γ inf P P Es P [max a A Qrob, (s , a )]. For notation simplicity, we would ignore the superscript rob. 3. Distributonally Robust Q-learning with Single Trajectory This section presents a general model-free framework for DRRL. We begin by instantiating the distance D as Cressie Read family of f-divergence (Cressie & Read, 1984), which is designed to recover previous common choices such as the χ2 and KL divergence. We then discuss the challenges and previous solutions in solving the corresponding DRRL problem, as described in Section 3.2. Finally, we present the design idea of our three-timescale framework and establish the corresponding convergence guarantee. 3.1. Divergence Families Previous work on DRRL has mainly focused on one or several divergences, such as KL, χ2, and total variation (TV) divergences. In contrast, we provide a unified framework that applies to a family of divergences known as the Cressie Read family of f-divergences. This family is parameterized Single-Trajectory DRRL by k ( , )/{0, 1}, and for any chosen k, the Cressie Read family of f-divergences is defined as Dfk(Q P) = Z fk(d P with fk(t) := tk kt+k 1 k(k 1) . Based on this family, we instantiate our ambiguity set in Equation 2 as Ps,a = {P s,a : S|Dfk(P s,a Ps,a) ρ} for some radius ρ > 0. The Cressie-Read family of f-divergence includes χ2divergence (k = 2) and KL divergence (k 1). One key challenge in developing DRRL algorithms using the formulation in Equation 3 is that the expectation is taken over the ambiguity set P, which is computationally intensive even with the access to the center model P. Since we only have access to samples generated from the possibly misspecific model P, estimating the expectation with respect to other models P P is even more challenging. While importance sampling-based techniques can achieve this, the cost of high variance is still undesirable. To solve this issue, we rely on the dual reformulation of Equation 3: Lemma 3.1 ((Duchi & Namkoong, 2021)). For any random variable X P, define σk(X, η) = ck(ρ)EP [(η X)k + ] 1 k +η with k = k k 1 and ck(ρ) = (1+k(k 1)ρ) 1 k . Then inf Q P{EQ[X] : Dfk(Q P) ρ} = sup η R σk(X, η), (4) Here (x)+ = max{x, 0}. Equation 4 shows that protecting against the distribution shift is equivalent to optimizing the tail-performance of a model, as only the value below the dual variable η are taken into account. Another key insight from the reformulation is that as the growth of fk(t) for large t becomes steeper for larger k, the f-divergence ball shrinks and the risk measure becomes less conservative. This bridges the gap between difference divergences, whereas previous literature, including Yang et al. (2022) and Zhou et al. (2021), treats different divergences as separate. By applying the dual reformulation, we can rewrite the Cressie-Read Bellman operator in Equation 3 as Tk(Q)(s, a) = r(s, a) + γ sup η R σk(max a A Q( , a ), η). (5) 3.2. Bias in Plug-in Estimator in Single Trajectory Setting In this subsection, we aim to solve Equation 5 using singletrajectory data, which has not been addressed by previous DRRL literature. As we can only observe one newly arrival sample each time, to design a online model-free DRRL algorithm, we need to approximate the expectation in Equation 5 using that single sample properly. As mentioned in Section 2.2, the design of the Q-learning algorithm relies on an one-sample unbiased estimator of the true Bellman operator. However, this convenience vanishes in the DR Bellman operator. To illustrate this, consider plugging only one sample into the Cressie-Read Bellman operator Equation 5: r(s, a) + γ sup η R {η ck(ρ)(η max a Q(s , a ))+} = r(s, a) + γ max a Q(s , a ). This reduces to the non-robust Bellman operator and is obviously not an unbiased estimator for Tk(Q). This example reveals the inherently more challenging nature of the online DRRL problem. Whereas non-robust RL only needs to improve the expectation of the cumulative return, improving the worst-case return requires more information about the system dynamics, which seems hopeless to be obtained from only one sample and sharply contrasts with our target. Even with the help of batch samples, deriving an appropriate estimator for the DR Bellman operator is still nontrivial. Consider a standard approach to construct estimators, sample average approximation (SAA): given a batch of sample size n starting from a fix state-action pair (s, a), i.e., Dn = {(si, ai, s i, ri), i [n], (si, ai) = (s, a)}, the SAA empirical Bellman operator is defined as: b Tk(Q)(s, a, Dn) = r(s, a) + γ sup η R bσk(max a A Q( , a ), η, Dn). Here, bσk is the empirical Cressie-Read functional defined as bσk := ck(ρ)[ X i [n] (η max a A Q(s i, a ))k + /n] 1 k + η. As pointed out by Liu et al. (2022), the SAA estimator is biased, prompting the introduction of the multilevel Monte-Carlo method (Blanchet & Glynn, 2015). Specifically, it first obtains N N+ samples from the distribution P(N = n) = pn = ϵ(1 ϵ)n, and then uses the simulator to draw 2N+1 samples D2N+1. The samples are further decomposed into two parts: D:2N consists of the first 2N samples, while D2N+1: contains the remaining samples. Finally, the DR term in Equation 5 is approximated by solving three optimization problems: b Tk(Q)(s, a, Dn) = r1 + max a A Q(s 1, a ) + q N,δ(Q) q N,δ(Q) := sup η 0 bσk(max a A Q( , a ), η, D2N+1) 2 sup η 0 bσk(max a A Q( , a ), η, D:2N ) 2 sup η 0 bσk(max a A Q( , a ), η, D2N+1:). Single-Trajectory DRRL However, this multilevel Monte-Carlo solution requires a large batch of samples for the same state-action pair before the next update, resulting in unbounded memory costs/computational time that are not practical. Furthermore, it is prohibited in the single-trajectory setting, where each step only one sample can be observed. Our experimental results show that simply approximating the Bellman operator with simulation data, without exploiting its structure, suffers from low data efficiency. 3.3. Three-timescale Framework The Q-learning is solving the nonrobust Bellman operator s fixed point in a stochastic approximation manner. A salient feature in the DR Bellman operator, compared with its nonrobust counterpart, is a bi-level optimization nature, i.e., jointly solving the dual parameter η and the fixed point Q of the Bellman optimality equation. We revisit the stochastic approximation view of the Q-learning and develop a threetimescale framework, by a faster running estimate of the optimal dual parameter, and a slower update of the Q table. To solve Equation 5 using a stochastic approximation template, we iteratively update the variables η and Q table as follows: for the n-th iteration after observing a new transition sample (sn, an, s n, rn) and some learning rates ζ1, ζ2 > 0, ηn+1 = ηn ζ1 Gradient of ηn, Qn+1 = rn + ζ2γσk(max a A Qn( , a ), ηn). As the update of η and Q relies on each other, we keep the learning speeds of η and Q, i.e., ζ1 and ζ2, different to stabilize the training process. Additionally, due to the (s, a)- rectangular assumption, η is independent across different (s, a)-pairs, while the Q table depends on each other. The independent structure for η allows it to be estimated more easily; so we approximate it in a faster loop, while for Q we update it in a slower loop. 3.4. Algorithmic Design In this subsection, we further instantiate the three-timescale framework to the Cressie-Read family of f-divergences. First, we compute the gradient of σk(maxa A Q( , a ), η) in Equation 5 with respect to η. Lemma 3.2 (Sub-Gradient of the σk dual function). σk(max a A Q( , a ), η) 1 k 1 1 Z2 + 1}, η > max a A Q( , a ), 1 k 1 1 Z2 + 1, 0], η = max a A Q( , a ), {1}, η < max a A Q( , a ), Algorithm 1 Distributionally Robust Q-learning with Cressie-Read family of f-divergences 1: Input: Exploration rate ϵ, Learning rates {ζi(n)}i [3], Cressie-Read family parameter k, Ambiguity set radius ρ. 2: Init: Initialize Q, Z and η with zero. 3: for n = 1, 2, do 4: Observe the state sn, execute the action an = arg maxa A Q(sn, a) using ϵ-greedy policy 5: Observe the reward rn and next state s n 6: Update Z1(sn, an) (1 ζ1(n))Z1(sn, an) + ζ1(n)(η(sn, an) max a Q(s n, a))k + , Z2(sn, an) (1 ζ1(n))Z2(sn, an) + ζ1(n)(η(sn, an) max a Q(s n, a))k 1 + . η(sn, an) η(sn, an)+ζ2(n)( ck(ρ)Z 1 k 1 1 (sn, an) Z2(sn, an) + 1). 8: Update Q(sn, an) (1 ζ3(n))Q(sn, an) + ζ3(n)(rn 1 k 1 (sn, an) η(sn, an))). 9: end for Z1 = EP [(η max a A Q( , a ))k + ], (7) Z2 = EP [(η max a A Q( , a ))k 1 + ]. (8) Due to the nonlinearity in Equation 6, the plug-in gradient estimator is in fact biased. The bias arises as for a random variable X, E[f(X)] = f(E[X]) for f(x) = x 1 k 1 in 1 k 1 1 . To address this issue, we introduce another even faster timescale to estimate Z1 and Z2, Z1(sn, an) (1 ζ1(n))Z1(sn, an) + ζ1(n)(η(sn, an) max a Q(s n, a ))k + , Z2(sn, an) (1 ζ1(n))Z2(sn, an) + ζ1(n)(η(sn, an) max a Q(s n, a ))k 1 + . In the medium timescale, we approximate η (s, a) := arg maxη R σk(maxa A Q(s, a ), η) by incrementally update the dual variable η using the stochastic gradient descent method, where the true gradient computed in Equation 6 is Single-Trajectory DRRL approximated by: η(sn, an) η(sn, an) + ζ2(n)( ck(ρ)Z 1 k 1 1 (sn, an) Z2(sn, an) + 1). (11) Finally, we update the DR Q function in the slowest timescale using Equation 12, Q(sn, an) (1 ζ3(n))Q(sn, an) + ζ3(n)b Tn,k(Q)(sn, an), (12) where b Tn,k(Q)(s, a) is the empirical version of Equation 5 in the n-th iteration: b Tn,k(Q)(sn, an) = rn γ(ck(ρ)Z 1 k 1 (sn, an) η(sn, an)). Here ζ1(n), ζ2(n) and ζ3(n) are learning rates for three timescales at time n, which will be specified later. We summarize the ingredients into our DR Q-learning (DRQ) algorithm (Algorithm 1), and prove the almost surely (a.s.) convergence of the algorithm as Theorem 3.3. The proof is deferred in Appendix C. Theorem 3.3. The estimators at the n-th step in Algorithm 1, (Zn,1, Zn,2, ηn, Qn), converge to (Z 1, Z 2, η , Q ) a.s. as n , where η and Q are the fixed-point of the equation Q = Tk(Q), and Z 1 and Z 2 are the corresponding quantity under η and Q . The proof establishes that, by appropriately selecting stepsizes to prioritize frequent updates of Zn,1 and Zn,2, followed by ηn, and with Qn updated at the slowest rate, the solution path of (Zn,1, Zn,2, ηn, Qn) closely tracks a system of three-dimensional ordinary differential equations (ODEs) considering martingale noise. Our approach is to generalize the classic machinery of two-timescale stochastic approximation (Borkar, 2009) to a three-timescale framework, and use it to analyze our proposed algorithm. See Appendix B for the detailed proof. 4. Experiments We demonstrate the robustness and sample complexity of our DRQ algorithm in the Cliffwalking environment (Del etang et al., 2021) and American put option environment (deferred in Appendix A). These environments provide a focused perspective on the policy and enable a clear understanding of the key parameters effects. We develop a deep learning version of DRQ and compare it with practical online and offline (robust) RL algorithms in classical control tasks, Lunar Lander and Cart Pole. 4.1. Convergence and Sample Complexity Before we begin, let us outline the key findings and messages conveyed in this subsection: (1) Our ambiguity set design provides substantial robustness, as demonstrated through comparisons with non-robust Q-learning and Rcontamination ambiguity sets (Wang & Zou, 2021). (2) Our DRQ algorithm exhibits desirable sample complexity, significantly outperforming the multi-level Monte Carlo based DRQ algorithm proposed by Liu et al. (2022) and comparable to the sample complexity of the model-based DRRL algorithm by Panaganti & Kalathil (2022). (a) Environment (b) Nonrobust (c) ρ = 1.0 (d) ρ = 1.5 Figure 1. The Cliffwalking environment and the learned policies for different ρ s. Experiment Setup: The Cliffwalking task is commonly used in risk-sensitive RL research (Del etang et al., 2021). Compared to the Frozen Lake environment used by Panaganti & Kalathil (2022), Cliffwalking offers a more intuitive visualization of robust policies (see Figure 1). The task involves a robot navigating from an initial state of (2, 0) to a goal state of (2, 3). At each step, the robot is affected by wind, which causes it to move in a random direction with probability p. Reaching the goal state earns a reward of +5, while encountering a wave in the water region {(3, j) | 0 j 3} results in a penalty of 1. We train the agent in the nominal environment with p = 0.5 for 3 million steps per run, using an ϵ-greedy exploration strategy with ϵ = 0.1. We evaluate its performance in perturbed environments, varying the choices of k and ρ to demonstrate different levels of robustness. We set the stepsize parameters according to Assumption B.1: ζ1(t) = 1/(1+(1 γ)t0.6), ζ2(t) = 1/(1+0.1(1 γ)t0.8), and ζ3(t) = 1/(1 + 0.05(1 γ) t), where the discount factor is γ = 0.9. Robustness: To evaluate the robustness of the learned poli- Single-Trajectory DRRL 0.5 0.6 0.7 0.8 0.9 p 0 0.5 1.0 1.5 R 0.5 0.6 0.7 0.8 0.9 p 0 0.5 1.0 1.5 R (b) Episode length 1.5 2.0 2.5 3.0 3.5 4.0 k (c) Value of various k and ρ Figure 2. Averaged return and steps with 100 random seeds in the perturbed environments. ρ = 0 corresponds to the non-robust Q-learning. R denotes the R-contamination ambiguity set. 0 0.075 0.15 Million Steps 0 0.075 0.15 Million Steps 0 0.075 0.15 Million Steps non-robust = 0.2 = 0.4 Figure 3. The training curves in the Cliffwalking environment. Each curve is averaged over 100 random seeds and shaded by their standard deviations. The dashed line is the optimal robust value with corresponding k and ρ. cies, we compare their cumulative returns in perturbed environments with p {0.5, 0.6, 0.7, 0.8, 0.9} over 100 episodes per setting. We visulize the decision at each status in Figure 1 with different robustness level ρ. In particular, the more robust policy tends to avoid falling into the water, thus arrives to the goal state with a longer path by keeping going up before going right. Figure 2a shows the return distribution for each policy. Figure 2b displays the time taken for the policies to reach the goal, and the more robust policy tends to spend more time, which quantitatively supports our observations in Figure 1. Interestingly, we find that the robust policies outperform the nonrobust one even in the nominal environment. For the different ρ s, ρ = 1.0 is the best within a relatively wide range (p {0.6, 0.7, 0.8}), while ρ = 1.5 is preferred in the environment of extreme pertubation (p = 0.9). This suggests that DRRL provides a elegant trade-off for different robustness preferences. We also compare our model-free DRRL algorithm with the robust RL algorithm presented in Wang & Zou (2021), which also supports training using a single trajectory. The algorithm in Wang & Zou (2021) uses an R-contamination ambiguity set. We select the best value of R from 0.1 to 0.9 and other detailed descriptions in Appendix A. In most cases, the R-contamination based algorithm performs very similarly to the non-robust benchmark, and even performs worse in some cases (i.e., p = 0.8 and 0.9), due to its excessive conservatism. As we mentioned in Section 3.1, larger k would render the the risk measure less conservative and thus less sensitive to the change in the ball radius ρ, which is empirically confirmed by Figure 2c. Sample Complexity: The training curves in Figure 3 depict the estimated value maxa b Q(s0, a) (solid line) and the optimal robust value V (s0) (dashed line) for the initial state s0. The results indicate that the estimated value converges quickly to the optimal value, regardless of the values of k and ρ. Importantly, our DRQ algorithm achieves a similar convergence rate to the non-robust baseline (represented by the black line). We further compare our algorithm with two robust baselines: the DRQ algorithm with a weak simulator proposed by Liu et al. (2022) (referred to as Liu s), and the model-based algorithm introduced by Panaganti & Kalathil (2022) (referred to as Model) in Figure 4. To ensure a fair comparison, we set the same learning rate, ζ3(t), for our DRQ algorithm and the Q-table update loop of the Liu s algorithm, as per their recommended choices. Our algorithm converges to the true DR value at a similar rate as the model-based algorithm, while the Liu s algorithm exhibits substantial deviation from the true value and converges relatively slowly. Our algorithm s superior sample efficiency is attributed to the utilization of first-order information to approximate optimal dual variables, whereas Liu s relies on a large amount of simulation data for an unbiased Single-Trajectory DRRL 0 0.075 0.15 Sample Size (Million) = 0.1, k = 2.0 0 0.075 0.15 Sample Size (Million) = 0.2, k = 2.0 0 0.075 0.15 Sample Size (Million) = 0.4, k = 2.0 Liu's Model DRQ (Ours) Figure 4. Sample complexity comparisons in Cliffwalking environment with Liu s and Model-based algorithms. Each curve is averaged over 100 random seeds and shaded by their standard deviations. 4.2. Practical Implementation We validate the practicality of our DRQ framework by implementing a practical version, called the Deep Distributionally Robust Q-learning (DDRQ) algorithm, based on the DQN algorithm (Mnih et al., 2015). We apply this algorithm to two classical control tasks from the Open AI Gym (Brockman et al., 2016): Cart Pole and Lunar Lander. Our practical algorithm, denoted as Algorithm 2, is a variant of Algorithm 1. Specifically, we adopt the Deep Q-Network (DQN) architecture (Mnih et al., 2015) and employ two sets of neural networks as functional approximators. One set, Qθ1 and Qθ2, serves as approximators for the Q function, while the other set, ηθ3 and ηθ4, approximates the distributionally robust dual variable η. To enhance training stability, we introduce a target network, Qθ2, for the fast Q network Qθ1 and ηθ4 for the fast dual variable η network ηθ3. Due to the approximation error introduced by neural networks and to further improve sample efficiency, our practical DDRQ algorithm adopts a two-timescale update approach. In this approach, our Q network aims to minimize the Bellman error, while the dual variable η network strives to maximize the DR Q value defined in Equation 5. It s important to note that the two-timescale update approach could introduce bias in the convergence of the dual variable, and thus the dual variable η may not the optimal dual variable for the primal problem. Given the primal-dual structure of this DR problem, this could render an even lower target value for the Q network to learn. This approach can be understood as a robust update strategy for our original DRRL problem, share some spirits to the optimization techniques used in other algorithms like Variational Autoencoders (VAE)(Kingma & Welling, 2013), Proximal Policy Optimization (PPO)(Schulman et al., 2017), and Maximum a Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018). Additional experimental details can be found in Appendix A.3. To assess the effectiveness of our DDRQ algorithm, we compare it against the RFQI algorithm (Panaganti et al., 2022), the soft-robust RL algorithm (Derman et al., 2018), and the non-robust DQN and FQI algorithms. This comparison encompasses representative practical (robust) reinforcement learning algorithms for both online and offline datasets. To evaluate the robustness of the learned policies, we introduce action and physical environment perturbations. For action perturbation, we simulate the perturbations by varying the probability ϵ of randomly selecting an action for both Cart Pole and Lunar Lander tasks. We test with ϵ {0, 0.1, 0.2, , 1.0} for Cart Pole and ϵ {0, 0.1, 0.2, , 0.6} for Lunar Lander. Regarding physical environment perturbation in Lunar Lander, we decrease the power of all the main engine and side engines by the same proportions, ranging from 0 to 0.6. For Cart Pole, we reduce the force mag parameter from 0.2 to 0.8. We set the same ambiguity set radius for both our DDRQ and RFQI algorithm for fair comparisons. Figure 5 illustrates how our DDRQ algorithm successfully learns robust policies across all tested tasks, achieving comparable performance to other robust counterparts such as RFQI and SR-DQN. Conversely, the non-robust DQN and FQI algorithms fail to learn robust policies and deteriorate significantly even under slight perturbations. It is worth noting that RFQI does not perform well in the Lunar Lander environment, despite using the official code provided by the authors. This outcome could be attributed to the restriction to their TV distance in constructing the ambiguity set, while our Creass-Read ambiguity set can be flexibily chosen to well adopted to the environment nature. Additionally, the soft-robust RL algorithm requires generating data based on multiple models within the ambiguity set. This process can be excessively time-consuming, particularly in large-scale applications. 5. Conclusion In this paper, we introduce our DRQ algorithm, a fully model-free DRRL algorithm trained on a single trajectory. Single-Trajectory DRRL 0 20 40 60 80 100 Prob. of picking a random action Return in 100 Games Cart Pole (AP) 80 70 60 50 40 30 20 Change from Nominal Value (%) Cart Pole (FMP) 0 10 20 30 40 50 60 Prob. of picking a random action Return in 100 Games Lunar Lander (AP) 60 50 40 30 20 10 0 Change from Nominal Value (%) Lunar Lander (EPP) RFQI FQI DQN SR-DQN DDRQ (Ours) Figure 5. The return in the Cart Pole and Lunar Lander environment. Each curve is averaged over 100 random seeds and shaded by their standard deviations. AP: Action Perturbation; FMP: Force Mag Perturbation; EPP: Engines Power Perturbation. By leveraging the stochastic approximation framework, we effectively tackle the joint optimization problem involving the state-action function and the DR dual variable. Through an extension of the classic two-timescale stochastic approximation framework, we establish the asymptotic convergence of our algorithm to the optimal DR policy. Our extensive experimentation showcases the convergence, sample efficiency, and robustness improvements achieved by our approach, surpassing non-robust methods and other robust RL algorithms. Our DDRQ algorithm further validates the practicality of our algorithmic framework. Acknowledgements This work is generously supported by the General Research Fund [Grants 16208120, and 16214121] from the Hong Kong Research Grants Council, the NSF grants CCF2312205 and CCF-2312204. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning, especially to enhance the robustness of the widely-used reinforcement learning algorithms. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. ar Xiv preprint ar Xiv:1806.06920, 2018. Abdullah, M. A., Ren, H., Ammar, H. B., Milenkovic, V., Luo, R., Zhang, M., and Wang, J. Wasserstein Robust Reinforcement Learning, 2019. URL http://arxiv. org/abs/1907.13196. Badrinath, K. P. and Kalathil, D. Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning, pp. 511 520. PMLR, 2021. Blanchet, J. H. and Glynn, P. W. Unbiased monte carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pp. 3656 3667. IEEE, 2015. Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009. Borkar, V. S. and Meyn, S. P. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38 (2):447 469, 2000. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Cox, J. C., Ross, S. A., and Rubinstein, M. Option pricing: A simplified approach. Journal of financial Economics, 7 (3):229 263, 1979. Cressie, N. and Read, T. R. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3):440 464, 1984. Del etang, G., Grau-Moya, J., Kunesch, M., Genewein, T., Brekelmans, R., Legg, S., and Ortega, P. A. Modelfree risk-sensitive reinforcement learning. ar Xiv preprint ar Xiv:2111.02907, 2021. Single-Trajectory DRRL Derman, E., Mankowitz, D. J., Mann, T. A., and Mannor, S. Soft-robust actor-critic policy-gradient. ar Xiv preprint ar Xiv:1803.04848, 2018. Duchi, J. C. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021. Goyal, V. and Grand-Clement, J. Robust markov decision processes: Beyond rectangularity. Mathematics of Operations Research, 2022. Ho, C. P., Petrik, M., and Wiesemann, W. Partial policy iteration for l1-robust markov decision processes. J. Mach. Learn. Res., 22:275 1, 2021. Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Lim, S. H., Xu, H., and Mannor, S. Reinforcement learning in robust markov decision processes. Advances in Neural Information Processing Systems, 26, 2013. Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. Distributionally robust q-learning. In International Conference on Machine Learning, pp. 13623 13643. PMLR, 2022. Ma, X., Liang, Z., Xia, L., Zhang, J., Blanchet, J., Liu, M., Zhao, Q., and Zhou, Z. Distributionally robust offline reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:2209.06620, 2022. Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. Bias and variance in value function estimation. In Proceedings of the twenty-first international conference on Machine learning, pp. 72, 2004. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533, 2015. Neufeld, A. and Sester, J. Robust q-learning algorithm for markov decision processes under wasserstein uncertainty. Ar Xiv, abs/2210.00898, 2022. Nilim, A. and El Ghaoui, L. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. Panaganti, K. and Kalathil, D. Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pp. 9582 9602. PMLR, 2022. Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. Robust reinforcement learning using offline data. Advances in neural information processing systems, 35: 32211 32224, 2022. Roy, A., Xu, H., and Pokutta, S. Reinforcement learning under model mismatch. Advances in neural information processing systems, 30, 2017. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Shi, L. and Chi, Y. Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity. URL http://arxiv.org/abs/2208. 05767. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Tamar, A., Mannor, S., and Xu, H. Scaling up robust mdps using function approximation. In International conference on machine learning, pp. 181 189. PMLR, 2014. Tsitsiklis, J. N. Asynchronous stochastic approximation and q-learning. Machine learning, 16:185 202, 1994. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., W unsch, D., Mc Kinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C., and Silver, D. Grandmaster level in Star Craft II using multi-agent reinforcement learning. 575(7782):350 354, 2019. doi: 10.1038/s41586-019-1724-z. Wang, Y. and Zou, S. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193 7206, 2021. Wiesemann, W., Kuhn, D., and Rustem, B. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013. Xu, H. and Mannor, S. Distributionally robust markov decision processes. Advances in Neural Information Processing Systems, 23, 2010. Yang, I. Wasserstein distributionally robust stochastic control: A data-driven approach. IEEE Transactions on Automatic Control, 66:3863 3870, 2018. Single-Trajectory DRRL Yang, W., Zhang, L., and Zhang, Z. Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223 3248, 2022. Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J., and Glynn, P. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 3331 3339. PMLR, 2021. Single-Trajectory DRRL In the subsequent sections, we delve into the experimental specifics and provide the technical proofs that were not included in the primary content. In Section A, we commence by showcasing an additional experiment on the American call option. This aligns with the convergence and sample complexity discussions from the main content. We then elucidate the intricacies of Liu s algorithm to facilitate a transparent comparison with our methodology. Lastly, we discuss the algorithmic intricacies of our DDRQ algorithm and provide details on the experiments that were previously omitted. In Section B, to prove Theorem 3.3, we begin by extending the two-timescale stochastic approximation framework to a three-timescale one. Following this, we adapt it to our algorithm, ensuring all requisite conditions are met. A. Additional Experiments Details A.1. Experiment on the American Put Option Problem In this section, we present additional experimental results from a simulated American put option problem (Cox et al., 1979) that has been previously studied in robust RL literature (Zhou et al., 2021; Tamar et al., 2014). The problem involves holding a put option in multiple stages, whose payoff depends on the price of a financial asset that follows a Bernoulli distribution. Specifically, the next price sh+1 at stage h + 1 follows, ( cush, w.p. p0, cdsh, w.p. 1 p0, (13) where the cu and cd are the price up and down factors and p0 is the probability that the price goes up. The initial price s0 is uniformly sampled from [κ ϵ, κ + ϵ], where κ = 100 is the strike price and ϵ = 5 in our simulation. The agent can take an action to exercise the option (ah = 1) or not exercise (ah = 0) at the time step h. If exercising the option, the agent receives a reward max(0, κ sh) and the state transits into an exit state. Otherwise, the price will fluctuate based on the above model and no reward will be assigned. Moreover we introduce a discount structure in this problem, i.e., the 1 reward in the stage h + 1 worths γ in stage h as our algorithm is designed for discounted RL setting. In our experiments, we set H = 5, cu = 1.02, cd = 0.98 and γ = 0.95. We limit the price in [80, 140] and discretize with the precision of 1 decimal place. Thus the state space size |S| = 602. 0.3 0.4 0.5 0.6 0.7 p 0.0 0.1 0.2 0.4 Figure 6. Averaged return in the American call option problem. ρ = 0.0 is the non-robust Q-learning. We first demonstrate the robustness gain of our DR Q-learning algorithm by comparing with the non-robust Q-learning algorithm, and investigate the effect of different robustness levels by varying ρ. Each agent is trained for 107 steps with an ϵ-greedy exploration policy of ϵ = 0.2 and evaluated in perturbed environments. We use the same learning rates for the three timescales in our DR Q-learning algorithm as in the Cliffwalking environment: ζ1(t) = 1/(1 + (1 γ)t0.6), ζ2(t) = 1/(1 + 0.1 (1 γ)t0.8), and ζ3(t) = 1/(1 + 0.01 (1 γ)t). For the non-robust Q-learning we set the same learning rate as in our Q-update, i.e., ζ3(t). We perturb the transition probability to the price up and down status Single-Trajectory DRRL p = {0.3, 0.4, 0.5, 0.6, 0.7}, and evaluate each agent for 5000 episodes. Figure 6 reports the average return and one standard deviation level. The non-robust Q-learning performs best when the price tends to decrease and the market gets more benefitial (p = {0.3, 0.4, 0.5}), which benefits the return of holding an American put option. However, when the prices tend to increase and the market is riskier (p = {0.6, 0.7}), our DR Q-learning algorithm significantly outperforms the non-robust counterpart, demonstrating the robustness gain of our algorithm against worst-case scenarios. 0.0 0.2 0.4 0.6 0.8 1.0 Million Steps 0.0 0.2 0.4 0.6 0.8 1.0 Million Steps 0.0 0.2 0.4 0.6 0.8 1.0 Million Steps 0.0 0.2 0.4 0.6 0.8 1.0 Million Steps = 0.05 = 0.1 = 0.2 = 0.4 = 0.5 Figure 7. Convergence curve of DR Q-learning algorithm to the true DR value under different ρ s and k s. Each curve is averaged over 10 random seeds and shaded by their standard deviation. The dashed line is the optimal robust value with corresponding k and ρ. We present the learning curve of our DR Q-learning algorithm with different ρ in Figure 7. Our algorithm can accurately learn the DR value under different ρ s and k s within 0.1 million steps. We compare the sample efficiency of our algorithm with the DR Q-learning algorithm in Liu et al. (2022) (referred to as Liu s) and the model-based algorithm in Panaganti & Kalathil (2022) (referred to as Model). We set a smaller learning rate for Liu s as ζ(t) = 1/(1 + (1 γ)t). The reason is setting the same learning rate ζ3(t) for their algorithm would render a much slower convergence performance, which is not fair for comparisons. We use the recommended choice ε = 0.5 for the sampling procedure in Liu algorithm. Both DR Q-learning and Liu are trained for 5 107 steps per run, while the model-based algorithm is trained for 106 steps per run to ensure sufficient samples for convergence. As shown in Figure 8, the model-based approach is the most sample-efficient, converging accurately to the optimal robust value with less than 104 samples. Our DR Q-learning algorithm is slightly less efficient, using 105 samples to converge. Liu algorithm is significantly less efficient, using 107 samples to converge. Note that the model-based approach we compared here is to first obtain samples for each state-action pairs, and then conduct the learning procedure to learn the optimal robust value. In particular, we need to specify the number of samples for each state-action pair n. Then the total number of samples used is the sum of all these number, i.e., S A n, whose computation manner is different from that in the model-free algorithms we used where each update requires one or a batch of new samples. To ensure self-containment, we provide the pseudocode for our implemented Liu algorithm (Algorithm 3) and the modelbased algorithm (Algorithm 2) below. These algorithms were not originally designed to solve the ambiguity set constructed by the Cressie-Read family of f-divergences. A.2. Liu s Algorithm Descriptions In this subsection, we provide the pseudo-code for the Liu algorithm, represented in Algorithm 2. Our intention is to emphasize the differences in algorithmic design between their approach and ours. Their algorithm, in particular, relies extensively on multi-level Monte Carlo, requiring the sampling of a batch of samples for each state-action pair. Once they estimate the Doubly Robust (DR) value for a specific state-action pair, the samples are promptly discarded and subsequently resampled from a simulator. To summarize, their algorithm exhibits significant distinctions from ours in terms of algorithmic design. A.3. Practical Experiments In this section, we provide a comprehensive description of our Deep Distributionally Robust Q-learning (DDRQ) algorithm, as illustrated in Algorithm 2, along with its experimental setup in the context of Caro Pole and Lunar Lander. Most of the hyperparameters are set the same for both Lunar Lander and Cart Pole. We choose Cressie-Read family parameter Single-Trajectory DRRL Algorithm 2 Distributionally Robust Deep Q-learning with Cressie-Read family of f-divergences 1: Input: Discount Factor γ, Radius of robustness ρ, Cressie-Read family parameter k, Q-network target update rate τQ and η-network target update rate τη, mini-batch size N, maximum number of iterations T, start training timestep Ttr, training network update frequency Ftr and target network update frequency Fup. 2: Init: Two state-action neural networks Qθ1 and Qθ2, two dual neural network ηθ1 and ηθ2, C = (1+k (k 1) ρ)1/k. 3: for for t = 1, , T do 4: Observe a state st and execute an action at using ϵ-greedy policy. 5: if t Ttr and t%Ftr then 6: Sample a minibatch B with N samples from the replay buffer. 7: Compute next-state target value for Q network Qi = rt γC (ηθ1(si, ai) max a A Qθ1(si, ai))k + , i B and for η network Q i = rt γC (ηθ2(si, ai) max a A Qθ2(si, ai))k + , i B. 8: Update θ1 = arg minθ P i(Qi Qθ(si, ai))2. 9: Update θ3 = arg maxθ P i Q i(θ). 10: end if 11: if t Ttr and t%Fup then 12: Update target network θ2 = (1 τQ)θ2 + τQθ2, θ4 = (1 τη)θ4 + τηθ3. 13: end if 14: end for 15: t = t + 1 Environment Maximum Training Step T ϵEnd τQ τη Cart Pole 1e8 0.05 1 0.05 Lunar Lander 3e7 0.2 0.5 0.1 Table 1. Different Hyperparamers between Cart Pole and Lunar Lander k = 2, which is indeed the χ2 ambiguity set and we set ambiguity set radius as ρ = 0.3. For RFQI we also use the same ρ for fair comparison. Our replay buffer size is set 1e6 and the batch size for training is set 4096. Our fast Q and η network are update every 10 steps (Ftr = 10) and the target networks are updated every 500 steps (Fup = 500). The learning rate for Q network is 2.5 10 4 and for η network is 2.5 10 3. The Q network and the η network both employ a dual-layer structure, with each layer consisting of 120 dimensions. For exploration scheme, we choose epsilon-greedy exploration with linearly decay epsilon with ending ϵEnd. The remain parameters tuned for each environments are referred in Table 1. B. Multiple Timescale Convergence We fix some notations that will be used in the following proof. For a positive integer n, [n] denotes the set {1, 2, , n}. |A| denotes the cardinality of the set A. We adopt the standard asymptotic notations: for two non-negative sequences an and bn, an = O(bn) iff lim supn an/bn < . d is the simplex on a d dimensional space, i.e., d = {x : Pd i=1 xi = 1, xi 0, i [d]}. For any vector x Rd and any semi-positive matrix A Rd d with A 0, we denote x A := x Ax. is Euclidean norm. B.1. Three Timescales Convergence Analysis In this subsection, we outline the roadmap for establishing the a.s. convergence of the Algorithm 1. For ease of presentation, our analysis is given for the synchronous case, where every entry of the Q function is updated at each timestep. Extension to the asynchronous case, where only one state-action pair entry is updated at each timestep, follows Tsitsiklis (1994). Our Single-Trajectory DRRL 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Sample Size (log10N) = 0.1, k = 2.0 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Sample Size (log10N) = 0.2, k = 2.0 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Sample Size (log10N) = 0.4, k = 2.0 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Sample Size (log10N) = 0.1, k = 4.0 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Sample Size (log10N) = 0.2, k = 4.0 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Sample Size (log10N) = 0.4, k = 4.0 Liu Model DRQ (Ours) Figure 8. Sample complexity comparisons in American option environment with other DRRL algorithms. The dashed line is the optimal robust value with corresponding k and ρ. The x-axis is in log10 scale. Each curve is averaged over 10 random seeds and shaded by their one standard deviation. The dashed line is the optimal robust value with corresponding k and ρ. approach is to generalize the classic machinery of two-timescale stochastic approximation (Borkar, 2009) to a three-timescale framework, and use it to analyze our proposed algorithm. We rewrite the Algorithm 1 as Zn+1 = Zn + ζ1(n)[f(Zn, ηn, Qn) + M Z n ], (14) ηn+1 = ηn + ζ2(n)[g(Zn, ηn, Qn) + ϵη n], (15) Qn+1 = Qn + ζ3(n)[h(Zn, ηn, Qn) + ϵQ n ]. (16) Here, we use Zn = (Zn,1, Zn,2) to represent the Zn,1 and Zn,2 jointly. To echo with our algorithm, f = (f1, f2) and M Z n = (M Z n,1, M Z n,2) are defined as, f1(Zn, ηn, Qn)(s, a) = Es [(ηn(s, a) max a Qn(s , a ))k + Zn,1(s, a)], f2(Zn, ηn, Qn)(s, a) = Es n[(ηn(s, a) max a Qn(s , a ))k 1 + Zn,2(s, a)], M Z n,1(s, a) = (ηn(s, a) max a Qn(s , a ))k + Zn,1(s, a) f1(Zn, ηn, Qn)(s, a), M Z n,2(s, a) = (ηn(s, a) max a Qn(s , a ))k 1 + Zn,2(s, a) f2(Zn, ηn, Qn)(s, a). In the update of ηn (Equation 15), g and ϵη n are defined as g(Zn, ηn, Qn)(s, a) = ck(ρ)E[(ηn(s, a) max a A Qn(s , a ))k + ] 1 k 1 E[(ηn(s, a) max a A Qn(s , a ))k 1 + ] + 1, ϵη n(s, a) = ck(ρ)Z 1 k 1 n,1 (s, a) Zn,2(s, a) + 1 g(Zn, ηn, Qn)(s, a). Finally in the update of Qn (Equation 16), h and ϵQ n are defined as h(Zn, ηn, Qn)(s, a) = r(s, a) γ(ck(ρ)(EP [(ηn(s, a) max a A Qn(s , a ))k + ]) 1 k ηn(s, a)), ϵQ n (s, a) = r(s, a) γ(ck(ρ)Z 1 k n,1(s, a) ηn(s, a)) h(Zn, ηn, Qn)(s, a). The algorithm 1 approximates the dynamic described by the system of f, g and h through samples along a single trajectory, with the resulting approximation error manifesting as martingale noise M Z n conditioned on some filtration Fn and the error terms ϵη n and ϵQ n . To analyze the dynamic of algorithm 1, we first obtain the continuous dynamic of f, g, and h using ordinary differential equations (ODEs) analysis. The second step is to analyze the stochastic nature of the noise term M Z n and the error terms ϵη n and ϵQ n , to ensure that they are negligible compared to the main trend of f, g, and h, which is achieved by the following stepsizes, Single-Trajectory DRRL Condition B.1. The stepsizes ζi(n), i = 1, 2, 3 satisfy X n ζi(n) = , X n ζ2 i (n) < , ζ1(n) = o(ζ2(n)), ζ2(n) = o(ζ3(n)). These stepsize schedules satisfy the standard conditions for stochastic approximation algorithms, ensuring that (1). the key quantities in gradient estimator Zn update on the fastest timescale, (2). the dual variable for the DR problem, ηn, update on the intermediate timescale; and (3). the Q table updates on the slowest timescale. Examples of such stepsize are ζ1(n) = 1 1+n0.6 , ζ2(n) = 1 1+n0.8 and ζ3(n) = 1 1+n. Notably, the first two conditions in Condition B.1 ensure the martingale noise is negligible. The different stepsizes for the three loops specificed by the third and fourth conditions ensures that Zn,1 and Zn,2 are sufficiently estimated with respect to the ηn and Qn, and these outer two loops are free from bias or noise in the stochastic approximation sense. Under Condition B.1, when analyzing the behavior of the Zn, the ηn and the Qn can be viewed as quasi-static. To study the behavior of the fastest loop, we analyze the following ODEs: Z(t) = f(Z(t), β(t), Q(t)), η(t) = 0, Q(t) = 0, (17) and prove that ODEs (17) a.s. converge to λ 1 (η, Q) for proper η and Q and some mapping λ 1 . Similarly, Qn can be viewed as fixed when analyzing the behavior of ηn, and the corresponding ODEs to understand its behavior are η(t) = g(λ 1(η(t), Q(t)), η(t), Q(t)), Q(t) = 0. (18) By exploiting the dual form of the distributionally robust optimization problem, we can prove these ODEs converge to the set {λ 1(Q), λ 2(Q), Q|Q V } for some mapping λ 1 and λ 2 with V is the set containing all the mapping from S to R. Lastly, we examine the slowest timescale ODE given by Q(t) = h(λ 1(Q(t)), λ 2(Q(t)), Q(t)), (19) and employ our analysis to establish the almost sure convergence of Algorithm 1 to the globally optimal pair (Z 1, Z 2, η , Q ). Lemma B.2 (Discrete Gronwall inequality). Let {xn, n 0} (resp. {an, n 0} ) be nonnegative (resp. positive) sequences and C, L 0 scalars such that for all n, Then for Tn = Pn m=0 am, xn+1 Ce LTn. Lemma B.3 (Gronwall inequality). For continuous u( ), v( ) 0 and scalars C, K, T 0 u(t) C + K Z t 0 u(s)v(s)ds, t [0, T], implies u(t) Ce K R T 0 v(s)ds, t [0, T]. B.2. Stability Criterion Consider the stochastic approximation scheme zn RN given by zn+1 = zn + an [g (zn) + Mn+1] , with the following Condition: Condition B.4. g : RN RN is Lipschitz. Condition B.5. The sequence {an} R satisfies P n a2 n < . Condition B.6. {Mn} is a martingale difference sequence with respect to the filtration Fn = σ (zm, Mm, m n), there exists K > 0 such that E Mn+1 2 | Fn K(1 + zn 2) a.s.. Condition B.7. The functions gd(z) = g(dz)/d, d 1 satisfy gd(z) g (z) as d uniformly on compacts for some continuous function g : RN RN. In addition, the ODE z(t) = g (z(t)) has the origin as its globally asymptotically stable equilibrium. We then have Lemma B.8. Under Condition B.4 to B.6, we have supn zn < a.s. See Section 2.2 and 3.2 in Borkar (2009) for the proof. As the stability proofs in Section 3.2 of Borkar (2009) are path-wise, we can apply this result to analyze multiple timescales dynamic. Single-Trajectory DRRL B.3. Three Timescales Convergence Criterion Consider the scheme xn+1 = xn + an h f (xn, yn, zn) + M (1) n+1 i (20) yn+1 = yn + bn h g (xn, yn, zn) + M (2) n+1 i (21) zn+1 = zn + cc h h (xn, yn, zn) + M (3) n+1 i (22) where f : Rd+k+p Rd, g : Rd+k+p Rk, h : Rd+k+p Rp, {M (i) n }, i = 1, 2, 3 are martingale difference sequences with respect to the σ-fields Fn = σ xm, ym, M (1) m , M (2) m , M (3) m ; m n , and the an, bn, cn form decreasing stepsize sequences. It is instructive to compare the stochastic update algorithms from Equations 20 to 22 with the following o.d.e., af(x(t), y(t), z(t)), b g(x(t), y(t), z(t)), c h(x(t), y(t), z(t)), in the limit that a, b, c 0 and a = o(b), c = o(b). We impose the following conditions, which are necessary for the a.s. convergence for each timescale itself and are commonly used in the literature of stochastic approximation algorithms, e.g., (Borkar, 2009). Condition B.9. f and g is L-Lipschitz map for some 0 < L < and h is bounded. Condition B.10. n (a2 n + b2 n + c2 n) < , and bn = o(an), cn = o(bn). Condition B.11. For i = 1, 2, 3 and n N+, {M (i) n } is a martingale differeence sequence with respect to the increasing family of σ-fields Fn. Furthermore, there exists some K > 0, such that for i = 1, 2, 3 and n N+, E[ M (i) n+1 2|Fn] K(1 + xn 2 + yn 2 + zn 2). Condition B.12. supn( xn + yn + zn ) < , a.s.. Condition B.13. For each y Rk and z Rp, x(t) = f(x(t), y, z) has a globally asymptotically stable equilibrium λ1(y, z), where λ1 : Rk+p Rd is a L-Lipschitz map for some L > 0. Condition B.14. For each z Rp, y(t) = g(λ1(y(t), z), y(t), z) has a globally asymptotically stable equilibrium λ2(z), where λ2 : Rp Rk is a L-Lipschitz map for some L > 0. Condition B.15. z(t) = h(λ1(z(t)), λ2(z(t)), z(t)) has a globally asymptotically stable equilibrium z . Conditions B.9, B.10, B.11 and B.12 are necessary for the a.s. convergence for each timescale itself. Moreover, Condition B.12 itself requires Conditions like B.9, B.10, B.11, with an extra condition like Condition B.6. Instead, we need to prove the boundedness for each timescale, thus the three timescales version is as follow Condition B.16. The ODE z(t) = f (x(t), y, z) y(t) = g (λ1(y(t), z), y(t), z) z(t) = h (λ1(z(t)), λ2(z(t)), z(t)) all have the origin as their globally asymptotically stable equilibrium for each y Rk and z Rp, where f = lim d f(dx) d , g = lim d g(dx) d , and h = lim d h(dx) We have the following results, which appears as a three timescales extension of Lemma 6.1 in Borkar (2009) and serves as a auxiliary lemma for the our a.s. convergence. Lemma B.17. Under the conditions B.9, B.10, B.11 and B.12. (xn, yn, zn) {λ 1(z), λ 2(z), z : z Rp} a.s.. Single-Trajectory DRRL Proof. Rewrite Equations 21 and 22 as yn+1 = yn + an h ϵ1,n + M (2) zn+1 = zn + an h ϵ2,n + M (3) where ϵ1,n = bn an g(xn, yn, zn), ϵ2,n = cn an h(xn, yn, zn), M (2) an M (2) n+1, M (3) an M (3) n+1. Note that ϵ1,n, ϵ2,n 0 as n . Consider them as the special case in the third extension in Section 2.2 in Borkar (2009) and then we can conclude that (xn, yn, zn) converges to the internally chain transitive invariant sets of the o.d.e., x(t) = h(x(t), y(t), z(t)) y(t) = 0 z(t) = 0, which implies that (xn, yn, zn) {λ 1(y, z), y, z : y Rk, z Rp}. Rewrite Equation 22 again as zn+1 = zn + bn h ϵ 2,n + M (3) where ϵ 2,n = cn bn h(xn, yn, zn) and M (3) bn M (3) n+1. We use the same extension again and can conclude that (xn, yn, zn) converges to the internally chain transitive invariant sets of the o.d.e., y(t) = g(λ 1(y(t)), y(t), z(t)) z(t) = 0. Thus (xn, yn, zn) {λ1(y), λ2(z), z : z Rp}. Theorem B.18. Under the Condition B.9 to B.16, (xn, yn, zn) (λ1(z ), λ2(z ), z ). Proof. Let t(0) = 0 and t(n) = Pn 1 i=0 ci for n 1. Define the piecewise linear continuous function z(t), t 0 where z(t(n)) = zn and z(t) = t(n+1) t t(n+1) t(n)zn+1 + t t(n) t(n+1) t(n)zn for t [t(n), t(n + 1)] with any n N. Let ψn = Pn 1 i=0 ci M (3) i+1, n N+. For any t 0, denote [t] = max{s(n) : s(n) t}. Then for n, m 0, we have z(t(n + m)) = z(t(n)) + k=1 cn+kh(xn+k, yn+k, zn+k) + (ψm+n+1 ψn) = z(t(n)) + Z t(n+m) t(n) h(λ1(z(s)), λ2(z(s)), z(s))ds t(n) (h(λ1(z([s])), λ2(z([s])), z([s])) h(λ1(z(s)), λ2(z(s)), z(s)))ds k=0 cn+k(h(xn+k, yn+k, zn+k) h(λ1(zn+k), λ2(zn+k), zn+k)) + (ψn+m+1 ψn). (23) We further define zt(n)(t) as the trajectory of z(t) = g(λ1(z(t)), λ2(z(t)), z(t)) with zt(n)(t(n)) = z(t(n)). zt(n)(t(n + m)) = z(t(n)) + Z t(n+m) t(n) h(λ1(zt(n)(s)), λ2(zt(n)(s)), zt(n)(s))ds. (24) Single-Trajectory DRRL Taking the difference between Equation 23 and the Equation 24 we have | z(t(n + m)) zt(n)(t(n + m))| k=0 cn+k(h(λ1( z(t + k)), λ2( z(t + k)), z(t + k)) h(λ1(z(t(n + k))), λ2(z(t(n + k))), z(t(n + k)))) + | Z t(n+m) t(n) (h(λ1(z([t])), λ2(z([t])), z([t])) h(λ1(z(s)), λ2(z(s)), z(s)))ds| k=1 cn+k(h(xn+k, yn+k, zn+k) h(λ1(zn+k), λ2(zn+k), zn+k))| | {z } II + |ψn+m+1 ψn| | {z } III We analyze the I term. For notation simplicity we ignore the supsript t(n). |h(λ1(z([t])), λ2(z([t])), z([t])) h(λ1(z(t)), λ2(z(t)), z(t))| = |(h(λ1(z([t])), λ2(z([t])), z([t])) h(λ1(z([t])), λ2(z([t])), z(t)))| + |(h(λ1(z([t])), λ2(z([t])), z(t)) h(λ1(z([t])), λ2(z([t])), z([t])))| = |(h(λ1(z([t])), λ2(z([t])), z([t])) h(λ1(z([t])), λ2(z(t)), z(t)))| + |h(λ1(z([t])), λ2(z([t])), z(t)) h(λ1(z([t])), λ2(z([t])), z(t))| + |(h(λ1(z([t])), λ2(z([t])), z(t)) h(λ1(z([t])), λ2(z([t])), z([t])))|. (25) By the Lipschitzness of the h we have h(x) h(0) L x , which implies h(x) h(0) + L x . zt(n)(t) z(s) + Z t s h(zt(n)(s)) ds s ( h(0) + L zt(n)(s) )ds ( z(s) + h(0) T) + L Z t s zt(n)(s) ds. By Gronwall s inequality (Lemma B.3), we have zt(n)(t) (C + h(0) T)e LT , t [t(n), t(n + m)]. Thus for all t [t(n), t(n + m)], we have h(λ1(zt(n)(t)), λ2(zt(n)(t)), zt(n)(t)) CT := h(0) + L(C + h(0) T)e LT < , a.s.. For any k [m 1] and t [t(n + k), t(n + k + 1)], zt(n)(t) zt(n)(t(n + k)) Z t t(n+k) h(λ1(zt(n)(s)), λ2(zt(n)(s)), zt(n)(s))ds CT (t t(n + k)) CT a(n + k), Single-Trajectory DRRL where the last inequality is from the construction of {t(n) : n N+}. Finally we can conclude t(n) (h(λ1(z([s])), λ2(z([s])), z(s)) h(λ1(z([s])), λ2(z([s])), z([s])))ds t(n) L z(s) z([s]) ds t(n+k) z(s) z(t(n + k)) ds k=0 c2 n+k 0, a.s.. For the III term, it converges to zero from the martingale convergence property. Subtracting equation 23 from 24 and take norms, we have z(t(n + m)) zt(n)(t(n + m)) i=0 cn+i z(t(n + i)) zt(n)(t(n + i)) k 0 c2 n+k + sup k 0 δn,n+k , a.s.. Define KT,n = CT L P k 0 c2 n+k + supk 0 δn,n+k . Note that KT,n 0 a.s. n . Let ui = x(t(n + i)) xt(n)(t(n + i)) . Thus, above inequality becomes um KT,n + L i=0 cn+iui. Thus the above inequality becomes z(t(n + m)) KT,n + L k=0 ckz(t(n + k)). Note that u0 = 0 and Pm 1 i=0 bi T, then using the discrete Gronwall lemma (Lemma B.2) we have sup 0 i m ui KT,ne LT . Following the similar logic as in Lemma 1 in Borkar (2009), we can extend the above result to the case z(t) zt(n)(t) 0 where t [0, T]. Then using the proof of Theorem 2 of Chapter 2 in Borkar (2009), we get zn z a.s. and thus by Lemma B.17 the proof can be concluded. C. Convergence of the DR Q-learning Algorithm Before we start the proof of the DR Q-learning algorithm, we first introduce the following lemma. Lemma C.1. Denote η = arg maxη σk(X, η) = ck(ρ)EP [(η X)k + ] 1 k + η. Given that X(ω) [0, M], then we have η [0, ck(ρ) ck(ρ) 1M]. Single-Trajectory DRRL Proof. Note that for η = minω X(ω), ck(ρ)EP [(η X)k + ] 1 k + η = minω X(ω) 0. Also we know that when η ck(ρ) ck(ρ) 1M, ck(ρ)EP [(η X)k + ] 1 k + η ck(ρ)EP [(η M)k + ] 1 k + η = ck(ρ)(η M) + η 0. Then we can conclude that η ck(ρ) ck(ρ) 1M. Moreover, as X(ω) 0, we know σk(X, 0) = 0, which concludes that η [0, ck(ρ) ck(ρ) 1M]. Note that Qn [0, 1 1 γ ] when reward is bounded by [0, 1]. Thus M = 1 1 γ in our case and then we denote η = ck(ρ) ck(ρ) 1M. Now we are ready to prove the convergence of the DR Q-learning algorithm. For theoretical analysis, we consider the clipping version of our DR Q-learning algorithm. Proof of Theorem 3.3. We define the filtration generated by the historical trajectory, Fn = σ({(st, at, s t, rt)}t [n 1], sn, an). In the following analysis, we fix for a (s, a) S A but ignore the (s, a) dependence for notation simplicity. Following the roadmap in Section 3.4, we rewrite the algorithm as Zn+1,1 = Zn,1 + ζ1(n)[f1(Zn,1, Zn,2, ηn, Qn) + M (1) n+1], (26) Zn+1,2 = Zn,2 + ζ1(n)[f2(Zn,1, Zn,2, ηn, Qn) + M (2) n+1], (27) ηn+1 = Γη [ηn + ζ2(n)f3(Zn,1, Zn,2, ηn, Qn)] , (28) Qn+1 = ΓQ[Qn + ζ3(n)[f4(Zn,1, Zn,2, ηn, Qn)]]. (29) Here for theoretical analysis, we add a clipping operator Γη(x) = min(max(x, 0), η) and ΓQ(x) = min(max(x, 0), M) compared with the algorithm presented in the main text. We first proceed by first identifying the terms in Equation 26 and 27 and studying the corresponding ODEs Q(t) = 0, η(t) = 0, Z1(t) = f1(Z1(t), Z2(t), η(t), Q(t)). Z2(t) = f2(Z1(t), Z2(t), η(t), Q(t)). As f1 and f2 is in fact irrelavant to the Z2 and Z1, we analyze their equilibria seperately. For notation convenience, we denote yn(s) = maxa A Qn(s, a ). For ODE 26 and each ηn R, Qn S A R, it is easy to know there exists a unique global asymtotically stable equilibrium Z n,1 = λ1(ηn, yn) = E[(ηn yn)k + ]. Similarly, For ODE 27 and each ηn R, Qn S A R, there exists a unique global asymtotically stable equilibrium Z n,2 = λ2(η, y) = E[(ηn yn)k 1 + ]. Second, M (1) n+1 = (ηn yn)y + E[(ηn yn)y + ] and M (2) n+1 = (ηn yn)y 1 + E[(ηn yn)y 1 + ]. Note that for any (s, a) S A, ηn(s, a) η, yn(s ) M and M η. Thus |(ηn(s, a) yn(s ))y + | ηy , which leads to |M (1) n+1(s, a)| = |(ηn(s, a) yn(s ))y + E[(ηn(s, a) yn(s ))y + ]| ηk . Since yn Qn and (x y)2 + x2 + y2 for any x, y, we have, E[ M (1) n+1 2|Fn] = E[ (ηn yn)y + E[(ηn yn)y + ] 2|Fn] K1(1 + Zn,1 2 + Zn,2 2 + Qn 2 + ηn 2), where K1 = Sη2k . Similarly, we can conclude that E[ M (2) n+1 2|Fn] K2(1 + Zn,1 2 + Zn,2 2 + Qn 2 + ηn 2) for some K2 = Sη2(k 1). Single-Trajectory DRRL Next we analyze the second loop. Q(t) = 0, η(t) = Γη[f3(λ1(η(t), Q(t)), λ2(η(t), Q(t)), η(t), Q(t))], f3(λ1(η, Q), λ2(η, Q), η, Q) = ck(ρ)λ1(η, Q) 1 k 1λ2(η, Q) + 1. The global convergence point is η (t) = arg maxη [0,η]{σk(Q, η)} = arg maxη R{σk(Q, η)}. Finally we arrive to the outer loop, i.e., Q(t) = ΓQ[f4(λ1(Q(t)), λ2(Q(t)), λ3(Q(t)), Q(t))]. By using the dual form of Cressie-Read Divergence (Lemma 3.1), we know that this is equivilant to Q(t) = r + γ inf P P EP [max a Q(s , a )] Q(t), for ambiguity set using Cressie-Read of f divergence. Denote H(t) = r + γ inf P P EP [maxa Q(s , a )] and thus we can rewrite the above ODE as Q(t) = H(t) Q(t). Following , we consider its infity version, i.e., H (t) = limc H(ct)/c. Q(t) = γ inf P P EP [max a Q(s , a )] Q(t). This is a contraction by Theorem 3.2 in Iyengar (2005). By the proof in Section 3.2 in Borkar & Meyn (2000), we know the contraction can lead to the global unique equilibrium point in the ode. Thus we finish verifying all the conditions in Section B.3, which can lead to the desired result. Single-Trajectory DRRL Algorithm 3 Distributionally Robust Q-learning with Cressie-Read family of f-divergences with Simulator 1: Input: Exploration rate ϵ, Learning rates {ζi(n)}i [3], Ambiguity set radius ρ > 0, parameter ε (0, 0.5) 2: Init: b Q(s, a) = 0, (s, a) S A 3: while Not Converge do 4: for every (s, a) S A do 5: Sample N N from P(N = n) = pn = ε(1 ε)n. 6: Draw 2N+1 samples {(ri, s i)}i [2N+1] from the simulator 7: Compute r N,ρ via r N,ρ = sup η R bσr k([2N+1], η) 1 2 sup η R bσr k([2N], η) 1 2 sup η R bσr k([2N :], η), sup η R bσr k(I, η) = sup η R { ck(ρ)[ X i I (η ri)k + /n] 1 k + η}, and [2N] = {1, 2, 3, , 2N} and [2N :] = {2N, 2N + 1, , 2N+1}. 8: Compute q N,ρ( b Qt) via q N,ρ( b Qt) = sup η R bσq k( b Qt, [2N+1], η) 1 2 sup η R bσq k( b Qt, [2N], η) 1 2 sup η R bσq k( b Qt, [2N :], η), sup η R bσq k( b Qt, I, η) = sup η R { ck(ρ)[ X i I (η max a A b Qt(s i, a ))k + /n] 1 k + η}. 9: Set Rρ(s, a) = r1 + r N,ρ p N . 10: Update Q via b Qt+1(s, a) = (1 ζt) b Qt(s, a) + ζt b Tρ( b Qt)(s, a), b Tρ( b Qt)(s, a) = r1 + r N,ρ + γ(max a A b Qt(s1, a ) + q N,ρ( b Qt) 11: end for 12: t = t + 1 13: end while