# reinforcement_learning_with_random_delays__1ceb5f33.pdf Published as a conference paper at ICLR 2021 REINFORCEMENT LEARNING WITH RANDOM DELAYS Yann Bouteiller Polytechnique Montreal yann.bouteiller@polymtl.ca Simon Ramstedt Mila, Mc Gill University simonramstedt@gmail.com Giovanni Beltrame Polytechnique Montreal Christopher Pal Mila, Polytechnique Montreal Jonathan Binas Mila, University of Montreal Action and observation delays commonly occur in many Reinforcement Learning applications, such as remote control scenarios. We study the anatomy of randomly delayed environments, and show that partially resampling trajectory fragments in hindsight allows for off-policy multi-step value estimation. We apply this principle to derive Delay-Correcting Actor-Critic (DCAC), an algorithm based on Soft Actor Critic with significantly better performance in environments with delays. This is shown theoretically and also demonstrated practically on a delay-augmented version of the Mu Jo Co continuous control benchmark. 1 INTRODUCTION Undelayed environment Delayed environment observation delay action delay Figure 1: A delayed environment can be decomposed into an undelayed environment and delayed communication dynamics. This article is concerned with the Reinforcement Learning (RL) scenario depicted in Figure 1, which is commonly encountered in real-world applications (Mahmood et al., 2018; Fuchs et al., 2020; Hwangbo et al., 2017). Oftentimes, actions generated by the agent are not immediately applied in the environment, and observations do not immediately reach the agent. Such environments have mainly been studied under the unrealistic assumption of constant delays (Nilsson et al., 1998; Ge et al., 2013; Mahmood et al., 2018). Here, prior work has proposed different planning algorithms which naively try to undelay the environment by simulating future observations (Walsh et al., 2008; Schuitema et al., 2010; Firoiu et al., 2018). We propose an off-policy, planning-free approach that enables lowbias and low-variance multi-step value estimation in environments with random delays. First, we study the anatomy of such environments in order to exploit their structure, defining Random-Delay Markov Decision Processes (RDMDP). Then, we show how to transform trajectory fragments collected under one policy into trajectory fragments distributed according to another policy. We demonstrate this principle by deriving a novel off-policy algorithm (DCAC) based on Soft Actor-Critic (SAC), and exhibiting greatly improved performance in delayed environments. Along with this work we release our code, including a wrapper that conveniently augments any Open AI gym environment with custom delays. 2 DELAYED ENVIRONMENTS We frame the general setting of real-world Reinforcement Learning in terms of an agent, random observation delays, random action delays, and an undelayed environment. At the beginning of each time-step, the agent starts computing a new action from the most recent available delayed observation. Meanwhile, a new observation is sent and the most recent delayed action is applied in the undelayed environment. Real-valued delays are rounded up to the next integer time-step. equal contribution Published as a conference paper at ICLR 2021 Figure 3: Influence of actions on delayed observations in delayed environments. For a given delayed observation st, the observation delay ωt refers to the number of elapsed time-steps from when st finishes being captured to when it starts being used to compute a new action. The action delay αt refers to the number of elapsed time-steps from when the last action influencing st starts being computed to one time-step before st finishes being captured. Figure 2: Histogram of real-world Wi Fi delays. We further refer to ωt + αt as the total delay of st. As a motivating illustration of real-world delayed setting, we have collected a dataset of communication delays between a decisionmaking computer and a flying robot over Wi Fi, summarized in Figure 2. In the presence of such delays, the naive approach is to simply use the last received observation. In this case, any delay longer than one time-step violates the Markov assumption, since the last sent action becomes an unobserved part of the current state of the environment. To overcome this issue, we define a Markov Decision Process that takes into account the communication dynamics. 2.1 RANDOM DELAY MARKOV DECISION PROCESSES To ensure the Markov property in delayed settings, it is necessary to augment the delayed observation with at least the last K sent actions. K is the combined maximum possible observation and action delay. This is required as the oldest actions along with the delayed observation describe the current state of the undelayed environment, whereas the most recent actions are yet to be applied (see Appendix C). Using this augmentation suffices to ensure that the Markov property is met in certain delayed environments. On the other hand, it is possible to do much better when the delays themselves are also part of the state-space. First, this allows us to model self-correlated delays, e.g. discarding outdated actions and observations (see Appendix A.1). Second, this provides useful information to the model about how old an observation is and what actions have been applied next. Third, knowledge over the total delay allows for efficient credit assignment and off-policy partial trajectory resampling, as we show in this work. Definition 1. A Random Delay Markov Decision Process RDMDP(E, pω, pα) = (X, A, µ, p) augments a Markov Decision Process E = (S, A, µ, p) with: (1) state-space X = S AK N2, (2) action-space A, (3) initial state distribution µ(x0) = µ(s, u, ω, α) = µ(s) δ(u cu) δ(ω cω) δ(α cα), (4) transition distribution p(s ,u ,ω ,α ,r |s,u,ω,α,a)=fω ω (s ,α ,r |s,u,ω,α,a)pω(ω |ω)pu(u |u,a), where s S is the delayed observation, u AK is a buffer of the last K sent actions, ω N is the observation delay, and α N is the action delay as defined above. To avoid conflicting with the subscript notation, we index the action buffers elements using square brackets. Here, u[1] is the most recent and u[K] is the oldest action in the buffer. We denote slices by u[i:j] = (u[i], . . . , u[j]) and u[i: j] = (u[i], . . . , u[K j]). We slightly override this notation and additionally define u[0] = a. Published as a conference paper at ICLR 2021 The constants cu AK and cω, cα N initialize u, ω, α, since δ is the Dirac delta distribution. The transition distribution itself is composed of three parts: (1) The observation delay distribution pω modelling the evolution of observation delays. Note that this density function must represent a discrete distribution (i.e. be a weighted sum of Dirac delta distributions). Furthermore, this process will repeat observations if there are no new ones available. This means that the observation delay can maximally grow by one from one time-step to the next. (2) The transition distribution for the action buffer pu(u |u, a) = δ(u (a, u[1: 1])). (3) The distribution f describing the evolution of observations, rewards and action delays (Definition 2). Definition 2. For each change in observation delays ( =ω ω ) we define a variable step update distribution f as f (s ,α ,r |s,u,ω,α,a)=Es ,α ,r f 1( |s,u,ω,α,a)[p(s ,r r |s ,u[ ω z }| { ω +α ]) pα(α |α )]. (1) The base case of the recursion is f 1(s , α , r | s, u, ω, α, a) = δ(s s) δ(α α) δ(r ). Here, pα is the action delay distribution which, similar to pω, must be discrete. The transition distribution of the underlying, undelayed MDP is p. The r r term accumulates intermediate rewards in case observations are skipped or repeated (see Appendix A.4). Since the observation delay cannot increment by more than one, f 1 is used when ω is increasing, whereas f0 is used when there is no change in observation delay. A simple special case of the RDMDP is the constant observation and action delay case with pω(ω |ω) = δ(ω cω) and pα(α |α) = δ(α cα). Here, the RDMDP reduces to a Constantly Delayed Markov Decision Process, described by Walsh et al. (2008). In this case, the action and observation delays α, ω can be removed from the state-space as they don t carry information. Examples of RDMDP dynamics are visualized in Figure 3 (see also Appendix C). 3 REINFORCEMENT LEARNING IN DELAYED ENVIRONMENTS Delayed environments as described in Section 2 are specific types of MDP, with an augmented statespace and delayed dynamics. Therefore, using this augmented state-space, traditional algorithms such as Soft Actor-Critic (SAC) (Haarnoja et al., 2018a)(Haarnoja et al., 2018b) will always work in randomly delayed settings. However, their performance will still deteriorate because of the more difficult credit assignment caused by delayed observations and rewards, on top of the exploration and generalization burdens of delayed environments. We now analyze how to compensate for the credit assignment difficulty by leveraging our knowledge about the delays dynamics. One solution is to perform on-policy multi-step rollouts on sub-trajectories that are longer than the considered delays. On the other hand, on-policy algorithms are known to be sample-inefficient and therefore are not commonly used in real-world applications, where data collection is costly. This motivates the development of off-policy algorithms able to reuse old samples, such as SAC. Intuitively, in delayed environments, one should take advantage of the fact that actions only influence observations and rewards after a number of time-steps relative to the beginning of their computation (the total delay ω + α). Since the delay information is part of the state-space, it can be leveraged to track the action influence through time. However, applying conventional off-policy algorithms in delayed settings leads to the following issue: the trajectories used to perform the aforementioned multi-step backups have been sampled under an outdated policy, and therefore contain outdated action buffers. In this section, we propose a method to tackle this issue by performing partial trajectory resampling. We make use of the fact that the delayed dynamics are known to simulate the effect they would have had under the current policy, effectively transforming off-policy sub-trajectories into on-policy sub-trajectories. This enables us to derive a family of efficient off-policy algorithms for randomly delayed settings. 3.1 PARTIAL TRAJECTORY RESAMPLING IN DELAYED ENVIRONMENTS One important observation implied by Figure 3 is that, given the delayed dynamics of RDMDPs , some actions contained in the action buffer of an off-policy state did not influence the subsequent delayed Published as a conference paper at ICLR 2021 Figure 4: Partial resampling of a small sub-trajectory. The action buffer is recursively resampled according to the current policy π (rewards are not modified by σ and are omitted here). observations and rewards for a number of time-steps. Therefore, if an off-policy sub-trajectory is short enough, it is possible to recursively resample its action buffers with no influence on the return. We propose the following transformation of off-policy sub-trajectories: Definition 3. The partial trajectory resampling operator recursively updates action buffers as follows σπ n(s 1, u 1, ω 1, α 1 | {z } x 1 , r 1, τ n 1|x 0; s1, u1, ω1, α1 | {z } x1 , r1, τn 1) =δ((s 1,ω 1,α 1,r 1) (s1,ω1,α1,r1))Ea0 π( |x 0)[δ(u 1 (a0,u 0[1: 1]))] σπ n 1(τ n 1|x 1;τn 1) (2) with trivial base case σ0(x 0) = 1 This operator recursively resamples the most recent actions of each action buffer in an input subtrajectory τn, according to a new policy π. Everything else stays unchanged. A visual example is provided in Figure 4 with n = 2 and an action buffer of two actions. When resampled actions are delayed and would not affect the environment, they do not "invalidate" the sub-trajectory. The resampled trajectories can then be considered on-policy. Theorem 1. The partial trajectory resampling operator σπ n (Def. 3) transforms off-policy trajectories into on-policy trajectories Eτn pµ n( |x0)[σπ n(τ n|x0;τn)]=pπ n(τ n|x0) (3) on the condition that none of the delayed observations depend on any of the resampled actions, i.e. ω t + α t t (4) where t indexes the trajectory τ n = (s 1, u 1, ω 1, α 1, r 1, . . . , s n, u n, ω n, α n, r n) from 1 to n. The condition in Equation 4 can be understood visually with the help of Figure 3. In the constant delay example it is fulfilled until the third time-step. After that, the observations would have been influenced by the resampled actions (starting with a0). 3.2 MULTI-STEP OFF-POLICY VALUE ESTIMATION IN DELAYED ENVIRONMENTS We have shown in Section 3.1 how it is possible to transform off-policy sub-trajectories into on-policy sub-trajectories in the presence of random delays. From this, we can derive a family of efficient off-policy algorithms for the randomly delayed setting. For this matter, we make use of the classic on-policy Monte-Carlo n-step value estimator: Definition 4. The n-step state-value estimator is defined as ˆvn(x0; x 1, r 1, τ n 1 | {z } τ n ) = r 1 + γˆvn 1(x 1; τ n 1) = i=1 γi 1r i + γnˆv0(x n). (5) where ˆv0 is a state-value function approximator (e.g. a neural network). Published as a conference paper at ICLR 2021 Figure 5: Visual example in a 1D-world with random delays (K = 3). The original trajectory has been sampled under the policy µ: always go left . The current policy is π: always go right . Indeed, in γ-discounted RL, performing on-policy n-step rollouts to estimate the value function reduces the bias introduced by the function approximator by a factor of γn: Lemma 1. The n-step value estimator has the following bias: bias(ˆvn(x0, )) = γn E...,x n,r n pπ n( |x0)[bias(ˆv0(x n))] (6) A simple corollary of Lemma 1 is that the on-policy n-step value estimator is unbiased when the function approximator ˆv0 is unbiased. On the other hand, Theorem 1 provides a recipe for transforming sub-trajectories collected under old policies into actual on-policy sub-trajectories. From a given state in an off-policy trajectory, this is done by applying σπ n to all the subsequent transitions until we meet a total delay (ωi + αi) that is shorter than the length of the formed sub-trajectory. Consequently, the transformed sub-trajectory can be fed to the on-policy n-step value estimator, where n is the length of this sub-trajectory. This does not only provide a better value estimate than usual 1-step off-policy estimators according to Lemma 1, but it maximally compensates for the multi-step credit assignment difficulty introduced by random delays. Indeed, the length of the transformed sub-trajectory is then exactly the number of time-steps it took the first action of the sub-trajectory to have an influence on subsequent delayed observations, minus one time-step. As opposed to other unbiased n-step off-policy methods, such as importance sampling and Retrace (Munos et al., 2016), this method doesn t suffer from variance explosion. This is because the presence of delays allows us to transform off-policy sub-trajectories into on-policy sub-trajectories, so that old samples don t need to be weighted by the policy ratio. Although we use a multi-step state-value estimator, the same principles can be applied to action-value estimation as well. In fact, the trajectory transformation described in Definition 3 enables efficient off-policy n-step value estimation in any value-based algorithm that would otherwise perform 1-step action-value backups, such as DQN, DDPG or SAC. In the next section, we illustrate this using SAC. Figure 5 summarizes the whole procedure in a simple 1D-world example. The maximum possible delay is K = 3 here, and the agent can only go left or right . An initial augmented state x0 is sampled from the replay memory, along with the 3 subsequent augmented states and rewards. The condition of Theorem 1 is satisfied for n 2. It follows that τn = τ2 = (x1, x2). This offpolicy trajectory fragment is partially resampled, which yields the corresponding on-policy trajectory fragment τ n = τ 2 . This on-policy trajectory fragment can then be used to compute an unbiased n-step value estimate of the initial state x0 = x 0. Published as a conference paper at ICLR 2021 4 DELAY-CORRECTING ACTOR-CRITIC We have seen in Section 3 how it is possible, in the delayed setting, to collect off-policy trajectories and still use on-policy multi-step estimators in an unbiased way, which allows us to compensate for the more difficult credit assignment introduced by the presence of random delays. We now apply this method to derive Delay-Correcting Actor-Critic (DCAC), an improved version of Soft Actor-Critic (Haarnoja et al., 2018a;b) for real-time randomly delayed settings. 4.1 VALUE APPROXIMATION Like SAC, DCAC makes use of the entropy-augmented soft value function (Haarnoja et al., 2018a): Lemma 2. In a RDMDP (E, pω, pα) the soft value function is: vsoft(x 0)=Ea π( |x 0)[Ex 1,r 1 p( |x 0,a)[r 1+γvsoft(x 1)] logπ(a|x 0)] (7) It can be estimated by augmenting the reward function in Definition 4 with an entropy reward: Definition 5. The delayed on-policy n-step soft state-value estimator, i.e. the n-step state-value estimator with entropy augmented rewards under the current policy π, is ˆvsoft n (x 0; τ n)=r 1+γˆvsoft n 1(x 1;τ n 1) Ea π( |x 0)[logπ(a|x 0)] (8) where ˆvsoft 0 is a state-value function approximator (e.g. a neural network). Given the off-policy trajectory transformation proposed in Section 3, Definition 5 directly gives DCAC s value target. To recap, we sample an initial state x0 (= x 0) and a subsequent trajectory τn (= x1, r1, . . . xn, rn) from a replay memory. The sampling procedure ensures that n is the greatest length so that the sampled trajectory τn does not contain any total delay ωi + βi < i. This trajectory was collected under an old policy µ, but we need a trajectory compatible with the current policy π to use ˆvsoft n in an unbiased way. Therefore, we feed τn to the partial trajectory resampling operator defined in Definition 3. This produces an equivalent on-policy sub-trajectory τ n with respect to the current policy π according to Theorem 1, while maximally taking advantage of the bias reduction described by Lemma 1. This partially resampled on-policy sub-trajectory is fed as input to ˆvsoft n (x0; τ n), which yields the target used in DCAC s soft state-value loss: Definition 6. The DCAC critic loss is LDCAC v (v) = E(x0,τn) D Eτ n σπ n( |x0;τn)[(vθ(x0) ˆvsoft n (x0; τ n))2] (9) where x0, τn are a start state and following trajectory, sampled from the replay memory, and satisfying the condition of Theorem 1. 4.2 POLICY IMPROVEMENT In addition to using the on-policy n-step value estimator as target for our parametric value estimator, we can also use it for policy improvement. As in SAC we use the reparameterization trick (Kingma & Welling, 2013) to obtain the policy gradient from the value estimator. However, since we use our trajectory transformation and a multi-step value estimator, this involves backpropagation through time in the action buffer. Definition 7. The DCAC actor loss is LDCAC π (π) = E(x0,τn) D Eτ n σπ n( |x0;τn)[ˆvsoft n (x0; τ n)] (10) where x0, τn are a start state and following trajectory, sampled from the replay memory, and satisfying the condition of Theorem 1. Published as a conference paper at ICLR 2021 Proposition 1. The DCAC actor loss is a less biased version of the SAC actor loss with bias(LDCAC π ) = En[γn] bias(LSAC π ) (11) assuming both are using similarly biased parametric value estimators to compute the loss, i.e. bias(ˆvsoft 0 (x)) = Ea π( |x)[bias(ˆqsoft 0 (x, a))] (12) 5 EXPERIMENTAL RESULTS To evaluate our approach and make future work in this direction easy for the RL community, we release as open-source, along with our code, a Gym wrapper that introduces custom multi-step delays in any classical turn-based Gym environment. In particular, this enables us to introduce random delays to the Gym Mu Jo Co continuous control suite (Brockman et al., 2016; Todorov et al.), which is otherwise turn-based. Compared algorithms. A naive version of SAC would only use the unaugmented delayed observations, which violates the Markov assumption in delayed settings as previously pointed out. Consequently, naive SAC exhibits near-random results in delayed environments. A few such experiments are provided in the Appendix for illustration (Figure 9). In order to make a fair comparison, all other experiments compare DCAC against SAC in the same RDMDP setting, i.e. all algorithms use the augmented observation space defined in Section 2.1. Since SAC is the algorithm we chose to improve for delayed scenarios, comparing DCAC against it in the same setting provides a like-for-like comparison. We also found it interesting to compare against RTAC (Ramstedt & Pal, 2019). Indeed, DCAC reduces to this algorithm in the special case where observation transmission is instantaneous (ω=0) and action computation and transmission constantly takes one time-step (α=1). Whereas DCAC performs variable-length state-value backups with partial trajectory resampling as explained in Section 4 , RTAC performs 1-step state-value backups, and SAC performs the usual 1-step action-value backup described in its second version (Haarnoja et al., 2018b). All hyperparameters and implementation details are provided in Section B of the Appendix. For each experiment, we perform six runs with different seeds, and shade the 90% confidence intervals. Figure 6: ω = 2, α = 3 (constant delays). With a constant total delay of five time-steps, DCAC exhibits a very strong advantage in performance. All tested algorithms use the same RDMDP augmented observations. Published as a conference paper at ICLR 2021 Figure 7: α, ω Wi Fi (random delays). DCAC clearly dominates the baselines. Ant became too difficult for all tested algorithms. Half Cheetah also became difficult and only DCAC escapes from local minima. Constant delays. Our first batch of experiments features simple, constantly delayed scenarios. Figure 6 displays the results of the most difficult of these experiments (i.e. where the delays are longest), while the others are provided in Section D.2 of the Appendix. The advantage of using DCAC is obvious in the presence of long constant delays. Note that DCAC reduces to the RTAC (Ramstedt & Pal, 2019) algorithm when ω = 0 and α = 1 and behaves as an evolved form of RTAC in the presence of longer constant delays. Real-world random delays. Our second batch of experiments features random delays of different magnitudes. The experiment we chose to present in Figure 7 is motivated by the fact that our approach is designed for real-world applications. Importantly, it provides an example how to implement DCAC in practice (see Appendix A and B for more details). We sample the communication delays for actions and observations from our real-world Wi Fi dataset, presented in Figure 2. When action or observation communications supersede previous communications, only the most recently produced information is kept. In other words, when an action is received in the undelayed environment, its age is compared to the action that is currently being applied. Then, the one that the agent most recently started to produce is applied. Similarly, when the agent receives a new observation, it only keeps the one that was most recently captured in the undelayed environment (see the right-hand side of Figure 3 for a visual example). We discretize the communication delays by using a time-step of 20ms. Importantly, note that Figure 2 has been cropped to 60ms, but the actual dataset contains outliers that can go as far as 1s. However, long delays (longer than 80ms in our example) are almost always superseded and discarded. Therefore, when such information is received, we clip the corresponding delay with no visible impact in performance: in practice, the maximum acceptable delays are design choices, and can be guided by existing probabilistic timing methods (Santinelli et al., 2017). 6 RELATED WORK We trace our line of research back to Katsikopoulos & Engelbrecht (2003), who provided the first discussion about Delayed Markov Decision Processes. In particular, they were interested in asynchronous rewards, which provides interesting insights in relation to Appendix A.4. Walsh et al. (2008) later re-introduced the notion of Constantly Delayed Markov Decision Process . While recent advances in deep learning enable implementations of what the authors call an augmented approach , this was considered intractable at the time because the size of the action buffer grows with the considered delay length. Instead, they studied the case where observations are retrieved with a constant delay and developed a model-based algorithm to predict the current state of the environment. Similarly, Schuitema et al. (2010) developed memory-less approaches based on SARSA and vanilla Published as a conference paper at ICLR 2021 Q-learning, taking advantage of prior knowledge about the duration of a constant control delay. Hester & Stone (2013) adopted the action buffer-augmented approach to handle random delays, and relied on a decision-tree algorithm to perform credit assignment implicitly. By comparison, our approach relies on delay measurements to perform credit assignment explicitly. More recently, Firoiu et al. (2018) introduced constant action delays to a video game to train agents whose reaction time compares to humans. Similar to previous work, the authors used a state-predictive model, but based on a recurrent neural network architecture. Ramstedt & Pal (2019) formalized the framework of Real-Time Reinforcement Learning (RTRL) that we generalize here to all forms of real-time delays. Initially designed to cope with the fact that inference is not instantaneous in real-world control, the RTRL setting is equivalent to a constantly delayed MDP with α = 1 and ω = 0. Finally, Xiao et al. (2020) adopted an alternative approach by considering the influence of the action selection time when action selection is performed within the duration of a larger time-step. However, their framework only allows delays smaller than one time-step, whereas large time-steps are not compatible with high-frequency control. 7 CONCLUSION AND FUTURE WORK We proposed a deep off-policy and planning-free approach that explicitly tackles the credit assignment difficulty introduced by real-world random delays. This is done by taking advantage of delay measurements in order to generate actual on-policy sub-trajectories from off-policy samples. In addition, we provide a theoretical analysis that can easily be reused to derive a wide family of algorithms such as DCAC, whereas previous work mostly dealt with finding approximate ways of modelling the state-space in constantly delayed environments. The action buffer is fundamentally required to define a Markovian state-space for RDMDPs , but it is of course possible to observe this action buffer approximately, e.g. by compressing it in the hidden state of an RNN, which is complementary to our work. We have designed our approach with real-world applications in mind, and it is easily scalable to a wide variety of scenarios. For practical implementation, see Section 5 and Sections A and B of the Appendix. See also rtgym, a small python helper that we use in future work to easily implement delayed environments in the real world. To the best of our knowledge, DCAC is the first deep actor-critic approach to exhibit such strong performance on both randomly and constantly delayed settings, as it makes use of the partially known dynamics of the environment to compensate for difficult credit assignment. We believe that our model can be further improved by making use of the fact that our critic estimates the state-value instead of the action-value function. Indeed, in this setting, Ramstedt & Pal (2019) showed that it is possible to simplify the model by merging the actor and the critic networks using the Pop Art output normalization (van Hasselt et al., 2016), which we did not try yet and leave for future work. Our approach handles and adapts to arbitrary choices of time-step duration, although in practice time-steps smaller than the upper bound of the inference time will require a few tricks. We believe that this approach is close to time-step agnostic RL and will investigate this direction in future work. ACKNOWLEDGMENTS We thank Pierre-Yves Lajoie, Yoshua Bengio and our anonymous reviewers for their constructive feedback, which greatly helped us improve the article. We also thank Element AI and Compute Canada for providing the computational resources we used to run our experiments. Published as a conference paper at ICLR 2021 Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Vlad Firoiu, Tina Ju, and Joshua B. Tenenbaum. At human speed: Deep reinforcement learning with action delay. Co RR, abs/1810.07286, 2018. Florian Fuchs, Yunlong Song, Elia Kaufmann, Davide Scaramuzza, and Peter Duerr. Super-human performance in gran turismo sport using deep reinforcement learning, 2020. Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. ar Xiv preprint ar Xiv:1802.09477, 2018. Yuan Ge, Qigong Chen, Ming Jiang, and Yiqing Huang. Modeling of random delays in networked control systems. Journal of Control Science and Engineering, 2013, 2013. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018a. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018b. Todd Hester and Peter Stone. Texplore: real-time sample-efficient reinforcement learning for robots. Machine learning, 90(3):385 429, 2013. Jemin Hwangbo, Inkyu Sa, Roland Siegwart, and Marco Hutter. Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters, 2(4):2096 2103, 2017. Konstantinos V Katsikopoulos and Sascha E Engelbrecht. Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control, 48(4):568 574, 2003. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. A. Rupam Mahmood, Dmytro Korenkevych, Brent J. Komer, and James Bergstra. Setting up a reinforcement learning task with a real-world robot, 2018. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. Remi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 1054 1062. Curran Associates, Inc., 2016. Johan Nilsson, Bo Bernhardsson, and Björn Wittenmark. Stochastic analysis and control of real-time systems with random time delays. Automatica, 34(1):57 64, 1998. Simon Ramstedt and Christopher Pal. Real-time reinforcement learning. In Neur IPS, 2019. L. Santinelli, F. Guet, and J. Morio. Revising measurement-based probabilistic timing analysis. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pp. 199 208, 2017. Erik Schuitema, Lucian Busoniu, Robert Babuska, and Pieter Jonker. Control delay in reinforcement learning for real-time dynamic systems: A memoryless approach. In International Conference on Intelligent Robots and Systems, 2010. Published as a conference paper at ICLR 2021 Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. in 2012 ieee. In RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. ar Xiv preprint ar Xiv:1509.06461, 2015. Hado P van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pp. 4287 4295, 2016. Thomas J. Walsh, Ali Nouri, Lihong Li, and Michael L. Littman. Learning and planning in environments with delayed feedback. Autonomous Agents and Multi-Agent Systems, 18:83 105, 2008. Ted Xiao, Eric Jang, Dmitry Kalashnikov, Sergey Levine, Julian Ibarz, Karol Hausman, and Alexander Herzog. Thinking while moving: Deep reinforcement learning with concurrent control. In International Conference on Learning Representations, 2020. Published as a conference paper at ICLR 2021 A PRACTICAL CONSIDERATIONS AND SCALABILITY A.1 SELF-CORRELATED DELAYS The separation between ω and α allows auto-correlated conditional distributions on both delays. This is necessary to allow superseded actions and observations to be discarded. In RDMDPs , the agent keeps the delayed observation that was most recently captured in the undelayed environment. Ideally, it is also ensured by the undelayed environment that the applied action is the action that most recently started being computed by the agent. In practice, this can be ensured by augmenting the actions with timestamps corresponding to the beginning of their computation, and observations with timestamps corresponding to the end of their capture. Thus, the undelayed environment and the agent can keep track of the most recent received timestamp and discard outdated incoming information. A.2 HOW TO MEASURE DELAYS To measure the delays in practice, one possibility is to make use of the aforementioned timestamps. In addition to the augmentations described in A.1, one can augment each observation sent by the undelayed environment with the timestamp of the action that was applied before the end of observation capture. When the agent receives an observation, this observation then contains two timestamps: one that directly corresponds to an action in the buffer (agent s clock), and one that corresponds to when the observation finished being captured (undelayed environment s clock). The identified action in the buffer directly gives the total delay. If the agent and the undelayed environment have e.g. synchronized clocks, the current timestamp minus the timestamp corresponding to observation capture gives the observation delay (and thus we can deduce the action delay). A.3 SCALABILITY OF THE ACTION BUFFER As seen in our Wi Fi experiment, the maximum delays are design choices in practice. The actual maximum delays can be prohibitively long (e.g. infinite when packets are lost) and would require a long action buffer to be handled in the worst-case scenario. However, in random delays scenarios, long delays are likely to be superseded by shorter delays. Therefore, observations reaching the agent with a total delay that exceeds the chosen K value should simply be discarded, and a procedure implemented to handle the unlikely edge-case where more than K such observations are received in a row. Also note that, although we used a simple action buffer in this work, more clever representations are possible in the presence of long delays, e.g. run-length encoding. A.4 DELAYED REWARDS We have implicitly made a choice when defining the rewards for RDMDPs . Indeed, keep in mind that observations can be dropped (superseded) at the level of the agent. In such cases, we chose to accumulate the rewards corresponding to the lost transitions. When an observation gets repeated because no new observation is available, the corresponding reward is 0, and when a new observation arrives, the corresponding reward contains the sum of intermediate rewards in lost transitions. In practice, this is ensured for example by making the assumption that the remote robot (i.e. the undelayed environment) can observe its own instantaneous reward. This allows the robot to compute its cumulative reward and send it to the agent along with the observation. The agent can then compute the difference between the last cumulative reward it received from the remote robot and the new one for each incoming observation (NB: outdated observations are discarded so the agent only sees cumulative rewards with time-increasing timestamps). Alternatively, the practitioner can choose to repeat the delayed rewards along with the repeated delayed observations at the level of the agent (this is what we use to do in earlier versions of the paper). When a trick similar to the aforementioned cannot be implemented, this can be done instead, with no impact on our analysis. However, the reward signal will inherently have a higher variance. Published as a conference paper at ICLR 2021 A.5 LONG OBSERVATION CAPTURE In practice, it is often the case that observation capture is not instantaneous. In such situation, one should increase the size of the action buffer so that it always includes the actions for which it is unclear whether they have influenced the observation yet or not. Indeed, when observation capture is not instantaneous it is not possible to know which undelayed state(s) it describes. The length of the multi-step backup performed by DCAC doesn t need to be adapted, because it only cares about the first action that is known to not have influenced the delayed observation. A.6 COMBINED OBSERVATIONS Equivalently, if observations are formed of several combined parts that were captured at different times, the action buffer must be long enough to always include the first action that has not influenced the oldest sub-observation yet (i.e. be as long as the maximum possible combined total delay). B IMPLEMENTATION DETAILS B.1 MORE INFORMATION AS INPUT TO THE MODEL The action delay α identifies the action that was applied during the previous time-step. It is needed to define RDMDPs and thus is used by DCAC. However, in practice we can include another piece of information on top of α: the delay of the action that is going to be applied in the undelayed environment when the captured observation is sent. We use this additional information as input of the model for all tested algorithms. B.2 MODEL ARCHITECTURE The model we use in all our experiments is composed of two separate multi-layer perceptrons (MLPs): a critic network, and an actor network. Both MLPs are built with the same simple architecture of two hidden layers, 256 units each. The critic outputs a single value, whereas the actor outputs an action distribution with the dimension of the action-space, from which actions are sampled with the reparameterization trick. This architecture is compatible with the second version of SAC described in Haarnoja et al. (2018b). The only difference from the DCAC model is that the SAC critic tracks q(x), and not v(x). Indeed, differently from usual actor-critic algorithms, the output of DCAC s critic approximates the state-value v(x) (instead of the action-value q(x)), as it is sufficient to optimize the actor loss described in Definition 7. Weights and biases are initialized with the default Pytorch initializer. Both the actor and the critic are optimized by gradient descent with the Adam optimizer, on losses LDCAC(π) (Equation 10) and LDCAC(v) (Equation 9), respectively. Classically, we use twin critic networks (Van Hasselt et al., 2015; Fujimoto et al., 2018) with target weight tracking (Mnih et al., 2015) to stabilize training. B.3 HYPERPARAMETERS Other than our neural network architecture, our implementations of SAC, RTAC and DCAC all share the following hyperparameters: Published as a conference paper at ICLR 2021 Table 1: Hyperparameters Optimizer Adam (Kingma & Ba, 2014) Learning rate 0.0003 Discount factor (γ) 0.99 Batch size 128 Target weights update coefficient (τ) 0.005 Gradient steps / environment steps 1 Reward scale 5.0 Entropy scale 1.0 Replay memory size 1000000 Number of samples before training starts 10000 Number of critics 2 NB: the target weights are updated according to the following running average: θ τθ + (1 τ)θ C VISUAL EXAMPLES Figure 8: Left: Example of Constantly Delayed MDP, with an action delay of three time-steps and an observation delay of two time-steps. Here, actions are indexed by the time at which they started being produced. The augmented observation is composed of an action buffer of the last five computed actions along with the delayed observation st 2. It will be used by the agent to compute action at. Meanwhile, in the undelayed environment, action at 3 is received and observation st is captured. Right: Example of Random Delay MDP, with α 3 time-steps and ω 2 time-steps. Actions and observations may be superseded due to random delays. In such cases, only the most recently produced actions and observations are kept, the others are discarded (crossed out). Published as a conference paper at ICLR 2021 D ADDITIONAL EXPERIMENTS D.1 IMPORTANCE OF THE AUGMENTED OBSERVATION SPACE Figure 9: ω = 0, α = 1: We illustrate the importance of the augmented observation space in delayed settings using our simplest task (constant 1-step action delay). Even with this small 1-step constant delay, the delayed observations are not Markov and a naive algorithm using only these observations (here: SAC naive) has near-random results. By comparison, an algorithm using the RDMDP augmented observations instead (here: SAC) is able to learn in delayed environments. D.2 CONSTANT DELAYS Figure 10: ω = 0, α = 1: This specific setting is equivalent to the RTRL setting (Ramstedt & Pal, 2019), in which DCAC reduces to the vanilla RTAC algorithm (without output normalization and merged networks). DCAC (RTAC) slightly outperforms SAC in this setting. Published as a conference paper at ICLR 2021 Figure 11: ω = 1, α = 2: In this more difficult setting (total constant delay of 3 instead of 1), DCAC starts really showing its potential, clearly outperforming all other approaches. D.3 RANDOM DELAYS Figure 12: ω [0; 2], α [1; 3] (uniformly sampled delays): This experiment is perhaps even more difficult than the Wi Fi experiment featured in the main paper, because it gives equal probability to all possible delays in the specified ranges (but delays are smaller here which makes it easier for RTAC, because these delays are closer to 1). All tested approaches fail on randomly delayed Ant. For other tasks, the advantage of DCAC is very clear over SAC. Published as a conference paper at ICLR 2021 E DEFINITIONS Definition 8. The n-step state-reward distribution for an environment E = (S, A, µ, p) and a policy π is defined as pπ n+1(s , r , τn | {z } τn+1 |s) = Ea π( |s)[pπ n(τn|s )p(s , r |s, a)] = Z A pπ n(τn|s )p(s , r |s, a)π(a|s)da (13) with the base case pπ 0(s) = 1 and the first iterate pπ 1(s , r |s) = R A p(s , r |s, a)π(a|s)da. Definition 9. A 1-step action-value estimator is defined as ˆq1(s,a; s ,r )=r +γ Ea π( |s )[ˆq0(s ,a )]. (14) Part of this estimator is usually another parametric estimator ˆq0 (e.g. a neural network trained with stochastic gradient descent). F OTHER MATHEMATICAL RESULTS F.1 LEMMA ON STEADY-STATE VALUE ESTIMATION BIAS Lemma 3. The expected bias of the n-step value estimator under the steady-state distribution (if it exists) is Ex pπ ss[bias ˆvn(x)] = γn Ex pπ ss[bias ˆv0(x)] (15) Proof. We remind ourselves that the steady state distribution observes pπ ss(xn) = Ex0 pπ ss[pπ n(..., xn, rn|x0)]. (16) According to Lemma 1 we then have Ex0 pπ ss bias(ˆvn(x0, )) =γn E...,x n,r n pπ n( |x0)[bias(ˆv0(x n))] (17) =γn Ex pπ ss[bias ˆv0(x)]. (18) F.2 LEMMA ON A DIRAC DELTA PRODUCT DISTRIBUTION Lemma 4. For p(u, v) = δ(u c)q(u, v) if q(u, v) < for u = c then p(u, v) = δ(u c)q(c, v). Proof. If u = c then p(u, v) = δ(u c)q(c, v), otherwise p(u, v) = 0 = δ(u c)q(c, v) F.3 LEMMA ON F Lemma 5. The dynamics described by f depend neither on the input action nor on a range of actions in the action buffer: f (s 1, α 1, r 1|x0, aµ 0) = f (s 1, α 1, r 1|x 0, aπ 0) with x0 = s0, u0, ω0, α0 and x 0 = s 0, u 0, ω 0, α 0, given that s0, ω0, α0 = s 0, ω 0, α 0 and given u0[ω 0 δ + α 1] = u 0[ω 0 δ + α 1] for all δ { , 1, . . . , 0} Proof. We prove by induction. The base case (ω 0 ω 1 = 1) is trivial since it does not depend on the inputs that differ. Published as a conference paper at ICLR 2021 For the induction step we have f (s 1, α 1, r 1|s0, u0, ω0, α0, aµ 0) = E s, α, r f 1( |s0,u0,ω0,α0 ,aµ 0 )[p(s 1, r 1 r| s, u[ω0 + α 1]) pα(α 1| α)] (19) Because of our condition on u0 and u 0 and the fact that ω0 = ω 0 this is equal to E s, α, r f 1( |s0,u0,ω0,α0 ,aµ 0 )[p(s 1, r 1 r| s, u 0[ω 0 + α 1]) pα(α 1| α)] We can now use the induction hypothesis since the conditions on s0, u0, ω0, α0 are still met when 1. E s, α, r f 1( |s 0,u 0,ω 0,α 0 ,aπ 0 )[p(s 1, r 1 r| s, u 0[ω 0 + α 1]) pα(α 1| α)] = f (s 1, α 1, r 1|x 0, aπ 0) (20) F.4 LEMMA ON PARTIAL RESAMPLING Lemma 6. Partially resampling trajectories collected under a policy µ according to σπ n transforms them into trajectories distributed according to π. Eτn pµ n( |x0)[σπ n(τ n|x 0; τn)] = pπ n(τ n|x 0) with x0 = s0, u0, ω0, α0 and x 0 = s 0, u 0, ω 0, α 0, on the condition that s0, ω0, α0 = s 0, ω 0, α 0 and on the condition that the actions in the initial action buffers u0 and u 0 that are applied in the following trajectory are the same, i.e. u0[k : end] = u 0[k : end] with k = min i (ω i+1 + α i+1 i) for i {0, n 1} and for the trajectory τ n = (s 1, u 1, ω 1, α 1, . . . , s n, u n, ω n, α n). Proof. We start with the induction base for n = 0. The theorem is trivial in this case since we have 0-length trajectories () and pµ 0(()|x0) = σπ 0 (()|x 0; ()) = pπ 0(()|x 0) = 1. For the induction step we start with the left hand side of the lemma s main equation. Eτn pµ n( |x0)[σπ n(τ n|x 0; τn)] = Eaµ 0 µ( |x0)[Ex1,r1 p(x1,r1|x0,aµ 0 )[Eτn 1 pµ n 1( |x1)[σπ n(τ n|x 0; x1, r1, τn 1)]]] (21) p(s1, u1, ω1, α1, r1|s0, u0, ω0, α0, aµ 0) = fω0 ω1(s1, α1, r1|s0, u0, ω0, α0, aµ 0) pω(ω1|ω0) pu(u1|u0, aµ 0) Plugging that and solving the integral over u1 yields = Eaµ 0 µ( |x0)[Eω1 pω( |ω0)[Es1,α1,r1 fω0 ω1( |s0,u0,ω0,α0 ,aµ 0 )[ Eτn 1 pµ n 1( |s1,(aµ 0 ,u0[1: 1]),ω1,α1 )[σπ n(τ n|x 0; s1, (aµ, u0[1 : 1]), ω1, α1, r1, τn 1)]]]] (22) Rolling out σπ n by one step and integrating out s1, ω1, α1, r1 yields Published as a conference paper at ICLR 2021 = Eaµ 0 µ( |x0)[Eτn 1 pµ n 1( |s 1,(aµ 0 ,u0[1: 1]),ω 1,α 1 )[Eaπ 0 π( |x 0)[δ(u 1 (aπ 0, u 0[1 : 1])) σπ n 1(τ n 1|s 1, u 1, ω 1, α 1; τn 1)fω0 ω 1 (s 1, α 1, r 1|s0, u0, ω0, α0, aµ 0) pω(ω 1|ω0)]]] (23) Reordering terms and substituting s0, ω0, α0 = s 0, ω 0, α 0 yields = pω(ω 1|ω 0)Eaπ 0 π( |x 0)[δ(u 1 (aπ 0, u 0[1 : 1])) Eaµ 0 µ( |x0)[fω 0 ω 1 (s 1, α 1, r 1|x0, aµ 0) Eτn 1 pµ n 1( |s 1,(aµ,u0[1: 1]),ω 1,α 1 )[σπ n 1(τ n 1|x 1; τn 1)]]] (24) We can substitute the f term according to Lemma 5 since the condition between x0 and x 0 is met. More precisely the condition on u0 and u 0 is met because k ω 0 + α 1 = ω 1 + α 1. After the substitution we have = pω(ω 1|ω 0)Eaπ 0 π( |x 0)[δ(u 1 (aπ 0, u 0[1 : 1])) fω 0 ω 1 (s 1, α 1, r 1|x 0, aπ 0) Eaµ 0 µ( |x0)[Eτn 1 pµ n 1( |s 1,(aµ,u0[1: 1]),ω 1,α 1 )[σπ n 1(τ n 1|x 1; τn 1)]]] (25) We can substitute the induction hypothesis in the following form. Eτn 1 pµ n 1( |x1)[σπ n 1(τ n 1|x 1; τn 1)] = pπ n 1(τ n 1|x 1) on the condition that u1[k : end] = u 1[k : end] with k = min i (ω i+2 + α i+2 i) for i {0, n 2} for the trajectory τ n 1 = (s 2, u 2, ω 2, α 2, . . . , s n, u n, ω n, α n). To check that this condition is met we observe that u1 = (aµ 0, u0[1 : 1]) and substitute u 1 = (aπ 0, u 0[1 : 1]) (made possible by Lemma 4) which means that u0[k 1 : end] = u 0[k 1 : end] with k = min i (ω i+2 + α i+2 i) for i {0, n 2} Substituting the induction hypothesis yields = pω(ω 1|ω 0)Eaπ 0 π( |x 0)[δ(u 1 (aπ 0, u0[1 : 1])) fω 0 ω 1 (s 1, α 1, r 1|x 0, aπ 0) pπ n 1(τ n 1|x 1)] (26) Eaπ 0 π( |x 0)[pπ n 1(τ n 1|x 1) p(x 1, r 1|x 0, aπ 0)] = pπ n(τ n|x 0) G PROOFS OF THE RESULTS FROM THE MAIN PAPER Theorem 1. The partial trajectory resampling operator σπ n (Def. 3) transforms off-policy trajectories into on-policy trajectories Eτn pµ n( |x0)[σπ n(τ n|x0;τn)]=pπ n(τ n|x0) (3) Published as a conference paper at ICLR 2021 on the condition that none of the delayed observations depend on any of the resampled actions, i.e. ω t + α t t (4) where t indexes the trajectory τ n = (s 1, u 1, ω 1, α 1, r 1, . . . , s n, u n, ω n, α n, r n) from 1 to n. Proof. The theorem is a special case of Lemma 6 with x0 = x 0. This allows us to simplify the condition in the lemma as we show next. Since u0 = u 0 we can allow all k 1 which is the minimum allowed index for u. Therefore we must ensure 1 mini(ω i+1 + α i+1 i). Since the min must be larger than 1 then all arguments must be larger than 1 which means this is equivalent to 1 ω i+1 + α i+1 i for i {0, n 1}. This can be transformed into ω t + α t t for t {1, n} (27) Lemma 1. The n-step value estimator has the following bias: bias(ˆvn(x0, )) = γn E...,x n,r n pπ n( |x0)[bias(ˆv0(x n))] (6) bias(ˆvn(x0, )) = Eτ n pπ n( |x0)[ˆvn(x0, τ n) vπ(x0)] (28) = Eτ n pπ n( |x0)[r 1 + γˆvn 1(x 1; τ n 1)] Ea0 π( |x0)[Er 1,x 1 p( |x0,a0)[r 1 + γvπ(x 1)]] (29) = Eτ n pπ n( |x0)[r 1 + γˆvn 1(x 1; τ n 1) r 1 γvπ(x 1)] (30) = γEτ n pπ n( |x0)[ˆvn 1(x 1; τ n 1) vπ(x 1)] (31) = . . . (32) = γn Eτ n pπ n( |x0)[ˆv0(x n) vπ(x n)] (33) = γn E...,x n,r n pπ n( |x0)[bias(ˆv0(x n))] (34) Lemma 2. In a RDMDP (E, pω, pα) the soft value function is: vsoft(x 0)=Ea π( |x 0)[Ex 1,r 1 p( |x 0,a)[r 1+γvsoft(x 1)] logπ(a|x 0)] (7) Proof. The soft value function for an environment (X, A, µ, p) is defined as vsoft(x 0)=Ea π( |x 0)[qsoft(x 0,a) logπ(a|x 0)] (35) where qsoft(x 0, a) = Ex 1,r 1 p(x 0,a)[r 1 + γvsoft(x 1)] (36) If (X, A, µ, p) = RDMDP(E, pω, pα) = (X, A, µ, p) with E = (S, A, µ, p) this is qsoft(x 0,a)=Ex 1,r 1 p( |x 0,a)[r 1+γvsoft(x 1)] (37) and vsoft(x 0) = Ea π( |x 0)[Ex 1,r 1 p( |x 0)[r 1 + γvsoft(x 1)] log π(a|x 0)] (38) Published as a conference paper at ICLR 2021 Proposition 1. The DCAC actor loss is a less biased version of the SAC actor loss with bias(LDCAC π ) = En[γn] bias(LSAC π ) (11) assuming both are using similarly biased parametric value estimators to compute the loss, i.e. bias(ˆvsoft 0 (x)) = Ea π( |x)[bias(ˆqsoft 0 (x, a))] (12) Proof. Note that for simplicity, we also assume that the states in the replay memory are distributed according to the steady-state distribution, i.e. D pπ ss. This assumption could be avoided by making more complicated assumptions about the biases of the state-value and action-value estimators. We now start with the bias of the DCAC loss with respect to an unbiased SAC loss using the true action-value function, bias(LDCAC π ) = LDCAC π LSAC -UB π (39) LDCAC π = Ex0,τn D Eτ n σπ n( |x0;τn)[ˆvsoft n (x0; τ n)] (40) = Ex0 DEn Eτ n pπ n( |x0)[ˆvsoft n (x0; τ n)] | Theorem 1 (41) LSAC -UB π =Ex0 D[Ea π( |x0)[log π(a|x0) qsoft(x0, a)]] (42) =Ex0 D[vsoft(x0)]. (43) Substituting these we have bias(LDCAC π ) =Ex0 DEn[ˆvsoft n (x0; τ n) vsoft(x0)] (44) =Ex0 DEn[bias(ˆvsoft n (x0; )] (45) =Ex0 DEn[γn E...,xn,rn pπ n( |x0)[bias(ˆvsoft 0 (xn))]] | Lemma 1 (46) =En[γn] Ex D[bias(ˆvsoft 0 (x))] | using D pπ ss and Lemma 3 (47) =En[γn] Ex D[Ea π( |x)[bias(ˆqsoft 0 (x, a))]] | Equation 12 (48) =En[γn] Ex D[Ea π( |x)[ˆqsoft 0 (x, a) qsoft(x, a)]] (49) =En[γn] (LSAC π LSAC -UB π ) (50) =En[γn] bias(LSAC π ) (51)